It’s surprisingly hard to pin down That’s right How Apple’s M1 compares to Intel’s x86 processor. Although the chip family has been extensively reviewed in a number of general consumer applications, the inevitable differences between MacOS and Windows, the effect of emulation, and the different degrees of optimization between x86 and M1 all make precise measurement more difficult.
An interesting new benchmark result and a review with app developer and engineer Craig Hunter shows that the M1 Ultra is completely destroying every Intel x86 CPU in the field. It’s not even a fair fight. According to Hunter’s results, an M1 Ultra running six threads matches the performance of a 28-core Xeon workstation from 2019.
Any long-term expectation that the M1 Ultra will suffer a sudden and unexplained scaling crash above six cores becomes dashed when we extend the y-axis of the graph high enough to accommodate the data.
This is a huge win for M1. Apple’s new CPU 28-core is 2 times faster than the Mac Pro’s top results. But what do we know about the test itself?
The Hunter Benchmark USM3D is described by NASA as “a tetrahedral disorganized flow solver widely used in industry, government and academia to solve aerodynamic problems. Done. “
As mentioned earlier, this is a computational fluid dynamics test, and CFD tests are notoriously sensitive to memory bandwidth. At ExtremeTech we have never tested USM3D and it is not an application I am familiar with, so we found out from Hunter about the test itself for some additional clarification and how he compiled it for each platform. There is some speculation online that the M1 Ultra has hit these performance levels with enhanced matrix extensions or other, unspecified optimizations that were not effective for the Intel platform.
According to Hunter, this is not true.
“I did not link to any Apple framework when compiling USM3D on the M1, nor did I try to tune or optimize the code for Accelerate or AMX,” said the engineer and app developer. “I’ve used stock USM3D source with gfortran and compiled fairly standard with -O3 optimization.”
“Honestly, I think it’s a little inconvenient for the M1 USM3D executable compared to the Intel USM3D executable,” he continued. “I’ve been using the Intel Fortran compiler for over 30 years (it was DEC Fortran then Compak Fortran before Intel Fortran) and I know how to get the most out of it. The Intel compiler performs some aggressive vectorization and optimization when compiling USM3D and has historically performed better than the gfortran at x86-64. So I hope I put some performance on the table using gfortran for M1. ”
We asked Hunter if he explained the performance of the M1 Ultra related to various Intel systems. The engineer has decades of experience evaluating CFD performance on a variety of platforms, from desktop systems like Mac Pro and Mac Studio to actual supercomputers.
“Based on all the tests past and present, I think it’s the SoC architecture that’s making the biggest difference here with Apple Silicon machines, and we’re calling for more cores in the calculation as system bandwidth is going to be the main driver of performance scaling. M1 Ultra system in the studio There is an insane amount of bandwidth. “
The benchmark is based on the NASA USM3D CFD code, which is available to U.S. citizens upon request at software.nasa.gov. It comes as source code and needs to be compiled with a Fortran compiler (you need to create OpenMPI with combined compiler support). MacFiles are set up for MacOS or Linux using the Intel Fortran compiler, creating a highly optimized executable for x86-64. You can use gfortran (which I used for the Arm-64 Apple M1 system) but I hope the ifort will be less than what the x86-64 can enable. “
What these results say about the x86 / M1 matchup
Not surprisingly, a SoC with more memory bandwidth than any previous CPU will perform better in a bandwidth-limited environment. What’s interesting about these results is that they No. Necessarily depends on any particular aspect of x86 vs. ARM. Apple is fielding here as much memory bandwidth as an AMD or Intel CPU, and performance can be improved accordingly.
In my article RISC vs. CISC is the wrong lens to compare modern x86, ARM CPUs, I discussed some time ago how Intel won the ISA war decades ago not because x86 was inherently the best instruction set architecture, but because it is an array. Can leverage. Continuous production improvement and repetitive x86 generation to generation improvement. Here, we see Apple arguably doing something similar. The M1 Ultra isn’t trashing every Intel x86 CPU because it’s magic, but because of the way Apple has unlocked the remarkable performance improvement to integrate DRAM on-packages. There is no reason why x86 CPUs cannot take advantage of these benefits This benchmark suggests that the memory bandwidth is so limited that top-end Alder Lake systems can match or outperform older Xeons, such as the 28-core Mac Pro, but it still matches the M1 Ultra for SoC and its sheer bandwidth. No. The main memory.
In fact, we see that x86 CPUs are taking baby steps to integrate more high-speed memory directly into the package, but Intel is currently focusing this technology on servers, Sapphire Rapids and its on-package HBM2 memory (some future SKUs). Neither AMD nor the M1 Ultra have made anything like that, but at least not yet. So far, AMD has focused on consolidating larger L3 caches rather than on-package DRAM. Any such move would require a purchase from the PC production site OEM and multiple other players.
I don’t expect the x86 makers to rush into adopting the technology because Apple is using it, but the M1 does show remarkable performance in certain tests, per watt performance. You can bet every aspect of the Cupertino company’s manufacturing and design approach has been put under a (probably literal) microscope at AMD and Intel. This is especially true for profits that are not tied to any specific ISA or manufacturing technology.