en.Wedoany.com Reported - Intel and AMD recently released the complete specification for the ACE CPU extension, aimed at improving the efficiency and energy efficiency of x86 processors when running specific AI tasks. This extension provides a better technical solution for executing such tasks on the CPU.

Currently, most AI models rely on GPUs for operation, but not all AI tasks are suitable for this hardware. For smaller models or latency-sensitive single-user operations, running on the CPU avoids the overhead of data transfer between the CPU and GPU. Additionally, in many scenarios, a GPU is unavailable or only a limited-performance integrated graphics card is present. The ACE standard achieves this by providing a technical standard that utilizes existing AVX10 registers and adds dedicated silicon circuits for matrix multiplication. Its key advantages include higher energy efficiency, simplified development and optimization processes, and support for 512-bit inputs, making ACE easy to integrate with existing designs.
Matrix multiplication is a fundamental operation in AI workloads, involving multiply-add loops on data tables. While it can be performed on most CPUs, it is limited in speed and consumes higher power. Compared to AVX10, ACE can execute 16 times more operations with the same number of input vectors. This does not equate to a 16x speedup, as it depends on the implementation, but Intel and AMD are expected to allocate more silicon circuits for this task in future designs to enhance performance. Since each ACE instruction performs more work than an equivalent AVX10 loop, instruction overhead is reduced, and better memory bandwidth utilization may be immediately achieved.
The benefits of ACE extend beyond completing the same work with fewer instructions. The standard is implementation-agnostic, meaning machine learning frameworks and their underlying libraries (e.g., PyTorch, TensorFlow) only need to write a single code path, without having to create multiple variants based on the AVX support level of the underlying hardware. ACE natively supports most data types used in machine learning operations, including INT8, INT32, FP8, FP16, FP32, and BF16, and can natively use the Open Compute Project's MX block scaling format, a capability not available in AVX10. Developers can also move some NPU-specific workloads back to the CPU, and in this process, ACE provides a unified target across x86 hardware, avoiding the complexity caused by hardware differences.
This article is compiled by Wedoany. All AI citations must indicate the source as "Wedoany". If there is any infringement or other issues, please notify us promptly, and we will modify or delete it accordingly. Email: news@wedoany.com









