AVX-512


AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture proposed by Intel in July 2013, and implemented in Intel's Xeon Phi x200 and Skylake-X CPUs; this includes the , as well as the new Xeon Scalable Processor Family and Xeon D-2100 Embedded Series.
AVX-512 is not the first 512-bit SIMD instruction set that Intel has introduced in processors: the earlier 512-bit SIMD instructions used in the first generation Xeon Phi coprocessors, derived from Intel's Larrabee project, are similar but not binary compatible and only partially source compatible.
AVX-512 consists of multiple extensions that may be implemented independently. This policy is a departure from the historical requirement of implementing the entire instruction block. Only the core extension AVX-512F is required by all AVX-512 implementations.

Instruction set

The AVX-512 instruction set consists of several separate sets each having their own unique CPUID feature bit; however, they are typically grouped by the processor generation that implements them.
; F, CD, ER, PF: Introduced with Xeon Phi x200 and Xeon E5-26xx V5, with the last two being specific to Knights Landing.
; VL, DQ, BW: Introduced with Skylake X and Cannon Lake.
; IFMA, VBMI: Introduced with Cannon Lake.
; 4VNNIW, 4FMAPS:Introduced with Knights Mill.
; VPOPCNTDQ: Vector population count instruction. Introduced with Knights Mill and Ice Lake.
; VNNI, VBMI2, BITALG:Introduced with Ice Lake.
; VP2INTERSECT: Introduced with Tiger Lake.
; GFNI, VPCLMULQDQ, VAES:Introduced with Ice Lake.

Encoding and features

The VEX prefix used by AVX and AVX2, while flexible, did not leave enough room for the features Intel wanted to add to AVX-512. This has led them to define a new prefix called EVEX.
Compared to VEX, EVEX adds the following benefits:
The extended registers, SIMD width bit, and opmask registers of AVX-512 are mandatory and all require support from the OS.

SIMD modes

The AVX-512 instructions are designed to mix with 128/256-bit AVX/AVX2 instructions without a performance penalty. However, AVX-512VL extensions allows the use of AVX-512 instructions on 128/256-bit registers XMM/YMM, so most SSE and AVX/AVX2 instructions have new AVX-512 versions encoded with the EVEX prefix which allow access to new features such as opmask and additional registers. Unlike AVX-256, the new instructions do not have new mnemonics but share namespace with AVX, making the distinction between VEX and EVEX encoded versions of an instruction ambiguous in the source code. Since AVX-512F only works on 32- and 64-bit values, SSE and AVX/AVX2 instructions that operate on bytes or words are available only with the AVX-512BW extension.
NameExtension setsRegistersTypes
Legacy SSESSE-SSE4.2xmm0-xmm15single floats. From SSE2: bytes, words, doublewords, quadwords and double floats.
AVX-128 AVX, AVX2xmm0-xmm15bytes, words, doublewords, quadwords, single floats and double floats.
AVX-256 AVX, AVX2ymm0-ymm15single float and double float. From AVX2: bytes, words, doublewords, quadwords
AVX-128 AVX-512VLxmm0-xmm31 doublewords, quadwords, single float and double float. With AVX512BW: bytes and words
AVX-256 AVX-512VLymm0-ymm31 doublewords, quadwords, single float and double float. With AVX512BW: bytes and words
AVX-512 AVX-512Fzmm0-zmm31 doublewords, quadwords, single float and double float. With AVX512BW: bytes and words

Extended registers

The width of the SIMD register file is increased from 256 bits to 512 bits, and expanded from 16 to a total of 32 registers ZMM0-ZMM31. These registers can be addressed as 256 bit YMM registers from AVX extensions and 128-bit XMM registers from Streaming SIMD Extensions, and legacy AVX and SSE instructions can be extended to operate on the 16 additional registers XMM16-XMM31 and YMM16-YMM31 when using EVEX encoded form.

Opmask registers

Most AVX-512 instructions may indicate one of 8 opmask registers. For instructions which use a mask register as an opmask, register `k0` is special: a hardcoded constant used to indicate unmasked operations. For other operations, such as those that write to an opmask register or perform arithmetic or logical operations, `k0` is a functioning, valid register. In most instructions, the opmask is used to control which values are written to the destination. A flag controls the opmask behavior, which can either be "zero", which zeros everything not selected by the mask, or "merge", which leaves everything not selected untouched. The merge behavior is identical to the blend instructions.
The opmask registers are normally 16 bits wide, but can be up to 64 bits with the AVX-512BW extension. How many of the bits are actually used, though, depends on the vector type of the instructions masked. For the 32-bit single float or double words, 16 bits are used to mask the 16 elements in a 512-bit register. For double float and quad words, at most 8 mask bits are used.
The opmask register is the reason why several bitwise instructions which naturally have no element widths had them added in AVX-512. For instance, bitwise AND, OR or 128-bit shuffle now exist in both double-word and quad-word variants with the only difference being in the final masking.

New opmask instructions

The opmask registers have a new mini extension of instructions operating directly on them. Unlike the rest of the AVX-512 instructions, these instructions are all VEX encoded. The initial opmask instructions are all 16-bit versions. With AVX-512DQ 8-bit versions were added to better match the needs of masking 8 64-bit values, and with AVX-512BW 32-bit and 64-bit versions were added so they can mask up to 64 8-bit values. The instructions KORTEST and KTEST can be used to set the x86 flags based on mask registers, so that they may be used together with non-SIMD x86 branch and conditional instructions.
InstructionExtension setDescription
KANDFBitwise logical AND Masks
KANDNFBitwise logical AND NOT Masks
KMOVFMove from and to Mask Registers or General Purpose Registers
KUNPCKFUnpack for Mask Registers
KNOTFNOT Mask Register
KORFBitwise logical OR Masks
KORTESTFOR Masks And Set Flags
KSHIFTLFShift Left Mask Registers
KSHIFTRFShift Right Mask Registers
KXNORFBitwise logical XNOR Masks
KXORFBitwise logical XOR Masks
KADDBW/DQAdd Two Masks
KTESTBW/DQBitwise comparison and set flags

New instructions in AVX-512 foundation

Many AVX-512 instructions are simply EVEX versions of old SSE or AVX instructions. There are, however, several new instructions, and old instructions that have been replaced with new AVX-512 versions. The new or majorly reworked instructions are listed below. These foundation instructions also include the extensions from AVX-512VL and AVX-512BW since those extensions merely add new versions of these instructions instead of new instructions.

Blend using mask

There are no EVEX-prefixed versions of the blend instructions from SSE4; instead, AVX-512 has a new set of blending instructions using mask registers as selectors. Together with the general compare into mask instructions below, these may be used to implement generic ternary operations or cmov, similar to XOP's VPCMOV.
Since blending is an integral part of the EVEX encoding, these instruction may also be considered basic move instructions. Using the zeroing blend mode, they can also be used as masking instructions.
InstructionExtension setDescription
VBLENDMPDFBlend float64 vectors using opmask control
VBLENDMPSFBlend float32 vectors using opmask control
VPBLENDMDFBlend int32 vectors using opmask control
VPBLENDMQFBlend int64 vectors using opmask control
VPBLENDMBBWBlend byte integer vectors using opmask control
VPBLENDMWBWBlend word integer vectors using opmask control

Compare into mask

AVX-512F has four new compare instructions. Like their XOP counterparts they use the immediate field to select between 8 different comparisons. Unlike their XOP inspiration, however, they save the result to a mask register and initially only support doubleword and quadword comparisons. The AVX-512BW extension provides the byte and word versions. Note that two mask registers may be specified for the instructions, one to write to and one to declare regular masking.
ImmediateComparisonDescription
0EQEqual
1LTLess than
2LELess than or equal
3FALSESet to zero
4NEQNot equal
5NLTGreater than or equal
6NLEGreater than
7TRUESet to one

InstructionExtension setDescription
VPCMPD
VPCMPUD
FCompare signed/unsigned doublewords into mask
VPCMPQ
VPCMPUQ
FCompare signed/unsigned quadwords into mask
VPCMPB
VPCMPUB
BWCompare signed/unsigned bytes into mask
VPCMPW
VPCMPUW
BWCompare signed/unsigned words into mask

Logical set mask

The final way to set masks is using Logical Set Mask. These instructions perform either AND or NAND, and then set the destination opmask based on the result values being zero or non-zero. Note that like the comparison instructions, these take two opmask registers, one as destination and one a regular opmask.
InstructionExtension setDescription
VPTESTMD, VPTESTMQFLogical AND and set mask for 32 or 64 bit integers.
VPTESTNMD, VPTESTNMQFLogical NAND and set mask for 32 or 64 bit integers.
VPTESTMB, VPTESTMWBWLogical AND and set mask for 8 or 16 bit integers.
VPTESTNMB, VPTESTNMWBWLogical NAND and set mask for 8 or 16 bit integers.

Compress and expand

The compress and expand instructions match the APL operations of the same name. They use the opmask in a slightly different way from other AVX-512 instructions. Compress only saves the values marked in the mask, but saves them compacted by skipping and not reserving space for unmarked values. Expand operates in the opposite way, by loading as many values as indicated in the mask and then spreading them to the selected positions.
InstructionDescription
VCOMPRESSPD,
VCOMPRESSPS
Store sparse packed double/single-precision floating-point values into dense memory
VPCOMPRESSD,
VPCOMPRESSQ
Store sparse packed doubleword/quadword integer values into dense memory/register
VEXPANDPD,
VEXPANDPS
Load sparse packed double/single-precision floating-point values from dense memory
VPEXPANDD,
VPEXPANDQ
Load sparse packed doubleword/quadword integer values from dense memory/register

Permute

A new set of permute instructions have been added for full two input permutations. They all take three arguments, two source registers and one index; the result is output by either overwriting the first source register or the index register. AVX-512BW extends the instructions to also include 16-bit versions, and the AVX-512_VBMI extension defines the byte versions of the instructions.
InstructionExtension setDescription
VPERMBVBMIPermute packed bytes elements.
VPERMWBWPermute packed words elements.
VPERMT2BVBMIFull byte permute overwriting first source.
VPERMT2WBWFull word permute overwriting first source.
VPERMI2PD, VPERMI2PSFFull single/double floating point permute overwriting the index.
VPERMI2D, VPERMI2QFFull doubleword/quadword permute overwriting the index.
VPERMI2BVBMIFull byte permute overwriting the index.
VPERMI2WBWFull word permute overwriting the index.
VPERMT2PS, VPERMT2PDFFull single/double floating point permute overwriting first source.
VPERMT2D, VPERMT2QFFull doubleword/quadword permute overwriting first source.
VSHUFF32x4, VSHUFF64x2,

VSHUFFI32x4, VSHUFFI64x2
FShuffle four packed 128-bit lines.
VPMULTISHIFTQBVBMISelect packed unaligned bytes from quadword sources.

Bitwise ternary logic

Two new instructions added can logically implement all possible bitwise operations between three inputs. They take three registers as input and an 8-bit immediate field. Each bit in the output is generated using a lookup of the three corresponding bits in the inputs to select one of the 8 positions in the 8-bit immediate. Since only 8 combinations are possible using three bits, this allow all possible 3 input bitwise operations to be performed.
These are the only bitwise vector instructions in AVX-512F; EVEX versions of the two source SSE and AVX bitwise vector instructions AND, ANDN, OR and XOR were added in AVX-512DQ.
The difference in the doubleword and quadword versions is only the application of the opmask.
InstructionDescription
VPTERNLOGD, VPTERNLOGQBitwise Ternary Logic

Truth table:
A0A1A2Double AND Double OR Bitwise blend
000000
001011
010010
011011
100010
101010
110011
111111

Conversions

A number of conversion or move instructions were added; these complete the set of conversion instructions available from SSE2.
InstructionExtension setDescription

VPMOVQD, VPMOVSQD, VPMOVUSQD,

VPMOVQW, VPMOVSQW,VPMOVUSQW,

VPMOVQB, VPMOVSQB, VPMOVUSQB,

VPMOVDW, VPMOVSDW, VPMOVUSDW,

VPMOVDB, VPMOVSDB, VPMOVUSDB
FDown convert quadword or doubleword to doubleword, word or byte; unsaturated, saturated or saturated unsigned. The reverse of the sign/zero extend instructions from SSE4.1.
VPMOVWB, VPMOVSWB, VPMOVUSWBBWDown convert word to byte; unsaturated, saturated or saturated unsigned.
VCVTPS2UDQ, VCVTPD2UDQ,

VCVTTPS2UDQ, VCVTTPD2UDQ
FConvert with or without truncation, packed single or double-precision floating point to packed unsigned doubleword integers.
VCVTSS2USI , VCVTSD2USI ,

VCVTTSS2USI , VCVTTSD2USI
FConvert with or without trunction, scalar single or double-precision floating point to unsigned doubleword integer.
VCVTPS2QQ, VCVTPD2QQ,

VCVTPS2UQQ, VCVTPD2UQQ,

VCVTTPS2QQ, VCVTTPD2QQ,

VCVTTPS2UQQ, VCVTTPD2UQQ
DQConvert with or without truncation, packed single or double-precision floating point to packed signed or unsigned quadword integers.
VCVTUDQ2PS , VCVTUDQ2PD FConvert packed unsigned doubleword integers to packed single or double-precision floating point.
VCVTUSI2PS , VCVTUSI2PD FConvert scalar unsigned doubleword integers to single or double-precision floating point.
VCVTUSI2SD, VCVTUSI2SSFConvert scalar unsigned integers to single or double-precision floating point.
VCVTUQQ2PS, VCVTUQQ2PDDQConvert packed unsigned quadword integers to packed single or double-precision floating point.
VCVTQQ2PD, VCVTQQ2PSFConvert packed quadword integers to packed single or double-precision floating point.

Floating point decomposition

Among the unique new features in AVX-512F are instructions to decompose floating-point values and handle special floating-point values. Since these methods are completely new, they also exist in scalar versions.
InstructionDescription
VGETEXPPD, VGETEXPPSConvert exponents of packed fp values into fp values
VGETEXPSD, VGETEXPSSConvert exponent of scalar fp value into fp value
VGETMANTPD, VGETMANTPSExtract vector of normalized mantissas from float32/float64 vector
VGETMANTSD, VGETMANTSSExtract float32/float64 of normalized mantissa from float32/float64 scalar
VFIXUPIMMPD, VFIXUPIMMPSFix up special packed float32/float64 values
VFIXUPIMMSD, VFIXUPIMMSSFix up special scalar float32/float64 value

Floating point arithmetic

This is the second set of new floating-point methods, which includes new scaling and approximate calculation of reciprocal, and reciprocal of square root. The approximate reciprocal instructions guarantee to have at most a relative error of 2−14.
InstructionDescription
VRCP14PD, VRCP14PSCompute approximate reciprocals of packed float32/float64 values
VRCP14SD, VRCP14SSCompute approximate reciprocals of scalar float32/float64 value
VRNDSCALEPS, VRNDSCALEPDRound packed float32/float64 values to include a given number of fraction bits
VRNDSCALESS, VRNDSCALESDRound scalar float32/float64 value to include a given number of fraction bits
VRSQRT14PD, VRSQRT14PSCompute approximate reciprocals of square roots of packed float32/float64 values
VRSQRT14SD, VRSQRT14SSCompute approximate reciprocal of square root of scalar float32/float64 value
VSCALEFPS, VSCALEFPDScale packed float32/float64 values with float32/float64 values
VSCALEFSS, VSCALEFSDScale scalar float32/float64 value with float32/float64 value

Broadcast

Miscellaneous

New instructions by sets

Conflict detection

The instructions in AVX-512 conflict detection are designed to help efficiently calculate conflict-free subsets of elements in loops that normally could not be safely vectorized.
InstructionNameDescription
VPCONFLICTD, VPCONFLICTQDetect conflicts within vector of packed double- or quadwords values.Compares each element in the first source, to all elements on same or earlier places in the second source and forms a bit vector of the results.
VPLZCNTD, VPLZCNTQCount the number of leading zero bits for packed double- or quadword values.Vectorized LZCNT instruction.
VPBROADCASTMB2Q,VPBROADCASTMW2DBroadcast mask to vector register.Either 8-bit mask to quadword vector, or 16-bit mask to doubleword vector.

Exponential and reciprocal

AVX-512 exponential and reciprocal instructions contain more accurate approximate reciprocal instructions than those in the AVX-512 foundation; relative error is at most 2−28. They also contain two new exponential functions that have a relative error of at most 2−23.
InstructionDescription
VEXP2PD, VEXP2PSCompute approximate exponential 2^x of packed single or double-precision floating point values
VRCP28PD, VRCP28PSCompute approximate reciprocals of packed single or double-precision floating point values
VRCP28SD, VRCP28SSCompute approximate reciprocal of scalar single or double-precision floating point value
VRSQRT28PD, VRSQRT28PSCompute approximate reciprocals of square roots of packed single or double-precision floating point values
VRSQRT28SD, VRSQRT28SSCompute approximate reciprocal of square root of scalar single or double-precision floating point value

Prefetch

AVX-512 prefetch instructions contain new prefetch operations for the new scatter and gather functionality introduced in AVX2 and AVX-512. T0 prefetch means prefetching into level 1 cache and T1 means prefetching into level 2 cache.
InstructionDescription
VGATHERPF0DPS, VGATHERPF0QPS, VGATHERPF0DPD, VGATHERPF0QPDUsing signed dword/qword indices, prefetch sparse byte memory locations containing single/double-precision data using opmask k1 and T0 hint.
VGATHERPF1DPS, VGATHERPF1QPS, VGATHERPF1DPD, VGATHERPF1QPDUsing signed dword/qword indices, prefetch sparse byte memory locations containing single/double-precision data using opmask k1 and T1 hint.
VSCATTERPF0DPS, VSCATTERPF0QPS, VSCATTERPF0DPD, VSCATTERPF0QPDUsing signed dword/qword indices, prefetch sparse byte memory locations containing single/double-precision data using writemask k1 and T0 hint with intent to write.
VSCATTERPF1DPS, VSCATTERPF1QPS, VSCATTERPF1DPD, VSCATTERPF1QPDUsing signed dword/qword indices, prefetch sparse byte memory locations containing single/double precision data using writemask k1 and T1 hint with intent to write.

4FMAPS and 4VNNIW

BW, DQ and VBMI

AVX-512DQ adds new doubleword and quadword instructions. AVX-512BW adds byte and words versions of the same instructions, and adds byte and word version of doubleword/quadword instructions in AVX-512F. A few instructions which get only word forms with AVX-512BW acquire byte forms with the AVX-512_VBMI extension.
Two new instructions were added to the mask instructions set: KADD and KTEST. The rest of mask instructions, which had only word forms, got byte forms with AVX-512DQ and doubleword/quadword forms with AVX-512BW. KUNPCKBW was extended to KUNPCKWD and KUNPCKDQ by AVX-512BW.
Among the instructions added by AVX-512DQ are several SSE, AVX instruction that didn't get AVX-512 versions with AVX-512F, among those are all the two input bitwise instructions and extract/insert integer instructions.
Instructions that are completely new are covered below.

Floating point instructions

Three new floating point operations are introduced. Since they are not only new to AVX-512 they have both packed/SIMD and scalar versions.
The VFPCLASS instructions tests if the floating point value is one of eight special floating-point values, which of the eight values will trigger a bit in the output mask register is controlled by the immediate field. The VRANGE instructions perform minimum or maximum operations depending on the value of the immediate field, which can also control if the operation is done absolute or not and separately how the sign is handled. The VREDUCE instructions operate on a single source, and subtract from that the integer part of the source value plus a number of bits specified in the immediate field of its fraction.
InstructionExtension setDescription
VFPCLASSPS, VFPCLASSPDDQTest types of packed single and double precision floating point values.
VFPCLASSSS, VFPCLASSSDDQTest types of scalar single and double precision floating point values.
VRANGEPS, VRANGEPDDQRange restriction calculation for packed floating point values.
VRANGESS, VRANGESDDQRange restriction calculation for scalar floating point values.
VREDUCEPS, VREDUCEPDDQPerform reduction transformation on packed floating point values.
VREDUCESS, VREDUCESDDQPerform reduction transformation on scalar floating point values.

Other instructions

VBMI2

Extend VPCOMPRESS and VPEXPAND with byte and word variants. Shift instructions are new.
InstructionDescription
VPCOMPRESSB, VPCOMPRESSWStore sparse packed byte/word integer values into dense memory/register
VPEXPANDB, VPEXPANDWLoad sparse packed byte/word integer values from dense memory/register
VPSHLDConcatenate and shift packed data left logical
VPSHLDVConcatenate and variable shift packed data left logical
VPSHRDConcatenate and shift packed data right logical
VPSHRDVConcatenate and variable shift packed data right logical

VNNI

Vector Neural Network Instructions.
InstructionDescription
VPDPBUSDMultiply and add unsigned and signed bytes
VPDPBUSDSMultiply and add unsigned and signed bytes with saturation
VPDPWSSDMultiply and add signed word integers
VPDPWSSDSMultiply and add word integers with saturation

IFMA

VPOPCNTDQ and BITALG

VP2INTERSECT

GFNI

EVEX-encoded Galois field new instructions:
InstructionDescription
VGF2P8AFFINEINVQBGalois field affine transformation inverse
VGF2P8AFFINEQBGalois field affine transformation
VGF2P8MULBGalois field multiply bytes

VPCLMULQDQ

VPCLMULQDQ with AVX-512F adds EVEX-encoded 512-bit version of PCLMULQDQ instruction. With AVX-512VL, it adds EVEX-encoded 256- and 128-bit versions. VPCLMULQDQ alone adds only VEX-encoded 256-bit version. The wider than 128-bit variations of the instruction perform the same operation on each 128-bit portion of input registers, but they do not extend it to select quadwords from different 128-bit fields.
InstructionDescription
VPCLMULQDQCarry-less multiplication quadword

VAES

EVEX-encoded AES instructions. The wider than 128-bit variations of the instruction perform the same operation on each 128-bit portion of input registers.
InstructionDescription
VAESDECPerform one round of an AES decryption flow
VAESDECLASTPerform last round of an AES decryption flow
VAESENCPerform one round of an AES encryption flow
VAESENCLASTPerform last round of an AES encryption flow

BF16

AI acceleration instructions operating on the Bfloat16 format.
InstructionDescription
VCVTNE2PS2BF16Convert two packed single precision numbers to one packed Bfloat16 number
VCVTNEPS2BF16 Convert one packed single precision number to one packed Bfloat16 number
VDPBF16PSCalculate dot product of two Bfloat16 pairs and accumulate the result into one packed single precision number

Legacy instructions upgraded with EVEX encoded versions

CPUs with AVX-512



QEMU supports AVX-512.

Performance

supports native AVX-512 performance and vector code quality analysis for 2nd generation Intel Xeon Phi processor. Along with traditional hotspots profile, Advisor Recommendations and "seamless" integration of Intel Compiler vectorization diagnostics, Advisor Survey analysis also provides AVX-512 ISA metrics and new AVX-512-specific "traits", e.g. Scatter, Compress/Expand, mask utilization.
AVX-512 causes a frequency throttling even greater than its predecessors, causing a penalty for mixed workloads. The additional downclocking is triggered by the 512-bit width of vectors, and using the 256-bit part of AVX-512 does not trigger it. As a result, gcc defaults to prefer using the 256-bit vectors.