AArch64


AArch64 or ARM64 is the 64-bit extension of the ARM architecture.
It was first introduced with the ARMv8-A architecture.

ARMv8-A

Announced in October 2011, ARMv8-A represents a fundamental change to the ARM architecture. It adds an optional 64-bit architecture, named "AArch64", and the associated new "A64" instruction set. AArch64 provides user-space compatibility with ARMv7-A, the 32-bit architecture, therein referred to as "AArch32" and the old 32-bit instruction set, now named "A32". The Thumb instruction set is referred to as "T32" and has no 64-bit counterpart. ARMv8-A allows 32-bit applications to be executed in a 64-bit OS, and a 32-bit OS to be under the control of a 64-bit hypervisor. ARM announced their Cortex-A53 and Cortex-A57 cores on 30 October 2012. Apple was the first to release an ARMv8-A compatible core in a consumer product. AppliedMicro, using an FPGA, was the first to demo ARMv8-A. The first ARMv8-A SoC from Samsung is the Exynos 5433 used in the Galaxy Note 4, which features two clusters of four Cortex-A57 and Cortex-A53 cores in a big.LITTLE configuration; but it will run only in AArch32 mode.
To both AArch32 and AArch64, ARMv8-A makes VFPv3/v4 and advanced SIMD standard. It also adds cryptography instructions supporting AES, SHA-1/SHA-256 and finite field arithmetic.

AArch64 features

AArch64 was introduced in ARMv8-A and is included in subsequent versions of ARMV8-A. AArch64 is not included in ARMv8-R or ARMv8-M, because they are both 32-bit architectures.

ARMv8.1-A

In December 2014, ARMv8.1-A, an update with "incremental benefits over v8.0", was announced. The enhancements fell into two categories: changes to the instruction set, and changes to the exception model and memory translation.
Instruction set enhancements included the following:
Enhancements for the exception model and memory translation system included the following:
In January 2016, ARMv8.2-A was announced. Its enhancements fell into four categories:
The Scalable Vector Extension is "an optional extension to the ARMv8.2-A architecture and newer" developed specifically for vectorization of high-performance computing scientific workloads. The specification allows for variable vector lengths to be implemented from 128 to 2048 bits. The extension is completementary to, and does not replace, the NEON extensions.
A 512-bit SVE variant has already been implemented on the Fugaku supercomputer using the Fujitsu A64FX ARM processor. It aims to be the world's highest-performing supercomputer with "the goal of beginning full operations around 2021."
SVE is supported by the GCC compiler, with GCC 8 supporting automatic vectorization and GCC 10 supporting C intrinsics. As of July 2020, LLVM and clang support C intrinsics.

ARMv8.3-A

In October 2016, ARMv8.3-A was announced. Its enhancements fell into six categories:
ARMv8.3-A architecture is now supported by the GCC 7 compiler.

ARMv8.4-A

In November 2017, ARMv8.4-A was announced. Its enhancements fell into these categories:
In September 2018 ARMv8.5-A was announced. Its enhancements fell into these categories:
On 2 August 2019, Google announced Android would adopt Memory Tagging Extension.

ARMv8.6-A

In September 2019, ARMv8.6-A was announced. It adds:
For example, Fine grained traps, Wait-for-Event instructions, EnhancedPAC2 and FPAC. The Bfloat16 extensions for SVE and [|Neon] are mainly for deep learning use.

Future ARM architecture features

In May 2019, ARM announced their upcoming Scalable Vector Extension 2 and Transactional Memory Extension.
Scalable Vector Extension 2 (SVE2)
SVE2 builds on SVE's scalable vectorization for increased fine-grain Data Level Parallelism, to allow more work done per instruction. SVE2 aims to bring these benefits to a wider range of software including DSP and multimedia SIMD code that currently use Neon. The LLVM/Clang 9.0 and GCC 10.0 development codes were updated to support SVE2.
Transactional Memory Extension (TME)
Following the x86 extensions, TME brings support for Hardware Transactional Memory and Transactional Lock Elision. TME aims to bring scalable concurrency to increase coarse-grained Thread Level Parallelism, to allow more work done per thread. The LLVM/Clang 9.0 and GCC 10.0 development codes were updated to support TME.