LLVM


The LLVM compiler infrastructure project is a set of compiler and toolchain technologies, which can be used to develop a front end for any programming language and a back end for any instruction set architecture. LLVM is designed around a language-independent intermediate representation that serves as a portable, high-level assembly language that can be optimized with a variety of transformations over multiple passes.
LLVM is written in C++ and is designed for compile-time, link-time, run-time, and "idle-time" optimization. Originally implemented for C and C++, the language-agnostic design of LLVM has since spawned a wide variety of front ends: languages with compilers that use LLVM include ActionScript, Ada, C#, Common Lisp, Crystal, CUDA, D, Delphi, Dylan, Fortran, Graphical G Programming Language, Halide, Haskell, Java bytecode, Julia, Kotlin, Lua, Objective-C, OpenGL Shading Language, PostgreSQL's SQL and PLpgSQL, Ruby, Rust, Scala, Swift, and Xojo.

History

The LLVM project started in 2000 at the University of Illinois at Urbana–Champaign, under the direction of Vikram Adve and Chris Lattner. LLVM was originally developed as a research infrastructure to investigate dynamic compilation techniques for static and dynamic programming languages. LLVM was released under the University of Illinois/NCSA Open Source License, a permissive free software licence. In 2005, Apple Inc. hired Lattner and formed a team to work on the LLVM system for various uses within Apple's development systems. LLVM is an integral part of Apple's latest development tools for macOS and iOS. Since 2013, Sony has been using LLVM's primary front end Clang compiler in the software development kit of its PlayStation 4 console.
The name LLVM was originally an initialism for Low Level Virtual Machine. This abbreviation has officially been removed to avoid confusion, as the LLVM has evolved into an umbrella project that has little relationship to what most current developers think of as virtual machines. Now, LLVM is a brand that applies to the LLVM umbrella project, the LLVM intermediate representation, the LLVM debugger, the LLVM implementation of the C++ Standard Library, etc. LLVM is administered by the LLVM Foundation. Its president is compiler engineer Tanya Lattner.
"For designing and implementing LLVM", the Association for Computing Machinery presented Vikram Adve, Chris Lattner, and Evan Cheng with the 2012 ACM Software System Award.
Since v9.0.0, it was relicensed to the Apache License 2.0 with LLVM Exceptions.

Features

LLVM can provide the middle layers of a complete compiler system, taking intermediate representation code from a compiler and emitting an optimized IR. This new IR can then be converted and linked into machine-dependent assembly language code for a target platform. LLVM can accept the IR from the GNU Compiler Collection toolchain, allowing it to be used with a wide array of extant compilers written for that project.
LLVM can also generate relocatable machine code at compile-time or link-time or even binary machine code at run-time.
LLVM supports a language-independent instruction set and type system. Each instruction is in static single assignment form, meaning that each variable is assigned once and then frozen. This helps simplify the analysis of dependencies among variables. LLVM allows code to be compiled statically, as it is under the traditional GCC system, or left for late-compiling from the IR to machine code via just-in-time compilation, similar to Java. The type system consists of basic types such as integer or floating point numbers and five derived types: pointers, arrays, vectors, structures, and functions. A type construct in a concrete language can be represented by combining these basic types in LLVM. For example, a class in C++ can be represented by a mix of structures, functions and arrays of function pointers.
The LLVM JIT compiler can optimize unneeded static branches out of a program at runtime, and thus is useful for partial evaluation in cases where a program has many options, most of which can easily be determined unneeded in a specific environment. This feature is used in the OpenGL pipeline of Mac OS X Leopard to provide support for missing hardware features.
Graphics code within the OpenGL stack can be left in intermediate representation, and then compiled when run on the target machine. On systems with high-end graphics processing units, the resulting code remains quite thin, passing the instructions on to the GPU with minimal changes. On systems with low-end GPUs, LLVM will compile optional procedures that run on the local central processing unit that emulate instructions that the GPU cannot run internally. LLVM improved performance on low-end machines using Intel GMA chipsets. A similar system was developed under the Gallium3D LLVMpipe, and incorporated into the GNOME shell to allow it to run without a proper 3D hardware driver loaded.
For run-time performance of the compiled programs, GCC formerly outperformed LLVM by 10% on average in 2011. Newer results in 2013 indicate that LLVM has now caught up with GCC in this area, and is now compiling binaries of approximately equal performance.

Components

LLVM has become an umbrella project containing multiple components.

Front ends

LLVM was originally written to be a replacement for the existing code generator in the GCC stack, and many of the GCC front ends have been modified to work with it, resulting in the now-defunct llvm-gcc suite. The modifications generally involve a GIMPLE-to-LLVM IR step so that LLVM optimizers and codegen can be used instead of GCC's GIMPLE system. Apple has historically been an important user of llvm-gcc.. This was considered mostly a temporary measure, but with the advent of clang and advantages of LLVM and clang's modern and modular codebase, is mostly obsolete.
LLVM currently supports compiling of Ada, C, C++, D, Delphi, Fortran, Haskell, Julia, Objective-C, Rust, and Swift using various front ends.
Widespread interest in LLVM has led to several efforts to develop new front ends for a variety of languages. The one that has received the most attention is Clang, a new compiler supporting C, C++, and Objective-C. Primarily supported by Apple, Clang is aimed at replacing the C/Objective-C compiler in the GCC system with a system that is more easily integrated with integrated development environments and has wider support for multithreading. Support for OpenMP directives has been included in Clang since release 3.8.
The Utrecht Haskell compiler can generate code for LLVM. Though the generator is in the early stages of development, in many cases it has been more efficient than the C code generator. There is a Glasgow Haskell Compiler backend using LLVM that achieves a 30% speed-up of the compiled code relative to native code compiling via GHC or C code generation followed by compiling, missing only one of the many optimizing techniques implemented by the GHC.
Many other components are in various stages of development, including, but not limited to, the Rust compiler, a Java bytecode front end, a Common Intermediate Language front end, the MacRuby implementation of Ruby 1.9, various front ends for Standard ML, and a new graph coloring register allocator.

Intermediate representation

The core of LLVM is the intermediate representation, a low-level programming language similar to assembly. IR is a strongly typed reduced instruction set computing instruction set which abstracts away most details of the target. For example, the calling convention is abstracted through call and ret instructions with explicit arguments. Also, instead of a fixed set of registers, IR uses an infinite set of temporaries of the form %0, %1, etc. LLVM supports three equivalent forms of IR: a human-readable assembly format, an in-memory format suitable for frontends, and a dense bitcode format for serializing. A simple "Hello, world!" program in the IR format:

@.str = internal constant c"hello, world\0A\00"
declare i32 @printf
define i32 @main nounwind

The many different conventions used and features provided by different targets mean that LLVM cannot truly produce a target-independent IR and retarget it without breaking some established rules. Examples of target dependence beyond what is explicitly mentioned in the documentation can be found in a 2011 proposal for "wordcode", a fully target-independent variant of LLVM IR intended for online distribution.

Back ends

At version 3.4, LLVM supports many instruction sets, including ARM, Qualcomm Hexagon, MIPS, Nvidia Parallel Thread Execution, PowerPC, AMD TeraScale, AMD Graphics Core Next, SPARC, z/Architecture, x86, x86-64, and XCore. Some features are not available on some platforms. Most features are present for x86, x86-64, z/Architecture, ARM, and PowerPC. RISC-V is supported as of version 7. In the past LLVM also supported fully or partially other backends, including C backend, Cell SPU, mblaze, AMD R600, DEC/Compaq Alpha and Nios2, but most of this hardware is mostly obsolete, and the LLVM support and maintenance for it couldn't be justified.
LLVM also supports WebAssembly as a target, which allows to compile programs and execute them in WebAssembly environment like Google Chrome / Chromium, Firefox, Microsoft Edge, Apple Safari or WAVM. WebAssembly support allows to use mostly unmodified C, C++, D, Rust, Nim, Kotlin and possibly other third-party LLVM-based languages source codes, programs and libraries and target them into WebAssembly.
The LLVM machine code subproject is LLVM's framework for translating machine instructions between textual forms and machine code. Formerly, LLVM relied on the system assembler, or one provided by a toolchain, to translate assembly into machine code. LLVM MC's integrated assembler supports most LLVM targets, including x86, x86-64, ARM, and ARM64. For some targets, including the various MIPS instruction sets, integrated assembly support is usable but still in the beta stage.

Linker

The lld subproject is an attempt to develop a built-in, platform-independent linker for LLVM. lld aims to remove dependence on a third-party linker., lld supports ELF, PE/COFF, Mach-O, and WebAssembly in descending order of completeness. In cases where lld is insufficient, another linker such as GNU ld can be used.
Using lld allows link-time optimization. When link-time optimization is enabled, the compiler generates LLVM bitcode instead of native code, and native code generation is done by the linker.

C++ Standard Library

The LLVM project includes an implementation of the C++ Standard Library called libc++, dual-licensed under the MIT License and the UIUC license.
Since v9.0.0, it was relicensed to the Apache License 2.0 with LLVM Exceptions.

Polly

This implements a suite of cache-locality optimizations as well as auto-parallelism and vectorization using a polyhedral model.

Debugger

Literature