OpenCL

OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units, graphics processing units, digital signal processors, field-programmable gate arrays and other processors or hardware accelerators. OpenCL specifies programming languages for programming these devices and application programming interfaces to control the platform and execute programs on the compute devices. OpenCL provides a standard interface for parallel computing using task- and data-based parallelism.
OpenCL is an open standard maintained by the non-profit technology consortium Khronos Group. Conformant implementations are available from Altera, AMD, Apple, ARM, Creative, IBM, Imagination, Intel, Nvidia, Qualcomm, Samsung, Vivante, Xilinx, and ZiiLABS.

Overview

OpenCL views a computing system as consisting of a number of compute devices, which might be central processing units or "accelerators" such as graphics processing units, attached to a host processor. It defines a [|C-like language] for writing programs. Functions executed on an OpenCL device are called "kernels". A single compute device typically consists of several compute units, which in turn comprise multiple processing elements. A single kernel execution can run on all or many of the PEs in parallel. How a compute device is subdivided into compute units and PEs is up to the vendor; a compute unit can be thought of as a "core", but the notion of core is hard to define across all the types of devices supported by OpenCL, and the number of compute units may not correspond to the number of cores claimed in vendors' marketing literature.
In addition to its C-like programming language, OpenCL defines an application programming interface that allows programs running on the host to launch kernels on the compute devices and manage device memory, which is separate from host memory. Programs in the OpenCL language are intended to be compiled at run-time, so that OpenCL-using applications are portable between implementations for various host devices. The OpenCL standard defines host APIs for C and C++; third-party APIs exist for other programming languages and platforms such as Python, Java, Perl and.NET. An [|implementation] of the OpenCL standard consists of a library that implements the API for C and C++, and an OpenCL C compiler for the compute device targeted.
In order to open the OpenCL programming model to other languages or to protect the kernel source from inspection, the Standard Portable Intermediate Representation can be used as a target-independent way to ship kernels between a front-end compiler and the OpenCL back-end.
More recently Khronos Group has ratified SYCL, a higher-level programming model for OpenCL as single-source DSEL based on pure C++11 to improve programming productivity.

Memory hierarchy

OpenCL defines a four-level [|memory hierarchy] for the compute device:

global memory: shared by all processing elements, but has high access latency ;
read-only memory: smaller, low latency, writable by the host CPU but not the compute devices ;
local memory: shared by a group of processing elements ;
per-element private memory.

Not every device needs to implement each level of this hierarchy in hardware. Consistency between the various levels in the hierarchy is relaxed, and only enforced by explicit synchronization constructs, notably barriers.
Devices may or may not share memory with the host CPU. The host API provides handles on device memory buffers and functions to transfer data back and forth between host and devices.

OpenCL C language

The programming language that is used to write compute kernels is called OpenCL C and is based on C99, but adapted to fit the device model in OpenCL. Memory buffers reside in specific levels of the memory hierarchy, and pointers are annotated with the region qualifiers,,, and, reflecting this. Instead of a device program having a function, OpenCL C functions are marked to signal that they are entry points into the program to be called from the host program. Function pointers, bit fields and variable-length arrays are omitted, and recursion is forbidden. The C standard library is replaced by a custom set of standard functions, geared toward math programming.
OpenCL C is extended to facilitate use of parallelism with vector types and operations, synchronization, and functions to work with work-items and work-groups. In particular, besides scalar types such as and, which behave similarly to the corresponding types in C, OpenCL provides fixed-length vector types such as ; such vector types are available in lengths two, three, four, eight and sixteen for various base types. Vectorized operations on these types are intended to map onto SIMD instructions sets, e.g., SSE or VMX, when running OpenCL programs on CPUs. Other specialized types include 2-d and 3-d image types.

Example: matrix-vector multiplication

The following is a matrix-vector multiplication algorithm in OpenCL C.

// Multiplies A*x, leaving the result in y.
// A is a row-major matrix, meaning the element is at A.
__kernel void matvec

The kernel function computes, in each invocation, the dot product of a single row of a matrix and a vector :
To extend this into a full matrix-vector multiplication, the OpenCL runtime maps the kernel over the rows of the matrix. On the host side, the function does this; it takes as arguments the kernel to execute, its arguments, and a number of work-items, corresponding to the number of rows in the matrix.

Example: computing the FFT

This example will load a fast Fourier transform implementation and execute it. The implementation is shown below. The code asks the OpenCL library for the first available graphics card, creates memory buffers for reading and writing, JIT-compiles the FFT-kernel and then finally asynchronously runs the kernel. The result from the transform is not read in this example.

include
include
include "CL/opencl.h"
define NUM_ENTRIES 1024

int main //

The actual calculation inside file "fft1D_1024_kernel_src.cl" :

R""

A full, open source implementation of an OpenCL FFT can be found on Apple's website.

History

OpenCL was initially developed by Apple Inc., which holds trademark rights, and refined into an initial proposal in collaboration with technical teams at AMD, IBM, Qualcomm, Intel, and Nvidia. Apple submitted this initial proposal to the Khronos Group. On June 16, 2008, the Khronos Compute Working Group was formed with representatives from CPU, GPU, embedded-processor, and software companies. This group worked for five months to finish the technical details of the specification for OpenCL 1.0 by November 18, 2008. This technical specification was reviewed by the Khronos members and approved for public release on December 8, 2008.

OpenCL 1.0

OpenCL 1.0 released with Mac OS X Snow Leopard on August 28, 2009. According to an Apple press release:

Snow Leopard further extends support for modern hardware with Open Computing Language, which lets any application tap into the vast gigaflops of GPU computing power previously available only to graphics applications. OpenCL is based on the C programming language and has been proposed as an open standard.

AMD decided to support OpenCL instead of the now deprecated Close to Metal in its Stream framework. RapidMind announced their adoption of OpenCL underneath their development platform to support GPUs from multiple vendors with one interface. On December 9, 2008, Nvidia announced its intention to add full support for the OpenCL 1.0 specification to its GPU Computing Toolkit. On October 30, 2009, IBM released its first OpenCL implementation as a part of the XL compilers.

OpenCL 1.1

OpenCL 1.1 was ratified by the Khronos Group on June 14, 2010 and adds significant functionality for enhanced parallel programming flexibility, functionality, and performance including:

New data types including 3-component vectors and additional image formats;
Handling commands from multiple host threads and processing buffers across multiple devices;
Operations on regions of a buffer including read, write and copy of 1D, 2D, or 3D rectangular regions;
Enhanced use of events to drive and control command execution;
Additional OpenCL built-in C functions such as integer clamp, shuffle, and asynchronous strided copies;
Improved OpenGL interoperability through efficient sharing of images and buffers by linking OpenCL and OpenGL events.
OpenCL 1.2

On November 15, 2011, the Khronos Group announced the OpenCL 1.2 specification, which added significant functionality over the previous versions in terms of performance and features for parallel programming. Most notable features include:

Device partitioning: the ability to partition a device into sub-devices so that work assignments can be allocated to individual compute units. This is useful for reserving areas of the device to reduce latency for time-critical tasks.
Separate compilation and linking of objects: the functionality to compile OpenCL into external libraries for inclusion into other programs.
Enhanced image support: 1.2 adds support for 1D images and 1D/2D image arrays. Furthermore, the OpenGL sharing extensions now allow for OpenGL 1D textures and 1D/2D texture arrays to be used to create OpenCL images.
Built-in kernels: custom devices that contain specific unique functionality are now integrated more closely into the OpenCL framework. Kernels can be called to use specialised or non-programmable aspects of underlying hardware. Examples include video encoding/decoding and digital signal processors.
DirectX functionality: DX9 media surface sharing allows for efficient sharing between OpenCL and DX9 or DXVA media surfaces. Equally, for DX11, seamless sharing between OpenCL and DX11 surfaces is enabled.
The ability to force IEEE 754 compliance for single precision floating point math: OpenCL by default allows the single precision versions of the division, reciprocal, and square root operation to be less accurate than the correctly rounded values that IEEE 754 requires. If the programmer passes the "-cl-fp32-correctly-rounded-divide-sqrt" command line argument to the compiler, these three operations will be computed to IEEE 754 requirements if the OpenCL implementation supports this, and will fail to compile if the OpenCL implementation does not support computing these operations to their correctly-rounded values as defined by the IEEE 754 specification. This ability is supplemented by the ability to query the OpenCL implementation to determine if it can perform these operations to IEEE 754 accuracy.
OpenCL 2.0

On November 18, 2013, the Khronos Group announced the ratification and public release of the finalized OpenCL 2.0 specification. Updates and additions to OpenCL 2.0 include:

Shared virtual memory
Nested parallelism
Generic address space
Images
C11 atomics
Pipes
Android installable client driver extension
OpenCL 2.1

The ratification and release of the OpenCL 2.1 provisional specification was announced on March 3, 2015 at the Game Developer Conference in San Francisco. It was released on November 16, 2015. It introduced the OpenCL C++ kernel language, based on a subset of C++14, while maintaining support for the preexisting OpenCL C kernel language. Vulkan and OpenCL 2.1 share SPIR-V as an intermediate representation allowing high-level language front-ends to share a common compilation target. Updates to the OpenCL API include:

Additional subgroup functionality
Copying of kernel objects and states
Low-latency device timer queries
Ingestion of SPIR-V code by runtime
Execution priority hints for queues
Zero-sized dispatches from host

AMD, ARM, Intel, HPC, and YetiWare have declared support for OpenCL 2.1.

OpenCL 2.2

OpenCL 2.2 brings the OpenCL C++ kernel language into the core specification for significantly enhanced parallel programming productivity. It was released on May 16, 2017. Maintenance Update released in May 2018 with bugfixes.

The OpenCL C++ kernel language is a static subset of the C++14 standard and includes classes, templates, lambda expressions, function overloads and many other constructs for generic and meta-programming.
Uses the new Khronos SPIR-V 1.1 intermediate language which fully supports the OpenCL C++ kernel language.
OpenCL library functions can now use the C++ language to provide increased safety and reduced undefined behavior while accessing features such as atomics, iterators, images, samplers, pipes, and device queue built-in types and address spaces.
Pipe storage is a new device-side type in OpenCL 2.2 that is useful for FPGA implementations by making connectivity size and type known at compile time, enabling efficient device-scope communication between kernels.
OpenCL 2.2 also includes features for enhanced optimization of generated code: applications can provide the value of specialization constant at SPIR-V compilation time, a new query can detect non-trivial constructors and destructors of program scope global objects, and user callbacks can be set at program release time.
Runs on any OpenCL 2.0-capable hardware
OpenCL 3.0

OpenCL 3.0 is in provisional Mode. OpenCL 1.2 is mandatory. All OpenCL 2.x Modules and new 3.0 modules are optional. New preferred language is C++ for OpenCL with many C++17 features.

Roadmap

When releasing OpenCL 2.2, the Khronos Group announced that OpenCL would converge where possible with Vulkan to enable OpenCL software deployment flexibility over both APIs. This has been now demonstrated by Adobe's Premiere Rush using the clspv open source compiler to compile significant amounts of OpenCL C kernel code to run on a Vulkan runtime for deployment on Android. OpenCL has a forward looking roadmap independent of Vulkan, with 'OpenCL Next' under development and targeting release in 2020. OpenCL Next may integrate extensions such as Vulkan / OpenCL Interop, Scratch-Pad Memory Management, Extended Subgroups, SPIR-V 1.4 ingestion and SPIR-V Extended debug info. OpenCL is also considering Vulkan-like loader and layers and a ‘Flexible Profile’ for deployment flexibility on multiple accelerator types.

Vendor implementations

Timeline of vendor implementations

December 10, 2008: AMD and Nvidia held the first public OpenCL demonstration, a 75-minute presentation at SIGGRAPH Asia 2008. AMD showed a CPU-accelerated OpenCL demo explaining the scalability of OpenCL on one or more cores while Nvidia showed a GPU-accelerated demo.
March 16, 2009: at the 4th Multicore Expo, Imagination Technologies announced the PowerVR SGX543MP, the first GPU of this company to feature OpenCL support.
March 26, 2009: at GDC 2009, AMD and Havok demonstrated the first working implementation for OpenCL accelerating Havok Cloth on AMD Radeon HD 4000 series GPU.
April 20, 2009: Nvidia announced the release of its OpenCL driver and SDK to developers participating in its OpenCL Early Access Program.
August 5, 2009: AMD unveiled the first development tools for its OpenCL platform as part of its ATI Stream SDK v2.0 Beta Program.
August 28, 2009: Apple released Mac OS X Snow Leopard, which contains a full implementation of OpenCL.
September 28, 2009: Nvidia released its own OpenCL drivers and SDK implementation.
October 13, 2009: AMD released the fourth beta of the ATI Stream SDK 2.0, which provides a complete OpenCL implementation on both R700/R800 GPUs and SSE3 capable CPUs. The SDK is available for both Linux and Windows.
November 26, 2009: Nvidia released drivers for OpenCL 1.0.
October 27, 2009: S3 released their first product supporting native OpenCL 1.0 – the Chrome 5400E embedded graphics processor.
December 10, 2009: VIA released their first product supporting OpenCL 1.0 – ChromotionHD 2.0 video processor included in VN1000 chipset.
December 21, 2009: AMD released the production version of the ATI Stream SDK 2.0, which provides OpenCL 1.0 support for R800 GPUs and beta support for R700 GPUs.
June 1, 2010: ZiiLABS released details of their first OpenCL implementation for the ZMS processor for handheld, embedded and digital home products.
June 30, 2010: IBM released a fully conformant version of OpenCL 1.0.
September 13, 2010: Intel released details of their first OpenCL implementation for the Sandy Bridge chip architecture. Sandy Bridge will integrate Intel's newest graphics chip technology directly onto the central processing unit.
November 15, 2010: Wolfram Research released Mathematica 8 with package.
March 3, 2011: Khronos Group announces the formation of the WebCL working group to explore defining a JavaScript binding to OpenCL. This creates the potential to harness GPU and multi-core CPU parallel processing from a Web browser.
March 31, 2011: IBM released a fully conformant version of OpenCL 1.1.
April 25, 2011: IBM released OpenCL Common Runtime v0.1 for Linux on x86 Architecture.
May 4, 2011: Nokia Research releases an open source WebCL extension for the Firefox web browser, providing a JavaScript binding to OpenCL.
July 1, 2011: Samsung Electronics releases an open source prototype implementation of WebCL for WebKit, providing a JavaScript binding to OpenCL.
August 8, 2011: AMD released the OpenCL-driven AMD Accelerated Parallel Processing Software Development Kit v2.5, replacing the ATI Stream SDK as technology and concept.
December 12, 2011: AMD released AMD APP SDK v2.6 which contains a preview of OpenCL 1.2.
February 27, 2012: The Portland Group released the PGI OpenCL compiler for multi-core ARM CPUs.
April 17, 2012: Khronos released a WebCL working draft.
May 6, 2013: Altera released the Altera SDK for OpenCL, version 13.0. It is conformant to OpenCL 1.0.
November 18, 2013: Khronos announced that the specification for OpenCL 2.0 had been finalized.
March 19, 2014: Khronos releases the WebCL 1.0 specification
August 29, 2014: Intel releases HD Graphics 5300 driver that supports OpenCL 2.0.
September 25, 2014: AMD releases Catalyst 14.41 RC1, which includes an OpenCL 2.0 driver.
January 14, 2015: Xilinx Inc. announces SDAccel development environment for OpenCL, C, and C++, achieves Khronos Conformance
April 13, 2015: Nvidia releases WHQL driver v350.12, which includes OpenCL 1.2 support for GPUs based on Kepler or later architectures. Driver 340+ support OpenCL 1.1 for Tesla and Fermi.
August 26, 2015: AMD released AMD APP SDK v3.0 which contains full support of OpenCL 2.0 and sample coding.
November 16, 2015: Khronos announced that the specification for OpenCL 2.1 had been finalized.
April 18, 2016: Khronos announced that the specification for OpenCL 2.2 had been provisionally finalized.
November 3, 2016 Intel support for Gen7+ of OpenCL 2.1 in SDK 2016 r3
February 17, 2017: Nvidia begins evaluation support of OpenCL 2.0 with driver 378.66.
May 16, 2017: Khronos announced that the specification for OpenCL 2.2 had been finalized with SPIR-V 1.2.
May 14, 2018: Khronos announced Maintenance Update for OpenCL 2.2 with Bugfix and unified headers.
April 27, 2020: Khronos announced provisional Version of OpenCL 3.0
Devices

As of 2016, OpenCL runs on Graphics processing units, CPUs with SIMD instructions, FPGAs, Movidius Myriad 2, Adapteva epiphany and DSPs.

Khronos Conformance Test Suite

Conformant products

The Khronos Group maintains an extended list of OpenCL-conformant products.
All standard-conformant implementations can be queried using one of the clinfo tools.

Version support

Products and their version of OpenCL support include:

OpenCL 3.0 support

None yet: all Hardware with OpenCL 1.2+ is possible, Khronos Test Suite work in progress

Intel NEO Compute: beta for Gen 12 Tiger Lake
OpenCL 2.2 support

None yet: Khronos Test Suite ready, with Driver Update all Hardware with 2.0 and 2.1 support possible

Intel NEO Compute: Work in Progress for actual products
OpenCL 2.1 support
Support backported to Intel 5th and 6th gen processors
Intel 7th, 8th, 9th, 10th, 11th, 12th gen processors
Khronos: with Driver Update all Hardware with 2.0 support possible
OpenCL 2.0 support
AMD GCN GPU's, some GCN 1st Gen only 1.2 with some Extensions
AMD GCN APU's
Intel 5th & 6th gen processors
Qualcomm Adreno 5xx series
Qualcomm Adreno 6xx series
ARM Mali G51 and G71 in Android 7.1 and Linux
ARM Mali G31, G52, G72 and G76
incomplete Evaluation support: Nvidia Kepler, Maxwell, Pascal, Volta and Turing GPU's with Driver Version 378.66+
OpenCL 1.2 support
for some AMD GCN 1st Gen some OpenCL 2.0 Features not possible today, but many more Extensions than Terascale
AMD TeraScale 2 & 3 GPU's
AMD TeraScale APU's
Nvidia Kepler, Maxwell, Pascal, Volta and Turing GPU's
Intel 3rd & 4th gen processors
Qualcomm Adreno 4xx series
ARM Mali Midgard 3rd gen
ARM Mali Midgard 4th gen
OpenCL 1.1 support
some AMD TeraScale 1 GPU's
Nvidia Tesla, Fermi GPU's
Qualcomm Adreno 3xx series
ARM Mali Midgard 1st and 2nd gen
OpenCL 1.0 support
mostly updated to 1.1 and 1.2 after first Driver for 1.0 only
Portability, performance and alternatives

A key feature of OpenCL is portability, via its abstracted memory and execution model, and the programmer is not able to directly use hardware-specific technologies such as inline Parallel Thread Execution for Nvidia GPUs unless they are willing to give up direct portability on other platforms. It is possible to run any OpenCL kernel on any conformant implementation.
However, performance of the kernel is not necessarily portable across platforms. Existing implementations have been shown to be competitive when kernel code is properly tuned, though, and auto-tuning has been suggested as a solution to the performance portability problem, yielding "acceptable levels of performance" in experimental linear algebra kernels. Portability of an entire application containing multiple kernels with differing behaviors was also studied, and shows that portability only required limited tradeoffs.
A study at Delft University from 2011 that compared CUDA programs and their straightforward translation into OpenCL C found CUDA to outperform OpenCL by at most 30% on the Nvidia implementation. The researchers noted that their comparison could be made fairer by applying manual optimizations to the OpenCL programs, in which case there was "no reason for OpenCL to obtain worse performance than CUDA". The performance differences could mostly be attributed to differences in the programming model and to NVIDIA's compiler optimizations for CUDA compared to those for OpenCL.
Another study at D-Wave Systems Inc. found that "The OpenCL kernel’s performance is between about 13% and 63% slower, and the end-to-end time is between about 16% and 67% slower" than CUDA's performance.
The fact that OpenCL allows workloads to be shared by CPU and GPU, executing the same programs, means that programmers can exploit both by dividing work among the devices. This leads to the problem of deciding how to partition the work, because the relative speeds of operations differ among the devices. Machine learning has been suggested to solve this problem: Grewe and O'Boyle describe a system of support-vector machines trained on compile-time features of program that can decide the device partitioning problem statically, without actually running the programs to measure their performance.

Project Coriander: Conversion CUDA to OpenCL 1.2 with CUDA-on-CL

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...

OpenCL

Overview

Memory hierarchy

OpenCL C language

Example: matrix-vector multiplication

Example: computing the FFT

History

OpenCL 1.0

OpenCL 1.1

OpenCL 1.2

OpenCL 2.0

OpenCL 2.1

OpenCL 2.2

OpenCL 3.0

Roadmap

Vendor implementations

Timeline of vendor implementations

Devices

Khronos Conformance Test Suite

Conformant products

Version support

OpenCL 3.0 support

OpenCL 2.2 support

OpenCL 2.1 support

OpenCL 2.0 support

OpenCL 1.2 support

OpenCL 1.1 support

OpenCL 1.0 support

Portability, performance and alternatives