Out-of-order execution

In computer engineering, out-of-order execution is a paradigm used in most high-performance central processing units to make use of instruction cycles that would otherwise be wasted. In this paradigm, a processor executes instructions in an order governed by the availability of input data and execution units, rather than by their original order in a program. In doing so, the processor can avoid being idle while waiting for the preceding instruction to complete and can, in the meantime, process the next instructions that are able to run immediately and independently.

History

Out-of-order execution is a restricted form of data flow computation, which was a major research area in computer architecture in the 1970s and early 1980s.
Important academic research in this subject was led by Yale Patt and his HPSm simulator. A paper by James E. Smith and A. R. Pleszkun, published in 1985 completed the scheme by describing how the precise behavior of exceptions could be maintained in out-of-order machines.
Arguably the first machine to use out-of-order execution is the CDC 6600, which uses a scoreboard to resolve conflicts.
About three years later, the IBM System/360 Model 91 introduced Tomasulo's algorithm, which makes full out-of-order execution possible. In 1990, IBM introduced the first out-of-order microprocessor, the POWER1, although out-of-order execution is limited to floating-point instructions.
In 1990s, out-of-order execution became more common, and was featured in the IBM/Motorola PowerPC 601, Fujitsu/HAL SPARC64, Intel Pentium Pro, MIPS R10000, HP PA-8000, AMD K5 and DEC Alpha 21264. Notable exceptions to this trend include the Sun UltraSPARC, HP/Intel Itanium, Intel Atom until Silvermont Architecture, and the IBM POWER6.
The high logical complexity of the out-of-order technique is the reason that it did not reach mainstream machines until the mid-1990s. Many low-end processors meant for cost-sensitive markets still do not use this paradigm due to the large silicon area required for its implementation. Low power usage is another design goal that is harder to achieve with an out-of-order execution design.
A vulnerability in some microprocessor manufacturers' implementations of the out-of-order execution mechanism was reported to the manufacturers on June 1, 2017, but which was not publicized until January 2018, as an exploitable vulnerability that led to millions of vulnerable systems. The vulnerability was named Spectre. A similar vulnerability, Meltdown, which was disclosed at the same time, took advantage of an assumption that some manufacturers had made when loading data into a processor's cache by allowing data to be cached from a privileged security boundary. This resulted in a race condition that could be timed to leak privileged information.

Basic concept

In-order processors

In earlier processors, the processing of instructions is performed in an instruction cycle normally consisting of the following steps:

Instruction fetch.
If input operands are available, the instruction is dispatched to the appropriate functional unit. If one or more operands are unavailable during the current clock cycle, the processor stalls until they are available.
The instruction is executed by the appropriate functional unit.
The functional unit writes the results back to the register file.
Out-of-order processors

This new paradigm breaks up the processing of instructions into these steps:

Instruction fetch.
Instruction dispatch to an instruction queue.
The instruction waits in the queue until its input operands are available. The instruction is then allowed to leave the queue before earlier, than older instructions.
The instruction is issued to the appropriate functional unit and executed by that unit.
The results are queued.
Only after all older instructions have their results written back to the register file, then this result is written back to the register file. This is called the graduation or retire stage.

The key concept of OoOE processing is to allow the processor to avoid a class of stalls that occur when the data needed to perform an operation are unavailable. In the outline above, the OoOE processor avoids the stall that occurs in step of the in-order processor when the instruction is not completely ready to be processed due to missing data.
OoOE processors fill these "slots" in time with other instructions that are ready, then re-order the results at the end to make it appear that the instructions were processed as normal. The way the instructions are ordered in the original computer code is known as program order, in the processor they are handled in data order, the order in which the data, operands, become available in the processor's registers. Fairly complex circuitry is needed to convert from one ordering to the other and maintain a logical ordering of the output; the processor itself runs the instructions in seemingly random order.
The benefit of OoOE processing grows as the instruction pipeline deepens and the speed difference between main memory and the processor widens. On modern machines, the processor runs many times faster than the memory, so during the time an in-order processor spends waiting for data to arrive, it could have processed a large number of instructions.

Dispatch and issue decoupling allows out-of-order issue

One of the differences created by the new paradigm is the creation of queues that allows the dispatch step to be decoupled from the issue step and the graduation stage to be decoupled from the execute stage. An early name for the paradigm was decoupled architecture. In the earlier in-order processors, these stages operated in a fairly lock-step, pipelined fashion.
The instructions of the program may not be run in the correct order, as long as the end result is correct. It separates the fetch and decode stages from the execute stage in a pipelined processor by using a buffer.
The buffer's purpose is to partition the memory access and execute functions in a computer program and achieve high-performance by exploiting the fine-grain parallelism between the two. In doing so, it effectively hides all memory latency from the processor's perspective.
A larger buffer can, in theory, increase throughput. However, if the processor has a branch misprediction then the entire buffer may need to be flushed, wasting a lot of clock cycles and reducing the effectiveness. Furthermore, larger buffers create more heat and use more die space. For this reason processor designers today favour a multi-threaded design approach.
Decoupled architectures are generally thought of as not useful for general purpose computing as they do not handle control intensive code well. Control intensive code include such things as nested branches that occur frequently in operating system kernels. Decoupled architectures play an important role in scheduling in very long instruction word architectures.
To avoid false operand dependencies, which would decrease the frequency when instructions could be issued out of order, a technique called register renaming is used. In this scheme, there are more physical registers than defined by the architecture. The physical registers are tagged so that multiple versions of the same architectural register can exist at the same time.

Execute and writeback decoupling allows program restart

The queue for results is necessary to resolve issues such as branch mispredictions and exceptions/traps. The results queue allows programs to be restarted after an exception, which requires the instructions to be completed in program order. The queue allows results to be discarded due to mispredictions on older branch instructions and exceptions taken on older instructions.
The ability to issue instructions past branches that are yet to resolve is known as speculative execution.

Micro-architectural choices

Are the instructions dispatched to a centralized queue or to multiple distributed queues?
Is there an actual results queue or are the results written directly into a register file? For the latter, the queueing function is handled by register maps that hold the register renaming information for each instruction in flight.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...