Software pipelining

In computer science, software pipelining is a technique used to optimize loops, in a manner that parallels hardware pipelining. Software pipelining is a type of out-of-order execution, except that the reordering is done by a compiler instead of the processor. Some computer architectures have explicit support for software pipelining, notably Intel's IA-64 architecture.
It is important to distinguish software pipelining, which is a target code technique for overlapping loop iterations, from modulo scheduling, the currently most effective known compiler technique for generating software pipelined loops.
Software pipelining has been known to assembly language programmers of machines with instruction-level parallelism since such architectures existed. Effective compiler generation of such code dates to the invention of modulo scheduling by Rau and Glaeser.
Lam showed that special hardware is unnecessary for effective modulo scheduling. Her technique, modulo variable expansion is widely used in practice.
Gao et al. formulated optimal software pipelining in integer linear programming, culminating in validation of advanced heuristics in an evaluation paper. This paper has a
good set of references on the topic.

Example

Consider the following loop:
for i = 1 to bignumber
A
B
C
end
In this example, let A, B, C be instructions, each operating on data i, that are dependent on each other. In other words, A must complete before B can start. For example, A could load data from memory into a register, B could perform some arithmetic operation on the data, and C could store the data back into memory. However, let there be no dependence between operations for different values of i. In other words, A can begin before A finishes.
Without software pipelining, the operations execute in the following sequence:
A B C A B C A B C...
Assume that each instruction takes 3 clock cycles to complete. Also assume that an instruction can be dispatched every cycle, as long as it has no dependencies on an instruction that is already executing. In the unpipelined case, each iteration thus takes 9 cycles to complete: 3 clock cycles for A, 3 clock cycles for B, and 3 clock cycles for C.
Now consider the following sequence of instructions with software pipelining:
A A A B B B C C C...
It can be easily verified that an instruction can be dispatched each cycle, which means that the same 3 iterations can be executed in a total of 9 cycles, giving an average of 3 cycles per iteration.

Implementation

Software pipelining is often used in combination with loop unrolling, and this combination of techniques is often a far better optimization than loop unrolling alone. In the example above, we could write the code as follows :
for i = 1 to step 3
A
A
A
B
B
B
C
C
C
end
Of course, matters are complicated if we can't guarantee that the total number of iterations will be divisible by the number of iterations we unroll. See the article on loop unrolling for more on solutions to this problem, but note that software pipelining prevents the use of Duff's device.
In the general case, loop unrolling may not be the best way to implement software pipelining. Consider a loop containing instructions with a high latency. For example, the following code:
for i = 1 to bignumber
A ; 3 cycle latency
B ; 3
C ; 12
D ; 3
E ; 3
F ; 3
end
would require 12 iterations of the loop to be unrolled to avoid the bottleneck of instruction C. This means that the code of the loop would increase by a factor of 12. Even worse, the prologue will likely be even larger than the code for the loop, and very probably inefficient because software pipelining cannot be used in this code. Furthermore, if bignumber is expected to be moderate in size compared to the number of iterations unrolled, then the execution will spend most of its time in this inefficient prologue code, rendering the software pipelining optimization ineffectual.
By contrast, here is the software pipelining for our example :
prologue
for i = 1 to
A
B
C
D ; note that we skip i+3
E
F
end
epilogue
Before getting to the prologue and epilogue, which handle iterations at the beginning and end of the loop, let's verify that this code does the same thing as the original for iterations in the middle of the loop. Specifically, consider iteration 7 in the original loop. The first iteration of the pipelined loop will be the first iteration that includes an instruction from iteration 7 of the original loop. The sequence of instructions is:
However, unlike the original loop, the pipelined version avoids the bottleneck at instruction C. Note that there are 12 instructions between C and the dependent instruction D, which means that the latency cycles of instruction C are used for other instructions instead of being wasted.
The prologue and epilogue handle iterations at the beginning and end of the loop. Here is a possible prologue for our example above:
; loop prologue
A
A, B
A, B, C
A, B, C ; cannot start D yet
A, B, C, D
A, B, C, D, E
Each line above corresponds to an iteration of the main pipelined loop, but without the instructions for iterations that have not yet begun. Similarly, the epilogue progressively removes instructions for iterations that have completed:
; loop epilogue
B, C, D, E, F
C, D, E, F
D, E, F
D, E, F
E, F
F

Difficulties of implementation

The requirement of a prologue and epilogue is one of the major difficulties of implementing software pipelining. Note that the prologue in this example is 18 instructions, 3 times as large as the loop itself. The epilogue would also be 18 instructions. In other words, the prologue and epilogue together are 6 times as large as the loop itself. While still better than attempting loop unrolling for this example, software pipelining requires a trade-off between speed and memory usage. Keep in mind, also, that if the code bloat is too large, it will affect speed anyway via a decrease in cache performance.
A further difficulty is that on many architectures, most instructions use a register as an argument, and that the specific register to use must be hard-coded into the instruction. In other words, on many architectures, it is impossible to code such an instruction as "multiply the contents of register X and register Y and put the result in register Z", where X, Y, and Z are numbers taken from other registers or memory. This has often been cited as a reason that software pipelining cannot be effectively implemented on conventional architectures.
In fact, Monica Lam presents an elegant solution to this problem in her thesis, A Systolic Array Optimizing Compiler . She calls it modulo variable expansion. The trick is to replicate the body of the loop after it has been scheduled, allowing different registers to be used for different values of the same variable when they have to be live at the same time. For the simplest possible example, let's suppose that A and B can be issued in parallel and that the latency of the former is 2 cycles. The pipelined body could then be:
A; B
Register allocation of this loop body runs into the problem that the result of A must stay live for two iterations. Using the same register for the result of A and the input of B will result in incorrect results.
However, if we replicate the scheduled loop body, the problem is solved:
A; B
A; B
Now a separate register can be allocated to the results of A and A. To be more concrete:
r1 = A; B = r1
r2 = A; B = r2
i = i + 2 // Just to be clear
On the assumption that each instruction bundle reads its input registers before writing its output registers, this code is correct. At the start of the replicated loop body, r1 holds the value of A from the previous replicated loop iteration. Since i has been incremented by 2 in the meantime, this is actually the value of A in this replicated loop iteration.
Of course, code replication increases code size and cache pressure just as the prologue and epilogue do. Nevertheless, for loops with large trip counts on architectures with enough instruction level parallelism, the technique easily performs well enough to be worth any increase in code size.

IA-64 implementation

Intel's IA-64 architecture provides an example of an architecture designed with the difficulties of software pipelining in mind. Some of the architectural support for software pipelining includes:

A "rotating" register bank; instructions can refer to a register number that is redirected to a different register each iteration of the loop. This makes the extra instructions inserted in the previous example unnecessary.
Predicates that take their value from special looping instructions. These predicates turn on or off certain instructions in the loop, making a separate prologue and epilogue unnecessary.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...