A barrel processor is a CPU that switches between threads of execution on every cycle. This CPU design technique is also known as "interleaved" or "fine-grained" temporal multithreading. Unlike simultaneous multithreading in modern superscalar architectures, it generally does not allow execution of multiple instructions in one cycle. Like preemptive multitasking, each thread of execution is assigned its own program counter and other hardware registers. A barrel processor can guarantee that each thread will execute one instruction every ncycles, unlike a preemptive multitasking machine, that typically runs one thread of execution for tens of millions of cycles, while all other threads wait their turn. A technique called C-slowing can automatically generate a corresponding barrel processor design from a single-tasking processor design. An n-way barrel processor generated this way acts much like n separate multiprocessing copies of the original single-tasking processor, each one running at roughly 1/n the original speed.
History
One of the earliest examples of a barrel processor was the I/O processing system in the CDC 6000 series supercomputers. These executed one instruction from each of 10 different virtual processors before returning to the first processor. One motivation for barrel processors was to reduce hardware costs. In the case of the CDC 6x00 PPUs, the digital logic of the processor was much faster than the core memory, so rather than having ten separate processors, there are ten separate core memory units for the PPUs, but they all share the single set of processor logic. Another example is the Honeywell 800, which had 8 groups of registers, allowing up to 8 concurrent programs. After each instruction, the processor would switch to the next active program in sequence. Barrel processors have also been used as large-scale central processors. The Tera MTA was a large-scale barrel processor design with 128 threads per core. The MTA architecture has seen continued development in successive products, such as the Cray Urika-GD, originally introduced in 2012 and targeted at data-mining applications. Barrel processors are also found in embedded systems, where they are particularly useful for their deterministic real-time thread performance. An example is the XMOSXCore XS1, a four-stage barrel processor with eight threads per core. The XS1 is found in Ethernet, USB, audio, and control devices, and other applications where I/O performance is critical. Barrel processors have also been used in specialized devices such as the eight-thread Ubicom IP3023 network I/O processor. Some 8-bit microcontrollers by Padauk Technology feature barrel processors with up to 8 threads per core.
Comparison with single-threaded processors
Advantages
A single-tasking processor spends a lot of time idle, not doing anything useful whenever a cache miss or pipeline stall occurs. Advantages to employing barrel processors over single-tasking processors include:
The ability to do useful work on the other threads while the stalled thread is waiting.
Designing an n-way barrel processor with an n-deep pipeline is much simpler than designing a single-tasking processor because a barrel processor never has a pipeline stall and doesn't need feed-forward circuits.
For real-time applications, a barrel processor can guarantee that a "real-time" thread can execute with precise timing, no matter what happens to the other threads, even if some other thread locks up in an infinite loop or is continuously interrupted by hardware interrupts.
Disadvantages
There are a few disadvantages to barrel processors.
The state of each thread must be kept on-chip, typically in registers, to avoid costly off-chip context switches. This requires a large number of registers compared to typical processors.
Either all threads must share the same cache, which slows overall system performance, or there must be one unit of cache for each execution thread, which can significantly increase the transistor count and thus the cost of such a CPU. However, in hard real-time embedded systems where barrel processors are often found, memory access costs are typically calculated assuming worst-case cache behavior, so this is a minor concern. Some barrel processors such as the XMOS XS1 do not have a cache at all.