Instruction pipelining

Basic five-stage pipeline in a RISC machine (IF = Instruction Fetch, ID = Instruction Decode, EX = Execute, MEM = Memory access, WB = Register write back). The vertical axis is successive instructions, the horizontal axis is time. So in the green column, the earliest instruction is in WB stage, and the latest instruction is undergoing instruction fetch.

Instruction pipelining is a technique used in the design of modern microprocessors, microcontrollers and CPUs to increase their instruction throughput (the number of instructions that can be executed in a unit of time).

The main idea is to divide (termed "split") the processing of a CPU instruction, as defined by the instruction microcode, into a series of independent steps of micro-operations (also called "microinstructions", "micro-op" or "µop"), with storage at the end of each step. This allows the CPUs control logic to handle instructions at the processing rate of the slowest step, which is much faster than the time needed to process the instruction as a single step.

The term pipeline refers to the fact that each step is carrying a single microinstruction (like a drop of water), and each step is linked to another step (analogy; similar to water pipes).

Most modern CPUs are driven by a clock. The CPU consists internally of logic and memory (flip flops). When the clock signal arrives, the flip flops store their new value then the logic requires a period of time to decode the flip flops new values. Then the next clock pulse arrives and the flip flops store another values, and so on. By breaking the logic into smaller pieces and inserting flip flops between pieces of logic, the time required by the logic (to decode values till generating valid outputs depending on these values) is reduced. In this way the clock period can be reduced.
For example, the RISC pipeline is broken into five stages with a set of flip flops between each stage as follows:

Instruction fetch
Instruction decode and register fetch
Execute
Memory access
Register write back

Processors with pipelining consist internally of stages (modules) which can semi-independently work on separate microinstructions. Each stage is linked by flip flops to the next stage (like a "chain") so that the stage's output is an input to another stage until the job of processing instructions is done. Such organization of processor internal modules reduces the instruction's overall processing time.

A non-pipeline architecture is not as efficient because some CPU modules are idle while another module is active during the instruction cycle. Pipelining does not completely remove idle time in a pipelined CPU, but making CPU modules work in parallel increases instruction throughput.

An instruction pipeline is said to be fully pipelined if it can accept a new instruction every clock cycle. A pipeline that is not fully pipelined has wait cycles that delay the progress of the pipeline.

Advantages and Disadvantages of Pipelining

Advantages of Pipelining:

The cycle time of the processor is reduced; increasing the instruction throughput. Pipelining doesn't reduce the time it takes to complete an instruction; instead it increases the number of instructions that can be processed simultaneously ("at once") and reduces the delay between completed instructions (called 'throughput').
The more pipeline stages a processor has, the more instructions it can process "at once" and the less of a delay there is between completed instructions. Every predominant general purpose microprocessor manufactured today uses at least 2 stages of pipeline up to 30 or 40 stages.
If pipelining is used, the CPU Arithmetic logic unit can be designed faster, but will be more complex.
Pipelining in theory increases performance over an un-pipelined core by a factor of the number of stages (assuming the clock frequency also increases by the same factor) and the code is ideal for pipeline execution.
Pipelined CPUs generally work at a higher clock frequency than the RAM clock frequency, (as of 2008 technologies, RAMs work at a low frequencies compared to CPUs frequencies) increasing computers overall performance.

Disadvantages of Pipelining:

Pipelining has many disadvantages though there are a lot of techniques used by CPUs and compilers designers to overcome most of them; the following is a list of common drawbacks:

The design of a non-pipelined processor is simpler and cheaper to manufacture, non-pipelined processor executes only a single instruction at a time. This prevents branch delays (in Pipelining, every branch is delayed) as well as problems when serial instructions being executed concurrently.
In pipelined processor, insertion of flip flops between modules increases the instruction latency compared to a non-pipelining processor.
A non-pipelined processor will have a defined instruction throughput. The performance of a pipelined processor is much harder to predict and may vary widely for different programs.
Many designs include pipelines as long as 7, 10, 20, 31 and even more stages; a disadvantage of a long pipeline is when a program branches, the entire pipeline must be flushed (cleared). The higher throughput of pipelines falls short when the executed code contains many branches: the processor cannot know in advance where to read the next instruction, and must wait for the branch instruction to finish, leaving the pipeline behind it empty. This disadvantage can be reduced by predicting whether the conditional branch instruction will branch based on previous activity. After the branch is resolved, the next instruction has to travel all the way through the pipeline before its result becomes available and the processor resumes "working" again. In such extreme cases, the performance of a pipelined processor could be worse than non-pipelined processor.
Unfortunately, not all instructions are independent. In a simple pipeline, completing an instruction may require 5 stages. To operate at full performance, this pipeline will need to run 4 subsequent independent instructions while the first is completing. Any of those 4 instructions might depend on the output of the first instruction, causing the pipeline control logic to wait and insert a stall or wasted clock cycle into the pipeline until the dependency is resolved. Fortunately, techniques such as forwarding can significantly reduce the cases where stalling is required.
Self-modifying programs may fail to execute properly on a pipelined architecture when the instructions being modified are near the instructions being executed. This can be caused by the instructions may already being in the Prefetch Input Queue, so the modification may not take effect for the upcoming execution of instructions. Instruction caches make the problem even worse.
Hazards: When a programmer (or compiler) writes assembly code, they generally assume that each instruction is executed before the next instruction is being executed. When this assumption is not validated by pipelining it causes a program to behave incorrectly, the situation is known as a hazard.
Various techniques for resolving hazards or working around such as forwarding and delaying (by inserting a stall or a wasted clock cycle) exist.

Examples

Generic pipeline

To the right is a generic pipeline with four stages:

Fetch
Decode
Execute
Write-back

The top gray box is the list of instructions waiting to be executed; the bottom gray box is the list of instructions that have been completed; and the middle white box is the pipeline.

Execution is as follows:

Time	Execution
0	Four instructions are awaiting to be executed
1	the green instruction is fetched from memory
2	the green instruction is decoded the purple instruction is fetched from memory
3	the green instruction is executed (actual operation is performed) the purple instruction is decoded the blue instruction is fetched
4	the green instruction's results are written back to the register file or memory the purple instruction is executed the blue instruction is decoded the red instruction is fetched
5	the green instruction is completed the purple instruction is written back the blue instruction is executed the red instruction is decoded
6	The purple instruction is completed the blue instruction is written back the red instruction is executed
7	the blue instruction is completed the red instruction is written back
8	the red instruction is completed
9	All instructions are executed

Bubble

When a "hiccup" (interruption) in execution occurs, a "bubble" is created in the pipeline in which nothing useful happens. In cycle 2, the fetching of the purple instruction is delayed and the decoding stage in cycle 3 now contains a bubble. Everything behind the purple instruction is delayed as well but everything in front of the purple instruction continues with execution.

Clearly, when compared to the execution above, the bubble yields a total execution time of 8 clock ticks instead of 7.

Bubbles are like stalls (delays), in which nothing useful will happen for the fetch, decode, execute and writeback. It is like a NOP (short for No OPeration) code.

Example 1

A typical instruction to add two numbers might be ADD A, B, C, which adds the values found in memory locations A and B, and then puts the result in memory location C. In a pipelined processor the pipeline controller would break this into a series of tasks similar to:

LOAD A, R1
LOAD B, R2
ADD R1, R2, R3
STORE R3, C
LOAD next instruction

The locations 'R1' and 'R2' are registers in the CPU. The values stored in memory locations labeled 'A' and 'B' are loaded (copied) into these registers, then added, and the result is stored in a memory location labeled 'C'.

In this example the pipeline is three stages long- load, execute, and store. Each of the steps are called pipeline stages.

On a non-pipelined processor, only one stage can be working at a time so the entire instruction has to complete before the next instruction can begin. On a pipelined processor, all of the stages can be working at once on different instructions. So when this instruction is at the execute stage, a second instruction will be at the decode stage and a 3rd instruction will be at the fetch stage.

Example 2

To better understand the concept, we can look at a theoretical 3-stage pipeline:

Stage	Description
Load	Read instruction from memory
Execute	Execute instruction
Store	Store result in memory and/or registers

and a pseudo-code assembly listing to be executed:

LOAD  #40, A      ; load 40 in A
MOVE  A, B        ; copy A in B
ADD   #20, B      ; add 20 to B
STORE B, 0x300    ; store B into memory cell 0x300

This is how it would be executed:

Clock 1
Load	Execute	Store
LOAD

The LOAD instruction is fetched from memory.

Clock 2
Load	Execute	Store
MOVE	LOAD

The LOAD instruction is executed, while the MOVE instruction is fetched from memory.

Clock 3
Load	Execute	Store
ADD	MOVE	LOAD

The LOAD instruction is in the Store stage, where its result (the number 40) will be stored in the register A. In the meantime, the MOVE instruction is being executed. Since it must move the contents of A into B, it must wait for the ending of the LOAD instruction.

Clock 4
Load	Execute	Store
STORE	ADD	MOVE

The STORE instruction is loaded, while the MOVE instruction is finishing off and the ADD is calculating.

And so on. Note that, sometimes, an instruction will depend on the result of another one (like our MOVE example). When more than one instruction references a particular location for an operand, either reading it (as an input) or writing it (as an output), executing those instructions in an order different from the original program order can lead to the hazards situation (mentioned above).