

ACA-<sup>£</sup>Lecture

- A **pipeline** is a set of data processing elements connected in series, so that the output of one element is the input of the next one.
- The pipeline organization can be demonstrated by this simple example:
- Multiply and Add Operation : Ai \* Bi + Ci
- 3 Suboperation Segment
  - 1)  $R1 \leftarrow Ai, R2 \leftarrow Bi$  : Input Ai and Bi
  - 2)  $R3 \leftarrow R1 * R2, R4 \leftarrow Ci$  : Multiply and input Ci 3)  $R5 \leftarrow R3 + R4$  : Add Ci

| Content of Re | egisters                | in Pipel                | ine Example                               | :                       |                                                       | Å. P       |
|---------------|-------------------------|-------------------------|-------------------------------------------|-------------------------|-------------------------------------------------------|------------|
| Clock         | Segment 1               |                         | Segment 2                                 |                         | Segment 3                                             |            |
| Number        | <i>R</i> 1              | R2                      | <i>R</i> 3                                | <i>R</i> 4              | R5                                                    | R1 R2      |
| 1 2 2         | $A_1$<br>$A_2$          | $B_1$<br>$B_2$<br>$B_2$ | $A_1 * B_1$                               | $\overline{C_1}$        | $A_1 * B_1 + C_1$                                     | Multiplier |
| 4<br>5        | $A_3$<br>$A_4$<br>$A_5$ | $B_3$<br>$B_4$<br>$B_5$ | $A_2 * B_2$ $A_3 * B_3$ $A_4 * B_4$       | $C_2$<br>$C_3$<br>$C_4$ | $A_2 * B_2 + C_2 A_3 * B_3 + C_3$                     | R3 R4      |
| 67            | $A_6$<br>$A_7$          | $B_6$<br>$B_7$          | $A_5 * B_5$<br>$A_6 * B_6$<br>$A_7 * B_7$ | C5<br>C6                | $A_4 * B_4 + C_4$ $A_5 * B_5 + C_5$ $A_4 * B_6 + C_6$ | Adder      |
| 8             |                         |                         | A1+ D1                                    | _                       | $A_7 * B_7 + C_7$                                     | R5         |

Dializa Essentia

Each clock produces new output and moves the data one step down the pipeline. When no more input data are available, the clock must continue until the last output emerges out of the pipeline.

# **Pipeline Logic:**

- A clock drives all the registers in the pipeline. This clock causes the CLC output to be latched in the register which provides input to the next stage, and thus making a start of new computation possible for next stage.
- The maximum clock rate is decided by the time delay of the CLC in the stage and the delay of the staging latch.





- Each segment consists of CLC (Si) that performs a suboperation over the data stream flowing through the pipe.
- The segments are separated by registers (Ri) that hold the intermediate results between stages.

| Clock<br>cycles | 1  | 2  | з  | 4  | 5  | 6  | 7  | 8  | 9  |  |
|-----------------|----|----|----|----|----|----|----|----|----|--|
| 1               | Τī | T2 | Тз | Τ₄ | Тs | Те |    |    |    |  |
| 2 nent          |    | Ŧ  | Τz | Γз | T₄ | Тs | Тe |    |    |  |
| Segr            |    |    | T  | Τz | Тз | T4 | Тs | Те |    |  |
| 4               |    |    |    | Th | Τz | Тз | T₄ | Тs | Тө |  |



#### Instruction Pipelining

 Instruction execution is extremely complex and involves several operations which are executed successively



 Pipelining is an implementation technique whereby multiple instructions are overlapped in execution. This is solved without additional hardware but only by letting different parts of the hardware work for different instructions at the same time.

- The pipeline organization of a CPU is similar to an assembly line: the work to be done in an instruction is broken into smaller steps (pieces), each of which takes a fraction of the time needed to complete the entire instruction. Each of these steps is a *pipe* stage (or a *pipe segment*).
- Pipe stages are connected to form a pipe:



 The time required for moving an instruction from one stage to the next: a machine cycle (often this is one clock cycle). The execution of one instruction takes several machine cycles as it passes through the pipeline.



ACA-<sup>£</sup>Lecture

٨

#### Acceleration by Pipelining



We consider that each instruction takes execution time Tex.

Execution time for the 7 instructions, with pipelining: (T<sub>ex</sub>/2)\*8= 4\*T<sub>ex</sub> Four-segment CPU pipeline :

- » 1) FI : Instruction Fetch
- » 2) DA : Decode Instruction & calculate EA
- » 3) FO : Operand Fetch
- » 4) EX : Execution



**Fetch instruction** 

from memory

Decode instruction

and calculate

effective address

Segment 1

Segment 2 :

Six stage pipeline



Execution time for the 7 instructions, with pipelining:

$$(T_{ex}/6)^{*}12=2^{*}T_{ex}$$

- After a certain time (N-1 cycles) all the N stages of the pipeline are working: the pipeline is filled. Now, theoretically, the pipeline works providing maximal parallelism (N instructions are active simultaneously).
- Apparently a greater number of stages always provides better performance. However:
  - a greater number of stages increases the overhead in moving information between stages and synchronization between stages.
  - with the number of stages the complexity of the CPU grows.
  - it is difficult to keep a large pipeline at maximum rate because of *pipeline hazards*.

| 80486 and Pentium: | five-stage pipeline for integer instr. |
|--------------------|----------------------------------------|
|                    | eight-stage pipeline for FP instr.     |
| PowerPC:           | four-stage pipeline for integer instr. |
|                    | six-stage pipeline for FP instr.       |



#### **Steady-state Analysis of Pipelines:**

Let n= # of stages. m= # of tasks run on the pipeline

• Efficiency (e): is the ratio of the energy used in doing the work and the total energy supplied.

$$\mathbf{e} = \frac{\mathbf{m} \cdot \mathbf{n}}{(\mathbf{m} + \mathbf{n} - \mathbf{1}) \cdot \mathbf{n}} = \frac{\mathbf{m}}{(\mathbf{m} + \mathbf{n} - \mathbf{1})}$$

When n>>m, then e tends to be m/n. When m>>n, then e tends to be1. When n=m, then e is 0.5



ACA-<sup>£</sup>Lecture

#### **Steady-state Analysis of Pipelines:**

Let n= # of stages.

m= # of tasks run on the pipeline

**Speedup (S):** is the ratio of the time taken by the nonpipelined architecture to the time taken by the pipeline.

$$S = \frac{(n.t).m}{(m+n-1).t} = \frac{m.n}{(m+n-1)}$$

When n>>m, then S tends to be m. When m>>n, then S tends to be n. When n=m, then S is n/m



ACA-<sup>£</sup>Lecture

#### **Steady-state Analysis of Pipelines:**

Let n= # of stages.

m= # of tasks run on the pipeline

Throughput (Q): is defined as the number of tasks completed per unit amount of time.

$$Q = \frac{m}{(m+n-1).t} = \frac{e}{t}$$

When n>>m, then Q tends to be m/(n.t) When m>>n, then Q tends to be1/t. When n=m, then Q is 0.5/t



ACA-<sup>£</sup>Lecture

## **Processor Design: Pipeline**



### <u>Summary</u>

- Instructions are executed by the CPU as a sequence of steps. Instruction execution can be substantially accelerated by *instruction pipelining*.
- A pipeline is organized as a succession of N stages. At a certain moment N instructions can be active inside the pipeline.
- Keeping a pipeline at its maximal rate is prevented by pipeline hazards. Structural hazards are due to resource conflicts. Data hazards are produced by data dependencies between instructions. Control hazards are produced as consequence of branch instructions
- With conditional branch we have a penalty even if the branch has not been taken. This is because we have to wait until the branch condition is available.
- Branch instructions represent a major problem in assuring an optimal flow through the pipeline.
  Several approaches have been taken for reducing branch penalties