

# EEC170 Computer Architecture

FQ 2005

Courtesy of Prof. Kent Wilken and Prof. John Owe

#### **Processor Designs So Far**

- We have Single Cycle Design with low CPI but high CCT
- We have Multicycle Design with low CCT but high CPI
- We want best of both: low CCT and low CPI
- Achieved using pipelining



# Laundry Timing

- How long does laundry take with a "single cycle" (wash/dry/fold is one clock) design? What is the clock cycle time? "CPI"? How many clocks?
- How long does laundry take with a "multiple cycle" (longest of wash/dry/fold is one clock) design? What is the CCT? "CPI"? How many clocks?





# Laundry Timing

 How long does laundry take with a "pipelined" (longest of wash/dry/fold is one clock) design? What is the CCT? How many clocks?

# Pipelined Laundry: Start work ASAP



Stall for Dependences

#### Sandwich Bar Analogy

- Pipelining: Multiple people going through the sandwich bar at the same time
  - If you're in front of the pickles, but you don't need the pickles ...
- Car assembly line

#### **Digital System Efficiency**

 A synchronous digital system doing too much work between clocks can be inefficient because logic can be static for much of clock period

















# **Pipeline Efficiency**

- Efficiency can be improved by
  - I. subdividing logic into multiple stages and shortening CCT
  - II. overlapping operation execution







#### **Pipeline Depth**

- If three pipeline stages are better than non-pipelined, are four stages better than three?
- What is the limit to the number of stages?
- What is the CCT at that limit?

# Why Pipeline?

- Suppose we execute 100 instructions. How long on each architecture?
- Single Cycle Machine
  - 4.5 ns/cycle, CPI=1
- Multicycle Machine
  - 1.0 ns/cycle, CPI=4.1
- Ideal pipelined machine
  - 1.0 ns/cycle, CPI=1 (but remember fill cost!)

# Why Pipeline?

- Suppose we execute 100 instructions
- Single Cycle Machine
  - 4.5 ns/cycle x 1 CPI x 100 inst = 450 ns
- Multicycle Machine
  - 1.0 ns/cycle x 4.1 CPI (due to inst mix) x 100 inst = 410 ns
- Ideal pipelined machine
  - 1.0 ns/cycle x (1 CPI x 100 inst + 4 cycle fill) = 104 ns

#### **Fill Costs**

- Suppose we pipeline different numbers of instructions. What is the overhead of pipelining?
- Ideal pipelined machine
  - 1.0 ns/cycle, CPI=1, 10 instructions
  - 1.0 ns/cycle, CPI=1, 1000 instructions
  - 1.0 ns/cycle, CPI=1, 100,000 instructions

#### **Fill Costs**

- Overhead: Ideal pipelined machine
  - 1.0 ns/cycle, CPI=1, 10 instructions
  - 10 ns runtime + 4 ns overhead: 40% overhead
  - 1.0 ns/cycle, CPI=1, 1000 instructions
  - 1000 ns runtime + 4 ns overhead: 0.4% overhead
  - 1.0 ns/cycle, CPI=1, 100,000 instructions
  - 100,000 ns runtime + 4 ns overhead: 0.004% overhead

## **Pipelined Multiplier**

- We can take the Wallace tree multiplier we developed in Chapter 3 and create a two stage pipeline
  - Reduce partial product terms
  - Do final n-2 bit addition using carry lookahead adder
- The two operations are roughly balanced, both take log(n) time







# **Pipelined Processor Design**

- We'll take the single-cycle design and transform it into a pipelined design
- What we formally called a 'phase' of an instruction's execution will now become a pipeline stage







































**Overlapping Instruction Execution** Time (in clock cycles) -Program execution order CC1 | CC2 | CC3 | CC4 | CC5 | CC6 | CC7 | CC8 | CC9 Iw \$1, 100(\$0) Reg DM Reg ALU Reg IM DM Reg 2 ALU lw \$3, 300(\$0) IM Reg DM Reg ALU. lw \$4, 400(\$0) IM Reg DM Reg ALU Iw \$5, 500(\$0) DM IM Reg Reg

















# **Data Hazards**

- Data hazards can occur when:
  - 1. Instructions  $\mathbf{I}_1$  and  $\mathbf{I}_2$  are both in the pipeline
  - 2.  $I_2$  follows  $I_1$
  - 3.  $I_1$  produces a result that is used by  $I_2$
- The problem: because I<sub>1</sub> has not posted its result to the register file, I<sub>2</sub> may read an obsolete value.















However \$2 is available at CC4 from internal pipeline latch

| т                             | ime (in clo | ock cyc | les) |     |     | -   |     |       |     |
|-------------------------------|-------------|---------|------|-----|-----|-----|-----|-------|-----|
| Program<br>execution<br>order | CC1         | CC2     | CC3  | CC4 | CC5 | CC6 | CC7 | CC8   | CC9 |
| Sub \$2, \$1, \$3             | IM -        | Reg     | ALU  |     | Reg |     |     |       |     |
| And \$12, \$2, \$             | 5           | IM      | Reg  | ALU |     | Reg |     |       |     |
| Or \$13, \$6, \$2             |             |         | M    | Reg | ALU | DM  | Reg |       |     |
| Add \$14, \$2, \$             | 2           |         |      | IM_ | Reg | ALU | DM  | Reg   |     |
| Sw \$15, 100(\$               | 2)          |         |      |     | IM  | Reg | ALU | - DM- | Reg |

#### Forwarding

- Most data hazards can be resolved by forwarding, sending the internal pipeline result from I<sub>1</sub> to the stage where it is needed by I<sub>2</sub> which is following in the pipeline
- Result is still posted to the register file when I<sub>1</sub> reaches WB stage
- Forwarding requires hardware modifications to the datapath



#### **DataPath with Forwarding**

- Forwarding data paths for results from Memory stage ٠ and WriteBack stage to Execution stage
- New ALU MUX to select operand from register file or ٠ newest result via forwarding



#### **Fowarding Path Control**

#### Control used to decide whether to forward from the Memory stage

If ((EX/MEM.RegWrite) #must be writing a register and (EX/MEM.RegisterRd != 0) #\$0 is always up to date and (EX/MEM.RegisterRd = TD/EX.RegisterRs) #stage 4 result and stage 3 source operand match ForwardA = 10

If ((EX/MEM.RegWrite)
and (EX/MEM.RegisterRd != 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
ForwardB = 10

# **Fowarding Path Control**

#### Control used to decide whether to forward from the Writeback stage:

If ((MEM/WB.RegWrite) #must be writing a register If (UMEW/WE.RegWEIEc) #must be writing a register and (MEW/WE.RegisterRd = 0) #\$0 is always up to date and !((EK/MEM.RegisterRd = ID/EX.RegisterRs) and (EX/MEM.RegWEIEc)) # forward result from stage 4 is newer, has priority and (MEM/WE.RegisterRd = ID/EX.RegisterRe)) #stage 4 result and stage 3 source operand match Forwardh = 01

((MEM/WB.RegWrite)

- rt (velt2/WE.Keg%tlte) and (NEM/WE.RegisterRd != 0) and !((EX/MEM.RegisterRd = ID/EX.RegisterRt) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01











| Load Hazard Solution #1                                                                |              |  |  |  |  |  |  |
|----------------------------------------------------------------------------------------|--------------|--|--|--|--|--|--|
| <ul> <li>Have compiler insert a NOP to space<br/>instructions further apart</li> </ul> |              |  |  |  |  |  |  |
| lw                                                                                     | \$2,20(\$1)  |  |  |  |  |  |  |
| nop                                                                                    |              |  |  |  |  |  |  |
| and                                                                                    | \$4,\$2,\$5  |  |  |  |  |  |  |
| or                                                                                     | \$8,\$2,\$6  |  |  |  |  |  |  |
| add                                                                                    | \$9,\$4,\$2  |  |  |  |  |  |  |
| slt                                                                                    | \$1,\$10,\$7 |  |  |  |  |  |  |
|                                                                                        |              |  |  |  |  |  |  |
|                                                                                        |              |  |  |  |  |  |  |













# **Refined NOP Injection**

 Setting all control bits to zero is correct but is overkill. What is the minimum set of control bits that must be set to zero?

#### **Early Branch Resolution**

Branch resolved in stage 2 to reduce penalty















**Delayed Branches & Exception**  Delayed branches must save PCs of next two instructions on exception, rather than one source1 esult decode memory access rdest source2 pc +4 instructior ALU ALU reg read result op rdest branch op rdest pc+4 op Page Fault save save

# **Delayed Branching Performance**

- For typical integer codes a useful instruction is executed in delay slot about 70% of the time thus branch cost is:
  - (cycles lost per branch = 0.3) x (fraction of branches = 0.25) = 8% performance loss
  - Significant reduction from 75%, but still want smaller branch cost

<u>NOTE:</u> If you consider slides 46847 as the baseline, then CPI for branch instruction is 4 in baseline implementation and 75% performance loss is correct. This is a bit different than what we discussed in class.



#### **Branch Prediction**

- Use a 'magic box' to:
  - · Determine the instruction being fetched is a branch
  - Predict the branch outcome (taken/not taken) Produce the target address

While the instruction is being fetched!



#### Simple Branch Prediction Table Branch Target Buffer (BTB)

- Predictor module is a 2<sup>n</sup> entry table indexed by n lower PC address bits. Fields are:
  - 1. 32-n upper PC address bits
- 2. IsBranch bit
- 3. Target Address (30 bits)
- 4. Prediction
- Table was filled during past executions of the branch













**Confirming Branch Prediction** 

 Prediction must be confirmed in ID stage, branchhistory state updated



# **Confirming Branch Prediction**

- Must consider pipeline action for following cases:
   No hit, no branch: do nothing
  - 2. No hit, branch:
  - a. Nullify IF instruction b. Branch to computed target address
  - 3. Hit, condition = prediction: do nothing
  - 4. Hit, condition = taken, prediction = not taken:
     a. Nullify IF instruction
    - b. Branch to computer target address
  - Hit, condition = not taken, prediction = taken

     Nullify IF instruction
    - b. Branch to ID stage PC+4



#### **BTB Updates**

- Must consider BTB updates for following cases:
  - 1. No hit, no branch: do nothing
  - 2. No hit, branch:
  - Write new BTB entry, set prediction to weak condition
  - 3. Hit, condition = taken: ++prediction
  - 4. Hit, condition = not taken: --prediction