# **EEC 170 Computer Architecture** Fall 2005 **Multiple Issue Introduction**

Courtesy of Prof. Mary Jane Irwin (Penn State Uni

# **Review: Pipeline Hazards** Structural hazards

Design pipeline to eliminate structural hazards

Data hazards – read before write

- Use data forwarding inside the pipeline
- For those cases that forwarding won't solve (e.g., load-use) include hazard hardware to insert stalls in the instruction stream
- Control hazards beq, bne, j, jr, jal

### • Stall - hurts performance

- Move decision point as early in the pipeline as possible reduces number of stalls at the cost of additional hardware
- Delay decision (requires compiler support) not feasible for deeper pipes requiring more than one delay slot to be filled
- Predict with even more hardware, can reduce the impact of control hazard stalls even further if the branch prediction (BHT) is correct and if the branched-to instruction is cached (BTB)

# Extracting Yet More Performance

Two options:

- · Increase the depth of the pipeline to increase the clock rate superpipelining (more details to come)
- · Fetch (and execute) more than one instructions at one time (expand every pipeline stage to accommodate multiple instructions) - multiple-issue

Launching multiple instructions per stage allows the instruction execution rate, CPI, to be less than 1

- So instead we use IPC: instructions per clock cycle
  - E.g., a 6 GHz, four-way multiple-issue processor can execute at a peak rate of 24 billion instructions per second with a best case CPI of 0.25 or a best case IPC of 4
- If the datapath has a five stage pipeline, how many instructions are active in the pipeline at any given time?

### **Superpipelined Processors**

- Increase the depth of the pipeline leading to shorter clock cycles (and more instructions "in flight" at one time)
  - The higher the degree of superpipelining, the more forwarding/hazard hardware needed, the more pipeline latch overhead (i.e., the pipeline latch accounts for a larger and larger percentage of the clock cycle time), and the bigger the clock skew issues (i.e., because of faster and faster clocks)

### Superpipelined vs Superscalar

- Superpipelined processors have longer instruction latency than the SS processors which can degrade performance in the presence of true dependencies
- Superscalar processors are more susceptible to resource conflicts - but we can fix this with hardware !

# Instruction vs Machine Parallelism

□ Instruction-level parallelism (ILP) of a program – a measure of the average number of instructions in a program that a processor might be able to execute at the same time

 Mostly determined by the number of true (data) dependencies and procedural (control) dependencies in relation to the number of other instructions DO I = 1 TO 100 A[I] = A[I] + 1

CONTINUE

#### Data-level parallelism (DLP)

# Machine parallelism of a

processor - a measure of the ability of the processor to take advantage of the ILP of the program

- Determined by the number of instructions that can be fetched and executed at the same time
- To achieve high performance, need both ILP and machine parallelism

## Multiple-Issue Processor Styles

### Static multiple-issue processors (aka VLIW)

- · Decisions on which instructions to execute simultaneously are being made statically (at compile time by the compiler)
- E.g., Intel Itanium and Itanium 2 for the IA-64 ISA EPIC
- (Explicit Parallel Instruction Computer)

### Dynamic multiple-issue processors (aka superscalar)

- · Decisions on which instructions to execute simultaneously are being made dynamically (at run time by the hardware)
- E.g., IBM Power 2, Pentium 4, MIPS R10K, HP PA 8500

### **Multiple-Issue Datapath Responsibilities**

Dust handle, with a combination of hardware and software fixes, the fundamental limitations of

- Storage (data) dependencies aka data hazards
- Limitation more severe in a SS/VLIW processor due to (usually) low ILP
- Procedural dependencies aka control hazards
  - Ditto, but even more severe
- Use dynamic branch prediction to help resolve the ILP issue Resource conflicts – aka structural hazards
  - A SS/VLIW processor has a much larger number of potential resource conflicts
  - Functional units may have to arbitrate for result buses and registerfile write ports
  - Resource conflicts can be eliminated by duplicating the resource or by pipelining the resource



# In-Order Issue with In-Order Completion

- Simplest policy is to issue instructions in exact program order and to complete them in the same order they were fetched (i.e., in program order)
- Example:
  - Assume a pipelined processor that can fetch and decode two instructions per cycle, that has three functional units (a single cycle adder, a single cycle shifter, and a two cycle multiplier), and that can complete (and write back) two results per cycle
  - And an instruction sequence with the following characteristics
    - Il needs two execute cycles (a multiply)
      - 12 IЗ
      - I4 needs the same function unit as I3
    - 15 needs data value produced by 14













# Handling Output Dependencies

There is one more situation that stalls instruction issuing with IOI-OOC, assume 11 – writes to R3

l2 – writes to R3 l5 – reads R3

- If the I1 write occurs after the I2 write, then I5 reads an incorrect value for R3
- I2 has an output dependency on I1 write before write
  - The issuing of I2 would have to be stalled if its result might later be overwritten by an previous instruction (i.e., I1) that takes longer to complete – the stall happens before instruction issue
- While IOI-OOC yields higher performance, it requires more dependency checking hardware
  - Dependency checking needed to resolve both read before write
    and write before write







### **Dependencies Review**

- Each of the three data dependencies
  - True data dependencies (read before write)
  - Antidependencies (write before read)
- Output dependencies (write before write) storage conflicts manifests itself through the use of registers (or other
- storage locations)
- True dependencies represent the flow of data and information through a program
- Anti- and output dependencies arise because the limited number of registers mean that programmers reuse registers for different computations
- When instructions are issued out-of-order, the correspondence between registers and values breaks down and the values *conflict* for registers

