

# Review: MIPS Architecture • Good example of *RISC* processor: Reduced Instruction-Set Computer • RISC really a misnomer: architecture goal is speed not small instruction set. However, certain simplicities result in short clock cycle • Alternate RISC definition: Relegate the Interesting Stuff to the Compiler • Avoid runtime overhead if complexity can be resolved at compile time • Old school view: software is hard, keep compiler simple RISC view: complier writers are sharp, have good modern development tools

- MIPS company spun off from Hennessy's MIPS processor project at Stanford
  - MIPS: Microprocessor without Interlocking Pipeline Stages
    - Designed for efficient pipelining (see Chapter 6)

### Review: MIPS General Architecture Characteristics

- 32-bit integer registers 
   ⇒ 32-bit architecture
  - 32-bit integer arithmetic
  - Higher precision must be done in software
  - Memory addresses up to 2<sup>32</sup> = 4GBytes
  - MIPS-64 extends architecture to 64 bits
- 32-bit fixed-length instructions (for simplicity)
   More complex operations done as a sequence of instructions
- Very few instruction formats (for simplicity)
- General Principle: Small/simple 
   Fast

# **Review: MIPS Arithmetic Operations**

- Uses only 3 specifier arithmetic instructions:
  - Operation result, source1, source2 E.g.,: ADD a, b, c ⇒ a = b + c
- Other architectures (e.g., x86) use 2 specifier instructions
  - One specifier is used as both a destination and source:
  - Operation destination/source1, source2
     E.g.,: ADD a, b ⇔ a = a + b

## **Review: 3 Specifier Instructions**

 3 specifier is the most general form, reduces instruction count vs. two specifier, e.g.:

$$a = b + c; \Rightarrow ADD a, b, c for$$

3 specifier 2 specifier

Why do you suppose two-specifier was ever used?

# 



# Nested Subroutine Calls • Need to handle nested subroutine calls • A calls B, B calls C, C calls D ... • Recursive subroutine calls (A conditionally calls A) • Must avoid overwriting return address in \$31 = \$ra • General solution: Store return address on stack • MIPS (other RISCs) does not have architected stack pointer • Any general purpose register could be used as stack pointer • Software convention uses \$29 = \$sp • Save return address, push ⇒ add \$sp, \$sp, -4 #expand stack by one word sw \$ra,0(\$sp) #save ret. addr. on top of stack

Restore return address, pop 
 \[
 \]
 Iw \$ra,0(\$sp) #get ret. addr, from top of stack
 add \$sp,\$sp,4 #shrink stack by one word





- RISC fixed-length instructions do not allow large immediate constants (e.g., 32 bits)
- MIPS uses special instruction in two-instruction sequence to create constants > 16 bits (uncommon case)
  - Load Upper Immediate (lui):

| lui | N/A | Rd | immediate data |
|-----|-----|----|----------------|
| 6   | 5   | 5  | 16             |

- Sets Rd upper 16 bits to immediate data, lower 16 bits to 0s
- 32 bit constant:

# **Register 0**

- Register 0 is special purpose
- Read from \$0 always produces value 0
   used to synthesize instructions beyond base set without added complexity
  - "move \$7, \$20" = add \$7, \$0, \$20
  - "load \$7, immediate" = addi \$7, \$0, immediate
  - Pseudo Instructions
- Write to \$0 does nothing
   add \$0, \$12, \$20 = NOP
- Various other examples of exploiting \$0 as we review instruction set

#### **Instruction Set Design**

- As seen, some operations can be synthesized with short sequence, e.g.:
  - Can also create 32-bit constant w/o lui:

addi \$t0,\$0,1234<sub>hex</sub> sll \$t0,\$t0,16 addi \$t0,\$t0,5678<sub>hex</sub> # \$t0 has 00001234<sub>hex</sub>
# \$t0 has 12340000<sub>hex</sub>
# \$t0 has 12345678<sub>hex</sub>

 How to decide if specific instruction (e.g., lui) is worthwhile?

**Experimentation/Quantitative Analysis** 

# Evaluating Instruction Sets?

#### Design-time metrics:

- Can it be implemented, in how long, at what cost?
   Can it be programmed? Ease of compilation?
- Static Metrics:
- ° How many bytes does the program occupy in memory? Dynamic Metrics:
- \* How many instructions are executed?
- ° How many bytes does the processor fetch to execute the program?
- \* How many clocks are required per instruction? CPI
- \* How "lean" a clock is practical?

Best Metric: Time to execute the program!

Inst. Count Cycle Time

NOTE: this depends on instructions set, processor organization, and compilation techniques.

# Aspects of CPU Performance CPU time = Seconds Program × Cycles x Seconds Program \* \* \* \* Program X \* \* \* Program X X \* \* Instr. Set X X \* \*

Х

х



# Example Tradeoff: Auto Increment (Decrement)

- Some architectures (<u>not</u> most RISCs) include instruction that auto increments base/index register
  - MIPS: lw \$t1,0(\$t0) # \$t0 is index addi \$t0,\$t0,4 # increment index
  - Auto Inc:lw+ \$t1,0(\$t0) # load and inc index
- Auto Inc (dec) may be useful for incrementing through arrays, but
  - not always increment by same value (e.g., 4)
    - instruction complicates hardware because produces two
      results (two register writes):
      - Value from memory
      - Updated index

Organization

Technology

#### Example Tradeoff: Memory Operations

- Some architectures (<u>not</u> RISC) allow arithmetic operands from memory
  - e.g., ADDM \$s0,\$s1,offset(\$t0)
- Problems:
  - Limited offset field (e.g., 11 bits)
  - requires 3 operations, hence very slow:
    - 1.
       compute address (r\_base + offset)

       2.
       load data (memory is slow)

       3.
       ALU operation
  - significantly complicates pipeline design
     ⇒ slower clock cycle

#### Example Tradeoff: Dedicated Loop Counter

- A special loop-counter register is automatically decremented based on branch opcode
  - Branch if Loop-Counter = 0

Bcond addr offset

- + Allows larger branch offset (lower instruction count, but small)
- ++ does not require separate increment instruction (lower instruction count)
- requires more work within branch instruction, may increase clock cycle time

|      | Varia    | Example<br>ble-Leng | Tradeoff:<br>gth Instructions                           |  |
|------|----------|---------------------|---------------------------------------------------------|--|
| mai  |          | as neede            | act code by using as<br>ed                              |  |
|      | opcode   | Rs Rd               |                                                         |  |
| • jı | imp labe | I                   |                                                         |  |
| [    | opcode   | offset              |                                                         |  |
|      | opcode   |                     | offset                                                  |  |
| kno  | wing ho  | ow long,            | struction before<br>where operator<br>bund ⇔ higher CPI |  |





| Ξ         | xam      | ple: HLL        | MIPS Assembly                                   |    |
|-----------|----------|-----------------|-------------------------------------------------|----|
|           | C Co     | de: swap(int)   | v[ ], int k)                                    |    |
|           |          | { int tem       |                                                 |    |
|           |          | temp =          | v[k];                                           |    |
|           |          | v[k] = v        | [k+1];                                          |    |
| MIPS Code | <b>:</b> | v[k+1] =        | = temp; }                                       |    |
| swap:     | addi     | \$sp, \$sp, -12 | # make room on stack for 3 regs                 |    |
|           | SW       | \$t0, 0(\$sp)   | # save \$t0 on stack                            |    |
|           | SW       | \$s0,4(\$sp)    | # save \$s0 on stack                            |    |
|           | SW       | \$s1, 8(\$sp)   | # save \$s1 on stack                            |    |
|           | muli     | \$t0, \$a0, 4   | # \$t0 = k * 4                                  |    |
|           | add      | \$t0,\$a1, \$t0 | <pre># \$t0 = v + k*4 = the addr. of v[k]</pre> |    |
|           | lw       | \$s0, 0(\$t0)   | # \$s0 = temp = v[k]                            |    |
|           | lw       | \$s1, 4(\$t0)   | # \$s1 = v[k+1] = next element of v             |    |
|           | SW       | \$s1, 0(\$t0)   | # v[k] = v[k+1]                                 |    |
|           | SW       | \$s0, 4(\$t0)   | # v[k+1] = temp                                 |    |
|           | lw       | \$t0, 0(\$sp)   | # restore \$t0 from stack                       |    |
|           | lw       | \$s0, 4(\$sp)   | # restore \$s0 from stack                       |    |
|           | lw       | \$s1, 8(\$sp)   | # restore \$s1 from stack                       |    |
|           | addi     | \$sp, \$sp, 12  | # restore stack pointer                         |    |
|           | jr       | \$ra            | # return to calling routine                     |    |
|           |          |                 |                                                 | 22 |









# **Lexical Analysis** Syntax Analysis (Parsing) • Translates character string to string of tokens Groups tokens into syntactic structure, in Tokens are the basic lexical units in the HLL, e.g.: the form of a syntax tree (parse tree) to identify statements sum := Y + Z x 12 ; statement expr • Token type and value are identified IF (New > MaxNum) ... => token string: (Recall high school English)

# Semantic Analysis

Performs static error checking

• A keyword: Then

• An identifier: newdata • An operator: >=

• A constant: 17

• Punctuation: ;

(Keyword, IF) (Identifier, "New") (Operator, >) (Identifier, "MaxNum")

• etc.

- Uses symbol table to check that variables are defined, declared
- Checks Operand/Result variables for type compatibility
- Generates intermediate representation used by compiler backend
  - May resemble 'generic' assembly code, assumes unlimited registers
    - Statement: S = A + B x C Becomes:
      - move R101,A move R102,B
      - move R103,C
      - mult R104,R102,R103 add R105, R101,R104
      - move S,R105

# **Compiler Optimization Basics**

- Usually optimizations only occur in the backend, frontend produces simple translation
- Backend attempts to improve code quality, some combination of code speed, size and possibly power consumption



# General Optimization Phase Structure Each optimization phase is generally structured as an analysis algorithm followed by a code-improving transformation algorithm Analysis is specific to the transformation

1

#### **Control Flow Analysis** Compiler divides program into basic blocks (BB) Straight-line code with no branch in except at start, no branch out except at end output Optimizations within BB are local, across BB are global LW r<sub>101</sub>, 0(r<sub>102</sub>) <stall> <stall> B1 r<sub>105</sub>, 0(r<sub>101</sub>) LW <stall> <stall> **B**3 B2 LW r<sub>107</sub>, 4(r<sub>102</sub>) <stall> SW r<sub>107</sub>, 0(r<sub>104</sub>)







# **Partial Deadcode Elimination**

- If instruction's result is not used along a path, the instruction can be pushed down the path it is used
  - reduces instruction's execution count

|                          |               | 101,1<br>02,r103,r104<br>6,0(r102) |
|--------------------------|---------------|------------------------------------|
|                          |               |                                    |
|                          | mult r1       | 06,r105, 3                         |
|                          |               | $\leq$                             |
| addi r107,<br>mult r106, |               | addi r108,r10                      |
|                          | $\overline{}$ |                                    |
|                          | add r101,r1   | 01, <b>r10</b> 6                   |
|                          |               |                                    |
|                          | sw r1         | 01,0(r108)                         |
|                          | -51111        | 01,0(1100)                         |

# Partial Deadcode Elimination If instruction's result is not used along a path, the instruction can be pushed down the path it is used reduces instruction's execution count addi r107,r107,1 mult r106,r107,r108,1 addi r101,r108,1 mult r106,r108,1 sw r101,0(r108)





## **Register Allocation**

- Until toward end of optimization phases compiler assumes unlimited number of symbolic registers
- Register allocation assigns symbolic registers to real registers
- Traditional allocation algorithms use graph coloring











# Loop Unrolling

- Can increase loop performance in exchange code size increase
  - HLL Code:
    - FOR (j = 0, j<1000, j++) sum = sum + a[j];
  - Intermediate code for loop:
    - loop: Iw r102,0(r101) add r103,r103,r102 addi r101,r101,1 siti r104,r101,1000 bne r104,r0,loop

# r101 is address of j # sum is r103, sum = sum + a[j] #j++ # (j<1000)

Three instructions of loop overhead per iteration

# Unrolled Loop

- Replicate the loop body to reduce per iternation loop overhead
  - HLL code
    - FOR (j = 0; j<1000; j=j+2) sum = sum + a[j]; sum = sum + a[j+1];
  - Intermediate code

     loop:
     lw r102,0(r101)
     add r103,r103,r102
     lw r105,4(r101)
     add r103,r103,r105
     add r103,r103,r105
     add r103,r103,r105
     add r101,r101,2
     stit r104,r101,r100
     bms r104,r01,0100
     bms r104,r01,010
- # load a[j] # sum a[j] # load a[j+1] # sum a[j+1] #j=j+2 # (j<1000)
- Loop Overhead is now 1.5 instructions per iteration

# **Subroutine In-Line Expansion**

- Insert the body of a subroutine rather than a call
  - Eliminates overhead of call/return
  - Simplifies program structure, allowing better instruction scheduling, register allocation
     Original code:

Load

Add

Call A

- <next> Optimized Code
- Load
- Add
- <body of A>
- <next>

### **EEC 175**

- Complete course on compiler optimization
- Explores the software side of the hardware/software interface
- Project course/design elective
- Spring Quarter, 5 units

# **EEC175 Topics**

Basic blocks Control flow graph Loops Call graph Program profiling Instruction Scheduling Loop-Invariant Code Motion Procedure In-Lining Branch Optimization Branch Alignment Unreachable Code Elimination Data-Flow Analysis Data Dependence Aliasing Reaching Definitions Def-Use Chains Bit-Vector Iterative Analysis Live-Range Analysis Symbolic-Register Renumbering Symbolic-Register Interference Available Expressions Dead-Code Elimination Code Hoisting/Sinking Common Sub-Expression Elimination

Partial Deadcode Elimination Strength Reduction Constant Propagation