#### An Approach to Low-power, High-performance, Fast Fourier Transform Processor Design

#### **Bevan Baas**

**Department of Electrical Engineering** 

bbaas@nova.stanford.edu http://nova.stanford.edu/~bbaas/

May 23, 1997

# Outline

#### Motivation and Introduction

- Energy-Efficient VLSI Processing
- Fast Fourier Transform Overview
- FFT Chip Architectures
- The Spiffee Processor
- Conclusion

# The Fast Fourier Transform (FFT)

- One of the most widely used digital signal processing algorithms
- Used in:
  - ♦ Communications
  - ♦ Radar
  - Instrumentation
  - Medical imaging



#### Low Power

- As semiconductor processing technology advances....
  - Clock speeds increase
  - Integration increases
    - ➡ Power increases

| Year       | 1995    | 1998    | 2001    | 2004    | 2007     |  |
|------------|---------|---------|---------|---------|----------|--|
| Technology | 0.35µm  | 0.25µm  | 0.18µm  | 0.13µm  | 0.10µm   |  |
| Vdd        | 3.3 V   | 2.5 V   | 1.8 V   | 1.5 V   | 1.2 V    |  |
| Clock      | 300 MHz | 450 MHz | 600 MHz | 800 MHz | 1000 MHz |  |
| Power      | 80 W    | 100 W   | 120 W   | 140 W   | 160 W    |  |

Source: SIA Roadmap

• More and more applications are power-limited

- FFT algorithm and architecture for high energyefficiency and high-performance
- Circuits for low voltage operation
- Design of a single-chip, 1024-point, FFT processor

# Outline

• Motivation and Introduction

#### Energy-Efficient VLSI Processing

- Fast Fourier Transform Overview
- FFT Chip Architectures
- The Spiffee Processor
- Conclusion





•  $P_{active} = \sum_{all nodes} activity * C * V^2 * frequency$ 

# **Energy-Efficiency**

- Goal is energy-efficiency with good performance
  - Not just low-power, which can be easily obtained by reducing performance
- For DSP, high energy-efficiency is key
  - ♦ Algorithms are often easily parallelized
  - Often insensitive to latency
  - Therefore, high-performance can usually be obtained through parallel processors

# Ultra Low Power (ULP) Overview

- Key idea: Biggest gain by lowering supply voltage
  - ♦ Switching energy is a strong function of V<sup>2</sup>
- To maintain performance, must also lower V<sub>t</sub>
  - Lowering V<sub>t</sub> significantly requires a process change
- Adjust V<sub>t</sub> by biasing substrate/wells



# Measured V<sub>t</sub> Adjustments



# **ULP** Implications

- Circuits operate with "leaky" transistors (low <sup>I</sup><sub>onf</sub> ratio)
- Static circuits generally ok
- No pure dynamic circuits, nmos pass gates,....
- Redesign high fan-in circuits



# Low Vt Design

- Nodes with high fan-in require re-design
- Memory bitline is a common structure with high fan-in
- Worst case: Reading a '0' in a column with M-1 '1's



# Hierarchical-Bitline SRAM

#### • 2-level hierarchical bitlines

- Reduces cell leakage on bitlines
- Reduces bitline capacitance by almost 50%
- 8 Local-bitlines x 16 cells each
- 3 Separate nwell biases



### Simulation Results

• 128 word memory, Vdd=300mV, parasitics included



### Outline

- Motivation and Introduction
- Energy-Efficient VLSI Processing

#### Fast Fourier Transform Overview

- FFT Chip Architectures
- The Spiffee Processor
- Conclusion

# The Fast Fourier Transform (FFT)

- Efficient method of calculating the Discrete Fourier Transform (DFT)
- Believed discovered by Gauss in 1805
- Re-discovered by Cooley and Tukey in 1965
- N = length of transform, must be composite
  - $\blacklozenge \mathbf{N} = \mathbf{N}_1 * \mathbf{N}_2 * \dots * \mathbf{N}_m$

|        | Transform | DFT               | FFT       | DFT ops / |  |
|--------|-----------|-------------------|-----------|-----------|--|
|        | Length    | Length Operations |           | FFT ops   |  |
|        | 64        | 4,096             | 384       | 11        |  |
|        | 256       | 65,536            | 2,048     | 32        |  |
|        | 1,024     | 1,048,576         | 10,240    | 102       |  |
| 65.536 |           | 4.294.967.296     | 1.048.576 | 4.096     |  |



# FFT Hardware Algorithms

- Simple, compact design more important than the number of operations
- Nearly all modern FFT processors use commonfactor, radix-2<sup>m</sup> processors

| Processor             | Architecture |
|-----------------------|--------------|
| LSI, L64280           | radix-2      |
| Plessey, PDSP16510A   | radix-4      |
| Dassault Electronique | radix-4      |
| Cobra, Colorado State | radix-4      |
| CNET, E. Bidet        | radix-2,4    |

# Outline

- Motivation and Introduction
- Energy-Efficient VLSI Processing
- Fast Fourier Transform Overview

# • FFT Chip Architectures

- The Spiffee Processor
- Conclusion

# **Common FFT Architectures**



# Cached Memory

• Small cache used to hold frequently-used data



- Cache size  $C=2^{\left|\log_2 E_{\sqrt{N}}\right|}$ 
  - E = Number of "Epochs" or passes through the data
- Partition processor based on activity
  - High activity: processor, cache
  - ◆ Low activity: main memory
  - Reduce leakage in low activity portion by increasing V<sub>t</sub>

# Cached Memory Algorithm

#### Previous Caching algorithms

- Gentleman and Sande, 1966
- Singleton, 1967
- Brenner, 1969
- Rabiner and Gold, 1975
- ♦ Bailey, 1990
- Carlson, 1990
- Processor's view: Fast, large memory
- Memory's view: Very large radix processor
  - ♦ Similar to Radix N<sup>1/E</sup>
- Especially good algorithm for large N

Radix-2 Decimation-in-time Dataflow



# Radix-2 Decimation-in-time Cached FFT Dataflow Diagram



### Outline

- Motivation and Introduction
- Energy-Efficient VLSI Processing
- Fast Fourier Transform Overview
- FFT Chip Architectures

#### The Spiffee Processor

Conclusion

# Design Goals

- 1024 complex point FFT processor
- Single-chip
- Deep pipelining
- Functional and good performance at low Vdd (400mV), low V<sub>t</sub> (0V)
- Robust circuits to operate in a possibly noisy environment

# Algorithm

Α

#### • Radix-2

- One butterfly / cycle
  - 1 complex multiply and
    2 complex add/subtracts
    4 multiplies and 6 adds



X = A + BW

- Cached FFT Algorithm
  - ◆ Mem = 1024 words x 36 bits
  - ◆ C = 1024<sup>1/2</sup> = 32 words x 40 bits
- Non-iterative datapath
  - High usage  $\implies$  good area efficiency

# **Block Diagram**

- SRAM Arrays (8)
  - Hierarchical bitlines, 6T cells, 128 x 36-bit
- Caches (4)
  - Dual-ported, 10T cells, 16 x 40-bit
- Multipliers (4)
  - 20-bit x 20-bit, 24-bit product 2's complement
- Adder/Subtractors (6)
  - ♦ 24-bit, CLA-Ripple
- ROMs (2)
  - ♦ Hierarchical bitlines, 256 x 40-bit



#### • 9-stage pipeline

| MEM<br>RD   | CROSSB<br>RD | MULT1 | MULT2 | MULT3 | ADD/SUB<br>CMULT | ADD/SUB<br>XY        | CROSSB<br>WR | MEM<br>WR |
|-------------|--------------|-------|-------|-------|------------------|----------------------|--------------|-----------|
| A<br>B<br>W | _            |       | Вx    | W     |                  | X = A+BW<br>Y = A-BW | _            | X<br>Y    |

- Throughput of one complex butterfly per cycle
- Stall 1 out of every 80 cycles due to Read-after-Write hazard

# Clocking

- Single-phase clock
- Each flip-flop contains minimum-size local clock buffers



• Selectable on-chip programmable oscillator or external clock

# Spiffee

- 460,000 transistors
- 5.985mm x 8.204mm (0.7μm design rules, L<sub>poly</sub>=0.6μm)
- $V_{tn} = 0.63V, V_{tp} = -0.88V$
- 1 poly, 3 metal layers
- Full custom
- 650-element scan path



# **Energy-Efficiency Comparison**

#### • 17 times more efficient @ 1.1V

|                       | Year | CMOS | Datapath | Supply  | 1024-pt   | Power  | Clock | Number   | Adjusted   |
|-----------------------|------|------|----------|---------|-----------|--------|-------|----------|------------|
| Processor             |      | Tech | width    | Voltage | Exec Time |        |       | of chips | transforms |
|                       |      | (µm) | (bits)   | (V)     | (µsec)    | (mW)   | (MHz) |          | / mJ *     |
| LSI, L64280           | 1990 | 1.5  | 20       | 5       | 26        | 20,000 | 40    | 20       | 11         |
| Plessey, PDSP16510A   | 1989 | 1.4  | 16       | 5       | 96        | 3,000  | 40    | 1        | 12         |
| Dassault Electronique | 1990 | 1    | 12       | 5       | 128       | 12,000 | 20    | 6        | 1          |
| Texas Mem Sys, TM-66  | -    | 0.8  | 32       | 5       | 65        | 7,000  | 50    | 3        | 16         |
| Cobra, Colorado State | 1994 | 0.75 | 23       | 5       | 9.5       | 7,700  | 40    | 16       | 49         |
| Sicom, SNC960A        | 1996 | 0.6  | 16       | 5       | 20        | 2,500  | 65    | 1        | 29         |
| CNET, E. Bidet        | 1994 | 0.5  | 10       | 3.3     | 51        | 330    | 20    | 1        | 30         |
| Spiffee1              | 1995 | 0.7  | 20       | 1.1     | 330       | 9.5    | 16    | 1        | 819        |
| Spiffee1              | 1995 | 0.7  | 20       | 3.3     | 30        | 845    | 173   | 1        | 101        |
|                       |      |      |          |         |           |        |       |          |            |

\* Adjusted\_transforms\_per\_mJ = Tech \* ( (DPath\*2/3) + (DPath\*1/3))<sub>10</sub> Power \* Exec\_Time







# Improvements Using Well/Substrate Biasing



# Estimated Performance in a Low-V<sub>t</sub> Process

 Portions of Spiffee were fabricated in a low-V<sub>t</sub> 0.8μm process with L<sub>poly</sub>=0.26μm and included an identical programmable oscillator



- Measurements predict at Vdd=0.4V:
  - ◆ 57MHz @ less than 9.7mW
  - ♦ 1024-pt FFT in 93µsec
  - More than 66 times more efficient than previously best known

• Input =  $\cos(2\pi \times 23/N) + \sin(2\pi \times 83/N) + \cos(2\pi \times 211/N) - j \sin(2\pi \times 211/N)$ 



# Outline

- Motivation and Introduction
- Energy-Efficient VLSI Processing
- Fast Fourier Transform Overview
- FFT Chip Architectures
- The Spiffee Processor
- Conclusion

#### Contributions

- FFT caching algorithm for high energy-efficiency
- Hierarchical-bitline SRAM and ROM memories for low-V<sub>t</sub> operation
- Design of a 1024-point, single-chip, full-custom, FFT processor
  - Fabricated and fully functional on first-pass silicon
  - 17 times more efficient than the previously most efficient known
  - ◆ Functional at 173MHz @ 3.3V

# ULPAcc

- 16-word x 24-bit dualported memory
- 24-bit accumulator
- On-chip controller and oscillator
- 11,700 transistors



# Srambb

- 128-word x 36-bit array
- On-chip controller, buffers, and oscillator
- 46,200 transistors



# Multbb

- 20-bit x 20-bit multiplier
- On-chip controller, buffers, and oscillator
- 28,500 transistors



### Other Projects and Publications

- Memory optimizing simulator
- MCM Test Chip
- Publications
  - B. M. Baas, "An Energy-Efficient Single-Chip FFT Processor," Proceedings of the 1996 IEEE Symposium on VLSI Circuits, Honolulu, HI, USA, 13-15 June 1996.
  - ◆ J. B. Burr, Z. Chen, B. M. Baas; "Stanford Ultra-Low-Power CMOS Technology and Applications," in Low-power HF Microelectronics, a Unified Approach. Stevage, UK: The Institution of Electrical Engineers, 1996.
  - B. M. Baas, "An Energy-Efficient FFT Processor Architecture," StarLab Technical Report NGT-70340-1994-1, January 25, 1994.
  - B. M. Baas, "A Pipelined Memory System For an Interleaved Processor," StarLab Technical Report NSF-GF-1992-1, June 18, 1992.

#### Future Work

- Investigate multiple datapath/cache pair systems
- Investigate multiple processor systems
- Modify Spiffee to be usable in a system
- Possible commercialization

# Acknowledgements

- Parents and family
- Advisors and mentors
  - Prof. Len Tyler, Prof. Kunle Olukotun, Prof. Allen Peterson, Jim Burr, Masataka Matsui
- Other faculty
  - Prof. Don Cox, Prof. Thomas Cover, Prof. Teresa Meng
- ♦ Colleagues
  - Vjekoslav Svilan, Yenwen Lu, Gerard Yeh, Ely Tsern, Jim Burnham, Birdy Amrutur, Gu-Yeon Wei, Dan Weinlade, STARLab members
- Support
  - Michael Godfrey, Marli Williams, Doris Reed
  - NSF, NASA, MOSIS, AISES-GE, Texas Instruments, Sun