A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing

Dean Truong
Wayne Cheng
Tinoosh Mohsenin
Zhiyi Yu
Toney Jacobson
Gouri Landge
Michael Meeuwsen
Christine Watnik
Paul Mejia
Anh Tran
Jeremy Webb
Eric Work
Zhibin Xiao
Bevan Baas
VLSI Computation Laboratory
Department of Electrical and Computer Engineering
University of California, Davis


Applications that require the computation of complex DSP workloads
are becoming increasingly commonplace.  These applications are often
composed of multiple DSP tasks and are found in applications such
as: wired and wireless communications, multimedia, sensor signal
processing, and medical/biological processing.  Many are embedded and
strongly energy-constrained.  In addition, many of these workloads
require very high throughputs, often dissipate a significant portion
of the system power budget, and are therefore of considerable interest.

In contrast to general-purpose workloads, DSP workloads typically comprise a collection of DSP kernels that are numerically intensive, easily parallelizable, and do not require large data working sets or large programs.

One-time fabrication costs for state of the art CMOS designs are several million dollars and total design costs of modern chips can easily total tens of millions of dollars. These costs are expected to continue rising in the future. In this context, programmable and/or reconfigurable processors that are not tailored to a single application or a small class of applications become increasingly attractive.

The presented processing array computes the aforementioned complex DSP application workloads with high performance and high energy efficiency, and is well suited for implementation in future fabrication technologies.

The 164 programmable processors are able to dynamically and independently switch their supply voltage between one of two power grids and are also able to dynamically and independently tune their clock frequency. Changes can be made by a local configurable hardware controller, local software, or configuration.

All 167 processors and 3 shared memories contain independent individual local oscillators that are able to halt, restart, or change frequency arbitrarily in a Globally Asynchronous Locally Synchronous (GALS) fashion. There are no PLLs, DLLs, clock crystals, or global clock signals. Individual oscillators fully halt (leakage power only) when there is no work to do, and restart in less than one cycle after work is available.

All processors are interconnected by a "double-link" reconfigurable mesh network. The novel network design allows links to be configured to pass data across the chip in dedicated channels without disturbing intermediate processors and without regard to their clock or voltage domains, with small circuit area. Configurable pipeline registers enable full-rate communication over long distances.

The 65 nm chip comprises a 2-D array of processors containing:

Presentation Slides


Dean Truong, Wayne Cheng, Tinoosh Mohsenin, Zhiyi Yu, Toney Jacobson, Gouri Landge, Michael Meeuwsen, Christine Watnik, Paul Mejia, Anh Tran, Jeremy Webb, Eric Work, Zhibin Xiao, Bevan Baas. "A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing." In Proceedings of the IEEE HotChips Symposium on High-Performance Chips (HotChips 2008), August 2008.

BibTeX Entry

   author    = {Dean Truong and Wayne Cheng and Tinoosh Mohsenin and  Zhiyi Yu 
               Toney Jacobson and Gouri Landge and Michael Meeuwsen 
	           and Christine Watnik and Paul Mejia and Anh Tran and Jeremy Webb
	           and Eric Work and Zhibin Xiao and Bevan Baas}, 
   title     = {A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing},
   booktitle = {IEEE HotChips Symposium on High-Performance Chips
               (HotChips 2008)},
   month     = {Aug.},
   year      = {2008}

Symposium Program

VCL Lab | ECE Dept. | UC Davis

Last update: Sep. 07, 2010