## WP 5.6: A 210Mb/s Radix-4 Bit-level Pipelined Viterbi Decoder Alfred K. Yeung, Jan M. Rabaey Dept. of EECS, University of California, Berkeley, CA In recent years, there has been tremendous interest in high speed implementations of the Viterbi algorithm for applications such as digital sequence detection in high-speed magnetic disk drives and demodulation in wireless communication channels. The classical approach for a high-speed single-chip implementation is to use parallel add-compare-select (ACS) units for each state of the trellis. The major design challenges are the fundamental speed bottleneck imposed by the recursive ACS iteration, and the large area consumed by the parallel ACS units and the associated interconnect routing. Algorithmic techniques such as the block processing techniques can boost performance by trading off hardware complexity with throughput, but the area penalty may jeopardize a single-chip solution. In this paper, the design of a high-speed and compact parallel ACS unit using a combination of algorithmic, logic and circuit techniques is described. The results are demonstrated by a 16-state, R=1/2, 210Mb/s Viterbi decoder, that out-performs the fastest single-chip implementation previously reported by a factor of 1.8 in terms of the throughput/area metric using the same process technology [1]. Using redundant number representation and carry-propagationfree addition, the addition and the compare-select operations needed in the ACS unit can be executed bit-wise, starting from the most significant bit [2]. The speed of an ACS unit is improved using bit-level pipelining, exploiting an extra level of parallelism (Figure 1a, 1b). Not only is speed faster than that of a comparable wordparallel implementation, but it is also independent of word length. Because of the 1b propagation of the carry-in redundant number addition, pipeline stages can only be inserted every two bits. By doubling the clock and retiming, two bit-slices can be folded onto one (Figure 1c). This results in 50% reduction in hardware at the expense of a small speed degradation due to the multiplexors and extra pipeline stage. Throughput of the ACS is further enhanced by applying one stage of look ahead, i.e., by collapsing 2 iterations of a radix-2 trellis into 1 iteration of a radix-4 trellis [1]. The speed of the radix-4 ACS is about 5% slower than that of a radix-2 ACS of a similar design, achieving an effective speed up of 1.9. Figure 2 is a block diagram of the ACS. Each column represents an ACS unit of an 8b-wide state. In each state, comparison starts from the MSB and propagates to the LSB, where the decision vector is encoded into 2 bits. Once a partial or final decision is made, the decision vector is set to override the comparisons in the lower-order bits. The normalization unit normalizes the state metric to prevent an overflow by resetting its MSB when needed. Figure 3 shows the design of a bit-slice of the radix-4ACS unit. Due to the large capacitive loading on the state metric, the generation of the state metric limits the cycle time. The critical path is improved by encoding the state metric so the compare select operation can be realized by a fast dynamic 4-input OR-gate. Encoding also allows evaluation of the state metric concurrently with the decision logic (instead of subsequently as in a conventional design). This scheme, together with other high-speed techniques such as single-phase clocking, integration of logic with latches to reduce the gate count along critical path, dynamic logic and compact layout, enables a 210MHz internal clock rate. In parallel ACS design, wiring area represents a very significant overhead. The bit-slice organization of this ACS allows routing of each bit of state metric to reside in a single routing channel, resulting in area savings by avoiding bus cornering. A CAD program based on a simple channel-area estimator optimizes state placement, reducing channel area while bounding the worst-case wire length. Only 32% of the ACS array is used for the routing of state and branch metrics, and power, compared to the 42% wiring area achieved in Reference 3, where area efficiency is the prime goal. The block diagram of the Viterbi decoder is shown in Figure 4. Data symbols are serialized into the BMU, where the 16x5 branch metrics are calculated and fed into the ACS. The trace-back unit implements a survivor path length of 32 using 3 16x2 stack memories that use shift-registers for speed and control simplicity. In N-state radix-4 trace-back, the recursive loop contains a N:1 select operation and can limit the cycle time, especially for large N (Figure 5a). Using the retiming technique, the critical path can be reduced to a 4:1 select operation, independent of N (Figure 5b). To ease system design, an on-chip clock doubler multiplies the external clock and drives a secondary clock driver located in the middle of the chip. Skew-balanced tertiary clock drivers in the logic blocks and datapath slices complete the clock distribution network. To reduce I/O switching noise and simplify system interface, all I/O operations are parallelized to one-fourth of the internal clock rate. MOS bypass capacitors totalling 1.3nF are laid out under the metal-2 power supply lines throughout the chip to ensure the integrity of the power supply lines. In self-test mode, the input register functions as a 24-tap LFSR to generate a pseudorandom pattern for chip testing. An on-chip VCO facilitates high-speed testing at variable clock frequencies. The chip implements a 16-state decoder, with R=1/2, 8-level soft decision inputs, and generator polynomials G1=10011 and G2=11101. Fabricated in a 1.2 $\mu$ m 2-metal CMOS technology, the 29.7mm² chip contains 75k transistors (Figure 6). The throughput and power dissipation vs. supply voltage measured at 11°C are shown in Figure 7. Making elaborate use of dynamic logic, the chip dissipates almost 3W at 5V. Nonetheless, at supply voltage of about 3.3V, power dissipation is similar to that of a mostly static design running at 5V [1], while achieving the same throughput. Comparisons with several high-speed and/or area-efficient custom designs previously reported are listed in Table 1. This design is 50% faster than the 140Mb/s performance milestone of Reference 1. Asuming throughput is improved by linear trade-off in area using algorithmic techniques, this design is superior, having the highest throughput/area. Speed and area scale with number of states. Speed decreases slightly from increased wiring capacitance. The projected area of 90mm² at 64-state confirms feasibility of high-speed single-chip implementation of the commercially-important 64-state Viterbi decoder in 1.2µm technology. ## Acknowledgment This research was sponsored by ARPA as part of the Infopad multimedia project. The authors thank T. Meng for technical assistance. ## References - [1] Black, P., T. Meng, "A 140Mb/s 32-State Radix-4 Viterbi Decoder," ISSCC Digest of Technical Papers, pp. 70-71, Feb., 1992. - [2] Fettweis, G., H. Meyr, "High-Speed Parallel Viterbi Decoding Algorithm and VLSI-Architecture," IEEE Communications, May, 1991. - [3] Sparso, J., et al., "An Area-Efficient Topology for VLSI Implementation of Viterbi Decoders and Other Shuffle-Exchange Type Structures," IEEE J. Solid-State Circuits, Feb., 1991. NORMALIZATION STATE METRIC ROUTING S[1] do S[15] S[2] BRANCH METRIC ROUTING STATE METRIC ROUTING S[1] S[2] BRANCH METRIC ROUTING do S[0] <sub>do</sub> S[1] S[2] S[15] BRANCH METRIC ROUTING BIT[1, 0] S[1] S[2] S[15] 16 x 2 Decision Vector (to Trace-Back Unit) ACS unit for 8-bit state metric Figure 1: (a) 4b ACS unit using carry-propagation-free addition. (b) Bit-level pipelined ACS (clock doubling, retiming). (c) Bit-level pipelined, time-multiplexed ACS unit. Figure 2: 16-state, radix-4 ACS unit block diagram. Figure 4: Block diagram of Viterbi decoder. Figure 3: Radix-4 ACS bit-slice circuit. Figure 6: See page 344. Table 1: See page 344. 200 A decode rate | 2 | 2 | 2 | 2 | 2 | 2 | 3.0 | 3.5 | 4.0 | 4.5 | 5.0 | 250 Figure 5: Radix-4 trace-back recursion. Figure 7: Measured decode rate and power dissipation vs. supply voltage at 27°C. WP 5.6: A 210Mb/s Radix-4 Bit-level Pipelined Viterbi Decoder (Continued from page 89) Figure 6: Viterbi decoder micrograph. | Design | Tech<br>(µm) | ACS Area<br>(mm²) | Total Area<br>(mm²) | Throughput (Mb/s) | Throughput / Area<br>ACS only / Total | |-------------|--------------|-------------------|---------------------|-------------------|---------------------------------------| | This Work | 1,2 | 21.0 | 50.0 | 200 | 2.3 / 1.8 | | Radix-4 [1] | 1.2 | 33.6 | 62.0 | 140 | 1.0/1.0 | | Radix-2 [3] | 2.0 | 7.1 | na | 50 | 1.7 / na | | Qualcomm | 1.5 | na | 54.1 | 47 | na / 0.4 | All figures are normalized to 32-State, 1.2 μm technology and typical operating conditions. Table 1: Performance comparisons.