2006 Workshop on On- And Off-Chip Interconnection Networks for Multicore Systems

Luca Benini, DEIS Universitá di Bologna, Design Automation for Networks-on-chip: Status and Outlook

Scalable Networks on Chips (NoCs) are needed to match the ever-increasing communication demands of large-scale Multi-Processor Systems-on-chip (MPSoCs) platforms for high-end embedded applications. The heterogeneous nature of on-chip cores, and the energy efficiency requirements typical of embedded computing call for application-specific NoCs which eliminate much of the overheads connected with general-purpose communication architectures. However, application-specific NoCs must be supported by adequate design flows to reduce design time and effort.

We survey the main challenges in application-specific NoC design, and we outline a complete NoC design flow. Experimental results demonstrate that it is indeed possible to generate an application-specific NoC from a high-level specification in a few hours. Comparison with hand-tuned solutions shows that the automatically generated ones are very competitive from the area, performance and power viewpoint. Even though early results are promising, much work is still needed to develop NoC design flows that can handle the design implementation challenges posed by nanometer CMOS technology.

Ivo Bolsens, Xilinx

Shekhar Borkar, Intel, Networks for Multi-core Chip—A Controversial View

When you think of networks for multi-core chips, traditional high dimensional networks, such as rings, meshes, and cross-bars come to mind. Are these the right choices? This talk will evaluate some of these options and propose a research agenda.

Bill Dally, Stanford, Future Directions for On-Chip Interconnection Networks

On-chip interconnection networks have different constraints than off-chip networks. They are implemented with technology which gives very different channel characteristics and network element costs than their off-chip counterparts. The optimal design of an on-chip network is very sensitive to the properties of network elements (channels, buffers, and switches). Aggressive circuit design of these elements can realize large improvements in energy and performance. This enabling circuit technology changes the optimal network design and should be regarded as a prerequisite for network design studies. This talk discusses circuit and architecture issues in on-chip interconnection network design and proposes a research agenda.

Chita Das, Penn State, Exploring NoC Design Space for Multicore Architectures

Integration of multiple cores on the same chip has signaled the beginning of communication-centric, rather than computation-centric systems. Further, technology trends have accentuated the importance of interconnect-conscious design as global wire delays do not scale down as fast as gate delays in new technologies. Consequently, on-chip interconnects, also known as Network-on-Chip (NoC) architectures, are predicted to be a major bottleneck in designing embedded System-on-Chip (SoC) architectures and high-performance multicore architectures alike. However, unlike the traditional multiprocessor interconnects, design of scalable and high performance NoCs poses a whole set of new challenges in terms of on-chip area budget, energy/thermal efficiency, and reliability constraints. In this talk, we will summarize our research effort in designing NoC architectures encompassing performance, scalability, power, thermal and reliability issues. In particular, we will discuss the design of two novel router architectures, a dynamic buffer management scheme, and a 3D router architecture. The talk will conclude with pointers to our ongoing/future research directions.

José Duato, Universidad Politécnica de Valencia, On-Chip Networks: Do We Need More Research?

In this talk we discuss how previous research results for off-chip networks can be incorporated into on-chip network designs, analyzing whether there is a need for new research on this topic. A case study for on-chip networks in embedded systems is presented, showing that while many previously proposed concepts and solutions can be reused, new challenges also exist. Even when breakthrough solutions may not be required, this case study shows that the new design constraints recommend significant changes in the way those solutions are implemented.

Ron Ho, Sun, Interconnection Technologies for Large-Scale Multiprocessors

Continued scaling of silicon technologies has enabled single-die multiprocessors of moderate complexity, with chips integrating eight cores commercially available today. Pushing further to tens of cores requires either growing die sizes beyond acceptable cost and yield limits, or using multiple chips and connecting them with power-hungry and bandwidth-limited links. This talk will discuss two interconnect technologies for multi-core systems that enable integration of hundreds of cores. Proximity IO uses chip-to-chip capacitive coupling to give a grid of multiple chips the same bandwidth, latency, and power characteristics as if they were a single, larger, virtual die. To reduce the energy costs of on-chip communication, we also use on-chip capacitive coupling to drive data on low-swing wires across chips. Together, these technologies enable system exploration of a wide variety of interconnected computers. I will describe both technologies, show results of Silicon testchip, and discuss their limitations.

Mark Horowitz, Stanford, Scaling, Power and the Future of CMOS

In the mid 1980s the power growth that accompanied scaling forced the industry to focus on CMOS technology, and leave nMOS and bipolars for niche applications. Now 20 years later, CMOS technology is facing power issues of its own. After first reviewing the "cause" of the problem, it will become clear that there are no easy solutions this time—no new technology or simple system/circuit change will rescue us. Power, and not number of devices, is now the primary limiter of chip performance, and the need to create power efficient designs is changing how we do design. This talk will review power-optimized design methods and show how power is strongly tied to performance and that variability adversely effects power efficiency. Projecting forward, it shows that unless die size shrinks, in future technologies most of the devices will need to be idle most of the time which has strong ramifications for the both the underlying device and system design.

Manolis Katevenis, University of Crete, Towards Light-Weight Intra-CMP Network Interfaces

This talk will present opinions on future Network Interfaces (NI) for high-speed communication, and our research plans in this area. Processor-to-Network Interfaces (NI) are the next-to-be-removed system bottleneck. Challenges include low latency, high throughput even for small messages, high flexibility, and low cost. In chip multiprocessors (CMP's), the NI's between processor cores and the network-on-chip (NoC) must be small when compared to the processor itself and the local memory that it connects to (e.g. L1 cache). Consequently, (a) the NI must not require dedicated memory of its own, but rather it must dynamically share a portion of local memory; and (b) sending/receiving information (enqueue/dequeue/RDMA) must be as fast as reading/writing a few words in L1 cache.

A powerful and simplifying architecture is to combine and integrate the network interface into/with the cache controller. Send/enqueue resembles cache block flush/replace, or write-update protocols, or writing into non-cacheable address space with write-combine. Receive/dequeue resembles/benefits from cache block prefetching. Support for synchronization primitives can be provided by enqueue/dequeue operations optionally triggering new events or packet generation; these resemble cache coherence protocol actions.

Steve Keckler, The University of Texas, Micronetwork-based Processor Microarchitectures

While substantial research in NoCs have focused on interconnects for chip-multiprocessors, network technology also provides opportunities for scalable processor and memory system architectures. Microarchitectural networks, or micronets, are lightweight networks that are integrated tightly into a processor core, replacing common control and data busses. In this talk, I will describe the micronetworks that we designed and implemented in the TRIPS processor, a 130nm ASIC that is constructed from distributed and replicated processor elements (tiles). In particular, I will discuss the operand network used for data communication in the TRIPS core and how we were able to exploit the specific requirements of the processor execution model to implement single-cycle per hop operand delivery across a 5x5 array of execution, register, and data cache tiles. I will also describe the distributed processor protocols enabled by both data and control micronetworks and reflect on our experience of building a micronetwork-based processor.

Partha Kundu, Intel, On-Die Interconnects for Next Generation CMPs

With ever-increasing transistors enabled by silicon technology, many computationally intensive problems are within reach of a single chip or small form-factor platforms. One such problem relates to solving the current data explosion problem for end-users. Single chip multi-processors (CMPs) in future will enable algorithms that in real-time aid in a) recognition of multi-modal objects, b) classification and categorization of unstructured data, and c) synthesis of complex physical objects as replicas of their actual counterparts. We examine commonly used kernels in such applications to understand the on-die interconnect's requirements for future CMPs. We find that many of these applications, although constrained by off-die bandwidth, can benefit from good caching solutions. We observe that a high performance on-die interconnect plays a key role in architecting such caching solutions.

Using an example 2D-mesh network topology, the talk discusses the design issues encountered in architecting such an interconnect. While previous research related to improving network throughput—specifically switch allocation, buffer management and flow control—may be leveraged and adapted effectively for on-die networks, we conclude that the significant challenge is one of managing power and improving the overall energy efficiency of the network.

Robert Mullins, Cambridge, Communication-Centric Design

As architects focus on coarse-grained parallelism as the primary route to boosting performance, chip-wide communication infrastructures will become central to realising real system-level performance gains. This focus on communication is reinforced by technology scaling that accelerates local computations but has little impact on the delay of longer interconnects. In this talk I will describe our work on low-latency on-chip routers and our approaches to clocking on-chip networks. Many of these ideas have been evaluated as part of our latest test chip "Lochside". To conclude, I will speculate on where the current convergence in VLSI platform architectures may lead us.

Li-Shiuan Peh, Princeton, Low-power Interconnection Networks

Systems from microprocessors to supercomputers, from embedded systems-on-a-chip to Internet routers are becoming increasingly interconnected, relying on network fabrics to scale up. With networks taking up a substantial portion of a system's imited power budget, it is now critical to explore low-power interconnection networks. In this talk, I'll briefly survey my group's research in low-power networks, in both design tools as well as network architectures, then zoom into our work on network thermal modeling and management. I'll round off the talk with a brief discussion on our ongoing research thrust towards "network-driven architectures" for CMPs which explores the embedding of global coordination functions such as coherence directories within the network fabric, leveraging the inherent scalability of networks for future many-core chips.

Michael Taylor, UCSD, Scalar Operand Networks for Tiled Microprocessors

Scalar Operand Networks ("SONs") are a class of network optimized for the transport of operands among remote ALUs and memories. These sub-nanosecond networks, the central communication mechanism inside microprocessors, are perhaps the fastest class of network currently known to man, and are the lowest-latency apparatus for executing programs that have parallelism but are dependence-heavy.

This talk examines how scalable forms of SONs can allow microprocessors to scale to 100s or 1000s of functional units. The key is to organize chip resources as an array of small tiles, which are interconnected by a scalable, point-to-point, pipelined SON. This allows the frequency of these systems ("tiled microprocessors") to remain high, while the quantity of exposed on-chip resources (e.g., ALUs) remains linear with die area. Scalable SONs offer extremely low latency and occupancy communication, on the order of a few cycles, versus thirty or so for conventional multicore processors. This low cost expands the set of applications that can be parallelized and enables compilers to exploit fine-grained parallelism. This talk discusses the scalable SON we designed for the 16-issue MIT Raw 180 nm VLSI prototype, and some of the efforts we have made to characterize SON properties in general.

Drew Wingard, Sonics, Intelligent Interconnects for Multicore SoCs

Many performance SoCs have adopted multiple processor strategies to meet the system requirements in the "convergence" era. Heterogeneous processor architectures have emerged as the most popular architecture for many embedded systems. We will describe a common interconnect architecture for such designs, and highlight key interconnect fabric performance characteristics and intelligent network services that should be provided by the interconnect. We also compare the requirements of several SoC applications against the benefits of the intelligent interconnect approach.

2006 Workshop on On- and Off-Chip Interconnection Networks for Multicore Systems

6-7 December 2006, Stanford, California