Files

Abstract

The demise of Moore's Law and Dennard scaling has resulted in diminishing performance gains for general-purpose processors, and so has prompted a surge in academic and commercial interest for hardware accelerators. Specialized hardware has already redefined the computing landscape by enabling the emergence of disruptive, large-scale applications that would otherwise not have been possible with CPUs alone. \emph{RTL simulators} play a key role in enabling the accelerated computing revolution: they are to hardware engineers what debuggers and runtime systems are to software engineers. Without RTL simulators, no hardware accelerator could be functionally designed. As accelerators increase in size and complexity, the hardware design industry will increasingly need faster RTL simulators to permit chip design in reasonable time. Since the advent of multicore computers, parallelism is the preferred approach to improve software performance. RTL simulation seems to offer many opportunities to follow such a path: accelerators are written in hardware description languages that contain parallel constructs for describing independent hardware components that run in parallel and synchronize only at clock edges. Unfortunately, there is a mismatch between RTL simulation and today's multicore systems: tasks in RTL simulation tend to be very small in size, resulting in fine-grain parallelism. This fine-grain parallelism contrasts with coarse-grain parallel workloads for which modern multicore systems are built, which leads to simulator designs that can achieve only weak parallel performance scaling. This thesis argues that we need computing architectures that can achieve \emph{strong scaling} to truly speed up RTL simulation through parallelism. A strong scaling architecture is one that can make effective use of additional cores without having to increase the total workload size. This enables even small or moderate size designs to exploit parallelism to run quickly. This thesis contributes Manticore, a co-designed manycore architecture and compiler for RTL simulation that achieves strong parallel performance scaling. Manticore combines a bulk-synchronous parallel execution model with static scheduling to eliminate the runtime overheads of synchronization among hundreds of cores, simplify core design, and significantly increase the parallelism possible on a single chip. Our modest FPGA prototype of Manticore greatly increases parallel RTL simulation rate compared to a state-of-the-art software simulator running on top-of-the-line desktop and server x86 processors. The ideas underlying Manticore's design present a first step towards fast, scale-out RTL simulation.

Details

PDF