Essay preview
Submitted to the 31st Annual ACM/IEEE International Symposium on Microarchitecture (MICRO-31)
A Comparison of VLIW and Traditional DSP
Architectures for Compiled Code
Mazen A. R. Saghir, Paul Chow, and Corinna G. Lee
Department of Electrical and Computer Engineering
University of Toronto
Abstract
Although programmable digital signal processors comprise a significant fraction of the processors sold in the world, their basic architectures have changed little since they were originally developed. The evolution and implementation of these processors has been based more on commonly held beliefs than quantitative data. In this paper, we show that by changing to a VLIW model with more registers, orthogonal instructions, and better flexibility for instruction-level parallelism, it is possible to achieve at least a factor of 1.3–2 in performance gain over the traditional DSP architectures on a suite of DSP benchmarks. When accounting for the effect of restrictive register use in traditional DSP architectures, we argue that the actual performance gain is at least a factor of 1.8–2.8. To counter an argument about extra chip area, we show that the cost of adding more registers is minimal when the overall area of the processor and the performance benefits are considered. Although a VLIW architecture has a much lower instruction density, we also show that the average number of instructions is actually reduced because there are fewer memory operations. A significant contribution to the better performance of the VLIW architecture is the ability to express more instances of parallelism than the restricted parallelism of the more traditional architectures. However, efficient techniques for encoding long instructions are required to make the higher flexibility and better performance of VLIW architectures feasible.
DRAFT: NOT TO BE DISTRIBUTED
1
A Comparison of VLIW and Traditional DSP Architectures for Compiled Code
1. Introduction
Digital signal processors (DSPs) are specialized microprocessors optimized to execute the computationally-intensive operations commonly found in the inner loops of digital signal processing algorithms. DSPs are typically designed to achieve high performance and low cost, and are therefore used extensively in embedded systems. Some of the architectural features commonly found in DSPs include fast multiplyaccumulate hardware, separate address units, multiple data-memory banks, specialized addressing modes, and support for low-overhead looping. Another common feature of DSP architectures is the use of tightly-encoded instruction sets. For example, using 16- to 48-bit instruction words, a common DSP instruction can specify up to five parallel operations: multiply-accumulate, two pointer updates, and two data moves.
Tightly-encoded instructions that could specify five operations in a single instruction were initially used to improve code density with the belief that this also reduced instruction memory requirements, and hence cost. Tightly-encoded instructions were also used to reduce instruction memory bandwidth, which was of particular concern when packaging was not very advanced. As a result of this encoding style, DSP instruction-set architectures (ISAs) are not very well suited for automatic code generation by modern, high-level language (HLL) compilers. For example, most operations are accumulator based, and very few registers can be specified in a DSP instruction. Although this reduces the number of bits required for encoding instructions, it also makes it more difficult for the compiler to generate efficient code. Furthermore, DSP instructions can only specify limited instances of parallelism that are commonly used in the inner loops of DSP algorithms. As a result, DSPs cannot exploit parallelism beyond what is supported by the ISA, and this can degrade performance significantly.
For the early generation of programmable DSPs, tightly-encoded instruction sets were an acceptable solution. At that time, DSPs were mainly used to implement simple algorithms that were relatively easy to code and optimize in assembly language. Furthermore, the compiler technology of that time was not advanced enough to generate code with the same efficiency and compactness as that of a human programmer. However, as DSP and multimedia applications become larger and more sophisticated, and as design and development cycles are required to be shorter, it is no longer feasible to program DSPs in assembly language. Furthermore, current compiler technology has advanced to the point where it can generate very efficient and compact code. For these reasons, DSP manufacturers are currently seeking alternative instruction set architectures. One such alternative, used in new media processors and DSPs such as the Texas Instruments TMS320C6x [1], is the VLIW architecture.
In this paper, we compare the performance and cost of a VLIW architecture to those achieved with more traditional DSP architectures. In Section 2, we examine several architectural styles that may be used to
2
A Comparison of VLIW and Traditional DSP Architectures for Compiled Code implement embedded DSPs, and we compare their advantages and disadvantages to VLIW architectures. In Section 3, we describe the methodology used to model two commercial DSPs, and we compare their performance and cost to those of our model VLIW architecture. In Section 4, we present the results of this comparison. Finally, in Section 5, we summarize the main points of this paper and present our conclusions.
2. Architectural Choices
Among the features that are desirable to have in an embedded DSP architecture is the ability to support the exploitation of parallelism. This is especially important with DSP applications since they exhibit high levels of parallelism. Another desirable feature is that the architecture be an easy target for HLL compilers. This enables the processor to be programmed in a HLL, which reduces code development time, increases its portability, and improves its maintainability. It also makes it easier for a modern compiler to optimize the generated code and improve execution performance. Finally, to meet the low cost requirements of embedded DSP systems, an architecture must occupy a small die area, and must use functional and control units having low complexity. In this section, we describe some of the architectural alternatives that can be used to implement embedded DSPs, and we discuss their advantages and disadvantages. 2.1. Superscalar Architectures
Superscalar architectures are the current style used to implement general-purpose processors. They exploit parallelism in the stream of instructions fetched from memory by issuing multiple instructions per cycle to the datapath. As instructions are fetched from memory, they are temporarily stored in an instruction buffer where they are examined by a complex control unit. The control unit determines the interdependencies between these instructions, as well as the dependencies with instructions already executing. Once these dependencies are resolved, and hardware resources become available, the control unit issues as many instructions in parallel as possible for execution on the datapath. In older superscalar architectures, instructions could only be issued in parallel if they happened to be in the correct static order inside the instruction buffer. Since instructions could only be issued in the same order they were fetched, these architectures were said to exploit static parallelism. On the other hand, more recent superscalar architectures are able to exploit parallelism among any group of instructions in the buffer. As such, they are said to exploit dynamic parallelism. This is achieved, in part, by using such techniques as dynamic register renaming or branch prediction, which eliminate false dependencies between instructions and increase the ability to issue instructions in parallel [2]. Instructions in the buffer can therefore be issued out of the order in which they were fetched, and the control unit is responsible for ensuring that the state of the processor is modified in the same order the instructions are fetched.
3
A Comparison of VLIW and Traditional DSP Architectures for Compiled Code The compiler technology associated with superscalar architectures is also very advanced and well understood. In addition to generating efficient code by using machine-independent, scalar optimizations [3], contemporary optimizing compilers are able to exploit the underlying features of their target architectures. Typically, they applying machine-dependent optimizations such as instruction scheduling, which minimizes pipeline latencies and helps expose parallel instructions to the hardware; data prefetching, which exploits the memory hierarchy; register renaming, which attempts to statically remove false dependencies between instructions; and branch prediction, which attempts to minimize branch penalties by predicting execution control paths.
Given their ability to exploit parallelism dynamically, and their sophisticated compilers, superscalar architectures are capable of exploiting the high levels of parallelism found in DSP applications. However, as available parallelism increases, and the number of functional units and registers increase to support it, the complexity and cost of their control units also increase. Given the stringent low-cost requirements of most embedded DSP systems, superscalar architectures may not therefore be the most cost-effective architectural choice. 2.2. SIMD Architectures
With the recent growth in multimedia applications for desktop computer systems, many general-purpose processor vendors have introduced multimedia extensions to their instruction set architectures [4]. Typically, these extensions take the form of special, single-instruction, multiple data (SIMD) instructions that perform identical, parallel operations on a fixed number of packed operands. The operands are typically 8-bit, 16-bit, or 32-bit integers that are packed into 32-bit or 64-bit registers. That is why these SIMD instructions are sometimes called packed arithmetic instructions, or are said to exploit subword parallelism. SIMD instructions enhance the execution performance of vectorizable loops1 that perform homogeneous operations on arrays of data. Since such loops are very common in multimedia and DSP applications, SIMD architectures can be used for these applications. Programs that use SIMD instructions can therefore achieve higher levels of code density and require less memory to store. The amount of hardware needed to support SIMD instructions is also minimal, and consists mainly of the additional control circuitry that enables an ALU to perform multiple slices of the same operation on individual subwords of its input operands. Currently, SIMD architectures are difficult targets for HLL compilers, and, like most commercial DSPs, must be programmed in assembly language for their full benefits to be achieved. Since high-level l...