Essays /

Note Paper Essay

Essay preview

Submitted to the 31st Annual ACM/IEEE International Symposium on Microarchitecture (MICRO-31)

A Comparison of VLIW and Traditional DSP
Architectures for Compiled Code
Mazen A. R. Saghir, Paul Chow, and Corinna G. Lee
Department of Electrical and Computer Engineering
University of Toronto

Abstract
Although programmable digital signal processors comprise a significant fraction of the processors sold in the world, their basic architectures have changed little since they were originally developed. The evolution and implementation of these processors has been based more on commonly held beliefs than quantitative data. In this paper, we show that by changing to a VLIW model with more registers, orthogonal instructions, and better flexibility for instruction-level parallelism, it is possible to achieve at least a factor of 1.3–2 in performance gain over the traditional DSP architectures on a suite of DSP benchmarks. When accounting for the effect of restrictive register use in traditional DSP architectures, we argue that the actual performance gain is at least a factor of 1.8–2.8. To counter an argument about extra chip area, we show that the cost of adding more registers is minimal when the overall area of the processor and the performance benefits are considered. Although a VLIW architecture has a much lower instruction density, we also show that the average number of instructions is actually reduced because there are fewer memory operations. A significant contribution to the better performance of the VLIW architecture is the ability to express more instances of parallelism than the restricted parallelism of the more traditional architectures. However, efficient techniques for encoding long instructions are required to make the higher flexibility and better performance of VLIW architectures feasible.

DRAFT: NOT TO BE DISTRIBUTED

1

A Comparison of VLIW and Traditional DSP Architectures for Compiled Code

1. Introduction
Digital signal processors (DSPs) are specialized microprocessors optimized to execute the computationally-intensive operations commonly found in the inner loops of digital signal processing algorithms. DSPs are typically designed to achieve high performance and low cost, and are therefore used extensively in embedded systems. Some of the architectural features commonly found in DSPs include fast multiplyaccumulate hardware, separate address units, multiple data-memory banks, specialized addressing modes, and support for low-overhead looping. Another common feature of DSP architectures is the use of tightly-encoded instruction sets. For example, using 16- to 48-bit instruction words, a common DSP instruction can specify up to five parallel operations: multiply-accumulate, two pointer updates, and two data moves.

Tightly-encoded instructions that could specify five operations in a single instruction were initially used to improve code density with the belief that this also reduced instruction memory requirements, and hence cost. Tightly-encoded instructions were also used to reduce instruction memory bandwidth, which was of particular concern when packaging was not very advanced. As a result of this encoding style, DSP instruction-set architectures (ISAs) are not very well suited for automatic code generation by modern, high-level language (HLL) compilers. For example, most operations are accumulator based, and very few registers can be specified in a DSP instruction. Although this reduces the number of bits required for encoding instructions, it also makes it more difficult for the compiler to generate efficient code. Furthermore, DSP instructions can only specify limited instances of parallelism that are commonly used in the inner loops of DSP algorithms. As a result, DSPs cannot exploit parallelism beyond what is supported by the ISA, and this can degrade performance significantly.

For the early generation of programmable DSPs, tightly-encoded instruction sets were an acceptable solution. At that time, DSPs were mainly used to implement simple algorithms that were relatively easy to code and optimize in assembly language. Furthermore, the compiler technology of that time was not advanced enough to generate code with the same efficiency and compactness as that of a human programmer. However, as DSP and multimedia applications become larger and more sophisticated, and as design and development cycles are required to be shorter, it is no longer feasible to program DSPs in assembly language. Furthermore, current compiler technology has advanced to the point where it can generate very efficient and compact code. For these reasons, DSP manufacturers are currently seeking alternative instruction set architectures. One such alternative, used in new media processors and DSPs such as the Texas Instruments TMS320C6x [1], is the VLIW architecture.

In this paper, we compare the performance and cost of a VLIW architecture to those achieved with more traditional DSP architectures. In Section 2, we examine several architectural styles that may be used to

2

A Comparison of VLIW and Traditional DSP Architectures for Compiled Code implement embedded DSPs, and we compare their advantages and disadvantages to VLIW architectures. In Section 3, we describe the methodology used to model two commercial DSPs, and we compare their performance and cost to those of our model VLIW architecture. In Section 4, we present the results of this comparison. Finally, in Section 5, we summarize the main points of this paper and present our conclusions.

2. Architectural Choices
Among the features that are desirable to have in an embedded DSP architecture is the ability to support the exploitation of parallelism. This is especially important with DSP applications since they exhibit high levels of parallelism. Another desirable feature is that the architecture be an easy target for HLL compilers. This enables the processor to be programmed in a HLL, which reduces code development time, increases its portability, and improves its maintainability. It also makes it easier for a modern compiler to optimize the generated code and improve execution performance. Finally, to meet the low cost requirements of embedded DSP systems, an architecture must occupy a small die area, and must use functional and control units having low complexity. In this section, we describe some of the architectural alternatives that can be used to implement embedded DSPs, and we discuss their advantages and disadvantages. 2.1. Superscalar Architectures

Superscalar architectures are the current style used to implement general-purpose processors. They exploit parallelism in the stream of instructions fetched from memory by issuing multiple instructions per cycle to the datapath. As instructions are fetched from memory, they are temporarily stored in an instruction buffer where they are examined by a complex control unit. The control unit determines the interdependencies between these instructions, as well as the dependencies with instructions already executing. Once these dependencies are resolved, and hardware resources become available, the control unit issues as many instructions in parallel as possible for execution on the datapath. In older superscalar architectures, instructions could only be issued in parallel if they happened to be in the correct static order inside the instruction buffer. Since instructions could only be issued in the same order they were fetched, these architectures were said to exploit static parallelism. On the other hand, more recent superscalar architectures are able to exploit parallelism among any group of instructions in the buffer. As such, they are said to exploit dynamic parallelism. This is achieved, in part, by using such techniques as dynamic register renaming or branch prediction, which eliminate false dependencies between instructions and increase the ability to issue instructions in parallel [2]. Instructions in the buffer can therefore be issued out of the order in which they were fetched, and the control unit is responsible for ensuring that the state of the processor is modified in the same order the instructions are fetched.

3

A Comparison of VLIW and Traditional DSP Architectures for Compiled Code The compiler technology associated with superscalar architectures is also very advanced and well understood. In addition to generating efficient code by using machine-independent, scalar optimizations [3], contemporary optimizing compilers are able to exploit the underlying features of their target architectures. Typically, they applying machine-dependent optimizations such as instruction scheduling, which minimizes pipeline latencies and helps expose parallel instructions to the hardware; data prefetching, which exploits the memory hierarchy; register renaming, which attempts to statically remove false dependencies between instructions; and branch prediction, which attempts to minimize branch penalties by predicting execution control paths.

Given their ability to exploit parallelism dynamically, and their sophisticated compilers, superscalar architectures are capable of exploiting the high levels of parallelism found in DSP applications. However, as available parallelism increases, and the number of functional units and registers increase to support it, the complexity and cost of their control units also increase. Given the stringent low-cost requirements of most embedded DSP systems, superscalar architectures may not therefore be the most cost-effective architectural choice. 2.2. SIMD Architectures

With the recent growth in multimedia applications for desktop computer systems, many general-purpose processor vendors have introduced multimedia extensions to their instruction set architectures [4]. Typically, these extensions take the form of special, single-instruction, multiple data (SIMD) instructions that perform identical, parallel operations on a fixed number of packed operands. The operands are typically 8-bit, 16-bit, or 32-bit integers that are packed into 32-bit or 64-bit registers. That is why these SIMD instructions are sometimes called packed arithmetic instructions, or are said to exploit subword parallelism. SIMD instructions enhance the execution performance of vectorizable loops1 that perform homogeneous operations on arrays of data. Since such loops are very common in multimedia and DSP applications, SIMD architectures can be used for these applications. Programs that use SIMD instructions can therefore achieve higher levels of code density and require less memory to store. The amount of hardware needed to support SIMD instructions is also minimal, and consists mainly of the additional control circuitry that enables an ALU to perform multiple slices of the same operation on individual subwords of its input operands. Currently, SIMD architectures are difficult targets for HLL compilers, and, like most commercial DSPs, must be programmed in assembly language for their full benefits to be achieved. Since high-level l...

Read more

Keywords

-2 -2100 -21020 -29 -31 /suif/suif1/suif-overview/suif.html, /suifconf/suifconf2/, 0 0.06 0.15 0.16 0.2 0.20 0.25 0.27 0.30 0.32 0.35 0.36 0.4 0.40 0.51 0.52 0.56 0.57 0.58 0.59 0.6 0.61 0.63 0.65 0.70 0.72 0.74 0.76 0.8 0.82 0.86 0.99 1 1.0 1.00 1.2 1.3 1.30 1.4 1.5 1.60 1.75 1.79 1.8 1.95 1.96 10 1024 10332 1068 107 11 1197325 12 13 14 14689 15 158921 16 1643 17 18 18407 19 1975 1986 1989 1990 1993 1994 1995 1996 1997 2 2.0 2.1 2.10 2.2 2.3 2.4 2.8 2.86 20 201 21 211 215313 223 2248024 2365 24 256 2603 29 29th 2d 3 3.00 3.02 3.1 3.2 3.3 3.39 3.4 3.5 3.52 3.6 30 315 31st 32 32995 34781 357678 388 392 4 4.1 4.2 41834 436 46 48 5 5.46 50 514 55 576 6 6.55 6346 64 66 688 7 7.04 70 72 8 84743 85 9 9.82 90625 93 94 a1 a10 a2 a3 a4 a5 a6 a7 a8 a9 abil abl abstract accept access account accumul accumulator-bas accur achiev acm acm/ieee across actual ad adapt add adder addison addison-wesley addit address adopt adpcm adsp advanc advantag affect ahead aho aid al alfr algorithm alloc allow alreadi also altern although alu alus alway among amount analog analysi analyz annual anoth appli applic approach approxim architectur area area-estim argu argument arithmet around array articl ashok asip assembl assess associ assum attain attempt au0 au1 august automat avail averag avoid b back back-end bandwidth banerjia bank base basi basic basic-block becom belief benchmark benefit better beyond bin bit bit-manipul blind block branch buffer buse c c-languag calcu calcul call cannot capabl care case caus cell chang characterist cheong chip choic chose chow circuit circuitri circular classic clear clock closer coars coarse-grain code com combin commerci common communic communiti compact compani compar comparison compil complex compon compress compris comput computationally-intens computer-aid concern conclus confer conscious consequ conserv consid consider consist constant constrain constraint consum cont contain contemporari contribut control control-flow convolut corinna correct correspond cosin cost cost-consci cost-effect cost-sensit could count counter coupl cpu cumbersom current custom cycl d data data-memori data-flow datapath decim decimation-in-tim decod decompress degrad deliv demand densiti depart depend describ descript design desir desktop destin detect determin develop devic defin die differ differenti difficult digit disadvantag discret discuss distribut div divid draft dsp dsp-style dsp1 dsp2 dsp56000/dsp56001 dsp56001 dsp56002 dsp96002 dsps du0 du1 dual dual-port due dynam earli eas easi easier easili edg edit edn effect efficienc efficient eight either electr element elimin embed employ empti enabl encod encoder/decoder end engin enhanc enough ensur equal especi estim et even everi evid evolut examin exampl except execut exhibit exist expect expens experi experiment exploit explor expos express extens extra extract fact factor fairer fals famili fast faster feasibl featur fetch fewer fft figur file final finit fir flex float floating-point form format found four fourier fp fpu0 fpu1 fraction frequenc front front-end full full-precis full-fledg fulli function furthermor futur g g.721 g721 gain gather general general-purpos generat gerald given good grain great greater group growth guag guid hand happen hardwar hardware-limit held help henc heterogen hierarchi high high-level higher highlight histogram hll hlls homogen howev human ideal ident identifi ieee ieee/acm iir imag impact impedi implement import impos improv impuls in-plac incid includ incorpor increas independ indic individu inform infrastructur initi inner input insid insight instanc instead instruct instruction-level instruction-set instrument integ intens interdepend interest intern introduc introduct involv infinit isa issu iter itu jeffrey june justifi justifi k k1 k10 k11 k12 k2 k3 k4 k5 k6 k7 k8 k9 keller kernel kernel-lik key known lam lan languag larg larger larin last late latenc latnrm lattic lead least least-mean-squar lee less level levi like limit linear littl lms lmsfir load logic long long-instruct longer look look-ahead loop loops1 low low-cost low-overhead lower lpc m machin machine-depend machine-independ made magazin main maintain major make malik mani manipul manual manufactur margin markus match matrix may mazen mean measur mechan media meet memori memory-bas menez method methodolog micro microarchitectur micron microprocessor mind minim miss mm2 mode model modem modern modifi modifi modul modulo monica moreov motorola move mu0 mu1 much mul mult multimedia multipl multipli multiplier/divider multiply-accumul multiplyaccumul multiprocessor muscl must mvp n natur near necessari need new next nine non non-exist nonetheless nop normal note number obtain occupi occur off offer offset ogi ogy-independ older on-chip on-the-fli one oper operand opportun optim optimist order organ origin orthogon over overal overcom overhead p pack packag pair paper parallel paramet part particular partit pass path paul pcu penalti per percentag perform periodogram perspect piler pipelin place point pointer poor popular port portabl possibl pp precis predict prefetch preliminari present previous principl problem proceed process processor product program program-control programm provid publish puls pulse-cod purpos put quantit r radar radix rate ratio ravi real realiti realiz reason recal recent reduc reduct refer regist relat relax reli remain remaind rememb remov renam report repres requir research resolv resourc respect respons restrict result review reflect risc robert run saghir said sathay save scalar schedul second section seek segment sensit separ serial set sethi sever shall sharad shift shorter show shown signal signific simd similar simpl simpler simul simultan sinc singl single-instruct single-issu single-port single-processor six size slice slight slower small smaller smaller-precis sobel softwar sold solut sometim sophist sourc space special specifi specifi specific spectral speech sqrt squar stanford state static statically-schedul statist still storag store straightforward stream stringent structur studi style sub sub-word subject submit subsect subsequ subset subtract subword sudarsanam suggest suif suif.stanford.edu suif.stanford.edu/suif/suif1/suif-overview/suif.html, suit suitabl sum summar summari superscalar supplement support survey symposium synthesi system tabl tackl take taken target task techniqu technol technolog technology-independ temporarili term texa therefor third though thought three thus tight tightly-encod tightlyencod time tms320c30 tms320c3x tms320c50 tms320c5x tms320c62xx tms320c6x tms320c80 today took tool topic toronto total trade trade-off trading-off tradit transcod transform translat trelli true turn two type typic ullman under understood unit univers unrestrict updat us use user usual utdsp utdsp1 utdsp2 v v.32 v32.modem vari vector vectoriz vendor version video vliw vliw-bas vol vs w wast way well wesley whether wide width wilson within without word workshop world wors worst worst-cas would writer written www-suif.stanford.edu www-suif.stanford.edu/suifconf/suifconf2/, x y µm λ field file files filter fine fine-grain fir first first-order five fixed fixed-point flat fledg flexibl float floating-point floatingpoint flow fly