Fixed-point configurable hardware components for adaptive filters

To reduce the gap between the VLSI technology capability and the designer productivity, design reuse based on IP (intellectual properties) is commonly used. In terms of arithmetic accuracy, the generated architecture can generally only be configured through the input and output word-lengths. In this paper, a new kind of fixed-point arithmetic IP is presented through the LMS and delayed-LMS examples. The operator and memory word-lengths are optimized under an accuracy constraint defined by the user. To significantly reduce the optimization and design times, the architecture parameter determination is based on analytical approach


I. INTRODUCTION
underlined with several experiments in Section IV.
The advance in VLSI technology offers the opportunity to II.LMS/DLMS ALGORITHM AND ARCHITECTURE integrate hardware accelerators and heterogenous processors in a single chip (System on Chip) or to obtain FPGA with several A. LMS and DLMS algorithms millions of gate-equivalent.Thus, complex signal processing The aim of adaptive filters is to estimate a sequence of applications can be now implemented in embedded systems.scalars from an observation sequence filtered by a system in The time-to-market requires to reduce the system development which coefficients vary.These coefficients converge towards time and thus, high-level design tools are needed.To reduce the optimum coefficients which minimize the mean square the gap between the hardware complexity and the designer error (MSE) between the filtered observation signal and the productivity, design reuse [1] based on IP (Intellectual prop-desired sequence.This type of filters is used in different erties) has to be used.
fields such as noise cancellation, equalization, linear prediction To reduce the cost and the power consumption, the fixed-and channel estimation.The LMS based algorithms are the point arithmetic is required.For efficient hardware implemost common used because their implementation in embedded mentation, the chip size and power consumption have to be systems is simpler than the RLS algorithm.The LMS adaptive minimized.Thus, the goal of this hardware implementation is algorithm, presented in Figure 1.a, estimates a sequence of to minimize the operator word-length as long as the desired scalars Yn from a sequence of N-length vectors xn [3].The accuracy constraint is respected.linear estimate of yn is wnt x where w, is a N-length weight In an arithmetic point of view, the available IP are limited.vector which converges to the optimal vector wopt.The vector   The IP user can only configure the input and output word-wn is updated according to the following equation length and sometimes the word-length of some specific operators.The link between the application performances and .t the data word-length is not immediate.Moreover, the fixed- W±n+l n + lXnCen-D with e-n -wJxn (1)  point design search space can not be explored easily with where ,u is a positive constant representing the adaptation this approach.Thus, the IP user must convert the application step.The delay D is null for the LMS algorithm and different into fixed-point.But, the manual fixed-point conversion is a of zero for the Delayed-LMS.tedious, time-consuming and error prone task.
In this paper, a new kind of IP is presented through the LMS B. Generic LMS architecture (Least Mean Square) and Delayed-LMS (DLMS) examples.
The generic architecture for the LMS/DLMS algorithm is These IP are configurable according to an accuracy con-presented in Figure l.b.The architecture is made-up of a filter straint influencing the algorithm quality.The IP user specifies part and an adaptation part to compute the new coefficients the accuracy constraint and the operator word-lengths are values.To satisfy the throughput constraint the filter part and automatically optimized.The optimal operator word-lengths the adaptation part can be parallelized.For the filter part, K which minimize the architecture cost and respect the accuracy multiplications are used in parallel and for the adaptation part constraint must be researched.The accuracy constraint can K MAD (Multiply-Add) patterns are used in parallel.The ------------______--___ (n i,.
The s corresponds to tho parallelism level K which allows to respect the throughput t he maximum value between multiplier and adder latency.The constraint, the architecture execution time is evaluated as filter part is divided into several pipeline stages.The first stage I .explained in section 1-C.Once the different operator word- corresponds to the multiply operation.To add the different . . . ..Rlengths and the parallelism level are defined, the VHDL code multiplication results, an adder based on a tree structure is .. representing the LMS or DLMS architecture at the RTL level used.This tree is made-up of Fg.2(K) levels.This global is generated.addition execution is pipelined.Let LADDbe the number of additions w hich can be exeuted in one cycle-time.Thus, theL num ber of pipelined stages for the global addition is given by Tevla the an dmef The IP generation methodology is presented in Figure 2. The methodology first stage corresponds to the data dynamic range determination.In linear time-invariant systems, A. Computaztion Accuralcy Evallualtion analytical approaches [5] can be used.But for the LMS T vlaeteacrc,aayia ehd rfre algorithm, none of these methods is applicable.So a floating-Tosimuluatio based onuayaayia methods wihlare to referred point simulation is made to evaluate the dynamic range from tosimulatintimen S based analytical expesion ofa the SQNR long the input data.Then, the binary-point position is deduced theimSlalgonithmeiSo comute alyias inrsso ofth4SNRi from the dynamic range to ensure that all data values can be teLSagrtmi optda n[] coded to prevent overflow.The third stage is _.the data word- where an is the noise associated with the term te' x and C. Throughput constraint depends on the way the filter is computed.The error in finite The system must verify a given constraint to ensure a realprecision iS given by time execution.The LMS Architecture presented in Figure 1 Cen = (5) and detailed in section II-B is divided in two parts correspond- ing to the filter part and the adaptation part.The execution time with in the global noise in the inner product w~4n.This of the filter part is obtained with the following expression global noise is the sum of each multiplication output noise and output accumulation noise.
TFIR N TFR=KTcycle + MADDTcycle + Tcycie (10) The execution time of the adaptation part is given by i=O N Moreover, a new term Pn is introduced TAdapt = Tcycle + K(Tcycle) + Tcycle (I I) The system throughput constraint depends on the chosen quantization algorithm.For the LMS algorithm, the sampling period Te Pn iS the N-length error vector due to the mut atsfnte olointeprsso effects on coefficients.This noise can not be considered as the noise due to the quantization of a signal.The mean of each term is represented by m whereas o2 represents its TFIR +TAdapt < Te (12) variance and can be determined as explained in [7].
Even if the Delayed-LMS algorithm has a slower conver- gence speed compared to the LMS Algorithm, as the error is b) Noise power expression: The study is made at steadydelayed, the filter part and the adaptation part can be computed state, once the filter coefficients have converged.The noise is in parallel which gives it a higher execution frequency.The measured at the filter output.The power of the error between constraints become filter output in finite precision and in infinite precision is determined.It is composed of three terms.
TFIR < Te and TAdapt < Te ( 13) The parallelism level is obtained by solving the expression 12 and 13.These expressions require the knowledge of the operator latency which depends on the operator word-lengths.
At the steady-state, the vector wn can be approximated by Thus, firstly, the operator word-lengths are optimized with the optimum vector wopt.So the term E(ctwn)2 is equal to a K equal to 1.The obtained operator word-lengths allow wo 2(m2 + Or2) with wõpt2 = EWop to determine the operator latency.Secondly, the term K is computed from the throughput constraint and then, the The second term is detailed in [6] and is equal to operator word-lengths are optimized with the real value of The LMS and DLMS IP blocks have been used for different The last term E(rjq) depends on the specific implementation experiments to underline the necessity to optimize the operator chosen for the filter output computation (filtered data).and memory word-lengths under an accuracy constraint.The IP users have to supply the reference and the input signal.For the architecture generation, the throughput constraint Te and B. Architecture Cost Evaluation the accuracy constraint SQNRmin must be defined.
The IP processing unit is based on a collection of operators The LMS and DLMS IP have been tested for different valextracted from a library.This library contains the arithmetic ues of the throughput constraint Te and the accuracy constraint operators, the registers and the multiplexors for the different SQNRmin.For each Te and SQNRmin value, the operator possible word-lengths.Each library element is automatically and memory word-lengths are optimized under the accuracy generated and characterized in terms of area and energy constraint.Then, the architecture is generated.The architecture consumption from scripts for the Synopsys tools.
area, the parallelism level and the energy consumption are The IP architecture area and energy consumption are ob-measured and the results are presented respectively in Figure tamned from the sum of the different basic element area and 3.a, 3.b and 3.c.The operator library has been generated from energy consumption.The elements correspond to the memory the 0.18 ,um technology from ST Microelectronics.The results (coefficients wn and input data xxn), the operators (multiplier, are presented for an timing constraint between 60 ns and adder, subtracter), the registers and the multiplexors used 170 ns and for an accuracy constraint between 30 dB and inside the datapath.90 dB.The architecture area increases when the timing constraint Compared to a classical approach, for a same computation decreases.Indeed, to respect this constraint, the parallelism accuracy, the architecture area and the energy consumption are level K must be more important.More operators are needed reduced respectively by 30 % and 23 %.With our approach, and thus the processing unit area is increased.The architecture the user can optimize the trade-off between the architecture costs (area, energy consumption) increase with the accuracy cost, the accuracy and the execution time.Accuracy models constraint.High values of accuracy constraint require to use have been defined for other specific applications like NLMS, operators and data with a greater word-length.This operator APA [10].Moreover, an automatic and generic floating-to- word-length rising, increases the energy consumption and the fixed-point conversion methodology is under development [9].area of the processing and memory units.Moreover, this operator word-length rising, increases the operator latency.computation accuracy our approach reduces the architecture Euromath Bulletin, 2(1):95-1 12, 1996.
Fig. 1.LMS/DLMS algorithm and generic Architecture for the IP / w itI LAD l cycle (2) Library EC | _io where tADDi is the adder latency.The last pipelined stage t noI are optiminatrodcda presn for the filter part corresponds to the final accumulation.The f e i.c es c adaptive part iS divided into three pipelined stages.The first1 one is for the subtraction.The second stage corresponds to the ~~~ILCl multiplication and the final addition composes the last stage.gnetdThetiming constraint management is detailed in Section 111-C.RTL level VHDL code III.FIXED-POINT OPTIMIZATION Fig.2.Methodology for the Fixed-Point IP generation

Fig. 3 .
Fig.3.Experiment results: architecture area, energy consumption and parallelism level for different value of accuracy and timing constraints Thus to respect the timing constraint, the parallelism level [1] M. Keating and P. Bricaud.Reuse Methodology Manual.Kluwer Academic Publishers, 3rd edition, 2000.K must be more important and the processing unit area is [2] D. Menard, M. Guitton, S. Pillement, and 0. Sentieys.Design and increased.Implementation of WCDMA Platforms: Challenges and Trade-offs.In Our results have been compared to a classical solution based Proceedings of the International Signal Processing Conference (ISPC 03), on 16 x 16 -*32-bit multiplications and 32-bit additions.[]sHym 'daptvFltrhoyvnlawoCfs, Aprnl 2003.Z This solution leads to a SQNR of 52 dB.The cost has been 2nd edition, 1991.evaluated for the classical and our optimized approach for an [4] c. Caraiscos, B. Liu, "A Roundoff Error Analysis of the LMS Adaptive cosrin* f5 dB an wit difren tiin con Algorithm", IEEE Transactions Acoustic, Speech, Signal Processing, vol accuracy cntanof5 dBadwt lern m gcn-ASSP-32, no.1, february 1984.straints.The results are presented in Figure 3.d.For the same [5] R. Kearfott.Interval Computations: Introduction, Uses, and Resources.
y 23 /o.ofFixed-Point LMS algorithm", Proceedings of the IEEE ICASSP 2004, G. Constantinides, P. Cheung and W. Luk "Truncation Noise in Fixed- In this paper, a new kind of fixed-point arithmetic jP Point SFGs", IEE Electronic Letters, 35(23) : 2012-2014, november 1999.has been proposed.The LMS/DLMS IP blocks have been [8] D. Katsushige, N. Kiyoshi and K. Hitoshi, "4Pipelined LMD Adaptive Filter Using a New Look-Ahead Transformation", IEEE Transactions detailed.A generic architecture has been proposed to adapt on Circuits and Systems, vol 46, january 1999 the parallelism level according to the timing and the com-[9] N. Herve, D. Menard, and 0. Sentieys.Data wordlength optimization for puato acuaycntans* oreueteoeao od fpga synthesis.In Proceedings of the IEEE International Workshop on Signal Processing Systems, SIPS'05, Athens, Grece, nov 2005.length optimization, the cost, the accuracy and the throughput [10] R. Rocher, D. Menard, 0. Sentieys and P. Scalart, "Accuracy Evaluation constraints are evaluated analytically.The results underline of Fixed-Point APA algorithm", Proceedings of the IEEE ICASSP 2005, the need to optimize the operator andl memory wordl-lengths.