Highly Optimized 2-D Convolvers (R) in FLEX Devices February 1997, ver. 1 Introduction Application Note 82 The FLEX(R) 10K and FLEX 8000 device families can efficiently implement two-dimensional convolvers, which are useful for many image processing applications such as filtering, sharpening, edge detection, and interpolation. A convolver processes information in small pieces, such as a 3 x 3 convolution window. Each element in the convolution window is multiplied by a constant coefficient, and then the products are summed to obtain the desired result. This application note describes how to create more efficient convolvers in FLEX 10K and FLEX 8000 designs. f For more information on two-dimensional convolvers, see the following documents: Conventions Application Note 73 (Implementing FIR Filters in FLEX Devices) Product Information Bulletin 21 (Implementing Logic with the Embedded Array in FLEX 10K Devices) For this application note, the 9 locations in a 3 x 3 convolution window are numbered for reference, as shown in Figure 1. The subscript indicates which tap or coefficient is being referenced. The variable Tn refers to the tap data value at location n (e.g., T2 refers to the tap value at location 2). Figure 1. Tap & Coefficient Reference Locations Altera Corporation A-AN-082-01 1 2 3 4 5 6 7 8 9 1 AN 82: Highly Optimized 2-D Convolvers in FLEX Devices Conventional Convolver Figure 2 shows the coefficient window for a sample convolver design. Figure 2. Coefficient Window Row 1 1.5 0.375 -0.625 Row 2 0.375 0.125 -0.125 Row 3 1.25 -0.125 0.250 Figure 3 shows a block diagram of a conventional convolver using the coefficients shown in Figure 2. In this convolver, the coefficients remain constant and the tap values change every clock cycle as new data is shifted into the convolver. 2 Altera Corporation AN 82: Highly Optimized 2-D Convolvers in FLEX Devices Figure 3. Conventional Convolver Block Diagram Row 1 8 Row 2 8 Row 3 8 8 1.25 -0.125 16 16 T8 T9 0.25 8 8 8 8 T7 T4 0.375 16 0.125 16 16 8 8 T5 -0.125 T6 1.5 16 16 17 17 17 18 8 T1 8 T2 0.375 16 -0.625 T3 16 17 18 19 20 Y The output of each register is called a tap (Tn). Each tap is multiplied by the corresponding coefficient (Cn) from Figure 2, and then the products are summed to produce the final result (Y). The equation for the convolver in Figure 3 is shown below: 9 Y = T n Cn 1 Altera Corporation 3 AN 82: Highly Optimized 2-D Convolvers in FLEX Devices Building the conventional two-dimensional convolver for maximum speed requires 9 multipliers and an adder tree consisting of 8 separate adders. Assuming the data and coefficients are 8 bits wide and the coefficients are constant, each multiplier would require 36 logic elements (LEs) and each adder would require one LE per bit, for a total of 539 LEs for the entire design. Highly Optimized Convolver You can improve the speed and logic efficiency of the conventional convolver by taking advantage of the numeric relationships between the coefficients. Although the convolution window in Figure 2 is asymmetric, all of the coefficients are integer multiples of 0.125. By rewriting the coefficient window and substituting K = 0.125, you obtain the window shown in Figure 4. Figure 4. Modified Coefficient Window 12 3 -5 3 1 -1 10 -1 2 x K In this example, you can take advantage of the fact that all the factored coefficients are integers. Multiplying the taps by the coefficients 12, 3, 5, 1, and 10 is simple because all the integers are the sum of powers of 2. For example, multiplying tap T1 by 12 is the same as adding 8T1 and 4T1 (i.e., 12T1 = 8T1 + 4T1). Because 8 and 4 are both powers of 2, multiplication can be avoided. Instead, T1 can be shifted left by 3 bits and 2 bits, respectively. Therefore, 12T1 can be computed with a single adder that is only 10 bits wide, assuming T1 is an 8-bit number. For example, if T1 = 11111111B, you can calculate 12T1 as shown in Figure 5. Figure 5. Multiplying 12 x T1 11111111000 + 1111111100 101111110100 = 8 x T1 + 4 x T1 12 x T1 Because the 2 least significant bits (LSBs) of 8T1 and 4T1 are always 0, the 2 extra bits in the adder are not required. Therefore, this calculation requires only 10 LEs. 4 Altera Corporation AN 82: Highly Optimized 2-D Convolvers in FLEX Devices Similarly, 3, 5, and 10 can each be computed with a single adder. Multiplying or dividing by any power of 2 is a shift left or right and does not require any LEs. However, LEs are required to add the product into the final result. To further improve the logic efficiency of this design, you can take advantage of the "sparseness" of the coeffiecients (i.e., the relatively few 1s in the coefficients). The basic method is to add or subtract various multiples of 2 of the taps to obtain the desired multiplication factor--e.g., add together all taps that have a 1 in the most significant bit (MSB). Table 1 lists each tap's coefficient and its binary representation. Numbers in parentheses are negative. Table 1. Binary Representation of the 9 Coefficients Coefficient C1 C2 C3 C4 C5 C6 C7 C8 C9 Weight Decimal Binary 12 1100 3 0011 -5 (0101) 3 0011 1 0001 -1 (0001) 10 1010 -1 (0001) 2 0010 To perform these calculations efficiently in parallel, begin by adding taps with a 1 in the MSB and work your way down to the LSB. In this example, the MSB of C1 and C7 is 1. Therefore, you should calculate an intermediate result by adding C1 and C7, remembering to shift the product left by 3 bits before adding it to the final result. Similarly, the (MSB - 1) bit of C1 and C3 is 1. In this case, compute (C1 - C3), remembering to shift the product left by 2 bits before adding it to the final result. Repeat this procedure for the next 2 bits of the weight, i.e., the (MSB - 2) bit and the LSB. Next, add all of the products to produce the final result. Last, multiply the final result by K. Because 0.125 is a power of 2, no multiplication is necessary--the final result is simply shifted right by 3 bits. Altera Corporation 5 AN 82: Highly Optimized 2-D Convolvers in FLEX Devices The following steps show mathematically how to use this method to create a highly optimized convolver design, or any other multiply and accumulate (MAC) design. All numbers are in binary. Y = T1C1 + T2C2 + T3C3 + T4C4 + T5C5 + T6C6 + T7C7 + T8C8 + T9C9 1. Substitute the binary values of the coefficients: Y = (T1 x 1100) + (T2 x 0011) - (T3 x 0101) + (T4 x 0011) + (T5 x 0001) - (T6 x 0001) + (T7 x 1010) - (T8 x 0001) + (T9 x 0010) 2. Factor out 8 (1000B): Y = [(T1 + T7) x 1000] + (T1 x 100) + (T2 x 011) - (T3 x 101) + (T4 x 011) + (T5 x 001) - (T6 x 001) + (T7 x 010) - (T8 x 001) + (T9 x 010) 3. Factor out 4 (100B): Y = [(T1 + T7) x 1000] + [(T1 - T3) x 100] + (T1 x 00) + (T2 x 11) - (T3 x 01) + (T4 x 11) + (T5 x 01) - (T6 x 01) + (T7 x 10) - (T8 x 01) + (T9 x 10) 4. Factor out 2 (10B): Y = [(T1 + T7) x 1000] + [(T1 - T3) x 100] + [(T2 + T4 + T7 + T9) x 10] + (T1 x 0) + (T2 x 1) - (T3 x 1) + (T4 x 1) + (T5 x 1) - (T6 x 1) + (T7 x 0) - (T8 x 1) + (T9 x 0) 5. Add the remaining non-zero terms: Y = [(T1 + T7) x 1000] + [(T1 - T3) x 100] + [(T2 + T4 + T7 + T9) x 10] + [(T2 - T3 + T4 + T5 - T6 - T8)] x 1 6. Group the terms by significance and perform more factoring: Y = [(T1 + T7) x 10] + [(T1 - T3) x 1] x 10 + [(T2 + T4 + T7 + T9) x 1 x 10] + [(T2 - T3 + T4 + T5 - T6 - T8) x 1] 6 Altera Corporation AN 82: Highly Optimized 2-D Convolvers in FLEX Devices Figure 6 shows a block diagram of the multiply and accumulate section of the optimized convolver. Figure 6. Highly Optimized Convolver Block Diagram Shift Left by 3 Bits Tap Position 1 Shift Left by 1 Bit Shift Left by 2 Bits 7 1 3 2 No Shift 4 7 9 4 5 2 6 3 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 + + + - + + + + + + + - + + 9 9 Shift Left by 1 Bit + 9 9 + + 11 Shift Left by 1 Bit + 9 9 9 + + 12 + + + 11 13 Shift Left by 1 Bit - + + 15 Y The highly optimized fixed-coefficient convolver uses 215 LEs, as opposed to the 529 LEs required by the conventional convolver shown in Figure 3--a savings of 60%. The highly optimized design, written in the Altera Hardware Description Language (AHDLTM), is shown in Figure 7. This pipelined design will run at over 90 MHz in a FLEX 8000A device with a -2 speed grade, a speed that is the equivalent of 810 MIPS for only 214 LEs. 1 Altera Corporation This design--named fstconv--can be downloaded from the Altera world-wide web site at http://www.altera.com. 7 AN 82: Highly Optimized 2-D Convolvers in FLEX Devices Figure 7. fstconv.tdf File (Part 1 of 2) SUBDESIGN fstconv ( in[3..1][8..1] out[15..1] clk ) VARIABLE row1[3..1][8..1] row2[3..1][8..1] row3[3..1][8..1] : INPUT; : OUTPUT; : INPUT; : DFF; -- Row 1 of shift register : DFF; -- Row 2 of shift register : DFF; -- Row 3 of shift register inter_sh3[9..1] : inter_sh2[9..1] : inter_sh1_t[2..1][9..1] : inter_sh1[10..1] : inter_sh0_t1[3..1][9..1] : inter_sh0_t2a[10..1] : inter_sh0_t2b[9..1] : inter_sh0[11..1] : inter_3_2[11..1] : inter_3_2_1[13..1] : inter_3_2_1_0[15..1] : BEGIN DFF; DFF; DFF; DFF; DFF; DFF; DFF; DFF; DFF; DFF; DFF; ------ Intermediate shift by 3 node Intermediate shift by 2 node Shift by 1 node More shift by 1 node Shift by 0 -- Shift by 3 + shift by 2 -- inter_3_2 + shift by 1 -- inter_3_2_1 + shift by 0 -- Connect the clock to all the internal registers (inter_sh3[].clk, inter_sh2[].clk, inter_sh1_t[][].clk, inter_sh1[].clk, inter_sh0_t1[][].clk, inter_sh0_t2a[].clk, inter_sh0_t2b[].clk, inter_sh0[11..1].clk, inter_3_2[].clk, inter_3_2_1[].clk inter_3_2_1_0[].clk) = clk; -- Connect the shift register: -- Data will shift from left to right, which will be 1 to 3 row1[1][] = in[1][]; row2[1][] = in[2][]; row3[1][] = in[3][]; row1[3..2][] = row1[2..1][]; row2[3..2][] = row2[2..1][]; row3[3..2][] = row3[2..1][]; row1[][].clk = clk; row2[][].clk = clk; row3[][].clk = clk; 8 Altera Corporation AN 82: Highly Optimized 2-D Convolvers in FLEX Devices Figure 7. fstconv.tdf (Part 2 of 2) -- The shift by 3 data inter_sh3[] = (row1[1][8], row1[1][]) + (row3[1][8], row3[1][]); -- The shift by 2 data inter_sh2[] = (row1[1][8], row1[1][]) - (row1[3][8], row1[3][]); -- The shift by 1 data inter_sh1_t[1][] = (row1[2][8], row1[2][]) + (row2[1][8], row2[1][]); -- 2 + 4 inter_sh1_t[2][] = (row3[1][8], row3[1][]) + (row3[3][8], row3[3][]); -- 7 + 9 inter_sh1[] = (inter_sh1_t[1][9], inter_sh1_t[1][]) + (inter_sh1_t[2][9], inter_sh1_t[2][]); -- The shift by zero data inter_sh0_t1[1][] = (row1[2][8], row1[2][]) + (row2[1][8], row2[1][]); inter_sh0_t1[2][] = (row2[2][8], row2[2][]) (row2[3][8], row2[3][]); inter_sh0_t1[3][] = (row1[3][8], row1[3][]) + (row3[2][8], row3[2][]); inter_sh0_t2a[] = (inter_sh0_t1[1][9], inter_sh0_t1[1][]) + (inter_sh0_t1[2][9], inter_sh0_t1[2][]); inter_sh0_t2b[] = inter_sh0_t1[3][]; inter_sh0[] = (inter_sh0_t2a[10], inter_sh0_t2a[]) (inter_sh0_t2b[9], inter_sh0_t2b[9], inter_sh0_t2b[]); -- Shift by 3 + shift by 2 inter_3_2[] = (inter_sh3[9], inter_sh3[], gnd) + (inter_sh2[9], inter_sh2[9], inter_sh2[]); -- Shift by 3 + shift by 2 + shift by 1 inter_3_2_1[] = (inter_3_2[11], inter_3_2[], gnd) + (inter_sh1[10], inter_sh1[10], inter_sh1[10], inter_sh1[]); -- Shift by 3 + 2 + 1 + 0 inter_3_2_1_0[] = (inter_3_2_1[13], inter_3_2_1[], gnd) + (inter_sh0[11], inter_sh0[11], inter_sh0[11], inter_sh0[11], inter_sh0[]); -- Output out[] = END; Altera Corporation inter_3_2_1_0[]; 9 AN 82: Highly Optimized 2-D Convolvers in FLEX Devices Conclusion 10 This application note shows how to improve the logic efficiency of a convolver design when the coefficients can be factored and are sparse. To reduce the amount of logic required for your convolver design, use the following tips: If possible, factor out a common coefficient, so that a single final multiplication can be performed at the end (if the factor is a multiple of 2, multiplication is not required). If possible, set all the coefficients to a power of 2 or to the sum of a few powers of 2. Logic is not required to multiply a tap by the coefficients 1, 2, 4, and 8--the taps are just shifted left by 0, 1, 2, or 3 bits, respectively. Likewise, the numbers 3, 5, 6, 9, 10, and 12 can be calculated with a single adder. The other numbers below 16 will require more logic, but will still be more efficient than using the conventional method. Altera Corporation AN 82: Highly Optimized 2-D Convolvers in FLEX Devices Notes: Altera Corporation 11 AN 82: Highly Optimized 2-D Convolvers in FLEX Devices (R) 2610 Orchard Parkway San Jose, CA 95134-2020 (408) 894-7000 http://www.altera.com Applications Hotline: (800) 800-EPLD Customer Marketing: (408) 894-7104 Literature Services: (888) 3-ALTERA lit_req@altera.com 12 Altera, FLEX, FLEX 10K, AHDL, and FLEX 8000 are trademarks and/or service marks of Altera Corporation in the United States and other countries. Altera acknowledges the trademarks of other organizations for their respective products or services mentioned in this document. Altera products are protected under numerous U.S. and foreign patents and pending applications, maskwork rights, and copyrights. Altera warrants performance of its semiconductor products to current specifications in accordance with Altera's standard warranty, but reserves the right to make changes to any products and services at any time without notice. Altera assumes no responsibility or liability arising out of the application or use of any information, product, or service described herein except as expressly agreed to in writing by Altera Corporation. Altera customers are advised to obtain the latest version of device specifications before relying on any published information and before placing orders for products or services. Copyright 1997 Altera Corporation. All rights reserved. Altera Corporation Printed on Recycled Paper.