®
Altera Corporation 1
Highly Optimized
2-D Convolvers
in FLEX Devices
February 1997, ver. 1 Application Note 82
A-AN-082-01
Introduction
The FLEX
®
10K and FLEX 8000 device families can efficiently implement
two-dimensional convolvers, which are useful for many image processing
applications such as filtering, sharpening, edge detection, and
interpolation. A convolver processes information in small pieces, such as
a 3
×
3 convolution window. Each element in the convolution window is
multiplied by a constant coefficient, and then the products are summed to
obtain the desired result.
This application note describes how to create more efficient convolvers in
FLEX 10K and FLEX 8000 designs.
f
For more information on two-dimensional convolvers, see the following
documents:
Application Note 73 (Implementing FIR Filters in FLEX Devices)
Product Information Bulletin 21 (Implementing Logic with the Embedded
Array in FLEX 10K Devices)
Conventions
For this application note, the 9 locations in a 3
×
3 convolution window are
numbered for reference, as shown in Figure 1. The subscript indicates
which tap or coefficient is being referenced. The variable
T
n
refers to the
tap data value at location
n
(e.g.,
T
2
refers to the tap value at location 2).
Figure 1. Tap & Coefficient Reference Locations
123
456
789
2 Altera Corporation
AN 82: Highly Optimized 2-D Convolvers in FLEX Devices
Conventional
Convolver
Figure 2 shows the coefficient window for a sample convolver design.
Figure 2. Coefficient Window
Figure 3 shows a block diagram of a conventional convolver using the
coefficients shown in Figure 2. In this convolver, the coefficients remain
constant and the tap values change every clock cycle as new data is shifted
into the convolver.
1.5
0.375
–0.6250.375
0.125
–0.1251.25
–0.125
0.250
Row 1
Row 2
Row 3
Altera Corporation 3
AN 82: Highly Optimized 2-D Convolvers in FLEX Devices
Figure 3. Conventional Convolver Block Diagram
The output of each register is called a tap (
T
n
). Each tap is multiplied by
the corresponding coefficient (
C
n
) from Figure 2, and then the products
are summed to produce the final result (
Y
). The equation for the convolver
in Figure 3 is shown below:
1.25 –0.125 0.25 0.375 0.125 –0.125 1.5 0.375
888888 8
8
17
18
19
20
8
–0.625
16 16 16 16 16 16 16 16 16
18
17 17
17
8
8
8
Row 1
Row 2
Row 3
T7T8T9T4T5T6T1T2T3
Y
YT
n
C
n
1
9
=
4 Altera Corporation
AN 82: Highly Optimized 2-D Convolvers in FLEX Devices
Building the conventional two-dimensional convolver for maximum
speed requires 9 multipliers and an adder tree consisting of 8 separate
adders. Assuming the data and coefficients are 8 bits wide and the
coefficients are constant, each multiplier would require 36 logic elements
(LEs) and each adder would require one LE per bit, for a total of 539 LEs
for the entire design.
Highly
Optimized
Convolver
You can improve the speed and logic efficiency of the conventional
convolver by taking advantage of the numeric relationships between the
coefficients. Although the convolution window in Figure 2 is asymmetric,
all of the coefficients are integer multiples of 0.125. By rewriting the
coefficient window and substituting K = 0.125, you obtain the window
shown in Figure 4.
Figure 4. Modified Coefficient Window
In this example, you can take advantage of the fact that all the factored
coefficients are integers. Multiplying the taps by the coefficients 12, 3, 5, 1,
and 10 is simple because all the integers are the sum of powers of 2. For
example, multiplying tap
T
1
by 12 is the same as adding 8
T
1
and 4
T
1
(i.e.,
12
T
1
= 8
T
1
+ 4
T
1
). Because 8 and 4 are both powers of 2, multiplication can
be avoided. Instead,
T
1
can be shifted left by 3 bits and 2 bits, respectively.
Therefore, 12
T
1
can be computed with a single adder that is only 10 bits
wide, assuming
T
1
is an 8-bit number. For example, if
T
1
= 11111111
B
, you
can calculate 12
T
1
as shown in Figure 5.
Figure 5. Multiplying 12
×
T
1
Because the 2 least significant bits (LSBs) of 8
T
1
and 4
T
1
are always 0, the
2 extra bits in the adder are not required. Therefore, this calculation
requires only 10 LEs.
12
3
–53
1
–110
–1
2
× K
11111111000
+ 1111111100
101111110100
8 ×
T
1
+ 4 ×
T
1
12 ×
T
1
=
Altera Corporation 5
AN 82: Highly Optimized 2-D Convolvers in FLEX Devices
Similarly, 3, 5, and 10 can each be computed with a single adder.
Multiplying or dividing by any power of 2 is a shift left or right and does
not require any LEs. However, LEs are required to add the product into
the final result.
To further improve the logic efficiency of this design, you can take
advantage of the “sparseness” of the coeffiecients (i.e., the relatively few
1s in the coefficients). The basic method is to add or subtract various
multiples of 2 of the taps to obtain the desired multiplication factor—e.g.,
add together all taps that have a 1 in the most significant bit (MSB). Table 1
lists each tap’s coefficient and its binary representation. Numbers in
parentheses are negative.
To perform these calculations efficiently in parallel, begin by adding taps
with a 1 in the MSB and work your way down to the LSB. In this example,
the MSB of
C
1
and
C
7
is 1. Therefore, you should calculate an intermediate
result by adding
C
1
and
C
7
, remembering to shift the product left by 3 bits
before adding it to the final result.
Similarly, the (MSB – 1) bit of
C
1
and
C
3
is 1. In this case, compute
(
C
1
C
3
), remembering to shift the product left by 2 bits before adding it
to the final result.
Repeat this procedure for the next 2 bits of the weight, i.e., the (MSB – 2)
bit and the LSB. Next, add all of the products to produce the final result.
Last, multiply the final result by K. Because 0.125 is a power of 2, no
multiplication is necessary—the final result is simply shifted right by 3
bits.
Table 1. Binary Representation of the 9 Coefficients
Coefficient Weight
Decimal Binary
C
1
12 1100
C
2
3 0011
C
3
–5 (0101)
C
4
3 0011
C
5
1 0001
C
6
–1 (0001)
C
7
10 1010
C
8
–1 (0001)
C
9
2 0010
6 Altera Corporation
AN 82: Highly Optimized 2-D Convolvers in FLEX Devices
The following steps show mathematically how to use this method to
create a highly optimized convolver design, or any other multiply and
accumulate (MAC) design. All numbers are in binary.
Y
=
T
1
C
1
+
T
2
C
2
+
T
3
C
3
+
T
4
C
4
+
T
5
C
5
+
T
6
C
6
+
T
7
C
7
+
T
8
C
8
+
T
9
C
9
1. Substitute the binary values of the coefficients:
Y
= (
T
1
×
1100) + (
T
2
×
0011) – (
T
3
×
0101) + (
T
4
×
0011) +
(
T
5
×
0001) – (
T
6
× 0001) + (T7 × 1010) – (T8 × 0001) + (T9 × 0010)
2. Factor out 8 (1000B):
Y = [(T1 + T7) × 1000] + (T1 × 100) + (T2 × 011) – (T3 × 101) +
(T4 × 011) + (T5 × 001) – (T6 × 001) + (T7 × 010) – (T8 × 001) + (T9 × 010)
3. Factor out 4 (100B):
Y = [(T1 + T7) × 1000] + [(T1T3) × 100] + (T1 × 00) + (T2 × 11) –
(T3 × 01) + (T4 × 11) + (T5 × 01) – (T6 × 01) + (T7 × 10) – (T8 × 01) +
(T9 × 10)
4. Factor out 2 (10B):
Y = [(T1 + T7) × 1000] + [(T1T3) × 100] + [(T2 + T4 + T7 + T9) × 10] +
(T1 × 0) + (T2 × 1) – (T3 × 1) + (T4 × 1) + (T5 × 1) – (T6 × 1) + (T7 × 0) –
(T8 × 1) + (T9 × 0)
5. Add the remaining non-zero terms:
Y = [(T1 + T7) × 1000] + [(T1T3) × 100] + [(T2 + T4 + T7 + T9) × 10] +
[(T2T3 + T4 + T5T6T8)] × 1
6. Group the terms by significance and perform more factoring:
Y = [(T1 + T7) × 10] + [(T1 T3) × 1] × 10 + [(T2 + T4 + T7 + T9) × 1 × 10] +
[(T2T3 + T4 + T5T6T8) × 1]
Altera Corporation 7
AN 82: Highly Optimized 2-D Convolvers in FLEX Devices
Figure 6 shows a block diagram of the multiply and accumulate section of
the optimized convolver.
Figure 6. Highly Optimized Convolver Block Diagram
The highly optimized fixed-coefficient convolver uses 215 LEs, as
opposed to the 529 LEs required by the conventional convolver shown in
Figure 3—a savings of 60%.
The highly optimized design, written in the Altera Hardware Description
Language (AHDL), is shown in Figure 7. This pipelined design will run
at over 90 MHz in a FLEX 8000A device with a -2 speed grade, a speed that
is the equivalent of 810 MIPS for only 214 LEs.
1This design—named fstconv—can be downloaded from the
Altera world-wide web site at http://www.altera.com.
11
15
13
12
17 312 974 2
45 6 3 8
Tap Position
11
+
88888
8
88 8888 88
9999999
Shift Left
by 1 Bit
Shift Left
by 1 Bit
Shift Left
by 1 Bit
+++– ++ ++ + ++– ++
Shift Left
by 3 Bits Shift Left
by 2 Bits Shift Left by 1 Bit No Shift
++
++
+
+
+
++
+
Y
8 Altera Corporation
AN 82: Highly Optimized 2-D Convolvers in FLEX Devices
Figure 7. fstconv.tdf File (Part 1 of 2)
SUBDESIGN fstconv
(in[3..1][8..1] : INPUT;
out[15..1] : OUTPUT;
clk : INPUT;
)
VARIABLE
row1[3..1][8..1] : DFF; -- Row 1 of shift register
row2[3..1][8..1] : DFF; -- Row 2 of shift register
row3[3..1][8..1] : DFF; -- Row 3 of shift register
inter_sh3[9..1] : DFF; -- Intermediate shift by 3 node
inter_sh2[9..1] : DFF; -- Intermediate shift by 2 node
inter_sh1_t[2..1][9..1] : DFF; -- Shift by 1 node
inter_sh1[10..1] : DFF; -- More shift by 1 node
inter_sh0_t1[3..1][9..1]: DFF; -- Shift by 0
inter_sh0_t2a[10..1] : DFF;
inter_sh0_t2b[9..1] : DFF;
inter_sh0[11..1] : DFF;
inter_3_2[11..1] : DFF; -- Shift by 3 + shift by 2
inter_3_2_1[13..1] : DFF; -- inter_3_2 + shift by 1
inter_3_2_1_0[15..1] : DFF; -- inter_3_2_1 + shift by 0
BEGIN
-- Connect the clock to all the internal registers
(inter_sh3[].clk, inter_sh2[].clk, inter_sh1_t[][].clk,
inter_sh1[].clk, inter_sh0_t1[][].clk,
inter_sh0_t2a[].clk, inter_sh0_t2b[].clk,
inter_sh0[11..1].clk, inter_3_2[].clk, inter_3_2_1[].clk
inter_3_2_1_0[].clk) = clk;
-- Connect the shift register:
-- Data will shift from left to right, which will be 1 to 3
row1[1][] = in[1][];
row2[1][] = in[2][];
row3[1][] = in[3][];
row1[3..2][] = row1[2..1][];
row2[3..2][] = row2[2..1][];
row3[3..2][] = row3[2..1][];
row1[][].clk = clk;
row2[][].clk = clk;
row3[][].clk = clk;
Altera Corporation 9
AN 82: Highly Optimized 2-D Convolvers in FLEX Devices
Figure 7. fstconv.tdf (Part 2 of 2)
-- The shift by 3 data
inter_sh3[] = (row1[1][8], row1[1][]) + (row3[1][8], row3[1][]);
-- The shift by 2 data
inter_sh2[] = (row1[1][8], row1[1][]) - (row1[3][8], row1[3][]);
-- The shift by 1 data
inter_sh1_t[1][] = (row1[2][8], row1[2][]) +
(row2[1][8], row2[1][]); -- 2 + 4
inter_sh1_t[2][] = (row3[1][8], row3[1][]) +
(row3[3][8], row3[3][]); -- 7 + 9
inter_sh1[] = (inter_sh1_t[1][9], inter_sh1_t[1][]) +
(inter_sh1_t[2][9], inter_sh1_t[2][]);
-- The shift by zero data
inter_sh0_t1[1][] = (row1[2][8], row1[2][]) +
(row2[1][8], row2[1][]);
inter_sh0_t1[2][] = (row2[2][8], row2[2][]) -
(row2[3][8], row2[3][]);
inter_sh0_t1[3][] = (row1[3][8], row1[3][]) +
(row3[2][8], row3[2][]);
inter_sh0_t2a[] = (inter_sh0_t1[1][9], inter_sh0_t1[1][]) +
(inter_sh0_t1[2][9], inter_sh0_t1[2][]);
inter_sh0_t2b[] = inter_sh0_t1[3][];
inter_sh0[] = (inter_sh0_t2a[10], inter_sh0_t2a[]) -
(inter_sh0_t2b[9], inter_sh0_t2b[9],
inter_sh0_t2b[]);
-- Shift by 3 + shift by 2
inter_3_2[] = (inter_sh3[9], inter_sh3[], gnd) +
(inter_sh2[9], inter_sh2[9], inter_sh2[]);
-- Shift by 3 + shift by 2 + shift by 1
inter_3_2_1[] = (inter_3_2[11], inter_3_2[], gnd) +
(inter_sh1[10], inter_sh1[10], inter_sh1[10],
inter_sh1[]);
-- Shift by 3 + 2 + 1 + 0
inter_3_2_1_0[] = (inter_3_2_1[13], inter_3_2_1[], gnd) +
(inter_sh0[11], inter_sh0[11],
inter_sh0[11], inter_sh0[11], inter_sh0[]);
-- Output
out[] = inter_3_2_1_0[];
END;
10 Altera Corporation
AN 82: Highly Optimized 2-D Convolvers in FLEX Devices
Conclusion This application note shows how to improve the logic efficiency of a
convolver design when the coefficients can be factored and are sparse. To
reduce the amount of logic required for your convolver design, use the
following tips:
If possible, factor out a common coefficient, so that a single final
multiplication can be performed at the end (if the factor is a multiple
of 2, multiplication is not required).
If possible, set all the coefficients to a power of 2 or to the sum of a few
powers of 2. Logic is not required to multiply a tap by the coefficients
1, 2, 4, and 8—the taps are just shifted left by 0, 1, 2, or 3 bits,
respectively. Likewise, the numbers 3, 5, 6, 9, 10, and 12 can be
calculated with a single adder. The other numbers below 16 will
require more logic, but will still be more efficient than using the
conventional method.
Notes:
AN 82: Highly Optimized 2-D Convolvers in FLEX Devices
Altera Corporation 11
Altera, FLEX, FLEX 10K, AHDL, and FLEX 8000 are trademarks and/or service marks of Altera Corporation
in the United States and other countries. Altera acknowledges the trademarks of other organizations for their
respective products or services mentioned in this document. Altera products are protected under numerous
U.S. and foreign patents and pending applications, maskwork rights, and copyrights. Altera warrants
performance of its semiconductor products to current specifications in accordance with Altera’s standard
warranty, but reserves the right to make changes to any products and services at any time without notice.
Altera assumes no responsibility or liability arising out of the application or use of any
information, product, or service described herein except as expressly agreed to in writing
by Altera Corporation. Altera customers are advised to obtain the latest version of device
specifications before relying on any published information and before placing orders for
products or services.
Copyright 1997 Altera Corporation. All rights reserved.
2610 Orchard Parkway
San Jose, CA 95134-2020
(408) 894-7000
http://www.altera.com
Applications Hotline:
(800) 800-EPLD
Customer Marketing:
(408) 894-7104
Literature Services:
(888) 3-ALTERA
lit_req@altera.com
®
AN 82: Highly Optimized 2-D Convolvers in FLEX Devices
12 Altera Corporation
Printed on Recycled Paper.