REAL
TIME FOURIER TRANSFORM USING PARALLEL RTX PROCESSORS
C. H.
Ting, Applied Biosystems, Inc.
and Rick VanNorman, Harris Semiconductor
Abstract
The
Continuous Fourier Transform (CFT) algorithm is well suited for parallel
processing using an array of fast RTX processors to speed up the transform
process. Using four 6 MHz RTX2000
processors linked together, the input sampling rate is pushed to 2 kHz. The resulting frequency spectrum is
displayed simultaneously while the transform is being computed. This experiment demonstrates
conclusively that the speed of CFT is linearly proportional to the number of
processors dedicated to the transform computation.
1. Introduction
In the
1989 Rochester Forth conference we reported the implementation of the
Continuous Fourier Transform (CFT) on an RTX2000 processor (1). CFT has many unique characteristics
which take advantage of a fast computing element like RTX2000. Since CFT guarantees that computation
error does not accumulate, integer arithmetic operations can be used for all
its computations. Another important
property of CFT is that the computation of each component in the frequency
spectrum is independent of the other components. Therefore, the computation can be
segregated to as many independent processors as there are frequency components,
and the throughput is linearly proportional to the number of processors
dedicated to the computational task.
This report details the construction of such a parallel processor array
and its performance.
2. Theory of Continuous Fourier Transform (CFT)
The
theory of CFT was discussed fully in our prior report. Here we shall give only a brief summary
of the theory to facilitate later discussions. Assume that we have a continuous
stream of data, sampled at a constant interval, from which the frequency
spectrum is computed in real time:
f(0), f(1), f(2), ... , f(k), ...
where
k may run to infinity. A window of
N contiguous elements from f(k) on are taken to obtain a frequency spectrum:
f(k), f(k+1), f(k+2), ... ,
f(k+N-1)
The
Continuous Fourier Transform computes the following N frequency components:
k+N-1
T(k,n) = ? f(m) exp{-2j¹mn/N}
m=k
n = 0, 1, 2, ... , N-1
Once
this set of frequency components is computed, the next set with a new data
point f(k+N) can be obtained easily with the following recursive equation:
T(k+1,n) =
T(k,n) + [f(k+N) - f(k)] exp{-2j¹kn/N}
n = 0, 1, 2, ... , N-1
Each T
component can be evaluated independent of other T components. As many as N processors can be employed
to solve the Fourier transform. The
processors only need the data stream f(k), and the assignments of the frequency
identifier n. There is no
inter-processor communication necessary.
In
this experiment, we distribute the T components among four RTX2000
processors. The speed of the
Fourier transform is increased by a factor of 4. Only the real part of the phase factor
is used in computation, and the result is that of a cosine transform. Computing only the cosine transform
further increases the processing speed by a factor of two. As the spectrum is computed on data in a
sliding window in real time, it is not necessary to do the sine transform.
3. The System Architecture
This
parallel processor architecture more closely approximates the Parallel CFT
Processor proposed in the original article by Ting (2), in which one slave
processor would be used to process one frequency component in the frequency
spectrum. However, as the number of
slave processors increases, the throughput bottle neck will shift from the
computational section to the output section of the Parallel CFT Processor
System. In this experiment, the
spectrum output is neatly integrated into the slave RTX processor array. The overall balance among input,
processing and output in the current system is a significant achievement.
The
Real Time CFT Parallel Processor System is designed with simplicity and
versatility as the primary considerations.
Its structure is shown schematically in Figure 1. A real time analog signal is fed into an
A/D converter, which broadcasts the digitized data over the A/D Data Bus to
four Slave Processors. These Slave
Processors, each handles a quarter of the computation load, present the cosine
transform results to the D/A Data Bus.
A D/A converter generates the frequency spectrum continuously from the
data on the D/A Data Bus. A Master
Processor generates all the timing signals to the A/D converter, the Slave
Processors and the D/A converter.
In the current setup, each component in the frequency spectrum is
computed from a window of 1024 contiguous data points in the input signal, and
512 components in the cosine transform are displayed continuously on an
oscilloscope.
4. Hardware Implementation
The
Master Processor and the four Slave Processors are identical single board
computers based on the INDELKO Forth Kit, with RTX2000GI-8 CPU from Harris
Semiconductor Corp. Each board has
64 Kbytes of RAM memory and the cmForth operating system is in EPROM's. Each board derives the system clock from
an individual 12 MHz crystal and operates at 6 MHz cycling rate. The Master Processor drives the A/D
converter, the Slave Processors and the D/A converter through a 74HCT374
latch. The Slave Processors receive
data from the A/D converter through a set of '374 latch and they put the
transformed data onto the D/A Data bus through a second set of '374
latches. The A/D converter and the
D/A converter are contained in a single KSV3100A chip from Samsung. Both the A/D and D/A Data Buses are 8
bits in width and they share the same I/O port (Port 1CH) on the G-Bus of
RTX2000.
The
schematics of the Master Processor and the Slave Processors are shown in Figure
2 and 3. Each processor is on a
circuit board, which plugs into a DIN style socket on the mother board. The A/D and D/A converter chip KSV3100A,
the '374 latches and '138 decoders are all located on the mother board.
Although
the D/A in KSV3100A is a 10 bit converter, only the most significant 8 bits are
used for the D/A Data Bus. Both A/D
and D/A converters have a voltage range of 0 to 2 volts. Since the outputs from the Slave
Processors are in signed two's complement form, the most significant sign bit
is inverted before feeding into the D/A through a '04 inverter.
The
reset and RS232 lines on each processor board are brought to the mother
board. They are connected to a
single reset switch and a DB25 connector through a flexible cable header. This arrangement allows a host computer
to communicate to each individual processor through a common RS232 cable. All processors boot cmForth from their
own on-board EPROM's. The same CFT program is downloaded to each processor. The Master Processor executes the word
MASTER to generate the timing signals to the rest of the system. Slave Processor 1 executes CFT1. Slave Processor 2 executes CFT2, etc.
To
facilitate the testing and verification of the system, a 555 timer is also
built on the mother board. It feeds a square wave of variable frequency to the
A/D.
It
takes 24 cycles (4 us) for a Slave processor to generate a new component in the
frequency spectrum. The four Slave
Processors are interleaved and the Master Processor clocks out consecutive
components by sending 0.5 us pulses at 1 us intervals sequentially to Slaves 1
to 4 and then repeats the sequence 128 times. Thus the 512 components in the cosine
transform can be displayed continuously on an oscilloscope. The Slaves
synchronize themselves by monitoring the EI2 input lines driven by the Master.
The Master samples the A/D converter and loads the data into slave processors'
input '374 latches. Figure 4 shows
all the timing signals generated by the Master.
While
initiating the A/D converter, the Master also raises the SyncSlaves line. When the Slaves sense the SyncSlaves
signals on their EI2 pins, they read the A/D data and start their own transform
routine. The first transform
results will be latched to the output '374 latches in about 10 us. The slave's waiting loop for EI2 signal
is a very tight 5 cycle loop.
Consequently, the jitter in the output of the slaves is limited to 0.84
us, which does not disturb the interleaved sequencing of the Slaves' outputs.
Using
interrupt servicing technique would reduce the jittering to within 0.5 us. This will be helpful when we build a 24
Slave Processor Array. However, the polling technique works very satisfactorily
in this four processor array and it eliminates four resettable flip-flops to
handle the EI2 interrupts.
This
scheme works because RTX2000 is a very well behaved machine. Its cycles and actions can be timed
precisely as all instructions are executed in one or two machine cycles. For most other processors, one would
have to use a deep FIFO chip to buffer the output data from the processor to
the D/A converter. In our current
experiment, the '374 latch, which is a single stage FIFO, can perform
flawlessly because the latched data will be picked up assuredly before the next
data point arrives.
5. Software Implementation
The
complete software package used in this experiment is shown in Listing 1. It is loaded to all processors on the
top of the cmForth kernel which is copied from the EPROM's into the RAM memory
during cold boot. The Master
Processor executes the word MASTER in Screen 2. The Slave Processors execute
respectively one of the CFT1, CFT2, CFT3 or CFT4 commands in Screens 12 to 15.
In
Screen 2, MASTER is the word which causes the Master Processor to generate all
the timing signals needed by the Slave Processors and the A/D-D/A converters,
as shown in Figure 4. HOLD is the
command disabling the output '374 latches of the Slaves, and preventing them
from fighting against one another over the D/A Data Bus. T1 is a test routine to be used by the
Slave Processors. It causes the
Slave to display a sawtooth wave through the D/A converter. It is useful to verify that the Master
generates the correct timing signals, and that the output data latch in a Slave
is working properly.
Screens
4 to 7 contain the cmForth optimizing compiler, which compresses as much Forth
functionality into a RTX2000 machine instruction as possible. The optimizing compiler takes care of
optimizing regular Forth code into highly efficient RTX machine code. Therefore, hand optimizing is necessary
only in the most time critical routine (TERM) in CFT.
Screen
8 contains some important slave test routines. DD dumps 256 bytes of memory for
inspection. Address of the next
byte above the dumped region is left on the stack so that DD can be used again
to continue the dumping and inspection.
VECTORS displays the contents of the Interrupt Vector Register in
RTX2000. It shows whether one can
read the SyncSlaves signal through the EI2 pin. The initialization routine in cmForth
often leaves the Data Stack Underflow Interrupt on and then VECTORS will
display a pageful of 1A0, which is the vector address of data stack underflow
interrupt. This condition must be
corrected by pushing a few dummy numbers on the stack.
INPUT
in Screen 8 reads and displays 256 samples of data from the A/D converter. This command will ensure that the A/D
converter and the '374 data input latches are working properly.
Screens
9 and 10 contain code to initialize the sine table and other memory areas used
by the CFT routine. Entries in the
sine table are 14 bits two's complement values. They are selected so that after
multiplying with the 8 bit input data and accumulated 1024 times, the result
fills in a 32 bit accumulator optimally.
(TERM)
in Screen 11 is the most critical piece of code in this parallel processor CFT
system. It updates every fourth
element in the frequency domain, after an input data is read in from the A/D
converter. After the 32 bit
accumulated result is obtained, the most significant 16 bit is written to the
output G port. Only the most
significant 8 bits are latched into the '374 output data latch, from where it
will be clocked out to the D/A converter by the Master Processor. The inner loop in (TERM) is executed 128
times to compute a quarter of the 512 cosine transform components. It takes 24 machine cycles to complete
this inner loop. With four Slave
Processors interleaved in their outputs, the effective system throughput is one
output data per 6 machine cycles-- 1 us on 6 MHz RTX2000 processors.
CFT1
to CFT4 in Screens 12 to 15 are similar.
The difference is that each word selects a difference initial phase
factor for the computation. The
initial phase factor determines the frequency of the cosine transform component
in the output frequency spectrum.
CFT2 to CFT4 also contain code to delay the starting of (TERM) so that
the output data from the Slave Processors are interleaved correctly.
6. Conclusion
With
this Parallel CFT Processor System we have proved conclusively that the CFT
algorithm can be implemented on a parallel processing structure and that the
processing speed can be increased linearly proportional to the number of
processors dedicated to the task.
It is most gratifying to observe that the CFT system takes analog
signals and generates the Fourier spectrum in real time, without degradation
while running unattended continuously .
The
architecture described here can be expanded to include up to 24 RTX2000 Slave
Processors, using only one RTX2000 as the Master. This structure will be able to produce
one frequency component in every machine cycle of the Master Processor. The sampling rate will be 12 kHz using 6
MHz RTX's, or 20 kHz using 10 MHz RTX's.
To use more slave processors, we will have to use a faster master
processor or use specialized control circuits to generate the required timing
signals to all the slave processors.
RTX2000
is very fast as a general purpose CPU and controller. However, it is still far from being
optimized for the task of CFT. In
the 24 cycle inner loop most of the cycles are fetching data from the memory
and storing results back to memory.
The ghost of the von Neumann bottle neck follows us wherever we go. The only way to get rid of the bottle
neck is to build specialized CFT slave processors with integrated
multiplier/accumulator and table memory in ROM. Then the ultimate speed limiting factor
is how fast can we route the transformed spectrum out to the display device and
to the system which will ultimately make use of the spectrum.
References
1. C. H. Ting and Rick VanNorman,
Continuous Fourier Transform, Proceedings of the 1989 Rochester Forth
Conference, p. 121, 1989.
2. C. H. Ting, Fourier Transform Faster
Than Fast Fourier Transform, Real Time Signal Processing, SPIE Proceedings,
Vol. 241, p. 167, 1980.
Figure
1. Architecture of the
Parallel CFT Processor System.
Figure
2. Schematics of the Master
Processor.
Figure
3. Schematics of the Slave
Processor 1.
Figure
4. Timing signals generated
by the Master Processor.