CHAPTER 9.

HIGH DENSITY PARALLEL PROCESSING

I. The Processor Array and Macro Controller

Summary

A GAPP processor array of 11520 processors and its associated controller were built and tested. It allows the processor array to be programmed conveniently using high level languages without sacrificing speed or code efficiency. The hardware was built and fully tested and several test and demonstration programs were developed. The hardware structure and special features of this system are presented in this report.

1. The Parallel Processor System.

As part of the process of evaluating parallel processor algorithms with emphasis on image processing we have developed a complete parallel array processor system. This is a necessary tool because large programs require unacceptably long time to execute on a software simulator, and because algorithm optimization ultimately requires testing with real time data. Included in the system is a SIMD (Single Instruction Multiple Datapath) processor array with 96 by 120 GAPP processors (Geometric Arithmetic Parallel Processor, NCR 45CG72) and an MIMD (Multiple Instruction Multiple Datapath) controller optimized for program compression and fast program flow control.

Rather than inventing a new operating environment, the system was designed to operate as an external coprocessor to an IBM AT personal computer as a host. Paths are provided from the host to the parallel processor system for program and data loading, run-time operation, and status monitoring. In addition, high speed 12 bit parallel input and output ports are provided which are capable of 10 megawords per second synchronous data transfers, and slower asynchronous transfers.

Processing within the parallel processor system is completely self-contained so that once started by the host, program execution can proceed independently. Our present system loads data via a DMA channel in the host computer and the results can be unloaded similarly to host or to a real time video display unit. A software console program was developed for use in the host computer to control the parallel processor system interactively.

2. Processor Array

The SIMD data processing section consists of an array of GAPP devices and is constructed from four circuit board assemblies, each of which has a 60 by 48 array of single bit GAPP processors. The boards, which were specially designed for this application, contain extensive signal buffering to allow array expansion in all four directions by using multiple boards. This versatility allows altering the array aspect ratio for experimentation with various classes of problems. Array expansion with these components is feasible up to approximately 256 by 256 cells at which point it is worthwhile to design a unique board package for each case so as to optimize density and area efficiency.

The processors are organized as two arrays as show in Figure 1: one of 12 by 96 processors for input/output corner-turning, and a main array of 108 by 96 processors. The two arrays may optionally execute from independent instruction and address streams. However, in the present system only one stream is used with one controller. In the main array, the EW and NS register planes are connected in a cylindrical surface topology in both the east-west and north-south directions, although spiral and other interconnections are jumper selectable. The corner-turning array is cylindrical in the north-south direction for the NS register plane, and uses the east edges for input and the west edge for output. The two arrays are connected, also in a cylindrical manner, by the CM register plane in the GAPP devices.

3. The Distributed Macro Controller

The Distributed Macro Controller (DMC) addresses the critical issue in parallel processing computers like the GAPP with high processor density and limited memory and instruction sets. The controller allows direct implementation of adaptive programming decisions made by the host; that is, without loss of machine cycles. The top level architectural innovation is that the controller is a MIMD machine that processes three different instruction streams simultaneously, as show in Figure 2. A Flow Control Unit feeds several (here two) Macro Generator Units. The instruction streams from the Macro Generator Units are combined to feed the control lines of the GAPP array. Each of these units will be a single chip in VLSI. All Macro Generator Units are identical.

Both the Flow Control and Macro Generator Units use externally writable control store memory to store instruction streams. The Flow Control Unit supervises program flow within the DMC while the Macro Generator Units produce output instructions for the GAPP array. The MIMD architecture is hierarchical; i.e., the Flow Control Unit directs the production of the programs from the Macro Generator Units. The final output stream consists of two 15 bit words, combined to form a single 20 bit instruction and address stream for the GAPP. The controller offers a very high degree of program compression. Existing sequencers have wide microcode words but little program optimization or compression.

The Flow Control Unit allows eight levels of nested subroutines and eight levels of nested loops. While loops increment, the loop counts of interior loops can be changed. Subroutine calls and returns are performed in 3 clock cycles. With the provision that subroutines are at least three instructions long, this allows penalty-free macroprogramming. External inputs may be tested for conditional operations (branching, looping, calling, and returning). The Flow Control Unit uses a 32 bit wide instruction format. It is designed to be a single chip unlike existing sequencers, although its primary function in the present controller is to direct the internal flow (within the DMC) of the program.

The Macro Generator Units, which are physically identical and each designed to be a single chip, have several novel features:

(1) Callable macro and address routines

(2) Automatic memory management

(3) Static and dynamic reinterpretation logic

(4) A rich set of stack operations.

Feature (1) is for program compression. Pre-loaded instruction streams can be called by specifying a pointer and length. Typically, these are not GAPP instructions but instructions which cause the Macro Generator Units to produce GAPP code indirectly.

Memory management calculates physical addresses given logical ones. Thus all of the memory addressing is indirect and penalty free. A linked list of memory segments with "occupied" and "free" areas is maintained. This handles allocation of memory and makes the task of the GAPP programmer much easier. Also, if these functions were to be performed in software, the processing system would not be able to operate at full speed.

Reinterpretation is a method of program compression useful from both an op-code and memory point of view. There are both dynamic and static reinterpretations available. Reinterpretation involves performing an exclusive-OR operation on the output with a mask pattern. A number of patterns may be stored. Dynamic reinterpretation allows an external constant to be loaded in where the bits of the constant can be used to modify the output with one of several masks. Static reinterpretation is only selectable at the macro-instruction level. Thus, the if-then-else construction becomes available to parallel processors without penalty.

The usefulness of reinterpretation is that applications natural to geometric SIMD machines tend to be highly patterned. Addition, subtraction and template matching to either a zero or one differ only in the selection of CARRY and BORROW, loops proceed by alternately selecting one stack or another, etc. A study of the class of transformations natural to GAPP-like machines reveals the frequent occurrence of such patterns allowing switching between instructions with the use of reinterpretation bits. Reinterpretation of address bits allows symmetrical operations within address space.

The controller also provides a large set of stack operators, operating simultaneously on two stacks holding address pointers to the GAPP memory. Stacks offer a way of changing the instruction sequence to the GAPP in nonconsecutive or nonlinear ways. Two stacks with two top elements cached in the address Macro Generator Unit give the programmer convenient access to four different memory areas to implement complex arithmetic and logic operations.

4. Conclusions

This parallel processor system is currently fully operational with the GAPP processor array. The system was designed to run at the maximum clock rate of 10 Mhz. Currently it is running at 2.5 Mhz. The computational throughput is thus 28.8 Giga instructions per second. We are developing software tools and programs to evaluate its performance for many different classes of problems, such as image processing and understanding, real time signal processing and analysis, translation invariant and non-invariant problems, and the general studies on the using and programming of parallel processing systems. Some of the results will be shown in a video tape, demonstrating the real time image processing capability of this system.

The controller is rich in ways that optimize the programming of GAPP-like arrays. We believe that the study of its concepts will allow better understanding of the maximum efficiency obtainable from geometric SIMD arrays and lead to distinct developments in this branch of computer science.

5. Acknowledgements

Dr. Wlodzimierz Holsztynski, the inventor of the GAPP device, provided the architectural design of the Distributed Macro Controller. Integrated Test Systems of Santa Barbara, CA constructed both the GAPP processor boards and the Distributed Macro Controller.

Figure 17. Functional units within the DMC.