Ramanathan Annamalai
Design and Synthesis of Maximum Throughput Parallel Array Architectures for Real-Time Image Transforms
Monday, June 29, 1998
9:00 AM
206 Egan Building
Abstract
Image transforms are widely used to compress still images and video so that they can be stored and/or transmitted efficiently. Separable image transforms can be realized as matrix products in two stages, the second stage (column processing) acting on the result of the first stage (row processing). Column processing (CP) requires the row processing output in transposed ordering and that is usually realized in hardware by a dual ported memory. However this intermediate storage/retrieval requirement serializes computation, restricts the ability to pipeline and severely limits the maximum throughput that can be achieved. It also complicates the control strategy of the architecture and makes efficient block (second level) pipelining difficult to implement.
This work develops a number of modular parallel structures for both stages of a 2D image transform. These structures are linear arrays characterized by their simple distributed memory and control requirements. They can accept a continuous stream of inputs and process it at the input sample rate without any data buffering or reordering (transposition). Since the arrays are capable of sustaining continuous real-time serial input, their block pipelining characteristics are excellent, leading to the highest possible throughput of one times N$ 2D transform computation every ^2$ cycles.
By combining modular array cores for the row and column processing, 2-D Transform architectures can be easily assembled for varying output patterns (serial, parallel, zigzag ordering). Two families of architectures are developed: Architectures-I that use a hardwired kernel and can perform a specified forward/inverse transform pair, and Architectures-II that can load any kernel on the fly and use it until a different kernel is loaded and applied. Architectures-I are suitable for realizing popular transform pairs (e.g. DCT/IDCT) in the same hardware and Architectures-II for providing maximum flexibility and reconfigurability while maintaining optimal throughput.
Handshaking protocols are developed that enable data block pipelining and alternative kernel reloading from the host on the fly (without adding any delay). The internal array datapath widths are made minimal while meeting the IDCT specifications of the H.261 video compression standard from ITU-T. A Unified DCT/IDCT datapath specification is developed that will not have any overflows or underflows and achieves a high SNR for the DCT and PSNR for the DCT/IDCT forward/inverse transform pair.
Synopsys synthesizable VHDL models are developed. The VHDL code for the Processing Elements (PE) is derived automatically using DG2VHDL, a CAD tool that facilitates the rapid prototyping of DSP algorithms and is under development in the Parallel Processing and Architectures group of Prof. Manolakos. The design partitioning is done in such a way that when a different transform pair is needed, few blocks will have to be resynthesized and this leads to a small ``adaptation turn-around'' time to produce a new transform pair processor from an existing one.
Hardware Synthesis is performed with the Synopsys Behavioral Compiler (BC) and Design Compiler (DC). Gate-level area estimates are obtained and architectural trade-offs are discussed.
Thesis Committee:
Prof. E.S. Manolakos (advisor)
Prof. W. Meleis
Dr. J. Fridman (Analog Devices)