Re-configurable Computing

by Richard Swan, Anthony Wyatt, Richard Cant, Caroline Langensiepen

Introduction

Tired of constantly having to upgrade your hardware to meet the demands of the latest software? Well, this may be unnecessary in the near future. The solution to this expensive problem could be re-configurable hardware. Using a software approach, all applications can be written so that they reconfigure the hardware to suit its particular needs, thus ensuring optimum performance.

Re-configurable computing is a relatively new field of research and development, the first research beginning in the late 1980's. It is an attempt to bridge the traditional gap between hardware and software within the computing field. There are certain processors designed specifically for a single application (e.g. a 3D graphics accelerator), which perform their specific tasks at far greater speed than a traditional (general purpose) processor. However, the application-specific computer can perform only that specialised task at speed, and other applications would run poorly or not at all.

General Purpose vs. Specialised Processors

In the past, attempts have been made to improve the performance of the general purpose computer by adding to it some application-specific capabilities. The most popular is the addition of instructions to perform floating point and numerical operations. In recent years, Intel has added instructions for faster procedure calling, floating point mathematics, and graphics. The re- configurable computer attempts to make the general-purpose computer into an application specific system by configuring or modifying an added piece of hardware to match the current application. This extra hardware is called the re-configurable device.

The re-configurable device allows designers to build part, or all of their design in hardware rather than software. By exporting functionality into hardware, significant speed gains can be realized because the functionality doesn't have to be split into individual instructions for the CPU to fetch, decode, etc. The functionality becomes the hardware. The ability to implement an application in hardware provides an opportunity to exploit the inherent concurrency of digital circuits. That is, the device can be configured, or partitioned, into multiple subsystems - all of which could run concurrently with each other.

In the past, hardware designers have used schematics and description languages (e.g. VHDL, discussed later) to describe their designs, while software designers have used high level languages and design tools. This comes from the finding that design methodologies for concurrent software systems design tend to have mainly sequential constructs, with concurrent/parallel constructs added on almost as an "after-thought".

A practical reconfigurable computing system requires a host. The host is the computer system that uses the re-configurable device, controls access to it, and programs it. There are various different methods of connecting the host to the re-configurable device. At the lowest level, the device is embedded into the microprocessor core, at the highest level the device could be connected via a connection cable [1]. The device itself is a microchip known as a Field Programmable Gate Array (FPGA), the current average size of which is about 2.5 inches square. It is this microchip that is the cornerstone of the re-configurable computing field.

A Bit of History

The FPGA is the outcome of the convergence of two separate technologies, the programmable logic device (PLD) and the Application Specific Integrated Circuit (ASIC). The history of the PLD began with the first PROM (Programmable Read Only Memory) devices and added versatility with the PAL (Programmable Array Logic) which allowed for a greater number of inputs and the inclusion of embedded registers. These devices have continued to grow in size and power. In the meantime, the ASIC has always been a powerful device but its use has traditionally required a considerable investment of both time and money. Attempts to reduce this burden have proceeded through modularisation of circuit elements as in the cell based ASIC and through standardisation of mask layers as pioneered by Ferranti with the ULA (Uncommitted Logic Array). The final step was to combine these two approaches with an interconnection mechanism that could be programmed in the field using fuses, antifuses or RAM cells as in the ground-breaking Xilinx devices of the mid 80's. The resulting circuits are very similar in capability and application to the largest PLDs although there are detailed differences which betray the separate ancestries. Apart from re-configurable computing, FPGAs are used as controllers, encoders/decoders, and for prototyping of custom VLSI circuits and microprocessors.

The first manufacturer of these devices was Xilinx [2] and Xilinx devices remain one of the most popular with companies and research groups. Other vendors currently in this marketplace include Atmel, Altera, AMD and Motorola.

Applications of FPGAs

The earliest FPGAs were used mainly to "sweep up" pieces of hardware design that did not correspond to any existing, off the shelf, component. The main benefits of such usage were reduced package count when compared to an implementation using small scale integrated circuits, or reduced capital investment if a custom ASIC was the alternative. As time has passed the available devices have grown in size and so further applications have become possible. One interesting usage of the technology is for the prototyping of full custom circuits that have a complicated internal structure such as processors. Most of the major processor manufactures now make an FPGA prototype at some point in the design process. One of the companies using this technology is Argonaut, who use FPGAs to prototype a RISC processor with user defined instructions before actually fabricating the final design in a "hard wired" ASIC technology. For many low volume applications the capital cost of a hard wired ASIC cannot be justified and so there is an increasing trend to use the FPGA version in the field, rather than just as a prototype.

Once the FPGA is not just a prototype, an interesting possibility opens up. Since the device is reprogrammable in the field, the design no longer has to carry the burden of flexibility to cope with different applications. Instead, an optimal design can be created for each application and loaded into the FPGA whenever that application is required. For example, consider a signal processing system. Such a system might have requirements for a number of different algorithms, including Fast Fourier Transforms, Z transforms and a variety of filtering operations. Each of the algorithms might need to be run at a variety of different wordlengths. A general purpose chip designed to do all of these would have many compromises in its structure. In contrast an FPGA layout could be individually created for each algorithm/wordlength combination. The fact that each of these layouts would be a no-compromise implementation of the algorithm in question would reduce the gate count and increase the speed compared to the general purpose design. These gains would help to offset the overheads that the use of the FPGA brings when compared to the custom ASIC.

When taken to the extreme, this kind of FPGA usage puts the software directly onto the layout of the chip. In order to see how this could work we next examine the architecture of a typical FPGA.

A Typical FPGA

A typical mid-range FPGA family is the Xilinx XC4000 series. These devices are quoted as ranging in size between 3000 and 85,000 equivalent gates 1], although in practical applications it is impossible to utilize all of them. The XC4000 architecture is based around a kind of universal "cell" called the Configurable Logic Block (CLB). Each CLB contains a logic section consisting of three function generators and a storage section containing two flip flops. The functions that are to be implemented in the logic section, together with routing of the outputs and the configuration of the flip- flops (edge triggered or level sensitive for example) are all defined in memory cells within the CLB. An alternative mode allows some of these memory cells to be used directly as RAM. A further refinement is the provision of special logic to facilitate 'carry look ahead' and propagation for arithmetic functions. The CLBs in the XC4000 family are arranged in a rectangular matrix, the smallest device having a 10x10 array whilst the largest one has 3,136 CLBs arranged 56x56.

Connections between the CLBs are made using a hierarchy of paths. At the lowest level are the direct connections. These provide a link between each CLB and its immediate neighbours. The direction of these links is fixed so that data will flow from left to right or top to bottom only. The next layer in the hierarchy is provided by the single length lines. These are also intended mainly for the connection of near neighbour CLBs, although they can be used for longer distances at the cost of some extra delay.

Interspersed at regular intervals within the array of CLBs are the Programmable Switch Matrices (PSM). Each single length line runs from one PSM to the next with an optional connection to the CLBs which it passes on the way. At the PSM each line can optionally be connected to its continuation and/or to the lines which run at right angles to it. It is these connections which cause the delay when single length lines are used over longer distances. To avoid the delay there are other lines which do not stop at each PSM . The double length lines stop at alternate PSMs and are arranged in a staggered way such that each CLB can access the same number of them. The quad length lines are similar and, as their name suggests, four times the length of a single length line. However the extra length increases the drive requirements and so these longer lines have their own switch matrices with extra buffering available. Global busses are made possible by the so called "long lines" which run the entire length of the array and are connected to the CLB outputs by three state buffers. Special purpose interconnect includes global clock distribution and local high speed carry propagation paths.

At the edge of the device, a dedicated Input/Output Block (IOB) is provided for each pin. As with the CLBs and interconnection hierarchy, a wide range of options can be programmed using RAM cells, including not only logical properties like the pin direction but also electrical and timing characteristics to allow the device to be used in a wide range of contexts.

The complexity of this device is beyond the scope of manual design, particularly if large devices and complicated applications are involved, so software support is clearly needed. Early FPGAs were designed using traditional schematic capture hardware design tools and this method remains in use. However, for reconfiguration to be fully exploited it needed to be accessible to software designers and so a new generation of tools is needed.

Software Support?

A unique opportunity for flexible parallel computing is offered with this technology, and should be utilized to make systems as efficient as possible. If re-configurable technology becomes widely adopted, then programmers will wish to write in a way that exploits the possibility of parallelism. One might hope that existing programming languages could provide a basis for software support of FPGA based systems. Unfortunately, the majority of popular programming languages currently used have no constructs to enable the user to program for multi-processor systems. For instance, one of the most popular languages, C++, has no built in support for parallelism. This limits the user to only being able to write programs, and to some extent think, in a sequential way, with one line of code being executed after the next. Functional languages allow parallelism to be expressed naturally, but are often very different conceptually from the sequential programming languages with which most users are familiar, making them difficult to learn and adjust to. With these drawbacks functional programming has been used mainly by a few experts.

Rather than create a new language similar to those described above, some developers have followed the route of extending the capabilities of existing sequential languages, such as C++, by making modifications and adding modules to them. The parallelism has often conflicted with existing components of the language forcing the removal of some functionality and limiting the scope of the improvements envisaged.

One approach in giving languages parallelism (Ada and Java being the main examples), is the use of threads. A thread is a kind of pathway that runs through the program. A purely sequential language would have one thread, i.e. only one path runs through the code at a time. Parallel programs are said to be multithreaded as they may have several of these code paths active at any one time. In a single processor system, this would be implemented by multi-tasking, with control of the processor being switched from one thread to another at different intervals of time, to give the illusion that they are running simultaneously. In a multi-processor system, such as is possible with re-configurable computing, threads could actually be executed in true parallelism on different processors in the system, allowing as many threads to be active as there are processors (and if there aren't enough it is possible a fresh one could even be made on the fly when using an FPGA system). When running on a single processor, a multi-tasking system has the advantage of being able to share most of the data in the system between all the threads. However, for multiple processors such shared memory becomes difficult to realise efficiently when the number of processors grows above about ten. As a consequence of this, highly parallel systems tend to use distributed memory and a Communicating Sequential Processes (CSP) software model. A number of processors have been designed with this kind of architecture in mind, notably the Inmos Transputer and Texas Instruments' 320C40.

Occam, which was designed for the Transputer, is a good example of the way in which CSP can be directly implemented in a language. However it is rather low level in that it lacks the capability for sophisticated data structures. Consequently most Transputer and 320C40 systems use variants of C/C++ with added support for parallel processing. Unfortunately the specific mechanisms added to allow parallelism tend to be of a lower level than the rest of the language and as such require much more work in order to create comparable systems with all the benefits of being concurrent.

One project which is of particular interest, since it has been created with FPGAs in mind, is Handel-C, by the Oxford Hardware Compilation group [3]. It is a language based on CSP algebra and developed from Occam [4], and has been successfully compiled into hardware. Handel-C works by adding Occam constructs to C, and therefore creating a new language where parallelism is possible. Although the idea behind Handel-C has remained the same since its conception, it has had numerous difficulties. Some parts of the original C language therefore had to be removed, as they were simply incompatible with the new system. This language also has become more low-level than the original version in order to allow programmers to use the new synchronous semantics to define the timing of programs as well as their behaviour.

None of these languages takes account of a very prominent architectural feature of FPGA design, namely the interconnect system. The interconnect system, as described above, is very well adapted to large amounts of local communication but can only provide a rather limited amount of long distance communication without penalty. One desirable feature of a software support system would be to encourage this local communication by using a graphical notation to provide an explicit metaphor.

A Different Approach

It is apparent from the above information that none of the above approaches considered separately is an appropriate solution, as each has its limitations and flaws. The approach used in the prototype [5], developed at Nottingham Trent University by ourselves, was to consider making use of a form of graphical language which we have christened "Graphite" [6, 7]. This was derived from existing software design methodologies such as Yourdon or Ward-Mellor which allow easier and less costly software development, as these will be familiar to many potential users. The language also makes use of process bubbles which can be implemented using an existing sequential language.

The sequential language to be used has not been decided upon yet, as there are a number of possible candidates. C/C++ is a logical choice, due to its popularity and versatility. However, at this point the sequential language used is unimportant as it will not be parallel in its own right. Instead the language will be used to describe the behaviour of sub-programs, or entities, in a larger system.

This method allows the user a better way of perceiving the system and of using linked sub-components, which are intrinsically parallel, to some extent without them even having to think about it. It also enables them to make use of as little or as much concurrency as they wish: if they wanted to program in a sequential way they would simply have to define a single entity and code within that. However the benefits would start to appear when they gain confidence enough to begin splitting the program into various components, connected via drawn data lines carrying various types (like passing parameters to a function that would run concurrently with the main program). Then information starts to be processed in parallel, and the efficiency gains begin to show.

The objective of creating this concurrent design and programming tool/language, is of course to allow programmers to specify a system, and then for this to be implemented in hardware. To achieve this, another thing must be considered; how do we tell the hardware what to construct? Even with the design created and the code programmed for each section required, it is useless unless it can be transformed into a form that can be created on the hardware board. The completed program (and the prototype can do this to some extent already), must therefore be able to compile the diagram and code into a hardware description.

In the past this would have been impossible due to computers not being able to understand schematics and drawings commonly used to specify hardware. Fortunately, a language exists which is designed for describing almost any kind of hardware, called VHDL. VHDL is a large and complex language designed to represent circuits. It is ideal for specifying digital electronic systems, and from it the hardware can be simulated to gauge performance or actually synthesised. Unfortunately VHDL uses rather low level timing constructs, which, although convenient for hardware description, are often a stumbling block for people who have written only in software programming languages. However, it does provide a convenient intermediate form between Graphite and the ultimate hardware realisation of a system. When our tool is fully completed, the designs and code entered by the user will ultimately be compiled into VHDL so that the design could be implemented in hardware from this, on a re-configurable programmable logic board.

VHDL shares many of the attributes of any other programming language, despite being optimised for detailing circuits. It also contains many features for describing hardware, up from the level of logic gates to complete processors and beyond. Once these circuits have been designed they can be imported to be parts of a larger system, in a similar way that objects can be used as components to larger objects in C++ with aggregation. The VHDL language is a recognised industry standard, which means that tools exist from all the major FPGA manufacturers to convert VHDL circuit descriptions into a suitable format for their devices.

When a system has been defined in VHDL, a test bench can be created to simulate the circuit in operation. This is done by writing a process that sends input to the virtual circuits and checking that the output is correct and on time. This means that systems can be tested to see if they act as expected and perform to an acceptable standard without the need for actual FPGA circuits. In addition, a software simulation is likely to allow for more detailed debugging with better access to what is going on in the circuit and less risk of losing information when a crash occurs.

The Prototype Tool

Our prototype is a design and programming tool that uses a graphical interface to enable the user to input designs and code which can then be compiled into hardware, by transforming it into the VHDL language. It utilises different symbols to represent different types of entity (e.g. function or state), and draws lines between them to represent the data lines between these entities. It is an event driven WIMP environment. The look and feel of the tool in its present form are illustrated in Figures 1 and 2. Within the design there is a facility for the user to enter the function code for all placed symbols. As with all prototypes, it is not fully functional, but gives a taster of the full program.

The first prototype of this system was implemented as a student project and it was known at the start that there would not be time to implement a fully working version. So it was decided that a partially functional prototype should be created as a 'proof of concept'. As it turned out, the prototype would be much more functional than expected, with much of the interface operational. However, it needs further extension to be complete, and the output is very limited at present, only generating the 'entity' descriptions in VHDL (the target language), whereas more would be needed to implement in hardware.

By examining what the prototype is, and is not, capable of, we have identified the future improvements which could be implemented to improve the product. In the future a full implementation of the compiler will be produced, which will require additional features and corrections. In a fully working version the code input by the user must be compiled to architecture statements, and the diagram design further represented by component statements, etc. As the design is to be implemented in hardware, some kind of checking facility is also needed to ensure that it meets all the rules and requirements for this. There may need to be options to choose which FPGA platform is to be used, so that VHDL output can be tailored to be more efficient on it.

At some point it may be advantageous to be able to reverse engineer some VHDL code into the diagram and code form. This would enable changes to be made at the higher level, without the complexities associated with working in VHDL. The diagram also needs to be able to switch to different levels of abstraction for functions, which may consist of smaller functions (similar to parent-child diagrams in data flow charts). Consistency checks would also be needed if child diagrams are used; ensuring that all data lines entering are connected somewhere, etc.

Since the implementation of this prototype, it has been recognised that new constructs are needed, and these would again have to be added to the drawing types (and may require further functions to tailor the settings of these). In addition to the hardware aspect, this methodology works well for designing any parallel system, and so options may be required to disable hardware compilation, and just produce designs/code instead.

Using the Tool

To make full use of the tool in the form proposed the following are required:

  1. A host system on which to run the tool.
  2. A board containing one or more reconfigurable devices together with a connection to the host and the facility to download configuration data as well as communicating with the system while it is running.
  3. Synthesis and place and route software capable of converting VHDL descriptions into a configuration data stream for the target device.

The tool is currently being written in C++ so almost any current workstation or personal computer should satisfy requirement 1. There are a number of manufacturers currently supplying hardware which can satisfy the second requirement. The authors are currently using a board developed under the "ASPIRE" program [8], which contains a version of the ARM processor together with two Xilinx XC4000 series FPGAs. Other suppliers of suitable hardware include Giga Operations Corporation [9]. Software to satisfy the third requirement is normally available from the FPGA manufacturer.

The use of the tool is as follows. First a Graphite program must be created. The Graphite program consists of a hierarchy of diagrams together with supporting text for the lowest level entities within the diagrams. The graphical tool enables the diagrams to be edited whilst maintaining the proper relationships between diagrams and between diagrams and supporting text. The supporting text can also be edited directly using a text editor.

When the program is thought to be complete it would normally be compiled first into a simulation file. The simulator allows the system behaviour to be examined and debugged in a friendly environment with complete access to the internal "working" of the system. It also allows debugging to take place without going through the time consuming hardware synthesis stage.

When the program is running satisfactorily on the simulator, the hardware compilation stage begins. At this point the tool will separate any substantial sequential code sections from the remainder. A suitable compiler will then process these pieces of sequential code. On the present hardware used by the authors these code sections will run on the ARM microprocessor, although in future they will probably be run on processors that are synthesised within the FPGA hardware. The remaining parts of the system will be compiled by the tool into VHDL and passed to the manufacturer's synthesis software. At present this software is not normally capable of creating an optimal configuration from VHDL without human assistance. It is hoped that the tool will be capable of generating extra information from the structure of the Graphite program which can guide the process of synthesis to produce a good result automatically. In the long run the ideal would be to bypass the VHDL stage and integrate the Graphite tool directly into the manufacturer's FPGA software suite. However, in the meantime the VHDL stage allows flexibility.

When the synthesis stage is completed the combination of compiled sequential code and configuration data is loaded into the target hardware and the system can be run for real.

Final Thoughts

When successfully implemented, we hope that the final product will be a step forward in computer programming. As mentioned earlier, performance improvements could be obtained through a successful system, as the computer would become more suited to the tasks required of the particular application. With this added flexibility speeding up the execution of programs, the endless upgrading of hardware and software currently required by a user to stay in-line with technology could perhaps be slowed. It is the aim of the student authors to further the development of the prototype mentioned here in the final year of their degree course and afterwards.

Research into the area of re-configurable computing is going on throughout the academic world, so it seems inevitable that a successful solution will be implemented somewhere. With this in mind, and all the changes that re-configurable hardware can bring, we believe the day is not far off when all applications are designed and implemented using languages that will re-configure the hardware.

References

1
Hauser, J., Wawrzynek, J., "GARP: A MIPS Processor with a Reconfigurable Coprocessor", Proceedings of IEEE Workshop on FPGAs for Custom Computing Machines, 12-21, April, 1997 

2
Xilinx Inc., http://www.xilinx.com

3
Page I., "Parameterised Processor Generation", "More FPGAs. International Workshop on Field Programmable Logic and Applications", September, 1993, pp. 223-237, http://www.comlab.ox.ac.uk/oucl/users/ian.page/handel/handel-c.html, http://www.sundance.com/handel.htm.

4
Page, I., Luk, W., "Compiling Occam into FPGAs", FPGAs, Abingdon EE & CS Books, 1991, pp 271-283

5
Clarke, G., Crabb, G., Swan, R., Vernon, P., Wyatt, A., "Re- configurable Computing", Group Project Report, Nottingham Trent University, May 1998.

6
R.J. Cant, "Defining Reconfigurable Graphics Accelerator Systems For Virtual Reality Using The Graphite Language" To appear in: Proceedings of the European Simulation Symposium ESS'98.

7
C.S. Langensiepen, "Graphite - A Scaleable Language For Training Simulator Design" To appear in: Proceedings of the European Simulation Symposium ESS'98.

8
Advanced Risc Machines, http://www.arm.com/Collab/Colinfo/index.html

9
Giga Operations Corporation, http://www.gigaops.com

Author Biographies

Richard Swan and Anthony Wyatt are Computer Studies undergraduates at the Nottingham Trent University who undertook a study of reconfigurable computing as a second year group project. They are currently in industry for their work placement year, and will return in 1999 to complete their degrees.

Richard Cant and Caroline Langensiepen are lecturers in the Department of Computing at Nottingham Trent University.

Want more Crossroads articles about Computer Architecture? Go to the index or to the next one or to the previous one.

Copyright 1999 Richard Swan, Anthony Wyatt, Richard Cant, and Caroline Langensiepen

Last Modified:
Location: www.acm.org/crossroads/xrds5-3/ntu.html