Concepts
The PACO Idea
Approximate computing is a steadily growing research topic because other performance gains become harder to reach while applications requiring less than absolute precision proliferate. The common goal is trading off accuracy or reliability for energy-savings, speed-up or computational latency reduction.
Approximate computing concepts range from
- completely custom CPU core fabrics with error rate self-monitoring over
- under-volting or overclocking of circuits to
- selectively ignoring cache coherency requirements.
A large majority of these approaches live an isolated existence because they target a very specific trade-off effect and are not based on a current general purpose CPU design. This isolated existence translates to often outdated and limited toolsets around the approximate computing platform.
In contrast, the PACO group decided to extend a current open general purpose CPU/ISA by some promising approximate functional units.
PACO Goal
To provide a platform that allows us and others to:
- learn about hardware approximation techniques by implementing them quickly
- measure effects on the quality of results and the speed-up acquired
- quickly compile approximate applications with a C/C++ compiler, allowing developers to evaluate results for their general purpose applications
- foster cooperation through open source hardware and software
- experiment with static approximation - the programmer/compiler has to try to predict precision requirements of individual operations and encode them within instructions. (In contrast to other approximate computing systems that try to regulate precision trade-offs at runtime and require quality metrics for every approximate calculation.)
PACO Approximate Functional Units
These functional units
- reside in the execution stage of the pipeline
- compute arithmetic functions in the most general sense
- behave deterministically, i.e. given the same inputs they always compute the same results
- addressed using special approximate instructions.
To prove that it is possible to integrate different approximate functional units in the PACO core, we have implemented:
An Approximate ALU
The PACO Approximate ALU allows one to experiment with different degrees of approximation for approximate applications, rather than providing energy savings or speedup compared to the standard ALU. The ALU ignores a certain number of least significant bits of its inputs depending on the degree of approximation specified in the instruction. An extension to C/C++ languages has also been created that allows both expected precision of inputs and minimum precision boundaries of outputs to be specified. The PACO compiler determines the most approximation possible without violating the output precision boundaries.
The Design Document elaborates on the concept behind approximation using the ALU in detail.
A Lookup Table (LUT)
The Lookup Table unit allows an application to replace complex arithmetical functions with a segment-wise linear approximations of them (see Fig.2). It is integrated in the CPU pipeline and takes only one cycle in the execution stage of the CPU.
The LUT accepts inputs from up to three registers. It can be configured at runtime to evaluate up to 9 bits from these registers to determine the segment that has be interpolated. Also, some input bits can be multiplied with the segment’s slope and finally an offset is added to calculate the result of the lookup instruction. The design of the Lookup Table is shown in Fig.3.
Fig.2: The Lookup Table unit approximates arithmetic functions within segments. In this case values within the segment are linearly interpolated.
Fig.3. Design of LUT as a part of the pipeline.
More detail can be found in the Design Document.
PACO Compiler Extensions
In order to create approximate applications, programmers must be able to control the level of approximation in their applications. The compiler must then be capable of translating this code to instructions that controls the approximate functional units. This not only requires special annotations but also new instructions.
The PACO Toolchain consists of CLANG capable of handling special annotations, translating this C code into an Intermediate Representation(IR). This IR is consumed by LLVM to create assembly code. Binutils then converts assembly to machine level representation that can be loaded and run on the FPGA or an emulator. Besides this, the special annotations are used to create a configuration for the LUT Core by the LUT Compiler.
It is useful if the compiler can at least try to predict consequences of previous arithmetical operations on later ones. This prediction is of course limited because small changes can have hard-to-predict effects in some functions.
An extensive description of the compiler can be found in the Design Document.