Hardware Acceleration of Deep Neural Network Model on FPGA
Hardware Acceleration of Deep Neural Network Models on FPGA ( Part 1 of 2)

Artificial Intelligence has become all-pervasive, by finding applications in areas which seemed impossible earlier. Deep Learning, which is a subfield of Machine Learning, has become a state-of-the-art solution to all AI problems due to its high accuracy and efficiency. It helps in making real time decisions in applications like Advanced Driver Assistance Systems (ADAS), Robots, Autonomous Vehicles, Industrial Automation, Aerospace and Défense.  For accurate decisions and real time behaviour, a massive amount of data needs to be processed.  Deep Neural Network (DNN) models achieve this by using a large number of neural network layers.

Deep Neural Network

Want to know how Deep Learning works? Here's a quick guide for everyone.
Source: freecodecamp.org

Deep Neural Networks is the state-of-art solution for a variety of applications like computer vision, speech recognition and natural language processing etc.  Artificial Neural Networks is a mathematical construct that tie together a large number of simple elements, called neurons, each of which can make simple mathematical decisions. A shallow neural network has only three layers: input layer, one hidden layer and output layer. A neural network becomes a Deep Neural Network (DNN) as the number of hidden layers increases. So, Deep Learning can be considered as a class of Artificial Neural Networks that is composed of many processing layers. They are more accurate and keep improving in accuracy as more neuron layers are added. Some important Deep Neural Network models are Feed-Forward Neural Network, Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN).

Hardware Accelerators for Deep Neural Networks

Hardware acceleration is defined as a process in which an application will offload a high computational task into specialised hardware for achieving high efficiency when compared to software implementation in CPU alone. To achieve accurate results in real-time, better models operating on a larger dataset are required. Also, time taken for decision making is an important factor. As new Deep Learning models evolve, the model structure becomes more complex. Thus, a huge number of operations and parameters, as well as more computing resources are needed. Three options for Hardware Accelerators are GPU’s, ASICs and FPGAs.

GPUs are designed for processing images through massive parallelism, but nowadays they are used in big data analytics, acceleration of a portion of an application that requires high throughput and memory bandwidth. GPUs are excellent in parallel processing. They can provide acceleration where the same operations are required many times in rapid succession.  But GPUs consume a huge amount of power which throws a challenge to DNN applications that need to be enabled on edge devices, especially battery-operated devices. GPUs achieve throughput with their ability to process input batches of large size, but typically the latency will be high. So, they are not suitable for latency-critical applications.

ASICs are integrated circuits specially designed for a particular purpose or application. They are highly optimized in terms of power and performance for one particular application. They have less I/O bandwidth, limited memory and other computing resources.  Although they can attain moderate performance at low power, the downside is that the development time and costs to realize them are high.

FPGAs can be used to accelerate a portion of an algorithm by assigning the high computational tasks to the programmable logic. They can attain high performance through extensive parallelism and at the same time, are energy efficient when compared to GPUs, and have less time to market and costs compared to ASICs. Another important feature of FPGAs is their; reconfigurability which is not possible with GPU and ASIC. As deep learning structures are advancing day by day, reconfigurability is an added advantage.

The following section lists out the reasons for considering FPGAs as hardware accelerators.

FPGAs as Hardware Accelerators:

When compared to GPU; ASIC and FPGA have less I/O bandwidth, limited memory and other computing resources but they can attain moderate performance at low power. ASIC is optimized for power and performance, but cost and development time is more. Also, they are not flexible. As an alternative to GPU and ASIC, FPGA based accelerators are currently used due to the following advantages:

  • FPGA offers high performance per watt when compared to GPU, making it a strong candidate for DNN computations and inference. 
  • Architecture is customizable and flexible so that the required resources can be used.
  • Provide high throughput with massive parallelism at low latency. 
  • FPGA has block RAM which allows faster data transfer compared to off-chip memory.
  • FPGAs are reconfigurable according to application. This enables a reduction in time to market. As the new machine learning algorithm evolves, less development time and reconfigurability make them a better option when compared to ASIC. 
  • Apart from power efficiency and throughput, the speed of a DNN deployed on an FPGA can be further increased when the inferred algorithm uses low numeric precision in the calculation. For example, the quantization process converts a 32-bit or 64-bit floating-point network models to a fixed point which reduces computations by maintaining reasonable accuracy. 

On the other hand, one of the main reasons for engineers not adopting FPGA is the difficulty in programming. FPGA is programmed by describing functionalities using Hardware Description Language (HDL) coding like VHDL or Verilog. This is different from regular programming like C or C++. 

To reduce complexity, tools like High-Level Synthesis (HLS) that synthesize high level languages to HDL codes exist. There are different hardware frameworks developed by FPGA vendors and other third-party companies to implement inference on FPGA. Xilinx and Intel have their own frameworks to improve the performance over others.  Some of the hardware frameworks are OpenCL, Intel’s OpenVino, Xilinx DNNDK and Xilinx Vitis AI which we will cover in part 2 of our blog. 

Read Part 2 here

Leave a Comment