Floating Point Architectures

By | 29th February 2020

In the previous tutorials, we have discussed about the fixed point architectures. Majority of FPGA based architectures are fixed point based. In the fixed point architectures, the number of bits reserved for integer and fractional part is fixed. Thus there is a limitation in representing wide range of numbers. This limitation results truncation error at the output. Accuracy can be increased by increasing the word length but this in turn increases hardware complexity. At some point further increase in word length can not be tolerable. The floating point representation solves this problem.

Previously we have discussed about basics of floating point numbers in Digital System Design Basics. The usage of floating point representation in real time implementations is limited. The reason behind this fact is that implementation of floating point architectures is complex. Due to higher complexity, floating point architectures are not suitable for rapid prototyping. On the other hand, floating point architectures provides better accuracy than fixed point architectures.

Floating Point Data Format

The objective of this tutorial is to discuss the basics of floating point representation and the basic architectures for floating point arithmetic operation. According to IEEE 754 standard there are two data formats for floating point numbers, viz, single precision (32-bits) and double precision (64-bits). But here we will design the architectures for 16-bit to achieve moderate accuracy and lower resources. A floating point number can be represented in binary as

Figure 1: Floating Point Data Format for 16-bit

So, a floating point number has three fields, viz, sign field ( S ), exponent field ( E ) and mantissa ( M ). The exponent field is a added to a bias component too differentiate between negative and positive exponents. The decimal equivalent of this representation is


Here the exponent is of 4-bits thus bias is (2^3 - 1) = 7 . The range of the exponent is thus 0 to 15 (Neglecting the infinity which is not required).

Example: Convert 7.5 in floating point representation.

7.5 = 0\mathbf{1}11\_100000000000  =  0\_1001\_11100000000000

Here the bit which is shown in bold is the hidden bit.

Maximum number that can be represented is

0\_1111\_11111111111 = 1. 11111111111\times 2^8 = 511.875

Minimum number that can be represented is

0\_0000\_00000000001 =  1\times 2^{-(7+11)}

Both the numbers also can be negative. The minimum number is not normalized here thus it is actually minimum subnormal number. If normalized then the minimum number will be

0\_0001\_00000000000 = 1. 00000000000\times 2^{-6} = 0.015625

Zero is represented as

0\_0000\_00000000000 or 1\_0000\_00000000000

Floating Point architectures

All the floating point architectures designed here follows 16-bit data width. These architectures are designed without optimization for simple understanding of floating point arithmetic. Here architectures are designed to support normal numbers but can support subnormal numbers also. Following architectures are discussed to perform different arithmetic operations

Thus in this tutorial, all the major floating point architectures are discussed and implemented using Verilog structural coding. We have avoided the rounding here as it adds more hardware and also it can be considered in fixed point.

(Visited 285 times, 1 visits today)

Leave a Reply

Your email address will not be published. Required fields are marked *