In the previous tutorials, we have discussed about the fixed point architectures. Majority of FPGA based architectures are fixed point based. In the fixed point architectures, the number of bits reserved for integer and fractional part is fixed. Thus there is a limitation in representing wide range of numbers. This limitation results truncation error at the output. Accuracy can be increased by increasing the word length but this in turn increases hardware complexity. At some point further increase in word length can not be tolerable. The floating point representation solves this problem.

Previously we have discussed about basics of floating point numbers in Digital System Design Basics. The usage of floating point representation in real time implementations is limited. The reason behind this fact is that implementation of floating point architectures is complex. Due to higher complexity, floating point architectures are not suitable for rapid prototyping. On the other hand, floating point architectures provides better accuracy than fixed point architectures.

**Floating Point Data Format**

The objective of this tutorial is to discuss the basics of floating point representation and the basic architectures for floating point arithmetic operation. According to IEEE 754 standard there are two data formats for floating point numbers, viz, **single precision (32-bits)** and **double precision (64-bits).** But here we will design the architectures for 16-bit to achieve moderate accuracy and lower resources. A floating point number can be represented in binary as

So, a floating point number has three fields, viz,** sign field (** $ S $ **), exponent field (** $ E$ **) **and** mantissa (** $ M $ **)**. The exponent field is a added to a bias component too differentiate between negative and positive exponents. The decimal equivalent of this representation is

$ S.M.2^{E+bias} $

Here the exponent is of 4-bits thus bias is $ (2^3 – 1) = 7 $ . The range of the exponent is thus 0 to 15 (Neglecting the infinity which is not required).

**Example:** Convert $ 7.5 $ in floating point representation.

$ 7.5 = 0\mathbf{1}11\_100000000000 = 0\_1001\_11100000000000 $

Here the bit which is shown in bold is the hidden bit.

Maximum number that can be represented is

$ 0\_1111\_11111111111 = 1. 11111111111\times 2^8 = 511.875 $

Minimum number that can be represented is

$ 0\_0000\_00000000001 = 1\times 2^{-(7+11)} $

Both the numbers also can be negative. The minimum number is not normalized here thus it is actually minimum subnormal number. If normalized then the minimum number will be

$ 0\_0001\_00000000000 = 1. 00000000000\times 2^{-6} = 0.015625 $

Zero is represented as

$ 0\_0000\_00000000000 $ or $ 1\_0000\_00000000000 $

**Floating Point architectures**

All the floating point architectures designed here follows 16-bit data width. These architectures are designed without optimization for simple understanding of floating point arithmetic. Here architectures are designed to support normal numbers but can support subnormal numbers also. Following architectures are discussed to perform different arithmetic operations

- Fixed Point to Floating Point Conversion
- Leading Zero Counter
- Floating Point Addition and Subtraction
- Floating Point Multiplication
- Floating Point Division
- Floating Point Square Root
- Floating Point Comparison
- Floating Point to Fixed Point Conversion

Thus in this tutorial, all the major floating point architectures are discussed and implemented using Verilog structural coding. We have avoided the rounding here as it adds more hardware and also it can be considered in fixed point.