Format for representing fixed point numbers
Posted: Wed Sep 30, 2009 2:46 am
				
				Fixed point numbers and number notation: 
In Fixed-point numbers the imaginary binary point plays a significant role while interpreting numbers. When dealing with numbers the digits to the left of imaginary binary point represents the integer and the digits to the right of binary point represents fractions. Here we study a format for representing fixed point numbers which we will be using here on.
Q notation (QF):
In this notation the total numbers of fractional bits represent the number format. One bit is assumed for Sign (MSB always). The number format representation doesn’t convey any information regarding the word length.
The notation QF means a number with F bits dedicated for the fractional part. If W represents the word length of the processor, then QF number means, F Fractional Bits, W - (F+1) Integer Bits and 1 Sign Bit.
For example,
Q15 means 15 Fractional bits and one Sign Bit.
Q14 means 14 Fractional bits and one Sign Bit.
Q1 means 1 Fractional bit and one Sign Bit.
How to select a Q-point format, to represent a float value in fixed point format
For example consider a float value 12.435
Steps involved in selecting Q-format for the above given value:
Assuming that the word length used by the programmer to represent a float value in fixed point format is WL = 16. (This assumption should be made by programmer depending up on the maximum range of input float value).
Note: Float to fixed conversion comes at cost of precision loss.
Consider above example:
Float value = 12.435
(12)10 -> (1100)2 -> 4 bits are required to represent integer part.
Sign -> 1 bit for sing representation
As we have seen before, number of fractional bits taken by float value is the Q-point format with which the float value is represented in fixed point.
Q-format for the above float value is = (16 – (4 + 1)) i.e. QF = 11
So the above float value can be represented in Q11 format.
			In Fixed-point numbers the imaginary binary point plays a significant role while interpreting numbers. When dealing with numbers the digits to the left of imaginary binary point represents the integer and the digits to the right of binary point represents fractions. Here we study a format for representing fixed point numbers which we will be using here on.
Q notation (QF):
In this notation the total numbers of fractional bits represent the number format. One bit is assumed for Sign (MSB always). The number format representation doesn’t convey any information regarding the word length.
The notation QF means a number with F bits dedicated for the fractional part. If W represents the word length of the processor, then QF number means, F Fractional Bits, W - (F+1) Integer Bits and 1 Sign Bit.
For example,
Q15 means 15 Fractional bits and one Sign Bit.
Q14 means 14 Fractional bits and one Sign Bit.
Q1 means 1 Fractional bit and one Sign Bit.
How to select a Q-point format, to represent a float value in fixed point format
For example consider a float value 12.435
Steps involved in selecting Q-format for the above given value:
- Calculate the number of bits needed to represent integer part (QI) of the given float value (4 bits).
- Calculate the word length needed to represent the float value in fixed point 
- 1 bit for sign representation
- 4 bits for integer representation.
- QF = 10, from equation discussed previous post ceiling (log2 (1/ ?)) for given example , ? = 0.0001
- WL = QI + QF + S, therefore WL = (4 + 10 + 1) = 15 (bits) to guaranty both range and resolution.
 
- So Q-format need to represent above float numbers is Q10 format.
Assuming that the word length used by the programmer to represent a float value in fixed point format is WL = 16. (This assumption should be made by programmer depending up on the maximum range of input float value).
Note: Float to fixed conversion comes at cost of precision loss.
Consider above example:
Float value = 12.435
(12)10 -> (1100)2 -> 4 bits are required to represent integer part.
Sign -> 1 bit for sing representation
As we have seen before, number of fractional bits taken by float value is the Q-point format with which the float value is represented in fixed point.
Q-format for the above float value is = (16 – (4 + 1)) i.e. QF = 11
So the above float value can be represented in Q11 format.