Ch2.2.2 Floating Point Numbers, Part 1

Floating Point Numbers

Floating point numbers provide the way around the limitations of binary integers.
Floating point numbers are capable of storing noninteger values, such as 2.71828182845905, 3.14159265358979, and 0.25.
How many significant digits shown below?

title

Floating Point Numbers

Floating point numbers provide the way around the limitations of binary integers.
Floating point numbers are capable of storing noninteger values, such as 2.71828182845905, 3.14159265358979, and 0.25.
How many significant digits shown below? (Ans = 5)

title

Floating Point Numbers = R Default

Floating point numbers can store much larger numbers.

as.integer(2^31)

[1] NA

2^31

[1] 2147483648

(2^31)*(2^31)

[1] 4.611686e+18

Floating Point Numbers: Double Precision

There are several standards for floating point.
We focus on double precision, or just double.
It has approximately double the storage space (64 bits) than standard floating point format (32 bits).

title

Floating Point Numbers: Double Precision

Compare with single precision.

title title

Floating Point Numbers: Single Precision

The C float data type is sometimes called single precision.
R can convert a number to a single precision value.
Underlying data type within R is still double precision.

title title

Example 1: Single Precision

Consider the example shown in the figure.

title title

Example 1: Single Precision

8-bit exponent range split to enable positive and negative exponents.

(x <- 2^8 - 1)

[1] 255

x/2

[1] 127.5

title

1.75*2^(-120)

[1] 1.316554e-36

Scientific Notation in Base 10

Scientific notation always starts with nonzero digit \( d \), where \( 0 < d \leq9 \), followed by a decimal part \( f \):

title

In this example, we can think of the number \( x \) as

\[ \begin{align*} x & = (8 + f) \times 10^{-3} \\ f & = 0.72 \end{align*} \]

The number \( f \) is the fractional part, or mantissa.

Floating Point Form, Base 10

In general,

\[ \begin{align*} x & = (-1)^s10^n(d + f) \\ 0 & < d \leq9 \\ 0 & \leq f < 1 \end{align*} \]

title

In base 10, \( d = 8 \) is a digit that must be stored.
This is not so for binary, since \( d = 1 \).

title

Floating Point Form, Base 2 (Single)

For binary,

\[ x = (-1)^{s}2^{n}(1 + f) \]

Floating point form:

\[ x = (-1)^s2^{c-127}(1 + f) \]

title

(x <- 2^8 - 1)

[1] 255

x/2

[1] 127.5

Floating Point Form, Base 2 (Double)

For binary,

\[ x = (-1)^{s} 2^{n} (1 + f) \]

Floating point form:

\[ x = (-1)^s 2^{c-1023} (1 + f) \]

title

(x <- 2^11 - 1)

[1] 2047

x/2

[1] 1023.5

Example 2: Single Precision

Consider the long real form

\[ x = (0)(10100001)(01100...0) \]

Write in floating point form

\[ (-1)^s 2^{c-127} (1+f) \]

Need to find \( s \), \( f \), \( c \) and \( p = c - 127 \).

title

Example 2: Single Precision

Given

\[ x = (0)(10100001)(01100...0) \]

Identify sign indicator \( s \):

(s <- 0)

[1] 0

Example 2: Single Precision

Given

\[ x = (0)(10100001)(01100...0) \]

Find characteristic \( c \) and power \( p = c - 127 \):

(c <- 1*2^7+0*2^6+1*2^5+0*2^4+0*2^3+0*2^2+0*2^1+1*2^0)

[1] 161

(p <- c - 127)

[1] 34

Example 2: Single Precision

Given

\[ x = (0)(10100001)(01100...0) \]

Determine mantissa \( f \):

(f <- 0*(1/2)+1*(1/2)^2+1*(1/2)^3+0*(1/2)^4)

[1] 0.375

Example 2: Single Precision

Thus the floating point form is

\[ \begin{align*} x &= (-1)^0 2^{34} (1+0.375) \Leftarrow \mathrm{Answer} \\ & = 1.375 \times 2^{34} \end{align*} \]

Determine base-10 value for \( x \):

(x <- (-1)^s*2^p*(1+f))

[1] 23622320128

Example 3: Single Precision

Consider the long real form

\[ x = (1)(01010010)(100100...0) \]

Write in floating point form

\[ (-1)^s 2^{c-127} (1+f) \]

Need to find \( s \), \( f \), \( c \) and \( p = c - 127 \).

title

Example 3: Single Precision

Given

\[ x = (1)(01010010)(100100...0) \]

Identify sign indicator \( s \):

Example 3: Single Precision

Given

\[ x = (1)(01010010)(100100...0) \]

Identify sign indicator \( s \):

(s <- 1 )

[1] 1

Example 3: Single Precision

Given \( x = (1)(01010010)(100100...0) \)
Find characteristic \( c \) and power \( p = c - 127 \):

Example 3: Single Precision

Given \( x = (1)(01010010)(100100...0) \)
Find characteristic \( c \) and power \( p = c - 127 \):

\[ \]

From Example 2: \( \, x = (0)(10100001)(01100...0) \)

c <- 1*2^7+0*2^6+1*2^5+0*2^4+0*2^3+0*2^2+0*2^1+1*2^0

Example 3: Single Precision

Given \( x = (1)(01010010)(100100...0) \)
Find characteristic \( c \) and power \( p = c - 127 \):

(c <- 0*2^7+1*2^6+0*2^5+1*2^4+0*2^3+0*2^2+1*2^1+0*2^0)
(p <- c - 127)

Example 3: Single Precision

Given \( x = (1)(01010010)(100100...0) \)
Find characteristic \( c \) and power \( p = c - 127 \):

(c <- 0*2^7+1*2^6+0*2^5+1*2^4+0*2^3+0*2^2+1*2^1+0*2^0)

[1] 82

(p <- c - 127)

[1] -45

Example 3: Single Precision

Given \( x = (1)(01010010)(100100...0) \)
Determine mantissa \( f \):

Example 3: Single Precision

Given \( x = (1)(01010010)(100100...0) \)
Determine mantissa \( f \):

\[ \]

From Example 2: \( \, x = (0)(10100001)(01100...0) \)

f <- 0*(1/2)+1*(1/2)^2+1*(1/2)^3+0*(1/2)^4

Example 3: Single Precision

Given \( x = (1)(01010010)(100100...0) \)
Determine mantissa \( f \):

(f <- 1*(1/2)+0*(1/2)^2+0*(1/2)^3+1*(1/2)^4)

Example 3: Single Precision

Given \( x = (1)(01010010)(100100...0) \)
Determine mantissa \( f \):

(f <- 1*(1/2)+0*(1/2)^2+0*(1/2)^3+1*(1/2)^4)

[1] 0.5625

Example 3: Single Precision

Thus the floating point form is

\[ \]

Determine base-10 value for \( x \):

(x <- (-1)^s*2^p*(1+f))

Example 3: Single Precision

Thus the floating point form is

\[ \begin{align*} x &= (-1)^1 2^{-45} (1+0.5625) \Leftarrow \mathrm{Answer} \\ & = - 1.5625 \times 2^{-45} \end{align*} \]

Determine base-10 value for \( x \):

(x <- (-1)^s*2^p*(1+f))

[1] -4.440892e-14

Example 4: Book Example

Our book provides this binary representation of 1000:

\[ 1000 = 0b1111101000 * 2^9 \]

Here, \( s=0 \), \( p = 9 \), and \( f = 1111101000 \).
Let's see if this is correct.

\[ x = (-1)^{s}2^{p}(1 + f) \]

Example 4: Book Example

With \( f = 1111101000 \), we have

(f <- 1*(1/2) + 1*(1/2)^2 + 1*(1/2)^3 + 1*(1/2)^4 + 1*(1/2)^5 + 0*(1/2)^6 + 1*(1/2)^7)

[1] 0.9765625

(1+f)*2^9

[1] 1012

The book is incorrect.

Example 4: Book Example Correction

Our book should have

\[ 1000 = 0b111101000 * 2^9 \]

instead of

\[ 1000 = 0b1111101000 * 2^9 \]

Let's check this next.

\[ x = (-1)^{s}2^{p}(1 + f) \]

Example 4: Book Example Correction

With \( f = 111101000 \), we have

(f <- 1*(1/2) + 1*(1/2)^2 + 1*(1/2)^3 + 1*(1/2)^4 + 0*(1/2)^5 + 1*(1/2)^6 + 0*(1/2)^7)

[1] 0.953125

(1+f)*2^9

[1] 1000

Thus our correction is valid.

Example 5: Book Example

Our book provides this binary representation of 0.75:

\[ 0.75 = 0b0001 * 2^{-1} \]

Here, \( s=0 \), \( p = -1 \), and \( f = 0001 \).
Let's see if this is correct.

\[ x = (-1)^{s}2^{p}(1 + f) \]

Example 5: Book Example

With \( f = 0001 \), we have

(f <- 0*(1/2) + 0*(1/2)^2 + 0*(1/2)^3 + 1*(1/2)^4 )

[1] 0.0625

(1+f)*2^(-1)

[1] 0.53125

The book is incorrect.

Example 5: Book Example Correction

Our book should have

\[ 0.75 = 0b1000 * 2^{-1} \]

instead of

\[ 0.75 = 0b0001 * 2^{-1} \]

Let's check this next.

\[ x = (-1)^{s}2^{p}(1 + f) \]

Example 5: Book Example Correction

With \( f = 1000 \), we have

(f <- 1*(1/2) + 0*(1/2)^2 + 0*(1/2)^3 + 0*(1/2)^4)

[1] 0.5

(1+f)*2^(-1)

[1] 0.75

Thus our correction is valid.

Example 5: Book Example

One more comment for this example from the book.

\[ 0.75 = 0b1000 * 2^{-1} \]

The author explains that “there is only one digit included because the initial digit is implicit. Therefore this number is \( 0b0.11 \) in binary.”
For the implicit initial digit, the author is probably referring to the 1 in the \( 1+f \).
For the \( 0b0.11 \), the author is probably referring to

\[ 1*(1/2) + 1*(1/2)^2 = 0.75 \]