suppressPackageStartupMessages(library("tidyverse"))
package 㤼㸱tidyverse㤼㸲 was built under R version 3.6.3
1. Describe the difference between is.finite(x)
and !is.infinite(x)
.
To find out, try the functions on a numeric vector that includes at least one number and the four special values (NA
, NaN
, Inf
, -Inf
).
x <- c(0, NA, NaN, Inf, -Inf)
is.finite(x)
[1] TRUE FALSE FALSE FALSE FALSE
!is.infinite(x)
[1] TRUE TRUE TRUE FALSE FALSE
The is.finite()
function considers non-missing numeric values to be finite, and missing (NA
), not a number (NaN
), and positive (Inf
) and negative infinity (-Inf
) to not be finite. The is.infinite()
behaves slightly differently. It considers Inf
and -Inf
to be infinite, and everything else, including non-missing numbers, NA
, and NaN
to not be infinite.
Results of is.finite()
and is.infinite()
for numeric and special values.
1 |
TRUE |
FALSE |
NA |
FALSE |
FALSE |
NaN |
FALSE |
FALSE |
Inf |
FALSE |
TRUE |
2. Read the source code for dplyr::near()
(Hint: to see the source code, drop the ()). How does it work?
The source for dplyr::near
is:
dplyr::near
function (x, y, tol = .Machine$double.eps^0.5)
{
abs(x - y) < tol
}
<bytecode: 0x000001ef57999068>
<environment: namespace:dplyr>
Instead of checking for exact equality, it checks that two numbers are within a certain tolerance, tol
. By default the tolerance is set to the square root of .Machine$double.eps
, which is the smallest floating point number that the computer can represent.
3. A logical vector can take 3 possible values. How many possible values can an integer vector take? How many possible values can a double take? Use Google to do some research.
For integers vectors, R uses a 32-bit representation. This means that it can represent up to
\(2^{32}\) different values with integers. One of these values is set aside for NA_integer_
. From the help for integer.
Note that current implementations of R use 32-bit integers for integer vectors, so the range of representable integers is restricted to about +/-2*10^9: doubles can hold much larger integers exactly.
The range of integers values that R can represent in an integer vector is \(\pm 2^{31}-1\)
.Machine$integer.max
[1] 2147483647
#> [1] 2147483647
The maximum integer is \(2^{31}-1\) rather than \(2^{32}\) because 1 bit is used to represent the sign (\(+\),\(−\)) and one value is used to represent NA_integer_
.
If you try to represent an integer greater than that value, R will return NA
values.
.Machine$integer.max + 1L
NAs produced by integer overflow
[1] NA
However, you can represent that value (exactly) with a numeric vector at the cost of about two times the memory.
as.numeric(.Machine$integer.max) + 1
[1] 2147483648
The same is true for the negative of the integer max.
-.Machine$integer.max - 1L
NAs produced by integer overflow
[1] NA
For double vectors, R uses a 64-bit representation. This means that they can hold up to \(2^{64}\) values exactly. However, some of those values are allocated to special values such as -Inf
, Inf
, NA_real_
, and NaN
. From the help for double:
All R platforms are required to work with values conforming to the IEC 60559 (also known as IEEE 754) standard. This basically works with a precision of 53 bits, and represents to that precision a range of absolute values from about 2e-308 to 2e+308. It also has special values NaN (many of them), plus and minus infinity and plus and minus zero (although R acts as if these are the same). There are also denormal(ized) (or subnormal) numbers with absolute values above or below the range given above but represented to less precision.
The details of floating point representation and arithmetic are complicated, beyond the scope of this question, and better discussed in the references provided below. The double can represent numbers in the range of about \(\pm 2\times10^{308}\) , which is provided in
.Machine$double.xmax
[1] 1.797693e+308
Many other details for the implementation of the double vectors are given in the .Machine
variable (and its documentation). These include the base (radix) of doubles,
.Machine$double.base
[1] 2
the number of bits used for the significand (mantissa),
.Machine$double.digits
[1] 53
the number of bits used in the exponent,
.Machine$double.exponent
[1] 11
and the smallest positive and negative numbers not equal to zero,
.Machine$double.eps
[1] 2.220446e-16
.Machine$double.neg.eps
[1] 1.110223e-16
- Computerphile, “Floating Point Numbers”
- https://en.wikipedia.org/wiki/IEEE_754
- https://en.wikipedia.org/wiki/Double-precision_floating-point_format
- “Floating Point Numbers Why floating point numbers are needed”
- Fabien Sanglard, “Floating Point Numbers: Visually Explained”
- James Howard, “How Many Floating Point Numbers are There?”
- GeeksforGeeks, “Floating Point Representation Basics”
- Chris Hecker, “Lets Go to the (Floating) Point”, Game Developer
- Chua Hock-Chuan, A Tutorial on Data Representation Integers, Floating-point Numbers, and Characters
- John D. Cook, “Anatomy of a floating point number”
- John D. Cook, “Five Tips for Floating Point Programming”
4. Brainstorm at least four functions that allow you to convert a double to an integer. How do they differ? Be precise.
The difference between to convert a double to an integer differ in how they deal with the fractional part of the double. There are are a variety of rules that could be used to do this.
- Round down, towards \(-\infty\). This is also called taking the
floor
of a number. This is the method the floor()
function uses.
- Round up, towards \(+\infty\). This is also called taking the
ceiling
. This is the method the ceiling()
function uses.
- Round towards zero. This is the method that the
trunc()
and as.integer()
functions use.
- Round away from zero.
- Round to the nearest integer. There several different methods for handling ties, which are numbers with a fractional part of 0.5.
- Round half down, towards \(-\infty\).
- Round half up, towards \(+\infty\).
- Round half towards zero
- Round half away from zero
- Round half towards the even integer. This is the method that the
round()
function uses.
- Round half towards the odd integer.
function(x, method) {
if (method == "round down") {
floor(x)
} else if (method == "round up") {
ceiling(x)
} else if (method == "round towards zero") {
trunc(x)
} else if (method == "round away from zero") {
sign(x) * ceiling(abs(x))
} else if (method == "nearest, round half up") {
floor(x + 0.5)
} else if (method == "nearest, round half down") {
ceiling(x - 0.5)
} else if (method == "nearest, round half towards zero") {
sign(x) * ceiling(abs(x) - 0.5)
} else if (method == "nearest, round half away from zero") {
sign(x) * floor(abs(x) + 0.5)
} else if (method == "nearest, round half to even") {
round(x, digits = 0)
} else if (method == "nearest, round half to odd") {
case_when(
# smaller integer is odd - round half down
floor(x) %% 2 ~ ceiling(x - 0.5),
# otherwise, round half up
TRUE ~ floor(x + 0.5)
)
} else if (method == "nearest, round half randomly") {
round_half_up <- sample(c(TRUE, FALSE), length(x), replace = TRUE)
y <- x
y[round_half_up] <- ceiling(x[round_half_up] - 0.5)
y[!round_half_up] <- floor(x[!round_half_up] + 0.5)
y
}
}
function(x, method) {
if (method == "round down") {
floor(x)
} else if (method == "round up") {
ceiling(x)
} else if (method == "round towards zero") {
trunc(x)
} else if (method == "round away from zero") {
sign(x) * ceiling(abs(x))
} else if (method == "nearest, round half up") {
floor(x + 0.5)
} else if (method == "nearest, round half down") {
ceiling(x - 0.5)
} else if (method == "nearest, round half towards zero") {
sign(x) * ceiling(abs(x) - 0.5)
} else if (method == "nearest, round half away from zero") {
sign(x) * floor(abs(x) + 0.5)
} else if (method == "nearest, round half to even") {
round(x, digits = 0)
} else if (method == "nearest, round half to odd") {
case_when(
# smaller integer is odd - round half down
floor(x) %% 2 ~ ceiling(x - 0.5),
# otherwise, round half up
TRUE ~ floor(x + 0.5)
)
} else if (method == "nearest, round half randomly") {
round_half_up <- sample(c(TRUE, FALSE), length(x), replace = TRUE)
y <- x
y[round_half_up] <- ceiling(x[round_half_up] - 0.5)
y[!round_half_up] <- floor(x[!round_half_up] + 0.5)
y
}
}
tibble(
x = c(
1.8, 1.5, 1.2, 0.8, 0.5, 0.2,
-0.2, -0.5, -0.8, -1.2, -1.5, -1.8
),
`Round down` = floor(x),
`Round up` = ceiling(x),
`Round towards zero` = trunc(x),
`Nearest, round half to even` = round(x)
)
See the Wikipedia articles, Rounding and IEEE floating point for more discussion of these rounding rules.
For rounding, R and many programming languages use the IEEE standard. This method is called “round to nearest, ties to even.” (See the documentation for .Machine$double.rounding
) This rule rounds ties, numbers with a remainder of 0.5, to the nearest even number. In this rule, half the ties are rounded up, and half are rounded down. The following function, round2()
, manually implements the “round to nearest, ties to even” method.
x <- seq(-10, 10, by = 0.5)
round2 <- function(x, to_even = TRUE) {
q <- x %/% 1
r <- x %% 1
q + (r >= 0.5)
}
x <- c(-12.5, -11.5, 11.5, 12.5)
round(x)
[1] -12 -12 12 12
round2(x, to_even = FALSE)
[1] -12 -11 12 13
This rounding method may be different than the one you learned in grade school, which is, at least for me, was to always round ties upwards, or, alternatively away from zero. This rule is called the “round half up” rule. The problem with the “round half up” rule is that it is biased upwards for positive numbers. Rounding to nearest with ties towards even is not. Consider this sequence which sums to zero.
x <- seq(-100.5, 100.5, by = 1)
x
[1] -100.5 -99.5 -98.5 -97.5 -96.5 -95.5 -94.5 -93.5 -92.5 -91.5 -90.5 -89.5
[13] -88.5 -87.5 -86.5 -85.5 -84.5 -83.5 -82.5 -81.5 -80.5 -79.5 -78.5 -77.5
[25] -76.5 -75.5 -74.5 -73.5 -72.5 -71.5 -70.5 -69.5 -68.5 -67.5 -66.5 -65.5
[37] -64.5 -63.5 -62.5 -61.5 -60.5 -59.5 -58.5 -57.5 -56.5 -55.5 -54.5 -53.5
[49] -52.5 -51.5 -50.5 -49.5 -48.5 -47.5 -46.5 -45.5 -44.5 -43.5 -42.5 -41.5
[61] -40.5 -39.5 -38.5 -37.5 -36.5 -35.5 -34.5 -33.5 -32.5 -31.5 -30.5 -29.5
[73] -28.5 -27.5 -26.5 -25.5 -24.5 -23.5 -22.5 -21.5 -20.5 -19.5 -18.5 -17.5
[85] -16.5 -15.5 -14.5 -13.5 -12.5 -11.5 -10.5 -9.5 -8.5 -7.5 -6.5 -5.5
[97] -4.5 -3.5 -2.5 -1.5 -0.5 0.5 1.5 2.5 3.5 4.5 5.5 6.5
[109] 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 15.5 16.5 17.5 18.5
[121] 19.5 20.5 21.5 22.5 23.5 24.5 25.5 26.5 27.5 28.5 29.5 30.5
[133] 31.5 32.5 33.5 34.5 35.5 36.5 37.5 38.5 39.5 40.5 41.5 42.5
[145] 43.5 44.5 45.5 46.5 47.5 48.5 49.5 50.5 51.5 52.5 53.5 54.5
[157] 55.5 56.5 57.5 58.5 59.5 60.5 61.5 62.5 63.5 64.5 65.5 66.5
[169] 67.5 68.5 69.5 70.5 71.5 72.5 73.5 74.5 75.5 76.5 77.5 78.5
[181] 79.5 80.5 81.5 82.5 83.5 84.5 85.5 86.5 87.5 88.5 89.5 90.5
[193] 91.5 92.5 93.5 94.5 95.5 96.5 97.5 98.5 99.5 100.5
sum(x)
[1] 0
A nice property of rounding preserved that sum. Using the “ties towards even”, the sum is still zero. However, the “ties towards \(+\infty\)” produces a non-zero number.
sum(x)
[1] 0
sum(round(x))
[1] 0
sum(round2(x))
[1] 101
Rounding rules can have real world impacts. One notable example was that in 1983, the Vancouver stock exchange adjusted its index from 524.811 to 1098.892 to correct for accumulated error due to rounding to three decimal points (see Vancouver Stock Exchange). This site lists several more examples of the dangers of rounding rules.
5. What functions from the readr
package allow you to turn a string into logical, integer, and double vector?
The function parse_logical()
parses logical values, which can appear as variations of TRUE/FALSE or 1/0.
parse_logical(c("TRUE", "FALSE", "1", "0", "true", "t", "NA"))
[1] TRUE FALSE TRUE FALSE TRUE TRUE NA
The function parse_integer()
parses integer values.
parse_integer(c("1235", "0134", "NA"))
[1] 1235 134 NA
However, if there are any non-numeric characters in the string, including currency symbols, commas, and decimals, parse_integer()
will raise an error.
parse_integer(c("1000", "$1,000", "10.00"))
2 parsing failures.
row col expected actual
2 -- an integer $1,000
3 -- no trailing characters .00
[1] 1000 NA NA
attr(,"problems")
The function parse_number()
parses integer values.
parse_number(c("1.0", "3.5", "$1,000.00", "NA"))
[1] 1.0 3.5 1000.0 NA
Unlike parse_integer()
, the function parse_number()
is very forgiving about the format of the numbers. It ignores all non-numeric characters, as with “$1,000.00” in the example. This allows it to easily parse numeric fields that include currency symbols and comma separators in number strings without any intervention by the user.
