Describe Heights to ET

Key points

# load the dataset
library(dslabs)
data(heights)

# make a table of category proportions
prop.table(table(heights$sex))
## 
##    Female      Male 
## 0.2266667 0.7733333

Cumulative Distribution Function

Every continuous distribution has a cumulative distribution function (CDF). The CDF defines the proportion of the data below a given value 𝑎 for all values of 𝑎 :

𝐹(𝑎)=Pr(𝑥≤𝑎)

The CDF is essential for calculating probabilities related to continuous data. In a continuous dataset, the probability of a specific exact value is not informative because most entries are unique. For example, in the student heights data, only one individual reported a height of 68.8976377952726 inches, but many students rounded similar heights to 69 inches. If we computed exact value probabilities, we would find that being exactly 69 inches is much more likely than being a non-integer exact height, which does not match our understanding that height is continuous. We can instead use the CDF to obtain a useful summary, such as the probability that a student is between 68.5 and 69.5 inches.

For datasets that are not normal, the CDF can be calculated manually by defining a function to compute the probability above. This function can then be applied to a range of values across the range of the dataset to calculate a CDF. Given a dataset my_data, the CDF can be calculated and plotted like this:

# a <- seq(min(my_data), max(my_data), length = 100)    # define range of values spanning the dataset
# cdf_function <- function(x) {    # computes prob. for a single value
#    mean(my_data <= x)
# }
# cdf_values <- sapply(a, cdf_function)
# plot(a, cdf_values)

The CDF defines that proportion of data below a cutoff 𝑎 . To define the proportion of values above 𝑎 , we compute:

1−𝐹(𝑎)

To define the proportion of values between 𝑎 and 𝑏 , we compute:

𝐹(𝑏)−𝐹(𝑎)

Note that the CDF can help compute probabilities. The probability of observing a randomly chosen value between 𝑎 and 𝑏 is equal to the proportion of values between 𝑎 and 𝑏 , which we compute with the CDF.

Smooth Density Plots

Key points

A further note on histograms

Note that the choice of binwidth has a determinative effect on shape. There is no “correct” choice for binwidth, and you can sometimes gain insights into the data by experimenting with binwidths.

Normal Distribution

Key points

Equation for the normal distribution

The normal distribution is mathematically defined by the following formula for any mean 𝜇 and standard deviation 𝜎 :

Pr(𝑎<𝑥<𝑏)=∫𝑏𝑎12𝜋√𝜎𝑒−12(𝑥−𝜇𝜎)2𝑑𝑥

Code

# define x as vector of male heights
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.3
## ✓ tidyr   1.0.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dslabs)
data(heights)
index <- heights$sex=="Male"
x <- heights$height[index]

# calculate the mean and standard deviation manually
average <- sum(x)/length(x)
SD <- sqrt(sum((x - average)^2)/length(x))

# built-in mean and sd functions - note that the audio and printed values disagree
average <- mean(x)
SD <- sd(x)
c(average = average, SD = SD)
##   average        SD 
## 69.314755  3.611024
# calculate standard units
z <- scale(x)

# calculate proportion of values within 2 SD of mean
mean(abs(z) < 2)
## [1] 0.9495074

Note about the sd function

The built-in R function sd() calculates the standard deviation, but it divides by length(x)-1 instead of length(x). When the length of the list is large, this difference is negligible and you can use the built-in sd() function. Otherwise, you should compute 𝜎 by hand. For this course series, assume that you should use the sd() function unless you are told not to do so.

Normal Distribution: Standard Units and Z-scores

Standard units

For data that are approximately normal, standard units describe the number of standard deviations an observation is from the mean. Standard units are denoted by the variable 𝑧 and are also known as z-scores.

For any value 𝑥 from a normal distribution with mean 𝜇 and standard deviation 𝜎 , the value in standard units is:

𝑧=𝑥−𝜇𝜎

Standard units are useful for many reasons. Note that the formula for the normal distribution is simplified by substituting 𝑧 in the exponent:

Pr(𝑎<𝑥<𝑏)=∫𝑏𝑎12𝜋√𝜎𝑒−12𝑧2𝑑𝑥

When 𝑧=0 , the normal distribution is at a maximum, the mean 𝜇 . The function is defined to be symmetric around 𝑧=0 .

The normal distribution of z-scores is called the standard normal distribution and is defined by 𝜇=0 and 𝜎=1 .

Z-scores are useful to quickly evaluate whether an observation is average or extreme. Z-scores near 0 are average. Z-scores above 2 or below -2 are significantly above or below the mean, and z-scores above 3 or below -3 are extremely rare.

We will learn more about benchmark z-score values and their corresponding probabilities below.

Code: Converting to standard units

The scale function converts a vector of approximately normally distributed values into z-scores.

z <- scale(x)

You can compute the proportion of observations that are within 2 standard deviations of the mean like this:

mean(abs(z) < 2)

The 68-95-99.7 Rule

The normal distribution is associated with the 68-95-99.7 rule. This rule describes the probability of observing events within a certain number of standard deviations of the mean.

The normal CDF and pnorm

Key points

The normal distribution has a mathematically defined CDF which can be computed in R with the function pnorm().

pnorm(a, avg, s) gives the value of the cumulative distribution function 𝐹(𝑎) for the normal distribution defined by average avg and standard deviation s.

We say that a random quantity is normally distributed with average avg and standard deviation s if the approximation pnorm(a, avg, s) holds for all values of a.

If we are willing to use the normal approximation for height, we can estimate the distribution simply from the mean and standard deviation of our values.

If we treat the height data as discrete rather than categorical, we see that the data are not very useful because integer values are more common than expected due to rounding. This is called discretization.

With rounded data, the normal approximation is particularly useful when computing probabilities of intervals of length 1 that include exactly one integer.

Code: Using pnorm to calculate probabilities

Given male heights x:

library(tidyverse)
library(dslabs)
data(heights)
x <- heights %>% filter(sex=="Male") %>% pull(height)

We can estimate the probability that a male is taller than 70.5 inches with:

1 - pnorm(70.5, mean(x), sd(x))
## [1] 0.371369

Code: Discretization and the normal approximation

# plot distribution of exact heights in data
plot(prop.table(table(x)), xlab = "a = Height in inches", ylab = "Pr(x = a)")

# probabilities in actual data over length 1 ranges containing an integer
mean(x <= 68.5) - mean(x <= 67.5)
## [1] 0.114532
mean(x <= 69.5) - mean(x <= 68.5)
## [1] 0.1194581
mean(x <= 70.5) - mean(x <= 69.5)
## [1] 0.1219212
# probabilities in normal approximation match well
pnorm(68.5, mean(x), sd(x)) - pnorm(67.5, mean(x), sd(x))
## [1] 0.1031077
pnorm(69.5, mean(x), sd(x)) - pnorm(68.5, mean(x), sd(x))
## [1] 0.1097121
pnorm(70.5, mean(x), sd(x)) - pnorm(69.5, mean(x), sd(x))
## [1] 0.1081743
# probabilities in actual data over other ranges don't match normal approx as well
mean(x <= 70.9) - mean(x <= 70.1)
## [1] 0.02216749
pnorm(70.9, mean(x), sd(x)) - pnorm(70.1, mean(x), sd(x))
## [1] 0.08359562