Histograms and density plots provide excellent summaries of a distribution. But can we summarize even further? We often see the average and standard deviation used as summary statistics: a two number summary! To understand what these summaries are and why they are so widely used, we need to understand the normal distribution.
The normal distribution, also known as the bell curve and as the Gaussian distribution, is one of the most famous mathematical concepts in history. A reason for this is that approximately normal distributions occur in many situations. Examples include gambling winnings, heights, weights, blood pressure, standardized test scores, and experimental measurement errors. Often data visualization is needed to confirm that our data follows a normal distribution.
Here we focus on how the normal distribution helps us summarize data and can be useful in practice.
One way the normal distribution is useful is that it can be used to approximate the distribution of a list of numbers without having access to the entire list. We will demonstrate this with the heights dataset.
Load the height data set and create a vector x with just the male heights:
library(dslabs)
data(heights)
x <- heights$height[heights$sex == "Male"]
What proportion of the data is between 69 and 72 inches (taller than 69 but shorter or equal to 72)? A proportion is between 0 and 1. Use the mean function in your code. Remember that you can use mean to compute the proportion of entries of a logical vector that are TRUE.
mean(x<=72) - mean(x<=69)
## [1] 0.3337438
Suppose all you know about the height data from the previous exercise is the average and the standard deviation and that its distribution is approximated by the normal distribution. We can compute the average and standard deviation like this:
library(dslabs) data(heights) x <- heights\(height[heights\)sex==“Male”] avg <- mean(x) stdev <- sd(x)
Suppose you only have avg and stdev below, but no access to x, can you approximate the proportion of the data that is between 69 and 72 inches?
Given a normal distribution with a mean mu and standard deviation sigma, you can calculate the proportion of observations less than or equal to a certain value with pnorm(value, mu, sigma). Notice that this is the CDF for the normal distribution. We will learn much more about pnorm later in the course series, but you can also learn more now with ?pnorm.
Use the normal approximation to estimate the proportion the proportion of the data that is between 69 and 72 inches. Note that you can’t use x in your code, only avg and stdev. Also note that R has a function that may prove very helpful here - check out the pnorm function (and remember that you can get help by using ?pnorm).
library(dslabs)
data(heights)
x <- heights$height[heights$sex=="Male"]
avg <- mean(x)
stdev <- sd(x)
pnorm(72,avg,stdev) - pnorm(69,avg,stdev)
## [1] 0.3061779
Notice that the approximation calculated in the second question is very close to the exact calculation in the first question. The normal distribution was a useful approximation for this case.
However, the approximation is not always useful. An example is for the more extreme values, often called the “tails” of the distribution. Let’s look at an example. We can compute the proportion of heights between 79 and 81.
library(dslabs) data(heights) x <- heights\(height[heights\)sex == “Male”] mean(x > 79 & x <= 81)
Use normal approximation to estimate the proportion of heights between 79 and 81 inches and save it in an object called approx. Report how many times bigger the actual proportion is compared to the approximation.
library(dslabs)
data(heights)
x <- heights$height[heights$sex == "Male"]
avg <- mean(x)
stdev <- sd(x)
exact <- mean(x > 79 & x <= 81)
approx <- pnorm(81,avg,stdev) - pnorm(79,avg,stdev)
exact/approx
## [1] 1.614261
In the previous exerceise we estimated the proportion of seven footers in the NBA using this simple code:
p <- 1 - pnorm(712, 69, 3) N <- round(p 10^9) 10/N Repeat the calculations performed in the previous question for Lebron James’ height: 6 feet 8 inches. There are about 150 players, instead of 10, that are at least that tall in the NBA.
Report the estimated proportion of people at least Lebron’s height that are in the NBA.
## Change the solution to previous answer
p <- 1 - pnorm(80, 69, 3)
N <- round(p * 10^9)
150/N
## [1] 0.001220842
In answering the previous questions, we found that it is not at all rare for a seven footer to become an NBA player.
What would be a fair critique of our calculations?
Practice and talent are what make a great basketball player, not height. The normal approximation is not appropriate for heights. *As seen in exercise 3, the normal approximation tends to underestimate the extreme values. It’s possible that there are more seven footers than we predicted. As seen in exercise 3, the normal approximation tends to overestimate the extreme values. It’s possible that there are less seven footers than we predicted.