📖 Case study: describing student heights
A distribution is a function or description that shows the possible values of a variable and how often those values occur.
For categorical variables, the distribution describes the proportions of each category.
A frequency table is the simplest way to show a categorical distribution. Use prop.table() to convert a table of counts to a frequency table.
Barplots display the distribution of categorical variables and are a way to visualize the information in frequency tables.
For continuous numerical data, reporting the frequency of each unique entry is not an effective summary as many or most values are unique. Instead, a distribution function is required.
The cumulative distribution function (CDF) is a function that reports the proportion of data below a value a for all values of a:
\(F(a) = \mbox{Pr}(x \leq a)\)
The proportion of observations between any two values a and b can be computed from the CDF as:
\(F(b) - F(a)\)
A histogram divides data into non-overlapping bins of the same size and plots the counts of number of values that fall in that interval.
# Load the dataset
library(dslabs)
data(heights)
# Make a table of category proportions
prop.table(table(heights$sex))##
## Female Male
## 0.2266667 0.7733333
Every continuous distribution has a cumulative distribution function (CDF). The CDF defines the proportion of the data below a given value a for all values of a:
\(F(a) = \mbox{Pr}(x \leq a)\)
Any continuous dataset has a CDF, not only normal distributions. For example, the male heights data we used in the previous section has this CDF:
This plot of the CDF for male heights has height values a on the x-axis and the proportion of students with heights of that value or lower on the y-axis.
As defined above, this plot of the CDF for male heights has height values a on the x-axis and the proportion of students with heights of that value or lower, F(a), on the y-axis.
The CDF is essential for calculating probabilities related to continuous data. In a continuous dataset, the probability of a specific exact value is not informative because most entries are unique. For example, in the student heights data, only one individual reported a height of 68.8976377952726 inches, but many students rounded similar heights to 69 inches. If we computed exact value probabilities, we would find that being exactly 69 inches is much more likely than being a non-integer exact height, which does not match our understanding that height is continuous. We can instead use the CDF to obtain a useful summary, such as the probability that a student is between 68.5 and 69.5 inches.
For datasets that are not normal, the CDF can be calculated manually by defining a function to compute the probability above. This function can then be applied to a range of values across the range of the dataset to calculate a CDF. Given a dataset my_data, the CDF can be calculated and plotted like this:
my_data <- heights$height
# Define range of values spanning the dataset
a <- seq(min(my_data), max(my_data), length = 100)
cdf_function <- function(x) {
# Computes prob. for a single value
mean(my_data <= x)
}
cdf_values <- sapply(a, cdf_function)
plot(a, cdf_values)The CDF defines that proportion of data below a cutoff a. To define the proportion of values above a, we compute:
\(1 - F(a)\)
To define the proportion of values between a and b, we compute:
\(F(b) - F(a)\)
Note that the CDF can help compute probabilities. The probability of observing a randomly chosen value between a and b is equal to the proportion of values between a and b, which we compute with the CDF.
The standard deviation is the average distance between a value and the mean value.
Calculate the mean using the mean() function.
Calculate the standard deviation using the sd() function or manually.
Standard units describe how many standard deviations a value is away from the mean. The z-score, or number of standard deviations an observation x is away from the mean μ:
\(Z = \frac{x - \mu}{\sigma}\)
Compute standard units with the scale() function.
Important: to calculate the proportion of values that meet a certain condition, use the mean() function on a logical vector.
Because TRUE is converted to 1 and FALSE is converted to 0, taking the mean of this vector yields the proportion of TRUE
The normal distribution is mathematically defined by the following formula for any mean μ and standard deviation σ:
\(\mbox{Pr}(a < x < b) = \int_a^b \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{1}{2}\left( \frac{x-\mu}{\sigma} \right)^2} \, dx\)
For data that are approximately normal, standard units describe the number of standard deviations an observation is from the mean. Standard units are denoted by the variable z and are also known as z-scores.
For any value x from a normal distribution with mean μ and standard deviation σ, the value in standard units is:
\(z = \frac{x-\mu}{\sigma}\)
Standard units are useful for many reasons. Note that the formula for the normal distribution is simplified by substituting z in the exponent:
\(\mbox{Pr}(a < x < b) = \int_a^b \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{1}{2}z^2} \, dx\)
The normal distribution of z-scores is called the standard normal distribution and is defined by μ=0 and σ=1.
Z-scores are useful to quickly evaluate whether an observation is average or extreme. Z-scores near 0 are average. Z-scores above 2 or below -2 are significantly above or below the mean, and z-scores above 3 or below -3 are extremely rare.
We will learn more about benchmark z-score values and their corresponding probabilities below.
The scale function converts a vector of approximately normally distributed values into z-scores.
z <- scale(x)
You can compute the proportion of observations that are within 2 standard deviations of the mean like this:
mean(abs(z) < 2)
The normal distribution is associated with the 68-95-99.7 rule. This rule describes the probability of observing events within a certain number of standard deviations of the mean.
The probability distribution function for the normal distribution is defined such that:
pnorm().pnorm(a, avg, s) gives the value of the cumulative distribution function F(a) for the normal distribution defined by average avg and standard deviation s.avg and standard deviation s if the approximation pnorm(a, avg, s) holds for all values of a.## ── Attaching packages ──────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
We can estimate the probability that a male is taller than 70.5 inches with:
## [1] 0.371369
# Plot distribution of exact heights in data
plot(prop.table(table(x)), xlab = "a = Height in inches", ylab = "Pr(x = a)")# Probabilities in ACTUAL data over length 1 ranges containing an integer
mean(x <= 68.5) - mean(x <= 67.5)## [1] 0.114532
## [1] 0.1194581
## [1] 0.1219212
# Probabilities in normal approximation match well
pnorm(68.5, mean(x), sd(x)) - pnorm(67.5, mean(x), sd(x))## [1] 0.1031077
## [1] 0.1097121
## [1] 0.1081743
# Probabilities in actual data over other ranges don't match normal approx as well
mean(x <= 70.9) - mean(x <= 70.1)## [1] 0.02216749
## [1] 0.08359562