Describe Heights to ET

Key points

A distribution is a function or description that shows the possible values of a variable and how often those values occur.
For categorical variables, the distribution describes the proportions of each category.
A frequency table is the simplest way to show a categorical distribution. Use prop.table() to convert a table of counts to a frequency table. Barplots display the distribution of categorical variables and are a way to visualize the information in frequency tables.
For continuous numerical data, reporting the frequency of each unique entry is not an effective summary as many or most values are unique. Instead, a distribution function is required.
The cumulative distribution function (CDF) is a function that reports the proportion of data below a value 𝑎 for all values of 𝑎 : 𝐹(𝑎)=Pr(𝑥≤𝑎) .
The proportion of observations between any two values 𝑎 and 𝑏 can be computed from the CDF as 𝐹(𝑏)−𝐹(𝑎) .
A histogram divides data into non-overlapping bins of the same size and plots the counts of number of values that fall in that interval.

# load the dataset
library(dslabs)
data(heights)

# make a table of category proportions
prop.table(table(heights$sex))

## 
##    Female      Male 
## 0.2266667 0.7733333

Cumulative Distribution Function

Every continuous distribution has a cumulative distribution function (CDF). The CDF defines the proportion of the data below a given value 𝑎 for all values of 𝑎 :

𝐹(𝑎)=Pr(𝑥≤𝑎)

The CDF is essential for calculating probabilities related to continuous data. In a continuous dataset, the probability of a specific exact value is not informative because most entries are unique. For example, in the student heights data, only one individual reported a height of 68.8976377952726 inches, but many students rounded similar heights to 69 inches. If we computed exact value probabilities, we would find that being exactly 69 inches is much more likely than being a non-integer exact height, which does not match our understanding that height is continuous. We can instead use the CDF to obtain a useful summary, such as the probability that a student is between 68.5 and 69.5 inches.

For datasets that are not normal, the CDF can be calculated manually by defining a function to compute the probability above. This function can then be applied to a range of values across the range of the dataset to calculate a CDF. Given a dataset my_data, the CDF can be calculated and plotted like this:

# a <- seq(min(my_data), max(my_data), length = 100)    # define range of values spanning the dataset
# cdf_function <- function(x) {    # computes prob. for a single value
#    mean(my_data <= x)
# }
# cdf_values <- sapply(a, cdf_function)
# plot(a, cdf_values)

The CDF defines that proportion of data below a cutoff 𝑎 . To define the proportion of values above 𝑎 , we compute:

1−𝐹(𝑎)

To define the proportion of values between 𝑎 and 𝑏 , we compute:

𝐹(𝑏)−𝐹(𝑎)

Note that the CDF can help compute probabilities. The probability of observing a randomly chosen value between 𝑎 and 𝑏 is equal to the proportion of values between 𝑎 and 𝑏 , which we compute with the CDF.

Smooth Density Plots

Key points

Smooth density plots can be thought of as histograms where the bin width is extremely or infinitely small. The smoothing function makes estimates of the true continuous trend of the data given the available sample of data points.
The degree of smoothness can be controlled by an argument in the plotting function. (We will learn functions for plotting later.)
While the histogram is an assumption-free summary, the smooth density plot is shaped by assumptions and choices you make as a data analyst.
The y-axis is scaled so that the area under the density curve sums to 1. This means that interpreting values on the y-axis is not straightforward. To determine the proportion of data in between two values, compute the area under the smooth density curve in the region between those values.
An advantage of smooth densities over histograms is that densities are easier to compare visually.

A further note on histograms

Note that the choice of binwidth has a determinative effect on shape. There is no “correct” choice for binwidth, and you can sometimes gain insights into the data by experimenting with binwidths.

Normal Distribution

Key points

The normal distribution:
- Is centered around one value, the mean
- Is symmetric around the mean
- Is defined completely by its mean ( 𝜇 ) and standard deviation ( 𝜎 )
- Always has the same proportion of observations within a given distance of the mean (for example, 95% within 2 𝜎 )
The standard deviation is the average distance between a value and the mean value.
Calculate the mean using the mean() function.
Calculate the standard deviation using the sd() function or manually.
Standard units describe how many standard deviations a value is away from the mean. The z-score, or number of standard deviations an observation 𝑥 is away from the mean 𝜇 : 𝑍=𝑥−𝜇𝜎
Compute standard units with the scale() function.
Important: to calculate the proportion of values that meet a certain condition, use the mean() function on a logical vector. Because TRUE is converted to 1 and FALSE is converted to 0, taking the mean of this vector yields the proportion of TRUE.

Equation for the normal distribution

The normal distribution is mathematically defined by the following formula for any mean 𝜇 and standard deviation 𝜎 :

Pr(𝑎<𝑥<𝑏)=∫𝑏𝑎12𝜋√𝜎𝑒−12(𝑥−𝜇𝜎)2𝑑𝑥

Code

# define x as vector of male heights
library(tidyverse)

## ── Attaching packages ──────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.3
## ✓ tidyr   1.0.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0

## ── Conflicts ─────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dslabs)
data(heights)
index <- heights$sex=="Male"
x <- heights$height[index]

# calculate the mean and standard deviation manually
average <- sum(x)/length(x)
SD <- sqrt(sum((x - average)^2)/length(x))

# built-in mean and sd functions - note that the audio and printed values disagree
average <- mean(x)
SD <- sd(x)
c(average = average, SD = SD)

##   average        SD 
## 69.314755  3.611024

# calculate standard units
z <- scale(x)

# calculate proportion of values within 2 SD of mean
mean(abs(z) < 2)

## [1] 0.9495074

Note about the sd function

The built-in R function sd() calculates the standard deviation, but it divides by length(x)-1 instead of length(x). When the length of the list is large, this difference is negligible and you can use the built-in sd() function. Otherwise, you should compute 𝜎 by hand. For this course series, assume that you should use the sd() function unless you are told not to do so.

Normal Distribution: Standard Units and Z-scores

Standard units

For data that are approximately normal, standard units describe the number of standard deviations an observation is from the mean. Standard units are denoted by the variable 𝑧 and are also known as z-scores.

For any value 𝑥 from a normal distribution with mean 𝜇 and standard deviation 𝜎 , the value in standard units is:

𝑧=𝑥−𝜇𝜎

Standard units are useful for many reasons. Note that the formula for the normal distribution is simplified by substituting 𝑧 in the exponent:

Pr(𝑎<𝑥<𝑏)=∫𝑏𝑎12𝜋√𝜎𝑒−12𝑧2𝑑𝑥

When 𝑧=0 , the normal distribution is at a maximum, the mean 𝜇 . The function is defined to be symmetric around 𝑧=0 .

The normal distribution of z-scores is called the standard normal distribution and is defined by 𝜇=0 and 𝜎=1 .

Z-scores are useful to quickly evaluate whether an observation is average or extreme. Z-scores near 0 are average. Z-scores above 2 or below -2 are significantly above or below the mean, and z-scores above 3 or below -3 are extremely rare.

We will learn more about benchmark z-score values and their corresponding probabilities below.

Code: Converting to standard units

The scale function converts a vector of approximately normally distributed values into z-scores.

z <- scale(x)

You can compute the proportion of observations that are within 2 standard deviations of the mean like this:

mean(abs(z) < 2)

The 68-95-99.7 Rule

The normal distribution is associated with the 68-95-99.7 rule. This rule describes the probability of observing events within a certain number of standard deviations of the mean.

The normal CDF and pnorm

Key points

The normal distribution has a mathematically defined CDF which can be computed in R with the function pnorm().

pnorm(a, avg, s) gives the value of the cumulative distribution function 𝐹(𝑎) for the normal distribution defined by average avg and standard deviation s.

We say that a random quantity is normally distributed with average avg and standard deviation s if the approximation pnorm(a, avg, s) holds for all values of a.

If we are willing to use the normal approximation for height, we can estimate the distribution simply from the mean and standard deviation of our values.

If we treat the height data as discrete rather than categorical, we see that the data are not very useful because integer values are more common than expected due to rounding. This is called discretization.

With rounded data, the normal approximation is particularly useful when computing probabilities of intervals of length 1 that include exactly one integer.

Code: Using pnorm to calculate probabilities

Given male heights x:

library(tidyverse)
library(dslabs)
data(heights)
x <- heights %>% filter(sex=="Male") %>% pull(height)

We can estimate the probability that a male is taller than 70.5 inches with:

1 - pnorm(70.5, mean(x), sd(x))

## [1] 0.371369

Code: Discretization and the normal approximation

# plot distribution of exact heights in data
plot(prop.table(table(x)), xlab = "a = Height in inches", ylab = "Pr(x = a)")

# probabilities in actual data over length 1 ranges containing an integer
mean(x <= 68.5) - mean(x <= 67.5)

## [1] 0.114532

mean(x <= 69.5) - mean(x <= 68.5)

## [1] 0.1194581

mean(x <= 70.5) - mean(x <= 69.5)

## [1] 0.1219212

# probabilities in normal approximation match well
pnorm(68.5, mean(x), sd(x)) - pnorm(67.5, mean(x), sd(x))

## [1] 0.1031077

pnorm(69.5, mean(x), sd(x)) - pnorm(68.5, mean(x), sd(x))

## [1] 0.1097121

pnorm(70.5, mean(x), sd(x)) - pnorm(69.5, mean(x), sd(x))

## [1] 0.1081743

# probabilities in actual data over other ranges don't match normal approx as well
mean(x <= 70.9) - mean(x <= 70.1)

## [1] 0.02216749

pnorm(70.9, mean(x), sd(x)) - pnorm(70.1, mean(x), sd(x))

## [1] 0.08359562

Data Science: Visualization - Section 1 - Distributions

Feb 22, 2020

Describe Heights to ET

Cumulative Distribution Function

Smooth Density Plots

Normal Distribution

Normal Distribution: Standard Units and Z-scores

The normal CDF and pnorm