Course 2 - Data Science: Visualization

1.2 - Introduction to Distributions

Describe Heights to ET

🎞 Describe Heights to ET

📖 Case study: describing student heights

📖 Distribution function

📖 Cumulative distribution functions

📖 Histograms

A distribution is a function or description that shows the possible values of a variable and how often those values occur.
For categorical variables, the distribution describes the proportions of each category.
A frequency table is the simplest way to show a categorical distribution. Use prop.table() to convert a table of counts to a frequency table.
Barplots display the distribution of categorical variables and are a way to visualize the information in frequency tables.
For continuous numerical data, reporting the frequency of each unique entry is not an effective summary as many or most values are unique. Instead, a distribution function is required.
The cumulative distribution function (CDF) is a function that reports the proportion of data below a value a for all values of a:

\(F(a) = \mbox{Pr}(x \leq a)\)
The proportion of observations between any two values a and b can be computed from the CDF as:

\(F(b) - F(a)\)
A histogram divides data into non-overlapping bins of the same size and plots the counts of number of values that fall in that interval.

# Load the dataset
library(dslabs)
data(heights)
# Make a table of category proportions
prop.table(table(heights$sex))

## 
##    Female      Male 
## 0.2266667 0.7733333

cumulative distribution function

Every continuous distribution has a cumulative distribution function (CDF). The CDF defines the proportion of the data below a given value a for all values of a:

\(F(a) = \mbox{Pr}(x \leq a)\)

Any continuous dataset has a CDF, not only normal distributions. For example, the male heights data we used in the previous section has this CDF:

CDF

This plot of the CDF for male heights has height values a on the x-axis and the proportion of students with heights of that value or lower on the y-axis.

As defined above, this plot of the CDF for male heights has height values a on the x-axis and the proportion of students with heights of that value or lower, F(a), on the y-axis.

The CDF is essential for calculating probabilities related to continuous data. In a continuous dataset, the probability of a specific exact value is not informative because most entries are unique. For example, in the student heights data, only one individual reported a height of 68.8976377952726 inches, but many students rounded similar heights to 69 inches. If we computed exact value probabilities, we would find that being exactly 69 inches is much more likely than being a non-integer exact height, which does not match our understanding that height is continuous. We can instead use the CDF to obtain a useful summary, such as the probability that a student is between 68.5 and 69.5 inches.

For datasets that are not normal, the CDF can be calculated manually by defining a function to compute the probability above. This function can then be applied to a range of values across the range of the dataset to calculate a CDF. Given a dataset my_data, the CDF can be calculated and plotted like this:

my_data <- heights$height
# Define range of values spanning the dataset
a <- seq(min(my_data), max(my_data), length = 100)    
cdf_function <- function(x) { 
    # Computes prob. for a single value
    mean(my_data <= x)
}
cdf_values <- sapply(a, cdf_function)
plot(a, cdf_values)

The CDF defines that proportion of data below a cutoff a. To define the proportion of values above a, we compute:

\(1 - F(a)\)

To define the proportion of values between a and b, we compute:

\(F(b) - F(a)\)

Note that the CDF can help compute probabilities. The probability of observing a randomly chosen value between a and b is equal to the proportion of values between a and b, which we compute with the CDF.

Smooth Density Plots

🎞 Smooth Density Plots

📖 Smoothed density

Smooth density plots can be thought of as histograms where the bin width is extremely or infinitely small. The smoothing function makes estimates of the true continuous trend of the data given the available sample of data points.
The degree of smoothness can be controlled by an argument in the plotting function. (We’ll learn functions for plotting later.)
While the histogram is an assumption-free summary, the smooth density plot is shaped by assumptions and choices you make as a data analyst.
The y-axis is scaled so that the area under the density curve sums to 1. This means that interpreting values on the y-axis is not straightforward. To determine the proportion of data in between two values, compute the area under the smooth density curve in the region between those values.
An advantage of smooth densities over histograms is that densities are easier to compare visually.
Note that the choice of binwidth has a determinative effect on shape. There is no “correct” choice for binwidth, and you can sometimes gain insights into the data by experimenting with binwidths.

Normal distribution

🎞 Normal Distribution

📖 The normal distribution

The normal distribution:
- Is centered around one value, the mean
- Is symmetric around the mean
- Is defined completely by its mean μ and standard deviation σ
- Always has the same proportion of observations within a given distance of the mean (for example, 95% within 2σ)

The standard deviation is the average distance between a value and the mean value.
Calculate the mean using the mean() function.
Calculate the standard deviation using the sd() function or manually.
Standard units describe how many standard deviations a value is away from the mean. The z-score, or number of standard deviations an observation x is away from the mean μ:

\(Z = \frac{x - \mu}{\sigma}\)
Compute standard units with the scale() function.
Important: to calculate the proportion of values that meet a certain condition, use the mean() function on a logical vector.
Because TRUE is converted to 1 and FALSE is converted to 0, taking the mean of this vector yields the proportion of TRUE

- Equation for the normal distribution

The normal distribution is mathematically defined by the following formula for any mean μ and standard deviation σ:

\(\mbox{Pr}(a < x < b) = \int_a^b \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{1}{2}\left( \frac{x-\mu}{\sigma} \right)^2} \, dx\)

Normal Distribution: Standard Units and Z-scores

- Standard units

For data that are approximately normal, standard units describe the number of standard deviations an observation is from the mean. Standard units are denoted by the variable z and are also known as z-scores.

For any value x from a normal distribution with mean μ and standard deviation σ, the value in standard units is:

\(z = \frac{x-\mu}{\sigma}\)

Standard units are useful for many reasons. Note that the formula for the normal distribution is simplified by substituting z in the exponent:

\(\mbox{Pr}(a < x < b) = \int_a^b \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{1}{2}z^2} \, dx\)

The normal distribution of z-scores is called the standard normal distribution and is defined by μ=0 and σ=1.

Z-scores are useful to quickly evaluate whether an observation is average or extreme. Z-scores near 0 are average. Z-scores above 2 or below -2 are significantly above or below the mean, and z-scores above 3 or below -3 are extremely rare.

We will learn more about benchmark z-score values and their corresponding probabilities below.

- Code: Converting to standard units

The scale function converts a vector of approximately normally distributed values into z-scores.

z <- scale(x)

You can compute the proportion of observations that are within 2 standard deviations of the mean like this:

mean(abs(z) < 2)

- The 68-95-99.7 Rule

The normal distribution is associated with the 68-95-99.7 rule. This rule describes the probability of observing events within a certain number of standard deviations of the mean.

ND1

The probability distribution function for the normal distribution is defined such that:

About 68% of observations will be within one standard deviation of the mean, μ±σ. In standard units, this is equivalent to a z-score of \(\mid z \mid \leq 1\).

ND2

About 95% of observations will be within two standard deviations of the mean μ±2σ. In standard units, this is equivalent to a z-score of \(\mid z \mid \leq 2\).

ND3

About 99.7% of observations will be within three standard deviations of the mean μ±3σ. In standard units, this is equivalent to a z-score of \(\mid z \mid \leq 3\).

ND4

We will learn how to compute these exact probabilities in a later section, as well as probabilities for other intervals.

The Normal CDF and pnorm

🎞 The Normal CDF and pnorm

📖 Theoretical continuous distributions

The normal distribution has a mathematically defined CDF which can be computed in R with the function pnorm().
pnorm(a, avg, s) gives the value of the cumulative distribution function F(a) for the normal distribution defined by average avg and standard deviation s.
We say that a random quantity is normally distributed with average avg and standard deviation s if the approximation pnorm(a, avg, s) holds for all values of a.
If we are willing to use the normal approximation for height, we can estimate the distribution simply from the mean and standard deviation of our values.
If we treat the height data as discrete rather than categorical, we see that the data are not very useful because integer values are more common than expected due to rounding. This is called discretization.
With rounded data, the normal approximation is particularly useful when computing probabilities of intervals of length 1 that include exactly one integer.

- Code: Using pnorm to calculate probabilities

library(tidyverse)

## ── Attaching packages ──────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ─────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dslabs)
data(heights)
x <- heights %>% filter(sex=="Male") %>% pull(height)

We can estimate the probability that a male is taller than 70.5 inches with:

1 - pnorm(70.5, mean(x), sd(x))

## [1] 0.371369

- Code: Discretization and the normal approximation

# Plot distribution of exact heights in data
plot(prop.table(table(x)), xlab = "a = Height in inches", ylab = "Pr(x = a)")

# Probabilities in ACTUAL data over length 1 ranges containing an integer
mean(x <= 68.5) - mean(x <= 67.5)

## [1] 0.114532

mean(x <= 69.5) - mean(x <= 68.5)

## [1] 0.1194581

mean(x <= 70.5) - mean(x <= 69.5)

## [1] 0.1219212

# Probabilities in normal approximation match well
pnorm(68.5, mean(x), sd(x)) - pnorm(67.5, mean(x), sd(x))

## [1] 0.1031077

pnorm(69.5, mean(x), sd(x)) - pnorm(68.5, mean(x), sd(x))

## [1] 0.1097121

pnorm(70.5, mean(x), sd(x)) - pnorm(69.5, mean(x), sd(x))

## [1] 0.1081743

# Probabilities in actual data over other ranges don't match normal approx as well
mean(x <= 70.9) - mean(x <= 70.1)

## [1] 0.02216749

pnorm(70.9, mean(x), sd(x)) - pnorm(70.1, mean(x), sd(x))

## [1] 0.08359562