library(tidyverse)
library(here)
library(visdat)
library(glue)
In Spring 2021, I taught a course entitled “Telling Stories with Data” (TSD), which introduced non-STEM majors to the Tidyverse and basic data visualization and analysis. I had bright students, but ones who typically had NO prior experience with either statistical analysis or computer programming. So TSD was designed as a soft entry, beginner-level guide to working with data.
This required having the students explore and learn about distributions, distributions as probability spaces, the normal distribution, and the Central Limit Theorem (CLT). To people with a background in Quantitative Analysis, this is all remedial. For people with no background, it can be difficult, confusing, and seem even at times semi-magical.
To help the students learn about the normal distribution, the CLT, and functions like pnorm()
, I created a small family of customized graphing functions for one of the natural normal distributions we were were exploring
Female students under the minimum male height:
Male students under the maximum female height:
I wanted a result that would visualize the probability space and print “Zed” score for a given value in the distribution. Essentially, a combination ofpnorm()
with ggplot()
and the appropriate area shaded.
Female students over the minimum male height:
Male students over the maximum female height:
We start with a normal distribution. Human height is well-studied, and we know that for adults it is normally distributed. This both without and after accounting for the two primary biological sexes: but we get more accurate results when accounting for biological sex.
So as our class practice, we always try to work with real world data sets (or near-enough). This data set concerns a population of college students. The data source: Hartmann, K., Krois, J., Waske, B. (2018). “The Standard Normal Distribution.” E-Learning Project SOGA: Statistics and Geospatial Data Analysis. Department of Earth Sciences, Freie Universitaet Berlin. CC-BY-SA 4.0.
students <- read.csv("https://userpage.fu-berlin.de/soga/200/2010_data_sets/students.csv")
students <- students %>%
tibble()
students %>%
glimpse()
## Rows: 8,239
## Columns: 16
## $ stud.id <int> 833917, 898539, 379678, 807564, 383291, 256074, 754591~
## $ name <chr> "Gonzales, Christina", "Lozano, T'Hani", "Williams, Ha~
## $ gender <chr> "Female", "Female", "Female", "Male", "Female", "Male"~
## $ age <int> 19, 19, 22, 19, 21, 19, 21, 21, 18, 18, 22, 18, 23, 20~
## $ height <int> 160, 172, 168, 183, 175, 189, 156, 167, 195, 165, 162,~
## $ weight <dbl> 64.8, 73.0, 70.6, 79.7, 71.4, 85.8, 65.9, 65.7, 94.4, ~
## $ religion <chr> "Muslim", "Other", "Protestant", "Other", "Catholic", ~
## $ nc.score <dbl> 1.91, 1.56, 1.24, 1.37, 1.46, 1.34, 1.11, 2.03, 1.29, ~
## $ semester <chr> "1st", "2nd", "3rd", "2nd", "1st", "2nd", "2nd", "3rd"~
## $ major <chr> "Political Science", "Social Sciences", "Social Scienc~
## $ minor <chr> "Social Sciences", "Mathematics and Statistics", "Math~
## $ score1 <int> NA, NA, 45, NA, NA, NA, NA, 58, 57, NA, 62, 76, 71, 66~
## $ score2 <int> NA, NA, 46, NA, NA, NA, NA, 62, 67, NA, 61, 82, 76, 70~
## $ online.tutorial <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, ~
## $ graduated <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, ~
## $ salary <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 45254.11, NA, ~
Since we are only interested in height, the other variables will be dropped.
stu_height <- students %>%
select(gender, height)
stu_height %>%
group_by(gender) %>%
skimr::skim()
Name | Piped data |
Number of rows | 8239 |
Number of columns | 2 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | gender |
Variable type: numeric
skim_variable | gender | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|---|
height | Female | 0 | 1 | 163.65 | 7.92 | 135 | 158 | 164 | 169 | 193 | ▁▃▇▃▁ |
height | Male | 0 | 1 | 179.07 | 7.99 | 144 | 174 | 179 | 184 | 206 | ▁▂▇▆▁ |
We can also recreate some of the skimr::skim()
results as summary stats.
dist_stats <- stu_height %>%
group_by(gender) %>%
summarize(avg_height = mean(height),
sd_height = sd(height),
min_height = min(height),
max_height = max(height))
dist_stats %>%
knitr::kable( caption = "Summary Stats")
gender | avg_height | sd_height | min_height | max_height |
---|---|---|---|---|
Female | 163.6533 | 7.919726 | 135 | 193 |
Male | 179.0727 | 7.988852 | 144 | 206 |
female_avg <- dist_stats$avg_height[1]
female_sd <- dist_stats$sd_height[1]
male_avg <- dist_stats$avg_height[2]
male_sd <- dist_stats$sd_height[2]
Our distributions, with the mean lines added:
plot_heights <- stu_height %>%
ggplot( aes(x = height, fill = gender) ) +
geom_density(alpha = 0.4) +
labs(x = "Height in CM", y = "Density",
fill = "Gender", title = "College Student Height: Normally Distributed",
subtitle = "8239 observations. Female: 4110. Male: 4129.",
caption = "Data Source: Freie Universität Berlin") +
geom_vline(xintercept = male_avg, lty = 3) +
geom_vline(xintercept = female_avg, lty = 3) +
scale_fill_manual(values = c("red", "blue")) +
theme_classic() +
theme(legend.position = c(.9, .9))
plot_heights ## Main plot to start
Usingrnom()
and dnorm()
, we can we build the density curves with the parameter values for the mean and standard deviation. The result shows that our empirical distributions match closely to idealized normal distributions:
test_sd <- sd(stu_height$height)
test_avg <- mean(stu_height$height)
height_norm <- rnorm(10000, test_avg, test_sd )
height_norm <- height_norm %>% enframe(value = "height", name = NULL)
x_ticks <- seq(120, 220, by = 10)
three_dist <- height_norm %>%
ggplot(aes(x = height)) +
stat_function(fun = dnorm,
args = list(mean = female_avg, sd = female_sd),
color = "red") +
stat_function(fun = dnorm,
args = list(mean = male_avg, sd = male_sd),
color = "blue") +
stat_function(fun = dnorm,
args = list(mean = mean(height_norm$height),
sd = sd(height_norm$height)),
color = "black") +
scale_x_continuous(breaks = x_ticks) +
theme_classic() +
labs(title = "Student Height Distribution",
subtitle = "Black: All; Red: Female; Blue: Male",
y = "Density", x = "Height in CM",
caption = "Data Source: Freie Universität Berlin")
three_dist
So we can proceed with confidence in using the Student Height data, stu_height
, to explore the normal distribution and the CLT.
stu_height %>%
summarize(avg_height = mean(height),
sd_height = sd(height),
min_height = min(height),
max_height = max(height)) %>%
mutate(gender = "All") %>%
select(gender, everything()) %>%
bind_rows(dist_stats) %>%
knitr::kable( caption = "Summary Stats")
gender | avg_height | sd_height | min_height | max_height |
---|---|---|---|---|
All | 171.3808 | 11.077529 | 135 | 206 |
Female | 163.6533 | 7.919726 | 135 | 193 |
Male | 179.0727 | 7.988852 | 144 | 206 |
The functions will take a numeric value: height measured in centimeters. We want to do the following:
To do this, we use an obscure ggplot function: ggplot_build()
. It allows us to access the plot data itself.
So we will create two plots to mine for data. First, a plot of Female Height; second, a plot of Male Height. We will then use in our custom functions the information obtained by building those plots.
female_height <- stu_height %>%
filter(gender == "Female") %>%
mutate(zed = scale(height, center = TRUE) )
female_height %>% head()
## # A tibble: 6 x 3
## gender height zed[,1]
## <chr> <int> <dbl>
## 1 Female 160 -0.461
## 2 Female 172 1.05
## 3 Female 168 0.549
## 4 Female 175 1.43
## 5 Female 156 -0.966
## 6 Female 167 0.423
male_height <- stu_height %>%
filter(gender == "Male") %>%
mutate(zed = scale(height, center = TRUE) )
male_height %>% head()
## # A tibble: 6 x 3
## gender height zed[,1]
## <chr> <int> <dbl>
## 1 Male 183 0.492
## 2 Male 189 1.24
## 3 Male 195 1.99
## 4 Male 185 0.742
## 5 Male 172 -0.885
## 6 Male 182 0.366
So far, so good.
We now build two plots not for display, but to raid the data generated by building the plots themselves. We’re looking at the guts, so to speak, of two density plots.
fh_plot <- female_height %>%
ggplot( aes(x = height ) ) +
geom_density() # Just for the data
fh_area <- ggplot_build(fh_plot)$data[[1]] # grab data
mh_plot <- male_height %>%
ggplot( aes(x = height ) ) +
geom_density() # Just for the data
mh_area <- ggplot_build(mh_plot)$data[[1]] # grab data
fh_area %>% glimpse()
## Rows: 512
## Columns: 18
## $ y <dbl> 7.399859e-05, 7.437072e-05, 7.436777e-05, 7.408919e-05, 7.~
## $ x <dbl> 135.0000, 135.1135, 135.2270, 135.3405, 135.4540, 135.5675~
## $ density <dbl> 7.399859e-05, 7.437072e-05, 7.436777e-05, 7.408919e-05, 7.~
## $ scaled <dbl> 0.001504427, 0.001511993, 0.001511933, 0.001506269, 0.0014~
## $ ndensity <dbl> 0.001504427, 0.001511993, 0.001511933, 0.001506269, 0.0014~
## $ count <dbl> 0.3041342, 0.3056637, 0.3056515, 0.3045066, 0.3024716, 0.2~
## $ n <int> 4110, 4110, 4110, 4110, 4110, 4110, 4110, 4110, 4110, 4110~
## $ flipped_aes <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA~
## $ PANEL <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ group <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1~
## $ ymin <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ ymax <dbl> 7.399859e-05, 7.437072e-05, 7.436777e-05, 7.408919e-05, 7.~
## $ fill <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ weight <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ colour <chr> "black", "black", "black", "black", "black", "black", "bla~
## $ alpha <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ size <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5~
## $ linetype <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
mh_area %>% glimpse()
## Rows: 512
## Columns: 18
## $ y <dbl> 9.818063e-05, 1.012755e-04, 1.039786e-04, 1.062890e-04, 1.~
## $ x <dbl> 144.0000, 144.1213, 144.2427, 144.3640, 144.4853, 144.6067~
## $ density <dbl> 9.818063e-05, 1.012755e-04, 1.039786e-04, 1.062890e-04, 1.~
## $ scaled <dbl> 0.002035178, 0.002099330, 0.002155364, 0.002203256, 0.0022~
## $ ndensity <dbl> 0.002035178, 0.002099330, 0.002155364, 0.002203256, 0.0022~
## $ count <dbl> 0.4053878, 0.4181664, 0.4293277, 0.4388673, 0.4467997, 0.4~
## $ n <int> 4129, 4129, 4129, 4129, 4129, 4129, 4129, 4129, 4129, 4129~
## $ flipped_aes <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA~
## $ PANEL <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ group <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1~
## $ ymin <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ ymax <dbl> 9.818063e-05, 1.012755e-04, 1.039786e-04, 1.062890e-04, 1.~
## $ fill <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ weight <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ colour <chr> "black", "black", "black", "black", "black", "black", "bla~
## $ alpha <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ size <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5~
## $ linetype <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
The area we need to fill is defined by the x, y coordinates. We can ignore the other variables for our purpose here.
We are now ready build our custom plotting functions. Using pnorm()
will give us the probability space as a numeric. Using geom_area(data = subset(fh_area, x < test_height)
will give us the shape up to our cut-off value: the lower tail. Conversely, using geom_area(data = subset(fh_area, x > test_height)
will give us the area to the right of our cut-off value: the upper tail.
# Custom functions --------------------------------------------------------
## female under
female_show_height_under <- function(test_height){
prob_h <- pnorm(test_height, female_avg, female_sd) %>%
round(3)
zed <- female_height %>%
filter(height == test_height) %>%
pull(zed) %>% unique() %>% round(3)
if(test_height < 136) warning ("Choose a value between 136 & 192")
if(test_height > 192 ) warning ("Choose a value between 136 & 192")
female_height %>%
ggplot( aes(x = height, y = ..density.. ) ) +
geom_density( alpha = 0.4, fill = "grey") +
geom_vline(xintercept = test_height, lty = 3, color = "red") +
geom_vline(xintercept = female_avg, lty = 5, color = "grey") +
geom_area(data = subset(fh_area, x < test_height ) ,
aes(x = x, y = y),
fill = "red",
alpha = 0.3) +
theme_classic() +
labs(title = "Female Students: Height Under",
x = "Height in CM", y = "Density",
subtitle = paste("Red area <", {test_height}, "CM",
" | ", "Probability space:", {prob_h},
" | ", "Zed:", {zed} ),
caption = "Data Source: Freie Universität Berlin")
}
#
## female over
female_show_height_over <- function(test_height){
prob_h <- pnorm(test_height, female_avg, female_sd,
lower.tail = FALSE) %>%
round(3)
zed <- female_height %>%
filter(height == test_height) %>%
pull(zed) %>% unique() %>% round(3)
if(test_height < 136) warning ("Choose a value between 136 & 192")
if(test_height > 192 ) warning ("Choose a value between 136 & 192")
female_height %>%
ggplot( aes(x = height, y = ..density.. ) ) +
geom_density( alpha = 0.4, fill = "grey") +
geom_vline(xintercept = test_height, lty = 3, color = "red") +
geom_vline(xintercept = female_avg, lty = 5, color = "grey") +
geom_area(data = subset(fh_area, x > test_height ) ,
aes(x = x, y = y),
fill = "red",
alpha = 0.3) +
theme_classic() +
labs(title = "Female Students: Height Over",
x = "Height in CM", y = "Density",
subtitle = paste("Red area >", {test_height},
"CM", " | ", "Probability space:", {prob_h},
" | ", "Zed:", {zed} ),
caption = "Data Source: Freie Universität Berlin")
}
#
# male under
male_show_height_under <- function(test_height){
prob_h <- pnorm(test_height, male_avg, male_sd) %>%
round(3)
zed <- male_height %>%
filter(height == test_height) %>%
pull(zed) %>% unique() %>% round(3)
if(test_height < 145) warning ("Choose a value between 145 & 201")
if(test_height > 201 ) warning ("Choose a value between 145 & 201")
male_height %>%
ggplot( aes(x = height, y = ..density.. ) ) +
geom_density( alpha = 0.4, fill = "grey") +
geom_vline(xintercept = test_height, lty = 3, color = "blue") +
geom_vline(xintercept = male_avg, lty = 5, color = "grey") +
geom_area(data = subset(mh_area, x < test_height ) ,
aes(x = x, y = y),
fill = "blue",
alpha = 0.3) +
theme_classic() +
labs(title = "Male Students: Height Under",
x = "Height in CM", y = "Density",
subtitle = paste("Blue area <", {test_height},
"CM", " | ",
"Probability space:", {prob_h},
" | ", "Zed:", {zed} ),
caption = "Source: Freie Universität Berlin")
}
#
## male over
male_show_height_over <- function(test_height){
prob_h <- pnorm(test_height, male_avg, male_sd, lower.tail = FALSE) %>%
round(3)
zed <- male_height %>%
filter(height == test_height) %>%
pull(zed) %>% unique() %>% round(3)
if(test_height < 145) warning ("Choose a value between 145 & 201")
if(test_height > 201 ) warning ("Choose a value between 145 & 201")
male_height %>%
ggplot( aes(x = height, y = ..density.. ) ) +
geom_density( alpha = 0.4, fill = "grey") +
geom_vline(xintercept = test_height, lty = 3, color = "blue") +
geom_vline(xintercept = male_avg, lty = 5, color = "grey") +
geom_area(data = subset(mh_area, x > test_height ) ,
aes(x = x, y = y),
fill = "blue",
alpha = 0.3) +
theme_classic() +
labs(title = "Male Students: Height Over",
x = "Height in CM", y = "Density",
subtitle = paste("Blue area >",
{test_height}, "CM",
" | ", "Probability space:",
{prob_h}, " | ", "Zed:", {zed} ),
caption = "Data Source: Freie Universität Berlin")
}
To work the properly, the possible values for each function are the minimum distribution value + 1, or the maximum distribution value - 1. It makes no sense to ask what is the probability of a value outside of the distribution.
We can start with a test_height of 170.
test_height <- 170
# Choose a whole number value between 136 & 192
female_show_height_under(test_height)
female_show_height_over(test_height)
# Choose a whole lnumber value between 145 & 201
male_show_height_under(test_height)
male_show_height_over(test_height)
All good!
We now have our custom functions for classwork and play. Let’s save them and the needed data so that we can load the image into our worksheet. Please note that our worksheet will first load the libraries tidyverse
, here
, and glue
.
remove_these <- c("p5","p6", "p7", "p8",
"mh_plot", "fh_plot",
"students", "three_dist",
"x_ticks", "test_avg",
"test_sd", "test_height")
rm(list = remove_these)
rm(remove_these)
save.image(file = here::here("data",
"tidy_data",
"fun_w_norm.RData"))
The data, functions, this RMD, and the worksheet RMD are freely available at github.com/Thom-J-H/Grab_Bag. Please feel free to improve. Thank you.
Thomas J. Haslam
2021-08-26