library(tidyverse)
library(here)
library(visdat)
library(glue)

Brief Introduction

In Spring 2021, I taught a course entitled “Telling Stories with Data” (TSD), which introduced non-STEM majors to the Tidyverse and basic data visualization and analysis. I had bright students, but ones who typically had NO prior experience with either statistical analysis or computer programming. So TSD was designed as a soft entry, beginner-level guide to working with data.

This required having the students explore and learn about distributions, distributions as probability spaces, the normal distribution, and the Central Limit Theorem (CLT). To people with a background in Quantitative Analysis, this is all remedial. For people with no background, it can be difficult, confusing, and seem even at times semi-magical.

To help the students learn about the normal distribution, the CLT, and functions like pnorm(), I created a small family of customized graphing functions for one of the natural normal distributions we were were exploring

Female students under the minimum male height:

Male students under the maximum female height:

The Graphing Functions

I wanted a result that would visualize the probability space and print “Zed” score for a given value in the distribution. Essentially, a combination ofpnorm() with ggplot() and the appropriate area shaded.

Female students over the minimum male height:

Male students over the maximum female height:

Starting with Data

We start with a normal distribution. Human height is well-studied, and we know that for adults it is normally distributed. This both without and after accounting for the two primary biological sexes: but we get more accurate results when accounting for biological sex.

Student Height Data

So as our class practice, we always try to work with real world data sets (or near-enough). This data set concerns a population of college students. The data source: Hartmann, K., Krois, J., Waske, B. (2018). “The Standard Normal Distribution.” E-Learning Project SOGA: Statistics and Geospatial Data Analysis. Department of Earth Sciences, Freie Universitaet Berlin. CC-BY-SA 4.0.

students <- read.csv("https://userpage.fu-berlin.de/soga/200/2010_data_sets/students.csv") 

students <- students %>% 
  tibble()

students %>% 
  glimpse()
## Rows: 8,239
## Columns: 16
## $ stud.id         <int> 833917, 898539, 379678, 807564, 383291, 256074, 754591~
## $ name            <chr> "Gonzales, Christina", "Lozano, T'Hani", "Williams, Ha~
## $ gender          <chr> "Female", "Female", "Female", "Male", "Female", "Male"~
## $ age             <int> 19, 19, 22, 19, 21, 19, 21, 21, 18, 18, 22, 18, 23, 20~
## $ height          <int> 160, 172, 168, 183, 175, 189, 156, 167, 195, 165, 162,~
## $ weight          <dbl> 64.8, 73.0, 70.6, 79.7, 71.4, 85.8, 65.9, 65.7, 94.4, ~
## $ religion        <chr> "Muslim", "Other", "Protestant", "Other", "Catholic", ~
## $ nc.score        <dbl> 1.91, 1.56, 1.24, 1.37, 1.46, 1.34, 1.11, 2.03, 1.29, ~
## $ semester        <chr> "1st", "2nd", "3rd", "2nd", "1st", "2nd", "2nd", "3rd"~
## $ major           <chr> "Political Science", "Social Sciences", "Social Scienc~
## $ minor           <chr> "Social Sciences", "Mathematics and Statistics", "Math~
## $ score1          <int> NA, NA, 45, NA, NA, NA, NA, 58, 57, NA, 62, 76, 71, 66~
## $ score2          <int> NA, NA, 46, NA, NA, NA, NA, 62, 67, NA, 61, 82, 76, 70~
## $ online.tutorial <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, ~
## $ graduated       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, ~
## $ salary          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 45254.11, NA, ~

Since we are only interested in height, the other variables will be dropped.

stu_height <- students %>% 
  select(gender, height)

stu_height %>% 
  group_by(gender) %>% 
  skimr::skim()
Data summary
Name Piped data
Number of rows 8239
Number of columns 2
_______________________
Column type frequency:
numeric 1
________________________
Group variables gender

Variable type: numeric

skim_variable gender n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
height Female 0 1 163.65 7.92 135 158 164 169 193 ▁▃▇▃▁
height Male 0 1 179.07 7.99 144 174 179 184 206 ▁▂▇▆▁

We can also recreate some of the skimr::skim() results as summary stats.

Summary Stats

dist_stats <- stu_height  %>% 
  group_by(gender) %>% 
  summarize(avg_height = mean(height), 
            sd_height = sd(height),
            min_height = min(height),
            max_height = max(height))

dist_stats %>%
  knitr::kable( caption = "Summary Stats")
Summary Stats
gender avg_height sd_height min_height max_height
Female 163.6533 7.919726 135 193
Male 179.0727 7.988852 144 206
female_avg <- dist_stats$avg_height[1] 
female_sd <- dist_stats$sd_height[1]

male_avg <- dist_stats$avg_height[2] 
male_sd <- dist_stats$sd_height[2]

Distributions Viz

Our distributions, with the mean lines added:

plot_heights <- stu_height %>%
  ggplot( aes(x = height,  fill = gender) ) +
  geom_density(alpha = 0.4) +
  labs(x = "Height in CM", y = "Density", 
       fill = "Gender", title = "College Student Height: Normally Distributed",
       subtitle = "8239 observations. Female: 4110.  Male: 4129.",
       caption = "Data Source: Freie Universität Berlin") +
  geom_vline(xintercept = male_avg,  lty = 3) +
  geom_vline(xintercept = female_avg,  lty = 3) +
  scale_fill_manual(values = c("red", "blue")) +
  theme_classic() +
  theme(legend.position = c(.9, .9))


plot_heights  ## Main plot to start

Parameters Viz

Usingrnom() and dnorm(), we can we build the density curves with the parameter values for the mean and standard deviation. The result shows that our empirical distributions match closely to idealized normal distributions:

test_sd <- sd(stu_height$height)
test_avg <- mean(stu_height$height)

height_norm <- rnorm(10000, test_avg, test_sd )

height_norm <- height_norm %>% enframe(value = "height", name = NULL)

x_ticks <- seq(120, 220, by = 10)

three_dist <- height_norm %>%
  ggplot(aes(x = height)) +
  stat_function(fun = dnorm, 
                args = list(mean = female_avg, sd = female_sd), 
                color = "red") +
  stat_function(fun = dnorm,  
                args = list(mean = male_avg, sd = male_sd),
                color = "blue") +
  stat_function(fun = dnorm,  
                args = list(mean = mean(height_norm$height), 
                            sd = sd(height_norm$height)),
                color = "black") +
  scale_x_continuous(breaks = x_ticks) +
  theme_classic() +
  labs(title = "Student Height Distribution",
       subtitle = "Black: All; Red: Female; Blue: Male",
       y = "Density", x = "Height in CM",
       caption = "Data Source: Freie Universität Berlin")

three_dist 

So we can proceed with confidence in using the Student Height data, stu_height, to explore the normal distribution and the CLT.

 stu_height %>% 
  summarize(avg_height = mean(height),
            sd_height = sd(height),
            min_height = min(height),
            max_height = max(height)) %>%
  mutate(gender = "All") %>%
  select(gender, everything()) %>%
  bind_rows(dist_stats) %>%
  knitr::kable( caption = "Summary Stats")
Summary Stats
gender avg_height sd_height min_height max_height
All 171.3808 11.077529 135 206
Female 163.6533 7.919726 135 193
Male 179.0727 7.988852 144 206

Building the Functions

The functions will take a numeric value: height measured in centimeters. We want to do the following:

  1. Map the value to the “Z” score.
  2. Calculate the probability space associated with that value.
  3. Shade the appropriate area under the curve.

To do this, we use an obscure ggplot function: ggplot_build(). It allows us to access the plot data itself.

So we will create two plots to mine for data. First, a plot of Female Height; second, a plot of Male Height. We will then use in our custom functions the information obtained by building those plots.

Map Value to ZED

female_height <- stu_height  %>% 
  filter(gender == "Female")  %>% 
  mutate(zed = scale(height, center = TRUE) )

female_height %>% head()
## # A tibble: 6 x 3
##   gender height zed[,1]
##   <chr>   <int>   <dbl>
## 1 Female    160  -0.461
## 2 Female    172   1.05 
## 3 Female    168   0.549
## 4 Female    175   1.43 
## 5 Female    156  -0.966
## 6 Female    167   0.423
male_height <- stu_height  %>% 
  filter(gender == "Male")  %>%
  mutate(zed = scale(height, center = TRUE) )

male_height %>% head()
## # A tibble: 6 x 3
##   gender height zed[,1]
##   <chr>   <int>   <dbl>
## 1 Male      183   0.492
## 2 Male      189   1.24 
## 3 Male      195   1.99 
## 4 Male      185   0.742
## 5 Male      172  -0.885
## 6 Male      182   0.366

So far, so good.

Map Value to Area

We now build two plots not for display, but to raid the data generated by building the plots themselves. We’re looking at the guts, so to speak, of two density plots.

fh_plot <- female_height %>%
  ggplot( aes(x = height ) ) +
  geom_density() # Just for the data

fh_area <- ggplot_build(fh_plot)$data[[1]] # grab data



mh_plot <- male_height %>%
  ggplot( aes(x = height ) ) +
  geom_density()  #  Just for the data


mh_area <-  ggplot_build(mh_plot)$data[[1]] # grab data

fh_area %>% glimpse()
## Rows: 512
## Columns: 18
## $ y           <dbl> 7.399859e-05, 7.437072e-05, 7.436777e-05, 7.408919e-05, 7.~
## $ x           <dbl> 135.0000, 135.1135, 135.2270, 135.3405, 135.4540, 135.5675~
## $ density     <dbl> 7.399859e-05, 7.437072e-05, 7.436777e-05, 7.408919e-05, 7.~
## $ scaled      <dbl> 0.001504427, 0.001511993, 0.001511933, 0.001506269, 0.0014~
## $ ndensity    <dbl> 0.001504427, 0.001511993, 0.001511933, 0.001506269, 0.0014~
## $ count       <dbl> 0.3041342, 0.3056637, 0.3056515, 0.3045066, 0.3024716, 0.2~
## $ n           <int> 4110, 4110, 4110, 4110, 4110, 4110, 4110, 4110, 4110, 4110~
## $ flipped_aes <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA~
## $ PANEL       <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ group       <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1~
## $ ymin        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ ymax        <dbl> 7.399859e-05, 7.437072e-05, 7.436777e-05, 7.408919e-05, 7.~
## $ fill        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ weight      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ colour      <chr> "black", "black", "black", "black", "black", "black", "bla~
## $ alpha       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ size        <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5~
## $ linetype    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
mh_area %>% glimpse()
## Rows: 512
## Columns: 18
## $ y           <dbl> 9.818063e-05, 1.012755e-04, 1.039786e-04, 1.062890e-04, 1.~
## $ x           <dbl> 144.0000, 144.1213, 144.2427, 144.3640, 144.4853, 144.6067~
## $ density     <dbl> 9.818063e-05, 1.012755e-04, 1.039786e-04, 1.062890e-04, 1.~
## $ scaled      <dbl> 0.002035178, 0.002099330, 0.002155364, 0.002203256, 0.0022~
## $ ndensity    <dbl> 0.002035178, 0.002099330, 0.002155364, 0.002203256, 0.0022~
## $ count       <dbl> 0.4053878, 0.4181664, 0.4293277, 0.4388673, 0.4467997, 0.4~
## $ n           <int> 4129, 4129, 4129, 4129, 4129, 4129, 4129, 4129, 4129, 4129~
## $ flipped_aes <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA~
## $ PANEL       <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ group       <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1~
## $ ymin        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ ymax        <dbl> 9.818063e-05, 1.012755e-04, 1.039786e-04, 1.062890e-04, 1.~
## $ fill        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ weight      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ colour      <chr> "black", "black", "black", "black", "black", "black", "bla~
## $ alpha       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ size        <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5~
## $ linetype    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~

The area we need to fill is defined by the x, y coordinates. We can ignore the other variables for our purpose here.

Visualizing pnorm()

We are now ready build our custom plotting functions. Using pnorm() will give us the probability space as a numeric. Using geom_area(data = subset(fh_area, x < test_height) will give us the shape up to our cut-off value: the lower tail. Conversely, using geom_area(data = subset(fh_area, x > test_height) will give us the area to the right of our cut-off value: the upper tail.

# Custom functions --------------------------------------------------------
## female under
female_show_height_under <- function(test_height){
  
  prob_h <- pnorm(test_height, female_avg, female_sd) %>% 
    round(3)
  
  zed <- female_height %>% 
    filter(height == test_height) %>% 
    pull(zed) %>% unique() %>% round(3)
  
  if(test_height < 136) warning ("Choose a value between 136 & 192")
  if(test_height > 192 ) warning ("Choose a value between 136 & 192")

  female_height %>% 
  ggplot( aes(x = height, y = ..density.. ) ) +
  geom_density( alpha = 0.4, fill = "grey") +
  geom_vline(xintercept = test_height, lty = 3, color = "red") +
  geom_vline(xintercept = female_avg, lty = 5, color = "grey") +
  geom_area(data = subset(fh_area, x < test_height ) , 
            aes(x = x, y = y), 
            fill = "red",
            alpha = 0.3) +
  theme_classic() +
  labs(title = "Female Students: Height Under", 
       x = "Height in CM", y = "Density",
       subtitle = paste("Red area <", {test_height}, "CM",
                        " | ",   "Probability space:", {prob_h},
                        " | ",  "Zed:", {zed}  ),
       caption = "Data Source: Freie Universität Berlin")
}
#

## female over

female_show_height_over <- function(test_height){
  
  prob_h <- pnorm(test_height, female_avg, female_sd, 
                  lower.tail = FALSE) %>% 
    round(3)
  
  zed <- female_height %>% 
    filter(height == test_height) %>% 
    pull(zed) %>% unique() %>% round(3)
  
  if(test_height < 136) warning ("Choose a value between 136 & 192")
  if(test_height > 192 ) warning ("Choose a value between 136 & 192")
  
  female_height %>% 
    ggplot( aes(x = height, y = ..density.. ) ) +
    geom_density( alpha = 0.4, fill = "grey") +
    geom_vline(xintercept = test_height, lty = 3, color = "red") +
    geom_vline(xintercept = female_avg, lty = 5, color = "grey") +
    geom_area(data = subset(fh_area, x > test_height ) , 
              aes(x = x, y = y), 
              fill = "red", 
              alpha = 0.3) +
    theme_classic() +
    labs(title = "Female Students: Height Over",
         x = "Height in CM", y = "Density",
         subtitle = paste("Red area >", {test_height}, 
                          "CM", " | ", "Probability space:", {prob_h}, 
                          " | ",  "Zed:", {zed}  ),
         caption = "Data Source: Freie Universität Berlin")
}
#

# male under
male_show_height_under <- function(test_height){
  
  prob_h <- pnorm(test_height, male_avg, male_sd) %>% 
    round(3)
  
  zed <- male_height %>% 
    filter(height == test_height) %>% 
    pull(zed) %>% unique() %>% round(3)
  
  if(test_height < 145) warning ("Choose a value between 145 & 201")
  if(test_height > 201 ) warning ("Choose a value between 145 & 201")
  
  male_height %>% 
    ggplot( aes(x = height, y = ..density.. ) ) +
    geom_density( alpha = 0.4, fill = "grey") +
    geom_vline(xintercept = test_height, lty = 3, color = "blue") +
    geom_vline(xintercept = male_avg, lty = 5, color = "grey") +
    geom_area(data = subset(mh_area, x < test_height ) , 
              aes(x = x, y = y), 
              fill = "blue",
              alpha = 0.3) +
    theme_classic() +
    labs(title = "Male Students: Height Under", 
         x = "Height in CM", y = "Density",
         subtitle = paste("Blue area <", {test_height}, 
                          "CM", " | ",  
                          "Probability space:", {prob_h}, 
                          " | ",  "Zed:", {zed}   ),
         caption = "Source: Freie Universität Berlin")
}
#

## male over
male_show_height_over <- function(test_height){
  
  prob_h <- pnorm(test_height, male_avg, male_sd, lower.tail = FALSE) %>% 
    round(3)
  
  zed <- male_height %>% 
    filter(height == test_height) %>% 
    pull(zed) %>% unique() %>% round(3)
  
  if(test_height < 145) warning ("Choose a value between 145 & 201")
  if(test_height > 201 ) warning ("Choose a value between 145 & 201")
  
  male_height %>% 
    ggplot( aes(x = height, y = ..density..  ) ) +
    geom_density( alpha = 0.4, fill = "grey") +
    geom_vline(xintercept = test_height, lty = 3, color = "blue") +
    geom_vline(xintercept = male_avg, lty = 5, color = "grey") +
    geom_area(data = subset(mh_area, x > test_height ) , 
              aes(x = x, y = y), 
              fill = "blue", 
              alpha = 0.3) +
    theme_classic() +
    labs(title = "Male Students: Height Over", 
         x = "Height in CM", y = "Density",
         subtitle = paste("Blue area >", 
                          {test_height}, "CM", 
                          " | ", "Probability space:", 
                          {prob_h}, " | ",  "Zed:", {zed}  ),
         caption = "Data Source: Freie Universität Berlin")
}

Test Functions

To work the properly, the possible values for each function are the minimum distribution value + 1, or the maximum distribution value - 1. It makes no sense to ask what is the probability of a value outside of the distribution.

We can start with a test_height of 170.

test_height <- 170

# Choose a whole number value between 136 & 192
female_show_height_under(test_height)

female_show_height_over(test_height)

# Choose a whole lnumber  value between 145 & 201
male_show_height_under(test_height) 

male_show_height_over(test_height) 

All good!

Save

We now have our custom functions for classwork and play. Let’s save them and the needed data so that we can load the image into our worksheet. Please note that our worksheet will first load the libraries tidyverse, here, and glue.

remove_these <- c("p5","p6", "p7", "p8", 
                  "mh_plot", "fh_plot",
                  "students", "three_dist",
                  "x_ticks", "test_avg", 
                  "test_sd", "test_height")

rm(list = remove_these)
rm(remove_these)

save.image(file = here::here("data", 
                             "tidy_data", 
                             "fun_w_norm.RData"))

Free at Github

The data, functions, this RMD, and the worksheet RMD are freely available at github.com/Thom-J-H/Grab_Bag. Please feel free to improve. Thank you.


Thomas J. Haslam
2021-08-26


Session Info