In this lab, you’ll investigate the probability distribution that is most central to statistics: the normal distribution. If you are confident that your data are nearly normal, that opens the door to many powerful statistical methods. Here we’ll use the graphical tools of R to assess the normality of our data and also learn how to generate random numbers from a normal distribution.

Getting Started

Before beginning this lab, make sure you’ve:

Load packages

In this lab, we will explore and visualize the data using the tidyverse suite of packages.

Let’s load the package now.

Creating a reproducible lab report

To create your new lab report, in RStudio, go to New File -> R Markdown… You can also download a lab report template from the Labs page of the class WordPress site.

The data

In the last lab, we worked with data from the 2018 American Community Survey (ACS). We’ll be using the same ACS data in this lab, in addition to the New York Time’s COVID-19 dataset.

Previously, we used the ACS data to examine relationships between median income, race, and region in the US. We’ll focus on median income again in this lab: how is varies across regions, as well as its relationship to COVID-19 severity.

Let’s open both datasets now.

ACS_2018 <- read_csv("acs2018_county_small.csv")

covid_nyt <- read_csv("covid19_county.csv")

In order to analyze these datasets together, we need to merge them into a single dataset. But how?

Notice that these datasets share two common variable - fips and region. A fips code is a unique numeric identifier assigned to every county in the United States. The first two digits of a county fips code tell you which state the county is in. For example, the fips code of every county in Vermont begins with the number 50. The region refers to the geographic region a state is part of (Midwest, South, etc).

Using fips and region, we can merge the ACS_2018 and covid_NYT datasets together. In R, we merge datasets using the join function. There are a few different types of joins to choose from. We’ll be using left_join, which that take all rows in the first dataset and adds matching columns from the second dataset. The two datasets are joined by their common variables fips and region.

ACS_covid <- left_join(ACS_2018, covid_nyt, by = c("fips", "region"))

Let’s read in the data and take a quick peek at the first few rows. Either you can use glimpse like before, or head to do this.

head(ACS_covid)
## # A tibble: 6 x 29
##    fips county_name    STATE COUNTY region pop_total pop_density median_age
##   <dbl> <chr>          <chr> <chr>  <chr>      <dbl>       <dbl>      <dbl>
## 1  1001 Autauga County 01    001    South      55200        92.9       37.8
## 2  1003 Baldwin County 01    003    South     208107       131.        42.8
## 3  1005 Barbour County 01    005    South      25782        29.1       39.9
## 4  1007 Bibb County    01    007    South      22527        36.2       39.9
## 5  1009 Blount County  01    009    South      57645        89.4       40.8
## 6  1011 Bullock County 01    011    South      10352        16.6       39.6
## # … with 21 more variables: pop_white <dbl>, pop_black <dbl>,
## #   pop_highschool <dbl>, pop_ba <dbl>, median_income <dbl>, white_pct <dbl>,
## #   black_pct <dbl>, black_30pct <dbl>, pop_college <dbl>, college_pct <dbl>,
## #   poverty_pct <dbl>, state_fips.x <dbl>, state <chr>, county <chr>,
## #   state_fips.y <dbl>, date <chr>, cases <dbl>, deaths <dbl>,
## #   days_inital <dbl>, cases_3d_avg <dbl>, deaths_3d_avg <dbl>

We see that we now have a dataset that is a combination of the ACS_2018 and covid_NYT dataset. However, it doesn’t contain all the information from the two datasets. Not every county that appears in ACS_2018 appears in covid_nyt. This means that there are some counties in ACS_covid that do not contain COVID-19 data- it’s important to keep in mind that the ACS_covid dataset contains missing values, which may have implications on our ability to run statistical calculations later on.

Visualizing distributions

Let’s dive straight into visualizing the distribution of median_income in the United States. The easiest way for us to visualize the distribution of data is by using a histogram. Remember to adjust the binwidth to one that makes sense for the data at hand. Most annual incomes are in the tens of thousands, so try different bindwidth in the thousands range until you find one you like best.

ggplot(data = ACS_covid, aes(x = median_income)) +
  geom_histogram(binwidth = 2500)
## Warning: Removed 1 rows containing non-finite values (stat_bin).

From this histogram, it looks like the distribution of median incomes in counties across the United States is centered at somewhere around $50,000 per year. It also looks pretty symmetric, but there’s a far-right tail. Also, how symmetric is pretty symmetric? Why does it even matter? We’ll get into the importance of distribution shape in a little bit.

  1. Make a plot to visualize the distribution of median_income in a region (South, Midwest, etc.) of your choice.
    How do the centers, shapes and spreads between your chosen region and the entire US compare?

The normal distribution

In your description of the distributions, did you use words like bell-shaped or normal? It’s tempting to say so when faced with a unimodal symmetric distribution.

To see how accurate that description is, you can plot a normal distribution curve on top of a histogram to see how closely the data follow a normal distribution. This normal curve should have the same mean and standard deviation as the data.

Let’s calculate and store the mean and standard deviations for the US that we’ll need. Note that because there are a few missing values in the ACS_covid dataset, we’ll need to use the na.rm argument in the mean function. This argument tells R to remove a dataset’s missing values from the mean calculation.

us_mean <- mean(ACS_covid$median_income, na.rm = TRUE)
us_sd   <- sd(ACS_covid$median_income, na.rm = TRUE)

Next, you make a density histogram to use as the backdrop and use the lines function to overlay a normal probability curve. The difference between a frequency histogram and a density histogram is that while in a frequency histogram the heights of the bars add up to the total number of observations, in a density histogram the areas of the bars add up to 1. The area of each bar can be calculated as simply the height times the width of the bar. Using a density histogram allows us to properly overlay a normal distribution curve over the histogram since the curve is a normal probability density function that also has area under the curve of 1. Frequency and density histograms both display the same exact shape; they only differ in their y-axis. You can verify this by comparing the frequency histogram you constructed earlier and the density histogram created by the commands below.

ggplot(data = na.omit(ACS_covid), aes(x = median_income)) +
        geom_blank() +
        geom_histogram(binwidth = 2500, aes(y = ..density..)) +
        stat_function(fun = dnorm, args = c(mean = us_mean, sd = us_sd), col = "tomato")

After initializing a blank plot with geom_blank(), the ggplot package (within the tidyverse) allows us to add additional layers. The first layer is a density histogram. The second layer is a statistical function – the density of the normal curve, dnorm. We specify that we want the curve to have the same mean and standard deviation as the column of median age. The argument col simply sets the color for the line to be drawn. If we left it out, the line would be drawn in black.

  1. Using median_income for a region of your choice, does this data follow a nearly normal distribution?

Normal probabilities

Why should you care whether a variable is normally distributed?

It turns out that statisticians know a lot about the normal distribution. Once you decide that a random variable is approximately normal, you can answer all sorts of questions about that variable related to probability. Take, for example, the question of, “What is the probability that a randomly chosen county in the United States has a median income of more that $100,000?”

If we assume that the median incomes of counties are normally distributed (a very close approximation is also okay), we can find this probability by calculating a Z score and consulting a Z table (also called a normal probability table). In R, this is done in one step with the function pnorm().

First, let’s calculate the mean and standard deviation median_income for the entire US. (We’re returning to our original ACS_covid dataset).

us_mean <- mean(ACS_covid$median_income, na.rm = TRUE)
us_sd   <- sd(ACS_covid$median_income, na.rm = TRUE)

Now, we can use the pnorm function to calculate the Z score and corresponding value from the Z table

1 - pnorm(q = 100000, mean = us_mean, sd = us_sd)

Note that the function pnorm() gives the area under the normal curve below a given value, q, with a given mean and standard deviation. Since we’re interested in the probability that a county in the US has a median income greater than $100,000, we have to take one minus that probability.

Assuming a normal distribution has allowed us to calculate a theoretical probability. If we want to calculate the probability empirically, we simply need to determine how many observations fall above $100,000 then divide this number by the total sample size.

Remember that in order to make a fair comparison to our previous calculation, we’ll need to limit our analysis to counties that do not have missing median_income values.

ACS_covid %>% 
  filter(median_income > 100000) %>%
  summarise(percent = n()/ nrow(ACS_covid))

Note the differences in the two probabilities. Using the pnorm function, we predicted the probability that a county had a median income above $100,00 is 0.0002228653. From our manual calculation, we predict that this probability is 0.008541393. That’s a bit different!

The closer that your distribution is to being normal, the more accurate the theoretical probabilities will be. The difference between our theoretical and actual probabilities suggests that the distribution of median incomes in the US is in fact not normal.

COVID-19 impact in high-income counties

Clearly, counties with median incomes greater than $100,000 are rare. Still, COVID-19 has hit almost every corner of the US. Are American’s wealthiest counties immune? Let’s take a closer look at counties in the US with median incomes greater than $100,000.

If we take part of our previous code:

ACS_covid %>% 
  filter(median_income > 100000) %>%
  summarise(number = n())

We can see there are 26 counties in the US with median incomes greater than $100,000. Have these counties been any more or less effected by COVID-19 than the rest of the US?

To start, let’s calculate the average number of COVID-19 deaths per county. Remember when we said missing values in a dataset might cause trouble later on? Now’s that time. R doesn’t know what to do with missing values when calculating a mean. We need to tell it to remove the missing values from its calculation, which we can do by specifying the argument na.rm = TRUE.

ACS_covid %>%
  summarise(mean_death_us = mean(deaths, na.rm = TRUE))

The average number of COVID-19 deaths per county is approximately 34.

Now, let’s calculate the average number of COVID-19 deaths in the 26 richest counties. Do you think think value will be higher or lower than the national average?

ACS_covid %>% 
  filter(median_income > 100000) %>%
  summarise(mean_death_100k = mean(deaths))

The average number of COVID-19 deaths in US counties with median incomes over $100,000 is approximately 245. Is this higher or lower than you expected? What are some reasons that these counties might have a death rate higher than the national average?

  1. Write out two probability questions that you would like to answer about any of the regions in this dataset. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which one had a closer agreement between the two methods?

You might be asking yourself: is the difference between 35 and 245 significant? Does having a median incomes of over $100,000 lead to a county having higher than average death rate?

Economists want answers to these questions too! In our next lab, we’ll introduce one way that we can begin exploring relationships between variables: statistical inference.


More Practice

  1. Examine covid-19 deaths in the Midwest and South. Do they look normal?

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License and was adopted from OpenIntro.org.