Week 5: Sampling distributions, bootstrapping, and quantifying uncertainty

This week I’ll focus on several topics: (1) Sampling distributions and standard errors; (2) Bootstrapping; and (3) Confidence intervals. To do so, I’ll go through several data sets:

-The Ten Mile Race: Data from D.C.’s Cherry Blossom Run

-Gone fishing: A repeated sampling of fishing in a fictional lake

-The Wage Gap: Earnings between men and women at a tech firm

And thanks to Professor Scott for providing excellent materials on his webpage.

Introduction Sample: Ten Mile Race

To refresh from last week, we’ll go through the the Cherry Blossom Race data set. This race is help in Washington DC everyday. These are the results of the 2005 race.

The data frame includes 8,636 observations on the following variables:

state: State of residence of runner
time: Official time from starting gun to finish line
net: The recorded time (in seconds) from when the runner crossed the starting line to when the runner crossed the finish line. This is generally less than the official time because of the large number of runners in the race.
age: Age of runner.
sex: Sex of runner with factor of two levels, “M” and “F”.

To get an idea about the relationship between the net recorded time and age of the runner, let’s plot the data:

# The model aggregrating men and women
plot(net~age,data=TenMileRace, col='grey')
lm1 = lm(net~age,data=TenMileRace)
abline(lm1)

summary(lm1)

## 
## Call:
## lm(formula = net ~ age, data = TenMileRace)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2720.7  -653.8   -35.7   568.1  4714.6 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5297.2192    37.6056 140.862   <2e-16 ***
## age            8.1899     0.9806   8.352   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 965.8 on 8634 degrees of freedom
## Multiple R-squared:  0.008014,   Adjusted R-squared:  0.007899 
## F-statistic: 69.75 on 1 and 8634 DF,  p-value: < 2.2e-16

However, there could be an influence on gender on the runner’s net finish time. So let’s plot a linear regression for both Males and Females.

# Now disaggregating
lmM = lm(net~age,data=subset(TenMileRace,sex=="M"))
lmF = lm(net~age,data=subset(TenMileRace,sex=="F"))
coef(lmM)

## (Intercept)         age 
##  4585.85479    17.56822

coef(lmF)

## (Intercept)         age 
##  5370.99995    15.96166

We can look at the means of net finish times by gender.

mean(net ~ sex, data=TenMileRace)

##        F        M 
## 5916.398 5280.702

Great! Recall that the baseline/offset form: The coefficients of this model are simply a different way of expressing the group means between men and women. Since we observe that there are differences between men and women, we can visualize this using the abline for each of the lm (i.e., lm1, lmM, lmF):

# Clearly we get different effects due to age when we disaggregate
plot(net~age,data=TenMileRace, col='grey', pch=19, cex=0.5)
abline(lm1, col='black')
abline(lmM, col='red')
abline(lmF, col='blue')

Next, we’ll look at the main effects of age and gender on finish time. First, I show the intercepts and coefficients, then I visualize the fit of the model.

# We can model this with main effects for age and sex
lm2 = lm(net ~ age + sex, data= TenMileRace)
coef(lm2)

## (Intercept)         age        sexM 
##  5339.15545    16.89362  -726.61948

# A simple way to visualize the fit
plotModel(lm2)

## Warning: package 'bindrcpp' was built under R version 3.4.4

Now that we can visualize the running times for all individuals, with age and sex included in the model, let’s look at the joint influence of the two variables on running times. Recall that an interaction addresses whether the influence of one independent variable is altered by the level of another independent variable.

# With an interaction
lm3 = lm(net ~ age + sex + age:sex, data= TenMileRace)
coef(lm3)

## (Intercept)         age        sexM    age:sexM 
## 5370.999953   15.961661 -785.145163    1.606559

# Visualize the fit
plotModel(lm3)

The above coefficients respond to (1) the influence of age on running times; (2) the influence of sex on running times; and (3) the joint influence of age and sex on running times - that is, Does the difference between mean male and mean female running times depend on age?.

Lastly, we’ll look at a simple_anova table.

# An ANOVA table
source('http://jgscott.github.io/teaching/r/utils/class_utils.R')
simple_anova(lm3)

##             Df       R2 R2_improve     sd sd_improve    pval
## Intercept    1 0.000000            969.66                   
## age          1 0.008014   0.008014 965.82      3.837 0.00000
## sex          1 0.139363   0.131349 899.66     66.159 0.00000
## age:sex      1 0.139433   0.000070 899.67     -0.015 0.40116
## Residuals 8632

Fromt the table above, it looks as though sex contributes the most to net running time (66.159). After comes age (3.87). The joint influence of age and gender (i.e., interaction) was not statistically significant.

After a motivating example, we’ll now go on to look at sampling distributions using the gone.

Sampling Distributions

This walkthrough looks at fictional data on fictional fish in a fictional lake :D. Let’s meet the 8,000 fish who live in Lake Woebegone.

gonefishing <- read.csv("https://raw.githubusercontent.com/jgscott/learnR/master/gonefishing/gonefishing.csv")
summary(gonefishing)

##      length          height          width           weight    
##  Min.   : 5.30   Min.   :2.100   Min.   :0.900   Min.   :  68  
##  1st Qu.:10.40   1st Qu.:4.200   1st Qu.:1.600   1st Qu.: 338  
##  Median :11.90   Median :4.800   Median :1.800   Median : 480  
##  Mean   :11.93   Mean   :4.768   Mean   :1.792   Mean   : 519  
##  3rd Qu.:13.20   3rd Qu.:5.300   3rd Qu.:2.000   3rd Qu.: 655  
##  Max.   :20.20   Max.   :8.000   Max.   :3.000   Max.   :2041

If we take a moment to look at the variation in length, height, width, and weight, we can start imagining that these fish come in many different shapes and sizes. Let’s look at a histogram of weights for the entire fish population:

hist(gonefishing$weight, breaks = 20)
mean_weight_pop <- mean(gonefishing$weight)
abline(v = mean_weight_pop, lwd = 4, col = "blue")

So the abline gave us the mean of the weights (518.965), calculated using the mean function. We also observe a considerable variation around the mean. Let’s look at the sampling distribution of the sample mean.

The sampling distribution of the sample mean

Let’s say we go on the fishing trip and catch 30 fish. What would we expect the mean of our sample to be? Let’s use R’s ability to take a random sample in order to simulate this process.

n_fish= 30

#Take a random sample from the population of fish in the lake
fishing_trip <- sample(gonefishing, n_fish)

#Look at the measurments of the first five fish we caught
head(fishing_trip)

##     length height width weight orig.id
## 84    15.3    6.2   2.4   1023      84
## 109   11.6    4.6   1.8    569     109
## 323   12.1    4.9   1.7    594     323
## 318   12.3    4.9   1.9    536     318
## 313    9.4    3.7   1.3    254     313
## 314   15.3    6.0   2.3    777     314

Above, the row names tell us which fish (numbered 1 to 800 in the original data set) that we happened to catch on this trip. Because the sample is random, our particular fish will be different than the ones we see here. Next, let’s compute the mean weight of the fish in our sample:

mean_weight_sample <- mean(fishing_trip$weight)
mean_weight_sample

## [1] 569.8667

The mean weight is going to be different each time we run it, such that it’s generates a new sample mean from a new set of observations each time. Crucially, all these sample means differ from the true population mean of 519, which we calculated above.

So that was yesterday’s fishing trip. What about today’s? Let’s say we released the fish back into the lake and we’re interested in taking a fresh sample from the population. We’ll repeat the commands above to get a new sample.

fishing_trip <- sample(gonefishing, n_fish)
mean_weight_sample <- mean(fishing_trip$weight)
mean_weight_sample

## [1] 467.3667

Today’s sample mean will be different from yesterday’s. Both will be different from the population mean. The next step is to get a sense of how the sample mean varied under lots and lots of samples (i.e., more than two). To do so, we will use mosaic’s package functions for performing a Monte Carlo simulation. Let’s try this:

do(25)*{
  fishing_trip <- sample(gonefishing, n_fish)
  mean_weight_sample <- mean(fishing_trip$weight)
  mean_weight_sample 
}

##      result
## 1  538.2667
## 2  635.5333
## 3  546.2333
## 4  581.9667
## 5  535.3333
## 6  475.7000
## 7  565.4000
## 8  471.6000
## 9  554.1667
## 10 451.5333
## 11 449.0333
## 12 580.7000
## 13 418.7333
## 14 539.3000
## 15 599.4000
## 16 515.5333
## 17 506.5000
## 18 431.3333
## 19 478.3000
## 20 541.9667
## 21 497.3333
## 22 542.9667
## 23 503.0000
## 24 490.5000
## 25 612.7667

Let’s parse the above function. 1. We took our original code block for performing a random sample from the population and calculating the mean weight of the same. 2. We placed that code block in curly braces {} 3. We then told R to repeat that code block 25 times via the “do(25)*" command.

The result above is 25 different sample means, each corresponding to a different random sample from a population. This process is an example of Monte Carlo simulation, wherein a computer is used to simulate a random process. Our code above produced 25 Monte Carlo samples. Next, let’s make two small modifications. First, we’ll do more than 25 Monte Carlo samples and we’ll save the result.

# Save the Monte Carlo output
my_fishing_year <- do(365)*{
  fishing_trip <- sample(gonefishing, n_fish)
  mean_weight_sample <- mean(fishing_trip$weight)
  mean_weight_sample
}

#Examine the first several entries
head(my_fishing_year)

##     result
## 1 452.0667
## 2 489.0667
## 3 560.9667
## 4 507.6667
## 5 511.0000
## 6 486.1667

Great! So we see that my_fishing_year is a df with one column labeled “result”. This column contains 365 sample means, one for each fishing trip (i.e., sample size of 30 from the population). Let’s look at a histogram of these simulated sample means:

hist(my_fishing_year$result, breaks = 20)

sd(my_fishing_year$result)

## [1] 44.82245

We call this the sampling distribution of the sample mean. The dispersion of this distribution tells us how precise the mean from any given sample of size (i.e., 30 fish) approximates the population mean. Thus, she SD of this sampling distribution is a natural measure of this dispersion. For this reason, it is usually called the standard error of the sample mean.

The sampling distribution of the ordinary least squares estimator

So above we learned the basic trick of simulating a sampling distribution. We can now apply that to any kind of statistical model. For example, let’s say we wanted to fit a model for the weight of a fish versus it’s volume.

# Define the volume variable and add it to the original data frame
gonefishing$volume <- gonefishing$height * gonefishing$length * gonefishing$width

# Model weight versus volume
plot(weight ~ volume, data = gonefishing)
lm0 <- lm(weight ~ volume, data = gonefishing)
abline(lm0)

coef(lm0)

## (Intercept)      volume 
##   48.947898    4.240374

For the population, it looks like the slope of the line is about 4.24 grams per cubic inch. And what about for our sample of 30 fish? Let’s execute this block of code a few times and compare the different lines we get.

# Plot the popuation
plot(weight ~ volume, data = gonefishing)
# Take a sample, show the points, and fit a straight line
n_fish <- 30
fishing_trip <- sample(gonefishing, n_fish)
lm1 <- lm(weight ~ volume, data = fishing_trip)
points(weight ~ volume, data = fishing_trip, pch = 19, col = "orange")
abline(lm1, lwd = 3, col = "orange")

For each Monte Carlo sample, of course we’ll see a slightly different fitted line, which reflects the variability from sample to sample. The line from our sample should be close to the true population line, but they won’t be exactly the same.

Next, let’s use the same approach as above to look at the sampling distribution of the least-squares estimator from a sample size of 30. This time we’ll collect the intercept and slope of the least-squares line, rather than the sample mean of the weight. We’ll use 365 Monte Carlo samples to simulate a year of fishing trips.

my_fishing_year <- do(365)*{
  fishing_trip <- sample(gonefishing, n_fish)
  lm1 <- lm(weight ~ volume, data = fishing_trip)
  coef(lm1)
}

For this simulation, look at how my_fishing_year variable now has two columns. These two represent the intercept and slope for the volume variable. To examine the sampling distribution of the slope, we could look at a histogram and compute the standard error.

hist(my_fishing_year$volume)

sd(my_fishing_year$volume)

## [1] 0.3118739

In this case, 365 different fishing trips of 30 fish, 365 different estimate slopes creaes a sampling distribution.

Fancy plots (A brief overview)

Professor Scott includes a bit of syntax to get really fancy with R plots. Here’s an example where we superimpose the fitted lines from 100 different samples on a plot of hte population. There’s also some extra flags passed to the plotting function to make things more visually appealing

n_fish <- 30
ghost_grey <- rgb(0.1, 0.1, 0.1, 0.2)
ghost_red <- rgb(0.8, 0.1, 0.1, 0.2)
plot(weight ~ volume, data = gonefishing, pch = 19, col = ghost_grey, las = 1)
abline(lm0, col = "darkgrey")

# Take 100 samples and fit a straight line to each one
for(i in 1:100) {
  fishing_trip <- sample(gonefishing, n_fish)
  lm1 <- lm(weight ~ volume, data = fishing_trip)
  abline(lm1, col = ghost_red)
}

The “fan” of different fitted lines provides a visual depiction of the sampling distribution of the OLS estimator. Note: if you want to see all the different graphical parameters, try typing ?par into the console.

Now on to the next exercise on wage gaps at a large tech firm.

The Wage Gap

This walk-through will look at whether there is a “wage gap” at a tech firm between male and female employees with similar qualifications. We will use a multiple regression to adjust for the effect of education and experience in evaluating the correlation between an employee’s sex and his/her annual salary. After attempting this exercise, one should be able to (1) Fit a regression model; (2) Correctly interpret the estimated coefficients; (3) Quantify uncertainty about parameters in a multiple-regression model using bootstrapping.

salary <- read.csv("http://jgscott.github.io/teaching/data/salary.csv")
summary(salary)

##      Salary        Education      Experience        Months     
##  Min.   :44920   Min.   :2.00   Min.   : 1.00   Min.   :  3.0  
##  1st Qu.:52750   1st Qu.:3.00   1st Qu.:13.00   1st Qu.: 26.5  
##  Median :60290   Median :4.00   Median :20.00   Median : 54.0  
##  Mean   :61034   Mean   :3.93   Mean   :23.56   Mean   : 58.4  
##  3rd Qu.:68090   3rd Qu.:5.00   3rd Qu.:35.50   3rd Qu.: 89.5  
##  Max.   :79130   Max.   :7.00   Max.   :47.00   Max.   :119.0  
##       Sex        
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.4884  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

We see from the above output that the variables used in this data are: - Salary: Annual salary in dollars - Experience: Months of experience at the particular company - Months: Total months of work experience, including all previous jobs - Sex: Whether the employee is male or female

Let’s first look at the distribution of salary by sex:

mean(Salary ~ Sex, data = salary)

##        0        1 
## 62610.45 59381.90

boxplot(Salary ~ Sex, data = salary, names = c("Female", "Male"))

Statistical adjustment for experience

Interestingly, it appears at first glance that women are paid more at this company than men, on average. However, does the story change if we adjust for work experience?

plot(Salary ~ Experience, data = salary)

lm1 <- lm(Salary ~ Experience, data = salary)
coef(lm1)

## (Intercept)  Experience 
##  52516.6821    361.5327

Before delving into the results, we expect more experienced workers to be paid more, all else being equal. So thinking about the above plot, how do these residuals - that is, salary adjusted for experience - look when we stratify them by sex?

boxplot(resid(lm1) ~ salary$Sex)

What a difference! Now it looks like men are being paid more than women for an equivalent amount of work experience since men have a positive residual, on average. The story is similar if we look at overall work experience, including jobs prior to the one with this particular company:

plot(Salary ~ Months, data = salary)

lm2 <- lm(Salary ~ Months, data = salary)
coef(lm2)

## (Intercept)      Months 
##  44807.1515    277.8743

The story in the residuals is similar: The distribution of adjusted salaries for men is shifted upward compared to that for women:

boxplot(resid(lm2) ~ salary$Sex)

Fitting a multiple regression model by least squares

To get at the partial relationship between gender and salary, we must fit a multiple-regression model that accounts for (1) experience with the company and (2) total number of months of professional work. We will also ajust for a third variable: years of post-secondary education. It is straightforward to fit such a model by least squares in R.

lm3 <- lm(Salary ~ Experience + Months + Education + Sex, data = salary)
coef(lm3)

## (Intercept)  Experience      Months   Education         Sex 
##  39305.7117    122.2467    263.5782    591.0780   2320.5438

According to this model, men are paid $2,320.54 more per year than women with similar levels of education and work experience, both overall and with this particular company.

Bootstrapping a multiple regression model

We can quantify our uncertainty about this effect via bootsrapping.

# Bootstrap sample
boot3 <- do(5000)*{
  lm(Salary ~ Experience + Months + Education + Sex, data = resample(salary))
}

# Histogram of boostrap sample for sex
hist(boot3$Sex)

# Confidence interval
confint(boot3)

##         name         lower        upper level     method     estimate
## 1  Intercept 34877.1630676 4.479727e+04  0.95 percentile 3.930571e+04
## 2 Experience    42.4248045 1.944757e+02  0.95 percentile 1.222467e+02
## 3     Months   236.4651866 2.910591e+02  0.95 percentile 2.635782e+02
## 4  Education  -709.9081839 1.452176e+03  0.95 percentile 5.910780e+02
## 5        Sex   173.2729911 4.390432e+03  0.95 percentile 2.320544e+03
## 6      sigma  1988.4973138 2.977739e+03  0.95 percentile 2.672043e+03
## 7  r.squared     0.9152498 9.662007e-01  0.95 percentile 9.380802e-01
## 8          F   102.5941699 2.715710e+02  0.95 percentile 1.439242e+02

In this case, the bootstrapped confidence interval runs from about $200 to about $4300. (each confidence interval will be slightly different because of the Monte Carlo variability inherent to bootstrapping.) This is quite a wide range: we cannot rule out that the wage gap is quite small, but nor can we rule out that it might run into the thousands of dollars.