This week I’ll focus on several topics: (1) Sampling distributions and standard errors; (2) Bootstrapping; and (3) Confidence intervals. To do so, I’ll go through several data sets:
-The Ten Mile Race: Data from D.C.’s Cherry Blossom Run
-Gone fishing: A repeated sampling of fishing in a fictional lake
-The Wage Gap: Earnings between men and women at a tech firm
And thanks to Professor Scott for providing excellent materials on his webpage.
To refresh from last week, we’ll go through the the Cherry Blossom Race data set. This race is help in Washington DC everyday. These are the results of the 2005 race.
The data frame includes 8,636 observations on the following variables:
To get an idea about the relationship between the net recorded time and age of the runner, let’s plot the data:
# The model aggregrating men and women
plot(net~age,data=TenMileRace, col='grey')
lm1 = lm(net~age,data=TenMileRace)
abline(lm1)
summary(lm1)
##
## Call:
## lm(formula = net ~ age, data = TenMileRace)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2720.7 -653.8 -35.7 568.1 4714.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5297.2192 37.6056 140.862 <2e-16 ***
## age 8.1899 0.9806 8.352 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 965.8 on 8634 degrees of freedom
## Multiple R-squared: 0.008014, Adjusted R-squared: 0.007899
## F-statistic: 69.75 on 1 and 8634 DF, p-value: < 2.2e-16
However, there could be an influence on gender on the runner’s net finish time. So let’s plot a linear regression for both Males and Females.
# Now disaggregating
lmM = lm(net~age,data=subset(TenMileRace,sex=="M"))
lmF = lm(net~age,data=subset(TenMileRace,sex=="F"))
coef(lmM)
## (Intercept) age
## 4585.85479 17.56822
coef(lmF)
## (Intercept) age
## 5370.99995 15.96166
We can look at the means of net finish times by gender.
mean(net ~ sex, data=TenMileRace)
## F M
## 5916.398 5280.702
Great! Recall that the baseline/offset form: The coefficients of this model are simply a different way of expressing the group means between men and women. Since we observe that there are differences between men and women, we can visualize this using the abline for each of the lm (i.e., lm1, lmM, lmF):
# Clearly we get different effects due to age when we disaggregate
plot(net~age,data=TenMileRace, col='grey', pch=19, cex=0.5)
abline(lm1, col='black')
abline(lmM, col='red')
abline(lmF, col='blue')
Next, we’ll look at the main effects of age and gender on finish time. First, I show the intercepts and coefficients, then I visualize the fit of the model.
# We can model this with main effects for age and sex
lm2 = lm(net ~ age + sex, data= TenMileRace)
coef(lm2)
## (Intercept) age sexM
## 5339.15545 16.89362 -726.61948
# A simple way to visualize the fit
plotModel(lm2)
## Warning: package 'bindrcpp' was built under R version 3.4.4
Now that we can visualize the running times for all individuals, with age and sex included in the model, let’s look at the joint influence of the two variables on running times. Recall that an interaction addresses whether the influence of one independent variable is altered by the level of another independent variable.
# With an interaction
lm3 = lm(net ~ age + sex + age:sex, data= TenMileRace)
coef(lm3)
## (Intercept) age sexM age:sexM
## 5370.999953 15.961661 -785.145163 1.606559
# Visualize the fit
plotModel(lm3)
The above coefficients respond to (1) the influence of age on running times; (2) the influence of sex on running times; and (3) the joint influence of age and sex on running times - that is, Does the difference between mean male and mean female running times depend on age?.
Lastly, we’ll look at a simple_anova table.
# An ANOVA table
source('http://jgscott.github.io/teaching/r/utils/class_utils.R')
simple_anova(lm3)
## Df R2 R2_improve sd sd_improve pval
## Intercept 1 0.000000 969.66
## age 1 0.008014 0.008014 965.82 3.837 0.00000
## sex 1 0.139363 0.131349 899.66 66.159 0.00000
## age:sex 1 0.139433 0.000070 899.67 -0.015 0.40116
## Residuals 8632
Fromt the table above, it looks as though sex contributes the most to net running time (66.159). After comes age (3.87). The joint influence of age and gender (i.e., interaction) was not statistically significant.
After a motivating example, we’ll now go on to look at sampling distributions using the gone.
This walkthrough looks at fictional data on fictional fish in a fictional lake :D. Let’s meet the 8,000 fish who live in Lake Woebegone.
gonefishing <- read.csv("https://raw.githubusercontent.com/jgscott/learnR/master/gonefishing/gonefishing.csv")
summary(gonefishing)
## length height width weight
## Min. : 5.30 Min. :2.100 Min. :0.900 Min. : 68
## 1st Qu.:10.40 1st Qu.:4.200 1st Qu.:1.600 1st Qu.: 338
## Median :11.90 Median :4.800 Median :1.800 Median : 480
## Mean :11.93 Mean :4.768 Mean :1.792 Mean : 519
## 3rd Qu.:13.20 3rd Qu.:5.300 3rd Qu.:2.000 3rd Qu.: 655
## Max. :20.20 Max. :8.000 Max. :3.000 Max. :2041
If we take a moment to look at the variation in length, height, width, and weight, we can start imagining that these fish come in many different shapes and sizes. Let’s look at a histogram of weights for the entire fish population:
hist(gonefishing$weight, breaks = 20)
mean_weight_pop <- mean(gonefishing$weight)
abline(v = mean_weight_pop, lwd = 4, col = "blue")
So the abline gave us the mean of the weights (518.965), calculated using the mean function. We also observe a considerable variation around the mean. Let’s look at the sampling distribution of the sample mean.
Let’s say we go on the fishing trip and catch 30 fish. What would we expect the mean of our sample to be? Let’s use R’s ability to take a random sample in order to simulate this process.
n_fish= 30
#Take a random sample from the population of fish in the lake
fishing_trip <- sample(gonefishing, n_fish)
#Look at the measurments of the first five fish we caught
head(fishing_trip)
## length height width weight orig.id
## 84 15.3 6.2 2.4 1023 84
## 109 11.6 4.6 1.8 569 109
## 323 12.1 4.9 1.7 594 323
## 318 12.3 4.9 1.9 536 318
## 313 9.4 3.7 1.3 254 313
## 314 15.3 6.0 2.3 777 314
Above, the row names tell us which fish (numbered 1 to 800 in the original data set) that we happened to catch on this trip. Because the sample is random, our particular fish will be different than the ones we see here. Next, let’s compute the mean weight of the fish in our sample:
mean_weight_sample <- mean(fishing_trip$weight)
mean_weight_sample
## [1] 569.8667
The mean weight is going to be different each time we run it, such that it’s generates a new sample mean from a new set of observations each time. Crucially, all these sample means differ from the true population mean of 519, which we calculated above.
So that was yesterday’s fishing trip. What about today’s? Let’s say we released the fish back into the lake and we’re interested in taking a fresh sample from the population. We’ll repeat the commands above to get a new sample.
fishing_trip <- sample(gonefishing, n_fish)
mean_weight_sample <- mean(fishing_trip$weight)
mean_weight_sample
## [1] 467.3667
Today’s sample mean will be different from yesterday’s. Both will be different from the population mean. The next step is to get a sense of how the sample mean varied under lots and lots of samples (i.e., more than two). To do so, we will use mosaic’s package functions for performing a Monte Carlo simulation. Let’s try this:
do(25)*{
fishing_trip <- sample(gonefishing, n_fish)
mean_weight_sample <- mean(fishing_trip$weight)
mean_weight_sample
}
## result
## 1 538.2667
## 2 635.5333
## 3 546.2333
## 4 581.9667
## 5 535.3333
## 6 475.7000
## 7 565.4000
## 8 471.6000
## 9 554.1667
## 10 451.5333
## 11 449.0333
## 12 580.7000
## 13 418.7333
## 14 539.3000
## 15 599.4000
## 16 515.5333
## 17 506.5000
## 18 431.3333
## 19 478.3000
## 20 541.9667
## 21 497.3333
## 22 542.9667
## 23 503.0000
## 24 490.5000
## 25 612.7667
Let’s parse the above function. 1. We took our original code block for performing a random sample from the population and calculating the mean weight of the same. 2. We placed that code block in curly braces {} 3. We then told R to repeat that code block 25 times via the “do(25)*" command.
The result above is 25 different sample means, each corresponding to a different random sample from a population. This process is an example of Monte Carlo simulation, wherein a computer is used to simulate a random process. Our code above produced 25 Monte Carlo samples. Next, let’s make two small modifications. First, we’ll do more than 25 Monte Carlo samples and we’ll save the result.
# Save the Monte Carlo output
my_fishing_year <- do(365)*{
fishing_trip <- sample(gonefishing, n_fish)
mean_weight_sample <- mean(fishing_trip$weight)
mean_weight_sample
}
#Examine the first several entries
head(my_fishing_year)
## result
## 1 452.0667
## 2 489.0667
## 3 560.9667
## 4 507.6667
## 5 511.0000
## 6 486.1667
Great! So we see that my_fishing_year is a df with one column labeled “result”. This column contains 365 sample means, one for each fishing trip (i.e., sample size of 30 from the population). Let’s look at a histogram of these simulated sample means:
hist(my_fishing_year$result, breaks = 20)
sd(my_fishing_year$result)
## [1] 44.82245
We call this the sampling distribution of the sample mean. The dispersion of this distribution tells us how precise the mean from any given sample of size (i.e., 30 fish) approximates the population mean. Thus, she SD of this sampling distribution is a natural measure of this dispersion. For this reason, it is usually called the standard error of the sample mean.
So above we learned the basic trick of simulating a sampling distribution. We can now apply that to any kind of statistical model. For example, let’s say we wanted to fit a model for the weight of a fish versus it’s volume.
# Define the volume variable and add it to the original data frame
gonefishing$volume <- gonefishing$height * gonefishing$length * gonefishing$width
# Model weight versus volume
plot(weight ~ volume, data = gonefishing)
lm0 <- lm(weight ~ volume, data = gonefishing)
abline(lm0)
coef(lm0)
## (Intercept) volume
## 48.947898 4.240374
For the population, it looks like the slope of the line is about 4.24 grams per cubic inch. And what about for our sample of 30 fish? Let’s execute this block of code a few times and compare the different lines we get.
# Plot the popuation
plot(weight ~ volume, data = gonefishing)
# Take a sample, show the points, and fit a straight line
n_fish <- 30
fishing_trip <- sample(gonefishing, n_fish)
lm1 <- lm(weight ~ volume, data = fishing_trip)
points(weight ~ volume, data = fishing_trip, pch = 19, col = "orange")
abline(lm1, lwd = 3, col = "orange")
For each Monte Carlo sample, of course we’ll see a slightly different fitted line, which reflects the variability from sample to sample. The line from our sample should be close to the true population line, but they won’t be exactly the same.
Next, let’s use the same approach as above to look at the sampling distribution of the least-squares estimator from a sample size of 30. This time we’ll collect the intercept and slope of the least-squares line, rather than the sample mean of the weight. We’ll use 365 Monte Carlo samples to simulate a year of fishing trips.
my_fishing_year <- do(365)*{
fishing_trip <- sample(gonefishing, n_fish)
lm1 <- lm(weight ~ volume, data = fishing_trip)
coef(lm1)
}
For this simulation, look at how my_fishing_year variable now has two columns. These two represent the intercept and slope for the volume variable. To examine the sampling distribution of the slope, we could look at a histogram and compute the standard error.
hist(my_fishing_year$volume)
sd(my_fishing_year$volume)
## [1] 0.3118739
In this case, 365 different fishing trips of 30 fish, 365 different estimate slopes creaes a sampling distribution.
Professor Scott includes a bit of syntax to get really fancy with R plots. Here’s an example where we superimpose the fitted lines from 100 different samples on a plot of hte population. There’s also some extra flags passed to the plotting function to make things more visually appealing
n_fish <- 30
ghost_grey <- rgb(0.1, 0.1, 0.1, 0.2)
ghost_red <- rgb(0.8, 0.1, 0.1, 0.2)
plot(weight ~ volume, data = gonefishing, pch = 19, col = ghost_grey, las = 1)
abline(lm0, col = "darkgrey")
# Take 100 samples and fit a straight line to each one
for(i in 1:100) {
fishing_trip <- sample(gonefishing, n_fish)
lm1 <- lm(weight ~ volume, data = fishing_trip)
abline(lm1, col = ghost_red)
}
The “fan” of different fitted lines provides a visual depiction of the sampling distribution of the OLS estimator. Note: if you want to see all the different graphical parameters, try typing ?par into the console.
Now on to the next exercise on wage gaps at a large tech firm.
This walk-through will look at whether there is a “wage gap” at a tech firm between male and female employees with similar qualifications. We will use a multiple regression to adjust for the effect of education and experience in evaluating the correlation between an employee’s sex and his/her annual salary. After attempting this exercise, one should be able to (1) Fit a regression model; (2) Correctly interpret the estimated coefficients; (3) Quantify uncertainty about parameters in a multiple-regression model using bootstrapping.
salary <- read.csv("http://jgscott.github.io/teaching/data/salary.csv")
summary(salary)
## Salary Education Experience Months
## Min. :44920 Min. :2.00 Min. : 1.00 Min. : 3.0
## 1st Qu.:52750 1st Qu.:3.00 1st Qu.:13.00 1st Qu.: 26.5
## Median :60290 Median :4.00 Median :20.00 Median : 54.0
## Mean :61034 Mean :3.93 Mean :23.56 Mean : 58.4
## 3rd Qu.:68090 3rd Qu.:5.00 3rd Qu.:35.50 3rd Qu.: 89.5
## Max. :79130 Max. :7.00 Max. :47.00 Max. :119.0
## Sex
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.4884
## 3rd Qu.:1.0000
## Max. :1.0000
We see from the above output that the variables used in this data are: - Salary: Annual salary in dollars - Experience: Months of experience at the particular company - Months: Total months of work experience, including all previous jobs - Sex: Whether the employee is male or female
Let’s first look at the distribution of salary by sex:
mean(Salary ~ Sex, data = salary)
## 0 1
## 62610.45 59381.90
boxplot(Salary ~ Sex, data = salary, names = c("Female", "Male"))
Interestingly, it appears at first glance that women are paid more at this company than men, on average. However, does the story change if we adjust for work experience?
plot(Salary ~ Experience, data = salary)
lm1 <- lm(Salary ~ Experience, data = salary)
coef(lm1)
## (Intercept) Experience
## 52516.6821 361.5327
Before delving into the results, we expect more experienced workers to be paid more, all else being equal. So thinking about the above plot, how do these residuals - that is, salary adjusted for experience - look when we stratify them by sex?
boxplot(resid(lm1) ~ salary$Sex)
What a difference! Now it looks like men are being paid more than women for an equivalent amount of work experience since men have a positive residual, on average. The story is similar if we look at overall work experience, including jobs prior to the one with this particular company:
plot(Salary ~ Months, data = salary)
lm2 <- lm(Salary ~ Months, data = salary)
coef(lm2)
## (Intercept) Months
## 44807.1515 277.8743
The story in the residuals is similar: The distribution of adjusted salaries for men is shifted upward compared to that for women:
boxplot(resid(lm2) ~ salary$Sex)
To get at the partial relationship between gender and salary, we must fit a multiple-regression model that accounts for (1) experience with the company and (2) total number of months of professional work. We will also ajust for a third variable: years of post-secondary education. It is straightforward to fit such a model by least squares in R.
lm3 <- lm(Salary ~ Experience + Months + Education + Sex, data = salary)
coef(lm3)
## (Intercept) Experience Months Education Sex
## 39305.7117 122.2467 263.5782 591.0780 2320.5438
According to this model, men are paid $2,320.54 more per year than women with similar levels of education and work experience, both overall and with this particular company.
We can quantify our uncertainty about this effect via bootsrapping.
# Bootstrap sample
boot3 <- do(5000)*{
lm(Salary ~ Experience + Months + Education + Sex, data = resample(salary))
}
# Histogram of boostrap sample for sex
hist(boot3$Sex)
# Confidence interval
confint(boot3)
## name lower upper level method estimate
## 1 Intercept 34877.1630676 4.479727e+04 0.95 percentile 3.930571e+04
## 2 Experience 42.4248045 1.944757e+02 0.95 percentile 1.222467e+02
## 3 Months 236.4651866 2.910591e+02 0.95 percentile 2.635782e+02
## 4 Education -709.9081839 1.452176e+03 0.95 percentile 5.910780e+02
## 5 Sex 173.2729911 4.390432e+03 0.95 percentile 2.320544e+03
## 6 sigma 1988.4973138 2.977739e+03 0.95 percentile 2.672043e+03
## 7 r.squared 0.9152498 9.662007e-01 0.95 percentile 9.380802e-01
## 8 F 102.5941699 2.715710e+02 0.95 percentile 1.439242e+02
In this case, the bootstrapped confidence interval runs from about $200 to about $4300. (each confidence interval will be slightly different because of the Monte Carlo variability inherent to bootstrapping.) This is quite a wide range: we cannot rule out that the wage gap is quite small, but nor can we rule out that it might run into the thousands of dollars.