Each lab in this course will have multiple components. First, there will a piece like the document below, which includes instructions, tutorials, and problems to be addressed in your write-up. Any part of the document below marked with a star, \(\star\), is a problem for your write-up. Second, there will be an R script where you will do all of the computations required for the lab. And third, you will complete a write-up to be turned in the Friday that you did your lab.
Your lab write-up should be typed in RMarkdown. All of your computational work will be done in RStudio Cloud, and both your lab write-up and your R script will be considered when grading your work.
In this lab, we will use R’s built-in tools to compute confidence intervals via the different methods we’ve developed in class. In particular, we will look at criminal sentencing data from Minnesota and attempt to identify racial disparities in sentence length.
As a starting example, we’ll look at Galton’s father-son height data set, beginning with just the fathers’ heights.
library(UsingR)
data("father.son")
fheight <- father.son$fheight # vector of fathers' heights
We can compute the confidence interval for the mean fathers’ heights by first storing the relevant quantities as variables.
n.fheight <- length(fheight) # number of observations
mean.fheight <- mean(fheight) # sample mean
sd.fheight <- sd(fheight) # sample standard deviation
First, we compute the 95% \(z\)-confidence interval.
height.z.min <- mean.fheight - qnorm(0.025, lower.tail = FALSE) * (sd.fheight / sqrt(n.fheight))
height.z.max <- mean.fheight + qnorm(0.025, lower.tail = FALSE) * (sd.fheight / sqrt(n.fheight))
c(height.z.min, height.z.max) # z-confidence interval
## [1] 67.52324 67.85095
Then, we compute the 95% \(t\)-confidence interval.
height.t.min <- mean.fheight - qt(0.025, df = n.fheight - 1, lower.tail = FALSE) * (sd.fheight / sqrt(n.fheight))
height.t.max <- mean.fheight + qt(0.025, df = n.fheight - 1, lower.tail = FALSE) * (sd.fheight / sqrt(n.fheight))
c(height.t.min, height.t.max) # t-confidence interval
## [1] 67.52306 67.85114
Notice that the confidence intervals are really quite close together. This reflects the fact that the sample size is large (\(n=\) 1078) and the heights are roughly normally distributed, as illustrated in the histogram below.
ggplot(father.son, aes(x = fheight)) +
geom_histogram(fill = "blue", color = "black") +
xlab("Fathers' Heights")
We definitely don’t want to go around computing every confidence
interval by hand. As you probably imagined, R has many ways of computing
confidence intervals using built-in functions. The R function
z.test is part of the BSDA package, and t.test
is included in base R. In both cases, you access the confidence interval
of the test in the conf.int object.
library(BSDA)
z.test(fheight, sigma.x = sd.fheight, conf.level = 0.95)$conf.int
## [1] 67.52324 67.85095
## attr(,"conf.level")
## [1] 0.95
t.test(fheight, conf.level = 0.95)$conf.int
## [1] 67.52306 67.85114
## attr(,"conf.level")
## [1] 0.95
We get the same answers as when we did the computation by hand!
The t.test function also allows us to compute confidence
intervals for differences between means. We’ll do that to compare the
means of the fathers’ heights and the sons’ heights. Notice that
t.test has an argument that allows you to specify whether
the population variances should be assumed equal or not.
sheight <- father.son$sheight
t.test(fheight, sheight, conf.level = 0.95, var.equal = TRUE)$conf.int
## [1] -1.2317972 -0.7621484
## attr(,"conf.level")
## [1] 0.95
t.test(fheight, sheight, conf.level = 0.95, var.equal = FALSE)$conf.int
## [1] -1.2317973 -0.7621483
## attr(,"conf.level")
## [1] 0.95
So, which of the two confidence intervals for the difference of means above is the proper one to use? Neither! These two data sets are definitely not independent, as the heights of the sons is very likely to be dependent on the heights of the fathers. This is backed up by the scatter plot below, where we see a positive correlation between the two heights. This is illustrated with the least squares regression line.
ggplot(father.son, aes(x = fheight, y = sheight)) +
geom_point(alpha = 0.5, color = "blue") +
geom_smooth(method = 'lm', color = 'red', se = FALSE) +
xlab("Fathers' Heights") +
ylab("Sons' Heights")
We have access to data (source) from the Minnesota Sentencing Guidelines Commission, which includes sentencing data from Minnesota between 2001 and 2019. The QSIDE Institute provides the cleaned data set, which contains 81 variables and almost 30,000 observations, as well as a codebook. For the purposes of this lab, we are only interested in two of the variables and a subset of the observations.
In particular, in your R Lab script you have access to a further cleaned data set that consists of two variables:
The data in this cleaned set are filtered for those defendants who were charged with Theft at a severity level of 2.
(\(\star\)) Compute 99% confidence interval for the mean sentence length of white defendants and the 99% confidence interval for the mean sentence length of black defendants. In your write-up, explain which method of computing confidence intervals you used, identifying and justifying your assumptions. Note that justification here may require you show additional computations and/or graphs.
(\(\star\)) Compute 99% confidence interval for the difference of mean sentence lengths of white defendants compared to black defendants. In your write-up, explain which method of computing confidence interval you used, identifying and justifying your assumptions. Note that justification here may require you show additional computations and/or graphs.
(\(\star\)) In your write-up, explain how we can interpret the confidence interval you found in part 2 above.
(\(\star\)) What additional statistical questions could we ask, beyond the questions posed in parts 1 and 2? Use the documentation of the data set to see what other variables are available and use these to inform your questions.