General Instructions

Each lab in this course will have multiple components. First, there will a piece like the document below, which includes instructions, tutorials, and problems to be addressed in your write-up. Any part of the document below marked with a star, \(\star\), is a problem for your write-up. Second, there will be an R script where you will do all of the computations required for the lab. And third, you will complete a write-up to be turned in the Friday that you did your lab.

Your lab write-up should be typed in RMarkdown. All of your computational work will be done in RStudio Cloud, and both your lab write-up and your R script will be considered when grading your work.

Lab Overview

In this lab, we will use R’s built-in tools to compute confidence intervals via the different methods we’ve developed in class. In particular, we will look at criminal sentencing data from Minnesota and attempt to identify racial disparities in sentence length.

Confidence Intervals the Slow Way

As a starting example, we’ll look at Galton’s father-son height data set, beginning with just the fathers’ heights.

library(UsingR)
data("father.son")

fheight <- father.son$fheight # vector of fathers' heights

We can compute the confidence interval for the mean fathers’ heights by first storing the relevant quantities as variables.

n.fheight <- length(fheight) # number of observations
mean.fheight <- mean(fheight) # sample mean
sd.fheight <- sd(fheight) # sample standard deviation

First, we compute the 95% \(z\)-confidence interval.

height.z.min <- mean.fheight - qnorm(0.025, lower.tail = FALSE) * (sd.fheight / sqrt(n.fheight))
height.z.max <- mean.fheight + qnorm(0.025, lower.tail = FALSE) * (sd.fheight / sqrt(n.fheight))

c(height.z.min, height.z.max) # z-confidence interval

## [1] 67.52324 67.85095

Then, we compute the 95% \(t\)-confidence interval.

height.t.min <- mean.fheight - qt(0.025, df = n.fheight - 1, lower.tail = FALSE) * (sd.fheight / sqrt(n.fheight))
height.t.max <- mean.fheight + qt(0.025, df = n.fheight - 1, lower.tail = FALSE) * (sd.fheight / sqrt(n.fheight))

c(height.t.min, height.t.max) # t-confidence interval

## [1] 67.52306 67.85114

Notice that the confidence intervals are really quite close together. This reflects the fact that the sample size is large (\(n=\) 1078) and the heights are roughly normally distributed, as illustrated in the histogram below.

ggplot(father.son, aes(x = fheight)) +
        geom_histogram(fill = "blue", color = "black") +
        xlab("Fathers' Heights")

Confidence Intervals the Efficient Way

We definitely don’t want to go around computing every confidence interval by hand. As you probably imagined, R has many ways of computing confidence intervals using built-in functions. The R function z.test is part of the BSDA package, and t.test is included in base R. In both cases, you access the confidence interval of the test in the conf.int object.

library(BSDA)

z.test(fheight, sigma.x = sd.fheight, conf.level = 0.95)$conf.int

## [1] 67.52324 67.85095
## attr(,"conf.level")
## [1] 0.95

t.test(fheight, conf.level = 0.95)$conf.int

## [1] 67.52306 67.85114
## attr(,"conf.level")
## [1] 0.95

We get the same answers as when we did the computation by hand!

The t.test function also allows us to compute confidence intervals for differences between means. We’ll do that to compare the means of the fathers’ heights and the sons’ heights. Notice that t.test has an argument that allows you to specify whether the population variances should be assumed equal or not.

sheight <- father.son$sheight

t.test(fheight, sheight, conf.level = 0.95, var.equal = TRUE)$conf.int

## [1] -1.2317972 -0.7621484
## attr(,"conf.level")
## [1] 0.95

t.test(fheight, sheight, conf.level = 0.95, var.equal = FALSE)$conf.int

## [1] -1.2317973 -0.7621483
## attr(,"conf.level")
## [1] 0.95

So, which of the two confidence intervals for the difference of means above is the proper one to use? Neither! These two data sets are definitely not independent, as the heights of the sons is very likely to be dependent on the heights of the fathers. This is backed up by the scatter plot below, where we see a positive correlation between the two heights. This is illustrated with the least squares regression line.

ggplot(father.son, aes(x = fheight, y = sheight)) +
        geom_point(alpha = 0.5, color = "blue") +
        geom_smooth(method = 'lm', color = 'red', se = FALSE) +
        xlab("Fathers' Heights") +
        ylab("Sons' Heights")

Minnesota Sentencing Data

We have access to data (source) from the Minnesota Sentencing Guidelines Commission, which includes sentencing data from Minnesota between 2001 and 2019. The QSIDE Institute provides the cleaned data set, which contains 81 variables and almost 30,000 observations, as well as a codebook. For the purposes of this lab, we are only interested in two of the variables and a subset of the observations.

In particular, in your R Lab script you have access to a further cleaned data set that consists of two variables:

race: “1” represents White, “2” represents Black
time: the sentence length (in months) given by the judge

The data in this cleaned set are filtered for those defendants who were charged with Theft at a severity level of 2.

(\(\star\)) Compute 99% confidence interval for the mean sentence length of white defendants and the 99% confidence interval for the mean sentence length of black defendants. In your write-up, explain which method of computing confidence intervals you used, identifying and justifying your assumptions. Note that justification here may require you show additional computations and/or graphs.
(\(\star\)) Compute 99% confidence interval for the difference of mean sentence lengths of white defendants compared to black defendants. In your write-up, explain which method of computing confidence interval you used, identifying and justifying your assumptions. Note that justification here may require you show additional computations and/or graphs.
(\(\star\)) In your write-up, explain how we can interpret the confidence interval you found in part 2 above.
(\(\star\)) What additional statistical questions could we ask, beyond the questions posed in parts 1 and 2? Use the documentation of the data set to see what other variables are available and use these to inform your questions.

Math 302 Lab 3

Ross Sweet

General Instructions

Lab Overview

Confidence Intervals the Slow Way

Confidence Intervals the Efficient Way

Minnesota Sentencing Data