Warm-up/Review

Suppose that we have a random sample from a normal distribution with a known variance.

  1. What is our pencil-and-paper formula for a confidence interval for a mean?

  2. What R functions are necessary/helpful for calculating this confidence interval?

  3. Explain how to use qnorm().

  4. What is our paper-and-pencil formula for a confidence interval for a mean when the variance is unknown? What changes and what stays the same?

  5. Identify the margin of error (ME) that we use to calculate the confidence interval for a mean when variance is known and the ME when variance is unknown.

Quantiles: qnorm() and qt()

To find a quantile from a standard normal distribution, we only needed to provide R with our desired percentage. For a t-distribution, we need to add one piece of information: the degrees of freedom.

To calculate the cutoff necessary for a 95% confidence interval for a mean based on a sample of size 100 (or 99 degrees of freedom), we type

qt(.025, df = 99)
## [1] -1.984217

Book Problem 7.6

Since this is a 90 percent CI, we need .05 in each tail.

n <- 20
(myq <- qt(.05, df =n-1, lower.tail = FALSE))
## [1] 1.729133

Thus the quantile from our T distribution with 19 degrees of freedom is 1.7291328.

The 90 percent confidence interval for the mean sugar content in a half cup of vanilla ice cream is:

(myCI <- 18.05 + c(-1,1) * 5/sqrt(n) * myq)
## [1] 16.11677 19.98323
 18.05 +   5/sqrt(n) * myq
## [1] 19.98323
  18.05 -  5/sqrt(n) * myq
## [1] 16.11677

The endpoints of the confidence interval are 16.1167707, 19.9832293. We are 90 percent confident the true mean sugar content in a half cup of vanilla ice cream is between 16.117 and 19.983 grams.

Practice Problems

From the book: 7.5, 7.7, 7.10, 7.14, 7.16, 7.19

Announcement: Seminar

We have a seminar speaker on Friday. He graduated from St. Thomas in 2015 with degrees in ac sci and statistics. You can find the poster (and zoom info) on Canvas under Pages.

Solutions

7.5

The margin of error for this problem is a

\[\begin{align} \frac{z_{\alpha/2} \sigma}{\sqrt{n}} \end{align}\]

If we halve the ME, the old ME will be half the new ME. That is,

\[\begin{align} \frac{z_{\alpha/2} \sigma}{\sqrt{n_1}} = \frac{z_{\alpha/2} \sigma}{2 \sqrt{n_2}} \end{align}\]

Solve for \(n_2\). We need to quadruple the sample size to halve the margin of error. (That is, the new sample size must be 4 times the old sample size.)

7.7

n <- 100
xbar <- 120
sdee <- 12
myq <- qt(.05, df = n-1)
lowbd <- xbar + myq*sdee/sqrt(n)
lowbd
## [1] 118.0075

I am 95 percent confident that the mean battery life is no lower than 118.008 hours.

Hint: if you’re confused as to whether you’ve created an upper bound or lower bound, you can check your work. A lower bound must be lower than the sample mean. An upper bound must be greater than the sample mean.

7.10

library(resampledata)
## 
## Attaching package: 'resampledata'
## The following object is masked from 'package:datasets':
## 
##     Titanic
data("Olympics2012")
names(Olympics2012)
## [1] "Name"    "Country" "Age"     "Sex"     "Height"  "Weight"  "Sport"
levels(Olympics2012$Sex) # what indicates woman?
## [1] "F" "M"
Wdata <- subset(Olympics2012, Sex =="F")
nrow(Wdata) #double check I did it right
## [1] 26

I like to start with the basic histogram to make sure nothing is weird.

hist(Wdata$Age)

Then once I know the correct variables were chosen, I rename the axes.

hist(Wdata$Age, xlab = "Age (Years)", main = "Female Olympian Ages")

This looks pretty normal so that’s good. We can use the t-based confidence intervals. (Remember: the more skewed the data are, the higher the sample size has to be for the sampling distribution to be approximately normal.)

xbar <- mean(Wdata$Age)
sdee <- sd(Wdata$Age)
n <- nrow(Wdata)
myq <- qt(.025, df = n-1)
(myCI <- xbar + c(1,-1) * myq * sdee/sqrt(n))
## [1] 25.06989 28.69934
round(myCI,2)
## [1] 25.07 28.70

We are 95% confident the mean age for female Olympians is between 25.07 and 28.7 years.

In other words, we are 95% confident that female Olympians are between 25.07 and 28.7 years old, on average.

Alternatively, we can look at the middle bit of output produced by this R shortcut:

t.test(Wdata$Age)
## 
##  One Sample t-test
## 
## data:  Wdata$Age
## t = 30.512, df = 25, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  25.06989 28.69934
## sample estimates:
## mean of x 
##  26.88462

7.14

First, do EDA (exploratory data analysis).

data(Girls2004)
boxplot(Girls2004$Weight ~ Girls2004$Smoker)

Again, make sure the guts of the plot are right before you pretty it up.

boxplot(Girls2004$Weight ~ Girls2004$Smoker, xlab = "Mother Smoked", ylab = "Weight (Grams)")

In this sample, babies born to mothers who smoked tended to weigh less at birth. We can see this both in the plots and in the group-wise means, below.

tapply(Girls2004$Weight, Girls2004$Smoker, FUN = mean )
##       No      Yes 
## 3401.580 3114.636

Find and save the means. (Alternatively, you can subset the data instead of using tapply.)

temp <- tapply(Girls2004$Weight, Girls2004$Smoker, FUN = mean )
xbarno <- temp[1]
xbarsmoke <- temp[2]

Find and save the sample variances.

temp <- tapply(Girls2004$Weight, Girls2004$Smoker, FUN = var )
varno <- temp[1]
varsmoke <- temp[2]

Find and save the sample sizes.

nno <- summary(Girls2004$Smoker)[1]
nsmoke <- summary(Girls2004$Smoker)[2]

Rather than throw this whole CI together at once, let’s take it step by step. First let’s calculate the standard error.

(temp<- varsmoke/nsmoke + varno/nno)
##      Yes 
## 23922.74
(SE <- sqrt(temp))
##      Yes 
## 154.6698

Now let’s find the quantile. Remember there’s a few ways to calculate the df. This way is the easiest but results in the widest CI. You can instead program Welch’s df.

df <- min(nno, nsmoke) -1
myq <- qt(.05, df = df)

Next, put it together:

(lowbd <- xbarno - xbarsmoke + myq * SE)
##       No 
## 6.610412

Remember, this is in grams (not pounds!). Therefore, we can say we are 95 percent confident that babies from mothers who didn’t smoke weigh at least 6.61 grams more than babies from mothers who did smoke, on average.

Alternatively, we can use a shortcut in R:

t.test(Weight ~ Smoker   , data=Girls2004, alternative = "greater", conf.level = .95)
## 
##  Welch Two Sample t-test
## 
## data:  Weight by Smoker
## t = 1.8552, df = 14.35, p-value = 0.04211
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  14.99055      Inf
## sample estimates:
##  mean in group No mean in group Yes 
##          3401.580          3114.636

This produces a different lower bound because it uses a different (better) df calculation. From this, we can say we are 95 percent confident that babies from mothers who didn’t smoke weigh at least 14.99 grams more than babies from mothers who did smoke, on average.

7.16

data("FlightDelays")
names(FlightDelays)
##  [1] "ID"           "Carrier"      "FlightNo"     "Destination"  "DepartTime"  
##  [6] "Day"          "Month"        "FlightLength" "Delay"        "Delayed30"
boxplot(FlightDelays$Delay ~ FlightDelays$Carrier)

t.test(FlightDelays$Delay ~ FlightDelays$Carrier)
## 
##  Welch Two Sample t-test
## 
## data:  FlightDelays$Delay by FlightDelays$Carrier
## t = -3.8255, df = 1843.8, p-value = 0.0001349
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.903198 -2.868194
## sample estimates:
## mean in group AA mean in group UA 
##         10.09738         15.98308

The way t.test works is it takes the mean of the first group listed below minus the mean of the second group listed below.

levels(FlightDelays$Carrier)
## [1] "AA" "UA"

Therefore, our 95% confidence interval is for the mean for American Airlines - mean for United. Since our CI is negative, United tends to be more delayed than American, on average.

We can double check this by checking the sample means:

tapply(FlightDelays$Delay, FlightDelays$Carrier, FUN = mean)
##       AA       UA 
## 10.09738 15.98308

Now that we have that straight in our heads, let’s interpret the CI. We are 95% confident that American Airlines flights are between between 2.868 and 8.903 minutes LESS delayed than United flights, on average.

Alternatively, we are 95% confident that United flights are between between 2.868 and 8.903 minutes MORE delayed than American flights, on average.

7.19

data(Groceries)
names(Groceries)
## [1] "Product"  "Size"     "Target"   "Walmart"  "Units"    "UnitType"

We have paired data here, so we need to focus on the difference in prices:

diffs <- Groceries$Target- Groceries$Walmart

It doesn’t matter which goes first. Just remember what you’ve done. Now simply create a CI for a single mean.

t.test(diffs)
## 
##  One Sample t-test
## 
## data:  diffs
## t = 0.47046, df = 29, p-value = 0.6415
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.1896825  0.3030159
## sample estimates:
##  mean of x 
## 0.05666667

The 95% CI for \(\mu_T - \mu_W\) is from -.1897 to .3030 dollars. We are 95% confident that groceries at target are between 18.97 cents LESS expensive and 30.30 cents MORE expensive than at WalMart, on average.

Notably, zero is in the CI, meaning it is plausible that the groceries cost the same at the two stores, on average.

To see whether the outlier is influential, remove it and repeat the process. If the CI changes dramatically, then that outlier was influential.