CIs_Summary_Extension

So far:

One sample Z-test: if standard deviation is known
- for mean
One sample T-test:
- if standard deviation is not known
- for mean
- use $ S $ in place of $ \sigma $
Two sample T-test
- for difference of means

First extension: to two sample, paired

if data is paired
still for the mean

Second extension: For proportions

this is a bit more complicated…but R will hide the unpleasant part
single proportion
difference of proportions
the underlying assumption is that a normal distribution approximates a binomial distribution if the sample size is large enough
R performs this approximation with continuity correction which we saw earlier

Matched pairs example (Exercise 7.19)

library(resampledata)

Groceries example: are prices at Target and Walmart different?

The plan

To perform a two-sample, paired t-test
we can produce a vector of the differences in prices and run a one-sample t-test.
If the confidence interval does NOT contain zero we have evidence of a difference of means.
try it…

vector of differences and t-test

DiffInPrice <- Groceries$Target-Groceries$Walmart
t.test(DiffInPrice, conf.level=.95)


    One Sample t-test

data:  DiffInPrice
t = 0.47046, df = 29, p-value = 0.6415
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -0.1896825  0.3030159
sample estimates:
 mean of x 
0.05666667

R shortcut

As long as you remember that R does exactly what you tell it to do, it's ok to use an R shortcut:

t.test(Groceries$Target, Groceries$Walmart, conf.level=.95, paired=T)


    Paired t-test

data:  Groceries$Target and Groceries$Walmart
t = 0.47046, df = 29, p-value = 0.6415
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.1896825  0.3030159
sample estimates:
mean of the differences 
             0.05666667

Conclusion

Here, 0 is in the confidence interval so we cannot conclude there is a statistical difference in prices between Target and Walmart.

Analysis as before

As we did earlier, let's remove any extreme outliers.
Here is a box plot of the differences of prices, so we can see how to sort.

boxplot(Groceries$Target-Groceries$Walmart)

plot of chunk unnamed-chunk-4

Remove outlier(s)

I'll remove any data points where the difference in price between Target and Walmart is more than $2.
The which command returns the index of the data point(s). If we pass the whole which command to the new vector construction we don't need to know the particular index values.

#which(abs(DiffInPrice)>2)
ModifiedTarget <- Groceries$Target[-which(abs(DiffInPrice)>2)]
ModifiedWalmart <- Groceries$Walmart[-which(abs(DiffInPrice)>2)]

Re-run t-test

t.test(ModifiedTarget, ModifiedWalmart, conf.level = .95, paired=T)


    Paired t-test

data:  ModifiedTarget and ModifiedWalmart
t = 2.2038, df = 28, p-value = 0.03593
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.01098743 0.30073671
sample estimates:
mean of the differences 
              0.1558621

Conclusion

Removing the outlier (maybe a recording error, etc.?) the confidence interval does not contain 0 and we have (weak) evidence of a difference in prices.
Why weak?
- First, zero just barely missed being in the interval.
- Second, the associated $ p $-value is small at a 0.05 level but not at a 0.01 level.

Confidence Intervals for proportions

If we want a confidence interval for a proportion (eg. proportion of dry wells in a certain area) we can use a version of Z-test
The assumption is that a normal distribution approximates a binomial distribution if $ n $ is large enough.
R performs this as a “black box” for us
The details are pages 215 - 216
R also tries to mitigate error by using a continuity correction

Example. Exercise 7.27

one proportion

Find CI for proportion of students who took drug and break out in hives

prop.test(34, 350)$conf

[1] 0.06913692 0.13429260
attr(,"conf.level")
[1] 0.95

Find CI for proportion of students who took placebo and break out in hives

prop.test(56, 350)$conf

[1] 0.1240384 0.2036158
attr(,"conf.level")
[1] 0.95

prop.test(34, 350)$conf

[1] 0.06913692 0.13429260
attr(,"conf.level")
[1] 0.95

prop.test(56, 350)$conf

[1] 0.1240384 0.2036158
attr(,"conf.level")
[1] 0.95

Notice the two intervals overlap.
So it is possible that the proportion is the same for both groups.

difference of proportions

Instead of estimating $ p $, we estimate $ p_{\text{drug}}-p_{\text{placebo}} $

prop.test(c(34, 56), c(350, 350))$conf

[1] -0.11508783 -0.01062645
attr(,"conf.level")
[1] 0.95

This time, 0 is not in the CI so we conclude there is a significant effect.
These results are contradictory.
We'll go with the second result because it considers the data as a whole.