CIs_Summary_Extension

So far:

  • One sample Z-test: if standard deviation is known
    • for mean
  • One sample T-test:
    • if standard deviation is not known
    • for mean
    • use \( S \) in place of \( \sigma \)
  • Two sample T-test
    • for difference of means

First extension: to two sample, paired

  • if data is paired
  • still for the mean

Second extension: For proportions

  • this is a bit more complicated…but R will hide the unpleasant part
  • single proportion
  • difference of proportions
  • the underlying assumption is that a normal distribution approximates a binomial distribution if the sample size is large enough
  • R performs this approximation with continuity correction which we saw earlier

Matched pairs example (Exercise 7.19)

library(resampledata)
  • Groceries example: are prices at Target and Walmart different?

The plan

  • To perform a two-sample, paired t-test
  • we can produce a vector of the differences in prices and run a one-sample t-test.
  • If the confidence interval does NOT contain zero we have evidence of a difference of means.
  • try it…

vector of differences and t-test

DiffInPrice <- Groceries$Target-Groceries$Walmart
t.test(DiffInPrice, conf.level=.95)

    One Sample t-test

data:  DiffInPrice
t = 0.47046, df = 29, p-value = 0.6415
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -0.1896825  0.3030159
sample estimates:
 mean of x 
0.05666667 

R shortcut

  • As long as you remember that R does exactly what you tell it to do, it's ok to use an R shortcut:
t.test(Groceries$Target, Groceries$Walmart, conf.level=.95, paired=T)

    Paired t-test

data:  Groceries$Target and Groceries$Walmart
t = 0.47046, df = 29, p-value = 0.6415
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.1896825  0.3030159
sample estimates:
mean of the differences 
             0.05666667 

Conclusion

  • Here, 0 is in the confidence interval so we cannot conclude there is a statistical difference in prices between Target and Walmart.

Analysis as before

  • As we did earlier, let's remove any extreme outliers.
  • Here is a box plot of the differences of prices, so we can see how to sort.
boxplot(Groceries$Target-Groceries$Walmart)

plot of chunk unnamed-chunk-4

Remove outlier(s)

  • I'll remove any data points where the difference in price between Target and Walmart is more than $2.

  • The which command returns the index of the data point(s). If we pass the whole which command to the new vector construction we don't need to know the particular index values.

#which(abs(DiffInPrice)>2)
ModifiedTarget <- Groceries$Target[-which(abs(DiffInPrice)>2)]
ModifiedWalmart <- Groceries$Walmart[-which(abs(DiffInPrice)>2)]

Re-run t-test

t.test(ModifiedTarget, ModifiedWalmart, conf.level = .95, paired=T)

    Paired t-test

data:  ModifiedTarget and ModifiedWalmart
t = 2.2038, df = 28, p-value = 0.03593
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.01098743 0.30073671
sample estimates:
mean of the differences 
              0.1558621 

Conclusion

  • Removing the outlier (maybe a recording error, etc.?) the confidence interval does not contain 0 and we have (weak) evidence of a difference in prices.
  • Why weak?
    • First, zero just barely missed being in the interval.
    • Second, the associated \( p \)-value is small at a 0.05 level but not at a 0.01 level.

Confidence Intervals for proportions

  • If we want a confidence interval for a proportion (eg. proportion of dry wells in a certain area) we can use a version of Z-test
  • The assumption is that a normal distribution approximates a binomial distribution if \( n \) is large enough.
  • R performs this as a “black box” for us
  • The details are pages 215 - 216
  • R also tries to mitigate error by using a continuity correction

Example. Exercise 7.27

one proportion

  • Find CI for proportion of students who took drug and break out in hives
prop.test(34, 350)$conf
[1] 0.06913692 0.13429260
attr(,"conf.level")
[1] 0.95
  • Find CI for proportion of students who took placebo and break out in hives
prop.test(56, 350)$conf
[1] 0.1240384 0.2036158
attr(,"conf.level")
[1] 0.95
prop.test(34, 350)$conf
[1] 0.06913692 0.13429260
attr(,"conf.level")
[1] 0.95
prop.test(56, 350)$conf
[1] 0.1240384 0.2036158
attr(,"conf.level")
[1] 0.95
  • Notice the two intervals overlap.
  • So it is possible that the proportion is the same for both groups.

difference of proportions

  • Instead of estimating \( p \), we estimate \( p_{\text{drug}}-p_{\text{placebo}} \)
prop.test(c(34, 56), c(350, 350))$conf
[1] -0.11508783 -0.01062645
attr(,"conf.level")
[1] 0.95
  • This time, 0 is not in the CI so we conclude there is a significant effect.
  • These results are contradictory.
  • We'll go with the second result because it considers the data as a whole.