Data Analysis in Social Science

“Coverage” for a confidence interval – Simulating the “globe tossing exercise”

Introduction: The US Geological Survey estimates the 29% of the Earth’s surface is covered with land. We can use this information and the binomial distribution to simulate our globe tossing excercise.

# Tossing the globe

# Tossing the globe is just another example of a binomial experiment. So
# we can set up a function (similar to what we did when studying the 
# sampling distribution for a proportion) that does the following:
#
# 1. Tosses the globe "n" times and calculates the proportion
#    of times we hit land.
# 2. Calculates the standard error
# 3. Calculates a confidence interval for a given value of "z"
#    (i.e., for a given confidence level).

# Here's our function:
globe_tossing_ci <- function(n, z, pie = .29){
  # Arguments:
  # n   = the sample size
  # z   = the standard normal variate for a given confidence level
  # pie = the population proportion (or probability)
  
  # Toss the globe "n" times and calculate proportion of land
  p <- rbinom(1, n, prob = pie)/n
  
  # Calculate the standard error
  se <- sqrt((p*(1-p))/n)
  
  # Calculate the confidence interval for a given "z"
  low <- p - z*se
  up <- p + z*se
  
  # Return the "point" estimate and confidence interval
  return(c(p, low, up))
}

Next, we can use the “replicate” function to repeatedly sample our globe_tossing_ci to get many different confidence intervals.

sims <- 100
cis <- as.data.frame(t(replicate(sims, globe_tossing_ci(n = 10000, z = 1.96))))
names(cis) <- c("p", "low", "up")

We have everything we need to calculate the “coverage” associated with our confidence interval:

cis$coverage <- ifelse(cis$low <= 0.29 & cis$up >= 0.29, 1, 0)
mean(cis$coverage)

## [1] 0.97

This last step is not necessary, but we can visualize coverage using ggplot2 with the following:

library(ggplot2)
cis$labels <- rownames(cis)
plt <- ggplot(cis, aes(x = labels, y = p)) +
  geom_hline(yintercept = 0.29, color = "red") +
  geom_pointrange(aes(ymax = up, ymin = low, color = coverage), size = 1, shape = 32) +
  geom_point(aes(color = coverage)) +
  scale_y_continuous(limits = c(.21,.38)) +
  ylab("Proportion Land\n") +
  xlab("") +
  coord_flip() +
  theme_bw() +
  theme(
    axis.text.y = element_blank(),
    legend.position="none"
  )
plt

Exercises

How to calculate confidence intervals using the z-distributions for proportion

Assume that we are interested in the proportion of lectures taking part in the upcomming strike. Assume further that we take a random sample of = 150 lectures and estimate the proportion participating to be p = 0.33 . Construct the 95% confience interval using the following formula: p +/- z * sqrt(p(1-p)/n)

Step 1: we start by calculating the standard error

 #sqrt(p(1-p)/n)
 se <- sqrt(0.33*(1 - 0.33)/150)

Step 2: plug in the standard error now the formula is: p +/- z * se, se is standard error The standard error of the sample mean depends on both the standard deviation and the sample size, 0.33 +/- z * se what is the value of z at 95% confidence interval? look in the table with z values, for 95% confidence interval, value of z = 1.96

0.33 - 1.96 * se

## [1] 0.2547503

0.33 + 1.96 * se

## [1] 0.4052497

[0.2547503, 0.4052497]

now, it’s your turn for “confidence levels” of 90 what is the value of z at 90% confidence interval? then replace in the formula

0.33 - 1.645 * se

## [1] 0.266844

0.33 + 1.645 * se

## [1] 0.393156

for “confidence levels” of 99 what is the value of z at 99% confidence interval? then replace in the formula

0.33 - 2.58 * se

## [1] 0.2309468

0.33 + 2.58 * se

## [1] 0.4290532

This was “by hand”. But we actually have functions in R that will calculate the confidence intervals: prop.test()

prop.test(x=50, n=150)

## 
##  1-sample proportions test with continuity correction
## 
## data:  50 out of 150, null probability 0.5
## X-squared = 16.007, df = 1, p-value = 6.312e-05
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.2598209 0.4155318
## sample estimates:
##         p 
## 0.3333333

Note: conf.level default is set at .95

How to calculate confidence intervals using the t-distribution for means

Assume we take a randomm sample of 1,000 students and measure how many hours of sleep they get on average over the course of a week. We find a mean of 8 hours and a standard deviation 2 hours. What is the 90% confidence interval for the mean hours slept?

set.seed(5678)
sleep <- rnorm(1000, mean = 8, sd = 2)

Calculate the sample mean. Note that it is not exactly 8

x_bar = mean(sleep)
x_bar

## [1] 8.008523

formula x bar +/- t * (s/sqrt(n)) Step 1: calculate the standard error

# s/sqrt(n)
 
se <-  sd(sleep)/(sqrt(1000))
x_bar

## [1] 8.008523

# formula is: x_bar +/- t * se

find the t value for a 90% confidence interval we do this using “qt” function

qt(c(.05, .95), df=999)

## [1] -1.64638  1.64638

we replace everything we know in the formula,

x_bar - 1.64638 * se

## [1] 7.903698

x_bar + 1.64638 * se

## [1] 8.113348

# [ 7.883581, 8.133466]

your turn, calculate the following: for “confidence levels” of 95

qt(c(.025, .975), df=999)

## [1] -1.962341  1.962341

x_bar - 1.96 * se

## [1] 7.88373

x_bar + 1.96 * se

## [1] 8.133317

for “confidence levels” of 99

qt(c(.005, .995), df=999)

## [1] -2.58076  2.58076

x_bar - 2.58 * se

## [1] 7.844255

x_bar + 2.58 * se

## [1] 8.172792

This was “by hand”. But we actually have functions in R that will calculate the confidence intervals for us: t.test

t.test(sleep,  conf.level = 0.95)

## 
##  One Sample t-test
## 
## data:  sleep
## t = 125.78, df = 999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  7.883581 8.133466
## sample estimates:
## mean of x 
##  8.008523

[7.903698 8.113348]

Assume that we take a random sample of 30 students and ask them how many units of alcohol they drink in a typical week. We estimate a sample mean (x bar) of 15.27 and we know that the standard deviation (s) is 6 CALCULATE “BY HAND” in R first, then using prop.test() or t.test()

Step 1: calculate the standard error

Step 2: find the value associated with a 95% confidence interval and n = 30:

double check with the appropriate function

Theresa May. Assume that we take a random sample of 100 potential voters in the U.K. and ask them to rate their feelings toward Theresa May using a “feeling thermometer” that ranges from 0 (extreme negative feelings) to 100 (extreme positive feelings). Assume further that x bar = 45 and we know s = 15. Please construct the 95% confidence interval for feelings towards May. CALCULATE “BY HAND” in R first, then using prop.test() or t.test()

Step 1: calculate the standard error

Step 2: find the value associated with a 95% confidence interval and n = 100:

double check with the appropriate function

The end

Data Analysis in Social Science - lab 5

Raluca Popp

23 February 2018

“Coverage” for a confidence interval – Simulating the “globe tossing exercise”

Exercises

How to calculate confidence intervals using the z-distributions for proportion

How to calculate confidence intervals using the t-distribution for means