Part 1 - answer 5 to 8

In August of 2012, news outlets ranging from the Washington Post to the Huffington Post ran a story about the rise of atheism in America. The source for the story was a poll that asked people, “Irrespective of whether you attend a place of worship or not, would you say you are a religious person, not a religious person or a convinced atheist?”

Preliminary Questions

  1. In the first paragraph, several key findings are reported. Do these percentages appear to be sample statistics (derived from the data sample) or population parameters?

SOLUTION: These are sample statistics because they come from a poll (a sample). They are not known population parameters.

  1. The title of the report is “Global Index of Religiosity and Atheism”. To generalize the report’s findings to the global human population, what must we assume about the sampling method? Does that seem like a reasonable assumption?

SOLUTION: We must assume the survey in each country used a sampling method that is random/representative of that country’s population (and that countries are combined appropriately). This is only partly reasonable because sampling frames, nonresponse, and accessibility can differ by country.

The data

Turn your attention to Table 6 (pages 15 and 16), which reports the sample size and response percentages for all 57 countries. While this is a useful format to summarize the data, we will base our analysis on the original data set of individual responses to the survey. Load this data set into R with the following command.

load(url("http://www.openintro.org/stat/data/atheism.RData"))
  1. What does each row of Table 6 correspond to? What does each row of atheism correspond to?

SOLUTION: Each row of Table 6 summarizes one country (nationality) for a given year (with sample size and response percentages). Each row of atheism is one individual respondent’s answer, along with their country and year.

To investigate the link between these two ways of organizing this data, take a look at the estimated proportion of atheists in the United States. Towards the bottom of Table 6, we see that this is 5%. We can check this number using the atheism data by running the commands below. Make sure you understand what each of the commands below does after running it.

us12 <- filter(atheism, nationality == "United States", year == "2012")
tally(~ response, data=us12, format = "proportion")
## response
##     atheist non-atheist 
##   0.0499002   0.9500998

4 Using a similar series of commands, confirm the calculation of the proportion of atheist responses in our neighboring country of Canada. Does it agree with the percentage of 9% in Table 6?

SOLUTION:

ca12 <- filter(atheism, nationality == "Canada", year == "2012")
tally(~response, data = ca12, format = "proportion")
## response
##     atheist non-atheist 
##  0.08982036  0.91017964

The proportion in Canada in 2012 is about 9%, which agrees with Table 6.

Inference on proportions

The table 6 provides statistics, that is, calculations made from the sample of 51,927 people. What we’d like, though, is insight into the population parameters. You answer the question, “What proportion of people in your sample reported being atheists?” with a statistic; while the question “What proportion of people on earth would report being atheists” is answered with an estimate of the parameter.

A confidence interval

Here is how we’d compute a 95% confidence interval for the proportion of atheists in the United States in 2012.

# 95% CI for proportion of atheists in US (2012)
us95 <- prop.test(~response, data = us12, conf.level = 0.95)
us95$conf.int
## [1] 0.03761982 0.06574456
## attr(,"conf.level")
## [1] 0.95

5 Interpret this confidence interval in the context of the problem.

SOLUTION: We are 95% confident that the true proportion of U.S. adults in 2012 who would answer “convinced atheist” is between 0.038 and 0.066.

6 Write out the conditions for inference to construct a 95% confidence interval for the proportion of atheists in the United States in 2012. Are you confident all conditions are met?

Although formal confidence intervals don’t show up in the report, suggestions of inference appear at the bottom of page 7: “In general, the error margin for surveys of this kind is \(\pm\) 3-5% at 95% confidence”.

  1. Based on the R output, what is the margin of error for the estimate of the proportion of the proportion of atheists in US in 2012?

SOLUTION:

me95 <- (us95$conf.int[2] - us95$conf.int[1]) / 2
me95
## [1] 0.01406237

So the margin of error is about 0.014.

  1. Calculate a 90% confidence interval for the proportion of atheists in the United States in 2012. Does it make sense that this confidence interval would be wider or narrower than the 95% confidence interval we already calculated?

SOLUTION:

# 90% CI
us90 <- prop.test(~response, data = us12, conf.level = 0.90)
us90$conf.int
## [1] 0.03930392 0.06302676
## attr(,"conf.level")
## [1] 0.9
# compare widths
width95 <- diff(us95$conf.int)
width90 <- diff(us90$conf.int)
c(width90 = width90, width95 = width95)
##    width90    width95 
## 0.02372284 0.02812473

The 90% confidence interval is narrower than the 95% interval because it uses a smaller critical value.

Part 2

North Carolina births

In 2014, the state of North Carolina released a large data set containing information on births recorded in this state. This data set is useful to researchers studying the relation between habits and practices of expectant mothers and the birth of their children. We will work with a random sample of observations from this data set.

Exploratory analysis

Load the nc data set into our workspace.

download.file("http://www.openintro.org/stat/data/nc.RData", destfile = "nc.RData")
load("nc.RData")

We have observations on 13 different variables, some categorical and some numerical. The variable descriptions are given below.

variable description
fage father’s age in years.
mage mother’s age in years.
mature maturity status of mother.
weeks length of pregnancy in weeks.
premie whether the birth was classified as premature (premie) or full-term.
visits number of hospital visits during pregnancy.
marital whether mother is married or not married at birth.
gained weight gained by mother during pregnancy in pounds.
weight weight of the baby at birth in pounds.
lowbirthweight whether baby was classified as low birthweight (low) or not (not low).
gender gender of the baby, female or male.
habit status of the mother as a nonsmoker or a smoker.
whitemom whether mom is white or not white.

As a first step in the analysis, we should consider summaries of the data. This can be done using the summary command:

summary(nc)
##       fage            mage            mature        weeks             premie   
##  Min.   :14.00   Min.   :13   mature mom :133   Min.   :20.00   full term:846  
##  1st Qu.:25.00   1st Qu.:22   younger mom:867   1st Qu.:37.00   premie   :152  
##  Median :30.00   Median :27                     Median :39.00   NA's     :  2  
##  Mean   :30.26   Mean   :27                     Mean   :38.33                  
##  3rd Qu.:35.00   3rd Qu.:32                     3rd Qu.:40.00                  
##  Max.   :55.00   Max.   :50                     Max.   :45.00                  
##  NA's   :171                                    NA's   :2                      
##      visits            marital        gained          weight      
##  Min.   : 0.0   married    :386   Min.   : 0.00   Min.   : 1.000  
##  1st Qu.:10.0   not married:613   1st Qu.:20.00   1st Qu.: 6.380  
##  Median :12.0   NA's       :  1   Median :30.00   Median : 7.310  
##  Mean   :12.1                     Mean   :30.33   Mean   : 7.101  
##  3rd Qu.:15.0                     3rd Qu.:38.00   3rd Qu.: 8.060  
##  Max.   :30.0                     Max.   :85.00   Max.   :11.750  
##  NA's   :9                        NA's   :27                      
##  lowbirthweight    gender          habit          whitemom  
##  low    :111    female:503   nonsmoker:873   not white:284  
##  not low:889    male  :497   smoker   :126   white    :714  
##                              NA's     :  1   NA's     :  2  
##                                                             
##                                                             
##                                                             
## 

Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions.

Make a histogram of weeks, the length of each pregnancy in weeks.

histogram(~weeks, data=nc, nint=20)

Part 1 - Inference

The human pregnancies typically last 38 weeks. Use \(\alpha=0.05\) in each case.

Example 1:

Test the hypothesis that the mean duration of pregnancies is not equal to 38 weeks. What is your decision?

tstar <- qt(.975, df=997)
t.test(~ weeks, data=nc, alternative="two.sided", mu=38)
## 
##  One Sample t-test
## 
## data:  weeks
## t = 3.6065, df = 997, p-value = 0.0003258
## alternative hypothesis: true mean is not equal to 38
## 95 percent confidence interval:
##  38.15257 38.51677
## sample estimates:
## mean of x 
##  38.33467

Looking at p-value what is your decision?

Now look at the Confidence Interval and state your decision base on the CI. A two-sided hypothesis test at significance level alpha is equivalent to using a confidence interval at (1-alpha)*100% confidence level and looking to see if the hypothesized value is in the confidence interval. If it is, you cannot reject the null hypothesis. If it isn’t, you can reject the null.

Example 2:

Test the hypothesis that the mean duration of pregnancies is greater than 38 year. What is your decision?

m <- mean(~ weeks, data=nc, na.rm=TRUE)
s <- sd(~ weeks, data=nc, na.rm=TRUE)
tstar <- qt(.95, df=997)
t <- (m-38)/(s/sqrt(998))
t.test(~ weeks, data=nc, alternative="greater", mu=38)
## 
##  One Sample t-test
## 
## data:  weeks
## t = 3.6065, df = 997, p-value = 0.0001629
## alternative hypothesis: true mean is greater than 38
## 95 percent confidence interval:
##  38.18189      Inf
## sample estimates:
## mean of x 
##  38.33467

Now look at the Confidence Interval or rather LCB. What is your decision?

Note: If you don’t know how to calculate one sided bounds, you can use the regular CI (this is just for illustration, because we covered this material).For a one-sided test at significance level alpha you need a confidence interval at (1 - 2alpha)*100% and again check for the hypothesized value in the confidence interval. If it is in the interval you cannot reject the null. In addition, the confidence interval must either be entirely below or above the hypothesized value (depending on situation) to reject the null. You can also obtain a 1-sided confidence interval (in fact called a ‘confidence bound’- this is what we use) - R does this automatically, as seen. What is your decision?

# Method 1
t.test(~weeks,data=nc,
       alternative = "two.sided",
       mu = 38,
       conf.level = 0.90)
## 
##  One Sample t-test
## 
## data:  weeks
## t = 3.6065, df = 997, p-value = 0.0003258
## alternative hypothesis: true mean is not equal to 38
## 90 percent confidence interval:
##  38.18189 38.48745
## sample estimates:
## mean of x 
##  38.33467
# Method 2
m <- mean(~ weeks, data=nc, na.rm=TRUE)
s <- sd(~ weeks, data=nc, na.rm=TRUE)
tstar1 <- qt(.95, df=997)
SE=s/sqrt(998)

c(m -tstar1*SE, m+tstar*SE)  
## [1] 38.18189 38.48745
m +c(-1,1)*tstar1*SE
## [1] 38.18189 38.48745

Example 3:

Test the hypothesis that the mean duration of pregnancies is less than 38 weeks. What is your decision?

tstar <- qt(.05, df=997)
t.test(~ weeks, data=nc, alternative="less", mu=38)
## 
##  One Sample t-test
## 
## data:  weeks
## t = 3.6065, df = 997, p-value = 0.9998
## alternative hypothesis: true mean is less than 38
## 95 percent confidence interval:
##      -Inf 38.48745
## sample estimates:
## mean of x 
##  38.33467

Question 1 - On your own

Use \(\alpha=0.1\) in each case.

  1. Test the hypothesis that the mean age of a mother is less than 27 years old. What is your decision?

SOLUTION: (See the t-test/CI output below.) Interpret the interval as: we are 95% confident the true mean mother’s age in the population is between the lower and upper bounds.

age_ci <- t.test(~mage, data = nc, conf.level = 0.95)
age_ci$conf.int
## [1] 26.61442 27.38558
## attr(,"conf.level")
## [1] 0.95
  1. Calculate a 95% confidence interval for the average age of a mother and interpret it in context.

SOLUTION: Use the p-value from the t-test below. If p < 0.10, reject H0 and conclude the mean mother’s age is less than 27; otherwise, fail to reject.

age_test <- t.test(~mage, data = nc, alternative = "less", mu = 27)
age_test
## 
##  One Sample t-test
## 
## data:  mage
## t = 0, df = 999, p-value = 0.5
## alternative hypothesis: true mean is less than 27
## 95 percent confidence interval:
##     -Inf 27.3235
## sample estimates:
## mean of x 
##        27

Part 2. Test Population Proportions and Counts

Testing one sample proportion to population value - z test for one sample proportion

Example 4: Birth rate for boys in hospital We know that 51.7% of babies born are male in the population. We observed that 313 boys were born to 550 singleton deliveries in one hospital Is this different that would be expected by chance?

y <- 313; n <- 550; phat <- y/n; phat
## [1] 0.5690909
nullp <- 0.517
sdp <- sqrt(nullp*(1-nullp)/n); sdp
## [1] 0.02130775
onesidep <- 1-pnorm(phat, mean=nullp, sd=sdp); onesidep
## [1] 0.007248761
twosidep <- 2*onesidep; twosidep
## [1] 0.01449752

or we can carry out the exact test (not described by the book):

binom.test(y, n, p=nullp)
## 
## 
## 
## data:  y out of 550
## number of successes = 313, number of trials = 550, p-value = 0.01499
## alternative hypothesis: true probability of success is not equal to 0.517
## 95 percent confidence interval:
##  0.5265178 0.6109179
## sample estimates:
## probability of success 
##              0.5690909

What can we conclude for the above example ?

Testing for a difference in proportions - Two sample z-test for a difference in proportions

Example 5: Use data from the NYC Maternal Infant HIV Transmission Study We are given two qualitative variables (AZT use & Transmission) for a prospective study of HIV transmission to infants among 321 mothers. Of the 47 women on AZT, 6 transmitted and of the 274 mothers who did not take AZT, 64 transmitted.

n1 <- 47; y1 <- 6
n2 <- 274; y2 <- 64
ppooled <- (y1+y2)/(n1+n2); ppooled
## [1] 0.2180685
sepooled <- sqrt(ppooled*(1-ppooled)/n1 + ppooled*(1-ppooled)/n2); sepooled
## [1] 0.06519423
z <- (y1/n1 - y2/n2)/sepooled; z
## [1] -1.624639
pval <- 2*(1-pnorm(z, lower.tail = FALSE)); pval
## [1] 0.1042396

Question 2 - On your own

  1. Was AZT effective? Based on what information? Make sure that you read the correct p-value.

SOLUTION: AZT looks beneficial because the observed transmission rate is lower with AZT (6/47 ≈ 13%) than without AZT (64/274 ≈ 23%). The computed z is negative.

The printed pval in the example is two‑sided (about 0.104). For the one‑sided alternative “AZT lowers transmission,” the p‑value is about 0.052 (= 0.104/2), which is not quite significant at 5% but would be significant at 10%.

In the previous example of MTC HIV transmission, we had counts. If instead, we had counts and proportions we could use this code to calculate the standard error of the difference and create a confidence interval.

n1 <- 47; p1 <- 0.13
n2 <- 274; p2 <- 0.23
sediff <- sqrt(p1*(1-p1)/n1 + p2*(1-p2)/n2); sediff
## [1] 0.05525155
(p2 - p1) + c(-1.96, 1.96)*sediff
## [1] -0.00829303  0.20829303
  1. Was AZT effective?

SOLUTION: Using the 95% confidence interval for (p2 − p1) shown above, the interval includes 0, so at the 5% level we do not have strong evidence of a difference. However, the point estimate suggests AZT may reduce transmission.

Part 3. Test Population Menas

Example 6: Acquire the WNBA & NBA datafile

bball_file <- if (file.exists("Basket-7.csv")) "Basket-7.csv" else "http://pmatheson.people.amherst.edu/Basket.csv"
bball <- read.csv(bball_file)
glimpse(bball)
## Rows: 24
## Columns: 5
## $ X        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
## $ PLAYER   <chr> "Cynthia Cooper         ", "Ruthie Bolton-Holifield", "Lisa L…
## $ GENDER   <chr> "Female", "Female", "Female", "Female", "Female", "Female", "…
## $ HEIGHTIN <int> 70, 69, 77, 74, 75, 70, 70, 75, 74, 68, 72, 75, 78, 81, 85, 7…
## $ WEIGHTLB <int> 150, 150, 170, 165, 180, 158, 145, 178, 178, 132, 172, 185, 2…

Take a peek at the data

names (bball)
## [1] "X"        "PLAYER"   "GENDER"   "HEIGHTIN" "WEIGHTLB"
head (bball)
##   X                  PLAYER GENDER HEIGHTIN WEIGHTLB
## 1 1 Cynthia Cooper          Female       70      150
## 2 2 Ruthie Bolton-Holifield Female       69      150
## 3 3 Lisa Leslie             Female       77      170
## 4 4 Wendy Palmer            Female       74      165
## 5 5 Jennifer Gillom         Female       75      180
## 6 6 Andrea Stinson          Female       70      158

Look more closely at the variables

favstats (~HEIGHTIN |GENDER, data=bball)
##   GENDER min    Q1 median Q3 max     mean       sd  n missing
## 1 Female  68 70.00     73 75  77 72.41667 2.937480 12       0
## 2   Male  72 78.75     80 82  85 79.91667 3.423404 12       0
favstats (~WEIGHTLB |GENDER, data=bball)
##   GENDER min     Q1 median    Q3 max     mean       sd  n missing
## 1 Female 132 150.00  167.5 178.0 185 163.5833 16.51698 12       0
## 2   Male 165 215.75  220.0 242.5 256 221.5833 26.58933 12       0
histogram (~WEIGHTLB |GENDER, data=bball)

histogram (~HEIGHTIN |GENDER, data=bball)

Comparing one group mean to population - a one sample t-test

Example 7: Are Women in WNBA taller than US women in general? To answer this question we need to compare the sample of WNBA heights to NHANES mean height which has population values of: (mu=63.75, sd=3.423)

mu=63.75
sd=3.423
womenonlyds <- filter(bball, GENDER=="Female")
xpnorm(c(mu-3*sd, mu-2*sd, mu-sd, mu+sd, mu+2*sd, mu+3*sd), mean=mu, sd=sd)

## [1] 0.001349898 0.022750132 0.158655254 0.841344746 0.977249868 0.998650102
t.test(womenonlyds$HEIGHTIN, alternative="greater", mu=63.75, data=womenonlyds)
## 
##  One Sample t-test
## 
## data:  womenonlyds$HEIGHTIN
## t = 10.22, df = 11, p-value = 2.972e-07
## alternative hypothesis: true mean is greater than 63.75
## 95 percent confidence interval:
##  70.8938     Inf
## sample estimates:
## mean of x 
##  72.41667

Question 3 - On your own

What do we conclude? Are women playing professional basketball taller than American women?

SOLUTION: From the one‑sample t‑test output (female players vs μ = 63.75 inches), the p‑value is extremely small, so we reject H0 and conclude WNBA players are taller on average than U.S. women.

Comparing two groups to each other - a two Sample independent t-test

If people who play basketball tend to be exceptionally tall would we expect women playing professional basketball to be as tall as the men?

Using the bball data we can ask if women in WNBA different in terms of men in NBA?

t.test(HEIGHTIN ~ GENDER, data=bball) # Unpooled
## 
##  Welch Two Sample t-test
## 
## data:  HEIGHTIN by GENDER
## t = -5.7595, df = 21.504, p-value = 9.339e-06
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -10.204201  -4.795799
## sample estimates:
## mean in group Female   mean in group Male 
##             72.41667             79.91667
t.test(HEIGHTIN ~ GENDER, var.equal=TRUE, data=bball)   # Pooled
## 
##  Two Sample t-test
## 
## data:  HEIGHTIN by GENDER
## t = -5.7595, df = 22, p-value = 8.564e-06
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -10.200583  -4.799417
## sample estimates:
## mean in group Female   mean in group Male 
##             72.41667             79.91667
bwplot(GENDER ~ HEIGHTIN, data=bball)

(We went over 2 independent t-tests , one assumes equal variance like the test above. If you don’t include var.equal=TRUE you will have Welch test )

Question 4 - On your own

Are there difference in the heights of men and women professional basketball players?

SOLUTION: From the two‑sample t‑test output (HEIGHTIN by GENDER), the p‑value is extremely small, so we conclude there is a difference in average height; men are taller on average than women among professional basketball players.

Comparing Groups when data is paired - Dependent t-test using difference scores

Example 8:

This is a special case of data.

We are not comparing the mean of one group vs. another. We have the same group of people over time with matched data. Paired data can also come from different people (twins, siblings, etc.), the unique aspect is that two numbers belong together.

For this analysis we will be using data from National Education Longitudinal Study (NELS). A nationally representative sample of eighth-graders were first surveyed in the spring of 1988. A sample of these respondents were then resurveyed through four follow-ups in 1990, 1992, 1994, and 2000. On the questionnaire, students reported on a range of topics including: school, work, and home experiences; educational resources and support; the role in education of their parents and peers; neighborhood characteristics; educational and occupational aspirations; and other student perceptions. Additional topics included self-reports on smoking, alcohol and drug use and extracurricular activities. For the three in-school waves of data collection (when most were eighth-graders, sophomores, or seniors), achievement tests in reading, social studies, mathematics and science were administered in addition to the student questionnaire.

Question: Many middle schoolers say their grades don’t matter and they’ll work harder in high school. We can answer this question using data from NELS. For this example we’ll operationally define performance as reading achievement.

Create difference scores between reading achievement scores in 8th and 10th grade for each participant.

nels_file <- if (file.exists("NELS-5.csv")) "NELS-5.csv" else "http://pmatheson.people.amherst.edu/NELS.csv"
educ <- read.csv(nels_file)
educ <- mutate(educ, diff = ACHRDG10 - ACHRDG08); 

favstats (~ACHRDG08, data=educ)
##    min      Q1 median      Q3   max     mean       sd   n missing
##  35.74 49.9875 56.445 63.1325 70.55 56.04906 8.829726 500       0
favstats (~ACHRDG10, data=educ)
##    min      Q1 median      Q3  max     mean       sd   n missing
##  31.53 49.6175 57.545 62.9475 68.8 56.11404 8.304459 500       0
favstats (~diff, data=educ)
##     min      Q1 median     Q3   max    mean       sd   n missing
##  -23.79 -3.7575  0.525 4.2625 15.11 0.06498 6.031412 500       0
densityplot (~diff, data=educ)

histogram (~diff, data=educ)

Question 5 - On your own

Create a confidence interval for the difference and perform the correct hypothesis test - you are back to one - mean t test , with variable diff. Did reading achievement scores differ over time (between 8th and 10th grade)? If so, why? If not, why not?

SOLUTION: (See the paired t‑test output below.) If the 95% CI for the mean difference excludes 0 and the p‑value is small, conclude reading achievement changed between 8th and 10th grade. These measurements are dependent because they are from the same students measured twice.

# CI and test for mean difference (paired via difference scores)
if (!"diff" %in% names(educ)) {
  educ <- mutate(educ, diff = ACHRDG10 - ACHRDG08)
}
t.test(educ$diff, mu = 0, conf.level = 0.95)
## 
##  One Sample t-test
## 
## data:  educ$diff
## t = 0.2409, df = 499, p-value = 0.8097
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.4649723  0.5949323
## sample estimates:
## mean of x 
##   0.06498