WPA #6: Chapter 11 - One and Two Sample Null Hypothesis Tests

For this WPA, we will use a dataset stored in a tab-separated text file called club.txt.This dataset contains results from a (fictional) survey of 300 attendants at one of three clubs in Konstanz on 8 December 2015. Each row represents one person.

Use the following code chunk to download the tab-separated text file club.txt from http://nathanieldphillips.com/wp-content/uploads/2015/12/club.txt and store it in an object called club.df. Make sure to include the full code in your Markdown document or it may not knit!

APA (American Pirate Association) format

The American Pirate Association has strict rules for how to display the result of hypothesis tests. Here are the formats for the three tests you will conduct in this WPA:

t-test: t(df) = XXX, p = YYY.
- Ex) t(9) = 2.50, p = .02
correlation test: r(df) = XXX, p = YYY.
- Ex) r(98) = .12, p = .12
chi-square test: X2(df) = XXX, p = YYY.
- Ex) X2(4) = 2.40, p = .66

A fried of mine is convinced that all women are secretly Werewolves. Seriously. Here’s his claim: the longer he’s at a club, the fewer women he sees. Why do women (werewolves) leave so early? He claims that they are are drawn to the moonlight outside, so they can’t help but to leave (him) before the night is over. Let’s test his claim using our data.

Create a plot (e.g. boxplot or beanplot) showing the distribution of club times for males and females

with(club.df, boxplot(time ~ gender))

Using grouped aggregation (e.g.; aggregate or dplyr), calculate the mean number of minutes that men and women stayed at the club(s)

with(club.df, aggregate(time ~ gender, 
                        FUN = mean))

##   gender     time
## 1      F 134.4167
## 2      M 136.7292

Conduct a two-tailed t-test testing whether or not there is a significant difference in the amount of time women and men spend at clubs. Save the result as an object called q1.test

q1.test <- with(club.df, t.test(time ~ gender))
# or
q1.test <- t.test(time ~ gender, 
                  data = club.df)

Write your conclusion in APA format. Be sure to address my friend’s claim that women are werewolves.

q1.test

## 
##  Welch Two Sample t-test
## 
## data:  time by gender
## t = -0.38152, df = 297.55, p-value = 0.7031
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -14.240836   9.615836
## sample estimates:
## mean in group F mean in group M 
##        134.4167        136.7292

#Women and men do not seem to leave clubs at different times, t(297.55) = -0.38, p = 0.70.

Do the results change if you only look at people who were at the Blechnerei? Using only the Blechnerei data, repeat the test and write your conclusion in APA format (Hint: Use subset()!)

q1b.test <- with(subset(club.df, club == "Blechnerei"), t.test(time ~ gender))

# or

q1b.test <- t.test(time ~ gender, 
                   data = club.df, 
                   subset = club == "Blechnerei")

q1b.test

## 
##  Welch Two Sample t-test
## 
## data:  time by gender
## t = 0.062752, df = 104.1, p-value = 0.9501
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -20.29240  21.61866
## sample estimates:
## mean in group F mean in group M 
##        140.9180        140.2549

#Even only looking at people who went to Blechnerei, there does not seem to be a difference between males and females, t(104.1) = 0.06, p = 0.95

Another friend has other club related ideas. According to her, if you’re looking to meet a nice lady or gentleman at the club, you should definitely have a few drinks to help you loosen up. Do our data support her claim? Test this by answering the question: Do people that did not leave alone tend to drink more or less than people who did leave alone?

Create a plot (e.g. boxplot or beanplot) showing the distribution of drinks for people that did and did not leave alone

with(club.df, boxplot(drinks ~ leavealone))

Using grouped aggregation (e.g.; aggregate or dplyr), calculate the mean number of drinks people people had when they went home alone or not alone.

with(club.df, aggregate(drinks ~ leavealone, 
                        FUN = mean))

##   leavealone   drinks
## 1          0 3.577465
## 2          1 4.117904

Conduct a two-tailed t-test testing whether or not there is a significant difference in the amount of drinks people had when they went home alone versus not alone. Save the result as an object called q2.test

q2.test <- with(club.df, t.test(drinks ~ leavealone))
# OR
q2.test <- t.test(drinks ~ leavealone, 
                  data = club.df)

Write your conclusion in APA format.

q2.test

## 
##  Welch Two Sample t-test
## 
## data:  drinks by leavealone
## t = -2.6253, df = 121.18, p-value = 0.009772
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.9479793 -0.1328990
## sample estimates:
## mean in group 0 mean in group 1 
##        3.577465        4.117904

# People who leave alone tend to have more drinks than those who do not leave alone, t(121.18) = -2.63, p < .01

Do the results change if you ignore Males and only test Females? Using only the Female data, repeat the test and write your conclusion in APA format (Hint: Use subset()!)

q2b.test <- with(subset(club.df, gender == "F"), t.test(drinks ~ leavealone))
# OR
q2b.test <- t.test(drinks ~ leavealone, 
                   data = club.df, 
                   subset = gender == "F")
q2b.test

## 
##  Welch Two Sample t-test
## 
## data:  drinks by leavealone
## t = -1.3791, df = 53.466, p-value = 0.1736
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.9844944  0.1821801
## sample estimates:
## mean in group 0 mean in group 1 
##        3.352941        3.754098

# When we only look at Females, there does not seem to be a significant relationship between drinks and leaving times, t(53.47) = -1.38, p = 0.17

In a later chapter, we’ll learn how to write custom functions that make a lot of your programming life much easier. For example, you can write a custom function that takes a t.test object as an input, and spits out an APA style conclusion as an output! In this question, we’ll create a function called ‘apa’ that does just that:

First, load the function into R. You can do this in one of two ways. Either re-download the yarrr package (I uploaded the package online earlier today) or execute all the code in the following chunk.

apa <- function(test.object, tails = 2, sig.digits = 2, p.lb = .01) {

  statistic.id <- substr(names(test.object$statistic), start = 1, stop = 1)
  p.value <- test.object$p.value

  if(tails == 1) {p.value <- p.value / 2}

  if (p.value < p.lb) {p.display <- paste("p < ", p.lb, " (", tails, "-tailed)", sep = "")}
  if (p.value > p.lb) {p.display <- paste("p = ", round(p.value, sig.digits), " (", tails, "-tailed)", sep = "")}


  add.par <- ""

  if(grepl("product-moment", test.object$method)) {

    estimate.display <- paste("r = ", round(test.object$estimate, sig.digits), ", ", sep = "")

  }

  if(grepl("Chi", test.object$method)) {

    estimate.display <- ""

    add.par <- paste(", N = ", sum(test.object$observed), sep = "")

  }

  if(grepl("One Sample t-test", test.object$method)) {

    estimate.display <- paste("mean = ", round(test.object$estimate, sig.digits), ", ", sep = "")

  }

  if(grepl("Two Sample t-test", test.object$method)) {

    estimate.display <- paste("mean difference = ", round(test.object$estimate[2] - test.object$estimate[1], sig.digits), ", ", sep = "")

  }




  return(paste(
    estimate.display,
    statistic.id,
    "(",
               round(test.object$parameter, sig.digits),
               add.par,
               ") = ",
               round(test.object$statistic, sig.digits),
               ", ",
               p.display,
               sep = ""
  )
  )

}

Now, try the function on your previous test results from Q1 and Q2 by executing the following two lines of code.

apa(q1.test)

## [1] "mean difference = 2.31, t(297.55) = -0.38, p = 0.7 (2-tailed)"

apa(q2.test)

## [1] "mean difference = 0.54, t(121.18) = -2.63, p < 0.01 (2-tailed)"

Do the results match what you wrote down for your answers to questions 1 and 2?

Yep!

Yet another friend of mine has some claims about club life. According to her, the main reason people drink at clubs isn’t to loosen up, it’s to stay awake! Is she right? Is there a relationship between the number of drinks a person has and how long they stay at the club?

Create a plot (e.g. scatterplot) showing the relationship between drinks and time.

with(club.df, plot(drinks, time))

Using grouped aggregation (e.g.; aggregate or dplyr), calculate the mean number of minutes that people stay at the club for each drink amount.

with(club.df, aggregate(time ~ drinks, 
                        FUN = mean))

##    drinks      time
## 1       0  85.40000
## 2       1 115.84615
## 3       2  97.03226
## 4       3 129.49123
## 5       4 136.85542
## 6       5 144.95522
## 7       6 155.31034
## 8       7 174.63636
## 9       8 194.00000
## 10      9 258.00000

Is the difference significant? Conduct a correlation test and save the result as an object called q4.test

q4.test <- with(club.df, cor.test(time, drinks))

#OR

q4.test <- cor.test(~ time + drinks, 
                    data = club.df)

Write your result in APA format.

q4.test

## 
##  Pearson's product-moment correlation
## 
## data:  time and drinks
## t = 6.6984, df = 298, p-value = 1.05e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2591255 0.4562998
## sample estimates:
##       cor 
## 0.3617512

# There is a significant positive correlation between the number of drinks a person has and how long they spend at the club, t(298) = 6.70, p < .01

Repeat the test but only for females at Blechnerei. Do you get the same conclusion? Write the results of this test in APA format

q4b.test <- with(subset(club.df, gender == "F" & 
                          club == "Blechnerei"), cor.test(time, drinks))

#OR

q4b.test <- cor.test(~ time + drinks, 
                    data = club.df,
                    subset = gender == "F" & club == "Blechnerei"
                    )

q4b.test

## 
##  Pearson's product-moment correlation
## 
## data:  time and drinks
## t = 2.7597, df = 59, p-value = 0.007695
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.09433171 0.54365162
## sample estimates:
##       cor 
## 0.3381205

# The results are very similar, t(59) = 2.76, p < .01

I don’t know about Germany, but in the US, we refer to clubs with mostly guys as “sausage fests.” Maybe in Germany you’d call it a Wurstfest. Let’s see if any of the clubs were a Wurstfest on this day.

What is the percentage of Males in each club? (Hint: Calculate a new binary variable called gender.log with 1 meaning Male and 0 meaning Female. Then, use grouped aggregation to calculate the percentage of 1s in gender.log for each club)

club.df$gender.log <- club.df$gender == "M"

genderagg <- with(club.df, aggregate(gender.log ~ club, 
                                     FUN = mean))

Plot the results using a barplot. Set the height argument to be the percentage of males in each club, and set the names argument to be the names of the clubs. (For bonus points, make set the color of the bars for any Wurstfests to be “royalblue3”).

If you’re not familiar with barplots, here’s an example of how to use it:

barplot(height = c(5, 3, 6, 3, 1),
        names = 1:5,
        col = "white"
        )

barplot(height = genderagg$gender.log, 
        names = c(1, 2, 3), 
        col = c("gray", "gray", "royalblue3"))

Is there a significant relationship between clubs and gender? Answer this using a chi-square test. Run the test and save the result in an object called q5.test

q5.test <- with(club.df, chisq.test(club, gender))

What is your conclusion in APA format?

q5.test

## 
##  Pearson's Chi-squared test
## 
## data:  club and gender
## X-squared = 13.74, df = 2, p-value = 0.001038

apa(q5.test)

## [1] "X(2, N = 300) = 13.74, p < 0.01 (2-tailed)"

# There is a significant relationship between clubs and gender, X(2, N = 300) = 13.74, p < .01 (2-tailed)

Was there a significant difference between just Kantine and Barry’s? Do the test again using only data from these two clubs. Report your results in APA format.

q5b.test <- with(subset(club.df, club %in% c("Kantine", "Barrys")), chisq.test(club, gender))

apa(q5b.test)

## [1] "X(1, N = 188) = 12.24, p < 0.01 (2-tailed)"

# Even only looking at Kantine and Barrys, there is a significant relationship bewteen clubs and gender, X(1, N = 188) = 12.24, p < .01 (2-tailed). Since we know from before that Barrys has a higher proportion of Females, we can conclude that there is a signifciantly higher proportion of men at Kantine than Barrys. Barrys is thus a true Wurstfest.

Who is more likely to leave a club alone, Men or Women?

Calculate the percentage of Men and women who leave alone

genderagg <- with(club.df, aggregate(leavealone ~ gender, 
                                     FUN = mean))

Plot the results using a barplot. Set the height argument to be the percentage of people who leave alone for each gender, and set the names argument to be the gender.

barplot(height = genderagg$leavealone, 
        names = genderagg$gender)

Is there a significant relationship between gender and whether or not people leave clubs alone? Answer this using a chi-square test. Run the test and save the result in an object called q6.test

q6.test <- with(club.df, chisq.test(gender, leavealone))

What is your conclusion in APA format?

q6.test

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  gender and leavealone
## X-squared = 0.43293, df = 1, p-value = 0.5106

apa(q6.test)

## [1] "X(1, N = 300) = 0.43, p = 0.51 (2-tailed)"

# There is no significant relationship between gender and whether or not people go home alone, X(1, N = 300) = 0.43, p = 0.51 (2-tailed)

Does your conclusion hold if you only include people who stayed at the club for more than 60 minutes? Repeat the test on these data and report your conclusions in APA format.

q6b.test <- with(subset(club.df, time > 60), 
                 chisq.test(gender, leavealone))
q6b.test

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  gender and leavealone
## X-squared = 0.88492, df = 1, p-value = 0.3469

apa(q6b.test)

## [1] "X(1, N = 277) = 0.88, p = 0.35 (2-tailed)"

# Even only looking at people who stayed at least 60 minutes, there is no significant relationship between gender and whether or not people go home alone, X(1, N = 277) = 0.88, p = 0.35 (2-tailed)