Hypothesis tests from a student survey

In this WPA, you’ll analyze data from a fictional survey of 100 students. In fact, you can even see the code I used to generate the data here (code to generate wpa6 data).

The data are located in a tab-delimited text file at http://nathanieldphillips.com/wp-content/uploads/2016/11/wpa6.txt

Datafile description

The data file has 100 rows and 12 columns. Here are the columns

  • sex: string. A string indicting the sex of the participant. “m” = male, “f” = female.

  • age: integer. An integer indicating the age of the participant.

  • major: string. A string indicating the participant’s major

  • haircolor: string. Hair color

  • iq: integer. P’s score on an IQ test.

  • country: string. P’s country of origin

  • logic: numeric. Amount of time it took for a participant to complete a logic problem. Smaller is better.

  • siblings: integer. How many siblings does the P have?

  • multitasking: integer. Participant’s score on a multitasking task. Higher is better.

  • partners: integer. How many sexual partners has the participant had?

  • marijuana: binary. Has the participant ever tried marijuana? 0 = “no”, 1 = “yes”

  • risk: binary. Would the person play a gamble with a 50% chance of losing 20CHF and a 50% chance of earning 20CHF? 0 means the participant would not play the gamble, 1 means they would

Data loading and preparation

  1. Open your class R project. Open a new script and enter your name, date, and the wpa number at the top. Save the script in the R folder in your project working directory as wpa_6_LASTFIRST.R, where LAST and FIRST are your last and first names.

  2. The data are stored in a tab–delimited text file located at http://nathanieldphillips.com/wp-content/uploads/2016/11/wpa6.txt. Using read.table() load this data into R as a new object called wpa6.df as follows.

# Read data into a new object called wpa6.df
wpa6.df <- read.table(file = "http://nathanieldphillips.com/wp-content/uploads/2016/11/wpa6.txt",
                      header = TRUE,         # There is a header row
                      sep = "\t")            # Data are tab-delimited
  1. Using write.table(), write the data as a text file titled wpa6.txt into the data folder in your project working directory as follows.
# Write wpa6.df to a tab-delimited text file in my data folder.
write.table(wpa6.df,                        # Object to be written
            file = "data/wpa6.txt",         # Put file wpa6.txt in the data folder of my working directory
            sep = "\t")                     # Make data tab-delimited
  1. Using head(), str(), and View() look at the dataset and make sure that it was loaded correctly. If the data don’t look correct (i.e; if there isn’t a header row and 100 rows and 12 columns), you didn’t load it correctly!

Please write your answers to all hypothesis test questions in proper American Pirate Association (APA) style! If your p-value is less than .01, just write p < .01

Chi-square: X(df) = XXX, p = YYY

t-test: t(df) = XXX, p = YYY

correlation test: r = XXX, t(df) = YYY, p = ZZZZ

For example, here is some output with the appropriate apa conclusion:

library(yarrr)

# Do pirates with headbands have different numbers of tattoos than those
#  who do not wear headbands?
t.test(tattoos ~ headband, 
       data = pirates)
## 
##  Welch Two Sample t-test
## 
## data:  tattoos by headband
## t = -19.313, df = 146.73, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.878101 -4.786803
## sample estimates:
##  mean in group no mean in group yes 
##          4.699115         10.031567

Answer: Pirates with headbands have significantly more tattoos on average than those who do not wear headbands: t(146.73) = -19.31, p < .01

t-test(s)

  1. Average IQ in the general population is 100. Do the participants have an IQ different from the general population? Answer this with a one-sample t-test.

  2. A friend of yours claims that students have 2.5 siblings on average. Test this claim with a one-sample t-test.

  3. Do students that have smoked marijuana have different IQ levels than those who have never smoked marijuana? Test this claim with a two-sample t-test (you can either use the vector or the formula notation for a t-test)

Correlation test(s)

  1. Do students with higher multitasking skills tend to have more romantic partners than those with lower multitasking skills? Test this with a correlation test:

  2. Do people with higher IQs perform faster on the logic test? Answer this question with a correlation test.

chi-square test(s)

  1. Are some majors more popular than others? Answer this question with a one-sample chi-square test.

  2. In general, were students more likely to take a risk than not? Answer this question with a one-sample chi-square test

  3. Is there a relationship between hair color and students’ academic major? Answer this with a two-sample chi-square test

CHECKPOINT!

Anscombe’s Famous data quartet

In the next few questions, we’ll explore Anscombe’s famous data quartet. This famous dataset will show you the dangers of interpreting statistical tests (like a correlation test), without first plotting the data!

  1. Run the following code to create the anscombe.df dataframe. This dataframe contains 4 datasets x1 and y1, x2 and y2, x3 and y3 and x4 and y4:
# JUST COPY, PASTE, AND RUN!

anscombe.df <- data.frame(x1 = c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5),
                          y1 = c(8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 4.68),
                          x2 = c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5),
                          y2 = c(9.14, 8.14, 8.74, 8.77, 9.26, 8.1, 6.13, 3.1, 9.13, 7.26, 4.74),
                          x3 = c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5),
                          y3 = c(7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73),
                          x4 = c(8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8),
                          y4 = c(6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.5, 5.56, 7.91, 6.89))
  1. Calculate the correlation between x1 and y1, x2 and y2, x3 and y3, and x4 and y4 separately (that is, what is the correlation between x1 and y? Now, what is the correlation between x2 and y2?, …). What do you notice about the correlation values for each test?

  2. Now run the following code to generate a scatterplot of each data pair, what do you find?

# JUST COPY, PASTE, AND RUN!

# Plot the famous Anscombe quartet
par(mfrow = c(2, 2)) # Create 2 x 2 plotting grid

for (i in 1:4) {   # Loop over datasets
 
  # Assign x and y for current value of i
  
  if (i == 1) {x <- anscombe.df$x1
               y <- anscombe.df$y1} 
  
  if (i == 2) {x <- anscombe.df$x2
               y <- anscombe.df$y2} 
  
  if (i == 3) {x <- anscombe.df$x3
               y <- anscombe.df$y3} 
  
  if (i == 4) {x <- anscombe.df$x4
               y <- anscombe.df$y4} 

  # Create plot
plot(x = x, y = y, pch = 21, main = "Anscombe 1", 
     bg = "orange", col = "red", 
     xlim = c(0, 20), ylim = c(0, 15))

 # Add regression line
abline(lm(y ~ x, 
          data = data.frame(y, x)),
       col = "blue", lty = 2)

# Add correlation test text
text(x = 3, y = 12, 
     labels = paste0("cor = ", round(cor(x, y), 2)))
  
}

par(mfrow = c(1, 1)) # Reset plotting grid

What you have just seen is the famous Anscombe’s quartet a dataset designed to show you how important is to always plot your data before running a statistical test!!! You can see more at the wikipedia page here: https://en.wikipedia.org/wiki/Anscombe%27s_quartet

You pick the test!

  1. Is there a relationship between whether a student has ever smoked marijuana and his/her decision to accept or reject the risky gamble?

  2. Do males and females have different numbers of sexual partners on average?

  3. Do males and females differ in how likely they are to have smoked marijuana?

  4. Do people who have smoked marijuana have different logic scores on average than those who never have smoked marijuana?

  5. Do people with higher iq scores tend to perform better on the logic test that those with lower iq scores?

More complicated tests

  1. Are Germans more likely than not to have tried marijuana? (Hint: this is a one-sample chi-square test with a subset argument)

  2. Does the relationship between iq and multitasking differ between males and females? Test this by conducting two separate tests – one for males and one for females. Do your conclusions differ?

  3. Does the IQ of people with brown hair differ from blondes? (Hint: This is a two-sample t-test that requires you to use the subset() argument to tell R which two groups you want to compare)

  4. Only for men from Switzerland, is there a relationship between age and IQ?

  5. Only for people who chose the risky gamble, do people that have smoked marijuana have more sexual partners than those who have never smoked marijuana? is there a relationship between smoking marijuana and number of sexual partners?

  6. Only for people who chose the risky gamble and have never tried marijuana, is there a relationship between iq and performance on the logic test?

Submit!

Save and email your wpa_6_LastFirst.R file to me at nathaniel.phillips@unibas.ch. Then, go to https://goo.gl/forms/UblvQ6dvA76veEWu1 to complete the WPA submission form.