In this WPA, you will analyze data from a (again…fake) study on attraction. In the study, 1000 heterosexual University students viewed the Facebook profile of another student (the “target”) of the opposite sex. Based on a target’s profile, each participant made three judgments about the target - intelligence, attractiveness, and dateability. The primary judgement was a dateability rating indicating how dateable the person was on a scale of 0 to 100.

facebook <- read.table("http://nathanieldphillips.com/wp-content/uploads/2016/04/facebook.txt", sep = "\t", header = TRUE)

The data are located in a tab-delimited text file at http://nathanieldphillips.com/wp-content/uploads/2016/04/facebook.txt. Here is how the first few rows of the data should look:

head(facebook)
##   session sex age haircolor university    education shirtless intelligence
## 1       1   m  23     brown   3.Geneva    3.Masters     2.Yes        1.low
## 2       1   m  19    blonde   2.Zurich 1.HighSchool      1.No     2.medium
## 3       1   f  22     brown   2.Zurich  2.Bachelors     2.Yes        1.low
## 4       1   f  22       red   2.Zurich  2.Bachelors      1.No     2.medium
## 5       1   m  23     brown   3.Geneva  2.Bachelors      1.No     2.medium
## 6       1   m  26    blonde   2.Zurich    3.Masters     2.Yes       3.high
##   attractiveness dateability
## 1         3.high          15
## 2       2.medium          44
## 3       2.medium         100
## 4         3.high         100
## 5       2.medium          63
## 6         3.high          76

Datafile description

The data file has 1000 rows and 10 columns. Here are the columns

Data loading and preparation

  1. Open your class R project. Open a new script and enter your name, date, and the wpa number at the top. Save the script in the R folder in your project working directory as wpa_9_LASTFIRST.R, where LAST and FIRST are your last and first names.

  2. The data are stored in a tab–delimited text file located at http://nathanieldphillips.com/wp-content/uploads/2016/04/facebook.txt. Using read.table() load this data into R as a new object called facebook

Understand the data

  1. Look at the first few rows of the dataframe with the head() function to make sure it loaded correctly.
head(facebook)
##   session sex age haircolor university    education shirtless intelligence
## 1       1   m  23     brown   3.Geneva    3.Masters     2.Yes        1.low
## 2       1   m  19    blonde   2.Zurich 1.HighSchool      1.No     2.medium
## 3       1   f  22     brown   2.Zurich  2.Bachelors     2.Yes        1.low
## 4       1   f  22       red   2.Zurich  2.Bachelors      1.No     2.medium
## 5       1   m  23     brown   3.Geneva  2.Bachelors      1.No     2.medium
## 6       1   m  26    blonde   2.Zurich    3.Masters     2.Yes       3.high
##   attractiveness dateability
## 1         3.high          15
## 2       2.medium          44
## 3       2.medium         100
## 4         3.high         100
## 5       2.medium          63
## 6         3.high          76
  1. Using the str() function, look at the structure of the dataframe to make sure everything looks ok
str(facebook)
## 'data.frame':    1000 obs. of  10 variables:
##  $ session       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ sex           : Factor w/ 2 levels "f","m": 2 2 1 1 2 2 1 2 1 1 ...
##  $ age           : int  23 19 22 22 23 26 19 25 22 19 ...
##  $ haircolor     : Factor w/ 3 levels "blonde","brown",..: 2 1 2 3 2 1 2 3 2 1 ...
##  $ university    : Factor w/ 3 levels "1.Basel","2.Zurich",..: 3 2 2 2 3 2 3 2 2 3 ...
##  $ education     : Factor w/ 4 levels "1.HighSchool",..: 3 1 2 2 2 3 1 3 2 1 ...
##  $ shirtless     : Factor w/ 2 levels "1.No","2.Yes": 2 1 2 1 1 2 1 2 2 1 ...
##  $ intelligence  : Factor w/ 3 levels "1.low","2.medium",..: 1 2 1 2 2 3 1 3 2 2 ...
##  $ attractiveness: Factor w/ 3 levels "1.low","2.medium",..: 3 2 2 3 2 3 2 1 2 2 ...
##  $ dateability   : int  15 44 100 100 63 76 61 26 76 40 ...

Custom Functions

  1. Write a function called feed.me() that takes a string food as an argument, and returns the sentence “I love to eat ___“. Try your function by running feed.me("apples").
feed.me <- function(___) {
  
  output <- paste0("I love to eat ", ___)
  
  print(___)
}
feed.me <- function(food) {
  
  output <- paste0("I love to eat ", food)
  
  print(output)
}

feed.me("apples")
## [1] "I love to eat apples"
  1. Write a function called my.mean() that takes a vector x as an argument, and returns the mean of the vector x. Don’t use the mean() function! Use sum() and length()!
my.mean <- function(___) {
  
  result <- sum(___) / length(___)
  
  return(result)
  
}
my.mean <- function(x) {
  
  result <- sum(x) / length(x)
  
  return(result)
  
}
  1. Try your my.mean() function to calculate the mean dateability rating of participants in the facebook dataset and compare the result to the built-in mean() function to make sure you get the same result!
my.mean(facebook$dateability)
## [1] 54.199
mean(facebook$dateability)
## [1] 54.199
  1. Write a function called how.many.na() that takes a vector x as an argument, and returns the number of NA values found in the vector:
how.many.na <- function(x) {
  
  output <- sum(is.na(___))
  
  return(___)
}
how.many.na <- function(x) {
  
  output <- sum(is.na(x))
  
  return(output)
}
  1. Test your how.many.na() function on the age of participants in the facebook dataframe and then on the vector x = c(4, 7, 3, NA, NA, 1)
how.many.na(facebook$age)
## [1] 0
how.many.na(c(4, 7, 3, NA, NA, 1))
## [1] 2
  1. Create a function called my.plot() that takes arguments x and y and returns a customised scatterplot with gridlines and a regression line:
my.plot <- function(x, y) {
  
  plot(x = ___, 
       y = ___, 
       pch = ___,     # look at ?points to see the values of pch!
       col = ___)
  
  grid()   # Add gridlines
  
  # Add a regression line
  abline(lm(___ ~ ___), 
         col = ___)
}
my.plot <- function(x, y) {
  
  plot(x = x, 
       y = y, 
       pch = 16,     # look at ?points to see the values of pch!
       col = "green")
  
  grid()   # Add gridlines
  
  # Add a regression line
  abline(lm(y ~ x), 
         col = "blue")
}
  1. Now test your my.plot() function on the age and dateability of participants in the facebook dataset
my.plot(facebook$age, 
        facebook$dateability)

Loops

  1. Create a loop that prints the squares of integers from 1 to 10:
for(i in ___) {
  
  square.i <- ___
  
  print(square.i)
  
}
for(i in 1:10) {
  
  square.i <- i ^ 2
  
  print(square.i)
  
}
## [1] 1
## [1] 4
## [1] 9
## [1] 16
## [1] 25
## [1] 36
## [1] 49
## [1] 64
## [1] 81
## [1] 100
  1. Using the following template, create a loop that calculates (and prints) the mean dateability rating of students from each university in the facebook dataset
for(university.i in c("1.Basel", "2.Zurich", "3.Geneva")) {
  
  data.i <- facebook$dateability[facebook$university == ___]
  
  output <- paste0("The mean dateability of university ", ___, " is ", ___)
  
  print(___)
  
}
for(university.i in c("1.Basel", "2.Zurich", "3.Geneva")) {
  
  data.i <- facebook$dateability[facebook$university == university.i]
  
  output <- paste0("The mean dateability of university ", university.i, " is ", mean(data.i))
  
  print(output)
  
}
## [1] "The mean dateability of university 1.Basel is 60.7548387096774"
## [1] "The mean dateability of university 2.Zurich is 50.4793388429752"
## [1] "The mean dateability of university 3.Geneva is 52.1131498470948"
  1. Now create a histogram of dateability ratings from each level of intelligence using the following structure
par(mfrow = c(1, 3)) # Set up 1 x 3 plotting grid

for(intelligence.i in c("1.low", "2.medium", "3.high")) {
  
  hist(facebook$dateability[facebook$intelligence == ___],
       main = ___,
       xlab = ___,
       col = ___
       )
  
}

par(mfrow = c(1, 1)) # Reset plotting grid
par(mfrow = c(1, 3)) # Set up 1 x 3 plotting grid

for(intelligence.i in c("1.low", "2.medium", "3.high")) {
  
  hist(facebook$dateability[facebook$intelligence == intelligence.i],
       main = intelligence.i,
       xlab = "Dateability",
       col = "skyblue"
       )
  
}

par(mfrow = c(1, 1)) # Reset plotting grid

CHECKPOINT!

More functions!

  1. Write a function called ttest.apa() that takes a vector x as an argument and returns an apa style conclusion from a one-sample test of x.
ttest.apa <- function(x, mu) {
  
   # Store the one-sample ttest in object a
  
  a <- t.test(x = ___, 
              mu = ___) 
  
  df <- a$parameter  # Get the degrees of freedom
  test.stat <- ___      # Get the test statistic
  p.value <- ___        # Get the p-value
  
  
  # If the test is significant...
  if(p.value <= ___) {
    
    output <- paste0("The test is significant! t(",
                     df, ") = ", test.stat, 
                   ", p = ", p.value, 
                   " (H0 = ", mu, ")")
  }
  
  # If the test is not significant...
    if(p.value > ___) {
    
    output <- ___
    
  }
  
  print(___)  
}
ttest.apa <- function(x, mu) {
  
   # Store the one-sample ttest in object a
  
  a <- t.test(x = x, 
              mu = mu) 
  
  df <- a$parameter  # Get the degrees of freedom
  test.stat <- a$statistic      # Get the test statistic
  p.value <- a$p.value        # Get the p-value
  
  
  # If the test is significant...
  if(p.value <= .05) {
    
    output <- paste0("The test is significant! t(",
                     df, ") = ", test.stat, 
                   ", p = ", p.value, 
                   " (H0 = ", mu, ")")
  }
  
  # If the test is not significant...
    if(p.value > .05) {
    
    output <- paste0("The test is not significant! t(",
                     df, ") = ", test.stat, 
                   ", p = ", p.value, 
                   " (H0 = ", mu, ")")
    
  }
  
  print(output)  
}
  1. Test your ttest.apa() function on the the dateability of participants in the facebook study. Specifically, test if their mean dateability rating is different from 50
ttest.apa(facebook$dateability, 50)
## [1] "The test is significant! t(999) = 4.93261151453317, p = 9.49897937574552e-07 (H0 = 50)"

More loops!

  1. The following dataframe survey contains results from a survey of 5 participants. Each participant was asked 5 questions on a 1-10 likert scale. As you can see, many of the responses are not valid integers from 1-10. Using a loop, create a new dataframe called survey.corrected that converts all invalid values to NA:
survey <- data.frame("p" = c(1, 2, 3, 4, 5),
                     "q1" = c(5, 3, 6, 3, 5),
                     "q2" = c(-1, 4, 3, 6, 11),
                     "q3" = c(6, 22, 4, 6, -5),
                     "q4" = c(6, 3, 4, -2, 4),
                     "q5" = c(1, 1, 900, 1, 2))
survey.corrected <- survey   # COPY SURVEY

for(column.i in ____ ) {  # LOOP OVER COLUMNS
 
  x <- ____   # COPY THE ORIGINAL COLUMN as x
  x[x %in% ___ == FALSE] <- ___  # REPLACE
  
  survey.corrected[,___] <- ___ # ASSIGN x back to survey.correced
  
}
survey.corrected <- survey   # COPY SURVEY

for(column.i in 2:6 ) {  # LOOP OVER COLUMNS
 
  x <- survey[,column.i]   # COPY THE ORIGINAL COLUMN as x
  x[(x %in% 1:10) == FALSE] <- NA  # REPLACE
  
  survey.corrected[,column.i] <- x # ASSIGN x back to survey.correced
  
}

survey.corrected
##   p q1 q2 q3 q4 q5
## 1 1  5 NA  6  6  1
## 2 2  3  4 NA  3  1
## 3 3  6  3  4  4 NA
## 4 4  3  6  6 NA  1
## 5 5  5 NA NA  4  2

Simulation!

  1. What is the probability of getting a significant p-value if the null hypothesis is true? Test this by conducting the following simulation:
p.values <- rep(NA, ___)

for(i in ___) {

x <- rnorm(n = ___, mean = ___, sd = ___)

result <- t.test(___)$___

p.values[___] <- ___

}
p.values <- rep(NA, 100)

for(i in 1:100) {

x <- rnorm(n = 10, mean = 0, sd = 1)

result <- t.test(x)$p.value

p.values[i] <- result

}

hist(p.values)

mean(p.values < .05)
## [1] 0.05
  1. Create a function called psimulation with 4 arguments: sim: the number of simulations, samplesize: the sample size, mu.true: the true mean, and sd.true: the true standard deviation. Your function should repeat the simulation from the previous question with the given arguments. That is, it should calculate sim p-values testing whether samplesize samples from a normal distribution with mean = mu.true and standard deviation = sd.true is significantly different from 0. The function should return a vector of p-values.
psimulation <- function(sim, 
                        samplesize, 
                        mu.true, 
                        sd.true) {
  
  
  
p.values <- rep(NA, sim)

for(i in 1:sim) {

x <- rnorm(n = samplesize, mean = mu.true, sd = sd.true)

result <- t.test(x)$p.value

p.values[i] <- result

}

return(p.values)
  
}


# testing the function

psimulation(sim = 20, samplesize = 20, mu.true = 1, sd.true = 1)
##  [1] 1.055971e-03 4.378150e-09 6.837770e-08 1.195854e-03 5.838443e-02
##  [6] 1.993403e-04 1.804775e-04 4.266576e-04 5.785016e-04 5.137253e-04
## [11] 7.624965e-06 9.271828e-04 2.558004e-03 1.408167e-02 2.325602e-05
## [16] 9.951451e-05 8.273548e-07 9.853791e-04 1.132161e-04 1.104221e-05

Submit!

Save and email your wpa_9_LastFirst.R file to me at nathaniel.phillips@unibas.ch. Then, go to https://goo.gl/forms/UblvQ6dvA76veEWu1 to complete the WPA submission form.