Open a new R script in R and save it as wpa_X_LastFirst.R (where Last and First is your last and first name). At the top of your script, write the assignment number, your name and date in comments. When you answer a task, indicate which task you are answering with appopriate comments.

Analyzing Bar survey data

The following contain (fictional!) data from a survey of 200 people at one of two bars in Basel (Grenzwert and Paddy’s) last Friday night at 3:00am. Each person was asked which kind of cologne they were wearing. After answering this question, a (very busy) researcher recorded how long each person spent talking to people at the bar. The data are stored in the following 5 vector objects:

Thankfully, you don’t need to type in the data yourself! The objects are stored in an RData file online.

A. Load the vectors into your R session by running the following code.

load(file = url("https://dl.dropboxusercontent.com/u/7618380/wpa2.RData"))

B. The str() function will give you basic information about objects. Get to know the objects (id, sex, cologne, bar, time) by running the str() function on each of the 5 vectors.

Review

  1. How many people were there of each sex? (Hint: use table())
table(sex)
## sex
##   f   m 
## 100 100
  1. What was the mean time?
mean(time)
## [1] 165.055
  1. What was the standard deviation of times?
sd(time)
## [1] 45.79221
  1. Create time.z a z-score transformation of time. (Hint: z-score is defined as (x - mean(x)) / sd(x))
time.z <- (time - mean(time)) / sd(time)

Numerical Indexing

  1. What was the value of the first time?
time[1]
## [1] 289
  1. What were the sexes of the first five participants?
sex[1:5]
## [1] "f" "f" "f" "m" "f"
  1. What were the colognes of the 10th through the 20th participants
cologne[10:20]
##  [1] "gio"         "calvinklein" "gio"         "calvinklein" "calvinklein"
##  [6] "calvinklein" "gio"         "gio"         "gio"         "calvinklein"
## [11] "gio"
  1. Which bar did the last participant go to? (hint: don’t write the indexing number directly; instead, index the vector using the length() function with the appropriate argument)
bar[length(bar)]
## [1] "paddys"

Logical Indexing on one variable

    1. How many people wore gio? b) How many wore calvinklein?
sum(cologne == "gio")
## [1] 100
sum(cologne == "calvinklein")
## [1] 100
    1. How many people went to Grenzwert? b) How many went to Paddys?
sum(bar == "grenzwert")
## [1] 100
sum(bar == "paddys")
## [1] 100
  1. What percent of people went to Grenzert? (Hint: use mean() combined with a logical vector)
mean(bar == "grenzwert")
## [1] 0.5
  1. How many talking times were longer than 30 minutes?
sum(time > 30)
## [1] 199
  1. What percent of talking times were longer than an hour?
mean(time > 60)
## [1] 0.95
  1. What percent of talking times were longer than 20 minutes but less than 40 minutes?
mean(time > 20 & time < 40)
## [1] 0.03

Logical indexing and two variables

  1. What were the ids of people who went to grenzwert?
id[bar == "grenzwert"]
##   [1] "g.31"  "g.25"  "g.30"  "g.50"  "g.29"  "g.56"  "g.74"  "g.66" 
##   [9] "g.55"  "g.38"  "g.67"  "g.73"  "g.24"  "g.46"  "g.32"  "g.21" 
##  [17] "g.39"  "g.18"  "g.98"  "g.94"  "g.100" "g.89"  "g.49"  "g.65" 
##  [25] "g.64"  "g.52"  "g.12"  "g.91"  "g.84"  "g.94"  "g.91"  "g.54" 
##  [33] "g.20"  "g.60"  "g.15"  "g.81"  "g.16"  "g.99"  "g.76"  "g.82" 
##  [41] "g.17"  "g.75"  "g.33"  "g.88"  "g.96"  "g.36"  "g.34"  "g.85" 
##  [49] "g.45"  "g.69"  "g.62"  "g.92"  "g.61"  "g.40"  "g.47"  "g.90" 
##  [57] "g.78"  "g.93"  "g.80"  "g.72"  "g.44"  "g.35"  "g.100" "g.37" 
##  [65] "g.95"  "g.99"  "g.57"  "g.63"  "g.87"  "g.83"  "g.42"  "g.13" 
##  [73] "g.79"  "g.86"  "g.22"  "g.68"  "g.97"  "g.92"  "g.58"  "g.11" 
##  [81] "g.98"  "g.43"  "g.19"  "g.26"  "g.51"  "g.93"  "g.27"  "g.71" 
##  [89] "g.77"  "g.95"  "g.28"  "g.96"  "g.48"  "g.14"  "g.70"  "g.41" 
##  [97] "g.59"  "g.97"  "g.23"  "g.53"
  1. What was the mean time of people who went to Grenzwert?
mean(time[bar == "grenzwert"])
## [1] 134.31
  1. What was the mean time of people who went to Paddys?
mean(time[bar == "paddys"])
## [1] 195.8
  1. What was the mean time of people who wore gio?
mean(time[cologne == "gio"])
## [1] 159.98
  1. What was the mean time of people who wore calvinklein?
mean(time[cologne == "calvinklein"])
## [1] 170.13
  1. Based on what you’ve learned, if someone wants to talk as much (I should have said as long) as possible, what cologne should they wear?
# They should wear calvinklein!

Changing data in a vector with indexing

In the next questions, we’ll use indexing and assignment to change the values within a vector. Because we don’t want to change the original data, we’ll make all of our adjustments on new vectors.

  1. Create new objects bar.r, cologne.r and time.r that are copies of the original bar, cologne and time objects (Hint: Just assign the existing vectors to new objects)
bar.r <- bar
cologne.r <- cologne
time.r <- time
    1. In the bar.r vector, change the "grenzwert" values to "g". b) Now change the "paddys" values to "p"
bar.r[bar == "grenzwert"] <- "g"
bar.r[bar == "paddys"] <- "p"
    1. In the cologne.r vector, change the "gio" values to "G". b) Now change the "calvinklein" values to "C"
cologne.r[cologne == "gio"] <- "G"
cologne.r[cologne == "calvinklein"] <- "C"
  1. In the time.r vector, change all time values greater than 280 to 280. Confirm that you did it correctly by calculating the maximum time in time.r
time.r[time > 280] <- 280
max(time.r)
## [1] 280

Checkpoint! If you got this far you’re doing great!

Solving a paradox…

  1. Based on what you’ve learned so far, if someone wanted to talk to people as long as possible, what cologne should they wear?
#I already said calvinklein...why are you asking me again?

Let’s see if your prediction holds up!

  1. What was the mean time of people who went to Grenzwert and wore gio??
mean(time[bar == "grenzwert" & cologne == "gio"])
## [1] 145.0111
  1. What was the mean time of people who went to Grenzwert and wore calvinklein?
mean(time[bar == "grenzwert" & cologne == "calvinklein"])
## [1] 38
  1. What was the mean time of people who went to Paddys who wore gio?
mean(time[bar == "paddys" & cologne == "gio"])
## [1] 294.7
  1. What was the mean time of people who went to Paddys who wore calvinklein??
mean(time[bar == "paddys" & cologne == "calvinklein"])
## [1] 184.8111
  1. Based on what you’ve learned now, if someone’s goal is to talk to people as long as possible, what cologne should they wear?
# They should wear gio! Talking times are longer for gio in BOTH bars!
# Even though across bars talking times are longer for calvinklein. Crazy!

You can visualize the data using the following code

# Combine vectors in a dataframe
survey.df <- data.frame(bar, cologne, time)
# Create a pirateplot of the data
yarrr:::pirateplot(time ~ cologne + bar, data = survey.df)

What you’ve just seen is an example of Simpson’s Paradox. If you want to learn more, check out the wikipedia page.

# Yes I definitely looked at the Wikipedia page. Very interesting!

Some bigger challenges…

  1. What percent of women wore calvinklein?
# Create a vector cologne.w with the colognes of women only
cologne.w <- cologne[sex == "f"]

# What percent were calvinklein?
mean(cologne.w == "calvinklein")
## [1] 0.51
# OR, do it all at once
mean(cologne[sex == "f"] == "calvinklein")
## [1] 0.51
  1. What was the median time of people who went to grenzwert and wore gio but who talked more than 100 minutes?
median(time[bar == "grenzwert" & time > 100])
## [1] 144.5
  1. What percent of participants either went to grenzwert and talked for less than 220 minutes or went to paddys and talked for more than 150 minutes but no longer than 250 minutes?
mean((bar == "grenzwert" & time < 220) | (bar == "paddys" & time > 150 & time < 250))
## [1] 0.95
  1. Let’s make the calvinklein wearers look better. For all of the calvinklein wearers, add a random sample from a normal distribution with mean 30 and standard deviation 5 to their original talking times.
# Step 1: Create a logical vector of who wore calvinklein
ck.log <- cologne == "calvinklein"

# Step 2: Do assignment!
time[ck.log] <- time[ck.log] + rnorm(n = sum(ck.log), mean = 30, sd = 5)


## OR do it all at once
time[cologne == "calvinklein"] <- time[cologne == "calvinklein"] + rnorm(n = sum(cologne == "calvinklein"), mean = 30, sd = 5)

Submit!

Save and email your wpa_X_LastFirst.R file to me at nathaniel.phillips@unibas.ch. Then, go to https://goo.gl/forms/UblvQ6dvA76veEWu1 to complete the WPA submission form.