Open a new R script in R and save it as `wpa_X_LastFirst.R`

(where Last and First is your last and first name). At the top of your script, write the assignment number, your name and date in comments. When you answer a task, indicate which task you are answering with appopriate comments.

# Analyzing Bar survey data

The following contain (fictional!) data from a survey of 200 people at one of two bars in Basel (Grenzwert and Paddy’s) last Friday night at 3:00am. Each person was asked which kind of cologne they were wearing. After answering this question, a (very busy) researcher recorded how long each person spent talking to people at the bar. The data are stored in the following 5 vector objects:

`id`

: An id indicating the participant in the form`x.n`

, where`x`

is the name of the bar the participant was at, and`n`

is a random indexing number)`sex`

: The person’s sex:`male`

or`female`

`cologne`

: Which cologne did the person wear?`gio`

or`calvinklein`

`bar`

: Where the person went out:`grenzwert`

or`paddys`

`time`

: The amount of time the person spent talking to people in minutes

Thankfully, you don’t need to type in the data yourself! The objects are stored in an *RData* file online.

A. Load the vectors into your R session by running the following code.

`load(file = url("https://dl.dropboxusercontent.com/u/7618380/wpa2.RData"))`

B. The `str()`

function will give you basic information about objects. Get to know the objects (`id`

, `sex`

, `cologne`

, `bar`

, `time`

) by running the `str()`

function on each of the 5 vectors.

## Review

- How many people were there of each sex? (Hint: use
`table()`

)

`table(sex)`

```
## sex
## f m
## 100 100
```

- What was the mean time?

`mean(time)`

`## [1] 165.055`

- What was the standard deviation of times?

`sd(time)`

`## [1] 45.79221`

- Create
`time.z`

a z-score transformation of time. (Hint: z-score is defined as`(x - mean(x)) / sd(x)`

)

`time.z <- (time - mean(time)) / sd(time)`

## Numerical Indexing

- What was the value of the first time?

`time[1]`

`## [1] 289`

- What were the sexes of the first five participants?

`sex[1:5]`

`## [1] "f" "f" "f" "m" "f"`

- What were the colognes of the 10th through the 20th participants

`cologne[10:20]`

```
## [1] "gio" "calvinklein" "gio" "calvinklein" "calvinklein"
## [6] "calvinklein" "gio" "gio" "gio" "calvinklein"
## [11] "gio"
```

- Which bar did the last participant go to? (hint: don’t write the indexing number directly; instead, index the vector using the
`length()`

function with the appropriate argument)

`bar[length(bar)]`

`## [1] "paddys"`

## Logical Indexing on one variable

- How many people wore gio? b) How many wore calvinklein?

`sum(cologne == "gio")`

`## [1] 100`

`sum(cologne == "calvinklein")`

`## [1] 100`

- How many people went to Grenzwert? b) How many went to Paddys?

`sum(bar == "grenzwert")`

`## [1] 100`

`sum(bar == "paddys")`

`## [1] 100`

- What percent of people went to Grenzert? (Hint: use
`mean()`

combined with a logical vector)

`mean(bar == "grenzwert")`

`## [1] 0.5`

- How many talking times were longer than 30 minutes?

`sum(time > 30)`

`## [1] 199`

- What percent of talking times were longer than an hour?

`mean(time > 60)`

`## [1] 0.95`

- What percent of talking times were longer than 20 minutes but less than 40 minutes?

`mean(time > 20 & time < 40)`

`## [1] 0.03`

## Logical indexing and two variables

- What were the ids of people who went to grenzwert?

`id[bar == "grenzwert"]`

```
## [1] "g.31" "g.25" "g.30" "g.50" "g.29" "g.56" "g.74" "g.66"
## [9] "g.55" "g.38" "g.67" "g.73" "g.24" "g.46" "g.32" "g.21"
## [17] "g.39" "g.18" "g.98" "g.94" "g.100" "g.89" "g.49" "g.65"
## [25] "g.64" "g.52" "g.12" "g.91" "g.84" "g.94" "g.91" "g.54"
## [33] "g.20" "g.60" "g.15" "g.81" "g.16" "g.99" "g.76" "g.82"
## [41] "g.17" "g.75" "g.33" "g.88" "g.96" "g.36" "g.34" "g.85"
## [49] "g.45" "g.69" "g.62" "g.92" "g.61" "g.40" "g.47" "g.90"
## [57] "g.78" "g.93" "g.80" "g.72" "g.44" "g.35" "g.100" "g.37"
## [65] "g.95" "g.99" "g.57" "g.63" "g.87" "g.83" "g.42" "g.13"
## [73] "g.79" "g.86" "g.22" "g.68" "g.97" "g.92" "g.58" "g.11"
## [81] "g.98" "g.43" "g.19" "g.26" "g.51" "g.93" "g.27" "g.71"
## [89] "g.77" "g.95" "g.28" "g.96" "g.48" "g.14" "g.70" "g.41"
## [97] "g.59" "g.97" "g.23" "g.53"
```

- What was the mean time of people who went to Grenzwert?

`mean(time[bar == "grenzwert"])`

`## [1] 134.31`

- What was the mean time of people who went to Paddys?

`mean(time[bar == "paddys"])`

`## [1] 195.8`

- What was the mean time of people who wore gio?

`mean(time[cologne == "gio"])`

`## [1] 159.98`

- What was the mean time of people who wore calvinklein?

`mean(time[cologne == "calvinklein"])`

`## [1] 170.13`

- Based on what you’ve learned, if someone wants to talk as much (I should have said as
*long*) as possible, what cologne should they wear?

`# They should wear calvinklein!`

## Changing data in a vector with indexing

In the next questions, we’ll use indexing and assignment to change the values within a vector. Because we don’t want to change the original data, we’ll make all of our adjustments on new vectors.

- Create new objects
`bar.r`

,`cologne.r`

and`time.r`

that are copies of the original`bar`

,`cologne`

and`time`

objects (Hint: Just assign the existing vectors to new objects)

```
bar.r <- bar
cologne.r <- cologne
time.r <- time
```

- In the
`bar.r`

vector, change the`"grenzwert"`

values to`"g"`

. b) Now change the`"paddys"`

values to`"p"`

- In the

```
bar.r[bar == "grenzwert"] <- "g"
bar.r[bar == "paddys"] <- "p"
```

- In the
`cologne.r`

vector, change the`"gio"`

values to`"G"`

. b) Now change the`"calvinklein"`

values to`"C"`

- In the

```
cologne.r[cologne == "gio"] <- "G"
cologne.r[cologne == "calvinklein"] <- "C"
```

- In the
`time.r`

vector, change all time values greater than 280 to 280. Confirm that you did it correctly by calculating the maximum time in`time.r`

```
time.r[time > 280] <- 280
max(time.r)
```

`## [1] 280`

**Checkpoint!** If you got this far you’re doing great!

## Solving a paradox…

- Based on what you’ve learned so far, if someone wanted to talk to people as long as possible, what cologne should they wear?

`#I already said calvinklein...why are you asking me again?`

*Let’s see if your prediction holds up!*

- What was the mean time of people who went to Grenzwert and wore gio??

`mean(time[bar == "grenzwert" & cologne == "gio"])`

`## [1] 145.0111`

- What was the mean time of people who went to Grenzwert and wore calvinklein?

`mean(time[bar == "grenzwert" & cologne == "calvinklein"])`

`## [1] 38`

- What was the mean time of people who went to Paddys who wore gio?

`mean(time[bar == "paddys" & cologne == "gio"])`

`## [1] 294.7`

- What was the mean time of people who went to Paddys who wore calvinklein??

`mean(time[bar == "paddys" & cologne == "calvinklein"])`

`## [1] 184.8111`

- Based on what you’ve learned now, if someone’s goal is to talk to people as long as possible, what cologne should they wear?

```
# They should wear gio! Talking times are longer for gio in BOTH bars!
# Even though across bars talking times are longer for calvinklein. Crazy!
```

You can visualize the data using the following code

```
# Combine vectors in a dataframe
survey.df <- data.frame(bar, cologne, time)
# Create a pirateplot of the data
yarrr:::pirateplot(time ~ cologne + bar, data = survey.df)
```

What you’ve just seen is an example of **Simpson’s Paradox**. If you want to learn more, check out the wikipedia page.

`# Yes I definitely looked at the Wikipedia page. Very interesting!`

## Some bigger challenges…

- What percent of women wore calvinklein?

```
# Create a vector cologne.w with the colognes of women only
cologne.w <- cologne[sex == "f"]
# What percent were calvinklein?
mean(cologne.w == "calvinklein")
```

`## [1] 0.51`

```
# OR, do it all at once
mean(cologne[sex == "f"] == "calvinklein")
```

`## [1] 0.51`

- What was the median time of people who went to grenzwert and wore gio but who talked more than 100 minutes?

`median(time[bar == "grenzwert" & time > 100])`

`## [1] 144.5`

- What percent of participants
*either*went to grenzwert and talked for less than 220 minutes*or*went to paddys and talked for more than 150 minutes but no longer than 250 minutes?

`mean((bar == "grenzwert" & time < 220) | (bar == "paddys" & time > 150 & time < 250))`

`## [1] 0.95`

- Let’s make the calvinklein wearers look better. For all of the calvinklein wearers, add a random sample from a normal distribution with mean 30 and standard deviation 5 to their original talking times.

```
# Step 1: Create a logical vector of who wore calvinklein
ck.log <- cologne == "calvinklein"
# Step 2: Do assignment!
time[ck.log] <- time[ck.log] + rnorm(n = sum(ck.log), mean = 30, sd = 5)
## OR do it all at once
time[cologne == "calvinklein"] <- time[cologne == "calvinklein"] + rnorm(n = sum(cologne == "calvinklein"), mean = 30, sd = 5)
```

## Submit!

Save and email your `wpa_X_LastFirst.R`

file to me at nathaniel.phillips@unibas.ch. Then, go to https://goo.gl/forms/UblvQ6dvA76veEWu1 to complete the WPA submission form.