This is problem set #2, in which we hope you will practice the packages tidyr and dplyr. Along with the tutorial we used in class, there are some great cheat sheets from RStudio.

The data set

This data set comes from a replication of Janiszewski and Uy (2008), who investigated whether the precision of the anchor for a price influences the amount of adjustment. (We use these data briefly in the tutorial).

In the data frame, the Input.condition variable represents the experimental condition (under the rounded anchor, the rounded anchor, over the rounded anchor). Input.price1, Input.price2, and Input.price3 are the anchors for the Answer.dog_cost, Answer.plasma_cost, and Answer.sushi_cost items.

Part 1: Making these data tidy

Load the tidyverse package and the data

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.3.3
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'ggplot2' was built under R version 3.3.3
## Warning: package 'tibble' was built under R version 3.3.3
## Warning: package 'tidyr' was built under R version 3.3.3
## Warning: package 'readr' was built under R version 3.3.3
## Warning: package 'purrr' was built under R version 3.3.3
## Warning: package 'dplyr' was built under R version 3.3.3
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
d <- read.csv("data/janiszewski_rep_cleaned.csv")

This data frame is in wide format - that means that each row is a participant and there are multiple observations per participant. This data is not tidy.

To make this data tidy, we’ll do some cleanup. First, remove the columns you don’t need, using the verb select.

HINT: ?select and the examples of helper functions will help you be efficient.

d.tidy <- select(d, WorkerId, contains("Answer"), contains("Input"))
head(d.tidy)
##   WorkerId Answer.dog_cost Answer.plasma_cost Answer.sushi_cost
## 1        1            2300               4800               8.7
## 2        2            2450               4850               9.0
## 3        3             800               1200               8.0
## 4        4            2000               4200               9.0
## 5        5            2000               4500               8.5
## 6        6            1600               4600               8.0
##   Input.condition Input.price1 Input.price2 Input.price3
## 1            over         5012         2508         9.36
## 2            over         5012         2508         9.36
## 3            over         5012         2508         9.36
## 4            over         5012         2508         9.36
## 5            over         5012         2508         9.36
## 6            over         5012         2508         9.36

Try renaming some variables using rename. A good naming scheme is:

Try using the %>% operator as well. So you will be “piping” d %>% rename(...).

d.tidy <- d.tidy %>% rename(Answer_dog_cost = Answer.dog_cost,
                       Answer_plasma_cost = Answer.plasma_cost,
                       Answer_sushi_cost = Answer.sushi_cost,
                       Input_condition = Input.condition,
                       Input_price1 = Input.price1,
                       Input_price2 = Input.price2,
                       Input_price3 = Input.price3)
head(d.tidy)
##   WorkerId Answer_dog_cost Answer_plasma_cost Answer_sushi_cost
## 1        1            2300               4800               8.7
## 2        2            2450               4850               9.0
## 3        3             800               1200               8.0
## 4        4            2000               4200               9.0
## 5        5            2000               4500               8.5
## 6        6            1600               4600               8.0
##   Input_condition Input_price1 Input_price2 Input_price3
## 1            over         5012         2508         9.36
## 2            over         5012         2508         9.36
## 3            over         5012         2508         9.36
## 4            over         5012         2508         9.36
## 5            over         5012         2508         9.36
## 6            over         5012         2508         9.36

OK, now for the tricky part. Use the verb gather to turn this into a tidy data frame.

HINT: look for online examples!

d.tidy <- d.tidy %>% gather(key = item, value = response, contains("Answer"))
head(d.tidy)
##   WorkerId Input_condition Input_price1 Input_price2 Input_price3
## 1        1            over         5012         2508         9.36
## 2        2            over         5012         2508         9.36
## 3        3            over         5012         2508         9.36
## 4        4            over         5012         2508         9.36
## 5        5            over         5012         2508         9.36
## 6        6            over         5012         2508         9.36
##              item response
## 1 Answer_dog_cost     2300
## 2 Answer_dog_cost     2450
## 3 Answer_dog_cost      800
## 4 Answer_dog_cost     2000
## 5 Answer_dog_cost     2000
## 6 Answer_dog_cost     1600

Now spread these data back into a wide format data frame.

d.wide <- d.tidy %>% spread(item, response)
head(d.wide)
##   WorkerId Input_condition Input_price1 Input_price2 Input_price3
## 1        1            over         5012         2508         9.36
## 2        2            over         5012         2508         9.36
## 3        3            over         5012         2508         9.36
## 4        4            over         5012         2508         9.36
## 5        5            over         5012         2508         9.36
## 6        6            over         5012         2508         9.36
##   Answer_dog_cost Answer_plasma_cost Answer_sushi_cost
## 1            2300               4800               8.7
## 2            2450               4850               9.0
## 3             800               1200               8.0
## 4            2000               4200               9.0
## 5            2000               4500               8.5
## 6            1600               4600               8.0

Part 2: Manipulating the data

NOTE: If you generally use plyr package, note that they do not play nicely together so things like the rename function won’t work unless you load dplyr after plyr.

As we said in class, a good thing to do is always to check histograms of the response variable. Do that now, using either regular base graphics (hist) or ggplot. What can you conclude?

ggplot(d.tidy, aes(response)) + geom_histogram() + facet_grid(item ~ .)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).

All of the items are on different scales. Since sushi’s variance and values are so small, it is hard to represent in one single graph. I’ll visualize each individually.

ggplot(d.tidy %>% filter(item == 'Answer_sushi_cost'), aes(response)) + geom_histogram()
## Warning: package 'bindrcpp' was built under R version 3.3.3
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).

ggplot(d.tidy %>% filter(item == 'Answer_plasma_cost'), aes(response)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(d.tidy %>% filter(item == 'Answer_dog_cost'), aes(response)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

OK, now we turn to the actual data anlysis. We’ll be using dplyr verbs to filter, group,mutate, and summarise the data.

Start by using summarise to compute the grand mean bet. (Note that this is the same as taking the grant mean - the value will come later. Right now we’re just learning the syntax of that verb.)

grand_mean <- d.tidy %>% summarise(mean = mean(response, na.rm = TRUE))

grand_mean
##       mean
## 1 2022.557

This is a great time to get comfortable with the %>% operator. In brief, %>% allows you to pipe data from one function to another. So if you would have written:

d <- function(d, other_stuff)

you can now write:

d <- d %>% function(other_stufF)

That doesn’t seem like much, but it’s cool when you can replace:

d <- function1(d, other_stuff) d <- function2(d, lots_of_other_stuff, more_stuff) d <- function3(d, yet_more_stuff)

with

d <- d %>% function1(other_stuff) %>% function2(lots_of_other_stuff, more_stuff) %>% function3(yet_more_stuff)

In other words, you get to make a clean list of the things you want to do and chain them together without a lot of intermediate assignments.

Let’s use that capacity to combine summarise with group_by, which allows us to break up our summary into groups. Try grouping by item and condition and taking means using summarise, chaining these two verbs with %>%.

group_mean <- d.tidy %>% 
        group_by(Input_condition, item) %>%
        summarise(mean = mean(response, na.rm = TRUE))

group_mean
## # A tibble: 9 x 3
## # Groups:   Input_condition [?]
##   Input_condition               item        mean
##            <fctr>              <chr>       <dbl>
## 1            over    Answer_dog_cost 1898.300000
## 2            over Answer_plasma_cost 4300.333000
## 3            over  Answer_sushi_cost    8.322414
## 4         rounded    Answer_dog_cost 1884.482414
## 5         rounded Answer_plasma_cost 4091.655172
## 6         rounded  Answer_sushi_cost    7.955517
## 7           under    Answer_dog_cost 1906.964286
## 8           under Answer_plasma_cost 4018.357143
## 9           under  Answer_sushi_cost    7.742500

Are there condition differences?

Based on only the means, it generally seams that participants in the over condition estimate items prices higher than participants in the rounded condition, and those in the rounded condition estimate items at a higher price than the under condition. This does not seem to be true for dog since Dog under was the highest. Additional analysis such as confidence intervals would be needed to determine whether there is a statistically significant different in conditions, but from the means, it does seem like there are condition differences.

How are we going to plot condition differences? They are fundamentally different magnitudes from one another. Really we need the size of the deviation from the anchor, which means we need the anchor value. Let’s go back to the data and add that in.

Take a look at these two complex piped expression.s You don’t have to modify it, but see what is being done here with gather, separate and spread. Run each part (e.g. the first verb, the first two verbs, etc.) and after doing each, look at head(d.tidy) to see what they do.

# clean up 
d.tidy <- d %>% 
  select(WorkerId, Input.condition, 
         starts_with("Answer"), 
         starts_with("Input")) %>%
  rename(workerid = WorkerId,
         condition = Input.condition,          
         plasma_anchor = Input.price1,
         dog_anchor = Input.price2,
         sushi_anchor = Input.price3,
         dog_cost = Answer.dog_cost,
         plasma_cost = Answer.plasma_cost, 
         sushi_cost = Answer.sushi_cost) 

# now do the gathering and spreading
d.tidy <- d.tidy %>%
  gather(name, cost, 
         dog_anchor, plasma_anchor, sushi_anchor, 
         dog_cost, plasma_cost, sushi_cost) %>%
  separate(name, c("item", "type"), "_") %>%
  spread(type, cost) 

Now we can do the same thing as before but look at the relative difference between anchor and estimate. Let’s do this two ways:

To do the first, use the mutate verb to add a percent change column, then compute the same summary as before.

pcts <- d.tidy %>% 
        mutate(pct_change = abs(cost - anchor) / anchor) %>%
        group_by(condition, item) %>%
        summarise(pct = mean(pct_change, na.rm = T))

To do the second, you will need to group once by item, then to ungroup and do the same thing as before. NOTE: you can use group_by(…, add=FALSE) to set new grouping levels, also.

HINT: scale(x) returns a complicated data structure that doesn’t play nicely with dplyr. try scale(x)[,1] to get what you need.

z.scores <- d.tidy %>%
        group_by(item) %>%
        mutate(z = scale(abs(anchor - cost) / anchor)[,1]) %>%
        group_by(condition, add = TRUE) %>%
        summarise(z = mean(z, na.rm = TRUE))

OK, now here comes the end: we’re going to plot the differences and see if anything happened. First the percent change:

ggplot(pcts, aes(item, pct, fill = condition)) + geom_bar(stat = "identity", position = "dodge")

and the \(z\)-scores:

ggplot(z.scores, aes(item, z, fill = condition)) + geom_bar(stat = "identity", position = "dodge")

What do you you see in this replication?

The original paper finds that more precise anchors lead to more accurate estimates, and that rounded anchors lead to estimates farther away from the actual price. So, it is expected that the rounded condition should have the highest percent change and highest z score. From the z score and percent change graph, there is no clear pattern in the data that would lead us to confirm the hypothesis. We see that the rounded condition has the highest percent change in only two of the three items, which may seem like partial support, but from the z scores, none of the items show rounded condition having the highest z score. The replication did not find the same effects as the original paper.

END