This is problem set #1, in which we hope you will practice the packages tidyr and dplyr. There are some great cheat sheets from RStudio.

The data set

This data set comes from a replication of Janiszewski and Uy (2008), who investigated whether the precision of the anchor for a price influences the amount of adjustment.

In the data frame, the Input.condition variable represents the experimental condition (under the rounded anchor, the rounded anchor, over the rounded anchor). Input.price1, Input.price2, and Input.price3 are the anchors for the Answer.dog_cost, Answer.plasma_cost, and Answer.sushi_cost items.

Preliminaries

I pretty much always clear the workspace and load the same basic helper functions before starting an analysis.

library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats

Note that I’m using a “relative” path (the “helper”) rather than an absolute path (e.g. “/Users/mcfrank/code/projects/etc…”). The relative path means that someone else can run your code by changing to the right directory, while the absolute path will force someone else to make trivial changes every time they want to run it.

Part 1: Data cleaning

The first part of this exercise actually just consists of getting the data in a format usable for analysis. This is not trivial. Let’s try it:

d <- read.csv("data/janiszewski_rep_exercise.csv")

Fine, right? Why can’t we go forward with the analysis?

HINT: try computing some summary statistics for the different items. Also, are there any participants that did the task more than once?

#str(d)
#head(d)
#dim(d)

summary(d$Input.price1)
## 4,988 5,000 5,012 
##    30    30    30
summary(d$Answer.dog_cost)
##         1000         1500         1600         1658         1687 
##            6            6            2            1            1 
##         1700         1749         1750         1790         1800 
##            2            1            2            1            2 
##         1850         1875         1900         1950         1990 
##            1            1            4            1            1 
##        2,000         2000         2100         2192         2200 
##            1           27            2            1            8 
##         2250         2292         2299         2300         2325 
##            3            1            1            3            1 
##         2350         2400         2450         2482      2499.99 
##            1            1            2            1            1 
##         2500          500          800 five hundred 
##            1            1            1            1
summary(d$Answer.plasma_cost)
##    0.45    1200    1500    2000    2500    2999    3000    3200    3250 
##       1       1       1       1       4       1       3       2       1 
##    3258    3458    3499    3500    3800   4,000    4000    4120    4200 
##       1       1       1       6       1       1       9       1       2 
##    4300    4350    4400    4495    4500    4540    4578    4600    4650 
##       2       1       1       1      16       1       1       1       1 
##    4675    4699    4700    4750    4800    4849    4850    4888    4899 
##       1       1       5       2       4       1       1       2       1 
##    4900    4968    4995    4997    4999 4999.99    5000 
##       1       1       2       1       2       1       1
summary(d$Answer.sushi_cost)
##           6.5   6.78   6.95   6.99      7   7,75   7.02   7.35    7.5 
##      1      1      1      2      3      3      1      1      1     10 
##   7.56   7.58   7.69   7.75    7.8   7.89   7.99      8   8.25   8.46 
##      1      1      1      3      2      2      9     18      2      1 
##   8.49    8.5   8.69    8.7   8.95   8.99      9    9.3 ehight 
##      1     12      1      1      2      4      3      1      1
#These variables should be stored as numeric variables, not factors!

#Check which participants did task more than once 
d_dups <- d %>%
  group_by(WorkerId) %>% 
    filter(n()>1) %>% 
    summarize(n=n())

d_dups
## # A tibble: 2 Ă— 2
##         WorkerId     n
##           <fctr> <int>
## 1 A1VYX5VKZ0CTWU     3
## 2  A5IUO2T5MZQUZ     2
#We see that two pariticpants did the task multiple times...

Fix the data file programmatically, i.e., write code that transforms the unclean data frame into a clean data frame.

#factorize some variables
d$Answer.dog_cost = as.numeric(d$Answer.dog_cost)
d$Answer.plasma_cost = as.numeric(d$Answer.plasma_cost)
d$Answer.sushi_cost = as.numeric(d$Answer.sushi_cost)

Part 2: Making these data tidy

Now let’s start with the cleaned data, so that we are all beginning from the same place.

d <- read.csv("data/janiszewski_rep_cleaned.csv")

This data frame is in wide format - that means that each row is a participant and there are multiple observations per participant. This data is not tidy.

To make this data tidy, we’ll do some cleanup. First, remove the columns you don’t need, using the verb select.

HINT: ?select and the examples of helper functions will help you be efficient.

d.tidy <- d %>%
  select(WorkerId,
         Input.condition,
         Input.price1,
         Input.price2,
         Input.price3,
         Answer.dog_cost,
         Answer.plasma_cost,
         Answer.sushi_cost)

Try renaming some variables using rename. A good naming scheme is:

Try using the %>% operator as well. So you will be “piping” d %>% rename(...).

d.tidy <- d.tidy %>%
  rename(input_condition = Input.condition,
         input_plasma = Input.price1,
         input_dog = Input.price2,
         input_sushi = Input.price3,
         answer_dog = Answer.dog_cost,
         answer_plasma = Answer.plasma_cost,
         answer_sushi = Answer.sushi_cost) 

#Check naming
str(d.tidy)
## 'data.frame':    87 obs. of  8 variables:
##  $ WorkerId       : Factor w/ 87 levels "A10316ZXDCW4TT",..: 5 20 41 58 23 22 33 36 81 63 ...
##  $ input_condition: Factor w/ 3 levels "over","rounded",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ input_plasma   : int  5012 5012 5012 5012 5012 5012 5012 5012 5012 5012 ...
##  $ input_dog      : int  2508 2508 2508 2508 2508 2508 2508 2508 2508 2508 ...
##  $ input_sushi    : num  9.36 9.36 9.36 9.36 9.36 9.36 9.36 9.36 9.36 9.36 ...
##  $ answer_dog     : num  2300 2450 800 2000 2000 1600 1750 2200 2000 1000 ...
##  $ answer_plasma  : num  4800 4850 1200 4200 4500 4600 4650 4800 4500 3000 ...
##  $ answer_sushi   : num  8.7 9 8 9 8.5 8 6.95 8 8.99 9 ...

OK, now for the tricky part. Use the verb gather to turn this into a tidy data frame.

HINT: look for online examples!

d.tidy <- d.tidy %>%
  gather(item,cost,
         input_plasma,
         input_dog,
         input_sushi,
         answer_dog,
         answer_plasma,
         answer_sushi) %>%
  separate(item,
           into = c("item","type"),
           sep = "_")

Bonus problem: spread these data back into a wide format data frame.

d.wide <- d.tidy %>%
  spread(item,type)

Part 3: Manipulating the data using dplyr

NOTE: If you generally use plyr package, note that they do not play nicely together so things like the rename function won’t work unless you load dplyr after plyr.

As we said in class, a good thing to do is always to check histograms of the response variable. Do that now, using either regular base graphics or ggplot. What can you conclude?

ggplot(d.tidy,aes(cost)) +
  geom_histogram(bins = 30)
## Warning: Removed 1 rows containing non-finite values (stat_bin).

From this histogram, we see that the range of responses is 0 to ~5000, and the majority of the responses for cost lie close to 0. There are also a few peak frequencies throughout the chart, possibly consistent with the other anchor points.

Try also using the dplyr distinct function to remove the duplicate participants from the raw csv file that you discovered in part 1.

d.raw <- read.csv("data/janiszewski_rep_exercise.csv")
d.unique.subs <- d.raw %>%
  distinct(WorkerId, .keep_all = TRUE)

OK, now we turn to the actual data anlysis. We’ll be using dplyr verbs to filter, group,mutate, and summarise the data.

Start by using summarise to compute the grand mean bet. (Note that this is the same as taking the grant mean - the value will come later. Right now we’re just learning the syntax of that verb.)

d.grandmean <- d.tidy %>%
  na.omit() %>%
  summarise(grand_mean = mean(cost))

This is a great time to get comfortable with the %>% operator. In brief, %>% allows you to pipe data from one function to another. So if you would have written:

d <- function(d, other_stuff)

you can now write:

d <- d %>% function(other_stufF)

That doesn’t seem like much, but it’s cool when you can replace:

d <- function1(d, other_stuff) d <- function2(d, lots_of_other_stuff, more_stuff) d <- function3(d, yet_more_stuff)

with

d <- d %>% function1(other_stuff) %>% function2(lots_of_other_stuff, more_stuff) %>% function3(yet_more_stuff)

In other words, you get to make a clean list of the things you want to do and chain them together without a lot of intermediate assignments.

Let’s use that capacity to combine summarise with group_by, which allows us to break up our summary into groups. Try grouping by item and condition and taking means using summarise, chaining these two verbs with %>%.

d.means <- d.tidy %>%
  na.omit() %>% #remove rows with NA
  group_by(item,input_condition) %>%
  summarise(mean = mean(cost))

d.means
## Source: local data frame [6 x 3]
## Groups: item [?]
## 
##     item input_condition     mean
##    <chr>          <fctr>    <dbl>
## 1 answer            over 2092.139
## 2 answer         rounded 1994.698
## 3 answer           under 1977.688
## 4  input            over 2509.787
## 5  input         rounded 2503.000
## 6  input           under 2496.213

OK, it’s looking like there are maybe some differences between conditions, but how are we going to plot these? They are fundamentally different magnitudes from one another.

Really we need the size of the deviation from the anchor, which means we need the anchor value. Let’s go back to the data and add that in.

Take a look at this complex expression. You don’t have to modify it, but see what is being done here with gather, separate and spread. Run each part (e.g. the first verb, the first two verbs, etc.) and after doing each, look at head(d.tidy) to see what they do.

d.tidy <- d %>% 
  select(WorkerId, Input.condition, 
         starts_with("Answer"), 
         starts_with("Input")) %>%
  rename(workerid = WorkerId,
         condition = Input.condition,          
         plasma_anchor = Input.price1,
         dog_anchor = Input.price2,
         sushi_anchor = Input.price3,
         dog_cost = Answer.dog_cost,
         plasma_cost = Answer.plasma_cost, 
         sushi_cost = Answer.sushi_cost) %>%
  gather(name, cost, 
         dog_anchor, plasma_anchor, sushi_anchor, 
         dog_cost, plasma_cost, sushi_cost) %>%
  separate(name, c("item", "type"), "_") %>%
  spread(type, cost) 

Now we can do the same thing as before but look at the relative difference between anchor and estimate. Let’s do this two ways:

To do the first, use the mutate verb to add a percent change column, then comute the same summary as before.

pcts <- d.tidy %>% 
  na.omit() %>%
  mutate(pct_change = abs(((cost - anchor)/anchor) * 100)) %>%
  group_by(item,condition) %>%
  summarise(mean_pct_change = mean(pct_change))
pcts
## Source: local data frame [9 x 3]
## Groups: item [?]
## 
##     item condition mean_pct_change
##    <chr>    <fctr>           <dbl>
## 1    dog      over        24.31021
## 2    dog   rounded        24.62070
## 3    dog     under        23.47655
## 4 plasma      over        14.19926
## 5 plasma   rounded        18.16690
## 6 plasma     under        19.43951
## 7  sushi      over        11.08532
## 8  sushi   rounded        11.60536
## 9  sushi     under        10.38773

To do the second, you will need to group once by item, then to ungroup and do the same thing as before. NOTE: you can use group_by(…, add=FALSE) to set new grouping levels, also.

HINT: scale(x) returns a complicated data structure that doesn’t play nicely with dplyr. try scale(x)[,1] to get what you need.

pcts_z <- d.tidy %>% 
  na.omit() %>%
  group_by(item) %>%
  ungroup(item) %>%
  mutate(pct = abs(((cost - anchor)/anchor) * 100)) %>%
  mutate(pct = scale(pct)[,1]) %>%
  group_by(item,condition) %>%
  summarise(pct = mean(pct))
pcts_z
## Source: local data frame [9 x 3]
## Groups: item [?]
## 
##     item condition         pct
##    <chr>    <fctr>       <dbl>
## 1    dog      over  0.45062321
## 2    dog   rounded  0.47112969
## 3    dog     under  0.39556518
## 4 plasma      over -0.21714651
## 5 plasma   rounded  0.04489291
## 6 plasma     under  0.12894183
## 7  sushi      over -0.42280425
## 8  sushi   rounded -0.38845852
## 9  sushi     under -0.46887614

OK, now here comes the end: we’re going to plot the differences and see if anything happened. First the percent change:

ggplot(pcts,aes(item,mean_pct_change, fill = condition)) +
  geom_bar(position = "dodge",stat="identity")

and the z-scores:

ggplot(pcts_z,aes(item,pct,fill = condition)) +
  geom_bar(position = "dodge",stat="identity")

Oh well. This replication didn’t seem to work out straightforwardly.

END