Part 1: Preprocessing

First, let’s load the tidyverse:

library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'tidyr' was built under R version 3.3.2
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
library(dplyr)

Next, we load the data and clean it, removing duplicate (or triplicate) submissions and nonsensical or incorrectly-formatted values.

d <- read_csv("problem_sets/data/janiszewski_rep_exercise.csv")
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   Reward = col_double(),
##   MaxAssignments = col_integer(),
##   AssignmentDurationInSeconds = col_integer(),
##   AutoApprovalDelayInSeconds = col_integer(),
##   NumberOfSimilarHITs = col_integer(),
##   WorkTimeInSeconds = col_integer(),
##   Input.price1 = col_number(),
##   Input.price2 = col_number(),
##   Input.price3 = col_double(),
##   Answer.plasma_cost = col_number()
## )
## See spec(...) for full column specifications.
d <- d %>% filter(d$WorkerId!="A1VYX5VKZ0CTWU")
d <- d %>% filter(d$WorkerId!="A5IUO2T5MZQUZ")
d <- d %>% filter(d$Answer.sushi_cost!="ehight")
d <- d %>% filter(d$Answer.sushi_cost!="7,75")
d <- d %>% filter(d$Answer.dog_cost!="five hundred")
d <- d %>% filter(d$Answer.dog_cost!="2,000")
d$Answer.dog_cost<-as.numeric(d$Answer.dog_cost)
d$Answer.sushi_cost<-as.numeric(d$Answer.sushi_cost)

Part 2: Tidy Data

We just cleaned the data, but for consistency with the rest of the assignment, we’ll now switch to using the provided pre-cleaned data.

d <- read.csv("problem_sets/data/janiszewski_rep_cleaned.csv")

Now we remove any columns we don’t need as well as any rows with NAs using select.

vars_of_interest<- c("Input.condition", "Input.price1", "Input.price2", "Input.price3", "Answer.dog_cost", "Answer.sushi_cost", "Answer.plasma_cost")
d.tidy <- d %>% select(one_of(vars_of_interest)) %>% na.omit

Next we will rename our data to be consistent with the following scheme, as laid out in the assignment: * consistent with case * consistent with ā€œ.ā€ or ā€œ_" ( ā€œ_" is usually preferred) * concise as will be comprehensible to others

d.tidy <- d.tidy %>% rename(condition=Input.condition, plasma_anchor=Input.price1, dog_anchor=Input.price2, sushi_anchor=Input.price3, dog_cost=Answer.dog_cost, sushi_cost=Answer.sushi_cost, plasma_cost=Answer.plasma_cost)

Having renamed our data, we gather it into tidy format.

d.gathered <- d.tidy %>%
  mutate(index=1:n()) %>%
  gather(measure_name, measure_value, dog_cost, sushi_cost, plasma_cost)

Since we’ve given indeces to each participant, we can now put the data back into wide format:

d.wide <- d.gathered %>% spread(measure_name, measure_value)

Part 3: Manipulating the data using dplyr

First, let’s plot some histograms to investigate:

qplot(measure_value, 
      fill = condition,
      facets = measure_name ~ ., # facets = measure_name ~ condition,
      data = d.gathered,
      binwidth=300)

For one thing, we can conclude that sushi_cost is so much lower than the other costs that this type of histogram is not informative. The other thing to note is that the different conditions seem to draw from the same distribution but under has the highest count and over has the lowest in all cases.

Aside—we can also use distinct to remove duplicates (this is so useful!):

d.raw <- read.csv("problem_sets/data/janiszewski_rep_exercise.csv")
d.unique.subs <- d %>% distinct(WorkerId)

Now on to analysis. First: the mean of every value in the data set:

mean(d.gathered$measure_value)
## [1] 2014.98

Let’s use that capacity to combine summarise with group_by, which allows us to break up our summary into groups. Try grouping by item and condition and taking means using summarise, chaining these two verbs with %>%.

d.gathered %>% group_by(condition, measure_name) %>% summarize(avg=mean(measure_value, na.rm=TRUE))
## Source: local data frame [9 x 3]
## Groups: condition [?]
## 
##   condition measure_name         avg
##      <fctr>        <chr>       <dbl>
## 1      over     dog_cost 1894.793103
## 2      over  plasma_cost 4310.689310
## 3      over   sushi_cost    8.322414
## 4   rounded     dog_cost 1884.482414
## 5   rounded  plasma_cost 4091.655172
## 6   rounded   sushi_cost    7.955517
## 7     under     dog_cost 1906.964286
## 8     under  plasma_cost 4018.357143
## 9     under   sushi_cost    7.742500

Next we’ll look for deviation from the anchor, using this long string which does all the data cleaning above in one fell swoop:

d.tidy <- d %>% 
  select(WorkerId, Input.condition, 
         starts_with("Answer"), 
         starts_with("Input")) %>%
  rename(workerid = WorkerId,
         condition = Input.condition,          
         plasma_anchor = Input.price1,
         dog_anchor = Input.price2,
         sushi_anchor = Input.price3,
         dog_cost = Answer.dog_cost,
         plasma_cost = Answer.plasma_cost, 
         sushi_cost = Answer.sushi_cost) %>%
  gather(name, cost, 
         dog_anchor, plasma_anchor, sushi_anchor, 
         dog_cost, plasma_cost, sushi_cost) %>%
  separate(name, c("item", "type"), "_") %>%
  spread(type, cost) 

Next we’re going to use the tidy format to look at relative difference between anchor and estimate: * By computing absolute percentage change in price, and * By computing z-scores over items.

In the first, we mutate to add a percent change column, then compute the same summary as before.

pcts <- d.tidy %>% 
  mutate(pct_change = 100*abs(anchor-cost)/cost ) %>%
  group_by(condition, item) %>% summarize(avg=mean(pct_change, na.rm=TRUE))

To do the second, we will group once by item, then ungroup and do the same thing as before. #NOTE: you can use group_by(…, add=FALSE) to set new grouping levels, also.

HINT: scale(x) returns a complicated data structure that doesn’t play nicely with dplyr. try scale(x)[,1] to get what you need.

z.scores <- d.tidy %>%
  group_by(item) %>%
  mutate(z=scale(cost)[,1]) %>%
  ungroup %>%
  group_by(condition, item) %>%
  summarize(z=mean(z, na.rm=TRUE))

Finally, we graph everything (note that the code to graph in the pset did not work for me because the use of stat in qplot is deprecated):

ggplot(pcts, aes(x=item, y=avg, fill=condition)) + geom_bar(stat="identity", position="dodge")

# qplot(item, avg, fill=condition, 
#       position="dodge",
#       stat="identity", geom="bar", 
#       data=pcts)

and the z-scores:

ggplot(z.scores, aes(x=item, y=z, fill=condition)) + geom_bar(stat="identity", position="dodge")

# qplot(item, z, fill=condition, 
#       position="dodge",
#       stat="identity", geom="bar", 
#       data=z.scores)