First, letās load the tidyverse:
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'tidyr' was built under R version 3.3.2
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
library(dplyr)
Next, we load the data and clean it, removing duplicate (or triplicate) submissions and nonsensical or incorrectly-formatted values.
d <- read_csv("problem_sets/data/janiszewski_rep_exercise.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## Reward = col_double(),
## MaxAssignments = col_integer(),
## AssignmentDurationInSeconds = col_integer(),
## AutoApprovalDelayInSeconds = col_integer(),
## NumberOfSimilarHITs = col_integer(),
## WorkTimeInSeconds = col_integer(),
## Input.price1 = col_number(),
## Input.price2 = col_number(),
## Input.price3 = col_double(),
## Answer.plasma_cost = col_number()
## )
## See spec(...) for full column specifications.
d <- d %>% filter(d$WorkerId!="A1VYX5VKZ0CTWU")
d <- d %>% filter(d$WorkerId!="A5IUO2T5MZQUZ")
d <- d %>% filter(d$Answer.sushi_cost!="ehight")
d <- d %>% filter(d$Answer.sushi_cost!="7,75")
d <- d %>% filter(d$Answer.dog_cost!="five hundred")
d <- d %>% filter(d$Answer.dog_cost!="2,000")
d$Answer.dog_cost<-as.numeric(d$Answer.dog_cost)
d$Answer.sushi_cost<-as.numeric(d$Answer.sushi_cost)
We just cleaned the data, but for consistency with the rest of the assignment, weāll now switch to using the provided pre-cleaned data.
d <- read.csv("problem_sets/data/janiszewski_rep_cleaned.csv")
Now we remove any columns we donāt need as well as any rows with NAs using select.
vars_of_interest<- c("Input.condition", "Input.price1", "Input.price2", "Input.price3", "Answer.dog_cost", "Answer.sushi_cost", "Answer.plasma_cost")
d.tidy <- d %>% select(one_of(vars_of_interest)) %>% na.omit
Next we will rename our data to be consistent with the following scheme, as laid out in the assignment: * consistent with case * consistent with ā.ā or ā_" ( ā_" is usually preferred) * concise as will be comprehensible to others
d.tidy <- d.tidy %>% rename(condition=Input.condition, plasma_anchor=Input.price1, dog_anchor=Input.price2, sushi_anchor=Input.price3, dog_cost=Answer.dog_cost, sushi_cost=Answer.sushi_cost, plasma_cost=Answer.plasma_cost)
Having renamed our data, we gather it into tidy format.
d.gathered <- d.tidy %>%
mutate(index=1:n()) %>%
gather(measure_name, measure_value, dog_cost, sushi_cost, plasma_cost)
Since weāve given indeces to each participant, we can now put the data back into wide format:
d.wide <- d.gathered %>% spread(measure_name, measure_value)
First, letās plot some histograms to investigate:
qplot(measure_value,
fill = condition,
facets = measure_name ~ ., # facets = measure_name ~ condition,
data = d.gathered,
binwidth=300)
For one thing, we can conclude that
sushi_cost is so much lower than the other costs that this type of histogram is not informative. The other thing to note is that the different conditions seem to draw from the same distribution but under has the highest count and over has the lowest in all cases.
Asideāwe can also use distinct to remove duplicates (this is so useful!):
d.raw <- read.csv("problem_sets/data/janiszewski_rep_exercise.csv")
d.unique.subs <- d %>% distinct(WorkerId)
Now on to analysis. First: the mean of every value in the data set:
mean(d.gathered$measure_value)
## [1] 2014.98
Letās use that capacity to combine summarise with group_by, which allows us to break up our summary into groups. Try grouping by item and condition and taking means using summarise, chaining these two verbs with %>%.
d.gathered %>% group_by(condition, measure_name) %>% summarize(avg=mean(measure_value, na.rm=TRUE))
## Source: local data frame [9 x 3]
## Groups: condition [?]
##
## condition measure_name avg
## <fctr> <chr> <dbl>
## 1 over dog_cost 1894.793103
## 2 over plasma_cost 4310.689310
## 3 over sushi_cost 8.322414
## 4 rounded dog_cost 1884.482414
## 5 rounded plasma_cost 4091.655172
## 6 rounded sushi_cost 7.955517
## 7 under dog_cost 1906.964286
## 8 under plasma_cost 4018.357143
## 9 under sushi_cost 7.742500
Next weāll look for deviation from the anchor, using this long string which does all the data cleaning above in one fell swoop:
d.tidy <- d %>%
select(WorkerId, Input.condition,
starts_with("Answer"),
starts_with("Input")) %>%
rename(workerid = WorkerId,
condition = Input.condition,
plasma_anchor = Input.price1,
dog_anchor = Input.price2,
sushi_anchor = Input.price3,
dog_cost = Answer.dog_cost,
plasma_cost = Answer.plasma_cost,
sushi_cost = Answer.sushi_cost) %>%
gather(name, cost,
dog_anchor, plasma_anchor, sushi_anchor,
dog_cost, plasma_cost, sushi_cost) %>%
separate(name, c("item", "type"), "_") %>%
spread(type, cost)
Next weāre going to use the tidy format to look at relative difference between anchor and estimate: * By computing absolute percentage change in price, and * By computing z-scores over items.
In the first, we mutate to add a percent change column, then compute the same summary as before.
pcts <- d.tidy %>%
mutate(pct_change = 100*abs(anchor-cost)/cost ) %>%
group_by(condition, item) %>% summarize(avg=mean(pct_change, na.rm=TRUE))
To do the second, we will group once by item, then ungroup and do the same thing as before. #NOTE: you can use group_by(ā¦, add=FALSE) to set new grouping levels, also.
HINT: scale(x) returns a complicated data structure that doesnāt play nicely with dplyr. try scale(x)[,1] to get what you need.
z.scores <- d.tidy %>%
group_by(item) %>%
mutate(z=scale(cost)[,1]) %>%
ungroup %>%
group_by(condition, item) %>%
summarize(z=mean(z, na.rm=TRUE))
Finally, we graph everything (note that the code to graph in the pset did not work for me because the use of stat in qplot is deprecated):
ggplot(pcts, aes(x=item, y=avg, fill=condition)) + geom_bar(stat="identity", position="dodge")
# qplot(item, avg, fill=condition,
# position="dodge",
# stat="identity", geom="bar",
# data=pcts)
and the z-scores:
ggplot(z.scores, aes(x=item, y=z, fill=condition)) + geom_bar(stat="identity", position="dodge")
# qplot(item, z, fill=condition,
# position="dodge",
# stat="identity", geom="bar",
# data=z.scores)