#load packages needed for this assignment
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.3.2

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Warning: package 'ggplot2' was built under R version 3.3.2

## Warning: package 'tidyr' was built under R version 3.3.2

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats

Part 1: Data cleaning

First, we look at the raw data.

d = read.csv("~/Desktop/PSYC254/data/problem_sets/data/janiszewski_rep_exercise.csv")
str(d)

## 'data.frame':    90 obs. of  34 variables:
##  $ HITId                      : Factor w/ 3 levels "261WKUDD8XMBJ8COBQ261TO6T1XNCJ",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ HITTypeId                  : Factor w/ 1 level "2DVGTP9RA7S5C4ALLOJOM70JMHCRW9": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Title                      : Factor w/ 1 level "How much is it worth? Quick 2 minute survey.": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Description                : Factor w/ 1 level "A quick two minute survey asking about prices. DO THIS HIT ONLY ONCE.": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Keywords                   : Factor w/ 1 level "survey, quick, prices": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Reward                     : num  0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
##  $ CreationTime               : Factor w/ 1 level "Wed Jan 25 18:13:24 GMT 2012": 1 1 1 1 1 1 1 1 1 1 ...
##  $ MaxAssignments             : int  30 30 30 30 30 30 30 30 30 30 ...
##  $ RequesterAnnotation        : Factor w/ 1 level "Department:Other": 1 1 1 1 1 1 1 1 1 1 ...
##  $ AssignmentDurationInSeconds: int  600 600 600 600 600 600 600 600 600 600 ...
##  $ AutoApprovalDelayInSeconds : int  604800 604800 604800 604800 604800 604800 604800 604800 604800 604800 ...
##  $ Expiration                 : Factor w/ 1 level "Wed Feb 01 18:13:24 GMT 2012": 1 1 1 1 1 1 1 1 1 1 ...
##  $ NumberOfSimilarHITs        : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ LifetimeInSeconds          : logi  NA NA NA NA NA NA ...
##  $ AssignmentId               : Factor w/ 90 levels "206OZFYQS1CS94XU9H4SBIC1G86JMR",..: 8 9 10 11 13 20 28 29 30 34 ...
##  $ WorkerId                   : Factor w/ 87 levels "A10316ZXDCW4TT",..: 5 20 41 58 23 22 33 36 81 63 ...
##  $ AssignmentStatus           : Factor w/ 1 level "Submitted": 1 1 1 1 1 1 1 1 1 1 ...
##  $ AcceptTime                 : Factor w/ 90 levels "Fri Jan 27 00:38:58 GMT 2012",..: 61 54 67 40 6 49 60 16 84 75 ...
##  $ SubmitTime                 : Factor w/ 90 levels "Fri Jan 27 00:40:50 GMT 2012",..: 61 54 66 40 6 49 60 16 84 75 ...
##  $ AutoApprovalTime           : Factor w/ 90 levels "Thu Feb 02 01:34:34 PST 2012",..: 43 34 48 20 76 29 42 86 66 57 ...
##  $ ApprovalTime               : logi  NA NA NA NA NA NA ...
##  $ RejectionTime              : logi  NA NA NA NA NA NA ...
##  $ RequesterFeedback          : logi  NA NA NA NA NA NA ...
##  $ WorkTimeInSeconds          : int  23 228 89 145 232 86 93 54 106 110 ...
##  $ LifetimeApprovalRate       : Factor w/ 1 level "0% (0/0)": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Last30DaysApprovalRate     : Factor w/ 1 level "0% (0/0)": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Last7DaysApprovalRate      : Factor w/ 1 level "0% (0/0)": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Input.condition            : Factor w/ 3 levels "over","rounded",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Input.price1               : Factor w/ 3 levels "4,988","5,000",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Input.price2               : Factor w/ 3 levels "2,492","2,500",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Input.price3               : num  9.36 9.36 9.36 9.36 9.36 9.36 9.36 9.36 9.36 9.36 ...
##  $ Answer.dog_cost            : Factor w/ 34 levels "1000","1500",..: 24 28 33 17 16 3 8 20 17 1 ...
##  $ Answer.plasma_cost         : Factor w/ 43 levels "0.45","1200",..: 32 34 2 18 23 26 27 32 23 7 ...
##  $ Answer.sushi_cost          : Factor w/ 29 levels "","6.5","6.78",..: 24 27 18 27 22 18 4 18 26 27 ...

In this format, the data is probably going to be difficult or unwieldy to work with. This is because a) it has a lot of extraneous info that we need to clean out first (i.e. information Qualtrics automatically gives you that isn’t necessarily pertinent) and b) it’s in wide format (i.e. multiple observations per participant per row) and we need long format and c) any potential empty cells, or duplications haven’t been cleaned out.

Part 2: Making data tidy

Next, we take a look at the cleaned dataset.

d = read.csv("~/Desktop/PSYC254/data/problem_sets/data/janiszewski_rep_cleaned.csv")
str(d)

## 'data.frame':    87 obs. of  34 variables:
##  $ HITId                      : Factor w/ 3 levels "261WKUDD8XMBJ8COBQ261TO6T1XNCJ",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ HITTypeId                  : Factor w/ 1 level "2DVGTP9RA7S5C4ALLOJOM70JMHCRW9": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Title                      : Factor w/ 1 level "How much is it worth? Quick 2 minute survey.": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Description                : Factor w/ 1 level "A quick two minute survey asking about prices. DO THIS HIT ONLY ONCE.": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Keywords                   : Factor w/ 1 level "survey, quick, prices": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Reward                     : num  0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
##  $ CreationTime               : Factor w/ 1 level "Wed Jan 25 18:13:24 GMT 2012": 1 1 1 1 1 1 1 1 1 1 ...
##  $ MaxAssignments             : int  30 30 30 30 30 30 30 30 30 30 ...
##  $ RequesterAnnotation        : Factor w/ 1 level "Department:Other": 1 1 1 1 1 1 1 1 1 1 ...
##  $ AssignmentDurationInSeconds: int  600 600 600 600 600 600 600 600 600 600 ...
##  $ AutoApprovalDelayInSeconds : int  604800 604800 604800 604800 604800 604800 604800 604800 604800 604800 ...
##  $ Expiration                 : Factor w/ 1 level "Wed Feb 01 18:13:24 GMT 2012": 1 1 1 1 1 1 1 1 1 1 ...
##  $ NumberOfSimilarHITs        : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ LifetimeInSeconds          : logi  NA NA NA NA NA NA ...
##  $ AssignmentId               : Factor w/ 87 levels "206OZFYQS1CS94XU9H4SBIC1G86JMR",..: 7 8 9 10 12 19 27 28 29 33 ...
##  $ WorkerId                   : Factor w/ 87 levels "A10316ZXDCW4TT",..: 5 20 41 58 23 22 33 36 81 63 ...
##  $ AssignmentStatus           : Factor w/ 1 level "Submitted": 1 1 1 1 1 1 1 1 1 1 ...
##  $ AcceptTime                 : Factor w/ 87 levels "Fri Jan 27 00:38:58 GMT 2012",..: 58 52 64 40 6 47 57 16 81 72 ...
##  $ SubmitTime                 : Factor w/ 87 levels "Fri Jan 27 00:40:50 GMT 2012",..: 58 52 63 40 6 47 57 16 81 72 ...
##  $ AutoApprovalTime           : Factor w/ 87 levels "Thu Feb 02 01:34:34 PST 2012",..: 40 32 45 20 73 27 39 83 63 54 ...
##  $ ApprovalTime               : logi  NA NA NA NA NA NA ...
##  $ RejectionTime              : logi  NA NA NA NA NA NA ...
##  $ RequesterFeedback          : logi  NA NA NA NA NA NA ...
##  $ WorkTimeInSeconds          : int  23 228 89 145 232 86 93 54 106 110 ...
##  $ LifetimeApprovalRate       : Factor w/ 1 level "0% (0/0)": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Last30DaysApprovalRate     : Factor w/ 1 level "0% (0/0)": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Last7DaysApprovalRate      : Factor w/ 1 level "0% (0/0)": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Input.condition            : Factor w/ 3 levels "over","rounded",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Input.price1               : int  5012 5012 5012 5012 5012 5012 5012 5012 5012 5012 ...
##  $ Input.price2               : int  2508 2508 2508 2508 2508 2508 2508 2508 2508 2508 ...
##  $ Input.price3               : num  9.36 9.36 9.36 9.36 9.36 9.36 9.36 9.36 9.36 9.36 ...
##  $ Answer.dog_cost            : num  2300 2450 800 2000 2000 1600 1750 2200 2000 1000 ...
##  $ Answer.plasma_cost         : num  4800 4850 1200 4200 4500 4600 4650 4800 4500 3000 ...
##  $ Answer.sushi_cost          : num  8.7 9 8 9 8.5 8 6.95 8 8.99 9 ...

d.tidy = select(d, WorkerId, Input.condition:Input.price3, Answer.dog_cost:Answer.sushi_cost)
d.tidy = d.tidy %>%
  rename(condition=Input.condition) %>%
  rename(dog_anchor=Input.price1) %>%
  rename(plasma_anchor=Input.price2) %>%
  rename(sushi_anchor=Input.price3)%>%
  rename(dog_cost=Answer.dog_cost) %>%
  rename(plasma_cost=Answer.plasma_cost) %>%
  rename(sushi_cost=Answer.sushi_cost)%>%
  mutate(WorkerId=1:n())

d.tidy = d.tidy %>%
  gather(name, cost,
         dog_anchor, plasma_anchor, sushi_anchor,
         dog_cost, plasma_cost, sushi_cost) %>%
  separate(name, c("item", "type"), "_") %>%
  spread(type, cost)

We select only the variables of interest (subject ID, anchoring condition, anchoring price, and participant’s bet), and then we rename the variables into something more easily understood. Finally, we convert the data from wide format to long format for ease of analysis. This is the form that is most convenient to run analyses in.

Part 3: Manipulating data using dplyr

Using the cleaned dataset from Part 2 above, we plot the data to see, visually, whether or not it looks like the effect replicated.

ggplot(d.tidy, aes(x=item, y=cost, group=condition,fill=condition)) +
     geom_bar(stat="identity", position="dodge")

## Warning: Removed 1 rows containing missing values (geom_bar).

This graph is pretty difficult to interpret because the three items are on different scales. If we clean the data some more to get standardized outcome variables, we may be able to more easily compare the three types of items and the responses that participants gave for them.

Here, we go through the process of manipulating the data to more easily compare the mean price given for each item in each condition.

d.raw = read.csv("~/Desktop/PSYC254/data/problem_sets/data/janiszewski_rep_exercise.csv",stringsAsFactors = F)
d.unique.subs = (d.raw %>% 
                   group_by(WorkerId) %>% 
                   filter(row_number() == 1))

m_grand = d.unique.subs %>% 
  group_by(Input.condition) %>% 
  summarise(m_dog = mean(as.numeric(Answer.dog_cost), na.rm = TRUE), m_plasma = mean(as.numeric(Answer.plasma_cost), na.rm = TRUE), m_sushi = mean(as.numeric(Answer.sushi_cost), na.rm = TRUE))

## Warning in mean(as.numeric(c("1000", "2000", "2192", "2000", "2000",
## "2100", : NAs introduced by coercion

## Warning in mean(as.numeric(c("1000", "2000", "2192", "2000", "2000",
## "2100", : NAs introduced by coercion

## Warning in mean(as.numeric(c("1000", "2000", "2192", "2000", "2000",
## "2100", : NAs introduced by coercion

## Warning in mean(as.numeric(c("2500", "2500", "4888", "4400", "0.45",
## "4300", : NAs introduced by coercion

## Warning in mean(as.numeric(c("7.5", "7.8", "7.5", "7.99", "8", "7.99",
## "7.8", : NAs introduced by coercion

## Warning in mean(as.numeric(c("7.5", "7.8", "7.5", "7.99", "8", "7.99",
## "7.8", : NAs introduced by coercion

## Warning in mean(as.numeric(c("7.5", "7.8", "7.5", "7.99", "8", "7.99",
## "7.8", : NAs introduced by coercion

print(m_grand)

## # A tibble: 3 × 4
##   Input.condition    m_dog m_plasma  m_sushi
##             <chr>    <dbl>    <dbl>    <dbl>
## 1            over 1894.793 4300.333 8.342857
## 2         rounded 1884.482 4094.929 7.953929
## 3           under 1959.074 3857.659 7.742500

m_grand2 = d.tidy %>%
  group_by(condition) %>%
  summarize(mean(cost, na.rm=TRUE))
print(m_grand2)

## # A tibble: 3 × 2
##   condition `mean(cost, na.rm = TRUE)`
##      <fctr>                      <dbl>
## 1      over                   2092.139
## 2   rounded                   1994.698
## 3     under                   1977.688

First, we ensure that the data we look at is only data for unique participants, removing any duplicates. Then, we calculate the mean price that participants reported for each item type in a table. We can see that the predicted effect seems to replicate for plasma and sushi (i.e. those in the over condition gave higher prices than those in the rounded condition, who in turn gave higher prices than those in the under condition), but not dogs. For the dogs, we can see that the average price in the over and rounded conditions is approximately comparable, but the price in the under condition is far larger - directly contrary to the predicted effect. This suggests that the replication may not be entirely successful. However, when we look at the prices overall across items, we see that the effect seems to replicate. However, since the three items use wildly different scales, this may confound what we see when we look at the raw data.

In order to better examine this effect, we then clean the data further and standardize the responses against the anchor values in two ways.

d.tidy <- d %>% 
  select(WorkerId, Input.condition, 
         starts_with("Answer"), 
         starts_with("Input")) %>%
  rename(workerid = WorkerId,
         condition = Input.condition,          
         plasma_anchor = Input.price1,
         dog_anchor = Input.price2,
         sushi_anchor = Input.price3,
         dog_cost = Answer.dog_cost,
         plasma_cost = Answer.plasma_cost, 
         sushi_cost = Answer.sushi_cost) %>%
  gather(name, cost, 
         dog_anchor, plasma_anchor, sushi_anchor, 
         dog_cost, plasma_cost, sushi_cost) %>%
  separate(name, c("item", "type"), "_") %>%
  spread(type, cost)

First, we standardize using percent change.

ps = d.tidy %>%
  mutate(p_change = ((anchor-cost)/anchor)*100) #percent change (decrease)

p_table = ps %>%
  group_by(condition) %>%
  summarise(m_dog = mean(p_change[item == "dog"], na.rm=T), m_plasma = mean(p_change[item=="plasma"],na.rm=T), m_sushi=mean(p_change[item=="sushi"],na.rm=T))
print(p_table)

## # A tibble: 3 × 4
##   condition    m_dog m_plasma  m_sushi
##      <fctr>    <dbl>    <dbl>    <dbl>
## 1      over 24.31021 14.19926 11.08532
## 2   rounded 24.62070 18.16690 11.60536
## 3     under 23.47655 19.43951 10.38773

p_table.a = ps %>%
  group_by(condition) %>%
  summarize(mean(p_change, na.rm=T))
print(p_table.a)

## # A tibble: 3 × 2
##   condition `mean(p_change, na.rm = T)`
##      <fctr>                       <dbl>
## 1      over                    16.59279
## 2   rounded                    18.13099
## 3     under                    17.76793

When calculating the percent change (here, percent decrease from the anchor price), we see that alghough the percent decrease seems to get larger for plasma (the expected effect), the same is not true for dogs and sushi, both of which have smaller percent decrease for the under condition, contrary to expectations. Further, overall, when looking at the percent decrease across items, we find that the under condition produces the smallest percent decrease, suggesting that the effect doesn’t cleanly replicate.

Next, we examine what results look like when we examine cost as a z-score.

ps1 = d.tidy %>%
  mutate(z=scale(cost)[,1]) #z score

p_table2 = ps1 %>%
  group_by(condition) %>%
  summarise(m_dog = mean(z[item == "dog"],na.rm=T), m_plasma = mean(z[item=="plasma"],na.rm=T), m_sushi=mean(z[item=="sushi"],na.rm=T))
print(p_table2)

## # A tibble: 3 × 4
##   condition       m_dog m_plasma   m_sushi
##      <fctr>       <dbl>    <dbl>     <dbl>
## 1      over -0.07003565 1.283835 -1.135294
## 2   rounded -0.07782373 1.166217 -1.135501
## 3     under -0.06515216 1.124903 -1.135621

p_table2.a = ps1 %>%
  group_by(condition) %>%
  summarize(mean(z, na.rm=T))
print(p_table2.a)

## # A tibble: 3 × 2
##   condition `mean(z, na.rm = T)`
##      <fctr>                <dbl>
## 1      over           0.03921865
## 2   rounded          -0.01570251
## 3     under          -0.02528979

As z-scores, we see again that the effect does not appear to hold. For both sushi and dogs, there doesn’t appear to be a big, appreciable change in cost based on condition, and again, for dogs, the over and rounded conditions don’t seem to be appreciably different. Again, the plasma condition does show the expected effect - the z scores get smaller when comparing over to rounded to under. When looking at the results across items, it does look like the results point to the effect holding, but the differences between the three conditions are so small that we can’t be confident in this.

Finally, we graph the standardized data to again visualize the results, but this time with more intepretable information.

ps.a = ps %>%
  group_by(item, condition) %>%
  summarise(means=mean(p_change, na.rm=T))

ggplot(ps.a, aes(x=item, y=means, group=condition,fill=condition)) +
     geom_bar(stat="identity", position="dodge")

ps1.a = ps1 %>%
  group_by(item, condition) %>%
  summarise(means=mean(z, na.rm=T))

ggplot(ps1.a, aes(x=item, y=means, group=condition,fill=condition)) +
     geom_bar(stat="identity", position="dodge")

In both cases, when standardized, the effect does not seem to replicate perfectly. Although plasma shows effects in the expected direction when using both methods of standardization, the effect for dogs and sushi is not what we expected. Whereas it was unclear with the unstandardized scores whether or not the effect replicated because of the different scales the three different items had, when standardized it is clear that the replication is not perfect.

PSET 1

Vivian Xiao

1/27/2017

Part 1: Data cleaning

Part 2: Making data tidy

Part 3: Manipulating data using dplyr