WPA #5

1.Download the dataframe pirate_survey_witherrors.txt from http://nathanieldphillips.com/wp-content/uploads/2015/05/pirate_survey_witherrors.txt. The data are stored in a tab-separated text file with headers. Load the dataframe into an object called pirates.errors. Because it’s tab-separated, use sep = “”.

pirates.errors <- read.table("http://nathanieldphillips.com/wp-content/uploads/2015/05/pirate_survey_witherrors.txt", sep = "\t", stringsAsFactors = F)

Let’s clean up the dataframe. Some of the values don’t seem to be appropriate. For example, when I look at the column sex, I see some bad values. For each of the columns, try to figure out which values are appropriate (hint: use table()), and recode all inappropriate values as NA.

For example, for the column sex you can recode all values that are not equal to “male”, “female” or “other” to NA using the following code. Using this general logic, try fixing any other columns with bad values

pirates.errors$sex[!(pirates.errors$sex %in% c("male", "female", "other/NA"))] <- NA
#clean up "sex"-column

pirates.errors$headband[!(pirates.errors$headband %in% c("yes", "no", "other/NA"))] <- NA
#clean up "headband"-column


pirates.errors$age[!(pirates.errors$age %in% seq(0,100,1))] <- NA
#clean up age-column

summary(pirates.errors$tattoos)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     -10       8      10    4025      12 1000000       4

pirates.errors$tattoos[!(pirates.errors$tattoos %in% seq(0,20,1))] <- NA
#clean up tattoos-column

#...

To answer the following questions, you need to have a dataset without errors. If you had trouble correcting the data, you can download a version of the data without errors at http://nathanieldphillips.com/wp-content/uploads/2015/05/pirate_survey_noerrors.txt

pirates.noerrors <- read.table("http://nathanieldphillips.com/wp-content/uploads/2015/05/pirate_survey_noerrors.txt", sep = "\t", stringsAsFactors = F)

A fellow pirate captain wants to know if there is a relationship between the sex of my pirates and the number of treasure chests they have found. Using aggregate() figure out the mean number of treasure chests found by males, females, and other.

aggregate(formula = tchests.found ~ sex, # DV is tchests.found, IV is sex of pirates
          FUN = mean, # Calculate the mean 
          na.rm = T, # Ignore NA values when calculating the mean
          data = pirates.noerrors 
          )

##      sex tchests.found
## 1 female      7.353319
## 2   male      7.128049
## 3  other      8.048780

#visualize results
aggregated.data <- aggregate(formula = tchests.found ~ sex,
                             FUN = mean, 
                             na.rm = T, 
                             data = pirates.noerrors)
barplot(height = aggregated.data$tchests.found, names.arg = aggregated.data$sex)

Each pirate only uses one kind of sword - and their speed with their preferred sword is represented in sword.speed Which sword types tend to have the fastest (i.e.; smallest) sword speed? Test this by calculating the median sword.speed for each sword type

aggregate(formula = sword.speed ~ sword.type, 
          FUN = median, # Calculate the median
          na.rm = T, # Ignore NA values when calculating the mean
          data = pirates.noerrors 
          )

##   sword.type sword.speed
## 1     banana   2.5859139
## 2    cutlass   0.4848266
## 3      sabre   1.7393120
## 4   scimitar   1.7559671

#visualize results
aggregated.data <- aggregate(formula = sword.speed ~ sword.type,
                             FUN = median, 
                             na.rm = T, 
                             data = pirates.noerrors)
barplot(height = aggregated.data$sword.speed, names.arg = aggregated.data$sword.type)

Is there a relationship between whether or not a pirate wears a headband and their speed with their sword? Test this in two ways.

First, calculate the median sword speed, separately for each level of headband use using aggregate(). What is your conclusion?

Second, calculate the median sword speed for all combinations of both sex AND sword.type. That is, calculate the median sword.speed for headband-wearers who use cutlasses, headband-wearers who use sabres, … headband-nonwearers who use cutlasses, headband-nonwearers who use sabres… (hint: include two independent variables in the formula argument to aggregate()). Does your conclusion change? If so, what do you think is going on?

#1 calculate the median sword speed, separately for each level of headband use using aggregate().

aggregate(formula = sword.speed ~ headband, 
          FUN = median, # Calculate the median
          na.rm = T, # Ignore NA values when calculating the mean
          data = pirates.noerrors 
          )

##   headband sword.speed
## 1       no   1.0780988
## 2      yes   0.5375353

#conclusion: Pirates who use a headband tend to move their swords faster. 

#2 calculate the median sword speed for all combinations of both sex AND sword.type. That is, calculate the median sword.speed for headband-wearers who use cutlasses, headband-wearers who use sabres, … headband-nonwearers who use cutlasses, headband-nonwearers who use sabres… (hint: include two independent variables in the formula argument to aggregate()).

aggregate(formula = sword.speed ~ sex + sword.type + headband, 
          FUN = median, # Calculate the median
          na.rm = T, # Ignore NA values when calculating the mean
          data = pirates.noerrors 
          )

##       sex sword.type headband sword.speed
## 1  female     banana       no   2.1053095
## 2    male     banana       no   2.1928123
## 3   other     banana       no   1.6016872
## 4  female    cutlass       no   0.4812121
## 5    male    cutlass       no   0.3408127
## 6  female      sabre       no   1.1823583
## 7    male      sabre       no   1.2412301
## 8   other      sabre       no   0.5307804
## 9  female   scimitar       no   0.9705695
## 10   male   scimitar       no   0.5063881
## 11  other   scimitar       no   1.4368908
## 12 female     banana      yes   4.2625317
## 13   male     banana      yes  14.1127041
## 14  other     banana      yes   1.4867682
## 15 female    cutlass      yes   0.4602335
## 16   male    cutlass      yes   0.5018827
## 17  other    cutlass      yes   0.6131719
## 18 female      sabre      yes   2.0247799
## 19   male      sabre      yes   2.9170651
## 20  other      sabre      yes   1.2586082
## 21 female   scimitar      yes   4.8914395
## 22   male   scimitar      yes   5.1367420
## 23  other   scimitar      yes   8.1321056

#I had trouble finding out how to group headband-wearers and non-wearers to compare them. My guess is that you need to split up wearers and non-wearers and then run the aggregate function with y = sword.speed and x1 = sex and x2 = sword.type. 
#You could also use the dplyr function, I guess.

Does a pirate’s favorite pirate say anything about them? Do pirates whose favorite pirate is Hook have more tattoos on average or a faster sword speed than those whose favorite pirate is Blackbeard? Using dplyr, create the following aggregated dataframe which shows aggregated data depending on the pirates’ favorite pirate. Here are the four basic steps

First, define the dataframe THEN… (%>%) Second, group the data by favorite.pirate, THEN… (%>%) Third, use the summary function to tell R you will summarise the data Fourth, define each of the new summary statistics (then close the summarize function)

require(dplyr)

## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

pirates.fave <- pirates.noerrors %>%
  group_by(favorite.pirate) %>% # Define the grouping variable favourite pirate
  summarise(frequency = length(favorite.pirate),
          tattoos.mean = mean(tattoos, na.rm = T), 
          swords.speed.median = median(sword.speed, na.rm = T)
          )

pirates.fave #yeeeha

## Source: local data frame [6 x 4]
## 
##   favorite.pirate frequency tattoos.mean swords.speed.median
## 1        Anicetus       120     9.100000           0.4776414
## 2      Blackbeard       100     9.620000           0.7313608
## 3      Edward Low       114     9.342105           0.5371732
## 4            Hook       115     9.713043           0.6046085
## 5    Jack Sparrow       453     9.607064           0.5391035
## 6      Lewis Scot        98     9.081633           0.5539311

#Pirates whose favourite pirate is "Hook" tend to have the most tattoos 
#Pirates whose favourite pirate is "Blackbeard" seem to have the highest sword speed
#Pireates whose favourite pirate is "Lewis Scot" appear to have the least tattoos
#Jack Sparrow is very popular. :-)

Is there a relationship between a pirate’s age and whether or not they went to Jack Sparrow’s School of Fashion and Piratry? Solve this in 2 steps.

First, create a logical vector indicating whether or not a pirate went to Jack Sparrow’s School of Fashion and Piratry. Second, use aggregate() to calculate the percentage of pirates who went to Jack Sparrow’s School of Fashion and Piratry as a function of age. Third, create a scatterplot showing the results

#1 create logical vector
college.log <- pirates.noerrors$college == "JSSFP"
college.log

##    [1]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE
##   [12] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE
##   [23] FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
##   [34] FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
##   [45] FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE
##   [56] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE
##   [67] FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE
##   [78] FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
##   [89] FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
##  [100] FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
##  [111] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE
##  [122] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE
##  [133] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE
##  [144] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
##  [155] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE
##  [166] FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE
##  [177]  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE
##  [188] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
##  [199]  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
##  [210] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE
##  [221]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [232] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE
##  [243] FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
##  [254] FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
##  [265]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE
##  [276] FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [287] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [298] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE
##  [309]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
##  [320] FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE
##  [331]  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE
##  [342]  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE
##  [353]  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
##  [364] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE
##  [375]  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE
##  [386] FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE
##  [397] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE
##  [408] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [419] FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE
##  [430] FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE
##  [441] FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE
##  [452] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
##  [463] FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
##  [474] FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
##  [485] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE
##  [496] FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE
##  [507]  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE
##  [518] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE
##  [529] FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE
##  [540]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
##  [551] FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
##  [562] FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE
##  [573]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
##  [584]  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
##  [595] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE
##  [606] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
##  [617] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE
##  [628]  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE
##  [639]  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE
##  [650]  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
##  [661] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
##  [672] FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [683]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [694] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
##  [705] FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE
##  [716] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE
##  [727] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE
##  [738] FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
##  [749]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
##  [760] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [771] FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
##  [782]  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
##  [793]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE
##  [804] FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE
##  [815] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
##  [826] FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
##  [837] FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [848] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [859]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
##  [870]  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE
##  [881] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE
##  [892]  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE
##  [903]  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE
##  [914]  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE
##  [925]  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
##  [936] FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
##  [947] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE
##  [958] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
##  [969] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE
##  [980] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
##  [991]  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE

#i think there's something wrong with this but cannot figure out what the issue is. 

aggregate(formula = age ~ college.log, 
          FUN = mean, # Calculate the median
          na.rm = T, # Ignore NA values when calculating the mean
          data = pirates.noerrors 
          )

##   college.log     age
## 1       FALSE 24.4113
## 2        TRUE 33.3168

#this of course is not right, either, because the logical vector is wrong. 

plot(x = pirates.noerrors$age, y = college.log)

plot(x = college.log, y = pirates.noerrors$age)

require(dplyr)

pirates.college <- pirates.noerrors %>%
  group_by(college) %>% # Define the grouping variable favourite pirate
  summarise(frequency = length(college),
          age.mean = mean(age, na.rm = T) 
          )

pirates.college

## Source: local data frame [2 x 3]
## 
##   college frequency age.mean
## 1    CCCC       637  24.4113
## 2   JSSFP       363  33.3168

#I'm sorry, I actually have no clue what I'm supposed to do here and how I'm supposed to calculate the percentage. From what I understand 100% of pirates went to JSSFP and the mean age of pirates attending school is 27.44 years but to be hones, i have no idea.

WPA #5

Rebekka Herz

Mai 2015