1.Download the dataframe pirate_survey_witherrors.txt from http://nathanieldphillips.com/wp-content/uploads/2015/05/pirate_survey_witherrors.txt. The data are stored in a tab-separated text file with headers. Load the dataframe into an object called pirates.errors. Because it’s tab-separated, use sep = “”.
pirates.errors <- read.table("http://nathanieldphillips.com/wp-content/uploads/2015/05/pirate_survey_witherrors.txt", sep = "\t", stringsAsFactors = F)
For example, for the column sex you can recode all values that are not equal to “male”, “female” or “other” to NA using the following code. Using this general logic, try fixing any other columns with bad values
pirates.errors$sex[!(pirates.errors$sex %in% c("male", "female", "other/NA"))] <- NA
#clean up "sex"-column
pirates.errors$headband[!(pirates.errors$headband %in% c("yes", "no", "other/NA"))] <- NA
#clean up "headband"-column
pirates.errors$age[!(pirates.errors$age %in% seq(0,100,1))] <- NA
#clean up age-column
summary(pirates.errors$tattoos)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -10 8 10 4025 12 1000000 4
pirates.errors$tattoos[!(pirates.errors$tattoos %in% seq(0,20,1))] <- NA
#clean up tattoos-column
#...
To answer the following questions, you need to have a dataset without errors. If you had trouble correcting the data, you can download a version of the data without errors at http://nathanieldphillips.com/wp-content/uploads/2015/05/pirate_survey_noerrors.txt
pirates.noerrors <- read.table("http://nathanieldphillips.com/wp-content/uploads/2015/05/pirate_survey_noerrors.txt", sep = "\t", stringsAsFactors = F)
aggregate(formula = tchests.found ~ sex, # DV is tchests.found, IV is sex of pirates
FUN = mean, # Calculate the mean
na.rm = T, # Ignore NA values when calculating the mean
data = pirates.noerrors
)
## sex tchests.found
## 1 female 7.353319
## 2 male 7.128049
## 3 other 8.048780
#visualize results
aggregated.data <- aggregate(formula = tchests.found ~ sex,
FUN = mean,
na.rm = T,
data = pirates.noerrors)
barplot(height = aggregated.data$tchests.found, names.arg = aggregated.data$sex)
aggregate(formula = sword.speed ~ sword.type,
FUN = median, # Calculate the median
na.rm = T, # Ignore NA values when calculating the mean
data = pirates.noerrors
)
## sword.type sword.speed
## 1 banana 2.5859139
## 2 cutlass 0.4848266
## 3 sabre 1.7393120
## 4 scimitar 1.7559671
#visualize results
aggregated.data <- aggregate(formula = sword.speed ~ sword.type,
FUN = median,
na.rm = T,
data = pirates.noerrors)
barplot(height = aggregated.data$sword.speed, names.arg = aggregated.data$sword.type)
First, calculate the median sword speed, separately for each level of headband use using aggregate(). What is your conclusion?
Second, calculate the median sword speed for all combinations of both sex AND sword.type. That is, calculate the median sword.speed for headband-wearers who use cutlasses, headband-wearers who use sabres, … headband-nonwearers who use cutlasses, headband-nonwearers who use sabres… (hint: include two independent variables in the formula argument to aggregate()). Does your conclusion change? If so, what do you think is going on?
#1 calculate the median sword speed, separately for each level of headband use using aggregate().
aggregate(formula = sword.speed ~ headband,
FUN = median, # Calculate the median
na.rm = T, # Ignore NA values when calculating the mean
data = pirates.noerrors
)
## headband sword.speed
## 1 no 1.0780988
## 2 yes 0.5375353
#conclusion: Pirates who use a headband tend to move their swords faster.
#2 calculate the median sword speed for all combinations of both sex AND sword.type. That is, calculate the median sword.speed for headband-wearers who use cutlasses, headband-wearers who use sabres, … headband-nonwearers who use cutlasses, headband-nonwearers who use sabres… (hint: include two independent variables in the formula argument to aggregate()).
aggregate(formula = sword.speed ~ sex + sword.type + headband,
FUN = median, # Calculate the median
na.rm = T, # Ignore NA values when calculating the mean
data = pirates.noerrors
)
## sex sword.type headband sword.speed
## 1 female banana no 2.1053095
## 2 male banana no 2.1928123
## 3 other banana no 1.6016872
## 4 female cutlass no 0.4812121
## 5 male cutlass no 0.3408127
## 6 female sabre no 1.1823583
## 7 male sabre no 1.2412301
## 8 other sabre no 0.5307804
## 9 female scimitar no 0.9705695
## 10 male scimitar no 0.5063881
## 11 other scimitar no 1.4368908
## 12 female banana yes 4.2625317
## 13 male banana yes 14.1127041
## 14 other banana yes 1.4867682
## 15 female cutlass yes 0.4602335
## 16 male cutlass yes 0.5018827
## 17 other cutlass yes 0.6131719
## 18 female sabre yes 2.0247799
## 19 male sabre yes 2.9170651
## 20 other sabre yes 1.2586082
## 21 female scimitar yes 4.8914395
## 22 male scimitar yes 5.1367420
## 23 other scimitar yes 8.1321056
#I had trouble finding out how to group headband-wearers and non-wearers to compare them. My guess is that you need to split up wearers and non-wearers and then run the aggregate function with y = sword.speed and x1 = sex and x2 = sword.type.
#You could also use the dplyr function, I guess.
First, define the dataframe THEN… (%>%) Second, group the data by favorite.pirate, THEN… (%>%) Third, use the summary function to tell R you will summarise the data Fourth, define each of the new summary statistics (then close the summarize function)
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
pirates.fave <- pirates.noerrors %>%
group_by(favorite.pirate) %>% # Define the grouping variable favourite pirate
summarise(frequency = length(favorite.pirate),
tattoos.mean = mean(tattoos, na.rm = T),
swords.speed.median = median(sword.speed, na.rm = T)
)
pirates.fave #yeeeha
## Source: local data frame [6 x 4]
##
## favorite.pirate frequency tattoos.mean swords.speed.median
## 1 Anicetus 120 9.100000 0.4776414
## 2 Blackbeard 100 9.620000 0.7313608
## 3 Edward Low 114 9.342105 0.5371732
## 4 Hook 115 9.713043 0.6046085
## 5 Jack Sparrow 453 9.607064 0.5391035
## 6 Lewis Scot 98 9.081633 0.5539311
#Pirates whose favourite pirate is "Hook" tend to have the most tattoos
#Pirates whose favourite pirate is "Blackbeard" seem to have the highest sword speed
#Pireates whose favourite pirate is "Lewis Scot" appear to have the least tattoos
#Jack Sparrow is very popular. :-)
First, create a logical vector indicating whether or not a pirate went to Jack Sparrow’s School of Fashion and Piratry. Second, use aggregate() to calculate the percentage of pirates who went to Jack Sparrow’s School of Fashion and Piratry as a function of age. Third, create a scatterplot showing the results
#1 create logical vector
college.log <- pirates.noerrors$college == "JSSFP"
college.log
## [1] TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE
## [23] FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
## [45] FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE
## [67] FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE
## [78] FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE
## [89] FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [100] FALSE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE
## [111] FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
## [122] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE
## [133] FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE
## [144] FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
## [155] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
## [166] FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE
## [177] TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE TRUE
## [188] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
## [199] TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE
## [210] FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
## [221] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [232] FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
## [243] FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [254] FALSE FALSE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE
## [265] TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE
## [276] FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [287] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [298] FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE
## [309] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE
## [320] FALSE TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE
## [331] TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE
## [342] TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE TRUE FALSE
## [353] TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
## [364] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE
## [375] TRUE FALSE TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE
## [386] FALSE FALSE TRUE TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE
## [397] FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE TRUE FALSE
## [408] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [419] FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE
## [430] FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
## [441] FALSE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE
## [452] FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [463] FALSE TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE
## [474] FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
## [485] FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
## [496] FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE
## [507] TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE
## [518] FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
## [529] FALSE TRUE TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE
## [540] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [551] FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [562] FALSE TRUE TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE
## [573] TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [584] TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
## [595] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE
## [606] FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
## [617] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE FALSE
## [628] TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE
## [639] TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE
## [650] TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [661] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [672] FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
## [683] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [694] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [705] FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE
## [716] FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
## [727] FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE
## [738] FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [749] TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
## [760] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [771] FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [782] TRUE FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
## [793] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
## [804] FALSE FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE FALSE TRUE
## [815] FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
## [826] FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [837] FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [848] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [859] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [870] TRUE FALSE TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE
## [881] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE
## [892] TRUE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## [903] TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE
## [914] TRUE FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE
## [925] TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
## [936] FALSE TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [947] FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE
## [958] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [969] FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE
## [980] FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [991] TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE
#i think there's something wrong with this but cannot figure out what the issue is.
aggregate(formula = age ~ college.log,
FUN = mean, # Calculate the median
na.rm = T, # Ignore NA values when calculating the mean
data = pirates.noerrors
)
## college.log age
## 1 FALSE 24.4113
## 2 TRUE 33.3168
#this of course is not right, either, because the logical vector is wrong.
plot(x = pirates.noerrors$age, y = college.log)
plot(x = college.log, y = pirates.noerrors$age)
require(dplyr)
pirates.college <- pirates.noerrors %>%
group_by(college) %>% # Define the grouping variable favourite pirate
summarise(frequency = length(college),
age.mean = mean(age, na.rm = T)
)
pirates.college
## Source: local data frame [2 x 3]
##
## college frequency age.mean
## 1 CCCC 637 24.4113
## 2 JSSFP 363 33.3168
#I'm sorry, I actually have no clue what I'm supposed to do here and how I'm supposed to calculate the percentage. From what I understand 100% of pirates went to JSSFP and the mean age of pirates attending school is 27.44 years but to be hones, i have no idea.