To access R and RStudio which are installed on the Saint Ann’s server you can go to: http://rstudio.saintannsny.org:8787/ and log in with your Saint Ann’s email address.
Past labs:
http://rpubs.com/jcross/titanic_plot
http://rpubs.com/jcross/intro_to_R
Today (and probably tomorrow) we’ll be learning to filter, arrange, select, mutate and even summarize data using the dplyr package in R.
If you find yourself wanting to review this material or find out more, I encourage you to read Garret Grolemund and Hadley Wickham’s online textbook *R for Data Science (http://r4ds.had.co.nz/)*. This lab practices the material in 5.1 through 5.6.
Loading packages
.libPaths("/home/rstudioshared/shared_files/packages")
library(dplyr)
library(ggplot2)
Loading and Viewing the data
The titanic data is saved as a .csv file (comma separated values) in a shared folder on the Saint Ann’s server. You can load it into R and view it using the code below.
titanic <- read.csv("/home/rstudioshared/shared_files/data/titanic_train.csv")
View(titanic)
How did the youngest passengers on the Titanic fair? We can look at all passengers who were under the age of 1:
titanic %>% filter(Age < 1)
Note that “%>%” which is called a “pipe” is a way of passing along whatever you have created so far and then doing something additional to it. This is ultimately quite useful. We can write lines of code with multiple pipes in order to do a series of things (one after the other) to our data. You can read the pipe operator as “as then”. In the code above, we start with the titanic data “and then” we filter by age.
…and what about the very oldest passenger on the boat?
titanic %>% filter(Age == max(Age, na.rm=TRUE))
# note: "na.rm=TRUE" removes passengees where Age is not available or "NA"
…and the oldest woman on the boat?
titanic %>% filter(Sex=="female") %>% filter(Age == max(Age, na.rm=TRUE))
[Notice the use of two pipes in the code above. We start with the entire data set, then we filter by Sex, and then we filter by age.]
Let’s take a look at the largest group of siblings:
titanic %>% filter(SibSp == max(SibSp, na.rm=TRUE))
It might seem inexplicable that every Sage child has 8 siblings but there are only 7 kids listed here but remember that this is only a sample of the data. We can do a quick search to see if there are any other members of the Sage family in our data set.
titanic %>% filter(grepl('Sage,', Name))
[Note: “grepl” is a useful function. It returns a series of trues and falses, in this case depending on whether it found the string “Sage,” within the passenger’s name. The filter function then returns only the rows with where grepl returned “TRUE”.]
By the way, the main characters in the Titanic movie (Jack and Rose) were fictional so there’s no use searching for them. The character played by Kathy Bates (“The Unsinkable Molly Brown”) was real and is listed in our training data as Margaret Tobin Brown.
If I want to find young people who died (Why? I don’t know!), I could do either of the following which will return all passengers younger than 5 who died.
titanic %>% filter(Age < 5, Survived ==0)
titanic %>% filter(Age < 5 & Survived ==0)
Question: Above, I wrote a line of code to find the oldest woman on the boat. Why doesn’t the following line of code do that?
titanic %>% filter(Sex=="female" & Age==max(Age, na.rm=TRUE))
If I wanted to return the youngest 5 % of those who died and the oldest 2 % of those who survived, I could try:
titanic %>% filter(Survived == 0) %>% filter(Age < quantile(Age, 0.05, na.rm=TRUE))
titanic %>% filter(Survived == 1) %>% filter(Age > quantile(Age, 0.98, na.rm=TRUE))
And, if I want the oldest and youngest 1 %’s:
titanic %>% filter(Age < quantile(Age, 0.01, na.rm=TRUE) | Age > quantile(Age, 0.99, na.rm=TRUE))
# The "|" symbol can be read as "or"
Questions:
Who is the youngest person in passenger class 1 who died?
What is the name of the oldest male in passenger class 3 who survived?
I could have answered the questions above (perhaps more easily) using the top_n function:
titanic %>% filter(Pclass == 1, Survived==1) %>% top_n(-1, Age)
titanic %>% filter(Pclass == 3, Survived==1, Sex=="male") %>% top_n(1, Age)
I can also find the 10 most expensive and 5 cheapest tickets with the following code:
titanic %>% top_n(10, Fare)
titanic %>% top_n(-5, Fare)
Question: Why do these searches return more than 10 and 5 results, respectively?
If I want to be more fastidious and come up with an ordered top 10 list of ticket prices I could do the following:
titanic %>% top_n(10, Fare) %>% arrange(desc(Fare))
I can use “mutate” to create new variables. For instance, we know the number of siblings each passenger had on the boat and the number of parents/children each passenger had but what if I want to know how many people had no family at all? First, I’ll calculate the number of family members for each person and then I’ll find everyone who had no family members.
titanic %>% mutate(num_family_members = SibSp + Parch) %>% filter(num_family_members==0)
Note that there’s nothing stopping me from combining my data transformation functions with the graphing functions we learned in the last lab:
titanic %>% mutate(num_family_members = SibSp + Parch) %>%
ggplot(aes(num_family_members,Fare, color=Sex)) + geom_point() + facet_wrap(~Survived)
titanic %>% mutate(log10Fare = log(Fare,10), num_family_members = SibSp + Parch) %>%
ggplot(aes(num_family_members,log10Fare, color=Sex)) + geom_point() + facet_wrap(~Survived)
Lastly, I can select only the columns I want and save my creations as new variables.
titanic %>% mutate(num_family_members = SibSp + Parch) %>%
filter(Age < 2) %>%
select(Survived, num_family_members, Fare)
titanic_youngsters <- titanic %>% mutate(num_family_members = SibSp + Parch) %>%
filter(Age < 2) %>%
select(Survived, num_family_members, Fare)
View(titanic_youngsters)
This is where things get exciting – although it might take us a while to see the full power of summaries.
Let’s start by finding the average (meaning, mean) family size of passengers, the average Fare and the rate of survival:
titanic %>% mutate(num_family_members = SibSp + Parch) %>%
summarize(mean_fam_size = mean(num_family_members), mean_Fare = mean(Fare), SurvivalRate = mean(Survived))
This starts to get interesting, when I do summaries for subgroups. I’m going to also calculate the number of people “num_passengers” in each subgroup:
titanic %>% mutate(num_family_members = SibSp + Parch) %>%
group_by(Pclass) %>%
summarize(num_passengers=n(), mean_fam_size = mean(num_family_members), mean_Fare = mean(Fare), SurvivalRate = mean(Survived))
titanic %>% mutate(num_family_members = SibSp + Parch) %>%
group_by(Sex) %>%
summarize(num_passengers=n(),mean_fam_size = mean(num_family_members), mean_Fare = mean(Fare), SurvivalRate = mean(Survived))
titanic %>% mutate(num_family_members = SibSp + Parch) %>%
group_by(Pclass, Sex) %>%
summarize(num_passengers=n(),mean_fam_size = mean(num_family_members), mean_Fare = mean(Fare), SurvivalRate = mean(Survived))
titanic %>% mutate(num_family_members = SibSp + Parch, age.group = cut(Age, breaks=seq(0,90,10))) %>%
group_by(Sex,age.group) %>%
summarize(num_passengers=n(),mean_fam_size = mean(num_family_members), mean_Fare = mean(Fare), SurvivalRate = mean(Survived))
We can also plot our summaries! Take your time and see if you can dissect everything that is going on in each of the following two chunks of code.
titanic %>% mutate(num_family_members = SibSp + Parch, age.group = cut(Age, breaks=seq(0,90,10))) %>%
group_by(Sex,age.group) %>%
summarize(num_passengers=n(),SurvivalRate = mean(Survived)) %>%
filter(!is.na(age.group)) %>%
ggplot(aes(age.group, SurvivalRate, color=Sex, size=num_passengers)) + geom_point() +
ggtitle("Titanic Survival Rates by Age and Sex")
titanic %>% mutate(num_family_members = SibSp + Parch, age.group = cut(Age, breaks=seq(0,90,10))) %>%
group_by(Sex,age.group, Pclass) %>%
summarize(num_passengers=n(),SurvivalRate = mean(Survived)) %>%
filter(!is.na(age.group)) %>%
ggplot(aes(age.group, SurvivalRate, color=Sex, size=num_passengers)) + geom_point() + facet_wrap(~Pclass) +
ggtitle("Titanic Survival Rates by Age, Sex and Passenger Class")
Create one graph based on the titanic data that you think is interesting or revealing. This can be a graph based on raw data or a graph based on a summary of the data. Export the graph and email it to me (jcross@saintannsny.org) along with the code that you used to create the graph. In class next week, I will ask your classmates to predict what the graph will look like based on the code and ask you to explain what you see in the graph.
Please let me know if you’d like to borrow a chromebook to work on this assignment (or whatever else) outside of class.
There’s a good chance that you will encounter inexplicable error messages. I encourage you to email me for help and to include your code and the error message. You might also try doing a Google search for the error message and seeing what you find.