Log in: http://rstudio.saintannsny.org:8787/

This lab: http://rpubs.com/jfcross4/114999

functions: http://rpubs.com/jfcross4/110739

dplyr: http://rpubs.com/jfcross4/110757

tidyr: http://rpubs.com/jfcross4/111819

Loading packages

.libPaths("/home/rstudioshared")
library(dplyr)
library(ggplot2)

Loading and Viewing the data

The titanic data is saved as a .csv file (comma separated values) in a shared folder on the Saint Ann’s server. YHou can load it into R and view it using the code below.

titanic <- read.csv("/home/rstudioshared/shared_files/titanic_train.csv")
View(titanic)

This data is from a kaggle competition. You can see a description of the dataset and a description of each of the variables here https://www.kaggle.com/c/titanic/data and here https://www.kaggle.com/c/titanic.

Titanic Questions (use the dplyr lab!)

Example: What was the average age of males and females on the titanic?

titanic %>% group_by(Sex) %>% summarise(mean(Age))
titanic %>% group_by(Sex) %>% summarise(mean(Age, na.rm=TRUE))

Notice how the first line returns “NA” This is because ages are not listed for some of the passengers so it is not possible to find the mean age for all passengers. The second line includes “na.rm=TRUE” within the mean function to first remove NA values and then calculate the mean.

  1. Calculate the survival rate for males and females.
  2. Calculate the survival rate by passenger class (Pclass).

Plotting the Titanic

The ggplot2 package has two main function: qplot for quicker, easier plotting and ggplot which gives you finer control of the plot. The gg in ggplot stands for “grammar of graphics” and the intent of this lab is for you to begin to learn some of the gramatical structures. Let’s start with qplot.

qplot(Age, Fare, data=titanic)
qplot(Age, Fare, data=titanic, colour=Pclass)
qplot(Age, Fare, data=titanic, colour=as.factor(Pclass))

Notice how all three of the functions above start by stating that you want a graph of Fare versus Age. By default, this creates a scatterplot. The second and third lines add that you want to color your data points by passenger class (Pclass). On the second line, it treats Pclass as a continous variable and uses a continuous shading scheme. By adding as.factor() on the third line, you’ve told R to treat passenger class as a set of categories.

We can also create multiple graphs at a time, using facets. The following graphs split passengers up by gender and port of embarkation and create a scatterplot for each group.

qplot(Age, Fare, data=titanic, colour=as.factor(Pclass), facets=~Sex+Embarked)
qplot(Age, Fare, data=titanic, colour=as.factor(Pclass), facets=Sex~Embarked)

Challenge: create a graph of Fare versus Age (as above) but with point colored by whether the passengers survived. Then try creating a grid of such graphs with passengers split by gender and passenger class.

Now with ggplot

Try each of the following:

ggplot(titanic, aes(x=Age, y=Fare)) + geom_point()
ggplot(titanic, aes(x=Age, y=Fare)) + geom_hex()
ggplot(titanic, aes(x=Age, y=Fare)) + geom_smooth()
ggplot(titanic, aes(x=Age, y=Fare)) + geom_line()
ggplot(titanic, aes(x=Age, y=Fare)) + geom_density2d()

Here in each case, we stated which variables we wanted to use within aes (for “aesthetics”) and then added a layer stating the type of plot that we would like.

You can also add part of the plotting comand to a variable and/or add multiple layers. The third line below hideously combines hexbins, a smooth curve and points, while making the bin shading lighter (alpha=0.3), the line red and thick (color=“red”, lwd=2) and increasing the size of the points (size=3).

g <- ggplot(titanic, aes(Age, Fare))
g + geom_hex() + geom_smooth()
g + geom_point() + geom_smooth()
g + geom_hex(alpha=0.3) + geom_smooth(color="red", lwd=2) + geom_point(size=3) 

Here are three univariate plots including two histograms and three density plots, which are similar to histograms but use a smooth curve rather than bins.

g <- ggplot(titanic, aes(Age))
g + geom_histogram()
g + geom_histogram(aes(fill=Sex))
g + geom_density()
g + geom_density(aes(fill=Sex))
g + geom_density(aes(fill=Sex), alpha=0.3)

You can also create boxplots and bar charts:

ggplot(titanic, aes(Sex, Age)) + geom_boxplot()
ggplot(titanic, aes(as.factor(Pclass), fill=Sex))+geom_bar()
ggplot(titanic, aes(as.factor(Pclass), fill=as.factor(Survived)))+geom_bar()

And can also the title and axis labels. Below, we’ll save the code to a chart as a variable and then add a new title and axis labels to that variable:

barchart <- ggplot(titanic, aes(as.factor(Pclass), fill=as.factor(Survived)))+geom_bar()

barchart+xlab("Passenger Class")+ylab("Number of Passengers")+ggtitle("Survival by Passenger Class and Gender")+scale_fill_discrete(name = "", labels = c("Died", "Survived"))

You can save your graph within the plot window or save your most recent plot from within your code as shown below:

ggsave("titanic_barchart.png", width = 5, height = 5)

You sould now see titanic_barchart.png within your files and you can click on it, export it or include it in a document that you create from within R (more on this later).

Try creating a couple of graphs based on this titanic data, editing the title and axis labels and then saving them.

Create New Variables

In a couple places in this lab, we have seen that it’s more useful to use passenger class and survival as factors (categories) than as continuous numbers. We can use dplyr to create new variables where passenger class and survival are treated as variables and a new variable that splits passengers in groups based on their age (this may prove useful later). Try using the mutate funciton as shown below and then examining a summary of the data set.

titanic <- titanic %>% mutate(Pclass.factor = as.factor(Pclass), 
                              Survived.factor = as.factor(Survived),
                              age.group = cut(Age, breaks=seq(0,90,10))
                              )
summary(titanic)