Plotting the Titanic

Log in: http://rstudio.saintannsny.org:8787/

This lab: http://rpubs.com/jcross/titanic_plot

Yesterday’s lab: http://rpubs.com/jcross/intro_to_R

Loading packages

This code first points R to where it will find packages and then loads two packages. Packages are simply bundles of functions (and sometimes data). The dplyr package helps us rearrange data while the ggplot2 package helps us graph data.

.libPaths("/home/rstudioshared/shared_files/packages")
library(dplyr)
library(ggplot2)

Loading and Viewing the data

The titanic data is saved as a .csv file (comma separated values) in a shared folder on the Saint Ann’s server. You can load it into R and view it using the code below.

titanic <- read.csv("/home/rstudioshared/shared_files/data/titanic_train.csv")
View(titanic)

Take a few minutes to look at this data. It contains information on 891 of the people who were on board the Titanic. This data is from a Kaggle competition. Kaggle is a website where people build models to make predictions and compete in prediction contests. You can see a description of the data set and a description of each of the variables here https://www.kaggle.com/c/titanic/data and here https://www.kaggle.com/c/titanic.

Plotting the Titanic

We’re going to start exploring this data by graphing it using the ggplot2 package.

The ggplot2 package has two main function: qplot for quicker, easier plotting and ggplot which gives you finer control of the plot. The gg in ggplot stands for “grammar of graphics” and the intent of this lab is for you to begin to learn some of the grammatical structures. Let’s start with qplot.

qplot(Age, Fare, data=titanic)

qplot(Age, Fare, data=titanic, colour=Pclass)

By default, this creates a scatter plot the lines above creates a scatter plot. The second line of code above colors the data points by passenger class (Pclass). However, it treats Pclass as a continuous variable and uses a continuous shading scheme (with a shade for every value between 1 and 3). In truth, Pclass, only has the values of 1, 2 and 3 (there’s no such thing as a passenger class of 2.397!) and so it might make more sense to think of passenger class as a categorical variable than a continuous variable or, in the language or R, a “factor” variable rather than a “numeric” variable. We can address this by adding new variables to our data set.

Creating New Variables and Varible Types

We can use dplyr to create new variables. The following code creates three new variables: two where passenger class and survival are treated as factor variables (passengers either lived or died and “Survived” can’t take on the value 0.8) as well as a new variable that splits passengers in groups based on their age (this may prove useful later).

titanic <- titanic %>% mutate(Pclass.factor = as.factor(Pclass), 
                              Survived.factor = as.factor(Survived),
                              age.group = cut(Age, breaks=seq(0,90,10))
                              )

Now, let’s look at a summary of our modified data set. Notice the different way that R summarizes “Survived” and “Survived.factor”.

summary(titanic)

We can also repeat the graph we made above, this time using passenger class as a factor variable.

qplot(Age, Fare, data=titanic, colour=Pclass.factor)

Which passenger class paid the highest fares?

We can also create multiple graphs at a time, using facets. The following graph splits passengers up port of embarkation (and in the second graph port and sex) and create a scatter plot for each group. We see that almost all of the passengers who came aboard in Queenstown, Ireland (Embarked is “Q”) where in passenger class 3.

qplot(Age, Fare, data=titanic, colour=Pclass.factor, facets=.~Embarked)

qplot(Age, Fare, data=titanic, colour=Pclass.factor, facets=Sex~Embarked)

Challenge:

Create a graph of Fare versus Age (as above) but with point colored by whether the passengers survived. Then try creating a grid of such graphs with passengers split by gender and passenger class. Based on this graph, how would describe who lived and who died on the titanic?

Now with ggplot

Our options expand, when we use ggplot instead of qplot. First, to replicate the graph we made above we can use:

ggplot(titanic, aes(x=Age, y=Fare, color=Pclass.factor)) + geom_point() + facet_grid(Sex~Embarked)

In the code above, we stated which variables we wanted to use within aes (for “aesthetics”) and then added a layer (geom_point) stating the type of plot that we would like and a second layer describing how we’d like to slice up the data.

Here are some other types of graphs that we can make using the same data. Take a moment to try to understand what each of these graphs is showing.

ggplot(titanic, aes(x=Age, y=Fare)) + geom_hex()
ggplot(titanic, aes(x=Age, y=Fare)) + geom_smooth()
ggplot(titanic, aes(x=Age, y=Fare)) + geom_smooth(method="lm")
ggplot(titanic, aes(x=Age, y=Fare)) + geom_density2d()

We can stack layers on top of layers. For instance (you can ignore all of the red warnings, for now):

ggplot(titanic, aes(x=Age, y=Fare, color=Pclass.factor)) + 
  geom_smooth(method="lm") + geom_point() +facet_grid(.~Embarked)

You can also assign a graph to a variable and/or add multiple “layers”. Try the following lines one at a time. The third line below hideously combines hexagonal bins, a smooth curve and points, while making the bin shading lighter (alpha=0.3), the line red and thick (color=“red”, lwd=2) and increasing the size of the points (size=3).

g <- ggplot(titanic, aes(Age, Fare))

g + geom_hex() + geom_smooth()

g + geom_point() + geom_smooth()

g + geom_hex(alpha=0.3) + geom_smooth(color="red", lwd=2) + geom_point(size=3)

Here are three univariate plots including two histograms and three density plots, which are similar to histograms but use a smooth curve rather than bins. Try each of the following, one at a time, and make sure to understand what each graph is showing.

g <- ggplot(titanic, aes(Age))
g + geom_histogram()
g + geom_histogram(aes(fill=Sex))
g + geom_density()
g + geom_density(aes(fill=Sex))
g + geom_density(aes(fill=Sex), alpha=0.3)

You can also create bar charts:

ggplot(titanic, aes(Sex, Age)) + geom_boxplot()
ggplot(titanic, aes(Pclass.factor, fill=Sex))+geom_bar()
ggplot(titanic, aes(Pclass.factor, fill=Survived.factor))+geom_bar()

The following line creates a box plot. The top of each “box” shows the 75th percentile value (meaning that 75% of the values are lower), the bottom shows the 25th percentile value and the line through the box shows the median (or 50th percentile value). The difference between the 25th percentile and the 75th percentile (in other words, the height of the box) is called the interquartile range (or IQR).

ggplot(titanic, aes(Sex, Age)) + geom_boxplot()

We can add a title and axis labels. Below, we’ll save the code to a chart as a variable and then add a new title and axis labels to that variable:

barchart <- ggplot(titanic, aes(Pclass.factor, fill=Survived.factor))+geom_bar()

barchart+xlab("Passenger Class")+ylab("Number of Passengers")+ggtitle("Survival by Passenger Class and Gender")+scale_fill_discrete(name = "", labels = c("Died", "Survived"))

You can save your graph within the plot window or save your most recent plot from within your code as shown below:

ggsave("titanic_barchart.png", width = 5, height = 5)

You should now see titanic_barchart.png within your files and you can click on it, export it or include it in a document that you create from within R (more on this later).

Try creating a couple of graphs based on this titanic data, editing the title and axis labels and then saving them in your folder.

You can read more about graphing using ggplot here: http://r4ds.had.co.nz/data-visualisation.html

Plotting the Titanic

Data Science