One of the most important steps in data analysis is that of the initial exploration. In order to move from obtaining data to applying statistical models and learning algorithms one must first get a sense of what is going on in the data. Not doing so may lead models and algorithms in a direction that is completely unreasonable.
In this short article we will look at the titanic data, and do some initial exploration with it. We seek to uncover relationships between attributes of passengers in order to predict if they survived or not. In Machine Learning, this type of problem is called a classification problem. You can obtain a copy of this data set here.
First lets take a quick glimpse of the features (attributes) each passenger was recorded for. Using the glimpse
command in the dplyr
package outputs this information nicely for us.
titanic_data <- tbl_df(read.csv(titanic_file, header = TRUE))
titanic_data %>% glimpse()
## Observations: 891
## Variables: 12
## $ PassengerId (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ Survived (int) 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0,...
## $ Pclass (int) 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3,...
## $ Name (fctr) Braund, Mr. Owen Harris, Cumings, Mrs. John Bradl...
## $ Sex (fctr) male, female, female, female, male, male, male, m...
## $ Age (dbl) 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, ...
## $ SibSp (int) 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4,...
## $ Parch (int) 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1,...
## $ Ticket (fctr) A/5 21171, PC 17599, STON/O2. 3101282, 113803, 37...
## $ Fare (dbl) 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, ...
## $ Cabin (fctr) , C85, , C123, , , E46, , , , G6, C103, , , , , ,...
## $ Embarked (fctr) S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q...
Let’s look at some initial visualizations for some numeric features in the data. First Age. Note that we have NA values in the Age column, we can get rid of these initially for purposes of visualization, but proper care needs to be done to alleviate this problem, namely note that:
sum(is.na(titanic_data$Age))/length(titanic_data$Age)
## [1] 0.1986532
# delete the values with an NA in the age column
titanic_data <- titanic_data[!is.na(titanic_data$Age),]
So about 20% of the Age data has a value of NA.
Lets first look at a histogram showing the Ages of passengers on board. It is intuitive to believe that Age is a good predictor of Survival, and in fact it turns out be. But before diving into that let’s exploit the importance of proper visualizations.
Below is the histogram of Ages for the passengers.
ggplot(data = titanic_data, aes(x = Age)) + geom_histogram(binwidth = 10, na.rm = TRUE, fill = "wheat", color = "black")
Naturally as statisticians one is always on the lookout for data that closely resembles a normal distribution, or some kind of bell curve, at first glance this looks like it may pass as one. Before making any conclusions tho, note the control we have over the plot above. Let’s look at the same plot, but this time let’s change the binwidth
from 10 down to 1.
By the end of this list of histograms, we can see that in fact the data is showing a completely different shape.
It turns out that this is the best our histogram can do at approximating the true distribution for this data.
We instead turn to kernel density estimation to approximate the distribution continuously. You can learn more about this technique here: Kernel Density Estimation. To do this we use a ggplot built in function.
ggplot(data = titanic_data, aes(x = Age)) + geom_density(na.rm = TRUE)
The curve approximated by kernel density shows us some interesting features. This density will be very close to the actual distribution when the amount of data is large, and in fact we have over 700 observations.
So now recall the goal of using this data is to be able to predict whether a passenger survived or not. As mentioned before this is what is known as a classification problem. We can use densities to classify an unseen observation by being able to detect a significant difference between two distributions and assigning it two one of the two.
The visualization that follows shows this,
x <- seq(-10, 10, by = .1)
dist1 <- ggplot(data = data.frame(x = x, y = dnorm(x,0,1)), aes(x,y)) + geom_line(col = cbPalette[3])
dist1 + geom_line(data = data.frame(x = x, y = dnorm(x,4,1)), aes(x,y), col = cbPalette[2]) + geom_vline(xintercept = 2, col = cbPalette[4]) + geom_point(aes(x = 0, y = 0), col = "red")
In the situation above we have two distributions that are quite different and so we can define a separating line that indicates: points to the left belong to blue, and points to the right belong to yellow. We classify this new red point as label blue, since it falls to the left of the green separating line. And so our goal in the next set of visualizations is to be able to distinguish two distributions that lead to the two possible conclusions, survived or did not survive. Note that we are conducting any proper classification method, and thus cannot conclude solely based on the plots, but rather we would like to get an intuition to how the data behaves.
Lets move our exploration to exploit this prior knowledge.
First lets see if someones age can lead to a conclusion of survival. We wile split two distributions for survivors and those that did not survive, and observe if there is some significant difference.
titanic_data <- transform(titanic_data, Survived_c = ifelse(Survived == 1, "Y", "N"))
cc <- scale_colour_manual(values=cbPalette[2:4])
ages_survived <- ggplot(data = titanic_data, aes(x = Age, col = Survived_c)) + geom_density() + cc
ages_survived
It is obvious that we cannot draw a line that separates the two distributions…yet. Note that the densities appear to have other little mounds. Take for example the curve for Survivors at around age 5. There appears to be a pretty significant mound there. Why might that be? Children? Further investigation down this will indeed show us that being a child is a significant predictor of survival. Another thing to note is that the jagged portions of the density could pertain to hidden interactions with other features. See if you can notice as we break down Ages further and further.
Lets build upon what we have above but this time include whether the passenger was male or female.
titanic_data <- transform(titanic_data, Survived_c = ifelse(Survived == 1, "Y", "N"))
cc <- scale_colour_manual(values=cbPalette[2:4])
ages_survived_sex <- ggplot(data = titanic_data, aes(x = Age, col = Survived_c)) + geom_density() + facet_grid(Sex ~ .) + cc
ages_survived_sex
Once again it would be naive to try and draw a separating line at this point, but we do notice some separation beginning to happen.
Specifically looking only at males in the plot above (bottom plot)
ggplot(data = titanic_data[titanic_data$Sex == 'male',], aes(x = Age, col = Survived_c)) + geom_density() + geom_vline(xintercept = 13, col = cbPalette[2])
We can comfortable say that if you are a male aged around 13 and under you survived.
One last predictor that is intuitively important is that of passenger class. We can see a lot separation beginning to occur in the first class below.
titanic_data <- transform(titanic_data, Survived_c = ifelse(Survived == 1, "Y", "N"))
cc <- scale_colour_manual(values=cbPalette[2:4])
ages_survived_sex <- ggplot(data = titanic_data, aes(x = Age, col = Survived_c)) + geom_density() + facet_grid(Pclass ~ .) + cc
ages_survived_sex
As the last visualization let’s look at the previous two plots combined.
ggplot(data = titanic_data, aes(x = Age, col = Survived_c)) + geom_density() + facet_grid(Pclass ~ Sex) + cc
We can see that the densities pertaining to females are not what we desired, but those of males may lead us in a direction for adequate predictions.
Having gone through several visualizations we can more comfortably apply certain models or algorithms. Again note that the point of these initial explorations are not conclude on a classification but rather gain intuition about the data, as well possibly being able to reject a result we obtain later on that does not make sense. There are many classification techniques we can apply here, in further articles I will explore these.