Data exploration is an important part of data analysis. Our ability to interpret patterns, find stories and gain instights come with less effort through Visualisation. This document will privide a guidance to do a preliminary analysis of a dataset using ggplot2. ggplot2 is a part of tidyverse, an ecosystem of packages designed with common APIs and a shared pholosophy.
For this analysis, I will be using Kaggle Titanic Train Dataset to analyse the survival rate, priority and rescue plan during the disaster and age distribution of the passengers. I will not be covering all types of visual graphs.
library(tidyverse)
library(ggplot2)
library(dplyr)
library(ggthemes)
titanic <-read.csv("titanic.csv", header=TRUE)
str(titanic)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
Basically, ggplot takes two main attibutes -
data as a dataframe from data source and
aes as a mapping.
Mapping which is technically called Aesthetic Mapping describes how variables in the data are mapped to visual properties of geoms.
Suppose we will visualise only two variables to make it simple, we will have to provide variables for plotting x-y coordinates in the graph. In this matter, if we describes only a variable for x, clever ggplot takes the record count as data for y-varialble by default. However, ggplot always need to know what data to accept for x. It is our task to supply.
The aes has many attributes such as alpha, colour, fill, group, line, type, size, weight.
In this analysis, I need to use fill to display the data in different colors according to its category. Generally, fill property accepts color code for example “red” or “#4675” to fill in the graph. However, I need more than one color to fill. In this case, the important thing to take note is fill takes only factor data type. If the type of existing raw data to be displayed is not factor, we have to convert to the factor.
Exploration starts here. Let’s use the bar graph.
Firstly, I want to see the survival rate during Titanic disaster by passenger class, Pclass.
Survival rate should compare both survivors and non-surviors. Survived has a data type of Integer when we checked all the data types in titanic dataset previously. How it will look like without changing data type into Factor before applying to fill.
ggplot(titanic, aes(x=Pclass, fill=Survived)) +
theme_bw() +
geom_bar() +
labs( y="Number of Passengers", x="Passenger Class", title="Titanic Survival Rate by Passenger Class")
This is what I ended up with when fill received the non-factor-typed data. Does the graph give you meaingful information? The bar graph cannot say how many passengers were rescued and how many lost lives. This is not what I wanted. It took me some time to realise the cause.
This is what we need to do.
titanic$Survived <- as.factor(titanic$Survived)
titanic$Pclass <- as.factor(titanic$Pclass)
class(titanic$Survived)
## [1] "factor"
class(titanic$Pclass)
## [1] "factor"
Once the data is factorized for fill, the graph produces more meaingful information.
ggplot(titanic, aes(x=Pclass, fill=Survived)) +
geom_bar() +
labs( y="Number of Passengers", x="Ticket Class", title="Titanic Survival Rate by Passenger Class")
We can make out that survival rate in the first class is the highest about 140 passengers while death rate is also the highest in the third-class ticket among the three classes.
R plotting package ggplot2 offers the most powerful aspects to create multi-panel plots. With the use of a single function, a plot can be split into more than one pane by using
facet_wrap() or facet_grid().
facet_wrap() arranges the panels into rows and columns and decides the layout which fits the number of panels.
What we need to do is append the facet_wrap() function with the argument of data we wish to group by to display to the existing ggplot function. This time, I want to explore the survival data by gender.
ggplot(titanic, aes(x=Sex, fill=Survived)) +
theme_bw() +
geom_bar() +
labs( y="Number of Passengers", title="Titanic Survival Rate by Gender by Passenger Class")+
facet_wrap(~Pclass)
We can make out from the plot that females were given first priorty in the resuce plan. Number of female surviours was higher than male in all three passenger classes.
With the use of face_wrap(), multi-variable analysis is achieved in such a way in one single plot showing data of
Pclass (1,2,3)
Gender( male, female)
PSurvived(0=False, 1=True).
Let me use facet_grid() this time, add one more variable Age and also change the bar graph to histogram.
Both facet_wrap() and facet_grid() can accept more than one variable. This is a very good feature of what ggplot offers. It means the facet plots layout the number of rows and columns by those variable values.
ggplot(titanic, aes(x=Age, fill=Survived))+
geom_histogram(bins=20, colour="#1380A1") +
labs(title="Survival Rate by Gender", y="Number of passengers", subtitle = "Distribution by age, gender and ticket class", caption="Author: Hnin")+
theme_bw() +
facet_grid(Sex~Pclass, scales="free")
The graph identifies the survival rate pattern with the consideration of three variables - age, sex, passenger class.
We can conclude that the approach of giving first priority to women and children was executed in the evacuation process.
Priority to males appeared to be very less in the rescuse operation.
Again, I want to visualise the age distribution of fatalities and survivals.
ggplot(data = titanic, aes(x=Pclass, y=Age)) +
geom_boxplot(alpha=0.7) +
geom_point(size= 1, colour='#1380A1') +
geom_jitter(aes(colour = Survived)) +
labs(title="Age Distribution by Passenger Class on Titanic", subtitle = "Males on board were more senior than female", caption="Author: Hnin") +
xlab("Passenger Class") +
ylab("Age(years)")+
theme_light() +
facet_wrap(~Sex)
In this sense, comination of geom_boxplot() and geom_point() visualises the distribution of desired data more explicitly. On top of that, facet_wrap() helps to group the boxplots by gender.
To make the geom_point() more presentable, I use geom_jitter() to spread out points. Without it, multiple points appear in the same place by overlapping each other which cause overplotting. It is useful on non-continuous data. By jittering the data into white space, it allows single points to be visible.
Please, take note that jitterring does not benefit you when there is no availabe space around overlapping point. To tackle the overplotting issue, we may consider
* using smaller size of points
* using transpharancy, alpha
* using appropiate binwidth
Explorary data analysis(EDA) does not require formal process with rigit procedures. During the initial phase of EDA, it takes our decision to what information we want to explore, investigate and what idea and questions we have in our mind. To visualise the answers to our questions, ggplot makes it simple to create complex plots.
R for Data Science (https://r4ds.had.co.nz/)