Data Import

Let’s use the read.csv() function to import our CSV file and the <- assign function to create our dataframe from it.

# my_titanic <- read.csv(file.choose())
my_titanic <- read.csv("//Users/johnhonanc/Downloads/train.csv")

Now I’m going to view my structure using the str() function

str(my_titanic)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

Now lets look at the top and bottom 6 rows to give us an idea what the data looks like

head(my_titanic)
tail(my_titanic)

You might notice there are some N/A values in the ‘Age’ variable. Let’s get some summary statistics to tell us more.

summary(my_titanic)
##   PassengerId       Survived          Pclass          Name          
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
##                                                                     
##      Sex                 Age            SibSp           Parch       
##  Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
##  Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
##  Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
##                     Mean   :29.70   Mean   :0.523   Mean   :0.3816  
##                     3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
##                     Max.   :80.00   Max.   :8.000   Max.   :6.0000  
##                     NA's   :177                                     
##     Ticket               Fare           Cabin             Embarked        
##  Length:891         Min.   :  0.00   Length:891         Length:891        
##  Class :character   1st Qu.:  7.91   Class :character   Class :character  
##  Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
##                     Mean   : 32.20                                        
##                     3rd Qu.: 31.00                                        
##                     Max.   :512.33                                        
## 

So we can see that there are issues with the ‘age’ variable because it contains 177 N/A values. So we can either remove them or research/fill them in.

titanic_no_na <- my_titanic[!is.na(my_titanic$Age), ]

We’ve now created a new data.frame called ‘titanic_no_na’ which removes all the rows where Age is N/A

Let’s turn some character categories into factors

titanic_no_na$Sex <- as.factor(titanic_no_na$Sex)
titanic_no_na$Pclass <- as.factor(titanic_no_na$Pclass)
titanic_no_na$Survived <- as.factor(titanic_no_na$Survived)

Now let’s try some simple visualisations. First, how do the passenger ages look?

hist(titanic_no_na$Age)

First we’ll see how the class breaks down, did first class passengers generally fare better than third class passengers?

plot(titanic_no_na$Pclass, titanic_no_na$Survived)

Let’s do the same for sex

plot(titanic_no_na$Sex, titanic_no_na$Survived)

So far it looks like female first class passengers stood a much better chance of survival than male third class passengers.