Let’s use the read.csv() function to import our CSV file and the <- assign function to create our dataframe from it.
# my_titanic <- read.csv(file.choose())
my_titanic <- read.csv("//Users/johnhonanc/Downloads/train.csv")
Now I’m going to view my structure using the str() function
str(my_titanic)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
Now lets look at the top and bottom 6 rows to give us an idea what the data looks like
head(my_titanic)
tail(my_titanic)
You might notice there are some N/A values in the ‘Age’ variable. Let’s get some summary statistics to tell us more.
summary(my_titanic)
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Sex Age SibSp Parch
## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.000 Median :0.0000
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 Length:891
## Class :character 1st Qu.: 7.91 Class :character Class :character
## Mode :character Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
So we can see that there are issues with the ‘age’ variable because it contains 177 N/A values. So we can either remove them or research/fill them in.
titanic_no_na <- my_titanic[!is.na(my_titanic$Age), ]
We’ve now created a new data.frame called ‘titanic_no_na’ which removes all the rows where Age is N/A
Let’s turn some character categories into factors
titanic_no_na$Sex <- as.factor(titanic_no_na$Sex)
titanic_no_na$Pclass <- as.factor(titanic_no_na$Pclass)
titanic_no_na$Survived <- as.factor(titanic_no_na$Survived)
Now let’s try some simple visualisations. First, how do the passenger ages look?
hist(titanic_no_na$Age)
First we’ll see how the class breaks down, did first class passengers generally fare better than third class passengers?
plot(titanic_no_na$Pclass, titanic_no_na$Survived)
Let’s do the same for sex
plot(titanic_no_na$Sex, titanic_no_na$Survived)
So far it looks like female first class passengers stood a much better chance of survival than male third class passengers.