Installs and loads dplyr package. This can be performed in the console but for demonstrative purposes it is included in the script
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Writes the dataset onto a dataframe. The dataset must be uploaded to the workspace beforehand.
titanic_data <- read.csv("Titanic-Dataset.csv")
We use multiple functions to give us some basic insights and overview into the data. View displays the entire dataset in a tabular format. Summary provides key statistical information about each column such as the mean, median, max and min values. str gives us the data type of each column.
head(titanic_data)
summary(titanic_data)
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Sex Age SibSp Parch
## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.000 Median :0.0000
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 Length:891
## Class :character 1st Qu.: 7.91 Class :character Class :character
## Mode :character Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
str(titanic_data)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
Shows the column names of the dataset
names(titanic_data)
## [1] "PassengerId" "Survived" "Pclass" "Name" "Sex"
## [6] "Age" "SibSp" "Parch" "Ticket" "Fare"
## [11] "Cabin" "Embarked"
names(titanic_data) <- tolower(names(titanic_data))
names(titanic_data)
## [1] "passengerid" "survived" "pclass" "name" "sex"
## [6] "age" "sibsp" "parch" "ticket" "fare"
## [11] "cabin" "embarked"
This section returns the number of missing values in each column.
colSums(is.na(titanic_data))
## passengerid survived pclass name sex age
## 0 0 0 0 0 177
## sibsp parch ticket fare cabin embarked
## 0 0 0 0 0 0
We can see that the only column with missing values is age, with 177 missing values. To address this, we replace each of the missing age values with the median age value. This is done by using dplyr’s mutate function in conjunction with an ifelse statement where if the value in the age column is missing, then replace it with the median age value, else keep it as is
median(titanic_data$age, na.rm = TRUE)
## [1] 28
titanic_data <- titanic_data %>%
mutate(age = ifelse(is.na(age), median(age, na.rm = TRUE), age))
The median age was 28 so all 177 missing values in the age column were replaced with 28. Now we use the colSums function again to ensure all missing values were replaced
colSums(is.na(titanic_data)) #verified all missing data has been filled
## passengerid survived pclass name sex age
## 0 0 0 0 0 0
## sibsp parch ticket fare cabin embarked
## 0 0 0 0 0 0
As we can see, there are no more missing values so this operation was successful
We convert appropriate columns to factor data type, to ensure these columns are properly represented as categorical data in statistical analysis
titanic_data <- titanic_data %>%
mutate(survived = as.factor(survived),
pclass = as.factor(pclass),
sex = as.factor(sex),
embarked = as.factor(embarked)
)
In the embarked column, there are two entries with blank values. These are not counted as missing values by R, but are missing values for all intents and purposes, so we manually filter them out as follows
titanic_data <- titanic_data %>%
filter(embarked != "")
The Cabin Column mostly contains blank values, as such we drop the column entirely as it will not offer any meaningful conclusions in our analysis
titanic_data <- titanic_data %>% select(-cabin)
This is a final check of the data to ensure it has been sufficiently cleaned
head(titanic_data)
Now that the data has been cleaned, we write it to a new .csv file so that it can be exported and used for further analysis or visualisations on R or other programs.
write.csv(titanic_data, "Cleaned_Titanic_Data.csv", row.names = FALSE)