Installs and loads dplyr package. This can be performed in the console but for demonstrative purposes it is included in the script

install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Writes the dataset onto a dataframe. The dataset must be uploaded to the workspace beforehand.

titanic_data <- read.csv("Titanic-Dataset.csv")

We use multiple functions to give us some basic insights and overview into the data. View displays the entire dataset in a tabular format. Summary provides key statistical information about each column such as the mean, median, max and min values. str gives us the data type of each column.

head(titanic_data)
summary(titanic_data)
##   PassengerId       Survived          Pclass          Name          
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
##                                                                     
##      Sex                 Age            SibSp           Parch       
##  Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
##  Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
##  Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
##                     Mean   :29.70   Mean   :0.523   Mean   :0.3816  
##                     3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
##                     Max.   :80.00   Max.   :8.000   Max.   :6.0000  
##                     NA's   :177                                     
##     Ticket               Fare           Cabin             Embarked        
##  Length:891         Min.   :  0.00   Length:891         Length:891        
##  Class :character   1st Qu.:  7.91   Class :character   Class :character  
##  Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
##                     Mean   : 32.20                                        
##                     3rd Qu.: 31.00                                        
##                     Max.   :512.33                                        
## 
str(titanic_data)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

Shows the column names of the dataset

names(titanic_data)
##  [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
##  [6] "Age"         "SibSp"       "Parch"       "Ticket"      "Fare"       
## [11] "Cabin"       "Embarked"
names(titanic_data) <- tolower(names(titanic_data))
names(titanic_data)
##  [1] "passengerid" "survived"    "pclass"      "name"        "sex"        
##  [6] "age"         "sibsp"       "parch"       "ticket"      "fare"       
## [11] "cabin"       "embarked"

This section returns the number of missing values in each column.

colSums(is.na(titanic_data)) 
## passengerid    survived      pclass        name         sex         age 
##           0           0           0           0           0         177 
##       sibsp       parch      ticket        fare       cabin    embarked 
##           0           0           0           0           0           0

We can see that the only column with missing values is age, with 177 missing values. To address this, we replace each of the missing age values with the median age value. This is done by using dplyr’s mutate function in conjunction with an ifelse statement where if the value in the age column is missing, then replace it with the median age value, else keep it as is

median(titanic_data$age, na.rm = TRUE)
## [1] 28
titanic_data <- titanic_data %>%
  mutate(age = ifelse(is.na(age), median(age, na.rm = TRUE), age))

The median age was 28 so all 177 missing values in the age column were replaced with 28. Now we use the colSums function again to ensure all missing values were replaced

colSums(is.na(titanic_data)) #verified all missing data has been filled
## passengerid    survived      pclass        name         sex         age 
##           0           0           0           0           0           0 
##       sibsp       parch      ticket        fare       cabin    embarked 
##           0           0           0           0           0           0

As we can see, there are no more missing values so this operation was successful

We convert appropriate columns to factor data type, to ensure these columns are properly represented as categorical data in statistical analysis

titanic_data <- titanic_data %>% 
  mutate(survived = as.factor(survived), 
         pclass = as.factor(pclass), 
         sex = as.factor(sex),
         embarked = as.factor(embarked)
         )

In the embarked column, there are two entries with blank values. These are not counted as missing values by R, but are missing values for all intents and purposes, so we manually filter them out as follows

titanic_data <- titanic_data %>%
  filter(embarked != "")

The Cabin Column mostly contains blank values, as such we drop the column entirely as it will not offer any meaningful conclusions in our analysis

titanic_data <- titanic_data %>% select(-cabin)

This is a final check of the data to ensure it has been sufficiently cleaned

head(titanic_data)

Now that the data has been cleaned, we write it to a new .csv file so that it can be exported and used for further analysis or visualisations on R or other programs.

write.csv(titanic_data, "Cleaned_Titanic_Data.csv", row.names = FALSE)