Introduction

Data Source

There is a complete database of all passengers on the titanic and it contains data as to who did and did not survive. This data is broken into two datasets. The training data contains who did and did not survive and the test dataset is missing that information. I am going to use the titanic dataset from https://www.kaggle.com/competitions/titanic/data. To be exact, I am going to use the train.csv from the dataset.

Column

PassengerId: This is the ID of ever passengers.
Survived: This feature have values 0 and 1. 0 is for not survived and 1 is for survived.
Pclass: These are 3 classes of passengers. Class1, Class2 and Class3.
Name: Name of each passengers.
Sex: Gender of passengers.
Age: Age of passengers.
SibSp: Indication that passenger have siblings and spouse.
Parch: Whether a passenger is alone or with family.
Ticket: Ticket no of passenger.
Fare: Indicating the fare.
Cabin: Cabin of passengers.
Embarked: Embarked category.

Library

library(caret)
library(dplyr)
library(gtools)
library(GGally)

Load Data

titanic <- read.csv("titanic/train.csv")

Data Inspection

head(titanic)

##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin Embarked
## 1        A/5 21171  7.2500              S
## 2         PC 17599 71.2833   C85        C
## 3 STON/O2. 3101282  7.9250              S
## 4           113803 53.1000  C123        S
## 5           373450  8.0500              S
## 6           330877  8.4583              Q

tail(titanic)

##     PassengerId Survived Pclass                                     Name    Sex
## 886         886        0      3     Rice, Mrs. William (Margaret Norton) female
## 887         887        0      2                    Montvila, Rev. Juozas   male
## 888         888        1      1             Graham, Miss. Margaret Edith female
## 889         889        0      3 Johnston, Miss. Catherine Helen "Carrie" female
## 890         890        1      1                    Behr, Mr. Karl Howell   male
## 891         891        0      3                      Dooley, Mr. Patrick   male
##     Age SibSp Parch     Ticket   Fare Cabin Embarked
## 886  39     0     5     382652 29.125              Q
## 887  27     0     0     211536 13.000              S
## 888  19     0     0     112053 30.000   B42        S
## 889  NA     1     2 W./C. 6607 23.450              S
## 890  26     0     0     111369 30.000  C148        C
## 891  32     0     0     370376  7.750              Q

dim(titanic)

## [1] 891  12

colnames(titanic)

##  [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
##  [6] "Age"         "SibSp"       "Parch"       "Ticket"      "Fare"       
## [11] "Cabin"       "Embarked"

The Titanic dataset is a comprehensive database containing information about passengers who were aboard the Titanic, including whether they survived or not. The dataset is divided into two parts: the training dataset, which includes information about who survived and who did not, and the test dataset, which lacks survival information. For this analysis, we will focus on the train.csv file from the dataset. The dataset consists of 12 columns, each with specific information about the passengers, such as their ID, survival status, class, name, gender, age, sibling/spouse indicators, family indicators, ticket numbers, fare, cabin information, and embarkation details. In total, there are 891 rows and 12 columns in the dataset.

Data Cleansing

Data Type

glimpse(titanic)

## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…

From the results, there are several variables in the Titanic dataset that need their data types transformed from integers or characters to factors. The variables in question include Survived, Pclass, Sex, SibSp, Parch, and Embarked. By converting these variables to factors, we can treat them as categorical features during exploratory data analysis (EDA). This modification ensures that the dataset’s categorical attributes are correctly identified and enables us to gain a deeper understanding of their distributions and relationships.

titanic <- titanic %>%
  mutate(Survived = as.factor(titanic$Survived),
         Pclass = as.factor(titanic$Pclass),
         Sex = as.factor(titanic$Sex),
         SibSp = as.factor(titanic$SibSp),
         Parch = as.factor(titanic$Parch),
         Embarked = as.factor(titanic$Embarked))

glimpse(titanic)

## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived    <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass      <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex         <fct> male, female, female, female, male, male, male, male, fema…
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp       <fct> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch       <fct> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked    <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S, C…

From the dataset, it appears that the PassengerId variable contains no missing values. This variable primarily serves as an identifier and is not considered a significant predictor for our analysis. Therefore, we may consider removing the PassengerId variable from the dataset to streamline our analysis, as it is unlikely to significantly impact our results.

titanic <- titanic %>%
  select(-PassengerId)

glimpse(titanic)

## Rows: 891
## Columns: 11
## $ Survived <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0…
## $ Pclass   <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3, 2…
## $ Name     <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Flore…
## $ Sex      <fct> male, female, female, female, male, male, male, male, female,…
## $ Age      <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55,…
## $ SibSp    <fct> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0, 0…
## $ Parch    <fct> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0…
## $ Ticket   <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37345…
## $ Fare     <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, 21…
## $ Cabin    <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C103…
## $ Embarked <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S, C, S…

Missing Value

anyNA(titanic)

## [1] TRUE

The dataset contains missing values (NA), which can significantly impact the quality and reliability of data analysis. To ensure accurate and reliable insights from the dataset, it is essential to address these missing values through imputation or data cleaning techniques. Next, we will check which variables contain missing values to determine the scope of data cleaning and imputation needed.

colSums(is.na(titanic))

## Survived   Pclass     Name      Sex      Age    SibSp    Parch   Ticket 
##        0        0        0        0      177        0        0        0 
##     Fare    Cabin Embarked 
##        0        0        0

It seems that the column Age had 177 missing values.

There are two treatments that can be used: - Remove the whole Age column so that there won’t be less rows. - Remove the rows with missing values in Age column so that it could still be used in the analysis.

titanic_no_age <- titanic %>% 
  select(-Age)
titanic_clean <- na.omit(titanic) # Only rows with missing values removed

colSums(is.na(titanic_no_age))

## Survived   Pclass     Name      Sex    SibSp    Parch   Ticket     Fare 
##        0        0        0        0        0        0        0        0 
##    Cabin Embarked 
##        0        0

colSums(is.na(titanic_clean))

## Survived   Pclass     Name      Sex      Age    SibSp    Parch   Ticket 
##        0        0        0        0        0        0        0        0 
##     Fare    Cabin Embarked 
##        0        0        0

glimpse(titanic_no_age)

## Rows: 891
## Columns: 10
## $ Survived <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0…
## $ Pclass   <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3, 2…
## $ Name     <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Flore…
## $ Sex      <fct> male, female, female, female, male, male, male, male, female,…
## $ SibSp    <fct> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0, 0…
## $ Parch    <fct> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0…
## $ Ticket   <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37345…
## $ Fare     <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, 21…
## $ Cabin    <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C103…
## $ Embarked <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S, C, S…

Data Explanation

Brief Overview of the data

summary(titanic_no_age)

##  Survived Pclass      Name               Sex      SibSp   Parch  
##  0:549    1:216   Length:891         female:314   0:608   0:678  
##  1:342    2:184   Class :character   male  :577   1:209   1:118  
##           3:491   Mode  :character                2: 28   2: 80  
##                                                   3: 16   3:  5  
##                                                   4: 18   4:  4  
##                                                   5:  5   5:  5  
##                                                   8:  7   6:  1  
##     Ticket               Fare           Cabin           Embarked
##  Length:891         Min.   :  0.00   Length:891          :  2   
##  Class :character   1st Qu.:  7.91   Class :character   C:168   
##  Mode  :character   Median : 14.45   Mode  :character   Q: 77   
##                     Mean   : 32.20                      S:644   
##                     3rd Qu.: 31.00                              
##                     Max.   :512.33                              
##

summary(titanic_clean)

##  Survived Pclass      Name               Sex           Age        SibSp  
##  0:424    1:186   Length:714         female:261   Min.   : 0.42   0:471  
##  1:290    2:173   Class :character   male  :453   1st Qu.:20.12   1:183  
##           3:355   Mode  :character                Median :28.00   2: 25  
##                                                   Mean   :29.70   3: 12  
##                                                   3rd Qu.:38.00   4: 18  
##                                                   Max.   :80.00   5:  5  
##                                                                   8:  0  
##  Parch      Ticket               Fare           Cabin           Embarked
##  0:521   Length:714         Min.   :  0.00   Length:714          :  2   
##  1:110   Class :character   1st Qu.:  8.05   Class :character   C:130   
##  2: 68   Mode  :character   Median : 15.74   Mode  :character   Q: 28   
##  3:  5                      Mean   : 34.69                      S:554   
##  4:  4                      3rd Qu.: 33.38                              
##  5:  5                      Max.   :512.33                              
##  6:  1

Univariate Analysis

hist(titanic_clean$Fare, breaks=20)

From the Fare histogram, it can be observed that as the Fare (ticket price) value increases, the frequency of passengers with such fares tends to decrease. This indicates that the majority of passengers purchased tickets at lower prices, while tickets with higher prices were less commonly purchased

boxplot(titanic_clean$Fare)

According to the boxplot shown, there is a noticeable presence of outliers above the maximum value, particularly in the “Fare” variable. These outliers may be attributed to various factors, such as scalpers reselling tickets at exorbitant prices or passengers who acquired their tickets through auction processes.

hist(titanic_clean$Age, breaks=20)

The Age histogram reveals an interesting distribution of passenger ages on the Titanic. It’s evident that there is a lower frequency of very young passengers, particularly those under the age of 20, suggesting that the Titanic had relatively fewer infants and children. However, as we move beyond the age of 20, the frequency of passengers gradually declines. This indicates that the majority of passengers fall into the age range of 20 to 40 years old, as this range exhibits the highest frequency. The histogram shape suggests a skewed distribution with a right tail, implying that the older passengers were less common on the Titanic’s voyage.

boxplot(titanic_clean$Age)

Based on the information derived from the boxplot and the summary statistics: - Outliers are observed in the age distribution, specifically above the age of 60. These outliers suggest the presence of older passengers who may be exceptional cases within the dataset. - The average age of passengers on the Titanic is approximately 29.7 years. This provides an overview of the typical age distribution among the passengers. - The dataset includes infants, as indicated by the lowest age recorded, which is 0.42. This suggests that there was at least one baby on board the Titanic during the voyage, highlighting the diversity in passenger age groups.

These insights into the age distribution provide a preliminary understanding of the passengers’ demographics, including the presence of older individuals and the inclusion of infants on the ship. Further analysis can explore how age might have influenced survival rates or other aspects of the Titanic tragedy.

Bivariate Analysis

plot(x = titanic_clean$Age, y = titanic_clean$Fare, 
     main="Scatter Plot Age vs. Fare",
     xlab="Age",
     ylab="Fare"
)

In the Age vs. Fare scatter plot, there is no clear linear relationship between age (Age) and the fare (Fare) paid by passengers. This indicates that there is no significant correlation between the age of passengers and the amount of money paid for the ship’s tickets. The data is distributed quite evenly across various age and fare ranges. It can be concluded that age does not directly influence the ticket fare.

correlation <- cor(titanic_clean$Age, titanic_clean$Fare)
correlation

## [1] 0.09606669

The correlation coefficient between Age and Fare is approximately 0.0961, which suggests a very weak positive relationship between the two variables. This indicates that as a passenger’s age increases, there is a slight tendency for their fare to also increase, but the relationship is not strong.

Conclusion

Data Origin and Variable Transformation: We sourced our dataset from Kaggle’s Titanic competition, specifically using the train.csv file. To prepare the data for analysis and modeling, we transformed several variables into factors. These variables include Survived, Pclass, Sex, SibSp, Parch, and Embarked. This transformation allows us to utilize these attributes effectively when building models.
Handling Missing Values: During our analysis, we discovered that the Age variable contained 177 missing values. Addressing these missing values is imperative to maintain data integrity and ensure the accuracy of our analytical results.
Outliers in Fare Variable: Our analysis revealed potential outliers in the Fare variable. These outliers might be due to factors such as scalpers reselling tickets at inflated prices. Investigating and understanding the underlying reasons for these outliers is essential for a more comprehensive analysis.
Age and Fare Relationship: Our analysis included a scatter plot comparing the Age and Fare variables. The scatter plot did not show a strong relationship between age and fare, as indicated by a low correlation coefficient of approximately 0.0961.

Exploratory Data Analysis Titanic

Husna Aydadenta

2023-10-19