STAT3622 Project

We are working with a widely known dataset called “Titanic - Machine Learning from Disaster” that is available on Kaggle, a popular platform for machine learning. The primary objective of this dataset is to predict whether a passenger survived the Titanic disaster based on various variables such as “pclass”, “sex”, “age”, “sibsp”,…

In this data-set we have 2 sets, and they are train and test. The objective is to build the model based on the train data-set and then try to predict people’s survival status based on the information in the test data-set.

Variable Definitions and Keys
Variable	Definition	Key
survival	Survival	0 = No, 1 = Yes
pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
sex	Sex
Age	Age in years
sibsp	# of siblings / spouses aboard the Titanic
parch	# of parents / children aboard the Titanic
ticket	Ticket number
fare	Passenger fare
cabin	Cabin number
embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton

Exploratory data analysis

In this section, the main focus is to finding the interesting features of the dataset and try to find the effective features from the data-set.

1. Handle the null values in the dataset.

Now we try to find the number of null values in the train and test dataset. (Table 2)

Number of null values in Train and Test data-set
	Test	Train
PassengerId	0	0
Pclass	0	0
Name	0	0
Sex	0	0
Age	86	177
SibSp	0	0
Parch	0	0
Ticket	0	0
Fare	1	0
Cabin	0	0
Embarked	0	0

From the above table we can easily observe that there are many missing values in variable “age” in both of the data-sets.

Now we can plot a bar chat to show the probability of survival for different age groups. (Table 1.)

From the bar plot we can easily seen the age group (0.34, 8.38] (72, 80.1] have relatively higher survival probability, and all the other groups have similar survival probabilities.

Oberve the age group distribution.

Plot a histogram to observe the distribution for the age variable.
By analyzing the histogram, we can infer that a significant portion of the passengers fall within the age range of [15,40], and there is a scarcity of passengers who are older than 60. To enhance the representativeness of the data, we can categorize the passengers into distinct age groups instead of relying solely on their individual ages. These age groups provide a better depiction of the distribution of the Age variable. And at the same time the null values in the train

10 groups has been used for this data-set, and the intervals are (0, 5, 10, 15, 20, 25, 30, 35, 40, 50, 60, 100)

Investigate the relationship between “Sex” and “Survived”

We also want to know the survival probability for male and female under this titanic disaster.

Sex ~ Survival relationship
Sex	mean_survived
female	0.7420382
male	0.1889081

We can observe from above, that the survival probability for female is way higher than male.

Embark survival probability

Embark and Survival
Embarked	Survival.Rate	Mean.Fare
	1.0000000	80.00000
C	0.5535714	59.95414
Q	0.3896104	13.27603
S	0.3369565	27.07981

From the above analysis we can see that the survival rate at C is higher than other ports, and the ticker price at port c is highest among the 3. This observation could indicates when the disaster happens the rich is more likely to survive.

repeat the procedure

From the line plots and bar plot, we can tell that for some certatin values of SipSp and Parch, the rate of Survival is actually higher. However, the bar plot of Cabin against Survived seems lacking details, so we decide to take a look at the Cabin data and decide if we will drop this variable or not.
The number of unique values for “Cabin” is 148 and most of the value of “Cabin” only appeared once, which indicates this value is not statistically significant. However, one fact of this “Cabin” variable is the first letter for the values of “Cabin” represents the certain area on the ferry and the numbers after the letter is the seat number, so some simple modification to the data is needed.

After the modification, the number of unique values for “Cabin” is 148.

handle the name variable

We can see the name variable has the title and the name of the person, but we only interested in the title of that person.

After the modification, there are 891 and there are some of them appeared only one or two times, so to make those titles to be more representative we change them to “other” instead.

So the final number counts for all the unique is given in the table below.

Number counts for each title
title	numbers
Mr	757
Mrs	197
Miss	260
Master	61
other	34

## PassengerId      Pclass        Name         Sex       SibSp       Parch 
##           0           0           0           0           0           0 
##      Ticket        Fare       Cabin    Embarked   age_group 
##           0           1           0           0           0