We are working with a widely known dataset called “Titanic - Machine Learning from Disaster” that is available on Kaggle, a popular platform for machine learning. The primary objective of this dataset is to predict whether a passenger survived the Titanic disaster based on various variables such as “pclass”, “sex”, “age”, “sibsp”,…

In this data-set we have 2 sets, and they are train and test. The objective is to build the model based on the train data-set and then try to predict people’s survival status based on the information in the test data-set.

Below is a table that shows the variable names and their corresponding meanings.(Table 1)
Variable Definitions and Keys
Variable Definition Key
survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex
Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

Exploratory data analysis

In this section, the main focus is to finding the interesting features of the dataset and try to find the effective features from the data-set.

1. Handle the null values in the dataset.

Now we try to find the number of null values in the train and test dataset. (Table 2)
Number of null values in Train and Test data-set
Test Train
PassengerId 0 0
Pclass 0 0
Name 0 0
Sex 0 0
Age 86 177
SibSp 0 0
Parch 0 0
Ticket 0 0
Fare 1 0
Cabin 0 0
Embarked 0 0

From the above table we can easily observe that there are many missing values in variable “age” in both of the data-sets.

Now we can plot a bar chat to show the probability of survival for different age groups. (Table 1.)

From the bar plot we can easily seen the age group (0.34, 8.38] (72, 80.1] have relatively higher survival probability, and all the other groups have similar survival probabilities.

Oberve the age group distribution.


Plot a histogram to observe the distribution for the age variable.
By analyzing the histogram, we can infer that a significant portion of the passengers fall within the age range of [15,40], and there is a scarcity of passengers who are older than 60. To enhance the representativeness of the data, we can categorize the passengers into distinct age groups instead of relying solely on their individual ages. These age groups provide a better depiction of the distribution of the Age variable. And at the same time the null values in the train

10 groups has been used for this data-set, and the intervals are (0, 5, 10, 15, 20, 25, 30, 35, 40, 50, 60, 100)

Investigate the relationship between “Sex” and “Survived”

We also want to know the survival probability for male and female under this titanic disaster.

Sex ~ Survival relationship
Sex mean_survived
female 0.7420382
male 0.1889081

We can observe from above, that the survival probability for female is way higher than male.

Embark survival probability


Embark and Survival
Embarked Survival.Rate Mean.Fare
1.0000000 80.00000
C 0.5535714 59.95414
Q 0.3896104 13.27603
S 0.3369565 27.07981

From the above analysis we can see that the survival rate at C is higher than other ports, and the ticker price at port c is highest among the 3. This observation could indicates when the disaster happens the rich is more likely to survive.

repeat the procedure



From the line plots and bar plot, we can tell that for some certatin values of SipSp and Parch, the rate of Survival is actually higher. However, the bar plot of Cabin against Survived seems lacking details, so we decide to take a look at the Cabin data and decide if we will drop this variable or not.
The number of unique values for “Cabin” is 148 and most of the value of “Cabin” only appeared once, which indicates this value is not statistically significant. However, one fact of this “Cabin” variable is the first letter for the values of “Cabin” represents the certain area on the ferry and the numbers after the letter is the seat number, so some simple modification to the data is needed.

After the modification, the number of unique values for “Cabin” is 148.

handle the name variable


We can see the name variable has the title and the name of the person, but we only interested in the title of that person.

After the modification, there are 891 and there are some of them appeared only one or two times, so to make those titles to be more representative we change them to “other” instead.

So the final number counts for all the unique is given in the table below.
Number counts for each title
title numbers
Mr 757
Mrs 197
Miss 260
Master 61
other 34
## PassengerId      Pclass        Name         Sex       SibSp       Parch 
##           0           0           0           0           0           0 
##      Ticket        Fare       Cabin    Embarked   age_group 
##           0           1           0           0           0

no null value and all the info is ready to use, end of EDA!!!!!!!!!

# Load required packages

# Fit One-Hot Encoder
one_hot <- predict(dummyVars(~., data_), newdata = data_)
train_one <- one_hot[1:num, ]
test_one <- one_hot[(num+1):nrow(one_hot), ]
train_one <- data.frame(train_one)
test_one <- data.frame(test_one)
# Fit Logistic Regression model
model <- glm(train$Survived ~ ., data = train_one, family = binomial(link = "logit"))

# Predict probabilities for test data
pred_probs <- predict(model, newdata = test_one, type = "response")

score = 0.73923, which incicates this simple logestic regression has predicted 73.9% of the results correct.