We are working with a widely known dataset called “Titanic - Machine Learning from Disaster” that is available on Kaggle, a popular platform for machine learning. The primary objective of this dataset is to predict whether a passenger survived the Titanic disaster based on various variables such as “pclass”, “sex”, “age”, “sibsp”,…
In this data-set we have 2 sets, and they are train and test. The objective is to build the model based on the train data-set and then try to predict people’s survival status based on the information in the test data-set.
Below is a table that shows the variable names and their corresponding meanings.(Table 1)| Variable | Definition | Key |
|---|---|---|
| survival | Survival | 0 = No, 1 = Yes |
| pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| sex | Sex | |
| Age | Age in years | |
| sibsp | # of siblings / spouses aboard the Titanic | |
| parch | # of parents / children aboard the Titanic | |
| ticket | Ticket number | |
| fare | Passenger fare | |
| cabin | Cabin number | |
| embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
In this section, the main focus is to finding the interesting features of the dataset and try to find the effective features from the data-set.
| Test | Train | |
|---|---|---|
| PassengerId | 0 | 0 |
| Pclass | 0 | 0 |
| Name | 0 | 0 |
| Sex | 0 | 0 |
| Age | 86 | 177 |
| SibSp | 0 | 0 |
| Parch | 0 | 0 |
| Ticket | 0 | 0 |
| Fare | 1 | 0 |
| Cabin | 0 | 0 |
| Embarked | 0 | 0 |
From the above table we can easily observe that there are many missing values in variable “age” in both of the data-sets.
Now we can plot a bar chat to show the probability of survival for
different age groups. (Table 1.)
From the bar plot we can easily seen the age group (0.34, 8.38] (72,
80.1] have relatively higher survival probability, and all the other
groups have similar survival probabilities.
Plot a histogram to observe the distribution for the age variable.
By analyzing the histogram, we can infer that a significant portion of
the passengers fall within the age range of [15,40], and there is a
scarcity of passengers who are older than 60. To enhance the
representativeness of the data, we can categorize the passengers into
distinct age groups instead of relying solely on their individual ages.
These age groups provide a better depiction of the distribution of the
Age variable. And at the same time the null values in the train
10 groups has been used for this data-set, and the intervals are (0, 5, 10, 15, 20, 25, 30, 35, 40, 50, 60, 100)
We also want to know the survival probability for male and female under this titanic disaster.
| Sex | mean_survived |
|---|---|
| female | 0.7420382 |
| male | 0.1889081 |
We can observe from above, that the survival probability for female is way higher than male.
| Embarked | Survival.Rate | Mean.Fare |
|---|---|---|
| 1.0000000 | 80.00000 | |
| C | 0.5535714 | 59.95414 |
| Q | 0.3896104 | 13.27603 |
| S | 0.3369565 | 27.07981 |
From the above analysis we can see that the survival rate at C is higher than other ports, and the ticker price at port c is highest among the 3. This observation could indicates when the disaster happens the rich is more likely to survive.
From the line plots and bar plot, we can tell that for some certatin
values of SipSp and Parch, the rate of Survival is actually higher.
However, the bar plot of Cabin against Survived seems lacking details,
so we decide to take a look at the Cabin data and decide if we will drop
this variable or not.
The number of unique values for “Cabin” is 148 and most of the value of
“Cabin” only appeared once, which indicates this value is not
statistically significant. However, one fact of this “Cabin” variable is
the first letter for the values of “Cabin” represents the certain area
on the ferry and the numbers after the letter is the seat number, so
some simple modification to the data is needed.
After the modification, the number of unique values for “Cabin” is 148.
We can see the name variable has the title and the name of the person,
but we only interested in the title of that person.
After the modification, there are 891 and there are some of them appeared only one or two times, so to make those titles to be more representative we change them to “other” instead.
So the final number counts for all the unique is given in the table below.| title | numbers |
|---|---|
| Mr | 757 |
| Mrs | 197 |
| Miss | 260 |
| Master | 61 |
| other | 34 |
## PassengerId Pclass Name Sex SibSp Parch
## 0 0 0 0 0 0
## Ticket Fare Cabin Embarked age_group
## 0 1 0 0 0
# Load required packages
# Fit One-Hot Encoder
one_hot <- predict(dummyVars(~., data_), newdata = data_)
train_one <- one_hot[1:num, ]
test_one <- one_hot[(num+1):nrow(one_hot), ]
train_one <- data.frame(train_one)
test_one <- data.frame(test_one)
# Fit Logistic Regression model
model <- glm(train$Survived ~ ., data = train_one, family = binomial(link = "logit"))
# Predict probabilities for test data
pred_probs <- predict(model, newdata = test_one, type = "response")
score = 0.73923, which incicates this simple logestic regression has predicted 73.9% of the results correct.