Introduction

This report analyzes the Titanic dataset from Kaggle and build a machine learning model to predict who survived.


Set working directory

setwd("C:/Users/tdjpr/Downloads")

Read in the Titanic dataset

titantic_train <- read.csv("train.csv")
titantic_test <- read.csv("test.csv")

Explore the Data

This next set of code will allow exploration of the datasets and preparing the datasets for training and testing processing. The training dataset is titantic_train and the testing dataset I will be using is titantic_test.

# Preview the titantic_train dataset
head(titantic_train)
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin Embarked
## 1        A/5 21171  7.2500              S
## 2         PC 17599 71.2833   C85        C
## 3 STON/O2. 3101282  7.9250              S
## 4           113803 53.1000  C123        S
## 5           373450  8.0500              S
## 6           330877  8.4583              Q

How is the data structured in the dataset? What are the variables and observations?

# Check the structure of the data
str(titantic_train)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

After looking at the structure of the dataset, I noticed that there are some variables with NA observations. In the next set of code I will be checking the missing values and during some data clean up to the dataset for the titantic_train dataset.

# Check missing values
colSums(is.na(titantic_train))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0
# Fill missing Age values with the median age
titantic_train$Age[is.na(titantic_train$Age)] <- median(titantic_train$Age, na.rm =TRUE )

# Convert 'Sex' column: male -> 0, female -> 1
titantic_train$Sex <- ifelse(titantic_train$Sex == "male", 0, 1)

# Check the first few rows
head(titantic_train)                          
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   0  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer)   1  38     1     0
## 3                              Heikkinen, Miss. Laina   1  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel)   1  35     1     0
## 5                            Allen, Mr. William Henry   0  35     0     0
## 6                                    Moran, Mr. James   0  28     0     0
##             Ticket    Fare Cabin Embarked
## 1        A/5 21171  7.2500              S
## 2         PC 17599 71.2833   C85        C
## 3 STON/O2. 3101282  7.9250              S
## 4           113803 53.1000  C123        S
## 5           373450  8.0500              S
## 6           330877  8.4583              Q

Which variables will help determine which passanger will survived the titantic or not survived? In the next code I will be identifing the variables that will help determine passanger survivial and defining a target function.

# Selecting Relevant Columns
titantic_train <- titantic_train[, c("PassengerId", "Survived", "Pclass", "Sex", "Age")]

# Define target into survived
target <- titantic_train$Survived

Will the code output the Titantic Survival data? In the follwoing code I will be splitting the data for training and testing.

# Set seed for reproducibility 
set.seed(123)

# Split the data (80% train, 20% test)
trainIndex <- createDataPartition(titantic_train$Survived, p = 0.8, list = FALSE)
trainData <- titantic_train[trainIndex,]
testData <- titantic_train [-trainIndex,]

What model will be use to help predict the Titantic Survivals? In the next set of code I will be using the Decision Tree Model to predict the Titantic Survival.

# Building the Decision Tree Model
tree_model <- rpart(Survived ~., data = trainData,method = "class")

#Create visualizing the Decision Tree
rpart.plot(tree_model, extra = 104, tweak = 1.9)

How does the Titantic Test dataset fit into the model? First, I will need to process the Titantic Test dataset like the Titantic Train dataset.

#Handle missing values in the titantic test dataset
titantic_test$Age[is.na(titantic_test$Age)] <- median(titantic_test$Age, na.rm = TRUE)

# Convert 'Sex' column: male -> 0, female -> 1
titantic_test$Sex <- ifelse(titantic_test$Sex  == "male", 0, 1)

# Select relevant columns
titantic_test <- titantic_test[, c("PassengerId", "Pclass", "Sex", "Age")]

Based on the titantic_test dataset variables, who will survived? The final set of code will be use to predict the survivals from the titantic_test dataset.

# Making predictions using the titantic_test datset. 
predictions <- predict(tree_model, newdata = titantic_test, type = "class")

# Assign predicitions using to the test dataset
titantic_test$Survived <- predictions

# Show prediction results
# Select only PassengerId and Survived columns
submission <- titantic_test[, c("PassengerId", "Survived")]

Conclusion

This report successfully processed the Titanic dataset, trained a decision tree model, and predicted survival outcomes for the test dataset. Based on the prediction 79% do not survived and 21% survived.

# Count occurrences of each category in Survived column
survival_count <- table(titantic_test$Survived)

# Convert counts percentages
survival_percentages <- prop.table(survival_count)*100

# Print results
print(round(survival_percentages,1))
## 
##    0    1 
## 78.9 21.1

Save to CSV File

# Save to CSV file
write.csv(submission, "titanic_submission.csv", row.names = FALSE)