This report analyzes the Titanic dataset from Kaggle and build a machine learning model to predict who survived.
setwd("C:/Users/tdjpr/Downloads")
titantic_train <- read.csv("train.csv")
titantic_test <- read.csv("test.csv")
This next set of code will allow exploration of the datasets and preparing the datasets for training and testing processing. The training dataset is titantic_train and the testing dataset I will be using is titantic_test.
# Preview the titantic_train dataset
head(titantic_train)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 S
## 6 330877 8.4583 Q
How is the data structured in the dataset? What are the variables and observations?
# Check the structure of the data
str(titantic_train)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
After looking at the structure of the dataset, I noticed that there are some variables with NA observations. In the next set of code I will be checking the missing values and during some data clean up to the dataset for the titantic_train dataset.
# Check missing values
colSums(is.na(titantic_train))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
# Fill missing Age values with the median age
titantic_train$Age[is.na(titantic_train$Age)] <- median(titantic_train$Age, na.rm =TRUE )
# Convert 'Sex' column: male -> 0, female -> 1
titantic_train$Sex <- ifelse(titantic_train$Sex == "male", 0, 1)
# Check the first few rows
head(titantic_train)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris 0 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) 1 38 1 0
## 3 Heikkinen, Miss. Laina 1 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) 1 35 1 0
## 5 Allen, Mr. William Henry 0 35 0 0
## 6 Moran, Mr. James 0 28 0 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 S
## 6 330877 8.4583 Q
Which variables will help determine which passanger will survived the titantic or not survived? In the next code I will be identifing the variables that will help determine passanger survivial and defining a target function.
# Selecting Relevant Columns
titantic_train <- titantic_train[, c("PassengerId", "Survived", "Pclass", "Sex", "Age")]
# Define target into survived
target <- titantic_train$Survived
Will the code output the Titantic Survival data? In the follwoing code I will be splitting the data for training and testing.
# Set seed for reproducibility
set.seed(123)
# Split the data (80% train, 20% test)
trainIndex <- createDataPartition(titantic_train$Survived, p = 0.8, list = FALSE)
trainData <- titantic_train[trainIndex,]
testData <- titantic_train [-trainIndex,]
What model will be use to help predict the Titantic Survivals? In the next set of code I will be using the Decision Tree Model to predict the Titantic Survival.
# Building the Decision Tree Model
tree_model <- rpart(Survived ~., data = trainData,method = "class")
#Create visualizing the Decision Tree
rpart.plot(tree_model, extra = 104, tweak = 1.9)
How does the Titantic Test dataset fit into the model? First, I will need to process the Titantic Test dataset like the Titantic Train dataset.
#Handle missing values in the titantic test dataset
titantic_test$Age[is.na(titantic_test$Age)] <- median(titantic_test$Age, na.rm = TRUE)
# Convert 'Sex' column: male -> 0, female -> 1
titantic_test$Sex <- ifelse(titantic_test$Sex == "male", 0, 1)
# Select relevant columns
titantic_test <- titantic_test[, c("PassengerId", "Pclass", "Sex", "Age")]
Based on the titantic_test dataset variables, who will survived? The final set of code will be use to predict the survivals from the titantic_test dataset.
# Making predictions using the titantic_test datset.
predictions <- predict(tree_model, newdata = titantic_test, type = "class")
# Assign predicitions using to the test dataset
titantic_test$Survived <- predictions
# Show prediction results
# Select only PassengerId and Survived columns
submission <- titantic_test[, c("PassengerId", "Survived")]
This report successfully processed the Titanic dataset, trained a decision tree model, and predicted survival outcomes for the test dataset. Based on the prediction 79% do not survived and 21% survived.
# Count occurrences of each category in Survived column
survival_count <- table(titantic_test$Survived)
# Convert counts percentages
survival_percentages <- prop.table(survival_count)*100
# Print results
print(round(survival_percentages,1))
##
## 0 1
## 78.9 21.1
# Save to CSV file
write.csv(submission, "titanic_submission.csv", row.names = FALSE)