Linear regression is one of the most elementary machine learning technique that one can use to predict some outcome given. For this example, we are going to predict the survival of the Titanic passenger using linear regression, more specifically the logistic regression. Logistic regression is a generalised linear model works best in a binary classification problem. In this document, we are going to briefly go through the data, perform data preprocessing (if applicable), built and evaluate the models, and finally visualise the model.
This titanic dataset is made up of 887 rows and 9 attributes. Each row represents a passenger and the columns describe different attributes about each passenger. Using str() function, the information for each of the attributes is summarised as below.
set.seed(116) # Set seed for reproducibility
library(tidyverse) # Data manipulation
## -- Attaching packages ------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ---------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(MASS) # stepwise regression package
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
titan <- read.csv('D:/Dataset/Titanic/titanic.csv',header = T)
str(titan)
## 'data.frame': 887 obs. of 9 variables:
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : Factor w/ 887 levels "Capt. Edward Gifford Crosby",..: 602 823 172 814 733 464 700 33 842 839 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 27 54 2 27 14 ...
## $ Siblings.Spouses.Aboard: int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parents.Children.Aboard: int 0 0 0 0 0 0 0 1 2 0 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
Dataset is checked for missing values or blank rows before any data manipulation is done. No missing values or empty rows are found in the dataset. The number of unique values found within column is checked next to verify if there’s any additional data manipulation required.
colSums(is.na(titan) | titan =="")
## Survived Pclass Name
## 0 0 0
## Sex Age Siblings.Spouses.Aboard
## 0 0 0
## Parents.Children.Aboard Fare Embarked
## 0 0 0
sapply(titan, function(x) length(unique(x)))
## Survived Pclass Name
## 2 3 887
## Sex Age Siblings.Spouses.Aboard
## 2 89 7
## Parents.Children.Aboard Fare Embarked
## 7 248 3
Based on the results and the output from previous section (str()), we can conclude that:
‘Name’ is recorded as factor, and it has 887 unique values. We should probably remove this column from the dataset for further analysis as this column would just create one level for each of the unique values. This do not provide any additional information that can be used in the regression model later.
‘Survived’, ‘Sex’ and ‘Embarked’ are recorded as integers with only 2-3 unique values each. These should be recorded as catogarical variables. We can use as.factor() function to deal with them.
drop <- c('Name')
titan1 <- titan[,!(names(titan) %in% drop)]
# Change ordinal data to factor
titan1$Pclass <- as.factor(titan1$Pclass)
titan1$Sex <- as.factor(titan1$Sex)
titan1$Embarked <- as.factor(titan1$Embarked)
The data is divided into training and testing set using a 75:25 ratio.
index <- 0.75*nrow(titan1)
titan1 <- titan1[sample(1:nrow(titan1)), ]
train <- titan1[1:index,]
test <- titan1[index:nrow(titan1),]
Using the training set, a simple logistic regression model is built using sex as the only predictor for survival status of the passenger. Using the summary output of the model, we can calculate the probability using the equation log(odds) = 1.0986 - 2.5478*(Male). Hence, the probability of male and female surviving the disaster is computed as below:
titan_glm <- glm(Survived ~ Sex, data = train, family = 'binomial')
summary(titan_glm)
##
## Call:
## glm(formula = Survived ~ Sex, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6651 -0.6494 -0.6494 0.7585 1.8221
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.0986 0.1432 7.671 1.71e-14 ***
## Sexmale -2.5478 0.1912 -13.327 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 899.75 on 664 degrees of freedom
## Residual deviance: 686.40 on 663 degrees of freedom
## AIC: 690.4
##
## Number of Fisher Scoring iterations: 4
The predicted value of survivability using the test dataset is compared with the actual survival status of each passenger, giving the model an accuracy of 79%.
# Test for accuracy
predict_sex_survived <- predict(titan_glm,newdata = test,type = 'response')
# Since Survived can only be either 1 or 0, write if statement to round up of down the response
predict_sex_survived <- ifelse(predict_sex_survived>0.5,1,0)
error_1 <- mean(predict_sex_survived!=test$Survived)
accuracy_1 <- 1-error_1
accuracy_1
## [1] 0.7882883
The 100% stacked plot is consistent with our findings, where majority of the passenger survived should be female, and the majority of the passenger who did not make it should be male. This is also consistent with the ‘women and children first’ myth.
Accuracy score of 79% is not too bad as a start. Let’s try if we can further improve this by including more attributes. To determine which attributes is a significant predictor for survival status, we can either hand pick them manually based on the p values, or implement the stepwise algorithm.
2 models are built, one with all the variables and another one with only the significant predictors selected using stepwise regression with AIC as the score. Model 2 is selected eventually as it produces a lower AIC score.
# titan_null created for R2 computation or stepwise regression if not using MASS-stepAIC
titan_null <- glm(Survived~1,data=train,family = binomial)
# model 1 - complete model
titan_logistic_complete <- glm(Survived~., data=train, family = binomial)
summary(titan_logistic_complete)
##
## Call:
## glm(formula = Survived ~ ., family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9108 -0.5942 -0.3698 0.6091 2.5255
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.385104 0.575846 7.615 2.64e-14 ***
## Pclass2 -1.214855 0.362074 -3.355 0.000793 ***
## Pclass3 -2.338527 0.364495 -6.416 1.40e-10 ***
## Sexmale -2.822213 0.232174 -12.156 < 2e-16 ***
## Age -0.042543 0.009161 -4.644 3.42e-06 ***
## Siblings.Spouses.Aboard -0.490668 0.128586 -3.816 0.000136 ***
## Parents.Children.Aboard -0.057766 0.136886 -0.422 0.673027
## Fare 0.003483 0.002933 1.187 0.235060
## EmbarkedQ -0.330097 0.409562 -0.806 0.420257
## EmbarkedS -0.274931 0.273462 -1.005 0.314717
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 899.75 on 664 degrees of freedom
## Residual deviance: 575.93 on 655 degrees of freedom
## AIC: 595.93
##
## Number of Fisher Scoring iterations: 5
# Model 2 - Stepwise regression starting from full model
titan_logistic_stepwise <- titan_logistic_complete %>% stepAIC(direction='both',trace = FALSE)
summary(titan_logistic_stepwise)
##
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + Siblings.Spouses.Aboard,
## family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7999 -0.5657 -0.3655 0.6087 2.4772
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.469241 0.483415 9.245 < 2e-16 ***
## Pclass2 -1.451172 0.315450 -4.600 4.22e-06 ***
## Pclass3 -2.605555 0.301405 -8.645 < 2e-16 ***
## Sexmale -2.825326 0.226529 -12.472 < 2e-16 ***
## Age -0.043885 0.009023 -4.864 1.15e-06 ***
## Siblings.Spouses.Aboard -0.481862 0.123558 -3.900 9.62e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 899.75 on 664 degrees of freedom
## Residual deviance: 578.67 on 659 degrees of freedom
## AIC: 590.67
##
## Number of Fisher Scoring iterations: 5
# compare AIC for all 3 model
AIC(titan_logistic_complete,titan_logistic_stepwise)
## df AIC
## titan_logistic_complete 10 595.9270
## titan_logistic_stepwise 6 590.6708
Using the model 2, we can then compute the probability for survival and accuracy of the model. An accuracy of 80% is obtained.
# Check for accuracy using test dataset
predict_2 <- predict(titan_logistic_stepwise,newdata = test,type = 'response')
# Since Survived can only be either 1 or 0, write if statement to round up of down the response
predict_2 <- ifelse(predict_2>0.5,1,0)
error_2 <- mean(predict_2!=test$Survived)
accuracy_2 <- 1-error_2
accuracy_2
## [1] 0.7882883
In this section, we are going to showcase different ways to visualise the logistic regression model. In order to visualise the probability, sigmoide curve is used. The curve maps any real value into another value between 0 and 1.
The curve can further be divided by category as shown below.
# logistic regression plot - sigmoid curve - predicted vs actual - By Pclass
predicted.data <- data.frame(probability.of.survival = titan_logistic_stepwise$fitted.value,Survived=train$Survived,Pclass=train$Pclass)
predicted.data <- predicted.data[order(predicted.data$probability.of.survival,decreasing = FALSE),]
ggplot(data=predicted.data,aes(x=probability.of.survival,y=Survived,col=Pclass))+
geom_point(alpha=0.8,shape=1,stroke=1)+
xlab('Predicted probability of surviving titanic')+
ylab('Survived')+
stat_smooth(method="glm", method.args=list(family="binomial"), se=FALSE,fullrange = TRUE)+
theme(legend.position = 'bottom',plot.title = element_text(hjust = 0.5))+
ggtitle('Survival by Pclass')
# logistic regression plot - sigmoid curve - predicted vs actual - By Gender
predicted.data <- data.frame(probability.of.survival = titan_logistic_stepwise$fitted.value,Survived=train$Survived,Sex=train$Sex)
predicted.data <- predicted.data[order(predicted.data$probability.of.survival,decreasing = FALSE),]
ggplot(data=predicted.data,aes(x=probability.of.survival,y=Survived,col=Sex))+
geom_point(alpha=0.8,shape=1,stroke=1)+
xlab('Predicted probability of surviving titanic')+
ylab('Survived')+
stat_smooth(method="glm", method.args=list(family="binomial"), se=FALSE,fullrange = TRUE)+
theme(legend.position = 'bottom',plot.title = element_text(hjust = 0.5))+
ggtitle('Survival by Gender')
Another way to visualise the logistic regression model is by using a jitter plot. Using the same data and category from sigmoid curve, the jitter plot is shown as below:
attach(train)
jitter = rnorm(nrow(train), sd = 0.1)
plot(titan_logistic_stepwise$fitted.values, train$Survived+ jitter,xlim = 0:1, ylim = c(-0.5, 1.5), axes =
FALSE,
xlab = "Predicted probability", ylab = "", col = c('#F8766D','#00BFC4')[train$Sex], pch = 16)
axis(1)
axis(2, at = c(0, 1), labels = c("0 (Died)", "1 (Survived)"))
legend("topleft",pch = 19,col=c('#F8766D','#00BFC4'),c("Female", "Male"),cex = 0.8,bty="o")
# glm plot by Pclass
set.seed(1)
jitter = rnorm(nrow(train), sd = 0.10)
plot(titan_logistic_stepwise$fitted.values, train$Survived+ jitter,xlim = 0:1, ylim = c(-0.5, 1.5), axes =
FALSE,
xlab = "Predicted probability", ylab = "", col = c('#F8766D','#00BFC4','#8c8c8c')[train$Pclass], pch = 16)
axis(1)
axis(2, at = c(0, 1), labels = c("0 (Died)", "1 (Survived)"))
legend("topleft",bg='transparent',pch = 19,col=c('#F8766D','#00BFC4','#8c8c8c'),c('Class 1','Class 2','Class 3'),horiz = T,cex = 0.8,bty="o")