rr library(tidyr) library(readr) library(stringr) library(dplyr) library(Hmisc) library(outliers) library(InformationValue)
The survival of passengers on Titanic
The Titanic data from Kaggle.com was used for this particular project. This dataset has information about 889 passengers on the Titanic contained in 12 variables.
rr #read data titanic_train <- read.csv(/Users/Vidya/Downloads/train.csv, header = T, na.strings = c(\)) training.data.raw <- read.csv(‘/Users/Vidya/Downloads/train.csv’,header=T,na.strings=c(\))
rr #dimension of the dataframe str(titanic_train)
'data.frame': 891 obs. of 12 variables:
$ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
$ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Name : Factor w/ 891 levels \Abbing
rr View(titanic_train)
rr # find sum of NA values in the data sum(is.na(titanic_train))
[1] 866
rr sum(is.na(titanic_train$Cabin))
[1] 687
rr sapply(titanic_train, function(x) sum(is.na(x)))
PassengerId Survived Pclass Name Sex Age SibSp Parch
0 0 0 0 0 177 0 0
Ticket Fare Cabin Embarked
0 0 687 2
rr sapply(titanic_train, function(x) length(unique(x)))
PassengerId Survived Pclass Name Sex Age SibSp Parch
891 2 3 891 2 89 7 7
Ticket Fare Cabin Embarked
681 248 148 4
There are 687 missing values in the cabin variable. This can be removed from the training dataset. The passenger ID column also does n ot contribute any meaningful information to the data as it only records a unique id of the passenger. The new training data is subset fom the original dataset.
rr titanic_subset <- subset(titanic_train, select= c(2:10, 12)) View(titanic_subset)
The age variable has 177 missing values. This can be dealt with by imputing the age value.
rr titanic_subset\(Age[is.na(titanic_subset\)Age)] <- mean(titanic_subset\(Age, na.rm =T) titanic_subset <- titanic_subset[!is.na(titanic_subset\)Embarked),] rownames(titanic_subset) <- NULL str(titanic_subset)
'data.frame': 889 obs. of 10 variables:
$ Survived: int 0 1 1 1 0 0 0 0 1 1 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Name : Factor w/ 891 levels \Abbing
rr table(titanic_subset$Survived)
0 1
549 340
rr #train <- titanic_subset[1:800,] #test <- titanic_subset[801:889,] train <- titanic_subset[1:800,] test <- titanic_subset[801:889,] model <- glm(Survived ~ Sex + Pclass ,family=binomial(link=‘logit’),data=train) #model <- glm(Survived ~., family = binomial(link =‘logit’), data = train)
rr summary(model)
Call:
glm(formula = Survived ~ Sex + Pclass, family = binomial(link = \logit\),
data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2060 -0.6949 -0.4491 0.6652 2.1653
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.2870 0.3158 10.409 <2e-16 ***
Sexmale -2.6937 0.1951 -13.809 <2e-16 ***
Pclass -0.9456 0.1123 -8.418 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1065.39 on 799 degrees of freedom
Residual deviance: 741.77 on 797 degrees of freedom
AIC: 747.77
Number of Fisher Scoring iterations: 4
rr predicted <- plogis(predict(model, test)) # predicted scores
rr optCutOff <- optimalCutoff(test$Survived, predicted)[1]
rr misClassError(test$Survived, predicted, threshold = optCutOff)
[1] 0.2022
rr plotROC(test$Survived, predicted)
rr confusionMatrix(test$Survived, predicted, threshold = optCutOff)
0 1
0 55 17
1 1 16