The Titanic Dataset is commonly used in many data sciences courses. This is an example of using R to develop a Logistic Regression model to predict survivability.
The presentation uses an R Markdown Notebook.
Read in the Titanic Dataset
TitanicData<-read.csv("c:/efrayer/R_Projects/Titanic/data/titanicpassengers_ERF.csv")
library(plyr)
Take a quick review of the data
summary(TitanicData)
Name Survived Passenger.Class
Connolly, Miss. Kate : 2 No :809 Min. :1.000
Kelly, Mr. James : 2 Yes:500 1st Qu.:2.000
Abbing, Mr. Anthony : 1 Median :3.000
Abbott, Master. Eugene Joseph : 1 Mean :2.295
Abbott, Mr. Rossmore Edward : 1 3rd Qu.:3.000
Abbott, Mrs. Stanton (Rosa Hunt): 1 Max. :3.000
(Other) :1301
Sex Age Siblings.and.Spouses
female:466 Min. : 0.1667 Min. :0.0000
male :843 1st Qu.:21.0000 1st Qu.:0.0000
Median :28.0000 Median :0.0000
Mean :29.8811 Mean :0.4989
3rd Qu.:39.0000 3rd Qu.:1.0000
Max. :80.0000 Max. :8.0000
NA's :263
Parents.and.Children Fare Port
Min. :0.000 Min. : 0.000 : 2
1st Qu.:0.000 1st Qu.: 7.896 C:270
Median :0.000 Median : 14.454 Q:123
Mean :0.385 Mean : 33.295 S:914
3rd Qu.:0.000 3rd Qu.: 31.275
Max. :9.000 Max. :512.329
NA's :1
Home...Destination
:564
New York, NY : 64
London : 14
Montreal, PQ : 10
Cornwall / Akron, OH: 9
Paris, France : 9
(Other) :639
The remainder of the code comes from using R Studio and Rattle.
crs$input <- c("Passenger.Class", "Sex", "Age",
"Siblings.and.Spouses", "Parents.and.Children",
"Fare", "Port")
crs$numeric <- c("Passenger.Class", "Age",
"Siblings.and.Spouses", "Parents.and.Children",
"Fare")
crs$categoric <- c("Sex", "Port")
crs$target <- "Survived"
crs$risk <- NULL
crs$ident <- NULL
crs$ignore <- c("Name", "Home...Destination")
crs$weights <- NULL
crs$glm <- glm(Survived ~ .,
data=crs$dataset[, c(crs$input, crs$target)],
family=binomial(link="logit"))
summary(crs$glm)
Call:
glm(formula = Survived ~ ., family = binomial(link = "logit"),
data = crs$dataset[, c(crs$input, crs$target)])
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5726 -0.6846 -0.4058 0.6456 2.5376
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.213902 0.508562 10.252 < 2e-16 ***
Passenger.Class -1.009327 0.133464 -7.563 3.95e-14 ***
Sexmale -2.608938 0.179304 -14.550 < 2e-16 ***
Age -0.037687 0.006634 -5.681 1.34e-08 ***
Siblings.and.Spouses -0.348021 0.108435 -3.209 0.00133 **
Parents.and.Children 0.049846 0.104193 0.478 0.63236
Fare 0.000463 0.001934 0.239 0.81084
PortQ -1.446477 0.445468 -3.247 0.00117 **
PortS -0.679066 0.211566 -3.210 0.00133 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1409.99 on 1042 degrees of freedom
Residual deviance: 954.75 on 1034 degrees of freedom
(266 observations deleted due to missingness)
AIC: 972.75
Number of Fisher Scoring iterations: 5
library(ROCR)
# ROC Curve: requires the ggplot2 package.
library(ggplot2, quietly=TRUE)
# Generate an ROC Curve for the glm model on titanicpassengers_ERF.csv [**train**].
crs$pr <- predict(crs$glm,
type = "response",
newdata = crs$dataset[,c(crs$input, crs$target)])
# Remove observations with missing target.
no.miss <- na.omit(crs$dataset[,c(crs$input, crs$target)]$Survived)
miss.list <- attr(no.miss, "na.action")
attributes(no.miss) <- NULL
if (length(miss.list))
{
pred <- prediction(crs$pr[-miss.list], no.miss)
} else
{
pred <- prediction(crs$pr, no.miss)
}
pe <- performance(pred, "tpr", "fpr")
au <- performance(pred, "auc")@y.values[[1]]
pd <- data.frame(fpr=unlist(pe@x.values), tpr=unlist(pe@y.values))
p <- ggplot(pd, aes(x=fpr, y=tpr))
p <- p + geom_line(colour="red")
p <- p + xlab("False Positive Rate") + ylab("True Positive Rate")
p <- p + ggtitle("ROC Curve Linear titanicpassengers_ERF.csv [**train**] Survived")
p <- p + theme(plot.title=element_text(size=10))
p <- p + geom_line(data=data.frame(), aes(x=c(0,1), y=c(0,1)), colour="grey")
p <- p + annotate("text", x=0.50, y=0.00, hjust=0, vjust=0, size=5,
label=paste("AUC =", round(au, 2)))
print(p)
# Calculate the area under the curve for the plot.
# Remove observations with missing target.
no.miss <- na.omit(crs$dataset[,c(crs$input, crs$target)]$Survived)
miss.list <- attr(no.miss, "na.action")
attributes(no.miss) <- NULL
if (length(miss.list))
{
pred <- prediction(crs$pr[-miss.list], no.miss)
} else
{
pred <- prediction(crs$pr, no.miss)
}
performance(pred, "auc")
An object of class "performance"
Slot "x.name":
[1] "None"
Slot "y.name":
[1] "Area under the ROC curve"
Slot "alpha.name":
[1] "none"
Slot "x.values":
list()
Slot "y.values":
[[1]]
[1] 0.8503217
Slot "alpha.values":
list()