library(car)
library(Rcpp) #required package for Amelia
library(gplots) #required package for Amelia
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
library(Amelia) #data visdualization
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.4, built: 2015-12-05)
## ## Copyright (C) 2005-2017 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(gplots) #required package for ROCR
library(ROCR)
library(ggplot2) #required package for caret
library(lattice) #required package for caret
library(caret)
The project aims to use logistic regression to predict the binary outcome of Challenger mission given the temperature at launch.
The data collected from 23 previous successful shuttle launches prior to the actual launch in 1986. Distress_ct is an dependent variable indicating number of distress event such as shuttle disintegration. Temperature, field_check_pressure, and flight_number are independent variables.
#Load the dataset
launch <- read.csv("http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml10/challenger.csv")
#data preview
knitr:: kable(head(launch))
| distress_ct | temperature | field_check_pressure | flight_num |
|---|---|---|---|
| 0 | 66 | 50 | 1 |
| 1 | 70 | 50 | 2 |
| 0 | 69 | 50 | 3 |
| 0 | 68 | 50 | 4 |
| 0 | 67 | 50 | 5 |
| 0 | 72 | 50 | 6 |
The shuttle had 6 O-rings responsible for sealing joints of the rocket booster and cold temperatures could make the component more brittle and susceptible to improper sealing.
The correlation between distress_ct and temperature is -0.51 means there is a moderately strong negative linear association. In addition, the negative correlation implies that an increase in temperature are related to decrease in the number of distressed events.
cor<-cor(launch$distress_ct,launch$temperature)
cor
## [1] -0.5111264
cor2<-cov(launch$temperature,launch$distress_ct)/(sd(launch$temperature)*sd(launch$distress_ct)) #manually caluation for Pearson's correlation. Alternatively, R's correlation function is cor()
cor2
## [1] -0.5111264
Next, check if there are any missing values in dataset. Amelia package provides a visual image of missing data per features.
missmap(launch, main = "Missing values vs observed")
‘NA’ is sometimes used in a dataset instead of null to represent missing values. Sapply() is used to check for ‘NA’. It also can be used to check for number of unique values in each feature. Here, 3 different values are representing distress_ct, 16 different values for temperature field, etc.
# gives number of missing values in each column
sapply(launch,function(x) sum(is.na(x)))
## distress_ct temperature field_check_pressure
## 0 0 0
## flight_num
## 0
# gives number of unique values in each column
sapply(launch, function(x) length(unique(x)))
## distress_ct temperature field_check_pressure
## 3 16 3
## flight_num
## 23
Since distress_ct is ranging from 0 to 3, it needs to be converted into a binary outcome in order to fit a logistic regression model.
launch$distress_ct = ifelse(launch$distress_ct<1,0,1)
launch$distress_ct
## [1] 0 1 0 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 1 0 1
90% of overall observations is used for training data and 10% is used for testing data.
# set up trainning and test data sets
set.seed(123)
indx = sample(1:nrow(launch), as.integer(0.9*nrow(launch)))
indx
## [1] 7 18 9 22 20 1 21 15 17 23 13 6 8 12 19 11 2 16 14 4
launch_train = launch[indx,]
launch_test = launch[-indx,]
launch_train_labels = launch[indx,1]
launch_test_labels = launch[-indx,1]
The generalized linear models (GLMs) are a broad class of model that includes linear regression, ANOVA, logistic etc. Therefore, family parameter is used in glm() to define which model to use. The coefficients estimators obtained from the model is a log odds of response variable, distress_ct. Because the generalized linear model for logistic is log odds of the dependent variable (class) and it’s a linear combination of the independent variables(features).
# fit the logistic regression model, with all predictor variables
model <- glm(distress_ct ~.,family=binomial(link='logit'),data=launch_train)
summary(model)
##
## Call:
## glm(formula = distress_ct ~ ., family = binomial(link = "logit"),
## data = launch_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1065 -0.6821 -0.4934 0.3896 1.9623
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 12.987949 8.029206 1.618 0.106
## temperature -0.213503 0.113066 -1.888 0.059 .
## field_check_pressure 0.004371 0.018642 0.234 0.815
## flight_num 0.018237 0.188557 0.097 0.923
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 24.435 on 19 degrees of freedom
## Residual deviance: 17.233 on 16 degrees of freedom
## AIC: 25.233
##
## Number of Fisher Scoring iterations: 5
In order to better estimate the significance of each predictors, anova() with chi-square test parameter is used to see how does the model fit by adding each predictors. The best model is one that has the least residuals deviance. Here, having the temperature as a predictor in the model has the greatest decrease in residual deviance compared to NULL. And having field_check_pressure and flight_num do not have significant decrease in residual deviance compared to temperature.
anova(model, test="Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: distress_ct
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 19 24.435
## temperature 1 6.7716 18 17.663 0.009262 **
## field_check_pressure 1 0.4201 17 17.243 0.516880
## flight_num 1 0.0095 16 17.233 0.922418
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Evaluate the model again with just temperature as predictor.
# drop the insignificant predictors, alpha = 0.10
model <- glm(distress_ct ~ temperature,family=binomial(link='logit'),data=launch_train)
summary(model)
##
## Call:
## glm(formula = distress_ct ~ temperature, family = binomial(link = "logit"),
## data = launch_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0777 -0.7629 -0.4259 0.4435 2.1232
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 13.7233 7.1901 1.909 0.0563 .
## temperature -0.2116 0.1044 -2.026 0.0427 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 24.435 on 19 degrees of freedom
## Residual deviance: 17.663 on 18 degrees of freedom
## AIC: 21.663
##
## Number of Fisher Scoring iterations: 5
anova(model, test="Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: distress_ct
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 19 24.435
## temperature 1 6.7716 18 17.663 0.009262 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Type=‘response’ sets the prediction for the model to probability values of each tested observation between 0 and 1. To get a meaningful binary prediction of distress_ct as to whether shuttle would be distress or not, it needs be converted the probability values into 0 or 1. Hence, if probability of distress_ct is >0.5, it would be counted as 1 but 0 otherwise.
# check Accuracy
fitted.results <- predict(model,newdata=launch_test,type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
head(fitted.results)
## 3 5 10
## 0 0 1
Calculating the error rate of the testing data on distress_ct from the model.Accuracy rate is 100%.
misClasificError <- mean(fitted.results != launch_test$distress_ct)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 1"
Another way to evaluate model performance is using Receiver Operating Characteristic (ROC). ROC curve is a graph that simultaneously displays true positive (sensitivity) and false positive rate (1- specificity). The overall performance of classifier is summarized by area under the curve (AUC) and larger the AUC represents the better classifier. Here, the ROC curve is showing a straight line that does not convey meaningful message of the model accuracy. This is probably due to size of training and testing dataset being too small.
p <- predict(model, newdata=launch_test, type="response")
pr <- prediction(p, launch_test$distress_ct)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
abline(0,1, lwd=2, lty=2)
AUC value can be extracted using performance(). 1 is equivalent to perfect classifier.
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 1
Since the size of this dataset is too small ( only 23 observation), bootstrapping is a reasonable way to improve the model by re-sampling the number of observations in this dataset multiple times with replacement.
start.time <- Sys.time()
ctrl <- trainControl(method = "boot632", number = 23, savePredictions = TRUE)
mod_fit <- train(distress_ct ~ ., data=launch,method="glm", family="binomial", trControl = ctrl, tuneLength = 5)
## Warning in train.default(x, y, weights = w, ...): You are trying to do
## regression and your outcome only has two possible values Are you trying to
## do classification? If so, use a 2 level factor as your outcome column.
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
The prediction values from the testing dataset are probability between 0 and 1 so it needed to be converted into binomial value. 3 out of 3 observations in the testing dataset are correctly predicted.
pred <- predict(mod_fit, newdata=launch_test)
pred = ifelse(pred > 0.5,1,0)
confusionMatrix(data=pred, launch_test_labels)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2 0
## 1 0 1
##
## Accuracy : 1
## 95% CI : (0.2924, 1)
## No Information Rate : 0.6667
## P-Value [Acc > NIR] : 0.2963
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.6667
## Detection Rate : 0.6667
## Detection Prevalence : 0.6667
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : 0
##
The time that takes to run boostraping is 0.5 secs.
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
## Time difference of 0.5905659 secs
AUC is still 1. The prediction accuracy from the orignial model and the improved model is the same but the advantage of using boostrapping allows the model to generated a larger dataset. It’s a better model since the number of observations in this dataset are small .
pr <- prediction(pred, launch_test$distress_ct)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 1