Using Logistic Regression on U.S. Space Shuttle Challenger dataset

Libraries Used

library(car)
library(Rcpp)   #required package for Amelia
library(gplots) #required package for Amelia

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

library(Amelia) #data visdualization

## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.4, built: 2015-12-05)
## ## Copyright (C) 2005-2017 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##

library(gplots) #required package for ROCR
library(ROCR)
library(ggplot2) #required package for caret
library(lattice) #required package for caret
library(caret)

Objective

The project aims to use logistic regression to predict the binary outcome of Challenger mission given the temperature at launch.

Step 1. Data Exploration

The data collected from 23 previous successful shuttle launches prior to the actual launch in 1986. Distress_ct is an dependent variable indicating number of distress event such as shuttle disintegration. Temperature, field_check_pressure, and flight_number are independent variables.

#Load the dataset
launch <- read.csv("http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml10/challenger.csv")

#data preview
knitr:: kable(head(launch))

distress_ct	temperature	field_check_pressure	flight_num
0	66	50	1
1	70	50	2
0	69	50	3
0	68	50	4
0	67	50	5
0	72	50	6

The shuttle had 6 O-rings responsible for sealing joints of the rocket booster and cold temperatures could make the component more brittle and susceptible to improper sealing.

The correlation between distress_ct and temperature is -0.51 means there is a moderately strong negative linear association. In addition, the negative correlation implies that an increase in temperature are related to decrease in the number of distressed events.

cor<-cor(launch$distress_ct,launch$temperature)
cor

## [1] -0.5111264

cor2<-cov(launch$temperature,launch$distress_ct)/(sd(launch$temperature)*sd(launch$distress_ct)) #manually caluation for Pearson's correlation. Alternatively, R's correlation function is cor()
cor2

## [1] -0.5111264

Next, check if there are any missing values in dataset. Amelia package provides a visual image of missing data per features.

missmap(launch, main = "Missing values vs observed")

‘NA’ is sometimes used in a dataset instead of null to represent missing values. Sapply() is used to check for ‘NA’. It also can be used to check for number of unique values in each feature. Here, 3 different values are representing distress_ct, 16 different values for temperature field, etc.

# gives number of missing values in each column
sapply(launch,function(x) sum(is.na(x)))

##          distress_ct          temperature field_check_pressure 
##                    0                    0                    0 
##           flight_num 
##                    0

# gives number of unique values in each column
sapply(launch, function(x) length(unique(x)))

##          distress_ct          temperature field_check_pressure 
##                    3                   16                    3 
##           flight_num 
##                   23

Step 3: Logistic Regression

Since distress_ct is ranging from 0 to 3, it needs to be converted into a binary outcome in order to fit a logistic regression model.

launch$distress_ct = ifelse(launch$distress_ct<1,0,1)
launch$distress_ct

##  [1] 0 1 0 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 1 0 1

90% of overall observations is used for training data and 10% is used for testing data.

# set up trainning and test data sets
set.seed(123)
indx = sample(1:nrow(launch), as.integer(0.9*nrow(launch)))
indx

##  [1]  7 18  9 22 20  1 21 15 17 23 13  6  8 12 19 11  2 16 14  4

launch_train = launch[indx,]
launch_test = launch[-indx,]

launch_train_labels = launch[indx,1]
launch_test_labels = launch[-indx,1]

The generalized linear models (GLMs) are a broad class of model that includes linear regression, ANOVA, logistic etc. Therefore, family parameter is used in glm() to define which model to use. The coefficients estimators obtained from the model is a log odds of response variable, distress_ct. Because the generalized linear model for logistic is log odds of the dependent variable (class) and it’s a linear combination of the independent variables(features).

# fit the logistic regression model, with all predictor variables
model <- glm(distress_ct ~.,family=binomial(link='logit'),data=launch_train)
summary(model)

## 
## Call:
## glm(formula = distress_ct ~ ., family = binomial(link = "logit"), 
##     data = launch_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1065  -0.6821  -0.4934   0.3896   1.9623  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)  
## (Intercept)          12.987949   8.029206   1.618    0.106  
## temperature          -0.213503   0.113066  -1.888    0.059 .
## field_check_pressure  0.004371   0.018642   0.234    0.815  
## flight_num            0.018237   0.188557   0.097    0.923  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 24.435  on 19  degrees of freedom
## Residual deviance: 17.233  on 16  degrees of freedom
## AIC: 25.233
## 
## Number of Fisher Scoring iterations: 5

In order to better estimate the significance of each predictors, anova() with chi-square test parameter is used to see how does the model fit by adding each predictors. The best model is one that has the least residuals deviance. Here, having the temperature as a predictor in the model has the greatest decrease in residual deviance compared to NULL. And having field_check_pressure and flight_num do not have significant decrease in residual deviance compared to temperature.

anova(model, test="Chisq")

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: distress_ct
## 
## Terms added sequentially (first to last)
## 
## 
##                      Df Deviance Resid. Df Resid. Dev Pr(>Chi)   
## NULL                                    19     24.435            
## temperature           1   6.7716        18     17.663 0.009262 **
## field_check_pressure  1   0.4201        17     17.243 0.516880   
## flight_num            1   0.0095        16     17.233 0.922418   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Evaluate the model again with just temperature as predictor.

# drop the insignificant predictors, alpha = 0.10
model <- glm(distress_ct ~ temperature,family=binomial(link='logit'),data=launch_train)
summary(model)

## 
## Call:
## glm(formula = distress_ct ~ temperature, family = binomial(link = "logit"), 
##     data = launch_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0777  -0.7629  -0.4259   0.4435   2.1232  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  13.7233     7.1901   1.909   0.0563 .
## temperature  -0.2116     0.1044  -2.026   0.0427 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 24.435  on 19  degrees of freedom
## Residual deviance: 17.663  on 18  degrees of freedom
## AIC: 21.663
## 
## Number of Fisher Scoring iterations: 5

anova(model, test="Chisq")

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: distress_ct
## 
## Terms added sequentially (first to last)
## 
## 
##             Df Deviance Resid. Df Resid. Dev Pr(>Chi)   
## NULL                           19     24.435            
## temperature  1   6.7716        18     17.663 0.009262 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step 4 Evaluating Model Performance

Type=‘response’ sets the prediction for the model to probability values of each tested observation between 0 and 1. To get a meaningful binary prediction of distress_ct as to whether shuttle would be distress or not, it needs be converted the probability values into 0 or 1. Hence, if probability of distress_ct is >0.5, it would be counted as 1 but 0 otherwise.

# check Accuracy
fitted.results <- predict(model,newdata=launch_test,type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
head(fitted.results)

##  3  5 10 
##  0  0  1

Calculating the error rate of the testing data on distress_ct from the model.Accuracy rate is 100%.

misClasificError <- mean(fitted.results != launch_test$distress_ct)
print(paste('Accuracy',1-misClasificError))

## [1] "Accuracy 1"

Another way to evaluate model performance is using Receiver Operating Characteristic (ROC). ROC curve is a graph that simultaneously displays true positive (sensitivity) and false positive rate (1- specificity). The overall performance of classifier is summarized by area under the curve (AUC) and larger the AUC represents the better classifier. Here, the ROC curve is showing a straight line that does not convey meaningful message of the model accuracy. This is probably due to size of training and testing dataset being too small.

p <- predict(model, newdata=launch_test, type="response")
pr <- prediction(p, launch_test$distress_ct)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
abline(0,1, lwd=2, lty=2)

AUC value can be extracted using performance(). 1 is equivalent to perfect classifier.

auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc

## [1] 1

Step 5: Model Improvement using Boostrapping.

Since the size of this dataset is too small ( only 23 observation), bootstrapping is a reasonable way to improve the model by re-sampling the number of observations in this dataset multiple times with replacement.

start.time <- Sys.time()
ctrl <- trainControl(method = "boot632", number = 23, savePredictions = TRUE)
mod_fit <- train(distress_ct ~ ., data=launch,method="glm", family="binomial", trControl = ctrl, tuneLength = 5)

## Warning in train.default(x, y, weights = w, ...): You are trying to do
## regression and your outcome only has two possible values Are you trying to
## do classification? If so, use a 2 level factor as your outcome column.

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

The prediction values from the testing dataset are probability between 0 and 1 so it needed to be converted into binomial value. 3 out of 3 observations in the testing dataset are correctly predicted.

pred <- predict(mod_fit, newdata=launch_test)
pred = ifelse(pred > 0.5,1,0)
confusionMatrix(data=pred, launch_test_labels)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 2 0
##          1 0 1
##                                      
##                Accuracy : 1          
##                  95% CI : (0.2924, 1)
##     No Information Rate : 0.6667     
##     P-Value [Acc > NIR] : 0.2963     
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.6667     
##          Detection Rate : 0.6667     
##    Detection Prevalence : 0.6667     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 0          
##

The time that takes to run boostraping is 0.5 secs.

end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

## Time difference of 0.5905659 secs

AUC is still 1. The prediction accuracy from the orignial model and the improved model is the same but the advantage of using boostrapping allows the model to generated a larger dataset. It’s a better model since the number of observations in this dataset are small .

pr <- prediction(pred, launch_test$distress_ct)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc

## [1] 1