The dataset contains information of the Challenger space shuttle that was crashed soon after launched in January 28, 1986. It was originally presented in the 1989 Journal of American Statistical Association: Risk analysis of the space shuttle: pre-Challenger prediction of failure.
The dataset contains 23 observations with information of 23 previous successful shuttle launches respectively prior to the actual launch in 1986. There are four features (columns) with distress_ct representing the dependent outcome for number of distress events such as shuttle disintegration; and temperature, field_check_pressure and fight_num as independent predictors, in which temperature, the pressure and the shuttle life that is proportional to its flight number are assumed to be affecting its distress events. All features are numerical.
launch = read.csv('http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml10/challenger.csv')
str(launch)
## 'data.frame': 23 obs. of 4 variables:
## $ distress_ct : int 0 1 0 0 0 0 0 0 1 1 ...
## $ temperature : int 66 70 69 68 67 72 73 70 57 63 ...
## $ field_check_pressure: int 50 50 50 50 50 50 100 100 200 200 ...
## $ flight_num : int 1 2 3 4 5 6 7 8 9 10 ...
Since a logistics regression analysis will be performed on the data, in which required the dependent outcome to be a binomial result for either 0 or 1. It is reasonable that the distress_ct to be either yes or no for the final distress value. The original distress_ct outcome ranging from 0 to 3 is then converted using an ifelse() function to be either 0: there’s not a distress event, or 1: there is a distress event. The original values in the distress_ct vector that are equal or greater then 1 would be converted to 1 to imply it was an distress event (regardless of the number events), and everything else that is less than 1 will be convert to 0 as to imply it was not a distress event.
launch$distress_ct = ifelse(launch$distress_ct<1,0,1)
launch$distress_ct
## [1] 0 1 0 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 1 0 1
This is a small dataset with only 23 observations, and thus it is appropriated to randomly take 90% of the overall dataset observations to be the training data and leave the rest of 10% of the overall data to be the tested data. Similarly, since the first column is the target variable/outcome and will be used for comparison with the prediction from the logistics regression model, it is splitted in a similar fashion proportional the the trained and tested dataset.
set.seed(123)
indx = sample(1:nrow(launch), as.integer(0.9*nrow(launch)))
indx
## [1] 7 18 9 22 20 1 21 15 17 23 13 6 8 12 19 11 2 16 14 4
launch_train = launch[indx,]
launch_test = launch[-indx,]
launch_train_labels = launch[indx,1]
launch_test_labels = launch[-indx,1]
Using the Amelia package and its missmap() function, it provides a visual image of whether or how much missing data for each features of the dataset. Fortunately, the challenger dataset is small enough with none missing data that should be worried for data cleaning before modeling.
library(Amelia)
## Warning: package 'Amelia' was built under R version 3.3.3
## Loading required package: Rcpp
## Warning: package 'Rcpp' was built under R version 3.3.3
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.4, built: 2015-12-05)
## ## Copyright (C) 2005-2017 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
missmap(launch, main = "Missing values vs observed")
A sapply() funcion is used to specifically look for any ‘NA’ values or numbers of unique values in each feature of the dataset. Based on the result below, there’s no missing value in any features, and the number of unique values is presented for each feature in the second part. It shows that there’s only two different values for distress_ct (that were converted to be 0 or 1 perviously), 3 different values for pressue, and each observation contain different value in the flight_num.
sapply(launch,function(x) sum(is.na(x)))
## distress_ct temperature field_check_pressure
## 0 0 0
## flight_num
## 0
sapply(launch, function(x) length(unique(x)))
## distress_ct temperature field_check_pressure
## 2 16 3
## flight_num
## 23
A glm() funcation can be used for various types of regression analysis. However, defining the family parameter to be ‘logit’ in the binomial() function, it would be modeled in a logistic regression fashion. the glm() builds a logistic regression model with the training data, using distress_ct feature as the dependent response, and the ‘.’ indicates using the rest of the features in the dataset as independent predictors for the model.
Simply typing the object name can give information of the model parameter coefficients estimates, while using a summary() function for the model object can provide detail information includes null & residuals deviance, AIC value (for model selection), and residuals deviance range, as well as coefficient estimates, standard error of coefficient estimates, and p-values for each features. The coefficients estimats indicates the log odds of the outcome (distress_ct) for each unit increase of the predictor variable, since only the log odds of the dependent variable is a linear combination of the all independent variable/features.
Based on this logistic regression model, one unit increase in temperature would be resulted in the creased in log odds of 0.39 for distress_ct level; one unit increased in pressure would be resulted in the increased in log odds of 0.027 for distress_ct level, and one unit increased in flight_num would be resulted in the decreased in the log odds of 0.28 for distress_ct level.
model <- glm(distress_ct ~.,family=binomial(link='logit'),data=launch_train)
model
##
## Call: glm(formula = distress_ct ~ ., family = binomial(link = "logit"),
## data = launch_train)
##
## Coefficients:
## (Intercept) temperature field_check_pressure
## 12.987949 -0.213503 0.004371
## flight_num
## 0.018237
##
## Degrees of Freedom: 19 Total (i.e. Null); 16 Residual
## Null Deviance: 24.43
## Residual Deviance: 17.23 AIC: 25.23
summary(model)
##
## Call:
## glm(formula = distress_ct ~ ., family = binomial(link = "logit"),
## data = launch_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1065 -0.6821 -0.4934 0.3896 1.9623
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 12.987949 8.029206 1.618 0.106
## temperature -0.213503 0.113066 -1.888 0.059 .
## field_check_pressure 0.004371 0.018642 0.234 0.815
## flight_num 0.018237 0.188557 0.097 0.923
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 24.435 on 19 degrees of freedom
## Residual deviance: 17.233 on 16 degrees of freedom
## AIC: 25.233
##
## Number of Fisher Scoring iterations: 5
Furthermore, using the anova() function for the logistic regression model along with a ‘chi-square’ test paramter allows to compared null and residuals (error) deviance for each predictor term and see how fit the model is by adding each term. The null model contains only the intercept, and the wider the gap in residual deviance value between two preditor terms is better, in which it indicates a greater decreased in residual deviance and thus confirm a statistical significance of adding that term.
Here, adding the temperature predictor has the greatest decrease in residual deviance compared to the null model (only intercept), and its p-value shows that the temperature predictor is significant at the 0.05 level and that adding it will greatly decrease the residual deviance and make it a better model. On the other hand, adding both pressure or the flight_num features do not decreased residuals deviance a lot as compared to the one with temperature, and their p-values both indicates whether adding these two terms will not affect much of the residual deviance, and thus would not make the model any better.
anova(model, test="Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: distress_ct
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 19 24.435
## temperature 1 6.7716 18 17.663 0.009262 **
## field_check_pressure 1 0.4201 17 17.243 0.516880
## flight_num 1 0.0095 16 17.233 0.922418
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The logistic regression model is applied on the tested dataset to make prediction for the distress_ct feature outcome, with a type = ‘response’ will set the final prediction to be a probability value for each tested observation between 0 to 1. However, since the purpose of the logistic regression is to make binomial prediction for the distress_ct outcome – whether the shuttle would be undergone to distress (1) or not (0). The fitted.results, which is a vector contains the probability for distress_ct, would be then manually converted into 0 or 1, with probability > 0.5 being 1 and probabaility < 0.5 being 0 for the distress_ct outcome.
fitted.results <- predict(model,newdata=launch_test,type='response')
head(fitted.results)
## 3 5 10
## 0.1868595 0.2675543 0.6442206
fitted.results <- ifelse(fitted.results > 0.5,1,0)
head(fitted.results)
## 3 5 10
## 0 0 1
Here, the vector of prediction on the tested data distress_ct from the model is compared to the actual binomial value of the distress_ct in the tested data. The error rate is calculated by dividing the number of wrong prediction (when the predicted value is not the same compared to the actual distress_ct value in the tested dataset) over the total observations in the tested data. And the accuracy is later calculated by 1 - error rate.
The error rate for binomial distress_ct prediciton is 0%, and the accuracy rate for it is 100%.
misClasificError <- mean(fitted.results != launch_test$distress_ct)
misClasificError
## [1] 0
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 1"
Using the ROCR package to generate ROC (Receiver Operating Characteristic) curve to plot the true positive rate (TPR, aka the sensitivity) against the false positive rate (FPR, aka 1- specificity) of the classifier of the target feature (the distress_ct). The ROC curve is showing a very wired line which does not convey any message of the data at all; it shows that it can only goes up to 0.5 TPR at 0 FPR, and consistently being 0.5 TPR for FPR ranging from 0 to 1 before the TPR can go up again. The problem associated with such bad ROC curve maybe due to this extremely small trained and tested dataset with only 20 and 3 observations, respectively.
library(ROCR)
## Warning: package 'ROCR' was built under R version 3.3.3
## Loading required package: gplots
## Warning: package 'gplots' was built under R version 3.3.3
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
p <- predict(model, newdata=launch_test, type="response")
pr <- prediction(p, launch_test$distress_ct)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
abline(0,1, lwd = 2, lty = 2)
Again, using performance() function to the prediction object, pr, it performs a predictor evalutation on the ROC curve to examine how good the model is in predicitng the classifier using the logistic regression model. The AUC (area under the curve) of the ROC curve is 1, which is equivalent to a perfect classifier, and it indicates that this is an outstanding model at identifying positive values.
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 1
Although the current model performs well, but one model improvement is the used of bootstrapping on the original data. Due to the fact of a small dataset to begin with for 23 observations, it is reasonable to use bootstrapping to resampled the same number of observations multiple times from the original dataset with replacement in order to estimate the properties of a larger set that is representative to the original dataset.
Using the traincontrol() function from the caret package with method = ‘boot632’ for bootstrapping as the train control for training the model. Everything else is the same after the training step. Again, the values that are predicted from the tested dataset using the trained model is in probability value between 0 to 1, and will need to be converted into a binomial value 0 or 1 using ifelse() statement. The confusionMatrix() shows that 3 out of 3 observations in the tested dataset are correclty predicted, and the accuracy for the model is thus 100%.
library(caret)
## Warning: package 'caret' was built under R version 3.3.3
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 3.3.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.3.3
start.time <- Sys.time()
ctrl <- trainControl(method = "boot632", number = 23, savePredictions = TRUE)
mod_fit <- train(distress_ct ~ ., data=launch, method="glm", family="binomial",
trControl = ctrl, tuneLength = 5)
## Warning in train.default(x, y, weights = w, ...): You are trying to do
## regression and your outcome only has two possible values Are you trying to
## do classification? If so, use a 2 level factor as your outcome column.
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
pred <- predict(mod_fit, newdata=launch_test)
pred
## 3 5 10
## 0.1244379 0.1697841 0.7536046
pred = ifelse(pred > 0.5,1,0)
confusionMatrix(data=pred, launch_test_labels)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2 0
## 1 0 1
##
## Accuracy : 1
## 95% CI : (0.2924, 1)
## No Information Rate : 0.6667
## P-Value [Acc > NIR] : 0.2963
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.6667
## Detection Rate : 0.6667
## Detection Prevalence : 0.6667
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : 0
##
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
## Time difference of 0.7642701 secs
The ROC curve for the bootstrapping train control for the model is also the same as a perfect classifier. This also indicates that using a bootstrapping will ensure a good model that identifies positive future values.
pr <- prediction(pred, launch_test$distress_ct)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
abline(0,1, lwd = 2, lty = 2)
The AUC of the ROC curve for the bootstrapping train controled model is 1. And it indicates that using bootstrapping method modeling will make a outstanding model at identifying future positive values.
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 1
Conclusion: Logistic regression is applied on the data of the catastrophic challenger space shuttle incident. Because the dataset is very small containing only 23 observations, a model is first built by the logistic regression, and were then re-modeled using the boostrapping method for model improvement. The model alone or with bootstrapping method are just as good. But a better model can be always generated with a larger dataset.