The data has been collected and ready to be analyed.
launch <- read.csv("http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml10/challenger.csv")
# exammine the launch data
str(launch)
'data.frame': 23 obs. of 4 variables:
$ distress_ct : int 0 1 0 0 0 0 0 0 1 1 ...
$ temperature : int 66 70 69 68 67 72 73 70 57 63 ...
$ field_check_pressure: int 50 50 50 50 50 50 100 100 200 200 ...
$ flight_num : int 1 2 3 4 5 6 7 8 9 10 ...
First recode the distress_ct variable into 0 and 1, making 1 to represent at least one failure during a launch.
launch$distress_ct = ifelse(launch$distress_ct<1,0,1)
launch$distress_ct
[1] 0 1 0 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 1 0 1
Set up trainning and test data sets
indx = sample(1:nrow(launch), as.integer(0.9*nrow(launch)))
indx # ramdomize rows, save 90% of data into index
[1] 7 11 22 21 3 18 17 14 16 20 5 1 6 2 12 19 10 4 23 15
launch_train = launch[indx,]
launch_test = launch[-indx,]
launch_train_labels = launch[indx,1] # label the first column: distress_ct is the chategorial dependent variable
launch_test_labels = launch[-indx,1]
Check if there’s any missing values:
library(Amelia)
Loading required package: Rcpp
package ‘Rcpp’ was built under R version 3.3.2##
## Amelia II: Multiple Imputation
## (Version 1.7.4, built: 2015-12-05)
## Copyright (C) 2005-2017 James Honaker, Gary King and Matthew Blackwell
## Refer to http://gking.harvard.edu/amelia/ for more information
##
missmap(launch, main = "Missing values vs observed")
Number of missing values in each column
sapply(launch,function(x) sum(is.na(x)))
distress_ct temperature field_check_pressure
0 0 0
flight_num
0
Number of unique values in each column
sapply(launch, function(x) length(unique(x)))
distress_ct temperature field_check_pressure
3 16 3
flight_num
23
fit the logistic regression model, with all predictor variables
model <- glm(distress_ct ~.,family=binomial(link='logit'),data=launch_train)
model
Call: glm(formula = distress_ct ~ ., family = binomial(link = "logit"),
data = launch_train)
Coefficients:
(Intercept) temperature field_check_pressure
11.307353 -0.196882 0.007464
flight_num
0.017586
Degrees of Freedom: 19 Total (i.e. Null); 16 Residual
Null Deviance: 24.43
Residual Deviance: 17.62 AIC: 25.62
summary(model)
Call:
glm(formula = distress_ct ~ ., family = binomial(link = "logit"),
data = launch_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.1239 -0.6218 -0.4997 0.4215 2.0906
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 11.307353 7.567090 1.494 0.1351
temperature -0.196882 0.108010 -1.823 0.0683 .
field_check_pressure 0.007464 0.017755 0.420 0.6742
flight_num 0.017586 0.182003 0.097 0.9230
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 24.435 on 19 degrees of freedom
Residual deviance: 17.617 on 16 degrees of freedom
AIC: 25.617
Number of Fisher Scoring iterations: 4
anova(model, test="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: distress_ct
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 19 24.435
temperature 1 5.7525 18 18.682 0.01646 *
field_check_pressure 1 1.0557 17 17.626 0.30421
flight_num 1 0.0094 16 17.617 0.92263
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Only tempersture is significant from glm and anova output.
Drop the insignificant predictors, alpha = 0.10
model <- glm(distress_ct~temperature,family=binomial(link='logit'),data=launch_train)
model
Call: glm(formula = distress_ct ~ temperature, family = binomial(link = "logit"),
data = launch_train)
Coefficients:
(Intercept) temperature
12.4677 -0.1951
Degrees of Freedom: 19 Total (i.e. Null); 18 Residual
Null Deviance: 24.43
Residual Deviance: 18.68 AIC: 22.68
summary(model)
Call:
glm(formula = distress_ct ~ temperature, family = binomial(link = "logit"),
data = launch_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0092 -0.8104 -0.4453 0.5260 2.1326
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 12.4677 6.8659 1.816 0.0694 .
temperature -0.1951 0.1009 -1.934 0.0531 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 24.435 on 19 degrees of freedom
Residual deviance: 18.682 on 18 degrees of freedom
AIC: 22.682
Number of Fisher Scoring iterations: 5
anova(model, test="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: distress_ct
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 19 24.435
temperature 1 5.7525 18 18.682 0.01646 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
From the anova test, we tell that the model fits.
Check Accuracy
fitted.results <- predict(model,newdata=launch_test,type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != launch_test$distress_ct)
print(paste('Accuracy',1-misClasificError))
[1] "Accuracy 1"
The misclassific error is 0 form the result,which indicates our model is really good.
ROC Method: Because this data set is so small, it is possible that the test data set does not contain both 0 and 1 values. If this happens the code will not run.And since the test data set is so small the ROC is not useful here,but the code is provided.
‘’’ library(ROCR) p <- predict(model, newdata=credit_test, type=“response”) pr <- prediction(p, credit_test$default) prf <- performance(pr, measure = “tpr”, x.measure = “fpr”) plot(prf)
auc <- performance(pr, measure = “auc”) auc <- auc@y.values[[1]] auc ‘’’