The data has been collected and ready to be analyed.
launch <- read.csv("http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml10/challenger.csv")
# exammine the launch data
str(launch)
'data.frame': 23 obs. of 4 variables:
$ distress_ct : int 0 1 0 0 0 0 0 0 1 1 ...
$ temperature : int 66 70 69 68 67 72 73 70 57 63 ...
$ field_check_pressure: int 50 50 50 50 50 50 100 100 200 200 ...
$ flight_num : int 1 2 3 4 5 6 7 8 9 10 ...
First recode the distress_ct variable into 0 and 1, making 1 to represent at least one failure during a launch.
launch$distress_ct = ifelse(launch$distress_ct<1,0,1)
launch$distress_ct
[1] 0 1 0 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 1 0 1
Set up trainning and test data sets
indx = sample(1:nrow(launch), as.integer(0.9*nrow(launch)))
indx # ramdomize rows, save 90% of data into index
[1] 22 16 14 12 18 1 23 9 3 4 21 15 19 13 17 7 11 2 6 20
launch_train = launch[indx,]
launch_test = launch[-indx,]
launch_train_labels = launch[indx,1] # label the first column: distress_ct is the chategorial dependent variable
launch_test_labels = launch[-indx,1]
Check if there’s any missing values:
library(Amelia)
missmap(launch, main = "Missing values vs observed")
Number of missing values in each column
sapply(launch,function(x) sum(is.na(x)))
distress_ct temperature field_check_pressure flight_num
0 0 0 0
Number of unique values in each column
sapply(launch, function(x) length(unique(x)))
distress_ct temperature field_check_pressure flight_num
2 16 3 23
fit the logistic regression model, with all predictor variables
model <- glm(distress_ct ~.,family=binomial(link='logit'),data=launch_train)
model
Call: glm(formula = distress_ct ~ ., family = binomial(link = "logit"),
data = launch_train)
Coefficients:
(Intercept) temperature field_check_pressure flight_num
12.815439 -0.212388 0.004734 0.021998
Degrees of Freedom: 19 Total (i.e. Null); 16 Residual
Null Deviance: 24.43
Residual Deviance: 17.19 AIC: 25.19
summary(model)
Call:
glm(formula = distress_ct ~ ., family = binomial(link = "logit"),
data = launch_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.1195 -0.6415 -0.4842 0.3850 1.9638
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 12.815439 8.023342 1.597 0.1102
temperature -0.212388 0.113289 -1.875 0.0608 .
field_check_pressure 0.004734 0.018554 0.255 0.7986
flight_num 0.021998 0.188979 0.116 0.9073
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 24.435 on 19 degrees of freedom
Residual deviance: 17.189 on 16 degrees of freedom
AIC: 25.189
Number of Fisher Scoring iterations: 5
anova(model, test="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: distress_ct
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 19 24.435
temperature 1 6.6572 18 17.777 0.009875 **
field_check_pressure 1 0.5742 17 17.203 0.448577
flight_num 1 0.0138 16 17.189 0.906545
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Only tempersture is significant from glm and anova output.
Drop the insignificant predictors, alpha = 0.10
model <- glm(distress_ct~temperature,family=binomial(link='logit'),data=launch_train)
model
Call: glm(formula = distress_ct ~ temperature, family = binomial(link = "logit"),
data = launch_train)
Coefficients:
(Intercept) temperature
13.535 -0.209
Degrees of Freedom: 19 Total (i.e. Null); 18 Residual
Null Deviance: 24.43
Residual Deviance: 17.78 AIC: 21.78
summary(model)
Call:
glm(formula = distress_ct ~ temperature, family = binomial(link = "logit"),
data = launch_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0690 -0.7770 -0.4267 0.4543 2.1226
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 13.5354 7.1173 1.902 0.0572 .
temperature -0.2090 0.1035 -2.019 0.0434 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 24.435 on 19 degrees of freedom
Residual deviance: 17.777 on 18 degrees of freedom
AIC: 21.777
Number of Fisher Scoring iterations: 5
anova(model, test="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: distress_ct
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 19 24.435
temperature 1 6.6572 18 17.777 0.009875 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
From the anova test, we tell that the model fits.
Check Accuracy
fitted.results <- predict(model,newdata=launch_test,type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != launch_test$distress_ct)
print(paste('Accuracy',1-misClasificError))
[1] "Accuracy 1"
The misclassific error is 0 form the result,which indicates our model is really good.
ROC Method:
Because this data set is so small, it is possible that the test data set does not contain both 0 and 1 values. If this happens the code will not run. And since the test data set is so small the ROC is not useful here, but the code is provided.
library(ROCR)
p <- predict(model, newdata=launch_test, type="response")
pr <- prediction(p,launch_test_labels)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
[1] 1