QUESTION 1
(A) Model written in terms of odds:
\(π = \frac{β_0 +β_1x_1 +β_2x_2+β_3x_3}{1 - e^(β_0 +β_1x_1 +β_2x_2+β_3x_3)}\)
Fitting model:
model1 <- glm(HIRE~., family = binomial(), data = discrim)
summary(model1)
##
## Call:
## glm(formula = HIRE ~ ., family = binomial(), data = discrim)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3250 -0.4130 -0.1040 0.1403 2.2511
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.7790 4.1117 -2.378 0.0174 *
## EDUC6 2.7254 1.8275 1.491 0.1359
## EDUC8 4.5865 2.4728 1.855 0.0636 .
## EXP 0.9138 0.4394 2.079 0.0376 *
## GENDER1 5.5328 2.6390 2.097 0.0360 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 35.165 on 27 degrees of freedom
## Residual deviance: 14.632 on 23 degrees of freedom
## AIC: 24.632
##
## Number of Fisher Scoring iterations: 7
(B)
\(β_{EXP}\): This model suggests that applicants see an increase in probability of 0.91% of being hired with each additional year of experience, if all else is held the same.
\(β_{GENDER}\): This model suggests that applicants that are males have a 5.5% higher chance of being hired than their female counterparts, if all else is held the same.
\(β_{EDUC6}\): This model suggests that applicants that have 6 years of education have a 2.7% higher chance of being hired than their counterparts with four years of education, if all else is held the same.
(C)
predict(model1, newdata = testC, type = "response")
## John Jennifer
## 0.576113139 0.005346398
John’s probability of being hired is 0.576113139 while Jennifer’s probability of being hired is 0.005346398.
(D)
Creating a confusion matrix.
We are going to say for every value with a prediction value greater than 0.5 our model is suggesting they were hired. For those with values below that threshold, they were not.
predicted <- predict(model1, discrim, type = "response")
for(i in 1:length(predicted)) {
if (predicted[i] > 0.5) {
predicted[i] <- 1
print(i)
} else {
predicted[i] <- 0
}
}
## [1] 3
## [1] 4
## [1] 10
## [1] 15
## [1] 17
## [1] 23
## [1] 27
## [1] 28
predicted <- as.numeric(predicted)
discrim$HIRE <- as.numeric(discrim$HIRE - 1)
## Warning in Ops.factor(discrim$HIRE, 1): '-' not meaningful for factors
confusionMatrix(discrim$HIRE, predicted)
Accuracy: 0.8929
Misclassifcation Error:
1 - 0.8929
## [1] 0.1071
QUESTION 2
iceData <- read.table("PONDICE.txt", header= T)
iceData$icetype <- as.factor(iceData$icetype)
(A) Pi for this logistic regression analysis is the probability that the ice is landfast ice.
(B) Pairwise Model:
$log() = log(β_0 +β_1x_1 +β_2x_2+β_3x_3) $
(C)
iceDataModified <- filter(iceData, iceData$icetype != "First-Year")
iceDataNew <- iceDataModified
attach(iceDataNew)
iceModel1 <- glm(icetype ~ depth + broadband.alb + visible.alb, family = binomial(), data = iceDataNew)
summary(iceModel1)
##
## Call:
## glm(formula = icetype ~ depth + broadband.alb + visible.alb,
## family = binomial(), data = iceDataNew)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9827 -1.0447 0.6726 0.9721 2.5334
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.2962 0.4339 -0.683 0.495
## depth -4.1282 0.8093 -5.101 3.38e-07 ***
## broadband.alb -47.1231 7.4766 -6.303 2.92e-10 ***
## visible.alb 31.1435 4.7298 6.585 4.56e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 575.31 on 415 degrees of freedom
## Residual deviance: 504.86 on 412 degrees of freedom
## AIC: 512.86
##
## Number of Fisher Scoring iterations: 5
π = -0.2962 + depth(-4.1282) + broadband.alb(-47.1231) + visible.alb(31.1435)
(D)
We are going to test our model’s adequacy by creating a confusion matrix and testing the misclassification error.
predicted <- predict(iceModel1, iceDataNew, type = "response")
for(i in 1:length(predicted)) {
if (predicted[i] > 0.5) {
predicted[i] <- 0
#print(i)
} else {
predicted[i] <- 1
}
}
#predicted <- as.factor(predicted)
iceDataNew$icetype <- ifelse(iceDataNew$icetype == "Landfast",1, 0)
confusionMatrix(iceDataNew$icetype, predicted)
Accuracy: 0.707
Misclassifcation Error:
1 - 0.707
## [1] 0.293
(E) \(log(β_0 +β_1x_1 +β_2x_2+β_3x_3+β_4x_1x_2+β_5x_1x_3+β_6x_2x_3)\)
(F)
iceModel2 <- glm(icetype ~ (depth + broadband.alb + visible.alb)^2, family = binomial(), data = iceDataModified)
summary(iceModel2)
##
## Call:
## glm(formula = icetype ~ (depth + broadband.alb + visible.alb)^2,
## family = binomial(), data = iceDataModified)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0781 -0.9473 0.5913 0.9060 2.3337
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.097 1.444 -4.223 2.42e-05 ***
## depth 2.996 3.009 0.996 0.319
## broadband.alb -10.558 12.509 -0.844 0.399
## visible.alb 39.691 7.726 5.137 2.79e-07 ***
## depth:broadband.alb -50.490 39.560 -1.276 0.202
## depth:visible.alb 6.145 24.063 0.255 0.798
## broadband.alb:visible.alb -56.242 12.056 -4.665 3.08e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 575.31 on 415 degrees of freedom
## Residual deviance: 472.66 on 409 degrees of freedom
## AIC: 486.66
##
## Number of Fisher Scoring iterations: 6
head(iceDataModified)
π = -6.097 + depth(2.996) + broadband.alb(-10.558) + visible.alb(39.691) + depth(broadband.alb)(-50.49) + depth(visible.alb)(6.145) + broadband.alb(visible.alb)(-56.242)
(G)
anova(iceModel1,iceModel2)
The residuals for model 2 are slightly lower, but not enough to have statistically significant difference.