STAT4360

QUESTION 1

(A) Model written in terms of odds:

$π = \frac{β_0 +β_1x_1 +β_2x_2+β_3x_3}{1 - e^(β_0 +β_1x_1 +β_2x_2+β_3x_3)}$

Fitting model:

model1 <- glm(HIRE~., family = binomial(), data = discrim)

summary(model1)

## 
## Call:
## glm(formula = HIRE ~ ., family = binomial(), data = discrim)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3250  -0.4130  -0.1040   0.1403   2.2511  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -9.7790     4.1117  -2.378   0.0174 *
## EDUC6         2.7254     1.8275   1.491   0.1359  
## EDUC8         4.5865     2.4728   1.855   0.0636 .
## EXP           0.9138     0.4394   2.079   0.0376 *
## GENDER1       5.5328     2.6390   2.097   0.0360 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 35.165  on 27  degrees of freedom
## Residual deviance: 14.632  on 23  degrees of freedom
## AIC: 24.632
## 
## Number of Fisher Scoring iterations: 7

(B)

$β_{EXP}$: This model suggests that applicants see an increase in probability of 0.91% of being hired with each additional year of experience, if all else is held the same.

$β_{GENDER}$: This model suggests that applicants that are males have a 5.5% higher chance of being hired than their female counterparts, if all else is held the same.

$β_{EDUC6}$: This model suggests that applicants that have 6 years of education have a 2.7% higher chance of being hired than their counterparts with four years of education, if all else is held the same.

(C)

predict(model1, newdata = testC, type = "response")

##        John    Jennifer 
## 0.576113139 0.005346398

John’s probability of being hired is 0.576113139 while Jennifer’s probability of being hired is 0.005346398.

(D)

Creating a confusion matrix.

We are going to say for every value with a prediction value greater than 0.5 our model is suggesting they were hired. For those with values below that threshold, they were not.

predicted <- predict(model1, discrim, type = "response")

for(i in 1:length(predicted)) {
    if (predicted[i] > 0.5) {
        predicted[i] <- 1 
        print(i)
    } else {
        predicted[i] <- 0
    }
}

## [1] 3
## [1] 4
## [1] 10
## [1] 15
## [1] 17
## [1] 23
## [1] 27
## [1] 28

predicted <- as.numeric(predicted)
discrim$HIRE <- as.numeric(discrim$HIRE - 1)

## Warning in Ops.factor(discrim$HIRE, 1): '-' not meaningful for factors

confusionMatrix(discrim$HIRE, predicted)

Accuracy: 0.8929

Misclassifcation Error:

1 - 0.8929

## [1] 0.1071

QUESTION 2

iceData <- read.table("PONDICE.txt", header= T)
iceData$icetype <- as.factor(iceData$icetype)

(A) Pi for this logistic regression analysis is the probability that the ice is landfast ice.

(B) Pairwise Model:

$log() = log(β_0 +β_1x_1 +β_2x_2+β_3x_3) $

(C)

iceDataModified <- filter(iceData, iceData$icetype != "First-Year")

iceDataNew <- iceDataModified

attach(iceDataNew)
iceModel1 <- glm(icetype ~ depth + broadband.alb + visible.alb, family = binomial(), data = iceDataNew)
summary(iceModel1)

## 
## Call:
## glm(formula = icetype ~ depth + broadband.alb + visible.alb, 
##     family = binomial(), data = iceDataNew)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9827  -1.0447   0.6726   0.9721   2.5334  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -0.2962     0.4339  -0.683    0.495    
## depth          -4.1282     0.8093  -5.101 3.38e-07 ***
## broadband.alb -47.1231     7.4766  -6.303 2.92e-10 ***
## visible.alb    31.1435     4.7298   6.585 4.56e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 575.31  on 415  degrees of freedom
## Residual deviance: 504.86  on 412  degrees of freedom
## AIC: 512.86
## 
## Number of Fisher Scoring iterations: 5

π = -0.2962 + depth(-4.1282) + broadband.alb(-47.1231) + visible.alb(31.1435)

(D)

We are going to test our model’s adequacy by creating a confusion matrix and testing the misclassification error.

predicted <- predict(iceModel1, iceDataNew, type = "response")

for(i in 1:length(predicted)) {
    if (predicted[i] > 0.5) {
        predicted[i] <- 0 
        #print(i)
    } else {
        predicted[i] <- 1
    }
}

#predicted <- as.factor(predicted)

iceDataNew$icetype <- ifelse(iceDataNew$icetype == "Landfast",1, 0)

confusionMatrix(iceDataNew$icetype, predicted)

Accuracy: 0.707

Misclassifcation Error:

1 - 0.707

## [1] 0.293

(E) $log(β_0 +β_1x_1 +β_2x_2+β_3x_3+β_4x_1x_2+β_5x_1x_3+β_6x_2x_3)$

(F)

iceModel2 <- glm(icetype ~ (depth + broadband.alb + visible.alb)^2, family = binomial(), data = iceDataModified)
summary(iceModel2)

## 
## Call:
## glm(formula = icetype ~ (depth + broadband.alb + visible.alb)^2, 
##     family = binomial(), data = iceDataModified)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0781  -0.9473   0.5913   0.9060   2.3337  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -6.097      1.444  -4.223 2.42e-05 ***
## depth                        2.996      3.009   0.996    0.319    
## broadband.alb              -10.558     12.509  -0.844    0.399    
## visible.alb                 39.691      7.726   5.137 2.79e-07 ***
## depth:broadband.alb        -50.490     39.560  -1.276    0.202    
## depth:visible.alb            6.145     24.063   0.255    0.798    
## broadband.alb:visible.alb  -56.242     12.056  -4.665 3.08e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 575.31  on 415  degrees of freedom
## Residual deviance: 472.66  on 409  degrees of freedom
## AIC: 486.66
## 
## Number of Fisher Scoring iterations: 6

head(iceDataModified)

π = -6.097 + depth(2.996) + broadband.alb(-10.558) + visible.alb(39.691) + depth(broadband.alb)(-50.49) + depth(visible.alb)(6.145) + broadband.alb(visible.alb)(-56.242)

(G)

anova(iceModel1,iceModel2)

The residuals for model 2 are slightly lower, but not enough to have statistically significant difference.

STAT4360

Max Whalen

12/3/2021