This homework is for Sarah, Cheyanne, Lukas, and Cameron

TITANIC

library(stableGR)
## Loading required package: mcmcse
## mcmcse: Monte Carlo Standard Errors for MCMC
## Version 1.4-1 created on 2020-01-29.
## copyright (c) 2012, James M. Flegal, University of California, Riverside
##                     John Hughes, University of Colorado, Denver
##                     Dootika Vats, University of Warwick
##                     Ning Dai, University of Minnesota
##  For citation information, type citation("mcmcse").
##  Type help("mcmcse-package") to get started.
library(faraway)
data(titanic.complete)
titanic <- titanic.complete
head(titanic)
##   Survived Pclass    Sex Age SibSp Parch    Fare Embarked
## 1        0      3   male  22     1     0  7.2500        S
## 2        1      1 female  38     1     0 71.2833        C
## 3        1      3 female  26     0     0  7.9250        S
## 4        1      1 female  35     1     0 53.1000        S
## 5        0      3   male  35     0     0  8.0500        S
## 7        0      1   male  54     0     0 51.8625        S
attach(titanic)

1 - Exploratory Data Analysis

In this section we will look at the proportion of suvivors on the titanic based on both sex and class. We would expect that there would be more female survivors, and that the proportion of survivors from each class would decrease as class decreased.
Below we look at the survivors based solely on sex, and can see that more females survived than males, which matches what we would expect from our knowledge of the titanic.
female <- sum(with(titanic, Sex == "female"))
femaleSurvive <- sum(with(titanic, Sex == "female" & Survived == 1))
femaleSurviveProp <- femaleSurvive/female

male <- sum(with(titanic, Sex == "male"))
maleSurvive <- sum(with(titanic, Sex == "male" & Survived == 1))
maleSurviveProp <- maleSurvive/male

sexPropMatrix <- matrix(data = round(c(femaleSurviveProp, 1 - femaleSurviveProp, maleSurviveProp, 1 - maleSurviveProp), 3), nrow = 2, byrow = TRUE)
rownames(sexPropMatrix) <- c("Female", "Male")
colnames(sexPropMatrix) <- c("Survived", "Did Not Survive")
sexPropMatrix
##        Survived Did Not Survive
## Female    0.753           0.247
## Male      0.205           0.795
Below we look at the survivors based solely on class, and can see that first class had the highest proportion of survivors and third class had the lowest, which also matches what we expected.
class1 <- sum(with(titanic, Pclass == 1))
class1Survive <- sum(with(titanic, Pclass == 1 & Survived == 1))
class1SurviveProp <- class1Survive/class1

class2 <- sum(with(titanic, Pclass == 2))
class2Survive <- sum(with(titanic, Pclass == 2 & Survived == 1))
class2SurviveProp <- class2Survive/class2

class3 <- sum(with(titanic, Pclass == 3))
class3Survive <- sum(with(titanic, Pclass == 3 & Survived == 1))
class3SurviveProp <- class3Survive/class3

classPropMatrix <- matrix(data = round(c(class1SurviveProp, 1 - class1SurviveProp, class2SurviveProp, 1 - class2SurviveProp, class3SurviveProp, 1 - class3SurviveProp), 3), nrow = 3, byrow = TRUE)
rownames(classPropMatrix) <- c("1st Class", "2nd Class", "3rd Class")
colnames(classPropMatrix) <- c("Survived", "Did Not Survive")
classPropMatrix
##           Survived Did Not Survive
## 1st Class    0.652           0.348
## 2nd Class    0.480           0.520
## 3rd Class    0.239           0.761

2 - Logistic Regression No Predictors

a) Logistic Regression Model: log(p/(1-p)) = -.38677
b) The odds of survival for a passenger on the titanic is 0.6792453  
c) There is a .4044944 probability of survival for a passenger on the titanic.
d) We are 95% confident that the log odds of survival for a passenger on the titanic is between -.5372197 and -.2377555
e) We are 95% confident that the odds of survival for a passenger on the titanic lies between 0.5843707 and 0.7883955 
myresp <- cbind(titanic$Survived, 1 - titanic$Survived)
tmod0 <- glm(myresp ~ 1, family = binomial)
summary(tmod0)
## 
## Call:
## glm(formula = myresp ~ 1, family = binomial)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.018  -1.018  -1.018   1.345   1.345  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.38677    0.07636  -5.065 4.08e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 960.9  on 711  degrees of freedom
## Residual deviance: 960.9  on 711  degrees of freedom
## AIC: 962.9
## 
## Number of Fisher Scoring iterations: 4
exp(coef(tmod0))
## (Intercept) 
##   0.6792453
ilogit(coef(tmod0))
## (Intercept) 
##   0.4044944
#-.38677 + c(-1, 1)*.07636*1.96 doing by hand is less accurate
confint(tmod0, level = .95)
## Waiting for profiling to be done...
##      2.5 %     97.5 % 
## -0.5372197 -0.2377555
exp(confint(tmod0, level = .95))
## Waiting for profiling to be done...
##     2.5 %    97.5 % 
## 0.5843707 0.7883955

3 - Logistic Regression Children v Adults

Logistic Regression Equation: log(p/(1-p)) = .1596 - .6536*I(Adult)
The odds of survival for children are 1.1730769, which is found by exponentiating the intercept. 
The odds of survival for adults is .6101808, which is found by exponentiating the sum of the intercept plus the coefficient associated with being an adult. 
The odds of survival for children are higher than adults, which makes sense because women and children were prioritized for getting on the rafts.
From this regression equation, the odds of survival for a child are 1.922399 times higher than that of an adult.
titanic$Adult<-ifelse(titanic$Age>=18.0, 1, 0)
myresp <- cbind(titanic$Survived, 1 - titanic$Survived)
tmod1 <- glm(myresp~Adult, family = binomial, data = titanic)
summary(tmod1)
## 
## Call:
## glm(formula = myresp ~ Adult, family = binomial, data = titanic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2459  -0.9761  -0.9761   1.3931   1.3931  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   0.1596     0.1887   0.846  0.39769   
## Adult        -0.6536     0.2067  -3.162  0.00157 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 960.90  on 711  degrees of freedom
## Residual deviance: 950.87  on 710  degrees of freedom
## AIC: 954.87
## 
## Number of Fisher Scoring iterations: 4
exp(coef(tmod1))
## (Intercept)       Adult 
##   1.1730769   0.5201833
exp(-coef(tmod1))
## (Intercept)       Adult 
##    0.852459    1.922399
exp(.1596 - .6536)
## [1] 0.6101808

4 - Logistic Regression Women v Men

Logistic Regression Equation: log(p/(1-p)) = 1.1141 - 2.4676*I(Male)
The odds of survival for women are 3.046875, which is found by exponentiating the intercept since women are included in the intercept.
The odds of survival for men are .2583345, which is found by exponentiating the sum of the intercept and the coefficient for being a man. 
It makes sense that the odds of survival for women are higher, since women and children were prioritized on the titanic.
From this regression equation, we can interpret that the odds of survival for women are 11.7943548 times higher than that of a man. (As they should be.) 
tmod2 <- glm(myresp~Sex, family = binomial, data = titanic)
summary(tmod2)
## 
## Call:
## glm(formula = myresp ~ Sex, family = binomial, data = titanic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6721  -0.6779  -0.6779   0.7534   1.7795  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.1141     0.1441   7.734 1.04e-14 ***
## Sexmale      -2.4676     0.1852 -13.327  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 960.90  on 711  degrees of freedom
## Residual deviance: 749.57  on 710  degrees of freedom
## AIC: 753.57
## 
## Number of Fisher Scoring iterations: 4
exp(coef(tmod2))
## (Intercept)     Sexmale 
##  3.04687500  0.08478632
exp(-coef(tmod2))
## (Intercept)     Sexmale 
##   0.3282051  11.7943548
exp(1.1141 - 2.4676)
## [1] 0.2583345

5 - Logistic Regression for Class

Logistic Regression Equation: log(p/(1-p)) = .6286 - .7096*I(Pclass == 2) - 1.7844*I(Pclass == 3)
The odds of survival for a passenger in first class is 1.875, for passengers in second class is .9221937, and for passengers in third class is .3148056. This means that odds of survival decreased as class status decreased, which makes sense because they were classist then (bad).
From this regression equation, we can say that the odds of survival for a passenger in first class would be 2.0331325 times higher than that of a passenger in second class, and 5.9558824 times higher than that of a passenger in third class. Odds of survival are best for those in first class, and worst in third class.
tmod3 <- glm(myresp~Pclass, family = binomial, data = titanic)
summary(tmod3)
## 
## Call:
## glm(formula = myresp ~ Pclass, family = binomial, data = titanic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4533  -0.7399  -0.7399   0.9246   1.6908  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.6286     0.1548   4.061 4.88e-05 ***
## Pclass2      -0.7096     0.2171  -3.269  0.00108 ** 
## Pclass3      -1.7844     0.1986  -8.987  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 960.90  on 711  degrees of freedom
## Residual deviance: 868.11  on 709  degrees of freedom
## AIC: 874.11
## 
## Number of Fisher Scoring iterations: 4
exp(coef(tmod3))
## (Intercept)     Pclass2     Pclass3 
##   1.8750000   0.4918519   0.1679012
exp(-coef(tmod3))
## (Intercept)     Pclass2     Pclass3 
##   0.5333333   2.0331325   5.9558824
exp(.6286 - .7096)
## [1] 0.9221937
exp(.6286 - 1.7844)
## [1] 0.3148056

6 - Logistic Regression All Three

Logistic Regression Equation: log(p/(1-p)) = 3.3126 - 1.003*I(Pclass == 2) - 2.2045*I(Pclass == 3) - 2.5248*I(Sex == Male) - 1.0691*I(Adult)
For this equation, we can interpret the intercept of 3.3126 to be the log odds of survival for a female child in first class. It would be interesting to see if female children actually had a better odds of survival than male children, or if this is just what worked for our model.
After accounting for Sex and Adult status, the odds of survival for a first class passenger are about 2.726 higher than a second class passenger. After accounting for Sex and Adult status, the odds of survival for a first class passenger are about 9.066 times higher than a third class passenger. After accounting for class and Adult status, the odds of survival for a Female are about 12.489 times higher than a Male. After accounting for class and Sex, the odds of survival for a Child are about 2.913 times higher than an Adult.
Sarah: As a 2nd class adult female (probably), my odds of survival are about 3.457
Cameron: As the captain of the Titanic, this model can not accurately predict odds of survival for Cameron.
Lukas: As a 1st class (he lives in Edina) adult male, his odds of survival are about .755. Apparently money is not as important as your sex.
Cheyanne: As a frugal adult female, her odds of survival are about 1.04. Maybe she should have spared those extra pennies for a higher class like her lavish counterpart, Sarah.
tmod4 <- glm(myresp~Pclass + Sex + Adult, family = binomial, data = titanic)
summary(tmod4)
## 
## Call:
## glm(formula = myresp ~ Pclass + Sex + Adult, family = binomial, 
##     data = titanic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5878  -0.6992  -0.3999   0.7128   2.2653  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   3.3126     0.3603   9.194  < 2e-16 ***
## Pclass2      -1.0030     0.2606  -3.848 0.000119 ***
## Pclass3      -2.2045     0.2516  -8.762  < 2e-16 ***
## Sexmale      -2.5248     0.2061 -12.253  < 2e-16 ***
## Adult        -1.0691     0.2706  -3.951 7.78e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 960.90  on 711  degrees of freedom
## Residual deviance: 656.18  on 707  degrees of freedom
## AIC: 666.18
## 
## Number of Fisher Scoring iterations: 4
exp(coef(tmod4))
## (Intercept)     Pclass2     Pclass3     Sexmale       Adult 
##  27.4575025   0.3667926   0.1103029   0.0800730   0.3433302
exp(-coef(tmod4))
## (Intercept)     Pclass2     Pclass3     Sexmale       Adult 
##  0.03641992  2.72633617  9.06594453 12.48860447  2.91264782
exp(3.3126 - 1.003 - 1.0691)
## [1] 3.457342
exp(3.3126 - 2.5248 - 1.0691)
## [1] 0.7548019
exp(3.3126 - 2.2045 - 1.0691)
## [1] 1.03977

7 - Interaction Terms

We will be testing several interaction terms to determine whether or not to add them to our model. To build our model, we will begin with our previous model, the model containing the three variables: class, sex, and adult.

For our interaction terms, we will be looking at 4 different terms, the interactions between fare and age, embarked and class, sex and adult, and class and age.

To determine whether or not to add an interaction term to our model, we will be using a drop in deviance test. We will add one variable at a time that has the lowest pvalue, as long as that variable meets a significance level of .01.
Round 1:
From our first round of testing below, we can see that mod3, the model that adds the interaction term of sex and adult, has the most significant improvement over our current model, with a pvalue of .000447. Perhaps we were right earlier in our thoughts that a female child wouldn't have much of a difference in odds of survival from a male child, since all children were grouped together.
currentmod <- glm(myresp ~ Pclass + Sex + Adult, family = binomial, data = titanic) #has 4 degrees of freedom less than null

mod1 <- glm(myresp~Pclass + Sex + Adult + Fare*Age, family = binomial, data = titanic) #has 7 degrees of freedom less than null
mod2 <- glm(myresp~Pclass + Sex + Adult + Embarked*Pclass, family = binomial, data = titanic) #has 10 degrees of freedom less than null
mod3 <- glm(myresp~Pclass + Sex + Adult + Sex*Adult, family = binomial, data = titanic) #has 5 degrees of freedom less than null
mod4 <- glm(myresp~Pclass + Sex + Adult + Pclass*Age, family = binomial, data = titanic) #has 7 degrees of freedom less than null

deviances <- c(anova(currentmod, mod1)[4], anova(currentmod, mod2)[4], anova(currentmod, mod3)[4], anova(currentmod, mod4)[4])
deviances
## $Deviance
## [1]       NA 14.55122
## 
## $Deviance
## [1]       NA 5.669349
## 
## $Deviance
## [1]       NA 12.32349
## 
## $Deviance
## [1]       NA 12.69438
#summary(mod1)
#summary(mod2)
#summary(mod3)
#summary(mod4)

pchisq(14.55122, df = 3, lower.tail = FALSE) #mod1
## [1] 0.002243245
pchisq(5.669349, df = 6, lower.tail = FALSE) #mod2
## [1] 0.4612294
pchisq(12.32349, df = 1, lower.tail = FALSE) #mod3
## [1] 0.0004472936
pchisq(12.69438, df = 3, lower.tail = FALSE) #mod4
## [1] 0.005346376
Round 2:
From this second round of testing, we can see that mod7, the model that adds the interaction term of Pclass and Age as a predictor, has the most significant drop in deviance, with a pvalue of .0021. Since this is below our significance level of .01, we will add this variable to our model.
currentmod <- glm(myresp~Pclass + Sex + Adult + Sex*Adult, family = binomial, data = titanic) #has 5 degrees of freedom less than null

mod5 <- glm(myresp~Pclass + Sex + Adult + Sex*Adult + Fare*Age, family = binomial, data = titanic) #has 8 degrees of freedom less than null
mod6 <- glm(myresp~Pclass + Sex + Adult + Sex*Adult + Embarked*Pclass, family = binomial, data = titanic) #has 11 degrees of freedom less than null
mod7 <- glm(myresp~Pclass + Sex + Adult + Sex*Adult + Pclass*Age, family = binomial, data = titanic) #has 8 degrees of freedom less than null


deviances <- c(anova(currentmod, mod5)[4], anova(currentmod, mod6)[4], anova(currentmod, mod7)[4])
deviances
## $Deviance
## [1]       NA 14.68968
## 
## $Deviance
## [1]      NA 8.16316
## 
## $Deviance
## [1]       NA 11.10345
#summary(mod5)
#summary(mod6)
#summary(mod7)

pchisq(11.10345, df = 3, lower.tail = FALSE) #mod5
## [1] 0.01117943
pchisq(8.16316, df = 6, lower.tail = FALSE) #mod6
## [1] 0.2263919
pchisq(14.68968, df = 3, lower.tail = FALSE) #mod7
## [1] 0.002101991
Round 3:
From our third round of testing, we find that mod8, the model that adds the interaction term of Fare and Age, is the most significant, with a pvalue of .04398. This is NOT below our pvalue of .01, so we will not add this to our model.
currentmod <- glm(myresp~Pclass + Sex + Adult + Sex*Adult + Pclass*Age, family = binomial, data = titanic) #has 8 degrees of freedom less than null

mod8 <- glm(myresp~Pclass + Sex + Adult + Sex*Adult + Pclass*Age + Fare*Age, family = binomial, data = titanic) #has 10 degrees of freedom less than null
mod9 <- glm(myresp~Pclass + Sex + Adult + Sex*Adult + Pclass*Age + Embarked*Pclass, family = binomial, data = titanic) #has 14 degrees of freedom less than null


deviances <- c(anova(currentmod, mod8)[4], anova(currentmod, mod9)[4])
deviances
## $Deviance
## [1]       NA 6.248075
## 
## $Deviance
## [1]       NA 8.427475
#summary(mod8)
#summary(mod9)

pchisq(6.248075, df = 2, lower.tail = FALSE) #mod8
## [1] 0.04397924
pchisq(8.427475, df = 6, lower.tail = FALSE) #mod9
## [1] 0.2084276
After performing our drop in deviance tests and using forward selection, we land on the final model that uses class, sex, adult, the interaction term between sex and adult, and the interaction term between class and age. Summaries of this model are below, along with the intercepts exponentiated to be used to understand the odds of survival. We will not be interpreting since the question did not specify to interpret.
summary(currentmod)
## 
## Call:
## glm(formula = myresp ~ Pclass + Sex + Adult + Sex * Adult + Pclass * 
##     Age, family = binomial, data = titanic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5109  -0.7378  -0.3683   0.6078   2.4507  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    3.162288   0.683755   4.625 3.75e-06 ***
## Pclass2       -0.891375   0.755570  -1.180 0.238104    
## Pclass3       -2.678733   0.704185  -3.804 0.000142 ***
## Sexmale       -1.218935   0.431959  -2.822 0.004774 ** 
## Adult          0.463126   0.465889   0.994 0.320189    
## Age           -0.026907   0.013537  -1.988 0.046856 *  
## Sexmale:Adult -1.658309   0.494824  -3.351 0.000804 ***
## Pclass2:Age   -0.015267   0.020606  -0.741 0.458747    
## Pclass3:Age    0.004205   0.020209   0.208 0.835191    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 960.90  on 711  degrees of freedom
## Residual deviance: 632.76  on 703  degrees of freedom
## AIC: 650.76
## 
## Number of Fisher Scoring iterations: 5
exp(coef(currentmod))
##   (Intercept)       Pclass2       Pclass3       Sexmale         Adult 
##    23.6245927     0.4100914     0.0686501     0.2955448     1.5890340 
##           Age Sexmale:Adult   Pclass2:Age   Pclass3:Age 
##     0.9734520     0.1904608     0.9848488     1.0042134
exp(-coef(currentmod))
##   (Intercept)       Pclass2       Pclass3       Sexmale         Adult 
##    0.04232877    2.43848094   14.56662096    3.38358196    0.62931317 
##           Age Sexmale:Adult   Pclass2:Age   Pclass3:Age 
##    1.02727205    5.25042459    1.01538428    0.99580429

BREAST CANCER

library(faraway)
data(wbca)
head(wbca)
##   Class Adhes BNucl Chrom Epith Mitos NNucl Thick UShap USize
## 1     1     1     1     3     2     1     1     5     1     1
## 2     1     5    10     3     7     1     2     5     4     4
## 3     1     1     2     3     2     1     1     3     1     1
## 4     1     1     4     3     3     1     7     6     8     8
## 5     1     3     1     3     2     1     1     4     1     1
## 6     0     8    10     9     7     1     7     8    10    10

1- All Variables

Below we fit a model with all variables being used to predict the odds of a tumor being malignant. Since a tumor being benign is associated with 1 and a tumor being malignant is associated with 0, these models predict the odds of a tumor being benign. We will be using this wording throughout the following problems.
myresp <- cbind(wbca$Class, 1 - wbca$Class)
cmod0 <- glm(myresp ~Adhes + BNucl + Chrom + Epith + Mitos + NNucl + Thick + UShap + USize, data = wbca, family = binomial)
summary(cmod0)
## 
## Call:
## glm(formula = myresp ~ Adhes + BNucl + Chrom + Epith + Mitos + 
##     NNucl + Thick + UShap + USize, family = binomial, data = wbca)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.48282  -0.01179   0.04739   0.09678   3.06425  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 11.16678    1.41491   7.892 2.97e-15 ***
## Adhes       -0.39681    0.13384  -2.965  0.00303 ** 
## BNucl       -0.41478    0.10230  -4.055 5.02e-05 ***
## Chrom       -0.56456    0.18728  -3.014  0.00257 ** 
## Epith       -0.06440    0.16595  -0.388  0.69795    
## Mitos       -0.65713    0.36764  -1.787  0.07387 .  
## NNucl       -0.28659    0.12620  -2.271  0.02315 *  
## Thick       -0.62675    0.15890  -3.944 8.01e-05 ***
## UShap       -0.28011    0.25235  -1.110  0.26699    
## USize        0.05718    0.23271   0.246  0.80589    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 881.388  on 680  degrees of freedom
## Residual deviance:  89.464  on 671  degrees of freedom
## AIC: 109.46
## 
## Number of Fisher Scoring iterations: 8

2 - Describe Relationships

We used the coefficient from the previous problem to exponentiate and interpret. 
After accounting for all other variables, an increase in one unit of thickness is associated with a multiplicative change of .5343255 in the odds of a tumor being benign. To make this more understandable, this means that a tumor is less likely to be benign as it increases in thickness.
Another way to interpret would be that, holding all else constant, a tumor is 1.87 times more likely to be benign for each decrease in one unit of thickness.
exp(-.62675)
## [1] 0.5343255
exp(.62675)
## [1] 1.871518

3 - Backwards Elimination

We are doing automatic backwards elimination with the drop in deviance test at a significance level of .05, dropping one variable at a time.
Round 1:
For our first round of dropping variables, we find that USize is the least significant variable, with a pvalue of .808. We will drop this variable and create a new model.
currentmod <- glm(myresp ~Adhes + BNucl + Chrom + Epith + Mitos + NNucl + Thick + UShap + USize, data = wbca, family = binomial)

drop1(currentmod, test = "Chisq")
## Single term deletions
## 
## Model:
## myresp ~ Adhes + BNucl + Chrom + Epith + Mitos + NNucl + Thick + 
##     UShap + USize
##        Df Deviance    AIC     LRT  Pr(>Chi)    
## <none>      89.464 109.46                      
## Adhes   1   98.844 116.84  9.3798  0.002194 ** 
## BNucl   1  109.000 127.00 19.5363 9.871e-06 ***
## Chrom   1   99.841 117.84 10.3767  0.001276 ** 
## Epith   1   89.613 107.61  0.1487  0.699763    
## Mitos   1   93.551 111.55  4.0868  0.043220 *  
## NNucl   1   95.204 113.20  5.7401  0.016582 *  
## Thick   1  110.239 128.24 20.7744 5.167e-06 ***
## UShap   1   90.627 108.63  1.1628  0.280887    
## USize   1   89.523 107.52  0.0592  0.807802    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Round 2:
For our second round of dropping variables, we find that Epith is the least significant variable, with a pvalue of .710. We will drop this variable and create a new model.
currentmod <- glm(myresp ~ Adhes + BNucl + Chrom + Epith + Mitos + NNucl + UShap + Thick, data = wbca, family = binomial)

drop1(currentmod, test = "Chisq")
## Single term deletions
## 
## Model:
## myresp ~ Adhes + BNucl + Chrom + Epith + Mitos + NNucl + UShap + 
##     Thick
##        Df Deviance    AIC     LRT  Pr(>Chi)    
## <none>      89.523 107.52                      
## Adhes   1   99.042 115.04  9.5185  0.002034 ** 
## BNucl   1  109.064 125.06 19.5405 9.849e-06 ***
## Chrom   1  100.153 116.15 10.6296  0.001113 ** 
## Epith   1   89.662 105.66  0.1384  0.709888    
## Mitos   1   93.552 109.55  4.0284  0.044740 *  
## NNucl   1   95.231 111.23  5.7076  0.016891 *  
## UShap   1   91.355 107.36  1.8313  0.175972    
## Thick   1  110.465 126.47 20.9421 4.734e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Round 3:
For our third round of dropping variables, we find that UShap is the least significant variable, with a pvalue of .136. We will drop this variable and create a new model.
currentmod <- glm(myresp ~ Adhes + BNucl + Chrom + Mitos + NNucl + UShap + Thick, data = wbca, family = binomial)

drop1(currentmod, test = "Chisq")
## Single term deletions
## 
## Model:
## myresp ~ Adhes + BNucl + Chrom + Mitos + NNucl + UShap + Thick
##        Df Deviance    AIC     LRT  Pr(>Chi)    
## <none>      89.662 105.66                      
## Adhes   1  100.126 114.13 10.4647 0.0012168 ** 
## BNucl   1  109.762 123.76 20.1000 7.350e-06 ***
## Chrom   1  100.844 114.84 11.1825 0.0008257 ***
## Mitos   1   93.714 107.71  4.0520 0.0441183 *  
## NNucl   1   95.853 109.85  6.1913 0.0128379 *  
## UShap   1   91.884 105.88  2.2227 0.1359951    
## Thick   1  110.632 124.63 20.9705 4.664e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Round 4:
For our fourth round of dropping variables, we find that Mitos is the least significant variable, with a pvalue of .032. This is not above our significance level of .05, so we will not drop this variable, and currentmod is our final model.
currentmod <- glm(myresp ~ Adhes + BNucl + Chrom + Mitos + NNucl + Thick, data = wbca, family = binomial)

drop1(currentmod, test = "Chisq")
## Single term deletions
## 
## Model:
## myresp ~ Adhes + BNucl + Chrom + Mitos + NNucl + Thick
##        Df Deviance    AIC    LRT  Pr(>Chi)    
## <none>      91.884 105.88                     
## Adhes   1  105.473 117.47 13.588 0.0002276 ***
## BNucl   1  124.813 136.81 32.929 9.558e-09 ***
## Chrom   1  109.699 121.70 17.815 2.435e-05 ***
## Mitos   1   96.494 108.49  4.609 0.0317983 *  
## NNucl   1  103.711 115.71 11.826 0.0005840 ***
## Thick   1  130.842 142.84 38.957 4.332e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4 - Regression Equation

From our automatic regression, we land on the model with the following equation:
log(p/(1-p)) = 11.47296 - 0.44534*Adhes - 0.47410*BNucl - 0.63311*Chrom - 0.68165*Mitos - 0.36241*NNucl - 0.72247*Thick
This question did not ask for an interpretation, so we will not interpret.
finalmod <- glm(myresp ~ Adhes + BNucl + Chrom + Mitos + NNucl + Thick, data = wbca, family = binomial)
summary(finalmod)
## 
## Call:
## glm(formula = myresp ~ Adhes + BNucl + Chrom + Mitos + NNucl + 
##     Thick, family = binomial, data = wbca)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2980  -0.0082   0.0452   0.0972   3.3435  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 11.47296    1.37152   8.365  < 2e-16 ***
## Adhes       -0.44534    0.12897  -3.453 0.000554 ***
## BNucl       -0.47410    0.09545  -4.967 6.80e-07 ***
## Chrom       -0.63311    0.16952  -3.735 0.000188 ***
## Mitos       -0.68165    0.35264  -1.933 0.053239 .  
## NNucl       -0.36241    0.11391  -3.181 0.001465 ** 
## Thick       -0.72247    0.14737  -4.902 9.46e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 881.388  on 680  degrees of freedom
## Residual deviance:  91.884  on 674  degrees of freedom
## AIC: 105.88
## 
## Number of Fisher Scoring iterations: 8
coef(finalmod)
## (Intercept)       Adhes       BNucl       Chrom       Mitos       NNucl 
##  11.4729593  -0.4453444  -0.4740987  -0.6331072  -0.6816490  -0.3624117 
##       Thick 
##  -0.7224688

5 - Lack of Fit Test

Below we conduct a lack of fit test at a significance level of .05. We find a pvalue of 1.446742e-06, which means we do have evidence of lack of fit in our model. We will have to do more digging to figure out why.
teststat <- summary(finalmod)$deviance
wdf <- nrow(esdcomp) - length(coef(finalmod))
pchisq(teststat, df = wdf, lower.tail= FALSE)
## [1] 1.446742e-06

6 - Outliers

Looking at our residuals.... it is not looking good for us. For the halfnorm plot, we would like to see a straight line, but in fact ours goes crazy. 244 and 419 are very very big outliers, as you can see they are called out. Not looking good. 
halfnorm(residuals(finalmod))

7 - Deviance Residuals

From the plot below, we see that we are more likely to have vertical variance as log odds heads to zero. When log odds is equal to 0, that would mean p = 1-p, so we are uncertain as to whether the tumor is malignant or benign. It would make sense to see the graph diverging near this value because it could really go either way if the tumor is malignant or benign. As our log odds get more negative or more positive, we are more certain about whether or not the tumor is malignant, so our deviance residuals are closer to zero. Overall, this graph makes sense.
plot(residuals(finalmod, type = "deviance") ~ predict(finalmod, type="link"), xlab="log odds", ylab="deviance residuals")

8 - Predicting Odds

Using our final reduced model, we find that the odds of being benign with these stats are 112.1963. This means that it is highly likely that this tumor is benign. 
newdat <- data.frame(Adhes=1, BNucl= 1,Chrom= 3, Epith=2,Mitos=1,
NNucl= 1,Thick=4,UShap=1,USize=1)
exp(predict(finalmod, newdata = newdat, se=TRUE)$fit)
##        1 
## 112.1973

9 - 95% Prediction Interval

If we use our reduced model, we are 95% confident that the probability this person's tumor is benign is between 97.35% and 99.71%
predict.out <- predict(finalmod, newdata = newdat, se=TRUE)
pt.est <- predict.out$fit
se.est <- predict.out$se.fit
logoddsCI<-pt.est+c(-1,1)*1.96*se.est
ilogit(logoddsCI)
## [1] 0.9734871 0.9970917