Homework 2

Problem 1:

Q. Using the wbca data in the faraway package. Fit a binary logistic regression model with Class as the response and the other variables as predictors. Comment on the model deviance and tests for the coefficients. Attempt model selection using the step function and comment on any reduction that takes place.

A. In the following work you’ll find models that present the best possible data interpretation based on the collection of data we were presented. An initial binary logistic regression of all variables was done allowing us to look at the baseline set data and their influence on the determination of malignancy. Further evaluation of the regression model was done using the step method to find the combination of variables that most influence development of a tumor. That further step [pun intended] finds that the combination: Adhes + BNucl + Chrom + Mitos + NNucl + Thick + UShap, best predicts cell malignancy. Our primary gauge of confidence in the model strength, AIC, reduces through this process from an initial value of 109.46 to 105.66 in the final model. This is a small reduction, but a reduction none the less.

The post-model improvement in deviance, from Null = 881.4 to Residual = 89.66 indicates that we see an almost ten-fold improvement in model accuracy from a null observation using only the intercept. This gives us confidence in the accuracy of the model.

Testing all variables we arrive at coefficients for each that range from 8.01x10^-5 (0.0000801) for Thick to 0.80589 for USize. The coefficients indicate the extent to which each category influences the outcome. The smaller the coefficient, the more influence it has on the outcome. So Thick influences the model the most and USize the least with the others’ influence falling somewhere between.

The individual measurements’ coefficients indicate to what extent we see a change in out model. For example, in the initial model, Adhes has a coefficient of -.39681. This tells us that for every 1 unit of Adhes measurement, in this case a scale of 1-10, there is a change of -.39681 in the log odds of malignancy. Various measurements have their own associated coefficients in both the initial model and the final step-wise model. Likewise with standard error. Each value describes the associated measurements’ variability of the coefficient estimate.

What we can conclude from this data is that there are certain measurements that influence the determination of malignancy more than others. As a result, we can rely on only these data points moving forward to save the time and expense of gathering additional but ultimately unhelpful data. We expect to save ten quadrillion dollars through the reduction in superfluous data. We will rule the world.

m1<-glm(Class~., family="binomial",data=wbca) summary(m1)

## ## Call: ## glm(formula = Class ~ ., family = "binomial", data = wbca) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.48282 -0.01179 0.04739 0.09678 3.06425 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 11.16678 1.41491 7.892 2.97e-15 *** ## Adhes -0.39681 0.13384 -2.965 0.00303 ** ## BNucl -0.41478 0.10230 -4.055 5.02e-05 *** ## Chrom -0.56456 0.18728 -3.014 0.00257 ** ## Epith -0.06440 0.16595 -0.388 0.69795 ## Mitos -0.65713 0.36764 -1.787 0.07387 . ## NNucl -0.28659 0.12620 -2.271 0.02315 * ## Thick -0.62675 0.15890 -3.944 8.01e-05 *** ## UShap -0.28011 0.25235 -1.110 0.26699 ## USize 0.05718 0.23271 0.246 0.80589 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 881.388 on 680 degrees of freedom ## Residual deviance: 89.464 on 671 degrees of freedom ## AIC: 109.46 ## ## Number of Fisher Scoring iterations: 8

step(m1,direction="both",scale=0)

## Start: AIC=109.46 ## Class ~ Adhes + BNucl + Chrom + Epith + Mitos + NNucl + Thick + ## UShap + USize ## ## Df Deviance AIC ## - USize 1 89.523 107.52 ## - Epith 1 89.613 107.61 ## - UShap 1 90.627 108.63 ## <none> 89.464 109.46 ## - Mitos 1 93.551 111.55 ## - NNucl 1 95.204 113.20 ## - Adhes 1 98.844 116.84 ## - Chrom 1 99.841 117.84 ## - BNucl 1 109.000 127.00 ## - Thick 1 110.239 128.24 ## ## Step: AIC=107.52 ## Class ~ Adhes + BNucl + Chrom + Epith + Mitos + NNucl + Thick + ## UShap ## ## Df Deviance AIC ## - Epith 1 89.662 105.66 ## - UShap 1 91.355 107.36 ## <none> 89.523 107.52 ## + USize 1 89.464 109.46 ## - Mitos 1 93.552 109.55 ## - NNucl 1 95.231 111.23 ## - Adhes 1 99.042 115.04 ## - Chrom 1 100.153 116.15 ## - BNucl 1 109.064 125.06 ## - Thick 1 110.465 126.47 ## ## Step: AIC=105.66 ## Class ~ Adhes + BNucl + Chrom + Mitos + NNucl + Thick + UShap ## ## Df Deviance AIC ## <none> 89.662 105.66 ## - UShap 1 91.884 105.88 ## + Epith 1 89.523 107.52 ## + USize 1 89.613 107.61 ## - Mitos 1 93.714 107.71 ## - NNucl 1 95.853 109.85 ## - Adhes 1 100.126 114.13 ## - Chrom 1 100.844 114.84 ## - BNucl 1 109.762 123.76 ## - Thick 1 110.632 124.63

## ## Call: glm(formula = Class ~ Adhes + BNucl + Chrom + Mitos + NNucl + ## Thick + UShap, family = "binomial", data = wbca) ## ## Coefficients: ## (Intercept) Adhes BNucl Chrom Mitos NNucl ## 11.0333 -0.3984 -0.4192 -0.5679 -0.6456 -0.2915 ## Thick UShap ## -0.6216 -0.2541 ## ## Degrees of Freedom: 680 Total (i.e. Null); 673 Residual ## Null Deviance: 881.4 ## Residual Deviance: 89.66 AIC: 105.7

m2<-glm(Class~BNucl+Thick+Chrom+Adhes, family="binomial",data=wbca) summary(m2)

## ## Call: ## glm(formula = Class ~ BNucl + Thick + Chrom + Adhes, family = "binomial", ## data = wbca) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.52772 -0.01571 0.04759 0.12653 2.89239 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 10.9174 1.1823 9.234 < 2e-16 *** ## BNucl -0.5043 0.0952 -5.297 1.17e-07 *** ## Thick -0.8837 0.1394 -6.337 2.34e-10 *** ## Chrom -0.7906 0.1689 -4.680 2.86e-06 *** ## Adhes -0.4736 0.1221 -3.879 0.000105 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 881.39 on 680 degrees of freedom ## Residual deviance: 111.38 on 676 degrees of freedom ## AIC: 121.38 ## ## Number of Fisher Scoring iterations: 8

step(m2)

## Start: AIC=121.38 ## Class ~ BNucl + Thick + Chrom + Adhes ## ## Df Deviance AIC ## <none> 111.38 121.38 ## - Adhes 1 129.19 137.19 ## - Chrom 1 143.96 151.96 ## - BNucl 1 151.59 159.59 ## - Thick 1 185.84 193.84

## ## Call: glm(formula = Class ~ BNucl + Thick + Chrom + Adhes, family = "binomial", ## data = wbca) ## ## Coefficients: ## (Intercept) BNucl Thick Chrom Adhes ## 10.9174 -0.5043 -0.8836 -0.7906 -0.4736 ## ## Degrees of Freedom: 680 Total (i.e. Null); 676 Residual ## Null Deviance: 881.4 ## Residual Deviance: 111.4 AIC: 121.4

Problem 2:

Q. Using the aflatoxin data in the faraway package. Fit a logistic regression model for the number of animals with liver cancer as a function of the dose. Comment on the statistical significance of the model. Calculate the predicted probability of liver cancer for a dose of 25 ppb. Can you find a 95% confidence interval for this value. If so report it and give any insights you may have.

A. A logistic regression model of the presented data shows rather straightforward predictions of liver cancer in lab animals based on a given dose of aflatoxin. With excellent confidence scores in AIC and residual deviance we can accurately predict the probability of an occurrence of cancer with a given dosage.

Using the presented data and our resulting model, we can predict with certainty that a dose of 25ppb will result in malignancy with the probability of .3134978, or, about 31.35% of the time. We can state that probability with 95% confidence. And, in fact, given the simplicity of the data we know that for any given dose a specific probability exists along the shown prediction curve.

m3<-glm(cbind(tumor,total-tumor)~dose,family="binomial",data=aflatoxin) summary(m3)

## ## Call: ## glm(formula = cbind(tumor, total - tumor) ~ dose, family = "binomial", ## data = aflatoxin) ## ## Deviance Residuals: ## 1 2 3 4 5 6 ## -1.2995 0.7959 -0.4814 0.4174 -0.1629 0.3774 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.03604 0.48226 -6.295 3.07e-10 *** ## dose 0.09009 0.01456 6.189 6.04e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 116.524 on 5 degrees of freedom ## Residual deviance: 2.897 on 4 degrees of freedom ## AIC: 17.685 ## ## Number of Fisher Scoring iterations: 5

plot(tumor/total~dose,aflatoxin,xlim=c(1,100),ylim=c(0,1), ylab="Percent of Tumors", xlab="Dose")

m4<-ilogit(predict(m3, newdata=data.frame(dose=25))) m4

## 1 ## 0.3134978

quantile(m4, c(0.025, 0.975))

## 2.5% 97.5% ## 0.3134978 0.3134978

plot(tumor/total~dose,aflatoxin,xlim=c(1,100),ylim=c(0,1), ylab="Percent of Tumors", xlab="Dose") x<-seq(1,100,1) lines(x,ilogit(predict(m3,newdata=data.frame(dose=x))))

END

Homework 2

Len Hancock

1/30/23

Problem 1:

Problem 2: