Sirds <- read.csv("~/math242/C7 SIRDS.csv")
reg <- glm(survival~birthweight,Sirds,family = binomial)
summary(reg)
##
## Call:
## glm(formula = survival ~ birthweight, family = binomial, data = Sirds)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6889 -0.9136 -0.5716 0.9200 1.9037
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.6008 1.1805 -3.050 0.00229 **
## birthweight 1.7408 0.5786 3.009 0.00262 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 68.994 on 49 degrees of freedom
## Residual deviance: 56.975 on 48 degrees of freedom
## AIC: 60.975
##
## Number of Fisher Scoring iterations: 3
This model gives us a logistic equation of: probability of survival = e^(-3.4008 + 1.7408(birthweight) / (1+e^(-3.4008 + 1.7408(birthweight))) . This model suggests that the odds ratio is 5.702, meaning that every aditional kilogram in birthweight the survival rate increases by a factor of 5.0702.
1.7408 + qnorm(.95, mean=0, sd= 1)* .5786
## [1] 2.692512
1.7408 - qnorm(.95, mean=0, sd= 1)* .5786
## [1] 0.7890877
Using this printout we can make a 95% confidence interval. To do this we take our coeficient from birthweight model and create an 95% interval using a z-score and our standard error. Then we raise e to each side of the interval giving us our 95% confidence interval. From this method we observe a confidence interval of 2.20 to 14.77. ##c)
1-pchisq(12.019, 1)
## [1] 0.0005266095
We can also see that from our 90% confidence interval in part be we would reject the null hypothesis that the birthweight does not effect survival due to the Wald test. In addition or logistic model includes a null deviance of 68.994 and a residual deviance of 56.975 meaning that the G statistic is 68.994- 56.975, or 12.019. which has a p-value (.0005) that leads us to reject the null and that the addition of birthweight creates a model that is so different in fit it is unlikely to occur by chance, this is the log liklihood ratio test. ## d)
Sirds2 <- cbind(Sirds, reg$residuals, reg$fitted.values)
Sirds2 <- Sirds2 %>%
rename(res = "reg$residuals", Fit = "reg$fitted.values")
hoslem.test(Sirds2$survival,Sirds2$Fit)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: Sirds2$survival, Sirds2$Fit
## X-squared = 8.0472, df = 8, p-value = 0.4289
The hosmer-lemeshow test for goodness of fit gives us a p-value(.4289) that leads us to fail to reject the null hypothesis. This could mean a few things; one possibility is that this model is a good fit, this model is the best fit, or we dont have enough data to prove it is not a good fit. We only have a sample size of 50, this test suggests around 400 so we need more data to conclude the fit for sure. ## f)
lrm(survival~birthweight,Sirds)
## Logistic Regression Model
##
## lrm(formula = survival ~ birthweight, data = Sirds)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 50 LR chi2 12.02 R2 0.286 C 0.762
## 0 27 d.f. 1 g 1.315 Dxy 0.523
## 1 23 Pr(> chi2) 0.0005 gr 3.724 gamma 0.525
## max |deriv| 2e-08 gp 0.274 tau-a 0.265
## Brier 0.196
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept -3.6008 1.1806 -3.05 0.0023
## birthweight 1.7408 0.5786 3.01 0.0026
##
This model provides a Somers’ D stat of .523, meaning that there are slighlty more concordant pairs than discordant pairs. This value is good, as we want more concordant pairs however it could be better. THe Goodman-Kruskal gamma measures the same thing but includes pairs that are equal, this is shown by the value being .525. Kendall’s tau-a value measures concordant pairs minus discordant pairs divided all possible pairs. Thus this value should be naturally lower , we observed a .265 which as before is good but it could be better.
Sirds2 %>%
ggplot(aes(res)) +
geom_histogram(fill = "cyan", bins = 100)
res <- residuals(reg,type = "deviance")
Sirds3 <- cbind(Sirds, res)
Sirds3 %>%
ggplot(aes(res)) +
geom_histogram(fill = "cyan", bins = 100)
These histograms show the different residuals the first histogram is the standardized Pearson residuals and the second histogram is plotting the deviance residuals.
Tattoos <- read.csv("~/math242/C7 Tatoos.csv")
reg <- glm(removal~method+gender+depth,Tattoos,family = binomial)
summary(reg)
##
## Call:
## glm(formula = removal ~ method + gender + depth, family = binomial,
## data = Tattoos)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7505 -0.9462 -0.5671 0.9395 1.7425
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.1139 0.7658 0.149 0.88180
## method 0.6993 0.6241 1.121 0.26246
## gender 0.4754 0.7320 0.649 0.51606
## depth -1.8601 0.6133 -3.033 0.00242 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 76.228 on 54 degrees of freedom
## Residual deviance: 64.575 on 51 degrees of freedom
## AIC: 72.575
##
## Number of Fisher Scoring iterations: 4
e <- 2.7182818284590452353602874713527
e^(.6993 + qnorm(.975, mean = 0, sd= 1)*.6241)
## [1] 6.838125
e^(.6993 - qnorm(.975, mean = 0, sd= 1)*.6241)
## [1] 0.5921984
e^(.4754 + qnorm(.975, mean=0, sd=1)*.7320)
## [1] 6.753721
e^(.4754 - qnorm(.975, mean=0, sd=1)*.7320)
## [1] 0.3831634
e^(-1.8601 + qnorm(.975, mean=0, sd=1)*.6133)
## [1] 0.5178581
e^(-1.8601 - qnorm(.975, mean=0, sd=1)*.6133)
## [1] 0.04678719
1-pchisq(76.228-64.575, 3)
## [1] 0.008671482
Using the Walds test we have evidence that the depth variable appears to be significant becasue it is the only variable to produce a 95% confidence interval that does not include 1. Meaning that no matter your value of the explanatory variable you will always cause a change in your value of the response variable. Using the liklihood ratio test we can conclude that at leaast one of these variables are significant because the drop in deviance produces a g-stat that has a p-value(.00867) that leads us to reject the null hypothesis that there is no relationship in our variables and conclude one variable is statistically significant. ## b)
reg2 <- glm(removal~gender+depth,Tattoos,family = binomial)
summary(reg2)
##
## Call:
## glm(formula = removal ~ gender + depth, family = binomial, data = Tattoos)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6123 -0.8541 -0.6914 0.7978 1.7597
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.4927 0.6897 0.714 0.47502
## gender 0.4888 0.7291 0.670 0.50254
## depth -1.8020 0.5989 -3.009 0.00262 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 76.228 on 54 degrees of freedom
## Residual deviance: 65.863 on 52 degrees of freedom
## AIC: 71.863
##
## Number of Fisher Scoring iterations: 4
e^(.4888 + qnorm(.975, mean=0, sd=1)*.7291)
## [1] 6.806035
e^(.4888 - qnorm(.975, mean=0, sd=1)*.7291)
## [1] 0.3905459
e^(-1.8020 + qnorm(.975, mean=0, sd=1)*.5989)
## [1] 0.5335633
e^(-1.8020 - qnorm(.975, mean=0, sd=1)*.5989)
## [1] 0.05100547
1- pchisq(76.228-65.863, 2)
## [1] 0.005613954
Usig the Walds test we have evidence that depth is significant because it produces a confidence interval that does not contain 1, meaning that if we change the value of depth it will change the probability of removal. Using the liklihood ratio test we observe a drop in deviance and g-stat that produces a p-value(.0056) that leads us to reject the null hypothesis and conclude that at least one of our variables effect the probability of removal. ## C)
1- pchisq(65.863- 64.575, 1)
## [1] 0.2564169
Using the drop in deviance test between the first and second model we can conclude that method is not significant because we observed a drop in deviance that produced a p-value(.2564) that leads us to fail to reject the null hypothesis that the probability of removal is not effected by these variables. ## d)
reg3 <- glm(removal~method, Tattoos,family = binomial)
summary(reg3)
##
## Call:
## glm(formula = removal ~ method, family = binomial, data = Tattoos)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.256 -1.256 -1.026 1.101 1.337
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.3677 0.4336 -0.848 0.396
## method 0.5500 0.5570 0.988 0.323
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 76.228 on 54 degrees of freedom
## Residual deviance: 75.242 on 53 degrees of freedom
## AIC: 79.242
##
## Number of Fisher Scoring iterations: 4
e^(.55 + qnorm(.975, mean = 0, sd= 1)*.557)
## [1] 5.16394
e^(.55 - qnorm(.975, mean = 0, sd= 1)*.557)
## [1] 0.5817585
1- pchisq(76.228- 75.242, 1)
## [1] 0.320722
This is a logistic model with just method as an explanatory variable. We can determine this model to be a bad one becasue of the Wald’s test and the liklihood ratio test. THe Wald’s test shows that the confidence interval of the method variable contains 1 which means that changing the value of method and the probability of removal could be unaffected. THe liklihood ratio test provides a drop in deviance that gives a pvalue(.321) that causes us to fail to reject the null that there is no effect on the probability of removal. ## e) In determining why method is not a significant variable in explaining the probability of removal the methods used in part C is better than part D for two reasons. One reason is it is alot less work because you are only using one statistical test. The next reason is that in part C we are comparing two models that we know are predictive of the probability of removal. Thsi more accurately shows that the method variable does not contribute predictive power in the successful models. Where as in the model from part D we cant say that it is predictive and we can really only compare the method variable to the restricted model. In conclusion the methods from part C provide more conclusive evidence that the method variable does not effect the probability of removal.
Cancer <- read.csv("~/math242/C7 Cancer2.csv")
Cancer <- Cancer %>%
mutate(Radius2 = (Radius)^2, RadCon= Radius*Concave)
reg <- glm(Malignant.~Radius+Concave+Radius2+RadCon, Cancer, family= binomial)
summary(reg)
##
## Call:
## glm(formula = Malignant. ~ Radius + Concave + Radius2 + RadCon,
## family = binomial, data = Cancer)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2962 -0.2579 -0.1203 0.1095 2.9112
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.3563 7.7838 -1.202 0.229
## Radius 0.3460 3.7795 0.092 0.927
## Concave 6.5378 3.0317 2.157 0.031 *
## Radius2 0.3515 0.4646 0.757 0.449
## RadCon -0.8061 0.7493 -1.076 0.282
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 751.44 on 568 degrees of freedom
## Residual deviance: 222.31 on 564 degrees of freedom
## AIC: 232.31
##
## Number of Fisher Scoring iterations: 7
e^(.346 + qnorm(.975, mean=0, sd=1)*3.7795)
## [1] 2330.141
e^(.346 - qnorm(.975, mean=0, sd= 1)*3.7795)
## [1] 0.0008573332
e^(6.5378 + qnorm(.975, mean=0, sd=1)*3.0317)
## [1] 262977.3
e^(6.5378 - qnorm(.975, mean=0, sd= 1)*3.0317)
## [1] 1.814441
e^(.3515 + qnorm(.975, mean=0, sd=1)*.4646)
## [1] 3.53283
e^(.3515 - qnorm(.975, mean=0, sd= 1)*.4646)
## [1] 0.5717238
e^(-.8061 + qnorm(.975, mean=0, sd=1)*.7493)
## [1] 1.939637
e^(-.8061 - qnorm(.975, mean=0, sd= 1)*.7493)
## [1] 0.1028276
1- pchisq(751.44-222.31, 4)
## [1] 0
With this model and using the Wald’s test and the liklihood ratio test we can determine the significance of variables included in this model. The Wald’s test supports that the concave variable is the only significant predictive variable, as it is the only variable that produces a confidence interval that does not include the value of 1. Meaning that the odds ratio is effected by the changing of the concavity. The liklihood ratio test provides a g-stat taken by subtracting the null deviance of 751.44 and the residual deviance of 222.31, which gives a p-value close to 0 meaning that there is at least one variable in this model that is a significant predictor.
reg2 <- glm(Malignant.~Radius+Concave+RadCon, Cancer, family= binomial)
summary(reg2)
##
## Call:
## glm(formula = Malignant. ~ Radius + Concave + RadCon, family = binomial,
## data = Cancer)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2951 -0.2601 -0.1062 0.1445 2.9431
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -15.0357 2.4698 -6.088 1.14e-09 ***
## Radius 3.1889 0.6050 5.271 1.36e-07 ***
## Concave 6.4551 3.0460 2.119 0.0341 *
## RadCon -0.7784 0.7472 -1.042 0.2976
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 751.44 on 568 degrees of freedom
## Residual deviance: 222.91 on 565 degrees of freedom
## AIC: 230.91
##
## Number of Fisher Scoring iterations: 7
e^(3.1889 + qnorm(.975, mean=0, sd=1)*.6050)
## [1] 79.41428
e^(3.1889 - qnorm(.975, mean=0, sd= 1)*.6050)
## [1] 7.412159
e^(6.4551 + qnorm(.975, mean=0, sd=1)*3.0460)
## [1] 248985.6
e^(6.4551 - qnorm(.975, mean=0, sd= 1)*3.0460)
## [1] 1.624256
e^(-.7784 + qnorm(.975, mean=0, sd=1)*.7472)
## [1] 1.985926
e^(-.7784 - qnorm(.975, mean=0, sd= 1)*.7472)
## [1] 0.1061518
1- pchisq(751.44- 222.91, 3)
## [1] 0
1- pchisq(222.91-222.31, 1)
## [1] 0.438578
With this model and using the Wald’s test and the liklihood ratio test we can determine the significance of variables included in this model. The Wald’s test supports that the radius and the concave variables are the only significant predictive variables, as they are the only variables that produces confidence intervals that do not include the value of 1. Meaning that the odds ratio is effected by the changing of the concavity and also effected by changing the radius. The liklihood ratio test provides a g-stat taken by subtracting the null deviance of 751.44 and the residual deviance of 222.91, which gives a p-value close to 0 meaning that there is at least one variable in this model that is a significant predictor. We also ran a drop in deviance test from the first model to the second model and found evidence that the radius squared variable is not significant and should not be included in the model. This is supported by the drop in deviance test providing a p-value of .438578 which leads us to conclude that this model is no different than the previous model. However, in this model we have two significant variables and also have a higher degrees of freedom which both lead to a higher predictive power of this model. Therefore we should keep this model and not include radius squared.
Cancer %>%
ggplot(aes(Radius,Radius2))+
geom_point()
Cancer2 <- Cancer %>%
select(Radius, Radius2)
cor(Cancer2)
## Radius Radius2
## Radius 1.000000 0.988119
## Radius2 0.988119 1.000000
This scatterplot and correlation table show that the variables of Radius and Radius*Radius are highly correlated, which is no surprise. r= .988119.
I believe that Radius is important in this model because when we removed radius squared from our model the p-value for Radius went to an extremely low p-value. It had a high p-value in part A because mulitcollinearity effects the p-value and makes it hard to determine which or if the variables effect the model. Thus when we removed one of the variables we saw that the other is actually very significant.
reg3 <- glm(Malignant.~Radius+Concave, Cancer, family= binomial)
summary(reg3)
##
## Call:
## glm(formula = Malignant. ~ Radius + Concave, family = binomial,
## data = Cancer)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3698 -0.2839 -0.1327 0.1160 2.8343
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -13.1320 1.4932 -8.795 < 2e-16 ***
## Radius 2.7175 0.3663 7.418 1.19e-13 ***
## Concave 3.3192 0.3545 9.362 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 751.44 on 568 degrees of freedom
## Residual deviance: 224.02 on 566 degrees of freedom
## AIC: 230.02
##
## Number of Fisher Scoring iterations: 7
e^(2.7175 + qnorm(.975, mean=0, sd=1)*.3663)
## [1] 31.04491
e^(2.7175 - qnorm(.975, mean=0, sd= 1)*.3663)
## [1] 7.385844
e^(3.3192 + qnorm(.975, mean=0, sd=1)*.3545)
## [1] 55.3683
e^(3.3192 - qnorm(.975, mean=0, sd= 1)*.3545)
## [1] 13.79619
1- pchisq(751.44- 224.02, 2)
## [1] 0
1- pchisq(224.02- 222.91, 1)
## [1] 0.2920819
With this model and using the Wald’s test and the liklihood ratio test we can determine the significance of variables included in this model. The Wald’s test supports that the radius and the concave variables are both significant predictive variables, as they both produce confidence intervals that do not include the value of 1. Meaning that the odds ratio is effected by the changing of the concavity and also effected by changing the radius. The liklihood ratio test provides a g-stat taken by subtracting the null deviance of 751.44 and the residual deviance of 224.02, which gives a p-value close to 0 meaning that there is at least one variable in this model that is a significant predictor. We also ran a drop in deviance test from the previous model to this model and found evidence that the radius(concave) variable is not significant and should not be included in the model. This is supported by the drop in deviance test providing a p-value of .2920819 which leads us to conclude that this model is no different than the previous model. However, in this model we still have two significant variables, but dropped a nonsignificant variable thus gaining a degree of freedom leading to a higher predictive power in this model. Therefore we should keep this model and not include radius(concave).
reg4 <- glm(Malignant.~Concave, Cancer, family= binomial)
summary(reg4)
##
## Call:
## glm(formula = Malignant. ~ Concave, family = binomial, data = Cancer)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0030 -0.3361 -0.3361 0.5375 2.4091
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.8455 0.2360 -12.06 <2e-16 ***
## Concave 4.7070 0.3069 15.34 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 751.44 on 568 degrees of freedom
## Residual deviance: 323.34 on 567 degrees of freedom
## AIC: 327.34
##
## Number of Fisher Scoring iterations: 5
e^(4.7070 + qnorm(.975, mean=0, sd=1)*.3069)
## [1] 202.0495
e^(4.7070 - qnorm(.975, mean=0, sd= 1)*.3069)
## [1] 60.67229
1- pchisq(751.44- 323.34, 1)
## [1] 0
1- pchisq(323.34- 224.02, 1)
## [1] 0
With this model and using the Wald’s test and the liklihood ratio test we can determine the significance of variables included in this model. The Wald’s test supports that the concave varaible is statistically significant as it produces a confidence interval that does not contain 1, which means that if the concavity changes then the response value will be effected. The liklihood ratio test provides a g-stat taken by subtracting the null deviance of 751.44 and the residual deviance of 323.34, which gives a p-value close to 0 meaning that there is at least one variable in this model that is a significant predictor. We also ran a drop in deviance test from the previous model to this model and found evidence that the radius variable is significant and should be included in the model. This is supported by the drop in deviance test providing a p-value close to 0 which leads us to conclude that this model is different than the previous model. Therefore we should keep radius as a variable in our model because it adds predictive power.
summary(reg3)
##
## Call:
## glm(formula = Malignant. ~ Radius + Concave, family = binomial,
## data = Cancer)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3698 -0.2839 -0.1327 0.1160 2.8343
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -13.1320 1.4932 -8.795 < 2e-16 ***
## Radius 2.7175 0.3663 7.418 1.19e-13 ***
## Concave 3.3192 0.3545 9.362 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 751.44 on 568 degrees of freedom
## Residual deviance: 224.02 on 566 degrees of freedom
## AIC: 230.02
##
## Number of Fisher Scoring iterations: 7
In conclusion we decided that the best logistic model is: Probability of Malignant = e^(-3.4008 + 2.7175(Radius) + 3.3192(Concave)) / (1+ e^(-3.4008 + 2.7175(Radius) + 3.3192(Concave))) We chose this model because the Wald’s test showed that all variables are statistically significant, the liklihood test showed that at least one of the variables in this model was significant. In addition we ran drop in deviance tests to conclude which variables were not significant and that we could drop from our model and which variables were significant that we should include in our model. This model produced the highest number of significant variables without including the non significant variables. In addition, this model has the highest predictive power out of all the models we tested. This model also lacks any multicollinearity that may produce untrue or unwanted results.