This question is based on one of the data sets discussed in an unpublished manuscript by Powell, T. and Sheather, S. (2008) entitled “A Theory of Extreme Competition”. According to Powell and Sheather:
This paper develops a model of competitive performance when populations compete …. We present a theoretical framework … and empirical tests in chess and … nationalpageants. The findings show that performance in these domains is substantially predictable from a few observable features of population and economic geography.
In this question, we shall consider data from the Miss America pageant, which was founded in Atlantic City in 1921, and 81 pageants have been conducted through 2008. In particular, we will develop a logistic regression model for the proportion of top ten finalists for each US state for the years 2000 to 2008. According to Powell and Sheather:
Eligibility for the Miss America pageant is limited to never-married female U.S. citizens between the ages of 17 and 24. To measure population size, we obtained data for this demographic segment for each U.S. state and the District of Columbia from the 2000 U.S. Census. As a measure of participation inducements, we obtained data on the number of qualifying pageants conducted in each state, on the assumption that qualifying pageants reflect state-level infrastructure and resource commitments. As a geographic measure, we used the latitude and longitude of each state capital and Washington DC, on the assumption that state locations convey information about the regional cultural geography of beauty pageants (in particular, beauty pageants are widely believed to receive greater cultural support south of the Mason-Dixon line). To measure search efficacy, we obtained data on the total land and water area (in square miles) for each state and the District of Columbia, on the assumption that search is more difficult over larger geographic areas.
They consider the following outcome variable and potential predictor variables:
Y = Number of times each US state (and the District of Columbia) has produced a top ten finalist for the years 20000-2008.
\(x_1\) = log(population size)
\(x_2\) = log(average number of contestants in each state’s final qualifying pageant each year between 2002 and 2007)
\(x_3\) = log(geographic area of each state and the District of Columbia)
\(x_4\) = Latitude of each state capitol
\(x_5\) = Longitude of each state capitol
The data can be found on the course web site in the file MissAmericato2008.txt.
url1 <- 'https://raw.githubusercontent.com/jcp9010/MSDA/master/CUNY%20621/Week%208/MissAmericato2008.txt'
miss_america <- read.table(url1, header=TRUE)
head(miss_america)
## abbreviation Top10 LogPopulation LogContestants LogTotalArea Latitude
## 1 AL 6 11.9249 3.895894 10.8670 32.3833
## 2 AK 0 9.8011 2.708050 13.4049 58.3667
## 3 AZ 0 12.0543 2.862201 11.6439 33.4333
## 4 AR 4 11.2702 3.766997 10.8814 34.7333
## 5 CA 5 14.0005 3.935740 12.0058 38.5167
## 6 CO 0 11.8820 2.944439 11.5530 39.7500
## Longitude
## 1 86.367
## 2 134.583
## 3 112.017
## 4 92.233
## 5 121.500
## 6 104.867
?Logistic Model? Y is a discrete continuous variable. Will create a linear regression
america_lm <- lm(Top10 ~ . - abbreviation, data=miss_america)
step_lm <- step(america_lm, direction="backward")
## Start: AIC=28.16
## Top10 ~ (abbreviation + LogPopulation + LogContestants + LogTotalArea +
## Latitude + Longitude) - abbreviation
##
## Df Sum of Sq RSS AIC
## - Longitude 1 0.4273 70.443 26.472
## - LogTotalArea 1 2.0571 72.073 27.638
## <none> 70.016 28.162
## - Latitude 1 4.5706 74.586 29.387
## - LogPopulation 1 6.1647 76.180 30.465
## - LogContestants 1 13.2496 83.265 35.000
##
## Step: AIC=26.47
## Top10 ~ LogPopulation + LogContestants + LogTotalArea + Latitude
##
## Df Sum of Sq RSS AIC
## - LogTotalArea 1 1.7088 72.152 25.694
## <none> 70.443 26.472
## - LogPopulation 1 5.7553 76.198 28.477
## - Latitude 1 5.9023 76.345 28.575
## - LogContestants 1 12.8294 83.272 33.005
##
## Step: AIC=25.69
## Top10 ~ LogPopulation + LogContestants + Latitude
##
## Df Sum of Sq RSS AIC
## <none> 72.152 25.694
## - LogPopulation 1 5.3107 77.462 27.316
## - Latitude 1 8.6232 80.775 29.452
## - LogContestants 1 11.2229 83.375 31.067
step_lm
##
## Call:
## lm(formula = Top10 ~ LogPopulation + LogContestants + Latitude,
## data = miss_america)
##
## Coefficients:
## (Intercept) LogPopulation LogContestants Latitude
## -4.20678 0.41111 1.39373 -0.08128
summary(step_lm)
##
## Call:
## lm(formula = Top10 ~ LogPopulation + LogContestants + Latitude,
## data = miss_america)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5770 -0.8059 -0.1069 0.7957 2.8228
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.20678 2.84135 -1.481 0.14540
## LogPopulation 0.41111 0.22103 1.860 0.06916 .
## LogContestants 1.39373 0.51547 2.704 0.00951 **
## Latitude -0.08128 0.03430 -2.370 0.02194 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.239 on 47 degrees of freedom
## Multiple R-squared: 0.5163, Adjusted R-squared: 0.4855
## F-statistic: 16.72 on 3 and 47 DF, p-value: 1.571e-07
Using Marginal Plots
library(car)
## Warning: package 'car' was built under R version 3.4.3
avPlots(step_lm, id.n=2)
Looks like observation 12 and 2 are likely leveraging the Latitude variable. Though, none of the points seem to go beyond the Cook’s distance threshold.
par(mfrow=c(2,2))
plot(step_lm)
Let’s remove 12 and 2.
miss_america_1 <- miss_america[-c(2,12),]
america_lm_2 <- lm(Top10 ~ . - abbreviation, data=miss_america_1)
step_lm2 <- step(america_lm_2, direction="backward")
## Start: AIC=26.75
## Top10 ~ (abbreviation + LogPopulation + LogContestants + LogTotalArea +
## Latitude + Longitude) - abbreviation
##
## Df Sum of Sq RSS AIC
## - Longitude 1 0.1999 66.409 24.897
## - LogTotalArea 1 1.0962 67.306 25.554
## - Latitude 1 2.3470 68.556 26.456
## <none> 66.210 26.749
## - LogPopulation 1 6.5021 72.712 29.339
## - LogContestants 1 11.8360 78.046 32.808
##
## Step: AIC=24.9
## Top10 ~ LogPopulation + LogContestants + LogTotalArea + Latitude
##
## Df Sum of Sq RSS AIC
## - Latitude 1 2.5611 68.970 24.751
## <none> 66.409 24.897
## - LogTotalArea 1 3.0858 69.495 25.122
## - LogPopulation 1 7.8522 74.262 28.373
## - LogContestants 1 11.6643 78.074 30.826
##
## Step: AIC=24.75
## Top10 ~ LogPopulation + LogContestants + LogTotalArea
##
## Df Sum of Sq RSS AIC
## <none> 68.970 24.751
## - LogTotalArea 1 3.9951 72.966 25.510
## - LogPopulation 1 7.8813 76.852 28.053
## - LogContestants 1 25.4010 94.371 38.115
step_lm2
##
## Call:
## lm(formula = Top10 ~ LogPopulation + LogContestants + LogTotalArea,
## data = miss_america_1)
##
## Coefficients:
## (Intercept) LogPopulation LogContestants LogTotalArea
## -8.4173 0.5197 2.0410 -0.2220
summary(step_lm2)
##
## Call:
## lm(formula = Top10 ~ LogPopulation + LogContestants + LogTotalArea,
## data = miss_america_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6294 -0.6670 -0.0260 0.6398 3.4173
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.4173 2.2450 -3.749 0.000504 ***
## LogPopulation 0.5197 0.2292 2.268 0.028202 *
## LogContestants 2.0410 0.5013 4.071 0.000187 ***
## LogTotalArea -0.2220 0.1375 -1.614 0.113411
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.238 on 45 degrees of freedom
## Multiple R-squared: 0.5228, Adjusted R-squared: 0.491
## F-statistic: 16.43 on 3 and 45 DF, p-value: 2.365e-07
avPlots(step_lm2, id.n=1)
Perhaps better, not perfect. 9 seems to be an outlier here. Could repeat the same exercise or leave it.
**Dr. Hans Riedwyl, a statistician at the University of Berne was asked by local authorities to analyze data on Swiss Bank notes. In particular, the statistician was asked to develop a model to predict whether a particular banknote is counterfeit (y = 0) or genuine (y = 1) based on the following physical measurements (in millimeters) of 100 genuine and 100 counterfeit Swiss Bank notes:**
Length = length of the banknote
Left = length of the left edge of the banknote
Right = length of the right edge of the banknote
Top = distance from the image to the top edge
Bottom = distance from the image to the bottom edge
Diagonal = length of the diagnoal
The data were originally reported in Flury and Riedwyl (1988) and they can be found in alr3 library and on the book web site in the file banknote.txt. Figure 8.20 contains a plot of Bottom and Diagonal with different symbols for the two values of y.
url2 <- 'https://raw.githubusercontent.com/jcp9010/MSDA/master/CUNY%20621/Week%208/banknote.txt'
banknote <- read.table(url2, header = TRUE)
head(banknote)
## Length Left Right Bottom Top Diagonal Y
## 1 214.8 131.0 131.1 9.0 9.7 141.0 0
## 2 214.6 129.7 129.7 8.1 9.5 141.7 0
## 3 214.8 129.7 129.7 8.7 9.6 142.2 0
## 4 214.8 129.7 129.6 7.5 10.4 142.0 0
## 5 215.0 129.6 129.7 10.4 7.7 141.8 0
## 6 215.7 130.8 130.5 9.0 10.1 141.4 0
bank_glm <- glm(Y ~ Bottom + Diagonal, family="binomial", data=banknote)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(bank_glm)
##
## Call:
## glm(formula = Y ~ Bottom + Diagonal, family = "binomial", data = banknote)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.51e-04 -2.00e-08 0.00e+00 2.00e-08 6.58e-04
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 99422.9 5433597.5 0.018 0.985
## Bottom 688.7 37796.3 0.018 0.985
## Diagonal -751.8 41093.2 -0.018 0.985
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2.7726e+02 on 199 degrees of freedom
## Residual deviance: 1.0424e-06 on 197 degrees of freedom
## AIC: 6
##
## Number of Fisher Scoring iterations: 25
test_banknote <- banknote[,-7]
round(predict(bank_glm, test_banknote, type="response"),1)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
## 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
## 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 199 200
## 1 1
After looking at the predictions above and the actual data, they are exactly the same.
require(ggplot2)
## Loading required package: ggplot2
ggplot(banknote, aes(x=Diagonal, y=Bottom, color=Y)) + geom_point()
The dataset wbca comes from a study of breast cancer in Wisconsin. There are 681 cases of potentially cancerous tumors of which 238 are actually malignant. Determining whether a tumor is really malignant is traditionally determined by an invasive surgical procedure. The purpose of this study was to determine whether a new procedure called fine needle aspiration, which draws only a small sample of tissue, could be effective in determining tumor status.
Class as the response and the other nine variables as predictors. Report the residual deviance and associated degrees of freedom. Can this information be used to determine if this model fits the data?library(faraway)
##
## Attaching package: 'faraway'
## The following objects are masked from 'package:car':
##
## logit, vif
data(wbca, package="faraway")
head(wbca)
## Class Adhes BNucl Chrom Epith Mitos NNucl Thick UShap USize
## 1 1 1 1 3 2 1 1 5 1 1
## 2 1 5 10 3 7 1 2 5 4 4
## 3 1 1 2 3 2 1 1 3 1 1
## 4 1 1 4 3 3 1 7 6 8 8
## 5 1 3 1 3 2 1 1 4 1 1
## 6 0 8 10 9 7 1 7 8 10 10
step function.)glm_mod <- glm(Class ~ ., family="binomial", data = wbca)
glm_mod_2 <- step(glm_mod, direction="backward")
## Start: AIC=109.46
## Class ~ Adhes + BNucl + Chrom + Epith + Mitos + NNucl + Thick +
## UShap + USize
##
## Df Deviance AIC
## - USize 1 89.523 107.52
## - Epith 1 89.613 107.61
## - UShap 1 90.627 108.63
## <none> 89.464 109.46
## - Mitos 1 93.551 111.55
## - NNucl 1 95.204 113.20
## - Adhes 1 98.844 116.84
## - Chrom 1 99.841 117.84
## - BNucl 1 109.000 127.00
## - Thick 1 110.239 128.24
##
## Step: AIC=107.52
## Class ~ Adhes + BNucl + Chrom + Epith + Mitos + NNucl + Thick +
## UShap
##
## Df Deviance AIC
## - Epith 1 89.662 105.66
## - UShap 1 91.355 107.36
## <none> 89.523 107.52
## - Mitos 1 93.552 109.55
## - NNucl 1 95.231 111.23
## - Adhes 1 99.042 115.04
## - Chrom 1 100.153 116.15
## - BNucl 1 109.064 125.06
## - Thick 1 110.465 126.47
##
## Step: AIC=105.66
## Class ~ Adhes + BNucl + Chrom + Mitos + NNucl + Thick + UShap
##
## Df Deviance AIC
## <none> 89.662 105.66
## - UShap 1 91.884 105.88
## - Mitos 1 93.714 107.71
## - NNucl 1 95.853 109.85
## - Adhes 1 100.126 114.13
## - Chrom 1 100.844 114.84
## - BNucl 1 109.762 123.76
## - Thick 1 110.632 124.63
glm_mod_2
##
## Call: glm(formula = Class ~ Adhes + BNucl + Chrom + Mitos + NNucl +
## Thick + UShap, family = "binomial", data = wbca)
##
## Coefficients:
## (Intercept) Adhes BNucl Chrom Mitos
## 11.0333 -0.3984 -0.4192 -0.5679 -0.6456
## NNucl Thick UShap
## -0.2915 -0.6216 -0.2541
##
## Degrees of Freedom: 680 Total (i.e. Null); 673 Residual
## Null Deviance: 881.4
## Residual Deviance: 89.66 AIC: 105.7
Residual Deviance is 89.66 with 673 Degrees of Freedom, AIC 105.7.
test_set <- data.frame(1,1,3,2,1,1,4,1,1)
colnames(test_set) <- c("Adhes", "BNucl", "Chrom","Epith", "Mitos","NNucl","Thick","UShap","USize")
a <- predict(glm_mod_2, test_set, type="response", se.fit=TRUE)
a
## $fit
## 1
## 0.9921115
##
## $se.fit
## 1
## 0.004551129
##
## $residual.scale
## [1] 1
print(paste0("2 SD above mean: ", a$fit + 2 * a$se.fit))
## [1] "2 SD above mean: 1.00121374555702"
print(paste0("2 SD below mean: ", a$fit - 2 * a$se.fit))
## [1] "2 SD below mean: 0.983009231045569"
# Working with model a (the stepwise "backwards" model)
library(caret)
## Warning: package 'caret' was built under R version 3.4.3
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:faraway':
##
## melanoma
## Warning in as.POSIXlt.POSIXct(Sys.time()): unknown timezone 'zone/tz/2018c.
## 1.0/zoneinfo/America/New_York'
Train <- createDataPartition(wbca$Class, p=0.7, list=FALSE)
training <- wbca[Train, ]
testing <- wbca[-Train, ]
pred1 <- round(predict(glm_mod_2, newdata=testing,type="response"),3)
pred11 <- ifelse(pred1 < 0.5, 0, 1)
ans <- confusionMatrix(pred11,testing$Class,positive="0")
ans
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 74 3
## 1 2 125
##
## Accuracy : 0.9755
## 95% CI : (0.9437, 0.992)
## No Information Rate : 0.6275
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9477
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9737
## Specificity : 0.9766
## Pos Pred Value : 0.9610
## Neg Pred Value : 0.9843
## Prevalence : 0.3725
## Detection Rate : 0.3627
## Detection Prevalence : 0.3775
## Balanced Accuracy : 0.9751
##
## 'Positive' Class : 0
##
pred21 <- ifelse(pred1 < 0.9, 0, 1)
ans1 <- confusionMatrix(pred21,testing$Class,positive="0")
ans1
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 76 6
## 1 0 122
##
## Accuracy : 0.9706
## 95% CI : (0.9371, 0.9891)
## No Information Rate : 0.6275
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.9381
## Mcnemar's Test P-Value : 0.04123
##
## Sensitivity : 1.0000
## Specificity : 0.9531
## Pos Pred Value : 0.9268
## Neg Pred Value : 1.0000
## Prevalence : 0.3725
## Detection Rate : 0.3725
## Detection Prevalence : 0.4020
## Balanced Accuracy : 0.9766
##
## 'Positive' Class : 0
##
Please see above for similar example.
The National Institute of Diabetes and Digestive and Kidney Diseases conducted a study on 768 adult female Pima Indians living near Phoenix. The purpose of the study was to investigate factors related to diabetes. The data may be found in the dataset pima.
data(pima, package="faraway")
head(pima)
## pregnant glucose diastolic triceps insulin bmi diabetes age test
## 1 6 148 72 35 0 33.6 0.627 50 1
## 2 1 85 66 29 0 26.6 0.351 31 0
## 3 8 183 64 0 0 23.3 0.672 32 1
## 4 1 89 66 23 94 28.1 0.167 21 0
## 5 0 137 40 35 168 43.1 2.288 33 1
## 6 5 116 74 0 0 25.6 0.201 30 0
summary(pima)
## pregnant glucose diastolic triceps
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## insulin bmi diabetes age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## test
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
require(ggplot2)
require(reshape2)
## Loading required package: reshape2
## Warning: package 'reshape2' was built under R version 3.4.3
pima_melt <- melt(pima)
## No id variables; using all as measure variables
ggplot(pima_melt, aes(x=factor(variable), y=value)) + geom_boxplot() + facet_grid(~variable, scale="free")
ggplot(pima_melt, aes(x=value)) + geom_histogram(bins=50) + facet_wrap(~variable, scale="free")
Some of the data is skewed to the right i.e. pregnant, diabetes, age.
pima1 <- pima[,!names(pima) %in% c("pregnant", "diabetes", "age")]
pima1 <- cbind(pima1, l_preg = log(pima$pregnant), l_diabetes = log(pima$diabetes), l_age = log(pima$age))
head(pima1)
## glucose diastolic triceps insulin bmi test l_preg l_diabetes l_age
## 1 148 72 35 0 33.6 1 1.791759 -0.4668087 3.912023
## 2 85 66 29 0 26.6 0 0.000000 -1.0469691 3.433987
## 3 183 64 0 0 23.3 1 2.079442 -0.3974969 3.465736
## 4 89 66 23 94 28.1 0 0.000000 -1.7897615 3.044522
## 5 137 40 35 168 43.1 1 -Inf 0.8276781 3.496508
## 6 116 74 0 0 25.6 0 1.609438 -1.6044504 3.401197
melt_pima1 <- melt(pima1)
## No id variables; using all as measure variables
ggplot(melt_pima1, aes(x=value)) + geom_histogram(bins=50) + facet_wrap(~variable, scale="free")
## Warning: Removed 111 rows containing non-finite values (stat_bin).
Diabetes transformation appears much more normal. What about the others?
BoxCox?
# Reference: https://stackoverflow.com/questions/46485024/how-to-use-boxcoxtrans-function-in-r
library(MASS)
## Warning: package 'MASS' was built under R version 3.4.3
# create a list with the BoxCox objects
g <- apply(pima, 2, BoxCoxTrans)
g
## $pregnant
## Box-Cox Transformation
##
## 768 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 3.000 3.845 6.000 17.000
##
## Lambda could not be estimated; no transformation is applied
##
##
## $glucose
## Box-Cox Transformation
##
## 768 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 99.0 117.0 120.9 140.2 199.0
##
## Lambda could not be estimated; no transformation is applied
##
##
## $diastolic
## Box-Cox Transformation
##
## 768 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 62.00 72.00 69.11 80.00 122.00
##
## Lambda could not be estimated; no transformation is applied
##
##
## $triceps
## Box-Cox Transformation
##
## 768 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 23.00 20.54 32.00 99.00
##
## Lambda could not be estimated; no transformation is applied
##
##
## $insulin
## Box-Cox Transformation
##
## 768 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 30.5 79.8 127.2 846.0
##
## Lambda could not be estimated; no transformation is applied
##
##
## $bmi
## Box-Cox Transformation
##
## 768 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 27.30 32.00 31.99 36.60 67.10
##
## Lambda could not be estimated; no transformation is applied
##
##
## $diabetes
## Box-Cox Transformation
##
## 768 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0780 0.2437 0.3725 0.4719 0.6262 2.4200
##
## Largest/Smallest: 31
## Sample Skewness: 1.91
##
## Estimated Lambda: -0.1
## With fudge factor, Lambda = 0 will be used for transformations
##
##
## $age
## Box-Cox Transformation
##
## 768 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 21.00 24.00 29.00 33.24 41.00 81.00
##
## Largest/Smallest: 3.86
## Sample Skewness: 1.13
##
## Estimated Lambda: -1.1
##
##
## $test
## Box-Cox Transformation
##
## 768 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.349 1.000 1.000
##
## Lambda could not be estimated; no transformation is applied
This function confirms that the log transformation of diabetes is the correct one. We will perform one more transformation as recommended above, the inverse of ‘age’.
pima2 <- pima[,!names(pima) %in% c("diabetes", "age")]
pima2 <- cbind(pima2, l_diabetes = log(pima$diabetes), inv_age = 1/(pima$age))
head(pima2)
## pregnant glucose diastolic triceps insulin bmi test l_diabetes
## 1 6 148 72 35 0 33.6 1 -0.4668087
## 2 1 85 66 29 0 26.6 0 -1.0469691
## 3 8 183 64 0 0 23.3 1 -0.3974969
## 4 1 89 66 23 94 28.1 0 -1.7897615
## 5 0 137 40 35 168 43.1 1 0.8276781
## 6 5 116 74 0 0 25.6 0 -1.6044504
## inv_age
## 1 0.02000000
## 2 0.03225806
## 3 0.03125000
## 4 0.04761905
## 5 0.03030303
## 6 0.03333333
melt_pima2 <- melt(pima2)
## No id variables; using all as measure variables
ggplot(melt_pima2, aes(x=value)) + geom_histogram(bins=50) + facet_wrap(~variable, scale="free")
dm_glm <- step(glm(test ~ ., family="binomial", data=pima2), direction="backward")
## Start: AIC=730.44
## test ~ pregnant + glucose + diastolic + triceps + insulin + bmi +
## l_diabetes + inv_age
##
## Df Deviance AIC
## - triceps 1 712.46 728.46
## - insulin 1 713.91 729.91
## <none> 712.44 730.44
## - pregnant 1 719.36 735.36
## - diastolic 1 720.32 736.32
## - inv_age 1 722.64 738.64
## - l_diabetes 1 725.62 741.62
## - bmi 1 751.55 767.55
## - glucose 1 821.91 837.91
##
## Step: AIC=728.46
## test ~ pregnant + glucose + diastolic + insulin + bmi + l_diabetes +
## inv_age
##
## Df Deviance AIC
## - insulin 1 714.10 728.10
## <none> 712.46 728.46
## - pregnant 1 719.39 733.39
## - diastolic 1 720.45 734.45
## - inv_age 1 722.67 736.67
## - l_diabetes 1 725.87 739.87
## - bmi 1 756.49 770.49
## - glucose 1 824.11 838.11
##
## Step: AIC=728.1
## test ~ pregnant + glucose + diastolic + bmi + l_diabetes + inv_age
##
## Df Deviance AIC
## <none> 714.10 728.10
## - pregnant 1 721.20 733.20
## - diastolic 1 722.47 734.47
## - inv_age 1 725.55 737.55
## - l_diabetes 1 726.53 738.53
## - bmi 1 756.76 768.76
## - glucose 1 829.31 841.31
summary(dm_glm)
##
## Call:
## glm(formula = test ~ pregnant + glucose + diastolic + bmi + l_diabetes +
## inv_age, family = "binomial", data = pima2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4620 -0.6993 -0.3909 0.7078 2.8100
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.063155 0.872816 -5.801 6.59e-09 ***
## pregnant 0.088742 0.033634 2.638 0.008328 **
## glucose 0.032757 0.003434 9.540 < 2e-16 ***
## diastolic -0.014860 0.005190 -2.863 0.004194 **
## bmi 0.087050 0.014323 6.077 1.22e-09 ***
## l_diabetes 0.519868 0.149491 3.478 0.000506 ***
## inv_age -42.089307 12.501627 -3.367 0.000761 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 993.48 on 767 degrees of freedom
## Residual deviance: 714.10 on 761 degrees of freedom
## AIC: 728.1
##
## Number of Fisher Scoring iterations: 5
quantile(pima2$bmi)
## 0% 25% 50% 75% 100%
## 0.0 27.3 32.0 36.6 67.1
first_quant <- subset(pima2, pima2$bmi <= 27.3)
third_quant <- subset(pima2, pima2$bmi >= 32.0 & pima2$bmi <= 36.6)
first_quant_odds <- sum(first_quant$test == 1)/(length(first_quant$test))
third_quant_odds <- sum(third_quant$test == 1)/(length(third_quant$test))
print(paste0("Odds of Testing Positive for 1st Quartile: ", first_quant_odds))
## [1] "Odds of Testing Positive for 1st Quartile: 0.103092783505155"
print(paste0("Odds of Testing Positive for 3rd Quartile: ", third_quant_odds))
## [1] "Odds of Testing Positive for 3rd Quartile: 0.449275362318841"
# Straight from the data
pos_dm <- subset(pima, pima$test == 1)
neg_dm <- subset(pima, pima$test == 0)
par(mfrow=c(1,2))
boxplot(pos_dm$diastolic, main="Positive for DM")
boxplot(neg_dm$diastolic, main="Negative for DM")
pos_sum <- summary(pos_dm)
print("Positive for DM, statistics for diastolic blood pressures: ")
## [1] "Positive for DM, statistics for diastolic blood pressures: "
pos_sum[,3]
##
## "Min. : 0.00 " "1st Qu.: 66.00 " "Median : 74.00 "
##
## "Mean : 70.82 " "3rd Qu.: 82.00 " "Max. :114.00 "
neg_sum <- summary(neg_dm)
print("Negative for DM, statistics for diastolic blood pressures: ")
## [1] "Negative for DM, statistics for diastolic blood pressures: "
neg_sum[,3]
##
## "Min. : 0.00 " "1st Qu.: 62.00 " "Median : 70.00 "
##
## "Mean : 68.18 " "3rd Qu.: 78.00 " "Max. :122.00 "
The data suggests they are quite similar. In the regression model, diastolic is statistically significant. These are two different meanings. Diastoic in regression means if there’s significant in the way it impacts the target variable, whereas the previous question only talks about the descriptive statistics
# Reference: https://stats.stackexchange.com/questions/121490/interpretation-of-plot-glm-model
# Reference: https://stats.stackexchange.com/questions/45050/diagnostics-for-logistic-regression
# Reference: https://stat.ethz.ch/R-manual/R-devel/library/boot/html/glm.diag.plots.html
library(boot)
##
## Attaching package: 'boot'
## The following object is masked from 'package:lattice':
##
## melanoma
## The following objects are masked from 'package:faraway':
##
## logit, melanoma
## The following object is masked from 'package:car':
##
## logit
glm.diag.plots(dm_glm)
test_set <- data.frame(1,99,64,22,76,27,0.25,25)
colnames(test_set) <- c("pregnant", "glucose", "diastolic","triceps", "insulin","bmi","diabetes","age")
test_set <- cbind(test_set, l_diabetes = log(test_set$diabetes), inv_age = 1/(test_set$age))
test_set <- test_set[,!names(test_set) %in% c('diabetes','age')]
b <- predict(dm_glm, test_set, type="response", se.fit=TRUE)
b
## $fit
## 1
## 0.0608607
##
## $se.fit
## 1
## 0.01189065
##
## $residual.scale
## [1] 1
print(paste0("2 SD above mean: ",b$fit + 2 * b$se.fit))
## [1] "2 SD above mean: 0.0846419968308864"
print(paste0("2 SD below mean: ",b$fit - 2 * b$se.fit))
## [1] "2 SD below mean: 0.037079409523968"