MARR 8.2

This question is based on one of the data sets discussed in an unpublished manuscript by Powell, T. and Sheather, S. (2008) entitled “A Theory of Extreme Competition”. According to Powell and Sheather:

This paper develops a model of competitive performance when populations compete …. We present a theoretical framework … and empirical tests in chess and … nationalpageants. The findings show that performance in these domains is substantially predictable from a few observable features of population and economic geography.

In this question, we shall consider data from the Miss America pageant, which was founded in Atlantic City in 1921, and 81 pageants have been conducted through 2008. In particular, we will develop a logistic regression model for the proportion of top ten finalists for each US state for the years 2000 to 2008. According to Powell and Sheather:

Eligibility for the Miss America pageant is limited to never-married female U.S. citizens between the ages of 17 and 24. To measure population size, we obtained data for this demographic segment for each U.S. state and the District of Columbia from the 2000 U.S. Census. As a measure of participation inducements, we obtained data on the number of qualifying pageants conducted in each state, on the assumption that qualifying pageants reflect state-level infrastructure and resource commitments. As a geographic measure, we used the latitude and longitude of each state capital and Washington DC, on the assumption that state locations convey information about the regional cultural geography of beauty pageants (in particular, beauty pageants are widely believed to receive greater cultural support south of the Mason-Dixon line). To measure search efficacy, we obtained data on the total land and water area (in square miles) for each state and the District of Columbia, on the assumption that search is more difficult over larger geographic areas.

They consider the following outcome variable and potential predictor variables:

  1. Y = Number of times each US state (and the District of Columbia) has produced a top ten finalist for the years 20000-2008.

  2. \(x_1\) = log(population size)

  3. \(x_2\) = log(average number of contestants in each state’s final qualifying pageant each year between 2002 and 2007)

  4. \(x_3\) = log(geographic area of each state and the District of Columbia)

  5. \(x_4\) = Latitude of each state capitol

  6. \(x_5\) = Longitude of each state capitol

The data can be found on the course web site in the file MissAmericato2008.txt.

url1 <- 'https://raw.githubusercontent.com/jcp9010/MSDA/master/CUNY%20621/Week%208/MissAmericato2008.txt'
miss_america <- read.table(url1, header=TRUE)
head(miss_america)
##   abbreviation Top10 LogPopulation LogContestants LogTotalArea Latitude
## 1           AL     6       11.9249       3.895894      10.8670  32.3833
## 2           AK     0        9.8011       2.708050      13.4049  58.3667
## 3           AZ     0       12.0543       2.862201      11.6439  33.4333
## 4           AR     4       11.2702       3.766997      10.8814  34.7333
## 5           CA     5       14.0005       3.935740      12.0058  38.5167
## 6           CO     0       11.8820       2.944439      11.5530  39.7500
##   Longitude
## 1    86.367
## 2   134.583
## 3   112.017
## 4    92.233
## 5   121.500
## 6   104.867
  1. Develop a logistic regression model that predicts y from \(x_1, x_2, x_3, x_4\) and \(x_5\) such that each of the predictors is significant at least at the 5% level. Use marginal model plots to check the validity of the full model and the final model (if it is different from the full model).

?Logistic Model? Y is a discrete continuous variable. Will create a linear regression

america_lm <- lm(Top10 ~ . - abbreviation, data=miss_america)
step_lm <- step(america_lm, direction="backward")
## Start:  AIC=28.16
## Top10 ~ (abbreviation + LogPopulation + LogContestants + LogTotalArea + 
##     Latitude + Longitude) - abbreviation
## 
##                  Df Sum of Sq    RSS    AIC
## - Longitude       1    0.4273 70.443 26.472
## - LogTotalArea    1    2.0571 72.073 27.638
## <none>                        70.016 28.162
## - Latitude        1    4.5706 74.586 29.387
## - LogPopulation   1    6.1647 76.180 30.465
## - LogContestants  1   13.2496 83.265 35.000
## 
## Step:  AIC=26.47
## Top10 ~ LogPopulation + LogContestants + LogTotalArea + Latitude
## 
##                  Df Sum of Sq    RSS    AIC
## - LogTotalArea    1    1.7088 72.152 25.694
## <none>                        70.443 26.472
## - LogPopulation   1    5.7553 76.198 28.477
## - Latitude        1    5.9023 76.345 28.575
## - LogContestants  1   12.8294 83.272 33.005
## 
## Step:  AIC=25.69
## Top10 ~ LogPopulation + LogContestants + Latitude
## 
##                  Df Sum of Sq    RSS    AIC
## <none>                        72.152 25.694
## - LogPopulation   1    5.3107 77.462 27.316
## - Latitude        1    8.6232 80.775 29.452
## - LogContestants  1   11.2229 83.375 31.067
step_lm
## 
## Call:
## lm(formula = Top10 ~ LogPopulation + LogContestants + Latitude, 
##     data = miss_america)
## 
## Coefficients:
##    (Intercept)   LogPopulation  LogContestants        Latitude  
##       -4.20678         0.41111         1.39373        -0.08128
summary(step_lm)
## 
## Call:
## lm(formula = Top10 ~ LogPopulation + LogContestants + Latitude, 
##     data = miss_america)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5770 -0.8059 -0.1069  0.7957  2.8228 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    -4.20678    2.84135  -1.481  0.14540   
## LogPopulation   0.41111    0.22103   1.860  0.06916 . 
## LogContestants  1.39373    0.51547   2.704  0.00951 **
## Latitude       -0.08128    0.03430  -2.370  0.02194 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.239 on 47 degrees of freedom
## Multiple R-squared:  0.5163, Adjusted R-squared:  0.4855 
## F-statistic: 16.72 on 3 and 47 DF,  p-value: 1.571e-07

Using Marginal Plots

library(car)
## Warning: package 'car' was built under R version 3.4.3
avPlots(step_lm, id.n=2)

  1. Identify any leverage points in the final model developed in (a). Decide if they are “bad” leverage points.

Looks like observation 12 and 2 are likely leveraging the Latitude variable. Though, none of the points seem to go beyond the Cook’s distance threshold.

par(mfrow=c(2,2))
plot(step_lm)

Let’s remove 12 and 2.

  1. Interpret the regression coefficients of the final model developed in (a).
miss_america_1 <- miss_america[-c(2,12),]

america_lm_2 <- lm(Top10 ~ . - abbreviation, data=miss_america_1)
step_lm2 <- step(america_lm_2, direction="backward")
## Start:  AIC=26.75
## Top10 ~ (abbreviation + LogPopulation + LogContestants + LogTotalArea + 
##     Latitude + Longitude) - abbreviation
## 
##                  Df Sum of Sq    RSS    AIC
## - Longitude       1    0.1999 66.409 24.897
## - LogTotalArea    1    1.0962 67.306 25.554
## - Latitude        1    2.3470 68.556 26.456
## <none>                        66.210 26.749
## - LogPopulation   1    6.5021 72.712 29.339
## - LogContestants  1   11.8360 78.046 32.808
## 
## Step:  AIC=24.9
## Top10 ~ LogPopulation + LogContestants + LogTotalArea + Latitude
## 
##                  Df Sum of Sq    RSS    AIC
## - Latitude        1    2.5611 68.970 24.751
## <none>                        66.409 24.897
## - LogTotalArea    1    3.0858 69.495 25.122
## - LogPopulation   1    7.8522 74.262 28.373
## - LogContestants  1   11.6643 78.074 30.826
## 
## Step:  AIC=24.75
## Top10 ~ LogPopulation + LogContestants + LogTotalArea
## 
##                  Df Sum of Sq    RSS    AIC
## <none>                        68.970 24.751
## - LogTotalArea    1    3.9951 72.966 25.510
## - LogPopulation   1    7.8813 76.852 28.053
## - LogContestants  1   25.4010 94.371 38.115
step_lm2
## 
## Call:
## lm(formula = Top10 ~ LogPopulation + LogContestants + LogTotalArea, 
##     data = miss_america_1)
## 
## Coefficients:
##    (Intercept)   LogPopulation  LogContestants    LogTotalArea  
##        -8.4173          0.5197          2.0410         -0.2220
summary(step_lm2)
## 
## Call:
## lm(formula = Top10 ~ LogPopulation + LogContestants + LogTotalArea, 
##     data = miss_america_1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6294 -0.6670 -0.0260  0.6398  3.4173 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -8.4173     2.2450  -3.749 0.000504 ***
## LogPopulation    0.5197     0.2292   2.268 0.028202 *  
## LogContestants   2.0410     0.5013   4.071 0.000187 ***
## LogTotalArea    -0.2220     0.1375  -1.614 0.113411    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.238 on 45 degrees of freedom
## Multiple R-squared:  0.5228, Adjusted R-squared:  0.491 
## F-statistic: 16.43 on 3 and 45 DF,  p-value: 2.365e-07
avPlots(step_lm2, id.n=1)

Perhaps better, not perfect. 9 seems to be an outlier here. Could repeat the same exercise or leave it.

MARR 8.6

**Dr. Hans Riedwyl, a statistician at the University of Berne was asked by local authorities to analyze data on Swiss Bank notes. In particular, the statistician was asked to develop a model to predict whether a particular banknote is counterfeit (y = 0) or genuine (y = 1) based on the following physical measurements (in millimeters) of 100 genuine and 100 counterfeit Swiss Bank notes:**

  1. Length = length of the banknote

  2. Left = length of the left edge of the banknote

  3. Right = length of the right edge of the banknote

  4. Top = distance from the image to the top edge

  5. Bottom = distance from the image to the bottom edge

  6. Diagonal = length of the diagnoal

The data were originally reported in Flury and Riedwyl (1988) and they can be found in alr3 library and on the book web site in the file banknote.txt. Figure 8.20 contains a plot of Bottom and Diagonal with different symbols for the two values of y.

url2 <- 'https://raw.githubusercontent.com/jcp9010/MSDA/master/CUNY%20621/Week%208/banknote.txt'
banknote <- read.table(url2, header = TRUE)
head(banknote)
##   Length  Left Right Bottom  Top Diagonal Y
## 1  214.8 131.0 131.1    9.0  9.7    141.0 0
## 2  214.6 129.7 129.7    8.1  9.5    141.7 0
## 3  214.8 129.7 129.7    8.7  9.6    142.2 0
## 4  214.8 129.7 129.6    7.5 10.4    142.0 0
## 5  215.0 129.6 129.7   10.4  7.7    141.8 0
## 6  215.7 130.8 130.5    9.0 10.1    141.4 0
  1. Fit a logistic regression model using just the last two predictor variables listed above (i.e., Bottom and Diagonal). R will give warnings including “fitted probabilities numerically 0 or 1 occurred”.
bank_glm <- glm(Y ~ Bottom + Diagonal, family="binomial", data=banknote)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(bank_glm)
## 
## Call:
## glm(formula = Y ~ Bottom + Diagonal, family = "binomial", data = banknote)
## 
## Deviance Residuals: 
##       Min         1Q     Median         3Q        Max  
## -5.51e-04  -2.00e-08   0.00e+00   2.00e-08   6.58e-04  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)   99422.9  5433597.5   0.018    0.985
## Bottom          688.7    37796.3   0.018    0.985
## Diagonal       -751.8    41093.2  -0.018    0.985
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2.7726e+02  on 199  degrees of freedom
## Residual deviance: 1.0424e-06  on 197  degrees of freedom
## AIC: 6
## 
## Number of Fisher Scoring iterations: 25
  1. Compare the predicted values of y from the model in (a) with the actual values of y and show that they coincide. This is a consequence of the fact that the residual deviance is zero to many decimal places. Looking at Figure 8.20, we see that the two predictors completely separate the counterfeit (y = 0) and genuine (y = 1) banknotes - thus producing a perfect logistic fit with zero residual deviance. A number of authors, including Atkinson and Riani (2000, p.251), comment that for perfect logistic fits, the estimates of the \(\beta\)s approach infinity and the z-values approach zero.
test_banknote <- banknote[,-7]
round(predict(bank_glm, test_banknote, type="response"),1)
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
##   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
##  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36 
##   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
##  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54 
##   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
##  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72 
##   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
##  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90 
##   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
##  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108 
##   0   0   0   0   0   0   0   0   0   0   1   1   1   1   1   1   1   1 
## 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 199 200 
##   1   1

After looking at the predictions above and the actual data, they are exactly the same.

require(ggplot2)
## Loading required package: ggplot2
ggplot(banknote, aes(x=Diagonal, y=Bottom, color=Y)) + geom_point()

ELMR 2.2

The dataset wbca comes from a study of breast cancer in Wisconsin. There are 681 cases of potentially cancerous tumors of which 238 are actually malignant. Determining whether a tumor is really malignant is traditionally determined by an invasive surgical procedure. The purpose of this study was to determine whether a new procedure called fine needle aspiration, which draws only a small sample of tissue, could be effective in determining tumor status.

  1. Fit a binomial regression with Class as the response and the other nine variables as predictors. Report the residual deviance and associated degrees of freedom. Can this information be used to determine if this model fits the data?
library(faraway)
## 
## Attaching package: 'faraway'
## The following objects are masked from 'package:car':
## 
##     logit, vif
data(wbca, package="faraway")
head(wbca)
##   Class Adhes BNucl Chrom Epith Mitos NNucl Thick UShap USize
## 1     1     1     1     3     2     1     1     5     1     1
## 2     1     5    10     3     7     1     2     5     4     4
## 3     1     1     2     3     2     1     1     3     1     1
## 4     1     1     4     3     3     1     7     6     8     8
## 5     1     3     1     3     2     1     1     4     1     1
## 6     0     8    10     9     7     1     7     8    10    10
  1. Use AIC as the criterion to determine the best subset of variables. (Use the step function.)
glm_mod <- glm(Class ~ ., family="binomial", data = wbca)
glm_mod_2 <- step(glm_mod, direction="backward")
## Start:  AIC=109.46
## Class ~ Adhes + BNucl + Chrom + Epith + Mitos + NNucl + Thick + 
##     UShap + USize
## 
##         Df Deviance    AIC
## - USize  1   89.523 107.52
## - Epith  1   89.613 107.61
## - UShap  1   90.627 108.63
## <none>       89.464 109.46
## - Mitos  1   93.551 111.55
## - NNucl  1   95.204 113.20
## - Adhes  1   98.844 116.84
## - Chrom  1   99.841 117.84
## - BNucl  1  109.000 127.00
## - Thick  1  110.239 128.24
## 
## Step:  AIC=107.52
## Class ~ Adhes + BNucl + Chrom + Epith + Mitos + NNucl + Thick + 
##     UShap
## 
##         Df Deviance    AIC
## - Epith  1   89.662 105.66
## - UShap  1   91.355 107.36
## <none>       89.523 107.52
## - Mitos  1   93.552 109.55
## - NNucl  1   95.231 111.23
## - Adhes  1   99.042 115.04
## - Chrom  1  100.153 116.15
## - BNucl  1  109.064 125.06
## - Thick  1  110.465 126.47
## 
## Step:  AIC=105.66
## Class ~ Adhes + BNucl + Chrom + Mitos + NNucl + Thick + UShap
## 
##         Df Deviance    AIC
## <none>       89.662 105.66
## - UShap  1   91.884 105.88
## - Mitos  1   93.714 107.71
## - NNucl  1   95.853 109.85
## - Adhes  1  100.126 114.13
## - Chrom  1  100.844 114.84
## - BNucl  1  109.762 123.76
## - Thick  1  110.632 124.63
glm_mod_2
## 
## Call:  glm(formula = Class ~ Adhes + BNucl + Chrom + Mitos + NNucl + 
##     Thick + UShap, family = "binomial", data = wbca)
## 
## Coefficients:
## (Intercept)        Adhes        BNucl        Chrom        Mitos  
##     11.0333      -0.3984      -0.4192      -0.5679      -0.6456  
##       NNucl        Thick        UShap  
##     -0.2915      -0.6216      -0.2541  
## 
## Degrees of Freedom: 680 Total (i.e. Null);  673 Residual
## Null Deviance:       881.4 
## Residual Deviance: 89.66     AIC: 105.7

Residual Deviance is 89.66 with 673 Degrees of Freedom, AIC 105.7.

  1. Use the reduced model to predict the outcome for a new patient with predictor variables 1, 1, 3, 2, 1, 1, 4, 1, 1 (same order as above). Give a confidence interval for your prediction.
test_set <- data.frame(1,1,3,2,1,1,4,1,1)
colnames(test_set) <- c("Adhes", "BNucl", "Chrom","Epith", "Mitos","NNucl","Thick","UShap","USize")
a <- predict(glm_mod_2, test_set, type="response", se.fit=TRUE)
a
## $fit
##         1 
## 0.9921115 
## 
## $se.fit
##           1 
## 0.004551129 
## 
## $residual.scale
## [1] 1
print(paste0("2 SD above mean: ", a$fit + 2 * a$se.fit))
## [1] "2 SD above mean: 1.00121374555702"
print(paste0("2 SD below mean: ", a$fit - 2 * a$se.fit))
## [1] "2 SD below mean: 0.983009231045569"
  1. Suppose that a cancer is classified as benign if p > 0.5 and malignant if p < 0.5. Compute the number of errors of both types that will be made if this method is applied to the current data with the reduced model.
# Working with model a (the stepwise "backwards" model)
library(caret)
## Warning: package 'caret' was built under R version 3.4.3
## Loading required package: lattice
## 
## Attaching package: 'lattice'
## The following object is masked from 'package:faraway':
## 
##     melanoma
## Warning in as.POSIXlt.POSIXct(Sys.time()): unknown timezone 'zone/tz/2018c.
## 1.0/zoneinfo/America/New_York'
Train <- createDataPartition(wbca$Class, p=0.7, list=FALSE)
training <- wbca[Train, ]
testing <- wbca[-Train, ]

pred1 <- round(predict(glm_mod_2, newdata=testing,type="response"),3)
pred11 <- ifelse(pred1 < 0.5, 0, 1)
ans <- confusionMatrix(pred11,testing$Class,positive="0")
ans
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  74   3
##          1   2 125
##                                          
##                Accuracy : 0.9755         
##                  95% CI : (0.9437, 0.992)
##     No Information Rate : 0.6275         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9477         
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.9737         
##             Specificity : 0.9766         
##          Pos Pred Value : 0.9610         
##          Neg Pred Value : 0.9843         
##              Prevalence : 0.3725         
##          Detection Rate : 0.3627         
##    Detection Prevalence : 0.3775         
##       Balanced Accuracy : 0.9751         
##                                          
##        'Positive' Class : 0              
## 
  1. Suppose we change the cutoff to 0.9 so that p < 0.9 is classified as malignant and p > 0.9 as benign. Compute the number of errors in this case. Discuss the issues in determining the cutoff.
pred21 <- ifelse(pred1 < 0.9, 0, 1)
ans1 <- confusionMatrix(pred21,testing$Class,positive="0")
ans1
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  76   6
##          1   0 122
##                                           
##                Accuracy : 0.9706          
##                  95% CI : (0.9371, 0.9891)
##     No Information Rate : 0.6275          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9381          
##  Mcnemar's Test P-Value : 0.04123         
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9531          
##          Pos Pred Value : 0.9268          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.3725          
##          Detection Rate : 0.3725          
##    Detection Prevalence : 0.4020          
##       Balanced Accuracy : 0.9766          
##                                           
##        'Positive' Class : 0               
## 
  1. It is usually misleading to use the same data to fit a model and test its predictive ability. To investigate this, split the data into two parts - assign every third observation to a test set and the remaining two thirds of the data to a training set. Use the training set to determine the model and the test set to assess its predictive performance. Compare the outcome to the previously obtained results.

Please see above for similar example.

ELMR 2.3

The National Institute of Diabetes and Digestive and Kidney Diseases conducted a study on 768 adult female Pima Indians living near Phoenix. The purpose of the study was to investigate factors related to diabetes. The data may be found in the dataset pima.

data(pima, package="faraway")
head(pima)
##   pregnant glucose diastolic triceps insulin  bmi diabetes age test
## 1        6     148        72      35       0 33.6    0.627  50    1
## 2        1      85        66      29       0 26.6    0.351  31    0
## 3        8     183        64       0       0 23.3    0.672  32    1
## 4        1      89        66      23      94 28.1    0.167  21    0
## 5        0     137        40      35     168 43.1    2.288  33    1
## 6        5     116        74       0       0 25.6    0.201  30    0
  1. Perform simple graphical and numerical summaries of the data. Can you find any obvious irregularies in the data? If you do, take appropriate steps to correct the problems.
summary(pima)
##     pregnant         glucose        diastolic         triceps     
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     insulin           bmi           diabetes           age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780   Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437   1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725   Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719   Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200   Max.   :81.00  
##       test      
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000
require(ggplot2)
require(reshape2)
## Loading required package: reshape2
## Warning: package 'reshape2' was built under R version 3.4.3
pima_melt <- melt(pima)
## No id variables; using all as measure variables
ggplot(pima_melt, aes(x=factor(variable), y=value)) + geom_boxplot() + facet_grid(~variable, scale="free")

ggplot(pima_melt, aes(x=value)) + geom_histogram(bins=50) + facet_wrap(~variable, scale="free")

Some of the data is skewed to the right i.e. pregnant, diabetes, age.

pima1 <- pima[,!names(pima) %in% c("pregnant", "diabetes", "age")]
pima1 <- cbind(pima1, l_preg = log(pima$pregnant), l_diabetes = log(pima$diabetes), l_age = log(pima$age))
head(pima1)
##   glucose diastolic triceps insulin  bmi test   l_preg l_diabetes    l_age
## 1     148        72      35       0 33.6    1 1.791759 -0.4668087 3.912023
## 2      85        66      29       0 26.6    0 0.000000 -1.0469691 3.433987
## 3     183        64       0       0 23.3    1 2.079442 -0.3974969 3.465736
## 4      89        66      23      94 28.1    0 0.000000 -1.7897615 3.044522
## 5     137        40      35     168 43.1    1     -Inf  0.8276781 3.496508
## 6     116        74       0       0 25.6    0 1.609438 -1.6044504 3.401197
melt_pima1 <- melt(pima1)
## No id variables; using all as measure variables
ggplot(melt_pima1, aes(x=value)) + geom_histogram(bins=50) + facet_wrap(~variable, scale="free")
## Warning: Removed 111 rows containing non-finite values (stat_bin).

Diabetes transformation appears much more normal. What about the others?

BoxCox?

# Reference: https://stackoverflow.com/questions/46485024/how-to-use-boxcoxtrans-function-in-r
library(MASS)
## Warning: package 'MASS' was built under R version 3.4.3
# create a list with the BoxCox objects
g <- apply(pima, 2, BoxCoxTrans)
g
## $pregnant
## Box-Cox Transformation
## 
## 768 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   3.000   3.845   6.000  17.000 
## 
## Lambda could not be estimated; no transformation is applied
## 
## 
## $glucose
## Box-Cox Transformation
## 
## 768 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    99.0   117.0   120.9   140.2   199.0 
## 
## Lambda could not be estimated; no transformation is applied
## 
## 
## $diastolic
## Box-Cox Transformation
## 
## 768 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   62.00   72.00   69.11   80.00  122.00 
## 
## Lambda could not be estimated; no transformation is applied
## 
## 
## $triceps
## Box-Cox Transformation
## 
## 768 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00   23.00   20.54   32.00   99.00 
## 
## Lambda could not be estimated; no transformation is applied
## 
## 
## $insulin
## Box-Cox Transformation
## 
## 768 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0    30.5    79.8   127.2   846.0 
## 
## Lambda could not be estimated; no transformation is applied
## 
## 
## $bmi
## Box-Cox Transformation
## 
## 768 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   27.30   32.00   31.99   36.60   67.10 
## 
## Lambda could not be estimated; no transformation is applied
## 
## 
## $diabetes
## Box-Cox Transformation
## 
## 768 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0780  0.2437  0.3725  0.4719  0.6262  2.4200 
## 
## Largest/Smallest: 31 
## Sample Skewness: 1.91 
## 
## Estimated Lambda: -0.1 
## With fudge factor, Lambda = 0 will be used for transformations
## 
## 
## $age
## Box-Cox Transformation
## 
## 768 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   21.00   24.00   29.00   33.24   41.00   81.00 
## 
## Largest/Smallest: 3.86 
## Sample Skewness: 1.13 
## 
## Estimated Lambda: -1.1 
## 
## 
## $test
## Box-Cox Transformation
## 
## 768 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.349   1.000   1.000 
## 
## Lambda could not be estimated; no transformation is applied

This function confirms that the log transformation of diabetes is the correct one. We will perform one more transformation as recommended above, the inverse of ‘age’.

pima2 <- pima[,!names(pima) %in% c("diabetes", "age")]
pima2 <- cbind(pima2, l_diabetes = log(pima$diabetes), inv_age = 1/(pima$age))
head(pima2)
##   pregnant glucose diastolic triceps insulin  bmi test l_diabetes
## 1        6     148        72      35       0 33.6    1 -0.4668087
## 2        1      85        66      29       0 26.6    0 -1.0469691
## 3        8     183        64       0       0 23.3    1 -0.3974969
## 4        1      89        66      23      94 28.1    0 -1.7897615
## 5        0     137        40      35     168 43.1    1  0.8276781
## 6        5     116        74       0       0 25.6    0 -1.6044504
##      inv_age
## 1 0.02000000
## 2 0.03225806
## 3 0.03125000
## 4 0.04761905
## 5 0.03030303
## 6 0.03333333
melt_pima2 <- melt(pima2)
## No id variables; using all as measure variables
ggplot(melt_pima2, aes(x=value)) + geom_histogram(bins=50) + facet_wrap(~variable, scale="free")

  1. Fit a model with the result of the diabetes test as the response and all the other variables as predictors. Can you tell whether this model fits the data?
dm_glm <- step(glm(test ~ ., family="binomial", data=pima2), direction="backward")
## Start:  AIC=730.44
## test ~ pregnant + glucose + diastolic + triceps + insulin + bmi + 
##     l_diabetes + inv_age
## 
##              Df Deviance    AIC
## - triceps     1   712.46 728.46
## - insulin     1   713.91 729.91
## <none>            712.44 730.44
## - pregnant    1   719.36 735.36
## - diastolic   1   720.32 736.32
## - inv_age     1   722.64 738.64
## - l_diabetes  1   725.62 741.62
## - bmi         1   751.55 767.55
## - glucose     1   821.91 837.91
## 
## Step:  AIC=728.46
## test ~ pregnant + glucose + diastolic + insulin + bmi + l_diabetes + 
##     inv_age
## 
##              Df Deviance    AIC
## - insulin     1   714.10 728.10
## <none>            712.46 728.46
## - pregnant    1   719.39 733.39
## - diastolic   1   720.45 734.45
## - inv_age     1   722.67 736.67
## - l_diabetes  1   725.87 739.87
## - bmi         1   756.49 770.49
## - glucose     1   824.11 838.11
## 
## Step:  AIC=728.1
## test ~ pregnant + glucose + diastolic + bmi + l_diabetes + inv_age
## 
##              Df Deviance    AIC
## <none>            714.10 728.10
## - pregnant    1   721.20 733.20
## - diastolic   1   722.47 734.47
## - inv_age     1   725.55 737.55
## - l_diabetes  1   726.53 738.53
## - bmi         1   756.76 768.76
## - glucose     1   829.31 841.31
summary(dm_glm)
## 
## Call:
## glm(formula = test ~ pregnant + glucose + diastolic + bmi + l_diabetes + 
##     inv_age, family = "binomial", data = pima2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4620  -0.6993  -0.3909   0.7078   2.8100  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -5.063155   0.872816  -5.801 6.59e-09 ***
## pregnant      0.088742   0.033634   2.638 0.008328 ** 
## glucose       0.032757   0.003434   9.540  < 2e-16 ***
## diastolic    -0.014860   0.005190  -2.863 0.004194 ** 
## bmi           0.087050   0.014323   6.077 1.22e-09 ***
## l_diabetes    0.519868   0.149491   3.478 0.000506 ***
## inv_age     -42.089307  12.501627  -3.367 0.000761 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 993.48  on 767  degrees of freedom
## Residual deviance: 714.10  on 761  degrees of freedom
## AIC: 728.1
## 
## Number of Fisher Scoring iterations: 5
  1. What is the difference in the odds of testing positive for diabetes for a woman with a BMI at the first quartile compared with a woman at the third quartile, assuming that all other factors are held constant? Give a confidence interval for this difference.
quantile(pima2$bmi)
##   0%  25%  50%  75% 100% 
##  0.0 27.3 32.0 36.6 67.1
first_quant <- subset(pima2, pima2$bmi <= 27.3)
third_quant <- subset(pima2, pima2$bmi >= 32.0 & pima2$bmi <= 36.6)
first_quant_odds <- sum(first_quant$test == 1)/(length(first_quant$test))
third_quant_odds <- sum(third_quant$test == 1)/(length(third_quant$test))

print(paste0("Odds of Testing Positive for 1st Quartile: ", first_quant_odds))
## [1] "Odds of Testing Positive for 1st Quartile: 0.103092783505155"
print(paste0("Odds of Testing Positive for 3rd Quartile: ", third_quant_odds))
## [1] "Odds of Testing Positive for 3rd Quartile: 0.449275362318841"
  1. Do women who test positive have higher diastolic blood pressures? Is the diastolic blood pressure significant in the regression model? Explain the distinction between the two questions and discuss why the answers are only apparently contradictory.
# Straight from the data
pos_dm <- subset(pima, pima$test == 1)
neg_dm <- subset(pima, pima$test == 0)

par(mfrow=c(1,2))
boxplot(pos_dm$diastolic, main="Positive for DM")
boxplot(neg_dm$diastolic, main="Negative for DM")

pos_sum <- summary(pos_dm)
print("Positive for DM, statistics for diastolic blood pressures: ")
## [1] "Positive for DM, statistics for diastolic blood pressures: "
pos_sum[,3]
##                                                          
## "Min.   :  0.00  " "1st Qu.: 66.00  " "Median : 74.00  " 
##                                                          
## "Mean   : 70.82  " "3rd Qu.: 82.00  " "Max.   :114.00  "
neg_sum <- summary(neg_dm)
print("Negative for DM, statistics for diastolic blood pressures: ")
## [1] "Negative for DM, statistics for diastolic blood pressures: "
neg_sum[,3]
##                                                          
## "Min.   :  0.00  " "1st Qu.: 62.00  " "Median : 70.00  " 
##                                                          
## "Mean   : 68.18  " "3rd Qu.: 78.00  " "Max.   :122.00  "

The data suggests they are quite similar. In the regression model, diastolic is statistically significant. These are two different meanings. Diastoic in regression means if there’s significant in the way it impacts the target variable, whereas the previous question only talks about the descriptive statistics

  1. Perform diagnostics on the regression model, reporting any potential violations and any suggested improvements to the model.
# Reference: https://stats.stackexchange.com/questions/121490/interpretation-of-plot-glm-model
# Reference: https://stats.stackexchange.com/questions/45050/diagnostics-for-logistic-regression
# Reference: https://stat.ethz.ch/R-manual/R-devel/library/boot/html/glm.diag.plots.html
library(boot)
## 
## Attaching package: 'boot'
## The following object is masked from 'package:lattice':
## 
##     melanoma
## The following objects are masked from 'package:faraway':
## 
##     logit, melanoma
## The following object is masked from 'package:car':
## 
##     logit
glm.diag.plots(dm_glm)

  1. Predict the outcome for a woman with predictor values 1, 99, 64, 22, 76, 27, 0.25, 25 (same order as in the orderset). Give a confidence interval for your prediction.
test_set <- data.frame(1,99,64,22,76,27,0.25,25)
colnames(test_set) <- c("pregnant", "glucose", "diastolic","triceps", "insulin","bmi","diabetes","age")
test_set <- cbind(test_set, l_diabetes = log(test_set$diabetes), inv_age = 1/(test_set$age))
test_set <- test_set[,!names(test_set) %in% c('diabetes','age')]

b <- predict(dm_glm, test_set, type="response", se.fit=TRUE)
b
## $fit
##         1 
## 0.0608607 
## 
## $se.fit
##          1 
## 0.01189065 
## 
## $residual.scale
## [1] 1
print(paste0("2 SD above mean: ",b$fit + 2 * b$se.fit))
## [1] "2 SD above mean: 0.0846419968308864"
print(paste0("2 SD below mean: ",b$fit - 2 * b$se.fit))
## [1] "2 SD below mean: 0.037079409523968"