ELMR 3.2

The ships dataset found in the MASS package gives the number of damage incidents and aggregate months of service for different types of ships broken down by year of construction and period of operation. Develop a model for the rate of incidents, describing the effect of the important predictors.

library(MASS)
## Warning: package 'MASS' was built under R version 3.4.3
data(ships, package="MASS")
head(ships)
##   type year period service incidents
## 1    A   60     60     127         0
## 2    A   60     75      63         0
## 3    A   65     60    1095         3
## 4    A   65     75    1095         4
## 5    A   70     60    1512         6
## 6    A   70     75    3353        18

This is likely a Poisson Regression. Poisson distributions arive naturally when the time between events is indepedent and identially exponentially distributed. We count the number of events in a given time period.

ships_poisson <- glm(incidents ~ ., family=poisson, data=ships)
ships_poisson
## 
## Call:  glm(formula = incidents ~ ., family = poisson, data = ships)
## 
## Coefficients:
## (Intercept)        typeB        typeC        typeD        typeE  
##  -5.7061338    0.8134618   -1.2045808   -0.8595236   -0.2225557  
##        year       period      service  
##   0.0451919    0.0605457    0.0000597  
## 
## Degrees of Freedom: 39 Total (i.e. Null);  32 Residual
## Null Deviance:       730.3 
## Residual Deviance: 174   AIC: 287.9
summary(ships_poisson)
## 
## Call:
## glm(formula = incidents ~ ., family = poisson, data = ships)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.1013  -1.9648  -0.5380   0.9899   4.6212  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.706e+00  1.221e+00  -4.673 2.96e-06 ***
## typeB        8.135e-01  2.023e-01   4.021 5.79e-05 ***
## typeC       -1.205e+00  3.275e-01  -3.679 0.000234 ***
## typeD       -8.595e-01  2.875e-01  -2.989 0.002795 ** 
## typeE       -2.226e-01  2.348e-01  -0.948 0.343173    
## year         4.519e-02  1.341e-02   3.370 0.000752 ***
## period       6.055e-02  8.945e-03   6.768 1.30e-11 ***
## service      5.970e-05  7.016e-06   8.509  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 730.25  on 39  degrees of freedom
## Residual deviance: 174.00  on 32  degrees of freedom
## AIC: 287.86
## 
## Number of Fisher Scoring iterations: 6

ELMR 3.4

The dataset africa gives information about the number of military coups in sub-Saharan Africa and various political and geographical information. Develop a simple but well-fitting model for the number of coups. Give an interpretation of the effect of the variables you include in your model on the response.

data(africa, package="faraway")
head(africa)
##          miltcoup oligarchy pollib parties pctvote popn size numelec
## Angola          0         0      2      38      NA  9.7 1247       0
## Benin           5         7      1      34   45.68  4.6  113       8
## Botswana        0         0     NA       7   20.30  1.2  582       5
## Burkina         6        13      2      62   17.50  8.8  274       5
## Burundi         2        13      2      10   34.39  5.3   28       3
## Cameroon        0         0      2      34   30.30 11.6  475      14
##          numregim
## Angola          1
## Benin           3
## Botswana        1
## Burkina         3
## Burundi         3
## Cameroon        3
africa_poisson <- glm(miltcoup ~ ., family=poisson, data=africa)
africa_poisson
## 
## Call:  glm(formula = miltcoup ~ ., family = poisson, data = africa)
## 
## Coefficients:
## (Intercept)    oligarchy       pollib      parties      pctvote  
##   -0.510269     0.073081    -0.712978     0.030774     0.013872  
##        popn         size      numelec     numregim  
##    0.009343    -0.000190    -0.016078     0.191735  
## 
## Degrees of Freedom: 35 Total (i.e. Null);  27 Residual
##   (11 observations deleted due to missingness)
## Null Deviance:       65.94 
## Residual Deviance: 28.67     AIC: 111.5
summary(africa_poisson)
## 
## Call:
## glm(formula = miltcoup ~ ., family = poisson, data = africa)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3443  -0.9542  -0.2587   0.3905   1.6953  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -0.5102693  0.9053301  -0.564  0.57301   
## oligarchy    0.0730814  0.0345958   2.112  0.03465 * 
## pollib      -0.7129779  0.2725635  -2.616  0.00890 **
## parties      0.0307739  0.0111873   2.751  0.00595 **
## pctvote      0.0138722  0.0097526   1.422  0.15491   
## popn         0.0093429  0.0065950   1.417  0.15658   
## size        -0.0001900  0.0002485  -0.765  0.44447   
## numelec     -0.0160783  0.0654842  -0.246  0.80605   
## numregim     0.1917349  0.2292890   0.836  0.40303   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 65.945  on 35  degrees of freedom
## Residual deviance: 28.668  on 27  degrees of freedom
##   (11 observations deleted due to missingness)
## AIC: 111.48
## 
## Number of Fisher Scoring iterations: 6

ELMR 3.5

The dvisits data comes from the Australian Health Survey of 1977 - 78 and consist of 5190 single adults where young and old have been oversampled.

data(dvisits, package="faraway")
head(dvisits)
##   sex  age  agesq income levyplus freepoor freerepa illness actdays hscore
## 1   1 0.19 0.0361   0.55        1        0        0       1       4      1
## 2   1 0.19 0.0361   0.45        1        0        0       1       2      1
## 3   0 0.19 0.0361   0.90        0        0        0       3       0      0
## 4   0 0.19 0.0361   0.15        0        0        0       1       0      0
## 5   0 0.19 0.0361   0.45        0        0        0       2       5      1
## 6   1 0.19 0.0361   0.35        0        0        0       5       1      9
##   chcond1 chcond2 doctorco nondocco hospadmi hospdays medicine prescrib
## 1       0       0        1        0        0        0        1        1
## 2       0       0        1        0        0        0        2        1
## 3       0       0        1        0        1        4        2        1
## 4       0       0        1        0        0        0        0        0
## 5       1       0        1        0        0        0        3        1
## 6       1       0        1        0        0        0        1        1
##   nonpresc
## 1        0
## 2        1
## 3        1
## 4        0
## 5        2
## 6        0
  1. Build a Poisson regression model with doctorco as the response and sex, age, agesq, income, levyplus, freepoor, freerepa, illness, actdays, hscore, chcond1, and chcond2 as possible predictor variables. Considering the deviance of this model, does this model fit the data?
doctor_poisson <- glm(doctorco ~ sex + age + agesq + income + levyplus + freepoor + freerepa + illness + actdays + hscore + chcond1 + chcond2, family=poisson, data = dvisits)
doctor_poisson
## 
## Call:  glm(formula = doctorco ~ sex + age + agesq + income + levyplus + 
##     freepoor + freerepa + illness + actdays + hscore + chcond1 + 
##     chcond2, family = poisson, data = dvisits)
## 
## Coefficients:
## (Intercept)          sex          age        agesq       income  
##    -2.22385      0.15688      1.05630     -0.84870     -0.20532  
##    levyplus     freepoor     freerepa      illness      actdays  
##     0.12319     -0.44006      0.07980      0.18695      0.12685  
##      hscore      chcond1      chcond2  
##     0.03008      0.11409      0.14116  
## 
## Degrees of Freedom: 5189 Total (i.e. Null);  5177 Residual
## Null Deviance:       5635 
## Residual Deviance: 4380  AIC: 6737
summary(doctor_poisson)
## 
## Call:
## glm(formula = doctorco ~ sex + age + agesq + income + levyplus + 
##     freepoor + freerepa + illness + actdays + hscore + chcond1 + 
##     chcond2, family = poisson, data = dvisits)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9170  -0.6862  -0.5743  -0.4839   5.7005  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.223848   0.189816 -11.716   <2e-16 ***
## sex          0.156882   0.056137   2.795   0.0052 ** 
## age          1.056299   1.000780   1.055   0.2912    
## agesq       -0.848704   1.077784  -0.787   0.4310    
## income      -0.205321   0.088379  -2.323   0.0202 *  
## levyplus     0.123185   0.071640   1.720   0.0855 .  
## freepoor    -0.440061   0.179811  -2.447   0.0144 *  
## freerepa     0.079798   0.092060   0.867   0.3860    
## illness      0.186948   0.018281  10.227   <2e-16 ***
## actdays      0.126846   0.005034  25.198   <2e-16 ***
## hscore       0.030081   0.010099   2.979   0.0029 ** 
## chcond1      0.114085   0.066640   1.712   0.0869 .  
## chcond2      0.141158   0.083145   1.698   0.0896 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 5634.8  on 5189  degrees of freedom
## Residual deviance: 4379.5  on 5177  degrees of freedom
## AIC: 6737.1
## 
## Number of Fisher Scoring iterations: 6

Reference: https://stats.stackexchange.com/questions/108995/interpreting-residual-and-null-deviance-in-glm-r The residual deviance is quite high. Probably not the best fit.

  1. Plot the residuals and the fitted data - why are there lines of observations on the plot?
par(mfrow=c(2,2))
plot(doctor_poisson)

There are lines because the responses are discrete continuous numbers.

  1. Use backward elimination with a critical p-value of 5% to reduce the model as much as possible. Report your model.
step(doctor_poisson, direction="backward")
## Start:  AIC=6737.08
## doctorco ~ sex + age + agesq + income + levyplus + freepoor + 
##     freerepa + illness + actdays + hscore + chcond1 + chcond2
## 
##            Df Deviance    AIC
## - agesq     1   4380.1 6735.7
## - freerepa  1   4380.3 6735.8
## - age       1   4380.6 6736.2
## <none>          4379.5 6737.1
## - chcond2   1   4382.4 6738.0
## - chcond1   1   4382.5 6738.0
## - levyplus  1   4382.5 6738.1
## - income    1   4385.0 6740.5
## - freepoor  1   4386.2 6741.8
## - sex       1   4387.4 6743.0
## - hscore    1   4388.1 6743.7
## - illness   1   4481.8 6837.4
## - actdays   1   4917.1 7272.7
## 
## Step:  AIC=6735.7
## doctorco ~ sex + age + income + levyplus + freepoor + freerepa + 
##     illness + actdays + hscore + chcond1 + chcond2
## 
##            Df Deviance    AIC
## - freerepa  1   4381.0 6734.5
## <none>          4380.1 6735.7
## - age       1   4383.0 6736.5
## - chcond1   1   4383.2 6736.8
## - levyplus  1   4383.3 6736.9
## - chcond2   1   4383.5 6737.0
## - income    1   4385.0 6738.6
## - freepoor  1   4386.8 6740.4
## - sex       1   4388.0 6741.5
## - hscore    1   4389.1 6742.7
## - illness   1   4481.9 6835.4
## - actdays   1   4917.1 7270.7
## 
## Step:  AIC=6734.53
## doctorco ~ sex + age + income + levyplus + freepoor + illness + 
##     actdays + hscore + chcond1 + chcond2
## 
##            Df Deviance    AIC
## <none>          4381.0 6734.5
## - levyplus  1   4383.4 6735.0
## - chcond1   1   4384.3 6735.9
## - chcond2   1   4384.7 6736.3
## - income    1   4386.7 6738.2
## - age       1   4387.1 6738.7
## - freepoor  1   4389.1 6740.6
## - sex       1   4389.5 6741.0
## - hscore    1   4390.2 6741.8
## - illness   1   4482.7 6834.2
## - actdays   1   4917.6 7269.2
## 
## Call:  glm(formula = doctorco ~ sex + age + income + levyplus + freepoor + 
##     illness + actdays + hscore + chcond1 + chcond2, family = poisson, 
##     data = dvisits)
## 
## Coefficients:
## (Intercept)          sex          age       income     levyplus  
##    -2.08906      0.16200      0.35513     -0.19981      0.08369  
##    freepoor      illness      actdays       hscore      chcond1  
##    -0.46960      0.18610      0.12661      0.03112      0.12110  
##     chcond2  
##     0.15889  
## 
## Degrees of Freedom: 5189 Total (i.e. Null);  5179 Residual
## Null Deviance:       5635 
## Residual Deviance: 4381  AIC: 6735

Stepwise does not appear to be much better.

  1. What sort of person would be predicted to visit the doctor the most under your selected model?

age, income, hscore, actdays and illness are statistically significant, which makes sense.

  1. For the last person in the dataset, compute the predicted probability distribution for their visits to the doctor, i.e., give the probability they visit 0,1,2, etc. times.
predict(doctor_poisson, dvisits[5190,], type="response")
##      5190 
## 0.1533837

The mean amount of visits to the doctor for patient 5190 would be 0.16 visits. We will set \(\lambda = 0.153\)

Reference: https://en.wikipedia.org/wiki/Poisson_distribution

print(paste0("Probability of 0 doctor's visits: ", dpois(0, lambda = 0.153)))
## [1] "Probability of 0 doctor's visits: 0.858129721811394"
print(paste0("Probability of 1 doctor's visits: ", dpois(1, lambda = 0.153)))
## [1] "Probability of 1 doctor's visits: 0.131293847437143"
print(paste0("Probability of 2 doctor's visits: ", dpois(2, lambda = 0.153)))
## [1] "Probability of 2 doctor's visits: 0.0100439793289415"
print(paste0("Probability of 3 doctor's visits: ", dpois(3, lambda = 0.153)))
## [1] "Probability of 3 doctor's visits: 0.000512242945776013"
print(paste0("Probability of 4 doctor's visits: ", dpois(4, lambda = 0.153)))
## [1] "Probability of 4 doctor's visits: 1.95932926759325e-05"
print(paste0("Probability of 5 doctor's visits: ", dpois(5, lambda = 0.153)))
## [1] "Probability of 5 doctor's visits: 5.99554755883536e-07"
# Reference: http://users.stat.ufl.edu/~presnell/Courses/sta4504-2000sp/R/Transcripts/simple-poisson.Rt
round(dpois(0:5, 0.153),3)
## [1] 0.858 0.131 0.010 0.001 0.000 0.000
  1. Fit a comparable (Gaussian) linear model and graphically compare the fits. Describe how they differ.
doctor_lm <- lm(doctorco ~ sex + age + agesq + income + levyplus + freepoor + freerepa + illness + actdays + hscore + chcond1 + chcond2, data=dvisits)
doctor_lm
## 
## Call:
## lm(formula = doctorco ~ sex + age + agesq + income + levyplus + 
##     freepoor + freerepa + illness + actdays + hscore + chcond1 + 
##     chcond2, data = dvisits)
## 
## Coefficients:
## (Intercept)          sex          age        agesq       income  
##    0.027632     0.033811     0.203201    -0.062103    -0.057323  
##    levyplus     freepoor     freerepa      illness      actdays  
##    0.035179    -0.103314     0.033241     0.059946     0.103192  
##      hscore      chcond1      chcond2  
##    0.016976     0.004384     0.041617
summary(doctor_lm)
## 
## Call:
## lm(formula = doctorco ~ sex + age + agesq + income + levyplus + 
##     freepoor + freerepa + illness + actdays + hscore + chcond1 + 
##     chcond2, data = dvisits)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1352 -0.2588 -0.1435 -0.0433  7.0327 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.027632   0.072220   0.383  0.70202    
## sex          0.033811   0.021604   1.565  0.11764    
## age          0.203201   0.410016   0.496  0.62020    
## agesq       -0.062103   0.458716  -0.135  0.89231    
## income      -0.057323   0.033089  -1.732  0.08326 .  
## levyplus     0.035179   0.024882   1.414  0.15748    
## freepoor    -0.103314   0.052471  -1.969  0.04901 *  
## freerepa     0.033241   0.038157   0.871  0.38371    
## illness      0.059946   0.008357   7.173 8.39e-13 ***
## actdays      0.103192   0.003657  28.216  < 2e-16 ***
## hscore       0.016976   0.005190   3.271  0.00108 ** 
## chcond1      0.004384   0.023740   0.185  0.85349    
## chcond2      0.041617   0.035863   1.160  0.24592    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7139 on 5177 degrees of freedom
## Multiple R-squared:  0.2018, Adjusted R-squared:    0.2 
## F-statistic: 109.1 on 12 and 5177 DF,  p-value: < 2.2e-16
predict(doctor_lm, dvisits[5190,])
##      5190 
## 0.1606531

It appears that it isn’t likely to be too different.

ELMR 3.7

The dataset esdcomp was recorded on 44 doctors working in an emergency service at a hospital to study the factors affecting the number of complaints received. Build a model for the number of complaints received and write a report on your conclusions.

data(esdcomp, package="faraway")
head(esdcomp)
##   visits complaints residency gender revenue   hours
## 1   2014          2         Y      F  263.03 1287.25
## 2   3091          3         N      M  334.94 1588.00
## 3    879          1         Y      M  206.42  705.25
## 4   1780          1         N      M  226.32 1005.50
## 5   3646         11         N      M  288.91 1667.25
## 6   2690          1         N      M  275.94 1517.75
complaints_poisson <- glm(complaints ~ ., family=poisson, data=esdcomp)
complaints_poisson
## 
## Call:  glm(formula = complaints ~ ., family = poisson, data = esdcomp)
## 
## Coefficients:
## (Intercept)       visits   residencyY      genderM      revenue  
##  -0.0803448    0.0009499   -0.2319740    0.1122391   -0.0033827  
##       hours  
##  -0.0001569  
## 
## Degrees of Freedom: 43 Total (i.e. Null);  38 Residual
## Null Deviance:       89.45 
## Residual Deviance: 49.99     AIC: 184.8
summary(complaints_poisson)
## 
## Call:
## glm(formula = complaints ~ ., family = poisson, data = esdcomp)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8989  -0.9193  -0.3835   0.4981   1.8221  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -0.0803448  1.1542122  -0.070  0.94450   
## visits       0.0009499  0.0003386   2.806  0.00502 **
## residencyY  -0.2319740  0.2029388  -1.143  0.25301   
## genderM      0.1122391  0.2235043   0.502  0.61554   
## revenue     -0.0033827  0.0041553  -0.814  0.41560   
## hours       -0.0001569  0.0006634  -0.237  0.81298   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 89.447  on 43  degrees of freedom
## Residual deviance: 49.995  on 38  degrees of freedom
## AIC: 184.77
## 
## Number of Fisher Scoring iterations: 5
a <- step(complaints_poisson, direction="backward")
## Start:  AIC=184.77
## complaints ~ visits + residency + gender + revenue + hours
## 
##             Df Deviance    AIC
## - hours      1   50.051 182.83
## - gender     1   50.251 183.03
## - revenue    1   50.665 183.44
## - residency  1   51.319 184.10
## <none>           49.995 184.78
## - visits     1   57.568 190.35
## 
## Step:  AIC=182.83
## complaints ~ visits + residency + gender + revenue
## 
##             Df Deviance    AIC
## - gender     1   50.404 181.18
## - revenue    1   50.739 181.52
## - residency  1   51.321 182.10
## <none>           50.051 182.83
## - visits     1   75.947 206.73
## 
## Step:  AIC=181.18
## complaints ~ visits + residency + revenue
## 
##             Df Deviance    AIC
## - revenue    1   50.882 179.66
## - residency  1   52.109 180.89
## <none>           50.404 181.18
## - visits     1   76.427 205.21
## 
## Step:  AIC=179.66
## complaints ~ visits + residency
## 
##             Df Deviance    AIC
## <none>           50.882 179.66
## - residency  1   54.246 181.03
## - visits     1   83.303 210.08
a
## 
## Call:  glm(formula = complaints ~ visits + residency, family = poisson, 
##     data = esdcomp)
## 
## Coefficients:
## (Intercept)       visits   residencyY  
##  -0.7274574    0.0008101   -0.3121729  
## 
## Degrees of Freedom: 43 Total (i.e. Null);  41 Residual
## Null Deviance:       89.45 
## Residual Deviance: 50.88     AIC: 179.7
summary(a)
## 
## Call:
## glm(formula = complaints ~ visits + residency, family = poisson, 
##     data = esdcomp)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9264  -0.9142  -0.3832   0.6496   1.9869  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.7274574  0.3972932  -1.831   0.0671 .  
## visits       0.0008101  0.0001429   5.668 1.45e-08 ***
## residencyY  -0.3121729  0.1725024  -1.810   0.0703 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 89.447  on 43  degrees of freedom
## Residual deviance: 50.882  on 41  degrees of freedom
## AIC: 179.66
## 
## Number of Fisher Scoring iterations: 5