The ships
dataset found in the MASS
package gives the number of damage incidents and aggregate months of service for different types of ships broken down by year of construction and period of operation. Develop a model for the rate of incidents, describing the effect of the important predictors.
library(MASS)
## Warning: package 'MASS' was built under R version 3.4.3
data(ships, package="MASS")
head(ships)
## type year period service incidents
## 1 A 60 60 127 0
## 2 A 60 75 63 0
## 3 A 65 60 1095 3
## 4 A 65 75 1095 4
## 5 A 70 60 1512 6
## 6 A 70 75 3353 18
This is likely a Poisson Regression. Poisson distributions arive naturally when the time between events is indepedent and identially exponentially distributed. We count the number of events in a given time period.
ships_poisson <- glm(incidents ~ ., family=poisson, data=ships)
ships_poisson
##
## Call: glm(formula = incidents ~ ., family = poisson, data = ships)
##
## Coefficients:
## (Intercept) typeB typeC typeD typeE
## -5.7061338 0.8134618 -1.2045808 -0.8595236 -0.2225557
## year period service
## 0.0451919 0.0605457 0.0000597
##
## Degrees of Freedom: 39 Total (i.e. Null); 32 Residual
## Null Deviance: 730.3
## Residual Deviance: 174 AIC: 287.9
summary(ships_poisson)
##
## Call:
## glm(formula = incidents ~ ., family = poisson, data = ships)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.1013 -1.9648 -0.5380 0.9899 4.6212
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.706e+00 1.221e+00 -4.673 2.96e-06 ***
## typeB 8.135e-01 2.023e-01 4.021 5.79e-05 ***
## typeC -1.205e+00 3.275e-01 -3.679 0.000234 ***
## typeD -8.595e-01 2.875e-01 -2.989 0.002795 **
## typeE -2.226e-01 2.348e-01 -0.948 0.343173
## year 4.519e-02 1.341e-02 3.370 0.000752 ***
## period 6.055e-02 8.945e-03 6.768 1.30e-11 ***
## service 5.970e-05 7.016e-06 8.509 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 730.25 on 39 degrees of freedom
## Residual deviance: 174.00 on 32 degrees of freedom
## AIC: 287.86
##
## Number of Fisher Scoring iterations: 6
The dataset africa
gives information about the number of military coups in sub-Saharan Africa and various political and geographical information. Develop a simple but well-fitting model for the number of coups. Give an interpretation of the effect of the variables you include in your model on the response.
data(africa, package="faraway")
head(africa)
## miltcoup oligarchy pollib parties pctvote popn size numelec
## Angola 0 0 2 38 NA 9.7 1247 0
## Benin 5 7 1 34 45.68 4.6 113 8
## Botswana 0 0 NA 7 20.30 1.2 582 5
## Burkina 6 13 2 62 17.50 8.8 274 5
## Burundi 2 13 2 10 34.39 5.3 28 3
## Cameroon 0 0 2 34 30.30 11.6 475 14
## numregim
## Angola 1
## Benin 3
## Botswana 1
## Burkina 3
## Burundi 3
## Cameroon 3
africa_poisson <- glm(miltcoup ~ ., family=poisson, data=africa)
africa_poisson
##
## Call: glm(formula = miltcoup ~ ., family = poisson, data = africa)
##
## Coefficients:
## (Intercept) oligarchy pollib parties pctvote
## -0.510269 0.073081 -0.712978 0.030774 0.013872
## popn size numelec numregim
## 0.009343 -0.000190 -0.016078 0.191735
##
## Degrees of Freedom: 35 Total (i.e. Null); 27 Residual
## (11 observations deleted due to missingness)
## Null Deviance: 65.94
## Residual Deviance: 28.67 AIC: 111.5
summary(africa_poisson)
##
## Call:
## glm(formula = miltcoup ~ ., family = poisson, data = africa)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3443 -0.9542 -0.2587 0.3905 1.6953
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.5102693 0.9053301 -0.564 0.57301
## oligarchy 0.0730814 0.0345958 2.112 0.03465 *
## pollib -0.7129779 0.2725635 -2.616 0.00890 **
## parties 0.0307739 0.0111873 2.751 0.00595 **
## pctvote 0.0138722 0.0097526 1.422 0.15491
## popn 0.0093429 0.0065950 1.417 0.15658
## size -0.0001900 0.0002485 -0.765 0.44447
## numelec -0.0160783 0.0654842 -0.246 0.80605
## numregim 0.1917349 0.2292890 0.836 0.40303
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 65.945 on 35 degrees of freedom
## Residual deviance: 28.668 on 27 degrees of freedom
## (11 observations deleted due to missingness)
## AIC: 111.48
##
## Number of Fisher Scoring iterations: 6
The dvisits
data comes from the Australian Health Survey of 1977 - 78 and consist of 5190 single adults where young and old have been oversampled.
data(dvisits, package="faraway")
head(dvisits)
## sex age agesq income levyplus freepoor freerepa illness actdays hscore
## 1 1 0.19 0.0361 0.55 1 0 0 1 4 1
## 2 1 0.19 0.0361 0.45 1 0 0 1 2 1
## 3 0 0.19 0.0361 0.90 0 0 0 3 0 0
## 4 0 0.19 0.0361 0.15 0 0 0 1 0 0
## 5 0 0.19 0.0361 0.45 0 0 0 2 5 1
## 6 1 0.19 0.0361 0.35 0 0 0 5 1 9
## chcond1 chcond2 doctorco nondocco hospadmi hospdays medicine prescrib
## 1 0 0 1 0 0 0 1 1
## 2 0 0 1 0 0 0 2 1
## 3 0 0 1 0 1 4 2 1
## 4 0 0 1 0 0 0 0 0
## 5 1 0 1 0 0 0 3 1
## 6 1 0 1 0 0 0 1 1
## nonpresc
## 1 0
## 2 1
## 3 1
## 4 0
## 5 2
## 6 0
doctorco
as the response and sex
, age
, agesq
, income
, levyplus
, freepoor
, freerepa
, illness
, actdays
, hscore
, chcond1
, and chcond2
as possible predictor variables. Considering the deviance of this model, does this model fit the data?doctor_poisson <- glm(doctorco ~ sex + age + agesq + income + levyplus + freepoor + freerepa + illness + actdays + hscore + chcond1 + chcond2, family=poisson, data = dvisits)
doctor_poisson
##
## Call: glm(formula = doctorco ~ sex + age + agesq + income + levyplus +
## freepoor + freerepa + illness + actdays + hscore + chcond1 +
## chcond2, family = poisson, data = dvisits)
##
## Coefficients:
## (Intercept) sex age agesq income
## -2.22385 0.15688 1.05630 -0.84870 -0.20532
## levyplus freepoor freerepa illness actdays
## 0.12319 -0.44006 0.07980 0.18695 0.12685
## hscore chcond1 chcond2
## 0.03008 0.11409 0.14116
##
## Degrees of Freedom: 5189 Total (i.e. Null); 5177 Residual
## Null Deviance: 5635
## Residual Deviance: 4380 AIC: 6737
summary(doctor_poisson)
##
## Call:
## glm(formula = doctorco ~ sex + age + agesq + income + levyplus +
## freepoor + freerepa + illness + actdays + hscore + chcond1 +
## chcond2, family = poisson, data = dvisits)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9170 -0.6862 -0.5743 -0.4839 5.7005
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.223848 0.189816 -11.716 <2e-16 ***
## sex 0.156882 0.056137 2.795 0.0052 **
## age 1.056299 1.000780 1.055 0.2912
## agesq -0.848704 1.077784 -0.787 0.4310
## income -0.205321 0.088379 -2.323 0.0202 *
## levyplus 0.123185 0.071640 1.720 0.0855 .
## freepoor -0.440061 0.179811 -2.447 0.0144 *
## freerepa 0.079798 0.092060 0.867 0.3860
## illness 0.186948 0.018281 10.227 <2e-16 ***
## actdays 0.126846 0.005034 25.198 <2e-16 ***
## hscore 0.030081 0.010099 2.979 0.0029 **
## chcond1 0.114085 0.066640 1.712 0.0869 .
## chcond2 0.141158 0.083145 1.698 0.0896 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 5634.8 on 5189 degrees of freedom
## Residual deviance: 4379.5 on 5177 degrees of freedom
## AIC: 6737.1
##
## Number of Fisher Scoring iterations: 6
Reference: https://stats.stackexchange.com/questions/108995/interpreting-residual-and-null-deviance-in-glm-r The residual deviance is quite high. Probably not the best fit.
par(mfrow=c(2,2))
plot(doctor_poisson)
There are lines because the responses are discrete continuous numbers.
step(doctor_poisson, direction="backward")
## Start: AIC=6737.08
## doctorco ~ sex + age + agesq + income + levyplus + freepoor +
## freerepa + illness + actdays + hscore + chcond1 + chcond2
##
## Df Deviance AIC
## - agesq 1 4380.1 6735.7
## - freerepa 1 4380.3 6735.8
## - age 1 4380.6 6736.2
## <none> 4379.5 6737.1
## - chcond2 1 4382.4 6738.0
## - chcond1 1 4382.5 6738.0
## - levyplus 1 4382.5 6738.1
## - income 1 4385.0 6740.5
## - freepoor 1 4386.2 6741.8
## - sex 1 4387.4 6743.0
## - hscore 1 4388.1 6743.7
## - illness 1 4481.8 6837.4
## - actdays 1 4917.1 7272.7
##
## Step: AIC=6735.7
## doctorco ~ sex + age + income + levyplus + freepoor + freerepa +
## illness + actdays + hscore + chcond1 + chcond2
##
## Df Deviance AIC
## - freerepa 1 4381.0 6734.5
## <none> 4380.1 6735.7
## - age 1 4383.0 6736.5
## - chcond1 1 4383.2 6736.8
## - levyplus 1 4383.3 6736.9
## - chcond2 1 4383.5 6737.0
## - income 1 4385.0 6738.6
## - freepoor 1 4386.8 6740.4
## - sex 1 4388.0 6741.5
## - hscore 1 4389.1 6742.7
## - illness 1 4481.9 6835.4
## - actdays 1 4917.1 7270.7
##
## Step: AIC=6734.53
## doctorco ~ sex + age + income + levyplus + freepoor + illness +
## actdays + hscore + chcond1 + chcond2
##
## Df Deviance AIC
## <none> 4381.0 6734.5
## - levyplus 1 4383.4 6735.0
## - chcond1 1 4384.3 6735.9
## - chcond2 1 4384.7 6736.3
## - income 1 4386.7 6738.2
## - age 1 4387.1 6738.7
## - freepoor 1 4389.1 6740.6
## - sex 1 4389.5 6741.0
## - hscore 1 4390.2 6741.8
## - illness 1 4482.7 6834.2
## - actdays 1 4917.6 7269.2
##
## Call: glm(formula = doctorco ~ sex + age + income + levyplus + freepoor +
## illness + actdays + hscore + chcond1 + chcond2, family = poisson,
## data = dvisits)
##
## Coefficients:
## (Intercept) sex age income levyplus
## -2.08906 0.16200 0.35513 -0.19981 0.08369
## freepoor illness actdays hscore chcond1
## -0.46960 0.18610 0.12661 0.03112 0.12110
## chcond2
## 0.15889
##
## Degrees of Freedom: 5189 Total (i.e. Null); 5179 Residual
## Null Deviance: 5635
## Residual Deviance: 4381 AIC: 6735
Stepwise does not appear to be much better.
age
, income
, hscore
, actdays
and illness
are statistically significant, which makes sense.
predict(doctor_poisson, dvisits[5190,], type="response")
## 5190
## 0.1533837
The mean amount of visits to the doctor for patient 5190 would be 0.16 visits. We will set \(\lambda = 0.153\)
Reference: https://en.wikipedia.org/wiki/Poisson_distribution
print(paste0("Probability of 0 doctor's visits: ", dpois(0, lambda = 0.153)))
## [1] "Probability of 0 doctor's visits: 0.858129721811394"
print(paste0("Probability of 1 doctor's visits: ", dpois(1, lambda = 0.153)))
## [1] "Probability of 1 doctor's visits: 0.131293847437143"
print(paste0("Probability of 2 doctor's visits: ", dpois(2, lambda = 0.153)))
## [1] "Probability of 2 doctor's visits: 0.0100439793289415"
print(paste0("Probability of 3 doctor's visits: ", dpois(3, lambda = 0.153)))
## [1] "Probability of 3 doctor's visits: 0.000512242945776013"
print(paste0("Probability of 4 doctor's visits: ", dpois(4, lambda = 0.153)))
## [1] "Probability of 4 doctor's visits: 1.95932926759325e-05"
print(paste0("Probability of 5 doctor's visits: ", dpois(5, lambda = 0.153)))
## [1] "Probability of 5 doctor's visits: 5.99554755883536e-07"
# Reference: http://users.stat.ufl.edu/~presnell/Courses/sta4504-2000sp/R/Transcripts/simple-poisson.Rt
round(dpois(0:5, 0.153),3)
## [1] 0.858 0.131 0.010 0.001 0.000 0.000
doctor_lm <- lm(doctorco ~ sex + age + agesq + income + levyplus + freepoor + freerepa + illness + actdays + hscore + chcond1 + chcond2, data=dvisits)
doctor_lm
##
## Call:
## lm(formula = doctorco ~ sex + age + agesq + income + levyplus +
## freepoor + freerepa + illness + actdays + hscore + chcond1 +
## chcond2, data = dvisits)
##
## Coefficients:
## (Intercept) sex age agesq income
## 0.027632 0.033811 0.203201 -0.062103 -0.057323
## levyplus freepoor freerepa illness actdays
## 0.035179 -0.103314 0.033241 0.059946 0.103192
## hscore chcond1 chcond2
## 0.016976 0.004384 0.041617
summary(doctor_lm)
##
## Call:
## lm(formula = doctorco ~ sex + age + agesq + income + levyplus +
## freepoor + freerepa + illness + actdays + hscore + chcond1 +
## chcond2, data = dvisits)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1352 -0.2588 -0.1435 -0.0433 7.0327
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.027632 0.072220 0.383 0.70202
## sex 0.033811 0.021604 1.565 0.11764
## age 0.203201 0.410016 0.496 0.62020
## agesq -0.062103 0.458716 -0.135 0.89231
## income -0.057323 0.033089 -1.732 0.08326 .
## levyplus 0.035179 0.024882 1.414 0.15748
## freepoor -0.103314 0.052471 -1.969 0.04901 *
## freerepa 0.033241 0.038157 0.871 0.38371
## illness 0.059946 0.008357 7.173 8.39e-13 ***
## actdays 0.103192 0.003657 28.216 < 2e-16 ***
## hscore 0.016976 0.005190 3.271 0.00108 **
## chcond1 0.004384 0.023740 0.185 0.85349
## chcond2 0.041617 0.035863 1.160 0.24592
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7139 on 5177 degrees of freedom
## Multiple R-squared: 0.2018, Adjusted R-squared: 0.2
## F-statistic: 109.1 on 12 and 5177 DF, p-value: < 2.2e-16
predict(doctor_lm, dvisits[5190,])
## 5190
## 0.1606531
It appears that it isn’t likely to be too different.
The dataset esdcomp
was recorded on 44 doctors working in an emergency service at a hospital to study the factors affecting the number of complaints received. Build a model for the number of complaints received and write a report on your conclusions.
data(esdcomp, package="faraway")
head(esdcomp)
## visits complaints residency gender revenue hours
## 1 2014 2 Y F 263.03 1287.25
## 2 3091 3 N M 334.94 1588.00
## 3 879 1 Y M 206.42 705.25
## 4 1780 1 N M 226.32 1005.50
## 5 3646 11 N M 288.91 1667.25
## 6 2690 1 N M 275.94 1517.75
complaints_poisson <- glm(complaints ~ ., family=poisson, data=esdcomp)
complaints_poisson
##
## Call: glm(formula = complaints ~ ., family = poisson, data = esdcomp)
##
## Coefficients:
## (Intercept) visits residencyY genderM revenue
## -0.0803448 0.0009499 -0.2319740 0.1122391 -0.0033827
## hours
## -0.0001569
##
## Degrees of Freedom: 43 Total (i.e. Null); 38 Residual
## Null Deviance: 89.45
## Residual Deviance: 49.99 AIC: 184.8
summary(complaints_poisson)
##
## Call:
## glm(formula = complaints ~ ., family = poisson, data = esdcomp)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8989 -0.9193 -0.3835 0.4981 1.8221
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.0803448 1.1542122 -0.070 0.94450
## visits 0.0009499 0.0003386 2.806 0.00502 **
## residencyY -0.2319740 0.2029388 -1.143 0.25301
## genderM 0.1122391 0.2235043 0.502 0.61554
## revenue -0.0033827 0.0041553 -0.814 0.41560
## hours -0.0001569 0.0006634 -0.237 0.81298
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 89.447 on 43 degrees of freedom
## Residual deviance: 49.995 on 38 degrees of freedom
## AIC: 184.77
##
## Number of Fisher Scoring iterations: 5
a <- step(complaints_poisson, direction="backward")
## Start: AIC=184.77
## complaints ~ visits + residency + gender + revenue + hours
##
## Df Deviance AIC
## - hours 1 50.051 182.83
## - gender 1 50.251 183.03
## - revenue 1 50.665 183.44
## - residency 1 51.319 184.10
## <none> 49.995 184.78
## - visits 1 57.568 190.35
##
## Step: AIC=182.83
## complaints ~ visits + residency + gender + revenue
##
## Df Deviance AIC
## - gender 1 50.404 181.18
## - revenue 1 50.739 181.52
## - residency 1 51.321 182.10
## <none> 50.051 182.83
## - visits 1 75.947 206.73
##
## Step: AIC=181.18
## complaints ~ visits + residency + revenue
##
## Df Deviance AIC
## - revenue 1 50.882 179.66
## - residency 1 52.109 180.89
## <none> 50.404 181.18
## - visits 1 76.427 205.21
##
## Step: AIC=179.66
## complaints ~ visits + residency
##
## Df Deviance AIC
## <none> 50.882 179.66
## - residency 1 54.246 181.03
## - visits 1 83.303 210.08
a
##
## Call: glm(formula = complaints ~ visits + residency, family = poisson,
## data = esdcomp)
##
## Coefficients:
## (Intercept) visits residencyY
## -0.7274574 0.0008101 -0.3121729
##
## Degrees of Freedom: 43 Total (i.e. Null); 41 Residual
## Null Deviance: 89.45
## Residual Deviance: 50.88 AIC: 179.7
summary(a)
##
## Call:
## glm(formula = complaints ~ visits + residency, family = poisson,
## data = esdcomp)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9264 -0.9142 -0.3832 0.6496 1.9869
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.7274574 0.3972932 -1.831 0.0671 .
## visits 0.0008101 0.0001429 5.668 1.45e-08 ***
## residencyY -0.3121729 0.1725024 -1.810 0.0703 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 89.447 on 43 degrees of freedom
## Residual deviance: 50.882 on 41 degrees of freedom
## AIC: 179.66
##
## Number of Fisher Scoring iterations: 5