library(reshape2)
library(ggplot2)
library(MASS)
Retrieve data from csv
shipAccidentData=read.csv("E:\\Data Science\\Statistics\\Poisson Distribution\\ShipAccidents.csv")
Summary of the dataframe
summary(shipAccidentData)
type construction
Length:40 Length:40
Class :character Class :character
Mode :character Mode :character
operation service incidents
Length:40 Min. : 0.0 Min. : 0.0
Class :character 1st Qu.: 175.8 1st Qu.: 0.0
Mode :character Median : 782.0 Median : 2.0
Mean : 4089.3 Mean : 8.9
3rd Qu.: 2078.5 3rd Qu.:11.0
Max. :44882.0 Max. :58.0
EDA
(defincidenttab=xtabs(~shipAccidentData$incidents))
shipAccidentData$incidents
0 1 2 3 4 5 6 7 11 12 18 29 39 44 53 58
14 5 2 1 2 1 2 2 2 2 2 1 1 1 1 1
ggplot(shipAccidentData,aes(incidents)) +
geom_histogram(binwidth=.5, position="dodge",color="blue") +
ggtitle("Figure 1: Number of incidents considering all types of ships together")

(defIncidenttab=xtabs(~shipAccidentData$incidents+shipAccidentData$type))
shipAccidentData$type
shipAccidentData$incidents A B C D E
0 3 1 2 5 3
1 0 0 4 0 1
2 0 0 1 1 0
3 1 0 0 0 0
4 1 0 0 1 0
5 0 0 0 0 1
6 1 0 1 0 0
7 0 0 0 0 2
11 1 0 0 1 0
12 0 1 0 0 1
18 1 1 0 0 0
29 0 1 0 0 0
39 0 1 0 0 0
44 0 1 0 0 0
53 0 1 0 0 0
58 0 1 0 0 0
ggplot(shipAccidentData, aes(incidents, fill = type)) +
geom_histogram(binwidth=.5, position="dodge") +
ggtitle("Figure 2: Number of incidents dividing ships in 5 types")

boxplot(shipAccidentData$service~shipAccidentData$incidents,
xlab="No of incidents",ylab="Service",col=seq(2,17))
legend("bottomright", inset=.02,title="No of incidents",
c("0","1","2","3","4","5","6","7","11","12","18","29","39","44","53","58"),fill=seq(2,17),horiz=TRUE,cex=0.8)
title("Figure3: Service Distribution")

with(shipAccidentData, tapply(incidents, type, function(x) {
sprintf("M (Var) = %1.2f (%1.2f)", mean(x), var(x))
}))
A
"M (Var) = 5.25 (40.79)"
B
"M (Var) = 31.62 (419.70)"
C
"M (Var) = 1.50 (3.71)"
D
"M (Var) = 2.12 (14.98)"
E
"M (Var) = 4.00 (20.00)"
There is significant difference between mean and variance
Step 1A. Regression Step [Negative Binomial Regression]
accidentCountModelPP = glm.nb(incidents ~ type+service, data = shipAccidentData)
summary(accidentCountModelPP)
Call:
glm.nb(formula = incidents ~ type + service, data = shipAccidentData,
init.theta = 0.7379181957, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8652 -1.3839 -0.2367 0.3237 1.7345
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.477e+00 4.436e-01 3.329 0.000872
typeB 4.770e-01 7.962e-01 0.599 0.549135
typeC -1.143e+00 6.701e-01 -1.705 0.088148
typeD -8.125e-01 6.532e-01 -1.244 0.213597
typeE -1.846e-01 6.307e-01 -0.293 0.769744
service 8.100e-05 3.117e-05 2.598 0.009364
(Intercept) ***
typeB
typeC .
typeD
typeE
service **
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(0.7379) family taken to be 1)
Null deviance: 82.572 on 39 degrees of freedom
Residual deviance: 44.656 on 34 degrees of freedom
AIC: 222.78
Number of Fisher Scoring iterations: 1
Theta: 0.738
Std. Err.: 0.242
2 x log-likelihood: -208.781
AIC(accidentCountModelPP)
[1] 222.7812
BIC(accidentCountModelPP)
[1] 234.6033
Step 1B. Verification Step [Negative Binomial Regression]
#Now that we have the model, we can use it to predict values
# and check the correctness of the model.(Train)
accidentCountPP.predict = predict(accidentCountModelPP, shipAccidentData, type="response")
# We can compare with the observed and predicted values (Train)
dfPP <- data.frame(seq(1:dim(shipAccidentData)[1]),shipAccidentData$incidents,accidentCountPP.predict)
colnames(dfPP)=c('Index','incident','predict')
dfPP
Step 2A. Regression Step [Zero inflatted Poisson]
accidentCountModelZI = zeroinfl(incidents ~ type+service,data=shipAccidentData,dist="poisson",link="logit")
glm.fit: fitted probabilities numerically 0 or 1 occurredsystem is computationally singular: reciprocal condition number = 3.66118e-98FALSE
summary(accidentCountModelZI)
Call:
zeroinfl(formula = incidents ~ type + service,
data = shipAccidentData, dist = "poisson",
link = "logit")
Pearson residuals:
Min 1Q Median 3Q
-2.974e+00 -7.410e-01 -1.433e-08 2.046e-01
Max
3.279e+00
Count model coefficients (poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.093e+00 NA NA NA
typeB 1.093e+00 NA NA NA
typeC -1.571e+00 NA NA NA
typeD -3.812e-01 NA NA NA
typeE -2.560e-01 NA NA NA
service 1.884e-05 NA NA NA
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 65.0504 NA NA NA
typeB 28.1912 NA NA NA
typeC -29.3956 NA NA NA
typeD 45.4410 NA NA NA
typeE 25.7062 NA NA NA
service -0.3511 NA NA NA
Number of iterations in BFGS optimization: 26
Log-likelihood: -92.05 on 12 Df
AIC(accidentCountModelZI)
[1] 208.094
BIC(accidentCountModelZI)
[1] 228.3605
Step 2B. Verification Step [Zero inflatted Poisson]
# Now that we have the model, we can use it to predict values
# and check the correctness of the model.
accidentCountZI.predict = predict(accidentCountModelZI,shipAccidentData, type ="response")
# We can compare with the observed and predicted values (Train)
dfZI <- data.frame(seq(1:dim(shipAccidentData)[1]),shipAccidentData$incidents,accidentCountZI.predict)
colnames(dfZI)=c('Index','incident','predict')
dfZI
From the above histogram figure we can see that incidents count data has an excess of zero counts and we also found that the mean and variance of the count data for each type is not sam, hence we performed Zero Inflated regression and Negative Binomial regression respectively.
Zero Inflated regression model gives lesser AIC and BIC values compared to Negative Binomial regression model, so we can say that here for ship accident dataset Zero Inflated regression is the best fit model.
LS0tDQp0aXRsZTogIlNoaXAgYWNjaWRlbnQgYW5hbHlzaXMiDQpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sNCi0tLQ0KDQoNCmBgYHtyfQ0KbGlicmFyeShyZXNoYXBlMikNCmxpYnJhcnkoZ2dwbG90MikNCmxpYnJhcnkoTUFTUykNCmBgYA0KDQojIyBSZXRyaWV2ZSBkYXRhIGZyb20gY3N2DQpgYGB7cn0NCnNoaXBBY2NpZGVudERhdGE9cmVhZC5jc3YoIkU6XFxEYXRhIFNjaWVuY2VcXFN0YXRpc3RpY3NcXFBvaXNzb24gRGlzdHJpYnV0aW9uXFxTaGlwQWNjaWRlbnRzLmNzdiIpDQpgYGANCg0KIyMgRGV0YWlsIGluZm9ybWF0aW9uIG9mIHRoZSBkYXRhc2V0DQpgYGB7cn0NCnN0cihzaGlwQWNjaWRlbnREYXRhKQ0KYGBgDQojIyBTdW1tYXJ5IG9mIHRoZSBkYXRhZnJhbWUNCg0KYGBge3J9DQpzdW1tYXJ5KHNoaXBBY2NpZGVudERhdGEpDQpgYGANCiMjIEVEQQ0KYGBge3J9DQoNCihkZWZpbmNpZGVudHRhYj14dGFicyh+c2hpcEFjY2lkZW50RGF0YSRpbmNpZGVudHMpKQ0KDQpnZ3Bsb3Qoc2hpcEFjY2lkZW50RGF0YSxhZXMoaW5jaWRlbnRzKSkgKyANCiAgZ2VvbV9oaXN0b2dyYW0oYmlud2lkdGg9LjUsIHBvc2l0aW9uPSJkb2RnZSIsY29sb3I9ImJsdWUiKSArIA0KICBnZ3RpdGxlKCJGaWd1cmUgMTogTnVtYmVyIG9mIGluY2lkZW50cyBjb25zaWRlcmluZyBhbGwgdHlwZXMgb2Ygc2hpcHMgdG9nZXRoZXIiKQ0KDQpgYGANCg0KYGBge3J9DQooZGVmSW5jaWRlbnR0YWI9eHRhYnMofnNoaXBBY2NpZGVudERhdGEkaW5jaWRlbnRzK3NoaXBBY2NpZGVudERhdGEkdHlwZSkpDQpnZ3Bsb3Qoc2hpcEFjY2lkZW50RGF0YSwgYWVzKGluY2lkZW50cywgZmlsbCA9IHR5cGUpKSArDQogIGdlb21faGlzdG9ncmFtKGJpbndpZHRoPS41LCBwb3NpdGlvbj0iZG9kZ2UiKSArIA0KICBnZ3RpdGxlKCJGaWd1cmUgMjogTnVtYmVyIG9mIGluY2lkZW50cyBkaXZpZGluZyBzaGlwcyBpbiA1IHR5cGVzIikNCg0KYGBgDQoNCmBgYHtyfQ0KYm94cGxvdChzaGlwQWNjaWRlbnREYXRhJHNlcnZpY2V+c2hpcEFjY2lkZW50RGF0YSRpbmNpZGVudHMsDQogICAgICAgIHhsYWI9Ik5vIG9mIGluY2lkZW50cyIseWxhYj0iU2VydmljZSIsY29sPXNlcSgyLDE3KSkNCmxlZ2VuZCgiYm90dG9tcmlnaHQiLCBpbnNldD0uMDIsdGl0bGU9Ik5vIG9mIGluY2lkZW50cyIsDQogICAgICAgYygiMCIsIjEiLCIyIiwiMyIsIjQiLCI1IiwiNiIsIjciLCIxMSIsIjEyIiwiMTgiLCIyOSIsIjM5IiwiNDQiLCI1MyIsIjU4IiksZmlsbD1zZXEoMiwxNyksaG9yaXo9VFJVRSxjZXg9MC44KQ0KdGl0bGUoIkZpZ3VyZTM6IFNlcnZpY2UgRGlzdHJpYnV0aW9uIikNCg0KYGBgDQpgYGB7cn0NCndpdGgoc2hpcEFjY2lkZW50RGF0YSwgdGFwcGx5KGluY2lkZW50cywgdHlwZSwgZnVuY3Rpb24oeCkgew0KICBzcHJpbnRmKCJNIChWYXIpID0gJTEuMmYgKCUxLjJmKSIsIG1lYW4oeCksIHZhcih4KSkNCn0pKQ0KYGBgDQojIyBUaGVyZSBpcyAgc2lnbmlmaWNhbnQgZGlmZmVyZW5jZSBiZXR3ZWVuIG1lYW4gYW5kIHZhcmlhbmNlIA0KDQojIyBTdGVwIDFBLiBSZWdyZXNzaW9uIFN0ZXAgW05lZ2F0aXZlIEJpbm9taWFsIFJlZ3Jlc3Npb25dDQpgYGB7cn0NCmFjY2lkZW50Q291bnRNb2RlbFBQID0gZ2xtLm5iKGluY2lkZW50cyB+IHR5cGUrc2VydmljZSwgZGF0YSA9IHNoaXBBY2NpZGVudERhdGEpDQpzdW1tYXJ5KGFjY2lkZW50Q291bnRNb2RlbFBQKQ0KQUlDKGFjY2lkZW50Q291bnRNb2RlbFBQKQ0KQklDKGFjY2lkZW50Q291bnRNb2RlbFBQKQ0KYGBgDQoNCiMjIFN0ZXAgMUIuIFZlcmlmaWNhdGlvbiBTdGVwIFtOZWdhdGl2ZSBCaW5vbWlhbCBSZWdyZXNzaW9uXQ0KYGBge3J9DQojTm93IHRoYXQgd2UgaGF2ZSB0aGUgbW9kZWwsIHdlIGNhbiB1c2UgaXQgdG8gcHJlZGljdCB2YWx1ZXMgDQojIGFuZCBjaGVjayB0aGUgY29ycmVjdG5lc3Mgb2YgdGhlIG1vZGVsLihUcmFpbikgDQphY2NpZGVudENvdW50UFAucHJlZGljdCA9IHByZWRpY3QoYWNjaWRlbnRDb3VudE1vZGVsUFAsIHNoaXBBY2NpZGVudERhdGEsIHR5cGU9InJlc3BvbnNlIikNCiMgV2UgY2FuIGNvbXBhcmUgd2l0aCB0aGUgb2JzZXJ2ZWQgYW5kIHByZWRpY3RlZCB2YWx1ZXMgKFRyYWluKQ0KZGZQUCA8LSBkYXRhLmZyYW1lKHNlcSgxOmRpbShzaGlwQWNjaWRlbnREYXRhKVsxXSksc2hpcEFjY2lkZW50RGF0YSRpbmNpZGVudHMsYWNjaWRlbnRDb3VudFBQLnByZWRpY3QpDQpjb2xuYW1lcyhkZlBQKT1jKCdJbmRleCcsJ2luY2lkZW50JywncHJlZGljdCcpDQpkZlBQDQpgYGANCg0KIyMgIFN0ZXAgMkEuIFJlZ3Jlc3Npb24gU3RlcCBbWmVybyBpbmZsYXR0ZWQgUG9pc3Nvbl0NCmBgYHtyfQ0KbGlicmFyeShwc2NsKQ0KYWNjaWRlbnRDb3VudE1vZGVsWkkgPSB6ZXJvaW5mbChpbmNpZGVudHMgfiB0eXBlK3NlcnZpY2UsZGF0YT1zaGlwQWNjaWRlbnREYXRhLGRpc3Q9InBvaXNzb24iLGxpbms9ImxvZ2l0IikNCnN1bW1hcnkoYWNjaWRlbnRDb3VudE1vZGVsWkkpDQpBSUMoYWNjaWRlbnRDb3VudE1vZGVsWkkpDQpCSUMoYWNjaWRlbnRDb3VudE1vZGVsWkkpDQpgYGANCg0KIyMgU3RlcCAyQi4gVmVyaWZpY2F0aW9uIFN0ZXAgW1plcm8gaW5mbGF0dGVkIFBvaXNzb25dDQpgYGB7cn0NCiMgTm93IHRoYXQgd2UgaGF2ZSB0aGUgbW9kZWwsIHdlIGNhbiB1c2UgaXQgdG8gcHJlZGljdCB2YWx1ZXMgDQojIGFuZCBjaGVjayB0aGUgY29ycmVjdG5lc3Mgb2YgdGhlIG1vZGVsLiANCmFjY2lkZW50Q291bnRaSS5wcmVkaWN0ID0gcHJlZGljdChhY2NpZGVudENvdW50TW9kZWxaSSxzaGlwQWNjaWRlbnREYXRhLCB0eXBlID0icmVzcG9uc2UiKQ0KIyBXZSBjYW4gY29tcGFyZSB3aXRoIHRoZSBvYnNlcnZlZCBhbmQgcHJlZGljdGVkIHZhbHVlcyAoVHJhaW4pDQpkZlpJIDwtIGRhdGEuZnJhbWUoc2VxKDE6ZGltKHNoaXBBY2NpZGVudERhdGEpWzFdKSxzaGlwQWNjaWRlbnREYXRhJGluY2lkZW50cyxhY2NpZGVudENvdW50WkkucHJlZGljdCkNCmNvbG5hbWVzKGRmWkkpPWMoJ0luZGV4JywnaW5jaWRlbnQnLCdwcmVkaWN0JykNCmRmWkkNCmBgYA0KDQoNCiMjIyBGcm9tIHRoZSBhYm92ZSBoaXN0b2dyYW0gZmlndXJlIHdlIGNhbiBzZWUgdGhhdCBpbmNpZGVudHMgY291bnQgZGF0YSBoYXMgYW4gZXhjZXNzIG9mIHplcm8gY291bnRzIGFuZCB3ZSBhbHNvIGZvdW5kIHRoYXQgdGhlIG1lYW4gYW5kIHZhcmlhbmNlIG9mIHRoZSBjb3VudCBkYXRhIGZvciBlYWNoIHR5cGUgaXMgbm90IHNhbSwgaGVuY2Ugd2UgcGVyZm9ybWVkIFplcm8gSW5mbGF0ZWQgcmVncmVzc2lvbiBhbmQgTmVnYXRpdmUgQmlub21pYWwgcmVncmVzc2lvbiByZXNwZWN0aXZlbHkuDQojIyMgWmVybyBJbmZsYXRlZCByZWdyZXNzaW9uIG1vZGVsIGdpdmVzIGxlc3NlciBBSUMgYW5kIEJJQyB2YWx1ZXMgY29tcGFyZWQgdG8gTmVnYXRpdmUgQmlub21pYWwgcmVncmVzc2lvbiBtb2RlbCwgc28gd2UgY2FuIHNheSB0aGF0IGhlcmUgZm9yIHNoaXAgYWNjaWRlbnQgZGF0YXNldCBaZXJvIEluZmxhdGVkIHJlZ3Jlc3Npb24gaXMgdGhlIGJlc3QgZml0IG1vZGVsLiANCg0K