Download package and renames collumn. The collumn “Average.Monthly.Temperature” has been renamed to “Temperature”. After examine the data, we consider to use generalrised linear regression model becasue the reponse variable, incidents is a count data (Discrete and there are many number of value).
pacman::p_load(plyr,tidyverse, reshape2,magrittr,data.table,ggplot2,lubridate,RcolorBrewer,gdata,MASS,Rmarkdown)
## Installing package into 'C:/Users/t_car/OneDrive/Documents/R/win-library/3.4'
## (as 'lib' is unspecified)
## Warning: package 'RcolorBrewer' is not available (for R version 3.4.3)
## Warning: Perhaps you meant 'RColorBrewer' ?
## Bioconductor version 3.6 (BiocInstaller 1.28.0), ?biocLite for help
## A new version of Bioconductor is available after installing the most
## recent version of R; see http://bioconductor.org/install
## Warning in p_install(package, character.only = TRUE, ...):
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'RcolorBrewer'
## Installing package into 'C:/Users/t_car/OneDrive/Documents/R/win-library/3.4'
## (as 'lib' is unspecified)
## Warning: package 'Rmarkdown' is not available (for R version 3.4.3)
## Warning: Perhaps you meant 'rmarkdown' ?
## Bioconductor version 3.6 (BiocInstaller 1.28.0), ?biocLite for help
## A new version of Bioconductor is available after installing the most
## recent version of R; see http://bioconductor.org/install
## Warning in p_install(package, character.only = TRUE, ...):
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'Rmarkdown'
## Warning in pacman::p_load(plyr, tidyverse, reshape2, magrittr, data.table, : Failed to install/load:
## RcolorBrewer, Rmarkdown
setwd('C:/Users/t_car/OneDrive/Documents/MTech/5. EB5102')
DIR = '3. Day_3'
file1 = 'Data.csv'
# Load data
data = read.csv(file.path(DIR, file1))
#Examine Data
str(data) # Check observation, data type and example of data.
## 'data.frame': 54 obs. of 4 variables:
## $ Incidents : int 0 1 0 0 1 0 3 1 0 0 ...
## $ PMD : int 100 202 310 406 497 594 702 803 896 997 ...
## $ Monthly.rainfall : int 220 200 197 199 174 172 174 175 165 200 ...
## $ Average.Monthly.Temperature: int 28 28 29 30 30 32 32 32 30 29 ...
names(data) # Check the variables name
## [1] "Incidents" "PMD"
## [3] "Monthly.rainfall" "Average.Monthly.Temperature"
head(data,15) # Shows example of data 15 rows
## Incidents PMD Monthly.rainfall Average.Monthly.Temperature
## 1 0 100 220 28
## 2 1 202 200 28
## 3 0 310 197 29
## 4 0 406 199 30
## 5 1 497 174 30
## 6 0 594 172 32
## 7 3 702 174 32
## 8 1 803 175 32
## 9 0 896 165 30
## 10 0 997 200 29
## 11 1 1106 230 28
## 12 5 1215 220 28
## 13 2 1321 200 28
## 14 4 1420 198 29
## 15 2 1512 153 29
data <- plyr::rename(data, c("Average.Monthly.Temperature" = "Temperature"))
To confirm that we choose the right model, I have conduct initial visualization to see weather the model fit data,or not? and also verify the model assumptions.
From the density plot, it seems to be poision distribution. Response variables contian integer which no minus value and the cruve seems to be possion distribution.
ggplot(data, aes(x=Incidents)) +
geom_density(color="darkblue", fill="lightblue",alpha=0.4)
So, we fit the model by using poission regression.
The results of the model indicates that only “PMD” is sinificant, the Monthly.rainfall and Temperature are not significant.
Residual Deviance: 115.9 : The results same as the I have sone on spss and JMP, deviance is 115.9. The data do not fit model well. AIC: 289.5 : Akike value is 280.5
Coeffficint(Beta) (Intercept) : -1.1896315 data\(PMD : 0.0005988 data\)Monthly.rainfall : 0.0005463 data$Temperature : 0.0311325
summary(m1 <- glm(Incidents ~ data$PMD+data$Monthly.rainfall+data$Temperature,
data, family=poisson))
##
## Call:
## glm(formula = Incidents ~ data$PMD + data$Monthly.rainfall +
## data$Temperature, family = poisson, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1312 -1.3110 -0.2644 0.7166 3.8717
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.190e+00 1.210e+00 -0.983 0.325
## data$PMD 5.988e-04 4.182e-05 14.319 <2e-16 ***
## data$Monthly.rainfall 5.463e-04 2.674e-03 0.204 0.838
## data$Temperature 3.113e-02 3.173e-02 0.981 0.327
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 387.38 on 53 degrees of freedom
## Residual deviance: 115.85 on 50 degrees of freedom
## AIC: 289.5
##
## Number of Fisher Scoring iterations: 5
m1
##
## Call: glm(formula = Incidents ~ data$PMD + data$Monthly.rainfall +
## data$Temperature, family = poisson, data = data)
##
## Coefficients:
## (Intercept) data$PMD data$Monthly.rainfall
## -1.1896315 0.0005988 0.0005463
## data$Temperature
## 0.0311325
##
## Degrees of Freedom: 53 Total (i.e. Null); 50 Residual
## Null Deviance: 387.4
## Residual Deviance: 115.9 AIC: 289.5
Print predictive value to examine the predictive results with original data. However, I don’t allow to print for reporting.
#print=data.frame(data,pred=m1$fitted)
#print
The real data always not met the equidispersion assumption for poisson model. So, this step is help to check weather the data meet the assumption or not? By fitting the model and get teh dispersion veriable, if the dispersion is grether than 1, it indicate there isover-dispersion. So, we need to use the negativve binomail model to handle the issue.
Based on the model result, it indicated that residual deviance/(n-p) = 2.296474. It has over-dispersion issue.
model.disp = glm(Incidents ~ data$PMD+data$Monthly.rainfall+data$Temperature, data, family=quasipoisson(link=log))
summary(model.disp)
##
## Call:
## glm(formula = Incidents ~ data$PMD + data$Monthly.rainfall +
## data$Temperature, family = quasipoisson(link = log), data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1312 -1.3110 -0.2644 0.7166 3.8717
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.190e+00 1.834e+00 -0.649 0.519
## data$PMD 5.988e-04 6.338e-05 9.449 1.05e-12 ***
## data$Monthly.rainfall 5.463e-04 4.053e-03 0.135 0.893
## data$Temperature 3.113e-02 4.809e-02 0.647 0.520
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 2.296474)
##
## Null deviance: 387.38 on 53 degrees of freedom
## Residual deviance: 115.85 on 50 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 5
summary(model.disp)$dispersion
## [1] 2.296474
Due to over-dispersion issue, I hace perform the negative binomal model instead of poisson. The model results indicate that only PMD is sinificant, so the PMD increase appear to increase the number of incidents.
Residual Deviance: 63.289 : The nagetive binomail model fit the data better than poisson AIC: 274.61 : Akike value also decrease to 274.61 which is better.
summary(m1 <- glm.nb(Incidents ~ data$PMD+data$Monthly.rainfall+data$Temperature, data = data))
##
## Call:
## glm.nb(formula = Incidents ~ data$PMD + data$Monthly.rainfall +
## data$Temperature, data = data, init.theta = 5.795792259,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6387 -0.7827 -0.2086 0.4723 2.2622
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.770e-01 2.090e+00 -0.276 0.783
## data$PMD 5.897e-04 5.962e-05 9.891 <2e-16 ***
## data$Monthly.rainfall 5.296e-04 4.362e-03 0.121 0.903
## data$Temperature 1.204e-02 5.443e-02 0.221 0.825
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(5.7958) family taken to be 1)
##
## Null deviance: 186.586 on 53 degrees of freedom
## Residual deviance: 63.289 on 50 degrees of freedom
## AIC: 274.61
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 5.80
## Std. Err.: 2.55
##
## 2 x log-likelihood: -264.605
Again, I ran the model by drop the insinificant value. The results indicates the deviance same as the full model but the AIC of this model is better.
summary(m1 <- glm.nb(Incidents ~ data$PMD, data = data))
##
## Call:
## glm.nb(formula = Incidents ~ data$PMD, data = data, init.theta = 5.75560483,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6249 -0.7826 -0.2300 0.4903 2.1966
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.205e-01 2.233e-01 -0.54 0.59
## data$PMD 5.911e-04 5.901e-05 10.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(5.7556) family taken to be 1)
##
## Null deviance: 185.99 on 53 degrees of freedom
## Residual deviance: 63.18 on 52 degrees of freedom
## AIC: 270.66
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 5.76
## Std. Err.: 2.51
##
## 2 x log-likelihood: -264.659