Download package and renames collumn. The collumn “Average.Monthly.Temperature” has been renamed to “Temperature”. After examine the data, we consider to use generalrised linear regression model becasue the reponse variable, incidents is a count data (Discrete and there are many number of value).

pacman::p_load(plyr,tidyverse, reshape2,magrittr,data.table,ggplot2,lubridate,RcolorBrewer,gdata,MASS,Rmarkdown)
## Installing package into 'C:/Users/t_car/OneDrive/Documents/R/win-library/3.4'
## (as 'lib' is unspecified)
## Warning: package 'RcolorBrewer' is not available (for R version 3.4.3)
## Warning: Perhaps you meant 'RColorBrewer' ?
## Bioconductor version 3.6 (BiocInstaller 1.28.0), ?biocLite for help
## A new version of Bioconductor is available after installing the most
##   recent version of R; see http://bioconductor.org/install
## Warning in p_install(package, character.only = TRUE, ...):
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'RcolorBrewer'
## Installing package into 'C:/Users/t_car/OneDrive/Documents/R/win-library/3.4'
## (as 'lib' is unspecified)
## Warning: package 'Rmarkdown' is not available (for R version 3.4.3)
## Warning: Perhaps you meant 'rmarkdown' ?
## Bioconductor version 3.6 (BiocInstaller 1.28.0), ?biocLite for help
## A new version of Bioconductor is available after installing the most
##   recent version of R; see http://bioconductor.org/install
## Warning in p_install(package, character.only = TRUE, ...):
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'Rmarkdown'
## Warning in pacman::p_load(plyr, tidyverse, reshape2, magrittr, data.table, : Failed to install/load:
## RcolorBrewer, Rmarkdown
setwd('C:/Users/t_car/OneDrive/Documents/MTech/5. EB5102')
DIR = '3. Day_3'
file1 = 'Data.csv'
# Load data
data = read.csv(file.path(DIR, file1))
#Examine Data
str(data)     # Check observation, data type and example of data. 
## 'data.frame':    54 obs. of  4 variables:
##  $ Incidents                  : int  0 1 0 0 1 0 3 1 0 0 ...
##  $ PMD                        : int  100 202 310 406 497 594 702 803 896 997 ...
##  $ Monthly.rainfall           : int  220 200 197 199 174 172 174 175 165 200 ...
##  $ Average.Monthly.Temperature: int  28 28 29 30 30 32 32 32 30 29 ...
names(data)   # Check the variables name
## [1] "Incidents"                   "PMD"                        
## [3] "Monthly.rainfall"            "Average.Monthly.Temperature"
head(data,15) # Shows example of data 15 rows 
##    Incidents  PMD Monthly.rainfall Average.Monthly.Temperature
## 1          0  100              220                          28
## 2          1  202              200                          28
## 3          0  310              197                          29
## 4          0  406              199                          30
## 5          1  497              174                          30
## 6          0  594              172                          32
## 7          3  702              174                          32
## 8          1  803              175                          32
## 9          0  896              165                          30
## 10         0  997              200                          29
## 11         1 1106              230                          28
## 12         5 1215              220                          28
## 13         2 1321              200                          28
## 14         4 1420              198                          29
## 15         2 1512              153                          29
data <- plyr::rename(data, c("Average.Monthly.Temperature" = "Temperature"))

To confirm that we choose the right model, I have conduct initial visualization to see weather the model fit data,or not? and also verify the model assumptions.

From the density plot, it seems to be poision distribution. Response variables contian integer which no minus value and the cruve seems to be possion distribution.

ggplot(data, aes(x=Incidents)) +
  geom_density(color="darkblue", fill="lightblue",alpha=0.4)

So, we fit the model by using poission regression.

The results of the model indicates that only “PMD” is sinificant, the Monthly.rainfall and Temperature are not significant.

Residual Deviance: 115.9 : The results same as the I have sone on spss and JMP, deviance is 115.9. The data do not fit model well. AIC: 289.5 : Akike value is 280.5

Coeffficint(Beta) (Intercept) : -1.1896315 data\(PMD : 0.0005988 data\)Monthly.rainfall : 0.0005463 data$Temperature : 0.0311325

summary(m1 <- glm(Incidents ~ data$PMD+data$Monthly.rainfall+data$Temperature,
                  data, family=poisson))
## 
## Call:
## glm(formula = Incidents ~ data$PMD + data$Monthly.rainfall + 
##     data$Temperature, family = poisson, data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1312  -1.3110  -0.2644   0.7166   3.8717  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -1.190e+00  1.210e+00  -0.983    0.325    
## data$PMD               5.988e-04  4.182e-05  14.319   <2e-16 ***
## data$Monthly.rainfall  5.463e-04  2.674e-03   0.204    0.838    
## data$Temperature       3.113e-02  3.173e-02   0.981    0.327    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 387.38  on 53  degrees of freedom
## Residual deviance: 115.85  on 50  degrees of freedom
## AIC: 289.5
## 
## Number of Fisher Scoring iterations: 5
m1
## 
## Call:  glm(formula = Incidents ~ data$PMD + data$Monthly.rainfall + 
##     data$Temperature, family = poisson, data = data)
## 
## Coefficients:
##           (Intercept)               data$PMD  data$Monthly.rainfall  
##            -1.1896315              0.0005988              0.0005463  
##      data$Temperature  
##             0.0311325  
## 
## Degrees of Freedom: 53 Total (i.e. Null);  50 Residual
## Null Deviance:       387.4 
## Residual Deviance: 115.9     AIC: 289.5

Print predictive value to examine the predictive results with original data. However, I don’t allow to print for reporting.

#print=data.frame(data,pred=m1$fitted)
#print

The real data always not met the equidispersion assumption for poisson model. So, this step is help to check weather the data meet the assumption or not? By fitting the model and get teh dispersion veriable, if the dispersion is grether than 1, it indicate there isover-dispersion. So, we need to use the negativve binomail model to handle the issue.

Based on the model result, it indicated that residual deviance/(n-p) = 2.296474. It has over-dispersion issue.

model.disp = glm(Incidents ~ data$PMD+data$Monthly.rainfall+data$Temperature, data, family=quasipoisson(link=log))
summary(model.disp)
## 
## Call:
## glm(formula = Incidents ~ data$PMD + data$Monthly.rainfall + 
##     data$Temperature, family = quasipoisson(link = log), data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1312  -1.3110  -0.2644   0.7166   3.8717  
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -1.190e+00  1.834e+00  -0.649    0.519    
## data$PMD               5.988e-04  6.338e-05   9.449 1.05e-12 ***
## data$Monthly.rainfall  5.463e-04  4.053e-03   0.135    0.893    
## data$Temperature       3.113e-02  4.809e-02   0.647    0.520    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 2.296474)
## 
##     Null deviance: 387.38  on 53  degrees of freedom
## Residual deviance: 115.85  on 50  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5
summary(model.disp)$dispersion
## [1] 2.296474

Due to over-dispersion issue, I hace perform the negative binomal model instead of poisson. The model results indicate that only PMD is sinificant, so the PMD increase appear to increase the number of incidents.

Residual Deviance: 63.289 : The nagetive binomail model fit the data better than poisson AIC: 274.61 : Akike value also decrease to 274.61 which is better.

summary(m1 <- glm.nb(Incidents ~ data$PMD+data$Monthly.rainfall+data$Temperature, data = data))
## 
## Call:
## glm.nb(formula = Incidents ~ data$PMD + data$Monthly.rainfall + 
##     data$Temperature, data = data, init.theta = 5.795792259, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6387  -0.7827  -0.2086   0.4723   2.2622  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -5.770e-01  2.090e+00  -0.276    0.783    
## data$PMD               5.897e-04  5.962e-05   9.891   <2e-16 ***
## data$Monthly.rainfall  5.296e-04  4.362e-03   0.121    0.903    
## data$Temperature       1.204e-02  5.443e-02   0.221    0.825    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(5.7958) family taken to be 1)
## 
##     Null deviance: 186.586  on 53  degrees of freedom
## Residual deviance:  63.289  on 50  degrees of freedom
## AIC: 274.61
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  5.80 
##           Std. Err.:  2.55 
## 
##  2 x log-likelihood:  -264.605

Again, I ran the model by drop the insinificant value. The results indicates the deviance same as the full model but the AIC of this model is better.

summary(m1 <- glm.nb(Incidents ~ data$PMD, data = data))
## 
## Call:
## glm.nb(formula = Incidents ~ data$PMD, data = data, init.theta = 5.75560483, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6249  -0.7826  -0.2300   0.4903   2.1966  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.205e-01  2.233e-01   -0.54     0.59    
## data$PMD     5.911e-04  5.901e-05   10.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(5.7556) family taken to be 1)
## 
##     Null deviance: 185.99  on 53  degrees of freedom
## Residual deviance:  63.18  on 52  degrees of freedom
## AIC: 270.66
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  5.76 
##           Std. Err.:  2.51 
## 
##  2 x log-likelihood:  -264.659