Home Assigment 2- Generalized Linear Models

This work is the second home assigment of the Transport Demand Modelling course of the PhD in Transportation Systems of the Instituto Superior Técnico

Introduction

Traffic crashes may occur due to three main factors: human, vehicles, and the environment. The environment factor is usually the most studied by transportation engineers, given it includes infraestructure, operational characteristics and road conditions (i.e weather, (Xi et al. 2016)). This factor have an impact in both frequency and severity of traffic crashes, and modelling it is important given that future events will occur under the same conditions as in the past situations (Yakar, 2015).

To model crash frequency, Different specifications of econometric models have been proposed to evaluate the relevant factors involving characteristics of traffic factors, control, highways, environment, and involved users. Several studies used the Poisson distribution, concluding that it is adequate to model the crash frequency, although the phenomenon of over-dispersion may occur. Other discrete distributions like the Negative Binomial are also widely used (Lord, 2006). Poisson and Negative Binomial models are suitable econometric approaches for crash assessment and forecasting under certain conditions and are still being used in the literature (Zou, 2018). In general, the Poisson models are more appropriate for homogeneous conditions, while the negative binomial models behave better when the conditions are heterogeneous (Lord, 2006). Lord and Mannering (2010) emphasize that the Poisson distribution is appropriate if mean and variance are equal. When this assumption is substantially violated, the NB distribution can provide an improvement over the Poisson distribution.

There is vast literature discussing the modelling approach for crash frequency. In this sense, the review provided by Lord and Mannering (2010) is really useful to select the most appropriate model to apply, based on its pros, cons, and more important the characteristics of the data gathered (sample size, over/under-dispersion, sample mean, amount of zeros) and estimation process complexity.

In this work, we present a case study of road accidents estimation, based on a sub-sample of a dataset collected for Ana Fernandes’ PhD Thesis which aimed to develop a maintenance model based on the costs generated by road accidents.

The objective is to define the most suitable model for estimating crash frequency for the given dataset. We estimated different models with different specifications and present a comparison, then we suggest the “best” model, based on statistical analysis and our criteria.

Methods

The data used in this home assignment is based on a sub-sample of a dataset extracted from Ana Fernandes’ PhD Thesis, which aimed to develop a maintenance model based on the costs generated by road accidents. The dataset contains data about 157 road sections (all from highways and 1km long) and the registered road accidents occurred between 1997 and 2002.

Data

The variables considered in this work were:

Variables	Description	Type
NumAcc9702	Number of road accidents between 1997 and 2002	Discrete
IFI	International friction index
TMDAveícdia	Average daily traffic
pHeav	Percentage heavy vehicles
AS	Average speed
PUrb	Percentage of the segment’s length belonging to an urban area. The higher the percentage is, the lower is the circulation speed of cars and the more they are exposed to potential conflicts with vehicles or pedestrians. A higher number of emergency breaks are expected also.
PRCross	Percentage of the segment’s length under the influence of intersections. The higher the influence of intersections (measured in terms of length of segments in the vicinity to intersections), the higher potential conflicts can be expected as intersections are obvious locations for conflicts to occur.
pTExt	Length (m) of the segment with road curves. The higher the proportion of the segment with road curves, the less favourable are the conditions for driving.
CDR	Class of the most unfavourable curve radius in the segment (a segment can have several curves with different radius). Road curve radius were categorized in 4 classes from 0 to 4: Class 0 ó R>1000m; Class 1 ó 750m<R<R<R
CI	Longitudinal inclination class. A categorical variable was created indicating if the segment has a gradient (CI=1); if it is levelled (CI=0); or if there is a mix of gradient and level parts (CI=0,5). CI =0 is better than (CI=0,5) that is better than CI=1.
AP	Average annual precipitation (mm
AcAADT	Accumulated Annual Average Daily Traffic 106 vehicles

Exploratory analysis

As a good modelling practice, our first step is to conduct an exploratory analysis of the data, view the magnitudes of each variable, the quartils, check for any missing data and outliers.

Based on this, we summarized the dataset charactistics using the skim function:

Data summary
Name	df1
Number of rows	157
Number of columns	12
_______________________
Column type frequency:
numeric	12
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
NumAcc9702	1	5.82	8.19	0.00	2.00	3.00	6.00	64.00	▇▁▁▁▁
IFI	1	31.27	8.34	10.00	25.00	35.00	37.00	45.00	▂▂▂▇▂
TMDA	1	9072.61	5555.74	4700.00	5600.00	7500.00	9500.00	29200.00	▇▁▁▁▁
pHeav	1	0.10	0.06	0.03	0.08	0.09	0.11	0.26	▃▇▁▁▁
AS	1	85.31	4.52	81.00	81.00	83.00	88.00	94.00	▇▁▃▁▂
pUrb	1	0.09	0.27	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
pRCross	1	0.13	0.21	0.00	0.00	0.00	0.20	1.00	▇▁▁▁▁
Ptext	1	299.49	251.29	0.00	0.00	300.00	500.00	1000.00	▇▅▅▂▁
CDR	1	1.65	1.42	0.00	0.00	2.00	3.00	4.00	▇▁▅▅▂
CI	1	0.44	0.50	0.00	0.00	0.00	1.00	1.00	▇▁▁▁▆
AP	1	811.10	399.49	490.00	530.00	627.00	793.00	1738.00	▇▃▂▁▂
AcAADT	1	16.68	0.46	16.15	16.32	16.61	16.85	17.97	▇▇▂▁▂

An histogram, is an useful plot to observe the behavior of each variable, for example We can observe that almost all segments were in rural areas.:

A fundamental check we must conduct is to look for outliers, analyze them and decide wether we should remove them or not:

df_no_outliers <- df1
boxplot(df1, las = 2)

outlier <- function(x){
  quant <- quantile(x, probs=c(0.25, 0.75))
  caps <- quantile(x, probs=c(0.05, 0.95))
  H <- 1.5* IQR(x, na.rm = TRUE)
  x[x < (quant[1] - H)] <- caps[1]
  x[x > (quant[2] + H)] <- caps[2]
  return(x)
}

df_no_outliers$TMDA = outlier(df_no_outliers$TMDA)
df_no_outliers$AP = outlier(df_no_outliers$AP)
df_no_outliers$AcAADT = outlier(df_no_outliers$AcAADT)

boxplot(df_no_outliers, las = 2)

df <-df_no_outliers

We decided to apply the “outlier” function based on the IQR method to remove the outliers of 4 variables, presented in the chunk.

Given variables CPR and CI are categorical, we used the factor function to indicate R this condition. Letting R know that this should not be treated as a numeric variable, but as a categorical nominal one.

## 
##  Leveled Gradient 
##       88       69

## 
##      R>1000m 750m<R<1000m  500m<R<750m  500m<R<300m       R<300m 
##           57            9           38           38           15

Then, we ran the Skim function again to check that the variables were transformed:

skim(df)

Data summary
Name	df
Number of rows	157
Number of columns	12
_______________________
Column type frequency:
factor	2
numeric	10
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
CDR	0	1	FALSE	5	R>1: 57, 500: 38, 500: 38, R<3: 15
CI	0	1	FALSE	2	Lev: 88, Gra: 69

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
NumAcc9702	1	5.82	8.19	0.00	2.00	3.00	6.00	64.00	▇▁▁▁▁
IFI	1	31.27	8.34	10.00	25.00	35.00	37.00	45.00	▂▂▂▇▂
TMDA	1	9394.27	6063.28	4700.00	5600.00	7500.00	9500.00	24500.00	▇▂▁▁▂
pHeav	1	0.10	0.06	0.03	0.08	0.09	0.11	0.26	▃▇▁▁▁
AS	1	85.31	4.52	81.00	81.00	83.00	88.00	94.00	▇▁▃▁▂
pUrb	1	0.09	0.27	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
pRCross	1	0.13	0.21	0.00	0.00	0.00	0.20	1.00	▇▁▁▁▁
Ptext	1	299.49	251.29	0.00	0.00	300.00	500.00	1000.00	▇▅▅▂▁
AP	1	811.10	399.49	490.00	530.00	627.00	793.00	1738.00	▇▃▂▁▂
AcAADT	1	16.68	0.45	16.15	16.32	16.61	16.85	17.80	▇▇▃▁▂

Preliminary hypothesis

Based on the aforementioned, we formulated the following preliminary hypothesis regarding the influence of each variable in the Number of Accidents:

-IFI: It is expected that with a higher IFI the number of crashes in the segment would decrease.

-TMDAveicdia: If there are more vehicles passing by there are more interactions between them (that can eventually generate crashes). Then, we expect that as this variable grows, the number of crashes will too.

-pHeav: The presence of heavy vehicles in the traffic may influence an increased sense of caution and alert in other drivers, along with speed reduction which, hence, it can have a negative impact in the occurrence of accidents.

-AS: Speed have been proved as one the most important factors influencing crash frequency and severity, the higher the speed, the higher the number of accidents.

-pUrb: It is expected that in the road sections that cross urban areas, the probability of crashes tend to increase because the interactions between road users also increase, and there are more sudden stops and speed changes. (Arevalo-Támara, et. al. 2020)

-pRCross: Crosses are mainly related to urban areas, then we make the same hypothesis as the previous variable.

-Ptext: When the road section does not have many turn extensions, the drivers are able to increase the speed with a certain degree of confidence which, in turn, may increase number of accidents (because of the high speeds). However, the existence of turn extensions combined with the lack of caution and responsibility of the drivers can also result in crashes.

-CDR: With a more favourable radius, i.e., larger radius, it is expected that accidents would decrease.

-CI: Slope is related with visibility distance (vertical curves), but this could increase (the higher the slope the lower the visibility distance if there are vertical curves) or decrease the number of crashes (in hilly areas drivers are more cautions, hence less crashes (Arevalo et. al.,2020)),

-AP: Under extreme conditions of heavy rain, the friction and visibility decrease. Thus, it is expectable that more crashes may occur. On the other hand, the drivers may be more cautions and alert, therefore reducing the number of accidents.We think this variable would be more significant if we were assesing crashes SEVERITY and not frequency.

-AcAADT: is an offset of TMDA.

Modeling framework

A generalized linear model (GLM) is a generalization of ordinary linear regression that allows to model variables that have error distribution other than a normal distribution like Gaussian distribution.

Generalized linear models are suitable for this data, because:

-The dependent variable is not continuous -The effect of independent variables may not be linear -The expected value of the errors terms might not be 0.

Given the nature of the data, we stimated poisson, over-dispersed poisson, negative binomial and zero-inflated models. For each one of them, we checked the assumptions for its suitability. However, results of not suitable models were presented for comparison purposes.

These are validated methods for crash frequency analisys and their theory is clearly explained in Washington, Karlaftis & Mannering (2003).

Poisson regression

Poisson regression is useful when predicting an outcome variable representing counts from a set of continuous predictor variables. This distribution can be thought of as the number of occurrences of an event of interest in a fixed period of time and is appropriate for variables with non-negative integer values. If you have overdispersion (see if residual deviance is much larger than degrees of freedom), you may want to use quasipoisson instead of Poisson.

Quasi-Poisson

A simple remedy for overdispersed count data is to introduce a dispersion parameter into the Poisson model, so that the conditional variance of the response is now V (Yi|ηi) = φμi. If φ > 1, therefore, the conditional variance of Y increases more rapidly than its mean. There is no exponential family corresponding to this specification, and the resulting GLM does not imply a specific probability distribution for the response variable.

Negative binomial model

This distribution can be thought of as the number of trials required to observe k successes and is appropriate for variables with non-negative integer values. If a data value is a non-integer, less than 0, or missing, then the corresponding case is not used in the analysis.

Results

First, we checked the distribution of the dependant variable (NumAcc9702). The mean is not equal to the variance, hence we do not have a poisson distribution:

var(df$NumAcc9702)/mean(df$NumAcc9702)

## [1] 11.54742

Regarding overdispersion, after computing the coefficient of variance we obtain a value of 11.6, which means we have overdispersion. A negative binomial would fit better. This is confirmed by applying the Maximum Likelihood ratio test, which gives us a p-value of 2.8e-153

Poisson Model

The first count data model estimated was the Poisson, we included this model for comparison purposes only, given that, as mentioned before the data does not meet the requirements for this type of model. Offset is the variable that is used to denote the exposure period in the Poisson regression. Two models are presented:

	Poiss1			Poiss2
Variable	Estimate	std. Error	Statistic	Estimate	std. Error	Statistic
(Intercept)	4.75 ^***	1.03	4.62	4.11 ^***	0.10	42.29
IFI	-0.06 ^***	0.01	-9.54	-0.08 ^***	0.00	-22.60
CI [Gradient]	0.28 ^***	0.08	3.39
AS	-0.05 ^***	0.01	-3.44
pHeav	-5.75 ^***	0.54	-10.57
pUrb	0.80 ^***	0.10	7.84
Observations	157			157
R² Nagelkerke	0.992			0.964
AIC	838.024			1084.253
log-Likelihood	-413.012			-540.126
p<0.05 p<0.01 * p<0.001

In the first model “Poiss1”, all explanatory variables are highly significant, however, AS presents a counter-intuitive sign, it indicates indicates that the higher the average speed of the segment, less accidents would occur, this is not reasonable given that there is a vast literature proving that high speeds generate more and more severe crashes. For the other variables, the effect is consistent with our a-priory hypothesis.

On the other hand, as expected, “Poiss1” obviously does not fit the data. For 152 degrees of freedom, the chi-squared is:

qchisq(0.95, df.residual(Poiss1))    #Chi squared for the given degrees of freedom

## [1] 180.6756

deviance(Poiss1)

## [1] 366.3815

pr <- residuals(Poiss1,"pearson")
sum(pr^2)  #Pearson´s chi-squared

## [1] 367.0736

And both, the deviance and the Pearson´s chi-squared are both above 500.

The second model, “Poiss2” containing only the IFI variable, is also significant but “Poiss1” has a higher Pseudo R squared and a higher likelihood ratio, meaning that “Poiss1” is the best model between the two.

Overdispersed poisson (ODP)

As over-dispersion is extra-confirmed, we wanted to try assuming that the variance is proportional rather than equal to the mean, so we estimated a scale (phi) parameter dividing Pearson´s Chi-squared by its d.f.

phi <- sum(pr^2)/df.residual(Poiss1)

round(c(phi,sqrt(phi)),4)

## [1] 2.4310 1.5592

This means we should multiply the standard errors by multiplying by 2.01. For this, we used the quasipoisson family of the glm function, we estimated a model using the same variables as Poiss1, all t-statistics decreased, the estimates are the same but the standard errors increased:

we did the same for the model Poiss2 (with only one explanatory variable: IFI)

	ODP1			ODP2
Variable	Estimate	std. Error	Statistic	Estimate	std. Error	Statistic
(Intercept)	4.75 ^**	1.60	2.96	4.11 ^***	0.20	20.11
IFI	-0.06 ^***	0.01	-6.12	-0.08 ^***	0.01	-10.75
CI [Gradient]	0.28 ^*	0.13	2.17
AS	-0.05 ^*	0.02	-2.20
pHeav	-5.75 ^***	0.85	-6.78
pUrb	0.80 ^***	0.16	5.03
Observations	157			157
R² Nagelkerke	0.992			0.964
p<0.05 p<0.01 * p<0.001

We can see that in both cases the stimates are exactly the same but the standard errors increased the double.

Negative binomial (NBM)

Negative binomial is a good alternative when there is over-dispersed data, and it is expected it will provide an improvement over the Poisson. We estimated the model with the same predictors:

This model has lower pseudo r squared compared to the Poisson. The variables are significant and the signs expected, except for the variable speed, we checked the dataset and realized that the AS variable had values only between 81 and 94 km/h, which means there is not an important variability among this variable, and it could be the explanation of the counter-intuitive value, hence, we decided to exclude this variable and present a new nbm3:

We obtained a simpler model, with coherent signs and significant variables, the pseudo r squared is a bit lower and so is the log likelihood, however, this was expected given it has fewer variables, being simpler and considering the over-dispersion of the data, we believe this is a good model.

	Nbm1			Nbm2			Nbm3
Variable	Estimate	std. Error	Statistic	Estimate	std. Error	Statistic	Estimate	std. Error	Statistic
(Intercept)	5.45 ^***	1.37	3.97	4.14 ^***	0.24	17.34	1.24 ^***	0.27	4.53
IFI	-0.06 ^***	0.01	-6.12	-0.09 ^***	0.01	-10.96	-0.07 ^***	0.01	-8.99
CI [Gradient]	0.27 ^*	0.13	2.08
AS	-0.06 ^**	0.02	-3.15
pHeav	-5.43 ^***	0.89	-6.11				-5.08 ^***	0.87	-5.86
pUrb	0.73 ^***	0.21	3.54				0.65 ^**	0.21	3.11
Observations	157			157			157
R² Nagelkerke	0.870			0.672			0.841
AIC	759.604			803.585			767.104
log-Likelihood	-372.802			-398.792			-378.552
p<0.05 p<0.01 * p<0.001

To compare the ODP and NB models, we plotted the Mean-Variance relationship:

This plot shows that both the ODP and the NBM are good fot the bulk of the data but the ODP starts failing as values increase, the NBM being a quadratic, rise faster and is better for high values. This means, the NBM provides a better description of the data than the ODP.

Zero-inflated poisson model (ZIM) (Bonus track)

As is common in crashes databases, there are many segments with zero reported accidents:

zobs <- df$NumAcc9702 == 0

zpoi <- exp(-exp(predict(Poiss1))) # or dpois(0,exp(predict(mp)))

c(obs=mean(zobs), poi=mean(zpoi))

##        obs        poi 
## 0.11464968 0.06256741

This tell us that 11% of the segments being assessed had zero accidents, but the Poisson models predicts that only 6% would not have crashes, this means this model understimate the possibility of having no crashes.

This model creates two populations, the first with an “always zero” probaility (π) and the second where the data has a Poisson distribuition with mean μ. Hence, we estimated a zero-inflated model as follows:

so we stimate it as follows:

We see that the only nearly significant factor (besides the intercept) in the “always zero” class is the IFI, this means that the higher the IFI the higher the probability that zero crashes occur, this is a really important finding to support investments in the road maintenance.

	Zero-inflated
Variable	Estimate	std. Error	Statistic
Count Model
(Intercept)	1.16 ^***	0.16	7.27
IFI	-0.07 ^***	0.01	-12.58
pHeav	-5.17 ^***	0.50	-10.24
pUrb	0.71 ^***	0.10	7.29
Zero-Inflated Model
(Intercept)	-10.25 ^***	2.97	-3.45
IFI	0.13	0.07	1.92
pHeav	5.86	11.08	0.53
pUrb	-114.38	22176.97	-0.01
Observations	157
R² / R² adjusted	0.950 / 0.949
AIC	849.062
log-Likelihood	-416.531
p<0.05 p<0.01 * p<0.001

To check if our actually solves the excess of zeros issue we compute:

 pr <- predict(mzip,type="zero")  # π

mu <- predict(mzip,type="count") # μ

zip <- pr + (1-pr)*exp(-mu) # also predict(mzip,type="prob")[,1]

mean(zip)

## [1] 0.1158482

The value is pretty close to number of zeros in our dataset, this means, this solves the problem of the excess of zeroes.

Best Model comparison

In this section, we present the “best” models

we acknowledge the data is not suitable for a Poisson model due to overdispersion, but between the the NBM and the ZIM, which is the most appropiate? we did the model comparison with the AIC and BIC.

aic <- data.frame(nbm3 = AIC(nbm3), mzip = AIC(mzip))

bic <- data.frame(nbm3 = BIC(nbm3), mzip = BIC(mzip))

print(aic)

##      nbm3     mzip
## 1 767.104 849.0617

print(bic)

##       nbm3     mzip
## 1 782.3853 873.5117

For both AIC and BIC, the nbm3 shows lower values, meaning it is the best model for our data.

Elasticities for the best model

Elasticities show us the true impact a variable has into the dependant variable, it shows the sensibility or the effect that a 1% change in the explanatory variable has into the dependant:

##   variable  elasticity
## 1      IFI -0.37981772
## 2    pHeav -0.08474754
## 3     Purb  0.01058752

This shows that if the IFI and heavy vehicles increase by 1%, the number of crashes on the segment will decrease by 0.37 and 0.08 respectively and would increase by 0.01 if the percentaje of urban area also increases by 1%.

Conclusions

In this work we estimated several count data models, we tried different specifications and “played” with the variables to see if a transformation would increase the model fitness, however, gladly the variables without any transformation performed well. The analysis and comparison between the different types of regressions allowed us to better understand the pros and cons of each, and being able to select the “best” for our data.

An important conclusion, is that the Poisson model had really good significance and pseudo r squared, however this it would have been a mistake to select this model because the data was overdispersed, which means, we cannot only analyze the model output and decide, we have to check if the data meets the assumptions or requirements that the modelling technique to be used needs to meet.

The IFI proved to be the most important variable in all cases, confirmed not only by the model´s estimates but also by the “always zero” class in the ZIM and by the elasticities computed for the NBM.

Also, this is our first attempt with Markdown, and we realized how powerful, useful and pragmatic this tool is.

References

Yakar F. Identification of Accident-Prone Road Sections by Using Relative Frequency Method. Promet - Traffic &Transportation. 2015;27: 539-47. Available from: doi:10.7307/ptt.v27i6.1609

Xi J, Zhao Z, Li W, Wang Q. A Traffic Accident Causation Analysis Method Based on AHP-Apriori. Procedia Engineering. 2016;137: 680-7. Available from: doi:10.1016/j.proeng.2016.01.305

Lord D. Modeling motor vehicle crashes using Poisson-gamma models: Examining the effects of low sample mean values and small sample size on the estimation of the fixed dispersion parameter. Accident Analysis & Prevention. 2006;38: 751-66. Available from: doi:10.1016/ j.aap.2006.02.001

Zou Y, Ash JE, Park B-J, Lord D, Wu L. Empirical Bayes estimates of finite mixture of negative binomial regression models and its application to highway safety. Journal of Applied Statistics. 2018;45: 1652-69. Available from: doi:10.1080/02664763.2017.1389863

Lord D, Mannering F. The statistical analysis of crash-frequency data: A review and assessment of methodological alternatives. Transportation Research Part A: Policy and Practice. 2010;44: 291-305. Available from: doi:10.1016/j.tra.2010.02.001

https://data.princeton.edu/wws509/r/overdispersion

Washington S, Karlaftis M, Mannering F, Statistical and econometric methods for transportation data analysis, 2003