COGS589 - Final Project

Mersin Trip Data Analysis

1. Introduction

In this project, Mersin Trip Data will be examined and number of trips will be analyzed using a linear model that can be useful in forecasting future number of trips. The data set used in this project consists of the data collected while preparing Mersin Transport Master Plan (MUAP) which was completed in 2010. A variable within the data set will be examined as the dependent variable (DV) of the study and among eight independent variables (IVs) the meaningful ones will be used as predictors.

2. Background

2.1. Transport Master Plan

A transport master plan is a collection of thorough, effective, and interrelated plans that establish the framework for a modern and integrated transportation system. It is founded on feasibility studies that establish constructability, cost-benefit analyses that establish economic viability, and a funding model that incorporates analyses of the economic and social benefits. A transportation master plan covers both land and sea, comprising various transport modes such as light rail, heavy rail, bus rapid transit, tram, taxi, ferry, and water taxi. The goal of a transport master plan is to achieve the ideal transport modal share while also improving the road network and connectivity. In 2008, it became mandatory for municipalities of cities with a population of over 100,000 to prepare a 15-year transportation master plan and update the plan every 5 years.

2.2. Four-Stage Transport Model

The classical four-stage transport model consists of the following stages:

Trip generation model
Trip distribution model
Modal split model
Trip assignment model

The four-stage transport model can be observed in the following diagram:

In the trip generation model, the number of trips by purpose and mode is determined. The determination of the number of trips is done through surveys. The study area is divided into zones based on population density. Usually using stratified sampling, certain number of households are surveyed in each zone. In the surveys, the participants in a household are asked about the trips they’ve made within the last 24 hours. Extra information is collected regarding the trips, such as purpose, origin, destination, mode, departure time, and arrival time. The survey results are compared with traffic flow surveys called cordon and screenline surveys carried out on the actual network.

In the trip distribution model, with the existing trip data an origin-destination matrix is constructed indicating how many trips are made from each zone to every other zone. Based on the available growth factors, the origin-destination matrix is forecasted for the target year.

In the modal split model, utility functions are constructed for all available transport modes. Many different factors are considered in the utility functions, such as travel time components (e.g., waiting, walking, line-haul, transfer, access, egress), travel costs (e.g., ticket fares, fuel, parking), income, car ownership, accessibility, convenience, comfort, security, family size, residential density. Overall, the factors can be grouped into three, namely, trip characteristics, transportation system characteristics, and trip-maker characteristics. Using the utility function of all modes, a multinomial logit model is constructed to determine the probability of the selection of mode alternatives, hence the modal choice.

In the trip assignment model, the trips classified by origin, destination, and mode are assigned to actual routes on the transportation network. In this step, the travel times are aimed to be minimized and volume/capacity ratio is aimed to keep at consistent levels. Although speed is a major indicator of service quality, it doesn’t give a concrete idea on how the capacity of the road is utilized. Density -with a unit of passenger cars per kilometer per lane- is a better indicator of congested/uncongested flow, and Level of Service (LOS) ranging from A to F corresponding to different density levels is the best measure for the capacity utilization of a road. While LOS A describes free-flow, LOS F describes complete congestion. Most of the time, capacity or near-capacity volumes are aimed in transportation planning, which correspond to LOS C and D. Aiming for very low volumes lead to the underutilization of the capacity of the roads, which is not desirable. The following flow vs. density diagram shows the traffic flow for below and above road capacity. The density level at the road capacity called the critical density. The unit of flow is passenger cars per hour per lane whereas the unit of density is passenger cars per kilometer per lane.

2.3. Trip Purposes

As stated in the explanation of the trip generation model, trips are categorized in terms of purpose. In general, they can be classified into two: Home-Based (HB) trips are ones in which home is either the origin or the destination, and Non-Home-Based (NHB) trips are ones in which neither end is home.

Home-based trips may have different purposes related to many different reasons such as, work, school, shopping, social, and recreational. Work and school trips are considered to be mandatory trips whereas the others are considered optional trips. In short, home-based trips can be categorized into HBW (Home-Based Work Trips), HBS (Home-Based School Trips), and HBO (Home-Based Other Trips)

A trip consists of two ends, namely, production and attraction. Trip production corresponds to the home end of HB trips or the origin of NHB trips. Trip attraction corresponds to the non-home end of HB trips or the destination of NHB trips. Trip generation can be visualized with the following diagram:

3. Aim of the Study

The main focus of this study is to analyzed the data collected for MUAP, more specifically the number of generated trips for a selected trip purpose. Importing the data set, the variables can be examined closely.

mersin_trip_data <- read.spss('Mersin Trip Data.sav', to.data.frame = TRUE)

The data set consists of 93 observations of 23 variables. The variables are as follows:

colnames(mersin_trip_data)

##  [1] "ZONENO"      "ID"          "AREA"        "DIST_CODE"   "DIST_NAME"  
##  [6] "ZONE_NAME"   "POP_08"      "STD_SCHOOL"  "TOT_AUTO"    "TOT_VEHICLE"
## [11] "TOT_STUDENT" "TOT_WORKER"  "AVG_INCOME"  "ZONE_CODE"   "EMPLOY"     
## [16] "HBW_PROD"    "HBW_ATTR"    "HBS_PROD"    "HBS_ATTR"    "HBO_PROD"   
## [21] "HBO_ATTR"    "NHB_PROD"    "NHB_ATTR"

The labels of the variables are also present in the data set, which can provide insights on the abbreviations.

attr(mersin_trip_data, "variable.labels")

##                                 ZONENO                                     ID 
##                          "Zone Number"                            "ID Number" 
##                                   AREA                              DIST_CODE 
##                        "Zonal Area M2"                     "Code of District" 
##                              DIST_NAME                              ZONE_NAME 
##                     "Name of District"                   "Name of Local Zone" 
##                                 POP_08                             STD_SCHOOL 
##                    "Population (2008)"        "Number of Students at Schools" 
##                               TOT_AUTO                            TOT_VEHICLE 
##                   "Total Automobiles "                       "Total Vehicles" 
##                            TOT_STUDENT                             TOT_WORKER 
## "Total Number of Students at the Zone"  "Total Number of Workers at the Zone" 
##                             AVG_INCOME                              ZONE_CODE 
##                       "Average Income"                            "Zone Code" 
##                                 EMPLOY                               HBW_PROD 
##                           "Employment"           "Home-Based Work Production" 
##                               HBW_ATTR                               HBS_PROD 
##           "Home-Based Work Attraction"         "Home-Based School Production" 
##                               HBS_ATTR                               HBO_PROD 
##         "Home-Based School Attraction"          "Home-Based Other Production" 
##                               HBO_ATTR                               NHB_PROD 
##          "Home-Based Other Attraction"            "Non-Home-Based Production" 
##                               NHB_ATTR 
##            "Non-Home-Based Attraction"

In this project, Home-Based School Trip Production (HBS_PROD) will be used as the dependent variable. As a reminder, as it is a home-based trip production, it corresponds to the number of trips generated from home to school or from school to home.

Among eight independent variables, meaningful ones will be used as the predictors in a linear model. It can be hypothesized that HBS trip production heavily depends on the total number of students at the zone. TOT_STUDENT corresponds to the number of students that reside in that zone, but the students don’t necessarily go to a school within the same zone. As HBS_PROD concerns the home-end of the trips, this should be the most important predictor, at least in theory.

Similarly, number of students at schools can contribute significantly to the model, but not as much as TOT_STUDENT, again in theory. It corresponds to the number of students at schools in a zone, but the students don’t necessarily reside within the same zone. Still, school trips tend to be shorter as parents are more willing to send their kids to closer schools most of the time, so students tend to reside within the same zone as their schools are in.

Population can affect HBS_PROD indirectly as higher population corresponds to more students in a zone. Total number of automobiles or vehicles should not affect HBS_PROD as the students have to go to school regardless of the transport mode, either with private car, public transport, service, or by walking. Income or employment shouldn’t affect HBS_PROD.

4. Preparatory Work

First and foremost, a subset of the data set with relevant dependent and independent variables is created.

trips <- mersin_trip_data %>% select(all_of(study_variables)) # all_of() function is used because of the error related to selection with external vectors
study_variables

## [1] "HBS_PROD"    "POP_08"      "STD_SCHOOL"  "TOT_AUTO"    "TOT_VEHICLE"
## [6] "TOT_STUDENT" "TOT_WORKER"  "AVG_INCOME"  "EMPLOY"

Next, it is observed that the data set does not contain any NA values.

any(is.na(trips))

## [1] FALSE

The correlation among variables can give an idea on where to start. In order to examine the correlations among variables, a correlation matrix is constructed.

rcorr(as.matrix(trips))

##             HBS_PROD POP_08 STD_SCHOOL TOT_AUTO TOT_VEHICLE TOT_STUDENT
## HBS_PROD        1.00   0.87       0.42     0.44        0.52        0.90
## POP_08          0.87   1.00       0.45     0.54        0.64        0.94
## STD_SCHOOL      0.42   0.45       1.00     0.23        0.28        0.47
## TOT_AUTO        0.44   0.54       0.23     1.00        0.95        0.49
## TOT_VEHICLE     0.52   0.64       0.28     0.95        1.00        0.58
## TOT_STUDENT     0.90   0.94       0.47     0.49        0.58        1.00
## TOT_WORKER      0.69   0.81       0.37     0.60        0.66        0.76
## AVG_INCOME      0.17   0.16       0.08     0.70        0.65        0.17
## EMPLOY          0.07   0.06       0.35     0.04        0.02        0.04
##             TOT_WORKER AVG_INCOME EMPLOY
## HBS_PROD          0.69       0.17   0.07
## POP_08            0.81       0.16   0.06
## STD_SCHOOL        0.37       0.08   0.35
## TOT_AUTO          0.60       0.70   0.04
## TOT_VEHICLE       0.66       0.65   0.02
## TOT_STUDENT       0.76       0.17   0.04
## TOT_WORKER        1.00       0.17   0.15
## AVG_INCOME        0.17       1.00  -0.08
## EMPLOY            0.15      -0.08   1.00
## 
## n= 93 
## 
## 
## P
##             HBS_PROD POP_08 STD_SCHOOL TOT_AUTO TOT_VEHICLE TOT_STUDENT
## HBS_PROD             0.0000 0.0000     0.0000   0.0000      0.0000     
## POP_08      0.0000          0.0000     0.0000   0.0000      0.0000     
## STD_SCHOOL  0.0000   0.0000            0.0293   0.0069      0.0000     
## TOT_AUTO    0.0000   0.0000 0.0293              0.0000      0.0000     
## TOT_VEHICLE 0.0000   0.0000 0.0069     0.0000               0.0000     
## TOT_STUDENT 0.0000   0.0000 0.0000     0.0000   0.0000                 
## TOT_WORKER  0.0000   0.0000 0.0002     0.0000   0.0000      0.0000     
## AVG_INCOME  0.1034   0.1252 0.4725     0.0000   0.0000      0.1033     
## EMPLOY      0.5132   0.5999 0.0007     0.6904   0.8331      0.7313     
##             TOT_WORKER AVG_INCOME EMPLOY
## HBS_PROD    0.0000     0.1034     0.5132
## POP_08      0.0000     0.1252     0.5999
## STD_SCHOOL  0.0002     0.4725     0.0007
## TOT_AUTO    0.0000     0.0000     0.6904
## TOT_VEHICLE 0.0000     0.0000     0.8331
## TOT_STUDENT 0.0000     0.1033     0.7313
## TOT_WORKER             0.1065     0.1396
## AVG_INCOME  0.1065                0.4248
## EMPLOY      0.1396     0.4248

The correlation matrices show how strongly each pair of variables are correlated. Pearson correlation coefficient and the significance can be examined for each pair. Based on the initial observation of the correlation matrices, Home-Based School Trip Production seems to have a very strong relationship with the total number of students at the zone and population.

HBS_PROD & TOT_STUDENT pair has a Pearson correlation coefficient of 0.90 with a significance of 0.00. Likewise, HBS_PROD & POP_08 pair has a Pearson correlation coefficient of 0.87 with a significance of 0.00. These very high correlation coefficients imply the existence of very strong correlations, which is in line with the initial hypotheses.

The next strongest correlation with HBS_PROD can be observed for TOT_WORKER with a Pearson correlation coefficient of 0.69 with a significance of 0.00. This indicates that the number of home-based school trips increases relatively strongly with increasing number of workers, which may not be expected at first. This can be due to the fact that TOT_WORKER is correlated with POP_08, which is also correlated with TOT_STUDENT.

Surprisingly enough, the number of students at schools is not that strongly correlated with HBS Production, even though it is still significant at the 0.05 level. It is much less correlated with HBS_PROD than initially hypothesized. It is not that strange, however, as it implies that many students reside in a different zone than their schools are in, which can be expected considering that Mersin is a metropolitan area.

Rest of the independent variables except employment seem to have a very weak relationship with HBS Production. TOT_AUTO and TOT_VEHICLE are still significantly correlated to HBS Production at the 0.05 level, which again can be linked to correlation with the total population. Last but not least, employment does not seem to show any relationship with HBS Production at all with a Pearson correlation coefficient of 0.07 with a significance of 0.51.

The strongest correlation between any variable pair is observed between TOT_AUTO and TOT_VEHICLE with a Pearson correlation coefficient of 0.95, which is highly anticipated as the total number of automobiles constitutes a part of the total number of vehicles. The weakest correlation between any variable pair is observed between TOT_AUTO and EMPLOYMENT with a Pearson correlation coefficient of 0.02.

Although the correlation matrices give the clearest idea on which variables are correlated, a correlogram can be constructed to visualize the relationships among variables.

gg_sc <- function(data, mapping, ...){
   p <- ggplot(data = data, mapping = mapping) +
     geom_point(size = 0.7, color ='darkcyan') +
     stat_smooth(geom = 'line', method = 'lm', alpha = 0.6, linewidth = 1, color = 'firebrick', ...)
   return(p)
}

ggpairs(trips, axisLabels = 'none',
        diag = list(continuous = wrap("densityDiag", alpha = 0.5, color = 'darkslateblue', fill = 'darkslateblue')),
        lower = list(continuous = gg_sc))

The correlogram confirms the initial observation based on correlation values. The regression lines seem to fit very well for HBS_PRODS & TOT_STUDENT and HBS_PRODS & POP_08 pairs.

5. Initial Linear Model

It is the most sensible to start with TOT_STUDENT, the independent variable that has the strongest correlation with HBS_PROD. First, HBS_PROD vs. TOT_STUDENT is plotted as seen below.

trips %>% ggplot(aes(x = TOT_STUDENT, y = HBS_PROD)) +
  geom_point(shape = 21, size = 2.5, stroke = 1.5, alpha = 0.6, fill  = 'darkslategray1') +
  labs(y ='Home-Based School Trip Production\n', x = '\nTotal Number of Students') +
  scale_x_continuous(breaks = seq(0,7000,1000), limits = c(0,6500)) +
  scale_y_continuous(breaks = seq(0,12000,2000), limits = c(0,10000)) +
  theme_minimal()

Next, a linear model is constructed. TOT_STUDENT is the only predictor in the initial model.

trip_mdl_1 <- lm(HBS_PROD ~ TOT_STUDENT, data = trips)

The plot of the model is given as:

trips %>% ggplot(aes(x = TOT_STUDENT, y = HBS_PROD)) +
  geom_point(shape = 21, size = 2.5, stroke = 1.5, alpha = 0.6, fill  = 'darkslategray1') +
  geom_smooth(method = 'lm', col = 'blue4') +
  labs(y ='Home-Based School Trip Production\n', x = '\nTotal Number of Students') +
  scale_x_continuous(breaks = seq(0,7000,1000), limits = c(0,6500)) +
  scale_y_continuous(breaks = seq(0,12000,2000), limits = c(0,10000)) +
  theme_minimal()

It can be observed that there is a very strong relationship between these variables, in line with the previous observations. Nonetheless, the coefficient of determination, R², should be examined in order to make a meaningful comment about the goodness of fit. The summary of the model is given as:

summary(trip_mdl_1)

## 
## Call:
## lm(formula = HBS_PROD ~ TOT_STUDENT, data = trips)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6525.6  -385.2   -17.7   557.1  2191.6 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 269.12502  194.79312   1.382     0.17    
## TOT_STUDENT   1.54365    0.07856  19.651   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1026 on 91 degrees of freedom
## Multiple R-squared:  0.8093, Adjusted R-squared:  0.8072 
## F-statistic: 386.1 on 1 and 91 DF,  p-value: < 2.2e-16

5.1. Coefficient Estimates

As seen in the model summary, the intercept is 269.13, which corresponds to the value of HBS_PROD for TOT_STUDENT = 0. The coefficient estimates for TOT_STUDENT is given as 1.54, which corresponds to the slope of the line. Overall, the equation of the regression line is

HBS_PRODUCTION = 269.13 + 1.54 TOT_STUDENT

5.2. Coefficient of Determination

The coefficient of determination, R² is found out to be 0.8093, which implies that the model explains 80.93% of the variance of the data points, which indicates a really good fit.

5.3. F-statistic

F-statistic is found out to be 386.1 with a p-value smaller than 0.05. Hence, the null hypothesis that all the model parameters are equal to zero is rejected. This result implies that as all of the model parameters are not zero, there exists a significant relationship between HBS_PROD and TOT_STUDENT, and the regression model is proper.

5.4. t-statistics

t-statistic for β₀, i.e., the coefficient estimates for the intercept, is found out to be 1.382 with a p-value larger than 0.05. Hence, the null hypothesis that β₀ = 0 can’t be rejected. This result implies that the regression line almost goes through the origin.

t-statistic for β₁, i.e., the coefficient estimates for TOT_STUDENT, is found out to be 19.651 with a p-value smaller than 0.05. Hence, the null hypothesis that β₁ = 0 is rejected. This result implies that HBS_PROD is not constant, and it can be explained through TOT_STUDENT, thus the regression model is proper.

5.5. Normality & Homoscedasticity Assumptions

Next, normality and homoscedasticity assumptions are examined. In order to do so, three graphs of residuals are plotted side-by-side, namely, histogram, Q-Q plot, and scatter plot.

par(mfrow = c(1, 3))
res_1 <- residuals(trip_mdl_1)
hist(res_1, col = 'cadetblue2', main = '', xlab = 'Residuals')
qqnorm(res_1, main = '')
qqline(res_1, col = 'red')
plot(fitted(trip_mdl_1), res_1, ylab = 'Residuals', xlab = 'Fitted Values')

First, the histogram displays a distribution close to normal, but not exactly. Scale of the graph also matters, but the distribution seems to have a positive kurtosis. Second, the Q-Q plot shows a good normality for the majority of the residuals, but there exist some outliers. Third, the scatter plot does not seem to show constant variance, i.e., homoscedasticity. Overall, the normality and homoscedasticity of residuals don’t seem to be exactly satisfied.

Just to be sure, Anderson-Darling Test can be carried out to test the normality of the residuals.

ad.test(res_1)

## 
##  Anderson-Darling normality test
## 
## data:  res_1
## A = 2.5005, p-value = 2.301e-06

p-value is found out to be smaller than 0.05. Hence, the null hypothesis that the residuals are normally distributed is rejected.

Similarly, Breusch–Pagan Test can be carried out to test the homoscedasticity of the residuals.

bptest(trip_mdl_1)

## 
##  studentized Breusch-Pagan test
## 
## data:  trip_mdl_1
## BP = 17.471, df = 1, p-value = 2.917e-05

p-value is found out to be smaller than 0.05. Hence, the null hypothesis that the residuals have constant variance is rejected.

Still, the model seems to have a relatively good coefficient of determination. In the next sections, more variables will be used in an attempt to improve the model.

6. Updated Regression Model

Population has the highest correlation with HBS Production after Total Number of Students; thus, it can be added to the model next.

trip_mdl_2 <- lm(HBS_PROD ~ TOT_STUDENT + POP_08, data = trips)

The summary of the second model is given as:

summary(trip_mdl_2)

## 
## Call:
## lm(formula = HBS_PROD ~ TOT_STUDENT + POP_08, data = trips)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6397.4  -435.7   -35.5   494.8  2279.4 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 115.66051  212.20469   0.545   0.5871    
## TOT_STUDENT   1.18964    0.21921   5.427 4.81e-07 ***
## POP_08        0.10234    0.05926   1.727   0.0876 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1015 on 90 degrees of freedom
## Multiple R-squared:  0.8154, Adjusted R-squared:  0.8113 
## F-statistic: 198.8 on 2 and 90 DF,  p-value: < 2.2e-16

6.1. Coefficient Estimates

As seen in the model summary, the intercept is 115.66, which corresponds to the value of HBS_PROD for TOT_STUDENT = POP_08 = 0. The coefficient estimates for TOT_STUDENT and POP_08 are given as 1.19 and 0.10, respectively. Overall, the equation of the regression line is

HBS_PRODUCTION = 115.66 + 1.19 TOT_STUDENT + 0.10 POP_08

There is already a significant drop in the coefficient estimates for the second predictor, POP_08. A small coefficient implies that an increase in POP_08 would not result in a significant increase in HBS_PROD.

6.2. Coefficient of Determination

The coefficient of determination, R² is found out to be 0.8154, which implies that the model explains 81.54% of the variance of the data points. The goodness of fit seems to increase very slightly. Considering that the increase in R² isn’t even 1%, the second variable POP_08 does not contribute significantly to the model. Still, F-test should be carried out to compare both models more clearly.

Adjusted R² value is found out to be 0.8113, which is very close to R². Adjusted R² is adjusted according to the model size, to see how much the variables contribute to the model overall. In the case that many predictors that don’t contribute significantly to the model are used, the adjusted R² would be very low, even if the R² value itself is high. Hence, in order to check whether the model is susceptible to overfitting, adjusted R² value should be examined. In this case, as only two predictors are used, there is no sign of overfitting.

6.3. F-statistic

F-statistic is found out to be 198.8 with a p-value smaller than 0.05. Hence, the null hypothesis that all the model parameters are equal to zero is rejected. This result implies that as all of the model parameters are not zero, there exists a significant relationship between the target and the predictors.

6.4. t-statistics

t-statistic for β₀ is found out to be 0.545 with a p-value larger than 0.05. Hence, the null hypothesis that β₀ = 0 can’t be rejected. This result implies that the regression line still almost goes through the origin.

t-statistic for β₁ is found out to be 5.427 with a p-value smaller than 0.05. Hence, the null hypothesis that β₁ = 0 is rejected.

t-statistic for β₂ is found out to be 1.727 with a p-value larger than 0.05 but smaller than 0.1. Hence, for a level of significance of 0.05 the null hypothesis that β₂ = 0 is accepted, but for a level of significance of 0.1 the null hypothesis that β₂ = 0 is rejected. In short, the coefficient of the second predictor, POP_08, can be assumed to be equal to 0 depending on the level of significance.

6.5. ANOVA Test

As expected, the addition of the second predictor increased the coefficient of determination, i.e., the goodness of fit. Still, it can be tested whether the addition of the second predictor improves the model significantly or not with an ANOVA test.

anova(trip_mdl_1, trip_mdl_2)

## Analysis of Variance Table
## 
## Model 1: HBS_PROD ~ TOT_STUDENT
## Model 2: HBS_PROD ~ TOT_STUDENT + POP_08
##   Res.Df      RSS Df Sum of Sq      F  Pr(>F)  
## 1     91 95787786                              
## 2     90 92714969  1   3072817 2.9828 0.08758 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

F-statistic is found out to be 2.98 with a p-value larger than 0.05 but smaller than 0.1. Hence, the null hypothesis that the added variable does not improve the model significantly is accepted for a level of significance of 0.05, but rejected for a level of significance of 0.1. Again, depending on the level of significance, the addition of the second predictor may or may not be meaningful. In any case, adding a third predictor most probably will not be meaningful.

6.6. Normality & Homoscedasticity Assumptions

Next, normality and homoscedasticity assumptions are examined. In order to do so, three graphs of residuals are plotted side-by-side, namely, histogram, Q-Q plot, and scatter plot.

par(mfrow = c(1, 3))
res_2 <- residuals(trip_mdl_2)
hist(res_2, col = 'cadetblue2', main = '', xlab = 'Residuals')
qqnorm(res_2, main = '')
qqline(res_2, col = 'red')
plot(fitted(trip_mdl_2), res_2, ylab = 'Residuals', xlab = 'Fitted Values')

It is not expected to see noteworthy changes in the normality and homoscedasticity of residuals. The residuals seem to carry the same characteristics as before. The normality and homoscedasticity of residuals still don’t seem to be satisfied.

Anderson-Darling Test can be carried out to test the normality of the residuals.

ad.test(res_2)

## 
##  Anderson-Darling normality test
## 
## data:  res_2
## A = 2.7357, p-value = 6.08e-07

p-value is found out to be smaller than 0.05. Hence, the null hypothesis that the residuals are normally distributed is rejected.

Breusch–Pagan Test can be carried out to test the homoscedasticity of the residuals.

bptest(trip_mdl_2)

## 
##  studentized Breusch-Pagan test
## 
## data:  trip_mdl_2
## BP = 17.992, df = 2, p-value = 0.0001239

p-value is found out to be smaller than 0.05. Hence, the null hypothesis that the residuals have constant variance is rejected.

7. Further Regression Models

Using more variables most likely won’t improve the model significantly. Still, some of them will be examined in this section.

7.1. Three Predictors

Starting with TOT_WORKER, the third strongest correlated variable with HBS_PROD, another model is constructed.

trip_mdl_3 <- lm(HBS_PROD ~ TOT_STUDENT + POP_08 + TOT_WORKER, data = trips)

The summary of the third model is given as:

summary(trip_mdl_3)

## 
## Call:
## lm(formula = HBS_PROD ~ TOT_STUDENT + POP_08 + TOT_WORKER, data = trips)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6355.9  -369.2   -38.2   495.9  2319.8 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 162.52934  224.56726   0.724   0.4711    
## TOT_STUDENT   1.18835    0.21992   5.404 5.41e-07 ***
## POP_08        0.12197    0.06656   1.832   0.0702 .  
## TOT_WORKER   -0.09418    0.14368  -0.655   0.5138    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1018 on 89 degrees of freedom
## Multiple R-squared:  0.8163, Adjusted R-squared:  0.8101 
## F-statistic: 131.8 on 3 and 89 DF,  p-value: < 2.2e-16

As seen in the model summary, there is a significant drop in the t-statistic for the third coefficient estimate. The corresponding p-value is 0.51, which implies that β₃ can be accepted to be zero. The model doesn’t seem to be improved so far. Likewise, there seems to be a very slight improvement in the R² value, which is found out to be 0.8163.

Carrying out ANOVA test next,

anova(trip_mdl_2, trip_mdl_3)

## Analysis of Variance Table
## 
## Model 1: HBS_PROD ~ TOT_STUDENT + POP_08
## Model 2: HBS_PROD ~ TOT_STUDENT + POP_08 + TOT_WORKER
##   Res.Df      RSS Df Sum of Sq      F Pr(>F)
## 1     90 92714969                           
## 2     89 92269510  1    445459 0.4297 0.5138

It can be observed in the test summary that the F-statistic is 0.43 and the corresponding p-value is 0.51. This result implies that the addition of TOT_WORKER does not improve the model significantly.

7.2. Eight Predictors

The addition of the third predictor does not yield a significantly better model. Nonetheless, the model can be constructed using all of the available predictors to see how much they contribute to the model.

trip_mdl_4 <- lm(HBS_PROD ~ ., data = trips)

The summary of the fourth model is given as:

summary(trip_mdl_4)

## 
## Call:
## lm(formula = HBS_PROD ~ ., data = trips)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6360.2  -382.4   -27.6   487.4  2033.0 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -242.51735  386.76037  -0.627   0.5323    
## POP_08         0.14804    0.07146   2.072   0.0414 *  
## STD_SCHOOL    -0.04269    0.06526  -0.654   0.5149    
## TOT_AUTO      -0.10785    0.56341  -0.191   0.8487    
## TOT_VEHICLE   -0.24298    0.54797  -0.443   0.6586    
## TOT_STUDENT    1.17741    0.23069   5.104 2.04e-06 ***
## TOT_WORKER    -0.05505    0.16273  -0.338   0.7360    
## AVG_INCOME     0.42029    0.34806   1.208   0.2306    
## EMPLOY         0.04105    0.03894   1.054   0.2949    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1032 on 84 degrees of freedom
## Multiple R-squared:  0.822,  Adjusted R-squared:  0.805 
## F-statistic: 48.48 on 8 and 84 DF,  p-value: < 2.2e-16

The intercept is found out to be -242.52, which doesn’t make any sense considering that the target is HBS_PROD, which corresponds to some number of trips produced. t-tests carried out for all variables indicate that only the coefficient estimates for TOT_STUDENT and POP_08 are non-zero for a level of significance of 0.1, as expected. The R² value is found out to be 0.822, which seems to be almost equal to the R² value of the third model. A slight drop in the adjusted R² value can be observed.

Carrying out ANOVA test next,

anova(trip_mdl_2, trip_mdl_4)

## Analysis of Variance Table
## 
## Model 1: HBS_PROD ~ TOT_STUDENT + POP_08
## Model 2: HBS_PROD ~ POP_08 + STD_SCHOOL + TOT_AUTO + TOT_VEHICLE + TOT_STUDENT + 
##     TOT_WORKER + AVG_INCOME + EMPLOY
##   Res.Df      RSS Df Sum of Sq      F Pr(>F)
## 1     90 92714969                           
## 2     84 89409873  6   3305096 0.5175 0.7935

Again, the result implies that the addition of all variables does not improve the model significantly.

8. Conclusion

The dependent variable HBS_PROD of Mersin Transportation Planning Study data set is examined in this study. The relationship of it with the independent variables in the data set is investigated and a linear regression model is developed. The calibrated regression model developed in this study includes TOT_STUDENT (Total Number of Students at the Zone) and POP_08 (Population, 2008) as independent variables. The model is able to explain the 81.54% of the variance in the data.

It is also found out that TOT_STUDENT has a great impact on the model, which is expected as the total number of students should have a significant influence on the Home-Based School Trip Production. All in all, this model can be easily applied to future trip forecasts using given student and population data.

COGS589 - Final Project

Berke Kaan Ülgen

Due 2023-01-15

Mersin Trip Data Analysis

1. Introduction

2. Background

2.1. Transport Master Plan

2.2. Four-Stage Transport Model

2.3. Trip Purposes

3. Aim of the Study

4. Preparatory Work

5. Initial Linear Model

5.1. Coefficient Estimates

5.2. Coefficient of Determination

5.3. F-statistic

5.4. t-statistics

5.5. Normality & Homoscedasticity Assumptions

6. Updated Regression Model

6.1. Coefficient Estimates

6.2. Coefficient of Determination

6.3. F-statistic

6.4. t-statistics

6.5. ANOVA Test

6.6. Normality & Homoscedasticity Assumptions

7. Further Regression Models

7.1. Three Predictors

7.2. Eight Predictors

8. Conclusion