Introduction

In 2018, SEPTA experienced its lowest ridership in nearly 20 years, losing nearly 14 million bus trips from 2017 to 2018. This is troubling, but not unique among North American cities. Transit ridership has been steadily declining since 2014 across the United States and Canada. Still, Philadelphia had one of the largest declines in bus ridership over the last year, more than Los Angeles, Dallas, Chicago & Boston.

Multiple studies have evaluated the factors that affect transit ridership, such as external factors like income to gas prices to personal preferences. Others have looked at internal factors that transit agencies and cities control like fare prices and level of service.

Our analysis builds off of this growing body of research to look at the spatial relationship between bus stop distances. We feel this is a topic of great important as Philadelphia grapples with strategies to improve ridership, and embarks on a comprehensive bus network redesign in the coming years.

Our predictive model is meant to show how distance between SEPTA’s bus stops affects ridership. We built this model on Route 21, which runs east-west on Chestnut and Walnut streets, between 69th Street Transportation Center and Penn’s Landing, and tested it along Route 16, which runs north-south between Cheltenham and City Hall on Broad Street. We chose these routes because of their relatively high ridership and service between high-density neighborhoods and job centers. We also chose lines that have different socio-economic and ethnic ridership profiles. Ultimately, this ended up complicating our test model results, but provides some important insights about modeling spatial factors along SEPTA’s bus system, which serves a large and diverse geographic area.

Literature Review

Many studies have explored the determinants of transit ridership, looking at a wide array of internal and external factors that influence ridership.

Internal factors relate to decisions, policies and conditions that are determined directly by the transit agency or municipality. Fare pricing is one of the most frequently analyzed internal factor influencing transit ridership. But improvements in service (frequency, coverage, reliability) are often more important than fare prices in determining ridership.

External factors, on the other hand, related to broader economic and geographic conditions like gas prices and unemployment rates. Studies have found some external factors to be more significant than others. For example, population size and employment rate have demonstrated statistical significance across multiple studies.

Many studies of internal influences find out of all the factors transit operators do control are quality of service, quantity of transit service, and fare pricing. A very recent study from Transportation of McGill Research found reduction in quantity in bus service in particular as a key determinants of ridership. This was found to be statistically significant among 25 North American cities.

When it comes to spatial factors, studies find that housing density and employment density are the most important spatial variables for determining transit demand. However, while spatial factors like these are used in studies to explain transit ridership variation, the colinearity among spatial variables and with other socio-economic factors raise more questions about causes and effects and their relative influence ridership.

Methodology

Data Gathering

2017 American Community Survey (ACS) and Longitudinal Employer-Household Dynamic (LEHD) datasets were used to gain demographic and employment data at the census block group level for the City of Philadelphia. This data was then mapped in ArcMap in preparation for a spatial join. Bus stop and average weekday ridership data was also acquired from the Southeastern Pennsylvania Transportation Authority (SEPTA). The bus stop points were then projected onto a map of the City of Philadelphia with census block groups from the Open Data Philly site. With the three datasets from the ACS, LEHD, and SEPTA projected onto ArcMap, we are ready to conduct our analysis.

GIS Analysis

We used the editor toolbar to create lines and placed the vertices of these lines where the bus stops were located for Route 21 and Route 16. We then used the Split Line at Vertices tool to divide the route line into smaller segments. After doing this, we calculate the geometry of each line segment and associate the length to a bus stop point using the spatial join function. Every bus stop now has a distance in feet of how close it is to the nearest bus stop. Next, we merge the Longitudinal Employer-Household Dynamic Dataset with the American Community Survey Dataset to have one shapefile with all the census block groups and the demographics associated with each. We then need to calculate our buffers.

Scholars still debate whether Euclidean Distance or Network Analyst is more effective when analyzing spatial components in a model. We decided to use the network analyst tool in GIS because we want to consider only areas that pedestrians can traverse. Since travelling across space is not ubiquitous, we must consider areas where people may not have access such as private property. We created a network dataset using the street network of Philadelphia acquired from the Philly Open Data site and then created a series of three buffers of 300, 600, and 900 feet that define how difficult it is to access a bus stop with the existing street grid. The map is displayed in figure x. We used the 900 foot buffer around the bus stop which has a radius of less than a quarter mile to represent the catchment area of the stop.

Each 900 meter buffer has a unique ID that matches that of the bus stop points. We use spatial join to join the buffers to all the census block groups that it intersects and then average out all the data of those block groups. For instance, if a 900 meter buffer around a stop intersects 3 census block groups than ArcMap would average all of the demographic data from the intersected block groups and add them into the row for each buffer. After doing this, we used a join based on the unique ID attribute column to transfer the demographic data from the buffers to the bus stop points. The attribute table of our bus stop points dataset finally has all of the data needed to conduct our analysis. We can now transfer the data into R.

R Analysis

In R we will visualize the distributions of our data using histograms, density, and scatter plots. We will then create a correlation matrix, analyze the means of all of our variables, and map the ridership numbers to gain a better idea of how strong our model is and how well it will perform. If we see that the independent variables are too correlated, than we will choose to not include them in our final mode. After running all these preliminary tests, we will perform several Ordinary Least Squares (OLS) Regressions and compare them using both a VIF test and an Anova test with all the models. We will then choose the model that best predicts our data and plot the residuals with a line of best of fit. This will tell us how well our original model with the Route 21 variables as our training set predicted ridership on the Route 16 bus line, helping us understand the generalizability of our model. If the predicted and observed variables are close, than we can have more confidence in our final model.

boundingBox <- st_read("http://data.phl.opendata.arcgis.com/datasets/405ec3da942d4e20869d4e1449a2be48_0.geojson")%>% 
  st_transform(crs = 6318)
cg <- st_read("http://data.phl.opendata.arcgis.com/datasets/1eed3c9b6d3c4561aaa62e1fc2dd81c4_0.geojson")%>% 
  st_transform(crs = 6318)
busroutes <- st_read("C:/Users/AnthonyJ/Box/Spring 2019/PlanningByNumbers/FinalProject/Stop Data/BusRoutes.shp")%>% 
  st_transform(crs = 6318)
mpd <- c("OBJECTID_12", "Weekday_Bo", "Route", "Latitude", "Longitude")


mp <- c %>%
  dplyr::select(mpd)

tmap_mode("view")
# make spatial
dat_sf <- st_as_sf(mp, coords = c("Longitude", "Latitude"), crs = 6318) 
xyz <- st_read("C:/Users/AnthonyJ/Box/Spring 2019/PlanningByNumbers/FinalProject/Route_21_16_Dist.shp")

Network Analyst Map

im <- load.image("C:/Users/AnthonyJ/Box/Spring 2019/PlanningByNumbers/FinalProject/Final_R/PBN-Route21Buffer-UpdatedBase -4-29.jpg")
plot(im,  axes=FALSE)

Data Analysis

One of the primary methods to anticipate how our model will perform involves analyzing the data that we have. Or dependent variable is ridership and our 12 independent variables are population density, average household size, median age, male population, female population, unemployment rate, percentage white, percentage black, percentage asian, percentage hispanic, median household income, and distance. Throughout the analysis, we will study the distribution of each of these variables and attempt to determine “the leanest and meanest” mode, the one that will best predict ridership.

When we look at the relationship between ridership and population density, race, income, and age in the four plots side by side, we see that there is no clear relationship that can be extrapolated except for the first plot, plot A, that shows the relationship between Boardings and Population Density. It shows a strong positive relationship. As population density increases, ridership increases. I would expect the coefficient on this variable to be positive and will definitely include this in our model.

The distribution of riders on the Route 16 and 21 is right skewed. Most stops have less than 100 passengers boarding. The route 21 has a more even spread than the Route 16, however. As the histogram shows, there are much higher instances of lower ridership on Route 16 than on Route 21. The average ridership on the route 16 is 39 people while the 21 has an average ridership of 78, much higher, however, the 21 ridership is much more evenly distributed. The map also allows us to visually analyze the distribution. When we look at the density plot and see that there is a stop with over 300 riders, we can further study this on our map with plotted ridership points. We see that the major hotspot is the Olney Transportation Center along the Route 16 corridor on Broad Street. This makes sense as the area around the Olney Transportation Center includes Philadelphia’s largest school, Central High School, a major hospital, Einstein Medical Center, and is very close to two stops on the Broad Street Line, Olney and Fern Rock Transportation Center.

The distributions of observations between the two routes vary greatly which may introduce some challenges when trying to test the generalizability of our model. The two routes may be too different to compare. It is also important to note that these variables describe the demographics of the residents in these census tracts and do not necessarily represent the bus passengers. Bus riders may use the observed stop as a transfer point to another destination. When looking at median household income, we see that the Route 16 serves on average a lower income group than Route 21. The average income along the Route 16 corridor is $31,000 while the Route 21 corridor has an average income of $43,000, more than $10,000 greater. Route 21 also serves populations with incomes over $100,000 while the maximum income route 16 serves is around $80,000. This also makes sense when we consider the geography of Philadelphia. North Philadelphia residents tend to have lower incomes than Center City and West Philadelphia residents.

The unemployment rate exhibits an even stronger contrast. The modes are on opposite sides of the histogram and the unemployment rate near the stops of Route 16 are double that of Route 21. The unemployment rate along the Route 21 corridor averages around 8% while that of Route 16 averages around 15%. The Route 16 unemployment rate is also left skewed while the route 21 is right skewed. This tells us that the unemployment rate will not be a good measure to use in our model if we want it to be generalizable. The rate on route 21 may be too unique since the route 21 traverses the region’s employment center, Philadelphia’s central business district.

The population density for Route 21 also contrasts with that of Route 16. The Route 21 corridor exhibits an average density of almost double that of Route 16. The mean population density for Route 16 is 19,000 people while that of Route 21 is 31,528 people. The density, however varies greatly along Route 21 with many stops being in a region with a population density of over 40,000 people per square mile. Route 16, in contrast, does not have any bus stops with an average population density of greater than 40,000 people per square mile.

One of the primary reasons why we chose this project was to see whether the spacing between bus stops has an effect on ridership. Although bus stop consolidation has become a common method that transit agencies use to increase the efficiency of their bus networks, not much research has been done on the impact of stop spacing on ridership. Using GIS, we can measure the distance between stops and integrate this variable into a model that attempts to predict ridership. Most stops on Route 21 are spaced less than 800 feet apart with three outliers spaced at 1000, 1100, and 1600 feet respectively while those on Route 16 are spaced up to 1200 feet apart. The average stop spacing for the Route 16 is 657 feet while that of the Route 21 is 540 feet. It would make sense that the Route 21 has smaller stop spacing because the bus route runs through the central business district and stops on every 500 foot block.

Generally, as one moves further away from Center City, the blocks become larger, population densities decrease, and bus stop spacing increases. Moreover, the ridership map illustrates this. Most of the Route 16 bus stops overlap with the Broad Street line. SEPTA offers overlapping service like this because the Broad Street line does not have elevators and is not accessible according to the Federal standards of the Americans with Disabilities Act. Since the distances between bus stops do not tend to have much variation, it may be difficult to identify a relationship between the factor and ridership.

Employment density and the unemployment rate tell a similar story. The regions that the two routes serve contrast greatly. Route 16 has many more stop regions with lower employment densities than Route 16. Except for three outlying observations, employment density does not exceed 5000 for Route 16 while half of the observations exhibit an employment density greater than that along Route 21. The average employment density of the regions surrounding Route 21 at 4,748 people per square mile is more than half that of Route 16 at 1,856.

Our observations do not show a particular relationship between the percentage of females and ridership. The areas with the most riders have more observations above 50% female than below 50%. The average is definitely over 50% but only by a few percentage points. When looking at race, most of the stops tends to be in a region with a higher percentage of Black residents. Many stops along Route 21 have a percentage of Black residents greater than 75% however as ridership increases, there tends to be more white residents in the vicinity of the bus stop. When considering the relationship between distance and ridership, we observe that the stop with the highest ridership along Route 21 is more than 1000 feet further from the nearest stop. This may show that as we increase stop spacing distances, ridership will increase. It is important to note, however, that most observations that have ridership between 0 and 300 passengers have stop spacing of less than 800 feet. The plot that illustrates the relationship between distance and ridership also has points that change color on a gradient based on income. No clear relationship can be extrapolated from this relationship. Observations with average incomes range greatly along the Route 21 corridor. We may have too little observations or stops along Route 21 to have any statistically significant relationship in our model between income and ridership.

The correlation plot shows that some of our independent variables are strongly correlated with each other. A value closer to 1 and -1 means that there is a strong level of correlation and suggests that we should not include the two variables in the model together because the effect of one may interfere with the effect of another on ridership. For instance, both the relationship between hispanic and female, white and household size, and white and black are highly correlated since the color is dark red (closer to -1). Thus, we will not include white in our final models because we do not want it to interfere with the black population percentage nor the household size. We will take note when including hispanic and female.

kable(head(nc1_21 %>% dplyr::select(exp_vars), 10)) %>% 
  kable_styling(bootstrap_options = "striped",position = "left") %>%
  scroll_box(width = "100%", height = "100%")
Population Density Household Size Age Female Unemployment Rate White Black Asian Hispanic City Hall Median HH Income Boardings Distance Employment Density
20209.00 2.370000 47.68000 54.96702 21.14000 5.4204452 72.60923 17.1269580 2.761748 0 28157.40 84 587.268 167.0000
28797.10 2.315000 43.10000 56.61568 24.57500 2.0076482 93.15488 1.4913958 2.562142 0 26126.75 74 587.268 250.7500
29394.20 2.392000 41.16000 56.86965 24.56000 2.0298608 93.30649 1.4259352 2.617011 0 25713.20 40 436.128 207.2000
21747.30 2.552000 37.20000 57.21860 26.40000 0.8360522 94.67781 0.9787928 2.671289 0 25361.40 203 569.767 201.2000
20007.33 2.555000 37.35000 57.07109 22.87500 0.6028636 95.20221 0.6531022 2.888721 0 24973.00 45 572.298 300.7500
26103.11 2.605714 36.12857 56.96452 23.44286 0.6241787 94.34954 0.7227332 2.874507 0 23405.71 60 562.726 181.5714
26391.38 2.596000 36.54000 56.94444 22.58000 0.7653061 94.13265 0.5385488 2.607710 0 23213.60 129 545.713 165.2000
25800.41 2.492500 36.26250 55.70833 17.90000 1.2083333 93.84722 0.9027778 2.194444 0 25324.38 143 573.566 166.1250
26913.16 2.465714 36.04286 55.25907 15.94286 1.3412396 93.19497 1.1153466 2.385995 0 27148.43 84 586.642 137.7143
26163.47 2.437143 35.91429 55.21899 17.11429 1.3235093 93.47796 0.9687543 2.687952 0 28131.29 64 613.542 136.7143

Exploring the data

theme_set(theme_bw())  # pre-set the bw theme.

g <- ggplot(c1_21, aes(Boardings, `Population Density`))
g2 <- g + geom_jitter(width = .5, size=1) +
  geom_point(color='red') +
  labs(subtitle="Measuring Ridership with Population Density", 
       y="Population Density", 
       x="Ridership", 
       title="Boardings and Population Density")
gf <- ggplot(c1_21, aes(Boardings, `Black`))
gf2 <-gf + geom_jitter(width = .5, size=1) +
   geom_point(color='blue') +
  labs(subtitle="Measuring Ridership with Race", 
       y="Percent Black", 
       x="Ridership", 
       title="Boardings and Race")
gs <- ggplot(c1_21, aes(Boardings, `Median HH Income`))
gs2 <-gf + geom_jitter(width = .5, size=1) +
   geom_point(color='purple') +
  labs(subtitle="Measuring Ridership with Income", 
       y="Median Household Income", 
       x="Ridership", 
       title="Boardings and Income")
gv <- ggplot(c1_21, aes(Boardings, `Age`))
gv2 <-gf + geom_jitter(width = .5, size=1) +
   geom_point(color='green') +
  labs(subtitle="Measuring Ridership with Age", 
       y="Median Age", 
       x="Ridership", 
       title="Boardings and Age")
k <- plot_grid(g2, gf2, gs2, gv2, labels = "AUTO")
k

Histogram, Density Curve, and Mean Comparison

Ridership

ggplot(c1, aes(x=Weekday_Bo)) + 
 geom_histogram(aes(y=..density..), colour="black", fill="white")+
 geom_density(alpha=.2, fill="#f781bf") +
  labs(title="Ridership",x="Passengers", y = "Frequency")+
  facet_grid(Route ~ .)

mu1 <- ddply(c1, "Route", summarise, Mean=mean(Weekday_Bo))
kable(mu1) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
Route Mean
16 38.78102
21 77.90090

Median Income

ggplot(c1, aes(x=Avg_MEDHINC_CY)) + 
 geom_histogram(aes(y=..density..), colour="black", fill="white")+
 geom_density(alpha=.2, fill="#b3cde3") +
  labs(title="Median Household Income",x="Income ($)", y = "Frequency")+
  facet_grid(Route ~ .)

mu1 <- ddply(c1, "Route", summarise, Mean=mean(Avg_MEDHINC_CY))
kable(mu1) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
Route Mean
16 31384.76
21 42717.85

Unemployment Rate

ggplot(c1, aes(x=Avg_UNEMPRT_CY)) + 
 geom_histogram(aes(y=..density..), colour="black", fill="white")+
 geom_density(alpha=.2, fill="#ccebc5") +
  labs(title="Unemployment Rate",x="Unemployment Rate", y = "Frequency")+
  facet_grid(Route ~ .)

mu1 <- ddply(c1, "Route", summarise, Mean=mean(Avg_UNEMPRT_CY))
kable(mu1) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
Route Mean
16 15.116496
21 7.940581

Population Density

ggplot(c1, aes(x=Avg_POPDENS_CY)) + 
 geom_histogram(aes(y=..density..), colour="black", fill="white")+
 geom_density(alpha=.2, fill="#fed9a6") +
  labs(title="Population Density",x="Population", y = "Frequency")+
  facet_grid(Route ~ .)

mu1 <- ddply(c1, "Route", summarise, Mean=mean(Avg_POPDENS_CY))
kable(mu1) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
Route Mean
16 18988.54
21 31528.46

Stop Spacing

ggplot(c1, aes(x=Distance)) + 
 geom_histogram(aes(y=..density..), colour="black", fill="white")+
 geom_density(alpha=.2, fill="#fbb4ae") +
  labs(title="Distance between Bus Stops",x="Stop Spacing (Feet)", y = "Frequency")+
  facet_grid(Route ~ .)

library(plyr)
mu1 <- ddply(c1, "Route", summarise, Mean=mean(Distance))
kable(mu1) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
Route Mean
16 657.5715
21 540.6033

Employment Density

ggplot(cn, aes(x=`Employment Density`)) + 
 geom_histogram(aes(y=..density..), colour="black", fill="white")+
 geom_density(alpha=.2, fill="#8dd3c7") +
  labs(title="Employment Density",x="Employment Density", y = "Frequency")+
  facet_grid(Route ~ .)

library(plyr)
mu1 <- ddply(cn, "Route", summarise, Mean=mean(`Employment Density`))
kable(mu1) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
Route Mean
16 1856.613
21 4747.579

Other Variable Relationships

Female

df <- data.frame(c1_21$Boardings, c1_21$Female)
ggplot(df, aes(c1_21$Boardings, y = value, color = variable)) + 
    geom_point(aes(y = c1_21$Female, col = "Female"))+
   labs(title="Population and Ridership by Gender",x="Ridership (Boardings)", y = "Percentage Female")

Race

df <- data.frame(c1_21$Boardings, c1_21$White, c1_21$Black, c1_21$Hispanic, c1_21$Asian)
ggplot(df, aes(c1_21$Boardings, y = value, color = variable)) + 
    geom_point(aes(y = c1_21$White, col = "White")) + 
    geom_point(aes(y = c1_21$Black, col = "Black"))+
  geom_point(aes(y = c1_21$Hispanic, col = "Hispanic"))+
  geom_point(aes(y = c1_21$Asian, col = "Asian"))+
   labs(title="Population and Ridership by Race",x="Ridership (Boardings)", y = "Percentage")

Plotting Distance and Ridership by Income

ggplot(c2_21, aes(Weekday_Bo, y = Distance, color = Avg_MEDHINC_CY)) + 
    geom_point(aes(y = Distance, size = 0.05)) + 
   labs(title="Distance and Ridership",x="Ridership (Boardings)", y = "Distance (ft)")

Correlation among Coefficients

Correlation plot

corr_vars <- nc1_21 %>% 
  dplyr::select(., exp_vars)

corr_tab <- pcor(corr_vars, method = "pearson")
corr2 <- cor(corr_vars)
cp <- corrplot(corr2, method = "color", type = "lower")

#corrplot(cp, method = "color")

Correlation matrix

corr_tab_estimate <- as.data.frame(corr_tab[["estimate"]]) %>% 
  setNames(exp_vars) %>% 
  cbind(exp_vars) %>% 
  dplyr::select(Var1 = exp_vars, everything()) %>% 
  gather(Var2, correlation, 2:13, - Var1) %>% 
  arrange(desc(abs(correlation)))

kable(corr2)%>% 
  kable_styling(bootstrap_options = "striped", position = "left")  %>%
  scroll_box(width = "100%", height = "500px")
Population Density Household Size Age Female Unemployment Rate White Black Asian Hispanic City Hall Median HH Income Boardings Distance Employment Density
Population Density 1.0000000 -0.3564799 0.0840427 -0.2890156 -0.2827185 0.4167922 -0.3413005 0.0397346 0.3439112 0.1533723 0.2722968 0.4436696 -0.2085687 0.4236355
Household Size -0.3564799 1.0000000 0.0479918 0.5838406 0.7252728 -0.8642103 0.8502428 -0.4694642 -0.6753127 -0.1420310 -0.6446483 -0.1933128 -0.0194388 -0.7861714
Age 0.0840427 0.0479918 1.0000000 0.2161379 0.2622267 0.0441104 0.2590048 -0.7341436 -0.3964738 -0.0331414 0.5715285 0.0449619 -0.0978845 0.1451460
Female -0.2890156 0.5838406 0.2161379 1.0000000 0.7349952 -0.7117141 0.7318553 -0.4242339 -0.8806718 -0.1596939 -0.4199029 -0.0814487 0.1456828 -0.5247251
Unemployment Rate -0.2827185 0.7252728 0.2622267 0.7349952 1.0000000 -0.7938732 0.8336772 -0.5343371 -0.6872453 -0.0582371 -0.4919217 -0.0684308 0.0453861 -0.4970135
White 0.4167922 -0.8642103 0.0441104 -0.7117141 -0.7938732 1.0000000 -0.9340213 0.4049096 0.6419463 0.1080283 0.7841476 0.1886877 -0.0508261 0.7395163
Black -0.3413005 0.8502428 0.2590048 0.7318553 0.8336772 -0.9340213 1.0000000 -0.7038986 -0.7330520 -0.0857207 -0.5370101 -0.1385473 -0.0101847 -0.6292564
Asian 0.0397346 -0.4694642 -0.7341436 -0.4242339 -0.5343371 0.4049096 -0.7038986 1.0000000 0.5722387 0.0008457 -0.1775227 -0.0126912 0.1433213 0.1492164
Hispanic 0.3439112 -0.6753127 -0.3964738 -0.8806718 -0.6872453 0.6419463 -0.7330520 0.5722387 1.0000000 0.2016883 0.2409348 0.1718966 -0.1140825 0.6099144
City Hall 0.1533723 -0.1420310 -0.0331414 -0.1596939 -0.0582371 0.1080283 -0.0857207 0.0008457 0.2016883 1.0000000 0.0762711 0.2098627 -0.0542444 0.2929010
Median HH Income 0.2722968 -0.6446483 0.5715285 -0.4199029 -0.4919217 0.7841476 -0.5370101 -0.1775227 0.2409348 0.0762711 1.0000000 0.1229779 -0.0896320 0.6590519
Boardings 0.4436696 -0.1933128 0.0449619 -0.0814487 -0.0684308 0.1886877 -0.1385473 -0.0126912 0.1718966 0.2098627 0.1229779 1.0000000 -0.0670764 0.3274057
Distance -0.2085687 -0.0194388 -0.0978845 0.1456828 0.0453861 -0.0508261 -0.0101847 0.1433213 -0.1140825 -0.0542444 -0.0896320 -0.0670764 1.0000000 -0.1356008
Employment Density 0.4236355 -0.7861714 0.1451460 -0.5247251 -0.4970135 0.7395163 -0.6292564 0.1492164 0.6099144 0.2929010 0.6590519 0.3274057 -0.1356008 1.0000000

Findings

Our first model regresses ridership with stop spacing or distance, median household income, population density, the white population, the black population, the asian population, the hispanic population, age, household size, and a dummy for whether the stop is near city hall. We find two statistically significant variables to be population and employment density. For every 1000 people that live in the vicinity of a bus stop, ridership increases by 2. Furthermore, for every 1000 jobs in the bus stop vicinity, there will be an addition 4 bus passengers. The adjusted R squared or the percentage that our model explains ridership is 0.210 or 21%. 21% of the variation of ridership is explained by this model if we want to use the faulty interpretation of the adjusted R-squared. Even though R squared corrects for the amount of independent variables in our model, it still may be misleading because some variables may influence R-squared more than others. It is also important to note that employment density and population density are not correlated as we saw earlier in our correlation pyramid and matrix. This model shows that they are both statistically significant but interestingly, our results vary when we add categorical variables.

On our first model, the population variables of White, Black, Asian, and Hispanic are not significant because they both may be correlated with each other and are raw numbers rather than percentages. The second model changes the white and black population independent variable into percentages of the total population to see whether race plays a role in ridership. We find that we fail to reject the null that the coefficient of the new race outputs are statistically significant from 0.

The second model we use includes categorical variables for income, age, and household size. They divide those predictors into four bins or quantiles with an equal amount of observations in each. The adjusted R-squared decreases by very little 0.01% meaning that the new inputs with categorical variables do not explain ridership more however the only statistically significant variable is population density. An increase in population density of 1000 will increase ridership by 3 passengers. The effect of population density on ridership increases in this model from the last one. In addition, employment density is no longer significant the coefficient is half the value as it was in the first regression. This makes me skeptical about the validity of the employment density model. Even though the two variables of population and employment density are not correlated, they may still be interacting.

When we run our third model with just our statistically significant initial independent variables of population and employment density, we find that the adjusted R-squared is actually slightly less by a tenth of a percentage point. It explains 20% of the variation according to the adjusted R-squared. We find that adding the categorical variables does not have much of an effect on our model since when we remove them, the adjusted r-squared does not change much. In addition, the employment density independent variable is statistically significant but with a different coefficient than in our first model. We run a VIF test under the regression to measure multicollinearity further but find population and employment density are not correlated.

Our final model has an adjusted R-squared of 0.189 and explains about 18.9% of the variation. We have chosen to only include population density because all the other variables are statistically insignificant. Moreover, employment density seems to be an unreliable predictor since it varied so much and was insignificant in our second model. It also appeared to have the same effect as population density which we do not believe is coincidental. An increase in the population of 1000 around a bus stop will increases ridership by 3 passengers.

These results must be taken with a grain of salt because they may not be endogenous. The bus stops may have been chosen because transportation planners knew that a higher population and employment density would lead to more riders. They may have also chosen to locate stops closer to each other because they know that there would be a higher demand for a particular route.

The ANOVA table compares our last two models that include either population and employment density or just population density. The test shows that our final model is not statistically significant from the third regression, the one with population density and employment density. The second model’s p-values are greater than 0.05 meaning that we fail to reject the null at the 95% confidence interval. This implies that it does not matter as much which model we use amongst the third and fourth ones.

When we try to use our final model or training set with the Route 21 data to predict ridership on the Route 16, our test set, the results show that our final model cannot predict ridership on the other route well. This is not surprising given the contrasts we observed amongst the independent variables above. Our final model does not illustrate a steady positive relationship on the test set like it does on the training set. A linear positive slope of 1 shows that the predictions are closer to the observations but the test set shows a relationship of the residuals with a slope closer to 0. It appears to be a horizontal line even though the observed ridership experiences variation in ridership amongst stops in our test set. For instance, when ridership is predicted to be 100 passengers on Route 1, the observed number of riders is less than 10.

Running the Regressions

Regression 1

ride1 <- lm(`Boardings` ~ `Distance` + `Median HH Income` + `Population Density` + `White Population` +
              `Black Population` + `Asian Population` + `Hispanic Population` + 
              `Age` + `Household Size` + `City Hall` + `Employment Density`, data = c1_21)
stargazer(ride1, type = 'html')
Dependent variable:
Boardings
Distance 0.018
(0.034)
Median HH Income -0.001
(0.001)
Population Density 0.002**
(0.001)
White Population -0.054
(0.067)
Black Population -0.090*
(0.051)
Asian Population -0.025
(0.061)
Hispanic Population 0.050
(0.280)
Age 1.176
(1.803)
Household Size 7.760
(35.849)
City Hall 43.600
(43.424)
Employment Density 0.004**
(0.002)
Constant 52.504
(119.729)
Observations 111
R2 0.289
Adjusted R2 0.210
Residual Std. Error 56.526 (df = 99)
F Statistic 3.660*** (df = 11; 99)
Note: p<0.1; p<0.05; p<0.01

Regression 2

ride2 <- lm(`Boardings` ~ `Distance` + as.factor(`Income Quantile`) + `Population Density` + (`Population Density`)^2  + `White`
              + `Black`   + `Employment Density` + `Unemployment Rate` +
              as.factor(`Age Quantile`) + as.factor(`Household Size Quantile`), data = nc1_21)
stargazer(ride2, type = 'html')
Dependent variable:
Boardings
Distance 0.022
(0.035)
as.factor(Income Quantile)2 6.030
(25.943)
as.factor(Income Quantile)3 -28.064
(27.992)
as.factor(Income Quantile)4 29.715
(43.438)
Population Density 0.003***
(0.001)
White -0.483
(1.274)
Black -0.544
(1.002)
Employment Density 0.002
(0.002)
Unemployment Rate 2.508
(2.621)
as.factor(Age Quantile)2 16.138
(29.445)
as.factor(Age Quantile)3 0.524
(33.063)
as.factor(Age Quantile)4 -34.090
(35.395)
as.factor(Household Size Quantile)2 18.441
(21.625)
as.factor(Household Size Quantile)3 5.352
(29.964)
as.factor(Household Size Quantile)4 19.613
(37.674)
Constant -16.899
(81.295)
Observations 111
R2 0.316
Adjusted R2 0.209
Residual Std. Error 56.580 (df = 95)
F Statistic 2.933*** (df = 15; 95)
Note: p<0.1; p<0.05; p<0.01
VIF
VIF <- car::vif(ride2)
kable(VIF) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
GVIF Df GVIF^(1/(2*Df))
Distance 1.234181 1 1.110937
as.factor(Income Quantile) 53.435000 3 1.940756
Population Density 2.315601 1 1.521710
White 41.612843 1 6.450802
Black 42.135599 1 6.491194
Employment Density 4.207488 1 2.051216
Unemployment Rate 10.063028 1 3.172228
as.factor(Age Quantile) 23.405804 3 1.691300
as.factor(Household Size Quantile) 35.227949 3 1.810567

Regression 3

ride3 <- lm(`Boardings` ~ `Distance` + `Population Density` + (`Population Density`)^2  +
             `Employment Density`, data = nc1_21)
stargazer(ride3, type = 'html')
Dependent variable:
Boardings
Distance 0.013
(0.033)
Population Density 0.002***
(0.001)
Employment Density 0.002*
(0.001)
Constant -7.614
(26.596)
Observations 111
R2 0.222
Adjusted R2 0.200
Residual Std. Error 56.889 (df = 107)
F Statistic 10.161*** (df = 3; 107)
Note: p<0.1; p<0.05; p<0.01
VIF
VIF <- car::vif(ride3)
kable(VIF) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
x
Distance 1.048461
Population Density 1.254285
Employment Density 1.222196

Regression 4 (final model)

ride4 <- lm(`Boardings` ~ `Population Density`, data = nc1_21)
stargazer(ride4, type = 'html')
Dependent variable:
Boardings
Population Density 0.003***
(0.0005)
Constant -2.704
(16.517)
Observations 111
R2 0.197
Adjusted R2 0.189
Residual Std. Error 57.259 (df = 109)
F Statistic 26.714*** (df = 1; 109)
Note: p<0.1; p<0.05; p<0.01

Testing the Final Model

an<-anova(ride3, ride4)
kable(an) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
Res.Df RSS Df Sum of Sq F Pr(>F)
107 346292.5 NA NA NA NA
109 357360.9 -2 -11068.43 1.710003 0.1857705
po1 <- ggplot(dd, aes(fit, Boardings)) + 
    geom_point(aes(size = 0.05, col = "blue"), show.legend = FALSE)+
    geom_smooth(method = "lm", col = "green") +
    guides(fill=FALSE) +
   labs(subtitle="Produced by Jackie and Anthony ",title="(Training) Predicted vs. Observed",x="Predicted", y = "Observed")
po2 <- ggplot(nc1_16, aes(fit, Boardings)) + 
    geom_point(aes(size = 0.05, col = "blue"), show.legend = FALSE)+
    geom_smooth(method = "lm", col = "orange") +
    guides(fill=FALSE) +
   labs(subtitle="Produced by Jackie and Anthony ",title="(Test) Predicted vs. Observed",x="Predicted", y = "Observed")

plot_grid(po1, po2, labels = "AUTO")

Discussion

The primary reason why we believe our training model did not perform well with our test set is because of the differences in the distribution amongst our independent variables as expressed above. In addition, we have very little observations. With only 111 observations, it is difficult to have statistically significant coefficients. Finally, there are many more factors that affect ridership that are not included in our model. Our model, for instance, does not include destination data but only factors in the characteristics of the origin or where the rider begins their trip. We do not know where the passenger is going and why they are conducting their trip. If we had that data, we could better understand the rider’s behavior and could perhaps gain a better understanding of ridership fluctuations. For instance, people may take the bus more along the Route 16 corridor but they may not be taking the Route 16 bus. We decided to include the bus route lines in pink on our initial map to demonstrate this point. There are many bus routes that intersect Broad Street so residents of the area may be taking the bus but not commuting in a North-South fashion.

Conclusion

Although our final model shows that only population density is significant and that stop spacing has no statistically significant effect on ridership with our data, we have developed a methodology that can be used to predict ridership with a few more variables. We hope to be able to add more observations and refine our methodology next year. Furthermore, since it is difficult to gain ridership data from SEPTA, this project can be used as a base to understand the difficulty of modeling transit behavior in Philadelphia. Since the city exhibits a high degree of inequality, the contrasting demographic characteristics prove to be a challenge for social scientists and researchers in the region. It is especially important to analyze the variation within our independent variables in such an environment. We hope to continue conducting this type of research in order to help Philadelphia and cities just like it to understand travel behavior and enhance the efficiency of commuting.

Endnotes

  1. Federal Transportation Authority, “The National Transit Database” available via https://www.transit.dot.gov/ntd
  2. Boisjoly, Geneviève. Invest in the ride: A 14 year longitudinal analysis of the determinants of public transport ridership in 25 North American cities. (May 2018) via http://tram.mcgill.ca/Research/Publications/Transit_Ridership_overtime.pdf
  3. Laughlin, Jason. “SEPTA keeps bleeding bus riders. It may take years to stanch the losses.” The Philadelphia Inquirer. March 26, 2019. https://www.philly.com/transportation/septa-bus-ridership-transit-loss-20190326.html
  4. Taylor, B., Fink, C., 2009. The factors influencing transit ridership: a review and analysis of the ridership literature. Retrieved from Los Angeles, C.A. via http://www.reconnectingamerica.org/assets/Uploads/ridersipfactors.pdf
  5. Taylor, B., Fink, C., 2009
  6. Boisjoly, Geneviève. Invest in the ride: A 14 year longitudinal analysis of the determinants of public transport ridership in 25 North American cities.
  7. Nelson & Nygaard (1996) and Pushkarev & Zupan (1977)
  8. Taylor, B., Fink, C., 2009 ##The End