DAT301-Project-1

2022-11-20

Abstract

The purpose of this project is to discover a more worthwhile approach for purposeful city planning to decrease high congestion areas in New York City. Utilizing regression modeling to provide clear and concise approaches for deciding where to apply resources for modifying or providing solutions for traffic flow based on date, time of day, traffic direction and street.

About the Data

The data is provided from the New York City Department of Transportation (NYC DOT) uses Automated Traffic Recorders (ATR) to collect traffic sample volume counts at bridge crossings and roadways.

https://data.cityofnewyork.us/Transportation/Automated-Traffic-Volume-Counts/7ym2-wayt

Limitations of the Project

Due to resources and lack of computing power the dataset has been reduced to a full year and only reflective of the data from one street. With this limitation in place the expectation is to create a model that describes the use case for each individual street and developing metrics for analyzing the data set for further street locations.

Columns of the Data Set.

Column Name	Description	Type
RequestId	An unique ID that is generated for each counts request.	Number
Boro	Lists which of the five administrative divisions of New York City the location is within, written as a word	Plain Text
Yr	The two digit year portion of the date when the count was conducted.	Number
M	The two digit month portion of the date when the count was conducted.	Number
HH	The two digit hour portion of the time when the count was conducted.	Number
MM	The two digit start minute portion of the time when the count was conducted.	Number
Vol	The total sum of count collected within a 15 minute increments.	Number
SegmentID	The ID that identifies each segment of a street in the LION street network version 14.	Number
WktGeom	A text markup language for representing vector geometry objects on a map and spatial reference systems of spatial objects.	Plain Text
street	The ‘On Street’ where the count took place.	Plain Text
fromSt	The ‘From Street’ where the count took place.	Plain Text
toSt	The ‘To Street’ where the count took place.	Plain Text
Direction	The text-based direction of traffic where the count took place.	Plain Text

Data Wrangling and Cleaning

For the problem we are trying to solve we can remove several data points from the data.The data set contains over 20 million columns so to conserve resources we will remove these data points.

Column Name	Reason of Removal
WktGeom	This data is too specific. We are trying to describe upticks in traffic congestion based on metrics that are similar between locations to pinpoint streets that need to be addressed.
RequestedId	We are not trying to study requests but perform aggregates on the data provided.
Boro	We are not performing an audit based on the boro who reported the data.
SegmentId	Data too specific. We are only trying to pinpoint specific streets.

Data Wrangling and Cleaning (Cont.)

The data set is too large for a simple regression (atleast in the context of this course so certain filtering has been applied) so the data is filtered based on year and specified street.

Year Examined: 2022
Street Examined: Broadway

For a more worthwhile use case the columns for Yr, M, D, HH, MM are converted into a more useful format. We ingest the data and convert these number fields into a more useful datetime instead of storing the date in separated integer columns.

# filter data based on conditional logic above
 df = filter(data.frame(df), Yr == 2020, street == "BROADWAY")
 df$date = paste(df$Yr,"-",df$M,"-",df$D," ",df$HH,":",df$MM,sep="")
 df$date = as.POSIXct(dt$date, format="%Y-%m-%d %H:%M", tz="UTC")

Data Wrangling and Cleaning - Code

The data is cleaned and stored as a newly processed csv file. This code has been ran previous to the project as it requires the grader to download a data set ~3GB in size.

# Ingest data set with removal
write.csv(df,"broadway.csv",row.names = FALSE)

Using the Data for Stepwise Regression

A forward/backwards (both) stepwise linear regression was used to identify possible predictors of the outcome Y out of the remaining candidates(fromsrt, tostrt, direction, date)

df = data.frame(read.csv("./broadway.csv"))
intercept_only <- lm(Vol ~ 1, data=df)
all <- lm(Vol ~ ., data=df)
both <- step(intercept_only, direction='both', scope=formula(all), trace=0)

Interpreting the Data

Utilizing the forwards/backwards model the regression model reduced the variables to two remaining candidates. The column toSt and the column of Direction, to be explored in future studies.

> both$anova
         Step Df   Deviance Resid. Df Resid. Dev      AIC
1             NA         NA     11423   26952097 88721.93
2      + toSt -5 3925088.29     11418   23027009 86933.87
3 + Direction -2   74889.03     11416   22952120 86900.65

Regression Model of toSt

Regression Model Direction

Interpreting the Data

Based on the stepwise regressions and regression models against the interpreted possible predictor candidates it is unsafe to conclude there is a way to predict traffic congestion on Broadway from this data set. Further research is required to develop a more worthwhile predictive model.

Run-Book

To accomodate the size of the data the code in the presentation (except for the code to produce the graphs) have not been executed. Should the grader wish to test the code a seperate Runbook has been provided.

Sources

https://data.cityofnewyork.us/Transportation/Automated-Traffic-Volume-Counts/7ym2-wayt

https://www.kaggle.com/datasets/aadimator/nyc-automated-traffic-volume-counts