Is the volume of traffic greater during PM hours in New York
City between 2013 and 2021? In order to answer this question, I
will use a dataset provided by NYC OpenData, which contains traffic
sample volume counts at various bridge crossings and roadways throughout
the city. This analysis will focus on those sample counts, as well as
the time those samples were collected. As this is an analysis of the
entirety of New York City, the specific locations at which each sample
was collected will not be included. Each sample count is separated into
hour-long increments, and these increments will be separated by AM and
PM hours, and then added up to create 12-hour totals rather than hourly
totals. Whether an hour-long span is considered AM or PM will be decided
by the start of the span rather than the end. This means that the spans
of 11AM-12PM will be considered an AM hour, and 11PM-12PM will be
considered a PM hour. The dataset can be found
here.
I will begin my data analysis by performing the necessary EDA and
cleaning. I will start by cleaning the column names so that they are
easier to use, making sure all my columns are in the format they need to
be in, and dealing with NAs by removing the year 2012 from my analysis,
given that is where the vast majority of NAs resided. I will then make
two new dataframes, separating the AM and PM hours. From there, I will
calculate 12-hour totals of traffic volume counts by street in order to
have it all in one column. After doing that, I will create a boxplot to
visually compare the distribution between the AM and PM hours. I will
also add histograms of both the AM and PM traffic totals in order to
visualize the shape of their distributions.
#loading my dataset and libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(lubridate)
traffic <- read.csv("trafficcounts.csv")
head(traffic)
## ID SegmentID Roadway.Name From To Direction Date
## 1 1 15540 BEACH STREET UNION PLACE VAN DUZER STREET NB 01/09/2012
## 2 2 15540 BEACH STREET UNION PLACE VAN DUZER STREET NB 01/10/2012
## 3 3 15540 BEACH STREET UNION PLACE VAN DUZER STREET NB 01/11/2012
## 4 4 15540 BEACH STREET UNION PLACE VAN DUZER STREET NB 01/12/2012
## 5 5 15540 BEACH STREET UNION PLACE VAN DUZER STREET NB 01/13/2012
## 6 6 15540 BEACH STREET UNION PLACE VAN DUZER STREET NB 01/14/2012
## X12.00.1.00.AM X1.00.2.00AM X2.00.3.00AM X3.00.4.00AM X4.00.5.00AM
## 1 20 10 11 14 13
## 2 21 16 8 6 13
## 3 27 14 6 5 12
## 4 22 7 7 8 11
## 5 31 17 7 5 13
## 6 42 27 21 18 21
## X5.00.6.00AM X6.00.7.00AM X7.00.8.00AM X8.00.9.00AM X9.00.10.00AM
## 1 20 34 66 100 52
## 2 13 31 70 67 45
## 3 16 34 75 69 71
## 4 12 33 75 89 66
## 5 28 29 68 84 64
## 6 13 17 18 46 53
## X10.00.11.00AM X11.00.12.00PM X12.00.1.00PM X1.00.2.00PM X2.00.3.00PM
## 1 68 85 85 94 104
## 2 57 67 73 95 102
## 3 67 70 90 89 115
## 4 70 60 105 103 71
## 5 83 89 88 113 113
## 6 29 0 NA NA NA
## X3.00.4.00PM X4.00.5.00PM X5.00.6.00PM X6.00.7.00PM X7.00.8.00PM X8.00.9.00PM
## 1 105 147 120 91 83 74
## 2 98 133 131 95 73 70
## 3 115 130 143 106 89 68
## 4 127 122 144 122 76 64
## 5 126 133 135 102 106 58
## 6 NA NA NA NA NA NA
## X9.00.10.00PM X10.00.11.00PM X11.00.12.00AM
## 1 49 42 42
## 2 63 42 35
## 3 64 56 43
## 4 58 64 43
## 5 58 55 54
## 6 NA NA NA
str(traffic)
## 'data.frame': 42756 obs. of 31 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ SegmentID : int 15540 15540 15540 15540 15540 15540 15540 15540 15540 15540 ...
## $ Roadway.Name : chr "BEACH STREET" "BEACH STREET" "BEACH STREET" "BEACH STREET" ...
## $ From : chr "UNION PLACE" "UNION PLACE" "UNION PLACE" "UNION PLACE" ...
## $ To : chr "VAN DUZER STREET" "VAN DUZER STREET" "VAN DUZER STREET" "VAN DUZER STREET" ...
## $ Direction : chr "NB" "NB" "NB" "NB" ...
## $ Date : chr "01/09/2012" "01/10/2012" "01/11/2012" "01/12/2012" ...
## $ X12.00.1.00.AM: int 20 21 27 22 31 42 27 26 32 24 ...
## $ X1.00.2.00AM : int 10 16 14 7 17 27 12 16 16 12 ...
## $ X2.00.3.00AM : num 11 8 6 7 7 21 12 11 8 7 ...
## $ X3.00.4.00AM : num 14 6 5 8 5 18 4 13 9 18 ...
## $ X4.00.5.00AM : int 13 13 12 11 13 21 22 16 15 11 ...
## $ X5.00.6.00AM : num 20 13 16 12 28 13 27 27 26 23 ...
## $ X6.00.7.00AM : int 34 31 34 33 29 17 66 59 63 61 ...
## $ X7.00.8.00AM : num 66 70 75 75 68 18 154 156 169 146 ...
## $ X8.00.9.00AM : num 100 67 69 89 84 46 155 177 178 177 ...
## $ X9.00.10.00AM : int 52 45 71 66 64 53 138 131 148 128 ...
## $ X10.00.11.00AM: int 68 57 67 70 83 29 105 107 139 117 ...
## $ X11.00.12.00PM: int 85 67 70 60 89 0 124 108 131 111 ...
## $ X12.00.1.00PM : int 85 73 90 105 88 NA 140 122 126 134 ...
## $ X1.00.2.00PM : int 94 95 89 103 113 NA 120 131 137 134 ...
## $ X2.00.3.00PM : int 104 102 115 71 113 NA 165 192 178 171 ...
## $ X3.00.4.00PM : int 105 98 115 127 126 NA 197 180 194 184 ...
## $ X4.00.5.00PM : int 147 133 130 122 133 NA 152 161 168 157 ...
## $ X5.00.6.00PM : int 120 131 143 144 135 NA 174 171 160 167 ...
## $ X6.00.7.00PM : int 91 95 106 122 102 NA 128 120 143 148 ...
## $ X7.00.8.00PM : int 83 73 89 76 106 NA 95 96 114 112 ...
## $ X8.00.9.00PM : int 74 70 68 64 58 NA 87 90 78 86 ...
## $ X9.00.10.00PM : int 49 63 64 58 58 NA 73 70 64 77 ...
## $ X10.00.11.00PM: int 42 42 56 64 55 NA 57 63 49 63 ...
## $ X11.00.12.00AM: int 42 35 43 43 54 NA 42 49 45 51 ...
#cleaning the column names
names(traffic) <- gsub("[(). \\-]", "_", names(traffic))
names(traffic) <- tolower(names(traffic))
names(traffic) <- gsub("^x", "", names(traffic))
names(traffic) <- gsub("_00", "", names(traffic))
names(traffic)[names(traffic) == "12_1_am"] <- "12_1am"
names(traffic) <- ifelse(
names(traffic) == "11_12pm", "11_12am",
ifelse(names(traffic) == "11_12am", "11_12pm", names(traffic)))
#making the date column be in date format
traffic <- traffic %>%
mutate(date = mdy(date))
#creating a year column from the date column
traffic <- traffic %>%
mutate(year = year(date))
str(traffic)
## 'data.frame': 42756 obs. of 32 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ segmentid : int 15540 15540 15540 15540 15540 15540 15540 15540 15540 15540 ...
## $ roadway_name: chr "BEACH STREET" "BEACH STREET" "BEACH STREET" "BEACH STREET" ...
## $ from : chr "UNION PLACE" "UNION PLACE" "UNION PLACE" "UNION PLACE" ...
## $ to : chr "VAN DUZER STREET" "VAN DUZER STREET" "VAN DUZER STREET" "VAN DUZER STREET" ...
## $ direction : chr "NB" "NB" "NB" "NB" ...
## $ date : Date, format: "2012-01-09" "2012-01-10" ...
## $ 12_1am : int 20 21 27 22 31 42 27 26 32 24 ...
## $ 1_2am : int 10 16 14 7 17 27 12 16 16 12 ...
## $ 2_3am : num 11 8 6 7 7 21 12 11 8 7 ...
## $ 3_4am : num 14 6 5 8 5 18 4 13 9 18 ...
## $ 4_5am : int 13 13 12 11 13 21 22 16 15 11 ...
## $ 5_6am : num 20 13 16 12 28 13 27 27 26 23 ...
## $ 6_7am : int 34 31 34 33 29 17 66 59 63 61 ...
## $ 7_8am : num 66 70 75 75 68 18 154 156 169 146 ...
## $ 8_9am : num 100 67 69 89 84 46 155 177 178 177 ...
## $ 9_10am : int 52 45 71 66 64 53 138 131 148 128 ...
## $ 10_11am : int 68 57 67 70 83 29 105 107 139 117 ...
## $ 11_12am : int 85 67 70 60 89 0 124 108 131 111 ...
## $ 12_1pm : int 85 73 90 105 88 NA 140 122 126 134 ...
## $ 1_2pm : int 94 95 89 103 113 NA 120 131 137 134 ...
## $ 2_3pm : int 104 102 115 71 113 NA 165 192 178 171 ...
## $ 3_4pm : int 105 98 115 127 126 NA 197 180 194 184 ...
## $ 4_5pm : int 147 133 130 122 133 NA 152 161 168 157 ...
## $ 5_6pm : int 120 131 143 144 135 NA 174 171 160 167 ...
## $ 6_7pm : int 91 95 106 122 102 NA 128 120 143 148 ...
## $ 7_8pm : int 83 73 89 76 106 NA 95 96 114 112 ...
## $ 8_9pm : int 74 70 68 64 58 NA 87 90 78 86 ...
## $ 9_10pm : int 49 63 64 58 58 NA 73 70 64 77 ...
## $ 10_11pm : int 42 42 56 64 55 NA 57 63 49 63 ...
## $ 11_12pm : int 42 35 43 43 54 NA 42 49 45 51 ...
## $ year : num 2012 2012 2012 2012 2012 ...
#checking amount of NAs
sum(is.na(traffic))
## [1] 3080
#almost all the NAs were from 2012, so I decided to remove that year from my analysis
traffic_clean <- traffic %>%
filter(year != 2012)
#rechecking amount of NAs without 2012
sum(is.na(traffic_clean))
## [1] 44
#seeing how many values are present each year
traffic_clean %>%
group_by(year) %>%
summarise(n = n())
## # A tibble: 9 × 2
## year n
## <dbl> <int>
## 1 2013 1287
## 2 2014 6147
## 3 2015 5921
## 4 2016 3169
## 5 2017 5163
## 6 2018 1048
## 7 2019 5841
## 8 2020 5643
## 9 2021 486
#separating the am and pm values into individual dataframes
traffic_am <- traffic_clean[, !grepl("pm", names(traffic_clean))]
traffic_pm <- traffic_clean[, !grepl("am", names(traffic_clean))]
#calculating 12-hour totals of traffic counts for both AM and PM, removing NAs since there's such a minuscule amount of them
traffic_am <- traffic_am %>%
mutate(total = rowSums(traffic_am[ , 8:19], na.rm = TRUE))
traffic_pm <- traffic_pm %>%
mutate(total = rowSums(traffic_pm[ , 8:19], na.rm = TRUE))
head(traffic_am)
## id segmentid roadway_name from to direction date 12_1am
## 1 1 2153 HUGUENOT AVE WOODROW RD STAFFORD AVE NB 2013-02-02 106
## 2 1 2153 HUGUENOT AVE WOODROW RD STAFFORD AVE NB 2013-02-03 109
## 3 1 2153 HUGUENOT AVE WOODROW RD STAFFORD AVE NB 2013-02-04 36
## 4 1 2153 HUGUENOT AVE WOODROW RD STAFFORD AVE NB 2013-02-05 42
## 5 1 2153 HUGUENOT AVE WOODROW RD STAFFORD AVE NB 2013-02-06 35
## 6 1 2153 HUGUENOT AVE WOODROW RD STAFFORD AVE NB 2013-02-07 33
## 1_2am 2_3am 3_4am 4_5am 5_6am 6_7am 7_8am 8_9am 9_10am 10_11am 11_12am year
## 1 74 45 29 29 45 71 145 213 278 387 335 2013
## 2 74 55 37 26 25 47 74 111 204 249 351 2013
## 3 28 11 16 32 108 168 418 493 263 282 307 2013
## 4 28 16 12 34 109 193 397 499 241 255 294 2013
## 5 38 12 14 31 98 195 372 490 297 260 283 2013
## 6 26 14 22 31 100 171 382 460 273 215 263 2013
## total
## 1 1757
## 2 1362
## 3 2162
## 4 2120
## 5 2125
## 6 1990
head(traffic_pm)
## id segmentid from to direction date 12_1pm 1_2pm 2_3pm
## 1 1 2153 WOODROW RD STAFFORD AVE NB 2013-02-02 406 411 371
## 2 1 2153 WOODROW RD STAFFORD AVE NB 2013-02-03 374 350 308
## 3 1 2153 WOODROW RD STAFFORD AVE NB 2013-02-04 304 328 426
## 4 1 2153 WOODROW RD STAFFORD AVE NB 2013-02-05 310 382 393
## 5 1 2153 WOODROW RD STAFFORD AVE NB 2013-02-06 334 324 383
## 6 1 2153 WOODROW RD STAFFORD AVE NB 2013-02-07 254 303 151
## 3_4pm 4_5pm 5_6pm 6_7pm 7_8pm 8_9pm 9_10pm 10_11pm 11_12pm year total
## 1 398 324 394 379 329 249 197 187 169 2013 5421
## 2 291 313 253 242 217 210 144 125 79 2013 4545
## 3 425 419 469 425 358 224 185 132 74 2013 5478
## 4 479 441 476 446 424 305 219 171 76 2013 5825
## 5 384 391 369 401 338 236 186 123 76 2013 5224
## 6 409 422 364 356 339 275 225 169 104 2013 5130
#noting the time of day for boxplot
traffic_am$Time <- "AM"
traffic_pm$Time <- "PM"
#creating a new dataframe with just the 12-hour totals and the time of day so I can use ggplot
traffic_boxplot <- rbind(traffic_am[, c("total", "Time")], traffic_pm[, c("total", "Time")])
#creating the boxplot (code format is taken from the Descriptive Statistics assignment)
ggplot(traffic_boxplot, aes(Time, total, fill = Time)) +
geom_boxplot() +
labs(title = "Daily AM and PM Traffic Volume Totals",
x = "Time of Day", y = "Daily Traffic Volume (per street)") +
theme_minimal()
This boxplot compares the total daily AM and PM traffic volumes
across all New York City streets included in this dataset. PM traffic
shows higher median volumes and a wider spread than AM traffic,
indicating more drivers on the road during these hours. The large number
of outliers showcases that certain streets experience significantly
higher traffic than others.
#creating the histogram for AM and PM (code format is taken from the Descriptive Statistics assignment)
ggplot(traffic_am, aes(x = total)) +
geom_histogram(binwidth = 1000, fill = "blue", color = "black") +
labs(title = "AM Traffic Totals", x = "Daily Traffic Totals (per street)", y = "Count") +
theme_minimal()
ggplot(traffic_pm, aes(x = total)) +
geom_histogram(binwidth = 1000, fill = "red", color = "black") +
labs(title = "PM Traffic Totals", x = "Daily Traffic Totals (per street)", y = "Count") +
theme_minimal()
Both histograms show that AM and PM traffic volume are
right-skewed, with most streets experiencing low to moderate traffic and
a small number of streets showing extremely high volumes. The
distribution during the PM hours is shifted further right than the
distribution during the AM hours, indicating higher traffic totals in
the afternoon.
Hypotheses
\(H_0\): \(\mu_{PM}\) = \(\mu_{AM}\)
\(H_a\): \(\mu_{PM}\) > \(\mu_{AM}\)
Where,
\(\mu_{PM}\) = mean traffic during PM hours
\(\mu_{AM}\) = mean traffic during AM hours
#conducting the two-sample t-test
t.test(traffic_pm$total, traffic_am$total, alternative = "greater")
##
## Welch Two Sample t-test
##
## data: traffic_pm$total and traffic_am$total
## t = 78.421, df = 63354, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 4069.378 Inf
## sample estimates:
## mean of x mean of y
## 8351.106 4194.544
I will evaluate this hypothesis test at the 5% significance
level (α = 0.05). Because the p-value (2.2e-16) is less than the
significance level (0.05), we reject the null hypothesis. There is
sufficient evidence to conclude that PM traffic volumes are greater on
average than AM traffic volumes in New York City over this time span.
This analysis compared New York City’s AM and PM traffic volumes
using daily totals derived from hourly traffic counts. The boxplot and
histograms indicated that PM traffic totals were greater and more
variable than the AM totals, with some roadways producing significantly
higher values. This trend was then confirmed using a right-tailed
two-sample t-test. The p-value was much smaller than the significance
level, supporting the hypothesis that mean traffic during PM hours was
greater than mean traffic during AM hours. This aligns with expected
commuter behavior. These results imply that roads in New York City are
more crowded later in the day, which could have an impact on traffic
control and urban planning. In order to better understand traffic
behavior in New York City, more research may be necessary. For instance,
examining the traffic patterns of specific roads may help identify the
places that are most responsible for the variations. We could also
identify additional factors that account for the daily fluctuations not
examined in this analysis by considering holidays or weather conditions.
GREP: Pattern matching and replacement. RDocumentation. (n.d.). https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/grep
NYC OpenData. (2022, June 2). Traffic Volume Counts https://catalog.data.gov/dataset/traffic-volume-counts