Introduction:

Is the volume of traffic greater during PM hours in New York City between 2013 and 2021? In order to answer this question, I will use a dataset provided by NYC OpenData, which contains traffic sample volume counts at various bridge crossings and roadways throughout the city. This analysis will focus on those sample counts, as well as the time those samples were collected. As this is an analysis of the entirety of New York City, the specific locations at which each sample was collected will not be included. Each sample count is separated into hour-long increments, and these increments will be separated by AM and PM hours, and then added up to create 12-hour totals rather than hourly totals. Whether an hour-long span is considered AM or PM will be decided by the start of the span rather than the end. This means that the spans of 11AM-12PM will be considered an AM hour, and 11PM-12PM will be considered a PM hour. The dataset can be found here.

Data Analysis

I will begin my data analysis by performing the necessary EDA and cleaning. I will start by cleaning the column names so that they are easier to use, making sure all my columns are in the format they need to be in, and dealing with NAs by removing the year 2012 from my analysis, given that is where the vast majority of NAs resided. I will then make two new dataframes, separating the AM and PM hours. From there, I will calculate 12-hour totals of traffic volume counts by street in order to have it all in one column. After doing that, I will create a boxplot to visually compare the distribution between the AM and PM hours. I will also add histograms of both the AM and PM traffic totals in order to visualize the shape of their distributions.

#loading my dataset and libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(lubridate)

traffic <- read.csv("trafficcounts.csv")
head(traffic)
##   ID SegmentID Roadway.Name        From               To Direction       Date
## 1  1     15540 BEACH STREET UNION PLACE VAN DUZER STREET        NB 01/09/2012
## 2  2     15540 BEACH STREET UNION PLACE VAN DUZER STREET        NB 01/10/2012
## 3  3     15540 BEACH STREET UNION PLACE VAN DUZER STREET        NB 01/11/2012
## 4  4     15540 BEACH STREET UNION PLACE VAN DUZER STREET        NB 01/12/2012
## 5  5     15540 BEACH STREET UNION PLACE VAN DUZER STREET        NB 01/13/2012
## 6  6     15540 BEACH STREET UNION PLACE VAN DUZER STREET        NB 01/14/2012
##   X12.00.1.00.AM X1.00.2.00AM X2.00.3.00AM X3.00.4.00AM X4.00.5.00AM
## 1             20           10           11           14           13
## 2             21           16            8            6           13
## 3             27           14            6            5           12
## 4             22            7            7            8           11
## 5             31           17            7            5           13
## 6             42           27           21           18           21
##   X5.00.6.00AM X6.00.7.00AM X7.00.8.00AM X8.00.9.00AM X9.00.10.00AM
## 1           20           34           66          100            52
## 2           13           31           70           67            45
## 3           16           34           75           69            71
## 4           12           33           75           89            66
## 5           28           29           68           84            64
## 6           13           17           18           46            53
##   X10.00.11.00AM X11.00.12.00PM X12.00.1.00PM X1.00.2.00PM X2.00.3.00PM
## 1             68             85            85           94          104
## 2             57             67            73           95          102
## 3             67             70            90           89          115
## 4             70             60           105          103           71
## 5             83             89            88          113          113
## 6             29              0            NA           NA           NA
##   X3.00.4.00PM X4.00.5.00PM X5.00.6.00PM X6.00.7.00PM X7.00.8.00PM X8.00.9.00PM
## 1          105          147          120           91           83           74
## 2           98          133          131           95           73           70
## 3          115          130          143          106           89           68
## 4          127          122          144          122           76           64
## 5          126          133          135          102          106           58
## 6           NA           NA           NA           NA           NA           NA
##   X9.00.10.00PM X10.00.11.00PM X11.00.12.00AM
## 1            49             42             42
## 2            63             42             35
## 3            64             56             43
## 4            58             64             43
## 5            58             55             54
## 6            NA             NA             NA
str(traffic)
## 'data.frame':    42756 obs. of  31 variables:
##  $ ID            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ SegmentID     : int  15540 15540 15540 15540 15540 15540 15540 15540 15540 15540 ...
##  $ Roadway.Name  : chr  "BEACH STREET" "BEACH STREET" "BEACH STREET" "BEACH STREET" ...
##  $ From          : chr  "UNION PLACE" "UNION PLACE" "UNION PLACE" "UNION PLACE" ...
##  $ To            : chr  "VAN DUZER STREET" "VAN DUZER STREET" "VAN DUZER STREET" "VAN DUZER STREET" ...
##  $ Direction     : chr  "NB" "NB" "NB" "NB" ...
##  $ Date          : chr  "01/09/2012" "01/10/2012" "01/11/2012" "01/12/2012" ...
##  $ X12.00.1.00.AM: int  20 21 27 22 31 42 27 26 32 24 ...
##  $ X1.00.2.00AM  : int  10 16 14 7 17 27 12 16 16 12 ...
##  $ X2.00.3.00AM  : num  11 8 6 7 7 21 12 11 8 7 ...
##  $ X3.00.4.00AM  : num  14 6 5 8 5 18 4 13 9 18 ...
##  $ X4.00.5.00AM  : int  13 13 12 11 13 21 22 16 15 11 ...
##  $ X5.00.6.00AM  : num  20 13 16 12 28 13 27 27 26 23 ...
##  $ X6.00.7.00AM  : int  34 31 34 33 29 17 66 59 63 61 ...
##  $ X7.00.8.00AM  : num  66 70 75 75 68 18 154 156 169 146 ...
##  $ X8.00.9.00AM  : num  100 67 69 89 84 46 155 177 178 177 ...
##  $ X9.00.10.00AM : int  52 45 71 66 64 53 138 131 148 128 ...
##  $ X10.00.11.00AM: int  68 57 67 70 83 29 105 107 139 117 ...
##  $ X11.00.12.00PM: int  85 67 70 60 89 0 124 108 131 111 ...
##  $ X12.00.1.00PM : int  85 73 90 105 88 NA 140 122 126 134 ...
##  $ X1.00.2.00PM  : int  94 95 89 103 113 NA 120 131 137 134 ...
##  $ X2.00.3.00PM  : int  104 102 115 71 113 NA 165 192 178 171 ...
##  $ X3.00.4.00PM  : int  105 98 115 127 126 NA 197 180 194 184 ...
##  $ X4.00.5.00PM  : int  147 133 130 122 133 NA 152 161 168 157 ...
##  $ X5.00.6.00PM  : int  120 131 143 144 135 NA 174 171 160 167 ...
##  $ X6.00.7.00PM  : int  91 95 106 122 102 NA 128 120 143 148 ...
##  $ X7.00.8.00PM  : int  83 73 89 76 106 NA 95 96 114 112 ...
##  $ X8.00.9.00PM  : int  74 70 68 64 58 NA 87 90 78 86 ...
##  $ X9.00.10.00PM : int  49 63 64 58 58 NA 73 70 64 77 ...
##  $ X10.00.11.00PM: int  42 42 56 64 55 NA 57 63 49 63 ...
##  $ X11.00.12.00AM: int  42 35 43 43 54 NA 42 49 45 51 ...
#cleaning the column names
names(traffic) <- gsub("[(). \\-]", "_", names(traffic))
names(traffic) <- tolower(names(traffic))
names(traffic) <- gsub("^x", "", names(traffic))
names(traffic) <- gsub("_00", "", names(traffic))
names(traffic)[names(traffic) == "12_1_am"] <- "12_1am"
names(traffic) <- ifelse(
  names(traffic) == "11_12pm", "11_12am",
  ifelse(names(traffic) == "11_12am", "11_12pm", names(traffic)))

#making the date column be in date format
traffic <- traffic %>%
  mutate(date = mdy(date))

#creating a year column from the date column
traffic <- traffic %>%
  mutate(year = year(date))

str(traffic)
## 'data.frame':    42756 obs. of  32 variables:
##  $ id          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ segmentid   : int  15540 15540 15540 15540 15540 15540 15540 15540 15540 15540 ...
##  $ roadway_name: chr  "BEACH STREET" "BEACH STREET" "BEACH STREET" "BEACH STREET" ...
##  $ from        : chr  "UNION PLACE" "UNION PLACE" "UNION PLACE" "UNION PLACE" ...
##  $ to          : chr  "VAN DUZER STREET" "VAN DUZER STREET" "VAN DUZER STREET" "VAN DUZER STREET" ...
##  $ direction   : chr  "NB" "NB" "NB" "NB" ...
##  $ date        : Date, format: "2012-01-09" "2012-01-10" ...
##  $ 12_1am      : int  20 21 27 22 31 42 27 26 32 24 ...
##  $ 1_2am       : int  10 16 14 7 17 27 12 16 16 12 ...
##  $ 2_3am       : num  11 8 6 7 7 21 12 11 8 7 ...
##  $ 3_4am       : num  14 6 5 8 5 18 4 13 9 18 ...
##  $ 4_5am       : int  13 13 12 11 13 21 22 16 15 11 ...
##  $ 5_6am       : num  20 13 16 12 28 13 27 27 26 23 ...
##  $ 6_7am       : int  34 31 34 33 29 17 66 59 63 61 ...
##  $ 7_8am       : num  66 70 75 75 68 18 154 156 169 146 ...
##  $ 8_9am       : num  100 67 69 89 84 46 155 177 178 177 ...
##  $ 9_10am      : int  52 45 71 66 64 53 138 131 148 128 ...
##  $ 10_11am     : int  68 57 67 70 83 29 105 107 139 117 ...
##  $ 11_12am     : int  85 67 70 60 89 0 124 108 131 111 ...
##  $ 12_1pm      : int  85 73 90 105 88 NA 140 122 126 134 ...
##  $ 1_2pm       : int  94 95 89 103 113 NA 120 131 137 134 ...
##  $ 2_3pm       : int  104 102 115 71 113 NA 165 192 178 171 ...
##  $ 3_4pm       : int  105 98 115 127 126 NA 197 180 194 184 ...
##  $ 4_5pm       : int  147 133 130 122 133 NA 152 161 168 157 ...
##  $ 5_6pm       : int  120 131 143 144 135 NA 174 171 160 167 ...
##  $ 6_7pm       : int  91 95 106 122 102 NA 128 120 143 148 ...
##  $ 7_8pm       : int  83 73 89 76 106 NA 95 96 114 112 ...
##  $ 8_9pm       : int  74 70 68 64 58 NA 87 90 78 86 ...
##  $ 9_10pm      : int  49 63 64 58 58 NA 73 70 64 77 ...
##  $ 10_11pm     : int  42 42 56 64 55 NA 57 63 49 63 ...
##  $ 11_12pm     : int  42 35 43 43 54 NA 42 49 45 51 ...
##  $ year        : num  2012 2012 2012 2012 2012 ...
#checking amount of NAs
sum(is.na(traffic))
## [1] 3080
#almost all the NAs were from 2012, so I decided to remove that year from my analysis
traffic_clean <- traffic %>%
  filter(year != 2012)

#rechecking amount of NAs without 2012
sum(is.na(traffic_clean))
## [1] 44
#seeing how many values are present each year 
traffic_clean %>%
  group_by(year) %>%
  summarise(n = n())
## # A tibble: 9 × 2
##    year     n
##   <dbl> <int>
## 1  2013  1287
## 2  2014  6147
## 3  2015  5921
## 4  2016  3169
## 5  2017  5163
## 6  2018  1048
## 7  2019  5841
## 8  2020  5643
## 9  2021   486
#separating the am and pm values into individual dataframes
traffic_am <- traffic_clean[, !grepl("pm", names(traffic_clean))]
traffic_pm <- traffic_clean[, !grepl("am", names(traffic_clean))]

#calculating 12-hour totals of traffic counts for both AM and PM, removing NAs since there's such a minuscule amount of them
traffic_am <- traffic_am %>%
  mutate(total = rowSums(traffic_am[ , 8:19], na.rm = TRUE))

traffic_pm <- traffic_pm %>%
  mutate(total = rowSums(traffic_pm[ , 8:19], na.rm = TRUE))

head(traffic_am)
##   id segmentid roadway_name       from           to direction       date 12_1am
## 1  1      2153 HUGUENOT AVE WOODROW RD STAFFORD AVE        NB 2013-02-02    106
## 2  1      2153 HUGUENOT AVE WOODROW RD STAFFORD AVE        NB 2013-02-03    109
## 3  1      2153 HUGUENOT AVE WOODROW RD STAFFORD AVE        NB 2013-02-04     36
## 4  1      2153 HUGUENOT AVE WOODROW RD STAFFORD AVE        NB 2013-02-05     42
## 5  1      2153 HUGUENOT AVE WOODROW RD STAFFORD AVE        NB 2013-02-06     35
## 6  1      2153 HUGUENOT AVE WOODROW RD STAFFORD AVE        NB 2013-02-07     33
##   1_2am 2_3am 3_4am 4_5am 5_6am 6_7am 7_8am 8_9am 9_10am 10_11am 11_12am year
## 1    74    45    29    29    45    71   145   213    278     387     335 2013
## 2    74    55    37    26    25    47    74   111    204     249     351 2013
## 3    28    11    16    32   108   168   418   493    263     282     307 2013
## 4    28    16    12    34   109   193   397   499    241     255     294 2013
## 5    38    12    14    31    98   195   372   490    297     260     283 2013
## 6    26    14    22    31   100   171   382   460    273     215     263 2013
##   total
## 1  1757
## 2  1362
## 3  2162
## 4  2120
## 5  2125
## 6  1990
head(traffic_pm)
##   id segmentid       from           to direction       date 12_1pm 1_2pm 2_3pm
## 1  1      2153 WOODROW RD STAFFORD AVE        NB 2013-02-02    406   411   371
## 2  1      2153 WOODROW RD STAFFORD AVE        NB 2013-02-03    374   350   308
## 3  1      2153 WOODROW RD STAFFORD AVE        NB 2013-02-04    304   328   426
## 4  1      2153 WOODROW RD STAFFORD AVE        NB 2013-02-05    310   382   393
## 5  1      2153 WOODROW RD STAFFORD AVE        NB 2013-02-06    334   324   383
## 6  1      2153 WOODROW RD STAFFORD AVE        NB 2013-02-07    254   303   151
##   3_4pm 4_5pm 5_6pm 6_7pm 7_8pm 8_9pm 9_10pm 10_11pm 11_12pm year total
## 1   398   324   394   379   329   249    197     187     169 2013  5421
## 2   291   313   253   242   217   210    144     125      79 2013  4545
## 3   425   419   469   425   358   224    185     132      74 2013  5478
## 4   479   441   476   446   424   305    219     171      76 2013  5825
## 5   384   391   369   401   338   236    186     123      76 2013  5224
## 6   409   422   364   356   339   275    225     169     104 2013  5130
#noting the time of day for boxplot
traffic_am$Time <- "AM"
traffic_pm$Time <- "PM"

#creating a new dataframe with just the 12-hour totals and the time of day so I can use ggplot
traffic_boxplot <- rbind(traffic_am[, c("total", "Time")], traffic_pm[, c("total", "Time")])

#creating the boxplot (code format is taken from the Descriptive Statistics assignment)
ggplot(traffic_boxplot, aes(Time, total, fill = Time)) +
  geom_boxplot() +
  labs(title = "Daily AM and PM Traffic Volume Totals",
       x = "Time of Day", y = "Daily Traffic Volume (per street)") +
  theme_minimal()


This boxplot compares the total daily AM and PM traffic volumes across all New York City streets included in this dataset. PM traffic shows higher median volumes and a wider spread than AM traffic, indicating more drivers on the road during these hours. The large number of outliers showcases that certain streets experience significantly higher traffic than others.

#creating the histogram for AM and PM (code format is taken from the Descriptive Statistics assignment)
ggplot(traffic_am, aes(x = total)) +
  geom_histogram(binwidth = 1000, fill = "blue", color = "black") +
  labs(title = "AM Traffic Totals", x = "Daily Traffic Totals (per street)", y = "Count") +
  theme_minimal()

ggplot(traffic_pm, aes(x = total)) +
  geom_histogram(binwidth = 1000, fill = "red", color = "black") +
  labs(title = "PM Traffic Totals", x = "Daily Traffic Totals (per street)", y = "Count") +
  theme_minimal()


Both histograms show that AM and PM traffic volume are right-skewed, with most streets experiencing low to moderate traffic and a small number of streets showing extremely high volumes. The distribution during the PM hours is shifted further right than the distribution during the AM hours, indicating higher traffic totals in the afternoon.

Statistical Analysis

Hypotheses

\(H_0\): \(\mu_{PM}\) = \(\mu_{AM}\)
\(H_a\): \(\mu_{PM}\) > \(\mu_{AM}\)

Where,

\(\mu_{PM}\) = mean traffic during PM hours

\(\mu_{AM}\) = mean traffic during AM hours

#conducting the two-sample t-test
t.test(traffic_pm$total, traffic_am$total, alternative = "greater")
## 
##  Welch Two Sample t-test
## 
## data:  traffic_pm$total and traffic_am$total
## t = 78.421, df = 63354, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  4069.378      Inf
## sample estimates:
## mean of x mean of y 
##  8351.106  4194.544


I will evaluate this hypothesis test at the 5% significance level (α = 0.05). Because the p-value (2.2e-16) is less than the significance level (0.05), we reject the null hypothesis. There is sufficient evidence to conclude that PM traffic volumes are greater on average than AM traffic volumes in New York City over this time span.

Conclusion

This analysis compared New York City’s AM and PM traffic volumes using daily totals derived from hourly traffic counts. The boxplot and histograms indicated that PM traffic totals were greater and more variable than the AM totals, with some roadways producing significantly higher values. This trend was then confirmed using a right-tailed two-sample t-test. The p-value was much smaller than the significance level, supporting the hypothesis that mean traffic during PM hours was greater than mean traffic during AM hours. This aligns with expected commuter behavior. These results imply that roads in New York City are more crowded later in the day, which could have an impact on traffic control and urban planning. In order to better understand traffic behavior in New York City, more research may be necessary. For instance, examining the traffic patterns of specific roads may help identify the places that are most responsible for the variations. We could also identify additional factors that account for the daily fluctuations not examined in this analysis by considering holidays or weather conditions.

References

GREP: Pattern matching and replacement. RDocumentation. (n.d.). https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/grep
NYC OpenData. (2022, June 2). Traffic Volume Counts https://catalog.data.gov/dataset/traffic-volume-counts