My data set gives information about each collision and details of all traffic collisions occurring on county and local roadways within Montgomery County. I plan to explore the amount of collisions that occur within each month of the year and what agency reports the collision. I want to find if any month has a disproportionate number of collisions and potential reasons, such as holidays. I also want to find out if there’s any correlation between collisions and hours.
This data set was collected using the Automated Crash Reporting System of the Maryland State Police and provided to the public by Montgomery County, MD
Rows: 97458 Columns: 44
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (38): Report Number, Local Case Number, Agency Name, ACRS Report Type, C...
dbl (6): Mile Point, Lane Number, Number of Lanes, Distance, Latitude, Long...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Linear regression model using hours of the day to predict the number of collisions
lm_model <-lm(total_collisions ~ hour, data = hourly_crashes)summary(lm_model)
Call:
lm(formula = total_collisions ~ hour, data = hourly_crashes)
Residuals:
Min 1Q Median 3Q Max
-3487.5 -1563.3 109.3 1434.6 2520.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2281.65 777.00 2.936 0.00764 **
hour 142.33 54.38 2.617 0.01573 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1844 on 22 degrees of freedom
Multiple R-squared: 0.2374, Adjusted R-squared: 0.2028
F-statistic: 6.85 on 1 and 22 DF, p-value: 0.01573
plot(lm_model)
The equation for my model is: Total Collisions = B0 + B1 * Hour + e.
Based on an adjusted R^2 value of 0.2028 it suggests that the relationship between hour of day and total collision only accounts for 20.28% of the variation in the data.
Based on a p-value of 0.01573 and using a significance level of 0.05, I can conclude that there is a significant linear relationship between the hour of the day and the total number of collisions.
Calculate the number of monthly crashes
Converting the date/time format to only months in a separate column
crashes2$month <-month(as.POSIXct(crashes2$crashdatetime, format ="%m/%d/%Y %I:%M:%S %p"))
Calculate the total crashes per month by each agency
ggplot(monthly_crashes_nona, aes(month, total_crashes, fill = agencyname)) +geom_bar(stat ="identity", alpha =0.5) +scale_x_discrete(labels =c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")) +scale_fill_brewer(palette ="Dark2", name ="Police Agencies", labels =c("Gaithersburg Police", "MD National Capital Park Police", "Montgomery County Police", "Rockville Police", "Takoma Park Police")) +labs(x ="Months", y ="Total Collisions", title ="Total Collisions per Month in 2024",caption ="Data Provided By: Montgomery County, MD") +theme_classic()
Summary Essay
A.
The data set originally came with headers that were capitalized, had spaces and also slashes. I cleaned it up by lower casing all the headers and removing the spaces. I initially was going to leave the slashes but I found that it caused some trouble in my code so I ended up removing the slashes as well.
B.
The visualization represents the total collisions per month in Montgomery County MD during the 2024 year. The graph suggests that months in the fall have the most collisions. This is what I expected because the sun sets earlier during the fall and that means people are driving in the dark more often than the summer. I wonder if the leaves on the ground during fall season have an impact on the number of collisions. I’d imagine that roads with leaves, especially wet leaves make driving riskier when compared to roads with no leaves.
C.
I noticed that the data set included location coordinates consisting of both longitude and latitude. It would have been interesting to plot all the collisions on a map, though I’m not sure how I would do that.