knitr::include_graphics("car_crash.jpg")DATA-110 Final Project
(Ringo Chiu, REUTERS, Whittier California, November 16, 2022)
Introduction
The data was collected directly from Los Angeles, California Open Data. The Los Angeles Police Department (LAPD) is responsible for collecting information from crash reports over the span of 10 years. The data they collected was corroborated by those involved or witnesses of the car accidents. The data includes 18 variables with each giving detailed information on the report on 622k individual car crashes. Though, there is a lot of approaches to analyzing this data I decided to focus on the patterns in traffic collisions in Los Angeles. Some key variables I will use are Area Name, Victim Sex, Victim Age, Time Occurred, and Number of Incidents. The topic was chosen because many of us can relate to driving and witnessing car crashes. Some may see it as normal but it is important to take it serious and look at the problem from a different lens to see what is the cause of this issue.
Load Libraries
library(tidyverse)Warning: package 'tidyverse' was built under R version 4.5.3
Warning: package 'ggplot2' was built under R version 4.5.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)
library (lubridate)
library(janitor)Warning: package 'janitor' was built under R version 4.5.3
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
Car_crash <- read_csv("car_crash_data.csv")Rows: 621677 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): Date Reported, Date Occurred, Area Name, Crime Code Description, M...
dbl (7): DR Number, Time Occurred, Area ID, Reporting District, Crime Code,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Cleanse of Data
Car_crash <- Car_crash %>%
clean_names()car_clean<-Car_crash %>%
select(date_occurred, time_occurred, area_name, victim_age, victim_sex, premise_description, location)
#Adds focus on variables I will use. Removes any unnecessary ones.car_clean<-car_clean %>%
mutate(
date_occurred = mdy(date_occurred),
hour_occurred = floor(time_occurred / 100)
)
#Convert dates to proper format and created a new variable that changes time from military to hours.Statistic Analysis
regression<-lm(
victim_age ~ hour_occurred + victim_sex + area_name,
data = car_clean
)
summary(regression)
Call:
lm(formula = victim_age ~ hour_occurred + victim_sex + area_name,
data = car_clean)
Residuals:
Min 1Q Median 3Q Max
-35.150 -13.064 -2.966 9.949 65.986
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.853253 16.549972 1.260 0.20766
hour_occurred -0.076821 0.003863 -19.885 < 2e-16 ***
victim_sexF 19.978553 16.549843 1.207 0.22737
victim_sexH 21.423362 16.601436 1.290 0.19689
victim_sexM 21.888314 16.549827 1.323 0.18598
victim_sexN 22.489522 17.553703 1.281 0.20013
victim_sexX 15.595395 16.551486 0.942 0.34607
area_nameCentral 1.205506 0.138042 8.733 < 2e-16 ***
area_nameDevonshire 2.079168 0.136239 15.261 < 2e-16 ***
area_nameFoothill 1.132246 0.149641 7.566 3.84e-14 ***
area_nameHarbor 1.057182 0.146997 7.192 6.40e-13 ***
area_nameHollenbeck 0.653772 0.149057 4.386 1.15e-05 ***
area_nameHollywood -1.744923 0.135459 -12.882 < 2e-16 ***
area_nameMission 0.638587 0.139483 4.578 4.69e-06 ***
area_nameN Hollywood 0.107977 0.131888 0.819 0.41296
area_nameNewton -1.181093 0.131737 -8.966 < 2e-16 ***
area_nameNortheast 1.203654 0.138812 8.671 < 2e-16 ***
area_nameOlympic -0.067995 0.131458 -0.517 0.60499
area_namePacific 1.223567 0.133635 9.156 < 2e-16 ***
area_nameRampart -0.465801 0.147788 -3.152 0.00162 **
area_nameSoutheast -0.856455 0.140267 -6.106 1.02e-09 ***
area_nameSouthwest -0.088976 0.126884 -0.701 0.48316
area_nameTopanga 2.463598 0.139521 17.658 < 2e-16 ***
area_nameVan Nuys 1.429971 0.133895 10.680 < 2e-16 ***
area_nameWest LA 1.704634 0.133207 12.797 < 2e-16 ***
area_nameWest Valley 2.638926 0.135404 19.489 < 2e-16 ***
area_nameWilshire -0.294441 0.128672 -2.288 0.02212 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.55 on 529250 degrees of freedom
(92400 observations deleted due to missingness)
Multiple R-squared: 0.009601, Adjusted R-squared: 0.009552
F-statistic: 197.3 on 26 and 529250 DF, p-value: < 2.2e-16
ggplot(car_clean,aes(x = hour_occurred, y = victim_age))+
geom_point(alpha = 0.25)+
geom_smooth(method = "lm", color = "blue")+
labs(title = "Victime Age by Hour of Accident",
x = "Hour",
y = "Victims Age")+
theme_dark()`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 88194 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 88194 rows containing missing values or values outside the scale range
(`geom_point()`).
I ran a multiple linear regression to investigate whether the hour which the accident occurred, sex of victim, and area name had relationship with victim age. Overall, the model was statistically significant with value being p < 0.0001. The hour in which the accident occurred showed a negative relationship with age of victim, which means that younger victims had more accidents later in the day. Many regions in Los Angeles also were significant predictors of victim age. The sex of the victim, however, was not statistically significant. The adjusted r^2 model explained 1% of variation in victim age.
The visualization above illustrates the relationship between the time of accident and the victims age. The smooth line shows a negative relationship meaning that younger individuals tend to have accidents at night. The dots represent a single collision.
Visualization 1
car_clean %>%
filter(victim_sex %in% c("M","F")) %>%
ggplot(aes(x= area_name, fill = victim_sex)) +
geom_bar()+
coord_flip()+
labs(
title = "Traffic Collisions in L.A by Area (2010 - 2026)",
x = "Region",
y = "Accidents",
fill = "Sex"
)+
theme_dark()#Created a simple bar graph but used coord_flip to allow for easier user readability. Focused on 2 sexes to reduce cluster.The visualization above displays the number Traffic Collisions in Los Angeles regions. There is a color fill to show separation between the number of accidents that Men and Women had over 10 years. illustrates that Men have higher number of accidents across all area’s in Los Angeles over the 10 years the LAPD reported. The police departments with the highest number of accidents is 77th Street, Southwest, and Wiltshire. The highest regions are heavily traffic roads which can be an indication of the high number of accidents.
Visualization 2
The attached Tableau visualization shows the traffic accident across Los Angeles police regions. Each point means a geographic reporting are in Los Angeles, while the size of the points are the number of collisions. As noted before, 77th Street and Southwest reported to have high number of traffic collisions throughout 10+ years. The visualization is interactive to give specific detail on count of accidents. Filter’s in victim sex were added to allow for an accurate analysis.
Conclusion
The visualizations created displayed the patterns of traffic accidents in Los Angeles. The first helped to determine correlation between how younger drivers at night tend to have more accidents than older individuals. While the other two visualizations gave detailed information on the areas in which this problem is most happening and the police department may have to improve its safety and surveillance protocols. Though the visualizations were great, I did attempt to show more difference between the number of accidents in each region with the size marks in Tableau. I could not find a way to do so. I was hoping I could also create a graph that could show specific places which accidents occurred, for example, streets, highways, parking lots. Overall, though the project did not go as expected it turned out well and I learned the importance of changes while going along the way of data analysis.