DATA-110 Final Project

Author

Isaac Cuellar

knitr::include_graphics("car_crash.jpg")

(Ringo Chiu, REUTERS, Whittier California, November 16, 2022)

Introduction

The data was collected directly from Los Angeles, California Open Data. The Los Angeles Police Department (LAPD) is responsible for collecting information from crash reports over the span of 10 years. The data they collected was corroborated by those involved or witnesses of the car accidents. The data includes 18 variables with each giving detailed information on the report on 622k individual car crashes. Though, there is a lot of approaches to analyzing this data I decided to focus on the patterns in traffic collisions in Los Angeles. Some key variables I will use are Area Name, Victim Sex, Victim Age, Time Occurred, and Number of Incidents. The topic was chosen because many of us can relate to driving and witnessing car crashes. Some may see it as normal but it is important to take it serious and look at the problem from a different lens to see what is the cause of this issue.

Load Libraries

library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.5.3
Warning: package 'ggplot2' was built under R version 4.5.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)
library (lubridate)
library(janitor)
Warning: package 'janitor' was built under R version 4.5.3

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
Car_crash <- read_csv("car_crash_data.csv")
Rows: 621677 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): Date Reported, Date Occurred, Area Name, Crime Code Description, M...
dbl  (7): DR Number, Time Occurred, Area ID, Reporting District, Crime Code,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Cleanse of Data

Car_crash <- Car_crash %>% 
  clean_names()
car_clean<-Car_crash %>% 
  select(date_occurred, time_occurred, area_name, victim_age, victim_sex, premise_description, location)
#Adds focus on variables I will use. Removes any unnecessary ones.
car_clean<-car_clean %>% 
  mutate(
    date_occurred = mdy(date_occurred),
    hour_occurred = floor(time_occurred / 100)
    )
#Convert dates to proper format and created a new variable that changes time from military to hours.

Statistic Analysis

regression<-lm(
  victim_age ~ hour_occurred + victim_sex + area_name,
  data = car_clean
)
summary(regression)

Call:
lm(formula = victim_age ~ hour_occurred + victim_sex + area_name, 
    data = car_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-35.150 -13.064  -2.966   9.949  65.986 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          20.853253  16.549972   1.260  0.20766    
hour_occurred        -0.076821   0.003863 -19.885  < 2e-16 ***
victim_sexF          19.978553  16.549843   1.207  0.22737    
victim_sexH          21.423362  16.601436   1.290  0.19689    
victim_sexM          21.888314  16.549827   1.323  0.18598    
victim_sexN          22.489522  17.553703   1.281  0.20013    
victim_sexX          15.595395  16.551486   0.942  0.34607    
area_nameCentral      1.205506   0.138042   8.733  < 2e-16 ***
area_nameDevonshire   2.079168   0.136239  15.261  < 2e-16 ***
area_nameFoothill     1.132246   0.149641   7.566 3.84e-14 ***
area_nameHarbor       1.057182   0.146997   7.192 6.40e-13 ***
area_nameHollenbeck   0.653772   0.149057   4.386 1.15e-05 ***
area_nameHollywood   -1.744923   0.135459 -12.882  < 2e-16 ***
area_nameMission      0.638587   0.139483   4.578 4.69e-06 ***
area_nameN Hollywood  0.107977   0.131888   0.819  0.41296    
area_nameNewton      -1.181093   0.131737  -8.966  < 2e-16 ***
area_nameNortheast    1.203654   0.138812   8.671  < 2e-16 ***
area_nameOlympic     -0.067995   0.131458  -0.517  0.60499    
area_namePacific      1.223567   0.133635   9.156  < 2e-16 ***
area_nameRampart     -0.465801   0.147788  -3.152  0.00162 ** 
area_nameSoutheast   -0.856455   0.140267  -6.106 1.02e-09 ***
area_nameSouthwest   -0.088976   0.126884  -0.701  0.48316    
area_nameTopanga      2.463598   0.139521  17.658  < 2e-16 ***
area_nameVan Nuys     1.429971   0.133895  10.680  < 2e-16 ***
area_nameWest LA      1.704634   0.133207  12.797  < 2e-16 ***
area_nameWest Valley  2.638926   0.135404  19.489  < 2e-16 ***
area_nameWilshire    -0.294441   0.128672  -2.288  0.02212 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.55 on 529250 degrees of freedom
  (92400 observations deleted due to missingness)
Multiple R-squared:  0.009601,  Adjusted R-squared:  0.009552 
F-statistic: 197.3 on 26 and 529250 DF,  p-value: < 2.2e-16
ggplot(car_clean,aes(x = hour_occurred, y = victim_age))+
  geom_point(alpha = 0.25)+
  geom_smooth(method = "lm", color = "blue")+
  labs(title = "Victime Age by Hour of Accident",
       x = "Hour",
       y = "Victims Age")+
  theme_dark()
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 88194 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 88194 rows containing missing values or values outside the scale range
(`geom_point()`).

I ran a multiple linear regression to investigate whether the hour which the accident occurred, sex of victim, and area name had relationship with victim age. Overall, the model was statistically significant with value being p < 0.0001. The hour in which the accident occurred showed a negative relationship with age of victim, which means that younger victims had more accidents later in the day. Many regions in Los Angeles also were significant predictors of victim age. The sex of the victim, however, was not statistically significant. The adjusted r^2 model explained 1% of variation in victim age.

The visualization above illustrates the relationship between the time of accident and the victims age. The smooth line shows a negative relationship meaning that younger individuals tend to have accidents at night. The dots represent a single collision.

Visualization 1

car_clean %>% 
  filter(victim_sex %in% c("M","F")) %>% 
ggplot(aes(x= area_name, fill = victim_sex)) + 
  geom_bar()+
  coord_flip()+
  labs(
    title = "Traffic Collisions in L.A by Area (2010 - 2026)",
    x = "Region",
    y = "Accidents",
    fill = "Sex"
  )+
  theme_dark()

#Created a simple bar graph but used coord_flip to allow for easier user readability. Focused on 2 sexes to reduce cluster.

The visualization above displays the number Traffic Collisions in Los Angeles regions. There is a color fill to show separation between the number of accidents that Men and Women had over 10 years. illustrates that Men have higher number of accidents across all area’s in Los Angeles over the 10 years the LAPD reported. The police departments with the highest number of accidents is 77th Street, Southwest, and Wiltshire. The highest regions are heavily traffic roads which can be an indication of the high number of accidents.

Visualization 2

https://public.tableau.com/views/FinalProjectVisualization_17784712003150/Sheet1?:language=en-US&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_link

The attached Tableau visualization shows the traffic accident across Los Angeles police regions. Each point means a geographic reporting are in Los Angeles, while the size of the points are the number of collisions. As noted before, 77th Street and Southwest reported to have high number of traffic collisions throughout 10+ years. The visualization is interactive to give specific detail on count of accidents. Filter’s in victim sex were added to allow for an accurate analysis.

Conclusion

The visualizations created displayed the patterns of traffic accidents in Los Angeles. The first helped to determine correlation between how younger drivers at night tend to have more accidents than older individuals. While the other two visualizations gave detailed information on the areas in which this problem is most happening and the police department may have to improve its safety and surveillance protocols. Though the visualizations were great, I did attempt to show more difference between the number of accidents in each region with the size marks in Tableau. I could not find a way to do so. I was hoping I could also create a graph that could show specific places which accidents occurred, for example, streets, highways, parking lots. Overall, though the project did not go as expected it turned out well and I learned the importance of changes while going along the way of data analysis.