Final Project Part 1

DSA406_001_SP25_FP1_ryalsaid

Author

Rommie Alsaidi

Published

February 18, 2025

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

1.Import dataset using read.csv

rad <- read_csv("data/Road Accident Data.csv") 
Rows: 307973 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (16): Accident_Index, Accident Date, Month, Day_of_Week, Junction_Contr...
dbl   (6): Year, Latitude, Longitude, Number_of_Casualties, Number_of_Vehicl...
time  (1): Time

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
  1. inspect dataset
head(rad) #returns the first couple of rows to see columns and values
# A tibble: 6 × 23
  Accident_Index `Accident Date` Month Day_of_Week  Year Junction_Control       
  <chr>          <chr>           <chr> <chr>       <dbl> <chr>                  
1 200901BS70001  1/1/2021        Jan   Thursday     2021 Give way or uncontroll…
2 200901BS70002  1/5/2021        Jan   Monday       2021 Give way or uncontroll…
3 200901BS70003  1/4/2021        Jan   Sunday       2021 Give way or uncontroll…
4 200901BS70004  1/5/2021        Jan   Monday       2021 Auto traffic signal    
5 200901BS70005  1/6/2021        Jan   Tuesday      2021 Auto traffic signal    
6 200901BS70006  1/1/2021        Jan   Thursday     2021 Give way or uncontroll…
# ℹ 17 more variables: Junction_Detail <chr>, Accident_Severity <chr>,
#   Latitude <dbl>, Light_Conditions <chr>, `Local_Authority_(District)` <chr>,
#   Carriageway_Hazards <chr>, Longitude <dbl>, Number_of_Casualties <dbl>,
#   Number_of_Vehicles <dbl>, Police_Force <chr>,
#   Road_Surface_Conditions <chr>, Road_Type <chr>, Speed_limit <dbl>,
#   Time <time>, Urban_or_Rural_Area <chr>, Weather_Conditions <chr>,
#   Vehicle_Type <chr>
str(rad) #returns the structure of the dataset as well as the variable names and types
spc_tbl_ [307,973 × 23] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Accident_Index            : chr [1:307973] "200901BS70001" "200901BS70002" "200901BS70003" "200901BS70004" ...
 $ Accident Date             : chr [1:307973] "1/1/2021" "1/5/2021" "1/4/2021" "1/5/2021" ...
 $ Month                     : chr [1:307973] "Jan" "Jan" "Jan" "Jan" ...
 $ Day_of_Week               : chr [1:307973] "Thursday" "Monday" "Sunday" "Monday" ...
 $ Year                      : num [1:307973] 2021 2021 2021 2021 2021 ...
 $ Junction_Control          : chr [1:307973] "Give way or uncontrolled" "Give way or uncontrolled" "Give way or uncontrolled" "Auto traffic signal" ...
 $ Junction_Detail           : chr [1:307973] "T or staggered junction" "Crossroads" "T or staggered junction" "T or staggered junction" ...
 $ Accident_Severity         : chr [1:307973] "Serious" "Serious" "Slight" "Serious" ...
 $ Latitude                  : num [1:307973] 51.5 51.5 51.5 51.5 51.5 ...
 $ Light_Conditions          : chr [1:307973] "Daylight" "Daylight" "Daylight" "Daylight" ...
 $ Local_Authority_(District): chr [1:307973] "Kensington and Chelsea" "Kensington and Chelsea" "Kensington and Chelsea" "Kensington and Chelsea" ...
 $ Carriageway_Hazards       : chr [1:307973] "None" "None" "None" "None" ...
 $ Longitude                 : num [1:307973] -0.201 -0.199 -0.18 -0.203 -0.173 ...
 $ Number_of_Casualties      : num [1:307973] 1 11 1 1 1 3 1 1 2 1 ...
 $ Number_of_Vehicles        : num [1:307973] 2 2 2 2 2 2 2 1 1 1 ...
 $ Police_Force              : chr [1:307973] "Metropolitan Police" "Metropolitan Police" "Metropolitan Police" "Metropolitan Police" ...
 $ Road_Surface_Conditions   : chr [1:307973] "Dry" "Wet or damp" "Dry" "Frost or ice" ...
 $ Road_Type                 : chr [1:307973] "One way street" "Single carriageway" "Single carriageway" "Single carriageway" ...
 $ Speed_limit               : num [1:307973] 30 30 30 30 30 30 30 30 30 30 ...
 $ Time                      : 'hms' num [1:307973] 15:11:00 10:59:00 14:19:00 08:10:00 ...
  ..- attr(*, "units")= chr "secs"
 $ Urban_or_Rural_Area       : chr [1:307973] "Urban" "Urban" "Urban" "Urban" ...
 $ Weather_Conditions        : chr [1:307973] "Fine no high winds" "Fine no high winds" "Fine no high winds" "Other" ...
 $ Vehicle_Type              : chr [1:307973] "Car" "Taxi/Private hire car" "Taxi/Private hire car" "Motorcycle over 500cc" ...
 - attr(*, "spec")=
  .. cols(
  ..   Accident_Index = col_character(),
  ..   `Accident Date` = col_character(),
  ..   Month = col_character(),
  ..   Day_of_Week = col_character(),
  ..   Year = col_double(),
  ..   Junction_Control = col_character(),
  ..   Junction_Detail = col_character(),
  ..   Accident_Severity = col_character(),
  ..   Latitude = col_double(),
  ..   Light_Conditions = col_character(),
  ..   `Local_Authority_(District)` = col_character(),
  ..   Carriageway_Hazards = col_character(),
  ..   Longitude = col_double(),
  ..   Number_of_Casualties = col_double(),
  ..   Number_of_Vehicles = col_double(),
  ..   Police_Force = col_character(),
  ..   Road_Surface_Conditions = col_character(),
  ..   Road_Type = col_character(),
  ..   Speed_limit = col_double(),
  ..   Time = col_time(format = ""),
  ..   Urban_or_Rural_Area = col_character(),
  ..   Weather_Conditions = col_character(),
  ..   Vehicle_Type = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 

Dataset Description

Our Dataset has 307973 rows and 23 columns. Most of the data is either in char or int, although sum num’s are sprinkled in. We have a unique identifier that called accident_Index, it is quite long however with 300,000 rows it may be optimal not to make our own key. We also notice that the dataset seems to come from the UK. although we did not check to be sure, the use of “carriage” and the mention of “Kensington and Chelsea” tells us this is from the UK.

Source of the data

Kaggle: https://www.kaggle.com/datasets/xavierberge/road-accident-dataset?select=Road+Accident+Data.csv

What is the dataset about?

Dataset containing road accidents data classified by type of injury, type of vehicle involved in accident, type of road, type of geographical area, time of the day and road conditions.

What are your motivations for exploring this dataset?

Although this subject was not my first choice, I haven’t been able to find something that could work towards what I want to do. As to this subject, I have been in a car accident before and it shook me up for a while. I wont say that it was hard to drive after but it definitely sits in the back of my mind when I do drive. So being able to understand the common causes of a car accident could help ease my mind.

What questions do you want to answer? (broad)

what attributes tend to be most associated with severe car accidents?

Hypothesis

Severe car accidents are more common in rainy junctions than any other situation.

Biases

A bias I may have is of course the fact that I have been in a car accident so I may hold a bias towards my own situation. I would also have a bias on prior understanding of driving where I know that rain and intersections cause volatile driving situations.

Data Dictionary

Variable Name Class Continuity Description Suggested R Functions
Accident_Index Character Discrete Unique identifier for each accident colnames(), unique()
Accident.Date Character Discrete Date when the accident occurred as.Date(), summary()
Month Factor Discrete Month in which the accident occurred (e.g., Jan, Feb) table(), levels()
Day_of_Week Factor Discrete Day of the week when the accident occurred table(), barplot(table())
Year Integer Discrete Year when the accident occurred unique(), hist()
Junction_Control Factor Discrete Type of junction control at the accident location table(), prop.table(table())
Junction_Detail Factor Discrete Description of the junction where the accident occurred table(), unique()
Accident_Severity Factor Discrete Severity of the accident (e.g., Slight, Serious) table(), barplot(table())
Latitude Numeric Continuous Latitude coordinate of the accident location summary(), range()
Longitude Numeric Continuous Longitude coordinate of the accident location summary(), range()
Light_Conditions Factor Discrete Light conditions during the accident (e.g., Daylight, Darkness) table(), pie(table())
Local_Authority_.District. Factor Discrete Local authority district where the accident occurred table(), length(unique())
Carriageway_Hazards Factor Discrete Any hazards present on the carriageway table(), which.max(table())
Number_of_Casualties Integer Discrete Number of casualties in the accident summary(), boxplot()
Number_of_Vehicles Integer Discrete Number of vehicles involved in the accident summary(), hist()
Police_Force Factor Discrete Police force responsible for the accident location table(), sort(table())
Road_Surface_Conditions Factor Discrete Surface condition of the road at the accident location table(), prop.table(table())
Road_Type Factor Discrete Type of road where the accident occurred table(), summary()
Speed_limit Integer Discrete Speed limit on the road where the accident occurred summary(), hist()
Time Character Discrete Time when the accident occurred (HH:MM) substr(), strptime(Time, format="%H:%M")
Urban_or_Rural_Area Factor Discrete Indicates if the accident occurred in an urban or rural area table(), barplot(table())
Weather_Conditions Factor Discrete Weather conditions during the accident (e.g., Fine, Rain, Fog) table(), prop.table(table())
Vehicle_Type Factor Discrete Type of vehicle involved in the accident table(), unique()