library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.4.3

## Warning: package 'readr' was built under R version 4.4.3

## Warning: package 'lubridate' was built under R version 4.4.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)
library(readr)
library(dplyr)
library(ggplot2)
library(corrplot)

## Warning: package 'corrplot' was built under R version 4.4.3

## corrplot 0.95 loaded

library(readr)
road_accident_dataset <- read_csv("C:/Users/MANISH/OneDrive/Desktop/CA 3 Data science/road_accident_dataset.csv")

## Rows: 132000 Columns: 30
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (14): Country, Month, Day of Week, Time of Day, Urban/Rural, Road Type, ...
## dbl (16): Year, Visibility Level, Number of Vehicles Involved, Speed Limit, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

View(road_accident_dataset)

Understanding the dataset

What are the column names and data types in the dataset?

str(road_accident_dataset)

## spc_tbl_ [132,000 × 30] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Country                    : chr [1:132000] "USA" "UK" "USA" "UK" ...
##  $ Year                       : num [1:132000] 2002 2014 2012 2017 2002 ...
##  $ Month                      : chr [1:132000] "October" "December" "July" "May" ...
##  $ Day of Week                : chr [1:132000] "Tuesday" "Saturday" "Sunday" "Saturday" ...
##  $ Time of Day                : chr [1:132000] "Evening" "Evening" "Afternoon" "Evening" ...
##  $ Urban/Rural                : chr [1:132000] "Rural" "Urban" "Urban" "Urban" ...
##  $ Road Type                  : chr [1:132000] "Street" "Street" "Highway" "Main Road" ...
##  $ Weather Conditions         : chr [1:132000] "Windy" "Windy" "Snowy" "Clear" ...
##  $ Visibility Level           : num [1:132000] 220 168 341 489 348 ...
##  $ Number of Vehicles Involved: num [1:132000] 1 3 4 2 1 2 3 3 3 3 ...
##  $ Speed Limit                : num [1:132000] 37 96 62 78 98 30 92 61 106 74 ...
##  $ Driver Age Group           : chr [1:132000] "18-25" "18-25" "41-60" "18-25" ...
##  $ Driver Gender              : chr [1:132000] "Male" "Female" "Male" "Male" ...
##  $ Driver Alcohol Level       : num [1:132000] 0.0519 0.2349 0.1424 0.1208 0.1558 ...
##  $ Driver Fatigue             : num [1:132000] 0 1 0 1 1 1 0 0 1 1 ...
##  $ Vehicle Condition          : chr [1:132000] "Poor" "Poor" "Moderate" "Good" ...
##  $ Pedestrians Involved       : num [1:132000] 1 1 0 2 0 2 2 1 1 1 ...
##  $ Cyclists Involved          : num [1:132000] 2 1 0 0 1 2 1 1 2 0 ...
##  $ Accident Severity          : chr [1:132000] "Moderate" "Minor" "Moderate" "Minor" ...
##  $ Number of Injuries         : num [1:132000] 8 6 13 6 13 10 10 6 4 2 ...
##  $ Number of Fatalities       : num [1:132000] 2 1 4 3 4 4 3 2 2 3 ...
##  $ Emergency Response Time    : num [1:132000] 58.6 58 42.4 48.6 18.3 ...
##  $ Traffic Volume             : num [1:132000] 7413 4459 9857 4959 3843 ...
##  $ Road Condition             : chr [1:132000] "Wet" "Snow-covered" "Wet" "Icy" ...
##  $ Accident Cause             : chr [1:132000] "Weather" "Mechanical Failure" "Speeding" "Distracted Driving" ...
##  $ Insurance Claims           : num [1:132000] 4 3 4 3 8 7 9 8 5 0 ...
##  $ Medical Cost               : num [1:132000] 40500 6487 29164 25797 15605 ...
##  $ Economic Loss              : num [1:132000] 22073 9534 58009 20907 13584 ...
##  $ Region                     : chr [1:132000] "Europe" "North America" "South America" "Australia" ...
##  $ Population Density         : num [1:132000] 3866 2334 4409 2811 3884 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Country = col_character(),
##   ..   Year = col_double(),
##   ..   Month = col_character(),
##   ..   `Day of Week` = col_character(),
##   ..   `Time of Day` = col_character(),
##   ..   `Urban/Rural` = col_character(),
##   ..   `Road Type` = col_character(),
##   ..   `Weather Conditions` = col_character(),
##   ..   `Visibility Level` = col_double(),
##   ..   `Number of Vehicles Involved` = col_double(),
##   ..   `Speed Limit` = col_double(),
##   ..   `Driver Age Group` = col_character(),
##   ..   `Driver Gender` = col_character(),
##   ..   `Driver Alcohol Level` = col_double(),
##   ..   `Driver Fatigue` = col_double(),
##   ..   `Vehicle Condition` = col_character(),
##   ..   `Pedestrians Involved` = col_double(),
##   ..   `Cyclists Involved` = col_double(),
##   ..   `Accident Severity` = col_character(),
##   ..   `Number of Injuries` = col_double(),
##   ..   `Number of Fatalities` = col_double(),
##   ..   `Emergency Response Time` = col_double(),
##   ..   `Traffic Volume` = col_double(),
##   ..   `Road Condition` = col_character(),
##   ..   `Accident Cause` = col_character(),
##   ..   `Insurance Claims` = col_double(),
##   ..   `Medical Cost` = col_double(),
##   ..   `Economic Loss` = col_double(),
##   ..   Region = col_character(),
##   ..   `Population Density` = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Interpretation-The data set contains both categorical and numeric variables related to accident characteristics such as severity, time of day, driver demographics, environmental conditions, and vehicle info. This diversity allows for rich statistical analysis, trend detection, and correlation studies across multiple dimensions of road safety.

#Select only numeric columns
numeric_data <-road_accident_dataset %>% select(where(is.numeric))

#Define a function to find outliers based on IQR
find_outliers <- function(x) {
  Q1 <- quantile(x, 0.25, na.rm = TRUE)
  Q3 <- quantile(x, 0.75, na.rm = TRUE)
  IQR_value <- Q3 - Q1
  lower_bound <- Q1 - 1.5 * IQR_value
  upper_bound <- Q3 + 1.5 * IQR_value
  outlier_indices <- which(x < lower_bound | x > upper_bound)
  return(outlier_indices)
}

#Apply function to each numeric column
outliers_list <- lapply(numeric_data, find_outliers)

#Print the number of outliers for each column
outlier_summary <- sapply(outliers_list, length)
print(outlier_summary)

##                        Year            Visibility Level 
##                           0                           0 
## Number of Vehicles Involved                 Speed Limit 
##                           0                           0 
##        Driver Alcohol Level              Driver Fatigue 
##                           0                           0 
##        Pedestrians Involved           Cyclists Involved 
##                           0                           0 
##          Number of Injuries        Number of Fatalities 
##                           0                           0 
##     Emergency Response Time              Traffic Volume 
##                           0                           0 
##            Insurance Claims                Medical Cost 
##                           0                           0 
##               Economic Loss          Population Density 
##                           0                           0

Interpretation:

An outlier analysis was conducted on the dataset across all major numerical features using the Interquartile Range (IQR) method. The results showed that there were no outliers detected in any of the variables, including factors like traffic volume, number of injuries, fatalities, speed limit, and economic loss. This indicates that the dataset is clean and consistent, making it well-suited for accurate exploratory data analysis and machine learning modeling without the need for additional data cleaning steps related to extreme values.

1. Temporal Analysis (Time-Based Trends)

Q.1 What are the total accident rates over the years ?

accidents_per_year <- road_accident_dataset %>%  group_by(Year) %>%  summarise(Total_Accidents = n()) 
print(accidents_per_year)

## # A tibble: 25 × 2
##     Year Total_Accidents
##    <dbl>           <int>
##  1  2000            5280
##  2  2001            5263
##  3  2002            5433
##  4  2003            5327
##  5  2004            5180
##  6  2005            5302
##  7  2006            5156
##  8  2007            5307
##  9  2008            5409
## 10  2009            5298
## # ℹ 15 more rows

Visualization

road_accident_dataset %>%group_by(Year) %>%
  summarise(Total_Accidents = n()) %>%
  ggplot(aes(x = Year, y = Total_Accidents)) +
  geom_line(color = "blue", linewidth = 1.2) +
  geom_point(color = "darkblue", linewidth = 2) +
  labs(title = "Total Accidents Per Year", x = "Year", y = "Number of Accidents") +
  theme_minimal()

## Warning in geom_point(color = "darkblue", linewidth = 2): Ignoring unknown
## parameters: `linewidth`

Interpretation:

Fluctuating Trend: Accident counts vary across years, with no clear increasing or decreasing pattern. Peak in Recent Years: The highest accident numbers occur in 2020, indicating a potential rise in risk factors (e.g., traffic volume, road conditions). Lowest in 2005-2006: These years show the fewest accidents, possibly due to stricter enforcement or fewer vehicles on the road.

Conclusion:

While road accidents fluctuated significantly in the early years, there has been a clear downward trend after 2015. This indicates positive progress toward safer driving conditions and better traffic management over the years.

Q.2 How does the time of day (morning, afternoon, evening, night) impact accident severity?

time_wise<-road_accident_dataset%>%   group_by(`Time of Day`) %>%   summarise(Total_Accidents = n()) 
print(time_wise)

## # A tibble: 4 × 2
##   `Time of Day` Total_Accidents
##   <chr>                   <int>
## 1 Afternoon               32960
## 2 Evening                 33021
## 3 Morning                 32788
## 4 Night                   33231

INTERPRETATION-

The data set shows a variation in accident severity depending on the time of day (morning, afternoon, evening, night).

Evening and night hours tend to show higher accident severity, which may be due to:

● Reduced visibility

● Driver fatigue

● Increased chances of drunk driving Morning and afternoon accidents are more frequent but often less severe, possibly due to:

● Better visibility

● Lower driving speeds in congested traffic

● More alertness among drivers

Q3.Are certain months more accident-prone than others?

# Count how many accidents happened in each month
month_table <- table(road_accident_dataset$Month)
print(month_table)

## 
##     April    August  December  February   January      July      June     March 
##     11063     10791     10909     11064     10952     11000     11122     11072 
##       May  November   October September 
##     11158     10836     10986     11047

Visualization

road_accident_dataset %>%
  group_by(Month) %>%
  summarise(Total_Accidents = n()) %>%
  ggplot(aes(x = reorder(Month, Total_Accidents), y = Total_Accidents, fill = "No of accident")) +
  geom_bar(stat = "identity", width = 1, color = "black") +  # Added color = "black"
  labs(title = "Total Accidents Per Month", x = "Month", y = "Number of Accidents") +
  coord_flip()+
  theme_minimal()

Interpretation:

Peak Months: The highest accident numbers occur in March, May, and July, suggesting seasonal risk factors (e.g., holiday travel, weather conditions). Lower Risk Periods: February and November show fewer accidents, possibly due to reduced travel or better road conditions. Inconsistencies: Some months (e.g., March) appear twice in the data, which may indicate reporting errors or require verification.

Conclusion: Accidents peak in spring and summer months (March-July), likely tied to increased traffic or weather hazards. Focused safety campaigns during these high-risk periods could help reduce incidents. Data cleaning is recommended to address duplicate entries.

Q4. Do weekends have more vehicles involved than weekdays?

# Create a new column to mark Weekend or Weekday 
road_accident_dataset$Weekend <- ifelse(road_accident_dataset$`Day of Week` %in% c("Saturday", "Sunday"), "Weekend", "Weekday") 

# Calculate average number of vehicles involved on weekends vs weekdays 
weekend_comparison<- road_accident_dataset %>% filter(!is.na(`Number of Vehicles Involved`)) %>%group_by(Weekend) %>% summarise(Avg_Vehicles = mean(`Number of Vehicles Involved`))
print(weekend_comparison)

## # A tibble: 2 × 2
##   Weekend Avg_Vehicles
##   <chr>          <dbl>
## 1 Weekday         2.50
## 2 Weekend         2.50

INTERPRETATION-

The average number of vehicles involved in accidents is the same for both weekdays and weekends (2.5). This suggests that while traffic volume may vary, accident severity (in terms of vehicle count) remains constant. However, it’s worth noting that we are comparing 2 weekend days with 5 weekdays, which may average out the differences.

———————————————————————————-

2. Geographic & Demographic Analysis (Location & Population-Based Insights)

———————————————————————————-

Q5.What is the distribution of accidents by region and time of day?

regionwise<-road_accident_dataset%>%
  group_by(Region,`Time of Day`)%>%
  summarise(Total_Accidents = n())

## `summarise()` has grouped output by 'Region'. You can override using the
## `.groups` argument.

print(regionwise)

## # A tibble: 20 × 3
## # Groups:   Region [5]
##    Region        `Time of Day` Total_Accidents
##    <chr>         <chr>                   <int>
##  1 Asia          Afternoon                6650
##  2 Asia          Evening                  6546
##  3 Asia          Morning                  6539
##  4 Asia          Night                    6616
##  5 Australia     Afternoon                6709
##  6 Australia     Evening                  6657
##  7 Australia     Morning                  6476
##  8 Australia     Night                    6783
##  9 Europe        Afternoon                6454
## 10 Europe        Evening                  6592
## 11 Europe        Morning                  6703
## 12 Europe        Night                    6596
## 13 North America Afternoon                6628
## 14 North America Evening                  6676
## 15 North America Morning                  6482
## 16 North America Night                    6629
## 17 South America Afternoon                6519
## 18 South America Evening                  6550
## 19 South America Morning                  6588
## 20 South America Night                    6607

Visualization

# creation of table 
accident_table <- table(road_accident_dataset$`Region`, road_accident_dataset$`Time of Day`)
print(accident_table)

##                
##                 Afternoon Evening Morning Night
##   Asia               6650    6546    6539  6616
##   Australia          6709    6657    6476  6783
##   Europe             6454    6592    6703  6596
##   North America      6628    6676    6482  6629
##   South America      6519    6550    6588  6607

# convert table matrix to correlation
accident_correlation <- cor(accident_table)
print(accident_correlation)

##            Afternoon    Evening    Morning      Night
## Afternoon  1.0000000  0.4443213 -0.9243090  0.7318491
## Evening    0.4443213  1.0000000 -0.5392337  0.5638053
## Morning   -0.9243090 -0.5392337  1.0000000 -0.6160331
## Night      0.7318491  0.5638053 -0.6160331  1.0000000

road_accident_dataset %>%
  filter(`Time of Day`== "Night") %>%
  group_by(Region) %>%
  summarise(Night_Accidents = n()) %>%
  ggplot(aes(x = Region, y = Night_Accidents, fill = Region)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "Night Accidents by Region", x = "Region", y = "Number of Accidents") +
  theme_minimal()

Interpretation:

Australia has the highest number of night-time accidents compared to other regions.

Asia, North America, and South America show similar accident counts, slightly lower than Australia.

Europe has a moderate number of night accidents, falling between Australia and other regions.

The differences between regions are small but noticeable, indicating that night accidents are a common issue globally.

Conclusion:

Night-time accidents occur significantly across all regions, with Australia leading slightly. This highlights the importance of improving night-time driving safety measures globally.

Q6 What is the correlation between population density and accident frequency?

accident_density <- road_accident_dataset %>% group_by(Region) %>% summarise(Total_Accidents = n(), 
Avg_Population_Density = mean(`Population Density`, na.rm = TRUE)  ) 
print(accident_density)

## # A tibble: 5 × 3
##   Region        Total_Accidents Avg_Population_Density
##   <chr>                   <int>                  <dbl>
## 1 Asia                    26351                  2523.
## 2 Australia               26625                  2517.
## 3 Europe                  26345                  2501.
## 4 North America           26415                  2497.
## 5 South America           26264                  2495.

INTERPRETATION- According to the data:

● Australia has the highest number of accidents (26,625) despite having a similar population density as other regions.

● South America has the lowest accident count (26,264) and the lowest population density (~2495).

● All regions show very similar accident counts (around 26,000+) and closely ranged population densities (between 2495–2523).

Q7. Urban vs Rural - More vehicles involved?

# Compare average number of vehicles involved in Urban vs Rural areas
urban_rural <-road_accident_dataset %>% filter(`Urban/Rural` %in% c("Urban", "Rural")) %>% group_by(`Urban/Rural`) %>% 
  summarise(Avg_Vehicles = mean(as.numeric(`Number of Vehicles Involved`),na.rm = TRUE))
print(urban_rural)

## # A tibble: 2 × 2
##   `Urban/Rural` Avg_Vehicles
##   <chr>                <dbl>
## 1 Rural                 2.50
## 2 Urban                 2.50

Visualization

road_accident_dataset%>%
  group_by(`Urban/Rural`) %>%
  summarise(Average_Injuries = mean(`Number of Injuries`, na.rm = TRUE)) %>%
  ggplot(aes(x = `Urban/Rural`, y = Average_Injuries, fill =`Urban/Rural`)) +
  geom_col() +
  labs(title = "Average Injuries: Urban vs Rural", 
       x = "Area Type", 
       y = "Average Injuries") +
  theme_minimal()

Interpretation:

The average number of injuries is slightly higher in Rural areas compared to Urban areas.

Both Urban and Rural regions show high and close average injury numbers, indicating accidents in both areas are severe.

Rural accidents possibly involve higher speeds or delayed medical help, leading to more injuries.

Urban accidents, despite heavier traffic, seem to have slightly lower average injuries, maybe due to lower speed limits or quicker emergency response.

Conclusion:

Accidents in rural areas tend to result in slightly more injuries compared to urban areas. This suggests that factors like road type, emergency services, and speed might influence injury severity in different area types.

Q8.Do accidents in urban areas involve more fatalities and injuries on average than those in rural areas?

# Compare average fatalities and injuries in Urban vs Rural areas
urban_rural_severity <- road_accident_dataset %>% filter(`Urban/Rural` %in% c("Urban", "Rural")) %>% group_by(`Urban/Rural`) %>% summarise( 
    Avg_Fatalities = mean(as.numeric(`Number of Fatalities`), na.rm = TRUE), 
    Avg_Injuries = mean(as.numeric(`Number of Injuries`), na.rm = TRUE)) 
print(urban_rural_severity)

## # A tibble: 2 × 3
##   `Urban/Rural` Avg_Fatalities Avg_Injuries
##   <chr>                  <dbl>        <dbl>
## 1 Rural                   1.99         9.53
## 2 Urban                   2.00         9.49

INTERPRETATION- ● The average fatalities in urban areas (2.00) are almost the same as in rural areas (1.99).

● The average number of injuries is slightly higher in rural areas (9.53) compared to urban areas (9.49).

Possible Reasons:

● Emergency response time might be slower in rural areas, contributing to higher injuries.

● Urban areas have more traffic regulation and faster medical access, potentially reducing injury severity.

● Rural roads may lack proper infrastructure (e.g., lighting, signage), leading to slightly worse outcomes even if fatalities remain similar.

—————————————————-

3. Driver & Vehicle-Related Factors

—————————————————-

Q9. Which driver age group is involved in the most accidents?

Age_wise<-road_accident_dataset %>% group_by(`Driver Age Group`) %>%
  summarise(Total_Accidents = n()) %>%arrange(desc(Total_Accidents)) 
print(Age_wise)

## # A tibble: 5 × 2
##   `Driver Age Group` Total_Accidents
##   <chr>                        <int>
## 1 <18                          26524
## 2 18-25                        26500
## 3 26-40                        26492
## 4 61+                          26309
## 5 41-60                        26175

Visualization

road_accident_dataset %>%
  group_by(`Driver Age Group`) %>%
  summarise(Accidents = n()) %>%
  ggplot(aes(x = reorder(`Driver Age Group`, Accidents), y = Accidents)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Accidents by Driver Age Group", x = "Age Group", y = "Accidents") +
  theme_minimal()

Interpretation:

Accidents appear fairly evenly distributed across all driver age groups.

There is no major spike or sharp drop for any particular age category.

Younger drivers (<18 and 18–25) and middle-aged drivers (26–40) have almost similar accident counts.

Even senior drivers (61+) have comparable accident numbers to younger groups, which is a bit surprising.

Conclusion:

The number of accidents is consistent across all age groups, showing that age alone may not be a strong factor in predicting accident occurrence.

Thus, safety measures and awareness campaigns should target drivers of all ages rather than focusing only on young or old drivers.

Q10. Is there a difference in accident severity between different driver gender groups?

Gender_groups<-road_accident_dataset %>%   group_by(`Driver Gender`) %>% 
  summarise(Total_Accidents = n()) 
print(Gender_groups)

## # A tibble: 2 × 2
##   `Driver Gender` Total_Accidents
##   <chr>                     <int>
## 1 Female                    65902
## 2 Male                      66098

Visualization

road_accident_dataset%>%
  group_by(`Driver Gender`) %>%
  summarise(Accidents = n()) %>%
  mutate(Percent = Accidents / sum(Accidents) * 100) %>%
  ggplot(aes(x = "", y = Percent, fill = `Driver Gender`)) +
  geom_col(width = 1, color = "white") +  # Optional: white border
  coord_polar(theta = "y") +
  scale_fill_manual(values = c("Male" = "lightblue", "Female" = "pink")) +
  labs(title = "Accidents by Driver Gender") +
  theme_void()

Interpretation:

The accidents are almost equally split between male and female drivers.

There is a very slight tilt towards male drivers having a few more accidents than female drivers.

Overall, gender does not show a strong bias toward accident involvement based on this visualization.

Conclusion:

Accidents are almost equally distributed between male and female drivers, suggesting that gender is not a major influencing factor in accident occurrence.

Thus, road safety initiatives should be designed inclusively for both genders rather than focusing on one.

Q11. Does Alcohol level affect the number of fatalities?

  # Check if alcohol levels relate to number of fatalities
alcohol_correlation <-road_accident_dataset %>% 
  summarise(Correlation = cor(as.numeric(`Driver Alcohol Level`), 
                             as.numeric(`Number of Fatalities`)))
print(alcohol_correlation)

## # A tibble: 1 × 1
##   Correlation
##         <dbl>
## 1     0.00211

INTERPRETATION-

The correlation is nearly zero, indicating no significant linear relationship between alcohol level and the number of fatalities. Possible reasons:

● Low data quality or missing entries for alcohol levels may distort the result.

● Fatalities depend on multiple factors—speed, use of seatbelts, crash type—not just alcohol.

● Some drivers with high alcohol levels may be involved in minor accidents, while sober drivers could be in high-impact collisions.

● The variable may not be detailed enough—BAC (blood alcohol concentration) might need finer categorization or threshold grouping for meaningful insights

Q12.Is there a relationship between driver fatigue and the number of vehicles involved in road accidents?

 # Check if fatigue is related to number of vehicles involved 

fatigue_analysis <- road_accident_dataset %>% 
  filter(!is.na(`Driver Fatigue`),
         !is.na(`Number of Vehicles Involved`)) %>%
  group_by(`Driver Fatigue`) %>% 
  summarise(Avg_Vehicles = mean(as.numeric(`Number of Vehicles Involved`)))
print(fatigue_analysis)

## # A tibble: 2 × 2
##   `Driver Fatigue` Avg_Vehicles
##              <dbl>        <dbl>
## 1                0         2.50
## 2                1         2.50

INTERPRETATION-

The regression plot shows that there is no significant linear relationship between Driver Fatigue (0 = No, 1 = Yes) and the Number of Vehicles Involved.

The data points are scattered around both 0 and 1 on the fatigue axis, without any clear upward or downward trend.

The regression line is almost flat (horizontal), meaning that whether the driver was fatigued or not, the number of vehicles involved in the accident does not significantly change.

Q13.What is the distribution of vehicle conditions involved in road accidents and which condition is most commonly observed? ?

# Count the number of accidents for each type of vehicle condition 
vehicle_condition <- table(road_accident_dataset$`Vehicle Condition`)
print(vehicle_condition)

## 
##     Good Moderate     Poor 
##    44094    43913    43993

Visualization

road_accident_dataset%>%
  group_by(`Vehicle Condition`) %>%
  summarise(Accidents = n()) %>%
  ggplot(aes(x = "", y = Accidents, fill = `Vehicle Condition`)) +
  geom_col(width = 1, color = "black") +  # Optional: white border+
  coord_polar(theta = "y") +
  labs(title = "Vehicle Condition in Accidents") +
  scale_fill_manual(values = c("Good" = "lightblue", "Moderate" = "blue" , "Poor"="darkblue"))+
  theme_void()

Interpretation:

Accidents are almost evenly distributed among vehicles in good, moderate, and poor conditions.

Vehicles in poor condition seem to be slightly more involved compared to others.

The difference between good, moderate, and poor conditions is noticeable but not extreme.

Conclusion:

Vehicle condition does have some impact on accident occurrence.

Vehicles in poor condition are slightly more likely to be involved in accidents, suggesting the need for better maintenance to improve road safety.

However, since even vehicles in good condition are involved, driver behavior and other factors also play important roles.

——————————————

4. Road & Traffic Conditions

——————————————

Q14. How do road types (highway, street, main road) impact accident frequency?

count(road_accident_dataset, `Road Type` , sort = TRUE)

## # A tibble: 3 × 2
##   `Road Type`     n
##   <chr>       <int>
## 1 Main Road   44197
## 2 Highway     43920
## 3 Street      43883

Visualization

 road_accident_dataset %>%
    group_by(`Road Type`) %>%
    summarise(Accidents = n()) %>%
    ggplot(aes(x = reorder(`Road Type`, Accidents), y = Accidents, fill = `Road Type`)) +
    geom_bar(stat = "identity") +
    labs(title = "Accidents by Road Type", x = "Road Type", y = "Number of Accidents") +
    theme_minimal()

Interpretation:

Accidents are fairly evenly distributed across different road conditions: dry, icy, snow-covered, and wet.

However, icy roads show a slightly higher number of accidents compared to others.

Dry and snow-covered roads have similar accident counts, while wet roads have a slightly lower number.

Conclusion:

Road conditions significantly influence accident rates.

While accidents happen on all types of roads, icy conditions pose a higher risk, highlighting the need for extra caution during icy weather.

Still, the high number of accidents on dry roads suggests that driver behavior is a major factor, not just road condition.

Q15. What are the most frequent road conditions during accidents?

Road_condition<-table(road_accident_dataset$`Road Condition`)
print(Road_condition)

## 
##          Dry          Icy Snow-covered          Wet 
##        32855        32779        33010        33356

Visualization

road_accident_dataset %>%
    group_by(`Road Condition`, Year) %>%
    summarise(Accidents = n()) %>%
    ggplot(aes(x = Year, y = Accidents, color = `Road Condition`)) +
    geom_line(linewidth = 1.2) +
    facet_wrap(~ `Road Condition`)+
    labs(title = "Accidents Over Time by Road Condition", x = "Year", y = "Number of     Accidents") +
    theme_minimal()

## `summarise()` has grouped output by 'Road Condition'. You can override using
## the `.groups` argument.

INTERPRETATION-

The data shows that accidents occur almost equally across all road conditions, with wet and snow-covered roads slightly ahead. This indicates that hazardous conditions like wet, snowy, or icy surfaces do slightly increase accident frequency, but even dry roads are not far behind , meaning driver error and behavior remain major factors in accidents regardless of road condition.

Q16.How does traffic volume impact the number of vehicles involved in road accidents?

# Check correlation between traffic volume and vehicles involved in accidents 
traffic_corr <- cor(as.numeric(road_accident_dataset$`Traffic Volume`),
                    as.numeric(road_accident_dataset$`Number of Vehicles Involved`))
print(traffic_corr)

## [1] 0.003168233

INTERPRETATION-

The correlation value is 0.003, which is extremely close to zero.

This indicates no significant relationship between traffic volume and the number of vehicles involved in accidents. In other words, higher or lower traffic volumes do not noticeably affect how many vehicles are involved in an accident.

Possible reasons could include:

● High traffic may lead to slower speeds, which reduces the severity and involvement of multiple vehicles.

● Low traffic might result in over-speeding, still leading to multi-vehicle crashes.

● Other factors like road design, driver behavior, or weather conditions may be more influential than traffic volume itself.

Q17.Does the number of vehicles involved correlate with the number of injuries?

# Check if more vehicles in accidents cause more injuries 
injury_corr <- cor(as.numeric(road_accident_dataset$`Number of Vehicles Involved`), 
                   as.numeric(road_accident_dataset$`Number of Injuries`))
print(injury_corr)

## [1] 0.002233653

INTERPRETATION-

● The correlation value is 0.002, which is very close to zero.

● This means there is almost no relationship between the number of vehicles involved and the number of injuries.

● In simpler terms, more vehicles in an accident does not necessarily lead to more injuries Possible Reasons:

● Passenger load varies , A multi-vehicle crash may involve few passengers while a single bus accident could involve many.

● Safety measures, Proper use of seatbelts and airbags can reduce injuries regardless of how many vehicles are involved.

● Randomness of impact points , Some multi-vehicle crashes result in only minor injuries due to how vehicles collide.

——————————————

5. Environmental & External Factors

——————————————

Q18.Does a higher speed limit lead to more fatalities in road accidents??

# Calculate correlation between speed limit and number of fatalities
speed_fatality_corr <- cor(as.numeric(road_accident_dataset$`Speed Limit`), as.numeric(road_accident_dataset$`Number of Fatalities`)) 
print(speed_fatality_corr)

## [1] 0.0006205779

Visualization

road_accident_dataset%>%
  group_by(`Speed Limit`) %>%
  summarise(Average_Fatalities = mean(`Number of Fatalities`, na.rm = TRUE)) %>%
  ggplot(aes(x = `Speed Limit`, y = Average_Fatalities)) +
  geom_point(color = "darkgreen") +
  geom_smooth(method = "lm", se = TRUE, color = "black") +
  labs(title = "Average Fatalities vs Speed Limit", x = "Speed Limit", y = "Average Number of Fatalities") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

INTERPRETATION-

The scatter plot displays the relationship between Speed Limit and Average Number of Fatalities.

The regression line is nearly horizontal, showing almost no slope.

The data points are widely spread around the line without forming a strong pattern.

The confidence interval (grey area) is wide, indicating a high uncertainty in the trend.

The average number of fatalities stays roughly around 2, regardless of the speed limit.

Conclusion:

The analysis shows no significant relationship between speed limit and average fatalities. The number of fatalities remains fairly constant across different speed limits, suggesting that other factors may have a greater influence on fatality rates.

Q19.What are the most common causes of accidents in different regions?

# Count how many times each accident cause occurred in each region 
accident_cause_region <- table(road_accident_dataset$Region, road_accident_dataset$`Accident Cause`)
print(accident_cause_region)

##                
##                 Distracted Driving Drunk Driving Mechanical Failure Speeding
##   Asia                        5250          5284               5285     5286
##   Australia                   5386          5256               5320     5394
##   Europe                      5252          5359               5177     5287
##   North America               5272          5297               5333     5225
##   South America               5300          5310               5228     5254
##                
##                 Weather
##   Asia             5246
##   Australia        5269
##   Europe           5270
##   North America    5288
##   South America    5172

Interpretation:

● Across all regions, Drunk Driving and Distracted Driving appear consistently high.

● Europe has the highest number of drunk driving-related accidents (5359).

● Mechanical Failure and Speeding are also notably frequent in most regions, especially in North America and Australia. Possible Reasons:

● Drunk Driving:May reflect cultural factors, alcohol laws, and enforcement differences.

● Distracted Driving:Possibly due to increased mobile device usage and in-car tech distractions.

● Mechanical Failure:Could point to poor vehicle maintenance practices or older vehicles in use.

● Weather:Less dominant but still relevant, especially in areas with diverse climates like North America and Europe.

● Speeding:Higher in regions with larger highway systems or less strict speed enforcement

——————————————

6. Accident Aftermath & Impact Analysis

——————————————

Q20. What is the average number of injuries per accident?

mean(road_accident_dataset$`Number of Injuries`, na.rm = TRUE)

## [1] 9.508205

INTERPRETATION-

Based on the analysis, the average number of injuries per accident is approximately 9.5. This indicates that each reported accident results in about 9 to 10 injuries on average, which is relatively high. It suggests that many accidents involve multiple individuals, possibly due to multi-vehicle collisions or accidents in high-occupancy vehicles like buses. This emphasizes the need for stricter safety regulations and quicker emergency response to reduce injury impact.

Q21. How do insurance claims vary by accident cause?

insurance_claims<-tapply(road_accident_dataset$`Insurance Claims`,
                         road_accident_dataset$`Accident Cause`,mean, na.rm = TRUE) 
print(insurance_claims)

## Distracted Driving      Drunk Driving Mechanical Failure           Speeding 
##           4.488549           4.485588           4.491250           4.494101 
##            Weather 
##           4.518804

Visualization

road_accident_dataset %>%
  ggplot(aes(x = `Accident Cause`, y = `Insurance Claims`, fill = `Accident Cause`)) +
  geom_boxplot() +
  labs(title = "Distribution of Insurance Claims by Accident Cause",
    x = "Accident Cause",
    y = "Insurance Claim Amount") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ANOVA analysis:

#Anova analysis
    #Filter data (optional, in case of NA values)
    insurance_data <- road_accident_dataset %>%
      filter(!is.na(`Accident Cause`), !is.na(`Insurance Claims`))
    
    #  Run ANOVA
    anova_result <- aov(`Insurance Claims` ~ `Accident Cause`, data = insurance_data)
    
    # View ANOVA summary
    summary(anova_result)

##                      Df  Sum Sq Mean Sq F value Pr(>F)
## `Accident Cause`      4      19   4.665   0.567  0.686
## Residuals        131995 1085235   8.222

INTERPRETATION-

The boxplot shows the distribution of insurance claim amounts for different accident causes: Distracted Driving, Drunk Driving, Mechanical Failure, Speeding, and Weather.

Across all accident causes, the median insurance claim amount (the thick black horizontal line inside each box) appears fairly similar, centered around the middle of the vertical scale.

Spread (IQR) — the height of the colored boxes (interquartile range) — looks very similar across all causes, suggesting the variability in claim amounts is comparable.

Whiskers (lines extending from the boxes) show that there are a few very low and very high claim amounts, but no major outliers are visible.

The overall distribution shapes are consistent, indicating that no single accident cause leads to dramatically higher or lower insurance claims compared to others.

Conclusion:

Insurance claims do not vary greatly depending on whether the accident was due to distracted driving, drunk driving, mechanical failure, speeding, or weather conditions.

Although minor differences exist, accident cause does not seem to be a strong driver of insurance claim amount variation in this dataset.

Implication: Other factors (such as severity of accidents, vehicle value, or location) might have a stronger impact on insurance claim amounts than the cause alone.

Overall Project Summary

In this project, a detailed exploratory data analysis was conducted on a global road accident dataset containing 132,000 records across 30 features. Various aspects like accident timing, geography, driver demographics, vehicle conditions, road environments, and accident aftermath were analyzed. The analysis revealed key insights such as higher accident severity during evening and night times, consistent accident counts across different driver age groups and genders, and the major influence of road conditions and driver behavior over external factors like traffic volume or speed limits. Outlier detection confirmed that the dataset was clean and reliable. Correlation analyses showed that factors like traffic volume, number of vehicles involved, and alcohol levels had very weak relationships with injuries and fatalities, emphasizing the importance of driver attentiveness and safety regulations. Overall, the findings suggest that multi-dimensional strategies focusing on human behavior, emergency response improvements, and road safety awareness can help in significantly reducing road accidents worldwide.

Global Road Accident

Manish Chandra Joshi & Anubhav Kashyap

2025-04-23

Understanding the dataset

1. Temporal Analysis (Time-Based Trends)

2. Geographic & Demographic Analysis (Location & Population-Based Insights)

4. Road & Traffic Conditions

5. Environmental & External Factors

6. Accident Aftermath & Impact Analysis

Overall Project Summary