library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(readr)
library(dplyr)
library(ggplot2)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.3
## corrplot 0.95 loaded
library(readr)
road_accident_dataset <- read_csv("C:/Users/MANISH/OneDrive/Desktop/CA 3 Data science/road_accident_dataset.csv")
## Rows: 132000 Columns: 30
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (14): Country, Month, Day of Week, Time of Day, Urban/Rural, Road Type, ...
## dbl (16): Year, Visibility Level, Number of Vehicles Involved, Speed Limit, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(road_accident_dataset)

Understanding the dataset

What are the column names and data types in the dataset?

str(road_accident_dataset) 
## spc_tbl_ [132,000 × 30] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Country                    : chr [1:132000] "USA" "UK" "USA" "UK" ...
##  $ Year                       : num [1:132000] 2002 2014 2012 2017 2002 ...
##  $ Month                      : chr [1:132000] "October" "December" "July" "May" ...
##  $ Day of Week                : chr [1:132000] "Tuesday" "Saturday" "Sunday" "Saturday" ...
##  $ Time of Day                : chr [1:132000] "Evening" "Evening" "Afternoon" "Evening" ...
##  $ Urban/Rural                : chr [1:132000] "Rural" "Urban" "Urban" "Urban" ...
##  $ Road Type                  : chr [1:132000] "Street" "Street" "Highway" "Main Road" ...
##  $ Weather Conditions         : chr [1:132000] "Windy" "Windy" "Snowy" "Clear" ...
##  $ Visibility Level           : num [1:132000] 220 168 341 489 348 ...
##  $ Number of Vehicles Involved: num [1:132000] 1 3 4 2 1 2 3 3 3 3 ...
##  $ Speed Limit                : num [1:132000] 37 96 62 78 98 30 92 61 106 74 ...
##  $ Driver Age Group           : chr [1:132000] "18-25" "18-25" "41-60" "18-25" ...
##  $ Driver Gender              : chr [1:132000] "Male" "Female" "Male" "Male" ...
##  $ Driver Alcohol Level       : num [1:132000] 0.0519 0.2349 0.1424 0.1208 0.1558 ...
##  $ Driver Fatigue             : num [1:132000] 0 1 0 1 1 1 0 0 1 1 ...
##  $ Vehicle Condition          : chr [1:132000] "Poor" "Poor" "Moderate" "Good" ...
##  $ Pedestrians Involved       : num [1:132000] 1 1 0 2 0 2 2 1 1 1 ...
##  $ Cyclists Involved          : num [1:132000] 2 1 0 0 1 2 1 1 2 0 ...
##  $ Accident Severity          : chr [1:132000] "Moderate" "Minor" "Moderate" "Minor" ...
##  $ Number of Injuries         : num [1:132000] 8 6 13 6 13 10 10 6 4 2 ...
##  $ Number of Fatalities       : num [1:132000] 2 1 4 3 4 4 3 2 2 3 ...
##  $ Emergency Response Time    : num [1:132000] 58.6 58 42.4 48.6 18.3 ...
##  $ Traffic Volume             : num [1:132000] 7413 4459 9857 4959 3843 ...
##  $ Road Condition             : chr [1:132000] "Wet" "Snow-covered" "Wet" "Icy" ...
##  $ Accident Cause             : chr [1:132000] "Weather" "Mechanical Failure" "Speeding" "Distracted Driving" ...
##  $ Insurance Claims           : num [1:132000] 4 3 4 3 8 7 9 8 5 0 ...
##  $ Medical Cost               : num [1:132000] 40500 6487 29164 25797 15605 ...
##  $ Economic Loss              : num [1:132000] 22073 9534 58009 20907 13584 ...
##  $ Region                     : chr [1:132000] "Europe" "North America" "South America" "Australia" ...
##  $ Population Density         : num [1:132000] 3866 2334 4409 2811 3884 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Country = col_character(),
##   ..   Year = col_double(),
##   ..   Month = col_character(),
##   ..   `Day of Week` = col_character(),
##   ..   `Time of Day` = col_character(),
##   ..   `Urban/Rural` = col_character(),
##   ..   `Road Type` = col_character(),
##   ..   `Weather Conditions` = col_character(),
##   ..   `Visibility Level` = col_double(),
##   ..   `Number of Vehicles Involved` = col_double(),
##   ..   `Speed Limit` = col_double(),
##   ..   `Driver Age Group` = col_character(),
##   ..   `Driver Gender` = col_character(),
##   ..   `Driver Alcohol Level` = col_double(),
##   ..   `Driver Fatigue` = col_double(),
##   ..   `Vehicle Condition` = col_character(),
##   ..   `Pedestrians Involved` = col_double(),
##   ..   `Cyclists Involved` = col_double(),
##   ..   `Accident Severity` = col_character(),
##   ..   `Number of Injuries` = col_double(),
##   ..   `Number of Fatalities` = col_double(),
##   ..   `Emergency Response Time` = col_double(),
##   ..   `Traffic Volume` = col_double(),
##   ..   `Road Condition` = col_character(),
##   ..   `Accident Cause` = col_character(),
##   ..   `Insurance Claims` = col_double(),
##   ..   `Medical Cost` = col_double(),
##   ..   `Economic Loss` = col_double(),
##   ..   Region = col_character(),
##   ..   `Population Density` = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Interpretation-The data set contains both categorical and numeric variables related to accident characteristics such as severity, time of day, driver demographics, environmental conditions, and vehicle info. This diversity allows for rich statistical analysis, trend detection, and correlation studies across multiple dimensions of road safety.


#Select only numeric columns
numeric_data <-road_accident_dataset %>% select(where(is.numeric))

#Define a function to find outliers based on IQR
find_outliers <- function(x) {
  Q1 <- quantile(x, 0.25, na.rm = TRUE)
  Q3 <- quantile(x, 0.75, na.rm = TRUE)
  IQR_value <- Q3 - Q1
  lower_bound <- Q1 - 1.5 * IQR_value
  upper_bound <- Q3 + 1.5 * IQR_value
  outlier_indices <- which(x < lower_bound | x > upper_bound)
  return(outlier_indices)
}

#Apply function to each numeric column
outliers_list <- lapply(numeric_data, find_outliers)

#Print the number of outliers for each column
outlier_summary <- sapply(outliers_list, length)
print(outlier_summary)
##                        Year            Visibility Level 
##                           0                           0 
## Number of Vehicles Involved                 Speed Limit 
##                           0                           0 
##        Driver Alcohol Level              Driver Fatigue 
##                           0                           0 
##        Pedestrians Involved           Cyclists Involved 
##                           0                           0 
##          Number of Injuries        Number of Fatalities 
##                           0                           0 
##     Emergency Response Time              Traffic Volume 
##                           0                           0 
##            Insurance Claims                Medical Cost 
##                           0                           0 
##               Economic Loss          Population Density 
##                           0                           0

Interpretation:

An outlier analysis was conducted on the dataset across all major numerical features using the Interquartile Range (IQR) method. The results showed that there were no outliers detected in any of the variables, including factors like traffic volume, number of injuries, fatalities, speed limit, and economic loss. This indicates that the dataset is clean and consistent, making it well-suited for accurate exploratory data analysis and machine learning modeling without the need for additional data cleaning steps related to extreme values.



2. Geographic & Demographic Analysis (Location & Population-Based Insights)

———————————————————————————-

Q5.What is the distribution of accidents by region and time of day?

regionwise<-road_accident_dataset%>%
  group_by(Region,`Time of Day`)%>%
  summarise(Total_Accidents = n())
## `summarise()` has grouped output by 'Region'. You can override using the
## `.groups` argument.
print(regionwise)
## # A tibble: 20 × 3
## # Groups:   Region [5]
##    Region        `Time of Day` Total_Accidents
##    <chr>         <chr>                   <int>
##  1 Asia          Afternoon                6650
##  2 Asia          Evening                  6546
##  3 Asia          Morning                  6539
##  4 Asia          Night                    6616
##  5 Australia     Afternoon                6709
##  6 Australia     Evening                  6657
##  7 Australia     Morning                  6476
##  8 Australia     Night                    6783
##  9 Europe        Afternoon                6454
## 10 Europe        Evening                  6592
## 11 Europe        Morning                  6703
## 12 Europe        Night                    6596
## 13 North America Afternoon                6628
## 14 North America Evening                  6676
## 15 North America Morning                  6482
## 16 North America Night                    6629
## 17 South America Afternoon                6519
## 18 South America Evening                  6550
## 19 South America Morning                  6588
## 20 South America Night                    6607

Visualization

# creation of table 
accident_table <- table(road_accident_dataset$`Region`, road_accident_dataset$`Time of Day`)
print(accident_table)
##                
##                 Afternoon Evening Morning Night
##   Asia               6650    6546    6539  6616
##   Australia          6709    6657    6476  6783
##   Europe             6454    6592    6703  6596
##   North America      6628    6676    6482  6629
##   South America      6519    6550    6588  6607
# convert table matrix to correlation
accident_correlation <- cor(accident_table)
print(accident_correlation)
##            Afternoon    Evening    Morning      Night
## Afternoon  1.0000000  0.4443213 -0.9243090  0.7318491
## Evening    0.4443213  1.0000000 -0.5392337  0.5638053
## Morning   -0.9243090 -0.5392337  1.0000000 -0.6160331
## Night      0.7318491  0.5638053 -0.6160331  1.0000000
road_accident_dataset %>%
  filter(`Time of Day`== "Night") %>%
  group_by(Region) %>%
  summarise(Night_Accidents = n()) %>%
  ggplot(aes(x = Region, y = Night_Accidents, fill = Region)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "Night Accidents by Region", x = "Region", y = "Number of Accidents") +
  theme_minimal()

Interpretation:

Australia has the highest number of night-time accidents compared to other regions.

Asia, North America, and South America show similar accident counts, slightly lower than Australia.

Europe has a moderate number of night accidents, falling between Australia and other regions.

The differences between regions are small but noticeable, indicating that night accidents are a common issue globally.

Conclusion:

Night-time accidents occur significantly across all regions, with Australia leading slightly. This highlights the importance of improving night-time driving safety measures globally.


Q6 What is the correlation between population density and accident frequency?

accident_density <- road_accident_dataset %>% group_by(Region) %>% summarise(Total_Accidents = n(), 
Avg_Population_Density = mean(`Population Density`, na.rm = TRUE)  ) 
print(accident_density)
## # A tibble: 5 × 3
##   Region        Total_Accidents Avg_Population_Density
##   <chr>                   <int>                  <dbl>
## 1 Asia                    26351                  2523.
## 2 Australia               26625                  2517.
## 3 Europe                  26345                  2501.
## 4 North America           26415                  2497.
## 5 South America           26264                  2495.

INTERPRETATION- According to the data:

● Australia has the highest number of accidents (26,625) despite having a similar population density as other regions.

● South America has the lowest accident count (26,264) and the lowest population density (~2495).

● All regions show very similar accident counts (around 26,000+) and closely ranged population densities (between 2495–2523).


Q7. Urban vs Rural - More vehicles involved?

# Compare average number of vehicles involved in Urban vs Rural areas
urban_rural <-road_accident_dataset %>% filter(`Urban/Rural` %in% c("Urban", "Rural")) %>% group_by(`Urban/Rural`) %>% 
  summarise(Avg_Vehicles = mean(as.numeric(`Number of Vehicles Involved`),na.rm = TRUE))
print(urban_rural) 
## # A tibble: 2 × 2
##   `Urban/Rural` Avg_Vehicles
##   <chr>                <dbl>
## 1 Rural                 2.50
## 2 Urban                 2.50

Visualization

road_accident_dataset%>%
  group_by(`Urban/Rural`) %>%
  summarise(Average_Injuries = mean(`Number of Injuries`, na.rm = TRUE)) %>%
  ggplot(aes(x = `Urban/Rural`, y = Average_Injuries, fill =`Urban/Rural`)) +
  geom_col() +
  labs(title = "Average Injuries: Urban vs Rural", 
       x = "Area Type", 
       y = "Average Injuries") +
  theme_minimal()

Interpretation:

The average number of injuries is slightly higher in Rural areas compared to Urban areas.

Both Urban and Rural regions show high and close average injury numbers, indicating accidents in both areas are severe.

Rural accidents possibly involve higher speeds or delayed medical help, leading to more injuries.

Urban accidents, despite heavier traffic, seem to have slightly lower average injuries, maybe due to lower speed limits or quicker emergency response.

Conclusion:

Accidents in rural areas tend to result in slightly more injuries compared to urban areas. This suggests that factors like road type, emergency services, and speed might influence injury severity in different area types.


Q8.Do accidents in urban areas involve more fatalities and injuries on average than those in rural areas?

# Compare average fatalities and injuries in Urban vs Rural areas
urban_rural_severity <- road_accident_dataset %>% filter(`Urban/Rural` %in% c("Urban", "Rural")) %>% group_by(`Urban/Rural`) %>% summarise( 
    Avg_Fatalities = mean(as.numeric(`Number of Fatalities`), na.rm = TRUE), 
    Avg_Injuries = mean(as.numeric(`Number of Injuries`), na.rm = TRUE)) 
print(urban_rural_severity) 
## # A tibble: 2 × 3
##   `Urban/Rural` Avg_Fatalities Avg_Injuries
##   <chr>                  <dbl>        <dbl>
## 1 Rural                   1.99         9.53
## 2 Urban                   2.00         9.49

INTERPRETATION- ● The average fatalities in urban areas (2.00) are almost the same as in rural areas (1.99).

● The average number of injuries is slightly higher in rural areas (9.53) compared to urban areas (9.49).

Possible Reasons:

● Emergency response time might be slower in rural areas, contributing to higher injuries.

● Urban areas have more traffic regulation and faster medical access, potentially reducing injury severity.

● Rural roads may lack proper infrastructure (e.g., lighting, signage), leading to slightly worse outcomes even if fatalities remain similar.


—————————————————-

4. Road & Traffic Conditions

——————————————

Q14. How do road types (highway, street, main road) impact accident frequency?

count(road_accident_dataset, `Road Type` , sort = TRUE) 
## # A tibble: 3 × 2
##   `Road Type`     n
##   <chr>       <int>
## 1 Main Road   44197
## 2 Highway     43920
## 3 Street      43883

Visualization

 road_accident_dataset %>%
    group_by(`Road Type`) %>%
    summarise(Accidents = n()) %>%
    ggplot(aes(x = reorder(`Road Type`, Accidents), y = Accidents, fill = `Road Type`)) +
    geom_bar(stat = "identity") +
    labs(title = "Accidents by Road Type", x = "Road Type", y = "Number of Accidents") +
    theme_minimal()

Interpretation:

Accidents are fairly evenly distributed across different road conditions: dry, icy, snow-covered, and wet.

However, icy roads show a slightly higher number of accidents compared to others.

Dry and snow-covered roads have similar accident counts, while wet roads have a slightly lower number.

Conclusion:

Road conditions significantly influence accident rates.

While accidents happen on all types of roads, icy conditions pose a higher risk, highlighting the need for extra caution during icy weather.

Still, the high number of accidents on dry roads suggests that driver behavior is a major factor, not just road condition.


Q15. What are the most frequent road conditions during accidents?

Road_condition<-table(road_accident_dataset$`Road Condition`)
print(Road_condition) 
## 
##          Dry          Icy Snow-covered          Wet 
##        32855        32779        33010        33356

Visualization

road_accident_dataset %>%
    group_by(`Road Condition`, Year) %>%
    summarise(Accidents = n()) %>%
    ggplot(aes(x = Year, y = Accidents, color = `Road Condition`)) +
    geom_line(linewidth = 1.2) +
    facet_wrap(~ `Road Condition`)+
    labs(title = "Accidents Over Time by Road Condition", x = "Year", y = "Number of     Accidents") +
    theme_minimal()
## `summarise()` has grouped output by 'Road Condition'. You can override using
## the `.groups` argument.

INTERPRETATION-

The data shows that accidents occur almost equally across all road conditions, with wet and snow-covered roads slightly ahead. This indicates that hazardous conditions like wet, snowy, or icy surfaces do slightly increase accident frequency, but even dry roads are not far behind , meaning driver error and behavior remain major factors in accidents regardless of road condition.


Q16.How does traffic volume impact the number of vehicles involved in road accidents?

# Check correlation between traffic volume and vehicles involved in accidents 
traffic_corr <- cor(as.numeric(road_accident_dataset$`Traffic Volume`),
                    as.numeric(road_accident_dataset$`Number of Vehicles Involved`))
print(traffic_corr) 
## [1] 0.003168233

INTERPRETATION-

The correlation value is 0.003, which is extremely close to zero.

This indicates no significant relationship between traffic volume and the number of vehicles involved in accidents. In other words, higher or lower traffic volumes do not noticeably affect how many vehicles are involved in an accident.

Possible reasons could include:

● High traffic may lead to slower speeds, which reduces the severity and involvement of multiple vehicles.

● Low traffic might result in over-speeding, still leading to multi-vehicle crashes.

● Other factors like road design, driver behavior, or weather conditions may be more influential than traffic volume itself.


Q17.Does the number of vehicles involved correlate with the number of injuries?

# Check if more vehicles in accidents cause more injuries 
injury_corr <- cor(as.numeric(road_accident_dataset$`Number of Vehicles Involved`), 
                   as.numeric(road_accident_dataset$`Number of Injuries`))
print(injury_corr) 
## [1] 0.002233653

INTERPRETATION-

● The correlation value is 0.002, which is very close to zero.

● This means there is almost no relationship between the number of vehicles involved and the number of injuries.

● In simpler terms, more vehicles in an accident does not necessarily lead to more injuries Possible Reasons:

● Passenger load varies , A multi-vehicle crash may involve few passengers while a single bus accident could involve many.

● Safety measures, Proper use of seatbelts and airbags can reduce injuries regardless of how many vehicles are involved.

● Randomness of impact points , Some multi-vehicle crashes result in only minor injuries due to how vehicles collide.


——————————————

5. Environmental & External Factors

——————————————

Q18.Does a higher speed limit lead to more fatalities in road accidents??

# Calculate correlation between speed limit and number of fatalities
speed_fatality_corr <- cor(as.numeric(road_accident_dataset$`Speed Limit`), as.numeric(road_accident_dataset$`Number of Fatalities`)) 
print(speed_fatality_corr) 
## [1] 0.0006205779

Visualization

road_accident_dataset%>%
  group_by(`Speed Limit`) %>%
  summarise(Average_Fatalities = mean(`Number of Fatalities`, na.rm = TRUE)) %>%
  ggplot(aes(x = `Speed Limit`, y = Average_Fatalities)) +
  geom_point(color = "darkgreen") +
  geom_smooth(method = "lm", se = TRUE, color = "black") +
  labs(title = "Average Fatalities vs Speed Limit", x = "Speed Limit", y = "Average Number of Fatalities") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

INTERPRETATION-

The scatter plot displays the relationship between Speed Limit and Average Number of Fatalities.

The regression line is nearly horizontal, showing almost no slope.

The data points are widely spread around the line without forming a strong pattern.

The confidence interval (grey area) is wide, indicating a high uncertainty in the trend.

The average number of fatalities stays roughly around 2, regardless of the speed limit.

Conclusion:

The analysis shows no significant relationship between speed limit and average fatalities. The number of fatalities remains fairly constant across different speed limits, suggesting that other factors may have a greater influence on fatality rates.


Q19.What are the most common causes of accidents in different regions?

# Count how many times each accident cause occurred in each region 
accident_cause_region <- table(road_accident_dataset$Region, road_accident_dataset$`Accident Cause`)
print(accident_cause_region) 
##                
##                 Distracted Driving Drunk Driving Mechanical Failure Speeding
##   Asia                        5250          5284               5285     5286
##   Australia                   5386          5256               5320     5394
##   Europe                      5252          5359               5177     5287
##   North America               5272          5297               5333     5225
##   South America               5300          5310               5228     5254
##                
##                 Weather
##   Asia             5246
##   Australia        5269
##   Europe           5270
##   North America    5288
##   South America    5172

Interpretation:

● Across all regions, Drunk Driving and Distracted Driving appear consistently high.

● Europe has the highest number of drunk driving-related accidents (5359).

● Mechanical Failure and Speeding are also notably frequent in most regions, especially in North America and Australia. Possible Reasons:

● Drunk Driving:May reflect cultural factors, alcohol laws, and enforcement differences.

● Distracted Driving:Possibly due to increased mobile device usage and in-car tech distractions.

● Mechanical Failure:Could point to poor vehicle maintenance practices or older vehicles in use.

● Weather:Less dominant but still relevant, especially in areas with diverse climates like North America and Europe.

● Speeding:Higher in regions with larger highway systems or less strict speed enforcement


——————————————

6. Accident Aftermath & Impact Analysis

——————————————

Q20. What is the average number of injuries per accident?

mean(road_accident_dataset$`Number of Injuries`, na.rm = TRUE) 
## [1] 9.508205

INTERPRETATION-

Based on the analysis, the average number of injuries per accident is approximately 9.5. This indicates that each reported accident results in about 9 to 10 injuries on average, which is relatively high. It suggests that many accidents involve multiple individuals, possibly due to multi-vehicle collisions or accidents in high-occupancy vehicles like buses. This emphasizes the need for stricter safety regulations and quicker emergency response to reduce injury impact.


Q21. How do insurance claims vary by accident cause?

insurance_claims<-tapply(road_accident_dataset$`Insurance Claims`,
                         road_accident_dataset$`Accident Cause`,mean, na.rm = TRUE) 
print(insurance_claims)
## Distracted Driving      Drunk Driving Mechanical Failure           Speeding 
##           4.488549           4.485588           4.491250           4.494101 
##            Weather 
##           4.518804

Visualization

road_accident_dataset %>%
  ggplot(aes(x = `Accident Cause`, y = `Insurance Claims`, fill = `Accident Cause`)) +
  geom_boxplot() +
  labs(title = "Distribution of Insurance Claims by Accident Cause",
    x = "Accident Cause",
    y = "Insurance Claim Amount") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

ANOVA analysis:

#Anova analysis
    #Filter data (optional, in case of NA values)
    insurance_data <- road_accident_dataset %>%
      filter(!is.na(`Accident Cause`), !is.na(`Insurance Claims`))
    
    #  Run ANOVA
    anova_result <- aov(`Insurance Claims` ~ `Accident Cause`, data = insurance_data)
    
    # View ANOVA summary
    summary(anova_result)
##                      Df  Sum Sq Mean Sq F value Pr(>F)
## `Accident Cause`      4      19   4.665   0.567  0.686
## Residuals        131995 1085235   8.222

INTERPRETATION-

The boxplot shows the distribution of insurance claim amounts for different accident causes: Distracted Driving, Drunk Driving, Mechanical Failure, Speeding, and Weather.

Across all accident causes, the median insurance claim amount (the thick black horizontal line inside each box) appears fairly similar, centered around the middle of the vertical scale.

Spread (IQR) — the height of the colored boxes (interquartile range) — looks very similar across all causes, suggesting the variability in claim amounts is comparable.

Whiskers (lines extending from the boxes) show that there are a few very low and very high claim amounts, but no major outliers are visible.

The overall distribution shapes are consistent, indicating that no single accident cause leads to dramatically higher or lower insurance claims compared to others.

Conclusion:

Insurance claims do not vary greatly depending on whether the accident was due to distracted driving, drunk driving, mechanical failure, speeding, or weather conditions.

Although minor differences exist, accident cause does not seem to be a strong driver of insurance claim amount variation in this dataset.

Implication: Other factors (such as severity of accidents, vehicle value, or location) might have a stronger impact on insurance claim amounts than the cause alone.


Overall Project Summary

In this project, a detailed exploratory data analysis was conducted on a global road accident dataset containing 132,000 records across 30 features. Various aspects like accident timing, geography, driver demographics, vehicle conditions, road environments, and accident aftermath were analyzed. The analysis revealed key insights such as higher accident severity during evening and night times, consistent accident counts across different driver age groups and genders, and the major influence of road conditions and driver behavior over external factors like traffic volume or speed limits. Outlier detection confirmed that the dataset was clean and reliable. Correlation analyses showed that factors like traffic volume, number of vehicles involved, and alcohol levels had very weak relationships with injuries and fatalities, emphasizing the importance of driver attentiveness and safety regulations. Overall, the findings suggest that multi-dimensional strategies focusing on human behavior, emergency response improvements, and road safety awareness can help in significantly reducing road accidents worldwide.