library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(readr)
library(dplyr)
library(ggplot2)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.3
## corrplot 0.95 loaded
library(readr)
road_accident_dataset <- read_csv("D:/Rstudi0_Project/road_accident_dataset.csv")
## Rows: 132000 Columns: 30
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (14): Country, Month, Day of Week, Time of Day, Urban/Rural, Road Type, ...
## dbl (16): Year, Visibility Level, Number of Vehicles Involved, Speed Limit, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(road_accident_dataset)
Q: What are the column names and data types in the dataset?
str(road_accident_dataset)
## spc_tbl_ [132,000 × 30] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Country : chr [1:132000] "USA" "UK" "USA" "UK" ...
## $ Year : num [1:132000] 2002 2014 2012 2017 2002 ...
## $ Month : chr [1:132000] "October" "December" "July" "May" ...
## $ Day of Week : chr [1:132000] "Tuesday" "Saturday" "Sunday" "Saturday" ...
## $ Time of Day : chr [1:132000] "Evening" "Evening" "Afternoon" "Evening" ...
## $ Urban/Rural : chr [1:132000] "Rural" "Urban" "Urban" "Urban" ...
## $ Road Type : chr [1:132000] "Street" "Street" "Highway" "Main Road" ...
## $ Weather Conditions : chr [1:132000] "Windy" "Windy" "Snowy" "Clear" ...
## $ Visibility Level : num [1:132000] 220 168 341 489 348 ...
## $ Number of Vehicles Involved: num [1:132000] 1 3 4 2 1 2 3 3 3 3 ...
## $ Speed Limit : num [1:132000] 37 96 62 78 98 30 92 61 106 74 ...
## $ Driver Age Group : chr [1:132000] "18-25" "18-25" "41-60" "18-25" ...
## $ Driver Gender : chr [1:132000] "Male" "Female" "Male" "Male" ...
## $ Driver Alcohol Level : num [1:132000] 0.0519 0.2349 0.1424 0.1208 0.1558 ...
## $ Driver Fatigue : num [1:132000] 0 1 0 1 1 1 0 0 1 1 ...
## $ Vehicle Condition : chr [1:132000] "Poor" "Poor" "Moderate" "Good" ...
## $ Pedestrians Involved : num [1:132000] 1 1 0 2 0 2 2 1 1 1 ...
## $ Cyclists Involved : num [1:132000] 2 1 0 0 1 2 1 1 2 0 ...
## $ Accident Severity : chr [1:132000] "Moderate" "Minor" "Moderate" "Minor" ...
## $ Number of Injuries : num [1:132000] 8 6 13 6 13 10 10 6 4 2 ...
## $ Number of Fatalities : num [1:132000] 2 1 4 3 4 4 3 2 2 3 ...
## $ Emergency Response Time : num [1:132000] 58.6 58 42.4 48.6 18.3 ...
## $ Traffic Volume : num [1:132000] 7413 4459 9857 4959 3843 ...
## $ Road Condition : chr [1:132000] "Wet" "Snow-covered" "Wet" "Icy" ...
## $ Accident Cause : chr [1:132000] "Weather" "Mechanical Failure" "Speeding" "Distracted Driving" ...
## $ Insurance Claims : num [1:132000] 4 3 4 3 8 7 9 8 5 0 ...
## $ Medical Cost : num [1:132000] 40500 6487 29164 25797 15605 ...
## $ Economic Loss : num [1:132000] 22073 9534 58009 20907 13584 ...
## $ Region : chr [1:132000] "Europe" "North America" "South America" "Australia" ...
## $ Population Density : num [1:132000] 3866 2334 4409 2811 3884 ...
## - attr(*, "spec")=
## .. cols(
## .. Country = col_character(),
## .. Year = col_double(),
## .. Month = col_character(),
## .. `Day of Week` = col_character(),
## .. `Time of Day` = col_character(),
## .. `Urban/Rural` = col_character(),
## .. `Road Type` = col_character(),
## .. `Weather Conditions` = col_character(),
## .. `Visibility Level` = col_double(),
## .. `Number of Vehicles Involved` = col_double(),
## .. `Speed Limit` = col_double(),
## .. `Driver Age Group` = col_character(),
## .. `Driver Gender` = col_character(),
## .. `Driver Alcohol Level` = col_double(),
## .. `Driver Fatigue` = col_double(),
## .. `Vehicle Condition` = col_character(),
## .. `Pedestrians Involved` = col_double(),
## .. `Cyclists Involved` = col_double(),
## .. `Accident Severity` = col_character(),
## .. `Number of Injuries` = col_double(),
## .. `Number of Fatalities` = col_double(),
## .. `Emergency Response Time` = col_double(),
## .. `Traffic Volume` = col_double(),
## .. `Road Condition` = col_character(),
## .. `Accident Cause` = col_character(),
## .. `Insurance Claims` = col_double(),
## .. `Medical Cost` = col_double(),
## .. `Economic Loss` = col_double(),
## .. Region = col_character(),
## .. `Population Density` = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
Interpretation:
The data set contains both categorical and numeric variables related to accident characteristics such as severity, time of day, driver demographics, environmental conditions, and vehicle info. This diversity allows for rich statistical analysis, trend detection, and correlation studies across multiple dimensions of road safety.
Q: Are there any outliers present in the dataset?
#Select only numeric columns
numeric_data <-road_accident_dataset %>% select(where(is.numeric))
#Define a function to find outliers based on IQR
find_outliers <- function(x) {
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
IQR_value <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value
outlier_indices <- which(x < lower_bound | x > upper_bound)
return(outlier_indices)
}
#Apply function to each numeric column
outliers_list <- lapply(numeric_data, find_outliers)
#Print the number of outliers for each column
outlier_summary <- sapply(outliers_list, length)
print(outlier_summary)
## Year Visibility Level
## 0 0
## Number of Vehicles Involved Speed Limit
## 0 0
## Driver Alcohol Level Driver Fatigue
## 0 0
## Pedestrians Involved Cyclists Involved
## 0 0
## Number of Injuries Number of Fatalities
## 0 0
## Emergency Response Time Traffic Volume
## 0 0
## Insurance Claims Medical Cost
## 0 0
## Economic Loss Population Density
## 0 0
Interpretation:
An outlier analysis was conducted on the dataset across all major numerical features using the Interquartile Range (IQR) method. The results showed that there were no outliers detected in any of the variables, including factors like traffic volume, number of injuries, fatalities, speed limit, and economic loss. This indicates that the dataset is clean and consistent, making it well-suited for accurate exploratory data analysis and machine learning modeling without the need for additional data cleaning steps related to extreme values.
———————————————————————————-
———————————————————————————-
Q.1 What are the total accident rates over the years ?
accidents_per_year <- road_accident_dataset %>% group_by(Year) %>% summarise(Total_Accidents = n())
print(accidents_per_year)
## # A tibble: 25 × 2
## Year Total_Accidents
## <dbl> <int>
## 1 2000 5280
## 2 2001 5263
## 3 2002 5433
## 4 2003 5327
## 5 2004 5180
## 6 2005 5302
## 7 2006 5156
## 8 2007 5307
## 9 2008 5409
## 10 2009 5298
## # ℹ 15 more rows
Visualization
road_accident_dataset %>%group_by(Year) %>%
summarise(Total_Accidents = n()) %>%
ggplot(aes(x = Year, y = Total_Accidents)) +
geom_line(color = "blue", linewidth = 1.2) +
geom_point(color = "darkblue", linewidth = 2) +
labs(title = "Total Accidents Per Year", x = "Year", y = "Number of Accidents") +
theme_minimal()
## Warning in geom_point(color = "darkblue", linewidth = 2): Ignoring unknown
## parameters: `linewidth`
Interpretation:
Fluctuating Trend: Accident counts vary across years, with no clear increasing or decreasing pattern. Peak in Recent Years: The highest accident numbers occur in 2020, indicating a potential rise in risk factors (e.g., traffic volume, road conditions). Lowest in 2005-2006: These years show the fewest accidents, possibly due to stricter enforcement or fewer vehicles on the road.
Conclusion:
While road accidents fluctuated significantly in the early years, there has been a clear downward trend after 2015. This indicates positive progress toward safer driving conditions and better traffic management over the years.
Q,2 What is the average accidents rates per year?
average_accidents<-mean(accidents_per_year$Total_Accidents)
print(average_accidents)
## [1] 5280
Interpretation:
Based on the data set, the average number of accidents per year is 5280.
● This value represents the total number of recorded accidents divided by the number of years included in our data set.
● It helps give a big-picture view of how frequently accidents occur on an annual basis.
Q.3 How does the time of day (morning, afternoon, evening, night) impact accident severity?
time_wise<-road_accident_dataset%>% group_by(`Time of Day`) %>% summarise(Total_Accidents = n())
print(time_wise)
## # A tibble: 4 × 2
## `Time of Day` Total_Accidents
## <chr> <int>
## 1 Afternoon 32960
## 2 Evening 33021
## 3 Morning 32788
## 4 Night 33231
INTERPRETATION-
The data set shows a variation in accident severity depending on the time of day (morning, afternoon, evening, night).
Evening and night hours tend to show higher accident severity, which may be due to:
● Reduced visibility
● Driver fatigue
● Increased chances of drunk driving Morning and afternoon accidents are more frequent but often less severe, possibly due to:
● Better visibility
● Lower driving speeds in congested traffic
● More alertness among drivers
Q4. Are certain months more accident-prone than others?
# Count how many accidents happened in each month
month_table <- table(road_accident_dataset$Month)
print(month_table)
##
## April August December February January July June March
## 11063 10791 10909 11064 10952 11000 11122 11072
## May November October September
## 11158 10836 10986 11047
Visualization
road_accident_dataset %>%
group_by(Month) %>%
summarise(Total_Accidents = n()) %>%
ggplot(aes(x = reorder(Month, Total_Accidents), y = Total_Accidents, fill = "No of accident")) +
geom_bar(stat = "identity", width = 1, color = "black") +
labs(title = "Total Accidents Per Month", x = "Month", y = "Number of Accidents") +
coord_flip()+
theme_minimal()
Interpretation:
Peak Months: The highest accident numbers occur in March, May, and July, suggesting seasonal risk factors (e.g., holiday travel, weather conditions). Lower Risk Periods: February and November show fewer accidents, possibly due to reduced travel or better road conditions. Inconsistencies: Some months (e.g., March) appear twice in the data, which may indicate reporting errors or require verification.
Conclusion:
Accidents peak in spring and summer months (March-July), likely tied to increased traffic or weather hazards. Focused safety campaigns during these high-risk periods could help reduce incidents. Data cleaning is recommended to address duplicate entries.
Q5. Do weekends have more vehicles involved than weekdays?
# Create a new column to mark Weekend or Weekday
road_accident_dataset$Weekend <- ifelse(road_accident_dataset$`Day of Week` %in% c("Saturday", "Sunday"), "Weekend", "Weekday")
# Calculate average number of vehicles involved on weekends vs weekdays
weekend_comparison<- road_accident_dataset %>% filter(!is.na(`Number of Vehicles Involved`)) %>%group_by(Weekend) %>% summarise(Avg_Vehicles = mean(`Number of Vehicles Involved`))
print(weekend_comparison)
## # A tibble: 2 × 2
## Weekend Avg_Vehicles
## <chr> <dbl>
## 1 Weekday 2.50
## 2 Weekend 2.50
Interpretation-
The average number of vehicles involved in accidents is the same for both weekdays and weekends (2.5). This suggests that while traffic volume may vary, accident severity (in terms of vehicle count) remains constant. However, it’s worth noting that we are comparing 2 weekend days with 5 weekdays, which may average out the differences.
———————————————————————————-
———————————————————————————-
Q6. What is the distribution of accidents by region and time of day?
regionwise<-road_accident_dataset%>%
group_by(Region,`Time of Day`)%>%
summarise(Total_Accidents = n())
## `summarise()` has grouped output by 'Region'. You can override using the
## `.groups` argument.
print(regionwise)
## # A tibble: 20 × 3
## # Groups: Region [5]
## Region `Time of Day` Total_Accidents
## <chr> <chr> <int>
## 1 Asia Afternoon 6650
## 2 Asia Evening 6546
## 3 Asia Morning 6539
## 4 Asia Night 6616
## 5 Australia Afternoon 6709
## 6 Australia Evening 6657
## 7 Australia Morning 6476
## 8 Australia Night 6783
## 9 Europe Afternoon 6454
## 10 Europe Evening 6592
## 11 Europe Morning 6703
## 12 Europe Night 6596
## 13 North America Afternoon 6628
## 14 North America Evening 6676
## 15 North America Morning 6482
## 16 North America Night 6629
## 17 South America Afternoon 6519
## 18 South America Evening 6550
## 19 South America Morning 6588
## 20 South America Night 6607
Visualization
road_accident_dataset %>%
filter(`Time of Day` == "Night") %>%
group_by(Region) %>%
summarise(Night_Accidents = n()) %>%
ggplot(aes(x = Region, y = Night_Accidents, fill = Region)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Night Accidents by Region", x = "Region", y = "Number of Accidents") +
theme_minimal()
Interpretation
Australia has the highest number of night-time accidents compared to other regions.
Asia, North America, and South America show similar accident counts, slightly lower than Australia.
Europe has a moderate number of night accidents, falling between Australia and other regions.
The differences between regions are small but noticeable, indicating that night accidents are a common issue globally.
Conclusion:
Night-time accidents occur significantly across all regions, with Australia leading slightly.
This highlights the importance of improving night-time driving safety measures globally.
Q7. What is the correlation between population density and accident frequency?
accident_density <- road_accident_dataset %>% group_by(Region) %>% summarise(Total_Accidents = n(),
Avg_Population_Density = mean(`Population Density`, na.rm = TRUE) )
print(accident_density)
## # A tibble: 5 × 3
## Region Total_Accidents Avg_Population_Density
## <chr> <int> <dbl>
## 1 Asia 26351 2523.
## 2 Australia 26625 2517.
## 3 Europe 26345 2501.
## 4 North America 26415 2497.
## 5 South America 26264 2495.
INTERPRETATION-
According to the data:
● Australia has the highest number of accidents (26,625) despite having a similar population density as other regions.
● South America has the lowest accident count (26,264) and the lowest population density (~2495).
● All regions show very similar accident counts (around 26,000+) and closely ranged population densities (between 2495–2523).
Q8. Urban vs Rural - More vehicles involved?
# Compare average number of vehicles involved in Urban vs Rural areas
urban_rural <-road_accident_dataset %>% filter(`Urban/Rural` %in% c("Urban", "Rural")) %>% group_by(`Urban/Rural`) %>%
summarise(Avg_Vehicles = mean(as.numeric(`Number of Vehicles Involved`),na.rm = TRUE))
print(urban_rural)
## # A tibble: 2 × 2
## `Urban/Rural` Avg_Vehicles
## <chr> <dbl>
## 1 Rural 2.50
## 2 Urban 2.50
Visualization
road_accident_dataset%>%
group_by(`Urban/Rural`) %>%
summarise(Average_Injuries = mean(`Number of Injuries`, na.rm = TRUE)) %>%
ggplot(aes(x = `Urban/Rural`, y = Average_Injuries, fill =`Urban/Rural`)) +
geom_col() +
labs(title = "Average Injuries: Urban vs Rural",
x = "Area Type",
y = "Average Injuries") +
theme_minimal()
Interpretation:
The average number of injuries is slightly higher in Rural areas compared to Urban areas.
Both Urban and Rural regions show high and close average injury numbers, indicating accidents in both areas are severe.
Rural accidents possibly involve higher speeds or delayed medical help, leading to more injuries.
Urban accidents, despite heavier traffic, seem to have slightly lower average injuries, maybe due to lower speed limits or quicker emergency response.
Conclusion:
Accidents in rural areas tend to result in slightly more injuries compared to urban areas. This suggests that factors like road type, emergency services, and speed might influence injury severity in different area types.
Q9. Do accidents in urban areas involve more fatalities and injuries on average than those in rural areas?
# Compare average fatalities and injuries in Urban vs Rural areas
urban_rural_severity <- road_accident_dataset %>% filter(`Urban/Rural` %in% c("Urban", "Rural")) %>% group_by(`Urban/Rural`) %>% summarise(
Avg_Fatalities = mean(as.numeric(`Number of Fatalities`), na.rm = TRUE),
Avg_Injuries = mean(as.numeric(`Number of Injuries`), na.rm = TRUE))
print(urban_rural_severity)
## # A tibble: 2 × 3
## `Urban/Rural` Avg_Fatalities Avg_Injuries
## <chr> <dbl> <dbl>
## 1 Rural 1.99 9.53
## 2 Urban 2.00 9.49
INTERPRETATION-
● The average fatalities in urban areas (2.00) are almost the same as in rural areas (1.99).
● The average number of injuries is slightly higher in rural areas (9.53) compared to urban areas (9.49).
Possible Reasons:
● Emergency response time might be slower in rural areas, contributing to higher injuries.
● Urban areas have more traffic regulation and faster medical access, potentially reducing injury severity.
● Rural roads may lack proper infrastructure (e.g., lighting, signage), leading to slightly worse outcomes even if fatalities remain similar.
—————————————————-
——————————————
Q15. How do road types (highway, street, main road) impact accident frequency?
count(road_accident_dataset, `Road Type` , sort = TRUE)
## # A tibble: 3 × 2
## `Road Type` n
## <chr> <int>
## 1 Main Road 44197
## 2 Highway 43920
## 3 Street 43883
Visualization
road_accident_dataset %>%
group_by(`Road Type`) %>%
summarise(Accidents = n()) %>%
ggplot(aes(x = reorder(`Road Type`, Accidents), y = Accidents, fill = `Road Type`)) +
geom_bar(stat = "identity") +
labs(title = "Accidents by Road Type", x = "Road Type", y = "Number of Accidents") +
theme_minimal()
Interpretation:
Accidents are fairly evenly distributed across different road conditions: dry, icy, snow-covered, and wet.
However, icy roads show a slightly higher number of accidents compared to others.
Dry and snow-covered roads have similar accident counts, while wet roads have a slightly lower number.
Conclusion:
Road conditions significantly influence accident rates.
While accidents happen on all types of roads, icy conditions pose a higher risk, highlighting the need for extra caution during icy weather.
Still, the high number of accidents on dry roads suggests that driver behavior is a major factor, not just road condition.
Q16. What are the most frequent road conditions during accidents?
Road_condition<-table(road_accident_dataset$`Road Condition`)
print(Road_condition)
##
## Dry Icy Snow-covered Wet
## 32855 32779 33010 33356
Visualization
road_accident_dataset %>%
group_by(`Road Condition`, Year) %>%
summarise(Accidents = n()) %>%
ggplot(aes(x = Year, y = Accidents, color = `Road Condition`)) +
geom_line(linewidth = 1.2) +
facet_wrap(~ `Road Condition`)+
labs(title = "Accidents Over Time by Road Condition", x = "Year", y = "Number of Accidents") +
theme_minimal()
## `summarise()` has grouped output by 'Road Condition'. You can override using
## the `.groups` argument.
INTERPRETATION-
The data shows that accidents occur almost equally across all road conditions, with wet and snow-covered roads slightly ahead. This indicates that hazardous conditions like wet, snowy, or icy surfaces do slightly increase accident frequency, but even dry roads are not far behind , meaning driver error and behavior remain major factors in accidents regardless of road condition.
Q17. How does traffic volume impact the number of vehicles involved in road accidents?
# Check correlation between traffic volume and vehicles involved in accidents
traffic_corr <- cor(as.numeric(road_accident_dataset$`Traffic Volume`),
as.numeric(road_accident_dataset$`Number of Vehicles Involved`))
print(traffic_corr)
## [1] 0.003168233
INTERPRETATION-
The correlation value is 0.003, which is extremely close to zero.
This indicates no significant relationship between traffic volume and the number of vehicles involved in accidents. In other words, higher or lower traffic volumes do not noticeably affect how many vehicles are involved in an accident.
Possible reasons could include:
● High traffic may lead to slower speeds, which reduces the severity and involvement of multiple vehicles.
● Low traffic might result in over-speeding, still leading to multi-vehicle crashes.
● Other factors like road design, driver behavior, or weather conditions may be more influential than traffic volume itself.
Q18. Does the number of vehicles involved correlate with the number of injuries?
# Check if more vehicles in accidents cause more injuries
injury_corr <- cor(as.numeric(road_accident_dataset$`Number of Vehicles Involved`),
as.numeric(road_accident_dataset$`Number of Injuries`))
print(injury_corr)
## [1] 0.002233653
INTERPRETATION-
● The correlation value is 0.002, which is very close to zero.
● This means there is almost no relationship between the number of vehicles involved and the number of injuries.
● In simpler terms, more vehicles in an accident does not necessarily lead to more injuries Possible Reasons:
● Passenger load varies , A multi-vehicle crash may involve few passengers while a single bus accident could involve many.
● Safety measures, Proper use of seatbelts and airbags can reduce injuries regardless of how many vehicles are involved.
● Randomness of impact points , Some multi-vehicle crashes result in only minor injuries due to how vehicles collide.
——————————————
——————————————
Q19. Does a higher speed limit lead to more fatalities in road accidents??
# Calculate correlation between speed limit and number of fatalities
speed_fatality_corr <- cor(as.numeric(road_accident_dataset$`Speed Limit`), as.numeric(road_accident_dataset$`Number of Fatalities`))
print(speed_fatality_corr)
## [1] 0.0006205779
Visualization
road_accident_dataset%>%
group_by(`Speed Limit`) %>%
summarise(Average_Fatalities = mean(`Number of Fatalities`, na.rm = TRUE)) %>%
ggplot(aes(x = `Speed Limit`, y = Average_Fatalities)) +
geom_point(color = "darkgreen") +
geom_smooth(method = "lm", se = TRUE, color = "black") +
labs(title = "Average Fatalities vs Speed Limit", x = "Speed Limit", y = "Average Number of Fatalities") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
INTERPRETATION-
The scatter plot displays the relationship between Speed Limit and Average Number of Fatalities.
The regression line is nearly horizontal, showing almost no slope.
The data points are widely spread around the line without forming a strong pattern.
The confidence interval (grey area) is wide, indicating a high uncertainty in the trend.
The average number of fatalities stays roughly around 2, regardless of the speed limit.
Conclusion:
The analysis shows no significant relationship between speed limit and average fatalities. The number of fatalities remains fairly constant across different speed limits, suggesting that other factors may have a greater influence on fatality rates.
Q20. What are the most common causes of accidents in different regions?
# Count how many times each accident cause occurred in each region
accident_cause_region <- table(road_accident_dataset$Region, road_accident_dataset$`Accident Cause`)
print(accident_cause_region)
##
## Distracted Driving Drunk Driving Mechanical Failure Speeding
## Asia 5250 5284 5285 5286
## Australia 5386 5256 5320 5394
## Europe 5252 5359 5177 5287
## North America 5272 5297 5333 5225
## South America 5300 5310 5228 5254
##
## Weather
## Asia 5246
## Australia 5269
## Europe 5270
## North America 5288
## South America 5172
Interpretation:
● Across all regions, Drunk Driving and Distracted Driving appear consistently high.
● Europe has the highest number of drunk driving-related accidents (5359).
● Mechanical Failure and Speeding are also notably frequent in most regions, especially in North America and Australia.
Possible Reasons:
● Drunk Driving:May reflect cultural factors, alcohol laws, and enforcement differences.
● Distracted Driving:Possibly due to increased mobile device usage and in-car tech distractions.
● Mechanical Failure:Could point to poor vehicle maintenance practices or older vehicles in use.
● Weather:Less dominant but still relevant, especially in areas with diverse climates like North America and Europe.
● Speeding:Higher in regions with larger highway systems or less strict speed enforcement
——————————————
——————————————
Q21. What is the average number of injuries per accident?
mean(road_accident_dataset$`Number of Injuries`, na.rm = TRUE)
## [1] 9.508205
INTERPRETATION-
Based on the analysis, the average number of injuries per accident is approximately 9.5. This indicates that each reported accident results in about 9 to 10 injuries on average, which is relatively high. It suggests that many accidents involve multiple individuals, possibly due to multi-vehicle collisions or accidents in high-occupancy vehicles like buses. This emphasizes the need for stricter safety regulations and quicker emergency response to reduce injury impact.
Q22. How do insurance claims vary by accident cause?
insurance_claims<-tapply(road_accident_dataset$`Insurance Claims`,
road_accident_dataset$`Accident Cause`,mean, na.rm = TRUE)
print(insurance_claims)
## Distracted Driving Drunk Driving Mechanical Failure Speeding
## 4.488549 4.485588 4.491250 4.494101
## Weather
## 4.518804
Visualization
road_accident_dataset %>%
ggplot(aes(x = `Accident Cause`, y = `Insurance Claims`, fill = `Accident Cause`)) +
geom_boxplot() +
labs(title = "Distribution of Insurance Claims by Accident Cause",
x = "Accident Cause",
y = "Insurance Claim Amount") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ANOVA analysis:
# Filter data (optional, in case of NA values)
insurance_data <- road_accident_dataset %>%
filter(!is.na(`Accident Cause`), !is.na(`Insurance Claims`))
#Run ANOVA
anova_result <- aov(`Insurance Claims` ~ `Accident Cause`, data = insurance_data)
#View ANOVA summary
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## `Accident Cause` 4 19 4.665 0.567 0.686
## Residuals 131995 1085235 8.222
INTERPRETATION-
The boxplot shows the distribution of insurance claim amounts for different accident causes: Distracted Driving, Drunk Driving, Mechanical Failure, Speeding, and Weather.
Across all accident causes, the median insurance claim amount (the thick black horizontal line inside each box) appears fairly similar, centered around the middle of the vertical scale.
Spread (IQR) — the height of the colored boxes (interquartile range) — looks very similar across all causes, suggesting the variability in claim amounts is comparable.
Whiskers (lines extending from the boxes) show that there are a few very low and very high claim amounts, but no major outliers are visible.
The overall distribution shapes are consistent, indicating that no single accident cause leads to dramatically higher or lower insurance claims compared to others.
Conclusion:
Insurance claims do not vary greatly depending on whether the accident was due to distracted driving, drunk driving, mechanical failure, speeding, or weather conditions.
Although minor differences exist, accident cause does not seem to be a strong driver of insurance claim amount variation in this dataset.
Implication: Other factors (such as severity of accidents, vehicle value, or location) might have a stronger impact on insurance claim amounts than the cause alone.
In this project, a detailed exploratory data analysis was conducted on a global road accident dataset containing 132,000 records across 30 features. Various aspects like accident timing, geography, driver demographics, vehicle conditions, road environments, and accident aftermath were analyzed.
The analysis revealed key insights such as higher accident severity during evening and night times, consistent accident counts across different driver age groups and genders, and the major influence of road conditions and driver behavior over external factors like traffic volume or speed limits.
Outlier detection confirmed that the dataset was clean and reliable. Correlation analyses showed that factors like traffic volume, number of vehicles involved, and alcohol levels had very weak relationships with injuries and fatalities, emphasizing the importance of driver attentiveness and safety regulations.
Overall, the findings suggest that multi-dimensional strategies focusing on human behavior, emergency response improvements, and road safety awareness can help in significantly reducing road accidents worldwide.
——————————————————————————————————————————————————————————