\[ \]:
# Read CSV
df.road <- read.csv("dft-road-casualty-statistics-casualty-provisional-mid-year-unvalidated-2023.csv")
# View the structure of the dataset
str(df.road)
'data.frame': 62674 obs. of 19 variables:
$ collision_index : chr "2023010419171" "2023010419183" "2023010419183" "2023010419189" ...
$ collision_year : int 2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 ...
$ collision_reference : chr "010419171" "010419183" "010419183" "010419189" ...
$ vehicle_reference : int 1 2 3 1 2 2 1 1 1 1 ...
$ casualty_reference : int 1 1 2 1 1 1 1 1 1 1 ...
$ casualty_class : int 3 1 2 1 1 1 3 1 1 3 ...
$ sex_of_casualty : int 2 1 2 1 1 1 1 1 1 1 ...
$ age_of_casualty : int 20 25 38 50 34 24 65 22 20 33 ...
$ age_band_of_casualty : int 4 5 7 8 6 5 9 5 4 6 ...
$ casualty_severity : int 3 3 3 3 3 3 3 3 3 3 ...
$ pedestrian_location : int 5 0 0 0 0 0 4 0 0 8 ...
$ pedestrian_movement : int 1 0 0 0 0 0 4 0 0 1 ...
$ car_passenger : int 0 0 2 0 0 0 0 0 0 0 ...
$ bus_or_coach_passenger : int 0 0 0 0 0 0 0 0 0 0 ...
$ pedestrian_road_maintenance_worker: int 0 0 0 0 0 0 0 0 0 0 ...
$ casualty_type : int 0 9 9 9 1 3 0 1 9 0 ...
$ casualty_home_area_type : int 1 1 -1 1 1 -1 1 2 1 1 ...
$ casualty_imd_decile : int 10 3 -1 5 2 -1 5 10 3 8 ...
$ lsoa_of_casualty : chr "E01030370" "E01001546" "-1" "E01002443" ...
From the 19 variables, we note that majority of the data types are integers. However, these integers are represented by categorical data hence we continue cleaning the data by defining the variables.
\[ \]:
summary(df.road)
collision_index collision_year collision_reference vehicle_reference
Length:62674 Min. :2023 Length:62674 Min. : 1.000
Class :character 1st Qu.:2023 Class :character 1st Qu.: 1.000
Mode :character Median :2023 Mode :character Median : 1.000
Mean :2023 Mean : 1.468
3rd Qu.:2023 3rd Qu.: 2.000
Max. :2023 Max. :992.000
casualty_reference casualty_class sex_of_casualty age_of_casualty
Min. : 1.000 Min. :1.000 Min. :-1.000 Min. : -1.00
1st Qu.: 1.000 1st Qu.:1.000 1st Qu.: 1.000 1st Qu.: 22.00
Median : 1.000 Median :1.000 Median : 1.000 Median : 34.00
Mean : 1.375 Mean :1.491 Mean : 1.358 Mean : 36.95
3rd Qu.: 1.000 3rd Qu.:2.000 3rd Qu.: 2.000 3rd Qu.: 51.00
Max. :70.000 Max. :3.000 Max. : 9.000 Max. :102.00
age_band_of_casualty casualty_severity pedestrian_location pedestrian_movement
Min. :-1.000 Min. :1.000 Min. :-1.0000 Min. :0.0000
1st Qu.: 5.000 1st Qu.:3.000 1st Qu.: 0.0000 1st Qu.:0.0000
Median : 6.000 Median :3.000 Median : 0.0000 Median :0.0000
Mean : 6.315 Mean :2.785 Mean : 0.8087 Mean :0.6634
3rd Qu.: 8.000 3rd Qu.:3.000 3rd Qu.: 0.0000 3rd Qu.:0.0000
Max. :11.000 Max. :3.000 Max. :10.0000 Max. :9.0000
car_passenger bus_or_coach_passenger pedestrian_road_maintenance_worker
Min. :-1.0000 Min. :-1.00000 Min. :-1.00000
1st Qu.: 0.0000 1st Qu.: 0.00000 1st Qu.: 0.00000
Median : 0.0000 Median : 0.00000 Median : 0.00000
Mean : 0.2246 Mean : 0.06234 Mean : 0.03721
3rd Qu.: 0.0000 3rd Qu.: 0.00000 3rd Qu.: 0.00000
Max. : 9.0000 Max. : 9.00000 Max. : 2.00000
casualty_type casualty_home_area_type casualty_imd_decile
Min. :-1.000 Min. :-1.00 Min. :-1.00
1st Qu.: 1.000 1st Qu.: 1.00 1st Qu.: 2.00
Median : 9.000 Median : 1.00 Median : 4.00
Mean : 9.508 Mean : 1.05 Mean : 4.26
3rd Qu.: 9.000 3rd Qu.: 1.00 3rd Qu.: 7.00
Max. :98.000 Max. : 3.00 Max. :10.00
lsoa_of_casualty
Length:62674
Class :character
Mode :character
Proceeding with this we check for missing values
\[ \]:
sum(is.na(df.road))
0
At this point, we have noted there are no missing values within the dataset.
Removing lines with -1 values as the value is unknown and will impact modelling
\[ \]:
# Install and load the dplyr package
install.packages("dplyr")
library(dplyr)
# Specify the columns to check for -1
columns_to_check <- c("vehicle_reference", "sex_of_casualty", "pedestrian_road_maintenance_worker",
"pedestrian_location", "lsoa_of_casualty", "casualty_type",
"casualty_imd_decile", "casualty_home_area_type", "car_passenger",
"age_of_casualty", "age_band_of_casualty")
# Filter out rows with -1 in any of the specified columns
df.road_clean <- df.road %>%
filter(!if_any(all_of(columns_to_check), ~ . == -1) & sex_of_casualty != 9)
# Print the first few rows of the cleaned data frame to verify
head(df.road_clean)
str(df.road_clean)
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
collision_index | collision_year | collision_reference | vehicle_reference | casualty_reference | casualty_class | sex_of_casualty | age_of_casualty | age_band_of_casualty | casualty_severity | pedestrian_location | pedestrian_movement | car_passenger | bus_or_coach_passenger | pedestrian_road_maintenance_worker | casualty_type | casualty_home_area_type | casualty_imd_decile | lsoa_of_casualty | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <int> | <chr> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <chr> | |
1 | 2023010419171 | 2023 | 010419171 | 1 | 1 | 3 | 2 | 20 | 4 | 3 | 5 | 1 | 0 | 0 | 0 | 0 | 1 | 10 | E01030370 |
2 | 2023010419183 | 2023 | 010419183 | 2 | 1 | 1 | 1 | 25 | 5 | 3 | 0 | 0 | 0 | 0 | 0 | 9 | 1 | 3 | E01001546 |
3 | 2023010419189 | 2023 | 010419189 | 1 | 1 | 1 | 1 | 50 | 8 | 3 | 0 | 0 | 0 | 0 | 0 | 9 | 1 | 5 | E01002443 |
4 | 2023010419191 | 2023 | 010419191 | 2 | 1 | 1 | 1 | 34 | 6 | 3 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 2 | E01004679 |
5 | 2023010419198 | 2023 | 010419198 | 1 | 1 | 3 | 1 | 65 | 9 | 3 | 4 | 4 | 0 | 0 | 0 | 0 | 1 | 5 | E01023593 |
6 | 2023010419201 | 2023 | 010419201 | 1 | 1 | 1 | 1 | 22 | 5 | 3 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 10 | E01026413 |
'data.frame': 52065 obs. of 19 variables:
$ collision_index : chr "2023010419171" "2023010419183" "2023010419189" "2023010419191" ...
$ collision_year : int 2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 ...
$ collision_reference : chr "010419171" "010419183" "010419189" "010419191" ...
$ vehicle_reference : int 1 2 1 2 1 1 1 1 1 1 ...
$ casualty_reference : int 1 1 1 1 1 1 1 1 1 1 ...
$ casualty_class : int 3 1 1 1 3 1 1 3 1 1 ...
$ sex_of_casualty : int 2 1 1 1 1 1 1 1 2 1 ...
$ age_of_casualty : int 20 25 50 34 65 22 20 33 58 40 ...
$ age_band_of_casualty : int 4 5 8 6 9 5 4 6 9 7 ...
$ casualty_severity : int 3 3 3 3 3 3 3 3 3 3 ...
$ pedestrian_location : int 5 0 0 0 4 0 0 8 0 0 ...
$ pedestrian_movement : int 1 0 0 0 4 0 0 1 0 0 ...
$ car_passenger : int 0 0 0 0 0 0 0 0 0 0 ...
$ bus_or_coach_passenger : int 0 0 0 0 0 0 0 0 0 0 ...
$ pedestrian_road_maintenance_worker: int 0 0 0 0 0 0 0 0 0 0 ...
$ casualty_type : int 0 9 9 1 0 1 9 0 9 9 ...
$ casualty_home_area_type : int 1 1 1 1 1 2 1 1 1 1 ...
$ casualty_imd_decile : int 10 3 5 2 5 10 3 8 5 2 ...
$ lsoa_of_casualty : chr "E01030370" "E01001546" "E01002443" "E01004679" ...
\[ \]:
# Get the total number of lines in the dataset
total_lines <- nrow(df.road)
print(total_lines)
# Get the total number of lines in the clean dataset
total_lines_clean <- nrow(df.road_clean)
print(total_lines_clean)
# Divide the total number of lines by new dataset to see how many % has been removed
result <- 1-(total_lines_clean/total_lines)
# Print the result
print(result)
[1] 62674
[1] 52065
[1] 0.1692727
4.1.1 Defining the variables
The variables defined are age, casualty home area, casualty classes and also gender of casualty. They were defined to help us proceed to data visualisation.
Age
\[ \]:
# Define a function to map age bands to intervals
map_age_band <- function(age_band) {
if (age_band == 1) {
return("0 - 5")
} else if (age_band <= 10 && age_band > 0) {
lower_bound <- (age_band - 1) * 5 + 1
upper_bound <- age_band * 5
return(paste(lower_bound, "-", upper_bound))
} else if (age_band == -1) {
return("Unknown")
} else {
return("> 50")
}
}
# Add a new column using mutate
df.road_clean <- df.road_clean %>%
mutate(age_interval = sapply(age_band_of_casualty, map_age_band))
head(df.road_clean)
collision_index | collision_year | collision_reference | vehicle_reference | casualty_reference | casualty_class | sex_of_casualty | age_of_casualty | age_band_of_casualty | casualty_severity | pedestrian_location | pedestrian_movement | car_passenger | bus_or_coach_passenger | pedestrian_road_maintenance_worker | casualty_type | casualty_home_area_type | casualty_imd_decile | lsoa_of_casualty | age_interval | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <int> | <chr> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <chr> | <chr> | |
1 | 2023010419171 | 2023 | 010419171 | 1 | 1 | 3 | 2 | 20 | 4 | 3 | 5 | 1 | 0 | 0 | 0 | 0 | 1 | 10 | E01030370 | 16 - 20 |
2 | 2023010419183 | 2023 | 010419183 | 2 | 1 | 1 | 1 | 25 | 5 | 3 | 0 | 0 | 0 | 0 | 0 | 9 | 1 | 3 | E01001546 | 21 - 25 |
3 | 2023010419189 | 2023 | 010419189 | 1 | 1 | 1 | 1 | 50 | 8 | 3 | 0 | 0 | 0 | 0 | 0 | 9 | 1 | 5 | E01002443 | 36 - 40 |
4 | 2023010419191 | 2023 | 010419191 | 2 | 1 | 1 | 1 | 34 | 6 | 3 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 2 | E01004679 | 26 - 30 |
5 | 2023010419198 | 2023 | 010419198 | 1 | 1 | 3 | 1 | 65 | 9 | 3 | 4 | 4 | 0 | 0 | 0 | 0 | 1 | 5 | E01023593 | 41 - 45 |
6 | 2023010419201 | 2023 | 010419201 | 1 | 1 | 1 | 1 | 22 | 5 | 3 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 10 | E01026413 | 21 - 25 |
Casualty Home Area
\[ \]:
# Define a function to classify casualty home area
casualty_home.area_desc <- function(casualty_home_area_type) {
if (casualty_home_area_type == 1) {
return("Urban")
} else if (casualty_home_area_type == 2) {
return("Semi-urban")
} else if (casualty_home_area_type == 3) {
return("Rural")
} else {
return("Other")
}
}
# Apply the function to create a new column
df.road_clean <- df.road_clean %>%
mutate(casualty_home.area_desc = sapply(casualty_home_area_type, casualty_home.area_desc))
# Display the first few rows to check the result
head(df.road_clean)
collision_index | collision_year | collision_reference | vehicle_reference | casualty_reference | casualty_class | sex_of_casualty | age_of_casualty | age_band_of_casualty | casualty_severity | ⋯ | pedestrian_movement | car_passenger | bus_or_coach_passenger | pedestrian_road_maintenance_worker | casualty_type | casualty_home_area_type | casualty_imd_decile | lsoa_of_casualty | age_interval | casualty_home.area_desc | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <int> | <chr> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | ⋯ | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <chr> | <chr> | <chr> | |
1 | 2023010419171 | 2023 | 010419171 | 1 | 1 | 3 | 2 | 20 | 4 | 3 | ⋯ | 1 | 0 | 0 | 0 | 0 | 1 | 10 | E01030370 | 16 - 20 | Urban |
2 | 2023010419183 | 2023 | 010419183 | 2 | 1 | 1 | 1 | 25 | 5 | 3 | ⋯ | 0 | 0 | 0 | 0 | 9 | 1 | 3 | E01001546 | 21 - 25 | Urban |
3 | 2023010419189 | 2023 | 010419189 | 1 | 1 | 1 | 1 | 50 | 8 | 3 | ⋯ | 0 | 0 | 0 | 0 | 9 | 1 | 5 | E01002443 | 36 - 40 | Urban |
4 | 2023010419191 | 2023 | 010419191 | 2 | 1 | 1 | 1 | 34 | 6 | 3 | ⋯ | 0 | 0 | 0 | 0 | 1 | 1 | 2 | E01004679 | 26 - 30 | Urban |
5 | 2023010419198 | 2023 | 010419198 | 1 | 1 | 3 | 1 | 65 | 9 | 3 | ⋯ | 4 | 0 | 0 | 0 | 0 | 1 | 5 | E01023593 | 41 - 45 | Urban |
6 | 2023010419201 | 2023 | 010419201 | 1 | 1 | 1 | 1 | 22 | 5 | 3 | ⋯ | 0 | 0 | 0 | 0 | 1 | 2 | 10 | E01026413 | 21 - 25 | Semi-urban |
Casualty classes
\[ \]:
# Define a function to classify casualty classes
casualty_class_desc <- function(casualty_class) {
if (casualty_class == 1) {
return("Driver")
} else if (casualty_class == 2) {
return("Passenger")
} else if (casualty_class == 3) {
return("Pedestrian")
} else {
return("Other")
}
}
# Apply the function to create a new column
df.road_clean <- df.road_clean %>%
mutate(casualty_class_desc = sapply(casualty_class, casualty_class_desc))
# Define a function to classify casualty severity
casualty_severity_desc <- function(casualty_severity) {
if (casualty_severity == 1) {
return("Fatal")
} else if (casualty_severity == 2) {
return("Serious")
} else if (casualty_severity == 3) {
return("Slight")
} else {
return("Other")
}
}
# Apply the function to create a new column
df.road_clean <- df.road_clean %>%
mutate(casualty_severity_desc = sapply(casualty_severity, casualty_severity_desc))
# Define a function to classify casualty gender
casualty_sex_desc <- function(sex_of_casualty) {
if (sex_of_casualty == 1) {
return("Male")
} else if (sex_of_casualty == 2) {
return("Female")
} else {
return("Unknown")
}
}
# Apply the function to create a new column
df.road_clean <- df.road_clean %>%
mutate(casualty_sex_desc = sapply(sex_of_casualty, casualty_sex_desc))
# Define a function to classify car passenger
car_passenger_desc <- function(car_passenger) {
if (car_passenger == 0) {
return("Yes - Car")
} else {
return("No - Others")
}
}
# Apply the function to create a new column
df.road_clean <- df.road_clean %>%
mutate(car_passenger_desc = sapply(car_passenger, car_passenger_desc))
# Display the first few rows to check the result
head(df.road_clean)
# Check the summary of the cleaned data frame
str(df.road_clean)
collision_index | collision_year | collision_reference | vehicle_reference | casualty_reference | casualty_class | sex_of_casualty | age_of_casualty | age_band_of_casualty | casualty_severity | ⋯ | casualty_type | casualty_home_area_type | casualty_imd_decile | lsoa_of_casualty | age_interval | casualty_home.area_desc | casualty_class_desc | casualty_severity_desc | casualty_sex_desc | car_passenger_desc | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <int> | <chr> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | ⋯ | <int> | <int> | <int> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | |
1 | 2023010419171 | 2023 | 010419171 | 1 | 1 | 3 | 2 | 20 | 4 | 3 | ⋯ | 0 | 1 | 10 | E01030370 | 16 - 20 | Urban | Pedestrian | Slight | Female | Yes - Car |
2 | 2023010419183 | 2023 | 010419183 | 2 | 1 | 1 | 1 | 25 | 5 | 3 | ⋯ | 9 | 1 | 3 | E01001546 | 21 - 25 | Urban | Driver | Slight | Male | Yes - Car |
3 | 2023010419189 | 2023 | 010419189 | 1 | 1 | 1 | 1 | 50 | 8 | 3 | ⋯ | 9 | 1 | 5 | E01002443 | 36 - 40 | Urban | Driver | Slight | Male | Yes - Car |
4 | 2023010419191 | 2023 | 010419191 | 2 | 1 | 1 | 1 | 34 | 6 | 3 | ⋯ | 1 | 1 | 2 | E01004679 | 26 - 30 | Urban | Driver | Slight | Male | Yes - Car |
5 | 2023010419198 | 2023 | 010419198 | 1 | 1 | 3 | 1 | 65 | 9 | 3 | ⋯ | 0 | 1 | 5 | E01023593 | 41 - 45 | Urban | Pedestrian | Slight | Male | Yes - Car |
6 | 2023010419201 | 2023 | 010419201 | 1 | 1 | 1 | 1 | 22 | 5 | 3 | ⋯ | 1 | 2 | 10 | E01026413 | 21 - 25 | Semi-urban | Driver | Slight | Male | Yes - Car |
'data.frame': 52065 obs. of 25 variables:
$ collision_index : chr "2023010419171" "2023010419183" "2023010419189" "2023010419191" ...
$ collision_year : int 2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 ...
$ collision_reference : chr "010419171" "010419183" "010419189" "010419191" ...
$ vehicle_reference : int 1 2 1 2 1 1 1 1 1 1 ...
$ casualty_reference : int 1 1 1 1 1 1 1 1 1 1 ...
$ casualty_class : int 3 1 1 1 3 1 1 3 1 1 ...
$ sex_of_casualty : int 2 1 1 1 1 1 1 1 2 1 ...
$ age_of_casualty : int 20 25 50 34 65 22 20 33 58 40 ...
$ age_band_of_casualty : int 4 5 8 6 9 5 4 6 9 7 ...
$ casualty_severity : int 3 3 3 3 3 3 3 3 3 3 ...
$ pedestrian_location : int 5 0 0 0 4 0 0 8 0 0 ...
$ pedestrian_movement : int 1 0 0 0 4 0 0 1 0 0 ...
$ car_passenger : int 0 0 0 0 0 0 0 0 0 0 ...
$ bus_or_coach_passenger : int 0 0 0 0 0 0 0 0 0 0 ...
$ pedestrian_road_maintenance_worker: int 0 0 0 0 0 0 0 0 0 0 ...
$ casualty_type : int 0 9 9 1 0 1 9 0 9 9 ...
$ casualty_home_area_type : int 1 1 1 1 1 2 1 1 1 1 ...
$ casualty_imd_decile : int 10 3 5 2 5 10 3 8 5 2 ...
$ lsoa_of_casualty : chr "E01030370" "E01001546" "E01002443" "E01004679" ...
$ age_interval : chr "16 - 20" "21 - 25" "36 - 40" "26 - 30" ...
$ casualty_home.area_desc : chr "Urban" "Urban" "Urban" "Urban" ...
$ casualty_class_desc : chr "Pedestrian" "Driver" "Driver" "Driver" ...
$ casualty_severity_desc : chr "Slight" "Slight" "Slight" "Slight" ...
$ casualty_sex_desc : chr "Female" "Male" "Male" "Male" ...
$ car_passenger_desc : chr "Yes - Car" "Yes - Car" "Yes - Car" "Yes - Car" ...
Gender of casualty
\[ \]:
4.1.2 Data Visualisation
\[ \]:
hist(df.road_clean$age_of_casualty,
main = "Distribution of Casualty Age",
xlab = "Age",
col = "skyblue",
border = "white")
# Convert age_interval to a factor variable with custom levels
df.road_clean$age_interval <- factor(df.road_clean$age_interval, levels = c("0 - 5", "6 - 10", "11 - 15", "16 - 20", "21 - 25", "26 - 30", "31 - 35", "36 - 40", "41 - 45", "46 - 50", "> 50"))
# Tabulate the counts of each age interval
age_interval_counts <- table(df.road_clean$age_interval)
# Create a bar plot
barplot(age_interval_counts,
main = "Distribution of Age Intervals",
xlab = "Age Intervals",
ylab = "Count",
col = "skyblue",
border = "white",
las = 3) # Rotate x-axis labels by 90 degrees
From the above histogram, we note that majority of the road casualities below to adults from 20s to 40s years old.
\[ \]:
# Create a table of casualty class descriptions
casualty_class_desc <- c("Driver", "Passenger", "Pedestrian", "Other")
class_counts <- table(df.road_clean$casualty_class_desc)
# Bar plot of casualty class descriptions
barplot(class_counts,
main = "Frequency of Casualty Types",
xlab = "Casualty Class",
ylab = "Frequency",
col = "skyblue",
border ="white",
names.arg = names(class_counts))
# Load necessary libraries
library(dplyr)
library(RColorBrewer)
# Calculate frequency of each casualty class description
class_counts <- table(df.road_clean$casualty_class_desc)
# Define a pastel color palette
pastel_palette <- brewer.pal(n = 5, name = "Pastel1")
# Create a pie chart
pie(class_counts,
main = "Distribution of Casualty Class Descriptions",
col = pastel_palette,
cex = 0.8,
labels = paste(names(class_counts), ": ", class_counts))
# Add a legend
legend("topright",
legend = names(class_counts),
fill = pastel_palette,
cex = 0.8,
title = "Casualty Class Desc.")
The dataset is mostly made up of casualities of drivers, followed up by passenger and finally pedestrian.
\[ \]:
# Show a table of casualty severity codes and their descriptions
severity_table <- table(df.road_clean$casualty_severity, df.road_clean$casualty_severity_desc)
severity_table
barplot(table(df.road_clean$casualty_class),
main = "Frequency of Casualty Classes",
xlab = "Casualty Class",
ylab = "Frequency",
col = "skyblue")
boxplot(df.road_clean$age_of_casualty ~ df.road_clean$casualty_severity,
main = "Age Distribution by Casualty Severity",
xlab = "Casualty Severity",
ylab = "Age",
col = "skyblue")
severity_by_class <- table(df.road_clean$casualty_class, df.road_clean$casualty_severity)
barplot(severity_by_class,
main = "Casualty Severity by Casualty Class",
xlab = "Casualty Class",
ylab = "Frequency",
col = c("skyblue", "salmon","purple"),
legend = rownames(severity_by_class))
Fatal Serious Slight
1 551 0 0
2 0 9768 0
3 0 0 41746
\[ \]:
# Count the frequency of each combination of casualty severity and gender
severity_gender_counts <- table(df.road_clean$casualty_sex_desc, df.road_clean$casualty_severity_desc)
# Plot the bar chart
barplot(severity_gender_counts,
beside = TRUE, # To plot bars side by side
main = "Casualty Severity by Gender",
xlab = "Frequency",
ylab = "Casualty Severity",
col = c("skyblue", "salmon","purple"), # Specify colors for bars
legend = rownames(severity_gender_counts), # Add legend
args.legend = list(title = "Gender")) # Add legend title
\[ \]:
# Install the gplots package if not already installed
install.packages("gplots")
# Load the gplots package
library(gplots)
# Filter out non-numeric columns
numeric_df <- df.road_clean[sapply(df.road_clean, is.numeric)]
# Remove columns with zero variance
numeric_df <- numeric_df[, sapply(numeric_df, function(x) var(x, na.rm = TRUE) != 0)]
# Calculate correlation matrix
correlation_matrix <- cor(numeric_df, use = "pairwise.complete.obs")
# Create the heatmap with correlation values
heatmap.2(correlation_matrix,
main = "Correlation Heatmap of Numerical Variables",
col = cm.colors(256),
trace = "none",
symm = TRUE,
density.info = "none",
key = TRUE,
keysize = 1.2,
key.title = NA,
key.xlab = "Correlation",
cexRow = 0.8,
cexCol = 0.8,
labRow = names(correlation_matrix),
labCol = names(correlation_matrix),
notecol = "black",
notecex = 0.8,
margins = c(10, 10))
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
also installing the dependencies ‘bitops’, ‘gtools’, ‘caTools’
Attaching package: ‘gplots’
The following object is masked from ‘package:stats’:
lowess
Based on heatmap above, we can see strong relationships between the following:
pedestrian_location and casualty class which indicates that the type of driver and the location of the pedestrian movement at the time of accident.A positive correlation between pedestrian_location and casualty_class suggests that the severity of casualties (reflected by casualty_class) is related to where the pedestrian was located when the accident happened.
The heatmap likely indicates a correlation between the severity of the casualty (reflected by casualty_class) and whether there was a car passenger involved in the accident (represented by car_passenger). A positive correlation between these variables could suggest that accidents with more severe casualties (higher casualty_class values) are also more likely to involve car passengers. There are a few possible explanations for this: - Cars with more occupants (including passengers) could be involved in higher-impact collisions that cause more severe injuries. - Passengers might also be vulnerable to injuries in severe accidents, depending on the seating position, type of collision, etc.
\[ \]:
str(df.road_clean)
summary(df.road_clean)
'data.frame': 52065 obs. of 25 variables:
$ collision_index : chr "2023010419171" "2023010419183" "2023010419189" "2023010419191" ...
$ collision_year : int 2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 ...
$ collision_reference : chr "010419171" "010419183" "010419189" "010419191" ...
$ vehicle_reference : int 1 2 1 2 1 1 1 1 1 1 ...
$ casualty_reference : int 1 1 1 1 1 1 1 1 1 1 ...
$ casualty_class : int 3 1 1 1 3 1 1 3 1 1 ...
$ sex_of_casualty : int 2 1 1 1 1 1 1 1 2 1 ...
$ age_of_casualty : int 20 25 50 34 65 22 20 33 58 40 ...
$ age_band_of_casualty : int 4 5 8 6 9 5 4 6 9 7 ...
$ casualty_severity : int 3 3 3 3 3 3 3 3 3 3 ...
$ pedestrian_location : int 5 0 0 0 4 0 0 8 0 0 ...
$ pedestrian_movement : int 1 0 0 0 4 0 0 1 0 0 ...
$ car_passenger : int 0 0 0 0 0 0 0 0 0 0 ...
$ bus_or_coach_passenger : int 0 0 0 0 0 0 0 0 0 0 ...
$ pedestrian_road_maintenance_worker: int 0 0 0 0 0 0 0 0 0 0 ...
$ casualty_type : int 0 9 9 1 0 1 9 0 9 9 ...
$ casualty_home_area_type : int 1 1 1 1 1 2 1 1 1 1 ...
$ casualty_imd_decile : int 10 3 5 2 5 10 3 8 5 2 ...
$ lsoa_of_casualty : chr "E01030370" "E01001546" "E01002443" "E01004679" ...
$ age_interval : Factor w/ 11 levels "0 - 5","6 - 10",..: 4 5 8 6 9 5 4 6 9 7 ...
$ casualty_home.area_desc : chr "Urban" "Urban" "Urban" "Urban" ...
$ casualty_class_desc : chr "Pedestrian" "Driver" "Driver" "Driver" ...
$ casualty_severity_desc : chr "Slight" "Slight" "Slight" "Slight" ...
$ casualty_sex_desc : chr "Female" "Male" "Male" "Male" ...
$ car_passenger_desc : chr "Yes - Car" "Yes - Car" "Yes - Car" "Yes - Car" ...
collision_index collision_year collision_reference vehicle_reference
Length:52065 Min. :2023 Length:52065 Min. : 1.00
Class :character 1st Qu.:2023 Class :character 1st Qu.: 1.00
Mode :character Median :2023 Mode :character Median : 1.00
Mean :2023 Mean : 1.47
3rd Qu.:2023 3rd Qu.: 2.00
Max. :2023 Max. :992.00
casualty_reference casualty_class sex_of_casualty age_of_casualty
Min. : 1.000 Min. :1.000 Min. :1.000 Min. : 0.00
1st Qu.: 1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.: 23.00
Median : 1.000 Median :1.000 Median :1.000 Median : 35.00
Mean : 1.342 Mean :1.456 Mean :1.382 Mean : 38.06
3rd Qu.: 1.000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.: 51.00
Max. :70.000 Max. :3.000 Max. :2.000 Max. :102.00
age_band_of_casualty casualty_severity pedestrian_location pedestrian_movement
Min. : 1.000 Min. :1.000 Min. : 0.0000 Min. :0.0000
1st Qu.: 5.000 1st Qu.:3.000 1st Qu.: 0.0000 1st Qu.:0.0000
Median : 6.000 Median :3.000 Median : 0.0000 Median :0.0000
Mean : 6.519 Mean :2.791 Mean : 0.7703 Mean :0.6357
3rd Qu.: 8.000 3rd Qu.:3.000 3rd Qu.: 0.0000 3rd Qu.:0.0000
Max. :11.000 Max. :3.000 Max. :10.0000 Max. :9.0000
car_passenger bus_or_coach_passenger pedestrian_road_maintenance_worker
Min. :0.0000 Min. :-1.00000 Min. :0.0000
1st Qu.:0.0000 1st Qu.: 0.00000 1st Qu.:0.0000
Median :0.0000 Median : 0.00000 Median :0.0000
Mean :0.2087 Mean : 0.05232 Mean :0.0362
3rd Qu.:0.0000 3rd Qu.: 0.00000 3rd Qu.:0.0000
Max. :9.0000 Max. : 4.00000 Max. :2.0000
casualty_type casualty_home_area_type casualty_imd_decile
Min. : 0.000 Min. :1.000 Min. : 1.000
1st Qu.: 1.000 1st Qu.:1.000 1st Qu.: 2.000
Median : 9.000 Median :1.000 Median : 5.000
Mean : 9.523 Mean :1.281 Mean : 4.918
3rd Qu.: 9.000 3rd Qu.:1.000 3rd Qu.: 7.000
Max. :98.000 Max. :3.000 Max. :10.000
lsoa_of_casualty age_interval casualty_home.area_desc casualty_class_desc
Length:52065 26 - 30:11035 Length:52065 Length:52065
Class :character 31 - 35: 8441 Class :character Class :character
Mode :character 36 - 40: 6827 Mode :character Mode :character
21 - 25: 5695
16 - 20: 5338
41 - 45: 5193
(Other): 9536
casualty_severity_desc casualty_sex_desc car_passenger_desc
Length:52065 Length:52065 Length:52065
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
4.1.3 Summary of data cleaning
- We have removed negative age which contributed 2.2% of the entire population.
- We have added variable definitions for easy reference.
- We have noted column bus_or_car_passengers is similar to column car_passengers hence we have dropped bus_or_car_passengers to avoid repetition in data.
Question: How can the class of casualties in road accidents be predicted based on accident characteristics, vehicle information, and other casualty demographics?
casualty_class value:
- 1 - driver
- 2 - passenger
- 3 - pedestrian
\[ \]:
# Install the package for splitting the train set and test set from the cleaned dataset
install.packages("caret")
library(caret)
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
also installing the dependencies ‘listenv’, ‘parallelly’, ‘future’, ‘globals’, ‘shape’, ‘future.apply’, ‘numDeriv’, ‘progressr’, ‘SQUAREM’, ‘diagram’, ‘lava’, ‘prodlim’, ‘proxy’, ‘iterators’, ‘Rcpp’, ‘clock’, ‘gower’, ‘hardhat’, ‘ipred’, ‘timeDate’, ‘e1071’, ‘foreach’, ‘ModelMetrics’, ‘plyr’, ‘pROC’, ‘recipes’, ‘reshape2’
Loading required package: ggplot2
Loading required package: lattice
\[ \]:
# Create a subset for modeling
modelSet <- subset(df.road_clean, select = c(vehicle_reference, casualty_class, sex_of_casualty, age_of_casualty,
age_band_of_casualty, casualty_severity, pedestrian_location, pedestrian_movement, car_passenger, bus_or_coach_passenger,
pedestrian_road_maintenance_worker, casualty_home_area_type, casualty_imd_decile))
# Set random seed for reproducibility
set.seed(123)
# Create training set index with 70% of the data
trainIndex <- createDataPartition(modelSet$casualty_class, p = 0.7, list = FALSE)
# Split the data into training and testing sets based on the index
trainSet <- modelSet[trainIndex, ]
testSet <- modelSet[-trainIndex, ]
# View the number of rows in each dataset
nrow(trainSet)
nrow(testSet)
36446
15619
\[ \]:
# Install the package for random forest modeling
install.packages("randomForest")
library(randomForest)
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
randomForest 4.7-1.1
Type rfNews() to see new features/changes/bug fixes.
Attaching package: ‘randomForest’
The following object is masked from ‘package:ggplot2’:
margin
The following object is masked from ‘package:dplyr’:
combine
\[ \]:
# Define the predictors
predictors <- setdiff(names(trainSet), "casualty_class")
# Train the Random Forest model
# Convert the data to a data.frame for randomForest function
trainSet <- as.data.frame(trainSet)
testSet <- as.data.frame(testSet)
# Create the formula for the Random Forest model
formula <- as.formula(paste("casualty_class", "~", paste(predictors, collapse = " + ")))
# Train the model
set.seed(123) # For reproducibility
rf_model <- randomForest(formula, data = trainSet, ntree = 500, mtry = floor(sqrt(length(predictors))), importance = TRUE)
# Summarize the model
print(rf_model)
# Make predictions on the test set
predictions <- predict(rf_model, testSet)
Warning message in randomForest.default(m, y, ...):
“The response has five or fewer unique values. Are you sure you want to do regression?”
Call:
randomForest(formula = formula, data = trainSet, ntree = 500, mtry = floor(sqrt(length(predictors))), importance = TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 3
Mean of squared residuals: 0.01283855
% Var explained: 97.58
\[ \]:
# Evaluate the model
# Calculate Mean Squared Error (MSE)
actuals <- testSet[["casualty_class"]]
mse <- mean((predictions - actuals)^2)
# Calculate Root Mean Squared Error (RMSE)
rmse <- sqrt(mse)
# Calculate R-squared
ss_total <- sum((actuals - mean(actuals))^2)
ss_residual <- sum((actuals - predictions)^2)
r_squared <- 1 - (ss_residual / ss_total)
# Print evaluation metrics
cat("Mean Squared Error:", mse, "\n")
cat("Root Mean Squared Error:", rmse, "\n")
cat("R-squared:", r_squared, "\n")
Mean Squared Error: 0.01388033
Root Mean Squared Error: 0.1178148
R-squared: 0.9737804
Low MSE and Low RMSE: These metrics indicate that the prediction errors are small on average, meaning the model's predictions are very close to the actual values.
High R-squared: This indicates that the model explains a large proportion of the variability in the target variable, suggesting a very good fit.
In the context of predictive modeling, achieving low Mean Squared Error (MSE) and low Root Mean Squared Error (RMSE) are indicative of the model making predictions that are very close to the actual values on average. MSE measures the average squared difference between predicted values and actual values, while RMSE provides a measure of the average magnitude of these prediction errors in the same units as the target variable. When both MSE and RMSE are low, it signifies that the model’s predictions are accurate and have minimal deviation from the true values.
Additionally, a high R-squared (R²) value indicates that the model can explain a large proportion of the variability observed in the target variable. R² ranges from 0 to 1, with 1 indicating that the model perfectly predicts the target variable based on the predictors. A high R-squared suggests that the model fits the data well, capturing a significant portion of the variance in the target variable and thereby demonstrating its efficacy in explaining the relationships between predictors and the outcome.
Together, low MSE, low RMSE, and high R-squared collectively signify that the predictive model performs well in terms of accuracy and explanatory power, making it a reliable tool for making predictions and understanding the underlying relationships within the data.
\[ \]:
# Install ggplot2 for visualization
install.packages("ggplot2")
library(ggplot2)
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
\[ \]:
# Density Plot of Predicted and Actual Values
plot(density(actuals),
main = "Density Plot of Actual vs Predicted Values",
xlab = "Values",
col = "salmon", lwd = 2)
lines(density(predictions), col = "purple", lwd = 2)
legend("topright", legend = c("Actual", "Predicted"), col = c("salmon", "purple"), lwd = 2)
# Box Plot of Predicted and Actual Values
boxplot(actuals, predictions,
names = c("Actual", "Predicted"),
main = "Box Plot of Actual vs Predicted Values",
col = c("salmon", "purple"))
# Density Plot of Residuals
residuals <- actuals - predictions
ggplot(data.frame(residuals), aes(x = residuals)) +
geom_density(fill = "purple", alpha = 0.5) +
labs(title = "Density Plot of Residuals",
x = "Residuals",
y = "Density")
4.2.1 Output Discussion
1. Density Plot of Actual vs. Predicted Values
The peaks and shapes of both densities are very similar, which means that the predicted values follow the same distribution as the actual values.
Small deviations may exist, but the model captures the overall underlying distribution of the actual values well.
2. Box Plot of Actual vs. Predicted Values
Both distributions appear similar, indicating that the predictions are closely aligned with the actual values.
There are no extreme outliers, suggesting that the model's predictions are consistent with the actual data.
3. Density Plot of Residuals
The residuals are centered around zero, indicating that the model does not have significant bias.
The peak at zero is sharp, suggesting that most of the residuals are small, which is a good sign of model accuracy.
The spread of the residuals is relatively narrow, indicating that the model's predictions are consistently close to the actual values.
These visualizations collectively provide insights into the model’s performance in predicting casualty classes in road accidents, showcasing its ability to closely match actual data distributions and minimize prediction errors effectively.
Question: Can machine learning techniques effectively predict casualty severity, and which variables are most predictive in the model?
For a classification model for multinomial classification problem. Multinomial logistic regressioncan be used.
\[ \]:
\[ \]:
# Load necessary libraries
if(!require(caret)) {
install.packages("caret")
library(caret)
}
if(!require(nnet)) {
install.packages("nnet")
library(nnet)
}
Loading required package: e1071
Loading required package: tidyverse
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ lubridate 1.9.3 ✔ tibble 3.2.1
✔ purrr 1.0.2 ✔ tidyr 1.3.1
✔ readr 2.1.5
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ purrr::%||%() masks base::%||%()
✖ randomForest::combine() masks dplyr::combine()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
✖ purrr::lift() masks caret::lift()
✖ randomForest::margin() masks ggplot2::margin()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Loading required package: gbm
Warning message in library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, :
“there is no package called ‘gbm’”
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
Loaded gbm 2.1.9
This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
Loading required package: nnet
\[ \]:
head(df.road_clean)
collision_index | collision_year | collision_reference | vehicle_reference | casualty_reference | casualty_class | sex_of_casualty | age_of_casualty | age_band_of_casualty | casualty_severity | ⋯ | casualty_type | casualty_home_area_type | casualty_imd_decile | lsoa_of_casualty | age_interval | casualty_home.area_desc | casualty_class_desc | casualty_severity_desc | casualty_sex_desc | car_passenger_desc | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <int> | <chr> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | ⋯ | <int> | <int> | <int> | <chr> | <fct> | <chr> | <chr> | <chr> | <chr> | <chr> | |
1 | 2023010419171 | 2023 | 010419171 | 1 | 1 | 3 | 2 | 20 | 4 | 3 | ⋯ | 0 | 1 | 10 | E01030370 | 16 - 20 | Urban | Pedestrian | Slight | Female | Yes - Car |
2 | 2023010419183 | 2023 | 010419183 | 2 | 1 | 1 | 1 | 25 | 5 | 3 | ⋯ | 9 | 1 | 3 | E01001546 | 21 - 25 | Urban | Driver | Slight | Male | Yes - Car |
3 | 2023010419189 | 2023 | 010419189 | 1 | 1 | 1 | 1 | 50 | 8 | 3 | ⋯ | 9 | 1 | 5 | E01002443 | 36 - 40 | Urban | Driver | Slight | Male | Yes - Car |
4 | 2023010419191 | 2023 | 010419191 | 2 | 1 | 1 | 1 | 34 | 6 | 3 | ⋯ | 1 | 1 | 2 | E01004679 | 26 - 30 | Urban | Driver | Slight | Male | Yes - Car |
5 | 2023010419198 | 2023 | 010419198 | 1 | 1 | 3 | 1 | 65 | 9 | 3 | ⋯ | 0 | 1 | 5 | E01023593 | 41 - 45 | Urban | Pedestrian | Slight | Male | Yes - Car |
6 | 2023010419201 | 2023 | 010419201 | 1 | 1 | 1 | 1 | 22 | 5 | 3 | ⋯ | 1 | 2 | 10 | E01026413 | 21 - 25 | Semi-urban | Driver | Slight | Male | Yes - Car |
\[ \]:
# Drop columns that won't be used for prediction
columns_to_drop <- c('collision_index', 'collision_reference', 'lsoa_of_casualty',
'age_interval', 'casualty_home.area_desc', 'casualty_class_desc',
'casualty_severity_desc', 'casualty_sex_desc', 'car_passenger_desc')
df.road_clean_modified <- df.road_clean %>% select(-one_of(columns_to_drop))
# Convert the outcome variable to a factor if it is not already.
df.road_clean_modified$casualty_severity <- as.factor(df.road_clean_modified$casualty_severity)
\[ \]:
# Split data into training and testing sets
set.seed(123)
train_index <- sample(seq_len(nrow(df.road_clean_modified)), size = 0.7*nrow(df.road_clean_modified))
train_data <- df.road_clean_modified[train_index, ]
test_data <- df.road_clean_modified[-train_index, ]
\[ \]:
# Train the multinomial logistic regression model
multinom_model <- multinom(casualty_severity ~ ., data = train_data)
# weights: 51 (32 variable)
initial value 40038.924861
iter 10 value 20200.784998
iter 20 value 19995.322055
iter 30 value 19252.683059
iter 40 value 18961.170221
iter 50 value 18958.209543
iter 60 value 18947.713461
final value 18947.521949
converged
\[ \]:
# Make predictions
predictions <- predict(multinom_model, newdata = test_data)
\[ \]:
# Evaluate the model
conf_matrix <- confusionMatrix(predictions, test_data$casualty_severity)
print(conf_matrix)
Confusion Matrix and Statistics
Reference
Prediction 1 2 3
1 0 0 0
2 2 6 7
3 163 2924 12516
Overall Statistics
Accuracy : 0.8018
95% CI : (0.7954, 0.808)
No Information Rate : 0.8018
P-Value [Acc > NIR] : 0.5128
Kappa : 0.0026
Mcnemar's Test P-Value : <2e-16
Statistics by Class:
Class: 1 Class: 2 Class: 3
Sensitivity 0.00000 0.0020478 0.999441
Specificity 1.00000 0.9992907 0.002585
Pos Pred Value NaN 0.4000000 0.802153
Neg Pred Value 0.98944 0.8126001 0.533333
Prevalence 0.01056 0.1876040 0.801831
Detection Rate 0.00000 0.0003842 0.801383
Detection Prevalence 0.00000 0.0009604 0.999040
Balanced Accuracy 0.50000 0.5006692 0.501013
\[ \]:
# Overall accuracy
accuracy <- conf_matrix$overall['Accuracy']
cat("Accuracy: ", accuracy, "\n")
# Class-specific metrics
class_metrics <- conf_matrix$byClass
# Macro-averaged metrics
macro_avg_precision <- mean(class_metrics[, 'Precision'], na.rm = TRUE)
macro_avg_recall <- mean(class_metrics[, 'Recall'], na.rm = TRUE)
macro_avg_f1 <- mean(class_metrics[, 'F1'], na.rm = TRUE)
macro_avg_specificity <- mean(class_metrics[, 'Specificity'], na.rm = TRUE)
cat("Macro-Averaged Precision: ", macro_avg_precision, "\n")
cat("Macro-Averaged Recall: ", macro_avg_recall, "\n")
cat("Macro-Averaged F1 Score: ", macro_avg_f1, "\n")
cat("Macro-Averaged Specificity: ", macro_avg_specificity, "\n")
Accuracy: 0.8017672
Macro-Averaged Precision: 0.6010767
Macro-Averaged Recall: 0.3338296
Macro-Averaged F1 Score: 0.4470349
Macro-Averaged Specificity: 0.6672918
4.3.1 Evaluation of Classification Model
Referring to confusion matrix, the model performs poorly for Class 1 with no correct predictions. Class 2 has some correct predictions but is also misclassified frequently. Class 3 is predicted well due to the high prevalence in the dataset.
The model achieved an accuracy of 80.18% indicating a high overall correctness in its predictions, but this might be skewed by the high prevalence of Class 3.
Kappa is very low (0.0026), indicating poor agreement beyond what would be expected by chance.
The macro-averaged precision of 0.6010767 indicates that, on average, the model's precision across all classes is about 60.11%. A precision of 60.11% suggests that when the model predicts a certain class, it is correct 60.11% of the time on average.
The recall, at 33.38%, indicates that the model correctly identifies only a third of the actual instances for each class on average, highlighting a significant shortfall in sensitivity.
The F1 score, which balances precision and recall, stands at 44.70%, reflecting the overall harmonic mean of these two metrics and underscoring the trade-off between them.
The specificity of 66.73% shows how well the model correctly identifies cases that do not belong to a particular class. Overall, these results suggest that although the model is generally good at making correct predictions, it has difficulty in correctly identifying all the actual instances of each class.
In conclusion, the model has low recall and moderate F1 score, despite having high accuracy and moderate precision. This might be due to class imbalance where one class is much more frequent than the others in the training data. More complex algorithms such as random forest and gradient boosting machine models with class weight adjustment might be more suitable for the prediction.