url <- "https://www.iihs.org/topics/fatality-statistics/detail/yearly-snapshot"
process_iihs_tables <-
function(url) {
# Load required libraries
library(rvest)
library(dplyr)
# Read the HTML from the URL
html_content <- read_html(url)
# Extract all tables from the HTML
tables <- html_content %>% html_table()
# Convert each table into a data frame
list_of_dfs <- lapply(tables, function(x) as.data.frame(x))
# Loop through each data frame and apply transformations
for (i in seq_along(list_of_dfs)) {
# Extract the data frame from the list
df <- list_of_dfs[[i]]
# Get the first column name (use it to name the new data frame)
column_header <- colnames(df)[1]
# Promote the first row to the column names
new_colnames <- as.character(unlist(df[1, ]))
colnames(df) <- new_colnames
# Remove the first row, as it has now been promoted to the header
df <- df[-1, ]
# Dynamically assign the new data frame to a variable named after the column header
assign(column_header, df, envir = .GlobalEnv) # Assign in the global environment
# Optionally print to confirm
print(paste("Created dataframe:", column_header))
}
}IIHS Yearly Snapshot Analysis
Time Series Analysis of Crash Data
What is being collected?
I am using data scraped from HTML tables on the IIHS’s website. The IIHS (Insurance Institute for Highway Safety) compiles safety data on vehicles and is a trusted name in the field of car safety. Today many car brands use the IIHS’s top safety picks and ratings as a way to brag about there car’s reliability or safety. In the data set I compiled tables from their yearly snapshot of car crashes from around the country. This data set takes aggregated data from 1975-2022 allowing for a detailed analysis of macro-level time-series data on car crashes for almost the past 50 years.
The IIHS being a trusted source for car safety, has an open policy on data collection. They have an free and active API, but in order to get access to this they require the user to be apart of company. The IIHS with very clear guidelines for harvesting data allows for almost all user agents to harvest the data on their many webpages. Using the webpage Yearly snapshot, I harvested summary tables that will allow for an in depth time-series analysis.
How will it be collected?
Using tidyverse grammar, the pipe from the magrittr package and the rvest package I have created functions that allow users to input a URL and then output a list of data frames, with the name assigned to them.
The structure of these tables makes them weird to work with. Inside the table there is the title of the table which is formatted like an excel row that has been merged and centered. When the data frame is loaded without the column naming sequence the column header for each column will be what is supposed to be the name of the data frame. This function creates 14 data-frames that can now be worked with individually. But, many of the tables have a similar issue where there is a hierarchical structure to the naming sequence. In order to tackle this issue I created a second function which allows the user to input the table with a first row that has should be in the column header and, now adds it to the naming convention of the data-frame. It also removes any data-collection information from the table, so the new table can be used for analysis.
concatenate_and_remove_rows <- function(df) {
# Iterate through each column in the data frame
new_colnames <- sapply(seq_along(df), function(col_idx) {
# Get the column name
col_name <- colnames(df)[col_idx]
# Get the first value in the column
first_value <- as.character(df[1, col_idx])
# Concatenate the column name and first value
paste0(col_name, "_", first_value)
})
# Update the column names of the data frame
colnames(df) <- new_colnames
# Remove the second row
if (nrow(df) >= 2) {
df <- df[-1, ] # Remove the second row
}
# Check if the last row has the same value in every column before removing it
if (nrow(df) >= 1) {
last_row <- as.character(unlist(df[nrow(df), ])) # Convert the last row to a character vector
if (length(unique(last_row)) == 1) { # Check if all values in the last row are the same
df <- df[-nrow(df), ] # Remove the last row
}
}
return(df)
}Since most of the data-frames where measuring by year, this made it easy to join the tables on the Year column. Using the SQL like joins in the dplyr package I was able to utilize the specific abilities of inner and outer joins to build a cohesive data frame which has time-series data with over 35 metrics!
Analysis
With all of this wonderful time series data we can now put some visuals together that allow us too see driving trends over time!
Based on this we can see that car-related deaths have gone down overall, but there has been a recent rise. What could the causes of this be, could it be new waves of carelessness or is there is a flaw in recent car design?
Seems like speeding isn’t the cause of this, as us American and our need for speed hasn’t been causing more trouble than it already has…
Other than Covid we can see that we have been driving more and more as a population. This could be a cause of this recent spike, but if the increase in travel was the reason for more deaths than we would have noticed a constant upward trend.
At least one positive from this is that the youngest and most vulnerable population is not being affected by this new wave of car-related deaths.
Well this is definitely something that is interesting. Maybe the cause of this is rise in car deathsi s because people are deciding to use those useless things called legs instead of gas-maxxing and expressing their freedom by driving a truck lifted taller than there head to take on the rugged terrain they face daily (that road with that pothole that make the car shake a little).
Conclusion
There is definitely a lot to learn from this small but compact dataset but I hope that this code can lead to your own analysis!