IIHS Yearly Snapshot Analysis

Time Series Analysis of Crash Data

Author

Ben Shilling

What is being collected?

I am using data scraped from HTML tables on the IIHS’s website. The IIHS (Insurance Institute for Highway Safety) compiles safety data on vehicles and is a trusted name in the field of car safety. Today many car brands use the IIHS’s top safety picks and ratings as a way to brag about there car’s reliability or safety. In the data set I compiled tables from their yearly snapshot of car crashes from around the country. This data set takes aggregated data from 1975-2022 allowing for a detailed analysis of macro-level time-series data on car crashes for almost the past 50 years.

The IIHS being a trusted source for car safety, has an open policy on data collection. They have an free and active API, but in order to get access to this they require the user to be apart of company. The IIHS with very clear guidelines for harvesting data allows for almost all user agents to harvest the data on their many webpages. Using the webpage Yearly snapshot, I harvested summary tables that will allow for an in depth time-series analysis.

How will it be collected?

Using tidyverse grammar, the pipe from the magrittr package and the rvest package I have created functions that allow users to input a URL and then output a list of data frames, with the name assigned to them.

url <- "https://www.iihs.org/topics/fatality-statistics/detail/yearly-snapshot"

process_iihs_tables <- 
  function(url) {
    # Load required libraries
    library(rvest)
    library(dplyr)
    
    # Read the HTML from the URL
    html_content <- read_html(url)
    
    # Extract all tables from the HTML
    tables <- html_content %>% html_table()
    
    # Convert each table into a data frame
    list_of_dfs <- lapply(tables, function(x) as.data.frame(x))
    
    # Loop through each data frame and apply transformations
    for (i in seq_along(list_of_dfs)) {
      # Extract the data frame from the list
      df <- list_of_dfs[[i]]
      
      # Get the first column name (use it to name the new data frame)
      column_header <- colnames(df)[1]
      
      # Promote the first row to the column names
      new_colnames <- as.character(unlist(df[1, ]))
      colnames(df) <- new_colnames
      
      # Remove the first row, as it has now been promoted to the header
      df <- df[-1, ]
      
      # Dynamically assign the new data frame to a variable named after the column header
      assign(column_header, df, envir = .GlobalEnv)  # Assign in the global environment
      
      # Optionally print to confirm
      print(paste("Created dataframe:", column_header))
    }
  }

The structure of these tables makes them weird to work with. Inside the table there is the title of the table which is formatted like an excel row that has been merged and centered. When the data frame is loaded without the column naming sequence the column header for each column will be what is supposed to be the name of the data frame. This function creates 14 data-frames that can now be worked with individually. But, many of the tables have a similar issue where there is a hierarchical structure to the naming sequence. In order to tackle this issue I created a second function which allows the user to input the table with a first row that has should be in the column header and, now adds it to the naming convention of the data-frame. It also removes any data-collection information from the table, so the new table can be used for analysis.

concatenate_and_remove_rows <- function(df) {
  # Iterate through each column in the data frame
  new_colnames <- sapply(seq_along(df), function(col_idx) {
    # Get the column name
    col_name <- colnames(df)[col_idx]
    
    # Get the first value in the column
    first_value <- as.character(df[1, col_idx])
    
    # Concatenate the column name and first value
    paste0(col_name, "_", first_value)
  })
  
  # Update the column names of the data frame
  colnames(df) <- new_colnames
  
  # Remove the second row
  if (nrow(df) >= 2) {
    df <- df[-1, ] # Remove the second row
  }
  
  # Check if the last row has the same value in every column before removing it
  if (nrow(df) >= 1) {
    last_row <- as.character(unlist(df[nrow(df), ])) # Convert the last row to a character vector
    if (length(unique(last_row)) == 1) { # Check if all values in the last row are the same
      df <- df[-nrow(df), ] # Remove the last row
    }
  }
  
  return(df)
}

Since most of the data-frames where measuring by year, this made it easy to join the tables on the Year column. Using the SQL like joins in the dplyr package I was able to utilize the specific abilities of inner and outer joins to build a cohesive data frame which has time-series data with over 35 metrics!

Analysis

With all of this wonderful time series data we can now put some visuals together that allow us too see driving trends over time!

Based on this we can see that car-related deaths have gone down overall, but there has been a recent rise. What could the causes of this be, could it be new waves of carelessness or is there is a flaw in recent car design?

Seems like speeding isn’t the cause of this, as us American and our need for speed hasn’t been causing more trouble than it already has…

Other than Covid we can see that we have been driving more and more as a population. This could be a cause of this recent spike, but if the increase in travel was the reason for more deaths than we would have noticed a constant upward trend.

At least one positive from this is that the youngest and most vulnerable population is not being affected by this new wave of car-related deaths.

Well this is definitely something that is interesting. Maybe the cause of this is rise in car deathsi s because people are deciding to use those useless things called legs instead of gas-maxxing and expressing their freedom by driving a truck lifted taller than there head to take on the rugged terrain they face daily (that road with that pothole that make the car shake a little).

Conclusion

There is definitely a lot to learn from this small but compact dataset but I hope that this code can lead to your own analysis!

--- title: "IIHS Yearly Snapshot Analysis" subtitle: "Time Series Analysis of Crash Data" author: "Ben Shilling" # Author name toc: true # Generates an automatic table of contents. format: # Options related to formatting. html: # Options related to HTML output. code-tools: TRUE # Allow the code tools option showing in the output. embed-resources: TRUE # Embeds all components into a single HTML file. execute: # Options related to the execution of code chunks. warning: FALSE # FALSE: Code chunk warnings are hidden by default. message: FALSE # FALSE: Code chunk messages are hidden by default. echo: FALSE # TRUE: Show all code in the output. --- ```{r} #| label: Library and Data Load #| include: false library(readr) # bring csv's into data frame library(knitr) # for those lovely html tables IIHS_Crash_Summary <- read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/shillingb1_xavier_edu/EUcG91Teo5lOmzUXACYMLj4BWqVbCZNATVrDcw2udUqRaA?download=1") ``` ## What is being collected? I am using data scraped from HTML tables on the IIHS's website. The IIHS (Insurance Institute for Highway Safety) compiles safety data on vehicles and is a trusted name in the field of car safety. Today many car brands use the IIHS's top safety picks and ratings as a way to brag about there car's reliability or safety. In the data set I compiled tables from their yearly snapshot of car crashes from around the country. This data set takes aggregated data from 1975-2022 allowing for a detailed analysis of macro-level time-series data on car crashes for almost the past 50 years. The IIHS being a trusted source for car safety, has an open policy on data collection. They have an free and active API, but in order to get access to this they require the user to be apart of company. The IIHS with very clear guidelines for harvesting data allows for almost all user agents to harvest the data on their many webpages. Using the webpage [Yearly snapshot](https://www.iihs.org/topics/fatality-statistics/detail/yearly-snapshot), I harvested summary tables that will allow for an in depth time-series analysis. ## How will it be collected? ```{r} #| label: Packages #| include: true library(tidyverse) # for all things tidy! library(magrittr) # it's not a %>% library(rvest) # scythe and harvester of the HTML file ``` Using tidyverse grammar, the pipe from the magrittr package and the rvest package I have created functions that allow users to input a URL and then output a list of data frames, with the name assigned to them. ```{r} #| label: data frame creation #| echo: true url <- "https://www.iihs.org/topics/fatality-statistics/detail/yearly-snapshot" process_iihs_tables <- function(url) { # Load required libraries library(rvest) library(dplyr) # Read the HTML from the URL html_content <- read_html(url) # Extract all tables from the HTML tables <- html_content %>% html_table() # Convert each table into a data frame list_of_dfs <- lapply(tables, function(x) as.data.frame(x)) # Loop through each data frame and apply transformations for (i in seq_along(list_of_dfs)) { # Extract the data frame from the list df <- list_of_dfs[[i]] # Get the first column name (use it to name the new data frame) column_header <- colnames(df)[1] # Promote the first row to the column names new_colnames <- as.character(unlist(df[1, ])) colnames(df) <- new_colnames # Remove the first row, as it has now been promoted to the header df <- df[-1, ] # Dynamically assign the new data frame to a variable named after the column header assign(column_header, df, envir = .GlobalEnv) # Assign in the global environment # Optionally print to confirm print(paste("Created dataframe:", column_header)) } } ``` The structure of these tables makes them weird to work with. Inside the table there is the title of the table which is formatted like an excel row that has been merged and centered. When the data frame is loaded without the column naming sequence the column header for each column will be what is supposed to be the name of the data frame. This function creates 14 data-frames that can now be worked with individually. But, many of the tables have a similar issue where there is a hierarchical structure to the naming sequence. In order to tackle this issue I created a second function which allows the user to input the table with a first row that has should be in the column header and, now adds it to the naming convention of the data-frame. It also removes any data-collection information from the table, so the new table can be used for analysis. ```{r} #| label: column header fix #| echo: true concatenate_and_remove_rows <- function(df) { # Iterate through each column in the data frame new_colnames <- sapply(seq_along(df), function(col_idx) { # Get the column name col_name <- colnames(df)[col_idx] # Get the first value in the column first_value <- as.character(df[1, col_idx]) # Concatenate the column name and first value paste0(col_name, "_", first_value) }) # Update the column names of the data frame colnames(df) <- new_colnames # Remove the second row if (nrow(df) >= 2) { df <- df[-1, ] # Remove the second row } # Check if the last row has the same value in every column before removing it if (nrow(df) >= 1) { last_row <- as.character(unlist(df[nrow(df), ])) # Convert the last row to a character vector if (length(unique(last_row)) == 1) { # Check if all values in the last row are the same df <- df[-nrow(df), ] # Remove the last row } } return(df) } ``` Since most of the data-frames where measuring by year, this made it easy to join the tables on the Year column. Using the SQL like joins in the dplyr package I was able to utilize the specific abilities of inner and outer joins to build a cohesive data frame which has time-series data with over 35 metrics! ## Analysis With all of this wonderful time series data we can now put some visuals together that allow us too see driving trends over time! ```{r} #| label: Vehicle Deaths by Year #| include: true IIHS_Crash_Summary %>% ggplot(aes(x = Year, y = Deaths)) + geom_line(color = "blue", size = 1) + # Line with custom color and size geom_point(color = "red", size = 2) + # Points to highlight each year labs( title = "Vehicle Deaths by Year", x = "Year", y = "Deaths" ) + theme_minimal() + # Apply a minimal theme for a clean look theme( plot.title = element_text(hjust = 0.5, size = 14), # Center title axis.title = element_text(size = 12) # Adjust axis title size ) + scale_y_continuous( limits = c(20000, 60000), # Set y-axis range from 0 to 60000 breaks = seq(20000, 60000, by = 10000) # Add breaks at intervals of 10000 ) ``` Based on this we can see that car-related deaths have gone down overall, but there has been a recent rise. What could the causes of this be, could it be new waves of carelessness or is there is a flaw in recent car design? ```{r} #| label: Speeding deaths by Year #| include: true IIHS_Crash_Summary %>% filter(Year >= 2013) %>% ggplot(aes(x = Year, y = `Speeding-related_Number`)) + geom_line(color = "blue", size = 1) + # Line with custom color and size geom_point(color = "red", size = 2) + # Points to highlight each year labs( title = "Speeding Deaths by Year", x = "Year", y = "Speeding Deaths" ) + theme_minimal() + # Apply a minimal theme for a clean look theme( plot.title = element_text(hjust = 0.5, size = 14), # Center title axis.title = element_text(size = 12) # Adjust axis title size ) + scale_y_continuous( limits = c(0, 20000), # Set y-axis range from 0 to 60000 breaks = seq(0,20000, by = 5000) # Add breaks at intervals of 10000 ) + scale_x_continuous( breaks = seq(2013,2022,by = 1) ) ``` Seems like speeding isn't the cause of this, as us American and our need for speed hasn't been causing more trouble than it already has... ```{r} #| label: Miles Travelled by Americans #| include: true IIHS_Crash_Summary %>% ggplot(aes(x = Year, y = `Miles traveled (millions)`)) + geom_line(color = "blue", size = 1) + # Line with custom color and size geom_point(color = "red", size = 2) + # Points to highlight each year labs( title = "Miles Travlled by Year", x = "Miles Travelled", y = "Speeding Deaths" ) + theme_minimal() + # Apply a minimal theme for a clean look theme( plot.title = element_text(hjust = 0.5, size = 14), # Center title axis.title = element_text(size = 12) # Adjust axis title size ) ``` Other than Covid we can see that we have been driving more and more as a population. This could be a cause of this recent spike, but if the increase in travel was the reason for more deaths than we would have noticed a constant upward trend. ```{r} #| label: Deaths of less than 13 #| include: true IIHS_Crash_Summary %>% ggplot(aes(x = Year, y = `<13 years_Number`)) + geom_line(color = "blue", size = 1) + # Line with custom color and size geom_point(color = "red", size = 2) + # Points to highlight each year labs( title = "Young Children's Vehicle Deaths by Year", x = "Deaths", y = "Speeding Deaths" ) + theme_minimal() + # Apply a minimal theme for a clean look theme( plot.title = element_text(hjust = 0.5, size = 14), # Center title axis.title = element_text(size = 12) # Adjust axis title size ) ``` At least one positive from this is that the youngest and most vulnerable population is not being affected by this new wave of car-related deaths. ```{r} #| label: Pedestrian Deaths #| include: true IIHS_Crash_Summary %>% ggplot(aes(x = Year, y = Pedestrians_Number)) + geom_line(color = "blue", size = 1) + # Line with custom color and size geom_point(color = "red", size = 2) + # Points to highlight each year labs( title = "Pedestrian Deaths by Year", x = "Deaths", y = "Pedestrian Deaths" ) + theme_minimal() + # Apply a minimal theme for a clean look theme( plot.title = element_text(hjust = 0.5, size = 14), # Center title axis.title = element_text(size = 12) # Adjust axis title size ) ``` Well this is definitely something that is interesting. Maybe the cause of this is rise in car deathsi s because people are deciding to use those useless things called legs instead of gas-maxxing and expressing their freedom by driving a truck lifted taller than there head to take on the rugged terrain they face daily (that road with that pothole that make the car shake a little). ## Conclusion There is definitely a lot to learn from this small but compact dataset but I hope that this code can lead to your own analysis!