Many of you will have read articles from the FiveThirtyEight website, which often uses data to support narratives on a wide variety of subjects
538 have recently made available many of the datasets they have used as an R package and I plan to look at some where there appears the potential not only to replicate their findings but also to
The first I have chosen, Which State Has The Worst Drivers, was written in October 2014 by one of their, then star data journalists Mona Chalabi, also infamous for the Vagina Dispatches series for the Guardian
The data originates from The National Highway Traffic Safety Administration (NHTSA) website
As always, the libraries and any obligatory files need to be loaded.
(The htmltools package is an addition to help overcome CSS issues arising through using the crosstalk package with blogdown. Props to htmlwidgets king Kenton Russell for his assistance)
library(htmltools)
Warning message:
running command '"C:/Program Files/RStudio/bin/pandoc/pandoc" +RTS -K512m -RTS badDrivers_538.utf8.md --to html --from markdown+autolink_bare_uris+ascii_identifiers+tex_math_single_backslash --output pandoc32c0793628fd.html --smart --email-obfuscation none --self-contained --standalone --section-divs --template "C:\Users\Andrew\Documents\R\win-library\3.3\rmarkdown\rmd\h\default.html" --no-highlight --variable highlightjs=1 --variable "theme:bootstrap" --include-in-header "C:\Users\Andrew\AppData\Local\Temp\Rtmpuetp0F\rmarkdown-str32c0766c1433.html" --mathjax --variable "mathjax-url:https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" --variable code_folding=show --variable source_embed=badDrivers_538.Rmd --include-after-body "C:\Users\Andrew\AppData\Local\Temp\Rtmpuetp0F\file32c031925ee2.html" --variable code_menu=1 --variable kable-scroll=1' had status 67
library(plotly)
library(fivethirtyeight)
library(crosstalk)
library(tabulizer)
library(DT)
library(tidyverse)
# a simple table linking stateID to state name
states <- read_csv("data/states.csv") %>%
# improve column names
select(state = State, stateID = Abbreviation)
Let us first explore the supplied data; do any necessary wrangling; attempt to replicate the charts in Plotly; and use the filter control of the crosstalk package (which enables inter-widget interactivity for htmlWidgets) so that the plots can be shown more concisely
Here is the first of the charts from the original article
Let’s look at the data in the package
df <- bad_drivers
glimpse(df)
Observations: 51
Variables: 8
$ state <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Califo...
$ num_drivers <dbl> 18.8, 18.1, 18.6, 22.4, 12.0, 13.6, 10.8, 16.2, 5.9...
$ perc_speeding <int> 39, 41, 35, 18, 35, 37, 46, 38, 34, 21, 19, 54, 36,...
$ perc_alcohol <int> 30, 25, 28, 26, 28, 28, 36, 30, 27, 29, 25, 41, 29,...
$ perc_not_distracted <int> 96, 90, 84, 94, 91, 79, 87, 87, 100, 92, 95, 82, 85...
$ perc_no_previous <int> 80, 94, 96, 95, 89, 95, 82, 99, 100, 94, 93, 87, 98...
$ insurance_premiums <dbl> 784.55, 1053.48, 899.47, 827.34, 878.41, 835.50, 10...
$ losses <dbl> 145.08, 133.93, 110.35, 142.39, 165.63, 139.91, 167...
The data is preprocessed to give rounded percentages whilst the charts are in actual numbers of fatalities per billion miles travelled. We could search for the raw data but for present purposes we will just need to recalculate.
However, to act on the data it will by much easier to transform the dataframe to a narrow rather than wide form. In addition, it would be nice to have the chart shown in meaningful order rather than alphabetically by state
df_gather <- df %>%
# restrict to columns required
select(1:6) %>%
# use a function within the tidyr package (part of the tidyverse)
gather(cause, pc, -c(state, num_drivers)) %>%
# calculate an approximate actual count
mutate(value = round(num_drivers * pc / 100, 1)) %>%
# add a stateID column for use in map %>%
left_join(states)
# In order to improve data display, the state needs to be changed from a character to a factor and ordered
## NB more work required
df_gather$state <-
factor(df_gather$state, levels = df_gather$state[order(df_gather$value)])
df_gather
NA
We now have the df in format required for our first chart. As I mentioned earlier, I want to use just one aspect of the much wider potential of the crosstalk package. I anticipate utilizing this more fully in future posts but beware that it is still in beta
# this enables the filter to interact with chart and, later, map
sd <- SharedData$new(df_gather)
fs <- filter_select(
id = "cause",
label = "Cause",
sharedData = sd,
group = ~ cause,
allLevels = FALSE,
multiple = FALSE
)
## this is needed as crosstalk does not work nicely with bootstrap, apparently
fs_nobootstrap <- fs
attr(fs_nobootstrap, "html_dependencies") <- Filter(
function(dep) {dep$name != "bootstrap"},
attr(fs_nobootstrap, "html_dependencies")
)
myChart <- sd %>%
# need to set a larger than default height to encompass all states
plot_ly(
x = ~ num_drivers,
y = ~ state,
height = 800
) %>%
# first put in all fatalities
add_bars(color = I("#f2dfa8"), name = "Total", width=0.8) %>%
# then those according to cause
add_bars(width=0.2,
x = ~ value,
y = ~ state,
color = I("#F6B900"),
name = "Cause",
hoverinfo="text",
text=~paste0(pc,"%")
) %>%
# make it an overlay barchart and inprove look
layout(
barmode = "overlay",
title = "Drivers involved in Fatal collisions by State - 2012",
xaxis = list(title = "Fatal collisons per billion miles"),
yaxis = list(title = ""),
margin = list(l = 120)
) %>%
config(displayModeBar = F, showLink = F)
## combine the selector and chart
tagList(
fs_nobootstrap,
myChart
)
So some progress. There is now a more compact display with an infoactive element on hover and the states are no longer in alphabetical order but not by rate for individual cause. It would also be useful if the selector showed the default
Now, how about extending the original article with some maps. Plotly offers a choropleth option
map <- sd %>%
plot_ly(
type = "choropleth",
locations = ~ stateID,
locationmode = "USA-states",
z = ~ value,
reversescale = TRUE
) %>% layout(geo = list(scope = "usa"))
tagList(
fs_nobootstrap,
map
)
Combining would make sense and let’s throw in a table for good measure. If going this route, a dashboard might be a more sensible option
I found it a difficult to locate the raw data that the fivethirtyeight dataset was based on so, in lieu of that and because it is a good news story, mapped the considerable reduction in fatalaties per miles travelled over the past forty years
The NHTSA data is available in the form of downloadable pdfs. A year ago this would have been enough for me to say, well, ‘Enough’: but at least two packages I am aware of pdftools and tabulizer attempt to address the problem of extracting data from pdfs.
Here I have used tabulizer. Downloading data takes a few seconds
Because there is an interactive element involved in setting the part of the page to collect the data from, I have commented out the initial process and supply the cleaner data from a local file
# # Ideally we would use the extract_tables function but
crash_data <-
extract_tables(
"https://crashstats.nhtsa.dot.gov/Api/Public/ViewPublication/812293",
pages = 5,
method = "data.frame"
)
head(crash_data[[1]])
#Luckily there is an interactive option where a rectangle can be drawn artound the displayed image. I know we just want the first 4 columns
# I have written its output to a file
# crash_data <- extract_areas("https://crashstats.nhtsa.dot.gov/Api/Public/ViewPublication/812293", pages = 5)
# df <- data.frame(crash_data)
# glimpse(df)
# Observations: 51
# Variables: 5
# $ X1 <fctr> Alabama , Alaska , Arizona , Arkansas , California , C...
# $ X2 <fctr> 3.63 , 4.38 , 4.19 , 4.01 , 3.09 , 3.50 , 2.13 , 3.37 ...
# $ X3 <fctr> 1.92 , 1.45 , 1.97 , 2.05 , 1.32 , 1.26 , 0.88 , 1.40 ...
# $ X4 <fctr> 1.31 , 1.05 , 1.40 , 1.49 , 0.94 , 1.03 , 0.92 , 1.06 ...
# $ X5 <fctr> 1.25, 1.50, 1.23, 1.37, 0.92, 1.00, 0.80, 1.26, 0.65, ...
# write_csv(df,"../../data/crash_data1975_2014.csv")
# Here we get the data back from the local file
# The package, correctly, guages that we want the state column to be
# of class character with the rest numeric
df <- read_csv("data/crash_data1975_2014.csv")
glimpse(df)
Observations: 51
Variables: 5
$ X1 <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "De...
$ X2 <dbl> 3.63, 4.38, 4.19, 4.01, 3.09, 3.50, 2.13, 3.37, 2.27, 3.24, 3.46, 3.47, 4.78, 3.56, 3.02...
$ X3 <dbl> 1.92, 1.45, 1.97, 2.05, 1.32, 1.26, 0.88, 1.40, 1.29, 1.75, 1.52, 1.39, 1.85, 1.27, 1.31...
$ X4 <dbl> 1.31, 1.05, 1.40, 1.49, 0.94, 1.03, 0.92, 1.06, 0.57, 1.25, 1.08, 1.01, 1.34, 0.94, 1.00...
$ X5 <dbl> 1.25, 1.50, 1.23, 1.37, 0.92, 1.00, 0.80, 1.26, 0.65, 1.24, 1.04, 0.93, 1.15, 0.88, 0.94...
# just need to enter more informative column names
names(df) <- c('state', 'yr_1975', 'yr_2005', 'yr_2013', 'yr_2014')
# and again put into tidy version and add state abbreviations
df_gather <- df %>%
gather(year, rate, -state) %>%
left_join(states)
glimpse(df_gather)
Observations: 204
Variables: 4
$ state <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut"...
$ year <chr> "yr_1975", "yr_1975", "yr_1975", "yr_1975", "yr_1975", "yr_1975", "yr_1975", "yr_19...
$ rate <dbl> 3.63, 4.38, 4.19, 4.01, 3.09, 3.50, 2.13, 3.37, 2.27, 3.24, 3.46, 3.47, 4.78, 3.56,...
$ stateID <chr> "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "DC", "FL", "GA", "HI", "ID", "IL",...
We can now map the changes over time for three years. Note the key input of zmin and zmax which ensures same color scale applies to all maps
# layout in 3 columns
#bscols(widths=c(4,4,4),
map_1975 <- df_gather %>%
filter(year=="yr_1975") %>%
# create a choropleth map
plot_geo(locationmode="USA-states") %>%
add_trace(z=~rate, color = ~rate, colors = "Reds",
# ensure that color scale extends beyond the values in individual map
zmin=0,zmax=6,
locations = ~stateID,
showscale=TRUE) %>%
layout(geo=list(scope="usa"),
title="1975") %>%
config(displayModeBar = F, showLink = F)
map_2005 <- df_gather %>%
filter(year=="yr_2005") %>%
plot_geo(locationmode="USA-states") %>%
add_trace(z=~rate, color = ~rate, colors = "Reds",
zmin=0,zmax=6,
locations = ~stateID,
showscale=TRUE) %>%
layout(geo=list(scope="usa") %>%
config(displayModeBar = F, showLink = F),
title="2005")
map_2014 <- df_gather %>%
filter(year=="yr_2014") %>%
plot_geo(locationmode="USA-states") %>%
add_trace(z=~rate, color = ~rate, colors = "Reds",
zmin=0,zmax=6,
locations = ~stateID,
# add a scale
showscale=TRUE) %>%
layout(geo=list(scope="usa") %>%
config(displayModeBar = F, showLink = F),
title="2014")
tagList(
map_1975,
map_2005,
map_2014
)
Ideally, these maps would be side by side but this is difficult within a blog format. Please note that there is a much larger time gap between the first two maps than the latter two. Hover individual states and you can see that by 2005 pretty well all states had their rate as low the best state thirty years prior
Let’s try one map showing % decline in rates
df %>%
left_join(states) %>%
mutate(change = round(100 * yr_2014 / yr_1975 - 100)) %>%
plot_geo(locationmode = "USA-states") %>%
add_trace(
z = ~ change,
color = ~ change,
colors = "Greens",
# zmin=0,zmax=6,
locations = ~ stateID,
showscale = TRUE,
reversescale = TRUE
) %>%
layout(
geo = list(scope = "usa") %>%
config(displayModeBar = F, showLink = F),
title = "% change in rate of Fatalities 1975 to 2014"
) %>%
config(displayModeBar = F, showLink = F)
Joining, by = "state"
Obviously, there has been a sharp downward trend in fatalities and even in absolute terms, current levels are a third less than the peak of 1972
Clearly several factors, including better cars and roads, awareness and legislation regarding alcohol and drugs will have played a part. There is a host of data on the NHTSAsite that could be used for further analyses