Using one or more TidyVerse packages and any dataset from fivethirtyeight.com or Kaggle, the task is to create a programming sample “vignette” that demonstrates use of one or more of the capabilities of selected TidyVerse package.
Lets load tidyverse package first. It includes readr, dplyr, tidyr, ggplot2, stringr, tibble, forcats and purr packages.
We’re going to load a dataset from fivethirtyeight.com to help us show tidyverse package at work. This data shows America’s bad drivers in all the states, involved in collisions.
First step is to read the bad-drivers data from github repository. The data contains below fields:
# define URL for bad drivers data
theURL <- 'https://raw.githubusercontent.com/fivethirtyeight/data/master/bad-drivers/bad-drivers.csv'
# read data
bad_drivers <- read_csv(theURL)
## Parsed with column specification:
## cols(
## State = col_character(),
## `Number of drivers involved in fatal collisions per billion miles` = col_double(),
## `Percentage Of Drivers Involved In Fatal Collisions Who Were Speeding` = col_double(),
## `Percentage Of Drivers Involved In Fatal Collisions Who Were Alcohol-Impaired` = col_double(),
## `Percentage Of Drivers Involved In Fatal Collisions Who Were Not Distracted` = col_double(),
## `Percentage Of Drivers Involved In Fatal Collisions Who Had Not Been Involved In Any Previous Accidents` = col_double(),
## `Car Insurance Premiums ($)` = col_double(),
## `Losses incurred by insurance companies for collisions per insured driver ($)` = col_double()
## )
head(bad_drivers)
## # A tibble: 6 x 8
## State `Number of driv… `Percentage Of … `Percentage Of … `Percentage Of …
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Alab… 18.8 39 30 96
## 2 Alas… 18.1 41 25 90
## 3 Ariz… 18.6 35 28 84
## 4 Arka… 22.4 18 26 94
## 5 Cali… 12 35 28 91
## 6 Colo… 13.6 37 28 79
## # … with 3 more variables: `Percentage Of Drivers Involved In Fatal Collisions
## # Who Had Not Been Involved In Any Previous Accidents` <dbl>, `Car Insurance
## # Premiums ($)` <dbl>, `Losses incurred by insurance companies for collisions
## # per insured driver ($)` <dbl>
In the next, we rename columns to replace big column names with shorter names.
# rename columns
colnames(bad_drivers) <- c("STATE",
"DRIVERS_INVOLVED",
"PERC_DRIVERS_SPEED",
"PERC_DRIVERS_ALCHO",
"PERC_DRIVERS_NOT_DIST",
"PERC_DRIVERS_NO_ACC",
"INS_PREM",
"LOSS_INSCOMP")
glimpse(bad_drivers)
## Observations: 51
## Variables: 8
## $ STATE <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Ca…
## $ DRIVERS_INVOLVED <dbl> 18.8, 18.1, 18.6, 22.4, 12.0, 13.6, 10.8, 16.2,…
## $ PERC_DRIVERS_SPEED <dbl> 39, 41, 35, 18, 35, 37, 46, 38, 34, 21, 19, 54,…
## $ PERC_DRIVERS_ALCHO <dbl> 30, 25, 28, 26, 28, 28, 36, 30, 27, 29, 25, 41,…
## $ PERC_DRIVERS_NOT_DIST <dbl> 96, 90, 84, 94, 91, 79, 87, 87, 100, 92, 95, 82…
## $ PERC_DRIVERS_NO_ACC <dbl> 80, 94, 96, 95, 89, 95, 82, 99, 100, 94, 93, 87…
## $ INS_PREM <dbl> 784.55, 1053.48, 899.47, 827.34, 878.41, 835.50…
## $ LOSS_INSCOMP <dbl> 145.08, 133.93, 110.35, 142.39, 165.63, 139.91,…
As we must have noticed columns PERC_DRIVERS_SPEED, PERC_DRIVERS_ALCHO, PERC_DRIVERS_NOT_DIST, PERC_DRIVERS_NO_ACC are percentages of DRIVERS_INVOLVED. In the next step we will mutate new columns DRIVERS_SPEED, DRIVERS_ALCHO, DRIVERS_NOT_DIST, DRIVERS_NO_ACC by taking the given percentage of DRIVERS_INVOLVED column.
# create new column DRIVERS_SPEED which will be (DRIVERS_INVOLVED*PERC_DRIVERS_SPEED)/100
bad_drivers <- bad_drivers %>%
mutate(DRIVERS_SPEED=(DRIVERS_INVOLVED*PERC_DRIVERS_SPEED)/100) %>%
mutate(DRIVERS_ALCHO=(DRIVERS_INVOLVED*PERC_DRIVERS_ALCHO)/100) %>%
mutate(DRIVERS_NOT_DIST=(DRIVERS_INVOLVED*PERC_DRIVERS_NOT_DIST)/100) %>%
mutate(DRIVERS_NO_ACC=(DRIVERS_INVOLVED*PERC_DRIVERS_NO_ACC)/100)
glimpse(bad_drivers)
## Observations: 51
## Variables: 12
## $ STATE <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Ca…
## $ DRIVERS_INVOLVED <dbl> 18.8, 18.1, 18.6, 22.4, 12.0, 13.6, 10.8, 16.2,…
## $ PERC_DRIVERS_SPEED <dbl> 39, 41, 35, 18, 35, 37, 46, 38, 34, 21, 19, 54,…
## $ PERC_DRIVERS_ALCHO <dbl> 30, 25, 28, 26, 28, 28, 36, 30, 27, 29, 25, 41,…
## $ PERC_DRIVERS_NOT_DIST <dbl> 96, 90, 84, 94, 91, 79, 87, 87, 100, 92, 95, 82…
## $ PERC_DRIVERS_NO_ACC <dbl> 80, 94, 96, 95, 89, 95, 82, 99, 100, 94, 93, 87…
## $ INS_PREM <dbl> 784.55, 1053.48, 899.47, 827.34, 878.41, 835.50…
## $ LOSS_INSCOMP <dbl> 145.08, 133.93, 110.35, 142.39, 165.63, 139.91,…
## $ DRIVERS_SPEED <dbl> 7.332, 7.421, 6.510, 4.032, 4.200, 5.032, 4.968…
## $ DRIVERS_ALCHO <dbl> 5.640, 4.525, 5.208, 5.824, 3.360, 3.808, 3.888…
## $ DRIVERS_NOT_DIST <dbl> 18.048, 16.290, 15.624, 21.056, 10.920, 10.744,…
## $ DRIVERS_NO_ACC <dbl> 15.040, 17.014, 17.856, 21.280, 10.680, 12.920,…
In this step we will draw a stacked bar lot using ggplot() method having states on X axis and DRIVERS_SPEED and DRIVERS_INVOLVED stacked together on Y axis. To achive this we first used select() method to get required columns. The used gather() method to make data long for DRIVERS_INVOLVED and DRIVERS_SPEED and finally used ggplot() to draw stacked bar plot.
bad_drivers %>%
select(STATE, DRIVERS_INVOLVED, DRIVERS_SPEED) %>%
gather(type, value, DRIVERS_INVOLVED:DRIVERS_SPEED) %>%
ggplot(., aes(x = STATE,y = value, fill = type)) +
geom_bar(position = "stack", stat="identity") +
scale_fill_manual(values = c("red", "darkred")) +
ylab("Drivers involved in Fatal collision while Speeding") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
Similarly next stacked plot is having states on X axis and DRIVERS_ALCHO and DRIVERS_INVOLVED stacked together on Y axis. To achive this we first used select() method to get required columns. The used gather() method to make data long for DRIVERS_INVOLVED and DRIVERS_ALCHO and finally used ggplot() to draw stacked bar plot.
bad_drivers %>%
select(STATE, DRIVERS_INVOLVED, DRIVERS_ALCHO) %>%
gather(type, value, DRIVERS_INVOLVED:DRIVERS_ALCHO) %>%
ggplot(., aes(x = STATE,y = value, fill = type)) +
geom_bar(position = "stack", stat="identity") +
scale_fill_manual(values = c("green", "darkgreen")) +
ylab("Drivers involved in Fatal collision while Alcho-Impaired") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
Next stacked plot is having states on X axis and DRIVERS_NOT_DIST and DRIVERS_INVOLVED stacked together on Y axis. To achive this we first used select() method to get required columns. The used gather() method to make data long for DRIVERS_INVOLVED and DRIVERS_NOT_DIST and finally used ggplot() to draw stacked bar plot.
bad_drivers %>%
select(STATE, DRIVERS_INVOLVED, DRIVERS_NOT_DIST) %>%
gather(type, value, DRIVERS_INVOLVED:DRIVERS_NOT_DIST) %>%
ggplot(., aes(x = STATE,y = value, fill = type)) +
geom_bar(position = "stack", stat="identity") +
scale_fill_manual(values = c("lightyellow", "yellow")) +
ylab("Drivers involved in Fatal collision not distracted") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
Next stacked plot is having states on X axis and DRIVERS_NO_ACC and DRIVERS_INVOLVED stacked together on Y axis. To achive this we first used select() method to get required columns. The used gather() method to make data long for DRIVERS_INVOLVED and DRIVERS_NO_ACC and finally used ggplot() to draw stacked bar plot.
bad_drivers %>%
select(STATE, DRIVERS_INVOLVED, DRIVERS_NO_ACC) %>%
gather(type, value, DRIVERS_INVOLVED:DRIVERS_NO_ACC) %>%
ggplot(., aes(x = STATE,y = value, fill = type)) +
geom_bar(position = "stack", stat="identity") +
scale_fill_manual(values = c("blue", "darkblue")) +
ylab("Drivers involved in Fatal collision no pre accident") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
Below plot of for STATE vs INS_PREMIUM that used ggplot() method to draw a bar plot.
bad_drivers %>%
ggplot(., aes(x = STATE,y = INS_PREM)) +
geom_bar(position = "stack", stat="identity") +
ylab("Car Insurance Premium") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
Here we discussed various packages and their functions to explore bad drivers dataset. For complete set details refer (https://www.tidyverse.org/).
Resources: