TidyVerse CREATE assignment

Your task here is to Create an Example. Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset.

Getting started Lets load tidyverse package first. It includes readr, dplyr, tidyr, ggplot2, stringr, tibble, forcats and purr packages.

library(readr)

## Warning: package 'readr' was built under R version 4.2.2

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.2.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

## Warning: package 'tidyr' was built under R version 4.2.2

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.2.2

library(stringr)

## Warning: package 'stringr' was built under R version 4.2.2

library(gganimate)

## Warning: package 'gganimate' was built under R version 4.2.2

library(dplyr)
library(gifski)

## Warning: package 'gifski' was built under R version 4.2.2

library(ggthemes)

## Warning: package 'ggthemes' was built under R version 4.2.2

library(forcats)

## Warning: package 'forcats' was built under R version 4.2.2

library(tibble)
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.2.2

## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──

## ✔ purrr 0.3.5     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

We’ll load a dataset from fivethirtyeight.com to demonstrate how the tidyverse package functions. This information demonstrates the terrible drivers in all 50 states of America who cause accidents. Reading the bad-drivers data from the github repository is the first step. The following fields make up the data: State Per billion kilometers, the number of drivers involved in fatal accidents Percentage of Speeding Drivers Involved In Fatal Collisions Alcohol-Impairment Rate Among Drivers Involved In Fatal Collisions percentage of distracted drivers who were involved in fatal collisions Percentage of motorists involved in fatal collisions who had never been in an accident before Costs of auto insurance (\() Insurance company losses from collisions per insured driver (\))

Data read using readr package read_csv() function is from readr package, used for reading flat file data with comma separated values.

# define URL for bad drivers data
theURL <- 'https://raw.githubusercontent.com/IvanGrozny88/TidyVerse/main/bad-drivers_csv.csv'

# read data
bad_drivers <- read_csv(theURL)

## Rows: 51 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): state
## dbl (7): number of drivers involved in fatal collisions per billion miles, p...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(bad_drivers)

## # A tibble: 6 × 8
##   state      number of drivers…¹ perce…² perce…³ perce…⁴ perce…⁵ car i…⁶ losse…⁷
##   <chr>                    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 alabama                   18.8      39      30      96      80    785.    145.
## 2 alaska                    18.1      41      25      90      94   1053.    134.
## 3 arizona                   18.6      35      28      84      96    899.    110.
## 4 arkansas                  22.4      18      26      94      95    827.    142.
## 5 california                12        35      28      91      89    878.    166.
## 6 colorado                  13.6      37      28      79      95    836.    140.
## # … with abbreviated variable names
## #   ¹`number of drivers involved in fatal collisions per billion miles`,
## #   ²`percentage of drivers involved in fatal collisions who were speeding`,
## #   ³`percentage of drivers involved in fatal collisions who were alcohol-impaired`,
## #   ⁴`percentage of drivers involved in fatal collisions who were not distracted`,
## #   ⁵`percentage of drivers involved in fatal collisions who had not been involved in any previous accidents`,
## #   ⁶`car insurance premiums ($)`, …

In the next, we rename columns to replace big column names with shorter names.

glimpse() function is from tibble package, used to see every column in a data frame.

# rename columns
colnames(bad_drivers) <- c("STATE", 
                           "DRIVERS_INVOLVED", 
                           "PERC_DRIVERS_SPEED", 
                           "PERC_DRIVERS_ALCHO", 
                           "PERC_DRIVERS_NOT_DIST", 
                           "PERC_DRIVERS_NO_ACC", 
                           "INS_PREM", 
                           "LOSS_INSCOMP")

glimpse(bad_drivers)

## Rows: 51
## Columns: 8
## $ STATE                 <chr> "alabama", "alaska", "arizona", "arkansas", "cal…
## $ DRIVERS_INVOLVED      <dbl> 18.8, 18.1, 18.6, 22.4, 12.0, 13.6, 10.8, 16.2, …
## $ PERC_DRIVERS_SPEED    <dbl> 39, 41, 35, 18, 35, 37, 46, 38, 34, 21, 19, 54, …
## $ PERC_DRIVERS_ALCHO    <dbl> 30, 25, 28, 26, 28, 28, 36, 30, 27, 29, 25, 41, …
## $ PERC_DRIVERS_NOT_DIST <dbl> 96, 90, 84, 94, 91, 79, 87, 87, 100, 92, 95, 82,…
## $ PERC_DRIVERS_NO_ACC   <dbl> 80, 94, 96, 95, 89, 95, 82, 99, 100, 94, 93, 87,…
## $ INS_PREM              <dbl> 784.55, 1053.48, 899.47, 827.34, 878.41, 835.50,…
## $ LOSS_INSCOMP          <dbl> 145.08, 133.93, 110.35, 142.39, 165.63, 139.91, …

Using the programs dplyr, tidyr, and ggplot2, manipulate and visualize data PERC DRIVERS SPEED, PERC DRIVERS ALCHO, PERC DRIVERS NOT DIST, and PERC DRIVERS NO ACC are percentages of DRIVERS INVOLVED, as we may have seen. By using the specified percentage of the DRIVERS INVOLVED column, we will mutate

mutate() function is from dplyr package, adds new variables and preserves existing ones.

# create new column DRIVERS_SPEED which will be (DRIVERS_INVOLVED*PERC_DRIVERS_SPEED)/100
bad_drivers <- bad_drivers %>% 
  mutate(DRIVERS_SPEED=(DRIVERS_INVOLVED*PERC_DRIVERS_SPEED)/100) %>% 
  mutate(DRIVERS_ALCHO=(DRIVERS_INVOLVED*PERC_DRIVERS_ALCHO)/100) %>% 
  mutate(DRIVERS_NOT_DIST=(DRIVERS_INVOLVED*PERC_DRIVERS_NOT_DIST)/100) %>% 
  mutate(DRIVERS_NO_ACC=(DRIVERS_INVOLVED*PERC_DRIVERS_NO_ACC)/100)

glimpse(bad_drivers)

## Rows: 51
## Columns: 12
## $ STATE                 <chr> "alabama", "alaska", "arizona", "arkansas", "cal…
## $ DRIVERS_INVOLVED      <dbl> 18.8, 18.1, 18.6, 22.4, 12.0, 13.6, 10.8, 16.2, …
## $ PERC_DRIVERS_SPEED    <dbl> 39, 41, 35, 18, 35, 37, 46, 38, 34, 21, 19, 54, …
## $ PERC_DRIVERS_ALCHO    <dbl> 30, 25, 28, 26, 28, 28, 36, 30, 27, 29, 25, 41, …
## $ PERC_DRIVERS_NOT_DIST <dbl> 96, 90, 84, 94, 91, 79, 87, 87, 100, 92, 95, 82,…
## $ PERC_DRIVERS_NO_ACC   <dbl> 80, 94, 96, 95, 89, 95, 82, 99, 100, 94, 93, 87,…
## $ INS_PREM              <dbl> 784.55, 1053.48, 899.47, 827.34, 878.41, 835.50,…
## $ LOSS_INSCOMP          <dbl> 145.08, 133.93, 110.35, 142.39, 165.63, 139.91, …
## $ DRIVERS_SPEED         <dbl> 7.332, 7.421, 6.510, 4.032, 4.200, 5.032, 4.968,…
## $ DRIVERS_ALCHO         <dbl> 5.640, 4.525, 5.208, 5.824, 3.360, 3.808, 3.888,…
## $ DRIVERS_NOT_DIST      <dbl> 18.048, 16.290, 15.624, 21.056, 10.920, 10.744, …
## $ DRIVERS_NO_ACC        <dbl> 15.040, 17.014, 17.856, 21.280, 10.680, 12.920, …

With states on the X axis, drivers speed and drivers involved stacked together on the Y axis, we will create a stacked bar lot using the ggplot() technique in this step. To get the necessary columns, we first utilized the select() method to achieve this. The data for DRIVERS INVOLVED and DRIVERS SPEED were made longer using the gather() technique, and a stacked bar plot was then created using ggplot().

The dplyr package’s select() function only keeps the variables we’ve mentioned. The tidyr package’s gather() method takes several columns and collapses them into key-value pairs while replicating any additional columns that are required. Every ggplot2 plot starts with a call to ggplot().

bad_drivers %>% 
  select(STATE, DRIVERS_INVOLVED, DRIVERS_SPEED) %>% 
  gather(type, value, DRIVERS_INVOLVED:DRIVERS_SPEED) %>% 
  ggplot(., aes(x = STATE,y = value, fill = type)) + 
  geom_bar(position = "stack", stat="identity") + 
  scale_fill_manual(values = c("red", "darkred")) + 
  ylab("Drivers involved in Fatal collision while Speeding") + 
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

Similar to the last stacked graphic, the following one has states on the X axis and DRIVERS ALCHO and DRIVERS INVOLVED stacked together on the Y axis. To get the necessary columns, we first utilized the select() method to achieve this. The data for DRIVERS INVOLVED and DRIVERS ALCHO were made long using the gather() method, and the stacked bar plot was created using ggplot().

bad_drivers %>% 
  select(STATE, DRIVERS_INVOLVED, DRIVERS_ALCHO) %>% 
  gather(type, value, DRIVERS_INVOLVED:DRIVERS_ALCHO) %>% 
  ggplot(., aes(x = STATE,y = value, fill = type)) + 
  geom_bar(position = "stack", stat="identity") + 
  scale_fill_manual(values = c("green", "darkgreen")) + 
  ylab("Drivers involved in Fatal collision while Alcho-Impaired") + 
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

Drivers NOT DIST and Drivers INVOLVED are stacked together on the Y axis of the next stacked figure, which has states on the X axis. To get the necessary columns, we first utilized the select() method to achieve this. The data for DRIVERS INVOLVED and DRIVERS NOT DIST were made lengthy using the gather() method, and the stacked bar plot was created using ggplot().

bad_drivers %>% 
  select(STATE, DRIVERS_INVOLVED, DRIVERS_NOT_DIST) %>% 
  gather(type, value, DRIVERS_INVOLVED:DRIVERS_NOT_DIST) %>% 
  ggplot(., aes(x = STATE,y = value, fill = type)) + 
  geom_bar(position = "stack", stat="identity") + 
  scale_fill_manual(values = c("lightyellow", "yellow")) + 
  ylab("Drivers involved in Fatal collision not distracted") + 
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

Drivers NO ACC and Drivers Involved are stacked together on the Y axis of the next stacked graphic, which has states on the X axis. To get the necessary columns, we first utilized the select() method to achieve this. The data for DRIVERS INVOLVED and DRIVERS NO ACC were made long using the gather() technique, and the stacked bar plot

bad_drivers %>% 
  select(STATE, DRIVERS_INVOLVED, DRIVERS_NO_ACC) %>% 
  gather(type, value, DRIVERS_INVOLVED:DRIVERS_NO_ACC) %>% 
  ggplot(., aes(x = STATE,y = value, fill = type)) + 
  geom_bar(position = "stack", stat="identity") + 
  scale_fill_manual(values = c("blue", "darkblue")) + 
  ylab("Drivers involved in Fatal collision no pre accident") + 
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

The bar plot for STATE vs. INS PREMIUM below was created using the ggplot() technique.

bad_drivers %>% 
  ggplot(., aes(x = STATE,y = INS_PREM)) + 
  geom_bar(position = "stack", stat="identity") + 
  ylab("Car Insurance Premium") + 
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

Conclusion We’ve covered a variety of packages and their functionality here as we examine the bad drivers dataset. For complete set details refer (https://www.tidyverse.org/).

https://www.tidyverse.org/packages/

https://fivethirtyeight.com/features/which-state-has-the-worst-drivers/

TidyVerse CREATE assignment

IvanTikhonov

2022-11-11