Your task here is to Create an Example. Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset.
Getting started Lets load tidyverse package first. It includes readr, dplyr, tidyr, ggplot2, stringr, tibble, forcats and purr packages.
library(readr)
## Warning: package 'readr' was built under R version 4.2.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.2.2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.2
library(stringr)
## Warning: package 'stringr' was built under R version 4.2.2
library(gganimate)
## Warning: package 'gganimate' was built under R version 4.2.2
library(dplyr)
library(gifski)
## Warning: package 'gifski' was built under R version 4.2.2
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.2.2
library(forcats)
## Warning: package 'forcats' was built under R version 4.2.2
library(tibble)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.2
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ purrr 0.3.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
We’ll load a dataset from fivethirtyeight.com to demonstrate how the tidyverse package functions. This information demonstrates the terrible drivers in all 50 states of America who cause accidents. Reading the bad-drivers data from the github repository is the first step. The following fields make up the data: State Per billion kilometers, the number of drivers involved in fatal accidents Percentage of Speeding Drivers Involved In Fatal Collisions Alcohol-Impairment Rate Among Drivers Involved In Fatal Collisions percentage of distracted drivers who were involved in fatal collisions Percentage of motorists involved in fatal collisions who had never been in an accident before Costs of auto insurance (\() Insurance company losses from collisions per insured driver (\))
Data read using readr package read_csv() function is from readr package, used for reading flat file data with comma separated values.
# define URL for bad drivers data
theURL <- 'https://raw.githubusercontent.com/IvanGrozny88/TidyVerse/main/bad-drivers_csv.csv'
# read data
bad_drivers <- read_csv(theURL)
## Rows: 51 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): state
## dbl (7): number of drivers involved in fatal collisions per billion miles, p...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(bad_drivers)
## # A tibble: 6 × 8
## state number of drivers…¹ perce…² perce…³ perce…⁴ perce…⁵ car i…⁶ losse…⁷
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 alabama 18.8 39 30 96 80 785. 145.
## 2 alaska 18.1 41 25 90 94 1053. 134.
## 3 arizona 18.6 35 28 84 96 899. 110.
## 4 arkansas 22.4 18 26 94 95 827. 142.
## 5 california 12 35 28 91 89 878. 166.
## 6 colorado 13.6 37 28 79 95 836. 140.
## # … with abbreviated variable names
## # ¹`number of drivers involved in fatal collisions per billion miles`,
## # ²`percentage of drivers involved in fatal collisions who were speeding`,
## # ³`percentage of drivers involved in fatal collisions who were alcohol-impaired`,
## # ⁴`percentage of drivers involved in fatal collisions who were not distracted`,
## # ⁵`percentage of drivers involved in fatal collisions who had not been involved in any previous accidents`,
## # ⁶`car insurance premiums ($)`, …
In the next, we rename columns to replace big column names with shorter names.
glimpse() function is from tibble package, used to see every column in a data frame.
# rename columns
colnames(bad_drivers) <- c("STATE",
"DRIVERS_INVOLVED",
"PERC_DRIVERS_SPEED",
"PERC_DRIVERS_ALCHO",
"PERC_DRIVERS_NOT_DIST",
"PERC_DRIVERS_NO_ACC",
"INS_PREM",
"LOSS_INSCOMP")
glimpse(bad_drivers)
## Rows: 51
## Columns: 8
## $ STATE <chr> "alabama", "alaska", "arizona", "arkansas", "cal…
## $ DRIVERS_INVOLVED <dbl> 18.8, 18.1, 18.6, 22.4, 12.0, 13.6, 10.8, 16.2, …
## $ PERC_DRIVERS_SPEED <dbl> 39, 41, 35, 18, 35, 37, 46, 38, 34, 21, 19, 54, …
## $ PERC_DRIVERS_ALCHO <dbl> 30, 25, 28, 26, 28, 28, 36, 30, 27, 29, 25, 41, …
## $ PERC_DRIVERS_NOT_DIST <dbl> 96, 90, 84, 94, 91, 79, 87, 87, 100, 92, 95, 82,…
## $ PERC_DRIVERS_NO_ACC <dbl> 80, 94, 96, 95, 89, 95, 82, 99, 100, 94, 93, 87,…
## $ INS_PREM <dbl> 784.55, 1053.48, 899.47, 827.34, 878.41, 835.50,…
## $ LOSS_INSCOMP <dbl> 145.08, 133.93, 110.35, 142.39, 165.63, 139.91, …
Using the programs dplyr, tidyr, and ggplot2, manipulate and visualize data PERC DRIVERS SPEED, PERC DRIVERS ALCHO, PERC DRIVERS NOT DIST, and PERC DRIVERS NO ACC are percentages of DRIVERS INVOLVED, as we may have seen. By using the specified percentage of the DRIVERS INVOLVED column, we will mutate
mutate() function is from dplyr package, adds new variables and preserves existing ones.
# create new column DRIVERS_SPEED which will be (DRIVERS_INVOLVED*PERC_DRIVERS_SPEED)/100
bad_drivers <- bad_drivers %>%
mutate(DRIVERS_SPEED=(DRIVERS_INVOLVED*PERC_DRIVERS_SPEED)/100) %>%
mutate(DRIVERS_ALCHO=(DRIVERS_INVOLVED*PERC_DRIVERS_ALCHO)/100) %>%
mutate(DRIVERS_NOT_DIST=(DRIVERS_INVOLVED*PERC_DRIVERS_NOT_DIST)/100) %>%
mutate(DRIVERS_NO_ACC=(DRIVERS_INVOLVED*PERC_DRIVERS_NO_ACC)/100)
glimpse(bad_drivers)
## Rows: 51
## Columns: 12
## $ STATE <chr> "alabama", "alaska", "arizona", "arkansas", "cal…
## $ DRIVERS_INVOLVED <dbl> 18.8, 18.1, 18.6, 22.4, 12.0, 13.6, 10.8, 16.2, …
## $ PERC_DRIVERS_SPEED <dbl> 39, 41, 35, 18, 35, 37, 46, 38, 34, 21, 19, 54, …
## $ PERC_DRIVERS_ALCHO <dbl> 30, 25, 28, 26, 28, 28, 36, 30, 27, 29, 25, 41, …
## $ PERC_DRIVERS_NOT_DIST <dbl> 96, 90, 84, 94, 91, 79, 87, 87, 100, 92, 95, 82,…
## $ PERC_DRIVERS_NO_ACC <dbl> 80, 94, 96, 95, 89, 95, 82, 99, 100, 94, 93, 87,…
## $ INS_PREM <dbl> 784.55, 1053.48, 899.47, 827.34, 878.41, 835.50,…
## $ LOSS_INSCOMP <dbl> 145.08, 133.93, 110.35, 142.39, 165.63, 139.91, …
## $ DRIVERS_SPEED <dbl> 7.332, 7.421, 6.510, 4.032, 4.200, 5.032, 4.968,…
## $ DRIVERS_ALCHO <dbl> 5.640, 4.525, 5.208, 5.824, 3.360, 3.808, 3.888,…
## $ DRIVERS_NOT_DIST <dbl> 18.048, 16.290, 15.624, 21.056, 10.920, 10.744, …
## $ DRIVERS_NO_ACC <dbl> 15.040, 17.014, 17.856, 21.280, 10.680, 12.920, …
With states on the X axis, drivers speed and drivers involved stacked together on the Y axis, we will create a stacked bar lot using the ggplot() technique in this step. To get the necessary columns, we first utilized the select() method to achieve this. The data for DRIVERS INVOLVED and DRIVERS SPEED were made longer using the gather() technique, and a stacked bar plot was then created using ggplot().
The dplyr package’s select() function only keeps the variables we’ve mentioned. The tidyr package’s gather() method takes several columns and collapses them into key-value pairs while replicating any additional columns that are required. Every ggplot2 plot starts with a call to ggplot().
bad_drivers %>%
select(STATE, DRIVERS_INVOLVED, DRIVERS_SPEED) %>%
gather(type, value, DRIVERS_INVOLVED:DRIVERS_SPEED) %>%
ggplot(., aes(x = STATE,y = value, fill = type)) +
geom_bar(position = "stack", stat="identity") +
scale_fill_manual(values = c("red", "darkred")) +
ylab("Drivers involved in Fatal collision while Speeding") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
Similar to the last stacked graphic, the following one has states on the
X axis and DRIVERS ALCHO and DRIVERS INVOLVED stacked together on the Y
axis. To get the necessary columns, we first utilized the select()
method to achieve this. The data for DRIVERS INVOLVED and DRIVERS ALCHO
were made long using the gather() method, and the stacked bar plot was
created using ggplot().
bad_drivers %>%
select(STATE, DRIVERS_INVOLVED, DRIVERS_ALCHO) %>%
gather(type, value, DRIVERS_INVOLVED:DRIVERS_ALCHO) %>%
ggplot(., aes(x = STATE,y = value, fill = type)) +
geom_bar(position = "stack", stat="identity") +
scale_fill_manual(values = c("green", "darkgreen")) +
ylab("Drivers involved in Fatal collision while Alcho-Impaired") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
Drivers NOT DIST and Drivers INVOLVED are stacked together on the Y axis
of the next stacked figure, which has states on the X axis. To get the
necessary columns, we first utilized the select() method to achieve
this. The data for DRIVERS INVOLVED and DRIVERS NOT DIST were made
lengthy using the gather() method, and the stacked bar plot was created
using ggplot().
bad_drivers %>%
select(STATE, DRIVERS_INVOLVED, DRIVERS_NOT_DIST) %>%
gather(type, value, DRIVERS_INVOLVED:DRIVERS_NOT_DIST) %>%
ggplot(., aes(x = STATE,y = value, fill = type)) +
geom_bar(position = "stack", stat="identity") +
scale_fill_manual(values = c("lightyellow", "yellow")) +
ylab("Drivers involved in Fatal collision not distracted") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
Drivers NO ACC and Drivers Involved are stacked together on the Y axis
of the next stacked graphic, which has states on the X axis. To get the
necessary columns, we first utilized the select() method to achieve
this. The data for DRIVERS INVOLVED and DRIVERS NO ACC were made long
using the gather() technique, and the stacked bar plot
bad_drivers %>%
select(STATE, DRIVERS_INVOLVED, DRIVERS_NO_ACC) %>%
gather(type, value, DRIVERS_INVOLVED:DRIVERS_NO_ACC) %>%
ggplot(., aes(x = STATE,y = value, fill = type)) +
geom_bar(position = "stack", stat="identity") +
scale_fill_manual(values = c("blue", "darkblue")) +
ylab("Drivers involved in Fatal collision no pre accident") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
The bar plot for STATE vs. INS PREMIUM below was created using the
ggplot() technique.
bad_drivers %>%
ggplot(., aes(x = STATE,y = INS_PREM)) +
geom_bar(position = "stack", stat="identity") +
ylab("Car Insurance Premium") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
Conclusion We’ve covered a variety of packages and their functionality
here as we examine the bad drivers dataset. For complete set details
refer (https://www.tidyverse.org/).
https://www.tidyverse.org/packages/
https://fivethirtyeight.com/features/which-state-has-the-worst-drivers/