Overview

Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset. In this case I selected Kaggle ‘COVID’ datset from Kaggle. The original dataset is from NY times.

Load required libraries

Step 1 is to install and load required libraries to extract data from NY times GIT library

The default packages loaded from the library ‘tidyverse’ are ggplot2, purrr,tibble,dplyr, tidyr, stringr,readr,forcats. My focus is on ggplot2 and dplyr

We will use read.csv() function from readr() package when we load the dataset.

Load the dataset

Raw data looks like this:

It contains time series data containing cumulative counts of coronavirus cases in the United States, at the state and county level, over time.

Data exploration using dplyr

  1. Filter function in dplyr()

Description Use filter() to choose rows/cases where conditions are true. Unlike base subsetting with [, rows where the condition evaluates to NA are dropped.

date county state fips cases deaths
2020-10-24 Autauga Alabama 1001 2048 31
2020-10-24 Baldwin Alabama 1003 6637 69
2020-10-24 Barbour Alabama 1005 1031 9
2020-10-24 Bibb Alabama 1007 828 14
2020-10-24 Blount Alabama 1009 1925 25

Now we have latest county level COVID data in the dataset covid_county_latest

  1. Arrange and group_by functions in dplyr()

Description Order tbl rows by an expression involving its variables.Most data operations are done on groups defined by variables. group_by() takes an existing tbl and converts it into a grouped tbl where operations are performed “by group”. ungroup() removes grouping.

date county state fips cases deaths
2020-10-24 Jefferson Alabama 1073 23129 377
2020-10-24 Mobile Alabama 1097 16849 315
2020-10-24 Tuscaloosa Alabama 1125 10296 140
2020-10-24 Montgomery Alabama 1101 10197 197
2020-10-24 Madison Alabama 1089 9280 96

We obtained COVID cases sorted by each state from highest to lowest in each of the counties.

  1. Select and rename functions in dplyr()

Description Choose or rename variables from a tbl. select() keeps only the variables you mention; rename() keeps all variables

county state covid_cases covid_deaths
Jefferson Alabama 23129 377
Mobile Alabama 16849 315
Tuscaloosa Alabama 10296 140
Montgomery Alabama 10197 197
Madison Alabama 9280 96

Selected only the required columns and renamed the columns so that it’s more intuitive to understand.

  1. Summarize function in dplyr()

Description Create one or more scalar variables summarizing the variables of an existing tbl. Tbls with groups created by group_by() will result in one row in the output for each group. Tbls with no groups will result in one row.

state US_cases US_deaths
California 906644 17345
Texas 906033 17998
Florida 776243 16416
New York 498568 33049
Illinois 376034 9765

Note that here, we obtained COVID cases by state by applying multiple functions such as arrange,group_by and summarise.

  1. Mutate function in dplyr()

Description mutate() adds new variables and preserves existing ones; transmute() adds new variables and drops existing ones. Both functions preserve the number of rows of the input. New variables overwrite existing variables of the same name.

state US_cases US_deaths mortality_rate cases_density
California 906644 17345 1.9 10.5
Texas 906033 17998 2.0 10.5
Florida 776243 16416 2.1 9.0
New York 498568 33049 6.6 5.8
Illinois 376034 9765 2.6 4.4

Mortality rate metric defined as deaths per cases is a better metric to understand the impact of COVID pandemic.

Visualization

  1. ggplot function in ggplot2 Description ggplot() initializes a ggplot object. It can be used to declare the input data frame for a graphic and to specify the set of plot aesthetics intended to be common throughout all subsequent layers unless specifically overridden.

The graph shows COVID cases spread by states from highest to lowest order. Let’s now plot mortality rate to understand where the cases caused were more deadly.

  1. geom_smooth function in ggplot2 Description Aids the eye in seeing patterns in the presence of overplotting. geom_smooth() and stat_smooth() are effectively aliases: they both use the same arguments. Use stat_smooth() if you want to display the results with a non-standard geom.

Conclusion

There are three clear outliers which are possibly stopping us from understanding the clear relation between mortality rate and case density.

After removing the outliers, the distribution seems more linear where COVID mortality rate is higher in the states with higher COVID density.

