Overview
Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset. In this case I selected Kaggle ‘COVID’ datset from Kaggle. The original dataset is from NY times.
Load required libraries
Step 1 is to install and load required libraries to extract data from NY times GIT library
The default packages loaded from the library ‘tidyverse’ are ggplot2, purrr,tibble,dplyr, tidyr, stringr,readr,forcats. My focus is on ggplot2 and dplyr
We will use read.csv() function from readr() package when we load the dataset.
Load the dataset
Raw data looks like this:
It contains time series data containing cumulative counts of coronavirus cases in the United States, at the state and county level, over time.
Data exploration using dplyr
- Filter function in dplyr()
Description Use filter() to choose rows/cases where conditions are true. Unlike base subsetting with [, rows where the condition evaluates to NA are dropped.
|
date
|
county
|
state
|
fips
|
cases
|
deaths
|
|
2020-10-24
|
Autauga
|
Alabama
|
1001
|
2048
|
31
|
|
2020-10-24
|
Baldwin
|
Alabama
|
1003
|
6637
|
69
|
|
2020-10-24
|
Barbour
|
Alabama
|
1005
|
1031
|
9
|
|
2020-10-24
|
Bibb
|
Alabama
|
1007
|
828
|
14
|
|
2020-10-24
|
Blount
|
Alabama
|
1009
|
1925
|
25
|
Now we have latest county level COVID data in the dataset covid_county_latest
- Arrange and group_by functions in dplyr()
Description Order tbl rows by an expression involving its variables.Most data operations are done on groups defined by variables. group_by() takes an existing tbl and converts it into a grouped tbl where operations are performed “by group”. ungroup() removes grouping.
|
date
|
county
|
state
|
fips
|
cases
|
deaths
|
|
2020-10-24
|
Jefferson
|
Alabama
|
1073
|
23129
|
377
|
|
2020-10-24
|
Mobile
|
Alabama
|
1097
|
16849
|
315
|
|
2020-10-24
|
Tuscaloosa
|
Alabama
|
1125
|
10296
|
140
|
|
2020-10-24
|
Montgomery
|
Alabama
|
1101
|
10197
|
197
|
|
2020-10-24
|
Madison
|
Alabama
|
1089
|
9280
|
96
|
We obtained COVID cases sorted by each state from highest to lowest in each of the counties.
- Select and rename functions in dplyr()
Description Choose or rename variables from a tbl. select() keeps only the variables you mention; rename() keeps all variables
|
county
|
state
|
covid_cases
|
covid_deaths
|
|
Jefferson
|
Alabama
|
23129
|
377
|
|
Mobile
|
Alabama
|
16849
|
315
|
|
Tuscaloosa
|
Alabama
|
10296
|
140
|
|
Montgomery
|
Alabama
|
10197
|
197
|
|
Madison
|
Alabama
|
9280
|
96
|
Selected only the required columns and renamed the columns so that it’s more intuitive to understand.
- Summarize function in dplyr()
Description Create one or more scalar variables summarizing the variables of an existing tbl. Tbls with groups created by group_by() will result in one row in the output for each group. Tbls with no groups will result in one row.
|
state
|
US_cases
|
US_deaths
|
|
California
|
906644
|
17345
|
|
Texas
|
906033
|
17998
|
|
Florida
|
776243
|
16416
|
|
New York
|
498568
|
33049
|
|
Illinois
|
376034
|
9765
|
Note that here, we obtained COVID cases by state by applying multiple functions such as arrange,group_by and summarise.
- Mutate function in dplyr()
Description mutate() adds new variables and preserves existing ones; transmute() adds new variables and drops existing ones. Both functions preserve the number of rows of the input. New variables overwrite existing variables of the same name.
|
state
|
US_cases
|
US_deaths
|
mortality_rate
|
cases_density
|
|
California
|
906644
|
17345
|
1.9
|
10.5
|
|
Texas
|
906033
|
17998
|
2.0
|
10.5
|
|
Florida
|
776243
|
16416
|
2.1
|
9.0
|
|
New York
|
498568
|
33049
|
6.6
|
5.8
|
|
Illinois
|
376034
|
9765
|
2.6
|
4.4
|
Mortality rate metric defined as deaths per cases is a better metric to understand the impact of COVID pandemic.
Visualization
- ggplot function in ggplot2 Description ggplot() initializes a ggplot object. It can be used to declare the input data frame for a graphic and to specify the set of plot aesthetics intended to be common throughout all subsequent layers unless specifically overridden.
covid_state %>% arrange(cases_density) %>% mutate(state = fct_reorder(state, cases_density)) %>%
ggplot(aes( x=state,y=cases_density)) +
geom_bar( stat="identity",color="black")+
geom_text(aes(label=paste0(cases_density,"%")),color = "orange",hjust=1,vjust=0.3,size = 3)+
theme( axis.line = element_line(colour = "black",
size = 1, linetype = "solid"))+
coord_flip()+
xlab("")+ ylab("")+
theme( axis.line = element_line(colour = "black",
size =1, linetype = "solid"))+
ggtitle("COVID Density Rate in the US States (as of Oct 24 '20)") +
theme(plot.title = element_text(lineheight=.8, face="bold"))

The graph shows COVID cases spread by states from highest to lowest order. Let’s now plot mortality rate to understand where the cases caused were more deadly.
covid_state %>% arrange(mortality_rate) %>% mutate(state = fct_reorder(state, mortality_rate)) %>%
ggplot(aes( x=state,y=mortality_rate)) +
geom_bar( stat="identity",color="black")+
geom_text(aes(label=paste0(mortality_rate,"%")),color = "orange",hjust=1,vjust=0.3,size = 3)+
theme( axis.line = element_line(colour = "black",
size = 1, linetype = "solid"))+
coord_flip()+
xlab("")+ ylab("")+
theme( axis.line = element_line(colour = "black",
size = 1, linetype = "solid"))+
ggtitle("COVID Mortality Rate in the US States (as of Oct 24 '20)") +
theme(plot.title = element_text(lineheight=.8, face="bold"))

- geom_smooth function in ggplot2 Description Aids the eye in seeing patterns in the presence of overplotting. geom_smooth() and stat_smooth() are effectively aliases: they both use the same arguments. Use stat_smooth() if you want to display the results with a non-standard geom.

Conclusion
There are three clear outliers which are possibly stopping us from understanding the clear relation between mortality rate and case density.
After removing the outliers, the distribution seems more linear where COVID mortality rate is higher in the states with higher COVID density.

