library(tidyverse)
library(ggplot2)
library(readxl)
library(here)
library(dplyr)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)Challenge 5 Instructions
Challenge Overview
Today’s challenge is to:
- read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
- tidy data (as needed, including sanity checks)
- mutate variables as needed (including sanity checks)
- create at least two univariate visualizations
- try to make them “publication” ready
- Explain why you choose the specific graph type
- Create at least one bivariate visualization
- try to make them “publication” ready
- Explain why you choose the specific graph type
R Graph Gallery is a good starting point for thinking about what information is conveyed in standard graph types, and includes example R code.
(be sure to only include the category tags for the data you use!)
Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- cereal.csv ⭐
- Total_cost_for_top_15_pathogens_2018.xlsx ⭐
- Australian Marriage ⭐⭐
- AB_NYC_2019.csv ⭐⭐⭐
- StateCounty2012.xls ⭐⭐⭐
- Public School Characteristics ⭐⭐⭐⭐
- USA Households ⭐⭐⭐⭐⭐
Briefly describe the data
Tidy Data (as needed)
Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.
Are there any variables that require mutation to be usable in your analysis stream? For example, do you need to calculate new values in order to graph them? Can string values be represented numerically? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?
Document your work here.
Univariate Visualizations
Bivariate Visualization(s)
Any additional comments?
Solutions
Reading the Data
The working directory for RStudio has been set such that “Total_cost_for_top_15_pathogens_2018.xlsx” can be found at the root of the working directory using the setwd() method.
pathogens <- read_excel(here("Total_cost_for_top_15_pathogens_2018.xlsx"), col_types = "text")
pathogens# A tibble: 27 × 3
Total cost of foodborne illness estimates for 15 leading foodbo…¹ ...2 ...3
<chr> <chr> <chr>
1 <NA> <NA> <NA>
2 <NA> Mean… <NA>
3 <NA> <NA> <NA>
4 <NA> Cases Cost
5 Campylobacter spp. (all species) 8450… 2181…
6 Clostridium perfringens 9659… 3842…
7 Cryptosporidium spp. (all species) 57616 5839…
8 Cyclospora cayetanensis 1140… 2571…
9 Listeria monocytogenes 1591 3189…
10 Norovirus 5461… 2566…
# ℹ 17 more rows
# ℹ abbreviated name:
# ¹`Total cost of foodborne illness estimates for 15 leading foodborne pathogens`
Data Description
High Level Description
The data set comprises of 27 rows with 3 columns.
pathogens# A tibble: 27 × 3
Total cost of foodborne illness estimates for 15 leading foodbo…¹ ...2 ...3
<chr> <chr> <chr>
1 <NA> <NA> <NA>
2 <NA> Mean… <NA>
3 <NA> <NA> <NA>
4 <NA> Cases Cost
5 Campylobacter spp. (all species) 8450… 2181…
6 Clostridium perfringens 9659… 3842…
7 Cryptosporidium spp. (all species) 57616 5839…
8 Cyclospora cayetanensis 1140… 2571…
9 Listeria monocytogenes 1591 3189…
10 Norovirus 5461… 2566…
# ℹ 17 more rows
# ℹ abbreviated name:
# ¹`Total cost of foodborne illness estimates for 15 leading foodborne pathogens`
The data set has a total of 3 <chr> type columns. The Total cost of foodborne illness estimates for 15 leading foodborne pathogens variable contains the name of the pathogens and should be renamed to a better column name. The ...2 and ...3 variables contain the number of cases and the cost expended to treat those illnesses respectively and should also be renamed. Each case represents the number of observed cases and the cost incurred to treat that illness.
How was the Data likely collected?
From the sheet, the dataset is collected as a part of the research - “Annual Cost of Illness and Quality-Adjusted Life Year Losses in the United States Due to 14 Foodborne Pathogens” by Hoffmann et. al. The data set updates a previous research carried out by the Economic Research Service (ERS) in 2013. On further research, the data set appears to have been downloaded from the ERS, USDA website.
Tidying the Data
The data set in its current form contains redundant rows. Additionally, the columns are incorrectly named and have incorrect data type.
The following query reads data only in the range “A6:C20”, renames the variables and mutates the “Cost” column to remove the leading “$” and the commas “,” and change the data type to numeric.
pathogens_clean <- read_excel(here("Total_cost_for_top_15_pathogens_2018.xlsx"),
range = "A6:C20",
col_names = c("Pathogen", "Cases", "Cost")) %>%
mutate(across(Cost, ~as.numeric(str_replace_all(., "[\\$,]", ""))))
pathogens_clean# A tibble: 15 × 3
Pathogen Cases Cost
<chr> <dbl> <dbl>
1 Campylobacter spp. (all species) 8.45e5 2.18e9
2 Clostridium perfringens 9.66e5 3.84e8
3 Cryptosporidium spp. (all species) 5.76e4 5.84e7
4 Cyclospora cayetanensis 1.14e4 2.57e6
5 Listeria monocytogenes 1.59e3 3.19e9
6 Norovirus 5.46e6 2.57e9
7 Salmonella (non-typhoidal species) 1.03e6 4.14e9
8 Shigella (all species) 1.31e5 1.59e8
9 Shiga toxin-producing Escherichia coli O157 (STEC O157) 6.32e4 3.11e8
10 non-O157 Shiga toxin-producing Escherichia coli (STEC non-O157) 1.13e5 3.17e7
11 Toxoplasma gondii 8.67e4 3.74e9
12 Vibrio parahaemolyticus 3.47e4 4.57e7
13 Vibrio vulnificus 9.6 e1 3.59e8
14 Vibrio non-cholera species other than V. parahaemolyticus and … 1.76e4 8.17e7
15 Yersinia enterocolitica 9.77e4 3.13e8
pathogens_clean now contains 15 observations corresponding to the 15 pathogens and three variables corresponding to the pathogen name, number of cases and the cost incurred respectively.
Sanity Check
Our mutated and tidyed data can be sanity checked for correctness by summing the “Cases” and “Cost” columns and matching the “Total for all 15 pathogens” observation in the original pathogens dataframe.
pathogens_clean %>%
summarize(total_cases = sum(Cases),
total_cost = sum(Cost))# A tibble: 1 × 2
total_cases total_cost
<dbl> <dbl>
1 8914713 17571792712.
pathogens[21,]# A tibble: 1 × 3
Total cost of foodborne illness estimates for 15 leading foodbor…¹ ...2 ...3
<chr> <chr> <chr>
1 Total for all 15 pathogens 8914… 1757…
# ℹ abbreviated name:
# ¹`Total cost of foodborne illness estimates for 15 leading foodborne pathogens`
Reading the excel results in formatting issues in the pathogens dataframe for the “Costs” variable, however the excel values match the sum totals. Therefore, our mutations are sanity checked.
Univariate Visualizations
Density Plots
pathogens_clean %>%
ggplot(aes(x = Cases)) +
geom_density(fill = "lightblue", color = "navy") +
labs(title = "Density Plot of Cases", x = "Cases", y = "Density")pathogens_clean %>%
ggplot(aes(x = Cost)) +
geom_density(fill = "lightgreen", color = "darkgreen") +
labs(title = "Density Plot of Cost", x = "Cost", y = "Density")Why Choose a Density Plot?
The y-axis of a density plot represents the estimated probability density of the data at different points along the x-axis. Through a density plot one can estimate how likely values are to occur within a particular range. In the plots above the probablity density of the Cost and Cases have been estimated. The density is calculated using kernel density estimation.
Histogram Plots
pathogens_clean %>%
ggplot(aes(Cost)) +
geom_histogram() +
labs(title = "Histogam Plot of Cost")pathogens_clean %>%
ggplot(aes(Cases)) +
geom_histogram() +
labs(title = "Histogam Plot of Cases")Why Choose a Histogram Plot?
A histogram plot automatically cuts the variable in bins and counts the number of data point per bin. Since the bars are side-by-side, histogram plots allow for easier comparison among variable values. Additionally, outlier detection is also easier by detecting bars that extend beyond the usual range of values.
Bivariate Visualizations
Bar Charts
The following bar charts represent the distribution of “Cases” and “Costs” respectively for each pathogen.
pathogens_clean %>%
ggplot(aes(x = Pathogen, y = Cases)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(title = "Cases by Pathogen", x = "Pathogen", y = "Cases") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))From the plot, we observe that “Norovirus” has the highest number of cases.
pathogens_clean %>%
ggplot(aes(x = Pathogen, y = Cost)) +
geom_bar(stat = "identity", fill = "lightcoral") +
labs(title = "Cost by Pathogen", x = "Pathogen", y = "Cost (in USD)") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))From the plot, we observe that “Salmonella (non-typhoidal species)” has the higest cost associated.
Why Choose a Bar Chart?
A bar chart is suited to display the relationship between a numeric variable (like “Cost” or “Cases”) and a categorical variable (like “Pathogens”). The height of each bar is proportional to the value of the numeric variable. Since there are significant differences in the values for different pathogens in the case of both the variables, a bar chart is apt for ranking.
Point Plots
The following point plots represent the distribution of “Cases” and “Costs” respectively for each pathogen.
pathogens_clean %>% ggplot(aes(x = Pathogen, y = Cases)) +
geom_point(position = "jitter", color = "darkred") +
labs(title = "Point Plot of Cases by Pathogen", x = "Pathogen", y = "Cases") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))pathogens_clean %>% ggplot(aes(x = Pathogen, y = Cost)) +
geom_point(position = "jitter", color = "darkblue") +
labs(title = "Point Plot of Cost by Pathogen", x = "Pathogen", y = "Cost") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))Why choose a Point Plot?
Point plots display individual data points allowing direct visualization of values associated with each Pathogen. This is particularly useful when only one value per category exists, as it provides a clear representation of the data. Jittering prevents overlapping points, making it easier to observe patterns and density in the data.
Scatter Plot
pathogens_clean %>%
ggplot(aes(x = Cases, y = Cost)) +
geom_point(color = "purple") +
labs(title = "Scatter Plot of Cases vs. Cost", x = "Cases", y = "Cost")The above plot provides a correlation of “Cost” against “Cases”.
Why Choose a Scatter Plot?
A Scatter plot is well suited to display the relationship between 2 numeric variables. Each dot represents a Pathogen. Scatter plots additionally help determine the trend in relationship between two variables - positive, negative or no correlation. Scatter plots may also reveal data clusters or groups of points that share similar charactersistics as can be seen from the lower left of the plot.