Check the data type/class of hospitalizations
. Generate
a table or histogram of the hospitalizations. Leave a brief note about
what you see.
## Check class / type of hospitalizations column
class(Outbreaks$hospitalizations)
## [1] "numeric"
## Create a frequency table
table(Outbreaks$hospitalizations)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12
## 36762 6248 2941 1353 691 460 300 190 136 86 98 87 50
## 13 14 15 16 17 18 19 20 21 22 23 24 25
## 39 34 29 20 20 16 16 11 12 13 13 14 8
## 26 27 28 29 30 31 32 33 34 35 36 37 38
## 5 6 8 10 8 7 5 4 5 7 10 1 3
## 39 40 41 43 44 45 47 48 49 50 52 53 54
## 3 2 2 4 1 6 1 2 2 2 6 2 2
## 55 56 57 58 60 62 64 68 70 71 72 76 88
## 2 3 1 2 2 1 2 1 1 3 1 1 1
## 94 101 103 104 108 109 124 129 133 143 145 166 167
## 2 1 1 1 1 1 1 2 1 1 1 1 1
## 179 200 204 308
## 1 1 1 1
## Histogram of hospitalizations
hist(Outbreaks$hospitalizations, col = "chocolate4", border = "bisque1",
main = "Histogram of Hospitalizations",
xlab = "Number of Hospitalizations"
)
Answer:
The hospitalizations
shows that most outbreaks resulted 0-3
hospitalizations, but a few outbreaks resulted in a very large number of
hospitalizations
Check how many missing values are in etiology
and
calculate what percent of the dataset that is. Include the percentage
using inline code.
# Count missing values in etiology (treat NA as missing; optionally count empty strings too)
n_total <- nrow(Outbreaks)
number_missing_etiology <- sum(is.na(Outbreaks$etiology) | Outbreaks$etiology == "", na.rm = TRUE)
percent_missing_etiology <- (number_missing_etiology / n_total) * 100
Answer:
There are 14647 missing values in etiology
out of 57649 rows, which is 25.41% of
the dataset.
Create a new dataset by filtering/subseting the outbreaks dataset to only include outbreaks from Minnesota. Name it appropriately. How many outbreaks are in this new dataset?
outbreaks_in_MN <- Outbreaks %>% dplyr::filter(state == "Minnesota")
Answer:
There are 2680 outbreaks in Minnesota
Filter and subset the Minnesota only dataset to only include whose primary mode of transmission was food, with a confirmed etiology status, which had at least one hospitalization, and which occured in the past 10 years. Reduce the dataset to only the variables year, month, etiology, illnesses, and hospitalizations. What year had the highest mean illnesses and what year had the highest mean hospitalizations? Hint: tapply(outcome, year, summary)
final_dataset <- Outbreaks %>%
dplyr::filter(
state == "Minnesota",
primary_mode == "Food",
!is.na(etiology) & etiology != "",
hospitalizations > 0,
year > 2015
)%>%
dplyr::select(year, month, etiology, illnesses, hospitalizations)
# Means by year using base R tapply
ill_mean_by_year <- with(final_dataset, tapply(illnesses, year, mean, na.rm = TRUE))
hosp_mean_by_year <- with(final_dataset, tapply(hospitalizations, year, mean, na.rm = TRUE))
# Year(s) with the highest means (ties handled)
top_ill_years <- names(ill_mean_by_year)[ill_mean_by_year == max(ill_mean_by_year, na.rm = TRUE)]
top_hosp_years <- names(hosp_mean_by_year)[hosp_mean_by_year == max(hosp_mean_by_year, na.rm = TRUE)]
# Values for reporting
top_ill_mean <- max(ill_mean_by_year, na.rm = TRUE)
top_hosp_mean <- max(hosp_mean_by_year, na.rm = TRUE)
Answer:
Highest mean illnesses year(s):
2018
(mean = 58.14)
Highest mean hospitalizations year(s):
2018
(mean = 2.43)