Question 1, Part A: Setup (hidden)

Question 1, Part B: Read in Dataset From My Path (hidden)

Click to expand output for Question #2

Question 2: Hospitalizations

Check the data type/class of hospitalizations. Generate a table or histogram of the hospitalizations. Leave a brief note about what you see.

## Check class / type of hospitalizations column
class(Outbreaks$hospitalizations)
## [1] "numeric"
## Create a frequency table
table(Outbreaks$hospitalizations)
## 
##     0     1     2     3     4     5     6     7     8     9    10    11    12 
## 36762  6248  2941  1353   691   460   300   190   136    86    98    87    50 
##    13    14    15    16    17    18    19    20    21    22    23    24    25 
##    39    34    29    20    20    16    16    11    12    13    13    14     8 
##    26    27    28    29    30    31    32    33    34    35    36    37    38 
##     5     6     8    10     8     7     5     4     5     7    10     1     3 
##    39    40    41    43    44    45    47    48    49    50    52    53    54 
##     3     2     2     4     1     6     1     2     2     2     6     2     2 
##    55    56    57    58    60    62    64    68    70    71    72    76    88 
##     2     3     1     2     2     1     2     1     1     3     1     1     1 
##    94   101   103   104   108   109   124   129   133   143   145   166   167 
##     2     1     1     1     1     1     1     2     1     1     1     1     1 
##   179   200   204   308 
##     1     1     1     1
## Histogram of hospitalizations
hist(Outbreaks$hospitalizations, col = "chocolate4", border = "bisque1",
    main = "Histogram of Hospitalizations",
    xlab = "Number of Hospitalizations"
)

Answer:
The hospitalizations shows that most outbreaks resulted 0-3 hospitalizations, but a few outbreaks resulted in a very large number of hospitalizations

Click to expand output for Question #3

Question 3: Missing values in etiology

Check how many missing values are in etiology and calculate what percent of the dataset that is. Include the percentage using inline code.

# Count missing values in etiology (treat NA as missing; optionally count empty strings too)
n_total <- nrow(Outbreaks)
number_missing_etiology <- sum(is.na(Outbreaks$etiology) | Outbreaks$etiology == "", na.rm = TRUE)
percent_missing_etiology <- (number_missing_etiology / n_total) * 100

Answer:
There are 14647 missing values in etiology out of 57649 rows, which is 25.41% of the dataset.

Click to expand output for Question #4

Question 4: New Dataset

Create a new dataset by filtering/subseting the outbreaks dataset to only include outbreaks from Minnesota. Name it appropriately. How many outbreaks are in this new dataset?

  outbreaks_in_MN <- Outbreaks %>% dplyr::filter(state == "Minnesota")

Answer:
There are 2680 outbreaks in Minnesota

Click to expand output for Question #5

Question 5: Food Truck Outbreaks in Minnesota

Filter and subset the Minnesota only dataset to only include whose primary mode of transmission was food, with a confirmed etiology status, which had at least one hospitalization, and which occured in the past 10 years. Reduce the dataset to only the variables year, month, etiology, illnesses, and hospitalizations. What year had the highest mean illnesses and what year had the highest mean hospitalizations? Hint: tapply(outcome, year, summary)

  final_dataset <- Outbreaks %>%
  dplyr::filter(
    state == "Minnesota",
    primary_mode == "Food",
    !is.na(etiology) & etiology != "",
    hospitalizations > 0,
    year > 2015
  )%>%
  dplyr::select(year, month, etiology, illnesses, hospitalizations)


# Means by year using base R tapply
ill_mean_by_year  <- with(final_dataset, tapply(illnesses, year, mean, na.rm = TRUE))
hosp_mean_by_year <- with(final_dataset, tapply(hospitalizations, year, mean, na.rm = TRUE))

# Year(s) with the highest means (ties handled)
top_ill_years  <- names(ill_mean_by_year)[ill_mean_by_year == max(ill_mean_by_year,  na.rm = TRUE)]
top_hosp_years <- names(hosp_mean_by_year)[hosp_mean_by_year == max(hosp_mean_by_year, na.rm = TRUE)]

# Values for reporting
top_ill_mean  <- max(ill_mean_by_year,  na.rm = TRUE)
top_hosp_mean <- max(hosp_mean_by_year, na.rm = TRUE)

Answer:
Highest mean illnesses year(s): 2018
(mean = 58.14)

Highest mean hospitalizations year(s): 2018
(mean = 2.43)