DS Labs Assignment

Author

M Madinko

DS Labs Assignment

Intro

I used the us_contagious_diseases dataset from the dslabs package. It contains official public health reports for several contagious diseases in the United States spanning several decades. My research focuses specifically on Rubella, incorporating data from all 50 states and the District of Columbia

library(RColorBrewer)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("dslabs")
Warning: package 'dslabs' was built under R version 4.5.3
#data(package = "dslabs")
data("us_contagious_diseases")  
unique(us_contagious_diseases$disease)
[1] Hepatitis A Measles     Mumps       Pertussis   Polio       Rubella    
[7] Smallpox   
Levels: Hepatitis A Measles Mumps Pertussis Polio Rubella Smallpox
unique(us_contagious_diseases$state)
 [1] Alabama              Alaska               Arizona             
 [4] Arkansas             California           Colorado            
 [7] Connecticut          Delaware             District Of Columbia
[10] Florida              Georgia              Hawaii              
[13] Idaho                Illinois             Indiana             
[16] Iowa                 Kansas               Kentucky            
[19] Louisiana            Maine                Maryland            
[22] Massachusetts        Michigan             Minnesota           
[25] Mississippi          Missouri             Montana             
[28] Nebraska             Nevada               New Hampshire       
[31] New Jersey           New Mexico           New York            
[34] North Carolina       North Dakota         Ohio                
[37] Oklahoma             Oregon               Pennsylvania        
[40] Rhode Island         South Carolina       South Dakota        
[43] Tennessee            Texas                Utah                
[46] Vermont              Virginia             Washington          
[49] West Virginia        Wisconsin            Wyoming             
51 Levels: Alabama Alaska Arizona Arkansas California Colorado ... Wyoming

Data Wrangling

exclusion of other diseases use only rubella

filter out Alaska and Hawaii

create 4 categories : by region

mutate the rate of measles by taking the count/(population10,00052)/weeks_reporting

draw a vertical line for 1969, which is when the rubella vaccination was developed

disease1 <- us_contagious_diseases |>
  filter(disease == "Rubella" & !state %in% c("Hawaii", "Alaska")) |>
  mutate(region = case_when(
    state %in% c("Connecticut", "Maine", "Massachusetts", "New Hampshire", "Rhode Island", "Vermont", "New Jersey", "New York", "Pennsylvania") ~ "Northeast",
    state %in% c("Illinois", "Indiana", "Iowa", "Kansas", "Michigan", "Minnesota", "Missouri", "Nebraska", "North Dakota", "Ohio", "South Dakota", "Wisconsin") ~ "Midwest",
    state %in% c("Alabama", "Arkansas", "Delaware", "District Of Columbia", "Florida", "Georgia", "Kentucky", "Louisiana", "Maryland", "Mississippi", "North Carolina", "Oklahoma", "South Carolina", "Tennessee", "Texas", "Virginia", "West Virginia") ~ "South",
    TRUE ~ "West")) |>
mutate(rate = count / population * 10000 /(weeks_reporting/52))
  head(disease1)
  disease   state year weeks_reporting count population region      rate
1 Rubella Alabama 1966              31   112    3345787  South 0.5615150
2 Rubella Alabama 1967              27   214    3364130  South 1.2251255
3 Rubella Alabama 1968              33   404    3386068  South 1.8800746
4 Rubella Alabama 1969              36   136    3412450  South 0.5756698
5 Rubella Alabama 1970              51   380    3444165  South 1.1249490
6 Rubella Alabama 1971              51   226    3481798  South 0.6618172

Heatmap rubella regional visualization in the usa

library(RColorBrewer)

ggplot(disease1, aes(x = year, y = region, fill = rate)) +
  geom_tile(color = "black") +
  scale_x_continuous(expand = c(0,0)) +
  scale_fill_gradientn(colors = brewer.pal(9, "Reds"), trans = "sqrt") +
  geom_vline(xintercept = 1969, col = "skyblue", size = 1.5) +
  theme_minimal() +
  labs(title = "Regional Rates of Rubella in the US",
       caption = "Source: Tycho Project",
       x = "", 
       y = "")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.