DS Labs Assignment

Author

M Madinko

DS Labs Assignment

Intro

I used the us_contagious_diseases dataset from the dslabs package. It contains official public health reports for several contagious diseases in the United States spanning several decades. My research focuses specifically on Rubella, incorporating data from all 50 states and the District of Columbia

library(RColorBrewer)
library(tidyverse)
library("dslabs")
#data(package = "dslabs")
data("us_contagious_diseases")  
unique(us_contagious_diseases$disease)
[1] Hepatitis A Measles     Mumps       Pertussis   Polio       Rubella    
[7] Smallpox   
Levels: Hepatitis A Measles Mumps Pertussis Polio Rubella Smallpox
unique(us_contagious_diseases$state)
 [1] Alabama              Alaska               Arizona             
 [4] Arkansas             California           Colorado            
 [7] Connecticut          Delaware             District Of Columbia
[10] Florida              Georgia              Hawaii              
[13] Idaho                Illinois             Indiana             
[16] Iowa                 Kansas               Kentucky            
[19] Louisiana            Maine                Maryland            
[22] Massachusetts        Michigan             Minnesota           
[25] Mississippi          Missouri             Montana             
[28] Nebraska             Nevada               New Hampshire       
[31] New Jersey           New Mexico           New York            
[34] North Carolina       North Dakota         Ohio                
[37] Oklahoma             Oregon               Pennsylvania        
[40] Rhode Island         South Carolina       South Dakota        
[43] Tennessee            Texas                Utah                
[46] Vermont              Virginia             Washington          
[49] West Virginia        Wisconsin            Wyoming             
51 Levels: Alabama Alaska Arizona Arkansas California Colorado ... Wyoming

Data Wrangling

exclusion of other diseases use only rubella

filter out Alaska and Hawaii

create 4 categories : by region

mutate the rate of measles by taking the count/(population10,00052)/weeks_reporting

draw a vertical line for 1969, which is when the rubella vaccination was developed

disease1 <- us_contagious_diseases |>
  filter(disease == "Rubella" & !state %in% c("Hawaii", "Alaska")) |>
  mutate(region = case_when(
    state %in% c("Connecticut", "Maine", "Massachusetts", "New Hampshire", "Rhode Island", "Vermont", "New Jersey", "New York", "Pennsylvania") ~ "Northeast",
    state %in% c("Illinois", "Indiana", "Iowa", "Kansas", "Michigan", "Minnesota", "Missouri", "Nebraska", "North Dakota", "Ohio", "South Dakota", "Wisconsin") ~ "Midwest",
    state %in% c("Alabama", "Arkansas", "Delaware", "District Of Columbia", "Florida", "Georgia", "Kentucky", "Louisiana", "Maryland", "Mississippi", "North Carolina", "Oklahoma", "South Carolina", "Tennessee", "Texas", "Virginia", "West Virginia") ~ "South",
    TRUE ~ "West")) |>
mutate(rate = count / population * 10000 /(weeks_reporting/52))
  head(disease1)
  disease   state year weeks_reporting count population region      rate
1 Rubella Alabama 1966              31   112    3345787  South 0.5615150
2 Rubella Alabama 1967              27   214    3364130  South 1.2251255
3 Rubella Alabama 1968              33   404    3386068  South 1.8800746
4 Rubella Alabama 1969              36   136    3412450  South 0.5756698
5 Rubella Alabama 1970              51   380    3444165  South 1.1249490
6 Rubella Alabama 1971              51   226    3481798  South 0.6618172

Heatmap rubella regional visualization in the usa

library(RColorBrewer)

ggplot(disease1, aes(x = year, y = region, fill = rate)) +
  geom_tile(color = "black") +
  scale_x_continuous(expand = c(0,0)) +
  scale_fill_gradientn(colors = brewer.pal(9, "Reds"), trans = "sqrt") +
  geom_vline(xintercept = 1969, col = "skyblue", linewidth  = 1.5) +
  theme_minimal() +
  labs(title = "Regional Rates of Rubella in the US",
       caption = "Source: Tycho Project",
       x = "", 
       y = "")

Observation

I filtered the dataset to keep only observations related to Rubella, while excluding Alaska and Hawaii since they were not considered part of the contiguous United States. I then created a new categorical variable with mutate function using case_when that groups states into four regions: Northeast, Midwest, South, and West. Finally, I visualized the data using a heatmap, where color intensity represents the infection rate, and I added a vertical line at 1969 to mark the introduction of the Rubella vaccine, allowing for a clear observation of trends before and after vaccination.