DS Lab Assignment

Author

Jude E. Abban

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggrepel)
library(ggthemes)
library(dslabs)
library(plotly)

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
data(package="dslabs")

Choosing the data

data (us_contagious_diseases)

Filtering and Grouping

Filter the years above 1950 and weeks reporting above 50. Also group the states by the region they’re in.

clean_data <- us_contagious_diseases |>
  filter(year >= 1950, weeks_reporting >= 50) |>
  mutate(rate_per_100k = (count / population) * 100000) |>
    mutate(region = case_when(
    # Northeast
    state %in% c("Connecticut", "Maine", "Massachusetts", "New Hampshire", 
                 "Rhode Island", "Vermont", "New Jersey", "New York", 
                 "Pennsylvania") ~ "Northeast",
    
    # Midwest
    state %in% c("Illinois", "Indiana", "Michigan", "Ohio", "Wisconsin",
                 "Iowa", "Kansas", "Minnesota", "Missouri", "Nebraska",
                 "North Dakota", "South Dakota") ~ "Midwest",
    
    # South
    state %in% c("Delaware", "Florida", "Georgia", "Maryland", "North Carolina",
                 "South Carolina", "Virginia", "West Virginia", "Alabama",
                 "Kentucky", "Mississippi", "Tennessee", "Arkansas",
                 "Louisiana", "Oklahoma", "Texas") ~ "South",
    
    # West
    state %in% c("Arizona", "Colorado", "Idaho", "Montana", "Nevada",
                 "New Mexico", "Utah", "Wyoming", "Alaska", "California",
                 "Hawaii", "Oregon", "Washington") ~ "West",
    
    TRUE ~ "Other"
  ))
final_plot <- clean_data |> 
  ggplot(aes(x = rate_per_100k, 
             y = population/10^6, 
             label = disease, 
             text = paste(
               "Disease:", disease,
               "<br>State:", state,
               "<br>Year:", year,
               "<br>Region:", region,
               "<br>Rate per 100k:", round(rate_per_100k, 2),
               "<br>Population (millions):", round(population/10^6, 2)
             ))) +
  geom_point(aes(color = disease), size = 2, alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE, lty = 2, color = "gray") +  
  scale_x_log10("Rate per 100,000 population (log scale)") +
  scale_y_log10("Population in millions (log scale)") +
  ggtitle("Contagious Disease Rates by State Population") +
  scale_color_discrete(name="Region") +
  theme_bw()

ggplotly(final_plot, tooltip = "text")
`geom_smooth()` using formula = 'y ~ x'

Reflection

For this assignment, I used the us_contagious_diseases dataset from the dslabs package, which tracks contagious disease cases by state, year, and disease type. I filtered the data for years above 1950 with at least 50 weeks of reporting, then calculated a rate per 100,000 people to make case counts easier to compare with the states. I also added a region variable by using case_when() to group states into regions. The graph is a scatterplot showing disease rate on the x-axis and state population on the y-axis, both on log scales. Points are colored by disease type, and a dashed line is added with geom_smooth() but doesn’t show because of plotly. The plot is made interactive with ggplotly(), so users can hover over points to see details like state, year, and rate.