Week 7 DSLabs Dataset Assignment

Author

Maisha Subin

# loading necessary packages
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(plotly) # for interactivity

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
library(viridis)
Loading required package: viridisLite
# loading DSLabs dataset
library(dslabs)
data("us_contagious_diseases")
unique(us_contagious_diseases$disease) # for listing different diseases recorded in the dataset
[1] Hepatitis A Measles     Mumps       Pertussis   Polio       Rubella    
[7] Smallpox   
Levels: Hepatitis A Measles Mumps Pertussis Polio Rubella Smallpox
# Creating a new/ modified data set
polio_data <- us_contagious_diseases %>%
  filter(disease == "Polio" & !state %in% c("Hawaii", "Alaska") & 
           !is.na(year) & !is.na(count) & !is.na(population) & !is.na(weeks_reporting)) %>%
  mutate(rate = ifelse(weeks_reporting > 0, count / population * 10000 / (weeks_reporting / 52), NA)) %>%
  filter(!is.na(rate)) # Calculation so rate amount is significant and filtering for 0 week reporting values
p <- ggplot(polio_data, aes(x = year, y = rate, color = state, group = state)) +
  geom_line() +  # Plot the line for each state
  geom_vline(xintercept = 1955, linetype = "dashed", color = "blue") +  # Vertical line at the introduction of the vaccine
  theme_dark() +  # Dark theme for a dark background
  labs(title = "Polio Incidence Over Time by State",
       x = "Year", y = "Polio Incidence Rate (per 10,000 people)",
       caption = "Source: Tycho Project") +
  theme(
    legend.position = "right",  # Move the legend to the right
    legend.box = "vertical",  # Arrange legend items vertically
    legend.key.size = unit(0.4, "cm"),  # Adjust the size of the legend keys
    legend.text = element_text(size = 8),  # Adjust legend text size for better readability
    axis.text.x = element_text(angle = 45, hjust = 1)  # Rotate x-axis labels for readability
  ) +
  scale_color_viridis(discrete = TRUE) +  # Using viridis colors (discrete scale)
  guides(color = guide_legend(ncol = 3))  # Using one column in the legend
p <- ggplotly(p)
p

Week 7 Dslabs Notes

I used “us_contagious_diseases” dataset from dslabs. I wanted all US states to be included except Hawaii and Alaska as I was not getting desired results on including them. I also wanted to know the states with hightest cases recorded and those with lowest, hence in order to get specific with my graph I used plotly. I wanted to use highchart, but faced difficulty with implementation. I went ahead and created a new data set called “polio_data” with no NA’s and used the calculation provided by professor Saidi for meaning numbers to plot on graph. Then lastly for the graph I used dark them and used “viridisLite” color palatte for my legend and aligned the legend on the right side. Used a x-intercept line at 1955 to compare the rates after the introduction of vaccine in the United States on April 12, 1955. With plotly I can see the highest rates of Polio reported over the years for different states. Some interesting points to note are: Nebraska and South Dakota had one of the highest values reported on 1952 considered the worst outbreak year all across the country, with the rates 15.72 and 15.16 respectively. Pennsylvania, South Carolina, even New York were some of the states which maintained low rates in 1950s.