DS LABS

Author

Djeneba Kounta

Load the library

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dslabs)
library(plotly)

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
library(viridis)
Loading required package: viridisLite

First, I loaded the libraries I needed for working with the dataset. I started by installing and loading the tidyverse package, which includes ggplot2 for data visualization. Then, I selected the U.S. Contagious Diseases dataset from the Data Science Lab (DSLab). To make the graph interactive, I also installed and loaded the plotly library. Using plotly, I was able to interact with my graph by hovering over points to see details like the year, disease, and number of cases.

Load the data

#install.packages("dslabs")
library("dslabs")
data(package="dslabs")
list.files(system.file("script", package = "dslabs"))
 [1] "make-admissions.R"                   
 [2] "make-brca.R"                         
 [3] "make-brexit_polls.R"                 
 [4] "make-calificaciones.R"               
 [5] "make-death_prob.R"                   
 [6] "make-divorce_margarine.R"            
 [7] "make-gapminder-rdas.R"               
 [8] "make-greenhouse_gases.R"             
 [9] "make-historic_co2.R"                 
[10] "make-mice_weights.R"                 
[11] "make-mnist_127.R"                    
[12] "make-mnist_27.R"                     
[13] "make-movielens.R"                    
[14] "make-murders-rda.R"                  
[15] "make-na_example-rda.R"               
[16] "make-nyc_regents_scores.R"           
[17] "make-olive.R"                        
[18] "make-outlier_example.R"              
[19] "make-polls_2008.R"                   
[20] "make-polls_us_election_2016.R"       
[21] "make-pr_death_counts.R"              
[22] "make-reported_heights-rda.R"         
[23] "make-research_funding_rates.R"       
[24] "make-stars.R"                        
[25] "make-temp_carbon.R"                  
[26] "make-tissue-gene-expression.R"       
[27] "make-trump_tweets.R"                 
[28] "make-weekly_us_contagious_diseases.R"
[29] "save-gapminder-example-csv.R"        
data("us_contagious_diseases")

Then, I installed the package in DSLAB and selected the dataset I wanted to use—U.S. Contagious Diseases. I loaded the dataset to proceed with my analysis.

clean the data

disease_total <- us_contagious_diseases |>
  group_by(year, disease) %>%
  summarise(total_count = sum(count), .groups = "drop")

head(disease_total)
# A tibble: 6 × 3
   year disease  total_count
  <dbl> <fct>          <dbl>
1  1928 Measles       483337
2  1928 Polio           4756
3  1928 Smallpox       36470
4  1929 Measles       339061
5  1929 Polio           2746
6  1929 Smallpox       38389

Thirdly, I proceeded to clean the data by creating a new data frame. This frame grouped the variables by year and disease, and removed the ‘state’ column. This step allowed me to analyze the evolution of contagious diseases in the United States from 1928 to 2010.

Create a graph

p <- ggplot(disease_total, aes(x = year, y = total_count, color = disease)) +
  geom_line(size = 0.2) +
  geom_point(size =  1) +
  labs(
    title = "Disease Counts in the USA from 1928 to 2010 ",
    x = "Year",
    y = "Total Case Count",
    color = "Diseases",
    
  ) +
  xlim(1928,2012)+
  scale_y_log10()+
  scale_color_viridis_d( option = "plasma")+
    theme_minimal()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
  ggplotly(p)

I decided to create my graph using the variables ‘year’, ‘total count’, and ‘deseases’ for visualization. I added geom point and geom line to represent the trends clearly, and included a descriptive title. To better spread the values and avoid data clustering at the lower end, I applied a logarithmic scale using log10 to the y-axis. For the color scheme, I used the Viridis palette to ensure clarity and accessibility.

Essay

To begin my analysis, I first installed the libraries I needed: Tidyverse, DSLAB, Plotly, and Viridis.

Tidyverse gave me access to useful functions for data manipulation.

DSLAB allowed me to access the dataset.

Plotly helped make my graph interactive.

Viridis provided a clean and accessible color palette.

I then installed the DSLAB package and selected the dataset I wanted to work with, called “U.S. Contagious Diseases.”

During the cleaning process, I first created a new data frame to structure the data. Then, I grouped the data by year and by disease, in order to conduct a global analysis of how contagious diseases evolved in the United States from 1928 to 2010.

For the visualization, I selected three key variables:

x-axis: year

y-axis: total number of cases

color: disease

I adjusted the size of the points, added a title to the graph, and set the limits of the x-axis. I also decided to use a log10 scale on the y-axis to prevent the smaller values from being too concentrated at the bottom of the graph. This made it easier to view the overall trends and better distinguish the differences between values.

Finally, I chose the minimal theme for a clean look, and I used Plotly because I believe it helps avoid confusion and allows users to clearly see the exact values of the data by interacting directly with the graph.