DS_labs assignment

Author

Asma Abbas

Loading the right libraries and dataset

library(dslabs)

Warning: package 'dslabs' was built under R version 4.4.3

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.4.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

data("us_contagious_diseases")

I picked out the contagious diseases dataset for various reasons. For one, I think it has a good amount of easily understandable variables to work with. But also, I’m already familiar with this dataset because I used it back in stats class. However, honestly, I don’t think I did it justice. I thought it would be good to go back and work with it again now that I have a somewhat better understanding of what im actually doing.

filtering data

I want to make a graph that shows the rates of Hepatitis A over the years, but exclusively in the dmv area.

dmv_hepatitisA <- us_contagious_diseases |>
  filter(state %in% c("District Of Columbia", "Maryland", "Virginia"), disease == "Hepatitis A") |>
  filter(!is.na(count))

All I did here was filter the variables I wanted to use, those being the DMV states and the diseases (hepatitis) pertaining to them. Then after that I filtered out any values that might be missing from those variables.

Making a graph

options(scipen=999)
ggplot(dmv_hepatitisA, aes(x = year, y = count, color = population)) +
  geom_point(size = 4, alpha = 0.6) +
  scale_color_gradient(low="lightblue", high = "darkblue" ) +
  labs(
    title = "Hepatitis A Cases in the DMV Region over the Years",
    subtitle = "Data from the US Contagious Diseases Dataset",
    x = "Year",
    y = "Number of Hepatitis A Cases",
    color = "Population") +
  theme_light() +
  theme(legend.position = "right")

Over here I think I just made a pretty standard scatterplot showing off three variables. On the x axis is the year, on the y axis is the disease count (infection count), and the scale shows off the density of the states population. To do that I used ggplot, and started by labeling the different axis. I used the geom_point settings to adjust the dots on the plot. I made them bigger and slightly more transparent because I noticed that some of them seemed to blend into one another, so I thought doing that made a difference. Then I brought in the title and subtitle with a source (usually I put the source as a caption, but I liked how this looked better. I dont think it makes a difference?) Now, when I was creating the scale, I was having issues with the way the population was being written, as it was in scientific notation. I wanted to change it to be written properly, so I used the scipen command (found online through google search, will cite below!) Initially I had put it somewhere else in the chunk, but it wasn’t running, so I just opted to put it at the top and it worked. I wanted to add commas, but wasn’t sure how to do so.

Response:

-In your RMD file / Rpubs document, be sure you describe in a paragraph what dataset you have used and document how you have created your graph. If you choose to use one of the datasets from the examples in the tutorial, be sure I can clearly understand what you have created that is meaningfully different. Render your document to Rpubs and submit the link in the assignment dropbox.

So as I said previously, I chose this dataset because it was one I’ve worked with before, and being familiar with the variables was helpful. That being said though, I still think it’s pretty interesting. I would like to see a dataset with a more updated set of diseases, as it seems most of these have died out or gone on the low (which is a good thing!) The dataset essentially lines up different relevant diseases across the United States, and describes their statistics across several years. Variables like count and weeks counting show both the intensity of infection rates and the duration. The variables I picked for examination were count (number of cases), year, and population all pertaining to Hepatitis A. I feel like out of all the diseases, aside from Polio, this one has more current and recent occurences. I also thought to make it more local and personal, it’d be good to see it from DMV states. I figured the best way to throw them all together would be scatterplot. After deciding that, I did minimal cleaning and went into creating the plot. I didn’t face much difficulty with this plot, aside from trying to make the gradient not look ugly, and also removing the scientific notation. I think the plot is interesting to analyze, and it makes me wonder how it would look if I were to input other diseases from the dataset.

Sources used to help:

Stack Overflow user. (2014, September 20). How to prevent scientific notation in R? Stack Overflow. https://stackoverflow.com/questions/25946047/how-to-prevent-scientific-notation-in-r

For removing scientific notation ^

Wickham, H., Chang, W., Henry, L., Pedersen, T. L., Takahashi, K., Wilke, C., … & RStudio. (n.d.). scale_gradient: Create a gradient colour scale. ggplot2. https://ggplot2.tidyverse.org/reference/scale_gradient.html

The color grading was in our notes, but I also looked at this while I was messing around with the scale.