── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("dslabs")#list out all Data sets in package ‘dslabs’data(package="dslabs")#load the datadata("us_contagious_diseases")head("us_contagious_diseases")
state1 <-disease_nona1|>group_by(state,region) |>summarize(#calculate the total number of diseases over all yearstotal_count=sum(count),#average population represents each state's sizeavg_population=mean(population))
`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by state and region.
ℹ Output is grouped by state.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(state, region))` for per-operation grouping
(`?dplyr::dplyr_by`) instead.
Calculate the overall disease rate for the country
#theme for better visualization library(ggthemes)# to avoid overlapping labelslibrary(ggrepel)#set theme/improve visualizationds_theme_set()ggplot(state1,aes(x=total_count,y=avg_population/10^6,label=state))+#reference line(manual)#geom_abline(slope=1,intercept=-log10(r),ity=2,col="blue")+geom_point(aes(color=region),size=3,alpha=0.9)+#nudge_x horizontal shoft,nudge_y vertical shift ,segment.colour:color of line from label #to point,box.padding:space for labelsgeom_text_repel(nudge_x =0.005,segment.colour ="orange",size=3,box.padding =0.3)+scale_x_log10("Contagious Diseases Count(log scale)")+scale_y_log10("State Population(million,log scale)")+ggtitle(" U.S. Contagious Diseases vs Population by State ") +scale_color_discrete(name="Region")+scale_color_brewer(name="State",palette="Set2")+#theme_minimal(base_size=14,base_family="serif")+#在geom_smooth(reference line auto)geom_smooth(method=lm,se=FALSE,Ity=2,color="lightgreen",linewidth=1)
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Warning in geom_smooth(method = lm, se = FALSE, Ity = 2, color = "lightgreen",
: Ignoring unknown parameters: `Ity`
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: The following aesthetics were dropped during statistical transformation: label.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_text_repel()`).
Summary
I chose the U.S Contagious Diseases dataset to explore the relationship between disease and population across U.S. states. Initially, I loaded the dataset and filtered out NA values. I also grouped all states into 4 regions: Northeast, South, North Central, and West. Then I replaced each state’s name with its abbreviation. I grouped the data by state and region , calculated the total number of disease cases across all years, and the average population for each state. Since the population does not change significantly over time, I use the average population as a representation of each state’s size.
For the visualization part, I created a scatterplot. The x-axis is the total disease count, and the y-axis is the average population, both on a log scale, given the large sizes of both variables. Each point represents a state, and each color a region. I added state abbreviations as labels using geom_text_repel to avoid overlapping text and improve readability. Additionally, I made a reference line representing the country’s overall disease rate using geom_smooth. Unfortunately, it was too complicated to change the x-axis values into simpler numbers without using more complex code on scale_x_log10. The graph shows a clear positive relationship between population size and total disease cases. The North Central and Northeast regions appear to have higher disease counts, while the West region has the lowest. The South region has a relatively moderate disease count