My dataset is from the Centers for Disease Control and Prevention (CDC) and focuses on depression and related health conditions across different ZIP Code Tabulation Areas (ZCTA5)in New England Region in the U.S. It includes three quantitative variables: depression prevalence, lack of sleep prevalence, and lack of emotional support in percentage. I also created categorical variables by grouping depression and population values into levels such as low, medium, and high to compare patterns and show them on the visualization more easily.
For data cleaning part, I first removed missing values using the filter() regarding variables like DEPRESSION_CrudePrev, SLEEP_CrudePrev, TotalPopulation, EMOTIONALSUPPORT_CrudePrev, Geolocation, and ZCTA5. Then I filtered the top 300 locations based on TotalPopulation by sorting the data in descending order and keeping the first 300 places.
I chose this dataset mainly because it is GIS-friendly, which allows me easily to map the data.Also it can help me understand how depression is associated with different factors ,including lack of sleep ,population and lack of support .
This topic is meaningful to me because mental health is closely connected to everyday life, including stress, sleep, and social support. Understanding these relationships through data helps me see how broader social factors may influence individual well-being.
Load the dataset and filter nas
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
Rows: 1139 Columns: 84
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (42): ZCTA5, ACCESS2_Crude95CI, ARTHRITIS_Crude95CI, BINGE_Crude95CI, BP...
dbl (40): ACCESS2_CrudePrev, ARTHRITIS_CrudePrev, BINGE_CrudePrev, BPHIGH_Cr...
num (2): TotalPopulation, TotalPop18plus
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#View(place)#filter all nas place_nona <- place |>filter(!is.na(DEPRESSION_CrudePrev),!is.na(SLEEP_CrudePrev),!is.na(TotalPopulation),!is.na(EMOTIONSPT_CrudePrev),!is.na(Geolocation),!is.na(ZCTA5))place_nona
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
options(scipen =999)p1<-ggplot(pop_group,aes(x=DEPRESSION_CrudePrev,y=SLEEP_CrudePrev,color=pop_group))+#size of the point depending on pop_groupgeom_point(aes(size=TotalPopulation),alpha=0.7)+geom_smooth(method="lm",se=FALSE,color="red")+labs(title="Depression vs Lack of Sleep Prevalence by Population Size in New England Region",x="Depression(%)",y="Lack of Sleep(%)",color="Population Group",size="Total Population",caption="Source: Centers for Disease Control and Prevention (CDC) ")+#set color palettescale_color_manual(values=c("#A8E6CF","#FFD3B6","#FFAAA5","#D5AAFF","#FFECB3","#B5EABA","#FFE0B2"))+theme_minimal()+theme(plot.title=element_text(hjust =0.5))p1 <-ggplotly(p1)
`geom_smooth()` using formula = 'y ~ x'
p1
Seperate GeoLocation (lat, long) into two columns: lat and long
#add 5 colors to 5 population groupspal <-colorFactor(palette =c( "#00C9A7", "#FF9671","#FF6F91","#845EC2","#FFC75F"),levels=c("<20%","20-25%","25-30%",">30%","Others"))leaflet(depression_level) |>setView(lng=-70,lat=43.5,zoom=9) |>addProviderTiles("Esri.WorldStreetMap") |>addCircles(stroke=FALSE,#size of the soptradius=~sqrt(EMOTIONSPT_CrudePrev*100000),fillColor=~pal(depression_level),fillOpacity=0.7,popup=popup1) |>addLegend(position ="bottomright",pal = pal,values =~depression_level,title ="Depression Level VS Lack of Emotional Support in New England Region")
Assuming "long" and "lat" are longitude and latitude, respectively
Essay
For this assignment, I created two plots: a scatterplot and a geographic map.
From the scatter plot, there is a slight positive relationship between depression and lack of sleep prevalence. As depression increases, sleep issues also tend to increase, although the relationship is not very strong. The points are widely spread, suggesting variability across locations.
Population size does not show a clear pattern in the scatter plot. Larger and smaller populations are mixed in the graph, indicating that population alone may not strongly influence the relationship between depression and sleep.
The map provides additional geographic insight. Areas with different depression levels are spread across the region rather than being concentrated in only urban or high-population areas. The size of the points represents the level of lack of emotional support, where larger circles indicate more lack of support. These larger circles are distributed across both higher and lower depression areas, suggesting that emotional support does not have a simple or direct relationship with depression levels.
One interesting observation is that some locations with relatively high emotional support still show higher depression levels, while some areas with lower support do not always come with highest depression levels. This indicates that emotional support alone may not fully explain the level of depression, and other factors such as access to healthcare may also affect depression levels.
A limitation of this visualization is that overlapping points on the map make it difficult to clearly distinguish all locations. Additionally, more variables (such as income or healthcare access) could have been included to provide deeper insight.
Citation
All code references are based on previous Data 110 handouts,homeworks and notes.