Project 2 Depression

Author

Qian He

Introduction:

My dataset is from the Centers for Disease Control and Prevention (CDC) and focuses on depression and related health conditions across different ZIP Code Tabulation Areas (ZCTA5)in New England Region in the U.S. It includes three quantitative variables: depression prevalence, lack of sleep prevalence, and lack of emotional support in percentage. I also created categorical variables by grouping depression and population values into levels such as low, medium, and high to compare patterns and show them on the visualization more easily.

For data cleaning part, I first removed missing values using the filter() regarding variables like DEPRESSION_CrudePrev, SLEEP_CrudePrev, TotalPopulation, EMOTIONALSUPPORT_CrudePrev, Geolocation, and ZCTA5. Then I filtered the top 300 locations based on TotalPopulation by sorting the data in descending order and keeping the first 300 places.

I chose this dataset mainly because it is GIS-friendly, which allows me easily to map the data.Also it can help me understand how depression is associated with different factors ,including lack of sleep ,population and lack of support .

This topic is meaningful to me because mental health is closely connected to everyday life, including stress, sleep, and social support. Understanding these relationships through data helps me see how broader social factors may influence individual well-being.

Load the dataset and filter nas

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

setwd("/Users/macbook/Desktop")
place <-read_csv("PLACES_ZCTA_Data_(GIS_Friendly_Format),_2024_release.csv")

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 1139 Columns: 84
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (42): ZCTA5, ACCESS2_Crude95CI, ARTHRITIS_Crude95CI, BINGE_Crude95CI, BP...
dbl (40): ACCESS2_CrudePrev, ARTHRITIS_CrudePrev, BINGE_CrudePrev, BPHIGH_Cr...
num  (2): TotalPopulation, TotalPop18plus

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#View(place)

#filter all nas 
place_nona <- place |>
  filter(!is.na(DEPRESSION_CrudePrev),
         !is.na(SLEEP_CrudePrev),
         !is.na(TotalPopulation),
         !is.na(EMOTIONSPT_CrudePrev),
         !is.na(Geolocation),
         !is.na(ZCTA5))
  
place_nona

# A tibble: 1,138 × 84
   ZCTA5 TotalPopulation TotalPop18plus ACCESS2_CrudePrev ACCESS2_Crude95CI
   <chr>           <dbl>          <dbl>             <dbl> <chr>            
 1 01001           16984          14019               4.5 ( 3.8,  5.3)     
 2 01002           27558          23853               4.6 ( 3.9,  5.4)     
 3 01003           13253          13181               4.1 ( 3.4,  4.7)     
 4 01005            4900           3889               4.9 ( 3.8,  6.4)     
 5 01007           15423          12248               3.5 ( 2.9,  4.3)     
 6 01008            1317           1099               3.9 ( 3.1,  4.9)     
 7 01009             980            779               4.9 ( 3.6,  6.5)     
 8 01010            3711           3021               3.4 ( 2.5,  4.4)     
 9 01011            1201           1020               3.9 ( 3.1,  4.8)     
10 01012             519            440               3.4 ( 2.5,  4.6)     
# ℹ 1,128 more rows
# ℹ 79 more variables: ARTHRITIS_CrudePrev <dbl>, ARTHRITIS_Crude95CI <chr>,
#   BINGE_CrudePrev <dbl>, BINGE_Crude95CI <chr>, BPHIGH_CrudePrev <dbl>,
#   BPHIGH_Crude95CI <chr>, BPMED_CrudePrev <dbl>, BPMED_Crude95CI <chr>,
#   CANCER_CrudePrev <dbl>, CANCER_Crude95CI <chr>, CASTHMA_CrudePrev <dbl>,
#   CASTHMA_Crude95CI <chr>, CHD_CrudePrev <dbl>, CHD_Crude95CI <chr>,
#   CHECKUP_CrudePrev <dbl>, CHECKUP_Crude95CI <chr>, …

filter the desired 300 places for ploting

#find top 300 population places(ZCTA5)
poptop300<-place_nona |>
  arrange(desc(TotalPopulation)) |>
  head(300)

poptop300

# A tibble: 300 × 84
   ZCTA5 TotalPopulation TotalPop18plus ACCESS2_CrudePrev ACCESS2_Crude95CI
   <chr>           <dbl>          <dbl>             <dbl> <chr>            
 1 02301           69418          52511               8.8 ( 7.6,  9.9)     
 2 02148           66274          54715               5.8 ( 5.0,  6.7)     
 3 02151           62252          49581              10.2 ( 8.8, 11.7)     
 4 02169           62140          52547               5   ( 4.3,  5.8)     
 5 02155           61483          53078               4.2 ( 3.6,  4.8)     
 6 02360           61134          50341               3.9 ( 3.3,  4.5)     
 7 01960           54155          44645               5.1 ( 4.3,  5.8)     
 8 01841           53654          39222              20.3 (17.6, 23.1)     
 9 01844           52970          41825               7.5 ( 6.5,  8.5)     
10 02780           52299          41595               6.5 ( 5.6,  7.5)     
# ℹ 290 more rows
# ℹ 79 more variables: ARTHRITIS_CrudePrev <dbl>, ARTHRITIS_Crude95CI <chr>,
#   BINGE_CrudePrev <dbl>, BINGE_Crude95CI <chr>, BPHIGH_CrudePrev <dbl>,
#   BPHIGH_Crude95CI <chr>, BPMED_CrudePrev <dbl>, BPMED_Crude95CI <chr>,
#   CANCER_CrudePrev <dbl>, CANCER_Crude95CI <chr>, CASTHMA_CrudePrev <dbl>,
#   CASTHMA_Crude95CI <chr>, CHD_CrudePrev <dbl>, CHD_Crude95CI <chr>,
#   CHECKUP_CrudePrev <dbl>, CHECKUP_Crude95CI <chr>, …

#filter the data with dplyr for the 300 desired places(ZCTA5)
pop300<- place_nona |>
  filter(ZCTA5 %in% poptop300$ZCTA5)
pop300

# A tibble: 300 × 84
   ZCTA5 TotalPopulation TotalPop18plus ACCESS2_CrudePrev ACCESS2_Crude95CI
   <chr>           <dbl>          <dbl>             <dbl> <chr>            
 1 01001           16984          14019               4.5 ( 3.8,  5.3)     
 2 01002           27558          23853               4.6 ( 3.9,  5.4)     
 3 01003           13253          13181               4.1 ( 3.4,  4.7)     
 4 01007           15423          12248               3.5 ( 2.9,  4.3)     
 5 01013           23656          18521               9.1 ( 7.7, 10.5)     
 6 01020           29781          24349               7.1 ( 6.1,  8.3)     
 7 01027           17896          15089               4.6 ( 3.7,  5.5)     
 8 01028           16417          13065               3.5 ( 2.9,  4.3)     
 9 01040           38238          29727              14.8 (12.8, 16.7)     
10 01056           21002          17568               5.6 ( 4.6,  6.7)     
# ℹ 290 more rows
# ℹ 79 more variables: ARTHRITIS_CrudePrev <dbl>, ARTHRITIS_Crude95CI <chr>,
#   BINGE_CrudePrev <dbl>, BINGE_Crude95CI <chr>, BPHIGH_CrudePrev <dbl>,
#   BPHIGH_Crude95CI <chr>, BPMED_CrudePrev <dbl>, BPMED_Crude95CI <chr>,
#   CANCER_CrudePrev <dbl>, CANCER_Crude95CI <chr>, CASTHMA_CrudePrev <dbl>,
#   CASTHMA_Crude95CI <chr>, CHD_CrudePrev <dbl>, CHD_Crude95CI <chr>,
#   CHECKUP_CrudePrev <dbl>, CHECKUP_Crude95CI <chr>, …

create the scatterplot

#turn population into categorical variables
pop_group <- pop300 |>
  mutate(pop_group=case_when(
    TotalPopulation >10000 & TotalPopulation <=20000 ~"10k-20k",
    TotalPopulation >20000 & TotalPopulation <=30000 ~"20k-30k",
    TotalPopulation >30000 & TotalPopulation <=40000 ~"30k-40k",
    TotalPopulation >40000 & TotalPopulation <=50000 ~"40k-50k",
    TotalPopulation >50000 & TotalPopulation <=60000 ~"50k-60k",
    TotalPopulation >60000 & TotalPopulation <=70000 ~"60k-70k",
    TRUE~"Others"))

library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

options(scipen = 999)
p1<- ggplot(pop_group,aes(x=DEPRESSION_CrudePrev,y=SLEEP_CrudePrev,color=pop_group))+
  #size of the point depending on pop_group
  geom_point(aes(size=TotalPopulation),alpha=0.7)+
  geom_smooth(method="lm",se=FALSE,color="red")+
  labs(title="Depression vs Lack of Sleep Prevalence by Population Size in New England Region",
       x="Depression(%)",
       y="Lack of Sleep(%)",
       color="Population Group",
       size="Total Population",
       caption="Source: Centers for Disease Control and Prevention (CDC) ")+
  #set color palette
  scale_color_manual(values=c("#A8E6CF","#FFD3B6","#FFAAA5","#D5AAFF","#FFECB3","#B5EABA","#FFE0B2"))+
  theme_minimal()+
  theme(plot.title=element_text(hjust = 0.5))
  
p1 <-ggplotly(p1)

`geom_smooth()` using formula = 'y ~ x'

p1

Seperate GeoLocation (lat, long) into two columns: lat and long

latlong <- pop_group |>
  mutate(Geolocation = str_replace_all(Geolocation, "POINT |\\(|\\)", "")) |>
  separate(Geolocation, into = c("long", "lat"), sep = " ", convert = TRUE)

head(latlong)

# A tibble: 6 × 86
  ZCTA5 TotalPopulation TotalPop18plus ACCESS2_CrudePrev ACCESS2_Crude95CI
  <chr>           <dbl>          <dbl>             <dbl> <chr>            
1 01001           16984          14019               4.5 ( 3.8,  5.3)     
2 01002           27558          23853               4.6 ( 3.9,  5.4)     
3 01003           13253          13181               4.1 ( 3.4,  4.7)     
4 01007           15423          12248               3.5 ( 2.9,  4.3)     
5 01013           23656          18521               9.1 ( 7.7, 10.5)     
6 01020           29781          24349               7.1 ( 6.1,  8.3)     
# ℹ 81 more variables: ARTHRITIS_CrudePrev <dbl>, ARTHRITIS_Crude95CI <chr>,
#   BINGE_CrudePrev <dbl>, BINGE_Crude95CI <chr>, BPHIGH_CrudePrev <dbl>,
#   BPHIGH_Crude95CI <chr>, BPMED_CrudePrev <dbl>, BPMED_Crude95CI <chr>,
#   CANCER_CrudePrev <dbl>, CANCER_Crude95CI <chr>, CASTHMA_CrudePrev <dbl>,
#   CASTHMA_Crude95CI <chr>, CHD_CrudePrev <dbl>, CHD_Crude95CI <chr>,
#   CHECKUP_CrudePrev <dbl>, CHECKUP_Crude95CI <chr>,
#   CHOLSCREEN_CrudePrev <dbl>, CHOLSCREEN_Crude95CI <chr>, …

create a map

# leaflet()
library(leaflet)
leaflet(data=latlong) |>
  setView(lng=-70,lat=43.5,zoom=8) |>
  addProviderTiles("Esri.WorldStreetMap") |>
  addCircles(
    #stroke=FALSE,
    lng=~long,
    lat=~lat,
    radius=~sqrt(TotalPopulation),
    color="#E63946",
    fillColor="#ffe066",
    fillOpacity=0.5)

Refine map to include a mouse-click tooltip

#create popup content
popup1 <- paste0(
  "<b>ZCTA5:</b> ",pop_group$ZCTA5, "<br>",
  "<b>DEPRESSION(%):</b> ",pop_group$DEPRESSION_CrudePrev, "<br>",
  "<b>SLEEP(%):</b> ",pop_group$SLEEP_CrudePrev , "<br>",
  "<b>EMOTIONAL SUPPORT:</b> ",pop_group$EMOTIONSPT_CrudePrev, "%<br>",
  "<b>POPULATION:</b> ",pop_group$pop_group,"%<br>" )

#turn depression variable into categorical variables
depression_level <- latlong |>
  mutate(depression_level=case_when(
     DEPRESSION_CrudePrev <=20 ~"<20%",
     DEPRESSION_CrudePrev>20 & DEPRESSION_CrudePrev <=25 ~"20-25%",
     DEPRESSION_CrudePrev>25 & DEPRESSION_CrudePrev <= 30~"25-30%",
     DEPRESSION_CrudePrev>30 ~">30%",
     TRUE~"Others")) |>
  #let the legend show colors in order
  mutate(depression_level=factor(depression_level,levels = c("<20%","20-25%","25-30%",">30%","Others")))

depression_level

# A tibble: 300 × 87
   ZCTA5 TotalPopulation TotalPop18plus ACCESS2_CrudePrev ACCESS2_Crude95CI
   <chr>           <dbl>          <dbl>             <dbl> <chr>            
 1 01001           16984          14019               4.5 ( 3.8,  5.3)     
 2 01002           27558          23853               4.6 ( 3.9,  5.4)     
 3 01003           13253          13181               4.1 ( 3.4,  4.7)     
 4 01007           15423          12248               3.5 ( 2.9,  4.3)     
 5 01013           23656          18521               9.1 ( 7.7, 10.5)     
 6 01020           29781          24349               7.1 ( 6.1,  8.3)     
 7 01027           17896          15089               4.6 ( 3.7,  5.5)     
 8 01028           16417          13065               3.5 ( 2.9,  4.3)     
 9 01040           38238          29727              14.8 (12.8, 16.7)     
10 01056           21002          17568               5.6 ( 4.6,  6.7)     
# ℹ 290 more rows
# ℹ 82 more variables: ARTHRITIS_CrudePrev <dbl>, ARTHRITIS_Crude95CI <chr>,
#   BINGE_CrudePrev <dbl>, BINGE_Crude95CI <chr>, BPHIGH_CrudePrev <dbl>,
#   BPHIGH_Crude95CI <chr>, BPMED_CrudePrev <dbl>, BPMED_Crude95CI <chr>,
#   CANCER_CrudePrev <dbl>, CANCER_Crude95CI <chr>, CASTHMA_CrudePrev <dbl>,
#   CASTHMA_Crude95CI <chr>, CHD_CrudePrev <dbl>, CHD_Crude95CI <chr>,
#   CHECKUP_CrudePrev <dbl>, CHECKUP_Crude95CI <chr>, …

#add 5 colors to 5 population groups
pal <- colorFactor(palette = c( "#00C9A7", "#FF9671","#FF6F91","#845EC2","#FFC75F"),
                   levels=c("<20%","20-25%","25-30%",">30%","Others"))

leaflet(depression_level) |>
  setView(lng=-70,lat=43.5,zoom=9) |>
  addProviderTiles("Esri.WorldStreetMap") |>
  addCircles(
    stroke=FALSE,
    #size of the sopt
    radius=~sqrt(EMOTIONSPT_CrudePrev*100000),
    fillColor=~pal(depression_level),
    fillOpacity=0.7,
    popup=popup1) |>
  addLegend(
    position = "bottomright",
    pal = pal,
    values = ~depression_level,
    title = "Depression Level VS Lack of Emotional Support in New England Region")

Assuming "long" and "lat" are longitude and latitude, respectively

Essay

For this assignment, I created two plots: a scatterplot and a geographic map.

From the scatter plot, there is a slight positive relationship between depression and lack of sleep prevalence. As depression increases, sleep issues also tend to increase, although the relationship is not very strong. The points are widely spread, suggesting variability across locations.

Population size does not show a clear pattern in the scatter plot. Larger and smaller populations are mixed in the graph, indicating that population alone may not strongly influence the relationship between depression and sleep.

The map provides additional geographic insight. Areas with different depression levels are spread across the region rather than being concentrated in only urban or high-population areas. The size of the points represents the level of lack of emotional support, where larger circles indicate more lack of support. These larger circles are distributed across both higher and lower depression areas, suggesting that emotional support does not have a simple or direct relationship with depression levels.

One interesting observation is that some locations with relatively high emotional support still show higher depression levels, while some areas with lower support do not always come with highest depression levels. This indicates that emotional support alone may not fully explain the level of depression, and other factors such as access to healthcare may also affect depression levels.

A limitation of this visualization is that overlapping points on the map make it difficult to clearly distinguish all locations. Additionally, more variables (such as income or healthcare access) could have been included to provide deeper insight.

Citation

All code references are based on previous Data 110 handouts,homeworks and notes.