Californian Depression

Author

Asher Scott

Intro: The Dataset I chose for this project is based off of depression rates in California from 2012 to 2016. Some of the variables that it includes are year, percent, and strata (population categories like race and income). One of the specific variables is frequency, or how often depression was found in one of these variables. https://letsgethealthy.ca.gov/

In lines 12-17 I loaded all the necessary libraries for this project. I then set my working directory and called my CSV from my Data 110 folder.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("/Users/asherscott/Desktop/Data 110")
AD<- read_csv("adultD.csv")
Rows: 161 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Strata, Strata_Name
dbl (6): Year, Frequency, Weighted_Frequency, Percent, Lower_95_CL, Upper_95_CL

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(AD)
# A tibble: 6 × 8
   Year Strata      Strata_Name Frequency Weighted_Frequency Percent Lower_95_CL
  <dbl> <chr>       <chr>           <dbl>              <dbl>   <dbl>       <dbl>
1  2012 Total       Total            1920                 NA   11.7        11.1 
2  2012 Sex         Male              561            1116664    8.12        7.32
3  2012 Sex         Female           1359            2163108   15.2        14.3 
4  2012 Race-Ethni… White            1314            1806371   14.6        13.7 
5  2012 Race-Ethni… Black              97             222022   13.5        10.4 
6  2012 Race-Ethni… Hispanic          412             923174    9.98        8.91
# ℹ 1 more variable: Upper_95_CL <dbl>
model1 <- lm(Frequency ~ Weighted_Frequency + Percent + Lower_95_CL + Upper_95_CL, data = AD)
summary(model1)

Call:
lm(formula = Frequency ~ Weighted_Frequency + Percent + Lower_95_CL + 
    Upper_95_CL, data = AD)

Residuals:
    Min      1Q  Median      3Q     Max 
-344.75  -57.20   -8.89   41.89  519.14 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         6.476e+01  3.084e+01   2.100   0.0374 *  
Weighted_Frequency  3.251e-04  2.050e-05  15.853   <2e-16 ***
Percent             1.925e+03  2.561e+03   0.752   0.4534    
Lower_95_CL        -9.267e+02  1.281e+03  -0.724   0.4705    
Upper_95_CL        -9.853e+02  1.280e+03  -0.770   0.4427    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 112.4 on 149 degrees of freedom
  (7 observations deleted due to missingness)
Multiple R-squared:  0.8455,    Adjusted R-squared:  0.8413 
F-statistic: 203.8 on 4 and 149 DF,  p-value: < 2.2e-16

From lines 29-34 I cleaned up my data by using the select function to select the variables I wanted and then used filter(str_detect) to get just the numbers related to income from the Strata column.

AD2 <- AD %>%
  select(Year, Strata, Strata_Name, Frequency) %>%
  filter(str_detect(Strata, "Income"))
summary(AD2)
      Year         Strata          Strata_Name          Frequency    
 Min.   :2012   Length:42          Length:42          Min.   :106.0  
 1st Qu.:2013   Class :character   Class :character   1st Qu.:155.5  
 Median :2015   Mode  :character   Mode  :character   Median :194.5  
 Mean   :2015                                         Mean   :244.5  
 3rd Qu.:2017                                         3rd Qu.:281.5  
 Max.   :2018                                         Max.   :642.0  

From lines 40-50 is the code I used to make my graph. Used geom_area to compare Income ranges and Frequency of Depression through 2012 to 2018 in California. On the yintercept I added a white dash to repsent the mean Frequency for this dataset. I chose the number 244.5 because that was the mean Frequency that I got from the summary function. Finally I chose to use them dark to better contrast the light colors

  ggplot(AD2, aes(x = Year, y = Frequency, fill = Strata_Name)) +
  geom_area(color = "black", linewidth = 0.7) +  
  geom_hline(yintercept = 244.5, linetype = "dashed", color = "white", linewith = 1) +  
  scale_fill_brewer(palette = "Set3") +
  labs(title = "Depression by Median Income in California",
  x = "Year",
  y = "Frequency",
  caption = "Source: Lets Get Healthy California") +
  theme_dark()  
Warning in geom_hline(yintercept = 244.5, linetype = "dashed", color = "white",
: Ignoring unknown parameters: `linewith`

One of the ways I cleaned up this dataset was by editing the names with gaps like strata name so that r could read it and using the filter function so remove unnecessary variables. Other than that there wasn’t much else for me to clean. Interesting takeaways are the sharp dip in 2014 and again in 2016. Another thing the data shows us is how both the lowest (<20k) and highest (100k+) incomes are the most depressed. My guess is that those with 20k are depressed because they are struggling to get by while those with 100k are depressed due to working a demanding job. The least depressed group, and the only one to be below the mean frequency line, is the 75k-99k demographic. I suspect that this is caused by having above average income but not having an intense job. I wish I could have added a plotly extention that showed Race-Ethnic Category but it kept crashing.