Intro: The Dataset I chose for this project is based off of depression rates in California from 2012 to 2016. Some of the variables that it includes are year, percent, and strata (population categories like race and income). One of the specific variables is frequency, or how often depression was found in one of these variables. https://letsgethealthy.ca.gov/
In lines 12-17 I loaded all the necessary libraries for this project. I then set my working directory and called my CSV from my Data 110 folder.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 161 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Strata, Strata_Name
dbl (6): Year, Frequency, Weighted_Frequency, Percent, Lower_95_CL, Upper_95_CL
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(AD)
# A tibble: 6 × 8
Year Strata Strata_Name Frequency Weighted_Frequency Percent Lower_95_CL
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2012 Total Total 1920 NA 11.7 11.1
2 2012 Sex Male 561 1116664 8.12 7.32
3 2012 Sex Female 1359 2163108 15.2 14.3
4 2012 Race-Ethni… White 1314 1806371 14.6 13.7
5 2012 Race-Ethni… Black 97 222022 13.5 10.4
6 2012 Race-Ethni… Hispanic 412 923174 9.98 8.91
# ℹ 1 more variable: Upper_95_CL <dbl>
Call:
lm(formula = Frequency ~ Weighted_Frequency + Percent + Lower_95_CL +
Upper_95_CL, data = AD)
Residuals:
Min 1Q Median 3Q Max
-344.75 -57.20 -8.89 41.89 519.14
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.476e+01 3.084e+01 2.100 0.0374 *
Weighted_Frequency 3.251e-04 2.050e-05 15.853 <2e-16 ***
Percent 1.925e+03 2.561e+03 0.752 0.4534
Lower_95_CL -9.267e+02 1.281e+03 -0.724 0.4705
Upper_95_CL -9.853e+02 1.280e+03 -0.770 0.4427
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 112.4 on 149 degrees of freedom
(7 observations deleted due to missingness)
Multiple R-squared: 0.8455, Adjusted R-squared: 0.8413
F-statistic: 203.8 on 4 and 149 DF, p-value: < 2.2e-16
From lines 29-34 I cleaned up my data by using the select function to select the variables I wanted and then used filter(str_detect) to get just the numbers related to income from the Strata column.
AD2 <- AD %>%select(Year, Strata, Strata_Name, Frequency) %>%filter(str_detect(Strata, "Income"))summary(AD2)
Year Strata Strata_Name Frequency
Min. :2012 Length:42 Length:42 Min. :106.0
1st Qu.:2013 Class :character Class :character 1st Qu.:155.5
Median :2015 Mode :character Mode :character Median :194.5
Mean :2015 Mean :244.5
3rd Qu.:2017 3rd Qu.:281.5
Max. :2018 Max. :642.0
From lines 40-50 is the code I used to make my graph. Used geom_area to compare Income ranges and Frequency of Depression through 2012 to 2018 in California. On the yintercept I added a white dash to repsent the mean Frequency for this dataset. I chose the number 244.5 because that was the mean Frequency that I got from the summary function. Finally I chose to use them dark to better contrast the light colors
ggplot(AD2, aes(x = Year, y = Frequency, fill = Strata_Name)) +geom_area(color ="black", linewidth =0.7) +geom_hline(yintercept =244.5, linetype ="dashed", color ="white", linewith =1) +scale_fill_brewer(palette ="Set3") +labs(title ="Depression by Median Income in California",x ="Year",y ="Frequency",caption ="Source: Lets Get Healthy California") +theme_dark()
Warning in geom_hline(yintercept = 244.5, linetype = "dashed", color = "white",
: Ignoring unknown parameters: `linewith`
One of the ways I cleaned up this dataset was by editing the names with gaps like strata name so that r could read it and using the filter function so remove unnecessary variables. Other than that there wasn’t much else for me to clean. Interesting takeaways are the sharp dip in 2014 and again in 2016. Another thing the data shows us is how both the lowest (<20k) and highest (100k+) incomes are the most depressed. My guess is that those with 20k are depressed because they are struggling to get by while those with 100k are depressed due to working a demanding job. The least depressed group, and the only one to be below the mean frequency line, is the 75k-99k demographic. I suspect that this is caused by having above average income but not having an intense job. I wish I could have added a plotly extention that showed Race-Ethnic Category but it kept crashing.