# load relevant libraries and read-in the data
suppressMessages(library(tidyverse))
tx_big_df <- read_csv("PLACES__Local_Data_for_Better_Health__Place_Data_2023_release_20240429.csv", show_col_types = FALSE)DATA 110 FINAL
Sunshine, Sleep, and Leisure Time: Their Correlation with Mental Health Metrics
Does adding more beach to one’s life correlate with better mental health?
Introduction
Does adding more beach to one’s life correlate with better mental health? When trying to answer this question, is it better to limit the data set to a state like Texas - a large state with a range of annual sunshine duration across 5 cities - or would the breadth of a nationwide sample lead to more comprehensive answers?
Beginning in 2020, the CDC began including estimates of lifetime diagnoses of depression in their PLACES data in addition to measures of current (low mental health in the past 14 days) depression(1). This project explores that data set(2). PLACES provides “model-based, population-level analysis and community estimates of health measures to all counties, places (incorporated and census designated places), census tracts, and ZIP Code Tabulation Areas (ZCTAs) across the United States.(3)” The primary data sources are the CDC’s Behavioral Risk Factor Surveillance System (BRFSS), Census 2010 population counts, annual Census population estimates and the 5-year American Community Survey (ACS)(6).
The measures explored are: MHLTH: Mental health not good for >=14 days DEPRESSION: Depression among adults aged >=18 years LPA: No leisure-time physical activity among adults aged >=18 years SLEEP: Sleeping less than 7 hours among adults aged >=18 years SunHours: The duration of sunshine annually for a specific location
N.B.: Any results from this project can only speak to correlation. As the causes of depression are still being researched, as is a universal cure-all, it is impossible to say that a specific element is what caused a bout of depression.
Why not keep the data set to just Texas? Apart from a school trip decades ago, I have no first hand experience with Texas. Also, while Texas does have a range of sunshine data, the United States as a whole has an even broader range.
There are prescription produce programs in the US that provide free fruit and vegetables to patients(4). A strong, documented relationship between sun and mood is a step towards prescription beach programs.
TX - Data Preparation
After reading in the gargantuan csv, I filtered to keep the observations regarding mental health, sleep, lack of physical activity and depression diagnoses. I also limited the data set to Texas, as annual sunshine hours duration range from 2,577.90 hours in Houston to 3,762.50 hours in El Paso(5).
measure_ids_keep <- c("MHLTH", "SLEEP", "LPA", "DEPRESSION")
tx_cities <- c("Houston", "San Antonio", "Austin", "Dallas", "El Paso")
tx_filtered_df <- tx_big_df %>%
filter(MeasureId %in% measure_ids_keep) %>%
filter(StateAbbr == "TX") %>%
filter(LocationName %in% tx_cities)The sunshine duration data was scraped from Wikipedia. I now need to add the sunshine data to the current data frame.
# load the sunshine data into r
tx_sunshine_df <- read_csv("tx_sunshine_rates.csv", show_col_types = FALSE)
# merging the data
tx_final_df <- left_join(tx_filtered_df, tx_sunshine_df, by = "LocationName")Finally, I format the location coordinates column, list all the variables and keep only the columns of interest.
tx_tidy__df <- tidyr::extract(tx_final_df, Geolocation, into = c('Lat', 'Long'),
regex = "POINT \\((-?\\d+\\.\\d+)\\s+(-?\\d+\\.\\d+)\\)")
head(tx_tidy__df)# A tibble: 6 × 23
Year StateAbbr StateDesc LocationName DataSource Category Measure
<dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 2021 TX Texas Austin BRFSS Health Risk Behavio… No lei…
2 2021 TX Texas Austin BRFSS Health Outcomes Depres…
3 2021 TX Texas Austin BRFSS Health Risk Behavio… No lei…
4 2021 TX Texas Austin BRFSS Health Status Mental…
5 2020 TX Texas Austin BRFSS Health Risk Behavio… Sleepi…
6 2021 TX Texas Austin BRFSS Health Status Mental…
# ℹ 16 more variables: Data_Value_Unit <chr>, Data_Value_Type <chr>,
# Data_Value <dbl>, Data_Value_Footnote_Symbol <chr>,
# Data_Value_Footnote <chr>, Low_Confidence_Limit <dbl>,
# High_Confidence_Limit <dbl>, TotalPopulation <dbl>, Lat <chr>, Long <chr>,
# LocationID <chr>, CategoryID <chr>, MeasureId <chr>, DataValueTypeID <chr>,
# Short_Question_Text <chr>, SunHours <dbl>
# checking column names
names(tx_tidy__df) [1] "Year" "StateAbbr"
[3] "StateDesc" "LocationName"
[5] "DataSource" "Category"
[7] "Measure" "Data_Value_Unit"
[9] "Data_Value_Type" "Data_Value"
[11] "Data_Value_Footnote_Symbol" "Data_Value_Footnote"
[13] "Low_Confidence_Limit" "High_Confidence_Limit"
[15] "TotalPopulation" "Lat"
[17] "Long" "LocationID"
[19] "CategoryID" "MeasureId"
[21] "DataValueTypeID" "Short_Question_Text"
[23] "SunHours"
# specifying columns by number
tx_tidy <- tx_tidy__df %>%
select(1, 2, 4, 6:10, 15:17, 19, 20, 22, 23)
head(tx_tidy)# A tibble: 6 × 15
Year StateAbbr LocationName Category Measure Data_Value_Unit Data_Value_Type
<dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 2021 TX Austin Health R… No lei… % Age-adjusted p…
2 2021 TX Austin Health O… Depres… % Crude prevalen…
3 2021 TX Austin Health R… No lei… % Crude prevalen…
4 2021 TX Austin Health S… Mental… % Crude prevalen…
5 2020 TX Austin Health R… Sleepi… % Age-adjusted p…
6 2021 TX Austin Health S… Mental… % Age-adjusted p…
# ℹ 8 more variables: Data_Value <dbl>, TotalPopulation <dbl>, Lat <chr>,
# Long <chr>, CategoryID <chr>, MeasureId <chr>, Short_Question_Text <chr>,
# SunHours <dbl>
Finally, we are limiting the data to age-adjusted prevalence.
tx_tidy <- tx_tidy %>%
filter(Data_Value_Type == "Age-adjusted prevalence") TX - Data Analysis
Graph 1
# loading relevant library for themes
library(ggthemes)Warning: package 'ggthemes' was built under R version 4.3.3
tx_1 <- ggplot(data = tx_tidy, aes(x = SunHours, y = Data_Value, color = MeasureId)) +
geom_point(aes(shape = MeasureId), size = 3) + facet_wrap(~ LocationName) + # separate plot for each city
labs(title = "Relationship Between Sunshine Hours and Health Metrics by City",
x = "Annual Sunshine Hours",
y = "Health Metric %",
caption = "Source: CDC PLACES") +
theme_solarized()
tx_1 + scale_color_brewer(palette = "Dark2")While the information displayed is intriguing, it does not lend itself to statistical analysis. The extra hours of sunlight in El Paso do not seem to affect the number of the population getting less than 7 hours of a sleep a night - Houston, with the lowest annual hours of sunshine, has a similar sleep-deprived population segment. Though El Paso has 46% more sunlight than Houston, it still has more people with a lifetime diagnosis of depression than Houston. Looking at the faceted plot, Austin (2,643.7 hours sun) seems easiest on the mind - lowest number of sleep- and leisure-deprived people and also the lowest number of people reporting a low mood in the past 2 weeks.
tx_2 <- ggplot(data = tx_tidy, aes(x = SunHours, y = Data_Value, color = MeasureId)) +
geom_point() +
facet_wrap(~ MeasureId) +
labs(title = "Relationship Between Sunshine Hours and Health Metrics by Measure",
x = "Annual Sunshine Hours",
y = "% Population Experiencing Measure",
caption = "Source: CDC PLACES",
color = "Measure") +
theme_solarized()
tx_2 + scale_color_brewer(palette = "Dark2")Graph, Analytical
tx_3 <- ggplot(data = tx_tidy, aes(x = SunHours, y = Data_Value, color = MeasureId)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ x, se = FALSE, aes(color = MeasureId, linetype = MeasureId)) +
facet_wrap(~ MeasureId) +
labs(title = "Sunshine Hours vs. Health Metrics by Measure",
x = "Annual Sunshine Hours",
y = "% Population Experiencing Measure",
caption = "Source: CDC PLACES") +
theme_solarized()
tx_3 + scale_color_brewer(palette = "Dark2")Interactivity would show how Austin sticks out compared to the other 4 TX cities in this data set. Eyeballing the regression lines above, it does not seem that annual sunshine duration is a good predictor of any the mental health related measures in this data set, especially once you remove Austin. Are any of the measures a good predictor for the others?
# specifying columns to keep for wide data frame
wide_tx_tidy <- tx_tidy %>%
select(3, Data_Value, Data_Value_Unit, TotalPopulation, MeasureId, SunHours)
head(wide_tx_tidy)# A tibble: 6 × 6
LocationName Data_Value Data_Value_Unit TotalPopulation MeasureId SunHours
<chr> <dbl> <chr> <dbl> <chr> <dbl>
1 Austin 19.4 % 806804 LPA 2644.
2 Austin 29.7 % 806804 SLEEP 2644.
3 Austin 15.2 % 806804 MHLTH 2644.
4 Austin 21.1 % 806804 DEPRESSION 2644.
5 Dallas 21.6 % 1197688 DEPRESSION 2850.
6 Dallas 30.4 % 1197688 LPA 2850.
# pivoting to wide format for more analysis
analysis_data_wide <- wide_tx_tidy %>%
pivot_wider(names_from = MeasureId, values_from = Data_Value)TX - Statistical Analysis
# linear regression for Depression as a function of LPA and Sleep
lm_depression <- lm(DEPRESSION ~ LPA + SLEEP, data = analysis_data_wide)
summary(lm_depression)
Call:
lm(formula = DEPRESSION ~ LPA + SLEEP, data = analysis_data_wide)
Residuals:
1 2 3 4 5
-0.03713 -0.30566 1.52031 -1.24429 0.06677
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.8797 10.7984 0.174 0.878
LPA -0.3539 0.2499 -1.416 0.292
SLEEP 0.8796 0.4565 1.927 0.194
Residual standard error: 1.407 on 2 degrees of freedom
Multiple R-squared: 0.6524, Adjusted R-squared: 0.3047
F-statistic: 1.877 on 2 and 2 DF, p-value: 0.3476
# linear regression for Mental Health as a function of LPA and Sleep
lm_mhlth <- lm(MHLTH ~ LPA + SLEEP, data = analysis_data_wide)
summary(lm_mhlth)
Call:
lm(formula = MHLTH ~ LPA + SLEEP, data = analysis_data_wide)
Residuals:
1 2 3 4 5
-0.007833 -0.079416 0.136029 -0.068949 0.020169
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.26263 0.94053 6.659 0.0218 *
LPA 0.04979 0.02177 2.287 0.1494
SLEEP 0.26866 0.03976 6.758 0.0212 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1225 on 2 degrees of freedom
Multiple R-squared: 0.9905, Adjusted R-squared: 0.9809
F-statistic: 103.8 on 2 and 2 DF, p-value: 0.00954
The results are startling.
Depression: My model is neither a good fit (only 30% of the variability is explained by the model) nor statistically significant (the p-value - 0.3476 - is higher than the conventional significant value, 0.05.)
Low Mental Health vs a lack of physical activity and lack of sleep: The model is an amazing fit. The R-squared of 0.9809 means the model explains nearly all of the variability. The p-value is very small at 0.00954.
The following equation is in regard to populations and is in percentage points.
Statistical Equation, TX:
Mental Health = 6.262(intercept) + 0.049 (LPA) + 0.269(sleep)
According to the equation, if 0% of a population reported sleeping less than 7 hours and 0% reported a lack of leisure-time physical activity, the expected low mental health rate would be 6.262%. The equation can only be used for ranges in the data set, however.
A lack of sleep has a much stronger correlation to low mental health than a lack of leisure-time physical activity.
lm_mh_sleep <- lm(MHLTH ~ SLEEP, data = analysis_data_wide)
summary(lm_mh_sleep)
Call:
lm(formula = MHLTH ~ SLEEP, data = analysis_data_wide)
Residuals:
1 2 3 4 5
-0.11426 -0.02261 0.25270 0.05270 -0.16853
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.18068 1.26225 4.104 0.02618 *
SLEEP 0.34120 0.03724 9.163 0.00275 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1903 on 3 degrees of freedom
Multiple R-squared: 0.9655, Adjusted R-squared: 0.954
F-statistic: 83.96 on 1 and 3 DF, p-value: 0.002748
When I focus on the relationship between mental health and sleep however, the R-squared value is slightly smaller meaning the model explains only 95.4% of the variability.
lm_sun_mh <- lm(MHLTH ~ SunHours, data = analysis_data_wide)
summary(lm_sun_mh)
Call:
lm(formula = MHLTH ~ SunHours, data = analysis_data_wide)
Residuals:
1 2 3 4 5
-1.42563 0.39627 -0.04984 0.19932 0.87987
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.562e+01 2.946e+00 5.304 0.0131 *
SunHours 3.792e-04 1.007e-03 0.377 0.7315
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.001 on 3 degrees of freedom
Multiple R-squared: 0.04516, Adjusted R-squared: -0.2731
F-statistic: 0.1419 on 1 and 3 DF, p-value: 0.7315
p-value: 0.7315 (very non-significant) R-squared: -0.2731
I have never received a negative R-squared value before. Wow. The sun is having no discernible effect on this data. Will the trend continue with a larger data set?
US - Data Preparation
We will repeat the data preparation process, this time for the US as a whole.
# taking the original large csv and focusing on the top 10 and bottom 10 sunny states
high_sun_states <- c("AZ", "NV", "TX", "CA", "NM", "FL", "CO", "OK", "HI", "UT")
low_sun_states <- c("TN", "IL", "WI", "IN", "MI", "OR", "OH", "WA", "AK", "PA")
filtered_big_df <- tx_big_df %>%
filter(MeasureId %in% measure_ids_keep) %>%
filter(StateAbbr %in% c(high_sun_states, low_sun_states)) # creating the list of locales to focus on by LocationID
locales <-c("0254920", "0485540", "0644000", "0820000", "1245000", "1714000", "2622000", "3240000", "3502000", "3916000", "4055000", "4159000", "4261000", "4824000", "4752006", "4967000", "5363000", "5553000", "1836003", "1571550")filtered_locales <- filtered_big_df %>%
filter(LocationID %in% locales)I now need to add the sunshine data to the current data frame.
# load the sunshine data into r
us_sunshine_df <- read_csv("C:/Users/bombshellnoir/Dropbox (Personal)/00000 Montgomery College/DATA 110/edited-sunshine_rates.csv", show_col_types = FALSE)
# merging the data
us_final_df <- left_join(filtered_locales, us_sunshine_df, by = "LocationID")us_tidy_df <- tidyr::extract(us_final_df, Geolocation, into = c('Lat', 'Long'),
regex = "POINT \\((-?\\d+\\.\\d+)\\s+(-?\\d+\\.\\d+)\\)")
head(us_tidy_df)# A tibble: 6 × 23
Year StateAbbr StateDesc LocationName DataSource Category Measure
<dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 2020 AK Alaska Nome BRFSS Health Risk Behavio… Sleepi…
2 2021 AK Alaska Nome BRFSS Health Status Mental…
3 2020 AK Alaska Nome BRFSS Health Risk Behavio… Sleepi…
4 2021 AK Alaska Nome BRFSS Health Risk Behavio… No lei…
5 2021 AK Alaska Nome BRFSS Health Status Mental…
6 2021 AK Alaska Nome BRFSS Health Outcomes Depres…
# ℹ 16 more variables: Data_Value_Unit <chr>, Data_Value_Type <chr>,
# Data_Value <dbl>, Data_Value_Footnote_Symbol <chr>,
# Data_Value_Footnote <chr>, Low_Confidence_Limit <dbl>,
# High_Confidence_Limit <dbl>, TotalPopulation <dbl>, Lat <chr>, Long <chr>,
# LocationID <chr>, CategoryID <chr>, MeasureId <chr>, DataValueTypeID <chr>,
# Short_Question_Text <chr>, SunHours <dbl>
names(us_tidy_df) [1] "Year" "StateAbbr"
[3] "StateDesc" "LocationName"
[5] "DataSource" "Category"
[7] "Measure" "Data_Value_Unit"
[9] "Data_Value_Type" "Data_Value"
[11] "Data_Value_Footnote_Symbol" "Data_Value_Footnote"
[13] "Low_Confidence_Limit" "High_Confidence_Limit"
[15] "TotalPopulation" "Lat"
[17] "Long" "LocationID"
[19] "CategoryID" "MeasureId"
[21] "DataValueTypeID" "Short_Question_Text"
[23] "SunHours"
# selecting only columns of interest
us_tidy <- us_tidy_df %>%
select(1, 2, 4, 6:10, 15:17, 19, 20, 22, 23)
head(us_tidy)# A tibble: 6 × 15
Year StateAbbr LocationName Category Measure Data_Value_Unit Data_Value_Type
<dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 2020 AK Nome Health R… Sleepi… % Age-adjusted p…
2 2021 AK Nome Health S… Mental… % Crude prevalen…
3 2020 AK Nome Health R… Sleepi… % Crude prevalen…
4 2021 AK Nome Health R… No lei… % Crude prevalen…
5 2021 AK Nome Health S… Mental… % Age-adjusted p…
6 2021 AK Nome Health O… Depres… % Crude prevalen…
# ℹ 8 more variables: Data_Value <dbl>, TotalPopulation <dbl>, Lat <chr>,
# Long <chr>, CategoryID <chr>, MeasureId <chr>, Short_Question_Text <chr>,
# SunHours <dbl>
# confirming the remaining column names
names(us_tidy) [1] "Year" "StateAbbr" "LocationName"
[4] "Category" "Measure" "Data_Value_Unit"
[7] "Data_Value_Type" "Data_Value" "TotalPopulation"
[10] "Lat" "Long" "CategoryID"
[13] "MeasureId" "Short_Question_Text" "SunHours"
# limiting to age-related prevalence
us_tidy <- us_tidy %>%
filter(Data_Value_Type == "Age-adjusted prevalence") US - Data Analysis
Graph, Regression Lines
#
us_1 <- ggplot(data = us_tidy, aes(x = SunHours, y = Data_Value, color = MeasureId)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ x, se = FALSE, aes(color = MeasureId, linetype = MeasureId)) +
facet_wrap(~ MeasureId) +
labs(title = "Relationship Between Sunshine Hours and Health Metrics by Measure",
x = "Annual Sunshine Hours",
y = "% Population Experiencing Measure",
caption = "Source: CDC PLACES") +
theme_minimal()
us_1 + scale_color_brewer(palette = "Dark2")Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 5 rows containing missing values or values outside the scale range
(`geom_point()`).
# specifying columns to keep for wide data frame, US data frame
wide_us_tidy <- us_tidy %>%
select(2, 3, Data_Value, Data_Value_Unit, TotalPopulation, MeasureId, SunHours)
head(wide_us_tidy)# A tibble: 6 × 7
StateAbbr LocationName Data_Value Data_Value_Unit TotalPopulation MeasureId
<chr> <chr> <dbl> <chr> <dbl> <chr>
1 AK Nome 31.8 % 3598 SLEEP
2 AK Nome 15.5 % 3598 MHLTH
3 AK Nome 21.6 % 3598 LPA
4 AK Nome 18.2 % 3598 DEPRESSION
5 AZ Yuma 19.3 % 93189 DEPRESSION
6 AZ Yuma 36.3 % 93189 SLEEP
# ℹ 1 more variable: SunHours <dbl>
# pivoting to wide format for more analysis, US data
us_wide <- wide_us_tidy %>%
pivot_wider(names_from = MeasureId, values_from = Data_Value)# linear regression for Depression as a function of LPA and Sleep
lm_us_depression <- lm(DEPRESSION ~ LPA + SLEEP, data = us_wide)
summary(lm_us_depression)
Call:
lm(formula = DEPRESSION ~ LPA + SLEEP, data = us_wide)
Residuals:
Min 1Q Median 3Q Max
-6.935 -2.142 1.146 2.609 4.346
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.3057 5.4273 6.137 1.43e-05 ***
LPA 0.4562 0.2229 2.047 0.0574 .
SLEEP -0.6736 0.2487 -2.709 0.0155 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.472 on 16 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.3152, Adjusted R-squared: 0.2296
F-statistic: 3.682 on 2 and 16 DF, p-value: 0.04838
# linear regression for Mental Health as a function of LPA and Sleep
lm_us_mhlth <- lm(MHLTH ~ LPA + SLEEP, data = us_wide)
summary(lm_us_mhlth)
Call:
lm(formula = MHLTH ~ LPA + SLEEP, data = us_wide)
Residuals:
Min 1Q Median 3Q Max
-3.6616 -0.7326 -0.0201 1.1847 1.7859
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.49299 2.38485 6.077 1.6e-05 ***
LPA 0.28783 0.09793 2.939 0.00962 **
SLEEP -0.14027 0.10927 -1.284 0.21752
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.526 on 16 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.4196, Adjusted R-squared: 0.347
F-statistic: 5.783 on 2 and 16 DF, p-value: 0.01288
The analysis suggests that these are not good models. Perhaps a relationship is clearer when comparing just two variables.
lm_us_sleep_depression <- lm(DEPRESSION ~ SLEEP, data = us_wide)
summary(lm_us_sleep_depression)
Call:
lm(formula = DEPRESSION ~ SLEEP, data = us_wide)
Residuals:
Min 1Q Median 3Q Max
-7.9828 -1.9179 0.8999 2.5519 4.8146
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.8664 5.7705 5.349 5.31e-05 ***
SLEEP -0.2704 0.1654 -1.634 0.121
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.784 on 17 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.1358, Adjusted R-squared: 0.08497
F-statistic: 2.671 on 1 and 17 DF, p-value: 0.1205
The equation: Depression rates = 30.8664 - 0.2704(Sleep)
The low R-squared and high p-value make it clear this is not a meaningful model. Can we use sunshine hours to predict the mental health value?
lm_us_sun_mh <- lm(MHLTH ~ SunHours, data = us_wide)
summary(lm_us_sun_mh)
Call:
lm(formula = MHLTH ~ SunHours, data = us_wide)
Residuals:
Min 1Q Median 3Q Max
-4.1522 -1.2898 0.0634 1.1204 3.6369
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.691e+01 2.234e+00 7.568 1.13e-06 ***
SunHours -1.819e-05 7.667e-04 -0.024 0.981
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.997 on 16 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 3.519e-05, Adjusted R-squared: -0.06246
F-statistic: 0.0005631 on 1 and 16 DF, p-value: 0.9814
There is a negative R-squared value! And such a high p-value. There is no correlation shown by the data between mental health and annual sunshine duration.
tab_us_tidy <- us_tidy_df %>%
select(1:4, 6:10, 15:17, 19, 20, 22, 23)
head(tab_us_tidy)# A tibble: 6 × 16
Year StateAbbr StateDesc LocationName Category Measure Data_Value_Unit
<dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 2020 AK Alaska Nome Health Risk Be… Sleepi… %
2 2021 AK Alaska Nome Health Status Mental… %
3 2020 AK Alaska Nome Health Risk Be… Sleepi… %
4 2021 AK Alaska Nome Health Risk Be… No lei… %
5 2021 AK Alaska Nome Health Status Mental… %
6 2021 AK Alaska Nome Health Outcomes Depres… %
# ℹ 9 more variables: Data_Value_Type <chr>, Data_Value <dbl>,
# TotalPopulation <dbl>, Lat <chr>, Long <chr>, CategoryID <chr>,
# MeasureId <chr>, Short_Question_Text <chr>, SunHours <dbl>
# limiting to age-related prevalence
tab_us_tidy <- tab_us_tidy %>%
filter(Data_Value_Type == "Age-adjusted prevalence")# specifying columns to keep for wide data frame, US data frame
wide_tab_us_tidy <- tab_us_tidy %>%
select(2:4, Data_Value, Data_Value_Unit, TotalPopulation, MeasureId, SunHours)
head(wide_tab_us_tidy)# A tibble: 6 × 8
StateAbbr StateDesc LocationName Data_Value Data_Value_Unit TotalPopulation
<chr> <chr> <chr> <dbl> <chr> <dbl>
1 AK Alaska Nome 31.8 % 3598
2 AK Alaska Nome 15.5 % 3598
3 AK Alaska Nome 21.6 % 3598
4 AK Alaska Nome 18.2 % 3598
5 AZ Arizona Yuma 19.3 % 93189
6 AZ Arizona Yuma 36.3 % 93189
# ℹ 2 more variables: MeasureId <chr>, SunHours <dbl>
# pivoting to wide format for more analysis, US data
tab_wide <- wide_tab_us_tidy %>%
pivot_wider(names_from = MeasureId, values_from = Data_Value)#write_csv(tab_wide, "tab_wide.csv")Visualizations
Graph 2
RECAP: The measures explored are: MHLTH: Mental health not good for >=14 days DEPRESSION: Depression among adults aged >=18 years LPA: No leisure-time physical activity among adults aged >=18 years SLEEP: Sleeping less than 7 hours among adults aged >=18 years SunHours: The duration of sunshine annually for a specific location
# load relevant library
suppressWarnings(suppressMessages(library(highcharter)))
us_viz <- wide_us_tidy %>%
hchart("scatter",
hcaes(x = SunHours,
y = Data_Value,
group = factor(MeasureId, labels = c("DEPRESSION", "Lack LPA", "Low MHLTH", "Low SLEEP")))) %>%
hc_add_theme(hc_theme_google()) %>%
hc_title(text = "Relationship Between Sunshine Hours and Health Metrics by Measure",
align = "right",
verticalAlign = "top") %>%
hc_caption(text = "Source: CDC PLACES<br>LPA = Leisure-Time Physical Activity") %>%
hc_xAxis(title = list(text = "Annual Sunshine Hours")) %>%
hc_yAxis(title = list(text = "% of Population Reporting Measure")) %>%
hc_tooltip(
headerFormat = "<b>Measure:</b> {point.series.name}<br>",
pointFormat = "State: {point.StateAbbr}<br>Prevalence: {point.y:,.0f}%<br>Sunshine Hours: {point.x}"
) %>%
hc_legend(verticalAlign = "middle",
layout = "vertical",
align = "right",
borderWidth = 1,
title = list(text = "Measures"))
us_vizThe graph looks at the states with the top 10 highest annual sunshine hours donation and the state with the 10 lowest. The missing 30 states are likely in the gap between 2500 and 3000 sunshine hours. The graph does make it clear that going without adequate sleep is common.
Tableau Plot
Graph 3
The Tableau interactive graph shows that the states with the highest rates of sunshine are in the Southwest. As you mouse over each state, you can see details about the metrics used and the specific location in the state where the sunshine hours were captured. I wish Tableau did not update the heat map as you filter for sunshine hours.
Visit this interactive plot.
Project Conclusion
My analysis does not support my hypothesis that rates of depression and/or low mental health are linked to sunshine duration. I believe a different approach (perhaps using a different statistical model?) to the question might yield a different answer. Maybe the survey replies were collected in winter months whereas the sunshine data is an average over the entire year. 2020 and 2021 changed people’s habits; perhaps a non-pandemic year would have different data for sleep, leisure time and/or low moods.
The data, at the state level, does show that low sleep and low moods go hand in hand. Doctors should prescribe naps.
Citations
National, State-Level, and County-Level Prevalence Estimates of Adults Aged ≥18 Years Self-Reporting a Lifetime Diagnosis of Depression — United States, 2020. MMWR Morb Mortal Wkly Rep 2023;72:644–650. DOI: http://dx.doi.org/10.15585/mmwr.mm7224a1
https://data.cdc.gov/500-Cities-Places/PLACES-Local-Data-for-Better-Health-Place-Data-202/eav7-hnsx/about_data
CDC PLACES: Local Data for Better Health https://www.cdc.gov/places
At Bread for the City, doctors prescribe fresh fruit and veggies https://www.washingtonpost.com/dc-md-va/2023/11/22/produce-rx-bread-city/
List of Cities By Sunshine Duration https://en.wikipedia.org/wiki/List_of_cities_by_sunshine_duration#North_America
Methodology - PLACES: Local Data for Better Health https://www.cdc.gov/places/methodology/index.html