DATA 110 FINAL

Author

Annet Isa

Sunshine, Sleep, and Leisure Time: Their Correlation with Mental Health Metrics

Does adding more beach to one’s life correlate with better mental health?

Introduction

Does adding more beach to one’s life correlate with better mental health? When trying to answer this question, is it better to limit the data set to a state like Texas - a large state with a range of annual sunshine duration across 5 cities - or would the breadth of a nationwide sample lead to more comprehensive answers?

Beginning in 2020, the CDC began including estimates of lifetime diagnoses of depression in their PLACES data in addition to measures of current (low mental health in the past 14 days) depression(1). This project explores that data set(2). PLACES provides “model-based, population-level analysis and community estimates of health measures to all counties, places (incorporated and census designated places), census tracts, and ZIP Code Tabulation Areas (ZCTAs) across the United States.(3)”

The measures explored are: MHLTH: Mental health not good for >=14 days DEPRESSION: Depression among adults aged >=18 years LPA: No leisure-time physical activity among adults aged >=18 years SLEEP: Sleeping less than 7 hours among adults aged >=18 years SunHours: The duration of sunshine annually for a specific location

N.B.: Any results from this project can only speak to correlation. As the causes of depression are still being researched, as is a universal cure-all, it is impossible to say that a specific element is what caused a bout of depression.

Why not keep the data set to just Texas? Apart from a school trip decades ago, I have no first hand experience with Texas. Also, while Texas does have a range of sunshine data, the United States as a whole has an even broader range.

There are prescription produce programs in the US that provide free fruit and vegetables to patients(4). A strong, documented relationship between sun and mood is a step towards prescription beach programs.

TX - Data Preparation

After reading in the gargantuan csv, I filtered to keep the observations regarding mental health, sleep, lack of physical activity and depression diagnoses. I also limited the data set to Texas, as annual sunshine hours duration range from 2,577.90 hours in Houston to 3,762.50 hours in El Paso(5).

# load relevant libraries and read-in the data
suppressMessages(library(tidyverse))
tx_big_df <- read_csv("PLACES__Local_Data_for_Better_Health__Place_Data_2023_release_20240429.csv", show_col_types = FALSE)

measure_ids_keep <- c("MHLTH", "SLEEP", "LPA", "DEPRESSION")

tx_cities <- c("Houston", "San Antonio", "Austin", "Dallas", "El Paso")

tx_filtered_df <- tx_big_df %>%
  filter(MeasureId %in% measure_ids_keep) %>%
  filter(StateAbbr == "TX") %>%
  filter(LocationName %in% tx_cities)

The sunshine duration data was scraped from Wikipedia. I now need to add the sunshine data to the current data frame.

# load the sunshine data into r
tx_sunshine_df <- read_csv("tx_sunshine_rates.csv", show_col_types = FALSE)

# merging the data
tx_final_df <- left_join(tx_filtered_df, tx_sunshine_df, by = "LocationName")

Finally, I format the location coordinates column, list all the variables and keep only the columns of interest.

tx_tidy__df <- tidyr::extract(tx_final_df, Geolocation, into = c('Lat', 'Long'), 
               regex = "POINT \\((-?\\d+\\.\\d+)\\s+(-?\\d+\\.\\d+)\\)")
head(tx_tidy__df)

# A tibble: 6 × 23
   Year StateAbbr StateDesc LocationName DataSource Category             Measure
  <dbl> <chr>     <chr>     <chr>        <chr>      <chr>                <chr>  
1  2021 TX        Texas     Austin       BRFSS      Health Risk Behavio… No lei…
2  2021 TX        Texas     Austin       BRFSS      Health Outcomes      Depres…
3  2021 TX        Texas     Austin       BRFSS      Health Risk Behavio… No lei…
4  2021 TX        Texas     Austin       BRFSS      Health Status        Mental…
5  2020 TX        Texas     Austin       BRFSS      Health Risk Behavio… Sleepi…
6  2021 TX        Texas     Austin       BRFSS      Health Status        Mental…
# ℹ 16 more variables: Data_Value_Unit <chr>, Data_Value_Type <chr>,
#   Data_Value <dbl>, Data_Value_Footnote_Symbol <chr>,
#   Data_Value_Footnote <chr>, Low_Confidence_Limit <dbl>,
#   High_Confidence_Limit <dbl>, TotalPopulation <dbl>, Lat <chr>, Long <chr>,
#   LocationID <chr>, CategoryID <chr>, MeasureId <chr>, DataValueTypeID <chr>,
#   Short_Question_Text <chr>, SunHours <dbl>

# checking column names
names(tx_tidy__df)

 [1] "Year"                       "StateAbbr"                 
 [3] "StateDesc"                  "LocationName"              
 [5] "DataSource"                 "Category"                  
 [7] "Measure"                    "Data_Value_Unit"           
 [9] "Data_Value_Type"            "Data_Value"                
[11] "Data_Value_Footnote_Symbol" "Data_Value_Footnote"       
[13] "Low_Confidence_Limit"       "High_Confidence_Limit"     
[15] "TotalPopulation"            "Lat"                       
[17] "Long"                       "LocationID"                
[19] "CategoryID"                 "MeasureId"                 
[21] "DataValueTypeID"            "Short_Question_Text"       
[23] "SunHours"

# specifying columns by number
tx_tidy <- tx_tidy__df %>%
  select(1, 2, 4, 6:10, 15:17, 19, 20, 22, 23)
head(tx_tidy)

# A tibble: 6 × 15
   Year StateAbbr LocationName Category  Measure Data_Value_Unit Data_Value_Type
  <dbl> <chr>     <chr>        <chr>     <chr>   <chr>           <chr>          
1  2021 TX        Austin       Health R… No lei… %               Age-adjusted p…
2  2021 TX        Austin       Health O… Depres… %               Crude prevalen…
3  2021 TX        Austin       Health R… No lei… %               Crude prevalen…
4  2021 TX        Austin       Health S… Mental… %               Crude prevalen…
5  2020 TX        Austin       Health R… Sleepi… %               Age-adjusted p…
6  2021 TX        Austin       Health S… Mental… %               Age-adjusted p…
# ℹ 8 more variables: Data_Value <dbl>, TotalPopulation <dbl>, Lat <chr>,
#   Long <chr>, CategoryID <chr>, MeasureId <chr>, Short_Question_Text <chr>,
#   SunHours <dbl>

Finally, we are limiting the data to age-adjusted prevalence.

tx_tidy <- tx_tidy %>%
  filter(Data_Value_Type == "Age-adjusted prevalence")

TX - Data Analysis

Graph 1

# loading relevant library for themes
library(ggthemes)

Warning: package 'ggthemes' was built under R version 4.3.3

tx_1 <- ggplot(data = tx_tidy, aes(x = SunHours, y = Data_Value, color = MeasureId)) +
  geom_point(aes(shape = MeasureId), size = 3) +   facet_wrap(~ LocationName) +  # separate plot for each city
  labs(title = "Relationship Between Sunshine Hours and Health Metrics by City",
       x = "Annual Sunshine Hours",
       y = "Health Metric %",
       caption = "Source: CDC PLACES") +
  theme_solarized() 

tx_1 + scale_color_brewer(palette = "Dark2")

While the information displayed is intriguing, it does not lend itself to statistical analysis. The extra hours of sunlight in El Paso do not seem to affect the number of the population getting less than 7 hours of a sleep a night - Houston, with the lowest annual hours of sunshine, has a similar sleep-deprived population segment. Though El Paso has 46% more sunlight than Houston, it still has more people with a lifetime diagnosis of depression than Houston. Looking at the faceted plot, Austin (2,643.7 hours sun) seems easiest on the mind - lowest number of sleep- and leisure-deprived people and also the lowest number of people reporting a low mood in the past 2 weeks.

tx_2 <- ggplot(data = tx_tidy, aes(x = SunHours, y = Data_Value, color = MeasureId)) +
  geom_point() +  
  facet_wrap(~ MeasureId) +  
  labs(title = "Relationship Between Sunshine Hours and Health Metrics by Measure",
       x = "Annual Sunshine Hours",
       y = "% Population Experiencing Measure",
       caption = "Source: CDC PLACES",
       color = "Measure")  +
  theme_solarized() 

tx_2 + scale_color_brewer(palette = "Dark2")

Graph, Analytical

tx_3 <- ggplot(data = tx_tidy, aes(x = SunHours, y = Data_Value, color = MeasureId)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE,  aes(color = MeasureId, linetype = MeasureId)) +  
  facet_wrap(~ MeasureId) +
  labs(title = "Sunshine Hours vs. Health Metrics by Measure",
       x = "Annual Sunshine Hours",
       y = "% Population Experiencing Measure",
       caption = "Source: CDC PLACES") +
  theme_solarized()

tx_3 + scale_color_brewer(palette = "Dark2")

Interactivity would show how Austin sticks out compared to the other 4 TX cities in this data set. Eyeballing the regression lines above, it does not seem that annual sunshine duration is a good predictor of any the mental health related measures in this data set, especially once you remove Austin. Are any of the measures a good predictor for the others?

# specifying columns to keep for wide data frame
wide_tx_tidy <- tx_tidy %>%
  select(3, Data_Value, Data_Value_Unit, TotalPopulation, MeasureId, SunHours)
head(wide_tx_tidy)

# A tibble: 6 × 6
  LocationName Data_Value Data_Value_Unit TotalPopulation MeasureId  SunHours
  <chr>             <dbl> <chr>                     <dbl> <chr>         <dbl>
1 Austin             19.4 %                        806804 LPA           2644.
2 Austin             29.7 %                        806804 SLEEP         2644.
3 Austin             15.2 %                        806804 MHLTH         2644.
4 Austin             21.1 %                        806804 DEPRESSION    2644.
5 Dallas             21.6 %                       1197688 DEPRESSION    2850.
6 Dallas             30.4 %                       1197688 LPA           2850.

# pivoting to wide format for more analysis
analysis_data_wide <- wide_tx_tidy %>%
  pivot_wider(names_from = MeasureId, values_from = Data_Value)

TX - Statistical Analysis

# linear regression for Depression as a function of LPA and Sleep
lm_depression <- lm(DEPRESSION ~ LPA + SLEEP, data = analysis_data_wide)
summary(lm_depression)


Call:
lm(formula = DEPRESSION ~ LPA + SLEEP, data = analysis_data_wide)

Residuals:
       1        2        3        4        5 
-0.03713 -0.30566  1.52031 -1.24429  0.06677 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   1.8797    10.7984   0.174    0.878
LPA          -0.3539     0.2499  -1.416    0.292
SLEEP         0.8796     0.4565   1.927    0.194

Residual standard error: 1.407 on 2 degrees of freedom
Multiple R-squared:  0.6524,    Adjusted R-squared:  0.3047 
F-statistic: 1.877 on 2 and 2 DF,  p-value: 0.3476

# linear regression for Mental Health as a function of LPA and Sleep
lm_mhlth <- lm(MHLTH ~ LPA + SLEEP, data = analysis_data_wide)
summary(lm_mhlth)


Call:
lm(formula = MHLTH ~ LPA + SLEEP, data = analysis_data_wide)

Residuals:
        1         2         3         4         5 
-0.007833 -0.079416  0.136029 -0.068949  0.020169 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  6.26263    0.94053   6.659   0.0218 *
LPA          0.04979    0.02177   2.287   0.1494  
SLEEP        0.26866    0.03976   6.758   0.0212 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1225 on 2 degrees of freedom
Multiple R-squared:  0.9905,    Adjusted R-squared:  0.9809 
F-statistic: 103.8 on 2 and 2 DF,  p-value: 0.00954

The results are startling.

Depression: My model is neither a good fit (only 30% of the variability is explained by the model) nor statistically significant (the p-value - 0.3476 - is higher than the conventional significant value, 0.05.)

Low Mental Health vs a lack of physical activity and lack of sleep: The model is an amazing fit. The R-squared of 0.9809 means the model explains nearly all of the variability. The p-value is very small at 0.00954.

The following equation is in regard to populations and is in percentage points.

Statistical Equation, TX:

Mental Health = 6.262(intercept) + 0.049 (LPA) + 0.269(sleep)

According to the equation, if 0% of a population reported sleeping less than 7 hours and 0% reported a lack of leisure-time physical activity, the expected low mental health rate would be 6.262%. The equation can only be used for ranges in the data set, however.

A lack of sleep has a much stronger correlation to low mental health than a lack of leisure-time physical activity.

lm_mh_sleep <- lm(MHLTH ~ SLEEP, data = analysis_data_wide)
summary(lm_mh_sleep)


Call:
lm(formula = MHLTH ~ SLEEP, data = analysis_data_wide)

Residuals:
       1        2        3        4        5 
-0.11426 -0.02261  0.25270  0.05270 -0.16853 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  5.18068    1.26225   4.104  0.02618 * 
SLEEP        0.34120    0.03724   9.163  0.00275 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1903 on 3 degrees of freedom
Multiple R-squared:  0.9655,    Adjusted R-squared:  0.954 
F-statistic: 83.96 on 1 and 3 DF,  p-value: 0.002748

When I focus on the relationship between mental health and sleep however, the R-squared value is slightly smaller meaning the model explains only 95.4% of the variability.

lm_sun_mh <- lm(MHLTH ~ SunHours, data = analysis_data_wide)
summary(lm_sun_mh)


Call:
lm(formula = MHLTH ~ SunHours, data = analysis_data_wide)

Residuals:
       1        2        3        4        5 
-1.42563  0.39627 -0.04984  0.19932  0.87987 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept) 1.562e+01  2.946e+00   5.304   0.0131 *
SunHours    3.792e-04  1.007e-03   0.377   0.7315  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.001 on 3 degrees of freedom
Multiple R-squared:  0.04516,   Adjusted R-squared:  -0.2731 
F-statistic: 0.1419 on 1 and 3 DF,  p-value: 0.7315

p-value: 0.7315 (very non-significant) R-squared: -0.2731

I have never received a negative R-squared value before. Wow. The sun is having no discernible effect on this data. Will the trend continue with a larger data set?

US - Data Preparation

We will repeat the data preparation process, this time for the US as a whole.

# taking the original large csv and focusing on the top 10 and bottom 10 sunny states
high_sun_states <- c("AZ", "NV", "TX", "CA", "NM", "FL", "CO", "OK", "HI", "UT")

low_sun_states <- c("TN", "IL", "WI", "IN", "MI", "OR", "OH", "WA", "AK", "PA")

filtered_big_df <- tx_big_df %>%
  filter(MeasureId %in% measure_ids_keep) %>%
  filter(StateAbbr %in% c(high_sun_states, low_sun_states))

# creating the list of locales to focus on by LocationID
locales <-c("0254920",  "0485540", "0644000", "0820000", "1245000", "1714000", "2622000", "3240000", "3502000", "3916000", "4055000", "4159000", "4261000", "4824000", "4752006", "4967000", "5363000", "5553000", "1836003", "1571550")

filtered_locales <- filtered_big_df %>%
  filter(LocationID %in% locales)

I now need to add the sunshine data to the current data frame.

# load the sunshine data into r
us_sunshine_df <- read_csv("C:/Users/bombshellnoir/Dropbox (Personal)/00000 Montgomery College/DATA 110/edited-sunshine_rates.csv", show_col_types = FALSE)

# merging the data
us_final_df <- left_join(filtered_locales, us_sunshine_df, by = "LocationID")

us_tidy_df <- tidyr::extract(us_final_df, Geolocation, into = c('Lat', 'Long'), 
               regex = "POINT \\((-?\\d+\\.\\d+)\\s+(-?\\d+\\.\\d+)\\)")
head(us_tidy_df)

# A tibble: 6 × 23
   Year StateAbbr StateDesc LocationName DataSource Category             Measure
  <dbl> <chr>     <chr>     <chr>        <chr>      <chr>                <chr>  
1  2020 AK        Alaska    Nome         BRFSS      Health Risk Behavio… Sleepi…
2  2021 AK        Alaska    Nome         BRFSS      Health Status        Mental…
3  2020 AK        Alaska    Nome         BRFSS      Health Risk Behavio… Sleepi…
4  2021 AK        Alaska    Nome         BRFSS      Health Risk Behavio… No lei…
5  2021 AK        Alaska    Nome         BRFSS      Health Status        Mental…
6  2021 AK        Alaska    Nome         BRFSS      Health Outcomes      Depres…
# ℹ 16 more variables: Data_Value_Unit <chr>, Data_Value_Type <chr>,
#   Data_Value <dbl>, Data_Value_Footnote_Symbol <chr>,
#   Data_Value_Footnote <chr>, Low_Confidence_Limit <dbl>,
#   High_Confidence_Limit <dbl>, TotalPopulation <dbl>, Lat <chr>, Long <chr>,
#   LocationID <chr>, CategoryID <chr>, MeasureId <chr>, DataValueTypeID <chr>,
#   Short_Question_Text <chr>, SunHours <dbl>

names(us_tidy_df)

 [1] "Year"                       "StateAbbr"                 
 [3] "StateDesc"                  "LocationName"              
 [5] "DataSource"                 "Category"                  
 [7] "Measure"                    "Data_Value_Unit"           
 [9] "Data_Value_Type"            "Data_Value"                
[11] "Data_Value_Footnote_Symbol" "Data_Value_Footnote"       
[13] "Low_Confidence_Limit"       "High_Confidence_Limit"     
[15] "TotalPopulation"            "Lat"                       
[17] "Long"                       "LocationID"                
[19] "CategoryID"                 "MeasureId"                 
[21] "DataValueTypeID"            "Short_Question_Text"       
[23] "SunHours"

# selecting only columns of interest
us_tidy <- us_tidy_df %>%
  select(1, 2, 4, 6:10, 15:17, 19, 20, 22, 23)
head(us_tidy)

# A tibble: 6 × 15
   Year StateAbbr LocationName Category  Measure Data_Value_Unit Data_Value_Type
  <dbl> <chr>     <chr>        <chr>     <chr>   <chr>           <chr>          
1  2020 AK        Nome         Health R… Sleepi… %               Age-adjusted p…
2  2021 AK        Nome         Health S… Mental… %               Crude prevalen…
3  2020 AK        Nome         Health R… Sleepi… %               Crude prevalen…
4  2021 AK        Nome         Health R… No lei… %               Crude prevalen…
5  2021 AK        Nome         Health S… Mental… %               Age-adjusted p…
6  2021 AK        Nome         Health O… Depres… %               Crude prevalen…
# ℹ 8 more variables: Data_Value <dbl>, TotalPopulation <dbl>, Lat <chr>,
#   Long <chr>, CategoryID <chr>, MeasureId <chr>, Short_Question_Text <chr>,
#   SunHours <dbl>

# confirming the remaining column names
names(us_tidy)

 [1] "Year"                "StateAbbr"           "LocationName"       
 [4] "Category"            "Measure"             "Data_Value_Unit"    
 [7] "Data_Value_Type"     "Data_Value"          "TotalPopulation"    
[10] "Lat"                 "Long"                "CategoryID"         
[13] "MeasureId"           "Short_Question_Text" "SunHours"

# limiting to age-related prevalence
us_tidy <- us_tidy %>%
  filter(Data_Value_Type == "Age-adjusted prevalence")

US - Data Analysis

Graph, Regression Lines

#
us_1 <- ggplot(data = us_tidy, aes(x = SunHours, y = Data_Value, color = MeasureId)) +
  geom_point() +  
   geom_smooth(method = "lm", formula = y ~ x, se = FALSE, aes(color = MeasureId, linetype = MeasureId)) +  
  facet_wrap(~ MeasureId) +
  labs(title = "Relationship Between Sunshine Hours and Health Metrics by Measure",
       x = "Annual Sunshine Hours",
       y = "% Population Experiencing Measure",
       caption = "Source: CDC PLACES") +
  theme_minimal()

us_1 + scale_color_brewer(palette = "Dark2")

Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 5 rows containing missing values or values outside the scale range
(`geom_point()`).

# specifying columns to keep for wide data frame, US data frame
wide_us_tidy <- us_tidy %>%
  select(2, 3, Data_Value, Data_Value_Unit, TotalPopulation, MeasureId, SunHours)
head(wide_us_tidy)

# A tibble: 6 × 7
  StateAbbr LocationName Data_Value Data_Value_Unit TotalPopulation MeasureId 
  <chr>     <chr>             <dbl> <chr>                     <dbl> <chr>     
1 AK        Nome               31.8 %                          3598 SLEEP     
2 AK        Nome               15.5 %                          3598 MHLTH     
3 AK        Nome               21.6 %                          3598 LPA       
4 AK        Nome               18.2 %                          3598 DEPRESSION
5 AZ        Yuma               19.3 %                         93189 DEPRESSION
6 AZ        Yuma               36.3 %                         93189 SLEEP     
# ℹ 1 more variable: SunHours <dbl>

# pivoting to wide format for more analysis, US data
us_wide <- wide_us_tidy %>%
  pivot_wider(names_from = MeasureId, values_from = Data_Value)

# linear regression for Depression as a function of LPA and Sleep
lm_us_depression <- lm(DEPRESSION ~ LPA + SLEEP, data = us_wide)
summary(lm_us_depression)


Call:
lm(formula = DEPRESSION ~ LPA + SLEEP, data = us_wide)

Residuals:
   Min     1Q Median     3Q    Max 
-6.935 -2.142  1.146  2.609  4.346 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  33.3057     5.4273   6.137 1.43e-05 ***
LPA           0.4562     0.2229   2.047   0.0574 .  
SLEEP        -0.6736     0.2487  -2.709   0.0155 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.472 on 16 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.3152,    Adjusted R-squared:  0.2296 
F-statistic: 3.682 on 2 and 16 DF,  p-value: 0.04838

# linear regression for Mental Health as a function of LPA and Sleep
lm_us_mhlth <- lm(MHLTH ~ LPA + SLEEP, data = us_wide)
summary(lm_us_mhlth)


Call:
lm(formula = MHLTH ~ LPA + SLEEP, data = us_wide)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.6616 -0.7326 -0.0201  1.1847  1.7859 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 14.49299    2.38485   6.077  1.6e-05 ***
LPA          0.28783    0.09793   2.939  0.00962 ** 
SLEEP       -0.14027    0.10927  -1.284  0.21752    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.526 on 16 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.4196,    Adjusted R-squared:  0.347 
F-statistic: 5.783 on 2 and 16 DF,  p-value: 0.01288

The analysis suggests that these are not good models. Perhaps a relationship is clearer when comparing just two variables.

lm_us_sleep_depression <- lm(DEPRESSION ~ SLEEP, data = us_wide)
summary(lm_us_sleep_depression)


Call:
lm(formula = DEPRESSION ~ SLEEP, data = us_wide)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.9828 -1.9179  0.8999  2.5519  4.8146 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  30.8664     5.7705   5.349 5.31e-05 ***
SLEEP        -0.2704     0.1654  -1.634    0.121    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.784 on 17 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.1358,    Adjusted R-squared:  0.08497 
F-statistic: 2.671 on 1 and 17 DF,  p-value: 0.1205

The equation: Depression rates = 30.8664 - 0.2704(Sleep)

The low R-squared and high p-value make it clear this is not a meaningful model. Can we use sunshine hours to predict the mental health value?

lm_us_sun_mh <- lm(MHLTH ~ SunHours, data = us_wide)
summary(lm_us_sun_mh)


Call:
lm(formula = MHLTH ~ SunHours, data = us_wide)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.1522 -1.2898  0.0634  1.1204  3.6369 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.691e+01  2.234e+00   7.568 1.13e-06 ***
SunHours    -1.819e-05  7.667e-04  -0.024    0.981    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.997 on 16 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  3.519e-05, Adjusted R-squared:  -0.06246 
F-statistic: 0.0005631 on 1 and 16 DF,  p-value: 0.9814

There is a negative R-squared value! And such a high p-value. There is no correlation shown by the data between mental health and annual sunshine duration.

tab_us_tidy <- us_tidy_df %>%
  select(1:4, 6:10, 15:17, 19, 20, 22, 23)
head(tab_us_tidy)

# A tibble: 6 × 16
   Year StateAbbr StateDesc LocationName Category        Measure Data_Value_Unit
  <dbl> <chr>     <chr>     <chr>        <chr>           <chr>   <chr>          
1  2020 AK        Alaska    Nome         Health Risk Be… Sleepi… %              
2  2021 AK        Alaska    Nome         Health Status   Mental… %              
3  2020 AK        Alaska    Nome         Health Risk Be… Sleepi… %              
4  2021 AK        Alaska    Nome         Health Risk Be… No lei… %              
5  2021 AK        Alaska    Nome         Health Status   Mental… %              
6  2021 AK        Alaska    Nome         Health Outcomes Depres… %              
# ℹ 9 more variables: Data_Value_Type <chr>, Data_Value <dbl>,
#   TotalPopulation <dbl>, Lat <chr>, Long <chr>, CategoryID <chr>,
#   MeasureId <chr>, Short_Question_Text <chr>, SunHours <dbl>

# limiting to age-related prevalence
tab_us_tidy <- tab_us_tidy %>%
  filter(Data_Value_Type == "Age-adjusted prevalence")

# specifying columns to keep for wide data frame, US data frame
wide_tab_us_tidy <- tab_us_tidy %>%
  select(2:4, Data_Value, Data_Value_Unit, TotalPopulation, MeasureId, SunHours)
head(wide_tab_us_tidy)

# A tibble: 6 × 8
  StateAbbr StateDesc LocationName Data_Value Data_Value_Unit TotalPopulation
  <chr>     <chr>     <chr>             <dbl> <chr>                     <dbl>
1 AK        Alaska    Nome               31.8 %                          3598
2 AK        Alaska    Nome               15.5 %                          3598
3 AK        Alaska    Nome               21.6 %                          3598
4 AK        Alaska    Nome               18.2 %                          3598
5 AZ        Arizona   Yuma               19.3 %                         93189
6 AZ        Arizona   Yuma               36.3 %                         93189
# ℹ 2 more variables: MeasureId <chr>, SunHours <dbl>

# pivoting to wide format for more analysis, US data
tab_wide <- wide_tab_us_tidy %>%
  pivot_wider(names_from = MeasureId, values_from = Data_Value)

#write_csv(tab_wide, "tab_wide.csv")

Visualizations

Graph 2

RECAP: The measures explored are: MHLTH: Mental health not good for >=14 days DEPRESSION: Depression among adults aged >=18 years LPA: No leisure-time physical activity among adults aged >=18 years SLEEP: Sleeping less than 7 hours among adults aged >=18 years SunHours: The duration of sunshine annually for a specific location

# load relevant library
suppressWarnings(suppressMessages(library(highcharter)))


us_viz <- wide_us_tidy %>%
  hchart("scatter",
         hcaes(x = SunHours,
               y = Data_Value,
               group = factor(MeasureId, labels = c("DEPRESSION", "Lack LPA", "Low MHLTH", "Low SLEEP")))) %>%
  hc_add_theme(hc_theme_google()) %>%
  hc_title(text = "Relationship Between Sunshine Hours and Health Metrics by Measure",
           align = "right",
           verticalAlign = "top") %>%
  hc_caption(text = "Source: CDC PLACES<br>LPA = Leisure-Time Physical Activity") %>%
  hc_xAxis(title = list(text = "Annual Sunshine Hours")) %>%
  hc_yAxis(title = list(text = "% of Population Reporting Measure")) %>%
  hc_tooltip(
    headerFormat = "<b>Measure:</b> {point.series.name}<br>",  
    pointFormat = "State:  {point.StateAbbr}<br>Prevalence: {point.y:,.0f}%<br>Sunshine Hours: {point.x}"  
  ) %>%
  hc_legend(verticalAlign = "middle",
            layout = "vertical",
            align = "right",
            borderWidth = 1,
            title = list(text = "Measures"))

us_viz

The graph looks at the states with the top 10 highest annual sunshine hours donation and the state with the 10 lowest. The missing 30 states are likely in the gap between 2500 and 3000 sunshine hours. The graph does make it clear that going without adequate sleep is common.

Tableau Plot

Graph 3

The Tableau interactive graph shows that the states with the highest rates of sunshine are in the Southwest. As you mouse over each state, you can see details about the metrics used and the specific location in the state where the sunshine hours were captured. I wish Tableau did not update the heat map as you filter for sunshine hours.

Visit this interactive plot.

Project Conclusion

My analysis does not support my hypothesis that rates of depression and/or low mental health are linked to sunshine duration. I believe a different approach (perhaps using a different statistical model?) to the question might yield a different answer. Maybe the survey replies were collected in winter months whereas the sunshine data is an average over the entire year. 2020 and 2021 changed people’s habits; perhaps a non-pandemic year would have different data for sleep, leisure time and/or low moods.

The data, at the state level, does show that low sleep and low moods go hand in hand. Doctors should prescribe naps.

Citations

National, State-Level, and County-Level Prevalence Estimates of Adults Aged ≥18 Years Self-Reporting a Lifetime Diagnosis of Depression — United States, 2020. MMWR Morb Mortal Wkly Rep 2023;72:644–650. DOI: http://dx.doi.org/10.15585/mmwr.mm7224a1
https://data.cdc.gov/500-Cities-Places/PLACES-Local-Data-for-Better-Health-Place-Data-202/eav7-hnsx/about_data
CDC PLACES: Local Data for Better Health https://www.cdc.gov/places
At Bread for the City, doctors prescribe fresh fruit and veggies https://www.washingtonpost.com/dc-md-va/2023/11/22/produce-rx-bread-city/
List of Cities By Sunshine Duration https://en.wikipedia.org/wiki/List_of_cities_by_sunshine_duration#North_America