library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("C:/Users/Mulut/Desktop/Classes/Data101/projects/Final project")
gss2010 <- read_csv("gss2010.csv")
## Rows: 2044 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): degree, grass
## dbl (3): hrsrelax, mntlhlth, hrs1
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
How do hours worked per week, hours of relaxation, educational attainment, and attitudes toward marijuana legalization affect the number of days an individual reports poor mental health in the past 30 days?
This project explores how work-life factors and social characteristics relate to self-reported mental health in the United States. Mental health is an important part of overall well-being, and people’s stress levels and emotional struggles can be influenced by working long hours, having limited time to relax, and broader life conditions connected to education and beliefs. In this project, I focus on whether hours worked per week, hours available to relax, educational attainment, and attitudes toward marijuana legalization help explain variation in reported mental health difficulty.
The dataset used is the 2010 General Social Survey (gss2010), which includes 2,044 observations and 5 variables. Each row represents one survey respondent, and the dataset includes measures of work time, relaxation time, education level, mental health, and opinions about marijuana legalization. Because the dataset includes both quantitative and categorical variables, it works well for a multiple regression analysis. The goal of this project is to build a multiple linear regression model to see which predictors are most strongly associated with the number of days in the past 30 days when mental health was not good.
To begin, I loaded the dataset and examined its structure using functions such as head() and str(). This helped me confirm variable types and see where missing values occur. Because this dataset contains many NA values, I created a cleaned dataset by selecting only the variables needed for the model, converting categorical variables (education and marijuana legalization opinion) into factors.
Next, I generated summary statistics for key variables such as mental health days, hours worked, and hours of relaxation. Finally, I prepared a clean dataset that is ready for regression.
The variables used in this project are:
mntlhlth – number of days in the past 30 days mental health was not good (quantitative, outcome variable)
hrs1 – hours worked per week (quantitative)
hrsrelax – hours available to relax after an average work day (quantitative)
degree – educational attainment/degree (categorical)
grass – opinion about legalizing marijuana (categorical)
head(gss2010)
## # A tibble: 6 × 5
## hrsrelax mntlhlth hrs1 degree grass
## <dbl> <dbl> <dbl> <chr> <chr>
## 1 2 3 55 BACHELOR <NA>
## 2 4 6 45 BACHELOR LEGAL
## 3 NA NA NA LT HIGH SCHOOL <NA>
## 4 NA NA NA LT HIGH SCHOOL NOT LEGAL
## 5 NA NA NA LT HIGH SCHOOL NOT LEGAL
## 6 NA NA NA LT HIGH SCHOOL LEGAL
str(gss2010)
## spc_tbl_ [2,044 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ hrsrelax: num [1:2044] 2 4 NA NA NA NA 3 NA 0 5 ...
## $ mntlhlth: num [1:2044] 3 6 NA NA NA NA 0 NA 0 10 ...
## $ hrs1 : num [1:2044] 55 45 NA NA NA NA 45 NA 40 48 ...
## $ degree : chr [1:2044] "BACHELOR" "BACHELOR" "LT HIGH SCHOOL" "LT HIGH SCHOOL" ...
## $ grass : chr [1:2044] NA "LEGAL" NA "NOT LEGAL" ...
## - attr(*, "spec")=
## .. cols(
## .. hrsrelax = col_double(),
## .. mntlhlth = col_double(),
## .. hrs1 = col_double(),
## .. degree = col_character(),
## .. grass = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
gss_clean <- gss2010 |>
select(mntlhlth, hrs1, hrsrelax, degree, grass) |>
mutate(
degree = factor(degree),
grass = factor(grass)) |>
filter(
mntlhlth >= 0, mntlhlth <= 30,
hrs1 >= 0,
hrsrelax >= 0
)
summary(gss_clean$mntlhlth)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 3.798 4.000 30.000
summary(gss_clean$hrs1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 35.00 40.00 40.64 50.00 89.00
summary(gss_clean$hrsrelax)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 3.000 3.696 5.000 24.000
degree_summary <- gss_clean |>
group_by(degree) |>
summarise(
n = n(),
mean_mntlhlth = mean(mntlhlth, na.rm = TRUE),
median_mntlhlth = median(mntlhlth, na.rm = TRUE),
sd_mntlhlth = sd(mntlhlth, na.rm = TRUE),
min_mntlhlth = min(mntlhlth, na.rm = TRUE),
max_mntlhlth = max(mntlhlth, na.rm = TRUE)
)
degree_summary
## # A tibble: 5 × 7
## degree n mean_mntlhlth median_mntlhlth sd_mntlhlth min_mntlhlth
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 BACHELOR 247 2.68 0 5.26 0
## 2 GRADUATE 153 2.39 0 4.98 0
## 3 HIGH SCHOOL 530 4.57 0 8.15 0
## 4 JUNIOR COLLEGE 92 3.60 0 6.46 0
## 5 LT HIGH SCHOOL 117 4.65 0 9.10 0
## # ℹ 1 more variable: max_mntlhlth <dbl>
Because my research question involves a quantitative outcome variable (mntlhlth) and multiple predictors, a multiple linear regression model is appropriate. This model allows me to see how each variable relates to mental health days while controlling for the others.
model <- lm(mntlhlth ~ hrs1 + hrsrelax + degree + grass, data = gss_clean)
summary(model)
##
## Call:
## lm(formula = mntlhlth ~ hrs1 + hrsrelax + degree + grass, data = gss_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7700 -4.2599 -2.3731 0.9443 27.5201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.375787 1.129652 2.988 0.002906 **
## hrs1 -0.004321 0.018358 -0.235 0.814004
## hrsrelax -0.244024 0.103783 -2.351 0.018992 *
## degreeGRADUATE 0.071433 0.925814 0.077 0.938522
## degreeHIGH SCHOOL 2.437399 0.686084 3.553 0.000408 ***
## degreeJUNIOR COLLEGE 2.455282 1.082616 2.268 0.023646 *
## degreeLT HIGH SCHOOL 1.846799 1.004810 1.838 0.066503 .
## grassNOT LEGAL -0.115215 0.537401 -0.214 0.830305
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.977 on 682 degrees of freedom
## (449 observations deleted due to missingness)
## Multiple R-squared: 0.0327, Adjusted R-squared: 0.02277
## F-statistic: 3.293 on 7 and 682 DF, p-value: 0.001881
The regression results indicate that some predictors have meaningful associations with the number of days individuals report poor mental health in the past 30 days, while others do not. The coefficient for hours worked per week is −0.0043, meaning that each additional hour worked per week is associated with a very small decrease of about 0.004 days of poor mental health, holding all other variables constant. However, this effect is not statistically significant, suggesting that hours worked alone do not meaningfully predict mental health in this model.
In contrast, hours of relaxation shows a statistically significant effect. The coefficient for hours of relaxation is −0.244, meaning that for each additional hour available to relax after an average workday, the expected number of days of poor mental health decreases by about 0.24 days, holding all other variables constant. This suggests that having more time to relax is associated with better mental health outcomes.
Education also plays an important role. Compared to the baseline education group, individuals with a high school degree report about 2.44 more days of poor mental health per month, and those with a junior college degree report about 2.46 more days, both of which are statistically significant. Individuals with less than a high school education also report more days of poor mental health, though this effect is only marginally significant. In contrast, having a graduate degree does not appear to significantly differ from the baseline group in terms of reported mental health days.
The coefficient for attitudes toward marijuana legalization indicates that individuals who believe marijuana should not be legal report slightly fewer days of poor mental health compared to those in the baseline group, but this effect is very small and not statistically significant.
The adjusted R-squared value of 0.0228 indicates that the model explains about 2.3% of the variation in reported poor mental health days. While this is a relatively small proportion, it is not unusual for mental health outcomes, which are influenced by many unobserved psychological, social, and environmental factors. The overall model p-value (0.0019) indicates that the regression is statistically significant as a whole, meaning that the combination of predictors provides a better explanation of mental health variation than a model with no predictors. Overall, the results suggest that relaxation time and educational attainment are more strongly associated with mental health outcomes than hours worked, highlighting the importance of recovery time and social factors in understanding mental health.
plot(gss_clean$hrs1, gss_clean$mntlhlth,
xlab = "Hours Worked Per Week",
ylab = "Days Mental Health Not Good",
main = "Linearity Check: Hours Worked vs Mental Health")
abline(lm(mntlhlth ~ hrs1, data = gss_clean), lwd = 2)
plot(gss_clean$hrsrelax, gss_clean$mntlhlth,
xlab = "Hours to Relax",
ylab = "Days Mental Health Not Good",
main = "Linearity Check: Relaxation Hours vs Mental Health")
abline(lm(mntlhlth ~ hrsrelax, data = gss_clean), lwd = 2)
plot(resid(model),
type = "b",
main = "Residual Plot for Multiple Linear Regression",
ylab = "Residuals")
abline(h = 0, lty = 2)
par(mfrow = c(2,2))
plot(model)
par(mfrow = c(1,1))
cor(gss_clean[, c("hrs1", "hrsrelax")], use = "complete.obs")
## hrs1 hrsrelax
## hrs1 1.0000000 -0.1962775
## hrsrelax -0.1962775 1.0000000
The relationship between hours worked per week and days of poor mental health shows a very weak linear trend, which is consistent with the non-significant regression coefficient. In contrast, hours of relaxation exhibits a clearer negative linear relationship, indicating that increased relaxation time is associated with fewer days of poor mental health. The Residuals vs. Fitted plot shows some curvature and increasing spread of residuals at higher fitted values, suggesting that while the linear model captures general trends, it does not fully represent the relationship across all ranges of predicted mental health outcomes.
The residual plot indexed by observation order shows that residuals fluctuate around zero with no clear pattern over the index, supporting the assumption of independence of observations. There is no visible clustering or sequence-related trend, which is consistent with the survey design of the General Social Survey, where each observation represents a different individual. However, the plot also reveals several large positive residuals, indicating the presence of outliers where the model substantially underpredicts the number of days of poor mental health for some respondents.
Homoscedasticity was evaluated using the Residuals vs. Fitted, Scale–Location, and residual index plots. Across these diagnostics, the spread of residuals increases as fitted values increase, indicating heteroscedasticity. This means the model predicts mental health outcomes less consistently for individuals with higher predicted numbers of poor mental health days. As a result, standard errors may be underestimated for these observations, and significance tests should be interpreted with caution.
Normality of residuals was assessed using the Normal Q–Q plot, which shows clear deviations from the reference line in the upper tail. This indicates that residuals are not normally distributed and that extreme values of poor mental health days occur more frequently than expected under normality. This non-normality is likely driven by the right-skewed and bounded nature of the outcome variable, which ranges from 0 to 30 days. While this violates the normality assumption, the relatively large sample size helps mitigate its impact on the reliability of the regression estimates.
Multicollinearity was assessed by examining correlations among the predictors. The correlation between the two numeric predictors, hours worked per week and hours of relaxation, was −0.196, indicating a weak negative relationship that is well below commonly used thresholds for concern. This suggests that multicollinearity is minimal and that the predictors do not strongly depend on one another. Categorical variables were included as factors and handled through dummy variable coding in the regression model, and there is no evidence that multicollinearity meaningfully affects the stability or interpretation of the regression coefficients.
The results of this project show that time use and educational attainment are meaningfully related to differences in reported mental health outcomes in the United States. Among the predictors examined, hours of relaxation has the strongest and most consistent association with mental health, with each additional hour of relaxation associated with fewer days in which respondents report poor mental health. In contrast, hours worked per week does not show a statistically meaningful relationship with mental health once other factors are taken into account. Education also appears to matter, as individuals with lower levels of educational attainment report more days of poor mental health compared to the baseline education group. Overall, the adjusted R-squared value of approximately 0.023 indicates that these predictors explain a small but important portion of the variation in mental health outcomes, which is expected given that mental health is influenced by many psychological, social, and environmental factors not captured in this model.
However, the diagnostic checks reveal important limitations. The Residuals vs. Fitted and Scale Location plots show increasing spread in the residuals at higher fitted values, indicating heteroscedasticity. The Normal Q–Q plot shows clear deviations from normality in the upper tail, reflecting the right-skewed and bounded nature of the outcome variable and the presence of individuals reporting unusually high numbers of poor mental health days. While these violations do not invalidate the regression model, they suggest that mental health outcomes are difficult to model using a simple linear structure and that results should be interpreted with caution.
Future work could improve this analysis by incorporating additional variables such as income, age, marital status, physical health, and social support, all of which are likely to influence mental health outcomes. Including interaction terms, such as between education and relaxation time, could reveal whether the benefits of relaxation differ across social groups. Despite its limitations, this project provides a valuable first look at how time use and education relate to mental health and highlights important directions for deeper and more detailed analysis.
Dataset Name: 2010 General Social Survey (gss2010) dataset. From openintro.org Source: US 2010 General Social Survey.