We are importing the tidyverse package and we are merely glimpsing at the data, the data contains one numerical age variable and many categorical variable. There are 751 observations and each case represents an employer’s responses to various questions regarding mental health and their industry’s attitude towards mental health. This data collected from respondents answering questions on a survey so the study is observational. My research question was does age played an influence in respondents seeking treatment.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.7
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
data <-read.csv("https://raw.githubusercontent.com/AldataSci/Data606Project/main/survey.csv",header=TRUE)
head(data)
## Timestamp Age Gender Country state self_employed
## 1 2014-08-27 11:29:31 37 Female United States IL <NA>
## 2 2014-08-27 11:29:37 44 M United States IN <NA>
## 3 2014-08-27 11:29:44 32 Male Canada <NA> <NA>
## 4 2014-08-27 11:29:46 31 Male United Kingdom <NA> <NA>
## 5 2014-08-27 11:30:22 31 Male United States TX <NA>
## 6 2014-08-27 11:31:22 33 Male United States TN <NA>
## family_history treatment work_interfere no_employees remote_work
## 1 No Yes Often 6-25 No
## 2 No No Rarely More than 1000 No
## 3 No No Rarely 6-25 No
## 4 Yes Yes Often 26-100 No
## 5 No No Never 100-500 Yes
## 6 Yes No Sometimes 6-25 No
## tech_company benefits care_options wellness_program seek_help anonymity
## 1 Yes Yes Not sure No Yes Yes
## 2 No Don't know No Don't know Don't know Don't know
## 3 Yes No No No No Don't know
## 4 Yes No Yes No No No
## 5 Yes Yes No Don't know Don't know Don't know
## 6 Yes Yes Not sure No Don't know Don't know
## leave mental_health_consequence phys_health_consequence
## 1 Somewhat easy No No
## 2 Don't know Maybe No
## 3 Somewhat difficult No No
## 4 Somewhat difficult Yes Yes
## 5 Don't know No No
## 6 Don't know No No
## coworkers supervisor mental_health_interview phys_health_interview
## 1 Some of them Yes No Maybe
## 2 No No No No
## 3 Yes Yes Yes Yes
## 4 Some of them No Maybe Maybe
## 5 Some of them Yes Yes Yes
## 6 Yes Yes No Maybe
## mental_vs_physical obs_consequence comments
## 1 Yes No <NA>
## 2 Don't know No <NA>
## 3 No No <NA>
## 4 No Yes <NA>
## 5 Don't know No <NA>
## 6 Don't know No <NA>
Let’s took a look at the distribution of the ages of the respondents. First I had to clean up the data since we had negative values for ages. Then I looked at the distribution of the histogram.
Clean_data <- data %>%
filter(Age > 18 & Age < 100)
The distribution looks left-skewed where the majority of respondents are in their 20 to 40s.
ggplot(Clean_data,aes(x=Age)) +
geom_histogram(bins=35) +
labs(
title ="Distribution of the Age of Respondents"
)
summary(Clean_data$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.00 27.00 31.00 32.16 36.00 72.00
Here I made a box plot between those who sought out treatment and those who didn’t. The boxplot marked those that are in their older years as outline.
## Boxplot marked the older ages as outlier and most of the data is cocentrated arounds 20s to 40s..
ggplot(Clean_data,mapping=aes(x=Age,y=treatment)) +
geom_boxplot() +
labs(
title="Relationship between Ages and Seeking Help in the US"
)
Since my research question focused on seeing if age played an influence if people seeker out treatment I’ve decided to do a logistic regression analysis since my explanatory variable or independent variable are the age of the respondents and the y variable is two level categorical variable which is yes and no.
We can assume that the observations are independent since the data came from a survey where each respondents answered the survey once. The logistic regression is the dependent variable is binary where the response is yes or No. For linearity we can assume that the independent variable are linearly related to the log odds.
Clean_data$treatment<-as.factor(Clean_data$treatment)
model <- glm(treatment ~ Age,data=Clean_data,family = binomial(link="logit"))
probabilites <- predict(model,type="response")
logit <- log(probabilites/(1-probabilites))
Calculating the probability of the model we can see that the model predicts there is a fairly linear relationships between the log odds of seeking treatments and the age of the respondents.
ggplot(Clean_data,aes(logit,Age)) +
geom_line(col="red") +
geom_point(col="blue") +
theme_bw() +
labs(
title = "Calculating log odds probablities of seeking treatment"
)
Using the logistical regression model, according to the model the formula is: the log odds of a person seeking treatment is: log(p/(1-p)) = -0.632247 + 0.020374 * Age So if I am interpreting this correctly as you get older you are 0.02 likely to sought out treatment.. or the odds of a person seeking treatment is 0.531 + 1.02* Age
summary(model)
##
## Call:
## glm(formula = treatment ~ Age, family = binomial(link = "logit"),
## data = Clean_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4818 -1.1771 0.9704 1.1690 1.2831
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.632247 0.261601 -2.417 0.0157 *
## Age 0.020374 0.007954 2.562 0.0104 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1724.4 on 1243 degrees of freedom
## Residual deviance: 1717.7 on 1242 degrees of freedom
## AIC: 1721.7
##
## Number of Fisher Scoring iterations: 3
Since the regression table doesn’t give us a R2 I enlisted the help of stat quest to help calculate the psuedo R^2 and the p-value for the model. Calculating the psuedo R^2 we get a value of 0.03 in other words 0.01 of people seeking treatment is explained by age. Thank you to Stat quest for helping me get the R squared value for the regression model.
## Calculating the Pseudo R^2 and the p-value:
## calculating McFadden's Psuedo R^2
## pull the log-likelihood of the null model out of the logistic varaible by getting the value for the null deviance and dividing by -2 (null deviance is the total sum of squares)
ll.null <- model$null.deviance/-2
## pull the log-likelihood for the fancy model out of the logistic variable by getting the value for the residual deviance and dividing by -2 (residual deviance is the residual sum of squares)
## So the 0.01 of the people seeking treatment is barely explained by age..
ll.proposed <- model$deviance/-2
(ll.null-ll.proposed)/ll.null
## [1] 0.003856443
Prediction <- data.frame(
probab_of_treatment=model$fitted.values,
treatment=Clean_data$treatment
)
Prediction <- Prediction[order(Prediction$probab_of_treatment,decreasing = FALSE),]
Prediction$rank <- 1:nrow(Prediction)
ggplot(data=Prediction,aes(x=rank,y=probab_of_treatment)) +
geom_point(aes(color=treatment)) +
xlab("Rank of Respondents") +
ylab("predicted probablity of seeking treatment")
Using Age as a predictor to predict mental health treatment wasn’t really a good predictor for predicting user mental health outreach. This supports the idea that age and mental health outreach aren’t really correlated.Even thought the p-value was less than 0.05 it isn’t the most important value since the p-value is heavily influenced by the sample size. Moving forward I would mostly likely use other predictors to predict the relationship between mental health outreach like influence of gender on mental health. Consequently, the survey results may not be indicative of the true population of the tech workers and respondents are more likely to answer the survey if they are interested in mental health.
Links to the sources that has helped me out for calculating the R squared and the value.