Data 606 Project

Reading in the Data

We are importing the tidyverse package and we are merely glimpsing at the data, the data contains one numerical age variable and many categorical variable. There are 751 observations and each case represents an employer’s responses to various questions regarding mental health and their industry’s attitude towards mental health. This data collected from respondents answering questions on a survey so the study is observational. My research question was does age played an influence in respondents seeking treatment.

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.7
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.1     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

data <-read.csv("https://raw.githubusercontent.com/AldataSci/Data606Project/main/survey.csv",header=TRUE)
head(data)

##             Timestamp Age Gender        Country state self_employed
## 1 2014-08-27 11:29:31  37 Female  United States    IL          <NA>
## 2 2014-08-27 11:29:37  44      M  United States    IN          <NA>
## 3 2014-08-27 11:29:44  32   Male         Canada  <NA>          <NA>
## 4 2014-08-27 11:29:46  31   Male United Kingdom  <NA>          <NA>
## 5 2014-08-27 11:30:22  31   Male  United States    TX          <NA>
## 6 2014-08-27 11:31:22  33   Male  United States    TN          <NA>
##   family_history treatment work_interfere   no_employees remote_work
## 1             No       Yes          Often           6-25          No
## 2             No        No         Rarely More than 1000          No
## 3             No        No         Rarely           6-25          No
## 4            Yes       Yes          Often         26-100          No
## 5             No        No          Never        100-500         Yes
## 6            Yes        No      Sometimes           6-25          No
##   tech_company   benefits care_options wellness_program  seek_help  anonymity
## 1          Yes        Yes     Not sure               No        Yes        Yes
## 2           No Don't know           No       Don't know Don't know Don't know
## 3          Yes         No           No               No         No Don't know
## 4          Yes         No          Yes               No         No         No
## 5          Yes        Yes           No       Don't know Don't know Don't know
## 6          Yes        Yes     Not sure               No Don't know Don't know
##                leave mental_health_consequence phys_health_consequence
## 1      Somewhat easy                        No                      No
## 2         Don't know                     Maybe                      No
## 3 Somewhat difficult                        No                      No
## 4 Somewhat difficult                       Yes                     Yes
## 5         Don't know                        No                      No
## 6         Don't know                        No                      No
##      coworkers supervisor mental_health_interview phys_health_interview
## 1 Some of them        Yes                      No                 Maybe
## 2           No         No                      No                    No
## 3          Yes        Yes                     Yes                   Yes
## 4 Some of them         No                   Maybe                 Maybe
## 5 Some of them        Yes                     Yes                   Yes
## 6          Yes        Yes                      No                 Maybe
##   mental_vs_physical obs_consequence comments
## 1                Yes              No     <NA>
## 2         Don't know              No     <NA>
## 3                 No              No     <NA>
## 4                 No             Yes     <NA>
## 5         Don't know              No     <NA>
## 6         Don't know              No     <NA>

Exploratory Data Analysis:

Let’s took a look at the distribution of the ages of the respondents. First I had to clean up the data since we had negative values for ages. Then I looked at the distribution of the histogram.

Clean_data <- data %>%
  filter(Age > 18 & Age < 100)

The distribution looks left-skewed where the majority of respondents are in their 20 to 40s.

ggplot(Clean_data,aes(x=Age)) +
  geom_histogram(bins=35) +
  labs(
    title ="Distribution of the Age of Respondents"
)

summary(Clean_data$Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   19.00   27.00   31.00   32.16   36.00   72.00

Here I made a box plot between those who sought out treatment and those who didn’t. The boxplot marked those that are in their older years as outline.

## Boxplot marked the older ages as outlier and most of the data is cocentrated arounds 20s to 40s.. 
ggplot(Clean_data,mapping=aes(x=Age,y=treatment)) +
  geom_boxplot() +
  labs(
    title="Relationship between Ages and Seeking Help in the US"
  )

Data Analysis:

Since my research question focused on seeing if age played an influence if people seeker out treatment I’ve decided to do a logistic regression analysis since my explanatory variable or independent variable are the age of the respondents and the y variable is two level categorical variable which is yes and no.

Verifying the conditions of logistic regression:

We can assume that the observations are independent since the data came from a survey where each respondents answered the survey once. The logistic regression is the dependent variable is binary where the response is yes or No. For linearity we can assume that the independent variable are linearly related to the log odds.

Clean_data$treatment<-as.factor(Clean_data$treatment)
model <- glm(treatment ~ Age,data=Clean_data,family = binomial(link="logit"))

probabilites <- predict(model,type="response")
logit <- log(probabilites/(1-probabilites))

Calculating the probability of the model we can see that the model predicts there is a fairly linear relationships between the log odds of seeking treatments and the age of the respondents.

ggplot(Clean_data,aes(logit,Age)) + 
  geom_line(col="red") +
  geom_point(col="blue") + 
  theme_bw() +
  labs(
    title = "Calculating log odds probablities of seeking treatment"
  )

Interpreting the glm model

Using the logistical regression model, according to the model the formula is: the log odds of a person seeking treatment is: log(p/(1-p)) = -0.632247 + 0.020374 * Age So if I am interpreting this correctly as you get older you are 0.02 likely to sought out treatment.. or the odds of a person seeking treatment is 0.531 + 1.02* Age

summary(model)

## 
## Call:
## glm(formula = treatment ~ Age, family = binomial(link = "logit"), 
##     data = Clean_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4818  -1.1771   0.9704   1.1690   1.2831  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.632247   0.261601  -2.417   0.0157 *
## Age          0.020374   0.007954   2.562   0.0104 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1724.4  on 1243  degrees of freedom
## Residual deviance: 1717.7  on 1242  degrees of freedom
## AIC: 1721.7
## 
## Number of Fisher Scoring iterations: 3

Calculating the Pseudo R^2 and the p-value(McFadden’s R2)

Since the regression table doesn’t give us a R2 I enlisted the help of stat quest to help calculate the psuedo R^2 and the p-value for the model. Calculating the psuedo R^2 we get a value of 0.03 in other words 0.01 of people seeking treatment is explained by age. Thank you to Stat quest for helping me get the R squared value for the regression model.

## Calculating the Pseudo R^2 and the p-value:
## calculating McFadden's Psuedo R^2

## pull the log-likelihood of the null model out of the logistic varaible by getting the value for the null deviance and dividing by -2 (null deviance is the total sum of squares)
ll.null <- model$null.deviance/-2

## pull the log-likelihood for the fancy model out of the logistic variable by getting the value for the residual deviance and dividing by -2 (residual deviance is the residual sum of squares)

## So the 0.01 of the people seeking treatment is barely explained by age.. 

ll.proposed <- model$deviance/-2


(ll.null-ll.proposed)/ll.null

## [1] 0.003856443

Plotting the Data:

Prediction <- data.frame(
  probab_of_treatment=model$fitted.values,
  treatment=Clean_data$treatment
)

Prediction <- Prediction[order(Prediction$probab_of_treatment,decreasing = FALSE),]
Prediction$rank <- 1:nrow(Prediction)

ggplot(data=Prediction,aes(x=rank,y=probab_of_treatment)) +
  geom_point(aes(color=treatment)) +
  xlab("Rank of Respondents") +
  ylab("predicted probablity of seeking treatment")

Conclusion:

Using Age as a predictor to predict mental health treatment wasn’t really a good predictor for predicting user mental health outreach. This supports the idea that age and mental health outreach aren’t really correlated.Even thought the p-value was less than 0.05 it isn’t the most important value since the p-value is heavily influenced by the sample size. Moving forward I would mostly likely use other predictors to predict the relationship between mental health outreach like influence of gender on mental health. Consequently, the survey results may not be indicative of the true population of the tech workers and respondents are more likely to answer the survey if they are interested in mental health.

Links to the sources that has helped me out for calculating the R squared and the value.

https://www.youtube.com/watch?v=C4N3_XJJ-jU&t=656s