606 Project

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.7
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.1     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

data <-read.csv("https://raw.githubusercontent.com/AldataSci/Data606Project/main/survey.csv",header=TRUE)


data <- data %>%
  select("Age","Gender","Country",
         "state","seek_help", "treatment","benefits","mental_vs_physical","remote_work") %>%
  filter(Country=="United States")

## How to measure the independent variable and the dependent variable???

## Here I cleaned up the data.. to remove weird values for the age
Clean3 <- data %>%
  filter(Age > 18 & Age < 100)

summary(Clean3$Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   19.00   28.00   32.00   33.19   38.00   72.00

ggplot(Clean3,aes(x=Age)) +
  geom_histogram() +
  labs(
    title ="Distribution of the Age of Respondents"
)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(Clean3,aes(x=treatment)) + 
  geom_bar(fill="blue") + 
  labs(
    title="Number of respondents"
  )

## Looking at the relationship between Age and those that seek treament..!)  
ggplot(Clean3,mapping=aes(x=Age,y=treatment)) +
  geom_boxplot() +
  labs(
    title="Relationship between Ages and Seeking Help in the US"
  )

## Dummy coding my categorical variables so I can do regression analysis
## Using if else so when respondents answer yes it has a value of 1 otherwise No is 2

Clean3$Yes_treatment <- ifelse(Clean3$treatment == "Yes",1,0)

head(Clean3)

##   Age Gender       Country state  seek_help treatment   benefits
## 1  37 Female United States    IL        Yes       Yes        Yes
## 2  44      M United States    IN Don't know        No Don't know
## 3  31   Male United States    TX Don't know        No        Yes
## 4  33   Male United States    TN Don't know        No        Yes
## 5  35 Female United States    MI         No       Yes         No
## 6  42 Female United States    IL         No       Yes        Yes
##   mental_vs_physical remote_work Yes_treatment
## 1                Yes          No             1
## 2         Don't know          No             0
## 3         Don't know         Yes             0
## 4         Don't know          No             0
## 5         Don't know         Yes             1
## 6                 No          No             1

Linear Regression Model:

Linear Regression assumes that the relationship between two variables x and y can be modeled by a straight line so:

Interpretation: The equation of the line we fit is y = 0.477 + 0.002 * x where y is the respondents who seeked treatments for their mental health and x is the age of the respondent.. The R^2 squared value is: -0.000336 which indicates that there is a 0% variation of the data in seeking treatment can be explained by age. [i.e this is a bad model]

Model <- lm(Yes_treatment ~ Age,data=Clean3) 
summary(Model)

## 
## Call:
## lm(formula = Yes_treatment ~ Age, data = Clean3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6128 -0.5419  0.4310  0.4561  0.4832 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.477236   0.081924   5.825  8.5e-09 ***
## Age         0.002085   0.002406   0.866    0.387    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4983 on 741 degrees of freedom
## Multiple R-squared:  0.001012,   Adjusted R-squared:  -0.000336 
## F-statistic: 0.7508 on 1 and 741 DF,  p-value: 0.3865

Here I did a plot which looks okay….

plot(Clean3$Age,Clean3$Yes_treatment)
abline(lm(Yes_treatment ~ Age,data=Clean3))

Statisicial Method #2 T-test:

T-test analysis:

I did the following hypothesis testing:

The null hypothesis is the true difference in mean is equal to 0 The alternative hypothesis is the true difference in mean is not equal to 0

We gained a p value that is 0.38 which is not less than 0.05 thus this sample in the survey is not statistically significant in other words we fail to reject the null hypothesis and conclude the true difference in means between groups that seek treatment and those that don’t is 0.

We are 95% confident that the proportion of the average age in both groups who either seek or dont seek treatment are between (-0.6 and 1.58)

## Making a t-test to compare the relationships between the two groups that sought/didn't sought out treatment.. 
Yes_data <- Clean3 %>%
  select(Age,treatment) %>%
  filter(treatment=="Yes")
No_data <- Clean3 %>%
  select(Age,treatment) %>%
  filter(treatment=="No")

t.test(Yes_data$Age,No_data$Age)

## 
##  Welch Two Sample t-test
## 
## data:  Yes_data$Age and No_data$Age
## t = 0.86795, df = 720.5, p-value = 0.3857
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6126933  1.5837214
## sample estimates:
## mean of x mean of y 
##  33.41133  32.92582

Conclusion:

This analysis was critical because I am personally invested in mental health. Especially in the tech industry, I was interested in whether there was a relationship between age and mental health. Unsurprisingly, With the regression analysis, none of the variations in the data could be explained by age, which indicates that your age does not affect your mental health. Some of the analysis limitations were that we mainly couldn’t capture the whole population in the survey. Those interested in issues such as mental health took the study compared to those who didn’t participate in the study because they had no interest in such issues as mental health. The respondents in the survey may convey a convenience sample that may heavily influence my analysis so far.

606 Project

Al Haque

4/19/2022

Linear Regression Model:

Statisicial Method #2 T-test:

Conclusion: