library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.7
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
data <-read.csv("https://raw.githubusercontent.com/AldataSci/Data606Project/main/survey.csv",header=TRUE)
data <- data %>%
select("Age","Gender","Country",
"state","seek_help", "treatment","benefits","mental_vs_physical","remote_work") %>%
filter(Country=="United States")
## How to measure the independent variable and the dependent variable???
## Here I cleaned up the data.. to remove weird values for the age
Clean3 <- data %>%
filter(Age > 18 & Age < 100)
summary(Clean3$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.00 28.00 32.00 33.19 38.00 72.00
ggplot(Clean3,aes(x=Age)) +
geom_histogram() +
labs(
title ="Distribution of the Age of Respondents"
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(Clean3,aes(x=treatment)) +
geom_bar(fill="blue") +
labs(
title="Number of respondents"
)
## Looking at the relationship between Age and those that seek treament..!)
ggplot(Clean3,mapping=aes(x=Age,y=treatment)) +
geom_boxplot() +
labs(
title="Relationship between Ages and Seeking Help in the US"
)
## Dummy coding my categorical variables so I can do regression analysis
## Using if else so when respondents answer yes it has a value of 1 otherwise No is 2
Clean3$Yes_treatment <- ifelse(Clean3$treatment == "Yes",1,0)
head(Clean3)
## Age Gender Country state seek_help treatment benefits
## 1 37 Female United States IL Yes Yes Yes
## 2 44 M United States IN Don't know No Don't know
## 3 31 Male United States TX Don't know No Yes
## 4 33 Male United States TN Don't know No Yes
## 5 35 Female United States MI No Yes No
## 6 42 Female United States IL No Yes Yes
## mental_vs_physical remote_work Yes_treatment
## 1 Yes No 1
## 2 Don't know No 0
## 3 Don't know Yes 0
## 4 Don't know No 0
## 5 Don't know Yes 1
## 6 No No 1
Linear Regression assumes that the relationship between two variables x and y can be modeled by a straight line so:
Interpretation: The equation of the line we fit is y = 0.477 + 0.002 * x where y is the respondents who seeked treatments for their mental health and x is the age of the respondent.. The R^2 squared value is: -0.000336 which indicates that there is a 0% variation of the data in seeking treatment can be explained by age. [i.e this is a bad model]
Model <- lm(Yes_treatment ~ Age,data=Clean3)
summary(Model)
##
## Call:
## lm(formula = Yes_treatment ~ Age, data = Clean3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6128 -0.5419 0.4310 0.4561 0.4832
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.477236 0.081924 5.825 8.5e-09 ***
## Age 0.002085 0.002406 0.866 0.387
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4983 on 741 degrees of freedom
## Multiple R-squared: 0.001012, Adjusted R-squared: -0.000336
## F-statistic: 0.7508 on 1 and 741 DF, p-value: 0.3865
Here I did a plot which looks okay….
plot(Clean3$Age,Clean3$Yes_treatment)
abline(lm(Yes_treatment ~ Age,data=Clean3))
T-test analysis:
I did the following hypothesis testing:
The null hypothesis is the true difference in mean is equal to 0 The alternative hypothesis is the true difference in mean is not equal to 0
We gained a p value that is 0.38 which is not less than 0.05 thus this sample in the survey is not statistically significant in other words we fail to reject the null hypothesis and conclude the true difference in means between groups that seek treatment and those that don’t is 0.
We are 95% confident that the proportion of the average age in both groups who either seek or dont seek treatment are between (-0.6 and 1.58)
## Making a t-test to compare the relationships between the two groups that sought/didn't sought out treatment..
Yes_data <- Clean3 %>%
select(Age,treatment) %>%
filter(treatment=="Yes")
No_data <- Clean3 %>%
select(Age,treatment) %>%
filter(treatment=="No")
t.test(Yes_data$Age,No_data$Age)
##
## Welch Two Sample t-test
##
## data: Yes_data$Age and No_data$Age
## t = 0.86795, df = 720.5, p-value = 0.3857
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.6126933 1.5837214
## sample estimates:
## mean of x mean of y
## 33.41133 32.92582
This analysis was critical because I am personally invested in mental health. Especially in the tech industry, I was interested in whether there was a relationship between age and mental health. Unsurprisingly, With the regression analysis, none of the variations in the data could be explained by age, which indicates that your age does not affect your mental health. Some of the analysis limitations were that we mainly couldn’t capture the whole population in the survey. Those interested in issues such as mental health took the study compared to those who didn’t participate in the study because they had no interest in such issues as mental health. The respondents in the survey may convey a convenience sample that may heavily influence my analysis so far.