Reading and Cleaning the Data

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.7
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
data <-read.csv("https://raw.githubusercontent.com/AldataSci/Data606Project/main/survey.csv",header=TRUE)


data <- data %>%
  select("Age","Gender","Country",
         "state","seek_help", "treatment","benefits","mental_vs_physical","remote_work") %>%
  filter(Country=="United States")

Research Question:

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

Does the frequency of respondents reaching out for help vary by age?

Cases:

What are the cases, and how many are there?

There are 751 observations and each case represents an employer’s responses to various questions regarding mental health.

Data Collection:

Describe the method of data collection.

The dataset was found on kaggle But the original dataset was collected and can be downloaded from Open Source Mental Illness

Type of Study

What type of study is this (observational/experiment)?

This data collected from respondents answering questions on a survey so the study is observational.

Data Source

The dataset is found on kaggle the link is here

Response

The response Variable is respondents that are the frequency of respondents reaching out for help and is categorical.

Explanatory

The explanatory variable is the age of the respondents and it is numerical

Relevant Summary Statistics

summary(data$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -29.00   27.50   32.00   33.33   37.50  329.00
ggplot(data,aes(x=Age)) + geom_histogram() + labs(
  title = "Age of Respondents") + 
  xlim(0,100) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 3 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).

ggplot(data,aes(x=seek_help)) + geom_bar() + 
  labs(title = "Would Respondents Seek Help?",xlabs ="Seek Help")