1. Variable Selection & Research Question:

(a) Identify one categorical variable (IV)

The Categorical Variable for this analysis is marij_month, Which asked respondent if the used marijuana past 30 days or not.

(b) Identify one continuous variable (DV)

The Continuos Variable for this analysis is k6score. Which represents a persons risk for serious mental illness.

(c) Hypothesis

I hypothesis that, there is a relationship between marij_month and k6score, Which shows respondents risk for serious mental illness depends on if they used Marijuana past 30 days or not.

2. Data Prep:

(a) Load Packages & Import Data

Loading the necessary packages. Importing data into R and named it Drug_Use_Health_Data.

library(readr)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
Drug_Use_Health_Data = read_csv("/Users/sakif/Downloads/SOC333_NSDUH_2016.csv")
## 
## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────
## cols(
##   .default = col_character(),
##   Nervous = col_double(),
##   Hopeless = col_double(),
##   Restless = col_double(),
##   Effort = col_double(),
##   Sad = col_double(),
##   Worthless = col_double(),
##   k6score = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
head(Drug_Use_Health_Data)
## # A tibble: 6 x 20
##   sexident Nervous Hopeless Restless Effort   Sad Worthless k6score k6category
##   <chr>      <dbl>    <dbl>    <dbl>  <dbl> <dbl>     <dbl>   <dbl> <chr>     
## 1 <NA>          NA       NA       NA     NA    NA        NA      NA <NA>      
## 2 Straight       0        0        0     NA     0         0      NA <NA>      
## 3 Straight       2        1        1      0     0         0       4 Low Risk  
## 4 <NA>          NA       NA       NA     NA    NA        NA      NA <NA>      
## 5 Straight       1        3        2      2     1         2      11 MMD       
## 6 Straight       2        1        1      2     1         1       8 MMD       
## # … with 11 more variables: marij_month <chr>, cocaine_month <chr>,
## #   crack_month <chr>, heroin_month <chr>, hallucinogen_month <chr>,
## #   inhalant_month <chr>, meth_month <chr>, painrelieve_month <chr>,
## #   tranq_month <chr>, stimulant_month <chr>, sedative_month <chr>

(b) Data Filtering & Storing

Filtering data to only keep from respondents who are in marij_month and k6score. Store this filtered data in a new object called, Marijuana.

Marijuana = Drug_Use_Health_Data %>%
  select(k6score, marij_month) %>%
  filter(!is.na(k6score))

Marijuana
## # A tibble: 42,927 x 2
##    k6score marij_month
##      <dbl> <chr>      
##  1       4 No         
##  2      11 No         
##  3       8 No         
##  4       0 No         
##  5       1 No         
##  6       4 No         
##  7       0 No         
##  8       0 No         
##  9       0 No         
## 10       0 No         
## # … with 42,917 more rows

3. Comparison of Means:

(a) Table

Comparing the mean of continuous variable between two groups.

Marijuana %>%
  group_by(marij_month) %>%
  summarise(Avg_Marijuana = mean(k6score))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
##   marij_month Avg_Marijuana
##   <chr>               <dbl>
## 1 No                   4.16
## 2 Yes                  6.43

(b) Visualization

Visualize the mean of continuous variable between two groups.

Marijuana %>%
  group_by(marij_month) %>%
  summarise(Avg_Marijuana = mean(k6score)) %>%
  ggplot()+
  geom_col(aes(x = marij_month, y = Avg_Marijuana, fill = marij_month))
## `summarise()` ungrouping output (override with `.groups` argument)

(c) Interpretation

From the visualization, it is clearly showing that a person has used marijuana past 30 days has more risk for serious mental illness than the person who didn’t use marijuana past 30 days.

4. Comparison of Distributions:

(a) Visualization

Visualize the distribution of responses to the continuous variables by showing a separate histogram for two groups.

Marijuana %>%
  ggplot()+
  geom_histogram(aes(x = k6score, fill = marij_month)) +
  facet_wrap(~marij_month)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

(b) Interpretation

The x-axis of the histogram represents, the # of days a person’s risk for mental illness has used mariauna. For No, the number 0 means that they use marijuana 0 days. The number 10 means that the respondent has not used marijuana 10 days from past 30 days. The number 25 means the respondent did not use marijuana 25 days from last 30 days. For Yes, the number 0 means that they use marijuana 0 days. The number 10 means that the respondent has used marijuana 10 days from past 30 days. The number 25 means the respondent used marijuana 25 days from last 30 days. So its clear, as more days respondents use marijuana the chances of having risk for serious mental illness is more.

5. Sampling Distribution & T-test

(a) Sampling Distribution

Produce two new data objects - one which only contains first group, and one which only contains second group. For each group: Draw 10,000 samples of 40 respondents, and calculate the mean of the continuous variables for each of those 10,000 samples. Store these 10,000 means in new objects.

Yes = Marijuana %>%
  filter(marij_month == "Yes")

No = Marijuana %>%
  filter(marij_month == "No")

Yes_Sample_Dist = replicate(10000, sample(Yes$k6score, 40) %>%
  mean(na.rm = TRUE)) %>%
  data.frame() %>%
  rename("mean" = 1)

No_Sample_Dist = replicate(10000, sample(No$k6score, 40) %>%
  mean(na.rm = TRUE)) %>%
  data.frame() %>%
  rename("mean" = 1)

ggplot()+
  geom_histogram(data = Yes_Sample_Dist, aes(x = mean), fill = "red") +
  geom_histogram(data = No_Sample_Dist, aes(x = mean), fill = "blue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

(b) T-test

Below are the results of the T-test. This tells us whether the differences in the mean for two groups with normally distributed sampling distributions.

t.test(k6score ~ marij_month, data = Marijuana)
## 
##  Welch Two Sample t-test
## 
## data:  k6score by marij_month
## t = -28.099, df = 6078.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.434468 -2.116930
## sample estimates:
##  mean in group No mean in group Yes 
##          4.155773          6.431472

(c) Interpret

There is a statistically significant difference between Yes and No in their mean towards the number of days that a person’s risk for serious mental illness used marijuna in past 30 days.