The Categorical Variable for this analysis is marij_month, Which asked respondent if the used marijuana past 30 days or not.
The Continuos Variable for this analysis is k6score. Which represents a persons risk for serious mental illness.
I hypothesis that, there is a relationship between marij_month and k6score, Which shows respondents risk for serious mental illness depends on if they used Marijuana past 30 days or not.
Loading the necessary packages. Importing data into R and named it Drug_Use_Health_Data.
library(readr)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Drug_Use_Health_Data = read_csv("/Users/sakif/Downloads/SOC333_NSDUH_2016.csv")
##
## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────
## cols(
## .default = col_character(),
## Nervous = col_double(),
## Hopeless = col_double(),
## Restless = col_double(),
## Effort = col_double(),
## Sad = col_double(),
## Worthless = col_double(),
## k6score = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
head(Drug_Use_Health_Data)
## # A tibble: 6 x 20
## sexident Nervous Hopeless Restless Effort Sad Worthless k6score k6category
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 <NA> NA NA NA NA NA NA NA <NA>
## 2 Straight 0 0 0 NA 0 0 NA <NA>
## 3 Straight 2 1 1 0 0 0 4 Low Risk
## 4 <NA> NA NA NA NA NA NA NA <NA>
## 5 Straight 1 3 2 2 1 2 11 MMD
## 6 Straight 2 1 1 2 1 1 8 MMD
## # … with 11 more variables: marij_month <chr>, cocaine_month <chr>,
## # crack_month <chr>, heroin_month <chr>, hallucinogen_month <chr>,
## # inhalant_month <chr>, meth_month <chr>, painrelieve_month <chr>,
## # tranq_month <chr>, stimulant_month <chr>, sedative_month <chr>
Filtering data to only keep from respondents who are in marij_month and k6score. Store this filtered data in a new object called, Marijuana.
Marijuana = Drug_Use_Health_Data %>%
select(k6score, marij_month) %>%
filter(!is.na(k6score))
Marijuana
## # A tibble: 42,927 x 2
## k6score marij_month
## <dbl> <chr>
## 1 4 No
## 2 11 No
## 3 8 No
## 4 0 No
## 5 1 No
## 6 4 No
## 7 0 No
## 8 0 No
## 9 0 No
## 10 0 No
## # … with 42,917 more rows
Comparing the mean of continuous variable between two groups.
Marijuana %>%
group_by(marij_month) %>%
summarise(Avg_Marijuana = mean(k6score))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## marij_month Avg_Marijuana
## <chr> <dbl>
## 1 No 4.16
## 2 Yes 6.43
Visualize the mean of continuous variable between two groups.
Marijuana %>%
group_by(marij_month) %>%
summarise(Avg_Marijuana = mean(k6score)) %>%
ggplot()+
geom_col(aes(x = marij_month, y = Avg_Marijuana, fill = marij_month))
## `summarise()` ungrouping output (override with `.groups` argument)
From the visualization, it is clearly showing that a person has used marijuana past 30 days has more risk for serious mental illness than the person who didn’t use marijuana past 30 days.
Visualize the distribution of responses to the continuous variables by showing a separate histogram for two groups.
Marijuana %>%
ggplot()+
geom_histogram(aes(x = k6score, fill = marij_month)) +
facet_wrap(~marij_month)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The x-axis of the histogram represents, the # of days a person’s risk for mental illness has used mariauna. For No, the number 0 means that they use marijuana 0 days. The number 10 means that the respondent has not used marijuana 10 days from past 30 days. The number 25 means the respondent did not use marijuana 25 days from last 30 days. For Yes, the number 0 means that they use marijuana 0 days. The number 10 means that the respondent has used marijuana 10 days from past 30 days. The number 25 means the respondent used marijuana 25 days from last 30 days. So its clear, as more days respondents use marijuana the chances of having risk for serious mental illness is more.
Produce two new data objects - one which only contains first group, and one which only contains second group. For each group: Draw 10,000 samples of 40 respondents, and calculate the mean of the continuous variables for each of those 10,000 samples. Store these 10,000 means in new objects.
Yes = Marijuana %>%
filter(marij_month == "Yes")
No = Marijuana %>%
filter(marij_month == "No")
Yes_Sample_Dist = replicate(10000, sample(Yes$k6score, 40) %>%
mean(na.rm = TRUE)) %>%
data.frame() %>%
rename("mean" = 1)
No_Sample_Dist = replicate(10000, sample(No$k6score, 40) %>%
mean(na.rm = TRUE)) %>%
data.frame() %>%
rename("mean" = 1)
ggplot()+
geom_histogram(data = Yes_Sample_Dist, aes(x = mean), fill = "red") +
geom_histogram(data = No_Sample_Dist, aes(x = mean), fill = "blue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Below are the results of the T-test. This tells us whether the differences in the mean for two groups with normally distributed sampling distributions.
t.test(k6score ~ marij_month, data = Marijuana)
##
## Welch Two Sample t-test
##
## data: k6score by marij_month
## t = -28.099, df = 6078.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.434468 -2.116930
## sample estimates:
## mean in group No mean in group Yes
## 4.155773 6.431472
There is a statistically significant difference between Yes and No in their mean towards the number of days that a person’s risk for serious mental illness used marijuna in past 30 days.