Variable Selection & Research Question

  • Categorical variable: marij_month, marijuana users and non users
  • Continuous variable: k6score, the risk of serious mental illness
  • Using the data attached, I will investigate the risk of serious mental illness for participants that have or have not used marijuana within the past 30 days. I surmise that the risk for participants that have used marijuana within the past 30 days will be significantly higher when compared to the participants that did not.

Data Prep

load

library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
userdata<-read.csv('/Volumes/FLASHDRIVE/Data 333/Skills drill 3? data.csv')
str(userdata)
## 'data.frame':    57146 obs. of  20 variables:
##  $ sexident          : chr  NA "Straight" "Straight" NA ...
##  $ Nervous           : int  NA 0 2 NA 1 2 0 NA NA NA ...
##  $ Hopeless          : int  NA 0 1 NA 3 1 0 NA NA NA ...
##  $ Restless          : int  NA 0 1 NA 2 1 0 NA NA NA ...
##  $ Effort            : int  NA NA 0 NA 2 2 0 NA NA NA ...
##  $ Sad               : int  NA 0 0 NA 1 1 0 NA NA NA ...
##  $ Worthless         : int  NA 0 0 NA 2 1 0 NA NA NA ...
##  $ k6score           : int  NA NA 4 NA 11 8 0 NA NA NA ...
##  $ k6category        : chr  NA NA "Low Risk" NA ...
##  $ marij_month       : chr  "No" "Yes" "No" "No" ...
##  $ cocaine_month     : chr  "No" "No" "No" "No" ...
##  $ crack_month       : chr  "No" "No" "No" "No" ...
##  $ heroin_month      : chr  "No" "No" "No" "No" ...
##  $ hallucinogen_month: chr  "No" "No" "No" "No" ...
##  $ inhalant_month    : chr  "No" "No" "No" "No" ...
##  $ meth_month        : chr  "No" "No" "No" "No" ...
##  $ painrelieve_month : chr  "No" "No" "No" "No" ...
##  $ tranq_month       : chr  "No" "No" "No" "No" ...
##  $ stimulant_month   : chr  "No" "No" "No" "No" ...
##  $ sedative_month    : chr  "No" "No" "No" "No" ...

Comparision of Means

Table

userdata%>%
  filter(marij_month %in% c("Yes","No")) %>%
  group_by(marij_month) %>%
  summarize(k6score =mean(k6score ,na.rm=TRUE))
## # A tibble: 2 x 2
##   marij_month k6score
## * <chr>         <dbl>
## 1 No             4.16
## 2 Yes            6.43

Visualization

userdata%>%
  filter(marij_month %in% c("Yes","No")) %>%
  group_by(marij_month) %>%
  summarize(k6score =mean(k6score ,na.rm=TRUE)) %>%
  ggplot(aes(x=marij_month,y=k6score,fill=marij_month)) +
  geom_col() +
  theme_classic()+
  theme(plot.title = element_text(hjust = 0.5),plot.subtitle =element_text(hjust = 0.5)) +
  theme(axis.text.x=element_text(angle=90, hjust=1), legend.position = "right")+
  labs(x="Risk for Serious Mental Illness", y="Count", title="Monthly Marijuana Use & Risk for Serious Mental Illness", subtitle="Anika Lewis") +
scale_fill_manual("Use w/n the past 30 days",values =c("Yes"="dark green","No"="black"))

Interpretation

  • This bar chart is visualization of the mean risk for mental illness according to whether or not the participant used marijuana within the past 30 days. The mean risk for participants that said yes is around 6.5/24. Participants that said no have a risk of around 4.2/24. A difference exists in potential risk, but considering that the scale goes up to 24, the difference here seems quite small.

Comparision of Distributions

Visualization

userdata%>%
  filter(marij_month %in% c("Yes","No")) %>%
   na.omit(userdata) %>%
  ggplot(aes(x=k6score,fill=marij_month)) +
  geom_histogram() +
  facet_wrap(~marij_month) +
  theme_classic()+
  theme(plot.title = element_text(hjust = 0.5),plot.subtitle =element_text(hjust = 0.5)) +
  theme(axis.text.x=element_text(angle=90, hjust=1), legend.position = "right")+
  labs(x="Risk for Serious Mental Illness", y="Count",title="Monthly Marijuana Use & Risk for Serious Mental Illness", subtitle="Anika Lewis") +
scale_fill_manual("Use w/n the past 30 days",values =c("Yes"="dark green","No"="black"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Interpretation

  • This histogram shows the distribution of participants that answered yes or no into bins that indicate the risk for serious mental illness. Through this we can see why the risk for the “no” participants is around 4. A majority of participants have 0 risk of serious mental illness. The bins with the second and third highest amount of participants is 5 and 10. Looking at this histogram for the participants who had answered “yes”, it appears that there might be less participants in general. The distribution seems similar to the “no” participants, with the majority, maybe slightly less, of participants with 0 risk, and then less participants with a risk of 5 or 10.This might explain why the means are so similar.

Sampling Distribution & T-test

Sampling Distribution

yes_data<-userdata%>%
  filter(marij_month=="Yes")
no_data<-userdata%>%
  filter(marij_month=="No")
sample(yes_data$k6score,40)%>%
  mean(na.rm=TRUE)
## [1] 6.027778
sample(no_data$k6score,40)%>%
  mean(na.rm=TRUE)
## [1] 4.964286
replicate(10000,
          sample(yes_data$k6score,40)%>%
  mean(na.rm=TRUE)
  )%>%
  data.frame()%>%
  rename("mean"=1) %>%
  ggplot()+
  geom_histogram(aes(x=mean),fill="black")+
  theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

replicate(10000,
          sample(no_data$k6score,40)%>%
  mean(na.rm=TRUE)
  )%>%
  data.frame()%>%
  rename("mean"=1) %>%
  ggplot()+
  geom_histogram(aes(x=mean),fill="dark green")+
  theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

T-test

Userdata<-userdata%>%
  select(marij_month,k6score)%>%
  filter(marij_month %in% c("Yes","No"))
t.test(k6score~marij_month,data=Userdata)
## 
##  Welch Two Sample t-test
## 
## data:  k6score by marij_month
## t = -28.099, df = 6078.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.434468 -2.116930
## sample estimates:
##  mean in group No mean in group Yes 
##          4.155773          6.431472

Interpretation

  • The p value given by this t-test is less .05. This implies that there might be a significant relationship between whether or not a participant uses marijuana and their risk of serious illness.