# load data
data1 <-read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/drug-use-by-age/drug-use-by-age.csv")
View(data1)
# 1. Introduction
# My research question is "What is the drug use habit for different groups of age." The drug use is on the rise in US and 23.5 million Americans
# are addicted to alcohol and drugs. That's approximately one in every 10 Americans over the age of 12. By knowing the drug habit of different age group,
# we could target our prevention and treatment measures more accurately.
# 2. Data
# 1) The population of the NSDUH series is the general civilian population aged 12 and older in the United States. Questions include age at first use, as well as lifetime, annual, and past-month usage for the following drugs: alcohol, marijuana, cocaine (including crack), hallucinogens, heroin, inhalants, tobacco, pain relievers, tranquilizers, stimulants, and sedatives.The survey covers substance abuse treatment history and perceived need for treatment, and includes questions from the Diagnostic and Statistical Manual (DSM) of Mental Disorders that allow diagnostic criteria to be applied. Respondents were also asked about personal and family income sources and amounts, health care access and coverage, illegal activities and arrest record, problems resulting from the use of drugs, perceptions of risks, and needle-sharing. Demographic data include gender, race, age, ethnicity, educational level, job status, income level, veteran status, household composition, and population density.
nrow(data1)
## [1] 17
sum(data1$n)
## [1] 55268
# 2) There are 55268 cases in this data collection.
# 3) Because I am interested in the drug use for different groups of age, I would like study the relationship between age_group, alcohol_use incluing alcohol_frequency and marijuana-use including marijuana-frequency.
# alcohol-use indicates that Percentage of those in an age group who used alcohol in the past 12 months; alcohol-frequency indicates Median number of times a user in an age group used alcohol in the past 12 months; marijuana-use indicates Percentage of those in an age group who used marijuana in the past 12 months; and marijuana-frequency indicates Median number of times a user in an age group used marijuana in the past 12 months.
# 4) It is a observational study. The sample design was conducted by using computer-assisted personal interviews and audio computer-assisted self interviews in 2012.
# 5) The survey sample has employed a 50-state design with an independent, multistage area probability sample for each of the 50 states and the District of Columbia.The samples was independent of observation, hense I think it would be appropriately generalized.
# 6) Exploring causality relationship would require more statistical analysis.
# 3. Exploratory data analysis:
# I am trying to find out the trend of drug habit in different age groups.
summary(data1)
## age n alcohol.use alcohol.frequency
## 12 : 1 Min. :2223 Min. : 3.90 Min. : 3.00
## 13 : 1 1st Qu.:2469 1st Qu.:40.10 1st Qu.:10.00
## 14 : 1 Median :2798 Median :64.60 Median :48.00
## 15 : 1 Mean :3251 Mean :55.43 Mean :33.35
## 16 : 1 3rd Qu.:3058 3rd Qu.:77.50 3rd Qu.:52.00
## 17 : 1 Max. :7391 Max. :84.20 Max. :52.00
## (Other):11
## marijuana.use marijuana.frequency cocaine.use cocaine.frequency
## Min. : 1.10 Min. : 4.00 Min. :0.000 5.0 :6
## 1st Qu.: 8.70 1st Qu.:30.00 1st Qu.:0.500 5.5 :2
## Median :20.80 Median :52.00 Median :2.000 8.0 :2
## Mean :18.92 Mean :42.94 Mean :2.176 - :1
## 3rd Qu.:28.40 3rd Qu.:52.00 3rd Qu.:4.000 1.0 :1
## Max. :34.00 Max. :72.00 Max. :4.900 15.0 :1
## (Other):4
## crack.use crack.frequency heroin.use heroin.frequency
## Min. :0.0000 - :3 Min. :0.0000 - : 1
## 1st Qu.:0.0000 5.0 :2 1st Qu.:0.1000 1.0 : 1
## Median :0.4000 6.0 :2 Median :0.2000 120.0 : 1
## Mean :0.2941 1.0 :1 Mean :0.3529 180.0 : 1
## 3rd Qu.:0.5000 10.0 :1 3rd Qu.:0.6000 2.0 : 1
## Max. :0.6000 15.0 :1 Max. :1.1000 280.0 : 1
## (Other):7 (Other):11
## hallucinogen.use hallucinogen.frequency inhalant.use
## Min. :0.100 Min. : 2.000 Min. :0.000
## 1st Qu.:0.600 1st Qu.: 3.000 1st Qu.:0.600
## Median :3.200 Median : 3.000 Median :1.400
## Mean :3.394 Mean : 8.412 Mean :1.388
## 3rd Qu.:5.200 3rd Qu.: 4.000 3rd Qu.:2.000
## Max. :8.600 Max. :52.000 Max. :3.000
##
## inhalant.frequency pain.releiver.use pain.releiver.frequency
## 4.0 :5 Min. : 0.600 Min. : 7.00
## 2.0 :2 1st Qu.: 3.900 1st Qu.:12.00
## 3.0 :2 Median : 6.200 Median :12.00
## - :1 Mean : 6.271 Mean :14.71
## 10.0 :1 3rd Qu.: 9.000 3rd Qu.:15.00
## 12.0 :1 Max. :10.000 Max. :36.00
## (Other):5
## oxycontin.use oxycontin.frequency tranquilizer.use
## Min. :0.0000 12.0 :2 Min. :0.200
## 1st Qu.:0.4000 13.5 :2 1st Qu.:1.400
## Median :1.1000 - :1 Median :3.500
## Mean :0.9353 17.5 :1 Mean :2.806
## 3rd Qu.:1.4000 20.0 :1 3rd Qu.:4.200
## Max. :1.7000 24.5 :1 Max. :5.400
## (Other):9
## tranquilizer.frequency stimulant.use stimulant.frequency
## Min. : 4.50 Min. :0.000 Min. : 2.00
## 1st Qu.: 6.00 1st Qu.:0.600 1st Qu.: 7.00
## Median :10.00 Median :1.800 Median : 10.00
## Mean :11.74 Mean :1.918 Mean : 31.15
## 3rd Qu.:11.00 3rd Qu.:3.000 3rd Qu.: 12.00
## Max. :52.00 Max. :4.100 Max. :364.00
##
## meth.use meth.frequency sedative.use sedative.frequency
## Min. :0.0000 - :2 Min. :0.0000 Min. : 3.00
## 1st Qu.:0.2000 12.0 :2 1st Qu.:0.2000 1st Qu.: 6.50
## Median :0.4000 30.0 :2 Median :0.3000 Median : 10.00
## Mean :0.3824 10.5 :1 Mean :0.2824 Mean : 19.38
## 3rd Qu.:0.6000 104.0 :1 3rd Qu.:0.4000 3rd Qu.: 17.50
## Max. :0.9000 105.0 :1 Max. :0.5000 Max. :104.00
## (Other):8
# It shows a summary of data and properties of each variable.
plot(data1$alcohol.use ~ data1$age,
main = "Alcohol Use",
xlab = "Age Group", ylab = "Alcohol Use %")

plot(data1$alcohol.frequency ~ data1$age,
main = "Alcohol Frequency",
xlab = "Age Group", ylab = "Alcohol Frequency")

plot(data1$marijuana.use ~ data1$age,
main = "Marijuana Use",
xlab = "Age Group", ylab = "Marijuana Use %")

plot(data1$marijuana.frequency ~ data1$age,
main = "Marijuana Frequency",
xlab = "Age Group", ylab = "Marijuana Frequency")

# 4. Inference:
# I am trying to find out if there is a relationship between alcohol use and marijuana use?
reg1 <- lm(data1$alcohol.use~data1$marijuana.use)
plot(data1$alcohol.use~data1$marijuana.use,
main = "alcohol use vs. marijuana use",
xlab = "marijuana use %",
ylab = "alcohol use %")
abline(reg1)

cor(data1$alcohol.use, data1$marijuana.use)
## [1] 0.5941651
summary(reg1)
##
## Call:
## lm(formula = data1$alcohol.use ~ data1$marijuana.use)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.729 -20.105 -5.862 19.690 30.953
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.1598 10.3606 2.911 0.0108 *
## data1$marijuana.use 1.3354 0.4668 2.861 0.0119 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.33 on 15 degrees of freedom
## Multiple R-squared: 0.353, Adjusted R-squared: 0.3099
## F-statistic: 8.185 on 1 and 15 DF, p-value: 0.0119
# Equation: y(alcohol.use) = 30.1598 + 1.3354 * x(marijuana.use)
# To assess whether the linear model is reliable, we need to check for conditions for (1) linearity, (2) nearly normal residuals, (3) constant variability and (4) independent observations. And based on sample design, we can assume that samples are independent observations.
plot(reg1$residuals ~ data1$marijuana.use)
abline(h = 0, lty = 3)

hist(reg1$residuals)

qqnorm(reg1$residuals)
qqline(reg1$residuals)

reg2 <- lm(data1$alcohol.frequency~data1$marijuana.frequency)
plot(data1$alcohol.frequency~data1$marijuana.frequency,
main = "alcohol frequency vs. marijuana frequency",
xlab = "marijuana frequency",
ylab = "alcohol frequency")
abline(reg2)

cor(data1$alcohol.frequency, data1$marijuana.frequency)
## [1] 0.8187685
summary(reg2)
##
## Call:
## lm(formula = data1$alcohol.frequency ~ data1$marijuana.frequency)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.9641 -10.3477 -0.7925 10.0359 25.2452
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.4663 8.0010 -0.933 0.366
## data1$marijuana.frequency 0.9506 0.1721 5.523 5.84e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.64 on 15 degrees of freedom
## Multiple R-squared: 0.6704, Adjusted R-squared: 0.6484
## F-statistic: 30.51 on 1 and 15 DF, p-value: 5.844e-05
# Equation: y(alcohol.frequency) = -7.4663 + 0.9506 * X(marijuana.frequency)
# To assess whether the linear model is reliable, we need to check for conditions for (1) linearity, (2) nearly normal residuals, (3) constant variability and (4) independent observations.
plot(reg2$residuals ~ data1$marijuana.frequency)
abline(h = 0, lty = 3)

hist(reg2$residuals)

qqnorm(reg2$residuals)
qqline(reg2$residuals)

# 5. Conclusion:
# According to the study, we can see a trend of a increase in alcohol use percentage with age growing at a peak 80% of age 22- 23. After age 22-23, there is a decrease in alcohol use percentage when age keep growing. (Alcohol_use vs. Age Group)
# My study also shows a increasing trend of alcohol frequency when age growing, and it peaked at 50 at age 21 and stays the same. (Alcohol_frequency vs. Age Group)
# The plot (Marijuana_use vs. Age) indicates a increase in marijuana use percentage when age grows from 12 to 20 at a peak of 35%. After age 21, there is a decrease in marijuana use as age grows.
# For the trend of marijuana frequency, there is a increase when age grows until age 20. And after age 20, there is no obivious trend in marijuana frequency. (Marijuana_frequency vs. Age)
# There is a positive relationship between alcohol use and marijuana use (alcohol use vs. marijuana use) and a positive relationship between alcohol frequency and marijuana frequency (alcohol frequency vs. marijuana frequency)
# Reference:
# 1. http://www.drugfree.org/new-data-show-millions-of-americans-with-alcohol-and-drug-addiction-could-benefit-from-health-care-r/
# 2. http://www.icpsr.umich.edu/icpsrweb/ICPSR/series/64#title