# load data
data1 <-read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/drug-use-by-age/drug-use-by-age.csv") 
View(data1)
# 1. Introduction
# My research question is "What is the drug use habit for different groups of age." The drug use is on the rise in US and 23.5 million Americans
# are addicted to alcohol and drugs. That's approximately one in every 10 Americans over the age of 12. By knowing the drug habit of different age group,
# we could target our prevention and treatment measures more accurately. 

# 2. Data
# 1) The population of the NSDUH series is the general civilian population aged 12 and older in the United States. Questions include age at first use, as well as lifetime, annual, and past-month usage for the following drugs: alcohol, marijuana, cocaine (including crack), hallucinogens, heroin, inhalants, tobacco, pain relievers, tranquilizers, stimulants, and sedatives.The survey covers substance abuse treatment history and perceived need for treatment, and includes questions from the Diagnostic and Statistical Manual (DSM) of Mental Disorders that allow diagnostic criteria to be applied. Respondents were also asked about personal and family income sources and amounts, health care access and coverage, illegal activities and arrest record, problems resulting from the use of drugs, perceptions of risks, and needle-sharing. Demographic data include gender, race, age, ethnicity, educational level, job status, income level, veteran status, household composition, and population density.
nrow(data1)
## [1] 17
sum(data1$n)
## [1] 55268
# 2) There are 55268 cases in this data collection.
# 3) Because I am interested in the drug use for different groups of age, I would like study the relationship between age_group, alcohol_use incluing alcohol_frequency and marijuana-use including marijuana-frequency.
# alcohol-use indicates that Percentage of those in an age group who used alcohol in the past 12 months; alcohol-frequency indicates Median number of times a user in an age group used alcohol in the past 12 months; marijuana-use indicates Percentage of those in an age group who used marijuana in the past 12 months; and marijuana-frequency indicates Median number of times a user in an age group used marijuana in the past 12 months.
# 4) It is a observational study. The sample design was conducted by using computer-assisted personal interviews and audio computer-assisted self interviews in 2012.
# 5) The survey sample has employed a 50-state design with an independent, multistage area probability sample for each of the 50 states and the District of Columbia.The samples was independent of observation, hense I think it would be appropriately generalized.
# 6) Exploring causality relationship would require more statistical analysis.

# 3. Exploratory data analysis:
# I am trying to find out the trend of drug habit in different age groups.
summary(data1)
##       age           n         alcohol.use    alcohol.frequency
##  12     : 1   Min.   :2223   Min.   : 3.90   Min.   : 3.00    
##  13     : 1   1st Qu.:2469   1st Qu.:40.10   1st Qu.:10.00    
##  14     : 1   Median :2798   Median :64.60   Median :48.00    
##  15     : 1   Mean   :3251   Mean   :55.43   Mean   :33.35    
##  16     : 1   3rd Qu.:3058   3rd Qu.:77.50   3rd Qu.:52.00    
##  17     : 1   Max.   :7391   Max.   :84.20   Max.   :52.00    
##  (Other):11                                                   
##  marijuana.use   marijuana.frequency  cocaine.use    cocaine.frequency
##  Min.   : 1.10   Min.   : 4.00       Min.   :0.000   5.0    :6        
##  1st Qu.: 8.70   1st Qu.:30.00       1st Qu.:0.500   5.5    :2        
##  Median :20.80   Median :52.00       Median :2.000   8.0    :2        
##  Mean   :18.92   Mean   :42.94       Mean   :2.176   -      :1        
##  3rd Qu.:28.40   3rd Qu.:52.00       3rd Qu.:4.000   1.0    :1        
##  Max.   :34.00   Max.   :72.00       Max.   :4.900   15.0   :1        
##                                                      (Other):4        
##    crack.use      crack.frequency   heroin.use     heroin.frequency
##  Min.   :0.0000   -      :3       Min.   :0.0000   -      : 1      
##  1st Qu.:0.0000   5.0    :2       1st Qu.:0.1000   1.0    : 1      
##  Median :0.4000   6.0    :2       Median :0.2000   120.0  : 1      
##  Mean   :0.2941   1.0    :1       Mean   :0.3529   180.0  : 1      
##  3rd Qu.:0.5000   10.0   :1       3rd Qu.:0.6000   2.0    : 1      
##  Max.   :0.6000   15.0   :1       Max.   :1.1000   280.0  : 1      
##                   (Other):7                        (Other):11      
##  hallucinogen.use hallucinogen.frequency  inhalant.use  
##  Min.   :0.100    Min.   : 2.000         Min.   :0.000  
##  1st Qu.:0.600    1st Qu.: 3.000         1st Qu.:0.600  
##  Median :3.200    Median : 3.000         Median :1.400  
##  Mean   :3.394    Mean   : 8.412         Mean   :1.388  
##  3rd Qu.:5.200    3rd Qu.: 4.000         3rd Qu.:2.000  
##  Max.   :8.600    Max.   :52.000         Max.   :3.000  
##                                                         
##  inhalant.frequency pain.releiver.use pain.releiver.frequency
##  4.0    :5          Min.   : 0.600    Min.   : 7.00          
##  2.0    :2          1st Qu.: 3.900    1st Qu.:12.00          
##  3.0    :2          Median : 6.200    Median :12.00          
##  -      :1          Mean   : 6.271    Mean   :14.71          
##  10.0   :1          3rd Qu.: 9.000    3rd Qu.:15.00          
##  12.0   :1          Max.   :10.000    Max.   :36.00          
##  (Other):5                                                   
##  oxycontin.use    oxycontin.frequency tranquilizer.use
##  Min.   :0.0000   12.0   :2           Min.   :0.200   
##  1st Qu.:0.4000   13.5   :2           1st Qu.:1.400   
##  Median :1.1000   -      :1           Median :3.500   
##  Mean   :0.9353   17.5   :1           Mean   :2.806   
##  3rd Qu.:1.4000   20.0   :1           3rd Qu.:4.200   
##  Max.   :1.7000   24.5   :1           Max.   :5.400   
##                   (Other):9                           
##  tranquilizer.frequency stimulant.use   stimulant.frequency
##  Min.   : 4.50          Min.   :0.000   Min.   :  2.00     
##  1st Qu.: 6.00          1st Qu.:0.600   1st Qu.:  7.00     
##  Median :10.00          Median :1.800   Median : 10.00     
##  Mean   :11.74          Mean   :1.918   Mean   : 31.15     
##  3rd Qu.:11.00          3rd Qu.:3.000   3rd Qu.: 12.00     
##  Max.   :52.00          Max.   :4.100   Max.   :364.00     
##                                                            
##     meth.use      meth.frequency  sedative.use    sedative.frequency
##  Min.   :0.0000   -      :2      Min.   :0.0000   Min.   :  3.00    
##  1st Qu.:0.2000   12.0   :2      1st Qu.:0.2000   1st Qu.:  6.50    
##  Median :0.4000   30.0   :2      Median :0.3000   Median : 10.00    
##  Mean   :0.3824   10.5   :1      Mean   :0.2824   Mean   : 19.38    
##  3rd Qu.:0.6000   104.0  :1      3rd Qu.:0.4000   3rd Qu.: 17.50    
##  Max.   :0.9000   105.0  :1      Max.   :0.5000   Max.   :104.00    
##                   (Other):8
# It shows a summary of data and properties of each variable.
plot(data1$alcohol.use ~ data1$age,
     main = "Alcohol Use",
     xlab = "Age Group", ylab = "Alcohol Use %")

plot(data1$alcohol.frequency ~ data1$age,
     main = "Alcohol Frequency",
     xlab = "Age Group", ylab = "Alcohol Frequency")

plot(data1$marijuana.use ~ data1$age,
     main = "Marijuana Use",
     xlab = "Age Group", ylab = "Marijuana Use %")

plot(data1$marijuana.frequency ~ data1$age,
     main = "Marijuana Frequency",
     xlab = "Age Group", ylab = "Marijuana Frequency")

# 4. Inference:
# I am trying to find out if there is a relationship between alcohol use and marijuana use?
reg1 <- lm(data1$alcohol.use~data1$marijuana.use)
plot(data1$alcohol.use~data1$marijuana.use,
     main = "alcohol use vs. marijuana use",
     xlab = "marijuana use %",
     ylab = "alcohol use %")
abline(reg1)

cor(data1$alcohol.use, data1$marijuana.use)
## [1] 0.5941651
summary(reg1)
## 
## Call:
## lm(formula = data1$alcohol.use ~ data1$marijuana.use)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.729 -20.105  -5.862  19.690  30.953 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)  
## (Intercept)          30.1598    10.3606   2.911   0.0108 *
## data1$marijuana.use   1.3354     0.4668   2.861   0.0119 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.33 on 15 degrees of freedom
## Multiple R-squared:  0.353,  Adjusted R-squared:  0.3099 
## F-statistic: 8.185 on 1 and 15 DF,  p-value: 0.0119
# Equation: y(alcohol.use) = 30.1598 + 1.3354 * x(marijuana.use)
# To assess whether the linear model is reliable, we need to check for conditions for (1) linearity, (2) nearly normal residuals, (3) constant variability and (4) independent observations. And based on sample design, we can assume that samples are independent observations.
plot(reg1$residuals ~ data1$marijuana.use)
abline(h = 0, lty = 3)  

hist(reg1$residuals)

qqnorm(reg1$residuals)
qqline(reg1$residuals)

reg2 <- lm(data1$alcohol.frequency~data1$marijuana.frequency)
plot(data1$alcohol.frequency~data1$marijuana.frequency,
     main = "alcohol frequency vs. marijuana frequency",
     xlab = "marijuana frequency",
     ylab = "alcohol frequency")
abline(reg2)

cor(data1$alcohol.frequency, data1$marijuana.frequency)
## [1] 0.8187685
summary(reg2)
## 
## Call:
## lm(formula = data1$alcohol.frequency ~ data1$marijuana.frequency)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.9641 -10.3477  -0.7925  10.0359  25.2452 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -7.4663     8.0010  -0.933    0.366    
## data1$marijuana.frequency   0.9506     0.1721   5.523 5.84e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.64 on 15 degrees of freedom
## Multiple R-squared:  0.6704, Adjusted R-squared:  0.6484 
## F-statistic: 30.51 on 1 and 15 DF,  p-value: 5.844e-05
# Equation: y(alcohol.frequency) = -7.4663 + 0.9506 * X(marijuana.frequency)
# To assess whether the linear model is reliable, we need to check for conditions for (1) linearity, (2) nearly normal residuals, (3) constant variability and (4) independent observations.
plot(reg2$residuals ~ data1$marijuana.frequency)
abline(h = 0, lty = 3)  

hist(reg2$residuals)

qqnorm(reg2$residuals)
qqline(reg2$residuals)

# 5. Conclusion:
# According to the study, we can see a trend of a increase in alcohol use percentage with age growing at a peak 80% of age 22- 23. After age 22-23, there is a decrease in alcohol use percentage when age keep growing. (Alcohol_use vs. Age Group)
# My study also shows a increasing trend of alcohol frequency when age growing, and it peaked at 50 at age 21 and stays the same. (Alcohol_frequency vs. Age Group)
# The plot (Marijuana_use vs. Age) indicates a increase in marijuana use percentage when age grows from 12 to 20 at a peak of 35%. After age 21, there is a decrease in marijuana use as age grows.
# For the trend of marijuana frequency, there is a increase when age grows until age 20. And after age 20, there is no obivious trend in marijuana frequency. (Marijuana_frequency vs. Age)
# There is a positive relationship between alcohol use and marijuana use (alcohol use vs. marijuana use) and a positive relationship between alcohol frequency and marijuana frequency (alcohol frequency vs. marijuana frequency)



# Reference: 
# 1. http://www.drugfree.org/new-data-show-millions-of-americans-with-alcohol-and-drug-addiction-could-benefit-from-health-care-r/
# 2. http://www.icpsr.umich.edu/icpsrweb/ICPSR/series/64#title