Drug Use by Age

Age Groups Pictures

Abstract

In the analysis, the data review drug use by age groups over the past 12 months. Drug use has among individuals has risen over the years between prescription and illicit drug. In the article, “How Baby Boomers Get High” compares the overall usage between younger counterparts and individuals over 50. I will investigate if medical drugs have a higher usage among all age groups than illicit drugs.

Required Libraries

Install packages and libraries to load data and perform analysis.

#install.packages('tinytex')
library(tinytex)
library(readr)
library(ggplot2)
library(tidyverse)
library(reactable)
library(reshape2)
library(tidyr)

Load Data

Retrieve csv file from Github Data Source

# load data
drug_use_by_age <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/drug-use-by-age/drug-use-by-age.csv")

Type of study

This is an observational study to provide evidence of a naturally occurring association between variables.

Cases or Observational Units

The data evolved from an article How Baby Boomers Get High and covers 13 drugs across 17 age groups.

Each case studies 13 drug types among 17 age groups in relation to percentage usage and the median number of times a user has used a drug in the past 12 months. The dataset is accessible on Five Thirty Eight Github with a last commit on April 22, 2015. The dataset on drug use by age includes 28 variables and 17 observations.

dim(drug_use_by_age)
## [1] 17 28
Original Data Set
reactable(drug_use_by_age) # original data set
Header (variable) Definition
alcohol-use Percentage of those in an age group who used alcohol in the past 12 months
alcohol-frequency Median number of times a user in an age group used alcohol in the past 12 months
marijuana-use Percentage of those in an age group who used marijuana in the past 12 months
marijuana-frequency Median number of times a user in an age group used marijuana in the past 12 months
cocaine-use Percentage of those in an age group who used cocaine in the past 12 months
cocaine-frequency Median number of times a user in an age group used cocaine in the past 12 months
crack-use Percentage of those in an age group who used crack in the past 12 months
crack-frequency Median number of times a user in an age group used crack in the past 12 months
heroin-use Percentage of those in an age group who used heroin in the past 12 months
heroin-frequency Median number of times a user in an age group used heroin in the past 12 months
hallucinogen-use Percentage of those in an age group who used hallucinogens in the past 12 months
hallucinogen-frequency Median number of times a user in an age group used hallucinogens in the past 12 months
inhalant-use Percentage of those in an age group who used inhalants in the past 12 months
inhalant-frequency Median number of times a user in an age group used inhalants in the past 12 months
pain-releiver-use Percentage of those in an age group who used pain relievers in the past 12 months
pain-releiver-frequency Median number of times a user in an age group used pain relievers in the past 12 months
oxycontin-use Percentage of those in an age group who used oxycontin in the past 12 months
oxycontin-frequency Median number of times a user in an age group used oxycontin in the past 12 months
tranquilizer-use Percentage of those in an age group who used tranquilizer in the past 12 months
tranquilizer-frequency Median number of times a user in an age group used tranquilizer in the past 12 months
stimulant-use Percentage of those in an age group who used stimulants in the past 12 months
stimulant-frequency Median number of times a user in an age group used stimulants in the past 12 months
meth-use Percentage of those in an age group who used meth in the past 12 months
meth-frequency Median number of times a user in an age group used meth in the past 12 months
sedative-use Percentage of those in an age group who used sedatives in the past 12 months
sedative-frequency Median number of times a user in an age group used sedatives in the past 12 months

Research question

Does the frequency and use of medical-use, prescription drugs, have a higher usage than illicit-use, non-prescription drugs, in an age group?

Dependent Variable

The response variables are the number of users n (quantitative) and drug use (qualitative).

Independent Variable

The independent variable is the age and is qualitative.

Evaluation Process

Clean-up Dataset

Data set was manually formatted to exclude the frequency variables and converted the percentage-use to real numbers for the creation of age group clusters. The clean data set was then loaded into the GitHub repository.

drug_use_clean <- read_csv("https://raw.githubusercontent.com/candrewxs/Project_Proposal_D606/main/drug-use-clean.csv")
head(drug_use_clean, 10)
## # A tibble: 10 x 15
##    age       n `Alcohol Real` `Marijuana Real` `Cocaine Real` `Crack Real`
##    <chr> <dbl>          <dbl>            <dbl>          <dbl>        <dbl>
##  1 12     2798           109.             30.8           2.80         0   
##  2 13     2757           234.             93.7           2.76         0   
##  3 14     2792           505.            243.            2.79         0   
##  4 15     2956           863.            429.           14.8          2.96
##  5 16     3058          1226.            688.           30.6          0   
##  6 17     3038          1498.            851.           60.8          3.04
##  7 18     2469          1449.            832.           79.0          9.88
##  8 19     2223          1436.            742.           91.1         11.1 
##  9 20     2271          1583.            772.          111.          13.6 
## 10 21     2354          1959.            777.          113.          11.8 
## # ... with 9 more variables: Heroin Real <dbl>, Hallucinogen Real <dbl>,
## #   Inhalant Real <dbl>, Pain Reliever Real <dbl>, Oxycontin Real <dbl>,
## #   Tranq Real <dbl>, Stimulant Real <dbl>, Meth Real <dbl>,
## #   Sedative Real <dbl>

Convert data to a dataframe, changed column designation to as.numeric, and replaced blank row with zero’s.

drug_use <- as.data.frame(drug_use_clean, stringsAsFactors = FALSE)
drug_use <- as.data.frame(apply(drug_use, 2, as.numeric))
drug_use[is.na(drug_use)] <- 0

Create three new clusters for age groups between 12 to 21 as: 12-15, 16-18, and 19-21.

drug_use[1,] <- drug_use[1,] + drug_use[2,] + drug_use[3,] + drug_use[4,]
drug_use[5,] <- drug_use[5,] + drug_use[6,] + drug_use[7,]
drug_use[8,] <- drug_use[8,] + drug_use[9,] + drug_use[10,]

The drug_use-by_age dataset contains 10 observations instead of 17 age groups. The combined age groups are as follows:12-15, 16-18, 19-21 and then we’ll change age column from character to factor for analysis testing.

drug_use_age <- drug_use[-c(2:4,6,7,9,10), ]
  
drug_use_age$age <- c("12-15", "16-18", "19-21", "22-23", "24-25", "26-29", 
                      "30-34", "35-49", "50-64", "65+")
drug_use_age$age <- as.factor(drug_use_age$age) #change column to factor
reactable(drug_use_age) # new data set

Visualization of dataset - sample size (n) by age

ggplot(drug_use_age, aes(x = age, y = n, fill = n)) +
  geom_col() +
  scale_fill_continuous() +
  labs(y="sample size (n)", 
       title = "Drug Use Sample Size by Age") +
  theme(plot.title=element_text(hjust = 0.5))

Summary Statistics


Initial Data Analysis

View the univariate summary information for anything unusual or unexpected for any data entry error. The summary information provides a close look at the minimum and maximum values of each variable. There are variables whose minimum values are zero and I decided to keep the value because it is a valid respondent input for drug use response.

summary(drug_use_age)
##       age          n          Alcohol Real  Marijuana Real     Cocaine Real   
##  12-15  :1   Min.   : 2448   Min.   :1207   Min.   :  29.38   Min.   :  0.00  
##  16-18  :1   1st Qu.: 3129   1st Qu.:2145   1st Qu.: 488.93   1st Qu.: 41.52  
##  19-21  :1   Median : 4649   Median :3226   Median : 782.35   Median : 97.48  
##  22-23  :1   Mean   : 5527   Mean   :3237   Mean   :1003.89   Mean   :119.48  
##  24-25  :1   3rd Qu.: 7255   3rd Qu.:4121   3rd Qu.:1288.38   3rd Qu.:180.32  
##  26-29  :1   Max.   :11303   Max.   :5543   Max.   :2370.74   Max.   :315.41  
##  (Other):4                                                                    
##    Crack Real     Heroin Real     Hallucinogen Real Inhalant Real   
##  Min.   : 0.00   Min.   : 0.000   Min.   :  2.448   Min.   :  0.00  
##  1st Qu.:11.11   1st Qu.: 8.407   1st Qu.: 46.148   1st Qu.: 12.53  
##  Median :15.01   Median :13.635   Median :106.491   Median : 29.45  
##  Mean   :17.64   Mean   :19.560   Mean   :170.462   Mean   : 69.63  
##  3rd Qu.:23.39   3rd Qu.:28.096   3rd Qu.:235.222   3rd Qu.: 85.37  
##  Max.   :36.95   Max.   :51.777   Max.   :507.534   Max.   :260.19  
##                                                                     
##  Pain Reliever Real Oxycontin Real     Tranq Real      Stimulant Real  
##  Min.   : 14.69     Min.   :  0.00   Min.   :  4.896   Min.   :  0.00  
##  1st Qu.:181.26     1st Qu.: 23.07   1st Qu.: 99.362   1st Qu.: 41.16  
##  Median :352.01     Median : 35.95   Median :125.403   Median : 70.49  
##  Mean   :341.07     Mean   : 49.59   Mean   :152.487   Mean   :100.09  
##  3rd Qu.:456.32     3rd Qu.: 74.94   3rd Qu.:204.684   3rd Qu.:156.93  
##  Max.   :674.97     Max.   :118.14   Max.   :307.806   Max.   :260.71  
##                                                                        
##    Meth Real     Sedative Real  
##  Min.   : 0.00   Min.   : 0.00  
##  1st Qu.:12.20   1st Qu.: 9.24  
##  Median :15.28   Median :10.98  
##  Mean   :20.79   Mean   :15.26  
##  3rd Qu.:31.16   3rd Qu.:24.36  
##  Max.   :43.45   Max.   :31.18  
## 

Check the columns for missing values

sum(is.na(drug_use_age)) # checking for missing values
## [1] 0

Visualizations - Univariate Plots

The histogram of alcohol shows the drug data density and the accompanying boxplot shows the outliners. The variable, alcohol real, shows a peak between 2000 and 3000 respondents, with a right skewed unimodal distribution.

In an alternative variable,pain-reliever real, the histogram and boxplot also provide insights on its data density relative to the percentage of those in an age group who used alcohol in the past 12 months. The shape has a multimodal distribution, prominent peaks between 0 to 100 and 300 to 500 respondents with no outliners.

attach(drug_use_age)

par(mfrow=c(2,2))
hist(`Alcohol Real`)
boxplot(`Alcohol Real`, horizontal = TRUE)
hist(`Pain Reliever Real`)
boxplot(`Pain Reliever Real`, horizontal = TRUE)
Variable Distribution & Outliner View

Variable Distribution & Outliner View


Bivariable Scatterplot showing two quantitative variables

In this section we are going to create a bivariate plot to investigate the relationships between two different parameters.

Correlation Matrix

The correlation matrix was used to show a linear relationship between variables. The correlation, R, is perfectly linear when the relationship value is either -1 or 1.

In the plot, the direction of the relationship between 2 variables mostly have a positive correlation indicating if a variable increases the other one increases and vice-versa.


Positive Relationship between Variables Correlation Value
Pain Reliever Real ~ Marijuana Real 0.970
Oxycontin Real ~ Marijuana Real 0.978
Tranq Real ~ Marijuana Real 0.979
Stimulant Real ~ Hallucinogen-Real 0.988
Meth Real ~ Tranq Real 0.984


Variables with a correlation close to 0 indicates that the two variables are independent, where one variable increases and the other variable may remain unchanged.


Independent Relationship between Variables Correlation Value
Heroin Real ~ n 0.073
Inhalant Real ~ Heroin Real 0.084



ANOVA - Analysis of Variance

I will use the anova function to compute the F-ratio and the p-value. The function takes as argument a model (a linear regression model in this case) where the dependent variable y is the drug usage value and the independent variable x is age (group cluster).

Drug Group Classification:

Medical Use Drug Names
Presciption Pain-Reliever, Oxycontin, Tranq, Stimulant, Sedative


Illicit Use Drug Names
Non-Prescription Alcohol, Marijuana, Cocaine, Crack, Heroin, Hallucinogen, Inhalant, Meth



Research question I want to know whether the percentage of drug use for prescribed, medically accepted drugs, have a higher usage than illicit, non-medical drugs, in an age group.

reactable(drug_use_clean)
Null Hypothesis:

\(H_{0}: \mu_{1} = \mu_{2} = \mu_{3} = \mu_{\infty}\)

Null Hypothesis, \(H_{0}\): The respondents in an age group who used medical-use drugs and illicit-use drugs in the past 12 months is statistically the same.


Alternative Hypothesis

\(H_{A}: \mu_{1} \neq \mu_{2} \neq \mu_{3} \neq \mu_{\infty}\)

Alternative Hypothesis, \(H_{A}\): The respondents in an age group who used medical-use drugs and illicit-use in the past 12 months is statistically different and it could be greater or less.


All Drugs

Create new data set with variables for medical-use and illicit-use drugs based on respondents usage. Implementing ANOVA requires that the drug values be in one column (the x column) and that the measurements be in another column (the y column). This requires that we use the long version of our table, dat.long, where the x column is labeled age and the y column is labeled Value. The first few rows of dat.long look like this:

# data set contains all drugs (medical & non-medical)
#dc1 <- select(drug_use_age, age, n, ends_with("-use") )
dc1.long <- gather(drug_use_age, "Drugs", value = "Value", -age, -n)
reactable(dc1.long)

Visualize the variability across each factor in boxplots for the ten different age groups suggesting differences in respondents within an age group who used medical and illicit drugs in the past 12 months.

agebp <- boxplot(Value ~ age, dc1.long, 
                 col = rainbow(ncol(drug_use_age)), 
                 main = "Drug Use Outliners among Age Groups", 
                 las = 2, pars=list(outcol="red"))

Measure Use - Number of respondents in an age group who used medical-use drugs and illicit-use drugs in the past 12 months.

# data set contains all drugs (medical & non-medical)
anova(lm(Value ~ age, dc1.long))
## Analysis of Variance Table
## 
## Response: Value
##            Df    Sum Sq Mean Sq F value Pr(>F)
## age         9   5338270  593141   0.615 0.7823
## Residuals 120 115729402  964412

Results: The null hypothesis, \(H_{0}\), is supported since the probability or p-value is greater than the significance level (Pr > \(\alpha = 0.05\)). \(H_{0}\): The respondents in an age group who used medical-use drugs and illicit-use drugs in the past 12 months is statistically the same.

Other Considerations: - The Sum Sq value is large indicating two or more of the levels are significantly different from one another. - Since F is lower than one, the levels between age group are not different suggesting that observed differences in mean values are not significant.


Fit Linear Regression Model

cFit <- lm(Value ~ age, dc1.long)

#influence(Fit) #silence - lengthy results

summary(cFit)
## 
## Call:
## lm(formula = Value ~ age, data = dc1.long)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -718.0 -446.3 -235.2  -96.8 4998.0 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   275.96     272.37   1.013    0.313
## age16-18      396.48     385.19   1.029    0.305
## age19-21      467.10     385.19   1.213    0.228
## age22-23      250.50     385.19   0.650    0.517
## age24-25      206.45     385.19   0.536    0.593
## age26-29      -20.23     385.19  -0.053    0.958
## age30-34      -29.88     385.19  -0.078    0.938
## age35-49      269.27     385.19   0.699    0.486
## age50-64      -30.32     385.19  -0.079    0.937
## age65+       -179.17     385.19  -0.465    0.643
## 
## Residual standard error: 982 on 120 degrees of freedom
## Multiple R-squared:  0.04409,    Adjusted R-squared:  -0.0276 
## F-statistic: 0.615 on 9 and 120 DF,  p-value: 0.7823
# require R-squared value to determine the percentage of the variance in medical drug and illicit drug use that is explained by age group

Results: The variance in age group explains 4.409% of the variance in the number of respondents who used either medical or illicit drugs.


You can check these assumptions graphically by plotting the results of the ANOVA analysis.

cOP <- par(mfrow=c(1,2))
plot(lm(Value ~ age, dc1.long), 1:2)

par(cOP)

Independent Variable - Drug Type

Visualize the variability across each factor in boxplots for 13 different drugs differences in percentage of those in an age group who used medical and illicit drugs in the past 12 months. The top three drug types among all age groups are: Alcohol, Marijuana and Pain-Reliever.

drugbp <- boxplot(Value ~ Drugs, dc1.long, 
        col = rainbow(ncol(drug_use_age)), 
        horizontal = FALSE, xlab = "", 
        main = "Drug Type Usage Among Respondents", 
        las = 2, cex.axis = 0.5) 

Hypothesis Test on Age and Drugs

Measure Use - respondents in a age group and drug type who used medical-use drugs and illicit-use drugs in the past 12 months.

# data set contains all drugs (medical & non-medical)
anova(lm(Value ~ age + Drugs, dc1.long))
## Analysis of Variance Table
## 
## Response: Value
##            Df   Sum Sq Mean Sq F value    Pr(>F)    
## age         9  5338270  593141  3.1079  0.002324 ** 
## Drugs      12 95117810 7926484 41.5330 < 2.2e-16 ***
## Residuals 108 20611592  190848                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Results: The null hypothesis, \(H_{0}\), is rejected since the probability or p-value for age and drugs are less than the significance level (Pr < \(\alpha = 0.05\)). \(H_{A}\): The percentage of those in an age group who used medical-use drugs and illicit-use in the past 12 months is statistically different and they could be greater or less`.

Other Considerations: - The Sum Sq value is large indicating two or more of the levels are significantly different from one another. - Since F is much large than one, the levels between age group are different suggesting that observed differences in mean values are significant contributors.
Fit Linear Regression Model

cFit2 <- lm(Value ~ age + Drugs, dc1.long)

#influence(Fit) #silence - lengthy results

summary(cFit2)
## 
## Call:
## lm(formula = Value ~ age + Drugs, data = dc1.long)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1717.74  -144.21    -5.25   142.41  2170.21 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              3103.77     179.71  17.271  < 2e-16 ***
## age16-18                  396.48     171.35   2.314  0.02257 *  
## age19-21                  467.10     171.35   2.726  0.00748 ** 
## age22-23                  250.50     171.35   1.462  0.14667    
## age24-25                  206.45     171.35   1.205  0.23090    
## age26-29                  -20.23     171.35  -0.118  0.90622    
## age30-34                  -29.88     171.35  -0.174  0.86191    
## age35-49                  269.27     171.35   1.571  0.11900    
## age50-64                  -30.32     171.35  -0.177  0.85989    
## age65+                   -179.17     171.35  -1.046  0.29807    
## DrugsCocaine Real       -3117.32     195.37 -15.956  < 2e-16 ***
## DrugsCrack Real         -3219.16     195.37 -16.477  < 2e-16 ***
## DrugsHallucinogen Real  -3066.33     195.37 -15.695  < 2e-16 ***
## DrugsHeroin Real        -3217.23     195.37 -16.467  < 2e-16 ***
## DrugsInhalant Real      -3167.16     195.37 -16.211  < 2e-16 ***
## DrugsMarijuana Real     -2232.90     195.37 -11.429  < 2e-16 ***
## DrugsMeth Real          -3216.01     195.37 -16.461  < 2e-16 ***
## DrugsOxycontin Real     -3187.20     195.37 -16.314  < 2e-16 ***
## DrugsPain Reliever Real -2895.73     195.37 -14.822  < 2e-16 ***
## DrugsSedative Real      -3221.53     195.37 -16.489  < 2e-16 ***
## DrugsStimulant Real     -3136.70     195.37 -16.055  < 2e-16 ***
## DrugsTranq Real         -3084.30     195.37 -15.787  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 436.9 on 108 degrees of freedom
## Multiple R-squared:  0.8298, Adjusted R-squared:  0.7966 
## F-statistic: 25.07 on 21 and 108 DF,  p-value: < 2.2e-16
# require R-squared value to determine the percentage of the variance in medical drug and illicit drug use that is explained by age group

Results: The age group and drug group variances explains 82.98% variance in respondents medical-use and illicit-use in an age group in the past 12 months. Since p-value is less than 0.5, the model as a whole is statistically significant.

You can check these assumptions graphically by plotting the results of the ANOVA analysis.

The residuals vs fitted plot shows the unexplained variances, the red line representing the mean of the residuals is non-linear and slightly centered around zero.
In the Normal Q-Q, the real residuals mostly follow a one-to-one line with the the theoretical residuals.
cOP2 <- par(mfrow=c(1,2))
plot(lm(Value ~ age + Drugs, dc1.long), 1:2)

par(cOP2)



CONCLUSION

Does the frequency and use of medical-use, prescription drugs, have a higher usage than illicit-use, non-prescription drugs, in an age group?


NO, there is no statistical evidence to state that frequency and usage of medical drugs is higher than than illicit drugs in an age group. In the correlation matrix, more positive relationships were discovered between medical drug use and age groups than illicit drug-use. However, in the ANOVA test the relationships were not statistically significant. The traditional significant level of α = 0.05 was used to evaluate the hypothesis based on the F-ratio and p-value. The first ANOVA analysis, used only one predictor variable age to determine the response variable Value or respondents and p-value supported the H0 hypothesis of commonality between variables. The second ANOVA analysis, used two predictor variables, age and drug type and supported the HA of statistical difference between variables on drug use in an age group.



Therefore, there is no strong evidence that respondents use of medical drugs is higher than illicit drugs in an age group.

GitHub RPubs

References:

R Multiple Plot Using par() Function. (2017, November 23). DataMentor. https://www.datamentor.io/r-programming/subplot/

Correlation coefficient and correlation test in R. (n.d.). Stats and R. https://statsandr.com/blog/correlation-coefficient-and-correlation-test-in-r/

Reading a Regression Table: A Guide for Students. (n.d.). Steven v. Miller. http://svmiller.com/blog/2014/08/reading-a-regression-table-a-guide-for-students/

ANOVA. (n.d.). Mgimond.github.io. Retrieved December 7, 2021, from https://mgimond.github.io/Stats-in-R/ANOVA.html