Age Groups Pictures
In the analysis, the data review drug use by age groups over the past 12 months. Drug use has among individuals has risen over the years between prescription and illicit drug. In the article, “How Baby Boomers Get High” compares the overall usage between younger counterparts and individuals over 50. I will investigate if medical drugs have a higher usage among all age groups than illicit drugs.
Install packages and libraries to load data and perform analysis.
#install.packages('tinytex')
library(tinytex)
library(readr)
library(ggplot2)
library(tidyverse)
library(reactable)
library(reshape2)
library(tidyr)
Retrieve csv file from Github Data Source
# load data
drug_use_by_age <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/drug-use-by-age/drug-use-by-age.csv")
Data is collected by National Survey on Drug Use and Health from the Substance Abuse and Mental Health Data Archive.
Data was posted on FiveThirtyEight GitHub and retrieved from National Survey on Drug Use and Health from the Substance Abuse and Mental Health Data Archive
This is an observational study to provide evidence of a naturally occurring association between variables.
The data evolved from an article How Baby Boomers Get High and covers 13 drugs across 17 age groups.
Each case studies 13 drug types among 17 age groups in relation to percentage usage and the median number of times a user has used a drug in the past 12 months. The dataset is accessible on Five Thirty Eight Github with a last commit on April 22, 2015. The dataset on drug use by age includes 28 variables and 17 observations.
dim(drug_use_by_age)
## [1] 17 28
reactable(drug_use_by_age) # original data set
| Header (variable) | Definition |
|---|---|
alcohol-use |
Percentage of those in an age group who used alcohol in the past 12 months |
alcohol-frequency |
Median number of times a user in an age group used alcohol in the past 12 months |
marijuana-use |
Percentage of those in an age group who used marijuana in the past 12 months |
marijuana-frequency |
Median number of times a user in an age group used marijuana in the past 12 months |
cocaine-use |
Percentage of those in an age group who used cocaine in the past 12 months |
cocaine-frequency |
Median number of times a user in an age group used cocaine in the past 12 months |
crack-use |
Percentage of those in an age group who used crack in the past 12 months |
crack-frequency |
Median number of times a user in an age group used crack in the past 12 months |
heroin-use |
Percentage of those in an age group who used heroin in the past 12 months |
heroin-frequency |
Median number of times a user in an age group used heroin in the past 12 months |
hallucinogen-use |
Percentage of those in an age group who used hallucinogens in the past 12 months |
hallucinogen-frequency |
Median number of times a user in an age group used hallucinogens in the past 12 months |
inhalant-use |
Percentage of those in an age group who used inhalants in the past 12 months |
inhalant-frequency |
Median number of times a user in an age group used inhalants in the past 12 months |
pain-releiver-use |
Percentage of those in an age group who used pain relievers in the past 12 months |
pain-releiver-frequency |
Median number of times a user in an age group used pain relievers in the past 12 months |
oxycontin-use |
Percentage of those in an age group who used oxycontin in the past 12 months |
oxycontin-frequency |
Median number of times a user in an age group used oxycontin in the past 12 months |
tranquilizer-use |
Percentage of those in an age group who used tranquilizer in the past 12 months |
tranquilizer-frequency |
Median number of times a user in an age group used tranquilizer in the past 12 months |
stimulant-use |
Percentage of those in an age group who used stimulants in the past 12 months |
stimulant-frequency |
Median number of times a user in an age group used stimulants in the past 12 months |
meth-use |
Percentage of those in an age group who used meth in the past 12 months |
meth-frequency |
Median number of times a user in an age group used meth in the past 12 months |
sedative-use |
Percentage of those in an age group who used sedatives in the past 12 months |
sedative-frequency |
Median number of times a user in an age group used sedatives in the past 12 months |
Does the frequency and use of medical-use, prescription drugs, have a higher usage than illicit-use, non-prescription drugs, in an age group?
The response variables are the number of users n (quantitative) and drug use (qualitative).
The independent variable is the age and is qualitative.
Data set was manually formatted to exclude the frequency variables and converted the percentage-use to real numbers for the creation of age group clusters. The clean data set was then loaded into the GitHub repository.
drug_use_clean <- read_csv("https://raw.githubusercontent.com/candrewxs/Project_Proposal_D606/main/drug-use-clean.csv")
head(drug_use_clean, 10)
## # A tibble: 10 x 15
## age n `Alcohol Real` `Marijuana Real` `Cocaine Real` `Crack Real`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 12 2798 109. 30.8 2.80 0
## 2 13 2757 234. 93.7 2.76 0
## 3 14 2792 505. 243. 2.79 0
## 4 15 2956 863. 429. 14.8 2.96
## 5 16 3058 1226. 688. 30.6 0
## 6 17 3038 1498. 851. 60.8 3.04
## 7 18 2469 1449. 832. 79.0 9.88
## 8 19 2223 1436. 742. 91.1 11.1
## 9 20 2271 1583. 772. 111. 13.6
## 10 21 2354 1959. 777. 113. 11.8
## # ... with 9 more variables: Heroin Real <dbl>, Hallucinogen Real <dbl>,
## # Inhalant Real <dbl>, Pain Reliever Real <dbl>, Oxycontin Real <dbl>,
## # Tranq Real <dbl>, Stimulant Real <dbl>, Meth Real <dbl>,
## # Sedative Real <dbl>
Convert data to a dataframe, changed column designation to as.numeric, and replaced blank row with zero’s.
drug_use <- as.data.frame(drug_use_clean, stringsAsFactors = FALSE)
drug_use <- as.data.frame(apply(drug_use, 2, as.numeric))
drug_use[is.na(drug_use)] <- 0
Create three new clusters for age groups between 12 to 21 as: 12-15, 16-18, and 19-21.
drug_use[1,] <- drug_use[1,] + drug_use[2,] + drug_use[3,] + drug_use[4,]
drug_use[5,] <- drug_use[5,] + drug_use[6,] + drug_use[7,]
drug_use[8,] <- drug_use[8,] + drug_use[9,] + drug_use[10,]
The drug_use-by_age dataset contains 10 observations instead of 17 age groups. The combined age groups are as follows:12-15, 16-18, 19-21 and then we’ll change age column from character to factor for analysis testing.
drug_use_age <- drug_use[-c(2:4,6,7,9,10), ]
drug_use_age$age <- c("12-15", "16-18", "19-21", "22-23", "24-25", "26-29",
"30-34", "35-49", "50-64", "65+")
drug_use_age$age <- as.factor(drug_use_age$age) #change column to factor
reactable(drug_use_age) # new data set
Visualization of dataset - sample size (n) by age
ggplot(drug_use_age, aes(x = age, y = n, fill = n)) +
geom_col() +
scale_fill_continuous() +
labs(y="sample size (n)",
title = "Drug Use Sample Size by Age") +
theme(plot.title=element_text(hjust = 0.5))
View the univariate summary information for anything unusual or unexpected for any data entry error. The summary information provides a close look at the minimum and maximum values of each variable. There are variables whose minimum values are zero and I decided to keep the value because it is a valid respondent input for drug use response.
summary(drug_use_age)
## age n Alcohol Real Marijuana Real Cocaine Real
## 12-15 :1 Min. : 2448 Min. :1207 Min. : 29.38 Min. : 0.00
## 16-18 :1 1st Qu.: 3129 1st Qu.:2145 1st Qu.: 488.93 1st Qu.: 41.52
## 19-21 :1 Median : 4649 Median :3226 Median : 782.35 Median : 97.48
## 22-23 :1 Mean : 5527 Mean :3237 Mean :1003.89 Mean :119.48
## 24-25 :1 3rd Qu.: 7255 3rd Qu.:4121 3rd Qu.:1288.38 3rd Qu.:180.32
## 26-29 :1 Max. :11303 Max. :5543 Max. :2370.74 Max. :315.41
## (Other):4
## Crack Real Heroin Real Hallucinogen Real Inhalant Real
## Min. : 0.00 Min. : 0.000 Min. : 2.448 Min. : 0.00
## 1st Qu.:11.11 1st Qu.: 8.407 1st Qu.: 46.148 1st Qu.: 12.53
## Median :15.01 Median :13.635 Median :106.491 Median : 29.45
## Mean :17.64 Mean :19.560 Mean :170.462 Mean : 69.63
## 3rd Qu.:23.39 3rd Qu.:28.096 3rd Qu.:235.222 3rd Qu.: 85.37
## Max. :36.95 Max. :51.777 Max. :507.534 Max. :260.19
##
## Pain Reliever Real Oxycontin Real Tranq Real Stimulant Real
## Min. : 14.69 Min. : 0.00 Min. : 4.896 Min. : 0.00
## 1st Qu.:181.26 1st Qu.: 23.07 1st Qu.: 99.362 1st Qu.: 41.16
## Median :352.01 Median : 35.95 Median :125.403 Median : 70.49
## Mean :341.07 Mean : 49.59 Mean :152.487 Mean :100.09
## 3rd Qu.:456.32 3rd Qu.: 74.94 3rd Qu.:204.684 3rd Qu.:156.93
## Max. :674.97 Max. :118.14 Max. :307.806 Max. :260.71
##
## Meth Real Sedative Real
## Min. : 0.00 Min. : 0.00
## 1st Qu.:12.20 1st Qu.: 9.24
## Median :15.28 Median :10.98
## Mean :20.79 Mean :15.26
## 3rd Qu.:31.16 3rd Qu.:24.36
## Max. :43.45 Max. :31.18
##
Check the columns for missing values
sum(is.na(drug_use_age)) # checking for missing values
## [1] 0
The histogram of alcohol shows the drug data density and the accompanying boxplot shows the outliners. The variable, alcohol real, shows a peak between 2000 and 3000 respondents, with a right skewed unimodal distribution.
In an alternative variable,pain-reliever real, the histogram and boxplot also provide insights on its data density relative to the percentage of those in an age group who used alcohol in the past 12 months. The shape has a multimodal distribution, prominent peaks between 0 to 100 and 300 to 500 respondents with no outliners.
attach(drug_use_age)
par(mfrow=c(2,2))
hist(`Alcohol Real`)
boxplot(`Alcohol Real`, horizontal = TRUE)
hist(`Pain Reliever Real`)
boxplot(`Pain Reliever Real`, horizontal = TRUE)
Variable Distribution & Outliner View
In this section we are going to create a bivariate plot to investigate the relationships between two different parameters.
Correlation Matrix
The correlation matrix was used to show a linear relationship between variables. The correlation, R, is perfectly linear when the relationship value is either -1 or 1.
positive correlation indicating if a variable increases the other one increases and vice-versa.
| Positive Relationship between Variables | Correlation Value |
|---|---|
| Pain Reliever Real ~ Marijuana Real | 0.970 |
| Oxycontin Real ~ Marijuana Real | 0.978 |
| Tranq Real ~ Marijuana Real | 0.979 |
| Stimulant Real ~ Hallucinogen-Real | 0.988 |
| Meth Real ~ Tranq Real | 0.984 |
Variables with a correlation close to 0 indicates that the two variables are independent, where one variable increases and the other variable may remain unchanged.
| Independent Relationship between Variables | Correlation Value |
|---|---|
| Heroin Real ~ n | 0.073 |
| Inhalant Real ~ Heroin Real | 0.084 |
I will use the anova function to compute the F-ratio and the p-value. The function takes as argument a model (a linear regression model in this case) where the dependent variable y is the drug usage value and the independent variable x is age (group cluster).
Drug Group Classification:
| Medical Use | Drug Names |
|---|---|
Presciption |
Pain-Reliever, Oxycontin, Tranq, Stimulant, Sedative |
| Illicit Use | Drug Names |
|---|---|
Non-Prescription |
Alcohol, Marijuana, Cocaine, Crack, Heroin, Hallucinogen, Inhalant, Meth |
Research question I want to know whether the percentage of drug use for prescribed, medically accepted drugs, have a higher usage than illicit, non-medical drugs, in an age group.
reactable(drug_use_clean)
Null Hypothesis:
\(H_{0}: \mu_{1} = \mu_{2} = \mu_{3} = \mu_{\infty}\)
Null Hypothesis, \(H_{0}\): The respondents in an age group who used medical-use drugs and illicit-use drugs in the past 12 months is statistically the same.
\(H_{A}: \mu_{1} \neq \mu_{2} \neq \mu_{3} \neq \mu_{\infty}\)
Alternative Hypothesis, \(H_{A}\): The respondents in an age group who used medical-use drugs and illicit-use in the past 12 months is statistically different and it could be greater or less.
Create new data set with variables for medical-use and illicit-use drugs based on respondents usage. Implementing ANOVA requires that the drug values be in one column (the x column) and that the measurements be in another column (the y column). This requires that we use the long version of our table, dat.long, where the x column is labeled age and the y column is labeled Value. The first few rows of dat.long look like this:
# data set contains all drugs (medical & non-medical)
#dc1 <- select(drug_use_age, age, n, ends_with("-use") )
dc1.long <- gather(drug_use_age, "Drugs", value = "Value", -age, -n)
reactable(dc1.long)
Visualize the variability across each factor in boxplots for the ten different age groups suggesting differences in respondents within an age group who used medical and illicit drugs in the past 12 months.
agebp <- boxplot(Value ~ age, dc1.long,
col = rainbow(ncol(drug_use_age)),
main = "Drug Use Outliners among Age Groups",
las = 2, pars=list(outcol="red"))
Measure Use - Number of respondents in an age group who used medical-use drugs and illicit-use drugs in the past 12 months.
# data set contains all drugs (medical & non-medical)
anova(lm(Value ~ age, dc1.long))
## Analysis of Variance Table
##
## Response: Value
## Df Sum Sq Mean Sq F value Pr(>F)
## age 9 5338270 593141 0.615 0.7823
## Residuals 120 115729402 964412
Results: The null hypothesis, \(H_{0}\), is supported since the probability or p-value is greater than the significance level (Pr > \(\alpha = 0.05\)). \(H_{0}\): The respondents in an age group who used medical-use drugs and illicit-use drugs in the past 12 months is statistically the same.
Other Considerations: - The Sum Sq value is large indicating two or more of the levels are significantly different from one another. - Since F is lower than one, the levels between age group are not different suggesting that observed differences in mean values are not significant.
Fit Linear Regression Model
cFit <- lm(Value ~ age, dc1.long)
#influence(Fit) #silence - lengthy results
summary(cFit)
##
## Call:
## lm(formula = Value ~ age, data = dc1.long)
##
## Residuals:
## Min 1Q Median 3Q Max
## -718.0 -446.3 -235.2 -96.8 4998.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 275.96 272.37 1.013 0.313
## age16-18 396.48 385.19 1.029 0.305
## age19-21 467.10 385.19 1.213 0.228
## age22-23 250.50 385.19 0.650 0.517
## age24-25 206.45 385.19 0.536 0.593
## age26-29 -20.23 385.19 -0.053 0.958
## age30-34 -29.88 385.19 -0.078 0.938
## age35-49 269.27 385.19 0.699 0.486
## age50-64 -30.32 385.19 -0.079 0.937
## age65+ -179.17 385.19 -0.465 0.643
##
## Residual standard error: 982 on 120 degrees of freedom
## Multiple R-squared: 0.04409, Adjusted R-squared: -0.0276
## F-statistic: 0.615 on 9 and 120 DF, p-value: 0.7823
# require R-squared value to determine the percentage of the variance in medical drug and illicit drug use that is explained by age group
Results: The variance in age group explains 4.409% of the variance in the number of respondents who used either medical or illicit drugs.
You can check these assumptions graphically by plotting the results of the ANOVA analysis.
cOP <- par(mfrow=c(1,2))
plot(lm(Value ~ age, dc1.long), 1:2)
par(cOP)
Visualize the variability across each factor in boxplots for 13 different drugs differences in percentage of those in an age group who used medical and illicit drugs in the past 12 months. The top three drug types among all age groups are: Alcohol, Marijuana and Pain-Reliever.
drugbp <- boxplot(Value ~ Drugs, dc1.long,
col = rainbow(ncol(drug_use_age)),
horizontal = FALSE, xlab = "",
main = "Drug Type Usage Among Respondents",
las = 2, cex.axis = 0.5)
Measure Use - respondents in a age group and drug type who used medical-use drugs and illicit-use drugs in the past 12 months.
# data set contains all drugs (medical & non-medical)
anova(lm(Value ~ age + Drugs, dc1.long))
## Analysis of Variance Table
##
## Response: Value
## Df Sum Sq Mean Sq F value Pr(>F)
## age 9 5338270 593141 3.1079 0.002324 **
## Drugs 12 95117810 7926484 41.5330 < 2.2e-16 ***
## Residuals 108 20611592 190848
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Results: The null hypothesis, \(H_{0}\), is rejected since the probability or p-value for age and drugs are less than the significance level (Pr < \(\alpha = 0.05\)). \(H_{A}\): The percentage of those in an age group who used medical-use drugs and illicit-use in the past 12 months is statistically different and they could be greater or less`.
Other Considerations: - The Sum Sq value is large indicating two or more of the levels are significantly different from one another. - Since F is much large than one, the levels between age group are different suggesting that observed differences in mean values are significant contributors.
Fit Linear Regression Model
cFit2 <- lm(Value ~ age + Drugs, dc1.long)
#influence(Fit) #silence - lengthy results
summary(cFit2)
##
## Call:
## lm(formula = Value ~ age + Drugs, data = dc1.long)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1717.74 -144.21 -5.25 142.41 2170.21
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3103.77 179.71 17.271 < 2e-16 ***
## age16-18 396.48 171.35 2.314 0.02257 *
## age19-21 467.10 171.35 2.726 0.00748 **
## age22-23 250.50 171.35 1.462 0.14667
## age24-25 206.45 171.35 1.205 0.23090
## age26-29 -20.23 171.35 -0.118 0.90622
## age30-34 -29.88 171.35 -0.174 0.86191
## age35-49 269.27 171.35 1.571 0.11900
## age50-64 -30.32 171.35 -0.177 0.85989
## age65+ -179.17 171.35 -1.046 0.29807
## DrugsCocaine Real -3117.32 195.37 -15.956 < 2e-16 ***
## DrugsCrack Real -3219.16 195.37 -16.477 < 2e-16 ***
## DrugsHallucinogen Real -3066.33 195.37 -15.695 < 2e-16 ***
## DrugsHeroin Real -3217.23 195.37 -16.467 < 2e-16 ***
## DrugsInhalant Real -3167.16 195.37 -16.211 < 2e-16 ***
## DrugsMarijuana Real -2232.90 195.37 -11.429 < 2e-16 ***
## DrugsMeth Real -3216.01 195.37 -16.461 < 2e-16 ***
## DrugsOxycontin Real -3187.20 195.37 -16.314 < 2e-16 ***
## DrugsPain Reliever Real -2895.73 195.37 -14.822 < 2e-16 ***
## DrugsSedative Real -3221.53 195.37 -16.489 < 2e-16 ***
## DrugsStimulant Real -3136.70 195.37 -16.055 < 2e-16 ***
## DrugsTranq Real -3084.30 195.37 -15.787 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 436.9 on 108 degrees of freedom
## Multiple R-squared: 0.8298, Adjusted R-squared: 0.7966
## F-statistic: 25.07 on 21 and 108 DF, p-value: < 2.2e-16
# require R-squared value to determine the percentage of the variance in medical drug and illicit drug use that is explained by age group
Results: The age group and drug group variances explains 82.98% variance in respondents medical-use and illicit-use in an age group in the past 12 months. Since p-value is less than 0.5, the model as a whole is statistically significant.
You can check these assumptions graphically by plotting the results of the ANOVA analysis.
cOP2 <- par(mfrow=c(1,2))
plot(lm(Value ~ age + Drugs, dc1.long), 1:2)
par(cOP2)
Does the frequency and use of medical-use, prescription drugs, have a higher usage than illicit-use, non-prescription drugs, in an age group?
NO, there is no statistical evidence to state that frequency and usage of medical drugs is higher than than illicit drugs in an age group. In the correlation matrix, more positive relationships were discovered between medical drug use and age groups than illicit drug-use. However, in the ANOVA test the relationships were not statistically significant. The traditional significant level of α = 0.05 was used to evaluate the hypothesis based on the F-ratio and p-value. The first ANOVA analysis, used only one predictor variable age to determine the response variable Value or respondents and p-value supported the H0 hypothesis of commonality between variables. The second ANOVA analysis, used two predictor variables, age and drug type and supported the HA of statistical difference between variables on drug use in an age group.
Therefore, there is no strong evidence that respondents use of medical drugs is higher than illicit drugs in an age group.
GitHub RPubsR Multiple Plot Using par() Function. (2017, November 23). DataMentor. https://www.datamentor.io/r-programming/subplot/
Correlation coefficient and correlation test in R. (n.d.). Stats and R. https://statsandr.com/blog/correlation-coefficient-and-correlation-test-in-r/
Reading a Regression Table: A Guide for Students. (n.d.). Steven v. Miller. http://svmiller.com/blog/2014/08/reading-a-regression-table-a-guide-for-students/
ANOVA. (n.d.). Mgimond.github.io. Retrieved December 7, 2021, from https://mgimond.github.io/Stats-in-R/ANOVA.html