For Project 1, I got a data set from CORGIS CSV Data sets that has 53 different columns that include every state in the USA, years, and the population of three different age groups for each state. As well as columns that record the total (in thousands of people) and rates of substances used in the past month and year. I want to focus on the rate of marijuana used per 1000 people in the past year in the population of the age group between 12 to 17 because if it were legal they still should not have it. I choose to look at five states Maine, Nevada, Washington, Oregon and the District of Columbia (DC is not a state but it is in this data set). (: The data was collected from each state as part of the NSDUH (source) study.
Loading Libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Loading in Data Set
Drugs <-read_csv('drugs.csv')
Rows: 867 Columns: 53
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): State
dbl (52): Year, Population.12-17, Population.18-25, Population.26+, Totals.A...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(Drugs)
# A tibble: 6 × 53
State Year `Population.12-17` `Population.18-25` `Population.26+`
<chr> <dbl> <dbl> <dbl> <dbl>
1 Alabama 2002 380805 499453 2812905
2 Alaska 2002 69400 62791 368460
3 Arizona 2002 485521 602265 3329482
4 Arkansas 2002 232986 302029 1687337
5 California 2002 3140739 3919577 21392421
6 Colorado 2002 385648 493921 2798960
# ℹ 48 more variables: `Totals.Alcohol.Use Disorder Past Year.12-17` <dbl>,
# `Totals.Alcohol.Use Disorder Past Year.18-25` <dbl>,
# `Totals.Alcohol.Use Disorder Past Year.26+` <dbl>,
# `Rates.Alcohol.Use Disorder Past Year.12-17` <dbl>,
# `Rates.Alcohol.Use Disorder Past Year.18-25` <dbl>,
# `Rates.Alcohol.Use Disorder Past Year.26+` <dbl>,
# `Totals.Alcohol.Use Past Month.12-17` <dbl>, …
Checking for NAs
sum(is.na(Drugs))
[1] 0
Filtering the data to the States I want to look at
LMS <- Drugs |>filter(State %in%c("Maine","Nevada","Oregon","Washington","District of Columbia"))
lm_DC <-lm(`Rates.Marijuana.Used Past Year.12-17`~ Year, data = DC)lm_DC
Call:
lm(formula = `Rates.Marijuana.Used Past Year.12-17` ~ Year, data = DC)
Coefficients:
(Intercept) Year
-4190.913 2.163
Based on the coefficients the formula is:
Rates.Marijuana.Used Past Year.12-17 = 2.163(year) - 4190.913
The formula suggests for every year the “Rates.Marijuana.Used Past Year.12-17” will increase by 2.163 and -4190,913 is the Rates.Marijuana.Used Past Year.12-17 would be if the year is 0.
sum_DC <-summary(lm_DC)sum_DC
Call:
lm(formula = `Rates.Marijuana.Used Past Year.12-17` ~ Year, data = DC)
Residuals:
Min 1Q Median 3Q Max
-31.3162 -11.3469 0.4375 12.4656 31.2754
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4190.9132 1754.9530 -2.388 0.0305 *
Year 2.1628 0.8731 2.477 0.0256 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 17.64 on 15 degrees of freedom
Multiple R-squared: 0.2903, Adjusted R-squared: 0.243
F-statistic: 6.136 on 1 and 15 DF, p-value: 0.02564
Based on the summary of the linear model the p-value is 0.03(0.02564) which makes the evidence statically significant. The adjusted R-Squared is 0.243 which means that 24% of the data can be explained by the variable.
Linear Analysis for Maine
lm_ME <-lm(`Rates.Marijuana.Used Past Year.12-17`~Year, data = ME)lm_ME
Call:
lm(formula = `Rates.Marijuana.Used Past Year.12-17` ~ Year, data = ME)
Coefficients:
(Intercept) Year
1996.0001 -0.9107
Based on the coefficients the formula is:
Rates.Marijuana.Used Past Year.12-17 = -0.9107(year) + 1996.0001
The formula suggests for every year the “Rates.Marijuana.Used Past Year.12-17” will decrease by -0.9107 and 1996.001 is the Rates.Marijuana.Used Past Year.12-17 would be if the year is 0.
sum_ME <-summary(lm_ME)sum_ME
Call:
lm(formula = `Rates.Marijuana.Used Past Year.12-17` ~ Year, data = ME)
Residuals:
Min 1Q Median 3Q Max
-25.335 -8.955 -1.949 11.287 16.987
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1996.0001 1219.4729 1.637 0.122
Year -0.9107 0.6067 -1.501 0.154
Residual standard error: 12.25 on 15 degrees of freedom
Multiple R-squared: 0.1306, Adjusted R-squared: 0.07263
F-statistic: 2.253 on 1 and 15 DF, p-value: 0.1541
Based on the summary of the linear model the p-value is 0.1541 which makes the evidence statically insignificant. The adjusted R-Squared is 0.073 which means that 7.3% of the data can be explained by the variable.
Linear Analysis for Nevada
lm_NV <-lm(`Rates.Marijuana.Used Past Year.12-17`~Year, data = NV)lm_NV
Call:
lm(formula = `Rates.Marijuana.Used Past Year.12-17` ~ Year, data = NV)
Coefficients:
(Intercept) Year
2437.102 -1.135
Based on the coefficients the formula is:
Rates.Marijuana.Used Past Year.12-17 = -1.135(year) + 2437.102
The formula suggests for every year the “Rates.Marijuana.Used Past Year.12-17” will decrease by -1.135 and 2437.102 is the Rates.Marijuana.Used Past Year.12-17 would be if the year is 0.
sum_NV <-summary(lm_NV)sum_NV
Call:
lm(formula = `Rates.Marijuana.Used Past Year.12-17` ~ Year, data = NV)
Residuals:
Min 1Q Median 3Q Max
-21.306 -7.915 -4.394 8.188 32.661
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2437.1019 1506.6739 1.618 0.127
Year -1.1354 0.7496 -1.515 0.151
Residual standard error: 15.14 on 15 degrees of freedom
Multiple R-squared: 0.1327, Adjusted R-squared: 0.07484
F-statistic: 2.294 on 1 and 15 DF, p-value: 0.1506
Based on the summary of the linear model the p-value is 0.1505 which makes the evidence statically insignificant. The adjusted R-Squared is 0.075 which means that 7.5% of the data can be explained by the variable.
Linear Analysis for Oregon
lm_OR <-lm(`Rates.Marijuana.Used Past Year.12-17`~Year, data = OR)lm_OR
Call:
lm(formula = `Rates.Marijuana.Used Past Year.12-17` ~ Year, data = OR)
Coefficients:
(Intercept) Year
648.7550 -0.2352
Based on the coefficients the formula is:
Rates.Marijuana.Used Past Year.12-17 = -0.2352(year) + 648.7550
The formula suggests for every year the “Rates.Marijuana.Used Past Year.12-17” will decrease by -0.235 and 648.7550 is the Rates.Marijuana.Used Past Year.12-17 would be if the year is 0.
sum_OR<-summary(lm_OR)sum_OR
Call:
lm(formula = `Rates.Marijuana.Used Past Year.12-17` ~ Year, data = OR)
Residuals:
Min 1Q Median 3Q Max
-16.1175 -4.6506 -0.7477 7.2091 18.9930
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 648.7550 974.6279 0.666 0.516
Year -0.2352 0.4849 -0.485 0.635
Residual standard error: 9.794 on 15 degrees of freedom
Multiple R-squared: 0.01544, Adjusted R-squared: -0.0502
F-statistic: 0.2352 on 1 and 15 DF, p-value: 0.6347
Based on the summary of the linear model the p-value is 0.6347 which makes the evidence statically insignificant. The adjusted R-Squared is -0.0502 which means that none of the data can be explained by the variable.
Linear Analysis for Washington
lm_WA <-lm(`Rates.Marijuana.Used Past Year.12-17`~Year, data = WA)lm_WA
Call:
lm(formula = `Rates.Marijuana.Used Past Year.12-17` ~ Year, data = WA)
Coefficients:
(Intercept) Year
-714.3206 0.4328
Based on the coefficients the formula is:
Rates.Marijuana.Used Past Year.12-17 = 0.4328(year) - 714.3206
The formula suggests for every year the “Rates.Marijuana.Used Past Year.12-17” will increase by 0.4328 and -714.3206 is the Rates.Marijuana.Used Past Year.12-17 would be if the year is 0.
sum_WA <-summary(lm_WA)sum_WA
Call:
lm(formula = `Rates.Marijuana.Used Past Year.12-17` ~ Year, data = WA)
Residuals:
Min 1Q Median 3Q Max
-22.327 -8.519 -1.257 8.377 24.306
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -714.3206 1376.8101 -0.519 0.611
Year 0.4328 0.6850 0.632 0.537
Residual standard error: 13.84 on 15 degrees of freedom
Multiple R-squared: 0.02593, Adjusted R-squared: -0.03901
F-statistic: 0.3992 on 1 and 15 DF, p-value: 0.537
Based on the summary of the linear model the p-value is 0.537 which makes the evidence statically insignificant. The adjusted R-Squared is -0.03901 which means that none of the data can be explained by the variable.
P2 <- LMS |>ggplot(aes(x = Year, y =`Rates.Marijuana.Used Past Year.12-17`, color = State, size =`Totals.Marijuana.Used Past Year.12-17`)) +geom_point() +geom_smooth(method=lm, se=FALSE, show.legend =FALSE) +guides() +theme_bw() +scale_color_manual(values =c(Maine ="blue",Nevada ='lightskyblue3',Oregon ='pink',`Washington`='gold',`District of Columbia`='red' )) +theme(legend.position ="right",legend.key =element_rect(fill ="white", colour ="black"))+labs(title ="Trends in Marijuana Usage Rate From 2002 to 2018 (Age 12-17)",x ="Year",y ="Rate Marijuana Used",color ="States",size ="Total Users (in thousands)",caption ="Source: NSDUH")P2
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
`geom_smooth()` using formula = 'y ~ x'
Warning: The following aesthetics were dropped during statistical transformation: size.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
IL <-ggplot(LMS, aes(x = Year , y =`Rates.Marijuana.Used Past Year.12-17`)) +geom_point() +geom_smooth(color='skyblue2') +facet_wrap(~ State)IL
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
First, I loaded the library and the data set. Next, I check to see if there is any missing data by finding the sum of missing values in the data set using is.na(). Thankfully, everything is accounted for. Then, I use filter on the data set to only have the five states that I wanted (Washington, District of Columbia [DC] Maine, Nevada, and Oregon). I also created an individual data set for each state by using filter again, to get a better look at each individual trend. After cleaning I explored the linear equation, p-value, and R-squared for each data. With the Year column as the independent variable and the “rates of marijuana use in the age group between 12 and 17” as the dependent variable. I notice that all states equations show that the slopes are decreasing expect for the District of Columbia and Washington. I created a scatter plot to get a visual of the slope. The year on the x-axis, the rate of marijuana usage on the y-axis, Total rate used as the size of the dots, and the states as the color.
What stood out to me was how steep the slope was for DC. Recreational marijuana use was not legal until 2015, which could explain why DC slope is steep. As the other states legalize before the data set even starts in 2002. I wanted to bring attention to minors using marijuana.
I am happy with the end results of my analysis. I would like to add a heat map of the types of substances to see which substance the age group of 12-17 took the most.
Sources:
Where Marijuana is legal in the US https://mjbizdaily.com/map-of-us-marijuana-legalization-by-state/