Project_1

Author

Jonathan_RH

Introduction

For Project 1, I got a data set from CORGIS CSV Data sets that has 53 different columns that include every state in the USA, years, and the population of three different age groups for each state. As well as columns that record the total (in thousands of people) and rates of substances used in the past month and year. I want to focus on the rate of marijuana used per 1000 people in the past year in the population of the age group between 12 to 17 because if it were legal they still should not have it. I choose to look at five states Maine, Nevada, Washington, Oregon and the District of Columbia (DC is not a state but it is in this data set). (: The data was collected from each state as part of the NSDUH (source) study.

Loading Libraries

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading in Data Set

Drugs <-read_csv('drugs.csv')

Rows: 867 Columns: 53
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): State
dbl (52): Year, Population.12-17, Population.18-25, Population.26+, Totals.A...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(Drugs)

# A tibble: 6 × 53
  State       Year `Population.12-17` `Population.18-25` `Population.26+`
  <chr>      <dbl>              <dbl>              <dbl>            <dbl>
1 Alabama     2002             380805             499453          2812905
2 Alaska      2002              69400              62791           368460
3 Arizona     2002             485521             602265          3329482
4 Arkansas    2002             232986             302029          1687337
5 California  2002            3140739            3919577         21392421
6 Colorado    2002             385648             493921          2798960
# ℹ 48 more variables: `Totals.Alcohol.Use Disorder Past Year.12-17` <dbl>,
#   `Totals.Alcohol.Use Disorder Past Year.18-25` <dbl>,
#   `Totals.Alcohol.Use Disorder Past Year.26+` <dbl>,
#   `Rates.Alcohol.Use Disorder Past Year.12-17` <dbl>,
#   `Rates.Alcohol.Use Disorder Past Year.18-25` <dbl>,
#   `Rates.Alcohol.Use Disorder Past Year.26+` <dbl>,
#   `Totals.Alcohol.Use Past Month.12-17` <dbl>, …

Checking for NAs

sum(is.na(Drugs))

[1] 0

Filtering the data to the States I want to look at

LMS <- Drugs |>
  filter(State %in% c("Maine","Nevada","Oregon","Washington","District of Columbia"))

Filter date for each state for LM

DC <- Drugs |>
  filter(State == "District of Columbia")

ME <- Drugs |>
  filter(State == "Maine")

NV <- Drugs |>
  filter(State == "Nevada")

OR <- Drugs |>
  filter(State == "Oregon")

WA <- Drugs |>
  filter(State == "Washington")

Linear Analysis for DC

lm_DC <- lm(`Rates.Marijuana.Used Past Year.12-17`~ Year, data = DC)

lm_DC


Call:
lm(formula = `Rates.Marijuana.Used Past Year.12-17` ~ Year, data = DC)

Coefficients:
(Intercept)         Year  
  -4190.913        2.163

Based on the coefficients the formula is:

Rates.Marijuana.Used Past Year.12-17 = 2.163(year) - 4190.913

The formula suggests for every year the “Rates.Marijuana.Used Past Year.12-17” will increase by 2.163 and -4190,913 is the Rates.Marijuana.Used Past Year.12-17 would be if the year is 0.

sum_DC <- summary(lm_DC)
sum_DC


Call:
lm(formula = `Rates.Marijuana.Used Past Year.12-17` ~ Year, data = DC)

Residuals:
     Min       1Q   Median       3Q      Max 
-31.3162 -11.3469   0.4375  12.4656  31.2754 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept) -4190.9132  1754.9530  -2.388   0.0305 *
Year            2.1628     0.8731   2.477   0.0256 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.64 on 15 degrees of freedom
Multiple R-squared:  0.2903,    Adjusted R-squared:  0.243 
F-statistic: 6.136 on 1 and 15 DF,  p-value: 0.02564

Based on the summary of the linear model the p-value is 0.03(0.02564) which makes the evidence statically significant. The adjusted R-Squared is 0.243 which means that 24% of the data can be explained by the variable.

Linear Analysis for Maine

lm_ME <- lm(`Rates.Marijuana.Used Past Year.12-17`~Year, data = ME)

lm_ME


Call:
lm(formula = `Rates.Marijuana.Used Past Year.12-17` ~ Year, data = ME)

Coefficients:
(Intercept)         Year  
  1996.0001      -0.9107

Based on the coefficients the formula is:

Rates.Marijuana.Used Past Year.12-17 = -0.9107(year) + 1996.0001

The formula suggests for every year the “Rates.Marijuana.Used Past Year.12-17” will decrease by -0.9107 and 1996.001 is the Rates.Marijuana.Used Past Year.12-17 would be if the year is 0.

sum_ME <- summary(lm_ME)
sum_ME


Call:
lm(formula = `Rates.Marijuana.Used Past Year.12-17` ~ Year, data = ME)

Residuals:
    Min      1Q  Median      3Q     Max 
-25.335  -8.955  -1.949  11.287  16.987 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) 1996.0001  1219.4729   1.637    0.122
Year          -0.9107     0.6067  -1.501    0.154

Residual standard error: 12.25 on 15 degrees of freedom
Multiple R-squared:  0.1306,    Adjusted R-squared:  0.07263 
F-statistic: 2.253 on 1 and 15 DF,  p-value: 0.1541

Based on the summary of the linear model the p-value is 0.1541 which makes the evidence statically insignificant. The adjusted R-Squared is 0.073 which means that 7.3% of the data can be explained by the variable.

Linear Analysis for Nevada

lm_NV <- lm(`Rates.Marijuana.Used Past Year.12-17`~Year, data = NV)

lm_NV


Call:
lm(formula = `Rates.Marijuana.Used Past Year.12-17` ~ Year, data = NV)

Coefficients:
(Intercept)         Year  
   2437.102       -1.135

Based on the coefficients the formula is:

Rates.Marijuana.Used Past Year.12-17 = -1.135(year) + 2437.102

The formula suggests for every year the “Rates.Marijuana.Used Past Year.12-17” will decrease by -1.135 and 2437.102 is the Rates.Marijuana.Used Past Year.12-17 would be if the year is 0.

sum_NV <- summary(lm_NV)
sum_NV


Call:
lm(formula = `Rates.Marijuana.Used Past Year.12-17` ~ Year, data = NV)

Residuals:
    Min      1Q  Median      3Q     Max 
-21.306  -7.915  -4.394   8.188  32.661 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) 2437.1019  1506.6739   1.618    0.127
Year          -1.1354     0.7496  -1.515    0.151

Residual standard error: 15.14 on 15 degrees of freedom
Multiple R-squared:  0.1327,    Adjusted R-squared:  0.07484 
F-statistic: 2.294 on 1 and 15 DF,  p-value: 0.1506

Based on the summary of the linear model the p-value is 0.1505 which makes the evidence statically insignificant. The adjusted R-Squared is 0.075 which means that 7.5% of the data can be explained by the variable.

Linear Analysis for Oregon

lm_OR <- lm(`Rates.Marijuana.Used Past Year.12-17`~Year, data = OR)

lm_OR


Call:
lm(formula = `Rates.Marijuana.Used Past Year.12-17` ~ Year, data = OR)

Coefficients:
(Intercept)         Year  
   648.7550      -0.2352

Based on the coefficients the formula is:

Rates.Marijuana.Used Past Year.12-17 = -0.2352(year) + 648.7550

The formula suggests for every year the “Rates.Marijuana.Used Past Year.12-17” will decrease by -0.235 and 648.7550 is the Rates.Marijuana.Used Past Year.12-17 would be if the year is 0.

sum_OR<- summary(lm_OR)
sum_OR


Call:
lm(formula = `Rates.Marijuana.Used Past Year.12-17` ~ Year, data = OR)

Residuals:
     Min       1Q   Median       3Q      Max 
-16.1175  -4.6506  -0.7477   7.2091  18.9930 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 648.7550   974.6279   0.666    0.516
Year         -0.2352     0.4849  -0.485    0.635

Residual standard error: 9.794 on 15 degrees of freedom
Multiple R-squared:  0.01544,   Adjusted R-squared:  -0.0502 
F-statistic: 0.2352 on 1 and 15 DF,  p-value: 0.6347

Based on the summary of the linear model the p-value is 0.6347 which makes the evidence statically insignificant. The adjusted R-Squared is -0.0502 which means that none of the data can be explained by the variable.

Linear Analysis for Washington

lm_WA <- lm(`Rates.Marijuana.Used Past Year.12-17`~Year, data = WA)

lm_WA


Call:
lm(formula = `Rates.Marijuana.Used Past Year.12-17` ~ Year, data = WA)

Coefficients:
(Intercept)         Year  
  -714.3206       0.4328

Based on the coefficients the formula is:

Rates.Marijuana.Used Past Year.12-17 = 0.4328(year) - 714.3206

The formula suggests for every year the “Rates.Marijuana.Used Past Year.12-17” will increase by 0.4328 and -714.3206 is the Rates.Marijuana.Used Past Year.12-17 would be if the year is 0.

sum_WA <- summary(lm_WA)
sum_WA


Call:
lm(formula = `Rates.Marijuana.Used Past Year.12-17` ~ Year, data = WA)

Residuals:
    Min      1Q  Median      3Q     Max 
-22.327  -8.519  -1.257   8.377  24.306 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -714.3206  1376.8101  -0.519    0.611
Year           0.4328     0.6850   0.632    0.537

Residual standard error: 13.84 on 15 degrees of freedom
Multiple R-squared:  0.02593,   Adjusted R-squared:  -0.03901 
F-statistic: 0.3992 on 1 and 15 DF,  p-value: 0.537

Based on the summary of the linear model the p-value is 0.537 which makes the evidence statically insignificant. The adjusted R-Squared is -0.03901 which means that none of the data can be explained by the variable.

P2 <- LMS |>
  ggplot(aes(x = Year, y = `Rates.Marijuana.Used Past Year.12-17`, color = State, size = `Totals.Marijuana.Used Past Year.12-17`)) +
  geom_point() +
  geom_smooth(method=lm, se= FALSE, show.legend = FALSE) +
  guides() +
  theme_bw() +
  scale_color_manual(values = c(Maine = "blue",
                                Nevada = 'lightskyblue3',
                                Oregon = 'pink',
                                `Washington` = 'gold',
                                `District of Columbia` = 'red'
                                )) +
  theme(legend.position = "right",
        legend.key = element_rect(fill = "white", colour = "black"))+
  labs(title ="Trends in Marijuana Usage Rate From 2002 to 2018 (Age 12-17)",
       x = "Year",
       y = "Rate Marijuana Used",
       color = "States",
       size = "Total Users (in thousands)",
       caption = "Source: NSDUH")

P2

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

`geom_smooth()` using formula = 'y ~ x'

Warning: The following aesthetics were dropped during statistical transformation: size.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

IL <-
  ggplot(LMS, aes(x = Year , y = `Rates.Marijuana.Used Past Year.12-17`)) +
  geom_point() +
  geom_smooth(color='skyblue2') +
  facet_wrap(~ State)
IL

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

First, I loaded the library and the data set. Next, I check to see if there is any missing data by finding the sum of missing values in the data set using is.na(). Thankfully, everything is accounted for. Then, I use filter on the data set to only have the five states that I wanted (Washington, District of Columbia [DC] Maine, Nevada, and Oregon). I also created an individual data set for each state by using filter again, to get a better look at each individual trend. After cleaning I explored the linear equation, p-value, and R-squared for each data. With the Year column as the independent variable and the “rates of marijuana use in the age group between 12 and 17” as the dependent variable. I notice that all states equations show that the slopes are decreasing expect for the District of Columbia and Washington. I created a scatter plot to get a visual of the slope. The year on the x-axis, the rate of marijuana usage on the y-axis, Total rate used as the size of the dots, and the states as the color.

What stood out to me was how steep the slope was for DC. Recreational marijuana use was not legal until 2015, which could explain why DC slope is steep. As the other states legalize before the data set even starts in 2002. I wanted to bring attention to minors using marijuana.

I am happy with the end results of my analysis. I would like to add a heat map of the types of substances to see which substance the age group of 12-17 took the most.

Sources:

Where Marijuana is legal in the US https://mjbizdaily.com/map-of-us-marijuana-legalization-by-state/