Final Paper

STEP 1 - load the data set from 2023 National Survey on Drug Use and Health (NSDUH) into R Studio.

library(readxl)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(pastecs)

## 
## Attaching package: 'pastecs'
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## The following object is masked from 'package:tidyr':
## 
##     extract

library(lmtest)

## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(MASS)

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

load('NSDUH_2023.Rdata')

let’s review this data set before narrowing the selection by using STR

str(puf2023_102124)

## tibble [56,705 × 2,636] (S3: tbl_df/tbl/data.frame)
##  $ QUESTID2    : num [1:56705] 1e+07 1e+07 1e+07 1e+07 1e+07 ...
##   ..- attr(*, "label")= chr "RESPONDENT IDENTIFICATION"
##  $ FILEDATE    : chr [1:56705] "10/21/2024" "10/21/2024" "10/21/2024" "10/21/2024" ...
##   ..- attr(*, "label")= chr "CREATION DATE OF THE DATA FILE"
##  $ ANALWT2_C   : num [1:56705] 3276 15630 4018 10712 8195 ...
##   ..- attr(*, "label")= chr "FIN PRSN-LEVEL SMPLE WGHT 2"
##  $ VESTR_C     : num [1:56705] 40031 40021 40043 40030 40023 ...
##   ..- attr(*, "label")= chr "VARIANCE STRATUM"
##  $ VEREP       : num [1:56705] 2 2 1 2 2 1 1 1 2 1 ...
##   ..- attr(*, "label")= chr "VARIANCE PRIMARY SAMPLING UNIT"
##   ..- attr(*, "format.sas")= chr "Z"
##  $ PDEN10      : num [1:56705] 2 2 2 2 2 2 1 2 1 2 ...
##   ..- attr(*, "label")= chr "POPULATION DENSITY 2010 - THREE LEVELS"
##   ..- attr(*, "format.sas")= chr "PDEN10FMT"
##  $ COUTYP4     : num [1:56705] 2 3 2 2 2 2 1 2 1 2 ...
##   ..- attr(*, "label")= chr "COUNTY METRO/NONMETRO STATUS (2013 3-LEVEL)"
##   ..- attr(*, "format.sas")= chr "COUTYP4FMT"
##  $ MAIIN102    : num [1:56705] 2 2 2 2 2 2 2 2 2 2 ...
##   ..- attr(*, "label")= chr "MAJORITY AMER INDIAN AREA INDICATOR FOR SEGMENT"
##   ..- attr(*, "format.sas")= chr "MAIIN102FMT"
##  $ AIIND102    : num [1:56705] 2 2 2 2 2 2 2 2 2 2 ...
##   ..- attr(*, "label")= chr "AMER INDIAN AREA INDICATOR"
##   ..- attr(*, "format.sas")= chr "AIIND102FMT"
##  $ AGE3        : num [1:56705] 10 9 9 1 10 4 8 11 8 2 ...
##   ..- attr(*, "label")= chr "RECODE - FINAL EDITED AGE"
##   ..- attr(*, "format.sas")= chr "AGE3FMT"
##  $ SERVICE     : num [1:56705] 2 1 2 99 2 2 2 2 2 99 ...
##   ..- attr(*, "label")= chr "EVER BEEN IN THE US ARMED FORCES"
##   ..- attr(*, "format.sas")= chr "SERVICEFMT"
##  $ MILSTAT     : num [1:56705] 99 3 99 99 99 99 99 99 99 99 ...
##   ..- attr(*, "label")= chr "CURRENT MILITARY STATUS"
##   ..- attr(*, "format.sas")= chr "MILSTATFMT"
##  $ ACTDEVER    : num [1:56705] 99 1 99 99 99 99 99 99 99 99 ...
##   ..- attr(*, "label")= chr "EVER ON ACTIVE DUTY IN US MILITARY/RESERV"
##   ..- attr(*, "format.sas")= chr "ACTDEVERFMT"
##  $ ACTD2001    : num [1:56705] 99 1 99 99 99 99 99 99 99 99 ...
##   ..- attr(*, "label")= chr "SEPTEMBER 2001 OR LATER"
##   ..- attr(*, "format.sas")= chr "ACTD2001FMT"
##  $ ACTD9001    : num [1:56705] 99 2 99 99 99 99 99 99 99 99 ...
##   ..- attr(*, "label")= chr "AUG 1990 TO AUG 2001 (INCLUDING PERSIAN GULF WAR)"
##   ..- attr(*, "format.sas")= chr "ACTD9001FMT"
##  $ ACTD7590    : num [1:56705] 99 2 99 99 99 99 99 99 99 99 ...
##   ..- attr(*, "label")= chr "MAY 1975 TO JULY 1990"
##   ..- attr(*, "format.sas")= chr "ACTD7590FMT"
##  $ ACTDVIET    : num [1:56705] 99 2 99 99 99 99 99 99 99 99 ...
##   ..- attr(*, "label")= chr "VIETNAM ERA (MARCH 1961 TO APRIL 1975)"
##   ..- attr(*, "format.sas")= chr "ACTDVIETFMT"
##  $ ACTDPRIV    : num [1:56705] 99 2 99 99 99 99 99 99 99 99 ...
##   ..- attr(*, "label")= chr "ANY TIME BEFORE VIETNAM ERA (MARCH 1961)"
##   ..- attr(*, "format.sas")= chr "ACTDPRIVFMT"
##  $ COMBATPY    : num [1:56705] 99 2 99 99 99 99 99 99 99 99 ...
##   ..- attr(*, "label")= chr "EVER IN COMBAT ZONE ON ACTIVE DUTY"
##   ..- attr(*, "format.sas")= chr "COMBATPYFMT"
##  $ NOMARR2     : num [1:56705] 1 99 2 99 1 99 99 99 99 99 ...
##   ..- attr(*, "label")= chr "NUMBER OF TIMES MARRIED"
##   ..- attr(*, "format.sas")= chr "NOMARR2FMT"
##  $ HEALTH      : num [1:56705] 2 2 3 3 1 3 3 2 1 5 ...
##   ..- attr(*, "label")= chr "OVERALL HEALTH"
##   ..- attr(*, "format.sas")= chr "HEALTHFMT"
##  $ MOVSINPYR2  : num [1:56705] 0 0 0 0 0 0 0 0 0 1 ...
##   ..- attr(*, "label")= chr "# TIMES MOVED PAST 12 MONTHS- RECODED"
##   ..- attr(*, "format.sas")= chr "MOVSINPYR2FMT"
##  $ SEXATRACT2  : num [1:56705] 1 2 2 1 1 1 1 1 1 2 ...
##   ..- attr(*, "label")= chr "SEXUAL ATTRACTION"
##   ..- attr(*, "format.sas")= chr "SEXATRACT2FMT"
##  $ SEXIDENT22  : num [1:56705] 1 3 1 1 1 1 1 1 1 3 ...
##   ..- attr(*, "label")= chr "SEXUAL IDENTITY"
##   ..- attr(*, "format.sas")= chr "SEXIDENT22FMT"
##  $ SPEAKENGL   : num [1:56705] 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "HOW WELL SPEAK ENGLISH"
##   ..- attr(*, "format.sas")= chr "SPEAKENGLFMT"
##  $ LVLDIFSEE2  : num [1:56705] 1 1 2 1 1 1 1 1 2 1 ...
##   ..- attr(*, "label")= chr "LEVEL OF DIFFICULTY SEEING EVEN WITH GLASSES"
##   ..- attr(*, "format.sas")= chr "LVLDIFSEE2FMT"
##  $ LVLDIFHEAR2 : num [1:56705] 1 1 3 1 1 2 2 1 1 1 ...
##   ..- attr(*, "label")= chr "LEVEL OF DIFFICULTY HEARING EVEN WITH HEARING AIDS"
##   ..- attr(*, "format.sas")= chr "LVLDIFHEAR2FMT"
##  $ LVLDIFWALK2 : num [1:56705] 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "LEVEL OF DIFFICULTY WALKING OR CLIMBING STEPS"
##   ..- attr(*, "format.sas")= chr "LVLDIFWALK2FMT"
##  $ LVLDIFMEM2  : num [1:56705] 1 2 2 1 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "LEVEL OF DIFFICULTY REMEMBERING OR CONCENTRATING"
##   ..- attr(*, "format.sas")= chr "LVLDIFMEM2FMT"
##  $ LVLDIFCARE2 : num [1:56705] 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "LEVEL OF DIFFICULTY WITH SELF-CARE"
##   ..- attr(*, "format.sas")= chr "LVLDIFCARE2FMT"
##  $ LVLDIFCOMM2 : num [1:56705] 1 1 1 1 1 1 1 1 1 3 ...
##   ..- attr(*, "label")= chr "LEVEL OF DIFFICULTY COMMUNICATING USING YOUR USUAL LANGUAGE"
##   ..- attr(*, "format.sas")= chr "LVLDIFCOMM2FMT"
##  $ IRSEX       : num [1:56705] 1 1 2 1 1 1 1 1 2 2 ...
##   ..- attr(*, "label")= chr "SEX AT BIRTH - IMPUTATION REVISED"
##   ..- attr(*, "format.sas")= chr "IRSEXFMT"
##  $ IRMARIT     : num [1:56705] 1 4 1 99 1 4 4 4 4 4 ...
##   ..- attr(*, "label")= chr "IMPUTATION REVISED MARITAL STATUS"
##   ..- attr(*, "format.sas")= chr "IRMARITFMT"
##  $ IIMARIT     : num [1:56705] 1 1 1 9 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "IMPUTATION INDICATOR MARITAL STATUS"
##   ..- attr(*, "format.sas")= chr "IIMARITFMT"
##  $ IREDUHIGHST2: num [1:56705] 9 8 8 2 11 8 11 11 11 5 ...
##   ..- attr(*, "label")= chr "EDUCATION - RECODED IMPUTATION REVISED"
##   ..- attr(*, "format.sas")= chr "IREDUHIGHST2FMT"
##  $ IIEDUHIGHST2: num [1:56705] 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "EDUCATION INDICATOR - IMPUTATION INDICATOR"
##   ..- attr(*, "format.sas")= chr "IIEDUHIGHST2FMT"
##  $ CATAGE      : num [1:56705] 4 4 4 1 4 2 3 4 3 1 ...
##   ..- attr(*, "label")= chr "RC-AGE CATEGORY"
##   ..- attr(*, "format.sas")= chr "CATAGEFMT"
##  $ CATAG2      : num [1:56705] 3 3 3 1 3 2 3 3 3 1 ...
##   ..- attr(*, "label")= chr "RC-AGE CATEGORY RECODE (3 LEVELS)"
##   ..- attr(*, "format.sas")= chr "CATAG2FMT"
##  $ CATAG3      : num [1:56705] 5 4 4 1 5 2 3 5 3 1 ...
##   ..- attr(*, "label")= chr "RC-AGE CATEGORY RECODE (5 LEVELS)"
##   ..- attr(*, "format.sas")= chr "CATAG3FMT"
##  $ CATAG6      : num [1:56705] 5 4 4 1 5 2 3 6 3 1 ...
##   ..- attr(*, "label")= chr "RC-AGE CATEGORY RECODE (6 LEVELS)"
##   ..- attr(*, "format.sas")= chr "CATAG6FMT"
##  $ CATAG7      : num [1:56705] 7 7 7 1 7 4 6 7 6 2 ...
##   ..- attr(*, "label")= chr "RC-AGE CATEGORY RECODE (7 LEVELS)"
##   ..- attr(*, "format.sas")= chr "CATAG7FMT"
##  $ PREGAGE2    : num [1:56705] 4 3 4 4 4 2 3 4 3 1 ...
##   ..- attr(*, "label")= chr "RC-PREGNANCY AGE CATEGORY RECODE"
##   ..- attr(*, "format.sas")= chr "PREGAGE2FMT"
##  $ DRVINAGE    : num [1:56705] 2 2 2 3 2 1 2 2 2 3 ...
##   ..- attr(*, "label")= chr "RC-DUI AGE CATEGORY RECODE"
##   ..- attr(*, "format.sas")= chr "DRVINAGEFMT"
##  $ DRVINDETAG  : num [1:56705] 3 3 3 4 3 1 3 3 3 4 ...
##   ..- attr(*, "label")= chr "RC-DUI DETAILED AGE CATEGORY RECODE"
##   ..- attr(*, "format.sas")= chr "DRVINDETAGFMT"
##  $ SEXAGE      : num [1:56705] 5 5 5 1 5 3 5 5 5 2 ...
##   ..- attr(*, "label")= chr "RC-COMBINED GENDER BY AGE CATEGORY INDICATOR"
##   ..- attr(*, "format.sas")= chr "SEXAGEFMT"
##  $ NEWRACE2    : num [1:56705] 7 1 4 7 6 1 1 1 5 1 ...
##   ..- attr(*, "label")= chr "RC-RACE/HISPANICITY RECODE (7 LEVELS)"
##   ..- attr(*, "format.sas")= chr "NEWRACE2FMT"
##  $ SEXRACE     : num [1:56705] 5 1 7 5 7 1 1 1 7 2 ...
##   ..- attr(*, "label")= chr "RC-COMBINED GENDER BY RACE INDICATOR"
##   ..- attr(*, "format.sas")= chr "SEXRACEFMT"
##  $ EDUHIGHCAT  : num [1:56705] 3 2 2 5 4 2 4 4 4 5 ...
##   ..- attr(*, "label")= chr "RC-EDUCATION CATEGORIES"
##   ..- attr(*, "format.sas")= chr "EDUHIGHCATFMT"
##  $ HEALTH2     : num [1:56705] 2 2 3 3 1 3 3 2 1 4 ...
##   ..- attr(*, "label")= chr "RC-OVERALL HEALTH RECODE"
##   ..- attr(*, "format.sas")= chr "HEALTH2FMT"
##  $ EDUSCHLGO   : num [1:56705] 2 2 2 1 2 2 2 2 2 1 ...
##   ..- attr(*, "label")= chr "NOW GOING TO SCHOOL"
##   ..- attr(*, "format.sas")= chr "EDUSCHLGOFMT"
##  $ EDUSCHGRD2  : num [1:56705] 99 99 99 3 99 99 99 99 99 6 ...
##   ..- attr(*, "label")= chr "WHAT GRADE IN NOW/WILL BE IN"
##   ..- attr(*, "format.sas")= chr "EDUSCHGRD2FMT"
##  $ EDUFULPAR   : num [1:56705] 99 99 99 1 99 99 99 99 99 1 ...
##   ..- attr(*, "label")= chr "FULL OR PART TIME"
##   ..- attr(*, "format.sas")= chr "EDUFULPARFMT"
##  $ EDUSCKMON   : num [1:56705] 99 99 99 4 99 99 99 99 99 0 ...
##   ..- attr(*, "label")= chr "HOW MANY DAYS MISSED SCHOOL FROM SICK"
##   ..- attr(*, "format.sas")= chr "EDUSCKMONFMT"
##  $ EDUSCKEST   : num [1:56705] 99 99 99 99 99 99 99 99 99 99 ...
##   ..- attr(*, "label")= chr "BEST ESTIMATE # DAYS MISSED BECAUSE SICK OR INURED"
##   ..- attr(*, "format.sas")= chr "EDUSCKESTFMT"
##  $ EDUSCKCOM   : num [1:56705] 99 99 99 4 99 99 99 99 99 0 ...
##   ..- attr(*, "label")= chr "RC - HOW MANY DAYS MISSED SCHOOL FROM SICK (COMBINED)"
##   ..- attr(*, "format.sas")= chr "EDUSCKCOMFMT"
##  $ EDUSKPMON   : num [1:56705] 99 99 99 0 99 99 99 99 99 0 ...
##   ..- attr(*, "label")= chr "HOW MANY DAYS MISSED SCHOOL FROM SKIPPING"
##   ..- attr(*, "format.sas")= chr "EDUSKPMONFMT"
##  $ EDUSKPEST   : num [1:56705] 99 99 99 99 99 99 99 99 99 99 ...
##   ..- attr(*, "label")= chr "BEST ESTIMATE # DAYS MISSED BECAUSE SKIPPED OR CUT"
##   ..- attr(*, "format.sas")= chr "EDUSKPESTFMT"
##  $ EDUSKPCOM   : num [1:56705] 99 99 99 0 99 99 99 99 99 0 ...
##   ..- attr(*, "label")= chr "RC - HOW MANY DAYS MISSED SCHOOL FROM SKIPPING (COMBINED)"
##   ..- attr(*, "format.sas")= chr "EDUSKPCOMFMT"
##  $ MILTFAMLY   : num [1:56705] 2 1 2 1 2 2 2 2 2 2 ...
##   ..- attr(*, "label")= chr "ANY IMMEDIATE FAMILY CURRENTLY IN US MILITARY"
##   ..- attr(*, "format.sas")= chr "MILTFAMLYFMT"
##  $ MILTSPPAR   : num [1:56705] 99 2 99 97 99 99 99 99 99 99 ...
##   ..- attr(*, "label")= chr "SPOUSE OR PARTNER IN MILITARY"
##   ..- attr(*, "format.sas")= chr "MILTSPPARFMT"
##  $ MILTPARNT   : num [1:56705] 99 2 99 97 99 99 99 99 99 99 ...
##   ..- attr(*, "label")= chr "FATHER OR MOTHER IN MILITARY"
##   ..- attr(*, "format.sas")= chr "MILTPARNTFMT"
##  $ MILTCHLDR   : num [1:56705] 99 2 99 97 99 99 99 99 99 99 ...
##   ..- attr(*, "label")= chr "SON OR DAUGHTER IN MILITARY"
##   ..- attr(*, "format.sas")= chr "MILTCHLDRFMT"
##  $ MILTSIBLN   : num [1:56705] 99 1 99 97 99 99 99 99 99 99 ...
##   ..- attr(*, "label")= chr "SISTER OR BROTHER IN MILITARY"
##   ..- attr(*, "format.sas")= chr "MILTSIBLNFMT"
##  $ ENRLCOLLFT2 : num [1:56705] 3 3 3 3 3 2 3 3 3 3 ...
##   ..- attr(*, "label")= chr "RC-COLLEGE ENROLLMENT-FULL TIME"
##   ..- attr(*, "format.sas")= chr "ENRLCOLLFT2FMT"
##  $ ENRLCOLLST2 : num [1:56705] 5 5 5 5 5 3 5 5 5 5 ...
##   ..- attr(*, "label")= chr "RC-COLLEGE ENROLLMENT"
##   ..- attr(*, "format.sas")= chr "ENRLCOLLST2FMT"
##  $ WRKSTATWK2  : num [1:56705] 8 1 9 99 1 3 1 2 1 9 ...
##   ..- attr(*, "label")= chr "WORK SITUATION IN PAST WEEK - RECODE"
##   ..- attr(*, "format.sas")= chr "WRKSTATWK2FMT"
##  $ WRKDPSTWK   : num [1:56705] 2 1 2 99 1 2 1 1 1 2 ...
##   ..- attr(*, "label")= chr "WORK AT JOB LAST WEEK"
##   ..- attr(*, "format.sas")= chr "WRKDPSTWKFMT"
##  $ WRKHADJOB   : num [1:56705] 2 99 2 99 99 1 99 99 99 2 ...
##   ..- attr(*, "label")= chr "DID YOU HAVE A JOB OR BUSINESS"
##   ..- attr(*, "format.sas")= chr "WRKHADJOBFMT"
##  $ WRKDHRSWK2  : num [1:56705] 999 50 999 999 40 999 40 16 15 999 ...
##   ..- attr(*, "label")= chr "HOW MANY HOURS WORKED LAST WEEK"
##   ..- attr(*, "format.sas")= chr "WRKDHRSWK2FMT"
##  $ WRK35WKUS   : num [1:56705] 99 1 99 99 1 1 1 2 1 99 ...
##   ..- attr(*, "label")= chr "USUALLY WORK 35 OR MORE HRS PER WEEK"
##   ..- attr(*, "format.sas")= chr "WRK35WKUSFMT"
##  $ WRKRSNNOT   : num [1:56705] 999 999 999 999 999 1 999 999 999 999 ...
##   ..- attr(*, "label")= chr "REASON FOR NOT WORKING AT JOB - PAST WK"
##   ..- attr(*, "format.sas")= chr "WRKRSNNOTFMT"
##  $ WRKRSNJOB   : num [1:56705] 5 999 8 999 999 999 999 999 999 1 ...
##   ..- attr(*, "label")= chr "REASON FOR NOT HAVING A JOB - PAST WK"
##   ..- attr(*, "format.sas")= chr "WRKRSNJOBFMT"
##  $ WRKEFFORT   : num [1:56705] 99 99 99 99 99 99 99 99 99 2 ...
##   ..- attr(*, "label")= chr "PAST 30 DAYS, MAKE EFFORT TO FIND WORK"
##   ..- attr(*, "format.sas")= chr "WRKEFFORTFMT"
##  $ WRKDPSTYR   : num [1:56705] 2 99 2 99 99 99 99 99 99 2 ...
##   ..- attr(*, "label")= chr "PAST 12 MOS, WORKED AT ANY JOB"
##   ..- attr(*, "format.sas")= chr "WRKDPSTYRFMT"
##  $ WRKSELFEM   : num [1:56705] 99 1 99 99 1 2 2 2 1 99 ...
##   ..- attr(*, "label")= chr "PAST 12 MOS, SELF EMPLOYED"
##   ..- attr(*, "format.sas")= chr "WRKSELFEMFMT"
##  $ WRKNUMJOB2  : num [1:56705] 99 1 99 99 1 3 4 1 1 99 ...
##   ..- attr(*, "label")= chr "PAST 12 MOS, HOW MANY EMPLOYERS"
##   ..- attr(*, "format.sas")= chr "WRKNUMJOB2FMT"
##  $ WRKNJBPYR   : num [1:56705] 99 2 99 99 2 1 1 2 2 99 ...
##   ..- attr(*, "label")= chr "PAST 12 MOS, TIME WITH NO JOB"
##   ..- attr(*, "format.sas")= chr "WRKNJBPYRFMT"
##  $ WRKNJBWKS   : num [1:56705] 99 99 99 99 99 7 2 99 99 99 ...
##   ..- attr(*, "label")= chr "HOW MANY WKS W/O JOB IN PAST 12 MOS"
##   ..- attr(*, "format.sas")= chr "WRKNJBWKSFMT"
##  $ WRKLASTYR2  : num [1:56705] 2017 9999 9991 9999 9999 ...
##   ..- attr(*, "label")= chr "WHAT YEAR LAST WORKED"
##   ..- attr(*, "format.sas")= chr "WRKLASTYR2FMT"
##  $ WRKSICKMO   : num [1:56705] 99 0 99 99 0 0 0 0 0 99 ...
##   ..- attr(*, "label")= chr "# DAYS MISSED WORK FOR INJURY/ILLNESS PAST 30 DAYS"
##   ..- attr(*, "format.sas")= chr "WRKSICKMOFMT"
##  $ WRKSKIPMO   : num [1:56705] 99 0 99 99 0 0 0 0 0 99 ...
##   ..- attr(*, "label")= chr "# DAYS SKIPPED WORK PAST 30 DAYS"
##   ..- attr(*, "format.sas")= chr "WRKSKIPMOFMT"
##  $ WRKDRGPOL   : num [1:56705] 99 2 99 99 2 1 1 1 2 99 ...
##   ..- attr(*, "label")= chr "WORKPLACE HAVE WRITTEN POLICY DRUG/ALCOHOL USE"
##   ..- attr(*, "format.sas")= chr "WRKDRGPOLFMT"
##  $ WRKDRGALB   : num [1:56705] 99 99 99 99 99 3 3 3 99 99 ...
##   ..- attr(*, "label")= chr "DID POLICY COVER ALCOHOL, DRUGS, OR  BOTH"
##   ..- attr(*, "format.sas")= chr "WRKDRGALBFMT"
##  $ WRKDRGEDU   : num [1:56705] 99 2 99 99 2 1 2 2 2 99 ...
##   ..- attr(*, "label")= chr "AT WORK, GIVEN EDUCATION ON DRUGS/ALC"
##   ..- attr(*, "format.sas")= chr "WRKDRGEDUFMT"
##  $ WRKDRGHLP   : num [1:56705] 99 2 99 99 2 2 2 1 2 99 ...
##   ..- attr(*, "label")= chr "ANY ASSISTANCE PROGRAM OFFERED THROUGH WRK"
##   ..- attr(*, "format.sas")= chr "WRKDRGHLPFMT"
##  $ WRKTSTALC   : num [1:56705] 99 2 99 99 2 1 2 2 2 99 ...
##   ..- attr(*, "label")= chr "WORKPLACE TESTS FOR ALCOHOL USAGE"
##   ..- attr(*, "format.sas")= chr "WRKTSTALCFMT"
##  $ WRKTSTDRG   : num [1:56705] 99 2 99 99 2 2 2 94 2 99 ...
##   ..- attr(*, "label")= chr "WORKPLACE TESTS FOR DRUG USAGE"
##   ..- attr(*, "format.sas")= chr "WRKTSTDRGFMT"
##  $ WRKTSTHIR   : num [1:56705] 99 99 99 99 99 1 99 98 99 99 ...
##   ..- attr(*, "label")= chr "TEST FOR DRUG/ALC AS HIRING PROCESS"
##   ..- attr(*, "format.sas")= chr "WRKTSTHIRFMT"
##  $ WRKTSTRDM   : num [1:56705] 99 99 99 99 99 1 99 98 99 99 ...
##   ..- attr(*, "label")= chr "TEST ON RANDOM BASIS"
##   ..- attr(*, "format.sas")= chr "WRKTSTRDMFMT"
##  $ WRKTST1ST   : num [1:56705] 99 99 99 99 99 2 99 98 99 99 ...
##   ..- attr(*, "label")= chr "WHAT HAPPENS FIRST TIME CAUGHT"
##   ..- attr(*, "format.sas")= chr "WRKTST1STFMT"
##  $ WRKOKPREH   : num [1:56705] 99 3 99 99 3 2 2 3 3 99 ...
##   ..- attr(*, "label")= chr "WOULD YOU WORK FOR EMP DOES DRUG TEST PRE-HIRE"
##   ..- attr(*, "format.sas")= chr "WRKOKPREHFMT"
##  $ WRKOKRAND   : num [1:56705] 99 3 99 99 2 2 2 3 3 99 ...
##   ..- attr(*, "label")= chr "WOULD YOU WORK FOR EMP DOES DRUG/ALC TEST RANDOMLY"
##   ..- attr(*, "format.sas")= chr "WRKOKRANDFMT"
##  $ IRWRKSTAT   : num [1:56705] 4 1 4 99 1 1 1 2 1 4 ...
##   ..- attr(*, "label")= chr "EMPLOYMENT STATUS - IMPUTATION REVISED"
##   ..- attr(*, "format.sas")= chr "IRWRKSTATFMT"
##  $ IIWRKSTAT   : num [1:56705] 1 1 1 9 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "EMPLOYMENT STATUS - IMPUTATION INDICATOR"
##   ..- attr(*, "format.sas")= chr "IIWRKSTATFMT"
##  $ II2WRKSTAT  : num [1:56705] 1 1 1 9 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "EMPLOYMENT STATUS - DETAILED IMPUTATION INDICATOR"
##   ..- attr(*, "format.sas")= chr "II2WRKSTATFMT"
##  $ IRWRKSTAT18 : num [1:56705] 4 1 4 99 1 1 1 2 1 99 ...
##   ..- attr(*, "label")= chr "EMPLOYMENT STATUS 18+ - IMPUTATION REVISED"
##   ..- attr(*, "format.sas")= chr "IRWRKSTAT18FMT"
##  $ IIWRKSTAT18 : num [1:56705] 1 1 1 9 1 1 1 1 1 9 ...
##   ..- attr(*, "label")= chr "EMPLOYMENT STATUS 18+ - IMPUTATION INDICATOR"
##   ..- attr(*, "format.sas")= chr "IIWRKSTAT18FMT"
##  $ II2WRKST18  : num [1:56705] 1 1 1 9 1 1 1 1 1 9 ...
##   ..- attr(*, "label")= chr "EMPLOYMENT STATUS 18+ - DETAILED IMPUTATION INDICATOR"
##   ..- attr(*, "format.sas")= chr "II2WRKST18FMT"
##  $ EDFAM18     : num [1:56705] 0 0 0 0 0 0 1 1 0 0 ...
##   ..- attr(*, "label")= chr "RC-EDITED INDICATOR: FAMILY IN HH 18 OR OLDER"
##   ..- attr(*, "format.sas")= chr "EDFAM18FMT"
##   [list output truncated]

STEP 2 - create smaller data set. STEP 2a - select just the four variables: ALCYRTOT,ALCBNG30D, IMPGOUTM, IMPYDAYS

select <- dplyr::select

AlcoholMHlarge<-select(puf2023_102124, ALCYRTOT,ALCBNG30D,IMPGOUTM,IMPYDAYS)

summary(AlcoholMHlarge)

##     ALCYRTOT       ALCBNG30D        IMPGOUTM        IMPYDAYS    
##  Min.   :  1.0   Min.   : 0.00   Min.   : 1.00   Min.   :  0.0  
##  1st Qu.: 36.0   1st Qu.: 1.00   1st Qu.:99.00   1st Qu.:  1.0  
##  Median :200.0   Median :91.00   Median :99.00   Median :999.0  
##  Mean   :466.2   Mean   :54.14   Mean   :97.03   Mean   :554.4  
##  3rd Qu.:991.0   3rd Qu.:93.00   3rd Qu.:99.00   3rd Qu.:999.0  
##  Max.   :998.0   Max.   :98.00   Max.   :99.00   Max.   :999.0

str(AlcoholMHlarge)

## tibble [56,705 × 4] (S3: tbl_df/tbl/data.frame)
##  $ ALCYRTOT : num [1:56705] 342 84 991 991 24 36 180 993 24 991 ...
##   ..- attr(*, "label")= chr "TOTAL # OF DAYS USED ALCOHOL IN PAST 12 MOS"
##   ..- attr(*, "format.sas")= chr "ALCYRTOTFMT"
##  $ ALCBNG30D: num [1:56705] 7 5 91 91 0 0 1 93 0 91 ...
##   ..- attr(*, "label")= chr "# DAYS HAD FOUR/FIVE OR MORE DRINKS PAST 30 DYS"
##   ..- attr(*, "format.sas")= chr "ALCBNG30DFMT"
##  $ IMPGOUTM : num [1:56705] 99 99 99 99 99 99 99 99 99 99 ...
##   ..- attr(*, "label")= chr "EMOTIONAL PROBLEMS KEEP YOU FROM LEAVING HOUSE"
##   ..- attr(*, "format.sas")= chr "IMPGOUTMFMT"
##  $ IMPYDAYS : num [1:56705] 999 6 120 999 999 20 10 999 0 999 ...
##   ..- attr(*, "label")= chr "HOW MANY DAY IN PAST YR YOU WERE UNABLE TO WORK"
##   ..- attr(*, "format.sas")= chr "IMPYDAYSFMT"

Interesting that there are no “zero days” for ALCYRTOT. This means that no one of the 56,705 observations said they drank zero days in the last 12 months. All observations drank at least one day. IMPGOUTM is actually a categorical variable (i.e. YES OR NO).

STEP 2b - remove observations with more than 365 days for ALCYRTOT and IMPYDAYS and remove observations with more than 30 days for ALCBNG30D and IMPGOUTM.

AlcoholMH1 <- AlcoholMHlarge %>% 
  filter(ALCYRTOT >= 0 & ALCYRTOT <= 366 &
         IMPYDAYS >= 0 & IMPYDAYS <= 366 &
         ALCBNG30D >= 0 & ALCBNG30D <= 31 &
         IMPGOUTM >= 0 & IMPGOUTM <= 31)

summary(AlcoholMH1)

##     ALCYRTOT        ALCBNG30D         IMPGOUTM       IMPYDAYS     
##  Min.   :  1.00   Min.   : 0.000   Min.   :1.00   Min.   :  0.00  
##  1st Qu.: 12.00   1st Qu.: 0.000   1st Qu.:1.00   1st Qu.:  3.00  
##  Median : 48.00   Median : 1.000   Median :1.00   Median : 30.00  
##  Mean   : 83.47   Mean   : 2.462   Mean   :1.21   Mean   : 82.11  
##  3rd Qu.:126.00   3rd Qu.: 2.000   3rd Qu.:1.00   3rd Qu.:120.00  
##  Max.   :364.00   Max.   :30.000   Max.   :2.00   Max.   :365.00

There are no N/As to remove from this data set. All four variables have values for all the remaining observations.

Let’s explore this data now…

ggplot(AlcoholMH1, aes(x=ALCYRTOT)) + 
  geom_histogram(bins = 40, fill = "steelblue", color = "black") +
  labs(title = "Distribution of Days Drinking in the last year")

ggplot(AlcoholMH1, aes(x=ALCBNG30D)) + 
  geom_histogram(bins = 40, fill = "steelblue", color = "black") +
  labs(title = "Distribution of Days Binge Drinking in the last month")

ggplot(AlcoholMH1, aes(x=IMPYDAYS)) + 
  geom_histogram(bins = 40, fill = "steelblue", color = "black") +
  labs(title = "Distribution of Days Missed work in the last year")

I left out IMPGOUTM because it’s categorical and not continuous. All three of these variables are extremely skewed as is.

ggplot(AlcoholMH1, aes(x =factor(IMPGOUTM))) + 
  geom_bar(fill = "steelblue") +
  labs(title = "Frequency of Days Could not leave the house due to emotional problems", x="Yes or No")

stat.desc(AlcoholMH1$ALCYRTOT)

##                         x
## nbr.val        420.000000
## nbr.null         0.000000
## nbr.na           0.000000
## min              1.000000
## max            364.000000
## range          363.000000
## sum          35059.000000
## median          48.000000
## mean            83.473810
## SE.mean          4.319218
## CI.mean.0.95     8.490036
## var           7835.371627
## std.dev         88.517635
## coef.var         1.060424

stat.desc(AlcoholMH1$IMPYDAYS)

##                         x
## nbr.val        420.000000
## nbr.null        77.000000
## nbr.na           0.000000
## min              0.000000
## max            365.000000
## range          365.000000
## sum          34486.000000
## median          30.000000
## mean            82.109524
## SE.mean          5.317133
## CI.mean.0.95    10.451578
## var          11874.198000
## std.dev        108.968794
## coef.var         1.327115

I wanted a way to more easily look at the number of days drinking, so let’s mutate the numbers into a category by creating a new variable.

alcohol_new <- AlcoholMH1 %>%
  mutate(Type_of_Drinker = case_when(
    ALCYRTOT == 0 ~ "Non-drinker",
    ALCYRTOT <= 52 ~ "Occasional",
    ALCYRTOT <= 156 ~ "Regular",
    ALCYRTOT <= 260 ~ "Frequent",
    TRUE ~ "Very Frequent"
  ))

table(alcohol_new$Type_of_Drinker)

## 
##      Frequent    Occasional       Regular Very Frequent 
##            52           235           118            15

Let’s see what this new category looks like

ggplot(alcohol_new, aes(x=Type_of_Drinker)) + 
  geom_bar(fill = "steelblue", color = "black") +
  labs(title = "Types of Drinkers")

ggplot(alcohol_new, aes(x = Type_of_Drinker, y = IMPYDAYS)) +
  geom_boxplot(fill = "steelblue") +
  labs(title = "Days Missed work by Type of Drinker",
       x = "Type of Drinker",
       y = "Days of Missed Work")

This looks like “Very Frequent” drinkers has a higher mean of Missed Days of work than the other three types of drinkers.

alcohol_new %>%
  group_by(Type_of_Drinker) %>%
  summarize(
    mean_impairment_days = mean(IMPYDAYS, na.rm = TRUE),
    n = n()
  ) %>%
  arrange(factor(Type_of_Drinker))

## # A tibble: 4 × 3
##   Type_of_Drinker mean_impairment_days     n
##   <chr>                          <dbl> <int>
## 1 Frequent                        93.1    52
## 2 Occasional                      81.2   235
## 3 Regular                         74.3   118
## 4 Very Frequent                  120.     15

cor(AlcoholMH1$ALCYRTOT, AlcoholMH1$IMPYDAYS)

## [1] 0.04650247

cor(AlcoholMH1$ALCBNG30D, AlcoholMH1$IMPYDAYS)

## [1] 0.138859

The Correlation shows the statistical correlation between number of days drinking in the past 12 months to number of days of missed work in the past 12 months is 0.046 which is closer to 0 which means not much of a correlation. I’m surprised by this, but we’ll keep looking at the data. The second one is a little less weak, with 0.1388 for Days Binge Drinking and Days of missed work.

Let’s plot these two variables and see what that looks like.

ggplot(alcohol_new, aes(x=ALCYRTOT,y=IMPYDAYS)) + geom_point() + geom_smooth(color='red') +
         labs(title="Alcohol Use and Days Unable to Work (untransformed data)", 
         x="Total # Days Used Alcohol in Past 12 Months",
         y="# Days Unable to Work in Past 12 Months")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

This visually shows a slight increase in days unable to work as days of alcohol use increases.

ggplot(alcohol_new, aes(x=ALCBNG30D,y=IMPYDAYS)) + geom_point() + geom_smooth(method= "lm", color='red') +
         labs(title="Days Binging and Days Unable to Work (untransformed data)", 
         x="Total # Days Binge Drinking in Past 30 Days",
         y="# Days Unable to Work in Past 12 Months")

## `geom_smooth()` using formula = 'y ~ x'

Next, I’ll create a linear model “lm()” from the variables, with a dependent variable (IMPYDAYS) as the outcome. I called it Alcoholmodel_1.

Alcoholmodel_1<-lm(IMPYDAYS~ALCYRTOT+ALCBNG30D+Type_of_Drinker,data=alcohol_new)

Next I can run the assumptions (just like in Homework 7)

Check the following assumptions: a) Linearity (plot and raintest)

plot(Alcoholmodel_1,which=1)

raintest(Alcoholmodel_1)

## 
##  Rainbow test
## 
## data:  Alcoholmodel_1
## Rain = 0.98645, df1 = 210, df2 = 204, p-value = 0.5393

Above is the plot of this lm. The fitted line is not straight and therefore the data is not linear.

Next, I ran the Rainbow test which has a p-value of 0.5393 which is not low, and therefore not statistically significant. This means that the # of total days drinking (ALCYRTOT) and the # of days binging (ALCBNG30D) and Type of Drinker do not have statistically significant effect on # of days missed work (IMPYDAYS)

I do want to try it just with the ALCYRTOT and ALCBNG30D and see if that’s different.

Create a new linear model with only three variables:

Alcoholmodel_2<-lm(IMPYDAYS~ALCYRTOT+ALCBNG30D,data=alcohol_new)

plot(Alcoholmodel_2,which=1)

raintest(Alcoholmodel_2)

## 
##  Rainbow test
## 
## data:  Alcoholmodel_2
## Rain = 0.98924, df1 = 210, df2 = 207, p-value = 0.5312

This does not look any better, let me try with just ALCBNG30D and IMPYDAYS

Alcoholmodel_3<-lm(IMPYDAYS~ALCBNG30D,data=alcohol_new)

plot(Alcoholmodel_3,which=1)

raintest(Alcoholmodel_3)

## 
##  Rainbow test
## 
## data:  Alcoholmodel_3
## Rain = 0.99424, df1 = 210, df2 = 208, p-value = 0.5167

Looks better, graph is not straight by rainbow test provides a a p-value that is likely to be linear.

I’ll continue with other assumptions testing on the linear model with all four variables (Alcoholmodel_1) b) Independence of errors (durbin-watson)

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

durbinWatsonTest(Alcoholmodel_1)

##  lag Autocorrelation D-W Statistic p-value
##    1     -0.02550154      2.048397   0.626
##  Alternative hypothesis: rho != 0

Next we look at the Durbin Watson test, which brings about the p-value of 0.954 which is high and means the errors are independent. Also, the DW Statistic is basically 2, which is good news.

Homoscedasticity (plot, bptest)

plot(Alcoholmodel_1,which=3)

bptest(Alcoholmodel_1)

## 
##  studentized Breusch-Pagan test
## 
## data:  Alcoholmodel_1
## BP = 5.7037, df = 5, p-value = 0.3361

This plot has a fairly straight line but flows at an angle. The points are not evenly distributed around the line. This would demonstrate a violation of Homoscedasticity.

The second command for the Breusch-Pagan test shows a result of p-value of 0.1797 which means the null hypothesis cannot be rejected and the model is not homoscedastic.

Normality of residuals (QQ plot, shapiro test)

plot(Alcoholmodel_1,which=2)

There is significant deviation from the dotted line, which means residuals are not evenly distributed.

shapiro.test(Alcoholmodel_1$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  Alcoholmodel_1$residuals
## W = 0.79672, p-value < 2.2e-16

The p-value is WAY WAY below 0.05, which means that the residuals are not normal.

No multicolinarity (VIF, cor)

vif(Alcoholmodel_1)

##                      GVIF Df GVIF^(1/(2*Df))
## ALCYRTOT        12.766232  1        3.572986
## ALCBNG30D        1.408807  1        1.186932
## Type_of_Drinker 12.688377  3        1.527218

All three variables are close to 1.00. meaning they are NOT related to another variable.

And I’ll run the correlation again.

cor(AlcoholMH1)

##               ALCYRTOT   ALCBNG30D     IMPGOUTM    IMPYDAYS
## ALCYRTOT   1.000000000  0.53162182 -0.009641016  0.04650247
## ALCBNG30D  0.531621816  1.00000000 -0.067479458  0.13885904
## IMPGOUTM  -0.009641016 -0.06747946  1.000000000 -0.27127344
## IMPYDAYS   0.046502472  0.13885904 -0.271273440  1.00000000

Does this model (Alcoholmodel_1) meet the assumptions for linear model? NO!

Because:

Linearity (plot and raintest) - MAYBE i.e. line is not straigt but p-value is above 0.50 at 0.67.
Independence of errors (durbin-watson) - YES MEETS has p-value of 0.676 which is high and means the errors are independent and the DW Statistic is basically 2.
Homoscedasticity (plot, bptest) - DOES NOT MEET Breusch-Pagan test shows a result of p-value of 0.1797 and the line is not straight.
Normality of residuals (QQ plot, shapiro test) - DOES NOT MEET THIS ASSUMPTION. Graph is not in line and the Shapiro-Wilk normality test of p-value < 2.2e-16.
No multicolinarity (VIF, cor) - YES MEETS, since they are not correlated.

I cannot use a log transformation since there are legitimate zeroes as values in these variable. I must try square root as a transformation here. I will only square root the dependent variable, and will leave the independent variables as they are.

Alcohol_multiple<-lm(sqrt(IMPYDAYS)~(ALCYRTOT)+(ALCBNG30D),data=alcohol_new)

summary(Alcohol_multiple)

## 
## Call:
## lm(formula = sqrt(IMPYDAYS) ~ (ALCYRTOT) + (ALCBNG30D), data = alcohol_new)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.638 -5.016 -1.256  3.876 12.915 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.617408   0.398679  16.598   <2e-16 ***
## ALCYRTOT    -0.002972   0.003871  -0.768   0.4431    
## ALCBNG30D    0.181917   0.070938   2.564   0.0107 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.941 on 417 degrees of freedom
## Multiple R-squared:  0.01667,    Adjusted R-squared:  0.01196 
## F-statistic: 3.535 on 2 and 417 DF,  p-value: 0.03002

This looks much better. Now I’ll check the assumptions.

plot(Alcohol_multiple,which=1)

raintest(Alcohol_multiple)

## 
##  Rainbow test
## 
## data:  Alcohol_multiple
## Rain = 0.98626, df1 = 210, df2 = 207, p-value = 0.5399

This new data set (Alcohol_multiple) is looking better! This line is a little more straight. Raintest is still p-value of 0.5399. Let’s run more assumptions here.

durbinWatsonTest(Alcohol_multiple)

##  lag Autocorrelation D-W Statistic p-value
##    1     -0.03491551      2.066002   0.488
##  Alternative hypothesis: rho != 0

plot(Alcohol_multiple,which=3)

bptest(Alcohol_multiple)

## 
##  studentized Breusch-Pagan test
## 
## data:  Alcohol_multiple
## BP = 1.9381, df = 2, p-value = 0.3795

Normality of Residuals

plot(Alcohol_multiple,which=2)

shapiro.test(Alcohol_multiple$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  Alcohol_multiple$residuals
## W = 0.92193, p-value = 5.847e-14

vif(Alcohol_multiple)

##  ALCYRTOT ALCBNG30D 
##  1.393965  1.393965

VIF shows both variables close to 0 which means they are not strongly correlated to other variables.

COR can’t run on a model, but can run it on the dataframe which it was above.

Linearity (plot and raintest) - YES MEETS line is much closer to straight and p-value is 0.539.
Independence of errors (durbin-watson) - YES MEETS has p-value of 0.488 which is high and means the errors are independent and the DW Statistic is basically 2.
Homoscedasticity (plot, bptest) - YES MEETS Breusch-Pagan test shows a result of p-value of 0.379 and the plot line is much more straight in the residuals graph.
Normality of residuals (QQ plot, shapiro test) - MAYBE. Graph is showing much more accurate, with the two ends trailing off as it should and the Shapiro-Wilk normality test of p-value = 5.847e-14 which means it is not normal.
No multicolinarity (VIF, cor) - YES MEETS, since they are not correlated since both values are way under 10.

This looks like a model that now meets the assumptions. Let’s look at summary again.

summary(Alcohol_multiple)

## 
## Call:
## lm(formula = sqrt(IMPYDAYS) ~ (ALCYRTOT) + (ALCBNG30D), data = alcohol_new)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.638 -5.016 -1.256  3.876 12.915 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.617408   0.398679  16.598   <2e-16 ***
## ALCYRTOT    -0.002972   0.003871  -0.768   0.4431    
## ALCBNG30D    0.181917   0.070938   2.564   0.0107 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.941 on 417 degrees of freedom
## Multiple R-squared:  0.01667,    Adjusted R-squared:  0.01196 
## F-statistic: 3.535 on 2 and 417 DF,  p-value: 0.03002

So number of days in the month of binge drinking is significant in relation to number of days of missed work. The number of days total drinking alcohol is not significant. Adjusted R Squared tells me the transformed number of days drinking (ALCYRTOT+ALCBNG30D) explains 1% of IMPYDAYS.

Final Paper

Megan Janzen

2025-05-04