STEP 1 - load the data set from 2023 National Survey on Drug Use and Health (NSDUH) into R Studio.

library(readxl)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(pastecs)
## 
## Attaching package: 'pastecs'
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## The following object is masked from 'package:tidyr':
## 
##     extract
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(MASS)
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
load('NSDUH_2023.Rdata')

let’s review this data set before narrowing the selection by using STR

str(puf2023_102124)
## tibble [56,705 × 2,636] (S3: tbl_df/tbl/data.frame)
##  $ QUESTID2    : num [1:56705] 1e+07 1e+07 1e+07 1e+07 1e+07 ...
##   ..- attr(*, "label")= chr "RESPONDENT IDENTIFICATION"
##  $ FILEDATE    : chr [1:56705] "10/21/2024" "10/21/2024" "10/21/2024" "10/21/2024" ...
##   ..- attr(*, "label")= chr "CREATION DATE OF THE DATA FILE"
##  $ ANALWT2_C   : num [1:56705] 3276 15630 4018 10712 8195 ...
##   ..- attr(*, "label")= chr "FIN PRSN-LEVEL SMPLE WGHT 2"
##  $ VESTR_C     : num [1:56705] 40031 40021 40043 40030 40023 ...
##   ..- attr(*, "label")= chr "VARIANCE STRATUM"
##  $ VEREP       : num [1:56705] 2 2 1 2 2 1 1 1 2 1 ...
##   ..- attr(*, "label")= chr "VARIANCE PRIMARY SAMPLING UNIT"
##   ..- attr(*, "format.sas")= chr "Z"
##  $ PDEN10      : num [1:56705] 2 2 2 2 2 2 1 2 1 2 ...
##   ..- attr(*, "label")= chr "POPULATION DENSITY 2010 - THREE LEVELS"
##   ..- attr(*, "format.sas")= chr "PDEN10FMT"
##  $ COUTYP4     : num [1:56705] 2 3 2 2 2 2 1 2 1 2 ...
##   ..- attr(*, "label")= chr "COUNTY METRO/NONMETRO STATUS (2013 3-LEVEL)"
##   ..- attr(*, "format.sas")= chr "COUTYP4FMT"
##  $ MAIIN102    : num [1:56705] 2 2 2 2 2 2 2 2 2 2 ...
##   ..- attr(*, "label")= chr "MAJORITY AMER INDIAN AREA INDICATOR FOR SEGMENT"
##   ..- attr(*, "format.sas")= chr "MAIIN102FMT"
##  $ AIIND102    : num [1:56705] 2 2 2 2 2 2 2 2 2 2 ...
##   ..- attr(*, "label")= chr "AMER INDIAN AREA INDICATOR"
##   ..- attr(*, "format.sas")= chr "AIIND102FMT"
##  $ AGE3        : num [1:56705] 10 9 9 1 10 4 8 11 8 2 ...
##   ..- attr(*, "label")= chr "RECODE - FINAL EDITED AGE"
##   ..- attr(*, "format.sas")= chr "AGE3FMT"
##  $ SERVICE     : num [1:56705] 2 1 2 99 2 2 2 2 2 99 ...
##   ..- attr(*, "label")= chr "EVER BEEN IN THE US ARMED FORCES"
##   ..- attr(*, "format.sas")= chr "SERVICEFMT"
##  $ MILSTAT     : num [1:56705] 99 3 99 99 99 99 99 99 99 99 ...
##   ..- attr(*, "label")= chr "CURRENT MILITARY STATUS"
##   ..- attr(*, "format.sas")= chr "MILSTATFMT"
##  $ ACTDEVER    : num [1:56705] 99 1 99 99 99 99 99 99 99 99 ...
##   ..- attr(*, "label")= chr "EVER ON ACTIVE DUTY IN US MILITARY/RESERV"
##   ..- attr(*, "format.sas")= chr "ACTDEVERFMT"
##  $ ACTD2001    : num [1:56705] 99 1 99 99 99 99 99 99 99 99 ...
##   ..- attr(*, "label")= chr "SEPTEMBER 2001 OR LATER"
##   ..- attr(*, "format.sas")= chr "ACTD2001FMT"
##  $ ACTD9001    : num [1:56705] 99 2 99 99 99 99 99 99 99 99 ...
##   ..- attr(*, "label")= chr "AUG 1990 TO AUG 2001 (INCLUDING PERSIAN GULF WAR)"
##   ..- attr(*, "format.sas")= chr "ACTD9001FMT"
##  $ ACTD7590    : num [1:56705] 99 2 99 99 99 99 99 99 99 99 ...
##   ..- attr(*, "label")= chr "MAY 1975 TO JULY 1990"
##   ..- attr(*, "format.sas")= chr "ACTD7590FMT"
##  $ ACTDVIET    : num [1:56705] 99 2 99 99 99 99 99 99 99 99 ...
##   ..- attr(*, "label")= chr "VIETNAM ERA (MARCH 1961 TO APRIL 1975)"
##   ..- attr(*, "format.sas")= chr "ACTDVIETFMT"
##  $ ACTDPRIV    : num [1:56705] 99 2 99 99 99 99 99 99 99 99 ...
##   ..- attr(*, "label")= chr "ANY TIME BEFORE VIETNAM ERA (MARCH 1961)"
##   ..- attr(*, "format.sas")= chr "ACTDPRIVFMT"
##  $ COMBATPY    : num [1:56705] 99 2 99 99 99 99 99 99 99 99 ...
##   ..- attr(*, "label")= chr "EVER IN COMBAT ZONE ON ACTIVE DUTY"
##   ..- attr(*, "format.sas")= chr "COMBATPYFMT"
##  $ NOMARR2     : num [1:56705] 1 99 2 99 1 99 99 99 99 99 ...
##   ..- attr(*, "label")= chr "NUMBER OF TIMES MARRIED"
##   ..- attr(*, "format.sas")= chr "NOMARR2FMT"
##  $ HEALTH      : num [1:56705] 2 2 3 3 1 3 3 2 1 5 ...
##   ..- attr(*, "label")= chr "OVERALL HEALTH"
##   ..- attr(*, "format.sas")= chr "HEALTHFMT"
##  $ MOVSINPYR2  : num [1:56705] 0 0 0 0 0 0 0 0 0 1 ...
##   ..- attr(*, "label")= chr "# TIMES MOVED PAST 12 MONTHS- RECODED"
##   ..- attr(*, "format.sas")= chr "MOVSINPYR2FMT"
##  $ SEXATRACT2  : num [1:56705] 1 2 2 1 1 1 1 1 1 2 ...
##   ..- attr(*, "label")= chr "SEXUAL ATTRACTION"
##   ..- attr(*, "format.sas")= chr "SEXATRACT2FMT"
##  $ SEXIDENT22  : num [1:56705] 1 3 1 1 1 1 1 1 1 3 ...
##   ..- attr(*, "label")= chr "SEXUAL IDENTITY"
##   ..- attr(*, "format.sas")= chr "SEXIDENT22FMT"
##  $ SPEAKENGL   : num [1:56705] 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "HOW WELL SPEAK ENGLISH"
##   ..- attr(*, "format.sas")= chr "SPEAKENGLFMT"
##  $ LVLDIFSEE2  : num [1:56705] 1 1 2 1 1 1 1 1 2 1 ...
##   ..- attr(*, "label")= chr "LEVEL OF DIFFICULTY SEEING EVEN WITH GLASSES"
##   ..- attr(*, "format.sas")= chr "LVLDIFSEE2FMT"
##  $ LVLDIFHEAR2 : num [1:56705] 1 1 3 1 1 2 2 1 1 1 ...
##   ..- attr(*, "label")= chr "LEVEL OF DIFFICULTY HEARING EVEN WITH HEARING AIDS"
##   ..- attr(*, "format.sas")= chr "LVLDIFHEAR2FMT"
##  $ LVLDIFWALK2 : num [1:56705] 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "LEVEL OF DIFFICULTY WALKING OR CLIMBING STEPS"
##   ..- attr(*, "format.sas")= chr "LVLDIFWALK2FMT"
##  $ LVLDIFMEM2  : num [1:56705] 1 2 2 1 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "LEVEL OF DIFFICULTY REMEMBERING OR CONCENTRATING"
##   ..- attr(*, "format.sas")= chr "LVLDIFMEM2FMT"
##  $ LVLDIFCARE2 : num [1:56705] 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "LEVEL OF DIFFICULTY WITH SELF-CARE"
##   ..- attr(*, "format.sas")= chr "LVLDIFCARE2FMT"
##  $ LVLDIFCOMM2 : num [1:56705] 1 1 1 1 1 1 1 1 1 3 ...
##   ..- attr(*, "label")= chr "LEVEL OF DIFFICULTY COMMUNICATING USING YOUR USUAL LANGUAGE"
##   ..- attr(*, "format.sas")= chr "LVLDIFCOMM2FMT"
##  $ IRSEX       : num [1:56705] 1 1 2 1 1 1 1 1 2 2 ...
##   ..- attr(*, "label")= chr "SEX AT BIRTH - IMPUTATION REVISED"
##   ..- attr(*, "format.sas")= chr "IRSEXFMT"
##  $ IRMARIT     : num [1:56705] 1 4 1 99 1 4 4 4 4 4 ...
##   ..- attr(*, "label")= chr "IMPUTATION REVISED MARITAL STATUS"
##   ..- attr(*, "format.sas")= chr "IRMARITFMT"
##  $ IIMARIT     : num [1:56705] 1 1 1 9 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "IMPUTATION INDICATOR MARITAL STATUS"
##   ..- attr(*, "format.sas")= chr "IIMARITFMT"
##  $ IREDUHIGHST2: num [1:56705] 9 8 8 2 11 8 11 11 11 5 ...
##   ..- attr(*, "label")= chr "EDUCATION - RECODED IMPUTATION REVISED"
##   ..- attr(*, "format.sas")= chr "IREDUHIGHST2FMT"
##  $ IIEDUHIGHST2: num [1:56705] 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "EDUCATION INDICATOR - IMPUTATION INDICATOR"
##   ..- attr(*, "format.sas")= chr "IIEDUHIGHST2FMT"
##  $ CATAGE      : num [1:56705] 4 4 4 1 4 2 3 4 3 1 ...
##   ..- attr(*, "label")= chr "RC-AGE CATEGORY"
##   ..- attr(*, "format.sas")= chr "CATAGEFMT"
##  $ CATAG2      : num [1:56705] 3 3 3 1 3 2 3 3 3 1 ...
##   ..- attr(*, "label")= chr "RC-AGE CATEGORY RECODE (3 LEVELS)"
##   ..- attr(*, "format.sas")= chr "CATAG2FMT"
##  $ CATAG3      : num [1:56705] 5 4 4 1 5 2 3 5 3 1 ...
##   ..- attr(*, "label")= chr "RC-AGE CATEGORY RECODE (5 LEVELS)"
##   ..- attr(*, "format.sas")= chr "CATAG3FMT"
##  $ CATAG6      : num [1:56705] 5 4 4 1 5 2 3 6 3 1 ...
##   ..- attr(*, "label")= chr "RC-AGE CATEGORY RECODE (6 LEVELS)"
##   ..- attr(*, "format.sas")= chr "CATAG6FMT"
##  $ CATAG7      : num [1:56705] 7 7 7 1 7 4 6 7 6 2 ...
##   ..- attr(*, "label")= chr "RC-AGE CATEGORY RECODE (7 LEVELS)"
##   ..- attr(*, "format.sas")= chr "CATAG7FMT"
##  $ PREGAGE2    : num [1:56705] 4 3 4 4 4 2 3 4 3 1 ...
##   ..- attr(*, "label")= chr "RC-PREGNANCY AGE CATEGORY RECODE"
##   ..- attr(*, "format.sas")= chr "PREGAGE2FMT"
##  $ DRVINAGE    : num [1:56705] 2 2 2 3 2 1 2 2 2 3 ...
##   ..- attr(*, "label")= chr "RC-DUI AGE CATEGORY RECODE"
##   ..- attr(*, "format.sas")= chr "DRVINAGEFMT"
##  $ DRVINDETAG  : num [1:56705] 3 3 3 4 3 1 3 3 3 4 ...
##   ..- attr(*, "label")= chr "RC-DUI DETAILED AGE CATEGORY RECODE"
##   ..- attr(*, "format.sas")= chr "DRVINDETAGFMT"
##  $ SEXAGE      : num [1:56705] 5 5 5 1 5 3 5 5 5 2 ...
##   ..- attr(*, "label")= chr "RC-COMBINED GENDER BY AGE CATEGORY INDICATOR"
##   ..- attr(*, "format.sas")= chr "SEXAGEFMT"
##  $ NEWRACE2    : num [1:56705] 7 1 4 7 6 1 1 1 5 1 ...
##   ..- attr(*, "label")= chr "RC-RACE/HISPANICITY RECODE (7 LEVELS)"
##   ..- attr(*, "format.sas")= chr "NEWRACE2FMT"
##  $ SEXRACE     : num [1:56705] 5 1 7 5 7 1 1 1 7 2 ...
##   ..- attr(*, "label")= chr "RC-COMBINED GENDER BY RACE INDICATOR"
##   ..- attr(*, "format.sas")= chr "SEXRACEFMT"
##  $ EDUHIGHCAT  : num [1:56705] 3 2 2 5 4 2 4 4 4 5 ...
##   ..- attr(*, "label")= chr "RC-EDUCATION CATEGORIES"
##   ..- attr(*, "format.sas")= chr "EDUHIGHCATFMT"
##  $ HEALTH2     : num [1:56705] 2 2 3 3 1 3 3 2 1 4 ...
##   ..- attr(*, "label")= chr "RC-OVERALL HEALTH RECODE"
##   ..- attr(*, "format.sas")= chr "HEALTH2FMT"
##  $ EDUSCHLGO   : num [1:56705] 2 2 2 1 2 2 2 2 2 1 ...
##   ..- attr(*, "label")= chr "NOW GOING TO SCHOOL"
##   ..- attr(*, "format.sas")= chr "EDUSCHLGOFMT"
##  $ EDUSCHGRD2  : num [1:56705] 99 99 99 3 99 99 99 99 99 6 ...
##   ..- attr(*, "label")= chr "WHAT GRADE IN NOW/WILL BE IN"
##   ..- attr(*, "format.sas")= chr "EDUSCHGRD2FMT"
##  $ EDUFULPAR   : num [1:56705] 99 99 99 1 99 99 99 99 99 1 ...
##   ..- attr(*, "label")= chr "FULL OR PART TIME"
##   ..- attr(*, "format.sas")= chr "EDUFULPARFMT"
##  $ EDUSCKMON   : num [1:56705] 99 99 99 4 99 99 99 99 99 0 ...
##   ..- attr(*, "label")= chr "HOW MANY DAYS MISSED SCHOOL FROM SICK"
##   ..- attr(*, "format.sas")= chr "EDUSCKMONFMT"
##  $ EDUSCKEST   : num [1:56705] 99 99 99 99 99 99 99 99 99 99 ...
##   ..- attr(*, "label")= chr "BEST ESTIMATE # DAYS MISSED BECAUSE SICK OR INURED"
##   ..- attr(*, "format.sas")= chr "EDUSCKESTFMT"
##  $ EDUSCKCOM   : num [1:56705] 99 99 99 4 99 99 99 99 99 0 ...
##   ..- attr(*, "label")= chr "RC - HOW MANY DAYS MISSED SCHOOL FROM SICK (COMBINED)"
##   ..- attr(*, "format.sas")= chr "EDUSCKCOMFMT"
##  $ EDUSKPMON   : num [1:56705] 99 99 99 0 99 99 99 99 99 0 ...
##   ..- attr(*, "label")= chr "HOW MANY DAYS MISSED SCHOOL FROM SKIPPING"
##   ..- attr(*, "format.sas")= chr "EDUSKPMONFMT"
##  $ EDUSKPEST   : num [1:56705] 99 99 99 99 99 99 99 99 99 99 ...
##   ..- attr(*, "label")= chr "BEST ESTIMATE # DAYS MISSED BECAUSE SKIPPED OR CUT"
##   ..- attr(*, "format.sas")= chr "EDUSKPESTFMT"
##  $ EDUSKPCOM   : num [1:56705] 99 99 99 0 99 99 99 99 99 0 ...
##   ..- attr(*, "label")= chr "RC - HOW MANY DAYS MISSED SCHOOL FROM SKIPPING (COMBINED)"
##   ..- attr(*, "format.sas")= chr "EDUSKPCOMFMT"
##  $ MILTFAMLY   : num [1:56705] 2 1 2 1 2 2 2 2 2 2 ...
##   ..- attr(*, "label")= chr "ANY IMMEDIATE FAMILY CURRENTLY IN US MILITARY"
##   ..- attr(*, "format.sas")= chr "MILTFAMLYFMT"
##  $ MILTSPPAR   : num [1:56705] 99 2 99 97 99 99 99 99 99 99 ...
##   ..- attr(*, "label")= chr "SPOUSE OR PARTNER IN MILITARY"
##   ..- attr(*, "format.sas")= chr "MILTSPPARFMT"
##  $ MILTPARNT   : num [1:56705] 99 2 99 97 99 99 99 99 99 99 ...
##   ..- attr(*, "label")= chr "FATHER OR MOTHER IN MILITARY"
##   ..- attr(*, "format.sas")= chr "MILTPARNTFMT"
##  $ MILTCHLDR   : num [1:56705] 99 2 99 97 99 99 99 99 99 99 ...
##   ..- attr(*, "label")= chr "SON OR DAUGHTER IN MILITARY"
##   ..- attr(*, "format.sas")= chr "MILTCHLDRFMT"
##  $ MILTSIBLN   : num [1:56705] 99 1 99 97 99 99 99 99 99 99 ...
##   ..- attr(*, "label")= chr "SISTER OR BROTHER IN MILITARY"
##   ..- attr(*, "format.sas")= chr "MILTSIBLNFMT"
##  $ ENRLCOLLFT2 : num [1:56705] 3 3 3 3 3 2 3 3 3 3 ...
##   ..- attr(*, "label")= chr "RC-COLLEGE ENROLLMENT-FULL TIME"
##   ..- attr(*, "format.sas")= chr "ENRLCOLLFT2FMT"
##  $ ENRLCOLLST2 : num [1:56705] 5 5 5 5 5 3 5 5 5 5 ...
##   ..- attr(*, "label")= chr "RC-COLLEGE ENROLLMENT"
##   ..- attr(*, "format.sas")= chr "ENRLCOLLST2FMT"
##  $ WRKSTATWK2  : num [1:56705] 8 1 9 99 1 3 1 2 1 9 ...
##   ..- attr(*, "label")= chr "WORK SITUATION IN PAST WEEK - RECODE"
##   ..- attr(*, "format.sas")= chr "WRKSTATWK2FMT"
##  $ WRKDPSTWK   : num [1:56705] 2 1 2 99 1 2 1 1 1 2 ...
##   ..- attr(*, "label")= chr "WORK AT JOB LAST WEEK"
##   ..- attr(*, "format.sas")= chr "WRKDPSTWKFMT"
##  $ WRKHADJOB   : num [1:56705] 2 99 2 99 99 1 99 99 99 2 ...
##   ..- attr(*, "label")= chr "DID YOU HAVE A JOB OR BUSINESS"
##   ..- attr(*, "format.sas")= chr "WRKHADJOBFMT"
##  $ WRKDHRSWK2  : num [1:56705] 999 50 999 999 40 999 40 16 15 999 ...
##   ..- attr(*, "label")= chr "HOW MANY HOURS WORKED LAST WEEK"
##   ..- attr(*, "format.sas")= chr "WRKDHRSWK2FMT"
##  $ WRK35WKUS   : num [1:56705] 99 1 99 99 1 1 1 2 1 99 ...
##   ..- attr(*, "label")= chr "USUALLY WORK 35 OR MORE HRS PER WEEK"
##   ..- attr(*, "format.sas")= chr "WRK35WKUSFMT"
##  $ WRKRSNNOT   : num [1:56705] 999 999 999 999 999 1 999 999 999 999 ...
##   ..- attr(*, "label")= chr "REASON FOR NOT WORKING AT JOB - PAST WK"
##   ..- attr(*, "format.sas")= chr "WRKRSNNOTFMT"
##  $ WRKRSNJOB   : num [1:56705] 5 999 8 999 999 999 999 999 999 1 ...
##   ..- attr(*, "label")= chr "REASON FOR NOT HAVING A JOB - PAST WK"
##   ..- attr(*, "format.sas")= chr "WRKRSNJOBFMT"
##  $ WRKEFFORT   : num [1:56705] 99 99 99 99 99 99 99 99 99 2 ...
##   ..- attr(*, "label")= chr "PAST 30 DAYS, MAKE EFFORT TO FIND WORK"
##   ..- attr(*, "format.sas")= chr "WRKEFFORTFMT"
##  $ WRKDPSTYR   : num [1:56705] 2 99 2 99 99 99 99 99 99 2 ...
##   ..- attr(*, "label")= chr "PAST 12 MOS, WORKED AT ANY JOB"
##   ..- attr(*, "format.sas")= chr "WRKDPSTYRFMT"
##  $ WRKSELFEM   : num [1:56705] 99 1 99 99 1 2 2 2 1 99 ...
##   ..- attr(*, "label")= chr "PAST 12 MOS, SELF EMPLOYED"
##   ..- attr(*, "format.sas")= chr "WRKSELFEMFMT"
##  $ WRKNUMJOB2  : num [1:56705] 99 1 99 99 1 3 4 1 1 99 ...
##   ..- attr(*, "label")= chr "PAST 12 MOS, HOW MANY EMPLOYERS"
##   ..- attr(*, "format.sas")= chr "WRKNUMJOB2FMT"
##  $ WRKNJBPYR   : num [1:56705] 99 2 99 99 2 1 1 2 2 99 ...
##   ..- attr(*, "label")= chr "PAST 12 MOS, TIME WITH NO JOB"
##   ..- attr(*, "format.sas")= chr "WRKNJBPYRFMT"
##  $ WRKNJBWKS   : num [1:56705] 99 99 99 99 99 7 2 99 99 99 ...
##   ..- attr(*, "label")= chr "HOW MANY WKS W/O JOB IN PAST 12 MOS"
##   ..- attr(*, "format.sas")= chr "WRKNJBWKSFMT"
##  $ WRKLASTYR2  : num [1:56705] 2017 9999 9991 9999 9999 ...
##   ..- attr(*, "label")= chr "WHAT YEAR LAST WORKED"
##   ..- attr(*, "format.sas")= chr "WRKLASTYR2FMT"
##  $ WRKSICKMO   : num [1:56705] 99 0 99 99 0 0 0 0 0 99 ...
##   ..- attr(*, "label")= chr "# DAYS MISSED WORK FOR INJURY/ILLNESS PAST 30 DAYS"
##   ..- attr(*, "format.sas")= chr "WRKSICKMOFMT"
##  $ WRKSKIPMO   : num [1:56705] 99 0 99 99 0 0 0 0 0 99 ...
##   ..- attr(*, "label")= chr "# DAYS SKIPPED WORK PAST 30 DAYS"
##   ..- attr(*, "format.sas")= chr "WRKSKIPMOFMT"
##  $ WRKDRGPOL   : num [1:56705] 99 2 99 99 2 1 1 1 2 99 ...
##   ..- attr(*, "label")= chr "WORKPLACE HAVE WRITTEN POLICY DRUG/ALCOHOL USE"
##   ..- attr(*, "format.sas")= chr "WRKDRGPOLFMT"
##  $ WRKDRGALB   : num [1:56705] 99 99 99 99 99 3 3 3 99 99 ...
##   ..- attr(*, "label")= chr "DID POLICY COVER ALCOHOL, DRUGS, OR  BOTH"
##   ..- attr(*, "format.sas")= chr "WRKDRGALBFMT"
##  $ WRKDRGEDU   : num [1:56705] 99 2 99 99 2 1 2 2 2 99 ...
##   ..- attr(*, "label")= chr "AT WORK, GIVEN EDUCATION ON DRUGS/ALC"
##   ..- attr(*, "format.sas")= chr "WRKDRGEDUFMT"
##  $ WRKDRGHLP   : num [1:56705] 99 2 99 99 2 2 2 1 2 99 ...
##   ..- attr(*, "label")= chr "ANY ASSISTANCE PROGRAM OFFERED THROUGH WRK"
##   ..- attr(*, "format.sas")= chr "WRKDRGHLPFMT"
##  $ WRKTSTALC   : num [1:56705] 99 2 99 99 2 1 2 2 2 99 ...
##   ..- attr(*, "label")= chr "WORKPLACE TESTS FOR ALCOHOL USAGE"
##   ..- attr(*, "format.sas")= chr "WRKTSTALCFMT"
##  $ WRKTSTDRG   : num [1:56705] 99 2 99 99 2 2 2 94 2 99 ...
##   ..- attr(*, "label")= chr "WORKPLACE TESTS FOR DRUG USAGE"
##   ..- attr(*, "format.sas")= chr "WRKTSTDRGFMT"
##  $ WRKTSTHIR   : num [1:56705] 99 99 99 99 99 1 99 98 99 99 ...
##   ..- attr(*, "label")= chr "TEST FOR DRUG/ALC AS HIRING PROCESS"
##   ..- attr(*, "format.sas")= chr "WRKTSTHIRFMT"
##  $ WRKTSTRDM   : num [1:56705] 99 99 99 99 99 1 99 98 99 99 ...
##   ..- attr(*, "label")= chr "TEST ON RANDOM BASIS"
##   ..- attr(*, "format.sas")= chr "WRKTSTRDMFMT"
##  $ WRKTST1ST   : num [1:56705] 99 99 99 99 99 2 99 98 99 99 ...
##   ..- attr(*, "label")= chr "WHAT HAPPENS FIRST TIME CAUGHT"
##   ..- attr(*, "format.sas")= chr "WRKTST1STFMT"
##  $ WRKOKPREH   : num [1:56705] 99 3 99 99 3 2 2 3 3 99 ...
##   ..- attr(*, "label")= chr "WOULD YOU WORK FOR EMP DOES DRUG TEST PRE-HIRE"
##   ..- attr(*, "format.sas")= chr "WRKOKPREHFMT"
##  $ WRKOKRAND   : num [1:56705] 99 3 99 99 2 2 2 3 3 99 ...
##   ..- attr(*, "label")= chr "WOULD YOU WORK FOR EMP DOES DRUG/ALC TEST RANDOMLY"
##   ..- attr(*, "format.sas")= chr "WRKOKRANDFMT"
##  $ IRWRKSTAT   : num [1:56705] 4 1 4 99 1 1 1 2 1 4 ...
##   ..- attr(*, "label")= chr "EMPLOYMENT STATUS - IMPUTATION REVISED"
##   ..- attr(*, "format.sas")= chr "IRWRKSTATFMT"
##  $ IIWRKSTAT   : num [1:56705] 1 1 1 9 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "EMPLOYMENT STATUS - IMPUTATION INDICATOR"
##   ..- attr(*, "format.sas")= chr "IIWRKSTATFMT"
##  $ II2WRKSTAT  : num [1:56705] 1 1 1 9 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "EMPLOYMENT STATUS - DETAILED IMPUTATION INDICATOR"
##   ..- attr(*, "format.sas")= chr "II2WRKSTATFMT"
##  $ IRWRKSTAT18 : num [1:56705] 4 1 4 99 1 1 1 2 1 99 ...
##   ..- attr(*, "label")= chr "EMPLOYMENT STATUS 18+ - IMPUTATION REVISED"
##   ..- attr(*, "format.sas")= chr "IRWRKSTAT18FMT"
##  $ IIWRKSTAT18 : num [1:56705] 1 1 1 9 1 1 1 1 1 9 ...
##   ..- attr(*, "label")= chr "EMPLOYMENT STATUS 18+ - IMPUTATION INDICATOR"
##   ..- attr(*, "format.sas")= chr "IIWRKSTAT18FMT"
##  $ II2WRKST18  : num [1:56705] 1 1 1 9 1 1 1 1 1 9 ...
##   ..- attr(*, "label")= chr "EMPLOYMENT STATUS 18+ - DETAILED IMPUTATION INDICATOR"
##   ..- attr(*, "format.sas")= chr "II2WRKST18FMT"
##  $ EDFAM18     : num [1:56705] 0 0 0 0 0 0 1 1 0 0 ...
##   ..- attr(*, "label")= chr "RC-EDITED INDICATOR: FAMILY IN HH 18 OR OLDER"
##   ..- attr(*, "format.sas")= chr "EDFAM18FMT"
##   [list output truncated]

STEP 2 - create smaller data set. STEP 2a - select just the four variables: ALCYRTOT,ALCBNG30D, IMPGOUTM, IMPYDAYS

select <- dplyr::select

AlcoholMHlarge<-select(puf2023_102124, ALCYRTOT,ALCBNG30D,IMPGOUTM,IMPYDAYS)

summary(AlcoholMHlarge)
##     ALCYRTOT       ALCBNG30D        IMPGOUTM        IMPYDAYS    
##  Min.   :  1.0   Min.   : 0.00   Min.   : 1.00   Min.   :  0.0  
##  1st Qu.: 36.0   1st Qu.: 1.00   1st Qu.:99.00   1st Qu.:  1.0  
##  Median :200.0   Median :91.00   Median :99.00   Median :999.0  
##  Mean   :466.2   Mean   :54.14   Mean   :97.03   Mean   :554.4  
##  3rd Qu.:991.0   3rd Qu.:93.00   3rd Qu.:99.00   3rd Qu.:999.0  
##  Max.   :998.0   Max.   :98.00   Max.   :99.00   Max.   :999.0
str(AlcoholMHlarge)
## tibble [56,705 × 4] (S3: tbl_df/tbl/data.frame)
##  $ ALCYRTOT : num [1:56705] 342 84 991 991 24 36 180 993 24 991 ...
##   ..- attr(*, "label")= chr "TOTAL # OF DAYS USED ALCOHOL IN PAST 12 MOS"
##   ..- attr(*, "format.sas")= chr "ALCYRTOTFMT"
##  $ ALCBNG30D: num [1:56705] 7 5 91 91 0 0 1 93 0 91 ...
##   ..- attr(*, "label")= chr "# DAYS HAD FOUR/FIVE OR MORE DRINKS PAST 30 DYS"
##   ..- attr(*, "format.sas")= chr "ALCBNG30DFMT"
##  $ IMPGOUTM : num [1:56705] 99 99 99 99 99 99 99 99 99 99 ...
##   ..- attr(*, "label")= chr "EMOTIONAL PROBLEMS KEEP YOU FROM LEAVING HOUSE"
##   ..- attr(*, "format.sas")= chr "IMPGOUTMFMT"
##  $ IMPYDAYS : num [1:56705] 999 6 120 999 999 20 10 999 0 999 ...
##   ..- attr(*, "label")= chr "HOW MANY DAY IN PAST YR YOU WERE UNABLE TO WORK"
##   ..- attr(*, "format.sas")= chr "IMPYDAYSFMT"

Interesting that there are no “zero days” for ALCYRTOT or for IMPGOUTM. This means that no one of the 56,705 observations said they drank zero days in the last 12 months and no one said they had zero days that they could not leave the house. All observations drank at least one day and could not leave the house at least one day in the last 30 days.

STEP 2b - remove observations with more than 365 days for ALCYRTOT and IMPYDAYS and remove observations with more than 30 days for ALCBNG30D and IMPGOUTM.

AlcoholMH1 <- AlcoholMHlarge %>% 
  filter(ALCYRTOT >= 0 & ALCYRTOT <= 366 &
         IMPYDAYS >= 0 & IMPYDAYS <= 366 &
         ALCBNG30D >= 0 & ALCBNG30D <= 31 &
         IMPGOUTM >= 0 & IMPGOUTM <= 31)

summary(AlcoholMH1)
##     ALCYRTOT        ALCBNG30D         IMPGOUTM       IMPYDAYS     
##  Min.   :  1.00   Min.   : 0.000   Min.   :1.00   Min.   :  0.00  
##  1st Qu.: 12.00   1st Qu.: 0.000   1st Qu.:1.00   1st Qu.:  3.00  
##  Median : 48.00   Median : 1.000   Median :1.00   Median : 30.00  
##  Mean   : 83.47   Mean   : 2.462   Mean   :1.21   Mean   : 82.11  
##  3rd Qu.:126.00   3rd Qu.: 2.000   3rd Qu.:1.00   3rd Qu.:120.00  
##  Max.   :364.00   Max.   :30.000   Max.   :2.00   Max.   :365.00

There are no N/As to remove from this data set. All four variables have values for all the remaining observations.

Let’s explore this data now…

hist(AlcoholMH1$ALCYRTOT)

hist(AlcoholMH1$IMPYDAYS)

hist(AlcoholMH1$ALCBNG30D)

I left out IMPGOUTM because it’s categorical and not continuous. All three of these variables are extremely skewed as is.

stat.desc(AlcoholMH1$ALCYRTOT)
##                         x
## nbr.val        420.000000
## nbr.null         0.000000
## nbr.na           0.000000
## min              1.000000
## max            364.000000
## range          363.000000
## sum          35059.000000
## median          48.000000
## mean            83.473810
## SE.mean          4.319218
## CI.mean.0.95     8.490036
## var           7835.371627
## std.dev         88.517635
## coef.var         1.060424
stat.desc(AlcoholMH1$IMPYDAYS)
##                         x
## nbr.val        420.000000
## nbr.null        77.000000
## nbr.na           0.000000
## min              0.000000
## max            365.000000
## range          365.000000
## sum          34486.000000
## median          30.000000
## mean            82.109524
## SE.mean          5.317133
## CI.mean.0.95    10.451578
## var          11874.198000
## std.dev        108.968794
## coef.var         1.327115
cor(AlcoholMH1$ALCYRTOT, AlcoholMH1$IMPYDAYS)
## [1] 0.04650247
cor(AlcoholMH1)
##               ALCYRTOT   ALCBNG30D     IMPGOUTM    IMPYDAYS
## ALCYRTOT   1.000000000  0.53162182 -0.009641016  0.04650247
## ALCBNG30D  0.531621816  1.00000000 -0.067479458  0.13885904
## IMPGOUTM  -0.009641016 -0.06747946  1.000000000 -0.27127344
## IMPYDAYS   0.046502472  0.13885904 -0.271273440  1.00000000

The Correlation is showing that none of these variables have a super high correlation, but the closest are # of days drinking (ALCYRTOT) and # of days binge drinking (ALCBNG30D) but that’s at 0.53 so it’s an overall positive relationship. Also, the statistical correlation between number of days drinking in the past 12 months to number of days of missed work in the past 12 months is 0.046 which is closer to 0 which means not much of a correlation. I’m surprised by this, but we’ll keep looking at the data.

Let’s plot these two variables and see what that looks like.

ggplot(AlcoholMH1, aes(x=ALCYRTOT,y=IMPYDAYS)) + geom_point() + geom_smooth(color='red') +
         labs(title="Alcohol Use and Days Unable to Work (untransformed data)", 
         x="Total # Days Used Alcohol in Past 12 Months",
         y="# Days Unable to Work in Past 12 Months")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

I’m not sure this is the correct line to add to this graph. But it appears to show that there is an overall positive relationship between the two variables…as days of alcohol use increase, then days of missed work increases too.

Next, I’ll create a linear model “lm()” from the variables, with a continuous dependent variable (IMPYDAYS) as the outcome. I called it Alcoholmodel_1.

Alcoholmodel_1<-lm(IMPYDAYS~ALCYRTOT+ALCBNG30D+IMPGOUTM,data=AlcoholMH1)

Next I can run the assumptions (just like in Homework 7)

Check the following assumptions: a) Linearity (plot and raintest)

plot(Alcoholmodel_1,which=1)

raintest(Alcoholmodel_1)
## 
##  Rainbow test
## 
## data:  Alcoholmodel_1
## Rain = 0.94052, df1 = 210, df2 = 206, p-value = 0.6708

Above is the plot of this lm. The fitted line is not straight and therefore the data is not linear.

Next, I ran the Rainbow test which has a p-value of 0.6708 which is not low, and therefore not statistically significant. This means that the # of total days drinking (ALCYRTOT) and the # of days binging (ALCBNG30D) and # of days unable to leave the house (IMPGOUTM) do not have statistically significant effect on # of days missed work (IMPYDAYS)

I do want to try it just with the ALCYRTOT and ALCBNG30D and see if that’s different.

Create a new linear model with only three variables:

Alcoholmodel_2<-lm(IMPYDAYS~ALCYRTOT+ALCBNG30D,data=AlcoholMH1)

plot(Alcoholmodel_2,which=1)

raintest(Alcoholmodel_2)
## 
##  Rainbow test
## 
## data:  Alcoholmodel_2
## Rain = 0.98924, df1 = 210, df2 = 207, p-value = 0.5312

This does not look any better, let me try with just ALCBNG30D and IMPYDAYS

Alcoholmodel_3<-lm(IMPYDAYS~ALCBNG30D,data=AlcoholMH1)

plot(Alcoholmodel_3,which=1)

raintest(Alcoholmodel_3)
## 
##  Rainbow test
## 
## data:  Alcoholmodel_3
## Rain = 0.99424, df1 = 210, df2 = 208, p-value = 0.5167

Looks better, graph is not straight by rainbow test provides a a p-value that is likely to be linear.

I’ll continue with other assumptions testing on the linear model with all four variables (Alcoholmodel_1) b) Independence of errors (durbin-watson)

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
durbinWatsonTest(Alcoholmodel_1)
##  lag Autocorrelation D-W Statistic p-value
##    1    -0.005573745      2.006889   0.954
##  Alternative hypothesis: rho != 0

Next we look at the Durbin Watson test, which brings about the p-value of 0.954 which is high and means the errors are independent. Also, the DW Statistic is basically 2, which is good news.

  1. Homoscedasticity (plot, bptest)
plot(Alcoholmodel_1,which=3)

bptest(Alcoholmodel_1)
## 
##  studentized Breusch-Pagan test
## 
## data:  Alcoholmodel_1
## BP = 12.633, df = 3, p-value = 0.005501

This plot has a fairly straight line but flows at an angle. The points are not evenly distributed around the line. This would demonstrate a violation of Homoscedasticity.

The second command for the Breusch-Pagan test shows a result of p-value of 0.0055 which means the null hypothesis cannot be rejected and the model is not homoscedastic.

  1. Normality of residuals (QQ plot, shapiro test)
plot(Alcoholmodel_1,which=2)

There is significant deviation from the dotted line, which means residuals are not evenly distributed.

shapiro.test(Alcoholmodel_1$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  Alcoholmodel_1$residuals
## W = 0.83694, p-value < 2.2e-16

The p-value is WAY WAY below 0.05, which means that the residuals are not normal.

  1. No multicolinarity (VIF, cor)
vif(Alcoholmodel_1)
##  ALCYRTOT ALCBNG30D  IMPGOUTM 
##  1.395309  1.401562  1.005543

All three variables are close to 1.00. meaning they are NOT related to another variable.

And I’ll run the correlation again.

cor(AlcoholMH1)
##               ALCYRTOT   ALCBNG30D     IMPGOUTM    IMPYDAYS
## ALCYRTOT   1.000000000  0.53162182 -0.009641016  0.04650247
## ALCBNG30D  0.531621816  1.00000000 -0.067479458  0.13885904
## IMPGOUTM  -0.009641016 -0.06747946  1.000000000 -0.27127344
## IMPYDAYS   0.046502472  0.13885904 -0.271273440  1.00000000

Does this model (Alcoholmodel_1) meet the assumptions for linear model? NO! Because: a) Linearity (plot and raintest) - MAYBE i.e. line is not straigt but p-value is above 0.50 at 0.67. b) Independence of errors (durbin-watson) - YES MEETS has p-value of 0.954 which is high and means the errors are independent and the DW Statistic is basically 2. c) Homoscedasticity (plot, bptest) - DOES NOT MEET Breusch-Pagan test shows a result of p-value of 0.0055 and the line is not straight. d) Normality of residuals (QQ plot, shapiro test) - DOES NOT MEET THIS ASSUMPTION. Graph is not in line and the Shapiro-Wilk normality test of p-value < 2.2e-16. e) No multicolinarity (VIF, cor) - YES MEETS, since they are not correlated.

I cannot use a log transformation since there are legitimate zeroes as values in these variable. I must try square root as a transformation here.

Alcohol_multiple<-lm(sqrt(IMPYDAYS)~sqrt(ALCYRTOT)+sqrt(ALCBNG30D),data=AlcoholMH1)

summary(Alcohol_multiple)
## 
## Call:
## lm(formula = sqrt(IMPYDAYS) ~ sqrt(ALCYRTOT) + sqrt(ALCBNG30D), 
##     data = AlcoholMH1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.620 -5.211 -1.454  3.835 13.125 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      6.62849    0.56452  11.742   <2e-16 ***
## sqrt(ALCYRTOT)  -0.05405    0.07375  -0.733   0.4640    
## sqrt(ALCBNG30D)  0.61966    0.28496   2.175   0.0302 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.955 on 417 degrees of freedom
## Multiple R-squared:  0.01192,    Adjusted R-squared:  0.007182 
## F-statistic: 2.515 on 2 and 417 DF,  p-value: 0.08205

This looks much better. Now I’ll check the assumptions.

plot(Alcohol_multiple,which=1)

raintest(Alcohol_multiple)
## 
##  Rainbow test
## 
## data:  Alcohol_multiple
## Rain = 0.98301, df1 = 210, df2 = 207, p-value = 0.5493

This new data set (Alcohol_multiple) is looking better! This line is much more straight. Raintest is still value of 0.549. Let’s run more assumptions here.

durbinWatsonTest(Alcohol_multiple)
##  lag Autocorrelation D-W Statistic p-value
##    1      -0.0289724       2.05368   0.564
##  Alternative hypothesis: rho != 0
plot(Alcohol_multiple,which=3)

bptest(Alcohol_multiple)
## 
##  studentized Breusch-Pagan test
## 
## data:  Alcohol_multiple
## BP = 1.1515, df = 2, p-value = 0.5623

Normality of Residuals

plot(Alcohol_multiple,which=2)

shapiro.test(Alcohol_multiple$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  Alcohol_multiple$residuals
## W = 0.91857, p-value = 2.725e-14
vif(Alcohol_multiple)
##  sqrt(ALCYRTOT) sqrt(ALCBNG30D) 
##        1.430094        1.430094

VIF shows close to 0 which means they are not strongly correlated to other variables.

COR can’t run on a model, but can run it on the dataframe which it was above.

  1. Linearity (plot and raintest) - YES MEETS line is much closer to straight and p-value is 0.549.
  2. Independence of errors (durbin-watson) - YES MEETS has p-value of 0.564 which is high and means the errors are independent and the DW Statistic is basically 2.
  3. Homoscedasticity (plot, bptest) - YES MEETS Breusch-Pagan test shows a result of p-value of 0.562 and the line is much more straight in the residuals graph.
  4. Normality of residuals (QQ plot, shapiro test) - MAYBE. Graph is showing much more accurate, with the two ends trailing off as it should and the Shapiro-Wilk normality test of p-value = 2.725e-14 which means it is not normal.
  5. No multicolinarity (VIF, cor) - YES MEETS, since they are not correlated since both values are way under 10.

This looks like a model that now meets the assumptiosn. Let’s look at summary again.

summary(Alcohol_multiple)
## 
## Call:
## lm(formula = sqrt(IMPYDAYS) ~ sqrt(ALCYRTOT) + sqrt(ALCBNG30D), 
##     data = AlcoholMH1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.620 -5.211 -1.454  3.835 13.125 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      6.62849    0.56452  11.742   <2e-16 ***
## sqrt(ALCYRTOT)  -0.05405    0.07375  -0.733   0.4640    
## sqrt(ALCBNG30D)  0.61966    0.28496   2.175   0.0302 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.955 on 417 degrees of freedom
## Multiple R-squared:  0.01192,    Adjusted R-squared:  0.007182 
## F-statistic: 2.515 on 2 and 417 DF,  p-value: 0.08205

So number of days in the month of binge drinking is significant in relation to number of days of missed work. The number of days total drinking alcohol in the last year is not significant. Adjusted R Squared tells me the transformed number of days drinking (ALCYRTOT+ALCBNG30D) explains .7% of IMPYDAYS.

DATA EXPLORATION GAPS I’m having trouble running the T-Test or create a Box plot that is useful for this. Maybe more independent variables that have a two categorical value would be helpful…could be something like employed or not employed, or experience with past mental health (yes or no)? What other ways can I explore this transformed data? It doesn’t create a dataframe, so I’m limited to run stat.desc, for example.