STEP 1 - load the data set from 2023 National Survey on Drug Use and Health (NSDUH) into R Studio.
library(readxl)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(pastecs)
##
## Attaching package: 'pastecs'
##
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## The following object is masked from 'package:tidyr':
##
## extract
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
load('NSDUH_2023.Rdata')
let’s review this data set before narrowing the selection by using STR
str(puf2023_102124)
## tibble [56,705 × 2,636] (S3: tbl_df/tbl/data.frame)
## $ QUESTID2 : num [1:56705] 1e+07 1e+07 1e+07 1e+07 1e+07 ...
## ..- attr(*, "label")= chr "RESPONDENT IDENTIFICATION"
## $ FILEDATE : chr [1:56705] "10/21/2024" "10/21/2024" "10/21/2024" "10/21/2024" ...
## ..- attr(*, "label")= chr "CREATION DATE OF THE DATA FILE"
## $ ANALWT2_C : num [1:56705] 3276 15630 4018 10712 8195 ...
## ..- attr(*, "label")= chr "FIN PRSN-LEVEL SMPLE WGHT 2"
## $ VESTR_C : num [1:56705] 40031 40021 40043 40030 40023 ...
## ..- attr(*, "label")= chr "VARIANCE STRATUM"
## $ VEREP : num [1:56705] 2 2 1 2 2 1 1 1 2 1 ...
## ..- attr(*, "label")= chr "VARIANCE PRIMARY SAMPLING UNIT"
## ..- attr(*, "format.sas")= chr "Z"
## $ PDEN10 : num [1:56705] 2 2 2 2 2 2 1 2 1 2 ...
## ..- attr(*, "label")= chr "POPULATION DENSITY 2010 - THREE LEVELS"
## ..- attr(*, "format.sas")= chr "PDEN10FMT"
## $ COUTYP4 : num [1:56705] 2 3 2 2 2 2 1 2 1 2 ...
## ..- attr(*, "label")= chr "COUNTY METRO/NONMETRO STATUS (2013 3-LEVEL)"
## ..- attr(*, "format.sas")= chr "COUTYP4FMT"
## $ MAIIN102 : num [1:56705] 2 2 2 2 2 2 2 2 2 2 ...
## ..- attr(*, "label")= chr "MAJORITY AMER INDIAN AREA INDICATOR FOR SEGMENT"
## ..- attr(*, "format.sas")= chr "MAIIN102FMT"
## $ AIIND102 : num [1:56705] 2 2 2 2 2 2 2 2 2 2 ...
## ..- attr(*, "label")= chr "AMER INDIAN AREA INDICATOR"
## ..- attr(*, "format.sas")= chr "AIIND102FMT"
## $ AGE3 : num [1:56705] 10 9 9 1 10 4 8 11 8 2 ...
## ..- attr(*, "label")= chr "RECODE - FINAL EDITED AGE"
## ..- attr(*, "format.sas")= chr "AGE3FMT"
## $ SERVICE : num [1:56705] 2 1 2 99 2 2 2 2 2 99 ...
## ..- attr(*, "label")= chr "EVER BEEN IN THE US ARMED FORCES"
## ..- attr(*, "format.sas")= chr "SERVICEFMT"
## $ MILSTAT : num [1:56705] 99 3 99 99 99 99 99 99 99 99 ...
## ..- attr(*, "label")= chr "CURRENT MILITARY STATUS"
## ..- attr(*, "format.sas")= chr "MILSTATFMT"
## $ ACTDEVER : num [1:56705] 99 1 99 99 99 99 99 99 99 99 ...
## ..- attr(*, "label")= chr "EVER ON ACTIVE DUTY IN US MILITARY/RESERV"
## ..- attr(*, "format.sas")= chr "ACTDEVERFMT"
## $ ACTD2001 : num [1:56705] 99 1 99 99 99 99 99 99 99 99 ...
## ..- attr(*, "label")= chr "SEPTEMBER 2001 OR LATER"
## ..- attr(*, "format.sas")= chr "ACTD2001FMT"
## $ ACTD9001 : num [1:56705] 99 2 99 99 99 99 99 99 99 99 ...
## ..- attr(*, "label")= chr "AUG 1990 TO AUG 2001 (INCLUDING PERSIAN GULF WAR)"
## ..- attr(*, "format.sas")= chr "ACTD9001FMT"
## $ ACTD7590 : num [1:56705] 99 2 99 99 99 99 99 99 99 99 ...
## ..- attr(*, "label")= chr "MAY 1975 TO JULY 1990"
## ..- attr(*, "format.sas")= chr "ACTD7590FMT"
## $ ACTDVIET : num [1:56705] 99 2 99 99 99 99 99 99 99 99 ...
## ..- attr(*, "label")= chr "VIETNAM ERA (MARCH 1961 TO APRIL 1975)"
## ..- attr(*, "format.sas")= chr "ACTDVIETFMT"
## $ ACTDPRIV : num [1:56705] 99 2 99 99 99 99 99 99 99 99 ...
## ..- attr(*, "label")= chr "ANY TIME BEFORE VIETNAM ERA (MARCH 1961)"
## ..- attr(*, "format.sas")= chr "ACTDPRIVFMT"
## $ COMBATPY : num [1:56705] 99 2 99 99 99 99 99 99 99 99 ...
## ..- attr(*, "label")= chr "EVER IN COMBAT ZONE ON ACTIVE DUTY"
## ..- attr(*, "format.sas")= chr "COMBATPYFMT"
## $ NOMARR2 : num [1:56705] 1 99 2 99 1 99 99 99 99 99 ...
## ..- attr(*, "label")= chr "NUMBER OF TIMES MARRIED"
## ..- attr(*, "format.sas")= chr "NOMARR2FMT"
## $ HEALTH : num [1:56705] 2 2 3 3 1 3 3 2 1 5 ...
## ..- attr(*, "label")= chr "OVERALL HEALTH"
## ..- attr(*, "format.sas")= chr "HEALTHFMT"
## $ MOVSINPYR2 : num [1:56705] 0 0 0 0 0 0 0 0 0 1 ...
## ..- attr(*, "label")= chr "# TIMES MOVED PAST 12 MONTHS- RECODED"
## ..- attr(*, "format.sas")= chr "MOVSINPYR2FMT"
## $ SEXATRACT2 : num [1:56705] 1 2 2 1 1 1 1 1 1 2 ...
## ..- attr(*, "label")= chr "SEXUAL ATTRACTION"
## ..- attr(*, "format.sas")= chr "SEXATRACT2FMT"
## $ SEXIDENT22 : num [1:56705] 1 3 1 1 1 1 1 1 1 3 ...
## ..- attr(*, "label")= chr "SEXUAL IDENTITY"
## ..- attr(*, "format.sas")= chr "SEXIDENT22FMT"
## $ SPEAKENGL : num [1:56705] 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "label")= chr "HOW WELL SPEAK ENGLISH"
## ..- attr(*, "format.sas")= chr "SPEAKENGLFMT"
## $ LVLDIFSEE2 : num [1:56705] 1 1 2 1 1 1 1 1 2 1 ...
## ..- attr(*, "label")= chr "LEVEL OF DIFFICULTY SEEING EVEN WITH GLASSES"
## ..- attr(*, "format.sas")= chr "LVLDIFSEE2FMT"
## $ LVLDIFHEAR2 : num [1:56705] 1 1 3 1 1 2 2 1 1 1 ...
## ..- attr(*, "label")= chr "LEVEL OF DIFFICULTY HEARING EVEN WITH HEARING AIDS"
## ..- attr(*, "format.sas")= chr "LVLDIFHEAR2FMT"
## $ LVLDIFWALK2 : num [1:56705] 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "label")= chr "LEVEL OF DIFFICULTY WALKING OR CLIMBING STEPS"
## ..- attr(*, "format.sas")= chr "LVLDIFWALK2FMT"
## $ LVLDIFMEM2 : num [1:56705] 1 2 2 1 1 1 1 1 1 1 ...
## ..- attr(*, "label")= chr "LEVEL OF DIFFICULTY REMEMBERING OR CONCENTRATING"
## ..- attr(*, "format.sas")= chr "LVLDIFMEM2FMT"
## $ LVLDIFCARE2 : num [1:56705] 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "label")= chr "LEVEL OF DIFFICULTY WITH SELF-CARE"
## ..- attr(*, "format.sas")= chr "LVLDIFCARE2FMT"
## $ LVLDIFCOMM2 : num [1:56705] 1 1 1 1 1 1 1 1 1 3 ...
## ..- attr(*, "label")= chr "LEVEL OF DIFFICULTY COMMUNICATING USING YOUR USUAL LANGUAGE"
## ..- attr(*, "format.sas")= chr "LVLDIFCOMM2FMT"
## $ IRSEX : num [1:56705] 1 1 2 1 1 1 1 1 2 2 ...
## ..- attr(*, "label")= chr "SEX AT BIRTH - IMPUTATION REVISED"
## ..- attr(*, "format.sas")= chr "IRSEXFMT"
## $ IRMARIT : num [1:56705] 1 4 1 99 1 4 4 4 4 4 ...
## ..- attr(*, "label")= chr "IMPUTATION REVISED MARITAL STATUS"
## ..- attr(*, "format.sas")= chr "IRMARITFMT"
## $ IIMARIT : num [1:56705] 1 1 1 9 1 1 1 1 1 1 ...
## ..- attr(*, "label")= chr "IMPUTATION INDICATOR MARITAL STATUS"
## ..- attr(*, "format.sas")= chr "IIMARITFMT"
## $ IREDUHIGHST2: num [1:56705] 9 8 8 2 11 8 11 11 11 5 ...
## ..- attr(*, "label")= chr "EDUCATION - RECODED IMPUTATION REVISED"
## ..- attr(*, "format.sas")= chr "IREDUHIGHST2FMT"
## $ IIEDUHIGHST2: num [1:56705] 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "label")= chr "EDUCATION INDICATOR - IMPUTATION INDICATOR"
## ..- attr(*, "format.sas")= chr "IIEDUHIGHST2FMT"
## $ CATAGE : num [1:56705] 4 4 4 1 4 2 3 4 3 1 ...
## ..- attr(*, "label")= chr "RC-AGE CATEGORY"
## ..- attr(*, "format.sas")= chr "CATAGEFMT"
## $ CATAG2 : num [1:56705] 3 3 3 1 3 2 3 3 3 1 ...
## ..- attr(*, "label")= chr "RC-AGE CATEGORY RECODE (3 LEVELS)"
## ..- attr(*, "format.sas")= chr "CATAG2FMT"
## $ CATAG3 : num [1:56705] 5 4 4 1 5 2 3 5 3 1 ...
## ..- attr(*, "label")= chr "RC-AGE CATEGORY RECODE (5 LEVELS)"
## ..- attr(*, "format.sas")= chr "CATAG3FMT"
## $ CATAG6 : num [1:56705] 5 4 4 1 5 2 3 6 3 1 ...
## ..- attr(*, "label")= chr "RC-AGE CATEGORY RECODE (6 LEVELS)"
## ..- attr(*, "format.sas")= chr "CATAG6FMT"
## $ CATAG7 : num [1:56705] 7 7 7 1 7 4 6 7 6 2 ...
## ..- attr(*, "label")= chr "RC-AGE CATEGORY RECODE (7 LEVELS)"
## ..- attr(*, "format.sas")= chr "CATAG7FMT"
## $ PREGAGE2 : num [1:56705] 4 3 4 4 4 2 3 4 3 1 ...
## ..- attr(*, "label")= chr "RC-PREGNANCY AGE CATEGORY RECODE"
## ..- attr(*, "format.sas")= chr "PREGAGE2FMT"
## $ DRVINAGE : num [1:56705] 2 2 2 3 2 1 2 2 2 3 ...
## ..- attr(*, "label")= chr "RC-DUI AGE CATEGORY RECODE"
## ..- attr(*, "format.sas")= chr "DRVINAGEFMT"
## $ DRVINDETAG : num [1:56705] 3 3 3 4 3 1 3 3 3 4 ...
## ..- attr(*, "label")= chr "RC-DUI DETAILED AGE CATEGORY RECODE"
## ..- attr(*, "format.sas")= chr "DRVINDETAGFMT"
## $ SEXAGE : num [1:56705] 5 5 5 1 5 3 5 5 5 2 ...
## ..- attr(*, "label")= chr "RC-COMBINED GENDER BY AGE CATEGORY INDICATOR"
## ..- attr(*, "format.sas")= chr "SEXAGEFMT"
## $ NEWRACE2 : num [1:56705] 7 1 4 7 6 1 1 1 5 1 ...
## ..- attr(*, "label")= chr "RC-RACE/HISPANICITY RECODE (7 LEVELS)"
## ..- attr(*, "format.sas")= chr "NEWRACE2FMT"
## $ SEXRACE : num [1:56705] 5 1 7 5 7 1 1 1 7 2 ...
## ..- attr(*, "label")= chr "RC-COMBINED GENDER BY RACE INDICATOR"
## ..- attr(*, "format.sas")= chr "SEXRACEFMT"
## $ EDUHIGHCAT : num [1:56705] 3 2 2 5 4 2 4 4 4 5 ...
## ..- attr(*, "label")= chr "RC-EDUCATION CATEGORIES"
## ..- attr(*, "format.sas")= chr "EDUHIGHCATFMT"
## $ HEALTH2 : num [1:56705] 2 2 3 3 1 3 3 2 1 4 ...
## ..- attr(*, "label")= chr "RC-OVERALL HEALTH RECODE"
## ..- attr(*, "format.sas")= chr "HEALTH2FMT"
## $ EDUSCHLGO : num [1:56705] 2 2 2 1 2 2 2 2 2 1 ...
## ..- attr(*, "label")= chr "NOW GOING TO SCHOOL"
## ..- attr(*, "format.sas")= chr "EDUSCHLGOFMT"
## $ EDUSCHGRD2 : num [1:56705] 99 99 99 3 99 99 99 99 99 6 ...
## ..- attr(*, "label")= chr "WHAT GRADE IN NOW/WILL BE IN"
## ..- attr(*, "format.sas")= chr "EDUSCHGRD2FMT"
## $ EDUFULPAR : num [1:56705] 99 99 99 1 99 99 99 99 99 1 ...
## ..- attr(*, "label")= chr "FULL OR PART TIME"
## ..- attr(*, "format.sas")= chr "EDUFULPARFMT"
## $ EDUSCKMON : num [1:56705] 99 99 99 4 99 99 99 99 99 0 ...
## ..- attr(*, "label")= chr "HOW MANY DAYS MISSED SCHOOL FROM SICK"
## ..- attr(*, "format.sas")= chr "EDUSCKMONFMT"
## $ EDUSCKEST : num [1:56705] 99 99 99 99 99 99 99 99 99 99 ...
## ..- attr(*, "label")= chr "BEST ESTIMATE # DAYS MISSED BECAUSE SICK OR INURED"
## ..- attr(*, "format.sas")= chr "EDUSCKESTFMT"
## $ EDUSCKCOM : num [1:56705] 99 99 99 4 99 99 99 99 99 0 ...
## ..- attr(*, "label")= chr "RC - HOW MANY DAYS MISSED SCHOOL FROM SICK (COMBINED)"
## ..- attr(*, "format.sas")= chr "EDUSCKCOMFMT"
## $ EDUSKPMON : num [1:56705] 99 99 99 0 99 99 99 99 99 0 ...
## ..- attr(*, "label")= chr "HOW MANY DAYS MISSED SCHOOL FROM SKIPPING"
## ..- attr(*, "format.sas")= chr "EDUSKPMONFMT"
## $ EDUSKPEST : num [1:56705] 99 99 99 99 99 99 99 99 99 99 ...
## ..- attr(*, "label")= chr "BEST ESTIMATE # DAYS MISSED BECAUSE SKIPPED OR CUT"
## ..- attr(*, "format.sas")= chr "EDUSKPESTFMT"
## $ EDUSKPCOM : num [1:56705] 99 99 99 0 99 99 99 99 99 0 ...
## ..- attr(*, "label")= chr "RC - HOW MANY DAYS MISSED SCHOOL FROM SKIPPING (COMBINED)"
## ..- attr(*, "format.sas")= chr "EDUSKPCOMFMT"
## $ MILTFAMLY : num [1:56705] 2 1 2 1 2 2 2 2 2 2 ...
## ..- attr(*, "label")= chr "ANY IMMEDIATE FAMILY CURRENTLY IN US MILITARY"
## ..- attr(*, "format.sas")= chr "MILTFAMLYFMT"
## $ MILTSPPAR : num [1:56705] 99 2 99 97 99 99 99 99 99 99 ...
## ..- attr(*, "label")= chr "SPOUSE OR PARTNER IN MILITARY"
## ..- attr(*, "format.sas")= chr "MILTSPPARFMT"
## $ MILTPARNT : num [1:56705] 99 2 99 97 99 99 99 99 99 99 ...
## ..- attr(*, "label")= chr "FATHER OR MOTHER IN MILITARY"
## ..- attr(*, "format.sas")= chr "MILTPARNTFMT"
## $ MILTCHLDR : num [1:56705] 99 2 99 97 99 99 99 99 99 99 ...
## ..- attr(*, "label")= chr "SON OR DAUGHTER IN MILITARY"
## ..- attr(*, "format.sas")= chr "MILTCHLDRFMT"
## $ MILTSIBLN : num [1:56705] 99 1 99 97 99 99 99 99 99 99 ...
## ..- attr(*, "label")= chr "SISTER OR BROTHER IN MILITARY"
## ..- attr(*, "format.sas")= chr "MILTSIBLNFMT"
## $ ENRLCOLLFT2 : num [1:56705] 3 3 3 3 3 2 3 3 3 3 ...
## ..- attr(*, "label")= chr "RC-COLLEGE ENROLLMENT-FULL TIME"
## ..- attr(*, "format.sas")= chr "ENRLCOLLFT2FMT"
## $ ENRLCOLLST2 : num [1:56705] 5 5 5 5 5 3 5 5 5 5 ...
## ..- attr(*, "label")= chr "RC-COLLEGE ENROLLMENT"
## ..- attr(*, "format.sas")= chr "ENRLCOLLST2FMT"
## $ WRKSTATWK2 : num [1:56705] 8 1 9 99 1 3 1 2 1 9 ...
## ..- attr(*, "label")= chr "WORK SITUATION IN PAST WEEK - RECODE"
## ..- attr(*, "format.sas")= chr "WRKSTATWK2FMT"
## $ WRKDPSTWK : num [1:56705] 2 1 2 99 1 2 1 1 1 2 ...
## ..- attr(*, "label")= chr "WORK AT JOB LAST WEEK"
## ..- attr(*, "format.sas")= chr "WRKDPSTWKFMT"
## $ WRKHADJOB : num [1:56705] 2 99 2 99 99 1 99 99 99 2 ...
## ..- attr(*, "label")= chr "DID YOU HAVE A JOB OR BUSINESS"
## ..- attr(*, "format.sas")= chr "WRKHADJOBFMT"
## $ WRKDHRSWK2 : num [1:56705] 999 50 999 999 40 999 40 16 15 999 ...
## ..- attr(*, "label")= chr "HOW MANY HOURS WORKED LAST WEEK"
## ..- attr(*, "format.sas")= chr "WRKDHRSWK2FMT"
## $ WRK35WKUS : num [1:56705] 99 1 99 99 1 1 1 2 1 99 ...
## ..- attr(*, "label")= chr "USUALLY WORK 35 OR MORE HRS PER WEEK"
## ..- attr(*, "format.sas")= chr "WRK35WKUSFMT"
## $ WRKRSNNOT : num [1:56705] 999 999 999 999 999 1 999 999 999 999 ...
## ..- attr(*, "label")= chr "REASON FOR NOT WORKING AT JOB - PAST WK"
## ..- attr(*, "format.sas")= chr "WRKRSNNOTFMT"
## $ WRKRSNJOB : num [1:56705] 5 999 8 999 999 999 999 999 999 1 ...
## ..- attr(*, "label")= chr "REASON FOR NOT HAVING A JOB - PAST WK"
## ..- attr(*, "format.sas")= chr "WRKRSNJOBFMT"
## $ WRKEFFORT : num [1:56705] 99 99 99 99 99 99 99 99 99 2 ...
## ..- attr(*, "label")= chr "PAST 30 DAYS, MAKE EFFORT TO FIND WORK"
## ..- attr(*, "format.sas")= chr "WRKEFFORTFMT"
## $ WRKDPSTYR : num [1:56705] 2 99 2 99 99 99 99 99 99 2 ...
## ..- attr(*, "label")= chr "PAST 12 MOS, WORKED AT ANY JOB"
## ..- attr(*, "format.sas")= chr "WRKDPSTYRFMT"
## $ WRKSELFEM : num [1:56705] 99 1 99 99 1 2 2 2 1 99 ...
## ..- attr(*, "label")= chr "PAST 12 MOS, SELF EMPLOYED"
## ..- attr(*, "format.sas")= chr "WRKSELFEMFMT"
## $ WRKNUMJOB2 : num [1:56705] 99 1 99 99 1 3 4 1 1 99 ...
## ..- attr(*, "label")= chr "PAST 12 MOS, HOW MANY EMPLOYERS"
## ..- attr(*, "format.sas")= chr "WRKNUMJOB2FMT"
## $ WRKNJBPYR : num [1:56705] 99 2 99 99 2 1 1 2 2 99 ...
## ..- attr(*, "label")= chr "PAST 12 MOS, TIME WITH NO JOB"
## ..- attr(*, "format.sas")= chr "WRKNJBPYRFMT"
## $ WRKNJBWKS : num [1:56705] 99 99 99 99 99 7 2 99 99 99 ...
## ..- attr(*, "label")= chr "HOW MANY WKS W/O JOB IN PAST 12 MOS"
## ..- attr(*, "format.sas")= chr "WRKNJBWKSFMT"
## $ WRKLASTYR2 : num [1:56705] 2017 9999 9991 9999 9999 ...
## ..- attr(*, "label")= chr "WHAT YEAR LAST WORKED"
## ..- attr(*, "format.sas")= chr "WRKLASTYR2FMT"
## $ WRKSICKMO : num [1:56705] 99 0 99 99 0 0 0 0 0 99 ...
## ..- attr(*, "label")= chr "# DAYS MISSED WORK FOR INJURY/ILLNESS PAST 30 DAYS"
## ..- attr(*, "format.sas")= chr "WRKSICKMOFMT"
## $ WRKSKIPMO : num [1:56705] 99 0 99 99 0 0 0 0 0 99 ...
## ..- attr(*, "label")= chr "# DAYS SKIPPED WORK PAST 30 DAYS"
## ..- attr(*, "format.sas")= chr "WRKSKIPMOFMT"
## $ WRKDRGPOL : num [1:56705] 99 2 99 99 2 1 1 1 2 99 ...
## ..- attr(*, "label")= chr "WORKPLACE HAVE WRITTEN POLICY DRUG/ALCOHOL USE"
## ..- attr(*, "format.sas")= chr "WRKDRGPOLFMT"
## $ WRKDRGALB : num [1:56705] 99 99 99 99 99 3 3 3 99 99 ...
## ..- attr(*, "label")= chr "DID POLICY COVER ALCOHOL, DRUGS, OR BOTH"
## ..- attr(*, "format.sas")= chr "WRKDRGALBFMT"
## $ WRKDRGEDU : num [1:56705] 99 2 99 99 2 1 2 2 2 99 ...
## ..- attr(*, "label")= chr "AT WORK, GIVEN EDUCATION ON DRUGS/ALC"
## ..- attr(*, "format.sas")= chr "WRKDRGEDUFMT"
## $ WRKDRGHLP : num [1:56705] 99 2 99 99 2 2 2 1 2 99 ...
## ..- attr(*, "label")= chr "ANY ASSISTANCE PROGRAM OFFERED THROUGH WRK"
## ..- attr(*, "format.sas")= chr "WRKDRGHLPFMT"
## $ WRKTSTALC : num [1:56705] 99 2 99 99 2 1 2 2 2 99 ...
## ..- attr(*, "label")= chr "WORKPLACE TESTS FOR ALCOHOL USAGE"
## ..- attr(*, "format.sas")= chr "WRKTSTALCFMT"
## $ WRKTSTDRG : num [1:56705] 99 2 99 99 2 2 2 94 2 99 ...
## ..- attr(*, "label")= chr "WORKPLACE TESTS FOR DRUG USAGE"
## ..- attr(*, "format.sas")= chr "WRKTSTDRGFMT"
## $ WRKTSTHIR : num [1:56705] 99 99 99 99 99 1 99 98 99 99 ...
## ..- attr(*, "label")= chr "TEST FOR DRUG/ALC AS HIRING PROCESS"
## ..- attr(*, "format.sas")= chr "WRKTSTHIRFMT"
## $ WRKTSTRDM : num [1:56705] 99 99 99 99 99 1 99 98 99 99 ...
## ..- attr(*, "label")= chr "TEST ON RANDOM BASIS"
## ..- attr(*, "format.sas")= chr "WRKTSTRDMFMT"
## $ WRKTST1ST : num [1:56705] 99 99 99 99 99 2 99 98 99 99 ...
## ..- attr(*, "label")= chr "WHAT HAPPENS FIRST TIME CAUGHT"
## ..- attr(*, "format.sas")= chr "WRKTST1STFMT"
## $ WRKOKPREH : num [1:56705] 99 3 99 99 3 2 2 3 3 99 ...
## ..- attr(*, "label")= chr "WOULD YOU WORK FOR EMP DOES DRUG TEST PRE-HIRE"
## ..- attr(*, "format.sas")= chr "WRKOKPREHFMT"
## $ WRKOKRAND : num [1:56705] 99 3 99 99 2 2 2 3 3 99 ...
## ..- attr(*, "label")= chr "WOULD YOU WORK FOR EMP DOES DRUG/ALC TEST RANDOMLY"
## ..- attr(*, "format.sas")= chr "WRKOKRANDFMT"
## $ IRWRKSTAT : num [1:56705] 4 1 4 99 1 1 1 2 1 4 ...
## ..- attr(*, "label")= chr "EMPLOYMENT STATUS - IMPUTATION REVISED"
## ..- attr(*, "format.sas")= chr "IRWRKSTATFMT"
## $ IIWRKSTAT : num [1:56705] 1 1 1 9 1 1 1 1 1 1 ...
## ..- attr(*, "label")= chr "EMPLOYMENT STATUS - IMPUTATION INDICATOR"
## ..- attr(*, "format.sas")= chr "IIWRKSTATFMT"
## $ II2WRKSTAT : num [1:56705] 1 1 1 9 1 1 1 1 1 1 ...
## ..- attr(*, "label")= chr "EMPLOYMENT STATUS - DETAILED IMPUTATION INDICATOR"
## ..- attr(*, "format.sas")= chr "II2WRKSTATFMT"
## $ IRWRKSTAT18 : num [1:56705] 4 1 4 99 1 1 1 2 1 99 ...
## ..- attr(*, "label")= chr "EMPLOYMENT STATUS 18+ - IMPUTATION REVISED"
## ..- attr(*, "format.sas")= chr "IRWRKSTAT18FMT"
## $ IIWRKSTAT18 : num [1:56705] 1 1 1 9 1 1 1 1 1 9 ...
## ..- attr(*, "label")= chr "EMPLOYMENT STATUS 18+ - IMPUTATION INDICATOR"
## ..- attr(*, "format.sas")= chr "IIWRKSTAT18FMT"
## $ II2WRKST18 : num [1:56705] 1 1 1 9 1 1 1 1 1 9 ...
## ..- attr(*, "label")= chr "EMPLOYMENT STATUS 18+ - DETAILED IMPUTATION INDICATOR"
## ..- attr(*, "format.sas")= chr "II2WRKST18FMT"
## $ EDFAM18 : num [1:56705] 0 0 0 0 0 0 1 1 0 0 ...
## ..- attr(*, "label")= chr "RC-EDITED INDICATOR: FAMILY IN HH 18 OR OLDER"
## ..- attr(*, "format.sas")= chr "EDFAM18FMT"
## [list output truncated]
STEP 2 - create smaller data set. STEP 2a - select just the four variables: ALCYRTOT,ALCBNG30D, IMPGOUTM, IMPYDAYS
select <- dplyr::select
AlcoholMHlarge<-select(puf2023_102124, ALCYRTOT,ALCBNG30D,IMPGOUTM,IMPYDAYS)
summary(AlcoholMHlarge)
## ALCYRTOT ALCBNG30D IMPGOUTM IMPYDAYS
## Min. : 1.0 Min. : 0.00 Min. : 1.00 Min. : 0.0
## 1st Qu.: 36.0 1st Qu.: 1.00 1st Qu.:99.00 1st Qu.: 1.0
## Median :200.0 Median :91.00 Median :99.00 Median :999.0
## Mean :466.2 Mean :54.14 Mean :97.03 Mean :554.4
## 3rd Qu.:991.0 3rd Qu.:93.00 3rd Qu.:99.00 3rd Qu.:999.0
## Max. :998.0 Max. :98.00 Max. :99.00 Max. :999.0
str(AlcoholMHlarge)
## tibble [56,705 × 4] (S3: tbl_df/tbl/data.frame)
## $ ALCYRTOT : num [1:56705] 342 84 991 991 24 36 180 993 24 991 ...
## ..- attr(*, "label")= chr "TOTAL # OF DAYS USED ALCOHOL IN PAST 12 MOS"
## ..- attr(*, "format.sas")= chr "ALCYRTOTFMT"
## $ ALCBNG30D: num [1:56705] 7 5 91 91 0 0 1 93 0 91 ...
## ..- attr(*, "label")= chr "# DAYS HAD FOUR/FIVE OR MORE DRINKS PAST 30 DYS"
## ..- attr(*, "format.sas")= chr "ALCBNG30DFMT"
## $ IMPGOUTM : num [1:56705] 99 99 99 99 99 99 99 99 99 99 ...
## ..- attr(*, "label")= chr "EMOTIONAL PROBLEMS KEEP YOU FROM LEAVING HOUSE"
## ..- attr(*, "format.sas")= chr "IMPGOUTMFMT"
## $ IMPYDAYS : num [1:56705] 999 6 120 999 999 20 10 999 0 999 ...
## ..- attr(*, "label")= chr "HOW MANY DAY IN PAST YR YOU WERE UNABLE TO WORK"
## ..- attr(*, "format.sas")= chr "IMPYDAYSFMT"
Interesting that there are no “zero days” for ALCYRTOT. This means that no one of the 56,705 observations said they drank zero days in the last 12 months. All observations drank at least one day. IMPGOUTM is actually a categorical variable (i.e. YES OR NO).
STEP 2b - remove observations with more than 365 days for ALCYRTOT and IMPYDAYS and remove observations with more than 30 days for ALCBNG30D and IMPGOUTM.
AlcoholMH1 <- AlcoholMHlarge %>%
filter(ALCYRTOT >= 0 & ALCYRTOT <= 366 &
IMPYDAYS >= 0 & IMPYDAYS <= 366 &
ALCBNG30D >= 0 & ALCBNG30D <= 31 &
IMPGOUTM >= 0 & IMPGOUTM <= 31)
summary(AlcoholMH1)
## ALCYRTOT ALCBNG30D IMPGOUTM IMPYDAYS
## Min. : 1.00 Min. : 0.000 Min. :1.00 Min. : 0.00
## 1st Qu.: 12.00 1st Qu.: 0.000 1st Qu.:1.00 1st Qu.: 3.00
## Median : 48.00 Median : 1.000 Median :1.00 Median : 30.00
## Mean : 83.47 Mean : 2.462 Mean :1.21 Mean : 82.11
## 3rd Qu.:126.00 3rd Qu.: 2.000 3rd Qu.:1.00 3rd Qu.:120.00
## Max. :364.00 Max. :30.000 Max. :2.00 Max. :365.00
There are no N/As to remove from this data set. All four variables have values for all the remaining observations.
Let’s explore this data now…
ggplot(AlcoholMH1, aes(x=ALCYRTOT)) +
geom_histogram(bins = 40, fill = "steelblue", color = "black") +
labs(title = "Distribution of Days Drinking in the last year")
ggplot(AlcoholMH1, aes(x=ALCBNG30D)) +
geom_histogram(bins = 40, fill = "steelblue", color = "black") +
labs(title = "Distribution of Days Binge Drinking in the last month")
ggplot(AlcoholMH1, aes(x=IMPYDAYS)) +
geom_histogram(bins = 40, fill = "steelblue", color = "black") +
labs(title = "Distribution of Days Missed work in the last year")
I left out IMPGOUTM because it’s categorical and not continuous. All three of these variables are extremely skewed as is.
ggplot(AlcoholMH1, aes(x =factor(IMPGOUTM))) +
geom_bar(fill = "steelblue") +
labs(title = "Frequency of Days Could not leave the house due to emotional problems", x="Yes or No")
stat.desc(AlcoholMH1$ALCYRTOT)
## x
## nbr.val 420.000000
## nbr.null 0.000000
## nbr.na 0.000000
## min 1.000000
## max 364.000000
## range 363.000000
## sum 35059.000000
## median 48.000000
## mean 83.473810
## SE.mean 4.319218
## CI.mean.0.95 8.490036
## var 7835.371627
## std.dev 88.517635
## coef.var 1.060424
stat.desc(AlcoholMH1$IMPYDAYS)
## x
## nbr.val 420.000000
## nbr.null 77.000000
## nbr.na 0.000000
## min 0.000000
## max 365.000000
## range 365.000000
## sum 34486.000000
## median 30.000000
## mean 82.109524
## SE.mean 5.317133
## CI.mean.0.95 10.451578
## var 11874.198000
## std.dev 108.968794
## coef.var 1.327115
I wanted a way to more easily look at the number of days drinking, so let’s mutate the numbers into a category by creating a new variable.
alcohol_new <- AlcoholMH1 %>%
mutate(Type_of_Drinker = case_when(
ALCYRTOT == 0 ~ "Non-drinker",
ALCYRTOT <= 52 ~ "Occasional",
ALCYRTOT <= 156 ~ "Regular",
ALCYRTOT <= 260 ~ "Frequent",
TRUE ~ "Very Frequent"
))
table(alcohol_new$Type_of_Drinker)
##
## Frequent Occasional Regular Very Frequent
## 52 235 118 15
Let’s see what this new category looks like
ggplot(alcohol_new, aes(x=Type_of_Drinker)) +
geom_bar(fill = "steelblue", color = "black") +
labs(title = "Types of Drinkers")
ggplot(alcohol_new, aes(x = Type_of_Drinker, y = IMPYDAYS)) +
geom_boxplot(fill = "steelblue") +
labs(title = "Days Missed work by Type of Drinker",
x = "Type of Drinker",
y = "Days of Missed Work")
This looks like “Very Frequent” drinkers has a higher mean of Missed
Days of work than the other three types of drinkers.
alcohol_new %>%
group_by(Type_of_Drinker) %>%
summarize(
mean_impairment_days = mean(IMPYDAYS, na.rm = TRUE),
n = n()
) %>%
arrange(factor(Type_of_Drinker))
## # A tibble: 4 × 3
## Type_of_Drinker mean_impairment_days n
## <chr> <dbl> <int>
## 1 Frequent 93.1 52
## 2 Occasional 81.2 235
## 3 Regular 74.3 118
## 4 Very Frequent 120. 15
cor(AlcoholMH1$ALCYRTOT, AlcoholMH1$IMPYDAYS)
## [1] 0.04650247
cor(AlcoholMH1$ALCBNG30D, AlcoholMH1$IMPYDAYS)
## [1] 0.138859
The Correlation shows the statistical correlation between number of days drinking in the past 12 months to number of days of missed work in the past 12 months is 0.046 which is closer to 0 which means not much of a correlation. I’m surprised by this, but we’ll keep looking at the data. The second one is a little less weak, with 0.1388 for Days Binge Drinking and Days of missed work.
Let’s plot these two variables and see what that looks like.
ggplot(alcohol_new, aes(x=ALCYRTOT,y=IMPYDAYS)) + geom_point() + geom_smooth(color='red') +
labs(title="Alcohol Use and Days Unable to Work (untransformed data)",
x="Total # Days Used Alcohol in Past 12 Months",
y="# Days Unable to Work in Past 12 Months")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
This visually shows a slight increase in days unable to work as days of
alcohol use increases.
ggplot(alcohol_new, aes(x=ALCBNG30D,y=IMPYDAYS)) + geom_point() + geom_smooth(method= "lm", color='red') +
labs(title="Days Binging and Days Unable to Work (untransformed data)",
x="Total # Days Binge Drinking in Past 30 Days",
y="# Days Unable to Work in Past 12 Months")
## `geom_smooth()` using formula = 'y ~ x'
Next, I’ll create a linear model “lm()” from the variables, with a dependent variable (IMPYDAYS) as the outcome. I called it Alcoholmodel_1.
Alcoholmodel_1<-lm(IMPYDAYS~ALCYRTOT+ALCBNG30D+Type_of_Drinker,data=alcohol_new)
Next I can run the assumptions (just like in Homework 7)
Check the following assumptions: a) Linearity (plot and raintest)
plot(Alcoholmodel_1,which=1)
raintest(Alcoholmodel_1)
##
## Rainbow test
##
## data: Alcoholmodel_1
## Rain = 0.98645, df1 = 210, df2 = 204, p-value = 0.5393
Above is the plot of this lm. The fitted line is not straight and therefore the data is not linear.
Next, I ran the Rainbow test which has a p-value of 0.5393 which is not low, and therefore not statistically significant. This means that the # of total days drinking (ALCYRTOT) and the # of days binging (ALCBNG30D) and Type of Drinker do not have statistically significant effect on # of days missed work (IMPYDAYS)
I do want to try it just with the ALCYRTOT and ALCBNG30D and see if that’s different.
Create a new linear model with only three variables:
Alcoholmodel_2<-lm(IMPYDAYS~ALCYRTOT+ALCBNG30D,data=alcohol_new)
plot(Alcoholmodel_2,which=1)
raintest(Alcoholmodel_2)
##
## Rainbow test
##
## data: Alcoholmodel_2
## Rain = 0.98924, df1 = 210, df2 = 207, p-value = 0.5312
This does not look any better, let me try with just ALCBNG30D and IMPYDAYS
Alcoholmodel_3<-lm(IMPYDAYS~ALCBNG30D,data=alcohol_new)
plot(Alcoholmodel_3,which=1)
raintest(Alcoholmodel_3)
##
## Rainbow test
##
## data: Alcoholmodel_3
## Rain = 0.99424, df1 = 210, df2 = 208, p-value = 0.5167
Looks better, graph is not straight by rainbow test provides a a p-value that is likely to be linear.
I’ll continue with other assumptions testing on the linear model with all four variables (Alcoholmodel_1) b) Independence of errors (durbin-watson)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
durbinWatsonTest(Alcoholmodel_1)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.02550154 2.048397 0.626
## Alternative hypothesis: rho != 0
Next we look at the Durbin Watson test, which brings about the p-value of 0.954 which is high and means the errors are independent. Also, the DW Statistic is basically 2, which is good news.
plot(Alcoholmodel_1,which=3)
bptest(Alcoholmodel_1)
##
## studentized Breusch-Pagan test
##
## data: Alcoholmodel_1
## BP = 5.7037, df = 5, p-value = 0.3361
This plot has a fairly straight line but flows at an angle. The points are not evenly distributed around the line. This would demonstrate a violation of Homoscedasticity.
The second command for the Breusch-Pagan test shows a result of p-value of 0.1797 which means the null hypothesis cannot be rejected and the model is not homoscedastic.
plot(Alcoholmodel_1,which=2)
There is significant deviation from the dotted line, which means
residuals are not evenly distributed.
shapiro.test(Alcoholmodel_1$residuals)
##
## Shapiro-Wilk normality test
##
## data: Alcoholmodel_1$residuals
## W = 0.79672, p-value < 2.2e-16
The p-value is WAY WAY below 0.05, which means that the residuals are not normal.
vif(Alcoholmodel_1)
## GVIF Df GVIF^(1/(2*Df))
## ALCYRTOT 12.766232 1 3.572986
## ALCBNG30D 1.408807 1 1.186932
## Type_of_Drinker 12.688377 3 1.527218
All three variables are close to 1.00. meaning they are NOT related to another variable.
And I’ll run the correlation again.
cor(AlcoholMH1)
## ALCYRTOT ALCBNG30D IMPGOUTM IMPYDAYS
## ALCYRTOT 1.000000000 0.53162182 -0.009641016 0.04650247
## ALCBNG30D 0.531621816 1.00000000 -0.067479458 0.13885904
## IMPGOUTM -0.009641016 -0.06747946 1.000000000 -0.27127344
## IMPYDAYS 0.046502472 0.13885904 -0.271273440 1.00000000
Does this model (Alcoholmodel_1) meet the assumptions for linear model? NO!
Because:
Linearity (plot and raintest) - MAYBE i.e. line is not straigt but p-value is above 0.50 at 0.67.
Independence of errors (durbin-watson) - YES MEETS has p-value of 0.676 which is high and means the errors are independent and the DW Statistic is basically 2.
Homoscedasticity (plot, bptest) - DOES NOT MEET Breusch-Pagan test shows a result of p-value of 0.1797 and the line is not straight.
Normality of residuals (QQ plot, shapiro test) - DOES NOT MEET THIS ASSUMPTION. Graph is not in line and the Shapiro-Wilk normality test of p-value < 2.2e-16.
No multicolinarity (VIF, cor) - YES MEETS, since they are not correlated.
I cannot use a log transformation since there are legitimate zeroes as values in these variable. I must try square root as a transformation here. I will only square root the dependent variable, and will leave the independent variables as they are.
Alcohol_multiple<-lm(sqrt(IMPYDAYS)~(ALCYRTOT)+(ALCBNG30D),data=alcohol_new)
summary(Alcohol_multiple)
##
## Call:
## lm(formula = sqrt(IMPYDAYS) ~ (ALCYRTOT) + (ALCBNG30D), data = alcohol_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.638 -5.016 -1.256 3.876 12.915
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.617408 0.398679 16.598 <2e-16 ***
## ALCYRTOT -0.002972 0.003871 -0.768 0.4431
## ALCBNG30D 0.181917 0.070938 2.564 0.0107 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.941 on 417 degrees of freedom
## Multiple R-squared: 0.01667, Adjusted R-squared: 0.01196
## F-statistic: 3.535 on 2 and 417 DF, p-value: 0.03002
This looks much better. Now I’ll check the assumptions.
plot(Alcohol_multiple,which=1)
raintest(Alcohol_multiple)
##
## Rainbow test
##
## data: Alcohol_multiple
## Rain = 0.98626, df1 = 210, df2 = 207, p-value = 0.5399
This new data set (Alcohol_multiple) is looking better! This line is a little more straight. Raintest is still p-value of 0.5399. Let’s run more assumptions here.
durbinWatsonTest(Alcohol_multiple)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.03491551 2.066002 0.488
## Alternative hypothesis: rho != 0
plot(Alcohol_multiple,which=3)
bptest(Alcohol_multiple)
##
## studentized Breusch-Pagan test
##
## data: Alcohol_multiple
## BP = 1.9381, df = 2, p-value = 0.3795
Normality of Residuals
plot(Alcohol_multiple,which=2)
shapiro.test(Alcohol_multiple$residuals)
##
## Shapiro-Wilk normality test
##
## data: Alcohol_multiple$residuals
## W = 0.92193, p-value = 5.847e-14
vif(Alcohol_multiple)
## ALCYRTOT ALCBNG30D
## 1.393965 1.393965
VIF shows both variables close to 0 which means they are not strongly correlated to other variables.
COR can’t run on a model, but can run it on the dataframe which it was above.
This looks like a model that now meets the assumptions. Let’s look at summary again.
summary(Alcohol_multiple)
##
## Call:
## lm(formula = sqrt(IMPYDAYS) ~ (ALCYRTOT) + (ALCBNG30D), data = alcohol_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.638 -5.016 -1.256 3.876 12.915
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.617408 0.398679 16.598 <2e-16 ***
## ALCYRTOT -0.002972 0.003871 -0.768 0.4431
## ALCBNG30D 0.181917 0.070938 2.564 0.0107 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.941 on 417 degrees of freedom
## Multiple R-squared: 0.01667, Adjusted R-squared: 0.01196
## F-statistic: 3.535 on 2 and 417 DF, p-value: 0.03002
So number of days in the month of binge drinking is significant in relation to number of days of missed work. The number of days total drinking alcohol is not significant. Adjusted R Squared tells me the transformed number of days drinking (ALCYRTOT+ALCBNG30D) explains 1% of IMPYDAYS.