STEP 1 - load the data set from 2023 National Survey on Drug Use and Health (NSDUH) into R Studio.
library(readxl)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(pastecs)
##
## Attaching package: 'pastecs'
##
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## The following object is masked from 'package:tidyr':
##
## extract
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
load('NSDUH_2023.Rdata')
let’s review this data set before narrowing the selection by using STR
str(puf2023_102124)
## tibble [56,705 × 2,636] (S3: tbl_df/tbl/data.frame)
## $ QUESTID2 : num [1:56705] 1e+07 1e+07 1e+07 1e+07 1e+07 ...
## ..- attr(*, "label")= chr "RESPONDENT IDENTIFICATION"
## $ FILEDATE : chr [1:56705] "10/21/2024" "10/21/2024" "10/21/2024" "10/21/2024" ...
## ..- attr(*, "label")= chr "CREATION DATE OF THE DATA FILE"
## $ ANALWT2_C : num [1:56705] 3276 15630 4018 10712 8195 ...
## ..- attr(*, "label")= chr "FIN PRSN-LEVEL SMPLE WGHT 2"
## $ VESTR_C : num [1:56705] 40031 40021 40043 40030 40023 ...
## ..- attr(*, "label")= chr "VARIANCE STRATUM"
## $ VEREP : num [1:56705] 2 2 1 2 2 1 1 1 2 1 ...
## ..- attr(*, "label")= chr "VARIANCE PRIMARY SAMPLING UNIT"
## ..- attr(*, "format.sas")= chr "Z"
## $ PDEN10 : num [1:56705] 2 2 2 2 2 2 1 2 1 2 ...
## ..- attr(*, "label")= chr "POPULATION DENSITY 2010 - THREE LEVELS"
## ..- attr(*, "format.sas")= chr "PDEN10FMT"
## $ COUTYP4 : num [1:56705] 2 3 2 2 2 2 1 2 1 2 ...
## ..- attr(*, "label")= chr "COUNTY METRO/NONMETRO STATUS (2013 3-LEVEL)"
## ..- attr(*, "format.sas")= chr "COUTYP4FMT"
## $ MAIIN102 : num [1:56705] 2 2 2 2 2 2 2 2 2 2 ...
## ..- attr(*, "label")= chr "MAJORITY AMER INDIAN AREA INDICATOR FOR SEGMENT"
## ..- attr(*, "format.sas")= chr "MAIIN102FMT"
## $ AIIND102 : num [1:56705] 2 2 2 2 2 2 2 2 2 2 ...
## ..- attr(*, "label")= chr "AMER INDIAN AREA INDICATOR"
## ..- attr(*, "format.sas")= chr "AIIND102FMT"
## $ AGE3 : num [1:56705] 10 9 9 1 10 4 8 11 8 2 ...
## ..- attr(*, "label")= chr "RECODE - FINAL EDITED AGE"
## ..- attr(*, "format.sas")= chr "AGE3FMT"
## $ SERVICE : num [1:56705] 2 1 2 99 2 2 2 2 2 99 ...
## ..- attr(*, "label")= chr "EVER BEEN IN THE US ARMED FORCES"
## ..- attr(*, "format.sas")= chr "SERVICEFMT"
## $ MILSTAT : num [1:56705] 99 3 99 99 99 99 99 99 99 99 ...
## ..- attr(*, "label")= chr "CURRENT MILITARY STATUS"
## ..- attr(*, "format.sas")= chr "MILSTATFMT"
## $ ACTDEVER : num [1:56705] 99 1 99 99 99 99 99 99 99 99 ...
## ..- attr(*, "label")= chr "EVER ON ACTIVE DUTY IN US MILITARY/RESERV"
## ..- attr(*, "format.sas")= chr "ACTDEVERFMT"
## $ ACTD2001 : num [1:56705] 99 1 99 99 99 99 99 99 99 99 ...
## ..- attr(*, "label")= chr "SEPTEMBER 2001 OR LATER"
## ..- attr(*, "format.sas")= chr "ACTD2001FMT"
## $ ACTD9001 : num [1:56705] 99 2 99 99 99 99 99 99 99 99 ...
## ..- attr(*, "label")= chr "AUG 1990 TO AUG 2001 (INCLUDING PERSIAN GULF WAR)"
## ..- attr(*, "format.sas")= chr "ACTD9001FMT"
## $ ACTD7590 : num [1:56705] 99 2 99 99 99 99 99 99 99 99 ...
## ..- attr(*, "label")= chr "MAY 1975 TO JULY 1990"
## ..- attr(*, "format.sas")= chr "ACTD7590FMT"
## $ ACTDVIET : num [1:56705] 99 2 99 99 99 99 99 99 99 99 ...
## ..- attr(*, "label")= chr "VIETNAM ERA (MARCH 1961 TO APRIL 1975)"
## ..- attr(*, "format.sas")= chr "ACTDVIETFMT"
## $ ACTDPRIV : num [1:56705] 99 2 99 99 99 99 99 99 99 99 ...
## ..- attr(*, "label")= chr "ANY TIME BEFORE VIETNAM ERA (MARCH 1961)"
## ..- attr(*, "format.sas")= chr "ACTDPRIVFMT"
## $ COMBATPY : num [1:56705] 99 2 99 99 99 99 99 99 99 99 ...
## ..- attr(*, "label")= chr "EVER IN COMBAT ZONE ON ACTIVE DUTY"
## ..- attr(*, "format.sas")= chr "COMBATPYFMT"
## $ NOMARR2 : num [1:56705] 1 99 2 99 1 99 99 99 99 99 ...
## ..- attr(*, "label")= chr "NUMBER OF TIMES MARRIED"
## ..- attr(*, "format.sas")= chr "NOMARR2FMT"
## $ HEALTH : num [1:56705] 2 2 3 3 1 3 3 2 1 5 ...
## ..- attr(*, "label")= chr "OVERALL HEALTH"
## ..- attr(*, "format.sas")= chr "HEALTHFMT"
## $ MOVSINPYR2 : num [1:56705] 0 0 0 0 0 0 0 0 0 1 ...
## ..- attr(*, "label")= chr "# TIMES MOVED PAST 12 MONTHS- RECODED"
## ..- attr(*, "format.sas")= chr "MOVSINPYR2FMT"
## $ SEXATRACT2 : num [1:56705] 1 2 2 1 1 1 1 1 1 2 ...
## ..- attr(*, "label")= chr "SEXUAL ATTRACTION"
## ..- attr(*, "format.sas")= chr "SEXATRACT2FMT"
## $ SEXIDENT22 : num [1:56705] 1 3 1 1 1 1 1 1 1 3 ...
## ..- attr(*, "label")= chr "SEXUAL IDENTITY"
## ..- attr(*, "format.sas")= chr "SEXIDENT22FMT"
## $ SPEAKENGL : num [1:56705] 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "label")= chr "HOW WELL SPEAK ENGLISH"
## ..- attr(*, "format.sas")= chr "SPEAKENGLFMT"
## $ LVLDIFSEE2 : num [1:56705] 1 1 2 1 1 1 1 1 2 1 ...
## ..- attr(*, "label")= chr "LEVEL OF DIFFICULTY SEEING EVEN WITH GLASSES"
## ..- attr(*, "format.sas")= chr "LVLDIFSEE2FMT"
## $ LVLDIFHEAR2 : num [1:56705] 1 1 3 1 1 2 2 1 1 1 ...
## ..- attr(*, "label")= chr "LEVEL OF DIFFICULTY HEARING EVEN WITH HEARING AIDS"
## ..- attr(*, "format.sas")= chr "LVLDIFHEAR2FMT"
## $ LVLDIFWALK2 : num [1:56705] 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "label")= chr "LEVEL OF DIFFICULTY WALKING OR CLIMBING STEPS"
## ..- attr(*, "format.sas")= chr "LVLDIFWALK2FMT"
## $ LVLDIFMEM2 : num [1:56705] 1 2 2 1 1 1 1 1 1 1 ...
## ..- attr(*, "label")= chr "LEVEL OF DIFFICULTY REMEMBERING OR CONCENTRATING"
## ..- attr(*, "format.sas")= chr "LVLDIFMEM2FMT"
## $ LVLDIFCARE2 : num [1:56705] 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "label")= chr "LEVEL OF DIFFICULTY WITH SELF-CARE"
## ..- attr(*, "format.sas")= chr "LVLDIFCARE2FMT"
## $ LVLDIFCOMM2 : num [1:56705] 1 1 1 1 1 1 1 1 1 3 ...
## ..- attr(*, "label")= chr "LEVEL OF DIFFICULTY COMMUNICATING USING YOUR USUAL LANGUAGE"
## ..- attr(*, "format.sas")= chr "LVLDIFCOMM2FMT"
## $ IRSEX : num [1:56705] 1 1 2 1 1 1 1 1 2 2 ...
## ..- attr(*, "label")= chr "SEX AT BIRTH - IMPUTATION REVISED"
## ..- attr(*, "format.sas")= chr "IRSEXFMT"
## $ IRMARIT : num [1:56705] 1 4 1 99 1 4 4 4 4 4 ...
## ..- attr(*, "label")= chr "IMPUTATION REVISED MARITAL STATUS"
## ..- attr(*, "format.sas")= chr "IRMARITFMT"
## $ IIMARIT : num [1:56705] 1 1 1 9 1 1 1 1 1 1 ...
## ..- attr(*, "label")= chr "IMPUTATION INDICATOR MARITAL STATUS"
## ..- attr(*, "format.sas")= chr "IIMARITFMT"
## $ IREDUHIGHST2: num [1:56705] 9 8 8 2 11 8 11 11 11 5 ...
## ..- attr(*, "label")= chr "EDUCATION - RECODED IMPUTATION REVISED"
## ..- attr(*, "format.sas")= chr "IREDUHIGHST2FMT"
## $ IIEDUHIGHST2: num [1:56705] 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "label")= chr "EDUCATION INDICATOR - IMPUTATION INDICATOR"
## ..- attr(*, "format.sas")= chr "IIEDUHIGHST2FMT"
## $ CATAGE : num [1:56705] 4 4 4 1 4 2 3 4 3 1 ...
## ..- attr(*, "label")= chr "RC-AGE CATEGORY"
## ..- attr(*, "format.sas")= chr "CATAGEFMT"
## $ CATAG2 : num [1:56705] 3 3 3 1 3 2 3 3 3 1 ...
## ..- attr(*, "label")= chr "RC-AGE CATEGORY RECODE (3 LEVELS)"
## ..- attr(*, "format.sas")= chr "CATAG2FMT"
## $ CATAG3 : num [1:56705] 5 4 4 1 5 2 3 5 3 1 ...
## ..- attr(*, "label")= chr "RC-AGE CATEGORY RECODE (5 LEVELS)"
## ..- attr(*, "format.sas")= chr "CATAG3FMT"
## $ CATAG6 : num [1:56705] 5 4 4 1 5 2 3 6 3 1 ...
## ..- attr(*, "label")= chr "RC-AGE CATEGORY RECODE (6 LEVELS)"
## ..- attr(*, "format.sas")= chr "CATAG6FMT"
## $ CATAG7 : num [1:56705] 7 7 7 1 7 4 6 7 6 2 ...
## ..- attr(*, "label")= chr "RC-AGE CATEGORY RECODE (7 LEVELS)"
## ..- attr(*, "format.sas")= chr "CATAG7FMT"
## $ PREGAGE2 : num [1:56705] 4 3 4 4 4 2 3 4 3 1 ...
## ..- attr(*, "label")= chr "RC-PREGNANCY AGE CATEGORY RECODE"
## ..- attr(*, "format.sas")= chr "PREGAGE2FMT"
## $ DRVINAGE : num [1:56705] 2 2 2 3 2 1 2 2 2 3 ...
## ..- attr(*, "label")= chr "RC-DUI AGE CATEGORY RECODE"
## ..- attr(*, "format.sas")= chr "DRVINAGEFMT"
## $ DRVINDETAG : num [1:56705] 3 3 3 4 3 1 3 3 3 4 ...
## ..- attr(*, "label")= chr "RC-DUI DETAILED AGE CATEGORY RECODE"
## ..- attr(*, "format.sas")= chr "DRVINDETAGFMT"
## $ SEXAGE : num [1:56705] 5 5 5 1 5 3 5 5 5 2 ...
## ..- attr(*, "label")= chr "RC-COMBINED GENDER BY AGE CATEGORY INDICATOR"
## ..- attr(*, "format.sas")= chr "SEXAGEFMT"
## $ NEWRACE2 : num [1:56705] 7 1 4 7 6 1 1 1 5 1 ...
## ..- attr(*, "label")= chr "RC-RACE/HISPANICITY RECODE (7 LEVELS)"
## ..- attr(*, "format.sas")= chr "NEWRACE2FMT"
## $ SEXRACE : num [1:56705] 5 1 7 5 7 1 1 1 7 2 ...
## ..- attr(*, "label")= chr "RC-COMBINED GENDER BY RACE INDICATOR"
## ..- attr(*, "format.sas")= chr "SEXRACEFMT"
## $ EDUHIGHCAT : num [1:56705] 3 2 2 5 4 2 4 4 4 5 ...
## ..- attr(*, "label")= chr "RC-EDUCATION CATEGORIES"
## ..- attr(*, "format.sas")= chr "EDUHIGHCATFMT"
## $ HEALTH2 : num [1:56705] 2 2 3 3 1 3 3 2 1 4 ...
## ..- attr(*, "label")= chr "RC-OVERALL HEALTH RECODE"
## ..- attr(*, "format.sas")= chr "HEALTH2FMT"
## $ EDUSCHLGO : num [1:56705] 2 2 2 1 2 2 2 2 2 1 ...
## ..- attr(*, "label")= chr "NOW GOING TO SCHOOL"
## ..- attr(*, "format.sas")= chr "EDUSCHLGOFMT"
## $ EDUSCHGRD2 : num [1:56705] 99 99 99 3 99 99 99 99 99 6 ...
## ..- attr(*, "label")= chr "WHAT GRADE IN NOW/WILL BE IN"
## ..- attr(*, "format.sas")= chr "EDUSCHGRD2FMT"
## $ EDUFULPAR : num [1:56705] 99 99 99 1 99 99 99 99 99 1 ...
## ..- attr(*, "label")= chr "FULL OR PART TIME"
## ..- attr(*, "format.sas")= chr "EDUFULPARFMT"
## $ EDUSCKMON : num [1:56705] 99 99 99 4 99 99 99 99 99 0 ...
## ..- attr(*, "label")= chr "HOW MANY DAYS MISSED SCHOOL FROM SICK"
## ..- attr(*, "format.sas")= chr "EDUSCKMONFMT"
## $ EDUSCKEST : num [1:56705] 99 99 99 99 99 99 99 99 99 99 ...
## ..- attr(*, "label")= chr "BEST ESTIMATE # DAYS MISSED BECAUSE SICK OR INURED"
## ..- attr(*, "format.sas")= chr "EDUSCKESTFMT"
## $ EDUSCKCOM : num [1:56705] 99 99 99 4 99 99 99 99 99 0 ...
## ..- attr(*, "label")= chr "RC - HOW MANY DAYS MISSED SCHOOL FROM SICK (COMBINED)"
## ..- attr(*, "format.sas")= chr "EDUSCKCOMFMT"
## $ EDUSKPMON : num [1:56705] 99 99 99 0 99 99 99 99 99 0 ...
## ..- attr(*, "label")= chr "HOW MANY DAYS MISSED SCHOOL FROM SKIPPING"
## ..- attr(*, "format.sas")= chr "EDUSKPMONFMT"
## $ EDUSKPEST : num [1:56705] 99 99 99 99 99 99 99 99 99 99 ...
## ..- attr(*, "label")= chr "BEST ESTIMATE # DAYS MISSED BECAUSE SKIPPED OR CUT"
## ..- attr(*, "format.sas")= chr "EDUSKPESTFMT"
## $ EDUSKPCOM : num [1:56705] 99 99 99 0 99 99 99 99 99 0 ...
## ..- attr(*, "label")= chr "RC - HOW MANY DAYS MISSED SCHOOL FROM SKIPPING (COMBINED)"
## ..- attr(*, "format.sas")= chr "EDUSKPCOMFMT"
## $ MILTFAMLY : num [1:56705] 2 1 2 1 2 2 2 2 2 2 ...
## ..- attr(*, "label")= chr "ANY IMMEDIATE FAMILY CURRENTLY IN US MILITARY"
## ..- attr(*, "format.sas")= chr "MILTFAMLYFMT"
## $ MILTSPPAR : num [1:56705] 99 2 99 97 99 99 99 99 99 99 ...
## ..- attr(*, "label")= chr "SPOUSE OR PARTNER IN MILITARY"
## ..- attr(*, "format.sas")= chr "MILTSPPARFMT"
## $ MILTPARNT : num [1:56705] 99 2 99 97 99 99 99 99 99 99 ...
## ..- attr(*, "label")= chr "FATHER OR MOTHER IN MILITARY"
## ..- attr(*, "format.sas")= chr "MILTPARNTFMT"
## $ MILTCHLDR : num [1:56705] 99 2 99 97 99 99 99 99 99 99 ...
## ..- attr(*, "label")= chr "SON OR DAUGHTER IN MILITARY"
## ..- attr(*, "format.sas")= chr "MILTCHLDRFMT"
## $ MILTSIBLN : num [1:56705] 99 1 99 97 99 99 99 99 99 99 ...
## ..- attr(*, "label")= chr "SISTER OR BROTHER IN MILITARY"
## ..- attr(*, "format.sas")= chr "MILTSIBLNFMT"
## $ ENRLCOLLFT2 : num [1:56705] 3 3 3 3 3 2 3 3 3 3 ...
## ..- attr(*, "label")= chr "RC-COLLEGE ENROLLMENT-FULL TIME"
## ..- attr(*, "format.sas")= chr "ENRLCOLLFT2FMT"
## $ ENRLCOLLST2 : num [1:56705] 5 5 5 5 5 3 5 5 5 5 ...
## ..- attr(*, "label")= chr "RC-COLLEGE ENROLLMENT"
## ..- attr(*, "format.sas")= chr "ENRLCOLLST2FMT"
## $ WRKSTATWK2 : num [1:56705] 8 1 9 99 1 3 1 2 1 9 ...
## ..- attr(*, "label")= chr "WORK SITUATION IN PAST WEEK - RECODE"
## ..- attr(*, "format.sas")= chr "WRKSTATWK2FMT"
## $ WRKDPSTWK : num [1:56705] 2 1 2 99 1 2 1 1 1 2 ...
## ..- attr(*, "label")= chr "WORK AT JOB LAST WEEK"
## ..- attr(*, "format.sas")= chr "WRKDPSTWKFMT"
## $ WRKHADJOB : num [1:56705] 2 99 2 99 99 1 99 99 99 2 ...
## ..- attr(*, "label")= chr "DID YOU HAVE A JOB OR BUSINESS"
## ..- attr(*, "format.sas")= chr "WRKHADJOBFMT"
## $ WRKDHRSWK2 : num [1:56705] 999 50 999 999 40 999 40 16 15 999 ...
## ..- attr(*, "label")= chr "HOW MANY HOURS WORKED LAST WEEK"
## ..- attr(*, "format.sas")= chr "WRKDHRSWK2FMT"
## $ WRK35WKUS : num [1:56705] 99 1 99 99 1 1 1 2 1 99 ...
## ..- attr(*, "label")= chr "USUALLY WORK 35 OR MORE HRS PER WEEK"
## ..- attr(*, "format.sas")= chr "WRK35WKUSFMT"
## $ WRKRSNNOT : num [1:56705] 999 999 999 999 999 1 999 999 999 999 ...
## ..- attr(*, "label")= chr "REASON FOR NOT WORKING AT JOB - PAST WK"
## ..- attr(*, "format.sas")= chr "WRKRSNNOTFMT"
## $ WRKRSNJOB : num [1:56705] 5 999 8 999 999 999 999 999 999 1 ...
## ..- attr(*, "label")= chr "REASON FOR NOT HAVING A JOB - PAST WK"
## ..- attr(*, "format.sas")= chr "WRKRSNJOBFMT"
## $ WRKEFFORT : num [1:56705] 99 99 99 99 99 99 99 99 99 2 ...
## ..- attr(*, "label")= chr "PAST 30 DAYS, MAKE EFFORT TO FIND WORK"
## ..- attr(*, "format.sas")= chr "WRKEFFORTFMT"
## $ WRKDPSTYR : num [1:56705] 2 99 2 99 99 99 99 99 99 2 ...
## ..- attr(*, "label")= chr "PAST 12 MOS, WORKED AT ANY JOB"
## ..- attr(*, "format.sas")= chr "WRKDPSTYRFMT"
## $ WRKSELFEM : num [1:56705] 99 1 99 99 1 2 2 2 1 99 ...
## ..- attr(*, "label")= chr "PAST 12 MOS, SELF EMPLOYED"
## ..- attr(*, "format.sas")= chr "WRKSELFEMFMT"
## $ WRKNUMJOB2 : num [1:56705] 99 1 99 99 1 3 4 1 1 99 ...
## ..- attr(*, "label")= chr "PAST 12 MOS, HOW MANY EMPLOYERS"
## ..- attr(*, "format.sas")= chr "WRKNUMJOB2FMT"
## $ WRKNJBPYR : num [1:56705] 99 2 99 99 2 1 1 2 2 99 ...
## ..- attr(*, "label")= chr "PAST 12 MOS, TIME WITH NO JOB"
## ..- attr(*, "format.sas")= chr "WRKNJBPYRFMT"
## $ WRKNJBWKS : num [1:56705] 99 99 99 99 99 7 2 99 99 99 ...
## ..- attr(*, "label")= chr "HOW MANY WKS W/O JOB IN PAST 12 MOS"
## ..- attr(*, "format.sas")= chr "WRKNJBWKSFMT"
## $ WRKLASTYR2 : num [1:56705] 2017 9999 9991 9999 9999 ...
## ..- attr(*, "label")= chr "WHAT YEAR LAST WORKED"
## ..- attr(*, "format.sas")= chr "WRKLASTYR2FMT"
## $ WRKSICKMO : num [1:56705] 99 0 99 99 0 0 0 0 0 99 ...
## ..- attr(*, "label")= chr "# DAYS MISSED WORK FOR INJURY/ILLNESS PAST 30 DAYS"
## ..- attr(*, "format.sas")= chr "WRKSICKMOFMT"
## $ WRKSKIPMO : num [1:56705] 99 0 99 99 0 0 0 0 0 99 ...
## ..- attr(*, "label")= chr "# DAYS SKIPPED WORK PAST 30 DAYS"
## ..- attr(*, "format.sas")= chr "WRKSKIPMOFMT"
## $ WRKDRGPOL : num [1:56705] 99 2 99 99 2 1 1 1 2 99 ...
## ..- attr(*, "label")= chr "WORKPLACE HAVE WRITTEN POLICY DRUG/ALCOHOL USE"
## ..- attr(*, "format.sas")= chr "WRKDRGPOLFMT"
## $ WRKDRGALB : num [1:56705] 99 99 99 99 99 3 3 3 99 99 ...
## ..- attr(*, "label")= chr "DID POLICY COVER ALCOHOL, DRUGS, OR BOTH"
## ..- attr(*, "format.sas")= chr "WRKDRGALBFMT"
## $ WRKDRGEDU : num [1:56705] 99 2 99 99 2 1 2 2 2 99 ...
## ..- attr(*, "label")= chr "AT WORK, GIVEN EDUCATION ON DRUGS/ALC"
## ..- attr(*, "format.sas")= chr "WRKDRGEDUFMT"
## $ WRKDRGHLP : num [1:56705] 99 2 99 99 2 2 2 1 2 99 ...
## ..- attr(*, "label")= chr "ANY ASSISTANCE PROGRAM OFFERED THROUGH WRK"
## ..- attr(*, "format.sas")= chr "WRKDRGHLPFMT"
## $ WRKTSTALC : num [1:56705] 99 2 99 99 2 1 2 2 2 99 ...
## ..- attr(*, "label")= chr "WORKPLACE TESTS FOR ALCOHOL USAGE"
## ..- attr(*, "format.sas")= chr "WRKTSTALCFMT"
## $ WRKTSTDRG : num [1:56705] 99 2 99 99 2 2 2 94 2 99 ...
## ..- attr(*, "label")= chr "WORKPLACE TESTS FOR DRUG USAGE"
## ..- attr(*, "format.sas")= chr "WRKTSTDRGFMT"
## $ WRKTSTHIR : num [1:56705] 99 99 99 99 99 1 99 98 99 99 ...
## ..- attr(*, "label")= chr "TEST FOR DRUG/ALC AS HIRING PROCESS"
## ..- attr(*, "format.sas")= chr "WRKTSTHIRFMT"
## $ WRKTSTRDM : num [1:56705] 99 99 99 99 99 1 99 98 99 99 ...
## ..- attr(*, "label")= chr "TEST ON RANDOM BASIS"
## ..- attr(*, "format.sas")= chr "WRKTSTRDMFMT"
## $ WRKTST1ST : num [1:56705] 99 99 99 99 99 2 99 98 99 99 ...
## ..- attr(*, "label")= chr "WHAT HAPPENS FIRST TIME CAUGHT"
## ..- attr(*, "format.sas")= chr "WRKTST1STFMT"
## $ WRKOKPREH : num [1:56705] 99 3 99 99 3 2 2 3 3 99 ...
## ..- attr(*, "label")= chr "WOULD YOU WORK FOR EMP DOES DRUG TEST PRE-HIRE"
## ..- attr(*, "format.sas")= chr "WRKOKPREHFMT"
## $ WRKOKRAND : num [1:56705] 99 3 99 99 2 2 2 3 3 99 ...
## ..- attr(*, "label")= chr "WOULD YOU WORK FOR EMP DOES DRUG/ALC TEST RANDOMLY"
## ..- attr(*, "format.sas")= chr "WRKOKRANDFMT"
## $ IRWRKSTAT : num [1:56705] 4 1 4 99 1 1 1 2 1 4 ...
## ..- attr(*, "label")= chr "EMPLOYMENT STATUS - IMPUTATION REVISED"
## ..- attr(*, "format.sas")= chr "IRWRKSTATFMT"
## $ IIWRKSTAT : num [1:56705] 1 1 1 9 1 1 1 1 1 1 ...
## ..- attr(*, "label")= chr "EMPLOYMENT STATUS - IMPUTATION INDICATOR"
## ..- attr(*, "format.sas")= chr "IIWRKSTATFMT"
## $ II2WRKSTAT : num [1:56705] 1 1 1 9 1 1 1 1 1 1 ...
## ..- attr(*, "label")= chr "EMPLOYMENT STATUS - DETAILED IMPUTATION INDICATOR"
## ..- attr(*, "format.sas")= chr "II2WRKSTATFMT"
## $ IRWRKSTAT18 : num [1:56705] 4 1 4 99 1 1 1 2 1 99 ...
## ..- attr(*, "label")= chr "EMPLOYMENT STATUS 18+ - IMPUTATION REVISED"
## ..- attr(*, "format.sas")= chr "IRWRKSTAT18FMT"
## $ IIWRKSTAT18 : num [1:56705] 1 1 1 9 1 1 1 1 1 9 ...
## ..- attr(*, "label")= chr "EMPLOYMENT STATUS 18+ - IMPUTATION INDICATOR"
## ..- attr(*, "format.sas")= chr "IIWRKSTAT18FMT"
## $ II2WRKST18 : num [1:56705] 1 1 1 9 1 1 1 1 1 9 ...
## ..- attr(*, "label")= chr "EMPLOYMENT STATUS 18+ - DETAILED IMPUTATION INDICATOR"
## ..- attr(*, "format.sas")= chr "II2WRKST18FMT"
## $ EDFAM18 : num [1:56705] 0 0 0 0 0 0 1 1 0 0 ...
## ..- attr(*, "label")= chr "RC-EDITED INDICATOR: FAMILY IN HH 18 OR OLDER"
## ..- attr(*, "format.sas")= chr "EDFAM18FMT"
## [list output truncated]
STEP 2 - create smaller data set. STEP 2a - select just the four variables: ALCYRTOT,ALCBNG30D, IMPGOUTM, IMPYDAYS
select <- dplyr::select
AlcoholMHlarge<-select(puf2023_102124, ALCYRTOT,ALCBNG30D,IMPGOUTM,IMPYDAYS)
summary(AlcoholMHlarge)
## ALCYRTOT ALCBNG30D IMPGOUTM IMPYDAYS
## Min. : 1.0 Min. : 0.00 Min. : 1.00 Min. : 0.0
## 1st Qu.: 36.0 1st Qu.: 1.00 1st Qu.:99.00 1st Qu.: 1.0
## Median :200.0 Median :91.00 Median :99.00 Median :999.0
## Mean :466.2 Mean :54.14 Mean :97.03 Mean :554.4
## 3rd Qu.:991.0 3rd Qu.:93.00 3rd Qu.:99.00 3rd Qu.:999.0
## Max. :998.0 Max. :98.00 Max. :99.00 Max. :999.0
str(AlcoholMHlarge)
## tibble [56,705 × 4] (S3: tbl_df/tbl/data.frame)
## $ ALCYRTOT : num [1:56705] 342 84 991 991 24 36 180 993 24 991 ...
## ..- attr(*, "label")= chr "TOTAL # OF DAYS USED ALCOHOL IN PAST 12 MOS"
## ..- attr(*, "format.sas")= chr "ALCYRTOTFMT"
## $ ALCBNG30D: num [1:56705] 7 5 91 91 0 0 1 93 0 91 ...
## ..- attr(*, "label")= chr "# DAYS HAD FOUR/FIVE OR MORE DRINKS PAST 30 DYS"
## ..- attr(*, "format.sas")= chr "ALCBNG30DFMT"
## $ IMPGOUTM : num [1:56705] 99 99 99 99 99 99 99 99 99 99 ...
## ..- attr(*, "label")= chr "EMOTIONAL PROBLEMS KEEP YOU FROM LEAVING HOUSE"
## ..- attr(*, "format.sas")= chr "IMPGOUTMFMT"
## $ IMPYDAYS : num [1:56705] 999 6 120 999 999 20 10 999 0 999 ...
## ..- attr(*, "label")= chr "HOW MANY DAY IN PAST YR YOU WERE UNABLE TO WORK"
## ..- attr(*, "format.sas")= chr "IMPYDAYSFMT"
Interesting that there are no “zero days” for ALCYRTOT or for IMPGOUTM. This means that no one of the 56,705 observations said they drank zero days in the last 12 months and no one said they had zero days that they could not leave the house. All observations drank at least one day and could not leave the house at least one day in the last 30 days.
STEP 2b - remove observations with more than 365 days for ALCYRTOT and IMPYDAYS and remove observations with more than 30 days for ALCBNG30D and IMPGOUTM.
AlcoholMH1 <- AlcoholMHlarge %>%
filter(ALCYRTOT >= 0 & ALCYRTOT <= 366 &
IMPYDAYS >= 0 & IMPYDAYS <= 366 &
ALCBNG30D >= 0 & ALCBNG30D <= 31 &
IMPGOUTM >= 0 & IMPGOUTM <= 31)
summary(AlcoholMH1)
## ALCYRTOT ALCBNG30D IMPGOUTM IMPYDAYS
## Min. : 1.00 Min. : 0.000 Min. :1.00 Min. : 0.00
## 1st Qu.: 12.00 1st Qu.: 0.000 1st Qu.:1.00 1st Qu.: 3.00
## Median : 48.00 Median : 1.000 Median :1.00 Median : 30.00
## Mean : 83.47 Mean : 2.462 Mean :1.21 Mean : 82.11
## 3rd Qu.:126.00 3rd Qu.: 2.000 3rd Qu.:1.00 3rd Qu.:120.00
## Max. :364.00 Max. :30.000 Max. :2.00 Max. :365.00
There are no N/As to remove from this data set. All four variables have values for all the remaining observations.
Let’s explore this data now…
hist(AlcoholMH1$ALCYRTOT)
hist(AlcoholMH1$IMPYDAYS)
hist(AlcoholMH1$ALCBNG30D)
I left out IMPGOUTM because it’s categorical and not continuous. All
three of these variables are extremely skewed as is.
stat.desc(AlcoholMH1$ALCYRTOT)
## x
## nbr.val 420.000000
## nbr.null 0.000000
## nbr.na 0.000000
## min 1.000000
## max 364.000000
## range 363.000000
## sum 35059.000000
## median 48.000000
## mean 83.473810
## SE.mean 4.319218
## CI.mean.0.95 8.490036
## var 7835.371627
## std.dev 88.517635
## coef.var 1.060424
stat.desc(AlcoholMH1$IMPYDAYS)
## x
## nbr.val 420.000000
## nbr.null 77.000000
## nbr.na 0.000000
## min 0.000000
## max 365.000000
## range 365.000000
## sum 34486.000000
## median 30.000000
## mean 82.109524
## SE.mean 5.317133
## CI.mean.0.95 10.451578
## var 11874.198000
## std.dev 108.968794
## coef.var 1.327115
cor(AlcoholMH1$ALCYRTOT, AlcoholMH1$IMPYDAYS)
## [1] 0.04650247
cor(AlcoholMH1)
## ALCYRTOT ALCBNG30D IMPGOUTM IMPYDAYS
## ALCYRTOT 1.000000000 0.53162182 -0.009641016 0.04650247
## ALCBNG30D 0.531621816 1.00000000 -0.067479458 0.13885904
## IMPGOUTM -0.009641016 -0.06747946 1.000000000 -0.27127344
## IMPYDAYS 0.046502472 0.13885904 -0.271273440 1.00000000
The Correlation is showing that none of these variables have a super high correlation, but the closest are # of days drinking (ALCYRTOT) and # of days binge drinking (ALCBNG30D) but that’s at 0.53 so it’s an overall positive relationship. Also, the statistical correlation between number of days drinking in the past 12 months to number of days of missed work in the past 12 months is 0.046 which is closer to 0 which means not much of a correlation. I’m surprised by this, but we’ll keep looking at the data.
Let’s plot these two variables and see what that looks like.
ggplot(AlcoholMH1, aes(x=ALCYRTOT,y=IMPYDAYS)) + geom_point() + geom_smooth(color='red') +
labs(title="Alcohol Use and Days Unable to Work (untransformed data)",
x="Total # Days Used Alcohol in Past 12 Months",
y="# Days Unable to Work in Past 12 Months")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
I’m not sure this is the correct line to add to this graph. But it
appears to show that there is an overall positive relationship between
the two variables…as days of alcohol use increase, then days of missed
work increases too.
Next, I’ll create a linear model “lm()” from the variables, with a continuous dependent variable (IMPYDAYS) as the outcome. I called it Alcoholmodel_1.
Alcoholmodel_1<-lm(IMPYDAYS~ALCYRTOT+ALCBNG30D+IMPGOUTM,data=AlcoholMH1)
Next I can run the assumptions (just like in Homework 7)
Check the following assumptions: a) Linearity (plot and raintest)
plot(Alcoholmodel_1,which=1)
raintest(Alcoholmodel_1)
##
## Rainbow test
##
## data: Alcoholmodel_1
## Rain = 0.94052, df1 = 210, df2 = 206, p-value = 0.6708
Above is the plot of this lm. The fitted line is not straight and therefore the data is not linear.
Next, I ran the Rainbow test which has a p-value of 0.6708 which is not low, and therefore not statistically significant. This means that the # of total days drinking (ALCYRTOT) and the # of days binging (ALCBNG30D) and # of days unable to leave the house (IMPGOUTM) do not have statistically significant effect on # of days missed work (IMPYDAYS)
I do want to try it just with the ALCYRTOT and ALCBNG30D and see if that’s different.
Create a new linear model with only three variables:
Alcoholmodel_2<-lm(IMPYDAYS~ALCYRTOT+ALCBNG30D,data=AlcoholMH1)
plot(Alcoholmodel_2,which=1)
raintest(Alcoholmodel_2)
##
## Rainbow test
##
## data: Alcoholmodel_2
## Rain = 0.98924, df1 = 210, df2 = 207, p-value = 0.5312
This does not look any better, let me try with just ALCBNG30D and IMPYDAYS
Alcoholmodel_3<-lm(IMPYDAYS~ALCBNG30D,data=AlcoholMH1)
plot(Alcoholmodel_3,which=1)
raintest(Alcoholmodel_3)
##
## Rainbow test
##
## data: Alcoholmodel_3
## Rain = 0.99424, df1 = 210, df2 = 208, p-value = 0.5167
Looks better, graph is not straight by rainbow test provides a a p-value that is likely to be linear.
I’ll continue with other assumptions testing on the linear model with all four variables (Alcoholmodel_1) b) Independence of errors (durbin-watson)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
durbinWatsonTest(Alcoholmodel_1)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.005573745 2.006889 0.954
## Alternative hypothesis: rho != 0
Next we look at the Durbin Watson test, which brings about the p-value of 0.954 which is high and means the errors are independent. Also, the DW Statistic is basically 2, which is good news.
plot(Alcoholmodel_1,which=3)
bptest(Alcoholmodel_1)
##
## studentized Breusch-Pagan test
##
## data: Alcoholmodel_1
## BP = 12.633, df = 3, p-value = 0.005501
This plot has a fairly straight line but flows at an angle. The points are not evenly distributed around the line. This would demonstrate a violation of Homoscedasticity.
The second command for the Breusch-Pagan test shows a result of p-value of 0.0055 which means the null hypothesis cannot be rejected and the model is not homoscedastic.
plot(Alcoholmodel_1,which=2)
There is significant deviation from the dotted line, which means
residuals are not evenly distributed.
shapiro.test(Alcoholmodel_1$residuals)
##
## Shapiro-Wilk normality test
##
## data: Alcoholmodel_1$residuals
## W = 0.83694, p-value < 2.2e-16
The p-value is WAY WAY below 0.05, which means that the residuals are not normal.
vif(Alcoholmodel_1)
## ALCYRTOT ALCBNG30D IMPGOUTM
## 1.395309 1.401562 1.005543
All three variables are close to 1.00. meaning they are NOT related to another variable.
And I’ll run the correlation again.
cor(AlcoholMH1)
## ALCYRTOT ALCBNG30D IMPGOUTM IMPYDAYS
## ALCYRTOT 1.000000000 0.53162182 -0.009641016 0.04650247
## ALCBNG30D 0.531621816 1.00000000 -0.067479458 0.13885904
## IMPGOUTM -0.009641016 -0.06747946 1.000000000 -0.27127344
## IMPYDAYS 0.046502472 0.13885904 -0.271273440 1.00000000
Does this model (Alcoholmodel_1) meet the assumptions for linear model? NO! Because: a) Linearity (plot and raintest) - MAYBE i.e. line is not straigt but p-value is above 0.50 at 0.67. b) Independence of errors (durbin-watson) - YES MEETS has p-value of 0.954 which is high and means the errors are independent and the DW Statistic is basically 2. c) Homoscedasticity (plot, bptest) - DOES NOT MEET Breusch-Pagan test shows a result of p-value of 0.0055 and the line is not straight. d) Normality of residuals (QQ plot, shapiro test) - DOES NOT MEET THIS ASSUMPTION. Graph is not in line and the Shapiro-Wilk normality test of p-value < 2.2e-16. e) No multicolinarity (VIF, cor) - YES MEETS, since they are not correlated.
I cannot use a log transformation since there are legitimate zeroes as values in these variable. I must try square root as a transformation here.
Alcohol_multiple<-lm(sqrt(IMPYDAYS)~sqrt(ALCYRTOT)+sqrt(ALCBNG30D),data=AlcoholMH1)
summary(Alcohol_multiple)
##
## Call:
## lm(formula = sqrt(IMPYDAYS) ~ sqrt(ALCYRTOT) + sqrt(ALCBNG30D),
## data = AlcoholMH1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.620 -5.211 -1.454 3.835 13.125
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.62849 0.56452 11.742 <2e-16 ***
## sqrt(ALCYRTOT) -0.05405 0.07375 -0.733 0.4640
## sqrt(ALCBNG30D) 0.61966 0.28496 2.175 0.0302 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.955 on 417 degrees of freedom
## Multiple R-squared: 0.01192, Adjusted R-squared: 0.007182
## F-statistic: 2.515 on 2 and 417 DF, p-value: 0.08205
This looks much better. Now I’ll check the assumptions.
plot(Alcohol_multiple,which=1)
raintest(Alcohol_multiple)
##
## Rainbow test
##
## data: Alcohol_multiple
## Rain = 0.98301, df1 = 210, df2 = 207, p-value = 0.5493
This new data set (Alcohol_multiple) is looking better! This line is much more straight. Raintest is still value of 0.549. Let’s run more assumptions here.
durbinWatsonTest(Alcohol_multiple)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.0289724 2.05368 0.564
## Alternative hypothesis: rho != 0
plot(Alcohol_multiple,which=3)
bptest(Alcohol_multiple)
##
## studentized Breusch-Pagan test
##
## data: Alcohol_multiple
## BP = 1.1515, df = 2, p-value = 0.5623
Normality of Residuals
plot(Alcohol_multiple,which=2)
shapiro.test(Alcohol_multiple$residuals)
##
## Shapiro-Wilk normality test
##
## data: Alcohol_multiple$residuals
## W = 0.91857, p-value = 2.725e-14
vif(Alcohol_multiple)
## sqrt(ALCYRTOT) sqrt(ALCBNG30D)
## 1.430094 1.430094
VIF shows close to 0 which means they are not strongly correlated to other variables.
COR can’t run on a model, but can run it on the dataframe which it was above.
This looks like a model that now meets the assumptiosn. Let’s look at summary again.
summary(Alcohol_multiple)
##
## Call:
## lm(formula = sqrt(IMPYDAYS) ~ sqrt(ALCYRTOT) + sqrt(ALCBNG30D),
## data = AlcoholMH1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.620 -5.211 -1.454 3.835 13.125
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.62849 0.56452 11.742 <2e-16 ***
## sqrt(ALCYRTOT) -0.05405 0.07375 -0.733 0.4640
## sqrt(ALCBNG30D) 0.61966 0.28496 2.175 0.0302 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.955 on 417 degrees of freedom
## Multiple R-squared: 0.01192, Adjusted R-squared: 0.007182
## F-statistic: 2.515 on 2 and 417 DF, p-value: 0.08205
So number of days in the month of binge drinking is significant in relation to number of days of missed work. The number of days total drinking alcohol in the last year is not significant. Adjusted R Squared tells me the transformed number of days drinking (ALCYRTOT+ALCBNG30D) explains .7% of IMPYDAYS.
DATA EXPLORATION GAPS I’m having trouble running the T-Test or create a Box plot that is useful for this. Maybe more independent variables that have a two categorical value would be helpful…could be something like employed or not employed, or experience with past mental health (yes or no)? What other ways can I explore this transformed data? It doesn’t create a dataframe, so I’m limited to run stat.desc, for example.