Statistical Programming in Health Data: HINTS 2024 Take-Home Test


Task 1: Environment Setup & Data Import

Importing the 2024 Hints Data into R

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Hints <- read_csv("C:/Users/her_n/OneDrive/Documents/MPH Program/Spring 2026/PUBH422 Statistical Planning for Health Data/Tests/Exam 2/hints7_public data_2024.csv") #importing Hints 2024 data into R
## Rows: 7278 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (15): HHID, Age, BirthSex, MaritalStatus, AgeGrpB, EducA, HHInc, TotalHo...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Printing the Hints data

Hints #printing the imported data
## # A tibble: 7,278 × 15
##       HHID   Age BirthSex MaritalStatus AgeGrpB EducA HHInc TotalHousehold   BMI
##      <dbl> <dbl>    <dbl>         <dbl>   <dbl> <dbl> <dbl>          <dbl> <dbl>
##  1  7.21e7    69        2             1       4     4     6              2  26.3
##  2  7.21e7    62        1             1       3     3     6              2  25  
##  3  7.21e7    34        1             1       1     4     5              5  24  
##  4  7.21e7    65        2             1       4     3     4              3  35.2
##  5  7.21e7    64        1             6       3     4     1              1  29  
##  6  7.21e7    64        2             1       3     4     6              2  27.3
##  7  7.21e7    26        1             6       1     4     4              1  30.7
##  8  7.21e7    85        2             4       5     2     3              2  30.3
##  9  7.21e7    32        1             6       1     3     5              1  26.1
## 10  7.21e7    -9        3             6      -9     4     1             -9  26.6
## # ℹ 7,268 more rows
## # ℹ 6 more variables: smokeStat <dbl>, RaceEthn5 <dbl>, phq4 <dbl>,
## #   Exercise <dbl>, ECigUse <dbl>, AvgDrinksPerWeek <dbl>

Displaying the Dimensions of the Hints Data

dim(Hints) #displaying the dimensions of data; 7278 observations, 15 variables
## [1] 7278   15

There are 7278 observations and 15 variables present within the data

Displaying the summary of the Hints Data

summary(Hints) #summary of the imported data 
##       HHID               Age            BirthSex       MaritalStatus   
##  Min.   :72100001   Min.   : -9.00   Min.   :-9.0000   Min.   :-9.000  
##  1st Qu.:72108592   1st Qu.: 36.00   1st Qu.: 1.0000   1st Qu.: 1.000  
##  Median :72117023   Median : 55.00   Median : 1.0000   Median : 2.000  
##  Mean   :72251562   Mean   : 50.38   Mean   : 0.7236   Mean   : 2.031  
##  3rd Qu.:72325380   3rd Qu.: 69.00   3rd Qu.: 2.0000   3rd Qu.: 4.000  
##  Max.   :72836009   Max.   :102.00   Max.   : 3.0000   Max.   : 6.000  
##     AgeGrpB           EducA            HHInc       TotalHousehold 
##  Min.   :-9.000   Min.   :-9.000   Min.   :-9.00   Min.   :-9.00  
##  1st Qu.: 2.000   1st Qu.: 2.000   1st Qu.: 1.00   1st Qu.: 1.00  
##  Median : 3.000   Median : 3.000   Median : 4.00   Median : 2.00  
##  Mean   : 2.128   Mean   : 2.335   Mean   : 2.36   Mean   : 1.56  
##  3rd Qu.: 4.000   3rd Qu.: 4.000   3rd Qu.: 6.00   3rd Qu.: 3.00  
##  Max.   : 5.000   Max.   : 4.000   Max.   : 6.00   Max.   :11.00  
##       BMI          smokeStat        RaceEthn5            phq4       
##  Min.   :-9.00   Min.   :-9.000   Min.   :-9.0000   Min.   :-9.000  
##  1st Qu.:23.20   1st Qu.: 2.000   1st Qu.: 1.0000   1st Qu.: 0.000  
##  Median :27.10   Median : 3.000   Median : 1.0000   Median : 1.000  
##  Mean   :26.44   Mean   : 1.803   Mean   : 0.7595   Mean   : 1.592  
##  3rd Qu.:31.90   3rd Qu.: 3.000   3rd Qu.: 3.0000   3rd Qu.: 3.000  
##  Max.   :66.60   Max.   : 3.000   Max.   : 5.0000   Max.   :12.000  
##     Exercise         ECigUse       AvgDrinksPerWeek
##  Min.   :  -9.0   Min.   :-9.000   Min.   :-9.000  
##  1st Qu.:   0.0   1st Qu.: 3.000   1st Qu.: 0.000  
##  Median :  90.0   Median : 3.000   Median : 0.000  
##  Mean   : 168.7   Mean   : 1.826   Mean   : 1.533  
##  3rd Qu.: 210.0   3rd Qu.: 3.000   3rd Qu.: 2.000  
##  Max.   :6300.0   Max.   : 3.000   Max.   :75.000

Displaying the first 6 variables of the Hints data

head(Hints) #displaying the first 6 variables
## # A tibble: 6 × 15
##       HHID   Age BirthSex MaritalStatus AgeGrpB EducA HHInc TotalHousehold   BMI
##      <dbl> <dbl>    <dbl>         <dbl>   <dbl> <dbl> <dbl>          <dbl> <dbl>
## 1 72100001    69        2             1       4     4     6              2  26.3
## 2 72100005    62        1             1       3     3     6              2  25  
## 3 72100014    34        1             1       1     4     5              5  24  
## 4 72100019    65        2             1       4     3     4              3  35.2
## 5 72100025    64        1             6       3     4     1              1  29  
## 6 72100026    64        2             1       3     4     6              2  27.3
## # ℹ 6 more variables: smokeStat <dbl>, RaceEthn5 <dbl>, phq4 <dbl>,
## #   Exercise <dbl>, ECigUse <dbl>, AvgDrinksPerWeek <dbl>

Task 2: Data Cleaning & Value Labeling

Identifying NAs and negative values within the dataset

Determining the structure overview of the Hints data

str(Hints) #this fn provides an overview of the structure of the imported data, showing details such as column names, data types (numeric, character, factor), and the number of rows and columns.
## spc_tbl_ [7,278 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ HHID            : num [1:7278] 72100001 72100005 72100014 72100019 72100025 ...
##  $ Age             : num [1:7278] 69 62 34 65 64 64 26 85 32 -9 ...
##  $ BirthSex        : num [1:7278] 2 1 1 2 1 2 1 2 1 3 ...
##  $ MaritalStatus   : num [1:7278] 1 1 1 1 6 1 6 4 6 6 ...
##  $ AgeGrpB         : num [1:7278] 4 3 1 4 3 3 1 5 1 -9 ...
##  $ EducA           : num [1:7278] 4 3 4 3 4 4 4 2 3 4 ...
##  $ HHInc           : num [1:7278] 6 6 5 4 1 6 4 3 5 1 ...
##  $ TotalHousehold  : num [1:7278] 2 2 5 3 1 2 1 2 1 -9 ...
##  $ BMI             : num [1:7278] 26.3 25 24 35.2 29 27.3 30.7 30.3 26.1 26.6 ...
##  $ smokeStat       : num [1:7278] 2 2 3 2 2 3 3 3 1 3 ...
##  $ RaceEthn5       : num [1:7278] 1 1 3 1 1 5 2 1 1 3 ...
##  $ phq4            : num [1:7278] 0 0 4 4 11 1 8 0 0 0 ...
##  $ Exercise        : num [1:7278] 225 180 240 120 0 150 80 90 360 0 ...
##  $ ECigUse         : num [1:7278] 3 3 3 3 1 3 3 3 1 3 ...
##  $ AvgDrinksPerWeek: num [1:7278] 12.5 15 7.5 2 12 0.5 1 4 0 -9 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   HHID = col_double(),
##   ..   Age = col_double(),
##   ..   BirthSex = col_double(),
##   ..   MaritalStatus = col_double(),
##   ..   AgeGrpB = col_double(),
##   ..   EducA = col_double(),
##   ..   HHInc = col_double(),
##   ..   TotalHousehold = col_double(),
##   ..   BMI = col_double(),
##   ..   smokeStat = col_double(),
##   ..   RaceEthn5 = col_double(),
##   ..   phq4 = col_double(),
##   ..   Exercise = col_double(),
##   ..   ECigUse = col_double(),
##   ..   AvgDrinksPerWeek = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Determining if Hints has NAs within the data

sum(is.na(Hints)) #summing the number of NAs in the dataset 
## [1] 0

The data does not have any NAs present within the original dataset.

Detrmining if Hints has any negative values

summary(Hints) #confirming which variables have negative values 
##       HHID               Age            BirthSex       MaritalStatus   
##  Min.   :72100001   Min.   : -9.00   Min.   :-9.0000   Min.   :-9.000  
##  1st Qu.:72108592   1st Qu.: 36.00   1st Qu.: 1.0000   1st Qu.: 1.000  
##  Median :72117023   Median : 55.00   Median : 1.0000   Median : 2.000  
##  Mean   :72251562   Mean   : 50.38   Mean   : 0.7236   Mean   : 2.031  
##  3rd Qu.:72325380   3rd Qu.: 69.00   3rd Qu.: 2.0000   3rd Qu.: 4.000  
##  Max.   :72836009   Max.   :102.00   Max.   : 3.0000   Max.   : 6.000  
##     AgeGrpB           EducA            HHInc       TotalHousehold 
##  Min.   :-9.000   Min.   :-9.000   Min.   :-9.00   Min.   :-9.00  
##  1st Qu.: 2.000   1st Qu.: 2.000   1st Qu.: 1.00   1st Qu.: 1.00  
##  Median : 3.000   Median : 3.000   Median : 4.00   Median : 2.00  
##  Mean   : 2.128   Mean   : 2.335   Mean   : 2.36   Mean   : 1.56  
##  3rd Qu.: 4.000   3rd Qu.: 4.000   3rd Qu.: 6.00   3rd Qu.: 3.00  
##  Max.   : 5.000   Max.   : 4.000   Max.   : 6.00   Max.   :11.00  
##       BMI          smokeStat        RaceEthn5            phq4       
##  Min.   :-9.00   Min.   :-9.000   Min.   :-9.0000   Min.   :-9.000  
##  1st Qu.:23.20   1st Qu.: 2.000   1st Qu.: 1.0000   1st Qu.: 0.000  
##  Median :27.10   Median : 3.000   Median : 1.0000   Median : 1.000  
##  Mean   :26.44   Mean   : 1.803   Mean   : 0.7595   Mean   : 1.592  
##  3rd Qu.:31.90   3rd Qu.: 3.000   3rd Qu.: 3.0000   3rd Qu.: 3.000  
##  Max.   :66.60   Max.   : 3.000   Max.   : 5.0000   Max.   :12.000  
##     Exercise         ECigUse       AvgDrinksPerWeek
##  Min.   :  -9.0   Min.   :-9.000   Min.   :-9.000  
##  1st Qu.:   0.0   1st Qu.: 3.000   1st Qu.: 0.000  
##  Median :  90.0   Median : 3.000   Median : 0.000  
##  Mean   : 168.7   Mean   : 1.826   Mean   : 1.533  
##  3rd Qu.: 210.0   3rd Qu.: 3.000   3rd Qu.: 2.000  
##  Max.   :6300.0   Max.   : 3.000   Max.   :75.000

The summary demonstrated some variables contain negative values.

Removing the negative values within the dataset

library(dplyr) #loading in the dplyr package to access the filter function 

Hints_clean <- Hints %>%
  filter( #removing the negative values within each variable 
    HHID > 0,
    Age > 0,
    BirthSex > 0,
    MaritalStatus > 0,
    AgeGrpB > 0,
    EducA > 0,
    HHInc > 0,
    TotalHousehold > 0,
    BMI >= 0,
    smokeStat > 0,
    RaceEthn5 > 0,
    phq4 > 0,
    Exercise >= 0,
    ECigUse > 0,
    AvgDrinksPerWeek > 0
  )

Displaying the dimension of the clean data after negative values were removed

dim(Hints_clean) #displaying the dimensions of data; 1645 observations, 15 variables
## [1] 1645   15

The clean data now had 1645 observations and 15 variables after removing the negative values.

Confirming the negative values were removed from the clean data

summary(Hints_clean) #confirming negative values were removed from the clean dataset
##       HHID               Age           BirthSex     MaritalStatus  
##  Min.   :72100014   Min.   :18.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.:72108806   1st Qu.:35.00   1st Qu.:1.000   1st Qu.:1.000  
##  Median :72116927   Median :48.00   Median :1.000   Median :2.000  
##  Mean   :72247179   Mean   :49.47   Mean   :1.387   Mean   :2.989  
##  3rd Qu.:72325148   3rd Qu.:64.00   3rd Qu.:2.000   3rd Qu.:6.000  
##  Max.   :72836006   Max.   :93.00   Max.   :3.000   Max.   :6.000  
##     AgeGrpB          EducA           HHInc       TotalHousehold  
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   : 1.000  
##  1st Qu.:2.000   1st Qu.:3.000   1st Qu.:3.000   1st Qu.: 1.000  
##  Median :2.000   Median :4.000   Median :4.000   Median : 2.000  
##  Mean   :2.554   Mean   :3.374   Mean   :4.154   Mean   : 2.375  
##  3rd Qu.:3.000   3rd Qu.:4.000   3rd Qu.:6.000   3rd Qu.: 3.000  
##  Max.   :5.000   Max.   :4.000   Max.   :6.000   Max.   :11.000  
##       BMI          smokeStat       RaceEthn5          phq4       
##  Min.   :10.80   Min.   :1.000   Min.   :1.000   Min.   : 1.000  
##  1st Qu.:23.70   1st Qu.:2.000   1st Qu.:1.000   1st Qu.: 2.000  
##  Median :27.40   Median :3.000   Median :1.000   Median : 3.000  
##  Mean   :28.54   Mean   :2.452   Mean   :1.799   Mean   : 3.785  
##  3rd Qu.:32.00   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.: 5.000  
##  Max.   :63.10   Max.   :3.000   Max.   :5.000   Max.   :12.000  
##     Exercise         ECigUse      AvgDrinksPerWeek
##  Min.   :   0.0   Min.   :1.000   Min.   : 0.250  
##  1st Qu.:  30.0   1st Qu.:3.000   1st Qu.: 0.750  
##  Median :  90.0   Median :3.000   Median : 2.500  
##  Mean   : 168.1   Mean   :2.685   Mean   : 5.784  
##  3rd Qu.: 210.0   3rd Qu.:3.000   3rd Qu.: 7.500  
##  Max.   :5040.0   Max.   :3.000   Max.   :75.000

The negative values were removed from the data and the summary demonstrates the values were successfully removed.

Confirming if there are any NAs within the clean data after negative values were removed

sum(is.na(Hints_clean)) #confirming if there are any NAs after removing the negatives 
## [1] 0

Further confirmation that NAs were not present within the data after the negative values were removed.

Recoding qualitative variables as factors

#Recoding the qualitative variables as factors
Hints_clean <- Hints_clean %>%
  mutate(
    BirthSex=factor(BirthSex, # each variable was recoded to have each level be matched with its corresponding label 
                    levels = c(1,2),
                    labels = c("Male","Female")),
    MaritalStatus=factor(MaritalStatus,
                         levels = c(1,2,3,4,5,6),
                         labels = c("Married","Living as Married","Divorced","Widowed","Separated","Single")),
    Education=factor(EducA,
                     levels = c(1,2,3,4),
                     labels = c("Less than High School","High School","Some College","College Graduate or More")),
    HouseholdInc=factor(HHInc,
                        levels = c(1,2,3,4,5,6),
                        labels = c("~<$20,000","~$20,000-<$35,000","~$35,000-<$50,000","~$50,000-<$75,000","~$75,000-<$100,000","~>=$100,000")),
    SmokingStatus=factor(smokeStat,
                         levels = c(1,2,3),
                         labels = c("Current","Former","Never")),
    RaceEthn=factor(RaceEthn5,
                    levels = c(1,2,3,4),
                    labels = c("Non-Hispanic White","Non-Hispanic Black/AA","Hispanic","Non-Hispanic Asian")),
    ECigUse=factor(ECigUse,
                   levels = c(1,2,3),
                   labels = c("Current","Former","Never"))
  )

Each variable was recoded to ensure each level matched its corresponding label within each variable.

Confirming if there are any NA values after the recoding

sum(is.na(Hints_clean)) #confirming how many NAs I currently have after changing to factors and relabeling 
## [1] 77

After each level was recoded, the data had 77 NAs present within the data.

Removing NAs from Clean data

Hints_clean <- Hints_clean %>% #removed the remaining 77 NAs from the data 
  tidyr::drop_na()

Confirming NAs and negative values were removed from clean data

sum(is.na(Hints_clean)) #confirming how many NAs are present after removing them from the data 
## [1] 0
summary(Hints_clean) #confirming the only values present within the cleaned dataset fall within the codebook
##       HHID               Age          BirthSex             MaritalStatus
##  Min.   :72100014   Min.   :18.00   Male  :975   Married          :686  
##  1st Qu.:72108949   1st Qu.:35.00   Female:595   Living as Married:122  
##  Median :72117151   Median :48.00                Divorced         :209  
##  Mean   :72249378   Mean   :49.66                Widowed          : 97  
##  3rd Qu.:72325213   3rd Qu.:64.00                Separated        : 32  
##  Max.   :72836006   Max.   :93.00                Single           :424  
##     AgeGrpB          EducA           HHInc       TotalHousehold  
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   : 1.000  
##  1st Qu.:2.000   1st Qu.:3.000   1st Qu.:3.000   1st Qu.: 1.000  
##  Median :2.000   Median :4.000   Median :4.000   Median : 2.000  
##  Mean   :2.568   Mean   :3.384   Mean   :4.176   Mean   : 2.368  
##  3rd Qu.:3.000   3rd Qu.:4.000   3rd Qu.:6.000   3rd Qu.: 3.000  
##  Max.   :5.000   Max.   :4.000   Max.   :6.000   Max.   :11.000  
##       BMI         smokeStat       RaceEthn5          phq4       
##  Min.   :10.8   Min.   :1.000   Min.   :1.000   Min.   : 1.000  
##  1st Qu.:23.7   1st Qu.:2.000   1st Qu.:1.000   1st Qu.: 2.000  
##  Median :27.3   Median :3.000   Median :1.000   Median : 3.000  
##  Mean   :28.5   Mean   :2.458   Mean   :1.653   Mean   : 3.756  
##  3rd Qu.:31.9   3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.: 5.000  
##  Max.   :63.1   Max.   :3.000   Max.   :4.000   Max.   :12.000  
##     Exercise         ECigUse     AvgDrinksPerWeek
##  Min.   :   0.0   Current: 112   Min.   : 0.250  
##  1st Qu.:  30.0   Former : 262   1st Qu.: 0.750  
##  Median :  90.0   Never  :1196   Median : 2.500  
##  Mean   : 163.9                  Mean   : 5.776  
##  3rd Qu.: 210.0                  3rd Qu.: 7.500  
##  Max.   :3600.0                  Max.   :75.000  
##                     Education               HouseholdInc SmokingStatus
##  Less than High School   : 63   ~<$20,000         :186   Current:206  
##  High School             :173   ~$20,000-<$35,000 :159   Former :439  
##  Some College            :432   ~$35,000-<$50,000 :186   Never  :925  
##  College Graduate or More:902   ~$50,000-<$75,000 :262                
##                                 ~$75,000-<$100,000:216                
##                                 ~>=$100,000       :561                
##                   RaceEthn  
##  Non-Hispanic White   :985  
##  Non-Hispanic Black/AA:209  
##  Hispanic             :312  
##  Non-Hispanic Asian   : 64  
##                             
## 

The sum demonstrated that the NAs were successfully removed from the cleaned data and the summary demonstrated each level was recoded to its corresponding label.

Displaying the dimension of the clean data after negative and NA values were removed

dim(Hints_clean) #displaying the dimensions of data; 1570 observations, 19 variables
## [1] 1570   19

The clean data now had 1570 observations and 19 variables after removing the negative and NA values.


Task 3 Summary Statistics for Quantitative Variables

Loading the necessary packages to access summarytools.

library(plyr)
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, mutate, rename, summarise, summarize
## The following object is masked from 'package:purrr':
## 
##     compact
library(dplyr)
library(summarytools)
## 
## Attaching package: 'summarytools'
## The following object is masked from 'package:tibble':
## 
##     view

Selecting quantitative variables within the clean data.

Quant_Hints_Clean <- Hints_clean %>% 
  select("Age","TotalHousehold","BMI","phq4","Exercise","AvgDrinksPerWeek") #selecting only quantitative data in Hints_Clean dataset

names(Quant_Hints_Clean) #confirming the correct columns were chosen 
## [1] "Age"              "TotalHousehold"   "BMI"              "phq4"            
## [5] "Exercise"         "AvgDrinksPerWeek"
sum(is.na(Quant_Hints_Clean)) #confirming if there are any NA values within the data
## [1] 0

The correct quantitative variables were successfully selected and there are no NAs within the data.

Descriptive statistics were generated for the identified quantitative variables.

descr(Quant_Hints_Clean, #using descr() to generate N, mean, median, std dev, min and max
      headings = TRUE,
      stats = "common"
      )
## Descriptive Statistics  
## Quant_Hints_Clean  
## N: 1570  
## 
##                       Age   AvgDrinksPerWeek       BMI   Exercise      phq4   TotalHousehold
## --------------- --------- ------------------ --------- ---------- --------- ----------------
##            Mean     49.66               5.78     28.50     163.95      3.76             2.37
##         Std.Dev     17.08               8.99      6.98     256.96      2.86             1.34
##             Min     18.00               0.25     10.80       0.00      1.00             1.00
##          Median     48.00               2.50     27.30      90.00      3.00             2.00
##             Max     93.00              75.00     63.10    3600.00     12.00            11.00
##         N.Valid   1570.00            1570.00   1570.00    1570.00   1570.00          1570.00
##               N   1570.00            1570.00   1570.00    1570.00   1570.00          1570.00
##       Pct.Valid    100.00             100.00    100.00     100.00    100.00           100.00

The sample consists of 1,570 participants with a mean age of 49.66 years with a standard deviation of 17.08 which indicates a middle-aged population with a wide age range (18 to 93 years). The average Body Mass Index is 28.50 with a standard deviation of 6.98, which falls within the overweight category, suggesting that a substantial portion of the sample may be at increased risk for weight-related health conditions.Participants reported an average of 5.78 alcoholic drinks per week with a standard deviation of 8.99 and a maximum value of 75 indicating considerable variability in alcohol consumption. Additionally, Exercise levels varied widely, with an average of 163.95 minutes per week but due to the large spread, differences in physical activity habits are suggested across the sample.The mean PHQ-4 score of 3.76 indicates relatively low to moderate levels of anxiety and depression symptoms overall. Lastly, the average household size was 2.37 individuals, reflecting smaller household compositions.


Task 4: Frequency Tables for Qualitative Variables

freq(Hints_clean$BirthSex,report.nas = FALSE, cumul = FALSE) #frequency table for BirthSex Variable 
## Frequencies  
## Hints_clean$BirthSex  
## Type: Factor  
## 
##                Freq        %
## ------------ ------ --------
##         Male    975    62.10
##       Female    595    37.90
##        Total   1570   100.00
freq(Hints_clean$MaritalStatus, report.nas = FALSE, cumul = FALSE) #frequency table for MaritalStatus Variable 
## Frequencies  
## Hints_clean$MaritalStatus  
## Type: Factor  
## 
##                           Freq        %
## ----------------------- ------ --------
##                 Married    686    43.69
##       Living as Married    122     7.77
##                Divorced    209    13.31
##                 Widowed     97     6.18
##               Separated     32     2.04
##                  Single    424    27.01
##                   Total   1570   100.00
freq(Hints_clean$Education, report.nas = FALSE, cumul = FALSE) #frequency table for Education Variable 
## Frequencies  
## Hints_clean$Education  
## Type: Factor  
## 
##                                  Freq        %
## ------------------------------ ------ --------
##          Less than High School     63     4.01
##                    High School    173    11.02
##                   Some College    432    27.52
##       College Graduate or More    902    57.45
##                          Total   1570   100.00
freq(Hints_clean$HouseholdInc, report.nas = FALSE, cumul = FALSE) #frequency table for HouseholdInc Variable 
## Frequencies  
## Hints_clean$HouseholdInc  
## Type: Factor  
## 
##                            Freq        %
## ------------------------ ------ --------
##                ~<$20,000    186    11.85
##        ~$20,000-<$35,000    159    10.13
##        ~$35,000-<$50,000    186    11.85
##        ~$50,000-<$75,000    262    16.69
##       ~$75,000-<$100,000    216    13.76
##              ~>=$100,000    561    35.73
##                    Total   1570   100.00
freq(Hints_clean$SmokingStatus, report.nas = FALSE, cumul = FALSE) #frequency table for SmokingStatus Variable 
## Frequencies  
## Hints_clean$SmokingStatus  
## Type: Factor  
## 
##                 Freq        %
## ------------- ------ --------
##       Current    206    13.12
##        Former    439    27.96
##         Never    925    58.92
##         Total   1570   100.00
freq(Hints_clean$RaceEthn, report.nas = FALSE, cumul = FALSE) #frequency table for RaceEthnicity Variable 
## Frequencies  
## Hints_clean$RaceEthn  
## Type: Factor  
## 
##                               Freq        %
## --------------------------- ------ --------
##          Non-Hispanic White    985    62.74
##       Non-Hispanic Black/AA    209    13.31
##                    Hispanic    312    19.87
##          Non-Hispanic Asian     64     4.08
##                       Total   1570   100.00
freq(Hints_clean$ECigUse, report.nas = FALSE, cumul = FALSE) #frequency table for ECigUse Variable 
## Frequencies  
## Hints_clean$ECigUse  
## Type: Factor  
## 
##                 Freq        %
## ------------- ------ --------
##       Current    112     7.13
##        Former    262    16.69
##         Never   1196    76.18
##         Total   1570   100.00

The HINTS participants are predominantly male with 62.10% accounting for the number of male participants and 37.90% were women. The majority of participants were married (43.69%) and the single participants took second place by representing 27.01% while 13.31% were divorced. The remaining participants were living as married (7.77%), widowed (6.18%), and separated (2.04%). Most of the participants were college graduates or continued their educational pursuits by making up 57.45% of the sample and 27.52% of participants has some college education attainment. 11.02% of participants obtained a high school level education while 4.01% obtained less than high school education. The combined household income was more evenly distributed; however, 35.73% of participants earned approximately, less than or equal to $100,000, while 16.69% earned approx. $50,000-$75,000, 13.76% earned approx. $75,000-$100,000, the same amount of participants (11.85%) earned approx.$20,000 and $35,000-$50,000, and, lastly, 10.13% earned approx. $20,000-$35,000. The majority of participants have never smoked cigarettes (58.92%), with former smokers taking second place (27.96%), and current smokers comprising 13.12% of the sample. Most of the participants identified as Non-Hispanic White (62.74%), Hispanic’s came in second (19.87%), Non-Hispanic Black/African Americans made up 13.31%, and lastly, 4.08% identified as Non-Hispanic Asian. Similarly to the smoke status frequencies, most of the participants have never used e-cigarettes (76.18%), 16.69% were former, and 7.13% were current e-cigarette users.


Task 5: Visualizing Quantitative Variables

Histogram & Density Plot

library(ggplot2) #loading ggplot2

sum(is.na(Hints_clean$BMI)) #confirming if there are any NAs in BMI
## [1] 0
sum(is.na(Hints_clean$BirthSex)) #confirming if there are any NAs in BirthSex
## [1] 0
table(Hints_clean$BirthSex,useNA = "ifany") #confirming if there are any NAs in BirthSex
## 
##   Male Female 
##    975    595
ggplot(Hints_clean, aes(x=BMI,fill = BirthSex)) + #indicating that the x axis will be BMI but the values of BirthSex will fill the bins
  geom_histogram(aes(y = after_stat(density)),
                 alpha = 0.5, position = "identity", bins = 30) + 
  geom_density(alpha = 0.2) + 
  labs(title = "BMI Distribution by Sex", #title of the graph will be BMI Distribution by Sex 
       x = "BMI", #x axis will be labeled as BMI
       y = "Density") #y axis will be Density

While BMI distributions are similar for both sexes, females tend to cluster more around the average BMI range, whereas males show slightly more variability and higher extreme values.

Boxplot

ggplot(Hints_clean, aes(x=SmokingStatus, y=phq4, fill = SmokingStatus)) + #x will be the different smoking status levels, y will be the phq4 values and the box plot will be willied with the values of smoking status 
  geom_boxplot() + #boxplot is indicated 
  labs(title = "PHQ-4 by Smoking Status", x="Smoking Status", y="PHQ-4 Score") #graph, x axis, and y axis title names 

Current smokers appear to have a higher median PHQ-4 score, suggesting this group has greater anxiety/depression symptoms.

Scatterplot

ggplot(Hints_clean, aes(x=Age, y=AvgDrinksPerWeek)) + #x axis will be age and y will be avg drinks per week 
  geom_point() + #indicating points will be used to identify each observation
  geom_smooth(method="lm") + #indicating the points are smooth
  labs(title="Age vs Alcohol Consumption", x="Age", y="Average Drinks Per Week") #graph, x axis, and y axis title names 
## `geom_smooth()` using formula = 'y ~ x'

The scatterplot shows the relationship between age and the average number of alcoholic drinks consumed per week. Overall, there appears to be a very weak positive relationship, as indicated by the slightly upward-sloping trend line. This suggests that, on average, alcohol consumption increases slightly with age; however, the relationship is not strong. Most participants, regardless of age, report low levels of alcohol consumption, with a high concentration of values clustered near zero to 10 drinks per week. There are also several outliers, particularly among middle-aged individuals, who report much higher levels of drinking. The wide spread of points across all age groups indicates high variability, meaning age alone is not a strong predictor of alcohol consumption in this sample.


Task 6: Visualizing Qualitative Variables

Bar Plot (Counts)

sum(is.na(Hints_clean$RaceEthn)) #confirming there are no NAs in RaceEthn
## [1] 0
ggplot(Hints_clean, aes(x=RaceEthn)) + #x axis will be RaceEthn
  geom_bar() + # bar chart is indicated 
  labs(title="Race/Ethnicity Distribution", x= "Race/Ethnicity", y="Count") #graph, x axis, and y axis title names 

The majority of participants identified as Non-Hispanic White, while those who identified as Hispanic were the second most common group. Non-Hispanic Black/African Americans came in third, while Non-Hispanic Asians made up the smallest proportion.

Bar Plot (Proportions)

sum(is.na(Hints_clean$MaritalStatus)) #confirming no NAs in Marital Status
## [1] 0
sum(is.na(Hints_clean$Education)) #confirming no NAs in Education
## [1] 0
ggplot(Hints_clean, aes(x=Education, fill=MaritalStatus)) + #x axis will be education level and marital status will fill the bars 
  geom_bar(position="fill") + #bar chart is indicated
  labs(title="Marital Status by Education", y="Proportion") + #provides title and the label for y axis
  theme(axis.text.x = element_text(size = 9, angle = 45, hjust = 1)) #adjusts labels on x axis

The stacked bar plot shows that marital status varies across education levels. It suggests that higher education groups have a greater proportion of married individuals, while lower education groups have more single individuals. This suggests a possible relationship between education and marital status.

Faceted Plot

sum(is.na(Hints_clean$Exercise)) #confirming if there are any NAs in Exercise
## [1] 0
sum(is.na(Hints_clean$ECigUse))#confirming if there are any NAs in ECigUse
## [1] 0
ggplot(Hints_clean, aes(x=Exercise)) + #x axis will be exercise 
  geom_histogram(bins = 25) + #histogram is loaded and will be using 25 bins to distribute the data
  facet_wrap(~ECigUse) + #telling R to create a mini graph for each e-cigarette use group
  labs(title="Exercise by E-Cigarette Use", #chart title
       x = "Exercise (Minutes)", #x axis title
       y = "Count") #y axis title

Across all groups, most people engage in relatively low amounts of exercise, with only a small proportion reporting very high activity levels. There are no dramatic differences between e-cigarette use groups, though never users may be slightly more concentrated at lower exercise levels.


Task 7: Cross-Tabulations

library(summarytools)

sum(is.na(Hints_clean$BirthSex)) #confirming no NAs in BirthSex
## [1] 0
sum(is.na(Hints_clean$HouseholdInc)) #confirming no NAs in HouseholdInc
## [1] 0
ctable(Hints_clean$BirthSex, Hints_clean$HouseholdInc) #creating cross-tabulation between BirthSex and Household Income 
## Cross-Tabulation, Row Proportions  
## BirthSex * HouseholdInc  
## Data Frame: Hints_clean  
## 
## ---------- -------------- ------------- ------------------- ------------------- ------------------- -------------------- ------------- ---------------
##              HouseholdInc     ~<$20,000   ~$20,000-<$35,000   ~$35,000-<$50,000   ~$50,000-<$75,000   ~$75,000-<$100,000   ~>=$100,000           Total
##   BirthSex                                                                                                                                            
##       Male                  120 (12.3%)         108 (11.1%)         120 (12.3%)         165 (16.9%)          139 (14.3%)   323 (33.1%)    975 (100.0%)
##     Female                   66 (11.1%)          51 ( 8.6%)          66 (11.1%)          97 (16.3%)           77 (12.9%)   238 (40.0%)    595 (100.0%)
##      Total                  186 (11.8%)         159 (10.1%)         186 (11.8%)         262 (16.7%)          216 (13.8%)   561 (35.7%)   1570 (100.0%)
## ---------- -------------- ------------- ------------------- ------------------- ------------------- -------------------- ------------- ---------------
chisq.test(table(Hints_clean$BirthSex,Hints_clean$HouseholdInc)) #running chi square test to determine if there is a significant assoc. btwn BirthSex and Household Inc
## 
##  Pearson's Chi-squared test
## 
## data:  table(Hints_clean$BirthSex, Hints_clean$HouseholdInc)
## X-squared = 8.6446, df = 5, p-value = 0.1241

The p-value was 0.1241 when the chi square test of indepdendence was performed.Since the p-value was greater than 0.05, we fail to reject the null hypothesis, meaning there is no statistically significant association between Birth Sex and Household income.


Task 8: Correlation Analysis

quant_vars <- Hints_clean[,c("Age","BMI","phq4","Exercise")] #selecting the identified quantitative variables 

sum(is.na(quant_vars)) #confirming if there are any NAs within the selected variables 
## [1] 0
cor_quant_vars <- cor(quant_vars, use = "complete.obs") #obtain correlations of the quantitative variables 

library(corrplot) #loading corrplot to use 
## corrplot 0.95 loaded
corrplot(cor_quant_vars, method = "circle") #visualizing the correlation of the quantitative variables using a heatmap

The strongest positive correlation, though modest, is observed between BMI and PHQ-4 (r = 0.08), suggesting that higher BMI is associated with greater psychological distress. The strongest negative correlation, though modest, appears between age and PHQ-4 (r = -0.18), indicating that younger individuals may experience more mental health symptoms. These relationships are generally consistent with public health expectations, although the correlations are relatively weak.


Task 9: Executive Summary & Reflection

This analysis used the 2024 HINTS public dataset to examine demographics, health behaviors, and mental health indicators. The sample included individuals across diverse ages, income levels, and educational backgrounds. Findings showed that current smokers reported higher levels of anxiety and depression, suggesting a greater mental health burden. While BMI distributions were similar between the sexes, females clustered more closely around the average BMI, whereas males showed slightly greater variability and higher extreme values. A chi-square test found no statistically significant association between sex and household income, indicating insufficient evidence to conclude that gender influences income in this sample. Correlation analysis revealed a modest relationship between BMI and mental health, suggesting that increased BMI may be associated with worse mental health outcomes. Overall, these findings highlight the impact of poor mental health and the importance of mental health interventions. However, limitations include the cross-sectional design, which prevents causal conclusions, and missing data, which reduced the sample size available for analysis.