PUBH 422 Test 2

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this

## Day 1: Data Preparation & Initial Exploration

# Task 1
setwd("C:/Users/kevinroldan13/OneDrive - Cal State Fullerton/Desktop/PUBH 422/PUBH 422 Test 2") # Set working directory

hints <- read.csv("hints7_public data_2024.csv", header = TRUE) # Loaded dataset with 7278 rows/observations and 15 columns/variables


dim(hints) #printed dimensions of dataset, 7278 observations and 15 variables

## [1] 7278   15

head(hints) #printed the first 6 rows of the dataset

##       HHID Age BirthSex MaritalStatus AgeGrpB EducA HHInc TotalHousehold  BMI
## 1 72100001  69        2             1       4     4     6              2 26.3
## 2 72100005  62        1             1       3     3     6              2 25.0
## 3 72100014  34        1             1       1     4     5              5 24.0
## 4 72100019  65        2             1       4     3     4              3 35.2
## 5 72100025  64        1             6       3     4     1              1 29.0
## 6 72100026  64        2             1       3     4     6              2 27.3
##   smokeStat RaceEthn5 phq4 Exercise ECigUse AvgDrinksPerWeek
## 1         2         1    0      225       3             12.5
## 2         2         1    0      180       3             15.0
## 3         3         3    4      240       3              7.5
## 4         2         1    4      120       3              2.0
## 5         2         1   11        0       1             12.0
## 6         3         5    1      150       3              0.5

summary (hints) #printed the preliminary statistical summary of dataset

##       HHID               Age            BirthSex       MaritalStatus   
##  Min.   :72100001   Min.   : -9.00   Min.   :-9.0000   Min.   :-9.000  
##  1st Qu.:72108592   1st Qu.: 36.00   1st Qu.: 1.0000   1st Qu.: 1.000  
##  Median :72117023   Median : 55.00   Median : 1.0000   Median : 2.000  
##  Mean   :72251562   Mean   : 50.38   Mean   : 0.7236   Mean   : 2.031  
##  3rd Qu.:72325380   3rd Qu.: 69.00   3rd Qu.: 2.0000   3rd Qu.: 4.000  
##  Max.   :72836009   Max.   :102.00   Max.   : 3.0000   Max.   : 6.000  
##     AgeGrpB           EducA            HHInc       TotalHousehold 
##  Min.   :-9.000   Min.   :-9.000   Min.   :-9.00   Min.   :-9.00  
##  1st Qu.: 2.000   1st Qu.: 2.000   1st Qu.: 1.00   1st Qu.: 1.00  
##  Median : 3.000   Median : 3.000   Median : 4.00   Median : 2.00  
##  Mean   : 2.128   Mean   : 2.335   Mean   : 2.36   Mean   : 1.56  
##  3rd Qu.: 4.000   3rd Qu.: 4.000   3rd Qu.: 6.00   3rd Qu.: 3.00  
##  Max.   : 5.000   Max.   : 4.000   Max.   : 6.00   Max.   :11.00  
##       BMI          smokeStat        RaceEthn5            phq4       
##  Min.   :-9.00   Min.   :-9.000   Min.   :-9.0000   Min.   :-9.000  
##  1st Qu.:23.20   1st Qu.: 2.000   1st Qu.: 1.0000   1st Qu.: 0.000  
##  Median :27.10   Median : 3.000   Median : 1.0000   Median : 1.000  
##  Mean   :26.44   Mean   : 1.803   Mean   : 0.7595   Mean   : 1.592  
##  3rd Qu.:31.90   3rd Qu.: 3.000   3rd Qu.: 3.0000   3rd Qu.: 3.000  
##  Max.   :66.60   Max.   : 3.000   Max.   : 5.0000   Max.   :12.000  
##     Exercise         ECigUse       AvgDrinksPerWeek
##  Min.   :  -9.0   Min.   :-9.000   Min.   :-9.000  
##  1st Qu.:   0.0   1st Qu.: 3.000   1st Qu.: 0.000  
##  Median :  90.0   Median : 3.000   Median : 0.000  
##  Mean   : 168.7   Mean   : 1.826   Mean   : 1.533  
##  3rd Qu.: 210.0   3rd Qu.: 3.000   3rd Qu.: 2.000  
##  Max.   :6300.0   Max.   : 3.000   Max.   :75.000

# Task 2 Data Cleaning & Value Labeling
str(hints) #displays the internal structure of the dataframe through concise summary

## 'data.frame':    7278 obs. of  15 variables:
##  $ HHID            : int  72100001 72100005 72100014 72100019 72100025 72100026 72100027 72100028 72100031 72100035 ...
##  $ Age             : int  69 62 34 65 64 64 26 85 32 -9 ...
##  $ BirthSex        : int  2 1 1 2 1 2 1 2 1 3 ...
##  $ MaritalStatus   : int  1 1 1 1 6 1 6 4 6 6 ...
##  $ AgeGrpB         : int  4 3 1 4 3 3 1 5 1 -9 ...
##  $ EducA           : int  4 3 4 3 4 4 4 2 3 4 ...
##  $ HHInc           : int  6 6 5 4 1 6 4 3 5 1 ...
##  $ TotalHousehold  : int  2 2 5 3 1 2 1 2 1 -9 ...
##  $ BMI             : num  26.3 25 24 35.2 29 27.3 30.7 30.3 26.1 26.6 ...
##  $ smokeStat       : int  2 2 3 2 2 3 3 3 1 3 ...
##  $ RaceEthn5       : int  1 1 3 1 1 5 2 1 1 3 ...
##  $ phq4            : int  0 0 4 4 11 1 8 0 0 0 ...
##  $ Exercise        : int  225 180 240 120 0 150 80 90 360 0 ...
##  $ ECigUse         : int  3 3 3 3 1 3 3 3 1 3 ...
##  $ AvgDrinksPerWeek: num  12.5 15 7.5 2 12 0.5 1 4 0 -9 ...

sum(is.na(hints)) #counts the number of missing values in the dataframe

## [1] 0

#Output shows no missing values were identified in the dataframe

sum(hints<0, na.rm=TRUE) #Detects negative values in dataset and gives count

## [1] 8700

hints_clean <- hints[
  hints$Age >= 0 & #reference each column because R does not detect it as columns
  hints$BirthSex > 0 &
  hints$MaritalStatus > 0 & #>0 removes negative values
  hints$AgeGrpB > 0 &
  hints$EducA > 0 &
  hints$HHInc > 0 &
  hints$TotalHousehold > 0 &
  hints$BMI >= 0 &
  hints$smokeStat > 0 &
  hints$RaceEthn5 > 0 &
  hints$phq4 >= 0 &
  hints$Exercise >= 0 &
  hints$ECigUse > 0 &
  hints$AvgDrinksPerWeek >= 0, ] #cleans data by excluding the negative values in dataframe and stores in new df

summary(hints_clean) #summary verifies no negative values are present, clean dataset is now 5557 observations and 15 variables

##       HHID               Age            BirthSex     MaritalStatus  
##  Min.   :72100001   Min.   : 18.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.:72108616   1st Qu.: 39.00   1st Qu.:1.000   1st Qu.:1.000  
##  Median :72116908   Median : 56.00   Median :1.000   Median :2.000  
##  Mean   :72247272   Mean   : 54.14   Mean   :1.417   Mean   :2.802  
##  3rd Qu.:72325132   3rd Qu.: 69.00   3rd Qu.:2.000   3rd Qu.:4.000  
##  Max.   :72836009   Max.   :100.00   Max.   :3.000   Max.   :6.000  
##     AgeGrpB          EducA          HHInc      TotalHousehold        BMI       
##  Min.   :1.000   Min.   :1.00   Min.   :1.00   Min.   : 1.000   Min.   : 7.80  
##  1st Qu.:2.000   1st Qu.:3.00   1st Qu.:2.00   1st Qu.: 1.000   1st Qu.:24.00  
##  Median :3.000   Median :4.00   Median :4.00   Median : 2.000   Median :27.50  
##  Mean   :2.893   Mean   :3.25   Mean   :3.95   Mean   : 2.412   Mean   :28.85  
##  3rd Qu.:4.000   3rd Qu.:4.00   3rd Qu.:6.00   3rd Qu.: 3.000   3rd Qu.:32.30  
##  Max.   :5.000   Max.   :4.00   Max.   :6.00   Max.   :11.000   Max.   :66.60  
##    smokeStat       RaceEthn5          phq4           Exercise     
##  Min.   :1.000   Min.   :1.000   Min.   : 0.000   Min.   :   0.0  
##  1st Qu.:2.000   1st Qu.:1.000   1st Qu.: 0.000   1st Qu.:  16.0  
##  Median :3.000   Median :1.000   Median : 1.000   Median :  90.0  
##  Mean   :2.549   Mean   :1.865   Mean   : 2.225   Mean   : 181.1  
##  3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.: 4.000   3rd Qu.: 225.0  
##  Max.   :3.000   Max.   :5.000   Max.   :12.000   Max.   :5040.0  
##     ECigUse      AvgDrinksPerWeek
##  Min.   :1.000   Min.   : 0.000  
##  1st Qu.:3.000   1st Qu.: 0.000  
##  Median :3.000   Median : 0.250  
##  Mean   :2.791   Mean   : 2.873  
##  3rd Qu.:3.000   3rd Qu.: 2.500  
##  Max.   :3.000   Max.   :75.000

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

hints_clean <- hints_clean %>%
  filter(BirthSex %in% c(1, 2) | is.na(BirthSex)) %>% #Kept only valid BirthSex values 1 or 2
  mutate(BirthSex = factor(BirthSex, #Recoded Birthsex into labeled categories.
                         levels = c(1, 2), #Assigns levels 1- Male and 2-Female
                         labels = c("Male", "Female"))) #Converted BirthSex from numeric to labeled categories:
 
hints_clean <- hints_clean %>%
  mutate(MaritalStatus = factor(MaritalStatus, #Recoded MaritalStatus into labeled factor variable
                         levels = c(1, 2, 3, 4, 5, 6),  #Defined original coding structure (1–6)
                         labels = c("Married", "Living as married", "Divorced", "Widowed",                                                          "Separated", "Single"))) #Converted numeric codes into readable categories                      
hints_clean <- hints_clean %>%
  mutate(EducA = factor(EducA, #Recoded Education variable into ordered categories
                         levels = c(1, 2, 3, 4), #Education levels from lowest to highest attainment
                         labels = c("Less than High School", "High School", 
                         "Some College", "College Graduate or More"))) #

hints_clean <- hints_clean %>%
  mutate(HHInc = factor(HHInc, #Recoded Household Income into income brackets
                         levels = c(1, 2, 3, 4, 5, 6), #Assigned Income categories (1–6)
                         labels = c("<$20,000", "$20,000-<$35,000", "$35,000-<$50,000",
                         "$50,000-<$75,000", "$75,000-<100,000", "≥$100,000")))
                         
hints_clean <- hints_clean %>% 
  mutate(smokeStat = factor(smokeStat, #Recoded smoking status into categorical variable
                         levels = c(1, 2, 3), #Assigned smoking categories
                         labels = c("Current", "Former", "Never")))
                         
hints_clean <- hints_clean %>%
  mutate(RaceEthn5 = factor(RaceEthn5, #Recoded race/ethnicity into labeled categories
                         levels = c(1, 2, 3, 4, 5), #Assigned Race/ethnicity catergories
                         labels = c("~Non-Hispanic White", 
                         "~Non-Hispanic Black or African American", "~Hispanic",
                         "~Non-Hispanic Asian", "~Non-Hispanic Other")))
                         

hints_clean <- hints_clean %>%
  mutate(ECigUse = factor(ECigUse, #Recoded e-cigarette use into categorical variable
                         levels = c(1, 2, 3), #Assigned ECigUse categories
                         labels = c("Current", "Former", "Never")))
                         
summary(hints_clean) #Generated summary statistics for cleaned dataset

##       HHID               Age           BirthSex              MaritalStatus 
##  Min.   :72100001   Min.   : 18.00   Male  :3269   Married          :2595  
##  1st Qu.:72108618   1st Qu.: 39.00   Female:2261   Living as married: 336  
##  Median :72116882   Median : 56.00                 Divorced         : 781  
##  Mean   :72246943   Mean   : 54.19                 Widowed          : 474  
##  3rd Qu.:72325100   3rd Qu.: 69.00                 Separated        : 119  
##  Max.   :72836009   Max.   :100.00                 Single           :1225  
##     AgeGrpB                           EducA                   HHInc     
##  Min.   :1.000   Less than High School   : 298   <$20,000        : 816  
##  1st Qu.:2.000   High School             : 827   $20,000-<$35,000: 658  
##  Median :3.000   Some College            :1585   $35,000-<$50,000: 675  
##  Mean   :2.896   College Graduate or More:2820   $50,000-<$75,000: 928  
##  3rd Qu.:4.000                                   $75,000-<100,000: 708  
##  Max.   :5.000                                   ≥$100,000       :1745  
##  TotalHousehold        BMI          smokeStat   
##  Min.   : 1.000   Min.   :10.20   Current: 541  
##  1st Qu.: 1.000   1st Qu.:24.00   Former :1410  
##  Median : 2.000   Median :27.50   Never  :3579  
##  Mean   : 2.411   Mean   :28.87                 
##  3rd Qu.: 3.000   3rd Qu.:32.38                 
##  Max.   :11.000   Max.   :66.60                 
##                                    RaceEthn5         phq4       
##  ~Non-Hispanic White                    :3127   Min.   : 0.000  
##  ~Non-Hispanic Black or African American: 798   1st Qu.: 0.000  
##  ~Hispanic                              :1085   Median : 1.000  
##  ~Non-Hispanic Asian                    : 306   Mean   : 2.222  
##  ~Non-Hispanic Other                    : 214   3rd Qu.: 4.000  
##                                                 Max.   :12.000  
##     Exercise         ECigUse     AvgDrinksPerWeek
##  Min.   :   0.0   Current: 280   Min.   : 0.00   
##  1st Qu.:  16.0   Former : 596   1st Qu.: 0.00   
##  Median : 100.0   Never  :4654   Median : 0.25   
##  Mean   : 181.3                  Mean   : 2.88   
##  3rd Qu.: 225.0                  3rd Qu.: 2.50   
##  Max.   :5040.0                  Max.   :75.00

## Day 2: Descriptive Statistics

# Task 3: Summary Statistics for Quantitative Variables
library(summarytools) #loads summarytools package

## Warning: package 'summarytools' was built under R version 4.5.3

quant_var <- hints_clean[, c("Age", #Subset includes all rows and only quant columns
                                    "TotalHousehold",
                                    "BMI",
                                    "phq4",
                                    "Exercise",
                                    "AvgDrinksPerWeek")] 
#Created a smaller subset for quantitative variables from the cleaned hints dataset which includes 5557 observations of 6 variables

descr(quant_var, #runs descriptive statistics for quant variables subset
      stats = c("n.valid", "mean", "med", "sd", "min", "max")) #lists the stats needed

## Descriptive Statistics  
## quant_var  
## N: 5530  
## 
##                     Age   AvgDrinksPerWeek       BMI   Exercise      phq4   TotalHousehold
## ------------- --------- ------------------ --------- ---------- --------- ----------------
##       N.Valid   5530.00            5530.00   5530.00    5530.00   5530.00          5530.00
##          Mean     54.19               2.88     28.87     181.33      2.22             2.41
##        Median     56.00               0.25     27.50     100.00      1.00             2.00
##       Std.Dev     17.76               6.56      6.96     310.31      2.98             1.38
##           Min     18.00               0.00     10.20       0.00      0.00             1.00
##           Max    100.00              75.00     66.60    5040.00     12.00            11.00

# The descriptive statistics indicate that there are 5,557 participants in the dataset. Participants' age ranges from 18-100, with the average age of 54. The number of individuals in their households ranged from 1-11, with the average being 2.41. Participants averaged 2.22 on the PHQ-4 scores which suggests low-levels of psychological distress. When looking at participant behaviors: On average, participants engaged in 181 minutes of moderate exercise. However, this can be skewed by extreme values, so median is 90 minutes. Their BMI ranges from 7.80 to 66.60 with the average being 28.85 indicating the average person is in the overweight range. The average participant reported 2.87 drinks per week, but median is 0.25 meaning that most individuals reported very few drinks at 0.25.

# Task 4: Frequency Tables for Qualitative Variables
qual_var <- hints_clean[, c("BirthSex", #Subset includes all rows and only qual columns
                            "MaritalStatus",
                            "EducA",
                            "HHInc",
                            "smokeStat",
                            "RaceEthn5",
                            "ECigUse")]

                            
#Created a smaller subset for qualitative variables from the cleaned hints dataset which includes 5557 observations of 7 variables

freq(qual_var$BirthSex, valid.col=TRUE) #valid.col=TRUE to display only valid percents

## Frequencies  
## qual_var$BirthSex  
## Type: Factor  
## 
##                Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ------------ ------ --------- -------------- --------- --------------
##         Male   3269     59.11          59.11     59.11          59.11
##       Female   2261     40.89         100.00     40.89         100.00
##         <NA>      0                               0.00         100.00
##        Total   5530    100.00         100.00    100.00         100.00

freq(qual_var$MaritalStatus, valid.col=TRUE)

## Frequencies  
## qual_var$MaritalStatus  
## Type: Factor  
## 
##                           Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------------------- ------ --------- -------------- --------- --------------
##                 Married   2595     46.93          46.93     46.93          46.93
##       Living as married    336      6.08          53.00      6.08          53.00
##                Divorced    781     14.12          67.12     14.12          67.12
##                 Widowed    474      8.57          75.70      8.57          75.70
##               Separated    119      2.15          77.85      2.15          77.85
##                  Single   1225     22.15         100.00     22.15         100.00
##                    <NA>      0                               0.00         100.00
##                   Total   5530    100.00         100.00    100.00         100.00

freq(qual_var$EducA, valid.col=TRUE)

## Frequencies  
## qual_var$EducA  
## Type: Factor  
## 
##                                  Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ------------------------------ ------ --------- -------------- --------- --------------
##          Less than High School    298      5.39           5.39      5.39           5.39
##                    High School    827     14.95          20.34     14.95          20.34
##                   Some College   1585     28.66          49.01     28.66          49.01
##       College Graduate or More   2820     50.99         100.00     50.99         100.00
##                           <NA>      0                               0.00         100.00
##                          Total   5530    100.00         100.00    100.00         100.00

freq(qual_var$HHInc, valid.col=TRUE)

## Frequencies  
## qual_var$HHInc  
## Type: Factor  
## 
##                          Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ---------------------- ------ --------- -------------- --------- --------------
##               <$20,000    816     14.76          14.76     14.76          14.76
##       $20,000-<$35,000    658     11.90          26.65     11.90          26.65
##       $35,000-<$50,000    675     12.21          38.86     12.21          38.86
##       $50,000-<$75,000    928     16.78          55.64     16.78          55.64
##       $75,000-<100,000    708     12.80          68.44     12.80          68.44
##              ≥$100,000   1745     31.56         100.00     31.56         100.00
##                   <NA>      0                               0.00         100.00
##                  Total   5530    100.00         100.00    100.00         100.00

freq(qual_var$smokeStat, valid.col=TRUE)

## Frequencies  
## qual_var$smokeStat  
## Type: Factor  
## 
##                 Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ------------- ------ --------- -------------- --------- --------------
##       Current    541      9.78           9.78      9.78           9.78
##        Former   1410     25.50          35.28     25.50          35.28
##         Never   3579     64.72         100.00     64.72         100.00
##          <NA>      0                               0.00         100.00
##         Total   5530    100.00         100.00    100.00         100.00

freq(qual_var$RaceEthn5, valid.col=TRUE)

## Frequencies  
## qual_var$RaceEthn5  
## Type: Factor  
## 
##                                                 Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## --------------------------------------------- ------ --------- -------------- --------- --------------
##                           ~Non-Hispanic White   3127     56.55          56.55     56.55          56.55
##       ~Non-Hispanic Black or African American    798     14.43          70.98     14.43          70.98
##                                     ~Hispanic   1085     19.62          90.60     19.62          90.60
##                           ~Non-Hispanic Asian    306      5.53          96.13      5.53          96.13
##                           ~Non-Hispanic Other    214      3.87         100.00      3.87         100.00
##                                          <NA>      0                               0.00         100.00
##                                         Total   5530    100.00         100.00    100.00         100.00

freq(qual_var$ECigUse, valid.col=TRUE)

## Frequencies  
## qual_var$ECigUse  
## Type: Factor  
## 
##                 Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ------------- ------ --------- -------------- --------- --------------
##       Current    280      5.06           5.06      5.06           5.06
##        Former    596     10.78          15.84     10.78          15.84
##         Never   4654     84.16         100.00     84.16         100.00
##          <NA>      0                               0.00         100.00
##         Total   5530    100.00         100.00    100.00         100.00

#created frequency table for each qual variable

#The frequency tables for the qualitative variables indicate that, from 5,557 participants, more than half were Male at birth (58.83%) with 40.89% being Female at birth. In regards to race/ethnicity, more than half of participants reported being Non-Hispanic White (56.31%), followed by Hispanic participants at 19.76%. For marital status, nearly half of participants were married (46.75%), followed by 22.30% single participants. The most common education level was college graduate or more (50.89%), while only about 5% of participants had an education level less than high school. This may help explain why the most common household income was $100,000 or more (31.44%). Regarding health behaviors, an overwhelming majority of 64.75% of participants reported never smoking and an even higher 84.15% of participants reported never using electronic cigarettes.

## Day 3: Data Visualization

# Task 5: Visualizing Quantitative Variables
library(ggplot2) #loaded ggplot2 package

# 1- Histogram & Density Plot
ggplot(hints_clean, aes(x = BMI, fill = BirthSex)) + 
  geom_histogram(aes(y = ..density..), position = "identity", alpha = 0.5, bins = 30) + #Converted counts into density scale
  geom_density(alpha = 0.7, color = "black") +
  labs(
    title = "BMI Distribution by Birth Sex", #Titled chart
    x = "BMI", #Labeled x-axis as BMI
    y = "Density", #Labeled y-axis as Density
    fill = "Birth Sex" #Labeled legend for gender categories
  ) +
  theme_minimal()

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

#The ouput shows a roughly bell-shaped distribution. Both distributions are right-skewed potentially due to extremely high BMI vlaues. The BMI peak for both sexes are between 20-25 with females concentrated more around the average and males extending to higher BMI ranges. The right tale for males extends further. Therefore, higher BMI's are more common in males while females are closer to the dataset average BMI.

#2- Boxplot
ggplot(hints_clean, aes(x = smokeStat, y = phq4, fill = smokeStat)) +
  geom_boxplot() +
  labs(
    title = "PHQ-4 Scores by Smoking Status", #Titled chart
    x = "Smoking Status", #Labeled x-axis as Smoking Status
    y = "PHQ-4 Score" #Labeled y-axis as PHQ-4 Score
  ) +
  theme_minimal() +
  theme(legend.position = "none")

#The output shows that current smokers group have the highest median PHQ‑4 score.The median line inside the red box (current smokers) sits noticeably higher than the medians for both former and never smokers, and the line for former and never smokers are similar. Therefore, on average, current smokers report more psychological distress than the other two groups.

#3- Scatterplot
ggplot(hints_clean, aes(x = Age, y = AvgDrinksPerWeek)) +
  geom_point(alpha = 0.5, color = "steelblue") +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  labs(
    title = "Relationship Between Age and Avg Drinks per Week", #Titled Chart
    x = "Age", #Labeled x-axis
    y = "Average Drinks per Week" #Labeled y-axis
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

#The output shows no apparent relationship between average drinks per week and age. The trend line is flat suggesting that drinking levels stays the same across all age groups. There are extreme outliers among all ages and they are not concentrated in a single age group.


# Task 6

#1- Bar Plot (Counts)
ggplot(hints_clean, aes(x = RaceEthn5, fill = RaceEthn5)) +
  geom_bar() +
  labs(
    title = "Participant Counts by Race/Ethnicity", #Titled chart
    x = "Race/Ethnicity", #Labeled x-axis
    y = "Count" #Labeled y-axis
  ) +
  theme_minimal() +
  theme(legend.position = "none")

#The output shows a bar plot with non-Hispanic white category makes up the largest group followed by Hispanic participants. Non-Hispanic white participants make up a considerable amount of the dataset sample, with roughly 3 times the size of their next counterpart- Hispanic.
  
#2- Bar Plot (Proportions)
ggplot(hints_clean, aes(x = EducA, fill = MaritalStatus)) +
  geom_bar(position = "fill") +  # proportion within each education level
  scale_y_continuous(labels = scales::percent_format()) +
  labs(
    title = "Proportion of Marital Status by Education Level", #Titled chart
    x = "Education Level", #Labeled x-axis
    y = "Proportion", #Labeled y-axis
    fill = "Marital Status"
  ) +
  theme_minimal()

#The output shows a bar plot where as education rises the proportion of marriage increases. Surprisnly there are higher rates of being widowed among individuals with lower education. Higher education levels are associated with a greater proportion of married individuals.

#3- Faceted Plot
ggplot(hints_clean, aes(x = Exercise)) +
  geom_histogram(bins = 40, fill = "lightgreen", color = "black") +
  labs(
    title = "Exercise Minutes Distribution by E-Cig Use", #Titled chart
    x = "Exercise Minutes", #Labeled x-axis
    y = "Count" #Labeled y-axis
  ) +
  facet_wrap(~ECigUse) +
  theme_minimal()

#The output shows Never e-cigarette users as having a large concentration of individuals with low exercise minutes. However, the bars have to wide of a range due to extreme exercise minutes outliers.

## Day 4: Bivariate Analysis & Correlation

# Task 7
ctable(hints_clean$BirthSex, hints_clean$HHInc, #Created a cross-tabulation (contingency table) between Birth Sex and Household Income
       prop = "r",       #included row percentages
       total = TRUE,     #included row and column totals
       style = "rmarkdown")

## ### Cross-Tabulation, Row Proportions  
## #### BirthSex * HHInc  
## **Data Frame:** hints_clean  
## 
## |          |       |             |                  |                  |                  |                  |              |               |
## |---------:|------:|------------:|-----------------:|-----------------:|-----------------:|-----------------:|-------------:|--------------:|
## |          | HHInc |    <$20,000 | $20,000-<$35,000 | $35,000-<$50,000 | $50,000-<$75,000 | $75,000-<100,000 |    ≥$100,000 |         Total |
## | BirthSex |       |             |                  |                  |                  |                  |              |               |
## |     Male |       | 544 (16.6%) |      437 (13.4%) |      426 (13.0%) |      531 (16.2%) |      415 (12.7%) |  916 (28.0%) | 3269 (100.0%) |
## |   Female |       | 272 (12.0%) |      221 ( 9.8%) |      249 (11.0%) |      397 (17.6%) |      293 (13.0%) |  829 (36.7%) | 2261 (100.0%) |
## |    Total |       | 816 (14.8%) |      658 (11.9%) |      675 (12.2%) |      928 (16.8%) |      708 (12.8%) | 1745 (31.6%) | 5530 (100.0%) |

chisq_test <- chisq.test(table(hints_clean$BirthSex, hints_clean$HHInc)) #Created a contingency table and ran a Chi-square test of independence

chisq_test #Printed result of chi-square test

## 
##  Pearson's Chi-squared test
## 
## data:  table(hints_clean$BirthSex, hints_clean$HHInc)
## X-squared = 71.328, df = 5, p-value = 5.421e-14

#ouput shows p-value of 5.21e-14, indicating p-value is extremely small and there is an association between birth sex and household income in this dataset. This suggests that the distribution of income levels differs between males and females.

# Task 8
corr_vars <- hints_clean[, c("Age", "BMI", "phq4", "Exercise")] #created subset of quant variables for correlation

cor_matrix <- cor(corr_vars, use = "complete.obs")  #excluded missing

cor_matrix

##                   Age         BMI        phq4     Exercise
## Age       1.000000000 -0.03574842 -0.24099932  0.009073485
## BMI      -0.035748415  1.00000000  0.07680228 -0.090328301
## phq4     -0.240999315  0.07680228  1.00000000 -0.059596341
## Exercise  0.009073485 -0.09032830 -0.05959634  1.000000000

library(corrplot) #loaded corrplot package

## Warning: package 'corrplot' was built under R version 4.5.3

## corrplot 0.95 loaded

corrplot(cor_matrix, method = "color", type = "upper", #Created a correlation plot (heatmap visualization) from a correlation matrix object
         addCoef.col = "black", tl.col = "black", tl.srt = 45)

#The strongest positive correlation are between BMI and phq4 values at 0.08. Suggesting a slight positive relationship between BMI and phq4 levels; as BMI increases so do phq4 (psychological stress). The strongest negative correlation relationship is between Age and pohq4 levels at -0.24. This suggests that as age increases, phq4 levels slightly decrease. Meaning as someone ages, their psychological distress levels can decrease.

## Day 5: Synthesis & Report Finalization

#Task 9: Executive Summary & Reflection

#The HINTS 2024 sample includes 7,278 participants. After cleaning up from negative and missing values, there are a total of 5,530 participants. The dataset examines demographic information such as age, birth sex, BMI, marital status, education level, household income. It also studies behavioral variables like cigarette and e-cigarette status, average drinks per week, and exercise. These participants range from ages 18-100 with the average age being 54. There is a slightly higher percentage of male participants than female participants.  One striking finding was the large sample of Non-Hispanic White participants, at nearly 3 times their next counterpart. Another striking find is the extreme outliers in exercise, BMI, and average drinks per week. Something I found interesting was the large percentage of being widowed among participants with an educational level less than high school. One notable correlation is the relationship between age and phq4 levels; as age increases psychological distress levels lower. This may reflect greater emotional resilience or more stable life circumstances among older adults. A notable association is the relationship between education and marital status. As higher educational attainment is achieved, marriage rates increase. This might suggest education is associated with higher relationship stability. The negative relationship between age and phq-4 scores suggests that younger adults experience higher levels of psychological distress. From a public health perspective, this highlights the importance of targeted mental‑health outreach and prevention for younger populations. A limitation is the large number of Non Hispanic White participants, which suggests that this group may have been oversampled. Therefore, the findings may not fully generalize to the broader U.S. population.

PUBH 422 Test 2

Kevin Roldan

2026-04-08

R Markdown