Test 2 RMD

2026-04-10

2024 Health Information National Trends Survey (HINTS) Analysis

Task 1: Environment Setup & Data Import

hints <- read.csv("C:/Users/Karin/Downloads/hints7_public data_2024.csv", header = TRUE) # downloads dataset
hints <- as.data.frame(hints) # converts data to data frame and names 'hints
dim(hints) # examines dimensions of data set 
## [1] 7278   15
head(hints) # lists first six rows of data frame 
##       HHID Age BirthSex MaritalStatus AgeGrpB EducA HHInc TotalHousehold  BMI
## 1 72100001  69        2             1       4     4     6              2 26.3
## 2 72100005  62        1             1       3     3     6              2 25.0
## 3 72100014  34        1             1       1     4     5              5 24.0
## 4 72100019  65        2             1       4     3     4              3 35.2
## 5 72100025  64        1             6       3     4     1              1 29.0
## 6 72100026  64        2             1       3     4     6              2 27.3
##   smokeStat RaceEthn5 phq4 Exercise ECigUse AvgDrinksPerWeek
## 1         2         1    0      225       3             12.5
## 2         2         1    0      180       3             15.0
## 3         3         3    4      240       3              7.5
## 4         2         1    4      120       3              2.0
## 5         2         1   11        0       1             12.0
## 6         3         5    1      150       3              0.5
summary (hints) # provides statistical summary of dataframe 
##       HHID               Age            BirthSex       MaritalStatus   
##  Min.   :72100001   Min.   : -9.00   Min.   :-9.0000   Min.   :-9.000  
##  1st Qu.:72108592   1st Qu.: 36.00   1st Qu.: 1.0000   1st Qu.: 1.000  
##  Median :72117023   Median : 55.00   Median : 1.0000   Median : 2.000  
##  Mean   :72251562   Mean   : 50.38   Mean   : 0.7236   Mean   : 2.031  
##  3rd Qu.:72325380   3rd Qu.: 69.00   3rd Qu.: 2.0000   3rd Qu.: 4.000  
##  Max.   :72836009   Max.   :102.00   Max.   : 3.0000   Max.   : 6.000  
##     AgeGrpB           EducA            HHInc       TotalHousehold 
##  Min.   :-9.000   Min.   :-9.000   Min.   :-9.00   Min.   :-9.00  
##  1st Qu.: 2.000   1st Qu.: 2.000   1st Qu.: 1.00   1st Qu.: 1.00  
##  Median : 3.000   Median : 3.000   Median : 4.00   Median : 2.00  
##  Mean   : 2.128   Mean   : 2.335   Mean   : 2.36   Mean   : 1.56  
##  3rd Qu.: 4.000   3rd Qu.: 4.000   3rd Qu.: 6.00   3rd Qu.: 3.00  
##  Max.   : 5.000   Max.   : 4.000   Max.   : 6.00   Max.   :11.00  
##       BMI          smokeStat        RaceEthn5            phq4       
##  Min.   :-9.00   Min.   :-9.000   Min.   :-9.0000   Min.   :-9.000  
##  1st Qu.:23.20   1st Qu.: 2.000   1st Qu.: 1.0000   1st Qu.: 0.000  
##  Median :27.10   Median : 3.000   Median : 1.0000   Median : 1.000  
##  Mean   :26.44   Mean   : 1.803   Mean   : 0.7595   Mean   : 1.592  
##  3rd Qu.:31.90   3rd Qu.: 3.000   3rd Qu.: 3.0000   3rd Qu.: 3.000  
##  Max.   :66.60   Max.   : 3.000   Max.   : 5.0000   Max.   :12.000  
##     Exercise         ECigUse       AvgDrinksPerWeek
##  Min.   :  -9.0   Min.   :-9.000   Min.   :-9.000  
##  1st Qu.:   0.0   1st Qu.: 3.000   1st Qu.: 0.000  
##  Median :  90.0   Median : 3.000   Median : 0.000  
##  Mean   : 168.7   Mean   : 1.826   Mean   : 1.533  
##  3rd Qu.: 210.0   3rd Qu.: 3.000   3rd Qu.: 2.000  
##  Max.   :6300.0   Max.   : 3.000   Max.   :75.000

Task 2: Data Cleaning & Value Labeling

str(hints)
## 'data.frame':    7278 obs. of  15 variables:
##  $ HHID            : int  72100001 72100005 72100014 72100019 72100025 72100026 72100027 72100028 72100031 72100035 ...
##  $ Age             : int  69 62 34 65 64 64 26 85 32 -9 ...
##  $ BirthSex        : int  2 1 1 2 1 2 1 2 1 3 ...
##  $ MaritalStatus   : int  1 1 1 1 6 1 6 4 6 6 ...
##  $ AgeGrpB         : int  4 3 1 4 3 3 1 5 1 -9 ...
##  $ EducA           : int  4 3 4 3 4 4 4 2 3 4 ...
##  $ HHInc           : int  6 6 5 4 1 6 4 3 5 1 ...
##  $ TotalHousehold  : int  2 2 5 3 1 2 1 2 1 -9 ...
##  $ BMI             : num  26.3 25 24 35.2 29 27.3 30.7 30.3 26.1 26.6 ...
##  $ smokeStat       : int  2 2 3 2 2 3 3 3 1 3 ...
##  $ RaceEthn5       : int  1 1 3 1 1 5 2 1 1 3 ...
##  $ phq4            : int  0 0 4 4 11 1 8 0 0 0 ...
##  $ Exercise        : int  225 180 240 120 0 150 80 90 360 0 ...
##  $ ECigUse         : int  3 3 3 3 1 3 3 3 1 3 ...
##  $ AvgDrinksPerWeek: num  12.5 15 7.5 2 12 0.5 1 4 0 -9 ...
sum(is.na(hints)) # checks total number of NA in data 
## [1] 0
any(hints<0, na.rm = TRUE) # checks if there are any negative variables 
## [1] TRUE
sum(apply(hints<0, 1, any)) # provides total number of negative variables 
## [1] 1721
hints_clean <- hints[!apply (hints <0, 1, any),] # removes any row with atleast 1 negative variable 
dim(hints_clean) # examines dimensions of new clean data frame 
## [1] 5557   15
#Recode 'Marital Status' to factor
hints_clean <- hints_clean %>%
  mutate(MaritalStatus = factor(MaritalStatus,
                           levels = c(1, 2, 3, 4, 5, 6),
                           labels = c("Married", "living as married", "Divorced", "Widowed", "Separated", "Single")))
#Recode 'EducA' to factor
hints_clean <- hints_clean %>%
  mutate(EducA = factor(EducA,
                        levels = c(1, 2, 3, 4),
                        labels = c("Less than High School", "High School", "Some College", "College Graduate or more")))
#Recode 'HHIinc' to factor
hints_clean <- hints_clean %>%
  mutate(HHInc = factor(HHInc,
                        levels = c(1, 2, 3, 4, 5, 6),
                        labels = c("<$20,000", "$20,000-<$35,000", "$35,000-<$50,000", "$50,000-<$75,000", "$75,000-<100,000", "≥$100,000")))
#Recode 'smokeStat' to factor
hints_clean <- hints_clean %>%
  mutate(smokeStat = factor(smokeStat,
                                levels = c(1, 2, 3),
                                labels = c("Current", "Former", "Never")))
#Recode 'RaceEthn5' to factor
hints_clean <- hints_clean %>%
  mutate(RaceEthn5 = factor(RaceEthn5,
                        levels = c(1, 2, 3, 4, 5),
                        labels = c("Non-Hispanic White", "Non-Hispanic Black or African American", "Hispanic", "Non-Hispanic Asian", "Non-Hispanic Other")))     
#Recode 'ECigUse' to factor
hints_clean <- hints_clean %>%
  mutate(ECigUse = factor(ECigUse,
                            levels = c(1, 2, 3),
                            labels = c("Current", "Former", "Never"))) 
hints_clean <- hints_clean[hints_clean$BirthSex %in% c(1,2),] # only keeps BirthSex Rows with 1 or 2 (no 3)

#Recode 'BirthSex' to factor
hints_clean <- hints_clean %>%
  mutate(BirthSex = factor(BirthSex,
                           levels = c(1, 2),
                           labels = c("Male", "Female")))
summary(hints_clean) # provides statistical summary of 'hints_clean' data 
##       HHID               Age           BirthSex              MaritalStatus 
##  Min.   :72100001   Min.   : 18.00   Male  :3269   Married          :2595  
##  1st Qu.:72108618   1st Qu.: 39.00   Female:2261   living as married: 336  
##  Median :72116882   Median : 56.00                 Divorced         : 781  
##  Mean   :72246943   Mean   : 54.19                 Widowed          : 474  
##  3rd Qu.:72325100   3rd Qu.: 69.00                 Separated        : 119  
##  Max.   :72836009   Max.   :100.00                 Single           :1225  
##     AgeGrpB                           EducA                   HHInc     
##  Min.   :1.000   Less than High School   : 298   <$20,000        : 816  
##  1st Qu.:2.000   High School             : 827   $20,000-<$35,000: 658  
##  Median :3.000   Some College            :1585   $35,000-<$50,000: 675  
##  Mean   :2.896   College Graduate or more:2820   $50,000-<$75,000: 928  
##  3rd Qu.:4.000                                   $75,000-<100,000: 708  
##  Max.   :5.000                                   ≥$100,000       :1745  
##  TotalHousehold        BMI          smokeStat   
##  Min.   : 1.000   Min.   :10.20   Current: 541  
##  1st Qu.: 1.000   1st Qu.:24.00   Former :1410  
##  Median : 2.000   Median :27.50   Never  :3579  
##  Mean   : 2.411   Mean   :28.87                 
##  3rd Qu.: 3.000   3rd Qu.:32.38                 
##  Max.   :11.000   Max.   :66.60                 
##                                   RaceEthn5         phq4       
##  Non-Hispanic White                    :3127   Min.   : 0.000  
##  Non-Hispanic Black or African American: 798   1st Qu.: 0.000  
##  Hispanic                              :1085   Median : 1.000  
##  Non-Hispanic Asian                    : 306   Mean   : 2.222  
##  Non-Hispanic Other                    : 214   3rd Qu.: 4.000  
##                                                Max.   :12.000  
##     Exercise         ECigUse     AvgDrinksPerWeek
##  Min.   :   0.0   Current: 280   Min.   : 0.00   
##  1st Qu.:  16.0   Former : 596   1st Qu.: 0.00   
##  Median : 100.0   Never  :4654   Median : 0.25   
##  Mean   : 181.3                  Mean   : 2.88   
##  3rd Qu.: 225.0                  3rd Qu.: 2.50   
##  Max.   :5040.0                  Max.   :75.00

Task 3: Summary Statistics for Quantitative Variables

# creates new data set with only quantitative variables 
variables_quantitative <- hints_clean [,c("Age","TotalHousehold","BMI","phq4","Exercise","AvgDrinksPerWeek")]

descr(variables_quantitative, # initiates descriptive of 'variables_quantitative'
      stats= c("n.valid","mean","med","sd","min","max")) # indicates which statistics to include in the table
## Descriptive Statistics  
## variables_quantitative  
## N: 5530  
## 
##                     Age   AvgDrinksPerWeek       BMI   Exercise      phq4   TotalHousehold
## ------------- --------- ------------------ --------- ---------- --------- ----------------
##       N.Valid   5530.00            5530.00   5530.00    5530.00   5530.00          5530.00
##          Mean     54.19               2.88     28.87     181.33      2.22             2.41
##        Median     56.00               0.25     27.50     100.00      1.00             2.00
##       Std.Dev     17.76               6.56      6.96     310.31      2.98             1.38
##           Min     18.00               0.00     10.20       0.00      0.00             1.00
##           Max    100.00              75.00     66.60    5040.00     12.00            11.00

The age range for the Hint 2024 survey is from 18 years to 100 years old. The average age of the sample is 54.19. The survey examined a few factors including: average drinks per week, BMI, exercise, PHQ-4 (anxiety and depression measurement), and total number of individuals living in one household. The average BMI of participants in 2024 was 28.87. As for household size, the maximum amount of individuals living in one household was 11 and the minimum was 1 person. The max number of drinks per week was 75 drinks, while the minimum was 0 drinks. The highest PHQ-4 score was 12, while the lowest was 0. On average, the sample of survey participants has a PHQ-4 score of 2.22. Lastly, on average, the participants of the survey exercised for 181.33 minutes, with a low of 0 and a high of 5,040 minutes.

Task 4: Frequency Table for Qualitative Variables

# creates new dataset with only qualitative variables 
variables_qualitative <- hints_clean[,c("BirthSex","MaritalStatus","EducA","HHInc","smokeStat","RaceEthn5","ECigUse")]

freq(variables_qualitative) # initiates a frequency table of 'variables_qualitative' 
## Frequencies  
## variables_qualitative$BirthSex  
## Type: Factor  
## 
##                Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ------------ ------ --------- -------------- --------- --------------
##         Male   3269     59.11          59.11     59.11          59.11
##       Female   2261     40.89         100.00     40.89         100.00
##         <NA>      0                               0.00         100.00
##        Total   5530    100.00         100.00    100.00         100.00
## 
## variables_qualitative$MaritalStatus  
## Type: Factor  
## 
##                           Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------------------- ------ --------- -------------- --------- --------------
##                 Married   2595     46.93          46.93     46.93          46.93
##       living as married    336      6.08          53.00      6.08          53.00
##                Divorced    781     14.12          67.12     14.12          67.12
##                 Widowed    474      8.57          75.70      8.57          75.70
##               Separated    119      2.15          77.85      2.15          77.85
##                  Single   1225     22.15         100.00     22.15         100.00
##                    <NA>      0                               0.00         100.00
##                   Total   5530    100.00         100.00    100.00         100.00
## 
## variables_qualitative$EducA  
## Type: Factor  
## 
##                                  Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ------------------------------ ------ --------- -------------- --------- --------------
##          Less than High School    298      5.39           5.39      5.39           5.39
##                    High School    827     14.95          20.34     14.95          20.34
##                   Some College   1585     28.66          49.01     28.66          49.01
##       College Graduate or more   2820     50.99         100.00     50.99         100.00
##                           <NA>      0                               0.00         100.00
##                          Total   5530    100.00         100.00    100.00         100.00
## 
## variables_qualitative$HHInc  
## Type: Factor  
## 
##                          Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ---------------------- ------ --------- -------------- --------- --------------
##               <$20,000    816     14.76          14.76     14.76          14.76
##       $20,000-<$35,000    658     11.90          26.65     11.90          26.65
##       $35,000-<$50,000    675     12.21          38.86     12.21          38.86
##       $50,000-<$75,000    928     16.78          55.64     16.78          55.64
##       $75,000-<100,000    708     12.80          68.44     12.80          68.44
##              ≥$100,000   1745     31.56         100.00     31.56         100.00
##                   <NA>      0                               0.00         100.00
##                  Total   5530    100.00         100.00    100.00         100.00
## 
## variables_qualitative$smokeStat  
## Type: Factor  
## 
##                 Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ------------- ------ --------- -------------- --------- --------------
##       Current    541      9.78           9.78      9.78           9.78
##        Former   1410     25.50          35.28     25.50          35.28
##         Never   3579     64.72         100.00     64.72         100.00
##          <NA>      0                               0.00         100.00
##         Total   5530    100.00         100.00    100.00         100.00
## 
## variables_qualitative$RaceEthn5  
## Type: Factor  
## 
##                                                Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## -------------------------------------------- ------ --------- -------------- --------- --------------
##                           Non-Hispanic White   3127     56.55          56.55     56.55          56.55
##       Non-Hispanic Black or African American    798     14.43          70.98     14.43          70.98
##                                     Hispanic   1085     19.62          90.60     19.62          90.60
##                           Non-Hispanic Asian    306      5.53          96.13      5.53          96.13
##                           Non-Hispanic Other    214      3.87         100.00      3.87         100.00
##                                         <NA>      0                               0.00         100.00
##                                        Total   5530    100.00         100.00    100.00         100.00
## 
## variables_qualitative$ECigUse  
## Type: Factor  
## 
##                 Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ------------- ------ --------- -------------- --------- --------------
##       Current    280      5.06           5.06      5.06           5.06
##        Former    596     10.78          15.84     10.78          15.84
##         Never   4654     84.16         100.00     84.16         100.00
##          <NA>      0                               0.00         100.00
##         Total   5530    100.00         100.00    100.00         100.00

Of the participants, about 59.11% were Female and 40.89% were Male. The majority of participants identified as non-Hispanic White at 56.55%.46.94% were Married, while 22.15% were single. 50.99% of the HINTS 2024 participants were college graduates or higher. About 14.76% reported making less than $20,000, while 31.56% reported making over $100,000. A majority reported never using cigarettes (64.72%) or e-cigarettes (84.16%).

Task 5: Visualizing Quantitative Variables

ggplot(hints_clean, aes(x=BMI, fill=BirthSex))+ # creates ggplot with x as 'BMI' and fill as 'BirthSex'
  geom_density(alpha=0.5)+ # sets transparency of fill to 0.5 
  labs(title= "Density of BMI", # tiles ggplot to 'Density of BMI'
       x= "BMI", # titles the x-axis 'Density'
       y= "Density") # title the y-axis 'BMI' 

ggplot (hints_clean, aes (x= smokeStat, y= phq4, fill=smokeStat ))+ # creates ggplot of phq4 based on smoking status
  geom_boxplot()+ # initiates a boxplot 
  labs(title="PHQ4 and Smoking Status Boxplot") # adds title to boxplot

ggplot (hints_clean, aes(x=Age, y=AvgDrinksPerWeek))+ # initiates ggplot with two variables 
  geom_point()+ # indicates ggplot as a scatter plot 
  geom_smooth(method=lm)+ # adds linear regression line 
  labs(title= "Age and Average Drinks Per Week Scatter Plot") # adds a title to the plot 
## `geom_smooth()` using formula = 'y ~ x'

Density of BMI: The density plot illustrates the density of BMI for both Male and Female. For both Male and Female, the density plot is right skewed. This means most people have lower values, while less people are on the higher range of BMI scale. The distribution for both sexes is similar.

PHQ-4 and Smoking Status Box plot: Individuals who identified as current smokers has the highest median PHQ-4 score. Thus, indicating that they have greater symptoms of anxiety or depression compared to former smokers or people who have never smoked.

Age and Average Drinks Per Week: The linear regression line has no slope indiccating that there is no correlation between age and average drinks per week.

Task 6: Visualizing Qualitative Variables

ggplot(hints_clean, aes(x=RaceEthn5, fill=RaceEthn5, size=2))+ # initiates ggplot to compare with 'RaceEthn5' variable
  geom_bar()+ # indicates plot as a bar plot 
  theme(axis.text.x = element_text(size=10, angle=45, hjust=1))+ # adjusts x-axis titles 
  labs(title= "HINTS 2024 Race and Ethnicity Bar Plot") # adds title to plot 

ggplot(hints_clean, aes(x = EducA, fill = MaritalStatus)) + # creates plot with x-axis as 'EduA' and fill with 'MaritalStatus'
  geom_bar(position = "fill")+ # indicates plot as a bar chart 
  labs (title= "Proportion of EducA by MaritalStatus", y= "Proportion")+ # adds title and indicates y axis as proportion 
  theme(axis.text.x = element_text(size=10, angle=45, hjust=1)) # adjusts the x-axis labels 

ggplot (hints_clean, aes (x= ECigUse, y= Exercise, fill=ECigUse ))+ # creates ggplot of 'exercise' based on 'eciguse'
  geom_boxplot()+ # indicates plot as boxplot
  labs(title= "Exercise Minutes by ECig Use BoxPlot") # gives title to box plot

Race and Ethnicity: Most individuals who completed the survey identified as Non-Hispanic White followed by Hispanics.

Proportion of Educational Attainment by Marital Status: As educational attainment increases so does the proportion of participants who are married. Additionally, the proportion of single participants across all educational attainments is approximately the same.

Exercise Minutes by E-Cigarette Use Box Plot: The box plot shows that all three groups (e.g., current, former, and never) of e-cigarette users have a similar median of exercise minutes. Overall, the three different groups have very similar distributions of exercise.

Task 7: Cross-Tabulation

cross_table_BirthSex_HHInc <- ctable( #initiates cross tab table 
  x=hints_clean$BirthSex, # indicates x-axis as 'BirthSex'
  y=hints_clean$HHInc, # indicates y-axis as 'HHINc'
  prop = "c") # show proportions for each column 

print(cross_table_BirthSex_HHInc)
## Cross-Tabulation, Column Proportions  
## BirthSex * HHInc  
## Data Frame: hints_clean  
## 
## ---------- ------- -------------- ------------------ ------------------ ------------------ ------------------ --------------- ---------------
##              HHInc       <$20,000   $20,000-<$35,000   $35,000-<$50,000   $50,000-<$75,000   $75,000-<100,000       ≥$100,000           Total
##   BirthSex                                                                                                                                   
##       Male           544 ( 66.7%)       437 ( 66.4%)       426 ( 63.1%)       531 ( 57.2%)       415 ( 58.6%)    916 ( 52.5%)   3269 ( 59.1%)
##     Female           272 ( 33.3%)       221 ( 33.6%)       249 ( 36.9%)       397 ( 42.8%)       293 ( 41.4%)    829 ( 47.5%)   2261 ( 40.9%)
##      Total           816 (100.0%)       658 (100.0%)       675 (100.0%)       928 (100.0%)       708 (100.0%)   1745 (100.0%)   5530 (100.0%)
## ---------- ------- -------------- ------------------ ------------------ ------------------ ------------------ --------------- ---------------
contingency_table_BirthSex_HHInc <- table (hints_clean$BirthSex, hints_clean$HHInc) # create a contingency table for 'BirthSex' and 'HHInc'
chi_square_BirthSex_HHInc <- chisq.test(contingency_table_BirthSex_HHInc) # runs a chi square test for 'BirthSex' and 'HHInc' 
print(chi_square_BirthSex_HHInc) # print chi square test result 
## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table_BirthSex_HHInc
## X-squared = 71.328, df = 5, p-value = 5.421e-14

The p-value equals 5.421e-14 indicating that Household Income and Birth Sex have a strong association to each other due to statistical significance.

Task 8: Correlation Analysis

quant_variables <- hints_clean[,c("Age","BMI", "phq4","Exercise")] # creates new data set with needed variables 
corrplot(cor(quant_variables), method="color") # creates corr plot with 4 quantitative variables 

Correlation Analysis: Of the four quantitative variables, the strongest positive correlation is between PHQ-4 Scores and BMI. Thus, indicating that as BMI increases so does PHQ4 (anxiety and depression symptoms). From a public health perspective, this does make sense because an individual may experience body image issues which could lead to depression and/or anxiety. The strongest negative correlation is Age increase than PHQ-4 decreases. This correlation does align, because younger individuals are likely to face more pressures that can lead to anxiety. For example a young adult who is in college, may experience depressive symptoms due to not having a plan or increased anxiety from assignment/tests. As a person gets older, they may experience these situations less than a younger person or be able to identify and manage their symptoms better.

Narrative Summary

A data analysis of the 2024 Hints Information National Trends Survey (HINTS). The purpose of the HINTS is to collect nationally representative data on individuals’ knowledge, attitudes, and use of health-related information. This analysis examined both quantitative and qualitative variables. The sample size was 5,530. The age range of the sample was 18 to 100 years. Most of the sample identified as non-Hispanic White. One key finding was that both sexes have similar distributions of BMI. Second, it was age and the average number of drinks per week was uncorrelated. Lastly, another key finding was that individuals who identified as current smokers had higher median PHQ-4 scores than former smokers or people who never smoked. Understanding that smoking is used to cope with stress and anxiety could guide future public health interventions to decrease smoking prevalence. One limitation of the analysis was having to remove all variables with NA values in the sample, because it reduced the sample size, which could potentially affect the statistical analysis results.