hints <- read.csv("C:/Users/Karin/Downloads/hints7_public data_2024.csv", header = TRUE) # downloads dataset
hints <- as.data.frame(hints) # converts data to data frame and names 'hints
dim(hints) # examines dimensions of data set
## [1] 7278 15
head(hints) # lists first six rows of data frame
## HHID Age BirthSex MaritalStatus AgeGrpB EducA HHInc TotalHousehold BMI
## 1 72100001 69 2 1 4 4 6 2 26.3
## 2 72100005 62 1 1 3 3 6 2 25.0
## 3 72100014 34 1 1 1 4 5 5 24.0
## 4 72100019 65 2 1 4 3 4 3 35.2
## 5 72100025 64 1 6 3 4 1 1 29.0
## 6 72100026 64 2 1 3 4 6 2 27.3
## smokeStat RaceEthn5 phq4 Exercise ECigUse AvgDrinksPerWeek
## 1 2 1 0 225 3 12.5
## 2 2 1 0 180 3 15.0
## 3 3 3 4 240 3 7.5
## 4 2 1 4 120 3 2.0
## 5 2 1 11 0 1 12.0
## 6 3 5 1 150 3 0.5
summary (hints) # provides statistical summary of dataframe
## HHID Age BirthSex MaritalStatus
## Min. :72100001 Min. : -9.00 Min. :-9.0000 Min. :-9.000
## 1st Qu.:72108592 1st Qu.: 36.00 1st Qu.: 1.0000 1st Qu.: 1.000
## Median :72117023 Median : 55.00 Median : 1.0000 Median : 2.000
## Mean :72251562 Mean : 50.38 Mean : 0.7236 Mean : 2.031
## 3rd Qu.:72325380 3rd Qu.: 69.00 3rd Qu.: 2.0000 3rd Qu.: 4.000
## Max. :72836009 Max. :102.00 Max. : 3.0000 Max. : 6.000
## AgeGrpB EducA HHInc TotalHousehold
## Min. :-9.000 Min. :-9.000 Min. :-9.00 Min. :-9.00
## 1st Qu.: 2.000 1st Qu.: 2.000 1st Qu.: 1.00 1st Qu.: 1.00
## Median : 3.000 Median : 3.000 Median : 4.00 Median : 2.00
## Mean : 2.128 Mean : 2.335 Mean : 2.36 Mean : 1.56
## 3rd Qu.: 4.000 3rd Qu.: 4.000 3rd Qu.: 6.00 3rd Qu.: 3.00
## Max. : 5.000 Max. : 4.000 Max. : 6.00 Max. :11.00
## BMI smokeStat RaceEthn5 phq4
## Min. :-9.00 Min. :-9.000 Min. :-9.0000 Min. :-9.000
## 1st Qu.:23.20 1st Qu.: 2.000 1st Qu.: 1.0000 1st Qu.: 0.000
## Median :27.10 Median : 3.000 Median : 1.0000 Median : 1.000
## Mean :26.44 Mean : 1.803 Mean : 0.7595 Mean : 1.592
## 3rd Qu.:31.90 3rd Qu.: 3.000 3rd Qu.: 3.0000 3rd Qu.: 3.000
## Max. :66.60 Max. : 3.000 Max. : 5.0000 Max. :12.000
## Exercise ECigUse AvgDrinksPerWeek
## Min. : -9.0 Min. :-9.000 Min. :-9.000
## 1st Qu.: 0.0 1st Qu.: 3.000 1st Qu.: 0.000
## Median : 90.0 Median : 3.000 Median : 0.000
## Mean : 168.7 Mean : 1.826 Mean : 1.533
## 3rd Qu.: 210.0 3rd Qu.: 3.000 3rd Qu.: 2.000
## Max. :6300.0 Max. : 3.000 Max. :75.000
str(hints)
## 'data.frame': 7278 obs. of 15 variables:
## $ HHID : int 72100001 72100005 72100014 72100019 72100025 72100026 72100027 72100028 72100031 72100035 ...
## $ Age : int 69 62 34 65 64 64 26 85 32 -9 ...
## $ BirthSex : int 2 1 1 2 1 2 1 2 1 3 ...
## $ MaritalStatus : int 1 1 1 1 6 1 6 4 6 6 ...
## $ AgeGrpB : int 4 3 1 4 3 3 1 5 1 -9 ...
## $ EducA : int 4 3 4 3 4 4 4 2 3 4 ...
## $ HHInc : int 6 6 5 4 1 6 4 3 5 1 ...
## $ TotalHousehold : int 2 2 5 3 1 2 1 2 1 -9 ...
## $ BMI : num 26.3 25 24 35.2 29 27.3 30.7 30.3 26.1 26.6 ...
## $ smokeStat : int 2 2 3 2 2 3 3 3 1 3 ...
## $ RaceEthn5 : int 1 1 3 1 1 5 2 1 1 3 ...
## $ phq4 : int 0 0 4 4 11 1 8 0 0 0 ...
## $ Exercise : int 225 180 240 120 0 150 80 90 360 0 ...
## $ ECigUse : int 3 3 3 3 1 3 3 3 1 3 ...
## $ AvgDrinksPerWeek: num 12.5 15 7.5 2 12 0.5 1 4 0 -9 ...
sum(is.na(hints)) # checks total number of NA in data
## [1] 0
any(hints<0, na.rm = TRUE) # checks if there are any negative variables
## [1] TRUE
sum(apply(hints<0, 1, any)) # provides total number of negative variables
## [1] 1721
hints_clean <- hints[!apply (hints <0, 1, any),] # removes any row with atleast 1 negative variable
dim(hints_clean) # examines dimensions of new clean data frame
## [1] 5557 15
#Recode 'Marital Status' to factor
hints_clean <- hints_clean %>%
mutate(MaritalStatus = factor(MaritalStatus,
levels = c(1, 2, 3, 4, 5, 6),
labels = c("Married", "living as married", "Divorced", "Widowed", "Separated", "Single")))
#Recode 'EducA' to factor
hints_clean <- hints_clean %>%
mutate(EducA = factor(EducA,
levels = c(1, 2, 3, 4),
labels = c("Less than High School", "High School", "Some College", "College Graduate or more")))
#Recode 'HHIinc' to factor
hints_clean <- hints_clean %>%
mutate(HHInc = factor(HHInc,
levels = c(1, 2, 3, 4, 5, 6),
labels = c("<$20,000", "$20,000-<$35,000", "$35,000-<$50,000", "$50,000-<$75,000", "$75,000-<100,000", "≥$100,000")))
#Recode 'smokeStat' to factor
hints_clean <- hints_clean %>%
mutate(smokeStat = factor(smokeStat,
levels = c(1, 2, 3),
labels = c("Current", "Former", "Never")))
#Recode 'RaceEthn5' to factor
hints_clean <- hints_clean %>%
mutate(RaceEthn5 = factor(RaceEthn5,
levels = c(1, 2, 3, 4, 5),
labels = c("Non-Hispanic White", "Non-Hispanic Black or African American", "Hispanic", "Non-Hispanic Asian", "Non-Hispanic Other")))
#Recode 'ECigUse' to factor
hints_clean <- hints_clean %>%
mutate(ECigUse = factor(ECigUse,
levels = c(1, 2, 3),
labels = c("Current", "Former", "Never")))
hints_clean <- hints_clean[hints_clean$BirthSex %in% c(1,2),] # only keeps BirthSex Rows with 1 or 2 (no 3)
#Recode 'BirthSex' to factor
hints_clean <- hints_clean %>%
mutate(BirthSex = factor(BirthSex,
levels = c(1, 2),
labels = c("Male", "Female")))
summary(hints_clean) # provides statistical summary of 'hints_clean' data
## HHID Age BirthSex MaritalStatus
## Min. :72100001 Min. : 18.00 Male :3269 Married :2595
## 1st Qu.:72108618 1st Qu.: 39.00 Female:2261 living as married: 336
## Median :72116882 Median : 56.00 Divorced : 781
## Mean :72246943 Mean : 54.19 Widowed : 474
## 3rd Qu.:72325100 3rd Qu.: 69.00 Separated : 119
## Max. :72836009 Max. :100.00 Single :1225
## AgeGrpB EducA HHInc
## Min. :1.000 Less than High School : 298 <$20,000 : 816
## 1st Qu.:2.000 High School : 827 $20,000-<$35,000: 658
## Median :3.000 Some College :1585 $35,000-<$50,000: 675
## Mean :2.896 College Graduate or more:2820 $50,000-<$75,000: 928
## 3rd Qu.:4.000 $75,000-<100,000: 708
## Max. :5.000 ≥$100,000 :1745
## TotalHousehold BMI smokeStat
## Min. : 1.000 Min. :10.20 Current: 541
## 1st Qu.: 1.000 1st Qu.:24.00 Former :1410
## Median : 2.000 Median :27.50 Never :3579
## Mean : 2.411 Mean :28.87
## 3rd Qu.: 3.000 3rd Qu.:32.38
## Max. :11.000 Max. :66.60
## RaceEthn5 phq4
## Non-Hispanic White :3127 Min. : 0.000
## Non-Hispanic Black or African American: 798 1st Qu.: 0.000
## Hispanic :1085 Median : 1.000
## Non-Hispanic Asian : 306 Mean : 2.222
## Non-Hispanic Other : 214 3rd Qu.: 4.000
## Max. :12.000
## Exercise ECigUse AvgDrinksPerWeek
## Min. : 0.0 Current: 280 Min. : 0.00
## 1st Qu.: 16.0 Former : 596 1st Qu.: 0.00
## Median : 100.0 Never :4654 Median : 0.25
## Mean : 181.3 Mean : 2.88
## 3rd Qu.: 225.0 3rd Qu.: 2.50
## Max. :5040.0 Max. :75.00
# creates new data set with only quantitative variables
variables_quantitative <- hints_clean [,c("Age","TotalHousehold","BMI","phq4","Exercise","AvgDrinksPerWeek")]
descr(variables_quantitative, # initiates descriptive of 'variables_quantitative'
stats= c("n.valid","mean","med","sd","min","max")) # indicates which statistics to include in the table
## Descriptive Statistics
## variables_quantitative
## N: 5530
##
## Age AvgDrinksPerWeek BMI Exercise phq4 TotalHousehold
## ------------- --------- ------------------ --------- ---------- --------- ----------------
## N.Valid 5530.00 5530.00 5530.00 5530.00 5530.00 5530.00
## Mean 54.19 2.88 28.87 181.33 2.22 2.41
## Median 56.00 0.25 27.50 100.00 1.00 2.00
## Std.Dev 17.76 6.56 6.96 310.31 2.98 1.38
## Min 18.00 0.00 10.20 0.00 0.00 1.00
## Max 100.00 75.00 66.60 5040.00 12.00 11.00
The age range for the Hint 2024 survey is from 18 years to 100 years old. The average age of the sample is 54.19. The survey examined a few factors including: average drinks per week, BMI, exercise, PHQ-4 (anxiety and depression measurement), and total number of individuals living in one household. The average BMI of participants in 2024 was 28.87. As for household size, the maximum amount of individuals living in one household was 11 and the minimum was 1 person. The max number of drinks per week was 75 drinks, while the minimum was 0 drinks. The highest PHQ-4 score was 12, while the lowest was 0. On average, the sample of survey participants has a PHQ-4 score of 2.22. Lastly, on average, the participants of the survey exercised for 181.33 minutes, with a low of 0 and a high of 5,040 minutes.
# creates new dataset with only qualitative variables
variables_qualitative <- hints_clean[,c("BirthSex","MaritalStatus","EducA","HHInc","smokeStat","RaceEthn5","ECigUse")]
freq(variables_qualitative) # initiates a frequency table of 'variables_qualitative'
## Frequencies
## variables_qualitative$BirthSex
## Type: Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ------------ ------ --------- -------------- --------- --------------
## Male 3269 59.11 59.11 59.11 59.11
## Female 2261 40.89 100.00 40.89 100.00
## <NA> 0 0.00 100.00
## Total 5530 100.00 100.00 100.00 100.00
##
## variables_qualitative$MaritalStatus
## Type: Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------------------- ------ --------- -------------- --------- --------------
## Married 2595 46.93 46.93 46.93 46.93
## living as married 336 6.08 53.00 6.08 53.00
## Divorced 781 14.12 67.12 14.12 67.12
## Widowed 474 8.57 75.70 8.57 75.70
## Separated 119 2.15 77.85 2.15 77.85
## Single 1225 22.15 100.00 22.15 100.00
## <NA> 0 0.00 100.00
## Total 5530 100.00 100.00 100.00 100.00
##
## variables_qualitative$EducA
## Type: Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ------------------------------ ------ --------- -------------- --------- --------------
## Less than High School 298 5.39 5.39 5.39 5.39
## High School 827 14.95 20.34 14.95 20.34
## Some College 1585 28.66 49.01 28.66 49.01
## College Graduate or more 2820 50.99 100.00 50.99 100.00
## <NA> 0 0.00 100.00
## Total 5530 100.00 100.00 100.00 100.00
##
## variables_qualitative$HHInc
## Type: Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ---------------------- ------ --------- -------------- --------- --------------
## <$20,000 816 14.76 14.76 14.76 14.76
## $20,000-<$35,000 658 11.90 26.65 11.90 26.65
## $35,000-<$50,000 675 12.21 38.86 12.21 38.86
## $50,000-<$75,000 928 16.78 55.64 16.78 55.64
## $75,000-<100,000 708 12.80 68.44 12.80 68.44
## ≥$100,000 1745 31.56 100.00 31.56 100.00
## <NA> 0 0.00 100.00
## Total 5530 100.00 100.00 100.00 100.00
##
## variables_qualitative$smokeStat
## Type: Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ------------- ------ --------- -------------- --------- --------------
## Current 541 9.78 9.78 9.78 9.78
## Former 1410 25.50 35.28 25.50 35.28
## Never 3579 64.72 100.00 64.72 100.00
## <NA> 0 0.00 100.00
## Total 5530 100.00 100.00 100.00 100.00
##
## variables_qualitative$RaceEthn5
## Type: Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## -------------------------------------------- ------ --------- -------------- --------- --------------
## Non-Hispanic White 3127 56.55 56.55 56.55 56.55
## Non-Hispanic Black or African American 798 14.43 70.98 14.43 70.98
## Hispanic 1085 19.62 90.60 19.62 90.60
## Non-Hispanic Asian 306 5.53 96.13 5.53 96.13
## Non-Hispanic Other 214 3.87 100.00 3.87 100.00
## <NA> 0 0.00 100.00
## Total 5530 100.00 100.00 100.00 100.00
##
## variables_qualitative$ECigUse
## Type: Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ------------- ------ --------- -------------- --------- --------------
## Current 280 5.06 5.06 5.06 5.06
## Former 596 10.78 15.84 10.78 15.84
## Never 4654 84.16 100.00 84.16 100.00
## <NA> 0 0.00 100.00
## Total 5530 100.00 100.00 100.00 100.00
Of the participants, about 59.11% were Female and 40.89% were Male. The majority of participants identified as non-Hispanic White at 56.55%.46.94% were Married, while 22.15% were single. 50.99% of the HINTS 2024 participants were college graduates or higher. About 14.76% reported making less than $20,000, while 31.56% reported making over $100,000. A majority reported never using cigarettes (64.72%) or e-cigarettes (84.16%).
ggplot(hints_clean, aes(x=BMI, fill=BirthSex))+ # creates ggplot with x as 'BMI' and fill as 'BirthSex'
geom_density(alpha=0.5)+ # sets transparency of fill to 0.5
labs(title= "Density of BMI", # tiles ggplot to 'Density of BMI'
x= "BMI", # titles the x-axis 'Density'
y= "Density") # title the y-axis 'BMI'
ggplot (hints_clean, aes (x= smokeStat, y= phq4, fill=smokeStat ))+ # creates ggplot of phq4 based on smoking status
geom_boxplot()+ # initiates a boxplot
labs(title="PHQ4 and Smoking Status Boxplot") # adds title to boxplot
ggplot (hints_clean, aes(x=Age, y=AvgDrinksPerWeek))+ # initiates ggplot with two variables
geom_point()+ # indicates ggplot as a scatter plot
geom_smooth(method=lm)+ # adds linear regression line
labs(title= "Age and Average Drinks Per Week Scatter Plot") # adds a title to the plot
## `geom_smooth()` using formula = 'y ~ x'
Density of BMI: The density plot illustrates the density of BMI for both Male and Female. For both Male and Female, the density plot is right skewed. This means most people have lower values, while less people are on the higher range of BMI scale. The distribution for both sexes is similar.
PHQ-4 and Smoking Status Box plot: Individuals who identified as current smokers has the highest median PHQ-4 score. Thus, indicating that they have greater symptoms of anxiety or depression compared to former smokers or people who have never smoked.
Age and Average Drinks Per Week: The linear regression line has no slope indiccating that there is no correlation between age and average drinks per week.
ggplot(hints_clean, aes(x=RaceEthn5, fill=RaceEthn5, size=2))+ # initiates ggplot to compare with 'RaceEthn5' variable
geom_bar()+ # indicates plot as a bar plot
theme(axis.text.x = element_text(size=10, angle=45, hjust=1))+ # adjusts x-axis titles
labs(title= "HINTS 2024 Race and Ethnicity Bar Plot") # adds title to plot
ggplot(hints_clean, aes(x = EducA, fill = MaritalStatus)) + # creates plot with x-axis as 'EduA' and fill with 'MaritalStatus'
geom_bar(position = "fill")+ # indicates plot as a bar chart
labs (title= "Proportion of EducA by MaritalStatus", y= "Proportion")+ # adds title and indicates y axis as proportion
theme(axis.text.x = element_text(size=10, angle=45, hjust=1)) # adjusts the x-axis labels
ggplot (hints_clean, aes (x= ECigUse, y= Exercise, fill=ECigUse ))+ # creates ggplot of 'exercise' based on 'eciguse'
geom_boxplot()+ # indicates plot as boxplot
labs(title= "Exercise Minutes by ECig Use BoxPlot") # gives title to box plot
Race and Ethnicity: Most individuals who completed the survey identified as Non-Hispanic White followed by Hispanics.
Proportion of Educational Attainment by Marital Status: As educational attainment increases so does the proportion of participants who are married. Additionally, the proportion of single participants across all educational attainments is approximately the same.
Exercise Minutes by E-Cigarette Use Box Plot: The box plot shows that all three groups (e.g., current, former, and never) of e-cigarette users have a similar median of exercise minutes. Overall, the three different groups have very similar distributions of exercise.
cross_table_BirthSex_HHInc <- ctable( #initiates cross tab table
x=hints_clean$BirthSex, # indicates x-axis as 'BirthSex'
y=hints_clean$HHInc, # indicates y-axis as 'HHINc'
prop = "c") # show proportions for each column
print(cross_table_BirthSex_HHInc)
## Cross-Tabulation, Column Proportions
## BirthSex * HHInc
## Data Frame: hints_clean
##
## ---------- ------- -------------- ------------------ ------------------ ------------------ ------------------ --------------- ---------------
## HHInc <$20,000 $20,000-<$35,000 $35,000-<$50,000 $50,000-<$75,000 $75,000-<100,000 ≥$100,000 Total
## BirthSex
## Male 544 ( 66.7%) 437 ( 66.4%) 426 ( 63.1%) 531 ( 57.2%) 415 ( 58.6%) 916 ( 52.5%) 3269 ( 59.1%)
## Female 272 ( 33.3%) 221 ( 33.6%) 249 ( 36.9%) 397 ( 42.8%) 293 ( 41.4%) 829 ( 47.5%) 2261 ( 40.9%)
## Total 816 (100.0%) 658 (100.0%) 675 (100.0%) 928 (100.0%) 708 (100.0%) 1745 (100.0%) 5530 (100.0%)
## ---------- ------- -------------- ------------------ ------------------ ------------------ ------------------ --------------- ---------------
contingency_table_BirthSex_HHInc <- table (hints_clean$BirthSex, hints_clean$HHInc) # create a contingency table for 'BirthSex' and 'HHInc'
chi_square_BirthSex_HHInc <- chisq.test(contingency_table_BirthSex_HHInc) # runs a chi square test for 'BirthSex' and 'HHInc'
print(chi_square_BirthSex_HHInc) # print chi square test result
##
## Pearson's Chi-squared test
##
## data: contingency_table_BirthSex_HHInc
## X-squared = 71.328, df = 5, p-value = 5.421e-14
The p-value equals 5.421e-14 indicating that Household Income and Birth Sex have a strong association to each other due to statistical significance.
quant_variables <- hints_clean[,c("Age","BMI", "phq4","Exercise")] # creates new data set with needed variables
corrplot(cor(quant_variables), method="color") # creates corr plot with 4 quantitative variables
Correlation Analysis: Of the four quantitative variables, the strongest positive correlation is between PHQ-4 Scores and BMI. Thus, indicating that as BMI increases so does PHQ4 (anxiety and depression symptoms). From a public health perspective, this does make sense because an individual may experience body image issues which could lead to depression and/or anxiety. The strongest negative correlation is Age increase than PHQ-4 decreases. This correlation does align, because younger individuals are likely to face more pressures that can lead to anxiety. For example a young adult who is in college, may experience depressive symptoms due to not having a plan or increased anxiety from assignment/tests. As a person gets older, they may experience these situations less than a younger person or be able to identify and manage their symptoms better.
A data analysis of the 2024 Hints Information National Trends Survey (HINTS). The purpose of the HINTS is to collect nationally representative data on individuals’ knowledge, attitudes, and use of health-related information. This analysis examined both quantitative and qualitative variables. The sample size was 5,530. The age range of the sample was 18 to 100 years. Most of the sample identified as non-Hispanic White. One key finding was that both sexes have similar distributions of BMI. Second, it was age and the average number of drinks per week was uncorrelated. Lastly, another key finding was that individuals who identified as current smokers had higher median PHQ-4 scores than former smokers or people who never smoked. Understanding that smoking is used to cope with stress and anxiety could guide future public health interventions to decrease smoking prevalence. One limitation of the analysis was having to remove all variables with NA values in the sample, because it reduced the sample size, which could potentially affect the statistical analysis results.