library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Hints <- read_csv("C:/Users/her_n/OneDrive/Documents/MPH Program/Spring 2026/PUBH422 Statistical Planning for Health Data/Tests/Exam 2/hints7_public data_2024.csv") #importing Hints 2024 data into R
## Rows: 7278 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (15): HHID, Age, BirthSex, MaritalStatus, AgeGrpB, EducA, HHInc, TotalHo...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Hints #printing the imported data
## # A tibble: 7,278 × 15
## HHID Age BirthSex MaritalStatus AgeGrpB EducA HHInc TotalHousehold BMI
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.21e7 69 2 1 4 4 6 2 26.3
## 2 7.21e7 62 1 1 3 3 6 2 25
## 3 7.21e7 34 1 1 1 4 5 5 24
## 4 7.21e7 65 2 1 4 3 4 3 35.2
## 5 7.21e7 64 1 6 3 4 1 1 29
## 6 7.21e7 64 2 1 3 4 6 2 27.3
## 7 7.21e7 26 1 6 1 4 4 1 30.7
## 8 7.21e7 85 2 4 5 2 3 2 30.3
## 9 7.21e7 32 1 6 1 3 5 1 26.1
## 10 7.21e7 -9 3 6 -9 4 1 -9 26.6
## # ℹ 7,268 more rows
## # ℹ 6 more variables: smokeStat <dbl>, RaceEthn5 <dbl>, phq4 <dbl>,
## # Exercise <dbl>, ECigUse <dbl>, AvgDrinksPerWeek <dbl>
dim(Hints) #displaying the dimensions of data; 7278 observations, 15 variables
## [1] 7278 15
There are 7278 observations and 15 variables present within the data
summary(Hints) #summary of the imported data
## HHID Age BirthSex MaritalStatus
## Min. :72100001 Min. : -9.00 Min. :-9.0000 Min. :-9.000
## 1st Qu.:72108592 1st Qu.: 36.00 1st Qu.: 1.0000 1st Qu.: 1.000
## Median :72117023 Median : 55.00 Median : 1.0000 Median : 2.000
## Mean :72251562 Mean : 50.38 Mean : 0.7236 Mean : 2.031
## 3rd Qu.:72325380 3rd Qu.: 69.00 3rd Qu.: 2.0000 3rd Qu.: 4.000
## Max. :72836009 Max. :102.00 Max. : 3.0000 Max. : 6.000
## AgeGrpB EducA HHInc TotalHousehold
## Min. :-9.000 Min. :-9.000 Min. :-9.00 Min. :-9.00
## 1st Qu.: 2.000 1st Qu.: 2.000 1st Qu.: 1.00 1st Qu.: 1.00
## Median : 3.000 Median : 3.000 Median : 4.00 Median : 2.00
## Mean : 2.128 Mean : 2.335 Mean : 2.36 Mean : 1.56
## 3rd Qu.: 4.000 3rd Qu.: 4.000 3rd Qu.: 6.00 3rd Qu.: 3.00
## Max. : 5.000 Max. : 4.000 Max. : 6.00 Max. :11.00
## BMI smokeStat RaceEthn5 phq4
## Min. :-9.00 Min. :-9.000 Min. :-9.0000 Min. :-9.000
## 1st Qu.:23.20 1st Qu.: 2.000 1st Qu.: 1.0000 1st Qu.: 0.000
## Median :27.10 Median : 3.000 Median : 1.0000 Median : 1.000
## Mean :26.44 Mean : 1.803 Mean : 0.7595 Mean : 1.592
## 3rd Qu.:31.90 3rd Qu.: 3.000 3rd Qu.: 3.0000 3rd Qu.: 3.000
## Max. :66.60 Max. : 3.000 Max. : 5.0000 Max. :12.000
## Exercise ECigUse AvgDrinksPerWeek
## Min. : -9.0 Min. :-9.000 Min. :-9.000
## 1st Qu.: 0.0 1st Qu.: 3.000 1st Qu.: 0.000
## Median : 90.0 Median : 3.000 Median : 0.000
## Mean : 168.7 Mean : 1.826 Mean : 1.533
## 3rd Qu.: 210.0 3rd Qu.: 3.000 3rd Qu.: 2.000
## Max. :6300.0 Max. : 3.000 Max. :75.000
head(Hints) #displaying the first 6 variables
## # A tibble: 6 × 15
## HHID Age BirthSex MaritalStatus AgeGrpB EducA HHInc TotalHousehold BMI
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 72100001 69 2 1 4 4 6 2 26.3
## 2 72100005 62 1 1 3 3 6 2 25
## 3 72100014 34 1 1 1 4 5 5 24
## 4 72100019 65 2 1 4 3 4 3 35.2
## 5 72100025 64 1 6 3 4 1 1 29
## 6 72100026 64 2 1 3 4 6 2 27.3
## # ℹ 6 more variables: smokeStat <dbl>, RaceEthn5 <dbl>, phq4 <dbl>,
## # Exercise <dbl>, ECigUse <dbl>, AvgDrinksPerWeek <dbl>
Determining the structure overview of the Hints data
str(Hints) #this fn provides an overview of the structure of the imported data, showing details such as column names, data types (numeric, character, factor), and the number of rows and columns.
## spc_tbl_ [7,278 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ HHID : num [1:7278] 72100001 72100005 72100014 72100019 72100025 ...
## $ Age : num [1:7278] 69 62 34 65 64 64 26 85 32 -9 ...
## $ BirthSex : num [1:7278] 2 1 1 2 1 2 1 2 1 3 ...
## $ MaritalStatus : num [1:7278] 1 1 1 1 6 1 6 4 6 6 ...
## $ AgeGrpB : num [1:7278] 4 3 1 4 3 3 1 5 1 -9 ...
## $ EducA : num [1:7278] 4 3 4 3 4 4 4 2 3 4 ...
## $ HHInc : num [1:7278] 6 6 5 4 1 6 4 3 5 1 ...
## $ TotalHousehold : num [1:7278] 2 2 5 3 1 2 1 2 1 -9 ...
## $ BMI : num [1:7278] 26.3 25 24 35.2 29 27.3 30.7 30.3 26.1 26.6 ...
## $ smokeStat : num [1:7278] 2 2 3 2 2 3 3 3 1 3 ...
## $ RaceEthn5 : num [1:7278] 1 1 3 1 1 5 2 1 1 3 ...
## $ phq4 : num [1:7278] 0 0 4 4 11 1 8 0 0 0 ...
## $ Exercise : num [1:7278] 225 180 240 120 0 150 80 90 360 0 ...
## $ ECigUse : num [1:7278] 3 3 3 3 1 3 3 3 1 3 ...
## $ AvgDrinksPerWeek: num [1:7278] 12.5 15 7.5 2 12 0.5 1 4 0 -9 ...
## - attr(*, "spec")=
## .. cols(
## .. HHID = col_double(),
## .. Age = col_double(),
## .. BirthSex = col_double(),
## .. MaritalStatus = col_double(),
## .. AgeGrpB = col_double(),
## .. EducA = col_double(),
## .. HHInc = col_double(),
## .. TotalHousehold = col_double(),
## .. BMI = col_double(),
## .. smokeStat = col_double(),
## .. RaceEthn5 = col_double(),
## .. phq4 = col_double(),
## .. Exercise = col_double(),
## .. ECigUse = col_double(),
## .. AvgDrinksPerWeek = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
sum(is.na(Hints)) #summing the number of NAs in the dataset
## [1] 0
The data does not have any NAs present within the original dataset.
summary(Hints) #confirming which variables have negative values
## HHID Age BirthSex MaritalStatus
## Min. :72100001 Min. : -9.00 Min. :-9.0000 Min. :-9.000
## 1st Qu.:72108592 1st Qu.: 36.00 1st Qu.: 1.0000 1st Qu.: 1.000
## Median :72117023 Median : 55.00 Median : 1.0000 Median : 2.000
## Mean :72251562 Mean : 50.38 Mean : 0.7236 Mean : 2.031
## 3rd Qu.:72325380 3rd Qu.: 69.00 3rd Qu.: 2.0000 3rd Qu.: 4.000
## Max. :72836009 Max. :102.00 Max. : 3.0000 Max. : 6.000
## AgeGrpB EducA HHInc TotalHousehold
## Min. :-9.000 Min. :-9.000 Min. :-9.00 Min. :-9.00
## 1st Qu.: 2.000 1st Qu.: 2.000 1st Qu.: 1.00 1st Qu.: 1.00
## Median : 3.000 Median : 3.000 Median : 4.00 Median : 2.00
## Mean : 2.128 Mean : 2.335 Mean : 2.36 Mean : 1.56
## 3rd Qu.: 4.000 3rd Qu.: 4.000 3rd Qu.: 6.00 3rd Qu.: 3.00
## Max. : 5.000 Max. : 4.000 Max. : 6.00 Max. :11.00
## BMI smokeStat RaceEthn5 phq4
## Min. :-9.00 Min. :-9.000 Min. :-9.0000 Min. :-9.000
## 1st Qu.:23.20 1st Qu.: 2.000 1st Qu.: 1.0000 1st Qu.: 0.000
## Median :27.10 Median : 3.000 Median : 1.0000 Median : 1.000
## Mean :26.44 Mean : 1.803 Mean : 0.7595 Mean : 1.592
## 3rd Qu.:31.90 3rd Qu.: 3.000 3rd Qu.: 3.0000 3rd Qu.: 3.000
## Max. :66.60 Max. : 3.000 Max. : 5.0000 Max. :12.000
## Exercise ECigUse AvgDrinksPerWeek
## Min. : -9.0 Min. :-9.000 Min. :-9.000
## 1st Qu.: 0.0 1st Qu.: 3.000 1st Qu.: 0.000
## Median : 90.0 Median : 3.000 Median : 0.000
## Mean : 168.7 Mean : 1.826 Mean : 1.533
## 3rd Qu.: 210.0 3rd Qu.: 3.000 3rd Qu.: 2.000
## Max. :6300.0 Max. : 3.000 Max. :75.000
The summary demonstrated some variables contain negative values.
library(dplyr) #loading in the dplyr package to access the filter function
Hints_clean <- Hints %>%
filter( #removing the negative values within each variable
HHID > 0,
Age > 0,
BirthSex > 0,
MaritalStatus > 0,
AgeGrpB > 0,
EducA > 0,
HHInc > 0,
TotalHousehold > 0,
BMI >= 0,
smokeStat > 0,
RaceEthn5 > 0,
phq4 > 0,
Exercise >= 0,
ECigUse > 0,
AvgDrinksPerWeek > 0
)
dim(Hints_clean) #displaying the dimensions of data; 1645 observations, 15 variables
## [1] 1645 15
The clean data now had 1645 observations and 15 variables after removing the negative values.
summary(Hints_clean) #confirming negative values were removed from the clean dataset
## HHID Age BirthSex MaritalStatus
## Min. :72100014 Min. :18.00 Min. :1.000 Min. :1.000
## 1st Qu.:72108806 1st Qu.:35.00 1st Qu.:1.000 1st Qu.:1.000
## Median :72116927 Median :48.00 Median :1.000 Median :2.000
## Mean :72247179 Mean :49.47 Mean :1.387 Mean :2.989
## 3rd Qu.:72325148 3rd Qu.:64.00 3rd Qu.:2.000 3rd Qu.:6.000
## Max. :72836006 Max. :93.00 Max. :3.000 Max. :6.000
## AgeGrpB EducA HHInc TotalHousehold
## Min. :1.000 Min. :1.000 Min. :1.000 Min. : 1.000
## 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:3.000 1st Qu.: 1.000
## Median :2.000 Median :4.000 Median :4.000 Median : 2.000
## Mean :2.554 Mean :3.374 Mean :4.154 Mean : 2.375
## 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:6.000 3rd Qu.: 3.000
## Max. :5.000 Max. :4.000 Max. :6.000 Max. :11.000
## BMI smokeStat RaceEthn5 phq4
## Min. :10.80 Min. :1.000 Min. :1.000 Min. : 1.000
## 1st Qu.:23.70 1st Qu.:2.000 1st Qu.:1.000 1st Qu.: 2.000
## Median :27.40 Median :3.000 Median :1.000 Median : 3.000
## Mean :28.54 Mean :2.452 Mean :1.799 Mean : 3.785
## 3rd Qu.:32.00 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.: 5.000
## Max. :63.10 Max. :3.000 Max. :5.000 Max. :12.000
## Exercise ECigUse AvgDrinksPerWeek
## Min. : 0.0 Min. :1.000 Min. : 0.250
## 1st Qu.: 30.0 1st Qu.:3.000 1st Qu.: 0.750
## Median : 90.0 Median :3.000 Median : 2.500
## Mean : 168.1 Mean :2.685 Mean : 5.784
## 3rd Qu.: 210.0 3rd Qu.:3.000 3rd Qu.: 7.500
## Max. :5040.0 Max. :3.000 Max. :75.000
The negative values were removed from the data and the summary demonstrates the values were successfully removed.
sum(is.na(Hints_clean)) #confirming if there are any NAs after removing the negatives
## [1] 0
Further confirmation that NAs were not present within the data after the negative values were removed.
#Recoding the qualitative variables as factors
Hints_clean <- Hints_clean %>%
mutate(
BirthSex=factor(BirthSex, # each variable was recoded to have each level be matched with its corresponding label
levels = c(1,2),
labels = c("Male","Female")),
MaritalStatus=factor(MaritalStatus,
levels = c(1,2,3,4,5,6),
labels = c("Married","Living as Married","Divorced","Widowed","Separated","Single")),
Education=factor(EducA,
levels = c(1,2,3,4),
labels = c("Less than High School","High School","Some College","College Graduate or More")),
HouseholdInc=factor(HHInc,
levels = c(1,2,3,4,5,6),
labels = c("~<$20,000","~$20,000-<$35,000","~$35,000-<$50,000","~$50,000-<$75,000","~$75,000-<$100,000","~>=$100,000")),
SmokingStatus=factor(smokeStat,
levels = c(1,2,3),
labels = c("Current","Former","Never")),
RaceEthn=factor(RaceEthn5,
levels = c(1,2,3,4),
labels = c("Non-Hispanic White","Non-Hispanic Black/AA","Hispanic","Non-Hispanic Asian")),
ECigUse=factor(ECigUse,
levels = c(1,2,3),
labels = c("Current","Former","Never"))
)
Each variable was recoded to ensure each level matched its corresponding label within each variable.
sum(is.na(Hints_clean)) #confirming how many NAs I currently have after changing to factors and relabeling
## [1] 77
After each level was recoded, the data had 77 NAs present within the data.
Hints_clean <- Hints_clean %>% #removed the remaining 77 NAs from the data
tidyr::drop_na()
sum(is.na(Hints_clean)) #confirming how many NAs are present after removing them from the data
## [1] 0
summary(Hints_clean) #confirming the only values present within the cleaned dataset fall within the codebook
## HHID Age BirthSex MaritalStatus
## Min. :72100014 Min. :18.00 Male :975 Married :686
## 1st Qu.:72108949 1st Qu.:35.00 Female:595 Living as Married:122
## Median :72117151 Median :48.00 Divorced :209
## Mean :72249378 Mean :49.66 Widowed : 97
## 3rd Qu.:72325213 3rd Qu.:64.00 Separated : 32
## Max. :72836006 Max. :93.00 Single :424
## AgeGrpB EducA HHInc TotalHousehold
## Min. :1.000 Min. :1.000 Min. :1.000 Min. : 1.000
## 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:3.000 1st Qu.: 1.000
## Median :2.000 Median :4.000 Median :4.000 Median : 2.000
## Mean :2.568 Mean :3.384 Mean :4.176 Mean : 2.368
## 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:6.000 3rd Qu.: 3.000
## Max. :5.000 Max. :4.000 Max. :6.000 Max. :11.000
## BMI smokeStat RaceEthn5 phq4
## Min. :10.8 Min. :1.000 Min. :1.000 Min. : 1.000
## 1st Qu.:23.7 1st Qu.:2.000 1st Qu.:1.000 1st Qu.: 2.000
## Median :27.3 Median :3.000 Median :1.000 Median : 3.000
## Mean :28.5 Mean :2.458 Mean :1.653 Mean : 3.756
## 3rd Qu.:31.9 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.: 5.000
## Max. :63.1 Max. :3.000 Max. :4.000 Max. :12.000
## Exercise ECigUse AvgDrinksPerWeek
## Min. : 0.0 Current: 112 Min. : 0.250
## 1st Qu.: 30.0 Former : 262 1st Qu.: 0.750
## Median : 90.0 Never :1196 Median : 2.500
## Mean : 163.9 Mean : 5.776
## 3rd Qu.: 210.0 3rd Qu.: 7.500
## Max. :3600.0 Max. :75.000
## Education HouseholdInc SmokingStatus
## Less than High School : 63 ~<$20,000 :186 Current:206
## High School :173 ~$20,000-<$35,000 :159 Former :439
## Some College :432 ~$35,000-<$50,000 :186 Never :925
## College Graduate or More:902 ~$50,000-<$75,000 :262
## ~$75,000-<$100,000:216
## ~>=$100,000 :561
## RaceEthn
## Non-Hispanic White :985
## Non-Hispanic Black/AA:209
## Hispanic :312
## Non-Hispanic Asian : 64
##
##
The sum demonstrated that the NAs were successfully removed from the cleaned data and the summary demonstrated each level was recoded to its corresponding label.
dim(Hints_clean) #displaying the dimensions of data; 1570 observations, 19 variables
## [1] 1570 19
The clean data now had 1570 observations and 19 variables after removing the negative and NA values.
Loading the necessary packages to access summarytools.
library(plyr)
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, mutate, rename, summarise, summarize
## The following object is masked from 'package:purrr':
##
## compact
library(dplyr)
library(summarytools)
##
## Attaching package: 'summarytools'
## The following object is masked from 'package:tibble':
##
## view
Quant_Hints_Clean <- Hints_clean %>%
select("Age","TotalHousehold","BMI","phq4","Exercise","AvgDrinksPerWeek") #selecting only quantitative data in Hints_Clean dataset
names(Quant_Hints_Clean) #confirming the correct columns were chosen
## [1] "Age" "TotalHousehold" "BMI" "phq4"
## [5] "Exercise" "AvgDrinksPerWeek"
sum(is.na(Quant_Hints_Clean)) #confirming if there are any NA values within the data
## [1] 0
The correct quantitative variables were successfully selected and there are no NAs within the data.
descr(Quant_Hints_Clean, #using descr() to generate N, mean, median, std dev, min and max
headings = TRUE,
stats = "common"
)
## Descriptive Statistics
## Quant_Hints_Clean
## N: 1570
##
## Age AvgDrinksPerWeek BMI Exercise phq4 TotalHousehold
## --------------- --------- ------------------ --------- ---------- --------- ----------------
## Mean 49.66 5.78 28.50 163.95 3.76 2.37
## Std.Dev 17.08 8.99 6.98 256.96 2.86 1.34
## Min 18.00 0.25 10.80 0.00 1.00 1.00
## Median 48.00 2.50 27.30 90.00 3.00 2.00
## Max 93.00 75.00 63.10 3600.00 12.00 11.00
## N.Valid 1570.00 1570.00 1570.00 1570.00 1570.00 1570.00
## N 1570.00 1570.00 1570.00 1570.00 1570.00 1570.00
## Pct.Valid 100.00 100.00 100.00 100.00 100.00 100.00
The sample consists of 1,570 participants with a mean age of 49.66 years with a standard deviation of 17.08 which indicates a middle-aged population with a wide age range (18 to 93 years). The average Body Mass Index is 28.50 with a standard deviation of 6.98, which falls within the overweight category, suggesting that a substantial portion of the sample may be at increased risk for weight-related health conditions.Participants reported an average of 5.78 alcoholic drinks per week with a standard deviation of 8.99 and a maximum value of 75 indicating considerable variability in alcohol consumption. Additionally, Exercise levels varied widely, with an average of 163.95 minutes per week but due to the large spread, differences in physical activity habits are suggested across the sample.The mean PHQ-4 score of 3.76 indicates relatively low to moderate levels of anxiety and depression symptoms overall. Lastly, the average household size was 2.37 individuals, reflecting smaller household compositions.
freq(Hints_clean$BirthSex,report.nas = FALSE, cumul = FALSE) #frequency table for BirthSex Variable
## Frequencies
## Hints_clean$BirthSex
## Type: Factor
##
## Freq %
## ------------ ------ --------
## Male 975 62.10
## Female 595 37.90
## Total 1570 100.00
freq(Hints_clean$MaritalStatus, report.nas = FALSE, cumul = FALSE) #frequency table for MaritalStatus Variable
## Frequencies
## Hints_clean$MaritalStatus
## Type: Factor
##
## Freq %
## ----------------------- ------ --------
## Married 686 43.69
## Living as Married 122 7.77
## Divorced 209 13.31
## Widowed 97 6.18
## Separated 32 2.04
## Single 424 27.01
## Total 1570 100.00
freq(Hints_clean$Education, report.nas = FALSE, cumul = FALSE) #frequency table for Education Variable
## Frequencies
## Hints_clean$Education
## Type: Factor
##
## Freq %
## ------------------------------ ------ --------
## Less than High School 63 4.01
## High School 173 11.02
## Some College 432 27.52
## College Graduate or More 902 57.45
## Total 1570 100.00
freq(Hints_clean$HouseholdInc, report.nas = FALSE, cumul = FALSE) #frequency table for HouseholdInc Variable
## Frequencies
## Hints_clean$HouseholdInc
## Type: Factor
##
## Freq %
## ------------------------ ------ --------
## ~<$20,000 186 11.85
## ~$20,000-<$35,000 159 10.13
## ~$35,000-<$50,000 186 11.85
## ~$50,000-<$75,000 262 16.69
## ~$75,000-<$100,000 216 13.76
## ~>=$100,000 561 35.73
## Total 1570 100.00
freq(Hints_clean$SmokingStatus, report.nas = FALSE, cumul = FALSE) #frequency table for SmokingStatus Variable
## Frequencies
## Hints_clean$SmokingStatus
## Type: Factor
##
## Freq %
## ------------- ------ --------
## Current 206 13.12
## Former 439 27.96
## Never 925 58.92
## Total 1570 100.00
freq(Hints_clean$RaceEthn, report.nas = FALSE, cumul = FALSE) #frequency table for RaceEthnicity Variable
## Frequencies
## Hints_clean$RaceEthn
## Type: Factor
##
## Freq %
## --------------------------- ------ --------
## Non-Hispanic White 985 62.74
## Non-Hispanic Black/AA 209 13.31
## Hispanic 312 19.87
## Non-Hispanic Asian 64 4.08
## Total 1570 100.00
freq(Hints_clean$ECigUse, report.nas = FALSE, cumul = FALSE) #frequency table for ECigUse Variable
## Frequencies
## Hints_clean$ECigUse
## Type: Factor
##
## Freq %
## ------------- ------ --------
## Current 112 7.13
## Former 262 16.69
## Never 1196 76.18
## Total 1570 100.00
The HINTS participants are predominantly male with 62.10% accounting for the number of male participants and 37.90% were women. The majority of participants were married (43.69%) and the single participants took second place by representing 27.01% while 13.31% were divorced. The remaining participants were living as married (7.77%), widowed (6.18%), and separated (2.04%). Most of the participants were college graduates or continued their educational pursuits by making up 57.45% of the sample and 27.52% of participants has some college education attainment. 11.02% of participants obtained a high school level education while 4.01% obtained less than high school education. The combined household income was more evenly distributed; however, 35.73% of participants earned approximately, less than or equal to $100,000, while 16.69% earned approx. $50,000-$75,000, 13.76% earned approx. $75,000-$100,000, the same amount of participants (11.85%) earned approx.$20,000 and $35,000-$50,000, and, lastly, 10.13% earned approx. $20,000-$35,000. The majority of participants have never smoked cigarettes (58.92%), with former smokers taking second place (27.96%), and current smokers comprising 13.12% of the sample. Most of the participants identified as Non-Hispanic White (62.74%), Hispanic’s came in second (19.87%), Non-Hispanic Black/African Americans made up 13.31%, and lastly, 4.08% identified as Non-Hispanic Asian. Similarly to the smoke status frequencies, most of the participants have never used e-cigarettes (76.18%), 16.69% were former, and 7.13% were current e-cigarette users.
library(ggplot2) #loading ggplot2
sum(is.na(Hints_clean$BMI)) #confirming if there are any NAs in BMI
## [1] 0
sum(is.na(Hints_clean$BirthSex)) #confirming if there are any NAs in BirthSex
## [1] 0
table(Hints_clean$BirthSex,useNA = "ifany") #confirming if there are any NAs in BirthSex
##
## Male Female
## 975 595
ggplot(Hints_clean, aes(x=BMI,fill = BirthSex)) + #indicating that the x axis will be BMI but the values of BirthSex will fill the bins
geom_histogram(aes(y = after_stat(density)),
alpha = 0.5, position = "identity", bins = 30) +
geom_density(alpha = 0.2) +
labs(title = "BMI Distribution by Sex", #title of the graph will be BMI Distribution by Sex
x = "BMI", #x axis will be labeled as BMI
y = "Density") #y axis will be Density
While BMI distributions are similar for both sexes, females tend to cluster more around the average BMI range, whereas males show slightly more variability and higher extreme values.
ggplot(Hints_clean, aes(x=SmokingStatus, y=phq4, fill = SmokingStatus)) + #x will be the different smoking status levels, y will be the phq4 values and the box plot will be willied with the values of smoking status
geom_boxplot() + #boxplot is indicated
labs(title = "PHQ-4 by Smoking Status", x="Smoking Status", y="PHQ-4 Score") #graph, x axis, and y axis title names
Current smokers appear to have a higher median PHQ-4 score, suggesting this group has greater anxiety/depression symptoms.
ggplot(Hints_clean, aes(x=Age, y=AvgDrinksPerWeek)) + #x axis will be age and y will be avg drinks per week
geom_point() + #indicating points will be used to identify each observation
geom_smooth(method="lm") + #indicating the points are smooth
labs(title="Age vs Alcohol Consumption", x="Age", y="Average Drinks Per Week") #graph, x axis, and y axis title names
## `geom_smooth()` using formula = 'y ~ x'
The scatterplot shows the relationship between age and the average number of alcoholic drinks consumed per week. Overall, there appears to be a very weak positive relationship, as indicated by the slightly upward-sloping trend line. This suggests that, on average, alcohol consumption increases slightly with age; however, the relationship is not strong. Most participants, regardless of age, report low levels of alcohol consumption, with a high concentration of values clustered near zero to 10 drinks per week. There are also several outliers, particularly among middle-aged individuals, who report much higher levels of drinking. The wide spread of points across all age groups indicates high variability, meaning age alone is not a strong predictor of alcohol consumption in this sample.
sum(is.na(Hints_clean$RaceEthn)) #confirming there are no NAs in RaceEthn
## [1] 0
ggplot(Hints_clean, aes(x=RaceEthn)) + #x axis will be RaceEthn
geom_bar() + # bar chart is indicated
labs(title="Race/Ethnicity Distribution", x= "Race/Ethnicity", y="Count") #graph, x axis, and y axis title names
The majority of participants identified as Non-Hispanic White, while those who identified as Hispanic were the second most common group. Non-Hispanic Black/African Americans came in third, while Non-Hispanic Asians made up the smallest proportion.
sum(is.na(Hints_clean$MaritalStatus)) #confirming no NAs in Marital Status
## [1] 0
sum(is.na(Hints_clean$Education)) #confirming no NAs in Education
## [1] 0
ggplot(Hints_clean, aes(x=Education, fill=MaritalStatus)) + #x axis will be education level and marital status will fill the bars
geom_bar(position="fill") + #bar chart is indicated
labs(title="Marital Status by Education", y="Proportion") + #provides title and the label for y axis
theme(axis.text.x = element_text(size = 9, angle = 45, hjust = 1)) #adjusts labels on x axis
The stacked bar plot shows that marital status varies across education levels. It suggests that higher education groups have a greater proportion of married individuals, while lower education groups have more single individuals. This suggests a possible relationship between education and marital status.
sum(is.na(Hints_clean$Exercise)) #confirming if there are any NAs in Exercise
## [1] 0
sum(is.na(Hints_clean$ECigUse))#confirming if there are any NAs in ECigUse
## [1] 0
ggplot(Hints_clean, aes(x=Exercise)) + #x axis will be exercise
geom_histogram(bins = 25) + #histogram is loaded and will be using 25 bins to distribute the data
facet_wrap(~ECigUse) + #telling R to create a mini graph for each e-cigarette use group
labs(title="Exercise by E-Cigarette Use", #chart title
x = "Exercise (Minutes)", #x axis title
y = "Count") #y axis title
Across all groups, most people engage in relatively low amounts of exercise, with only a small proportion reporting very high activity levels. There are no dramatic differences between e-cigarette use groups, though never users may be slightly more concentrated at lower exercise levels.
library(summarytools)
sum(is.na(Hints_clean$BirthSex)) #confirming no NAs in BirthSex
## [1] 0
sum(is.na(Hints_clean$HouseholdInc)) #confirming no NAs in HouseholdInc
## [1] 0
ctable(Hints_clean$BirthSex, Hints_clean$HouseholdInc) #creating cross-tabulation between BirthSex and Household Income
## Cross-Tabulation, Row Proportions
## BirthSex * HouseholdInc
## Data Frame: Hints_clean
##
## ---------- -------------- ------------- ------------------- ------------------- ------------------- -------------------- ------------- ---------------
## HouseholdInc ~<$20,000 ~$20,000-<$35,000 ~$35,000-<$50,000 ~$50,000-<$75,000 ~$75,000-<$100,000 ~>=$100,000 Total
## BirthSex
## Male 120 (12.3%) 108 (11.1%) 120 (12.3%) 165 (16.9%) 139 (14.3%) 323 (33.1%) 975 (100.0%)
## Female 66 (11.1%) 51 ( 8.6%) 66 (11.1%) 97 (16.3%) 77 (12.9%) 238 (40.0%) 595 (100.0%)
## Total 186 (11.8%) 159 (10.1%) 186 (11.8%) 262 (16.7%) 216 (13.8%) 561 (35.7%) 1570 (100.0%)
## ---------- -------------- ------------- ------------------- ------------------- ------------------- -------------------- ------------- ---------------
chisq.test(table(Hints_clean$BirthSex,Hints_clean$HouseholdInc)) #running chi square test to determine if there is a significant assoc. btwn BirthSex and Household Inc
##
## Pearson's Chi-squared test
##
## data: table(Hints_clean$BirthSex, Hints_clean$HouseholdInc)
## X-squared = 8.6446, df = 5, p-value = 0.1241
The p-value was 0.1241 when the chi square test of indepdendence was performed.Since the p-value was greater than 0.05, we fail to reject the null hypothesis, meaning there is no statistically significant association between Birth Sex and Household income.
quant_vars <- Hints_clean[,c("Age","BMI","phq4","Exercise")] #selecting the identified quantitative variables
sum(is.na(quant_vars)) #confirming if there are any NAs within the selected variables
## [1] 0
cor_quant_vars <- cor(quant_vars, use = "complete.obs") #obtain correlations of the quantitative variables
library(corrplot) #loading corrplot to use
## corrplot 0.95 loaded
corrplot(cor_quant_vars, method = "circle") #visualizing the correlation of the quantitative variables using a heatmap
The strongest positive correlation, though modest, is observed between BMI and PHQ-4 (r = 0.08), suggesting that higher BMI is associated with greater psychological distress. The strongest negative correlation, though modest, appears between age and PHQ-4 (r = -0.18), indicating that younger individuals may experience more mental health symptoms. These relationships are generally consistent with public health expectations, although the correlations are relatively weak.
This analysis used the 2024 HINTS public dataset to examine demographics, health behaviors, and mental health indicators. The sample included individuals across diverse ages, income levels, and educational backgrounds. Findings showed that current smokers reported higher levels of anxiety and depression, suggesting a greater mental health burden. While BMI distributions were similar between the sexes, females clustered more closely around the average BMI, whereas males showed slightly greater variability and higher extreme values. A chi-square test found no statistically significant association between sex and household income, indicating insufficient evidence to conclude that gender influences income in this sample. Correlation analysis revealed a modest relationship between BMI and mental health, suggesting that increased BMI may be associated with worse mental health outcomes. Overall, these findings highlight the impact of poor mental health and the importance of mental health interventions. However, limitations include the cross-sectional design, which prevents causal conclusions, and missing data, which reduced the sample size available for analysis.