This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this
## Day 1: Data Preparation & Initial Exploration
# Task 1
setwd("C:/Users/kevinroldan13/OneDrive - Cal State Fullerton/Desktop/PUBH 422/PUBH 422 Test 2") # Set working directory
hints <- read.csv("hints7_public data_2024.csv", header = TRUE) # Loaded dataset with 7278 rows/observations and 15 columns/variables
dim(hints) #printed dimensions of dataset, 7278 observations and 15 variables
## [1] 7278 15
head(hints) #printed the first 6 rows of the dataset
## HHID Age BirthSex MaritalStatus AgeGrpB EducA HHInc TotalHousehold BMI
## 1 72100001 69 2 1 4 4 6 2 26.3
## 2 72100005 62 1 1 3 3 6 2 25.0
## 3 72100014 34 1 1 1 4 5 5 24.0
## 4 72100019 65 2 1 4 3 4 3 35.2
## 5 72100025 64 1 6 3 4 1 1 29.0
## 6 72100026 64 2 1 3 4 6 2 27.3
## smokeStat RaceEthn5 phq4 Exercise ECigUse AvgDrinksPerWeek
## 1 2 1 0 225 3 12.5
## 2 2 1 0 180 3 15.0
## 3 3 3 4 240 3 7.5
## 4 2 1 4 120 3 2.0
## 5 2 1 11 0 1 12.0
## 6 3 5 1 150 3 0.5
summary (hints) #printed the preliminary statistical summary of dataset
## HHID Age BirthSex MaritalStatus
## Min. :72100001 Min. : -9.00 Min. :-9.0000 Min. :-9.000
## 1st Qu.:72108592 1st Qu.: 36.00 1st Qu.: 1.0000 1st Qu.: 1.000
## Median :72117023 Median : 55.00 Median : 1.0000 Median : 2.000
## Mean :72251562 Mean : 50.38 Mean : 0.7236 Mean : 2.031
## 3rd Qu.:72325380 3rd Qu.: 69.00 3rd Qu.: 2.0000 3rd Qu.: 4.000
## Max. :72836009 Max. :102.00 Max. : 3.0000 Max. : 6.000
## AgeGrpB EducA HHInc TotalHousehold
## Min. :-9.000 Min. :-9.000 Min. :-9.00 Min. :-9.00
## 1st Qu.: 2.000 1st Qu.: 2.000 1st Qu.: 1.00 1st Qu.: 1.00
## Median : 3.000 Median : 3.000 Median : 4.00 Median : 2.00
## Mean : 2.128 Mean : 2.335 Mean : 2.36 Mean : 1.56
## 3rd Qu.: 4.000 3rd Qu.: 4.000 3rd Qu.: 6.00 3rd Qu.: 3.00
## Max. : 5.000 Max. : 4.000 Max. : 6.00 Max. :11.00
## BMI smokeStat RaceEthn5 phq4
## Min. :-9.00 Min. :-9.000 Min. :-9.0000 Min. :-9.000
## 1st Qu.:23.20 1st Qu.: 2.000 1st Qu.: 1.0000 1st Qu.: 0.000
## Median :27.10 Median : 3.000 Median : 1.0000 Median : 1.000
## Mean :26.44 Mean : 1.803 Mean : 0.7595 Mean : 1.592
## 3rd Qu.:31.90 3rd Qu.: 3.000 3rd Qu.: 3.0000 3rd Qu.: 3.000
## Max. :66.60 Max. : 3.000 Max. : 5.0000 Max. :12.000
## Exercise ECigUse AvgDrinksPerWeek
## Min. : -9.0 Min. :-9.000 Min. :-9.000
## 1st Qu.: 0.0 1st Qu.: 3.000 1st Qu.: 0.000
## Median : 90.0 Median : 3.000 Median : 0.000
## Mean : 168.7 Mean : 1.826 Mean : 1.533
## 3rd Qu.: 210.0 3rd Qu.: 3.000 3rd Qu.: 2.000
## Max. :6300.0 Max. : 3.000 Max. :75.000
# Task 2 Data Cleaning & Value Labeling
str(hints) #displays the internal structure of the dataframe through concise summary
## 'data.frame': 7278 obs. of 15 variables:
## $ HHID : int 72100001 72100005 72100014 72100019 72100025 72100026 72100027 72100028 72100031 72100035 ...
## $ Age : int 69 62 34 65 64 64 26 85 32 -9 ...
## $ BirthSex : int 2 1 1 2 1 2 1 2 1 3 ...
## $ MaritalStatus : int 1 1 1 1 6 1 6 4 6 6 ...
## $ AgeGrpB : int 4 3 1 4 3 3 1 5 1 -9 ...
## $ EducA : int 4 3 4 3 4 4 4 2 3 4 ...
## $ HHInc : int 6 6 5 4 1 6 4 3 5 1 ...
## $ TotalHousehold : int 2 2 5 3 1 2 1 2 1 -9 ...
## $ BMI : num 26.3 25 24 35.2 29 27.3 30.7 30.3 26.1 26.6 ...
## $ smokeStat : int 2 2 3 2 2 3 3 3 1 3 ...
## $ RaceEthn5 : int 1 1 3 1 1 5 2 1 1 3 ...
## $ phq4 : int 0 0 4 4 11 1 8 0 0 0 ...
## $ Exercise : int 225 180 240 120 0 150 80 90 360 0 ...
## $ ECigUse : int 3 3 3 3 1 3 3 3 1 3 ...
## $ AvgDrinksPerWeek: num 12.5 15 7.5 2 12 0.5 1 4 0 -9 ...
sum(is.na(hints)) #counts the number of missing values in the dataframe
## [1] 0
#Output shows no missing values were identified in the dataframe
sum(hints<0, na.rm=TRUE) #Detects negative values in dataset and gives count
## [1] 8700
hints_clean <- hints[
hints$Age >= 0 & #reference each column because R does not detect it as columns
hints$BirthSex > 0 &
hints$MaritalStatus > 0 & #>0 removes negative values
hints$AgeGrpB > 0 &
hints$EducA > 0 &
hints$HHInc > 0 &
hints$TotalHousehold > 0 &
hints$BMI >= 0 &
hints$smokeStat > 0 &
hints$RaceEthn5 > 0 &
hints$phq4 >= 0 &
hints$Exercise >= 0 &
hints$ECigUse > 0 &
hints$AvgDrinksPerWeek >= 0, ] #cleans data by excluding the negative values in dataframe and stores in new df
summary(hints_clean) #summary verifies no negative values are present, clean dataset is now 5557 observations and 15 variables
## HHID Age BirthSex MaritalStatus
## Min. :72100001 Min. : 18.00 Min. :1.000 Min. :1.000
## 1st Qu.:72108616 1st Qu.: 39.00 1st Qu.:1.000 1st Qu.:1.000
## Median :72116908 Median : 56.00 Median :1.000 Median :2.000
## Mean :72247272 Mean : 54.14 Mean :1.417 Mean :2.802
## 3rd Qu.:72325132 3rd Qu.: 69.00 3rd Qu.:2.000 3rd Qu.:4.000
## Max. :72836009 Max. :100.00 Max. :3.000 Max. :6.000
## AgeGrpB EducA HHInc TotalHousehold BMI
## Min. :1.000 Min. :1.00 Min. :1.00 Min. : 1.000 Min. : 7.80
## 1st Qu.:2.000 1st Qu.:3.00 1st Qu.:2.00 1st Qu.: 1.000 1st Qu.:24.00
## Median :3.000 Median :4.00 Median :4.00 Median : 2.000 Median :27.50
## Mean :2.893 Mean :3.25 Mean :3.95 Mean : 2.412 Mean :28.85
## 3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.:6.00 3rd Qu.: 3.000 3rd Qu.:32.30
## Max. :5.000 Max. :4.00 Max. :6.00 Max. :11.000 Max. :66.60
## smokeStat RaceEthn5 phq4 Exercise
## Min. :1.000 Min. :1.000 Min. : 0.000 Min. : 0.0
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.: 0.000 1st Qu.: 16.0
## Median :3.000 Median :1.000 Median : 1.000 Median : 90.0
## Mean :2.549 Mean :1.865 Mean : 2.225 Mean : 181.1
## 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.: 4.000 3rd Qu.: 225.0
## Max. :3.000 Max. :5.000 Max. :12.000 Max. :5040.0
## ECigUse AvgDrinksPerWeek
## Min. :1.000 Min. : 0.000
## 1st Qu.:3.000 1st Qu.: 0.000
## Median :3.000 Median : 0.250
## Mean :2.791 Mean : 2.873
## 3rd Qu.:3.000 3rd Qu.: 2.500
## Max. :3.000 Max. :75.000
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
hints_clean <- hints_clean %>%
filter(BirthSex %in% c(1, 2) | is.na(BirthSex)) %>% #Kept only valid BirthSex values 1 or 2
mutate(BirthSex = factor(BirthSex, #Recoded Birthsex into labeled categories.
levels = c(1, 2), #Assigns levels 1- Male and 2-Female
labels = c("Male", "Female"))) #Converted BirthSex from numeric to labeled categories:
hints_clean <- hints_clean %>%
mutate(MaritalStatus = factor(MaritalStatus, #Recoded MaritalStatus into labeled factor variable
levels = c(1, 2, 3, 4, 5, 6), #Defined original coding structure (1–6)
labels = c("Married", "Living as married", "Divorced", "Widowed", "Separated", "Single"))) #Converted numeric codes into readable categories
hints_clean <- hints_clean %>%
mutate(EducA = factor(EducA, #Recoded Education variable into ordered categories
levels = c(1, 2, 3, 4), #Education levels from lowest to highest attainment
labels = c("Less than High School", "High School",
"Some College", "College Graduate or More"))) #
hints_clean <- hints_clean %>%
mutate(HHInc = factor(HHInc, #Recoded Household Income into income brackets
levels = c(1, 2, 3, 4, 5, 6), #Assigned Income categories (1–6)
labels = c("<$20,000", "$20,000-<$35,000", "$35,000-<$50,000",
"$50,000-<$75,000", "$75,000-<100,000", "≥$100,000")))
hints_clean <- hints_clean %>%
mutate(smokeStat = factor(smokeStat, #Recoded smoking status into categorical variable
levels = c(1, 2, 3), #Assigned smoking categories
labels = c("Current", "Former", "Never")))
hints_clean <- hints_clean %>%
mutate(RaceEthn5 = factor(RaceEthn5, #Recoded race/ethnicity into labeled categories
levels = c(1, 2, 3, 4, 5), #Assigned Race/ethnicity catergories
labels = c("~Non-Hispanic White",
"~Non-Hispanic Black or African American", "~Hispanic",
"~Non-Hispanic Asian", "~Non-Hispanic Other")))
hints_clean <- hints_clean %>%
mutate(ECigUse = factor(ECigUse, #Recoded e-cigarette use into categorical variable
levels = c(1, 2, 3), #Assigned ECigUse categories
labels = c("Current", "Former", "Never")))
summary(hints_clean) #Generated summary statistics for cleaned dataset
## HHID Age BirthSex MaritalStatus
## Min. :72100001 Min. : 18.00 Male :3269 Married :2595
## 1st Qu.:72108618 1st Qu.: 39.00 Female:2261 Living as married: 336
## Median :72116882 Median : 56.00 Divorced : 781
## Mean :72246943 Mean : 54.19 Widowed : 474
## 3rd Qu.:72325100 3rd Qu.: 69.00 Separated : 119
## Max. :72836009 Max. :100.00 Single :1225
## AgeGrpB EducA HHInc
## Min. :1.000 Less than High School : 298 <$20,000 : 816
## 1st Qu.:2.000 High School : 827 $20,000-<$35,000: 658
## Median :3.000 Some College :1585 $35,000-<$50,000: 675
## Mean :2.896 College Graduate or More:2820 $50,000-<$75,000: 928
## 3rd Qu.:4.000 $75,000-<100,000: 708
## Max. :5.000 ≥$100,000 :1745
## TotalHousehold BMI smokeStat
## Min. : 1.000 Min. :10.20 Current: 541
## 1st Qu.: 1.000 1st Qu.:24.00 Former :1410
## Median : 2.000 Median :27.50 Never :3579
## Mean : 2.411 Mean :28.87
## 3rd Qu.: 3.000 3rd Qu.:32.38
## Max. :11.000 Max. :66.60
## RaceEthn5 phq4
## ~Non-Hispanic White :3127 Min. : 0.000
## ~Non-Hispanic Black or African American: 798 1st Qu.: 0.000
## ~Hispanic :1085 Median : 1.000
## ~Non-Hispanic Asian : 306 Mean : 2.222
## ~Non-Hispanic Other : 214 3rd Qu.: 4.000
## Max. :12.000
## Exercise ECigUse AvgDrinksPerWeek
## Min. : 0.0 Current: 280 Min. : 0.00
## 1st Qu.: 16.0 Former : 596 1st Qu.: 0.00
## Median : 100.0 Never :4654 Median : 0.25
## Mean : 181.3 Mean : 2.88
## 3rd Qu.: 225.0 3rd Qu.: 2.50
## Max. :5040.0 Max. :75.00
## Day 2: Descriptive Statistics
# Task 3: Summary Statistics for Quantitative Variables
library(summarytools) #loads summarytools package
## Warning: package 'summarytools' was built under R version 4.5.3
quant_var <- hints_clean[, c("Age", #Subset includes all rows and only quant columns
"TotalHousehold",
"BMI",
"phq4",
"Exercise",
"AvgDrinksPerWeek")]
#Created a smaller subset for quantitative variables from the cleaned hints dataset which includes 5557 observations of 6 variables
descr(quant_var, #runs descriptive statistics for quant variables subset
stats = c("n.valid", "mean", "med", "sd", "min", "max")) #lists the stats needed
## Descriptive Statistics
## quant_var
## N: 5530
##
## Age AvgDrinksPerWeek BMI Exercise phq4 TotalHousehold
## ------------- --------- ------------------ --------- ---------- --------- ----------------
## N.Valid 5530.00 5530.00 5530.00 5530.00 5530.00 5530.00
## Mean 54.19 2.88 28.87 181.33 2.22 2.41
## Median 56.00 0.25 27.50 100.00 1.00 2.00
## Std.Dev 17.76 6.56 6.96 310.31 2.98 1.38
## Min 18.00 0.00 10.20 0.00 0.00 1.00
## Max 100.00 75.00 66.60 5040.00 12.00 11.00
# The descriptive statistics indicate that there are 5,557 participants in the dataset. Participants' age ranges from 18-100, with the average age of 54. The number of individuals in their households ranged from 1-11, with the average being 2.41. Participants averaged 2.22 on the PHQ-4 scores which suggests low-levels of psychological distress. When looking at participant behaviors: On average, participants engaged in 181 minutes of moderate exercise. However, this can be skewed by extreme values, so median is 90 minutes. Their BMI ranges from 7.80 to 66.60 with the average being 28.85 indicating the average person is in the overweight range. The average participant reported 2.87 drinks per week, but median is 0.25 meaning that most individuals reported very few drinks at 0.25.
# Task 4: Frequency Tables for Qualitative Variables
qual_var <- hints_clean[, c("BirthSex", #Subset includes all rows and only qual columns
"MaritalStatus",
"EducA",
"HHInc",
"smokeStat",
"RaceEthn5",
"ECigUse")]
#Created a smaller subset for qualitative variables from the cleaned hints dataset which includes 5557 observations of 7 variables
freq(qual_var$BirthSex, valid.col=TRUE) #valid.col=TRUE to display only valid percents
## Frequencies
## qual_var$BirthSex
## Type: Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ------------ ------ --------- -------------- --------- --------------
## Male 3269 59.11 59.11 59.11 59.11
## Female 2261 40.89 100.00 40.89 100.00
## <NA> 0 0.00 100.00
## Total 5530 100.00 100.00 100.00 100.00
freq(qual_var$MaritalStatus, valid.col=TRUE)
## Frequencies
## qual_var$MaritalStatus
## Type: Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------------------- ------ --------- -------------- --------- --------------
## Married 2595 46.93 46.93 46.93 46.93
## Living as married 336 6.08 53.00 6.08 53.00
## Divorced 781 14.12 67.12 14.12 67.12
## Widowed 474 8.57 75.70 8.57 75.70
## Separated 119 2.15 77.85 2.15 77.85
## Single 1225 22.15 100.00 22.15 100.00
## <NA> 0 0.00 100.00
## Total 5530 100.00 100.00 100.00 100.00
freq(qual_var$EducA, valid.col=TRUE)
## Frequencies
## qual_var$EducA
## Type: Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ------------------------------ ------ --------- -------------- --------- --------------
## Less than High School 298 5.39 5.39 5.39 5.39
## High School 827 14.95 20.34 14.95 20.34
## Some College 1585 28.66 49.01 28.66 49.01
## College Graduate or More 2820 50.99 100.00 50.99 100.00
## <NA> 0 0.00 100.00
## Total 5530 100.00 100.00 100.00 100.00
freq(qual_var$HHInc, valid.col=TRUE)
## Frequencies
## qual_var$HHInc
## Type: Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ---------------------- ------ --------- -------------- --------- --------------
## <$20,000 816 14.76 14.76 14.76 14.76
## $20,000-<$35,000 658 11.90 26.65 11.90 26.65
## $35,000-<$50,000 675 12.21 38.86 12.21 38.86
## $50,000-<$75,000 928 16.78 55.64 16.78 55.64
## $75,000-<100,000 708 12.80 68.44 12.80 68.44
## ≥$100,000 1745 31.56 100.00 31.56 100.00
## <NA> 0 0.00 100.00
## Total 5530 100.00 100.00 100.00 100.00
freq(qual_var$smokeStat, valid.col=TRUE)
## Frequencies
## qual_var$smokeStat
## Type: Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ------------- ------ --------- -------------- --------- --------------
## Current 541 9.78 9.78 9.78 9.78
## Former 1410 25.50 35.28 25.50 35.28
## Never 3579 64.72 100.00 64.72 100.00
## <NA> 0 0.00 100.00
## Total 5530 100.00 100.00 100.00 100.00
freq(qual_var$RaceEthn5, valid.col=TRUE)
## Frequencies
## qual_var$RaceEthn5
## Type: Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## --------------------------------------------- ------ --------- -------------- --------- --------------
## ~Non-Hispanic White 3127 56.55 56.55 56.55 56.55
## ~Non-Hispanic Black or African American 798 14.43 70.98 14.43 70.98
## ~Hispanic 1085 19.62 90.60 19.62 90.60
## ~Non-Hispanic Asian 306 5.53 96.13 5.53 96.13
## ~Non-Hispanic Other 214 3.87 100.00 3.87 100.00
## <NA> 0 0.00 100.00
## Total 5530 100.00 100.00 100.00 100.00
freq(qual_var$ECigUse, valid.col=TRUE)
## Frequencies
## qual_var$ECigUse
## Type: Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ------------- ------ --------- -------------- --------- --------------
## Current 280 5.06 5.06 5.06 5.06
## Former 596 10.78 15.84 10.78 15.84
## Never 4654 84.16 100.00 84.16 100.00
## <NA> 0 0.00 100.00
## Total 5530 100.00 100.00 100.00 100.00
#created frequency table for each qual variable
#The frequency tables for the qualitative variables indicate that, from 5,557 participants, more than half were Male at birth (58.83%) with 40.89% being Female at birth. In regards to race/ethnicity, more than half of participants reported being Non-Hispanic White (56.31%), followed by Hispanic participants at 19.76%. For marital status, nearly half of participants were married (46.75%), followed by 22.30% single participants. The most common education level was college graduate or more (50.89%), while only about 5% of participants had an education level less than high school. This may help explain why the most common household income was $100,000 or more (31.44%). Regarding health behaviors, an overwhelming majority of 64.75% of participants reported never smoking and an even higher 84.15% of participants reported never using electronic cigarettes.
## Day 3: Data Visualization
# Task 5: Visualizing Quantitative Variables
library(ggplot2) #loaded ggplot2 package
# 1- Histogram & Density Plot
ggplot(hints_clean, aes(x = BMI, fill = BirthSex)) +
geom_histogram(aes(y = ..density..), position = "identity", alpha = 0.5, bins = 30) + #Converted counts into density scale
geom_density(alpha = 0.7, color = "black") +
labs(
title = "BMI Distribution by Birth Sex", #Titled chart
x = "BMI", #Labeled x-axis as BMI
y = "Density", #Labeled y-axis as Density
fill = "Birth Sex" #Labeled legend for gender categories
) +
theme_minimal()
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#The ouput shows a roughly bell-shaped distribution. Both distributions are right-skewed potentially due to extremely high BMI vlaues. The BMI peak for both sexes are between 20-25 with females concentrated more around the average and males extending to higher BMI ranges. The right tale for males extends further. Therefore, higher BMI's are more common in males while females are closer to the dataset average BMI.
#2- Boxplot
ggplot(hints_clean, aes(x = smokeStat, y = phq4, fill = smokeStat)) +
geom_boxplot() +
labs(
title = "PHQ-4 Scores by Smoking Status", #Titled chart
x = "Smoking Status", #Labeled x-axis as Smoking Status
y = "PHQ-4 Score" #Labeled y-axis as PHQ-4 Score
) +
theme_minimal() +
theme(legend.position = "none")
#The output shows that current smokers group have the highest median PHQ‑4 score.The median line inside the red box (current smokers) sits noticeably higher than the medians for both former and never smokers, and the line for former and never smokers are similar. Therefore, on average, current smokers report more psychological distress than the other two groups.
#3- Scatterplot
ggplot(hints_clean, aes(x = Age, y = AvgDrinksPerWeek)) +
geom_point(alpha = 0.5, color = "steelblue") +
geom_smooth(method = "lm", se = TRUE, color = "red") +
labs(
title = "Relationship Between Age and Avg Drinks per Week", #Titled Chart
x = "Age", #Labeled x-axis
y = "Average Drinks per Week" #Labeled y-axis
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
#The output shows no apparent relationship between average drinks per week and age. The trend line is flat suggesting that drinking levels stays the same across all age groups. There are extreme outliers among all ages and they are not concentrated in a single age group.
# Task 6
#1- Bar Plot (Counts)
ggplot(hints_clean, aes(x = RaceEthn5, fill = RaceEthn5)) +
geom_bar() +
labs(
title = "Participant Counts by Race/Ethnicity", #Titled chart
x = "Race/Ethnicity", #Labeled x-axis
y = "Count" #Labeled y-axis
) +
theme_minimal() +
theme(legend.position = "none")
#The output shows a bar plot with non-Hispanic white category makes up the largest group followed by Hispanic participants. Non-Hispanic white participants make up a considerable amount of the dataset sample, with roughly 3 times the size of their next counterpart- Hispanic.
#2- Bar Plot (Proportions)
ggplot(hints_clean, aes(x = EducA, fill = MaritalStatus)) +
geom_bar(position = "fill") + # proportion within each education level
scale_y_continuous(labels = scales::percent_format()) +
labs(
title = "Proportion of Marital Status by Education Level", #Titled chart
x = "Education Level", #Labeled x-axis
y = "Proportion", #Labeled y-axis
fill = "Marital Status"
) +
theme_minimal()
#The output shows a bar plot where as education rises the proportion of marriage increases. Surprisnly there are higher rates of being widowed among individuals with lower education. Higher education levels are associated with a greater proportion of married individuals.
#3- Faceted Plot
ggplot(hints_clean, aes(x = Exercise)) +
geom_histogram(bins = 40, fill = "lightgreen", color = "black") +
labs(
title = "Exercise Minutes Distribution by E-Cig Use", #Titled chart
x = "Exercise Minutes", #Labeled x-axis
y = "Count" #Labeled y-axis
) +
facet_wrap(~ECigUse) +
theme_minimal()
#The output shows Never e-cigarette users as having a large concentration of individuals with low exercise minutes. However, the bars have to wide of a range due to extreme exercise minutes outliers.
## Day 4: Bivariate Analysis & Correlation
# Task 7
ctable(hints_clean$BirthSex, hints_clean$HHInc, #Created a cross-tabulation (contingency table) between Birth Sex and Household Income
prop = "r", #included row percentages
total = TRUE, #included row and column totals
style = "rmarkdown")
## ### Cross-Tabulation, Row Proportions
## #### BirthSex * HHInc
## **Data Frame:** hints_clean
##
## | | | | | | | | | |
## |---------:|------:|------------:|-----------------:|-----------------:|-----------------:|-----------------:|-------------:|--------------:|
## | | HHInc | <$20,000 | $20,000-<$35,000 | $35,000-<$50,000 | $50,000-<$75,000 | $75,000-<100,000 | ≥$100,000 | Total |
## | BirthSex | | | | | | | | |
## | Male | | 544 (16.6%) | 437 (13.4%) | 426 (13.0%) | 531 (16.2%) | 415 (12.7%) | 916 (28.0%) | 3269 (100.0%) |
## | Female | | 272 (12.0%) | 221 ( 9.8%) | 249 (11.0%) | 397 (17.6%) | 293 (13.0%) | 829 (36.7%) | 2261 (100.0%) |
## | Total | | 816 (14.8%) | 658 (11.9%) | 675 (12.2%) | 928 (16.8%) | 708 (12.8%) | 1745 (31.6%) | 5530 (100.0%) |
chisq_test <- chisq.test(table(hints_clean$BirthSex, hints_clean$HHInc)) #Created a contingency table and ran a Chi-square test of independence
chisq_test #Printed result of chi-square test
##
## Pearson's Chi-squared test
##
## data: table(hints_clean$BirthSex, hints_clean$HHInc)
## X-squared = 71.328, df = 5, p-value = 5.421e-14
#ouput shows p-value of 5.21e-14, indicating p-value is extremely small and there is an association between birth sex and household income in this dataset. This suggests that the distribution of income levels differs between males and females.
# Task 8
corr_vars <- hints_clean[, c("Age", "BMI", "phq4", "Exercise")] #created subset of quant variables for correlation
cor_matrix <- cor(corr_vars, use = "complete.obs") #excluded missing
cor_matrix
## Age BMI phq4 Exercise
## Age 1.000000000 -0.03574842 -0.24099932 0.009073485
## BMI -0.035748415 1.00000000 0.07680228 -0.090328301
## phq4 -0.240999315 0.07680228 1.00000000 -0.059596341
## Exercise 0.009073485 -0.09032830 -0.05959634 1.000000000
library(corrplot) #loaded corrplot package
## Warning: package 'corrplot' was built under R version 4.5.3
## corrplot 0.95 loaded
corrplot(cor_matrix, method = "color", type = "upper", #Created a correlation plot (heatmap visualization) from a correlation matrix object
addCoef.col = "black", tl.col = "black", tl.srt = 45)
#The strongest positive correlation are between BMI and phq4 values at 0.08. Suggesting a slight positive relationship between BMI and phq4 levels; as BMI increases so do phq4 (psychological stress). The strongest negative correlation relationship is between Age and pohq4 levels at -0.24. This suggests that as age increases, phq4 levels slightly decrease. Meaning as someone ages, their psychological distress levels can decrease.
## Day 5: Synthesis & Report Finalization
#Task 9: Executive Summary & Reflection
#The HINTS 2024 sample includes 7,278 participants. After cleaning up from negative and missing values, there are a total of 5,530 participants. The dataset examines demographic information such as age, birth sex, BMI, marital status, education level, household income. It also studies behavioral variables like cigarette and e-cigarette status, average drinks per week, and exercise. These participants range from ages 18-100 with the average age being 54. There is a slightly higher percentage of male participants than female participants. One striking finding was the large sample of Non-Hispanic White participants, at nearly 3 times their next counterpart. Another striking find is the extreme outliers in exercise, BMI, and average drinks per week. Something I found interesting was the large percentage of being widowed among participants with an educational level less than high school. One notable correlation is the relationship between age and phq4 levels; as age increases psychological distress levels lower. This may reflect greater emotional resilience or more stable life circumstances among older adults. A notable association is the relationship between education and marital status. As higher educational attainment is achieved, marriage rates increase. This might suggest education is associated with higher relationship stability. The negative relationship between age and phq-4 scores suggests that younger adults experience higher levels of psychological distress. From a public health perspective, this highlights the importance of targeted mental‑health outreach and prevention for younger populations. A limitation is the large number of Non Hispanic White participants, which suggests that this group may have been oversampled. Therefore, the findings may not fully generalize to the broader U.S. population.