Analyzing 2020 HINTS Data by wrangling, descriptive analysis, function writing, univariate visualization, integration, reporting, and submission.
Wrangling data: Loaded the 2020 Hints Data set into R
library(readr)
HINTSData_2020_clean <- read_csv("C:/Users/her_n/OneDrive/Documents/MPH Program/Spring 2026/PUBH422 Statistical Planning for Health Data/Lectures/Datasets/HINTSData_2020_clean.csv")
## Rows: 2402 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): PersonID
## dbl (16): HHID, QualityCare, HealthInsurance, Age, BirthGender, FullTimeOcc_...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The Variables Age, Average Drinks Per Week, Weekly Minutes Moderate Exercise, BMI, Quality Care, Birth Gender, Smoke Status, and Race & Ethnicity were extracted from the dataset
Group7Variables <- HINTSData_2020_clean[,c("Age", #extracted the variables assigned to our group
"AvgDrinksPerWeek",
"WeeklyMinutesModerateExercise",
"BMI",
"QualityCare",
"BirthGender",
"smokeStat",
"RaceEthn5")]
Confirmed the identified variables were successfully extracted.
head(Group7Variables) #printed the first couple of columns to make sure the correct variables were extracted
## # A tibble: 6 × 8
## Age AvgDrinksPerWeek WeeklyMinutesModerateEx…¹ BMI QualityCare BirthGender
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 59 14 1200 20.7 1 2
## 2 61 0 0 35.4 4 1
## 3 31 2 180 23 2 2
## 4 48 1 60 35.1 2 2
## 5 70 0 0 29.2 3 1
## 6 55 2 540 18.4 3 2
## # ℹ abbreviated name: ¹WeeklyMinutesModerateExercise
## # ℹ 2 more variables: smokeStat <dbl>, RaceEthn5 <dbl>
Confirmed none of our variables were missing information.
colSums(is.na(Group7Variables)) #confirmed none of our columns have missing information
## Age AvgDrinksPerWeek
## 0 0
## WeeklyMinutesModerateExercise BMI
## 0 0
## QualityCare BirthGender
## 0 0
## smokeStat RaceEthn5
## 0 0
Statistical 5-number Summary, mean, standard deviation, and variance of continuous variables.
my_summary_func <- function(data) {
continuous_vars <- data[, c("Age",
"AvgDrinksPerWeek",
"WeeklyMinutesModerateExercise",
"BMI")]
cat("\n----- Five-Number Summary -----\n")
print(summary(continuous_vars))
cat("\n----- Means -----\n")
print(colMeans(continuous_vars, na.rm = TRUE))
cat("\n----- Standard Deviations -----\n")
print(apply(continuous_vars, 2, sd, na.rm = TRUE))
cat("\n----- Variances -----\n")
print(apply(continuous_vars, 2, var, na.rm = TRUE))
}
A function was created to produce the 5 number summary, mean, standard deviation, and variances
my_summary_func(Group7Variables)
##
## ----- Five-Number Summary -----
## Age AvgDrinksPerWeek WeeklyMinutesModerateExercise
## Min. : 18.00 Min. : 0.000 Min. : 0.0
## 1st Qu.: 43.00 1st Qu.: 0.000 1st Qu.: 0.0
## Median : 58.00 Median : 0.000 Median : 90.0
## Mean : 55.59 Mean : 3.358 Mean : 161.8
## 3rd Qu.: 68.00 3rd Qu.: 4.000 3rd Qu.: 210.0
## Max. :100.00 Max. :70.000 Max. :4620.0
## BMI
## Min. :10.90
## 1st Qu.:24.00
## Median :27.50
## Mean :28.61
## 3rd Qu.:31.90
## Max. :73.80
##
## ----- Means -----
## Age AvgDrinksPerWeek
## 55.591174 3.358035
## WeeklyMinutesModerateExercise BMI
## 161.781848 28.607036
##
## ----- Standard Deviations -----
## Age AvgDrinksPerWeek
## 16.579024 6.588142
## WeeklyMinutesModerateExercise BMI
## 271.189253 6.538369
##
## ----- Variances -----
## Age AvgDrinksPerWeek
## 274.86403 43.40362
## WeeklyMinutesModerateExercise BMI
## 73543.61087 42.75028
Age demonstrates that most participants are middle aged, with the mean being 55 years old. Most participants drink 0-3 drinks on average per week however, the average was shifted due to outliers. The median exercise was 90 minutes per week. The average BMI indicated that people are slightly overweight.
Dyplr Descriptive Package for Smoking Status Variable
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Group7Variables %>%
group_by(smokeStat) %>%
summarise(
across(
c(Age,
AvgDrinksPerWeek,
WeeklyMinutesModerateExercise,
BMI),
list(
Min = ~min(.),
Q1 = ~quantile(., 0.25),
Median = ~median(.),
Mean = ~mean(.),
Q3 = ~quantile(., 0.75),
Max = ~max(.),
SD = ~sd(.),
Variance = ~var(.)
),
.names = "{.col}_{.fn}"
)
)
## # A tibble: 3 × 33
## smokeStat Age_Min Age_Q1 Age_Median Age_Mean Age_Q3 Age_Max Age_SD
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 19 44 57 54.5 65 87 14.5
## 2 2 21 52 64 61.4 72 98 14.8
## 3 3 18 39 55 53.4 66 100 17.0
## # ℹ 25 more variables: Age_Variance <dbl>, AvgDrinksPerWeek_Min <dbl>,
## # AvgDrinksPerWeek_Q1 <dbl>, AvgDrinksPerWeek_Median <dbl>,
## # AvgDrinksPerWeek_Mean <dbl>, AvgDrinksPerWeek_Q3 <dbl>,
## # AvgDrinksPerWeek_Max <dbl>, AvgDrinksPerWeek_SD <dbl>,
## # AvgDrinksPerWeek_Variance <dbl>, WeeklyMinutesModerateExercise_Min <dbl>,
## # WeeklyMinutesModerateExercise_Q1 <dbl>,
## # WeeklyMinutesModerateExercise_Median <dbl>, …
A tibble 3x33 was created that contains 3 rows (smoking groups) with 33 columns: (1 group variable (SmokeStat), 4 continuous variables & 8 stats). The youngest person that is a current smoker: 19, while the oldest is a former smoker: 21. The max age is a person who never smoked: 100, while the lowest max age was for a current smoker: 87. The oldest group on average was former (2) smokers, suggesting that quitting typically occurs later in life. The group that never smoked had the more age variability with their SD=17. current and never smokers have similar mean ages.
Skimr Descriptive Package: produce a statistical summary for the continous variables
library(skimr)
skim(Group7Variables[, c("Age",
"AvgDrinksPerWeek",
"WeeklyMinutesModerateExercise",
"BMI")])
| Name | …[] |
| Number of rows | 2402 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Age | 0 | 1 | 55.59 | 16.58 | 18.0 | 43 | 58.0 | 68.0 | 100.0 | ▃▅▇▅▁ |
| AvgDrinksPerWeek | 0 | 1 | 3.36 | 6.59 | 0.0 | 0 | 0.0 | 4.0 | 70.0 | ▇▁▁▁▁ |
| WeeklyMinutesModerateExercise | 0 | 1 | 161.78 | 271.19 | 0.0 | 0 | 90.0 | 210.0 | 4620.0 | ▇▁▁▁▁ |
| BMI | 0 | 1 | 28.61 | 6.54 | 10.9 | 24 | 27.5 | 31.9 | 73.8 | ▂▇▁▁▁ |
Writing Functions
my_summary_func <- function(data) {
continuous_vars <- data[, c("Age",
"AvgDrinksPerWeek",
"WeeklyMinutesModerateExercise",
"BMI")]
cat("----- Five-Number Summary -----\n")
print(summary(continuous_vars))
cat("\n----- Means -----\n")
print(colMeans(continuous_vars))
cat("\n----- Standard Deviations -----\n")
print(apply(continuous_vars, 2, sd))
cat("\n----- Variances -----\n")
print(apply(continuous_vars, 2, var))
}
Run the function
my_summary_func(Group7Variables)
## ----- Five-Number Summary -----
## Age AvgDrinksPerWeek WeeklyMinutesModerateExercise
## Min. : 18.00 Min. : 0.000 Min. : 0.0
## 1st Qu.: 43.00 1st Qu.: 0.000 1st Qu.: 0.0
## Median : 58.00 Median : 0.000 Median : 90.0
## Mean : 55.59 Mean : 3.358 Mean : 161.8
## 3rd Qu.: 68.00 3rd Qu.: 4.000 3rd Qu.: 210.0
## Max. :100.00 Max. :70.000 Max. :4620.0
## BMI
## Min. :10.90
## 1st Qu.:24.00
## Median :27.50
## Mean :28.61
## 3rd Qu.:31.90
## Max. :73.80
##
## ----- Means -----
## Age AvgDrinksPerWeek
## 55.591174 3.358035
## WeeklyMinutesModerateExercise BMI
## 161.781848 28.607036
##
## ----- Standard Deviations -----
## Age AvgDrinksPerWeek
## 16.579024 6.588142
## WeeklyMinutesModerateExercise BMI
## 271.189253 6.538369
##
## ----- Variances -----
## Age AvgDrinksPerWeek
## 274.86403 43.40362
## WeeklyMinutesModerateExercise BMI
## 73543.61087 42.75028
Age demonstrates that most participants are middle aged, with the mean being 55 years old. Most participants drink 0-3 drinks on average per week however, the average was shifted due to outliers. The median exercise was 90 minutes per week. The average BMI indicated that people are slightly overweight.
Barplot of Quality Care
barplot(table(Group7Variables$QualityCare),
main = "Distribution of QualityCare",
col = "lightblue",
names.arg = c("Excellent", "Very Good", "Good", "Fair","Poor"),
xlab = "Quality of Care Rating",
ylab = "Count")
Most respondents reported receiving good quality care. Only a small proportion reported poor quality care. The distribution suggests overall positive perception of care quality.
Barplot of Race and Ethnicity
barplot(table(Group7Variables$RaceEthn5),
main = "Distribution of Race/Ethnicity",
col = "lightgreen",
names.arg = c("Non-Hisp White", "Non-Hisp Black/AA", "Hispanic", "Non-Hisp Asian","Non-Hisp Other"),
xlab = "Race/Ethnicty",
ylab = "Count",
cex.names = 0.75,
cex.lab = 1.2)
The sample consists mostly of Non-Hispanic White, with smaller proportions of Non-Hispanic Asian and Non-Hispanic Other participants.
Histogram of Age
hist(Group7Variables$Age,
main = "Distribution of Age",
xlab="Age",
col = "pink")
Age appears normally distributed in 60 years old and skewed right.
Histogram of BMI
hist(Group7Variables$BMI,
main = "Distribution of BMI",
xlab="BMI",
col = "orange")
BMI appears slightly right-skewed, with most values between 20-30.
Barplot of QUality Care using ggplot2
library(ggplot2)
ggplot(Group7Variables, aes(x = factor(QualityCare,
levels = c(1,2,3,4,5),
labels = c("Excellent",
"Very Good",
"Good",
"Fair",
"Poor"))))+
geom_bar(fill = "lightblue") +
xlab("Quality Care Rating") +
ggtitle("Distribution of QualityCare")
Most respondents reported receiving good quality care. Only a small proportion reported poor quality care. The distribution suggests overall positive perception of care quality.
Barplot of Race & Ethnicity using ggplot2
ggplot(Group7Variables, aes(x = factor(RaceEthn5,
levels = c(1,2,3,4,5),
labels = c("Non-Hisp White",
"Non-Hisp Black/AA",
"Hispanic",
"Non-Hisp Asian",
"Non-Hisp Other")))) +
geom_bar(fill = "lightgreen") +
xlab("Race/Ethnicity Category") +
ggtitle("Distribution of RaceEthn5")
The sample consists mostly of Non-Hispanic White, with smaller proportions of Non-Hispanic Asian and Non-Hispanic Other participants.
Histogram of Age using ggplot2
ggplot(Group7Variables, aes(x = Age)) +
geom_histogram(fill = "pink", bins = 30) +
ggtitle("Distribution of Age") +
xlab("Age")
Age appears normally distributed in 60 years old and skewed right.
Histogram of BMI using ggplot2
ggplot(Group7Variables, aes(x = BMI)) +
geom_histogram(fill = "orange", bins = 30) +
ggtitle("Distribution of BMI") +
xlab("BMI")
BMI appears slightly right-skewed, with most values between 20-30.
Single R Function for barplots
my_visual_func_I <- function(data, varname){
barplot(table(data[[varname]]),
main = paste("Distribution of", varname),
col = "lightblue",
xlab = varname)
}
my_visual_func_I(Group7Variables, "QualityCare")
Most respondents reported receiving good quality care. Only a small proportion reported poor quality care. The distribution suggests overall positive perception of care quality.
my_visual_func_I(Group7Variables, "RaceEthn5")
The sample is mostly in group 1(Non-Hispanic White), with smaller proportions of in group 4(Non-Hispanic Asian)and 5(Non-Hispanic Other).
Single R Function for Histograms
my_visual_func_II <- function(data, varname){
hist(data[[varname]],
main = paste("Distribution of", varname),
col = "orange",
xlab = varname)
}
my_visual_func_II(Group7Variables, "Age")
Age appears normally distributed in 60 years old and skewed right.
my_visual_func_II(Group7Variables, "BMI")
BMI appears slightly right-skewed, with most values between 20-30.