R Markdown

Introduction

Analyzing 2020 HINTS Data by wrangling, descriptive analysis, function writing, univariate visualization, integration, reporting, and submission.

Nancy’s Contribution

Wrangling data: Loaded the 2020 Hints Data set into R

library(readr)
HINTSData_2020_clean <- read_csv("C:/Users/her_n/OneDrive/Documents/MPH Program/Spring 2026/PUBH422 Statistical Planning for Health Data/Lectures/Datasets/HINTSData_2020_clean.csv")
## Rows: 2402 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): PersonID
## dbl (16): HHID, QualityCare, HealthInsurance, Age, BirthGender, FullTimeOcc_...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The Variables Age, Average Drinks Per Week, Weekly Minutes Moderate Exercise, BMI, Quality Care, Birth Gender, Smoke Status, and Race & Ethnicity were extracted from the dataset

Group7Variables <- HINTSData_2020_clean[,c("Age", #extracted the variables assigned to our group
                                           "AvgDrinksPerWeek",
                                           "WeeklyMinutesModerateExercise",
                                           "BMI",
                                           "QualityCare",
                                           "BirthGender",
                                           "smokeStat",
                                           "RaceEthn5")]

Confirmed the identified variables were successfully extracted.

head(Group7Variables) #printed the first couple of columns to make sure the correct variables were extracted
## # A tibble: 6 × 8
##     Age AvgDrinksPerWeek WeeklyMinutesModerateEx…¹   BMI QualityCare BirthGender
##   <dbl>            <dbl>                     <dbl> <dbl>       <dbl>       <dbl>
## 1    59               14                      1200  20.7           1           2
## 2    61                0                         0  35.4           4           1
## 3    31                2                       180  23             2           2
## 4    48                1                        60  35.1           2           2
## 5    70                0                         0  29.2           3           1
## 6    55                2                       540  18.4           3           2
## # ℹ abbreviated name: ¹​WeeklyMinutesModerateExercise
## # ℹ 2 more variables: smokeStat <dbl>, RaceEthn5 <dbl>

Confirmed none of our variables were missing information.

colSums(is.na(Group7Variables)) #confirmed none of our columns have missing information 
##                           Age              AvgDrinksPerWeek 
##                             0                             0 
## WeeklyMinutesModerateExercise                           BMI 
##                             0                             0 
##                   QualityCare                   BirthGender 
##                             0                             0 
##                     smokeStat                     RaceEthn5 
##                             0                             0

Alyssa’s Contribution

Statistical 5-number Summary, mean, standard deviation, and variance of continuous variables.

my_summary_func <- function(data) {
  
  continuous_vars <- data[, c("Age",
                              "AvgDrinksPerWeek",
                              "WeeklyMinutesModerateExercise",
                              "BMI")]
  
  cat("\n----- Five-Number Summary -----\n")
  print(summary(continuous_vars))
  
  cat("\n----- Means -----\n")
  print(colMeans(continuous_vars, na.rm = TRUE))
  
  cat("\n----- Standard Deviations -----\n")
  print(apply(continuous_vars, 2, sd, na.rm = TRUE))
  
  cat("\n----- Variances -----\n")
  print(apply(continuous_vars, 2, var, na.rm = TRUE))
}

A function was created to produce the 5 number summary, mean, standard deviation, and variances

my_summary_func(Group7Variables)
## 
## ----- Five-Number Summary -----
##       Age         AvgDrinksPerWeek WeeklyMinutesModerateExercise
##  Min.   : 18.00   Min.   : 0.000   Min.   :   0.0               
##  1st Qu.: 43.00   1st Qu.: 0.000   1st Qu.:   0.0               
##  Median : 58.00   Median : 0.000   Median :  90.0               
##  Mean   : 55.59   Mean   : 3.358   Mean   : 161.8               
##  3rd Qu.: 68.00   3rd Qu.: 4.000   3rd Qu.: 210.0               
##  Max.   :100.00   Max.   :70.000   Max.   :4620.0               
##       BMI       
##  Min.   :10.90  
##  1st Qu.:24.00  
##  Median :27.50  
##  Mean   :28.61  
##  3rd Qu.:31.90  
##  Max.   :73.80  
## 
## ----- Means -----
##                           Age              AvgDrinksPerWeek 
##                     55.591174                      3.358035 
## WeeklyMinutesModerateExercise                           BMI 
##                    161.781848                     28.607036 
## 
## ----- Standard Deviations -----
##                           Age              AvgDrinksPerWeek 
##                     16.579024                      6.588142 
## WeeklyMinutesModerateExercise                           BMI 
##                    271.189253                      6.538369 
## 
## ----- Variances -----
##                           Age              AvgDrinksPerWeek 
##                     274.86403                      43.40362 
## WeeklyMinutesModerateExercise                           BMI 
##                   73543.61087                      42.75028

Age demonstrates that most participants are middle aged, with the mean being 55 years old. Most participants drink 0-3 drinks on average per week however, the average was shifted due to outliers. The median exercise was 90 minutes per week. The average BMI indicated that people are slightly overweight.


Dyplr Descriptive Package for Smoking Status Variable

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
Group7Variables %>%
  group_by(smokeStat) %>%
  summarise(
    across(
      c(Age,
        AvgDrinksPerWeek,
        WeeklyMinutesModerateExercise,
        BMI),
      list(
        Min = ~min(.),
        Q1 = ~quantile(., 0.25),
        Median = ~median(.),
        Mean = ~mean(.),
        Q3 = ~quantile(., 0.75),
        Max = ~max(.),
        SD = ~sd(.),
        Variance = ~var(.)
      ),
      .names = "{.col}_{.fn}"
    )
  )
## # A tibble: 3 × 33
##   smokeStat Age_Min Age_Q1 Age_Median Age_Mean Age_Q3 Age_Max Age_SD
##       <dbl>   <dbl>  <dbl>      <dbl>    <dbl>  <dbl>   <dbl>  <dbl>
## 1         1      19     44         57     54.5     65      87   14.5
## 2         2      21     52         64     61.4     72      98   14.8
## 3         3      18     39         55     53.4     66     100   17.0
## # ℹ 25 more variables: Age_Variance <dbl>, AvgDrinksPerWeek_Min <dbl>,
## #   AvgDrinksPerWeek_Q1 <dbl>, AvgDrinksPerWeek_Median <dbl>,
## #   AvgDrinksPerWeek_Mean <dbl>, AvgDrinksPerWeek_Q3 <dbl>,
## #   AvgDrinksPerWeek_Max <dbl>, AvgDrinksPerWeek_SD <dbl>,
## #   AvgDrinksPerWeek_Variance <dbl>, WeeklyMinutesModerateExercise_Min <dbl>,
## #   WeeklyMinutesModerateExercise_Q1 <dbl>,
## #   WeeklyMinutesModerateExercise_Median <dbl>, …

A tibble 3x33 was created that contains 3 rows (smoking groups) with 33 columns: (1 group variable (SmokeStat), 4 continuous variables & 8 stats). The youngest person that is a current smoker: 19, while the oldest is a former smoker: 21. The max age is a person who never smoked: 100, while the lowest max age was for a current smoker: 87. The oldest group on average was former (2) smokers, suggesting that quitting typically occurs later in life. The group that never smoked had the more age variability with their SD=17. current and never smokers have similar mean ages.


Skimr Descriptive Package: produce a statistical summary for the continous variables

library(skimr)
skim(Group7Variables[, c("Age",
                         "AvgDrinksPerWeek",
                         "WeeklyMinutesModerateExercise",
                         "BMI")])
Data summary
Name …[]
Number of rows 2402
Number of columns 4
_______________________
Column type frequency:
numeric 4
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Age 0 1 55.59 16.58 18.0 43 58.0 68.0 100.0 ▃▅▇▅▁
AvgDrinksPerWeek 0 1 3.36 6.59 0.0 0 0.0 4.0 70.0 ▇▁▁▁▁
WeeklyMinutesModerateExercise 0 1 161.78 271.19 0.0 0 90.0 210.0 4620.0 ▇▁▁▁▁
BMI 0 1 28.61 6.54 10.9 24 27.5 31.9 73.8 ▂▇▁▁▁

Writing Functions

my_summary_func <- function(data) {
  
  continuous_vars <- data[, c("Age",
                              "AvgDrinksPerWeek",
                              "WeeklyMinutesModerateExercise",
                              "BMI")]
  
  cat("----- Five-Number Summary -----\n")
  print(summary(continuous_vars))
  
  cat("\n----- Means -----\n")
  print(colMeans(continuous_vars))
  
  cat("\n----- Standard Deviations -----\n")
  print(apply(continuous_vars, 2, sd))
  
  cat("\n----- Variances -----\n")
  print(apply(continuous_vars, 2, var))
}

Run the function

my_summary_func(Group7Variables)
## ----- Five-Number Summary -----
##       Age         AvgDrinksPerWeek WeeklyMinutesModerateExercise
##  Min.   : 18.00   Min.   : 0.000   Min.   :   0.0               
##  1st Qu.: 43.00   1st Qu.: 0.000   1st Qu.:   0.0               
##  Median : 58.00   Median : 0.000   Median :  90.0               
##  Mean   : 55.59   Mean   : 3.358   Mean   : 161.8               
##  3rd Qu.: 68.00   3rd Qu.: 4.000   3rd Qu.: 210.0               
##  Max.   :100.00   Max.   :70.000   Max.   :4620.0               
##       BMI       
##  Min.   :10.90  
##  1st Qu.:24.00  
##  Median :27.50  
##  Mean   :28.61  
##  3rd Qu.:31.90  
##  Max.   :73.80  
## 
## ----- Means -----
##                           Age              AvgDrinksPerWeek 
##                     55.591174                      3.358035 
## WeeklyMinutesModerateExercise                           BMI 
##                    161.781848                     28.607036 
## 
## ----- Standard Deviations -----
##                           Age              AvgDrinksPerWeek 
##                     16.579024                      6.588142 
## WeeklyMinutesModerateExercise                           BMI 
##                    271.189253                      6.538369 
## 
## ----- Variances -----
##                           Age              AvgDrinksPerWeek 
##                     274.86403                      43.40362 
## WeeklyMinutesModerateExercise                           BMI 
##                   73543.61087                      42.75028

Age demonstrates that most participants are middle aged, with the mean being 55 years old. Most participants drink 0-3 drinks on average per week however, the average was shifted due to outliers. The median exercise was 90 minutes per week. The average BMI indicated that people are slightly overweight.


Bi’s Contribution

Barplot of Quality Care

barplot(table(Group7Variables$QualityCare),
        main = "Distribution of QualityCare",
        col = "lightblue", 
        names.arg = c("Excellent", "Very Good", "Good", "Fair","Poor"),
        xlab = "Quality of Care Rating",
        ylab = "Count")

Most respondents reported receiving good quality care. Only a small proportion reported poor quality care. The distribution suggests overall positive perception of care quality.


Barplot of Race and Ethnicity

barplot(table(Group7Variables$RaceEthn5),
        main = "Distribution of Race/Ethnicity",
        col = "lightgreen", 
        names.arg = c("Non-Hisp White", "Non-Hisp Black/AA", "Hispanic", "Non-Hisp Asian","Non-Hisp Other"),
        xlab = "Race/Ethnicty",
        ylab = "Count",
        cex.names = 0.75,
        cex.lab = 1.2)

The sample consists mostly of Non-Hispanic White, with smaller proportions of Non-Hispanic Asian and Non-Hispanic Other participants.


Histogram of Age

hist(Group7Variables$Age,
     main = "Distribution of Age",
     xlab="Age",
     col = "pink")

Age appears normally distributed in 60 years old and skewed right.


Histogram of BMI

hist(Group7Variables$BMI,
     main = "Distribution of BMI",
     xlab="BMI",
     col = "orange")

BMI appears slightly right-skewed, with most values between 20-30.


Barplot of QUality Care using ggplot2

library(ggplot2)
ggplot(Group7Variables, aes(x = factor(QualityCare,
                                       levels = c(1,2,3,4,5),
                                       labels = c("Excellent",
                                                  "Very Good",
                                                  "Good",
                                                  "Fair",
                                                  "Poor"))))+
  geom_bar(fill = "lightblue") +
  xlab("Quality Care Rating") +
  ggtitle("Distribution of QualityCare")

Most respondents reported receiving good quality care. Only a small proportion reported poor quality care. The distribution suggests overall positive perception of care quality.


Barplot of Race & Ethnicity using ggplot2

ggplot(Group7Variables, aes(x = factor(RaceEthn5,
                                       levels = c(1,2,3,4,5),
                                       labels = c("Non-Hisp White",
                                                  "Non-Hisp Black/AA",
                                                  "Hispanic",
                                                  "Non-Hisp Asian",
                                                  "Non-Hisp Other")))) +
  geom_bar(fill = "lightgreen") +
  xlab("Race/Ethnicity Category") +
  ggtitle("Distribution of RaceEthn5")

The sample consists mostly of Non-Hispanic White, with smaller proportions of Non-Hispanic Asian and Non-Hispanic Other participants.


Histogram of Age using ggplot2

ggplot(Group7Variables, aes(x = Age)) +
  geom_histogram(fill = "pink", bins = 30) +
  ggtitle("Distribution of Age") +
  xlab("Age")

Age appears normally distributed in 60 years old and skewed right.


Histogram of BMI using ggplot2

ggplot(Group7Variables, aes(x = BMI)) +
  geom_histogram(fill = "orange", bins = 30) +
  ggtitle("Distribution of BMI") +
  xlab("BMI")

BMI appears slightly right-skewed, with most values between 20-30.


Single R Function for barplots

my_visual_func_I <- function(data, varname){
  barplot(table(data[[varname]]),
          main = paste("Distribution of", varname),
          col = "lightblue",
          xlab = varname)
}
my_visual_func_I(Group7Variables, "QualityCare")

Most respondents reported receiving good quality care. Only a small proportion reported poor quality care. The distribution suggests overall positive perception of care quality.

my_visual_func_I(Group7Variables, "RaceEthn5")

The sample is mostly in group 1(Non-Hispanic White), with smaller proportions of in group 4(Non-Hispanic Asian)and 5(Non-Hispanic Other).


Single R Function for Histograms

my_visual_func_II <- function(data, varname){
  hist(data[[varname]],
       main = paste("Distribution of", varname),
       col = "orange",
       xlab = varname)
}
my_visual_func_II(Group7Variables, "Age")

Age appears normally distributed in 60 years old and skewed right.

my_visual_func_II(Group7Variables, "BMI")

BMI appears slightly right-skewed, with most values between 20-30.