Introduction

For this analysis, data from the 2021–2023 National Health and Nutrition Examination Survey (NHANES) were utilized. The purpose of the analysis was to observe whether there is a statistical difference in Glycohemoglobin (HbA1c) levels among different racial and ethnic groups in the sample. The analysis included data from Demographics (DEMO), Glycohemoglobin (GHB), and Body Measures (BMI). These datasets were merged and cleaned to prepare for analysis. The quantitative variables included Body Mass Index (BMI), age, Glycohemoglobin (HbA1c), and waist circumference. The qualitative variables were sex, education level, race/ethnicity, and marital status. An analysis of each variable was conducted, and several plots were created to illustrate the relationships among different variables in the dataset. Finally, an ANOVA test was completed to determine whether there was a statistically significant difference in Glycohemoglobin levels among the racial and ethnic groups in the sample.

Data Import

To test the hypothesis, three datasets from the 2021–2023 National Health and Nutrition Examination Survey (NHANES) were used: the Body Measures (BMX), Demographics (DEMO), and Glycohemoglobin (GHB) files. These datasets were imported into R and served as the foundation for merging, cleaning, and preparing the analytic dataset used in subsequent statistical analyses.

## IMPORT NHANES 'Demographic' Data ##
nhanes_demo <- read_xpt("C:/Users/Karin/OneDrive/Spring 2026/PUBH 422/Final Project/DEMO_L (4).xpt") 

View(nhanes_demo) # view 'Demographic' Data 

## IMPORT NHANES 'Glycohemoglobin' Data ##
nhanes_ghb <- read_xpt("C:/Users/Karin/OneDrive/Spring 2026/PUBH 422/Final Project/GHB_L.xpt")

View(nhanes_ghb) # View 'Glycohemoglobin' Data

## IMPORT NHANES 'Body Measures' Data ##
nhanes_bmx <- read_xpt("C:/Users/Karin/OneDrive/Spring 2026/PUBH 422/Final Project/BMX_L.xpt")

View (nhanes_bmx) # View 'Body Measures' Data

Merge Data Sets and Select Variables

Following data importation, the three NHANES datasets were merged using the shared sequence number (SEQN). Only participants with valid SEQN values across all three datasets were retained. From this merged file, a subset of variables relevant to the study hypothesis was selected for analysis. The quantitative variables included Glycohemoglobin (%), Age (years), and Body Mass Index (BMI). The qualitative variables included Sex, Race/Ethnicity, Education level, and Marital Status.

## MERGE three data sets with needed variables for analysis and matches them by 'SEQN' ##
merged_nhanes <- nhanes_demo %>% 
  inner_join(nhanes_ghb, by = "SEQN") %>%
  inner_join(nhanes_bmx, by = "SEQN")

View (merged_nhanes) # View 'merged_nhanes' data set 

## EXTRACT SEQN, four quantitative variables, and four qualitative variables for data analysis ##
variable_nhanes <- merged_nhanes %>%
  select(SEQN, 
         LBXGH, RIDAGEYR, BMXBMI, BMXWAIST, 
         RIAGENDR, RIDRETH3, DMDEDUC2, DMDMARTZ)

str(variable_nhanes) # Confirm there are 9 columns in the new data set

## tibble [7,199 × 9] (S3: tbl_df/tbl/data.frame)
##  $ SEQN    : num [1:7199] 130378 130379 130380 130386 130387 ...
##   ..- attr(*, "label")= chr "Respondent sequence number"
##  $ LBXGH   : num [1:7199] 5.6 5.6 6.2 5.1 5.9 4.9 5.5 5.9 5.4 5.6 ...
##   ..- attr(*, "label")= chr "Glycohemoglobin (%)"
##  $ RIDAGEYR: num [1:7199] 43 66 44 34 68 27 59 31 33 74 ...
##   ..- attr(*, "label")= chr "Age in years at screening"
##  $ BMXBMI  : num [1:7199] 27 33.5 29.7 30.2 42.6 43.7 28 46 38.9 43 ...
##   ..- attr(*, "label")= chr "Body Mass Index (kg/m**2)"
##  $ BMXWAIST: num [1:7199] 98.3 114.7 93.5 106.1 122 ...
##   ..- attr(*, "label")= chr "Waist Circumference (cm)"
##  $ RIAGENDR: num [1:7199] 1 1 2 1 2 2 1 2 2 2 ...
##   ..- attr(*, "label")= chr "Gender"
##  $ RIDRETH3: num [1:7199] 6 3 2 1 3 4 3 3 3 3 ...
##   ..- attr(*, "label")= chr "Race/Hispanic origin w/ NH Asian"
##  $ DMDEDUC2: num [1:7199] 5 5 3 4 5 4 5 3 3 5 ...
##   ..- attr(*, "label")= chr "Education level - Adults 20+"
##  $ DMDMARTZ: num [1:7199] 1 1 1 1 3 1 1 1 3 1 ...
##   ..- attr(*, "label")= chr "Marital status"
##  - attr(*, "label")= chr "Demographic Variables and Sample Weights"

Clean Data

After creating the dataset with the selected variables, the data were cleaned by removing all participants with missing values (NA). Using complete cases ensured that the descriptive and inferential results were not influenced by missing information and that all analyses were based on participants with full data.

anyNA(variable_nhanes) # Examines for NA's in data set

## [1] TRUE

clean_nhanes <- variable_nhanes %>% drop_na() # remove all rows with NA in the data set 

anyNA(clean_nhanes) # confirms there are 0 Na's in data set

## [1] FALSE

Rename Variables

Once the data were cleaned, the quantitative and qualitative variables were renamed to improve clarity and ensure consistent identification throughout the analysis.

## RENAME 5 quantitative variables ##
clean_nhanes <- clean_nhanes %>%
  rename( 
    HbA1c = LBXGH,
    Age = RIDAGEYR,
    BMI = BMXBMI,
    Waist = BMXWAIST
    )

## RENAME 5 qualitative variables ##
clean_nhanes <- clean_nhanes %>%
  rename(
    Sex = RIAGENDR,
    Race_Ethn = RIDRETH3,
    Education = DMDEDUC2,
    Marital_Stat = DMDMARTZ
  )

summary (clean_nhanes) # examines summary to confirm names were changed

##       SEQN            HbA1c             Age             BMI       
##  Min.   :130378   Min.   : 3.200   Min.   :20.00   Min.   :11.10  
##  1st Qu.:133298   1st Qu.: 5.200   1st Qu.:39.00   1st Qu.:24.70  
##  Median :136366   Median : 5.500   Median :57.00   Median :28.40  
##  Mean   :136326   Mean   : 5.778   Mean   :53.82   Mean   :29.71  
##  3rd Qu.:139287   3rd Qu.: 5.900   3rd Qu.:68.00   3rd Qu.:33.40  
##  Max.   :142310   Max.   :17.100   Max.   :80.00   Max.   :69.90  
##      Waist            Sex          Race_Ethn      Education      Marital_Stat  
##  Min.   : 60.0   Min.   :1.000   Min.   :1.00   Min.   :1.000   Min.   : 1.00  
##  1st Qu.: 89.0   1st Qu.:1.000   1st Qu.:3.00   1st Qu.:3.000   1st Qu.: 1.00  
##  Median : 99.6   Median :2.000   Median :3.00   Median :4.000   Median : 1.00  
##  Mean   :101.1   Mean   :1.548   Mean   :3.29   Mean   :3.869   Mean   : 1.71  
##  3rd Qu.:111.4   3rd Qu.:2.000   3rd Qu.:3.00   3rd Qu.:5.000   3rd Qu.: 2.00  
##  Max.   :187.0   Max.   :2.000   Max.   :7.00   Max.   :9.000   Max.   :99.00

Recode Qualitative Variables to Factor

The final step in preparing the dataset was recoding the qualitative variables as factor variables. Using the NHANES codebooks, the appropriate response categories were identified, and participants who selected “Don’t Know” or “Refused” were removed. This ensured that all categorical variables were clearly defined and that the dataset contained only valid responses for analysis.

#Recode 'Sex' to factor
clean_nhanes <- clean_nhanes %>%
  mutate(Sex = factor(Sex,
                      levels = c(1,2),
                      labels = c("Male", "Female")))

#Recode 'Race_Ethn' to factor
clean_nhanes <- clean_nhanes %>%
  mutate(Race_Ethn = factor(Race_Ethn,
                      levels = c(1,2, 3, 4, 6, 7),
                      labels = c("Mexican American", "Other Hispanic", "Non-Hispanic White", "Non-Hispanic Black", "Non-Hispanic Asian", "Other Race")))

clean_nhanes <- clean_nhanes [clean_nhanes$Education %in% c(1,2,3,4,5),] # Remove 'Refused' and 'Dont Know' responses from Education variable

#Recode 'Education' to factor
clean_nhanes <- clean_nhanes %>%
  mutate(Education = factor(Education, 
                            levels = c(1,2,3,4,5),
                            labels = c("Less than 9th grade","9th-11th grade","Highschool Grad or Equivalent","Some College or AA degree", "College Graduate or above")))

clean_nhanes <- clean_nhanes [clean_nhanes$Marital_Stat %in% c(1,2,3),]  # Remove 'Refused' and 'Dont Know' responses from Marital Status Variable 

#Recode 'Marital_Stat' to factor
clean_nhanes <- clean_nhanes %>%
  mutate(Marital_Stat = factor(Marital_Stat, 
                            levels = c(1,2,3),
                            labels = c("Married/Living with Partner","Widowed/Divorced/Seperated","Never Married")))

summary(clean_nhanes) # examine summary to confirm all qualitative variables were converted to factor

##       SEQN            HbA1c             Age             BMI       
##  Min.   :130378   Min.   : 3.200   Min.   :20.00   Min.   :11.10  
##  1st Qu.:133303   1st Qu.: 5.200   1st Qu.:39.00   1st Qu.:24.70  
##  Median :136370   Median : 5.500   Median :57.00   Median :28.40  
##  Mean   :136330   Mean   : 5.779   Mean   :53.82   Mean   :29.71  
##  3rd Qu.:139289   3rd Qu.: 5.900   3rd Qu.:68.00   3rd Qu.:33.40  
##  Max.   :142310   Max.   :17.100   Max.   :80.00   Max.   :69.90  
##      Waist           Sex                    Race_Ethn   
##  Min.   : 60.0   Male  :2473   Mexican American  : 370  
##  1st Qu.: 89.0   Female:3006   Other Hispanic    : 568  
##  Median : 99.6                 Non-Hispanic White:3277  
##  Mean   :101.1                 Non-Hispanic Black: 621  
##  3rd Qu.:111.4                 Non-Hispanic Asian: 296  
##  Max.   :187.0                 Other Race        : 347  
##                          Education                         Marital_Stat 
##  Less than 9th grade          : 257   Married/Living with Partner:3041  
##  9th-11th grade               : 413   Widowed/Divorced/Seperated :1353  
##  Highschool Grad or Equivalent:1133   Never Married              :1085  
##  Some College or AA degree    :1675                                     
##  College Graduate or above    :2001                                     
##

Descriptive Statistics of Quantitative Variables

To begin the analysis, descriptive statistics were generated for the quantitative variables in the dataset: Age, Body Mass Index (BMI), Glycohemoglobin (%), and waist circumference. After data cleaning, the final sample size was 5,479 participants. The average age of the sample was approximately 54 years, with ages ranging from 20 to 80. This age range is appropriate for the study because the hypothesis focuses on Glycohemoglobin levels among adults. The mean BMI of the sample was 29.71, with values ranging from 11.10 to 69.90. Glycohemoglobin, the primary indicator for the hypothesis, ranged from 3.20 to 17.10. Waist circumference, an additional quantitative variable included due to its strong relationship with BMI, had an average value of 101.10 centimeters. Together, these quantitative measures provide essential context for evaluating the relationship between demographic and health-related factors (e.g., Glycohemoglobin levels).

quant_variables <- clean_nhanes %>%
  select(HbA1c,
          Age,
          BMI,
          Waist) # creates a new table with only quantitative variables 


descr( #initiates descriptive statistics function 
  quant_variables, # selects quantitative variables 
  stats = c("n.valid", "mean", "med", "sd", "min", "max") # list stats to include in the table 
)

## Descriptive Statistics  
## quant_variables  
## Label: Demographic Variables and Sample Weights  
## N: 5479  
## 
##                     Age       BMI     HbA1c     Waist
## ------------- --------- --------- --------- ---------
##       N.Valid   5479.00   5479.00   5479.00   5479.00
##          Mean     53.82     29.71      5.78    101.10
##        Median     57.00     28.40      5.50     99.60
##       Std.Dev     17.00      7.08      1.10     16.74
##           Min     20.00     11.10      3.20     60.00
##           Max     80.00     69.90     17.10    187.00

Descriptive Statistics of HbA1c Levels Across Racial/Ethnic Participants

A descriptive statistics table of HbA1c levels was created to examine differences across racial and ethnic groups. Non‑Hispanic Black participants had a higher mean (6.01) and median (5.7) HbA1c level compared to Non‑Hispanic White participants.

## generate descriptive statistics for Race/Ethnicity and HbA1c Levels ##

clean_nhanes %>%
  group_by(Race_Ethn) %>%
  summarise(
    mean = mean(HbA1c, na.rm = TRUE),
    sd = sd(HbA1c, na.rm = TRUE),
    min = min(HbA1c, na.rm = TRUE),
    max = max(HbA1c, na.rm = TRUE),
    median = median(HbA1c, na.rm = TRUE),
  )

## # A tibble: 6 × 6
##   Race_Ethn           mean    sd   min   max median
##   <fct>              <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 Mexican American    6.01 1.39    4.2  13.7    5.6
## 2 Other Hispanic      5.82 1.12    3.8  13.5    5.6
## 3 Non-Hispanic White  5.69 0.981   3.2  17.1    5.5
## 4 Non-Hispanic Black  6.01 1.29    3.6  14.8    5.7
## 5 Non-Hispanic Asian  5.80 1.04    4.2  11      5.6
## 6 Other Race          5.91 1.28    4.3  14.6    5.6

Frequency Table of Categorical Variables

Following this, frequency tables were created for each of the qualitative variables. These variables include sex, race and ethnicity, education level, and marital status. Of the 5,479 participants in the sample, about 55% were female and 45% were male. For race and ethnicity, the largest group identified as Non Hispanic White (59.81%), followed by Non Hispanic Black (11.33%), Other Hispanic (10.37%), and Mexican American (6.75%), other race (6.33%), and non-Hispanic Asian (5.40%).

In terms of education, more than 60% of the sample had completed education beyond an associate degree. For marital status, 55.5% reported being married or living with a partner, while the remaining participants were single, widowed, divorced, or separated. These qualitative variables help describe the overall identity of the sample. The most important factor for the hypothesis, however, is racial and ethnic identity.

cat_variables <- clean_nhanes %>% 
  select(Sex,
         Race_Ethn,
         Education,
         Marital_Stat) # creates a new data set with only categorical variables to run freq tables

freq(cat_variables$Sex) # generates 'Sex' frequency table

## Frequencies  
## cat_variables$Sex  
## Type: Factor  
## 
##                Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ------------ ------ --------- -------------- --------- --------------
##         Male   2473     45.14          45.14     45.14          45.14
##       Female   3006     54.86         100.00     54.86         100.00
##         <NA>      0                               0.00         100.00
##        Total   5479    100.00         100.00    100.00         100.00

freq(cat_variables$Race_Ethn) # generates 'Race_Ethn' frequency table

## Frequencies  
## cat_variables$Race_Ethn  
## Type: Factor  
## 
##                            Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ------------------------ ------ --------- -------------- --------- --------------
##         Mexican American    370      6.75           6.75      6.75           6.75
##           Other Hispanic    568     10.37          17.12     10.37          17.12
##       Non-Hispanic White   3277     59.81          76.93     59.81          76.93
##       Non-Hispanic Black    621     11.33          88.26     11.33          88.26
##       Non-Hispanic Asian    296      5.40          93.67      5.40          93.67
##               Other Race    347      6.33         100.00      6.33         100.00
##                     <NA>      0                               0.00         100.00
##                    Total   5479    100.00         100.00    100.00         100.00

freq(cat_variables$Education) # generates 'Education' frequency table

## Frequencies  
## cat_variables$Education  
## Type: Factor  
## 
##                                       Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------------------------------- ------ --------- -------------- --------- --------------
##                 Less than 9th grade    257      4.69           4.69      4.69           4.69
##                      9th-11th grade    413      7.54          12.23      7.54          12.23
##       Highschool Grad or Equivalent   1133     20.68          32.91     20.68          32.91
##           Some College or AA degree   1675     30.57          63.48     30.57          63.48
##           College Graduate or above   2001     36.52         100.00     36.52         100.00
##                                <NA>      0                               0.00         100.00
##                               Total   5479    100.00         100.00    100.00         100.00

freq(cat_variables$Marital_Stat) # generates 'Marital_Stat' frequency table

## Frequencies  
## cat_variables$Marital_Stat  
## Type: Factor  
## 
##                                     Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## --------------------------------- ------ --------- -------------- --------- --------------
##       Married/Living with Partner   3041     55.50          55.50     55.50          55.50
##        Widowed/Divorced/Seperated   1353     24.69          80.20     24.69          80.20
##                     Never Married   1085     19.80         100.00     19.80         100.00
##                              <NA>      0                               0.00         100.00
##                             Total   5479    100.00         100.00    100.00         100.00

Univariate Plots

Two univariate plots were developed to better understand the distribution of key variables in the sample. First, a bar plot was created to show the race and ethnicity breakdown of the full sample. The plot shows that the majority of participants identified as Non Hispanic White, followed by Non Hispanic Black and Other Hispanic. Understanding the racial and ethnic makeup of the sample is important because it directly relates to the hypothesis. The second univariate plot was a density plot of BMI for males and females. Both distributions were right skewed, indicating that most BMI values were lower with a smaller number of very high values. The female curve was wider than the male curve, suggesting greater variability in BMI among females. Together, these univariate plots help illustrate the racial and ethnic composition of the sample as well as the overall BMI distribution, both of which are important for interpreting the results of the analysis.

### Plot for Quantitative Variable (BMI) ###
ggplot(clean_nhanes, aes(x=BMI, fill=Sex))+ # creates ggplot with x as 'BMI' and fill as 'BirthSex'
  geom_density(alpha=0.5)+ # sets transparency of fill to 0.5 
  labs(title= "NHANES 2021-2023, Density of BMI", # tiles ggplot to 'Density of BMI'
       x= "BMI", # titles the x-axis 'Density'
       y= "Density")+ # title the y-axis 'BMI' 
  theme(
    plot.title = element_text(hjust=0.5) # adjusts the title so that it is centered 
        )+
          scale_fill_manual(
            values = c("Male"= "blue", "Female"="pink")  # adjusts fill color for two different sex 
          )

### Plot for Categorical Variable (Race_Ethn) ###        
ggplot(clean_nhanes, aes(x=Race_Ethn, fill=Race_Ethn, size=2))+ # initiates ggplot to compare with 'Race_Ethn' variable
  geom_bar()+ # indicates plot as a bar plot 
  theme(axis.text.x = element_text(size=10, angle=45, hjust=1,# adjusts x-axis titles 
       plot.title = element_text(hjust = 0.5) 
  ))+ 
  labs(title= "NHANES 2021-2023 Race and Ethnicity Bar Plot", # Titles the plot 
       x= "Race/Ethnicity", # Names the x-axis 
       y="Count",# names the y-axis 
       fill= "Race/Ethnicity") # names the legend

## Warning in element_text(size = 10, angle = 45, hjust = 1, plot.title = element_text(hjust = 0.5)): `...` must be empty.
## ✖ Problematic argument:
## • plot.title = element_text(hjust = 0.5)

Bivariate Plots

The following plots developed were bivariate plots. The first bivariate plot is a box plot showing Glycohemoglobin levels among different racial and ethnic groups. The box plot indicates that Non‑Hispanic Black participants have a higher median Glycohemoglobin level compared to other racial and ethnic groups, while Non‑Hispanic White participants have the lowest median levels. These differences highlight potential disparities in blood sugar control across racial and ethnic populations. The second plot compares Glycohemoglobin levels in relation to BMI. The linear regression line has a slight upward slope, indicating a weak positive correlation between BMI and Glycohemoglobin. As BMI increases, Glycohemoglobin tends to increase slightly. However, the relationship is not strong, suggesting that BMI alone does not strongly predict Glycohemoglobin levels.

### Box Plot of Glycohemoglobin Across different Race/Ethnicity Groups" ###
ggplot (clean_nhanes, aes (x= Race_Ethn, y= HbA1c, fill=Race_Ethn ))+ # creates ggplot of 'exercise' based on 'eciguse'
  geom_boxplot()+  # indicates plot as boxplot
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  labs(title= "NHANES 2021-2023, Glycohemoglobin (%) by Race/Ethnicity Boxplot", # names the plot 
       x="Race/Ethnicity", # names the x-axis 
       y="Glycohemoglobin (%)", # names the y axis
       fill="Race/Ethnicity") # names the legend

### Scatterplot of BMI and Glycohemoglobin ###
ggplot (clean_nhanes, aes(x=BMI, y=HbA1c))+ # initiates ggplot with two variables 
  geom_point()+ # indicates ggplot as a scatter plot 
  geom_smooth(method=lm)+ # adds linear regression line 
  labs(title= "NHANES 2021-2023, BMI and Glycohemoglobin (%) Scatter Plot")+ # adds title to the plot 
  theme(plot.title=element_text(hjust=0.5) # centers the title of the plot 
        )

## `geom_smooth()` using formula = 'y ~ x'

Multifaceted Scatterplots

Following this, multifaceted plots were created to observe relationships between quantitative variables across different groups (e.g., sex and race/ethnicity). The first scatter plot illustrates the relationship between BMI and Glycohemoglobin. For both males and females, the linear regression line has a slight upward slope, indicating a weak positive correlation between BMI and Glycohemoglobin. This means that as BMI increases, Glycohemoglobin tends to increase slightly. Another component of this plot shows that females have greater variability in both BMI and Glycohemoglobin, which may be influenced by other contributing factors such as metabolism. Overall, this multifaceted plot supports the finding of a weak positive relationship between Glycohemoglobin and BMI. The second multifaceted scatter plot shows the relationship between Glycohemoglobin (%) and Age across different racial and ethnic groups. Each panel represents one group, allowing for a visual comparison of trends. Across all groups, the regression lines show a slight upward slope, indicating a weak positive correlation between age and Glycohemoglobin levels. This suggests that as age increases, Glycohemoglobin tends to rise slightly.

## Multifaceted Scatterplot of BMI by Glycohemoglobin, separated by Sex ##

ggplot(clean_nhanes, aes(x = BMI, y = HbA1c)) + # creates a scatterplot for BMI and Hb1Ac
  geom_point(aes(color = factor(Sex)), size = 3) + # adds scatterplot points, colored by Sex 
  geom_smooth(method = lm, se = FALSE) + # adds linear regression line to scatterplot 
  facet_wrap(~ Sex) + # creates two seperate panels by sex 
  labs(
    title = "Scatter Plot of Glycohemoglobin vs BMI by Sex", # titles the scatterplot 
    x = "BMI", # titles the x-axis 
    y = "Glycohemoglobin (%)", # titles the y axis 
    color = "Sex" # titles the legend 
  ) +
  theme_minimal() + # applies minimal theme 
  theme(
    plot.title = element_text(hjust = 0.5) # centers the title of the plot 
  )

## `geom_smooth()` using formula = 'y ~ x'

## Multifaceted Scatterplot of Glycohemoglobin by Age, separated by Race/Ethnicity##

ggplot(clean_nhanes, aes(x = Age, y = HbA1c)) + # creates a scatterplot for Age and Hb1Ac
  geom_point(aes(color = factor(Race_Ethn)), size = 3) + # adds scatterplot points, colored by Race/Ethnicity 
  geom_smooth(method = lm, se = FALSE) + # adds linear regression line to scatterplot 
  facet_wrap(~ Race_Ethn) + # creates two separate panels by Race/Ethnicity
  labs(
    title = "Scatter Plot of Glycohemoglobin vs Age by Race/Ethnicity", # titles the scatterplot 
    x = "Age", # titles the x-axis 
    y = "Glycohemoglobin (%)", # titles the y axis 
    color = "Race/Ethnicity" # titles the legend 
  ) +
  theme_minimal() + # applies minimal theme 
  theme(
    plot.title = element_text(hjust = 0.5) # centers the title of the plot 
  )

## `geom_smooth()` using formula = 'y ~ x'

Correlation Plot

The final plot created was a correlation plot designed to illustrate which quantitative variables had the strongest relationships. The plot shows that BMI and waist circumference have the strongest positive correlation, which is expected since both measure aspects of body composition. The next strongest correlation is between age and Glycohemoglobin, indicating that Glycohemoglobin levels tend to increase slightly with age. Overall, this correlation plot is useful for identifying which quantitative variables share the strongest associations within the dataset.

## create a correlation plot of multiple variables ##
corrplot(cor(quant_variables), method="color", type="upper") # creates corr plot with 4 quantitative variables

Analysis of Variance - ANOVA

For the final analysis, an ANOVA test was completed to answer the hypothesis question. The purpose of the ANOVA test is to determine whether there are statistically significant differences in mean values among three or more groups. In this analysis, mean Glycohemoglobin levels were compared across different racial and ethnic groups. The ANOVA results showed a p‑value of 1.36e‑14, indicating a statistically significant difference in HbA1c levels among the racial and ethnic groups in the is a significant difference in Glycohemoglobin levels among different racial/ethnic groups.

# Compute the analysis of variance
anova_race <- aov(HbA1c ~ Race_Ethn, data = clean_nhanes)

# Summary of the analysis
summary(anova_race)

##               Df Sum Sq Mean Sq F value   Pr(>F)    
## Race_Ethn      5     89  17.727   14.94 1.36e-14 ***
## Residuals   5473   6495   1.187                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion

As a result of this analysis, there is a statistically significant difference in mean HbA1c levels across racial and ethnic groups. Non Hispanic Black participants had higher mean and median HbA1c levels compared to Non Hispanic White participants.

Karinna Alatorre_Final Project

2026-05-05