Age Vs. Insulin

Author

Table 44

Published

April 30, 2026


1 R Setup

Code
# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
# !!! Do not modify the code in this chunk! It sets the root directory of   !!!
# !!! your project, loads R libraries with additional capabilities, defines !!!
# !!! constants, and sets environment variables.                            !!!
# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

## project root
suppressMessages(here::i_am(path = "CorrelationExample.qmd"))

## libraries
library(here)
suppressMessages(library(tidyverse))
library(broom)
library(gt)
suppressMessages(library(summarytools))
Warning in fun(libname, pkgname): couldn't connect to display ":0"
Code
library(WorkshopData)
library(WorkshopHelpers)
suppressMessages(library(ggpp))

## constants

## environment
set.seed(123)

2 Experimental Design

  • This first section of the Big-Data Biostats Workshop calls for medical knowledge and creativity.
  • Use your your medical knowledge to formulate an interesting and meaningful scientific question related to health and/or nutrition that can be addressed with variables from our curated 2013-2014 NHANES data set (see the separate Workshop Instructions document).
  • To keep this workshop manageable, you need to stick to one of three basic statistical tests:
    • A correlation test [R function cor.test()] with two continuous variables.
    • A chi-square test [R function chisq.test()] with two categorical variables.
    • A t-test [R function t.test()] with one categorical and one continuous variable.
  • For the t-test, the categorical variable must be binary; for the chi-square test, more than two categories each will be harder to interpret.
  • You should not need more than half a dozen variables–the subject identifier SEQN, a pair of variables for the statistical test, and possibly some additional variables for filtering, grouping, or other data transformations.
  • You may have to cycle several times through this Experimental Design section until you have a workable scientific question with matching variables.
  • After you edit the code chunk in the fourth question according to the instructions and run it, your RStudio Environment tab should contain the variable ‘chosenNhanesVariables’.
  • After you edit the code chunk in the fifth question according to the instructions and run it, your RStudio Environment tab should contain the variable ‘dfExpDesign’.
  • You have about 30 min to fill in the seven questions for this section.
  • We will have a room vote for the best project at the end, and the winning group gets a prize.

2.1 What is your overall scientific question?

How insulin and age correlate.

2.2 Why is this question interesting and meaningful?

There are implications for diabetes and other insulin dependent disorders.

2.3 What specific hypothesis will you test?

We hypothesize that insulin increases as you age.

2.4 Which NHANES variables from which 2013-2014 NHANES Data Files will you use?

SEQN DEMO_H, RIDAGEYR DEMO_H, LBXIN INS_H

Code
# !!! you must adapt the first line of code for your chosen NHANES variables !!!
# !!! Keep "SEQN DEMO_H" but replace the other variable/data-file pairs with two or more of your choices !!!
# !!! if you need help ask your facilitators !!!

chosenNhanesVariables <- c("SEQN DEMO_H", "RIDAGEYR DEMO_H", "LBXIN INS_H")

# !!! Uncomment the remaining code lines of this chunk but otherwise leave them as they are !!!

# validating the chosen NHANES variables and creating a data frame of the experimental design
dfExpDesign <-
  WorkshopHelpers::validateInputVariables(variablesWithDataFiles = chosenNhanesVariables)

# tabulating the chosen and validated NHANES variables of the experimental design
gt::gt(data = dfExpDesign) |>
  gt::tab_header(title = "Chosen NHANES Variables") |>
  gt::opt_row_striping(row_striping = TRUE)
Chosen NHANES Variables
Variable VariableDescription DataFile DataFileDescription ReleaseCycle
SEQN Respondent sequence number. DEMO_H Demographic Variables and Sample Weights 2013-2014
RIDAGEYR Age in years of the participant at the time of screening. Individuals 80 and over are topcoded at 80 years of age. DEMO_H Demographic Variables and Sample Weights 2013-2014
LBXIN Insulin (uU/mL) INS_H Insulin 2013-2014

2.5 Briefly describe your chosen variables on the basis of the NHANES codebook!

The variables are age and insulin.

Code
# !!! Uncomment the code lines of this chunk but otherwise leave them as they are !!!

# showing the NHANES codebooks for the chosen and validated variables
WorkshopHelpers::showCodebooks(df = dfExpDesign)
SEQN - Respondent sequence number
Names Values
Data File Name DEMO_H
Variable Name SEQN
SAS Label Respondent sequence number
English Text Respondent sequence number.
Target Both males and females 0 YEARS - 150 YEARS
Hard Edits NA
CDC URL https://wwwn.cdc.gov/Nchs/data/nhanes/Public/2013/DataFiles/DEMO_H.htm#SEQN
Codes
RIDAGEYR - Age in years at screening
Names Values
Data File Name DEMO_H
Variable Name RIDAGEYR
SAS Label Age in years at screening
English Text Age in years of the participant at the time of screening. Individuals 80 and over are topcoded at 80 years of age.
Target Both males and females 0 YEARS - 150 YEARS
Hard Edits NA
CDC URL https://wwwn.cdc.gov/Nchs/data/nhanes/Public/2013/DataFiles/DEMO_H.htm#RIDAGEYR
Code or Value Value Description Count Cumulative Skip to Item
0 to 79 Range of Values 9823 9823 NA
80 80 years of age and over 352 10175 NA
. Missing 0 10175 NA
LBXIN - Insulin (uU/mL)
Names Values
Data File Name INS_H
Variable Name LBXIN
SAS Label Insulin (uU/mL)
English Text Insulin (uU/mL)
Target Both males and females 12 YEARS - 150 YEARS
Hard Edits 0 to 9999.99
CDC URL https://wwwn.cdc.gov/Nchs/data/nhanes/Public/2013/DataFiles/INS_H.htm#LBXIN
Code or Value Value Description Count Cumulative Skip to Item
0.14 to 682.48 Range of Values 3093 3093 NA
. Missing 236 3329 NA

2.6 Which R function will you use for which statistical test?

We plan on using a t-test.

2.7 What significance level are you choosing as the threshold for your test?

We are using p<0.05 as the significance threshold for our test.


3 Data Import

  • This section imports your chosen NHANES Data Files, extracts your chosen NHANES variables, combines them by subject identifier, and performs some initial data cleaning.
  • You should not need to modify the code chunk other than uncommenting the code lines.
  • Carefully inspect the resulting summary table to verify a successful import and briefly state your observations.
  • Your RStudio Environment tab should now contain the dataframe ‘dfData’ with SEQN and your other chosen variables as columns and several thousand subjects as rows.
  • Completing his section should take no more than 5 min.
Code
# !!! Uncomment the code lines of this chunk but otherwise leave them as they are !!!
 
# # importing the data for the chosen NHANES variables in dfExpDesign into 
# #   the dataframe dfData
dfData <- WorkshopHelpers::importNHANESdata(df = dfExpDesign)
# 
# # summarizing the imported data in dfData for review
summarytools::dfSummary(x = dfData, graph.col = FALSE, style = "grid", plain.ascii = FALSE, silent = TRUE, max.distinct.values = 20)

3.0.1 Data Frame Summary

3.0.1.1 dfData

Dimensions: 10175 x 3
Duplicates: 0

No Variable Stats / Values Freqs (% of Valid) Valid Missing
1 SEQN
[numeric]
Mean (sd) : 78644 (2937.4)
min < med < max:
73557 < 78644 < 83731
IQR (CV) : 5087 (0)
10175 distinct values 10175
(100.0%)
0
(0.0%)
2 RIDAGEYR
[numeric]
Mean (sd) : 31.5 (24.4)
min < med < max:
0 < 26 < 80
IQR (CV) : 42 (0.8)
81 distinct values 10175
(100.0%)
0
(0.0%)
3 LBXIN
[numeric]
Mean (sd) : 13.5 (18.6)
min < med < max:
0.1 < 9.5 < 682.5
IQR (CV) : 9.3 (1.4)
1745 distinct values 3093
(30.4%)
7082
(69.6%)

3.0.2 Data Frame Summary

3.0.2.1 dfData

Dimensions: 10175 x 3
Duplicates: 0

No Variable Stats / Values Freqs (% of Valid) Valid Missing
1 SEQN
[numeric]
Mean (sd) : 78644 (2937.4)
min < med < max:
73557 < 78644 < 83731
IQR (CV) : 5087 (0)
10175 distinct values 10175
(100.0%)
0
(0.0%)
2 RIDAGEYR
[numeric]
Mean (sd) : 31.5 (24.4)
min < med < max:
0 < 26 < 80
IQR (CV) : 42 (0.8)
81 distinct values 10175
(100.0%)
0
(0.0%)
3 LBXIN
[numeric]
Mean (sd) : 13.5 (18.6)
min < med < max:
0.1 < 9.5 < 682.5
IQR (CV) : 9.3 (1.4)
1745 distinct values 3093
(30.4%)
7082
(69.6%)

4 Data Wrangling

  • This section “wrangles” your dfData dataframe into shape so that your data meet the requirements of your statistical test later on.
  • The specifics depend on the variables you picked; you definitely will have to edit the code chunk here.
  • Some examples of many possible wrangling steps are …
    • Transforming with the mutate() function:
      • Reduce a multi-level categorical variable down to two levels
      • Convert a continuous variable to categorical with two levels
      • Convert a categorical variable into a factor (a special type of categorical variable)
      • Combine multiple variables into one
    • Dropping rows (subjects) with the filter() function:
      • Remove subjects with missing values
      • Remove subjects with characteristics you want to exclude
  • Don’t hesitate to ask your friendly facilitators for feedback and advice.
  • Carefully inspect the resulting summary table to verify successful wrangling and briefly state your observations.
  • Your RStudio Environment tab should now contain the dataframe ‘dfDataWrangled’.
  • Completing his section should take about 15 min.

4.1 What, if any, variable transformations are needed for your statistical test and why?

4.2 Wrangle your data. What has changed?

Code
# !!! you must adapt this code chunk for your chosen NHANES variables !!!
# !!! uncomment the code lines you need to modify and/or add new code !!!
# !!! you may want to ask your facilitators for help !!!

# # wrangling the data in dfData in a series or 'pipe' of steps connected with '|>'
# dfDataWrangled <- dfData |>
#   
#   #### 1. Transforming variables with mutate() ####
#   # - here we collapse the four blood pressure measurements into one by 
#   #   calculating their median and ignoring missing values (NAs)
#   # - transforming across multiple columns requires rowwise() before and ungroup() after
#   # - the resulting BPXSY column will be used in the correlation test
#   dplyr::rowwise() |> 
#   dplyr::mutate(BPXSY = median(BPXSY1, BPXSY2, BPXSY3, BPXSY4, na.rm = TRUE)) |> 
#   dplyr::ungroup() |> 
# 
#   #### 2. Filtering Rows with filter() ####
#   # - now we keep only rows that don't have any a missing value (NA) in the two 
#   #   columns that we need for the correlation test
#   # - The is.na() function flags missing values, and the exclamation mark negates.
FilteredData <- dplyr::filter(dfData, !is.na(LBXIN))
# 
# # reviewing the wrangled dataframe dfDataWrangled
summarytools::dfSummary(x = FilteredData, graph.col = FALSE, style = "grid", plain.ascii = FALSE, silent = TRUE, max.distinct.values = 20)

4.2.1 Data Frame Summary

4.2.1.1 FilteredData

Dimensions: 3093 x 3
Duplicates: 0

No Variable Stats / Values Freqs (% of Valid) Valid Missing
1 SEQN
[numeric]
Mean (sd) : 78692.8 (2915.2)
min < med < max:
73559 < 78707 < 83727
IQR (CV) : 5028 (0)
3093 distinct values 3093
(100.0%)
0
(0.0%)
2 RIDAGEYR
[numeric]
Mean (sd) : 43.2 (20.6)
min < med < max:
12 < 43 < 80
IQR (CV) : 36 (0.5)
69 distinct values 3093
(100.0%)
0
(0.0%)
3 LBXIN
[numeric]
Mean (sd) : 13.5 (18.6)
min < med < max:
0.1 < 9.5 < 682.5
IQR (CV) : 9.3 (1.4)
1745 distinct values 3093
(100.0%)
0
(0.0%)

|==============================================================================| 100% ### Data Frame Summary
#### FilteredData
Dimensions: 3093 x 3
Duplicates: 0

No Variable Stats / Values Freqs (% of Valid) Valid Missing
1 SEQN
[numeric]
Mean (sd) : 78692.8 (2915.2)
min < med < max:
73559 < 78707 < 83727
IQR (CV) : 5028 (0)
3093 distinct values 3093
(100.0%)
0
(0.0%)
2 RIDAGEYR
[numeric]
Mean (sd) : 43.2 (20.6)
min < med < max:
12 < 43 < 80
IQR (CV) : 36 (0.5)
69 distinct values 3093
(100.0%)
0
(0.0%)
3 LBXIN
[numeric]
Mean (sd) : 13.5 (18.6)
min < med < max:
0.1 < 9.5 < 682.5
IQR (CV) : 9.3 (1.4)
1745 distinct values 3093
(100.0%)
0
(0.0%)

5 Exploratory Data Analysis

  • Exploratory Data Analysis (EDA) serves to investigate and analyze data before or even without formal statistical tests (Tukey 1977).
  • Instead, EDA relies heavily on descriptive statistics and especially graphical visualizations.
  • For this workshop, we use EDA for two purposes:
    • As a sanity check before testing–are there any odd patterns in the data, such as outliers, multimodal distributions, etc.?
    • Can we get at least a partial answer for our scientific question from a suitable plot?
  • Adapt the code chunks below for your variables and comment on their output.
  • Completing his section should take about 15 min.

5.1 Tabulation

Code
# !!! if you use R to generate a table then your code goes into this chunk !!!

5.2 Visualization

Code
# !!! you must adapt this code chunk for your chosen NHANES variables !!!
# !!! uncomment the code lines you need to modify and/or add new code !!!

# ## simple univariate plot
# # histogram of age
# # choose meaningful nr of breaks: 73 consecutive values per summary table
hist(FilteredData$RIDAGEYR, breaks = 72)

Code
# !!! you must adapt this code chunk for your chosen NHANES variables !!!
# !!! uncomment the code lines you need to modify and/or add new code !!!

# ## simple univariate plot
# # histogram of blood pressure
# # choose meaningful nr of breaks: range from 66 to 228 per summary table
hist(FilteredData$LBXIN, breaks = 81)

Code
# !!! you must adapt this code chunk for your chosen NHANES variables !!!
# !!! uncomment the code lines you need to modify and/or add new code !!!

# ## simple bivariate plot
# 
# # # scatter plot of blood pressure against age
# Create plot
plot(
  x = FilteredData$LBXIN,
  y = FilteredData$RIDAGEYR,
  xlim = c(0, 200),
  xlab = "Insulin",
  ylab = "Age",
  main = "Insulin Level Vs. Age"
)

# Add regression line
model <- lm(RIDAGEYR ~ LBXIN, data = FilteredData)
abline(model, col = "red", lwd = 2)

# Compute correlation test
test <- cor.test(FilteredData$LBXIN, FilteredData$RIDAGEYR)

# Format values
r_val <- round(test$estimate, 2)
p_val <- signif(test$p.value, 3)

# Add box with r and p
legend(
  "topright",
  legend = paste0("r = ", r_val, "\np = ", p_val),
  bty = "o"   # change to "o" if you want a box outline
)


6 Statistical Testing

  • Finally!
  • Edit the code chunk below to match your chosen variables.
  • Explain your approach and interpret the output from the test
Code
# !!! you must adapt this code chunk for your chosen NHANES variables !!!
# !!! uncomment the code lines you need to modify and/or add new code !!!

# # # correlation test
cor.test(FilteredData$LBXIN, FilteredData$RIDAGEYR)

    Pearson's product-moment correlation

data:  FilteredData$LBXIN and FilteredData$RIDAGEYR
t = -0.23942, df = 3091, p-value = 0.8108
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.03954458  0.03094272
sample estimates:
         cor 
-0.004306283 

7 Summary and Conclusions

Indicated by the p value on the pearson correlation test there is likely no significant correlation between insulin and age. There was one outlier value which was included in the correlation test that is not seen on the graph with an insulin value of 682 at age 47.

  • State the conclusions of your statistical analysis in reference to your specific hypothesis and your overall scientific question from the Experimental Design section.
  • Place your conclusions in a clinical or otherwise greater context.
  • Finally, make a nice figure that visually summarizes your data, statistics, and broader conclusions and helps you tell your story during the presentation. Combine whatever tools work for you–R, Powerpoint, Photoshop, etc. If you create a PNG or JPEG file, move it to your project directory (the same folder as this .qmd file) and have it rendered in your answer with a reference like ‘![caption](filename)’ (omit the backslashes!); see the example with the AHA blood pressure categories in file CorrelationsExample.qmd. Ask your facilitators if you need help.
  • VERY IMPORTANT: Upload both your source and rendered documents (.qmd and .html) to InteDashboard/LEO/OneDrive (TBD) for the in-class presentations at the end of the Workshop.
Code
# !!! if you use R to make your nice figure then your code goes into this chunk !!!

8 Notes

Because of the time constraints of this educational activity, we have ignored the fact that NHANES is a weighted survey. For an actual research project, you MUST incorporate the weights into your analysis, e.g., with the survey package.

Similarly, we have skipped model diagnostics after the statistical test, i.e., checking whether the data meet the assumptions of your correlation test, chi-square test, or t-test. For an actual research project, you MUST perform this essential step after your statistical testing.


9 References

Give full citations if you cited any references; otherwise delete this section.