Age Vs. Insulin

Author

Table 44

Published

April 30, 2026

1 R Setup

Code

# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
# !!! Do not modify the code in this chunk! It sets the root directory of   !!!
# !!! your project, loads R libraries with additional capabilities, defines !!!
# !!! constants, and sets environment variables.                            !!!
# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

## project root
suppressMessages(here::i_am(path = "CorrelationExample.qmd"))

## libraries
library(here)
suppressMessages(library(tidyverse))
library(broom)
library(gt)
suppressMessages(library(summarytools))

Warning in fun(libname, pkgname): couldn't connect to display ":0"

Code

library(WorkshopData)
library(WorkshopHelpers)
suppressMessages(library(ggpp))

## constants

## environment
set.seed(123)

2 Experimental Design

Instructions for Experimental Design

This first section of the Big-Data Biostats Workshop calls for medical knowledge and creativity.
Use your your medical knowledge to formulate an interesting and meaningful scientific question related to health and/or nutrition that can be addressed with variables from our curated 2013-2014 NHANES data set (see the separate Workshop Instructions document).
To keep this workshop manageable, you need to stick to one of three basic statistical tests:
- A correlation test [R function cor.test()] with two continuous variables.
- A chi-square test [R function chisq.test()] with two categorical variables.
- A t-test [R function t.test()] with one categorical and one continuous variable.
For the t-test, the categorical variable must be binary; for the chi-square test, more than two categories each will be harder to interpret.
You should not need more than half a dozen variables–the subject identifier SEQN, a pair of variables for the statistical test, and possibly some additional variables for filtering, grouping, or other data transformations.
You may have to cycle several times through this Experimental Design section until you have a workable scientific question with matching variables.
After you edit the code chunk in the fourth question according to the instructions and run it, your RStudio Environment tab should contain the variable ‘chosenNhanesVariables’.
After you edit the code chunk in the fifth question according to the instructions and run it, your RStudio Environment tab should contain the variable ‘dfExpDesign’.
You have about 30 min to fill in the seven questions for this section.
We will have a room vote for the best project at the end, and the winning group gets a prize.

2.1 What is your overall scientific question?

How insulin and age correlate.

2.2 Why is this question interesting and meaningful?

There are implications for diabetes and other insulin dependent disorders.

2.3 What specific hypothesis will you test?

We hypothesize that insulin increases as you age.

2.4 Which NHANES variables from which 2013-2014 NHANES Data Files will you use?

SEQN DEMO_H, RIDAGEYR DEMO_H, LBXIN INS_H

Code

# !!! you must adapt the first line of code for your chosen NHANES variables !!!
# !!! Keep "SEQN DEMO_H" but replace the other variable/data-file pairs with two or more of your choices !!!
# !!! if you need help ask your facilitators !!!

chosenNhanesVariables <- c("SEQN DEMO_H", "RIDAGEYR DEMO_H", "LBXIN INS_H")

# !!! Uncomment the remaining code lines of this chunk but otherwise leave them as they are !!!

# validating the chosen NHANES variables and creating a data frame of the experimental design
dfExpDesign <-
  WorkshopHelpers::validateInputVariables(variablesWithDataFiles = chosenNhanesVariables)

# tabulating the chosen and validated NHANES variables of the experimental design
gt::gt(data = dfExpDesign) |>
  gt::tab_header(title = "Chosen NHANES Variables") |>
  gt::opt_row_striping(row_striping = TRUE)

Variable	VariableDescription	DataFile	DataFileDescription	ReleaseCycle
Chosen NHANES Variables
SEQN	Respondent sequence number.	DEMO_H	Demographic Variables and Sample Weights	2013-2014
RIDAGEYR	Age in years of the participant at the time of screening. Individuals 80 and over are topcoded at 80 years of age.	DEMO_H	Demographic Variables and Sample Weights	2013-2014
LBXIN	Insulin (uU/mL)	INS_H	Insulin	2013-2014

2.5 Briefly describe your chosen variables on the basis of the NHANES codebook!

The variables are age and insulin.

Code

# !!! Uncomment the code lines of this chunk but otherwise leave them as they are !!!

# showing the NHANES codebooks for the chosen and validated variables
WorkshopHelpers::showCodebooks(df = dfExpDesign)

Names	Values
SEQN - Respondent sequence number
Data File Name	DEMO_H
Variable Name	SEQN
SAS Label	Respondent sequence number
English Text	Respondent sequence number.
Target	Both males and females 0 YEARS - 150 YEARS
Hard Edits	NA
CDC URL	https://wwwn.cdc.gov/Nchs/data/nhanes/Public/2013/DataFiles/DEMO_H.htm#SEQN

Codes

Names	Values
RIDAGEYR - Age in years at screening
Data File Name	DEMO_H
Variable Name	RIDAGEYR
SAS Label	Age in years at screening
English Text	Age in years of the participant at the time of screening. Individuals 80 and over are topcoded at 80 years of age.
Target	Both males and females 0 YEARS - 150 YEARS
Hard Edits	NA
CDC URL	https://wwwn.cdc.gov/Nchs/data/nhanes/Public/2013/DataFiles/DEMO_H.htm#RIDAGEYR

Code or Value	Value Description	Count	Cumulative	Skip to Item
0 to 79	Range of Values	9823	9823	NA
80	80 years of age and over	352	10175	NA
.	Missing	0	10175	NA

Names	Values
LBXIN - Insulin (uU/mL)
Data File Name	INS_H
Variable Name	LBXIN
SAS Label	Insulin (uU/mL)
English Text	Insulin (uU/mL)
Target	Both males and females 12 YEARS - 150 YEARS
Hard Edits	0 to 9999.99
CDC URL	https://wwwn.cdc.gov/Nchs/data/nhanes/Public/2013/DataFiles/INS_H.htm#LBXIN

Code or Value	Value Description	Count	Cumulative	Skip to Item
0.14 to 682.48	Range of Values	3093	3093	NA
.	Missing	236	3329	NA

2.6 Which R function will you use for which statistical test?

We plan on using a t-test.

2.7 What significance level are you choosing as the threshold for your test?

We are using p<0.05 as the significance threshold for our test.

3 Data Import

Instructions for Data Import

This section imports your chosen NHANES Data Files, extracts your chosen NHANES variables, combines them by subject identifier, and performs some initial data cleaning.
You should not need to modify the code chunk other than uncommenting the code lines.
Carefully inspect the resulting summary table to verify a successful import and briefly state your observations.
Your RStudio Environment tab should now contain the dataframe ‘dfData’ with SEQN and your other chosen variables as columns and several thousand subjects as rows.
Completing his section should take no more than 5 min.

Code

# !!! Uncomment the code lines of this chunk but otherwise leave them as they are !!!
 
# # importing the data for the chosen NHANES variables in dfExpDesign into 
# #   the dataframe dfData
dfData <- WorkshopHelpers::importNHANESdata(df = dfExpDesign)
# 
# # summarizing the imported data in dfData for review
summarytools::dfSummary(x = dfData, graph.col = FALSE, style = "grid", plain.ascii = FALSE, silent = TRUE, max.distinct.values = 20)

3.0.1 Data Frame Summary

3.0.1.1 dfData

Dimensions: 10175 x 3
Duplicates: 0

No	Variable	Stats / Values	Freqs (% of Valid)	Valid	Missing
1	SEQN [numeric]	Mean (sd) : 78644 (2937.4) min < med < max: 73557 < 78644 < 83731 IQR (CV) : 5087 (0)	10175 distinct values	10175 (100.0%)	0 (0.0%)
2	RIDAGEYR [numeric]	Mean (sd) : 31.5 (24.4) min < med < max: 0 < 26 < 80 IQR (CV) : 42 (0.8)	81 distinct values	10175 (100.0%)	0 (0.0%)
3	LBXIN [numeric]	Mean (sd) : 13.5 (18.6) min < med < max: 0.1 < 9.5 < 682.5 IQR (CV) : 9.3 (1.4)	1745 distinct values	3093 (30.4%)	7082 (69.6%)

3.0.2 Data Frame Summary

3.0.2.1 dfData

Dimensions: 10175 x 3
Duplicates: 0

No	Variable	Stats / Values	Freqs (% of Valid)	Valid	Missing
1	SEQN [numeric]	Mean (sd) : 78644 (2937.4) min < med < max: 73557 < 78644 < 83731 IQR (CV) : 5087 (0)	10175 distinct values	10175 (100.0%)	0 (0.0%)
2	RIDAGEYR [numeric]	Mean (sd) : 31.5 (24.4) min < med < max: 0 < 26 < 80 IQR (CV) : 42 (0.8)	81 distinct values	10175 (100.0%)	0 (0.0%)
3	LBXIN [numeric]	Mean (sd) : 13.5 (18.6) min < med < max: 0.1 < 9.5 < 682.5 IQR (CV) : 9.3 (1.4)	1745 distinct values	3093 (30.4%)	7082 (69.6%)

4 Data Wrangling

Instructions for Data Wrangling

This section “wrangles” your dfData dataframe into shape so that your data meet the requirements of your statistical test later on.
The specifics depend on the variables you picked; you definitely will have to edit the code chunk here.
Some examples of many possible wrangling steps are …
- Transforming with the mutate() function:
  - Reduce a multi-level categorical variable down to two levels
  - Convert a continuous variable to categorical with two levels
  - Convert a categorical variable into a factor (a special type of categorical variable)
  - Combine multiple variables into one
- Dropping rows (subjects) with the filter() function:
  - Remove subjects with missing values
  - Remove subjects with characteristics you want to exclude
Don’t hesitate to ask your friendly facilitators for feedback and advice.
Carefully inspect the resulting summary table to verify successful wrangling and briefly state your observations.
Your RStudio Environment tab should now contain the dataframe ‘dfDataWrangled’.
Completing his section should take about 15 min.

4.1 What, if any, variable transformations are needed for your statistical test and why?

4.2 Wrangle your data. What has changed?

Code

# !!! you must adapt this code chunk for your chosen NHANES variables !!!
# !!! uncomment the code lines you need to modify and/or add new code !!!
# !!! you may want to ask your facilitators for help !!!

# # wrangling the data in dfData in a series or 'pipe' of steps connected with '|>'
# dfDataWrangled <- dfData |>
#   
#   #### 1. Transforming variables with mutate() ####
#   # - here we collapse the four blood pressure measurements into one by 
#   #   calculating their median and ignoring missing values (NAs)
#   # - transforming across multiple columns requires rowwise() before and ungroup() after
#   # - the resulting BPXSY column will be used in the correlation test
#   dplyr::rowwise() |> 
#   dplyr::mutate(BPXSY = median(BPXSY1, BPXSY2, BPXSY3, BPXSY4, na.rm = TRUE)) |> 
#   dplyr::ungroup() |> 
# 
#   #### 2. Filtering Rows with filter() ####
#   # - now we keep only rows that don't have any a missing value (NA) in the two 
#   #   columns that we need for the correlation test
#   # - The is.na() function flags missing values, and the exclamation mark negates.
FilteredData <- dplyr::filter(dfData, !is.na(LBXIN))
# 
# # reviewing the wrangled dataframe dfDataWrangled
summarytools::dfSummary(x = FilteredData, graph.col = FALSE, style = "grid", plain.ascii = FALSE, silent = TRUE, max.distinct.values = 20)

4.2.1 Data Frame Summary

4.2.1.1 FilteredData

Dimensions: 3093 x 3
Duplicates: 0

No	Variable	Stats / Values	Freqs (% of Valid)	Valid
1	SEQN [numeric]	Mean (sd) : 78692.8 (2915.2) min < med < max: 73559 < 78707 < 83727 IQR (CV) : 5028 (0)	3093 distinct values	3093 (100.0%)
2	RIDAGEYR [numeric]	Mean (sd) : 43.2 (20.6) min < med < max: 12 < 43 < 80 IQR (CV) : 36 (0.5)	69 distinct values	3093 (100.0%)
3	LBXIN [numeric]	Mean (sd) : 13.5 (18.6) min < med < max: 0.1 < 9.5 < 682.5 IQR (CV) : 9.3 (1.4)	1745 distinct values	3093 (100.0%)

|==============================================================================| 100% ### Data Frame Summary
#### FilteredData
Dimensions: 3093 x 3
Duplicates: 0

No	Variable	Stats / Values	Freqs (% of Valid)	Valid
1	SEQN [numeric]	Mean (sd) : 78692.8 (2915.2) min < med < max: 73559 < 78707 < 83727 IQR (CV) : 5028 (0)	3093 distinct values	3093 (100.0%)
2	RIDAGEYR [numeric]	Mean (sd) : 43.2 (20.6) min < med < max: 12 < 43 < 80 IQR (CV) : 36 (0.5)	69 distinct values	3093 (100.0%)
3	LBXIN [numeric]	Mean (sd) : 13.5 (18.6) min < med < max: 0.1 < 9.5 < 682.5 IQR (CV) : 9.3 (1.4)	1745 distinct values	3093 (100.0%)

5 Exploratory Data Analysis

Instructions for Exploratory Data Analysis

Exploratory Data Analysis (EDA) serves to investigate and analyze data before or even without formal statistical tests (Tukey 1977).
Instead, EDA relies heavily on descriptive statistics and especially graphical visualizations.
For this workshop, we use EDA for two purposes:
- As a sanity check before testing–are there any odd patterns in the data, such as outliers, multimodal distributions, etc.?
- Can we get at least a partial answer for our scientific question from a suitable plot?
Adapt the code chunks below for your variables and comment on their output.
Completing his section should take about 15 min.

5.1 Tabulation

Code

# !!! if you use R to generate a table then your code goes into this chunk !!!

5.2 Visualization

Code

# !!! you must adapt this code chunk for your chosen NHANES variables !!!
# !!! uncomment the code lines you need to modify and/or add new code !!!

# ## simple univariate plot
# # histogram of age
# # choose meaningful nr of breaks: 73 consecutive values per summary table
hist(FilteredData$RIDAGEYR, breaks = 72)

Code

# !!! you must adapt this code chunk for your chosen NHANES variables !!!
# !!! uncomment the code lines you need to modify and/or add new code !!!

# ## simple univariate plot
# # histogram of blood pressure
# # choose meaningful nr of breaks: range from 66 to 228 per summary table
hist(FilteredData$LBXIN, breaks = 81)

Code

# !!! you must adapt this code chunk for your chosen NHANES variables !!!
# !!! uncomment the code lines you need to modify and/or add new code !!!

# ## simple bivariate plot
# 
# # # scatter plot of blood pressure against age
# Create plot
plot(
  x = FilteredData$LBXIN,
  y = FilteredData$RIDAGEYR,
  xlim = c(0, 200),
  xlab = "Insulin",
  ylab = "Age",
  main = "Insulin Level Vs. Age"
)

# Add regression line
model <- lm(RIDAGEYR ~ LBXIN, data = FilteredData)
abline(model, col = "red", lwd = 2)

# Compute correlation test
test <- cor.test(FilteredData$LBXIN, FilteredData$RIDAGEYR)

# Format values
r_val <- round(test$estimate, 2)
p_val <- signif(test$p.value, 3)

# Add box with r and p
legend(
  "topright",
  legend = paste0("r = ", r_val, "\np = ", p_val),
  bty = "o"   # change to "o" if you want a box outline
)

6 Statistical Testing

Instructions for Statistical Testing

Finally!
Edit the code chunk below to match your chosen variables.
Explain your approach and interpret the output from the test

Code

# !!! you must adapt this code chunk for your chosen NHANES variables !!!
# !!! uncomment the code lines you need to modify and/or add new code !!!

# # # correlation test
cor.test(FilteredData$LBXIN, FilteredData$RIDAGEYR)


    Pearson's product-moment correlation

data:  FilteredData$LBXIN and FilteredData$RIDAGEYR
t = -0.23942, df = 3091, p-value = 0.8108
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.03954458  0.03094272
sample estimates:
         cor 
-0.004306283

7 Summary and Conclusions

Indicated by the p value on the pearson correlation test there is likely no significant correlation between insulin and age. There was one outlier value which was included in the correlation test that is not seen on the graph with an insulin value of 682 at age 47.

Instructions for Summary and Conclusions

State the conclusions of your statistical analysis in reference to your specific hypothesis and your overall scientific question from the Experimental Design section.
Place your conclusions in a clinical or otherwise greater context.
Finally, make a nice figure that visually summarizes your data, statistics, and broader conclusions and helps you tell your story during the presentation. Combine whatever tools work for you–R, Powerpoint, Photoshop, etc. If you create a PNG or JPEG file, move it to your project directory (the same folder as this .qmd file) and have it rendered in your answer with a reference like ‘![caption](filename)’ (omit the backslashes!); see the example with the AHA blood pressure categories in file CorrelationsExample.qmd. Ask your facilitators if you need help.
VERY IMPORTANT: Upload both your source and rendered documents (.qmd and .html) to InteDashboard/LEO/OneDrive (TBD) for the in-class presentations at the end of the Workshop.

Code

# !!! if you use R to make your nice figure then your code goes into this chunk !!!

8 Notes

Because of the time constraints of this educational activity, we have ignored the fact that NHANES is a weighted survey. For an actual research project, you MUST incorporate the weights into your analysis, e.g., with the survey package.

Similarly, we have skipped model diagnostics after the statistical test, i.e., checking whether the data meet the assumptions of your correlation test, chi-square test, or t-test. For an actual research project, you MUST perform this essential step after your statistical testing.

9 References

Instructions for References

Give full citations if you cited any references; otherwise delete this section.

1 R Setup

2 Experimental Design

2.1 What is your overall scientific question?

2.2 Why is this question interesting and meaningful?

2.3 What specific hypothesis will you test?

2.4 Which NHANES variables from which 2013-2014 NHANES Data Files will you use?

2.5 Briefly describe your chosen variables on the basis of the NHANES codebook!

2.6 Which R function will you use for which statistical test?

2.7 What significance level are you choosing as the threshold for your test?

3 Data Import

3.0.1 Data Frame Summary

3.0.1.1 dfData

3.0.2 Data Frame Summary

3.0.2.1 dfData

4 Data Wrangling

4.1 What, if any, variable transformations are needed for your statistical test and why?

4.2 Wrangle your data. What has changed?

4.2.1 Data Frame Summary

4.2.1.1 FilteredData

5 Exploratory Data Analysis

5.1 Tabulation

5.2 Visualization

6 Statistical Testing

7 Summary and Conclusions

8 Notes

9 References

Copyright