# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!# !!! Do not modify the code in this chunk! It sets the root directory of !!!# !!! your project, loads R libraries with additional capabilities, defines !!!# !!! constants, and sets environment variables. !!!# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!## project rootsuppressMessages(here::i_am(path ="CorrelationExample.qmd"))## librarieslibrary(here)suppressMessages(library(tidyverse))library(broom)library(gt)suppressMessages(library(summarytools))
Warning in fun(libname, pkgname): couldn't connect to display ":0"
This first section of the Big-Data Biostats Workshop calls for medical knowledge and creativity.
Use your your medical knowledge to formulate an interesting and meaningful scientific question related to health and/or nutrition that can be addressed with variables from our curated 2013-2014 NHANES data set (see the separate Workshop Instructions document).
To keep this workshop manageable, you need to stick to one of three basic statistical tests:
A correlation test [R function cor.test()] with two continuous variables.
A chi-square test [R function chisq.test()] with two categorical variables.
A t-test [R function t.test()] with one categorical and one continuous variable.
For the t-test, the categorical variable must be binary; for the chi-square test, more than two categories each will be harder to interpret.
You should not need more than half a dozen variables–the subject identifier SEQN, a pair of variables for the statistical test, and possibly some additional variables for filtering, grouping, or other data transformations.
You may have to cycle several times through this Experimental Design section until you have a workable scientific question with matching variables.
After you edit the code chunk in the fourth question according to the instructions and run it, your RStudio Environment tab should contain the variable ‘chosenNhanesVariables’.
After you edit the code chunk in the fifth question according to the instructions and run it, your RStudio Environment tab should contain the variable ‘dfExpDesign’.
You have about 30 min to fill in the seven questions for this section.
We will have a room vote for the best project at the end, and the winning group gets a prize.
2.1 What is your overall scientific question?
How insulin and age correlate.
2.2 Why is this question interesting and meaningful?
There are implications for diabetes and other insulin dependent disorders.
2.3 What specific hypothesis will you test?
We hypothesize that insulin increases as you age.
2.4 Which NHANES variables from which 2013-2014 NHANES Data Files will you use?
SEQN DEMO_H, RIDAGEYR DEMO_H, LBXIN INS_H
Code
# !!! you must adapt the first line of code for your chosen NHANES variables !!!# !!! Keep "SEQN DEMO_H" but replace the other variable/data-file pairs with two or more of your choices !!!# !!! if you need help ask your facilitators !!!chosenNhanesVariables <-c("SEQN DEMO_H", "RIDAGEYR DEMO_H", "LBXIN INS_H")# !!! Uncomment the remaining code lines of this chunk but otherwise leave them as they are !!!# validating the chosen NHANES variables and creating a data frame of the experimental designdfExpDesign <- WorkshopHelpers::validateInputVariables(variablesWithDataFiles = chosenNhanesVariables)# tabulating the chosen and validated NHANES variables of the experimental designgt::gt(data = dfExpDesign) |> gt::tab_header(title ="Chosen NHANES Variables") |> gt::opt_row_striping(row_striping =TRUE)
Chosen NHANES Variables
Variable
VariableDescription
DataFile
DataFileDescription
ReleaseCycle
SEQN
Respondent sequence number.
DEMO_H
Demographic Variables and Sample Weights
2013-2014
RIDAGEYR
Age in years of the participant at the time of screening. Individuals 80 and over are topcoded at 80 years of age.
DEMO_H
Demographic Variables and Sample Weights
2013-2014
LBXIN
Insulin (uU/mL)
INS_H
Insulin
2013-2014
2.5 Briefly describe your chosen variables on the basis of the NHANES codebook!
The variables are age and insulin.
Code
# !!! Uncomment the code lines of this chunk but otherwise leave them as they are !!!# showing the NHANES codebooks for the chosen and validated variablesWorkshopHelpers::showCodebooks(df = dfExpDesign)
2.6 Which R function will you use for which statistical test?
We plan on using a t-test.
2.7 What significance level are you choosing as the threshold for your test?
We are using p<0.05 as the significance threshold for our test.
3 Data Import
NoteInstructions for Data Import
This section imports your chosen NHANES Data Files, extracts your chosen NHANES variables, combines them by subject identifier, and performs some initial data cleaning.
You should not need to modify the code chunk other than uncommenting the code lines.
Carefully inspect the resulting summary table to verify a successful import and briefly state your observations.
Your RStudio Environment tab should now contain the dataframe ‘dfData’ with SEQN and your other chosen variables as columns and several thousand subjects as rows.
Completing his section should take no more than 5 min.
Code
# !!! Uncomment the code lines of this chunk but otherwise leave them as they are !!!# # importing the data for the chosen NHANES variables in dfExpDesign into # # the dataframe dfDatadfData <- WorkshopHelpers::importNHANESdata(df = dfExpDesign)# # # summarizing the imported data in dfData for reviewsummarytools::dfSummary(x = dfData, graph.col =FALSE, style ="grid", plain.ascii =FALSE, silent =TRUE, max.distinct.values =20)
3.0.1 Data Frame Summary
3.0.1.1 dfData
Dimensions: 10175 x 3 Duplicates: 0
No
Variable
Stats / Values
Freqs (% of Valid)
Valid
Missing
1
SEQN
[numeric]
Mean (sd) : 78644 (2937.4)
min < med < max:
73557 < 78644 < 83731
IQR (CV) : 5087 (0)
10175 distinct values
10175
(100.0%)
0
(0.0%)
2
RIDAGEYR
[numeric]
Mean (sd) : 31.5 (24.4)
min < med < max:
0 < 26 < 80
IQR (CV) : 42 (0.8)
81 distinct values
10175
(100.0%)
0
(0.0%)
3
LBXIN
[numeric]
Mean (sd) : 13.5 (18.6)
min < med < max:
0.1 < 9.5 < 682.5
IQR (CV) : 9.3 (1.4)
1745 distinct values
3093
(30.4%)
7082
(69.6%)
3.0.2 Data Frame Summary
3.0.2.1 dfData
Dimensions: 10175 x 3 Duplicates: 0
No
Variable
Stats / Values
Freqs (% of Valid)
Valid
Missing
1
SEQN
[numeric]
Mean (sd) : 78644 (2937.4)
min < med < max:
73557 < 78644 < 83731
IQR (CV) : 5087 (0)
10175 distinct values
10175
(100.0%)
0
(0.0%)
2
RIDAGEYR
[numeric]
Mean (sd) : 31.5 (24.4)
min < med < max:
0 < 26 < 80
IQR (CV) : 42 (0.8)
81 distinct values
10175
(100.0%)
0
(0.0%)
3
LBXIN
[numeric]
Mean (sd) : 13.5 (18.6)
min < med < max:
0.1 < 9.5 < 682.5
IQR (CV) : 9.3 (1.4)
1745 distinct values
3093
(30.4%)
7082
(69.6%)
4 Data Wrangling
NoteInstructions for Data Wrangling
This section “wrangles” your dfData dataframe into shape so that your data meet the requirements of your statistical test later on.
The specifics depend on the variables you picked; you definitely will have to edit the code chunk here.
Some examples of many possible wrangling steps are …
Transforming with the mutate() function:
Reduce a multi-level categorical variable down to two levels
Convert a continuous variable to categorical with two levels
Convert a categorical variable into a factor (a special type of categorical variable)
Combine multiple variables into one
Dropping rows (subjects) with the filter() function:
Remove subjects with missing values
Remove subjects with characteristics you want to exclude
Don’t hesitate to ask your friendly facilitators for feedback and advice.
Carefully inspect the resulting summary table to verify successful wrangling and briefly state your observations.
Your RStudio Environment tab should now contain the dataframe ‘dfDataWrangled’.
Completing his section should take about 15 min.
4.1 What, if any, variable transformations are needed for your statistical test and why?
4.2 Wrangle your data. What has changed?
Code
# !!! you must adapt this code chunk for your chosen NHANES variables !!!# !!! uncomment the code lines you need to modify and/or add new code !!!# !!! you may want to ask your facilitators for help !!!# # wrangling the data in dfData in a series or 'pipe' of steps connected with '|>'# dfDataWrangled <- dfData |># # #### 1. Transforming variables with mutate() ##### # - here we collapse the four blood pressure measurements into one by # # calculating their median and ignoring missing values (NAs)# # - transforming across multiple columns requires rowwise() before and ungroup() after# # - the resulting BPXSY column will be used in the correlation test# dplyr::rowwise() |> # dplyr::mutate(BPXSY = median(BPXSY1, BPXSY2, BPXSY3, BPXSY4, na.rm = TRUE)) |> # dplyr::ungroup() |> # # #### 2. Filtering Rows with filter() ##### # - now we keep only rows that don't have any a missing value (NA) in the two # # columns that we need for the correlation test# # - The is.na() function flags missing values, and the exclamation mark negates.FilteredData <- dplyr::filter(dfData, !is.na(LBXIN))# # # reviewing the wrangled dataframe dfDataWrangledsummarytools::dfSummary(x = FilteredData, graph.col =FALSE, style ="grid", plain.ascii =FALSE, silent =TRUE, max.distinct.values =20)
4.2.1 Data Frame Summary
4.2.1.1 FilteredData
Dimensions: 3093 x 3 Duplicates: 0
No
Variable
Stats / Values
Freqs (% of Valid)
Valid
Missing
1
SEQN
[numeric]
Mean (sd) : 78692.8 (2915.2)
min < med < max:
73559 < 78707 < 83727
IQR (CV) : 5028 (0)
3093 distinct values
3093
(100.0%)
0
(0.0%)
2
RIDAGEYR
[numeric]
Mean (sd) : 43.2 (20.6)
min < med < max:
12 < 43 < 80
IQR (CV) : 36 (0.5)
69 distinct values
3093
(100.0%)
0
(0.0%)
3
LBXIN
[numeric]
Mean (sd) : 13.5 (18.6)
min < med < max:
0.1 < 9.5 < 682.5
IQR (CV) : 9.3 (1.4)
1745 distinct values
3093
(100.0%)
0
(0.0%)
|==============================================================================| 100% ### Data Frame Summary
#### FilteredData Dimensions: 3093 x 3 Duplicates: 0
No
Variable
Stats / Values
Freqs (% of Valid)
Valid
Missing
1
SEQN
[numeric]
Mean (sd) : 78692.8 (2915.2)
min < med < max:
73559 < 78707 < 83727
IQR (CV) : 5028 (0)
3093 distinct values
3093
(100.0%)
0
(0.0%)
2
RIDAGEYR
[numeric]
Mean (sd) : 43.2 (20.6)
min < med < max:
12 < 43 < 80
IQR (CV) : 36 (0.5)
69 distinct values
3093
(100.0%)
0
(0.0%)
3
LBXIN
[numeric]
Mean (sd) : 13.5 (18.6)
min < med < max:
0.1 < 9.5 < 682.5
IQR (CV) : 9.3 (1.4)
1745 distinct values
3093
(100.0%)
0
(0.0%)
5 Exploratory Data Analysis
NoteInstructions for Exploratory Data Analysis
Exploratory Data Analysis (EDA) serves to investigate and analyze data before or even without formal statistical tests (Tukey 1977).
Instead, EDA relies heavily on descriptive statistics and especially graphical visualizations.
For this workshop, we use EDA for two purposes:
As a sanity check before testing–are there any odd patterns in the data, such as outliers, multimodal distributions, etc.?
Can we get at least a partial answer for our scientific question from a suitable plot?
Adapt the code chunks below for your variables and comment on their output.
Completing his section should take about 15 min.
5.1 Tabulation
Code
# !!! if you use R to generate a table then your code goes into this chunk !!!
5.2 Visualization
Code
# !!! you must adapt this code chunk for your chosen NHANES variables !!!# !!! uncomment the code lines you need to modify and/or add new code !!!# ## simple univariate plot# # histogram of age# # choose meaningful nr of breaks: 73 consecutive values per summary tablehist(FilteredData$RIDAGEYR, breaks =72)
Code
# !!! you must adapt this code chunk for your chosen NHANES variables !!!# !!! uncomment the code lines you need to modify and/or add new code !!!# ## simple univariate plot# # histogram of blood pressure# # choose meaningful nr of breaks: range from 66 to 228 per summary tablehist(FilteredData$LBXIN, breaks =81)
Code
# !!! you must adapt this code chunk for your chosen NHANES variables !!!# !!! uncomment the code lines you need to modify and/or add new code !!!# ## simple bivariate plot# # # # scatter plot of blood pressure against age# Create plotplot(x = FilteredData$LBXIN,y = FilteredData$RIDAGEYR,xlim =c(0, 200),xlab ="Insulin",ylab ="Age",main ="Insulin Level Vs. Age")# Add regression linemodel <-lm(RIDAGEYR ~ LBXIN, data = FilteredData)abline(model, col ="red", lwd =2)# Compute correlation testtest <-cor.test(FilteredData$LBXIN, FilteredData$RIDAGEYR)# Format valuesr_val <-round(test$estimate, 2)p_val <-signif(test$p.value, 3)# Add box with r and plegend("topright",legend =paste0("r = ", r_val, "\np = ", p_val),bty ="o"# change to "o" if you want a box outline)
6 Statistical Testing
NoteInstructions for Statistical Testing
Finally!
Edit the code chunk below to match your chosen variables.
Explain your approach and interpret the output from the test
Code
# !!! you must adapt this code chunk for your chosen NHANES variables !!!# !!! uncomment the code lines you need to modify and/or add new code !!!# # # correlation testcor.test(FilteredData$LBXIN, FilteredData$RIDAGEYR)
Pearson's product-moment correlation
data: FilteredData$LBXIN and FilteredData$RIDAGEYR
t = -0.23942, df = 3091, p-value = 0.8108
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.03954458 0.03094272
sample estimates:
cor
-0.004306283
7 Summary and Conclusions
Indicated by the p value on the pearson correlation test there is likely no significant correlation between insulin and age. There was one outlier value which was included in the correlation test that is not seen on the graph with an insulin value of 682 at age 47.
NoteInstructions for Summary and Conclusions
State the conclusions of your statistical analysis in reference to your specific hypothesis and your overall scientific question from the Experimental Design section.
Place your conclusions in a clinical or otherwise greater context.
Finally, make a nice figure that visually summarizes your data, statistics, and broader conclusions and helps you tell your story during the presentation. Combine whatever tools work for you–R, Powerpoint, Photoshop, etc. If you create a PNG or JPEG file, move it to your project directory (the same folder as this .qmd file) and have it rendered in your answer with a reference like ‘’ (omit the backslashes!); see the example with the AHA blood pressure categories in file CorrelationsExample.qmd. Ask your facilitators if you need help.
VERY IMPORTANT: Upload both your source and rendered documents (.qmd and .html) to InteDashboard/LEO/OneDrive (TBD) for the in-class presentations at the end of the Workshop.
Code
# !!! if you use R to make your nice figure then your code goes into this chunk !!!
8 Notes
Because of the time constraints of this educational activity, we have ignored the fact that NHANES is a weighted survey. For an actual research project, you MUST incorporate the weights into your analysis, e.g., with the survey package.
Similarly, we have skipped model diagnostics after the statistical test, i.e., checking whether the data meet the assumptions of your correlation test, chi-square test, or t-test. For an actual research project, you MUST perform this essential step after your statistical testing.
9 References
NoteInstructions for References
Give full citations if you cited any references; otherwise delete this section.