Last Ran: 2021-05-18

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(summarytools)
library(magrittr)
library(corrplot)
library(FactoMineR)
library(factoextra)

Task: Look for basic correlations pertaining to smoking, alcohol usage, and general demographics using Health History Questionnaire (HHQ).

e.g., How does smoking history relate to the other measures of health history and demographics? What about alcohol intake?

Pending ISSUES:
  • 30574 Check V02 height

Step 1. Import Baseline Demographics & Health History.

This data is reported via the COG1 Session. We could also use the NCI Dietary questionnaire, since this includes questions about alcohol usage. Although, we'll start at Health History Questionnaire (HHQ). Tips for finding other useful variables are integraded below.

I have already used the RedCap 'Data Export Reports & Stats' tab to download this data and place it on the server under the following: "/Volumes/IGNITE_Imaging/QC_Output/R_IGNITE/RedCap/PRE/Data/"

setwd("/Volumes/IGNITE_Imaging/QC_Output/R_IGNITE/RedCap/PRE/Data/")
FILE<-list.files("/Volumes/IGNITE_Imaging/QC_Output/R_IGNITE/RedCap/PRE/Data/", pattern="*.r")
PATH<-"/Volumes/IGNITE_Imaging/QC_Output/R_IGNITE/RedCap/PRE/Data/"
data<-paste(PATH, FILE, sep="")

source(data)

Step 2.

Assuming that the data has already been appropriately entered & compared within RedCap, we will remove the form entries ending in --1 and --2.

data$Compared<-str_detect(data$record_id, pattern = "--1")
data<-data[!(data$Compared==TRUE),]

data$Merged_RedCap<-str_detect(data$record_id, pattern = "--2")
data<-data[!(data$Merged_RedCap==TRUE),]

#data$record_id<-droplevels(data$record_id) #droplevels() to remove --* levels from 'record_id' 

Step 3.

Next, we'll use plyr/dplyr/tidyverse's 'filter' command to filter-in only those rows (particpants) which have been randomized (i.e., randomization_complete.factor=="Complete").

data <- data %>% 
  filter(randomization_complete.factor=="Complete") 

Step 4.

Get rid of columns where for ALL rows the value is NA. This should narrow our database to 500 particpants (rows).

data <- data[,colSums(is.na(data))<nrow(data)]

Step 6. BMI Group Calculation & Remove outlier/error'd entry (30574)

data$vo2_data_height_m<-data$vo2_data_height/100
data$BMI<-data$vo2_data_weight/(data$vo2_data_height_m*data$vo2_data_height_m)
data$BMI<-as.numeric(data$BMI)
data$BMI_group<-if_else(data$BMI<18.5, 1, as.numeric(data$BMI)) #under
data$BMI_group<-if_else( (data$BMI_group>1 & data$BMI_group<24.9)  ,2 , as.numeric(data$BMI_group)) #healthy
data$BMI_group<-if_else(data$BMI_group<29.9  & data$BMI_group>2 ,3, as.numeric(data$BMI_group)) #over
data$BMI_group<-if_else(data$BMI_group<34.9 & data$BMI_group>3 ,4, as.numeric(data$BMI_group)) #obese I
data$BMI_group<-if_else(data$BMI_group<39.9 & data$BMI_group>4 ,5, as.numeric(data$BMI_group)) #obese II
data$BMI_group<-if_else(data$BMI_group<49.9 & data$BMI_group>5 ,6, as.numeric(data$BMI_group)) #obese III
data$BMI_group<-if_else(data$BMI_group>50 ,7, as.numeric(data$BMI_group)) #OUTLIER?
data$BMI_group<-as.character(data$BMI_group)

data$BMI_GROUP<-if_else(data$BMI_group=="1", "Underweight", data$BMI_group)
data$BMI_GROUP<-if_else(data$BMI_group=="2", "Healthy", data$BMI_GROUP)
data$BMI_GROUP<-if_else(data$BMI_group=="3", "Overweight", data$BMI_GROUP)
data$BMI_GROUP<-if_else(data$BMI_group=="4", "Obese I", data$BMI_GROUP)
data$BMI_GROUP<-if_else(data$BMI_group=="5", "Obese II", data$BMI_GROUP)
data$BMI_GROUP<-if_else(data$BMI_group=="6", "Obese III", data$BMI_GROUP)
data$BMI_GROUP<-if_else(data$BMI_group=="7", "ERROR", data$BMI_GROUP)
data$BMI_GROUP<-factor(data$BMI_GROUP, levels=c("Underweight", "Healthy", "Overweight","Obese I", "Obese II", "Obese III"))

TMP<-data[!(data$BMI_GROUP=="ERROR"),] 
TMP<-TMP[complete.cases(TMP$BMI_GROUP),] 
TMP$BMI_GROUP<-as.factor(TMP$BMI_GROUP)

BMI_GROUP
history_cig_history.factor Underweight Healthy Overweight Obese I Obese II Obese III Total
Yes 0 ( 0.0% ) 3 ( 21.4% ) 6 ( 42.9% ) 1 ( 7.1% ) 1 ( 7.1% ) 3 ( 21.4% ) 14 ( 100.0% )
No 1 ( 0.2% ) 95 ( 19.1% ) 189 ( 38.0% ) 124 ( 24.9% ) 57 ( 11.5% ) 31 ( 6.2% ) 497 ( 100.0% )
Total 1 ( 0.2% ) 98 ( 19.2% ) 195 ( 38.2% ) 125 ( 24.5% ) 58 ( 11.4% ) 34 ( 6.7% ) 511 ( 100.0% )
 Χ2 = 6.8616   df = 5   p = .2311

Generated by summarytools 0.9.9 (R version 3.6.1)
2021-05-18

BMI_GROUP
screen_race.factor Underweight Healthy Overweight Obese I Obese II Obese III Total
African American / Black 0 ( 0.0% ) 8 ( 11.8% ) 19 ( 27.9% ) 19 ( 27.9% ) 12 ( 17.6% ) 10 ( 14.7% ) 68 ( 100.0% )
Asian 0 ( 0.0% ) 3 ( 33.3% ) 4 ( 44.4% ) 1 ( 11.1% ) 1 ( 11.1% ) 0 ( 0.0% ) 9 ( 100.0% )
Caucasian / White 1 ( 0.2% ) 82 ( 19.5% ) 170 ( 40.5% ) 101 ( 24.0% ) 45 ( 10.7% ) 21 ( 5.0% ) 420 ( 100.0% )
American Indian or Alaska Native 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Native Hawaiian or other Pacific Islander 0 ( 0.0% ) 1 ( 100.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 1 ( 100.0% )
Subject Refused to Answer 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% )
Bi-racial 0 ( 0.0% ) 4 ( 66.7% ) 1 ( 16.7% ) 1 ( 16.7% ) 0 ( 0.0% ) 0 ( 0.0% ) 6 ( 100.0% )
Other 0 ( 0.0% ) 0 ( 0.0% ) 1 ( 14.3% ) 3 ( 42.9% ) 0 ( 0.0% ) 3 ( 42.9% ) 7 ( 100.0% )
Total 1 ( 0.2% ) 98 ( 19.2% ) 195 ( 38.2% ) 125 ( 24.5% ) 58 ( 11.4% ) 34 ( 6.7% ) 511 ( 100.0% )
 Χ2 = NaN   df = 35   p = NaN

Generated by summarytools 0.9.9 (R version 3.6.1)
2021-05-18

Smoker Demographics health history -

WOW I forgot that nearly half of america used to smoke!! Crazy!

Data Frame Summary

prior_smoke_yrs.tbl

Dimensions: 513 x 8
Duplicates: 316
Variable Stats / Values Freqs (% of Valid) Graph
screen_site.factor [factor] 1. Pitt 2. Kansas 3. Northeastern
186(36.3%)
181(35.3%)
146(28.5%)
rand_age_group.factor [factor] 1. < =72 2. >72
402(78.4%)
111(21.6%)
rand_sex.factor [factor] 1. Male 2. Female
147(28.7%)
366(71.3%)
screen_race.factor [factor] 1. African American / Black 2. Asian 3. Caucasian / White 4. American Indian or Alaska 5. Native Hawaiian or other 6. Subject Refused to Answer 7. Bi-racial 8. Other
68(13.3%)
9(1.8%)
422(82.3%)
0(0.0%)
1(0.2%)
0(0.0%)
6(1.2%)
7(1.4%)
history_cig_history.factor [factor] 1. Yes 2. No
14(2.7%)
499(97.3%)
smoke_other.factor [factor] 1. Yes 2. No
15(2.9%)
496(97.1%)
prior_smoke.factor [factor] 1. Yes 2. No
219(43.9%)
280(56.1%)
prior_smoke_yrs [numeric] Mean (sd) : 17 (12.4) min < med < max: 0.2 < 15 < 59 IQR (CV) : 17.8 (0.7) 40 distinct values

Generated by summarytools 0.9.9 (R version 3.6.1)
2021-05-18

Descriptive Statistics

prior_smoke_yrs by screen_site.factor

Data Frame: prior_smoke_yrs.tbl
N: 186
Mean Std.Dev Min Median Max N.Valid Pct.Valid
Pitt 20.19 12.61 2.00 20.00 50.00 92 49.46
Kansas 16.12 12.69 0.17 11.00 59.00 61 33.70
Northeastern 13.48 10.73 1.00 10.00 50.00 65 44.52

Generated by summarytools 0.9.9 (R version 3.6.1)
2021-05-18

Data Frame Summary

Smokers.table

Dimensions: 14 x 6
Duplicates: 0
Variable Label Stats / Values Freqs (% of Valid) Graph Missing
rand_treat.factor [factor] 1. Group 1: 150 mins of aero 2. Group 2: 225 mins of aero 3. Group 3: Stretch and Tone
6(42.9%)
4(28.6%)
4(28.6%)
0 (0.0%)
rand_age_group.factor [factor] 1. < =72 2. >72
12(85.7%)
2(14.3%)
0 (0.0%)
rand_sex.factor [factor] 1. Male 2. Female
2(14.3%)
12(85.7%)
0 (0.0%)
smoke_day [labelled, numeric] 31A) How many packs do you smoke per day? Mean (sd) : 0.5 (0.4) min < med < max: 0 < 0.4 < 1.5 IQR (CV) : 0.2 (0.9) 9 distinct values 0 (0.0%)
yrs_smoke [labelled, numeric] 31B) For how many years have you smoked? Mean (sd) : 40.4 (13.2) min < med < max: 15 < 45 < 55 IQR (CV) : 20 (0.3)
15:1(7.1%)
20:1(7.1%)
30:3(21.4%)
40:1(7.1%)
41:1(7.1%)
49:1(7.1%)
50:3(21.4%)
52:1(7.1%)
54:1(7.1%)
55:1(7.1%)
0 (0.0%)
vo2_data_beta.factor [factor] 1. Yes 2. No
3(21.4%)
11(78.6%)
0 (0.0%)

Generated by summarytools 0.9.9 (R version 3.6.1)
2021-05-18

Step X. Chi Squared - Smoking health history

data %$% 
  ctable(history_cig_history.factor, history_breath_sudden.factor,
         chisq = TRUE, OR = TRUE, RR=TRUE,
         headings = FALSE) %>%
  print(method = "render")
history_breath_sudden.factor
history_cig_history.factor Yes No Total
Yes 2 ( 14.3% ) 12 ( 85.7% ) 14 ( 100.0% )
No 12 ( 2.4% ) 487 ( 97.6% ) 499 ( 100.0% )
Total 14 ( 2.7% ) 499 ( 97.3% ) 513 ( 100.0% )
 Χ2 = 3.4572   df = 1   p = .0630
O.R. (95% C.I.) = 6.76  (1.36 - 33.59)
R.R. (95% C.I.) = 5.94  (1.47 - 24.08)

Generated by summarytools 0.9.9 (R version 3.6.1)
2021-05-18

data$vo2_data_test_speed<-as.factor(data$vo2_data_test_speed)
data %$% 
  ctable(history_cig_history.factor, vo2_data_test_speed,
   chisq = TRUE, OR = TRUE, RR=TRUE,
         headings = FALSE) %>%
  print(method = "render")
vo2_data_test_speed
history_cig_history.factor 1.5 1.7 1.8 1.9 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 4 Total
Yes 1 ( 7.1% ) 0 ( 0.0% ) 1 ( 7.1% ) 0 ( 0.0% ) 3 ( 21.4% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 1 ( 7.1% ) 0 ( 0.0% ) 6 ( 42.9% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 2 ( 14.3% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 0 ( 0.0% ) 14 ( 100.0% )
No 9 ( 1.8% ) 12 ( 2.4% ) 5 ( 1.0% ) 5 ( 1.0% ) 22 ( 4.4% ) 7 ( 1.4% ) 20 ( 4.0% ) 15 ( 3.0% ) 16 ( 3.2% ) 37 ( 7.4% ) 13 ( 2.6% ) 46 ( 9.2% ) 43 ( 8.6% ) 13 ( 2.6% ) 66 ( 13.2% ) 15 ( 3.0% ) 32 ( 6.4% ) 27 ( 5.4% ) 10 ( 2.0% ) 77 ( 15.4% ) 3 ( 0.6% ) 2 ( 0.4% ) 3 ( 0.6% ) 1 ( 0.2% ) 499 ( 100.0% )
Total 10 ( 1.9% ) 12 ( 2.3% ) 6 ( 1.2% ) 5 ( 1.0% ) 25 ( 4.9% ) 7 ( 1.4% ) 20 ( 3.9% ) 15 ( 2.9% ) 16 ( 3.1% ) 37 ( 7.2% ) 13 ( 2.5% ) 46 ( 9.0% ) 44 ( 8.6% ) 13 ( 2.5% ) 72 ( 14.0% ) 15 ( 2.9% ) 32 ( 6.2% ) 27 ( 5.3% ) 12 ( 2.3% ) 77 ( 15.0% ) 3 ( 0.6% ) 2 ( 0.4% ) 3 ( 0.6% ) 1 ( 0.2% ) 513 ( 100.0% )
 Χ2 = 41.4629   df = 23   p = .0105

Generated by summarytools 0.9.9 (R version 3.6.1)
2021-05-18

data %$% 
  ctable(history_cig_history.factor, vo2sum_criteria1.factor,
         chisq = TRUE, OR = TRUE, RR=TRUE,
         headings = FALSE) %>%
  print(method = "render")
vo2sum_criteria1.factor
history_cig_history.factor Yes No <NA> Total
Yes 9 ( 64.3% ) 5 ( 35.7% ) 0 ( 0.0% ) 14 ( 100.0% )
No 404 ( 81.0% ) 94 ( 18.8% ) 1 ( 0.2% ) 499 ( 100.0% )
Total 413 ( 80.5% ) 99 ( 19.3% ) 1 ( 0.2% ) 513 ( 100.0% )
 Χ2 = 1.5136   df = 1   p = .2186
O.R. (95% C.I.) = 0.42  (0.14 - 1.28)
R.R. (95% C.I.) = 0.79  (0.54 - 1.17)

Generated by summarytools 0.9.9 (R version 3.6.1)
2021-05-18

...
...
...

There are a number of ways in which we can extract variables from a huge database. Since we know what we're looking for (Health History Questionnaire), we'll start with those columns. Using the RedCap codebook, we can see that all health history items start with a similar string of text "history_". We'll use tidyverse commands starts_with(), embedded within the select() call of the data.

For more information on Tidyverse: https://hbctraining.github.io/Intro-to-R/lessons/08_intro_tidyverse.html

We can additionally use some commands to make sure that any columns which include information about smoking/alcohol are selected. The contains() command will extract any columns which contain a matching string of text. Thus, we'll use "contains("smoke")" & "contains("alcohol")" within our select call was well.

The first 4 variables included in the select command below are needed in later steps. We've also included 'starts_with("vo2")' to get some other demographics (weight, height, etc.)

Let's specfically look at the HHQ. Since there is no definitive guide for scoring this questionnaire, we need some data reduction techneques. My first inclination tells me to use exploratory factor analysis. Irregardless, it is good to have a background in the following R commands/packages:

PCA (FactoMineR & factoextra packages) corrplot (corrplot package) factanal (stats package)

Best online expamples: http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/112-pca-principal-component-analysis-essentials/

HHQ<-data %>% select(  record_id, 
                       starts_with("history_"))
HHQ<-HHQ[,c(2:19)]
HHQ<-HHQ[(complete.cases(HHQ)),]


HHQ.pca<-PCA(HHQ, graph = FALSE)
fviz_eig(HHQ.pca, addlabels = TRUE) ## MAN thats a shit scree plot

var <- get_pca_var(HHQ.pca)

par(mfrow=c(1,2))
fviz_pca_var(HHQ.pca, col.var = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
             repel = TRUE )

corrplot(var$cos2, is.corr=FALSE)

par(mfrow=c(1,2))

fviz_contrib(HHQ.pca, choice = "var", axes = 1)

fviz_contrib(HHQ.pca, choice = "var", axes = 2)