Work done by Francisco Bischoff and João Almeida.
Using the database osa_simulated_523.csv which includes data from 523 patients who were referred to sleep consultation for risk of obstructive sleep apnea.
a) Present a descriptive analysis of the dataset.
all variables are categorical and have the frequency shown in the graph.
cats <- inspect_cat(data)
cats.ix <- sort(sort(cats$common_pcnt, index.return = TRUE, decreasing = T)$ix, index.return = TRUE)$ix
lab <- names(cats$levels)
lab <- paste(sprintf("%02d", cats.ix), lab)
names(cats$levels) <- lab
show_plot(cats)we created a univariate logistic regression in order to assess the significance of each variable on their own and their impact on predicting the outcome
l.coef <- vector()
l.sign <- vector()
for (var in seq(length(data) - 1)) {
l.model <- glm(OSA ~ ., family = binomial(link = "logit"), data = data[, c(1, var + 1)])
l.results <- summary(l.model)
l.coef[var] <- l.results$coefficients[2, 1]
l.sign[var] <- l.results$coefficients[2, 4]
}
model.list <- data.frame(variables = names(data)[2:39], coefficients = l.coef, significance = l.sign)
model.list$below30pcnt <- ifelse(model.list$significance < 0.3, TRUE, FALSE)
model.list[model.list$significance < 0.3, ]Checking the most significant and impactful variables
OSA False True
Race
African 152 7
Caucasian 12 342
Call: xtabs(formula = ~Race + OSA, data = data, subset = NULL)
Number of cases in table: 513
Number of factors: 2
Test for independence of all factors:
Chisq = 428.9, df = 1, p-value = 2.784e-95
OSA False True
Decreased.Libido
False 62 2
True 102 350
Call: xtabs(formula = ~Decreased.Libido + OSA, data = data, subset = NULL)
Number of cases in table: 516
Number of factors: 2
Test for independence of all factors:
Chisq = 142.78, df = 1, p-value = 6.574e-33
b) Identify covariates significantly associated with the outcome “OSA”, presenting the effect of that association (e.g. OR) and the corresponding significance (e.g. p-value or 95%CI).
Using forward and Backward stepwise selection, we look for the best associated variables with the outcome OSA.
For practical purposes, we show here an example of the process with the first 12 variables.
model1 <- glm(OSA ~ ., family = binomial(link = "logit"), data = data[, c(1, 2:13)])
k1 <- blr_step_p_backward(model1, prem = 0.1)Backward Elimination Method
---------------------------
Candidate Terms:
1 . Gender
2 . Depression
3 . Headaches
4 . Decreased.Libido
5 . Smoking
6 . Alcohol
7 . Sedatives
8 . Humor
9 . Age.Cat
10 . MI
11 . Hypothyroidism
12 . Coffee
We are eliminating variables based on p value...
Variables Removed:
✖ Depression
✖ Coffee
✖ MI
✖ Sedatives
✖ Alcohol
✖ Humor
✖ Smoking
No more variables satisfy the condition of p value = 0.1
Final Model Output
------------------
✔ Creating model overview.
✔ Creating response profile.
✔ Extracting maximum likelihood estimates.
✔ Estimating concordant and discordant pairs.
Model Overview
------------------------------------------------------------------------
Data Set Resp Var Obs. Df. Model Df. Residual Convergence
------------------------------------------------------------------------
data OSA 523 497 490 TRUE
------------------------------------------------------------------------
Response Summary
--------------------------------------------------------
Outcome Frequency Outcome Frequency
--------------------------------------------------------
0 166 1 355
--------------------------------------------------------
Maximum Likelihood Estimates
--------------------------------------------------------------------------
Parameter DF Estimate Std. Error z value Pr(>|z|)
--------------------------------------------------------------------------
(Intercept) 1 -6.2506 0.8976 -6.9638 0.0000
GenderMale 1 1.1256 0.2954 3.8099 1e-04
HeadachesTrue 1 0.6098 0.2851 2.1391 0.0324
Decreased.LibidoTrue 1 4.4926 0.7689 5.8433 0.0000
Age.Cat40-54 1 0.8739 0.4151 2.1054 0.0353
Age.Cat55-69 1 2.2918 0.4262 5.3772 0.0000
Age.Cat>=70 1 3.2296 0.5417 5.9621 0.0000
HypothyroidismTrue 1 0.6460 0.2712 2.3820 0.0172
--------------------------------------------------------------------------
Association of Predicted Probabilities and Observed Responses
---------------------------------------------------------------
% Concordant 0.8494 Somers' D 0.7351
% Discordant 0.1297 Gamma 0.7197
% Tied 0.0209 Tau-a 0.3135
Pairs 53901 c 0.8599
---------------------------------------------------------------
So our method was to apply the method exemplified above to the full dataset and variables. Due to problems like glm.fit: algorithm did not converge and glm.fit: fitted probabilities numerically 0 or 1 occurred, we could not use the whole set of variables in one pass, so we used multiple passes, removing variables, until we got a unique model.
After all the calculations were done (calculation time: 28.634999 s) The variables we found highly associated with the OSA outcome are the following:
Forward stepwise : AF, Decreased.Libido, Race
Model Overview
------------------------------------------------------------------------
Data Set Resp Var Obs. Df. Model Df. Residual Convergence
------------------------------------------------------------------------
data OSA 501 500 497 TRUE
------------------------------------------------------------------------
Response Summary
--------------------------------------------------------
Outcome Frequency Outcome Frequency
--------------------------------------------------------
0 162 1 339
--------------------------------------------------------
Maximum Likelihood Estimates
--------------------------------------------------------------------------
Parameter DF Estimate Std. Error z value Pr(>|z|)
--------------------------------------------------------------------------
(Intercept) 1 -12.0985 1.6842 -7.1837 0.0000
Decreased.LibidoTrue 1 8.4038 1.4672 5.7278 0.0000
RaceCaucasian 1 9.2286 1.1781 7.8332 0.0000
AFTrue 1 3.7382 0.8471 4.4130 0.0000
--------------------------------------------------------------------------
Odds Ratio Estimates
----------------------------------------------------------------------
Effects Estimate 95% Wald Conf. Limit
----------------------------------------------------------------------
Decreased.LibidoTrue 4464.0921 385.2467 148734.1006
RaceCaucasian 10183.7721 1514.5293 219256.8015
AFTrue 42.0217 8.6387 258.5099
----------------------------------------------------------------------
Association of Predicted Probabilities and Observed Responses
---------------------------------------------------------------
% Concordant 0.9888 Somers' D 0.9979
% Discordant 0.0010 Gamma 0.9877
% Tied 0.0102 Tau-a 0.4331
Pairs 54918 c 0.9939
---------------------------------------------------------------
Backward stepwise (p-value: remove >0.10): Conc.decrease, Daytime.Sleepiness, Decreased.Libido, GE.Reflux, PHT, Race, VehicleCrashes
Model Overview
------------------------------------------------------------------------
Data Set Resp Var Obs. Df. Model Df. Residual Convergence
------------------------------------------------------------------------
data OSA 485 484 477 TRUE
------------------------------------------------------------------------
Response Summary
--------------------------------------------------------
Outcome Frequency Outcome Frequency
--------------------------------------------------------
0 157 1 328
--------------------------------------------------------
Maximum Likelihood Estimates
----------------------------------------------------------------------------
Parameter DF Estimate Std. Error z value Pr(>|z|)
----------------------------------------------------------------------------
(Intercept) 1 -13.4282 3.3932 -3.9574 1e-04
Decreased.LibidoTrue 1 10.4685 2.6555 3.9422 1e-04
RaceCaucasian 1 12.7373 2.7638 4.6086 0.0000
Daytime.SleepinessTrue 1 -3.7971 1.9008 -1.9976 0.0458
GE.RefluxTrue 1 6.0412 2.0033 3.0156 0.0026
Conc.decreaseTrue 1 -6.4543 2.2963 -2.8107 0.0049
VehicleCrashesTrue 1 3.4229 1.8814 1.8193 0.0689
PHTTrue 1 6.3001 1.9974 3.1542 0.0016
----------------------------------------------------------------------------
Odds Ratio Estimates
-----------------------------------------------------------------------------
Effects Estimate 95% Wald Conf. Limit
-----------------------------------------------------------------------------
Decreased.LibidoTrue 35188.0840 608.0737 42958071.3815
RaceCaucasian 340213.1960 5690.6647 652278401.2327
Daytime.SleepinessTrue 0.0224 3e-04 0.6876
GE.RefluxTrue 420.3814 17.7781 55256.3055
Conc.decreaseTrue 0.0016 0.0000 0.0972
VehicleCrashesTrue 30.6586 1.1636 2624.1873
PHTTrue 544.6205 13.1280 57070.3793
-----------------------------------------------------------------------------
Association of Predicted Probabilities and Observed Responses
---------------------------------------------------------------
% Concordant 0.9972 Somers' D 0.9957
% Discordant 0.0022 Gamma 0.9951
% Tied 6e-04 Tau-a 0.4366
Pairs 51496 c 0.9975
---------------------------------------------------------------
c) Build a multivariable logistic regression model, using any appropriate covariate selection method.
Besides the variables selected by stepwise algorithm, we added the following variables as we find them important due to clinical logic reasoning: Alcohol due to its muscular relaxing properties. So, providing a larger muscular relaxation, it can enhance the effects of the OSA. That said, the final list consisted of:
Race, Decreased.Libido, AF, Daytime.Sleepiness, GE.Reflux, Conc.decrease, VehicleCrashes, PHT, Alcohol
First, we used the blorr package to build a model.
covariates <- unique(c(forward$predictors, backward$predictors, "Alcohol"))
preds <- which(names(data) %in% covariates)
data3 <- data[, c(1, preds)]
data3 <- dplyr::tbl_df(data3[complete.cases(data3), ])
data3_sim <- data_sim[, c(1, preds)]
data3_sim <- dplyr::tbl_df(data3_sim[complete.cases(data3_sim), ])
model <- glm(OSA ~ ., family = binomial(link = "logit"), data = data3)
def_model <- blr_regress(model, odd_conf_limit = TRUE)✔ Creating model overview.
✔ Creating response profile.
✔ Extracting maximum likelihood estimates.
✔ Computing odds ratio estimates.
✔ Estimating concordant and discordant pairs.
Model Overview
------------------------------------------------------------------------
Data Set Resp Var Obs. Df. Model Df. Residual Convergence
------------------------------------------------------------------------
data OSA 474 473 464 TRUE
------------------------------------------------------------------------
Response Summary
--------------------------------------------------------
Outcome Frequency Outcome Frequency
--------------------------------------------------------
0 156 1 318
--------------------------------------------------------
Maximum Likelihood Estimates
----------------------------------------------------------------------------
Parameter DF Estimate Std. Error z value Pr(>|z|)
----------------------------------------------------------------------------
(Intercept) 1 -15.2534 4.3881 -3.4761 5e-04
Decreased.LibidoTrue 1 9.8521 2.7616 3.5675 4e-04
AlcoholTrue 1 2.1614 1.4782 1.4622 0.1437
RaceCaucasian 1 13.3183 3.2672 4.0764 0.0000
AFTrue 1 1.7261 2.2544 0.7656 0.4439
Daytime.SleepinessTrue 1 -2.9885 2.4691 -1.2104 0.2261
GE.RefluxTrue 1 5.2490 2.9092 1.8043 0.0712
Conc.decreaseTrue 1 -7.0191 2.7020 -2.5977 0.0094
VehicleCrashesTrue 1 3.3304 2.0149 1.6529 0.0984
PHTTrue 1 6.3048 1.9863 3.1741 0.0015
----------------------------------------------------------------------------
Odds Ratio Estimates
-----------------------------------------------------------------------------
Effects Estimate 95% Wald Conf. Limit
-----------------------------------------------------------------------------
Decreased.LibidoTrue 18998.5739 318.3644 29733682.4229
AlcoholTrue 8.6830 0.6638 348.1881
RaceCaucasian 608211.8862 6076.0510 7871669728.7739
AFTrue 5.6185 0.0743 1416.4247
Daytime.SleepinessTrue 0.0504 3e-04 5.7313
GE.RefluxTrue 190.3748 1.6801 561034.0901
Conc.decreaseTrue 9e-04 0.0000 0.0835
VehicleCrashesTrue 27.9482 0.7739 3166.5777
PHTTrue 547.1886 11.7915 49719.5039
-----------------------------------------------------------------------------
Association of Predicted Probabilities and Observed Responses
---------------------------------------------------------------
% Concordant 0.9993 Somers' D 0.9987
% Discordant 7e-04 Gamma 0.9986
% Tied 1e-04 Tau-a 0.4419
Pairs 49608 c 0.9993
---------------------------------------------------------------
Then we used the caret package to build the model as well. The metric aimed for was ROC and evaluation was used leave-one-out and leave-group-out (aka train test)
## LOO
control <- trainControl(method = "LOOCV", classProbs = TRUE, savePredictions = TRUE)
fit.glm <- train(OSA ~ ., data = data3, method = "glm", metric = "ROC", trControl = control)
fit.glm$finalModel
Call: NULL
Coefficients:
(Intercept) Decreased.LibidoTrue AlcoholTrue RaceCaucasian AFTrue
-15.253 9.852 2.161 13.318 1.726
Daytime.SleepinessTrue GE.RefluxTrue Conc.decreaseTrue VehicleCrashesTrue PHTTrue
-2.989 5.249 -7.019 3.330 6.305
Degrees of Freedom: 473 Total (i.e. Null); 464 Residual
Null Deviance: 600.6
Residual Deviance: 22.58 AIC: 42.58
### train test
control2 <- trainControl(method = "LGOCV", classProbs = TRUE, repeats = 3, p = 0.25, savePredictions = TRUE, summaryFunction = twoClassSummary)
fit.glm2 <- train(OSA ~ ., data = data3, method = "glm", metric = "ROC", trControl = control2)
fit.glm2$finalModel
Call: NULL
Coefficients:
(Intercept) Decreased.LibidoTrue AlcoholTrue RaceCaucasian AFTrue
-15.253 9.852 2.161 13.318 1.726
Daytime.SleepinessTrue GE.RefluxTrue Conc.decreaseTrue VehicleCrashesTrue PHTTrue
-2.989 5.249 -7.019 3.330 6.305
Degrees of Freedom: 473 Total (i.e. Null); 464 Residual
Null Deviance: 600.6
Residual Deviance: 22.58 AIC: 42.58
d) Evaluate the derived model, using any appropriate model evaluation strategy, presenting estimates of the generalizability of using the model as diagnostic predictor for new patients.
Then we plotted the roc curves for all models.
# leave one out
predobj <- prediction(predictions = fit.glm$pred$True, labels = fit.glm$pred$obs)
perf <- performance(predobj, measure = "tpr", x.measure = "fpr")
plot(perf) ROC Sens Spec
0.9837929 0.9807692 0.9811321
# train test
predobj2 <- prediction(predictions = fit.glm2$pred$True, labels = fit.glm2$pred$obs)
perf2 <- performance(predobj2, measure = "tpr", x.measure = "fpr")
plot(perf2) ROC Sens Spec
0.9920217 0.9634188 0.9805042
e) Comment on how different your approach and final solution would need to be if applied to the larger osa_simulated.csv dataset, discussing on the implications for traditional analysis of having “too large” datasets.
For large datasets, a few things should be handled different. First the evaluation of the significance in large data sets tend to be significance, since it is impacted by row numbers. That said, semi-automatic models like forward and backward stepwise selections can be difficult to apply as is to such datasets. However, there are different metrics and statistic method to overcome this, like AIC and BIC metrics. Furthermore, the metric used for assessing quality - the Leave One Out is not very big data friendly, since it is computational demanding. Nevertheless, there are a few others metrics for assessing model quality and generalization capabilities.
Last but not least, it is in these scenarios that data scientists can be of great importance, taking into account the domain they are working on, adding variables and creating new ones based on their knowledge of the problem trying to be solved.