This chapter is a part of the project “Machine Learning in Medicine” with participation of 12 members who comes from different backgrounds.
Via this project, we hope fulfill the following objectives:
The chapter 7 “Factor Analysis and Partial Least Squares for Complex-Data Reduction (250 patients)” from “Machine Learning in Medicine Cookbook - Tome 1 of Ton J. Cleophas and Aeilko H. Zwinderman” has suggested to use the Factor Analysis and Partial Least Squares to predict drug efficacy…
(Viết típ)
From a large number of independent variables, we need to train a model to predict drug efficacy that is composed of 4 dependent variables.
In this report, we present our methodology in 2 parts: 1. Build a linear regression based on the whole dataset and on the reduced dataset by factor analysis 2. Build a structural model using Partial Least Squares Path Modelling
“Twelve highly expressed genes are used to predict drug efficacy” Highly expressed genes : G1, G2, G3, G4 , G16, G17, G18, G19 and G24, G25, G26, G27 Drug efficacy: O1, O2, O3, O4
library(foreign)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(caret)
library(plsdepot)
library(psych)
library(plspm)
library(reshape)
rm(list=ls())
file <- "https://cdn.rawgit.com/tkvit/MLM_project/master/Chap16_FA_PLS/optscalingfactorplscanonical.sav"
df <- read.spss(file,to.data.frame=TRUE)
raw <- df #backup data
df <- df[,-17]
predictors <- df[,1:12]
outcomes <- df [,13:16]
Look at the data
head(df)
## geneone genetwo genethree genefour genesixteen geneseventeen
## 1 8 8 9 5 7 10
## 2 9 9 10 9 8 8
## 3 9 8 8 8 8 9
## 4 8 9 8 9 6 7
## 5 10 10 8 10 9 10
## 6 7 8 8 8 8 7
## geneeighteen genenineteen genetwentyfour genetwentyfive genetwentysix
## 1 5 6 9 9 6
## 2 7 8 8 9 8
## 3 7 8 9 8 9
## 4 6 4 6 6 5
## 5 10 8 8 9 9
## 6 6 5 7 8 8
## genetwentyseven outcomeone outcometwo outcomethree outcomefour
## 1 6 6 7 6 7
## 2 8 8 7 8 7
## 3 9 9 8 8 8
## 4 5 7 7 7 6
## 5 9 8 8 8 7
## 6 7 7 6 6 7
# For FA analysis, we decided to work with the sum of the four outcomes variables.
summaryoutcome <- data.frame(summaryoutcome = apply(outcomes,1,sum))
# Reduction data with pca with rotation
fit <- principal(predictors,nfactors = 3,rotate = "varimax")
#Dataframe with the latent predictors variables
df_vf <- data.frame(cbind(df,FA1 = fit$scores[, 1], FA2 = fit$scores[, 2], FA3 = fit$scores[, 3],summaryoutcome))
In this section, we apply a linear regression model on two cases: the whole dataset and the reduced dataset using factor analysis. For this analysis, we decided to work with the sum of the four outcomes variables. The new outcome variable is called ‘summaryoutcome’
#Lm on raw dataset
df_temp <- select(df_vf,geneone:genetwentyseven,summaryoutcome)
mod1_lin <- lm(summaryoutcome ~., data = df_temp)
summary(mod1_lin)
##
## Call:
## lm(formula = summaryoutcome ~ ., data = df_temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.4316 -1.9233 0.2772 2.2117 7.8250
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.29257 1.47510 2.232 0.026543 *
## geneone -0.12207 0.18900 -0.646 0.519001
## genetwo 0.28696 0.22481 1.276 0.203047
## genethree 0.37010 0.22775 1.625 0.105489
## genefour 0.06308 0.19637 0.321 0.748314
## genesixteen 0.76381 0.17163 4.450 1.32e-05 ***
## geneseventeen 0.83512 0.19790 4.220 3.48e-05 ***
## geneeighteen 0.08777 0.15139 0.580 0.562638
## genenineteen 0.57612 0.15359 3.751 0.000221 ***
## genetwentyfour 0.40268 0.14591 2.760 0.006236 **
## genetwentyfive 0.02789 0.14081 0.198 0.843171
## genetwentysix 0.31987 0.14217 2.250 0.025373 *
## genetwentyseven -0.27523 0.13316 -2.067 0.039830 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.642 on 237 degrees of freedom
## Multiple R-squared: 0.7288, Adjusted R-squared: 0.7151
## F-statistic: 53.07 on 12 and 237 DF, p-value: < 2.2e-16
At first we need to test the reliability of the dataset using the test-retest reliability. The Cronbach alpha indicates the internal consistency of a variable. It should be higher than 0.80.
#Test-retest reliability
a<- psych::alpha(predictors)$alpha.drop
print(a)
## raw_alpha std.alpha G6(smc) average_r S/N
## geneone 0.9022553 0.9037372 0.9177891 0.4604731 9.388232
## genetwo 0.8953719 0.8963375 0.9091673 0.4401094 8.646694
## genethree 0.8948120 0.8955384 0.9103046 0.4379983 8.572893
## genefour 0.9040677 0.9059329 0.9202038 0.4668143 9.630709
## genesixteen 0.8957594 0.8983816 0.9126559 0.4455850 8.840734
## geneseventeen 0.8963856 0.8982057 0.9125954 0.4451094 8.823728
## geneeighteen 0.8993559 0.9012474 0.9165995 0.4534518 9.126313
## genenineteen 0.8948362 0.8974701 0.9147251 0.4431298 8.753255
## genetwentyfour 0.8929730 0.8962084 0.9118027 0.4397671 8.634692
## genetwentyfive 0.9024913 0.9050741 0.9212400 0.4643170 9.534532
## genetwentysix 0.8941496 0.8970246 0.9120615 0.4419376 8.711059
## genetwentyseven 0.9029756 0.9049215 0.9195379 0.4638754 9.517620
## alpha se
## geneone 0.008838139
## genetwo 0.009411006
## genethree 0.009479066
## genefour 0.008736403
## genesixteen 0.009451036
## geneseventeen 0.009376683
## geneeighteen 0.009069112
## genenineteen 0.009548438
## genetwentyfour 0.009844989
## genetwentyfive 0.008839683
## genetwentysix 0.009739096
## genetwentyseven 0.008751473
All the Cronbach alpha values are higher than 80%. The dataset is reliable.
Next step, we reduced the independent variables to 3 factors called FA1, FA2 and FA3.
fit <- principal(predictors,nfactors = 3,rotate = "varimax")
print(fit)
## Principal Components Analysis
## Call: principal(r = predictors, nfactors = 3, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
## RC1 RC2 RC3 h2 u2 com
## geneone 0.21 0.81 0.14 0.72 0.28 1.2
## genetwo 0.55 0.68 0.07 0.77 0.23 1.9
## genethree 0.62 0.61 0.06 0.77 0.23 2.0
## genefour 0.03 0.76 0.37 0.71 0.29 1.5
## genesixteen 0.86 0.16 0.09 0.77 0.23 1.1
## geneseventeen 0.65 0.22 0.34 0.58 0.42 1.8
## geneeighteen 0.53 0.30 0.32 0.47 0.53 2.3
## genenineteen 0.75 0.27 0.17 0.66 0.34 1.4
## genetwentyfour 0.66 0.10 0.54 0.73 0.27 2.0
## genetwentyfive 0.22 0.23 0.70 0.59 0.41 1.4
## genetwentysix 0.69 0.08 0.49 0.72 0.28 1.8
## genetwentyseven 0.19 0.16 0.82 0.74 0.26 1.2
##
## RC1 RC2 RC3
## SS loadings 3.72 2.40 2.11
## Proportion Var 0.31 0.20 0.18
## Cumulative Var 0.31 0.51 0.69
## Proportion Explained 0.45 0.29 0.26
## Cumulative Proportion 0.45 0.74 1.00
##
## Mean item complexity = 1.6
## Test of the hypothesis that 3 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.07
## with the empirical chi square 159.81 with prob < 1.5e-18
##
## Fit based upon off diagonal values = 0.98
The useful information in this step: the latent factor 1 strongly correlates with the genes 16–19 (called PC2), the latent factor 2 with the genes 1–4 (PC1), and the latent factor 3 with the genes 24–27 (PC3).
#Lm with factor analysis
df_temp <- select(df_vf,FA1:FA3,summaryoutcome)
mod2_lin <- lm(summaryoutcome ~., data = df_temp)
summary(mod2_lin)
##
## Call:
## lm(formula = summaryoutcome ~ ., data = df_temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.8192 -1.9699 0.2148 2.3344 9.1462
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.3320 0.2309 118.379 < 2e-16 ***
## FA1 5.2893 0.2313 22.863 < 2e-16 ***
## FA2 1.7508 0.2313 7.568 7.59e-13 ***
## FA3 1.5282 0.2313 6.605 2.43e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.651 on 246 degrees of freedom
## Multiple R-squared: 0.7171, Adjusted R-squared: 0.7137
## F-statistic: 207.9 on 3 and 246 DF, p-value: < 2.2e-16
In this section, we want to use Partial Least Square Path Modelling that is both regression model and structural model. The advantage of PLS-PM is that the model can take into account the relationships between the latent variables and also the relationship of each latent variable with its block of indicators.
The PLS_PM model consists of two sub models: the inner model and the outer model….(Bổ sung sau)
Now we are going to set up the PLS Path modelling model
Step 1: Construct the path matrix. In this case, we suppose that the group of genes will affect the drug efficacy and these relations are one way. The matrix is a binary matrix. The column affects the rows.
The figure below illustrates the relation between the blocks: PC1(gene 1-4), PC2 (genes 16-19), PC3 (genes 24-27) will affect the block drug efficacy (O1-O4)
# Observation of FA analysis suggest the construction path matrix
PC1 = c(0, 0, 0, 0)
PC2 = c(0, 0, 0, 0)
PC3 = c(0, 0, 0, 0)
efficacy = c(1, 1, 1, 0)
# matrix (by row binding)
foo_path = rbind(PC1,PC2,PC3,efficacy)
innerplot(foo_path, box.size = 0.1)
# add column names (optional)
colnames(foo_path) = rownames(foo_path)
Step 2 Now construct the outer model
#outer model
foo_blocks = list(1:4,5:8, 9:12, 13:16)
# modes (rflective blocks)
foo_modes = c("A","A","A","A")
Step 3 The model is run with bootstrap 500 times
foo_pls1 = plspm(df, foo_path, foo_blocks, modes = foo_modes,
boot.val = TRUE, br = 500)
We need to check the following points to validate the model: - The unidimensionality - The internal loadings of an indicator - The cross-loadings of an indicator with the other latent variables
The unidimensionality is verified based on the cronbach alpha values and other coefficients.
foo_pls1$unidim
## Mode MVs C.alpha DG.rho eig.1st eig.2nd
## PC1 A 4 0.8375080 0.8922575 2.703187 0.6379821
## PC2 A 4 0.8357228 0.8904442 2.681531 0.6006842
## PC3 A 4 0.8194216 0.8813711 2.604721 0.5734549
## efficacy A 4 0.8940109 0.9267006 3.039999 0.4220001
The internal loadings of an indicator indicate the correlation of each gene to the block. Based on the plot, we can see the force of each gene to their corresponding block. The weakest variables are genefour for PC1, geneeighteen for PC2, genetwentyseven for PC3. We remember these genes for the next step.
# check outer model
foo_pls1$outer_model
## name block weight loading communality redundancy
## 1 geneone PC1 0.2374020 0.7851267 0.6164239 0.0000000
## 2 genetwo PC1 0.3603526 0.9083122 0.8250311 0.0000000
## 3 genethree PC1 0.3890395 0.8935310 0.7983977 0.0000000
## 4 genefour PC1 0.2073838 0.6687013 0.4471615 0.0000000
## 5 genesixteen PC2 0.3432759 0.8554509 0.7317962 0.0000000
## 6 geneseventeen PC2 0.3094545 0.8400499 0.7056838 0.0000000
## 7 geneeighteen PC2 0.2466070 0.7595231 0.5768753 0.0000000
## 8 genenineteen PC2 0.3182192 0.8141664 0.6628669 0.0000000
## 9 genetwentyfour PC3 0.3896540 0.8948269 0.8007152 0.0000000
## 10 genetwentyfive PC3 0.2346505 0.7135171 0.5091067 0.0000000
## 11 genetwentysix PC3 0.3856708 0.8879260 0.7884126 0.0000000
## 12 genetwentyseven PC3 0.2025939 0.6982084 0.4874949 0.0000000
## 13 outcomeone efficacy 0.3212617 0.9147755 0.8368142 0.5871557
## 14 outcometwo efficacy 0.3049914 0.9110843 0.8300747 0.5824269
## 15 outcomethree efficacy 0.2450601 0.8325635 0.6931619 0.4863612
## 16 outcomefour efficacy 0.2722327 0.8236215 0.6783524 0.4759701
plot(foo_pls1, what = "weights")
Now we verify the cross loadings values. What we want is that the loadings values of of each block with itself must be stronger the ones with other blocks.
# check cross loadings
foo_pls1$crossloadings
## name block PC1 PC2 PC3 efficacy
## 1 geneone PC1 0.7851267 0.4240536 0.4031129 0.3934530
## 2 genetwo PC1 0.9083122 0.5997834 0.5404322 0.5972208
## 3 genethree PC1 0.8935310 0.6555677 0.5512869 0.6447656
## 4 genefour PC1 0.6687013 0.4131683 0.3599533 0.3437041
## 5 genesixteen PC2 0.5541554 0.8554509 0.5808967 0.7476556
## 6 geneseventeen PC2 0.5107802 0.8400499 0.5710330 0.6739924
## 7 geneeighteen PC2 0.4979131 0.7595231 0.4822294 0.5371123
## 8 genenineteen PC2 0.5859186 0.8141664 0.5994290 0.6930826
## 9 genetwentyfour PC3 0.5249670 0.6516726 0.8948269 0.6624763
## 10 genetwentyfive PC3 0.4266934 0.4037936 0.7135171 0.3989443
## 11 genetwentysix PC3 0.5161384 0.6413022 0.8879260 0.6557040
## 12 genetwentyseven PC3 0.3742854 0.4431464 0.6982084 0.3444425
## 13 outcomeone efficacy 0.5996060 0.8020112 0.6630708 0.9147755
## 14 outcometwo efficacy 0.5765422 0.7944103 0.5891718 0.9110843
## 15 outcomethree efficacy 0.4672861 0.6228682 0.4848062 0.8325635
## 16 outcomefour efficacy 0.5427362 0.6084722 0.5983804 0.8236215
We use ggplot2 to better visualize the crossloadings library(ggplot2) library(reshape)
Now everything is verified. We can look at the inner model. We got 70% of explained variance and the goodness of fit is 0.69.
# check inner model
foo_pls1$inner_summary
## Type R2 Block_Communality Mean_Redundancy AVE
## PC1 Exogenous 0.000000 0.6717535 0.0000000 0.6717535
## PC2 Exogenous 0.000000 0.6693056 0.0000000 0.6693056
## PC3 Exogenous 0.000000 0.6464324 0.0000000 0.6464324
## efficacy Endogenous 0.701656 0.7596008 0.5329785 0.7596008
plot(foo_pls1)
foo_pls1$gof
## [1] 0.6941746
We got 0.69 as goodness of fit. We wonder wether doing a variable selection can improve the GoF. Based on what we obseved in the outer model.
We made 3 cases:
#pls2
foo_blocks = list(1:3,c(5,6,8), 9:11, 13:16)
foo_pls2 = plspm(df, foo_path, foo_blocks, modes = foo_modes,
boot.val = TRUE, br = 500)
#pls3
foo_blocks = list(1:3,c(5,6,8), 9:12, 13:16)
foo_pls3 = plspm(df, foo_path, foo_blocks, modes = foo_modes,
boot.val = TRUE, br = 500)
# check inner model
foo_pls1$inner_summary
## Type R2 Block_Communality Mean_Redundancy AVE
## PC1 Exogenous 0.000000 0.6717535 0.0000000 0.6717535
## PC2 Exogenous 0.000000 0.6693056 0.0000000 0.6693056
## PC3 Exogenous 0.000000 0.6464324 0.0000000 0.6464324
## efficacy Endogenous 0.701656 0.7596008 0.5329785 0.7596008
foo_pls2$inner_summary
## Type R2 Block_Communality Mean_Redundancy AVE
## PC1 Exogenous 0.0000000 0.7679859 0.000000 0.7679859
## PC2 Exogenous 0.0000000 0.7265386 0.000000 0.7265386
## PC3 Exogenous 0.0000000 0.7185128 0.000000 0.7185128
## efficacy Endogenous 0.7228401 0.7596590 0.549112 0.7596590
foo_pls3$inner_summary
## Type R2 Block_Communality Mean_Redundancy AVE
## PC1 Exogenous 0.0000000 0.7679856 0.0000000 0.7679856
## PC2 Exogenous 0.0000000 0.7265386 0.0000000 0.7265386
## PC3 Exogenous 0.0000000 0.6464359 0.0000000 0.6464359
## efficacy Endogenous 0.7170512 0.7596614 0.5447161 0.7596614
foo_pls1$gof
## [1] 0.6941746
foo_pls2$gof
## [1] 0.7335616
foo_pls3$gof
## [1] 0.71952
Conclusion: The % variance is slightly better with the variable selection. However, we observed a gain of GoF by excluding some gene.
In this chapter, we extracted the information from Factor analysis and applied it into the Partial Least Square Path Modelling. The PLS-PM model allowed us to take into account of multiple outcome variables and to see the relationship between latent variables and their indicators. The last important point is we use the insight given by PLS-PM to do a variable selection which allowed to improve the goodness of fit.
(Sửa và bổ sung sau)