Machine Learning in Medecine project

This chapter is a part of the project “Machine Learning in Medicine” with participation of 12 members who comes from different backgrounds.

Via this project, we hope fulfill the following objectives:

To spread and encourage the application of Machine Learning in medical research in VietNam
To connect between Machine Learning and Statistics
To promote the use of R language, an open source and leading tool for statistics and machine learning.

Introduction

The chapter 7 “Factor Analysis and Partial Least Squares for Complex-Data Reduction (250 patients)” from “Machine Learning in Medicine Cookbook - Tome 1 of Ton J. Cleophas and Aeilko H. Zwinderman” has suggested to use the Factor Analysis and Partial Least Squares to predict drug efficacy…

(Viết típ)

Objectives

From a large number of independent variables, we need to train a model to predict drug efficacy that is composed of 4 dependent variables.

In this report, we present our methodology in 2 parts: 1. Build a linear regression based on the whole dataset and on the reduced dataset by factor analysis 2. Build a structural model using Partial Least Squares Path Modelling

Dataset

“Twelve highly expressed genes are used to predict drug efficacy” Highly expressed genes : G1, G2, G3, G4 , G16, G17, G18, G19 and G24, G25, G26, G27 Drug efficacy: O1, O2, O3, O4

0. Load data

library(foreign)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(caret)
library(plsdepot)
library(psych)
library(plspm)
library(reshape)

rm(list=ls())
file <- "https://cdn.rawgit.com/tkvit/MLM_project/master/Chap16_FA_PLS/optscalingfactorplscanonical.sav"
df <- read.spss(file,to.data.frame=TRUE)
raw <- df #backup data
df <- df[,-17]
predictors <- df[,1:12]
outcomes <- df [,13:16]

Look at the data

head(df)

##   geneone genetwo genethree genefour genesixteen geneseventeen
## 1       8       8         9        5           7            10
## 2       9       9        10        9           8             8
## 3       9       8         8        8           8             9
## 4       8       9         8        9           6             7
## 5      10      10         8       10           9            10
## 6       7       8         8        8           8             7
##   geneeighteen genenineteen genetwentyfour genetwentyfive genetwentysix
## 1            5            6              9              9             6
## 2            7            8              8              9             8
## 3            7            8              9              8             9
## 4            6            4              6              6             5
## 5           10            8              8              9             9
## 6            6            5              7              8             8
##   genetwentyseven outcomeone outcometwo outcomethree outcomefour
## 1               6          6          7            6           7
## 2               8          8          7            8           7
## 3               9          9          8            8           8
## 4               5          7          7            7           6
## 5               9          8          8            8           7
## 6               7          7          6            6           7

1 . Prepare for the final dataset

# For FA analysis, we decided to work with the sum of the four outcomes variables.
summaryoutcome <- data.frame(summaryoutcome = apply(outcomes,1,sum))
# Reduction data with pca with rotation
fit <- principal(predictors,nfactors = 3,rotate = "varimax")
#Dataframe with the latent predictors variables
df_vf <- data.frame(cbind(df,FA1 = fit$scores[, 1], FA2 = fit$scores[, 2], FA3 = fit$scores[, 3],summaryoutcome))

2. Linear regression with and without factor analysis

In this section, we apply a linear regression model on two cases: the whole dataset and the reduced dataset using factor analysis. For this analysis, we decided to work with the sum of the four outcomes variables. The new outcome variable is called ‘summaryoutcome’

Linear regression on the whole dataset

#Lm on raw dataset
df_temp <- select(df_vf,geneone:genetwentyseven,summaryoutcome)
mod1_lin <- lm(summaryoutcome ~., data = df_temp)
summary(mod1_lin)

## 
## Call:
## lm(formula = summaryoutcome ~ ., data = df_temp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.4316  -1.9233   0.2772   2.2117   7.8250 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.29257    1.47510   2.232 0.026543 *  
## geneone         -0.12207    0.18900  -0.646 0.519001    
## genetwo          0.28696    0.22481   1.276 0.203047    
## genethree        0.37010    0.22775   1.625 0.105489    
## genefour         0.06308    0.19637   0.321 0.748314    
## genesixteen      0.76381    0.17163   4.450 1.32e-05 ***
## geneseventeen    0.83512    0.19790   4.220 3.48e-05 ***
## geneeighteen     0.08777    0.15139   0.580 0.562638    
## genenineteen     0.57612    0.15359   3.751 0.000221 ***
## genetwentyfour   0.40268    0.14591   2.760 0.006236 ** 
## genetwentyfive   0.02789    0.14081   0.198 0.843171    
## genetwentysix    0.31987    0.14217   2.250 0.025373 *  
## genetwentyseven -0.27523    0.13316  -2.067 0.039830 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.642 on 237 degrees of freedom
## Multiple R-squared:  0.7288, Adjusted R-squared:  0.7151 
## F-statistic: 53.07 on 12 and 237 DF,  p-value: < 2.2e-16

Linear regression on the reduced dataset

At first we need to test the reliability of the dataset using the test-retest reliability. The Cronbach alpha indicates the internal consistency of a variable. It should be higher than 0.80.

#Test-retest reliability
a<- psych::alpha(predictors)$alpha.drop
print(a)

##                 raw_alpha std.alpha   G6(smc) average_r      S/N
## geneone         0.9022553 0.9037372 0.9177891 0.4604731 9.388232
## genetwo         0.8953719 0.8963375 0.9091673 0.4401094 8.646694
## genethree       0.8948120 0.8955384 0.9103046 0.4379983 8.572893
## genefour        0.9040677 0.9059329 0.9202038 0.4668143 9.630709
## genesixteen     0.8957594 0.8983816 0.9126559 0.4455850 8.840734
## geneseventeen   0.8963856 0.8982057 0.9125954 0.4451094 8.823728
## geneeighteen    0.8993559 0.9012474 0.9165995 0.4534518 9.126313
## genenineteen    0.8948362 0.8974701 0.9147251 0.4431298 8.753255
## genetwentyfour  0.8929730 0.8962084 0.9118027 0.4397671 8.634692
## genetwentyfive  0.9024913 0.9050741 0.9212400 0.4643170 9.534532
## genetwentysix   0.8941496 0.8970246 0.9120615 0.4419376 8.711059
## genetwentyseven 0.9029756 0.9049215 0.9195379 0.4638754 9.517620
##                    alpha se
## geneone         0.008838139
## genetwo         0.009411006
## genethree       0.009479066
## genefour        0.008736403
## genesixteen     0.009451036
## geneseventeen   0.009376683
## geneeighteen    0.009069112
## genenineteen    0.009548438
## genetwentyfour  0.009844989
## genetwentyfive  0.008839683
## genetwentysix   0.009739096
## genetwentyseven 0.008751473

All the Cronbach alpha values are higher than 80%. The dataset is reliable.

Next step, we reduced the independent variables to 3 factors called FA1, FA2 and FA3.

fit <- principal(predictors,nfactors = 3,rotate = "varimax")
print(fit)

## Principal Components Analysis
## Call: principal(r = predictors, nfactors = 3, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                  RC1  RC2  RC3   h2   u2 com
## geneone         0.21 0.81 0.14 0.72 0.28 1.2
## genetwo         0.55 0.68 0.07 0.77 0.23 1.9
## genethree       0.62 0.61 0.06 0.77 0.23 2.0
## genefour        0.03 0.76 0.37 0.71 0.29 1.5
## genesixteen     0.86 0.16 0.09 0.77 0.23 1.1
## geneseventeen   0.65 0.22 0.34 0.58 0.42 1.8
## geneeighteen    0.53 0.30 0.32 0.47 0.53 2.3
## genenineteen    0.75 0.27 0.17 0.66 0.34 1.4
## genetwentyfour  0.66 0.10 0.54 0.73 0.27 2.0
## genetwentyfive  0.22 0.23 0.70 0.59 0.41 1.4
## genetwentysix   0.69 0.08 0.49 0.72 0.28 1.8
## genetwentyseven 0.19 0.16 0.82 0.74 0.26 1.2
## 
##                        RC1  RC2  RC3
## SS loadings           3.72 2.40 2.11
## Proportion Var        0.31 0.20 0.18
## Cumulative Var        0.31 0.51 0.69
## Proportion Explained  0.45 0.29 0.26
## Cumulative Proportion 0.45 0.74 1.00
## 
## Mean item complexity =  1.6
## Test of the hypothesis that 3 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.07 
##  with the empirical chi square  159.81  with prob <  1.5e-18 
## 
## Fit based upon off diagonal values = 0.98

The useful information in this step: the latent factor 1 strongly correlates with the genes 16–19 (called PC2), the latent factor 2 with the genes 1–4 (PC1), and the latent factor 3 with the genes 24–27 (PC3).

#Lm with factor analysis
df_temp <- select(df_vf,FA1:FA3,summaryoutcome)
mod2_lin <- lm(summaryoutcome ~., data = df_temp)
summary(mod2_lin)

## 
## Call:
## lm(formula = summaryoutcome ~ ., data = df_temp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.8192  -1.9699   0.2148   2.3344   9.1462 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  27.3320     0.2309 118.379  < 2e-16 ***
## FA1           5.2893     0.2313  22.863  < 2e-16 ***
## FA2           1.7508     0.2313   7.568 7.59e-13 ***
## FA3           1.5282     0.2313   6.605 2.43e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.651 on 246 degrees of freedom
## Multiple R-squared:  0.7171, Adjusted R-squared:  0.7137 
## F-statistic: 207.9 on 3 and 246 DF,  p-value: < 2.2e-16

3. PLS_Path Modelling

In this section, we want to use Partial Least Square Path Modelling that is both regression model and structural model. The advantage of PLS-PM is that the model can take into account the relationships between the latent variables and also the relationship of each latent variable with its block of indicators.

The PLS_PM model consists of two sub models: the inner model and the outer model….(Bổ sung sau)

Now we are going to set up the PLS Path modelling model

Step 1: Construct the path matrix. In this case, we suppose that the group of genes will affect the drug efficacy and these relations are one way. The matrix is a binary matrix. The column affects the rows.

The figure below illustrates the relation between the blocks: PC1(gene 1-4), PC2 (genes 16-19), PC3 (genes 24-27) will affect the block drug efficacy (O1-O4)

# Observation of FA analysis suggest the construction path matrix
PC1 = c(0, 0, 0, 0)
PC2 = c(0, 0, 0, 0)
PC3 = c(0, 0, 0, 0)
efficacy = c(1, 1, 1, 0)
# matrix (by row binding)
foo_path = rbind(PC1,PC2,PC3,efficacy)
innerplot(foo_path, box.size = 0.1)

# add column names (optional)
colnames(foo_path) = rownames(foo_path)

Step 2 Now construct the outer model

#outer model
foo_blocks = list(1:4,5:8, 9:12, 13:16)
# modes (rflective blocks)
foo_modes = c("A","A","A","A")

Step 3 The model is run with bootstrap 500 times

foo_pls1 = plspm(df, foo_path, foo_blocks, modes = foo_modes,
                 boot.val = TRUE, br = 500)

Validation of PLS_PM model

We need to check the following points to validate the model: - The unidimensionality - The internal loadings of an indicator - The cross-loadings of an indicator with the other latent variables

The unidimensionality is verified based on the cronbach alpha values and other coefficients.

foo_pls1$unidim

##          Mode MVs   C.alpha    DG.rho  eig.1st   eig.2nd
## PC1         A   4 0.8375080 0.8922575 2.703187 0.6379821
## PC2         A   4 0.8357228 0.8904442 2.681531 0.6006842
## PC3         A   4 0.8194216 0.8813711 2.604721 0.5734549
## efficacy    A   4 0.8940109 0.9267006 3.039999 0.4220001

The internal loadings of an indicator indicate the correlation of each gene to the block. Based on the plot, we can see the force of each gene to their corresponding block. The weakest variables are genefour for PC1, geneeighteen for PC2, genetwentyseven for PC3. We remember these genes for the next step.

# check outer model
foo_pls1$outer_model

##               name    block    weight   loading communality redundancy
## 1          geneone      PC1 0.2374020 0.7851267   0.6164239  0.0000000
## 2          genetwo      PC1 0.3603526 0.9083122   0.8250311  0.0000000
## 3        genethree      PC1 0.3890395 0.8935310   0.7983977  0.0000000
## 4         genefour      PC1 0.2073838 0.6687013   0.4471615  0.0000000
## 5      genesixteen      PC2 0.3432759 0.8554509   0.7317962  0.0000000
## 6    geneseventeen      PC2 0.3094545 0.8400499   0.7056838  0.0000000
## 7     geneeighteen      PC2 0.2466070 0.7595231   0.5768753  0.0000000
## 8     genenineteen      PC2 0.3182192 0.8141664   0.6628669  0.0000000
## 9   genetwentyfour      PC3 0.3896540 0.8948269   0.8007152  0.0000000
## 10  genetwentyfive      PC3 0.2346505 0.7135171   0.5091067  0.0000000
## 11   genetwentysix      PC3 0.3856708 0.8879260   0.7884126  0.0000000
## 12 genetwentyseven      PC3 0.2025939 0.6982084   0.4874949  0.0000000
## 13      outcomeone efficacy 0.3212617 0.9147755   0.8368142  0.5871557
## 14      outcometwo efficacy 0.3049914 0.9110843   0.8300747  0.5824269
## 15    outcomethree efficacy 0.2450601 0.8325635   0.6931619  0.4863612
## 16     outcomefour efficacy 0.2722327 0.8236215   0.6783524  0.4759701

plot(foo_pls1, what = "weights")

Now we verify the cross loadings values. What we want is that the loadings values of of each block with itself must be stronger the ones with other blocks.

# check cross loadings
foo_pls1$crossloadings

##               name    block       PC1       PC2       PC3  efficacy
## 1          geneone      PC1 0.7851267 0.4240536 0.4031129 0.3934530
## 2          genetwo      PC1 0.9083122 0.5997834 0.5404322 0.5972208
## 3        genethree      PC1 0.8935310 0.6555677 0.5512869 0.6447656
## 4         genefour      PC1 0.6687013 0.4131683 0.3599533 0.3437041
## 5      genesixteen      PC2 0.5541554 0.8554509 0.5808967 0.7476556
## 6    geneseventeen      PC2 0.5107802 0.8400499 0.5710330 0.6739924
## 7     geneeighteen      PC2 0.4979131 0.7595231 0.4822294 0.5371123
## 8     genenineteen      PC2 0.5859186 0.8141664 0.5994290 0.6930826
## 9   genetwentyfour      PC3 0.5249670 0.6516726 0.8948269 0.6624763
## 10  genetwentyfive      PC3 0.4266934 0.4037936 0.7135171 0.3989443
## 11   genetwentysix      PC3 0.5161384 0.6413022 0.8879260 0.6557040
## 12 genetwentyseven      PC3 0.3742854 0.4431464 0.6982084 0.3444425
## 13      outcomeone efficacy 0.5996060 0.8020112 0.6630708 0.9147755
## 14      outcometwo efficacy 0.5765422 0.7944103 0.5891718 0.9110843
## 15    outcomethree efficacy 0.4672861 0.6228682 0.4848062 0.8325635
## 16     outcomefour efficacy 0.5427362 0.6084722 0.5983804 0.8236215

We use ggplot2 to better visualize the crossloadings library(ggplot2) library(reshape)

Now everything is verified. We can look at the inner model. We got 70% of explained variance and the goodness of fit is 0.69.

# check inner model
foo_pls1$inner_summary

##                Type       R2 Block_Communality Mean_Redundancy       AVE
## PC1       Exogenous 0.000000         0.6717535       0.0000000 0.6717535
## PC2       Exogenous 0.000000         0.6693056       0.0000000 0.6693056
## PC3       Exogenous 0.000000         0.6464324       0.0000000 0.6464324
## efficacy Endogenous 0.701656         0.7596008       0.5329785 0.7596008

plot(foo_pls1)

foo_pls1$gof

## [1] 0.6941746

Final step

We got 0.69 as goodness of fit. We wonder wether doing a variable selection can improve the GoF. Based on what we obseved in the outer model.

We made 3 cases:

pls1: the model with the whole variables (GoF already known = 0.69)
pls2: take out the gene four, gene eighteen and genetwentyseven
pls3: take out the gene four and geneeighteen

#pls2 
foo_blocks = list(1:3,c(5,6,8), 9:11, 13:16)
foo_pls2 = plspm(df, foo_path, foo_blocks, modes = foo_modes,
                 boot.val = TRUE, br = 500)
                 
#pls3 
foo_blocks = list(1:3,c(5,6,8), 9:12, 13:16)
foo_pls3 = plspm(df, foo_path, foo_blocks, modes = foo_modes,
                 boot.val = TRUE, br = 500)

# check inner model
foo_pls1$inner_summary

##                Type       R2 Block_Communality Mean_Redundancy       AVE
## PC1       Exogenous 0.000000         0.6717535       0.0000000 0.6717535
## PC2       Exogenous 0.000000         0.6693056       0.0000000 0.6693056
## PC3       Exogenous 0.000000         0.6464324       0.0000000 0.6464324
## efficacy Endogenous 0.701656         0.7596008       0.5329785 0.7596008

foo_pls2$inner_summary

##                Type        R2 Block_Communality Mean_Redundancy       AVE
## PC1       Exogenous 0.0000000         0.7679859        0.000000 0.7679859
## PC2       Exogenous 0.0000000         0.7265386        0.000000 0.7265386
## PC3       Exogenous 0.0000000         0.7185128        0.000000 0.7185128
## efficacy Endogenous 0.7228401         0.7596590        0.549112 0.7596590

foo_pls3$inner_summary

##                Type        R2 Block_Communality Mean_Redundancy       AVE
## PC1       Exogenous 0.0000000         0.7679856       0.0000000 0.7679856
## PC2       Exogenous 0.0000000         0.7265386       0.0000000 0.7265386
## PC3       Exogenous 0.0000000         0.6464359       0.0000000 0.6464359
## efficacy Endogenous 0.7170512         0.7596614       0.5447161 0.7596614

foo_pls1$gof

## [1] 0.6941746

foo_pls2$gof

## [1] 0.7335616

foo_pls3$gof

## [1] 0.71952

Conclusion: The % variance is slightly better with the variable selection. However, we observed a gain of GoF by excluding some gene.

General Conclusion

In this chapter, we extracted the information from Factor analysis and applied it into the Partial Least Square Path Modelling. The PLS-PM model allowed us to take into account of multiple outcome variables and to see the relationship between latent variables and their indicators. The last important point is we use the insight given by PLS-PM to do a variable selection which allowed to improve the goodness of fit.

(Sửa và bổ sung sau)

Factor Analysis and Partial Least Square Path Modelling

Vịt Trần and Hùng Trịnh

January 13, 2017 - draft version 0