1. General Functions

Function to impute Numerical and Double values by Mean

impute_by_mean <- function(impute_data) {   
# Imputation of Numerical and Double columns using the mean of the column

for(i in 1:ncol(impute_data)){
  if(is.numeric(impute_data[[i]]) || is.double(impute_data[[i]]) ){
    na <- is.na(impute_data[[i]])
    impute_data[na, i] <- mean(impute_data[[i]], na.rm = TRUE) 
  }
  
}
  return (impute_data)
}

Function to impute Numerical and Double values by Mode

impute_by_mode <- function(impute_data) {   
# Imputation of Categorical variables using the mode of the column

for(i in 1:ncol(impute_data)){
  if(is.factor(impute_data[[i]]) ){
    na <- is.na(impute_data[[i]])
    impute_data[na, i] <-  i_mode(impute_data[[i]]) #check median as well
  }
}
  return (impute_data)
}

i_mode <- function(x) {                                     
  unique_x <- unique(x)
  mode <- unique_x[which.max(tabulate(match(x, unique_x)))]
  mode
}

# simple example of a plotting function
colorHist <- function(dvector, dcolor){
    hist(
        dvector,
        col = dcolor,
        border = 'white'
    )
}
#   inspect the function
colorHist

## function(dvector, dcolor){
##     hist(
##         dvector,
##         col = dcolor,
##         border = 'white'
##     )
## }

Introduction to Data set Introduction to Data set and Research Question

The data set is considered to answer the research question - ’Predict In-hospitality Mortality rate for the patients”

df <- read.csv("D:/MSc Brunel/CS5810 HPC/dataset.csv")
names(df)

##  [1] "encounter_id"                  "patient_id"                   
##  [3] "hospital_id"                   "age"                          
##  [5] "bmi"                           "elective_surgery"             
##  [7] "ethnicity"                     "gender"                       
##  [9] "height"                        "icu_admit_source"             
## [11] "icu_id"                        "icu_stay_type"                
## [13] "icu_type"                      "pre_icu_los_days"             
## [15] "weight"                        "apache_2_diagnosis"           
## [17] "apache_3j_diagnosis"           "apache_post_operative"        
## [19] "arf_apache"                    "gcs_eyes_apache"              
## [21] "gcs_motor_apache"              "gcs_unable_apache"            
## [23] "gcs_verbal_apache"             "heart_rate_apache"            
## [25] "intubated_apache"              "map_apache"                   
## [27] "resprate_apache"               "temp_apache"                  
## [29] "ventilated_apache"             "d1_diasbp_max"                
## [31] "d1_diasbp_min"                 "d1_diasbp_noninvasive_max"    
## [33] "d1_diasbp_noninvasive_min"     "d1_heartrate_max"             
## [35] "d1_heartrate_min"              "d1_mbp_max"                   
## [37] "d1_mbp_min"                    "d1_mbp_noninvasive_max"       
## [39] "d1_mbp_noninvasive_min"        "d1_resprate_max"              
## [41] "d1_resprate_min"               "d1_spo2_max"                  
## [43] "d1_spo2_min"                   "d1_sysbp_max"                 
## [45] "d1_sysbp_min"                  "d1_sysbp_noninvasive_max"     
## [47] "d1_sysbp_noninvasive_min"      "d1_temp_max"                  
## [49] "d1_temp_min"                   "h1_diasbp_max"                
## [51] "h1_diasbp_min"                 "h1_diasbp_noninvasive_max"    
## [53] "h1_diasbp_noninvasive_min"     "h1_heartrate_max"             
## [55] "h1_heartrate_min"              "h1_mbp_max"                   
## [57] "h1_mbp_min"                    "h1_mbp_noninvasive_max"       
## [59] "h1_mbp_noninvasive_min"        "h1_resprate_max"              
## [61] "h1_resprate_min"               "h1_spo2_max"                  
## [63] "h1_spo2_min"                   "h1_sysbp_max"                 
## [65] "h1_sysbp_min"                  "h1_sysbp_noninvasive_max"     
## [67] "h1_sysbp_noninvasive_min"      "d1_glucose_max"               
## [69] "d1_glucose_min"                "d1_potassium_max"             
## [71] "d1_potassium_min"              "apache_4a_hospital_death_prob"
## [73] "apache_4a_icu_death_prob"      "aids"                         
## [75] "cirrhosis"                     "diabetes_mellitus"            
## [77] "hepatic_failure"               "immunosuppression"            
## [79] "leukemia"                      "lymphoma"                     
## [81] "solid_tumor_with_metastasis"   "apache_3j_bodysystem"         
## [83] "apache_2_bodysystem"           "X"                            
## [85] "hospital_death"

We rename some of the Apache variables for a shorter and more sensible name

names(df)[10]="ICU.admis" 
names(df)[11]="ICU.id"
names(df)[12]="ICU.stay.type"
names(df)[13]="ICU.type"
names(df)[14]="Pre.ICU.loss.days"
names(df)[16]="Apa2.diagnosis"
names(df)[17]="Apa3.diagnosis"
names(df)[18]="Apa.post.op"
names(df)[19]="Arf.apache"
names(df)[20]="Eyes.apache"
names(df)[21]="Motor.apache"
names(df)[22]="Unable.apache"
names(df)[23]="Verbal.apache"
names(df)[24]="Heart_rate.apache"
names(df)[25]="Intubated.apache"
names(df)[26]="Map.apache"
names(df)[27]="Respi_rate.apache"
names(df)[28]="Temp.apache"
names(df)[29]="Ventilated.apache"

Data Cleaning and Preparation

2.1 Feature Selection - Subsetting by removal of irrelevant and/or redundant features

Some of the Id columns does not contribute to the analysis of research question - ’Predict In hospitality Mortality rate for the patients”. Similarly some of the columns are redundant. 1. Identified the set of Apache 3 Variables 2. Identified a set of other variables.(cirrhosis,leukemia,lymphoma,solid_tumor_with_metastasis etc) 3. Identified a set of demographic variable ( age, bmi, weight, height, ethnicity, gender etc) 4. Identifies some redundant variables(weight and bmi are redundant and bmi is calculated based on height and weight, apache_3j_diagnosis and apache_3j_bodysystem and apache_2_bodysystem is same as apache_2_diagnosis ,represent the same thing in different form) 5. Identified some irrelevant columns like id’s - and icu_admit_source, icu_stay_type

# Columns names that are either redundant or Not required/meaningless/empty for analysis
remove.col<- c("encounter_id", "patient_id", "hospital_id", "icu_admit_source", "icu_id", "icu_stay_type", "weight", "height", "X", "apache_3j_diagnosis", "apache_2_diagnosis")

#Prepare data with relevant columns 
initial.df <-df[, !(colnames(df) %in% remove.col)]
names(df)

Convert variables that sees to be categorical, which are defined otherwise

str(initial.df)
category.column <- c("elective_surgery", "ethnicity" , "gender","ICU.type",  "Apa.post.op" , "Arf.apache", "aids", "cirrhosis", "diabetes_mellitus", "hepatic_failure", "immunosuppression" ,"leukemia", "lymphoma", "solid_tumor_with_metastasis", "apache_3j_bodysystem", "hospital_death", "Eyes.apache", "Motor.apache", "Unable.apache", "Verbal.apache",
"Intubated.apache", "Ventilated.apache", "apache_2_bodysystem")

initial.df[ ,category.column] <- lapply(initial.df[,category.column] , factor)
str(initial.df)

2.2 Data Cleaning

Before moving on to determining the specifics of multiple imputation, explore and see the pattern of missing data in the dataset.

(colMeans(is.na(initial.df)))*100 #calculate the mean of each column and multiply by 100 for percentages

The percent of NA in the data set is not very alarming (<less than 5%) .However they are handled using mean and mode in next section

Imputation of the data sets - Using mean for ‘Numerical’ as well as ‘Double’ variables in the data set

initial.df <- impute_by_mean(initial.df)

Mode imputation for ‘Categorical’ variables in the data set

initial.df <- impute_by_mode(initial.df)
describe(initial.df)

Additional cleaning is required for some categorical and numerical variables when the data set was described ’

# 3 Distinct values are seen for 'Gender' variable
nrow(initial.df[initial.df$gender == '', ])

## [1] 25

# So replace them with mode
initial.df$gender[initial.df$gender == ''] <- 'M'

# Empty values are observed in 'ethnicity' variable
nrow(initial.df[initial.df$ethnicity == '', ])

## [1] 1395

# So replace them with mode
initial.df$ethnicity[initial.df$ethnicity == ''] <- 'Caucasian'

# there are some empty '' rows for 'apache_3j_bodysystem'
nrow(initial.df[initial.df$apache_3j_bodysystem == '', ])

## [1] 1662

# So replace them with mode
initial.df$apache_3j_bodysystem[initial.df$apache_3j_bodysystem == ''] <- 'Cardiovascular'

# there are some empty '' rows for 'apache_2_bodysystem'
nrow(initial.df[initial.df$apache_2_bodysystem == '', ])

## [1] 1662

# So replace them with mode
initial.df$apache_2_bodysystem[initial.df$apache_2_bodysystem == ''] <- 'Neurologic'

# Identified different spellings 'apache_2_bodysystem'
initial.df$apache_2_bodysystem[initial.df$apache_2_bodysystem == "Undefined diagnoses"] <- "Undefined Diagnoses" 

# there are some -1 rows in apache_4a_hospital_death_prob, so treating them as NA's
initial.df$apache_4a_hospital_death_prob[initial.df$apache_4a_hospital_death_prob == -1] <- mean(initial.df$apache_4a_hospital_death_prob)

# there are some -1 rows in pre_icu_los_days, so treating them as NA's
#initial.df$pre_icu_los_days[initial.df$pre_icu_los_days <= -1 ] <- mean(initial.df$pre_icu_los_days)

initial.df$pre_icu_los_days[initial.df$pre_icu_los_days < 0 ] # what to do with this variable? they are -ve and 640 observations?

## NULL

# does not seem to add to the research question? Drop this one from the df

initial.df <- initial.df[,-7]

Describe the final cleaned data set

describe(initial.df)

Exploratory data analysis - Statistically and Visually 3.1 Exploratory data analysis - Statistically

summary(initial.df)

There are 91713 observations and 74 variables in the ‘initial.df’ ( including all variables for demography, Apache3 and others co-variate)

The variables (both numerical and Categorical) are represented appropriately now, with plausible values and levels. However, variables like “pre_icu_los_days”, “d1_heartrate_min”, “d1_spo2_max”, “d1_spo2_min”, “h1_resprate_min”, “h1_spo2_max”,“h1_spo2_min” shows 0 value. - Report thses in some way Whether or not these variables can have 0 values, is a matter of SME.

3.2 Exploratory data analysis - Group Summaries

3.2.1 Aggregate mean for the apache3 based hospital_death_prob and other factors(hepatic_failure,immunosuppression,cancer..etc)

3.2.2 - In-hospital Deaths relationship with other Apache3 variables

3.2.3 Aggregate mean for the apache3 based hospital_death_prob with apache3 params

# Aggregate mean for the apache3 based hospital_death_prob and apache3 variables
in.hospital.death.prob <-
  aggregate(
    apache_4a_hospital_death_prob,
    list(
      GCs_Eyes = Eyes.apache,
      GCS_Motot = Motor.apache,
      GCS_Unable = Unable.apache ,
      GCS1_verbal = Verbal.apache,
      apache_grp = apache_3j_bodysystem,
      intube_Apache = Intubated.apache,
      venti_apache = Ventilated.apache,
      death = hospital_death
    ),
    mean
  )

3.2.4 - In-hospital Deaths relationship with other Apache3 variables

in.hospital.death.prob

in.hospital.death.prob <-  in.hospital.death.prob[in.hospital.death.prob$x > 0.5, ]

in.hospital.death.prob[in.hospital.death.prob$death ==1 ,]

 
in.hospital.death.prob[order(in.hospital.death.prob$x), ]# order to find max

The output of the aggregate shows that the in-hospital deaths are higher when apache3 parameters(intubated_apache,ventilated_apache,GCS param(excluding GCS1-Unable)) are true.

3.2.5 Correlation apache3 based hospital_death_prob with Apache3 D1 param

cor.test.columns <- c("Heart_rate.apache","Respi_rate.apache","Temp.apache", "d1_diasbp_max" , "d1_sysbp_noninvasive_max", "d1_sysbp_noninvasive_min", "d1_temp_max","d1_temp_min", "d1_resprate_min","d1_resprate_max","d1_spo2_max",
                     "d1_spo2_min","d1_sysbp_max", "d1_sysbp_min", "d1_heartrate_max","d1_heartrate_min", "d1_mbp_max", "d1_mbp_min",
                     "d1_mbp_noninvasive_max", "d1_diasbp_noninvasive_min","d1_diasbp_noninvasive_max","apache_4a_hospital_death_prob")

d1.cor.df <- initial.df[, cor.test.columns]
in.hospital.mortality.d1 <- cor(d1.cor.df[cor.test.columns])
in.hospital.mortality.d1


corrplot(in.hospital.mortality.d1)

From the above out put , the following apache3 D1 variables seems to impact the in-hospital deaths - d1_sysbp_noninvasive_min, d1_temp_min,d1_spo2_min,d1_sysbp_min,d1_mbp_min,d1_diasbp_noninvasive_min

3.2.6 Correlation apache3 based hospital_death_prob with Apache3 H1 param

cor.h1.columns <- c("h1_diasbp_max","h1_diasbp_min","h1_diasbp_noninvasive_max", "h1_diasbp_noninvasive_min" , "h1_heartrate_max", "h1_heartrate_min", "h1_mbp_max","h1_mbp_min", "h1_mbp_noninvasive_max","h1_mbp_noninvasive_min","h1_resprate_max",
                     "h1_resprate_min","h1_spo2_max", "h1_spo2_min", "h1_sysbp_max","h1_sysbp_min", "h1_sysbp_noninvasive_max", "h1_sysbp_noninvasive_min", "d1_glucose_max", "d1_glucose_min","d1_potassium_max","d1_potassium_min","apache_4a_hospital_death_prob")

h1.cor.df <- initial.df[, cor.h1.columns]
in.hospital.mortality.h1 <- cor(h1.cor.df[cor.h1.columns])
in.hospital.mortality.h1
corrplot(in.hospital.mortality.h1)

From the above out put , the following apache3 H1 variables seems to impact the in-hospital deaths - h1_diasbp_min , h1_diasbp_noninvasive_min, h1_mbp_min, h1_mbp_noninvasive_min, h1_resprate_max, h1_sysbp_min, h1_sysbp_noninvasive_min

The relationships between the selected variables will be explored visually next

3.2 Exploratory data analysis - Visually

Checking the normal distributions for the variables

par(mfrow=c(2,2))
hist(d1.cor.df)

hist(h1.cor.df)

# Relation of in-hospital deaths with demographic variable - age/bmi/ethnicity/gender

bmi.ethnic<-ggplot(initial.df, aes(x=ethnicity, y=bmi,fill=hospital_death)) + geom_boxplot()
bmi.ethnic

age.ethnic<-ggplot(initial.df, aes(x=ethnicity, y=age,fill=hospital_death)) + geom_boxplot()
age.ethnic

#Relation of in-hospital deaths with demographic varaible - age/bmi/ethicnicity/gender

bmi.gender <- ggplot(initial.df, aes(x=gender, y=bmi,fill=hospital_death)) + geom_boxplot()
bmi.gender

age.gender <- ggplot(initial.df, aes(x=gender, y=age,fill=hospital_death)) + geom_boxplot()
age.gender

# Relation of in-hospital deaths with elective surgery planned, icu_type and apache_post_operative

age.surgery <- ggplot(initial.df, aes(x=elective_surgery, y=age,fill=hospital_death)) + geom_boxplot() #Some outliers observered
age.surgery  #No much impact is seen on the elective surgery on in-hospital deaths

age.icu <- ggplot(initial.df, aes(x=ICU.type, y=age,fill=hospital_death)) + geom_boxplot() #Some outliers observered
age.icu # No much impact is seen on the icu type on in-hospital deaths

age.po <- ggplot(initial.df, aes(x=Apa.post.op, y=age,fill=hospital_death)) + geom_boxplot() #Some outlines observered
age.po #Post operative status for higher ages results in same number of in-hospital and non-hospital (NR)

prob.apache4<-ggplot(initial.df, aes(x=age, y=apache_4a_hospital_death_prob,fill=hospital_death)) + geom_boxplot()#Higher in-hospital death are observed with higher apache_4a_hospital_death_prob
prob.apache4

Relation with In-hospital deaths with GCS apache 3 param

p1<-ggplot(initial.df, aes(x=hospital_death, y=apache_4a_hospital_death_prob,fill=Eyes.apache)) + geom_boxplot()
p1 # the box plot indicate that higher in-hospital deaths occur when patients have gcs_eyes_apache=2

p2<-ggplot(initial.df, aes(x=hospital_death, y=apache_4a_hospital_death_prob,fill=Motor.apache)) + geom_boxplot()
p2 #that higher in-hospital deaths occur when patients have gcs_motor_apache=1

p3<-ggplot(initial.df, aes(x=hospital_death, y=apache_4a_hospital_death_prob,fill=Unable.apache)) + geom_boxplot()
p3#that Same in-hospital deaths occur when patients have gcs_unable_apache, No impact

p4<-ggplot(initial.df, aes(x=hospital_death, y=apache_4a_hospital_death_prob,fill=Verbal.apache)) + geom_boxplot()
p4#that higher in-hospital deaths occur when patients have gcs_verbal_apache=1

Conclusion : From the above EDA analysis, its concluded that the in-hospital mortality of patients may be determined by the following 24 variables : 1. age 2. gender 3. bmi 4. ethnicity 5.apache_4a_hospital_death_prob 6. Apache GCS varaites 7.heart_rate_apache 8.intubated_apache 9.temp_apache 10.resprate_apache 11.ventilated_apache 12. Apache D1 param : d1_sysbp_noninvasive_min, d1_temp_min,d1_spo2_min,d1_sysbp_min,d1_mbp_min,d1_diasbp_noninvasive_min 13. Apache H1 param : h1_diasbp_min , h1_diasbp_noninvasive_min, h1_mbp_min, h1_mbp_noninvasive_min, h1_resprate_max, h1_sysbp_min, h1_sysbp_noninvasive_min

# Create the data frame with final columns 

all.columns <-
  c(
    "age",
    "gender",
    "bmi",
    "ethnicity",
    "apache_4a_hospital_death_prob",
    "Eyes.apache",
    "motor.apache",
    "Verbal.apache",
    "Heart_rate.apache",
    "Intubated.apache",
    "Temp.apache",
    "Respi_rate.apache",
    "Ventilated.apache",
    "d1_sysbp_noninvasive_min",
    "d1_temp_min",
    "d1_spo2_min",
    "d1_sysbp_min",
    "d1_mbp_min",
    "d1_diasbp_noninvasive_min",
    "h1_diasbp_min",
    "h1_diasbp_noninvasive_min",
    "h1_mbp_min",
    "h1_mbp_noninvasive_min",
    "h1_resprate_max",
    "h1_sysbp_min",
    "h1_sysbp_noninvasive_min",
    "hospital_death",
    "apache_3j_bodysystem"
  )

#Prepare data with relevant columns 
final.df <-initial.df[ , names(initial.df) %in% all.columns]
names(final.df)

##  [1] "age"                           "bmi"                          
##  [3] "ethnicity"                     "gender"                       
##  [5] "Eyes.apache"                   "Verbal.apache"                
##  [7] "Heart_rate.apache"             "Intubated.apache"             
##  [9] "Respi_rate.apache"             "Temp.apache"                  
## [11] "Ventilated.apache"             "d1_diasbp_noninvasive_min"    
## [13] "d1_mbp_min"                    "d1_spo2_min"                  
## [15] "d1_sysbp_min"                  "d1_sysbp_noninvasive_min"     
## [17] "d1_temp_min"                   "h1_diasbp_min"                
## [19] "h1_diasbp_noninvasive_min"     "h1_mbp_min"                   
## [21] "h1_mbp_noninvasive_min"        "h1_resprate_max"              
## [23] "h1_sysbp_min"                  "h1_sysbp_noninvasive_min"     
## [25] "apache_4a_hospital_death_prob" "apache_3j_bodysystem"         
## [27] "hospital_death"

nrow(final.df)

## [1] 91713

Multiple Component Analysis

PCA does not performs well with categorical variables. Eliminating the factor levels for the data set would be an option, but it may detract value to our interpretation by removing variables that may have explanatory force. We will instead used fMDA, which it can be seen as a generalization of PCA when using a mix of categorical and numerical variables.

##Before running FAMD on the whole dataset, we reduce it size to a manageable level. 
set.seed(1000)
sample <- sample(c(TRUE,FALSE), nrow(final.df), replace=TRUE, prob = c(0.55,0.45))
data.for.fmda.055<- final.df[sample, ]
data.for.fmda.045<- final.df[!sample,]
final.df.fmda <- FAMD(data.for.fmda.055, graph=TRUE) # note that running this analysis takes R up tp 10' to complete

## Warning: ggrepel: 50555 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## Warning: ggrepel: 28 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## Warning: ggrepel: 50555 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## Warning: ggrepel: 14 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## Warning: ggrepel: 10 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

a <- fviz_eig(final.df.fmda,  
              choice='eigenvalue', 
              geom='line') 
b <- fviz_eig(final.df.fmda)

grid.arrange(a, b, ncol=2)

The scree plot on the left indicates that the first 3 principal dimensions accounts for more variance than each of the original variables. The second scree plot indicate that this three account for just around 28% of the variance (17, 7 and 4 percent). This suggest a complex dataset, maybe because their relationships are not linear or other variables that may explain better the variance, have not been included.

## Split quantitative and qualitative variables
split <- splitmix(final.df)

## FAMD
res.pcamix <- PCAmix(X.quanti=split$X.quanti,  
                     X.quali=split$X.quali, rename.level = TRUE)

## Plotting
plot(res.pcamix, choice="cor")

Interpretation of the correlation circle.

What is the square loading? As the name suggests, is the square of the factor loading of each variable. Factor loading describes the correlation between each variable and the princial component (or dimension) that contains it.

Why we need square loadings? For an individual variable, the square loading describes the proportion of variance captured by a particular PD. For each variable, the sum of its squared loading across all PDs equals to 1.

One way to depict this relationship is using correlation circles, which plot only continuous variables using their loadings for any given two PDs as coordinates.

In this case, dimension 1 and 2 (i.e., PD 1 and PD 2) together capture information contained in all 5 non invasive procedures variables, SP02 levels amongst others.

Note that the close to the circle the variables are the more important are to interpreting the PDs involved. For instance, the non invasive variables are close to the circle (1) while ‘age’ is much less important.

Additionally, variables in opposite sides are inversely correlated. For instance, it makes sense that non invasive procedures are inversely correlated to a higher hospital death probability: the less invasive the procedure the less likely to require longer hospital stays and maybe less hospital infection complications (or medical mistakes)

In addition, the square loading graph provides a more detailed view of the variables and its loading (from 0 to 1 = x^2 + y^2 =1 = circle )

Squared Loading Plot

The following plot allow us to visualize qualitative and quantitative variables together. Compared to the correlation circle, this graph would show just x and y axis.

## Plot
## Colour indiv obs by their squared loading
p <- fviz_famd_var(final.df.fmda, 'var', 
                   axes = c(1, 2),
                   col.var = 'cos2')

## Add the supplementary variable, Churn
fviz_add(p, final.df.fmda$var$coord, cex=0.2,
         col.var = 'cos2')

Interpretation All the non invasive variables are more correlated to PD1 than PD2 The rest of the variables (apache, prob, age, etc) are more correlated to PD2 than PD1

In addition, on each (almost a cluster!) group those variables further away from the origin have the highest square loading and are more important in explaining the variance of PD1 and PD2

Variable contribution

Note that while factor loading and squared loading explains how well a PD describes variation captures in a variable, CONTRIBUTION describes how much a variable account for the variation captured in a PD.

## Import libraries
library(FactoMineR)
library(factoextra)
library(gridExtra)


## Plots
a <- fviz_contrib(final.df.fmda, choice = "var", axes = 1)

b <- fviz_contrib(final.df.fmda, choice = "var", axes = 2)

grid.arrange(a, b, nrow=1)

#fviz_contrib(res.famd, choice = "var", axes = 1:2)

Clearly, the top contributors on each PD are: on PD1

Varimax Rotation

By applying what is called a varimax rotation, a small number of variables will be highly correlated to each PD.

split <- splitmix(final.df)
## FAMD
res.pcamix <- PCAmix(X.quanti=split$X.quanti,  
                     X.quali=split$X.quali, rename.level = TRUE)

## Apply varimax rotation to the first two PCs
res.pcarot <- PCArot(res.pcamix, dim=2,
                     graph=FALSE)

## Plot
plot(res.pcarot, choice="sqload", 
     coloring.var=TRUE, axes=c(1, 2))

rotated.values <- res.pcarot$quali
rotated.values

## $coord
##                         dim1.rot     dim2.rot
## ethnicity            0.022401147 0.0015226622
## gender               0.008106845 0.0005848549
## Eyes.apache          0.005162853 0.5961342275
## Verbal.apache        0.006373153 0.6188090057
## Intubated.apache     0.008123457 0.3973926479
## Ventilated.apache    0.013310539 0.5150740580
## apache_3j_bodysystem 0.099258634 0.0381583318
## hospital_death       0.026314008 0.2233951881

Clustering

## k means clustering

set.seed (234)

data.for.kmeans <- final.df[,-c(3,4,5,6,8,11,27,26)] #k means works for continuous variables, hence we discard the non numeric values first 
set.seed(236)

sample <- sample(c(TRUE,FALSE), nrow(data.for.kmeans), replace=TRUE, prob = c(0.6,0.4))# we reduce the size of our dataset to make calculations faster

reduced.data.for.k.means.06 <- data.for.kmeans[sample, ]
reduced.data.for.kmeans.04 <- data.for.kmeans[!sample,]

apache.k.means. <- kmeans(scale(reduced.data.for.k.means.06), centers = 3, nstart = 25 )



fviz_cluster(
  apache.k.means.,
  na.rm = TRUE,
  data = reduced.data.for.k.means.06,
  palette = c("#2E9FDF", "#00AFBB", "#E7B800"),
  geom = "point",
  ellipse.type = "convex",
  ggtheme = theme_bw()
)

print(apache.k.means.)

The graph shows that despite being able to separate the data in different clusters, our data is homogeneous. It does not seems to be a clear divide between clusters. This goes along with our FAMD (factor analysis of mixed data) where our combined first two principal components explained a relatively low percentage of the variances. This suggested that the data set is too complex, or that other variables with more explanatory variables are missing from our dataset.

However, we have a identified a set of variables (22 plus one target variable) inside our data frame that it can be used as an starting point to create supervised models.

"age",  "gender", "bmi", "ethnicity", "apache_4a_hospital_death_prob", "Eyes.apache", "motor.apache", "Verbal.apache", "Heart_rate.apache",  "Intubated.apache","Temp.apache", "Respi_rate.apache", "Ventilated.apache", "d1_temp_min",  "d1_spo2_min", "d1_sysbp_min", "d1_mbp_min", "h1_diasbp_min", "h1_mbp_min",  "h1_resprate_max", "h1_sysbp_min", "apache_3j_bodysystem", plus our dependant variable:"hospital_death"

What has changed? Based on our correlation graphs (code line 404) we discharged the variables that were highly correlated: in this case the non invasive variables. The goal here was to avoid multicollinearity: multicollinearity will produce high variance on the estimation of the individual coefficients that we are trying to predict, making the interpretation of our model,less accurate (and probably unrealistic). To further support our variable selection these had very little contribution to each dimension (code line 606): between 2.5% and 0.1% contribution (after Varimax rotation). We will however, further investigate multicollinearity in our model using VIF test.

From this point we can either construct binary regression models, neural networks or others.

CS5801 Coursework

Pooja Malhotra

2132489

2132489@brunel.ac.uk

Mario Ureta

2115480@brunel.ac.uk

Distributed Data Analysis

04 March, 2023

1. General Functions

2.2 Data Cleaning

3.2 Exploratory data analysis - Group Summaries