Loading required libraries and data

library(car)
## Loading required package: carData
library(ggplot2)
library(MASS)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
d <- read.csv('thrill.csv')

Columns in the set

names(d)
##  [1] "Age"           "Sex"           "DM2"           "HTN"          
##  [5] "CAD"           "Dur_CKD"       "eGFR"          "Hypotension"  
##  [9] "Duration"      "Calcificatio0" "Thrill"

Is there redundancy?

Are all the variables of independent of each other? If not, we can exclude them from plotting. If two variables show good correlation, we can remove one of them at least! Let’s see correlation variable wise.

We scatterplot a few selected variables, columns 3 to 8

scatterplotMatrix(d[,3:8])

Now we focus on two particular variables which kind of show a linear trend in the scatter plot. Even their histogram looks similar

scatterplotMatrix(d[,3:4])

Standardising and adding a prediction column

We scale down all variables to their standard deviations.

ds <- as.data.frame(scale(d[,1:10]))

We add the field ‘thrill’

ds$thrill <- d$Thrill

Now the data looks like

names(ds)
##  [1] "Age"           "Sex"           "DM2"           "HTN"          
##  [5] "CAD"           "Dur_CKD"       "eGFR"          "Hypotension"  
##  [9] "Duration"      "Calcificatio0" "thrill"

We can try a 3d plot with 3 variables, demarcated by two colors (‘thrill’ and ‘not thrill’). But that is as far as we can go with plotting.

ds_plot <- ds
ds_plot$thrill <- as.factor(ds$thrill)
plot_ly(ds_plot,x=~Dur_CKD,y=~eGFR, z=~Duration,color=~thrill)
## No trace type specified:
##   Based on info supplied, a 'scatter3d' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter3d
## No scatter3d mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

Finding principal components

Variables which are well correlated between them can be ‘merged’ into one (a ‘principal component’), so that we can plot in lesser dimensions.

pca <- prcomp(ds)
summary(pca)
## Importance of components:
##                          PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.883 1.4301 1.1655 0.9249 0.87643 0.74119 0.72315
## Proportion of Variance 0.348 0.2008 0.1334 0.0840 0.07542 0.05394 0.05135
## Cumulative Proportion  0.348 0.5488 0.6822 0.7662 0.84159 0.89553 0.94687
##                            PC8     PC9    PC10      PC11
## Standard deviation     0.52252 0.43476 0.28113 1.212e-16
## Proportion of Variance 0.02681 0.01856 0.00776 0.000e+00
## Cumulative Proportion  0.97368 0.99224 1.00000 1.000e+00

This returns 11 ‘principal components’ (new variables) to plot with. However, we can usually pick first 3 or 4, others show variation small enough to be ignored. We plot the principal components by their variance

screeplot(pca,type='lines',main="Principal components")

At the third component, the variance flattens out. Which means that we can safely use the first three PCAs for prediction. Let’s have a look at the first principal component

pca$rotation[,1]
##           Age           Sex           DM2           HTN           CAD 
##   0.340109800   0.254447417   0.450861145   0.407931344   0.210877768 
##       Dur_CKD          eGFR   Hypotension      Duration Calcificatio0 
##   0.008236888  -0.123515079  -0.338620516  -0.259001825   0.450861145 
##        thrill 
##   0.071296960

This lists the coefficients against each variable, which means the firss ‘principal component’ is a merged variable.

Linear discriminant analysis

This function uses PCAs to predict outout of a column (i.e thrill)

lda <- lda(thrill ~ .,data=ds)
## Warning in lda.default(x, grouping, ...): variables are collinear
lda
## Call:
## lda(thrill ~ ., data = ds)
## 
## Prior probabilities of groups:
##         0         1 
## 0.2307692 0.7692308 
## 
## Group means:
##            Age        Sex        DM2         HTN         CAD    Dur_CKD
## 0 -0.020703019 -0.7626739 -0.3699512 -0.25217814 -0.17225909  0.4296552
## 1  0.006210906  0.2288022  0.1109853  0.07565344  0.05167773 -0.1288965
##         eGFR Hypotension      Duration Calcificatio0
## 0  0.5590151   0.7655590 -0.0026259975    -0.3699512
## 1 -0.1677045  -0.2296677  0.0007877993     0.1109853
## 
## Coefficients of linear discriminants:
##                       LD1
## Age           -0.62154665
## Sex            0.95985708
## DM2            0.33737551
## HTN           -0.47900922
## CAD            0.08510143
## Dur_CKD       -0.38372768
## eGFR          -0.08903322
## Hypotension   -0.85682963
## Duration       0.54423338
## Calcificatio0  0.33737551

This lists the coefficients against each variable, used for discrimination. We now generate predictions against the same dataset and plot a discrimination histogram

lda_preds <- predict(lda, ds)
ldahist(data = lda_preds$x[,1], g=ds$thrill)

We have achieved some distinction between ‘thrill’ and ‘non thrill’