library(car)
## Loading required package: carData
library(ggplot2)
library(MASS)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
d <- read.csv('thrill.csv')
Columns in the set
names(d)
## [1] "Age" "Sex" "DM2" "HTN"
## [5] "CAD" "Dur_CKD" "eGFR" "Hypotension"
## [9] "Duration" "Calcificatio0" "Thrill"
Are all the variables of independent of each other? If not, we can exclude them from plotting. If two variables show good correlation, we can remove one of them at least! Let’s see correlation variable wise.
We scatterplot a few selected variables, columns 3 to 8
scatterplotMatrix(d[,3:8])
Now we focus on two particular variables which kind of show a linear trend in the scatter plot. Even their histogram looks similar
scatterplotMatrix(d[,3:4])
We scale down all variables to their standard deviations.
ds <- as.data.frame(scale(d[,1:10]))
We add the field ‘thrill’
ds$thrill <- d$Thrill
Now the data looks like
names(ds)
## [1] "Age" "Sex" "DM2" "HTN"
## [5] "CAD" "Dur_CKD" "eGFR" "Hypotension"
## [9] "Duration" "Calcificatio0" "thrill"
We can try a 3d plot with 3 variables, demarcated by two colors (‘thrill’ and ‘not thrill’). But that is as far as we can go with plotting.
ds_plot <- ds
ds_plot$thrill <- as.factor(ds$thrill)
plot_ly(ds_plot,x=~Dur_CKD,y=~eGFR, z=~Duration,color=~thrill)
## No trace type specified:
## Based on info supplied, a 'scatter3d' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter3d
## No scatter3d mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
Variables which are well correlated between them can be ‘merged’ into one (a ‘principal component’), so that we can plot in lesser dimensions.
pca <- prcomp(ds)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.883 1.4301 1.1655 0.9249 0.87643 0.74119 0.72315
## Proportion of Variance 0.348 0.2008 0.1334 0.0840 0.07542 0.05394 0.05135
## Cumulative Proportion 0.348 0.5488 0.6822 0.7662 0.84159 0.89553 0.94687
## PC8 PC9 PC10 PC11
## Standard deviation 0.52252 0.43476 0.28113 1.212e-16
## Proportion of Variance 0.02681 0.01856 0.00776 0.000e+00
## Cumulative Proportion 0.97368 0.99224 1.00000 1.000e+00
This returns 11 ‘principal components’ (new variables) to plot with. However, we can usually pick first 3 or 4, others show variation small enough to be ignored. We plot the principal components by their variance
screeplot(pca,type='lines',main="Principal components")
At the third component, the variance flattens out. Which means that we can safely use the first three PCAs for prediction. Let’s have a look at the first principal component
pca$rotation[,1]
## Age Sex DM2 HTN CAD
## 0.340109800 0.254447417 0.450861145 0.407931344 0.210877768
## Dur_CKD eGFR Hypotension Duration Calcificatio0
## 0.008236888 -0.123515079 -0.338620516 -0.259001825 0.450861145
## thrill
## 0.071296960
This lists the coefficients against each variable, which means the firss ‘principal component’ is a merged variable.
This function uses PCAs to predict outout of a column (i.e thrill)
lda <- lda(thrill ~ .,data=ds)
## Warning in lda.default(x, grouping, ...): variables are collinear
lda
## Call:
## lda(thrill ~ ., data = ds)
##
## Prior probabilities of groups:
## 0 1
## 0.2307692 0.7692308
##
## Group means:
## Age Sex DM2 HTN CAD Dur_CKD
## 0 -0.020703019 -0.7626739 -0.3699512 -0.25217814 -0.17225909 0.4296552
## 1 0.006210906 0.2288022 0.1109853 0.07565344 0.05167773 -0.1288965
## eGFR Hypotension Duration Calcificatio0
## 0 0.5590151 0.7655590 -0.0026259975 -0.3699512
## 1 -0.1677045 -0.2296677 0.0007877993 0.1109853
##
## Coefficients of linear discriminants:
## LD1
## Age -0.62154665
## Sex 0.95985708
## DM2 0.33737551
## HTN -0.47900922
## CAD 0.08510143
## Dur_CKD -0.38372768
## eGFR -0.08903322
## Hypotension -0.85682963
## Duration 0.54423338
## Calcificatio0 0.33737551
This lists the coefficients against each variable, used for discrimination. We now generate predictions against the same dataset and plot a discrimination histogram
lda_preds <- predict(lda, ds)
ldahist(data = lda_preds$x[,1], g=ds$thrill)
We have achieved some distinction between ‘thrill’ and ‘non thrill’