We will conduct an analysis of the UCI Machine Learning Repository heart data set. This data set is often used for demonstrating classification methods. The target variable is usually AtheroscleroticHDisease, which indicates the presence or absence of pathologies of the blood vessels that supply the heart muscle itself. We will do something slightly different here and demonstrate several unsupervised machine learning methods to perform a thorough exploratory analysis. Exploratory analysis is essential for any serious data analytic work in order to develop an intuition about the data, identify the most important independent variables and determine the most appropriate confirmatory and hypothesis tests.
Note, we defer printing the source code until the end of the document, except where we have modified the data. In that case, the code is shown so the reader can understand the actions in context
We will proceed in two phases:
The data dictionary provides the following information:
| name | type | notes |
|---|---|---|
| AtheroscleroticHDisease | factor | Presence of heart disease |
| Age | integer | Age in years |
| Sex | factor | The two accepted levels at the time of data collection |
| ChestPain | factor | Presence/type of chest pain |
| RestBP | integer | Resting blood pressure mm Hg |
| Chol | integer | Serum cholesterol mg/dl |
| Fbs | factor | Fasting blood sugar |
| RestECG | factor | Resting electrocardiograph results |
| MaxHR | integer | Maximum heart rate during exercise |
| ExAng | factor | Exercise induced angina |
| Oldpeak | factor | ST depression exercise relative to rest |
| Slope | factor | Slope of peak ST exercise segment |
| Ca | small integer | Number of major vessels under fluoroscopy |
| Thal | factor | No description given |
The data is imported from a csv file. There are 303 rows and 15 columns. Below is the head and summary.
## # A tibble: 303 x 15
## X Age Sex ChestPain RestBP Chol Fbs RestECG MaxHR ExAng
## <int> <int> <int> <fct> <int> <int> <int> <int> <int> <int>
## 1 1 63 1 typical 145 233 1 2 150 0
## 2 2 67 1 asymptom… 160 286 0 2 108 1
## 3 3 67 1 asymptom… 120 229 0 2 129 1
## 4 4 37 1 nonangin… 130 250 0 0 187 0
## 5 5 41 0 nontypic… 130 204 0 2 172 0
## 6 6 56 1 nontypic… 120 236 0 0 178 0
## 7 7 62 0 asymptom… 140 268 0 2 160 0
## 8 8 57 0 asymptom… 120 354 0 0 163 1
## 9 9 63 1 asymptom… 130 254 0 2 147 0
## 10 10 53 1 asymptom… 140 203 1 2 155 1
## # … with 293 more rows, and 5 more variables: Oldpeak <dbl>, Slope <int>,
## # Ca <int>, Thal <fct>, AtheroscleroticHDisease <fct>
## X Age Sex ChestPain
## Min. : 1.0 Min. :29.00 Min. :0.0000 asymptomatic:144
## 1st Qu.: 76.5 1st Qu.:48.00 1st Qu.:0.0000 nonanginal : 86
## Median :152.0 Median :56.00 Median :1.0000 nontypical : 50
## Mean :152.0 Mean :54.44 Mean :0.6799 typical : 23
## 3rd Qu.:227.5 3rd Qu.:61.00 3rd Qu.:1.0000
## Max. :303.0 Max. :77.00 Max. :1.0000
##
## RestBP Chol Fbs RestECG
## Min. : 94.0 Min. :126.0 Min. :0.0000 Min. :0.0000
## 1st Qu.:120.0 1st Qu.:211.0 1st Qu.:0.0000 1st Qu.:0.0000
## Median :130.0 Median :241.0 Median :0.0000 Median :1.0000
## Mean :131.7 Mean :246.7 Mean :0.1485 Mean :0.9901
## 3rd Qu.:140.0 3rd Qu.:275.0 3rd Qu.:0.0000 3rd Qu.:2.0000
## Max. :200.0 Max. :564.0 Max. :1.0000 Max. :2.0000
##
## MaxHR ExAng Oldpeak Slope
## Min. : 71.0 Min. :0.0000 Min. :0.00 Min. :1.000
## 1st Qu.:133.5 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:1.000
## Median :153.0 Median :0.0000 Median :0.80 Median :2.000
## Mean :149.6 Mean :0.3267 Mean :1.04 Mean :1.601
## 3rd Qu.:166.0 3rd Qu.:1.0000 3rd Qu.:1.60 3rd Qu.:2.000
## Max. :202.0 Max. :1.0000 Max. :6.20 Max. :3.000
##
## Ca Thal AtheroscleroticHDisease
## Min. :0.0000 fixed : 18 No :164
## 1st Qu.:0.0000 normal :166 Yes:139
## Median :0.0000 reversable:117
## Mean :0.6722 NA's : 2
## 3rd Qu.:1.0000
## Max. :3.0000
## NA's :4
There are several obvious problems with the raw data set. X is an index number column and should be removed. AtheroscleroticHDisease is too much to type, so we will change the name. Several variables are coded as numeric but are factors; Sex, for example. The documentation here is informative and guides the following corrections.
# recoding
heart <- heart %>% select(-X) %>%
rename(HDisease = AtheroscleroticHDisease) %>%
mutate(Sex = factor(ifelse(Sex == 1, "M", "F"))
, Fbs = factor(ifelse(Fbs == 1, ">120", "<=120"))
, RestECG = factor(ifelse(RestECG == 0, "normal"
, "abnormal")) # there are only 4 valued at 1, we'll reduce to just two levels
, ExAng = factor(ifelse(ExAng == 1, "Yes", "No"))
, Slope = factor(ifelse(Slope == 1, "down", "level")) # there few valued at 3, we'll reduce to just two levels
, Ca = factor(Ca, ordered = TRUE))
# identify numerical and categorical variables for later use
classes <- sapply(heart, function(x) class(x)[1])
num_vars <- names(classes)[classes %in% c("integer", "numeric")]
cat_vars <- names(classes)[!(classes %in% c("integer", "numeric"))]
Note, if we were doing any predictive modeling, we should consider performing a train/test split here and then create a pipeline for the following steps. This is to avoid leaking information from the test set into the training set e.g. via missing value imputation. In this case, we will skip such a step for the sake of simplicity.
There is a very small number of missing values, which we would like to impute based on row-wise information. The figure demonstrates that there are no instances that share missingness in both the variables involved. The non-parametric nearest neighbours imputation is a reasonable choice, as it makes no assumptions about the data. This will be executed shortly.
Next, we show a kernel density plot of each numeric variables.
## Skew
## Age RestBP Chol MaxHR Oldpeak
## -0.2069951 0.6990596 1.1242853 -0.5321391 1.2571761
A visual inpection of the above plots reveals skew and non-normality in the Oldpeak and Chol variables. This is confirmed by a statistical test for skew, where values of skew >1 are severe. The skew in the Chol variable appears to be caused by an outlier. Briefly researching this matter online, it is apparent that a reading of >200 for cholesterol is already considered extremely high and the problematic reading in our dataset is nearly 600. On the other hand, the Oldpeak variable is systemically skewed. Power transformations are unsuitable for this variable because of the prevalence of zero values, so this will be left as is.
We will perform the following actions:
# modified min/max functions
minval <- function(x) min(x, na.rm = TRUE)
maxval <- function(x) max(x, na.rm = TRUE)
# change outlier to missing
heart$Chol[which.max(heart$Chol)] <- NA
# scale to [0, 1]
for (nv in num_vars) {
this_nv <- pull(heart, nv)
if(minval(this_nv) < 0) {
heart[[nv]] <- (this_nv + abs(minval(this_nv))) /
(maxval(this_nv) + abs(minval(this_nv)))
} else {
heart[[nv]] <- (this_nv - minval(this_nv)) /
(maxval(this_nv) - minval(this_nv))
}
}
# impute missing - VIM package. Median is the default function. Pick a moderately large k
heart <- as_tibble(kNN(heart, c("Chol", "Ca", "Thal"), imp_var = FALSE, k=11))
# for later use
heart_num <- select(heart, num_vars)
heart_cat <- select(heart, cat_vars)
## Recoded data set
## # A tibble: 303 x 14
## Age Sex ChestPain RestBP Chol Fbs RestECG MaxHR ExAng Oldpeak
## <dbl> <fct> <fct> <dbl> <dbl> <fct> <fct> <dbl> <fct> <dbl>
## 1 0.708 M typical 0.481 0.368 >120 abnorm… 0.603 No 0.371
## 2 0.792 M asymptom… 0.623 0.550 <=120 abnorm… 0.282 Yes 0.242
## 3 0.792 M asymptom… 0.245 0.354 <=120 abnorm… 0.443 Yes 0.419
## 4 0.167 M nonangin… 0.340 0.426 <=120 normal 0.885 No 0.565
## 5 0.25 F nontypic… 0.340 0.268 <=120 abnorm… 0.771 No 0.226
## 6 0.562 M nontypic… 0.245 0.378 <=120 normal 0.817 No 0.129
## 7 0.688 F asymptom… 0.434 0.488 <=120 abnorm… 0.679 No 0.581
## 8 0.583 F asymptom… 0.245 0.784 <=120 normal 0.702 Yes 0.0968
## 9 0.708 M asymptom… 0.340 0.440 <=120 abnorm… 0.580 No 0.226
## 10 0.5 M asymptom… 0.434 0.265 >120 abnorm… 0.641 Yes 0.5
## # … with 293 more rows, and 4 more variables: Slope <fct>, Ca <ord>,
## # Thal <fct>, HDisease <fct>
## Age Sex ChestPain RestBP
## Min. :0.0000 F: 97 asymptomatic:144 Min. :0.0000
## 1st Qu.:0.3958 M:206 nonanginal : 86 1st Qu.:0.2453
## Median :0.5625 nontypical : 50 Median :0.3396
## Mean :0.5300 typical : 23 Mean :0.3556
## 3rd Qu.:0.6667 3rd Qu.:0.4340
## Max. :1.0000 Max. :1.0000
## Chol Fbs RestECG MaxHR ExAng
## Min. :0.0000 <=120:258 abnormal:152 Min. :0.0000 No :204
## 1st Qu.:0.2921 >120 : 45 normal :151 1st Qu.:0.4771 Yes: 99
## Median :0.3952 Median :0.6260
## Mean :0.4112 Mean :0.6001
## 3rd Qu.:0.5103 3rd Qu.:0.7252
## Max. :1.0000 Max. :1.0000
## Oldpeak Slope Ca Thal HDisease
## Min. :0.0000 down :142 0:180 fixed : 18 No :164
## 1st Qu.:0.0000 level:161 1: 65 normal :167 Yes:139
## Median :0.1290 2: 38 reversable:118
## Mean :0.1677 3: 20
## 3rd Qu.:0.2581
## Max. :1.0000
We would now like to determine which variables might be good predictors of atherosclerotic heart disease and also look for colinearity or correlation among the predictors. A simple bivariate correlation analysis is followed by a scatterplot matrix conditioned on each level of HDisease.
Careful inspection of the splom shows a clear elevation of Oldpeak in the HDisease = Yes group. There is a potentially interesting interaction between Age and MaxHR, which is the pair with the strongest correlation: Specifically, MaxHR is more correlated with Age in the HDisease = No group, whereas MaxHR is slightly lower whatever the Age in the HDisease = Yes group. The HDisease = Yes group are formed from a narrower Age band, indicating that a particular age range could carry higher risk. The remaining continuous variables may be less informative.
We now generate a boxplot for each of the five continuous variables, conditioned on each of the factor variables. The figures that follow need to be assessed row by row.
A visual instpection of the boxplots indicates that there may be some interaction between Chestpain and the three informative variables identified previously (Age, MaxHR and Oldpeak). This is also true of ExAng (exercise induced angina), Slope, Ca and Thal. We will not perform any statistical tests at this stage. Fishing for p-values with so many tests requires careful attention to the false discovery rate, and assumptions checking and diagnostics for an unwieldy number of variable combinations. It is sufficient to develop an intuition about the data and interactions.
Distributions among categorical variables can be assessed according to counts and proportions. The following Fourfold and Mosaic visualisations implicitly perform significance tests by means of shading residuals. Fourfold plots are suited to visualising pairs of binary variables, while mosaics can handle factors with more levels.
We can see that the presence of atherosclerotic heart disease is significantly associated with males, while a third order interaction exists such that those who are generally free from chest pain but suffer with exercise induced angina have a significant risk of atherosclerotic heart disease.
In this four way analysis, we can see that downward slope is most associated with the HDisease = No group and level slope for HDisease = Yes. Similarly, one or more arteries visible under fluoroscopy (Ca) are most strongly associated with HDisease = Yes. A third order interaction indicates that a combination of level Slope and reversible Thal is very strongly associated with HDisease = Yes, with fixed Thal also being strongly indicative. Mosaics like this may look daunting at first, but quickly become intuitive with a little practice. They visualise a recursive partition. The tile areas are proportional to the count at each intersection and are shaded blue/red for positive/negative association compared to the independence assumption (i.e. compared to a non-significant \(\chi^2\) test).
We continue with a more advanced exploration of this data set using various clustering techniques.
The first method we will exlore is principle components analysis (PCA) on the scaled continuous variables. PCA identifies orthogonal projections of multivariate data that capture most of the variation in the first components.
## Standard deviations (1, .., p=5):
## [1] 0.2423081 0.1803313 0.1635028 0.1514241 0.1241092
##
## Rotation (n x k) = (5 x 5):
## PC1 PC2 PC3 PC4 PC5
## Age -0.6025078 0.3040866 -0.4588691 0.2202864 0.5342553
## RestBP -0.3156248 0.4112466 0.4659294 0.5758516 -0.4272727
## Chol -0.1891233 0.6063421 0.1498822 -0.7479366 -0.1212758
## MaxHR 0.4950871 0.3663388 0.4205950 0.1508718 0.6488623
## Oldpeak -0.5064450 -0.4863668 0.6107243 -0.1941582 0.3102890
## Importance of components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 0.2423 0.1803 0.1635 0.1514 0.12411
## Proportion of Variance 0.3756 0.2081 0.1710 0.1467 0.09855
## Cumulative Proportion 0.3756 0.5837 0.7548 0.9014 1.00000
We can see from the output summary that Age and Oldpeak are loaded onto PC1 in the opposite direction to MaxHR, indicating a negative correlation. The other PC’s can generally be interpreted according to further relationships among these variables, some of which we have already seen in the bivariate correlation analysis. It seems from the cumulative variance measure and the scree plot that the projection of these five numeric features onto less interpretable principal components does not offer any obvious gains; it still takes four components to capture most of the variance, nearly the same number as the raw variables. Nevertheless, we can make a biplot to better understand the multivariate relationships.
Multiple correspondence analysis is a dimension reduction and clustering analysis for categorical counts. The relationships among several or many categorical variables can be mapped in two dimensions, giving fairly intiuitive results. The most important benefit is that categorical variables are implicitly recoded into non-abritrary numeric values. This makes them available to use in any methods that only accept real-valued inputs, such as distance-based methods like k-means or hierarchical clustering.
##
## Principal inertias (eigenvalues):
##
## dim value % cum% scree plot
## 1 0.059785 83.5 83.5 ************************
## 2 0.000902 1.3 84.8
## 3 0.000424 0.6 85.4
## 4 0.000309 0.4 85.8
## 5 2e-05000 0.0 85.8
## 6 1.1e-050 0.0 85.9
...
The results of this analysis are indeed rather interesting and confirm the findings of the initial explaratory analysis. The plot can be interpreted by identifying attributes that have moved in the same direction from the origin (0, 0), with particular interest in those that have clustered close together. We can see the specific values of ChestPain, Slope, Ca, ExAng and Sex that are associated with the presence of absence of heart diseease. We also have confirmation that Fbs is much less correlated with HDisease and RestECG only moderately correlated. If we were going on to do predictive modeling, this analysis would provide a stong case for excluding them. In fact, we will re-run this analysis accordingly.
##
## Principal inertias (eigenvalues):
##
## dim value % cum% scree plot
## 1 0.101162 88.3 88.3 *************************
## 2 0.000779 0.7 88.9
## 3 0.000492 0.4 89.4
## 4 0.000210 0.2 89.6
## 5 1.9e-050 0.0 89.6
## -------- -----
## Total: 0.114614
...
We are not displaying the arrows here the labels are rather crowded and hard enough to read without the visual clutter. We can see that the clusters are very pronounced and we can identify a handful of less informative attributes. For example, Thal=fixed and ChestPain=typical represent small minorities in the way these variables are distributed.
We can augment this analysis by discretizing the numeric variables and including them. This will be done by binning into low and high levels.
hilo <- function(x, ind) {
cut(x, breaks=c(quantile(pull(heart, ind), probs = c(0, 0.5, 1))),
labels = c("lo", "hi")
)
}
binned <- as_tibble(sapply(num_vars
, function(ind) sapply(select(heart, ind)
, hilo
, ind = ind)))
names(binned) <- num_vars
heart_bins <- bind_cols(heart_cat, binned)
##
## Principal inertias (eigenvalues):
##
## dim value % cum% scree plot
## 1 0.055812 74.5 74.5 ***********************
## 2 0.001959 2.6 77.1 *
## 3 0.000991 1.3 78.4
## 4 0.000532 0.7 79.2
## 5 0.000155 0.2 79.4
## 6 5e-05000 0.1 79.4
## 7 2.1e-050 0.0 79.5
## -------- -----
## Total: 0.074905
##
...
This plot takes some time to interpret, but a careful look further confirms results of our previous analyses. The lowest MaxHR values are associated with heart disease and the highest values with absence of disease. Increasing Age is generally associated with presence of heart disease. Increasing Chol is associated, but only very moderately - the different attribute levels are not separated widely on the horizontal dimension. The same is true for RestBP. Let us re-run this once more, excluding these two less correlated variables. We also exclude the HDisease label as we do not want it to influence the resulting values.
##
## Principal inertias (eigenvalues):
##
## dim value % cum% scree plot
## 1 0.062239 70.4 70.4 ************************
## 2 0.001719 1.9 72.4 *
## 3 0.000888 1.0 73.4
## 4 0.000390 0.4 73.8
## 5 0.000266 0.3 74.1
## 6 6.5e-050 0.1 74.2
## 7 2.5e-050 0.0 74.2
## -------- -----
## Total: 0.088394
...
## $rows
## Dim1 Dim2
## <NA> NA NA
##
## $cols
## Dim1 Dim2
## Sex:F 0.19462044 0.10607208
## Sex:M -0.07452537 -0.05075994
## ChestPain:asymptomatic -0.27815173 -0.01145707
## ChestPain:nonanginal 0.25163360 0.03656958
## ChestPain:nontypical 0.45002573 -0.05274889
## ChestPain:typical 0.01031257 0.04094717
## ExAng:No 0.20571452 0.01902297
...
We can finish with this multiple correspondence analysis that provides two very well differentiated clusters of attributes. This result gives us plenty of reason to believe that we could infer a diagnostic model from the data set. Because nearly all the inertia (variance) is captured on the first dimension, we can use the one-dimensional coordinate values as numerical proxies for the categorical attributes in the analyses to follow.
# res is the plot object created by (res <- plot(...))
# the last 6 rows are the three continuous vars, and we just want to convert categorical to real valued.
coords <- data.frame(res$cols, heart_mca$factors)[-(20:25), ]
# We don't need the HDisease feature as we will see how well the clusters match these labels.
# We'll add a column of missing values for each of the other categorical variables
heart_real <- select(heart_num, -Chol, -RestBP)
for(cv in names(select(heart_cat, -HDisease))) { # Fbs and RestECG were removed already
heart_real <- mutate(heart_real, !!cv := NA)
}
# Now we'll insert the value from Dim 1 for each level of each factor
for(i in 1:nrow(coords)) {
fac <- as.character(coords$factor[i])
lev <- as.character(coords$level[i])
heart_real <- mutate(heart_real, !!fac := ifelse(pull(heart, fac) == lev, coords$Dim1[i], pull(heart_real, fac)))
}
Now that we have selected our variables of interest and converted categorical values to non-arbitrary real-values, we can continue with distance based clustering. The following visualisation is a map of all the Euclidean distances between the points.
This visualisation provides a useful confirmation of the clustering tendency in this data. We can also compute the Hopkins statistic H, which tests the null hypothesis that the data has come from a uniform distribution and is distributed as \(H \sim \mathit{Beta}(n, n)\) where n is the number of samples used to calculate the statistic. A value of n = 5-10% of the dataset is recommended.
n <- nrow(heart_real)/10
H <- hopkins(heart_real, n = n, header = TRUE)$H
H < qbeta(0.05, n, n) # significant result if TRUE
## [1] TRUE
K-medoids searches for k archetypal or representative instances, called medoids that act as the cluster centres. Each non-medoid instance is assigned to its nearest medoid. The algorithm proceeds by swapping medoid and non-medoid points, accepting a swap that decreases the sum of a dissimilarity function. K-medoids is less sensitive to outliers than the classic K-means method, so is often favoured. Another reason to prefer K-medoids is that the number k can be estimated using the silhouette method. From the knowledge we have already that these variables tend to be associated with heart disease, we could assume two clusters is the correct number, but it’s still worth checking using this visual inspection.
With confidence we can run the clustering with k=2:
## Age MaxHR Oldpeak Sex ChestPain ExAng
## [1,] 0.5416667 0.4656489 0.1935484 -0.07452537 -0.2781517 -0.3827575
## [2,] 0.4375000 0.7022901 0.0000000 -0.07452537 0.2516336 0.2057145
## Slope Ca Thal
## [1,] -0.2444348 -0.1640730 -0.2839171
## [2,] 0.3125888 0.1899808 0.2750060
##
## No Yes
## 1 25 114
## 2 139 25
Recall that HDisease was not included in the real-valued data cpnversion nor the clustering process. Cross-tabulating the instances from the original dataset shows a “confusion matrix-like” result, indicating that the two clusters have captured the association with heart disease among the variables of interest. The cluster labels are arbitrary in this unsupervised learning process. We can set the appropriate label to the cluster id and get a full suite of diagnostics:
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 139 25
## Yes 25 114
##
## Accuracy : 0.835
## 95% CI : (0.7883, 0.875)
## No Information Rate : 0.5413
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6677
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8476
## Specificity : 0.8201
## Pos Pred Value : 0.8476
## Neg Pred Value : 0.8201
## Prevalence : 0.5413
## Detection Rate : 0.4587
## Detection Prevalence : 0.5413
## Balanced Accuracy : 0.8339
##
## 'Positive' Class : No
##
We can see that if we assigned new, unlabelled points to their nearest cluster medoid, we might expect an accuracy of around 0.83
We can do something similar with hierarchical clustering. After some experimentation with different clustering methods, the “Complete” method was chosen as it produced two distinct clusters with similar numbers of members.
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 122 18
## Yes 42 121
##
## Accuracy : 0.802
## 95% CI : (0.7526, 0.8454)
## No Information Rate : 0.5413
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6064
##
## Mcnemar's Test P-Value : 0.002985
##
## Sensitivity : 0.7439
## Specificity : 0.8705
## Pos Pred Value : 0.8714
## Neg Pred Value : 0.7423
## Prevalence : 0.5413
## Detection Rate : 0.4026
## Detection Prevalence : 0.4620
## Balanced Accuracy : 0.8072
##
## 'Positive' Class : No
##
Again, we can speculate on the accuracy we would get on prediction using just the cluster membership. This yields a slightly lower overall accuracy, but an increased specificity, or True Positive Rate. As such, this method might be more suitable when designing a diagnostic tool.
We can plot dendrograms of hierarchical clusters and colour the leaves using any factor variable. This provides a visual inspection of how the factor is distributed among the clusters, starting with the HDisease variable that we might want to classify at some future time. Again, recall that this “target variable” was not used in the clustering algorithm yet we can see that it is well separated into the two clusters. Note, the colours are arbitrary. What is interesting is how they have separated so well into prevalent groupings between the two clusters.
We have performed a thorough exploratory analysis of the UCI Heart Data and used unsupervised machine learning methods to provide a visual intuition of important structure in the data. This enables us to reduce the dimension of the problem by removing non-informative features. After this, we were able to engineer a single, highly informative cluster membership feature.
Here you can find the source code.
library(knitr)
library(tibble)
library(readr)
library(dplyr)
library(lattice)
library(ggplot2)
library(corrplot)
library(vcd)
library(psych)
library(car)
library(VIM)
library(ca)
library(factoextra)
library(cluster)
library(clustertend)
library(caret)
library(ape)
opts_chunk$set(warning = FALSE
, message = FALSE
, echo = FALSE
)
opts_template$set(
fig.wide = list(fig.height = 4.5, fig.width = 8, fig.align='center')
, fig.wideX = list(fig.height = 3, fig.width = 9, fig.align='center')
, fig.relaxed = list(fig.height = 6, fig.width = 8, fig.align='center')
, fig.hugetile = list(fig.height = 7, fig.width = 7, fig.align='center')
, fig.bigtile = list(fig.height = 5, fig.width = 5, fig.align='center')
, fig.tile = list(fig.height = 3, fig.width = 3, fig.align='center')
)
hook_output <- knit_hooks$get("output")
knit_hooks$set(output = function(x, options) {
lines <- options$output.lines
if (is.null(lines)) {
return(hook_output(x, options)) # pass to default hook
}
x <- unlist(strsplit(x, "\n"))
more <- "..."
if (length(lines)==1) { # first n lines
if (length(x) > lines) {
# truncate the output, but add ....
x <- c(head(x, lines), more)
}
} else {
x <- c(more, x[lines], more)
}
# paste these lines together
x <- paste(c(x, ""), collapse = "\n")
hook_output(x, options)
})
par(mar = c(4,3,3,1))
set.seed(142136)
# graphing themes
source("HeartTheme.R")
# load data
heart <- as_tibble(read.csv("Heart.csv"))
variables_df <- data.frame(name = c("AtheroscleroticHDisease"
, "Age", "Sex", "ChestPain"
, "RestBP", "Chol", "Fbs"
, "RestECG", "MaxHR"
, "ExAng", "Oldpeak"
, "Slope", "Ca", "Thal")
, type = c("factor", "integer"
, "factor", "factor"
, "integer", "integer"
, "factor", "factor"
, "integer", "factor"
, "factor", "factor"
, "small integer", "factor")
, notes = c("Presence of heart disease"
, "Age in years", "The two accepted levels at the time of data collection"
, "Presence/type of chest pain"
, "Resting blood pressure mm Hg"
, "Serum cholesterol mg/dl"
, "Fasting blood sugar"
, "Resting electrocardiograph results"
, "Maximum heart rate during exercise"
, "Exercise induced angina"
, "ST depression exercise relative to rest"
, "Slope of peak ST exercise segment"
, "Number of major vessels under fluoroscopy"
, "No description given"
))
kable(variables_df)
heart
summary(heart)
# recoding
heart <- heart %>% select(-X) %>%
rename(HDisease = AtheroscleroticHDisease) %>%
mutate(Sex = factor(ifelse(Sex == 1, "M", "F"))
, Fbs = factor(ifelse(Fbs == 1, ">120", "<=120"))
, RestECG = factor(ifelse(RestECG == 0, "normal"
, "abnormal")) # there are only 4 valued at 1, we'll reduce to just two levels
, ExAng = factor(ifelse(ExAng == 1, "Yes", "No"))
, Slope = factor(ifelse(Slope == 1, "down", "level")) # there few valued at 3, we'll reduce to just two levels
, Ca = factor(Ca, ordered = TRUE))
# identify numerical and categorical variables for later use
classes <- sapply(heart, function(x) class(x)[1])
num_vars <- names(classes)[classes %in% c("integer", "numeric")]
cat_vars <- names(classes)[!(classes %in% c("integer", "numeric"))]
image(is.na(select(heart, Ca, Thal))
, main = "Missing Values"
, xlab = ""
, ylab = "Ca, Thal"
, xaxt = "n"
, yaxt = "n"
, bty = "n"
, col = c(myPal[4]
, myPalDark[5])
)
for (nv in num_vars) {
densityplot(~pull(heart, nv)
, xlab = nv
, par.settings = myLatticeTheme) %>%
print()
}
cat("Skew")
sapply(num_vars, function(ind) skew(pull(heart, ind)))
# modified min/max functions
minval <- function(x) min(x, na.rm = TRUE)
maxval <- function(x) max(x, na.rm = TRUE)
# change outlier to missing
heart$Chol[which.max(heart$Chol)] <- NA
# scale to [0, 1]
for (nv in num_vars) {
this_nv <- pull(heart, nv)
if(minval(this_nv) < 0) {
heart[[nv]] <- (this_nv + abs(minval(this_nv))) /
(maxval(this_nv) + abs(minval(this_nv)))
} else {
heart[[nv]] <- (this_nv - minval(this_nv)) /
(maxval(this_nv) - minval(this_nv))
}
}
# impute missing - VIM package. Median is the default function. Pick a moderately large k
heart <- as_tibble(kNN(heart, c("Chol", "Ca", "Thal"), imp_var = FALSE, k=11))
# for later use
heart_num <- select(heart, num_vars)
heart_cat <- select(heart, cat_vars)
cat("Recoded data set")
heart
summary(heart)
corrplot(cor(heart_num), order="AOE", type="upper"
, col = myPal.rangeDiv(10)
, tl.pos="d", tl.cex = 1, tl.col = myPalDark[1]
, method = "number", number.cex = 1.5)
corrplot(cor(heart_num), add=TRUE, type = "lower"
, col = myPal.rangeDiv(10)
, method = "ellipse", order = "AOE", diag = FALSE
, tl.pos="n", cl.pos = "n")
trel <- myLatticeTheme
trel$plot.symbol$col <- myPal[1]
splom(~heart_num | HDisease
, data = heart
, diag.panel = function(x, ...){
yrng <- current.panel.limits()$ylim
d <- density(x, na.rm=TRUE)
d$y <- with(d, yrng[1] + 0.95 * diff(yrng) * y / max(y) )
panel.lines(d)
diag.panel.splom(x, ...)
}
, panel = function(x, y, ...){
panel.xyplot(x, y, ..., alpha = 0.4)
panel.loess(x, y, ..., col = myPalDark[2], lwd = 3)
}
, main = "Scatterplot Matrix by HDisease Group"
, xlab = ""
, layout = c(2, 1)
, pscales = 0
, par.settings = trel)
cols <- myPal.rangeDiv(5)
names(cols) <- num_vars
myBoxPlots <- function(nv, cv) {
fmla <- as.formula(paste(nv, " ~ ", cv))
boxplot(fmla, data = heart, col = cols[nv])
}
par(mfrow=c(1,5))
for (cv in cat_vars){
for (nv in num_vars) {
myBoxPlots(nv, cv)
}
}
par(mfrow=c(1,1))
fourfold(with(heart, table(HDisease, Sex))
, color = myFourFold)
fourfold(with(heart, table(HDisease, ExAng, ChestPain))
, color = myFourFold)
mosaic(with(heart, table(HDisease, Slope, Ca, Thal))
, gp = shading_hsv
, gp_args = list(h = myShading["h", ]
, s = c(0.75, 0)
, v = 1
, lty = 1:2
, line_col = c(myPalDark[5], myPalDark[4])
)
)
heart_pca <- prcomp(heart_num, scale. = FALSE)
heart_pca
summary(heart_pca)
plot(heart_pca, main = "scree plot for heart pca")
biplot(heart_pca
, xlim = c(-0.2, 0.2), ylim = c(-0.2, 0.2)
, col = c(myPal[2], myPalDark[5]), cex=c(0.5, 1.5))
heart_mca <- mjca(heart_cat)
summary(heart_mca)
plot(heart_mca, map = "symbiplot"
, arrows = c(FALSE, TRUE)
, mass = c(FALSE, TRUE)
, contrib = c("none", "relative")
, col = c(myPalDark[5], myPalDark[4]))
heart_cat <- select(heart_cat, -Fbs, -RestECG)
heart_mca <- mjca(heart_cat)
summary(heart_mca)
plot(heart_mca, map = "symbiplot"
, mass = c(FALSE, TRUE)
, contrib = c("none", "relative")
, col = c(myPalDark[5], myPalDark[4]))
hilo <- function(x, ind) {
cut(x, breaks=c(quantile(pull(heart, ind), probs = c(0, 0.5, 1))),
labels = c("lo", "hi")
)
}
binned <- as_tibble(sapply(num_vars
, function(ind) sapply(select(heart, ind)
, hilo
, ind = ind)))
names(binned) <- num_vars
heart_bins <- bind_cols(heart_cat, binned)
heart_mca <- mjca(heart_bins)
summary(heart_mca)
plot(heart_mca, map = "symmetric"
, mass = c(FALSE, TRUE)
, contrib = c("none", "relative")
, col = c(myPalDark[5], myPalDark[4]))
heart_mca <- mjca(select(heart_bins, -Chol, -RestBP, -HDisease))
summary(heart_mca)
(res <- plot(heart_mca, map = "symmetric"
, mass = c(FALSE, TRUE)
, contrib = c("none", "relative")
, col = c(myPalDark[5], myPalDark[4])))
# res is the plot object created by (res <- plot(...))
# the last 6 rows are the three continuous vars, and we just want to convert categorical to real valued.
coords <- data.frame(res$cols, heart_mca$factors)[-(20:25), ]
# We don't need the HDisease feature as we will see how well the clusters match these labels.
# We'll add a column of missing values for each of the other categorical variables
heart_real <- select(heart_num, -Chol, -RestBP)
for(cv in names(select(heart_cat, -HDisease))) { # Fbs and RestECG were removed already
heart_real <- mutate(heart_real, !!cv := NA)
}
# Now we'll insert the value from Dim 1 for each level of each factor
for(i in 1:nrow(coords)) {
fac <- as.character(coords$factor[i])
lev <- as.character(coords$level[i])
heart_real <- mutate(heart_real, !!fac := ifelse(pull(heart, fac) == lev, coords$Dim1[i], pull(heart_real, fac)))
}
distance <- get_dist(heart_real)
myPalDiv <- myPal.rangeDiv(3)
fviz_dist(distance, show_labels = FALSE
, gradient = list(low = myPalDiv[1], mid = myPalDiv[2], high = myPalDiv[3]))
n <- nrow(heart_real)/10
H <- hopkins(heart_real, n = n, header = TRUE)$H
H < qbeta(0.05, n, n) # significant result if TRUE
fviz_nbclust(heart_real, pam, method = "silhouette"
, k.max = 5, linecolor = myPalDark[4]) +
myGgTheme()
pam2_clus <- pam(heart_real, k = 2, diss = FALSE, metric = "euclidean")
pam2_clus$medoids # translate back
table(pam2_clus$cluster, heart$HDisease)
cm <- confusionMatrix(factor(ifelse(pam2_clus$cluster == 1, "Yes", "No")), heart$HDisease)
cm
fviz_cluster(pam2_clus, heart_real
, ellipse.type = "convex"
, geom = "point"
, palette = c(myPalDark[4], myPalDark[5])) +
myGgTheme()
hclus <- hclust(dist(heart_real)
, method="complete")
num_clus <- 2
cm <- confusionMatrix(factor(ifelse(cutree(hclus, num_clus) == 1, "Yes", "No")), heart$HDisease)
cm
hclusPlot <- function(x, n, df, colourBy) {
labelColours <- myPalContrasts[factor(df[[colourBy]])]
# phylogenic tree plot
plot(as.phylo(x)
, direction = "downwards"
, tip.color = myPalContrasts[factor(df[[colourBy]])]
, main = paste(x$method, "method\nleaf colours by" , colourBy))
rect.hclust(x, k=n, border = "red")
}
for (nm in names(heart_real)[!names(heart_real) %in% c("Age", "MaxHR")]) {
hclusPlot(hclus, num_clus, heart, nm)
}