library(skimr)
library(tidyverse)
library(caret) # For featureplot, classification report
library(corrplot) # For correlation matrix and PCA contributionplots
library(AppliedPredictiveModeling)
library(mice) # For data imputation
library(VIM) # For missing data visualization
library(gridExtra) # For grid plots
library(pROC) # For AUC calculations
library(ROCR) # For ROC and AUC plots
library(dendextend) # For dendrograms
library(factoextra) # For PCA plots
library(e1071) # For SVMThis dataset is composed of real patient responses to two questionnaires related to ADHD and Mood Disorder and a variety of demographic, abuse and drug use variarables. For each questionnaire, the responses to individual questions are provided along with total scores. Links to the actual questions are provided below:
The first part of this work will make use of unsupervised learning techniques such as Principal Component Analysis (PCA) and clustering in an attempt to discover structures in the data. The second part will explore support vector machines in a supervised learning exercise to predict whether an individual has attempted suicide.
The dataset is composed of 54 variables and 175 observations. The data is coded as numeric and holds 33 observations that have some level of missing data. A summary of the variable distributions is provided below:
adhd_raw <- read.csv('https://raw.githubusercontent.com/maelillien/data622/main/hw4/adhd_data.csv', header = TRUE)
adhd <- adhd_raw
head(adhd)adhd[adhd==''] <- NA
skim(adhd %>% select(-c(ADHD.Q1:ADHD.Q18, MD.Q1a:MD.Q3)))| Name | adhd %>% select(-c(ADHD.Q… |
| Number of rows | 175 |
| Number of columns | 21 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 20 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Initial | 0 | 1 | 2 | 3 | 0 | 109 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Age | 0 | 1.00 | 39.47 | 11.17 | 18 | 29.5 | 42 | 48.0 | 69 | ▆▅▇▅▁ |
| Sex | 0 | 1.00 | 1.43 | 0.50 | 1 | 1.0 | 1 | 2.0 | 2 | ▇▁▁▁▆ |
| Race | 0 | 1.00 | 1.64 | 0.69 | 1 | 1.0 | 2 | 2.0 | 6 | ▇▁▁▁▁ |
| ADHD.Total | 0 | 1.00 | 34.32 | 16.70 | 0 | 21.0 | 33 | 47.5 | 72 | ▃▆▇▆▂ |
| MD.TOTAL | 0 | 1.00 | 10.02 | 4.81 | 0 | 6.5 | 11 | 14.0 | 17 | ▃▃▆▇▇ |
| Alcohol | 4 | 0.98 | 1.35 | 1.39 | 0 | 0.0 | 1 | 3.0 | 3 | ▇▂▁▁▆ |
| THC | 4 | 0.98 | 0.81 | 1.27 | 0 | 0.0 | 0 | 1.5 | 3 | ▇▁▁▁▃ |
| Cocaine | 4 | 0.98 | 1.09 | 1.39 | 0 | 0.0 | 0 | 3.0 | 3 | ▇▁▁▁▅ |
| Stimulants | 4 | 0.98 | 0.12 | 0.53 | 0 | 0.0 | 0 | 0.0 | 3 | ▇▁▁▁▁ |
| Sedative.hypnotics | 4 | 0.98 | 0.12 | 0.54 | 0 | 0.0 | 0 | 0.0 | 3 | ▇▁▁▁▁ |
| Opioids | 4 | 0.98 | 0.39 | 0.99 | 0 | 0.0 | 0 | 0.0 | 3 | ▇▁▁▁▁ |
| Court.order | 5 | 0.97 | 0.09 | 0.28 | 0 | 0.0 | 0 | 0.0 | 1 | ▇▁▁▁▁ |
| Education | 9 | 0.95 | 11.90 | 2.17 | 6 | 11.0 | 12 | 13.0 | 19 | ▁▅▇▂▁ |
| Hx.of.Violence | 11 | 0.94 | 0.24 | 0.43 | 0 | 0.0 | 0 | 0.0 | 1 | ▇▁▁▁▂ |
| Disorderly.Conduct | 11 | 0.94 | 0.73 | 0.45 | 0 | 0.0 | 1 | 1.0 | 1 | ▃▁▁▁▇ |
| Suicide | 13 | 0.93 | 0.30 | 0.46 | 0 | 0.0 | 0 | 1.0 | 1 | ▇▁▁▁▃ |
| Abuse | 14 | 0.92 | 1.33 | 2.12 | 0 | 0.0 | 0 | 2.0 | 7 | ▇▂▁▁▁ |
| Non.subst.Dx | 22 | 0.87 | 0.44 | 0.68 | 0 | 0.0 | 0 | 1.0 | 2 | ▇▁▃▁▁ |
| Subst.Dx | 23 | 0.87 | 1.14 | 0.93 | 0 | 0.0 | 1 | 2.0 | 3 | ▆▇▁▅▂ |
| Psych.meds. | 118 | 0.33 | 0.96 | 0.80 | 0 | 0.0 | 1 | 2.0 | 2 | ▇▁▇▁▆ |
The dataset is modified to include an EducationLevel categorical variable derived from the numerical Education variables representing the years of schooling. The Abuse column is unfolded into 3 binary variables indicating the occurence of the 3 types of abuse. The original Abuse variable is dropped.
We work with a multiple subsets of the data for subsequent parts this report. Some analyses make use of the entire set of questionnaire reponses while others use only the total score.
# Renaming variables
adhd <- rename(adhd, ADHDTotal = ADHD.Total, MDTotal = MD.TOTAL, Sedatives = Sedative.hypnotics, CourtOrder = Court.order,
Violence = Hx.of.Violence, Conduct = Disorderly.Conduct, NonSubstDX = Non.subst.Dx, SubstDX = Subst.Dx, PsychMeds = Psych.meds.)
# Drop Initial column
adhd <- adhd %>% select(-c(Initial))
# Shift sex variable to 0 and 1
adhd$Sex <- adhd$Sex-1
# Re-coding Education Variable, College = Education > 12, HS = Education > 8 & <= 12, MS = Education <= 8
adhd$EducationLevel <- ifelse(adhd$Education <= 8, 1, ifelse(adhd$Education <= 12 & adhd$Education > 8, 2, ifelse(adhd$Education > 12, 3, 99)))
# Creating new variables based on type of abuse
adhd$AbuseP <- as.numeric(adhd$Abuse == 1 | adhd$Abuse == 4 | adhd$Abuse == 5 | adhd$Abuse == 7)
adhd$AbuseS <- as.numeric(adhd$Abuse == 2 | adhd$Abuse == 4 | adhd$Abuse == 6 | adhd$Abuse == 7)
adhd$AbuseE <- as.numeric(adhd$Abuse == 3 | adhd$Abuse == 5 | adhd$Abuse == 6 | adhd$Abuse == 7)
adhd <- adhd %>% select(-c(Abuse))
# Forming data subsets: full set of variables or reduced (totals only)
adhd.full <- adhd
adhd.red <- adhd %>% select(c(Age, Sex, Race, ADHDTotal, MDTotal, Alcohol:AbuseE))
# Set all characters to numeric
adhd.red <- mutate_all(adhd.red, function(x) as.numeric(as.character(x)))
adhd.full <- mutate_all(adhd.full, function(x) as.numeric(as.character(x)))The dataset contains a few missing values. The PsychMeds variable mostly contained missing values and was dropped entirely. A few observations were quite sparse and only contained basic demographic and questionnaire score columns. In order to avoid biasing the dataset with imputed values, we preferred to drop all observations with missing values from the dataset. The resulting dataset contains 33 fewer observations with 142 complete rows and 19 columns.
mice_plot <- aggr(adhd.red, col=c('#F8766D','#00BFC4'), numbers=TRUE, sortVars=TRUE, labels=names(adhd.red), cex.axis=.7, gap=3, ylab=c("Missing data","Pattern"))##
## Variables sorted by number of missings:
## Variable Count
## PsychMeds 0.67428571
## SubstDX 0.13142857
## NonSubstDX 0.12571429
## AbuseP 0.08000000
## AbuseS 0.08000000
## AbuseE 0.08000000
## Suicide 0.07428571
## Violence 0.06285714
## Conduct 0.06285714
## Education 0.05142857
## EducationLevel 0.05142857
## CourtOrder 0.02857143
## Alcohol 0.02285714
## THC 0.02285714
## Cocaine 0.02285714
## Stimulants 0.02285714
## Sedatives 0.02285714
## Opioids 0.02285714
## Age 0.00000000
## Sex 0.00000000
## Race 0.00000000
## ADHDTotal 0.00000000
## MDTotal 0.00000000
# Discard PsychMeds
adhd.red.complete <- adhd.red %>% select(-c(PsychMeds))
adhd.full.complete <- adhd.full %>% select(-c(PsychMeds))
# Keep only complete cases
adhd.red.complete <- adhd.red.complete[complete.cases(adhd.red.complete),]
adhd.full.complete <- adhd.full.complete[complete.cases(adhd.full.complete),]The plots below explore the distributions of subsets of the data.
The various distribution of the responses to the ADHD questionnaire are displayed below. Some variables are unimodal, others bimodal, some are centered and others skewed. Reference should be made to the survey questions in order to gain deeper insight of these distributions.
For the Mood Disorder questionnaire, the majority of the binary responses were 1. The classes are fairly balance expect for a few questions. Q2 is particularly unbalanced. The distribution of the responses to Q3 is highly skewed to the left tail.
A few observations on the binary and ordinal variables:
Clustering refers to a broad set of techniques for finding subgroups, or clusters, in a dataset. We seek to partition observations into distinct groups so that the observations within each group are quite similar to each other, while observations in different groups are quite different from each other. The most popular clustering approaches are K-means and Hierarchical Clustering (HC). While the former requires a pre-specified number of clusters k, the latter does not. HC is a bottom-up or agglomerative clustering approach which results in an upside-down tree representation, built from the leaves and combined into clusters up to the trunk. Clusters are identified by horizontal cuts across the dendrogram.
In this section, we explore the use of Hierarchical Clustering on two portions of the data. The first uses only the questionnaire responses to ADHD while the second uses the total questionnaire scores for both surveys as well as the other variables (demographic, drugs, abuse, etc). The latter is referred to as the ‘reduced’ dataset.
Clustering typically requires the variables to be scaled in order to avoid more weight to variables using a larger range of values. However, when all the variables under conideration are measured on the same scale, which is the case when only comparing survey responses, it can be appropriate to leave the variables unscaled.
With HC, the concept of dissimilarity between a pair of observations needs to be extended to a pair of groups of observations. This extension is achieved with the notion of linkage, which defines the dissimilarity between two groups of observations. The resulting dendrogram heavily depends on the choice of linkage. The most popular linkages are complete and average because they tend to result in more balanced clusters.
Using only the individual unscaled responses to the ADHD Questionnaire, we obtain the following dendrogram structure using complete linkage. In this case, complete linkage provided the best balancing and a cutoff into 3 clusters looked appropriate. In order to gain insight into these clusters, we need to look at the distribution of the variables within each of them.
For these 3 clusters, we can make the following observations:
We can establish a ranking for this clustering based on the monotic rise in the meanADHD and meanMD across clusters. From less to most severe: Cluster 3, Cluster 2, Cluster 1. With this information, a treatment plan could be developped to focus on clusters 2 and 1, with the latter requiring special attention.
hc0 <- adhd.full %>% select(ADHD.Q1:ADHD.Q18) %>% dist(method = "euclidean") %>% hclust(method = "complete")
dend0 <- hc0 %>% as.dendrogram
sub_grp0 <- cutree(dend0, k = 3, order_clusters_as_data = TRUE)
dend0 %>% set("branches_k_color", k = 3) %>% set("labels", "") %>% plot(main = "Hierarchical Clustering ADHD Questionnaire")adhd.full %>%
mutate(cluster = sub_grp0) %>%
group_by(cluster) %>%
summarise(meanAge = mean(Age), meanMD = mean(MDTotal), meanADHD = mean(ADHDTotal), count = n()) %>%
gather(var,value,meanAge:count) %>%
ggplot(aes(cluster,value,fill=cluster)) +
geom_col() + facet_grid(var ~ ., scales="free_y") +
geom_text(aes(label=round(value,1)), vjust=1.6, color="white", size=3.5) +
ggtitle('ADHD Questionnaire Cluster Distribution') +
theme_minimal()For a contrasting analysis, we looked at clustering based on the dataset which included only the total questionnaire scores and dropped observations with missing values. The variables were scaled to balance out the contribution of the high values for scores and age. We explore the use of complete, average and Ward linkages.
Unlike the previous clustering, adding the other variables results in a less balanced dendrogram. In this case, most clusters contain only a few observations and two large clusters contain the majority of the data. We can make a few additional observations:
Based on the variable distributions, it might be tempting to want to group clusters 4 and 5 together since they represent a cohort with similar meanAge, meanMD and meanADHD scores. However, hierachichal clustering considers more variables than the aforementioned few and this kind of grouping would violate the measure of similarity as determined by the y axis of the dendrogram.
A proposed ADHD and Mood Disorder ranking in 4 levels from least severe to most severe is: Cluster 2, Clusters 1, 4 & 5, Clusters 3 (older) & 6 (younger).
hc1 <- adhd.red.complete %>% scale %>% dist(method = "euclidean") %>% hclust(method = "complete")
dend1 <- hc1 %>% as.dendrogram
sub_grp1 <- cutree(dend1, k = 6)
dend1 %>% set("branches_k_color", k = 6) %>% set("labels", "") %>% plot(main = "Hierachical Clustering Reduced Dataset + Complete Linkage")adhd.red.complete %>%
mutate(cluster = sub_grp1) %>%
group_by(cluster) %>%
summarise(meanAge = mean(Age), meanMD = mean(MDTotal), meanADHD = mean(ADHDTotal), count = n()) %>%
gather(var,value,meanAge:count) %>%
ggplot(aes(cluster,value,fill=cluster)) +
geom_col() + facet_grid(var ~ ., scales="free_y") +
geom_text(aes(label=round(value,1)), vjust=1.6, color="white", size=3.5) +
ggtitle('Reduced Dataset + Complete Linkage Cluster Distribution') +
theme_minimal()Average linkage completely changes the dendrogram representation. The number of clusters, k=5 seemed appropriate. Similarly to complete linkage, one large cluster is identified, this time containing the vast majority of the observations. This obscures the analysis beyond what we can establish about the small clusters.
hc2 <- adhd.red.complete %>% scale %>% dist(method = "euclidean") %>% hclust(method = 'average')
dend2 <- hc2 %>% as.dendrogram
sub_grp2 <- cutree(dend2, k = 5)
dend2 %>% set("branches_k_color", k = 5) %>% set("labels", "") %>% plot(main = "Hierachical Clustering Reduced Dataset + Average Linkage")adhd.red.complete %>%
mutate(cluster = sub_grp2) %>%
group_by(cluster) %>%
summarise(meanAge = mean(Age), meanMD = mean(MDTotal), meanADHD = mean(ADHDTotal), count = n()) %>%
gather(var,value,meanAge:count) %>%
ggplot(aes(cluster,value,fill=cluster)) +
geom_col() + facet_grid(var ~ ., scales="free_y") +
geom_text(aes(label=round(value,1)), vjust=1.6, color="white", size=3.5) +
ggtitle('Reduced Dataset + Average Linkage Cluster Distribution') +
theme_minimal()While average linkage is a popular option, the resulting clustering above was somewhat unattractive. Here we consider another linkage method to expand the analysis. Ward linkage makes use of Ward’s minimum variance criterion which seeks to minimize the total within-cluster variance. The resulting dendrogram is visually attractive and more balanced than before. A cutoff at k=3 clusters seems appropriate. This grouping is more similar to the one that relied only on the ADHD questionnaire responses. Here, Cluster 3 is the oldest group but also has the lowest meanADHD and meanMD scores. Clusters 1 and 2 are less obvious to differentiate since Cluster 2 has the highest meanADHD score but a lower meanMD score than Cluster 1.
hc3 <- adhd.red.complete %>% scale %>% dist(method = "euclidean") %>% hclust(method = 'ward.D')
dend3 <- hc3 %>% as.dendrogram
sub_grp3 <- cutree(dend3, k = 3)
dend3 %>% set("branches_k_color", k = 3) %>% set("labels", "") %>% plot(main = "Hierachical Clustering Reduced Dataset + Ward Linkage")adhd.red.complete %>%
mutate(cluster = sub_grp3) %>%
group_by(cluster) %>%
summarise(meanAge = mean(Age), meanMD = mean(MDTotal), meanADHD = mean(ADHDTotal), count = n()) %>%
gather(var,value,meanAge:count) %>%
ggplot(aes(cluster,value,fill=cluster)) +
geom_col() + facet_grid(var ~ ., scales="free_y") +
geom_text(aes(label=round(value,1)), vjust=1.6, color="white", size=3.5) +
ggtitle('Reduced Dataset + Ward Linkage Cluster Distribution') +
theme_minimal()We use a tanglegram to obtain side-by-side comparisons on the clusters obtained using the different linkage methods. Here we only compare the complete and average linkage clusterings which are similar in terms of cluster imbalance. The first thing to notice is the consistency of groupings at the higher ends of the hiearchies. This is shown by the nearly horizontal ribbon linking the two clustering and indicates that the selected observations end up in the corresponsing cluster representation. Naturally, a large number of observations find correspondence in the opposing largest clusters.
dl <- dendlist(
dend1 %>%
set("labels_col", k=6) %>%
set("branches_lty", 1) %>%
set("branches_k_color", k = 6),
dend2 %>%
set("labels_col", k=5) %>%
set("branches_lty", 1) %>%
set("branches_k_color", k = 5)
)
# Plot them together
tanglegram(dl,
common_subtrees_color_lines = TRUE, highlight_distinct_edges = TRUE, highlight_branches_lwd=FALSE,
margin_inner=7,
lwd=2,
show_labels = FALSE
)
title("Complete Linkage vs Average")The comparison of complete linkage against ward linkage is interesting since the latter contains only half the number of clusters. It might be worth studying how the patients in the middle cluster of the ward representation (lowest ADHD and MD) find correspondence in the complete linkage clusters. Similarly, we could study in more detail how individuals in Clusters 3 and 6 (using complete linkage) which were small but extreme clusters end up grouped in the ward representation which had less defined levels of severity in terms of ADHD and MD.
dl <- dendlist(
dend1 %>%
set("labels_col", k=6) %>%
set("branches_lty", 1) %>%
set("branches_k_color", k = 6),
dend3 %>%
set("labels_col", k=3) %>%
set("branches_lty", 1) %>%
set("branches_k_color", k = 3)
)
# Plot them together
tanglegram(dl,
common_subtrees_color_lines = TRUE, highlight_distinct_edges = TRUE, highlight_branches_lwd=FALSE,
margin_inner=7,
lwd=2,
show_labels = FALSE
)
title("Complete Linkage vs Ward")Principal Comnponent Analysis (PCA) is a dimensionality reduction technique where a dataset is transformed to use p eigenvectors of the covariance matrix instead of the original number of predictors n, where p < n. The number of eigenvectors p is selected by looking at the sorted eigenvalues and determining a threshold percentage of variance explained and the resulting p.
The method seeks to project the data into a lower dimensional space where each axis (or principal component) captures the most variability in the data subject to the condition of being uncorrelated to the other axes. This last condition is important for dimensionality reduction in the sense that large datasets can contain many correlated variables which hold no additional information.
An eigenvalue > 1 indicates that PCs account for more variance than accounted by one of the original variables in standardized data. This is commonly used as a cutoff point for which PCs are retained. This holds true only when the data are standardized. We can also limit the number of component to that number that accounts for a certain fraction of the total variance, for example 70%.
This section focuses on the subset of the data containing only the individual responses to the ADHD questionnaire. The table below displays the first 10 eigenvalues obtained from the decomposition. Following from the cutoff decription above, our selection of dimensions can be based on the number of scaled eigenvalues that are greater than 1 or on a certain percentage of cummulative variance explained. Another way to select the number of PCs to consider is to study the scree plot provided below, which is simply a visual representation of the variance explained by each component. We typically look for an elbow in the plot to make our selection. In our case, the scree plot elbow occurs at Dimensions = 2 and the eigenvalue threshold at Dimensions = 3 which account for 59.2% and 64.8% respectively.
adhd.full %>% select(ADHD.Q1:ADHD.Q18) %>%
prcomp(scale = TRUE) %>% get_eigenvalue() %>% head(10)adhd.full %>% select(ADHD.Q1:ADHD.Q18) %>%
prcomp(scale = FALSE) %>%
fviz_eig(addlabels = TRUE)We can study the individual contributions of each questionnaire response to the principal components using the plot below. We observe that the contributions to the first principal component (Dim. 1) by the variables is roughly equivalent. However, Q16 bears a lot of weight in the 2nd dimension as Q5 does in the third dimension.
res.pca.adhd <- adhd.full %>% select(ADHD.Q1:ADHD.Q18) %>% prcomp(scale = FALSE)
corrplot(get_pca_var(res.pca.adhd)$contrib, is.corr=FALSE) The cummulative contribution across the first 3 dimensions is summarized below. Q5 bears the most overall importance, followed by Q16 and Q4. It is worth diving into the actual questionnaire to look up what each question is asking to see if any insights can be drawn to explain the variability. The questions are listed below.
adhd.full %>% select(ADHD.Q1:ADHD.Q18) %>%
prcomp(scale = FALSE) %>%
fviz_contrib(choice = "var", axes = 1:3, top = 10)We can visualize these contribution in 2 dimensions using the first two PCs as shown below. On this plot, we look at the groupings and directions of the vectors. Positively correlated variables are grouped together. Negatively correlated variables are positioned on opposite sides of the plot origin. The distance between variables and the origin measures the contribution of the variables. We make the following observations:
fviz_pca_var(res.pca.adhd,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)Using the resduced dataset, we can use PCA to extract additional information about the patients. Since we are using total scores for this section, the variables must be scaled in order to avoid any overweighted contributions. We find that it takes 9 dimensions to explain approximately 70% of the variance. The scree plot has no obvious elbow which can be used to determine a cutoff. Using the eigenvalue cutoff at Dimension 8, we can still capture 66% of the variance.
adhd.red.complete %>%
prcomp(scale = TRUE) %>%get_eigenvalue() %>% head(10)adhd.red.complete %>%
prcomp(scale = TRUE) %>%
fviz_eig(addlabels = TRUE)From the correlation and variable contribution plots below, we can make a few observations:
res.pca <- adhd.red.complete %>% prcomp(scale = TRUE)
corrplot(get_pca_var(res.pca)$contrib, is.corr=FALSE) Cummulatively for PC1 to PC8, we can observe from the below that Education and its derivative EducationLevel is the largest individual contributor to the principal components, followed by SubstDX, Age and Race. The two ADHD and Mood Disorder questionnaires contribute roughly equivalently to the first 8 dimensions.
adhd.red.complete %>%
prcomp(scale = TRUE) %>%
fviz_contrib(choice = "var", axes = 1:8, top = 10)The 2D representation of the variable contributions to PC1 and PC2 are shown below. We note the following:
In summary, the most insightful observations from this PCA analysis might be that Suicide and Abuse are generally correlated with both ADHD and Mood Disorder. Additionally, Mood Disorder is more closely correlated with substance-related abuse as well as abuse of Alcohol and THC, Violence and Conduct than ADHD.
fviz_pca_var(res.pca,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)The biplot representation below displays the same information as above but overlays the individual patient numbers. This view allows the selection of individual observations for a deeper dive.
fviz_pca_biplot(res.pca,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)The primary goal of Support Vector Machines (SVM) is to determine a decision boundary between classes where the separating margin is maximized. There can be an infinite number of decision boundaries between classes. The points from each class closest to the decision boundary are called Support Vectors. SVM looks to find an optimal decision boundary such that Support Vectors from each class delineate a maximal margin subject to some constraints. Classification of a new unseen point is determined by which side of the decision boundary the point resides on.
For data that is not linearly separable due or overlapping, SVM utilizes kernel functions (also called kernel trick) to transform the feature space from n dimensions to a higher n+y dimensions to achieve better separation of classes. The decision boundary in this higher dimensional space is a separating hyperplane.
The strategy that we will use for SVM is to compare several models specifically with and without Principal Component Analysis dimensionality reduction. For each model that is run we tune the model parameters via a gridsearch of:
The models will perform a 10-fold Cross-validation to prevent overfitting and arrive at the overall best parameters and accuracy rate below. Since SVM is a supervised learning method, we require a dependent variable to model. In this case, it will be the Suicide feature.
The dataset approach will be done by building an SVM model without any without any mathematical dimensionality reduction. ### Base Data
For the initial SVM model will take the entire dataset as given. The ADHD and Mood Disorder total columns have been removed as they will be colinear with some of or all of the individual questions. In additional Psych Meds has been removed because of the large number of incomplete values for this feature
adhd_raw2 <- adhd_raw %>% select(-MD.TOTAL, -ADHD.Total, -Initial, -Psych.meds.)
adhd_raw2 <- adhd_raw2[complete.cases(adhd_raw2), ]
adhd_raw2$Suicide <- as.factor(adhd_raw2$Suicide)
set.seed(55)
trainIndex <- createDataPartition(adhd_raw2$Suicide, p = .8, list = FALSE, times = 1)
adhd_base_training <- adhd_raw2[ trainIndex,]
adhd_base_testing <- adhd_raw2[-trainIndex,]The resulting model based on the feature aggregated version of the dataset is shown below. With a radial kernel and C=2 we reach 71.9% training accuracy but a lower testing accuracy of 60.7%.
svm_base_m <- tune(svm, Suicide ~., data = adhd_base_training, ranges=list(
kernel=c("linear", "polynomial", "radial", "sigmoid"),
cost=2^(1:8),
epsilon = seq(0,1,0.1)))
svm_bm_best <- svm_base_m$best.model
svm_base_pred <- predict(svm_bm_best, newdata = adhd_base_testing, type="class")
svm_base_cm <- confusionMatrix(svm_base_pred, adhd_base_testing$Suicide)
base_method <- cbind(Reduction_Method = "Base Data", svm_base_m$best.parameters, Training_Accuracy = 1-svm_base_m$best.performance, Testing_Accuracy = sum(diag(svm_base_cm$table))/sum(svm_base_cm$table),
Features=ncol(adhd_base_training)-1,
Cumulative_Proportion=100)
base_methodThe next SVM model we will take the dataset with the ADHD and Mood Disorder questions removed. The totals for each remain within the dataset. In essence we have reduced dimensinality here, but we’ve done so with simple feature aggregation.
adhd.complete2 <- adhd.red.complete
adhd.complete2$Suicide <- as.factor(adhd.complete2$Suicide)
set.seed(55)
trainIndex <- createDataPartition(adhd.complete2$Suicide, p = .8, list = FALSE, times = 1)
adhd.training <- adhd.complete2[ trainIndex,]
adhd.testing <- adhd.complete2[-trainIndex,]Prediction on reduced dataset yields similar accuracy results to the above, with a small gain in training accuracy.
svm_m <- tune(svm, Suicide ~., data = adhd.training, ranges=list(
kernel=c("linear", "polynomial", "radial", "sigmoid"),
cost=2^(1:8),
epsilon = seq(0,1,0.1)))
svm_m_best <- svm_m$best.model
svm_pred <- predict(svm_m_best, newdata = adhd.testing, type="class")
svm_cm <- confusionMatrix(svm_pred, adhd.testing$Suicide)
noquestions_method <- cbind(Reduction_Method = "Reduced Dataset", svm_m$best.parameters, Training_Accuracy = 1-svm_m$best.performance, Testing_Accuracy = sum(diag(svm_cm$table))/sum(svm_cm$table),
Features=ncol(adhd.testing)-1,
Cumulative_Proportion=100)
noquestions_methodIn the PCA Reduction approach we run SVM models by first reducing the dimensionality of the feature set. The reduced feature space will then be passed for prediction of the Suicide class via a SVM models.
adhd_raw2 <- adhd_raw %>% select(-MD.TOTAL, -ADHD.Total, -Initial, -Psych.meds.)
adhd_raw2 <- adhd_raw2[complete.cases(adhd_raw2), ]
adhd_raw2$Suicide <- as.factor(adhd_raw2$Suicide)
set.seed(55)
trainIndex <- createDataPartition(adhd_raw2$Suicide, p = .8, list = FALSE, times = 1)
adhd_pca.training <- adhd_raw2[ trainIndex,]
adhd_pca.testing <- adhd_raw2[-trainIndex,]
head(adhd_raw2)We run the PCA model with normalizing and centering the input on a normal distribution since PCA will perform a transformation of the unit vectors into a different feature space. The PCA model has transformed the original feature space into 49 principal components, with 95% variance in the dataset cumulatively explained by 34 Principle Components. We check for normality and outliers, since we scaled the the original features, the qqplot should be normal as well, which look fairly normal.
svm_pca_m <- prcomp(select(adhd_pca.training, -Suicide), center = TRUE, scale = TRUE)
summary(svm_pca_m)## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 3.4453 2.2528 1.55906 1.37007 1.34635 1.28866 1.27728
## Proportion of Variance 0.2422 0.1036 0.04961 0.03831 0.03699 0.03389 0.03329
## Cumulative Proportion 0.2422 0.3458 0.39543 0.43374 0.47073 0.50462 0.53792
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.22380 1.19497 1.16886 1.08459 1.05487 1.04846 1.03414
## Proportion of Variance 0.03057 0.02914 0.02788 0.02401 0.02271 0.02243 0.02183
## Cumulative Proportion 0.56848 0.59762 0.62551 0.64951 0.67222 0.69466 0.71648
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 1.00751 0.96060 0.95823 0.91873 0.87611 0.8400 0.8340
## Proportion of Variance 0.02072 0.01883 0.01874 0.01723 0.01566 0.0144 0.0142
## Cumulative Proportion 0.73720 0.75603 0.77477 0.79199 0.80766 0.8221 0.8363
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.81062 0.78359 0.76540 0.69345 0.67529 0.64590 0.63526
## Proportion of Variance 0.01341 0.01253 0.01196 0.00981 0.00931 0.00851 0.00824
## Cumulative Proportion 0.84966 0.86219 0.87415 0.88396 0.89327 0.90178 0.91002
## PC29 PC30 PC31 PC32 PC33 PC34 PC35
## Standard deviation 0.62827 0.60068 0.59058 0.57195 0.56669 0.54155 0.52992
## Proportion of Variance 0.00806 0.00736 0.00712 0.00668 0.00655 0.00599 0.00573
## Cumulative Proportion 0.91808 0.92544 0.93256 0.93923 0.94579 0.95177 0.95750
## PC36 PC37 PC38 PC39 PC40 PC41 PC42
## Standard deviation 0.5050 0.48908 0.47771 0.44431 0.42168 0.39567 0.37971
## Proportion of Variance 0.0052 0.00488 0.00466 0.00403 0.00363 0.00319 0.00294
## Cumulative Proportion 0.9627 0.96759 0.97225 0.97628 0.97990 0.98310 0.98604
## PC43 PC44 PC45 PC46 PC47 PC48 PC49
## Standard deviation 0.36480 0.34923 0.32573 0.31662 0.2968 0.27565 0.24184
## Proportion of Variance 0.00272 0.00249 0.00217 0.00205 0.0018 0.00155 0.00119
## Cumulative Proportion 0.98876 0.99125 0.99341 0.99546 0.9973 0.99881 1.00000
qqnorm(svm_pca_m[["x"]][,1])The PCA Biplot of the first two principal component shows that there is a large amount of overlap between the two suicide classes and poor separability of classes just given these two components
fviz_pca_biplot(svm_pca_m, label="var",
habillage = adhd_pca.training$Suicide,
addEllipses = TRUE, palette = "jco") To determine the number of principal components (factors/features) to keep, we look at three established methods:
The Kaiser rule states that the optimal factor reduction is achieved by selecting Principal Components that have an eigenvalue greater than 1. An eigenvalue of 1 contains the same information as 1 variable
Below are all dimensions/Principal Components that have the an eigenvalue that’s greater than 1
#Kaiser rule: select PCs with eigenvalues of at least 1
reduced_dim <- get_eigenvalue(svm_pca_m) %>% filter(eigenvalue > 1)
reduced_dimCreate training dataset using reduced dimensions via the Kaiser Rule combined with the respective Suicide class
adhd_pca.training_kaiser <- cbind(as.data.frame(svm_pca_m$x[,c(1:nrow(reduced_dim))]), Suicide = adhd_pca.training$Suicide)
head(adhd_pca.training_kaiser)Rotate the test data using the predict function in the same rotation as the training data with the dropped components via the Kaiser Rule
adhd_pca.testing_kaiser <- cbind(as.data.frame(predict(svm_pca_m, newdata = select(adhd_pca.testing, -Suicide))[,c(1:nrow(reduced_dim))]),Suicide = adhd_pca.testing$Suicide)
head(adhd_pca.testing_kaiser)Below are the optimal tuned hyper-parameters with training and testing accuracy
svm_pca_kaiser <- tune(svm, Suicide ~., data = adhd_pca.training_kaiser, ranges=list(
kernel=c("linear", "polynomial", "radial", "sigmoid"),
cost=2^(1:8),
epsilon = seq(0,1,0.1)))
svm_pca_kaiser_best <- svm_pca_kaiser$best.model
svm_pca_kaiser_pred <- predict(svm_pca_kaiser_best, newdata = adhd_pca.testing_kaiser, type="class")
svm_pca_kaiser_cm <- confusionMatrix(svm_pca_kaiser_pred, adhd_pca.testing_kaiser$Suicide)
kaiser_method <- cbind(Reduction_Method = "Kaiser Rule",svm_pca_kaiser$best.parameters, Training_Accuracy = 1-svm_pca_kaiser$best.performance,
Testing_Accuracy = (sum(diag(svm_pca_kaiser_cm$table))/sum(svm_pca_kaiser_cm$table)),
Features=ncol(adhd_pca.testing_kaiser)-1,
Cumulative_Proportion=max(reduced_dim["cumulative.variance.percent"]))
kaiser_methodThe Scree Plot suggests that the best number of components to reduce a dataset by is the component that forms an “elbow”. The scree plot shows that after the 3rd Principal component, additional components successively diminish in explaining the variance. The 3rd component also looks like where the “elbow” forms. Therefore, according to the Scree method, in this case we should choose to just keep the first 3 Principal Components and discard the rest of the data.
fviz_eig(svm_pca_m, addlabels = T) Create training dataset using reduced dimensions via the Kaiser Rule combined with the respective Suicide class
adhd_pca.training_scree <- cbind(as.data.frame(svm_pca_m$x[,c(1:3)]), Suicide = adhd_pca.training$Suicide)
head(adhd_pca.training_scree)Rotate the test data using the predict function in the same rotation as the training data with the dropped components via the scree Method
adhd_pca.testing_scree <- cbind(as.data.frame(predict(svm_pca_m, newdata = select(adhd_pca.testing, -Suicide))[,1:3]),Suicide = adhd_pca.testing$Suicide)
head(adhd_pca.testing_scree)Below are the optimal tuned hyper-parameters with training and testing accuracy
svm_pca_scree <- tune(svm, Suicide ~., data = adhd_pca.training_scree, ranges=list(
kernel=c("linear", "polynomial", "radial", "sigmoid"),
cost=2^(1:8),
epsilon = seq(0,1,0.1)))
svm_pca_scree_best <- svm_pca_scree$best.model
svm_pca_scree_pred <- predict(svm_pca_scree_best, newdata = adhd_pca.testing_scree, type="class")
svm_pca_scree_cm <- confusionMatrix(svm_pca_scree_pred, adhd_pca.testing_scree$Suicide)
scree_method <- cbind(Reduction_Method = "Scree Method", svm_pca_scree$best.parameters, Training_Accuracy = 1-svm_pca_scree$best.performance, Testing_Accuracy = sum(diag(svm_pca_scree_cm$table))/sum(svm_pca_scree_cm$table),
Features=ncol(adhd_pca.testing_scree)-1,
Cumulative_Proportion=(get_eigenvalue(svm_pca_m) %>% select(cumulative.variance.percent))[3,])
scree_methodAnother method of component selection is ensuring that at least two-thirds of the variance is explained by the reduced feature set. In order words keep the number of Principal Components that cumulatively explain two-thirds of the variation and discard the Principal Components after the two-thirds mark. In this case we will opt to select at least 90% of cumulative variance explained. This reduces to 27 dimensions.
reduced_dim_90 <- get_eigenvalue(svm_pca_m) %>% filter(cumulative.variance.percent < 91)
reduced_dim_90Create training dataset using reduced dimensions via the Proportion of Variance Explained Method combined with the respective Suicide class
adhd_pca.training_90var <- cbind(as.data.frame(svm_pca_m$x[,c(1:nrow(reduced_dim_90))]), Suicide = adhd_pca.training$Suicide)
head(adhd_pca.training_90var)Rotate the test data using the predict function in the same rotation as the training data with the dropped components via the Proportion of Variance Explained Method
adhd_pca.testing_90var <- cbind(as.data.frame(predict(svm_pca_m, newdata = select(adhd_pca.testing, -Suicide))[,c(1:nrow(reduced_dim_90))]),Suicide = adhd_pca.testing$Suicide)
head(adhd_pca.testing_90var)Below are the optimal tuned hyper-parameters with training and testing accuracy
svm_pca_90var <- tune(svm, Suicide ~., data = adhd_pca.training_90var, ranges=list(
kernel=c("linear", "polynomial", "radial", "sigmoid"),
cost=2^(1:8),
epsilon = seq(0,1,0.1)))
svm_pca_90var_best <- svm_pca_90var$best.model
svm_pca_90var_pred <- predict(svm_pca_90var_best, newdata = adhd_pca.testing_90var, type="class")
svm_pca_90var_cm <- confusionMatrix(svm_pca_90var_pred, adhd_pca.testing_90var$Suicide)
method_90var <- cbind(Reduction_Method = "Proportion of Variance Explained", svm_pca_90var$best.parameters, Training_Accuracy = 1-svm_pca_90var$best.performance, Testing_Accuracy = sum(diag(svm_pca_90var_cm$table))/sum(svm_pca_90var_cm$table),
Features=ncol(adhd_pca.testing_90var)-1,
Cumulative_Proportion=max(reduced_dim_90["cumulative.variance.percent"]))
method_90varIt is interesting to see that the Training Accuracy for the Questions Removed SVM model had the lowest training accuracy but the highest testing accuracy where both the testing and training accuracy are fairly close. This represents a model that is the least over fitted. While using the Questions Removed model produced the best accuracy, in reality this may not be possible and it may be computationally best to reduce the dataset when working with a large set of features. In this scenario the Scree Method has the best results with the least number of Principal Components and would be the model of selection as it is the most efficient.
svm_models <- rbind(base_method, noquestions_method)
svm_models <- rbind(svm_models, kaiser_method)
svm_models <- rbind(svm_models, scree_method)
svm_models <- rbind(svm_models, method_90var)
svm_models