Setup

library(skimr)
library(tidyverse)
library(caret) # For featureplot, classification report
library(corrplot) # For correlation matrix and PCA contributionplots
library(AppliedPredictiveModeling)
library(mice) # For data imputation
library(VIM) # For missing data visualization
library(gridExtra) # For grid plots
library(pROC) # For AUC calculations
library(ROCR) # For ROC and AUC plots
library(dendextend) # For dendrograms
library(factoextra) # For PCA plots
library(e1071) # For SVM

ADHD Dataset

This dataset is composed of real patient responses to two questionnaires related to ADHD and Mood Disorder and a variety of demographic, abuse and drug use variarables. For each questionnaire, the responses to individual questions are provided along with total scores. Links to the actual questions are provided below:

The first part of this work will make use of unsupervised learning techniques such as Principal Component Analysis (PCA) and clustering in an attempt to discover structures in the data. The second part will explore support vector machines in a supervised learning exercise to predict whether an individual has attempted suicide.

Data

The dataset is composed of 54 variables and 175 observations. The data is coded as numeric and holds 33 observations that have some level of missing data. A summary of the variable distributions is provided below:

adhd_raw <- read.csv('https://raw.githubusercontent.com/maelillien/data622/main/hw4/adhd_data.csv', header = TRUE)
adhd <- adhd_raw
head(adhd)

adhd[adhd==''] <- NA 
skim(adhd %>% select(-c(ADHD.Q1:ADHD.Q18, MD.Q1a:MD.Q3)))

Data summary
Name	adhd %>% select(-c(ADHD.Q…
Number of rows	175
Number of columns	21
_______________________
Column type frequency:
character	1
numeric	20
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Initial	0	1	2	3	0	109	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Age	0	1.00	39.47	11.17	18	29.5	42	48.0	69	▆▅▇▅▁
Sex	0	1.00	1.43	0.50	1	1.0	1	2.0	2	▇▁▁▁▆
Race	0	1.00	1.64	0.69	1	1.0	2	2.0	6	▇▁▁▁▁
ADHD.Total	0	1.00	34.32	16.70	0	21.0	33	47.5	72	▃▆▇▆▂
MD.TOTAL	0	1.00	10.02	4.81	0	6.5	11	14.0	17	▃▃▆▇▇
Alcohol	4	0.98	1.35	1.39	0	0.0	1	3.0	3	▇▂▁▁▆
THC	4	0.98	0.81	1.27	0	0.0	0	1.5	3	▇▁▁▁▃
Cocaine	4	0.98	1.09	1.39	0	0.0	0	3.0	3	▇▁▁▁▅
Stimulants	4	0.98	0.12	0.53	0	0.0	0	0.0	3	▇▁▁▁▁
Sedative.hypnotics	4	0.98	0.12	0.54	0	0.0	0	0.0	3	▇▁▁▁▁
Opioids	4	0.98	0.39	0.99	0	0.0	0	0.0	3	▇▁▁▁▁
Court.order	5	0.97	0.09	0.28	0	0.0	0	0.0	1	▇▁▁▁▁
Education	9	0.95	11.90	2.17	6	11.0	12	13.0	19	▁▅▇▂▁
Hx.of.Violence	11	0.94	0.24	0.43	0	0.0	0	0.0	1	▇▁▁▁▂
Disorderly.Conduct	11	0.94	0.73	0.45	0	0.0	1	1.0	1	▃▁▁▁▇
Suicide	13	0.93	0.30	0.46	0	0.0	0	1.0	1	▇▁▁▁▃
Abuse	14	0.92	1.33	2.12	0	0.0	0	2.0	7	▇▂▁▁▁
Non.subst.Dx	22	0.87	0.44	0.68	0	0.0	0	1.0	2	▇▁▃▁▁
Subst.Dx	23	0.87	1.14	0.93	0	0.0	1	2.0	3	▆▇▁▅▂
Psych.meds.	118	0.33	0.96	0.80	0	0.0	1	2.0	2	▇▁▇▁▆

Data Processing

The dataset is modified to include an EducationLevel categorical variable derived from the numerical Education variables representing the years of schooling. The Abuse column is unfolded into 3 binary variables indicating the occurence of the 3 types of abuse. The original Abuse variable is dropped.

We work with a multiple subsets of the data for subsequent parts this report. Some analyses make use of the entire set of questionnaire reponses while others use only the total score.

# Renaming variables
adhd <- rename(adhd, ADHDTotal = ADHD.Total, MDTotal = MD.TOTAL, Sedatives = Sedative.hypnotics, CourtOrder = Court.order, 
               Violence = Hx.of.Violence, Conduct = Disorderly.Conduct, NonSubstDX = Non.subst.Dx, SubstDX = Subst.Dx, PsychMeds = Psych.meds.)
# Drop Initial column
adhd <- adhd %>% select(-c(Initial))
# Shift sex variable to 0 and 1
adhd$Sex <- adhd$Sex-1
# Re-coding Education Variable, College = Education > 12, HS = Education > 8 & <= 12, MS = Education <= 8 
adhd$EducationLevel <- ifelse(adhd$Education <= 8, 1, ifelse(adhd$Education <= 12 & adhd$Education > 8, 2, ifelse(adhd$Education > 12, 3, 99)))
# Creating new variables based on type of abuse
adhd$AbuseP <- as.numeric(adhd$Abuse == 1 | adhd$Abuse == 4 | adhd$Abuse == 5 | adhd$Abuse == 7)
adhd$AbuseS <- as.numeric(adhd$Abuse == 2 | adhd$Abuse == 4 | adhd$Abuse == 6 | adhd$Abuse == 7)
adhd$AbuseE <- as.numeric(adhd$Abuse == 3 | adhd$Abuse == 5 | adhd$Abuse == 6 | adhd$Abuse == 7)
adhd <- adhd %>% select(-c(Abuse))
# Forming data subsets: full set of variables or reduced (totals only)
adhd.full <- adhd
adhd.red <- adhd %>% select(c(Age, Sex, Race, ADHDTotal, MDTotal, Alcohol:AbuseE))
# Set all characters to numeric
adhd.red <- mutate_all(adhd.red, function(x) as.numeric(as.character(x)))
adhd.full <- mutate_all(adhd.full, function(x) as.numeric(as.character(x)))

Missing Values

The dataset contains a few missing values. The PsychMeds variable mostly contained missing values and was dropped entirely. A few observations were quite sparse and only contained basic demographic and questionnaire score columns. In order to avoid biasing the dataset with imputed values, we preferred to drop all observations with missing values from the dataset. The resulting dataset contains 33 fewer observations with 142 complete rows and 19 columns.

mice_plot <- aggr(adhd.red, col=c('#F8766D','#00BFC4'), numbers=TRUE, sortVars=TRUE, labels=names(adhd.red), cex.axis=.7, gap=3, ylab=c("Missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##        Variable      Count
##       PsychMeds 0.67428571
##         SubstDX 0.13142857
##      NonSubstDX 0.12571429
##          AbuseP 0.08000000
##          AbuseS 0.08000000
##          AbuseE 0.08000000
##         Suicide 0.07428571
##        Violence 0.06285714
##         Conduct 0.06285714
##       Education 0.05142857
##  EducationLevel 0.05142857
##      CourtOrder 0.02857143
##         Alcohol 0.02285714
##             THC 0.02285714
##         Cocaine 0.02285714
##      Stimulants 0.02285714
##       Sedatives 0.02285714
##         Opioids 0.02285714
##             Age 0.00000000
##             Sex 0.00000000
##            Race 0.00000000
##       ADHDTotal 0.00000000
##         MDTotal 0.00000000

# Discard PsychMeds
adhd.red.complete <- adhd.red %>% select(-c(PsychMeds))
adhd.full.complete <- adhd.full %>% select(-c(PsychMeds))
# Keep only complete cases
adhd.red.complete <- adhd.red.complete[complete.cases(adhd.red.complete),]
adhd.full.complete <- adhd.full.complete[complete.cases(adhd.full.complete),]

Data Exploration

The plots below explore the distributions of subsets of the data.

The various distribution of the responses to the ADHD questionnaire are displayed below. Some variables are unimodal, others bimodal, some are centered and others skewed. Reference should be made to the survey questions in order to gain deeper insight of these distributions.

For the Mood Disorder questionnaire, the majority of the binary responses were 1. The classes are fairly balance expect for a few questions. Q2 is particularly unbalanced. The distribution of the responses to Q3 is highly skewed to the left tail.

Distribution of Binary and Ordinal Variables

A few observations on the binary and ordinal variables:

Most participants have not been abused but a noteworthy number of paricipates have suffered emotional, physical and sexual abuse or a combination of the three.
Alcohol, Cocaine and THC are similarly distributed with the highest densities at the extremes of the range
The majority of participants have a history of disorderly conduct, some were violent but only few have received a court order
Opiods, Stimulants and Sedatives which are fairly similar in nature are similarly distributed with mostly no use
Sex is not quite balanced and Race is mostly limited to two classes
A significant proportion of the sample population attempted suicide

Discrete Variables

The ADHDTotal variable is unimodal is fairly spread out over the middle of the range 20-45
Age is bimodal with peaks at 25 and 45 years of age
Education is also unimodal with a peak at 12 which which corresponds to finishing high school
MDTotal is unimodel but skewed to the left tail with a mean around 11

Clustering

Clustering refers to a broad set of techniques for finding subgroups, or clusters, in a dataset. We seek to partition observations into distinct groups so that the observations within each group are quite similar to each other, while observations in different groups are quite different from each other. The most popular clustering approaches are K-means and Hierarchical Clustering (HC). While the former requires a pre-specified number of clusters k, the latter does not. HC is a bottom-up or agglomerative clustering approach which results in an upside-down tree representation, built from the leaves and combined into clusters up to the trunk. Clusters are identified by horizontal cuts across the dendrogram.

In this section, we explore the use of Hierarchical Clustering on two portions of the data. The first uses only the questionnaire responses to ADHD while the second uses the total questionnaire scores for both surveys as well as the other variables (demographic, drugs, abuse, etc). The latter is referred to as the ‘reduced’ dataset.

Clustering typically requires the variables to be scaled in order to avoid more weight to variables using a larger range of values. However, when all the variables under conideration are measured on the same scale, which is the case when only comparing survey responses, it can be appropriate to leave the variables unscaled.

With HC, the concept of dissimilarity between a pair of observations needs to be extended to a pair of groups of observations. This extension is achieved with the notion of linkage, which defines the dissimilarity between two groups of observations. The resulting dendrogram heavily depends on the choice of linkage. The most popular linkages are complete and average because they tend to result in more balanced clusters.

ADHD Questionnaire

Using only the individual unscaled responses to the ADHD Questionnaire, we obtain the following dendrogram structure using complete linkage. In this case, complete linkage provided the best balancing and a cutoff into 3 clusters looked appropriate. In order to gain insight into these clusters, we need to look at the distribution of the variables within each of them.

For these 3 clusters, we can make the following observations:

The clusters have very similar average age.
The consistuents of cluster 3 have the lowest average ADHD and Mood Disorder total scores.
Cluster 1 in on the other end of the spectrum and has the largest average lowest average ADHD and Mood Disorder total scores.
Clusters 2 is the largest subgroup with scores falling in between clusters 1 and 3

We can establish a ranking for this clustering based on the monotic rise in the meanADHD and meanMD across clusters. From less to most severe: Cluster 3, Cluster 2, Cluster 1. With this information, a treatment plan could be developped to focus on clusters 2 and 1, with the latter requiring special attention.

hc0 <- adhd.full %>% select(ADHD.Q1:ADHD.Q18) %>% dist(method = "euclidean") %>% hclust(method = "complete")
dend0 <- hc0 %>% as.dendrogram
sub_grp0 <- cutree(dend0, k = 3, order_clusters_as_data = TRUE)
dend0 %>% set("branches_k_color", k = 3) %>% set("labels", "") %>% plot(main = "Hierarchical Clustering ADHD Questionnaire")

adhd.full %>%
  mutate(cluster = sub_grp0) %>%
  group_by(cluster) %>%
  summarise(meanAge = mean(Age), meanMD = mean(MDTotal), meanADHD = mean(ADHDTotal), count = n()) %>%
  gather(var,value,meanAge:count) %>%
  ggplot(aes(cluster,value,fill=cluster)) + 
  geom_col() + facet_grid(var ~ ., scales="free_y") +
  geom_text(aes(label=round(value,1)), vjust=1.6, color="white", size=3.5) + 
  ggtitle('ADHD Questionnaire Cluster Distribution')  +
  theme_minimal()

Reduced Dataset

For a contrasting analysis, we looked at clustering based on the dataset which included only the total questionnaire scores and dropped observations with missing values. The variables were scaled to balance out the contribution of the high values for scores and age. We explore the use of complete, average and Ward linkages.

Complete Linkage

Unlike the previous clustering, adding the other variables results in a less balanced dendrogram. In this case, most clusters contain only a few observations and two large clusters contain the majority of the data. We can make a few additional observations:

The largest cluster, Cluster 1, is on the higher end of meanAge but has the lowest average ADHD and comparable MD scores. The value for meanADHD in this group is about twice the value of the least severe group in the previous cluster (ADHD only).
Clusters 3 and 6 have the highest meanADHD and meanMD scores with the difference that Cluster 6 is 15 years younger on average. These individuals may consitute a subgroup for potential treatment.
Cluster 2 has the highest meanAge but slightly lesser meanADHD and meanMD scores than Clusters 4 and 5, which have roughly the samnme averages.

Based on the variable distributions, it might be tempting to want to group clusters 4 and 5 together since they represent a cohort with similar meanAge, meanMD and meanADHD scores. However, hierachichal clustering considers more variables than the aforementioned few and this kind of grouping would violate the measure of similarity as determined by the y axis of the dendrogram.

A proposed ADHD and Mood Disorder ranking in 4 levels from least severe to most severe is: Cluster 2, Clusters 1, 4 & 5, Clusters 3 (older) & 6 (younger).

hc1 <- adhd.red.complete %>%  scale %>% dist(method = "euclidean") %>% hclust(method = "complete")
dend1 <- hc1 %>% as.dendrogram
sub_grp1 <- cutree(dend1, k = 6)
dend1 %>% set("branches_k_color", k = 6) %>% set("labels", "") %>% plot(main = "Hierachical Clustering Reduced Dataset + Complete Linkage")

adhd.red.complete %>%
  mutate(cluster = sub_grp1) %>%
  group_by(cluster) %>%
  summarise(meanAge = mean(Age), meanMD = mean(MDTotal), meanADHD = mean(ADHDTotal), count = n()) %>%
  gather(var,value,meanAge:count) %>%
  ggplot(aes(cluster,value,fill=cluster)) + 
  geom_col() + facet_grid(var ~ ., scales="free_y") +
  geom_text(aes(label=round(value,1)), vjust=1.6, color="white", size=3.5) + 
  ggtitle('Reduced Dataset + Complete Linkage Cluster Distribution') +
  theme_minimal()

Average Linkage

Average linkage completely changes the dendrogram representation. The number of clusters, k=5 seemed appropriate. Similarly to complete linkage, one large cluster is identified, this time containing the vast majority of the observations. This obscures the analysis beyond what we can establish about the small clusters.

Clusters 6 and Cluster 5, of size 1 and 2 respectively, represented the most severe cases. The single individual in Cluster 6 was 10 years younger. It is interesting to note that Cluster 5 in more similar to the rest of the data in term of hierachy than to Cluster 6.
Cluster 2 also had the maximum meanADHD score but a meanMD score comparable to other groupings
Clusters 3 and 4 contain only a handful of individuals which have quite similar metrics on average.
The large Cluster 1 is on average nearly the older cohort but must contain a variety of individuals that may have large variations in ADHD and MD scores but are nevertheless similar based on other attributes. The average ADHD and MD scores are lower in this group than in any other cluster.

hc2 <- adhd.red.complete %>%  scale %>% dist(method = "euclidean") %>% hclust(method =  'average')
dend2 <- hc2 %>% as.dendrogram
sub_grp2 <- cutree(dend2, k = 5)
dend2 %>% set("branches_k_color", k = 5) %>% set("labels", "") %>% plot(main = "Hierachical Clustering Reduced Dataset + Average Linkage")

adhd.red.complete %>%
  mutate(cluster = sub_grp2) %>%
  group_by(cluster) %>%
  summarise(meanAge = mean(Age), meanMD = mean(MDTotal), meanADHD = mean(ADHDTotal), count = n()) %>%
  gather(var,value,meanAge:count) %>%
  ggplot(aes(cluster,value,fill=cluster)) + 
  geom_col() + facet_grid(var ~ ., scales="free_y") +
  geom_text(aes(label=round(value,1)), vjust=1.6, color="white", size=3.5) + 
  ggtitle('Reduced Dataset + Average Linkage Cluster Distribution') +
  theme_minimal()

Ward Linkage

While average linkage is a popular option, the resulting clustering above was somewhat unattractive. Here we consider another linkage method to expand the analysis. Ward linkage makes use of Ward’s minimum variance criterion which seeks to minimize the total within-cluster variance. The resulting dendrogram is visually attractive and more balanced than before. A cutoff at k=3 clusters seems appropriate. This grouping is more similar to the one that relied only on the ADHD questionnaire responses. Here, Cluster 3 is the oldest group but also has the lowest meanADHD and meanMD scores. Clusters 1 and 2 are less obvious to differentiate since Cluster 2 has the highest meanADHD score but a lower meanMD score than Cluster 1.

hc3 <- adhd.red.complete %>%  scale %>% dist(method = "euclidean") %>% hclust(method =  'ward.D')
dend3 <- hc3 %>% as.dendrogram
sub_grp3 <- cutree(dend3, k = 3)
dend3 %>% set("branches_k_color", k = 3) %>% set("labels", "") %>% plot(main = "Hierachical Clustering Reduced Dataset + Ward Linkage")

adhd.red.complete %>%
  mutate(cluster = sub_grp3) %>%
  group_by(cluster) %>%
  summarise(meanAge = mean(Age), meanMD = mean(MDTotal), meanADHD = mean(ADHDTotal), count = n()) %>%
  gather(var,value,meanAge:count) %>%
  ggplot(aes(cluster,value,fill=cluster)) + 
  geom_col() + facet_grid(var ~ ., scales="free_y") +
  geom_text(aes(label=round(value,1)), vjust=1.6, color="white", size=3.5) + 
  ggtitle('Reduced Dataset + Ward Linkage Cluster Distribution') +
  theme_minimal()

Comparison

We use a tanglegram to obtain side-by-side comparisons on the clusters obtained using the different linkage methods. Here we only compare the complete and average linkage clusterings which are similar in terms of cluster imbalance. The first thing to notice is the consistency of groupings at the higher ends of the hiearchies. This is shown by the nearly horizontal ribbon linking the two clustering and indicates that the selected observations end up in the corresponsing cluster representation. Naturally, a large number of observations find correspondence in the opposing largest clusters.

dl <- dendlist(
  dend1 %>% 
    set("labels_col", k=6) %>%
    set("branches_lty", 1) %>%
    set("branches_k_color", k = 6),
  dend2 %>% 
    set("labels_col", k=5) %>%
    set("branches_lty", 1) %>%
    set("branches_k_color", k = 5)
)
 
# Plot them together
tanglegram(dl,
           common_subtrees_color_lines = TRUE, highlight_distinct_edges  = TRUE, highlight_branches_lwd=FALSE,
           margin_inner=7,
           lwd=2,
           show_labels = FALSE
)
title("Complete Linkage vs Average")

The comparison of complete linkage against ward linkage is interesting since the latter contains only half the number of clusters. It might be worth studying how the patients in the middle cluster of the ward representation (lowest ADHD and MD) find correspondence in the complete linkage clusters. Similarly, we could study in more detail how individuals in Clusters 3 and 6 (using complete linkage) which were small but extreme clusters end up grouped in the ward representation which had less defined levels of severity in terms of ADHD and MD.

dl <- dendlist(
  dend1 %>% 
    set("labels_col", k=6) %>%
    set("branches_lty", 1) %>%
    set("branches_k_color", k = 6),
  dend3 %>% 
    set("labels_col", k=3) %>%
    set("branches_lty", 1) %>%
    set("branches_k_color", k = 3)
)
 
# Plot them together
tanglegram(dl,
           common_subtrees_color_lines = TRUE, highlight_distinct_edges  = TRUE, highlight_branches_lwd=FALSE,
           margin_inner=7,
           lwd=2,
           show_labels = FALSE
)
title("Complete Linkage vs Ward")

Principal Component Analysis

Principal Comnponent Analysis (PCA) is a dimensionality reduction technique where a dataset is transformed to use p eigenvectors of the covariance matrix instead of the original number of predictors n, where p < n. The number of eigenvectors p is selected by looking at the sorted eigenvalues and determining a threshold percentage of variance explained and the resulting p.

The method seeks to project the data into a lower dimensional space where each axis (or principal component) captures the most variability in the data subject to the condition of being uncorrelated to the other axes. This last condition is important for dimensionality reduction in the sense that large datasets can contain many correlated variables which hold no additional information.

An eigenvalue > 1 indicates that PCs account for more variance than accounted by one of the original variables in standardized data. This is commonly used as a cutoff point for which PCs are retained. This holds true only when the data are standardized. We can also limit the number of component to that number that accounts for a certain fraction of the total variance, for example 70%.

ADHD Questionnaire

This section focuses on the subset of the data containing only the individual responses to the ADHD questionnaire. The table below displays the first 10 eigenvalues obtained from the decomposition. Following from the cutoff decription above, our selection of dimensions can be based on the number of scaled eigenvalues that are greater than 1 or on a certain percentage of cummulative variance explained. Another way to select the number of PCs to consider is to study the scree plot provided below, which is simply a visual representation of the variance explained by each component. We typically look for an elbow in the plot to make our selection. In our case, the scree plot elbow occurs at Dimensions = 2 and the eigenvalue threshold at Dimensions = 3 which account for 59.2% and 64.8% respectively.

adhd.full %>% select(ADHD.Q1:ADHD.Q18) %>%
  prcomp(scale = TRUE) %>% get_eigenvalue() %>% head(10)

adhd.full %>% select(ADHD.Q1:ADHD.Q18) %>%
  prcomp(scale = FALSE) %>% 
  fviz_eig(addlabels = TRUE)

We can study the individual contributions of each questionnaire response to the principal components using the plot below. We observe that the contributions to the first principal component (Dim. 1) by the variables is roughly equivalent. However, Q16 bears a lot of weight in the 2nd dimension as Q5 does in the third dimension.

res.pca.adhd <- adhd.full %>% select(ADHD.Q1:ADHD.Q18) %>% prcomp(scale = FALSE)
corrplot(get_pca_var(res.pca.adhd)$contrib, is.corr=FALSE)

The cummulative contribution across the first 3 dimensions is summarized below. Q5 bears the most overall importance, followed by Q16 and Q4. It is worth diving into the actual questionnaire to look up what each question is asking to see if any insights can be drawn to explain the variability. The questions are listed below.

Q4: When you have a task that requires a lot of thought, how often do you avoid or delay getting started?
Q5: How often do you fidget or squirm with your hands or feet when you have to sit down for a long time?
Q16: When you’re in a conversation, how often do you find yourself finishing the sentences of the people you are talking to, before they can finish them themselves?

adhd.full %>% select(ADHD.Q1:ADHD.Q18) %>%
  prcomp(scale = FALSE) %>% 
  fviz_contrib(choice = "var", axes = 1:3, top = 10)

We can visualize these contribution in 2 dimensions using the first two PCs as shown below. On this plot, we look at the groupings and directions of the vectors. Positively correlated variables are grouped together. Negatively correlated variables are positioned on opposite sides of the plot origin. The distance between variables and the origin measures the contribution of the variables. We make the following observations:

Q16 is most red and shown as the most contributing variable in this representation due to it’s importance in PC2 which we saw above.
Q15, Q17 and Q18 are closely correlated while being away from the correlated vectors near the PC1 axis.
Most vectors hug the PC1 axis as can be explained from the fairly even contributions to PC1 reveale don the correlation plot.
Q16, A9, Q4 and Q5 have the largest distances from the origin and contribute similarly to PC1 but contribute quite differently to PC2 where Q16 dominates

fviz_pca_var(res.pca.adhd,
             col.var = "contrib", # Color by contributions to the PC
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE     # Avoid text overlapping
             )

Reduced Dataset

Using the resduced dataset, we can use PCA to extract additional information about the patients. Since we are using total scores for this section, the variables must be scaled in order to avoid any overweighted contributions. We find that it takes 9 dimensions to explain approximately 70% of the variance. The scree plot has no obvious elbow which can be used to determine a cutoff. Using the eigenvalue cutoff at Dimension 8, we can still capture 66% of the variance.

adhd.red.complete %>%
  prcomp(scale = TRUE) %>%get_eigenvalue() %>% head(10)

adhd.red.complete %>%
  prcomp(scale = TRUE) %>% 
  fviz_eig(addlabels = TRUE)

From the correlation and variable contribution plots below, we can make a few observations:

SubstDX is the largest contributor to PC1
ADHDTotal is the most important contribution to PC2 followed by NonSubstDX
Education and Opiods contribute about equally to PC3
Emotional abuse, Violence, THC and Alcohol are the greatest contributors to PC4, 5, 6 & 7 respectfully.
Race is the leading contributor to PC8 which is our cutoff.

res.pca <- adhd.red.complete %>% prcomp(scale = TRUE)
corrplot(get_pca_var(res.pca)$contrib, is.corr=FALSE)

Cummulatively for PC1 to PC8, we can observe from the below that Education and its derivative EducationLevel is the largest individual contributor to the principal components, followed by SubstDX, Age and Race. The two ADHD and Mood Disorder questionnaires contribute roughly equivalently to the first 8 dimensions.

adhd.red.complete %>%
  prcomp(scale = TRUE) %>% 
  fviz_contrib(choice = "var", axes = 1:8, top = 10)

The 2D representation of the variable contributions to PC1 and PC2 are shown below. We note the following:

SubstDX has the largest contribution mostly in the direction of the PC2 axis. In the same direction we can identify lesser but nevertheless present contributions from variables such as Violence, Conduct, Cocaine, THC and Alcohol.
SubstDX and the aforementioned variables are uncorrelated with the Education variables as shown by the opposite direction of those vectors.
NonSubstDX is correlated with ADHDTotal
The ADHDTotal score is nearly perfectly correlated with Sex and uncorrelated with Race. We might want to be careful with the use of these variables.
The MDTotal score is closely correlated with Suicide.
The slice of the lower right quadrant between MDTotal and ADHDTotal contains Suicide and all forms of Abuse.

In summary, the most insightful observations from this PCA analysis might be that Suicide and Abuse are generally correlated with both ADHD and Mood Disorder. Additionally, Mood Disorder is more closely correlated with substance-related abuse as well as abuse of Alcohol and THC, Violence and Conduct than ADHD.

fviz_pca_var(res.pca,
             col.var = "contrib", # Color by contributions to the PC
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE     # Avoid text overlapping
             )

The biplot representation below displays the same information as above but overlays the individual patient numbers. This view allows the selection of individual observations for a deeper dive.

fviz_pca_biplot(res.pca,
             col.var = "contrib", # Color by contributions to the PC
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE     # Avoid text overlapping
             )

Support Vector Machines

The primary goal of Support Vector Machines (SVM) is to determine a decision boundary between classes where the separating margin is maximized. There can be an infinite number of decision boundaries between classes. The points from each class closest to the decision boundary are called Support Vectors. SVM looks to find an optimal decision boundary such that Support Vectors from each class delineate a maximal margin subject to some constraints. Classification of a new unseen point is determined by which side of the decision boundary the point resides on.

For data that is not linearly separable due or overlapping, SVM utilizes kernel functions (also called kernel trick) to transform the feature space from n dimensions to a higher n+y dimensions to achieve better separation of classes. The decision boundary in this higher dimensional space is a separating hyperplane.

The strategy that we will use for SVM is to compare several models specifically with and without Principal Component Analysis dimensionality reduction. For each model that is run we tune the model parameters via a gridsearch of:

Kernel: Linear, Polynomial, Radial, and Sigmoid
Cost Function: Tune to control the penalty imposed on observations that lie outside the epsilon margin by penalizing predictions that are farther than the desired output
Epsilon Intensive Loss Function: epsilon margin value

The models will perform a 10-fold Cross-validation to prevent overfitting and arrive at the overall best parameters and accuracy rate below. Since SVM is a supervised learning method, we require a dependent variable to model. In this case, it will be the Suicide feature.

SVM(Dataset)

The dataset approach will be done by building an SVM model without any without any mathematical dimensionality reduction. ### Base Data

For the initial SVM model will take the entire dataset as given. The ADHD and Mood Disorder total columns have been removed as they will be colinear with some of or all of the individual questions. In additional Psych Meds has been removed because of the large number of incomplete values for this feature

adhd_raw2 <- adhd_raw %>% select(-MD.TOTAL, -ADHD.Total, -Initial, -Psych.meds.)
adhd_raw2 <- adhd_raw2[complete.cases(adhd_raw2), ]
adhd_raw2$Suicide <- as.factor(adhd_raw2$Suicide)
set.seed(55)
trainIndex <- createDataPartition(adhd_raw2$Suicide, p = .8, list = FALSE, times = 1)
adhd_base_training <- adhd_raw2[ trainIndex,]
adhd_base_testing <- adhd_raw2[-trainIndex,]

The resulting model based on the feature aggregated version of the dataset is shown below. With a radial kernel and C=2 we reach 71.9% training accuracy but a lower testing accuracy of 60.7%.

svm_base_m <- tune(svm, Suicide ~., data = adhd_base_training, ranges=list(
  kernel=c("linear", "polynomial", "radial", "sigmoid"),
  cost=2^(1:8),
  epsilon = seq(0,1,0.1)))
svm_bm_best <- svm_base_m$best.model
svm_base_pred <- predict(svm_bm_best, newdata = adhd_base_testing, type="class")
svm_base_cm <- confusionMatrix(svm_base_pred, adhd_base_testing$Suicide)
base_method <- cbind(Reduction_Method = "Base Data", svm_base_m$best.parameters, Training_Accuracy = 1-svm_base_m$best.performance, Testing_Accuracy = sum(diag(svm_base_cm$table))/sum(svm_base_cm$table),
                     Features=ncol(adhd_base_training)-1,
                      Cumulative_Proportion=100)
base_method

Reduced Dataset

The next SVM model we will take the dataset with the ADHD and Mood Disorder questions removed. The totals for each remain within the dataset. In essence we have reduced dimensinality here, but we’ve done so with simple feature aggregation.

adhd.complete2 <- adhd.red.complete
adhd.complete2$Suicide <- as.factor(adhd.complete2$Suicide)
set.seed(55)
trainIndex <- createDataPartition(adhd.complete2$Suicide, p = .8, list = FALSE, times = 1)
adhd.training <- adhd.complete2[ trainIndex,]
adhd.testing  <- adhd.complete2[-trainIndex,]

Prediction on reduced dataset yields similar accuracy results to the above, with a small gain in training accuracy.

svm_m <- tune(svm, Suicide ~., data = adhd.training, ranges=list(
  kernel=c("linear", "polynomial", "radial", "sigmoid"),
  cost=2^(1:8),
  epsilon = seq(0,1,0.1)))
svm_m_best <- svm_m$best.model
svm_pred <- predict(svm_m_best, newdata = adhd.testing, type="class")
svm_cm <- confusionMatrix(svm_pred, adhd.testing$Suicide)
noquestions_method <- cbind(Reduction_Method = "Reduced Dataset", svm_m$best.parameters, Training_Accuracy = 1-svm_m$best.performance, Testing_Accuracy = sum(diag(svm_cm$table))/sum(svm_cm$table),
                     Features=ncol(adhd.testing)-1,
                      Cumulative_Proportion=100)
noquestions_method

SVM (PCA Reduction)

In the PCA Reduction approach we run SVM models by first reducing the dimensionality of the feature set. The reduced feature space will then be passed for prediction of the Suicide class via a SVM models.

adhd_raw2 <- adhd_raw %>% select(-MD.TOTAL, -ADHD.Total, -Initial, -Psych.meds.)
adhd_raw2 <- adhd_raw2[complete.cases(adhd_raw2), ]
adhd_raw2$Suicide <- as.factor(adhd_raw2$Suicide)
set.seed(55)
trainIndex <- createDataPartition(adhd_raw2$Suicide, p = .8, list = FALSE, times = 1)
adhd_pca.training <- adhd_raw2[ trainIndex,]
adhd_pca.testing  <- adhd_raw2[-trainIndex,]
head(adhd_raw2)

We run the PCA model with normalizing and centering the input on a normal distribution since PCA will perform a transformation of the unit vectors into a different feature space. The PCA model has transformed the original feature space into 49 principal components, with 95% variance in the dataset cumulatively explained by 34 Principle Components. We check for normality and outliers, since we scaled the the original features, the qqplot should be normal as well, which look fairly normal.

svm_pca_m <- prcomp(select(adhd_pca.training, -Suicide), center = TRUE, scale = TRUE)
summary(svm_pca_m)

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     3.4453 2.2528 1.55906 1.37007 1.34635 1.28866 1.27728
## Proportion of Variance 0.2422 0.1036 0.04961 0.03831 0.03699 0.03389 0.03329
## Cumulative Proportion  0.2422 0.3458 0.39543 0.43374 0.47073 0.50462 0.53792
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     1.22380 1.19497 1.16886 1.08459 1.05487 1.04846 1.03414
## Proportion of Variance 0.03057 0.02914 0.02788 0.02401 0.02271 0.02243 0.02183
## Cumulative Proportion  0.56848 0.59762 0.62551 0.64951 0.67222 0.69466 0.71648
##                           PC15    PC16    PC17    PC18    PC19   PC20   PC21
## Standard deviation     1.00751 0.96060 0.95823 0.91873 0.87611 0.8400 0.8340
## Proportion of Variance 0.02072 0.01883 0.01874 0.01723 0.01566 0.0144 0.0142
## Cumulative Proportion  0.73720 0.75603 0.77477 0.79199 0.80766 0.8221 0.8363
##                           PC22    PC23    PC24    PC25    PC26    PC27    PC28
## Standard deviation     0.81062 0.78359 0.76540 0.69345 0.67529 0.64590 0.63526
## Proportion of Variance 0.01341 0.01253 0.01196 0.00981 0.00931 0.00851 0.00824
## Cumulative Proportion  0.84966 0.86219 0.87415 0.88396 0.89327 0.90178 0.91002
##                           PC29    PC30    PC31    PC32    PC33    PC34    PC35
## Standard deviation     0.62827 0.60068 0.59058 0.57195 0.56669 0.54155 0.52992
## Proportion of Variance 0.00806 0.00736 0.00712 0.00668 0.00655 0.00599 0.00573
## Cumulative Proportion  0.91808 0.92544 0.93256 0.93923 0.94579 0.95177 0.95750
##                          PC36    PC37    PC38    PC39    PC40    PC41    PC42
## Standard deviation     0.5050 0.48908 0.47771 0.44431 0.42168 0.39567 0.37971
## Proportion of Variance 0.0052 0.00488 0.00466 0.00403 0.00363 0.00319 0.00294
## Cumulative Proportion  0.9627 0.96759 0.97225 0.97628 0.97990 0.98310 0.98604
##                           PC43    PC44    PC45    PC46   PC47    PC48    PC49
## Standard deviation     0.36480 0.34923 0.32573 0.31662 0.2968 0.27565 0.24184
## Proportion of Variance 0.00272 0.00249 0.00217 0.00205 0.0018 0.00155 0.00119
## Cumulative Proportion  0.98876 0.99125 0.99341 0.99546 0.9973 0.99881 1.00000

qqnorm(svm_pca_m[["x"]][,1])

The PCA Biplot of the first two principal component shows that there is a large amount of overlap between the two suicide classes and poor separability of classes just given these two components

fviz_pca_biplot(svm_pca_m, label="var",
             habillage = adhd_pca.training$Suicide,
             addEllipses = TRUE, palette = "jco")

To determine the number of principal components (factors/features) to keep, we look at three established methods:

The Kaiser Rule
Scree Method
Cumulative Proportion of Variance Explained Method

Kaiser Rule

The Kaiser rule states that the optimal factor reduction is achieved by selecting Principal Components that have an eigenvalue greater than 1. An eigenvalue of 1 contains the same information as 1 variable

Below are all dimensions/Principal Components that have the an eigenvalue that’s greater than 1

#Kaiser rule: select PCs with eigenvalues of at least 1
reduced_dim <- get_eigenvalue(svm_pca_m) %>% filter(eigenvalue > 1)
reduced_dim

Create training dataset using reduced dimensions via the Kaiser Rule combined with the respective Suicide class

adhd_pca.training_kaiser <- cbind(as.data.frame(svm_pca_m$x[,c(1:nrow(reduced_dim))]), Suicide = adhd_pca.training$Suicide)
head(adhd_pca.training_kaiser)

Rotate the test data using the predict function in the same rotation as the training data with the dropped components via the Kaiser Rule

adhd_pca.testing_kaiser <- cbind(as.data.frame(predict(svm_pca_m, newdata = select(adhd_pca.testing, -Suicide))[,c(1:nrow(reduced_dim))]),Suicide = adhd_pca.testing$Suicide)
head(adhd_pca.testing_kaiser)

Below are the optimal tuned hyper-parameters with training and testing accuracy

svm_pca_kaiser <- tune(svm, Suicide ~., data = adhd_pca.training_kaiser, ranges=list(
  kernel=c("linear", "polynomial", "radial", "sigmoid"),
  cost=2^(1:8),
  epsilon = seq(0,1,0.1)))
svm_pca_kaiser_best <- svm_pca_kaiser$best.model
svm_pca_kaiser_pred <- predict(svm_pca_kaiser_best, newdata = adhd_pca.testing_kaiser, type="class")
svm_pca_kaiser_cm <- confusionMatrix(svm_pca_kaiser_pred, adhd_pca.testing_kaiser$Suicide)
kaiser_method <- cbind(Reduction_Method = "Kaiser Rule",svm_pca_kaiser$best.parameters, Training_Accuracy = 1-svm_pca_kaiser$best.performance, 
                       Testing_Accuracy = (sum(diag(svm_pca_kaiser_cm$table))/sum(svm_pca_kaiser_cm$table)),
                       Features=ncol(adhd_pca.testing_kaiser)-1,
                       Cumulative_Proportion=max(reduced_dim["cumulative.variance.percent"]))
kaiser_method

Scree Polt Method

The Scree Plot suggests that the best number of components to reduce a dataset by is the component that forms an “elbow”. The scree plot shows that after the 3rd Principal component, additional components successively diminish in explaining the variance. The 3rd component also looks like where the “elbow” forms. Therefore, according to the Scree method, in this case we should choose to just keep the first 3 Principal Components and discard the rest of the data.

fviz_eig(svm_pca_m, addlabels = T)

Create training dataset using reduced dimensions via the Kaiser Rule combined with the respective Suicide class

adhd_pca.training_scree <- cbind(as.data.frame(svm_pca_m$x[,c(1:3)]), Suicide = adhd_pca.training$Suicide)
head(adhd_pca.training_scree)

Rotate the test data using the predict function in the same rotation as the training data with the dropped components via the scree Method

adhd_pca.testing_scree <- cbind(as.data.frame(predict(svm_pca_m, newdata = select(adhd_pca.testing, -Suicide))[,1:3]),Suicide = adhd_pca.testing$Suicide)
head(adhd_pca.testing_scree)

Below are the optimal tuned hyper-parameters with training and testing accuracy

svm_pca_scree <- tune(svm, Suicide ~., data = adhd_pca.training_scree, ranges=list(
  kernel=c("linear", "polynomial", "radial", "sigmoid"),
  cost=2^(1:8),
  epsilon = seq(0,1,0.1)))
svm_pca_scree_best <- svm_pca_scree$best.model
svm_pca_scree_pred <- predict(svm_pca_scree_best, newdata = adhd_pca.testing_scree, type="class")
svm_pca_scree_cm <- confusionMatrix(svm_pca_scree_pred, adhd_pca.testing_scree$Suicide)
scree_method <- cbind(Reduction_Method = "Scree Method", svm_pca_scree$best.parameters, Training_Accuracy = 1-svm_pca_scree$best.performance, Testing_Accuracy = sum(diag(svm_pca_scree_cm$table))/sum(svm_pca_scree_cm$table),
                      Features=ncol(adhd_pca.testing_scree)-1,
                      Cumulative_Proportion=(get_eigenvalue(svm_pca_m) %>% select(cumulative.variance.percent))[3,])
scree_method

Variance Explained Method

Another method of component selection is ensuring that at least two-thirds of the variance is explained by the reduced feature set. In order words keep the number of Principal Components that cumulatively explain two-thirds of the variation and discard the Principal Components after the two-thirds mark. In this case we will opt to select at least 90% of cumulative variance explained. This reduces to 27 dimensions.

reduced_dim_90 <- get_eigenvalue(svm_pca_m) %>% filter(cumulative.variance.percent < 91)
reduced_dim_90

Create training dataset using reduced dimensions via the Proportion of Variance Explained Method combined with the respective Suicide class

adhd_pca.training_90var <- cbind(as.data.frame(svm_pca_m$x[,c(1:nrow(reduced_dim_90))]), Suicide = adhd_pca.training$Suicide)
head(adhd_pca.training_90var)

Rotate the test data using the predict function in the same rotation as the training data with the dropped components via the Proportion of Variance Explained Method

adhd_pca.testing_90var <- cbind(as.data.frame(predict(svm_pca_m, newdata = select(adhd_pca.testing, -Suicide))[,c(1:nrow(reduced_dim_90))]),Suicide = adhd_pca.testing$Suicide)
head(adhd_pca.testing_90var)

Below are the optimal tuned hyper-parameters with training and testing accuracy

svm_pca_90var <- tune(svm, Suicide ~., data = adhd_pca.training_90var, ranges=list(
  kernel=c("linear", "polynomial", "radial", "sigmoid"),
  cost=2^(1:8),
  epsilon = seq(0,1,0.1)))
svm_pca_90var_best <- svm_pca_90var$best.model
svm_pca_90var_pred <- predict(svm_pca_90var_best, newdata = adhd_pca.testing_90var, type="class")
svm_pca_90var_cm <- confusionMatrix(svm_pca_90var_pred, adhd_pca.testing_90var$Suicide)
method_90var <- cbind(Reduction_Method = "Proportion of Variance Explained", svm_pca_90var$best.parameters, Training_Accuracy = 1-svm_pca_90var$best.performance, Testing_Accuracy = sum(diag(svm_pca_90var_cm$table))/sum(svm_pca_90var_cm$table),
                      Features=ncol(adhd_pca.testing_90var)-1,
                      Cumulative_Proportion=max(reduced_dim_90["cumulative.variance.percent"]))
method_90var

SVM Model Selection

It is interesting to see that the Training Accuracy for the Questions Removed SVM model had the lowest training accuracy but the highest testing accuracy where both the testing and training accuracy are fairly close. This represents a model that is the least over fitted. While using the Questions Removed model produced the best accuracy, in reality this may not be possible and it may be computationally best to reduce the dataset when working with a large set of features. In this scenario the Scree Method has the best results with the least number of Principal Components and would be the model of selection as it is the most efficient.

svm_models <- rbind(base_method, noquestions_method)
svm_models <- rbind(svm_models, kaiser_method)
svm_models <- rbind(svm_models, scree_method)
svm_models <- rbind(svm_models, method_90var)
svm_models

DATA 622 HW4 - Clustering, PCA and SVM

Mael Illien, Dhairav Chhatbar, Santosh Manjrekar

4/12/2021

Setup

ADHD Dataset

Data

Data Processing

Missing Values

Data Exploration

Distribution of Binary and Ordinal Variables

Discrete Variables

Clustering

ADHD Questionnaire

Reduced Dataset

Complete Linkage

Average Linkage

Ward Linkage

Comparison

Principal Component Analysis

ADHD Questionnaire

Reduced Dataset

Support Vector Machines

SVM(Dataset)

Reduced Dataset

SVM (PCA Reduction)

Kaiser Rule

Scree Polt Method

Variance Explained Method

SVM Model Selection