Authorship

Group 5:

Don (Geeth) Padmaperuma,
Subhalaxmi Rout,
Isabel R., and
Magnus Skonberg

Background

The purpose of this assignment was to explore Clustering, Principal Component Analysis, and Support Vector Machines.

Clustering is used as a means of grouping a set of observations in a way that those in the same group are more similar to one another than those in other groups. Principal Component Analysis is used to a means of variable reduction where it’s our aim to preserve as much useful information as possible. In short, we trade a little accuracy for simplicity. Support Vector Machines, on another note, are a supervised learning technique used for classification and regression analysis. We’re going to apply it as a means of classification based on one of the variables in our set (suicide).

Our Approach

We’ll perform a relatively in-depth exploratory data analysis (EDA) to better familiarize ourselves with the data at hand and then we’ll explore each of the aforementioned methods.

Grouping our patients via k-means / hierarchical clustering and profiling these groups based on shared characteristics, focusing upon a subset of our data (ie. ADHD or mood disorder) in applying Principal Component Analysis, and applying a Support Vector Machine model based on whether or not a patient has attempted suicide.

……………………………………………………………………..

The Data

The data at hand is based on a real life research project and deals with a number of measures of mental health and well-being - attention deficit hyperactivity disorder (ADHD), mood / mental disorder (MD), substance abuse, etc.

Being that our exploration and application of methods can take on many forms and head in a nearly infinite number of directions, the room for creativity is without bounds.

We start by loading in our data and utilizing built-in head() and dim() functions for an early read.

#Load in data
#data <- read.csv("https://raw.githubusercontent.com/SubhalaxmiRout002/Data-622-Group-5/main/HW%204/ADHD_data.csv")
data <- read.csv("ADHD_data.csv")
#head(data)
#dim(data)

Exploratory Data Analysis

From our initial read of our dataset’s dimensions and first 6 entries, we observe that we’re dealing with a 175 observation x 54 variable dataset. Our initial dataset is broad and may be difficult to manage / make sense of. With the motivation of clarification in mind, we thus break our larger dataset into subsets:

genl: all identifier variables (race, sex, education), total score variables, and history-related variables.
adhd: all adhd related question and total score variables.
md: all mental disorder related question and total score variables.
sub: all substance abuse related question and total score variables.
hstry: all violence, abuse, and substance history related variables.

We may end up merging these subsets, pulling individual variables, or exploring individual subsets later but this provides a means of effectively recognizing where related variables lie while simultaneously providing smaller, more manageable dataframes to explore and work upon.

Once these dataframes are created, we see that adhd and md have total score variables whereas sub does not, even though its data (across variables) maintains a consistent scoring system. Thus, we take it upon ourselves to add this variable for sake of consistency:

genl <- data[c(1:4,23,39)]
#head(genl) #drop initial, add sub and hstry (all vars or "total score")
#we may want to use these separately (ie. for Q2)
adhd <- data[c(5:23)]
md <- data[c(24:39)]
sub <- data[c(40:45)]
hstry <- data[c(46:54)]
sub$SUB.Total <- rowSums(sub) #add Total column to sub
#Finalize genl dataframe (for EDA)
genl$SUB.Total <- sub$SUB.Total #add SUB.Total to genl
genl <- genl[c(-1)] #drop Initial
genl <- cbind(genl, hstry) #add history variables back in with general
head(genl)

##   Age Sex Race ADHD.Total MD.TOTAL SUB.Total Court.order Education
## 1  24   1    1         40       15         3           1        11
## 2  48   2    1         55       14         0           0        14
## 3  51   2    1         31        5         0           0        12
## 4  43   1    1         45       13         4           0        12
## 5  34   1    1         48        7         2           1         9
## 6  39   2    1         55       14         1           0        11
##   Hx.of.Violence Disorderly.Conduct Suicide Abuse Non.subst.Dx Subst.Dx
## 1              0                  1       1     0            2        0
## 2              0                  0       1     4            1        0
## 3              0                  0       0     6            2        0
## 4              0                  0       1     7            2        0
## 5              1                  1       1     0            2        0
## 6              0                  1       1     2            0        0
##   Psych.meds.
## 1           2
## 2           1
## 3           1
## 4           2
## 5           0
## 6           0

We subset our dataframes (as noted above), add a “total” column for sub, and finalize our genl dataframe by merging the total column for sub, dropping the Initial variable, and adding all hstry variables back in.

We lean on the data dictionary (provided with our .xlsx file) to outline the variables at hand:

Age: quantitative variable representative of the individual’s age.
Sex: categorical variable representative of the individual’s sex (male-1, female-2).
Race: categorical variable representative of the individual’s race (white-1, african american-2, hispanic-3, asian-4, native american-5, other / missing data-6).
ADHD.Total: quantitative variable representative of the individual’s total self-report score for ADHD scale. It’s the cumulative score for 18 questions with a scoring metric of never-0, rarely-1, sometimes-2, often-3, and very often-4.
MD.TOTAL: quantitative variable representative of the individual’s total self-report score for mental disorder questions. It’s the cumulative score for 15 questions with a scoring metric of no-0, yes-1; question 3: no problem-0, minor-1, moderate-2, and serious-3.
SUB.Total: quantitative variable representative of the individual’s total self-report score for substance abuse related questions. It’s the cumulative score across 6 categories with a scoring metric of no use-0, use-1, abuse-2, and dependence-3.
Court.order: categorical variable representative of whether the individual’s case was court ordered (No-0, Yes-1).
Education: categorical variable representative of the individual’s level of education (1-12 grade, 13+ college).
Hx.of.Violence: categorical variable representative of whether the individual has a history of violence (No-0, Yes-1).
Disorderly.Conduct: categorical variable representative of whether the individual has a record of disorderly conduct (No-0, Yes-1).
Suicide: categorical variable representative of whether the individual has attempted suicide in the past (No-0, Yes-1).
Abuse: categorical variable representative of whether the individual has a history of abuse (No-0, Physical (P)-1, Sexual (S)-2, Emotional (E)-3, P&S-4, P&E-5, S&E-6, P&S&E-7).
Non.subst.Dx: categorical variable representative of whether the individual has a non-substance diagonosis (0 – none; 1 – one; 2 – More than one).
Subst.Dx: categorical variable representative of whether the individual has a substance diagonosis (0 – none; 1 – one Substance-related; 2 – two; 3 – three or more).
Psych.meds.: categorical variable representative of whether the individual has been prescribed psychiatric medication (0 – none; 1 – one psychotropic med; 2 – more than one psychotropic med).

The resulting dataframe (and variables) are shown above and at this point we’re prepared to more deeply explore our genl dataset.

To do so, we utilize the built-in summary() and glimpse() functions:

summary(genl)

##       Age             Sex             Race        ADHD.Total       MD.TOTAL    
##  Min.   :18.00   Min.   :1.000   Min.   :1.00   Min.   : 0.00   Min.   : 0.00  
##  1st Qu.:29.50   1st Qu.:1.000   1st Qu.:1.00   1st Qu.:21.00   1st Qu.: 6.50  
##  Median :42.00   Median :1.000   Median :2.00   Median :33.00   Median :11.00  
##  Mean   :39.47   Mean   :1.434   Mean   :1.64   Mean   :34.32   Mean   :10.02  
##  3rd Qu.:48.00   3rd Qu.:2.000   3rd Qu.:2.00   3rd Qu.:47.50   3rd Qu.:14.00  
##  Max.   :69.00   Max.   :2.000   Max.   :6.00   Max.   :72.00   Max.   :17.00  
##                                                                                
##    SUB.Total       Court.order        Education    Hx.of.Violence  
##  Min.   : 0.000   Min.   :0.00000   Min.   : 6.0   Min.   :0.0000  
##  1st Qu.: 3.000   1st Qu.:0.00000   1st Qu.:11.0   1st Qu.:0.0000  
##  Median : 3.000   Median :0.00000   Median :12.0   Median :0.0000  
##  Mean   : 3.883   Mean   :0.08823   Mean   :11.9   Mean   :0.2439  
##  3rd Qu.: 6.000   3rd Qu.:0.00000   3rd Qu.:13.0   3rd Qu.:0.0000  
##  Max.   :13.000   Max.   :1.00000   Max.   :19.0   Max.   :1.0000  
##  NA's   :4        NA's   :5         NA's   :9      NA's   :11      
##  Disorderly.Conduct    Suicide           Abuse        Non.subst.Dx   
##  Min.   :0.0000     Min.   :0.0000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000     1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000  
##  Median :1.0000     Median :0.0000   Median :0.000   Median :0.0000  
##  Mean   :0.7256     Mean   :0.3025   Mean   :1.329   Mean   :0.4379  
##  3rd Qu.:1.0000     3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :2.0000  
##  NA's   :11         NA's   :13       NA's   :14      NA's   :22      
##     Subst.Dx      Psych.meds.    
##  Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:0.0000  
##  Median :1.000   Median :1.0000  
##  Mean   :1.138   Mean   :0.9649  
##  3rd Qu.:2.000   3rd Qu.:2.0000  
##  Max.   :3.000   Max.   :2.0000  
##  NA's   :23      NA's   :118

glimpse(genl)

## Rows: 175
## Columns: 15
## $ Age                <int> 24, 48, 51, 43, 34, 39, 41, 48, 44, 27, 44, 56, 53…
## $ Sex                <int> 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 1, 1, 2, 1, 2, 2, 2,…
## $ Race               <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ ADHD.Total         <int> 40, 55, 31, 45, 48, 55, 54, 41, 56, 56, 42, 38, 31…
## $ MD.TOTAL           <int> 15, 14, 5, 13, 7, 14, 9, 7, 12, 11, 16, 0, 11, 10,…
## $ SUB.Total          <dbl> 3, 0, 0, 4, 2, 1, 9, 0, 2, 3, 10, 0, 2, 0, 4, 3, 0…
## $ Court.order        <int> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ Education          <int> 11, 14, 12, 12, 9, 11, 12, 16, 12, 9, 12, 18, 12, …
## $ Hx.of.Violence     <int> 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0,…
## $ Disorderly.Conduct <int> 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0,…
## $ Suicide            <int> 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ Abuse              <int> 0, 4, 6, 7, 0, 2, 4, 0, 0, 2, 4, 2, 0, 0, 0, 5, 1,…
## $ Non.subst.Dx       <int> 2, 1, 2, 2, 2, 0, 1, 2, 1, 1, 1, 2, 0, 1, 0, 2, 2,…
## $ Subst.Dx           <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 2, 2, 0, 0, 0, 1, 1, 0,…
## $ Psych.meds.        <int> 2, 1, 1, 2, 0, 0, 1, 2, 1, 0, 0, 1, 0, 0, 0, 2, 1,…

From the above outputs we can extend that:

the average Age of respondents is ~40, there are slightly more male than female respondents, and the average education level ~12 (finishing high school).
the average ADHD,Total was ~34, the average MD.TOTAL was ~10, and the averge SUB.Total was ~4.
Disorderly.Conduct appears to be more common than Hx.of.Violence which in turn is more common than Court.order.
approximately 1/3 of respondents have attempted Suicide and some form of abuse appears to be common (ie. Physical).
the average Non.subst.Dx was ~0.4 while the average Subst.Dx was ~1.1 giving the impression that we’re dealing with more substance diagnoses that non-substance diagnoses.
Psych.meds is missing the majority of its values and NA values will have to be accounted for (in general).

Based on this information, we drop Psych.meds, drop remaining NA values, and observe the affect on our dataset:

genl <- genl[c(-15)] #drop Psych.meds
genl <- genl %>% tibble() %>% drop_na() #drop NA values
#revisit genl
dim(genl) #175 - 142 = 33 dropped rows

## [1] 142  14

colSums(is.na(genl)) #verify no NAs

##                Age                Sex               Race         ADHD.Total 
##                  0                  0                  0                  0 
##           MD.TOTAL          SUB.Total        Court.order          Education 
##                  0                  0                  0                  0 
##     Hx.of.Violence Disorderly.Conduct            Suicide              Abuse 
##                  0                  0                  0                  0 
##       Non.subst.Dx           Subst.Dx 
##                  0                  0

#save dataset for Q3 (in case of "tampering" during Q1, Q2):
q3_data <- genl

Our dataset has been reduced from 175 observations x 15 variables to 142 observations x 14 variables and from numerous NA values across multiple variables to no NA values.

With NA values dealt with, we ensure each variable’s of the proper type and then visualize our variable distributions based on whether they’re categorical or numeric:

#convert features to factor
genl$Sex <- as.factor(genl$Sex)
genl$Race <- as.factor(genl$Race)
genl$Court.order <- as.factor(genl$Court.order)
genl$Education <- as.factor(genl$Education)
genl$Hx.of.Violence <- as.factor(genl$Hx.of.Violence)
genl$Disorderly.Conduct <- as.factor(genl$Disorderly.Conduct)
genl$Suicide <- as.factor(genl$Suicide)
genl$Abuse <- as.factor(genl$Abuse)
genl$Non.subst.Dx <- as.factor(genl$Non.subst.Dx)
genl$Subst.Dx <- as.factor(genl$Subst.Dx)

#convert features to numeric
genl$Age <- as.numeric(genl$Age)
genl$ADHD.Total <- as.numeric(genl$ADHD.Total)
genl$MD.TOTAL <- as.numeric(genl$MD.TOTAL)
genl$SUB.Total <- as.numeric(genl$SUB.Total)
#head(genl) #verify conversions

#visualize categorical distributions as a table
fig1 <- inspectdf::inspect_imb(genl)
fig1

## # A tibble: 10 x 4
##    col_name           value  pcnt   cnt
##    <chr>              <chr> <dbl> <int>
##  1 Court.order        0      92.3   131
##  2 Hx.of.Violence     0      73.9   105
##  3 Disorderly.Conduct 1      70.4   100
##  4 Suicide            0      68.3    97
##  5 Non.subst.Dx       0      64.8    92
##  6 Abuse              0      63.4    90
##  7 Race               2      55.6    79
##  8 Sex                1      54.2    77
##  9 Education          12     38.7    55
## 10 Subst.Dx           1      38.7    55

From the categorical variable table above we see that ~92% of respondent’s cases were not court ordered, ~74% of respondents do not have a history of violence, ~70% of respondents have a history of disorderly conduct, ~68% of respondents have not attempted suicide, ~65% of respondents do not have a non-substance diagnosis, ~63% of respondents do not have a history of abuse, ~56% of respondents are African American, ~54% of respondents are male, ~39% of respondents have completed at least high school, and ~39% of respondents have one substance related diagnosis. The fact that Education and Subst.Dx have the same pcnt values is interesting and worth noting.

Moving on to our numeric distributions:

#visualize numeric distributions 
fig2 <- inspectdf::inspect_num(genl) %>% 
    show_plot()
fig2

From the histograms above we see that:

ADHD.Total has a left skewed normal distribution with a peak ~45.
Age appears to be multimodal with peaks ~25, ~35 and ~50.
MD.TOTAL has a non-normal distribution with a peak concentrated > 15 (at the far right).
SUB.Total appears to have a multi-modal distribution with its largest peak at ~3.

With insight gained via variable distributions, we move on to observing the pair plot and correlation matrix for numeric variables:

#paired plot of numeric variables
pairs(genl %>% select_if(is.numeric))

#Correlation matrix for numeric variables
genl_corr <- genl %>%
    select_if(is.numeric) %>%
    cor()
corrplot(genl_corr, title="Numeric Variable Correlation",type = "lower", mar=c(0,0,1,0))

The pair plot and correlation matrix highlight the fact that ADHD.Total and MD.TOTAL appear to have strong correlation. We’ll note this fact, while carrying both variables forward.

……………………………………………………………………..

1. Clustering

We elected to utilize hierarchical clustering being that it’s more adaptive than k-means clustering. For k-means clustering, we have to pre-define our k. Whereas for hierarchical clustering both the value of k and the location of our centroids (mid-point between all data points in the group) is determined as a part of the process.

We proceed by computing the distances between all data points (using Euclidean distance), clustering based on distances based on centroid distance and the variance in each cluster, and then plotting the corresponding dendrogram:

set.seed(123)
#compute distances between all data points (using euclidean distance)
distances <- dist(genl, method = 'euclidean')
#cluster based on distances using centroid distance as well as variance in clusters
clusterHealth <- hclust(distances, method = 'ward.D') #ward.D accounts for centroid distance + variance
#plot the dendrogram
plot(clusterHealth)

Based on our dendrogram, it appears that the ideal number of clusters is either 3, 4, or 5. We proceed with 5 clusters being that it’s the higher of the three and we’d like to better represent the variation we observed during our exploratory data analysis. We want to better capture the distinct characteristics of our clustered groups.

We “cut the tree” and demarcate our groups based on the desired number of clusters, calculate each numeric variables’ mean value for each cluster, and then output the resulting table:

clusterGroups <- cutree(clusterHealth, 5)
#clusterGroups #verify the breakdown per observation
#Mean value for numeric variables
#tapply(genl$Age, clusterGroups, mean, simplify=TRUE) #avg age for each cluster
#tapply(genl$ADHD.Total, clusterGroups, mean) #avg ADHD score for each cluster
#tapply(genl$MD.TOTAL, clusterGroups, mean) #avg MD score for each cluster
#tapply(genl$SUB.Total, clusterGroups, mean) #avg SUB score for each cluster
#Extract corresponding values (to place in easy to understand form):
avg_age <- c(27.06, 48.95, 41.42, 47.4, 25.22)
avg_adhd <- c(49.34, 53.52, 34.12, 15.93, 6.56)
avg_md <- c(12.84, 12.71, 10.58, 7.27, 4.11)
avg_sub <- c(4.13, 3.28, 4.1, 4.17, 3.44)
#Build kable table
group_names <- c("Group 1", "Group 2", "Group 3", "Group 4", "Group 5")
df <- data.frame(group_names, avg_age, avg_adhd, avg_md, avg_sub)
df %>% 
    kbl() %>%
    kable_styling(latex_options = "striped")

group_names	avg_age	avg_adhd	avg_md	avg_sub
Group 1	27.06	49.34	12.84	4.13
Group 2	48.95	53.52	12.71	3.28
Group 3	41.42	34.12	10.58	4.10
Group 4	47.40	15.93	7.27	4.17
Group 5	25.22	6.56	4.11	3.44

Based on the output above we’d profile our groups as follows:

Group 1: Young and Distressed. This is the 2nd youngest group (late 20s) with the 2nd highest incidence of ADHD and the highest incidence of mental disorder.
Group 2: Old and Unfocused. This is the oldest group (late 40s) with the highest incidence of ADHD and the 2nd highest incidence of mental disorder.
Group 3: Mid Age and Mid Tiered. This group is “middle of the run”. They’re toward their middle years (early 40s), and have moderate incidence of ADHD and mental disorder.
Group 4: Old and Stable. This is the 2nd most stable group. They’re in their late 40s and have low incidence of ADHD and mental disorder.
Group 5: Young and Driven. To contrast the Young and Distressed group, this group is young and stable. This is the most stable group. They’re in their mid 20s and have the lowest incidence of ADHD and mental disorder.

It was interesting to profile our groups and observe that there appeared to be 2 tiers in the earlier and later years. It was also interesting to see that all groups had a relatively low incidence of substance use / dependency. Due to its apparent low impact as a differentiating factor for our clusters it was not discussed.

Clustering: K-means

In the Appendix, we’ve also provided code (with a more succinct write up) for our application of K-means clustering analysis to the genl (general) dataset.

……………………………………………………………………..

2. Principal Component Analysis

Principal Component Analysis (PCA) is a way of reducing the dimensions of a given dataset by extracting new features from the original features present in the dataset. The “new” variables after PCA are all independent of one another. We elected to perform Principal Component Analysis on the ADHD self-report columns, MD (mood disorder) questionnaire, and entire dataset.

ADHD

We start with the Attention Deficit Hyperactivity Disorder (ADHD) symptom variables. During EDA, we’d subset datasets based on question type and thus we proceed with the adhd dataset.

There are 19 variables in the adhd dataset, 18 pertaining to self-reported ADHD question responses and 1 the total score for that row. All ADHD questions capture categorical responses: never-0, rarely-1, sometimes-2, often-3, and very often-4 whereas the ADHD.Total is a quantitative (sum) variable representative of the individual’s total self-report score for ADHD scale.

We verify that our data does not have missing values:

# Check for NAs
colSums(is.na(adhd))

##    ADHD.Q1    ADHD.Q2    ADHD.Q3    ADHD.Q4    ADHD.Q5    ADHD.Q6    ADHD.Q7 
##          0          0          0          0          0          0          0 
##    ADHD.Q8    ADHD.Q9   ADHD.Q10   ADHD.Q11   ADHD.Q12   ADHD.Q13   ADHD.Q14 
##          0          0          0          0          0          0          0 
##   ADHD.Q15   ADHD.Q16   ADHD.Q17   ADHD.Q18 ADHD.Total 
##          0          0          0          0          0

Missing values would disturb the implementation of PCA (where we’d have to drop / impute corresponding values), and we confirm that our data has 0 missing values across all variables.

We then perform PCA via pcrcomp() function and review corresponding statistics:

# Perform PCA and review summary statistics
pca_adhd <- prcomp(adhd, scale. = TRUE)
summary(pca_adhd)

## Importance of components:
##                           PC1     PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     3.2092 1.16479 1.00701 0.89915 0.85060 0.77076 0.75157
## Proportion of Variance 0.5421 0.07141 0.05337 0.04255 0.03808 0.03127 0.02973
## Cumulative Proportion  0.5421 0.61345 0.66683 0.70938 0.74746 0.77872 0.80845
##                            PC8     PC9    PC10    PC11    PC12    PC13   PC14
## Standard deviation     0.70763 0.66793 0.64292 0.63647 0.60782 0.59497 0.5375
## Proportion of Variance 0.02635 0.02348 0.02175 0.02132 0.01944 0.01863 0.0152
## Cumulative Proportion  0.83481 0.85829 0.88004 0.90136 0.92081 0.93944 0.9546
##                           PC15    PC16   PC17    PC18    PC19
## Standard deviation     0.52570 0.46267 0.4529 0.40571 0.04007
## Proportion of Variance 0.01455 0.01127 0.0108 0.00866 0.00008
## Cumulative Proportion  0.96919 0.98046 0.9912 0.99992 1.00000

From the Importance of components, we can interpret:

Standard deviation as a measure of variability across a single component (ie. deviation of data within PC1)
Proportion of Variance as a measure of the component’s variability / representative capability for the entire dataset (ie. the sum of the Proportion of Variance row = 1.00).
Cumulative Proportion as a measure of the cumulative variability represented by principal components up until the component column we’re on.

From these definitions we can extend that the first 3 components (PC1, PC2, and PC3) represent ~ 66.7% of the data and remaining components add but minor increments to our representative capability (all the way until PC19).

Next, we visit the scree plot:

# scree plot
plot(pca_adhd, type = 'l', col = 'blue')

If we recall from statistics: \(var = sd^2\), where \(var\) represents variance and \(sd\) represents standard deviation.

From the scree plot above, we confirm our earlier finding that PC1, PC2, and PC3 appear to be enough to represent our larger dataset.

Correlation between variables and principal components:

To explore how our variables (original and PCA) relate to one another, we bind PC1, PC2, and PC3 to our original adhd dataset, calculate variable correlation and output the result as a table:

# combine datasets
adhd_df2 <- cbind(adhd, pca_adhd$x)
# %>% column_spec(which(ADHD_df_2[,19] > 0.69) , color = 'white', background = 'maroon')

# cor plot
cor(adhd, adhd_df2[, 20:22]) %>%  kableExtra::kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down")

	PC1	PC2	PC3
ADHD.Q1	0.6980086	-0.2205533	0.3379874
ADHD.Q2	0.7087355	-0.3229617	0.2203357
ADHD.Q3	0.6730084	-0.3613418	0.1294906
ADHD.Q4	0.7363824	-0.3583042	0.1760306
ADHD.Q5	0.7279360	-0.1324912	-0.4384661
ADHD.Q6	0.6490774	0.0323714	-0.3655759
ADHD.Q7	0.7393470	-0.1582908	0.0755233
ADHD.Q8	0.8025831	-0.1275705	0.1186556
ADHD.Q9	0.7958069	-0.0225645	0.0772789
ADHD.Q10	0.7919919	-0.0589730	0.1039090
ADHD.Q11	0.7411649	-0.1609594	-0.1413902
ADHD.Q12	0.6869949	0.2492497	0.1772878
ADHD.Q13	0.7627239	-0.0183180	-0.4475655
ADHD.Q14	0.7185091	0.0224583	-0.3740732
ADHD.Q15	0.6382790	0.3882041	0.0003570
ADHD.Q16	0.6736272	0.5359116	0.1555591
ADHD.Q17	0.6547685	0.4287821	0.1110080
ADHD.Q18	0.7104675	0.4124523	0.1129293
ADHD.Total	0.9988446	0.0127319	-0.0120590

From the correlation table we can extend that:

All variables have a strong positive correlation with PC1 and it appears that PC1 is identical to ADHD.Total based on their correlation of ~1.00.
Q1: Q5, Q7, Q8, and Q11 have a negative correlation, Q9, Q10, and Q13 have a slight negative correlation, Q12, Q15:Q18 have a positive correlation, and Q6, Q14, and ADHD.Total have a slight positive correlation with PC2.
Q1:Q4, Q8, Q10, Q12, Q16:Q18 have a positive correlation, Q7, Q9, and Q15 have a slight positive correlation, Q5, Q6, Q11, Q13 and Q14 have a negative correlation, and ADHD.Total has a slight negative correlation with with PC3.

The correlations above provide indication as to the makeup of our principal components. Those with a stronger positive correlation are more a component while those with a stronger negative correlation are less a component.

From our correlation table, we move on to the corresponding biplot.

Bi-plots

When our first two PCs explain a significant portion of the variance in our data (which is the case for our application), we can make use of a bi-plot.

A bi-plot overlays the scores plot, a projection of observations onto the span of the first two PCs, and a loadings plot, a projection of the variable vectors onto the span of the PCs, in a single graphic:

# Bi-plot of PC1 and PC2
biplot(pca_adhd, choices = c(1,2), scale = 0)

Points are projected observations while vectors are projected variables. In the above plot, the red arrows represent the eigenvectors of each variable helping us to interpret the relationship between principal components while the numbers in black represent our observations.

With regard to reading the plot, the left and bottom axes are used to read PCA scores (of the dots) while the top and right axes are used to read how strongly each vector influences the PC.

From the above biplot, we can extend that: * the observations appear to play a larger role in the makeup of PC1 than that of PC2, * ADHD.Tot has the most weight for PC1 and * Q15:Q18 carry a lot of positive weight while Q2:Q4 carry negative weight for PC2

Next, we move on to the bi-plot for PC2 and PC3:

# Bi-plot of PC2 and PC3
biplot(pca_adhd, choices = c(2,3), scale = 0)

From the PC2-PC3 biplot we can extend that:

the observations follow a relatively consistent spread with some observations playing a larger role in PC2 vs. PC3 (ie. 60) while others play a larger role in PC3 vs. PC2 (ie. 144),
Q1 carries the most positive weight for PC2 and
Q13 carries a lot of negative weight while Q1 carries a lot of positive weight for PC3

Finally, we explore the bi-plot for PC1 and PC3:

# Bi-plot of PC1 and PC3
biplot(pca_adhd, choices = c(1,3), scale = 0)

The PC1-PC3 bi-plot has a similar layout to our first:

the observations appear to play a larger role in the makeup of PC1 than that of PC2,
ADHD.Tot has a lot of weight for PC1 and
Q1 and Q2 carry a lot of positive weight while Q5, Q6, Q13, and Q14 carry a lot of negative weight for PC2.

These plots are useful, but as a possible improvement we’ll explore varimax.

Varimax rotation

Varimax rotation is used to simplify the expression of a particular sub-space. Coordinates remain unchanged while the sum of squared loadings are maximized.

We apply the varimax() function to the rotation variable of our principal components and plot the correlation of corresponding loadings:

# apply varimax 
var_adhd <- varimax(pca_adhd$rotation)
var_adhd$loadings

## 
## Loadings:
##            PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14 PC15
## ADHD.Q1                -1                                                   
## ADHD.Q2                                                        1            
## ADHD.Q3                     1                                               
## ADHD.Q4                                                                 -1  
## ADHD.Q5            -1                                                       
## ADHD.Q6                                                  -1                 
## ADHD.Q7                        -1                                           
## ADHD.Q8                                                                     
## ADHD.Q9                                         1                           
## ADHD.Q10                                   -1                               
## ADHD.Q11                                                           -1       
## ADHD.Q12                                             1                      
## ADHD.Q13                                                                    
## ADHD.Q14                                                                    
## ADHD.Q15                           -1                                       
## ADHD.Q16        1                                                           
## ADHD.Q17                               -1                                   
## ADHD.Q18   -1                                                               
## ADHD.Total                                                                  
##            PC16 PC17 PC18 PC19
## ADHD.Q1                       
## ADHD.Q2                       
## ADHD.Q3                       
## ADHD.Q4                       
## ADHD.Q5                       
## ADHD.Q6                       
## ADHD.Q7                       
## ADHD.Q8          1            
## ADHD.Q9                       
## ADHD.Q10                      
## ADHD.Q11                      
## ADHD.Q12                      
## ADHD.Q13             -1       
## ADHD.Q14   -1                 
## ADHD.Q15                      
## ADHD.Q16                      
## ADHD.Q17                      
## ADHD.Q18                      
## ADHD.Total                -1  
## 
##                  PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8   PC9  PC10
## SS loadings    1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.053 0.053 0.053 0.053 0.053 0.053 0.053 0.053 0.053 0.053
## Cumulative Var 0.053 0.105 0.158 0.211 0.263 0.316 0.368 0.421 0.474 0.526
##                 PC11  PC12  PC13  PC14  PC15  PC16  PC17  PC18  PC19
## SS loadings    1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.053 0.053 0.053 0.053 0.053 0.053 0.053 0.053 0.053
## Cumulative Var 0.579 0.632 0.684 0.737 0.789 0.842 0.895 0.947 1.000

#var_adhd$loadings #corrplot is more succinct, easier to interpret

corrplot(var_adhd$loadings[,1:3], is.corr = FALSE)

We elect to interpret the correlation plot (over the loadings output) due to its clarity and succinctness. From the above output, we can see that:

PC1 is strongly (negatively) correlated with ADHD.Q18,
PC2 is strongly (positively) correlated with ADHD.Q16, and
PC3 is strongly (negatively) correlated with ADHD.Q5

Although the results conflict with our earlier plot interpretations, one cannot argue with the simplicity of concise code. Varimax provided in 2 lines of code what it took multiple plots, iterations, and interpretations to come to when referring to biplots: the strongest correlate for each principal component.

PCA: MD and general

In the Appendix, we’ve also provided code (with a more succinct write up) for our application of principal component analysis to the md (mood disorder) and genl (general) datasets.

……………………………………………………………………..

3. Support Vector Machine

A Support Vector Machine (SVM) is a supervised machine learning algorithm that creates a hyperplane / decision boundary based on labeled data. Support vectors are the data points closest to the hyperplane, that influence the hyperplane’s location and orientation. Using support vectors, we maximize the margin, the space between prospective classes, by ensuring that our classifier is equidistant from the closest observation in each class.

EDA for Suicide

First we need to identify NAs using vis_miss from visdat package. This graph shows the missing values of corresponding columns in percentage. In bottom section shows total percentage of missing value present in the data.

visdat::vis_miss(data)

We can see maximum NAs availabe in Psych.meds column. We replace NAs with Mode. Created function and pass the dataset into it.

mode <- function(df, ...) {
  tb <- table(df)
  which(tb == max(tb)) %>% sort %>% `[`(., 1) %>% names
}

impute <- function(df, fun) {
  fval <- fun(df, na.rm = TRUE)
  y <- ifelse(is.na(df), fval, df)
  return(y)
}

data_all <- data %>% 
  select(-Initial) %>% 
  mutate(
    Alcohol = impute(Alcohol, mode),
    THC     = impute(THC, mode),
    Cocaine = impute(Cocaine, mode),
    Stimulants = impute(Stimulants, mode),
    Sedative.hypnotics = impute(Sedative.hypnotics, mean),
    Opioids  = impute(Opioids, mean),
    Court.order = impute(Court.order, mean),
    Education = impute(Education, mode),
    Hx.of.Violence = impute(Hx.of.Violence, mode),
    Disorderly.Conduct = impute(Disorderly.Conduct, mode),
    Suicide = impute(Suicide, mode),
    Abuse = impute(Abuse, mode),
    Non.subst.Dx = impute(Non.subst.Dx, mode),
    Subst.Dx = impute(Subst.Dx, mode),
    Psych.meds. = impute(Psych.meds., mode)
    ) %>% 
  mutate_if(is.character, as.numeric)

visdat::vis_miss(data_all)

No missing value availab in th dataset. We can perform EDA on suicide analysis.

Below graphs shows the Suicide pecentage over different category.

Plot, suicide % over other category such as Age, Sex, Education, Cocaine, Hx.of.Violence, Disorderly.Conduct, Abuse, Non.subst.Dx, Subst.Dx, and Psych.meds.

p <- data_all %>% select(c('Age','Suicide')) 

setDT(p)[Age <10, agegroup := "0-10"]
p[Age >=10 & Age <20, agegroup := "10-20"]
p[Age >=20 & Age <30, agegroup := "20-30"]
p[Age >=30 & Age <40, agegroup := "30-40"]
p[Age >=40 & Age <50, agegroup := "40-50"]
p[Age >=50 & Age <60, agegroup := "50-60"]
p[Age >=60 & Age <70, agegroup := "60-70"]

PlotXTabs2(p, agegroup, Suicide, results.subtitle = FALSE, title ="Suicide commit Pecentage by Age Group")

Agegroup (30-40) committed more suicide and agegroup (40-50) is the second highest.

Age group by suicide commit:

10 - 20 - No suicide
20 - 30 - 29%
30 - 40 - 38%
40 - 50 - 30%
50 - 60 - 19%
60 - 70 - No suicide

p <- data_all %>% select(c('Sex','Suicide')) 
PlotXTabs2(p, Sex, Suicide, results.subtitle = FALSE, title ="Suicide commit Pecentage by Sex")

females attempted suicide more than males.

Female - 37%
Male - 21%

p <- data_all %>% select(c('Race','Suicide')) 
PlotXTabs2(p, Race, Suicide, results.subtitle = FALSE, title ="Suicide commit Pecentage by Race")

Suicide committed by Race:

White - 39%
African American - 20%

Interesting point here, Hispanic no suicide % because there is only 1 observation for this category. Similarly for Other or missing data category having 2 observations are present, due to lack of data others showing 50% suicide.

p <- data_all %>% select(c('Education','Suicide')) 
PlotXTabs2(p, Education, Suicide, results.subtitle = FALSE, title ="Suicide commit Pecentage by level of Education")

Higher educated less likely committed suicide than others.

p <- data_all %>% select(c('Hx.of.Violence','Suicide')) 
PlotXTabs2(p, Hx.of.Violence, Suicide, results.subtitle = FALSE, title ="Suicide commit Pecentage by Violence")

Violant people commited more suicide than non-violant.

p <- data_all %>% select(c('Disorderly.Conduct','Suicide')) 
PlotXTabs2(p, Disorderly.Conduct, Suicide, results.subtitle = FALSE, title ="Suicide commit Pecentage by Disorderly conduct")

This is little surprising, non Disorderly conduct committed more suicide than Disorderly Conduct.

p <- data_all %>% select(c('Abuse','Suicide')) 
PlotXTabs2(p, Abuse, Suicide, results.subtitle = FALSE, title ="Suicide commit Pecentage by Abuse")

Suicide commit by Abuse category:

Emotional and P&S&E - 75%
Sexual - 60%
S&E - 50%
P&S - 33%

Other abuse category are less.

p <- data_all %>% select(c('Non.subst.Dx','Suicide')) 
PlotXTabs2(p, Non.subst.Dx, Suicide, results.subtitle = FALSE, title ="Suicide commit Pecentage by Non subst Dx")

Non-substance related Dx : Suicide commit Non-subset Dx one is highest and Non-substance related Dx none is lowest.

p <- data_all %>% select(c('Subst.Dx','Suicide')) 
PlotXTabs2(p, Subst.Dx, Suicide, results.subtitle = FALSE, title ="Suicide commit Pecentage by Subst Dx")

Substance related Dx : Suicide commit substance related Dx two and three are highest and substance related Dx one is lowest.

p <- data_all %>% select(c('Psych.meds.','Suicide')) 
PlotXTabs2(p, Psych.meds., Suicide, results.subtitle = FALSE, title ="Suicide commit Pecentage by Psych.meds.")

Suicide commit by Psychiatric Meds:

more than one psychotropic med - 41%
one psychotropic med - 25%
None - 37%

Corelation Matrix and Significant variable selection with genl data

By use of support vectors our SVM is capable of finding the optimal decision boundary to distinctly classify our data. In our case, we’re going to model suicide attempts from the genl dataset with a radial and a linear decision boundary. We’ll then interpret the performance of each model side-by-side.

To start, we re-familiarize ourselves with the data at hand and our dependent variable (Suicide):

#Ensure it's the data of interest:
genl <- q3_data

#Refamiliarize ourselves with the genl dataset:
dim(genl) #175 - 142 = 33 dropped rows

## [1] 142  14

head(genl)

## # A tibble: 6 x 14
##     Age   Sex  Race ADHD.Total MD.TOTAL SUB.Total Court.order Education
##   <int> <int> <int>      <int>    <int>     <dbl>       <int>     <int>
## 1    24     1     1         40       15         3           1        11
## 2    48     2     1         55       14         0           0        14
## 3    51     2     1         31        5         0           0        12
## 4    43     1     1         45       13         4           0        12
## 5    34     1     1         48        7         2           1         9
## 6    39     2     1         55       14         1           0        11
## # … with 6 more variables: Hx.of.Violence <int>, Disorderly.Conduct <int>,
## #   Suicide <int>, Abuse <int>, Non.subst.Dx <int>, Subst.Dx <int>

#Visualize suicide attempt counts:
as.data.frame(table(genl$Suicide))

##   Var1 Freq
## 1    0   97
## 2    1   45

We’re dealing with 142 observations x 14 variables - 10 of type factors, 4 of type double, and a dependent variable Suicide where 97/142 (68.3%) of patients have not attempted suicide and 45/142 (31.7%) of patients have.

Feature Exclusion

We’ll narrow our dataset further by exploring our independent variable’s correlation with Suicide, with one another, and then verify our assumptions with the Boruta function for feature ranking and selection.

We start by measuring the strength of correlation to Suicide for independent variables:

#Compute proportion of missing data per variable
v <- colnames(genl)
incomplete <- function(x) sum(!complete.cases(x)) / 142
Missing_Data <- sapply(genl[v], incomplete) 
#head(missing) #verify

#Compute correlation between each variable and Suicide
target_corr <- function(x, y) cor(y, x, use = "na.or.complete")
Suicide_Correlation <- sapply(genl[v], target_corr, y=genl$Suicide) 
#head(c_target) #verify

#Bind and output Missing Data and Correlation with Target
MDCS <- data.frame(cbind(Missing_Data, Suicide_Correlation))
MDCS %>%
  kbl(caption = "Proportion of Missing Data vs. Correlation with Suicide") %>%
  kable_minimal()

Proportion of Missing Data vs. Correlation with Suicide
	Missing_Data	Suicide_Correlation
Age	0	-0.0815414
Sex	0	0.1640972
Race	0	-0.1025970
ADHD.Total	0	0.1337074
MD.TOTAL	0	0.2859463
SUB.Total	0	0.1543758
Court.order	0	0.1423456
Education	0	-0.1011286
Hx.of.Violence	0	0.1129186
Disorderly.Conduct	0	-0.0228881
Suicide	0	1.0000000
Abuse	0	0.3107638
Non.subst.Dx	0	0.0238396
Subst.Dx	0	0.1698773

From the above table we can extend that:

none of the variables in our set have missing values,
Age, Disorderly.Conduct, and Non.subst.Dx have a weak correlation with Suicide
Abuse, MD.Total, and Subst.Dx have the strongest correlation with Suicide,
remaining variables have a low-to-moderate correlation with Suicide.

From this, we can see the 3 variables we may keep as well as those we may drop. For the remaining variables, we’ll explore further to guide our decision.

We measure the correlation between variables to observe multicollinearity:

#Utilize custom-built correlation matrix generation function
plot_corr_matrix(genl, 0.3)

Multicollinearity is a concern and that we may eliminate a larger portion of variables as a result. We’ll proceed with only variables with strong predictive promise and / or unique value added.

Based on these guidelines, it appears that we should proceed with at least Abuse, MD.Total, Subst.Dx, whereas variables Sex, Court.order, and Hx.of.Violence we’ll decide after consulting the Boruta output.

To address weak target correlation and multicollinearity we consider the exclusion of 7-10 variables : Age, Race, Education, Disorderly.Conduct, and Non.subst.Dx due to weak correlation with the target variable, ADHD.Total and SUB.Total due to multicollinearity, and Sex, Court.order, and Hx.of.Violence we’ll decide after consulting our Boruta output.

Before doing so, we utilize the Boruta function for feature ranking and selection:

# Perform Boruta search
boruta_output <- Boruta(Suicide ~ ., data=genl, doTrace=0, maxRuns = 1000)

#Get significant variables including tentatives
boruta_signif <- getSelectedAttributes(boruta_output, withTentative = TRUE)
#print(boruta_signif)

# Plot variable importance
plot(boruta_output, cex.axis=.7, las=2, xlab="", main="Variable Importance")

From the above plot, we see that only two variables (Abuse and MD.TOTAL) carry much importance when we consider Suicide but in order to offer a more comprehensive representation, we’ll also include the next 3 variables (Race, Subst.Dx, and Court.order). It’s worth noting that all variables score relatively low in feature importance which may be an early indication of difficulty in predicting Suicide.

We select the 5 features noted above, convert categorical features to type factor and then verify the dataset that we’re proceeding to SVM model-building with:

## # A tibble: 6 x 6
##   Race  MD.TOTAL Court.order Suicide Abuse Subst.Dx
##   <fct>    <int> <fct>       <fct>   <fct> <fct>   
## 1 1           15 1           1       0     0       
## 2 1           14 0           1       4     0       
## 3 1            5 0           0       6     0       
## 4 1           13 0           1       7     0       
## 5 1            7 1           1       0     0       
## 6 1           14 0           1       2     0

We observe 5 independent variables (1 numeric, 4 factor) and find that our data is of the proper form aside from the fact that we have numerous independent variables of type factor that we may have to convert to “one hot” dummie variables (ie. unique variables for each level of the factor variable where a 1 is present).

Model Building & Selection

We proceed to train-test split our data (for later assessment), convert categorical variables into “one hot” dummie variables, train our model, and consult the summary statistics:

To improve model can tune the SVM by changing the parameters Cost, gamma and the kernel function. Use svm.tune from package e1071. This function can accepts many parameters, here we use 3. They are kernel, cost, and gamma.

kernel: Different kernels for SVM can be used such as linear, polynomial, radial basis and sigmoid. All the kernels except the linear one require the gamma parameter.
Cost: The cost of constraints violation
gamma: has to be tuned to better fit the hyperplane to the data.

Here we create 3 models using radial, linear and polynomial.

#Set set (replicability)
set.seed(123)

#Perform train-test Split
train_split <- initial_split(select_df, prop = 0.75)
training <- training(train_split)
test <- testing(train_split)
results <- test$Suicide #hold actual test values for later comparison
test$Suicide <- NA #set initial test values to NA, to then be predicted

#Convert categorical variables into dummie variables (one hot or dummyvars)
dummies <- dummyVars(~ ., data=training[,-4]) #Omit col 4 (Suicide) from dummy var generation
d_train <- as.data.frame(predict(dummies, training[,-4]))
d_train$Suicide <- training$Suicide

dummies <- dummyVars(~ ., data=test[,-4])
d_test <- as.data.frame(predict(dummies, test[,-4]))
d_test$Suicide <- test$Suicide

#head(d_train) #verify dummie variables


#Set gamma list and use variable cost to determine optimal model parameters (for SVM modeling)
gammalist <- c(0.005,0.01,0.015,0.02,0.025,0.03,0.035,0.04,0.045,0.05)

#1st SVM application: radial kernel
svm_radial <- tune.svm(Suicide ~., data=d_train, kernel='radial', cost=2^(-1:5), gamma = gammalist)
# summary(svm_radial)
# summary(svm_radial$best.model)
svm1 <- predict(svm_radial$best.model, d_test[,-20])
#confusionMatrix(svm1, results)

#2nd SVM application: linear kernel
svm_linear <- tune.svm(Suicide ~., data=d_train, kernel='linear')
# summary(svm_linear)
# summary(svm_linear$best.model)
svm2 <- predict(svm_linear$best.model, d_test[,-20])
#confusionMatrix(svm2, results)


#3rd SVM application: polynomial kernel
svm_polynomial <- tune.svm(Suicide ~., data=d_train, kernel='polynomial', cost=2^(-1:5), gamma = gammalist)
svm3 <- predict(svm_polynomial$best.model, d_test[,-20])


#Convert performance statistics to tabular form (for output / interpretation):
AccuracySVM1 <- confusionMatrix(svm1, results)$overall['Accuracy']
AccuracySVM2 <- confusionMatrix(svm2, results)$overall['Accuracy']
AccuracySVM3 <- confusionMatrix(svm3, results)$overall['Accuracy']

SVM_Model_1 <- confusionMatrix(svm1, results)$byClass
SVM_Model_1 <- data.frame(SVM_Model_1)
SVM_Model_1 <- rbind("Accuracy" = AccuracySVM1, SVM_Model_1)

SVM_Model_2 <- confusionMatrix(svm2, results)$byClass
SVM_Model_2 <- data.frame(SVM_Model_2)
SVM_Model_2 <- rbind("Accuracy" = AccuracySVM2, SVM_Model_2)

SVM_Model_3 <- confusionMatrix(svm3, results)$byClass
SVM_Model_3 <- data.frame(SVM_Model_3)
SVM_Model_3 <- rbind("Accuracy" = AccuracySVM3, SVM_Model_3)

tabularview <- data.frame(SVM_Model_1, SVM_Model_2, SVM_Model_3)
tabularview %>%  kableExtra::kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down")

	SVM_Model_1	SVM_Model_2	SVM_Model_3
Accuracy	0.6285714	0.6285714	0.7428571
Sensitivity	0.6428571	0.6785714	0.8214286
Specificity	0.5714286	0.4285714	0.4285714
Pos Pred Value	0.8571429	0.8260870	0.8518519
Neg Pred Value	0.2857143	0.2500000	0.3750000
Precision	0.8571429	0.8260870	0.8518519
Recall	0.6428571	0.6785714	0.8214286
F1	0.7346939	0.7450980	0.8363636
Prevalence	0.8000000	0.8000000	0.8000000
Detection Rate	0.5142857	0.5428571	0.6571429
Detection Prevalence	0.6000000	0.6571429	0.7714286
Balanced Accuracy	0.6071429	0.5535714	0.6250000

The models share the same Accuracy and Prevalence, whereas SVM_Model_1 (radial) outperformed SVM_Model_2 (linear) in Specificity, Pos Pred Value, Neg Pred Value, Precision, and Balanced Accuracy while SVM_Model_2 (linear) outperformed SVM_Model_1 (radial) for Sensitivity, Recall, F1, Detection Rate, and Detection Prevalence. SVM_Model_3 outperformed than other 2 models. This model have high accuracy, specificity, recall and f1 than other 2 models.

The performance was rather close but being that we’re dealing with an imbalanced set (97/142 (68.3%) of patients have not attempted suicide), we use the Balanced Accuracy as our deciding factor and elect SVM_Model_1 (radial).

Confusion matrix for `SVM_Model_1` and `SVM_Model_2`

confusionMatrix(svm1, results)$table

##           Reference
## Prediction  0  1
##          0 18  3
##          1 10  4

confusionMatrix(svm2, results)$table

##           Reference
## Prediction  0  1
##          0 19  4
##          1  9  3

Both the models SVM_model_1 and SVM_model_2 have higher class error in predicting no suicide and less error in predicting suicide.

Confusion matrix for `SVM_Model_3`

confusionMatrix(svm3, results)$table

##           Reference
## Prediction  0  1
##          0 23  4
##          1  5  3

The confusion matrix shows SVM_model_3 is a good model for suicide and non-suicide with accuracy of 74%. The miscallsification happens for suicide is 4 and no suicide 5. There is definately scope to improve the performance of the model. This dataset we did the aggregate of the columns and did not impute NAs, this might be the reason of low accuracy. In the appendix section we perform 6 SVM models (including 3 kernels), one using entire dataset and another using PCA output.

SVM: Entire dataset and general dataset using PCA output

In the Appendix, we’ve also provided code (with a more succinct write up) for our application of support vector machines to the entire dataset (~86% Accuracy) and our principal components ~74% Accuracy).

Analysis

It was nice to learn via application the effect of kernel selection. For a simpler model and one we believe is likely to be linearly separable, it may be best to start with a linear kernel, whereas when the data appears to be non-linear (which we believed to be the case here), we can start with a non-linear application and then compare back to a linear model to test our assumptions. In this case the radial kernel model did not perform leaps-and-bounds better than the linear kernel model but it did outperform it when we considered the fact that we were dealing with unbalanced data.

It appears that both models were able to relatively accurately predict suicide for those who had attempted suicide in the past but each had a difficult time predicting non-suicidal patients. If the model were advanced further, this might be a case where accurately predicting positive (suicide) and remediating the situation (ie. providing help to the patient) could supercede the model’s inability to predict non-suicide but this is a very sensitive topic and the model would absolutely have to best tested to a much deeper degree and brought to a much higher level of accuracy to even consider such an idea.

In general though, we were surprised by the low accuracy of our SVM models provided the “clout” surrounding Support Vector Machines and their ability to accurately predict. The reason for a lower performance may have been because the null values we’d dropped (maybe there was valuable data therein), the variables we elected to focus upon, the way in which we applied the models, the nature of the data itself, or the simple fact that it’s difficult to predict something like suicide because humans are creatures of nuance and with that comes a great level of unpredictability.

We later confirmed, with code included in the Appendix, that Accuracy could be improved via inclusion of stronger (ie. principal components) or more features (ie. the entire dataset).

Future Work

With entire dataset we get better accuracy, this can also be improved with differnt machine learning algo. In future we may include QDA and LDA to see how the model works.

……………………………………………………………………..

References

In completing this assignment, reference was made to the following:

MIT OpenCourseWare. (2018). An Introduction to Clustering - Video 7: Hierarchical Clustering in R [video]. Retrieved from: https://www.youtube.com/watch?v=GPOUGpF-Sno [hierarchical clustering]
Bio Turing. (2018). How to read PCA biplots and scree plots [article]. Retrieved from: https://blog.bioturing.com/2018/06/18/how-to-read-pca-biplots-and-scree-plots/
Xi Cheng. (2020). Preprocessing of categorical predictors in SVM, KNN, and KDC [article]. Retrieved from: https://stats.libretexts.org/Bookshelves/Computing_and_Modeling/RTG%3A_Classification_Methods/4%3A_Numerical_Experiments_and_Real_Data_Analysis/Preprocessing_of_categorical_predictors_in_SVM%2C_KNN_and_KDC_(contributed_by_Xi_Cheng)
Optimal Number of Clusters [article]. Retrived from: https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods/

……………………………………………………………………..

Appendix

Mood Disorder PCA

There are columns 15 columns for MD. we will follow the same process that we used for ADHD.

MD_df <- data %>% select(24:38)
head(MD_df)

##   MD.Q1a MD.Q1b MD.Q1c MD.Q1d MD.Q1e MD.Q1f MD.Q1g MD.Q1h MD.Q1i MD.Q1j MD.Q1k
## 1      1      1      1      1      0      1      1      1      1      1      1
## 2      1      1      1      1      1      1      1      1      1      0      0
## 3      0      0      0      0      1      1      1      0      0      0      0
## 4      1      1      0      0      1      1      1      1      1      0      0
## 5      0      1      0      1      0      1      1      0      0      0      0
## 6      0      1      0      1      1      1      1      1      1      1      1
##   MD.Q1L MD.Q1m MD.Q2 MD.Q3
## 1      0      1     1     3
## 2      1      0     1     3
## 3      0      0     0     2
## 4      1      1     1     3
## 5      0      0     1     2
## 6      1      0     1     3

# check for NAs
colSums(is.na(MD_df))

## MD.Q1a MD.Q1b MD.Q1c MD.Q1d MD.Q1e MD.Q1f MD.Q1g MD.Q1h MD.Q1i MD.Q1j MD.Q1k 
##      0      0      0      0      0      0      0      0      0      0      0 
## MD.Q1L MD.Q1m  MD.Q2  MD.Q3 
##      0      0      0      0

Above summary we see no NAs available in the MD_df dataset.

In MD.Q3 column scale is different than other columns, so to avoid scale factor set scale = True.

# Compute principal component analysis.
pca_md <- prcomp(MD_df, scale = TRUE)
summary(pca_md)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.3857 1.3474 0.9721 0.94891 0.91563 0.82515 0.79246
## Proportion of Variance 0.3794 0.1210 0.0630 0.06003 0.05589 0.04539 0.04187
## Cumulative Proportion  0.3794 0.5004 0.5635 0.62348 0.67937 0.72476 0.76663
##                            PC8     PC9    PC10    PC11    PC12   PC13    PC14
## Standard deviation     0.77044 0.74286 0.72814 0.70302 0.65162 0.5835 0.55685
## Proportion of Variance 0.03957 0.03679 0.03535 0.03295 0.02831 0.0227 0.02067
## Cumulative Proportion  0.80620 0.84299 0.87834 0.91128 0.93959 0.9623 0.98296
##                           PC15
## Standard deviation     0.50553
## Proportion of Variance 0.01704
## Cumulative Proportion  1.00000

From pca summary, PC1 shows 37% variance in the data, PC2 shows 50% variance in the data, PC3 shows 56% variance in the data and so on.

Plot the scree plot to disply the same, scree plot y-axis is variance nothing but the square of the SD.

# scree plot
plot(pca_md, type = 'l', col = 'blue')

From scree plot PC1, PC2, PC3, and PC4 are describes 62% variance in the data.

From ‘https://www.originlab.com/doc/Origin-Help/PrincipleComp-Analysis’, found if the relationship is weak between variables, PCA does not work well to reduce data. For MD data more than 3 PC components describe the data, not sure how well PCA will work.

Correlation between variables and principal componenets:

Create another dataset use cbind with original MD data along with PC1, PC2, PC3, and PC4.

# combine dataset
md_df_2 <- cbind(MD_df, pca_md$x)

# cor plot
cor(MD_df, md_df_2[, 16:19]) %>%  kableExtra::kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down")

	PC1	PC2	PC3	PC4
MD.Q1a	-0.6888686	-0.1322144	0.0229725	-0.3776545
MD.Q1b	-0.6421749	-0.3857186	0.1381774	0.1459230
MD.Q1c	-0.4439895	0.4544153	-0.4763749	-0.0977568
MD.Q1d	-0.5655577	0.0049581	0.0644164	-0.0442116
MD.Q1e	-0.6391312	0.2391336	-0.1887304	0.1745264
MD.Q1f	-0.6668512	-0.2825469	0.1009022	0.2032375
MD.Q1g	-0.6818994	-0.3687898	0.0536512	0.1911240
MD.Q1h	-0.6193596	0.4973240	0.0584859	0.2126773
MD.Q1i	-0.5873223	0.4451927	-0.1405849	0.2802170
MD.Q1j	-0.5954591	0.4356557	0.1328873	0.0517699
MD.Q1k	-0.5118310	0.4082574	0.5106816	-0.2784104
MD.Q1L	-0.7087339	-0.1724858	0.1944088	-0.4085494
MD.Q1m	-0.5682204	-0.0522540	-0.4267685	-0.4284171
MD.Q2	-0.7328573	-0.2802758	0.0969890	0.2440009
MD.Q3	-0.5118573	-0.5064956	-0.3385010	0.0527350

Above cor table we can see

MD.Q1a, MD.Q1g, MD.Q1L, and MD.Q2 have high negative correlation with PC1.
MD.Q1g, and MD.Q1c have positive correlation with PC2 and negative correlation with MD.Q3.
MD.Q1k has positive correlation with PC3 and MD.Q1c has negative correlation with PC3.
MD.Q1m and MD.Q1L have negative correlation with PC4.

Bi-plot and principal components

In Bi-plot red arrows represent the eigenvectors of each variable. This plot helps us to interpret the relationship between principal components.

# Bi-plot of PC1 and PC2
biplot(pca_md, choices = c(1,2), scale = 0)

From (PC1 and PC2 biplot) observations have low MD.Q3, MD.Q1h

# Bi-plot of PC2 and PC3
biplot(pca_md, choices = c(2,3), scale = 0)

(PC2 and PC3 biplot)

MD.Q1m negatively correlate with MD.Q1d.
MD.Q3 negatively correlate with MD.Q1k.
MD.Q1c negative correlate with MD.Q1L.

# Bi-plot of PC1 and PC3
biplot(pca_md, choices = c(3,4), scale = 0)

From (PC1 and PC3 biplot)

MD.Q1L negative correlate with MD.Q1i.
MD.Q1b negative correlate with MD.Q1m.

We need better approach to examine the relationship.

Using VARIMAX

# apply varimax 
var_md <- varimax(pca_md$rotation)
var_md$loadings

## 
## Loadings:
##        PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14 PC15
## MD.Q1a                                  1                               
## MD.Q1b                      1                                           
## MD.Q1c                              1                                   
## MD.Q1d                  1                                               
## MD.Q1e                         -1                                       
## MD.Q1f                                                1                 
## MD.Q1g -1                                                               
## MD.Q1h                                                              -1  
## MD.Q1i      1                                                           
## MD.Q1j                                     -1                           
## MD.Q1k          1                                                       
## MD.Q1L                                                         -1       
## MD.Q1m             -1                                                   
## MD.Q2                                                     -1            
## MD.Q3                                            1                      
## 
##                  PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8   PC9  PC10
## SS loadings    1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.067 0.067 0.067 0.067 0.067 0.067 0.067 0.067 0.067 0.067
## Cumulative Var 0.067 0.133 0.200 0.267 0.333 0.400 0.467 0.533 0.600 0.667
##                 PC11  PC12  PC13  PC14  PC15
## SS loadings    1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.067 0.067 0.067 0.067 0.067
## Cumulative Var 0.733 0.800 0.867 0.933 1.000

corrplot(var_md$loadings[,1:4], is.corr = FALSE)

Varimax loading interprets :

PC1 negatively co-relate with MD.Q1g
PC2 positively co-relate with MD.Q1i
PC3 positively co-relate with MD.Q1k
PC4 negatively co-relate with MD.Q1m.

PCA of MD analysis

The PCA of MD shows first principle component have 38% of the variance in the data. To get 90% of the variance, we need to include 11 of the principal components. This reduction reduce features 15 to 11.

General PCA

PCA performs using entire dataset, exclude indivislau ADHD and MD column and take Total.ADHD and MD in to consideration.

Here we are using built-in R function prccomp() to perform the Principal Component Analysis.

PCA can be applied only on numerical data. We have few Categorical variables in this dataset that need to be converted to numberical category.

# Check the data class
sapply(genl, class)

##                Age                Sex               Race         ADHD.Total 
##          "integer"          "integer"          "integer"          "integer" 
##           MD.TOTAL          SUB.Total        Court.order          Education 
##          "integer"          "numeric"          "integer"          "integer" 
##     Hx.of.Violence Disorderly.Conduct            Suicide              Abuse 
##          "integer"          "integer"          "integer"          "integer" 
##       Non.subst.Dx           Subst.Dx 
##          "integer"          "integer"

As we can see there are only 4 numeric data types and rest are factors. We need to convert all the factors into numeric.

cols.num <- c("Sex","Race","Court.order","Education","Hx.of.Violence", "Disorderly.Conduct", "Suicide","Abuse","Non.subst.Dx","Subst.Dx")
genl[cols.num] <- sapply(genl[cols.num],as.numeric)

# Check to see if all our variables in numeric.
sapply(genl, class)

##                Age                Sex               Race         ADHD.Total 
##          "integer"          "numeric"          "numeric"          "integer" 
##           MD.TOTAL          SUB.Total        Court.order          Education 
##          "integer"          "numeric"          "numeric"          "numeric" 
##     Hx.of.Violence Disorderly.Conduct            Suicide              Abuse 
##          "numeric"          "numeric"          "numeric"          "numeric" 
##       Non.subst.Dx           Subst.Dx 
##          "numeric"          "numeric"

# Compute principal component analysis.
result.pca <- prcomp(genl, scale = TRUE)

Show the percentage of variances explained by each principal component.

# Visualize eigenvalues (scree plot). 
fviz_eig(result.pca)

Individuals with a similar profile are grouped together.

# Graph of individuals
fviz_pca_ind(result.pca,
             col.ind = "cos2", # Color by the quality of representation
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE     # Avoid text overlapping
             )

Positive correlated variables point to the same side of the plot. Negative correlated variables point to opposite sides of the graph.

# Graph of variables
fviz_pca_var(result.pca,
             col.var = "contrib", # Color by contributions to the PC
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE     # Avoid text overlapping
             )

# Biplot of individuals and variables
fviz_pca_biplot(result.pca, repel = TRUE,
                col.var = "#2E9FDF", # Variables color
                col.ind = "#696969"  # Individuals color
                )

VARIMAX

summary(result.pca)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.6316 1.4760 1.1923 1.11534 1.06328 1.00875 0.94316
## Proportion of Variance 0.1902 0.1556 0.1015 0.08886 0.08075 0.07268 0.06354
## Cumulative Proportion  0.1902 0.3458 0.4473 0.53617 0.61693 0.68961 0.75315
##                            PC8     PC9   PC10    PC11   PC12   PC13    PC14
## Standard deviation     0.85085 0.83870 0.7812 0.69752 0.6372 0.5916 0.41911
## Proportion of Variance 0.05171 0.05024 0.0436 0.03475 0.0290 0.0250 0.01255
## Cumulative Proportion  0.80486 0.85511 0.8987 0.93345 0.9625 0.9875 1.00000

var_genl <- varimax(result.pca$rotation)
corrplot(var_genl$loadings[,1:11], is.corr = FALSE)

PCA of General dataset

The PCA of genl shows first principle component have 19% of the variance in the data. To get 93% of the variance, we need to include 11 of the principal components. This reduction reduce features from 14 to 11.

……………………………………………………………………..

K-means clustering

In k-means clustering, we first create unnamed/new observations that serve as “centroids”, which are then used to compute the Euclidean distance between the centroid and the surrounding observations, which are then used for category assignment. The average of all of the points in each category are then recomputed for the centroid based on this average. Finally, the algorithm is re-run using these new centroids with the process ending when there is no change in centroid position between one run of the clustering algorithm and another.

To perform cluster analysis, first need to select the optimal clusters to do that there are some methods.

Elbow method
Average silhouette method
Gap statistics

We’ll use the following R packages:

factoextra to determine the optimal number clusters for a given clustering methods and for data visualization. NbClust for computing about 30 methods at once, in order to find the optimal number of clusters.

fviz_nbclust below are the parameter description:

df: numeric matrix or data frame
FUNcluster: a partitioning function. Allowed values include kmeans, pam, clara and hcut (for hierarchical clustering).
method: the method to be used for determining the optimal number of clusters.

Standadize the data, set mean and standard deviation.

# standadize the data 
means <- apply(genl,2,mean)
sds <- apply(genl,2,sd)
nor = scale(genl,center=means,scale=sds)

# Elbow method
fviz_nbclust(nor, kmeans, method = "wss") +
    geom_vline(xintercept = 6, linetype = 2)+
  labs(subtitle = "Elbow method")

# Silhouette method
fviz_nbclust(nor, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette method")

# Gap statistic
# nboot = 50 to keep the function speedy. 
# recommended value: nboot= 500 for your analysis.
# Use verbose = FALSE to hide computing progression.
set.seed(140)
fviz_nbclust(nor, kmeans, nstart = 25,  method = "gap_stat", nboot = 50)+
  labs(subtitle = "Gap statistic method")

Elbow method: 6 clusters solution suggested
Silhouette method: 2 clusters solution suggested
Gap statistic method: 5 clusters solution suggested

The disadvantage of elbow and average silhouette methods is that, they measure a global clustering characteristic only. A more sophisticated method is to use the gap statistic which provides a statistical procedure to formalize the elbow/silhouette heuristic in order to estimate the optimal number of clusters.

According to these observations, it’s possible to define k = 5 as the optimal number of clusters in the data.

set.seed(500)
kmeans_res <- kmeans(nor, 5, nstart = 25)
kmeans_res

## K-means clustering with 5 clusters of sizes 11, 47, 30, 24, 30
## 
## Cluster means:
##             Age        Sex       Race  ADHD.Total   MD.TOTAL   SUB.Total
## 1 -0.6569935740  0.1754249 -0.3842857  0.02634379  0.1301821  0.54162019
## 2 -0.0002945166 -0.4474302  0.2545898 -0.48604331 -0.2163302  0.05965279
## 3  0.3244546883  0.2845213  0.1498511  0.42753297 -0.3387180 -1.08642639
## 4 -0.1334416906  1.0012237 -0.3662406  0.53686606  0.5563087  0.70092025
## 5  0.0236577174 -0.4488487 -0.1148113 -0.10521735  0.1848549  0.23364008
##   Court.order    Education Hx.of.Violence Disorderly.Conduct     Suicide
## 1   3.4387824 -0.350659310      0.4403721         0.64578809  0.48949560
## 2  -0.2887527 -0.307035361     -0.5915229         0.18123941 -0.54200720
## 3  -0.2887527  0.701004304     -0.5915229        -1.10091493 -0.25036946
## 4  -0.2887527 -0.108316402     -0.4023422        -0.08200484  1.10604981
## 5  -0.2887527 -0.004754037      1.6786462         0.64578809  0.03519249
##        Abuse Non.subst.Dx    Subst.Dx
## 1  0.2884947   0.11673855  0.16124445
## 2 -0.5021573  -0.61105397  0.14289936
## 3  0.1810423   1.01574893 -1.07460319
## 4  0.7884472  -0.06964165  0.88375228
## 5 -0.1308684  -0.04552186  0.08460274
## 
## Clustering vector:
##   [1] 1 3 3 3 1 4 5 5 5 4 5 3 2 3 1 3 3 5 3 3 4 5 3 4 3 1 3 4 2 2 5 2 4 4 2 5 5
##  [38] 4 2 4 5 2 2 4 4 5 5 4 2 4 4 5 4 1 1 2 2 2 5 1 3 5 3 2 2 5 3 3 2 3 3 3 3 3
##  [75] 4 5 1 3 5 2 4 5 3 4 3 3 3 3 2 2 2 2 5 5 1 2 2 2 2 2 2 2 2 4 2 5 5 2 2 2 2
## [112] 2 2 4 1 5 2 2 2 5 2 2 2 5 2 4 5 3 3 4 2 2 5 5 5 1 2 4 4 2 2 3
## 
## Within cluster sum of squares by cluster:
## [1] 141.3976 350.9247 334.8022 230.6030 251.2763
##  (between_SS / total_SS =  33.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

# cluster visualization
fviz_cluster(kmeans_res, data = genl,
             ellipse.type = "convex",
             ggtheme = theme_classic())

Calculate each numeric variables mean value for each cluster, and then output the resulting table:

aggregate(genl, by=list(cluster=kmeans_res$cluster), mean) %>%  kableExtra::kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down")

cluster	Age	Sex	Race	ADHD.Total	MD.TOTAL	SUB.Total	Court.order	Education	Hx.of.Violence	Disorderly.Conduct	Suicide	Abuse	Non.subst.Dx	Subst.Dx
1	32.18182	1.545454	1.363636	35.27273	10.909091	5.545454	1	11.18182	0.4545455	1.0000000	0.5454545	1.8181818	0.5454545	1.272727
2	39.53191	1.234043	1.765957	26.68085	9.276596	4.170213	0	11.27660	0.0000000	0.7872340	0.0638298	0.2127660	0.0425532	1.255319
3	43.16667	1.600000	1.700000	42.00000	8.700000	0.900000	0	13.46667	0.0000000	0.2000000	0.2000000	1.6000000	1.1666667	0.100000
4	38.04167	1.958333	1.375000	43.83333	12.916667	6.000000	0	11.70833	0.0833333	0.6666667	0.8333333	2.8333333	0.4166667	1.958333
5	39.80000	1.233333	1.533333	33.06667	11.166667	4.666667	0	11.93333	1.0000000	1.0000000	0.3333333	0.9666667	0.4333333	1.200000

Based on the output above we’d profile our groups as follows:

Cluster 1: Have moderate incidence of ADHD and mental disorder, and less violence.
Cluster 2: Have low incidence of ADHD and mental disorder.
Cluster 3: Have high incidence of ADHD and low incidence of mental disorder.
Cluster 4: Have high incidence of ADHD and moderate incidence of mental disorder.
Cluster 5: Have moderate incidence of ADHD and mental disorder, and more violence.

……………………………………………………………………..

SVM

Below, we re-visited our dataset with a different approach to compare-contrast the resulting Accuracies. The results were an improvement :)

SVM Model: entire dataset

We will perform SVM with entire dataset. Then, create another SVM model on reduced dimension which performed PCA with genl dataset. Compare the result matrices of these model to find the best one.

Below we are going to create 3 models using 3 kernels such as radial, linear and polynomial.

data_all_new <- data_all %>% select(-Suicide, everything())

#Set set (replicability)
set.seed(700)

#Perform train-test Split
intrain <- createDataPartition(data_all_new$Suicide, p = 0.75, list = FALSE)
training_all <- data_all_new[intrain,]
test_all <- data_all_new[-intrain,]
results <- test_all$Suicide #hold actual test values for later comparison
test_all$Suicide <- NA 

gammalist <- c(0.005,0.01,0.015,0.02,0.025,0.03,0.035,0.04,0.045,0.05)

tune.out_radial <- tune.svm(as.factor(Suicide) ~., data=training_all, 
                 kernel='radial', cost=2^(1:5), gamma = gammalist)

tune.out_linear <- tune.svm(as.factor(Suicide) ~., data=training_all, 
                 kernel='linear')

tune.out_poly <- tune.svm(as.factor(Suicide) ~., data=training_all, 
                 kernel='polynomial', cost=2^(1:5), gamma = gammalist)

svm4 <- predict(tune.out_radial$best.model, test_all[,-dim(test_all)[2]])
svm5 <- predict(tune.out_linear$best.model, test_all[,-dim(test_all)[2]])
svm6 <- predict(tune.out_poly$best.model, test_all[,-dim(test_all)[2]])

#Convert performance statistics to tabular form (for output / interpretation):
AccuracySVM4 <- confusionMatrix(svm4, as.factor(results))$overall['Accuracy']
AccuracySVM5 <- confusionMatrix(svm5, as.factor(results))$overall['Accuracy']
AccuracySVM6 <- confusionMatrix(svm6, as.factor(results))$overall['Accuracy']

SVM_Model_4 <- confusionMatrix(svm4, as.factor(results))$byClass
SVM_Model_4 <- data.frame(SVM_Model_4)
SVM_Model_4 <- rbind("Accuracy" = AccuracySVM4, SVM_Model_4)

SVM_Model_5 <- confusionMatrix(svm5, as.factor(results))$byClass
SVM_Model_5 <- data.frame(SVM_Model_5)
SVM_Model_5 <- rbind("Accuracy" = AccuracySVM5, SVM_Model_5)

SVM_Model_6 <- confusionMatrix(svm6, as.factor(results))$byClass
SVM_Model_6 <- data.frame(SVM_Model_6)
SVM_Model_6 <- rbind("Accuracy" = AccuracySVM6, SVM_Model_6)

tabularview <- data.frame(SVM_Model_4, SVM_Model_5, SVM_Model_6)
tabularview %>%  kableExtra::kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down")

	SVM_Model_4	SVM_Model_5	SVM_Model_6
Accuracy	0.8604651	0.8372093	0.8604651
Sensitivity	0.9189189	0.8918919	1.0000000
Specificity	0.5000000	0.5000000	0.0000000
Pos Pred Value	0.9189189	0.9166667	0.8604651
Neg Pred Value	0.5000000	0.4285714	NaN
Precision	0.9189189	0.9166667	0.8604651
Recall	0.9189189	0.8918919	1.0000000
F1	0.9189189	0.9041096	0.9250000
Prevalence	0.8604651	0.8604651	0.8604651
Detection Rate	0.7906977	0.7674419	0.8604651
Detection Prevalence	0.8604651	0.8372093	1.0000000
Balanced Accuracy	0.7094595	0.6959459	0.5000000

The models share the same Accuracy and Neg Pred Value, whereas SVM_Model_4 (radial) outperformed SVM_Model_5 (linear) in Sensitivity, Recall, Detection Ratee, F1, and Detection Prevalence while SVM_Model_5 (linear) outperformed SVM_Model_4 (radial) for Specificity, Pos Pred Value, Precision, and Balanced Accuracy. SVM_Model_6 has same accuracy as SVM_Model_4 but if we look at Neg Pred Value shows NAN, this model is not good fit for suicide. Lets have a look on confusion matrix and interprete how the model works.

Confusion matrix for `SVM_Model_4`

confusionMatrix(svm4, as.factor(results))$table

##           Reference
## Prediction  0  1
##          0 34  3
##          1  3  3

This model is performed well in prediction yes and no. Total misclassification is 6.

Confusion matrix for `SVM_Model_5`

confusionMatrix(svm5, as.factor(results))$table

##           Reference
## Prediction  0  1
##          0 33  3
##          1  4  3

This model performed well with suicide class but little poor in no suicide class. Total misclassification is 7.

Confusion matrix for `SVM_Model_6`

confusionMatrix(svm6, as.factor(results))$table

##           Reference
## Prediction  0  1
##          0 37  6
##          1  0  0

This model peformed awesome in no suicide class. However in suicide class performed really bad. This is definately not a good model for the prediction because this could not performed for suicide class.

SVM Model: PCA output

From Varimax plot we found, to get above 90% variance in data we need to include 11 principal components. We are going to exclude ‘ADHD.Total’,‘Non.subst.Dx’, ‘Subst.Dx’, from the dataset.

Split the data in to train and test, apply model, fit the model, predict the model using test data, calculate results matrices to see the model performance.

genl_2 <- data[c(2:4,23,39)]
genl_2$SUB.Total <- sub$SUB.Total
genl_2 <- cbind(genl_2, hstry)

# imputation
genl_3 <- genl_2 %>% 
  mutate(
    SUB.Total = impute(SUB.Total, mode),
    Court.order = impute(Court.order, mode),
    Education = impute(Education, mode),
    Hx.of.Violence = impute(Hx.of.Violence, mode),
    Disorderly.Conduct = impute(Disorderly.Conduct, mode),
    Suicide = impute(Suicide, mode),
    Abuse = impute(Abuse, mode),
    Non.subst.Dx = impute(Non.subst.Dx, mode),
    Subst.Dx = impute(Subst.Dx, mode),
    Psych.meds. = impute(Psych.meds., mode)
    ) %>% 
  mutate_if(is.character, as.numeric)

genl_svm <- genl_3 %>% select(-c('ADHD.Total', 'Non.subst.Dx', 'Subst.Dx')) %>% select(-Suicide, everything())

#Set set (replicability)
set.seed(900)

#Perform train-test Split
intrain <- createDataPartition(genl_svm$Suicide, p = 0.75, list = FALSE)
training_not_all <- genl_svm[intrain,]
test_not_all <- genl_svm[-intrain,]
results_not_all <- test_not_all$Suicide #hold actual test values for later comparison
test_not_all$Suicide <- NA 


gammalist <- c(0.005,0.01,0.015,0.02,0.025,0.03,0.035,0.04,0.045,0.05)

tune.out_radial_7 <- tune.svm(as.factor(Suicide) ~., data=training_not_all, 
                 kernel='radial', cost=2^(1:5), gamma = gammalist)

tune.out_linear_8 <- tune.svm(as.factor(Suicide) ~., data=training_not_all, 
                 kernel='linear')

tune.out_poly_9 <- tune.svm(as.factor(Suicide) ~., data=training_not_all, 
                 kernel='polynomial', cost=2^(1:5), gamma = gammalist)

svm7 <- predict(tune.out_radial_7$best.model, test_not_all[,-dim(test_not_all)[2]])
svm8 <- predict(tune.out_linear_8$best.model, test_not_all[,-dim(test_not_all)[2]])
svm9 <- predict(tune.out_poly_9$best.model, test_not_all[,-dim(test_not_all)[2]])

#Convert performance statistics to tabular form (for output / interpretation):
AccuracySVM7 <- confusionMatrix(svm7, as.factor(results_not_all))$overall['Accuracy']
AccuracySVM8 <- confusionMatrix(svm8, as.factor(results_not_all))$overall['Accuracy']
AccuracySVM9 <- confusionMatrix(svm9, as.factor(results_not_all))$overall['Accuracy']


SVM_Model_7 <- confusionMatrix(svm7, as.factor(results_not_all))$byClass
SVM_Model_7 <- data.frame(SVM_Model_7)
SVM_Model_7 <- rbind("Accuracy" = AccuracySVM7, SVM_Model_7)

SVM_Model_8 <- confusionMatrix(svm8, as.factor(results_not_all))$byClass
SVM_Model_8 <- data.frame(SVM_Model_8)
SVM_Model_8 <- rbind("Accuracy" = AccuracySVM8, SVM_Model_8)

SVM_Model_9 <- confusionMatrix(svm9, as.factor(results_not_all))$byClass
SVM_Model_9 <- data.frame(SVM_Model_9)
SVM_Model_9 <- rbind("Accuracy" = AccuracySVM9, SVM_Model_9)

tabularview <- data.frame(SVM_Model_7, SVM_Model_8, SVM_Model_9)
tabularview %>%  kableExtra::kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down")

	SVM_Model_7	SVM_Model_8	SVM_Model_9
Accuracy	0.7441860	0.6744186	0.7441860
Sensitivity	0.9655172	1.0000000	1.0000000
Specificity	0.2857143	0.0000000	0.2142857
Pos Pred Value	0.7368421	0.6744186	0.7250000
Neg Pred Value	0.8000000	NaN	1.0000000
Precision	0.7368421	0.6744186	0.7250000
Recall	0.9655172	1.0000000	1.0000000
F1	0.8358209	0.8055556	0.8405797
Prevalence	0.6744186	0.6744186	0.6744186
Detection Rate	0.6511628	0.6744186	0.6744186
Detection Prevalence	0.8837209	1.0000000	0.9302326
Balanced Accuracy	0.6256158	0.5000000	0.6071429

We extend the following commentary based on the above output:

SVM_Model_7(Radial) has high accuracy
SVM_Model_8(linear) has low accuracy and NAN for Neg Pred Value
SVM_Model_9(polynomial) has high accuracy, specificiaty, F1 than Model8.

SVM_Model_7 and SVM_Model_9 have high accuracy than SVM_Model_8. Lets have a look on confusinon matrix and interpret the results.

Confusion matrix for `SVM_Model_7`

confusionMatrix(svm7, as.factor(results_not_all))$table

##           Reference
## Prediction  0  1
##          0 28 10
##          1  1  4

SVM_Model_7 has accuracy 74%, this model has total 11 misclassification and higher misclassification occures in suicide class. This model predict good for no suicide class but not a good model for suicide class.

Confusion matrix for `SVM_Model_8`

confusionMatrix(svm8, as.factor(results_not_all))$table

##           Reference
## Prediction  0  1
##          0 29 14
##          1  0  0

SVM_Model_8 has accuracy 67%, this model has total 14 misclassification and completly incorrectly predict for suicide class, we can see all observations for suicide class has predicted incorrectly. This is not a good model to go with.

Confusion matrix for `SVM_Model_9`

confusionMatrix(svm9, as.factor(results_not_all))$table

##           Reference
## Prediction  0  1
##          0 29 11
##          1  0  3

SVM_Model_9 has same accurcay as SVM_Model_7, from confusion matrix we see there are total 11 misclassification and higher misclassification present in suicide class. This model predict 100% accurate for no suicide class but shows poor prediction in suicide class.

We performed total 9 models using SVM, the genl dataset without PCA gave better result than the PCA model with genl data. But the entire dataset model performed better in comparision to other models. Above all the models, SVM_Model_4 is the best model.

DATA622_HW4

Group 5

2021-05-10

Authorship

Background

Our Approach

The Data

Exploratory Data Analysis

1. Clustering

Clustering: K-means

2. Principal Component Analysis

ADHD

Correlation between variables and principal components:

Bi-plots

Varimax rotation

PCA: MD and general

3. Support Vector Machine

EDA for Suicide

Corelation Matrix and Significant variable selection with genl data

Feature Exclusion

Model Building & Selection

Confusion matrix for SVM_Model_1 and SVM_Model_2

Confusion matrix for SVM_Model_3

SVM: Entire dataset and general dataset using PCA output

Analysis

Future Work

References

Appendix

Mood Disorder PCA

Correlation between variables and principal componenets:

Bi-plot and principal components

Using VARIMAX

PCA of MD analysis

General PCA

VARIMAX

K-means clustering

SVM

SVM Model: entire dataset

Confusion matrix for SVM_Model_4

Confusion matrix for SVM_Model_5

Confusion matrix for SVM_Model_6

SVM Model: PCA output

Confusion matrix for SVM_Model_7

Confusion matrix for SVM_Model_8

Confusion matrix for SVM_Model_9

Confusion matrix for `SVM_Model_1` and `SVM_Model_2`

Confusion matrix for `SVM_Model_3`

Confusion matrix for `SVM_Model_4`

Confusion matrix for `SVM_Model_5`

Confusion matrix for `SVM_Model_6`

Confusion matrix for `SVM_Model_7`

Confusion matrix for `SVM_Model_8`

Confusion matrix for `SVM_Model_9`