The purpose of this assignment was to explore Clustering, Principal Component Analysis, and Support Vector Machines.
Clustering is used as a means of grouping a set of observations in a way that those in the same group are more similar to one another than those in other groups. Principal Component Analysis is used to a means of variable reduction where it’s our aim to preserve as much useful information as possible. In short, we trade a little accuracy for simplicity. Support Vector Machines, on another note, are a supervised learning technique used for classification and regression analysis. We’re going to apply it as a means of classification based on one of the variables in our set (suicide).
We’ll perform a relatively in-depth exploratory data analysis (EDA) to better familiarize ourselves with the data at hand and then we’ll explore each of the aforementioned methods.
Grouping our patients via k-means / hierarchical clustering and profiling these groups based on shared characteristics, focusing upon a subset of our data (ie. ADHD or mood disorder) in applying Principal Component Analysis, and applying a Support Vector Machine model based on whether or not a patient has attempted suicide.
……………………………………………………………………..
The data at hand is based on a real life research project and deals with a number of measures of mental health and well-being - attention deficit hyperactivity disorder (ADHD), mood / mental disorder (MD), substance abuse, etc.
Being that our exploration and application of methods can take on many forms and head in a nearly infinite number of directions, the room for creativity is without bounds.
We start by loading in our data and utilizing built-in head() and dim() functions for an early read.
#Load in data
<- read.csv("https://raw.githubusercontent.com/SubhalaxmiRout002/Data-622-Group-5/main/HW%204/ADHD_data.csv")
data #data <- read.csv("ADHD_data.csv")
head(data)
## Initial Age Sex Race ADHD.Q1 ADHD.Q2 ADHD.Q3 ADHD.Q4 ADHD.Q5 ADHD.Q6 ADHD.Q7
## 1 JA 24 1 1 1 1 4 2 3 1 1
## 2 LA 48 2 1 3 3 4 4 5 2 2
## 3 MD 51 2 1 2 1 2 1 3 3 3
## 4 RD 43 1 1 3 3 2 2 4 3 2
## 5 RB 34 1 1 4 4 2 4 4 2 3
## 6 SB 39 2 1 2 3 1 4 3 2 3
## ADHD.Q8 ADHD.Q9 ADHD.Q10 ADHD.Q11 ADHD.Q12 ADHD.Q13 ADHD.Q14 ADHD.Q15
## 1 3 2 4 2 4 1 0 3
## 2 3 2 4 1 4 2 4 4
## 3 2 0 1 2 0 2 2 3
## 4 4 4 2 3 1 3 3 1
## 5 4 4 2 4 1 3 2 1
## 6 4 4 2 4 2 4 4 3
## ADHD.Q16 ADHD.Q17 ADHD.Q18 ADHD.Total MD.Q1a MD.Q1b MD.Q1c MD.Q1d MD.Q1e
## 1 1 3 4 40 1 1 1 1 0
## 2 3 1 4 55 1 1 1 1 1
## 3 2 1 1 31 0 0 0 0 1
## 4 2 1 2 45 1 1 0 0 1
## 5 2 1 1 48 0 1 0 1 0
## 6 4 3 3 55 0 1 0 1 1
## MD.Q1f MD.Q1g MD.Q1h MD.Q1i MD.Q1j MD.Q1k MD.Q1L MD.Q1m MD.Q2 MD.Q3 MD.TOTAL
## 1 1 1 1 1 1 1 0 1 1 3 15
## 2 1 1 1 1 0 0 1 0 1 3 14
## 3 1 1 0 0 0 0 0 0 0 2 5
## 4 1 1 1 1 0 0 1 1 1 3 13
## 5 1 1 0 0 0 0 0 0 1 2 7
## 6 1 1 1 1 1 1 1 0 1 3 14
## Alcohol THC Cocaine Stimulants Sedative.hypnotics Opioids Court.order
## 1 1 1 1 0 0 0 1
## 2 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0
## 4 1 1 1 1 0 0 0
## 5 1 1 0 0 0 0 1
## 6 1 0 0 0 0 0 0
## Education Hx.of.Violence Disorderly.Conduct Suicide Abuse Non.subst.Dx
## 1 11 0 1 1 0 2
## 2 14 0 0 1 4 1
## 3 12 0 0 0 6 2
## 4 12 0 0 1 7 2
## 5 9 1 1 1 0 2
## 6 11 0 1 1 2 0
## Subst.Dx Psych.meds.
## 1 0 2
## 2 0 1
## 3 0 1
## 4 0 2
## 5 0 0
## 6 0 0
dim(data)
## [1] 175 54
From our initial read of our dataset’s dimensions and first 6 entries, we observe that we’re dealing with a 175 observation x 54 variable dataset. Our initial dataset is broad and may be difficult to manage / make sense of. With the motivation of clarification in mind, we thus break our larger dataset into subsets:
We may end up merging these subsets, pulling individual variables, or exploring individual subsets later but this provides a means of effectively recognizing where related variables lie while simultaneously providing smaller, more manageable dataframes to explore and work upon.
Once these dataframes are created, we see that adhd and md have total score variables whereas sub does not, even though its data (across variables) maintains a consistent scoring system. Thus, we take it upon ourselves to add this variable for sake of consistency:
<- data[c(1:4,23,39)]
genl #head(genl) #drop initial, add sub and hstry (all vars or "total score")
#we may want to use these separately (ie. for Q2)
<- data[c(5:23)]
adhd <- data[c(24:39)]
md <- data[c(40:45)]
sub <- data[c(46:54)]
hstry $SUB.Total <- rowSums(sub) #add Total column to sub
sub#Finalize genl dataframe (for EDA)
$SUB.Total <- sub$SUB.Total #add SUB.Total to genl
genl<- genl[c(-1)] #drop Initial
genl <- cbind(genl, hstry) #add history variables back in with general
genl head(genl)
## Age Sex Race ADHD.Total MD.TOTAL SUB.Total Court.order Education
## 1 24 1 1 40 15 3 1 11
## 2 48 2 1 55 14 0 0 14
## 3 51 2 1 31 5 0 0 12
## 4 43 1 1 45 13 4 0 12
## 5 34 1 1 48 7 2 1 9
## 6 39 2 1 55 14 1 0 11
## Hx.of.Violence Disorderly.Conduct Suicide Abuse Non.subst.Dx Subst.Dx
## 1 0 1 1 0 2 0
## 2 0 0 1 4 1 0
## 3 0 0 0 6 2 0
## 4 0 0 1 7 2 0
## 5 1 1 1 0 2 0
## 6 0 1 1 2 0 0
## Psych.meds.
## 1 2
## 2 1
## 3 1
## 4 2
## 5 0
## 6 0
We subset our dataframes (as noted above), add a “total” column for sub, and finalize our genl dataframe by merging the total column for sub, dropping the Initial
variable, and adding all hstry variables back in.
We lean on the data dictionary (provided with our .xlsx file) to outline the variables at hand:
Age
: quantitative variable representative of the individual’s age.Sex
: categorical variable representative of the individual’s sex (male-1, female-2).Race
: categorical variable representative of the individual’s race (white-1, african american-2, hispanic-3, asian-4, native american-5, other / missing data-6).ADHD.Total
: quantitative variable representative of the individual’s total self-report score for ADHD scale. It’s the cumulative score for 18 questions with a scoring metric of never-0, rarely-1, sometimes-2, often-3, and very often-4.MD.TOTAL
: quantitative variable representative of the individual’s total self-report score for mental disorder questions. It’s the cumulative score for 15 questions with a scoring metric of no-0, yes-1; question 3: no problem-0, minor-1, moderate-2, and serious-3.SUB.Total
: quantitative variable representative of the individual’s total self-report score for substance abuse related questions. It’s the cumulative score across 6 categories with a scoring metric of no use-0, use-1, abuse-2, and dependence-3.Court.order
: categorical variable representative of whether the individual’s case was court ordered (No-0, Yes-1).Education
: categorical variable representative of the individual’s level of education (1-12 grade, 13+ college).Hx.of.Violence
: categorical variable representative of whether the individual has a history of violence (No-0, Yes-1).Disorderly.Conduct
: categorical variable representative of whether the individual has a record of disorderly conduct (No-0, Yes-1).Suicide
: categorical variable representative of whether the individual has attempted suicide in the past (No-0, Yes-1).Abuse
: categorical variable representative of whether the individual has a history of abuse (No-0, Physical (P)-1, Sexual (S)-2, Emotional (E)-3, P&S-4, P&E-5, S&E-6, P&S&E-7).Non.subst.Dx
: categorical variable representative of whether the individual has a non-substance diagonosis (0 – none; 1 – one; 2 – More than one).Subst.Dx
: categorical variable representative of whether the individual has a substance diagonosis (0 – none; 1 – one Substance-related; 2 – two; 3 – three or more).Psych.meds.
: categorical variable representative of whether the individual has been prescribed psychiatric medication (0 – none; 1 – one psychotropic med; 2 – more than one psychotropic med).The resulting dataframe (and variables) are shown above and at this point we’re prepared to more deeply explore our genl dataset.
To do so, we utilize the built-in summary() and glimpse() functions:
summary(genl)
## Age Sex Race ADHD.Total MD.TOTAL
## Min. :18.00 Min. :1.000 Min. :1.00 Min. : 0.00 Min. : 0.00
## 1st Qu.:29.50 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:21.00 1st Qu.: 6.50
## Median :42.00 Median :1.000 Median :2.00 Median :33.00 Median :11.00
## Mean :39.47 Mean :1.434 Mean :1.64 Mean :34.32 Mean :10.02
## 3rd Qu.:48.00 3rd Qu.:2.000 3rd Qu.:2.00 3rd Qu.:47.50 3rd Qu.:14.00
## Max. :69.00 Max. :2.000 Max. :6.00 Max. :72.00 Max. :17.00
##
## SUB.Total Court.order Education Hx.of.Violence
## Min. : 0.000 Min. :0.00000 Min. : 6.0 Min. :0.0000
## 1st Qu.: 3.000 1st Qu.:0.00000 1st Qu.:11.0 1st Qu.:0.0000
## Median : 3.000 Median :0.00000 Median :12.0 Median :0.0000
## Mean : 3.883 Mean :0.08823 Mean :11.9 Mean :0.2439
## 3rd Qu.: 6.000 3rd Qu.:0.00000 3rd Qu.:13.0 3rd Qu.:0.0000
## Max. :13.000 Max. :1.00000 Max. :19.0 Max. :1.0000
## NA's :4 NA's :5 NA's :9 NA's :11
## Disorderly.Conduct Suicide Abuse Non.subst.Dx
## Min. :0.0000 Min. :0.0000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000
## Median :1.0000 Median :0.0000 Median :0.000 Median :0.0000
## Mean :0.7256 Mean :0.3025 Mean :1.329 Mean :0.4379
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :7.000 Max. :2.0000
## NA's :11 NA's :13 NA's :14 NA's :22
## Subst.Dx Psych.meds.
## Min. :0.000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.0000
## Median :1.000 Median :1.0000
## Mean :1.138 Mean :0.9649
## 3rd Qu.:2.000 3rd Qu.:2.0000
## Max. :3.000 Max. :2.0000
## NA's :23 NA's :118
glimpse(genl)
## Rows: 175
## Columns: 15
## $ Age <int> 24, 48, 51, 43, 34, 39, 41, 48, 44, 27, 44, 56, 53,~
## $ Sex <int> 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 1, 1, 2, 1, 2, 2, 2, ~
## $ Race <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ ADHD.Total <int> 40, 55, 31, 45, 48, 55, 54, 41, 56, 56, 42, 38, 31,~
## $ MD.TOTAL <int> 15, 14, 5, 13, 7, 14, 9, 7, 12, 11, 16, 0, 11, 10, ~
## $ SUB.Total <dbl> 3, 0, 0, 4, 2, 1, 9, 0, 2, 3, 10, 0, 2, 0, 4, 3, 0,~
## $ Court.order <int> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ~
## $ Education <int> 11, 14, 12, 12, 9, 11, 12, 16, 12, 9, 12, 18, 12, 1~
## $ Hx.of.Violence <int> 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, ~
## $ Disorderly.Conduct <int> 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, ~
## $ Suicide <int> 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ~
## $ Abuse <int> 0, 4, 6, 7, 0, 2, 4, 0, 0, 2, 4, 2, 0, 0, 0, 5, 1, ~
## $ Non.subst.Dx <int> 2, 1, 2, 2, 2, 0, 1, 2, 1, 1, 1, 2, 0, 1, 0, 2, 2, ~
## $ Subst.Dx <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 2, 2, 0, 0, 0, 1, 1, 0, ~
## $ Psych.meds. <int> 2, 1, 1, 2, 0, 0, 1, 2, 1, 0, 0, 1, 0, 0, 0, 2, 1, ~
From the above outputs we can extend that:
Age
of respondents is ~40, there are slightly more male than female respondents, and the average education level ~12 (finishing high school).ADHD,Total
was ~34, the average MD.TOTAL
was ~10, and the averge SUB.Total
was ~4.Disorderly.Conduct
appears to be more common than Hx.of.Violence
which in turn is more common than Court.order
.Suicide
and some form of abuse appears to be common (ie. Physical).Non.subst.Dx
was ~0.4 while the average Subst.Dx
was ~1.1 giving the impression that we’re dealing with more substance diagnoses that non-substance diagnoses.Psych.meds
is missing the majority of its values and NA values will have to be accounted for (in general).Based on this information, we drop Psych.meds
, drop remaining NA values, and observe the affect on our dataset:
<- genl[c(-15)] #drop Psych.meds
genl <- genl %>% tibble() %>% drop_na() #drop NA values
genl #revisit genl
dim(genl) #175 - 142 = 33 dropped rows
## [1] 142 14
colSums(is.na(genl)) #verify no NAs
## Age Sex Race ADHD.Total
## 0 0 0 0
## MD.TOTAL SUB.Total Court.order Education
## 0 0 0 0
## Hx.of.Violence Disorderly.Conduct Suicide Abuse
## 0 0 0 0
## Non.subst.Dx Subst.Dx
## 0 0
#save dataset for Q3 (in case of "tampering" during Q1, Q2):
<- genl q3_data
Our dataset has been reduced from 175 observations x 15 variables to 142 observations x 14 variables and from numerous NA values across multiple variables to no NA values.
With NA values dealt with, we ensure each variable’s of the proper type and then visualize our variable distributions based on whether they’re categorical or numeric:
#convert features to factor
$Sex <- as.factor(genl$Sex)
genl$Race <- as.factor(genl$Race)
genl$Court.order <- as.factor(genl$Court.order)
genl$Education <- as.factor(genl$Education)
genl$Hx.of.Violence <- as.factor(genl$Hx.of.Violence)
genl$Disorderly.Conduct <- as.factor(genl$Disorderly.Conduct)
genl$Suicide <- as.factor(genl$Suicide)
genl$Abuse <- as.factor(genl$Abuse)
genl$Non.subst.Dx <- as.factor(genl$Non.subst.Dx)
genl$Subst.Dx <- as.factor(genl$Subst.Dx)
genl#convert features to numeric
$Age <- as.numeric(genl$Age)
genl$ADHD.Total <- as.numeric(genl$ADHD.Total)
genl$MD.TOTAL <- as.numeric(genl$MD.TOTAL)
genl$SUB.Total <- as.numeric(genl$SUB.Total)
genl#head(genl) #verify conversions
#visualize categorical distributions as a table
<- inspectdf::inspect_imb(genl)
fig1 fig1
## # A tibble: 10 x 4
## col_name value pcnt cnt
## <chr> <chr> <dbl> <int>
## 1 Court.order 0 92.3 131
## 2 Hx.of.Violence 0 73.9 105
## 3 Disorderly.Conduct 1 70.4 100
## 4 Suicide 0 68.3 97
## 5 Non.subst.Dx 0 64.8 92
## 6 Abuse 0 63.4 90
## 7 Race 2 55.6 79
## 8 Sex 1 54.2 77
## 9 Education 12 38.7 55
## 10 Subst.Dx 1 38.7 55
From the categorical variable table above we see that ~92% of respondent’s cases were not court ordered, ~74% of respondents do not have a history of violence, ~70% of respondents have a history of disorderly conduct, ~68% of respondents have not attempted suicide, ~65% of respondents do not have a non-substance diagnosis, ~63% of respondents do not have a history of abuse, ~56% of respondents are African American, ~54% of respondents are male, ~39% of respondents have completed at least high school, and ~39% of respondents have one substance related diagnosis. The fact that Education
and Subst.Dx
have the same pcnt values is interesting and worth noting.
Moving on to our numeric distributions:
#visualize numeric distributions
<- inspectdf::inspect_num(genl) %>%
fig2 show_plot()
fig2
From the histograms above we see that:
ADHD.Total
has a left skewed normal distribution with a peak ~45.Age
appears to be multimodal with peaks ~25, ~35 and ~50.MD.TOTAL
has a non-normal distribution with a peak concentrated > 15 (at the far right).SUB.Total
appears to have a multi-modal distribution with its largest peak at ~3.With insight gained via variable distributions, we move on to observing the pair plot and correlation matrix for numeric variables:
#paired plot of numeric variables
pairs(genl %>% select_if(is.numeric))
#Correlation matrix for numeric variables
<- genl %>%
genl_corr select_if(is.numeric) %>%
cor()
corrplot(genl_corr, title="Numeric Variable Correlation",type = "lower", mar=c(0,0,1,0))
The pair plot and correlation matrix highlight the fact that ADHD.Total
and MD.TOTAL
appear to have strong correlation. We’ll note this fact, while carrying both variables forward.
……………………………………………………………………..
We elected to utilize hierarchical clustering being that it’s more adaptive than k-means clustering. For k-means clustering, we have to pre-define our k. Whereas for hierarchical clustering both the value of k and the location of our centroids (mid-point between all data points in the group) is determined as a part of the process.
We proceed by computing the distances between all data points (using Euclidean distance), clustering based on distances
based on centroid distance and the variance in each cluster, and then plotting the corresponding dendrogram:
set.seed(123)
#compute distances between all data points (using euclidean distance)
<- dist(genl, method = 'euclidean')
distances #cluster based on distances using centroid distance as well as variance in clusters
<- hclust(distances, method = 'ward.D') #ward.D accounts for centroid distance + variance
clusterHealth #plot the dendrogram
plot(clusterHealth)
Based on our dendrogram, it appears that the ideal number of clusters is either 3, 4, or 5. We proceed with 5 clusters being that it’s the higher of the three and we’d like to better represent the variation we observed during our exploratory data analysis. We want to better capture the distinct characteristics of our clustered groups.
We “cut the tree” and demarcate our groups based on the desired number of clusters, calculate each numeric variables’ mean value for each cluster, and then output the resulting table:
<- cutree(clusterHealth, 5)
clusterGroups #clusterGroups #verify the breakdown per observation
#Mean value for numeric variables
#tapply(genl$Age, clusterGroups, mean, simplify=TRUE) #avg age for each cluster
#tapply(genl$ADHD.Total, clusterGroups, mean) #avg ADHD score for each cluster
#tapply(genl$MD.TOTAL, clusterGroups, mean) #avg MD score for each cluster
#tapply(genl$SUB.Total, clusterGroups, mean) #avg SUB score for each cluster
#Extract corresponding values (to place in easy to understand form):
<- c(27.06, 48.95, 41.42, 47.4, 25.22)
avg_age <- c(49.34, 53.52, 34.12, 15.93, 6.56)
avg_adhd <- c(12.84, 12.71, 10.58, 7.27, 4.11)
avg_md <- c(4.13, 3.28, 4.1, 4.17, 3.44)
avg_sub #Build kable table
<- c("Group 1", "Group 2", "Group 3", "Group 4", "Group 5")
group_names <- data.frame(group_names, avg_age, avg_adhd, avg_md, avg_sub)
df %>%
df kbl() %>%
kable_styling(latex_options = "striped")
group_names | avg_age | avg_adhd | avg_md | avg_sub |
---|---|---|---|---|
Group 1 | 27.06 | 49.34 | 12.84 | 4.13 |
Group 2 | 48.95 | 53.52 | 12.71 | 3.28 |
Group 3 | 41.42 | 34.12 | 10.58 | 4.10 |
Group 4 | 47.40 | 15.93 | 7.27 | 4.17 |
Group 5 | 25.22 | 6.56 | 4.11 | 3.44 |
Based on the output above we’d profile our groups as follows:
It was interesting to profile our groups and observe that there appeared to be 2 tiers in the earlier and later years. It was also interesting to see that all groups had a relatively low incidence of substance use / dependency. Due to its apparent low impact as a differentiating factor for our clusters it was not discussed.
In the Appendix, we’ve also provided code (with a more succinct write up) for our application of K-means clustering analysis to the genl
(general) dataset.
……………………………………………………………………..
Principal Component Analysis (PCA) is a way of reducing the dimensions of a given dataset by extracting new features from the original features present in the dataset. The “new” variables after PCA are all independent of one another. We elected to perform Principal Component Analysis on the ADHD self-report columns, MD (mood disorder) questionnaire, and entire dataset.
We start with the Attention Deficit Hyperactivity Disorder (ADHD) symptom variables. During EDA, we’d subset datasets based on question type and thus we proceed with the adhd
dataset.
There are 19 variables in the adhd
dataset, 18 pertaining to self-reported ADHD question responses and 1 the total score for that row. All ADHD questions capture categorical responses: never-0, rarely-1, sometimes-2, often-3, and very often-4 whereas the ADHD.Total
is a quantitative (sum) variable representative of the individual’s total self-report score for ADHD scale.
We verify that our data does not have missing values:
# Check for NAs
colSums(is.na(adhd))
## ADHD.Q1 ADHD.Q2 ADHD.Q3 ADHD.Q4 ADHD.Q5 ADHD.Q6 ADHD.Q7
## 0 0 0 0 0 0 0
## ADHD.Q8 ADHD.Q9 ADHD.Q10 ADHD.Q11 ADHD.Q12 ADHD.Q13 ADHD.Q14
## 0 0 0 0 0 0 0
## ADHD.Q15 ADHD.Q16 ADHD.Q17 ADHD.Q18 ADHD.Total
## 0 0 0 0 0
Missing values would disturb the implementation of PCA (where we’d have to drop / impute corresponding values), and we confirm that our data has 0 missing values across all variables.
We then perform PCA via pcrcomp() function and review corresponding statistics:
# Perform PCA and review summary statistics
<- prcomp(adhd, scale. = TRUE)
pca_adhd summary(pca_adhd)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 3.2092 1.16479 1.00701 0.89915 0.85060 0.77076 0.75157
## Proportion of Variance 0.5421 0.07141 0.05337 0.04255 0.03808 0.03127 0.02973
## Cumulative Proportion 0.5421 0.61345 0.66683 0.70938 0.74746 0.77872 0.80845
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.70763 0.66793 0.64292 0.63647 0.60782 0.59497 0.5375
## Proportion of Variance 0.02635 0.02348 0.02175 0.02132 0.01944 0.01863 0.0152
## Cumulative Proportion 0.83481 0.85829 0.88004 0.90136 0.92081 0.93944 0.9546
## PC15 PC16 PC17 PC18 PC19
## Standard deviation 0.52570 0.46267 0.4529 0.40571 0.04007
## Proportion of Variance 0.01455 0.01127 0.0108 0.00866 0.00008
## Cumulative Proportion 0.96919 0.98046 0.9912 0.99992 1.00000
From the Importance of components
, we can interpret:
Standard deviation
as a measure of variability across a single component (ie. deviation of data within PC1
)Proportion of Variance
as a measure of the component’s variability / representative capability for the entire dataset (ie. the sum of the Proportion of Variance
row = 1.00).Cumulative Proportion
as a measure of the cumulative variability represented by principal components up until the component column we’re on.From these definitions we can extend that the first 3 components (PC1
, PC2
, and PC3
) represent ~ 66.7% of the data and remaining components add but minor increments to our representative capability (all the way until PC19
).
Next, we visit the scree plot:
# scree plot
plot(pca_adhd, type = 'l', col = 'blue')
If we recall from statistics: \(var = sd^2\), where \(var\) represents variance and \(sd\) represents standard deviation.
From the scree plot above, we confirm our earlier finding that PC1, PC2, and PC3 appear to be enough to represent our larger dataset.
To explore how our variables (original and PCA) relate to one another, we bind PC1
, PC2
, and PC3
to our original adhd
dataset, calculate variable correlation and output the result as a table:
# combine datasets
<- cbind(adhd, pca_adhd$x)
adhd_df2 # %>% column_spec(which(ADHD_df_2[,19] > 0.69) , color = 'white', background = 'maroon')
# cor plot
cor(adhd, adhd_df2[, 20:22]) %>% kableExtra::kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down")
PC1 | PC2 | PC3 | |
---|---|---|---|
ADHD.Q1 | 0.6980086 | -0.2205533 | 0.3379874 |
ADHD.Q2 | 0.7087355 | -0.3229617 | 0.2203357 |
ADHD.Q3 | 0.6730084 | -0.3613418 | 0.1294906 |
ADHD.Q4 | 0.7363824 | -0.3583042 | 0.1760306 |
ADHD.Q5 | 0.7279360 | -0.1324912 | -0.4384661 |
ADHD.Q6 | 0.6490774 | 0.0323714 | -0.3655759 |
ADHD.Q7 | 0.7393470 | -0.1582908 | 0.0755233 |
ADHD.Q8 | 0.8025831 | -0.1275705 | 0.1186556 |
ADHD.Q9 | 0.7958069 | -0.0225645 | 0.0772789 |
ADHD.Q10 | 0.7919919 | -0.0589730 | 0.1039090 |
ADHD.Q11 | 0.7411649 | -0.1609594 | -0.1413902 |
ADHD.Q12 | 0.6869949 | 0.2492497 | 0.1772878 |
ADHD.Q13 | 0.7627239 | -0.0183180 | -0.4475655 |
ADHD.Q14 | 0.7185091 | 0.0224583 | -0.3740732 |
ADHD.Q15 | 0.6382790 | 0.3882041 | 0.0003570 |
ADHD.Q16 | 0.6736272 | 0.5359116 | 0.1555591 |
ADHD.Q17 | 0.6547685 | 0.4287821 | 0.1110080 |
ADHD.Q18 | 0.7104675 | 0.4124523 | 0.1129293 |
ADHD.Total | 0.9988446 | 0.0127319 | -0.0120590 |
From the correlation table we can extend that:
PC1
and it appears that PC1
is identical to ADHD.Total
based on their correlation of ~1.00.ADHD.Total
have a slight positive correlation with PC2
.ADHD.Total
has a slight negative correlation with with PC3
.The correlations above provide indication as to the makeup of our principal components. Those with a stronger positive correlation are more a component while those with a stronger negative correlation are less a component.
From our correlation table, we move on to the corresponding biplot.
When our first two PCs explain a significant portion of the variance in our data (which is the case for our application), we can make use of a bi-plot.
A bi-plot overlays the scores plot, a projection of observations onto the span of the first two PCs, and a loadings plot, a projection of the variable vectors onto the span of the PCs, in a single graphic:
# Bi-plot of PC1 and PC2
biplot(pca_adhd, choices = c(1,2), scale = 0)
Points are projected observations while vectors are projected variables. In the above plot, the red arrows represent the eigenvectors of each variable helping us to interpret the relationship between principal components while the numbers in black represent our observations.
With regard to reading the plot, the left and bottom axes are used to read PCA scores (of the dots) while the top and right axes are used to read how strongly each vector influences the PC.
From the above biplot, we can extend that: * the observations appear to play a larger role in the makeup of PC1
than that of PC2
, * ADHD.Tot
has the most weight for PC1
and * Q15:Q18 carry a lot of positive weight while Q2:Q4 carry negative weight for PC2
Next, we move on to the bi-plot for PC2 and PC3:
# Bi-plot of PC2 and PC3
biplot(pca_adhd, choices = c(2,3), scale = 0)
From the PC2
-PC3
biplot we can extend that:
PC2
vs. PC3
(ie. 60) while others play a larger role in PC3
vs. PC2
(ie. 144),PC2
andPC3
Finally, we explore the bi-plot for PC1 and PC3:
# Bi-plot of PC1 and PC3
biplot(pca_adhd, choices = c(1,3), scale = 0)
The PC1
-PC3
bi-plot has a similar layout to our first:
PC1
than that of PC2
,ADHD.Tot
has a lot of weight for PC1
andPC2
.These plots are useful, but as a possible improvement we’ll explore varimax.
Varimax rotation is used to simplify the expression of a particular sub-space. Coordinates remain unchanged while the sum of squared loadings are maximized.
We apply the varimax() function to the rotation variable of our principal components and plot the correlation of corresponding loadings:
# apply varimax
<- varimax(pca_adhd$rotation)
var_adhd $loadings var_adhd
##
## Loadings:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14 PC15
## ADHD.Q1 -1
## ADHD.Q2 1
## ADHD.Q3 1
## ADHD.Q4 -1
## ADHD.Q5 -1
## ADHD.Q6 -1
## ADHD.Q7 -1
## ADHD.Q8
## ADHD.Q9 1
## ADHD.Q10 -1
## ADHD.Q11 -1
## ADHD.Q12 1
## ADHD.Q13
## ADHD.Q14
## ADHD.Q15 -1
## ADHD.Q16 1
## ADHD.Q17 -1
## ADHD.Q18 -1
## ADHD.Total
## PC16 PC17 PC18 PC19
## ADHD.Q1
## ADHD.Q2
## ADHD.Q3
## ADHD.Q4
## ADHD.Q5
## ADHD.Q6
## ADHD.Q7
## ADHD.Q8 1
## ADHD.Q9
## ADHD.Q10
## ADHD.Q11
## ADHD.Q12
## ADHD.Q13 -1
## ADHD.Q14 -1
## ADHD.Q15
## ADHD.Q16
## ADHD.Q17
## ADHD.Q18
## ADHD.Total -1
##
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.053 0.053 0.053 0.053 0.053 0.053 0.053 0.053 0.053 0.053
## Cumulative Var 0.053 0.105 0.158 0.211 0.263 0.316 0.368 0.421 0.474 0.526
## PC11 PC12 PC13 PC14 PC15 PC16 PC17 PC18 PC19
## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.053 0.053 0.053 0.053 0.053 0.053 0.053 0.053 0.053
## Cumulative Var 0.579 0.632 0.684 0.737 0.789 0.842 0.895 0.947 1.000
#var_adhd$loadings #corrplot is more succinct, easier to interpret
corrplot(var_adhd$loadings[,1:3], is.corr = FALSE)
We elect to interpret the correlation plot (over the loadings output) due to its clarity and succinctness. From the above output, we can see that:
PC1
is strongly (negatively) correlated with ADHD.Q18
,PC2
is strongly (positively) correlated with ADHD.Q16
, andPC3
is strongly (negatively) correlated with ADHD.Q5
Although the results conflict with our earlier plot interpretations, one cannot argue with the simplicity of concise code. Varimax provided in 2 lines of code what it took multiple plots, iterations, and interpretations to come to when referring to biplots: the strongest correlate for each principal component.
In the Appendix, we’ve also provided code (with a more succinct write up) for our application of principal component analysis to the md
(mood disorder) and genl
(general) datasets.
……………………………………………………………………..
A Support Vector Machine (SVM) is a supervised machine learning algorithm that creates a hyperplane / decision boundary based on labeled data. Support vectors are the data points closest to the hyperplane, that influence the hyperplane’s location and orientation. Using support vectors, we maximize the margin, the space between prospective classes, by ensuring that our classifier is equidistant from the closest observation in each class.
First we need to identify NAs using vis_miss
from visdat package. This graph shows the missing values of corresponding columns in percentage. In bottom section shows total percentage of missing value present in the data.
::vis_miss(data) visdat
We can see maximum NAs availabe in Psych.meds column. We replace NAs with Mode. Created function and pass the dataset into it.
<- function(df, ...) {
mode <- table(df)
tb which(tb == max(tb)) %>% sort %>% `[`(., 1) %>% names
}<- function(df, fun) {
impute <- fun(df, na.rm = TRUE)
fval <- ifelse(is.na(df), fval, df)
y return(y)
}<- data %>%
data_all select(-Initial) %>%
mutate(
Alcohol = impute(Alcohol, mode),
THC = impute(THC, mode),
Cocaine = impute(Cocaine, mode),
Stimulants = impute(Stimulants, mode),
Sedative.hypnotics = impute(Sedative.hypnotics, mean),
Opioids = impute(Opioids, mean),
Court.order = impute(Court.order, mean),
Education = impute(Education, mode),
Hx.of.Violence = impute(Hx.of.Violence, mode),
Disorderly.Conduct = impute(Disorderly.Conduct, mode),
Suicide = impute(Suicide, mode),
Abuse = impute(Abuse, mode),
Non.subst.Dx = impute(Non.subst.Dx, mode),
Subst.Dx = impute(Subst.Dx, mode),
Psych.meds. = impute(Psych.meds., mode)
%>%
) mutate_if(is.character, as.numeric)
::vis_miss(data_all) visdat
No missing value availab in th dataset. We can perform EDA on suicide analysis.
Below graphs shows the Suicide pecentage over different category.
Plot, suicide % over other category such as Age, Sex, Education, Cocaine, Hx.of.Violence, Disorderly.Conduct, Abuse, Non.subst.Dx, Subst.Dx, and Psych.meds.
<- data_all %>% select(c('Age','Suicide'))
p setDT(p)[Age <10, agegroup := "0-10"]
>=10 & Age <20, agegroup := "10-20"]
p[Age >=20 & Age <30, agegroup := "20-30"]
p[Age >=30 & Age <40, agegroup := "30-40"]
p[Age >=40 & Age <50, agegroup := "40-50"]
p[Age >=50 & Age <60, agegroup := "50-60"]
p[Age >=60 & Age <70, agegroup := "60-70"]
p[Age PlotXTabs2(p, agegroup, Suicide, results.subtitle = FALSE, title ="Suicide commit Pecentage by Age Group")
Agegroup (30-40) committed more suicide and agegroup (40-50) is the second highest.
Age group by suicide commit:
<- data_all %>% select(c('Sex','Suicide'))
p PlotXTabs2(p, Sex, Suicide, results.subtitle = FALSE, title ="Suicide commit Pecentage by Sex")
females attempted suicide more than males.
<- data_all %>% select(c('Race','Suicide'))
p PlotXTabs2(p, Race, Suicide, results.subtitle = FALSE, title ="Suicide commit Pecentage by Race")
Suicide committed by Race:
Interesting point here, Hispanic
no suicide % because there is only 1 observation for this category. Similarly for Other or missing data
category having 2 observations are present, due to lack of data others showing 50% suicide.
<- data_all %>% select(c('Education','Suicide'))
p PlotXTabs2(p, Education, Suicide, results.subtitle = FALSE, title ="Suicide commit Pecentage by level of Education")
Higher educated less likely committed suicide than others.
<- data_all %>% select(c('Hx.of.Violence','Suicide'))
p PlotXTabs2(p, Hx.of.Violence, Suicide, results.subtitle = FALSE, title ="Suicide commit Pecentage by Violence")
Violant people commited more suicide than non-violant.
<- data_all %>% select(c('Disorderly.Conduct','Suicide'))
p PlotXTabs2(p, Disorderly.Conduct, Suicide, results.subtitle = FALSE, title ="Suicide commit Pecentage by Disorderly conduct")
This is little surprising, non Disorderly conduct committed more suicide than Disorderly Conduct.
<- data_all %>% select(c('Abuse','Suicide'))
p PlotXTabs2(p, Abuse, Suicide, results.subtitle = FALSE, title ="Suicide commit Pecentage by Abuse")
Suicide commit by Abuse category:
Other abuse category are less.
<- data_all %>% select(c('Non.subst.Dx','Suicide'))
p PlotXTabs2(p, Non.subst.Dx, Suicide, results.subtitle = FALSE, title ="Suicide commit Pecentage by Non subst Dx")
Non-substance related Dx : Suicide commit Non-subset Dx one is highest and Non-substance related Dx none is lowest.
<- data_all %>% select(c('Subst.Dx','Suicide'))
p PlotXTabs2(p, Subst.Dx, Suicide, results.subtitle = FALSE, title ="Suicide commit Pecentage by Subst Dx")
Substance related Dx : Suicide commit substance related Dx two and three are highest and substance related Dx one is lowest.
<- data_all %>% select(c('Psych.meds.','Suicide'))
p PlotXTabs2(p, Psych.meds., Suicide, results.subtitle = FALSE, title ="Suicide commit Pecentage by Psych.meds.")
Suicide commit by Psychiatric Meds:
By use of support vectors our SVM is capable of finding the optimal decision boundary to distinctly classify our data. In our case, we’re going to model suicide attempts from the genl
dataset with a radial and a linear decision boundary. We’ll then interpret the performance of each model side-by-side.
To start, we re-familiarize ourselves with the data at hand and our dependent variable (Suicide
):
#Ensure it's the data of interest:
<- q3_data
genl #Refamiliarize ourselves with the genl dataset:
dim(genl) #175 - 142 = 33 dropped rows
## [1] 142 14
head(genl)
## # A tibble: 6 x 14
## Age Sex Race ADHD.Total MD.TOTAL SUB.Total Court.order Education
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 24 1 1 40 15 3 1 11
## 2 48 2 1 55 14 0 0 14
## 3 51 2 1 31 5 0 0 12
## 4 43 1 1 45 13 4 0 12
## 5 34 1 1 48 7 2 1 9
## 6 39 2 1 55 14 1 0 11
## # ... with 6 more variables: Hx.of.Violence <int>, Disorderly.Conduct <int>,
## # Suicide <int>, Abuse <int>, Non.subst.Dx <int>, Subst.Dx <int>
#Visualize suicide attempt counts:
as.data.frame(table(genl$Suicide))
## Var1 Freq
## 1 0 97
## 2 1 45
We’re dealing with 142 observations x 14 variables - 10 of type factors, 4 of type double, and a dependent variable Suicide
where 97/142 (68.3%) of patients have not attempted suicide and 45/142 (31.7%) of patients have.
We’ll narrow our dataset further by exploring our independent variable’s correlation with Suicide
, with one another, and then verify our assumptions with the Boruta function for feature ranking and selection.
We start by measuring the strength of correlation to Suicide
for independent variables:
#Compute proportion of missing data per variable
<- colnames(genl)
v <- function(x) sum(!complete.cases(x)) / 142
incomplete <- sapply(genl[v], incomplete)
Missing_Data #head(missing) #verify
#Compute correlation between each variable and Suicide
<- function(x, y) cor(y, x, use = "na.or.complete")
target_corr <- sapply(genl[v], target_corr, y=genl$Suicide)
Suicide_Correlation #head(c_target) #verify
#Bind and output Missing Data and Correlation with Target
<- data.frame(cbind(Missing_Data, Suicide_Correlation))
MDCS %>%
MDCS kbl(caption = "Proportion of Missing Data vs. Correlation with Suicide") %>%
kable_minimal()
Missing_Data | Suicide_Correlation | |
---|---|---|
Age | 0 | -0.0815414 |
Sex | 0 | 0.1640972 |
Race | 0 | -0.1025970 |
ADHD.Total | 0 | 0.1337074 |
MD.TOTAL | 0 | 0.2859463 |
SUB.Total | 0 | 0.1543758 |
Court.order | 0 | 0.1423456 |
Education | 0 | -0.1011286 |
Hx.of.Violence | 0 | 0.1129186 |
Disorderly.Conduct | 0 | -0.0228881 |
Suicide | 0 | 1.0000000 |
Abuse | 0 | 0.3107638 |
Non.subst.Dx | 0 | 0.0238396 |
Subst.Dx | 0 | 0.1698773 |
From the above table we can extend that:
Age
, Disorderly.Conduct
, and Non.subst.Dx
have a weak correlation with Suicide
Abuse
, MD.Total
, and Subst.Dx
have the strongest correlation with Suicide
,Suicide
.From this, we can see the 3 variables we may keep as well as those we may drop. For the remaining variables, we’ll explore further to guide our decision.
We measure the correlation between variables to observe multicollinearity:
#Utilize custom-built correlation matrix generation function
plot_corr_matrix(genl, 0.3)
Multicollinearity is a concern and that we may eliminate a larger portion of variables as a result. We’ll proceed with only variables with strong predictive promise and / or unique value added.
Based on these guidelines, it appears that we should proceed with at least Abuse
, MD.Total
, Subst.Dx
, whereas variables Sex
, Court.order
, and Hx.of.Violence
we’ll decide after consulting the Boruta output.
To address weak target correlation and multicollinearity we consider the exclusion of 7-10 variables : Age
, Race
, Education
, Disorderly.Conduct
, and Non.subst.Dx
due to weak correlation with the target variable, ADHD.Total
and SUB.Total
due to multicollinearity, and Sex
, Court.order
, and Hx.of.Violence
we’ll decide after consulting our Boruta output.
Before doing so, we utilize the Boruta function for feature ranking and selection:
# Perform Boruta search
<- Boruta(Suicide ~ ., data=genl, doTrace=0, maxRuns = 1000)
boruta_output #Get significant variables including tentatives
<- getSelectedAttributes(boruta_output, withTentative = TRUE)
boruta_signif #print(boruta_signif)
# Plot variable importance
plot(boruta_output, cex.axis=.7, las=2, xlab="", main="Variable Importance")
From the above plot, we see that only two variables (Abuse
and MD.TOTAL
) carry much importance when we consider Suicide
but in order to offer a more comprehensive representation, we’ll also include the next 3 variables (Race
, Subst.Dx
, and Court.order
). It’s worth noting that all variables score relatively low in feature importance which may be an early indication of difficulty in predicting Suicide
.
We select the 5 features noted above, convert categorical features to type factor and then verify the dataset that we’re proceeding to SVM model-building with:
## # A tibble: 6 x 6
## Race MD.TOTAL Court.order Suicide Abuse Subst.Dx
## <fct> <int> <fct> <fct> <fct> <fct>
## 1 1 15 1 1 0 0
## 2 1 14 0 1 4 0
## 3 1 5 0 0 6 0
## 4 1 13 0 1 7 0
## 5 1 7 1 1 0 0
## 6 1 14 0 1 2 0
We observe 5 independent variables (1 numeric, 4 factor) and find that our data is of the proper form aside from the fact that we have numerous independent variables of type factor that we may have to convert to “one hot” dummie variables (ie. unique variables for each level of the factor variable where a 1 is present).
We proceed to train-test split our data (for later assessment), convert categorical variables into “one hot” dummie variables, train our model, and consult the summary statistics:
To improve model can tune the SVM by changing the parameters Cost, gamma and the kernel function. Use svm.tune from package e1071
. This function can accepts many parameters, here we use 3. They are kernel, cost, and gamma.
Here we create 3 models using radial, linear and polynomial.
#Set set (replicability)
set.seed(123)
#Perform train-test Split
<- initial_split(select_df, prop = 0.75)
train_split <- training(train_split)
training <- testing(train_split)
test <- test$Suicide #hold actual test values for later comparison
results $Suicide <- NA #set initial test values to NA, to then be predicted
test#Convert categorical variables into dummie variables (one hot or dummyvars)
<- dummyVars(~ ., data=training[,-4]) #Omit col 4 (Suicide) from dummy var generation
dummies <- as.data.frame(predict(dummies, training[,-4]))
d_train $Suicide <- training$Suicide
d_train<- dummyVars(~ ., data=test[,-4])
dummies <- as.data.frame(predict(dummies, test[,-4]))
d_test $Suicide <- test$Suicide
d_test#head(d_train) #verify dummie variables
#Set gamma list and use variable cost to determine optimal model parameters (for SVM modeling)
<- c(0.005,0.01,0.015,0.02,0.025,0.03,0.035,0.04,0.045,0.05)
gammalist #1st SVM application: radial kernel
<- tune.svm(Suicide ~., data=d_train, kernel='radial', cost=2^(-1:5), gamma = gammalist)
svm_radial # summary(svm_radial)
# summary(svm_radial$best.model)
<- predict(svm_radial$best.model, d_test[,-20])
svm1 #confusionMatrix(svm1, results)
#2nd SVM application: linear kernel
<- tune.svm(Suicide ~., data=d_train, kernel='linear')
svm_linear # summary(svm_linear)
# summary(svm_linear$best.model)
<- predict(svm_linear$best.model, d_test[,-20])
svm2 #confusionMatrix(svm2, results)
#3rd SVM application: polynomial kernel
<- tune.svm(Suicide ~., data=d_train, kernel='polynomial', cost=2^(-1:5), gamma = gammalist)
svm_polynomial <- predict(svm_polynomial$best.model, d_test[,-20])
svm3 #Convert performance statistics to tabular form (for output / interpretation):
<- confusionMatrix(svm1, results)$overall['Accuracy']
AccuracySVM1 <- confusionMatrix(svm2, results)$overall['Accuracy']
AccuracySVM2 <- confusionMatrix(svm3, results)$overall['Accuracy']
AccuracySVM3 <- confusionMatrix(svm1, results)$byClass
SVM_Model_1 <- data.frame(SVM_Model_1)
SVM_Model_1 <- rbind("Accuracy" = AccuracySVM1, SVM_Model_1)
SVM_Model_1 <- confusionMatrix(svm2, results)$byClass
SVM_Model_2 <- data.frame(SVM_Model_2)
SVM_Model_2 <- rbind("Accuracy" = AccuracySVM2, SVM_Model_2)
SVM_Model_2 <- confusionMatrix(svm3, results)$byClass
SVM_Model_3 <- data.frame(SVM_Model_3)
SVM_Model_3 <- rbind("Accuracy" = AccuracySVM3, SVM_Model_3)
SVM_Model_3 <- data.frame(SVM_Model_1, SVM_Model_2, SVM_Model_3)
tabularview %>% kableExtra::kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down") tabularview
SVM_Model_1 | SVM_Model_2 | SVM_Model_3 | |
---|---|---|---|
Accuracy | 0.6285714 | 0.6285714 | 0.7428571 |
Sensitivity | 0.6428571 | 0.6785714 | 0.8214286 |
Specificity | 0.5714286 | 0.4285714 | 0.4285714 |
Pos Pred Value | 0.8571429 | 0.8260870 | 0.8518519 |
Neg Pred Value | 0.2857143 | 0.2500000 | 0.3750000 |
Precision | 0.8571429 | 0.8260870 | 0.8518519 |
Recall | 0.6428571 | 0.6785714 | 0.8214286 |
F1 | 0.7346939 | 0.7450980 | 0.8363636 |
Prevalence | 0.8000000 | 0.8000000 | 0.8000000 |
Detection Rate | 0.5142857 | 0.5428571 | 0.6571429 |
Detection Prevalence | 0.6000000 | 0.6571429 | 0.7714286 |
Balanced Accuracy | 0.6071429 | 0.5535714 | 0.6250000 |
The models share the same Accuracy and Prevalence, whereas SVM_Model_1
(radial) outperformed SVM_Model_2
(linear) in Specificity, Pos Pred Value, Neg Pred Value, Precision, and Balanced Accuracy while SVM_Model_2
(linear) outperformed SVM_Model_1
(radial) for Sensitivity, Recall, F1, Detection Rate, and Detection Prevalence. SVM_Model_3
outperformed than other 2 models. This model have high accuracy, specificity, recall and f1 than other 2 models.
The performance was rather close but being that we’re dealing with an imbalanced set (97/142 (68.3%) of patients have not attempted suicide), we use the Balanced Accuracy as our deciding factor and elect SVM_Model_1
(radial).
confusionMatrix(svm1, results)$table
## Reference
## Prediction 0 1
## 0 18 3
## 1 10 4
confusionMatrix(svm2, results)$table
## Reference
## Prediction 0 1
## 0 19 4
## 1 9 3
Both the models SVM_model_1
and SVM_model_2
have higher class error in predicting no suicide and less error in predicting suicide.
confusionMatrix(svm3, results)$table
## Reference
## Prediction 0 1
## 0 23 4
## 1 5 3
The confusion matrix shows SVM_model_3
is a good model for suicide and non-suicide with accuracy of 74%. The miscallsification happens for suicide is 4 and no suicide 5. There is definately scope to improve the performance of the model. This dataset we did the aggregate of the columns and did not impute NAs, this might be the reason of low accuracy. In the appendix section we perform 6 SVM models (including 3 kernels), one using entire dataset and another using PCA output.
In the Appendix, we’ve also provided code (with a more succinct write up) for our application of support vector machines to the entire dataset (~86% Accuracy) and our principal components ~74% Accuracy).
It was nice to learn via application the effect of kernel selection. For a simpler model and one we believe is likely to be linearly separable, it may be best to start with a linear kernel, whereas when the data appears to be non-linear (which we believed to be the case here), we can start with a non-linear application and then compare back to a linear model to test our assumptions. In this case the radial kernel model did not perform leaps-and-bounds better than the linear kernel model but it did outperform it when we considered the fact that we were dealing with unbalanced data.
It appears that both models were able to relatively accurately predict suicide for those who had attempted suicide in the past but each had a difficult time predicting non-suicidal patients. If the model were advanced further, this might be a case where accurately predicting positive (suicide) and remediating the situation (ie. providing help to the patient) could supercede the model’s inability to predict non-suicide but this is a very sensitive topic and the model would absolutely have to best tested to a much deeper degree and brought to a much higher level of accuracy to even consider such an idea.
In general though, we were surprised by the low accuracy of our SVM models provided the “clout” surrounding Support Vector Machines and their ability to accurately predict. The reason for a lower performance may have been because the null values we’d dropped (maybe there was valuable data therein), the variables we elected to focus upon, the way in which we applied the models, the nature of the data itself, or the simple fact that it’s difficult to predict something like suicide because humans are creatures of nuance and with that comes a great level of unpredictability.
We later confirmed, with code included in the Appendix, that Accuracy could be improved via inclusion of stronger (ie. principal components) or more features (ie. the entire dataset).
With entire dataset we get better accuracy, this can also be improved with differnt machine learning algo. In future we may include QDA and LDA to see how the model works.
……………………………………………………………………..
In completing this assignment, reference was made to the following:
……………………………………………………………………..
There are columns 15 columns for MD. we will follow the same process that we used for ADHD.
<- data %>% select(24:38)
MD_df head(MD_df)
## MD.Q1a MD.Q1b MD.Q1c MD.Q1d MD.Q1e MD.Q1f MD.Q1g MD.Q1h MD.Q1i MD.Q1j MD.Q1k
## 1 1 1 1 1 0 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1 0 0
## 3 0 0 0 0 1 1 1 0 0 0 0
## 4 1 1 0 0 1 1 1 1 1 0 0
## 5 0 1 0 1 0 1 1 0 0 0 0
## 6 0 1 0 1 1 1 1 1 1 1 1
## MD.Q1L MD.Q1m MD.Q2 MD.Q3
## 1 0 1 1 3
## 2 1 0 1 3
## 3 0 0 0 2
## 4 1 1 1 3
## 5 0 0 1 2
## 6 1 0 1 3
# check for NAs
colSums(is.na(MD_df))
## MD.Q1a MD.Q1b MD.Q1c MD.Q1d MD.Q1e MD.Q1f MD.Q1g MD.Q1h MD.Q1i MD.Q1j MD.Q1k
## 0 0 0 0 0 0 0 0 0 0 0
## MD.Q1L MD.Q1m MD.Q2 MD.Q3
## 0 0 0 0
Above summary we see no NAs available in the MD_df
dataset.
In MD.Q3 column scale is different than other columns, so to avoid scale factor set scale = True.
# Compute principal component analysis.
<- prcomp(MD_df, scale = TRUE)
pca_md summary(pca_md)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.3857 1.3474 0.9721 0.94891 0.91563 0.82515 0.79246
## Proportion of Variance 0.3794 0.1210 0.0630 0.06003 0.05589 0.04539 0.04187
## Cumulative Proportion 0.3794 0.5004 0.5635 0.62348 0.67937 0.72476 0.76663
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.77044 0.74286 0.72814 0.70302 0.65162 0.5835 0.55685
## Proportion of Variance 0.03957 0.03679 0.03535 0.03295 0.02831 0.0227 0.02067
## Cumulative Proportion 0.80620 0.84299 0.87834 0.91128 0.93959 0.9623 0.98296
## PC15
## Standard deviation 0.50553
## Proportion of Variance 0.01704
## Cumulative Proportion 1.00000
From pca summary, PC1 shows 37% variance in the data, PC2 shows 50% variance in the data, PC3 shows 56% variance in the data and so on.
Plot the scree plot to disply the same, scree plot y-axis is variance nothing but the square of the SD.
# scree plot
plot(pca_md, type = 'l', col = 'blue')
From scree plot PC1, PC2, PC3, and PC4 are describes 62% variance in the data.
From ‘https://www.originlab.com/doc/Origin-Help/PrincipleComp-Analysis’, found if the relationship is weak between variables, PCA does not work well to reduce data. For MD data more than 3 PC components describe the data, not sure how well PCA will work.
Create another dataset use cbind
with original MD data along with PC1, PC2, PC3, and PC4.
# combine dataset
<- cbind(MD_df, pca_md$x)
md_df_2 # cor plot
cor(MD_df, md_df_2[, 16:19]) %>% kableExtra::kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down")
PC1 | PC2 | PC3 | PC4 | |
---|---|---|---|---|
MD.Q1a | -0.6888686 | -0.1322144 | 0.0229725 | -0.3776545 |
MD.Q1b | -0.6421749 | -0.3857186 | 0.1381774 | 0.1459230 |
MD.Q1c | -0.4439895 | 0.4544153 | -0.4763749 | -0.0977568 |
MD.Q1d | -0.5655577 | 0.0049581 | 0.0644164 | -0.0442116 |
MD.Q1e | -0.6391312 | 0.2391336 | -0.1887304 | 0.1745264 |
MD.Q1f | -0.6668512 | -0.2825469 | 0.1009022 | 0.2032375 |
MD.Q1g | -0.6818994 | -0.3687898 | 0.0536512 | 0.1911240 |
MD.Q1h | -0.6193596 | 0.4973240 | 0.0584859 | 0.2126773 |
MD.Q1i | -0.5873223 | 0.4451927 | -0.1405849 | 0.2802170 |
MD.Q1j | -0.5954591 | 0.4356557 | 0.1328873 | 0.0517699 |
MD.Q1k | -0.5118310 | 0.4082574 | 0.5106816 | -0.2784104 |
MD.Q1L | -0.7087339 | -0.1724858 | 0.1944088 | -0.4085494 |
MD.Q1m | -0.5682204 | -0.0522540 | -0.4267685 | -0.4284171 |
MD.Q2 | -0.7328573 | -0.2802758 | 0.0969890 | 0.2440009 |
MD.Q3 | -0.5118573 | -0.5064956 | -0.3385010 | 0.0527350 |
Above cor table we can see
In Bi-plot red arrows represent the eigenvectors of each variable. This plot helps us to interpret the relationship between principal components.
# Bi-plot of PC1 and PC2
biplot(pca_md, choices = c(1,2), scale = 0)
From (PC1 and PC2 biplot) observations have low MD.Q3, MD.Q1h
# Bi-plot of PC2 and PC3
biplot(pca_md, choices = c(2,3), scale = 0)
(PC2 and PC3 biplot)
# Bi-plot of PC1 and PC3
biplot(pca_md, choices = c(3,4), scale = 0)
From (PC1 and PC3 biplot)
We need better approach to examine the relationship.
# apply varimax
<- varimax(pca_md$rotation)
var_md $loadings var_md
##
## Loadings:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14 PC15
## MD.Q1a 1
## MD.Q1b 1
## MD.Q1c 1
## MD.Q1d 1
## MD.Q1e -1
## MD.Q1f 1
## MD.Q1g -1
## MD.Q1h -1
## MD.Q1i 1
## MD.Q1j -1
## MD.Q1k 1
## MD.Q1L -1
## MD.Q1m -1
## MD.Q2 -1
## MD.Q3 1
##
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.067 0.067 0.067 0.067 0.067 0.067 0.067 0.067 0.067 0.067
## Cumulative Var 0.067 0.133 0.200 0.267 0.333 0.400 0.467 0.533 0.600 0.667
## PC11 PC12 PC13 PC14 PC15
## SS loadings 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.067 0.067 0.067 0.067 0.067
## Cumulative Var 0.733 0.800 0.867 0.933 1.000
corrplot(var_md$loadings[,1:4], is.corr = FALSE)
Varimax loading interprets :
The PCA of MD shows first principle component have 38% of the variance in the data. To get 90% of the variance, we need to include 11 of the principal components. This reduction reduce features 15 to 11.
PCA performs using entire dataset, exclude indivislau ADHD and MD column and take Total.ADHD and MD in to consideration.
Here we are using built-in R function prccomp() to perform the Principal Component Analysis.
PCA can be applied only on numerical data. We have few Categorical variables in this dataset that need to be converted to numberical category.
# Check the data class
sapply(genl, class)
## Age Sex Race ADHD.Total
## "integer" "integer" "integer" "integer"
## MD.TOTAL SUB.Total Court.order Education
## "integer" "numeric" "integer" "integer"
## Hx.of.Violence Disorderly.Conduct Suicide Abuse
## "integer" "integer" "integer" "integer"
## Non.subst.Dx Subst.Dx
## "integer" "integer"
As we can see there are only 4 numeric data types and rest are factors. We need to convert all the factors into numeric.
<- c("Sex","Race","Court.order","Education","Hx.of.Violence", "Disorderly.Conduct", "Suicide","Abuse","Non.subst.Dx","Subst.Dx")
cols.num <- sapply(genl[cols.num],as.numeric) genl[cols.num]
# Check to see if all our variables in numeric.
sapply(genl, class)
## Age Sex Race ADHD.Total
## "integer" "numeric" "numeric" "integer"
## MD.TOTAL SUB.Total Court.order Education
## "integer" "numeric" "numeric" "numeric"
## Hx.of.Violence Disorderly.Conduct Suicide Abuse
## "numeric" "numeric" "numeric" "numeric"
## Non.subst.Dx Subst.Dx
## "numeric" "numeric"
# Compute principal component analysis.
<- prcomp(genl, scale = TRUE) result.pca
Show the percentage of variances explained by each principal component.
# Visualize eigenvalues (scree plot).
fviz_eig(result.pca)
Individuals with a similar profile are grouped together.
# Graph of individuals
fviz_pca_ind(result.pca,
col.ind = "cos2", # Color by the quality of representation
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
Positive correlated variables point to the same side of the plot. Negative correlated variables point to opposite sides of the graph.
# Graph of variables
fviz_pca_var(result.pca,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
# Biplot of individuals and variables
fviz_pca_biplot(result.pca, repel = TRUE,
col.var = "#2E9FDF", # Variables color
col.ind = "#696969" # Individuals color
)
summary(result.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.6316 1.4760 1.1923 1.11534 1.06328 1.00875 0.94316
## Proportion of Variance 0.1902 0.1556 0.1015 0.08886 0.08075 0.07268 0.06354
## Cumulative Proportion 0.1902 0.3458 0.4473 0.53617 0.61693 0.68961 0.75315
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.85085 0.83870 0.7812 0.69752 0.6372 0.5916 0.41911
## Proportion of Variance 0.05171 0.05024 0.0436 0.03475 0.0290 0.0250 0.01255
## Cumulative Proportion 0.80486 0.85511 0.8987 0.93345 0.9625 0.9875 1.00000
<- varimax(result.pca$rotation)
var_genl corrplot(var_genl$loadings[,1:11], is.corr = FALSE)
PCA of General dataset
The PCA of genl shows first principle component have 19% of the variance in the data. To get 93% of the variance, we need to include 11 of the principal components. This reduction reduce features from 14 to 11.
……………………………………………………………………..
In k-means clustering, we first create unnamed/new observations that serve as “centroids”, which are then used to compute the Euclidean distance between the centroid and the surrounding observations, which are then used for category assignment. The average of all of the points in each category are then recomputed for the centroid based on this average. Finally, the algorithm is re-run using these new centroids with the process ending when there is no change in centroid position between one run of the clustering algorithm and another.
To perform cluster analysis, first need to select the optimal clusters to do that there are some methods.
We’ll use the following R packages:
factoextra
to determine the optimal number clusters for a given clustering methods and for data visualization. NbClust
for computing about 30 methods at once, in order to find the optimal number of clusters.
fviz_nbclust below are the parameter description:
Standadize the data, set mean and standard deviation.
# standadize the data
<- apply(genl,2,mean)
means <- apply(genl,2,sd)
sds = scale(genl,center=means,scale=sds)
nor # Elbow method
fviz_nbclust(nor, kmeans, method = "wss") +
geom_vline(xintercept = 6, linetype = 2)+
labs(subtitle = "Elbow method")
# Silhouette method
fviz_nbclust(nor, kmeans, method = "silhouette")+
labs(subtitle = "Silhouette method")
# Gap statistic
# nboot = 50 to keep the function speedy.
# recommended value: nboot= 500 for your analysis.
# Use verbose = FALSE to hide computing progression.
set.seed(140)
fviz_nbclust(nor, kmeans, nstart = 25, method = "gap_stat", nboot = 50)+
labs(subtitle = "Gap statistic method")
The disadvantage of elbow and average silhouette methods is that, they measure a global clustering characteristic only. A more sophisticated method is to use the gap statistic which provides a statistical procedure to formalize the elbow/silhouette heuristic in order to estimate the optimal number of clusters.
According to these observations, it’s possible to define k = 5 as the optimal number of clusters in the data.
set.seed(500)
<- kmeans(nor, 5, nstart = 25)
kmeans_res kmeans_res
## K-means clustering with 5 clusters of sizes 11, 47, 30, 24, 30
##
## Cluster means:
## Age Sex Race ADHD.Total MD.TOTAL SUB.Total
## 1 -0.6569935740 0.1754249 -0.3842857 0.02634379 0.1301821 0.54162019
## 2 -0.0002945166 -0.4474302 0.2545898 -0.48604331 -0.2163302 0.05965279
## 3 0.3244546883 0.2845213 0.1498511 0.42753297 -0.3387180 -1.08642639
## 4 -0.1334416906 1.0012237 -0.3662406 0.53686606 0.5563087 0.70092025
## 5 0.0236577174 -0.4488487 -0.1148113 -0.10521735 0.1848549 0.23364008
## Court.order Education Hx.of.Violence Disorderly.Conduct Suicide
## 1 3.4387824 -0.350659310 0.4403721 0.64578809 0.48949560
## 2 -0.2887527 -0.307035361 -0.5915229 0.18123941 -0.54200720
## 3 -0.2887527 0.701004304 -0.5915229 -1.10091493 -0.25036946
## 4 -0.2887527 -0.108316402 -0.4023422 -0.08200484 1.10604981
## 5 -0.2887527 -0.004754037 1.6786462 0.64578809 0.03519249
## Abuse Non.subst.Dx Subst.Dx
## 1 0.2884947 0.11673855 0.16124445
## 2 -0.5021573 -0.61105397 0.14289936
## 3 0.1810423 1.01574893 -1.07460319
## 4 0.7884472 -0.06964165 0.88375228
## 5 -0.1308684 -0.04552186 0.08460274
##
## Clustering vector:
## [1] 1 3 3 3 1 4 5 5 5 4 5 3 2 3 1 3 3 5 3 3 4 5 3 4 3 1 3 4 2 2 5 2 4 4 2 5 5
## [38] 4 2 4 5 2 2 4 4 5 5 4 2 4 4 5 4 1 1 2 2 2 5 1 3 5 3 2 2 5 3 3 2 3 3 3 3 3
## [75] 4 5 1 3 5 2 4 5 3 4 3 3 3 3 2 2 2 2 5 5 1 2 2 2 2 2 2 2 2 4 2 5 5 2 2 2 2
## [112] 2 2 4 1 5 2 2 2 5 2 2 2 5 2 4 5 3 3 4 2 2 5 5 5 1 2 4 4 2 2 3
##
## Within cluster sum of squares by cluster:
## [1] 141.3976 350.9247 334.8022 230.6030 251.2763
## (between_SS / total_SS = 33.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
# cluster visualization
fviz_cluster(kmeans_res, data = genl,
ellipse.type = "convex",
ggtheme = theme_classic())
Calculate each numeric variables mean value for each cluster, and then output the resulting table:
aggregate(genl, by=list(cluster=kmeans_res$cluster), mean) %>% kableExtra::kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down")
cluster | Age | Sex | Race | ADHD.Total | MD.TOTAL | SUB.Total | Court.order | Education | Hx.of.Violence | Disorderly.Conduct | Suicide | Abuse | Non.subst.Dx | Subst.Dx |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 32.18182 | 1.545454 | 1.363636 | 35.27273 | 10.909091 | 5.545454 | 1 | 11.18182 | 0.4545455 | 1.0000000 | 0.5454545 | 1.8181818 | 0.5454545 | 1.272727 |
2 | 39.53191 | 1.234043 | 1.765957 | 26.68085 | 9.276596 | 4.170213 | 0 | 11.27660 | 0.0000000 | 0.7872340 | 0.0638298 | 0.2127660 | 0.0425532 | 1.255319 |
3 | 43.16667 | 1.600000 | 1.700000 | 42.00000 | 8.700000 | 0.900000 | 0 | 13.46667 | 0.0000000 | 0.2000000 | 0.2000000 | 1.6000000 | 1.1666667 | 0.100000 |
4 | 38.04167 | 1.958333 | 1.375000 | 43.83333 | 12.916667 | 6.000000 | 0 | 11.70833 | 0.0833333 | 0.6666667 | 0.8333333 | 2.8333333 | 0.4166667 | 1.958333 |
5 | 39.80000 | 1.233333 | 1.533333 | 33.06667 | 11.166667 | 4.666667 | 0 | 11.93333 | 1.0000000 | 1.0000000 | 0.3333333 | 0.9666667 | 0.4333333 | 1.200000 |
Based on the output above we’d profile our groups as follows:
……………………………………………………………………..
Below, we re-visited our dataset with a different approach to compare-contrast the resulting Accuracies. The results were an improvement :)
We will perform SVM with entire dataset. Then, create another SVM model on reduced dimension which performed PCA with genl
dataset. Compare the result matrices of these model to find the best one.
Below we are going to create 3 models using 3 kernels such as radial, linear and polynomial.
<- data_all %>% select(-Suicide, everything())
data_all_new #Set set (replicability)
set.seed(700)
#Perform train-test Split
<- createDataPartition(data_all_new$Suicide, p = 0.75, list = FALSE)
intrain <- data_all_new[intrain,]
training_all <- data_all_new[-intrain,]
test_all <- test_all$Suicide #hold actual test values for later comparison
results $Suicide <- NA
test_all<- c(0.005,0.01,0.015,0.02,0.025,0.03,0.035,0.04,0.045,0.05)
gammalist <- tune.svm(as.factor(Suicide) ~., data=training_all,
tune.out_radial kernel='radial', cost=2^(1:5), gamma = gammalist)
<- tune.svm(as.factor(Suicide) ~., data=training_all,
tune.out_linear kernel='linear')
<- tune.svm(as.factor(Suicide) ~., data=training_all,
tune.out_poly kernel='polynomial', cost=2^(1:5), gamma = gammalist)
<- predict(tune.out_radial$best.model, test_all[,-dim(test_all)[2]])
svm4 <- predict(tune.out_linear$best.model, test_all[,-dim(test_all)[2]])
svm5 <- predict(tune.out_poly$best.model, test_all[,-dim(test_all)[2]])
svm6 #Convert performance statistics to tabular form (for output / interpretation):
<- confusionMatrix(svm4, as.factor(results))$overall['Accuracy']
AccuracySVM4 <- confusionMatrix(svm5, as.factor(results))$overall['Accuracy']
AccuracySVM5 <- confusionMatrix(svm6, as.factor(results))$overall['Accuracy']
AccuracySVM6 <- confusionMatrix(svm4, as.factor(results))$byClass
SVM_Model_4 <- data.frame(SVM_Model_4)
SVM_Model_4 <- rbind("Accuracy" = AccuracySVM4, SVM_Model_4)
SVM_Model_4 <- confusionMatrix(svm5, as.factor(results))$byClass
SVM_Model_5 <- data.frame(SVM_Model_5)
SVM_Model_5 <- rbind("Accuracy" = AccuracySVM5, SVM_Model_5)
SVM_Model_5 <- confusionMatrix(svm6, as.factor(results))$byClass
SVM_Model_6 <- data.frame(SVM_Model_6)
SVM_Model_6 <- rbind("Accuracy" = AccuracySVM6, SVM_Model_6)
SVM_Model_6 <- data.frame(SVM_Model_4, SVM_Model_5, SVM_Model_6)
tabularview %>% kableExtra::kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down") tabularview
SVM_Model_4 | SVM_Model_5 | SVM_Model_6 | |
---|---|---|---|
Accuracy | 0.8604651 | 0.8372093 | 0.8604651 |
Sensitivity | 0.9189189 | 0.8918919 | 1.0000000 |
Specificity | 0.5000000 | 0.5000000 | 0.0000000 |
Pos Pred Value | 0.9189189 | 0.9166667 | 0.8604651 |
Neg Pred Value | 0.5000000 | 0.4285714 | NaN |
Precision | 0.9189189 | 0.9166667 | 0.8604651 |
Recall | 0.9189189 | 0.8918919 | 1.0000000 |
F1 | 0.9189189 | 0.9041096 | 0.9250000 |
Prevalence | 0.8604651 | 0.8604651 | 0.8604651 |
Detection Rate | 0.7906977 | 0.7674419 | 0.8604651 |
Detection Prevalence | 0.8604651 | 0.8372093 | 1.0000000 |
Balanced Accuracy | 0.7094595 | 0.6959459 | 0.5000000 |
The models share the same Accuracy and Neg Pred Value, whereas SVM_Model_4 (radial) outperformed SVM_Model_5 (linear) in Sensitivity, Recall, Detection Ratee, F1, and Detection Prevalence while SVM_Model_5 (linear) outperformed SVM_Model_4 (radial) for Specificity, Pos Pred Value, Precision, and Balanced Accuracy. SVM_Model_6 has same accuracy as SVM_Model_4 but if we look at Neg Pred Value shows NAN, this model is not good fit for suicide. Lets have a look on confusion matrix and interprete how the model works.
confusionMatrix(svm4, as.factor(results))$table
## Reference
## Prediction 0 1
## 0 34 3
## 1 3 3
This model is performed well in prediction yes and no. Total misclassification is 6.
confusionMatrix(svm5, as.factor(results))$table
## Reference
## Prediction 0 1
## 0 33 3
## 1 4 3
This model performed well with suicide class but little poor in no suicide class. Total misclassification is 7.
confusionMatrix(svm6, as.factor(results))$table
## Reference
## Prediction 0 1
## 0 37 6
## 1 0 0
This model peformed awesome in no suicide class. However in suicide class performed really bad. This is definately not a good model for the prediction because this could not performed for suicide class.
From Varimax
plot we found, to get above 90% variance in data we need to include 11 principal components. We are going to exclude ‘ADHD.Total’,‘Non.subst.Dx’, ‘Subst.Dx’, from the dataset.
Split the data in to train and test, apply model, fit the model, predict the model using test data, calculate results matrices to see the model performance.
<- data[c(2:4,23,39)]
genl_2 $SUB.Total <- sub$SUB.Total
genl_2<- cbind(genl_2, hstry)
genl_2 # imputation
<- genl_2 %>%
genl_3 mutate(
SUB.Total = impute(SUB.Total, mode),
Court.order = impute(Court.order, mode),
Education = impute(Education, mode),
Hx.of.Violence = impute(Hx.of.Violence, mode),
Disorderly.Conduct = impute(Disorderly.Conduct, mode),
Suicide = impute(Suicide, mode),
Abuse = impute(Abuse, mode),
Non.subst.Dx = impute(Non.subst.Dx, mode),
Subst.Dx = impute(Subst.Dx, mode),
Psych.meds. = impute(Psych.meds., mode)
%>%
) mutate_if(is.character, as.numeric)
<- genl_3 %>% select(-c('ADHD.Total', 'Non.subst.Dx', 'Subst.Dx')) %>% select(-Suicide, everything())
genl_svm #Set set (replicability)
set.seed(900)
#Perform train-test Split
<- createDataPartition(genl_svm$Suicide, p = 0.75, list = FALSE)
intrain <- genl_svm[intrain,]
training_not_all <- genl_svm[-intrain,]
test_not_all <- test_not_all$Suicide #hold actual test values for later comparison
results_not_all $Suicide <- NA
test_not_all<- c(0.005,0.01,0.015,0.02,0.025,0.03,0.035,0.04,0.045,0.05)
gammalist <- tune.svm(as.factor(Suicide) ~., data=training_not_all,
tune.out_radial_7 kernel='radial', cost=2^(1:5), gamma = gammalist)
<- tune.svm(as.factor(Suicide) ~., data=training_not_all,
tune.out_linear_8 kernel='linear')
<- tune.svm(as.factor(Suicide) ~., data=training_not_all,
tune.out_poly_9 kernel='polynomial', cost=2^(1:5), gamma = gammalist)
<- predict(tune.out_radial_7$best.model, test_not_all[,-dim(test_not_all)[2]])
svm7 <- predict(tune.out_linear_8$best.model, test_not_all[,-dim(test_not_all)[2]])
svm8 <- predict(tune.out_poly_9$best.model, test_not_all[,-dim(test_not_all)[2]])
svm9 #Convert performance statistics to tabular form (for output / interpretation):
<- confusionMatrix(svm7, as.factor(results_not_all))$overall['Accuracy']
AccuracySVM7 <- confusionMatrix(svm8, as.factor(results_not_all))$overall['Accuracy']
AccuracySVM8 <- confusionMatrix(svm9, as.factor(results_not_all))$overall['Accuracy']
AccuracySVM9 <- confusionMatrix(svm7, as.factor(results_not_all))$byClass
SVM_Model_7 <- data.frame(SVM_Model_7)
SVM_Model_7 <- rbind("Accuracy" = AccuracySVM7, SVM_Model_7)
SVM_Model_7 <- confusionMatrix(svm8, as.factor(results_not_all))$byClass
SVM_Model_8 <- data.frame(SVM_Model_8)
SVM_Model_8 <- rbind("Accuracy" = AccuracySVM8, SVM_Model_8)
SVM_Model_8 <- confusionMatrix(svm9, as.factor(results_not_all))$byClass
SVM_Model_9 <- data.frame(SVM_Model_9)
SVM_Model_9 <- rbind("Accuracy" = AccuracySVM9, SVM_Model_9)
SVM_Model_9 <- data.frame(SVM_Model_7, SVM_Model_8, SVM_Model_9)
tabularview %>% kableExtra::kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down") tabularview
SVM_Model_7 | SVM_Model_8 | SVM_Model_9 | |
---|---|---|---|
Accuracy | 0.7441860 | 0.6744186 | 0.7441860 |
Sensitivity | 0.9655172 | 1.0000000 | 1.0000000 |
Specificity | 0.2857143 | 0.0000000 | 0.2142857 |
Pos Pred Value | 0.7368421 | 0.6744186 | 0.7250000 |
Neg Pred Value | 0.8000000 | NaN | 1.0000000 |
Precision | 0.7368421 | 0.6744186 | 0.7250000 |
Recall | 0.9655172 | 1.0000000 | 1.0000000 |
F1 | 0.8358209 | 0.8055556 | 0.8405797 |
Prevalence | 0.6744186 | 0.6744186 | 0.6744186 |
Detection Rate | 0.6511628 | 0.6744186 | 0.6744186 |
Detection Prevalence | 0.8837209 | 1.0000000 | 0.9302326 |
Balanced Accuracy | 0.6256158 | 0.5000000 | 0.6071429 |
We extend the following commentary based on the above output:
SVM_Model_7
and SVM_Model_9
have high accuracy than SVM_Model_8
. Lets have a look on confusinon matrix and interpret the results.
confusionMatrix(svm7, as.factor(results_not_all))$table
## Reference
## Prediction 0 1
## 0 28 10
## 1 1 4
SVM_Model_7
has accuracy 74%, this model has total 11 misclassification and higher misclassification occures in suicide class. This model predict good for no suicide class but not a good model for suicide class.
confusionMatrix(svm8, as.factor(results_not_all))$table
## Reference
## Prediction 0 1
## 0 29 14
## 1 0 0
SVM_Model_8
has accuracy 67%, this model has total 14 misclassification and completly incorrectly predict for suicide class, we can see all observations for suicide class has predicted incorrectly. This is not a good model to go with.
confusionMatrix(svm9, as.factor(results_not_all))$table
## Reference
## Prediction 0 1
## 0 29 11
## 1 0 3
SVM_Model_9
has same accurcay as SVM_Model_7
, from confusion matrix we see there are total 11 misclassification and higher misclassification present in suicide class. This model predict 100% accurate for no suicide class but shows poor prediction in suicide class.
We performed total 9 models using SVM, the genl dataset without PCA gave better result than the PCA model with genl data. But the entire dataset model performed better in comparision to other models.
© 2021 GitHub, Inc.