The purpose of this assignment was to explore Clustering, Principal Component Analysis, and Support Vector Machines.
Clustering is used as a means of grouping a set of observations in a way that those in the same group are more similar to one another than those in other groups. Principal Component Analysis is used to a means of variable reduction where it’s our aim to preserve as much useful information as possible. In short, we trade a little accuracy for simplicity. Support Vector Machines, on another note, are a supervised learning technique used for classification and regression analysis. We’re going to apply it as a means of classification based on one of the variables in our set (suicide).
We’ll perform a relatively in-depth exploratory data analysis (EDA) to better familiarize ourselves with the data at hand and then we’ll explore each of the aforementioned methods.
Grouping our patients via k-means / hierarchical clustering and profiling these groups based on shared characteristics, focusing upon a subset of our data (ie. ADHD or mood disorder) in applying Principal Component Analysis, and applying a Support Vector Machine model based on whether or not a patient has attempted suicide.
……………………………………………………………………..
The data at hand is based on a real life research project and deals with a number of measures of mental health and well-being - attention deficit hyperactivity disorder (ADHD), mood / mental disorder (MD), substance abuse, etc.
Being that our exploration and application of methods can take on many forms and head in a nearly infinite number of directions, the room for creativity is without bounds.
We start by loading in our data and utilizing built-in head() and dim() functions for an early read.
#Load in data
#data <- read.csv("https://raw.githubusercontent.com/SubhalaxmiRout002/Data-622-Group-5/main/HW%204/ADHD_data.csv")
data <- read.csv("ADHD_data.csv")
#head(data)
#dim(data)From our initial read of our dataset’s dimensions and first 6 entries, we observe that we’re dealing with a 175 observation x 54 variable dataset. Our initial dataset is broad and may be difficult to manage / make sense of. With the motivation of clarification in mind, we thus break our larger dataset into subsets:
We may end up merging these subsets, pulling individual variables, or exploring individual subsets later but this provides a means of effectively recognizing where related variables lie while simultaneously providing smaller, more manageable dataframes to explore and work upon.
Once these dataframes are created, we see that adhd and md have total score variables whereas sub does not, even though its data (across variables) maintains a consistent scoring system. Thus, we take it upon ourselves to add this variable for sake of consistency:
genl <- data[c(1:4,23,39)]
#head(genl) #drop initial, add sub and hstry (all vars or "total score")
#we may want to use these separately (ie. for Q2)
adhd <- data[c(5:23)]
md <- data[c(24:39)]
sub <- data[c(40:45)]
hstry <- data[c(46:54)]
sub$SUB.Total <- rowSums(sub) #add Total column to sub
#Finalize genl dataframe (for EDA)
genl$SUB.Total <- sub$SUB.Total #add SUB.Total to genl
genl <- genl[c(-1)] #drop Initial
genl <- cbind(genl, hstry) #add history variables back in with general
head(genl)## Age Sex Race ADHD.Total MD.TOTAL SUB.Total Court.order Education
## 1 24 1 1 40 15 3 1 11
## 2 48 2 1 55 14 0 0 14
## 3 51 2 1 31 5 0 0 12
## 4 43 1 1 45 13 4 0 12
## 5 34 1 1 48 7 2 1 9
## 6 39 2 1 55 14 1 0 11
## Hx.of.Violence Disorderly.Conduct Suicide Abuse Non.subst.Dx Subst.Dx
## 1 0 1 1 0 2 0
## 2 0 0 1 4 1 0
## 3 0 0 0 6 2 0
## 4 0 0 1 7 2 0
## 5 1 1 1 0 2 0
## 6 0 1 1 2 0 0
## Psych.meds.
## 1 2
## 2 1
## 3 1
## 4 2
## 5 0
## 6 0
We subset our dataframes (as noted above), add a “total” column for sub, and finalize our genl dataframe by merging the total column for sub, dropping the Initial variable, and adding all hstry variables back in.
We lean on the data dictionary (provided with our .xlsx file) to outline the variables at hand:
Age: quantitative variable representative of the individual’s age.Sex: categorical variable representative of the individual’s sex (male-1, female-2).Race: categorical variable representative of the individual’s race (white-1, african american-2, hispanic-3, asian-4, native american-5, other / missing data-6).ADHD.Total: quantitative variable representative of the individual’s total self-report score for ADHD scale. It’s the cumulative score for 18 questions with a scoring metric of never-0, rarely-1, sometimes-2, often-3, and very often-4.MD.TOTAL: quantitative variable representative of the individual’s total self-report score for mental disorder questions. It’s the cumulative score for 15 questions with a scoring metric of no-0, yes-1; question 3: no problem-0, minor-1, moderate-2, and serious-3.SUB.Total: quantitative variable representative of the individual’s total self-report score for substance abuse related questions. It’s the cumulative score across 6 categories with a scoring metric of no use-0, use-1, abuse-2, and dependence-3.Court.order: categorical variable representative of whether the individual’s case was court ordered (No-0, Yes-1).Education: categorical variable representative of the individual’s level of education (1-12 grade, 13+ college).Hx.of.Violence: categorical variable representative of whether the individual has a history of violence (No-0, Yes-1).Disorderly.Conduct: categorical variable representative of whether the individual has a record of disorderly conduct (No-0, Yes-1).Suicide: categorical variable representative of whether the individual has attempted suicide in the past (No-0, Yes-1).Abuse: categorical variable representative of whether the individual has a history of abuse (No-0, Physical (P)-1, Sexual (S)-2, Emotional (E)-3, P&S-4, P&E-5, S&E-6, P&S&E-7).Non.subst.Dx: categorical variable representative of whether the individual has a non-substance diagonosis (0 – none; 1 – one; 2 – More than one).Subst.Dx: categorical variable representative of whether the individual has a substance diagonosis (0 – none; 1 – one Substance-related; 2 – two; 3 – three or more).Psych.meds.: categorical variable representative of whether the individual has been prescribed psychiatric medication (0 – none; 1 – one psychotropic med; 2 – more than one psychotropic med).The resulting dataframe (and variables) are shown above and at this point we’re prepared to more deeply explore our genl dataset.
To do so, we utilize the built-in summary() and glimpse() functions:
## Age Sex Race ADHD.Total MD.TOTAL
## Min. :18.00 Min. :1.000 Min. :1.00 Min. : 0.00 Min. : 0.00
## 1st Qu.:29.50 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:21.00 1st Qu.: 6.50
## Median :42.00 Median :1.000 Median :2.00 Median :33.00 Median :11.00
## Mean :39.47 Mean :1.434 Mean :1.64 Mean :34.32 Mean :10.02
## 3rd Qu.:48.00 3rd Qu.:2.000 3rd Qu.:2.00 3rd Qu.:47.50 3rd Qu.:14.00
## Max. :69.00 Max. :2.000 Max. :6.00 Max. :72.00 Max. :17.00
##
## SUB.Total Court.order Education Hx.of.Violence
## Min. : 0.000 Min. :0.00000 Min. : 6.0 Min. :0.0000
## 1st Qu.: 3.000 1st Qu.:0.00000 1st Qu.:11.0 1st Qu.:0.0000
## Median : 3.000 Median :0.00000 Median :12.0 Median :0.0000
## Mean : 3.883 Mean :0.08823 Mean :11.9 Mean :0.2439
## 3rd Qu.: 6.000 3rd Qu.:0.00000 3rd Qu.:13.0 3rd Qu.:0.0000
## Max. :13.000 Max. :1.00000 Max. :19.0 Max. :1.0000
## NA's :4 NA's :5 NA's :9 NA's :11
## Disorderly.Conduct Suicide Abuse Non.subst.Dx
## Min. :0.0000 Min. :0.0000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000
## Median :1.0000 Median :0.0000 Median :0.000 Median :0.0000
## Mean :0.7256 Mean :0.3025 Mean :1.329 Mean :0.4379
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :7.000 Max. :2.0000
## NA's :11 NA's :13 NA's :14 NA's :22
## Subst.Dx Psych.meds.
## Min. :0.000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.0000
## Median :1.000 Median :1.0000
## Mean :1.138 Mean :0.9649
## 3rd Qu.:2.000 3rd Qu.:2.0000
## Max. :3.000 Max. :2.0000
## NA's :23 NA's :118
## Rows: 175
## Columns: 15
## $ Age <int> 24, 48, 51, 43, 34, 39, 41, 48, 44, 27, 44, 56, 53…
## $ Sex <int> 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 1, 1, 2, 1, 2, 2, 2,…
## $ Race <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ ADHD.Total <int> 40, 55, 31, 45, 48, 55, 54, 41, 56, 56, 42, 38, 31…
## $ MD.TOTAL <int> 15, 14, 5, 13, 7, 14, 9, 7, 12, 11, 16, 0, 11, 10,…
## $ SUB.Total <dbl> 3, 0, 0, 4, 2, 1, 9, 0, 2, 3, 10, 0, 2, 0, 4, 3, 0…
## $ Court.order <int> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ Education <int> 11, 14, 12, 12, 9, 11, 12, 16, 12, 9, 12, 18, 12, …
## $ Hx.of.Violence <int> 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0,…
## $ Disorderly.Conduct <int> 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0,…
## $ Suicide <int> 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ Abuse <int> 0, 4, 6, 7, 0, 2, 4, 0, 0, 2, 4, 2, 0, 0, 0, 5, 1,…
## $ Non.subst.Dx <int> 2, 1, 2, 2, 2, 0, 1, 2, 1, 1, 1, 2, 0, 1, 0, 2, 2,…
## $ Subst.Dx <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 2, 2, 0, 0, 0, 1, 1, 0,…
## $ Psych.meds. <int> 2, 1, 1, 2, 0, 0, 1, 2, 1, 0, 0, 1, 0, 0, 0, 2, 1,…
From the above outputs we can extend that:
Age of respondents is ~40, there are slightly more male than female respondents, and the average education level ~12 (finishing high school).ADHD,Total was ~34, the average MD.TOTAL was ~10, and the averge SUB.Total was ~4.Disorderly.Conduct appears to be more common than Hx.of.Violence which in turn is more common than Court.order.Suicide and some form of abuse appears to be common (ie. Physical).Non.subst.Dx was ~0.4 while the average Subst.Dx was ~1.1 giving the impression that we’re dealing with more substance diagnoses that non-substance diagnoses.Psych.meds is missing the majority of its values and NA values will have to be accounted for (in general).Based on this information, we drop Psych.meds, drop remaining NA values, and observe the affect on our dataset:
genl <- genl[c(-15)] #drop Psych.meds
genl <- genl %>% tibble() %>% drop_na() #drop NA values
#revisit genl
dim(genl) #175 - 142 = 33 dropped rows## [1] 142 14
## Age Sex Race ADHD.Total
## 0 0 0 0
## MD.TOTAL SUB.Total Court.order Education
## 0 0 0 0
## Hx.of.Violence Disorderly.Conduct Suicide Abuse
## 0 0 0 0
## Non.subst.Dx Subst.Dx
## 0 0
Our dataset has been reduced from 175 observations x 15 variables to 142 observations x 14 variables and from numerous NA values across multiple variables to no NA values.
With NA values dealt with, we ensure each variable’s of the proper type and then visualize our variable distributions based on whether they’re categorical or numeric:
#convert features to factor
genl$Sex <- as.factor(genl$Sex)
genl$Race <- as.factor(genl$Race)
genl$Court.order <- as.factor(genl$Court.order)
genl$Education <- as.factor(genl$Education)
genl$Hx.of.Violence <- as.factor(genl$Hx.of.Violence)
genl$Disorderly.Conduct <- as.factor(genl$Disorderly.Conduct)
genl$Suicide <- as.factor(genl$Suicide)
genl$Abuse <- as.factor(genl$Abuse)
genl$Non.subst.Dx <- as.factor(genl$Non.subst.Dx)
genl$Subst.Dx <- as.factor(genl$Subst.Dx)
#convert features to numeric
genl$Age <- as.numeric(genl$Age)
genl$ADHD.Total <- as.numeric(genl$ADHD.Total)
genl$MD.TOTAL <- as.numeric(genl$MD.TOTAL)
genl$SUB.Total <- as.numeric(genl$SUB.Total)
#head(genl) #verify conversions
#visualize categorical distributions as a table
fig1 <- inspectdf::inspect_imb(genl)
fig1## # A tibble: 10 x 4
## col_name value pcnt cnt
## <chr> <chr> <dbl> <int>
## 1 Court.order 0 92.3 131
## 2 Hx.of.Violence 0 73.9 105
## 3 Disorderly.Conduct 1 70.4 100
## 4 Suicide 0 68.3 97
## 5 Non.subst.Dx 0 64.8 92
## 6 Abuse 0 63.4 90
## 7 Race 2 55.6 79
## 8 Sex 1 54.2 77
## 9 Education 12 38.7 55
## 10 Subst.Dx 1 38.7 55
From the categorical variable table above we see that ~92% of respondent’s cases were not court ordered, ~74% of respondents do not have a history of violence, ~70% of respondents have a history of disorderly conduct, ~68% of respondents have not attempted suicide, ~65% of respondents do not have a non-substance diagnosis, ~63% of respondents do not have a history of abuse, ~56% of respondents are African American, ~54% of respondents are male, ~39% of respondents have completed at least high school, and ~39% of respondents have one substance related diagnosis. The fact that Education and Subst.Dx have the same pcnt values is interesting and worth noting.
Moving on to our numeric distributions:
From the histograms above we see that:
ADHD.Total has a left skewed normal distribution with a peak ~45.Age appears to be multimodal with peaks ~25, ~35 and ~50.MD.TOTAL has a non-normal distribution with a peak concentrated > 15 (at the far right).SUB.Total appears to have a multi-modal distribution with its largest peak at ~3.With insight gained via variable distributions, we move on to observing the pair plot and correlation matrix for numeric variables:
#Correlation matrix for numeric variables
genl_corr <- genl %>%
select_if(is.numeric) %>%
cor()
corrplot(genl_corr, title="Numeric Variable Correlation",type = "lower", mar=c(0,0,1,0))The pair plot and correlation matrix highlight the fact that ADHD.Total and MD.TOTAL appear to have strong correlation. We’ll note this fact, while carrying both variables forward.
……………………………………………………………………..
We elected to utilize hierarchical clustering being that it’s more adaptive than k-means clustering. For k-means clustering, we have to pre-define our k. Whereas for hierarchical clustering both the value of k and the location of our centroids (mid-point between all data points in the group) is determined as a part of the process.
We proceed by computing the distances between all data points (using Euclidean distance), clustering based on distances based on centroid distance and the variance in each cluster, and then plotting the corresponding dendrogram:
set.seed(123)
#compute distances between all data points (using euclidean distance)
distances <- dist(genl, method = 'euclidean')
#cluster based on distances using centroid distance as well as variance in clusters
clusterHealth <- hclust(distances, method = 'ward.D') #ward.D accounts for centroid distance + variance
#plot the dendrogram
plot(clusterHealth)Based on our dendrogram, it appears that the ideal number of clusters is either 3, 4, or 5. We proceed with 5 clusters being that it’s the higher of the three and we’d like to better represent the variation we observed during our exploratory data analysis. We want to better capture the distinct characteristics of our clustered groups.
We “cut the tree” and demarcate our groups based on the desired number of clusters, calculate each numeric variables’ mean value for each cluster, and then output the resulting table:
clusterGroups <- cutree(clusterHealth, 5)
#clusterGroups #verify the breakdown per observation
#Mean value for numeric variables
#tapply(genl$Age, clusterGroups, mean, simplify=TRUE) #avg age for each cluster
#tapply(genl$ADHD.Total, clusterGroups, mean) #avg ADHD score for each cluster
#tapply(genl$MD.TOTAL, clusterGroups, mean) #avg MD score for each cluster
#tapply(genl$SUB.Total, clusterGroups, mean) #avg SUB score for each cluster
#Extract corresponding values (to place in easy to understand form):
avg_age <- c(27.06, 48.95, 41.42, 47.4, 25.22)
avg_adhd <- c(49.34, 53.52, 34.12, 15.93, 6.56)
avg_md <- c(12.84, 12.71, 10.58, 7.27, 4.11)
avg_sub <- c(4.13, 3.28, 4.1, 4.17, 3.44)
#Build kable table
group_names <- c("Group 1", "Group 2", "Group 3", "Group 4", "Group 5")
df <- data.frame(group_names, avg_age, avg_adhd, avg_md, avg_sub)
df %>%
kbl() %>%
kable_styling(latex_options = "striped")| group_names | avg_age | avg_adhd | avg_md | avg_sub |
|---|---|---|---|---|
| Group 1 | 27.06 | 49.34 | 12.84 | 4.13 |
| Group 2 | 48.95 | 53.52 | 12.71 | 3.28 |
| Group 3 | 41.42 | 34.12 | 10.58 | 4.10 |
| Group 4 | 47.40 | 15.93 | 7.27 | 4.17 |
| Group 5 | 25.22 | 6.56 | 4.11 | 3.44 |
Based on the output above we’d profile our groups as follows:
It was interesting to profile our groups and observe that there appeared to be 2 tiers in the earlier and later years. It was also interesting to see that all groups had a relatively low incidence of substance use / dependency. Due to its apparent low impact as a differentiating factor for our clusters it was not discussed.
In the Appendix, we’ve also provided code (with a more succinct write up) for our application of K-means clustering analysis to the genl (general) dataset.
……………………………………………………………………..
Principal Component Analysis (PCA) is a way of reducing the dimensions of a given dataset by extracting new features from the original features present in the dataset. The “new” variables after PCA are all independent of one another. We elected to perform Principal Component Analysis on the ADHD self-report columns, MD (mood disorder) questionnaire, and entire dataset.
We start with the Attention Deficit Hyperactivity Disorder (ADHD) symptom variables. During EDA, we’d subset datasets based on question type and thus we proceed with the adhd dataset.
There are 19 variables in the adhd dataset, 18 pertaining to self-reported ADHD question responses and 1 the total score for that row. All ADHD questions capture categorical responses: never-0, rarely-1, sometimes-2, often-3, and very often-4 whereas the ADHD.Total is a quantitative (sum) variable representative of the individual’s total self-report score for ADHD scale.
We verify that our data does not have missing values:
## ADHD.Q1 ADHD.Q2 ADHD.Q3 ADHD.Q4 ADHD.Q5 ADHD.Q6 ADHD.Q7
## 0 0 0 0 0 0 0
## ADHD.Q8 ADHD.Q9 ADHD.Q10 ADHD.Q11 ADHD.Q12 ADHD.Q13 ADHD.Q14
## 0 0 0 0 0 0 0
## ADHD.Q15 ADHD.Q16 ADHD.Q17 ADHD.Q18 ADHD.Total
## 0 0 0 0 0
Missing values would disturb the implementation of PCA (where we’d have to drop / impute corresponding values), and we confirm that our data has 0 missing values across all variables.
We then perform PCA via pcrcomp() function and review corresponding statistics:
# Perform PCA and review summary statistics
pca_adhd <- prcomp(adhd, scale. = TRUE)
summary(pca_adhd)## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 3.2092 1.16479 1.00701 0.89915 0.85060 0.77076 0.75157
## Proportion of Variance 0.5421 0.07141 0.05337 0.04255 0.03808 0.03127 0.02973
## Cumulative Proportion 0.5421 0.61345 0.66683 0.70938 0.74746 0.77872 0.80845
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.70763 0.66793 0.64292 0.63647 0.60782 0.59497 0.5375
## Proportion of Variance 0.02635 0.02348 0.02175 0.02132 0.01944 0.01863 0.0152
## Cumulative Proportion 0.83481 0.85829 0.88004 0.90136 0.92081 0.93944 0.9546
## PC15 PC16 PC17 PC18 PC19
## Standard deviation 0.52570 0.46267 0.4529 0.40571 0.04007
## Proportion of Variance 0.01455 0.01127 0.0108 0.00866 0.00008
## Cumulative Proportion 0.96919 0.98046 0.9912 0.99992 1.00000
From the Importance of components, we can interpret:
Standard deviation as a measure of variability across a single component (ie. deviation of data within PC1)Proportion of Variance as a measure of the component’s variability / representative capability for the entire dataset (ie. the sum of the Proportion of Variance row = 1.00).Cumulative Proportion as a measure of the cumulative variability represented by principal components up until the component column we’re on.From these definitions we can extend that the first 3 components (PC1, PC2, and PC3) represent ~ 66.7% of the data and remaining components add but minor increments to our representative capability (all the way until PC19).
Next, we visit the scree plot:
If we recall from statistics: \(var = sd^2\), where \(var\) represents variance and \(sd\) represents standard deviation.
From the scree plot above, we confirm our earlier finding that PC1, PC2, and PC3 appear to be enough to represent our larger dataset.
To explore how our variables (original and PCA) relate to one another, we bind PC1, PC2, and PC3 to our original adhd dataset, calculate variable correlation and output the result as a table:
# combine datasets
adhd_df2 <- cbind(adhd, pca_adhd$x)
# %>% column_spec(which(ADHD_df_2[,19] > 0.69) , color = 'white', background = 'maroon')
# cor plot
cor(adhd, adhd_df2[, 20:22]) %>% kableExtra::kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down") | PC1 | PC2 | PC3 | |
|---|---|---|---|
| ADHD.Q1 | 0.6980086 | -0.2205533 | 0.3379874 |
| ADHD.Q2 | 0.7087355 | -0.3229617 | 0.2203357 |
| ADHD.Q3 | 0.6730084 | -0.3613418 | 0.1294906 |
| ADHD.Q4 | 0.7363824 | -0.3583042 | 0.1760306 |
| ADHD.Q5 | 0.7279360 | -0.1324912 | -0.4384661 |
| ADHD.Q6 | 0.6490774 | 0.0323714 | -0.3655759 |
| ADHD.Q7 | 0.7393470 | -0.1582908 | 0.0755233 |
| ADHD.Q8 | 0.8025831 | -0.1275705 | 0.1186556 |
| ADHD.Q9 | 0.7958069 | -0.0225645 | 0.0772789 |
| ADHD.Q10 | 0.7919919 | -0.0589730 | 0.1039090 |
| ADHD.Q11 | 0.7411649 | -0.1609594 | -0.1413902 |
| ADHD.Q12 | 0.6869949 | 0.2492497 | 0.1772878 |
| ADHD.Q13 | 0.7627239 | -0.0183180 | -0.4475655 |
| ADHD.Q14 | 0.7185091 | 0.0224583 | -0.3740732 |
| ADHD.Q15 | 0.6382790 | 0.3882041 | 0.0003570 |
| ADHD.Q16 | 0.6736272 | 0.5359116 | 0.1555591 |
| ADHD.Q17 | 0.6547685 | 0.4287821 | 0.1110080 |
| ADHD.Q18 | 0.7104675 | 0.4124523 | 0.1129293 |
| ADHD.Total | 0.9988446 | 0.0127319 | -0.0120590 |
From the correlation table we can extend that:
PC1 and it appears that PC1 is identical to ADHD.Total based on their correlation of ~1.00.ADHD.Total have a slight positive correlation with PC2.ADHD.Total has a slight negative correlation with with PC3.The correlations above provide indication as to the makeup of our principal components. Those with a stronger positive correlation are more a component while those with a stronger negative correlation are less a component.
From our correlation table, we move on to the corresponding biplot.
When our first two PCs explain a significant portion of the variance in our data (which is the case for our application), we can make use of a bi-plot.
A bi-plot overlays the scores plot, a projection of observations onto the span of the first two PCs, and a loadings plot, a projection of the variable vectors onto the span of the PCs, in a single graphic:
Points are projected observations while vectors are projected variables. In the above plot, the red arrows represent the eigenvectors of each variable helping us to interpret the relationship between principal components while the numbers in black represent our observations.
With regard to reading the plot, the left and bottom axes are used to read PCA scores (of the dots) while the top and right axes are used to read how strongly each vector influences the PC.
From the above biplot, we can extend that: * the observations appear to play a larger role in the makeup of PC1 than that of PC2, * ADHD.Tot has the most weight for PC1 and * Q15:Q18 carry a lot of positive weight while Q2:Q4 carry negative weight for PC2
Next, we move on to the bi-plot for PC2 and PC3:
From the PC2-PC3 biplot we can extend that:
PC2 vs. PC3 (ie. 60) while others play a larger role in PC3 vs. PC2 (ie. 144),PC2 andPC3Finally, we explore the bi-plot for PC1 and PC3:
The PC1-PC3 bi-plot has a similar layout to our first:
PC1 than that of PC2,ADHD.Tot has a lot of weight for PC1 andPC2.These plots are useful, but as a possible improvement we’ll explore varimax.
Varimax rotation is used to simplify the expression of a particular sub-space. Coordinates remain unchanged while the sum of squared loadings are maximized.
We apply the varimax() function to the rotation variable of our principal components and plot the correlation of corresponding loadings:
##
## Loadings:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14 PC15
## ADHD.Q1 -1
## ADHD.Q2 1
## ADHD.Q3 1
## ADHD.Q4 -1
## ADHD.Q5 -1
## ADHD.Q6 -1
## ADHD.Q7 -1
## ADHD.Q8
## ADHD.Q9 1
## ADHD.Q10 -1
## ADHD.Q11 -1
## ADHD.Q12 1
## ADHD.Q13
## ADHD.Q14
## ADHD.Q15 -1
## ADHD.Q16 1
## ADHD.Q17 -1
## ADHD.Q18 -1
## ADHD.Total
## PC16 PC17 PC18 PC19
## ADHD.Q1
## ADHD.Q2
## ADHD.Q3
## ADHD.Q4
## ADHD.Q5
## ADHD.Q6
## ADHD.Q7
## ADHD.Q8 1
## ADHD.Q9
## ADHD.Q10
## ADHD.Q11
## ADHD.Q12
## ADHD.Q13 -1
## ADHD.Q14 -1
## ADHD.Q15
## ADHD.Q16
## ADHD.Q17
## ADHD.Q18
## ADHD.Total -1
##
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.053 0.053 0.053 0.053 0.053 0.053 0.053 0.053 0.053 0.053
## Cumulative Var 0.053 0.105 0.158 0.211 0.263 0.316 0.368 0.421 0.474 0.526
## PC11 PC12 PC13 PC14 PC15 PC16 PC17 PC18 PC19
## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.053 0.053 0.053 0.053 0.053 0.053 0.053 0.053 0.053
## Cumulative Var 0.579 0.632 0.684 0.737 0.789 0.842 0.895 0.947 1.000
#var_adhd$loadings #corrplot is more succinct, easier to interpret
corrplot(var_adhd$loadings[,1:3], is.corr = FALSE)We elect to interpret the correlation plot (over the loadings output) due to its clarity and succinctness. From the above output, we can see that:
PC1 is strongly (negatively) correlated with ADHD.Q18,PC2 is strongly (positively) correlated with ADHD.Q16, andPC3 is strongly (negatively) correlated with ADHD.Q5Although the results conflict with our earlier plot interpretations, one cannot argue with the simplicity of concise code. Varimax provided in 2 lines of code what it took multiple plots, iterations, and interpretations to come to when referring to biplots: the strongest correlate for each principal component.
In the Appendix, we’ve also provided code (with a more succinct write up) for our application of principal component analysis to the md (mood disorder) and genl (general) datasets.
……………………………………………………………………..
A Support Vector Machine (SVM) is a supervised machine learning algorithm that creates a hyperplane / decision boundary based on labeled data. Support vectors are the data points closest to the hyperplane, that influence the hyperplane’s location and orientation. Using support vectors, we maximize the margin, the space between prospective classes, by ensuring that our classifier is equidistant from the closest observation in each class.
First we need to identify NAs using vis_miss from visdat package. This graph shows the missing values of corresponding columns in percentage. In bottom section shows total percentage of missing value present in the data.
We can see maximum NAs availabe in Psych.meds column. We replace NAs with Mode. Created function and pass the dataset into it.
mode <- function(df, ...) {
tb <- table(df)
which(tb == max(tb)) %>% sort %>% `[`(., 1) %>% names
}
impute <- function(df, fun) {
fval <- fun(df, na.rm = TRUE)
y <- ifelse(is.na(df), fval, df)
return(y)
}
data_all <- data %>%
select(-Initial) %>%
mutate(
Alcohol = impute(Alcohol, mode),
THC = impute(THC, mode),
Cocaine = impute(Cocaine, mode),
Stimulants = impute(Stimulants, mode),
Sedative.hypnotics = impute(Sedative.hypnotics, mean),
Opioids = impute(Opioids, mean),
Court.order = impute(Court.order, mean),
Education = impute(Education, mode),
Hx.of.Violence = impute(Hx.of.Violence, mode),
Disorderly.Conduct = impute(Disorderly.Conduct, mode),
Suicide = impute(Suicide, mode),
Abuse = impute(Abuse, mode),
Non.subst.Dx = impute(Non.subst.Dx, mode),
Subst.Dx = impute(Subst.Dx, mode),
Psych.meds. = impute(Psych.meds., mode)
) %>%
mutate_if(is.character, as.numeric)
visdat::vis_miss(data_all)No missing value availab in th dataset. We can perform EDA on suicide analysis.
Below graphs shows the Suicide pecentage over different category.
Plot, suicide % over other category such as Age, Sex, Education, Cocaine, Hx.of.Violence, Disorderly.Conduct, Abuse, Non.subst.Dx, Subst.Dx, and Psych.meds.
p <- data_all %>% select(c('Age','Suicide'))
setDT(p)[Age <10, agegroup := "0-10"]
p[Age >=10 & Age <20, agegroup := "10-20"]
p[Age >=20 & Age <30, agegroup := "20-30"]
p[Age >=30 & Age <40, agegroup := "30-40"]
p[Age >=40 & Age <50, agegroup := "40-50"]
p[Age >=50 & Age <60, agegroup := "50-60"]
p[Age >=60 & Age <70, agegroup := "60-70"]
PlotXTabs2(p, agegroup, Suicide, results.subtitle = FALSE, title ="Suicide commit Pecentage by Age Group")Agegroup (30-40) committed more suicide and agegroup (40-50) is the second highest.
Age group by suicide commit:
p <- data_all %>% select(c('Sex','Suicide'))
PlotXTabs2(p, Sex, Suicide, results.subtitle = FALSE, title ="Suicide commit Pecentage by Sex")females attempted suicide more than males.
p <- data_all %>% select(c('Race','Suicide'))
PlotXTabs2(p, Race, Suicide, results.subtitle = FALSE, title ="Suicide commit Pecentage by Race")Suicide committed by Race:
Interesting point here, Hispanic no suicide % because there is only 1 observation for this category. Similarly for Other or missing data category having 2 observations are present, due to lack of data others showing 50% suicide.
p <- data_all %>% select(c('Education','Suicide'))
PlotXTabs2(p, Education, Suicide, results.subtitle = FALSE, title ="Suicide commit Pecentage by level of Education")Higher educated less likely committed suicide than others.
p <- data_all %>% select(c('Hx.of.Violence','Suicide'))
PlotXTabs2(p, Hx.of.Violence, Suicide, results.subtitle = FALSE, title ="Suicide commit Pecentage by Violence")Violant people commited more suicide than non-violant.
p <- data_all %>% select(c('Disorderly.Conduct','Suicide'))
PlotXTabs2(p, Disorderly.Conduct, Suicide, results.subtitle = FALSE, title ="Suicide commit Pecentage by Disorderly conduct")This is little surprising, non Disorderly conduct committed more suicide than Disorderly Conduct.
p <- data_all %>% select(c('Abuse','Suicide'))
PlotXTabs2(p, Abuse, Suicide, results.subtitle = FALSE, title ="Suicide commit Pecentage by Abuse")Suicide commit by Abuse category:
Other abuse category are less.
p <- data_all %>% select(c('Non.subst.Dx','Suicide'))
PlotXTabs2(p, Non.subst.Dx, Suicide, results.subtitle = FALSE, title ="Suicide commit Pecentage by Non subst Dx")Non-substance related Dx : Suicide commit Non-subset Dx one is highest and Non-substance related Dx none is lowest.
p <- data_all %>% select(c('Subst.Dx','Suicide'))
PlotXTabs2(p, Subst.Dx, Suicide, results.subtitle = FALSE, title ="Suicide commit Pecentage by Subst Dx")Substance related Dx : Suicide commit substance related Dx two and three are highest and substance related Dx one is lowest.
p <- data_all %>% select(c('Psych.meds.','Suicide'))
PlotXTabs2(p, Psych.meds., Suicide, results.subtitle = FALSE, title ="Suicide commit Pecentage by Psych.meds.")Suicide commit by Psychiatric Meds:
By use of support vectors our SVM is capable of finding the optimal decision boundary to distinctly classify our data. In our case, we’re going to model suicide attempts from the genl dataset with a radial and a linear decision boundary. We’ll then interpret the performance of each model side-by-side.
To start, we re-familiarize ourselves with the data at hand and our dependent variable (Suicide):
#Ensure it's the data of interest:
genl <- q3_data
#Refamiliarize ourselves with the genl dataset:
dim(genl) #175 - 142 = 33 dropped rows## [1] 142 14
## # A tibble: 6 x 14
## Age Sex Race ADHD.Total MD.TOTAL SUB.Total Court.order Education
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 24 1 1 40 15 3 1 11
## 2 48 2 1 55 14 0 0 14
## 3 51 2 1 31 5 0 0 12
## 4 43 1 1 45 13 4 0 12
## 5 34 1 1 48 7 2 1 9
## 6 39 2 1 55 14 1 0 11
## # … with 6 more variables: Hx.of.Violence <int>, Disorderly.Conduct <int>,
## # Suicide <int>, Abuse <int>, Non.subst.Dx <int>, Subst.Dx <int>
## Var1 Freq
## 1 0 97
## 2 1 45
We’re dealing with 142 observations x 14 variables - 10 of type factors, 4 of type double, and a dependent variable Suicide where 97/142 (68.3%) of patients have not attempted suicide and 45/142 (31.7%) of patients have.
We’ll narrow our dataset further by exploring our independent variable’s correlation with Suicide, with one another, and then verify our assumptions with the Boruta function for feature ranking and selection.
We start by measuring the strength of correlation to Suicide for independent variables:
#Compute proportion of missing data per variable
v <- colnames(genl)
incomplete <- function(x) sum(!complete.cases(x)) / 142
Missing_Data <- sapply(genl[v], incomplete)
#head(missing) #verify
#Compute correlation between each variable and Suicide
target_corr <- function(x, y) cor(y, x, use = "na.or.complete")
Suicide_Correlation <- sapply(genl[v], target_corr, y=genl$Suicide)
#head(c_target) #verify
#Bind and output Missing Data and Correlation with Target
MDCS <- data.frame(cbind(Missing_Data, Suicide_Correlation))
MDCS %>%
kbl(caption = "Proportion of Missing Data vs. Correlation with Suicide") %>%
kable_minimal()| Missing_Data | Suicide_Correlation | |
|---|---|---|
| Age | 0 | -0.0815414 |
| Sex | 0 | 0.1640972 |
| Race | 0 | -0.1025970 |
| ADHD.Total | 0 | 0.1337074 |
| MD.TOTAL | 0 | 0.2859463 |
| SUB.Total | 0 | 0.1543758 |
| Court.order | 0 | 0.1423456 |
| Education | 0 | -0.1011286 |
| Hx.of.Violence | 0 | 0.1129186 |
| Disorderly.Conduct | 0 | -0.0228881 |
| Suicide | 0 | 1.0000000 |
| Abuse | 0 | 0.3107638 |
| Non.subst.Dx | 0 | 0.0238396 |
| Subst.Dx | 0 | 0.1698773 |
From the above table we can extend that:
Age, Disorderly.Conduct, and Non.subst.Dx have a weak correlation with SuicideAbuse, MD.Total, and Subst.Dx have the strongest correlation with Suicide,Suicide.From this, we can see the 3 variables we may keep as well as those we may drop. For the remaining variables, we’ll explore further to guide our decision.
We measure the correlation between variables to observe multicollinearity:
Multicollinearity is a concern and that we may eliminate a larger portion of variables as a result. We’ll proceed with only variables with strong predictive promise and / or unique value added.
Based on these guidelines, it appears that we should proceed with at least Abuse, MD.Total, Subst.Dx, whereas variables Sex, Court.order, and Hx.of.Violence we’ll decide after consulting the Boruta output.
To address weak target correlation and multicollinearity we consider the exclusion of 7-10 variables : Age, Race, Education, Disorderly.Conduct, and Non.subst.Dx due to weak correlation with the target variable, ADHD.Total and SUB.Total due to multicollinearity, and Sex, Court.order, and Hx.of.Violence we’ll decide after consulting our Boruta output.
Before doing so, we utilize the Boruta function for feature ranking and selection:
# Perform Boruta search
boruta_output <- Boruta(Suicide ~ ., data=genl, doTrace=0, maxRuns = 1000)
#Get significant variables including tentatives
boruta_signif <- getSelectedAttributes(boruta_output, withTentative = TRUE)
#print(boruta_signif)
# Plot variable importance
plot(boruta_output, cex.axis=.7, las=2, xlab="", main="Variable Importance")From the above plot, we see that only two variables (Abuse and MD.TOTAL) carry much importance when we consider Suicide but in order to offer a more comprehensive representation, we’ll also include the next 3 variables (Race, Subst.Dx, and Court.order). It’s worth noting that all variables score relatively low in feature importance which may be an early indication of difficulty in predicting Suicide.
We select the 5 features noted above, convert categorical features to type factor and then verify the dataset that we’re proceeding to SVM model-building with:
## # A tibble: 6 x 6
## Race MD.TOTAL Court.order Suicide Abuse Subst.Dx
## <fct> <int> <fct> <fct> <fct> <fct>
## 1 1 15 1 1 0 0
## 2 1 14 0 1 4 0
## 3 1 5 0 0 6 0
## 4 1 13 0 1 7 0
## 5 1 7 1 1 0 0
## 6 1 14 0 1 2 0
We observe 5 independent variables (1 numeric, 4 factor) and find that our data is of the proper form aside from the fact that we have numerous independent variables of type factor that we may have to convert to “one hot” dummie variables (ie. unique variables for each level of the factor variable where a 1 is present).
We proceed to train-test split our data (for later assessment), convert categorical variables into “one hot” dummie variables, train our model, and consult the summary statistics:
To improve model can tune the SVM by changing the parameters Cost, gamma and the kernel function. Use svm.tune from package e1071. This function can accepts many parameters, here we use 3. They are kernel, cost, and gamma.
Here we create 3 models using radial, linear and polynomial.
#Set set (replicability)
set.seed(123)
#Perform train-test Split
train_split <- initial_split(select_df, prop = 0.75)
training <- training(train_split)
test <- testing(train_split)
results <- test$Suicide #hold actual test values for later comparison
test$Suicide <- NA #set initial test values to NA, to then be predicted
#Convert categorical variables into dummie variables (one hot or dummyvars)
dummies <- dummyVars(~ ., data=training[,-4]) #Omit col 4 (Suicide) from dummy var generation
d_train <- as.data.frame(predict(dummies, training[,-4]))
d_train$Suicide <- training$Suicide
dummies <- dummyVars(~ ., data=test[,-4])
d_test <- as.data.frame(predict(dummies, test[,-4]))
d_test$Suicide <- test$Suicide
#head(d_train) #verify dummie variables
#Set gamma list and use variable cost to determine optimal model parameters (for SVM modeling)
gammalist <- c(0.005,0.01,0.015,0.02,0.025,0.03,0.035,0.04,0.045,0.05)
#1st SVM application: radial kernel
svm_radial <- tune.svm(Suicide ~., data=d_train, kernel='radial', cost=2^(-1:5), gamma = gammalist)
# summary(svm_radial)
# summary(svm_radial$best.model)
svm1 <- predict(svm_radial$best.model, d_test[,-20])
#confusionMatrix(svm1, results)
#2nd SVM application: linear kernel
svm_linear <- tune.svm(Suicide ~., data=d_train, kernel='linear')
# summary(svm_linear)
# summary(svm_linear$best.model)
svm2 <- predict(svm_linear$best.model, d_test[,-20])
#confusionMatrix(svm2, results)
#3rd SVM application: polynomial kernel
svm_polynomial <- tune.svm(Suicide ~., data=d_train, kernel='polynomial', cost=2^(-1:5), gamma = gammalist)
svm3 <- predict(svm_polynomial$best.model, d_test[,-20])
#Convert performance statistics to tabular form (for output / interpretation):
AccuracySVM1 <- confusionMatrix(svm1, results)$overall['Accuracy']
AccuracySVM2 <- confusionMatrix(svm2, results)$overall['Accuracy']
AccuracySVM3 <- confusionMatrix(svm3, results)$overall['Accuracy']
SVM_Model_1 <- confusionMatrix(svm1, results)$byClass
SVM_Model_1 <- data.frame(SVM_Model_1)
SVM_Model_1 <- rbind("Accuracy" = AccuracySVM1, SVM_Model_1)
SVM_Model_2 <- confusionMatrix(svm2, results)$byClass
SVM_Model_2 <- data.frame(SVM_Model_2)
SVM_Model_2 <- rbind("Accuracy" = AccuracySVM2, SVM_Model_2)
SVM_Model_3 <- confusionMatrix(svm3, results)$byClass
SVM_Model_3 <- data.frame(SVM_Model_3)
SVM_Model_3 <- rbind("Accuracy" = AccuracySVM3, SVM_Model_3)
tabularview <- data.frame(SVM_Model_1, SVM_Model_2, SVM_Model_3)
tabularview %>% kableExtra::kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down")| SVM_Model_1 | SVM_Model_2 | SVM_Model_3 | |
|---|---|---|---|
| Accuracy | 0.6285714 | 0.6285714 | 0.7428571 |
| Sensitivity | 0.6428571 | 0.6785714 | 0.8214286 |
| Specificity | 0.5714286 | 0.4285714 | 0.4285714 |
| Pos Pred Value | 0.8571429 | 0.8260870 | 0.8518519 |
| Neg Pred Value | 0.2857143 | 0.2500000 | 0.3750000 |
| Precision | 0.8571429 | 0.8260870 | 0.8518519 |
| Recall | 0.6428571 | 0.6785714 | 0.8214286 |
| F1 | 0.7346939 | 0.7450980 | 0.8363636 |
| Prevalence | 0.8000000 | 0.8000000 | 0.8000000 |
| Detection Rate | 0.5142857 | 0.5428571 | 0.6571429 |
| Detection Prevalence | 0.6000000 | 0.6571429 | 0.7714286 |
| Balanced Accuracy | 0.6071429 | 0.5535714 | 0.6250000 |
The models share the same Accuracy and Prevalence, whereas SVM_Model_1 (radial) outperformed SVM_Model_2 (linear) in Specificity, Pos Pred Value, Neg Pred Value, Precision, and Balanced Accuracy while SVM_Model_2 (linear) outperformed SVM_Model_1 (radial) for Sensitivity, Recall, F1, Detection Rate, and Detection Prevalence. SVM_Model_3 outperformed than other 2 models. This model have high accuracy, specificity, recall and f1 than other 2 models.
The performance was rather close but being that we’re dealing with an imbalanced set (97/142 (68.3%) of patients have not attempted suicide), we use the Balanced Accuracy as our deciding factor and elect SVM_Model_1 (radial).
SVM_Model_1 and SVM_Model_2## Reference
## Prediction 0 1
## 0 18 3
## 1 10 4
## Reference
## Prediction 0 1
## 0 19 4
## 1 9 3
Both the models SVM_model_1 and SVM_model_2 have higher class error in predicting no suicide and less error in predicting suicide.
SVM_Model_3## Reference
## Prediction 0 1
## 0 23 4
## 1 5 3
The confusion matrix shows SVM_model_3 is a good model for suicide and non-suicide with accuracy of 74%. The miscallsification happens for suicide is 4 and no suicide 5. There is definately scope to improve the performance of the model. This dataset we did the aggregate of the columns and did not impute NAs, this might be the reason of low accuracy. In the appendix section we perform 6 SVM models (including 3 kernels), one using entire dataset and another using PCA output.
In the Appendix, we’ve also provided code (with a more succinct write up) for our application of support vector machines to the entire dataset (~86% Accuracy) and our principal components ~74% Accuracy).
It was nice to learn via application the effect of kernel selection. For a simpler model and one we believe is likely to be linearly separable, it may be best to start with a linear kernel, whereas when the data appears to be non-linear (which we believed to be the case here), we can start with a non-linear application and then compare back to a linear model to test our assumptions. In this case the radial kernel model did not perform leaps-and-bounds better than the linear kernel model but it did outperform it when we considered the fact that we were dealing with unbalanced data.
It appears that both models were able to relatively accurately predict suicide for those who had attempted suicide in the past but each had a difficult time predicting non-suicidal patients. If the model were advanced further, this might be a case where accurately predicting positive (suicide) and remediating the situation (ie. providing help to the patient) could supercede the model’s inability to predict non-suicide but this is a very sensitive topic and the model would absolutely have to best tested to a much deeper degree and brought to a much higher level of accuracy to even consider such an idea.
In general though, we were surprised by the low accuracy of our SVM models provided the “clout” surrounding Support Vector Machines and their ability to accurately predict. The reason for a lower performance may have been because the null values we’d dropped (maybe there was valuable data therein), the variables we elected to focus upon, the way in which we applied the models, the nature of the data itself, or the simple fact that it’s difficult to predict something like suicide because humans are creatures of nuance and with that comes a great level of unpredictability.
We later confirmed, with code included in the Appendix, that Accuracy could be improved via inclusion of stronger (ie. principal components) or more features (ie. the entire dataset).
With entire dataset we get better accuracy, this can also be improved with differnt machine learning algo. In future we may include QDA and LDA to see how the model works.
……………………………………………………………………..
In completing this assignment, reference was made to the following:
……………………………………………………………………..
There are columns 15 columns for MD. we will follow the same process that we used for ADHD.
## MD.Q1a MD.Q1b MD.Q1c MD.Q1d MD.Q1e MD.Q1f MD.Q1g MD.Q1h MD.Q1i MD.Q1j MD.Q1k
## 1 1 1 1 1 0 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1 0 0
## 3 0 0 0 0 1 1 1 0 0 0 0
## 4 1 1 0 0 1 1 1 1 1 0 0
## 5 0 1 0 1 0 1 1 0 0 0 0
## 6 0 1 0 1 1 1 1 1 1 1 1
## MD.Q1L MD.Q1m MD.Q2 MD.Q3
## 1 0 1 1 3
## 2 1 0 1 3
## 3 0 0 0 2
## 4 1 1 1 3
## 5 0 0 1 2
## 6 1 0 1 3
## MD.Q1a MD.Q1b MD.Q1c MD.Q1d MD.Q1e MD.Q1f MD.Q1g MD.Q1h MD.Q1i MD.Q1j MD.Q1k
## 0 0 0 0 0 0 0 0 0 0 0
## MD.Q1L MD.Q1m MD.Q2 MD.Q3
## 0 0 0 0
Above summary we see no NAs available in the MD_df dataset.
In MD.Q3 column scale is different than other columns, so to avoid scale factor set scale = True.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.3857 1.3474 0.9721 0.94891 0.91563 0.82515 0.79246
## Proportion of Variance 0.3794 0.1210 0.0630 0.06003 0.05589 0.04539 0.04187
## Cumulative Proportion 0.3794 0.5004 0.5635 0.62348 0.67937 0.72476 0.76663
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.77044 0.74286 0.72814 0.70302 0.65162 0.5835 0.55685
## Proportion of Variance 0.03957 0.03679 0.03535 0.03295 0.02831 0.0227 0.02067
## Cumulative Proportion 0.80620 0.84299 0.87834 0.91128 0.93959 0.9623 0.98296
## PC15
## Standard deviation 0.50553
## Proportion of Variance 0.01704
## Cumulative Proportion 1.00000
From pca summary, PC1 shows 37% variance in the data, PC2 shows 50% variance in the data, PC3 shows 56% variance in the data and so on.
Plot the scree plot to disply the same, scree plot y-axis is variance nothing but the square of the SD.
From scree plot PC1, PC2, PC3, and PC4 are describes 62% variance in the data.
From ‘https://www.originlab.com/doc/Origin-Help/PrincipleComp-Analysis’, found if the relationship is weak between variables, PCA does not work well to reduce data. For MD data more than 3 PC components describe the data, not sure how well PCA will work.
Create another dataset use cbind with original MD data along with PC1, PC2, PC3, and PC4.
# combine dataset
md_df_2 <- cbind(MD_df, pca_md$x)
# cor plot
cor(MD_df, md_df_2[, 16:19]) %>% kableExtra::kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down") | PC1 | PC2 | PC3 | PC4 | |
|---|---|---|---|---|
| MD.Q1a | -0.6888686 | -0.1322144 | 0.0229725 | -0.3776545 |
| MD.Q1b | -0.6421749 | -0.3857186 | 0.1381774 | 0.1459230 |
| MD.Q1c | -0.4439895 | 0.4544153 | -0.4763749 | -0.0977568 |
| MD.Q1d | -0.5655577 | 0.0049581 | 0.0644164 | -0.0442116 |
| MD.Q1e | -0.6391312 | 0.2391336 | -0.1887304 | 0.1745264 |
| MD.Q1f | -0.6668512 | -0.2825469 | 0.1009022 | 0.2032375 |
| MD.Q1g | -0.6818994 | -0.3687898 | 0.0536512 | 0.1911240 |
| MD.Q1h | -0.6193596 | 0.4973240 | 0.0584859 | 0.2126773 |
| MD.Q1i | -0.5873223 | 0.4451927 | -0.1405849 | 0.2802170 |
| MD.Q1j | -0.5954591 | 0.4356557 | 0.1328873 | 0.0517699 |
| MD.Q1k | -0.5118310 | 0.4082574 | 0.5106816 | -0.2784104 |
| MD.Q1L | -0.7087339 | -0.1724858 | 0.1944088 | -0.4085494 |
| MD.Q1m | -0.5682204 | -0.0522540 | -0.4267685 | -0.4284171 |
| MD.Q2 | -0.7328573 | -0.2802758 | 0.0969890 | 0.2440009 |
| MD.Q3 | -0.5118573 | -0.5064956 | -0.3385010 | 0.0527350 |
Above cor table we can see
In Bi-plot red arrows represent the eigenvectors of each variable. This plot helps us to interpret the relationship between principal components.
From (PC1 and PC2 biplot) observations have low MD.Q3, MD.Q1h
(PC2 and PC3 biplot)
From (PC1 and PC3 biplot)
We need better approach to examine the relationship.
##
## Loadings:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14 PC15
## MD.Q1a 1
## MD.Q1b 1
## MD.Q1c 1
## MD.Q1d 1
## MD.Q1e -1
## MD.Q1f 1
## MD.Q1g -1
## MD.Q1h -1
## MD.Q1i 1
## MD.Q1j -1
## MD.Q1k 1
## MD.Q1L -1
## MD.Q1m -1
## MD.Q2 -1
## MD.Q3 1
##
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.067 0.067 0.067 0.067 0.067 0.067 0.067 0.067 0.067 0.067
## Cumulative Var 0.067 0.133 0.200 0.267 0.333 0.400 0.467 0.533 0.600 0.667
## PC11 PC12 PC13 PC14 PC15
## SS loadings 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.067 0.067 0.067 0.067 0.067
## Cumulative Var 0.733 0.800 0.867 0.933 1.000
Varimax loading interprets :
The PCA of MD shows first principle component have 38% of the variance in the data. To get 90% of the variance, we need to include 11 of the principal components. This reduction reduce features 15 to 11.
PCA performs using entire dataset, exclude indivislau ADHD and MD column and take Total.ADHD and MD in to consideration.
Here we are using built-in R function prccomp() to perform the Principal Component Analysis.
PCA can be applied only on numerical data. We have few Categorical variables in this dataset that need to be converted to numberical category.
## Age Sex Race ADHD.Total
## "integer" "integer" "integer" "integer"
## MD.TOTAL SUB.Total Court.order Education
## "integer" "numeric" "integer" "integer"
## Hx.of.Violence Disorderly.Conduct Suicide Abuse
## "integer" "integer" "integer" "integer"
## Non.subst.Dx Subst.Dx
## "integer" "integer"
As we can see there are only 4 numeric data types and rest are factors. We need to convert all the factors into numeric.
cols.num <- c("Sex","Race","Court.order","Education","Hx.of.Violence", "Disorderly.Conduct", "Suicide","Abuse","Non.subst.Dx","Subst.Dx")
genl[cols.num] <- sapply(genl[cols.num],as.numeric)## Age Sex Race ADHD.Total
## "integer" "numeric" "numeric" "integer"
## MD.TOTAL SUB.Total Court.order Education
## "integer" "numeric" "numeric" "numeric"
## Hx.of.Violence Disorderly.Conduct Suicide Abuse
## "numeric" "numeric" "numeric" "numeric"
## Non.subst.Dx Subst.Dx
## "numeric" "numeric"
Show the percentage of variances explained by each principal component.
Individuals with a similar profile are grouped together.
# Graph of individuals
fviz_pca_ind(result.pca,
col.ind = "cos2", # Color by the quality of representation
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)Positive correlated variables point to the same side of the plot. Negative correlated variables point to opposite sides of the graph.
# Graph of variables
fviz_pca_var(result.pca,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)# Biplot of individuals and variables
fviz_pca_biplot(result.pca, repel = TRUE,
col.var = "#2E9FDF", # Variables color
col.ind = "#696969" # Individuals color
)## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.6316 1.4760 1.1923 1.11534 1.06328 1.00875 0.94316
## Proportion of Variance 0.1902 0.1556 0.1015 0.08886 0.08075 0.07268 0.06354
## Cumulative Proportion 0.1902 0.3458 0.4473 0.53617 0.61693 0.68961 0.75315
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.85085 0.83870 0.7812 0.69752 0.6372 0.5916 0.41911
## Proportion of Variance 0.05171 0.05024 0.0436 0.03475 0.0290 0.0250 0.01255
## Cumulative Proportion 0.80486 0.85511 0.8987 0.93345 0.9625 0.9875 1.00000
PCA of General dataset
The PCA of genl shows first principle component have 19% of the variance in the data. To get 93% of the variance, we need to include 11 of the principal components. This reduction reduce features from 14 to 11.
……………………………………………………………………..
In k-means clustering, we first create unnamed/new observations that serve as “centroids”, which are then used to compute the Euclidean distance between the centroid and the surrounding observations, which are then used for category assignment. The average of all of the points in each category are then recomputed for the centroid based on this average. Finally, the algorithm is re-run using these new centroids with the process ending when there is no change in centroid position between one run of the clustering algorithm and another.
To perform cluster analysis, first need to select the optimal clusters to do that there are some methods.
We’ll use the following R packages:
factoextra to determine the optimal number clusters for a given clustering methods and for data visualization. NbClust for computing about 30 methods at once, in order to find the optimal number of clusters.
fviz_nbclust below are the parameter description:
Standadize the data, set mean and standard deviation.
# standadize the data
means <- apply(genl,2,mean)
sds <- apply(genl,2,sd)
nor = scale(genl,center=means,scale=sds)
# Elbow method
fviz_nbclust(nor, kmeans, method = "wss") +
geom_vline(xintercept = 6, linetype = 2)+
labs(subtitle = "Elbow method")# Silhouette method
fviz_nbclust(nor, kmeans, method = "silhouette")+
labs(subtitle = "Silhouette method")# Gap statistic
# nboot = 50 to keep the function speedy.
# recommended value: nboot= 500 for your analysis.
# Use verbose = FALSE to hide computing progression.
set.seed(140)
fviz_nbclust(nor, kmeans, nstart = 25, method = "gap_stat", nboot = 50)+
labs(subtitle = "Gap statistic method")The disadvantage of elbow and average silhouette methods is that, they measure a global clustering characteristic only. A more sophisticated method is to use the gap statistic which provides a statistical procedure to formalize the elbow/silhouette heuristic in order to estimate the optimal number of clusters.
According to these observations, it’s possible to define k = 5 as the optimal number of clusters in the data.
## K-means clustering with 5 clusters of sizes 11, 47, 30, 24, 30
##
## Cluster means:
## Age Sex Race ADHD.Total MD.TOTAL SUB.Total
## 1 -0.6569935740 0.1754249 -0.3842857 0.02634379 0.1301821 0.54162019
## 2 -0.0002945166 -0.4474302 0.2545898 -0.48604331 -0.2163302 0.05965279
## 3 0.3244546883 0.2845213 0.1498511 0.42753297 -0.3387180 -1.08642639
## 4 -0.1334416906 1.0012237 -0.3662406 0.53686606 0.5563087 0.70092025
## 5 0.0236577174 -0.4488487 -0.1148113 -0.10521735 0.1848549 0.23364008
## Court.order Education Hx.of.Violence Disorderly.Conduct Suicide
## 1 3.4387824 -0.350659310 0.4403721 0.64578809 0.48949560
## 2 -0.2887527 -0.307035361 -0.5915229 0.18123941 -0.54200720
## 3 -0.2887527 0.701004304 -0.5915229 -1.10091493 -0.25036946
## 4 -0.2887527 -0.108316402 -0.4023422 -0.08200484 1.10604981
## 5 -0.2887527 -0.004754037 1.6786462 0.64578809 0.03519249
## Abuse Non.subst.Dx Subst.Dx
## 1 0.2884947 0.11673855 0.16124445
## 2 -0.5021573 -0.61105397 0.14289936
## 3 0.1810423 1.01574893 -1.07460319
## 4 0.7884472 -0.06964165 0.88375228
## 5 -0.1308684 -0.04552186 0.08460274
##
## Clustering vector:
## [1] 1 3 3 3 1 4 5 5 5 4 5 3 2 3 1 3 3 5 3 3 4 5 3 4 3 1 3 4 2 2 5 2 4 4 2 5 5
## [38] 4 2 4 5 2 2 4 4 5 5 4 2 4 4 5 4 1 1 2 2 2 5 1 3 5 3 2 2 5 3 3 2 3 3 3 3 3
## [75] 4 5 1 3 5 2 4 5 3 4 3 3 3 3 2 2 2 2 5 5 1 2 2 2 2 2 2 2 2 4 2 5 5 2 2 2 2
## [112] 2 2 4 1 5 2 2 2 5 2 2 2 5 2 4 5 3 3 4 2 2 5 5 5 1 2 4 4 2 2 3
##
## Within cluster sum of squares by cluster:
## [1] 141.3976 350.9247 334.8022 230.6030 251.2763
## (between_SS / total_SS = 33.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
# cluster visualization
fviz_cluster(kmeans_res, data = genl,
ellipse.type = "convex",
ggtheme = theme_classic())Calculate each numeric variables mean value for each cluster, and then output the resulting table:
aggregate(genl, by=list(cluster=kmeans_res$cluster), mean) %>% kableExtra::kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down") | cluster | Age | Sex | Race | ADHD.Total | MD.TOTAL | SUB.Total | Court.order | Education | Hx.of.Violence | Disorderly.Conduct | Suicide | Abuse | Non.subst.Dx | Subst.Dx |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 32.18182 | 1.545454 | 1.363636 | 35.27273 | 10.909091 | 5.545454 | 1 | 11.18182 | 0.4545455 | 1.0000000 | 0.5454545 | 1.8181818 | 0.5454545 | 1.272727 |
| 2 | 39.53191 | 1.234043 | 1.765957 | 26.68085 | 9.276596 | 4.170213 | 0 | 11.27660 | 0.0000000 | 0.7872340 | 0.0638298 | 0.2127660 | 0.0425532 | 1.255319 |
| 3 | 43.16667 | 1.600000 | 1.700000 | 42.00000 | 8.700000 | 0.900000 | 0 | 13.46667 | 0.0000000 | 0.2000000 | 0.2000000 | 1.6000000 | 1.1666667 | 0.100000 |
| 4 | 38.04167 | 1.958333 | 1.375000 | 43.83333 | 12.916667 | 6.000000 | 0 | 11.70833 | 0.0833333 | 0.6666667 | 0.8333333 | 2.8333333 | 0.4166667 | 1.958333 |
| 5 | 39.80000 | 1.233333 | 1.533333 | 33.06667 | 11.166667 | 4.666667 | 0 | 11.93333 | 1.0000000 | 1.0000000 | 0.3333333 | 0.9666667 | 0.4333333 | 1.200000 |
Based on the output above we’d profile our groups as follows:
……………………………………………………………………..
Below, we re-visited our dataset with a different approach to compare-contrast the resulting Accuracies. The results were an improvement :)
We will perform SVM with entire dataset. Then, create another SVM model on reduced dimension which performed PCA with genl dataset. Compare the result matrices of these model to find the best one.
Below we are going to create 3 models using 3 kernels such as radial, linear and polynomial.
data_all_new <- data_all %>% select(-Suicide, everything())
#Set set (replicability)
set.seed(700)
#Perform train-test Split
intrain <- createDataPartition(data_all_new$Suicide, p = 0.75, list = FALSE)
training_all <- data_all_new[intrain,]
test_all <- data_all_new[-intrain,]
results <- test_all$Suicide #hold actual test values for later comparison
test_all$Suicide <- NA
gammalist <- c(0.005,0.01,0.015,0.02,0.025,0.03,0.035,0.04,0.045,0.05)
tune.out_radial <- tune.svm(as.factor(Suicide) ~., data=training_all,
kernel='radial', cost=2^(1:5), gamma = gammalist)
tune.out_linear <- tune.svm(as.factor(Suicide) ~., data=training_all,
kernel='linear')
tune.out_poly <- tune.svm(as.factor(Suicide) ~., data=training_all,
kernel='polynomial', cost=2^(1:5), gamma = gammalist)
svm4 <- predict(tune.out_radial$best.model, test_all[,-dim(test_all)[2]])
svm5 <- predict(tune.out_linear$best.model, test_all[,-dim(test_all)[2]])
svm6 <- predict(tune.out_poly$best.model, test_all[,-dim(test_all)[2]])
#Convert performance statistics to tabular form (for output / interpretation):
AccuracySVM4 <- confusionMatrix(svm4, as.factor(results))$overall['Accuracy']
AccuracySVM5 <- confusionMatrix(svm5, as.factor(results))$overall['Accuracy']
AccuracySVM6 <- confusionMatrix(svm6, as.factor(results))$overall['Accuracy']
SVM_Model_4 <- confusionMatrix(svm4, as.factor(results))$byClass
SVM_Model_4 <- data.frame(SVM_Model_4)
SVM_Model_4 <- rbind("Accuracy" = AccuracySVM4, SVM_Model_4)
SVM_Model_5 <- confusionMatrix(svm5, as.factor(results))$byClass
SVM_Model_5 <- data.frame(SVM_Model_5)
SVM_Model_5 <- rbind("Accuracy" = AccuracySVM5, SVM_Model_5)
SVM_Model_6 <- confusionMatrix(svm6, as.factor(results))$byClass
SVM_Model_6 <- data.frame(SVM_Model_6)
SVM_Model_6 <- rbind("Accuracy" = AccuracySVM6, SVM_Model_6)
tabularview <- data.frame(SVM_Model_4, SVM_Model_5, SVM_Model_6)
tabularview %>% kableExtra::kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down")| SVM_Model_4 | SVM_Model_5 | SVM_Model_6 | |
|---|---|---|---|
| Accuracy | 0.8604651 | 0.8372093 | 0.8604651 |
| Sensitivity | 0.9189189 | 0.8918919 | 1.0000000 |
| Specificity | 0.5000000 | 0.5000000 | 0.0000000 |
| Pos Pred Value | 0.9189189 | 0.9166667 | 0.8604651 |
| Neg Pred Value | 0.5000000 | 0.4285714 | NaN |
| Precision | 0.9189189 | 0.9166667 | 0.8604651 |
| Recall | 0.9189189 | 0.8918919 | 1.0000000 |
| F1 | 0.9189189 | 0.9041096 | 0.9250000 |
| Prevalence | 0.8604651 | 0.8604651 | 0.8604651 |
| Detection Rate | 0.7906977 | 0.7674419 | 0.8604651 |
| Detection Prevalence | 0.8604651 | 0.8372093 | 1.0000000 |
| Balanced Accuracy | 0.7094595 | 0.6959459 | 0.5000000 |
The models share the same Accuracy and Neg Pred Value, whereas SVM_Model_4 (radial) outperformed SVM_Model_5 (linear) in Sensitivity, Recall, Detection Ratee, F1, and Detection Prevalence while SVM_Model_5 (linear) outperformed SVM_Model_4 (radial) for Specificity, Pos Pred Value, Precision, and Balanced Accuracy. SVM_Model_6 has same accuracy as SVM_Model_4 but if we look at Neg Pred Value shows NAN, this model is not good fit for suicide. Lets have a look on confusion matrix and interprete how the model works.
SVM_Model_4## Reference
## Prediction 0 1
## 0 34 3
## 1 3 3
This model is performed well in prediction yes and no. Total misclassification is 6.
SVM_Model_5## Reference
## Prediction 0 1
## 0 33 3
## 1 4 3
This model performed well with suicide class but little poor in no suicide class. Total misclassification is 7.
SVM_Model_6## Reference
## Prediction 0 1
## 0 37 6
## 1 0 0
This model peformed awesome in no suicide class. However in suicide class performed really bad. This is definately not a good model for the prediction because this could not performed for suicide class.
From Varimax plot we found, to get above 90% variance in data we need to include 11 principal components. We are going to exclude ‘ADHD.Total’,‘Non.subst.Dx’, ‘Subst.Dx’, from the dataset.
Split the data in to train and test, apply model, fit the model, predict the model using test data, calculate results matrices to see the model performance.
genl_2 <- data[c(2:4,23,39)]
genl_2$SUB.Total <- sub$SUB.Total
genl_2 <- cbind(genl_2, hstry)
# imputation
genl_3 <- genl_2 %>%
mutate(
SUB.Total = impute(SUB.Total, mode),
Court.order = impute(Court.order, mode),
Education = impute(Education, mode),
Hx.of.Violence = impute(Hx.of.Violence, mode),
Disorderly.Conduct = impute(Disorderly.Conduct, mode),
Suicide = impute(Suicide, mode),
Abuse = impute(Abuse, mode),
Non.subst.Dx = impute(Non.subst.Dx, mode),
Subst.Dx = impute(Subst.Dx, mode),
Psych.meds. = impute(Psych.meds., mode)
) %>%
mutate_if(is.character, as.numeric)genl_svm <- genl_3 %>% select(-c('ADHD.Total', 'Non.subst.Dx', 'Subst.Dx')) %>% select(-Suicide, everything())
#Set set (replicability)
set.seed(900)
#Perform train-test Split
intrain <- createDataPartition(genl_svm$Suicide, p = 0.75, list = FALSE)
training_not_all <- genl_svm[intrain,]
test_not_all <- genl_svm[-intrain,]
results_not_all <- test_not_all$Suicide #hold actual test values for later comparison
test_not_all$Suicide <- NA
gammalist <- c(0.005,0.01,0.015,0.02,0.025,0.03,0.035,0.04,0.045,0.05)
tune.out_radial_7 <- tune.svm(as.factor(Suicide) ~., data=training_not_all,
kernel='radial', cost=2^(1:5), gamma = gammalist)
tune.out_linear_8 <- tune.svm(as.factor(Suicide) ~., data=training_not_all,
kernel='linear')
tune.out_poly_9 <- tune.svm(as.factor(Suicide) ~., data=training_not_all,
kernel='polynomial', cost=2^(1:5), gamma = gammalist)
svm7 <- predict(tune.out_radial_7$best.model, test_not_all[,-dim(test_not_all)[2]])
svm8 <- predict(tune.out_linear_8$best.model, test_not_all[,-dim(test_not_all)[2]])
svm9 <- predict(tune.out_poly_9$best.model, test_not_all[,-dim(test_not_all)[2]])
#Convert performance statistics to tabular form (for output / interpretation):
AccuracySVM7 <- confusionMatrix(svm7, as.factor(results_not_all))$overall['Accuracy']
AccuracySVM8 <- confusionMatrix(svm8, as.factor(results_not_all))$overall['Accuracy']
AccuracySVM9 <- confusionMatrix(svm9, as.factor(results_not_all))$overall['Accuracy']
SVM_Model_7 <- confusionMatrix(svm7, as.factor(results_not_all))$byClass
SVM_Model_7 <- data.frame(SVM_Model_7)
SVM_Model_7 <- rbind("Accuracy" = AccuracySVM7, SVM_Model_7)
SVM_Model_8 <- confusionMatrix(svm8, as.factor(results_not_all))$byClass
SVM_Model_8 <- data.frame(SVM_Model_8)
SVM_Model_8 <- rbind("Accuracy" = AccuracySVM8, SVM_Model_8)
SVM_Model_9 <- confusionMatrix(svm9, as.factor(results_not_all))$byClass
SVM_Model_9 <- data.frame(SVM_Model_9)
SVM_Model_9 <- rbind("Accuracy" = AccuracySVM9, SVM_Model_9)
tabularview <- data.frame(SVM_Model_7, SVM_Model_8, SVM_Model_9)
tabularview %>% kableExtra::kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down")| SVM_Model_7 | SVM_Model_8 | SVM_Model_9 | |
|---|---|---|---|
| Accuracy | 0.7441860 | 0.6744186 | 0.7441860 |
| Sensitivity | 0.9655172 | 1.0000000 | 1.0000000 |
| Specificity | 0.2857143 | 0.0000000 | 0.2142857 |
| Pos Pred Value | 0.7368421 | 0.6744186 | 0.7250000 |
| Neg Pred Value | 0.8000000 | NaN | 1.0000000 |
| Precision | 0.7368421 | 0.6744186 | 0.7250000 |
| Recall | 0.9655172 | 1.0000000 | 1.0000000 |
| F1 | 0.8358209 | 0.8055556 | 0.8405797 |
| Prevalence | 0.6744186 | 0.6744186 | 0.6744186 |
| Detection Rate | 0.6511628 | 0.6744186 | 0.6744186 |
| Detection Prevalence | 0.8837209 | 1.0000000 | 0.9302326 |
| Balanced Accuracy | 0.6256158 | 0.5000000 | 0.6071429 |
We extend the following commentary based on the above output:
SVM_Model_7 and SVM_Model_9 have high accuracy than SVM_Model_8. Lets have a look on confusinon matrix and interpret the results.
SVM_Model_7## Reference
## Prediction 0 1
## 0 28 10
## 1 1 4
SVM_Model_7 has accuracy 74%, this model has total 11 misclassification and higher misclassification occures in suicide class. This model predict good for no suicide class but not a good model for suicide class.
SVM_Model_8## Reference
## Prediction 0 1
## 0 29 14
## 1 0 0
SVM_Model_8 has accuracy 67%, this model has total 14 misclassification and completly incorrectly predict for suicide class, we can see all observations for suicide class has predicted incorrectly. This is not a good model to go with.
SVM_Model_9## Reference
## Prediction 0 1
## 0 29 11
## 1 0 3
SVM_Model_9 has same accurcay as SVM_Model_7, from confusion matrix we see there are total 11 misclassification and higher misclassification present in suicide class. This model predict 100% accurate for no suicide class but shows poor prediction in suicide class.
We performed total 9 models using SVM, the genl dataset without PCA gave better result than the PCA model with genl data. But the entire dataset model performed better in comparision to other models. Above all the models, SVM_Model_4 is the best model.