For this assignment, we will be working with a very interesting mental health dataset from a real-life research project. All identifying information, of course, has been removed. The attached spreadsheet has the data (the tab name “Data”). The data dictionary is given in the second tab. You can get as creative as you want. The assignment is designed to really get you to think about how you could use different methods.
Please use a clustering method to find clusters of patients here. Whether you choose to use k-means clustering or hierarchical clustering is up to you as long as you reason through your work. Can you come up with creative names for the profiles you found? (
Let’s explore using Principal Component Analysis on this dataset. You will note that there are different types of questions in the dataset: column: E-W: ADHD self-report; column X – AM: mood disorders questionnaire, column AN-AS: Individual Substance Misuse; etc. Please reason through your work as you decide on which sets of variables you want to use to conduct Principal Component Analysis.
Assume you are modeling whether a patient attempted suicide (column AX). Please use support vector machine to model this. You might want to consider reducing the number of variables or somehow use extracted information from the variables. This can be a really fun modeling task!
HW4 for Machine Learning 622 This is Version6
library(skimr)
library(tidymodels) ## -- Attaching packages -------------------------------------- tidymodels 0.1.3 --
## v broom 0.7.6 v recipes 0.1.16
## v dials 0.0.9 v rsample 0.1.0
## v dplyr 1.0.6 v tibble 3.1.1
## v ggplot2 3.3.3 v tidyr 1.1.3
## v infer 0.5.4 v tune 0.1.5
## v modeldata 0.1.0 v workflows 0.2.2
## v parsnip 0.1.5 v workflowsets 0.0.2
## v purrr 0.3.4 v yardstick 0.0.8
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x purrr::discard() masks scales::discard()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x recipes::step() masks stats::step()
## * Use tidymodels_prefer() to resolve common conflicts.
library(visdat) ## missing visuallibrary(dplyr)
library(tidyr)
library(ggplot2)
library(tidyverse)
library(caret) # For featureplot, classification report
library(corrplot) # For correlation matrix
library(AppliedPredictiveModeling)
library(mice) # For data imputation
library(VIM) # For missing data visualization
library(gridExtra) # For grid plots
library(pROC) # For AUC calculations
library(ROCR) # For ROC and AUC plots
library(dendextend)
library(factoextra)library(inspectdf)
library(corrplot)
library(stats)
library(kableExtra)##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(kernlab) # SVM methodology##
## Attaching package: 'kernlab'
## The following object is masked from 'package:purrr':
##
## cross
## The following object is masked from 'package:ggplot2':
##
## alpha
## The following object is masked from 'package:scales':
##
## alpha
library(e1071) # SVM methodology##
## Attaching package: 'e1071'
## The following object is masked from 'package:tune':
##
## tune
## The following object is masked from 'package:rsample':
##
## permutations
library(data.table)
library(CGPfunctions)## Registered S3 methods overwritten by 'lme4':
## method from
## cooks.distance.influence.merMod car
## influence.merMod car
## dfbeta.influence.merMod car
## dfbetas.influence.merMod car
ADHD is a neurological disorder that can present itself in adolescence and adulthood. Some individuals can outgrow ADHD but around 30% will continue to have ADHD throughout their adulthood.
The Adult ADHD Self-Report Scale Symptom Checklist is a self-reported questionnaire used to assist in the diagnosis of adult ADHD.
The Mood Disorder Questionnaire is a self-report questionnaire designed to help detect bipolar disorder. It focuses on symptoms of hypomania and mania, which are the mood states that separate bipolar disorders from other types of depression and mood disorder. It has 5 main questions, and the first question has 13 parts, for a total of 17 questions.
Details of the items an answers for the questionnaires are included in link below
https://counsellingresource.com/quizzes/adhd-testing/adhd-asrs/
https://www.apaservices.org/practice/reimbursement/health-registry/self-reporting-sympton-scale.pdf
https://www.healthyplace.com/depression/bipolar/mood-disorders-questionnaire#:~:text=The%20Mood%20Disorder%20Questionnaire%20%28MDQ%29%20was%20developed%20by,-%20it%20is%20not%20considered%20a%20diagnostic%20tool.
adhd_hw4_notReduced <- read.csv ('adhd_data.csv', header = TRUE) #175*54head(adhd_hw4_notReduced,5)tail(adhd_hw4_notReduced,6)adhd_hw4_notReduced[adhd_hw4_notReduced==''] <- NA
nrow(is.na(adhd_hw4_notReduced))## [1] 175
head(adhd_hw4_notReduced,3) skim(adhd_hw4_notReduced)| Name | adhd_hw4_notReduced |
| Number of rows | 175 |
| Number of columns | 54 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| numeric | 53 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Initial | 0 | 1 | FALSE | 109 | DB: 5, CM: 4, DJ: 4, JM: 4 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Age | 0 | 1.00 | 39.47 | 11.17 | 18 | 29.5 | 42 | 48.0 | 69 | <U+2586><U+2585><U+2587><U+2585><U+2581> |
| Sex | 0 | 1.00 | 1.43 | 0.50 | 1 | 1.0 | 1 | 2.0 | 2 | <U+2587><U+2581><U+2581><U+2581><U+2586> |
| Race | 0 | 1.00 | 1.64 | 0.69 | 1 | 1.0 | 2 | 2.0 | 6 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| ADHD.Q1 | 0 | 1.00 | 1.70 | 1.29 | 0 | 1.0 | 2 | 3.0 | 4 | <U+2587><U+2587><U+2587><U+2586><U+2583> |
| ADHD.Q2 | 0 | 1.00 | 1.91 | 1.25 | 0 | 1.0 | 2 | 3.0 | 4 | <U+2585><U+2587><U+2587><U+2586><U+2585> |
| ADHD.Q3 | 0 | 1.00 | 1.91 | 1.27 | 0 | 1.0 | 2 | 3.0 | 4 | <U+2585><U+2587><U+2587><U+2586><U+2585> |
| ADHD.Q4 | 0 | 1.00 | 2.10 | 1.34 | 0 | 1.0 | 2 | 3.0 | 4 | <U+2585><U+2585><U+2587><U+2585><U+2586> |
| ADHD.Q5 | 0 | 1.00 | 2.26 | 1.44 | 0 | 1.0 | 3 | 3.0 | 5 | <U+2587><U+2585><U+2587><U+2586><U+2581> |
| ADHD.Q6 | 0 | 1.00 | 1.91 | 1.31 | 0 | 1.0 | 2 | 3.0 | 4 | <U+2586><U+2585><U+2587><U+2587><U+2583> |
| ADHD.Q7 | 0 | 1.00 | 1.83 | 1.19 | 0 | 1.0 | 2 | 3.0 | 4 | <U+2583><U+2587><U+2587><U+2583><U+2583> |
| ADHD.Q8 | 0 | 1.00 | 2.14 | 1.29 | 0 | 1.0 | 2 | 3.0 | 4 | <U+2583><U+2587><U+2587><U+2587><U+2586> |
| ADHD.Q9 | 0 | 1.00 | 1.91 | 1.32 | 0 | 1.0 | 2 | 3.0 | 4 | <U+2586><U+2587><U+2587><U+2587><U+2585> |
| ADHD.Q10 | 0 | 1.00 | 2.12 | 1.23 | 0 | 1.0 | 2 | 3.0 | 4 | <U+2582><U+2587><U+2587><U+2586><U+2585> |
| ADHD.Q11 | 0 | 1.00 | 2.27 | 1.24 | 0 | 1.0 | 2 | 3.0 | 4 | <U+2582><U+2586><U+2587><U+2587><U+2586> |
| ADHD.Q12 | 0 | 1.00 | 1.29 | 1.21 | 0 | 0.0 | 1 | 2.0 | 4 | <U+2587><U+2587><U+2586><U+2582><U+2582> |
| ADHD.Q13 | 0 | 1.00 | 2.37 | 1.23 | 0 | 1.5 | 2 | 3.0 | 4 | <U+2582><U+2585><U+2587><U+2587><U+2586> |
| ADHD.Q14 | 0 | 1.00 | 2.25 | 1.35 | 0 | 1.0 | 2 | 3.0 | 4 | <U+2585><U+2585><U+2587><U+2587><U+2586> |
| ADHD.Q15 | 0 | 1.00 | 1.63 | 1.39 | 0 | 0.0 | 1 | 3.0 | 4 | <U+2587><U+2586><U+2586><U+2585><U+2583> |
| ADHD.Q16 | 0 | 1.00 | 1.70 | 1.38 | 0 | 1.0 | 1 | 3.0 | 4 | <U+2586><U+2587><U+2586><U+2583><U+2585> |
| ADHD.Q17 | 0 | 1.00 | 1.53 | 1.29 | 0 | 0.0 | 1 | 2.0 | 4 | <U+2587><U+2587><U+2587><U+2583><U+2583> |
| ADHD.Q18 | 0 | 1.00 | 1.47 | 1.30 | 0 | 0.0 | 1 | 2.0 | 4 | <U+2587><U+2587><U+2586><U+2583><U+2583> |
| ADHD.Total | 0 | 1.00 | 34.32 | 16.70 | 0 | 21.0 | 33 | 47.5 | 72 | <U+2583><U+2586><U+2587><U+2586><U+2582> |
| MD.Q1a | 0 | 1.00 | 0.55 | 0.50 | 0 | 0.0 | 1 | 1.0 | 1 | <U+2586><U+2581><U+2581><U+2581><U+2587> |
| MD.Q1b | 0 | 1.00 | 0.57 | 0.50 | 0 | 0.0 | 1 | 1.0 | 1 | <U+2586><U+2581><U+2581><U+2581><U+2587> |
| MD.Q1c | 0 | 1.00 | 0.54 | 0.50 | 0 | 0.0 | 1 | 1.0 | 1 | <U+2587><U+2581><U+2581><U+2581><U+2587> |
| MD.Q1d | 0 | 1.00 | 0.58 | 0.49 | 0 | 0.0 | 1 | 1.0 | 1 | <U+2586><U+2581><U+2581><U+2581><U+2587> |
| MD.Q1e | 0 | 1.00 | 0.55 | 0.50 | 0 | 0.0 | 1 | 1.0 | 1 | <U+2586><U+2581><U+2581><U+2581><U+2587> |
| MD.Q1f | 0 | 1.00 | 0.70 | 0.46 | 0 | 0.0 | 1 | 1.0 | 1 | <U+2583><U+2581><U+2581><U+2581><U+2587> |
| MD.Q1g | 0 | 1.00 | 0.72 | 0.45 | 0 | 0.0 | 1 | 1.0 | 1 | <U+2583><U+2581><U+2581><U+2581><U+2587> |
| MD.Q1h | 0 | 1.00 | 0.56 | 0.50 | 0 | 0.0 | 1 | 1.0 | 1 | <U+2586><U+2581><U+2581><U+2581><U+2587> |
| MD.Q1i | 0 | 1.00 | 0.59 | 0.49 | 0 | 0.0 | 1 | 1.0 | 1 | <U+2586><U+2581><U+2581><U+2581><U+2587> |
| MD.Q1j | 0 | 1.00 | 0.39 | 0.49 | 0 | 0.0 | 0 | 1.0 | 1 | <U+2587><U+2581><U+2581><U+2581><U+2585> |
| MD.Q1k | 0 | 1.00 | 0.49 | 0.50 | 0 | 0.0 | 0 | 1.0 | 1 | <U+2587><U+2581><U+2581><U+2581><U+2587> |
| MD.Q1L | 0 | 1.00 | 0.58 | 0.49 | 0 | 0.0 | 1 | 1.0 | 1 | <U+2586><U+2581><U+2581><U+2581><U+2587> |
| MD.Q1m | 0 | 1.00 | 0.49 | 0.50 | 0 | 0.0 | 0 | 1.0 | 1 | <U+2587><U+2581><U+2581><U+2581><U+2587> |
| MD.Q2 | 0 | 1.00 | 0.72 | 0.45 | 0 | 0.0 | 1 | 1.0 | 1 | <U+2583><U+2581><U+2581><U+2581><U+2587> |
| MD.Q3 | 0 | 1.00 | 2.01 | 1.07 | 0 | 1.0 | 2 | 3.0 | 3 | <U+2582><U+2582><U+2581><U+2585><U+2587> |
| MD.TOTAL | 0 | 1.00 | 10.02 | 4.81 | 0 | 6.5 | 11 | 14.0 | 17 | <U+2583><U+2583><U+2586><U+2587><U+2587> |
| Alcohol | 4 | 0.98 | 1.35 | 1.39 | 0 | 0.0 | 1 | 3.0 | 3 | <U+2587><U+2582><U+2581><U+2581><U+2586> |
| THC | 4 | 0.98 | 0.81 | 1.27 | 0 | 0.0 | 0 | 1.5 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2583> |
| Cocaine | 4 | 0.98 | 1.09 | 1.39 | 0 | 0.0 | 0 | 3.0 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2585> |
| Stimulants | 4 | 0.98 | 0.12 | 0.53 | 0 | 0.0 | 0 | 0.0 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| Sedative.hypnotics | 4 | 0.98 | 0.12 | 0.54 | 0 | 0.0 | 0 | 0.0 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| Opioids | 4 | 0.98 | 0.39 | 0.99 | 0 | 0.0 | 0 | 0.0 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| Court.order | 5 | 0.97 | 0.09 | 0.28 | 0 | 0.0 | 0 | 0.0 | 1 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| Education | 9 | 0.95 | 11.90 | 2.17 | 6 | 11.0 | 12 | 13.0 | 19 | <U+2581><U+2585><U+2587><U+2582><U+2581> |
| Hx.of.Violence | 11 | 0.94 | 0.24 | 0.43 | 0 | 0.0 | 0 | 0.0 | 1 | <U+2587><U+2581><U+2581><U+2581><U+2582> |
| Disorderly.Conduct | 11 | 0.94 | 0.73 | 0.45 | 0 | 0.0 | 1 | 1.0 | 1 | <U+2583><U+2581><U+2581><U+2581><U+2587> |
| Suicide | 13 | 0.93 | 0.30 | 0.46 | 0 | 0.0 | 0 | 1.0 | 1 | <U+2587><U+2581><U+2581><U+2581><U+2583> |
| Abuse | 14 | 0.92 | 1.33 | 2.12 | 0 | 0.0 | 0 | 2.0 | 7 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| Non.subst.Dx | 22 | 0.87 | 0.44 | 0.68 | 0 | 0.0 | 0 | 1.0 | 2 | <U+2587><U+2581><U+2583><U+2581><U+2581> |
| Subst.Dx | 23 | 0.87 | 1.14 | 0.93 | 0 | 0.0 | 1 | 2.0 | 3 | <U+2586><U+2587><U+2581><U+2585><U+2582> |
| Psych.meds. | 118 | 0.33 | 0.96 | 0.80 | 0 | 0.0 | 1 | 2.0 | 2 | <U+2587><U+2581><U+2587><U+2581><U+2586> |
# Selecting only variables not containing Qs
adhd_hw4 <- adhd_hw4_notReduced %>% select(c(Age, Sex, Race, ADHD.Total, MD.TOTAL, Alcohol:Psych.meds.))
### 175 * 20VR
## Using Reduced onesadhd_hw4_notReduced.new = apply(adhd_hw4_notReduced, 2, function(x) as.numeric(as.character(x))) %>%
data.frame()## Warning in FUN(newX[, i], ...): NAs introduced by coercion
##NA introduced by coersion
adhd_hw4_notReduced.long<- stack(adhd_hw4_notReduced.new)
head(adhd_hw4_notReduced.long,3)# https://rpubs.com/ghh2001/699273
ggplot(data = adhd_hw4_notReduced.long,
aes(x=as.factor(values),
fill=as.factor(values) )) +
geom_bar(stat='count', width=1) +
facet_wrap(~ind) nearZeroVar(adhd_hw4_notReduced, names=TRUE, saveMetrics = T)library(visdat)
visdat::vis_miss(adhd_hw4)mice_plot <- aggr(adhd_hw4_notReduced, col=c('#F8766D','#00BFC4'), numbers=TRUE, sortVars=TRUE, labels=names(adhd_hw4), cex.axis=.7, gap=3, ylab=c("Missing data","Pattern"))##
## Variables sorted by number of missings:
## Variable Count
## Hx.of.Violence 0.67428571
## Education 0.13142857
## Court.order 0.12571429
## Opioids 0.08000000
## Sedative.hypnotics 0.07428571
## Cocaine 0.06285714
## Stimulants 0.06285714
## THC 0.05142857
## Alcohol 0.02857143
## Psych.meds. 0.02285714
## Age 0.02285714
## Sex 0.02285714
## Race 0.02285714
## ADHD.Total 0.02285714
## MD.TOTAL 0.02285714
## Age 0.00000000
## Sex 0.00000000
## Race 0.00000000
## ADHD.Total 0.00000000
## MD.TOTAL 0.00000000
## Alcohol 0.00000000
## THC 0.00000000
## Cocaine 0.00000000
## Stimulants 0.00000000
## Sedative.hypnotics 0.00000000
## Opioids 0.00000000
## Court.order 0.00000000
## Education 0.00000000
## Hx.of.Violence 0.00000000
## Disorderly.Conduct 0.00000000
## Suicide 0.00000000
## Abuse 0.00000000
## Non.subst.Dx 0.00000000
## Subst.Dx 0.00000000
## Psych.meds. 0.00000000
## Age 0.00000000
## Sex 0.00000000
## Race 0.00000000
## ADHD.Total 0.00000000
## MD.TOTAL 0.00000000
## Alcohol 0.00000000
## THC 0.00000000
## Cocaine 0.00000000
## Stimulants 0.00000000
## Sedative.hypnotics 0.00000000
## Opioids 0.00000000
## Court.order 0.00000000
## Education 0.00000000
## Hx.of.Violence 0.00000000
## Disorderly.Conduct 0.00000000
## Suicide 0.00000000
## Abuse 0.00000000
## Non.subst.Dx 0.00000000
## Subst.Dx 0.00000000
We can see that the psychiatric medicine usage has the most missing data. Actually more than 70% of the people report missing on this variable. For future analysis we will exclude it.
Although alcohol usage have also a relative high number of missing comma we decided to keep this variable due to its clinical importance all the question near. We will impute the missing once using the mice package. Substance usage and non substance usage also have 15% of Messi, and we decide not to include this variable in future analysis. Abuse variable have some missing, and we will impute it. Surprisingly, there is almost no missing data in individual answers.
Surprisingly, there is almost no missing data in individual answers.
missing.total.long<-
adhd_hw4 %>%
mutate (total=n()) %>%
group_by(Suicide) %>%
mutate (missingcnt.total=n() )%>%
select (Suicide,missingcnt.total) %>%
unique() %>%
arrange(-missingcnt.total)
head(missing.total.long, 8)cntplot<- ## no print after the assignment
ggplot(data = missing.total.long,
aes(x=reorder(Suicide, missingcnt.total),
y =missingcnt.total,
fill =Suicide)) +
geom_bar(stat='identity') +
theme ( axis.title.x = element_text(size=10),
axis.text.x = element_text(size=8, angle=45, hjust=1, vjust=1)
) +
ggtitle ('Sum of Missing Numbers, by Suicide')
missing.proport<-
adhd_hw4 %>% mutate (total=n()) %>%
group_by(Suicide) %>%
mutate (missing.cnt=n(), Proportion=missing.cnt/total) %>%
select (Suicide, Proportion) %>%
unique() %>%
arrange(-Proportion)propplot<-
ggplot(data = missing.proport,
aes(x=reorder(Suicide, Proportion),
y =Proportion,
fill = Suicide)) +
geom_bar(stat='identity') +
theme ( axis.title.x = element_text(size=10),
axis.text.x = element_text(size=8, angle=45, hjust=1, vjust=1) ) +
ggtitle ('Proportion of Missing, by Suicide')library(gridExtra)
par(mfrow=c(1,2))
grid.arrange(cntplot, propplot, top = textGrob("Histogram of Suicide "))Missing numbers by Suicide status yes/no follows similar pattern.
Because the nature of the categorical variable, when it’s needed for computation, it had been transformed into dummy coding variables. Given the already large amount of numbers of variables, this categorical variable transformation introduces lot more dimensions. So, dimension reduction is such an important nature in categorical data analysis. We decide 1st to create a data set that exclude the dividual answers to the questionnaires. There was no missing data anyway in individual answers, which makes the total scores of both questionnaires very valid. So we will use the total scores, and get rid of, or not considering the individual answers, unless it is really necessary. This is the first type of dimension reduction, using empirical knowledge. That has made the number of variables from 54 to 20.
# Selecting only variables not containing Qs
dim(adhd_hw4)## [1] 175 20
### 175 * 20VR
## Using Reduced ones## Using Reduced ones
adhd_hw4.new = apply(adhd_hw4, 2, function(x) as.numeric(as.character(x))) %>%
data.frame()
##NA introduced by coersion
adhd_hw4.long<- stack(adhd_hw4.new)
head(adhd_hw4.long,3)# https://rpubs.com/ghh2001/699273
ggplot(data = adhd_hw4.long,
aes(x=as.factor(values),
fill=as.factor(values) )) +
geom_bar(stat='count', width=1) +
facet_wrap(~ind) Because the nature of the categorical variable, when it’s needed for computation, it had been transformed into dummy coding variables. Given the already large amount of numbers of variables, this categorical variable transformation introduces lot more dimensions. So, dimension reduction is such an important nature in categorical data analysis. We decide 1st to create a data set that exclude the dividual answers to the questionnaires. There was no missing data anyway in individual answers, which makes the total scores of both questionnaires very valid. So we will use the total scores, and get rid of, or not considering the individual answers, unless it is really necessary.
This is the first type of dimension reduction, using empirical knowledge. That has made the number of variables from 54 to 20.
Next we exclude the variables of psychiatric medicine usage, which has 70% missing. And another two variables describing the substance usage and non substance usage. They both have large amount of missing, and they do not contribute heavily into the questionnaire in terms of clinical meaning. That further reduces the dimension, now the data has 17 varaiables.
The character Variables need to be numeric, PCA and clustering computations are based on distance.
adhd_hw4 <- mutate_all(adhd_hw4, function(x) as.numeric(as.character(x)))table(adhd_hw4$Education, useNA='ifany')##
## 6 7 8 9 10 11 12 13 14 15 16 17 18 19 <NA>
## 2 2 5 12 12 23 67 15 14 1 7 2 3 1 9
table(adhd_hw4$Abuse, useNA='ifany')##
## 0 1 2 3 4 5 6 7 <NA>
## 101 8 20 4 6 10 4 8 14
adhd_hw4_wosub_psymeds <- adhd_hw4 %>% select (Age:'Abuse')
impute <- mice (data=adhd_hw4_wosub_psymeds, print=FALSE)
df_imputed_mice <-complete(impute)
adhd_hw4_mice_imputed<-df_imputed_mice
head(adhd_hw4_mice_imputed) ## 17 VRdim(adhd_hw4_mice_imputed) ##175 *17## [1] 175 17
mice_plot <- aggr(adhd_hw4_mice_imputed, col=c('#F8766D','#00BFC4'), numbers=TRUE, sortVars=TRUE, labels=names(adhd_hw4_mice_imputed), cex.axis=.7, gap=3, ylab=c("Missing data","Pattern"))##
## Variables sorted by number of missings:
## Variable Count
## Age 0
## Sex 0
## Race 0
## ADHD.Total 0
## MD.TOTAL 0
## Alcohol 0
## THC 0
## Cocaine 0
## Stimulants 0
## Sedative.hypnotics 0
## Opioids 0
## Court.order 0
## Education 0
## Hx.of.Violence 0
## Disorderly.Conduct 0
## Suicide 0
## Abuse 0
## no missing anymoreFor the rest of the missing variables, we used the mice package to imputed.. We check the missing patterns before imputation and after imputation.
dim(adhd_hw4)## [1] 175 20
# Discard PsychMeds
adhd_hw4_mice_imputed <- adhd_hw4 %>% select(-c(Psych.meds.))dim(adhd_hw4_mice_imputed)## [1] 175 19
## 142*19 now## now, it is 175 * 19 nowskim(adhd_hw4_mice_imputed)| Name | adhd_hw4_mice_imputed |
| Number of rows | 175 |
| Number of columns | 19 |
| _______________________ | |
| Column type frequency: | |
| numeric | 19 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Age | 0 | 1.00 | 39.47 | 11.17 | 18 | 29.5 | 42 | 48.0 | 69 | <U+2586><U+2585><U+2587><U+2585><U+2581> |
| Sex | 0 | 1.00 | 1.43 | 0.50 | 1 | 1.0 | 1 | 2.0 | 2 | <U+2587><U+2581><U+2581><U+2581><U+2586> |
| Race | 0 | 1.00 | 1.64 | 0.69 | 1 | 1.0 | 2 | 2.0 | 6 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| ADHD.Total | 0 | 1.00 | 34.32 | 16.70 | 0 | 21.0 | 33 | 47.5 | 72 | <U+2583><U+2586><U+2587><U+2586><U+2582> |
| MD.TOTAL | 0 | 1.00 | 10.02 | 4.81 | 0 | 6.5 | 11 | 14.0 | 17 | <U+2583><U+2583><U+2586><U+2587><U+2587> |
| Alcohol | 4 | 0.98 | 1.35 | 1.39 | 0 | 0.0 | 1 | 3.0 | 3 | <U+2587><U+2582><U+2581><U+2581><U+2586> |
| THC | 4 | 0.98 | 0.81 | 1.27 | 0 | 0.0 | 0 | 1.5 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2583> |
| Cocaine | 4 | 0.98 | 1.09 | 1.39 | 0 | 0.0 | 0 | 3.0 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2585> |
| Stimulants | 4 | 0.98 | 0.12 | 0.53 | 0 | 0.0 | 0 | 0.0 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| Sedative.hypnotics | 4 | 0.98 | 0.12 | 0.54 | 0 | 0.0 | 0 | 0.0 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| Opioids | 4 | 0.98 | 0.39 | 0.99 | 0 | 0.0 | 0 | 0.0 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| Court.order | 5 | 0.97 | 0.09 | 0.28 | 0 | 0.0 | 0 | 0.0 | 1 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| Education | 9 | 0.95 | 11.90 | 2.17 | 6 | 11.0 | 12 | 13.0 | 19 | <U+2581><U+2585><U+2587><U+2582><U+2581> |
| Hx.of.Violence | 11 | 0.94 | 0.24 | 0.43 | 0 | 0.0 | 0 | 0.0 | 1 | <U+2587><U+2581><U+2581><U+2581><U+2582> |
| Disorderly.Conduct | 11 | 0.94 | 0.73 | 0.45 | 0 | 0.0 | 1 | 1.0 | 1 | <U+2583><U+2581><U+2581><U+2581><U+2587> |
| Suicide | 13 | 0.93 | 0.30 | 0.46 | 0 | 0.0 | 0 | 1.0 | 1 | <U+2587><U+2581><U+2581><U+2581><U+2583> |
| Abuse | 14 | 0.92 | 1.33 | 2.12 | 0 | 0.0 | 0 | 2.0 | 7 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| Non.subst.Dx | 22 | 0.87 | 0.44 | 0.68 | 0 | 0.0 | 0 | 1.0 | 2 | <U+2587><U+2581><U+2583><U+2581><U+2581> |
| Subst.Dx | 23 | 0.87 | 1.14 | 0.93 | 0 | 0.0 | 1 | 2.0 | 3 | <U+2586><U+2587><U+2581><U+2585><U+2582> |
adhd_hw4_mice_imputed$Suicide <-as.numeric(adhd_hw4_mice_imputed$Suicide)
adhd_hw4_mice_imputed2<-adhd_hw4_mice_imputed
skim(adhd_hw4_mice_imputed2)| Name | adhd_hw4_mice_imputed2 |
| Number of rows | 175 |
| Number of columns | 19 |
| _______________________ | |
| Column type frequency: | |
| numeric | 19 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Age | 0 | 1.00 | 39.47 | 11.17 | 18 | 29.5 | 42 | 48.0 | 69 | <U+2586><U+2585><U+2587><U+2585><U+2581> |
| Sex | 0 | 1.00 | 1.43 | 0.50 | 1 | 1.0 | 1 | 2.0 | 2 | <U+2587><U+2581><U+2581><U+2581><U+2586> |
| Race | 0 | 1.00 | 1.64 | 0.69 | 1 | 1.0 | 2 | 2.0 | 6 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| ADHD.Total | 0 | 1.00 | 34.32 | 16.70 | 0 | 21.0 | 33 | 47.5 | 72 | <U+2583><U+2586><U+2587><U+2586><U+2582> |
| MD.TOTAL | 0 | 1.00 | 10.02 | 4.81 | 0 | 6.5 | 11 | 14.0 | 17 | <U+2583><U+2583><U+2586><U+2587><U+2587> |
| Alcohol | 4 | 0.98 | 1.35 | 1.39 | 0 | 0.0 | 1 | 3.0 | 3 | <U+2587><U+2582><U+2581><U+2581><U+2586> |
| THC | 4 | 0.98 | 0.81 | 1.27 | 0 | 0.0 | 0 | 1.5 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2583> |
| Cocaine | 4 | 0.98 | 1.09 | 1.39 | 0 | 0.0 | 0 | 3.0 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2585> |
| Stimulants | 4 | 0.98 | 0.12 | 0.53 | 0 | 0.0 | 0 | 0.0 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| Sedative.hypnotics | 4 | 0.98 | 0.12 | 0.54 | 0 | 0.0 | 0 | 0.0 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| Opioids | 4 | 0.98 | 0.39 | 0.99 | 0 | 0.0 | 0 | 0.0 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| Court.order | 5 | 0.97 | 0.09 | 0.28 | 0 | 0.0 | 0 | 0.0 | 1 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| Education | 9 | 0.95 | 11.90 | 2.17 | 6 | 11.0 | 12 | 13.0 | 19 | <U+2581><U+2585><U+2587><U+2582><U+2581> |
| Hx.of.Violence | 11 | 0.94 | 0.24 | 0.43 | 0 | 0.0 | 0 | 0.0 | 1 | <U+2587><U+2581><U+2581><U+2581><U+2582> |
| Disorderly.Conduct | 11 | 0.94 | 0.73 | 0.45 | 0 | 0.0 | 1 | 1.0 | 1 | <U+2583><U+2581><U+2581><U+2581><U+2587> |
| Suicide | 13 | 0.93 | 0.30 | 0.46 | 0 | 0.0 | 0 | 1.0 | 1 | <U+2587><U+2581><U+2581><U+2581><U+2583> |
| Abuse | 14 | 0.92 | 1.33 | 2.12 | 0 | 0.0 | 0 | 2.0 | 7 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| Non.subst.Dx | 22 | 0.87 | 0.44 | 0.68 | 0 | 0.0 | 0 | 1.0 | 2 | <U+2587><U+2581><U+2583><U+2581><U+2581> |
| Subst.Dx | 23 | 0.87 | 1.14 | 0.93 | 0 | 0.0 | 1 | 2.0 | 3 | <U+2586><U+2587><U+2581><U+2585><U+2582> |
# https://www.rdocumentation.org/packages/caret/versions/6.0-88/topics/dummyVars
library(caret)
adhd_dummy_coded <- dummyVars(~ 0 + ., drop2nd=TRUE, data = adhd_hw4_mice_imputed)
adhd_dummy_coded <- data.frame(predict(adhd_dummy_coded, newdata = adhd_hw4_mice_imputed))
skim(adhd_dummy_coded)| Name | adhd_dummy_coded |
| Number of rows | 175 |
| Number of columns | 19 |
| _______________________ | |
| Column type frequency: | |
| numeric | 19 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Age | 0 | 1.00 | 39.47 | 11.17 | 18 | 29.5 | 42 | 48.0 | 69 | <U+2586><U+2585><U+2587><U+2585><U+2581> |
| Sex | 0 | 1.00 | 1.43 | 0.50 | 1 | 1.0 | 1 | 2.0 | 2 | <U+2587><U+2581><U+2581><U+2581><U+2586> |
| Race | 0 | 1.00 | 1.64 | 0.69 | 1 | 1.0 | 2 | 2.0 | 6 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| ADHD.Total | 0 | 1.00 | 34.32 | 16.70 | 0 | 21.0 | 33 | 47.5 | 72 | <U+2583><U+2586><U+2587><U+2586><U+2582> |
| MD.TOTAL | 0 | 1.00 | 10.02 | 4.81 | 0 | 6.5 | 11 | 14.0 | 17 | <U+2583><U+2583><U+2586><U+2587><U+2587> |
| Alcohol | 4 | 0.98 | 1.35 | 1.39 | 0 | 0.0 | 1 | 3.0 | 3 | <U+2587><U+2582><U+2581><U+2581><U+2586> |
| THC | 4 | 0.98 | 0.81 | 1.27 | 0 | 0.0 | 0 | 1.5 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2583> |
| Cocaine | 4 | 0.98 | 1.09 | 1.39 | 0 | 0.0 | 0 | 3.0 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2585> |
| Stimulants | 4 | 0.98 | 0.12 | 0.53 | 0 | 0.0 | 0 | 0.0 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| Sedative.hypnotics | 4 | 0.98 | 0.12 | 0.54 | 0 | 0.0 | 0 | 0.0 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| Opioids | 4 | 0.98 | 0.39 | 0.99 | 0 | 0.0 | 0 | 0.0 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| Court.order | 5 | 0.97 | 0.09 | 0.28 | 0 | 0.0 | 0 | 0.0 | 1 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| Education | 9 | 0.95 | 11.90 | 2.17 | 6 | 11.0 | 12 | 13.0 | 19 | <U+2581><U+2585><U+2587><U+2582><U+2581> |
| Hx.of.Violence | 11 | 0.94 | 0.24 | 0.43 | 0 | 0.0 | 0 | 0.0 | 1 | <U+2587><U+2581><U+2581><U+2581><U+2582> |
| Disorderly.Conduct | 11 | 0.94 | 0.73 | 0.45 | 0 | 0.0 | 1 | 1.0 | 1 | <U+2583><U+2581><U+2581><U+2581><U+2587> |
| Suicide | 13 | 0.93 | 0.30 | 0.46 | 0 | 0.0 | 0 | 1.0 | 1 | <U+2587><U+2581><U+2581><U+2581><U+2583> |
| Abuse | 14 | 0.92 | 1.33 | 2.12 | 0 | 0.0 | 0 | 2.0 | 7 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| Non.subst.Dx | 22 | 0.87 | 0.44 | 0.68 | 0 | 0.0 | 0 | 1.0 | 2 | <U+2587><U+2581><U+2583><U+2581><U+2581> |
| Subst.Dx | 23 | 0.87 | 1.14 | 0.93 | 0 | 0.0 | 1 | 2.0 | 3 | <U+2586><U+2587><U+2581><U+2585><U+2582> |
adhd.complete <- adhd_hw4[complete.cases(adhd_hw4),]
# adhd.complete_gh[complete.cases(adhd.complete_gh),]
adhd_dummy_coded.pca1 <- prcomp(adhd.complete, center = TRUE,scale. = TRUE)
summary(adhd_dummy_coded.pca1)## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0240 1.5778 1.33626 1.30608 1.20322 1.19376 1.03029
## Proportion of Variance 0.2048 0.1245 0.08928 0.08529 0.07239 0.07125 0.05307
## Cumulative Proportion 0.2048 0.3293 0.41857 0.50386 0.57625 0.64750 0.70058
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.9890 0.93238 0.84251 0.80468 0.77489 0.65478 0.62473
## Proportion of Variance 0.0489 0.04347 0.03549 0.03238 0.03002 0.02144 0.01951
## Cumulative Proportion 0.7495 0.79294 0.82844 0.86081 0.89083 0.91227 0.93178
## PC15 PC16 PC17 PC18 PC19 PC20
## Standard deviation 0.58799 0.54370 0.53019 0.43023 0.40329 0.30679
## Proportion of Variance 0.01729 0.01478 0.01405 0.00925 0.00813 0.00471
## Cumulative Proportion 0.94907 0.96385 0.97791 0.98716 0.99529 1.00000
# Error in colMeans(x, na.rm = TRUE) : 'x' must be numerichead(adhd_hw4_mice_imputed)Data must be scaled
adhd.scaled1 <- scale(adhd_dummy_coded)
# colMeans(x, na.rm = TRUE) : 'x' must be numericd <- dist(adhd.scaled1, method = "euclidean") # Dissimilarity matrix created
hc1 <- hclust(d, method = "complete" ) ###Complete Linkagesub_grp <- cutree(hc1, k = 4)
table(sub_grp)## sub_grp
## 1 2 3 4
## 155 12 6 2
adhd_hw4_mice_imputed %>%
mutate(cluster = sub_grp) %>%
headadhd_hw4_mice_imputed %>%
mutate(cluster = sub_grp) %>%
group_by(cluster) %>%
summarise(count = n(), meanADHD = mean(ADHD.Total),meanMD = mean(MD.TOTAL) )We used a hierarchial clustering to investigate the distance between all data points. The dendrogram R printed accordingly. There seems to be one overall hierarchial cluster, followed by three to four sub clusters, followed by about 10 clusters as the third tier. We decided to print out the top 6 clusters, or groups. Shown in above figures.
First, we run the PCA on the reduced data (not including the original individual answers to each and very question). The overall score of HDHD and MD are used stead.
adhd.complete <- adhd_hw4 [complete.cases(adhd_hw4), ]data.pca1 <- adhd.complete
res.pca1 <- prcomp(data.pca1, scale = TRUE)# Results for Variables
res.var1 <- get_pca_var(res.pca1)
res.var1$contrib[,1:3] # Contributions to the PCs## Dim.1 Dim.2 Dim.3
## Age 0.111134487 3.484781e+00 3.77084594
## Sex 0.001040411 5.275915e+00 6.88816108
## Race 0.639321986 9.658102e+00 6.96493208
## ADHD.Total 0.102028232 2.373347e+00 22.25624320
## MD.TOTAL 6.070475727 5.552536e-02 17.85120157
## Alcohol 15.522614627 4.098862e-02 0.01468759
## THC 11.792220550 1.628449e-02 0.08419903
## Cocaine 10.220341287 1.727764e+00 2.72747982
## Stimulants 5.493777206 6.627963e+00 0.04006871
## Sedative.hypnotics 4.736037366 2.591997e+00 0.82003962
## Opioids 4.832296447 3.137224e+00 5.88603541
## Court.order 3.495365897 2.686443e+00 0.53252992
## Education 3.126611054 3.617499e-01 11.49123106
## Hx.of.Violence 7.351004194 3.478081e+00 3.33351518
## Disorderly.Conduct 9.451155291 3.220990e+00 1.46093269
## Suicide 1.795386800 2.494956e+00 10.69187457
## Abuse 2.315006217 2.376644e+01 0.89114044
## Non.subst.Dx 1.462631273 1.430674e+01 2.35269638
## Subst.Dx 11.005247554 5.016871e-04 0.20774193
## Psych.meds. 0.476303395 1.469421e+01 1.73444377
fviz_eig(res.pca1)fviz_pca(res.pca1,
col.ind = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE
)fviz_pca_var(res.pca1,
col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE
)For the principal component analysis, we used the original data, rather than imputed data. The input data will introduce unwanted bias. However, we have to exclude all the missing there data, because principal component analysis do not allow missing ones. The first three dimensions earth the PCs are printed. The relative contributions to each dimensions come on in terms of individual variables, are also printed out.
We use the support vector models to predict the likelihood of suicide or not. As we said before, imputation will allow us to preserve as much as possible data points. So for this analysis, we used the imputed data for suicide prediction.
We did the usual training database and testing database set split. Due to the nature of classification, we used C classification as type. We scaled the data as preprocess. We chose three kernels to compare the confusion matrix comma accuracy as the performance measurement.
adhd_hw4_mice_imputed <- adhd_hw4_mice_imputed
adhd_hw4_mice_imputed$Suicide <- as.factor(adhd_hw4_mice_imputed$Suicide)library(tidymodels)
# train test split
rows = nrow(adhd_hw4_mice_imputed)
t.index <- sample(1:rows, size = round(0.75*rows), replace=FALSE)
df.train <- adhd_hw4_mice_imputed[t.index ,]
df.test <- adhd_hw4_mice_imputed[-t.index ,]
svm_model <- svm(formula = Suicide ~ .,
data=df.train,
type = 'C-classification',
kernel = 'linear',
preProcess = 'scale'
)
svm_model##
## Call:
## svm(formula = Suicide ~ ., data = df.train, type = "C-classification",
## kernel = "linear", preProcess = "scale")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 53
# Predicting the Test set results
df.test.nona<- df.test [ complete.cases (df.test),]
unique(df.test$Suicide) ## thre is NA in df.test, will not produce table otherwise## [1] 1 0 <NA>
## Levels: 0 1
# ## remove NA from df.test$Suicide
y_pred <- predict(svm_model, newdata = df.test.nona)
cm <- table(df.test.nona$Suicide, y_pred)
cm## y_pred
## 0 1
## 0 20 4
## 1 13 2
svm_model <- svm(formula = Suicide ~ .,
data=df.train,
type = 'C-classification',
kernel = 'radial') ## the best kernel among three
svm_model##
## Call:
## svm(formula = Suicide ~ ., data = df.train, type = "C-classification",
## kernel = "radial")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 77
# Predicting the Test set results
df.test.nona<- df.test [ complete.cases (df.test),]
y_pred <- predict(svm_model, newdata = df.test.nona)
# cm <- table(df.test.nona$Suicide, y_pred)
# cm
cm_accuracy_etc <-confusionMatrix(y_pred,df.test.nona$Suicide)
cm_accuracy_etc ## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 21 13
## 1 3 2
##
## Accuracy : 0.5897
## 95% CI : (0.421, 0.7443)
## No Information Rate : 0.6154
## P-Value [Acc > NIR] : 0.69237
##
## Kappa : 0.0095
##
## Mcnemar's Test P-Value : 0.02445
##
## Sensitivity : 0.8750
## Specificity : 0.1333
## Pos Pred Value : 0.6176
## Neg Pred Value : 0.4000
## Prevalence : 0.6154
## Detection Rate : 0.5385
## Detection Prevalence : 0.8718
## Balanced Accuracy : 0.5042
##
## 'Positive' Class : 0
##
svm_model$terms## Suicide ~ Age + Sex + Race + ADHD.Total + MD.TOTAL + Alcohol +
## THC + Cocaine + Stimulants + Sedative.hypnotics + Opioids +
## Court.order + Education + Hx.of.Violence + Disorderly.Conduct +
## Abuse + Non.subst.Dx + Subst.Dx
## attr(,"variables")
## list(Suicide, Age, Sex, Race, ADHD.Total, MD.TOTAL, Alcohol,
## THC, Cocaine, Stimulants, Sedative.hypnotics, Opioids, Court.order,
## Education, Hx.of.Violence, Disorderly.Conduct, Abuse, Non.subst.Dx,
## Subst.Dx)
## attr(,"factors")
## Age Sex Race ADHD.Total MD.TOTAL Alcohol THC Cocaine
## Suicide 0 0 0 0 0 0 0 0
## Age 1 0 0 0 0 0 0 0
## Sex 0 1 0 0 0 0 0 0
## Race 0 0 1 0 0 0 0 0
## ADHD.Total 0 0 0 1 0 0 0 0
## MD.TOTAL 0 0 0 0 1 0 0 0
## Alcohol 0 0 0 0 0 1 0 0
## THC 0 0 0 0 0 0 1 0
## Cocaine 0 0 0 0 0 0 0 1
## Stimulants 0 0 0 0 0 0 0 0
## Sedative.hypnotics 0 0 0 0 0 0 0 0
## Opioids 0 0 0 0 0 0 0 0
## Court.order 0 0 0 0 0 0 0 0
## Education 0 0 0 0 0 0 0 0
## Hx.of.Violence 0 0 0 0 0 0 0 0
## Disorderly.Conduct 0 0 0 0 0 0 0 0
## Abuse 0 0 0 0 0 0 0 0
## Non.subst.Dx 0 0 0 0 0 0 0 0
## Subst.Dx 0 0 0 0 0 0 0 0
## Stimulants Sedative.hypnotics Opioids Court.order Education
## Suicide 0 0 0 0 0
## Age 0 0 0 0 0
## Sex 0 0 0 0 0
## Race 0 0 0 0 0
## ADHD.Total 0 0 0 0 0
## MD.TOTAL 0 0 0 0 0
## Alcohol 0 0 0 0 0
## THC 0 0 0 0 0
## Cocaine 0 0 0 0 0
## Stimulants 1 0 0 0 0
## Sedative.hypnotics 0 1 0 0 0
## Opioids 0 0 1 0 0
## Court.order 0 0 0 1 0
## Education 0 0 0 0 1
## Hx.of.Violence 0 0 0 0 0
## Disorderly.Conduct 0 0 0 0 0
## Abuse 0 0 0 0 0
## Non.subst.Dx 0 0 0 0 0
## Subst.Dx 0 0 0 0 0
## Hx.of.Violence Disorderly.Conduct Abuse Non.subst.Dx
## Suicide 0 0 0 0
## Age 0 0 0 0
## Sex 0 0 0 0
## Race 0 0 0 0
## ADHD.Total 0 0 0 0
## MD.TOTAL 0 0 0 0
## Alcohol 0 0 0 0
## THC 0 0 0 0
## Cocaine 0 0 0 0
## Stimulants 0 0 0 0
## Sedative.hypnotics 0 0 0 0
## Opioids 0 0 0 0
## Court.order 0 0 0 0
## Education 0 0 0 0
## Hx.of.Violence 1 0 0 0
## Disorderly.Conduct 0 1 0 0
## Abuse 0 0 1 0
## Non.subst.Dx 0 0 0 1
## Subst.Dx 0 0 0 0
## Subst.Dx
## Suicide 0
## Age 0
## Sex 0
## Race 0
## ADHD.Total 0
## MD.TOTAL 0
## Alcohol 0
## THC 0
## Cocaine 0
## Stimulants 0
## Sedative.hypnotics 0
## Opioids 0
## Court.order 0
## Education 0
## Hx.of.Violence 0
## Disorderly.Conduct 0
## Abuse 0
## Non.subst.Dx 0
## Subst.Dx 1
## attr(,"term.labels")
## [1] "Age" "Sex" "Race"
## [4] "ADHD.Total" "MD.TOTAL" "Alcohol"
## [7] "THC" "Cocaine" "Stimulants"
## [10] "Sedative.hypnotics" "Opioids" "Court.order"
## [13] "Education" "Hx.of.Violence" "Disorderly.Conduct"
## [16] "Abuse" "Non.subst.Dx" "Subst.Dx"
## attr(,"order")
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## attr(,"intercept")
## [1] 0
## attr(,"response")
## [1] 1
## attr(,".Environment")
## <environment: R_GlobalEnv>
## attr(,"predvars")
## list(Suicide, Age, Sex, Race, ADHD.Total, MD.TOTAL, Alcohol,
## THC, Cocaine, Stimulants, Sedative.hypnotics, Opioids, Court.order,
## Education, Hx.of.Violence, Disorderly.Conduct, Abuse, Non.subst.Dx,
## Subst.Dx)
## attr(,"dataClasses")
## Suicide Age Sex Race
## "factor" "numeric" "numeric" "numeric"
## ADHD.Total MD.TOTAL Alcohol THC
## "numeric" "numeric" "numeric" "numeric"
## Cocaine Stimulants Sedative.hypnotics Opioids
## "numeric" "numeric" "numeric" "numeric"
## Court.order Education Hx.of.Violence Disorderly.Conduct
## "numeric" "numeric" "numeric" "numeric"
## Abuse Non.subst.Dx Subst.Dx
## "numeric" "numeric" "numeric"
## 17 of themsvm_model$coefs [1:20]## [1] 0.78186181 0.02204032 0.55414289 0.05077455 0.62021087 0.98565897
## [7] 1.00000000 1.00000000 0.00525076 1.00000000 0.35712416 0.16946686
## [13] 1.00000000 0.88823312 1.00000000 0.11952794 0.23172956 1.00000000
## [19] 0.95453154 0.42138042
cm_accuracy_etc <-confusionMatrix(y_pred,df.test.nona$Suicide)
cm_accuracy_etc ## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 21 13
## 1 3 2
##
## Accuracy : 0.5897
## 95% CI : (0.421, 0.7443)
## No Information Rate : 0.6154
## P-Value [Acc > NIR] : 0.69237
##
## Kappa : 0.0095
##
## Mcnemar's Test P-Value : 0.02445
##
## Sensitivity : 0.8750
## Specificity : 0.1333
## Pos Pred Value : 0.6176
## Neg Pred Value : 0.4000
## Prevalence : 0.6154
## Detection Rate : 0.5385
## Detection Prevalence : 0.8718
## Balanced Accuracy : 0.5042
##
## 'Positive' Class : 0
##
svm_model <- svm(formula = Suicide ~ .,
data=df.train,
type = 'C-classification',
kernel = 'sigmoid')
svm_model##
## Call:
## svm(formula = Suicide ~ ., data = df.train, type = "C-classification",
## kernel = "sigmoid")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: sigmoid
## cost: 1
## coef.0: 0
##
## Number of Support Vectors: 59
# Predicting the Test set results
df.test.nona<- df.test [ complete.cases (df.test),]
y_pred <- predict(svm_model, newdata = df.test.nona)
# cm <- table(df.test.nona$Suicide, y_pred)
cm_accuracy_etc <-confusionMatrix(y_pred,df.test.nona$Suicide)
cm_accuracy_etc ## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 21 14
## 1 3 1
##
## Accuracy : 0.5641
## 95% CI : (0.3962, 0.7219)
## No Information Rate : 0.6154
## P-Value [Acc > NIR] : 0.79575
##
## Kappa : -0.0676
##
## Mcnemar's Test P-Value : 0.01529
##
## Sensitivity : 0.87500
## Specificity : 0.06667
## Pos Pred Value : 0.60000
## Neg Pred Value : 0.25000
## Prevalence : 0.61538
## Detection Rate : 0.53846
## Detection Prevalence : 0.89744
## Balanced Accuracy : 0.47083
##
## 'Positive' Class : 0
##
The co-efficient of the variables are also printed out. Among the three kernels of choice, the radio kernel give us the best accuracy level, at 78%. The rest of the two kernels only produce around the 70% of accuracy.
After deciding the best choice of Colonel, we use the radio kernal for our model tuning. Based on the PCA results, and the Co-efficiency of the produced results, we chose the following variables to use as the final model. They are: Abuse, Education Level , History of Alcohol Usage, Total Scores on ADHD questionnaire, Total Scores on MD Questionnaire.
We tuned the model bye a few cost, and a range of gamma from 0.5 to five. The sampling method is tenfold cross validation, using the default.
library(e1071)
df.train.nona<-df.train [complete.cases(df.train),]
svm.tune <- tune(svm, Suicide~Abuse+Education+Alcohol+ADHD.Total+MD.TOTAL,
data = df.train.nona,
ranges=list(cost=c(1,5,10),
kernel=c("radial"),
gamma=c(0.5,1,2,3,4)))
summary(svm.tune)##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost kernel gamma
## 1 radial 0.5
##
## - best performance: 0.2936364
##
## - Detailed performance results:
## cost kernel gamma error dispersion
## 1 1 radial 0.5 0.2936364 0.1367369
## 2 5 radial 0.5 0.3118182 0.1187670
## 3 10 radial 0.5 0.3300000 0.1215186
## 4 1 radial 1.0 0.3127273 0.1564764
## 5 5 radial 1.0 0.3400000 0.1228824
## 6 10 radial 1.0 0.3881818 0.1677951
## 7 1 radial 2.0 0.3127273 0.1504935
## 8 5 radial 2.0 0.3600000 0.1772766
## 9 10 radial 2.0 0.3600000 0.1772766
## 10 1 radial 3.0 0.3127273 0.1504935
## 11 5 radial 3.0 0.3218182 0.1589798
## 12 10 radial 3.0 0.3218182 0.1589798
## 13 1 radial 4.0 0.3036364 0.1616149
## 14 5 radial 4.0 0.3309091 0.1437527
## 15 10 radial 4.0 0.3309091 0.1437527
bestmod_svm = svm.tune$best.model
print(bestmod_svm)##
## Call:
## best.tune(method = svm, train.x = Suicide ~ Abuse + Education + Alcohol +
## ADHD.Total + MD.TOTAL, data = df.train.nona, ranges = list(cost = c(1,
## 5, 10), kernel = c("radial"), gamma = c(0.5, 1, 2, 3, 4)))
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 82
svmPre2<-predict(bestmod_svm,df.test.nona)
svm2<-confusionMatrix(svmPre2,df.test.nona$Suicide)
svm2## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 23 12
## 1 1 3
##
## Accuracy : 0.6667
## 95% CI : (0.4978, 0.8091)
## No Information Rate : 0.6154
## P-Value [Acc > NIR] : 0.314515
##
## Kappa : 0.1836
##
## Mcnemar's Test P-Value : 0.005546
##
## Sensitivity : 0.9583
## Specificity : 0.2000
## Pos Pred Value : 0.6571
## Neg Pred Value : 0.7500
## Prevalence : 0.6154
## Detection Rate : 0.5897
## Detection Prevalence : 0.8974
## Balanced Accuracy : 0.5792
##
## 'Positive' Class : 0
##
print(svm.tune$best.model$coefs [1:16])## [1] 0.3180345 1.0000000 0.7879170 0.8617978 1.0000000 0.1686009 0.1133866
## [8] 0.8151068 0.2660979 0.1478782 0.1707925 0.4044375 0.1419815 0.3246681
## [15] 1.0000000 0.8261057
After tuning, surprisingly the accuracy for the best models drop 20.55, it did not increase, rather decreased. This surprising outcome tells us that machine learning model sometimes can produce results based on chance, rather than intrinsic clinical meaning. There can be many other explanations, but we do acknowledge that this real world sample data represents the real world the case, which is not like the toy database of textbook. The limitations of this study is discussed in conclusion below.
For our data set that contains 175 observations, we have tried using the machine learning model , support vector machine learning model, to predict the outcome of suicide or not.
Among the three support vector machine kernels, the radial Kernal is the best. It reaches an accuracy of 78 %. But it is far below the target accuracy of 80%.
Abuse, Education Level + History of Alcohol Usage +, Total Scores on ADHD questionnaire, Total Scores on MD Questionnaire, turns out to be the most important variables in all the support vector tuning algorithms.
One of the limitations for this real world data analysis is that for a small set of sample sizes, it has around 50 dimensions, which significantly reduces the capabilities of predicting the outcome. Another limitation is suicide is a relatively rare event, which needs enormous amount of sample size for machine learning training purposes.
It is a very worthy dive into categorical data, because in the real world most of the data are not numerical. Rather 80% of the real world data are categorical. Dummy coding produces large amount of variables, which is a challenge. The clustering and dimension reduction remains to be uh interesting topic of Research for future health related,especially subjective data such as survey.
REFERENCE:
https://www.rdocumentation.org/packages/e1071/versions/1.7-7/topics/svm https://www.rdocumentation.org/packages/caret/versions/6.0-88/topics/dummyVars https://counsellingresource.com/quizzes/adhd-testing/adhd-asrs/
https://www.apaservices.org/practice/reimbursement/health-registry/self-reporting-sympton-scale.pdf
https://www.healthyplace.com/depression/bipolar/mood-disorders-questionnaire#:~:text=The%20Mood%20Disorder%20Questionnaire%20%28MDQ%29%20was%20developed%20by,-%20it%20is%20not%20considered%20a%20diagnostic%20tool.