For this assignment, we will be working with a very interesting mental health dataset from a real-life research project. All identifying information, of course, has been removed. The attached spreadsheet has the data (the tab name “Data”). The data dictionary is given in the second tab. You can get as creative as you want. The assignment is designed to really get you to think about how you could use different methods.
Please use a clustering method to find clusters of patients here. Whether you choose to use k-means clustering or hierarchical clustering is up to you as long as you reason through your work. Can you come up with creative names for the profiles you found? (
Let’s explore using Principal Component Analysis on this dataset. You will note that there are different types of questions in the dataset: column: E-W: ADHD self-report; column X – AM: mood disorders questionnaire, column AN-AS: Individual Substance Misuse; etc. Please reason through your work as you decide on which sets of variables you want to use to conduct Principal Component Analysis.
Assume you are modeling whether a patient attempted suicide (column AX). Please use support vector machine to model this. You might want to consider reducing the number of variables or somehow use extracted information from the variables. This can be a really fun modeling task!
HW4 for Machine Learning 622 This is Version6
library(skimr)
library(tidymodels)
## -- Attaching packages -------------------------------------- tidymodels 0.1.3 --
## v broom 0.7.6 v recipes 0.1.16
## v dials 0.0.9 v rsample 0.1.0
## v dplyr 1.0.6 v tibble 3.1.1
## v ggplot2 3.3.3 v tidyr 1.1.3
## v infer 0.5.4 v tune 0.1.5
## v modeldata 0.1.0 v workflows 0.2.2
## v parsnip 0.1.5 v workflowsets 0.0.2
## v purrr 0.3.4 v yardstick 0.0.8
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x purrr::discard() masks scales::discard()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x recipes::step() masks stats::step()
## * Use tidymodels_prefer() to resolve common conflicts.
library(visdat) ## missing visual
library(dplyr)
library(tidyr)
library(ggplot2)
library(tidyverse)
library(caret) # For featureplot, classification report
library(corrplot) # For correlation matrix
library(AppliedPredictiveModeling)
library(mice) # For data imputation
library(VIM) # For missing data visualization
library(gridExtra) # For grid plots
library(pROC) # For AUC calculations
library(ROCR) # For ROC and AUC plots
library(dendextend)
library(factoextra)
library(inspectdf)
library(corrplot)
library(stats)
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(kernlab) # SVM methodology
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:purrr':
##
## cross
## The following object is masked from 'package:ggplot2':
##
## alpha
## The following object is masked from 'package:scales':
##
## alpha
library(e1071) # SVM methodology
##
## Attaching package: 'e1071'
## The following object is masked from 'package:tune':
##
## tune
## The following object is masked from 'package:rsample':
##
## permutations
library(data.table)
library(CGPfunctions)
## Registered S3 methods overwritten by 'lme4':
## method from
## cooks.distance.influence.merMod car
## influence.merMod car
## dfbeta.influence.merMod car
## dfbetas.influence.merMod car
ADHD is a neurological disorder that can present itself in adolescence and adulthood. Some individuals can outgrow ADHD but around 30% will continue to have ADHD throughout their adulthood.
The Adult ADHD Self-Report Scale Symptom Checklist is a self-reported questionnaire used to assist in the diagnosis of adult ADHD.
The Mood Disorder Questionnaire is a self-report questionnaire designed to help detect bipolar disorder. It focuses on symptoms of hypomania and mania, which are the mood states that separate bipolar disorders from other types of depression and mood disorder. It has 5 main questions, and the first question has 13 parts, for a total of 17 questions.
Details of the items an answers for the questionnaires are included in link below
https://counsellingresource.com/quizzes/adhd-testing/adhd-asrs/
https://www.apaservices.org/practice/reimbursement/health-registry/self-reporting-sympton-scale.pdf
https://www.healthyplace.com/depression/bipolar/mood-disorders-questionnaire#:~:text=The%20Mood%20Disorder%20Questionnaire%20%28MDQ%29%20was%20developed%20by,-%20it%20is%20not%20considered%20a%20diagnostic%20tool.
<- read.csv ('adhd_data.csv', header = TRUE) #175*54 adhd_hw4_notReduced
head(adhd_hw4_notReduced,5)
tail(adhd_hw4_notReduced,6)
==''] <- NA
adhd_hw4_notReduced[adhd_hw4_notReduced
nrow(is.na(adhd_hw4_notReduced))
## [1] 175
head(adhd_hw4_notReduced,3)
skim(adhd_hw4_notReduced)
Name | adhd_hw4_notReduced |
Number of rows | 175 |
Number of columns | 54 |
_______________________ | |
Column type frequency: | |
factor | 1 |
numeric | 53 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
Initial | 0 | 1 | FALSE | 109 | DB: 5, CM: 4, DJ: 4, JM: 4 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Age | 0 | 1.00 | 39.47 | 11.17 | 18 | 29.5 | 42 | 48.0 | 69 | <U+2586><U+2585><U+2587><U+2585><U+2581> |
Sex | 0 | 1.00 | 1.43 | 0.50 | 1 | 1.0 | 1 | 2.0 | 2 | <U+2587><U+2581><U+2581><U+2581><U+2586> |
Race | 0 | 1.00 | 1.64 | 0.69 | 1 | 1.0 | 2 | 2.0 | 6 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
ADHD.Q1 | 0 | 1.00 | 1.70 | 1.29 | 0 | 1.0 | 2 | 3.0 | 4 | <U+2587><U+2587><U+2587><U+2586><U+2583> |
ADHD.Q2 | 0 | 1.00 | 1.91 | 1.25 | 0 | 1.0 | 2 | 3.0 | 4 | <U+2585><U+2587><U+2587><U+2586><U+2585> |
ADHD.Q3 | 0 | 1.00 | 1.91 | 1.27 | 0 | 1.0 | 2 | 3.0 | 4 | <U+2585><U+2587><U+2587><U+2586><U+2585> |
ADHD.Q4 | 0 | 1.00 | 2.10 | 1.34 | 0 | 1.0 | 2 | 3.0 | 4 | <U+2585><U+2585><U+2587><U+2585><U+2586> |
ADHD.Q5 | 0 | 1.00 | 2.26 | 1.44 | 0 | 1.0 | 3 | 3.0 | 5 | <U+2587><U+2585><U+2587><U+2586><U+2581> |
ADHD.Q6 | 0 | 1.00 | 1.91 | 1.31 | 0 | 1.0 | 2 | 3.0 | 4 | <U+2586><U+2585><U+2587><U+2587><U+2583> |
ADHD.Q7 | 0 | 1.00 | 1.83 | 1.19 | 0 | 1.0 | 2 | 3.0 | 4 | <U+2583><U+2587><U+2587><U+2583><U+2583> |
ADHD.Q8 | 0 | 1.00 | 2.14 | 1.29 | 0 | 1.0 | 2 | 3.0 | 4 | <U+2583><U+2587><U+2587><U+2587><U+2586> |
ADHD.Q9 | 0 | 1.00 | 1.91 | 1.32 | 0 | 1.0 | 2 | 3.0 | 4 | <U+2586><U+2587><U+2587><U+2587><U+2585> |
ADHD.Q10 | 0 | 1.00 | 2.12 | 1.23 | 0 | 1.0 | 2 | 3.0 | 4 | <U+2582><U+2587><U+2587><U+2586><U+2585> |
ADHD.Q11 | 0 | 1.00 | 2.27 | 1.24 | 0 | 1.0 | 2 | 3.0 | 4 | <U+2582><U+2586><U+2587><U+2587><U+2586> |
ADHD.Q12 | 0 | 1.00 | 1.29 | 1.21 | 0 | 0.0 | 1 | 2.0 | 4 | <U+2587><U+2587><U+2586><U+2582><U+2582> |
ADHD.Q13 | 0 | 1.00 | 2.37 | 1.23 | 0 | 1.5 | 2 | 3.0 | 4 | <U+2582><U+2585><U+2587><U+2587><U+2586> |
ADHD.Q14 | 0 | 1.00 | 2.25 | 1.35 | 0 | 1.0 | 2 | 3.0 | 4 | <U+2585><U+2585><U+2587><U+2587><U+2586> |
ADHD.Q15 | 0 | 1.00 | 1.63 | 1.39 | 0 | 0.0 | 1 | 3.0 | 4 | <U+2587><U+2586><U+2586><U+2585><U+2583> |
ADHD.Q16 | 0 | 1.00 | 1.70 | 1.38 | 0 | 1.0 | 1 | 3.0 | 4 | <U+2586><U+2587><U+2586><U+2583><U+2585> |
ADHD.Q17 | 0 | 1.00 | 1.53 | 1.29 | 0 | 0.0 | 1 | 2.0 | 4 | <U+2587><U+2587><U+2587><U+2583><U+2583> |
ADHD.Q18 | 0 | 1.00 | 1.47 | 1.30 | 0 | 0.0 | 1 | 2.0 | 4 | <U+2587><U+2587><U+2586><U+2583><U+2583> |
ADHD.Total | 0 | 1.00 | 34.32 | 16.70 | 0 | 21.0 | 33 | 47.5 | 72 | <U+2583><U+2586><U+2587><U+2586><U+2582> |
MD.Q1a | 0 | 1.00 | 0.55 | 0.50 | 0 | 0.0 | 1 | 1.0 | 1 | <U+2586><U+2581><U+2581><U+2581><U+2587> |
MD.Q1b | 0 | 1.00 | 0.57 | 0.50 | 0 | 0.0 | 1 | 1.0 | 1 | <U+2586><U+2581><U+2581><U+2581><U+2587> |
MD.Q1c | 0 | 1.00 | 0.54 | 0.50 | 0 | 0.0 | 1 | 1.0 | 1 | <U+2587><U+2581><U+2581><U+2581><U+2587> |
MD.Q1d | 0 | 1.00 | 0.58 | 0.49 | 0 | 0.0 | 1 | 1.0 | 1 | <U+2586><U+2581><U+2581><U+2581><U+2587> |
MD.Q1e | 0 | 1.00 | 0.55 | 0.50 | 0 | 0.0 | 1 | 1.0 | 1 | <U+2586><U+2581><U+2581><U+2581><U+2587> |
MD.Q1f | 0 | 1.00 | 0.70 | 0.46 | 0 | 0.0 | 1 | 1.0 | 1 | <U+2583><U+2581><U+2581><U+2581><U+2587> |
MD.Q1g | 0 | 1.00 | 0.72 | 0.45 | 0 | 0.0 | 1 | 1.0 | 1 | <U+2583><U+2581><U+2581><U+2581><U+2587> |
MD.Q1h | 0 | 1.00 | 0.56 | 0.50 | 0 | 0.0 | 1 | 1.0 | 1 | <U+2586><U+2581><U+2581><U+2581><U+2587> |
MD.Q1i | 0 | 1.00 | 0.59 | 0.49 | 0 | 0.0 | 1 | 1.0 | 1 | <U+2586><U+2581><U+2581><U+2581><U+2587> |
MD.Q1j | 0 | 1.00 | 0.39 | 0.49 | 0 | 0.0 | 0 | 1.0 | 1 | <U+2587><U+2581><U+2581><U+2581><U+2585> |
MD.Q1k | 0 | 1.00 | 0.49 | 0.50 | 0 | 0.0 | 0 | 1.0 | 1 | <U+2587><U+2581><U+2581><U+2581><U+2587> |
MD.Q1L | 0 | 1.00 | 0.58 | 0.49 | 0 | 0.0 | 1 | 1.0 | 1 | <U+2586><U+2581><U+2581><U+2581><U+2587> |
MD.Q1m | 0 | 1.00 | 0.49 | 0.50 | 0 | 0.0 | 0 | 1.0 | 1 | <U+2587><U+2581><U+2581><U+2581><U+2587> |
MD.Q2 | 0 | 1.00 | 0.72 | 0.45 | 0 | 0.0 | 1 | 1.0 | 1 | <U+2583><U+2581><U+2581><U+2581><U+2587> |
MD.Q3 | 0 | 1.00 | 2.01 | 1.07 | 0 | 1.0 | 2 | 3.0 | 3 | <U+2582><U+2582><U+2581><U+2585><U+2587> |
MD.TOTAL | 0 | 1.00 | 10.02 | 4.81 | 0 | 6.5 | 11 | 14.0 | 17 | <U+2583><U+2583><U+2586><U+2587><U+2587> |
Alcohol | 4 | 0.98 | 1.35 | 1.39 | 0 | 0.0 | 1 | 3.0 | 3 | <U+2587><U+2582><U+2581><U+2581><U+2586> |
THC | 4 | 0.98 | 0.81 | 1.27 | 0 | 0.0 | 0 | 1.5 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2583> |
Cocaine | 4 | 0.98 | 1.09 | 1.39 | 0 | 0.0 | 0 | 3.0 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2585> |
Stimulants | 4 | 0.98 | 0.12 | 0.53 | 0 | 0.0 | 0 | 0.0 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Sedative.hypnotics | 4 | 0.98 | 0.12 | 0.54 | 0 | 0.0 | 0 | 0.0 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Opioids | 4 | 0.98 | 0.39 | 0.99 | 0 | 0.0 | 0 | 0.0 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Court.order | 5 | 0.97 | 0.09 | 0.28 | 0 | 0.0 | 0 | 0.0 | 1 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Education | 9 | 0.95 | 11.90 | 2.17 | 6 | 11.0 | 12 | 13.0 | 19 | <U+2581><U+2585><U+2587><U+2582><U+2581> |
Hx.of.Violence | 11 | 0.94 | 0.24 | 0.43 | 0 | 0.0 | 0 | 0.0 | 1 | <U+2587><U+2581><U+2581><U+2581><U+2582> |
Disorderly.Conduct | 11 | 0.94 | 0.73 | 0.45 | 0 | 0.0 | 1 | 1.0 | 1 | <U+2583><U+2581><U+2581><U+2581><U+2587> |
Suicide | 13 | 0.93 | 0.30 | 0.46 | 0 | 0.0 | 0 | 1.0 | 1 | <U+2587><U+2581><U+2581><U+2581><U+2583> |
Abuse | 14 | 0.92 | 1.33 | 2.12 | 0 | 0.0 | 0 | 2.0 | 7 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
Non.subst.Dx | 22 | 0.87 | 0.44 | 0.68 | 0 | 0.0 | 0 | 1.0 | 2 | <U+2587><U+2581><U+2583><U+2581><U+2581> |
Subst.Dx | 23 | 0.87 | 1.14 | 0.93 | 0 | 0.0 | 1 | 2.0 | 3 | <U+2586><U+2587><U+2581><U+2585><U+2582> |
Psych.meds. | 118 | 0.33 | 0.96 | 0.80 | 0 | 0.0 | 1 | 2.0 | 2 | <U+2587><U+2581><U+2587><U+2581><U+2586> |
# Selecting only variables not containing Qs
<- adhd_hw4_notReduced %>% select(c(Age, Sex, Race, ADHD.Total, MD.TOTAL, Alcohol:Psych.meds.))
adhd_hw4 ### 175 * 20VR
## Using Reduced ones
= apply(adhd_hw4_notReduced, 2, function(x) as.numeric(as.character(x))) %>%
adhd_hw4_notReduced.new data.frame()
## Warning in FUN(newX[, i], ...): NAs introduced by coercion
##NA introduced by coersion
<- stack(adhd_hw4_notReduced.new)
adhd_hw4_notReduced.long
head(adhd_hw4_notReduced.long,3)
# https://rpubs.com/ghh2001/699273
ggplot(data = adhd_hw4_notReduced.long,
aes(x=as.factor(values),
fill=as.factor(values) )) +
geom_bar(stat='count', width=1) +
facet_wrap(~ind)
nearZeroVar(adhd_hw4_notReduced, names=TRUE, saveMetrics = T)
library(visdat)
::vis_miss(adhd_hw4) visdat
<- aggr(adhd_hw4_notReduced, col=c('#F8766D','#00BFC4'), numbers=TRUE, sortVars=TRUE, labels=names(adhd_hw4), cex.axis=.7, gap=3, ylab=c("Missing data","Pattern")) mice_plot
##
## Variables sorted by number of missings:
## Variable Count
## Hx.of.Violence 0.67428571
## Education 0.13142857
## Court.order 0.12571429
## Opioids 0.08000000
## Sedative.hypnotics 0.07428571
## Cocaine 0.06285714
## Stimulants 0.06285714
## THC 0.05142857
## Alcohol 0.02857143
## Psych.meds. 0.02285714
## Age 0.02285714
## Sex 0.02285714
## Race 0.02285714
## ADHD.Total 0.02285714
## MD.TOTAL 0.02285714
## Age 0.00000000
## Sex 0.00000000
## Race 0.00000000
## ADHD.Total 0.00000000
## MD.TOTAL 0.00000000
## Alcohol 0.00000000
## THC 0.00000000
## Cocaine 0.00000000
## Stimulants 0.00000000
## Sedative.hypnotics 0.00000000
## Opioids 0.00000000
## Court.order 0.00000000
## Education 0.00000000
## Hx.of.Violence 0.00000000
## Disorderly.Conduct 0.00000000
## Suicide 0.00000000
## Abuse 0.00000000
## Non.subst.Dx 0.00000000
## Subst.Dx 0.00000000
## Psych.meds. 0.00000000
## Age 0.00000000
## Sex 0.00000000
## Race 0.00000000
## ADHD.Total 0.00000000
## MD.TOTAL 0.00000000
## Alcohol 0.00000000
## THC 0.00000000
## Cocaine 0.00000000
## Stimulants 0.00000000
## Sedative.hypnotics 0.00000000
## Opioids 0.00000000
## Court.order 0.00000000
## Education 0.00000000
## Hx.of.Violence 0.00000000
## Disorderly.Conduct 0.00000000
## Suicide 0.00000000
## Abuse 0.00000000
## Non.subst.Dx 0.00000000
## Subst.Dx 0.00000000
We can see that the psychiatric medicine usage has the most missing data. Actually more than 70% of the people report missing on this variable. For future analysis we will exclude it.
Although alcohol usage have also a relative high number of missing comma we decided to keep this variable due to its clinical importance all the question near. We will impute the missing once using the mice package. Substance usage and non substance usage also have 15% of Messi, and we decide not to include this variable in future analysis. Abuse variable have some missing, and we will impute it. Surprisingly, there is almost no missing data in individual answers.
Surprisingly, there is almost no missing data in individual answers.
<-
missing.total.long%>%
adhd_hw4 mutate (total=n()) %>%
group_by(Suicide) %>%
mutate (missingcnt.total=n() )%>%
select (Suicide,missingcnt.total) %>%
unique() %>%
arrange(-missingcnt.total)
head(missing.total.long, 8)
<- ## no print after the assignment
cntplotggplot(data = missing.total.long,
aes(x=reorder(Suicide, missingcnt.total),
y =missingcnt.total,
fill =Suicide)) +
geom_bar(stat='identity') +
theme ( axis.title.x = element_text(size=10),
axis.text.x = element_text(size=8, angle=45, hjust=1, vjust=1)
+
) ggtitle ('Sum of Missing Numbers, by Suicide')
<-
missing.proport%>% mutate (total=n()) %>%
adhd_hw4 group_by(Suicide) %>%
mutate (missing.cnt=n(), Proportion=missing.cnt/total) %>%
select (Suicide, Proportion) %>%
unique() %>%
arrange(-Proportion)
<-
propplotggplot(data = missing.proport,
aes(x=reorder(Suicide, Proportion),
y =Proportion,
fill = Suicide)) +
geom_bar(stat='identity') +
theme ( axis.title.x = element_text(size=10),
axis.text.x = element_text(size=8, angle=45, hjust=1, vjust=1) ) +
ggtitle ('Proportion of Missing, by Suicide')
library(gridExtra)
par(mfrow=c(1,2))
grid.arrange(cntplot, propplot, top = textGrob("Histogram of Suicide "))
Missing numbers by Suicide status yes/no follows similar pattern.
Because the nature of the categorical variable, when it’s needed for computation, it had been transformed into dummy coding variables. Given the already large amount of numbers of variables, this categorical variable transformation introduces lot more dimensions. So, dimension reduction is such an important nature in categorical data analysis. We decide 1st to create a data set that exclude the dividual answers to the questionnaires. There was no missing data anyway in individual answers, which makes the total scores of both questionnaires very valid. So we will use the total scores, and get rid of, or not considering the individual answers, unless it is really necessary. This is the first type of dimension reduction, using empirical knowledge. That has made the number of variables from 54 to 20.
# Selecting only variables not containing Qs
dim(adhd_hw4)
## [1] 175 20
### 175 * 20VR
## Using Reduced ones
## Using Reduced ones
= apply(adhd_hw4, 2, function(x) as.numeric(as.character(x))) %>%
adhd_hw4.new data.frame()
##NA introduced by coersion
<- stack(adhd_hw4.new)
adhd_hw4.long
head(adhd_hw4.long,3)
# https://rpubs.com/ghh2001/699273
ggplot(data = adhd_hw4.long,
aes(x=as.factor(values),
fill=as.factor(values) )) +
geom_bar(stat='count', width=1) +
facet_wrap(~ind)
Because the nature of the categorical variable, when it’s needed for computation, it had been transformed into dummy coding variables. Given the already large amount of numbers of variables, this categorical variable transformation introduces lot more dimensions. So, dimension reduction is such an important nature in categorical data analysis. We decide 1st to create a data set that exclude the dividual answers to the questionnaires. There was no missing data anyway in individual answers, which makes the total scores of both questionnaires very valid. So we will use the total scores, and get rid of, or not considering the individual answers, unless it is really necessary.
This is the first type of dimension reduction, using empirical knowledge. That has made the number of variables from 54 to 20.
Next we exclude the variables of psychiatric medicine usage, which has 70% missing. And another two variables describing the substance usage and non substance usage. They both have large amount of missing, and they do not contribute heavily into the questionnaire in terms of clinical meaning. That further reduces the dimension, now the data has 17 varaiables.
The character Variables need to be numeric, PCA and clustering computations are based on distance.
<- mutate_all(adhd_hw4, function(x) as.numeric(as.character(x))) adhd_hw4
table(adhd_hw4$Education, useNA='ifany')
##
## 6 7 8 9 10 11 12 13 14 15 16 17 18 19 <NA>
## 2 2 5 12 12 23 67 15 14 1 7 2 3 1 9
table(adhd_hw4$Abuse, useNA='ifany')
##
## 0 1 2 3 4 5 6 7 <NA>
## 101 8 20 4 6 10 4 8 14
<- adhd_hw4 %>% select (Age:'Abuse')
adhd_hw4_wosub_psymeds
<- mice (data=adhd_hw4_wosub_psymeds, print=FALSE)
impute
<-complete(impute)
df_imputed_mice
<-df_imputed_mice
adhd_hw4_mice_imputed
head(adhd_hw4_mice_imputed) ## 17 VR
dim(adhd_hw4_mice_imputed) ##175 *17
## [1] 175 17
<- aggr(adhd_hw4_mice_imputed, col=c('#F8766D','#00BFC4'), numbers=TRUE, sortVars=TRUE, labels=names(adhd_hw4_mice_imputed), cex.axis=.7, gap=3, ylab=c("Missing data","Pattern")) mice_plot
##
## Variables sorted by number of missings:
## Variable Count
## Age 0
## Sex 0
## Race 0
## ADHD.Total 0
## MD.TOTAL 0
## Alcohol 0
## THC 0
## Cocaine 0
## Stimulants 0
## Sedative.hypnotics 0
## Opioids 0
## Court.order 0
## Education 0
## Hx.of.Violence 0
## Disorderly.Conduct 0
## Suicide 0
## Abuse 0
## no missing anymore
For the rest of the missing variables, we used the mice package to imputed.. We check the missing patterns before imputation and after imputation.
dim(adhd_hw4)
## [1] 175 20
# Discard PsychMeds
<- adhd_hw4 %>% select(-c(Psych.meds.)) adhd_hw4_mice_imputed
dim(adhd_hw4_mice_imputed)
## [1] 175 19
## 142*19 now## now, it is 175 * 19 now
skim(adhd_hw4_mice_imputed)
Name | adhd_hw4_mice_imputed |
Number of rows | 175 |
Number of columns | 19 |
_______________________ | |
Column type frequency: | |
numeric | 19 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Age | 0 | 1.00 | 39.47 | 11.17 | 18 | 29.5 | 42 | 48.0 | 69 | <U+2586><U+2585><U+2587><U+2585><U+2581> |
Sex | 0 | 1.00 | 1.43 | 0.50 | 1 | 1.0 | 1 | 2.0 | 2 | <U+2587><U+2581><U+2581><U+2581><U+2586> |
Race | 0 | 1.00 | 1.64 | 0.69 | 1 | 1.0 | 2 | 2.0 | 6 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
ADHD.Total | 0 | 1.00 | 34.32 | 16.70 | 0 | 21.0 | 33 | 47.5 | 72 | <U+2583><U+2586><U+2587><U+2586><U+2582> |
MD.TOTAL | 0 | 1.00 | 10.02 | 4.81 | 0 | 6.5 | 11 | 14.0 | 17 | <U+2583><U+2583><U+2586><U+2587><U+2587> |
Alcohol | 4 | 0.98 | 1.35 | 1.39 | 0 | 0.0 | 1 | 3.0 | 3 | <U+2587><U+2582><U+2581><U+2581><U+2586> |
THC | 4 | 0.98 | 0.81 | 1.27 | 0 | 0.0 | 0 | 1.5 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2583> |
Cocaine | 4 | 0.98 | 1.09 | 1.39 | 0 | 0.0 | 0 | 3.0 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2585> |
Stimulants | 4 | 0.98 | 0.12 | 0.53 | 0 | 0.0 | 0 | 0.0 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Sedative.hypnotics | 4 | 0.98 | 0.12 | 0.54 | 0 | 0.0 | 0 | 0.0 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Opioids | 4 | 0.98 | 0.39 | 0.99 | 0 | 0.0 | 0 | 0.0 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Court.order | 5 | 0.97 | 0.09 | 0.28 | 0 | 0.0 | 0 | 0.0 | 1 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Education | 9 | 0.95 | 11.90 | 2.17 | 6 | 11.0 | 12 | 13.0 | 19 | <U+2581><U+2585><U+2587><U+2582><U+2581> |
Hx.of.Violence | 11 | 0.94 | 0.24 | 0.43 | 0 | 0.0 | 0 | 0.0 | 1 | <U+2587><U+2581><U+2581><U+2581><U+2582> |
Disorderly.Conduct | 11 | 0.94 | 0.73 | 0.45 | 0 | 0.0 | 1 | 1.0 | 1 | <U+2583><U+2581><U+2581><U+2581><U+2587> |
Suicide | 13 | 0.93 | 0.30 | 0.46 | 0 | 0.0 | 0 | 1.0 | 1 | <U+2587><U+2581><U+2581><U+2581><U+2583> |
Abuse | 14 | 0.92 | 1.33 | 2.12 | 0 | 0.0 | 0 | 2.0 | 7 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
Non.subst.Dx | 22 | 0.87 | 0.44 | 0.68 | 0 | 0.0 | 0 | 1.0 | 2 | <U+2587><U+2581><U+2583><U+2581><U+2581> |
Subst.Dx | 23 | 0.87 | 1.14 | 0.93 | 0 | 0.0 | 1 | 2.0 | 3 | <U+2586><U+2587><U+2581><U+2585><U+2582> |
$Suicide <-as.numeric(adhd_hw4_mice_imputed$Suicide)
adhd_hw4_mice_imputed<-adhd_hw4_mice_imputed
adhd_hw4_mice_imputed2skim(adhd_hw4_mice_imputed2)
Name | adhd_hw4_mice_imputed2 |
Number of rows | 175 |
Number of columns | 19 |
_______________________ | |
Column type frequency: | |
numeric | 19 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Age | 0 | 1.00 | 39.47 | 11.17 | 18 | 29.5 | 42 | 48.0 | 69 | <U+2586><U+2585><U+2587><U+2585><U+2581> |
Sex | 0 | 1.00 | 1.43 | 0.50 | 1 | 1.0 | 1 | 2.0 | 2 | <U+2587><U+2581><U+2581><U+2581><U+2586> |
Race | 0 | 1.00 | 1.64 | 0.69 | 1 | 1.0 | 2 | 2.0 | 6 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
ADHD.Total | 0 | 1.00 | 34.32 | 16.70 | 0 | 21.0 | 33 | 47.5 | 72 | <U+2583><U+2586><U+2587><U+2586><U+2582> |
MD.TOTAL | 0 | 1.00 | 10.02 | 4.81 | 0 | 6.5 | 11 | 14.0 | 17 | <U+2583><U+2583><U+2586><U+2587><U+2587> |
Alcohol | 4 | 0.98 | 1.35 | 1.39 | 0 | 0.0 | 1 | 3.0 | 3 | <U+2587><U+2582><U+2581><U+2581><U+2586> |
THC | 4 | 0.98 | 0.81 | 1.27 | 0 | 0.0 | 0 | 1.5 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2583> |
Cocaine | 4 | 0.98 | 1.09 | 1.39 | 0 | 0.0 | 0 | 3.0 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2585> |
Stimulants | 4 | 0.98 | 0.12 | 0.53 | 0 | 0.0 | 0 | 0.0 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Sedative.hypnotics | 4 | 0.98 | 0.12 | 0.54 | 0 | 0.0 | 0 | 0.0 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Opioids | 4 | 0.98 | 0.39 | 0.99 | 0 | 0.0 | 0 | 0.0 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Court.order | 5 | 0.97 | 0.09 | 0.28 | 0 | 0.0 | 0 | 0.0 | 1 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Education | 9 | 0.95 | 11.90 | 2.17 | 6 | 11.0 | 12 | 13.0 | 19 | <U+2581><U+2585><U+2587><U+2582><U+2581> |
Hx.of.Violence | 11 | 0.94 | 0.24 | 0.43 | 0 | 0.0 | 0 | 0.0 | 1 | <U+2587><U+2581><U+2581><U+2581><U+2582> |
Disorderly.Conduct | 11 | 0.94 | 0.73 | 0.45 | 0 | 0.0 | 1 | 1.0 | 1 | <U+2583><U+2581><U+2581><U+2581><U+2587> |
Suicide | 13 | 0.93 | 0.30 | 0.46 | 0 | 0.0 | 0 | 1.0 | 1 | <U+2587><U+2581><U+2581><U+2581><U+2583> |
Abuse | 14 | 0.92 | 1.33 | 2.12 | 0 | 0.0 | 0 | 2.0 | 7 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
Non.subst.Dx | 22 | 0.87 | 0.44 | 0.68 | 0 | 0.0 | 0 | 1.0 | 2 | <U+2587><U+2581><U+2583><U+2581><U+2581> |
Subst.Dx | 23 | 0.87 | 1.14 | 0.93 | 0 | 0.0 | 1 | 2.0 | 3 | <U+2586><U+2587><U+2581><U+2585><U+2582> |
# https://www.rdocumentation.org/packages/caret/versions/6.0-88/topics/dummyVars
library(caret)
<- dummyVars(~ 0 + ., drop2nd=TRUE, data = adhd_hw4_mice_imputed)
adhd_dummy_coded <- data.frame(predict(adhd_dummy_coded, newdata = adhd_hw4_mice_imputed))
adhd_dummy_coded skim(adhd_dummy_coded)
Name | adhd_dummy_coded |
Number of rows | 175 |
Number of columns | 19 |
_______________________ | |
Column type frequency: | |
numeric | 19 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Age | 0 | 1.00 | 39.47 | 11.17 | 18 | 29.5 | 42 | 48.0 | 69 | <U+2586><U+2585><U+2587><U+2585><U+2581> |
Sex | 0 | 1.00 | 1.43 | 0.50 | 1 | 1.0 | 1 | 2.0 | 2 | <U+2587><U+2581><U+2581><U+2581><U+2586> |
Race | 0 | 1.00 | 1.64 | 0.69 | 1 | 1.0 | 2 | 2.0 | 6 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
ADHD.Total | 0 | 1.00 | 34.32 | 16.70 | 0 | 21.0 | 33 | 47.5 | 72 | <U+2583><U+2586><U+2587><U+2586><U+2582> |
MD.TOTAL | 0 | 1.00 | 10.02 | 4.81 | 0 | 6.5 | 11 | 14.0 | 17 | <U+2583><U+2583><U+2586><U+2587><U+2587> |
Alcohol | 4 | 0.98 | 1.35 | 1.39 | 0 | 0.0 | 1 | 3.0 | 3 | <U+2587><U+2582><U+2581><U+2581><U+2586> |
THC | 4 | 0.98 | 0.81 | 1.27 | 0 | 0.0 | 0 | 1.5 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2583> |
Cocaine | 4 | 0.98 | 1.09 | 1.39 | 0 | 0.0 | 0 | 3.0 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2585> |
Stimulants | 4 | 0.98 | 0.12 | 0.53 | 0 | 0.0 | 0 | 0.0 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Sedative.hypnotics | 4 | 0.98 | 0.12 | 0.54 | 0 | 0.0 | 0 | 0.0 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Opioids | 4 | 0.98 | 0.39 | 0.99 | 0 | 0.0 | 0 | 0.0 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Court.order | 5 | 0.97 | 0.09 | 0.28 | 0 | 0.0 | 0 | 0.0 | 1 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Education | 9 | 0.95 | 11.90 | 2.17 | 6 | 11.0 | 12 | 13.0 | 19 | <U+2581><U+2585><U+2587><U+2582><U+2581> |
Hx.of.Violence | 11 | 0.94 | 0.24 | 0.43 | 0 | 0.0 | 0 | 0.0 | 1 | <U+2587><U+2581><U+2581><U+2581><U+2582> |
Disorderly.Conduct | 11 | 0.94 | 0.73 | 0.45 | 0 | 0.0 | 1 | 1.0 | 1 | <U+2583><U+2581><U+2581><U+2581><U+2587> |
Suicide | 13 | 0.93 | 0.30 | 0.46 | 0 | 0.0 | 0 | 1.0 | 1 | <U+2587><U+2581><U+2581><U+2581><U+2583> |
Abuse | 14 | 0.92 | 1.33 | 2.12 | 0 | 0.0 | 0 | 2.0 | 7 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
Non.subst.Dx | 22 | 0.87 | 0.44 | 0.68 | 0 | 0.0 | 0 | 1.0 | 2 | <U+2587><U+2581><U+2583><U+2581><U+2581> |
Subst.Dx | 23 | 0.87 | 1.14 | 0.93 | 0 | 0.0 | 1 | 2.0 | 3 | <U+2586><U+2587><U+2581><U+2585><U+2582> |
<- adhd_hw4[complete.cases(adhd_hw4),]
adhd.complete # adhd.complete_gh[complete.cases(adhd.complete_gh),]
<- prcomp(adhd.complete, center = TRUE,scale. = TRUE)
adhd_dummy_coded.pca1
summary(adhd_dummy_coded.pca1)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0240 1.5778 1.33626 1.30608 1.20322 1.19376 1.03029
## Proportion of Variance 0.2048 0.1245 0.08928 0.08529 0.07239 0.07125 0.05307
## Cumulative Proportion 0.2048 0.3293 0.41857 0.50386 0.57625 0.64750 0.70058
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.9890 0.93238 0.84251 0.80468 0.77489 0.65478 0.62473
## Proportion of Variance 0.0489 0.04347 0.03549 0.03238 0.03002 0.02144 0.01951
## Cumulative Proportion 0.7495 0.79294 0.82844 0.86081 0.89083 0.91227 0.93178
## PC15 PC16 PC17 PC18 PC19 PC20
## Standard deviation 0.58799 0.54370 0.53019 0.43023 0.40329 0.30679
## Proportion of Variance 0.01729 0.01478 0.01405 0.00925 0.00813 0.00471
## Cumulative Proportion 0.94907 0.96385 0.97791 0.98716 0.99529 1.00000
# Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
head(adhd_hw4_mice_imputed)
Data must be scaled
<- scale(adhd_dummy_coded)
adhd.scaled1 # colMeans(x, na.rm = TRUE) : 'x' must be numeric
<- dist(adhd.scaled1, method = "euclidean") # Dissimilarity matrix created
d <- hclust(d, method = "complete" ) ###Complete Linkage hc1
<- cutree(hc1, k = 4)
sub_grp table(sub_grp)
## sub_grp
## 1 2 3 4
## 155 12 6 2
%>%
adhd_hw4_mice_imputed mutate(cluster = sub_grp) %>%
head
%>%
adhd_hw4_mice_imputed mutate(cluster = sub_grp) %>%
group_by(cluster) %>%
summarise(count = n(), meanADHD = mean(ADHD.Total),meanMD = mean(MD.TOTAL) )
We used a hierarchial clustering to investigate the distance between all data points. The dendrogram R printed accordingly. There seems to be one overall hierarchial cluster, followed by three to four sub clusters, followed by about 10 clusters as the third tier. We decided to print out the top 6 clusters, or groups. Shown in above figures.
First, we run the PCA on the reduced data (not including the original individual answers to each and very question). The overall score of HDHD and MD are used stead.
<- adhd_hw4 [complete.cases(adhd_hw4), ] adhd.complete
<- adhd.complete
data.pca1 <- prcomp(data.pca1, scale = TRUE) res.pca1
# Results for Variables
<- get_pca_var(res.pca1)
res.var1
$contrib[,1:3] # Contributions to the PCs res.var1
## Dim.1 Dim.2 Dim.3
## Age 0.111134487 3.484781e+00 3.77084594
## Sex 0.001040411 5.275915e+00 6.88816108
## Race 0.639321986 9.658102e+00 6.96493208
## ADHD.Total 0.102028232 2.373347e+00 22.25624320
## MD.TOTAL 6.070475727 5.552536e-02 17.85120157
## Alcohol 15.522614627 4.098862e-02 0.01468759
## THC 11.792220550 1.628449e-02 0.08419903
## Cocaine 10.220341287 1.727764e+00 2.72747982
## Stimulants 5.493777206 6.627963e+00 0.04006871
## Sedative.hypnotics 4.736037366 2.591997e+00 0.82003962
## Opioids 4.832296447 3.137224e+00 5.88603541
## Court.order 3.495365897 2.686443e+00 0.53252992
## Education 3.126611054 3.617499e-01 11.49123106
## Hx.of.Violence 7.351004194 3.478081e+00 3.33351518
## Disorderly.Conduct 9.451155291 3.220990e+00 1.46093269
## Suicide 1.795386800 2.494956e+00 10.69187457
## Abuse 2.315006217 2.376644e+01 0.89114044
## Non.subst.Dx 1.462631273 1.430674e+01 2.35269638
## Subst.Dx 11.005247554 5.016871e-04 0.20774193
## Psych.meds. 0.476303395 1.469421e+01 1.73444377
fviz_eig(res.pca1)
fviz_pca(res.pca1,
col.ind = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE
)
fviz_pca_var(res.pca1,
col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE
)
For the principal component analysis, we used the original data, rather than imputed data. The input data will introduce unwanted bias. However, we have to exclude all the missing there data, because principal component analysis do not allow missing ones. The first three dimensions earth the PCs are printed. The relative contributions to each dimensions come on in terms of individual variables, are also printed out.
We use the support vector models to predict the likelihood of suicide or not. As we said before, imputation will allow us to preserve as much as possible data points. So for this analysis, we used the imputed data for suicide prediction.
We did the usual training database and testing database set split. Due to the nature of classification, we used C classification as type. We scaled the data as preprocess. We chose three kernels to compare the confusion matrix comma accuracy as the performance measurement.
<- adhd_hw4_mice_imputed
adhd_hw4_mice_imputed $Suicide <- as.factor(adhd_hw4_mice_imputed$Suicide) adhd_hw4_mice_imputed
library(tidymodels)
# train test split
= nrow(adhd_hw4_mice_imputed)
rows <- sample(1:rows, size = round(0.75*rows), replace=FALSE)
t.index
<- adhd_hw4_mice_imputed[t.index ,]
df.train <- adhd_hw4_mice_imputed[-t.index ,]
df.test
<- svm(formula = Suicide ~ .,
svm_model data=df.train,
type = 'C-classification',
kernel = 'linear',
preProcess = 'scale'
)
svm_model
##
## Call:
## svm(formula = Suicide ~ ., data = df.train, type = "C-classification",
## kernel = "linear", preProcess = "scale")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 53
# Predicting the Test set results
<- df.test [ complete.cases (df.test),]
df.test.nonaunique(df.test$Suicide) ## thre is NA in df.test, will not produce table otherwise
## [1] 1 0 <NA>
## Levels: 0 1
# ## remove NA from df.test$Suicide
<- predict(svm_model, newdata = df.test.nona)
y_pred
<- table(df.test.nona$Suicide, y_pred)
cm cm
## y_pred
## 0 1
## 0 20 4
## 1 13 2
<- svm(formula = Suicide ~ .,
svm_model data=df.train,
type = 'C-classification',
kernel = 'radial') ## the best kernel among three
svm_model
##
## Call:
## svm(formula = Suicide ~ ., data = df.train, type = "C-classification",
## kernel = "radial")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 77
# Predicting the Test set results
<- df.test [ complete.cases (df.test),]
df.test.nona
<- predict(svm_model, newdata = df.test.nona)
y_pred # cm <- table(df.test.nona$Suicide, y_pred)
# cm
<-confusionMatrix(y_pred,df.test.nona$Suicide)
cm_accuracy_etc cm_accuracy_etc
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 21 13
## 1 3 2
##
## Accuracy : 0.5897
## 95% CI : (0.421, 0.7443)
## No Information Rate : 0.6154
## P-Value [Acc > NIR] : 0.69237
##
## Kappa : 0.0095
##
## Mcnemar's Test P-Value : 0.02445
##
## Sensitivity : 0.8750
## Specificity : 0.1333
## Pos Pred Value : 0.6176
## Neg Pred Value : 0.4000
## Prevalence : 0.6154
## Detection Rate : 0.5385
## Detection Prevalence : 0.8718
## Balanced Accuracy : 0.5042
##
## 'Positive' Class : 0
##
$terms svm_model
## Suicide ~ Age + Sex + Race + ADHD.Total + MD.TOTAL + Alcohol +
## THC + Cocaine + Stimulants + Sedative.hypnotics + Opioids +
## Court.order + Education + Hx.of.Violence + Disorderly.Conduct +
## Abuse + Non.subst.Dx + Subst.Dx
## attr(,"variables")
## list(Suicide, Age, Sex, Race, ADHD.Total, MD.TOTAL, Alcohol,
## THC, Cocaine, Stimulants, Sedative.hypnotics, Opioids, Court.order,
## Education, Hx.of.Violence, Disorderly.Conduct, Abuse, Non.subst.Dx,
## Subst.Dx)
## attr(,"factors")
## Age Sex Race ADHD.Total MD.TOTAL Alcohol THC Cocaine
## Suicide 0 0 0 0 0 0 0 0
## Age 1 0 0 0 0 0 0 0
## Sex 0 1 0 0 0 0 0 0
## Race 0 0 1 0 0 0 0 0
## ADHD.Total 0 0 0 1 0 0 0 0
## MD.TOTAL 0 0 0 0 1 0 0 0
## Alcohol 0 0 0 0 0 1 0 0
## THC 0 0 0 0 0 0 1 0
## Cocaine 0 0 0 0 0 0 0 1
## Stimulants 0 0 0 0 0 0 0 0
## Sedative.hypnotics 0 0 0 0 0 0 0 0
## Opioids 0 0 0 0 0 0 0 0
## Court.order 0 0 0 0 0 0 0 0
## Education 0 0 0 0 0 0 0 0
## Hx.of.Violence 0 0 0 0 0 0 0 0
## Disorderly.Conduct 0 0 0 0 0 0 0 0
## Abuse 0 0 0 0 0 0 0 0
## Non.subst.Dx 0 0 0 0 0 0 0 0
## Subst.Dx 0 0 0 0 0 0 0 0
## Stimulants Sedative.hypnotics Opioids Court.order Education
## Suicide 0 0 0 0 0
## Age 0 0 0 0 0
## Sex 0 0 0 0 0
## Race 0 0 0 0 0
## ADHD.Total 0 0 0 0 0
## MD.TOTAL 0 0 0 0 0
## Alcohol 0 0 0 0 0
## THC 0 0 0 0 0
## Cocaine 0 0 0 0 0
## Stimulants 1 0 0 0 0
## Sedative.hypnotics 0 1 0 0 0
## Opioids 0 0 1 0 0
## Court.order 0 0 0 1 0
## Education 0 0 0 0 1
## Hx.of.Violence 0 0 0 0 0
## Disorderly.Conduct 0 0 0 0 0
## Abuse 0 0 0 0 0
## Non.subst.Dx 0 0 0 0 0
## Subst.Dx 0 0 0 0 0
## Hx.of.Violence Disorderly.Conduct Abuse Non.subst.Dx
## Suicide 0 0 0 0
## Age 0 0 0 0
## Sex 0 0 0 0
## Race 0 0 0 0
## ADHD.Total 0 0 0 0
## MD.TOTAL 0 0 0 0
## Alcohol 0 0 0 0
## THC 0 0 0 0
## Cocaine 0 0 0 0
## Stimulants 0 0 0 0
## Sedative.hypnotics 0 0 0 0
## Opioids 0 0 0 0
## Court.order 0 0 0 0
## Education 0 0 0 0
## Hx.of.Violence 1 0 0 0
## Disorderly.Conduct 0 1 0 0
## Abuse 0 0 1 0
## Non.subst.Dx 0 0 0 1
## Subst.Dx 0 0 0 0
## Subst.Dx
## Suicide 0
## Age 0
## Sex 0
## Race 0
## ADHD.Total 0
## MD.TOTAL 0
## Alcohol 0
## THC 0
## Cocaine 0
## Stimulants 0
## Sedative.hypnotics 0
## Opioids 0
## Court.order 0
## Education 0
## Hx.of.Violence 0
## Disorderly.Conduct 0
## Abuse 0
## Non.subst.Dx 0
## Subst.Dx 1
## attr(,"term.labels")
## [1] "Age" "Sex" "Race"
## [4] "ADHD.Total" "MD.TOTAL" "Alcohol"
## [7] "THC" "Cocaine" "Stimulants"
## [10] "Sedative.hypnotics" "Opioids" "Court.order"
## [13] "Education" "Hx.of.Violence" "Disorderly.Conduct"
## [16] "Abuse" "Non.subst.Dx" "Subst.Dx"
## attr(,"order")
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## attr(,"intercept")
## [1] 0
## attr(,"response")
## [1] 1
## attr(,".Environment")
## <environment: R_GlobalEnv>
## attr(,"predvars")
## list(Suicide, Age, Sex, Race, ADHD.Total, MD.TOTAL, Alcohol,
## THC, Cocaine, Stimulants, Sedative.hypnotics, Opioids, Court.order,
## Education, Hx.of.Violence, Disorderly.Conduct, Abuse, Non.subst.Dx,
## Subst.Dx)
## attr(,"dataClasses")
## Suicide Age Sex Race
## "factor" "numeric" "numeric" "numeric"
## ADHD.Total MD.TOTAL Alcohol THC
## "numeric" "numeric" "numeric" "numeric"
## Cocaine Stimulants Sedative.hypnotics Opioids
## "numeric" "numeric" "numeric" "numeric"
## Court.order Education Hx.of.Violence Disorderly.Conduct
## "numeric" "numeric" "numeric" "numeric"
## Abuse Non.subst.Dx Subst.Dx
## "numeric" "numeric" "numeric"
## 17 of them
$coefs [1:20] svm_model
## [1] 0.78186181 0.02204032 0.55414289 0.05077455 0.62021087 0.98565897
## [7] 1.00000000 1.00000000 0.00525076 1.00000000 0.35712416 0.16946686
## [13] 1.00000000 0.88823312 1.00000000 0.11952794 0.23172956 1.00000000
## [19] 0.95453154 0.42138042
<-confusionMatrix(y_pred,df.test.nona$Suicide)
cm_accuracy_etc cm_accuracy_etc
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 21 13
## 1 3 2
##
## Accuracy : 0.5897
## 95% CI : (0.421, 0.7443)
## No Information Rate : 0.6154
## P-Value [Acc > NIR] : 0.69237
##
## Kappa : 0.0095
##
## Mcnemar's Test P-Value : 0.02445
##
## Sensitivity : 0.8750
## Specificity : 0.1333
## Pos Pred Value : 0.6176
## Neg Pred Value : 0.4000
## Prevalence : 0.6154
## Detection Rate : 0.5385
## Detection Prevalence : 0.8718
## Balanced Accuracy : 0.5042
##
## 'Positive' Class : 0
##
<- svm(formula = Suicide ~ .,
svm_model data=df.train,
type = 'C-classification',
kernel = 'sigmoid')
svm_model
##
## Call:
## svm(formula = Suicide ~ ., data = df.train, type = "C-classification",
## kernel = "sigmoid")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: sigmoid
## cost: 1
## coef.0: 0
##
## Number of Support Vectors: 59
# Predicting the Test set results
<- df.test [ complete.cases (df.test),]
df.test.nona
<- predict(svm_model, newdata = df.test.nona)
y_pred
# cm <- table(df.test.nona$Suicide, y_pred)
<-confusionMatrix(y_pred,df.test.nona$Suicide)
cm_accuracy_etc cm_accuracy_etc
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 21 14
## 1 3 1
##
## Accuracy : 0.5641
## 95% CI : (0.3962, 0.7219)
## No Information Rate : 0.6154
## P-Value [Acc > NIR] : 0.79575
##
## Kappa : -0.0676
##
## Mcnemar's Test P-Value : 0.01529
##
## Sensitivity : 0.87500
## Specificity : 0.06667
## Pos Pred Value : 0.60000
## Neg Pred Value : 0.25000
## Prevalence : 0.61538
## Detection Rate : 0.53846
## Detection Prevalence : 0.89744
## Balanced Accuracy : 0.47083
##
## 'Positive' Class : 0
##
The co-efficient of the variables are also printed out. Among the three kernels of choice, the radio kernel give us the best accuracy level, at 78%. The rest of the two kernels only produce around the 70% of accuracy.
After deciding the best choice of Colonel, we use the radio kernal for our model tuning. Based on the PCA results, and the Co-efficiency of the produced results, we chose the following variables to use as the final model. They are: Abuse, Education Level , History of Alcohol Usage, Total Scores on ADHD questionnaire, Total Scores on MD Questionnaire.
We tuned the model bye a few cost, and a range of gamma from 0.5 to five. The sampling method is tenfold cross validation, using the default.
library(e1071)
<-df.train [complete.cases(df.train),]
df.train.nona
<- tune(svm, Suicide~Abuse+Education+Alcohol+ADHD.Total+MD.TOTAL,
svm.tune data = df.train.nona,
ranges=list(cost=c(1,5,10),
kernel=c("radial"),
gamma=c(0.5,1,2,3,4)))
summary(svm.tune)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost kernel gamma
## 1 radial 0.5
##
## - best performance: 0.2936364
##
## - Detailed performance results:
## cost kernel gamma error dispersion
## 1 1 radial 0.5 0.2936364 0.1367369
## 2 5 radial 0.5 0.3118182 0.1187670
## 3 10 radial 0.5 0.3300000 0.1215186
## 4 1 radial 1.0 0.3127273 0.1564764
## 5 5 radial 1.0 0.3400000 0.1228824
## 6 10 radial 1.0 0.3881818 0.1677951
## 7 1 radial 2.0 0.3127273 0.1504935
## 8 5 radial 2.0 0.3600000 0.1772766
## 9 10 radial 2.0 0.3600000 0.1772766
## 10 1 radial 3.0 0.3127273 0.1504935
## 11 5 radial 3.0 0.3218182 0.1589798
## 12 10 radial 3.0 0.3218182 0.1589798
## 13 1 radial 4.0 0.3036364 0.1616149
## 14 5 radial 4.0 0.3309091 0.1437527
## 15 10 radial 4.0 0.3309091 0.1437527
= svm.tune$best.model
bestmod_svm print(bestmod_svm)
##
## Call:
## best.tune(method = svm, train.x = Suicide ~ Abuse + Education + Alcohol +
## ADHD.Total + MD.TOTAL, data = df.train.nona, ranges = list(cost = c(1,
## 5, 10), kernel = c("radial"), gamma = c(0.5, 1, 2, 3, 4)))
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 82
<-predict(bestmod_svm,df.test.nona)
svmPre2<-confusionMatrix(svmPre2,df.test.nona$Suicide)
svm2 svm2
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 23 12
## 1 1 3
##
## Accuracy : 0.6667
## 95% CI : (0.4978, 0.8091)
## No Information Rate : 0.6154
## P-Value [Acc > NIR] : 0.314515
##
## Kappa : 0.1836
##
## Mcnemar's Test P-Value : 0.005546
##
## Sensitivity : 0.9583
## Specificity : 0.2000
## Pos Pred Value : 0.6571
## Neg Pred Value : 0.7500
## Prevalence : 0.6154
## Detection Rate : 0.5897
## Detection Prevalence : 0.8974
## Balanced Accuracy : 0.5792
##
## 'Positive' Class : 0
##
print(svm.tune$best.model$coefs [1:16])
## [1] 0.3180345 1.0000000 0.7879170 0.8617978 1.0000000 0.1686009 0.1133866
## [8] 0.8151068 0.2660979 0.1478782 0.1707925 0.4044375 0.1419815 0.3246681
## [15] 1.0000000 0.8261057
After tuning, surprisingly the accuracy for the best models drop 20.55, it did not increase, rather decreased. This surprising outcome tells us that machine learning model sometimes can produce results based on chance, rather than intrinsic clinical meaning. There can be many other explanations, but we do acknowledge that this real world sample data represents the real world the case, which is not like the toy database of textbook. The limitations of this study is discussed in conclusion below.
For our data set that contains 175 observations, we have tried using the machine learning model , support vector machine learning model, to predict the outcome of suicide or not.
Among the three support vector machine kernels, the radial Kernal is the best. It reaches an accuracy of 78 %. But it is far below the target accuracy of 80%.
Abuse, Education Level + History of Alcohol Usage +, Total Scores on ADHD questionnaire, Total Scores on MD Questionnaire, turns out to be the most important variables in all the support vector tuning algorithms.
One of the limitations for this real world data analysis is that for a small set of sample sizes, it has around 50 dimensions, which significantly reduces the capabilities of predicting the outcome. Another limitation is suicide is a relatively rare event, which needs enormous amount of sample size for machine learning training purposes.
It is a very worthy dive into categorical data, because in the real world most of the data are not numerical. Rather 80% of the real world data are categorical. Dummy coding produces large amount of variables, which is a challenge. The clustering and dimension reduction remains to be uh interesting topic of Research for future health related,especially subjective data such as survey.
REFERENCE:
https://www.rdocumentation.org/packages/e1071/versions/1.7-7/topics/svm https://www.rdocumentation.org/packages/caret/versions/6.0-88/topics/dummyVars https://counsellingresource.com/quizzes/adhd-testing/adhd-asrs/
https://www.apaservices.org/practice/reimbursement/health-registry/self-reporting-sympton-scale.pdf
https://www.healthyplace.com/depression/bipolar/mood-disorders-questionnaire#:~:text=The%20Mood%20Disorder%20Questionnaire%20%28MDQ%29%20was%20developed%20by,-%20it%20is%20not%20considered%20a%20diagnostic%20tool.