HW4 - DATA622

Group 1 Members:

David Moste
Vinayak Kamath
Kenan Sooklall
Christian Thieme
Lidiia Tronina

Introduction

For this project, we will be working with a mental health dataset obtained from a previously completed research project on attention deficit hyperactivity disorder (ADHD). The dataset contains information from 175 individuals who took part in the study and includes 54 columns broken into the following categories:

Excel Column	Variable	Key
C	Sex	Male-1, Female-2
D	Race	White-1, African American-2, Hispanic-3, Asian-4, Native American-5, Other or missing data -6
E-W	ADHD self-report scale	Never-0, rarely-1, sometimes-2, often-3, very often-4
X-AM	Mood disorder questions	No-0, yes-1; question 3: no problem-0, minor-1, moderate-2, serious-3
AN-AS	Individual substances misuse	no use-0, use-1, abuse-2, dependence-3
AT	Court Order	No-0, Yes-1
AU	Education	1-12 grade, 13+ college
AV	History of Violence	No-0, Yes-1
AW	Disorderly Conduct	No-0, Yes-1
AX	Suicide attempt	No-0, Yes-1
AY	Abuse History	No-0, Physical (P)-1, Sexual (S)-2, Emotional (E)-3, P&S-4, P&E-5, S&E-6, P&S&E-7
AZ	Non-substance-related Dx	0 – none; 1 – one; 2 – More than one
BA	Substance-related Dx	0 – none; 1 – one Substance-related; 2 – two; 3 – three or more
BB	Psychiatric Meds	0 – none; 1 – one psychotropic med; 2 – more than one psychotropic med

Our task is first to explore the data, then using the understanding gained, to build a clustering model to determine different segments of patients. Additionally, we’ll perform Principal Component Analysis on a subset of the dataset to see what information can be gleaned. Finally, we’ll use Gradient Boosting and Support Vector Machines to predict if a patient has attempted suicide.

As can be seen from the table above, the dataset has a wide variety of numeric and categorical data. We’ll dive deeper into this data now.

Exploratory Data Analysis

Let’s begin exploring by taking a high level look at our dataset:

glimpse(adhd)

## Rows: 175
## Columns: 54
## $ Initial              <chr> "JA", "LA", "MD", "RD", "RB", "SB", "PK", "RJ"...
## $ Age                  <dbl> 24, 48, 51, 43, 34, 39, 41, 48, 44, 27, 44, 56...
## $ Sex                  <dbl> 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 1, 1, 2, 1, 2, 2...
## $ Race                 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ `ADHD Q1`            <dbl> 1, 3, 2, 3, 4, 2, 2, 2, 3, 2, 2, 2, 1, 2, 1, 4...
## $ `ADHD Q2`            <dbl> 1, 3, 1, 3, 4, 3, 2, 4, 3, 3, 3, 2, 3, 3, 0, 3...
## $ `ADHD Q3`            <dbl> 4, 4, 2, 2, 2, 1, 1, 3, 3, 4, 2, 2, 0, 2, 1, 1...
## $ `ADHD Q4`            <dbl> 2, 4, 1, 2, 4, 4, 3, 4, 4, 4, 3, 3, 2, 4, 2, 4...
## $ `ADHD Q5`            <dbl> 3, 5, 3, 4, 4, 3, 4, 3, 4, 4, 3, 3, 3, 2, 2, 2...
## $ `ADHD Q6`            <dbl> 1, 2, 3, 3, 2, 2, 4, 3, 3, 3, 3, 2, 3, 0, 3, 0...
## $ `ADHD Q7`            <dbl> 1, 2, 3, 2, 3, 3, 2, 1, 3, 4, 3, 1, 2, 4, 0, 2...
## $ `ADHD Q8`            <dbl> 3, 3, 2, 4, 4, 4, 3, 1, 4, 3, 3, 2, 2, 4, 0, 3...
## $ `ADHD Q9`            <dbl> 2, 2, 0, 4, 4, 4, 3, 4, 3, 2, 2, 3, 1, 2, 2, 3...
## $ `ADHD Q10`           <dbl> 4, 4, 1, 2, 2, 2, 4, 2, 4, 4, 3, 2, 1, 4, 2, 3...
## $ `ADHD Q11`           <dbl> 2, 1, 2, 3, 4, 4, 3, 4, 3, 3, 2, 2, 3, 3, 1, 4...
## $ `ADHD Q12`           <dbl> 4, 4, 0, 1, 1, 2, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0...
## $ `ADHD Q13`           <dbl> 1, 2, 2, 3, 3, 4, 4, 4, 4, 3, 2, 3, 3, 1, 2, 3...
## $ `ADHD Q14`           <dbl> 0, 4, 2, 3, 2, 4, 4, 3, 3, 4, 4, 3, 3, 3, 3, 4...
## $ `ADHD Q15`           <dbl> 3, 4, 3, 1, 1, 3, 4, 0, 3, 4, 1, 3, 2, 0, 1, 0...
## $ `ADHD Q16`           <dbl> 1, 3, 2, 2, 2, 4, 4, 0, 2, 4, 2, 1, 1, 0, 1, 1...
## $ `ADHD Q17`           <dbl> 3, 1, 1, 1, 1, 3, 2, 1, 4, 2, 1, 1, 0, 0, 3, 0...
## $ `ADHD Q18`           <dbl> 4, 4, 1, 2, 1, 3, 4, 1, 3, 2, 2, 2, 1, 2, 1, 3...
## $ `ADHD Total`         <dbl> 40, 55, 31, 45, 48, 55, 54, 41, 56, 56, 42, 38...
## $ `MD Q1a`             <dbl> 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0...
## $ `MD Q1b`             <dbl> 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1...
## $ `MD Q1c`             <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0...
## $ `MD Q1d`             <dbl> 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0...
## $ `MD Q1e`             <dbl> 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0...
## $ `MD Q1f`             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1...
## $ `MD Q1g`             <dbl> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1...
## $ `MD Q1h`             <dbl> 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0...
## $ `MD Q1i`             <dbl> 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0...
## $ `MD Q1j`             <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0...
## $ `MD Q1k`             <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0...
## $ `MD Q1L`             <dbl> 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0...
## $ `MD Q1m`             <dbl> 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0...
## $ `MD Q2`              <dbl> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1...
## $ `MD Q3`              <dbl> 3, 3, 2, 3, 2, 3, 3, 3, 3, 2, 3, 0, 2, 2, 2, 2...
## $ `MD TOTAL`           <dbl> 15, 14, 5, 13, 7, 14, 9, 7, 12, 11, 16, 0, 11,...
## $ Alcohol              <dbl> 1, 0, 0, 1, 1, 1, 3, 0, 1, 0, 3, 0, 1, 0, 3, 0...
## $ THC                  <dbl> 1, 0, 0, 1, 1, 0, 3, 0, 0, 3, 1, 0, 1, 0, 1, 0...
## $ Cocaine              <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ Stimulants           <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ `Sedative-hypnotics` <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ Opioids              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 3, 0, 0, 0, 0, 3...
## $ `Court order`        <dbl> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0...
## $ Education            <dbl> 11, 14, 12, 12, 9, 11, 12, 16, 12, 9, 12, 18, ...
## $ `Hx of Violence`     <dbl> 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0...
## $ `Disorderly Conduct` <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0...
## $ Suicide              <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ Abuse                <dbl> 0, 4, 6, 7, 0, 2, 4, 0, 0, 2, 4, 2, 0, 0, 0, 5...
## $ `Non-subst Dx`       <dbl> 2, 1, 2, 2, 2, 0, 1, 2, 1, 1, 1, 2, 0, 1, 0, 2...
## $ `Subst Dx`           <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 2, 2, 0, 0, 0, 1, 1...
## $ `Psych meds.`        <dbl> 2, 1, 1, 2, 0, 0, 1, 2, 1, 0, 0, 1, 0, 0, 0, 2...

We can see that there is a mix of both numeric and categorical data. We note that most of the numerical features are actually binary categorical variables. Before going too far, let’s change the data type of these variables so we can explore our data properly:

Having made the necessary changes, we can see in the visual above, that our dataset is mostly categorical data. However, there is one column that is of type character and another column that is of type integer.

Numerical Features

	n	min	mean	median	max	sd
Age	175	1	22.36571	25	42	10.94936

Looking at the summary above, we can see that the age of the subjects of this research study are between 1 and 42 with the median value being 25. Let’s now take a look at the shape of our distribution and see if there are significant outliers:

In looking at the distribution of Age, it appears that our dataset is multi-modal. This often means that there are sub-populations within the data. We’ll keep this in mind as we continue our exploration. We do not see any significant outliers.

Later, we will be building models to predict which individuals have attempted suicide. As such, we’ll look at the distribution of Age broken down by whether an individual has attempted suicide or not:

Based on the histogram above, we do not note any significant differences in the distributions. However, this may be more easily determined by looking at a boxplot.

In looking at the boxplots above, we can see that it does appear that, in general, those who attempted suicide were younger than those who did not.

Categorical Features

Now let’s turn our attention to our categorical variables. We have both binary categorical variables, and variables with 3 or more classes. Looking at this data in a table would be unwieldy due to its size, so we’ll visualize this data instead.

In looking at the summary table above, we note:

Abuse: Nearly half of our population has been subject to some type of abuse
ADHD Q1-18: There is a fairly equal dispersion between each of these responses
Alcohol: Alcohol appears to be used by half of our population
Cocaine: About a 1/3 of the population uses cocaine
Court Order: Majority of the population does not have a court order
Disorderly Conduct: Over half of the population has had disorderly conduct
Education: A very small percentage of population has attended college
Hx of Violence: Very few of the subjects have a history of violence
MDQ1-MDQ3: There is a fairly even breakout in response for each of these questions
Non-subst Dx: We note the presence of quite a few missing values in this column. Most individuals do not have a non-substance prescription
Opiods: Most subjects do not have opiods
Psych meds.: The majority of this column is missing data
Race: The majority of the population is either white or African American
Sedative-hypnotics: Hardly anyone in the population is using sedative-hypnotics
Sex: The population is a fairly even split between males and females
Stimulants: Stimulant usage among the participants is very minimal
Subst Dx: A fairly even split between each of these categories. We note the presence of quite a few missing values in this column
Suicide: It appears that around around a 1/4 of the population has attempted suicide
THC: THC only affects around 1/3 of our population

Again, we will be building models to predict who has attempted to commit suicide, so it will also be helpful to understand the the distributions of these variables as it relates to our Suicide variable, both Yes and No. As mentioned above, only 1/4 of the population has attempted suicide, so we expect counts for those who have attempted suicide to be considerably lower than those who have not, however, the purpose of the below charts is simply to compare the shapes of the distributions to determine if anything major sticks out as a red flag.

Based on the charts above, we do not note anything we would consider out of the ordinary in comparing these distribution shapes.

Missing Data

As we’ve seen throughout our EDA, there are quite a few missing values within the columns. Let’s take a closer look at this so we can determine the best way to deal with them.

It appears that about 2.6% of our dataset is missing values and as previously mentioned, Psych meds. makes up a large portion of that. Because of this, there would be no good way to impute these values and so we have decided to drop this variable from the dataset.

adhd <- adhd %>% 
  select(-`Psych meds.` )

Further, we see that many of the respondents who have missing data about a suicide attempt, also have additional missing data. As our objective is to predict suicide attempts, we will drop those subjects who have missing data for that variable since the data will be unusable when it comes to modeling.

adhd <- adhd %>% 
  filter(!is.na(Suicide))

Having adjusted these, let’s now look at our missing data again.

We can see clearly that these adjustments have made a significant impact on the dataset. There is now only 0.4% of our dataset that is missing. We can now use an imputation method to fill in the remainder of the missing values.

We have chosen to use the pmm method (predictive mean matching) from the mice library to impute our missing values. Predictive mean matching calculates the predicted value for our target variable, and, for missing values, forms a small set of “candidate donors” from the complete cases that are closest to the predicted value for our missing entry. Donors are then randomly chosen from candidates and imputed where values were once missing. To apply pmm we assume that the distribution is the same for missing cells as it is for observed data, and thus, the approach may be more limited when the % of missing values is higher.

In order to apply this method, we’ll need to adjust our column names that have spacing. We can use the clean_names function from the janitor library to do this for us. Once this is completed, we’ll impute our missing values:

adhd <- adhd %>% 
  janitor::clean_names()
adhd <- mice(data = adhd, m = 1, method = "pmm", seed = 500)

## 
##  iter imp variable
##   1   1  education*  abuse*  non_subst_dx*  subst_dx*
##   2   1  education*  abuse*  non_subst_dx*  subst_dx*
##   3   1  education*  abuse*  non_subst_dx*  subst_dx*
##   4   1  education*  abuse*  non_subst_dx*  subst_dx*
##   5   1  education*  abuse*  non_subst_dx*  subst_dx*

## Warning: Number of logged events: 45

adhd <- mice::complete(adhd, 1)

We can see now that we have successfully imputed all missing values:

visdat::vis_miss(adhd, sort_miss = TRUE)

Modeling

Unsupervised Learning and Dimensionality Reduction

In the below sections we will explore our dataset further using clustering to see if R can identify any “unseen” patterns within the data. Next, we’ll use principal component analysis (PCA) to reduce the dimensionality of our data and to see if we can identify which features have the highest variance within our dataset.

K-Means Clustering

K-Means clustering is a technique to create clusters by finding groups that are similar, as determined by euclidean distance. This method has many advantages (i.e. computationally efficient) and disadvantages (i.e. results change with every run, dependent on the starting values, no real way to determine appropriate number of clusters). We ultimately chose to use k-means instead of hierarchical clustering due to there being no real reason to believe this data has a hierarchical structure to it.

The first step was to pull out only the columns we wanted to use. Since k-means is based on euclidean distance, we chose to only use variables that had continuous values, ruling out all binary and factor based predictors. Some of this data came in as factors (despite it being continuous), so we had to convert those to numeric values.

library(tidyverse)
adhd_km <- adhd %>%
  select(age, md_total, education, adhd_total)
adhd_km <- adhd_km %>%
  mutate_if(is.factor, as.character) %>%
  mutate_if(is.character, as.numeric)

The next step was to transform the data. Again, the euclidean based distance measurement requires that we center and scale the data. We also chose to perform a Box-Cox transformation to help reduce skewness.

library(caret)
adhd_km_trans <- preProcess(adhd_km,
                            method = c("center", "scale", "BoxCox"))
adhd_km_trans <- predict(adhd_km_trans, adhd_km)

Now that we have our data set up, we can get a sense of how many clusters to create. To do this, we used the silhouette method. When the average silhouette width starts to decrease, you stop gaining information by dividing clusters. This led us to an optimal number of 2 cluster.

library(factoextra)

## Warning: package 'factoextra' was built under R version 4.0.5

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

fviz_nbclust(adhd_km_trans, kmeans, method = "silhouette", k.max = 20)

Now we can visualize our clusters! To visualize these four dimensional clusters, we used PCA to reduce the dimensionality of our data.

library(factoextra)
km.res <- kmeans(adhd_km_trans, centers = 2, nstart = 50)
fviz_cluster(km.res, adhd_km_trans)

What incredible clusters! In an effort to get a sense of the differences between these clusters and what made them unique, we went back to the original data and performed a comparative analysis. We computed the mean for each factor within each cluster and plotted these values on a graph to see the differences in the two clusters more clearly.

library(tidyverse)
adhd_km_res <- cbind(adhd_km, km.res$cluster)
km_comparison <- adhd_km_res %>%
  group_by(km.res$cluster) %>%
  rename(cluster = `km.res$cluster`) %>%
  summarise_all(mean) %>%
  pivot_longer(c("age":"adhd_total"),
               names_to = "category",
               values_to = "mean")
ggplot(km_comparison,
       aes(x = as.factor(cluster),
           y = mean,
           fill = as.factor(category))) +
  geom_bar(position = "dodge",
           stat = "identity") +
  scale_fill_brewer(palette = "Set3") +
  labs(x = "Cluster",
       y = "Value",
       fill = "Factor")

By looking at these two clusters, we can see that one of our clusters has a high total ADHD and MD score while also having a lower age and education. This would indicate that our two groups are the:

Younger and less educated with high ADHD and MD scores
Older and more educated with lower ADHD and MD scores

Out of curiosity, we wanted to add back in a single binary variable, suicide, since we knew we would be investigating it later. We knew k-means would struggle with this, but we were curious to see how much it would impact the outcome.

We went through the same steps as before, but this time included suicide as our first variable.

library(tidyverse)
library(caret)
library(factoextra)
adhd_km <- adhd %>%
  select(suicide, age, md_total, education, adhd_total)
adhd_km <- adhd_km %>%
  mutate_if(is.factor, as.character) %>%
  mutate_if(is.character, as.numeric)
adhd_km_trans <- preProcess(adhd_km,
                            method = c("center", "scale", "BoxCox"))
adhd_km_trans <- predict(adhd_km_trans, adhd_km)
fviz_nbclust(adhd_km_trans, kmeans, method = "silhouette", k.max = 20)

km.res <- kmeans(adhd_km_trans, centers = 12, nstart = 50)
fviz_cluster(km.res, adhd_km_trans)

adhd_km_res <- cbind(adhd_km, km.res$cluster)
km_comparison <- adhd_km_res %>%
  group_by(km.res$cluster) %>%
  rename(cluster = `km.res$cluster`) %>%
  summarise_all(mean) %>%
  pivot_longer(c("suicide":"adhd_total"),
               names_to = "category",
               values_to = "mean")
ggplot(km_comparison,
       aes(x = as.factor(cluster),
           y = mean,
           fill = as.factor(category))) +
  geom_bar(position = "dodge",
           stat = "identity") +
  scale_fill_brewer(palette = "Set3") +
  labs(x = "Cluster",
       y = "Value",
       fill = "Factor")

Adding in just one binary variable - suicide - had a massive impact on our clustering. The optimal number of clusters increased by 600% all the way up to 12 clusters. All of a sudden we have clusters that fit a bunch of different scenarios. Based on our clusters, we ended up with the following conclusions:

We had 4 groups that had non-zero means for suicide (meaning suicide was attempted)
Of these 4 groups, 3 had higher ADHD scores, 1 was an older group and 1 was a younger group, and education for all 4 groups was low.

We’re curious if the massive increase in the optimal number of clusters was caused by the addition of a single variable or the addition of a binary variable. The best way to tell would be to replace the binary suicide variable with another continuous variable, but we already used all of the provided continuous variables. We could always remove one of our original variables and retain the binary suicide variable, but even this would cause some level of distortion.

Principal Component Analysis

PCA is used in exploratory data analysis and for making predictive models. It is commonly used for dimensionality reduction by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data’s variation as possible. The first principal component can equivalently be defined as a direction that maximizes the variance of the projected data. The i-th principal component can be taken as a direction orthogonal to the first i-1 principal components that maximizes the variance of the projected data. (Source : https://en.wikipedia.org/wiki/Principal_component_analysis)

We’ll use the prcomp() function to reduce the number of dimensions on our entire dataset and measure the amount of variance explained that is beneficial. As mentioned previously, PCA works best with numerical data, as such, we will exclude the categorical variable Initials. We are now left with a matrix of 52 columns and 150+ rows which we will pass through prcomp( ) function for the principal component analysis.

This function returns the results as an object of class ‘prcomp’. We will assign the output to a variable named pca.out.

For this analysis we will concentrate on the responses to the mood disorder questions as mood swings seems to be a strong predictor for suicide attempts.

#working on  copy for pca analysis
adhd.pca <- data.frame(adhd)

#droping character column for PCA analysis.
adhd.pca <- adhd.pca[,-1]

#droping ADHD disorders questionnaire columnsand the suicide column for PCA analysis.
adhd.pca <- select(adhd.pca , -c(adhd_q1,adhd_q2, adhd_q3, adhd_q4, adhd_q5, adhd_q6, adhd_q7
                                 , adhd_q8, adhd_q9, adhd_q10, adhd_q11, adhd_q12
                                 , adhd_q13, adhd_q14, adhd_q15, adhd_q16, adhd_q17
                                 , adhd_q18, adhd_total
                                 , suicide))

#converting columns to numeric from factor for PCA analysis
adhd.pca$sex <- as.numeric(adhd.pca$sex)
adhd.pca$age <- as.numeric(adhd.pca$age)
adhd.pca$race <- as.numeric(adhd.pca$race)
#adhd.pca$adhd_q1 <- as.numeric(adhd.pca$adhd_q1)
#adhd.pca$adhd_q2 <- as.numeric(adhd.pca$adhd_q2)
#adhd.pca$adhd_q3 <- as.numeric(adhd.pca$adhd_q3)
#adhd.pca$adhd_q4 <- as.numeric(adhd.pca$adhd_q4)
#adhd.pca$adhd_q5 <- as.numeric(adhd.pca$adhd_q5)
#adhd.pca$adhd_q6 <- as.numeric(adhd.pca$adhd_q6)
#adhd.pca$adhd_q7 <- as.numeric(adhd.pca$adhd_q7)
#adhd.pca$adhd_q8 <- as.numeric(adhd.pca$adhd_q8)
#adhd.pca$adhd_q9 <- as.numeric(adhd.pca$adhd_q9)
#adhd.pca$adhd_q10 <- as.numeric(adhd.pca$adhd_q10)
#adhd.pca$adhd_q11 <- as.numeric(adhd.pca$adhd_q11)
#adhd.pca$adhd_q12 <- as.numeric(adhd.pca$adhd_q12)
#adhd.pca$adhd_q13 <- as.numeric(adhd.pca$adhd_q13)
#adhd.pca$adhd_q14 <- as.numeric(adhd.pca$adhd_q14)
#adhd.pca$adhd_q15 <- as.numeric(adhd.pca$adhd_q15)
#adhd.pca$adhd_q16 <- as.numeric(adhd.pca$adhd_q16)
#adhd.pca$adhd_q17 <- as.numeric(adhd.pca$adhd_q17)
#adhd.pca$adhd_q18 <- as.numeric(adhd.pca$adhd_q18)
#adhd.pca$adhd_total <- as.numeric(adhd.pca$adhd_total)
adhd.pca$md_q1a <- as.numeric(adhd.pca$md_q1a)
adhd.pca$md_q1b <- as.numeric(adhd.pca$md_q1b)
adhd.pca$md_q1c <- as.numeric(adhd.pca$md_q1c)
adhd.pca$md_q1d <- as.numeric(adhd.pca$md_q1d)
adhd.pca$md_q1e <- as.numeric(adhd.pca$md_q1e)
adhd.pca$md_q1f <- as.numeric(adhd.pca$md_q1f)
adhd.pca$md_q1g <- as.numeric(adhd.pca$md_q1g)
adhd.pca$md_q1h <- as.numeric(adhd.pca$md_q1h)
adhd.pca$md_q1i <- as.numeric(adhd.pca$md_q1i)
adhd.pca$md_q1j <- as.numeric(adhd.pca$md_q1j)
adhd.pca$md_q1k <- as.numeric(adhd.pca$md_q1k)
adhd.pca$md_q1l <- as.numeric(adhd.pca$md_q1l)
adhd.pca$md_q1m <- as.numeric(adhd.pca$md_q1m)
adhd.pca$md_q2 <- as.numeric(adhd.pca$md_q2)
adhd.pca$md_q3 <- as.numeric(adhd.pca$md_q3)
adhd.pca$md_total <- as.numeric(adhd.pca$md_total)
adhd.pca$alcohol <- as.numeric(adhd.pca$alcohol)
adhd.pca$thc <- as.numeric(adhd.pca$thc)
adhd.pca$cocaine <- as.numeric(adhd.pca$cocaine)
adhd.pca$stimulants <- as.numeric(adhd.pca$stimulants)
adhd.pca$sedative_hypnotics <- as.numeric(adhd.pca$sedative_hypnotics)
adhd.pca$opioids <- as.numeric(adhd.pca$opioids)
adhd.pca$court_order <- as.numeric(adhd.pca$court_order)
adhd.pca$education <- as.numeric(adhd.pca$education)
adhd.pca$hx_of_violence <- as.numeric(adhd.pca$education)
adhd.pca$disorderly_conduct <- as.numeric(adhd.pca$disorderly_conduct)
#adhd.pca$suicide <- as.numeric(adhd.pca$suicide)
adhd.pca$abuse <- as.numeric(adhd.pca$abuse)
adhd.pca$non_subst_dx <- as.numeric(adhd.pca$non_subst_dx)
adhd.pca$subst_dx <- as.numeric(adhd.pca$subst_dx)

#prcomp function
pca.out = prcomp(adhd.pca, scale= TRUE )
pca.out

## Standard deviations (1, .., p=32):
##  [1] 2.640323e+00 1.650249e+00 1.499738e+00 1.386581e+00 1.314550e+00
##  [6] 1.217532e+00 1.207500e+00 1.090778e+00 1.040926e+00 1.009742e+00
## [11] 9.728212e-01 9.327881e-01 8.822015e-01 8.773261e-01 8.509808e-01
## [16] 8.210993e-01 7.939130e-01 7.686684e-01 7.417143e-01 6.914803e-01
## [21] 6.632220e-01 6.466217e-01 6.332013e-01 6.068797e-01 5.887602e-01
## [26] 5.334894e-01 5.263445e-01 4.837081e-01 4.630039e-01 4.178191e-01
## [31] 3.827453e-02 4.830643e-16
## 
## Rotation (n x k) = (32 x 32):
##                            PC1          PC2          PC3         PC4
## age                 0.03336377 -0.016035809  0.095615021 -0.31769551
## sex                -0.03062185 -0.190448520 -0.150950196  0.04758729
## race                0.06153719  0.231042763  0.338634394 -0.15918540
## md_q1a             -0.27173946  0.031971639 -0.080152997 -0.04865127
## md_q1b             -0.24624637 -0.182944105 -0.101833756 -0.06740290
## md_q1c             -0.15991210  0.194234706  0.292514071  0.01223432
## md_q1d             -0.21841639 -0.026241473  0.068723503 -0.10316833
## md_q1e             -0.23407692  0.009839529  0.140703697 -0.02795032
## md_q1f             -0.24310507 -0.192887263 -0.036205688 -0.03384025
## md_q1g             -0.24395434 -0.212742517 -0.032828309 -0.13726742
## md_q1h             -0.22528456  0.096462817  0.225721670  0.17613526
## md_q1i             -0.21889346  0.118602570  0.200114792  0.07634098
## md_q1j             -0.21644439  0.080973836  0.205195267  0.10524224
## md_q1k             -0.20466450  0.139680860  0.131613748  0.12189475
## md_q1l             -0.26829008 -0.004287841 -0.073824509  0.01976871
## md_q1m             -0.21514390  0.035804951 -0.009100572  0.01433684
## md_q2              -0.25972093 -0.220794089  0.077176045  0.03019134
## md_q3              -0.18448485 -0.236596690 -0.169754723 -0.11717532
## md_total           -0.37206533 -0.062002927  0.068926769 -0.01007812
## alcohol            -0.12054292  0.123344103 -0.109954488  0.10710563
## thc                -0.02302607  0.196274036 -0.154023366  0.20437879
## cocaine            -0.05400986  0.380520206  0.032224697 -0.07488839
## stimulants         -0.03658567 -0.020124781 -0.145762437  0.20198797
## sedative_hypnotics -0.01026143 -0.037156633 -0.172617288  0.32542531
## opioids            -0.04869099 -0.098148030 -0.201787791  0.33938775
## court_order        -0.03891218  0.071537229 -0.125138592  0.11039260
## education           0.12108525 -0.262048046  0.324287388  0.39612473
## hx_of_violence      0.12108525 -0.262048046  0.324287388  0.39612473
## disorderly_conduct -0.06753305  0.257494940 -0.221235237  0.04909009
## abuse              -0.03222505 -0.096570115 -0.256498820 -0.04660587
## non_subst_dx        0.03246106 -0.275826822  0.015791957 -0.11794034
## subst_dx           -0.09013668  0.301983498 -0.243036409  0.29937285
##                             PC5          PC6          PC7          PC8
## age                 0.445768184 -0.273202178  0.220159130 -0.036493967
## sex                 0.244147095  0.474668158  0.194668599  0.293605461
## race               -0.133122538  0.096735334 -0.002553002  0.337627518
## md_q1a             -0.050939977 -0.133414434  0.017826812  0.014474370
## md_q1b             -0.146417253  0.029510948 -0.100116810  0.164947234
## md_q1c              0.075223865 -0.076076265 -0.023933195  0.082555380
## md_q1d              0.025666745 -0.060558979  0.054480774  0.063035504
## md_q1e              0.142531419  0.083523774  0.164914752  0.139185142
## md_q1f             -0.051141802  0.009758732  0.069006185  0.006065575
## md_q1g             -0.087898905  0.039597836  0.030887258  0.092619241
## md_q1h              0.077470221  0.231227357 -0.007695393 -0.052224660
## md_q1i              0.138543372  0.162595243  0.045208613 -0.113517143
## md_q1j              0.052442834  0.260236318 -0.001187189 -0.242671016
## md_q1k             -0.080770710  0.125466107 -0.050319189 -0.330698981
## md_q1l             -0.060320615 -0.255617569 -0.136511087 -0.063895825
## md_q1m              0.146149544 -0.274070462 -0.087082870  0.005792263
## md_q2              -0.168222709 -0.102443549 -0.018731439  0.157866070
## md_q3              -0.046985633 -0.056554589 -0.017757839  0.123780165
## md_total           -0.010952352 -0.008632013 -0.014635730  0.013961048
## alcohol             0.147297949 -0.313200835  0.287395610 -0.189925329
## thc                -0.365084015  0.072358380  0.273776314  0.280913142
## cocaine             0.170599274  0.027682322  0.186688997  0.187158365
## stimulants         -0.182219515 -0.057720153  0.120874555  0.080531874
## sedative_hypnotics  0.282416423  0.077950737 -0.322760424 -0.078968953
## opioids             0.264203495  0.058264056 -0.264238970  0.040080148
## court_order        -0.268320401  0.211767710  0.314123684 -0.306502403
## education          -0.005729779 -0.216346581  0.198371738  0.094956235
## hx_of_violence     -0.005729779 -0.216346581  0.198371738  0.094956235
## disorderly_conduct -0.176591681 -0.246645471  0.068179182 -0.102624027
## abuse               0.278719988  0.079958381  0.395400975  0.072220896
## non_subst_dx       -0.024279764  0.080910918  0.317932262 -0.434214415
## subst_dx            0.156477542 -0.084894551  0.120995084  0.152487267
##                              PC9         PC10         PC11        PC12
## age                 0.1777312076 -0.033851174 -0.044376170 -0.01101578
## sex                -0.0043743170 -0.113037380 -0.033416209  0.21578091
## race               -0.2526303961  0.063968319  0.082748203  0.06501477
## md_q1a             -0.0142589861 -0.042864193  0.235163538 -0.02710441
## md_q1b             -0.0831135520  0.205511396 -0.078456924  0.05030622
## md_q1c              0.1207753893 -0.014691600 -0.182382396 -0.33796885
## md_q1d              0.2430942870 -0.109211210  0.123857841  0.42996712
## md_q1e              0.0173947409 -0.183641606  0.201194645 -0.04159599
## md_q1f              0.1295263777  0.093326665 -0.011117267  0.11605387
## md_q1g              0.0417182729  0.078705407 -0.022172430 -0.05236325
## md_q1h              0.0691508022 -0.020667777  0.033706172  0.09823516
## md_q1i              0.1622382909 -0.008852521 -0.246476514 -0.20880527
## md_q1j             -0.0574372250 -0.098082338  0.025563038 -0.03684529
## md_q1k             -0.2024584368  0.203582083  0.340440799  0.17119197
## md_q1l              0.0402954821  0.232060387  0.190349014 -0.08858718
## md_q1m             -0.2819597356 -0.065863281 -0.279402713 -0.05374598
## md_q2               0.0899932721  0.077440925 -0.077749238  0.01500964
## md_q3              -0.2681941665 -0.231348407 -0.280318292 -0.19404741
## md_total           -0.0331003234 -0.015973667 -0.041828082 -0.04897186
## alcohol            -0.3946825008 -0.154066807 -0.070915177  0.31683583
## thc                -0.0117854843  0.278642521 -0.089647517 -0.10359734
## cocaine             0.1155595328  0.090250104 -0.054205969 -0.19306756
## stimulants          0.0562430140 -0.605440911  0.437782870 -0.36852359
## sedative_hypnotics -0.1432887327  0.182306117 -0.002630789 -0.28159043
## opioids             0.3286830201 -0.061120470 -0.071939824  0.13631361
## court_order         0.0322969193 -0.166270644 -0.434611908  0.06061491
## education           0.0217011803  0.069803763 -0.023575037  0.02263082
## hx_of_violence      0.0217011803  0.069803763 -0.023575037  0.02263082
## disorderly_conduct  0.4853845652  0.063352431 -0.057458561  0.06236044
## abuse              -0.0319473605  0.347314707  0.208313293 -0.17475066
## non_subst_dx       -0.0009272567  0.170937286  0.094339434 -0.24949389
## subst_dx           -0.1664908295  0.057448129  0.036935606  0.10204128
##                           PC13         PC14        PC15        PC16
## age                 0.10375437  0.054845118  0.06114118  0.01012488
## sex                -0.09333258  0.004562066  0.23399217  0.17920773
## race                0.17353032 -0.152078236 -0.16762103 -0.03007089
## md_q1a             -0.12188465 -0.089143091  0.17891548  0.16215963
## md_q1b             -0.04961481 -0.131911852 -0.06055833  0.23891958
## md_q1c             -0.35092838 -0.270435964 -0.20111230 -0.12621602
## md_q1d             -0.05747553 -0.465350249 -0.27992314 -0.06871410
## md_q1e             -0.34378672  0.058328036  0.25754615 -0.36238060
## md_q1f             -0.03862331  0.217713047  0.06710636 -0.38668900
## md_q1g              0.33287507  0.103536598  0.06924608 -0.09952170
## md_q1h             -0.02365317  0.366759891 -0.25667840 -0.01283403
## md_q1i              0.09817992  0.243504140 -0.31044873  0.21616858
## md_q1j              0.09818088 -0.188860802  0.41768012  0.22103975
## md_q1k              0.05598151 -0.075989521 -0.07066745  0.01658730
## md_q1l              0.28620547  0.057783880  0.06495658  0.04365297
## md_q1m              0.08116718  0.157944776  0.17248577 -0.24365328
## md_q2               0.06489761  0.025246492  0.07903445  0.04197274
## md_q3              -0.09509643 -0.172468863 -0.16961693  0.27594017
## md_total           -0.02085377 -0.038320188 -0.02416777  0.02928881
## alcohol            -0.09761260  0.123724840 -0.16537340  0.08553403
## thc                -0.23935328  0.212944835 -0.02753719 -0.04503271
## cocaine             0.29389268 -0.075928717  0.22719392  0.20013260
## stimulants          0.19265550  0.038581363 -0.14674929 -0.01812780
## sedative_hypnotics -0.09426594 -0.302813009  0.06348838 -0.25654687
## opioids             0.08650530  0.054876495 -0.19145352  0.08505569
## court_order         0.29666738 -0.287459422  0.02236677 -0.31427783
## education           0.06631902 -0.091243443  0.07536158  0.06058015
## hx_of_violence      0.06631902 -0.091243443  0.07536158  0.06058015
## disorderly_conduct -0.17578904 -0.095611096  0.15270502  0.11957998
## abuse               0.18847794 -0.172010167 -0.30355552 -0.17439811
## non_subst_dx       -0.28307899  0.027045773 -0.02794345  0.21627401
## subst_dx           -0.03074205 -0.023317906  0.04438788  0.12512795
##                            PC17         PC18         PC19          PC20
## age                 0.234295636 -0.089803480 -0.066413162 -2.262317e-01
## sex                 0.069746234  0.254714801  0.030166372 -6.307930e-02
## race               -0.150421835  0.117002332 -0.025327429  1.637394e-01
## md_q1a             -0.159012065  0.289913675  0.393738400  2.182363e-01
## md_q1b              0.267709570 -0.047078178 -0.016438923 -2.408258e-01
## md_q1c             -0.162250306  0.082533031 -0.256713576  2.314672e-02
## md_q1d             -0.081643286 -0.091462213  0.117910599 -1.489525e-01
## md_q1e              0.034175726 -0.051112830 -0.091353204 -2.482877e-02
## md_q1f             -0.002708953 -0.282596162  0.171882123  3.849295e-01
## md_q1g             -0.202733532 -0.005018515 -0.435391888  2.199747e-01
## md_q1h              0.036839555  0.038681452  0.346193836 -3.750497e-02
## md_q1i              0.231975659  0.134557671 -0.001715819  1.800678e-01
## md_q1j              0.027177481  0.115051663 -0.301364447  2.736566e-02
## md_q1k              0.072350420 -0.155037438 -0.060239352 -2.799618e-01
## md_q1l             -0.023237338  0.227852224  0.066062348  3.819569e-02
## md_q1m             -0.263642450  0.272673268  0.146032465 -4.569315e-01
## md_q2               0.164319856 -0.279077394 -0.166622613 -1.340569e-01
## md_q3              -0.005446446 -0.120633918  0.161008490  1.212325e-01
## md_total           -0.006766558 -0.006600321  0.033229537 -1.119369e-02
## alcohol             0.177591581 -0.047141171 -0.231154849  1.771337e-01
## thc                 0.033262704 -0.003250061 -0.086126748 -2.877085e-01
## cocaine            -0.074209665 -0.458794327  0.287740789 -6.247687e-02
## stimulants          0.104145822 -0.058428682 -0.044460092 -1.468377e-01
## sedative_hypnotics  0.298211099 -0.140577310  0.109388814  1.079867e-01
## opioids            -0.414568377 -0.057817826 -0.168303836 -1.443630e-01
## court_order        -0.024519268  0.028096144  0.133737866 -3.472454e-02
## education           0.009473916  0.044740762  0.085066693  4.103315e-02
## hx_of_violence      0.009473916  0.044740762  0.085066693  4.103315e-02
## disorderly_conduct  0.170859173  0.234855898 -0.051409091  1.023602e-01
## abuse               0.050799291  0.283112660 -0.043128795  5.665629e-05
## non_subst_dx       -0.413062945 -0.163660022  0.049332348 -8.465086e-02
## subst_dx           -0.272317242 -0.218334498 -0.107170997  2.053617e-01
##                            PC21         PC22         PC23         PC24
## age                 0.238954916  0.018173179  0.064999912 -0.450182834
## sex                 0.190717970  0.079184580 -0.030359502 -0.066872102
## race                0.164130183  0.277690738  0.386331874 -0.271829700
## md_q1a             -0.152560797 -0.236380877 -0.121994973 -0.432613950
## md_q1b             -0.180895945 -0.173548999 -0.012824409  0.029513362
## md_q1c              0.123984259 -0.273359774 -0.222081123 -0.065978731
## md_q1d             -0.066549091  0.380328211 -0.210317420  0.123462676
## md_q1e             -0.233917689 -0.163391279  0.364988453  0.148556330
## md_q1f              0.495786124  0.024664521 -0.106671587  0.094929745
## md_q1g             -0.346622786  0.124052391  0.121648672 -0.046698475
## md_q1h             -0.254105328  0.049631769  0.073568358 -0.108228834
## md_q1i             -0.006491301  0.162343478 -0.084611809  0.106710674
## md_q1j              0.182243299  0.126935948 -0.149695947  0.071798712
## md_q1k              0.243026261 -0.228787948  0.219816047  0.062276399
## md_q1l              0.222742062 -0.005536183 -0.008475544 -0.028386223
## md_q1m              0.006873645  0.254541669 -0.029655077  0.199104847
## md_q2              -0.140609437  0.046978325 -0.141576648 -0.248336969
## md_q3               0.236487847 -0.049625834  0.330780843  0.156928985
## md_total            0.059671728 -0.006905531  0.039547970  0.033157547
## alcohol            -0.100442973 -0.020414875  0.071465871 -0.088913993
## thc                 0.179952751  0.132615421 -0.035849094 -0.191966789
## cocaine            -0.117552261 -0.092833367  0.084946520  0.183944615
## stimulants          0.088130414  0.128016270 -0.057610383 -0.026124596
## sedative_hypnotics -0.107752417  0.346763333  0.115103092 -0.249386710
## opioids             0.118282854 -0.201500710  0.294277578 -0.171678607
## court_order        -0.053969225 -0.181493906  0.047601052 -0.238935985
## education          -0.001563704 -0.036359515  0.058987957  0.068381062
## hx_of_violence     -0.001563704 -0.036359515  0.058987957  0.068381062
## disorderly_conduct  0.002461422  0.208628647  0.406480633  0.160387789
## abuse              -0.035296835 -0.137672750 -0.027677085  0.211362467
## non_subst_dx       -0.054779632  0.302170483  0.090289103 -0.101497252
## subst_dx           -0.022049951  0.090444849 -0.249372798  0.004900215
##                            PC25         PC26         PC27         PC28
## age                -0.022045120 -0.107897684 -0.274701658  0.001629420
## sex                 0.014035368 -0.280162365  0.162606740  0.174502014
## race               -0.203191157  0.078657181  0.004200263 -0.147142914
## md_q1a              0.031420005  0.143497458  0.059740499 -0.228326382
## md_q1b             -0.530286967  0.173670657 -0.381698943  0.073236758
## md_q1c             -0.151984893 -0.206664583  0.159484816  0.240754004
## md_q1d              0.205980396  0.133897525 -0.043575449  0.103693597
## md_q1e              0.223356284  0.149221255 -0.248786275 -0.076397654
## md_q1f             -0.307688348  0.110298864  0.032752199 -0.048271033
## md_q1g              0.028106221 -0.166005992 -0.050838820  0.212780837
## md_q1h             -0.191734709 -0.238332128  0.066360077  0.265280011
## md_q1i              0.217782401  0.142241758 -0.200184989 -0.319944900
## md_q1j             -0.116927798  0.302015682  0.036185600 -0.006263303
## md_q1k              0.096561247 -0.306457636  0.032427751 -0.149040596
## md_q1l              0.295403922 -0.003028877 -0.117485720  0.448280327
## md_q1m             -0.155595395 -0.031311156  0.075546142 -0.166412980
## md_q2               0.125102237 -0.177652884  0.434164133 -0.386696873
## md_q3               0.234826081 -0.128650542 -0.007629842  0.015703520
## md_total            0.028732285 -0.012637046 -0.012731205  0.002357241
## alcohol            -0.070816957  0.242980967  0.290604627  0.219293929
## thc                 0.244617783  0.262278334 -0.034061544  0.122440109
## cocaine            -0.033803615  0.142639508  0.241845601  0.169591260
## stimulants         -0.209499548 -0.024278858 -0.007066795  0.020223486
## sedative_hypnotics  0.013844346  0.029613064  0.041317963  0.086856162
## opioids            -0.039430505  0.299010455  0.054438809 -0.085101967
## court_order         0.019766881 -0.051021489 -0.146889441  0.006961255
## education          -0.006537646 -0.012714534 -0.057299215  0.019377766
## hx_of_violence     -0.006537646 -0.012714534 -0.057299215  0.019377766
## disorderly_conduct -0.211807281 -0.221162661  0.106334830 -0.090872970
## abuse              -0.071619507  0.052494399  0.182153747 -0.202439106
## non_subst_dx       -0.107395080  0.009927608 -0.013435858  0.029665771
## subst_dx           -0.032363846 -0.353735936 -0.423357968 -0.191986775
##                            PC29        PC30          PC31          PC32
## age                 0.063106186 -0.19618508  5.957325e-03  0.000000e+00
## sex                 0.083654553  0.32770212 -3.521109e-03  3.321098e-16
## race               -0.151053597  0.13261464  1.444740e-03  1.267305e-16
## md_q1a              0.275031176 -0.04936930 -8.772797e-02  3.422377e-16
## md_q1b             -0.025448563  0.15435023 -1.003719e-01 -4.838127e-16
## md_q1c              0.047349527  0.10987050 -1.016572e-01 -1.046259e-15
## md_q1d              0.091777186 -0.03749042 -9.106163e-02 -2.443516e-16
## md_q1e             -0.210750240  0.11399768 -9.194900e-02 -1.956705e-16
## md_q1f              0.082819631  0.02507437 -9.805214e-02  9.958021e-17
## md_q1g              0.427321529 -0.11467865 -7.434356e-02  2.205538e-16
## md_q1h             -0.207366317 -0.39089204 -7.867493e-02 -8.401095e-16
## md_q1i              0.161891474  0.30407172 -1.036058e-01 -4.509076e-16
## md_q1j             -0.156686443 -0.39862569 -9.151701e-02 -8.292318e-16
## md_q1k              0.337213423  0.07115196 -1.013924e-01  3.376135e-17
## md_q1l             -0.391009377  0.25516955 -1.034310e-01 -6.142866e-16
## md_q1m              0.103815898  0.05906243 -1.020638e-01 -3.486227e-16
## md_q2              -0.355703870  0.07206380 -9.144735e-02 -1.218384e-15
## md_q3              -0.032758874 -0.31596465 -2.011765e-01 -1.033516e-15
## md_total            0.011608462 -0.02427163  9.132117e-01  4.245660e-15
## alcohol             0.011148015  0.15811286 -1.119956e-03 -3.771936e-16
## thc                 0.172094600 -0.21743005 -2.032771e-03  2.153725e-16
## cocaine             0.104717619  0.11976428 -1.159660e-02  3.775554e-16
## stimulants          0.050906399  0.08616164 -4.044616e-03  1.464436e-16
## sedative_hypnotics  0.109615818  0.03349246 -9.108052e-06  1.160947e-16
## opioids            -0.033785738  0.01343407 -5.066329e-03 -1.216283e-16
## court_order        -0.121104288  0.09960254 -5.491854e-04 -4.504842e-16
## education           0.079587461 -0.01450872  1.303516e-03  7.071068e-01
## hx_of_violence      0.079587461 -0.01450872  1.303516e-03 -7.071068e-01
## disorderly_conduct  0.002424071 -0.02805922  5.636794e-03  2.646578e-16
## abuse              -0.142274240 -0.20042710  9.529962e-03  9.671649e-17
## non_subst_dx       -0.112257042  0.19508039 -6.467801e-03 -3.488599e-16
## subst_dx           -0.172793130 -0.04340333  1.105127e-02 -2.788349e-16

Next, we will print the summary of the prcomp object:

# summary of the prcomp object
summary(pca.out)

## Importance of components:
##                           PC1    PC2     PC3     PC4    PC5     PC6     PC7
## Standard deviation     2.6403 1.6502 1.49974 1.38658 1.3146 1.21753 1.20750
## Proportion of Variance 0.2178 0.0851 0.07029 0.06008 0.0540 0.04632 0.04556
## Cumulative Proportion  0.2178 0.3030 0.37325 0.43333 0.4873 0.53365 0.57922
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     1.09078 1.04093 1.00974 0.97282 0.93279 0.88220 0.87733
## Proportion of Variance 0.03718 0.03386 0.03186 0.02957 0.02719 0.02432 0.02405
## Cumulative Proportion  0.61640 0.65026 0.68212 0.71169 0.73888 0.76321 0.78726
##                           PC15    PC16   PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.85098 0.82110 0.7939 0.76867 0.74171 0.69148 0.66322
## Proportion of Variance 0.02263 0.02107 0.0197 0.01846 0.01719 0.01494 0.01375
## Cumulative Proportion  0.80989 0.83096 0.8507 0.86912 0.88631 0.90125 0.91500
##                           PC22    PC23    PC24    PC25    PC26    PC27    PC28
## Standard deviation     0.64662 0.63320 0.60688 0.58876 0.53349 0.52634 0.48371
## Proportion of Variance 0.01307 0.01253 0.01151 0.01083 0.00889 0.00866 0.00731
## Cumulative Proportion  0.92807 0.94059 0.95210 0.96294 0.97183 0.98049 0.98780
##                          PC29    PC30    PC31      PC32
## Standard deviation     0.4630 0.41782 0.03827 4.831e-16
## Proportion of Variance 0.0067 0.00546 0.00005 0.000e+00
## Cumulative Proportion  0.9945 0.99995 1.00000 1.000e+00

Here we get principal components named PC1-PC32. Each of these explains a percentage of the total variation in the dataset. For example, PC1 explains nearly 22% of the total variance i.e. around One-fourth of the information of the dataset can be encapsulated by just that one Principal Component. PC2 explains 8% and so on.

Next, we’ll visualize these components to better understand this analysis. While plotting a PCA we refer to a scatter plot of the first two principal components PC1 and PC2. These plots reveal the features of the data such as non-linearity and the possible departure from normality. PC1 and PC2 are evaluated for each sample vector and plotted.

# loading library
library(ggfortify)

## Warning: package 'ggfortify' was built under R version 4.0.5

#The autoplot( ) function of the ‘ggfortify package’ for plotting PCA:
pca.out.plot <- autoplot(pca.out, data = adhd,colour='suicide' )
  
pca.out.plot

From the above plot, we cannot see a clear clustering of the suicide dots. With a low 22% and 9% on PC1 and PC2, this is what we’d expect to see.

To determine the ideal number of components to include that captures the most variance, we can use the plot() function on the precomp object.

plot(pca.out, type="l")

In a screeplot the ‘arm-bend’ represents a decrease in cumulative contribution. The above plot shows the bend at the second principal component. A scree plot is a diagnostic tool to check whether PCA works well on the data or not; PC1 captures the most variation, PC2 — the second most, and so on. We use scree plot to select the principal components to keep. The ideal curve should be steep and then bends at an ‘arm-bend’ — this is the cutting-off point — and after that our curve flattens out. In the visual above, PC 1,2,3,4,5 and 6 are enough to describe the variance within the data.

We can also use this PCA analysis to determine which of the questions accounted for the most variance (had the heaviest bearing on the results). For PC1, we can see that Question Q1a is the most important (md_q1a -0.26892181) and for PC2, we can see that the Question Q3 is the most important (md_q3 -0.236710147) along with the question on Cocaine (cocaine 0.381655078).

Predicting Suicide Attempts

Having concluded our exploratory data analysis, we’ll build two models to attempt to predict suicide attempts. The first model will leverage gradient boosting and the second model will use support vector machines.

Gradient Boosting

We begin by dropping the Initial column as it doesn’t hold any predictive information. The data is then split into training and testing sets:

set.seed(777)
df_split = initial_split(adhd %>% select(-c(initial)), strata=suicide)
df_train <- training(df_split)
df_test <- testing(df_split)

Gradient boosting machines are similar to random forests or bagging approaches, but instead of just growing a large number of trees from independent random samples of the data, they are grown sequentially on transformations of the data. Boosting is a method to improve (boost) the weak learners sequentially and increase the model accuracy with a combined model.

Unlike random forests, GBMs can have high variability in accuracy dependent on their hyperparameter settings.

Boosting has 5 tuning parameters that we can focus on:

The number of trees.
The shrinkage parameter λ (eta in the params), a small positive number. This controls the rate at which boosting learns.
The number of splits in each tree, which controls the complexity of the boosted ensemble (controlled with max.depth).
The interaction depth, number of splits it has to perform on a tree.
The distribution type. We will use “adaboost”, as we are working with 0-1 outcomes.

We will tune the model with the training set to find the best combination with lowest RMSE error.

# search grid
hyper_grid <- expand.grid(
  n.trees = c(1000,3000),
  shrinkage = c(0.01, 0.05),
  interaction.depth = c(3, 5, 7),
  n.minobsinnode = c(5, 10, 15)
)

# create model fit function
model_fit <- function(n.trees, shrinkage, interaction.depth, n.minobsinnode) {
  set.seed(777)
  m <- gbm(
    formula = as.character(suicide) ~ .,
    data = data.frame(df_train),
    distribution = "adaboost",
    n.trees = n.trees,
    shrinkage = shrinkage,
    interaction.depth = interaction.depth,
    n.minobsinnode = n.minobsinnode,
    cv.folds = 10
  )
  # compute RMSE
  sqrt(min(m$cv.error))
}


hyper_grid$rmse <- purrr::pmap_dbl(
  hyper_grid,
  ~ model_fit(
    n.trees = ..1,
    shrinkage = ..2,
    interaction.depth = ..3,
    n.minobsinnode = ..4
  )
)

# results
arrange(hyper_grid, rmse)

##    n.trees shrinkage interaction.depth n.minobsinnode      rmse
## 1     1000      0.01                 3             15 0.9668500
## 2     3000      0.01                 3             15 0.9668500
## 3     1000      0.01                 5             15 0.9668500
## 4     3000      0.01                 5             15 0.9668500
## 5     1000      0.01                 7             15 0.9668500
## 6     3000      0.01                 7             15 0.9668500
## 7     1000      0.05                 3             10 0.9669584
## 8     3000      0.05                 3             10 0.9669584
## 9     1000      0.05                 5             10 0.9669584
## 10    3000      0.05                 5             10 0.9669584
## 11    1000      0.05                 7             10 0.9669584
## 12    3000      0.05                 7             10 0.9669584
## 13    1000      0.01                 5              5 0.9673455
## 14    3000      0.01                 5              5 0.9673455
## 15    1000      0.01                 7              5 0.9673455
## 16    3000      0.01                 7              5 0.9673455
## 17    1000      0.01                 3             10 0.9675437
## 18    3000      0.01                 3             10 0.9675437
## 19    1000      0.01                 5             10 0.9675437
## 20    3000      0.01                 5             10 0.9675437
## 21    1000      0.01                 7             10 0.9675437
## 22    3000      0.01                 7             10 0.9675437
## 23    1000      0.01                 3              5 0.9676015
## 24    3000      0.01                 3              5 0.9676015
## 25    1000      0.05                 3             15 0.9676084
## 26    3000      0.05                 3             15 0.9676084
## 27    1000      0.05                 5             15 0.9676084
## 28    3000      0.05                 5             15 0.9676084
## 29    1000      0.05                 7             15 0.9676084
## 30    3000      0.05                 7             15 0.9676084
## 31    1000      0.05                 5              5 0.9677676
## 32    3000      0.05                 5              5 0.9677676
## 33    1000      0.05                 7              5 0.9677676
## 34    3000      0.05                 7              5 0.9677676
## 35    1000      0.05                 3              5 0.9677922
## 36    3000      0.05                 3              5 0.9677922

We experienced no improvement in our RMSE after we tried to reduce/increase the number of trees.

From the results above, we see the optimal hyperparameters are:

the number of trees to 1000
interaction depth to 3
shrinkage to 0.01
n.minobsinnode to 15

We will use these values to build the model for our test set.

set.seed(777)
modelGBM=gbm(as.character(suicide)~., data=data.frame(df_train)
             ,n.trees=1000,distribution='adaboost',interaction.depth=3,shrinkage=0.01, n.minobsinnode = 15)

print(modelGBM)

## gbm(formula = as.character(suicide) ~ ., distribution = "adaboost", 
##     data = data.frame(df_train), n.trees = 1000, interaction.depth = 3, 
##     n.minobsinnode = 15, shrinkage = 0.01)
## A gradient boosted model with adaboost loss function.
## 1000 iterations were performed.
## There were 51 predictors of which 41 had non-zero influence.

pgbm=predict(modelGBM,newdata=data.frame(df_test),n.trees = 1000,type='response')
pgbm[pgbm>0.5]=1
pgbm[pgbm<=0.5]=0

confusionMatrix(as.factor(df_test$suicide),as.factor(pgbm), positive='1')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 24  4
##          1 10  2
##                                           
##                Accuracy : 0.65            
##                  95% CI : (0.4832, 0.7937)
##     No Information Rate : 0.85            
##     P-Value [Acc > NIR] : 0.9996          
##                                           
##                   Kappa : 0.0278          
##                                           
##  Mcnemar's Test P-Value : 0.1814          
##                                           
##             Sensitivity : 0.3333          
##             Specificity : 0.7059          
##          Pos Pred Value : 0.1667          
##          Neg Pred Value : 0.8571          
##              Prevalence : 0.1500          
##          Detection Rate : 0.0500          
##    Detection Prevalence : 0.3000          
##       Balanced Accuracy : 0.5196          
##                                           
##        'Positive' Class : 1               
##

Above, we can see the accuracy metrics for our final model. It appears that Boosting has a correct classification rate of 65% and a sensitivity of 33%. Obviously with these results, we would not recommend using this model. While we can with some accuracy predict who will not attempt to commit suicide, the model is very bad at predicting who would attempt to commit suicide. However, we’ll use these results as a benchmark as we move forward to the SVM method.

As a point of interest, we can use the summary function on our final model object to give us a variable importance plot. Not surprisingly, adhd_totaland md_total are the most important variables in the model.

summary(modelGBM,
        cBars = 10,
        n.trees = modelGBM$n.trees,
        plotit = TRUE,
        order = TRUE,
        method = relative.influence,
        normalize = TRUE)

##                                   var      rel.inf
## adhd_total                 adhd_total 8.918851e+01
## md_total                     md_total 4.997296e+00
## abuse                           abuse 2.361112e+00
## education                   education 3.205825e-01
## adhd_q2                       adhd_q2 3.110028e-01
## adhd_q1                       adhd_q1 2.820472e-01
## adhd_q9                       adhd_q9 2.471811e-01
## adhd_q7                       adhd_q7 2.396161e-01
## adhd_q10                     adhd_q10 1.913477e-01
## adhd_q14                     adhd_q14 1.714906e-01
## alcohol                       alcohol 1.590282e-01
## race                             race 1.583348e-01
## md_q1m                         md_q1m 1.529314e-01
## adhd_q3                       adhd_q3 1.372526e-01
## adhd_q17                     adhd_q17 1.270819e-01
## adhd_q12                     adhd_q12 1.134044e-01
## adhd_q15                     adhd_q15 1.025606e-01
## subst_dx                     subst_dx 9.152572e-02
## adhd_q13                     adhd_q13 8.570253e-02
## age                               age 8.549288e-02
## adhd_q11                     adhd_q11 7.321143e-02
## adhd_q18                     adhd_q18 6.840093e-02
## adhd_q16                     adhd_q16 6.372256e-02
## cocaine                       cocaine 4.022817e-02
## adhd_q6                       adhd_q6 4.007584e-02
## adhd_q8                       adhd_q8 3.235041e-02
## adhd_q4                       adhd_q4 2.921838e-02
## non_subst_dx             non_subst_dx 2.489671e-02
## sex                               sex 1.887829e-02
## md_q1e                         md_q1e 1.575051e-02
## md_q3                           md_q3 1.549181e-02
## adhd_q5                       adhd_q5 1.365565e-02
## thc                               thc 1.337537e-02
## md_q1g                         md_q1g 8.650117e-03
## md_q1h                         md_q1h 6.254487e-03
## md_q1f                         md_q1f 5.378939e-03
## md_q1d                         md_q1d 3.807187e-03
## md_q1k                         md_q1k 1.717775e-03
## md_q1l                         md_q1l 1.135029e-03
## md_q1b                         md_q1b 1.925562e-04
## md_q1a                         md_q1a 1.056823e-04
## md_q1c                         md_q1c 0.000000e+00
## md_q1i                         md_q1i 0.000000e+00
## md_q1j                         md_q1j 0.000000e+00
## md_q2                           md_q2 0.000000e+00
## stimulants                 stimulants 0.000000e+00
## sedative_hypnotics sedative_hypnotics 0.000000e+00
## opioids                       opioids 0.000000e+00
## court_order               court_order 0.000000e+00
## hx_of_violence         hx_of_violence 0.000000e+00
## disorderly_conduct disorderly_conduct 0.000000e+00

Support Vector Machine

A Support Vector Machine (SVM) is an algorithm that searches the feature space for the optimal hyper plane. This hyper plane will separate the features by classes with the maximum margin. Here we train an SVM to find the dividing plane between those who commit suicide and those who don’t base on the features we have.

The trained dataset is fit to an SVM

set.seed(42)
svm_model = svm(suicide~., data=df_train, kernel='linear', probability=TRUE)
summary(svm_model)

## 
## Call:
## svm(formula = suicide ~ ., data = df_train, kernel = "linear", probability = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  71
## 
##  ( 31 40 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

Our base model consists of 71 support vectors where 31 are assigned to label 0 (no suicide) and 40 to label 1 (suicide)

We will tune the SVM with the training set to find the best values for gamma and cost. The tuning process is done with 10 fold cross validation.

set.seed(42)
svm_tune <- tune.svm(suicide~., data = df_train, gamma = 0.25, cost = seq(0.1,1,0.1))
summary(svm_tune)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  gamma cost
##   0.25  0.1
## 
## - best performance: 0.3044872 
## 
## - Detailed performance results:
##    gamma cost     error dispersion
## 1   0.25  0.1 0.3044872  0.0811544
## 2   0.25  0.2 0.3044872  0.0811544
## 3   0.25  0.3 0.3044872  0.0811544
## 4   0.25  0.4 0.3044872  0.0811544
## 5   0.25  0.5 0.3044872  0.0811544
## 6   0.25  0.6 0.3044872  0.0811544
## 7   0.25  0.7 0.3044872  0.0811544
## 8   0.25  0.8 0.3044872  0.0811544
## 9   0.25  0.9 0.3044872  0.0811544
## 10  0.25  1.0 0.3044872  0.0811544

svm_best <- svm_tune$best.model

It looks like the best parameters are gamma = 0.25 and cost = 0.1.

## 
## Call:
## svm(formula = suicide ~ ., data = df_train, cost = 0.1, gamma = 0.25, 
##     kernel = "linear", probability = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.1 
## 
## Number of Support Vectors:  82
## 
##  ( 34 48 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

Now with the best model we can try it against the test set:

set.seed(42)

y_test = predict(svm_best, df_test %>% select(-c(suicide)))
confusionMatrix(df_test[,'suicide'], y_test, positive='1')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 24  4
##          1  8  4
##                                           
##                Accuracy : 0.7             
##                  95% CI : (0.5347, 0.8344)
##     No Information Rate : 0.8             
##     P-Value [Acc > NIR] : 0.9568          
##                                           
##                   Kappa : 0.2105          
##                                           
##  Mcnemar's Test P-Value : 0.3865          
##                                           
##             Sensitivity : 0.5000          
##             Specificity : 0.7500          
##          Pos Pred Value : 0.3333          
##          Neg Pred Value : 0.8571          
##              Prevalence : 0.2000          
##          Detection Rate : 0.1000          
##    Detection Prevalence : 0.3000          
##       Balanced Accuracy : 0.6250          
##                                           
##        'Positive' Class : 1               
##

Based on the results above, we can see that the model is most accurate at predicting someone who will not commit suicide with 24 true negatives. The largest mistake is 8 false positives, meaning the model predicted that those individuals would attempt suicide when they didn’t. There are also 4 false negatives which is a situation where the model predicts those individual won’t attempt suicide when they did.

set.seed(42)
pred <- predict(svm_best, df_test, decision.values = TRUE, probability = TRUE)
pred_prob <- attr(pred, 'probabilities')
df_prob <- cbind(df_test %>% select(suicide), y_test, pred_prob)
names(df_prob) = c('y_true' ,'y_pred', 'negative', 'positive')
summary(df_prob)

##  y_true y_pred    negative          positive     
##  0:28   0:32   Min.   :0.08737   Min.   :0.3383  
##  1:12   1: 8   1st Qu.:0.18629   1st Qu.:0.6181  
##                Median :0.26803   Median :0.7320  
##                Mean   :0.30472   Mean   :0.6953  
##                3rd Qu.:0.38187   3rd Qu.:0.8137  
##                Max.   :0.66166   Max.   :0.9126

The predicted probabilities are inline with our analysis of the training and testing set. From the predicted probabilities the SVM model is more likely to predict that an individual will not commit suicide, true negatives. The sensitivity and specificity further add to this models strength but the low positive predicative value shows it’s draw backs. This SVM wouldn’t do well in the real world since it will miss the most important cases, the situation where we want to stop someone from attempting suicide.

There is always a trade off between precision and recall, and depending on the situation, we want to prioritize one over the other. In the case of suicides false positives are fine because that means we put time into stopping someone who wasn’t going to commit suicide anyways. However false negatives are the worst, since that would be an individual that isn’t given attention and will mostly likely commit suicide.

Modeling Comparison

Detailed results of the individual models are given above, however, a comparison of our gradient boosted model and SVM model are provided below:

Model	Accuracy	Specificity	Sensitivity
Gradient Boost	65.0%	70.5%	33.3%
SVM	70.0%	75.0%	50.0%

While the accuracy of these models may at first glance seem impressive, the purpose of these models was to predict whether a patient attempted suicide or not. This means we shouldn’t necessarily focus on accuracy as our measure for this model, instead we should focus on the sensitivity, which tells us how many of the subjects that did attempt suicide where actually correctly predicted by our model. In looking at our sensitivity above we can see that our results are lackluster. We would not recommend using either of these models to predict suicide attempts in their current state. However, if our objective was to build a model that could predict those who would not attempt suicide, this may be a potential model as both models have a specificity of greater than 70%.

Conclusion

Having concluded our analysis and modeling, we can see that this dataset is ripe with information to better understand those with ADHD as well as indicators for those who may be contemplating suicide. As a next step to this project, different/additional features would be used in modeling to see if additional signal could be found to make better predictions for those who may attempt suicide. As can be seen, modeling of this type, if done correctly, has the potential to save lives.