Source Code: https://github.com/djlofland/DATA622_S2021_Group2/tree/master/Homework4
For this assignment, we will be working with a very interesting mental health dataset from a real-life research project. All identifying information, of course, has been removed. The attached spreadsheet has the data (the tab name “Data”). The data dictionary is given in the second tab. You can get as creative as you want. The assignment is designed to really get you to think about how you could use different methods.
We start by loading the Excel dataset provided into a dataframe.
Next, we can take a quick look at the dataframe features. By highlighting the data dictionary below, and printing the first few rows, we can start to establish some context around the data that has been collected from the survey.
# Display first few rows for inspection
kable(head(df)) %>%
kable_styling(bootstrap_options = "basic")| Initial | Age | Sex | Race | ADHD Q1 | ADHD Q2 | ADHD Q3 | ADHD Q4 | ADHD Q5 | ADHD Q6 | ADHD Q7 | ADHD Q8 | ADHD Q9 | ADHD Q10 | ADHD Q11 | ADHD Q12 | ADHD Q13 | ADHD Q14 | ADHD Q15 | ADHD Q16 | ADHD Q17 | ADHD Q18 | ADHD Total | MD Q1a | MD Q1b | MD Q1c | MD Q1d | MD Q1e | MD Q1f | MD Q1g | MD Q1h | MD Q1i | MD Q1j | MD Q1k | MD Q1L | MD Q1m | MD Q2 | MD Q3 | MD TOTAL | Alcohol | THC | Cocaine | Stimulants | Sedative-hypnotics | Opioids | Court order | Education | Hx of Violence | Disorderly Conduct | Suicide | Abuse | Non-subst Dx | Subst Dx | Psych meds. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| JA | 24 | 1 | 1 | 1 | 1 | 4 | 2 | 3 | 1 | 1 | 3 | 2 | 4 | 2 | 4 | 1 | 0 | 3 | 1 | 3 | 4 | 40 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 3 | 15 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 11 | 0 | 1 | 1 | 0 | 2 | 0 | 2 |
| LA | 48 | 2 | 1 | 3 | 3 | 4 | 4 | 5 | 2 | 2 | 3 | 2 | 4 | 1 | 4 | 2 | 4 | 4 | 3 | 1 | 4 | 55 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 3 | 14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 14 | 0 | 0 | 1 | 4 | 1 | 0 | 1 |
| MD | 51 | 2 | 1 | 2 | 1 | 2 | 1 | 3 | 3 | 3 | 2 | 0 | 1 | 2 | 0 | 2 | 2 | 3 | 2 | 1 | 1 | 31 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 12 | 0 | 0 | 0 | 6 | 2 | 0 | 1 |
| RD | 43 | 1 | 1 | 3 | 3 | 2 | 2 | 4 | 3 | 2 | 4 | 4 | 2 | 3 | 1 | 3 | 3 | 1 | 2 | 1 | 2 | 45 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 3 | 13 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 12 | 0 | 0 | 1 | 7 | 2 | 0 | 2 |
| RB | 34 | 1 | 1 | 4 | 4 | 2 | 4 | 4 | 2 | 3 | 4 | 4 | 2 | 4 | 1 | 3 | 2 | 1 | 2 | 1 | 1 | 48 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 7 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 9 | 1 | 1 | 1 | 0 | 2 | 0 | 0 |
| SB | 39 | 2 | 1 | 2 | 3 | 1 | 4 | 3 | 2 | 3 | 4 | 4 | 2 | 4 | 2 | 4 | 4 | 3 | 4 | 3 | 3 | 55 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 3 | 14 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 11 | 0 | 1 | 1 | 2 | 0 | 0 | 0 |
After reading in the dataframe and identifying the data dictionary that explains the values for our features, our next step is to do some exploratory analysis. We can initially run the skim() function to find important generalized information about the dataset.
# Show df summary deatils
skim(df)| Name | df |
| Number of rows | 175 |
| Number of columns | 54 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 53 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Initial | 0 | 1 | 2 | 3 | 0 | 108 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Age | 0 | 1.00 | 39.47 | 11.17 | 18 | 29.5 | 42 | 48.0 | 69 | ▆▅▇▅▁ |
| Sex | 0 | 1.00 | 1.43 | 0.50 | 1 | 1.0 | 1 | 2.0 | 2 | ▇▁▁▁▆ |
| Race | 0 | 1.00 | 1.64 | 0.69 | 1 | 1.0 | 2 | 2.0 | 6 | ▇▁▁▁▁ |
| ADHD Q1 | 0 | 1.00 | 1.70 | 1.29 | 0 | 1.0 | 2 | 3.0 | 4 | ▇▇▇▆▃ |
| ADHD Q2 | 0 | 1.00 | 1.91 | 1.25 | 0 | 1.0 | 2 | 3.0 | 4 | ▅▇▇▆▅ |
| ADHD Q3 | 0 | 1.00 | 1.91 | 1.27 | 0 | 1.0 | 2 | 3.0 | 4 | ▅▇▇▆▅ |
| ADHD Q4 | 0 | 1.00 | 2.10 | 1.34 | 0 | 1.0 | 2 | 3.0 | 4 | ▅▅▇▅▆ |
| ADHD Q5 | 0 | 1.00 | 2.26 | 1.44 | 0 | 1.0 | 3 | 3.0 | 5 | ▇▅▇▆▁ |
| ADHD Q6 | 0 | 1.00 | 1.91 | 1.31 | 0 | 1.0 | 2 | 3.0 | 4 | ▆▅▇▇▃ |
| ADHD Q7 | 0 | 1.00 | 1.83 | 1.19 | 0 | 1.0 | 2 | 3.0 | 4 | ▃▇▇▃▃ |
| ADHD Q8 | 0 | 1.00 | 2.14 | 1.29 | 0 | 1.0 | 2 | 3.0 | 4 | ▃▇▇▇▆ |
| ADHD Q9 | 0 | 1.00 | 1.91 | 1.32 | 0 | 1.0 | 2 | 3.0 | 4 | ▆▇▇▇▅ |
| ADHD Q10 | 0 | 1.00 | 2.12 | 1.23 | 0 | 1.0 | 2 | 3.0 | 4 | ▂▇▇▆▅ |
| ADHD Q11 | 0 | 1.00 | 2.27 | 1.24 | 0 | 1.0 | 2 | 3.0 | 4 | ▂▆▇▇▆ |
| ADHD Q12 | 0 | 1.00 | 1.29 | 1.21 | 0 | 0.0 | 1 | 2.0 | 4 | ▇▇▆▂▂ |
| ADHD Q13 | 0 | 1.00 | 2.37 | 1.23 | 0 | 1.5 | 2 | 3.0 | 4 | ▂▅▇▇▆ |
| ADHD Q14 | 0 | 1.00 | 2.25 | 1.35 | 0 | 1.0 | 2 | 3.0 | 4 | ▅▅▇▇▆ |
| ADHD Q15 | 0 | 1.00 | 1.63 | 1.39 | 0 | 0.0 | 1 | 3.0 | 4 | ▇▆▆▅▃ |
| ADHD Q16 | 0 | 1.00 | 1.70 | 1.38 | 0 | 1.0 | 1 | 3.0 | 4 | ▆▇▆▃▅ |
| ADHD Q17 | 0 | 1.00 | 1.53 | 1.29 | 0 | 0.0 | 1 | 2.0 | 4 | ▇▇▇▃▃ |
| ADHD Q18 | 0 | 1.00 | 1.47 | 1.30 | 0 | 0.0 | 1 | 2.0 | 4 | ▇▇▆▃▃ |
| ADHD Total | 0 | 1.00 | 34.32 | 16.70 | 0 | 21.0 | 33 | 47.5 | 72 | ▃▆▇▆▂ |
| MD Q1a | 0 | 1.00 | 0.55 | 0.50 | 0 | 0.0 | 1 | 1.0 | 1 | ▆▁▁▁▇ |
| MD Q1b | 0 | 1.00 | 0.57 | 0.50 | 0 | 0.0 | 1 | 1.0 | 1 | ▆▁▁▁▇ |
| MD Q1c | 0 | 1.00 | 0.54 | 0.50 | 0 | 0.0 | 1 | 1.0 | 1 | ▇▁▁▁▇ |
| MD Q1d | 0 | 1.00 | 0.58 | 0.49 | 0 | 0.0 | 1 | 1.0 | 1 | ▆▁▁▁▇ |
| MD Q1e | 0 | 1.00 | 0.55 | 0.50 | 0 | 0.0 | 1 | 1.0 | 1 | ▆▁▁▁▇ |
| MD Q1f | 0 | 1.00 | 0.70 | 0.46 | 0 | 0.0 | 1 | 1.0 | 1 | ▃▁▁▁▇ |
| MD Q1g | 0 | 1.00 | 0.72 | 0.45 | 0 | 0.0 | 1 | 1.0 | 1 | ▃▁▁▁▇ |
| MD Q1h | 0 | 1.00 | 0.56 | 0.50 | 0 | 0.0 | 1 | 1.0 | 1 | ▆▁▁▁▇ |
| MD Q1i | 0 | 1.00 | 0.59 | 0.49 | 0 | 0.0 | 1 | 1.0 | 1 | ▆▁▁▁▇ |
| MD Q1j | 0 | 1.00 | 0.39 | 0.49 | 0 | 0.0 | 0 | 1.0 | 1 | ▇▁▁▁▅ |
| MD Q1k | 0 | 1.00 | 0.49 | 0.50 | 0 | 0.0 | 0 | 1.0 | 1 | ▇▁▁▁▇ |
| MD Q1L | 0 | 1.00 | 0.58 | 0.49 | 0 | 0.0 | 1 | 1.0 | 1 | ▆▁▁▁▇ |
| MD Q1m | 0 | 1.00 | 0.49 | 0.50 | 0 | 0.0 | 0 | 1.0 | 1 | ▇▁▁▁▇ |
| MD Q2 | 0 | 1.00 | 0.72 | 0.45 | 0 | 0.0 | 1 | 1.0 | 1 | ▃▁▁▁▇ |
| MD Q3 | 0 | 1.00 | 2.01 | 1.07 | 0 | 1.0 | 2 | 3.0 | 3 | ▂▂▁▅▇ |
| MD TOTAL | 0 | 1.00 | 10.02 | 4.81 | 0 | 6.5 | 11 | 14.0 | 17 | ▃▃▆▇▇ |
| Alcohol | 4 | 0.98 | 1.35 | 1.39 | 0 | 0.0 | 1 | 3.0 | 3 | ▇▂▁▁▆ |
| THC | 4 | 0.98 | 0.81 | 1.27 | 0 | 0.0 | 0 | 1.5 | 3 | ▇▁▁▁▃ |
| Cocaine | 4 | 0.98 | 1.09 | 1.39 | 0 | 0.0 | 0 | 3.0 | 3 | ▇▁▁▁▅ |
| Stimulants | 4 | 0.98 | 0.12 | 0.53 | 0 | 0.0 | 0 | 0.0 | 3 | ▇▁▁▁▁ |
| Sedative-hypnotics | 4 | 0.98 | 0.12 | 0.54 | 0 | 0.0 | 0 | 0.0 | 3 | ▇▁▁▁▁ |
| Opioids | 4 | 0.98 | 0.39 | 0.99 | 0 | 0.0 | 0 | 0.0 | 3 | ▇▁▁▁▁ |
| Court order | 5 | 0.97 | 0.09 | 0.28 | 0 | 0.0 | 0 | 0.0 | 1 | ▇▁▁▁▁ |
| Education | 9 | 0.95 | 11.90 | 2.17 | 6 | 11.0 | 12 | 13.0 | 19 | ▁▅▇▂▁ |
| Hx of Violence | 11 | 0.94 | 0.24 | 0.43 | 0 | 0.0 | 0 | 0.0 | 1 | ▇▁▁▁▂ |
| Disorderly Conduct | 11 | 0.94 | 0.73 | 0.45 | 0 | 0.0 | 1 | 1.0 | 1 | ▃▁▁▁▇ |
| Suicide | 13 | 0.93 | 0.30 | 0.46 | 0 | 0.0 | 0 | 1.0 | 1 | ▇▁▁▁▃ |
| Abuse | 14 | 0.92 | 1.33 | 2.12 | 0 | 0.0 | 0 | 2.0 | 7 | ▇▂▁▁▁ |
| Non-subst Dx | 22 | 0.87 | 0.44 | 0.68 | 0 | 0.0 | 0 | 1.0 | 2 | ▇▁▃▁▁ |
| Subst Dx | 23 | 0.87 | 1.14 | 0.93 | 0 | 0.0 | 1 | 2.0 | 3 | ▆▇▁▅▂ |
| Psych meds. | 118 | 0.33 | 0.96 | 0.80 | 0 | 0.0 | 1 | 2.0 | 2 | ▇▁▇▁▆ |
We can see above that the dataset consists of 175 observations and 54 different features. A few of the features have quite a few missing values, including the Psych meds. feature (with 118 missing values), as well as Subst Dx and Non-subst Dx, which have 23 and 22 missing values, respectively. These questions, when looking at the data summary above, correspond to medication questions that are specific to a particular population – for instance, when sampling the general population for a survey, likely a smaller percentage (and not a majority) will be on psychiatric medication, or be taking medication for substance use/non-substance use related causes. Therefore, many of these missing values make sense, but will be important to keep in mind when we start our clustering techniques.
Another interesting component of the skim() function is to be able to take a quick look at the distribution of responses for each feature. From the Education distribution, we can see that the median education level was a high school diploma (12th grade). Also, there seemed to be slightly more males than females that responded to the survey, and the median age was around 42 years. Additionally, we can see that a majority of respondents to many of the substance-use questions, such as Cocaine, THC and Alcohol tended to answer as either “no use”, or “dependence”, with not much gray area in between. Finally, a large majority of respondents self-identified their race as White.
As we saw in our data summary, we have some missing values. Lets explore this a little further to see if any important patterns stand out.
# Use nanair package to plot missing value patterns
gg_miss_upset(df)The missing data only appears in a few of our features: Suicide, Abuse, Non-subst Dx, Subst Dx, and Psych meds..
We can use kNN imputation to help replace many of the missing values. With the KNN method, a categorical missing value is imputed by looking at other records with similar features. Once k similar records are found, they are used to infer the missing value. Note that imputing can inherently introduce bias and there is a trade-off between loosing information versus bias when trying to fill in holes.
We also see that for each respondent, we have their initial. This won’t be useful for our purposes (note there are also duplicates where multiple people had the same initials).
## Use kNN to impute the data
df<-kNN(df)%>%
select(Age:'Psych meds.')Looking at our dataset, we see that missing values are no longer an issue; however, we still have some additional cleanup to perform on our dataset below before we proceed with kNN modeling.
# Display data from summary
skim(df)| Name | df |
| Number of rows | 175 |
| Number of columns | 53 |
| _______________________ | |
| Column type frequency: | |
| numeric | 53 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Age | 0 | 1 | 39.47 | 11.17 | 18 | 29.5 | 42 | 48.0 | 69 | ▆▅▇▅▁ |
| Sex | 0 | 1 | 1.43 | 0.50 | 1 | 1.0 | 1 | 2.0 | 2 | ▇▁▁▁▆ |
| Race | 0 | 1 | 1.64 | 0.69 | 1 | 1.0 | 2 | 2.0 | 6 | ▇▁▁▁▁ |
| ADHD Q1 | 0 | 1 | 1.70 | 1.29 | 0 | 1.0 | 2 | 3.0 | 4 | ▇▇▇▆▃ |
| ADHD Q2 | 0 | 1 | 1.91 | 1.25 | 0 | 1.0 | 2 | 3.0 | 4 | ▅▇▇▆▅ |
| ADHD Q3 | 0 | 1 | 1.91 | 1.27 | 0 | 1.0 | 2 | 3.0 | 4 | ▅▇▇▆▅ |
| ADHD Q4 | 0 | 1 | 2.10 | 1.34 | 0 | 1.0 | 2 | 3.0 | 4 | ▅▅▇▅▆ |
| ADHD Q5 | 0 | 1 | 2.26 | 1.44 | 0 | 1.0 | 3 | 3.0 | 5 | ▇▅▇▆▁ |
| ADHD Q6 | 0 | 1 | 1.91 | 1.31 | 0 | 1.0 | 2 | 3.0 | 4 | ▆▅▇▇▃ |
| ADHD Q7 | 0 | 1 | 1.83 | 1.19 | 0 | 1.0 | 2 | 3.0 | 4 | ▃▇▇▃▃ |
| ADHD Q8 | 0 | 1 | 2.14 | 1.29 | 0 | 1.0 | 2 | 3.0 | 4 | ▃▇▇▇▆ |
| ADHD Q9 | 0 | 1 | 1.91 | 1.32 | 0 | 1.0 | 2 | 3.0 | 4 | ▆▇▇▇▅ |
| ADHD Q10 | 0 | 1 | 2.12 | 1.23 | 0 | 1.0 | 2 | 3.0 | 4 | ▂▇▇▆▅ |
| ADHD Q11 | 0 | 1 | 2.27 | 1.24 | 0 | 1.0 | 2 | 3.0 | 4 | ▂▆▇▇▆ |
| ADHD Q12 | 0 | 1 | 1.29 | 1.21 | 0 | 0.0 | 1 | 2.0 | 4 | ▇▇▆▂▂ |
| ADHD Q13 | 0 | 1 | 2.37 | 1.23 | 0 | 1.5 | 2 | 3.0 | 4 | ▂▅▇▇▆ |
| ADHD Q14 | 0 | 1 | 2.25 | 1.35 | 0 | 1.0 | 2 | 3.0 | 4 | ▅▅▇▇▆ |
| ADHD Q15 | 0 | 1 | 1.63 | 1.39 | 0 | 0.0 | 1 | 3.0 | 4 | ▇▆▆▅▃ |
| ADHD Q16 | 0 | 1 | 1.70 | 1.38 | 0 | 1.0 | 1 | 3.0 | 4 | ▆▇▆▃▅ |
| ADHD Q17 | 0 | 1 | 1.53 | 1.29 | 0 | 0.0 | 1 | 2.0 | 4 | ▇▇▇▃▃ |
| ADHD Q18 | 0 | 1 | 1.47 | 1.30 | 0 | 0.0 | 1 | 2.0 | 4 | ▇▇▆▃▃ |
| ADHD Total | 0 | 1 | 34.32 | 16.70 | 0 | 21.0 | 33 | 47.5 | 72 | ▃▆▇▆▂ |
| MD Q1a | 0 | 1 | 0.55 | 0.50 | 0 | 0.0 | 1 | 1.0 | 1 | ▆▁▁▁▇ |
| MD Q1b | 0 | 1 | 0.57 | 0.50 | 0 | 0.0 | 1 | 1.0 | 1 | ▆▁▁▁▇ |
| MD Q1c | 0 | 1 | 0.54 | 0.50 | 0 | 0.0 | 1 | 1.0 | 1 | ▇▁▁▁▇ |
| MD Q1d | 0 | 1 | 0.58 | 0.49 | 0 | 0.0 | 1 | 1.0 | 1 | ▆▁▁▁▇ |
| MD Q1e | 0 | 1 | 0.55 | 0.50 | 0 | 0.0 | 1 | 1.0 | 1 | ▆▁▁▁▇ |
| MD Q1f | 0 | 1 | 0.70 | 0.46 | 0 | 0.0 | 1 | 1.0 | 1 | ▃▁▁▁▇ |
| MD Q1g | 0 | 1 | 0.72 | 0.45 | 0 | 0.0 | 1 | 1.0 | 1 | ▃▁▁▁▇ |
| MD Q1h | 0 | 1 | 0.56 | 0.50 | 0 | 0.0 | 1 | 1.0 | 1 | ▆▁▁▁▇ |
| MD Q1i | 0 | 1 | 0.59 | 0.49 | 0 | 0.0 | 1 | 1.0 | 1 | ▆▁▁▁▇ |
| MD Q1j | 0 | 1 | 0.39 | 0.49 | 0 | 0.0 | 0 | 1.0 | 1 | ▇▁▁▁▅ |
| MD Q1k | 0 | 1 | 0.49 | 0.50 | 0 | 0.0 | 0 | 1.0 | 1 | ▇▁▁▁▇ |
| MD Q1L | 0 | 1 | 0.58 | 0.49 | 0 | 0.0 | 1 | 1.0 | 1 | ▆▁▁▁▇ |
| MD Q1m | 0 | 1 | 0.49 | 0.50 | 0 | 0.0 | 0 | 1.0 | 1 | ▇▁▁▁▇ |
| MD Q2 | 0 | 1 | 0.72 | 0.45 | 0 | 0.0 | 1 | 1.0 | 1 | ▃▁▁▁▇ |
| MD Q3 | 0 | 1 | 2.01 | 1.07 | 0 | 1.0 | 2 | 3.0 | 3 | ▂▂▁▅▇ |
| MD TOTAL | 0 | 1 | 10.02 | 4.81 | 0 | 6.5 | 11 | 14.0 | 17 | ▃▃▆▇▇ |
| Alcohol | 0 | 1 | 1.35 | 1.39 | 0 | 0.0 | 1 | 3.0 | 3 | ▇▂▁▁▇ |
| THC | 0 | 1 | 0.84 | 1.29 | 0 | 0.0 | 0 | 2.0 | 3 | ▇▁▁▁▃ |
| Cocaine | 0 | 1 | 1.09 | 1.39 | 0 | 0.0 | 0 | 3.0 | 3 | ▇▁▁▁▅ |
| Stimulants | 0 | 1 | 0.12 | 0.53 | 0 | 0.0 | 0 | 0.0 | 3 | ▇▁▁▁▁ |
| Sedative-hypnotics | 0 | 1 | 0.12 | 0.54 | 0 | 0.0 | 0 | 0.0 | 3 | ▇▁▁▁▁ |
| Opioids | 0 | 1 | 0.38 | 0.98 | 0 | 0.0 | 0 | 0.0 | 3 | ▇▁▁▁▁ |
| Court order | 0 | 1 | 0.09 | 0.28 | 0 | 0.0 | 0 | 0.0 | 1 | ▇▁▁▁▁ |
| Education | 0 | 1 | 11.90 | 2.12 | 6 | 11.0 | 12 | 12.0 | 19 | ▁▅▇▂▁ |
| Hx of Violence | 0 | 1 | 0.23 | 0.42 | 0 | 0.0 | 0 | 0.0 | 1 | ▇▁▁▁▂ |
| Disorderly Conduct | 0 | 1 | 0.73 | 0.44 | 0 | 0.0 | 1 | 1.0 | 1 | ▃▁▁▁▇ |
| Suicide | 0 | 1 | 0.29 | 0.46 | 0 | 0.0 | 0 | 1.0 | 1 | ▇▁▁▁▃ |
| Abuse | 0 | 1 | 1.24 | 2.06 | 0 | 0.0 | 0 | 2.0 | 7 | ▇▂▁▁▁ |
| Non-subst Dx | 0 | 1 | 0.41 | 0.65 | 0 | 0.0 | 0 | 1.0 | 2 | ▇▁▂▁▁ |
| Subst Dx | 0 | 1 | 1.11 | 0.89 | 0 | 0.0 | 1 | 2.0 | 3 | ▅▇▁▃▂ |
| Psych meds. | 0 | 1 | 0.95 | 0.70 | 0 | 0.0 | 1 | 1.0 | 2 | ▅▁▇▁▃ |
We know that most of our features are either factors or ordinal factors. Let’s recode from float to factor to ensure our models don’t incorrectly treat them as continuous. Note that we’ll leave Age as continuous even though it’s technically an ordinal factor. Also, for simplicity, boolean features (only 0 or 1) can be left as integer. Looking ahead, factors are going to lead to a large number of features when we dummy code in preparation for modeling. Dummy encoding leads to separate columns for each value. High dimensions can lead to problems.Fortunately, we don’t have that many features and values, so we aren’t starting with an excessive number of columns. That said, we will likely need to either reduce dimensions thorough PCA or different recoding approaches.
# We'll leave age as an integer
df$Age <- as.integer(df$Age)
# Misc columns (note: some columns are ordinal, others are not)
df$Sex <- factor(df$Sex)
df$Race <- factor(df$Race)
df$Alcohol <- factor(df$Alcohol)
df$THC <- factor(df$THC)
df$Cocaine <- factor(df$Cocaine)
df$Stimulants <- factor(df$Stimulants)
df$`Sedative-hypnotics` <- factor(df$`Sedative-hypnotics`)
df$Opioids <- factor(df$Opioids)
df$Education <- factor(df$Education)
df$`Court order` <- factor(df$`Court order`)
df$`Hx of Violence` <- factor(df$`Hx of Violence`)
df$`Disorderly Conduct` <- factor(df$`Disorderly Conduct`)
df$Suicide <- factor(df$Suicide)
df$Abuse <- factor(df$Abuse)
df$`Non-subst Dx` <- factor(df$`Non-subst Dx`)
df$`Subst Dx` <- factor(df$`Subst Dx`)
df$`Psych meds.` <- factor(df$`Psych meds.`)
# Quick trick to fix all columns matching name pattern
df <- df %>%
mutate(across(contains('ADHD'), factor)) %>%
mutate(across(contains('MD Q'), factor))
# Back these out since they are totals (sum), not factor
df$`ADHD Total` <- as.integer(df$`ADHD Total`)
df$`MD TOTAL` <- as.integer(df$`MD TOTAL`)For the continuous variables, we can examine the distributions, broken out by the target variable, Sex.
df_plot = df %>% mutate(Sex = ifelse(Sex == 1, "Male", "Female"))
par(mfrow=c(2,3))
par(oma=c(0,0,6,0))
tbl1 <- table(df_plot$Sex, df_plot$Alcohol)
barplot(tbl1, main="Alcohol use",
col=c("thistle3", "cornflowerblue"), ylim=range(c(0, 180)), legend=TRUE)
tbl2 <- table(df_plot$Sex, df_plot$Cocaine)
barplot(tbl2, main="Cocain use",
col=c("thistle3", "cornflowerblue"), ylim=range(c(0, 180)))
tbl3 <- table(df_plot$Sex, df_plot$THC)
barplot(tbl3, main="THC use",
col=c("thistle3", "cornflowerblue"), ylim=range(c(0, 180)))
tbl4 <- table(df_plot$Sex, df_plot$Stimulants)
barplot(tbl4, main="Stimulant use",
col=c("thistle3", "cornflowerblue"), ylim=range(c(0, 180)))
tbl5 <- table(df_plot$Sex, df_plot$`Sedative-hypnotics`)
barplot(tbl5, main="Sedative-hypnotic use",
col=c("thistle3", "cornflowerblue"), ylim=range(c(0, 180)))
tbl6 <- table(df_plot$Sex, df_plot$Opioids)
barplot(tbl6, main="Opioid use",
col=c("thistle3", "cornflowerblue"), ylim=range(c(0, 180)))
title("(0 = No use, 1 = Use, 2 = Abuse, 3 = Dependence)" , outer=TRUE)
par(oma=c(1,0,3,0))
title("SUBSTANCE USE RESPONSES BY SEX" , outer=TRUE)myColors <- ifelse(df_plot$Sex=="Male" , "thistle3", "cornflowerblue")
par(mfrow=c(1,2))
boxplot(df_plot$`ADHD Total`~df_plot$Sex, xlab="", ylab = "ADHD Total", main="ADHD Totals by Sex", col=myColors)
boxplot(df_plot$`MD TOTAL`~df_plot$Sex, xlab="", ylab = "MD Total", main="MD Totals by Sex", col=myColors)par(mfrow=c(2,2))
par(oma=c(0,0,6,0))
tbl7 <- table(df_plot$Sex, df_plot$`Court order`)
barplot(tbl7, main="Court order",
col=c("thistle3", "cornflowerblue"), ylim=range(c(0, 180)))
tbl8 <- table(df_plot$Sex, df_plot$`Hx of Violence`)
barplot(tbl8, main="History of violence",
col=c("thistle3", "cornflowerblue"), ylim=range(c(0, 180)))
tbl9 <- table(df_plot$Sex, df_plot$`Disorderly Conduct`)
barplot(tbl9, main="Disorderly conduct",
col=c("thistle3", "cornflowerblue"), ylim=range(c(0, 180)))
tbl11 <- table(df_plot$Sex, df_plot$Suicide)
barplot(tbl11, main="Suicide",
col=c("thistle3", "cornflowerblue"), ylim=range(c(0, 180)), legend=TRUE)
title("(0 = No, 1 = Yes)" , outer=TRUE)
par(oma=c(1,0,3,0))
title("MENTAL HEALTH AND JUSTICE INVOLVEMENT BY SEX" , outer=TRUE)par(mfrow=c(1,1))
par(oma=c(4,0,0,0))
df_plot = df_plot %>% mutate(Abuse = ifelse(Abuse == 0, "None",
ifelse(Abuse == 1, "Physical (P)",
ifelse(Abuse == 2, "Sexual (S)",
ifelse(Abuse == 3, "Emotional (E)",
ifelse(Abuse == 4, "P & S",
ifelse(Abuse == 5, "P & E",
ifelse(Abuse == 6, "S & E",
ifelse(Abuse == 7, "P & E & S", "Unknown")))))))))
tbl10 <- table(df_plot$Sex, df_plot$Abuse)
barplot(tbl10, main="Abuse",
col=c("thistle3", "cornflowerblue"), ylim=range(c(0, 180)), las=2, legend=TRUE)When looking at these figures broken out by our target variable Sex, we can see a few things:
Interestingly, it appears that female respondents had an ADHD total score that was trending higher than male respondents.
Mood disorder totals were similar across both male and female respondents.
It appears that both male and female respondents answered “No use”, more frequently for substances such as stimulants, sedative-hypnotics, and opioids, than for some of the more common drugs, such as alcohol, THC, and cocaine.
It also looks like males responded more frequently as having a history of violence, or disorderly conduct.
Conversely, it looks like females responded slightly more frequently than males as having attempted suicide.
Finally, when looking closely at the breakdown of responses to the Abuse question by sex, we can see that females overwhelmingly responded more frequently of having experienced some form of abuse, especially sexual abuse or a combination of sexual abuse with other forms of abuse. This was noticeably more frequent than the responses by their male counterparts.
We’ll need to dummy code our categorical variables. This process will create new columns for each value and assign a 0 or 1. Note that dummy encoding typically drops one value which becomes the baseline. So if we have a categorical feature with five unique values, we will have four columns. If all columns are 0, that represents the reference value. This helps reduce the number of necessary columns. With dummy columns in place, we need to remove our original variables from this dataset.
# dummy encode our categorical features
df_dummy <- dummyVars(~ 0 + ., drop2nd=TRUE, data = df)
df_dummy <- data.frame(predict(df_dummy, newdata = df))
# Uncomment if we want to verify the dummy df summary
#skim(df_dummy)We should remove any columns with zero variance. Note, since we are working with a sparse dataset, we do NOT want to remove near zero variance, just zero variance.
nzv <- nearZeroVar(df_dummy, saveMetrics= TRUE)
(nzv %>% filter(zeroVar == T))## [1] freqRatio percentUnique zeroVar nzv
## <0 rows> (or 0-length row.names)
# drop columns with zero variance
nzv <- nearZeroVar(df_dummy)
df_dummy <- df_dummy[, -nzv]Age is our only continuous variable. We normalize our continuous features using a simple z-score standardization. Although this may help our KNN model, we’ll have to be careful about interpretability later on!
# z-score scale continuous features
df[, c("Age")] <- scale(df[, c("Age")])
df_dummy[, c("Age")] <- scale(df_dummy[, c("Age")])Our data set is ready for our unsupervised models.
# quick inspect of dataframe
kable(head(df_dummy)) %>%
kable_styling(bootstrap_options = "basic")| Age | Sex.1 | Sex.2 | Race.1 | Race.2 | X.ADHD.Q1.0 | X.ADHD.Q1.1 | X.ADHD.Q1.2 | X.ADHD.Q1.3 | X.ADHD.Q1.4 | X.ADHD.Q2.0 | X.ADHD.Q2.1 | X.ADHD.Q2.2 | X.ADHD.Q2.3 | X.ADHD.Q2.4 | X.ADHD.Q3.0 | X.ADHD.Q3.1 | X.ADHD.Q3.2 | X.ADHD.Q3.3 | X.ADHD.Q3.4 | X.ADHD.Q4.0 | X.ADHD.Q4.1 | X.ADHD.Q4.2 | X.ADHD.Q4.3 | X.ADHD.Q4.4 | X.ADHD.Q5.0 | X.ADHD.Q5.1 | X.ADHD.Q5.2 | X.ADHD.Q5.3 | X.ADHD.Q5.4 | X.ADHD.Q6.0 | X.ADHD.Q6.1 | X.ADHD.Q6.2 | X.ADHD.Q6.3 | X.ADHD.Q6.4 | X.ADHD.Q7.0 | X.ADHD.Q7.1 | X.ADHD.Q7.2 | X.ADHD.Q7.3 | X.ADHD.Q7.4 | X.ADHD.Q8.0 | X.ADHD.Q8.1 | X.ADHD.Q8.2 | X.ADHD.Q8.3 | X.ADHD.Q8.4 | X.ADHD.Q9.0 | X.ADHD.Q9.1 | X.ADHD.Q9.2 | X.ADHD.Q9.3 | X.ADHD.Q9.4 | X.ADHD.Q10.0 | X.ADHD.Q10.1 | X.ADHD.Q10.2 | X.ADHD.Q10.3 | X.ADHD.Q10.4 | X.ADHD.Q11.0 | X.ADHD.Q11.1 | X.ADHD.Q11.2 | X.ADHD.Q11.3 | X.ADHD.Q11.4 | X.ADHD.Q12.0 | X.ADHD.Q12.1 | X.ADHD.Q12.2 | X.ADHD.Q12.3 | X.ADHD.Q12.4 | X.ADHD.Q13.0 | X.ADHD.Q13.1 | X.ADHD.Q13.2 | X.ADHD.Q13.3 | X.ADHD.Q13.4 | X.ADHD.Q14.0 | X.ADHD.Q14.1 | X.ADHD.Q14.2 | X.ADHD.Q14.3 | X.ADHD.Q14.4 | X.ADHD.Q15.0 | X.ADHD.Q15.1 | X.ADHD.Q15.2 | X.ADHD.Q15.3 | X.ADHD.Q15.4 | X.ADHD.Q16.0 | X.ADHD.Q16.1 | X.ADHD.Q16.2 | X.ADHD.Q16.3 | X.ADHD.Q16.4 | X.ADHD.Q17.0 | X.ADHD.Q17.1 | X.ADHD.Q17.2 | X.ADHD.Q17.3 | X.ADHD.Q17.4 | X.ADHD.Q18.0 | X.ADHD.Q18.1 | X.ADHD.Q18.2 | X.ADHD.Q18.3 | X.ADHD.Q18.4 | X.ADHD.Total. | X.MD.Q1a.0 | X.MD.Q1a.1 | X.MD.Q1b.0 | X.MD.Q1b.1 | X.MD.Q1c.0 | X.MD.Q1c.1 | X.MD.Q1d.0 | X.MD.Q1d.1 | X.MD.Q1e.0 | X.MD.Q1e.1 | X.MD.Q1f.0 | X.MD.Q1f.1 | X.MD.Q1g.0 | X.MD.Q1g.1 | X.MD.Q1h.0 | X.MD.Q1h.1 | X.MD.Q1i.0 | X.MD.Q1i.1 | X.MD.Q1j.0 | X.MD.Q1j.1 | X.MD.Q1k.0 | X.MD.Q1k.1 | X.MD.Q1L.0 | X.MD.Q1L.1 | X.MD.Q1m.0 | X.MD.Q1m.1 | X.MD.Q2.0 | X.MD.Q2.1 | X.MD.Q3.0 | X.MD.Q3.1 | X.MD.Q3.2 | X.MD.Q3.3 | X.MD.TOTAL. | Alcohol.0 | Alcohol.1 | Alcohol.3 | THC.0 | THC.1 | THC.3 | Cocaine.0 | Cocaine.1 | Cocaine.3 | Stimulants.0 | X.Sedative.hypnotics.0 | Opioids.0 | Opioids.3 | X.Court.order.0 | X.Court.order.1 | Education.9 | Education.10 | Education.11 | Education.12 | Education.13 | Education.14 | X.Hx.of.Violence.0 | X.Hx.of.Violence.1 | X.Disorderly.Conduct.0 | X.Disorderly.Conduct.1 | Suicide.0 | Suicide.1 | Abuse.0 | Abuse.1 | Abuse.2 | Abuse.5 | X.Non.subst.Dx.0 | X.Non.subst.Dx.1 | X.Non.subst.Dx.2 | X.Subst.Dx.0 | X.Subst.Dx.1 | X.Subst.Dx.2 | X.Subst.Dx.3 | X.Psych.meds..0 | X.Psych.meds..1 | X.Psych.meds..2 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| -1.38522071 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 37 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 15 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| 0.76320137 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 52 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 14 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1.03175413 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 28 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 5 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 0.31561343 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 42 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 13 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| -0.49004485 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 45 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 7 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
| -0.04245691 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 52 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 14 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
Please use a clustering method to find clusters of patients here. Whether you choose to use k-means clustering or hierarchical clustering is up to you as long as you reason through your work. Can you come up with creative names for the profiles you found? (60)
Now that our data has been tidy’d and is ready to be subjected to a few unsupervised learning techniques, we’ll first start with a k-means approach. The k-means algorithm works via the following steps:
First, a certain number of clusters (k) is designated, and the algorithm obtains all observations and creates k random unnamed/new observations that serve as initial “centroids”.
These centroids are then used to compute the Euclidean distance between the centroid and the surrounding observations. For each observation, the nearest centroid is used to classify the observation, and will essentially assign a category.
Then, the algorithm takes the average of all of the points in each category and recomputes the centroid based on this average. The algorithm then runs again using these new centroids.
The process ends when there is no change in centroid position between one run of the clustering algorithm and another.
For our dataset, we’ll essentially subject our k-means algorithm directly to our dataset. However, for a k-means approach, you’ll first have to assign the number of final clusters you’d like to classify your data into. This can be a difficult task, thus we will outline this process below.
We’ll need to identify the appropriate number for k. k represents the number of clusters we will group rows into. Three common techniques are: Elbow, Silhouette, Gap statistic, and NBClust(). For simplicity, we will uyse the Elbow approach.
# Elbow method
fviz_nbclust(df_dummy, kmeans, method = "wss") +
geom_vline(xintercept = 5, linetype = 2) + # add line for better visualization
labs(subtitle = "Elbow method") # add subtitleWe can see above, that it is approximately 5. More clusters do not improve within segment variability. Therefore, we’ll perform our initial K-Means model with \(k=5\).
set.seed(42)
km_res <- kmeans(df_dummy, centers = 5, nstart = 20)
summary(km_res)## Length Class Mode
## cluster 175 -none- numeric
## centers 850 -none- numeric
## totss 1 -none- numeric
## withinss 5 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 5 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
sil <- silhouette(km_res$cluster, dist(df_dummy))
fviz_silhouette(sil)## cluster size ave.sil.width
## 1 1 28 0.32
## 2 2 46 0.25
## 3 3 27 0.27
## 4 4 41 0.21
## 5 5 33 0.22
After running our kmeans() function, we can take a look at the final 5 clusters that were created by our algorithm:
As a reminder, the interpretation of the silhouette coefficient is as follows:
\(> 0\), means that the observation is well grouped. The closer the coefficient is to 1, the better the observation is grouped.
\(< 0\), means that the observation has been placed in the wrong cluster.
\(= 0\), means that the observation is between 5 clusters.
The silhouette plot above and the average silhouette coefficient help to determine whether your clustering is good or not. If a large majority of the silhouette coefficients are positive, it indicates that the observations are placed in the correct group. This silhouette plot can therefore be used in the choice of the optimal number of classes.
It is also possible to plot clusters by using the fviz_cluster() function. Note that a principal component analysis is performed to represent the variables in a 5 dimensions plane.
fviz_cluster(km_res, df_dummy, ellipse.type = "norm")To provide some context around the clusters that were generated by our algorithm, we decided to isolate the 5 clusters we created, and subjected them to our df_dummy dataset, grouping and computing the means for each one of our features, to see if any noticeable trends start to arise.
k_means_analysis_df <- df_dummy %>%
mutate(Cluster = km_res$cluster) %>%
group_by(Cluster) %>%
summarise_all("mean")k_means_analysis_df <- k_means_analysis_df %>% select('Cluster', 'Age', 'Sex.1', 'X.ADHD.Total.', 'X.MD.TOTAL.', 'Alcohol.0', 'Alcohol.3', 'THC.0', 'THC.3', 'X.Court.order.1', 'X.Hx.of.Violence.1', 'Suicide.1')
kable(k_means_analysis_df) %>%
kable_styling(bootstrap_options = "responsive")| Cluster | Age | Sex.1 | X.ADHD.Total. | X.MD.TOTAL. | Alcohol.0 | Alcohol.3 | THC.0 | THC.3 | X.Court.order.1 | X.Hx.of.Violence.1 | Suicide.1 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | -0.0264716 | 0.3571429 | 54.392857 | 12.142857 | 0.3214286 | 0.4642857 | 0.6785714 | 0.2857143 | 0.0714286 | 0.2142857 | 0.3571429 |
| 2 | -0.1183523 | 0.5869565 | 41.847826 | 13.065217 | 0.3695652 | 0.4565217 | 0.6956522 | 0.1739130 | 0.1304348 | 0.2173913 | 0.4130435 |
| 3 | -0.1120817 | 0.8148148 | 8.518519 | 4.333333 | 0.5925926 | 0.2962963 | 0.5555556 | 0.4074074 | 0.0370370 | 0.1481481 | 0.1481481 |
| 4 | -0.0184400 | 0.5121951 | 29.487805 | 9.414634 | 0.6097561 | 0.2195122 | 0.6829268 | 0.2195122 | 0.0243902 | 0.2926829 | 0.2195122 |
| 5 | 0.3020502 | 0.5757576 | 18.757576 | 9.393939 | 0.4242424 | 0.5151515 | 0.6969697 | 0.2121212 | 0.1515152 | 0.2727273 | 0.2727273 |
From above, we isolated 12 columns that were of particular interest to us. These seemed to show some sort of trend in computed means per group. Here are some explanations and interpretations of how our k-means cluster grouped the respondents:
Cluster #1: It looks like the 34 individuals that comprise of cluster number 1 are on the younger end, given that the average age of 26 is below the 25th percentile of our overall dataset. Additionally, it appears that this group had a high average total for the ADHD questions, as well as the highest total for the MD questions. Additionally, they had the highest average for responding that they had a dependence on alcohol, and also the highest average that had attempted to commit suicide and a court order.
Cluster #2: The 18 individuals in this cluster make up the youngest average age out of the rest of the four clusters. The individuals in this group also comprised of the largest male majority, lowest average ADHD total, lowest average dependence on alcohol, highest average dependence on THC, and second highest average in having a court order.
Cluster #3: The 43 individuals in this cluster make up our largest group, with the oldest average age of around 48 years. Additionally, it has a majority of male respondents, it had the second highest average dependence on alcohol, and second lowest average dependence on THC (most responded as “do not use”). This group seems to fall outside of many of the substance abuse tendencies, has the lowest average responding that they had a history of violence, as well as the lowest average of having attempted suicide.
Cluster #4: The 39 individuals in this cluster make up the group with the second oldest average age. It has a majority of female respondents, the highest average total for the ADHD questions, the second highest average total for the MD questions, the lowest average dependence on THC (most responded as “do not use”), and the second highest average for those responding that they have attempted suicide.
Cluster #5: The final cluster of 41 individuals make up the group with an average age that sits right in between our other four clusters. Additionally, this group falls in between other clusters on most other averages, except nobody in this group had responded that they have a court order, and they had the highest average for responding with a history of violence.
Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups similar objects into groups called clusters. The endpoint is a set of clusters, where each cluster is distinct from each other cluster, and the objects within each cluster are broadly similar to each other.
There are 5 main methods to measure the distance between clusters, referred as linkage methods:
We can apply the hierarchical clustering with the complete linkage criterion thanks to the hclust() function with the argument method = “complete” and method = “ward.D2”:
d <- dist(df_dummy, method = "euclidean")
hclust1 <- hclust(d, method = "complete")plot(hclust1, hang = -1, cex = 0.4, main = "Complete linkage hierarchical clustering dendrogram")
rect.hclust(hclust1,
k = 5, # k is used to specify the number of clusters
border = "steelblue"
)Above, we can see the visual representation of our complete linkage hierarchical clustering dendrogram, including the 5 deliniated clusters in blue. Later, we can compare this to the Ward’s method hierarchical clustering dendrogram that is presented below:
hclust2 <- hclust(d, method = "ward.D2")plot(hclust2, hang = -1, cex = 0.4, main = "Ward's method hierarchical clustering dendrogram")
rect.hclust(hclust2,
k = 5, # k is used to specify the number of clusters
border = "coral"
)We compare hierarchical clustering with complete linkage versus Ward’s method. The function tanglegram plots two dendrograms, side by side, with their labels connected by lines.
dend1 <- as.dendrogram (hclust1)
dend2 <- as.dendrogram (hclust2)dend_list <- dendlist(dend1, dend2)
tanglegram(dend1, dend2,
highlight_distinct_edges = FALSE, # Turn-off dashed lines
common_subtrees_color_lines = FALSE, # Turn-off line colors
common_subtrees_color_branches = TRUE, # Color common branches
main = paste("entanglement =", round(entanglement(dend_list), 2))
)As we can see above, the entanglement value of 0.31 is closer to 0, meaning that there is alignment between the two dendrograms. Additionally, when we use the cor_cophenetic() and cor_bakers_gamma() functions, which measure the Cophenetic and Baker correlation coefficient’s respectively, we can see that both of these values fall ~0.814. This indicates that there is reasonable correlation between the two dendrograms.
paste0('Cophenetic correlation coefficient: ', cor_cophenetic(dend_list))## [1] "Cophenetic correlation coefficient: 0.813789033187447"
paste0('Baker correlation coefficient: ', cor_bakers_gamma(dend_list))## [1] "Baker correlation coefficient: 0.814728261827425"
Given our hierarchical clustering techniques, we visualized characteristics across each of our five clusters for our complete linkage HC model. This heatmap shows the percent of each factor level (i.e. every value of 1) in the absolute count of cluster members.
require(reshape2)
# Time for the heatmap
# the 1st step here is to have 1 variable per row
# factors have to be converted to characters in order not to be dropped
clust.num <- cutree(hclust1, k = 5)
df.cl <- cbind(df_dummy, clust.num)
df.cl <- df.cl %>% mutate(rownum = row_number())
df.long <- melt(data.frame(lapply(df.cl, as.character), stringsAsFactors=FALSE),
id = c("rownum", "clust.num"), factorsAsStrings=T)
df.long.q <- df.long %>%
group_by(clust.num, variable, value) %>%
mutate(count = n_distinct(rownum)) %>%
distinct(clust.num, variable, value, count)
heatmap.c <- ggplot(df.long.q, aes(x = clust.num, y = variable, ordered = T)) +
geom_tile(aes(fill = count))+
scale_fill_gradient2(low = "#ffffff", mid = "#E8D5B5", high = "#3E3496")
# calculating the percent of each factor level in the absolute count of cluster members
df.long.p <- df.long.q %>%
group_by(clust.num, variable) %>%
mutate(perc = count / sum(count)) %>%
arrange(clust.num)
heatmap.p <- ggplot(df.long.p, aes(x = clust.num, y = variable, ordered = T)) +
geom_tile(aes(fill = perc), alpha = 0.85)+
labs(title = "Distribution of characteristics across clusters", x = "Cluster number", y = NULL) +
geom_hline(yintercept = 3.5) +
geom_hline(yintercept = 10.5) +
geom_hline(yintercept = 13.5) +
geom_hline(yintercept = 17.5) +
geom_hline(yintercept = 21.5) +
scale_fill_gradient2(low = "#ffffff", mid = "#ddddff", high = "#3333ff")
heatmap.pFrom our above visualization, we can see a few distinct trends:
Cluster 1: we can see here that most members of in this cluster use cocaine, THC, and alcohol, have an education level at 11th grade (indicating they do not have a high school diploma), and take more than one psychotropic medication.
Cluster 2: we can see here that most of the members of this cluster have an education level at the college level (13 or above), do not use alcohol, and agreed consistently with a response of “rarely” for many of the ADHD questions. With further metadata on the survey questions, we’d be able to identify even more trends based on these questions.
Cluster 3: we can see here that most members of this cluster had more than one non-substance-related disorder and answered with “sometimes” or “often” more frequently on ADHD questions.
Cluster 4: we can see here that most members of this cluster had experienced physical abuse, or a combination of physical and emotional abuse, answered “very often” on a lot of the ADHD questions, were users of THC and alcohol, and had a low education level (around 10th grade).
Cluster 5: we can see here that most members of this cluster also answered “very often” on a lot of the ADHD questions and were less likely to use sedative hypnotics and psychiatric medications than other clusters.
Let’s explore using Principal Component Analysis on this dataset. You will note that there are different types of questions in the dataset: columns E-W: ADHD self-report; column X–AM: mood disorders questionnaire, column AN-AS: Individual Substance Misuse; etc. Please reason through your work as you decide on which sets of variables you want to use to conduct Principal Component Analysis. (60)
Principal Component Analysis (PCA) is an unsupervised, non-parametric statistical technique primarily used for dimensionality reduction in machine learning. It is a useful technique for exploratory data analysis, allowing us to better visualize the variation present in a dataset with many variables. For our particular use case here, it appears that many of the questionnaire variables fall on likert scales, which when prepared for analysis are extended to dummy variables. This creates many additional features and can make analysis more difficult due to an increased number of dimensions. Therefore, utilizing PCA to reduce the number of dimensions on our entired dataset and measure the amount of variance explained is beneficial. In order to do this, we’ll use the prcomp() function:
df_dummy.pca <- prcomp(df_dummy, center = TRUE,scale. = TRUE)
summary(df_dummy.pca)## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 4.5376 3.18663 2.73537 2.32943 2.16129 2.0739 1.95356
## Proportion of Variance 0.1211 0.05973 0.04401 0.03192 0.02748 0.0253 0.02245
## Cumulative Proportion 0.1211 0.18085 0.22486 0.25678 0.28426 0.3096 0.33201
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.92898 1.88273 1.80904 1.77187 1.74963 1.73401 1.69848
## Proportion of Variance 0.02189 0.02085 0.01925 0.01847 0.01801 0.01769 0.01697
## Cumulative Proportion 0.35390 0.37475 0.39400 0.41247 0.43047 0.44816 0.46513
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 1.68709 1.63613 1.60444 1.57041 1.55859 1.53553 1.51945
## Proportion of Variance 0.01674 0.01575 0.01514 0.01451 0.01429 0.01387 0.01358
## Cumulative Proportion 0.48187 0.49762 0.51276 0.52727 0.54156 0.55543 0.56901
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 1.48822 1.47627 1.44932 1.43317 1.42505 1.40096 1.35289
## Proportion of Variance 0.01303 0.01282 0.01236 0.01208 0.01195 0.01155 0.01077
## Cumulative Proportion 0.58204 0.59486 0.60721 0.61930 0.63124 0.64279 0.65355
## PC29 PC30 PC31 PC32 PC33 PC34 PC35
## Standard deviation 1.33229 1.32665 1.31559 1.30793 1.28983 1.2712 1.26500
## Proportion of Variance 0.01044 0.01035 0.01018 0.01006 0.00979 0.0095 0.00941
## Cumulative Proportion 0.66399 0.67435 0.68453 0.69459 0.70438 0.7139 0.72330
## PC36 PC37 PC38 PC39 PC40 PC41 PC42
## Standard deviation 1.23800 1.23160 1.21727 1.18520 1.17638 1.14582 1.13364
## Proportion of Variance 0.00902 0.00892 0.00872 0.00826 0.00814 0.00772 0.00756
## Cumulative Proportion 0.73231 0.74123 0.74995 0.75821 0.76635 0.77408 0.78164
## PC43 PC44 PC45 PC46 PC47 PC48 PC49
## Standard deviation 1.12521 1.09399 1.07900 1.06334 1.05498 1.0350 1.02503
## Proportion of Variance 0.00745 0.00704 0.00685 0.00665 0.00655 0.0063 0.00618
## Cumulative Proportion 0.78908 0.79612 0.80297 0.80962 0.81617 0.8225 0.82865
## PC50 PC51 PC52 PC53 PC54 PC55 PC56
## Standard deviation 1.01966 0.99640 0.98515 0.9841 0.94749 0.93823 0.92943
## Proportion of Variance 0.00612 0.00584 0.00571 0.0057 0.00528 0.00518 0.00508
## Cumulative Proportion 0.83477 0.84061 0.84632 0.8520 0.85729 0.86247 0.86755
## PC57 PC58 PC59 PC60 PC61 PC62 PC63
## Standard deviation 0.91723 0.91116 0.89074 0.88521 0.87092 0.86147 0.85339
## Proportion of Variance 0.00495 0.00488 0.00467 0.00461 0.00446 0.00437 0.00428
## Cumulative Proportion 0.87250 0.87739 0.88205 0.88666 0.89112 0.89549 0.89977
## PC64 PC65 PC66 PC67 PC68 PC69 PC70
## Standard deviation 0.84359 0.82856 0.8144 0.80802 0.77716 0.77408 0.76361
## Proportion of Variance 0.00419 0.00404 0.0039 0.00384 0.00355 0.00352 0.00343
## Cumulative Proportion 0.90396 0.90800 0.9119 0.91574 0.91929 0.92282 0.92625
## PC71 PC72 PC73 PC74 PC75 PC76 PC77
## Standard deviation 0.75844 0.73888 0.72172 0.70507 0.69442 0.68624 0.68265
## Proportion of Variance 0.00338 0.00321 0.00306 0.00292 0.00284 0.00277 0.00274
## Cumulative Proportion 0.92963 0.93284 0.93591 0.93883 0.94167 0.94444 0.94718
## PC78 PC79 PC80 PC81 PC82 PC83 PC84
## Standard deviation 0.68101 0.65799 0.63659 0.63540 0.61940 0.61432 0.60301
## Proportion of Variance 0.00273 0.00255 0.00238 0.00237 0.00226 0.00222 0.00214
## Cumulative Proportion 0.94991 0.95245 0.95484 0.95721 0.95947 0.96169 0.96383
## PC85 PC86 PC87 PC88 PC89 PC90 PC91
## Standard deviation 0.58494 0.56710 0.56549 0.56263 0.54866 0.54308 0.52871
## Proportion of Variance 0.00201 0.00189 0.00188 0.00186 0.00177 0.00173 0.00164
## Cumulative Proportion 0.96584 0.96773 0.96961 0.97148 0.97325 0.97498 0.97663
## PC92 PC93 PC94 PC95 PC96 PC97 PC98
## Standard deviation 0.51134 0.50216 0.4872 0.47106 0.46084 0.45508 0.42992
## Proportion of Variance 0.00154 0.00148 0.0014 0.00131 0.00125 0.00122 0.00109
## Cumulative Proportion 0.97816 0.97965 0.9810 0.98235 0.98360 0.98482 0.98590
## PC99 PC100 PC101 PC102 PC103 PC104 PC105
## Standard deviation 0.42528 0.41427 0.40528 0.39905 0.38734 0.37002 0.36575
## Proportion of Variance 0.00106 0.00101 0.00097 0.00094 0.00088 0.00081 0.00079
## Cumulative Proportion 0.98697 0.98798 0.98894 0.98988 0.99076 0.99157 0.99235
## PC106 PC107 PC108 PC109 PC110 PC111 PC112
## Standard deviation 0.3444 0.33774 0.32897 0.3199 0.30388 0.29899 0.29847
## Proportion of Variance 0.0007 0.00067 0.00064 0.0006 0.00054 0.00053 0.00052
## Cumulative Proportion 0.9930 0.99372 0.99436 0.9950 0.99551 0.99603 0.99656
## PC113 PC114 PC115 PC116 PC117 PC118 PC119
## Standard deviation 0.28840 0.27254 0.25597 0.24597 0.23611 0.22167 0.20236
## Proportion of Variance 0.00049 0.00044 0.00039 0.00036 0.00033 0.00029 0.00024
## Cumulative Proportion 0.99704 0.99748 0.99787 0.99822 0.99855 0.99884 0.99908
## PC120 PC121 PC122 PC123 PC124 PC125 PC126
## Standard deviation 0.19741 0.17283 0.16092 0.14456 0.12100 0.09975 0.09005
## Proportion of Variance 0.00023 0.00018 0.00015 0.00012 0.00009 0.00006 0.00005
## Cumulative Proportion 0.99931 0.99949 0.99964 0.99976 0.99985 0.99991 0.99995
## PC127 PC128 PC129 PC130 PC131 PC132
## Standard deviation 0.07295 0.03984 0.02521 0.02149 2.015e-15 1.074e-15
## Proportion of Variance 0.00003 0.00001 0.00000 0.00000 0.000e+00 0.000e+00
## Cumulative Proportion 0.99998 0.99999 1.00000 1.00000 1.000e+00 1.000e+00
## PC133 PC134 PC135 PC136 PC137
## Standard deviation 1.029e-15 1.022e-15 9.436e-16 8.734e-16 7.847e-16
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
## PC138 PC139 PC140 PC141 PC142
## Standard deviation 7.634e-16 7.264e-16 6.527e-16 5.796e-16 5.346e-16
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
## PC143 PC144 PC145 PC146 PC147
## Standard deviation 4.418e-16 3.73e-16 3.35e-16 3.316e-16 3.138e-16
## Proportion of Variance 0.000e+00 0.00e+00 0.00e+00 0.000e+00 0.000e+00
## Cumulative Proportion 1.000e+00 1.00e+00 1.00e+00 1.000e+00 1.000e+00
## PC148 PC149 PC150 PC151 PC152
## Standard deviation 2.981e-16 2.981e-16 2.981e-16 2.981e-16 2.981e-16
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
## PC153 PC154 PC155 PC156 PC157
## Standard deviation 2.981e-16 2.981e-16 2.981e-16 2.981e-16 2.981e-16
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
## PC158 PC159 PC160 PC161 PC162
## Standard deviation 2.981e-16 2.981e-16 2.981e-16 2.981e-16 2.981e-16
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
## PC163 PC164 PC165 PC166 PC167
## Standard deviation 2.981e-16 2.981e-16 2.981e-16 2.981e-16 2.981e-16
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
## PC168 PC169 PC170
## Standard deviation 2.166e-16 1.361e-16 4.369e-17
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00
The scale = 0 argument to biplot ensures that the arrows are scaled to represent the loadings; other values for scale give slightly different biplots with different interpretations.
biplot(df_dummy.pca, scale = 0, cex=0.5)From the biplot above, it’s difficult to tell much given the very large number of features. However, from our PCA analysis, we can also take a look at the eigenvalues that were generated by using a scree plot. The plot below shows the percentage of variance explained by each principal component.
fviz_eig(df_dummy.pca)#compute standard deviation of each principal component
std_dev <- df_dummy.pca$sdev
#compute variance
pr_var <- std_dev^2
prop_varex <- pr_var/sum(pr_var)
round(prop_varex[1:10], 2)## [1] 0.12 0.06 0.04 0.03 0.03 0.03 0.02 0.02 0.02 0.02
The first principal component in our example therefore explains 12% of the variability, and the second principal component explains 6%. Together, the first ten principal components explain 39% of the variability. And we proceed to plot the PVE and cumulative PVE to provide us our scree plots as we did earlier.
#scree plot
plot(prop_varex, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
type = "b")#cumulative scree plot
plot(cumsum(prop_varex), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
type = "b")As we can see above in our plots of the PVE and the cumultative PVE, the first few principal components account for a much larger proportion of the variance explained than the remaining 150+ principal components for this dataset. Additionally, the proportion of variance explained by the first principal component at 12% is double the second principal component’s proportion of variance explained.
Although running PCA on the entire dataset is helpful in some ways, certain findings from a biplot are affected by the number of variables in the overall dataset. Additionally, we are seeing from our PCA analysis that the first few principal components explain a much larger proportion of the variability than later principal compnents. Because of this, we thought it would be interesting to dive a bit deeper into which variables seem to hold the most importance in determining the first few dimensions. To do this, we’ll use the FactoMineR and factoextra packages to determine the eigenvalues of each dimension, with particular interest in dimensions 1 and 2, since those are what we’ve plotted above and hold the highest proportion of variance explained.
library("FactoMineR")
library("factoextra")
eig.val <- get_eigenvalue(df_dummy.pca)
eig.val## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 2.058998e+01 1.211175e+01 12.11175
## Dim.2 1.015460e+01 5.973293e+00 18.08505
## Dim.3 7.482244e+00 4.401320e+00 22.48637
## Dim.4 5.426226e+00 3.191898e+00 25.67826
## Dim.5 4.671181e+00 2.747754e+00 28.42602
## Dim.6 4.301045e+00 2.530027e+00 30.95604
## Dim.7 3.816415e+00 2.244950e+00 33.20099
## Dim.8 3.720982e+00 2.188813e+00 35.38981
## Dim.9 3.544675e+00 2.085103e+00 37.47491
## Dim.10 3.272608e+00 1.925064e+00 39.39997
## Dim.11 3.139506e+00 1.846768e+00 41.24674
## Dim.12 3.061203e+00 1.800708e+00 43.04745
## Dim.13 3.006795e+00 1.768703e+00 44.81615
## Dim.14 2.884835e+00 1.696962e+00 46.51311
## Dim.15 2.846258e+00 1.674270e+00 48.18738
## Dim.16 2.676906e+00 1.574651e+00 49.76204
## Dim.17 2.574236e+00 1.514256e+00 51.27629
## Dim.18 2.466179e+00 1.450694e+00 52.72699
## Dim.19 2.429208e+00 1.428946e+00 54.15593
## Dim.20 2.357853e+00 1.386972e+00 55.54290
## Dim.21 2.308716e+00 1.358068e+00 56.90097
## Dim.22 2.214809e+00 1.302829e+00 58.20380
## Dim.23 2.179384e+00 1.281990e+00 59.48579
## Dim.24 2.100534e+00 1.235608e+00 60.72140
## Dim.25 2.053975e+00 1.208220e+00 61.92962
## Dim.26 2.030759e+00 1.194564e+00 63.12418
## Dim.27 1.962699e+00 1.154529e+00 64.27871
## Dim.28 1.830307e+00 1.076651e+00 65.35536
## Dim.29 1.774996e+00 1.044115e+00 66.39948
## Dim.30 1.759998e+00 1.035293e+00 67.43477
## Dim.31 1.730765e+00 1.018097e+00 68.45287
## Dim.32 1.710675e+00 1.006279e+00 69.45915
## Dim.33 1.663649e+00 9.786172e-01 70.43777
## Dim.34 1.615838e+00 9.504927e-01 71.38826
## Dim.35 1.600213e+00 9.413019e-01 72.32956
## Dim.36 1.532634e+00 9.015492e-01 73.23111
## Dim.37 1.516842e+00 8.922598e-01 74.12337
## Dim.38 1.481739e+00 8.716111e-01 74.99498
## Dim.39 1.404703e+00 8.262958e-01 75.82128
## Dim.40 1.383860e+00 8.140351e-01 76.63531
## Dim.41 1.312896e+00 7.722915e-01 77.40760
## Dim.42 1.285132e+00 7.559599e-01 78.16356
## Dim.43 1.266088e+00 7.447574e-01 78.90832
## Dim.44 1.196820e+00 7.040120e-01 79.61233
## Dim.45 1.164236e+00 6.848446e-01 80.29718
## Dim.46 1.130689e+00 6.651113e-01 80.96229
## Dim.47 1.112984e+00 6.546967e-01 81.61698
## Dim.48 1.071175e+00 6.301032e-01 82.24709
## Dim.49 1.050689e+00 6.180525e-01 82.86514
## Dim.50 1.039706e+00 6.115915e-01 83.47673
## Dim.51 9.928116e-01 5.840068e-01 84.06074
## Dim.52 9.705116e-01 5.708892e-01 84.63163
## Dim.53 9.685291e-01 5.697230e-01 85.20135
## Dim.54 8.977434e-01 5.280843e-01 85.72944
## Dim.55 8.802825e-01 5.178132e-01 86.24725
## Dim.56 8.638392e-01 5.081407e-01 86.75539
## Dim.57 8.413060e-01 4.948859e-01 87.25027
## Dim.58 8.302081e-01 4.883577e-01 87.73863
## Dim.59 7.934111e-01 4.667124e-01 88.20535
## Dim.60 7.836012e-01 4.609419e-01 88.66629
## Dim.61 7.584971e-01 4.461748e-01 89.11246
## Dim.62 7.421366e-01 4.365510e-01 89.54901
## Dim.63 7.282774e-01 4.283985e-01 89.97741
## Dim.64 7.116407e-01 4.186122e-01 90.39602
## Dim.65 6.865161e-01 4.038330e-01 90.79986
## Dim.66 6.632368e-01 3.901393e-01 91.19000
## Dim.67 6.528991e-01 3.840583e-01 91.57405
## Dim.68 6.039806e-01 3.552827e-01 91.92934
## Dim.69 5.992040e-01 3.524729e-01 92.28181
## Dim.70 5.830996e-01 3.429998e-01 92.62481
## Dim.71 5.752251e-01 3.383677e-01 92.96318
## Dim.72 5.459440e-01 3.211435e-01 93.28432
## Dim.73 5.208832e-01 3.064019e-01 93.59072
## Dim.74 4.971238e-01 2.924257e-01 93.88315
## Dim.75 4.822259e-01 2.836623e-01 94.16681
## Dim.76 4.709243e-01 2.770143e-01 94.44382
## Dim.77 4.660061e-01 2.741213e-01 94.71795
## Dim.78 4.637766e-01 2.728098e-01 94.99076
## Dim.79 4.329454e-01 2.546738e-01 95.24543
## Dim.80 4.052408e-01 2.383770e-01 95.48381
## Dim.81 4.037374e-01 2.374926e-01 95.72130
## Dim.82 3.836606e-01 2.256827e-01 95.94698
## Dim.83 3.773846e-01 2.219910e-01 96.16897
## Dim.84 3.636213e-01 2.138949e-01 96.38287
## Dim.85 3.421505e-01 2.012650e-01 96.58413
## Dim.86 3.215971e-01 1.891747e-01 96.77331
## Dim.87 3.197801e-01 1.881059e-01 96.96141
## Dim.88 3.165514e-01 1.862067e-01 97.14762
## Dim.89 3.010231e-01 1.770724e-01 97.32469
## Dim.90 2.949359e-01 1.734917e-01 97.49818
## Dim.91 2.795301e-01 1.644294e-01 97.66261
## Dim.92 2.614644e-01 1.538026e-01 97.81642
## Dim.93 2.521640e-01 1.483318e-01 97.96475
## Dim.94 2.373158e-01 1.395976e-01 98.10435
## Dim.95 2.218982e-01 1.305284e-01 98.23487
## Dim.96 2.123762e-01 1.249272e-01 98.35980
## Dim.97 2.071020e-01 1.218247e-01 98.48163
## Dim.98 1.848327e-01 1.087251e-01 98.59035
## Dim.99 1.808651e-01 1.063912e-01 98.69674
## Dim.100 1.716232e-01 1.009548e-01 98.79770
## Dim.101 1.642498e-01 9.661755e-02 98.89431
## Dim.102 1.592391e-01 9.367008e-02 98.98798
## Dim.103 1.500322e-01 8.825426e-02 99.07624
## Dim.104 1.369130e-01 8.053705e-02 99.15678
## Dim.105 1.337712e-01 7.868897e-02 99.23546
## Dim.106 1.186128e-01 6.977221e-02 99.30524
## Dim.107 1.140701e-01 6.710003e-02 99.37234
## Dim.108 1.082187e-01 6.365804e-02 99.43600
## Dim.109 1.023468e-01 6.020399e-02 99.49620
## Dim.110 9.234217e-02 5.431893e-02 99.55052
## Dim.111 8.939229e-02 5.258370e-02 99.60310
## Dim.112 8.908586e-02 5.240345e-02 99.65551
## Dim.113 8.317426e-02 4.892604e-02 99.70443
## Dim.114 7.427911e-02 4.369359e-02 99.74812
## Dim.115 6.551903e-02 3.854061e-02 99.78667
## Dim.116 6.050011e-02 3.558830e-02 99.82225
## Dim.117 5.574898e-02 3.279352e-02 99.85505
## Dim.118 4.913681e-02 2.890401e-02 99.88395
## Dim.119 4.094800e-02 2.408706e-02 99.90804
## Dim.120 3.896884e-02 2.292285e-02 99.93096
## Dim.121 2.986963e-02 1.757037e-02 99.94853
## Dim.122 2.589412e-02 1.523184e-02 99.96376
## Dim.123 2.089667e-02 1.229216e-02 99.97606
## Dim.124 1.464069e-02 8.612168e-03 99.98467
## Dim.125 9.949442e-03 5.852613e-03 99.99052
## Dim.126 8.108799e-03 4.769882e-03 99.99529
## Dim.127 5.321968e-03 3.130570e-03 99.99842
## Dim.128 1.587621e-03 9.338945e-04 99.99935
## Dim.129 6.353731e-04 3.737489e-04 99.99973
## Dim.130 4.616670e-04 2.715688e-04 100.00000
## Dim.131 4.060658e-30 2.388622e-30 100.00000
## Dim.132 1.153285e-30 6.784027e-31 100.00000
## Dim.133 1.058163e-30 6.224488e-31 100.00000
## Dim.134 1.044789e-30 6.145815e-31 100.00000
## Dim.135 8.903315e-31 5.237244e-31 100.00000
## Dim.136 7.628326e-31 4.487250e-31 100.00000
## Dim.137 6.158224e-31 3.622485e-31 100.00000
## Dim.138 5.828365e-31 3.428450e-31 100.00000
## Dim.139 5.276157e-31 3.103622e-31 100.00000
## Dim.140 4.259960e-31 2.505859e-31 100.00000
## Dim.141 3.359680e-31 1.976282e-31 100.00000
## Dim.142 2.857539e-31 1.680905e-31 100.00000
## Dim.143 1.951522e-31 1.147954e-31 100.00000
## Dim.144 1.391153e-31 8.183251e-32 100.00000
## Dim.145 1.122306e-31 6.601800e-32 100.00000
## Dim.146 1.099434e-31 6.467259e-32 100.00000
## Dim.147 9.846459e-32 5.792035e-32 100.00000
## Dim.148 8.884720e-32 5.226306e-32 100.00000
## Dim.149 8.884720e-32 5.226306e-32 100.00000
## Dim.150 8.884720e-32 5.226306e-32 100.00000
## Dim.151 8.884720e-32 5.226306e-32 100.00000
## Dim.152 8.884720e-32 5.226306e-32 100.00000
## Dim.153 8.884720e-32 5.226306e-32 100.00000
## Dim.154 8.884720e-32 5.226306e-32 100.00000
## Dim.155 8.884720e-32 5.226306e-32 100.00000
## Dim.156 8.884720e-32 5.226306e-32 100.00000
## Dim.157 8.884720e-32 5.226306e-32 100.00000
## Dim.158 8.884720e-32 5.226306e-32 100.00000
## Dim.159 8.884720e-32 5.226306e-32 100.00000
## Dim.160 8.884720e-32 5.226306e-32 100.00000
## Dim.161 8.884720e-32 5.226306e-32 100.00000
## Dim.162 8.884720e-32 5.226306e-32 100.00000
## Dim.163 8.884720e-32 5.226306e-32 100.00000
## Dim.164 8.884720e-32 5.226306e-32 100.00000
## Dim.165 8.884720e-32 5.226306e-32 100.00000
## Dim.166 8.884720e-32 5.226306e-32 100.00000
## Dim.167 8.884720e-32 5.226306e-32 100.00000
## Dim.168 4.691854e-32 2.759914e-32 100.00000
## Dim.169 1.853678e-32 1.090399e-32 100.00000
## Dim.170 1.909247e-33 1.123086e-33 100.00000
res.pca <- PCA(df_dummy, scale.unit = TRUE, ncp = 5, graph = FALSE)
var <- get_pca_var(res.pca)Interestingly, eigenvalues less than 1 indicate that principal components account for more variance than accounted by one of the original variables in the standardized data. Because of this, many use this as a cutoff point for which PCs should be retained. Since we see this cutoff occur at Dim = 50, anything past this dimension doesn’t provide good insight into our data. Above, our scree plot cut this even further, by showing that anything past the first ten dimensions does not account for a large proportion of variance explained. Therefore, going forward, we will only focus on the first few dimensions to see if we can garner any insights.
Since our biplot above was very crowded and difficult to interpret, we decided to look a bit deeper at the quality of representation of the variables, which is determined by taking the square cosine (cos2) and accounts for a variable’s position relative to the circumference of the correlation circle (not pictured in our biplot above but can be visually seen by the length of each vector from the center/origin). After subjecting these cos2 values to a corrplot across tthe first five PCs, we can examine below:
library("corrplot")
corrplot(var$cos2, is.corr=FALSE)Typically, a high cos2 value indicates a good representation of the variable on the principal component, and in our case the variable is positioned close to the circumference of the correlation circle (and farther away from the origin) – which we can visibly see with variables such as X.ADHD.Total and X.MD.Total on the biplot. The opposite is true for variables with a low cos2 value, which tend to fall closer to the origin.
We can also take a look at factors such as the contribution of variables on our principal components. Variables that are correlated with PC1 and PC2 are the most important in explaining the variability in the dataset. Therefore, when we conduct another correlation plot, but this time of variable contribution, we see the following:
corrplot(var$contrib, is.corr=FALSE)We can also see this in a barplot below, and shows the variables that have the highest contribution percentage for our first two PCs:
fviz_contrib(df_dummy.pca, choice = "var", axes = 1, top = 15)fviz_contrib(df_dummy.pca, choice = "var", axes = 2, top = 15)Similar to our biplot visualization and our cos2 values, we can confirm that variables such as X.ADHD.Total and X.MD.Total, as well as many other ADHD and MD questions contribute most to the variability explained in our dataset. This is important for us to take not of for future analysis, where we will be looking more closely at features that seem to provide more insight into clustering and classification.
Graph of variables. Positive correlated variables point to the same side of the plot. Negative correlated variables point to opposite sides of the graph.
fviz_pca_var(df_dummy.pca,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)Assume you are modeling whether a patient attempted suicide (column AX). Please use support vector machine to model this. You might want to consider reducing the number of variables or somehow use extracted information from the variables. This can be a really fun modeling task! (80)
In machine learning, Support vector machine (SVM) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. It is mostly used in classification problems. In this algorithm, each data item is plotted as a point in n-dimensional space (where n is number of features), with the value of each feature being the value of a particular coordinate. Then, classification is performed by finding the hyper-plane that best differentiates the two classes.
In addition to performing linear classification, SVMs can efficiently perform a non-linear classification, implicitly mapping their inputs into high-dimensional feature spaces.
In order to run our SVM model correctly, our first step will be to separate the data into training and test dataset. This way, we can test the accuracy by using our holdout test set later on. We decided to perform a conventional 80/20 training to testing data split on our original dataframe.
set.seed(123)
df2 <- data.frame(df)
index <- createDataPartition(df2$Suicide, p=0.8, list = FALSE)
df_train <- df2[index,]
df_test <- df2[-index,]Since the data has 53 variables, and SVM works much better on lower dimensional datasets, we’ll need to perform some sort of dimensionality reduction for feature selection. To do this, we decided to use a Random Forest algorithm to reduce the number of variables. Random Forests are often used for feature selection in a data science workflow. The tree-based strategies used by random forests naturally ranks how well they improve the purity of the node. This means, that if there is a decrease in impurity over all trees (called Gini impurity) then we can use it to determine the final importance of the variable.
Therefore, we’ll run our random forest algorithm and find the importance of each one of our features.
# Use random Forest to select features
rf<-randomForest(Suicide~., data=df_train)
importance(rf)## MeanDecreaseGini
## Age 1.5888889
## Sex 0.4510086
## Race 0.8656579
## ADHD.Q1 1.5374429
## ADHD.Q2 2.3424692
## ADHD.Q3 1.4111277
## ADHD.Q4 1.3011155
## ADHD.Q5 1.3537078
## ADHD.Q6 1.7395177
## ADHD.Q7 1.3272879
## ADHD.Q8 1.2233361
## ADHD.Q9 1.4076785
## ADHD.Q10 1.1595813
## ADHD.Q11 1.2866615
## ADHD.Q12 1.3884853
## ADHD.Q13 1.0137116
## ADHD.Q14 1.3344706
## ADHD.Q15 1.3001079
## ADHD.Q16 1.8850309
## ADHD.Q17 2.0366498
## ADHD.Q18 1.8381003
## ADHD.Total 1.9361384
## MD.Q1a 0.2775800
## MD.Q1b 0.4063452
## MD.Q1c 0.2731979
## MD.Q1d 0.4328351
## MD.Q1e 0.2633041
## MD.Q1f 0.2959477
## MD.Q1g 0.6647625
## MD.Q1h 0.1837682
## MD.Q1i 0.2397299
## MD.Q1j 0.2551557
## MD.Q1k 0.2385603
## MD.Q1L 0.3239011
## MD.Q1m 0.3347337
## MD.Q2 0.5071448
## MD.Q3 1.0199672
## MD.TOTAL 2.0597022
## Alcohol 1.5328180
## THC 0.6819343
## Cocaine 0.7803778
## Stimulants 0.2550488
## Sedative.hypnotics 0.7287060
## Opioids 0.5515170
## Court.order 0.1582192
## Education 3.2623427
## Hx.of.Violence 0.5416944
## Disorderly.Conduct 0.4312837
## Abuse 5.9259389
## Non.subst.Dx 0.4289075
## Subst.Dx 1.3522370
## Psych.meds. 1.7046166
varImpPlot(rf,type=2)As we can see from the mean decrease Gini scores above, the variable Abuse has the highest mean decrease Gini value, at 6.386, which means that it has the highest variable importance. To fit our initial Support Vector Machines model, we plan to select the top 10 variables with the highest mean decrease Gini value. Therefore, the variables we choose for this initial SVM fit are Abuse, Education, ADHD.Q1, Alcohol, ADHD.Q16, ADHD.Total, ADHD.Q18, MD.TOTAL, ADHD.Q9, and ADHD.Q2.
# set.seed(123)
# select top 10 variables from Importance Table
df.svm=svm(Suicide~Abuse+Education+ADHD.Q1+Alcohol+ADHD.Q16+ADHD.Total+ADHD.Q18+MD.TOTAL+ADHD.Q9+ADHD.Q2,data=df_train, kernel="radial", cost = 10, scale = FALSE)We use Radial kernel in our support vector classification and the number of support vectors is 90.
Then we can test the accurancy using our test data.
svmPre1<-predict(df.svm,df_test)
svm1<-confusionMatrix(svmPre1,df_test$Suicide)
svm1## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 21 6
## 1 3 4
##
## Accuracy : 0.7353
## 95% CI : (0.5564, 0.8712)
## No Information Rate : 0.7059
## P-Value [Acc > NIR] : 0.4355
##
## Kappa : 0.3014
##
## Mcnemar's Test P-Value : 0.5050
##
## Sensitivity : 0.8750
## Specificity : 0.4000
## Pos Pred Value : 0.7778
## Neg Pred Value : 0.5714
## Prevalence : 0.7059
## Detection Rate : 0.6176
## Detection Prevalence : 0.7941
## Balanced Accuracy : 0.6375
##
## 'Positive' Class : 0
##
The accuracy is 0.7353. Considering we only use 10 variables, the test result is not bad. We can tune the Model by trying different parameters.
svm.tune <- tune(e1071::svm, Suicide~Abuse+Education+ADHD.Q1+Alcohol+ADHD.Q16+ADHD.Total+ADHD.Q18+MD.TOTAL+ADHD.Q9+ADHD.Q2,
data = df_train,
ranges=list(cost=c(0.001,0.01,0.1,1,5,10,100),
kernel=c("linear", "polynomial", "radial"),
gamma=c(0.5,1,2,3,4)))
summary(svm.tune)##
## Parameter tuning of 'e1071::svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost kernel gamma
## 1 linear 0.5
##
## - best performance: 0.2419048
##
## - Detailed performance results:
## cost kernel gamma error dispersion
## 1 1e-03 linear 0.5 0.2914286 0.09924888
## 2 1e-02 linear 0.5 0.2914286 0.09924888
## 3 1e-01 linear 0.5 0.2423810 0.09736902
## 4 1e+00 linear 0.5 0.2419048 0.10856029
## 5 5e+00 linear 0.5 0.3057143 0.11332933
## 6 1e+01 linear 0.5 0.2919048 0.10507529
## 7 1e+02 linear 0.5 0.3271429 0.08609703
## 8 1e-03 polynomial 0.5 0.2914286 0.09924888
## 9 1e-02 polynomial 0.5 0.2771429 0.08002267
## 10 1e-01 polynomial 0.5 0.2709524 0.09517061
## 11 1e+00 polynomial 0.5 0.2566667 0.09144428
## 12 5e+00 polynomial 0.5 0.2566667 0.09144428
## 13 1e+01 polynomial 0.5 0.2566667 0.09144428
## 14 1e+02 polynomial 0.5 0.2566667 0.09144428
## 15 1e-03 radial 0.5 0.2914286 0.09924888
## 16 1e-02 radial 0.5 0.2914286 0.09924888
## 17 1e-01 radial 0.5 0.2914286 0.09924888
## 18 1e+00 radial 0.5 0.2842857 0.10223134
## 19 5e+00 radial 0.5 0.2485714 0.09740394
## 20 1e+01 radial 0.5 0.2485714 0.09740394
## 21 1e+02 radial 0.5 0.2485714 0.09740394
## 22 1e-03 linear 1.0 0.2914286 0.09924888
## 23 1e-02 linear 1.0 0.2914286 0.09924888
## 24 1e-01 linear 1.0 0.2423810 0.09736902
## 25 1e+00 linear 1.0 0.2419048 0.10856029
## 26 5e+00 linear 1.0 0.3057143 0.11332933
## 27 1e+01 linear 1.0 0.2919048 0.10507529
## 28 1e+02 linear 1.0 0.3271429 0.08609703
## 29 1e-03 polynomial 1.0 0.2633333 0.09691773
## 30 1e-02 polynomial 1.0 0.2638095 0.09055135
## 31 1e-01 polynomial 1.0 0.2566667 0.09144428
## 32 1e+00 polynomial 1.0 0.2566667 0.09144428
## 33 5e+00 polynomial 1.0 0.2566667 0.09144428
## 34 1e+01 polynomial 1.0 0.2566667 0.09144428
## 35 1e+02 polynomial 1.0 0.2566667 0.09144428
## 36 1e-03 radial 1.0 0.2914286 0.09924888
## 37 1e-02 radial 1.0 0.2914286 0.09924888
## 38 1e-01 radial 1.0 0.2914286 0.09924888
## 39 1e+00 radial 1.0 0.2914286 0.09924888
## 40 5e+00 radial 1.0 0.2914286 0.09924888
## 41 1e+01 radial 1.0 0.2914286 0.09924888
## 42 1e+02 radial 1.0 0.2914286 0.09924888
## 43 1e-03 linear 2.0 0.2914286 0.09924888
## 44 1e-02 linear 2.0 0.2914286 0.09924888
## 45 1e-01 linear 2.0 0.2423810 0.09736902
## 46 1e+00 linear 2.0 0.2419048 0.10856029
## 47 5e+00 linear 2.0 0.3057143 0.11332933
## 48 1e+01 linear 2.0 0.2919048 0.10507529
## 49 1e+02 linear 2.0 0.3271429 0.08609703
## 50 1e-03 polynomial 2.0 0.2776190 0.08723085
## 51 1e-02 polynomial 2.0 0.2566667 0.09144428
## 52 1e-01 polynomial 2.0 0.2566667 0.09144428
## 53 1e+00 polynomial 2.0 0.2566667 0.09144428
## 54 5e+00 polynomial 2.0 0.2566667 0.09144428
## 55 1e+01 polynomial 2.0 0.2566667 0.09144428
## 56 1e+02 polynomial 2.0 0.2566667 0.09144428
## 57 1e-03 radial 2.0 0.2914286 0.09924888
## 58 1e-02 radial 2.0 0.2914286 0.09924888
## 59 1e-01 radial 2.0 0.2914286 0.09924888
## 60 1e+00 radial 2.0 0.2914286 0.09924888
## 61 5e+00 radial 2.0 0.2914286 0.09924888
## 62 1e+01 radial 2.0 0.2914286 0.09924888
## 63 1e+02 radial 2.0 0.2914286 0.09924888
## 64 1e-03 linear 3.0 0.2914286 0.09924888
## 65 1e-02 linear 3.0 0.2914286 0.09924888
## 66 1e-01 linear 3.0 0.2423810 0.09736902
## 67 1e+00 linear 3.0 0.2419048 0.10856029
## 68 5e+00 linear 3.0 0.3057143 0.11332933
## 69 1e+01 linear 3.0 0.2919048 0.10507529
## 70 1e+02 linear 3.0 0.3271429 0.08609703
## 71 1e-03 polynomial 3.0 0.2566667 0.09144428
## 72 1e-02 polynomial 3.0 0.2566667 0.09144428
## 73 1e-01 polynomial 3.0 0.2566667 0.09144428
## 74 1e+00 polynomial 3.0 0.2566667 0.09144428
## 75 5e+00 polynomial 3.0 0.2566667 0.09144428
## 76 1e+01 polynomial 3.0 0.2566667 0.09144428
## 77 1e+02 polynomial 3.0 0.2566667 0.09144428
## 78 1e-03 radial 3.0 0.2914286 0.09924888
## 79 1e-02 radial 3.0 0.2914286 0.09924888
## 80 1e-01 radial 3.0 0.2914286 0.09924888
## 81 1e+00 radial 3.0 0.2914286 0.09924888
## 82 5e+00 radial 3.0 0.2914286 0.09924888
## 83 1e+01 radial 3.0 0.2914286 0.09924888
## 84 1e+02 radial 3.0 0.2914286 0.09924888
## 85 1e-03 linear 4.0 0.2914286 0.09924888
## 86 1e-02 linear 4.0 0.2914286 0.09924888
## 87 1e-01 linear 4.0 0.2423810 0.09736902
## 88 1e+00 linear 4.0 0.2419048 0.10856029
## 89 5e+00 linear 4.0 0.3057143 0.11332933
## 90 1e+01 linear 4.0 0.2919048 0.10507529
## 91 1e+02 linear 4.0 0.3271429 0.08609703
## 92 1e-03 polynomial 4.0 0.2566667 0.09144428
## 93 1e-02 polynomial 4.0 0.2566667 0.09144428
## 94 1e-01 polynomial 4.0 0.2566667 0.09144428
## 95 1e+00 polynomial 4.0 0.2566667 0.09144428
## 96 5e+00 polynomial 4.0 0.2566667 0.09144428
## 97 1e+01 polynomial 4.0 0.2566667 0.09144428
## 98 1e+02 polynomial 4.0 0.2566667 0.09144428
## 99 1e-03 radial 4.0 0.2914286 0.09924888
## 100 1e-02 radial 4.0 0.2914286 0.09924888
## 101 1e-01 radial 4.0 0.2914286 0.09924888
## 102 1e+00 radial 4.0 0.2914286 0.09924888
## 103 5e+00 radial 4.0 0.2914286 0.09924888
## 104 1e+01 radial 4.0 0.2914286 0.09924888
## 105 1e+02 radial 4.0 0.2914286 0.09924888
bestmod=svm.tune$best.model
bestmod##
## Call:
## best.tune(method = e1071::svm, train.x = Suicide ~ Abuse + Education +
## ADHD.Q1 + Alcohol + ADHD.Q16 + ADHD.Total + ADHD.Q18 + MD.TOTAL +
## ADHD.Q9 + ADHD.Q2, data = df_train, ranges = list(cost = c(0.001,
## 0.01, 0.1, 1, 5, 10, 100), kernel = c("linear", "polynomial",
## "radial"), gamma = c(0.5, 1, 2, 3, 4)))
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 73
The tune() function performs ten-fold cross-validation by default. To test the best kernel, we use linear, polynomial and radial in the tuning model and discovered that the best model uses a Linear kernel with a cost of 0.1. From the table, we can see that the lowest error rate is 0.2561905.
We can use confusion Matrix to check the table and accuracy.
svmPre2<-predict(bestmod,df_test)
svm2<-confusionMatrix(svmPre2,df_test$Suicide)
svm2## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 23 6
## 1 1 4
##
## Accuracy : 0.7941
## 95% CI : (0.621, 0.913)
## No Information Rate : 0.7059
## P-Value [Acc > NIR] : 0.1741
##
## Kappa : 0.4195
##
## Mcnemar's Test P-Value : 0.1306
##
## Sensitivity : 0.9583
## Specificity : 0.4000
## Pos Pred Value : 0.7931
## Neg Pred Value : 0.8000
## Prevalence : 0.7059
## Detection Rate : 0.6765
## Detection Prevalence : 0.8529
## Balanced Accuracy : 0.6792
##
## 'Positive' Class : 0
##
It appears that the tuning of our initial SVM model improved the accuracy rate slightly to 0.7941, which is better than the previous one without tuning.
To recap, Support Vector Machines are a subclass of supervised classifiers that attempt to partition a feature space into two or more groups. They achieve this by finding an optimal means of separating such groups based on their known class labels:
When comparing SVM and SVM with Tune, we can see that our SVM via PCA method seemed to perform better than SVM. That said SVM with Tune has the lowest training accuracy but the highest testing accuracy where both the testing and training accuracy are close. When using of kernels to separate non-linear data to try to produce the best accuracy, this may make them difficult (if not impossible) to interpret. So, in this scenario the Tune Method has better results with the least number of Principal Components and would be the model of selection as it is the most efficient.