Source Code: https://github.com/djlofland/DATA622_S2021_Group2/tree/master/Homework4

For this assignment, we will be working with a very interesting mental health dataset from a real-life research project. All identifying information, of course, has been removed. The attached spreadsheet has the data (the tab name “Data”). The data dictionary is given in the second tab. You can get as creative as you want. The assignment is designed to really get you to think about how you could use different methods.


Load Data & EDA


Load Data

We start by loading the Excel dataset provided into a dataframe.

Data Sample

Next, we can take a quick look at the dataframe features. By highlighting the data dictionary below, and printing the first few rows, we can start to establish some context around the data that has been collected from the survey.

# Display first few rows for inspection
kable(head(df)) %>% 
  kable_styling(bootstrap_options = "basic")
Initial Age Sex Race ADHD Q1 ADHD Q2 ADHD Q3 ADHD Q4 ADHD Q5 ADHD Q6 ADHD Q7 ADHD Q8 ADHD Q9 ADHD Q10 ADHD Q11 ADHD Q12 ADHD Q13 ADHD Q14 ADHD Q15 ADHD Q16 ADHD Q17 ADHD Q18 ADHD Total MD Q1a MD Q1b MD Q1c MD Q1d MD Q1e MD Q1f MD Q1g MD Q1h MD Q1i MD Q1j MD Q1k MD Q1L MD Q1m MD Q2 MD Q3 MD TOTAL Alcohol THC Cocaine Stimulants Sedative-hypnotics Opioids Court order Education Hx of Violence Disorderly Conduct Suicide Abuse Non-subst Dx Subst Dx Psych meds.
JA 24 1 1 1 1 4 2 3 1 1 3 2 4 2 4 1 0 3 1 3 4 40 1 1 1 1 0 1 1 1 1 1 1 0 1 1 3 15 1 1 1 0 0 0 1 11 0 1 1 0 2 0 2
LA 48 2 1 3 3 4 4 5 2 2 3 2 4 1 4 2 4 4 3 1 4 55 1 1 1 1 1 1 1 1 1 0 0 1 0 1 3 14 0 0 0 0 0 0 0 14 0 0 1 4 1 0 1
MD 51 2 1 2 1 2 1 3 3 3 2 0 1 2 0 2 2 3 2 1 1 31 0 0 0 0 1 1 1 0 0 0 0 0 0 0 2 5 0 0 0 0 0 0 0 12 0 0 0 6 2 0 1
RD 43 1 1 3 3 2 2 4 3 2 4 4 2 3 1 3 3 1 2 1 2 45 1 1 0 0 1 1 1 1 1 0 0 1 1 1 3 13 1 1 1 1 0 0 0 12 0 0 1 7 2 0 2
RB 34 1 1 4 4 2 4 4 2 3 4 4 2 4 1 3 2 1 2 1 1 48 0 1 0 1 0 1 1 0 0 0 0 0 0 1 2 7 1 1 0 0 0 0 1 9 1 1 1 0 2 0 0
SB 39 2 1 2 3 1 4 3 2 3 4 4 2 4 2 4 4 3 4 3 3 55 0 1 0 1 1 1 1 1 1 1 1 1 0 1 3 14 1 0 0 0 0 0 0 11 0 1 1 2 0 0 0

Data Dictionary

  • Initial Respondents initial: we will drop this column
  • Age Age: 0 … 69
  • Sex Sex: Male-1, Female-2
  • Race Race: White-1, African American-2, Hispanic-3, Asian-4, Native American-5, Other or missing data -6
  • ADHD Q(n) ADHD self-report scale: Never-0, rarely-1, sometimes-2, often-3, very often-4
  • ADHD Total ADHD self-report Total: Sum of ADHD question scores
  • MD Q(n) Mood disorder questions: No-0, yes-1; question 3: no problem-0, minor-1, moderate-2, serious-3
  • MD TOTAL MD self-report Total: Sum of MD question scores
  • Alcohol Individual substances misuse: no use-0, use-1, abuse-2, dependence-3
  • THC Individual substances misuse: no use-0, use-1, abuse-2, dependence-3
  • Cocaine Individual substances misuse: no use-0, use-1, abuse-2, dependence-3
  • Stimulants Individual substances misuse: no use-0, use-1, abuse-2, dependence-3
  • Sedative-hypnotics Individual substances misuse: no use-0, use-1, abuse-2, dependence-3
  • Opioids Individual substances misuse: no use-0, use-1, abuse-2, dependence-3
  • Court order Court Order: No-0, Yes-1
  • Education Education: 1-12 grade, 13+ college
  • Hx of Violence History of Violence: No-0, Yes-1
  • Disorderly Conduct Disorderly Conduct: No-0, Yes-1
  • Suicide Suicide attempt: No-0, Yes-1
  • Abuse Abuse Hx: No-0, Physical (P)-1, Sexual (S)-2, Emotional (E)-3, P&S-4, P&E-5, S&E-6, P&S&E-7
  • Non-subst Dx Non-substance-related Dx: 0 – none; 1 – one; 2 – More than one
  • Subst Dx Substance-related Dx: 0 – none; 1 – one Substance-related; 2 – two; 3 – three or more
  • Psych meds. Psychiatric Meds: 0 – none; 1 – one psychotropic med; 2 – more than one psychotropic med

Data Frame summary

After reading in the dataframe and identifying the data dictionary that explains the values for our features, our next step is to do some exploratory analysis. We can initially run the skim() function to find important generalized information about the dataset.

# Show df summary deatils
skim(df)
Data summary
Name df
Number of rows 175
Number of columns 54
_______________________
Column type frequency:
character 1
numeric 53
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Initial 0 1 2 3 0 108 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Age 0 1.00 39.47 11.17 18 29.5 42 48.0 69 <U+2586><U+2585><U+2587><U+2585><U+2581>
Sex 0 1.00 1.43 0.50 1 1.0 1 2.0 2 <U+2587><U+2581><U+2581><U+2581><U+2586>
Race 0 1.00 1.64 0.69 1 1.0 2 2.0 6 <U+2587><U+2581><U+2581><U+2581><U+2581>
ADHD Q1 0 1.00 1.70 1.29 0 1.0 2 3.0 4 <U+2587><U+2587><U+2587><U+2586><U+2583>
ADHD Q2 0 1.00 1.91 1.25 0 1.0 2 3.0 4 <U+2585><U+2587><U+2587><U+2586><U+2585>
ADHD Q3 0 1.00 1.91 1.27 0 1.0 2 3.0 4 <U+2585><U+2587><U+2587><U+2586><U+2585>
ADHD Q4 0 1.00 2.10 1.34 0 1.0 2 3.0 4 <U+2585><U+2585><U+2587><U+2585><U+2586>
ADHD Q5 0 1.00 2.26 1.44 0 1.0 3 3.0 5 <U+2587><U+2585><U+2587><U+2586><U+2581>
ADHD Q6 0 1.00 1.91 1.31 0 1.0 2 3.0 4 <U+2586><U+2585><U+2587><U+2587><U+2583>
ADHD Q7 0 1.00 1.83 1.19 0 1.0 2 3.0 4 <U+2583><U+2587><U+2587><U+2583><U+2583>
ADHD Q8 0 1.00 2.14 1.29 0 1.0 2 3.0 4 <U+2583><U+2587><U+2587><U+2587><U+2586>
ADHD Q9 0 1.00 1.91 1.32 0 1.0 2 3.0 4 <U+2586><U+2587><U+2587><U+2587><U+2585>
ADHD Q10 0 1.00 2.12 1.23 0 1.0 2 3.0 4 <U+2582><U+2587><U+2587><U+2586><U+2585>
ADHD Q11 0 1.00 2.27 1.24 0 1.0 2 3.0 4 <U+2582><U+2586><U+2587><U+2587><U+2586>
ADHD Q12 0 1.00 1.29 1.21 0 0.0 1 2.0 4 <U+2587><U+2587><U+2586><U+2582><U+2582>
ADHD Q13 0 1.00 2.37 1.23 0 1.5 2 3.0 4 <U+2582><U+2585><U+2587><U+2587><U+2586>
ADHD Q14 0 1.00 2.25 1.35 0 1.0 2 3.0 4 <U+2585><U+2585><U+2587><U+2587><U+2586>
ADHD Q15 0 1.00 1.63 1.39 0 0.0 1 3.0 4 <U+2587><U+2586><U+2586><U+2585><U+2583>
ADHD Q16 0 1.00 1.70 1.38 0 1.0 1 3.0 4 <U+2586><U+2587><U+2586><U+2583><U+2585>
ADHD Q17 0 1.00 1.53 1.29 0 0.0 1 2.0 4 <U+2587><U+2587><U+2587><U+2583><U+2583>
ADHD Q18 0 1.00 1.47 1.30 0 0.0 1 2.0 4 <U+2587><U+2587><U+2586><U+2583><U+2583>
ADHD Total 0 1.00 34.32 16.70 0 21.0 33 47.5 72 <U+2583><U+2586><U+2587><U+2586><U+2582>
MD Q1a 0 1.00 0.55 0.50 0 0.0 1 1.0 1 <U+2586><U+2581><U+2581><U+2581><U+2587>
MD Q1b 0 1.00 0.57 0.50 0 0.0 1 1.0 1 <U+2586><U+2581><U+2581><U+2581><U+2587>
MD Q1c 0 1.00 0.54 0.50 0 0.0 1 1.0 1 <U+2587><U+2581><U+2581><U+2581><U+2587>
MD Q1d 0 1.00 0.58 0.49 0 0.0 1 1.0 1 <U+2586><U+2581><U+2581><U+2581><U+2587>
MD Q1e 0 1.00 0.55 0.50 0 0.0 1 1.0 1 <U+2586><U+2581><U+2581><U+2581><U+2587>
MD Q1f 0 1.00 0.70 0.46 0 0.0 1 1.0 1 <U+2583><U+2581><U+2581><U+2581><U+2587>
MD Q1g 0 1.00 0.72 0.45 0 0.0 1 1.0 1 <U+2583><U+2581><U+2581><U+2581><U+2587>
MD Q1h 0 1.00 0.56 0.50 0 0.0 1 1.0 1 <U+2586><U+2581><U+2581><U+2581><U+2587>
MD Q1i 0 1.00 0.59 0.49 0 0.0 1 1.0 1 <U+2586><U+2581><U+2581><U+2581><U+2587>
MD Q1j 0 1.00 0.39 0.49 0 0.0 0 1.0 1 <U+2587><U+2581><U+2581><U+2581><U+2585>
MD Q1k 0 1.00 0.49 0.50 0 0.0 0 1.0 1 <U+2587><U+2581><U+2581><U+2581><U+2587>
MD Q1L 0 1.00 0.58 0.49 0 0.0 1 1.0 1 <U+2586><U+2581><U+2581><U+2581><U+2587>
MD Q1m 0 1.00 0.49 0.50 0 0.0 0 1.0 1 <U+2587><U+2581><U+2581><U+2581><U+2587>
MD Q2 0 1.00 0.72 0.45 0 0.0 1 1.0 1 <U+2583><U+2581><U+2581><U+2581><U+2587>
MD Q3 0 1.00 2.01 1.07 0 1.0 2 3.0 3 <U+2582><U+2582><U+2581><U+2585><U+2587>
MD TOTAL 0 1.00 10.02 4.81 0 6.5 11 14.0 17 <U+2583><U+2583><U+2586><U+2587><U+2587>
Alcohol 4 0.98 1.35 1.39 0 0.0 1 3.0 3 <U+2587><U+2582><U+2581><U+2581><U+2586>
THC 4 0.98 0.81 1.27 0 0.0 0 1.5 3 <U+2587><U+2581><U+2581><U+2581><U+2583>
Cocaine 4 0.98 1.09 1.39 0 0.0 0 3.0 3 <U+2587><U+2581><U+2581><U+2581><U+2585>
Stimulants 4 0.98 0.12 0.53 0 0.0 0 0.0 3 <U+2587><U+2581><U+2581><U+2581><U+2581>
Sedative-hypnotics 4 0.98 0.12 0.54 0 0.0 0 0.0 3 <U+2587><U+2581><U+2581><U+2581><U+2581>
Opioids 4 0.98 0.39 0.99 0 0.0 0 0.0 3 <U+2587><U+2581><U+2581><U+2581><U+2581>
Court order 5 0.97 0.09 0.28 0 0.0 0 0.0 1 <U+2587><U+2581><U+2581><U+2581><U+2581>
Education 9 0.95 11.90 2.17 6 11.0 12 13.0 19 <U+2581><U+2585><U+2587><U+2582><U+2581>
Hx of Violence 11 0.94 0.24 0.43 0 0.0 0 0.0 1 <U+2587><U+2581><U+2581><U+2581><U+2582>
Disorderly Conduct 11 0.94 0.73 0.45 0 0.0 1 1.0 1 <U+2583><U+2581><U+2581><U+2581><U+2587>
Suicide 13 0.93 0.30 0.46 0 0.0 0 1.0 1 <U+2587><U+2581><U+2581><U+2581><U+2583>
Abuse 14 0.92 1.33 2.12 0 0.0 0 2.0 7 <U+2587><U+2582><U+2581><U+2581><U+2581>
Non-subst Dx 22 0.87 0.44 0.68 0 0.0 0 1.0 2 <U+2587><U+2581><U+2583><U+2581><U+2581>
Subst Dx 23 0.87 1.14 0.93 0 0.0 1 2.0 3 <U+2586><U+2587><U+2581><U+2585><U+2582>
Psych meds. 118 0.33 0.96 0.80 0 0.0 1 2.0 2 <U+2587><U+2581><U+2587><U+2581><U+2586>

We can see above that the dataset consists of 175 observations and 54 different features. A few of the features have quite a few missing values, including the Psych meds. feature (with 118 missing values), as well as Subst Dx and Non-subst Dx, which have 23 and 22 missing values, respectively. These questions, when looking at the data summary above, correspond to medication questions that are specific to a particular population – for instance, when sampling the general population for a survey, likely a smaller percentage (and not a majority) will be on psychiatric medication, or be taking medication for substance use/non-substance use related causes. Therefore, many of these missing values make sense, but will be important to keep in mind when we start our clustering techniques.

Another interesting component of the skim() function is to be able to take a quick look at the distribution of responses for each feature. From the Education distribution, we can see that the median education level was a high school diploma (12th grade). Also, there seemed to be slightly more males than females that responded to the survey, and the median age was around 42 years. Additionally, we can see that a majority of respondents to many of the substance use questions, such as Cocaine, THC and Alcohol tended to answer as either “no use”, or “dependence”, with not much gray area in between. Finally, a large majority of respondents self-identified their race as White.

Missing Values

As we saw in our data summary, we have some missing values. Lets explore this a little further to see if any important patterns stand out.

# Use nanair package to plot missing value patterns
gg_miss_upset(df)

The missing data only appears in a few of our features: Suicide, Abuse, Non-subst Dx, Subst Dx, and Psych meds..

For factors, doing traditional imputing approaches can be problematic. While in most cases, missing data might be inferred as 0, we don’t know if there were problems with administering, coding the survey, and/or patients didn’t want to answer and skipped questions (when some of these may have applied). So, rather than dropping the rows or imputing, we will instead create new factor values ‘NA’ for each indicating that we lack that information.

We also see that for each respondent, we have their initial. This won’t be useful for our purposes (note there are also duplicates where multiple people had the same initials).

# Replace missing with 'unknown' so we have a factor value for each
df <- df %>% 
  na_replace('unknown')

# Drop Initial column
df <- df %>% 
  dplyr::select(-Initial)

Looking at our dataset, we see that missing values are no longer an issue; however, we still have some additional cleanup to perform on our dataset below before we proceed with kNN modeling.

# Display data from summary
skim(df)
Data summary
Name df
Number of rows 175
Number of columns 53
_______________________
Column type frequency:
character 15
numeric 38
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Alcohol 0 1 1 7 0 5 0
THC 0 1 1 7 0 5 0
Cocaine 0 1 1 7 0 5 0
Stimulants 0 1 1 7 0 4 0
Sedative-hypnotics 0 1 1 7 0 5 0
Opioids 0 1 1 7 0 4 0
Court order 0 1 1 7 0 3 0
Education 0 1 1 7 0 15 0
Hx of Violence 0 1 1 7 0 3 0
Disorderly Conduct 0 1 1 7 0 3 0
Suicide 0 1 1 7 0 3 0
Abuse 0 1 1 7 0 9 0
Non-subst Dx 0 1 1 7 0 4 0
Subst Dx 0 1 1 7 0 5 0
Psych meds. 0 1 1 7 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Age 0 1 39.47 11.17 18 29.5 42 48.0 69 <U+2586><U+2585><U+2587><U+2585><U+2581>
Sex 0 1 1.43 0.50 1 1.0 1 2.0 2 <U+2587><U+2581><U+2581><U+2581><U+2586>
Race 0 1 1.64 0.69 1 1.0 2 2.0 6 <U+2587><U+2581><U+2581><U+2581><U+2581>
ADHD Q1 0 1 1.70 1.29 0 1.0 2 3.0 4 <U+2587><U+2587><U+2587><U+2586><U+2583>
ADHD Q2 0 1 1.91 1.25 0 1.0 2 3.0 4 <U+2585><U+2587><U+2587><U+2586><U+2585>
ADHD Q3 0 1 1.91 1.27 0 1.0 2 3.0 4 <U+2585><U+2587><U+2587><U+2586><U+2585>
ADHD Q4 0 1 2.10 1.34 0 1.0 2 3.0 4 <U+2585><U+2585><U+2587><U+2585><U+2586>
ADHD Q5 0 1 2.26 1.44 0 1.0 3 3.0 5 <U+2587><U+2585><U+2587><U+2586><U+2581>
ADHD Q6 0 1 1.91 1.31 0 1.0 2 3.0 4 <U+2586><U+2585><U+2587><U+2587><U+2583>
ADHD Q7 0 1 1.83 1.19 0 1.0 2 3.0 4 <U+2583><U+2587><U+2587><U+2583><U+2583>
ADHD Q8 0 1 2.14 1.29 0 1.0 2 3.0 4 <U+2583><U+2587><U+2587><U+2587><U+2586>
ADHD Q9 0 1 1.91 1.32 0 1.0 2 3.0 4 <U+2586><U+2587><U+2587><U+2587><U+2585>
ADHD Q10 0 1 2.12 1.23 0 1.0 2 3.0 4 <U+2582><U+2587><U+2587><U+2586><U+2585>
ADHD Q11 0 1 2.27 1.24 0 1.0 2 3.0 4 <U+2582><U+2586><U+2587><U+2587><U+2586>
ADHD Q12 0 1 1.29 1.21 0 0.0 1 2.0 4 <U+2587><U+2587><U+2586><U+2582><U+2582>
ADHD Q13 0 1 2.37 1.23 0 1.5 2 3.0 4 <U+2582><U+2585><U+2587><U+2587><U+2586>
ADHD Q14 0 1 2.25 1.35 0 1.0 2 3.0 4 <U+2585><U+2585><U+2587><U+2587><U+2586>
ADHD Q15 0 1 1.63 1.39 0 0.0 1 3.0 4 <U+2587><U+2586><U+2586><U+2585><U+2583>
ADHD Q16 0 1 1.70 1.38 0 1.0 1 3.0 4 <U+2586><U+2587><U+2586><U+2583><U+2585>
ADHD Q17 0 1 1.53 1.29 0 0.0 1 2.0 4 <U+2587><U+2587><U+2587><U+2583><U+2583>
ADHD Q18 0 1 1.47 1.30 0 0.0 1 2.0 4 <U+2587><U+2587><U+2586><U+2583><U+2583>
ADHD Total 0 1 34.32 16.70 0 21.0 33 47.5 72 <U+2583><U+2586><U+2587><U+2586><U+2582>
MD Q1a 0 1 0.55 0.50 0 0.0 1 1.0 1 <U+2586><U+2581><U+2581><U+2581><U+2587>
MD Q1b 0 1 0.57 0.50 0 0.0 1 1.0 1 <U+2586><U+2581><U+2581><U+2581><U+2587>
MD Q1c 0 1 0.54 0.50 0 0.0 1 1.0 1 <U+2587><U+2581><U+2581><U+2581><U+2587>
MD Q1d 0 1 0.58 0.49 0 0.0 1 1.0 1 <U+2586><U+2581><U+2581><U+2581><U+2587>
MD Q1e 0 1 0.55 0.50 0 0.0 1 1.0 1 <U+2586><U+2581><U+2581><U+2581><U+2587>
MD Q1f 0 1 0.70 0.46 0 0.0 1 1.0 1 <U+2583><U+2581><U+2581><U+2581><U+2587>
MD Q1g 0 1 0.72 0.45 0 0.0 1 1.0 1 <U+2583><U+2581><U+2581><U+2581><U+2587>
MD Q1h 0 1 0.56 0.50 0 0.0 1 1.0 1 <U+2586><U+2581><U+2581><U+2581><U+2587>
MD Q1i 0 1 0.59 0.49 0 0.0 1 1.0 1 <U+2586><U+2581><U+2581><U+2581><U+2587>
MD Q1j 0 1 0.39 0.49 0 0.0 0 1.0 1 <U+2587><U+2581><U+2581><U+2581><U+2585>
MD Q1k 0 1 0.49 0.50 0 0.0 0 1.0 1 <U+2587><U+2581><U+2581><U+2581><U+2587>
MD Q1L 0 1 0.58 0.49 0 0.0 1 1.0 1 <U+2586><U+2581><U+2581><U+2581><U+2587>
MD Q1m 0 1 0.49 0.50 0 0.0 0 1.0 1 <U+2587><U+2581><U+2581><U+2581><U+2587>
MD Q2 0 1 0.72 0.45 0 0.0 1 1.0 1 <U+2583><U+2581><U+2581><U+2581><U+2587>
MD Q3 0 1 2.01 1.07 0 1.0 2 3.0 3 <U+2582><U+2582><U+2581><U+2585><U+2587>
MD TOTAL 0 1 10.02 4.81 0 6.5 11 14.0 17 <U+2583><U+2583><U+2586><U+2587><U+2587>

Fix Datatypes

We know that most of our features are either factors or ordinal factors. Let’s recode from float to factor to ensure our models don’t incorrectly treat them as continuous. Note that we’ll leave Age as continuous even though it’s technically an ordinal factor. Also, for simplicity, boolean features (only 0 or 1) can be left as integer. Looking ahead, factors are going to lead to a large number of features when we dummy code in preparation for modeling. Dummy encoding leads to separate columns for each value. High dimensions can lead to problems [TODO: insert discussion as to why here]. Fortunately, we don’t have that many features and values, so we aren’t starting with an excessive number of coumns. That said, we will likely need to either reduce dimensions thorough PCA or different recoding approaches.

# We'll leave age as an integer
df$Age                  <- as.integer(df$Age)

# Misc columns (note: some columns are ordinal, others are not)
df$Sex                  <- factor(df$Sex)
df$Race                 <- factor(df$Race)
df$Alcohol              <- factor(df$Alcohol, levels=c("unknown", 0, 1, 2, 3))
df$THC                  <- factor(df$THC, levels=c('unknown', 0, 1, 2, 3))
df$Cocaine              <- factor(df$Cocaine, levels=c('unknown', 0, 1, 2, 3))
df$Stimulants           <- factor(df$Stimulants, levels=c('unknown', 0, 1, 2, 3))
df$`Sedative-hypnotics` <- factor(df$`Sedative-hypnotics`, levels=c('unknown', 0, 1, 2, 3))
df$Opioids              <- factor(df$Opioids, levels=c('unknown', 0, 1, 2, 3))

df$Education            <- factor(df$Education, levels=c('unknown', 0, 1, 2, 3, 4, 5, 6, 7, 8, 
                                                         9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20))

df$`Court order`        <- factor(df$`Court order`, levels=c('unknown', 0, 1))
df$`Hx of Violence`     <- factor(df$`Hx of Violence`, levels=c('unknown', 0, 1))
df$`Disorderly Conduct` <- factor(df$`Disorderly Conduct`, levels=c('unknown', 0, 1))
df$Suicide              <- factor(df$Suicide, levels=c('unknown', 0, 1))

df$Abuse                <- factor(df$Abuse, levels=c('unknown', 0, 1, 2, 3, 4,5, 6, 7))

df$`Non-subst Dx`       <- factor(df$`Non-subst Dx`, levels=c('unknown', 0, 1, 2))
df$`Subst Dx`           <- factor(df$`Subst Dx`, levels = c('unknown', 0, 1, 2, 3))
df$`Psych meds.`        <- factor(df$`Psych meds.`, levels=c('unknown', 0, 1, 2))

# Quick trick to fix all columns matching name pattern
df <- df %>%
  mutate(across(contains('ADHD'), factor)) %>%
  mutate(across(contains('MD Q'), factor))

# Back these out since they are totals (sum), not factor
df$`ADHD Total`         <- as.integer(df$`ADHD Total`)
df$`MD TOTAL`           <- as.integer(df$`MD TOTAL`)

Feature Plots

For the continuous variables, we can examine the distributions, broken out by the target variable, Sex.

[TODO: This was quick and dirty placeholder … Provide better visualizations, possibly by feature category (e.e.g, ADHD, MD, Substance Abuse columns, etc)]

df_plot = df %>% mutate(Sex = ifelse(Sex == 1, "Male", "Female"))

par(mfrow=c(2,3))
par(oma=c(0,0,6,0))


tbl1 <- table(df_plot$Sex, df_plot$Alcohol)
barplot(tbl1, main="Alcohol use",
        col=c("thistle3", "cornflowerblue"), ylim=range(c(0, 180)), legend=TRUE)

tbl2 <- table(df_plot$Sex, df_plot$Cocaine)
barplot(tbl2, main="Cocain use",
        col=c("thistle3", "cornflowerblue"), ylim=range(c(0, 180)))

tbl3 <- table(df_plot$Sex, df_plot$THC)
barplot(tbl3, main="THC use",
        col=c("thistle3", "cornflowerblue"), ylim=range(c(0, 180)))

tbl4 <- table(df_plot$Sex, df_plot$Stimulants)
barplot(tbl4, main="Stimulant use",
        col=c("thistle3", "cornflowerblue"), ylim=range(c(0, 180)))

tbl5 <- table(df_plot$Sex, df_plot$`Sedative-hypnotics`)
barplot(tbl5, main="Sedative-hypnotic use",
        col=c("thistle3", "cornflowerblue"), ylim=range(c(0, 180)))

tbl6 <- table(df_plot$Sex, df_plot$Opioids)
barplot(tbl6, main="Opioid use",
        col=c("thistle3", "cornflowerblue"), ylim=range(c(0, 180)))


title("(0 = No use, 1 = Use, 2 = Abuse, 3 = Dependence)" , outer=TRUE)
par(oma=c(1,0,3,0))
title("SUBSTANCE USE RESPONSES BY SEX" , outer=TRUE)

myColors <- ifelse(df_plot$Sex=="Male" , "thistle3", "cornflowerblue")

par(mfrow=c(1,2))
boxplot(df_plot$`ADHD Total`~df_plot$Sex, xlab="", ylab = "ADHD Total", main="ADHD Totals by Sex", col=myColors)
boxplot(df_plot$`MD TOTAL`~df_plot$Sex, xlab="", ylab = "MD Total", main="MD Totals by Sex", col=myColors)

par(mfrow=c(2,2))
par(oma=c(0,0,6,0))

tbl7 <- table(df_plot$Sex, df_plot$`Court order`)
barplot(tbl7, main="Court order",
        col=c("thistle3", "cornflowerblue"), ylim=range(c(0, 180)))

tbl8 <- table(df_plot$Sex, df_plot$`Hx of Violence`)
barplot(tbl8, main="History of violence",
        col=c("thistle3", "cornflowerblue"), ylim=range(c(0, 180)))

tbl9 <- table(df_plot$Sex, df_plot$`Disorderly Conduct`)
barplot(tbl9, main="Disorderly conduct",
        col=c("thistle3", "cornflowerblue"), ylim=range(c(0, 180)))

tbl11 <- table(df_plot$Sex, df_plot$Suicide)
barplot(tbl11, main="Suicide",
        col=c("thistle3", "cornflowerblue"), ylim=range(c(0, 180)), legend=TRUE)


title("(0 = No, 1 = Yes)" , outer=TRUE)
par(oma=c(1,0,3,0))
title("MENTAL HEALTH AND JUSTICE INVOLVEMENT BY SEX" , outer=TRUE)

par(mfrow=c(1,1))
par(oma=c(4,0,0,0))

df_plot = df_plot %>% mutate(Abuse = ifelse(Abuse == 0, "None", 
                              ifelse(Abuse == 1, "Physical (P)", 
                              ifelse(Abuse == 2, "Sexual (S)", 
                              ifelse(Abuse == 3, "Emotional (E)",
                              ifelse(Abuse == 4, "P & S", 
                              ifelse(Abuse == 5, "P & E", 
                              ifelse(Abuse == 6, "S & E", 
                              ifelse(Abuse == 7, "P & E & S", "Unknown")))))))))

tbl10 <- table(df_plot$Sex, df_plot$Abuse)
barplot(tbl10, main="Abuse",
        col=c("thistle3", "cornflowerblue"), ylim=range(c(0, 180)), las=2, legend=TRUE)

When looking at these figures broken out by our target variable Sex, we can see a few things:

  • Interestingly, it appears that female respondents had an ADHD total score that was trending higher than male respondents.

  • MD totals were similar across both male and female respondents.

  • It appears that both male and female respondents answered “No use”, more frequently for substances such as stimulants, sedative-hypnotics, and opioids, than for some of the more common drugs, such as alcohol, THC, and cocaine.

  • It also looks like males responded more frequently as having a history of violence, or disorderly conduct.

  • Conversely, it looks like females responded slightly more frequently than males as having attempted suicide.

  • Finally, when looking closely at the breakdown of responses to the Abuse question by sex, we can see that females overwhelmingly responded more frequently of having experienced some form of abuse, especially sexual abuse or a combination of sexual abuse with other forms of abuse. This was noticeably more frequent than the responses by their male counterparts.

[TODO: Insert CorrPlot? Since he have high dimensions, it may be problematic to visualize. Maybe if we facet? not sure]

Placeholder (and useless) image inserted below.

library(ggcorrplot)
model.matrix(~., data=df) %>% 
  cor(use="pairwise.complete.obs") %>% 
  ggcorrplot(show.diag = F, type="lower", lab=TRUE, lab_size=2)

[Zach comment: I also tried to find a better way to visualize this, but came up empty. I did attempt a distance plot, but similar to yours above, there seem to be too many dimensions. My code below is pretty useless, but wanted to keep it in just in case it spurred other ideas.]

library(factoextra)
distance <- get_dist(df)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

[TODO: Describe any clear feature distributions or correlations patterns]

Dummy Columns

We’ll need to dummy code our categorical variables. This process will create new columns for each value and assign a 0 or 1. Note that dummy encoding typically drops one value which becomes the baseline. So if we have a categorical feature with five unique values, we will have four columns. If all columns are 0, that represents the reference value. This helps reduce the number of necessary columns. With dummy columns in place, we need to remove our original variables from this dataset.

# dummy encode our categorical features
df_dummy <- dummyVars(~ 0 + ., drop2nd=TRUE, data = df)
df_dummy <- data.frame(predict(df_dummy, newdata = df))

# Uncomment if we want to verify the dummy df summary
#skim(df_dummy)

(Near) Zero Variance

We should remove any columns with zero variance. Note, since we are working with a sparse dataset, we do NOT want to remove near zero variance, jsut zer variance.

nzv <- nearZeroVar(df_dummy, saveMetrics= TRUE)
(nzv %>% filter(zeroVar == T))
##              freqRatio percentUnique zeroVar  nzv
## Stimulants.2         0     0.5714286    TRUE TRUE
## Opioids.2            0     0.5714286    TRUE TRUE
## Education.0          0     0.5714286    TRUE TRUE
## Education.1          0     0.5714286    TRUE TRUE
## Education.2          0     0.5714286    TRUE TRUE
## Education.3          0     0.5714286    TRUE TRUE
## Education.4          0     0.5714286    TRUE TRUE
## Education.5          0     0.5714286    TRUE TRUE
## Education.20         0     0.5714286    TRUE TRUE
# drop columns with zero variance
nzv <- nearZeroVar(df_dummy)
df_dummy <- df_dummy[, -nzv]

[TODO: Should we revisit correlations between factor variable? Describe any clear feature distributions or correlations patterns]

Feature Correlation

[TODO: Show feature value correlations]

[TODO: Discuss multicolineatity and determine if any features are candidates to be dropped up front.]

Class imbalance

If values within factors are highly imbalanced, then we can get bias where specific values are better repreented an thus algorithms have more datapoints to learn from.

[TODO: If any strong class imbalance of values with in a feature, we might need to bootstrap or resample to address]

Next, we’ll do some data tidying to get our data set ready for our KNN model. [TODO: Insert discussion on any features to remove, keep, data type changes, needs transforms, etc]

# remove any features
#df <- df %>% 
#  dplyr::select(-c())

Transforms

Age is our only continuous variable. We normalize our continuous features using a simple z-score standardization. Although this may help our KNN model, we’ll have to be careful about interpretability later on!

# z-score scale continuous features
df[, c("Age")] <- scale(df[, c("Age")])

Our data set is ready for our unsupervised models.

# quick inspect of dataframe
kable(head(df_dummy)) %>% 
  kable_styling(bootstrap_options = "basic")
Age Sex.1 Sex.2 Race.1 Race.2 X.ADHD.Q1.0 X.ADHD.Q1.1 X.ADHD.Q1.2 X.ADHD.Q1.3 X.ADHD.Q1.4 X.ADHD.Q2.0 X.ADHD.Q2.1 X.ADHD.Q2.2 X.ADHD.Q2.3 X.ADHD.Q2.4 X.ADHD.Q3.0 X.ADHD.Q3.1 X.ADHD.Q3.2 X.ADHD.Q3.3 X.ADHD.Q3.4 X.ADHD.Q4.0 X.ADHD.Q4.1 X.ADHD.Q4.2 X.ADHD.Q4.3 X.ADHD.Q4.4 X.ADHD.Q5.0 X.ADHD.Q5.1 X.ADHD.Q5.2 X.ADHD.Q5.3 X.ADHD.Q5.4 X.ADHD.Q6.0 X.ADHD.Q6.1 X.ADHD.Q6.2 X.ADHD.Q6.3 X.ADHD.Q6.4 X.ADHD.Q7.0 X.ADHD.Q7.1 X.ADHD.Q7.2 X.ADHD.Q7.3 X.ADHD.Q7.4 X.ADHD.Q8.0 X.ADHD.Q8.1 X.ADHD.Q8.2 X.ADHD.Q8.3 X.ADHD.Q8.4 X.ADHD.Q9.0 X.ADHD.Q9.1 X.ADHD.Q9.2 X.ADHD.Q9.3 X.ADHD.Q9.4 X.ADHD.Q10.0 X.ADHD.Q10.1 X.ADHD.Q10.2 X.ADHD.Q10.3 X.ADHD.Q10.4 X.ADHD.Q11.0 X.ADHD.Q11.1 X.ADHD.Q11.2 X.ADHD.Q11.3 X.ADHD.Q11.4 X.ADHD.Q12.0 X.ADHD.Q12.1 X.ADHD.Q12.2 X.ADHD.Q12.3 X.ADHD.Q12.4 X.ADHD.Q13.0 X.ADHD.Q13.1 X.ADHD.Q13.2 X.ADHD.Q13.3 X.ADHD.Q13.4 X.ADHD.Q14.0 X.ADHD.Q14.1 X.ADHD.Q14.2 X.ADHD.Q14.3 X.ADHD.Q14.4 X.ADHD.Q15.0 X.ADHD.Q15.1 X.ADHD.Q15.2 X.ADHD.Q15.3 X.ADHD.Q15.4 X.ADHD.Q16.0 X.ADHD.Q16.1 X.ADHD.Q16.2 X.ADHD.Q16.3 X.ADHD.Q16.4 X.ADHD.Q17.0 X.ADHD.Q17.1 X.ADHD.Q17.2 X.ADHD.Q17.3 X.ADHD.Q17.4 X.ADHD.Q18.0 X.ADHD.Q18.1 X.ADHD.Q18.2 X.ADHD.Q18.3 X.ADHD.Q18.4 X.ADHD.Total. X.MD.Q1a.0 X.MD.Q1a.1 X.MD.Q1b.0 X.MD.Q1b.1 X.MD.Q1c.0 X.MD.Q1c.1 X.MD.Q1d.0 X.MD.Q1d.1 X.MD.Q1e.0 X.MD.Q1e.1 X.MD.Q1f.0 X.MD.Q1f.1 X.MD.Q1g.0 X.MD.Q1g.1 X.MD.Q1h.0 X.MD.Q1h.1 X.MD.Q1i.0 X.MD.Q1i.1 X.MD.Q1j.0 X.MD.Q1j.1 X.MD.Q1k.0 X.MD.Q1k.1 X.MD.Q1L.0 X.MD.Q1L.1 X.MD.Q1m.0 X.MD.Q1m.1 X.MD.Q2.0 X.MD.Q2.1 X.MD.Q3.0 X.MD.Q3.1 X.MD.Q3.2 X.MD.Q3.3 X.MD.TOTAL. Alcohol.0 Alcohol.1 Alcohol.3 THC.0 THC.1 THC.3 Cocaine.0 Cocaine.1 Cocaine.3 Stimulants.0 X.Sedative.hypnotics.0 Opioids.0 Opioids.3 X.Court.order.0 X.Court.order.1 Education.unknown Education.9 Education.10 Education.11 Education.12 Education.13 Education.14 X.Hx.of.Violence.unknown X.Hx.of.Violence.0 X.Hx.of.Violence.1 X.Disorderly.Conduct.unknown X.Disorderly.Conduct.0 X.Disorderly.Conduct.1 Suicide.unknown Suicide.0 Suicide.1 Abuse.unknown Abuse.0 Abuse.2 Abuse.5 X.Non.subst.Dx.unknown X.Non.subst.Dx.0 X.Non.subst.Dx.1 X.Non.subst.Dx.2 X.Subst.Dx.unknown X.Subst.Dx.0 X.Subst.Dx.1 X.Subst.Dx.2 X.Subst.Dx.3 X.Psych.meds..unknown X.Psych.meds..0 X.Psych.meds..1 X.Psych.meds..2
24 1 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 37 0 1 0 1 0 1 0 1 1 0 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 0 1 0 0 0 1 15 0 1 0 0 1 0 0 1 0 1 1 1 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1
48 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 52 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 0 1 0 0 1 1 0 0 1 0 0 0 1 14 1 0 0 1 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0
51 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 28 1 0 1 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0 1 0 5 1 0 0 1 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0
43 1 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 42 0 1 0 1 1 0 1 0 0 1 0 1 0 1 0 1 0 1 1 0 1 0 0 1 0 1 0 1 0 0 0 1 13 0 1 0 0 1 0 0 1 0 0 1 1 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1
34 1 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 45 1 0 0 1 1 0 0 1 1 0 0 1 0 1 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0 0 1 0 7 0 1 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0
39 0 1 1 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 52 1 0 0 1 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 0 0 0 1 14 0 1 0 1 0 0 1 0 0 1 1 1 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 0

Part 1: Cluster Patients


Please use a clustering method to find clusters of patients here. Whether you choose to use k-means clustering or hierarchical clustering is up to you as long as you reason through your work. Can you come up with creative names for the profiles you found? (60)

K-Means

Now that our data has been tidy’d and is ready to be subjected to a few unsupervised learning techniques, we’ll first start with a k-means approach. The k-means algorithm works via the following steps:

  • First, a certain number of clusters (k) is designated, and the algorithm obtains all observations and creates k random unnamed/new observations that serve as initial “centroids”.

  • These centroids are then used to compute the Euclidean distance between the centroid and the surrounding observations. For each observation, the nearest centroid is used to classify the observation, and will essentially assign a category.

  • Then, the algorithm takes the average of all of the points in each category and recomputes the centroid based on this average. The algorithm then runs again using these new centroids.

  • The process ends when there is no change in centroid position between one run of the clustering algorithm and another.

For our dataset, we’ll essentially subject our k-means algorithm directly to our dataset. However, for a k-means approach, you’ll first have to assign the number of final clusters you’d like to classify your data into. This can be a difficult task, thus we will outline this process below.

Determine Segments

We’ll need to identify the appropriate number for k. k represents the number of clusters we will group rows into. Three common techniques are: Elbow, Silhouette, Gap statistic, and NBClust(). For simplicity, we will uyse the Elbow approach. [TODO: Insert disscussion]

# Elbow method
fviz_nbclust(df_dummy, kmeans, method = "wss") +
  geom_vline(xintercept = 4, linetype = 2) + # add line for better visualization
  labs(subtitle = "Elbow method") # add subtitle

We can see above, that it is approximately 5. More clusters do not improve within segment variability. Therefore, we’ll perform our initial K-Means model with \(k=5\).

Build Model

set.seed(42)
km_res <- kmeans(df_dummy, centers = 5, nstart = 20)

summary(km_res)
##              Length Class  Mode   
## cluster      175    -none- numeric
## centers      885    -none- numeric
## totss          1    -none- numeric
## withinss       5    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           5    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric
sil <- silhouette(km_res$cluster, dist(df_dummy))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1   34          0.33
## 2       2   18          0.33
## 3       3   43          0.29
## 4       4   39          0.33
## 5       5   41          0.24

After running our kmeans() function, we can take a look at the final 5 clusters that were created by our algorithm:
As a reminder, the interpretation of the silhouette coefficient is as follows:

1, >0, means that the observation is well grouped. The closer the coefficient is to 1, the better the observation is grouped.

2, <0, means that the observation has been placed in the wrong cluster.

3, =0, means that the observation is between 5 clusters.

The silhouette plot above and the average silhouette coefficient help to determine whether your clustering is good or not. If a large majority of the silhouette coefficients are positive, it indicates that the observations are placed in the correct group. This silhouette plot can therefore be used in the choice of the optimal number of classes.

It is also possible to plot clusters by using the fviz_cluster() function. Note that a principal component analysis is performed to represent the variables in a 5 dimensions plane.

fviz_cluster(km_res, df_dummy, ellipse.type = "norm")

[TODO: Discussion of model results]

To provide some context around the clusters that were generated by our algorithm, we decided to isolate the 5 clusters we created, and subjected them to our df_dummy dataset, grouping and computing the means for each one of our features, to see if any noticeable trends start to arise.

k_means_analysis_df <- df_dummy %>%
  mutate(Cluster = km_res$cluster) %>%
  group_by(Cluster) %>%
  summarise_all("mean")
k_means_analysis_df <- k_means_analysis_df %>% select('Cluster', 'Age', 'Sex.1', 'X.ADHD.Total.', 'X.MD.TOTAL.', 'Alcohol.0', 'Alcohol.3', 'THC.0', 'THC.3', 'X.Court.order.1', 'X.Hx.of.Violence.1', 'Suicide.1')

kable(k_means_analysis_df) %>% 
  kable_styling(bootstrap_options = "responsive")
Cluster Age Sex.1 X.ADHD.Total. X.MD.TOTAL. Alcohol.0 Alcohol.3 THC.0 THC.3 X.Court.order.1 X.Hx.of.Violence.1 Suicide.1
1 26.58824 0.5294118 46.02941 12.794118 0.3823529 0.4705882 0.5294118 0.3529412 0.2058824 0.1764706 0.4117647
2 24.61111 0.6666667 11.83333 6.611111 0.8333333 0.1666667 0.3888889 0.5000000 0.1666667 0.2222222 0.1666667
3 48.97674 0.6511628 16.16279 7.069767 0.3953488 0.4418605 0.7674419 0.1162791 0.0697674 0.1395349 0.1627907
4 48.20513 0.4102564 47.07692 11.820513 0.3076923 0.4358974 0.8205128 0.0769231 0.0512821 0.1794872 0.3589744
5 38.41463 0.6097561 29.19512 10.609756 0.5609756 0.2682927 0.6341463 0.2682927 0.0000000 0.4146341 0.2682927

From above, we isolated 12 columns that were of particular interest to us. These seemed to show some sort of trend in computed means per group. Here are some explanations and interpretations of how our k-means cluster grouped the respondents:

  • Cluster #1: It looks like the 34 individuals that comprise of cluster number 1 are on the younger end, given that the average age of 26 is below the 25th percentile of our overall dataset. Additionally, it appears that this group had a high average total for the ADHD questions, as well as the highest total for the MD questions. Additionally, they had the highest average for responding that they had a dependence on alcohol, and also the highest average that had attempted to commit suicide and a court order.

  • Cluster #2: The 18 individuals in this cluster make up the youngest average age out of the rest of the four clusters. The individuals in this group also comprised of the largest male majority, lowest average ADHD total, lowest average dependence on alcohol, highest average dependence on THC, and second highest average in having a court order.

  • Cluster #3: The 43 individuals in this cluster make up our largest group, with the oldest average age of around 48 years. Additionally, it has a majority of male respondents, it had the second highest average dependence on alcohol, and second lowest average dependence on THC (most responded as “do not use”). This group seems to fall outside of many of the substance abuse tendencies, has the lowest average responding that they had a history of violence, as well as the lowest average of having attempted suicide.

  • Cluster #4: The 39 individuals in this cluster make up the group with the second oldest average age. It has a majority of female respondents, the highest average total for the ADHD questions, the second highest average total for the MD questions, the lowest average dependence on THC (most responded as “do not use”), and the second highest average for those responding that they have attempted suicide.

  • Cluster #5: The final cluster of 41 individuals make up the group with an average age that sits right in between our other four clusters. Additionally, this group falls in between other clusters on most other averages, except nobody in this group had responded that they have a court order, and they had the highest average for responding with a history of violence.

Hierarchical Clustering

Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups similar objects into groups called clusters. The endpoint is a set of clusters, where each cluster is distinct from each other cluster, and the objects within each cluster are broadly similar to each other.

There are 5 main methods to measure the distance between clusters, referred as linkage methods:

Single linkage: computes the minimum distance between clusters before merging them. Complete linkage: computes the maximum distance between clusters before merging them. Average linkage: computes the average distance between clusters before merging them. Centroid linkage: calculates centroids for both clusters, then computes the distance between the two before merging them. Ward’s (minimum variance) criterion: minimizes the total within-cluster variance and find the pair of clusters that leads to minimum increase in total within-cluster variance after merging.

We can apply the hierarchical clustering with the complete linkage criterion thanks to the hclust() function with the argument method = “complete” and method = “ward.D2”:

d <- dist(df_dummy, method = "euclidean")
hclust1 <- hclust(d, method = "complete")
plot(hclust1)

plot(hclust1)
rect.hclust(hclust1,
  k = 5, # k is used to specify the number of clusters
  border = "blue"
)

hclust2 <- hclust(d, method = "ward.D2")
plot(hclust2)

plot(hclust2)
rect.hclust(hclust2,
  k = 5, # k is used to specify the number of clusters
  border = "blue"
)

We compare hierarchical clustering with complete linkage versus Ward’s method. The function tanglegram plots two dendrograms, side by side, with their labels connected by lines.

dend1 <- as.dendrogram (hclust1)
dend2 <- as.dendrogram (hclust2)
dend_list <- dendlist(dend1, dend2)

tanglegram(dend1, dend2,
  highlight_distinct_edges = FALSE, # Turn-off dashed lines
  common_subtrees_color_lines = FALSE, # Turn-off line colors
  common_subtrees_color_branches = TRUE, # Color common branches 
  main = paste("entanglement =", round(entanglement(dend_list), 2))
  )

[TODO: Explanation of the result]

Model

[TODO: Discussion of model results]

Identify “Profiles”

[EDA of Segments - use variable importance to see which features (factor and value) were most important, then see if there are clear groupings]

[TODO: Overall Discussion of segments]


Part 2: Principal Component Analysis (PCA)


Let’s explore using Principal Component Analysis on this dataset. You will note that there are different types of questions in the dataset: columns E-W: ADHD self-report; column X–AM: mood disorders questionnaire, column AN-AS: Individual Substance Misuse; etc. Please reason through your work as you decide on which sets of variables you want to use to conduct Principal Component Analysis. (60)

df_dummy.pca <- prcomp(df_dummy, center = TRUE,scale. = TRUE)

summary(df_dummy.pca)
## Importance of components:
##                           PC1     PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     4.4894 3.20295 2.77590 2.62572 2.30642 2.10226 2.02049
## Proportion of Variance 0.1139 0.05796 0.04353 0.03895 0.03005 0.02497 0.02306
## Cumulative Proportion  0.1139 0.17183 0.21537 0.25432 0.28437 0.30934 0.33240
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     1.98530 1.93805 1.89249 1.83511 1.76704 1.75229 1.71581
## Proportion of Variance 0.02227 0.02122 0.02023 0.01903 0.01764 0.01735 0.01663
## Cumulative Proportion  0.35467 0.37589 0.39613 0.41515 0.43279 0.45014 0.46677
##                           PC15    PC16    PC17    PC18    PC19    PC20   PC21
## Standard deviation     1.69830 1.67791 1.64408 1.60241 1.57614 1.53845 1.5343
## Proportion of Variance 0.01629 0.01591 0.01527 0.01451 0.01404 0.01337 0.0133
## Cumulative Proportion  0.48307 0.49898 0.51425 0.52875 0.54279 0.55616 0.5695
##                           PC22    PC23    PC24    PC25    PC26    PC27   PC28
## Standard deviation     1.51049 1.48294 1.47207 1.44851 1.42889 1.41373 1.3888
## Proportion of Variance 0.01289 0.01242 0.01224 0.01185 0.01154 0.01129 0.0109
## Cumulative Proportion  0.58235 0.59477 0.60702 0.61887 0.63041 0.64170 0.6526
##                          PC29    PC30    PC31    PC32    PC33    PC34    PC35
## Standard deviation     1.3695 1.34982 1.32913 1.31066 1.30210 1.27977 1.27124
## Proportion of Variance 0.0106 0.01029 0.00998 0.00971 0.00958 0.00925 0.00913
## Cumulative Proportion  0.6632 0.67349 0.68347 0.69317 0.70275 0.71200 0.72113
##                           PC36    PC37    PC38    PC39    PC40    PC41    PC42
## Standard deviation     1.26014 1.23169 1.21476 1.20622 1.19689 1.17444 1.15464
## Proportion of Variance 0.00897 0.00857 0.00834 0.00822 0.00809 0.00779 0.00753
## Cumulative Proportion  0.73011 0.73868 0.74701 0.75523 0.76333 0.77112 0.77865
##                           PC43    PC44    PC45    PC46    PC47    PC48    PC49
## Standard deviation     1.13430 1.11718 1.09868 1.08651 1.07766 1.06357 1.04314
## Proportion of Variance 0.00727 0.00705 0.00682 0.00667 0.00656 0.00639 0.00615
## Cumulative Proportion  0.78592 0.79297 0.79979 0.80646 0.81302 0.81941 0.82556
##                           PC50    PC51    PC52    PC53    PC54    PC55    PC56
## Standard deviation     1.02718 1.01907 1.00972 0.99954 0.97391 0.95617 0.94644
## Proportion of Variance 0.00596 0.00587 0.00576 0.00564 0.00536 0.00517 0.00506
## Cumulative Proportion  0.83152 0.83739 0.84315 0.84879 0.85415 0.85932 0.86438
##                           PC57    PC58    PC59    PC60    PC61    PC62    PC63
## Standard deviation     0.92962 0.91885 0.90686 0.88778 0.88067 0.87013 0.86527
## Proportion of Variance 0.00488 0.00477 0.00465 0.00445 0.00438 0.00428 0.00423
## Cumulative Proportion  0.86926 0.87403 0.87868 0.88313 0.88751 0.89179 0.89602
##                           PC64    PC65    PC66    PC67    PC68   PC69    PC70
## Standard deviation     0.85583 0.84744 0.84228 0.82681 0.81179 0.7869 0.78045
## Proportion of Variance 0.00414 0.00406 0.00401 0.00386 0.00372 0.0035 0.00344
## Cumulative Proportion  0.90016 0.90422 0.90822 0.91209 0.91581 0.9193 0.92275
##                           PC71    PC72    PC73    PC74    PC75    PC76    PC77
## Standard deviation     0.75570 0.74874 0.74247 0.72263 0.71918 0.70869 0.69445
## Proportion of Variance 0.00323 0.00317 0.00311 0.00295 0.00292 0.00284 0.00272
## Cumulative Proportion  0.92598 0.92914 0.93226 0.93521 0.93813 0.94097 0.94369
##                          PC78    PC79    PC80    PC81   PC82   PC83    PC84
## Standard deviation     0.6914 0.68394 0.67353 0.65629 0.6379 0.6246 0.61415
## Proportion of Variance 0.0027 0.00264 0.00256 0.00243 0.0023 0.0022 0.00213
## Cumulative Proportion  0.9464 0.94904 0.95160 0.95403 0.9563 0.9585 0.96067
##                           PC85    PC86    PC87   PC88    PC89    PC90    PC91
## Standard deviation     0.61093 0.59934 0.59026 0.5799 0.56819 0.55491 0.54633
## Proportion of Variance 0.00211 0.00203 0.00197 0.0019 0.00182 0.00174 0.00169
## Cumulative Proportion  0.96278 0.96480 0.96677 0.9687 0.97050 0.97224 0.97392
##                           PC92   PC93    PC94    PC95    PC96    PC97    PC98
## Standard deviation     0.53117 0.5159 0.50429 0.49417 0.47779 0.47247 0.46337
## Proportion of Variance 0.00159 0.0015 0.00144 0.00138 0.00129 0.00126 0.00121
## Cumulative Proportion  0.97552 0.9770 0.97846 0.97984 0.98113 0.98239 0.98360
##                           PC99   PC100   PC101   PC102   PC103   PC104   PC105
## Standard deviation     0.44758 0.43428 0.42540 0.41019 0.40229 0.40193 0.38187
## Proportion of Variance 0.00113 0.00107 0.00102 0.00095 0.00091 0.00091 0.00082
## Cumulative Proportion  0.98473 0.98580 0.98682 0.98777 0.98869 0.98960 0.99042
##                         PC106   PC107   PC108   PC109   PC110   PC111   PC112
## Standard deviation     0.3767 0.35583 0.34609 0.34245 0.33205 0.32181 0.31356
## Proportion of Variance 0.0008 0.00072 0.00068 0.00066 0.00062 0.00059 0.00056
## Cumulative Proportion  0.9912 0.99194 0.99262 0.99328 0.99390 0.99449 0.99504
##                          PC113  PC114   PC115  PC116   PC117   PC118   PC119
## Standard deviation     0.30079 0.2988 0.28458 0.2669 0.25949 0.24660 0.23668
## Proportion of Variance 0.00051 0.0005 0.00046 0.0004 0.00038 0.00034 0.00032
## Cumulative Proportion  0.99555 0.9961 0.99652 0.9969 0.99730 0.99764 0.99796
##                          PC120   PC121   PC122   PC123  PC124   PC125   PC126
## Standard deviation     0.22643 0.22324 0.21772 0.19814 0.1873 0.16559 0.15200
## Proportion of Variance 0.00029 0.00028 0.00027 0.00022 0.0002 0.00015 0.00013
## Cumulative Proportion  0.99825 0.99853 0.99880 0.99902 0.9992 0.99937 0.99950
##                          PC127  PC128  PC129   PC130   PC131   PC132   PC133
## Standard deviation     0.14085 0.1347 0.1308 0.09985 0.08419 0.08024 0.06921
## Proportion of Variance 0.00011 0.0001 0.0001 0.00006 0.00004 0.00004 0.00003
## Cumulative Proportion  0.99962 0.9997 0.9998 0.99987 0.99991 0.99995 0.99997
##                          PC134   PC135  PC136   PC137     PC138     PC139
## Standard deviation     0.05072 0.03287 0.0245 0.01907 1.994e-15 1.116e-15
## Proportion of Variance 0.00001 0.00001 0.0000 0.00000 0.000e+00 0.000e+00
## Cumulative Proportion  0.99999 0.99999 1.0000 1.00000 1.000e+00 1.000e+00
##                            PC140     PC141     PC142    PC143     PC144
## Standard deviation     1.004e-15 1.001e-15 7.869e-16 7.77e-16 7.601e-16
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.00e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.00e+00 1.000e+00
##                            PC145     PC146     PC147     PC148     PC149
## Standard deviation     6.118e-16 5.907e-16 5.039e-16 4.856e-16 4.186e-16
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                           PC150    PC151    PC152    PC153    PC154    PC155
## Standard deviation     3.56e-16 3.56e-16 3.56e-16 3.56e-16 3.56e-16 3.56e-16
## Proportion of Variance 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00
## Cumulative Proportion  1.00e+00 1.00e+00 1.00e+00 1.00e+00 1.00e+00 1.00e+00
##                           PC156    PC157    PC158    PC159    PC160    PC161
## Standard deviation     3.56e-16 3.56e-16 3.56e-16 3.56e-16 3.56e-16 3.56e-16
## Proportion of Variance 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00
## Cumulative Proportion  1.00e+00 1.00e+00 1.00e+00 1.00e+00 1.00e+00 1.00e+00
##                           PC162    PC163    PC164    PC165    PC166    PC167
## Standard deviation     3.56e-16 3.56e-16 3.56e-16 3.56e-16 3.56e-16 3.56e-16
## Proportion of Variance 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00
## Cumulative Proportion  1.00e+00 1.00e+00 1.00e+00 1.00e+00 1.00e+00 1.00e+00
##                           PC168    PC169    PC170    PC171     PC172     PC173
## Standard deviation     3.56e-16 3.56e-16 3.56e-16 2.45e-16 2.359e-16 2.232e-16
## Proportion of Variance 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.00e+00 1.00e+00 1.00e+00 1.00e+00 1.000e+00 1.000e+00
##                            PC174     PC175
## Standard deviation     2.072e-16 1.684e-16
## Proportion of Variance 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00
biplot(df_dummy.pca, scale = 0)

#compute standard deviation of each principal component
std_dev <- df_dummy.pca$sdev

#compute variance
pr_var <- std_dev^2

#check variance of first 10 components
pr_var[1:10]
##  [1] 20.155101 10.258863  7.705646  6.894418  5.319577  4.419484  4.082382
##  [8]  3.941416  3.756023  3.581516
prop_varex <- pr_var/sum(pr_var)
prop_varex[1:10]
##  [1] 0.11387063 0.05795968 0.04353473 0.03895151 0.03005411 0.02496884
##  [7] 0.02306430 0.02226789 0.02122047 0.02023455
#scree plot
plot(prop_varex, xlab = "Principal Component",
             ylab = "Proportion of Variance Explained",
             type = "b")

#cumulative scree plot
plot(cumsum(prop_varex), xlab = "Principal Component",
              ylab = "Cumulative Proportion of Variance Explained",
              type = "b")


Part 3: Cluster Patients


Assume you are modeling whether a patient attempted suicide (column AX). Please use support vector machine to model this. You might want to consider reducing the number of variables or somehow use extracted information from the variables. This can be a really fun modeling task! (80)