Packages

The following packages were used for this solution:

library(tidyverse)
library(MASS)
library(caret)
library(haven)
library(purrr)
library(reshape2)
library(scales)
library(ordinal)
library(lattice)
library(psych)
library(fastDummies)
library(FactoMineR)
library(factoextra)
library(logisticPCA)
library(klaR)

# Ensure 'select' comes from package 'dplyr'
select <- dplyr::select

# Ensure 'alpha' comes from 'psych'
alpha <- psych::alpha

Task 1

Find out how each attribute (and level) influences the overall likeliness to download (experiment_data$answer).

BACKGROUND: 892 participants were asked to rate “How likely are you to download this mobile app?” for 12 random permutations of a message describing the app.

The first step to completing this task involves loading the data and exploring the distributions and types of data that we will be working with.

# Set root directory
setwd("../")

# Load the data
experiment_data <-
  read_sav('inst/extdata/data/experiment_data.sav')

# Return the structure of our data
glimpse(experiment_data)
Rows: 10,704
Columns: 9
$ response_id  <chr> "R_00zxXbZGxe0owBX", "R_00zxXbZGxe0owBX", "R_00zxXbZGxe0owBX", "R_00zxXbZGxe0owBX", "R_00zxXbZGxe0owBX", "R_00zxXbZGxe0owBX"...
$ task         <dbl> 1, 10, 11, 12, 2, 3, 4, 5, 6, 7, 8, 9, 1, 10, 11, 12, 2, 3, 4, 5, 6, 7, 8, 9, 1, 10, 11, 12, 2, 3, 4, 5, 6, 7, 8, 9, 1, 10, ...
$ duration     <chr> "3 months", "6 months", "12 months", "3 months", "12 months", "12 months", "6 months", "12 months", "12 months", "3 months",...
$ offer        <chr> "help you lead a better life", "help you lead a better life", "improve your sleep sustainably", "improve your sleep sustaina...
$ outcome      <chr> "changing your sleep mindset", "changing your sleep mindset", "breaking bad habits and creating new routines", "breaking bad...
$ price        <chr> "$30/month", "$40/month", "$40/month", "$40/month", "$30/month", "$20/month", "$20/month", "$30/month", "$40/month", "$30/mo...
$ rtb          <chr> "cognitive behavioral therapy", "daily text messages from a coach", "cognitive behavioral therapy", "a program created just ...
$ social_proof <chr> "leading researchers", "a method that has helped thousands", "a method that has helped thousands", "professional athletes", ...
$ answer       <dbl> 3, 4, 3, 1, 2, 4, 2, 3, 4, 3, 3, 4, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 2, 2...

Distribution of Answers

As we can see, there are 9 variables: A participant ID variable (response_ID), a conditional variable noting which session a person had (Task), 6 dependent variables (duration, offer, outcome, price, rtb, social_proof), and 1 dependent variable (answer).

Additionally, our data has a great sample size (n = 10,704) and is very clean posessing no NA’s.

# Count the number of NA's in each column
experiment_data %>%
  summarize_all(~ sum(is.na(.)))
# A tibble: 1 x 9
  response_id  task duration offer outcome price   rtb social_proof answer
        <int> <int>    <int> <int>   <int> <int> <int>        <int>  <int>
1           0     0        0     0       0     0     0            0      0

Let’s look at the distribution of data for each level of each variable (except ‘response_id’) using ggplot2.

# Melt data into long format then graph percent of total for each variable
experiment_data %>%
  melt(id = c("response_id")) %>%
  group_by(value) %>%
  count(variable) %>%
  ungroup() %>% group_by(variable) %>%
  mutate(pct = round(n/sum(n),4) * 100)%>%
  arrange(variable) %>%
  ggplot(aes(x = variable, y = pct, fill = fct_rev(value))) + 
  geom_bar(fill = rep(c(colorSet),20), stat = "identity", position = position_stack(vjust = 0.5)) +
  geom_text(aes(label = pct), position = position_stack(vjust = 0.5), size = 6) +
  labs(x = "Variable", y = "Percent of Total", title = "Distribution of Each Level of Each Variable (%)") +
  theme_bw() +
  theme(legend.position = "none", axis.text = element_text(size = 10), axis.title = element_text(size = 14), title = element_text(size = 16))

NOTE: Variable names are long and therefore were left out of the visualization. Each colored portion represents the percentage that a level of a variable takes up relative to the total responses for that variable.

With the exception of the dependent variable, ‘answer’, each variable has an even split of data for each of it’s levels. Let’s investigate the distribution of responses for ‘answer’.

One last thing: each variable is currently coded as either numeric or character. Going forward, we need to convert them to factors.

# Convert variable types to factor
experiment_data <- experiment_data %>%
  mutate_all(~ as.factor(.))

# Percentage of responses for each answer
experiment_data %>%
  histogram(~ answer,
             main = "Distribution of Answers",
             data = .)

Participants answered 1 (very unlikely) as often as 2 and 3 (somewhat unlikely and somewhat likely, respectively) combined. The target answer, 4 (very likely) was answered the least with only 15% of responses.

This imbalance is important to keep in mind when looking at the distributions of predictor variables although statistically, it should not create any problems when building models.

Distribution of Answers for each Predictor

To get a feel for how each level of each predictor may affect participants’ likeliness to use the app, bi-variate histograms of each relationship were plotted below.

Price

experiment_data %>%
  histogram(~ answer | price, 
            main = "Responses by Price",
            data = .)

It appears that ‘$20/month’ is the most desirable choice with both the lowest number of 1’s and the highest number of 4’s.

Duration

experiment_data %>%
  histogram(~ answer | duration, 
            main = "Responses by Duration",
            data = .)

Duration looks a little less clear, although ‘3 months’ may have a slight edge in terms of the number of 3s compared to ‘6 months’.

Offer

experiment_data %>%
  histogram(~ answer | offer, 
            main = "Responses by Offer",
            data = .)

Similar to ‘duration’, responses tend to match the answers distribution. At a glance, the ‘help you sleep more without pills’ looks the best with a slightly lower rate of 1s and slightly higher rate of 3s than average.

Outcome

experiment_data %>%
  histogram(~ answer | outcome, 
            main = "Responses by Outcome",
            data = .)

‘breaking bad habits and creating new routines’ looks most promising with the lowest rate of 1s and the highest rate of 4s.

RTB

experiment_data %>%
  histogram(~ answer | rtb, 
            main = "Responses by RTB",
            data = .)

Although there are no standout winners for ‘rtb’, there are a few losers. ‘daily text messages from a coach’ shows a higher rate of 1’s while ‘personalized, human coaching’ and ‘the suppport of a community of people just like you’ show more 2’s and less 3’s than average.

Social Proof

experiment_data %>%
  histogram(~ answer | social_proof, 
            main = "Responses by Social Proof",
            data = .)

Task Number

‘task’ is not within the variables to be directly analyzed, however it is important to check that there are no times that have significantly different times than the other. For example, if ‘time 12’ has many more people responding with ‘1’ than ‘4’ than ‘time 1’, this may signal that participants become less and less likely to purchase an app with more exposure to advertisements. Both variables are truly ordinal, so Kendall’s Tau will be used as the correlation coefficient. Additionally, a ‘count’ plot from ggplot2 will help visualize the distribution of responses.

# Count plot of 'answer' by 'task'
experiment_data %>%
  ggplot(aes(x = task, y = answer)) +
  geom_count(color = colorSet[1]) +
  labs(title = "Answers by Task", x = "Answer", y = "Task") +
  theme_bw() +
  theme(axis.text = element_text(size = 10), axis.title = element_text(size = 14), title = element_text(size = 16))

# Correlation between 'task' and 'answer'
experiment_data %>%
  select(task, answer) %>%
  mutate_all(~ as.numeric(.)) %>%
  cor(., method = "kendall") %>%
  .[1,2]
[1] -0.01051784

Kendall’s Tau is negligible (𝜏 = -0.01), and there appears to be no relationship given the count plot.

Multicollinearity

One common issue that can impact multivariate analyses of linear models is multicollinearity, or the correlation of predictor variables. We will run a test of the relationships between all predictors to check for multicollinearity.

# Test independence of all predictors for all 2-way interactions
experiment_data %>%
  xtabs(~ duration + offer + outcome + price + rtb + social_proof, data = .) %>%
  loglm(~ duration + offer + outcome + price + rtb + social_proof, data = .)
Call:
loglm(formula = ~duration + offer + outcome + price + rtb + social_proof, 
    data = .)

Statistics:
                      X^2   df      P(> X^2)
Likelihood Ratio 3570.691 3221 0.00001239445
Pearson          3285.516 3221 0.20989030186

The p-value (p = 0.21) is not small enough to provide evidence that the predictor variables are pairwise-dependent. Therefore, our predictors are not multicollinear.

Fitting a Model

The previous distributions gave us a feel for how the data is distributed and now we will consider each variable together in a linear model. Given that our dependent variable is ordinal, we will be using the ordinal package which contains many helpful functions for ordinal regression.

The clm function allows us to fit a cumulative link model (analogous to an ordinal regression model), which allows for each parameter type our model requires. These are:

Independent (Fixed)

duration, price, offer, outcome, rtb, social_proof

Dependent

answer

NOTE: 1.) This task is inferential rather than predictive, so data will not need to undergo typical predictive modeling treatment such as splitting and training.
2.) ‘task’ is left out of the model for the sake of simplicity. Although it is possible that in one of the models ‘task’ displays a significant interactive effect, this would be hard to interpret and provide little to no inferential value.

# Create a cumulative link mixed model including all predictors
# WARNING: This may take a very long time to compute
ORModel <- experiment_data %>%
  clm(answer ~ duration  + price + offer + outcome + rtb + social_proof,
       data = .,
       nAGQ = 10,
      Hess = TRUE)

summary(ORModel)
formula: answer ~ duration + price + offer + outcome + rtb + social_proof
data:    .

 link  threshold nobs  logLik    AIC      niter max.grad cond.H 
 logit flexible  10704 -13828.67 27699.34 5(0)  3.12e-13 1.9e+02

Coefficients:
                                                       Estimate Std. Error z value             Pr(>|z|)    
duration3 months                                       0.029053   0.043799   0.663              0.50711    
duration6 months                                      -0.009370   0.043438  -0.216              0.82921    
price$30/month                                        -0.357336   0.043213  -8.269 < 0.0000000000000002 ***
price$40/month                                        -0.498598   0.043627 -11.429 < 0.0000000000000002 ***
offerhelp you lead a better life                       0.031005   0.055722   0.556              0.57793    
offerhelp you sleep without more pills                 0.068547   0.056002   1.224              0.22095    
offerimprove your health for the long-run              0.091239   0.055757   1.636              0.10176    
offerimprove your sleep sustainably                    0.035942   0.056262   0.639              0.52293    
outcomechanging your sleep mindset                    -0.047279   0.043230  -1.094              0.27411    
outcomeempowering you to take back your sleep habits  -0.116284   0.043615  -2.666              0.00767 ** 
rtbcognitive behavioral therapy                       -0.009455   0.060236  -0.157              0.87527    
rtbdaily text messages from a coach                   -0.162099   0.061680  -2.628              0.00859 ** 
rtbpersonalized, human coaching                       -0.025920   0.061131  -0.424              0.67156    
rtbthe support of a community of people just like you -0.103828   0.061646  -1.684              0.09213 .  
rtbunique daily games, challenges and exercises       -0.085466   0.061202  -1.396              0.16258    
social_proofleading researchers                        0.097973   0.050212   1.951              0.05104 .  
social_proofprofessional athletes                     -0.016104   0.050112  -0.321              0.74794    
social_proofscientific evidence                        0.101145   0.049798   2.031              0.04225 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Threshold coefficients:
    Estimate Std. Error z value
1|2 -0.58549    0.07736  -7.568
2|3  0.31304    0.07723   4.053
3|4  1.47902    0.07902  18.717

The output is rather large, but we really only need to pay attention to two values per coefficient. The estimate shows the strength and direction of the relationship while the p-value provides the probability of our effect in the hypothesis testing sense. The standard cutoff for p-values for human-level data is p < 0.05. Thankfully, those outputs are marked with ., *, **, or *** if they are significant or approaching significance.

Each coefficient is a combination of the level name prefixed by the variable. Additionally, the estimates are relative to the expected difference from the first level of the variable. Therefore, each variable has n - 1 estimates where n is the number of levels the variable has.

In addition, the threshold coefficients provide the estimated standardized differences between each level of ‘answers’. While there are no p-values for this output, the z-values show that these estimates have > 99.99% confidence. Therefore, each level of answer is sufficiently different in a step-wise fashion.

Some variables appear to have an optimal level:

Price: $20
Social Proof: scientific evidence

While others have levels that are worse than other levels:

RTB: the support of a community of people just like you | daily text messages from a coach
Outcome: empowering you to take back your sleep habits

And still other main effects are unrelated to the outcome given that none of their levels are better/worse:

Offer
Duration

Proportional Odds Test

One last thing to test before moving on is the assumption of proportional odds. This assumption states that effects need to be equal for each level of the dependent variable. If this assumption is violated, the effects are not truly ordinal but rather nominal. The nominal_test function can test this.

# Test the proportional odds assumption
nominal_test(ORModel)
Tests of nominal effects

formula: answer ~ duration + price + offer + outcome + rtb + social_proof
             Df logLik   AIC LRT Pr(>Chi)
<none>          -13829 27699             
duration                                 
price                                    
offer                                    
outcome                                  
rtb                                      
social_proof                             

There is no evidence for nominal effects so we may continue.

Interactions

Our model has shown us the main effects, now we are going to test for interactions (i.e. a certain level of ‘price’ and ‘duration’ together are a better predictor of ‘answer’ than looking at either separately).

We will start with 2 way interactions and test for a significant difference between the interaction model and the main-effects model.

# Create a model testing 2-way interactions
ORModelX <- experiment_data %>%
  clm(answer ~ (price + outcome + rtb + social_proof + offer + duration)^2,
       data = .,
       nAGQ = 10,
      Hess = TRUE)

# Compare the main effects only model to the interaction model
anova(ORModel, ORModelX)
Likelihood ratio tests of cumulative link models:
 
         formula:                                                             link: threshold:
ORModel  answer ~ duration + price + offer + outcome + rtb + social_proof     logit flexible  
ORModelX answer ~ (price + outcome + rtb + social_proof + offer + duration)^2 logit flexible  

         no.par   AIC logLik LR.stat  df Pr(>Chisq)
ORModel      21 27699 -13829                       
ORModelX    152 27831 -13764  129.94 131     0.5097

Our test shows that the interactions model and main-effects model are not significantly different (χ2 = 0.51). Higher-order interactions (e.g. three-way) may exist but are both unlikely and hard to interpret as our 2-way interactions model already has 152 parameters (Yikes!).

Conclusion

Our analysis yields the following conclusions about each variable:

Price

Price clearly has an optimal level of ‘$20’ and each increase (‘$30’, ‘$40’) has a negative effect on likeliness to rate high. Additionally, this effect is very large relative to the proceeding effects (‘$30’ and ‘$40’ coefficients were β = -0.36 and β = -0.50, respectively, compared to ‘$20’).

Social Proof

‘scientific evidence’ and ‘leading researchers’ were significantly better than ‘professional athletes’ and ‘a method that has helped thousands’ with very similar coefficients (β = 0.10 for both). This means that people prefer ‘expert’ or ‘scientific’ evidence over other types of evidence.

NOTE: Coefficients can be compared proportionally, therefore this effect has a 3.5x - 5x times smaller effect on answer than price.

RTB

Although no level emerged clearly the best for ‘RTB’, there were some clear losers. ‘daily text messages from a coach’ was significantly worse (β = -0.16) while ‘the support of a community of people just like you’ approached significance (β = -0.06), so avoid using these two levels. Technically, ‘A program created just for you’ performed best, but difference between this level and the remaining 3 levels is insignificant, so all 4 will be considered equal.

Outcome

The phrase to avoid for ‘Outcome’ is ‘empowering you to take back your sleep habits’ (β = -0.12). ‘breaking bad habits and creating new routines’ had a higher estimation than ‘empowering you to take back your sleep habits’ (β = 0.04), but the difference was small and insignificant.

Duration and Offer

No levels for both of these variables were significantly different. In fact, a model that drops both is not significantly different from the main-effects model.

# Build a new model that drops 'duration' and 'price' 
ORModel2 <- experiment_data %>%
  clm(answer ~ outcome + price + rtb + social_proof,
       data = ., 
      Hess = T)

# Compare the models
anova(ORModel, ORModel2)
Likelihood ratio tests of cumulative link models:
 
         formula:                                                         link: threshold:
ORModel2 answer ~ outcome + price + rtb + social_proof                    logit flexible  
ORModel  answer ~ duration + price + offer + outcome + rtb + social_proof logit flexible  

         no.par   AIC logLik LR.stat df Pr(>Chisq)
ORModel2     15 27691 -13831                      
ORModel      21 27699 -13829  4.1418  6     0.6575

Therefore, we can conclude that these two variables are unrelated to the probability that an individual will download the app.

Summary

The combination of levels that is most likely to lead to an individual wanting to download the app is:

Price = the lowest possible price

Social Proof = scientific or research based statement

RTB = DO NOT USE: ‘daily messages from a coach’ OR ‘the support of a community of people just like you’. The other 4 levels were not significantly different, so RTB is not as strongly related as the previous 2 variables OR the remaining 4 phrases are not a standout phrase for this variable and one will need to be found.

Outcome = USE: ‘changing your sleep mindset’ OR ‘breaking bad habits and creating new routines’.

Duration = Unrelated to the outcome

Offer = Unrelated to the outcome

Task 2

Describe the groups of respondents that we see in the data using the added demographic, psychological and behavioral data. Feel free to experiment with clustering techniques in addition to descriptives.

Survey Diagnostics

BACKGROUND: 892 participants were asked to answer a series of surveys asking about their demographic information and opinions.

First, we will load the data and look at the shape of the data.

# Set root directory
setwd("..")

# Load data
survey_data <-
  read_sav('inst/extdata/data/survey_data.sav')

# Store the colnames of the survey data
SurveyColnames <- colnames(survey_data)
as.data.frame(SurveyColnames)
          SurveyColnames
1            response_id
2                d_urban
3               s_gender
4                 s_race
5            d_education
6             s_hhincome
7              s_problem
8        m1_philosophy_1
9        m1_philosophy_2
10       m1_philosophy_3
11       m1_philosophy_4
12       m1_philosophy_5
13       m1_philosophy_6
14       m1_philosophy_7
15       m1_philosophy_8
16       m1_philosophy_9
17        m2_attitudes_1
18        m2_attitudes_2
19        m2_attitudes_3
20        m2_attitudes_4
21        m2_attitudes_5
22        m2_attitudes_6
23        m2_attitudes_7
24        m2_attitudes_8
25        m2_attitudes_9
26       m2_attitudes_10
27       m2_attitudes_11
28   m2_awareness_apps_1
29   m2_awareness_apps_4
30   m2_awareness_apps_5
31   m2_awareness_apps_6
32   m2_awareness_apps_7
33   m2_awareness_apps_8
34   m2_awareness_apps_9
35  m2_awareness_apps_10
36  m2_awareness_apps_11
37  m2_awareness_apps_12
38  m2_awareness_apps_13
39  m2_awareness_apps_14
40                 hours
41              source_1
42              source_4
43              source_5
44              source_6
45              source_7
46              source_8
47              source_9
48             source_10
49             source_11
50             source_12
51             source_13
52             source_14
53             source_15
54             source_16
55             source_17
56            behavior_1
57            behavior_4
58            behavior_5
59            behavior_6
60            behavior_7
61            behavior_8
62            behavior_9
63           behavior_10
64           behavior_11
65           behavior_12
66           behavior_13
67           behavior_14
68           behavior_15
69           behavior_16
70          behavior_a_1
71          behavior_a_2
72          behavior_a_3
73          behavior_a_4
74          behavior_a_5
75          behavior_a_6
76          behavior_a_7
77          behavior_a_8
78          behavior_a_9
79         behavior_a_10
80         behavior_a_11
81         behavior_a_12
82         behavior_a_13
83         behavior_a_14
84           interst_cbt
85            past_coach
86        interest_coach
87             d_marital
88           d_h_hnumber
89              d_parent
90        d_child_infant
91         d_child_young
92         d_child_older
93            d_politics
94      d_political_view
95          d_employment
96       d_work_schedule
97          d_work_hours
98              s_region
99                 s_age
100              weights

This is quite a lot of columns. Thankfully, the naming conventions make it easy to see which parts of the survey are sectioned together.

Missing Data

Let’s check if there are any NAs.

# Store the number of NA's in each column
NAOutPut <- survey_data %>%
  summarize_all(~ sum(is.na(.)))

# Print the output in minimal form
as.numeric(NAOutPut)
  [1]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 612 848 863 568 728 652 361 847 712
 [37] 843 846 742   0 687 793 791 838 597 657 601 799 719 835 820 799 814 796 814 822 710 575 781 784 594 758 846 831 745 668 746 743 756 822 710 575
 [73] 781 784 594 758 846 831 745 668 746 743 892   0   0   0   0   0   0 592 592 592  25   0   0 459 459   0   0   0

This is the number of NAs for each column (The output was truncated due to length). The NAs in columns 23 - 83 are all survey answers for the ‘awareness’, ‘source’, and ‘behavior’ sections – while the columns after these that have NAs are asking about demographic and biographic information. It is normal for people to not complete a survey and there are a few tools we have at our disposal to consider.

Let’s see how many complete cases each section of the survey has.

# Store Complete number of cases
Sections <- c("Philosophy","Attitudes", "Awareness", "Source", "Behvaior", "Demographic", "Biographic")
CompleteCases <- numeric()

# Function for appending Complete Cases
appendCases <- function(matching) {
  CompleteCases <<- append(CompleteCases,
         survey_data %>%
           select(starts_with(matching)) %>%
           complete.cases() %>%
           sum()
         )
}

# Add complete observation counts for each section of the survey
appendCases("m1_philosophy")
appendCases("m2_attitudes")
appendCases("m2_awareness")
appendCases("source")
appendCases("behavior")
appendCases("d_")
appendCases("s_")

# Combine the section labels and the number of complete cases
casesDf <- data.frame(Sections, CompleteCases)
casesDf
     Sections CompleteCases
1  Philosophy           892
2   Attitudes           892
3   Awareness             0
4      Source             0
5    Behvaior             0
6 Demographic           198
7  Biographic           892

‘philosophy’ and ‘attitudes’ were both fully taken by every individual who took the survey ‘awareness’, on the other hand, has 0 complete observations. ‘source’ and ‘behavior’ are special cases because their NAs represent that a person did not check a box, so their NAs could effectively be recoded as 0s and we can treat them as complete observations. Demographic questions had only ~ 25% complete observations, but the total number of NAs for the questions might not be bad. Finally, biographic questions were all complete.

Awareness

Let’s dig into the proportion of NA’s to not NA’s each column in the ‘awareness’ section has to see how bad this problem really is.

# Get the percentage of NA's per column for the 'awareness' section
survey_data %>%
  select(starts_with("m2_awareness")) %>%
  map(~ paste0(round(sum(is.na(.) / length(.)) * 100, 2), "%"))
$m2_awareness_apps_1
[1] "68.61%"

$m2_awareness_apps_4
[1] "95.07%"

$m2_awareness_apps_5
[1] "96.75%"

$m2_awareness_apps_6
[1] "63.68%"

$m2_awareness_apps_7
[1] "81.61%"

$m2_awareness_apps_8
[1] "73.09%"

$m2_awareness_apps_9
[1] "40.47%"

$m2_awareness_apps_10
[1] "94.96%"

$m2_awareness_apps_11
[1] "79.82%"

$m2_awareness_apps_12
[1] "94.51%"

$m2_awareness_apps_13
[1] "94.84%"

$m2_awareness_apps_14
[1] "83.18%"

It’s really bad. The percentage of NA’s for most columns is > 64%. For this reason, I suggest that data for the awareness section be recollected and analyzed when there is a sufficient response rate.

NOTE: Imputation is a possible solution for missing data, however the ratio of NA’s to data for most questions is far greater than 50%. This means that the expected error for each imputed variable is very large. Additionally, imputation is more suited toward problems of prediction while the task we are answering is inferential.

Demographics

Let’s try the same method on the demographic information.

# Get the percentage of NA's per column for the 'demographic' section
survey_data %>%
  select(starts_with("d_")) %>%
  map(~ paste0(round(sum(is.na(.) / length(.)) * 100, 2), "%"))
$d_urban
[1] "0%"

$d_education
[1] "0%"

$d_marital
[1] "0%"

$d_h_hnumber
[1] "0%"

$d_parent
[1] "0%"

$d_child_infant
[1] "66.37%"

$d_child_young
[1] "66.37%"

$d_child_older
[1] "66.37%"

$d_politics
[1] "2.8%"

$d_political_view
[1] "0%"

$d_employment
[1] "0%"

$d_work_schedule
[1] "51.46%"

$d_work_hours
[1] "51.46%"

This is quite a different story. It appears that people were fine responding with most demographic questions. Questions about their children or work schedule, however, were answered by less than 50% of the participants each. This still leaves us with large sample sizes for both variables: 33% of individuals (n = 300) willing to respond about their children and 49% of respondents (n = 433) willing to talk about their work schedule. We can even use willingness to respond about these variables as a feature itself.

Distributions

There are many demographic and biographic variables in our dataset.

survey_data %>%
  select(starts_with(c("s_", "d_"))) %>%
  mutate_all(~ as_factor(.)) %>%
  map(~ kable(table(.)))
$s_gender


|.      | Freq|
|:------|----:|
|Male   |  389|
|Female |  503|

$s_race


|.                                         | Freq|
|:-----------------------------------------|----:|
|African American/Black/Caribbean American |   98|
|Asian or Pacific Islander                 |   35|
|Hispanic or Latino                        |   44|
|Mixed race and Others                     |   35|
|White                                     |  680|

$s_hhincome


|.                            | Freq|
|:----------------------------|----:|
|Less than $25,000            |  168|
|$25,000 - less than $50,000  |  255|
|$50,000 - less than $100,000 |  273|
|$100,000 or more             |  196|

$s_problem


|.                                                       | Freq|
|:-------------------------------------------------------|----:|
|Every night                                             |  247|
|Most nights (Four or more nights during a typical week) |  305|
|Sometimes (Two or three nights during a typical week)   |  340|
|Rarely (One night during a typical week)                |    0|
|Never                                                   |    0|

$s_region


|.         | Freq|
|:---------|----:|
|MIDWEST   |  237|
|WEST      |  192|
|SOUTH     |  283|
|NORTHEAST |  180|

$s_age


|.     | Freq|
|:-----|----:|
|65+   |  177|
|46-64 |  252|
|31-45 |  253|
|18-30 |  210|

$d_urban


|.                                                                              | Freq|
|:------------------------------------------------------------------------------|----:|
|Urban (An area that is commonly within a town or city with a large population) |  301|
|Suburban (An area that is residential and on the outskirts of a city)          |  443|
|Rural (An area that is outside of town or city and has a small population)     |  148|

$d_education


|.                                         | Freq|
|:-----------------------------------------|----:|
|Less than high school                     |   22|
|High school graduate                      |  188|
|Currently enrolled in college             |   19|
|Some college, but not currently enrolled  |  176|
|Associate's degree/Technical degree/AA/AS |  123|
|College graduate/Bachelor's degree/BA/BS  |  184|
|Some postgraduate courses                 |   25|
|Master's degree (e.g MS, MA, MBA, etc. )  |  123|
|PhD, JD or MD                             |   32|

$d_marital


|.                          | Freq|
|:--------------------------|----:|
|Married                    |  439|
|Separated                  |   15|
|Divorced                   |   81|
|Widowed                    |   40|
|Never married              |  247|
|Domestic/civil partnership |   70|

$d_h_hnumber


|.   | Freq|
|:---|----:|
|1   |  178|
|2-3 |  515|
|4-5 |  168|
|6+  |   31|

$d_parent


|.   | Freq|
|:---|----:|
|Yes |  300|
|No  |  592|

$d_child_infant


|.             | Freq|
|:-------------|----:|
|None          |  168|
|One           |   90|
|Two           |   34|
|Three or more |    8|

$d_child_young


|.             | Freq|
|:-------------|----:|
|None          |  105|
|One           |  144|
|Two           |   40|
|Three or more |   11|

$d_child_older


|.             | Freq|
|:-------------|----:|
|None          |  150|
|One           |  107|
|Two           |   36|
|Three or more |    7|

$d_politics


|.                 | Freq|
|:-----------------|----:|
|Strong Democrat   |  147|
|Democrat          |  208|
|Independent       |  225|
|Republican        |  151|
|Strong Republican |  111|
|Other             |   25|

$d_political_view


|.                     | Freq|
|:---------------------|----:|
|Very liberal          |  137|
|Somewhat liberal      |  168|
|Moderate              |  326|
|Somewhat conservative |  168|
|Very conservative     |   93|

$d_employment


|.                             | Freq|
|:-----------------------------|----:|
|Working full time now         |  326|
|Working part time now         |  107|
|Temporarily laid off          |   24|
|Unemployed                    |   76|
|Retired                       |  215|
|Permanently disabled          |   39|
|Taking care of home or family |   69|
|Student                       |   29|
|Other                         |    7|

$d_work_schedule


|.                                                                  | Freq|
|:------------------------------------------------------------------|----:|
|Fixed work schedule – working hours remain consistent              |  271|
|Flexible work schedule – you control your working hours            |  118|
|Shift work schedule – working hours depend on the week             |   30|
|Rotating work schedule – shift work changes in a repeating pattern |   11|
|Don’t know / not applicable                                        |    3|

$d_work_hours


|.            | Freq|
|:------------|----:|
|Less than 20 |   54|
|20-39        |  179|
|40 or more   |  200|

There are many demographic and biographic variables in the data, so we are going to skip analysis of each variable and just take note of important parts of our data. Additionally, many of the variables are already coded as numeric variables.

The following facts are notable for this dataset:

1.) There are far more white individuals than any other group.

2.) A majority of individuals are either married or have never married.

3.) Every individual in the study has trouble sleep at least sometimes.

4.) 1/3 respondents are parents.

Reliability

To analyze whether the Likert scale survey sections are reliable measures, we will use the alpha function from the psych package. This will give internal consistency estimates based on Cronbach’s Alpha as well as item statistics in one function.

Philosophy

# Alpha for the 'philosophy' scale
survey_data %>%
  select(starts_with("m1_philosophy")) %>%
  alpha()

Reliability analysis   
Call: alpha(x = .)

  raw_alpha std.alpha G6(smc) average_r S/N   ase mean   sd median_r
      0.66      0.67    0.69      0.18   2 0.017  3.6 0.55     0.18

 lower alpha upper     95% confidence boundaries
0.63 0.66 0.69 

 Reliability if an item is dropped:
                raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
m1_philosophy_1      0.63      0.64    0.66      0.18 1.8    0.018 0.019  0.16
m1_philosophy_2      0.68      0.68    0.68      0.21 2.1    0.016 0.013  0.18
m1_philosophy_3      0.64      0.64    0.67      0.18 1.8    0.018 0.022  0.18
m1_philosophy_4      0.62      0.63    0.65      0.18 1.7    0.019 0.019  0.18
m1_philosophy_5      0.62      0.64    0.65      0.18 1.7    0.019 0.020  0.18
m1_philosophy_6      0.59      0.60    0.62      0.16 1.5    0.021 0.017  0.16
m1_philosophy_7      0.61      0.61    0.63      0.17 1.6    0.020 0.016  0.18
m1_philosophy_8      0.63      0.64    0.66      0.18 1.8    0.019 0.017  0.18
m1_philosophy_9      0.66      0.67    0.69      0.20 2.0    0.017 0.021  0.19

 Item statistics 
                  n raw.r std.r r.cor r.drop mean   sd
m1_philosophy_1 892  0.51  0.53  0.44   0.33  3.7 1.03
m1_philosophy_2 892  0.41  0.35  0.23   0.16  2.7 1.28
m1_philosophy_3 892  0.48  0.51  0.40   0.32  3.6 0.89
m1_philosophy_4 892  0.55  0.56  0.47   0.37  3.8 1.04
m1_philosophy_5 892  0.56  0.54  0.46   0.38  3.3 1.06
m1_philosophy_6 892  0.68  0.66  0.63   0.52  3.3 1.17
m1_philosophy_7 892  0.60  0.62  0.58   0.45  3.9 0.95
m1_philosophy_8 892  0.51  0.53  0.45   0.34  4.0 0.97
m1_philosophy_9 892  0.40  0.40  0.25   0.20  3.8 1.03

Non missing response frequency for each item
                   1    2    3    4    5 miss
m1_philosophy_1 0.03 0.10 0.23 0.40 0.24    0
m1_philosophy_2 0.22 0.24 0.25 0.19 0.10    0
m1_philosophy_3 0.02 0.07 0.40 0.37 0.15    0
m1_philosophy_4 0.04 0.07 0.27 0.36 0.27    0
m1_philosophy_5 0.04 0.19 0.33 0.29 0.15    0
m1_philosophy_6 0.09 0.17 0.26 0.33 0.16    0
m1_philosophy_7 0.02 0.07 0.20 0.44 0.28    0
m1_philosophy_8 0.02 0.05 0.16 0.41 0.35    0
m1_philosophy_9 0.02 0.09 0.22 0.38 0.28    0

‘philosophy’ is not quite a singular measure (α = 0.66) (psychometric measures are considered reliable at α = 0.70 or greater). If question 2 is removed, it approaches singularity (α = 0.68), but does not reach it. In order to remedy this, I suggest that future surveys add questions.

Question 9 also looks weak as removing it hardly changes alpha. Questions 2 and 9 also have low correlations to the item-total. Let’s see what the results are if questions 2 and 9 are removed.

# Alpha estimate for the 'philosophy' scale without bad questions (2 and 9)
survey_data %>%
  select(starts_with("m1_philosophy"), -contains(c("_2","_9"))) %>%
  alpha()

Reliability analysis   
Call: alpha(x = .)

  raw_alpha std.alpha G6(smc) average_r S/N   ase mean   sd median_r
      0.69      0.69    0.68      0.24 2.2 0.016  3.7 0.61      0.2

 lower alpha upper     95% confidence boundaries
0.66 0.69 0.72 

 Reliability if an item is dropped:
                raw_alpha std.alpha G6(smc) average_r S/N alpha se  var.r med.r
m1_philosophy_1      0.67      0.66    0.65      0.25 2.0    0.017 0.0145  0.19
m1_philosophy_3      0.68      0.68    0.66      0.26 2.1    0.017 0.0126  0.24
m1_philosophy_4      0.65      0.65    0.64      0.24 1.9    0.018 0.0127  0.19
m1_philosophy_5      0.69      0.69    0.67      0.27 2.2    0.016 0.0106  0.24
m1_philosophy_6      0.62      0.63    0.61      0.22 1.7    0.020 0.0108  0.18
m1_philosophy_7      0.63      0.63    0.61      0.22 1.7    0.019 0.0082  0.19
m1_philosophy_8      0.66      0.66    0.64      0.24 1.9    0.018 0.0092  0.20

 Item statistics 
                  n raw.r std.r r.cor r.drop mean   sd
m1_philosophy_1 892  0.56  0.57  0.45   0.36  3.7 1.03
m1_philosophy_3 892  0.50  0.52  0.39   0.32  3.6 0.89
m1_philosophy_4 892  0.61  0.60  0.49   0.41  3.8 1.04
m1_philosophy_5 892  0.51  0.50  0.35   0.29  3.3 1.06
m1_philosophy_6 892  0.71  0.68  0.62   0.52  3.3 1.17
m1_philosophy_7 892  0.67  0.68  0.62   0.51  3.9 0.95
m1_philosophy_8 892  0.58  0.59  0.50   0.40  4.0 0.97

Non missing response frequency for each item
                   1    2    3    4    5 miss
m1_philosophy_1 0.03 0.10 0.23 0.40 0.24    0
m1_philosophy_3 0.02 0.07 0.40 0.37 0.15    0
m1_philosophy_4 0.04 0.07 0.27 0.36 0.27    0
m1_philosophy_5 0.04 0.19 0.33 0.29 0.15    0
m1_philosophy_6 0.09 0.17 0.26 0.33 0.16    0
m1_philosophy_7 0.02 0.07 0.20 0.44 0.28    0
m1_philosophy_8 0.02 0.05 0.16 0.41 0.35    0

The mean Alpha estimate for these items is close enough to be considered acceptable (α = 0.69). According to the task description, this scale is meant to measure an individuals views on science, products, and health.

Philosophy Composite Score

Now we will create a new column to save the scale composite scores for philosophy in the case that we need it as a continuous variable. Because the ‘philosophy’ and ‘attitudes’ measures both use 5-point Likert scales, we can just take the mean value of each so we don’t need to standardize the values later.

# Add 'philosophy' composite scores
survey_data$philosophy_composite <- 
  survey_data %>%
  select(starts_with("m1_philosophy"), -contains(c("_2","_9"))) %>% 
  rowMeans()

Attitudes

The other scale we will be analyzing is the attitudes scale. Let’s first see what the reliability is out of the gate.

# Alpha estimate for the 'attitudes' scale
survey_data %>% 
  select(starts_with("m2_attitudes")) %>%
  alpha()

Reliability analysis   
Call: alpha(x = .)

  raw_alpha std.alpha G6(smc) average_r S/N   ase mean  sd median_r
      0.72      0.74    0.75      0.21 2.9 0.014  3.9 0.5     0.21

 lower alpha upper     95% confidence boundaries
0.69 0.72 0.75 

 Reliability if an item is dropped:
                raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
m2_attitudes_1       0.69      0.72    0.72      0.20 2.5    0.016 0.015  0.20
m2_attitudes_2       0.68      0.71    0.71      0.20 2.4    0.016 0.015  0.20
m2_attitudes_3       0.68      0.71    0.72      0.20 2.5    0.016 0.018  0.20
m2_attitudes_4       0.75      0.77    0.76      0.25 3.3    0.013 0.011  0.23
m2_attitudes_5       0.69      0.72    0.73      0.21 2.6    0.016 0.020  0.20
m2_attitudes_6       0.69      0.72    0.73      0.21 2.6    0.015 0.019  0.20
m2_attitudes_7       0.67      0.70    0.71      0.19 2.3    0.016 0.014  0.20
m2_attitudes_8       0.70      0.73    0.74      0.21 2.7    0.015 0.020  0.20
m2_attitudes_9       0.68      0.71    0.72      0.20 2.4    0.016 0.017  0.20
m2_attitudes_10      0.72      0.74    0.75      0.23 2.9    0.014 0.019  0.24
m2_attitudes_11      0.71      0.74    0.75      0.22 2.8    0.015 0.019  0.23

 Item statistics 
                  n raw.r std.r r.cor r.drop mean   sd
m2_attitudes_1  892  0.55  0.59 0.555  0.439  4.5 0.76
m2_attitudes_2  892  0.59  0.63 0.594  0.480  4.4 0.80
m2_attitudes_3  892  0.60  0.60 0.553  0.458  4.0 1.01
m2_attitudes_4  892  0.30  0.24 0.099  0.083  3.2 1.23
m2_attitudes_5  892  0.57  0.55 0.483  0.421  3.7 1.03
m2_attitudes_6  892  0.54  0.55 0.473  0.400  3.9 0.94
m2_attitudes_7  892  0.65  0.69 0.675  0.553  4.2 0.85
m2_attitudes_8  892  0.51  0.52 0.426  0.366  3.8 0.93
m2_attitudes_9  892  0.61  0.63 0.588  0.492  3.9 0.90
m2_attitudes_10 892  0.44  0.40 0.279  0.236  3.9 1.20
m2_attitudes_11 892  0.46  0.44 0.333  0.290  3.6 1.04

Non missing response frequency for each item
                   1    2    3    4    5 miss
m2_attitudes_1  0.01 0.01 0.07 0.28 0.63    0
m2_attitudes_2  0.01 0.02 0.08 0.33 0.56    0
m2_attitudes_3  0.02 0.06 0.18 0.34 0.40    0
m2_attitudes_4  0.10 0.22 0.23 0.30 0.15    0
m2_attitudes_5  0.04 0.09 0.28 0.37 0.22    0
m2_attitudes_6  0.02 0.07 0.20 0.45 0.26    0
m2_attitudes_7  0.01 0.03 0.15 0.41 0.40    0
m2_attitudes_8  0.02 0.06 0.21 0.48 0.22    0
m2_attitudes_9  0.01 0.05 0.22 0.42 0.30    0
m2_attitudes_10 0.06 0.09 0.16 0.30 0.39    0
m2_attitudes_11 0.03 0.13 0.28 0.37 0.20    0

This scale shows evidence for being a singular measure (α = 0.74). Additionally, the Alpha-if-item-dropped is not significantly higher for any question.

There is one issue, however. The mean scores for a few questions are very high. Questions 1 and 2 both have very high means for a 5-point Likert measure (M = 4.40 and 4.50, respectively). Let’s take a look at the questions:

  1. Getting high quality sleep is important for my long-term health
  1. Getting better sleep would improve my focus

It is reasonable to expect that most people will put a 5 on both these questions. What happens to the results if these questions are dropped?

survey_data %>%
  select(starts_with("m2_attitudes"), -contains(c("_1","_2"))) %>%
  alpha()

Reliability analysis   
Call: alpha(x = .)

  raw_alpha std.alpha G6(smc) average_r S/N   ase mean   sd median_r
      0.64      0.66    0.65      0.22   2 0.019  3.8 0.56     0.25

 lower alpha upper     95% confidence boundaries
0.6 0.64 0.68 

 Reliability if an item is dropped:
               raw_alpha std.alpha G6(smc) average_r S/N alpha se  var.r med.r
m2_attitudes_3      0.58      0.61    0.60      0.21 1.5    0.022 0.0205  0.25
m2_attitudes_4      0.70      0.71    0.67      0.28 2.4    0.016 0.0049  0.28
m2_attitudes_5      0.57      0.60    0.59      0.20 1.5    0.023 0.0233  0.23
m2_attitudes_6      0.61      0.63    0.62      0.22 1.7    0.021 0.0185  0.25
m2_attitudes_7      0.57      0.59    0.57      0.19 1.4    0.023 0.0128  0.23
m2_attitudes_8      0.61      0.64    0.63      0.23 1.8    0.020 0.0219  0.27
m2_attitudes_9      0.58      0.60    0.59      0.20 1.5    0.022 0.0159  0.23

 Item statistics 
                 n raw.r std.r r.cor r.drop mean   sd
m2_attitudes_3 892  0.63  0.63  0.54  0.429  4.0 1.01
m2_attitudes_4 892  0.40  0.33  0.13  0.096  3.2 1.23
m2_attitudes_5 892  0.65  0.64  0.55  0.455  3.7 1.03
m2_attitudes_6 892  0.55  0.57  0.45  0.344  3.9 0.94
m2_attitudes_7 892  0.65  0.68  0.64  0.491  4.2 0.85
m2_attitudes_8 892  0.53  0.55  0.41  0.324  3.8 0.93
m2_attitudes_9 892  0.61  0.64  0.56  0.426  3.9 0.90

Non missing response frequency for each item
                  1    2    3    4    5 miss
m2_attitudes_3 0.02 0.06 0.18 0.34 0.40    0
m2_attitudes_4 0.10 0.22 0.23 0.30 0.15    0
m2_attitudes_5 0.04 0.09 0.28 0.37 0.22    0
m2_attitudes_6 0.02 0.07 0.20 0.45 0.26    0
m2_attitudes_7 0.01 0.03 0.15 0.41 0.40    0
m2_attitudes_8 0.02 0.06 0.21 0.48 0.22    0
m2_attitudes_9 0.01 0.05 0.22 0.42 0.30    0

Dropping these items leads to a less singular scale. For this reason and because the items do not have a very low standard deviation, they will be kept in the composite score calculation. In the future however, I recommend that these questions get reworked or reworded to increase the variance in responses while retaining the meaning of the question.

Attitude Composite Score

Just like with the philosophy scores, we can now create composite scores.

# Add 'attitudes' composite scores
survey_data$attitudes_composite <- survey_data %>%
  select(starts_with("m2_attitudes")) %>%
  rowMeans(.)

Clustering

MFA

One major part of the task is to see if there are any underlying latent structures to our data. Using the MFA function from the FactoMineR package will help to accomplish this.

First we need to check if any of the variables we will be clustering have 0-variance or near-zero variance.

# Prefixes for each group of the survey
survey_group_prefixes <- c("s_", "d_", "interest_", "past_", "behavior_", "source_")

# Subset only variables of interest
survey_factors <- survey_data %>%
  rename(interest_cbt = interst_cbt) %>%
  select(starts_with(survey_group_prefixes), ends_with("_composite")) %>%
  map(~ as_factor(.)) %>%
  as.data.frame()

# Find variables that have 95% or more of their responses as one response
near_0_vars <- nearZeroVar(survey_factors)
near_0_vars
 [1] 23 24 25 26 27 28 29 30 31 32 33 34 35 36 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

There are quite a few columns that have almost no variance. Let’s take a look at what the variables consist of.

# Column names for near_0_variance predictors
survey_factors[,near_0_vars] %>%
  colnames()
 [1] "behavior_1"    "behavior_4"    "behavior_5"    "behavior_6"    "behavior_7"    "behavior_8"    "behavior_9"    "behavior_10"   "behavior_11"  
[10] "behavior_12"   "behavior_13"   "behavior_14"   "behavior_15"   "behavior_16"   "behavior_a_14" "source_1"      "source_4"      "source_5"     
[19] "source_6"      "source_7"      "source_8"      "source_9"      "source_10"     "source_11"     "source_12"     "source_13"     "source_14"    
[28] "source_15"     "source_16"     "source_17"    

A large number of ‘behavior’ columns as well as every ‘source’ variable had little to no variance. We will keep this in mind in case the results of the MFA are off. In order to run an MFA we needs to separate each section of the survey into groups.

# Remove low-variance variables
survey_factors <- survey_factors[,-near_0_vars]

Now, we can create the model.

# All of the sources have been removed, so remove it from the prefixes list
survey_group_prefixes <- survey_group_prefixes[!survey_group_prefixes %in% "source_"]

# Create an ordered vector of the number of questions in each group
survey_groups <- map(
  survey_group_prefixes, 
      function(x) length(
        # Match column names to the survey_group_prefixes
        grep(
          # "^" specifies "start_with" in R's regex
          paste0("^", x), 
          colnames(survey_factors)))
        )

# Add survey composites to the end of the survey groups
survey_groups <- c(survey_groups,1,1)

# Names for each group
survey_group_names <- c("Biographic", "Demographics", "Interests", "Past-Coach", "Behaviors", "Philosophy", "Attitudes")

# Run an MFA
survey_factors_MFA <- MFA(survey_factors %>% mutate_at(c("philosophy_composite", "attitudes_composite"), as.numeric), 
                          group = as.numeric(survey_groups),
                          # Specify groups (n = categorical, s = numeric)
                          type = c(rep("n",5),rep("s",2)),
                          name.group = survey_group_names,
                          ncp = 6
                          )

MFA Plots

The factoextra library helps us plot the results. Click the tabs below to show different plots of the results.

Variable Groups
fviz_mfa_var(survey_factors_MFA, 'group')

Most variable groups are explained more by dimension 1 than dimension 2. ‘interests’ appears to most strongly and almost solely related to dimension 1 while the same can be said for ‘biographic’ and dimension 2. Finally, ‘behaviors’ and ‘attitudes’ are very similar measures.

Scree Plot
fviz_screeplot(survey_factors_MFA)

paste("Number of Dimensions: ", nrow(survey_factors_MFA$eig))
[1] "Number of Dimensions:  185"

The scree plot displays that the largest dimension (dimension 1) explains around 5% of the variance. For most datasets this is small, however given that our dataset has a very large number of dimensions (185), 5% still represents 9 variables.

Partial Axes
fviz_mfa_axes(survey_factors_MFA, 
              axes = c(1,2),
              geom = c("point", "arrow"))

We can see that neither dimension 1 nor 2 are highly related to any single variable group. This is probably why both axes only account for a small portion of the variability in the sample. This means each group of variables is measuring something different (e.g. biographic information is not as strongly related to demographic information as we might think).

Dim 1
fviz_contrib(survey_factors_MFA, 
             choice = "quali.var", 
             axes = 1, 
             top = 10)

These are the top variables for explaining dimension 1, which explains around 5% of the variance for the model. Looking at the top variables, it appears that the largest cluster in the data is explained by a combination of having a coach or being interested in a coach and to a lesser extent age.

Dim2
fviz_contrib(survey_factors_MFA, 
             choice = "quali.var", 
             axes = 2, 
             top = 10)

The second largest dimension in the model accounts for nearly 3% of the variance. This dimension is explained by being unsure about a coach, age, and to a lesser extent- socioeconomic status.

K-Modes

Another way to cluster our data is using a K-modes algorithm, an extension of the more popular K-means algorithm. The klaR function from the kmodes package will help to accomplish this. But first, we need to create dummy variables for the demographic data because they are non-continuous, as well as some other data cleaning. The last thing the below chunk will accomplish is replacing NAs with 0s for the ‘source’ questions. Thankfully, removing zero-var variables was done previously, so we can reuse some of our work.

# Dummy code demographic, biographic, and interest variables
survey_dummy <- survey_data %>%
  rename(interest_cbt = interst_cbt) %>%
  select(starts_with(survey_group_prefixes)) %>%
  map(~ as_factor(.)) %>%
  dummy_cols() %>%
  select_if(is.numeric)

cluster_data <- data.frame(
  survey_dummy,
  survey_data %>%
    select(ends_with("_composite"))
)

ncol(cluster_data)
[1] 221

While dummy coding data dramatically increases the number of columns (221), any 1 column is straight-forward to interpret by its column name. Our data structure contains:

  1. Dummy coded demographic data (prefixed with “.data.s_”)
  2. Dummy coded biographic data (prefixed with “.data.d_”)
  3. Dummy coded data measuring interest in cognitive behavioral therapy (prefixed with “.data.interest_cbt_”)
  4. Dummy coded data measuring interest in coaching (prefixed with “.data.interest_coach_”)
  5. Dummy coded data for whether an individual has had a coach in the past (prefixed with “.data.past_coach_”)
  6. Dummy coded data representing if the participant has taken partaken in specific behaviors to combat sleep problems (prefixed with “.data.behavior_”)
  7. Composite scores for the ‘philosophy’ and ‘attitudes’ portions of the survey (suffixed with "_composite")

Let’s remove the low variance columns and replacs NAs with 0s like we did for the MFA.

# Find low variance columns
near_0_vars <- nearZeroVar(cluster_data)

# Remove low variance columns
cluster_data <- cluster_data[, -near_0_vars]

# Replace NAs with 0s
cluster_data <- cluster_data %>%
  replace(is.na(.), 0)

We can now proceed with looking at a scree plot of k-modes variance within values by cluster to choose k for our algorithm.

# Get within cluster variance values for k = 1:30
klist <- numeric()

map(1:30, function(x){
  # Set the seed to ensure the output is consistent
  set.seed(1073)
  k <- kmodes(cluster_data, modes = x)
  klist <<- c(klist, sum(k$withindiff))
})

kindex <- seq(1,length(klist))

k_df <- data.frame(klist, kindex)
# Plot the errors for each value of K
k_df %>%
  ggplot(aes(x = kindex, y = klist)) +
  geom_point(color = colorSet[1], size = 4) + 
  theme_bw() +
  labs(title = "Scree Plot of Total Error for K = 1:30", y = "Total Error", x = "K") +
  theme(legend.position = "none", axis.text = element_text(size = 10), axis.title = element_text(size = 14), title = element_text(size = 16))

The “elbows” of the scree plot appears to be at K = 4 and K = 7. We will use K = 4, favoring a simple model. Using the labels for each person, we could look at a plot to see if clusters are related to having certain sources of sleep illness.

# Set seed to ensure identical results each time
set.seed(1073)

# Kmodes on our data for K = 5
km4 <- kmodes(cluster_data, modes = 4)

source_cluster_data <- data.frame(km4$cluster, 
                                  survey_data %>% select(starts_with("source")))

# Graph the percentage of source of sleep issues by cluster
source_cluster_data %>%
  melt(id.vars = "km4.cluster") %>%
  filter(value == 1) %>%
  group_by(variable) %>%
  count(km4.cluster) %>%
  mutate(pct = round(n/sum(n),4) * 100) %>%
  arrange(variable) %>%
  ggplot(aes(x = variable, y = pct, fill = factor(km4.cluster))) + 
  geom_bar(stat = "identity", position = position_stack(vjust = 0.5)) +
  geom_text(aes(label = pct), position = position_stack(vjust = 0.5), size = 3.5) +
  labs(x = "Source", y = "Percent of Total", title = "Distribution of Sleep Issues by Cluster (%)") +
  theme_bw() +
  theme(legend.title = element_blank(), axis.text.x = element_text(size = 10, angle = 45, hjust = 1), axis.title = element_text(size = 14), title = element_text(size = 16))

Each cluster can be analyzed as a group of people with similar characteristics which may be related to having a sleeping issue. For instance, only 7% reporting ‘source_17’ are from cluster 3. Lets look at cluster 3 to find out.

# Add cluster labels to our data
survey_data$cluster <- km4$cluster

# Analyze people with cluster 3
survey_data %>%
  filter(cluster == 3) %>%
  select(starts_with(c("d_"))) %>%
  mutate_all(~ as_factor(.)) %>%
  map(~ kable(table(.)))
$d_urban


|.                                                                              | Freq|
|:------------------------------------------------------------------------------|----:|
|Urban (An area that is commonly within a town or city with a large population) |  138|
|Suburban (An area that is residential and on the outskirts of a city)          |   90|
|Rural (An area that is outside of town or city and has a small population)     |   33|

$d_education


|.                                         | Freq|
|:-----------------------------------------|----:|
|Less than high school                     |    5|
|High school graduate                      |   49|
|Currently enrolled in college             |    3|
|Some college, but not currently enrolled  |   56|
|Associate's degree/Technical degree/AA/AS |   31|
|College graduate/Bachelor's degree/BA/BS  |   56|
|Some postgraduate courses                 |    3|
|Master's degree (e.g MS, MA, MBA, etc. )  |   45|
|PhD, JD or MD                             |   13|

$d_marital


|.                          | Freq|
|:--------------------------|----:|
|Married                    |  165|
|Separated                  |   10|
|Divorced                   |   12|
|Widowed                    |    2|
|Never married              |   51|
|Domestic/civil partnership |   21|

$d_h_hnumber


|.   | Freq|
|:---|----:|
|1   |   12|
|2-3 |  123|
|4-5 |  106|
|6+  |   20|

$d_parent


|.   | Freq|
|:---|----:|
|Yes |  257|
|No  |    4|

$d_child_infant


|.             | Freq|
|:-------------|----:|
|None          |  150|
|One           |   72|
|Two           |   28|
|Three or more |    7|

$d_child_young


|.             | Freq|
|:-------------|----:|
|None          |   84|
|One           |  126|
|Two           |   37|
|Three or more |   10|

$d_child_older


|.             | Freq|
|:-------------|----:|
|None          |  123|
|One           |   96|
|Two           |   32|
|Three or more |    6|

$d_politics


|.                 | Freq|
|:-----------------|----:|
|Strong Democrat   |   42|
|Democrat          |   64|
|Independent       |   52|
|Republican        |   60|
|Strong Republican |   32|
|Other             |    4|

$d_political_view


|.                     | Freq|
|:---------------------|----:|
|Very liberal          |   59|
|Somewhat liberal      |   49|
|Moderate              |   83|
|Somewhat conservative |   36|
|Very conservative     |   34|

$d_employment


|.                             | Freq|
|:-----------------------------|----:|
|Working full time now         |  170|
|Working part time now         |   38|
|Temporarily laid off          |    5|
|Unemployed                    |   11|
|Retired                       |    2|
|Permanently disabled          |    6|
|Taking care of home or family |   25|
|Student                       |    1|
|Other                         |    3|

$d_work_schedule


|.                                                                  | Freq|
|:------------------------------------------------------------------|----:|
|Fixed work schedule – working hours remain consistent              |  132|
|Flexible work schedule – you control your working hours            |   56|
|Shift work schedule – working hours depend on the week             |   12|
|Rotating work schedule – shift work changes in a repeating pattern |    5|
|Don’t know / not applicable                                        |    3|

$d_work_hours


|.            | Freq|
|:------------|----:|
|Less than 20 |   18|
|20-39        |   95|
|40 or more   |   95|

The vast majority of people in this cluster are working 20+ hours pre week and are parents. Because ‘source_17’ was “other”, we can assume that the vast majority of people who have children and work full time have up to 3 of the other listed sleep issues.

Summary

We have extracted the following information from analyzing the survey data:

1.) The sample is predominantly white, married or have never married, and unanimously have at least some sleeping issues.

2.) Missing data in the ‘awareness’ section makes it unusable. I would ask for more data or for a new sample.

3.) The ‘philosophy’ section is on the verge of measuring multiple latent structures. If it is important that it be a reliable measure, I suggest adding at least 2 - 3 more questions and/or rewording problematic questions (2 and 9).

4.) The ‘attitudes’ section is a reliable measure, however questions 1 and 2 both have very high means for a 5-point Likert scale (M = 4.40 and 4.50, respectively). I would suggest rewording or replacing these items as they probably are providing little information to the scale despite increasing it’s reliability.

5.) Missing data in the demographics show that people are unanimously willing to answer most questions with the exception of questions about their children or work.

6.) An MFA revealed that the 2 largest dimensions we can reduce the data to jointly explained ~ 8% of the variance and roughly mapped onto:

  1. Coaching interest

  2. Socio-economic status and age

7.) A K-modes clustering algorithm (K = 4) revealed that nearly all parents clustered together, and their sleeping issues can be described by one of the 14 listed sources of sleeping issues that are not ‘other’.