The following packages were used for this solution:
library(tidyverse)
library(MASS)
library(caret)
library(haven)
library(purrr)
library(reshape2)
library(scales)
library(ordinal)
library(lattice)
library(psych)
library(fastDummies)
library(FactoMineR)
library(factoextra)
library(logisticPCA)
library(klaR)
# Ensure 'select' comes from package 'dplyr'
select <- dplyr::select
# Ensure 'alpha' comes from 'psych'
alpha <- psych::alpha
BACKGROUND: 892 participants were asked to rate “How likely are you to download this mobile app?” for 12 random permutations of a message describing the app.
The first step to completing this task involves loading the data and exploring the distributions and types of data that we will be working with.
# Set root directory
setwd("../")
# Load the data
experiment_data <-
read_sav('inst/extdata/data/experiment_data.sav')
# Return the structure of our data
glimpse(experiment_data)
Rows: 10,704
Columns: 9
$ response_id <chr> "R_00zxXbZGxe0owBX", "R_00zxXbZGxe0owBX", "R_00zxXbZGxe0owBX", "R_00zxXbZGxe0owBX", "R_00zxXbZGxe0owBX", "R_00zxXbZGxe0owBX"...
$ task <dbl> 1, 10, 11, 12, 2, 3, 4, 5, 6, 7, 8, 9, 1, 10, 11, 12, 2, 3, 4, 5, 6, 7, 8, 9, 1, 10, 11, 12, 2, 3, 4, 5, 6, 7, 8, 9, 1, 10, ...
$ duration <chr> "3 months", "6 months", "12 months", "3 months", "12 months", "12 months", "6 months", "12 months", "12 months", "3 months",...
$ offer <chr> "help you lead a better life", "help you lead a better life", "improve your sleep sustainably", "improve your sleep sustaina...
$ outcome <chr> "changing your sleep mindset", "changing your sleep mindset", "breaking bad habits and creating new routines", "breaking bad...
$ price <chr> "$30/month", "$40/month", "$40/month", "$40/month", "$30/month", "$20/month", "$20/month", "$30/month", "$40/month", "$30/mo...
$ rtb <chr> "cognitive behavioral therapy", "daily text messages from a coach", "cognitive behavioral therapy", "a program created just ...
$ social_proof <chr> "leading researchers", "a method that has helped thousands", "a method that has helped thousands", "professional athletes", ...
$ answer <dbl> 3, 4, 3, 1, 2, 4, 2, 3, 4, 3, 3, 4, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 2, 2...
As we can see, there are 9 variables: A participant ID variable (response_ID), a conditional variable noting which session a person had (Task), 6 dependent variables (duration, offer, outcome, price, rtb, social_proof), and 1 dependent variable (answer).
Additionally, our data has a great sample size (n = 10,704) and is very clean posessing no NA’s.
# Count the number of NA's in each column
experiment_data %>%
summarize_all(~ sum(is.na(.)))
# A tibble: 1 x 9
response_id task duration offer outcome price rtb social_proof answer
<int> <int> <int> <int> <int> <int> <int> <int> <int>
1 0 0 0 0 0 0 0 0 0
Let’s look at the distribution of data for each level of each variable (except ‘response_id’) using ggplot2.
# Melt data into long format then graph percent of total for each variable
experiment_data %>%
melt(id = c("response_id")) %>%
group_by(value) %>%
count(variable) %>%
ungroup() %>% group_by(variable) %>%
mutate(pct = round(n/sum(n),4) * 100)%>%
arrange(variable) %>%
ggplot(aes(x = variable, y = pct, fill = fct_rev(value))) +
geom_bar(fill = rep(c(colorSet),20), stat = "identity", position = position_stack(vjust = 0.5)) +
geom_text(aes(label = pct), position = position_stack(vjust = 0.5), size = 6) +
labs(x = "Variable", y = "Percent of Total", title = "Distribution of Each Level of Each Variable (%)") +
theme_bw() +
theme(legend.position = "none", axis.text = element_text(size = 10), axis.title = element_text(size = 14), title = element_text(size = 16))
NOTE: Variable names are long and therefore were left out of the visualization. Each colored portion represents the percentage that a level of a variable takes up relative to the total responses for that variable.
With the exception of the dependent variable, ‘answer’, each variable has an even split of data for each of it’s levels. Let’s investigate the distribution of responses for ‘answer’.
One last thing: each variable is currently coded as either numeric or character. Going forward, we need to convert them to factors.
# Convert variable types to factor
experiment_data <- experiment_data %>%
mutate_all(~ as.factor(.))
# Percentage of responses for each answer
experiment_data %>%
histogram(~ answer,
main = "Distribution of Answers",
data = .)
Participants answered 1 (very unlikely) as often as 2 and 3 (somewhat unlikely and somewhat likely, respectively) combined. The target answer, 4 (very likely) was answered the least with only 15% of responses.
This imbalance is important to keep in mind when looking at the distributions of predictor variables although statistically, it should not create any problems when building models.
To get a feel for how each level of each predictor may affect participants’ likeliness to use the app, bi-variate histograms of each relationship were plotted below.
experiment_data %>%
histogram(~ answer | price,
main = "Responses by Price",
data = .)
It appears that ‘$20/month’ is the most desirable choice with both the lowest number of 1’s and the highest number of 4’s.
experiment_data %>%
histogram(~ answer | duration,
main = "Responses by Duration",
data = .)
Duration looks a little less clear, although ‘3 months’ may have a slight edge in terms of the number of 3s compared to ‘6 months’.
experiment_data %>%
histogram(~ answer | offer,
main = "Responses by Offer",
data = .)
Similar to ‘duration’, responses tend to match the answers distribution. At a glance, the ‘help you sleep more without pills’ looks the best with a slightly lower rate of 1s and slightly higher rate of 3s than average.
experiment_data %>%
histogram(~ answer | outcome,
main = "Responses by Outcome",
data = .)
‘breaking bad habits and creating new routines’ looks most promising with the lowest rate of 1s and the highest rate of 4s.
experiment_data %>%
histogram(~ answer | rtb,
main = "Responses by RTB",
data = .)
Although there are no standout winners for ‘rtb’, there are a few losers. ‘daily text messages from a coach’ shows a higher rate of 1’s while ‘personalized, human coaching’ and ‘the suppport of a community of people just like you’ show more 2’s and less 3’s than average.
‘task’ is not within the variables to be directly analyzed, however it is important to check that there are no times that have significantly different times than the other. For example, if ‘time 12’ has many more people responding with ‘1’ than ‘4’ than ‘time 1’, this may signal that participants become less and less likely to purchase an app with more exposure to advertisements. Both variables are truly ordinal, so Kendall’s Tau will be used as the correlation coefficient. Additionally, a ‘count’ plot from ggplot2 will help visualize the distribution of responses.
# Count plot of 'answer' by 'task'
experiment_data %>%
ggplot(aes(x = task, y = answer)) +
geom_count(color = colorSet[1]) +
labs(title = "Answers by Task", x = "Answer", y = "Task") +
theme_bw() +
theme(axis.text = element_text(size = 10), axis.title = element_text(size = 14), title = element_text(size = 16))
# Correlation between 'task' and 'answer'
experiment_data %>%
select(task, answer) %>%
mutate_all(~ as.numeric(.)) %>%
cor(., method = "kendall") %>%
.[1,2]
[1] -0.01051784
Kendall’s Tau is negligible (𝜏 = -0.01), and there appears to be no relationship given the count plot.
One common issue that can impact multivariate analyses of linear models is multicollinearity, or the correlation of predictor variables. We will run a test of the relationships between all predictors to check for multicollinearity.
# Test independence of all predictors for all 2-way interactions
experiment_data %>%
xtabs(~ duration + offer + outcome + price + rtb + social_proof, data = .) %>%
loglm(~ duration + offer + outcome + price + rtb + social_proof, data = .)
Call:
loglm(formula = ~duration + offer + outcome + price + rtb + social_proof,
data = .)
Statistics:
X^2 df P(> X^2)
Likelihood Ratio 3570.691 3221 0.00001239445
Pearson 3285.516 3221 0.20989030186
The p-value (p = 0.21) is not small enough to provide evidence that the predictor variables are pairwise-dependent. Therefore, our predictors are not multicollinear.
The previous distributions gave us a feel for how the data is distributed and now we will consider each variable together in a linear model. Given that our dependent variable is ordinal, we will be using the ordinal package which contains many helpful functions for ordinal regression.
The clm function allows us to fit a cumulative link model (analogous to an ordinal regression model), which allows for each parameter type our model requires. These are:
duration, price, offer, outcome, rtb, social_proof
answer
NOTE: 1.) This task is inferential rather than predictive, so data will not need to undergo typical predictive modeling treatment such as splitting and training.
2.) ‘task’ is left out of the model for the sake of simplicity. Although it is possible that in one of the models ‘task’ displays a significant interactive effect, this would be hard to interpret and provide little to no inferential value.
# Create a cumulative link mixed model including all predictors
# WARNING: This may take a very long time to compute
ORModel <- experiment_data %>%
clm(answer ~ duration + price + offer + outcome + rtb + social_proof,
data = .,
nAGQ = 10,
Hess = TRUE)
summary(ORModel)
formula: answer ~ duration + price + offer + outcome + rtb + social_proof
data: .
link threshold nobs logLik AIC niter max.grad cond.H
logit flexible 10704 -13828.67 27699.34 5(0) 3.12e-13 1.9e+02
Coefficients:
Estimate Std. Error z value Pr(>|z|)
duration3 months 0.029053 0.043799 0.663 0.50711
duration6 months -0.009370 0.043438 -0.216 0.82921
price$30/month -0.357336 0.043213 -8.269 < 0.0000000000000002 ***
price$40/month -0.498598 0.043627 -11.429 < 0.0000000000000002 ***
offerhelp you lead a better life 0.031005 0.055722 0.556 0.57793
offerhelp you sleep without more pills 0.068547 0.056002 1.224 0.22095
offerimprove your health for the long-run 0.091239 0.055757 1.636 0.10176
offerimprove your sleep sustainably 0.035942 0.056262 0.639 0.52293
outcomechanging your sleep mindset -0.047279 0.043230 -1.094 0.27411
outcomeempowering you to take back your sleep habits -0.116284 0.043615 -2.666 0.00767 **
rtbcognitive behavioral therapy -0.009455 0.060236 -0.157 0.87527
rtbdaily text messages from a coach -0.162099 0.061680 -2.628 0.00859 **
rtbpersonalized, human coaching -0.025920 0.061131 -0.424 0.67156
rtbthe support of a community of people just like you -0.103828 0.061646 -1.684 0.09213 .
rtbunique daily games, challenges and exercises -0.085466 0.061202 -1.396 0.16258
social_proofleading researchers 0.097973 0.050212 1.951 0.05104 .
social_proofprofessional athletes -0.016104 0.050112 -0.321 0.74794
social_proofscientific evidence 0.101145 0.049798 2.031 0.04225 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Threshold coefficients:
Estimate Std. Error z value
1|2 -0.58549 0.07736 -7.568
2|3 0.31304 0.07723 4.053
3|4 1.47902 0.07902 18.717
The output is rather large, but we really only need to pay attention to two values per coefficient. The estimate shows the strength and direction of the relationship while the p-value provides the probability of our effect in the hypothesis testing sense. The standard cutoff for p-values for human-level data is p < 0.05. Thankfully, those outputs are marked with ., *, **, or *** if they are significant or approaching significance.
Each coefficient is a combination of the level name prefixed by the variable. Additionally, the estimates are relative to the expected difference from the first level of the variable. Therefore, each variable has n - 1 estimates where n is the number of levels the variable has.
In addition, the threshold coefficients provide the estimated standardized differences between each level of ‘answers’. While there are no p-values for this output, the z-values show that these estimates have > 99.99% confidence. Therefore, each level of answer is sufficiently different in a step-wise fashion.
Some variables appear to have an optimal level:
Price: $20
Social Proof: scientific evidence
While others have levels that are worse than other levels:
RTB: the support of a community of people just like you | daily text messages from a coach
Outcome: empowering you to take back your sleep habits
And still other main effects are unrelated to the outcome given that none of their levels are better/worse:
Offer
Duration
One last thing to test before moving on is the assumption of proportional odds. This assumption states that effects need to be equal for each level of the dependent variable. If this assumption is violated, the effects are not truly ordinal but rather nominal. The nominal_test function can test this.
# Test the proportional odds assumption
nominal_test(ORModel)
Tests of nominal effects
formula: answer ~ duration + price + offer + outcome + rtb + social_proof
Df logLik AIC LRT Pr(>Chi)
<none> -13829 27699
duration
price
offer
outcome
rtb
social_proof
There is no evidence for nominal effects so we may continue.
Our model has shown us the main effects, now we are going to test for interactions (i.e. a certain level of ‘price’ and ‘duration’ together are a better predictor of ‘answer’ than looking at either separately).
We will start with 2 way interactions and test for a significant difference between the interaction model and the main-effects model.
# Create a model testing 2-way interactions
ORModelX <- experiment_data %>%
clm(answer ~ (price + outcome + rtb + social_proof + offer + duration)^2,
data = .,
nAGQ = 10,
Hess = TRUE)
# Compare the main effects only model to the interaction model
anova(ORModel, ORModelX)
Likelihood ratio tests of cumulative link models:
formula: link: threshold:
ORModel answer ~ duration + price + offer + outcome + rtb + social_proof logit flexible
ORModelX answer ~ (price + outcome + rtb + social_proof + offer + duration)^2 logit flexible
no.par AIC logLik LR.stat df Pr(>Chisq)
ORModel 21 27699 -13829
ORModelX 152 27831 -13764 129.94 131 0.5097
Our test shows that the interactions model and main-effects model are not significantly different (χ2 = 0.51). Higher-order interactions (e.g. three-way) may exist but are both unlikely and hard to interpret as our 2-way interactions model already has 152 parameters (Yikes!).
Our analysis yields the following conclusions about each variable:
Price clearly has an optimal level of ‘$20’ and each increase (‘$30’, ‘$40’) has a negative effect on likeliness to rate high. Additionally, this effect is very large relative to the proceeding effects (‘$30’ and ‘$40’ coefficients were β = -0.36 and β = -0.50, respectively, compared to ‘$20’).
Although no level emerged clearly the best for ‘RTB’, there were some clear losers. ‘daily text messages from a coach’ was significantly worse (β = -0.16) while ‘the support of a community of people just like you’ approached significance (β = -0.06), so avoid using these two levels. Technically, ‘A program created just for you’ performed best, but difference between this level and the remaining 3 levels is insignificant, so all 4 will be considered equal.
The phrase to avoid for ‘Outcome’ is ‘empowering you to take back your sleep habits’ (β = -0.12). ‘breaking bad habits and creating new routines’ had a higher estimation than ‘empowering you to take back your sleep habits’ (β = 0.04), but the difference was small and insignificant.
No levels for both of these variables were significantly different. In fact, a model that drops both is not significantly different from the main-effects model.
# Build a new model that drops 'duration' and 'price'
ORModel2 <- experiment_data %>%
clm(answer ~ outcome + price + rtb + social_proof,
data = .,
Hess = T)
# Compare the models
anova(ORModel, ORModel2)
Likelihood ratio tests of cumulative link models:
formula: link: threshold:
ORModel2 answer ~ outcome + price + rtb + social_proof logit flexible
ORModel answer ~ duration + price + offer + outcome + rtb + social_proof logit flexible
no.par AIC logLik LR.stat df Pr(>Chisq)
ORModel2 15 27691 -13831
ORModel 21 27699 -13829 4.1418 6 0.6575
Therefore, we can conclude that these two variables are unrelated to the probability that an individual will download the app.
The combination of levels that is most likely to lead to an individual wanting to download the app is:
Price = the lowest possible price
Social Proof = scientific or research based statement
RTB = DO NOT USE: ‘daily messages from a coach’ OR ‘the support of a community of people just like you’. The other 4 levels were not significantly different, so RTB is not as strongly related as the previous 2 variables OR the remaining 4 phrases are not a standout phrase for this variable and one will need to be found.
Outcome = USE: ‘changing your sleep mindset’ OR ‘breaking bad habits and creating new routines’.
Duration = Unrelated to the outcome
Offer = Unrelated to the outcome
BACKGROUND: 892 participants were asked to answer a series of surveys asking about their demographic information and opinions.
First, we will load the data and look at the shape of the data.
# Set root directory
setwd("..")
# Load data
survey_data <-
read_sav('inst/extdata/data/survey_data.sav')
# Store the colnames of the survey data
SurveyColnames <- colnames(survey_data)
as.data.frame(SurveyColnames)
SurveyColnames
1 response_id
2 d_urban
3 s_gender
4 s_race
5 d_education
6 s_hhincome
7 s_problem
8 m1_philosophy_1
9 m1_philosophy_2
10 m1_philosophy_3
11 m1_philosophy_4
12 m1_philosophy_5
13 m1_philosophy_6
14 m1_philosophy_7
15 m1_philosophy_8
16 m1_philosophy_9
17 m2_attitudes_1
18 m2_attitudes_2
19 m2_attitudes_3
20 m2_attitudes_4
21 m2_attitudes_5
22 m2_attitudes_6
23 m2_attitudes_7
24 m2_attitudes_8
25 m2_attitudes_9
26 m2_attitudes_10
27 m2_attitudes_11
28 m2_awareness_apps_1
29 m2_awareness_apps_4
30 m2_awareness_apps_5
31 m2_awareness_apps_6
32 m2_awareness_apps_7
33 m2_awareness_apps_8
34 m2_awareness_apps_9
35 m2_awareness_apps_10
36 m2_awareness_apps_11
37 m2_awareness_apps_12
38 m2_awareness_apps_13
39 m2_awareness_apps_14
40 hours
41 source_1
42 source_4
43 source_5
44 source_6
45 source_7
46 source_8
47 source_9
48 source_10
49 source_11
50 source_12
51 source_13
52 source_14
53 source_15
54 source_16
55 source_17
56 behavior_1
57 behavior_4
58 behavior_5
59 behavior_6
60 behavior_7
61 behavior_8
62 behavior_9
63 behavior_10
64 behavior_11
65 behavior_12
66 behavior_13
67 behavior_14
68 behavior_15
69 behavior_16
70 behavior_a_1
71 behavior_a_2
72 behavior_a_3
73 behavior_a_4
74 behavior_a_5
75 behavior_a_6
76 behavior_a_7
77 behavior_a_8
78 behavior_a_9
79 behavior_a_10
80 behavior_a_11
81 behavior_a_12
82 behavior_a_13
83 behavior_a_14
84 interst_cbt
85 past_coach
86 interest_coach
87 d_marital
88 d_h_hnumber
89 d_parent
90 d_child_infant
91 d_child_young
92 d_child_older
93 d_politics
94 d_political_view
95 d_employment
96 d_work_schedule
97 d_work_hours
98 s_region
99 s_age
100 weights
This is quite a lot of columns. Thankfully, the naming conventions make it easy to see which parts of the survey are sectioned together.
Let’s check if there are any NAs.
# Store the number of NA's in each column
NAOutPut <- survey_data %>%
summarize_all(~ sum(is.na(.)))
# Print the output in minimal form
as.numeric(NAOutPut)
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 612 848 863 568 728 652 361 847 712
[37] 843 846 742 0 687 793 791 838 597 657 601 799 719 835 820 799 814 796 814 822 710 575 781 784 594 758 846 831 745 668 746 743 756 822 710 575
[73] 781 784 594 758 846 831 745 668 746 743 892 0 0 0 0 0 0 592 592 592 25 0 0 459 459 0 0 0
This is the number of NAs for each column (The output was truncated due to length). The NAs in columns 23 - 83 are all survey answers for the ‘awareness’, ‘source’, and ‘behavior’ sections – while the columns after these that have NAs are asking about demographic and biographic information. It is normal for people to not complete a survey and there are a few tools we have at our disposal to consider.
Let’s see how many complete cases each section of the survey has.
# Store Complete number of cases
Sections <- c("Philosophy","Attitudes", "Awareness", "Source", "Behvaior", "Demographic", "Biographic")
CompleteCases <- numeric()
# Function for appending Complete Cases
appendCases <- function(matching) {
CompleteCases <<- append(CompleteCases,
survey_data %>%
select(starts_with(matching)) %>%
complete.cases() %>%
sum()
)
}
# Add complete observation counts for each section of the survey
appendCases("m1_philosophy")
appendCases("m2_attitudes")
appendCases("m2_awareness")
appendCases("source")
appendCases("behavior")
appendCases("d_")
appendCases("s_")
# Combine the section labels and the number of complete cases
casesDf <- data.frame(Sections, CompleteCases)
casesDf
Sections CompleteCases
1 Philosophy 892
2 Attitudes 892
3 Awareness 0
4 Source 0
5 Behvaior 0
6 Demographic 198
7 Biographic 892
‘philosophy’ and ‘attitudes’ were both fully taken by every individual who took the survey ‘awareness’, on the other hand, has 0 complete observations. ‘source’ and ‘behavior’ are special cases because their NAs represent that a person did not check a box, so their NAs could effectively be recoded as 0s and we can treat them as complete observations. Demographic questions had only ~ 25% complete observations, but the total number of NAs for the questions might not be bad. Finally, biographic questions were all complete.
Let’s dig into the proportion of NA’s to not NA’s each column in the ‘awareness’ section has to see how bad this problem really is.
# Get the percentage of NA's per column for the 'awareness' section
survey_data %>%
select(starts_with("m2_awareness")) %>%
map(~ paste0(round(sum(is.na(.) / length(.)) * 100, 2), "%"))
$m2_awareness_apps_1
[1] "68.61%"
$m2_awareness_apps_4
[1] "95.07%"
$m2_awareness_apps_5
[1] "96.75%"
$m2_awareness_apps_6
[1] "63.68%"
$m2_awareness_apps_7
[1] "81.61%"
$m2_awareness_apps_8
[1] "73.09%"
$m2_awareness_apps_9
[1] "40.47%"
$m2_awareness_apps_10
[1] "94.96%"
$m2_awareness_apps_11
[1] "79.82%"
$m2_awareness_apps_12
[1] "94.51%"
$m2_awareness_apps_13
[1] "94.84%"
$m2_awareness_apps_14
[1] "83.18%"
It’s really bad. The percentage of NA’s for most columns is > 64%. For this reason, I suggest that data for the awareness section be recollected and analyzed when there is a sufficient response rate.
NOTE: Imputation is a possible solution for missing data, however the ratio of NA’s to data for most questions is far greater than 50%. This means that the expected error for each imputed variable is very large. Additionally, imputation is more suited toward problems of prediction while the task we are answering is inferential.
Let’s try the same method on the demographic information.
# Get the percentage of NA's per column for the 'demographic' section
survey_data %>%
select(starts_with("d_")) %>%
map(~ paste0(round(sum(is.na(.) / length(.)) * 100, 2), "%"))
$d_urban
[1] "0%"
$d_education
[1] "0%"
$d_marital
[1] "0%"
$d_h_hnumber
[1] "0%"
$d_parent
[1] "0%"
$d_child_infant
[1] "66.37%"
$d_child_young
[1] "66.37%"
$d_child_older
[1] "66.37%"
$d_politics
[1] "2.8%"
$d_political_view
[1] "0%"
$d_employment
[1] "0%"
$d_work_schedule
[1] "51.46%"
$d_work_hours
[1] "51.46%"
This is quite a different story. It appears that people were fine responding with most demographic questions. Questions about their children or work schedule, however, were answered by less than 50% of the participants each. This still leaves us with large sample sizes for both variables: 33% of individuals (n = 300) willing to respond about their children and 49% of respondents (n = 433) willing to talk about their work schedule. We can even use willingness to respond about these variables as a feature itself.
There are many demographic and biographic variables in our dataset.
survey_data %>%
select(starts_with(c("s_", "d_"))) %>%
mutate_all(~ as_factor(.)) %>%
map(~ kable(table(.)))
$s_gender
|. | Freq|
|:------|----:|
|Male | 389|
|Female | 503|
$s_race
|. | Freq|
|:-----------------------------------------|----:|
|African American/Black/Caribbean American | 98|
|Asian or Pacific Islander | 35|
|Hispanic or Latino | 44|
|Mixed race and Others | 35|
|White | 680|
$s_hhincome
|. | Freq|
|:----------------------------|----:|
|Less than $25,000 | 168|
|$25,000 - less than $50,000 | 255|
|$50,000 - less than $100,000 | 273|
|$100,000 or more | 196|
$s_problem
|. | Freq|
|:-------------------------------------------------------|----:|
|Every night | 247|
|Most nights (Four or more nights during a typical week) | 305|
|Sometimes (Two or three nights during a typical week) | 340|
|Rarely (One night during a typical week) | 0|
|Never | 0|
$s_region
|. | Freq|
|:---------|----:|
|MIDWEST | 237|
|WEST | 192|
|SOUTH | 283|
|NORTHEAST | 180|
$s_age
|. | Freq|
|:-----|----:|
|65+ | 177|
|46-64 | 252|
|31-45 | 253|
|18-30 | 210|
$d_urban
|. | Freq|
|:------------------------------------------------------------------------------|----:|
|Urban (An area that is commonly within a town or city with a large population) | 301|
|Suburban (An area that is residential and on the outskirts of a city) | 443|
|Rural (An area that is outside of town or city and has a small population) | 148|
$d_education
|. | Freq|
|:-----------------------------------------|----:|
|Less than high school | 22|
|High school graduate | 188|
|Currently enrolled in college | 19|
|Some college, but not currently enrolled | 176|
|Associate's degree/Technical degree/AA/AS | 123|
|College graduate/Bachelor's degree/BA/BS | 184|
|Some postgraduate courses | 25|
|Master's degree (e.g MS, MA, MBA, etc. ) | 123|
|PhD, JD or MD | 32|
$d_marital
|. | Freq|
|:--------------------------|----:|
|Married | 439|
|Separated | 15|
|Divorced | 81|
|Widowed | 40|
|Never married | 247|
|Domestic/civil partnership | 70|
$d_h_hnumber
|. | Freq|
|:---|----:|
|1 | 178|
|2-3 | 515|
|4-5 | 168|
|6+ | 31|
$d_parent
|. | Freq|
|:---|----:|
|Yes | 300|
|No | 592|
$d_child_infant
|. | Freq|
|:-------------|----:|
|None | 168|
|One | 90|
|Two | 34|
|Three or more | 8|
$d_child_young
|. | Freq|
|:-------------|----:|
|None | 105|
|One | 144|
|Two | 40|
|Three or more | 11|
$d_child_older
|. | Freq|
|:-------------|----:|
|None | 150|
|One | 107|
|Two | 36|
|Three or more | 7|
$d_politics
|. | Freq|
|:-----------------|----:|
|Strong Democrat | 147|
|Democrat | 208|
|Independent | 225|
|Republican | 151|
|Strong Republican | 111|
|Other | 25|
$d_political_view
|. | Freq|
|:---------------------|----:|
|Very liberal | 137|
|Somewhat liberal | 168|
|Moderate | 326|
|Somewhat conservative | 168|
|Very conservative | 93|
$d_employment
|. | Freq|
|:-----------------------------|----:|
|Working full time now | 326|
|Working part time now | 107|
|Temporarily laid off | 24|
|Unemployed | 76|
|Retired | 215|
|Permanently disabled | 39|
|Taking care of home or family | 69|
|Student | 29|
|Other | 7|
$d_work_schedule
|. | Freq|
|:------------------------------------------------------------------|----:|
|Fixed work schedule – working hours remain consistent | 271|
|Flexible work schedule – you control your working hours | 118|
|Shift work schedule – working hours depend on the week | 30|
|Rotating work schedule – shift work changes in a repeating pattern | 11|
|Don’t know / not applicable | 3|
$d_work_hours
|. | Freq|
|:------------|----:|
|Less than 20 | 54|
|20-39 | 179|
|40 or more | 200|
There are many demographic and biographic variables in the data, so we are going to skip analysis of each variable and just take note of important parts of our data. Additionally, many of the variables are already coded as numeric variables.
The following facts are notable for this dataset:
1.) There are far more white individuals than any other group.
2.) A majority of individuals are either married or have never married.
3.) Every individual in the study has trouble sleep at least sometimes.
4.) 1/3 respondents are parents.
To analyze whether the Likert scale survey sections are reliable measures, we will use the alpha function from the psych package. This will give internal consistency estimates based on Cronbach’s Alpha as well as item statistics in one function.
# Alpha for the 'philosophy' scale
survey_data %>%
select(starts_with("m1_philosophy")) %>%
alpha()
Reliability analysis
Call: alpha(x = .)
raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
0.66 0.67 0.69 0.18 2 0.017 3.6 0.55 0.18
lower alpha upper 95% confidence boundaries
0.63 0.66 0.69
Reliability if an item is dropped:
raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
m1_philosophy_1 0.63 0.64 0.66 0.18 1.8 0.018 0.019 0.16
m1_philosophy_2 0.68 0.68 0.68 0.21 2.1 0.016 0.013 0.18
m1_philosophy_3 0.64 0.64 0.67 0.18 1.8 0.018 0.022 0.18
m1_philosophy_4 0.62 0.63 0.65 0.18 1.7 0.019 0.019 0.18
m1_philosophy_5 0.62 0.64 0.65 0.18 1.7 0.019 0.020 0.18
m1_philosophy_6 0.59 0.60 0.62 0.16 1.5 0.021 0.017 0.16
m1_philosophy_7 0.61 0.61 0.63 0.17 1.6 0.020 0.016 0.18
m1_philosophy_8 0.63 0.64 0.66 0.18 1.8 0.019 0.017 0.18
m1_philosophy_9 0.66 0.67 0.69 0.20 2.0 0.017 0.021 0.19
Item statistics
n raw.r std.r r.cor r.drop mean sd
m1_philosophy_1 892 0.51 0.53 0.44 0.33 3.7 1.03
m1_philosophy_2 892 0.41 0.35 0.23 0.16 2.7 1.28
m1_philosophy_3 892 0.48 0.51 0.40 0.32 3.6 0.89
m1_philosophy_4 892 0.55 0.56 0.47 0.37 3.8 1.04
m1_philosophy_5 892 0.56 0.54 0.46 0.38 3.3 1.06
m1_philosophy_6 892 0.68 0.66 0.63 0.52 3.3 1.17
m1_philosophy_7 892 0.60 0.62 0.58 0.45 3.9 0.95
m1_philosophy_8 892 0.51 0.53 0.45 0.34 4.0 0.97
m1_philosophy_9 892 0.40 0.40 0.25 0.20 3.8 1.03
Non missing response frequency for each item
1 2 3 4 5 miss
m1_philosophy_1 0.03 0.10 0.23 0.40 0.24 0
m1_philosophy_2 0.22 0.24 0.25 0.19 0.10 0
m1_philosophy_3 0.02 0.07 0.40 0.37 0.15 0
m1_philosophy_4 0.04 0.07 0.27 0.36 0.27 0
m1_philosophy_5 0.04 0.19 0.33 0.29 0.15 0
m1_philosophy_6 0.09 0.17 0.26 0.33 0.16 0
m1_philosophy_7 0.02 0.07 0.20 0.44 0.28 0
m1_philosophy_8 0.02 0.05 0.16 0.41 0.35 0
m1_philosophy_9 0.02 0.09 0.22 0.38 0.28 0
‘philosophy’ is not quite a singular measure (α = 0.66) (psychometric measures are considered reliable at α = 0.70 or greater). If question 2 is removed, it approaches singularity (α = 0.68), but does not reach it. In order to remedy this, I suggest that future surveys add questions.
Question 9 also looks weak as removing it hardly changes alpha. Questions 2 and 9 also have low correlations to the item-total. Let’s see what the results are if questions 2 and 9 are removed.
# Alpha estimate for the 'philosophy' scale without bad questions (2 and 9)
survey_data %>%
select(starts_with("m1_philosophy"), -contains(c("_2","_9"))) %>%
alpha()
Reliability analysis
Call: alpha(x = .)
raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
0.69 0.69 0.68 0.24 2.2 0.016 3.7 0.61 0.2
lower alpha upper 95% confidence boundaries
0.66 0.69 0.72
Reliability if an item is dropped:
raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
m1_philosophy_1 0.67 0.66 0.65 0.25 2.0 0.017 0.0145 0.19
m1_philosophy_3 0.68 0.68 0.66 0.26 2.1 0.017 0.0126 0.24
m1_philosophy_4 0.65 0.65 0.64 0.24 1.9 0.018 0.0127 0.19
m1_philosophy_5 0.69 0.69 0.67 0.27 2.2 0.016 0.0106 0.24
m1_philosophy_6 0.62 0.63 0.61 0.22 1.7 0.020 0.0108 0.18
m1_philosophy_7 0.63 0.63 0.61 0.22 1.7 0.019 0.0082 0.19
m1_philosophy_8 0.66 0.66 0.64 0.24 1.9 0.018 0.0092 0.20
Item statistics
n raw.r std.r r.cor r.drop mean sd
m1_philosophy_1 892 0.56 0.57 0.45 0.36 3.7 1.03
m1_philosophy_3 892 0.50 0.52 0.39 0.32 3.6 0.89
m1_philosophy_4 892 0.61 0.60 0.49 0.41 3.8 1.04
m1_philosophy_5 892 0.51 0.50 0.35 0.29 3.3 1.06
m1_philosophy_6 892 0.71 0.68 0.62 0.52 3.3 1.17
m1_philosophy_7 892 0.67 0.68 0.62 0.51 3.9 0.95
m1_philosophy_8 892 0.58 0.59 0.50 0.40 4.0 0.97
Non missing response frequency for each item
1 2 3 4 5 miss
m1_philosophy_1 0.03 0.10 0.23 0.40 0.24 0
m1_philosophy_3 0.02 0.07 0.40 0.37 0.15 0
m1_philosophy_4 0.04 0.07 0.27 0.36 0.27 0
m1_philosophy_5 0.04 0.19 0.33 0.29 0.15 0
m1_philosophy_6 0.09 0.17 0.26 0.33 0.16 0
m1_philosophy_7 0.02 0.07 0.20 0.44 0.28 0
m1_philosophy_8 0.02 0.05 0.16 0.41 0.35 0
The mean Alpha estimate for these items is close enough to be considered acceptable (α = 0.69). According to the task description, this scale is meant to measure an individuals views on science, products, and health.
Now we will create a new column to save the scale composite scores for philosophy in the case that we need it as a continuous variable. Because the ‘philosophy’ and ‘attitudes’ measures both use 5-point Likert scales, we can just take the mean value of each so we don’t need to standardize the values later.
# Add 'philosophy' composite scores
survey_data$philosophy_composite <-
survey_data %>%
select(starts_with("m1_philosophy"), -contains(c("_2","_9"))) %>%
rowMeans()
The other scale we will be analyzing is the attitudes scale. Let’s first see what the reliability is out of the gate.
# Alpha estimate for the 'attitudes' scale
survey_data %>%
select(starts_with("m2_attitudes")) %>%
alpha()
Reliability analysis
Call: alpha(x = .)
raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
0.72 0.74 0.75 0.21 2.9 0.014 3.9 0.5 0.21
lower alpha upper 95% confidence boundaries
0.69 0.72 0.75
Reliability if an item is dropped:
raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
m2_attitudes_1 0.69 0.72 0.72 0.20 2.5 0.016 0.015 0.20
m2_attitudes_2 0.68 0.71 0.71 0.20 2.4 0.016 0.015 0.20
m2_attitudes_3 0.68 0.71 0.72 0.20 2.5 0.016 0.018 0.20
m2_attitudes_4 0.75 0.77 0.76 0.25 3.3 0.013 0.011 0.23
m2_attitudes_5 0.69 0.72 0.73 0.21 2.6 0.016 0.020 0.20
m2_attitudes_6 0.69 0.72 0.73 0.21 2.6 0.015 0.019 0.20
m2_attitudes_7 0.67 0.70 0.71 0.19 2.3 0.016 0.014 0.20
m2_attitudes_8 0.70 0.73 0.74 0.21 2.7 0.015 0.020 0.20
m2_attitudes_9 0.68 0.71 0.72 0.20 2.4 0.016 0.017 0.20
m2_attitudes_10 0.72 0.74 0.75 0.23 2.9 0.014 0.019 0.24
m2_attitudes_11 0.71 0.74 0.75 0.22 2.8 0.015 0.019 0.23
Item statistics
n raw.r std.r r.cor r.drop mean sd
m2_attitudes_1 892 0.55 0.59 0.555 0.439 4.5 0.76
m2_attitudes_2 892 0.59 0.63 0.594 0.480 4.4 0.80
m2_attitudes_3 892 0.60 0.60 0.553 0.458 4.0 1.01
m2_attitudes_4 892 0.30 0.24 0.099 0.083 3.2 1.23
m2_attitudes_5 892 0.57 0.55 0.483 0.421 3.7 1.03
m2_attitudes_6 892 0.54 0.55 0.473 0.400 3.9 0.94
m2_attitudes_7 892 0.65 0.69 0.675 0.553 4.2 0.85
m2_attitudes_8 892 0.51 0.52 0.426 0.366 3.8 0.93
m2_attitudes_9 892 0.61 0.63 0.588 0.492 3.9 0.90
m2_attitudes_10 892 0.44 0.40 0.279 0.236 3.9 1.20
m2_attitudes_11 892 0.46 0.44 0.333 0.290 3.6 1.04
Non missing response frequency for each item
1 2 3 4 5 miss
m2_attitudes_1 0.01 0.01 0.07 0.28 0.63 0
m2_attitudes_2 0.01 0.02 0.08 0.33 0.56 0
m2_attitudes_3 0.02 0.06 0.18 0.34 0.40 0
m2_attitudes_4 0.10 0.22 0.23 0.30 0.15 0
m2_attitudes_5 0.04 0.09 0.28 0.37 0.22 0
m2_attitudes_6 0.02 0.07 0.20 0.45 0.26 0
m2_attitudes_7 0.01 0.03 0.15 0.41 0.40 0
m2_attitudes_8 0.02 0.06 0.21 0.48 0.22 0
m2_attitudes_9 0.01 0.05 0.22 0.42 0.30 0
m2_attitudes_10 0.06 0.09 0.16 0.30 0.39 0
m2_attitudes_11 0.03 0.13 0.28 0.37 0.20 0
This scale shows evidence for being a singular measure (α = 0.74). Additionally, the Alpha-if-item-dropped is not significantly higher for any question.
There is one issue, however. The mean scores for a few questions are very high. Questions 1 and 2 both have very high means for a 5-point Likert measure (M = 4.40 and 4.50, respectively). Let’s take a look at the questions:
- Getting high quality sleep is important for my long-term health
- Getting better sleep would improve my focus
It is reasonable to expect that most people will put a 5 on both these questions. What happens to the results if these questions are dropped?
survey_data %>%
select(starts_with("m2_attitudes"), -contains(c("_1","_2"))) %>%
alpha()
Reliability analysis
Call: alpha(x = .)
raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
0.64 0.66 0.65 0.22 2 0.019 3.8 0.56 0.25
lower alpha upper 95% confidence boundaries
0.6 0.64 0.68
Reliability if an item is dropped:
raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
m2_attitudes_3 0.58 0.61 0.60 0.21 1.5 0.022 0.0205 0.25
m2_attitudes_4 0.70 0.71 0.67 0.28 2.4 0.016 0.0049 0.28
m2_attitudes_5 0.57 0.60 0.59 0.20 1.5 0.023 0.0233 0.23
m2_attitudes_6 0.61 0.63 0.62 0.22 1.7 0.021 0.0185 0.25
m2_attitudes_7 0.57 0.59 0.57 0.19 1.4 0.023 0.0128 0.23
m2_attitudes_8 0.61 0.64 0.63 0.23 1.8 0.020 0.0219 0.27
m2_attitudes_9 0.58 0.60 0.59 0.20 1.5 0.022 0.0159 0.23
Item statistics
n raw.r std.r r.cor r.drop mean sd
m2_attitudes_3 892 0.63 0.63 0.54 0.429 4.0 1.01
m2_attitudes_4 892 0.40 0.33 0.13 0.096 3.2 1.23
m2_attitudes_5 892 0.65 0.64 0.55 0.455 3.7 1.03
m2_attitudes_6 892 0.55 0.57 0.45 0.344 3.9 0.94
m2_attitudes_7 892 0.65 0.68 0.64 0.491 4.2 0.85
m2_attitudes_8 892 0.53 0.55 0.41 0.324 3.8 0.93
m2_attitudes_9 892 0.61 0.64 0.56 0.426 3.9 0.90
Non missing response frequency for each item
1 2 3 4 5 miss
m2_attitudes_3 0.02 0.06 0.18 0.34 0.40 0
m2_attitudes_4 0.10 0.22 0.23 0.30 0.15 0
m2_attitudes_5 0.04 0.09 0.28 0.37 0.22 0
m2_attitudes_6 0.02 0.07 0.20 0.45 0.26 0
m2_attitudes_7 0.01 0.03 0.15 0.41 0.40 0
m2_attitudes_8 0.02 0.06 0.21 0.48 0.22 0
m2_attitudes_9 0.01 0.05 0.22 0.42 0.30 0
Dropping these items leads to a less singular scale. For this reason and because the items do not have a very low standard deviation, they will be kept in the composite score calculation. In the future however, I recommend that these questions get reworked or reworded to increase the variance in responses while retaining the meaning of the question.
Just like with the philosophy scores, we can now create composite scores.
# Add 'attitudes' composite scores
survey_data$attitudes_composite <- survey_data %>%
select(starts_with("m2_attitudes")) %>%
rowMeans(.)
One major part of the task is to see if there are any underlying latent structures to our data. Using the MFA function from the FactoMineR package will help to accomplish this.
First we need to check if any of the variables we will be clustering have 0-variance or near-zero variance.
# Prefixes for each group of the survey
survey_group_prefixes <- c("s_", "d_", "interest_", "past_", "behavior_", "source_")
# Subset only variables of interest
survey_factors <- survey_data %>%
rename(interest_cbt = interst_cbt) %>%
select(starts_with(survey_group_prefixes), ends_with("_composite")) %>%
map(~ as_factor(.)) %>%
as.data.frame()
# Find variables that have 95% or more of their responses as one response
near_0_vars <- nearZeroVar(survey_factors)
near_0_vars
[1] 23 24 25 26 27 28 29 30 31 32 33 34 35 36 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
There are quite a few columns that have almost no variance. Let’s take a look at what the variables consist of.
# Column names for near_0_variance predictors
survey_factors[,near_0_vars] %>%
colnames()
[1] "behavior_1" "behavior_4" "behavior_5" "behavior_6" "behavior_7" "behavior_8" "behavior_9" "behavior_10" "behavior_11"
[10] "behavior_12" "behavior_13" "behavior_14" "behavior_15" "behavior_16" "behavior_a_14" "source_1" "source_4" "source_5"
[19] "source_6" "source_7" "source_8" "source_9" "source_10" "source_11" "source_12" "source_13" "source_14"
[28] "source_15" "source_16" "source_17"
A large number of ‘behavior’ columns as well as every ‘source’ variable had little to no variance. We will keep this in mind in case the results of the MFA are off. In order to run an MFA we needs to separate each section of the survey into groups.
# Remove low-variance variables
survey_factors <- survey_factors[,-near_0_vars]
Now, we can create the model.
# All of the sources have been removed, so remove it from the prefixes list
survey_group_prefixes <- survey_group_prefixes[!survey_group_prefixes %in% "source_"]
# Create an ordered vector of the number of questions in each group
survey_groups <- map(
survey_group_prefixes,
function(x) length(
# Match column names to the survey_group_prefixes
grep(
# "^" specifies "start_with" in R's regex
paste0("^", x),
colnames(survey_factors)))
)
# Add survey composites to the end of the survey groups
survey_groups <- c(survey_groups,1,1)
# Names for each group
survey_group_names <- c("Biographic", "Demographics", "Interests", "Past-Coach", "Behaviors", "Philosophy", "Attitudes")
# Run an MFA
survey_factors_MFA <- MFA(survey_factors %>% mutate_at(c("philosophy_composite", "attitudes_composite"), as.numeric),
group = as.numeric(survey_groups),
# Specify groups (n = categorical, s = numeric)
type = c(rep("n",5),rep("s",2)),
name.group = survey_group_names,
ncp = 6
)
The factoextra library helps us plot the results. Click the tabs below to show different plots of the results.
fviz_mfa_var(survey_factors_MFA, 'group')
Most variable groups are explained more by dimension 1 than dimension 2. ‘interests’ appears to most strongly and almost solely related to dimension 1 while the same can be said for ‘biographic’ and dimension 2. Finally, ‘behaviors’ and ‘attitudes’ are very similar measures.
fviz_screeplot(survey_factors_MFA)
paste("Number of Dimensions: ", nrow(survey_factors_MFA$eig))
[1] "Number of Dimensions: 185"
The scree plot displays that the largest dimension (dimension 1) explains around 5% of the variance. For most datasets this is small, however given that our dataset has a very large number of dimensions (185), 5% still represents 9 variables.
fviz_mfa_axes(survey_factors_MFA,
axes = c(1,2),
geom = c("point", "arrow"))
We can see that neither dimension 1 nor 2 are highly related to any single variable group. This is probably why both axes only account for a small portion of the variability in the sample. This means each group of variables is measuring something different (e.g. biographic information is not as strongly related to demographic information as we might think).
fviz_contrib(survey_factors_MFA,
choice = "quali.var",
axes = 1,
top = 10)
These are the top variables for explaining dimension 1, which explains around 5% of the variance for the model. Looking at the top variables, it appears that the largest cluster in the data is explained by a combination of having a coach or being interested in a coach and to a lesser extent age.
fviz_contrib(survey_factors_MFA,
choice = "quali.var",
axes = 2,
top = 10)
The second largest dimension in the model accounts for nearly 3% of the variance. This dimension is explained by being unsure about a coach, age, and to a lesser extent- socioeconomic status.
Another way to cluster our data is using a K-modes algorithm, an extension of the more popular K-means algorithm. The klaR function from the kmodes package will help to accomplish this. But first, we need to create dummy variables for the demographic data because they are non-continuous, as well as some other data cleaning. The last thing the below chunk will accomplish is replacing NAs with 0s for the ‘source’ questions. Thankfully, removing zero-var variables was done previously, so we can reuse some of our work.
# Dummy code demographic, biographic, and interest variables
survey_dummy <- survey_data %>%
rename(interest_cbt = interst_cbt) %>%
select(starts_with(survey_group_prefixes)) %>%
map(~ as_factor(.)) %>%
dummy_cols() %>%
select_if(is.numeric)
cluster_data <- data.frame(
survey_dummy,
survey_data %>%
select(ends_with("_composite"))
)
ncol(cluster_data)
[1] 221
While dummy coding data dramatically increases the number of columns (221), any 1 column is straight-forward to interpret by its column name. Our data structure contains:
Let’s remove the low variance columns and replacs NAs with 0s like we did for the MFA.
# Find low variance columns
near_0_vars <- nearZeroVar(cluster_data)
# Remove low variance columns
cluster_data <- cluster_data[, -near_0_vars]
# Replace NAs with 0s
cluster_data <- cluster_data %>%
replace(is.na(.), 0)
We can now proceed with looking at a scree plot of k-modes variance within values by cluster to choose k for our algorithm.
# Get within cluster variance values for k = 1:30
klist <- numeric()
map(1:30, function(x){
# Set the seed to ensure the output is consistent
set.seed(1073)
k <- kmodes(cluster_data, modes = x)
klist <<- c(klist, sum(k$withindiff))
})
kindex <- seq(1,length(klist))
k_df <- data.frame(klist, kindex)
# Plot the errors for each value of K
k_df %>%
ggplot(aes(x = kindex, y = klist)) +
geom_point(color = colorSet[1], size = 4) +
theme_bw() +
labs(title = "Scree Plot of Total Error for K = 1:30", y = "Total Error", x = "K") +
theme(legend.position = "none", axis.text = element_text(size = 10), axis.title = element_text(size = 14), title = element_text(size = 16))
The “elbows” of the scree plot appears to be at K = 4 and K = 7. We will use K = 4, favoring a simple model. Using the labels for each person, we could look at a plot to see if clusters are related to having certain sources of sleep illness.
# Set seed to ensure identical results each time
set.seed(1073)
# Kmodes on our data for K = 5
km4 <- kmodes(cluster_data, modes = 4)
source_cluster_data <- data.frame(km4$cluster,
survey_data %>% select(starts_with("source")))
# Graph the percentage of source of sleep issues by cluster
source_cluster_data %>%
melt(id.vars = "km4.cluster") %>%
filter(value == 1) %>%
group_by(variable) %>%
count(km4.cluster) %>%
mutate(pct = round(n/sum(n),4) * 100) %>%
arrange(variable) %>%
ggplot(aes(x = variable, y = pct, fill = factor(km4.cluster))) +
geom_bar(stat = "identity", position = position_stack(vjust = 0.5)) +
geom_text(aes(label = pct), position = position_stack(vjust = 0.5), size = 3.5) +
labs(x = "Source", y = "Percent of Total", title = "Distribution of Sleep Issues by Cluster (%)") +
theme_bw() +
theme(legend.title = element_blank(), axis.text.x = element_text(size = 10, angle = 45, hjust = 1), axis.title = element_text(size = 14), title = element_text(size = 16))
Each cluster can be analyzed as a group of people with similar characteristics which may be related to having a sleeping issue. For instance, only 7% reporting ‘source_17’ are from cluster 3. Lets look at cluster 3 to find out.
# Add cluster labels to our data
survey_data$cluster <- km4$cluster
# Analyze people with cluster 3
survey_data %>%
filter(cluster == 3) %>%
select(starts_with(c("d_"))) %>%
mutate_all(~ as_factor(.)) %>%
map(~ kable(table(.)))
$d_urban
|. | Freq|
|:------------------------------------------------------------------------------|----:|
|Urban (An area that is commonly within a town or city with a large population) | 138|
|Suburban (An area that is residential and on the outskirts of a city) | 90|
|Rural (An area that is outside of town or city and has a small population) | 33|
$d_education
|. | Freq|
|:-----------------------------------------|----:|
|Less than high school | 5|
|High school graduate | 49|
|Currently enrolled in college | 3|
|Some college, but not currently enrolled | 56|
|Associate's degree/Technical degree/AA/AS | 31|
|College graduate/Bachelor's degree/BA/BS | 56|
|Some postgraduate courses | 3|
|Master's degree (e.g MS, MA, MBA, etc. ) | 45|
|PhD, JD or MD | 13|
$d_marital
|. | Freq|
|:--------------------------|----:|
|Married | 165|
|Separated | 10|
|Divorced | 12|
|Widowed | 2|
|Never married | 51|
|Domestic/civil partnership | 21|
$d_h_hnumber
|. | Freq|
|:---|----:|
|1 | 12|
|2-3 | 123|
|4-5 | 106|
|6+ | 20|
$d_parent
|. | Freq|
|:---|----:|
|Yes | 257|
|No | 4|
$d_child_infant
|. | Freq|
|:-------------|----:|
|None | 150|
|One | 72|
|Two | 28|
|Three or more | 7|
$d_child_young
|. | Freq|
|:-------------|----:|
|None | 84|
|One | 126|
|Two | 37|
|Three or more | 10|
$d_child_older
|. | Freq|
|:-------------|----:|
|None | 123|
|One | 96|
|Two | 32|
|Three or more | 6|
$d_politics
|. | Freq|
|:-----------------|----:|
|Strong Democrat | 42|
|Democrat | 64|
|Independent | 52|
|Republican | 60|
|Strong Republican | 32|
|Other | 4|
$d_political_view
|. | Freq|
|:---------------------|----:|
|Very liberal | 59|
|Somewhat liberal | 49|
|Moderate | 83|
|Somewhat conservative | 36|
|Very conservative | 34|
$d_employment
|. | Freq|
|:-----------------------------|----:|
|Working full time now | 170|
|Working part time now | 38|
|Temporarily laid off | 5|
|Unemployed | 11|
|Retired | 2|
|Permanently disabled | 6|
|Taking care of home or family | 25|
|Student | 1|
|Other | 3|
$d_work_schedule
|. | Freq|
|:------------------------------------------------------------------|----:|
|Fixed work schedule – working hours remain consistent | 132|
|Flexible work schedule – you control your working hours | 56|
|Shift work schedule – working hours depend on the week | 12|
|Rotating work schedule – shift work changes in a repeating pattern | 5|
|Don’t know / not applicable | 3|
$d_work_hours
|. | Freq|
|:------------|----:|
|Less than 20 | 18|
|20-39 | 95|
|40 or more | 95|
The vast majority of people in this cluster are working 20+ hours pre week and are parents. Because ‘source_17’ was “other”, we can assume that the vast majority of people who have children and work full time have up to 3 of the other listed sleep issues.
We have extracted the following information from analyzing the survey data:
1.) The sample is predominantly white, married or have never married, and unanimously have at least some sleeping issues.
2.) Missing data in the ‘awareness’ section makes it unusable. I would ask for more data or for a new sample.
3.) The ‘philosophy’ section is on the verge of measuring multiple latent structures. If it is important that it be a reliable measure, I suggest adding at least 2 - 3 more questions and/or rewording problematic questions (2 and 9).
4.) The ‘attitudes’ section is a reliable measure, however questions 1 and 2 both have very high means for a 5-point Likert scale (M = 4.40 and 4.50, respectively). I would suggest rewording or replacing these items as they probably are providing little information to the scale despite increasing it’s reliability.
5.) Missing data in the demographics show that people are unanimously willing to answer most questions with the exception of questions about their children or work.
6.) An MFA revealed that the 2 largest dimensions we can reduce the data to jointly explained ~ 8% of the variance and roughly mapped onto:
Coaching interest
Socio-economic status and age
7.) A K-modes clustering algorithm (K = 4) revealed that nearly all parents clustered together, and their sleeping issues can be described by one of the 14 listed sources of sleeping issues that are not ‘other’.
Social Proof