Association rules is an unsupervised learning algorithm which is sharpened for the purpose of finding patterns in such data as transaction market data or customer preference data to create recommendations systems.
Despite the fact that those are the most popular applications, in this short study we aim to show how the association rules algorithms may help to analyse survey data, and namely, psychological and behavioral questionnaires.
This implementation may help to better understand the causalities between the respondents’ characteristics and extend the simple correlation analysis, for example.
Step-by-step, the work will look as follows:
The data set we use was retrieved from kaggle.com and available under the following link.
It contains above 4000 responses for the Duckworth Grit Scale. So, the data covers the psychological survey with 50 questions, all covering 5 core dimensions:
Please, check kaggle.com for the full list of questions and detailed data set description.
The responses were provided on a Likert scale from 1 to 5 (1 = Disagree; 5 = Agree). Let us load the data first and look at how it looks like.
# loading the data
df <- read.csv('dataset.csv', sep = '')
#limiting the data to questions and demographics
df <- df[complete.cases(df),c(31:92, 1)]
head(df)
## education urban gender engnat age hand religion orientation race voted
## 1 4 3 2 2 28 1 1 1 4 1
## 2 2 3 2 1 19 1 6 1 4 2
## 3 1 2 2 2 16 1 0 1 1 2
## 4 3 2 2 1 30 1 6 1 3 1
## 5 4 2 1 2 38 1 2 1 1 2
## 6 3 3 2 2 23 1 12 1 1 2
## married familysize E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 N1 N2 N3 N4 N5 N6 N7 N8 N9
## 1 1 2 2 2 3 4 5 1 2 4 4 4 2 3 4 2 3 2 3 3 2
## 2 1 3 4 2 5 3 5 2 4 2 5 4 5 2 5 3 3 4 5 5 5
## 3 1 3 1 5 2 4 3 2 1 5 2 5 4 3 4 3 4 4 4 3 4
## 4 1 6 3 5 2 4 3 4 3 5 1 5 1 5 3 3 2 3 1 1 2
## 5 2 3 1 5 1 5 2 5 1 5 1 4 4 1 5 3 5 4 1 1 4
## 6 0 1 3 1 3 1 4 1 4 2 5 2 1 3 1 5 2 1 1 2 1
## N10 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 O1 O2 O3 O4
## 1 4 1 5 2 5 1 4 2 4 4 3 2 4 4 3 2 4 3 2 2 4 5 2 4 1
## 2 3 5 4 1 4 2 5 1 4 3 4 4 3 4 3 1 3 5 2 5 3 4 2 5 2
## 3 2 3 3 1 3 3 4 4 4 3 3 2 2 4 2 3 4 5 3 3 4 4 2 4 2
## 4 1 1 4 1 5 4 5 2 5 5 4 4 1 5 1 4 1 4 1 4 3 5 2 3 2
## 5 3 2 3 2 3 2 3 2 2 2 4 3 1 3 1 4 2 3 2 3 4 5 3 2 2
## 6 1 2 4 1 4 2 4 2 4 4 5 5 3 4 1 4 3 4 1 4 5 5 1 5 2
## O5 O6 O7 O8 O9 O10 country
## 1 4 2 5 4 5 4 RO
## 2 4 1 4 3 4 5 US
## 3 4 3 5 5 4 4 US
## 4 4 1 4 2 5 4 KE
## 5 3 3 4 1 3 2 JP
## 6 5 1 5 2 5 5 AU
str(df)
## 'data.frame': 4231 obs. of 63 variables:
## $ education : int 4 2 1 3 4 3 3 2 3 1 ...
## $ urban : int 3 3 2 2 2 3 2 1 3 2 ...
## $ gender : int 2 2 2 2 1 2 1 1 1 2 ...
## $ engnat : int 2 1 2 1 2 2 1 2 2 1 ...
## $ age : int 28 19 16 30 38 23 35 22 50 16 ...
## $ hand : int 1 1 1 1 1 1 1 1 1 1 ...
## $ religion : int 1 6 0 6 2 12 3 7 12 1 ...
## $ orientation: int 1 1 1 1 1 1 1 3 1 1 ...
## $ race : int 4 4 1 3 1 1 4 5 4 4 ...
## $ voted : int 1 2 2 1 2 2 2 2 2 2 ...
## $ married : int 1 1 1 1 2 0 3 1 3 1 ...
## $ familysize : int 2 3 3 6 3 1 1 2 3 2 ...
## $ E1 : int 2 4 1 3 1 3 4 3 3 4 ...
## $ E2 : int 2 2 5 5 5 1 1 3 1 1 ...
## $ E3 : int 3 5 2 2 1 3 3 4 5 5 ...
## $ E4 : int 4 3 4 4 5 1 2 3 1 2 ...
## $ E5 : int 5 5 3 3 2 4 3 2 5 4 ...
## $ E6 : int 1 2 2 4 5 1 1 2 1 1 ...
## $ E7 : int 2 4 1 3 1 4 4 2 4 5 ...
## $ E8 : int 4 2 5 5 5 2 2 3 2 4 ...
## $ E9 : int 4 5 2 1 1 5 5 4 4 3 ...
## $ E10 : int 4 4 5 5 4 2 3 4 2 2 ...
## $ N1 : int 2 5 4 1 4 1 5 5 3 4 ...
## $ N2 : int 3 2 3 5 1 3 2 2 4 3 ...
## $ N3 : int 4 5 4 3 5 1 4 5 3 4 ...
## $ N4 : int 2 3 3 3 3 5 2 2 5 2 ...
## $ N5 : int 3 3 4 2 5 2 3 4 1 3 ...
## $ N6 : int 2 4 4 3 4 1 4 5 1 3 ...
## $ N7 : int 3 5 4 1 1 1 5 4 1 3 ...
## $ N8 : int 3 5 3 1 1 2 5 5 1 2 ...
## $ N9 : int 2 5 4 2 4 1 4 5 1 3 ...
## $ N10 : int 4 3 2 1 3 1 2 3 1 3 ...
## $ A1 : int 1 5 3 1 2 2 2 5 1 1 ...
## $ A2 : int 5 4 3 4 3 4 4 5 4 5 ...
## $ A3 : int 2 1 1 1 2 1 4 2 1 1 ...
## $ A4 : int 5 4 3 5 3 4 5 4 5 5 ...
## $ A5 : int 1 2 3 4 2 2 3 2 1 1 ...
## $ A6 : int 4 5 4 5 3 4 3 5 3 5 ...
## $ A7 : int 2 1 4 2 2 2 2 2 1 1 ...
## $ A8 : int 4 4 4 5 2 4 4 4 5 4 ...
## $ A9 : int 4 3 3 5 2 4 5 3 5 4 ...
## $ A10 : int 3 4 3 4 4 5 2 3 5 4 ...
## $ C1 : int 2 4 2 4 3 5 2 4 4 4 ...
## $ C2 : int 4 3 2 1 1 3 4 1 3 3 ...
## $ C3 : int 4 4 4 5 3 4 5 5 4 4 ...
## $ C4 : int 3 3 2 1 1 1 3 3 1 3 ...
## $ C5 : int 2 1 3 4 4 4 2 3 3 2 ...
## $ C6 : int 4 3 4 1 2 3 1 1 1 3 ...
## $ C7 : int 3 5 5 4 3 4 4 5 3 4 ...
## $ C8 : int 2 2 3 1 2 1 3 3 1 2 ...
## $ C9 : int 2 5 3 4 3 4 4 5 4 3 ...
## $ C10 : int 4 3 4 3 4 5 4 4 4 4 ...
## $ O1 : int 5 4 4 5 5 5 5 3 4 4 ...
## $ O2 : int 2 2 2 2 3 1 1 3 1 2 ...
## $ O3 : int 4 5 4 3 2 5 5 5 5 4 ...
## $ O4 : int 1 2 2 2 2 2 1 3 1 1 ...
## $ O5 : int 4 4 4 4 3 5 5 3 4 4 ...
## $ O6 : int 2 1 3 1 3 1 1 3 1 2 ...
## $ O7 : int 5 4 5 4 4 5 5 5 5 4 ...
## $ O8 : int 4 3 5 2 1 2 5 3 1 3 ...
## $ O9 : int 5 4 4 5 3 5 5 4 4 4 ...
## $ O10 : chr "4" "5" "4" "4" ...
## $ country : chr "RO" "US" "US" "KE" ...
summary(df)
## education urban gender engnat
## Min. :0.000 Min. :0.000 Min. :0.000 Min. : 0.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.: 1.000
## Median :2.000 Median :2.000 Median :2.000 Median : 1.000
## Mean :2.392 Mean :2.239 Mean :1.664 Mean : 1.371
## 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.: 2.000
## Max. :4.000 Max. :3.000 Max. :3.000 Max. :30.000
## age hand religion orientation
## Min. : 1.00 Min. :0.000 Min. : 0.000 Min. :0.000
## 1st Qu.: 17.00 1st Qu.:1.000 1st Qu.: 2.000 1st Qu.:1.000
## Median : 21.00 Median :1.000 Median : 4.000 Median :1.000
## Mean : 25.44 Mean :1.156 Mean : 5.123 Mean :1.368
## 3rd Qu.: 29.00 3rd Qu.:1.000 3rd Qu.: 7.000 3rd Qu.:1.000
## Max. :350.00 Max. :8.000 Max. :12.000 Max. :5.000
## race voted married familysize
## Min. :0.000 Min. :0.00 Min. :0.000 Min. : 0
## 1st Qu.:3.000 1st Qu.:1.00 1st Qu.:1.000 1st Qu.: 2
## Median :4.000 Median :2.00 Median :1.000 Median : 2
## Mean :3.434 Mean :1.68 Mean :1.255 Mean : 29153
## 3rd Qu.:4.000 3rd Qu.:2.00 3rd Qu.:1.000 3rd Qu.: 3
## Max. :5.000 Max. :2.00 Max. :6.000 Max. :123334444
## E1 E2 E3 E4
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:2.000
## Median :3.000 Median :3.000 Median :3.000 Median :3.000
## Mean :2.735 Mean :2.694 Mean :3.372 Mean :3.074
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## E5 E6 E7 E8 E9
## Min. :0.000 Min. :0.00 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:3.000 1st Qu.:1.00 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000
## Median :4.000 Median :2.00 Median :3.000 Median :3.000 Median :3.000
## Mean :3.374 Mean :2.38 Mean :2.841 Mean :3.377 Mean :3.008
## 3rd Qu.:4.000 3rd Qu.:3.00 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.00 Max. :5.000 Max. :5.000 Max. :5.000
## E10 N1 N2 N3
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:3.000
## Median :4.000 Median :4.000 Median :3.000 Median :4.000
## Mean :3.554 Mean :3.322 Mean :3.204 Mean :3.847
## 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## N4 N5 N6 N7 N8
## Min. :0.000 Min. :0.00 Min. :0.000 Min. :0.000 Min. :0.00
## 1st Qu.:2.000 1st Qu.:2.00 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.00
## Median :3.000 Median :3.00 Median :3.000 Median :3.000 Median :3.00
## Mean :2.727 Mean :2.93 Mean :2.964 Mean :3.144 Mean :2.76
## 3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.00
## Max. :5.000 Max. :5.00 Max. :5.000 Max. :5.000 Max. :5.00
## N9 N10 A1 A2
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:3.000
## Median :3.000 Median :3.000 Median :2.000 Median :4.000
## Mean :3.177 Mean :2.732 Mean :2.285 Mean :3.947
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:3.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## A3 A4 A5 A6
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:1.000 1st Qu.:4.000 1st Qu.:1.000 1st Qu.:3.000
## Median :2.000 Median :4.000 Median :2.000 Median :4.000
## Mean :2.236 Mean :4.009 Mean :2.185 Mean :3.855
## 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.:3.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## A7 A8 A9 A10
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:1.000 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.000
## Median :2.000 Median :4.000 Median :4.000 Median :4.000
## Mean :2.129 Mean :3.769 Mean :3.893 Mean :3.704
## 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.:5.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## C1 C2 C3 C4
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:2.000
## Median :3.000 Median :3.000 Median :4.000 Median :3.000
## Mean :3.339 Mean :2.979 Mean :4.011 Mean :2.659
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## C5 C6 C7 C8
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:2.000
## Median :3.000 Median :3.000 Median :4.000 Median :2.000
## Mean :2.647 Mean :2.874 Mean :3.678 Mean :2.453
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:3.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## C9 C10 O1 O2
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:1.000
## Median :3.000 Median :4.000 Median :4.000 Median :2.000
## Mean :3.293 Mean :3.632 Mean :3.675 Mean :2.158
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:3.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## O3 O4 O5 O6 O7
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.00
## 1st Qu.:3.000 1st Qu.:1.000 1st Qu.:3.000 1st Qu.:1.000 1st Qu.:3.00
## Median :4.000 Median :2.000 Median :4.000 Median :2.000 Median :4.00
## Mean :4.057 Mean :2.097 Mean :3.859 Mean :1.874 Mean :4.01
## 3rd Qu.:5.000 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.:2.000 3rd Qu.:5.00
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.00
## O8 O9 O10 country
## Min. :0.000 Min. :0.000 Length:4231 Length:4231
## 1st Qu.:2.000 1st Qu.:4.000 Class :character Class :character
## Median :3.000 Median :4.000 Mode :character Mode :character
## Mean :3.173 Mean :4.104
## 3rd Qu.:4.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000
As we see, we have also some demographics data, such as gender, education level, family size, etc. Almost all the variables are numeric, which means that we need to process them to qualitative variables to have the appropriate format for the association rules.
First, we discretize and select the demographics features and second, take a look at the questionnaire.
The data set actually provides a very comprehensive scope of demographics, such as religion, orientation and even right- or left-handedness. Definitely, this is a valuable data, but for the sake of convenience we need to limit the data to the most important features.
The number of association rules can be vast and overloading data may create unnecessary noise. We choose and process those features, which are most likely to have any causality with psychological traits.
Those are: * gender * age category * urban place of living or not * family size * education level
# 1) split age into categories
df <- df[!df$age < 13,] # we omit responses for children younger 13
df <- df %>% mutate(age_cat = cut(age, breaks=c(0,18, 25, 35, 45, 55, 65, Inf),
labels = c('Under 18', '18-24',
'25-34','35-44', '45-54',
'55-64', '65 and over'), include.lowest = TRUE))
df$age_cat <- as.factor(df$age_cat)
summary(df$age_cat)
## Under 18 18-24 25-34 35-44 45-54 55-64
## 1488 1389 635 343 255 101
## 65 and over
## 15
# 2) factor education to appropriate categories
df$education <- factor(df$education, levels = sort(unique(df$education)), labels = c('Not mentioned',
'Less than high school', 'High school', 'University degree',
'Graduate degree'))
# 3) factor type of urbanization
df$urban <- factor(df$urban, levels = sort(unique(df$urban)),
labels = c('Not mentioned',
'Rural (countryside)', 'Suburban',
'Urban (city)'))
# 4) factor gender
df$gender <- factor(df$gender, levels = sort(unique(df$gender)), labels = c('Not mentioned', 'Male', 'Female', 'Other'))
# 5) factor and split family size into levels
df <- df[!(df$familysize == max(df$familysize)),]
df$familysize <- as.factor(ifelse(df$familysize == 0, 'Adopted', ifelse(df$familysize == 1, 'Only child', ifelse(df$familysize <= 3, '2 to 3', ifelse(df$familysize <= 6, '4 to 6', '7 and more') ))))
To have a glimpse, below are a few questions presented.
| Code | Question |
|---|---|
| E1 | I am the life of the party. |
| E2 | I don’t talk a lot. |
| E3 | I feel comfortable around people. |
| … | … |
| O8 | I use difficult words. |
| O9 | I spend time reflecting on things. |
| O10 | I am full of ideas. |
If we include all 50 questions, we will get too noisy data. Instead, we generate the compound indices for 5 dimensions by taking the average score for each topic.
Before doing that, let us inspect the values.
# inspect some response values
summary(df[, 50:55])
## C8 C9 C10 O1
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:3.000
## Median :2.000 Median :3.000 Median :4.000 Median :4.000
## Mean :2.452 Mean :3.293 Mean :3.632 Mean :3.676
## 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## O2 O3
## Min. :0.000 Min. :0.00
## 1st Qu.:1.000 1st Qu.:3.00
## Median :2.000 Median :4.00
## Mean :2.156 Mean :4.06
## 3rd Qu.:3.000 3rd Qu.:5.00
## Max. :5.000 Max. :5.00
As we see, some responses contain 0 values, which should not have been present in the data.
Most probably, 0 stands for no answer, so we impute them with
NAs and omit those rows.
for (i in 1:nrow(df)){
for (j in 13:62){
if (df[i,j] %in% c(0)) {
df[i,j] <- NA }
}
}
df <- df[complete.cases(df),]
Finally, we may go to transforming 50 columns into 5. The meaning of each statement was analysed and we take linear average of the values for each category
# averaging extraversion
df <- df %>% mutate(extraversion = round((E1 + E3 + E5 + E7 + E9 +
(6-E2) + (6-E4) + (6-E6) + (6-E8) + (6-E10)
)/10, 0))
# averaging neuroticism
df <- df %>% mutate(neuroticism = round((N1 + N3 + N5 + N6 + N7 + N8 + N9 + N10 +
(6-N2) + (6 - N4))/10, 0))
# averaging symapthy to others
df <- df %>% mutate(sympathy = round((A2 + A4 + A6 + A9 + A10 +
(6-A1) + (6-A3) + (6-A5) + (6-A7) +
(6-A8))/10, 0))
# averaging tendency for discipline and order
df <- df %>% mutate(discipline = round((C1 + C3 + C5 + C7 + C9 + C10 +
(6-C2) + (6-C4) + (6-C6) + (6- C8))/10, 0))
#one of the columns is still a character
df$O10 <- as.integer(df$O10)
# averaging being inventive and creativity
# one of the statements was omitted it effect for average was not clear
df <- df%>% mutate(inventive = round((O1 + O3 + O5 + O7 + O8 + O10 +
(6-O2) + (6-O4) + (6-O6))/9, 0))
Desc(df[,65:69])
## ------------------------------------------------------------------------------
## Describe df[, 65:69] (data.frame):
##
## data frame: 3831 obs. of 5 variables
## 3831 complete cases (100.0%)
##
## Nr ColName Class NAs Levels
## 1 extraversion numeric .
## 2 neuroticism numeric .
## 3 sympathy numeric .
## 4 discipline numeric .
## 5 inventive numeric .
##
##
## ------------------------------------------------------------------------------
## 1 - extraversion (numeric)
##
## length n NAs unique 0s mean meanCI'
## 3'831 3'831 0 5 0 3.02 2.99
## 100.0% 0.0% 0.0% 3.05
##
## .05 .10 .25 median .75 .90 .95
## 2.00 2.00 2.00 3.00 4.00 4.00 4.00
##
## range sd vcoef mad IQR skew kurt
## 4.00 0.97 0.32 1.48 2.00 0.02 -0.71
##
##
## value freq perc cumfreq cumperc
## 1 1 155 4.0% 155 4.0%
## 2 2 1'105 28.8% 1'260 32.9%
## 3 3 1'275 33.3% 2'535 66.2%
## 4 4 1'105 28.8% 3'640 95.0%
## 5 5 191 5.0% 3'831 100.0%
##
## ' 95%-CI (classic)
## ------------------------------------------------------------------------------
## 2 - neuroticism (numeric)
##
## length n NAs unique 0s mean meanCI'
## 3'831 3'831 0 5 0 3.10 3.07
## 100.0% 0.0% 0.0% 3.12
##
## .05 .10 .25 median .75 .90 .95
## 2.00 2.00 2.00 3.00 4.00 4.00 4.00
##
## range sd vcoef mad IQR skew kurt
## 4.00 0.93 0.30 1.48 2.00 -0.13 -0.63
##
##
## value freq perc cumfreq cumperc
## 1 1 132 3.4% 132 3.4%
## 2 2 942 24.6% 1'074 28.0%
## 3 3 1'349 35.2% 2'423 63.2%
## 4 4 1'245 32.5% 3'668 95.7%
## 5 5 163 4.3% 3'831 100.0%
##
## ' 95%-CI (classic)
## ------------------------------------------------------------------------------
## 3 - sympathy (numeric)
##
## length n NAs unique 0s mean meanCI'
## 3'831 3'831 0 5 0 3.68 3.66
## 100.0% 0.0% 0.0% 3.70
##
## .05 .10 .25 median .75 .90 .95
## 2.00 3.00 3.00 4.00 4.00 4.00 4.00
##
## range sd vcoef mad IQR skew kurt
## 4.00 0.65 0.18 0.00 1.00 -0.88 0.96
##
##
## value freq perc cumfreq cumperc
## 1 1 7 0.2% 7 0.2%
## 2 2 203 5.3% 210 5.5%
## 3 3 974 25.4% 1'184 30.9%
## 4 4 2'480 64.7% 3'664 95.6%
## 5 5 167 4.4% 3'831 100.0%
##
## ' 95%-CI (classic)
## ------------------------------------------------------------------------------
## 4 - discipline (numeric)
##
## length n NAs unique 0s mean meanCI'
## 3'831 3'831 0 5 0 3.36 3.34
## 100.0% 0.0% 0.0% 3.39
##
## .05 .10 .25 median .75 .90 .95
## 2.00 2.00 3.00 3.00 4.00 4.00 5.00
##
## range sd vcoef mad IQR skew kurt
## 4.00 0.80 0.24 1.48 1.00 -0.17 -0.47
##
##
## value freq perc cumfreq cumperc
## 1 1 10 0.3% 10 0.3%
## 2 2 543 14.2% 553 14.4%
## 3 3 1'518 39.6% 2'071 54.1%
## 4 4 1'565 40.9% 3'636 94.9%
## 5 5 195 5.1% 3'831 100.0%
##
## ' 95%-CI (classic)
## ------------------------------------------------------------------------------
## 5 - inventive (numeric)
##
## length n NAs unique 0s mean meanCI'
## 3'831 3'831 0 5 0 3.86 3.84
## 100.0% 0.0% 0.0% 3.89
##
## .05 .10 .25 median .75 .90 .95
## 3.00 3.00 3.00 4.00 4.00 5.00 5.00
##
## range sd vcoef mad IQR skew kurt
## 4.00 0.73 0.19 0.00 1.00 -0.25 -0.05
##
##
## value freq perc cumfreq cumperc
## 1 1 6 0.2% 6 0.2%
## 2 2 95 2.5% 101 2.6%
## 3 3 1'007 26.3% 1'108 28.9%
## 4 4 2'034 53.1% 3'142 82.0%
## 5 5 689 18.0% 3'831 100.0%
##
## ' 95%-CI (classic)
# factorisation of indexes
# we will factor extraversion, neuroticism and sympathy into 3 levels
hist(df$extraversion)
hist(df$neuroticism)
hist(df$sympathy)
Observing the distributions, probably it would be better to split extraversion, neuroticism and sympathy into 3 levels; and discipline and creativity into 2 levels.
# low, medium and high levels
for (i in 1:nrow(df)){
for (j in 65:67){
if (df[i,j] %in% c(1,2)) {
df[i,j] <- paste('low', colnames(df[j]), sep = "_")}
else if (df[i,j] == 3) {
df[i,j] <- paste('medium', colnames(df[j]), sep = "_")}
else
df[i,j] <- paste('high', colnames(df[j]), sep = "_")
}
}
hist(df$discipline)
hist(df$inventive)
# discipline and incentive into 2 levels : low and high
for (i in 1:nrow(df)){
for (j in 68:69){
if (df[i,j] %in% c(1,2,3)) {
df[i,j] <- paste('low', colnames(df[j]), sep = "_")}
else
df[i,j] <- paste('high', colnames(df[j]), sep = "_")
}
}
Now we limit the data frame to those variables which we need, change those to factors, clean from missings and save it for the further association rule mining.
data_person <- df[, c(1:3,64:69)]
summary(data_person)
## education urban gender
## Not mentioned : 28 Not mentioned : 40 Not mentioned: 5
## Less than high school: 560 Rural (countryside): 717 Male :1303
## High school :1660 Suburban :1317 Female :2499
## University degree :1050 Urban (city) :1757 Other : 24
## Graduate degree : 533
##
##
## age_cat extraversion neuroticism sympathy
## Under 18 :1351 Length:3831 Length:3831 Length:3831
## 18-24 :1265 Class :character Class :character Class :character
## 25-34 : 583 Mode :character Mode :character Mode :character
## 35-44 : 316
## 45-54 : 223
## 55-64 : 80
## 65 and over: 13
## discipline inventive
## Length:3831 Length:3831
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
# data_person <- cbind(data_person, df[, c('E2', 'E3', 'E8', 'N1',
# 'N3', 'A9', 'C1', 'C6', 'O6')])
data_person[, c('extraversion', 'neuroticism',
'sympathy', 'discipline', 'inventive')] <- lapply(data_person[,c('extraversion', 'neuroticism',
'sympathy', 'discipline', 'inventive')], factor)
data_person <- data_person[complete.cases(data_person), ]
write.csv(data_person, file="data_factor.csv", row.names = F)
For digging into the associations persistent in psychological
characteristics, we will use the Apriori algorithm. In R, the
arules and arulesViz packages are used.
To figure out which rules are relevant, we have 3 measures, which we focus on.
Firstly, we check one-item frequency of the characteristics. Support helps to determine the most popular items in transactions. For example, usually, we want to find the most popular consequent X and check, what could be its antecedents. Or, with support we may check how often the itemset \({X,Y}\) is met in data.
\[Support(X) = \displaystyle \frac{frequency(X)}{n}\]
Say, we have some hypothesis that Y is somewhat related to X. Confidence tells us, what is the proportion of total transactions with X intersect with transactions Y. Confidence equal to 1 would mean that all people choosing X, also chose Y. In other words, confidence may called the conditional probability of Y given X.
\[Confidence(X \to\ Y) = \displaystyle \frac{P(X \cap Y)}{P(X)} = P(Y \mid X) \]
However, high confidence does not necessarily incure any association persistent. It might be the case that X and Y are simply very popular items, but rather independent ones. Here, lift helps us.
The probabilistic rule says that \({P(X \cap Y)}\) for two independent variables equals to the multiplication of probabilities \({P(X) * P(Y)}\). What lift does is that it compares the observed frequency of X and Y met together in data and the probability to meet them together if they were independent.
Thus, we may suspect any association for X and Y if the nominator is greater than the denominator, i.e lift > 1. In this case we say, we meet X and Y together more often, than we should if they were independent.
\[Lift(X \to\ Y) = \displaystyle \frac{P(X \cap Y)}{P(X) * P(Y)}\]
To sum up, association is suspected if:
First of all, we need to load the data in a format appropriate for Apriori.
basket <-read.transactions("data_factor.csv", format="basket", sep=",", skip=1) # reading the file as transactions
Let us check how the data looks now. As we see, now we have a long list of bundles of responses given by participants.
inspect(basket[1:5])
## items
## [1] {25-34,
## Female,
## Graduate degree,
## high_inventive,
## high_sympathy,
## low_discipline,
## medium_extraversion,
## medium_neuroticism,
## Urban (city)}
## [2] {18-24,
## Female,
## High school,
## high_discipline,
## high_extraversion,
## high_inventive,
## high_neuroticism,
## high_sympathy,
## Urban (city)}
## [3] {Female,
## high_inventive,
## high_neuroticism,
## Less than high school,
## low_discipline,
## low_extraversion,
## medium_sympathy,
## Suburban,
## Under 18}
## [4] {25-34,
## Female,
## high_discipline,
## high_inventive,
## high_sympathy,
## low_extraversion,
## low_neuroticism,
## Suburban,
## University degree}
## [5] {35-44,
## Graduate degree,
## high_discipline,
## high_neuroticism,
## high_sympathy,
## low_extraversion,
## low_inventive,
## Male,
## Suburban}
# another way with LIST function
LIST(head(basket))
If we check, the length of the basket will be equal to
the number of rows of data_person.
length(basket)
## [1] 3831
So, now we have 3831 entries of respondents’ characteristics encoded as list of transactions (response combinations).
Next, we may check the most frequent responses in our
basket.
sort(round(itemFrequency(basket),3), decreasing = TRUE)
## high_inventive high_sympathy Female
## 0.711 0.691 0.652
## low_discipline high_discipline Urban (city)
## 0.541 0.459 0.459
## High school high_neuroticism Under 18
## 0.433 0.368 0.353
## medium_neuroticism Suburban Male
## 0.352 0.344 0.340
## high_extraversion medium_extraversion 18-24
## 0.338 0.333 0.330
## low_extraversion low_inventive low_neuroticism
## 0.329 0.289 0.280
## University degree medium_sympathy Rural (countryside)
## 0.274 0.254 0.187
## 25-34 Less than high school Graduate degree
## 0.152 0.146 0.139
## 35-44 45-54 low_sympathy
## 0.082 0.058 0.055
## 55-64 Not mentioned Other
## 0.021 0.017 0.006
## 65 and over
## 0.003
itemFrequencyPlot(basket, topN=25, type="relative", main="Top 25 popular responses", col="#ACE6CF")
So, the 3 most frequent items in questionnaire were
high_inventive, high_sympathy and
Female.
A useful way to look at two itemset frequencies is to look at the cross tables for the characteristics. We may refer both to support and lift measures.
sup_tab<-crossTable(basket, measure="support", sort=TRUE)
round(sup_tab[1:10, 1:10], 3)
## high_inventive high_sympathy Female low_discipline
## high_inventive 0.711 0.510 0.437 0.370
## high_sympathy 0.510 0.691 0.490 0.353
## Female 0.437 0.490 0.652 0.352
## low_discipline 0.370 0.353 0.352 0.541
## high_discipline 0.341 0.338 0.301 0.000
## Urban (city) 0.315 0.313 0.307 0.245
## High school 0.288 0.287 0.285 0.249
## high_neuroticism 0.244 0.241 0.275 0.243
## Under 18 0.237 0.226 0.250 0.229
## medium_neuroticism 0.242 0.244 0.233 0.182
## high_discipline Urban (city) High school high_neuroticism
## high_inventive 0.341 0.315 0.288 0.244
## high_sympathy 0.338 0.313 0.287 0.241
## Female 0.301 0.307 0.285 0.275
## low_discipline 0.000 0.245 0.249 0.243
## high_discipline 0.459 0.214 0.185 0.125
## Urban (city) 0.214 0.459 0.200 0.169
## High school 0.185 0.200 0.433 0.170
## high_neuroticism 0.125 0.169 0.170 0.368
## Under 18 0.124 0.158 0.201 0.153
## medium_neuroticism 0.170 0.159 0.155 0.000
## Under 18 medium_neuroticism
## high_inventive 0.237 0.242
## high_sympathy 0.226 0.244
## Female 0.250 0.233
## low_discipline 0.229 0.182
## high_discipline 0.124 0.170
## Urban (city) 0.158 0.159
## High school 0.201 0.155
## high_neuroticism 0.153 0.000
## Under 18 0.353 0.124
## medium_neuroticism 0.124 0.352
So, high_inventive and high_sympathy occur
together in 51% of responses. Female and
high_sympathy - in 49% of responses. So, having preview of
the double item frequencies, we already keep in mind pairs that we want
to test for association.
lift_tab <- crossTable(basket, measure="lift", sort=TRUE)
round(lift_tab[1:10, 1:10], 3)
## high_inventive high_sympathy Female low_discipline
## high_inventive NA 1.039 0.944 0.962
## high_sympathy 1.039 NA 1.086 0.946
## Female 0.944 1.086 NA 0.997
## low_discipline 0.962 0.946 0.997 NA
## high_discipline 1.045 1.063 1.003 0.000
## Urban (city) 0.967 0.989 1.027 0.988
## High school 0.936 0.958 1.008 1.062
## high_neuroticism 0.932 0.951 1.148 1.222
## Under 18 0.947 0.928 1.085 1.199
## medium_neuroticism 0.967 1.002 1.013 0.957
## high_discipline Urban (city) High school high_neuroticism
## high_inventive 1.045 0.967 0.936 0.932
## high_sympathy 1.063 0.989 0.958 0.951
## Female 1.003 1.027 1.008 1.148
## low_discipline 0.000 0.988 1.062 1.222
## high_discipline NA 1.015 0.927 0.739
## Urban (city) 1.015 NA 1.009 1.003
## High school 0.927 1.009 NA 1.070
## high_neuroticism 0.739 1.003 1.070 NA
## Under 18 0.765 0.978 1.317 1.182
## medium_neuroticism 1.050 0.988 1.014 0.000
## Under 18 medium_neuroticism
## high_inventive 0.947 0.967
## high_sympathy 0.928 1.002
## Female 1.085 1.013
## low_discipline 1.199 0.957
## high_discipline 0.765 1.050
## Urban (city) 0.978 0.988
## High school 1.317 1.014
## high_neuroticism 1.182 0.000
## Under 18 NA 1.001
## medium_neuroticism 1.001 NA
Not to go far, we may also immediately check double item lifts. For
example, looking at the pair Female and
high_inventive (for which the support was high), the lift
is 0.944. This tells us that this pair is seen as two items are popular
by themselves, independently, but the association is rather not
expected.
After checking the frequencies, it’s a high time to search for the
rules of itemsets. First, we may check the whole basket,
without suspecting any antecedents or consequents.
For the apriori, we need to set the constraints on
minimum support and confidence. We will set 0.1 and 0.8 respectively.
The higher confidence we set, the less rules are expected to be
found.
# creating the rules - standard settings
rules_general <- apriori(basket, parameter=list(supp=0.1, conf=0.8))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 383
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[31 item(s), 3831 transaction(s)] done [0.00s].
## sorting and recoding items ... [24 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [38 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules_byconf <- sort(rules_general, by="confidence", decreasing=TRUE)
inspect(head(rules_byconf))
## lhs rhs support confidence coverage lift count
## [1] {Less than high school} => {Under 18} 0.1365179 0.9339286 0.1461759 2.648320 523
## [2] {Female,
## high_discipline,
## high_extraversion} => {high_sympathy} 0.1020621 0.8988506 0.1135474 1.300905 391
## [3] {high_extraversion,
## Male} => {high_inventive} 0.1054555 0.8820961 0.1195510 1.241025 404
## [4] {high_extraversion,
## low_neuroticism} => {high_inventive} 0.1177238 0.8757282 0.1344297 1.232066 451
## [5] {Female,
## high_extraversion,
## high_inventive} => {high_sympathy} 0.1466980 0.8619632 0.1701906 1.247518 562
## [6] {Female,
## high_extraversion} => {high_sympathy} 0.1866353 0.8583433 0.2174367 1.242279 715
As we see,, the strongest rule out of all is that respondents under 18 y.o. have education level less than secondary. But there are no big news about that, right?
What we look for is whether some traits have antecedents. The [2] rule is rather more interesting to us. Almost 90% of the highly sympathetic respondents were females, suggesting that they are highly disciplined and extraverted.
rules_bylift<-sort(rules_general, by="lift", decreasing=TRUE)
inspect(head(rules_bylift))
## lhs rhs support confidence coverage lift count
## [1] {Less than high school} => {Under 18} 0.1365179 0.9339286 0.1461759 2.648320 523
## [2] {Female,
## high_discipline,
## high_extraversion} => {high_sympathy} 0.1020621 0.8988506 0.1135474 1.300905 391
## [3] {high_neuroticism,
## low_inventive} => {Female} 0.1015401 0.8189474 0.1239885 1.255457 389
## [4] {Female,
## high_extraversion,
## high_inventive} => {high_sympathy} 0.1466980 0.8619632 0.1701906 1.247518 562
## [5] {Female,
## high_extraversion} => {high_sympathy} 0.1866353 0.8583433 0.2174367 1.242279 715
## [6] {high_extraversion,
## Male} => {high_inventive} 0.1054555 0.8820961 0.1195510 1.241025 404
Another way to grasp any associations is to look at the graphs.
arulesViz provides very qualified visualisations for
associations.
plot(rules_general, method="paracoord", control=list(reorder=TRUE))
However, it is hard to see now any strong relations. What we can do more, is to check specific right hand sides (consequents) and left hand sides (antecedents).
To check that, we set the rules’ right hand side (consequent) as
high_neuroticism. Taking into account that the number of
unique observations and characteristics is high in our basket, now we
will threshold support and confidence values to 0.1 and 0.4
respectively.
rules_neuroticism <-apriori(data= basket, parameter=list(supp=0.1, conf=0.4),
appearance=list(default="lhs", rhs="high_neuroticism"),
control=list(verbose=F))
inspect(sort(rules_neuroticism, by="confidence", decreasing=TRUE)[1:5])
## lhs rhs support confidence coverage lift count
## [1] {low_discipline,
## low_extraversion} => {high_neuroticism} 0.1067606 0.5446072 0.1960324 1.481811 409
## [2] {Female,
## low_extraversion} => {high_neuroticism} 0.1166797 0.5405079 0.2158705 1.470657 447
## [3] {Female,
## low_discipline} => {high_neuroticism} 0.1772383 0.5040831 0.3516053 1.371550 679
## [4] {Female,
## Under 18} => {high_neuroticism} 0.1239885 0.4968619 0.2495432 1.351902 475
## [5] {Female,
## high_inventive,
## low_discipline} => {high_neuroticism} 0.1111981 0.4913495 0.2263117 1.336903 426
inspect(sort(rules_neuroticism, by="lift", decreasing=TRUE)[1:5])
## lhs rhs support confidence coverage lift count
## [1] {low_discipline,
## low_extraversion} => {high_neuroticism} 0.1067606 0.5446072 0.1960324 1.481811 409
## [2] {Female,
## low_extraversion} => {high_neuroticism} 0.1166797 0.5405079 0.2158705 1.470657 447
## [3] {Female,
## low_discipline} => {high_neuroticism} 0.1772383 0.5040831 0.3516053 1.371550 679
## [4] {Female,
## Under 18} => {high_neuroticism} 0.1239885 0.4968619 0.2495432 1.351902 475
## [5] {Female,
## high_inventive,
## low_discipline} => {high_neuroticism} 0.1111981 0.4913495 0.2263117 1.336903 426
Interpretation: how do we interpret the measures of support, confidence and lift?
Support: item combination
{low_discipline, low_extraversion, high_neuroticism} is met
in 10.68% of responses.
Confidence: 54% of people who responded
{high_neuroticism} also responded
{low_discipline, low_extraversion}.
Lift value 1.48 tells us, that under the observed frequencies, the probability to meet the itemset given it is conditional of each other is higher than the probability if the items would be independent from each other.
plot(rules_neuroticism, method="graph")
plot(rules_neuroticism, method="paracoord", control=list(reorder=FALSE))
plot(rules_neuroticism, method="grouped")
## Registered S3 methods overwritten by 'registry':
## method from
## print.registry_field proxy
## print.registry_entry proxy
So, as we see, people with low discipline and low extroversion are more exposed to stress and mood wings. This characteristics are also most common for women.
If we look at the parallel coordinates graph, we can see how all these characteristics are connected.
Another option how we may use Apriori rules is to check, not the consequent, but the antecedent (left hand side). For example, in reverse, we might ask:
Now, we will set antecedent Female as a left hand side
to be studied with the rules.
rules_female <-apriori(data= basket, parameter=list(supp=0.1, conf=0.4),
appearance=list(default="rhs", lhs="Female"),
control=list(verbose=F))
inspect(sort(rules_female, by="confidence", decreasing=TRUE)[1:5])
## lhs rhs support confidence coverage lift count
## [1] {Female} => {high_sympathy} 0.4896894 0.7507003 0.6523101 1.086488 1876
## [2] {} => {high_inventive} 0.7107805 0.7107805 1.0000000 1.000000 2723
## [3] {} => {high_sympathy} 0.6909423 0.6909423 1.0000000 1.000000 2647
## [4] {Female} => {high_inventive} 0.4374837 0.6706683 0.6523101 0.943566 1676
## [5] {} => {low_discipline} 0.5405899 0.5405899 1.0000000 1.000000 2071
inspect(sort(rules_female, by="lift", decreasing=TRUE)[1:5])
## lhs rhs support confidence coverage lift
## [1] {Female} => {high_neuroticism} 0.2751240 0.4217687 0.6523101 1.147582
## [2] {Female} => {high_sympathy} 0.4896894 0.7507003 0.6523101 1.086488
## [3] {Female} => {Urban (city)} 0.3072305 0.4709884 0.6523101 1.026953
## [4] {Female} => {High school} 0.2850431 0.4369748 0.6523101 1.008464
## [5] {Female} => {high_discipline} 0.3007048 0.4609844 0.6523101 1.003427
## count
## [1] 1054
## [2] 1876
## [3] 1177
## [4] 1092
## [5] 1152
Here we see the association between females and higher neuroticism from the other side. As for the {Female}=>{high_sympathy} itemset, though the confidence is pretty high, might say that there is a connection, but I’d rather would be skeptic about this association as the lift is not much higher than 1 and we also saw that people generally tend to say that they sympathize with others.
plot(rules_female, method="grouped")
plot(rules_female, method="graph")
plot(rules_female, method="paracoord", control=list(reorder=TRUE))
rules_discipline <-apriori(data= basket, parameter=list(supp=0.1, conf=0.08),
appearance=list(default="lhs", rhs="high_discipline"),
control=list(verbose=F))
inspect(sort(rules_discipline, by="confidence", decreasing=TRUE)[1:5])
## lhs rhs support confidence coverage lift count
## [1] {high_inventive,
## high_sympathy,
## low_neuroticism} => {high_discipline} 0.1072827 0.6391913 0.1678413 1.391331 411
## [2] {high_sympathy,
## low_neuroticism} => {high_discipline} 0.1279039 0.6218274 0.2056904 1.353535 490
## [3] {high_inventive,
## low_neuroticism} => {high_discipline} 0.1362569 0.6048667 0.2252676 1.316616 522
## [4] {low_neuroticism} => {high_discipline} 0.1647090 0.5875233 0.2803446 1.278865 631
## [5] {Female,
## high_extraversion,
## high_sympathy} => {high_discipline} 0.1020621 0.5468531 0.1866353 1.190338 391
inspect(sort(rules_discipline, by="lift", decreasing=TRUE)[1:5])
## lhs rhs support confidence coverage lift count
## [1] {high_inventive,
## high_sympathy,
## low_neuroticism} => {high_discipline} 0.1072827 0.6391913 0.1678413 1.391331 411
## [2] {high_sympathy,
## low_neuroticism} => {high_discipline} 0.1279039 0.6218274 0.2056904 1.353535 490
## [3] {high_inventive,
## low_neuroticism} => {high_discipline} 0.1362569 0.6048667 0.2252676 1.316616 522
## [4] {low_neuroticism} => {high_discipline} 0.1647090 0.5875233 0.2803446 1.278865 631
## [5] {Female,
## high_extraversion,
## high_sympathy} => {high_discipline} 0.1020621 0.5468531 0.1866353 1.190338 391
Here we see that people with low_neuroticism are also
highly disciplined and prefer order.
plot(rules_discipline, method="paracoord", control=list(reorder=TRUE))
plot(rules_discipline, method="graph")
***
We can also check rare characteristics - such as low sympathy. We know, that only 5% of respondents claimed that.
itemFrequency(basket)['low_sympathy']
## low_sympathy
## 0.05481597
So, to find any associations, we lower the threshold for support to 1%.
rules_sympathy <-apriori(data= basket, parameter=list(supp=0.01, conf=0.1),
appearance=list(default="lhs", rhs="low_sympathy"),
control=list(verbose=F))
inspect(sort(rules_sympathy, by="confidence", decreasing=TRUE)[1:5])
## lhs rhs support confidence coverage lift count
## [1] {high_inventive,
## low_extraversion,
## Male} => {low_sympathy} 0.01200731 0.1523179 0.07883059 2.778713 46
## [2] {low_extraversion,
## Male} => {low_sympathy} 0.01540068 0.1404762 0.10963195 2.562687 59
## [3] {high_inventive,
## low_extraversion,
## Under 18} => {low_sympathy} 0.01122422 0.1295181 0.08666145 2.362780 43
## [4] {High school,
## high_inventive,
## low_extraversion} => {low_sympathy} 0.01226834 0.1256684 0.09762464 2.292552 47
## [5] {high_neuroticism,
## Male} => {low_sympathy} 0.01096319 0.1246291 0.08796659 2.273591 42
inspect(sort(rules_sympathy, by="lift", decreasing=TRUE)[1:5])
## lhs rhs support confidence coverage lift count
## [1] {high_inventive,
## low_extraversion,
## Male} => {low_sympathy} 0.01200731 0.1523179 0.07883059 2.778713 46
## [2] {low_extraversion,
## Male} => {low_sympathy} 0.01540068 0.1404762 0.10963195 2.562687 59
## [3] {high_inventive,
## low_extraversion,
## Under 18} => {low_sympathy} 0.01122422 0.1295181 0.08666145 2.362780 43
## [4] {High school,
## high_inventive,
## low_extraversion} => {low_sympathy} 0.01226834 0.1256684 0.09762464 2.292552 47
## [5] {high_neuroticism,
## Male} => {low_sympathy} 0.01096319 0.1246291 0.08796659 2.273591 42
Lowering the bounds, we find a group: higher lift tells us that low extraverted males are the antecedent items for low sympathy.
plot(rules_sympathy, method="paracoord", control=list(reorder=TRUE))
To conclude, association rules is a very powerful tool when we want to see some causalities in the data. In our case, we showed how useful it could be to implement it to the survey data. For example, when we look for any relationships between features, we often perform the correlation analysis.
Association rules is a valuable built-up as we obtain inference (like the percentages of how many people responded X), and better understand the profile groups.