Preface


Association rules is an unsupervised learning algorithm which is sharpened for the purpose of finding patterns in such data as transaction market data or customer preference data to create recommendations systems.

Despite the fact that those are the most popular applications, in this short study we aim to show how the association rules algorithms may help to analyse survey data, and namely, psychological and behavioral questionnaires.

This implementation may help to better understand the causalities between the respondents’ characteristics and extend the simple correlation analysis, for example.

Step-by-step, the work will look as follows:

  1. Loading and inspecting data
  2. Feature preprocessing (discretisation)
  3. Frequency inspection
  4. Inspecting itemsets and rules.

01. Data

The data set we use was retrieved from kaggle.com and available under the following link.

It contains above 4000 responses for the Duckworth Grit Scale. So, the data covers the psychological survey with 50 questions, all covering 5 core dimensions:

  1. Extraversion, openness to people, communicativeness
  2. Persistence to stress, emotional stability
  3. Sympathy to others, softness
  4. Self-discipline, tendency for order
  5. Intellect, imagination.

Please, check kaggle.com for the full list of questions and detailed data set description.

The responses were provided on a Likert scale from 1 to 5 (1 = Disagree; 5 = Agree). Let us load the data first and look at how it looks like.

# loading the data
df <- read.csv('dataset.csv', sep = '')

#limiting the data to questions and demographics
df <- df[complete.cases(df),c(31:92, 1)]
head(df)
##   education urban gender engnat age hand religion orientation race voted
## 1         4     3      2      2  28    1        1           1    4     1
## 2         2     3      2      1  19    1        6           1    4     2
## 3         1     2      2      2  16    1        0           1    1     2
## 4         3     2      2      1  30    1        6           1    3     1
## 5         4     2      1      2  38    1        2           1    1     2
## 6         3     3      2      2  23    1       12           1    1     2
##   married familysize E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 N1 N2 N3 N4 N5 N6 N7 N8 N9
## 1       1          2  2  2  3  4  5  1  2  4  4   4  2  3  4  2  3  2  3  3  2
## 2       1          3  4  2  5  3  5  2  4  2  5   4  5  2  5  3  3  4  5  5  5
## 3       1          3  1  5  2  4  3  2  1  5  2   5  4  3  4  3  4  4  4  3  4
## 4       1          6  3  5  2  4  3  4  3  5  1   5  1  5  3  3  2  3  1  1  2
## 5       2          3  1  5  1  5  2  5  1  5  1   4  4  1  5  3  5  4  1  1  4
## 6       0          1  3  1  3  1  4  1  4  2  5   2  1  3  1  5  2  1  1  2  1
##   N10 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 O1 O2 O3 O4
## 1   4  1  5  2  5  1  4  2  4  4   3  2  4  4  3  2  4  3  2  2   4  5  2  4  1
## 2   3  5  4  1  4  2  5  1  4  3   4  4  3  4  3  1  3  5  2  5   3  4  2  5  2
## 3   2  3  3  1  3  3  4  4  4  3   3  2  2  4  2  3  4  5  3  3   4  4  2  4  2
## 4   1  1  4  1  5  4  5  2  5  5   4  4  1  5  1  4  1  4  1  4   3  5  2  3  2
## 5   3  2  3  2  3  2  3  2  2  2   4  3  1  3  1  4  2  3  2  3   4  5  3  2  2
## 6   1  2  4  1  4  2  4  2  4  4   5  5  3  4  1  4  3  4  1  4   5  5  1  5  2
##   O5 O6 O7 O8 O9 O10 country
## 1  4  2  5  4  5   4      RO
## 2  4  1  4  3  4   5      US
## 3  4  3  5  5  4   4      US
## 4  4  1  4  2  5   4      KE
## 5  3  3  4  1  3   2      JP
## 6  5  1  5  2  5   5      AU
str(df)
## 'data.frame':    4231 obs. of  63 variables:
##  $ education  : int  4 2 1 3 4 3 3 2 3 1 ...
##  $ urban      : int  3 3 2 2 2 3 2 1 3 2 ...
##  $ gender     : int  2 2 2 2 1 2 1 1 1 2 ...
##  $ engnat     : int  2 1 2 1 2 2 1 2 2 1 ...
##  $ age        : int  28 19 16 30 38 23 35 22 50 16 ...
##  $ hand       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ religion   : int  1 6 0 6 2 12 3 7 12 1 ...
##  $ orientation: int  1 1 1 1 1 1 1 3 1 1 ...
##  $ race       : int  4 4 1 3 1 1 4 5 4 4 ...
##  $ voted      : int  1 2 2 1 2 2 2 2 2 2 ...
##  $ married    : int  1 1 1 1 2 0 3 1 3 1 ...
##  $ familysize : int  2 3 3 6 3 1 1 2 3 2 ...
##  $ E1         : int  2 4 1 3 1 3 4 3 3 4 ...
##  $ E2         : int  2 2 5 5 5 1 1 3 1 1 ...
##  $ E3         : int  3 5 2 2 1 3 3 4 5 5 ...
##  $ E4         : int  4 3 4 4 5 1 2 3 1 2 ...
##  $ E5         : int  5 5 3 3 2 4 3 2 5 4 ...
##  $ E6         : int  1 2 2 4 5 1 1 2 1 1 ...
##  $ E7         : int  2 4 1 3 1 4 4 2 4 5 ...
##  $ E8         : int  4 2 5 5 5 2 2 3 2 4 ...
##  $ E9         : int  4 5 2 1 1 5 5 4 4 3 ...
##  $ E10        : int  4 4 5 5 4 2 3 4 2 2 ...
##  $ N1         : int  2 5 4 1 4 1 5 5 3 4 ...
##  $ N2         : int  3 2 3 5 1 3 2 2 4 3 ...
##  $ N3         : int  4 5 4 3 5 1 4 5 3 4 ...
##  $ N4         : int  2 3 3 3 3 5 2 2 5 2 ...
##  $ N5         : int  3 3 4 2 5 2 3 4 1 3 ...
##  $ N6         : int  2 4 4 3 4 1 4 5 1 3 ...
##  $ N7         : int  3 5 4 1 1 1 5 4 1 3 ...
##  $ N8         : int  3 5 3 1 1 2 5 5 1 2 ...
##  $ N9         : int  2 5 4 2 4 1 4 5 1 3 ...
##  $ N10        : int  4 3 2 1 3 1 2 3 1 3 ...
##  $ A1         : int  1 5 3 1 2 2 2 5 1 1 ...
##  $ A2         : int  5 4 3 4 3 4 4 5 4 5 ...
##  $ A3         : int  2 1 1 1 2 1 4 2 1 1 ...
##  $ A4         : int  5 4 3 5 3 4 5 4 5 5 ...
##  $ A5         : int  1 2 3 4 2 2 3 2 1 1 ...
##  $ A6         : int  4 5 4 5 3 4 3 5 3 5 ...
##  $ A7         : int  2 1 4 2 2 2 2 2 1 1 ...
##  $ A8         : int  4 4 4 5 2 4 4 4 5 4 ...
##  $ A9         : int  4 3 3 5 2 4 5 3 5 4 ...
##  $ A10        : int  3 4 3 4 4 5 2 3 5 4 ...
##  $ C1         : int  2 4 2 4 3 5 2 4 4 4 ...
##  $ C2         : int  4 3 2 1 1 3 4 1 3 3 ...
##  $ C3         : int  4 4 4 5 3 4 5 5 4 4 ...
##  $ C4         : int  3 3 2 1 1 1 3 3 1 3 ...
##  $ C5         : int  2 1 3 4 4 4 2 3 3 2 ...
##  $ C6         : int  4 3 4 1 2 3 1 1 1 3 ...
##  $ C7         : int  3 5 5 4 3 4 4 5 3 4 ...
##  $ C8         : int  2 2 3 1 2 1 3 3 1 2 ...
##  $ C9         : int  2 5 3 4 3 4 4 5 4 3 ...
##  $ C10        : int  4 3 4 3 4 5 4 4 4 4 ...
##  $ O1         : int  5 4 4 5 5 5 5 3 4 4 ...
##  $ O2         : int  2 2 2 2 3 1 1 3 1 2 ...
##  $ O3         : int  4 5 4 3 2 5 5 5 5 4 ...
##  $ O4         : int  1 2 2 2 2 2 1 3 1 1 ...
##  $ O5         : int  4 4 4 4 3 5 5 3 4 4 ...
##  $ O6         : int  2 1 3 1 3 1 1 3 1 2 ...
##  $ O7         : int  5 4 5 4 4 5 5 5 5 4 ...
##  $ O8         : int  4 3 5 2 1 2 5 3 1 3 ...
##  $ O9         : int  5 4 4 5 3 5 5 4 4 4 ...
##  $ O10        : chr  "4" "5" "4" "4" ...
##  $ country    : chr  "RO" "US" "US" "KE" ...
summary(df)
##    education         urban           gender          engnat      
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   : 0.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.: 1.000  
##  Median :2.000   Median :2.000   Median :2.000   Median : 1.000  
##  Mean   :2.392   Mean   :2.239   Mean   :1.664   Mean   : 1.371  
##  3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.: 2.000  
##  Max.   :4.000   Max.   :3.000   Max.   :3.000   Max.   :30.000  
##       age              hand          religion       orientation   
##  Min.   :  1.00   Min.   :0.000   Min.   : 0.000   Min.   :0.000  
##  1st Qu.: 17.00   1st Qu.:1.000   1st Qu.: 2.000   1st Qu.:1.000  
##  Median : 21.00   Median :1.000   Median : 4.000   Median :1.000  
##  Mean   : 25.44   Mean   :1.156   Mean   : 5.123   Mean   :1.368  
##  3rd Qu.: 29.00   3rd Qu.:1.000   3rd Qu.: 7.000   3rd Qu.:1.000  
##  Max.   :350.00   Max.   :8.000   Max.   :12.000   Max.   :5.000  
##       race           voted         married        familysize       
##  Min.   :0.000   Min.   :0.00   Min.   :0.000   Min.   :        0  
##  1st Qu.:3.000   1st Qu.:1.00   1st Qu.:1.000   1st Qu.:        2  
##  Median :4.000   Median :2.00   Median :1.000   Median :        2  
##  Mean   :3.434   Mean   :1.68   Mean   :1.255   Mean   :    29153  
##  3rd Qu.:4.000   3rd Qu.:2.00   3rd Qu.:1.000   3rd Qu.:        3  
##  Max.   :5.000   Max.   :2.00   Max.   :6.000   Max.   :123334444  
##        E1              E2              E3              E4       
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :3.000   Median :3.000   Median :3.000   Median :3.000  
##  Mean   :2.735   Mean   :2.694   Mean   :3.372   Mean   :3.074  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##        E5              E6             E7              E8              E9       
##  Min.   :0.000   Min.   :0.00   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:3.000   1st Qu.:1.00   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000  
##  Median :4.000   Median :2.00   Median :3.000   Median :3.000   Median :3.000  
##  Mean   :3.374   Mean   :2.38   Mean   :2.841   Mean   :3.377   Mean   :3.008  
##  3rd Qu.:4.000   3rd Qu.:3.00   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :5.000   Max.   :5.00   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##       E10              N1              N2              N3       
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:3.000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:3.000  
##  Median :4.000   Median :4.000   Median :3.000   Median :4.000  
##  Mean   :3.554   Mean   :3.322   Mean   :3.204   Mean   :3.847  
##  3rd Qu.:5.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:5.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##        N4              N5             N6              N7              N8      
##  Min.   :0.000   Min.   :0.00   Min.   :0.000   Min.   :0.000   Min.   :0.00  
##  1st Qu.:2.000   1st Qu.:2.00   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.00  
##  Median :3.000   Median :3.00   Median :3.000   Median :3.000   Median :3.00  
##  Mean   :2.727   Mean   :2.93   Mean   :2.964   Mean   :3.144   Mean   :2.76  
##  3rd Qu.:4.000   3rd Qu.:4.00   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.00  
##  Max.   :5.000   Max.   :5.00   Max.   :5.000   Max.   :5.000   Max.   :5.00  
##        N9             N10              A1              A2       
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:3.000  
##  Median :3.000   Median :3.000   Median :2.000   Median :4.000  
##  Mean   :3.177   Mean   :2.732   Mean   :2.285   Mean   :3.947  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:3.000   3rd Qu.:5.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##        A3              A4              A5              A6       
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:1.000   1st Qu.:4.000   1st Qu.:1.000   1st Qu.:3.000  
##  Median :2.000   Median :4.000   Median :2.000   Median :4.000  
##  Mean   :2.236   Mean   :4.009   Mean   :2.185   Mean   :3.855  
##  3rd Qu.:3.000   3rd Qu.:5.000   3rd Qu.:3.000   3rd Qu.:5.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##        A7              A8              A9             A10       
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:1.000   1st Qu.:3.000   1st Qu.:3.000   1st Qu.:3.000  
##  Median :2.000   Median :4.000   Median :4.000   Median :4.000  
##  Mean   :2.129   Mean   :3.769   Mean   :3.893   Mean   :3.704  
##  3rd Qu.:3.000   3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:5.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##        C1              C2              C3              C4       
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:3.000   1st Qu.:2.000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :3.000   Median :3.000   Median :4.000   Median :3.000  
##  Mean   :3.339   Mean   :2.979   Mean   :4.011   Mean   :2.659  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:5.000   3rd Qu.:4.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##        C5              C6              C7              C8       
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :3.000   Median :3.000   Median :4.000   Median :2.000  
##  Mean   :2.647   Mean   :2.874   Mean   :3.678   Mean   :2.453  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:5.000   3rd Qu.:3.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##        C9             C10              O1              O2       
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:2.000   1st Qu.:3.000   1st Qu.:3.000   1st Qu.:1.000  
##  Median :3.000   Median :4.000   Median :4.000   Median :2.000  
##  Mean   :3.293   Mean   :3.632   Mean   :3.675   Mean   :2.158  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:5.000   3rd Qu.:3.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##        O3              O4              O5              O6              O7      
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.00  
##  1st Qu.:3.000   1st Qu.:1.000   1st Qu.:3.000   1st Qu.:1.000   1st Qu.:3.00  
##  Median :4.000   Median :2.000   Median :4.000   Median :2.000   Median :4.00  
##  Mean   :4.057   Mean   :2.097   Mean   :3.859   Mean   :1.874   Mean   :4.01  
##  3rd Qu.:5.000   3rd Qu.:3.000   3rd Qu.:5.000   3rd Qu.:2.000   3rd Qu.:5.00  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.00  
##        O8              O9            O10              country         
##  Min.   :0.000   Min.   :0.000   Length:4231        Length:4231       
##  1st Qu.:2.000   1st Qu.:4.000   Class :character   Class :character  
##  Median :3.000   Median :4.000   Mode  :character   Mode  :character  
##  Mean   :3.173   Mean   :4.104                                        
##  3rd Qu.:4.000   3rd Qu.:5.000                                        
##  Max.   :5.000   Max.   :5.000

As we see, we have also some demographics data, such as gender, education level, family size, etc. Almost all the variables are numeric, which means that we need to process them to qualitative variables to have the appropriate format for the association rules.

02. Feature preprocessing

First, we discretize and select the demographics features and second, take a look at the questionnaire.

Demographics

The data set actually provides a very comprehensive scope of demographics, such as religion, orientation and even right- or left-handedness. Definitely, this is a valuable data, but for the sake of convenience we need to limit the data to the most important features.

The number of association rules can be vast and overloading data may create unnecessary noise. We choose and process those features, which are most likely to have any causality with psychological traits.

Those are: * gender * age category * urban place of living or not * family size * education level

# 1) split age into categories 
df <- df[!df$age < 13,] # we omit responses for children younger 13 
df <- df %>% mutate(age_cat = cut(age, breaks=c(0,18, 25, 35, 45, 55, 65, Inf), 
                                                labels = c('Under 18', '18-24', 
                                                           '25-34','35-44', '45-54', 
                                                           '55-64', '65 and over'), include.lowest = TRUE))
df$age_cat <- as.factor(df$age_cat)
summary(df$age_cat)
##    Under 18       18-24       25-34       35-44       45-54       55-64 
##        1488        1389         635         343         255         101 
## 65 and over 
##          15
# 2) factor education to appropriate categories
df$education <- factor(df$education, levels = sort(unique(df$education)), labels = c('Not mentioned', 
'Less than high school', 'High school', 'University degree', 
'Graduate degree'))

# 3) factor type of urbanization 
df$urban <- factor(df$urban, levels = sort(unique(df$urban)), 
                   labels = c('Not mentioned', 
                              'Rural (countryside)', 'Suburban',
                              'Urban (city)'))
# 4) factor gender 
df$gender <- factor(df$gender, levels = sort(unique(df$gender)), labels = c('Not mentioned', 'Male', 'Female', 'Other'))

# 5) factor and split family size into levels
df <- df[!(df$familysize == max(df$familysize)),]

df$familysize <- as.factor(ifelse(df$familysize == 0, 'Adopted', ifelse(df$familysize == 1, 'Only child', ifelse(df$familysize <= 3, '2 to 3', ifelse(df$familysize <= 6, '4 to 6', '7 and more') ))))

Questionnaire responses

To have a glimpse, below are a few questions presented.

Code Question
E1 I am the life of the party.
E2 I don’t talk a lot.
E3 I feel comfortable around people.
… …
O8 I use difficult words.
O9 I spend time reflecting on things.
O10 I am full of ideas.

If we include all 50 questions, we will get too noisy data. Instead, we generate the compound indices for 5 dimensions by taking the average score for each topic.

Before doing that, let us inspect the values.

# inspect some response values 
summary(df[, 50:55])
##        C8              C9             C10              O1       
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:3.000   1st Qu.:3.000  
##  Median :2.000   Median :3.000   Median :4.000   Median :4.000  
##  Mean   :2.452   Mean   :3.293   Mean   :3.632   Mean   :3.676  
##  3rd Qu.:3.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:5.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##        O2              O3      
##  Min.   :0.000   Min.   :0.00  
##  1st Qu.:1.000   1st Qu.:3.00  
##  Median :2.000   Median :4.00  
##  Mean   :2.156   Mean   :4.06  
##  3rd Qu.:3.000   3rd Qu.:5.00  
##  Max.   :5.000   Max.   :5.00

As we see, some responses contain 0 values, which should not have been present in the data.

Most probably, 0 stands for no answer, so we impute them with NAs and omit those rows.

for (i in 1:nrow(df)){
  for (j in 13:62){
    if (df[i,j] %in% c(0)) {
        df[i,j] <- NA }
  }
}

df <- df[complete.cases(df),]

Finally, we may go to transforming 50 columns into 5. The meaning of each statement was analysed and we take linear average of the values for each category

# averaging extraversion
df <- df %>% mutate(extraversion = round((E1 + E3 + E5 + E7 + E9 + 
                                      (6-E2) + (6-E4) + (6-E6) + (6-E8) + (6-E10)
                                      )/10, 0))
# averaging neuroticism 
df <- df %>% mutate(neuroticism = round((N1 + N3 + N5 + N6 + N7 + N8 + N9 + N10 +
                                           (6-N2) + (6 - N4))/10, 0))
# averaging symapthy to others 
df <- df %>% mutate(sympathy = round((A2 + A4 + A6 + A9 + A10 + 
                                        (6-A1) + (6-A3) + (6-A5) + (6-A7) + 
                                        (6-A8))/10, 0))

# averaging tendency for discipline and order 
df <- df %>% mutate(discipline = round((C1 + C3 + C5 + C7 + C9 + C10 +
                                     (6-C2) + (6-C4) + (6-C6) + (6- C8))/10, 0))

#one of the columns is still a character
df$O10 <- as.integer(df$O10)
 
# averaging being inventive and creativity 
# one of the statements was omitted it effect for average was not clear 
df <- df%>% mutate(inventive = round((O1 + O3 + O5 + O7 + O8 + O10 +
                                        (6-O2) + (6-O4) + (6-O6))/9, 0))
Desc(df[,65:69])
## ------------------------------------------------------------------------------ 
## Describe df[, 65:69] (data.frame):
## 
## data frame:  3831 obs. of  5 variables
##      3831 complete cases (100.0%)
## 
##   Nr  ColName       Class    NAs  Levels
##   1   extraversion  numeric  .          
##   2   neuroticism   numeric  .          
##   3   sympathy      numeric  .          
##   4   discipline    numeric  .          
##   5   inventive     numeric  .          
## 
## 
## ------------------------------------------------------------------------------ 
## 1 - extraversion (numeric)
## 
##   length       n    NAs  unique    0s  mean  meanCI'
##    3'831   3'831      0       5     0  3.02    2.99
##           100.0%   0.0%          0.0%          3.05
##                                                    
##      .05     .10    .25  median   .75   .90     .95
##     2.00    2.00   2.00    3.00  4.00  4.00    4.00
##                                                    
##    range      sd  vcoef     mad   IQR  skew    kurt
##     4.00    0.97   0.32    1.48  2.00  0.02   -0.71
##                                                    
## 
##    value   freq   perc  cumfreq  cumperc
## 1      1    155   4.0%      155     4.0%
## 2      2  1'105  28.8%    1'260    32.9%
## 3      3  1'275  33.3%    2'535    66.2%
## 4      4  1'105  28.8%    3'640    95.0%
## 5      5    191   5.0%    3'831   100.0%
## 
## ' 95%-CI (classic)

## ------------------------------------------------------------------------------ 
## 2 - neuroticism (numeric)
## 
##   length       n    NAs  unique    0s   mean  meanCI'
##    3'831   3'831      0       5     0   3.10    3.07
##           100.0%   0.0%          0.0%           3.12
##                                                     
##      .05     .10    .25  median   .75    .90     .95
##     2.00    2.00   2.00    3.00  4.00   4.00    4.00
##                                                     
##    range      sd  vcoef     mad   IQR   skew    kurt
##     4.00    0.93   0.30    1.48  2.00  -0.13   -0.63
##                                                     
## 
##    value   freq   perc  cumfreq  cumperc
## 1      1    132   3.4%      132     3.4%
## 2      2    942  24.6%    1'074    28.0%
## 3      3  1'349  35.2%    2'423    63.2%
## 4      4  1'245  32.5%    3'668    95.7%
## 5      5    163   4.3%    3'831   100.0%
## 
## ' 95%-CI (classic)

## ------------------------------------------------------------------------------ 
## 3 - sympathy (numeric)
## 
##   length       n    NAs  unique    0s   mean  meanCI'
##    3'831   3'831      0       5     0   3.68    3.66
##           100.0%   0.0%          0.0%           3.70
##                                                     
##      .05     .10    .25  median   .75    .90     .95
##     2.00    3.00   3.00    4.00  4.00   4.00    4.00
##                                                     
##    range      sd  vcoef     mad   IQR   skew    kurt
##     4.00    0.65   0.18    0.00  1.00  -0.88    0.96
##                                                     
## 
##    value   freq   perc  cumfreq  cumperc
## 1      1      7   0.2%        7     0.2%
## 2      2    203   5.3%      210     5.5%
## 3      3    974  25.4%    1'184    30.9%
## 4      4  2'480  64.7%    3'664    95.6%
## 5      5    167   4.4%    3'831   100.0%
## 
## ' 95%-CI (classic)

## ------------------------------------------------------------------------------ 
## 4 - discipline (numeric)
## 
##   length       n    NAs  unique    0s   mean  meanCI'
##    3'831   3'831      0       5     0   3.36    3.34
##           100.0%   0.0%          0.0%           3.39
##                                                     
##      .05     .10    .25  median   .75    .90     .95
##     2.00    2.00   3.00    3.00  4.00   4.00    5.00
##                                                     
##    range      sd  vcoef     mad   IQR   skew    kurt
##     4.00    0.80   0.24    1.48  1.00  -0.17   -0.47
##                                                     
## 
##    value   freq   perc  cumfreq  cumperc
## 1      1     10   0.3%       10     0.3%
## 2      2    543  14.2%      553    14.4%
## 3      3  1'518  39.6%    2'071    54.1%
## 4      4  1'565  40.9%    3'636    94.9%
## 5      5    195   5.1%    3'831   100.0%
## 
## ' 95%-CI (classic)

## ------------------------------------------------------------------------------ 
## 5 - inventive (numeric)
## 
##   length       n    NAs  unique    0s   mean  meanCI'
##    3'831   3'831      0       5     0   3.86    3.84
##           100.0%   0.0%          0.0%           3.89
##                                                     
##      .05     .10    .25  median   .75    .90     .95
##     3.00    3.00   3.00    4.00  4.00   5.00    5.00
##                                                     
##    range      sd  vcoef     mad   IQR   skew    kurt
##     4.00    0.73   0.19    0.00  1.00  -0.25   -0.05
##                                                     
## 
##    value   freq   perc  cumfreq  cumperc
## 1      1      6   0.2%        6     0.2%
## 2      2     95   2.5%      101     2.6%
## 3      3  1'007  26.3%    1'108    28.9%
## 4      4  2'034  53.1%    3'142    82.0%
## 5      5    689  18.0%    3'831   100.0%
## 
## ' 95%-CI (classic)

# factorisation of indexes 
# we will factor extraversion, neuroticism and sympathy into 3 levels 
hist(df$extraversion)
hist(df$neuroticism)
hist(df$sympathy)

Observing the distributions, probably it would be better to split extraversion, neuroticism and sympathy into 3 levels; and discipline and creativity into 2 levels.

# low, medium and high levels 
for (i in 1:nrow(df)){
  for (j in 65:67){
    if (df[i,j] %in% c(1,2)) {
      df[i,j] <- paste('low', colnames(df[j]), sep = "_")}
      else if (df[i,j] == 3) {
        df[i,j] <- paste('medium', colnames(df[j]), sep = "_")}
      else
        df[i,j] <- paste('high', colnames(df[j]), sep = "_")
  }
}
hist(df$discipline)
hist(df$inventive)
# discipline and incentive into 2 levels : low and high 
for (i in 1:nrow(df)){
  for (j in 68:69){
    if (df[i,j] %in% c(1,2,3)) {
      df[i,j] <- paste('low', colnames(df[j]), sep = "_")}
      else
        df[i,j] <- paste('high', colnames(df[j]), sep = "_")
  }
}

Now we limit the data frame to those variables which we need, change those to factors, clean from missings and save it for the further association rule mining.

data_person <- df[, c(1:3,64:69)]
summary(data_person)
##                  education                    urban                gender    
##  Not mentioned        :  28   Not mentioned      :  40   Not mentioned:   5  
##  Less than high school: 560   Rural (countryside): 717   Male         :1303  
##  High school          :1660   Suburban           :1317   Female       :2499  
##  University degree    :1050   Urban (city)       :1757   Other        :  24  
##  Graduate degree      : 533                                                  
##                                                                              
##                                                                              
##         age_cat     extraversion       neuroticism          sympathy        
##  Under 18   :1351   Length:3831        Length:3831        Length:3831       
##  18-24      :1265   Class :character   Class :character   Class :character  
##  25-34      : 583   Mode  :character   Mode  :character   Mode  :character  
##  35-44      : 316                                                           
##  45-54      : 223                                                           
##  55-64      :  80                                                           
##  65 and over:  13                                                           
##   discipline         inventive        
##  Length:3831        Length:3831       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
## 
# data_person <- cbind(data_person, df[, c('E2', 'E3', 'E8', 'N1',
  #                          'N3', 'A9', 'C1', 'C6', 'O6')])

data_person[, c('extraversion', 'neuroticism', 
                'sympathy', 'discipline', 'inventive')] <- lapply(data_person[,c('extraversion', 'neuroticism', 
                'sympathy', 'discipline', 'inventive')], factor)

data_person <- data_person[complete.cases(data_person), ]
write.csv(data_person, file="data_factor.csv", row.names = F)

03. Apriori associations

For digging into the associations persistent in psychological characteristics, we will use the Apriori algorithm. In R, the arules and arulesViz packages are used.

Theoretical introduction

To figure out which rules are relevant, we have 3 measures, which we focus on.


Firstly, we check one-item frequency of the characteristics. Support helps to determine the most popular items in transactions. For example, usually, we want to find the most popular consequent X and check, what could be its antecedents. Or, with support we may check how often the itemset \({X,Y}\) is met in data.

\[Support(X) = \displaystyle \frac{frequency(X)}{n}\]


Say, we have some hypothesis that Y is somewhat related to X. Confidence tells us, what is the proportion of total transactions with X intersect with transactions Y. Confidence equal to 1 would mean that all people choosing X, also chose Y. In other words, confidence may called the conditional probability of Y given X.

\[Confidence(X \to\ Y) = \displaystyle \frac{P(X \cap Y)}{P(X)} = P(Y \mid X) \]


However, high confidence does not necessarily incure any association persistent. It might be the case that X and Y are simply very popular items, but rather independent ones. Here, lift helps us.

The probabilistic rule says that \({P(X \cap Y)}\) for two independent variables equals to the multiplication of probabilities \({P(X) * P(Y)}\). What lift does is that it compares the observed frequency of X and Y met together in data and the probability to meet them together if they were independent.

Thus, we may suspect any association for X and Y if the nominator is greater than the denominator, i.e lift > 1. In this case we say, we meet X and Y together more often, than we should if they were independent.

\[Lift(X \to\ Y) = \displaystyle \frac{P(X \cap Y)}{P(X) * P(Y)}\]


To sum up, association is suspected if:

  • support: may vary across data
  • confidence: better if fairly high
  • lift: greater than 1.

Single and double frequency inspection

First of all, we need to load the data in a format appropriate for Apriori.

basket <-read.transactions("data_factor.csv", format="basket", sep=",", skip=1) # reading the file as transactions

Let us check how the data looks now. As we see, now we have a long list of bundles of responses given by participants.

inspect(basket[1:5])
##     items                   
## [1] {25-34,                 
##      Female,                
##      Graduate degree,       
##      high_inventive,        
##      high_sympathy,         
##      low_discipline,        
##      medium_extraversion,   
##      medium_neuroticism,    
##      Urban (city)}          
## [2] {18-24,                 
##      Female,                
##      High school,           
##      high_discipline,       
##      high_extraversion,     
##      high_inventive,        
##      high_neuroticism,      
##      high_sympathy,         
##      Urban (city)}          
## [3] {Female,                
##      high_inventive,        
##      high_neuroticism,      
##      Less than high school, 
##      low_discipline,        
##      low_extraversion,      
##      medium_sympathy,       
##      Suburban,              
##      Under 18}              
## [4] {25-34,                 
##      Female,                
##      high_discipline,       
##      high_inventive,        
##      high_sympathy,         
##      low_extraversion,      
##      low_neuroticism,       
##      Suburban,              
##      University degree}     
## [5] {35-44,                 
##      Graduate degree,       
##      high_discipline,       
##      high_neuroticism,      
##      high_sympathy,         
##      low_extraversion,      
##      low_inventive,         
##      Male,                  
##      Suburban}
# another way with LIST function
LIST(head(basket))

If we check, the length of the basket will be equal to the number of rows of data_person.

length(basket)
## [1] 3831

So, now we have 3831 entries of respondents’ characteristics encoded as list of transactions (response combinations).

Next, we may check the most frequent responses in our basket.

sort(round(itemFrequency(basket),3), decreasing = TRUE)
##        high_inventive         high_sympathy                Female 
##                 0.711                 0.691                 0.652 
##        low_discipline       high_discipline          Urban (city) 
##                 0.541                 0.459                 0.459 
##           High school      high_neuroticism              Under 18 
##                 0.433                 0.368                 0.353 
##    medium_neuroticism              Suburban                  Male 
##                 0.352                 0.344                 0.340 
##     high_extraversion   medium_extraversion                 18-24 
##                 0.338                 0.333                 0.330 
##      low_extraversion         low_inventive       low_neuroticism 
##                 0.329                 0.289                 0.280 
##     University degree       medium_sympathy   Rural (countryside) 
##                 0.274                 0.254                 0.187 
##                 25-34 Less than high school       Graduate degree 
##                 0.152                 0.146                 0.139 
##                 35-44                 45-54          low_sympathy 
##                 0.082                 0.058                 0.055 
##                 55-64         Not mentioned                 Other 
##                 0.021                 0.017                 0.006 
##           65 and over 
##                 0.003
itemFrequencyPlot(basket, topN=25, type="relative", main="Top 25 popular responses", col="#ACE6CF")

So, the 3 most frequent items in questionnaire were high_inventive, high_sympathy and Female.

A useful way to look at two itemset frequencies is to look at the cross tables for the characteristics. We may refer both to support and lift measures.

sup_tab<-crossTable(basket, measure="support", sort=TRUE)
round(sup_tab[1:10, 1:10], 3)
##                    high_inventive high_sympathy Female low_discipline
## high_inventive              0.711         0.510  0.437          0.370
## high_sympathy               0.510         0.691  0.490          0.353
## Female                      0.437         0.490  0.652          0.352
## low_discipline              0.370         0.353  0.352          0.541
## high_discipline             0.341         0.338  0.301          0.000
## Urban (city)                0.315         0.313  0.307          0.245
## High school                 0.288         0.287  0.285          0.249
## high_neuroticism            0.244         0.241  0.275          0.243
## Under 18                    0.237         0.226  0.250          0.229
## medium_neuroticism          0.242         0.244  0.233          0.182
##                    high_discipline Urban (city) High school high_neuroticism
## high_inventive               0.341        0.315       0.288            0.244
## high_sympathy                0.338        0.313       0.287            0.241
## Female                       0.301        0.307       0.285            0.275
## low_discipline               0.000        0.245       0.249            0.243
## high_discipline              0.459        0.214       0.185            0.125
## Urban (city)                 0.214        0.459       0.200            0.169
## High school                  0.185        0.200       0.433            0.170
## high_neuroticism             0.125        0.169       0.170            0.368
## Under 18                     0.124        0.158       0.201            0.153
## medium_neuroticism           0.170        0.159       0.155            0.000
##                    Under 18 medium_neuroticism
## high_inventive        0.237              0.242
## high_sympathy         0.226              0.244
## Female                0.250              0.233
## low_discipline        0.229              0.182
## high_discipline       0.124              0.170
## Urban (city)          0.158              0.159
## High school           0.201              0.155
## high_neuroticism      0.153              0.000
## Under 18              0.353              0.124
## medium_neuroticism    0.124              0.352

So, high_inventive and high_sympathy occur together in 51% of responses. Female and high_sympathy - in 49% of responses. So, having preview of the double item frequencies, we already keep in mind pairs that we want to test for association.

lift_tab <- crossTable(basket, measure="lift", sort=TRUE)
round(lift_tab[1:10, 1:10], 3)
##                    high_inventive high_sympathy Female low_discipline
## high_inventive                 NA         1.039  0.944          0.962
## high_sympathy               1.039            NA  1.086          0.946
## Female                      0.944         1.086     NA          0.997
## low_discipline              0.962         0.946  0.997             NA
## high_discipline             1.045         1.063  1.003          0.000
## Urban (city)                0.967         0.989  1.027          0.988
## High school                 0.936         0.958  1.008          1.062
## high_neuroticism            0.932         0.951  1.148          1.222
## Under 18                    0.947         0.928  1.085          1.199
## medium_neuroticism          0.967         1.002  1.013          0.957
##                    high_discipline Urban (city) High school high_neuroticism
## high_inventive               1.045        0.967       0.936            0.932
## high_sympathy                1.063        0.989       0.958            0.951
## Female                       1.003        1.027       1.008            1.148
## low_discipline               0.000        0.988       1.062            1.222
## high_discipline                 NA        1.015       0.927            0.739
## Urban (city)                 1.015           NA       1.009            1.003
## High school                  0.927        1.009          NA            1.070
## high_neuroticism             0.739        1.003       1.070               NA
## Under 18                     0.765        0.978       1.317            1.182
## medium_neuroticism           1.050        0.988       1.014            0.000
##                    Under 18 medium_neuroticism
## high_inventive        0.947              0.967
## high_sympathy         0.928              1.002
## Female                1.085              1.013
## low_discipline        1.199              0.957
## high_discipline       0.765              1.050
## Urban (city)          0.978              0.988
## High school           1.317              1.014
## high_neuroticism      1.182              0.000
## Under 18                 NA              1.001
## medium_neuroticism    1.001                 NA

Not to go far, we may also immediately check double item lifts. For example, looking at the pair Female and high_inventive (for which the support was high), the lift is 0.944. This tells us that this pair is seen as two items are popular by themselves, independently, but the association is rather not expected.

Itemsets and rules

After checking the frequencies, it’s a high time to search for the rules of itemsets. First, we may check the whole basket, without suspecting any antecedents or consequents.

For the apriori, we need to set the constraints on minimum support and confidence. We will set 0.1 and 0.8 respectively. The higher confidence we set, the less rules are expected to be found.

# creating the rules - standard settings
rules_general <- apriori(basket, parameter=list(supp=0.1, conf=0.8))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 383 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[31 item(s), 3831 transaction(s)] done [0.00s].
## sorting and recoding items ... [24 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [38 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
rules_byconf <- sort(rules_general, by="confidence", decreasing=TRUE)
inspect(head(rules_byconf))
##     lhs                        rhs                support confidence  coverage     lift count
## [1] {Less than high school} => {Under 18}       0.1365179  0.9339286 0.1461759 2.648320   523
## [2] {Female,                                                                                 
##      high_discipline,                                                                        
##      high_extraversion}     => {high_sympathy}  0.1020621  0.8988506 0.1135474 1.300905   391
## [3] {high_extraversion,                                                                      
##      Male}                  => {high_inventive} 0.1054555  0.8820961 0.1195510 1.241025   404
## [4] {high_extraversion,                                                                      
##      low_neuroticism}       => {high_inventive} 0.1177238  0.8757282 0.1344297 1.232066   451
## [5] {Female,                                                                                 
##      high_extraversion,                                                                      
##      high_inventive}        => {high_sympathy}  0.1466980  0.8619632 0.1701906 1.247518   562
## [6] {Female,                                                                                 
##      high_extraversion}     => {high_sympathy}  0.1866353  0.8583433 0.2174367 1.242279   715

As we see,, the strongest rule out of all is that respondents under 18 y.o. have education level less than secondary. But there are no big news about that, right?

What we look for is whether some traits have antecedents. The [2] rule is rather more interesting to us. Almost 90% of the highly sympathetic respondents were females, suggesting that they are highly disciplined and extraverted.

rules_bylift<-sort(rules_general, by="lift", decreasing=TRUE) 
inspect(head(rules_bylift))
##     lhs                        rhs                support confidence  coverage     lift count
## [1] {Less than high school} => {Under 18}       0.1365179  0.9339286 0.1461759 2.648320   523
## [2] {Female,                                                                                 
##      high_discipline,                                                                        
##      high_extraversion}     => {high_sympathy}  0.1020621  0.8988506 0.1135474 1.300905   391
## [3] {high_neuroticism,                                                                       
##      low_inventive}         => {Female}         0.1015401  0.8189474 0.1239885 1.255457   389
## [4] {Female,                                                                                 
##      high_extraversion,                                                                      
##      high_inventive}        => {high_sympathy}  0.1466980  0.8619632 0.1701906 1.247518   562
## [5] {Female,                                                                                 
##      high_extraversion}     => {high_sympathy}  0.1866353  0.8583433 0.2174367 1.242279   715
## [6] {high_extraversion,                                                                      
##      Male}                  => {high_inventive} 0.1054555  0.8820961 0.1195510 1.241025   404

Another way to grasp any associations is to look at the graphs. arulesViz provides very qualified visualisations for associations.

plot(rules_general, method="paracoord", control=list(reorder=TRUE))

However, it is hard to see now any strong relations. What we can do more, is to check specific right hand sides (consequents) and left hand sides (antecedents).


Are there any antecedent characteristics for high neuroticism?

To check that, we set the rules’ right hand side (consequent) as high_neuroticism. Taking into account that the number of unique observations and characteristics is high in our basket, now we will threshold support and confidence values to 0.1 and 0.4 respectively.

rules_neuroticism <-apriori(data= basket, parameter=list(supp=0.1, conf=0.4), 
                           appearance=list(default="lhs", rhs="high_neuroticism"), 
                           control=list(verbose=F)) 

inspect(sort(rules_neuroticism, by="confidence", decreasing=TRUE)[1:5])
##     lhs                   rhs                  support confidence  coverage     lift count
## [1] {low_discipline,                                                                      
##      low_extraversion} => {high_neuroticism} 0.1067606  0.5446072 0.1960324 1.481811   409
## [2] {Female,                                                                              
##      low_extraversion} => {high_neuroticism} 0.1166797  0.5405079 0.2158705 1.470657   447
## [3] {Female,                                                                              
##      low_discipline}   => {high_neuroticism} 0.1772383  0.5040831 0.3516053 1.371550   679
## [4] {Female,                                                                              
##      Under 18}         => {high_neuroticism} 0.1239885  0.4968619 0.2495432 1.351902   475
## [5] {Female,                                                                              
##      high_inventive,                                                                      
##      low_discipline}   => {high_neuroticism} 0.1111981  0.4913495 0.2263117 1.336903   426
inspect(sort(rules_neuroticism, by="lift", decreasing=TRUE)[1:5])
##     lhs                   rhs                  support confidence  coverage     lift count
## [1] {low_discipline,                                                                      
##      low_extraversion} => {high_neuroticism} 0.1067606  0.5446072 0.1960324 1.481811   409
## [2] {Female,                                                                              
##      low_extraversion} => {high_neuroticism} 0.1166797  0.5405079 0.2158705 1.470657   447
## [3] {Female,                                                                              
##      low_discipline}   => {high_neuroticism} 0.1772383  0.5040831 0.3516053 1.371550   679
## [4] {Female,                                                                              
##      Under 18}         => {high_neuroticism} 0.1239885  0.4968619 0.2495432 1.351902   475
## [5] {Female,                                                                              
##      high_inventive,                                                                      
##      low_discipline}   => {high_neuroticism} 0.1111981  0.4913495 0.2263117 1.336903   426

Interpretation: how do we interpret the measures of support, confidence and lift?

  • Support: item combination {low_discipline, low_extraversion, high_neuroticism} is met in 10.68% of responses.

  • Confidence: 54% of people who responded {high_neuroticism} also responded {low_discipline, low_extraversion}.

  • Lift value 1.48 tells us, that under the observed frequencies, the probability to meet the itemset given it is conditional of each other is higher than the probability if the items would be independent from each other.

plot(rules_neuroticism, method="graph")

plot(rules_neuroticism, method="paracoord", control=list(reorder=FALSE))

plot(rules_neuroticism, method="grouped") 
## Registered S3 methods overwritten by 'registry':
##   method               from 
##   print.registry_field proxy
##   print.registry_entry proxy

So, as we see, people with low discipline and low extroversion are more exposed to stress and mood wings. This characteristics are also most common for women.

If we look at the parallel coordinates graph, we can see how all these characteristics are connected.

Another option how we may use Apriori rules is to check, not the consequent, but the antecedent (left hand side). For example, in reverse, we might ask:


How does the psychological profile look like for females?

Now, we will set antecedent Female as a left hand side to be studied with the rules.

rules_female <-apriori(data= basket, parameter=list(supp=0.1, conf=0.4), 
                           appearance=list(default="rhs", lhs="Female"), 
                           control=list(verbose=F)) 

inspect(sort(rules_female, by="confidence", decreasing=TRUE)[1:5])
##     lhs         rhs              support   confidence coverage  lift     count
## [1] {Female} => {high_sympathy}  0.4896894 0.7507003  0.6523101 1.086488 1876 
## [2] {}       => {high_inventive} 0.7107805 0.7107805  1.0000000 1.000000 2723 
## [3] {}       => {high_sympathy}  0.6909423 0.6909423  1.0000000 1.000000 2647 
## [4] {Female} => {high_inventive} 0.4374837 0.6706683  0.6523101 0.943566 1676 
## [5] {}       => {low_discipline} 0.5405899 0.5405899  1.0000000 1.000000 2071
inspect(sort(rules_female, by="lift", decreasing=TRUE)[1:5])
##     lhs         rhs                support   confidence coverage  lift    
## [1] {Female} => {high_neuroticism} 0.2751240 0.4217687  0.6523101 1.147582
## [2] {Female} => {high_sympathy}    0.4896894 0.7507003  0.6523101 1.086488
## [3] {Female} => {Urban (city)}     0.3072305 0.4709884  0.6523101 1.026953
## [4] {Female} => {High school}      0.2850431 0.4369748  0.6523101 1.008464
## [5] {Female} => {high_discipline}  0.3007048 0.4609844  0.6523101 1.003427
##     count
## [1] 1054 
## [2] 1876 
## [3] 1177 
## [4] 1092 
## [5] 1152

Here we see the association between females and higher neuroticism from the other side. As for the {Female}=>{high_sympathy} itemset, though the confidence is pretty high, might say that there is a connection, but I’d rather would be skeptic about this association as the lift is not much higher than 1 and we also saw that people generally tend to say that they sympathize with others.

plot(rules_female, method="grouped") 

plot(rules_female, method="graph")

plot(rules_female, method="paracoord", control=list(reorder=TRUE))

Who are the people with the highest discipline score?

rules_discipline <-apriori(data= basket, parameter=list(supp=0.1, conf=0.08), 
                           appearance=list(default="lhs", rhs="high_discipline"), 
                           control=list(verbose=F)) 


inspect(sort(rules_discipline, by="confidence", decreasing=TRUE)[1:5])
##     lhs                     rhs                 support confidence  coverage     lift count
## [1] {high_inventive,                                                                       
##      high_sympathy,                                                                        
##      low_neuroticism}    => {high_discipline} 0.1072827  0.6391913 0.1678413 1.391331   411
## [2] {high_sympathy,                                                                        
##      low_neuroticism}    => {high_discipline} 0.1279039  0.6218274 0.2056904 1.353535   490
## [3] {high_inventive,                                                                       
##      low_neuroticism}    => {high_discipline} 0.1362569  0.6048667 0.2252676 1.316616   522
## [4] {low_neuroticism}    => {high_discipline} 0.1647090  0.5875233 0.2803446 1.278865   631
## [5] {Female,                                                                               
##      high_extraversion,                                                                    
##      high_sympathy}      => {high_discipline} 0.1020621  0.5468531 0.1866353 1.190338   391
inspect(sort(rules_discipline, by="lift", decreasing=TRUE)[1:5])
##     lhs                     rhs                 support confidence  coverage     lift count
## [1] {high_inventive,                                                                       
##      high_sympathy,                                                                        
##      low_neuroticism}    => {high_discipline} 0.1072827  0.6391913 0.1678413 1.391331   411
## [2] {high_sympathy,                                                                        
##      low_neuroticism}    => {high_discipline} 0.1279039  0.6218274 0.2056904 1.353535   490
## [3] {high_inventive,                                                                       
##      low_neuroticism}    => {high_discipline} 0.1362569  0.6048667 0.2252676 1.316616   522
## [4] {low_neuroticism}    => {high_discipline} 0.1647090  0.5875233 0.2803446 1.278865   631
## [5] {Female,                                                                               
##      high_extraversion,                                                                    
##      high_sympathy}      => {high_discipline} 0.1020621  0.5468531 0.1866353 1.190338   391

Here we see that people with low_neuroticism are also highly disciplined and prefer order.

plot(rules_discipline, method="paracoord", control=list(reorder=TRUE))

plot(rules_discipline, method="graph")

***

Who are low sympathetic people?

We can also check rare characteristics - such as low sympathy. We know, that only 5% of respondents claimed that.

itemFrequency(basket)['low_sympathy']
## low_sympathy 
##   0.05481597

So, to find any associations, we lower the threshold for support to 1%.

rules_sympathy <-apriori(data= basket, parameter=list(supp=0.01, conf=0.1), 
                           appearance=list(default="lhs", rhs="low_sympathy"), 
                           control=list(verbose=F)) 


inspect(sort(rules_sympathy, by="confidence", decreasing=TRUE)[1:5])
##     lhs                    rhs               support confidence   coverage     lift count
## [1] {high_inventive,                                                                     
##      low_extraversion,                                                                   
##      Male}              => {low_sympathy} 0.01200731  0.1523179 0.07883059 2.778713    46
## [2] {low_extraversion,                                                                   
##      Male}              => {low_sympathy} 0.01540068  0.1404762 0.10963195 2.562687    59
## [3] {high_inventive,                                                                     
##      low_extraversion,                                                                   
##      Under 18}          => {low_sympathy} 0.01122422  0.1295181 0.08666145 2.362780    43
## [4] {High school,                                                                        
##      high_inventive,                                                                     
##      low_extraversion}  => {low_sympathy} 0.01226834  0.1256684 0.09762464 2.292552    47
## [5] {high_neuroticism,                                                                   
##      Male}              => {low_sympathy} 0.01096319  0.1246291 0.08796659 2.273591    42
inspect(sort(rules_sympathy, by="lift", decreasing=TRUE)[1:5])
##     lhs                    rhs               support confidence   coverage     lift count
## [1] {high_inventive,                                                                     
##      low_extraversion,                                                                   
##      Male}              => {low_sympathy} 0.01200731  0.1523179 0.07883059 2.778713    46
## [2] {low_extraversion,                                                                   
##      Male}              => {low_sympathy} 0.01540068  0.1404762 0.10963195 2.562687    59
## [3] {high_inventive,                                                                     
##      low_extraversion,                                                                   
##      Under 18}          => {low_sympathy} 0.01122422  0.1295181 0.08666145 2.362780    43
## [4] {High school,                                                                        
##      high_inventive,                                                                     
##      low_extraversion}  => {low_sympathy} 0.01226834  0.1256684 0.09762464 2.292552    47
## [5] {high_neuroticism,                                                                   
##      Male}              => {low_sympathy} 0.01096319  0.1246291 0.08796659 2.273591    42

Lowering the bounds, we find a group: higher lift tells us that low extraverted males are the antecedent items for low sympathy.

plot(rules_sympathy, method="paracoord", control=list(reorder=TRUE))

Conclusion

To conclude, association rules is a very powerful tool when we want to see some causalities in the data. In our case, we showed how useful it could be to implement it to the survey data. For example, when we look for any relationships between features, we often perform the correlation analysis.

Association rules is a valuable built-up as we obtain inference (like the percentages of how many people responded X), and better understand the profile groups.