Introduction


Association rule learning is a data analysis technique in machine learning that reveals relationships between variables in a large dataset. The basic idea is to find common patterns of elements that occur together in transactions or records. These patterns are then used to generate rules, called association rules, that suggest a relationship between elements. The use of association rule learning is significant in data mining as it provides insights into the connections between variables in the data. This information can then be used to make informed choices and forecast future trends. By discovering hidden patterns in the data, organizations can also improve their efficiency and make more informed decisions.

The following analysis is an example of the use of association rules in practice.

About Dataset


The data used in the analysis was taken from Kaggle: https://www.kaggle.com/datasets/wenruliu/adult-income-dataset

The original dataset named ‘adult’ came from the UCI machine learning repository: https://archive.ics.uci.edu/ml/datasets/adult

Before cleaning, the data contains 15 variables (6 continuous and 9 nominal/ordinal) and 48842 observations. The database contains information on the individuals income and personal information. These include age, gender, race, country of origin, marital status and others. All variables and their descriptions can be found in the tables below.

  • Continuous attributes
Variable Description
age age of respondent
fnlwgt represents final weight, which is the number of units in the target population that the responding unit represents
education.num number of years of education in total
capital.gain income from investment sources other than wage/salary
capital.loss losses from investment sources
hours.per.week number of hours worked per week


  • Nominal/ordinal attributes
Variable Description
workclass Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked
education Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool
marital.status Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse
occupation Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces
relationship Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried
race White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
gender Female, Male
native.country United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands
income >50K, <=50K

Loading data

data<-read.csv("adult.csv", header=TRUE, sep=",")
head(data)

Data cleaning and preprocessing

In order to apply association rules methods, variables need to be prepared. The data must be in transactional format. It is therefore necessary to convert continuous variables into nominal variables. We can get rid of some variables such as “fnlwgt”, “educational.num”, as they do not provide valuable information for analysis. The remaining continuous variables need to be grouped and converted into a factor variable type.

data <- select(data,-c("fnlwgt", "educational.num"))
data[!complete.cases(data),]
data %>% group_by(education) %>% tally()
data %>% group_by(workclass) %>% tally()
data %>% group_by(marital.status) %>% tally()
data %>% group_by(occupation) %>% tally()
data %>% group_by(relationship) %>% tally()
data %>% group_by(race) %>% tally()
data %>% group_by(gender) %>% tally()
data %>% group_by(native.country) %>% tally()
data %>% group_by(income) %>% tally()
data <- data %>% mutate(workclass = ifelse(workclass == "?", "Unknown_workclass", workclass))
data <- data %>% mutate(occupation = ifelse(occupation == "?", "Unknown_occupation", occupation))
data <- data %>% mutate(native.country = ifelse(native.country == "?", "Unknown_native.country", native.country))
str(data)
'data.frame':   48842 obs. of  13 variables:
 $ age           : int  25 38 28 44 18 34 29 63 24 55 ...
 $ workclass     : chr  "Private" "Private" "Local-gov" "Private" ...
 $ education     : chr  "11th" "HS-grad" "Assoc-acdm" "Some-college" ...
 $ marital.status: chr  "Never-married" "Married-civ-spouse" "Married-civ-spouse" "Married-civ-spouse" ...
 $ occupation    : chr  "Machine-op-inspct" "Farming-fishing" "Protective-serv" "Machine-op-inspct" ...
 $ relationship  : chr  "Own-child" "Husband" "Husband" "Husband" ...
 $ race          : chr  "Black" "White" "White" "Black" ...
 $ gender        : chr  "Male" "Male" "Male" "Male" ...
 $ capital.gain  : int  0 0 0 7688 0 0 0 3103 0 0 ...
 $ capital.loss  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ hours.per.week: int  40 50 40 40 30 30 40 32 40 10 ...
 $ native.country: chr  "United-States" "United-States" "United-States" "United-States" ...
 $ income        : chr  "<=50K" "<=50K" ">50K" ">50K" ...
ggplot(data, aes(age)) +
  geom_histogram(aes(y = ..density..), color = "#000000", fill = "#0099F8") +
  geom_density(color = "#000000", fill = "#F85700", alpha = 0.6)

max(data$age)
[1] 90
min(data$age)
[1] 17
data$age <- cut(data$age,
            breaks = c(0,24,34,44,54,64,74,100),
            labels = c("17-24","25-34","35-44","45-54","55-64","65-74","75+"),
            ordered_result = T)

class(data$age)
[1] "ordered" "factor" 
data %>% group_by(age) %>% tally()
ggplot(data, aes(hours.per.week)) +
  geom_histogram(aes(y = ..density..), color = "#000000", fill = "#0099F8") +
  geom_density(color = "#000000", fill = "#F85700", alpha = 0.6)

max(data$hours.per.week)
[1] 99
min(data$hours.per.week)
[1] 1
table(data$hours.per.week)

    1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16    17    18    19    20 
   27    53    59    84    95    92    45   218    27   425    20   247    28    55   623   303    42   129    19  1862 
   21    22    23    24    25    26    27    28    29    30    31    32    33    34    35    36    37    38    39    40 
   46    62    40   354   958    40    43   140    15  1700    12   423    61    48  1937   336   242   714    63 22803 
   41    42    43    44    45    46    47    48    49    50    51    52    53    54    55    56    57    58    59    60 
   59   338   227   310  2717   129    82   770    39  4246    20   205    39    62  1051   141    19    38     7  2177 
   61    62    63    64    65    66    67    68    69    70    72    73    74    75    76    77    78    79    80    81 
    4    23    15    22   355    23     6    16     1   437   107     4     3   105     4     9    13     1   210     3 
   82    84    85    86    87    88    89    90    91    92    94    95    96    97    98    99 
    1    72    17     4     1     4     3    42     3     3     1     2     9     2    14   137 
data$hours.per.week <- cut(data$hours.per.week,
                breaks = c(0,31,40,60,100),
                labels = c("Working_part-time","Working_full-time","Working_long-hours","Possible_workaholic"),
                ordered_result = T)

class(data$hours.per.week)
[1] "ordered" "factor" 
data %>% group_by(hours.per.week) %>% tally()
ggplot(data, aes(capital.gain)) +
  geom_histogram(aes(y = ..density..), color = "#000000", fill = "#0099F8") +
  geom_density(color = "#000000", fill = "#F85700", alpha = 0.6)

max(data$capital.gain)
[1] 99999
min(data$capital.gain)
[1] 0
table(data$capital.gain)

    0   114   401   594   914   991  1055  1086  1111  1151  1173  1264  1409  1424  1455  1471  1506  1639  1731  1797 
44807     8     5    52    10     6    37     8     1    13     5     2    10     4     4     9    24     1     1    10 
 1831  1848  2009  2036  2050  2062  2105  2174  2176  2202  2228  2290  2329  2346  2354  2387  2407  2414  2463  2538 
    9     9     3     5     5     3    15    74    31    28     5    10     7     8    21     1    25    10    15     5 
 2580  2597  2635  2653  2829  2885  2907  2936  2961  2964  2977  2993  3103  3137  3273  3325  3411  3418  3432  3456 
   20    31    14    11    42    30    18     4     4    14    11     3   152    51     7    81    34     8     4     6 
 3464  3471  3674  3781  3818  3887  3908  3942  4064  4101  4386  4416  4508  4650  4687  4787  4865  4931  4934  5013 
   33    11    22    16    11     8    42    18    54    29   108    24    23    63     4    35    25     4    10   117 
 5060  5178  5455  5556  5721  6097  6360  6418  6497  6514  6612  6723  6767  6849  7262  7298  7430  7443  7688  7896 
    2   146    18     6     7     2     3    16    15    10     1     5     6    42     1   364    15     7   410     4 
 7978  8614  9386  9562 10520 10566 10605 11678 13550 14084 14344 15020 15024 15831 18481 20051 22040 25124 25236 27828 
    2    82    31     5    64     8    19     4    42    49    34    10   513     8     2    49     1     6    14    58 
34095 41310 99999 
    6     3   244 
x <- data %>% filter(capital.gain > 0)

ggplot(x, aes(capital.gain)) +
  geom_histogram(aes(y = ..density..), color = "#000000", fill = "#0099F8") +
  geom_density(color = "#000000", fill = "#F85700", alpha = 0.6)

max(x$capital.gain)
[1] 99999
min(x$capital.gain)
[1] 114
table(x$capital.gain)

  114   401   594   914   991  1055  1086  1111  1151  1173  1264  1409  1424  1455  1471  1506  1639  1731  1797  1831 
    8     5    52    10     6    37     8     1    13     5     2    10     4     4     9    24     1     1    10     9 
 1848  2009  2036  2050  2062  2105  2174  2176  2202  2228  2290  2329  2346  2354  2387  2407  2414  2463  2538  2580 
    9     3     5     5     3    15    74    31    28     5    10     7     8    21     1    25    10    15     5    20 
 2597  2635  2653  2829  2885  2907  2936  2961  2964  2977  2993  3103  3137  3273  3325  3411  3418  3432  3456  3464 
   31    14    11    42    30    18     4     4    14    11     3   152    51     7    81    34     8     4     6    33 
 3471  3674  3781  3818  3887  3908  3942  4064  4101  4386  4416  4508  4650  4687  4787  4865  4931  4934  5013  5060 
   11    22    16    11     8    42    18    54    29   108    24    23    63     4    35    25     4    10   117     2 
 5178  5455  5556  5721  6097  6360  6418  6497  6514  6612  6723  6767  6849  7262  7298  7430  7443  7688  7896  7978 
  146    18     6     7     2     3    16    15    10     1     5     6    42     1   364    15     7   410     4     2 
 8614  9386  9562 10520 10566 10605 11678 13550 14084 14344 15020 15024 15831 18481 20051 22040 25124 25236 27828 34095 
   82    31     5    64     8    19     4    42    49    34    10   513     8     2    49     1     6    14    58     6 
41310 99999 
    3   244 
data$capital.gain <- cut(data$capital.gain,
                           breaks = c(-1,0,median(x$capital.gain),max(x$capital.gain)),
                           labels = c("0_capital_gain","below_median_capital_gain","above_median_capital_gain"),
                           ordered_result = T)

class(data$capital.gain)
[1] "ordered" "factor" 
data %>% group_by(capital.gain) %>% tally()
y <- data %>% filter(capital.loss > 0)

ggplot(data, aes(capital.loss)) +
  geom_histogram(aes(y = ..density..), color = "#000000", fill = "#0099F8") +
  geom_density(color = "#000000", fill = "#F85700", alpha = 0.6)

max(data$capital.loss)
[1] 4356
min(data$capital.loss)
[1] 0
table(data$capital.loss)

    0   155   213   323   419   625   653   810   880   974  1092  1138  1258  1340  1380  1408  1411  1421  1429  1485 
46560     1     5     5     3    17     4     2     6     2    11     4     6    11    10    35     4     1     3    71 
 1504  1510  1539  1564  1573  1579  1590  1594  1602  1617  1628  1648  1651  1668  1669  1672  1719  1721  1726  1735 
   26     3     1    43    12    30    62     9    62    11    24     3    11     9    35    50    38    28     9     3 
 1740  1741  1755  1762  1816  1825  1844  1848  1870  1876  1887  1902  1911  1944  1974  1977  1980  2001  2002  2042 
   58    44     2    20     4     5     3    67     1    59   233   304     1     3    28   253    36    35    33    12 
 2051  2057  2080  2129  2149  2163  2174  2179  2201  2205  2206  2231  2238  2246  2258  2267  2282  2339  2352  2377 
   29    16     1     7     5     2    10    20     1    19     6     7     4     8    39     3     2    27     2    25 
 2392  2415  2444  2457  2465  2467  2472  2489  2547  2559  2603  2754  2824  3004  3175  3683  3770  3900  4356 
   11    72    20     4     1     2     4     1     5    17     7     2    14     5     2     2     4     2     3 
data$capital.loss <- cut(data$capital.loss,
                         breaks = c(-1,0,median(y$capital.loss),max(y$capital.loss)),
                         labels = c("0_capital_loss","below_median_capital_loss","above_median_capital_loss"),
                         ordered_result = T)

class(data$capital.loss)
[1] "ordered" "factor" 
data %>% group_by(capital.loss) %>% tally()

Due to the asymmetric distribution of the variable capital.gain and capital.loss, the median instead of the average of all non-zero income/losses respectively was used in determining the new levels.

summary(data)
    age         workclass          education         marital.status      occupation        relationship      
 17-24: 8432   Length:48842       Length:48842       Length:48842       Length:48842       Length:48842      
 25-34:12577   Class :character   Class :character   Class :character   Class :character   Class :character  
 35-44:12193   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character  
 45-54: 8771                                                                                                 
 55-64: 4782                                                                                                 
 65-74: 1642                                                                                                 
 75+  :  445                                                                                                 
     race              gender                             capital.gain                      capital.loss  
 Length:48842       Length:48842       0_capital_gain           :44807   0_capital_loss           :46560  
 Class :character   Class :character   below_median_capital_gain: 2345   below_median_capital_loss: 1166  
 Mode  :character   Mode  :character   above_median_capital_gain: 1690   above_median_capital_loss: 1116  
                                                                                                          
                                                                                                          
                                                                                                          
                                                                                                          
             hours.per.week  native.country        income         
 Working_part-time  : 7863   Length:48842       Length:48842      
 Working_full-time  :26627   Class :character   Class :character  
 Working_long-hours :12676   Mode  :character   Mode  :character  
 Possible_workaholic: 1676                                        
                                                                  
                                                                  
                                                                  
trans <- transactions(data)
Warning: Column(s) 2, 3, 4, 5, 6, 7, 8, 12, 13 not logical or factor. Applying default discretization (see '? discretizeDF').
length(trans)
[1] 48842
LIST(head(trans, 2))
$`1`
 [1] "age=25-34"                        "workclass=Private"                "education=11th"                  
 [4] "marital.status=Never-married"     "occupation=Machine-op-inspct"     "relationship=Own-child"          
 [7] "race=Black"                       "gender=Male"                      "capital.gain=0_capital_gain"     
[10] "capital.loss=0_capital_loss"      "hours.per.week=Working_full-time" "native.country=United-States"    
[13] "income=<=50K"                    

$`2`
 [1] "age=35-44"                         "workclass=Private"                 "education=HS-grad"                
 [4] "marital.status=Married-civ-spouse" "occupation=Farming-fishing"        "relationship=Husband"             
 [7] "race=White"                        "gender=Male"                       "capital.gain=0_capital_gain"      
[10] "capital.loss=0_capital_loss"       "hours.per.week=Working_long-hours" "native.country=United-States"     
[13] "income=<=50K"                     

The finished data prepared for further analysis was saved under the name ‘trans’. The number of transactions is 48842, with a fixed length of 13.

Analysis

Item frequency

head(sort(itemFrequency(trans, type="absolute"), decreasing = T), 20)
      capital.loss=0_capital_loss       capital.gain=0_capital_gain      native.country=United-States 
                            46560                             44807                             43832 
                       race=White                      income=<=50K                 workclass=Private 
                            41762                             37155                             33906 
                      gender=Male  hours.per.week=Working_full-time marital.status=Married-civ-spouse 
                            32650                             26627                             22379 
             relationship=Husband                     gender=Female      marital.status=Never-married 
                            19716                             16192                             16117 
                education=HS-grad hours.per.week=Working_long-hours        relationship=Not-in-family 
                            15784                             12676                             12583 
                        age=25-34                         age=35-44                       income=>50K 
                            12577                             12193                             11687 
           education=Some-college                         age=45-54 
                            10878                              8771 
head(sort(itemFrequency(trans, type="absolute"), decreasing = F), 20)
        native.country=Holand-Netherlands                    workclass=Never-worked 
                                        1                                        10 
                  occupation=Armed-Forces                    native.country=Hungary 
                                       15                                        19 
                  native.country=Honduras                     workclass=Without-pay 
                                       20                                        21 
                  native.country=Scotland                       native.country=Laos 
                                       21                                        23 
native.country=Outlying-US(Guam-USVI-etc)                 native.country=Yugoslavia 
                                       23                                        23 
           native.country=Trinadad&Tobago                   native.country=Cambodia 
                                       27                                        28 
                      native.country=Hong                   native.country=Thailand 
                                       30                                        30 
         marital.status=Married-AF-spouse                    native.country=Ireland 
                                       37                                        37 
                    native.country=France                    native.country=Ecuador 
                                       38                                        45 
                      native.country=Peru                     native.country=Greece 
                                       46                                        49 
itemFrequencyPlot(trans, topN=20, type="relative", main="Frequency plot")

The chart above shows the most frequently occurring values in the dataset. As you can see, there are several values that appear in almost every ‘transaction’. For example, the frequency of capital.loss=0 is over 90%. This should not come as a surprise, as it was already visible in the data analysis. The same is true for the values native.country=United-States and race=White. Again, this is not surprising as the data relates to people from the United States, where white people made up the vast majority of the population at the time of data collection.

It is therefore to be expected that these most common values will appear with the vast majority of rules.

Measuring rule interest

  • Support of an association rule is the percentage of groups containing all elements in the rule, calculated from the total number of groups examined. It represents the frequency of occurrence of the rule’s body and head. Setting a minimum support level when mining can produce more significant results and control the number of rules generated.

  • Confidence measures the reliability of an association rule by indicating the frequency at which the head of the rule appears in groups containing the body of the rule. A higher confidence value means the rule is more reliable. Mining settings can include a minimum confidence threshold to produce more significant results and limit the number of generated rules.

  • The lift of an association rule indicates its significance by comparing the actual confidence of the rule to its expected confidence, which is calculated as the support of the head divided by the product of the body and head supports. A lift value greater than 1 means the body positively affects the head, a value less than 1 means a negative effect, and a value near 1 means no effect.

Analysis of the obtained rules - basic parameters

Apriori is an algorithm for discovering frequent item sets and association rules in databases. It starts with individual items with high frequency and builds larger sets while maintaining sufficient frequency. The frequent item sets produced by Apriori reveal general trends in the database.

Given the usefulness of the apriori algorithm and its popularity, it is this algorithm that will be used to further analyse association rules.

rules<-apriori(trans, parameter=list(minlen=2))
Apriori

Parameter specification:

Algorithmic control:

Absolute minimum support count: 4884 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[121 item(s), 48842 transaction(s)] done [0.03s].
sorting and recoding items ... [32 item(s)] done [0.01s].
creating transaction tree ... done [0.02s].
checking subsets of size 1 2 3 4 5 6 7 8 9 done [0.09s].
writing ... [7067 rule(s)] done [0.01s].
creating S4 object  ... done [0.02s].
summary(rules)
set of 7067 rules

rule length distribution (lhs + rhs):sizes
   2    3    4    5    6    7    8    9 
 135  718 1724 2245 1559  577  102    7 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.000   4.000   5.000   4.927   6.000   9.000 

summary of quality measures:
    support         confidence        coverage           lift            count      
 Min.   :0.1000   Min.   :0.8001   Min.   :0.1003   Min.   :0.8722   Min.   : 4885  
 1st Qu.:0.1150   1st Qu.:0.8891   1st Qu.:0.1245   1st Qu.:1.0001   1st Qu.: 5615  
 Median :0.1360   Median :0.9267   Median :0.1473   Median :1.0278   Median : 6641  
 Mean   :0.1692   Mean   :0.9244   Mean   :0.1839   Mean   :1.2065   Mean   : 8263  
 3rd Qu.:0.1876   3rd Qu.:0.9631   3rd Qu.:0.2052   3rd Qu.:1.1480   3rd Qu.: 9163  
 Max.   :0.8707   Max.   :1.0000   Max.   :0.9533   Max.   :2.7453   Max.   :42525  

mining info:

Basic parameters of the apriori function: support = 0.1 and confidence = 0.8.

rules.supp<-sort(rules, by="support", decreasing=TRUE)
inspectDT(rules.supp[is.significant(rules.supp, trans, alpha=0.05)])

By sorting the resulting rules by the highest support value, we obtain similar conclusions to those reached by frequency chart analysis. The expected combinations race=White and native.country=United-States, or in combination with capital.loss=0 have the highest support value. This means that these connections have the highest number of rules obtained. It is also worth mentioning that only rules that are statistically significant at the 5% significance level were selected using the is.significant function.

rules.conf<-sort(rules, by="confidence", decreasing=TRUE) 
inspectDT(rules.conf[is.significant(rules.conf, trans, alpha=0.05)])

Sorting by highest confidence values already yields slightly different conclusions. A value of one for relationship=Husband and gender=Male should not come as a surprise. This means that when a person has the value relationship=Husband he is also male. There is a similar relationship between relationship=Husband and marital.status=Married for obvious reasons. However, there are many other less obvious and interesting connections that are worth exploring.

rules.lift<-sort(rules, by="lift", decreasing=TRUE)
inspectDT(rules.lift[is.significant(rules.lift, trans, alpha=0.05)])

As described earlier, lift values greater than 1 indicate a positive effect of itemset 𝑋 (body of the rule) on 𝑌 (head of the rule). The highest lift values for the analysed dataset are reached by rules describing the impact of, for example, relationship=Own-child on marital.status=Never-married. Or another obvious example like the influence of marital.status=Married-civ-spouse and gender=Male on {relationship=Husband}.

Digging the rules - examples of interesting relationships

A properly prepared dataset, together with the association rules detection techniques used, provides a wealth of knowledge about the units under investigation. Thanks to a large number of variables with multiple levels, we are able to investigate the most diverse connections between specific values. It would be impossible to prepare and describe all the interesting connections, hence I limited the selection to the study of rules related to income=>50K and native.country=Poland.

RHS

I was first interested in the variables and their values that had the greatest impact on income=>50K.

rules.income<-apriori(data=trans, parameter=list(supp=0.1,conf = 0.4), appearance=list(default="lhs", rhs="income=>50K"), control=list(verbose=F)) 
rules.income.byconf<-sort(rules.income, by="confidence", decreasing=TRUE)
inspectDT(rules.income.byconf)

By placing income=>50K to the right of the rules and adjusting the parameters accordingly, it was possible to obtain 102 rules with a minimum confidence of 0.4 and support = 0.1. For the data analysed, it can be concluded that married white US citizens have the highest probability of high earnings. This rule of thumb reaches a confidence level of 0.464. A good indicator may also be the value of the variable relationship=Husband because it also reaches similar confidence scores. In other rules, variables such as hours.per.week=Working_long-hours and capital.gain=0_capital_gain also appear, which seems reasonable.

LHS

The second interesting idea from my perspective to explore was to check the rules determining the impact of the native.country=Poland variable on all the others.

rules.Poland<-apriori(data=trans, parameter=list(supp=0.00001, conf = 0.1, minlen = 2), 
appearance=list(default="rhs", lhs="native.country=Poland"), control=list(verbose=F)) 
rules.Poland.byconf<-sort(rules.Poland, by="confidence", decreasing=TRUE)
inspectDT(rules.Poland.byconf)

Due to the very small number of people whose native country was Poland, support had to be reduced to as low as 0.00001 in order to comply with the 26 rules. Again, the rules were sorted due to the highest confidence. Unsurprisingly, the highest relationship was achieved between {native.country=Poland} and {race=White}. (the vast majority of Poles are White), {capital.loss=0_capital_loss, capital.gain=0_capital_gain} (significant advantage of these categories over others for the entire dataset). A very high confidence level because 0.805 was also achieved with income=<=50K. This shows that the average Pole surveyed was very likely to have low earnings. Further high values can also be read for rules where there is workclass=Private, gender=Male, hours.per.week=Working_full-time. Thus, it can be concluded that among the surveyed Poles there was a significant preponderance of men who worked privately on a full-time basis. Married people also prevailed among the surveyed Poles. They are also relatively young people (a preponderance of ages 35-44 followed by 25-34) who have graduated from high school.

Visualisation of the results obtained

plot(rules[1:100,], method="matrix", measure="lift")
Itemsets in Antecedent (LHS)
 [1] "{marital.status=Married-civ-spouse}" "{relationship=Husband}"              "{relationship=Own-child}"           
 [4] "{age=17-24}"                         "{income=>50K}"                       "{occupation=Craft-repair}"          
 [7] "{hours.per.week=Working_part-time}"  "{marital.status=Never-married}"      "{marital.status=Divorced}"          
[10] "{hours.per.week=Working_long-hours}" "{relationship=Not-in-family}"        "{gender=Female}"                    
[13] "{education=HS-grad}"                 "{education=Some-college}"            "{relationship=Unmarried}"           
[16] "{occupation=Adm-clerical}"           "{occupation=Sales}"                  "{age=25-34}"                        
[19] "{occupation=Exec-managerial}"        "{age=35-44}"                         "{age=45-54}"                        
[22] "{education=Bachelors}"               "{occupation=Prof-specialty}"        
Itemsets in Consequent (RHS)
[1] "{capital.loss=0_capital_loss}"       "{capital.gain=0_capital_gain}"       "{race=White}"                       
[4] "{native.country=United-States}"      "{income=<=50K}"                      "{gender=Male}"                      
[7] "{marital.status=Married-civ-spouse}" "{relationship=Husband}"              "{marital.status=Never-married}"     

plot(rules, measure=c("support","confidence"), shading="lift")
To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

plot(rules[1:20], method="grouped")

Visualisations for rules with income = >50k

plot(rules.income, measure=c("support","confidence"), shading="lift")

plot(rules.income[1:20,], method="graph")

plot(rules.income, method="paracoord", control=list(reorder=TRUE))

Interactive plotting

set.seed(123)
plot(rules.income, method="graph", measure="support", shading="lift", engine="html")
Warning: Too many rules supplied. Only plotting the best 100 using ‘lift’ (change control parameter max if needed).

The above charts are a valuable addition to the analysis of association rules. Using the example of the rules with income=>50k, you can perfectly see in the graphs the connections to other variables and their values. Graphical analysis provides the same information as a thorough review of the tables of metrics, hence its conclusions are analogous to those described earlier. However, many times graphical analysis provides a better and easier way to see the complexity and strength of connections between variables and their levels.

Summary

As stated earlier association rule learning is a method in machine learning used to uncover relationships between variables in a dataset. The technique involves finding common patterns in elements that occur together and generating rules that suggest connections between elements. The method was used in a study and proved to be effective, providing valuable insights into the connections within the data.

By using the Apriori algorithm, valuable information was obtained regarding the relationship between high income and the other variables and their values, as well as between the native country Poland and the other variables. However, this is a small part of the available rules to explore. In further work on the analysis, it would be worthwhile to deepen the interpretations of the results with other interesting variables like race, age, marital status and education.

---
title: "USL project - Association rules"
author: "Adam Janczyszyn"
date: "2022/2023"
output:
  html_notebook:
    toc: yes
    toc_float: yes
    highlight: haddock
    theme: cerulean
    number_sections: no
  pdf_document:
    toc: yes
  html_document:
    toc: yes
    toc_float: yes
    highlight: haddock
    theme: cerulean
    number_sections: no
    df_print: paged
editor_options: 
  markdown: 
    wrap: 72
---

```{r message=FALSE, warning=FALSE, include=FALSE}
library(factoextra)
library(tidyverse)
library(skimr)
library(arules)
library(arulesViz)
library(arulesCBA)
```

# Introduction

![Source:
<https://www.edsurge.com/news/2020-02-28-this-district-helps-young-kids-identify-their-interests-and-ideal-careers>](association_rules.webp)

</br> Association rule learning is a data analysis technique in machine learning that reveals relationships between variables in a large dataset. The basic idea is to find common patterns of elements that occur together in transactions or records. These patterns are then used to generate rules, called association rules, that suggest a relationship between elements. The use of association rule learning is significant in data mining as it provides insights into the connections between variables in the data. This information can then be used to make informed choices and forecast future trends. By discovering hidden patterns in the data, organizations can also improve their efficiency and make more informed decisions.

The following analysis is an example of the use of association rules in
practice. </br></br>

# About Dataset

</br> The data used in the analysis was taken from Kaggle:
<https://www.kaggle.com/datasets/wenruliu/adult-income-dataset>

The original dataset named 'adult' came from the UCI machine learning
repository: <https://archive.ics.uci.edu/ml/datasets/adult>

Before cleaning, the data contains 15 variables (6 continuous and 9 nominal/ordinal) and 48842 observations. The database contains information on the individuals income and personal information. These include age, gender, race, country of origin, marital status and others. All variables and their descriptions can be found in the tables below. </br></br>

-   ***Continuous attributes***

| **Variable**   | Description                                                                                                        |
|:-----------------|------------------------------------------------------|
| age            | age of respondent                                                                                                  |
| fnlwgt         | represents final weight, which is the number of units in the target population that the responding unit represents |
| education.num  | number of years of education in total                                                                              |
| capital.gain   | income from investment sources other than wage/salary                                                              |
| capital.loss   | losses from investment sources                                                                                     |
| hours.per.week | number of hours worked per week                                                                                    |

</br>

-   ***Nominal/ordinal attributes***

| **Variable**   | Description                                                                                                                                                                                                                                                                                                                                                                                                                    |
|-----------|-------------------------------------------------------------|
| workclass      | Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked                                                                                                                                                                                                                                                                                                                          |
| education      | Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool                                                                                                                                                                                                                                                                          |
| marital.status | Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse                                                                                                                                                                                                                                                                                                                      |
| occupation     | Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces                                                                                                                                                                                                       |
| relationship   | Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried                                                                                                                                                                                                                                                                                                                                                             |
| race           | White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black                                                                                                                                                                                                                                                                                                                                                                    |
| gender         | Female, Male                                                                                                                                                                                                                                                                                                                                                                                                                   |
| native.country | United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands |
| income         | \>50K, \<=50K                                                                                                                                                                                                                                                                                                                                                                                                                  |

## Loading data

```{r}
data<-read.csv("adult.csv", header=TRUE, sep=",")
head(data)
```

## Data cleaning and preprocessing

In order to apply association rules methods, variables need to be prepared. The data must be in transactional format. It is therefore necessary to convert continuous variables into nominal variables. We can get rid of some variables such as "fnlwgt", "educational.num", as they do not provide valuable information for analysis. The remaining continuous variables need to be grouped and converted into a factor variable type.

```{r}
data <- select(data,-c("fnlwgt", "educational.num"))
data[!complete.cases(data),]
```

```{r}
data %>% group_by(education) %>% tally()
data %>% group_by(workclass) %>% tally()
data %>% group_by(marital.status) %>% tally()
data %>% group_by(occupation) %>% tally()
data %>% group_by(relationship) %>% tally()
data %>% group_by(race) %>% tally()
data %>% group_by(gender) %>% tally()
data %>% group_by(native.country) %>% tally()
data %>% group_by(income) %>% tally()
```

```{r}
data <- data %>% mutate(workclass = ifelse(workclass == "?", "Unknown_workclass", workclass))
data <- data %>% mutate(occupation = ifelse(occupation == "?", "Unknown_occupation", occupation))
data <- data %>% mutate(native.country = ifelse(native.country == "?", "Unknown_native.country", native.country))
str(data)
```

```{r}
ggplot(data, aes(age)) +
  geom_histogram(aes(y = ..density..), color = "#000000", fill = "#0099F8") +
  geom_density(color = "#000000", fill = "#F85700", alpha = 0.6)
max(data$age)
min(data$age)


data$age <- cut(data$age,
            breaks = c(0,24,34,44,54,64,74,100),
            labels = c("17-24","25-34","35-44","45-54","55-64","65-74","75+"),
            ordered_result = T)

class(data$age)
data %>% group_by(age) %>% tally()
```

```{r}
ggplot(data, aes(hours.per.week)) +
  geom_histogram(aes(y = ..density..), color = "#000000", fill = "#0099F8") +
  geom_density(color = "#000000", fill = "#F85700", alpha = 0.6)
max(data$hours.per.week)
min(data$hours.per.week)
table(data$hours.per.week)

data$hours.per.week <- cut(data$hours.per.week,
                breaks = c(0,31,40,60,100),
                labels = c("Working_part-time","Working_full-time","Working_long-hours","Possible_workaholic"),
                ordered_result = T)

class(data$hours.per.week)
data %>% group_by(hours.per.week) %>% tally()
```

```{r}
ggplot(data, aes(capital.gain)) +
  geom_histogram(aes(y = ..density..), color = "#000000", fill = "#0099F8") +
  geom_density(color = "#000000", fill = "#F85700", alpha = 0.6)
max(data$capital.gain)
min(data$capital.gain)
table(data$capital.gain)
```

```{r}
x <- data %>% filter(capital.gain > 0)

ggplot(x, aes(capital.gain)) +
  geom_histogram(aes(y = ..density..), color = "#000000", fill = "#0099F8") +
  geom_density(color = "#000000", fill = "#F85700", alpha = 0.6)
max(x$capital.gain)
min(x$capital.gain)
table(x$capital.gain)
```

```{r}
data$capital.gain <- cut(data$capital.gain,
                           breaks = c(-1,0,median(x$capital.gain),max(x$capital.gain)),
                           labels = c("0_capital_gain","below_median_capital_gain","above_median_capital_gain"),
                           ordered_result = T)

class(data$capital.gain)
data %>% group_by(capital.gain) %>% tally()
```

```{r}
y <- data %>% filter(capital.loss > 0)

ggplot(data, aes(capital.loss)) +
  geom_histogram(aes(y = ..density..), color = "#000000", fill = "#0099F8") +
  geom_density(color = "#000000", fill = "#F85700", alpha = 0.6)
max(data$capital.loss)
min(data$capital.loss)
table(data$capital.loss)
```

```{r}
data$capital.loss <- cut(data$capital.loss,
                         breaks = c(-1,0,median(y$capital.loss),max(y$capital.loss)),
                         labels = c("0_capital_loss","below_median_capital_loss","above_median_capital_loss"),
                         ordered_result = T)

class(data$capital.loss)
data %>% group_by(capital.loss) %>% tally()
```

Due to the asymmetric distribution of the variable capital.gain and capital.loss, the median instead of the average of all non-zero income/losses respectively was used in determining the new levels.

```{r}
summary(data)
```

```{r}
trans <- transactions(data)
length(trans)
LIST(head(trans, 2))
```

The finished data prepared for further analysis was saved under the name 'trans'. The number of transactions is 48842, with a fixed length of 13.

# Analysis

## Item frequency

```{r}
head(sort(itemFrequency(trans, type="absolute"), decreasing = T), 20)
head(sort(itemFrequency(trans, type="absolute"), decreasing = F), 20)
itemFrequencyPlot(trans, topN=20, type="relative", main="Frequency plot")
```
The chart above shows the most frequently occurring values in the dataset. As you can see, there are several values that appear in almost every 'transaction'. For example, the frequency of capital.loss=0 is over 90%. This should not come as a surprise, as it was already visible in the data analysis. The same is true for the values native.country=United-States and race=White. Again, this is not surprising as the data relates to people from the United States, where white people made up the vast majority of the population at the time of data collection.

It is therefore to be expected that these most common values will appear with the vast majority of rules.

## Measuring rule interest

- Support of an association rule is the percentage of groups containing all elements in the rule, calculated from the total number of groups examined. It represents the frequency of occurrence of the rule's body and head. Setting a minimum support level when mining can produce more significant results and control the number of rules generated.

- Confidence measures the reliability of an association rule by indicating the frequency at which the head of the rule appears in groups containing the body of the rule. A higher confidence value means the rule is more reliable. Mining settings can include a minimum confidence threshold to produce more significant results and limit the number of generated rules.

- The lift of an association rule indicates its significance by comparing the actual confidence of the rule to its expected confidence, which is calculated as the support of the head divided by the product of the body and head supports. A lift value greater than 1 means the body positively affects the head, a value less than 1 means a negative effect, and a value near 1 means no effect.

## Analysis of the obtained rules - basic parameters

Apriori is an algorithm for discovering frequent item sets and association rules in databases. It starts with individual items with high frequency and builds larger sets while maintaining sufficient frequency. The frequent item sets produced by Apriori reveal general trends in the database.

Given the usefulness of the apriori algorithm and its popularity, it is this algorithm that will be used to further analyse association rules.

```{r}
rules<-apriori(trans, parameter=list(minlen=2))
summary(rules)
```

Basic parameters of the apriori function: support = 0.1 and confidence = 0.8.

```{r}
rules.supp<-sort(rules, by="support", decreasing=TRUE)
inspectDT(rules.supp[is.significant(rules.supp, trans, alpha=0.05)])
```

By sorting the resulting rules by the highest support value, we obtain similar conclusions to those reached by frequency chart analysis. The expected combinations race=White and native.country=United-States, or in combination with capital.loss=0 have the highest support value. This means that these connections have the highest number of rules obtained. It is also worth mentioning that only rules that are statistically significant at the 5% significance level were selected using the is.significant function.

```{r}
rules.conf<-sort(rules, by="confidence", decreasing=TRUE) 
inspectDT(rules.conf[is.significant(rules.conf, trans, alpha=0.05)])
```

Sorting by highest confidence values already yields slightly different conclusions. A value of one for relationship=Husband and gender=Male should not come as a surprise. This means that when a person has the value relationship=Husband he is also male. There is a similar relationship between relationship=Husband and marital.status=Married for obvious reasons. However, there are many other less obvious and interesting connections that are worth exploring.

```{r}
rules.lift<-sort(rules, by="lift", decreasing=TRUE)
inspectDT(rules.lift[is.significant(rules.lift, trans, alpha=0.05)])
```

As described earlier, lift values greater than 1 indicate a positive effect of itemset 𝑋 (body of the rule) on 𝑌 (head of the rule). The highest lift values for the analysed dataset are reached by rules describing the impact of, for example, relationship=Own-child on marital.status=Never-married. Or another obvious example like the influence of marital.status=Married-civ-spouse and gender=Male on {relationship=Husband}.

# Digging the rules - examples of interesting relationships

A properly prepared dataset, together with the association rules detection techniques used, provides a wealth of knowledge about the units under investigation. Thanks to a large number of variables with multiple levels, we are able to investigate the most diverse connections between specific values. It would be impossible to prepare and describe all the interesting connections, hence I limited the selection to the study of rules related to income=>50K and native.country=Poland.

## RHS

I was first interested in the variables and their values that had the greatest impact on income=>50K.

```{r}
rules.income<-apriori(data=trans, parameter=list(supp=0.1,conf = 0.4), appearance=list(default="lhs", rhs="income=>50K"), control=list(verbose=F)) 
rules.income.byconf<-sort(rules.income, by="confidence", decreasing=TRUE)
inspectDT(rules.income.byconf)
```

By placing income=>50K to the right of the rules and adjusting the parameters accordingly, it was possible to obtain 102 rules with a minimum confidence of 0.4 and support = 0.1. For the data analysed, it can be concluded that married white US citizens have the highest probability of high earnings. This rule of thumb reaches a confidence level of 0.464. A good indicator may also be the value of the variable relationship=Husband because it also reaches similar confidence scores. In other rules, variables such as hours.per.week=Working_long-hours and capital.gain=0_capital_gain also appear, which seems reasonable.  

## LHS

The second interesting idea from my perspective to explore was to check the rules determining the impact of the native.country=Poland variable on all the others.

```{r}
rules.Poland<-apriori(data=trans, parameter=list(supp=0.00001, conf = 0.1, minlen = 2), 
appearance=list(default="rhs", lhs="native.country=Poland"), control=list(verbose=F)) 
rules.Poland.byconf<-sort(rules.Poland, by="confidence", decreasing=TRUE)
inspectDT(rules.Poland.byconf)
```

Due to the very small number of people whose native country was Poland, support had to be reduced to as low as 0.00001 in order to comply with the 26 rules. Again, the rules were sorted due to the highest confidence. Unsurprisingly, the highest relationship was achieved between {native.country=Poland} and {race=White}. (the vast majority of Poles are White), {capital.loss=0_capital_loss, capital.gain=0_capital_gain} (significant advantage of these categories over others for the entire dataset). A very high confidence level because 0.805 was also achieved with income=<=50K. This shows that the average Pole surveyed was very likely to have low earnings. Further high values can also be read for rules where there is workclass=Private, gender=Male, hours.per.week=Working_full-time. Thus, it can be concluded that among the surveyed Poles there was a significant preponderance of men who worked privately on a full-time basis. Married people also prevailed among the surveyed Poles. They are also relatively young people (a preponderance of ages 35-44 followed by 25-34) who have graduated from high school.

# Visualisation of the results obtained

```{r}
plot(rules[1:100,], method="matrix", measure="lift")
```

```{r}
plot(rules, measure=c("support","confidence"), shading="lift")
```


```{r}
plot(rules[1:20], method="grouped")
```

## Visualisations for rules with income = >50k

```{r}
plot(rules.income, measure=c("support","confidence"), shading="lift")
```

```{r}
plot(rules.income[1:20,], method="graph")
```


```{r}
plot(rules.income, method="paracoord", control=list(reorder=TRUE))
```

## Interactive plotting

```{r}
set.seed(123)
plot(rules.income, method="graph", measure="support", shading="lift", engine="html")
```

The above charts are a valuable addition to the analysis of association rules. Using the example of the rules with income=>50k, you can perfectly see in the graphs the connections to other variables and their values. Graphical analysis provides the same information as a thorough review of the tables of metrics, hence its conclusions are analogous to those described earlier. However, many times graphical analysis provides a better and easier way to see the complexity and strength of connections between variables and their levels.

# Summary

As stated earlier association rule learning is a method in machine learning used to uncover relationships between variables in a dataset. The technique involves finding common patterns in elements that occur together and generating rules that suggest connections between elements. The method was used in a study and proved to be effective, providing valuable insights into the connections within the data.

By using the Apriori algorithm, valuable information was obtained regarding the relationship between high income and the other variables and their values, as well as between the native country Poland and the other variables. However, this is a small part of the available rules to explore. In further work on the analysis, it would be worthwhile to deepen the interpretations of the results with other interesting variables like race, age, marital status and education.

# References

<https://www.kaggle.com/datasets/wenruliu/adult-income-dataset>

<https://www.ibm.com/docs/en/ias?topic=procedures-association-rules>
