Introduction

It has always been known that women differ from men in many aspects. One of them are preferences. Therefore, the aim of this work is to analyze the choice by gender. In this article I will use association rules which is an unsupervised learning technique which aims to describe and discover regularities between data.

Dataset

The data was collected in fall 2015 from university students of 21 nationalities studying various majors in various countries using this form. I found out this dataset on Kaggle.

Source:https://www.kaggle.com/hb20007/gender-classification.

dane<-read.csv("Gender.csv", sep=",", header=TRUE,na.strings=c("","NA")) 
dane <- data.frame(dane)
#rownames(dane) <- dane[,1]
#data = data[,-1]

head(dane)%>%kbl() %>%kable_paper("hover")%>%scroll_box(width = "910px")
Favorite.Color Favorite.Music.Genre Favorite.Beverage Favorite.Soft.Drink Gender
Cool Rock Vodka 7UP/Sprite F
Neutral Hip hop Vodka Coca Cola/Pepsi F
Warm Rock Wine Coca Cola/Pepsi F
Warm Folk/Traditional Whiskey Fanta F
Cool Rock Vodka Coca Cola/Pepsi F
Warm Jazz/Blues Doesn’t drink Fanta F
cat("Number of observations in the dataset:", nrow(dane))
## Number of observations in the dataset: 66
cat("Number of years variables in the analysis:", ncol(dane))
## Number of years variables in the analysis: 5

Check for missing values

##       Favorite.Color Favorite.Music.Genre    Favorite.Beverage 
##                 "0%"                 "0%"                 "0%" 
##  Favorite.Soft.Drink               Gender 
##                 "0%"                 "0%"

No missing values were found in the data

write.csv(dane, file = "transactions.csv", row.names = F)
trans1 <- read.transactions("transactions.csv", format = "basket", sep = ",", skip=1)
summary(trans1)
## transactions as itemMatrix in sparse format with
##  66 rows (elements/itemsets/transactions) and
##  21 columns (items) and a density of 0.2380952 
## 
## most frequent items:
##            Cool               F               M Coca Cola/Pepsi            Warm 
##              37              33              33              32              22 
##         (Other) 
##             173 
## 
## element (itemset/transaction) length distribution:
## sizes
##  5 
## 66 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       5       5       5       5       5       5 
## 
## includes extended item information - examples:
##            labels
## 1      7UP/Sprite
## 2            Beer
## 3 Coca Cola/Pepsi
itemFrequencyPlot(trans1, topN=25, type="relative", main="ItemFrequency") 

The plot above shows us the frequency of occurance of each answer. The most frequent is favorite color - Cool (colors reported by respondents were mapped to either warm, cool or neutral).

The Apriori algorithm

At the beginning, I will apply the apriori algorithm without making any detailed assumptions. I will use the default parameters sizes.

rules<-apriori(trans1, parameter=list(supp=0.1, conf=0.65)) 
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.65    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 6 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[21 item(s), 66 transaction(s)] done [0.00s].
## sorting and recoding items ... [18 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [9 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

The algorithm has already found 9 rules. Relaxing the restriction on support increases the number of rules.

set.seed(240) 
plot(rules, method="graph", measure="support", shading="lift", main="Graf dla 8 reguł")

Confidence

rules.by.conf<-sort(rules, by="confidence", decreasing=TRUE) 
inspect(head(rules.by.conf))
##     lhs            rhs               support   confidence coverage  lift    
## [1] {Hip hop}   => {M}               0.1060606 0.8750000  0.1212121 1.750000
## [2] {F,Rock}    => {Coca Cola/Pepsi} 0.1212121 0.8000000  0.1515152 1.650000
## [3] {Pop}       => {F}               0.1969697 0.7647059  0.2575758 1.529412
## [4] {Cool,Rock} => {Coca Cola/Pepsi} 0.1363636 0.7500000  0.1818182 1.546875
## [5] {Cool,Pop}  => {F}               0.1212121 0.7272727  0.1666667 1.454545
## [6] {Beer}      => {Cool}            0.1363636 0.6923077  0.1969697 1.234927
##     count
## [1]  7   
## [2]  8   
## [3] 13   
## [4]  9   
## [5]  8   
## [6]  9

From resuts above we see that maximum confidence which we are able to achive is 0.88 which means 88% reliability of the rule. In addition there is a clear connection between the groups {Hip hop}=>{M} and {F,Rock}=>{Coca Cola/Pepsi}.

Lift

rules.by.lift<-sort(rules, by="lift", decreasing=TRUE) 
inspect(head(rules.by.lift))
##     lhs            rhs               support   confidence coverage  lift    
## [1] {Hip hop}   => {M}               0.1060606 0.8750000  0.1212121 1.750000
## [2] {F,Rock}    => {Coca Cola/Pepsi} 0.1212121 0.8000000  0.1515152 1.650000
## [3] {Cool,Rock} => {Coca Cola/Pepsi} 0.1363636 0.7500000  0.1818182 1.546875
## [4] {Pop}       => {F}               0.1969697 0.7647059  0.2575758 1.529412
## [5] {Cool,Pop}  => {F}               0.1212121 0.7272727  0.1666667 1.454545
## [6] {Rock}      => {Coca Cola/Pepsi} 0.1969697 0.6842105  0.2878788 1.411184
##     count
## [1]  7   
## [2]  8   
## [3]  9   
## [4] 13   
## [5]  8   
## [6] 13

Support

rules.by.supp<-sort(rules, by="support", decreasing=TRUE) 
inspect(head(rules.by.supp))
##     lhs                       rhs               support   confidence coverage 
## [1] {Pop}                  => {F}               0.1969697 0.7647059  0.2575758
## [2] {Rock}                 => {Coca Cola/Pepsi} 0.1969697 0.6842105  0.2878788
## [3] {Coca Cola/Pepsi,M}    => {Cool}            0.1515152 0.6666667  0.2272727
## [4] {Beer}                 => {Cool}            0.1363636 0.6923077  0.1969697
## [5] {Coca Cola/Pepsi,Rock} => {Cool}            0.1363636 0.6923077  0.1969697
## [6] {Cool,Rock}            => {Coca Cola/Pepsi} 0.1363636 0.7500000  0.1818182
##     lift     count
## [1] 1.529412 13   
## [2] 1.411184 13   
## [3] 1.189189 10   
## [4] 1.234927  9   
## [5] 1.234927  9   
## [6] 1.546875  9

Supports are quite low (20%)

plot(rules, method="paracoord", control=list(reorder=TRUE))

Individual rule representation

dane =na.omit(dane)
pie(sort(table(dane$Gender)),main = "Respondent's gender")

As we can see, there is an identical number of men and women in the data set, which is good news.

Females

rules.f<-apriori(data=trans1, parameter=list(supp=0.1,conf = 0.5), 
                      appearance=list(default="lhs", rhs="F"), control=list(verbose=F)) 

rules.f.byconf<-sort(rules.f, by="support", decreasing=TRUE)

inspect(head(rules.f.byconf))
##     lhs                  rhs support   confidence coverage  lift     count
## [1] {}                => {F} 0.5000000 0.5000000  1.0000000 1.000000 33   
## [2] {Coca Cola/Pepsi} => {F} 0.2575758 0.5312500  0.4848485 1.062500 17   
## [3] {Pop}             => {F} 0.1969697 0.7647059  0.2575758 1.529412 13   
## [4] {Warm}            => {F} 0.1969697 0.5909091  0.3333333 1.181818 13   
## [5] {Rock}            => {F} 0.1515152 0.5263158  0.2878788 1.052632 10   
## [6] {Other}           => {F} 0.1363636 0.5000000  0.2727273 1.000000  9

It occurs that for minimum support 1% there are 6 rules mined by the Apriori algorithm for females. According to confidence, the strongest one is {Pop} => {F}. As to support, the strongest one is {Coca Cola/Pepsi} => {F}. The highest lift refers also to {Pop} => {F} rule.

plot(rules.f, method="graph")

For females the biggest support is for Coca Cola/Pepsi and the highest lift for pop music.

Males

rules.m<-apriori(data=trans1, parameter=list(supp=0.1,conf = 0.5), 
                      appearance=list(default="lhs", rhs="M"), control=list(verbose=F)) 

rules.m.byconf<-sort(rules.m, by="support", decreasing=TRUE)

inspect(head(rules.m.byconf))
##     lhs                       rhs support   confidence coverage  lift     count
## [1] {}                     => {M} 0.5000000 0.5000000  1.0000000 1.000000 33   
## [2] {Cool}                 => {M} 0.3030303 0.5405405  0.5606061 1.081081 20   
## [3] {Coca Cola/Pepsi,Cool} => {M} 0.1515152 0.5555556  0.2727273 1.111111 10   
## [4] {Doesn't drink}        => {M} 0.1363636 0.6428571  0.2121212 1.285714  9   
## [5] {Other}                => {M} 0.1363636 0.5000000  0.2727273 1.000000  9   
## [6] {Fanta}                => {M} 0.1212121 0.5714286  0.2121212 1.142857  8

For men there are 6 rules mined by the Apriori algorithm. According to confidence, the strongest one is {Doesn’t drink} => {M}. As to support, the strongest one is {Cool} => {M}. The highest lift refers also to {Doesn’t drink} => {M} rule.

plot(rules.m, method="graph")

For males the biggest support is for Cool and the highest lift for Hip-Hop.

Summary

In this article, I used association methods to discover rules between items. I came to conclusion that in general there are no rules as strong as might be expected.However, with x, you can see a connection between women -> pop music and men -> hip-hop.