Introduction

It has always been known that women differ from men in many aspects. One of them are preferences. Therefore, the aim of this work is to analyze the choice by gender. In this article I will use association rules which is an unsupervised learning technique which aims to describe and discover regularities between data.

Dataset

The data was collected in fall 2015 from university students of 21 nationalities studying various majors in various countries using this form. I found out this dataset on Kaggle.

Source:https://www.kaggle.com/hb20007/gender-classification.

dane<-read.csv("Gender.csv", sep=",", header=TRUE,na.strings=c("","NA")) 
dane <- data.frame(dane)
#rownames(dane) <- dane[,1]
#data = data[,-1]

head(dane)%>%kbl() %>%kable_paper("hover")%>%scroll_box(width = "910px")

Favorite.Color	Favorite.Music.Genre	Favorite.Beverage	Favorite.Soft.Drink	Gender
Cool	Rock	Vodka	7UP/Sprite	F
Neutral	Hip hop	Vodka	Coca Cola/Pepsi	F
Warm	Rock	Wine	Coca Cola/Pepsi	F
Warm	Folk/Traditional	Whiskey	Fanta	F
Cool	Rock	Vodka	Coca Cola/Pepsi	F
Warm	Jazz/Blues	Doesn’t drink	Fanta	F

cat("Number of observations in the dataset:", nrow(dane))

## Number of observations in the dataset: 66

cat("Number of years variables in the analysis:", ncol(dane))

## Number of years variables in the analysis: 5

Check for missing values

##       Favorite.Color Favorite.Music.Genre    Favorite.Beverage 
##                 "0%"                 "0%"                 "0%" 
##  Favorite.Soft.Drink               Gender 
##                 "0%"                 "0%"

No missing values were found in the data

write.csv(dane, file = "transactions.csv", row.names = F)
trans1 <- read.transactions("transactions.csv", format = "basket", sep = ",", skip=1)

summary(trans1)

## transactions as itemMatrix in sparse format with
##  66 rows (elements/itemsets/transactions) and
##  21 columns (items) and a density of 0.2380952 
## 
## most frequent items:
##            Cool               F               M Coca Cola/Pepsi            Warm 
##              37              33              33              32              22 
##         (Other) 
##             173 
## 
## element (itemset/transaction) length distribution:
## sizes
##  5 
## 66 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       5       5       5       5       5       5 
## 
## includes extended item information - examples:
##            labels
## 1      7UP/Sprite
## 2            Beer
## 3 Coca Cola/Pepsi

itemFrequencyPlot(trans1, topN=25, type="relative", main="ItemFrequency")

The plot above shows us the frequency of occurance of each answer. The most frequent is favorite color - Cool (colors reported by respondents were mapped to either warm, cool or neutral).

The Apriori algorithm

At the beginning, I will apply the apriori algorithm without making any detailed assumptions. I will use the default parameters sizes.

rules<-apriori(trans1, parameter=list(supp=0.1, conf=0.65))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.65    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 6 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[21 item(s), 66 transaction(s)] done [0.00s].
## sorting and recoding items ... [18 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [9 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

The algorithm has already found 9 rules. Relaxing the restriction on support increases the number of rules.

set.seed(240) 
plot(rules, method="graph", measure="support", shading="lift", main="Graf dla 8 reguł")

The strongest link is between women and pop music.

Confidence

rules.by.conf<-sort(rules, by="confidence", decreasing=TRUE) 
inspect(head(rules.by.conf))

##     lhs            rhs               support   confidence coverage  lift    
## [1] {Hip hop}   => {M}               0.1060606 0.8750000  0.1212121 1.750000
## [2] {F,Rock}    => {Coca Cola/Pepsi} 0.1212121 0.8000000  0.1515152 1.650000
## [3] {Pop}       => {F}               0.1969697 0.7647059  0.2575758 1.529412
## [4] {Cool,Rock} => {Coca Cola/Pepsi} 0.1363636 0.7500000  0.1818182 1.546875
## [5] {Cool,Pop}  => {F}               0.1212121 0.7272727  0.1666667 1.454545
## [6] {Beer}      => {Cool}            0.1363636 0.6923077  0.1969697 1.234927
##     count
## [1]  7   
## [2]  8   
## [3] 13   
## [4]  9   
## [5]  8   
## [6]  9

From resuts above we see that maximum confidence which we are able to achive is 0.88 which means 88% reliability of the rule. In addition there is a clear connection between the groups {Hip hop}=>{M} and {F,Rock}=>{Coca Cola/Pepsi}.

Lift

rules.by.lift<-sort(rules, by="lift", decreasing=TRUE) 
inspect(head(rules.by.lift))

##     lhs            rhs               support   confidence coverage  lift    
## [1] {Hip hop}   => {M}               0.1060606 0.8750000  0.1212121 1.750000
## [2] {F,Rock}    => {Coca Cola/Pepsi} 0.1212121 0.8000000  0.1515152 1.650000
## [3] {Cool,Rock} => {Coca Cola/Pepsi} 0.1363636 0.7500000  0.1818182 1.546875
## [4] {Pop}       => {F}               0.1969697 0.7647059  0.2575758 1.529412
## [5] {Cool,Pop}  => {F}               0.1212121 0.7272727  0.1666667 1.454545
## [6] {Rock}      => {Coca Cola/Pepsi} 0.1969697 0.6842105  0.2878788 1.411184
##     count
## [1]  7   
## [2]  8   
## [3]  9   
## [4] 13   
## [5]  8   
## [6] 13

Support

rules.by.supp<-sort(rules, by="support", decreasing=TRUE) 
inspect(head(rules.by.supp))

##     lhs                       rhs               support   confidence coverage 
## [1] {Pop}                  => {F}               0.1969697 0.7647059  0.2575758
## [2] {Rock}                 => {Coca Cola/Pepsi} 0.1969697 0.6842105  0.2878788
## [3] {Coca Cola/Pepsi,M}    => {Cool}            0.1515152 0.6666667  0.2272727
## [4] {Beer}                 => {Cool}            0.1363636 0.6923077  0.1969697
## [5] {Coca Cola/Pepsi,Rock} => {Cool}            0.1363636 0.6923077  0.1969697
## [6] {Cool,Rock}            => {Coca Cola/Pepsi} 0.1363636 0.7500000  0.1818182
##     lift     count
## [1] 1.529412 13   
## [2] 1.411184 13   
## [3] 1.189189 10   
## [4] 1.234927  9   
## [5] 1.234927  9   
## [6] 1.546875  9

Supports are quite low (20%)

plot(rules, method="paracoord", control=list(reorder=TRUE))

There are 9 rules. According to lift surprisingly popular are: rock and pop. Surprisingly there is no alcohol among the rules at all.

Individual rule representation

dane =na.omit(dane)
pie(sort(table(dane$Gender)),main = "Respondent's gender")

As we can see, there is an identical number of men and women in the data set, which is good news.

Females

rules.f<-apriori(data=trans1, parameter=list(supp=0.1,conf = 0.5), 
                      appearance=list(default="lhs", rhs="F"), control=list(verbose=F)) 

rules.f.byconf<-sort(rules.f, by="support", decreasing=TRUE)

inspect(head(rules.f.byconf))

##     lhs                  rhs support   confidence coverage  lift     count
## [1] {}                => {F} 0.5000000 0.5000000  1.0000000 1.000000 33   
## [2] {Coca Cola/Pepsi} => {F} 0.2575758 0.5312500  0.4848485 1.062500 17   
## [3] {Pop}             => {F} 0.1969697 0.7647059  0.2575758 1.529412 13   
## [4] {Warm}            => {F} 0.1969697 0.5909091  0.3333333 1.181818 13   
## [5] {Rock}            => {F} 0.1515152 0.5263158  0.2878788 1.052632 10   
## [6] {Other}           => {F} 0.1363636 0.5000000  0.2727273 1.000000  9

It occurs that for minimum support 1% there are 6 rules mined by the Apriori algorithm for females. According to confidence, the strongest one is {Pop} => {F}. As to support, the strongest one is {Coca Cola/Pepsi} => {F}. The highest lift refers also to {Pop} => {F} rule.

plot(rules.f, method="graph")

For females the biggest support is for Coca Cola/Pepsi and the highest lift for pop music.

Males

rules.m<-apriori(data=trans1, parameter=list(supp=0.1,conf = 0.5), 
                      appearance=list(default="lhs", rhs="M"), control=list(verbose=F)) 

rules.m.byconf<-sort(rules.m, by="support", decreasing=TRUE)

inspect(head(rules.m.byconf))

##     lhs                       rhs support   confidence coverage  lift     count
## [1] {}                     => {M} 0.5000000 0.5000000  1.0000000 1.000000 33   
## [2] {Cool}                 => {M} 0.3030303 0.5405405  0.5606061 1.081081 20   
## [3] {Coca Cola/Pepsi,Cool} => {M} 0.1515152 0.5555556  0.2727273 1.111111 10   
## [4] {Doesn't drink}        => {M} 0.1363636 0.6428571  0.2121212 1.285714  9   
## [5] {Other}                => {M} 0.1363636 0.5000000  0.2727273 1.000000  9   
## [6] {Fanta}                => {M} 0.1212121 0.5714286  0.2121212 1.142857  8

For men there are 6 rules mined by the Apriori algorithm. According to confidence, the strongest one is {Doesn’t drink} => {M}. As to support, the strongest one is {Cool} => {M}. The highest lift refers also to {Doesn’t drink} => {M} rule.

plot(rules.m, method="graph")

For males the biggest support is for Cool and the highest lift for Hip-Hop.

Summary

In this article, I used association methods to discover rules between items. I came to conclusion that in general there are no rules as strong as might be expected.However, with x, you can see a connection between women -> pop music and men -> hip-hop.

Association Rules for Gender Classification.

Introduction

Dataset

The data was collected in fall 2015 from university students of 21 nationalities studying various majors in various countries using this form. I found out this dataset on Kaggle.

Source:https://www.kaggle.com/hb20007/gender-classification.

Check for missing values

No missing values were found in the data

The plot above shows us the frequency of occurance of each answer. The most frequent is favorite color - Cool (colors reported by respondents were mapped to either warm, cool or neutral).

The Apriori algorithm

At the beginning, I will apply the apriori algorithm without making any detailed assumptions. I will use the default parameters sizes.

The algorithm has already found 9 rules. Relaxing the restriction on support increases the number of rules.

The strongest link is between women and pop music.

Confidence

From resuts above we see that maximum confidence which we are able to achive is 0.88 which means 88% reliability of the rule. In addition there is a clear connection between the groups {Hip hop}=>{M} and {F,Rock}=>{Coca Cola/Pepsi}.

Lift

Support

Supports are quite low (20%)

There are 9 rules. According to lift surprisingly popular are: rock and pop. Surprisingly there is no alcohol among the rules at all.

Individual rule representation

As we can see, there is an identical number of men and women in the data set, which is good news.

Females

It occurs that for minimum support 1% there are 6 rules mined by the Apriori algorithm for females. According to confidence, the strongest one is {Pop} => {F}. As to support, the strongest one is {Coca Cola/Pepsi} => {F}. The highest lift refers also to {Pop} => {F} rule.

For females the biggest support is for Coca Cola/Pepsi and the highest lift for pop music.

Males

For men there are 6 rules mined by the Apriori algorithm. According to confidence, the strongest one is {Doesn’t drink} => {M}. As to support, the strongest one is {Cool} => {M}. The highest lift refers also to {Doesn’t drink} => {M} rule.

For males the biggest support is for Cool and the highest lift for Hip-Hop.

Summary

In this article, I used association methods to discover rules between items. I came to conclusion that in general there are no rules as strong as might be expected.However, with x, you can see a connection between women -> pop music and men -> hip-hop.