Unsupervised Learning, Individual Assignment 2

Name: Subhankar Pattnaik ID: 71710059 Place: Bangalore

1. Cosmetics

Step 1: Download the data on cosmetics purchases (Cosmetics.xls) from the textbook website (http://dataminingbook.com/).

Step 2: Using R or any other tool you are familiar with, apply association rules to these data. You may choose the threshold support and confidence in such a way that you get approximately15-20 rules with, at least, a few of them with Lift Ratio greater than 1.

Step 3: Order those rules in decreasing order of Lift Ratio.

library("Matrix")

## Warning: package 'Matrix' was built under R version 3.4.1

library("arules")

## Warning: package 'arules' was built under R version 3.4.1

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

library("arulesViz")

## Warning: package 'arulesViz' was built under R version 3.4.1

## Loading required package: grid

library("xlsx")

## Loading required package: rJava

## Loading required package: xlsxjars

Cosmetics_data <- read.xlsx("Y:\\Knowledge Repository\\ISB\\DMG\\IndividualAssignment2\\Cosmetics\\Cosmetics.xlsx", "Data", header=TRUE)

rules = apriori(as.matrix(Cosmetics_data[,3:16]), parameter=list(support=0.2, confidence=0.45, minlen=2, target = "rules"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.45    0.1    1 none FALSE            TRUE       5     0.2      2
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 200 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[14 item(s), 1000 transaction(s)] done [0.00s].
## sorting and recoding items ... [11 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [19 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

a) Please include (copy-paste) the output ??? first fifteen rules along with header and input parameter details ??? in your submission.

Input: rules = apriori(as.matrix(Cosmetics_data[,3:16]), parameter=list(support=0.2, confidence=0.45, minlen=2, target = “rules”))

I selected Support as 0.2 and Confidence as 0.45

*Through the above input I got around 19 rules. Below is the output displaying only first 15 rules which are arranged in descending order.

Output:

inspect(head(sort(rules, by="lift"), 15))

##      lhs             rhs          support confidence lift    
## [1]  {Mascara}    => {Eye.shadow} 0.321   0.8991597  2.359999
## [2]  {Eye.shadow} => {Mascara}    0.321   0.8425197  2.359999
## [3]  {Concealer}  => {Eyeliner}   0.297   0.6719457  1.470341
## [4]  {Eyeliner}   => {Concealer}  0.297   0.6498906  1.470341
## [5]  {Blush}      => {Concealer}  0.220   0.6060606  1.371178
## [6]  {Concealer}  => {Blush}      0.220   0.4977376  1.371178
## [7]  {Lip.Gloss}  => {Foundation} 0.356   0.7265306  1.355468
## [8]  {Foundation} => {Lip.Gloss}  0.356   0.6641791  1.355468
## [9]  {Mascara}    => {Concealer}  0.204   0.5714286  1.292825
## [10] {Concealer}  => {Mascara}    0.204   0.4615385  1.292825
## [11] {Eye.shadow} => {Concealer}  0.201   0.5275591  1.193573
## [12] {Concealer}  => {Eye.shadow} 0.201   0.4547511  1.193573
## [13] {Eye.shadow} => {Lip.Gloss}  0.201   0.5275591  1.076651
## [14] {Eye.shadow} => {Foundation} 0.211   0.5538058  1.033220
## [15] {Eyeliner}   => {Lip.Gloss}  0.227   0.4967177  1.013710

b) What is the support of the first rule? Explain how it has been calculated for this rule.

Support of the first rule is 0.321 The support of a rule says that % of transactions in which antecedent (IF) and consequent (THEN) appear in the data. Therefore by interpreting this value we can say that, 32.1% times both Mascara and Eye.shadow occur in the data of 1000 records.

It has been calculated by:

Support = (Number of transaction where both Mascara and Eye.shadow occurs) / (Total no. of transactions)

=> Support = length(which(Cosmetics_data$Mascara == 1 & Cosmetics_data$Eye.shadow == 1)) / length(Cosmetics_data$Trans..)

=> Support = 321 / 1000 = 0.321

c) What is the confidence of the first rule? Explain how it has been calculated for this rule.

Confidence of the first rule is 0.8991597 The confidence of a rule says that % of antecedent (IF) transactions that also have the consequent (THEN) item set. Therefore by interpreting this value we can say that, 89.91% times Eye.shadow appears along with Mascara whenever Mascara is purchased.

It has been calculated by:

Confidence = (Number of transaction where both Mascara and Eye.shadow occurs) / (No. of transactions having Masacara)

=> Confidence = length(which(Cosmetics_data$Mascara == 1 & Cosmetics_data$Eye.shadow == 1)) / length(which(Cosmetics_data$Mascara == 1))

=> Confidence = 321 / 357 = 0.8991597

d) What is the lift ratio of the first rule? Explain how it has been calculated.

Lift Ratio of the first rule is 2.3599 Lift Ratio is generally calculated by Confidence / Benchmark Confidence i.e. Lift Ratio = P(C|A) / P(C) Lift ratio generally tells us how independent the Consequent and Antecedent are. If the lift ratio is greater than 1 then both antecedent and consequent are both considered to be dependent.

From the value of 2.3599 we can tell that Mascara and Eye.shadow are bought together not because of randomly. If Mascara is bought then Eye.shadow has to be bought or vice-versa.

It has been calculated by:

=> Lift Ratio = Confidence / Benchmark Confidence

=> Lift Ratio = ((Number of transaction where both Mascara and Eye.shadow occurs) / (No. of transactions having Masacara)) / (% of transactions having Eye.shadow)

=> Lift Ratio = (length(which(Cosmetics_data$Mascara == 1 & Cosmetics_data$Eye.shadow == 1)) / length(which(Cosmetics_data$Mascara == 1))) / (length(which(Cosmetics_data$Eye.shadow == 1)) / 1000)

=> Lift Ratio = (321 / 357) / (381/1000) = 2.359999

e) Reviewing the first fifteen rules, comment on their redundancy (read as constructed from same item set/tuple). How many distinct rules did you find from the first 15 rules?

There are a total 9 distinct rules from the first 15 rules. First 6 rules are redundant.

First 12 rules are bundled which means basically the IF and ELSE are bundled & have same items into consideration.Therefore having the same support or Lift Ratio.

f) Interpret the first three distinct (i.e., excluding the redundant ones, if any, as defined above) rules, in the output, in words.

Below are the first 3 distinct rules.

[1] -> {Mascara} => {Eye.shadow} [3] -> {Concealer} => {Eyeliner} [5] -> {Blush} => {Concealer}

The first rule says that both Mascara & Eye.shadow are present 32% o times out all transactions and the confidence is 0.8425 - 0.8991 which is around 85%-90%. Both Mascara & Eye.shadow are dependent on each other i.e. a lot of customers purchase both at a time. The lift ratio is found out to be 2.3599

Similarly, Concealer & Eyeliner appears 0.297 times of total transactions with Lift_Ratio 1.4703 while Blush & Concealer appears 0.220 times of total transactions having Lift_Ratio of 1.3711.

Seeing the rules we can tag up another rule for Blush, Concealer & Eyeliner as single rule as well.

g) Based on the distinct rules that you identified in Part (f), suggest some action that’ll benefit the business owner.

Based on the distinct rules retrieved from (f), we can say that as Concealer appears with both the items i.e. Eyeliner & Blush with Support having almost same, the owner can stack up all 3 itmes close to each other at the store. So that there is a high chance of customer picking all three in a single transactions. Although Mascara & Eye.shadow has high Confidence as well Lift_Ratio we can boost up more sales by introducing a combo offer with a discount on it. The same offer will workout for Concealer, Eyeliner and Blush.

2. Airlines Network

Step 1: Download (1) airports.dat, (2) airlines.dat, and (3) routes.dat from http://openflights.org/data.html or use the files that were shared with you on LMS while doing the in-class exercise on airport networks. Refer to the data descriptions provided on the site itself. You may use the R scripts shown in class (also uploaded on LMS), or any other software packages that you are familiar with to complete the exercise.

Step 2: Create a directed network graph of airline routes using routes.dat

Step 3: Create community-based clusters. If you are using R, choose leading.eigenvector approach. If you are using any other software, use the equivalent algorithm provided with the software.

Step 4: Answer the following questions.

#try(require("igraph") || install.packages("igraph", repos=("http://cran.mirrors.hoobly.com"))
library("igraph")

## Warning: package 'igraph' was built under R version 3.4.1

## 
## Attaching package: 'igraph'

## The following object is masked from 'package:arules':
## 
##     union

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

airline_routes <- read.csv("Y:\\Knowledge Repository\\ISB\\DMG\\IndividualAssignment2\\Airline\\routes.dat", header=FALSE)
colnames(airline_routes) <- c("Airline", "Airline ID", "Source Airport","Source Airport ID","Destination Airport","Destination Airport ID","Codeshare","Stops","Equipment")

Airline Routes

head(airline_routes)

##   Airline Airline ID Source Airport Source Airport ID Destination Airport
## 1      2B        410            AER              2965                 KZN
## 2      2B        410            ASF              2966                 KZN
## 3      2B        410            ASF              2966                 MRV
## 4      2B        410            CEK              2968                 KZN
## 5      2B        410            CEK              2968                 OVB
## 6      2B        410            DME              4029                 KZN
##   Destination Airport ID Codeshare Stops Equipment
## 1                   2990               0       CR2
## 2                   2990               0       CR2
## 3                   2962               0       CR2
## 4                   2990               0       CR2
## 5                   4078               0       CR2
## 6                   2990               0       CR2

Creating a directed network graph of airline routes

AirlineNW_Directed <- graph.edgelist(as.matrix(airline_routes[,c(3,5)]), directed=TRUE)
A <- leading.eigenvector.community(AirlineNW_Directed)

## Warning in leading.eigenvector.community(AirlineNW_Directed): At
## community.c:1581 :This method was developed for undirected graphs

head(membership(A),100)

## AER KZN ASF MRV CEK OVB DME NBC TGK UUA EGO KGD GYD LED SVX NJC NUX BTK 
##   1   1   1   1   1   1   1   1   1   1   1   9   9   9   1   1   1   1 
## IKT HTA KCK ODO UKX ULK YKS MJZ AYP LIM CUZ PEM HUU IQT PCL TPP ABJ BOY 
##   1   1   1   1   1   1   1   1  19  19  19  19  19  19  19  19   9  11 
## OUA ACC BKO DKR COO LFW NIM BOG GYE UIO CLO SCY OCC BDS ZRH BOD BRS GVA 
##   9   9   9   9   9   9   9  19  19  19  19  19  19   9   9   9   9   9 
## LPA LCA RMF TFS AJR LYC ARN GEV HAD JKG KRF KSD MHQ OER POR TRF VBY VHM 
##   9   9   9   9  11   9   9   9   9   9   9   9   9   9   9   9   9   9 
## VXO HMV KOK TKU OSL ADQ AOS KKB KLN KOZ OLH KZB SYB KYK ORI KPR BSO MNL 
##   9  11   9   9   9  11  11  11  11  11  11  11  11  11  11  11  11  10 
## BXU CBO CGY CRM DGT DWC GES KLO LGP MPH 
##  11  11  17  11  11   1  11  10  11  11

Plot of directed network graph

plot(AirlineNW_Directed)

Plot of community-based clusters

plot(A,AirlineNW_Directed)

(a) What would you call a community in a social-media network? Intuitive, qualitative answers are expected/acceptable.

Generally a community is formed based on the simlar characters, attributes or interests. If we take this up in a social-media network, the people having similar interests about a topic (like movies, technologies, foods, places etc), belonging to similar school/college/organization, based on demographics i.e. geographical locations, nationality, religions, economy etc,

(b) Extend the definition of community (as you suggested above in Part a) to the community of airports.

If we extend the definition of community to the community of airports, then we can segregrate the airports into Domestic airport, International airport, Community airports (Aviation), Military airports, Unclassified airports etc.

(c) How many distinct airports are there in the dataset? How many communities of airports got identified? List the number of airports in each cluster/community in a table.

There are total of 3425 distinct airports.

A$vcount

## [1] 3425

A total 25 communities were identified

length(A)

## [1] 25

List of number of airports in each cluster/community

sizes(A)

## Community sizes
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
## 137   2   4   2   4   2   4  10 635 340 776 334 274   1   6   1   2  21 
##  19  20  21  22  23  24  25 
## 198 656   1   1   1   1  12

Step 5: Compute the centralities (in-degree, out-degree, in-closeness, eigenvector, betweenness) of each airport. Now, run k-Means clustering to group the airports based on their centralities alone. Take k equal to the number of communities you obtained in Part c, above.

How many airports are there in the network?

vcount(AirlineNW_Directed)

## [1] 3425

How many connections are there in the network?

ecount(AirlineNW_Directed)

## [1] 67663

In-Degree

indegree <- degree(AirlineNW_Directed, mode="in")
head(indegree)

## AER KZN ASF MRV CEK OVB 
##  26  28   8  22  20  90

Out-Degree

outdegree <- degree(AirlineNW_Directed, mode="out")
head(outdegree)

## AER KZN ASF MRV CEK OVB 
##  26  28   8  22  20  87

Closeness

closeness_in <- closeness(AirlineNW_Directed, mode="in", normalized = TRUE)
head(closeness_in)

##        AER        KZN        ASF        MRV        CEK        OVB 
## 0.01801138 0.01801868 0.01796951 0.01805165 0.01797724 0.01813761

Betweenness

btwn <- betweenness(AirlineNW_Directed, normalized = TRUE)
head(btwn)

##          AER          KZN          ASF          MRV          CEK 
## 2.057750e-05 2.211094e-05 3.970297e-07 2.315226e-05 2.473746e-06 
##          OVB 
## 1.385724e-03

Eigen Centrality

eigenv <- eigen_centrality(AirlineNW_Directed, directed = TRUE, scale = FALSE, weights = NULL)
eigenvVec <- eigenv$vector
head(eigenvVec)

##         AER         KZN         ASF         MRV         CEK         OVB 
## 0.002587475 0.002603360 0.001008290 0.002982193 0.001856655 0.009060106

Combining all the 5 centralities

centralities <- cbind(indegree, outdegree, closeness_in, btwn, eigenvVec)
colnames(centralities) <- c("inDegree","outDegree","closenessIn","betweenness","eigenVector")
head(centralities)

##     inDegree outDegree closenessIn  betweenness eigenVector
## AER       26        26  0.01801138 2.057750e-05 0.002587475
## KZN       28        28  0.01801868 2.211094e-05 0.002603360
## ASF        8         8  0.01796951 3.970297e-07 0.001008290
## MRV       22        22  0.01805165 2.315226e-05 0.002982193
## CEK       20        20  0.01797724 2.473746e-06 0.001856655
## OVB       90        87  0.01813761 1.385724e-03 0.009060106

K-Means Clustering

centralitiesdf <- as.data.frame(centralities)
norm_centralities <- scale(centralitiesdf)
fit <- kmeans(norm_centralities, centers=25, iter.max=10)

## Warning: did not converge in 10 iterations

fit$centers

##       inDegree   outDegree closenessIn betweenness eigenVector
## 1   2.25884866  2.26204586  0.26133253  0.27411559  1.75964799
## 2  -0.33917588 -0.33814280 -0.26106571 -0.14311286 -0.32160949
## 3   0.01763672  0.01798590  0.19636413 -0.15150797 -0.12779498
## 4   6.62778625  6.59297530  0.33375627  7.02556884  6.26662998
## 5  -0.34419769 -0.34041832 -8.41655715 -0.16394425 -0.32161001
## 6  10.41950698 10.45774623  0.35480173 13.05595374 10.99255603
## 7   4.82810221  4.83787116  0.30843461  1.74104349  4.86934224
## 8   3.31407487  3.29807336  0.29982896  3.50143070  2.60697587
## 9   0.65568375  0.64920570  0.23423886  1.11863709  0.27629183
## 10  0.19589449  0.20679158  0.16360191  3.03911691 -0.16011137
## 11  0.84026634  0.83733424  0.23969758 -0.03291130  0.63041135
## 12  3.21721367  3.21167689  0.26425178  0.16643940  3.23118579
## 13 -0.14621444 -0.14296702  0.09384112  0.42236992 -0.26483858
## 14  1.17186702  1.18196558  0.26025306  0.03170901  2.56953890
## 15  1.39153836  1.39045749  0.24706381  0.04726597  1.12674174
## 16  1.53436756  1.51885439  0.27438381  2.13824756  1.13125220
## 17  2.63417087  2.67457437  0.28817981 10.05504548  2.18285823
## 18  0.36652734  0.36697974  0.22285299 -0.03438075  0.12393817
## 19 -0.12724797 -0.12770249  0.20725313 -0.19383877  0.08602397
## 20 -0.01964289 -0.01414801  0.11444782  1.29298937 -0.23616894
## 21 -0.32869041 -0.32849651 -0.02433750 -0.17325932 -0.32113200
## 22 -0.32393395 -0.32364294  0.12317781 -0.19078321 -0.29877831
## 23 -0.21621311 -0.21691753  0.17155446 -0.18134307 -0.21869540
## 24  0.07060460  0.06374256  0.22470215 -0.15418913  0.60373198
## 25  0.43669232  0.43738106  0.23916643 -0.14657046  1.34295776

(e) Do you observe the groups obtained in Step 4 to be similar to or different from what you obtained in Step 5? Why?

fit$size

##  [1]   37  125  148   16   47    7   29   20   21   10   63   21   50   17
## [15]   57   16    5  121  139   24  554 1335  487   50   26

No. As per the above output which was obtained from the K-Means, the cluster sizes obtained are totally different from the leading.eigenvector centrality functionality.

Basically in leading.eigenvectore algorithm the community are calculated on the basis of eigenvector with the modularity matrix of the graph. So it groups the node/structures using the eigenvectors of matrices. While K-Means group the nodes on the basis of its centralities, which is more feasible when compared to leading.eigenvector algorithm. K-Means goes on iterating and improving until there is no possibilty of moving the node between clusters. This more effective as it groups based on the features of the nodes rather than only distance.

Step 6: Now, run k-Means clustering again on the airports based on their centralities. Go with a value of k as you find appropriate.

Determine number of clusters

Cluster_Variability <- matrix(nrow=8, ncol=1)
for (i in 1:8) Cluster_Variability[i] <- kmeans(norm_centralities, centers=i)$tot.withinss
plot(1:8, Cluster_Variability, type="b", xlab="Number of clusters", ylab="Within groups sum of squares")

Based on the plot we can take 3 clusters / communities. So we will re-run the K-Means algorithm for k = 3

fit <- kmeans(norm_centralities, centers=3, iter.max=10)

(f) Interpret the clustering outcome.

fit$centers

##     inDegree  outDegree closenessIn betweenness eigenVector
## 1  2.6408776  2.6410987   0.2739675   1.0993867   2.4828468
## 2  7.1345324  7.1262662   0.3339593   8.7416041   7.0107256
## 3 -0.1966311 -0.1965708  -0.0168798  -0.1318576  -0.1874907

fit$size

## [1]  165   28 3232

From the calculated cluster centers and size values we can say that,

Cluster 1, is comprised of those Airports who have low values of all centralities.

Cluster 3, is comprised of those Airports who have high values of all centralities i.e. inDegrees/outDegrees, Closeness, Betweenness etc.

While Cluster 2 remains in between in terms of centrality values.

The sizes we got are Cluster 1 - 47 Cluster 2 - 3281 Cluster 3 - 97

From the above stats we can assume that may be -

Cluster 3 - Comprises of Major airports may be international which are as well as major hubs in the Countries. And obviously International or major airports have centralities.

Cluster 2 - These are domestice airports and may as well have unclassified/non-commercial or private airports.

Cluster 1 - These are may be country-side airports or military Aiports, which generally lies on the border of countries and are located very sparse manner making the centralities very low compared to others

Step 7: Carefully observe the centralities of the airports in the dataset.

(g) If your organization is planning on launching a new flight service on a couple of new routes, what will that be (based on the information you have in this data alone)? Explain your answer. What other information would have helped you to make a better decision?

We will probably look out to provide new routes to the pair of Airports

Who doesn’t have direct routes between them.
Those Airports whose betweenness value is pretty much more or those Airports which are more congested, we can introduce new routes to near by airports such that betweenness increases and the weights are distributed from a more weighted Airport.
Introducing new routes to those Airports whose Closeness are less to other Airports provided that Aisport is a major hub.
Whose inDegree/outDegree value for the Airport less. We can work out to increase our business.

The other information that would have been helpful would be

Number of visitors for each of the Airport
The Travel Cost between the Airports
The distance between all the Airports (Although this can be calculated manually).

Unsupervised Learning, Individual Assignment 2

Subhankar Pattnaik

Sept 09, 2017

1. Cosmetics

2. Airlines Network