Name: Subhankar Pattnaik ID: 71710059 Place: Bangalore
Step 1: Download the data on cosmetics purchases (Cosmetics.xls) from the textbook website (http://dataminingbook.com/).
Step 2: Using R or any other tool you are familiar with, apply association rules to these data. You may choose the threshold support and confidence in such a way that you get approximately15-20 rules with, at least, a few of them with Lift Ratio greater than 1.
Step 3: Order those rules in decreasing order of Lift Ratio.
library("Matrix")
## Warning: package 'Matrix' was built under R version 3.4.1
library("arules")
## Warning: package 'arules' was built under R version 3.4.1
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
library("arulesViz")
## Warning: package 'arulesViz' was built under R version 3.4.1
## Loading required package: grid
library("xlsx")
## Loading required package: rJava
## Loading required package: xlsxjars
Cosmetics_data <- read.xlsx("Y:\\Knowledge Repository\\ISB\\DMG\\IndividualAssignment2\\Cosmetics\\Cosmetics.xlsx", "Data", header=TRUE)
rules = apriori(as.matrix(Cosmetics_data[,3:16]), parameter=list(support=0.2, confidence=0.45, minlen=2, target = "rules"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.45 0.1 1 none FALSE TRUE 5 0.2 2
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 200
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[14 item(s), 1000 transaction(s)] done [0.00s].
## sorting and recoding items ... [11 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [19 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
a) Please include (copy-paste) the output ??? first fifteen rules along with header and input parameter details ??? in your submission.
Input: rules = apriori(as.matrix(Cosmetics_data[,3:16]), parameter=list(support=0.2, confidence=0.45, minlen=2, target = “rules”))
I selected Support as 0.2 and Confidence as 0.45
*Through the above input I got around 19 rules. Below is the output displaying only first 15 rules which are arranged in descending order.
Output:
inspect(head(sort(rules, by="lift"), 15))
## lhs rhs support confidence lift
## [1] {Mascara} => {Eye.shadow} 0.321 0.8991597 2.359999
## [2] {Eye.shadow} => {Mascara} 0.321 0.8425197 2.359999
## [3] {Concealer} => {Eyeliner} 0.297 0.6719457 1.470341
## [4] {Eyeliner} => {Concealer} 0.297 0.6498906 1.470341
## [5] {Blush} => {Concealer} 0.220 0.6060606 1.371178
## [6] {Concealer} => {Blush} 0.220 0.4977376 1.371178
## [7] {Lip.Gloss} => {Foundation} 0.356 0.7265306 1.355468
## [8] {Foundation} => {Lip.Gloss} 0.356 0.6641791 1.355468
## [9] {Mascara} => {Concealer} 0.204 0.5714286 1.292825
## [10] {Concealer} => {Mascara} 0.204 0.4615385 1.292825
## [11] {Eye.shadow} => {Concealer} 0.201 0.5275591 1.193573
## [12] {Concealer} => {Eye.shadow} 0.201 0.4547511 1.193573
## [13] {Eye.shadow} => {Lip.Gloss} 0.201 0.5275591 1.076651
## [14] {Eye.shadow} => {Foundation} 0.211 0.5538058 1.033220
## [15] {Eyeliner} => {Lip.Gloss} 0.227 0.4967177 1.013710
b) What is the support of the first rule? Explain how it has been calculated for this rule.
Support of the first rule is 0.321 The support of a rule says that % of transactions in which antecedent (IF) and consequent (THEN) appear in the data. Therefore by interpreting this value we can say that, 32.1% times both Mascara and Eye.shadow occur in the data of 1000 records.
It has been calculated by:
Support = (Number of transaction where both Mascara and Eye.shadow occurs) / (Total no. of transactions)
=> Support = length(which(Cosmetics_data\(Mascara == 1 & Cosmetics_data\)Eye.shadow == 1)) / length(Cosmetics_data$Trans..)
=> Support = 321 / 1000 = 0.321
c) What is the confidence of the first rule? Explain how it has been calculated for this rule.
Confidence of the first rule is 0.8991597 The confidence of a rule says that % of antecedent (IF) transactions that also have the consequent (THEN) item set. Therefore by interpreting this value we can say that, 89.91% times Eye.shadow appears along with Mascara whenever Mascara is purchased.
It has been calculated by:
Confidence = (Number of transaction where both Mascara and Eye.shadow occurs) / (No. of transactions having Masacara)
=> Confidence = length(which(Cosmetics_data\(Mascara == 1 & Cosmetics_data\)Eye.shadow == 1)) / length(which(Cosmetics_data$Mascara == 1))
=> Confidence = 321 / 357 = 0.8991597
d) What is the lift ratio of the first rule? Explain how it has been calculated.
Lift Ratio of the first rule is 2.3599 Lift Ratio is generally calculated by Confidence / Benchmark Confidence i.e. Lift Ratio = P(C|A) / P(C) Lift ratio generally tells us how independent the Consequent and Antecedent are. If the lift ratio is greater than 1 then both antecedent and consequent are both considered to be dependent.
From the value of 2.3599 we can tell that Mascara and Eye.shadow are bought together not because of randomly. If Mascara is bought then Eye.shadow has to be bought or vice-versa.
It has been calculated by:
=> Lift Ratio = Confidence / Benchmark Confidence
=> Lift Ratio = ((Number of transaction where both Mascara and Eye.shadow occurs) / (No. of transactions having Masacara)) / (% of transactions having Eye.shadow)
=> Lift Ratio = (length(which(Cosmetics_data\(Mascara == 1 & Cosmetics_data\)Eye.shadow == 1)) / length(which(Cosmetics_data\(Mascara == 1))) / (length(which(Cosmetics_data\)Eye.shadow == 1)) / 1000)
=> Lift Ratio = (321 / 357) / (381/1000) = 2.359999
e) Reviewing the first fifteen rules, comment on their redundancy (read as constructed from same item set/tuple). How many distinct rules did you find from the first 15 rules?
There are a total 9 distinct rules from the first 15 rules. First 6 rules are redundant.
First 12 rules are bundled which means basically the IF and ELSE are bundled & have same items into consideration.Therefore having the same support or Lift Ratio.
f) Interpret the first three distinct (i.e., excluding the redundant ones, if any, as defined above) rules, in the output, in words.
Below are the first 3 distinct rules.
[1] -> {Mascara} => {Eye.shadow} [3] -> {Concealer} => {Eyeliner} [5] -> {Blush} => {Concealer}
The first rule says that both Mascara & Eye.shadow are present 32% o times out all transactions and the confidence is 0.8425 - 0.8991 which is around 85%-90%. Both Mascara & Eye.shadow are dependent on each other i.e. a lot of customers purchase both at a time. The lift ratio is found out to be 2.3599
Similarly, Concealer & Eyeliner appears 0.297 times of total transactions with Lift_Ratio 1.4703 while Blush & Concealer appears 0.220 times of total transactions having Lift_Ratio of 1.3711.
Seeing the rules we can tag up another rule for Blush, Concealer & Eyeliner as single rule as well.
g) Based on the distinct rules that you identified in Part (f), suggest some action that’ll benefit the business owner.
Based on the distinct rules retrieved from (f), we can say that as Concealer appears with both the items i.e. Eyeliner & Blush with Support having almost same, the owner can stack up all 3 itmes close to each other at the store. So that there is a high chance of customer picking all three in a single transactions. Although Mascara & Eye.shadow has high Confidence as well Lift_Ratio we can boost up more sales by introducing a combo offer with a discount on it. The same offer will workout for Concealer, Eyeliner and Blush.
Step 1: Download (1) airports.dat, (2) airlines.dat, and (3) routes.dat from http://openflights.org/data.html or use the files that were shared with you on LMS while doing the in-class exercise on airport networks. Refer to the data descriptions provided on the site itself. You may use the R scripts shown in class (also uploaded on LMS), or any other software packages that you are familiar with to complete the exercise.
Step 2: Create a directed network graph of airline routes using routes.dat
Step 3: Create community-based clusters. If you are using R, choose leading.eigenvector approach. If you are using any other software, use the equivalent algorithm provided with the software.
Step 4: Answer the following questions.
#try(require("igraph") || install.packages("igraph", repos=("http://cran.mirrors.hoobly.com"))
library("igraph")
## Warning: package 'igraph' was built under R version 3.4.1
##
## Attaching package: 'igraph'
## The following object is masked from 'package:arules':
##
## union
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
airline_routes <- read.csv("Y:\\Knowledge Repository\\ISB\\DMG\\IndividualAssignment2\\Airline\\routes.dat", header=FALSE)
colnames(airline_routes) <- c("Airline", "Airline ID", "Source Airport","Source Airport ID","Destination Airport","Destination Airport ID","Codeshare","Stops","Equipment")
Airline Routes
head(airline_routes)
## Airline Airline ID Source Airport Source Airport ID Destination Airport
## 1 2B 410 AER 2965 KZN
## 2 2B 410 ASF 2966 KZN
## 3 2B 410 ASF 2966 MRV
## 4 2B 410 CEK 2968 KZN
## 5 2B 410 CEK 2968 OVB
## 6 2B 410 DME 4029 KZN
## Destination Airport ID Codeshare Stops Equipment
## 1 2990 0 CR2
## 2 2990 0 CR2
## 3 2962 0 CR2
## 4 2990 0 CR2
## 5 4078 0 CR2
## 6 2990 0 CR2
Creating a directed network graph of airline routes
AirlineNW_Directed <- graph.edgelist(as.matrix(airline_routes[,c(3,5)]), directed=TRUE)
A <- leading.eigenvector.community(AirlineNW_Directed)
## Warning in leading.eigenvector.community(AirlineNW_Directed): At
## community.c:1581 :This method was developed for undirected graphs
head(membership(A),100)
## AER KZN ASF MRV CEK OVB DME NBC TGK UUA EGO KGD GYD LED SVX NJC NUX BTK
## 1 1 1 1 1 1 1 1 1 1 1 9 9 9 1 1 1 1
## IKT HTA KCK ODO UKX ULK YKS MJZ AYP LIM CUZ PEM HUU IQT PCL TPP ABJ BOY
## 1 1 1 1 1 1 1 1 19 19 19 19 19 19 19 19 9 11
## OUA ACC BKO DKR COO LFW NIM BOG GYE UIO CLO SCY OCC BDS ZRH BOD BRS GVA
## 9 9 9 9 9 9 9 19 19 19 19 19 19 9 9 9 9 9
## LPA LCA RMF TFS AJR LYC ARN GEV HAD JKG KRF KSD MHQ OER POR TRF VBY VHM
## 9 9 9 9 11 9 9 9 9 9 9 9 9 9 9 9 9 9
## VXO HMV KOK TKU OSL ADQ AOS KKB KLN KOZ OLH KZB SYB KYK ORI KPR BSO MNL
## 9 11 9 9 9 11 11 11 11 11 11 11 11 11 11 11 11 10
## BXU CBO CGY CRM DGT DWC GES KLO LGP MPH
## 11 11 17 11 11 1 11 10 11 11
Plot of directed network graph
plot(AirlineNW_Directed)
Plot of community-based clusters
plot(A,AirlineNW_Directed)
(a) What would you call a community in a social-media network? Intuitive, qualitative answers are expected/acceptable.
Generally a community is formed based on the simlar characters, attributes or interests. If we take this up in a social-media network, the people having similar interests about a topic (like movies, technologies, foods, places etc), belonging to similar school/college/organization, based on demographics i.e. geographical locations, nationality, religions, economy etc,
(b) Extend the definition of community (as you suggested above in Part a) to the community of airports.
If we extend the definition of community to the community of airports, then we can segregrate the airports into Domestic airport, International airport, Community airports (Aviation), Military airports, Unclassified airports etc.
(c) How many distinct airports are there in the dataset? How many communities of airports got identified? List the number of airports in each cluster/community in a table.
There are total of 3425 distinct airports.
A$vcount
## [1] 3425
A total 25 communities were identified
length(A)
## [1] 25
List of number of airports in each cluster/community
sizes(A)
## Community sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 137 2 4 2 4 2 4 10 635 340 776 334 274 1 6 1 2 21
## 19 20 21 22 23 24 25
## 198 656 1 1 1 1 12
Step 5: Compute the centralities (in-degree, out-degree, in-closeness, eigenvector, betweenness) of each airport. Now, run k-Means clustering to group the airports based on their centralities alone. Take k equal to the number of communities you obtained in Part c, above.
How many airports are there in the network?
vcount(AirlineNW_Directed)
## [1] 3425
How many connections are there in the network?
ecount(AirlineNW_Directed)
## [1] 67663
In-Degree
indegree <- degree(AirlineNW_Directed, mode="in")
head(indegree)
## AER KZN ASF MRV CEK OVB
## 26 28 8 22 20 90
Out-Degree
outdegree <- degree(AirlineNW_Directed, mode="out")
head(outdegree)
## AER KZN ASF MRV CEK OVB
## 26 28 8 22 20 87
Closeness
closeness_in <- closeness(AirlineNW_Directed, mode="in", normalized = TRUE)
head(closeness_in)
## AER KZN ASF MRV CEK OVB
## 0.01801138 0.01801868 0.01796951 0.01805165 0.01797724 0.01813761
Betweenness
btwn <- betweenness(AirlineNW_Directed, normalized = TRUE)
head(btwn)
## AER KZN ASF MRV CEK
## 2.057750e-05 2.211094e-05 3.970297e-07 2.315226e-05 2.473746e-06
## OVB
## 1.385724e-03
Eigen Centrality
eigenv <- eigen_centrality(AirlineNW_Directed, directed = TRUE, scale = FALSE, weights = NULL)
eigenvVec <- eigenv$vector
head(eigenvVec)
## AER KZN ASF MRV CEK OVB
## 0.002587475 0.002603360 0.001008290 0.002982193 0.001856655 0.009060106
Combining all the 5 centralities
centralities <- cbind(indegree, outdegree, closeness_in, btwn, eigenvVec)
colnames(centralities) <- c("inDegree","outDegree","closenessIn","betweenness","eigenVector")
head(centralities)
## inDegree outDegree closenessIn betweenness eigenVector
## AER 26 26 0.01801138 2.057750e-05 0.002587475
## KZN 28 28 0.01801868 2.211094e-05 0.002603360
## ASF 8 8 0.01796951 3.970297e-07 0.001008290
## MRV 22 22 0.01805165 2.315226e-05 0.002982193
## CEK 20 20 0.01797724 2.473746e-06 0.001856655
## OVB 90 87 0.01813761 1.385724e-03 0.009060106
K-Means Clustering
centralitiesdf <- as.data.frame(centralities)
norm_centralities <- scale(centralitiesdf)
fit <- kmeans(norm_centralities, centers=25, iter.max=10)
## Warning: did not converge in 10 iterations
fit$centers
## inDegree outDegree closenessIn betweenness eigenVector
## 1 2.25884866 2.26204586 0.26133253 0.27411559 1.75964799
## 2 -0.33917588 -0.33814280 -0.26106571 -0.14311286 -0.32160949
## 3 0.01763672 0.01798590 0.19636413 -0.15150797 -0.12779498
## 4 6.62778625 6.59297530 0.33375627 7.02556884 6.26662998
## 5 -0.34419769 -0.34041832 -8.41655715 -0.16394425 -0.32161001
## 6 10.41950698 10.45774623 0.35480173 13.05595374 10.99255603
## 7 4.82810221 4.83787116 0.30843461 1.74104349 4.86934224
## 8 3.31407487 3.29807336 0.29982896 3.50143070 2.60697587
## 9 0.65568375 0.64920570 0.23423886 1.11863709 0.27629183
## 10 0.19589449 0.20679158 0.16360191 3.03911691 -0.16011137
## 11 0.84026634 0.83733424 0.23969758 -0.03291130 0.63041135
## 12 3.21721367 3.21167689 0.26425178 0.16643940 3.23118579
## 13 -0.14621444 -0.14296702 0.09384112 0.42236992 -0.26483858
## 14 1.17186702 1.18196558 0.26025306 0.03170901 2.56953890
## 15 1.39153836 1.39045749 0.24706381 0.04726597 1.12674174
## 16 1.53436756 1.51885439 0.27438381 2.13824756 1.13125220
## 17 2.63417087 2.67457437 0.28817981 10.05504548 2.18285823
## 18 0.36652734 0.36697974 0.22285299 -0.03438075 0.12393817
## 19 -0.12724797 -0.12770249 0.20725313 -0.19383877 0.08602397
## 20 -0.01964289 -0.01414801 0.11444782 1.29298937 -0.23616894
## 21 -0.32869041 -0.32849651 -0.02433750 -0.17325932 -0.32113200
## 22 -0.32393395 -0.32364294 0.12317781 -0.19078321 -0.29877831
## 23 -0.21621311 -0.21691753 0.17155446 -0.18134307 -0.21869540
## 24 0.07060460 0.06374256 0.22470215 -0.15418913 0.60373198
## 25 0.43669232 0.43738106 0.23916643 -0.14657046 1.34295776
(e) Do you observe the groups obtained in Step 4 to be similar to or different from what you obtained in Step 5? Why?
fit$size
## [1] 37 125 148 16 47 7 29 20 21 10 63 21 50 17
## [15] 57 16 5 121 139 24 554 1335 487 50 26
No. As per the above output which was obtained from the K-Means, the cluster sizes obtained are totally different from the leading.eigenvector centrality functionality.
Basically in leading.eigenvectore algorithm the community are calculated on the basis of eigenvector with the modularity matrix of the graph. So it groups the node/structures using the eigenvectors of matrices. While K-Means group the nodes on the basis of its centralities, which is more feasible when compared to leading.eigenvector algorithm. K-Means goes on iterating and improving until there is no possibilty of moving the node between clusters. This more effective as it groups based on the features of the nodes rather than only distance.
Step 6: Now, run k-Means clustering again on the airports based on their centralities. Go with a value of k as you find appropriate.
Determine number of clusters
Cluster_Variability <- matrix(nrow=8, ncol=1)
for (i in 1:8) Cluster_Variability[i] <- kmeans(norm_centralities, centers=i)$tot.withinss
plot(1:8, Cluster_Variability, type="b", xlab="Number of clusters", ylab="Within groups sum of squares")
Based on the plot we can take 3 clusters / communities. So we will re-run the K-Means algorithm for k = 3
fit <- kmeans(norm_centralities, centers=3, iter.max=10)
(f) Interpret the clustering outcome.
fit$centers
## inDegree outDegree closenessIn betweenness eigenVector
## 1 2.6408776 2.6410987 0.2739675 1.0993867 2.4828468
## 2 7.1345324 7.1262662 0.3339593 8.7416041 7.0107256
## 3 -0.1966311 -0.1965708 -0.0168798 -0.1318576 -0.1874907
fit$size
## [1] 165 28 3232
From the calculated cluster centers and size values we can say that,
Cluster 1, is comprised of those Airports who have low values of all centralities.
Cluster 3, is comprised of those Airports who have high values of all centralities i.e. inDegrees/outDegrees, Closeness, Betweenness etc.
While Cluster 2 remains in between in terms of centrality values.
The sizes we got are Cluster 1 - 47 Cluster 2 - 3281 Cluster 3 - 97
From the above stats we can assume that may be -
Cluster 3 - Comprises of Major airports may be international which are as well as major hubs in the Countries. And obviously International or major airports have centralities.
Cluster 2 - These are domestice airports and may as well have unclassified/non-commercial or private airports.
Cluster 1 - These are may be country-side airports or military Aiports, which generally lies on the border of countries and are located very sparse manner making the centralities very low compared to others
Step 7: Carefully observe the centralities of the airports in the dataset.
(g) If your organization is planning on launching a new flight service on a couple of new routes, what will that be (based on the information you have in this data alone)? Explain your answer. What other information would have helped you to make a better decision?
We will probably look out to provide new routes to the pair of Airports
The other information that would have been helpful would be