Sampurna Tuladhar

Data Mining - Project 3

I imported two libraries “rvest” and “tidyverse”.

df <-read.csv("dota2.csv")

I had to import my .csv file because after running all, it somehow removes some variables from my data set.

Association

library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
library(arulesViz)

Importing some rules library for my association analysis.

transactions(df)
## Warning: Column(s) 1, 3, 4, 5 not logical or factor. Applying default
## discretization (see '? discretizeDF').
## transactions in sparse format with
##  250 transactions (rows) and
##  260 items (columns)

convert the data using transaction() function for association analysis.

colnames(df)[c(1,3,4,5)]
## [1] "ability_name"     "item_id"          "duration"         "first_blood_time"

Cool, looks like it is showing all of my variables.

df<- df %>% mutate(
  radiant_win = (radiant_win > 0))

df <- df %>% select(-ability_name)

Now, one of my variable named “ability_name” is inappropriate for my analysis; thus I am removing it form my data set.

as(df,"transactions")
## Warning: Column(s) 2, 3, 4 not logical or factor. Applying default
## discretization (see '? discretizeDF').
## transactions in sparse format with
##  250 transactions (rows) and
##  10 items (columns)
trans <- transactions(df)
## Warning: Column(s) 2, 3, 4 not logical or factor. Applying default
## discretization (see '? discretizeDF').

It is showing some warnings, but it is not a problem because r can still compute these data.

summary(trans)
## transactions as itemMatrix in sparse format with
##  250 rows (elements/itemsets/transactions) and
##  10 columns (items) and a density of 0.3528 
## 
## most frequent items:
##                  radiant_win             item_id=[46,218] 
##                          132                          130 
##   first_blood_time=[109,321] duration=[2.79e+03,5.34e+03] 
##                           85                           84 
## duration=[1.44e+03,2.25e+03)                      (Other) 
##                           83                          368 
## 
## element (itemset/transaction) length distribution:
## sizes
##   3   4 
## 118 132 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   3.000   4.000   3.528   4.000   4.000 
## 
## includes extended item information - examples:
##            labels   variables  levels
## 1     radiant_win radiant_win    TRUE
## 2  item_id=[2,36)     item_id  [2,36)
## 3 item_id=[36,46)     item_id [36,46)
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2             2
## 3             3

By using summary() function I can calculate my dataset’s most frequent items, item id and it’s length distribution.

colnames(trans)
##  [1] "radiant_win"                  "item_id=[2,36)"              
##  [3] "item_id=[36,46)"              "item_id=[46,218]"            
##  [5] "duration=[1.44e+03,2.25e+03)" "duration=[2.25e+03,2.79e+03)"
##  [7] "duration=[2.79e+03,5.34e+03]" "first_blood_time=[0,29)"     
##  [9] "first_blood_time=[29,109)"    "first_blood_time=[109,321]"

Displaying column of trans with items in it.

inspect(trans[1:3])
##     items                          transactionID
## [1] {radiant_win,                               
##      item_id=[36,46),                           
##      duration=[2.25e+03,2.79e+03),              
##      first_blood_time=[0,29)}                  1
## [2] {item_id=[2,36),                            
##      duration=[2.25e+03,2.79e+03),              
##      first_blood_time=[109,321]}               2
## [3] {item_id=[36,46),                           
##      duration=[2.25e+03,2.79e+03),              
##      first_blood_time=[109,321]}               3

After displaying my trans column, I used inspect() function with 1:3 ratio in trans so that it shows me data in the trans.

image(trans)

I used image() function to show the correlation between transaction and items of my data set trans.

itemFrequencyPlot(trans,topN =15)

Then I used ItemFrequencyPlot() fucnction to show the items that are frequent in my data set.

vertical <- as(trans, "tidLists")
as(vertical,"matrix")[1:8,1:6]
##                                  1     2     3     4     5     6
## radiant_win                   TRUE FALSE FALSE FALSE  TRUE  TRUE
## item_id=[2,36)               FALSE  TRUE FALSE FALSE FALSE  TRUE
## item_id=[36,46)               TRUE FALSE  TRUE FALSE FALSE FALSE
## item_id=[46,218]             FALSE FALSE FALSE  TRUE  TRUE FALSE
## duration=[1.44e+03,2.25e+03) FALSE FALSE FALSE FALSE  TRUE  TRUE
## duration=[2.25e+03,2.79e+03)  TRUE  TRUE  TRUE FALSE FALSE FALSE
## duration=[2.79e+03,5.34e+03] FALSE FALSE FALSE  TRUE FALSE FALSE
## first_blood_time=[0,29)       TRUE FALSE FALSE FALSE FALSE FALSE

I have also used vertical variable to store the trans data in vertical matrix using as() function in [1:8,1:6] ratios.

#Apriori

I have finished visualizing, summarizing and inspecting the new assiociation data. Now, I am going to use Apriori principle to visualize, summarize and inspect the rules.

its <- apriori(trans, parameter=list(target = "frequent"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##          NA    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen            target  ext
##      10 frequent itemsets TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 25 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[10 item(s), 250 transaction(s)] done [0.00s].
## sorting and recoding items ... [10 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## sorting transactions ... done [0.00s].
## writing ... [39 set(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Using apriori() function to implement Apriori principle on trans. I have stored my apriori data into ‘its’ variable so I can use it later for my rules.

its
## set of 39 itemsets

Displaying the set of itemsets.

inspect(head(its, n = 10))
##      items                          support transIdenticalToItemsets count
## [1]  {item_id=[36,46)}              0.152   0                         38  
## [2]  {item_id=[2,36)}               0.328   0                         82  
## [3]  {duration=[2.79e+03,5.34e+03]} 0.336   0                         84  
## [4]  {first_blood_time=[29,109)}    0.328   0                         82  
## [5]  {duration=[2.25e+03,2.79e+03)} 0.332   0                         83  
## [6]  {first_blood_time=[109,321]}   0.340   0                         85  
## [7]  {first_blood_time=[0,29)}      0.332   0                         83  
## [8]  {duration=[1.44e+03,2.25e+03)} 0.332   0                         83  
## [9]  {item_id=[46,218]}             0.520   0                        130  
## [10] {radiant_win}                  0.528   0                        132

Now, its time to inspect my data which was converted to apriori. I can see the items and its support and count of its data.

ggplot(tibble(`Itemset Size` = factor(size(its))), aes(`Itemset Size`)) + geom_bar()

Using ggplot, tibble() function shows the item size and count and their correlation. As we can see from plot the count of itemset size in 1 is 10, 2 is 27 and 3 is 5 respectively.

##Build Rules

Rules are similar to frequent but different in data. So, its time for building some rules for my data set.

rules <- apriori(trans, parameter = list(support= 0.01, confidence = 0.4))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.4    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 2 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[10 item(s), 250 transaction(s)] done [0.00s].
## sorting and recoding items ... [10 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [114 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
inspect(head(rules))
##     lhs                               rhs                          support
## [1] {}                             => {item_id=[46,218]}           0.520  
## [2] {}                             => {radiant_win}                0.528  
## [3] {item_id=[36,46)}              => {first_blood_time=[109,321]} 0.068  
## [4] {item_id=[36,46)}              => {radiant_win}                0.084  
## [5] {item_id=[2,36)}               => {radiant_win}                0.152  
## [6] {duration=[2.79e+03,5.34e+03]} => {item_id=[46,218]}           0.180  
##     confidence coverage lift      count
## [1] 0.5200000  1.000    1.0000000 130  
## [2] 0.5280000  1.000    1.0000000 132  
## [3] 0.4473684  0.152    1.3157895  17  
## [4] 0.5526316  0.152    1.0466507  21  
## [5] 0.4634146  0.328    0.8776792  38  
## [6] 0.5357143  0.336    1.0302198  45

I used apriori() function with trans and parameters equaling to list() function with support and confidence values. I used some big numbers for my support and confidence but it was not showing enough rules so, I had to bring the support and confidence down for me to get more rules.

plot(rules, jitter =1)

I have used plot() function in my rules by using jitter to 1 so that I can get enough rules to show correlation between my support and confidence. We can also see the lift values being 2.5 the largest and 1.0 the smallest. So, we can safley say that the data is positively correlated.

plot(rules, shading = "order")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

Scatter plot with different colors for each variables in different order.

plot(head(rules, n= 50), method = "graph")

Finally I have used plot(head()) function with ruels and number 50 in graph method to show the support and lift between my data sets. In the graph we can see my itemses pointing to eachother with support being 0.5 the highest and 0.1 to the lowest. And lift being 1.6 the highest and 0.8 the lowest.