eq <- read.csv("https://raw.githubusercontent.com/khatriprajwol/NEW-Data/main/database.csv")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(rvest)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.4 ✓ stringr 1.4.0
## ✓ tidyr 1.1.4 ✓ forcats 0.5.1
## ✓ readr 2.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x readr::guess_encoding() masks rvest::guess_encoding()
## x dplyr::lag() masks stats::lag()
This project contains the association rule mining method which is popular data mining methon in R. Association rule help us to determine the frequent item set that appears in the transaction. Therefore, to perform the analysis we use arules and arulesViz packages. The main application of association rule analysis is to find the pattern what customer buys. In real life we can see association rule in a YouTube where we are recommended with similar kind of videos and in a market we are often recommended items as people have bought this one too. It is important to learn this techniques that can help business organization to recommend the best product to its customer. In my data set it can be used to predict upcoming earthquak in some particular place.
eq <- data.frame(Latitude=eq$Latitude, Longitude=eq$Longitude, Type=eq$Type, Depth=eq$Depth, Magnitude=eq$Magnitude, Status=eq$Status)
head(eq)
## Latitude Longitude Type Depth Magnitude Status
## 1 19.246 145.616 Earthquake 131.6 6.0 Automatic
## 2 1.863 127.352 Earthquake 80.0 5.8 Automatic
## 3 -20.579 -173.972 Earthquake 20.0 6.2 Automatic
## 4 -59.076 -23.557 Earthquake 15.0 5.8 Automatic
## 5 11.938 126.427 Earthquake 15.0 5.8 Automatic
## 6 -13.405 166.629 Earthquake 35.0 6.7 Automatic
tail(eq)
## Latitude Longitude Type Depth Magnitude Status
## 23407 38.3754 -118.8977 Earthquake 10.80 5.6 Reviewed
## 23408 38.3917 -118.8941 Earthquake 12.30 5.6 Reviewed
## 23409 38.3777 -118.8957 Earthquake 8.80 5.5 Reviewed
## 23410 36.9179 140.4262 Earthquake 10.00 5.9 Reviewed
## 23411 -9.0283 118.6639 Earthquake 79.00 6.3 Reviewed
## 23412 37.3973 141.4103 Earthquake 11.94 5.5 Reviewed
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
##
## recode
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz)
As we have already discussed in above paragraph that we use arules and arulesViz packages to analyse association mining. And now to convert the data for association analysis. In order to analyze data first of all we have to load the transaction data into object of the transaction class. It is done by using arules package
transactions(eq)
## Warning: Column(s) 1, 2, 3, 4, 5, 6 not logical or factor. Applying default
## discretization (see '? discretizeDF').
## transactions in sparse format with
## 23412 transactions (rows) and
## 18 items (columns)
I have received many errors here saying an items are not logical. So, let’s look at the columns and see if we can convert them into factors (or Boolean) for analysis.
colnames(eq)[c(1,2,3,6)]
## [1] "Latitude" "Longitude" "Type" "Status"
Here, I do not think I can do anything to latitude, Longitude, Type and Status. Lets change magnitude > 5 and depth > 5 to see how it will effect the transaction.
df <- eq %>% mutate(
Magnitude = (Magnitude > 5),
Depth = (Depth > 10)
)
Here, I’ll run the transactions again and see how it has cleaned up changing magnitude and depth.
as(df,"transactions")
## Warning: Column(s) 1, 2, 3, 6 not logical or factor. Applying default
## discretization (see '? discretizeDF').
## transactions in sparse format with
## 23412 transactions (rows) and
## 14 items (columns)
trans <- transactions(df)
## Warning: Column(s) 1, 2, 3, 6 not logical or factor. Applying default
## discretization (see '? discretizeDF').
Even changing the magnitude and depth, I still have received the error saying that the columns are not logical. Therefore, I am happy to let R do the default discretization of the rest of the data because what is left I could not come up with better cutoffs. And to be honest I feel like latitude, longitude, type and status are as important too.
dim(trans)
## [1] 23412 14
Looking at the dimension of the object we received 23412 transaction and 12 distinct items.
itemLabels(trans)
## [1] "Latitude=[-77.1,-12.7)" "Latitude=[-12.7,13)" "Latitude=[13,86]"
## [4] "Longitude=[-180,-28.3)" "Longitude=[-28.3,137)" "Longitude=[137,180]"
## [7] "Type=Earthquake" "Type=Explosion" "Type=Nuclear Explosion"
## [10] "Type=Rock Burst" "Depth" "Magnitude"
## [13] "Status=Automatic" "Status=Reviewed"
The above part of code shows the list of distinct items in the data that is 12.
summary(trans)
## transactions as itemMatrix in sparse format with
## 23412 rows (elements/itemsets/transactions) and
## 14 columns (items) and a density of 0.4135425
##
## most frequent items:
## Magnitude Type=Earthquake Status=Reviewed
## 23412 23232 20773
## Depth Latitude=[-77.1,-12.7) (Other)
## 18486 7804 41839
##
## element (itemset/transaction) length distribution:
## sizes
## 5 6
## 4926 18486
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 6.00 6.00 5.79 6.00 6.00
##
## includes extended item information - examples:
## labels variables levels
## 1 Latitude=[-77.1,-12.7) Latitude [-77.1,-12.7)
## 2 Latitude=[-12.7,13) Latitude [-12.7,13)
## 3 Latitude=[13,86] Latitude [13,86]
##
## includes extended transaction information - examples:
## transactionID
## 1 1
## 2 2
## 3 3
Here the summary() gives us information about our transaction object. We can see we have 23412 transactions (rows) and 12 items (columns) and we can also see the most frequent items. In addition density gives us information about the percentage of non-zero cells in this 23412 x 12 matrix. The length distribution gives 4 items in 23412 transaction.
colnames(trans)
## [1] "Latitude=[-77.1,-12.7)" "Latitude=[-12.7,13)" "Latitude=[13,86]"
## [4] "Longitude=[-180,-28.3)" "Longitude=[-28.3,137)" "Longitude=[137,180]"
## [7] "Type=Earthquake" "Type=Explosion" "Type=Nuclear Explosion"
## [10] "Type=Rock Burst" "Depth" "Magnitude"
## [13] "Status=Automatic" "Status=Reviewed"
head(colnames)
##
## 1 function (x, do.NULL = TRUE, prefix = "col")
## 2 {
## 3 if (is.data.frame(x) && do.NULL)
## 4 return(names(x))
## 5 dn <- dimnames(x)
## 6 if (!is.null(dn[[2L]]))
inspect(trans[1:6])
## items transactionID
## [1] {Latitude=[13,86],
## Longitude=[137,180],
## Type=Earthquake,
## Depth,
## Magnitude,
## Status=Automatic} 1
## [2] {Latitude=[-12.7,13),
## Longitude=[-28.3,137),
## Type=Earthquake,
## Depth,
## Magnitude,
## Status=Automatic} 2
## [3] {Latitude=[-77.1,-12.7),
## Longitude=[-180,-28.3),
## Type=Earthquake,
## Depth,
## Magnitude,
## Status=Automatic} 3
## [4] {Latitude=[-77.1,-12.7),
## Longitude=[-28.3,137),
## Type=Earthquake,
## Depth,
## Magnitude,
## Status=Automatic} 4
## [5] {Latitude=[-12.7,13),
## Longitude=[-28.3,137),
## Type=Earthquake,
## Depth,
## Magnitude,
## Status=Automatic} 5
## [6] {Latitude=[-77.1,-12.7),
## Longitude=[137,180],
## Type=Earthquake,
## Depth,
## Magnitude,
## Status=Automatic} 6
image(trans)
This is odd diagram I was not expecting it to be like this having a 33.33% of density. It is clear that it is not informative and did not do a great job. However, allow me to change the column of it see how it does.
image(trans[1:10])
It did a great job and in fact it is much clear and informative where i can see the relationship clearly. Since I had large data i took small part of it that is [1:10] where we can clearly see most of the cells i.e. 33.33% are non-zero values. We have most of the elements are non zero therefore the matrix is considered dense.
itemFrequencyPlot(trans,topN = 15, col="Red")
The above diagram shows the relative item frequency. The items {latitude=[-77.1,-12.7)}, {Latitude=[-12.7,13]},{Latitude=[13,86]},{longitude=[-180,-28.3)} {longitude=[-28.3,137)},{longitude=[137,180]} all have a relative frequency that is support of 30%.
vertical <- as(trans, "tidLists")
as(vertical, "matrix")[1:10, 1:10]
## 1 2 3 4 5 6 7 8 9
## Latitude=[-77.1,-12.7) FALSE FALSE TRUE TRUE FALSE TRUE FALSE TRUE TRUE
## Latitude=[-12.7,13) FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## Latitude=[13,86] TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## Longitude=[-180,-28.3) FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## Longitude=[-28.3,137) FALSE TRUE FALSE TRUE TRUE FALSE TRUE FALSE TRUE
## Longitude=[137,180] TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
## Type=Earthquake TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## Type=Explosion FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Type=Nuclear Explosion FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Type=Rock Burst FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 10
## Latitude=[-77.1,-12.7) TRUE
## Latitude=[-12.7,13) FALSE
## Latitude=[13,86] FALSE
## Longitude=[-180,-28.3) FALSE
## Longitude=[-28.3,137) FALSE
## Longitude=[137,180] TRUE
## Type=Earthquake TRUE
## Type=Explosion FALSE
## Type=Nuclear Explosion FALSE
## Type=Rock Burst FALSE
As we know that transaction id lists contains a set of items where using tidlists uses the class ngCMatrix to efficiently store the transaction ID lists as a sparse matrix. Each column in the matrix represent one transaction ID list.Note that a matrix is called a sparse matrix if most of the elements are zero. Opposite of it, if most of the elements are nonzero, then the matrix is considered dense.
its <- apriori(trans, parameter=list(target = "frequent"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## NA 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 frequent itemsets TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 2341
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[14 item(s), 23412 transaction(s)] done [0.01s].
## sorting and recoding items ... [11 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 done [0.00s].
## sorting transactions ... done [0.00s].
## writing ... [181 set(s)] done [0.00s].
## creating S4 object ... done [0.00s].
Another way to analyze the rule is using A-Priori Algorithm with the function apriori(). This function requires both a minimum support and a minimum confidence constraint at the same time.
its
## set of 181 itemsets
inspect(head(its, n = 10))
## items support transIdenticalToItemsets count
## [1] {Status=Automatic} 0.1127200 0 2639
## [2] {Latitude=[-77.1,-12.7)} 0.3333333 0 7804
## [3] {Longitude=[-180,-28.3)} 0.3333333 0 7804
## [4] {Longitude=[-28.3,137)} 0.3333333 0 7804
## [5] {Latitude=[13,86]} 0.3333333 0 7804
## [6] {Latitude=[-12.7,13)} 0.3333333 0 7804
## [7] {Longitude=[137,180]} 0.3333333 0 7804
## [8] {Depth} 0.7895951 0 18486
## [9] {Status=Reviewed} 0.8872800 0 20773
## [10] {Type=Earthquake} 0.9923116 0 23232
Lets build the rules using apriori principle. in this case I will be using the support of 0.2 and confidence of 0.9.
rules <- apriori(trans, parameter = list(support = 0.2, confidence = 0.9))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.9 0.1 1 none FALSE TRUE 5 0.2 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 4682
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[14 item(s), 23412 transaction(s)] done [0.01s].
## sorting and recoding items ... [10 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [128 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(head(rules))
## lhs rhs support confidence
## [1] {} => {Type=Earthquake} 0.9923116 0.9923116
## [2] {} => {Magnitude} 1.0000000 1.0000000
## [3] {Latitude=[-77.1,-12.7)} => {Status=Reviewed} 0.3071075 0.9213224
## [4] {Latitude=[-77.1,-12.7)} => {Type=Earthquake} 0.3320092 0.9960277
## [5] {Latitude=[-77.1,-12.7)} => {Magnitude} 0.3333333 1.0000000
## [6] {Longitude=[-180,-28.3)} => {Status=Reviewed} 0.3008713 0.9026140
## coverage lift count
## [1] 1.0000000 1.000000 23232
## [2] 1.0000000 1.000000 23412
## [3] 0.3333333 1.038367 7190
## [4] 0.3333333 1.003745 7773
## [5] 0.3333333 1.000000 7804
## [6] 0.3333333 1.017282 7044
I will use scatter plot to visualize the data to see how it behaves. To visualize our association rules in a scatter plot, we use the function plot() of the arulesViz package.
plot(rules,jitter = 1)
The plot shows support on the x-axis and confidence on the y-axis. Lift ist shown as a color with different levels ranging from grey to red.
plot(rules, shading = "order")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
There is a special value for shading called “order” which produces a two-key plot where the color of the points represents the length (order) of the rule if the select method = “two-key plot. This is basically a scatter plot with shading =”order": I could not determine anything because it looks confusing to me. Therefore, using 10 rules I will see how the data behaves.
plot(head(rules, n=10), method="graph")
Graph-based techniques concentrate on the relationship between individual items in the rule set. They represent the rules (or itemsets) as a graph with items as labeled vertices, and rules (or itemsets) represented as vertices connected to items using arrows.The network graph below shows associations between selected items. Here, Larger circles imply higher support, while red circles imply higher lift. From the above graph I can say that the data shows association between the selected items. I learned from this project that data selection is important. This data which I have selected is not meant for to generate association rule having low support low correlation and lift. Even though I have found some trend in my data set It was not the better data and it is important to select better data.