Project 3

eq <- read.csv("https://raw.githubusercontent.com/khatriprajwol/NEW-Data/main/database.csv")
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(rvest)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.4     ✓ stringr 1.4.0
## ✓ tidyr   1.1.4     ✓ forcats 0.5.1
## ✓ readr   2.0.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter()         masks stats::filter()
## x readr::guess_encoding() masks rvest::guess_encoding()
## x dplyr::lag()            masks stats::lag()

This project contains the association rule mining method which is popular data mining methon in R. Association rule help us to determine the frequent item set that appears in the transaction. Therefore, to perform the analysis we use arules and arulesViz packages. The main application of association rule analysis is to find the pattern what customer buys. In real life we can see association rule in a YouTube where we are recommended with similar kind of videos and in a market we are often recommended items as people have bought this one too. It is important to learn this techniques that can help business organization to recommend the best product to its customer. In my data set it can be used to predict upcoming earthquak in some particular place.

eq <- data.frame(Latitude=eq$Latitude, Longitude=eq$Longitude, Type=eq$Type, Depth=eq$Depth, Magnitude=eq$Magnitude, Status=eq$Status)
head(eq)

##   Latitude Longitude       Type Depth Magnitude    Status
## 1   19.246   145.616 Earthquake 131.6       6.0 Automatic
## 2    1.863   127.352 Earthquake  80.0       5.8 Automatic
## 3  -20.579  -173.972 Earthquake  20.0       6.2 Automatic
## 4  -59.076   -23.557 Earthquake  15.0       5.8 Automatic
## 5   11.938   126.427 Earthquake  15.0       5.8 Automatic
## 6  -13.405   166.629 Earthquake  35.0       6.7 Automatic

tail(eq)

##       Latitude Longitude       Type Depth Magnitude   Status
## 23407  38.3754 -118.8977 Earthquake 10.80       5.6 Reviewed
## 23408  38.3917 -118.8941 Earthquake 12.30       5.6 Reviewed
## 23409  38.3777 -118.8957 Earthquake  8.80       5.5 Reviewed
## 23410  36.9179  140.4262 Earthquake 10.00       5.9 Reviewed
## 23411  -9.0283  118.6639 Earthquake 79.00       6.3 Reviewed
## 23412  37.3973  141.4103 Earthquake 11.94       5.5 Reviewed

library(arules)

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## 
## Attaching package: 'arules'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

library(arulesViz)

As we have already discussed in above paragraph that we use arules and arulesViz packages to analyse association mining. And now to convert the data for association analysis. In order to analyze data first of all we have to load the transaction data into object of the transaction class. It is done by using arules package

transactions(eq)

## Warning: Column(s) 1, 2, 3, 4, 5, 6 not logical or factor. Applying default
## discretization (see '? discretizeDF').

## transactions in sparse format with
##  23412 transactions (rows) and
##  18 items (columns)

I have received many errors here saying an items are not logical. So, let’s look at the columns and see if we can convert them into factors (or Boolean) for analysis.

colnames(eq)[c(1,2,3,6)]

## [1] "Latitude"  "Longitude" "Type"      "Status"

Here, I do not think I can do anything to latitude, Longitude, Type and Status. Lets change magnitude > 5 and depth > 5 to see how it will effect the transaction.

df <- eq %>% mutate(
  Magnitude = (Magnitude > 5),
  Depth = (Depth > 10)
)

Here, I’ll run the transactions again and see how it has cleaned up changing magnitude and depth.

as(df,"transactions")

## Warning: Column(s) 1, 2, 3, 6 not logical or factor. Applying default
## discretization (see '? discretizeDF').

## transactions in sparse format with
##  23412 transactions (rows) and
##  14 items (columns)

trans <- transactions(df)

## Warning: Column(s) 1, 2, 3, 6 not logical or factor. Applying default
## discretization (see '? discretizeDF').

Even changing the magnitude and depth, I still have received the error saying that the columns are not logical. Therefore, I am happy to let R do the default discretization of the rest of the data because what is left I could not come up with better cutoffs. And to be honest I feel like latitude, longitude, type and status are as important too.

dim(trans)

## [1] 23412    14

Looking at the dimension of the object we received 23412 transaction and 12 distinct items.

itemLabels(trans)

##  [1] "Latitude=[-77.1,-12.7)" "Latitude=[-12.7,13)"    "Latitude=[13,86]"      
##  [4] "Longitude=[-180,-28.3)" "Longitude=[-28.3,137)"  "Longitude=[137,180]"   
##  [7] "Type=Earthquake"        "Type=Explosion"         "Type=Nuclear Explosion"
## [10] "Type=Rock Burst"        "Depth"                  "Magnitude"             
## [13] "Status=Automatic"       "Status=Reviewed"

The above part of code shows the list of distinct items in the data that is 12.

summary(trans)

## transactions as itemMatrix in sparse format with
##  23412 rows (elements/itemsets/transactions) and
##  14 columns (items) and a density of 0.4135425 
## 
## most frequent items:
##              Magnitude        Type=Earthquake        Status=Reviewed 
##                  23412                  23232                  20773 
##                  Depth Latitude=[-77.1,-12.7)                (Other) 
##                  18486                   7804                  41839 
## 
## element (itemset/transaction) length distribution:
## sizes
##     5     6 
##  4926 18486 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00    6.00    6.00    5.79    6.00    6.00 
## 
## includes extended item information - examples:
##                   labels variables        levels
## 1 Latitude=[-77.1,-12.7)  Latitude [-77.1,-12.7)
## 2    Latitude=[-12.7,13)  Latitude    [-12.7,13)
## 3       Latitude=[13,86]  Latitude       [13,86]
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2             2
## 3             3

Here the summary() gives us information about our transaction object. We can see we have 23412 transactions (rows) and 12 items (columns) and we can also see the most frequent items. In addition density gives us information about the percentage of non-zero cells in this 23412 x 12 matrix. The length distribution gives 4 items in 23412 transaction.

colnames(trans)

##  [1] "Latitude=[-77.1,-12.7)" "Latitude=[-12.7,13)"    "Latitude=[13,86]"      
##  [4] "Longitude=[-180,-28.3)" "Longitude=[-28.3,137)"  "Longitude=[137,180]"   
##  [7] "Type=Earthquake"        "Type=Explosion"         "Type=Nuclear Explosion"
## [10] "Type=Rock Burst"        "Depth"                  "Magnitude"             
## [13] "Status=Automatic"       "Status=Reviewed"

head(colnames)

##                                                
## 1 function (x, do.NULL = TRUE, prefix = "col") 
## 2 {                                            
## 3     if (is.data.frame(x) && do.NULL)         
## 4         return(names(x))                     
## 5     dn <- dimnames(x)                        
## 6     if (!is.null(dn[[2L]]))

inspect(trans[1:6])

##     items                    transactionID
## [1] {Latitude=[13,86],                    
##      Longitude=[137,180],                 
##      Type=Earthquake,                     
##      Depth,                               
##      Magnitude,                           
##      Status=Automatic}                   1
## [2] {Latitude=[-12.7,13),                 
##      Longitude=[-28.3,137),               
##      Type=Earthquake,                     
##      Depth,                               
##      Magnitude,                           
##      Status=Automatic}                   2
## [3] {Latitude=[-77.1,-12.7),              
##      Longitude=[-180,-28.3),              
##      Type=Earthquake,                     
##      Depth,                               
##      Magnitude,                           
##      Status=Automatic}                   3
## [4] {Latitude=[-77.1,-12.7),              
##      Longitude=[-28.3,137),               
##      Type=Earthquake,                     
##      Depth,                               
##      Magnitude,                           
##      Status=Automatic}                   4
## [5] {Latitude=[-12.7,13),                 
##      Longitude=[-28.3,137),               
##      Type=Earthquake,                     
##      Depth,                               
##      Magnitude,                           
##      Status=Automatic}                   5
## [6] {Latitude=[-77.1,-12.7),              
##      Longitude=[137,180],                 
##      Type=Earthquake,                     
##      Depth,                               
##      Magnitude,                           
##      Status=Automatic}                   6

image(trans)

This is odd diagram I was not expecting it to be like this having a 33.33% of density. It is clear that it is not informative and did not do a great job. However, allow me to change the column of it see how it does.

image(trans[1:10])

It did a great job and in fact it is much clear and informative where i can see the relationship clearly. Since I had large data i took small part of it that is [1:10] where we can clearly see most of the cells i.e. 33.33% are non-zero values. We have most of the elements are non zero therefore the matrix is considered dense.

itemFrequencyPlot(trans,topN = 15, col="Red")

The above diagram shows the relative item frequency. The items {latitude=[-77.1,-12.7)}, {Latitude=[-12.7,13]},{Latitude=[13,86]},{longitude=[-180,-28.3)} {longitude=[-28.3,137)},{longitude=[137,180]} all have a relative frequency that is support of 30%.

vertical <- as(trans, "tidLists")
as(vertical, "matrix")[1:10, 1:10]

##                            1     2     3     4     5     6     7     8     9
## Latitude=[-77.1,-12.7) FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE
## Latitude=[-12.7,13)    FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## Latitude=[13,86]        TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
## Longitude=[-180,-28.3) FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## Longitude=[-28.3,137)  FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE
## Longitude=[137,180]     TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE
## Type=Earthquake         TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## Type=Explosion         FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Type=Nuclear Explosion FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Type=Rock Burst        FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##                           10
## Latitude=[-77.1,-12.7)  TRUE
## Latitude=[-12.7,13)    FALSE
## Latitude=[13,86]       FALSE
## Longitude=[-180,-28.3) FALSE
## Longitude=[-28.3,137)  FALSE
## Longitude=[137,180]     TRUE
## Type=Earthquake         TRUE
## Type=Explosion         FALSE
## Type=Nuclear Explosion FALSE
## Type=Rock Burst        FALSE

As we know that transaction id lists contains a set of items where using tidlists uses the class ngCMatrix to efficiently store the transaction ID lists as a sparse matrix. Each column in the matrix represent one transaction ID list.Note that a matrix is called a sparse matrix if most of the elements are zero. Opposite of it, if most of the elements are nonzero, then the matrix is considered dense.

its <- apriori(trans, parameter=list(target = "frequent"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##          NA    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen            target  ext
##      10 frequent itemsets TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 2341 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[14 item(s), 23412 transaction(s)] done [0.01s].
## sorting and recoding items ... [11 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 done [0.00s].
## sorting transactions ... done [0.00s].
## writing ... [181 set(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Another way to analyze the rule is using A-Priori Algorithm with the function apriori(). This function requires both a minimum support and a minimum confidence constraint at the same time.

its

## set of 181 itemsets

inspect(head(its, n = 10))

##      items                    support   transIdenticalToItemsets count
## [1]  {Status=Automatic}       0.1127200 0                         2639
## [2]  {Latitude=[-77.1,-12.7)} 0.3333333 0                         7804
## [3]  {Longitude=[-180,-28.3)} 0.3333333 0                         7804
## [4]  {Longitude=[-28.3,137)}  0.3333333 0                         7804
## [5]  {Latitude=[13,86]}       0.3333333 0                         7804
## [6]  {Latitude=[-12.7,13)}    0.3333333 0                         7804
## [7]  {Longitude=[137,180]}    0.3333333 0                         7804
## [8]  {Depth}                  0.7895951 0                        18486
## [9]  {Status=Reviewed}        0.8872800 0                        20773
## [10] {Type=Earthquake}        0.9923116 0                        23232

Lets build the rules using apriori principle. in this case I will be using the support of 0.2 and confidence of 0.9.

rules <- apriori(trans, parameter = list(support = 0.2, confidence = 0.9))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.9    0.1    1 none FALSE            TRUE       5     0.2      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 4682 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[14 item(s), 23412 transaction(s)] done [0.01s].
## sorting and recoding items ... [10 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [128 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

inspect(head(rules))

##     lhs                         rhs               support   confidence
## [1] {}                       => {Type=Earthquake} 0.9923116 0.9923116 
## [2] {}                       => {Magnitude}       1.0000000 1.0000000 
## [3] {Latitude=[-77.1,-12.7)} => {Status=Reviewed} 0.3071075 0.9213224 
## [4] {Latitude=[-77.1,-12.7)} => {Type=Earthquake} 0.3320092 0.9960277 
## [5] {Latitude=[-77.1,-12.7)} => {Magnitude}       0.3333333 1.0000000 
## [6] {Longitude=[-180,-28.3)} => {Status=Reviewed} 0.3008713 0.9026140 
##     coverage  lift     count
## [1] 1.0000000 1.000000 23232
## [2] 1.0000000 1.000000 23412
## [3] 0.3333333 1.038367  7190
## [4] 0.3333333 1.003745  7773
## [5] 0.3333333 1.000000  7804
## [6] 0.3333333 1.017282  7044

I will use scatter plot to visualize the data to see how it behaves. To visualize our association rules in a scatter plot, we use the function plot() of the arulesViz package.

plot(rules,jitter = 1)

The plot shows support on the x-axis and confidence on the y-axis. Lift ist shown as a color with different levels ranging from grey to red.

plot(rules, shading = "order")

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

There is a special value for shading called “order” which produces a two-key plot where the color of the points represents the length (order) of the rule if the select method = “two-key plot. This is basically a scatter plot with shading =”order": I could not determine anything because it looks confusing to me. Therefore, using 10 rules I will see how the data behaves.

plot(head(rules, n=10), method="graph")

Graph-based techniques concentrate on the relationship between individual items in the rule set. They represent the rules (or itemsets) as a graph with items as labeled vertices, and rules (or itemsets) represented as vertices connected to items using arrows.The network graph below shows associations between selected items. Here, Larger circles imply higher support, while red circles imply higher lift. From the above graph I can say that the data shows association between the selected items. I learned from this project that data selection is important. This data which I have selected is not meant for to generate association rule having low support low correlation and lift. Even though I have found some trend in my data set It was not the better data and it is important to select better data.

Project 3

https://raw.githubusercontent.com/khatriprajwol/Earthquake/main/database.csv

10/22/2021