Project
Phishing Websites Data Set
What is the domain and what are the potential benefits to be derived from association rule mining. This is high level - not find patterns, but what would improve because of the use of the patterns.
To perform Association Rule Mining in R, we use the arules and the arulesViz packages in R by Michael Hahsler, et al.
| n | name | description |
|---|---|---|
| 1 | having_IP_Address | IP address in URL |
| 2 | URL_Length | URL length |
| 3 | Shortining_Service | URL may be made considerably smaller in length and still lead to the required webpage |
| 4 | having_At_Symbol | “@” symbol present in URL |
| 5 | double_slash_redirecting | “//” present in URL |
| 6 | Prefix_Suffix | Prefixes or Suffixes separated by “-” to the domain name |
| 7 | having_Sub_Domain | Sub Domain and/or multi sub domain |
| 8 | SSLfinal_State | Hyper Text Transfer Protocol with Secure Sockets Layer |
| 9 | Domain_registeration_length | Domain Registration Length |
| 10 | Favicon | graphic image (icon) associated with a specific webpage |
| 11 | port | Using Non-Standard Port |
| 12 | HTTPS_token | The Existence of “HTTPS” Token in the Domain Part of the URL |
| 13 | Request_URL | Examines whether the external objects contained within a webpage (images, videos and sounds) are loaded from another domain |
| 14 | URL_of_Anchor | An anchor is an element defined by the a tag. This feature is treated exactly as “Request URL” |
| 15 | Links_in_tags | Links in Meta, Script and Link tags |
| 16 | SFH | Server Form Handler (SFH) |
| 17 | Submitting_to_email | Submitting Information to Emai |
| 18 | Abnormal_URL | Whether host name is included or not in URL |
| 19 | Redirect | How many times a website has been redirected |
| 20 | on_mouseover | Status Bar Customization |
| 21 | RightClick | Disabling Right Click |
| 22 | popUpWidnow | Using Pop-up Window |
| 23 | Iframe | IFrame Redirection, Iframe is an HTML tag used to display an additional webpage into one that is currently shown |
| 24 | age_of_domain | Age of Domain |
| 25 | DNSRecord | DNS record for the domain |
| 26 | web_traffic | Measures the popularity of the website by determining the number of visitors and the number of pages they visit |
| 27 | Page_Rank | PageRank is a value ranging from “0” to “1”. PageRank aims to measure how important a webpage is on the Internet |
| 28 | Google_Index | Examines whether a website is in Google’s index or not. When a site is indexed by Google, it is displayed on search results |
| 29 | Links_pointing_to_page | The number of links pointing to the webpage |
| 30 | Statistical_report | Host belongs to top phishing IP or top phishing domains |
| 31 | Result | Website classification: legit or phishing |
The data set collects information about websites. Each website is described through 2456 attributes, all categorical. For example, the data set tells us if the URL is long or short, if the URL uses the IP address etc etc. The most important variable perhaps is Result which tells us whether it is a phishing website or not.
From association rule mining we can see what are the commonalities between phishing websites. In this way we can recognize whether a website is legit or not based on its characteristic (URL, port..).
This can also provide approximate guidelines when creating a new website. The website creator will know which are the feature to avoid in order to not be categorized as phishing.
colnames(ds)
## [1] "having_IP_Address" "URL_Length"
## [3] "Shortining_Service" "having_At_Symbol"
## [5] "double_slash_redirecting" "Prefix_Suffix"
## [7] "having_Sub_Domain" "SSLfinal_State"
## [9] "Domain_registeration_length" "Favicon"
## [11] "port" "HTTPS_token"
## [13] "Request_URL" "URL_of_Anchor"
## [15] "Links_in_tags" "SFH"
## [17] "Submitting_to_email" "Abnormal_URL"
## [19] "Redirect" "on_mouseover"
## [21] "RightClick" "popUpWidnow"
## [23] "Iframe" "age_of_domain"
## [25] "DNSRecord" "web_traffic"
## [27] "Page_Rank" "Google_Index"
## [29] "Links_pointing_to_page" "Statistical_report"
## [31] "Result"
What is in the data, and what preprocessing was done to make it amenable for association rule mining. Where choices were made (e.g., parameter settings for discretization, or decisions to ignore an attribute), describe your reasoning behind the choices.
The data set collects information about websites. Each website is described through 2456 attributes.
All variables in the data set are all categorical, each category has 2 or 3 levels as shown in the summary table here below
summary(ds)
## having_IP_Address URL_Length Shortining_Service having_At_Symbol
## 1: 278 1 : 416 0:2154 0:2322
## 0:2178 0 : 28 1: 302 1: 134
## -1:2012
## double_slash_redirecting Prefix_Suffix having_Sub_Domain SSLfinal_State
## 1: 308 -1: 954 -1:1060 -1: 788
## 0:2148 0 :1174 0 : 792 1 :1416
## 1 : 328 1 : 604 0 : 252
## Domain_registeration_length Favicon port HTTPS_token Request_URL
## 0 :890 0:1990 0:2124 1: 394 1 :1468
## 1 :806 1: 466 1: 332 0:2062 -1: 988
## -1:760
## URL_of_Anchor Links_in_tags SFH Submitting_to_email Abnormal_URL
## -1: 718 1 : 596 -1:2060 1: 454 1: 346
## 0 :1202 -1: 804 1 : 396 0:2002 0:2110
## 1 : 536 0 :1056
## Redirect on_mouseover RightClick popUpWidnow Iframe age_of_domain DNSRecord
## 0:2196 0:2166 0:2352 0:1974 0:2230 -1:1088 1:1318
## 1: 260 1: 290 1: 104 1: 482 1: 226 0 : 288 0:1138
## 1 :1080
## web_traffic Page_Rank Google_Index Links_pointing_to_page Statistical_report
## -1: 594 -1:1728 0:2113 1 : 966 1: 440
## 0 : 520 0 : 328 1: 343 0 :1370 0:2016
## 1 :1342 1 : 400 -1: 120
## Result
## 1 :1094
## -1:1362
##
The plots here below give us an idea on how the data is distributed. It can already suggest which variable can be collapsed into 2 level and which one can be eliminated.
URL_Length has 3 levels, it is safe to assume that we can collapse it into 2 levels.
Unfortunately, the data set lacks a description of how the variables have been coded, so we are left to guess what 1,0,-1 correspond to.
In addition we notice that coding is not consistent, sometime a the binary variable is coded with 0, 1 sometimes with 1, -1. The first task is to make assumptions on the coding.
We will make the level explicit, R will make the coding automatically.
With the following chunks of code we reorganize the factor levels (rename and collapse some levels) and delete some variable which are deemed as overlapping with others.
ds<-rename(ds, Phishing = Result)
ds$Phishing<-fct_recode(ds$Phishing, yes = "1", no = "-1")
ds$having_IP_Address<-fct_recode(ds$having_IP_Address, yes = "1", no = "0")
ds$URL_Length<-fct_collapse(ds$URL_Length,
"<75" = c("0", "1"),
">75" = "-1"
)
ds$Shortining_Service<-fct_recode(ds$Shortining_Service, yes = "1", no = "0")
ds$having_At_Symbol<-fct_recode(ds$having_At_Symbol, yes = "1", no = "0")
ds$double_slash_redirecting<-fct_recode(ds$double_slash_redirecting, yes = "1", no = "0")
ds$Prefix_Suffix<-fct_recode(ds$Prefix_Suffix, no_dash = "1", prefix_dash="-1", suffix_dash="0")
ds$having_Sub_Domain<-fct_recode(ds$having_Sub_Domain, otherwise="-1", two_dot="0", one_dot="1")
ds$SSLfinal_State<-fct_recode(ds$SSLfinal_State, otherwise="-1", http="0", http_ssl="1")
ds$Domain_registeration_length<-fct_collapse(ds$Domain_registeration_length,
"<1yr" = c("0", "1"),
">1yr" = "-1"
)
ds$Favicon<-fct_recode(ds$Favicon, no ="0", yes ="1")
ds$port<-fct_recode(ds$port, std ="0", non_std ="1")
ds$HTTPS_token<-fct_recode(ds$HTTPS_token, yes = "1", no = "0")
ds<-select(ds,-HTTPS_token)
We discard variable HTTPS_token since it seems redundant with SSLfinal_State
ds<-select(ds,-Request_URL)
We discard variable Request_URL since it has 2 discretized levels here but we have description for 3 in the documentation. It would be to much of a guess to recode such factor in addition it is redundant with URL_Anchor
ds$URL_of_Anchor<-fct_recode(ds$URL_of_Anchor, ">67%" ="-1", "31<anc<67"="0", "<31%" ="1")
ds$Links_in_tags<-fct_recode(ds$Links_in_tags, "<17%"="1", ">81%"="-1", "17%<link<81%" = "0")
ds<-select(ds,-SFH)
We discard variable SFH since it has 2 levels here but we have description for 3 in the documentation. It would be too much of a guess to recode such factor
ds$Submitting_to_email<-fct_recode(ds$Submitting_to_email, yes ="1", no="0")
ds$Abnormal_URL<-fct_recode(ds$Abnormal_URL, yes = "1", no = "0")
ds<-select(ds, -Redirect)
We discard variable Redirect since it has 2 discretized levels here but we have description for 3 in the documentation. In addition it seems redundant with double_slash_redirecting
ds$on_mouseover<-fct_recode(ds$on_mouseover, yes="1", no="0")
ds<-select(ds, -RightClick)
Such variable has 2 levels and there are very few cases of Right_Click = 1, we discard such variable. Since Right_Click = 0 would be too common in across the websites.
ds$popUpWidnow<-fct_recode(ds$popUpWidnow, yes="1", no="0")
ds$Iframe<-fct_recode(ds$Iframe, yes="1", no="0")
ds<-select(ds,-age_of_domain)
We discard variable age_of_domain since it has 2 discretized levels here but we have description for 3 in the documentation. In addition it seems redundant with Domain_registeration_length
ds$DNSRecord<-fct_recode(ds$DNSRecord, yes="1", no="0")
ds$web_traffic<-fct_recode(ds$web_traffic, "<100k"="1", ">100k"="0", otherwise="-1")
ds<-select(ds, -Page_Rank)
We discard variable Page_Rank since it has 3 discretized level here but we have description for 2 in the documentation. In addition it seems redundant with the variable web traffic.
ds$Google_Index<-fct_recode(ds$Google_Index, yes="1", no="0")
ds$Links_pointing_to_page<-fct_recode(ds$Links_pointing_to_page, ">2link" ="1", no_link="0", "0<link<2"="-1")
ds$Statistical_report<-fct_recode(ds$Statistical_report, yes="1", no="0")
The data set is almost ready, we need to transform our data frame in a transaction-like table.
ds_tr<- as(ds, "transactions")
ds_tr %>% summary()
## transactions as itemMatrix in sparse format with
## 2456 rows (elements/itemsets/transactions) and
## 55 columns (items) and a density of 0.4363636
##
## most frequent items:
## having_At_Symbol=no Iframe=no having_IP_Address=no
## 2322 2230 2178
## on_mouseover=no Shortining_Service=no (Other)
## 2166 2154 47894
##
## element (itemset/transaction) length distribution:
## sizes
## 24
## 2456
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 24 24 24 24 24 24
##
## includes extended item information - examples:
## labels variables levels
## 1 having_IP_Address=yes having_IP_Address yes
## 2 having_IP_Address=no having_IP_Address no
## 3 URL_Length=<75 URL_Length <75
##
## includes extended transaction information - examples:
## transactionID
## 1 1
## 2 2
## 3 3
Our data set contains 2456 transactions (number of websites) and 55 columns (summation of the number of levels across the all the columns in the original data frame) which represent the items (website feature).
The summary command tells us which are the most frequent items. For example we can see that most of the website does not contain ‘@’ symbol in the URL.
In addition summary() displays information about the transaction length distribution. In our case we see that all the transactions have length equal to 24 since we have a value for all the columns in the original data set.
We are not that happy with such representation since we do not want a column for each factor level. For example, most of our variables have 2 levels and they can be considered as a “logical” variables, e.g for variable having_At_Symbol we only want one column in the transaction data set with 1 referring to “yes, there is a @ symbol in the URL” and 0 referring to “no @ at symbol”.
for (i in 1:ncol(ds)){
if ( nlevels(ds[,i])==2 & "yes" %in% levels(ds[,i]) ){
ds[,i]<-to_logical(as.character(ds[,i]))
}
}
ds %>% summary()
## having_IP_Address URL_Length Shortining_Service having_At_Symbol
## Mode :logical <75: 444 Mode :logical Mode :logical
## FALSE:2178 >75:2012 FALSE:2154 FALSE:2322
## TRUE :278 TRUE :302 TRUE :134
## double_slash_redirecting Prefix_Suffix having_Sub_Domain SSLfinal_State
## Mode :logical prefix_dash: 954 otherwise:1060 otherwise: 788
## FALSE:2148 suffix_dash:1174 two_dot : 792 http_ssl :1416
## TRUE :308 no_dash : 328 one_dot : 604 http : 252
## Domain_registeration_length Favicon port URL_of_Anchor
## <1yr:1696 Mode :logical std :2124 >67% : 718
## >1yr: 760 FALSE:1990 non_std: 332 31<anc<67:1202
## TRUE :466 <31% : 536
## Links_in_tags Submitting_to_email Abnormal_URL on_mouseover
## <17% : 596 Mode :logical Mode :logical Mode :logical
## >81% : 804 FALSE:2002 FALSE:2110 FALSE:2166
## 17%<link<81%:1056 TRUE :454 TRUE :346 TRUE :290
## popUpWidnow Iframe DNSRecord web_traffic
## Mode :logical Mode :logical Mode :logical otherwise: 594
## FALSE:1974 FALSE:2230 FALSE:1138 >100k : 520
## TRUE :482 TRUE :226 TRUE :1318 <100k :1342
## Google_Index Links_pointing_to_page Statistical_report Phishing
## Mode :logical >2link : 966 Mode :logical Mode :logical
## FALSE:2113 no_link :1370 FALSE:2016 FALSE:1362
## TRUE :343 0<link<2: 120 TRUE :440 TRUE :1094
ds_tr<- as(ds, "transactions")
ds_tr %>% summary()
## transactions as itemMatrix in sparse format with
## 2456 rows (elements/itemsets/transactions) and
## 41 columns (items) and a density of 0.3082645
##
## most frequent items:
## port=std URL_Length=>75
## 2124 2012
## Domain_registeration_length=<1yr SSLfinal_State=http_ssl
## 1696 1416
## Links_pointing_to_page=no_link (Other)
## 1370 22423
##
## element (itemset/transaction) length distribution:
## sizes
## 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## 467 596 494 213 120 179 141 118 61 15 14 12 22 4
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.00 11.00 12.00 12.64 14.00 23.00
##
## includes extended item information - examples:
## labels variables levels
## 1 having_IP_Address having_IP_Address TRUE
## 2 URL_Length=<75 URL_Length <75
## 3 URL_Length=>75 URL_Length >75
##
## includes extended transaction information - examples:
## transactionID
## 1 1
## 2 2
## 3 3
Parameter settings, and the time required.
With so many transaction and item available, the apriori algorithm would generate a huge number of rules. We need to set some parameters and thresholds on measures like: support, confidence and itemset size.
The support of an itemset X is defined as the proportion of transactions in the data set which contain the itemset. We set this parameter equal to 0.1.
The confidence of a rule is defined: \(conf(X|Y) = \frac{sup(X \cup Y)}{sup(X)}.\) Confidence can be interpreted as an estimate of the probability P(Y|X), the probability of finnding the RHS of the rule in transactions under the condition that these transactions also contain the LHS. We set this parameter equal to 0.5.
We constrain also the size of the rules generated to maximum length 5.
Further to this the main goal of our project is to find how different features common to phishing websites. For this reason we will set the right hand side equal to ‘Phishing’
The plot here below shows all the items ranked according to their support. The red line represent the median.
Top 10 items based on frequency
rules<-apriori(data=ds_tr,
parameter=list (supp=sup_th, conf = conf_th, maxlen = mlen),
appearance = list (rhs="Phishing"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 5 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 245
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[41 item(s), 2456 transaction(s)] done [0.00s].
## sorting and recoding items ... [38 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.01s].
## writing ... [263 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
The apriori algorithm generates 263. Most of the rules have size 4 or 5.
## set of 263 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5
## 11 71 123 58
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 4.000 3.867 4.000 5.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.1002 Min. :0.5010 Min. :0.1002 Min. :1.125
## 1st Qu.:0.1120 1st Qu.:0.6749 1st Qu.:0.1342 1st Qu.:1.515
## Median :0.1295 Median :0.8567 Median :0.1596 Median :1.923
## Mean :0.1451 Mean :0.8200 Mean :0.1887 Mean :1.841
## 3rd Qu.:0.1596 3rd Qu.:0.9841 3rd Qu.:0.2158 3rd Qu.:2.209
## Max. :0.3664 Max. :1.0000 Max. :0.6906 Max. :2.245
## count
## Min. :246.0
## 1st Qu.:275.0
## Median :318.0
## Mean :356.3
## 3rd Qu.:392.0
## Max. :900.0
##
## mining info:
## data ntransactions support confidence
## ds_tr 2456 0.1 0.5
From this plot we can see that there are many rules with high confidence, high lift and rather low support.
## lhs rhs support confidence coverage lift count
## [1] {URL_of_Anchor=>67%,
## web_traffic=otherwise} => {Phishing} 0.1693811 1 0.1693811 2.244973 416
## [2] {having_Sub_Domain=two_dot,
## URL_of_Anchor=>67%} => {Phishing} 0.1473941 1 0.1473941 2.244973 362
## [3] {SSLfinal_State=otherwise,
## URL_of_Anchor=>67%} => {Phishing} 0.1864821 1 0.1864821 2.244973 458
## [4] {Prefix_Suffix=suffix_dash,
## URL_of_Anchor=>67%} => {Phishing} 0.1140065 1 0.1140065 2.244973 280
## [5] {URL_of_Anchor=>67%,
## Links_pointing_to_page=no_link} => {Phishing} 0.1710098 1 0.1710098 2.244973 420
From the ‘two-key plot’ it is clear that order (the number of items contained in the rule) and support have a very strong inverse relationship. The red dots (itemset with size 5) tends to have lower support value compared to purple dots (itemset with size 2).
Remove redundant rules
A rule is redundant if a more general rules with the same or a higher confidence exists. That is, a more specific rule is redundant if it is only equally or even less predictive than a more general rule. A rule is more general if it has the same RHS but one or more items removed from the LHS.
rules<-rules[!is.redundant(rules)]
Example of redundant rules
## lhs rhs support confidence coverage
## [1] {SSLfinal_State=http} => {Phishing} 0.1009772 0.9841270 0.1026059
## [2] {web_traffic=>100k} => {Phishing} 0.1473941 0.6961538 0.2117264
## [3] {web_traffic=otherwise} => {Phishing} 0.1921824 0.7946128 0.2418567
## [4] {URL_of_Anchor=>67%} => {Phishing} 0.2890879 0.9888579 0.2923453
## [5] {Links_in_tags=>81%} => {Phishing} 0.1864821 0.5696517 0.3273616
## [6] {having_Sub_Domain=two_dot} => {Phishing} 0.1718241 0.5328283 0.3224756
## lift count
## [1] 2.209338 248
## [2] 1.562846 362
## [3] 1.783884 472
## [4] 2.219959 710
## [5] 1.278853 458
## [6] 1.196185 422
After having removed the redundant rules we are left with 115 rules. Most of the rules have size 3 or 4.
This is because many itemset of size 5 were considered as redundant.
## set of 115 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5
## 11 52 46 6
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 3.000 3.409 4.000 5.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.1002 Min. :0.5303 Min. :0.1026 Min. :1.191
## 1st Qu.:0.1134 1st Qu.:0.6509 1st Qu.:0.1429 1st Qu.:1.461
## Median :0.1409 Median :0.7977 Median :0.1743 Median :1.791
## Mean :0.1535 Mean :0.7949 Mean :0.2036 Mean :1.785
## 3rd Qu.:0.1759 3rd Qu.:0.9560 3rd Qu.:0.2374 3rd Qu.:2.146
## Max. :0.3664 Max. :1.0000 Max. :0.6906 Max. :2.245
## count
## Min. :246.0
## 1st Qu.:278.5
## Median :346.0
## Mean :376.9
## 3rd Qu.:432.0
## Max. :900.0
##
## mining info:
## data ntransactions support confidence
## ds_tr 2456 0.1 0.5
Lift can be interpreted as the deviation of the support of the whole rule from the support expected under independence given the supports of both sides of the rule. Greater lift values (>>1) indicate stronger associations. Here we select all the rules whose occurrence is at least twice the expected one (if they were independent).
rules<- subset(rules, subset = lift > 2)
summary(rules)
## set of 42 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5
## 2 18 20 2
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 4.000 3.524 4.000 5.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.1010 Min. :0.8929 Min. :0.1026 Min. :2.004
## 1st Qu.:0.1166 1st Qu.:0.9545 1st Qu.:0.1198 1st Qu.:2.143
## Median :0.1482 Median :0.9804 Median :0.1555 Median :2.201
## Mean :0.1536 Mean :0.9677 Mean :0.1589 Mean :2.172
## 3rd Qu.:0.1775 3rd Qu.:0.9927 3rd Qu.:0.1855 3rd Qu.:2.229
## Max. :0.2891 Max. :1.0000 Max. :0.2923 Max. :2.245
## count
## Min. :248.0
## 1st Qu.:286.2
## Median :364.0
## Mean :377.3
## 3rd Qu.:436.0
## Max. :710.0
##
## mining info:
## data ntransactions support confidence
## ds_tr 2456 0.1 0.5
The following 2 interactive plots show the selected rules.
Simpson Paradox
Domain_registration_length is the selected variable to investigate if we have a case of Simpson Paradox, this variable measures how old is a domain. Perhaps phishing website have different features if we make a distinction between new and old domains.
We will repeat the analysis (same parameters) but subsetting the dataset based on the value of the variable Domain_registration_length
Subset of transactions for “New” website -> Domain_registration_length < 1 year
summary(rules_new)
## set of 82 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5
## 10 42 27 3
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 3.00 3.00 3.28 4.00 5.00
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.1002 Min. :0.5303 Min. :0.1026 Min. :1.191
## 1st Qu.:0.1131 1st Qu.:0.6205 1st Qu.:0.1433 1st Qu.:1.393
## Median :0.1405 Median :0.7967 Median :0.1775 Median :1.789
## Mean :0.1525 Mean :0.7969 Mean :0.2014 Mean :1.789
## 3rd Qu.:0.1779 3rd Qu.:0.9637 3rd Qu.:0.2362 3rd Qu.:2.163
## Max. :0.2940 Max. :1.0000 Max. :0.5366 Max. :2.245
## count
## Min. :246.0
## 1st Qu.:277.8
## Median :345.0
## Mean :374.6
## 3rd Qu.:437.0
## Max. :722.0
##
## mining info:
## data ntransactions support confidence
## ds_new_tr 2456 0.1 0.5
Top rules for “New” websites ranked by support
## lhs rhs support confidence coverage lift count
## [1] {Prefix_Suffix=prefix_dash} => {Phishing} 0.2939739 0.7568134 0.3884365 1.699025 722
## [2] {URL_of_Anchor=>67%} => {Phishing} 0.2890879 0.9888579 0.2923453 2.219959 710
## [3] {DNSRecord} => {Phishing} 0.2846091 0.5303490 0.5366450 1.190619 699
## [4] {SSLfinal_State=otherwise} => {Phishing} 0.2768730 0.8629442 0.3208469 1.937286 680
## [5] {URL_Length=>75,
## Prefix_Suffix=prefix_dash} => {Phishing} 0.2483713 0.7568238 0.3281759 1.699049 610
Top rules for “New” websites ranked by confidence
## lhs rhs support confidence coverage lift count
## [1] {URL_of_Anchor=>67%,
## web_traffic=otherwise} => {Phishing} 0.1693811 1 0.1693811 2.244973 416
## [2] {having_Sub_Domain=two_dot,
## URL_of_Anchor=>67%} => {Phishing} 0.1473941 1 0.1473941 2.244973 362
## [3] {SSLfinal_State=otherwise,
## URL_of_Anchor=>67%} => {Phishing} 0.1864821 1 0.1864821 2.244973 458
## [4] {Prefix_Suffix=suffix_dash,
## URL_of_Anchor=>67%} => {Phishing} 0.1140065 1 0.1140065 2.244973 280
## [5] {URL_of_Anchor=>67%,
## Links_pointing_to_page=no_link} => {Phishing} 0.1710098 1 0.1710098 2.244973 420
Top rules for “New” websites ranked by lift
## lhs rhs support confidence coverage lift count
## [1] {URL_of_Anchor=>67%,
## web_traffic=otherwise} => {Phishing} 0.1693811 1 0.1693811 2.244973 416
## [2] {having_Sub_Domain=two_dot,
## URL_of_Anchor=>67%} => {Phishing} 0.1473941 1 0.1473941 2.244973 362
## [3] {SSLfinal_State=otherwise,
## URL_of_Anchor=>67%} => {Phishing} 0.1864821 1 0.1864821 2.244973 458
## [4] {Prefix_Suffix=suffix_dash,
## URL_of_Anchor=>67%} => {Phishing} 0.1140065 1 0.1140065 2.244973 280
## [5] {URL_of_Anchor=>67%,
## Links_pointing_to_page=no_link} => {Phishing} 0.1710098 1 0.1710098 2.244973 420
Subset of transactions for “Old” website -> Domain_registration_length > 1 year
Summary
## set of 82 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5
## 10 42 27 3
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 3.00 3.00 3.28 4.00 5.00
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.1002 Min. :0.5303 Min. :0.1026 Min. :1.191
## 1st Qu.:0.1131 1st Qu.:0.6205 1st Qu.:0.1433 1st Qu.:1.393
## Median :0.1405 Median :0.7967 Median :0.1775 Median :1.789
## Mean :0.1525 Mean :0.7969 Mean :0.2014 Mean :1.789
## 3rd Qu.:0.1779 3rd Qu.:0.9637 3rd Qu.:0.2362 3rd Qu.:2.163
## Max. :0.2940 Max. :1.0000 Max. :0.5366 Max. :2.245
## count
## Min. :246.0
## 1st Qu.:277.8
## Median :345.0
## Mean :374.6
## 3rd Qu.:437.0
## Max. :722.0
##
## mining info:
## data ntransactions support confidence
## ds_old_tr 2456 0.1 0.5
Top rules for “Old” websites ranked by support
## lhs rhs support confidence coverage lift count
## [1] {Prefix_Suffix=prefix_dash} => {Phishing} 0.2939739 0.7568134 0.3884365 1.699025 722
## [2] {URL_of_Anchor=>67%} => {Phishing} 0.2890879 0.9888579 0.2923453 2.219959 710
## [3] {DNSRecord} => {Phishing} 0.2846091 0.5303490 0.5366450 1.190619 699
## [4] {SSLfinal_State=otherwise} => {Phishing} 0.2768730 0.8629442 0.3208469 1.937286 680
## [5] {URL_Length=>75,
## Prefix_Suffix=prefix_dash} => {Phishing} 0.2483713 0.7568238 0.3281759 1.699049 610
Top rules for “Old” websites ranked by confidence
## lhs rhs support confidence coverage lift count
## [1] {URL_of_Anchor=>67%,
## web_traffic=otherwise} => {Phishing} 0.1693811 1 0.1693811 2.244973 416
## [2] {having_Sub_Domain=two_dot,
## URL_of_Anchor=>67%} => {Phishing} 0.1473941 1 0.1473941 2.244973 362
## [3] {SSLfinal_State=otherwise,
## URL_of_Anchor=>67%} => {Phishing} 0.1864821 1 0.1864821 2.244973 458
## [4] {Prefix_Suffix=suffix_dash,
## URL_of_Anchor=>67%} => {Phishing} 0.1140065 1 0.1140065 2.244973 280
## [5] {URL_of_Anchor=>67%,
## Links_pointing_to_page=no_link} => {Phishing} 0.1710098 1 0.1710098 2.244973 420
Top rules for “Old” websites ranked by lift
## lhs rhs support confidence coverage lift count
## [1] {URL_of_Anchor=>67%,
## web_traffic=otherwise} => {Phishing} 0.1693811 1 0.1693811 2.244973 416
## [2] {having_Sub_Domain=two_dot,
## URL_of_Anchor=>67%} => {Phishing} 0.1473941 1 0.1473941 2.244973 362
## [3] {SSLfinal_State=otherwise,
## URL_of_Anchor=>67%} => {Phishing} 0.1864821 1 0.1864821 2.244973 458
## [4] {Prefix_Suffix=suffix_dash,
## URL_of_Anchor=>67%} => {Phishing} 0.1140065 1 0.1140065 2.244973 280
## [5] {URL_of_Anchor=>67%,
## Links_pointing_to_page=no_link} => {Phishing} 0.1710098 1 0.1710098 2.244973 420
Since the top rules based on support, lift and confidence for “New” and “Old” website overlap pretty much, we don’t detect ay Simpson Paradox behind the variable Domain_registratio length.
However a deeper analysis combined with domain knowledge might help to unveil eventual Simpson Paradox in the association rules
Summary (number of rules, general description), and a selection of those you would show to a client.
Top 5 rules based on support
## lhs rhs support confidence coverage lift count
## [1] {URL_of_Anchor=>67%} => {Phishing} 0.2890879 0.9888579 0.2923453 2.219959 710
## [2] {Domain_registeration_length=<1yr,
## URL_of_Anchor=>67%} => {Phishing} 0.2369707 0.9965753 0.2377850 2.237284 582
## [3] {port=std,
## URL_of_Anchor=>67%} => {Phishing} 0.2345277 0.9896907 0.2369707 2.221829 576
## [4] {SSLfinal_State=otherwise,
## Domain_registeration_length=<1yr} => {Phishing} 0.2345277 0.9000000 0.2605863 2.020475 576
## [5] {URL_Length=>75,
## URL_of_Anchor=>67%} => {Phishing} 0.2312704 0.9895470 0.2337134 2.221506 568
Top 5 rules based on confidence
## lhs rhs support confidence coverage lift count
## [1] {URL_of_Anchor=>67%,
## web_traffic=otherwise} => {Phishing} 0.1693811 1 0.1693811 2.244973 416
## [2] {having_Sub_Domain=two_dot,
## URL_of_Anchor=>67%} => {Phishing} 0.1473941 1 0.1473941 2.244973 362
## [3] {SSLfinal_State=otherwise,
## URL_of_Anchor=>67%} => {Phishing} 0.1864821 1 0.1864821 2.244973 458
## [4] {Prefix_Suffix=suffix_dash,
## URL_of_Anchor=>67%} => {Phishing} 0.1140065 1 0.1140065 2.244973 280
## [5] {URL_of_Anchor=>67%,
## Links_pointing_to_page=no_link} => {Phishing} 0.1710098 1 0.1710098 2.244973 420
Top 5 rules based on lift
## lhs rhs support confidence coverage lift count
## [1] {URL_of_Anchor=>67%,
## web_traffic=otherwise} => {Phishing} 0.1693811 1 0.1693811 2.244973 416
## [2] {having_Sub_Domain=two_dot,
## URL_of_Anchor=>67%} => {Phishing} 0.1473941 1 0.1473941 2.244973 362
## [3] {SSLfinal_State=otherwise,
## URL_of_Anchor=>67%} => {Phishing} 0.1864821 1 0.1864821 2.244973 458
## [4] {Prefix_Suffix=suffix_dash,
## URL_of_Anchor=>67%} => {Phishing} 0.1140065 1 0.1140065 2.244973 280
## [5] {URL_of_Anchor=>67%,
## Links_pointing_to_page=no_link} => {Phishing} 0.1710098 1 0.1710098 2.244973 420
It can be noticed that the rules with the highest confidence value are also the rules with the highest lift value. We select these rules as they seem the most powerful rules when predicting the features of a phishing website.
For example if the tags and the website have different domain names (URL_of_Anchor= >67%) and the website is not popular (web_traffic=otherwise) then we are basically certain that the website is phishing.
Similar conclusion can be done for websites whose tags have different domain names (URL_of_Anchor= >67%) and no link pointing at the web page (Links_pointing_to_page=no_link).
Top rules
Similarly to the tables above we can see how the item URL_of_Anchor >67% is very common among the rules pointing at phishing websites.
The rules we just selected could help a person to build a website which will not be categorized as phishing.
For example it would be important to have the tags and the website having the same domain names and avoid prefixes or suffixes separated by (-) to the domain name.
Another suggestion could be to ave the tags and the website having the same domain names and have multiple subdomains (having_Sub_Domain=two_dot)