Project

Phishing Websites Data Set

Objective

What is the domain and what are the potential benefits to be derived from association rule mining. This is high level - not find patterns, but what would improve because of the use of the patterns.

To perform Association Rule Mining in R, we use the arules and the arulesViz packages in R by Michael Hahsler, et al.

n	name	description
1	having_IP_Address	IP address in URL
2	URL_Length	URL length
3	Shortining_Service	URL may be made considerably smaller in length and still lead to the required webpage
4	having_At_Symbol	“@” symbol present in URL
5	double_slash_redirecting	“//” present in URL
6	Prefix_Suffix	Prefixes or Suffixes separated by “-” to the domain name
7	having_Sub_Domain	Sub Domain and/or multi sub domain
8	SSLfinal_State	Hyper Text Transfer Protocol with Secure Sockets Layer
9	Domain_registeration_length	Domain Registration Length
10	Favicon	graphic image (icon) associated with a specific webpage
11	port	Using Non-Standard Port
12	HTTPS_token	The Existence of “HTTPS” Token in the Domain Part of the URL
13	Request_URL	Examines whether the external objects contained within a webpage (images, videos and sounds) are loaded from another domain
14	URL_of_Anchor	An anchor is an element defined by the a tag. This feature is treated exactly as “Request URL”
15	Links_in_tags	Links in Meta, Script and Link tags
16	SFH	Server Form Handler (SFH)
17	Submitting_to_email	Submitting Information to Emai
18	Abnormal_URL	Whether host name is included or not in URL
19	Redirect	How many times a website has been redirected
20	on_mouseover	Status Bar Customization
21	RightClick	Disabling Right Click
22	popUpWidnow	Using Pop-up Window
23	Iframe	IFrame Redirection, Iframe is an HTML tag used to display an additional webpage into one that is currently shown
24	age_of_domain	Age of Domain
25	DNSRecord	DNS record for the domain
26	web_traffic	Measures the popularity of the website by determining the number of visitors and the number of pages they visit
27	Page_Rank	PageRank is a value ranging from “0” to “1”. PageRank aims to measure how important a webpage is on the Internet
28	Google_Index	Examines whether a website is in Google’s index or not. When a site is indexed by Google, it is displayed on search results
29	Links_pointing_to_page	The number of links pointing to the webpage
30	Statistical_report	Host belongs to top phishing IP or top phishing domains
31	Result	Website classification: legit or phishing

The data set collects information about websites. Each website is described through 2456 attributes, all categorical. For example, the data set tells us if the URL is long or short, if the URL uses the IP address etc etc. The most important variable perhaps is Result which tells us whether it is a phishing website or not.

From association rule mining we can see what are the commonalities between phishing websites. In this way we can recognize whether a website is legit or not based on its characteristic (URL, port..).

This can also provide approximate guidelines when creating a new website. The website creator will know which are the feature to avoid in order to not be categorized as phishing.

colnames(ds)

##  [1] "having_IP_Address"           "URL_Length"                 
##  [3] "Shortining_Service"          "having_At_Symbol"           
##  [5] "double_slash_redirecting"    "Prefix_Suffix"              
##  [7] "having_Sub_Domain"           "SSLfinal_State"             
##  [9] "Domain_registeration_length" "Favicon"                    
## [11] "port"                        "HTTPS_token"                
## [13] "Request_URL"                 "URL_of_Anchor"              
## [15] "Links_in_tags"               "SFH"                        
## [17] "Submitting_to_email"         "Abnormal_URL"               
## [19] "Redirect"                    "on_mouseover"               
## [21] "RightClick"                  "popUpWidnow"                
## [23] "Iframe"                      "age_of_domain"              
## [25] "DNSRecord"                   "web_traffic"                
## [27] "Page_Rank"                   "Google_Index"               
## [29] "Links_pointing_to_page"      "Statistical_report"         
## [31] "Result"

Data set description

What is in the data, and what preprocessing was done to make it amenable for association rule mining. Where choices were made (e.g., parameter settings for discretization, or decisions to ignore an attribute), describe your reasoning behind the choices.

The data set collects information about websites. Each website is described through 2456 attributes.

All variables in the data set are all categorical, each category has 2 or 3 levels as shown in the summary table here below

summary(ds)

##  having_IP_Address URL_Length Shortining_Service having_At_Symbol
##  1: 278            1 : 416    0:2154             0:2322          
##  0:2178            0 :  28    1: 302             1: 134          
##                    -1:2012                                       
##  double_slash_redirecting Prefix_Suffix having_Sub_Domain SSLfinal_State
##  1: 308                   -1: 954       -1:1060           -1: 788       
##  0:2148                   0 :1174       0 : 792           1 :1416       
##                           1 : 328       1 : 604           0 : 252       
##  Domain_registeration_length Favicon  port     HTTPS_token Request_URL
##  0 :890                      0:1990   0:2124   1: 394      1 :1468    
##  1 :806                      1: 466   1: 332   0:2062      -1: 988    
##  -1:760                                                               
##  URL_of_Anchor Links_in_tags SFH       Submitting_to_email Abnormal_URL
##  -1: 718       1 : 596       -1:2060   1: 454              1: 346      
##  0 :1202       -1: 804       1 : 396   0:2002              0:2110      
##  1 : 536       0 :1056                                                 
##  Redirect on_mouseover RightClick popUpWidnow Iframe   age_of_domain DNSRecord
##  0:2196   0:2166       0:2352     0:1974      0:2230   -1:1088       1:1318   
##  1: 260   1: 290       1: 104     1: 482      1: 226   0 : 288       0:1138   
##                                                        1 :1080                
##  web_traffic Page_Rank Google_Index Links_pointing_to_page Statistical_report
##  -1: 594     -1:1728   0:2113       1 : 966                1: 440            
##  0 : 520     0 : 328   1: 343       0 :1370                0:2016            
##  1 :1342     1 : 400                -1: 120                                  
##  Result   
##  1 :1094  
##  -1:1362  
##

The plots here below give us an idea on how the data is distributed. It can already suggest which variable can be collapsed into 2 level and which one can be eliminated.

URL_Length has 3 levels, it is safe to assume that we can collapse it into 2 levels.

Unfortunately, the data set lacks a description of how the variables have been coded, so we are left to guess what 1,0,-1 correspond to.

In addition we notice that coding is not consistent, sometime a the binary variable is coded with 0, 1 sometimes with 1, -1. The first task is to make assumptions on the coding.

We will make the level explicit, R will make the coding automatically.

With the following chunks of code we reorganize the factor levels (rename and collapse some levels) and delete some variable which are deemed as overlapping with others.

ds<-rename(ds, Phishing = Result)
ds$Phishing<-fct_recode(ds$Phishing, yes = "1", no = "-1")

ds$having_IP_Address<-fct_recode(ds$having_IP_Address, yes = "1", no = "0")

ds$URL_Length<-fct_collapse(ds$URL_Length,
  "<75" = c("0", "1"),
  ">75" = "-1"
  )

ds$Shortining_Service<-fct_recode(ds$Shortining_Service, yes = "1", no = "0")

ds$having_At_Symbol<-fct_recode(ds$having_At_Symbol, yes = "1", no = "0")

ds$double_slash_redirecting<-fct_recode(ds$double_slash_redirecting, yes = "1", no = "0")

ds$Prefix_Suffix<-fct_recode(ds$Prefix_Suffix, no_dash = "1", prefix_dash="-1", suffix_dash="0")

ds$having_Sub_Domain<-fct_recode(ds$having_Sub_Domain, otherwise="-1", two_dot="0", one_dot="1")

ds$SSLfinal_State<-fct_recode(ds$SSLfinal_State, otherwise="-1", http="0", http_ssl="1")

ds$Domain_registeration_length<-fct_collapse(ds$Domain_registeration_length,
  "<1yr" = c("0", "1"),
  ">1yr" = "-1"
  )

ds$Favicon<-fct_recode(ds$Favicon, no ="0", yes ="1")

ds$port<-fct_recode(ds$port, std ="0", non_std ="1")

ds$HTTPS_token<-fct_recode(ds$HTTPS_token, yes = "1", no = "0")

ds<-select(ds,-HTTPS_token)

We discard variable HTTPS_token since it seems redundant with SSLfinal_State

ds<-select(ds,-Request_URL)

We discard variable Request_URL since it has 2 discretized levels here but we have description for 3 in the documentation. It would be to much of a guess to recode such factor in addition it is redundant with URL_Anchor

ds$URL_of_Anchor<-fct_recode(ds$URL_of_Anchor, ">67%" ="-1", "31<anc<67"="0", "<31%" ="1")
ds$Links_in_tags<-fct_recode(ds$Links_in_tags, "<17%"="1", ">81%"="-1", "17%<link<81%" = "0")

ds<-select(ds,-SFH)

We discard variable SFH since it has 2 levels here but we have description for 3 in the documentation. It would be too much of a guess to recode such factor

ds$Submitting_to_email<-fct_recode(ds$Submitting_to_email, yes ="1", no="0")

ds$Abnormal_URL<-fct_recode(ds$Abnormal_URL, yes = "1", no = "0")

ds<-select(ds, -Redirect)

We discard variable Redirect since it has 2 discretized levels here but we have description for 3 in the documentation. In addition it seems redundant with double_slash_redirecting

ds$on_mouseover<-fct_recode(ds$on_mouseover, yes="1", no="0")

ds<-select(ds, -RightClick)

Such variable has 2 levels and there are very few cases of Right_Click = 1, we discard such variable. Since Right_Click = 0 would be too common in across the websites.

ds$popUpWidnow<-fct_recode(ds$popUpWidnow, yes="1", no="0")
ds$Iframe<-fct_recode(ds$Iframe, yes="1", no="0")

ds<-select(ds,-age_of_domain)

We discard variable age_of_domain since it has 2 discretized levels here but we have description for 3 in the documentation. In addition it seems redundant with Domain_registeration_length

ds$DNSRecord<-fct_recode(ds$DNSRecord, yes="1", no="0")

ds$web_traffic<-fct_recode(ds$web_traffic, "<100k"="1", ">100k"="0", otherwise="-1")

ds<-select(ds, -Page_Rank)

We discard variable Page_Rank since it has 3 discretized level here but we have description for 2 in the documentation. In addition it seems redundant with the variable web traffic.

ds$Google_Index<-fct_recode(ds$Google_Index, yes="1", no="0")

ds$Links_pointing_to_page<-fct_recode(ds$Links_pointing_to_page, ">2link" ="1", no_link="0", "0<link<2"="-1")

ds$Statistical_report<-fct_recode(ds$Statistical_report, yes="1", no="0")

The data set is almost ready, we need to transform our data frame in a transaction-like table.

ds_tr<- as(ds, "transactions")
ds_tr %>% summary()

## transactions as itemMatrix in sparse format with
##  2456 rows (elements/itemsets/transactions) and
##  55 columns (items) and a density of 0.4363636 
## 
## most frequent items:
##   having_At_Symbol=no             Iframe=no  having_IP_Address=no 
##                  2322                  2230                  2178 
##       on_mouseover=no Shortining_Service=no               (Other) 
##                  2166                  2154                 47894 
## 
## element (itemset/transaction) length distribution:
## sizes
##   24 
## 2456 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      24      24      24      24      24      24 
## 
## includes extended item information - examples:
##                  labels         variables levels
## 1 having_IP_Address=yes having_IP_Address    yes
## 2  having_IP_Address=no having_IP_Address     no
## 3        URL_Length=<75        URL_Length    <75
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2             2
## 3             3

Our data set contains 2456 transactions (number of websites) and 55 columns (summation of the number of levels across the all the columns in the original data frame) which represent the items (website feature).

The summary command tells us which are the most frequent items. For example we can see that most of the website does not contain ‘@’ symbol in the URL.

In addition summary() displays information about the transaction length distribution. In our case we see that all the transactions have length equal to 24 since we have a value for all the columns in the original data set.

We are not that happy with such representation since we do not want a column for each factor level. For example, most of our variables have 2 levels and they can be considered as a “logical” variables, e.g for variable having_At_Symbol we only want one column in the transaction data set with 1 referring to “yes, there is a @ symbol in the URL” and 0 referring to “no @ at symbol”.

for (i in 1:ncol(ds)){
  if ( nlevels(ds[,i])==2 & "yes" %in% levels(ds[,i]) ){
    ds[,i]<-to_logical(as.character(ds[,i]))
    }
}

ds %>% summary()

##  having_IP_Address URL_Length Shortining_Service having_At_Symbol
##  Mode :logical     <75: 444   Mode :logical      Mode :logical   
##  FALSE:2178        >75:2012   FALSE:2154         FALSE:2322      
##  TRUE :278                    TRUE :302          TRUE :134       
##  double_slash_redirecting     Prefix_Suffix  having_Sub_Domain   SSLfinal_State
##  Mode :logical            prefix_dash: 954   otherwise:1060    otherwise: 788  
##  FALSE:2148               suffix_dash:1174   two_dot  : 792    http_ssl :1416  
##  TRUE :308                no_dash    : 328   one_dot  : 604    http     : 252  
##  Domain_registeration_length  Favicon             port        URL_of_Anchor 
##  <1yr:1696                   Mode :logical   std    :2124   >67%     : 718  
##  >1yr: 760                   FALSE:1990      non_std: 332   31<anc<67:1202  
##                              TRUE :466                      <31%     : 536  
##       Links_in_tags  Submitting_to_email Abnormal_URL    on_mouseover   
##  <17%        : 596   Mode :logical       Mode :logical   Mode :logical  
##  >81%        : 804   FALSE:2002          FALSE:2110      FALSE:2166     
##  17%<link<81%:1056   TRUE :454           TRUE :346       TRUE :290      
##  popUpWidnow       Iframe        DNSRecord          web_traffic  
##  Mode :logical   Mode :logical   Mode :logical   otherwise: 594  
##  FALSE:1974      FALSE:2230      FALSE:1138      >100k    : 520  
##  TRUE :482       TRUE :226       TRUE :1318      <100k    :1342  
##  Google_Index    Links_pointing_to_page Statistical_report  Phishing      
##  Mode :logical   >2link  : 966          Mode :logical      Mode :logical  
##  FALSE:2113      no_link :1370          FALSE:2016         FALSE:1362     
##  TRUE :343       0<link<2: 120          TRUE :440          TRUE :1094

ds_tr<- as(ds, "transactions")
ds_tr %>% summary()

## transactions as itemMatrix in sparse format with
##  2456 rows (elements/itemsets/transactions) and
##  41 columns (items) and a density of 0.3082645 
## 
## most frequent items:
##                         port=std                   URL_Length=>75 
##                             2124                             2012 
## Domain_registeration_length=<1yr          SSLfinal_State=http_ssl 
##                             1696                             1416 
##   Links_pointing_to_page=no_link                          (Other) 
##                             1370                            22423 
## 
## element (itemset/transaction) length distribution:
## sizes
##  10  11  12  13  14  15  16  17  18  19  20  21  22  23 
## 467 596 494 213 120 179 141 118  61  15  14  12  22   4 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.00   11.00   12.00   12.64   14.00   23.00 
## 
## includes extended item information - examples:
##              labels         variables levels
## 1 having_IP_Address having_IP_Address   TRUE
## 2    URL_Length=<75        URL_Length    <75
## 3    URL_Length=>75        URL_Length    >75
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2             2
## 3             3

Rule mining process

Parameter settings, and the time required.

With so many transaction and item available, the apriori algorithm would generate a huge number of rules. We need to set some parameters and thresholds on measures like: support, confidence and itemset size.

The support of an itemset X is defined as the proportion of transactions in the data set which contain the itemset. We set this parameter equal to 0.1.

The confidence of a rule is defined: \(conf(X|Y) = \frac{sup(X \cup Y)}{sup(X)}.\) Confidence can be interpreted as an estimate of the probability P(Y|X), the probability of finnding the RHS of the rule in transactions under the condition that these transactions also contain the LHS. We set this parameter equal to 0.5.

We constrain also the size of the rules generated to maximum length 5.

Further to this the main goal of our project is to find how different features common to phishing websites. For this reason we will set the right hand side equal to ‘Phishing’

The plot here below shows all the items ranked according to their support. The red line represent the median.

Top 10 items based on frequency

rules<-apriori(data=ds_tr,
               parameter=list (supp=sup_th, conf = conf_th, maxlen = mlen),
               appearance = list (rhs="Phishing"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target  ext
##       5  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 245 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[41 item(s), 2456 transaction(s)] done [0.00s].
## sorting and recoding items ... [38 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.01s].
## writing ... [263 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

The apriori algorithm generates 263. Most of the rules have size 4 or 5.

## set of 263 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4   5 
##  11  71 123  58 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   4.000   3.867   4.000   5.000 
## 
## summary of quality measures:
##     support         confidence        coverage           lift      
##  Min.   :0.1002   Min.   :0.5010   Min.   :0.1002   Min.   :1.125  
##  1st Qu.:0.1120   1st Qu.:0.6749   1st Qu.:0.1342   1st Qu.:1.515  
##  Median :0.1295   Median :0.8567   Median :0.1596   Median :1.923  
##  Mean   :0.1451   Mean   :0.8200   Mean   :0.1887   Mean   :1.841  
##  3rd Qu.:0.1596   3rd Qu.:0.9841   3rd Qu.:0.2158   3rd Qu.:2.209  
##  Max.   :0.3664   Max.   :1.0000   Max.   :0.6906   Max.   :2.245  
##      count      
##  Min.   :246.0  
##  1st Qu.:275.0  
##  Median :318.0  
##  Mean   :356.3  
##  3rd Qu.:392.0  
##  Max.   :900.0  
## 
## mining info:
##   data ntransactions support confidence
##  ds_tr          2456     0.1        0.5

From this plot we can see that there are many rules with high confidence, high lift and rather low support.

##     lhs                                 rhs          support confidence  coverage     lift count
## [1] {URL_of_Anchor=>67%,                                                                        
##      web_traffic=otherwise}          => {Phishing} 0.1693811          1 0.1693811 2.244973   416
## [2] {having_Sub_Domain=two_dot,                                                                 
##      URL_of_Anchor=>67%}             => {Phishing} 0.1473941          1 0.1473941 2.244973   362
## [3] {SSLfinal_State=otherwise,                                                                  
##      URL_of_Anchor=>67%}             => {Phishing} 0.1864821          1 0.1864821 2.244973   458
## [4] {Prefix_Suffix=suffix_dash,                                                                 
##      URL_of_Anchor=>67%}             => {Phishing} 0.1140065          1 0.1140065 2.244973   280
## [5] {URL_of_Anchor=>67%,                                                                        
##      Links_pointing_to_page=no_link} => {Phishing} 0.1710098          1 0.1710098 2.244973   420

From the ‘two-key plot’ it is clear that order (the number of items contained in the rule) and support have a very strong inverse relationship. The red dots (itemset with size 5) tends to have lower support value compared to purple dots (itemset with size 2).

Remove redundant rules

A rule is redundant if a more general rules with the same or a higher confidence exists. That is, a more specific rule is redundant if it is only equally or even less predictive than a more general rule. A rule is more general if it has the same RHS but one or more items removed from the LHS.

rules<-rules[!is.redundant(rules)]

Example of redundant rules

##     lhs                            rhs        support   confidence coverage 
## [1] {SSLfinal_State=http}       => {Phishing} 0.1009772 0.9841270  0.1026059
## [2] {web_traffic=>100k}         => {Phishing} 0.1473941 0.6961538  0.2117264
## [3] {web_traffic=otherwise}     => {Phishing} 0.1921824 0.7946128  0.2418567
## [4] {URL_of_Anchor=>67%}        => {Phishing} 0.2890879 0.9888579  0.2923453
## [5] {Links_in_tags=>81%}        => {Phishing} 0.1864821 0.5696517  0.3273616
## [6] {having_Sub_Domain=two_dot} => {Phishing} 0.1718241 0.5328283  0.3224756
##     lift     count
## [1] 2.209338 248  
## [2] 1.562846 362  
## [3] 1.783884 472  
## [4] 2.219959 710  
## [5] 1.278853 458  
## [6] 1.196185 422

After having removed the redundant rules we are left with 115 rules. Most of the rules have size 3 or 4.

This is because many itemset of size 5 were considered as redundant.

## set of 115 rules
## 
## rule length distribution (lhs + rhs):sizes
##  2  3  4  5 
## 11 52 46  6 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   3.000   3.409   4.000   5.000 
## 
## summary of quality measures:
##     support         confidence        coverage           lift      
##  Min.   :0.1002   Min.   :0.5303   Min.   :0.1026   Min.   :1.191  
##  1st Qu.:0.1134   1st Qu.:0.6509   1st Qu.:0.1429   1st Qu.:1.461  
##  Median :0.1409   Median :0.7977   Median :0.1743   Median :1.791  
##  Mean   :0.1535   Mean   :0.7949   Mean   :0.2036   Mean   :1.785  
##  3rd Qu.:0.1759   3rd Qu.:0.9560   3rd Qu.:0.2374   3rd Qu.:2.146  
##  Max.   :0.3664   Max.   :1.0000   Max.   :0.6906   Max.   :2.245  
##      count      
##  Min.   :246.0  
##  1st Qu.:278.5  
##  Median :346.0  
##  Mean   :376.9  
##  3rd Qu.:432.0  
##  Max.   :900.0  
## 
## mining info:
##   data ntransactions support confidence
##  ds_tr          2456     0.1        0.5

Lift can be interpreted as the deviation of the support of the whole rule from the support expected under independence given the supports of both sides of the rule. Greater lift values (>>1) indicate stronger associations. Here we select all the rules whose occurrence is at least twice the expected one (if they were independent).

rules<- subset(rules, subset = lift > 2)
summary(rules)

## set of 42 rules
## 
## rule length distribution (lhs + rhs):sizes
##  2  3  4  5 
##  2 18 20  2 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   4.000   3.524   4.000   5.000 
## 
## summary of quality measures:
##     support         confidence        coverage           lift      
##  Min.   :0.1010   Min.   :0.8929   Min.   :0.1026   Min.   :2.004  
##  1st Qu.:0.1166   1st Qu.:0.9545   1st Qu.:0.1198   1st Qu.:2.143  
##  Median :0.1482   Median :0.9804   Median :0.1555   Median :2.201  
##  Mean   :0.1536   Mean   :0.9677   Mean   :0.1589   Mean   :2.172  
##  3rd Qu.:0.1775   3rd Qu.:0.9927   3rd Qu.:0.1855   3rd Qu.:2.229  
##  Max.   :0.2891   Max.   :1.0000   Max.   :0.2923   Max.   :2.245  
##      count      
##  Min.   :248.0  
##  1st Qu.:286.2  
##  Median :364.0  
##  Mean   :377.3  
##  3rd Qu.:436.0  
##  Max.   :710.0  
## 
## mining info:
##   data ntransactions support confidence
##  ds_tr          2456     0.1        0.5

The following 2 interactive plots show the selected rules.

Simpson Paradox

Domain_registration_length is the selected variable to investigate if we have a case of Simpson Paradox, this variable measures how old is a domain. Perhaps phishing website have different features if we make a distinction between new and old domains.

We will repeat the analysis (same parameters) but subsetting the dataset based on the value of the variable Domain_registration_length

Subset of transactions for “New” website -> Domain_registration_length < 1 year

summary(rules_new)

## set of 82 rules
## 
## rule length distribution (lhs + rhs):sizes
##  2  3  4  5 
## 10 42 27  3 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    3.00    3.00    3.28    4.00    5.00 
## 
## summary of quality measures:
##     support         confidence        coverage           lift      
##  Min.   :0.1002   Min.   :0.5303   Min.   :0.1026   Min.   :1.191  
##  1st Qu.:0.1131   1st Qu.:0.6205   1st Qu.:0.1433   1st Qu.:1.393  
##  Median :0.1405   Median :0.7967   Median :0.1775   Median :1.789  
##  Mean   :0.1525   Mean   :0.7969   Mean   :0.2014   Mean   :1.789  
##  3rd Qu.:0.1779   3rd Qu.:0.9637   3rd Qu.:0.2362   3rd Qu.:2.163  
##  Max.   :0.2940   Max.   :1.0000   Max.   :0.5366   Max.   :2.245  
##      count      
##  Min.   :246.0  
##  1st Qu.:277.8  
##  Median :345.0  
##  Mean   :374.6  
##  3rd Qu.:437.0  
##  Max.   :722.0  
## 
## mining info:
##       data ntransactions support confidence
##  ds_new_tr          2456     0.1        0.5

Top rules for “New” websites ranked by support

##     lhs                            rhs          support confidence  coverage     lift count
## [1] {Prefix_Suffix=prefix_dash} => {Phishing} 0.2939739  0.7568134 0.3884365 1.699025   722
## [2] {URL_of_Anchor=>67%}        => {Phishing} 0.2890879  0.9888579 0.2923453 2.219959   710
## [3] {DNSRecord}                 => {Phishing} 0.2846091  0.5303490 0.5366450 1.190619   699
## [4] {SSLfinal_State=otherwise}  => {Phishing} 0.2768730  0.8629442 0.3208469 1.937286   680
## [5] {URL_Length=>75,                                                                       
##      Prefix_Suffix=prefix_dash} => {Phishing} 0.2483713  0.7568238 0.3281759 1.699049   610

Top rules for “New” websites ranked by confidence

##     lhs                                 rhs          support confidence  coverage     lift count
## [1] {URL_of_Anchor=>67%,                                                                        
##      web_traffic=otherwise}          => {Phishing} 0.1693811          1 0.1693811 2.244973   416
## [2] {having_Sub_Domain=two_dot,                                                                 
##      URL_of_Anchor=>67%}             => {Phishing} 0.1473941          1 0.1473941 2.244973   362
## [3] {SSLfinal_State=otherwise,                                                                  
##      URL_of_Anchor=>67%}             => {Phishing} 0.1864821          1 0.1864821 2.244973   458
## [4] {Prefix_Suffix=suffix_dash,                                                                 
##      URL_of_Anchor=>67%}             => {Phishing} 0.1140065          1 0.1140065 2.244973   280
## [5] {URL_of_Anchor=>67%,                                                                        
##      Links_pointing_to_page=no_link} => {Phishing} 0.1710098          1 0.1710098 2.244973   420

Top rules for “New” websites ranked by lift

##     lhs                                 rhs          support confidence  coverage     lift count
## [1] {URL_of_Anchor=>67%,                                                                        
##      web_traffic=otherwise}          => {Phishing} 0.1693811          1 0.1693811 2.244973   416
## [2] {having_Sub_Domain=two_dot,                                                                 
##      URL_of_Anchor=>67%}             => {Phishing} 0.1473941          1 0.1473941 2.244973   362
## [3] {SSLfinal_State=otherwise,                                                                  
##      URL_of_Anchor=>67%}             => {Phishing} 0.1864821          1 0.1864821 2.244973   458
## [4] {Prefix_Suffix=suffix_dash,                                                                 
##      URL_of_Anchor=>67%}             => {Phishing} 0.1140065          1 0.1140065 2.244973   280
## [5] {URL_of_Anchor=>67%,                                                                        
##      Links_pointing_to_page=no_link} => {Phishing} 0.1710098          1 0.1710098 2.244973   420

Subset of transactions for “Old” website -> Domain_registration_length > 1 year

Summary

## set of 82 rules
## 
## rule length distribution (lhs + rhs):sizes
##  2  3  4  5 
## 10 42 27  3 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    3.00    3.00    3.28    4.00    5.00 
## 
## summary of quality measures:
##     support         confidence        coverage           lift      
##  Min.   :0.1002   Min.   :0.5303   Min.   :0.1026   Min.   :1.191  
##  1st Qu.:0.1131   1st Qu.:0.6205   1st Qu.:0.1433   1st Qu.:1.393  
##  Median :0.1405   Median :0.7967   Median :0.1775   Median :1.789  
##  Mean   :0.1525   Mean   :0.7969   Mean   :0.2014   Mean   :1.789  
##  3rd Qu.:0.1779   3rd Qu.:0.9637   3rd Qu.:0.2362   3rd Qu.:2.163  
##  Max.   :0.2940   Max.   :1.0000   Max.   :0.5366   Max.   :2.245  
##      count      
##  Min.   :246.0  
##  1st Qu.:277.8  
##  Median :345.0  
##  Mean   :374.6  
##  3rd Qu.:437.0  
##  Max.   :722.0  
## 
## mining info:
##       data ntransactions support confidence
##  ds_old_tr          2456     0.1        0.5

Top rules for “Old” websites ranked by support

##     lhs                            rhs          support confidence  coverage     lift count
## [1] {Prefix_Suffix=prefix_dash} => {Phishing} 0.2939739  0.7568134 0.3884365 1.699025   722
## [2] {URL_of_Anchor=>67%}        => {Phishing} 0.2890879  0.9888579 0.2923453 2.219959   710
## [3] {DNSRecord}                 => {Phishing} 0.2846091  0.5303490 0.5366450 1.190619   699
## [4] {SSLfinal_State=otherwise}  => {Phishing} 0.2768730  0.8629442 0.3208469 1.937286   680
## [5] {URL_Length=>75,                                                                       
##      Prefix_Suffix=prefix_dash} => {Phishing} 0.2483713  0.7568238 0.3281759 1.699049   610

Top rules for “Old” websites ranked by confidence

##     lhs                                 rhs          support confidence  coverage     lift count
## [1] {URL_of_Anchor=>67%,                                                                        
##      web_traffic=otherwise}          => {Phishing} 0.1693811          1 0.1693811 2.244973   416
## [2] {having_Sub_Domain=two_dot,                                                                 
##      URL_of_Anchor=>67%}             => {Phishing} 0.1473941          1 0.1473941 2.244973   362
## [3] {SSLfinal_State=otherwise,                                                                  
##      URL_of_Anchor=>67%}             => {Phishing} 0.1864821          1 0.1864821 2.244973   458
## [4] {Prefix_Suffix=suffix_dash,                                                                 
##      URL_of_Anchor=>67%}             => {Phishing} 0.1140065          1 0.1140065 2.244973   280
## [5] {URL_of_Anchor=>67%,                                                                        
##      Links_pointing_to_page=no_link} => {Phishing} 0.1710098          1 0.1710098 2.244973   420

Top rules for “Old” websites ranked by lift

##     lhs                                 rhs          support confidence  coverage     lift count
## [1] {URL_of_Anchor=>67%,                                                                        
##      web_traffic=otherwise}          => {Phishing} 0.1693811          1 0.1693811 2.244973   416
## [2] {having_Sub_Domain=two_dot,                                                                 
##      URL_of_Anchor=>67%}             => {Phishing} 0.1473941          1 0.1473941 2.244973   362
## [3] {SSLfinal_State=otherwise,                                                                  
##      URL_of_Anchor=>67%}             => {Phishing} 0.1864821          1 0.1864821 2.244973   458
## [4] {Prefix_Suffix=suffix_dash,                                                                 
##      URL_of_Anchor=>67%}             => {Phishing} 0.1140065          1 0.1140065 2.244973   280
## [5] {URL_of_Anchor=>67%,                                                                        
##      Links_pointing_to_page=no_link} => {Phishing} 0.1710098          1 0.1710098 2.244973   420

Since the top rules based on support, lift and confidence for “New” and “Old” website overlap pretty much, we don’t detect ay Simpson Paradox behind the variable Domain_registratio length.

However a deeper analysis combined with domain knowledge might help to unveil eventual Simpson Paradox in the association rules

Resulting rules

Summary (number of rules, general description), and a selection of those you would show to a client.

Top 5 rules based on support

##     lhs                                   rhs          support confidence  coverage     lift count
## [1] {URL_of_Anchor=>67%}               => {Phishing} 0.2890879  0.9888579 0.2923453 2.219959   710
## [2] {Domain_registeration_length=<1yr,                                                            
##      URL_of_Anchor=>67%}               => {Phishing} 0.2369707  0.9965753 0.2377850 2.237284   582
## [3] {port=std,                                                                                    
##      URL_of_Anchor=>67%}               => {Phishing} 0.2345277  0.9896907 0.2369707 2.221829   576
## [4] {SSLfinal_State=otherwise,                                                                    
##      Domain_registeration_length=<1yr} => {Phishing} 0.2345277  0.9000000 0.2605863 2.020475   576
## [5] {URL_Length=>75,                                                                              
##      URL_of_Anchor=>67%}               => {Phishing} 0.2312704  0.9895470 0.2337134 2.221506   568

Top 5 rules based on confidence

##     lhs                                 rhs          support confidence  coverage     lift count
## [1] {URL_of_Anchor=>67%,                                                                        
##      web_traffic=otherwise}          => {Phishing} 0.1693811          1 0.1693811 2.244973   416
## [2] {having_Sub_Domain=two_dot,                                                                 
##      URL_of_Anchor=>67%}             => {Phishing} 0.1473941          1 0.1473941 2.244973   362
## [3] {SSLfinal_State=otherwise,                                                                  
##      URL_of_Anchor=>67%}             => {Phishing} 0.1864821          1 0.1864821 2.244973   458
## [4] {Prefix_Suffix=suffix_dash,                                                                 
##      URL_of_Anchor=>67%}             => {Phishing} 0.1140065          1 0.1140065 2.244973   280
## [5] {URL_of_Anchor=>67%,                                                                        
##      Links_pointing_to_page=no_link} => {Phishing} 0.1710098          1 0.1710098 2.244973   420

Top 5 rules based on lift

##     lhs                                 rhs          support confidence  coverage     lift count
## [1] {URL_of_Anchor=>67%,                                                                        
##      web_traffic=otherwise}          => {Phishing} 0.1693811          1 0.1693811 2.244973   416
## [2] {having_Sub_Domain=two_dot,                                                                 
##      URL_of_Anchor=>67%}             => {Phishing} 0.1473941          1 0.1473941 2.244973   362
## [3] {SSLfinal_State=otherwise,                                                                  
##      URL_of_Anchor=>67%}             => {Phishing} 0.1864821          1 0.1864821 2.244973   458
## [4] {Prefix_Suffix=suffix_dash,                                                                 
##      URL_of_Anchor=>67%}             => {Phishing} 0.1140065          1 0.1140065 2.244973   280
## [5] {URL_of_Anchor=>67%,                                                                        
##      Links_pointing_to_page=no_link} => {Phishing} 0.1710098          1 0.1710098 2.244973   420

It can be noticed that the rules with the highest confidence value are also the rules with the highest lift value. We select these rules as they seem the most powerful rules when predicting the features of a phishing website.

For example if the tags and the website have different domain names (URL_of_Anchor= >67%) and the website is not popular (web_traffic=otherwise) then we are basically certain that the website is phishing.

Similar conclusion can be done for websites whose tags have different domain names (URL_of_Anchor= >67%) and no link pointing at the web page (Links_pointing_to_page=no_link).

Top rules

Similarly to the tables above we can see how the item URL_of_Anchor >67% is very common among the rules pointing at phishing websites.

Recommendations

The rules we just selected could help a person to build a website which will not be categorized as phishing.

For example it would be important to have the tags and the website having the same domain names and avoid prefixes or suffixes separated by (-) to the domain name.

Another suggestion could be to ave the tags and the website having the same domain names and have multiple subdomains (having_Sub_Domain=two_dot)

Project 2 - Phishing Websites Data Set

Andrea Valtorta

10/13/2020

Objective

Data set description

Rule mining process

Resulting rules

Recommendations