Background

The daily use of technology by individuals and governmental institutions, particularly the internet, has recently made life simpler and facilitated commercial services and transactions. The internet is a key component of the commercial services offered by banking and other electronic services. Users browse a high number of web applications each day. During the COVID-19 pandemic, people relied more on the internet to buy their daily household needs, such as food, beverages, and clothing, and this is known as an online purchase. However, this has attracted the interest of many fraudsters seeking important and valuable user data from them. This is usually accomplished by redirecting you to a website that appears legitimate but is actually a phishing website. On the phishing website, you are tricked into giving criminals your information, such as login credentials and other sensitive data. This sensitive information can then be used to steal accounts or steal identities. The general method of detecting phishing websites is to update blacklisted urls and Internet Protocol (IP) addresses in the antivirus database, also known as the “blacklist” method. To avoid blacklists, attackers use creative techniques to fool users, such as modifying the url to looks legitimate through obfuscation and other simple techniques. Therefore, we conduct the machine learning technique to analyze various blacklisted and legitimate urls and their features and then accurately detect phishing websites using this technique.

Import Libraries

Here are the list of libraries that we will be using

library(dplyr)
library(readr)
library(urltools)
library(httr)
library(stringr)
library(tokenizers)
library(wordcloud)
library(magrittr)
library(tidytext)
library(Rwhois)
library(lubridate)
library(rvest)

Dataset

In order to detect phishing websites, we need to collect data from both legitimate and phishing websites. The list of phishing websites are easier to get from the opensource service called PhishTank. This service provides list of phishing websites in different formats such as csv, json, etc that also get updated hourly. However, the legitimate websites can’t be found as easy as phishing websites. Fortunately, I found one source which has collection of benign, spam, phishing, malware & defacement websites. The source come from University of New Brunswick. You can check that from this link.

First we will read the phishing data

phis <- read_csv("verified_online.csv")
head(phis)

The dataframe include the URL(Uniform Resource Locator) of the website and other informations related to it. However, our focus is only on the url because we want to detect phishing based only on the url itself.

Since all of the websites listed here are the phishing websites, we need to label them manually. Therefore, we will create on column is_phishing and give it a value of 1

phis <- phis %>% 
  mutate(is_phishing = 1)

Similarly, we can do the exact same process to our legitimate website. First we read the dataset

legit <- read_csv("ISCXurl2016/FinalDataset/url/Benign_list_big_final.csv", col_names = F)
head(legit)

Now, we will label them as 0 since it all contain legitimate website. We also want to change the column name into url so we can have the similar column name between two dataframes.

legit <- legit %>% 
  mutate(is_phishing = 0) %>% 
  rename(url = X1)

After that, we can combine both dataframes so we only have one dataframe containg legitimate and phishing websites. We can use bind_rows() from dplyr package to bind dataframes by row and making a longer result.

url_data <- bind_rows(phis[,c("url","is_phishing")], legit)
head(url_data)

Feature Extraction

Now, here comes the challenging part. Like I’ve said earlier, We are only focusing on the url itself and you might be wondering how can we detect whether a web is phishing or not based only on the url. There are lots of resources that give explanation about the characteristics of phishing websites. Hence, we will try to extract features based on these characteristics.

Check if the url contains IP address

Here, we want to check if the url comprises of IP address. An IP address is a unique address that identifies a network or device on the internet or a local network. IP stands for “Internet Protocol” which is the set of rules governing the format of data sent via the internet or local network.IP addresses contain a series of four numbers, ranging from 0 (except the first one) to 255, each separated from the next by a period. It may look something like 192.0.2.1. Most of the legitimate websites do not use IP address as an url. Use of IP address in url can be thought as an indicator that attacker is trying to steal sensitive information. In order to check that, we need to use a method called regular expression.

Regular Expression

A regular expression or regex in short, is a sequence of characters (or even one character) that describes a certain pattern found in a text. In R, we can utilize stringr library to perform regular expression. Each functions will have different purposes. For example, say we want to find a character “the” in a sentence. We can use str_detect():

str_detect(string = "How many brothers and sisters have you got? ",
           pattern = "the")
#> [1] TRUE

The function will return TRUE if it finds the matching pattern from the string. Other than writing a word as a pattern, we can also use special characters called character classes. A character class matches any character of a predefined set of characters. Examples:

  • \w: match any word character (any letter, digit, or underscore)
  • \W: match any non-character
  • \d: match any digit
  • \D: match any non-digit

It is also possible to create a user-defined character class by enclosing any set of characters inside square brackets. Now, we will apply that to the url to check if it contains IP address. For example, this url: “http://165.232.173.145/mobile.html” should be considered TRUE since it comprises of an IP address.

str_detect(string = "http://165.232.173.145/mobile.html",
           pattern = "^https?://(([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])\\.){3}([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])[\\w.\\/]+")
#> [1] TRUE

In order to apply that to all of our data, we need to first create a helper function. This function will check for every url that we have and return 1 if it consists of an IP address, otherwise it will return 0. We also want to create a new column that store those information.

is_ip <- function(url) {
  pattern <- "^https?://(([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])\\.){3}([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])[\\w.\\/]+"
  if(str_detect(url, pattern)) {
    return(1)
  } else {
    return(0)
  }
}

We can use sapply to apply our function into one column

url_data <- url_data %>% 
  mutate(contains_ip = sapply(url, is_ip))

Check if the url contains “@” symbol

Next, we will check if the url contains any “@” symbol. Let’s see what kind of url that has an “@” in it.

url_data %>% 
  filter(str_detect(url,"@"))

We will create function that return 1 if the url contains “@” symbol and 0 otherwise.

is_at <- function(url) {
  pattern <- "@"
  if(str_detect(url, pattern)) {
    return(1)
  } else {
    return(0)
  }
}

We apply the function to our data and create a new column to store that

url_data <- url_data %>% 
  mutate(contains_at = sapply(url, is_at))

Number of dots in hostname

There exists many tokens or special characters which appear most frequently in url hostname of phishing sites but not in legitimate sites. One of them is dots. In R, we can use a package called httr to extract hostname from the url. We will use parse_url() function to perform that.

parse_url("https://sistema.corporategfcx.homes/sobrenos/login/")
#> $scheme
#> [1] "https"
#> 
#> $hostname
#> [1] "sistema.corporategfcx.homes"
#> 
#> $port
#> NULL
#> 
#> $path
#> [1] "sobrenos/login/"
#> 
#> $query
#> NULL
#> 
#> $params
#> NULL
#> 
#> $fragment
#> NULL
#> 
#> $username
#> NULL
#> 
#> $password
#> NULL
#> 
#> attr(,"class")
#> [1] "url"

In order to count all number of dots that exist in the url we can use str_extract_all() function. This function will return all the matching pattern from the string. If we also wrap it with length(), we can also count the number of dots that are in the url.

parsed_url <- parse_url("https://sistema.corporategfcx.homes/sobrenos/login/")
length(str_extract_all(parsed_url$hostname,"\\.")[[1]])
#> [1] 2

We can now create a function

dots_count <- function(url) {
  parsed_url <- parse_url(url)
  host <- parsed_url$hostname
  pattern <- "\\."
  return(length(str_extract_all(host,"\\.")[[1]]))
}

We then apply our function to count all the dots for each url and store it under the column num_of_dots

url_data <- url_data %>% 
  mutate(num_of_dots = sapply(url, dots_count))

Check if the domain contain dash(-)

The dash is rarely used in legitimate url, so we will create function that return 1 if the url contains dash.

is_dash <- function(url) {
  parsed_url <- parse_url(url)
  host <- parsed_url$hostname
  pattern <- "-"
  if(str_detect(host, pattern)) {
    return(1)
  } else {
    return(0)
  }

}

Let’s apply to our data and store it as a new column contain_dash

url_data <- url_data %>% 
  mutate(contain_dash = sapply(url, is_dash))

Check if the url path contain “//”

The “//” is usually used for redirecting the webpage to another webpage, so we want to check the present of it our url path. Keep in mind the we will extract the path from the url since the “//” is expected to be in the path part of url not the one after “https:”, and we can also use parse_url function to do the job.

url_data %>%
  mutate(url_path = sapply(url, function(x) parse_url(x)$path)) %>% 
  filter(str_detect(url_path, "//"))

Now, we create our helper function that return 1 if the url path contains “//”

is_redirect <- function(url) {
  parsed_url <- parse_url(url)
  path <- parsed_url$path
  pattern <- "//"
  if(str_detect(path, pattern)) {
    return(1)
  } else {
    return(0)
  }

}

We apply our function and store it into a new column contain_dbslash

url_data <- url_data %>% 
  mutate(contain_dblslash = sapply(url, is_redirect))

Check if url contains “http”

The phishers may add the “HTTPS” token to the domain part of a url in order to trick users. So we want to check the present of that in our url host.

url_data %>%
  mutate(url_host = sapply(url, function(x) parse_url(x)$hostname)) %>% 
  filter(str_detect(url_host, "https?"))

We create the helper function that extract the url hostname and return 1 if the host contain “http/https”

is_urlhttp <- function(url) {
  parsed_url <- parse_url(url)
  host <- parsed_url$hostname
  pattern <- "https?"
  if(str_detect(host, pattern)) {
    return(1)
  } else {
    return(0)
  }

}

Apply our function into the data and store it as a new column contain_urlhttp

url_data <- url_data %>% 
  mutate(contain_urlhttp = sapply(url, is_urlhttp))

Check if the url is short url

There are lots of shortening url service that we can use to shorten the url. Those service often allows phisher to hide long phishing url by making it short. We try to collect all the possible form of short url service from the internet. Here we have a list containing the distinct form of short url.

shorturl <- "^bit\\.ly|^goo\\.gl|^shorte\\.st|^go2l\\.ink|^x\\.co|^ow\\.ly|^t\\.co|^tinyurl|^tr\\.im|^is\\.gd|^cli\\.gs|^yfrog\\.com|^migre\\.me|^ff\\.im|^tiny\\.cc|^url4\\.eu|^twit\\.ac|^su\\.pr|^twurl\\.nl|^snipurl\\.com|^short\\.to|^Budurl\\.com|^ping\\.fm|^post\\.ly|^Just\\.as|^bkite\\.com|^snipr\\.com|^flic\\.kr|^loopt\\.us|^doiop\\.com|^short\\.ie|^kl\\.am|^wp\\.me|^rubyurl\\.com|^om\\.ly|^to\\.ly|^bit\\.do|^lnkd\\.in|^db\\.tt|^qr\\.ae|^adf\\.ly|^bitly\\.com|^cur\\.lv|^tinyurl\\.com|^ity\\.im|^q\\.gs|^po\\.st|^bc\\.vc|^twitthis\\.com|^u\\.to|^j\\.mp|^buzurl\\.com|^cutt\\.us|^u\\.bb|^yourls\\.org|^prettylinkpro\\.com|^scrnch\\.me|^filoops\\.info|^vzturl\\.com|^qr\\.net|^1url\\.com|^tweez\\.me|^v\\.gd|^link\\.zip\\.net|^Dwarfurl\\.com|^Digg\\.com|^htxt\\.it|^Alturl\\.com|^RedirX\\.com|^DigBig\\.com|^u\\.mavrev\\.com|^u\\.nu|^linkbee\\.com|^Yep\\.it|^posted\\.at|^xrl\\.us|^metamark\\.net|^sn\\.im|^hurl\\.ws|^eepurl\\.com|^idek\\.net|^urlpire\\.com|^chilp\\.it"
url_data %>% 
  mutate(url_host = sapply(url, function(x) parse_url(x)$hostname)) %>% 
  filter(str_detect(url_host, shorturl))

We then create a function that check the presente of this url shortening service in our url data

is_short <- function(url) {
  parsed_url <- parse_url(url)
  host <- parsed_url$hostname
  if(str_detect(host, shorturl)) {
    return(1)
  } else {
    return(0)
  }

}

We apply the function into our data and store it into a column is_short

url_data <- url_data %>% 
  mutate(is_shorturl = sapply(url, is_short))

Check the length of url host

We can use str_length() to count the number of character from string.

a <- parse_url("https://sistema.corporategfcx.homes/sobrenos/login/")
str_length(a$hostname)
#> [1] 27

We create the function to count the number of character from our url data.

count_host <- function(url) {
  parsed_url <- parse_url(url)
  host <- parsed_url$hostname
  return(str_length(host))
  }

Apply that and create a new column called count_host

url_data <- url_data %>% 
  mutate(host_length = sapply(url, count_host))

Check if url contains sensitive words

There exists few words or tokens that are common to most of the phishing URLs. Let us check our data and extract all different words from it. First, we need to remove all unneccessary word such as “htttp”,“com”, etc because there are not meaningful and every url are expected to have those kind of words. We will use str_replace_all to perform the task.

str_replace_all("https://sistema.corporategfcx.homes/sobrenos/login/", "https?:|com", "") %>% 
  str_split("[ ./,\\-\\(\\)\\[\\]]")
#> [[1]]
#> [1] ""              ""              "sistema"       "corporategfcx"
#> [5] "homes"         "sobrenos"      "login"         ""

Let’s apply that to our data

url_data <- url_data %>% 
  mutate(token = str_replace_all(url, "https?:|com|www|php", "")) %>% 
  mutate(token = str_split(token, "[ ./,\\-\\(\\)\\[\\]]"))

After that, we can visualise the wordcount for both phishing and legitimate website using wordcloud package

# Phishing wordcloud
phis_only <- url_data %>% 
  filter(is_phishing==1)
wordcloud(as.character(phis_only$token), max.words =50, min.freq=500, random.order = F, colors=brewer.pal(8, "Dark2"))

# Legitimate wordcloud
legit_only <- url_data %>% 
  filter(is_phishing==0)
wordcloud(as.character(legit_only$token), max.words =50, min.freq=500, random.order = F, colors=brewer.pal(8, "Dark2"))

Now, we want to create the table that count every word that exist in our url data. and sort it in a descending order. we want to only take the majority, so we only take words that are more than 500 count.

words_count <- url_data %>% 
  mutate(token = str_replace_all(url, "https?:|com|www|php", " ")) %>% 
  unnest_tokens(output = words, input = token, token = "regex", pattern = "[ ./,\\-\\(\\)\\[\\]]") %>% 
  filter(is_phishing==1) %>% 
  count(words, sort = TRUE) %>% 
  filter(n>500)

words_count

For legitimate url, perhaps we can set the threshold into 300

word_counts_legit <- url_data %>% 
  mutate(token = str_replace_all(url, "https?:|com|www|php", " ")) %>% 
  unnest_tokens(output = words, input = token, token = "regex", pattern = "[ ./,\\-\\(\\)\\[\\]]") %>% 
  filter(is_phishing==0) %>% 
  count(words, sort = TRUE) %>% 
  filter(words %in% words_count$words, n>300)

word_counts_legit

I have also found that these words are quite commonly used in phishing website let’s combine that

sens_words <- words_count %>% 
  filter(!words %in% word_counts_legit$words) %>% 
  .$words

sens_words <- c(sens_words, 'confirm',
'account', 'banking', 'secure', 'ebyisapi', 'webscr', 'signin',
'mail', 'install', 'toolbar', 'backup', 'paypal', 'password',
'username')

We now create a function that check the presence of these sensitive words in our url.

is_sensitive <- function(url) {
  words <- str_replace_all(url, "https?:|com|www|php", "")
  words <- str_split(words, "[ ./,\\-\\(\\)\\[\\]]")
  words <- words[[1]]
  if(any(sens_words %in% words)) {
    return(1)
  } else {
    return(0)
  }
}

Apply that into our data and store it as a column contains_sensitive

url_data <- url_data %>% 
  mutate(contains_sensitive = sapply(url, is_sensitive))

Check the age of domain

Now here is one of the most challenging feature extraction. we want to check the age of the domain that is used in our url data. There is one famous service that can provide this information called “WHOIS”. WHOIS is a public database that houses the information collected when someone registers a domain name or updates their DNS settings. In order to retrieve those informations, we need to use package called whois. This package is actually intended to python users so we can’t use R in this case. The interesting part is that we can use or run python code in a chunk using reticulate package.

# Connecting python and conda environment
library(reticulate)
use_condaenv("smm_dadp")

Now, we will create a new python chunk and try the whois package. We will use a funcion called whois() to retrieve information from WHOIS database. Since we want to check the age of the domain, we can use information such as “creation_data” and “expiration_date”

import whois
import pandas as pd
import numpy as np

res = whois.whois("sistema.corporategfcx.homes")

print(res)
#> {
#>   "domain_name": null,
#>   "registrar": null,
#>   "whois_server": null,
#>   "referral_url": null,
#>   "updated_date": null,
#>   "creation_date": null,
#>   "expiration_date": null,
#>   "name_servers": null,
#>   "status": null,
#>   "emails": null,
#>   "dnssec": null,
#>   "name": null,
#>   "org": null,
#>   "address": null,
#>   "city": null,
#>   "state": null,
#>   "registrant_postal_code": null,
#>   "country": null
#> }
print(res.creation_date)
#> None
print(res.expiration_date)
#> None

After that, we will create a function that will count the age of the domain by substracting the expiration_date to creation_date. In order to use R object in our python chunk, we can use r.name_of_variable

⚠️ Warning: Applying that function into our data is computationally heavy. Depending on your computer, this will take an average of 24 hours to finish.


def domain_age(url):
  try:
    url = str(url)
    res = whois.whois(url)
    df = pd.DataFrame({"date": [res.creation_date, res.expiration_date]})
    age = (df["date"][1]-df["date"][0])//np.timedelta64(1, 'M')
  except:
    age = None
  return age


# r.url_data = r.url_data.assign(age_of_domain = r.url_data["url"])
# 
# phish = (r.url_data["is_phishing"]==1)
# 
# age_of_domain = r.url_data["url"].apply(domain_age)

After that we can take python variable that include all of the information about our domain age and use it in our R chunk with py$name_of_variable

url_data["age_of_domain"] <- py$age_of_domain

Data Preprocessing

Let’s check if there is a missing value

url_data %>% 
  is.na() %>% 
  colSums()
#>                url        is_phishing        contains_ip        contains_at 
#>                  0                  0                  0                  0 
#>        num_of_dots       contain_dash   contain_dblslash    contain_urlhttp 
#>                  0                  0                  0                  0 
#>        is_shorturl        host_length              token contains_sensitive 
#>                  0                  0                  0                  0 
#>      age_of_domain 
#>              33057
plot(url_data$age_of_domain~as.factor(url_data$is_phishing))

We can fill the missing value with its median value

url_data[is.na(url_data$age_of_domain),"age_of_domain"] <- median(url_data$age_of_domain, na.rm = T)
url_data_clean <- url_data %>% 
  select(-c(token,url)) %>% 
  mutate(is_phishing = as.factor(is_phishing))

Now we can split our data into training and testing

RNGkind(sample.kind = "Rounding")
set.seed(417)

index <- sample(x = nrow(url_data_clean) , size = nrow(url_data_clean) * 0.9)

# splitting
url_train <- url_data_clean[index , ]
url_test <- url_data_clean[-index , ]

Check the proportion of our target

prop.table(table(url_train$is_phishing))
#> 
#>         0         1 
#> 0.5896443 0.4103557

Build Model

Support Vector Machine

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used for Classification as well as Regression problems. The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future. This best decision boundary is called a hyperplane.

We can use e1071 package to build svm model classifier with svm() function. It takes 4 arguments:

  • formula: y~x
  • data: data that we want to train
  • type: svm can be used for classification or regression so this argument depends on what we want
  • kernel: the kernel used in training and predicting

When dealing with nonlinear data, Support Vector Machine transforms it into a higher dimension where it may be linearly separated. Support Vector Machine does this by utilizing various Kernel settings

Let’s try with the linear kernel

library(e1071)
svm_model <- svm(formula = is_phishing~.,
                 data = url_train,
                 type = "C-classification",
                 kernel = "linear")

We can predict the data test using our first model

prediction <- predict(svm_model, url_test)

After that, we evaluate our model using confusion matrix

library(caret)
confusionMatrix(data = prediction, reference = url_test$is_phishing, positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    0    1
#>          0 3471  338
#>          1   78 2111
#>                                                
#>                Accuracy : 0.9306               
#>                  95% CI : (0.9239, 0.9369)     
#>     No Information Rate : 0.5917               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.8541               
#>                                                
#>  Mcnemar's Test P-Value : < 0.00000000000000022
#>                                                
#>             Sensitivity : 0.8620               
#>             Specificity : 0.9780               
#>          Pos Pred Value : 0.9644               
#>          Neg Pred Value : 0.9113               
#>              Prevalence : 0.4083               
#>          Detection Rate : 0.3520               
#>    Detection Prevalence : 0.3650               
#>       Balanced Accuracy : 0.9200               
#>                                                
#>        'Positive' Class : 1                    
#> 

We see a quite high accuracy around 93%. However, perhaps we can try to tune our model to get a better accuracy

Let’s try with another kernel setting “radial”

svm_model2 <- svm(formula = is_phishing~.,
                 data = url_train,
                 type = "C-classification",
                 kernel = "radial")
prediction2 <- predict(svm_model2, url_test)
confusionMatrix(data = prediction2, reference = url_test$is_phishing, positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    0    1
#>          0 3524  310
#>          1   25 2139
#>                                                
#>                Accuracy : 0.9441               
#>                  95% CI : (0.938, 0.9498)      
#>     No Information Rate : 0.5917               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.8823               
#>                                                
#>  Mcnemar's Test P-Value : < 0.00000000000000022
#>                                                
#>             Sensitivity : 0.8734               
#>             Specificity : 0.9930               
#>          Pos Pred Value : 0.9884               
#>          Neg Pred Value : 0.9191               
#>              Prevalence : 0.4083               
#>          Detection Rate : 0.3566               
#>    Detection Prevalence : 0.3608               
#>       Balanced Accuracy : 0.9332               
#>                                                
#>        'Positive' Class : 1                    
#> 

There is an improvement of accuracy! Can we do better?. Well, let’s try to tune the hyperparameter for our linear kernel setting. we can try to tune these two parameters:

  • Regularization parameter (C)/cost: It tells us how much misclassification we want to avoid.

    • Hard margin SVM generally has large values of C.
    • Soft margin SVM generally has small values of C
  • Gamma: The Gamma values used in SVM are similar to the C values. Gamma is a parameter for non-linear hyperplanes. The gamma parameter defines how far the influence of a single training example reaches, with low values meaning ‘far’ and high values meaning ‘close’

    • Large Gamma: The decision boundary will be influenced by fewer data points. As a result, overfitting results from the decision boundary being non-linear.
    • Small Gamma: The decision boundary will be influenced by more data points. The decision boundary is hence more generic.

Let’s apply that into our model, we will set cost=10, and gamma=5

svm_model3 <- svm(formula = is_phishing~.,
                 data = url_train,
                 type = "C-classification",
                 kernel = "radial",
                 cost = 10,
                 gamma = 5)
prediction3 <- predict(svm_model3, url_test)
confusionMatrix(data = prediction3, reference = url_test$is_phishing, positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    0    1
#>          0 3524  235
#>          1   25 2214
#>                                                
#>                Accuracy : 0.9567               
#>                  95% CI : (0.9512, 0.9617)     
#>     No Information Rate : 0.5917               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.9091               
#>                                                
#>  Mcnemar's Test P-Value : < 0.00000000000000022
#>                                                
#>             Sensitivity : 0.9040               
#>             Specificity : 0.9930               
#>          Pos Pred Value : 0.9888               
#>          Neg Pred Value : 0.9375               
#>              Prevalence : 0.4083               
#>          Detection Rate : 0.3691               
#>    Detection Prevalence : 0.3733               
#>       Balanced Accuracy : 0.9485               
#>                                                
#>        'Positive' Class : 1                    
#> 

We can see by tuning the hyperparameter, we can improve the accuracy.

XGBoost

XGBoost is short for eXtreme Gradient Boosting and one of the boosting algoritms. To understand what boosting algorithm is, we need to know what ensemble method is. Ensemble methods is a machine learning technique that combines several base models in order to produce one optimal predictive model.

The simple version of ensembles is as follows:

  • Develop a predictive model and record the predictions for a given data set.
  • Repeat for multiple models on the same data.
  • For each record to be predicted, take an average (or a weighted average, or a majority vote) of the predictions.

Ensemble models improve model accuracy by combining the results from many models.

Unlike many ML models which focus on high quality prediction done by a single model, boosting algorithms seek to improve the prediction power by training a sequence of weak models, each compensating the weaknesses of its predecessors.

Now, to use XGBoost model we can leverage xgboost package and use xgboost() function. It usually takes 4 common arguments.

  • data: data for training in form of matrix
  • label: the target/label of our data,
  • objective: specify the learning task and the corresponding learning objective
  • nrounds: maximum number of boosting iterations.

Now, we can try to build the model and train our data

One point that we have to keep in mind is that our XGBoost model would accept data label to be in numerical type, so we need to convert first our label into numerical

url_train_xg <- url_train %>% 
  mutate(is_phishing=as.numeric(as.character(is_phishing)))
url_test_xg <- url_test %>% 
  mutate(is_phishing=as.numeric(as.character(is_phishing)))

Let’s create our model

library(xgboost)
xgb_model <- xgboost(data = as.matrix(url_train_xg[,-1]),
                     label = url_train_xg$is_phishing,
                     nrounds = 100,
                     objective = "binary:logistic",
                     verbose=0)

Since we use an objective of “binary:logistic” the output would be probability. We can convert them into our class [0,1] using as.numeric() function

pred <- predict(xgb_model, as.matrix(url_test_xg[,-1]))
prediction4 <- as.numeric(pred > 0.5)
prediction4 <- as.factor(prediction4)

Last, we will see our model accuracy

confusionMatrix(data = prediction4, reference = as.factor(url_test_xg$is_phishing), positive="1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    0    1
#>          0 3517  181
#>          1   32 2268
#>                                                
#>                Accuracy : 0.9645               
#>                  95% CI : (0.9595, 0.969)      
#>     No Information Rate : 0.5917               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.9258               
#>                                                
#>  Mcnemar's Test P-Value : < 0.00000000000000022
#>                                                
#>             Sensitivity : 0.9261               
#>             Specificity : 0.9910               
#>          Pos Pred Value : 0.9861               
#>          Neg Pred Value : 0.9511               
#>              Prevalence : 0.4083               
#>          Detection Rate : 0.3781               
#>    Detection Prevalence : 0.3835               
#>       Balanced Accuracy : 0.9585               
#>                                                
#>        'Positive' Class : 1                    
#> 

Thats an improvement!

Conclusion

Phishing website detection is one of the most serious issues, particularly in the digital era, when everyone will conduct more transactions, enter personal information, and interact with others via the internet. This is an opportunity for criminals to obtain information from victims that can later be used for undesirable purposes. As a result, it is hoped that a machine learning model that can detect web phishing quickly and accurately will reduce the risk of cybercrime and protect the public, allowing them to use the internet and conduct transactions safely.

References

  1. R. S. Rao, T. Vaishnavi, and A. R. Pais, “CatchPhish: detection of phishing websites by inspecting URLs,” J. Ambient Intell. Humaniz. Comput., vol. 11, no. 2, pp. 813–825, 2020, doi: 10.1007/s12652-019-01311-4.
  2. M. Elsadig et al., “Intelligent Deep Machine Learning Cyber Phishing URL Detection Based on BERT Features Extraction,” Electron., vol. 11, no. 22, 2022, doi: 10.3390/electronics11223647.
  3. R. Mahajan and I. Siddavatam, “Phishing Website Detection using Machine Learning Algorithms,” Int. J. Comput. Appl., vol. 181, no. 23, pp. 45–47, 2018, doi: 10.5120/ijca2018918026.
  4. “A Guide to R Regular Expressions,” 2022. https://www.datacamp.com/tutorial/regex-r-regular-expressions-guide.
  5. R. E. Ikwu, “Extracting Feature Vectors From URL Strings For Malicious URL Detection,” 2021, [Online]. Available: https://towardsdatascience.com/extracting-feature-vectors-from-url-strings-for-malicious-url-detection-cbafc24737a.