The daily use of technology by individuals and governmental institutions, particularly the internet, has recently made life simpler and facilitated commercial services and transactions. The internet is a key component of the commercial services offered by banking and other electronic services. Users browse a high number of web applications each day. During the COVID-19 pandemic, people relied more on the internet to buy their daily household needs, such as food, beverages, and clothing, and this is known as an online purchase. However, this has attracted the interest of many fraudsters seeking important and valuable user data from them. This is usually accomplished by redirecting you to a website that appears legitimate but is actually a phishing website. On the phishing website, you are tricked into giving criminals your information, such as login credentials and other sensitive data. This sensitive information can then be used to steal accounts or steal identities. The general method of detecting phishing websites is to update blacklisted urls and Internet Protocol (IP) addresses in the antivirus database, also known as the “blacklist” method. To avoid blacklists, attackers use creative techniques to fool users, such as modifying the url to looks legitimate through obfuscation and other simple techniques. Therefore, we conduct the machine learning technique to analyze various blacklisted and legitimate urls and their features and then accurately detect phishing websites using this technique.
Here are the list of libraries that we will be using
library(dplyr)
library(readr)
library(urltools)
library(httr)
library(stringr)
library(tokenizers)
library(wordcloud)
library(magrittr)
library(tidytext)
library(Rwhois)
library(lubridate)
library(rvest)In order to detect phishing websites, we need to collect data from both legitimate and phishing websites. The list of phishing websites are easier to get from the opensource service called PhishTank. This service provides list of phishing websites in different formats such as csv, json, etc that also get updated hourly. However, the legitimate websites can’t be found as easy as phishing websites. Fortunately, I found one source which has collection of benign, spam, phishing, malware & defacement websites. The source come from University of New Brunswick. You can check that from this link.
First we will read the phishing data
phis <- read_csv("verified_online.csv")
head(phis)The dataframe include the URL(Uniform Resource Locator) of the website and other informations related to it. However, our focus is only on the url because we want to detect phishing based only on the url itself.
Since all of the websites listed here are the phishing websites, we
need to label them manually. Therefore, we will create on column
is_phishing and give it a value of 1
phis <- phis %>%
mutate(is_phishing = 1)Similarly, we can do the exact same process to our legitimate website. First we read the dataset
legit <- read_csv("ISCXurl2016/FinalDataset/url/Benign_list_big_final.csv", col_names = F)
head(legit)Now, we will label them as 0 since it all contain legitimate website.
We also want to change the column name into url so we can
have the similar column name between two dataframes.
legit <- legit %>%
mutate(is_phishing = 0) %>%
rename(url = X1)After that, we can combine both dataframes so we only have one
dataframe containg legitimate and phishing websites. We can use
bind_rows() from dplyr package to bind
dataframes by row and making a longer result.
url_data <- bind_rows(phis[,c("url","is_phishing")], legit)
head(url_data)Now, here comes the challenging part. Like I’ve said earlier, We are only focusing on the url itself and you might be wondering how can we detect whether a web is phishing or not based only on the url. There are lots of resources that give explanation about the characteristics of phishing websites. Hence, we will try to extract features based on these characteristics.
Here, we want to check if the url comprises of IP address. An IP address is a unique address that identifies a network or device on the internet or a local network. IP stands for “Internet Protocol” which is the set of rules governing the format of data sent via the internet or local network.IP addresses contain a series of four numbers, ranging from 0 (except the first one) to 255, each separated from the next by a period. It may look something like 192.0.2.1. Most of the legitimate websites do not use IP address as an url. Use of IP address in url can be thought as an indicator that attacker is trying to steal sensitive information. In order to check that, we need to use a method called regular expression.
A regular expression or regex in short, is a
sequence of characters (or even one character) that describes a certain
pattern found in a text. In R, we can utilize stringr
library to perform regular expression. Each functions will have
different purposes. For example, say we want to find a character “the”
in a sentence. We can use str_detect():
str_detect(string = "How many brothers and sisters have you got? ",
pattern = "the")#> [1] TRUE
The function will return TRUE if it finds the matching
pattern from the string. Other than writing a word as a pattern, we can
also use special characters called character classes. A character class
matches any character of a predefined set of characters. Examples:
\w: match any word character (any letter, digit, or
underscore)\W: match any non-character\d: match any digit\D: match any non-digitIt is also possible to create a user-defined character class by enclosing any set of characters inside square brackets. Now, we will apply that to the url to check if it contains IP address. For example, this url: “http://165.232.173.145/mobile.html” should be considered TRUE since it comprises of an IP address.
str_detect(string = "http://165.232.173.145/mobile.html",
pattern = "^https?://(([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])\\.){3}([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])[\\w.\\/]+")#> [1] TRUE
In order to apply that to all of our data, we need to first create a helper function. This function will check for every url that we have and return 1 if it consists of an IP address, otherwise it will return 0. We also want to create a new column that store those information.
is_ip <- function(url) {
pattern <- "^https?://(([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])\\.){3}([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])[\\w.\\/]+"
if(str_detect(url, pattern)) {
return(1)
} else {
return(0)
}
}We can use sapply to apply our function into one column
url_data <- url_data %>%
mutate(contains_ip = sapply(url, is_ip))Next, we will check if the url contains any “@” symbol. Let’s see what kind of url that has an “@” in it.
url_data %>%
filter(str_detect(url,"@"))We will create function that return 1 if the url contains “@” symbol and 0 otherwise.
is_at <- function(url) {
pattern <- "@"
if(str_detect(url, pattern)) {
return(1)
} else {
return(0)
}
}We apply the function to our data and create a new column to store that
url_data <- url_data %>%
mutate(contains_at = sapply(url, is_at))There exists many tokens or special characters which appear most
frequently in url hostname of phishing sites but not in legitimate
sites. One of them is dots. In R, we can use a package called
httr to extract hostname from the url. We will use
parse_url() function to perform that.
parse_url("https://sistema.corporategfcx.homes/sobrenos/login/")#> $scheme
#> [1] "https"
#>
#> $hostname
#> [1] "sistema.corporategfcx.homes"
#>
#> $port
#> NULL
#>
#> $path
#> [1] "sobrenos/login/"
#>
#> $query
#> NULL
#>
#> $params
#> NULL
#>
#> $fragment
#> NULL
#>
#> $username
#> NULL
#>
#> $password
#> NULL
#>
#> attr(,"class")
#> [1] "url"
In order to count all number of dots that exist in the url we can use
str_extract_all() function. This function will return all
the matching pattern from the string. If we also wrap it with
length(), we can also count the number of dots that are in
the url.
parsed_url <- parse_url("https://sistema.corporategfcx.homes/sobrenos/login/")
length(str_extract_all(parsed_url$hostname,"\\.")[[1]])#> [1] 2
We can now create a function
dots_count <- function(url) {
parsed_url <- parse_url(url)
host <- parsed_url$hostname
pattern <- "\\."
return(length(str_extract_all(host,"\\.")[[1]]))
}We then apply our function to count all the dots for each url and
store it under the column num_of_dots
url_data <- url_data %>%
mutate(num_of_dots = sapply(url, dots_count))The dash is rarely used in legitimate url, so we will create function that return 1 if the url contains dash.
is_dash <- function(url) {
parsed_url <- parse_url(url)
host <- parsed_url$hostname
pattern <- "-"
if(str_detect(host, pattern)) {
return(1)
} else {
return(0)
}
}Let’s apply to our data and store it as a new column
contain_dash
url_data <- url_data %>%
mutate(contain_dash = sapply(url, is_dash))The “//” is usually used for redirecting the webpage to another
webpage, so we want to check the present of it our url path. Keep in
mind the we will extract the path from the url since the “//” is
expected to be in the path part of url not the one after “https:”, and
we can also use parse_url function to do the job.
url_data %>%
mutate(url_path = sapply(url, function(x) parse_url(x)$path)) %>%
filter(str_detect(url_path, "//"))Now, we create our helper function that return 1 if the url path contains “//”
is_redirect <- function(url) {
parsed_url <- parse_url(url)
path <- parsed_url$path
pattern <- "//"
if(str_detect(path, pattern)) {
return(1)
} else {
return(0)
}
}We apply our function and store it into a new column
contain_dbslash
url_data <- url_data %>%
mutate(contain_dblslash = sapply(url, is_redirect))The phishers may add the “HTTPS” token to the domain part of a url in order to trick users. So we want to check the present of that in our url host.
url_data %>%
mutate(url_host = sapply(url, function(x) parse_url(x)$hostname)) %>%
filter(str_detect(url_host, "https?"))We create the helper function that extract the url hostname and return 1 if the host contain “http/https”
is_urlhttp <- function(url) {
parsed_url <- parse_url(url)
host <- parsed_url$hostname
pattern <- "https?"
if(str_detect(host, pattern)) {
return(1)
} else {
return(0)
}
}Apply our function into the data and store it as a new column
contain_urlhttp
url_data <- url_data %>%
mutate(contain_urlhttp = sapply(url, is_urlhttp))There are lots of shortening url service that we can use to shorten the url. Those service often allows phisher to hide long phishing url by making it short. We try to collect all the possible form of short url service from the internet. Here we have a list containing the distinct form of short url.
shorturl <- "^bit\\.ly|^goo\\.gl|^shorte\\.st|^go2l\\.ink|^x\\.co|^ow\\.ly|^t\\.co|^tinyurl|^tr\\.im|^is\\.gd|^cli\\.gs|^yfrog\\.com|^migre\\.me|^ff\\.im|^tiny\\.cc|^url4\\.eu|^twit\\.ac|^su\\.pr|^twurl\\.nl|^snipurl\\.com|^short\\.to|^Budurl\\.com|^ping\\.fm|^post\\.ly|^Just\\.as|^bkite\\.com|^snipr\\.com|^flic\\.kr|^loopt\\.us|^doiop\\.com|^short\\.ie|^kl\\.am|^wp\\.me|^rubyurl\\.com|^om\\.ly|^to\\.ly|^bit\\.do|^lnkd\\.in|^db\\.tt|^qr\\.ae|^adf\\.ly|^bitly\\.com|^cur\\.lv|^tinyurl\\.com|^ity\\.im|^q\\.gs|^po\\.st|^bc\\.vc|^twitthis\\.com|^u\\.to|^j\\.mp|^buzurl\\.com|^cutt\\.us|^u\\.bb|^yourls\\.org|^prettylinkpro\\.com|^scrnch\\.me|^filoops\\.info|^vzturl\\.com|^qr\\.net|^1url\\.com|^tweez\\.me|^v\\.gd|^link\\.zip\\.net|^Dwarfurl\\.com|^Digg\\.com|^htxt\\.it|^Alturl\\.com|^RedirX\\.com|^DigBig\\.com|^u\\.mavrev\\.com|^u\\.nu|^linkbee\\.com|^Yep\\.it|^posted\\.at|^xrl\\.us|^metamark\\.net|^sn\\.im|^hurl\\.ws|^eepurl\\.com|^idek\\.net|^urlpire\\.com|^chilp\\.it"
url_data %>%
mutate(url_host = sapply(url, function(x) parse_url(x)$hostname)) %>%
filter(str_detect(url_host, shorturl))We then create a function that check the presente of this url shortening service in our url data
is_short <- function(url) {
parsed_url <- parse_url(url)
host <- parsed_url$hostname
if(str_detect(host, shorturl)) {
return(1)
} else {
return(0)
}
}We apply the function into our data and store it into a column
is_short
url_data <- url_data %>%
mutate(is_shorturl = sapply(url, is_short))We can use str_length() to count the number of character
from string.
a <- parse_url("https://sistema.corporategfcx.homes/sobrenos/login/")
str_length(a$hostname)#> [1] 27
We create the function to count the number of character from our url data.
count_host <- function(url) {
parsed_url <- parse_url(url)
host <- parsed_url$hostname
return(str_length(host))
}Apply that and create a new column called count_host
url_data <- url_data %>%
mutate(host_length = sapply(url, count_host))There exists few words or tokens that are common to most of the
phishing URLs. Let us check our data and extract all different words
from it. First, we need to remove all unneccessary word such as
“htttp”,“com”, etc because there are not meaningful and every url are
expected to have those kind of words. We will use
str_replace_all to perform the task.
str_replace_all("https://sistema.corporategfcx.homes/sobrenos/login/", "https?:|com", "") %>%
str_split("[ ./,\\-\\(\\)\\[\\]]")#> [[1]]
#> [1] "" "" "sistema" "corporategfcx"
#> [5] "homes" "sobrenos" "login" ""
Let’s apply that to our data
url_data <- url_data %>%
mutate(token = str_replace_all(url, "https?:|com|www|php", "")) %>%
mutate(token = str_split(token, "[ ./,\\-\\(\\)\\[\\]]"))After that, we can visualise the wordcount for both phishing and
legitimate website using wordcloud package
# Phishing wordcloud
phis_only <- url_data %>%
filter(is_phishing==1)
wordcloud(as.character(phis_only$token), max.words =50, min.freq=500, random.order = F, colors=brewer.pal(8, "Dark2"))# Legitimate wordcloud
legit_only <- url_data %>%
filter(is_phishing==0)
wordcloud(as.character(legit_only$token), max.words =50, min.freq=500, random.order = F, colors=brewer.pal(8, "Dark2"))Now, we want to create the table that count every word that exist in our url data. and sort it in a descending order. we want to only take the majority, so we only take words that are more than 500 count.
words_count <- url_data %>%
mutate(token = str_replace_all(url, "https?:|com|www|php", " ")) %>%
unnest_tokens(output = words, input = token, token = "regex", pattern = "[ ./,\\-\\(\\)\\[\\]]") %>%
filter(is_phishing==1) %>%
count(words, sort = TRUE) %>%
filter(n>500)
words_countFor legitimate url, perhaps we can set the threshold into 300
word_counts_legit <- url_data %>%
mutate(token = str_replace_all(url, "https?:|com|www|php", " ")) %>%
unnest_tokens(output = words, input = token, token = "regex", pattern = "[ ./,\\-\\(\\)\\[\\]]") %>%
filter(is_phishing==0) %>%
count(words, sort = TRUE) %>%
filter(words %in% words_count$words, n>300)
word_counts_legitI have also found that these words are quite commonly used in phishing website let’s combine that
sens_words <- words_count %>%
filter(!words %in% word_counts_legit$words) %>%
.$words
sens_words <- c(sens_words, 'confirm',
'account', 'banking', 'secure', 'ebyisapi', 'webscr', 'signin',
'mail', 'install', 'toolbar', 'backup', 'paypal', 'password',
'username')We now create a function that check the presence of these sensitive words in our url.
is_sensitive <- function(url) {
words <- str_replace_all(url, "https?:|com|www|php", "")
words <- str_split(words, "[ ./,\\-\\(\\)\\[\\]]")
words <- words[[1]]
if(any(sens_words %in% words)) {
return(1)
} else {
return(0)
}
}Apply that into our data and store it as a column
contains_sensitive
url_data <- url_data %>%
mutate(contains_sensitive = sapply(url, is_sensitive))Now here is one of the most challenging feature extraction. we want
to check the age of the domain that is used in our url data. There is
one famous service that can provide this information called “WHOIS”.
WHOIS is a public database that houses the information collected when
someone registers a domain name or updates their DNS settings. In order
to retrieve those informations, we need to use package called
whois. This package is actually intended to python users so
we can’t use R in this case. The interesting part is that we can use or
run python code in a chunk using reticulate package.
# Connecting python and conda environment
library(reticulate)
use_condaenv("smm_dadp")Now, we will create a new python chunk and try the whois
package. We will use a funcion called whois() to retrieve
information from WHOIS database. Since we want to check the age of the
domain, we can use information such as “creation_data” and
“expiration_date”
import whois
import pandas as pd
import numpy as np
res = whois.whois("sistema.corporategfcx.homes")
print(res)#> {
#> "domain_name": null,
#> "registrar": null,
#> "whois_server": null,
#> "referral_url": null,
#> "updated_date": null,
#> "creation_date": null,
#> "expiration_date": null,
#> "name_servers": null,
#> "status": null,
#> "emails": null,
#> "dnssec": null,
#> "name": null,
#> "org": null,
#> "address": null,
#> "city": null,
#> "state": null,
#> "registrant_postal_code": null,
#> "country": null
#> }
print(res.creation_date)#> None
print(res.expiration_date)#> None
After that, we will create a function that will count the age of the domain by substracting the expiration_date to creation_date. In order to use R object in our python chunk, we can use r.name_of_variable
⚠️ Warning: Applying that function into our data is computationally heavy. Depending on your computer, this will take an average of 24 hours to finish.
def domain_age(url):
try:
url = str(url)
res = whois.whois(url)
df = pd.DataFrame({"date": [res.creation_date, res.expiration_date]})
age = (df["date"][1]-df["date"][0])//np.timedelta64(1, 'M')
except:
age = None
return age
# r.url_data = r.url_data.assign(age_of_domain = r.url_data["url"])
#
# phish = (r.url_data["is_phishing"]==1)
#
# age_of_domain = r.url_data["url"].apply(domain_age)After that we can take python variable that include all of the information about our domain age and use it in our R chunk with py$name_of_variable
url_data["age_of_domain"] <- py$age_of_domainLet’s check if there is a missing value
url_data %>%
is.na() %>%
colSums()#> url is_phishing contains_ip contains_at
#> 0 0 0 0
#> num_of_dots contain_dash contain_dblslash contain_urlhttp
#> 0 0 0 0
#> is_shorturl host_length token contains_sensitive
#> 0 0 0 0
#> age_of_domain
#> 33057
plot(url_data$age_of_domain~as.factor(url_data$is_phishing))We can fill the missing value with its median value
url_data[is.na(url_data$age_of_domain),"age_of_domain"] <- median(url_data$age_of_domain, na.rm = T)url_data_clean <- url_data %>%
select(-c(token,url)) %>%
mutate(is_phishing = as.factor(is_phishing))Now we can split our data into training and testing
RNGkind(sample.kind = "Rounding")
set.seed(417)
index <- sample(x = nrow(url_data_clean) , size = nrow(url_data_clean) * 0.9)
# splitting
url_train <- url_data_clean[index , ]
url_test <- url_data_clean[-index , ]Check the proportion of our target
prop.table(table(url_train$is_phishing))#>
#> 0 1
#> 0.5896443 0.4103557
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used for Classification as well as Regression problems. The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future. This best decision boundary is called a hyperplane.
We can use e1071 package to build svm model classifier
with svm() function. It takes 4 arguments:
formula: y~xdata: data that we want to traintype: svm can be used for classification or regression
so this argument depends on what we wantkernel: the kernel used in training and predictingWhen dealing with nonlinear data, Support Vector Machine transforms it into a higher dimension where it may be linearly separated. Support Vector Machine does this by utilizing various Kernel settings
Let’s try with the linear kernel
library(e1071)
svm_model <- svm(formula = is_phishing~.,
data = url_train,
type = "C-classification",
kernel = "linear")We can predict the data test using our first model
prediction <- predict(svm_model, url_test)After that, we evaluate our model using confusion matrix
library(caret)
confusionMatrix(data = prediction, reference = url_test$is_phishing, positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 3471 338
#> 1 78 2111
#>
#> Accuracy : 0.9306
#> 95% CI : (0.9239, 0.9369)
#> No Information Rate : 0.5917
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.8541
#>
#> Mcnemar's Test P-Value : < 0.00000000000000022
#>
#> Sensitivity : 0.8620
#> Specificity : 0.9780
#> Pos Pred Value : 0.9644
#> Neg Pred Value : 0.9113
#> Prevalence : 0.4083
#> Detection Rate : 0.3520
#> Detection Prevalence : 0.3650
#> Balanced Accuracy : 0.9200
#>
#> 'Positive' Class : 1
#>
We see a quite high accuracy around 93%. However, perhaps we can try to tune our model to get a better accuracy
Let’s try with another kernel setting “radial”
svm_model2 <- svm(formula = is_phishing~.,
data = url_train,
type = "C-classification",
kernel = "radial")prediction2 <- predict(svm_model2, url_test)confusionMatrix(data = prediction2, reference = url_test$is_phishing, positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 3524 310
#> 1 25 2139
#>
#> Accuracy : 0.9441
#> 95% CI : (0.938, 0.9498)
#> No Information Rate : 0.5917
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.8823
#>
#> Mcnemar's Test P-Value : < 0.00000000000000022
#>
#> Sensitivity : 0.8734
#> Specificity : 0.9930
#> Pos Pred Value : 0.9884
#> Neg Pred Value : 0.9191
#> Prevalence : 0.4083
#> Detection Rate : 0.3566
#> Detection Prevalence : 0.3608
#> Balanced Accuracy : 0.9332
#>
#> 'Positive' Class : 1
#>
There is an improvement of accuracy! Can we do better?. Well, let’s try to tune the hyperparameter for our linear kernel setting. we can try to tune these two parameters:
Regularization parameter (C)/cost: It tells us how
much misclassification we want to avoid.
Gamma: The Gamma values used in SVM are similar to
the C values. Gamma is a parameter for non-linear hyperplanes. The gamma
parameter defines how far the influence of a single training example
reaches, with low values meaning ‘far’ and high values meaning
‘close’
Let’s apply that into our model, we will set cost=10,
and gamma=5
svm_model3 <- svm(formula = is_phishing~.,
data = url_train,
type = "C-classification",
kernel = "radial",
cost = 10,
gamma = 5)prediction3 <- predict(svm_model3, url_test)confusionMatrix(data = prediction3, reference = url_test$is_phishing, positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 3524 235
#> 1 25 2214
#>
#> Accuracy : 0.9567
#> 95% CI : (0.9512, 0.9617)
#> No Information Rate : 0.5917
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.9091
#>
#> Mcnemar's Test P-Value : < 0.00000000000000022
#>
#> Sensitivity : 0.9040
#> Specificity : 0.9930
#> Pos Pred Value : 0.9888
#> Neg Pred Value : 0.9375
#> Prevalence : 0.4083
#> Detection Rate : 0.3691
#> Detection Prevalence : 0.3733
#> Balanced Accuracy : 0.9485
#>
#> 'Positive' Class : 1
#>
We can see by tuning the hyperparameter, we can improve the accuracy.
XGBoost is short for eXtreme Gradient Boosting and one of the boosting algoritms. To understand what boosting algorithm is, we need to know what ensemble method is. Ensemble methods is a machine learning technique that combines several base models in order to produce one optimal predictive model.
The simple version of ensembles is as follows:
Ensemble models improve model accuracy by combining the results from many models.
Unlike many ML models which focus on high quality prediction done by a single model, boosting algorithms seek to improve the prediction power by training a sequence of weak models, each compensating the weaknesses of its predecessors.
Now, to use XGBoost model we can leverage xgboost
package and use xgboost() function. It usually takes 4
common arguments.
data: data for training in form of matrixlabel: the target/label of our data,objective: specify the learning task and the
corresponding learning objectivenrounds: maximum number of boosting iterations.Now, we can try to build the model and train our data
One point that we have to keep in mind is that our XGBoost model would accept data label to be in numerical type, so we need to convert first our label into numerical
url_train_xg <- url_train %>%
mutate(is_phishing=as.numeric(as.character(is_phishing)))
url_test_xg <- url_test %>%
mutate(is_phishing=as.numeric(as.character(is_phishing)))Let’s create our model
library(xgboost)
xgb_model <- xgboost(data = as.matrix(url_train_xg[,-1]),
label = url_train_xg$is_phishing,
nrounds = 100,
objective = "binary:logistic",
verbose=0)Since we use an objective of “binary:logistic” the output would be
probability. We can convert them into our class [0,1] using
as.numeric() function
pred <- predict(xgb_model, as.matrix(url_test_xg[,-1]))
prediction4 <- as.numeric(pred > 0.5)
prediction4 <- as.factor(prediction4)Last, we will see our model accuracy
confusionMatrix(data = prediction4, reference = as.factor(url_test_xg$is_phishing), positive="1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 3517 181
#> 1 32 2268
#>
#> Accuracy : 0.9645
#> 95% CI : (0.9595, 0.969)
#> No Information Rate : 0.5917
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.9258
#>
#> Mcnemar's Test P-Value : < 0.00000000000000022
#>
#> Sensitivity : 0.9261
#> Specificity : 0.9910
#> Pos Pred Value : 0.9861
#> Neg Pred Value : 0.9511
#> Prevalence : 0.4083
#> Detection Rate : 0.3781
#> Detection Prevalence : 0.3835
#> Balanced Accuracy : 0.9585
#>
#> 'Positive' Class : 1
#>
Thats an improvement!
Phishing website detection is one of the most serious issues, particularly in the digital era, when everyone will conduct more transactions, enter personal information, and interact with others via the internet. This is an opportunity for criminals to obtain information from victims that can later be used for undesirable purposes. As a result, it is hoped that a machine learning model that can detect web phishing quickly and accurately will reduce the risk of cybercrime and protect the public, allowing them to use the internet and conduct transactions safely.