| variable | class | description |
|---|---|---|
| rank | double | popularity in their database of released passwords |
| password | character | Actual text of the password |
| category | character | What category does the password fall in to? |
| value | double | Time to crack by online guessing |
| time_unit | character | Time unit to match with value |
| offline_crack_sec | double | Time to crack offline in seconds |
| rank_alt | double | Rank 2 |
| strength | double | Strength = quality of password where 10 is highest, 1 is lowest, please note that these are relative to these generally bad passwords |
| font_size | double | Used to create the graphic for KIB |
Read in Data
passwords <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-14/passwords.csv')
Parsed with column specification:
cols(
rank = col_double(),
password = col_character(),
category = col_character(),
value = col_double(),
time_unit = col_character(),
offline_crack_sec = col_double(),
rank_alt = col_double(),
strength = col_double(),
font_size = col_double()
)
passwords <- dplyr::distinct(passwords, .keep_all = TRUE)
Here’s the first few rows of our dataset
# A tibble: 501 x 9
rank password category value time_unit offline_crack_s~ rank_alt strength
<dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 1 password passwor~ 6.91 years 2.17 1 8
2 2 123456 simple-~ 18.5 minutes 0.0000111 2 4
3 3 12345678 simple-~ 1.29 days 0.00111 3 4
4 4 1234 simple-~ 11.1 seconds 0.000000111 4 4
5 5 qwerty simple-~ 3.72 days 0.00321 5 8
6 6 12345 simple-~ 1.85 minutes 0.00000111 6 4
7 7 dragon animal 3.72 days 0.00321 7 8
8 8 baseball sport 6.91 years 2.17 8 4
9 9 football sport 6.91 years 2.17 9 7
10 10 letmein passwor~ 3.19 months 0.0835 10 8
# ... with 491 more rows, and 1 more variable: font_size <dbl>
We have 501 observations and 9 variables
passwords %>%
select_if(function(x) any(is.na(x))) %>%
summarise_each(funs(sum(is.na(.))))
Warning: funs() is soft deprecated as of dplyr 0.8.0
Please use a list of either functions or lambdas:
# Simple named list:
list(mean = mean, median = median)
# Auto named with `tibble::lst()`:
tibble::lst(mean, median)
# Using lambdas
list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
This warning is displayed once per session.
# A tibble: 1 x 9
rank password category value time_unit offline_crack_s~ rank_alt strength
<int> <int> <int> <int> <int> <int> <int> <int>
1 1 1 1 1 1 1 1 1
# ... with 1 more variable: font_size <int>
There are 1 missing values in each column. We’ll drop them.
passwords <- passwords %>%
na.omit(passwords)
dim(passwords)
[1] 500 9
We have 6 numeric variables in our dataset. However, our response variable, strength, signifies the level of strength of the passwords so it should be a categorical variable
passwords %>%
select_if(., is.numeric)
# A tibble: 500 x 6
rank value offline_crack_sec rank_alt strength font_size
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 6.91 2.17 1 8 11
2 2 18.5 0.0000111 2 4 8
3 3 1.29 0.00111 3 4 8
4 4 11.1 0.000000111 4 4 8
5 5 3.72 0.00321 5 8 11
6 6 1.85 0.00000111 6 4 8
7 7 3.72 0.00321 7 8 11
8 8 6.91 2.17 8 4 8
9 9 6.91 2.17 9 7 11
10 10 3.19 0.0835 10 8 11
# ... with 490 more rows
passwords %>%
sjmisc::descr()
## Basic descriptive statistics
var type label n NA.prc mean sd se
rank numeric rank 500 0 250.50 144.48 6.46
value numeric value 500 0 5.60 8.44 0.38
offline_crack_sec numeric offline_crack_sec 500 0 0.50 2.66 0.12
rank_alt numeric rank_alt 500 0 251.22 145.05 6.49
strength numeric strength 500 0 7.43 5.42 0.24
font_size numeric font_size 500 0 10.30 3.65 0.16
md trimmed range iqr skew
250.50 250.50 499 (1-500) 249.50000 0.00
3.72 4.23 90.98 (1.29-92.27) 0.29000 8.86
0.00 0.08 29.27 (0-29.27) 0.08029 10.03
251.50 251.16 501 (1-502) 250.50000 0.00
7.00 7.06 48 (0-48) 2.00000 4.46
11.00 10.61 28 (0-28) 1.00000 0.13
rank and rank_alt have similar descriptive values. It looks like they give us the same information. Offline_crack_sec have a skew value of 10.03 which shows that most passwords takes little time to be figured out. Below is the historgram of Offline_crack_sec which confirms the right-skewness of the variable. We’ll have to normalize or standardize numeric variables for linear models since they are susceptible to outliers. The same applies to value.
passwords %>%
ggplot2::ggplot(aes(x=offline_crack_sec)) +
geom_histogram(bins=30)
passwords %>%
ggplot2::ggplot(aes(x=value)) +
geom_histogram(bins=30)
The histogram of value also shows that this variable is right skewed.
There are 4 categorical variables
passwords %>%
select_if(., is.character)
# A tibble: 500 x 3
password category time_unit
<chr> <chr> <chr>
1 password password-related years
2 123456 simple-alphanumeric minutes
3 12345678 simple-alphanumeric days
4 1234 simple-alphanumeric seconds
5 qwerty simple-alphanumeric days
6 12345 simple-alphanumeric minutes
7 dragon animal days
8 baseball sport years
9 football sport years
10 letmein password-related months
# ... with 490 more rows
Below are the number of unique values in each variable.
passwords %>%
select_if(., is.character) %>%
dplyr::summarise_all(dplyr::funs(dplyr::n_distinct(.)))
# A tibble: 1 x 3
password category time_unit
<int> <int> <int>
1 500 10 7
We’ll look into the response variable, strength first.
passwords %>%
ggplot(aes(x=strength)) +
geom_bar() +
labs(title = 'Reponse Variable Distribution')
Inspecting the the dsitribution of the response variable, there are values greater than 10. The values that are outside this range are probably wrongly recorded or invalid. Let’s inspect what these values are. In addition, there is a value with less than 5 count. This means we cannot perform 5fold tratified cross validation. We’ll opt to use 5fold cv, instead.
passwords %>%
dplyr::filter(strength > 10)
# A tibble: 15 x 9
rank password category value time_unit offline_crack_s~ rank_alt strength
<dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 13 abc123 simple-~ 3.7 weeks 0.0224 13 32
2 26 trustno1 simple-~ 92.3 years 29.0 26 25
3 149 ncc1701 nerdy-p~ 2.56 years 0.806 149 46
4 197 thx1138 nerdy-p~ 2.56 years 0.806 198 46
5 274 8675309 nerdy-p~ 3.09 hours 0.000111 275 19
6 321 bond007 nerdy-p~ 2.56 years 0.806 322 38
7 336 rush2112 nerdy-p~ 92.3 years 29.0 337 48
8 344 red123 simple-~ 3.7 weeks 0.0224 345 35
9 359 ou812 nerdy-p~ 17.3 hours 0.000622 360 36
10 395 heka6w2 simple-~ 2.56 years 0.806 396 36
11 406 jordan23 sport 92.3 years 29.3 407 34
12 407 eagle1 simple-~ 3.7 weeks 0.0224 408 21
13 463 123abc simple-~ 3.7 weeks 0.0224 465 32
14 478 test123 simple-~ 2.56 years 0.806 480 36
15 500 passw0rd passwor~ 92.3 years 29.0 502 28
# ... with 1 more variable: font_size <dbl>
There are 15 values that are outside the range of 1-10 and seem randomly created. I will remove these values.
passwords <- passwords %>%
dplyr::filter(strength <= 10)
Now we have only 485 observations in our data.
passwords %>%
ggplot(aes(x=category)) +
geom_bar() +
labs(title = 'Category Variable Distribution') +
theme(axis.text.x = element_text(angle=45))
We have ten unique values in category. Most values are in the ‘name’ value.
The test of independence for categorical variables are adapted from this article. In particular, theils’ u was developed to measure the level of association between categorical variables.
library(reticulate)
use_condaenv("D:/Anaconda/envs/pythonwdask/python.exe")
from typing import List
import pandas as pd
import numpy as np
import math
import scipy.stats as ss
from collections import Counter
def cond_entropy(x,y, log_base = math.e):
"""Compute conditional entropy of two categorical variables """
y_counter = Counter(y)
xy_counter = Counter(list(zip(x,y)))
total_occurrences = sum(y_counter.values())
entropy = 0.0
for xy in xy_counter.keys():
p_xy = xy_counter[xy] / total_occurrences
p_y = y_counter[xy[1]] / total_occurrences
entropy += p_xy * math.log(p_y / p_xy, log_base)
return entropy
def theils_u(x, y):
"""Compute association value of x w.r.t y (categorical variables)"""
s_xy = cond_entropy(x, y)
x_counter = Counter(x)
total_occurences = sum(x_counter.values())
p_x = list(map(lambda n: n / total_occurences, x_counter.values()))
s_x = ss.entropy(p_x)
if s_x == 0:
return 1
else:
return (s_x - s_xy) / s_x
def correlation_ratio(cats, measurements):
"""Compute assoviation value between a categorical and a quantitative variable"""
fcat, _ = pd.factorize(cats)
cat_num = np.max(fcat) + 1
y_avg_array = np.zeros(cat_num)
n_array = np.zeros(cat_num)
for i in range(0, cat_num):
cat_measures = measurements.iloc[np.argwhere(fcat == i).flatten(),]
n_array[i] = len(cat_measures)
y_avg_array[i] = np.average(cat_measures)
y_total_avg = np.sum(np.multiply(y_avg_array, n_array)) / np.sum(n_array)
numerator = np.sum(np.multiply(n_array, np.power(np.subtract(y_avg_array, y_total_avg), 2)))
denominator = np.sum(np.power(np.subtract(measurements, y_total_avg),2))
if numerator == 0:
eta == 0.0
else:
eta = np.sqrt(numerator / denominator)
return eta
def corr_mat(df: pd.DataFrame, categories: List[str], measurements: List[str]) -> np.array:
"""Compute correlation matrix of variable in a dataset"""
idx_id = 0.0
id_idx = 0.0
columns = categories + measurements
rows = cols = len(columns)
corr_matrix = np.zeros((rows, cols))
corr_matrix = pd.DataFrame(corr_matrix, index=columns, columns=columns)
for idx, var in enumerate(columns):
for id, var_corr in enumerate(columns):
if var == var_corr:
idx_id = id_idx = 1.0
else:
if var in categories:
if var_corr in categories:
idx_id = theils_u(df[var], df[var_corr])
id_idx = theils_u(df[var_corr], df[var])
else:
idx_id = correlation_ratio(df[var], df[var_corr])
id_idx = idx_id
else:
if var_corr in categories:
idx_id = correlation_ratio(df[var_corr], df[var])
id_idx = idx_id
else:
idx_id, _ = ss.pearsonr(df[var], df[var_corr])
id_idx = idx_id
corr_matrix.loc[var, var_corr] = idx_id if not np.isnan(idx_id) and abs(idx_id) < np.inf else 0.0
corr_matrix.loc[var_corr, var] = id_idx if not np.isnan(id_idx) and abs(id_idx) < np.inf else 0.0
return corr_matrix
df = r.passwords
cat_vars = ["password", "category", "strength", "time_unit"]
num_vars = ["rank", "offline_crack_sec", "font_size", "rank_alt", "value"]
corr_matriex = corr_mat(df, cat_vars, num_vars)
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(corr_matriex, annot=True, cbar_kws= {'orientation': 'horizontal'})
plt.show()
plt.tight_layout()
Note: Password has perfect correlation with most variables. We should remove it in our model. Font_size and category have a pretty strong linear relationship w each other.
Rank and Rank_Alt are perfectly linear( shown by the plot and the correlation of 1). They are likely to carry similar information. We’ll discard one of them. Strength (our response variable) has a strong correlation with font_size. I suspect this variable will be important in predicting the strength of the passwords. time_unit and value have strong correlation with each other. It makes sense since the description says time_unit specifies the time unit of value. We’ll remove one of these varaibles. Since value and offline_crack_sec indicate the amount of time it takes to guess the passwords online and offline, respectively. They give similar information. Thus, we can remove value.
pp_passwords <- passwords %>%
select(-value, -rank_alt, -password)
pp_passwords
# A tibble: 485 x 6
rank category time_unit offline_crack_sec strength font_size
<dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 1 password-related years 2.17 8 11
2 2 simple-alphanumeric minutes 0.0000111 4 8
3 3 simple-alphanumeric days 0.00111 4 8
4 4 simple-alphanumeric seconds 0.000000111 4 8
5 5 simple-alphanumeric days 0.00321 8 11
6 6 simple-alphanumeric minutes 0.00000111 4 8
7 7 animal days 0.00321 8 11
8 8 sport years 2.17 4 8
9 9 sport years 2.17 7 11
10 10 password-related months 0.0835 8 11
# ... with 475 more rows
Now there are only 6 variables in our dataset
pp_passwords <- pp_passwords %>%
dplyr::mutate_if(is.character, factor, ordered = FALSE) %>%
dplyr::mutate(strength = factor(strength))
# passwords <- passwords %>%
# dplyr::select(-password) %>%
# dplyr::mutate_if(is.character, factor, ordered = FALSE) %>%
# dplyr::mutate(strength = factor(strength))
In order to make sure we don’t overfit our model, we’ll split our data into a train/test set with a 70/30 ratio. Cross-validation will be performed on the train set. The test set is preserved as the final check of the model’s performance. In addition, since we’ll be using some linear models, numerical data will be standardized. The tree-based models only accept numerical data. Hence, categorical variable will be one-hot encoded, except for the response variable. Last but not least, features with zero variance don’t contribute to the predictive abilities of the model. As a result, we’ll remove them.
# set seed for replication
set.seed(123)
# Stratified Kfold on the response variable
pp_split <-rsample::initial_split(pp_passwords, prop = 0.7,
strata = 'strength')
split <- rsample::initial_split(passwords, prop = 0.7,
strata = 'strength')
# Get train set
pp_train <- rsample::training(pp_split)
train <- rsample::training(split)
# Get test set
pp_test <- rsample::testing(pp_split)
test <- rsample::testing(split)
# Preparing a recipe for preprocessing
bluebrint <- recipes::recipe(strength ~ ., data = pp_passwords) %>%
recipes::step_nzv(all_nominal()) %>% # Removing zero variance features
recipes::step_center(all_numeric(), -all_outcomes()) %>%
recipes::step_scale(all_numeric(), -all_outcomes()) %>% # These two steps standardize the numeric variables
recipes::step_dummy(matches("categ|time"))# lastly, we'll one-hot encode the categorical variables
# Preprocessing the data
prepare <- recipes::prep(bluebrint, training = pp_train)
xgb_train <- recipes::bake(prepare, new_data = pp_train)
xgb_test <- recipes::bake(prepare, new_data = pp_test)
print('The dimension of the train data:')
[1] "The dimension of the train data:"
print(dim(xgb_train))
[1] 341 19
print('The dimension of the test data:')
[1] "The dimension of the test data:"
dim(xgb_test)
[1] 144 19
We’ll perform 5-fold cross validation on the train set using the mlr package
ppclf_task <- mlr::makeClassifTask(data = xgb_train, target = 'strength')
Warning in makeTask(type = type, data = data, weights = weights, blocking =
blocking, : Provided data is not a pure data.frame but from class tbl_df, hence
it will be converted.
pprdesc <- mlr::makeResampleDesc("CV", iters = 5, stratify = FALSE)
We’ll fit the data with decision trees, KNN, and xgboost.
learnerCART=mlr::makeLearner(id="CART","classif.rpart", predict.type = "prob")
learnerXGBM=mlr::makeLearner(id="XGBM","classif.xgboost", predict.type = "prob")
learnerKNN=mlr::makeLearner(id="KNN","classif.knn")
set.seed(123)
learners <- list(learnerKNN, learnerCART, learnerXGBM)
bmr <- mlr::benchmark(learners, ppclf_task, pprdesc, measures=list(acc), models = TRUE, show.info = FALSE)
bmr
task.id learner.id acc.test.mean
1 xgb_train KNN 0.5806905
2 xgb_train CART 0.7478261
3 xgb_train XGBM 0.7974851
Performance Plot
plotBMRBoxplots(bmr, measure = acc, order.lrn = getBMRLearnerIds(bmr), style = "violin", pretty.names = FALSE) +
aes(color = learner.id) +
theme(strip.text.x = element_text(size = 8))
Warning: `fun.y` is deprecated. Use `fun` instead.
Warning: `fun.ymin` is deprecated. Use `fun.min` instead.
Warning: `fun.ymax` is deprecated. Use `fun.max` instead.
KNN perform bad for this dataset. While decision trees and xgboost perform, on average, similarly well. The best model is XGBM. We’ll tune this model for optimal performance.
To get the optimal model, we’ll need to tune the learning rate to get a stable model. Instead of using gridsearch, we’ll opt for randomsearch as it is more efficient when tuning a large set of values.
set.seed(123)
# Defining task
ppclf_task1 <- mlr::makeClassifTask(data = xgb_train, target = 'strength')
Warning in makeTask(type = type, data = data, weights = weights, blocking =
blocking, : Provided data is not a pure data.frame but from class tbl_df, hence
it will be converted.
# Set of learning rate values for tuning
learning_rate <- ParamHelpers::makeParamSet(
ParamHelpers::makeNumericParam("eta", lower = -0.3, upper = 0.3)
)
# Using random search
ctrl <- mlr::makeTuneControlRandom(maxit=200L)
# Tuning
res <- mlr::tuneParams("classif.xgboost", task = ppclf_task1, resampling = pprdesc, par.set = learning_rate, control = ctrl, measures = acc, show.info = FALSE)
opt_eta <- res$x$eta
print(paste("The optimal learning rate is:", round(opt_eta, 3)))
[1] "The optimal learning rate is: 0.075"
Once we have derived an optimal learning rate, we’ll tune the optimal number of trees for the model. Instead of using random search to tune out trees, we’ll tune in a specific set of values and narrow down to find the optimal number of trees.
set.seed(123)
opt_eta <- res$x$eta
# Set of learning rate values for tuning
ntrees <- ParamHelpers::makeParamSet(
ParamHelpers::makeDiscreteParam("nrounds", values = c(100, 500, 1000)) # start with 3 widely different values
)
# Using grid search
ctrl <- mlr::makeTuneControlGrid()
# Setting the optimal learning rate
eta_lnr <- mlr::setHyperPars(makeLearner("classif.xgboost"), eta = opt_eta)
# Tuning
res1 <- mlr::tuneParams(eta_lnr, task = ppclf_task1, resampling = pprdesc, par.set = ntrees, control = ctrl, measures = acc, show.info = FALSE)
print(paste('The best one out of 3 is :', res1$x$nrounds))
[1] "The best one out of 3 is : 100"
Now we’ll zoom in the best values out of three.
[1] "The best one out of 3 is : 50"
Continue to zoom in
set.seed(123)
opt_eta <- res$x$eta
learning_rate <- ParamHelpers::makeParamSet(
ParamHelpers::makeDiscreteParam("nrounds", values = c(25, 50, 75))
)
ctrl <- mlr::makeTuneControlGrid()
eta_lnr <- mlr::setHyperPars(makeLearner("classif.xgboost"), eta = opt_eta)
res1 <- mlr::tuneParams(eta_lnr, task = ppclf_task1, resampling = pprdesc, par.set = learning_rate, control = ctrl, measures = acc, show.info = FALSE)
print(paste('The best one out of 3 is :', res1$x$nrounds))
[1] "The best one out of 3 is : 50"
set.seed(123)
opt_eta <- res$x$eta
learning_rate <- ParamHelpers::makeParamSet(
ParamHelpers::makeIntegerParam("nrounds", lower = 60, upper = 80)
)
ctrl <- mlr::makeTuneControlGrid()
eta_lnr <- mlr::setHyperPars(makeLearner("classif.xgboost"), eta = opt_eta)
res1 <- mlr::tuneParams(eta_lnr, task = ppclf_task1, resampling = pprdesc, par.set = learning_rate, control = ctrl, measures = acc, show.info = FALSE)
print(paste('The best one out of 3 is :', res1$x$nrounds))
[1] "The best one out of 3 is : 73"
opt_nrounds <- res1$x$nrounds
treeHype <- ParamHelpers::makeParamSet(
ParamHelpers::makeIntegerParam("max_depth", lower = 3, upper = 10),
ParamHelpers::makeNumericParam("subsample", lower = 0.5, upper = 1),
ParamHelpers::makeNumericParam("min_child_weight", lower = 1, upper = 15),
ParamHelpers::makeNumericParam("colsample_bytree", lower = 0.5, upper = 1)
)
ctrl <- mlr::makeTuneControlRandom()
eta_lnr <- mlr::setHyperPars(makeLearner("classif.xgboost"), eta = opt_eta, nrounds = opt_nrounds)
res1 <- mlr::tuneParams(eta_lnr, task = ppclf_task1, resampling = pprdesc, par.set = treeHype, control = ctrl, measures = acc, show.info = FALSE)
Below are the optimal tree hyper parameters
res1
Tune result:
Op. pars: max_depth=9; subsample=0.842; min_child_weight=1.42; colsample_bytree=0.81
acc.test.mean=0.7916454
Now we’ll use these derived paramters to train a new xgboost model and evaluate its performance on the test set
set.seed(123)
opt_maxdepth <- res1$x$max_depth
opt_subsample <- res1$x$subsample
opt_minchild <- res1$x$min_child_weight
opt_colsample <- res1$x$colsample_bytree
opt_lrn <- setHyperPars(makeLearner("classif.xgboost"), max_depth = opt_maxdepth,
subsample = opt_subsample, min_child_weight = opt_minchild,
colsample_bytree = opt_colsample)
m = train(opt_lrn, ppclf_task1)
Evaluating the final model
train_predictions <- predict(m, newdata = xgb_train)$data
Warning in predict.WrappedModel(m, newdata = xgb_train): Provided data for
prediction is not a pure data.frame but from class tbl_df, hence it will be
converted.
test_predictions <- predict(m, newdata = xgb_test)$data
Warning in predict.WrappedModel(m, newdata = xgb_test): Provided data for
prediction is not a pure data.frame but from class tbl_df, hence it will be
converted.
# predictions %>%
# dplyr::filter(truth == response) %>%
# dplyr::summarise(count = n())
print('The train accuracy is')
[1] "The train accuracy is"
mean(train_predictions$truth == train_predictions$response)
[1] 0.7771261
print('The test accuracy is')
[1] "The test accuracy is"
mean(test_predictions$truth == test_predictions$response)
[1] 0.7430556
Our final model is good. it’s not overfitting and the accuracy is around 80%. Let’s check the confusion matrix
mlr::calculateConfusionMatrix(predict(m, newdata = xgb_test))
Warning in predict.WrappedModel(m, newdata = xgb_test): Provided data for
prediction is not a pure data.frame but from class tbl_df, hence it will be
converted.
predicted
true 0 1 2 3 4 5 6 7 8 9 10 -err.-
0 8 3 0 0 0 0 0 0 3 0 0 6
1 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 13 0 0 0 0 0 0 0
5 0 0 0 0 0 4 0 0 5 0 0 5
6 0 0 0 0 0 0 21 0 0 0 0 0
7 0 0 0 0 0 0 0 10 17 0 0 17
8 0 0 0 0 0 0 0 7 36 0 0 7
9 0 0 0 0 0 0 0 0 0 15 0 0
10 0 0 0 0 0 0 0 0 0 2 0 2
-err.- 0 3 0 0 0 0 0 7 25 2 0 37
Overall our model performs well. We can see that it incorrectly predicts 18 cases of strength lv 7 to be 8. It makese sense since 7 is very close to 8 and 8 makes up the majority of values in the strength variable.
# Saving model for futher reuse
saveRDS(m, "./final_models.rds")
model <- readRDS("./final_models.rds")
In this notebook, we perform data exploration using various statistical properties of the dataset such as mean, variance. In addition, we perform feature engineering and selection using correlation, one-hot encoding. Finally, we use 3 models to fit the data and five predictions about the strength of passwords given its feature. We arrive at a model with around 80% accuracy on the test set. We could try ensembling the models to see if the performance could be improved but that is for another task.