Data Description

variable class description
rank double popularity in their database of released passwords
password character Actual text of the password
category character What category does the password fall in to?
value double Time to crack by online guessing
time_unit character Time unit to match with value
offline_crack_sec double Time to crack offline in seconds
rank_alt double Rank 2
strength double Strength = quality of password where 10 is highest, 1 is lowest, please note that these are relative to these generally bad passwords
font_size double Used to create the graphic for KIB

Read in Data

passwords <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-14/passwords.csv')
Parsed with column specification:
cols(
  rank = col_double(),
  password = col_character(),
  category = col_character(),
  value = col_double(),
  time_unit = col_character(),
  offline_crack_sec = col_double(),
  rank_alt = col_double(),
  strength = col_double(),
  font_size = col_double()
)
passwords <- dplyr::distinct(passwords, .keep_all = TRUE)

Here’s the first few rows of our dataset

# A tibble: 501 x 9
    rank password category value time_unit offline_crack_s~ rank_alt strength
   <dbl> <chr>    <chr>    <dbl> <chr>                <dbl>    <dbl>    <dbl>
 1     1 password passwor~  6.91 years          2.17               1        8
 2     2 123456   simple-~ 18.5  minutes        0.0000111          2        4
 3     3 12345678 simple-~  1.29 days           0.00111            3        4
 4     4 1234     simple-~ 11.1  seconds        0.000000111        4        4
 5     5 qwerty   simple-~  3.72 days           0.00321            5        8
 6     6 12345    simple-~  1.85 minutes        0.00000111         6        4
 7     7 dragon   animal    3.72 days           0.00321            7        8
 8     8 baseball sport     6.91 years          2.17               8        4
 9     9 football sport     6.91 years          2.17               9        7
10    10 letmein  passwor~  3.19 months         0.0835            10        8
# ... with 491 more rows, and 1 more variable: font_size <dbl>

We have 501 observations and 9 variables

Handling Missing Data

passwords %>% 
  select_if(function(x) any(is.na(x))) %>% 
  summarise_each(funs(sum(is.na(.))))
Warning: funs() is soft deprecated as of dplyr 0.8.0
Please use a list of either functions or lambdas: 

  # Simple named list: 
  list(mean = mean, median = median)

  # Auto named with `tibble::lst()`: 
  tibble::lst(mean, median)

  # Using lambdas
  list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
This warning is displayed once per session.
# A tibble: 1 x 9
   rank password category value time_unit offline_crack_s~ rank_alt strength
  <int>    <int>    <int> <int>     <int>            <int>    <int>    <int>
1     1        1        1     1         1                1        1        1
# ... with 1 more variable: font_size <int>

There are 1 missing values in each column. We’ll drop them.

passwords <- passwords %>% 
  na.omit(passwords)

dim(passwords)
[1] 500   9

Data Exploration

Numerical Data

We have 6 numeric variables in our dataset. However, our response variable, strength, signifies the level of strength of the passwords so it should be a categorical variable

passwords %>% 
  select_if(., is.numeric)
# A tibble: 500 x 6
    rank value offline_crack_sec rank_alt strength font_size
   <dbl> <dbl>             <dbl>    <dbl>    <dbl>     <dbl>
 1     1  6.91       2.17               1        8        11
 2     2 18.5        0.0000111          2        4         8
 3     3  1.29       0.00111            3        4         8
 4     4 11.1        0.000000111        4        4         8
 5     5  3.72       0.00321            5        8        11
 6     6  1.85       0.00000111         6        4         8
 7     7  3.72       0.00321            7        8        11
 8     8  6.91       2.17               8        4         8
 9     9  6.91       2.17               9        7        11
10    10  3.19       0.0835            10        8        11
# ... with 490 more rows
passwords %>% 
  sjmisc::descr()

## Basic descriptive statistics

               var    type             label   n NA.prc   mean     sd   se
              rank numeric              rank 500      0 250.50 144.48 6.46
             value numeric             value 500      0   5.60   8.44 0.38
 offline_crack_sec numeric offline_crack_sec 500      0   0.50   2.66 0.12
          rank_alt numeric          rank_alt 500      0 251.22 145.05 6.49
          strength numeric          strength 500      0   7.43   5.42 0.24
         font_size numeric         font_size 500      0  10.30   3.65 0.16
     md trimmed              range       iqr  skew
 250.50  250.50        499 (1-500) 249.50000  0.00
   3.72    4.23 90.98 (1.29-92.27)   0.29000  8.86
   0.00    0.08    29.27 (0-29.27)   0.08029 10.03
 251.50  251.16        501 (1-502) 250.50000  0.00
   7.00    7.06          48 (0-48)   2.00000  4.46
  11.00   10.61          28 (0-28)   1.00000  0.13

rank and rank_alt have similar descriptive values. It looks like they give us the same information. Offline_crack_sec have a skew value of 10.03 which shows that most passwords takes little time to be figured out. Below is the historgram of Offline_crack_sec which confirms the right-skewness of the variable. We’ll have to normalize or standardize numeric variables for linear models since they are susceptible to outliers. The same applies to value.

passwords %>% 
  ggplot2::ggplot(aes(x=offline_crack_sec)) +
  geom_histogram(bins=30)

passwords %>% 
  ggplot2::ggplot(aes(x=value)) +
  geom_histogram(bins=30)

The histogram of value also shows that this variable is right skewed.

Categorical data

There are 4 categorical variables

passwords %>% 
  select_if(., is.character)
# A tibble: 500 x 3
   password category            time_unit
   <chr>    <chr>               <chr>    
 1 password password-related    years    
 2 123456   simple-alphanumeric minutes  
 3 12345678 simple-alphanumeric days     
 4 1234     simple-alphanumeric seconds  
 5 qwerty   simple-alphanumeric days     
 6 12345    simple-alphanumeric minutes  
 7 dragon   animal              days     
 8 baseball sport               years    
 9 football sport               years    
10 letmein  password-related    months   
# ... with 490 more rows

Below are the number of unique values in each variable.

passwords %>%
  select_if(., is.character) %>% 
  dplyr::summarise_all(dplyr::funs(dplyr::n_distinct(.)))
# A tibble: 1 x 3
  password category time_unit
     <int>    <int>     <int>
1      500       10         7

We’ll look into the response variable, strength first.

passwords %>% 
  ggplot(aes(x=strength)) +
  geom_bar() +
  labs(title = 'Reponse Variable Distribution')

Inspecting the the dsitribution of the response variable, there are values greater than 10. The values that are outside this range are probably wrongly recorded or invalid. Let’s inspect what these values are. In addition, there is a value with less than 5 count. This means we cannot perform 5fold tratified cross validation. We’ll opt to use 5fold cv, instead.

passwords %>% 
  dplyr::filter(strength > 10)
# A tibble: 15 x 9
    rank password category value time_unit offline_crack_s~ rank_alt strength
   <dbl> <chr>    <chr>    <dbl> <chr>                <dbl>    <dbl>    <dbl>
 1    13 abc123   simple-~  3.7  weeks             0.0224         13       32
 2    26 trustno1 simple-~ 92.3  years            29.0            26       25
 3   149 ncc1701  nerdy-p~  2.56 years             0.806         149       46
 4   197 thx1138  nerdy-p~  2.56 years             0.806         198       46
 5   274 8675309  nerdy-p~  3.09 hours             0.000111      275       19
 6   321 bond007  nerdy-p~  2.56 years             0.806         322       38
 7   336 rush2112 nerdy-p~ 92.3  years            29.0           337       48
 8   344 red123   simple-~  3.7  weeks             0.0224        345       35
 9   359 ou812    nerdy-p~ 17.3  hours             0.000622      360       36
10   395 heka6w2  simple-~  2.56 years             0.806         396       36
11   406 jordan23 sport    92.3  years            29.3           407       34
12   407 eagle1   simple-~  3.7  weeks             0.0224        408       21
13   463 123abc   simple-~  3.7  weeks             0.0224        465       32
14   478 test123  simple-~  2.56 years             0.806         480       36
15   500 passw0rd passwor~ 92.3  years            29.0           502       28
# ... with 1 more variable: font_size <dbl>

There are 15 values that are outside the range of 1-10 and seem randomly created. I will remove these values.

passwords <- passwords %>% 
  dplyr::filter(strength <= 10)

Now we have only 485 observations in our data.

passwords %>% 
  ggplot(aes(x=category)) +
  geom_bar() +
  labs(title = 'Category Variable Distribution') +
  theme(axis.text.x = element_text(angle=45))

We have ten unique values in category. Most values are in the ‘name’ value.

Checking for Linear Correlation among Variables

The test of independence for categorical variables are adapted from this article. In particular, theils’ u was developed to measure the level of association between categorical variables.

library(reticulate)
use_condaenv("D:/Anaconda/envs/pythonwdask/python.exe")
from typing import List
import pandas as pd
import numpy as np
import math
import scipy.stats as ss
from collections import Counter

def cond_entropy(x,y, log_base = math.e):
  """Compute conditional entropy of two categorical variables """
  y_counter = Counter(y)
  xy_counter = Counter(list(zip(x,y)))
  total_occurrences = sum(y_counter.values())
  entropy = 0.0
  for xy in xy_counter.keys():
    p_xy = xy_counter[xy] / total_occurrences
    p_y = y_counter[xy[1]] / total_occurrences
    entropy += p_xy * math.log(p_y / p_xy, log_base)
  return entropy


def theils_u(x, y):
  """Compute association value of x w.r.t y (categorical variables)"""
  s_xy = cond_entropy(x, y)
  x_counter = Counter(x)
  total_occurences = sum(x_counter.values())
  p_x = list(map(lambda n: n / total_occurences, x_counter.values()))
  s_x = ss.entropy(p_x)
  if s_x == 0:
    return 1
  else:
    return (s_x - s_xy) / s_x
    
    
def correlation_ratio(cats, measurements):
  """Compute assoviation value between a categorical and a quantitative variable""" 
  fcat, _ = pd.factorize(cats)
  cat_num = np.max(fcat) + 1
  y_avg_array = np.zeros(cat_num)
  n_array = np.zeros(cat_num)
  for i in range(0, cat_num):
    cat_measures = measurements.iloc[np.argwhere(fcat == i).flatten(),]
    n_array[i] = len(cat_measures)
    y_avg_array[i] = np.average(cat_measures)
  y_total_avg = np.sum(np.multiply(y_avg_array, n_array)) / np.sum(n_array)
  numerator = np.sum(np.multiply(n_array, np.power(np.subtract(y_avg_array, y_total_avg), 2)))
  denominator = np.sum(np.power(np.subtract(measurements, y_total_avg),2))
  if numerator == 0:
    eta == 0.0
  else:
    eta = np.sqrt(numerator / denominator)
  return eta

def corr_mat(df: pd.DataFrame, categories: List[str], measurements: List[str]) -> np.array:
  """Compute correlation matrix of variable in a dataset"""
  idx_id = 0.0
  id_idx = 0.0
  columns = categories + measurements
  rows = cols = len(columns)
  corr_matrix = np.zeros((rows, cols))
  corr_matrix = pd.DataFrame(corr_matrix, index=columns, columns=columns)
  for idx, var in enumerate(columns):
    for id, var_corr in enumerate(columns):
      if var == var_corr:
        idx_id = id_idx = 1.0
      else:
        if var in categories:
          if var_corr in categories:
            idx_id = theils_u(df[var], df[var_corr])
            id_idx = theils_u(df[var_corr], df[var])
          else:
            idx_id = correlation_ratio(df[var], df[var_corr])
            id_idx = idx_id
        else:
          if var_corr in categories:
            idx_id = correlation_ratio(df[var_corr], df[var])
            id_idx = idx_id
          else:
            idx_id, _ = ss.pearsonr(df[var], df[var_corr])
            id_idx = idx_id
      corr_matrix.loc[var, var_corr] = idx_id if not np.isnan(idx_id) and abs(idx_id) < np.inf else 0.0
      corr_matrix.loc[var_corr, var] = id_idx if not np.isnan(id_idx) and abs(id_idx) < np.inf else 0.0
  
  return corr_matrix
df = r.passwords
cat_vars = ["password", "category", "strength", "time_unit"]
num_vars = ["rank", "offline_crack_sec", "font_size", "rank_alt", "value"]
corr_matriex = corr_mat(df, cat_vars, num_vars)
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(corr_matriex, annot=True, cbar_kws= {'orientation': 'horizontal'})
plt.show()

plt.tight_layout()

Note: Password has perfect correlation with most variables. We should remove it in our model. Font_size and category have a pretty strong linear relationship w each other.

Rank and Rank_Alt are perfectly linear( shown by the plot and the correlation of 1). They are likely to carry similar information. We’ll discard one of them. Strength (our response variable) has a strong correlation with font_size. I suspect this variable will be important in predicting the strength of the passwords. time_unit and value have strong correlation with each other. It makes sense since the description says time_unit specifies the time unit of value. We’ll remove one of these varaibles. Since value and offline_crack_sec indicate the amount of time it takes to guess the passwords online and offline, respectively. They give similar information. Thus, we can remove value.

pp_passwords <- passwords %>% 
  select(-value, -rank_alt, -password)

pp_passwords
# A tibble: 485 x 6
    rank category            time_unit offline_crack_sec strength font_size
   <dbl> <chr>               <chr>                 <dbl>    <dbl>     <dbl>
 1     1 password-related    years           2.17               8        11
 2     2 simple-alphanumeric minutes         0.0000111          4         8
 3     3 simple-alphanumeric days            0.00111            4         8
 4     4 simple-alphanumeric seconds         0.000000111        4         8
 5     5 simple-alphanumeric days            0.00321            8        11
 6     6 simple-alphanumeric minutes         0.00000111         4         8
 7     7 animal              days            0.00321            8        11
 8     8 sport               years           2.17               4         8
 9     9 sport               years           2.17               7        11
10    10 password-related    months          0.0835             8        11
# ... with 475 more rows

Now there are only 6 variables in our dataset

pp_passwords <- pp_passwords %>% 
  dplyr::mutate_if(is.character, factor, ordered = FALSE) %>% 
  dplyr::mutate(strength = factor(strength))

# passwords <- passwords %>% 
#   dplyr::select(-password) %>% 
#   dplyr::mutate_if(is.character, factor, ordered = FALSE) %>% 
#   dplyr::mutate(strength = factor(strength))

Modeling

Split Data into Train and Test Sets

In order to make sure we don’t overfit our model, we’ll split our data into a train/test set with a 70/30 ratio. Cross-validation will be performed on the train set. The test set is preserved as the final check of the model’s performance. In addition, since we’ll be using some linear models, numerical data will be standardized. The tree-based models only accept numerical data. Hence, categorical variable will be one-hot encoded, except for the response variable. Last but not least, features with zero variance don’t contribute to the predictive abilities of the model. As a result, we’ll remove them.

# set seed for replication
set.seed(123)

# Stratified Kfold on the response variable
pp_split <-rsample::initial_split(pp_passwords, prop = 0.7,
                               strata = 'strength')
split <- rsample::initial_split(passwords, prop = 0.7,
                                strata = 'strength')
# Get train set
pp_train <- rsample::training(pp_split)
train <- rsample::training(split)

# Get test set
pp_test <- rsample::testing(pp_split)
test <- rsample::testing(split)

# Preparing a recipe for preprocessing
bluebrint <- recipes::recipe(strength ~ ., data = pp_passwords) %>% 
  recipes::step_nzv(all_nominal()) %>%  # Removing zero variance features
  recipes::step_center(all_numeric(), -all_outcomes()) %>% 
  recipes::step_scale(all_numeric(), -all_outcomes()) %>% # These two steps standardize the numeric variables
  recipes::step_dummy(matches("categ|time"))# lastly, we'll one-hot encode the categorical variables

# Preprocessing the data
prepare <- recipes::prep(bluebrint, training = pp_train)

xgb_train <- recipes::bake(prepare, new_data = pp_train)
xgb_test <- recipes::bake(prepare, new_data = pp_test)
print('The dimension of the train data:')
[1] "The dimension of the train data:"
print(dim(xgb_train))
[1] 341  19
print('The dimension of the test data:')
[1] "The dimension of the test data:"
dim(xgb_test)
[1] 144  19

We’ll perform 5-fold cross validation on the train set using the mlr package

ppclf_task <- mlr::makeClassifTask(data = xgb_train, target = 'strength')
Warning in makeTask(type = type, data = data, weights = weights, blocking =
blocking, : Provided data is not a pure data.frame but from class tbl_df, hence
it will be converted.
pprdesc <- mlr::makeResampleDesc("CV", iters = 5, stratify = FALSE)

We’ll fit the data with decision trees, KNN, and xgboost.

learnerCART=mlr::makeLearner(id="CART","classif.rpart", predict.type = "prob")

learnerXGBM=mlr::makeLearner(id="XGBM","classif.xgboost", predict.type = "prob")

learnerKNN=mlr::makeLearner(id="KNN","classif.knn")
set.seed(123)
learners <- list(learnerKNN, learnerCART, learnerXGBM)
bmr <- mlr::benchmark(learners, ppclf_task, pprdesc, measures=list(acc), models = TRUE, show.info = FALSE)
bmr
    task.id learner.id acc.test.mean
1 xgb_train        KNN     0.5806905
2 xgb_train       CART     0.7478261
3 xgb_train       XGBM     0.7974851

Performance Plot

plotBMRBoxplots(bmr, measure = acc, order.lrn = getBMRLearnerIds(bmr), style = "violin", pretty.names = FALSE) +
  aes(color = learner.id) +
  theme(strip.text.x = element_text(size = 8))
Warning: `fun.y` is deprecated. Use `fun` instead.
Warning: `fun.ymin` is deprecated. Use `fun.min` instead.
Warning: `fun.ymax` is deprecated. Use `fun.max` instead.

KNN perform bad for this dataset. While decision trees and xgboost perform, on average, similarly well. The best model is XGBM. We’ll tune this model for optimal performance.

Tuning

Learning Rate

To get the optimal model, we’ll need to tune the learning rate to get a stable model. Instead of using gridsearch, we’ll opt for randomsearch as it is more efficient when tuning a large set of values.

set.seed(123)

# Defining task
ppclf_task1 <- mlr::makeClassifTask(data = xgb_train, target = 'strength')
Warning in makeTask(type = type, data = data, weights = weights, blocking =
blocking, : Provided data is not a pure data.frame but from class tbl_df, hence
it will be converted.
# Set of learning rate values for tuning
learning_rate <- ParamHelpers::makeParamSet(
  ParamHelpers::makeNumericParam("eta", lower = -0.3, upper = 0.3)
)

# Using random search
ctrl <- mlr::makeTuneControlRandom(maxit=200L)

# Tuning
res <- mlr::tuneParams("classif.xgboost", task = ppclf_task1, resampling = pprdesc, par.set = learning_rate, control = ctrl, measures = acc, show.info = FALSE)

opt_eta <- res$x$eta
print(paste("The optimal learning rate is:", round(opt_eta, 3)))
[1] "The optimal learning rate is: 0.075"

Number of Trees

Once we have derived an optimal learning rate, we’ll tune the optimal number of trees for the model. Instead of using random search to tune out trees, we’ll tune in a specific set of values and narrow down to find the optimal number of trees.

set.seed(123)

opt_eta <- res$x$eta

# Set of learning rate values for tuning
ntrees <- ParamHelpers::makeParamSet(
  ParamHelpers::makeDiscreteParam("nrounds", values = c(100, 500, 1000)) # start with 3 widely different values
)

# Using grid search
ctrl <- mlr::makeTuneControlGrid()

# Setting the optimal learning rate
eta_lnr <- mlr::setHyperPars(makeLearner("classif.xgboost"), eta = opt_eta)

# Tuning
res1 <- mlr::tuneParams(eta_lnr, task = ppclf_task1, resampling = pprdesc, par.set = ntrees, control = ctrl, measures = acc, show.info = FALSE)
print(paste('The best one out of 3 is :', res1$x$nrounds))
[1] "The best one out of 3 is : 100"

Now we’ll zoom in the best values out of three.

[1] "The best one out of 3 is : 50"

Continue to zoom in

set.seed(123)

opt_eta <- res$x$eta
learning_rate <- ParamHelpers::makeParamSet(
  ParamHelpers::makeDiscreteParam("nrounds", values = c(25, 50, 75))
)
ctrl <- mlr::makeTuneControlGrid()
eta_lnr <- mlr::setHyperPars(makeLearner("classif.xgboost"), eta = opt_eta)
res1 <- mlr::tuneParams(eta_lnr, task = ppclf_task1, resampling = pprdesc, par.set = learning_rate, control = ctrl, measures = acc, show.info = FALSE)

print(paste('The best one out of 3 is :', res1$x$nrounds))
[1] "The best one out of 3 is : 50"
set.seed(123)

opt_eta <- res$x$eta
learning_rate <- ParamHelpers::makeParamSet(
  ParamHelpers::makeIntegerParam("nrounds", lower = 60, upper = 80)
)
ctrl <- mlr::makeTuneControlGrid()
eta_lnr <- mlr::setHyperPars(makeLearner("classif.xgboost"), eta = opt_eta)
res1 <- mlr::tuneParams(eta_lnr, task = ppclf_task1, resampling = pprdesc, par.set = learning_rate, control = ctrl, measures = acc, show.info = FALSE)

print(paste('The best one out of 3 is :', res1$x$nrounds))
[1] "The best one out of 3 is : 73"

Tuning Tree-specific Hyperparameters

opt_nrounds <- res1$x$nrounds
treeHype <- ParamHelpers::makeParamSet(
  ParamHelpers::makeIntegerParam("max_depth", lower = 3, upper = 10),
  ParamHelpers::makeNumericParam("subsample", lower = 0.5, upper = 1),
  ParamHelpers::makeNumericParam("min_child_weight", lower = 1, upper = 15),
  ParamHelpers::makeNumericParam("colsample_bytree", lower = 0.5, upper = 1)
)
ctrl <- mlr::makeTuneControlRandom()
eta_lnr <- mlr::setHyperPars(makeLearner("classif.xgboost"), eta = opt_eta, nrounds = opt_nrounds)
res1 <- mlr::tuneParams(eta_lnr, task = ppclf_task1, resampling = pprdesc, par.set = treeHype, control = ctrl, measures = acc, show.info = FALSE)

Below are the optimal tree hyper parameters

res1
Tune result:
Op. pars: max_depth=9; subsample=0.842; min_child_weight=1.42; colsample_bytree=0.81
acc.test.mean=0.7916454

Now we’ll use these derived paramters to train a new xgboost model and evaluate its performance on the test set

set.seed(123)
opt_maxdepth <- res1$x$max_depth
opt_subsample <- res1$x$subsample
opt_minchild <- res1$x$min_child_weight
opt_colsample <- res1$x$colsample_bytree

opt_lrn <- setHyperPars(makeLearner("classif.xgboost"), max_depth = opt_maxdepth,
                        subsample = opt_subsample, min_child_weight = opt_minchild,
                        colsample_bytree = opt_colsample)

m = train(opt_lrn, ppclf_task1)

Evaluating the final model

train_predictions <- predict(m, newdata = xgb_train)$data
Warning in predict.WrappedModel(m, newdata = xgb_train): Provided data for
prediction is not a pure data.frame but from class tbl_df, hence it will be
converted.
test_predictions <- predict(m, newdata = xgb_test)$data
Warning in predict.WrappedModel(m, newdata = xgb_test): Provided data for
prediction is not a pure data.frame but from class tbl_df, hence it will be
converted.
# predictions %>% 
#   dplyr::filter(truth == response) %>% 
#   dplyr::summarise(count = n())
print('The train accuracy is')
[1] "The train accuracy is"
mean(train_predictions$truth == train_predictions$response)
[1] 0.7771261
print('The test accuracy is')
[1] "The test accuracy is"
mean(test_predictions$truth == test_predictions$response)
[1] 0.7430556

Our final model is good. it’s not overfitting and the accuracy is around 80%. Let’s check the confusion matrix

mlr::calculateConfusionMatrix(predict(m, newdata = xgb_test))
Warning in predict.WrappedModel(m, newdata = xgb_test): Provided data for
prediction is not a pure data.frame but from class tbl_df, hence it will be
converted.
        predicted
true     0 1 2 3  4 5  6  7  8  9 10 -err.-
  0      8 3 0 0  0 0  0  0  3  0  0      6
  1      0 0 0 0  0 0  0  0  0  0  0      0
  2      0 0 0 0  0 0  0  0  0  0  0      0
  3      0 0 0 0  0 0  0  0  0  0  0      0
  4      0 0 0 0 13 0  0  0  0  0  0      0
  5      0 0 0 0  0 4  0  0  5  0  0      5
  6      0 0 0 0  0 0 21  0  0  0  0      0
  7      0 0 0 0  0 0  0 10 17  0  0     17
  8      0 0 0 0  0 0  0  7 36  0  0      7
  9      0 0 0 0  0 0  0  0  0 15  0      0
  10     0 0 0 0  0 0  0  0  0  2  0      2
  -err.- 0 3 0 0  0 0  0  7 25  2  0     37

Overall our model performs well. We can see that it incorrectly predicts 18 cases of strength lv 7 to be 8. It makese sense since 7 is very close to 8 and 8 makes up the majority of values in the strength variable.

# Saving model for futher reuse
saveRDS(m, "./final_models.rds")
model <- readRDS("./final_models.rds")

Conclusion

In this notebook, we perform data exploration using various statistical properties of the dataset such as mean, variance. In addition, we perform feature engineering and selection using correlation, one-hot encoding. Finally, we use 3 models to fit the data and five predictions about the strength of passwords given its feature. We arrive at a model with around 80% accuracy on the test set. We could try ensembling the models to see if the performance could be improved but that is for another task.

References

  1. https://towardsdatascience.com/the-search-for-categorical-correlation-a1cf7f1888c9
  2. https://www.statisticssolutions.com/using-chi-square-statistic-in-research/
  3. De, Muth J. E. Basic Statistics and Pharmaceutical Statistical Applications, Third Edition. Hoboken: CRC Press, 2014. Internet resource.