Password Theming by Popularity and its Relation to Security

Overview

Of a dataset commonly used passwords, which categories are the most prominent?

Introduction

There are common themes that pop up when people create their passwords, and the popular passwords within this dataset are sorted into 10 categories. By exploring the frequency of these different themes it may shed some light on what to avoid when creating a password in order to have something unique and secure.

The data comes from Information is Beautiful and was compiled from a wide variety of leaked passwords found on the internet. It was shared on TidyTuesday at https://github.com/rfordatascience/tidytuesday/blob/main/data/2020/2020-01-14/readme.md.

Exploring the Data

# Load libraries
library(rmarkdown)
library(knitr)
library(lattice)

# Load passwords dataset
passwords <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2020/2020-01-14/passwords.csv', show_col_types = FALSE)

# Remove empty rows
passwords <- passwords[-c(501:507),]

# Preview of data
head(passwords, n = 10)

## # A tibble: 10 × 9
##     rank password category   value time_unit offline_crack_sec rank_alt strength
##    <dbl> <chr>    <chr>      <dbl> <chr>                 <dbl>    <dbl>    <dbl>
##  1     1 password password-…  6.91 years           2.17               1        8
##  2     2 123456   simple-al… 18.5  minutes         0.0000111          2        4
##  3     3 12345678 simple-al…  1.29 days            0.00111            3        4
##  4     4 1234     simple-al… 11.1  seconds         0.000000111        4        4
##  5     5 qwerty   simple-al…  3.72 days            0.00321            5        8
##  6     6 12345    simple-al…  1.85 minutes         0.00000111         6        4
##  7     7 dragon   animal      3.72 days            0.00321            7        8
##  8     8 baseball sport       6.91 years           2.17               8        4
##  9     9 football sport       6.91 years           2.17               9        7
## 10    10 letmein  password-…  3.19 months          0.0835            10        8
## # ℹ 1 more variable: font_size <dbl>

# Tables of category variable
table(passwords$category)

## 
##              animal          cool-macho              fluffy                food 
##                  29                  79                  44                  11 
##                name           nerdy-pop    password-related     rebellious-rude 
##                 183                  30                  15                  11 
## simple-alphanumeric               sport 
##                  61                  37

proportions(table(passwords$category))

## 
##              animal          cool-macho              fluffy                food 
##               0.058               0.158               0.088               0.022 
##                name           nerdy-pop    password-related     rebellious-rude 
##               0.366               0.060               0.030               0.022 
## simple-alphanumeric               sport 
##               0.122               0.074

# Summary of offline_crack_sec variable
summary(passwords$offline_crack_sec)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 1.000e-07 3.210e-03 3.210e-03 5.000e-01 8.350e-02 2.927e+01

# Bar chart with category variable
barchart(passwords$category)

Name seems to be the most frequent category by far, with food and rebellious-rude being the least frequent.

Analysis

Of the 10 categories, two of interest are name and simple-alphanumeric. These are broad categories that are prevalent in the sample and seem to be common choices for passwords, so it could be worthwhile to see how this trend might reflect on the true population and how this might related to password security.

I will examine this through a hypothesis test comparing two proportions using a significance level of \(\alpha = 0.05\).

\(H_0: p_N = p_S\)

\(H_1: p_N \neq p_S\)

# Store sample statistics
table(passwords$category)

## 
##              animal          cool-macho              fluffy                food 
##                  29                  79                  44                  11 
##                name           nerdy-pop    password-related     rebellious-rude 
##                 183                  30                  15                  11 
## simple-alphanumeric               sport 
##                  61                  37

x_pass <- c(183, 61)
n_pass <- c(500, 500)

# Two sample prop.test
prop.test(x = x_pass, n = n_pass, alternative = "two.sided", correct = FALSE)

## 
##  2-sample test for equality of proportions without continuity correction
## 
## data:  x_pass out of n_pass
## X-squared = 80.688, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.1929536 0.2950464
## sample estimates:
## prop 1 prop 2 
##  0.366  0.122

This comes out to a p-value that is less than 2.2e-16.

Conclusions

2.2e-16 < 0.05. Therefore, I reject the null hypothesis.

There is overwhelming evidence to suggest that there is a different in quantity of passwords involving names and simple alphanumeric strings. More specifically, it seems that names are much more common of a subject for passwords.

Given the prevalence of it, one takeaway from this could be that using your name as a password is a bad idea. If a majority of people are doing it, it makes it easy to guess those passwords as long as the owner’s name is known.

Limitations

There are multiple major elements to this dataset that limit analysis.

First, the data is ranked by frequency from various real password leaks. This means that it is not a random sample and that can represent a larger population because other passwords would be less commonly used and thus reasonably follow different trends.

Second, despite being ranked on frequency, there is no data on specifically how many times a password showed up, meaning they can only be treated as individual observations.

Third, being that these are common passwords and especially that they were found leaked on the internet, they are consistently bad passwords and are only really representative of that group.

Another thing I noticed in my research is that profanity and other adult terms were omitted from the list, which makes it a less accurate reflection of the true population.

All of these limitations affect the analysis I conducted. There is no way of actually knowing from this data how many times each category showed up because each password only counts as one appearance. Despite names seeming to be overwhelmingly more common, simple alphanumeric passwords make up a good chunk of the top entries, which could change the results if this was able to be accounted for. Overall, this analysis is not very reliable for predicting details about a greater population of passwords.

This list is also a rather one sided look at password security. As stated, these are not good passwords and can only say so much on their own about what makes a password strong or weak. For example, these bad passwords are all sorted into categories but ideally a password would not fall so easily into a theme. A strong password has less patterns that could lead to easier guessing, therefore being more “random”.

Randomness, or more accurately entropy, is the biggest factor that goes into all elements of a strong password. Entropy describes the amount of uncertainty in a given variable. Basically, more possibilities for each character of a password exponentially increases the amount of time it would take to guess it. In a password with only letters, each character has 26 possibilities. Adding other character types like numbers or symbols increases this, as well as simply making the password longer. However, this can be superseded by patterns like common words or details that would be obvious first guesses. What we gathered from this data is that a person’s name would be a very good first guess before attempting brute forcing.

In other words, the only real trend that could be found in good passwords is that they share little in common with each other, with a high amount of entropy in each one, while analyzing data such as the one here misses a lot of the larger picture.

This document was produced as a final project for MAT 143H - Introduction to Statistics (Honors) at North Shore Community College.
The course was led by Professor Billy Jackson.
Student Name: Benjamin Rasmussen Semester: Spring 2026