Loading Libraries

library(readxl)
library(ggplot2)

Bad Passwords

Research question: Does the length of the password affect its strength? “This dataset provides columns with different features of different passwords that lots of people choose” Kaggle.com

Each column gives different information about each of the passwords studied, the column i am interster in are the following:

passwords <- read.csv("https://raw.githubusercontent.com/SalouaDaouki/Data605/main/passwords.csv")

Exploring the data

Glimpse of the data:

head(passwords)
##   rank password            category value time_unit offline_crack_sec rank_alt
## 1    1 password    password-related  6.91     years          2.17e+00        1
## 2    2   123456 simple-alphanumeric 18.52   minutes          1.11e-05        2
## 3    3 12345678 simple-alphanumeric  1.29      days          1.11e-03        3
## 4    4     1234 simple-alphanumeric 11.11   seconds          1.11e-07        4
## 5    5   qwerty simple-alphanumeric  3.72      days          3.21e-03        5
## 6    6    12345 simple-alphanumeric  1.85   minutes          1.11e-06        6
##   strength font_size
## 1        8        11
## 2        4         8
## 3        4         8
## 4        4         8
## 5        8        11
## 6        4         8
tail(passwords)
##     rank password category value time_unit offline_crack_sec rank_alt strength
## 502   NA     <NA>     <NA>    NA      <NA>                NA       NA       NA
## 503   NA     <NA>     <NA>    NA      <NA>                NA       NA       NA
## 504   NA     <NA>     <NA>    NA      <NA>                NA       NA       NA
## 505   NA     <NA>     <NA>    NA      <NA>                NA       NA       NA
## 506   NA     <NA>     <NA>    NA      <NA>                NA       NA       NA
## 507   NA     <NA>     <NA>    NA      <NA>                NA       NA       NA
##     font_size
## 502        NA
## 503        NA
## 504        NA
## 505        NA
## 506        NA
## 507        NA
dim(passwords)
## [1] 507   9
summary(passwords)
##       rank         password           category             value       
##  Min.   :  1.0   Length:507         Length:507         Min.   : 1.290  
##  1st Qu.:125.8   Class :character   Class :character   1st Qu.: 3.430  
##  Median :250.5   Mode  :character   Mode  :character   Median : 3.720  
##  Mean   :250.5                                         Mean   : 5.603  
##  3rd Qu.:375.2                                         3rd Qu.: 3.720  
##  Max.   :500.0                                         Max.   :92.270  
##  NA's   :7                                             NA's   :7       
##   time_unit         offline_crack_sec     rank_alt        strength     
##  Length:507         Min.   : 0.00000   Min.   :  1.0   Min.   : 0.000  
##  Class :character   1st Qu.: 0.00321   1st Qu.:125.8   1st Qu.: 6.000  
##  Mode  :character   Median : 0.00321   Median :251.5   Median : 7.000  
##                     Mean   : 0.50001   Mean   :251.2   Mean   : 7.432  
##                     3rd Qu.: 0.08350   3rd Qu.:376.2   3rd Qu.: 8.000  
##                     Max.   :29.27000   Max.   :502.0   Max.   :48.000  
##                     NA's   :7          NA's   :7       NA's   :7       
##    font_size   
##  Min.   : 0.0  
##  1st Qu.:10.0  
##  Median :11.0  
##  Mean   :10.3  
##  3rd Qu.:11.0  
##  Max.   :28.0  
##  NA's   :7

Based on the summary above, the columns 1, 4, 6, 7, 8, & 9 are characters, which are suppsod to be numerical, so before visualizing the data, I will convert those columns to numerical (lines 69 to 78 below)

str(passwords)
## 'data.frame':    507 obs. of  9 variables:
##  $ rank             : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ password         : chr  "password" "123456" "12345678" "1234" ...
##  $ category         : chr  "password-related" "simple-alphanumeric" "simple-alphanumeric" "simple-alphanumeric" ...
##  $ value            : num  6.91 18.52 1.29 11.11 3.72 ...
##  $ time_unit        : chr  "years" "minutes" "days" "seconds" ...
##  $ offline_crack_sec: num  2.17 1.11e-05 1.11e-03 1.11e-07 3.21e-03 1.11e-06 3.21e-03 2.17 2.17 8.35e-02 ...
##  $ rank_alt         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ strength         : int  8 4 4 4 8 4 8 4 7 8 ...
##  $ font_size        : int  11 8 8 8 11 8 11 8 11 11 ...
class(passwords)
## [1] "data.frame"

Cleaning the data:

# Check for missing values
sum(is.na(passwords))
## [1] 63
# Remove rows with any missing values
passwords <- na.omit(passwords)
# Convert character columns to numeric
passwords$rank <- as.numeric(passwords$rank)
passwords$value <- as.numeric(passwords$value)
passwords$offline_crack_sec <- as.numeric(passwords$offline_crack_sec)
passwords$rank_alt <- as.numeric(passwords$rank_alt)
passwords$strength <- as.numeric(passwords$strength)
passwords$font_size <- as.numeric(passwords$font_size)
summary(passwords)
##       rank         password           category             value       
##  Min.   :  1.0   Length:500         Length:500         Min.   : 1.290  
##  1st Qu.:125.8   Class :character   Class :character   1st Qu.: 3.430  
##  Median :250.5   Mode  :character   Mode  :character   Median : 3.720  
##  Mean   :250.5                                         Mean   : 5.603  
##  3rd Qu.:375.2                                         3rd Qu.: 3.720  
##  Max.   :500.0                                         Max.   :92.270  
##   time_unit         offline_crack_sec     rank_alt        strength     
##  Length:500         Min.   : 0.00000   Min.   :  1.0   Min.   : 0.000  
##  Class :character   1st Qu.: 0.00321   1st Qu.:125.8   1st Qu.: 6.000  
##  Mode  :character   Median : 0.00321   Median :251.5   Median : 7.000  
##                     Mean   : 0.50001   Mean   :251.2   Mean   : 7.432  
##                     3rd Qu.: 0.08350   3rd Qu.:376.2   3rd Qu.: 8.000  
##                     Max.   :29.27000   Max.   :502.0   Max.   :48.000  
##    font_size   
##  Min.   : 0.0  
##  1st Qu.:10.0  
##  Median :11.0  
##  Mean   :10.3  
##  3rd Qu.:11.0  
##  Max.   :28.0
# Add a new column with the count of characters in the 'password' column next to it
passwords <- cbind(passwords[, 1], char_count = nchar(passwords$password), passwords[, -1])

Visualizing the data:

# Scatter plot of 'length' vs 'strength'
plot(passwords$char_count, passwords$strength, 
     xlab = "Length", ylab = "Strength", 
     main = "Scatter plot of Length vs Strength")

# Barplot of 'category'
barplot(table(passwords$category), 
        xlab = "Category", ylab = "Frequency", 
        main = "Barplot of Category")

# Create a scatter plot with category on x-axis and strength on y-axis using ggplot2
ggplot(passwords, aes(x = category, y = strength)) +
  geom_point() +
  labs(x = "Category", y = "Strength", title = "Scatter plot of Category vs Strength") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Calculate correlation coefficients
correlation_matrix <- cor(passwords[, c("char_count", "value", "strength")])

# Print correlation matrix
print(correlation_matrix)
##            char_count      value  strength
## char_count 1.00000000 0.08344091 0.2619824
## value      0.08344091 1.00000000 0.3268753
## strength   0.26198241 0.32687528 1.0000000

Based on the correlation matrix, the correlation between the length of the password and its value is very weak, where the strength of the password and its length have little stronger correlation. On the other hand, when the password is stronger, it will take more time to crack it by online guessing (the value).

##Linear Regression:

# For this data, I want to predict the "strength" of the password based on its "char_count"
model <- lm(strength ~ char_count, data = passwords)
# Summary of the regression model
summary(model)
## 
## Call:
## lm(formula = strength ~ char_count, data = passwords)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.852 -1.478 -0.160  0.840 38.148 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.9145     1.3975  -0.654    0.513    
## char_count    1.3458     0.2222   6.058 2.72e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.232 on 498 degrees of freedom
## Multiple R-squared:  0.06863,    Adjusted R-squared:  0.06676 
## F-statistic:  36.7 on 1 and 498 DF,  p-value: 2.72e-09
# Plot the residuals vs. fitted values
plot(model, which = 1)

# Plot the normal Q-Q plot of residuals
plot(model, which = 2)

# Plot the scale-location plot (square root of standardized residuals vs. fitted values)
plot(model, which = 3)

# Plot the residuals vs. leverage
plot(model, which = 5)

After the linear model, the residual plots suggest that the linearity is not met since there is a clear trend: the data points are in vertical lines. In addition, the \(p-value = 2.72 \times 10^{-09}\) which is less than \(0.05\),

# Fit a quadratic regression model
lm_quad_model <- lm(strength ~ poly(char_count, degree = 2), data = passwords)

# Summary of the model
summary(lm_quad_model)
## 
## Call:
## lm(formula = strength ~ poly(char_count, degree = 2), data = passwords)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.701 -1.320 -0.230  0.770 38.299 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     7.4320     0.2342  31.739  < 2e-16 ***
## poly(char_count, degree = 2)1  31.6930     5.2359   6.053  2.8e-09 ***
## poly(char_count, degree = 2)2  -2.2241     5.2359  -0.425    0.671    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.236 on 497 degrees of freedom
## Multiple R-squared:  0.06897,    Adjusted R-squared:  0.06523 
## F-statistic: 18.41 on 2 and 497 DF,  p-value: 1.937e-08
# Plot the model
plot(lm_quad_model)

Both r_squared values suggest that the char_count (or the length) of the password doesn’t influence, significantly, its strength alone. There are other factors that can make the password strong.