library(readxl)
library(ggplot2)
Research question: Does the length of the password affect its strength? “This dataset provides columns with different features of different passwords that lots of people choose” Kaggle.com
Each column gives different information about each of the passwords studied, the column i am interster in are the following:
category (character): What category does the password fall in to?
value (double): Time to crack by online guessing
strength (double): quality of password where 10 is highest, 1 is lowest, please note that these are relative to these generally bad passwords
passwords <- read.csv("https://raw.githubusercontent.com/SalouaDaouki/Data605/main/passwords.csv")
head(passwords)
## rank password category value time_unit offline_crack_sec rank_alt
## 1 1 password password-related 6.91 years 2.17e+00 1
## 2 2 123456 simple-alphanumeric 18.52 minutes 1.11e-05 2
## 3 3 12345678 simple-alphanumeric 1.29 days 1.11e-03 3
## 4 4 1234 simple-alphanumeric 11.11 seconds 1.11e-07 4
## 5 5 qwerty simple-alphanumeric 3.72 days 3.21e-03 5
## 6 6 12345 simple-alphanumeric 1.85 minutes 1.11e-06 6
## strength font_size
## 1 8 11
## 2 4 8
## 3 4 8
## 4 4 8
## 5 8 11
## 6 4 8
tail(passwords)
## rank password category value time_unit offline_crack_sec rank_alt strength
## 502 NA <NA> <NA> NA <NA> NA NA NA
## 503 NA <NA> <NA> NA <NA> NA NA NA
## 504 NA <NA> <NA> NA <NA> NA NA NA
## 505 NA <NA> <NA> NA <NA> NA NA NA
## 506 NA <NA> <NA> NA <NA> NA NA NA
## 507 NA <NA> <NA> NA <NA> NA NA NA
## font_size
## 502 NA
## 503 NA
## 504 NA
## 505 NA
## 506 NA
## 507 NA
dim(passwords)
## [1] 507 9
summary(passwords)
## rank password category value
## Min. : 1.0 Length:507 Length:507 Min. : 1.290
## 1st Qu.:125.8 Class :character Class :character 1st Qu.: 3.430
## Median :250.5 Mode :character Mode :character Median : 3.720
## Mean :250.5 Mean : 5.603
## 3rd Qu.:375.2 3rd Qu.: 3.720
## Max. :500.0 Max. :92.270
## NA's :7 NA's :7
## time_unit offline_crack_sec rank_alt strength
## Length:507 Min. : 0.00000 Min. : 1.0 Min. : 0.000
## Class :character 1st Qu.: 0.00321 1st Qu.:125.8 1st Qu.: 6.000
## Mode :character Median : 0.00321 Median :251.5 Median : 7.000
## Mean : 0.50001 Mean :251.2 Mean : 7.432
## 3rd Qu.: 0.08350 3rd Qu.:376.2 3rd Qu.: 8.000
## Max. :29.27000 Max. :502.0 Max. :48.000
## NA's :7 NA's :7 NA's :7
## font_size
## Min. : 0.0
## 1st Qu.:10.0
## Median :11.0
## Mean :10.3
## 3rd Qu.:11.0
## Max. :28.0
## NA's :7
Based on the summary above, the columns 1, 4, 6, 7, 8, & 9 are characters, which are suppsod to be numerical, so before visualizing the data, I will convert those columns to numerical (lines 69 to 78 below)
str(passwords)
## 'data.frame': 507 obs. of 9 variables:
## $ rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ password : chr "password" "123456" "12345678" "1234" ...
## $ category : chr "password-related" "simple-alphanumeric" "simple-alphanumeric" "simple-alphanumeric" ...
## $ value : num 6.91 18.52 1.29 11.11 3.72 ...
## $ time_unit : chr "years" "minutes" "days" "seconds" ...
## $ offline_crack_sec: num 2.17 1.11e-05 1.11e-03 1.11e-07 3.21e-03 1.11e-06 3.21e-03 2.17 2.17 8.35e-02 ...
## $ rank_alt : int 1 2 3 4 5 6 7 8 9 10 ...
## $ strength : int 8 4 4 4 8 4 8 4 7 8 ...
## $ font_size : int 11 8 8 8 11 8 11 8 11 11 ...
class(passwords)
## [1] "data.frame"
# Check for missing values
sum(is.na(passwords))
## [1] 63
# Remove rows with any missing values
passwords <- na.omit(passwords)
# Convert character columns to numeric
passwords$rank <- as.numeric(passwords$rank)
passwords$value <- as.numeric(passwords$value)
passwords$offline_crack_sec <- as.numeric(passwords$offline_crack_sec)
passwords$rank_alt <- as.numeric(passwords$rank_alt)
passwords$strength <- as.numeric(passwords$strength)
passwords$font_size <- as.numeric(passwords$font_size)
summary(passwords)
## rank password category value
## Min. : 1.0 Length:500 Length:500 Min. : 1.290
## 1st Qu.:125.8 Class :character Class :character 1st Qu.: 3.430
## Median :250.5 Mode :character Mode :character Median : 3.720
## Mean :250.5 Mean : 5.603
## 3rd Qu.:375.2 3rd Qu.: 3.720
## Max. :500.0 Max. :92.270
## time_unit offline_crack_sec rank_alt strength
## Length:500 Min. : 0.00000 Min. : 1.0 Min. : 0.000
## Class :character 1st Qu.: 0.00321 1st Qu.:125.8 1st Qu.: 6.000
## Mode :character Median : 0.00321 Median :251.5 Median : 7.000
## Mean : 0.50001 Mean :251.2 Mean : 7.432
## 3rd Qu.: 0.08350 3rd Qu.:376.2 3rd Qu.: 8.000
## Max. :29.27000 Max. :502.0 Max. :48.000
## font_size
## Min. : 0.0
## 1st Qu.:10.0
## Median :11.0
## Mean :10.3
## 3rd Qu.:11.0
## Max. :28.0
# Add a new column with the count of characters in the 'password' column next to it
passwords <- cbind(passwords[, 1], char_count = nchar(passwords$password), passwords[, -1])
# Scatter plot of 'length' vs 'strength'
plot(passwords$char_count, passwords$strength,
xlab = "Length", ylab = "Strength",
main = "Scatter plot of Length vs Strength")
# Barplot of 'category'
barplot(table(passwords$category),
xlab = "Category", ylab = "Frequency",
main = "Barplot of Category")
# Create a scatter plot with category on x-axis and strength on y-axis using ggplot2
ggplot(passwords, aes(x = category, y = strength)) +
geom_point() +
labs(x = "Category", y = "Strength", title = "Scatter plot of Category vs Strength") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Calculate correlation coefficients
correlation_matrix <- cor(passwords[, c("char_count", "value", "strength")])
# Print correlation matrix
print(correlation_matrix)
## char_count value strength
## char_count 1.00000000 0.08344091 0.2619824
## value 0.08344091 1.00000000 0.3268753
## strength 0.26198241 0.32687528 1.0000000
Based on the correlation matrix, the correlation between the length of the password and its value is very weak, where the strength of the password and its length have little stronger correlation. On the other hand, when the password is stronger, it will take more time to crack it by online guessing (the value).
##Linear Regression:
# For this data, I want to predict the "strength" of the password based on its "char_count"
model <- lm(strength ~ char_count, data = passwords)
# Summary of the regression model
summary(model)
##
## Call:
## lm(formula = strength ~ char_count, data = passwords)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.852 -1.478 -0.160 0.840 38.148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.9145 1.3975 -0.654 0.513
## char_count 1.3458 0.2222 6.058 2.72e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.232 on 498 degrees of freedom
## Multiple R-squared: 0.06863, Adjusted R-squared: 0.06676
## F-statistic: 36.7 on 1 and 498 DF, p-value: 2.72e-09
# Plot the residuals vs. fitted values
plot(model, which = 1)
# Plot the normal Q-Q plot of residuals
plot(model, which = 2)
# Plot the scale-location plot (square root of standardized residuals vs. fitted values)
plot(model, which = 3)
# Plot the residuals vs. leverage
plot(model, which = 5)
After the linear model, the residual plots suggest that the linearity is not met since there is a clear trend: the data points are in vertical lines. In addition, the \(p-value = 2.72 \times 10^{-09}\) which is less than \(0.05\),
# Fit a quadratic regression model
lm_quad_model <- lm(strength ~ poly(char_count, degree = 2), data = passwords)
# Summary of the model
summary(lm_quad_model)
##
## Call:
## lm(formula = strength ~ poly(char_count, degree = 2), data = passwords)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.701 -1.320 -0.230 0.770 38.299
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.4320 0.2342 31.739 < 2e-16 ***
## poly(char_count, degree = 2)1 31.6930 5.2359 6.053 2.8e-09 ***
## poly(char_count, degree = 2)2 -2.2241 5.2359 -0.425 0.671
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.236 on 497 degrees of freedom
## Multiple R-squared: 0.06897, Adjusted R-squared: 0.06523
## F-statistic: 18.41 on 2 and 497 DF, p-value: 1.937e-08
# Plot the model
plot(lm_quad_model)
Both r_squared values suggest that the char_count (or the length) of the
password doesn’t influence, significantly, its strength alone. There are
other factors that can make the password strong.