#Intro Wordle recently went viral, so much so that it was acquired by the New York times. Wordle is basically hangman with five letters words and six chances to guess the word per day. Many people have looked at this and examined possibilities of the data, I have found most to be lacking with suggestions such as saint, soare, salet, and various other guesses. I have seen most people reference the most common letters in the English language, what I have not seen is people looking into the most common letters within Wordle as it only uses a limited selection of words. As a new data analyst, I thought it would be fun to dive deep into what most people seemingly tend to skip over as one of the most important steps in playing this game.

Questions we are trying to answer are what is the most common letters used in Wordle? What word gives you the most information on your first guess? Is the best guess the actual best?

We need some criteria to base our second question on other than it being a real word that we can guess within the game. For the most impact the word should not have duplicate letters, it should have five unique letters. It should contain a combination of the most common letters within the game itself. It doesn’t have to be a valid solution but must be a valid guess.

Source Material

I am using this data set Wordle Valid Words by Bill Cruise under Public Domain. This data set contains two files valid_guesses.csv and valid_solutions.csv we will be using both in our analysis. This data set is from January 2022 before NY times removed some words from the list.

Preparation

##loading and importing the data First I went ahead and expanded the words into letters by column using excel. Labeling the columns as Letter_1, Letter_2, Letter_3, Letter_4, and Letter_5.

## [1] 2315
## [1] 10657

We then import our data and to make sure we have the correct number of rows using nrow(). We should have 2315 rows for valid solutions and 10657 rows for guesses. Everything matches and looks good so far, next we must make sure everything is in lower case letters to keep everything consistent.

valid_solutions$word <- tolower(valid_solutions$word)
valid_solutions$Letter_1 <- tolower(valid_solutions$Letter_1)
valid_solutions$Letter_2 <- tolower(valid_solutions$Letter_2)
valid_solutions$Letter_3 <- tolower(valid_solutions$Letter_3)
valid_solutions$Letter_4 <- tolower(valid_solutions$Letter_4)
valid_solutions$Letter_5 <- tolower(valid_solutions$Letter_5)

valid_guesses$word <- tolower(valid_guesses$word)
valid_guesses$Letter_1 <- tolower(valid_guesses$Letter_1)
valid_guesses$Letter_2 <- tolower(valid_guesses$Letter_2)
valid_guesses$Letter_3 <- tolower(valid_guesses$Letter_3)
valid_guesses$Letter_4 <- tolower(valid_guesses$Letter_4)
valid_guesses$Letter_5 <- tolower(valid_guesses$Letter_5)

Setting up data frames

We are creating some new data frames to hold all the count data for each row.

# Count letters by frequency sorted by letter placement in all words
Letter_v_1 <- data.frame(table(valid_solutions$Letter_1))
Letter_v_2 <- data.frame(table(valid_solutions$Letter_2))
Letter_v_3 <- data.frame(table(valid_solutions$Letter_3))
Letter_v_4 <- data.frame(table(valid_solutions$Letter_4))
Letter_v_5 <- data.frame(table(valid_solutions$Letter_5))
Letter_g_1 <- data.frame(table(valid_guesses$Letter_1))
Letter_g_2 <- data.frame(table(valid_guesses$Letter_2))
Letter_g_3 <- data.frame(table(valid_guesses$Letter_3))
Letter_g_4 <- data.frame(table(valid_guesses$Letter_4))
Letter_g_5 <- data.frame(table(valid_guesses$Letter_5))

We then will merge all of this into a new easy to read data frame where we will add more rows later.

all_letters_solutions <- merge(Letter_v_1, Letter_v_2, by = "Var1", all = TRUE)
all_letters_solutions <- merge(all_letters_solutions, Letter_v_3, by = "Var1", all = TRUE)
all_letters_solutions <- merge(all_letters_solutions, Letter_v_4, by = "Var1", all = TRUE)
## Warning in merge.data.frame(all_letters_solutions, Letter_v_4, by = "Var1", :
## column names 'Freq.x', 'Freq.y' are duplicated in the result
all_letters_solutions <- merge(all_letters_solutions, Letter_v_5, by = "Var1", all = TRUE)
## Warning in merge.data.frame(all_letters_solutions, Letter_v_5, by = "Var1", :
## column names 'Freq.x', 'Freq.y' are duplicated in the result
all_letters_guesses <- merge(Letter_g_1, Letter_g_2, by = "Var1", all = TRUE) 
all_letters_guesses <- merge(all_letters_guesses, Letter_g_3, by = "Var1", all = TRUE)
all_letters_guesses <- merge(all_letters_guesses, Letter_g_4, by = "Var1", all = TRUE)
## Warning in merge.data.frame(all_letters_guesses, Letter_g_4, by = "Var1", :
## column names 'Freq.x', 'Freq.y' are duplicated in the result
all_letters_guesses <- merge(all_letters_guesses, Letter_g_5, by = "Var1", all = TRUE)
## Warning in merge.data.frame(all_letters_guesses, Letter_g_5, by = "Var1", :
## column names 'Freq.x', 'Freq.y' are duplicated in the result

We first have to rename the ones we just merged.

names(all_letters_guesses)[2] <- "Letter_1"
names(all_letters_guesses)[3] <- "Letter_2"
names(all_letters_guesses)[4] <- "Letter_3"
names(all_letters_guesses)[5] <- "Letter_4"
names(all_letters_guesses)[6] <- "Letter_5"

names(all_letters_solutions)[2] <- "Letter_1"
names(all_letters_solutions)[3] <- "Letter_2"
names(all_letters_solutions)[4] <- "Letter_3"
names(all_letters_solutions)[5] <- "Letter_4"
names(all_letters_solutions)[6] <- "Letter_5"

We can then add up each letter for its overall occurrence by making a new column named sum

all_letters_solutions$sum <- rowSums(all_letters_solutions[ , c(2:6)], na.rm=TRUE)
all_letters_guesses$sum <- rowSums(all_letters_guesses[ , c(2:6)], na.rm=TRUE)

## sort them highest to lowest
all_letters_solutions<- all_letters_solutions[order(-all_letters_solutions$sum), ]
all_letters_guesses<- all_letters_guesses[order(-all_letters_guesses$sum), ]

Adding weights to letters

Next, we will give each letter a weight based on its percentage of usage compared to the rest of the letters. First, we need to count how many letters we have in each data frame

count <- 0
for(i in all_letters_guesses$sum){ ## sum of all letters
  count <- count+i
  
  ##guess_remaining[nrow(guess_remaining) + 1,] = c(word = i, remaining = length(not_word))
}
print(count)
## [1] 53285
count <- 0
for(i in all_letters_solutions$sum){ ## sum of all letters
  count <- count+i

  
  ##guess_remaining[nrow(guess_remaining) + 1,] = c(word = i, remaining = length(not_word))
}
print(count)
## [1] 11575

We have our total number of letters in each data frame, we will take that number divided the total number of letters and put the value in a new column named weight.

for(i in all_letters_solutions$sum){ ## weight
  
  all_letters_solutions$weight <- (all_letters_solutions$sum / 11575) *100
}
for(i in all_letters_guesses$sum){ ## weight
  
  all_letters_guesses$weight <- (all_letters_guesses$sum / 53285) *100
}

final Check

Let’s check our data frames to see if everything looks correct.

all_letters_guesses

##    Var1 Letter_1 Letter_2 Letter_3 Letter_4 Letter_5  sum    weight
## 19    s     1199       77      453      345     3922 5996 11.252698
## 5     e      231     1386      705     2009     1098 5429 10.188608
## 1     a      596     1959      929      911      616 5011  9.404148
## 15    o      221     1817      749      566      331 3684  6.913766
## 18    r      523      673     1035      567      461 3259  6.116168
## 9     i      131     1181      785      722      269 3088  5.795252

all_letters_solutions

##    Var1 Letter_1 Letter_2 Letter_3 Letter_4 Letter_5  sum    weight
## 5     e       72      242      177      318      424 1233 10.652268
## 1     a      141      304      307      163       64  979  8.457883
## 18    r      105      267      163      152      212  899  7.766739
## 15    o       41      279      244      132       58  754  6.514039
## 20    t      149       77      111      139      253  729  6.298056
## 12    l       88      201      112      162      156  719  6.211663

Everything looks correct and is ready for the next step.

Analysis

The first thing we need to do is look at all_letters_solutions data frame, to understand the frequency of letter use in Wordle.

Looking at the chart above we can see that the letters e,a,r,o, and t are in the top five and we have l,i,s,n, and c as the next group of top five behind them. According to (lexico)[https://www.lexico.com/explore/which-letters-are-used-most] e,a,r,i,o,t,n,s,l, and c are the top ten letters in order. This shows that we have enough of a difference that it could change the outcome of any analysis you would do using the most common in English list. We will continue with our list for the rest of the analysis.

We have the top five letters of Wordle, can we make a five-letter word or words out of these letters? For this we need to investigate valid guesses and return any matches from the word column.

best_guess <- character()
for (x in valid_guesses$word) {
  one <- str_detect(x,"e")
  two <- str_detect(x,"a")
  three <- str_detect(x,"r") 
  four <- str_detect(x,"o")
  five <- str_detect(x,"t")
  if (one == "TRUE" & two == "TRUE" & three == "TRUE" & four == "TRUE"  & five == "TRUE"){
    best_guess <- append(best_guess,x)
  }
}
print(best_guess)
## [1] "oater" "orate" "roate"

The above code runs through each word in valid_guesses data frame, looks to see if each of our supplied letters is in the word, then when it finds a match in any order will put it into a list. Doing so we have come up with three valid words oater, orate, and roate. This could be our word, but we need to dig a little deeper. We had the letter l closely tied to the letter t. But even so we need to see how good these words stack up to each other. We are looking to eliminate as many words as possible.

not_orate <- character()
for (x in valid_solutions$word){
  if (str_detect(x,"[orate]") == "FALSE"){
    not_orate <- append(not_orate,x)
  }
}

Running the above code returns how many words would be left if we didn’t get a single letter right. We are given an answer of 196 out of 2315. Now if you compare this to a word like saint with 346 that could be guessed with no letters correct, that is a huge improvement especially when others have concluded that it’s a good word to start with. although we are not currently done, we need to examine and validate the findings. For that we need to run through all possible guesses to score each word against how many options are left in valid_solutions.

guess_remaining <- data.frame(word=character(),remaining=integer())
not_word <- character()

for (i in valid_guesses$word){
  not_word <- character()
  for (x in valid_solutions$word){ 
    if(str_detect(x, substr(i,1,1)) == "FALSE" & 
       str_detect(x, substr(i,2,2)) == "FALSE" & 
       str_detect(x, substr(i,3,3)) == "FALSE" & 
       str_detect(x, substr(i,4,4)) == "FALSE" & 
       str_detect(x, substr(i,5,5)) == "FALSE"){ 
        not_word <- append(not_word,x)
    }
  }
  guess_remaining[nrow(guess_remaining) + 1,] = c(word = i, remaining = length(not_word))
}

The above code probably isn’t ideal, it’s probably not the fastest or least intensive to run; But it works, and it’s been a few days trying to solve this problem. Checking all 2315 words to one of 10657 only to repeat it by going through 10657 words in valid_guesses, takes a bit of time. I learned a lot getting to this point, one of which is the answer to our question. But then I feel like we are missing something, sure I have one of many answers and my answer doesn’t match anyone else. Am I wrong? I question my choices; I search online to for a list of words that are supposed to be better than my answer.

top_10 <- data.frame(word=character(),remaining= integer())
top_10[nrow(top_10) + 1,] = list(word = as.character("Salet"), remaining = as.integer(221))
top_10[nrow(top_10) + 1,] = list(word = as.character("Soare"), remaining = as.integer(279))
top_10[nrow(top_10) + 1,] = list(word = as.character("Trace"), remaining = as.integer(246))
top_10[nrow(top_10) + 1,] = list(word = as.character("Serai"), remaining = as.integer(168))
top_10[nrow(top_10) + 1,] = list(word = as.character("Tales"), remaining = as.integer(221))
top_10[nrow(top_10) + 1,] = list(word = as.character("Cones"), remaining = as.integer(268))
top_10[nrow(top_10) + 1,] = list(word = as.character("Hates"), remaining = as.integer(282))
top_10[nrow(top_10) + 1,] = list(word = as.character("Audio"), remaining = as.integer(190))
top_10[nrow(top_10) + 1,] = list(word = as.character("Cones"), remaining = as.integer(268))
top_10[nrow(top_10) + 1,] = list(word = as.character("Adieu"), remaining = as.integer(162))
top_10 <- top_10[order(-top_10$remaining),]
top_10
##     word remaining
## 7  Hates       282
## 2  Soare       279
## 6  Cones       268
## 9  Cones       268
## 3  Trace       246
## 1  Salet       221
## 5  Tales       221
## 8  Audio       190
## 4  Serai       168
## 10 Adieu       162

One list (shown above) of many I have found, most words on the list come out worse than my original guess of orate which was 195. Salet is supposed to be the best choice but clearly adieu is the top pick of this list. So, what is going on? Adieu has mostly vowels and the list states that “it’s not as efficient of a word as you might think.”

for(i in valid_guesses$word){ ## cycle words as i
  weight_score = 0
  ban_1 = c("")
  ban_2 = c("")
  ban_3 = c("")
  ban_4 = c("")
  
  for(x in all_letters_solutions$Var1){ ## cycle through letters as x
    if(substr(i,1,1) == x & str_count(i, x) == 1){
      weight_score = as.double(all_letters_solutions$weight[all_letters_solutions$Var1 == x])+weight_score
    }else if(substr(i,1,1) == x & str_count(i, x) >1){
      weight_score = as.double(all_letters_solutions$weight[all_letters_solutions$Var1 == x])+weight_score
      ban_1 <- x
    }
    
    if(substr(i,2,2) == x & str_count(i, x) == 1 & substr(i,2,2) != ban_1){
      weight_score = as.double(all_letters_solutions$weight[all_letters_solutions$Var1 == x])+weight_score
    }else if(substr(i,2,2) == x & str_count(i, x) >1 & substr(i,2,2) != ban_1){
      weight_score = as.double(all_letters_solutions$weight[all_letters_solutions$Var1 == x])+weight_score
      ban_2 <- x
    }
    
    if(substr(i,3,3) == x & str_count(i, x) == 1 & substr(i,3,3) != ban_1 & substr(i,3,3) != ban_2){
      weight_score = as.double(all_letters_solutions$weight[all_letters_solutions$Var1 == x])+weight_score
    }else if(substr(i,3,3) == x & str_count(i, x) >1 & substr(i,3,3) != ban_1 & substr(i,3,3) != ban_2){
      weight_score = as.double(all_letters_solutions$weight[all_letters_solutions$Var1 == x])+weight_score
      ban_3 <- x
    }
    
    if(substr(i,4,4) == x & str_count(i, x) == 1 & substr(i,4,4) != ban_1 & substr(i,4,4) != ban_2 & substr(i,4,4) != ban_3){
      weight_score = as.double(all_letters_solutions$weight[all_letters_solutions$Var1 == x])+weight_score
    }else if(substr(i,4,4) == x & str_count(i, x) >1 & substr(i,4,4) != ban_1 & substr(i,4,4) != ban_2 & substr(i,4,4) != ban_3){
      weight_score = as.double(all_letters_solutions$weight[all_letters_solutions$Var1 == x])+weight_score
      ban_4 <- x
    }
    
    if(substr(i,5,5) == x & str_count(i, x) == 1 & substr(i,5,5) != ban_1 & substr(i,5,5) != ban_2 & substr(i,5,5) != ban_3 & substr(i,5,5) != ban_4){ 
      weight_score = as.double(all_letters_solutions$weight[all_letters_solutions$Var1== x])+weight_score
    }
    
  }
  guess_remaining$score[guess_remaining$word == i] <- weight_score ## add weight to column score
}

The above code is giving a score to each word on valid_guesses by combining the score from all the letters based on the weights we applied earlier. So, the letter e has the highest score, and j has the lowest as they have the highest impact and lowest impact based on usage. We also are not applying any score after a letter has been scored already, to fit within our criteria.

top_score <- data.frame(word=character(),remaining= integer(),score= integer())

i=0

guess_remaining <- guess_remaining[order(-guess_remaining$score),]


while (i<10) {
  i=i+1
  top_score[nrow(top_score) + 1,] = list(word = guess_remaining$word[i], remaining = as.integer(guess_remaining$remaining[i]),score = as.double(guess_remaining$score[i]))
}

top_score
##     word remaining    score
## 1  oater       195 39.68898
## 2  orate       195 39.68898
## 3  roate       195 39.68898
## 4  realo       176 39.60259
## 5  artel       196 39.38661
## 6  ratel       196 39.38661
## 7  taler       196 39.38661
## 8  aeros       183 39.17063
## 9  soare       183 39.17063
## 10 retia       194 38.97192

As you can see above our most common letters are on top, scoring almost 40 points which is a sum of percentages based on how common each letter is within the game of Wordle which should equate to 39.6% chance of having at least one letter be correct. Before we call the answer, we have one of three choices to make. Or does it even make a difference? let’s check.

oater, orate, or roate?

oater = as.numeric(0)
orate = as.numeric(0)
roate = as.numeric(0)

for(i in valid_solutions$word){
  if(substr(i,1,1) == "o"){
    oater = oater+1
  }
  if(substr(i,2,2) == "a"){
    oater = oater+1
  }
  if(substr(i,3,3) == "t"){
    oater = oater+1
  }
  if(substr(i,4,4) == "e"){
    oater = oater+1
  }
  if(substr(i,5,5) == "r"){
    oater = oater+1
  }
}

for(i in valid_solutions$word){
  if(substr(i,1,1) == "o"){
    orate = orate+1
  }
  if(substr(i,2,2) == "r"){
    orate = orate+1
  }
  if(substr(i,3,3) == "a"){
    orate = orate+1
  }
  if(substr(i,4,4) == "t"){
    orate = orate+1
  }
  if(substr(i,5,5) == "e"){
    orate = orate+1
  }
}

for(i in valid_solutions$word){
  if(substr(i,1,1) == "r"){
    roate = roate+1
  }
  if(substr(i,2,2) == "o"){
    roate = roate+1
  }
  if(substr(i,3,3) == "a"){
    roate = roate+1
  }
  if(substr(i,4,4) == "t"){
    roate = roate+1
  }
  if(substr(i,5,5) == "e"){
    roate = roate+1
  }
}

oater_orate_roate <- data.frame(word=character(),score= integer())
oater_orate_roate[nrow(oater_orate_roate) + 1,] = list(word = as.character("oater"), remaining = as.integer(oater))
oater_orate_roate[nrow(oater_orate_roate) + 1,] = list(word = as.character("orate"), remaining = as.integer(orate))
oater_orate_roate[nrow(oater_orate_roate) + 1,] = list(word = as.character("roate"), remaining = as.integer(roate))

oater_orate_roate
##    word score
## 1 oater   986
## 2 orate  1178
## 3 roate  1254

The above list is a sum of every time a letter matches the same position against all words in valid_solutions, we can see that roate is the clear winner.

Conclusion

It has been a journey of learning, understanding, and research. I explored different paths of this data that didn’t make it into the final analysis. I understand the data better than I probably should even though it really is just a list of words. I have investigated other people’s analysis of this question, and it has made me question everything. But I feel confident in my answer using the data I was given. I understand that the letters within the game of Wordle vary slightly from the overall dictionary usage. We now know the best word to use when starting a game of Wordle, but not necessarily getting to the answer the quickest which would be a whole other problem. The question I can’t answer is if the best guess is the best. Does using the same word every time take away the fun of it all? I guess that’s up to the person playing.

If you see errors or have any suggestions of areas, I could improve please let me know. I’m happy to be proven wrong.