Wordle Start Word Analysis

Purpose

What is the most common letter frequencies of 5-letter words in each letter position. These make good candidates for opening or second words with highest probability of getting green or yellow letters

TL;DR: Try these words to start:

SANES, SALES, SORES, CARES, BARES, TARES, SATES, SERES, PARES, MARES

Note: As far as wordle strategy goes, you might not want to pick words from this list that have duplicate letters!

Method

First load a dictionary of English words and subset them to only 5-letter words. Here I’m using the Grady Ward Augmented English dictionary.

# load a dictionary
data(GradyAugmented)

# filter to only 5 letter words and convert to uppercase
wordles <- toupper(GradyAugmented[str_length(GradyAugmented)==5])

flextable(data.frame(Words = sample(wordles,10))) %>% 
  set_caption(caption = "Example Wordles") %>% 
  set_table_properties(layout = "autofit", width = 0.2)
Example Wordles

Words

SLUGS

AFOUL

SPANK

SONSY

ZANZA

SPIKY

HEXER

FLINN

YAUDS

CRUET

For each of the five letter positions, we can calculate how often each letter appears in that position. For example, how often does the letter ‘S’ appear as the first letter of a 5-letter word, how often does ‘T’ appear as the second letter etc.

The plot below shows the top 10 letters in each word position showing that S is the most common start (and last!) letter in 5-letter words.

# function to tabulate letter frequency from a vector of letters
letter_freq <- function(x){
  tbl <- table(x)
  res <- cbind(tbl,round(prop.table(tbl)*100,2))
  colnames(res) <- c('Count','Percentage')
  res <- res %>% 
    as.data.frame() %>% 
    rownames_to_column("Letter") %>% 
    arrange(desc(Percentage))
}

# wrapper function for letter table by letter position
letter_table <- function(n){letter_freq(str_sub(wordles,n,n))}

# collect the letter freq tables for all 5 letter positions
wordle_freq <- map_dfr(seq(1:5), letter_table, .id = "position" )

wordle_freq %>% 
  group_by(position) %>% 
  top_n(10, Percentage) %>% 
  ungroup() %>% 
  mutate(position = as.factor(position),
         Letter = reorder_within(Letter,Percentage, position)) %>% 
  
  ggplot(aes(Letter, Percentage, fill = position)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~position, scales = 'free_y') +
  coord_flip() +
  scale_x_reordered() +
  theme_bw() +
  labs(y = "Percent Letter Frequency",
       title = "Letter Frequency of 5-letter words by Letter Positions 1-5",
       subtitle = "Dictionary: Grady Wards Augmented Dictionary")

Applying the letter frequencies by position to each 5-letter wordle in our dictionary, we can then calculate a combined score by averaging each letter frequency in the word.

# function to score word by calculating mean percentage of letter position based on freq table
score_word <- function(word){
  chars <- data.frame(Letter = unlist(strsplit(word, split = "")), stringsAsFactors = FALSE) %>% 
    mutate(position = as.character(row_number())) %>% 
    inner_join(wordle_freq, by = c("position", "Letter"))
  
 # score <- mean(chars$Percentage)
  df <- data.frame(Word = word, Score = mean(chars$Percentage))
  return(df)
}

# get word scores
word_scores <- map_dfr(wordles, score_word)
saveRDS(word_scores, "word_scores.rds")
flextable(df.wordscores) %>% 
  set_caption(caption = "Top 10 Words by average letter position frequency score") %>% 
  set_table_properties(layout = "autofit", width = 0.5)
Top 10 Words by average letter position frequency score

Word

Score

SANES

16.618

SALES

16.492

SORES

16.310

CARES

16.260

BARES

16.090

TARES

16.032

SATES

15.978

SERES

15.960

PARES

15.948

MARES

15.912