Analyzing Scrabble Games

Introduction

There has been quite a lot of high quality analysis of chess games using R. Joshua Kunst has written an excellent post with reproducible code showing piece movement and survival rates, which pieces capture which others and spatial patterns. He has also authored the really great rchess package. Randal Olson has a nice piece on exploring chess openings and Seth Kadish has a cool post on exploring chess square occupancy by different grandmasters.

Most people’s experience of scrabble is the game played with relatives on holidays or over the kitchen table which descends into arguments over what is or is not a word. You may or may not be suprised to learn that there actually exists a very active tournament scene. Scrabble played at the top-level is a game of complex spatial strategy and skill mixed with a fair amount of luck. The game is far more than just remembering lots of words (though this helps!). Data wise, there is this very nice visualization of wandering paths of scrabble moves by Nicholas Rougeux and, of course, Oliver Roeder’s awesome analysis of the National Scrabble Championship scores at fivethirtyeight and discussion of what makes Nigel Richards the best scrabble player on Earth.

Re-reading all of these posts made me wonder about putting some scrabble data together into a package so myself and others could do some fun data analysis. The results is the package scrabblr. In this package I decided to collate every turn played by two ‘expert’ level computer sims when playing against each other. I used the Quackle Scrabble software for this - a really fantastic open-source software written by Jason Katz-Brown and John O’Laughlin. There are four different computer players in Quackle. Three of these are ‘championship’ players who will spend much longer processing potential moves to find the move with the optimal win percentage. The fourth computer player - ‘Speedy Player’ - picks the move that initially looks best based on score, leave, game position etc. This move is often the most optimal, but sometimes may not be depending upon board dynamics, unseen tiles etc. For this package, I decided to use Speedy Player rather than any of the Championship players. There are some limitations to this - particularly that the Speedy Player does not always make the best moves in the pre-endgame when there are only a few tiles left in the bag. But, for now, I’m ok with this.

At this point I should make one other major disclaimer and one minor disclaimer. All the scrabble games in the package were played using the CSW15 word source. This is the word source (think dictionary of dictionaries) that is used in International Scrabble tournaments such as the World Championship. It differs from TWL which is the “Tournament Word List” and is used in your home scrabble set in North America. I make no excuses for this - I prefer the Collins (CSW) scrabble dictionary and I switched to playing this ‘code’ of scrabble in tournaments a few years ago.

The minor issue is that Quackle will always start the game playing horizontally. In scrabble you have to touch the DWS on the star of the center square (8H) when playing the first move that isn’t a pass or an exchange, but you can play vertically or horizontally. From my experience of tournament scrabble over 90% of players play horizontally to start with. There is one very notable exception - Nigel Richards always plays vertically e.g. this game.

Getting The Data

I wrote a visual basic script to make Quackle play itself many times on my Windows machine. Each game was played and then two files were stored - the quackle game file (.gcg) and a Speedy Player Report as a .txt file. Between these two files all information related to what tiles were on each player’s rack, what the board position was, what the score was etc. for every turn could be turned into dataframes or matrices.

This is the code I used for the VBScript. There are much better ways of doing this - probably using python or actually modifying the source code of Quackle itself - but I did this because it was straightforward:

Set action_play = createobject("WScript.shell")

action_play.AppActivate "Quackle versus New Player 1 - Quackle"

path_play = "C:\Users\curley1\Desktop\Files2\"

For n = 1 To 2566
    
    action_play.sendkeys "^n", 1
    action_play.SendKeys "~", 1
    
    WScript.sleep 5000
    
    action_play.sendkeys "^s", 1
    WScript.sleep 1000
    action_play.SendKeys path_play & "Game_Number_" & n & ".gcg", 1
    WScript.sleep 1000
    action_play.SendKeys "~", 1
    WScript.sleep 1000
    
    action_play.SendKeys "%p", 1
    action_play.SendKeys "{DOWN}", 1
    action_play.SendKeys "{DOWN}", 1
    action_play.SendKeys "{DOWN}", 1
    action_play.SendKeys "~", 1
    
    WScript.sleep 1000
    action_play.SendKeys path_play & "Report_Number_" & n & ".txt", 1
    WScript.sleep 1000
    action_play.SendKeys "~", 1

    WScript.sleep 5000
Next

Following this, I converted the .gcg files to .txt files using R, cleaned up the raw data, put the tidy data into dataframes and matrices and then compiling all data into the scrabblr R package. The .R script for doing this is available on GitHub here.

Installation

The R package can be installed from GitHub using devtools.

#  devtools::install_github("jalapic/scrabblr")
library(scrabblr)

Data

The scrabblr package contains data from 2566 scrabble games played between two expert level computer players.

Turns dataframe

The main package dataset is called turns. Here are the first 11 columns. You can search for words played using the search box:

DT::datatable(turns[,1:11], rownames = FALSE )

The first two columns identify the unique game and the turn of htat game. Shown here are the first 10 turns of game 1. The third column shows which player makes that move. This obvious rotates except for the last two moves. In scrabble, the player who ‘goes-out’ first (i.e. uses up all their tiles on their rack with no tiles being left in the bag) then gets double the points value of the tiles on their opponent’s rack. These points are represented as another turn (and therefore row/observation) for that player. The exception to this is if a game ends by six turns of no score in a row - i.e. six passes of exchange of tiles. In this case, the points value of each player are subtracted from their own score.

The position refers to where the first letter of the played word is. A blank scrabble board looks like this (complete with bonus squares):

Any position starting with a number is a horizontal play. “8D” refers to a word that started on the 8th row and 4th column - that would mean it started on a DLS (double letter score). Any position starting with a letter is a vertical play. So “I3” refers to a word that starts on the TWS (triple word score) at the 9th column and 3rd row.

The next five variables represent the play made. The variable rack shows the tiles on the rack of each player on each turn. For most turns a player will have 7 tiles. When there are no tiles left in the bag to replace tiles on the rack, then there will be fewer than 7. Racks are shown in alphabetical order and blanks (of which there are two) are shown with a “?”. So the rack “?ABDGIR” is a rack with one blank and six letters.

The variable word represents the actual word in the dictionary (technically ‘word-source’) played. So, for instance, if the word ‘WIRED’ was on the board and a player added ‘RE’ to the front of this word then this variable would have the word ‘REWIRED’ in it.

The variables move and play give more information as to the play. In both columns any blanks played will be shown with lower case letter (upper case letters are always standard tiles). In the play column any tiles not played by the player are shown within parentheses. So the play ‘CLERIHE(W)’ on row four means that the player played the tiles ‘CLERIHE’ to a ‘W’ already on the board. In the move column all tiles played are shown and tiles that were already on the board are represented by a dot ‘.’.

The leave variable shows the tiles that are left on the rack of a player after their turn. If they played all their tiles then this is empty. Note - if a player plays all their tiles on their turn they get a 50 point bonus and the play is called a ‘bingo’.

The points variable gives the total points for that particular turn, and the score variable is the cumulative score for each player. Final scores of each player can be got from subsetting the score of each player on their respective final turns from each game.

The tile distribution and points value of a standard scrabble game can be quickly glanced at using the tiles dataframe.

tiles

##    tile points number
## 1     A      1      9
## 2     B      3      2
## 3     C      3      2
## 4     D      2      4
## 5     E      1     12
## 6     F      4      2
## 7     G      2      3
## 8     H      4      2
## 9     I      1      9
## 10    J      8      1
## 11    K      5      1
## 12    L      1      4
## 13    M      3      2
## 14    N      1      6
## 15    O      1      8
## 16    P      3      2
## 17    Q     10      1
## 18    R      1      6
## 19    S      1      4
## 20    T      1      6
## 21    U      1      4
## 22    V      4      2
## 23    W      4      2
## 24    X      8      1
## 25    Y      4      2
## 26    Z     10      1
## 27    ?      0      2

The turns dataframe has thirteen other variables. The last four are ‘tws’, ‘dws’, ‘tls’ and ‘dls’, which represent the number of triple word square, double word square, triple letter square and double letter squares played on for that turn.

turns[1:6,c(5,7,21:24)]

##      rack       play tws dws tls dls
## 1 ?ABDGIR    BARDInG   0   1   0   1
## 2 AEFLOSV FLAVO(n)ES   0   0   1   2
## 3 AENRSWY        WYE   1   0   0   0
## 4 CEEHILR CLERIHE(W)   1   0   0   1
## 5 AABNPRS  B(EF)ANAS   0   1   1   0
## 6 AAEGIIK       KAIE   0   0   1   0

The other columns represent summary data regarding the rack or play that could be gleaned from the other variables but are given for convenience. They are the length of the word played, the tiles played, the number of tiles on the rack, the value of the tiles on the rack, the number of vowels, consonants, esses and blanks on the rack. The variable turn_player gives the cumultive turn of each player in each game.

turns[1:6,c(5,7,12:20)]

##      rack       play word_length tiles_played tiles_rack turn_player
## 1 ?ABDGIR    BARDInG           7            7          7           1
## 2 AEFLOSV FLAVO(n)ES           8            7          7           1
## 3 AENRSWY        WYE           3            3          7           2
## 4 CEEHILR CLERIHE(W)           8            7          7           2
## 5 AABNPRS  B(EF)ANAS           7            5          7           3
## 6 AAEGIIK       KAIE           4            4          7           3
##   rack_value rack_vowels rack_consonants rack_s rack_blanks
## 1         10           2               4      0           1
## 2         13           3               4      1           0
## 3         13           2               5      1           0
## 4         12           3               4      0           0
## 5         11           2               5      1           0
## 6         12           5               2      0           0

Final Board Matrices

The second type of data stored in the package are matrices showing the board position of each game. The object finalboards is a list of the final scrabble board of each of the 2566 games. For instance, here is the final board for game 1000:

finalboards[[17]]

##    A B C D E F G H I J K L M N O
## 1  =     '       p       '     =
## 2  B O O T   '   r   '       -  
## 3      B A R Q U E "       -    
## 4  ' C O X   I M P A R I T Y   '
## 5    E     -     E     -        
## 6    N       G E N I T U R E '  
## 7    T "       ' S '       " Z  
## 8  = A   '     D I V I   D O I T
## 9    R "       I N ' F L O W N  
## 10   E   K R E N G   F A     C H
## 11   S   A -           -       O
## 12 '     G U A R D I A N -     V
## 13     - O     "   "       -   E
## 14   -   U   M E T E P A S   J A
## 15 W I E L D Y   =       '   O S

The default boards still contain the notation used by Quackle to represent the location of bonus squares. What these notation marks represent can be checked in the notation dataframe. This also gives the total number of each square at the beginning of a game:

notation

##   symbol value total
## 1      '   dls    24
## 2      "   tls    12
## 3      -   dws    17
## 4      =   tws     8

To make the board more readable, we can apply two functions depending on your opinion of what is easier to read:

board_clean(finalboards[[999]])

##    A B C D E F G H I J K L M N O
## 1      O V E R B E T            
## 2    Q U O D                    
## 3        C         U       R    
## 4      N A N       P       E R R
## 5      A B A       S       T E E
## 6      a   C   Y U T Z     I W I
## 7      N   R       A       L   R
## 8        H E Y I N G       E L D
## 9          O   F E E T     D O  
## 10         U M     S         G  
## 11 S       s I           V   I  
## 12 P O O H   L           O   E  
## 13 E   F A I K         J I G S  
## 14 N         O A T M E A L   T  
## 15 D           X I       A

board_dots(finalboards[[1999]])

##    A B C D E F G H I J K L M N O
## 1  . . . . . . . . . . . . . . .
## 2  . . . . . . . . . Y W R O K E
## 3  . . . . . . Z . P O H I R I .
## 4  . . . . . . O V E N E D . T A
## 5  . . . . . . R . . . W . . . C
## 6  . . . . . Q I . . . . . . . R
## 7  L . . . B A . . . . . . . . I
## 8  I . . F I T N A . V O L T E D
## 9  C . . O N . . T . . . . . . .
## 10 E . M E D . . L . U T . . . .
## 11 N . I F . . J A . . A G A . .
## 12 S . X I . . U N . . P E B A .
## 13 O G E E . u N T I M E L Y . .
## 14 R O D . . . . E . . . . . A .
## 15 S O . . . . . S c R A U G H S

Boards During Games

The board state at any given turn is stored in the object boards. The list represents the games in turn and within each game there is another list where the index reprents the turn in the turns dataframe for each game. For instance, here is the game board for a player on turn 9 of game 1807. The player had the rack - [EINORUX]:

boards[[1807]][[9]]

##    A B C D E F G H I J K L M N O
## 1  =     '       =       '     =
## 2    -       '   O   ' E P O D E
## 3      -       " V A C H E R I N
## 4  '     -       E       -     '
## 5        V I H A R A   O        
## 6    '       "   E   " N     '  
## 7      "       ' A '   S   "    
## 8  =     '       S M O C K     =
## 9      "       ' Y '   R   "    
## 10   '       "       B E     '  
## 11         -         O E        
## 12 '     -       ' W A N -     '
## 13     -       "   U       -    
## 14   -       '     Z '       -  
## 15 =     '       =       '     =

The best play was to make a triple-triple (hit two tws scores giving a x9 bonus to the word score). We can find the play made here:

turns[turns$gameid==1807 & turns$turn_game==9,c(4,5,7:11)]

##       position    rack       play     word leave points score
## 42599       O1 EINORUX X(EN)URINE XENURINE     O    144   330

Oh, of course, X is for XENURINE:

boards[[1807]][[10]]

##    A B C D E F G H I J K L M N O
## 1  =     '       =       '     X
## 2    -       '   O   ' E P O D E
## 3      -       " V A C H E R I N
## 4  '     -       E       -     U
## 5        V I H A R A   O       R
## 6    '       "   E   " N     ' I
## 7      "       ' A '   S   "   N
## 8  =     '       S M O C K     E
## 9      "       ' Y '   R   "    
## 10   '       "       B E     '  
## 11         -         O E        
## 12 '     -       ' W A N -     '
## 13     -       "   U       -    
## 14   -       '     Z '       -  
## 15 =     '       =       '     =

Analysis

There are obviously many, many things that could be analyzed with these data. I’ll highlight some of straightforward descriptive analyses first and then make some suggestions for future work.

Descriptive Analyses

For most analyses I will use the tidyverse suite of packages for ease.

library(tidyverse)

Points per Game

The most sensible starting point seems to be how many points are scored per game by each player? We can get this by filtering for the last row of each player for each game (note - not the maximum score for each player as if the game ends in a tie players lose points at the end of the game).

turns %>% 
  group_by(gameid, player) %>% 
  filter(turn_player==max(turn_player)) %>%
  filter(row_number()==max(row_number())) %>%
  select(gameid, player, score) -> final_scores

final_scores %>%
  group_by(player) %>%
  summarize(min=min(score), max=max(score), mean=mean(score), median=median(score), sd=sd(score))

## # A tibble: 2 × 6
##    player   min   max     mean median       sd
##     <chr> <dbl> <dbl>    <dbl>  <dbl>    <dbl>
## 1 Player1   287   731 465.0401    463 60.14975
## 2 Player2   251   663 451.1017    447 62.02244

It appears as if going first player 1 has a slight edge. Let’s plot:

final_scores %>%
  spread(player,score) %>%
  ggplot(aes(Player1, Player2)) + 
  geom_point(alpha=.15) + 
  theme_classic()

We can further investigate the advantage of going first by looking at win proportions.

final_scores %>% 
  spread(player,score) %>%
  ungroup()%>%
  summarise(P1 = sum(Player1>Player2), P2 = sum(Player2>Player1), tie=sum(Player1==Player2), total = n())

## # A tibble: 1 × 4
##      P1    P2   tie total
##   <int> <int> <int> <int>
## 1  1404  1151    11  2566

Only 11 out of 2566 games were ties (0.4%). Player 1 won 1404 games compared to Player 2’s 1151 games. We can test if this is a significant difference using a binomial test.

binom.test(1404,2555)

## 
##  Exact binomial test
## 
## data:  1404 and 2555
## number of successes = 1404, number of trials = 2555, p-value =
## 6.058e-07
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.5299753 0.5689326
## sample estimates:
## probability of success 
##              0.5495108

We can also plot the distribution of Player 1 versus Player 2 scores:

final_scores %>%
  ggplot(aes(score, fill = player)) +
  geom_density(alpha = 0.2) +
  scale_fill_manual(values = c("firebrick", "dodgerblue")) + 
  theme_classic()

We can clearly see that under these game conditions that there is advantage to playing first. Of course, expert players will point to the fallability of the Speedy Player particularly during the pre-endgame part of the game, but others have shown from tournament data that there is a slight edge to going first in scrabble.

Highest Scoring Games

From the above table the highest score by any player is 731. We can find the top five scores like this:

final_scores %>%
  arrange(-score) %>%
  head(5)

## Source: local data frame [5 x 3]
## Groups: gameid, player [5]
## 
##   gameid  player score
##    <int>   <chr> <dbl>
## 1    422 Player1   731
## 2   2268 Player1   695
## 3   1733 Player1   688
## 4   2129 Player1   688
## 5   1513 Player1   675

Here’s the board for the 731 game:

gameid_maxscore <- as.numeric(final_scores[final_scores$score==max(final_scores$score),'gameid'])
finalboards[[422]]

##    A B C D E F G H I J K L M N O
## 1  =     V E N U S       '     N
## 2    -       "   K I E L B A S I
## 3      -       W I F E   R - T E
## 4  '     -       '       O   E D
## 5          L I G H t I N G   E  
## 6    "       "       "   U   P  
## 7      '       C   '     E Y E  
## 8  =     '   P O O H     ' A R C
## 9      '   T A X E M E S   R   O
## 10   "   T A D       F E U d " V
## 11     Z A G           N   A   E
## 12 W O O D       Q I N T A R   Y
## 13     R       N I D       M    
## 14   - I     J O     "       -  
## 15 A B L A T O R S       '     =

and here are the moves of Player 1 during this game:

turns %>%
  filter(gameid==gameid_maxscore & player=="Player1") %>%
  select(position, rack, play, word,leave,points,score)

##    position    rack       play     word leave points score
## 1        8F AEEHOOP       POOH     POOH   AEE     18    18
## 2        9E ABEEMTX     TAXEME   TAXEME     B     59    77
## 3       11C AABGSTZ        ZAG      ZAG  ABST     36   113
## 4       15A AABORST AB(L)ATORS ABLATORS          149   262
## 5        M7 ?AAMRRY    YARdARM  YARDARM           88   350
## 6        O8 ACELOVY      COVEY    COVEY    AL     66   416
## 7        2H ABEIIKL KIELBA(S)I KIELBASI           66   482
## 8        3G EEFGISW       WIFE     WIFE   EGS     35   517
## 9        1D EGNNSUV      VENUS    VENUS    GN     57   574
## 10      13G DGHIINN        NID      NID  GHIN     27   601
## 11       5E ?GHIILN LIGHtIN(G) LIGHTING           98   699
## 12       K9      NS   S(E)N(T)     SENT           24   723
## 13      end                                        8   731

We can find the top combined scores like so:

final_scores %>%
  spread(player,score) %>%
  mutate(total = Player1 + Player2) %>%
  arrange(-total)

## Source: local data frame [2,566 x 4]
## Groups: gameid [2,566]
## 
##    gameid Player1 Player2 total
##     <int>   <dbl>   <dbl> <dbl>
## 1     542     556     640  1196
## 2    1867     509     641  1150
## 3    2131     566     560  1126
## 4    2268     695     431  1126
## 5    1051     574     548  1122
## 6     150     533     580  1113
## 7     904     674     435  1109
## 8     563     579     529  1108
## 9    1941     572     533  1105
## 10    350     607     496  1103
## # ... with 2,556 more rows

and plot the distribution like so:

final_scores %>%
  spread(player,score) %>%
  mutate(total = Player1 + Player2) %>%
  ggplot(aes(total)) + 
  geom_histogram(bins=60, fill='gray88',color='black') + 
  theme_classic()

Points per Play

How many points are won on the average play ? For this analysis, I will only consider plays when there are still 7 tiles on a player’s rack as obviously in the endgame there may be lots of plays of much smaller value.

turns %>% 
  filter(tiles_rack==7) %>%
ggplot(aes(points)) + 
  geom_histogram(bins=60, fill='gray88',color='black') + 
  theme_classic()

This appears to be a bimodal distribution. This isn’t surprising because most high scoring plays come from bingos where players get a +50 point bonus for using all 7 tiles. There are also double-double and triple-tripel plays where the values of words are x4 and x9 respectively which will lead to some extra high play. We can formally test for a bimodal distribution using the diptest package, which shows conclusively it is at least bimodal.

diptest::dip.test(turns %>% filter(tiles_rack==7) %>% .$points)

## 
##  Hartigans' dip test for unimodality / multimodality
## 
## data:  turns %>% filter(tiles_rack == 7) %>% .$points
## D = 0.018457, p-value < 2.2e-16
## alternative hypothesis: non-unimodal, i.e., at least bimodal

The distributions of bingos and non-bingos look like this when considered separately:

ggplot(turns %>% filter(tiles_rack==7 & tiles_played<7), aes(points)) + geom_histogram(bins=60, fill='gray88',color='black') + theme_classic() + ggtitle("Non-Bingos") + xlim(0,250)

ggplot(turns %>% filter(tiles_rack==7 & tiles_played==7), aes(points)) + geom_histogram(bins=60, fill='gray88',color='black') + theme_classic() + ggtitle("Bingos") + xlim(0,250)

Highest Scoring Plays

We can also look for the highest scoring individual plays. The top 10 are:

turns %>%
  arrange(-points) %>%
  select(1,2,4,5,7:10,16,22,21) %>%
  head(10) %>%
  DT::datatable(.)

All of these plays are 8 letter word bingos that cover two TWS and use one letter already on the board. This means that they get a x9 value of the word score. This is also obvious from the position the word was played. We can visualize the highest scoring play by looking at the boards.

boards[[1954]][[7]]

##    A B C D E F G H I J K L M N O
## 1  =     '       =       '     R
## 2    -       '       '       - E
## 3      -       "   F       -   P
## 4  '     -       ' R     -     I
## 5          N O O G I E S       Q
## 6    '       "     E "       ' U
## 7      "       ' A N G U L A T E
## 8  =     W A U L E D     '     D
## 9      K E E N O   L       "    
## 10   '       "     Y "       '  
## 11         -           -        
## 12 '     -       '       -     '
## 13     -       "   "       -    
## 14   -       '       '       -  
## 15 =     '       =       '     =

The highest non bingo plays can be found like this:

turns %>%
  filter(tiles_played<7) %>%
  arrange(-points) %>%
  select(1,2,4,5,7:10,16,22,21) %>%
  head(10) %>%
  DT::datatable(.)

Again most of these words achieve extra bonuses by being triple-triples. The highest is the X(EN)URINE play. The second highest is an amazing C(YA)NO(S)ES play. This was the board prior to the play:

boards[[245]][[25]]

##    A B C D E F G H I J K L M N O
## 1  = Y A '   S   =       '     =
## 2    U G     N       '       -  
## 3    M A     I "   "       -    
## 4  '   Z -   F   '       -     '
## 5      E T A T         -        
## 6    '   A W E       "       '  
## 7    N " I   R '   '   R   "    
## 8  V A I L       D W E E B     =
## 9    I n O C U L A '   H E X    
## 10   N   R   P A K   M E T   '  
## 11       E - O R       E T H    
## 12 '     s       V     L O O   '
## 13   J U S     B I O     N E    
## 14   U R   T R E A D I N G   D O
## 15 = D   Q I     L       S P I F

Blanks

The blanks are the most significant part of a scrabble game. When individuals are close in skill level getting both blanks (double blanking your opponent) is a major advangage. What are the win proportions of player1 and player2 if they played 0, 1 or 2 blanks? (Note - this is different from drawing 0,1 or 2 blanks as a player might get a blank but not be able to play it e.g. if their opponent went out with a bingo before they got to play it).

turns[grepl("[a-z]",turns$move),] %>%
  mutate(player = factor(player)) %>%
  group_by(gameid,player) %>%
  summarize(total = n()) %>%
  complete(player, fill = list(total = 0)) %>%
  full_join(final_scores) %>%
  group_by(gameid)  %>%
  mutate(result = ifelse(score==max(score) & score==min(score),"T",
                         ifelse(score==max(score) & score>min(score), "W","L")))  %>% 
  filter(!is.na(total)) %>% #one game neither player played a blank
  group_by(player,total) %>% 
  summarise(wins = sum(result=="W"), losses=sum(result=="L"), ties=sum(result=="T")) %>%
  group_by(player) %>%
  mutate(winpct = wins/(wins+losses+ties)) %>%
  ggplot(aes(total,winpct,fill=player)) +
  geom_bar(position='dodge', stat='identity') + 
  theme_classic() + 
  scale_fill_manual(values=c("gray30", 'gray68')) +
  scale_x_continuous(breaks=0:2) + 
  xlab("Blanks Played") +
  ylab("Win Percentage")

What tiles get played as blanks ?

blanks <- function(x) { x[grepl("[a-z]", x)] }
bls <- lapply(finalboards,blanks)

table(unlist(bls))

## 
##   a   b   c   d   e   f   g   h   i   j   k   l   m   n   o   p   q   r 
## 371 102 176 176 465  66 152 106 335   6  57 246 174 308 302 130  17 364 
##   s   t   u   v   w   x   y   z 
## 730 298 237  73  46  16  88  33

Word Length

The distribution of words played is as one might expect. The majority of plays are of 4 and 5 letters long. Words of 6,7 and 8 letters long are equally likely. Typically words of 7 or 8 are bonus bingo words (although they could be played through other letters like X(EN)URINE). Words of length 6 aren’t particularly useful for scrabble players to learn - usually there are 2 or more tiles worth keeping on a rack that you can’t bingo with - so the optimal strategy is to keep those and play off the 3-5 other tiles. This is more obvious when looking at the distribution of tiles played.

turns %>% filter(position!='end') %>% ggplot(aes(word_length)) + geom_bar() + theme_classic() + scale_x_continuous(breaks=0:12) + ggtitle("Word Length")

turns %>% filter(position!='end') %>% ggplot(aes(tiles_played)) + geom_bar() + theme_classic() + scale_x_continuous(breaks=0:7) + ggtitle("Tiles Played")

Notably there are a number of plays of 0 tiles. These are passes or exchanges. We can find the proportion of these compared to regular plays like this:

turns %>%
  filter(position!='end') %>%
  summarise(pass = sprintf("%.4f", (sum(position=='pas') / n())), 
            exchange = sprintf("%.4f", (sum(position=='xch') / n()) )
  )

##     pass exchange
## 1 0.0010   0.0096

A tiny proportion of moves are exchanges. Passes always occur at the end of the game. It would be interesting to know on what turn exchanges occur.

turns %>%
  filter(position=='xch') %>%
    summarize(mean = mean(turn_player),
              median = median(turn_player), 
            lqr=quantile(turn_player,.25), 
            uqr=quantile(turn_player,.75)
            )

##   mean median lqr uqr
## 1  3.5      2   1   6

It’s very clear that exchanging tiles occurs early in games when players are trying to balance their racks to have high scoring potential.

What are the longest words played ? We can simply sort by the variable word_length.

table(turns$word_length)

## 
##     0     2     3     4     5     6     7     8     9    10    11    12 
##  3184  3959 10664 12046 10920  5806  7012  6610   370    44     6     1

Using table shows us that there are some words of length 11 and 12. We can find these:

turns %>%
  arrange(-word_length) %>%
  filter(word_length>=11) %>%
  select(1,2,4,5,7:10,12,13) %>%
  head(10) %>%
  DT::datatable(.)

A quarrelsome grandnephew needed chaperonage due to overfatigue caused by a misdivision of reallocated fermentation.

Common Words

Finding amazing 11+ length plays is not what makes somebody a good scrabble player. The nuts and bolts of the game is knowing those obscure English words that are super important to play regularly. So which words are the most frequently played?

turns %>%
  filter(word_length>1) %>%
  group_by(word) %>%
  summarize(total = n()) %>%
  arrange(-total) %>%
  mutate(pct = sprintf("%.1f", 100*total/2566)) %>%
  head(10)

## # A tibble: 10 × 3
##      word total   pct
##    <fctr> <int> <chr>
## 1      QI   641  25.0
## 2     QAT   239   9.3
## 3     QIN   195   7.6
## 4      XI   150   5.8
## 5      OX   118   4.6
## 6    EUOI   107   4.2
## 7      XU   103   4.0
## 8      ZO    96   3.7
## 9      ZA    94   3.7
## 10     EX    93   3.6

Every 1 game out of 4 a player plays the word ‘QI’ (an alternative spelling of ‘CHI’ - the life energy in Chinese philosophy). QI is the only two letter word containing the difficult to play ‘Q’ tile. All the other most common words are short words played using the ‘Q’, ‘X’ or ‘Z’ tiles. The other word is a four letter word with no consonants - ‘EUOI’ (A cry of impassioned rapture in ancient Bacchic revels).

Let’s look at Q plays a bit more. From the above output it’s clear that only 3 ‘Q’ words account for approximately 41% of all Q words ever played in scrabble.

We can check the move variable for Q words as this variable shows tiles played - not already played tiles on the board. It also contains times when a Q was left on the rack of the player that didn’t go out - but we can filter this out as these tiles did not result in a word.

q <- table(as.character(turns$word[grepl("Q", turns$move)]))

data.frame(q) %>%
  filter(Var1!="") %>%
  arrange(-Freq) %>%
  mutate(wordno = row_number(),
         pct = Freq/sum(Freq),
         cum.pct = cumsum(pct)) -> qdf

Here are the top 10 played ‘Q’ tiles. As you can see, if you only know a handful of Q words commonly played in srabble you’ll get much better very quickly!

head(qdf, 10)

##    Var1 Freq wordno        pct   cum.pct
## 1    QI  621      1 0.25000000 0.2500000
## 2   QAT  231      2 0.09299517 0.3429952
## 3   QIN  179      3 0.07206119 0.4150564
## 4   QIS   93      4 0.03743961 0.4524960
## 5  QADI   64      5 0.02576490 0.4782609
## 6  QAID   53      6 0.02133655 0.4995974
## 7   QUA   44      7 0.01771337 0.5173108
## 8  QATS   30      8 0.01207729 0.5293881
## 9   SUQ   26      9 0.01046699 0.5398551
## 10 CINQ   25     10 0.01006441 0.5499195

However, to get really good at scrabble you have to learn exponentially more words for smaller and smaller gains:

qdf %>% 
  ggplot(aes(x=wordno, y=cum.pct)) + 
  geom_path() + 
  theme_classic() + 
  ggtitle("Cumulative Distribution of Q plays") +
  xlab("Unique Q Word") +
  ylab("Cumulative proportion of games")

All Words

One issue with the above word analysis is that it only represents words played by a player from tiles on their own rack. When playing scrabble finding a place for the word to fit on the board is key. This means that other words are made by playing your word in paralel or adjoining another word. scrabblr has a function to get all words made on any board - board_words.

These are all the words on one board:

finalboards[[60]]

##    A B C D E F G H I J K L M N O
## 1  =     '     A D O     J E E L
## 2    -   C   ' T O O T H I E R  
## 3      w H E r E O F     V -    
## 4  '     O       '       E L   '
## 5        P I S T I L   A Y U   R
## 6    '       T       M U   N ' E
## 7      "     O '   A I D   G   B
## 8  =     '   W A U K   I '     A
## 9      "   Z E X   A N T B I R D
## 10   '   W I D E N   " O     ' G
## 11         P           R     M E
## 12 '   L I S     '       -   A S
## 13 Q U A       "   "       - V  
## 14   G R A N T E E   '       I F
## 15 =     '       =       '   N Y

board_words(finalboards[[60]])

##  [1] "ADO"      "JEEL"     "TOOTHIER" "WHEREOF"  "EL"       "PISTIL"  
##  [7] "AYU"      "MU"       "AID"      "WAUK"     "ZEX"      "ANTBIRD" 
## [13] "WIDEN"    "ME"       "LIS"      "AS"       "QUA"      "GRANTEE" 
## [19] "IF"       "NY"       "UG"       "LAR"      "CHOP"     "ZIPS"    
## [25] "STOWED"   "ATE"      "AXE"      "DOO"      "OOF"      "AKA"     
## [31] "MI"       "AUDITOR"  "JIVEY"    "EE"       "LUNG"     "ER"      
## [37] "MAVIN"    "REBADGES" "FY"

Let’s use this function to get all the words on each board and look at the cumulative proportion curves for plays involving all the power tiles (QXJZ) as well as the high scoring K tile and the V tile which is often tricky to play.

l <- lapply(finalboards, board_words)
l.u <- unlist(l)
qs <- l.u[grepl("Q", l.u)]
xs <- l.u[grepl("X", l.u)]
js <- l.u[grepl("J", l.u)]
zs <- l.u[grepl("Z", l.u)]
ks <- l.u[grepl("K", l.u)]
vs <- l.u[grepl("V", l.u)]

lapply(list(qs,xs,js,zs,ks,vs), 
       function(x) data.frame(table(as.character(x)))) %>%
      map(~ filter(., Var1!="")) %>%
      map(~ arrange(., -Freq)) %>%
      map(~ mutate(., 
                   wordno = row_number(),
                   pct = Freq/sum(Freq),
                   cum.pct = cumsum(pct))) %>%
      Map(cbind, ., letter = c("Q","X","J","Z","K","V")) %>%
      do.call('rbind', .)  %>%
  ggplot(aes(x=wordno, y=cum.pct, color=letter)) + 
  geom_path() + 
  theme_classic() + 
  ggtitle("Cumulative Distribution of QXJZKV plays") +
  xlab("Unique Word") +
  ylab("Cumulative proportion of words played")

As we can see, in 50% of all games a relatively small subset of words including Q, X, J or Z get played repeatedly. Slightly more K words get played in 50% of games, but for the V - nearly 500 unique words need to be known.

Tile Synergy

A critical skill that expert scrabble players need to develop is a feel (or explicit knowledge) of what tiles they should break up on their rack and which ones they should keep. A series of bad leaves will result in a rack that is almost unplayable. Expert players are able to realize when they have to play off a certain letter and make a play that is slightly less valuable than another play because if they don’t play it then their next move will be worse.

Here is one very crude and quick way to look at tile synergy. For every turn of every game when a player had a full rack I calculated the average score played per turn for when each possible combination of two letters (or blanks) were on a rack. For the following visualization I remove blanks because they go successfully with every letter - I wanted to see which tiles had better synergy with each other.

library(viridis)

# a function to get turn score for each pair of letters on rack
synergy <- function(s,tv){
x <- combn(unlist(strsplit(s, split="")),2)
v <- unique(apply(x,2,function(z) paste0(z[1],z[2])))
out <- rep(tv, length(v))
names(out)<-v
return(out)
}

p  <- apply(turns[turns$tiles_rack==7,], 1, function(x) synergy(x[5], x[10]))

data.table::rbindlist(lapply(p, function(x) 
  data.frame(value=as.numeric(as.character(x)), 
             tiles=as.character(names(x))
  )
))  %>%
  group_by(tiles) %>%
  summarize(val = mean(value, na.rm=T)) %>%
  mutate(tiles = as.character(tiles),
         tile1 = substr(tiles,1,1),
         tile2 = substr(tiles,2,2)) -> df2

df2$tile1 <- factor(df2$tile1, levels=LETTERS[1:26])
df2$tile2 <- factor(df2$tile2, levels=LETTERS[26:1])

df2 %>%  
  filter(!grepl("\\?", tiles)) %>%
  ggplot(aes(tile1, tile2, fill = val)) + 
  geom_tile(colour="gray5", size=.75, stat="identity") + 
  scale_x_discrete(expand = c(0, 0)) +
  scale_y_discrete(expand = c(0, 0)) +
  scale_fill_viridis(guide = guide_legend(title = "Average Points Per Turn", title.position = "top"),
                     option="A",  limits=c(25,50)) +
  xlab("") + 
  ylab("") +
  ggtitle("Scrabble Tile Synergy") +
  theme(
    plot.title = element_text(color="black",hjust=0,vjust=2.5, size=rel(1.7)),
    plot.background = element_rect(fill="white"),
    panel.background = element_rect(fill="white"),
    panel.border = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    axis.line = element_blank(),
    axis.ticks = element_blank(), 
    axis.text = element_text(color="black", size=rel(1.3)),
    axis.text.y  = element_text(hjust=1),
    legend.text = element_text(color="black", size=rel(1.3)),
    legend.background = element_rect(fill="white"),
    legend.position = "bottom"
  )

There are some pretty straightforward conclusions from this. Having a Z or an S is nearly always going to lead to a high scoring play. They are especially good when there is a ZA combination (‘ZA’ is an acceptable word meaning pizza), and SE is super-strong. Conversely, the Q is awful. The V and W tiles are also pretty bad. You also don’t want to have two of the same tile (except for SS) - that really kills scoring. In particular, the UU is an absolutely brutal combination. There are some other clashes - PB is not a great combination and should be avoided. Also, the F is a surprisingly bad tile. Even though it is worth four points and has lots of two letter words meaning it could be played two-ways easily on bonus squares.

There are some other combinations that do well despite their letters not being that wonderful. Unsurprisingly, he CH combination is ok and so is QU. The G is not a good tile except for when played with the I or N - presumbably for the -ING suffix making many bingos.

This is a very crude analysis and has many limitations. Value of tile combinations will vary by turn, in different stages of the game and depending on the configuration of the board.

Space Usage

Given the data available in scrabblr it is possible to determine and visualize which squares of a scrabble board get played more or less frequently. We can also break this down by all tiles or specific tiles. Below I have visualized the relative likelihood of each square being played upon.

When looking at ‘any’ it is clear that the pattern resembles a scrabble board. The most frequently played on squares are those with bonuses. In particular going diagonally out from the center are the DWS. The TWS are also played very frequently, but most especially H15. Interestingly the top left quadrant of the board is much less likely to be played in.

Whe nlooking at blanks we can see again that the bonus squares are often played in. However, we can also note that there extends a path of darker higher frequency squares along the sides of the board (particularly the bottom and right). Closer to the center we can see that the stretches between rows 4 and 11 in columns E and K are more likely to have blanks played on them. This would represent words that are double-doubles (two DWS being played on with a word for x4 bonus). Again the top-left quadrant is less occupied.

I’ve chosen to visualize the pattern also for Q and Z. Here it is clear that the most likely landing spots are the TLS and to a lesser extent the DLS - this pattern is most striking close to the center of the board. There is also a really interesting high frequency landing spot for Q at position C9. Also there are relatively more Q plays in the top-left quadrant compared to the lower-right quadrant.

# A function to calculate the frequency of times each square is played on by 
# any letter, a blank or a specific letter.

board_list2df <- function(l, type="alpha"){

lval <- lapply(l, function(x) sub("[^[:alpha:]]+", "", unlist(matrix(x))))
  
if(type!="alpha" & type!="blank") {  lval <- lapply(lval, function(z) ifelse(z %in% type,z,"") )}
if(type=="blank") {  lval <- lapply(lval, function(z) ifelse(grepl("[a-z]",z),z,"") )}

lvaldf <- as.data.frame.matrix(do.call('cbind',lval), stringsAsFactors = F)


vals <- apply(lvaldf,1,function(y) sum(grepl("[[:alpha:]]",y)))

valsdf <- data.frame(var1= rep(LETTERS[1:15],each=15),
                     var2 = rep(1:15,15),
                     vals, stringsAsFactors = F)
valsdf$vals1 <- valsdf$vals/ncol(lvaldf)



return(list('raw_data'=lvaldf,'summary'=valsdf))
}


# Get results for different tiles and
# scale data relative to each board in a 0-1 range

rbind(
board_list2df(finalboards)[[2]] %>% mutate(tiles="Any", vals2=(vals1-min(vals1)) / (max(vals1)-min(vals1))),
board_list2df(finalboards, "blank")[[2]] %>% mutate(tiles="Blank", vals2=(vals1-min(vals1))/(max(vals1)-min(vals1))),
board_list2df(finalboards, "Q")[[2]] %>% mutate(tiles="Q", vals2=(vals1-min(vals1))/(max(vals1)-min(vals1))),
board_list2df(finalboards, "Z")[[2]] %>% mutate(tiles="Z", vals2=(vals1-min(vals1)) /(max(vals1)-min(vals1)))
) -> tilespace


ggplot(tilespace, aes(var1, var2, fill = vals2)) + 
  geom_tile(colour="gray5", size=.75, stat="identity") + 
  scale_fill_viridis(guide = guide_legend(title = "Relative Occupancy Likelihood", title.position = "top"),
                     option="A",direction=-1) +
  scale_y_reverse(breaks=1:15) +
  xlab("") + 
  ylab("") +
  ggtitle("Likely Board Position of Scrabble Tiles") +
  facet_wrap(~tiles) +
  theme(
    plot.title = element_text(color="black",hjust=0,vjust=3, size=rel(1.7)),
    plot.background = element_rect(fill="white"),
    strip.background = element_blank(),
    strip.text = element_text(face = "bold", size = 15),
    panel.background = element_blank(),
    panel.border = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    axis.line = element_blank(),
    axis.ticks = element_blank(), 
    axis.text = element_text(color="black", size=rel(1.2)),
    axis.text.y  = element_text(hjust=1),
    axis.text.x = element_text(margin=unit(c(-0.2,-0.2,-0.2,-0.2), "cm")),
    legend.text = element_text(color="black", size=rel(1.3)),
    legend.background = element_rect(fill="white"),
    legend.position = "bottom"
  )

Conclusions & Contact Me

Thanks if you managed to read this far. Please use the package and explore interesting questions with these data. I’ve only just touched the surface of what is possible. I will add more games to the database if there is enough demand. I could also add games by the Championship players. If you have any comments, suggestions, questions - please contact me via twitter.