Project 1 - Chess Tournament

Jeff Parks

2021-09-19

Project Description

For this project, we’re provided a text file with chess tournament results in a tabular format. Our task is to generate a .CSV file with the following information for all of the players:

  • Player’s Name
  • Player’s State
  • Total Number of Points
  • Player’s Pre-Rating
  • Average Pre Chess Rating of Opponents

sample of tournamentinfo.txt

Data Loading & Transformations

To accomplish this task, we’ll create three subtables; one of Player data, one of Match data, and a third table that counts how many matches each Player completed.

As we can see, our source data is not well-formatted (or ‘tidy’) with each row representing a single observation of player performance; instead, each player has information stored over two consecutive rows.

This tabular format makes for a nice visual presentation, but it poses a challenge for processing and analysis. We’ll take two different approaches to tidying this data - but first let’s load it in using read_delim with the pipe character (|) as our separator.

library(tidyverse)
df <- read_delim('https://raw.githubusercontent.com/jefedigital/cuny-data-607-projects/main/chess-tournament/data/tournamentinfo.txt', 
                 delim='|', skip=2, trim_ws=TRUE, show_col_types = FALSE) %>% 
  rename(player_id=1, player_name=2)
  
df <- select(df,1:length(df)-1) # drop last col
## # A tibble: 6 × 10
##   player_id         player_name  Pts   `1`   `2`   `3`   `4`   `5`   `6`   `7`  
##   <chr>             <chr>        <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 ----------------… <NA>         <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## 2 1                 GARY HUA     6.0   W  39 W  21 W  18 W  14 W   7 D  12 D   4
## 3 ON                15445895 / … N:2   W     B     W     B     W     B     W    
## 4 ----------------… <NA>         <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## 5 2                 DAKSHESH DA… 6.0   W  63 W  58 L   4 W  17 W  16 W  20 W   7
## 6 MI                14598900 / … N:2   B     W     B     W     B     W     B

Players table

First, let’s create a Player table with tournament id, name, state, and pre-tournament rating, so that each player’s information is represented on a unique row.

First, we’ll select the first two columns from df, and remove all the ‘separator’ rows (where player_name is NA.)

# select first two cols, drop rows where player_id is NA 
df_players <- select(df,1:2) %>% 
  filter(!is.na(df['player_name'])) 
## # A tibble: 6 × 2
##   player_id player_name                
##   <chr>     <chr>                      
## 1 1         GARY HUA                   
## 2 ON        15445895 / R: 1794   ->1817
## 3 2         DAKSHESH DARURI            
## 4 MI        14598900 / R: 1553   ->1663
## 5 3         ADITYA BAJAJ               
## 6 MI        14959604 / R: 1384   ->1640

Next we’ll want to condense each two-row sequence to a single row; creating separate dataframes of even and odd rows, then binding them together column-wise.

# create two dfs from even and odd rows, bind column-wise and relabel
df_players_odd <- filter(df_players,row_number() %% 2 == 1)
df_players_even <- filter(df_players,row_number() %% 2 == 0)

df_players <- bind_cols(df_players_odd, df_players_even)
names(df_players) <- c('player_id','name','state','player_info')
## # A tibble: 6 × 4
##   player_id name                state player_info                
##   <chr>     <chr>               <chr> <chr>                      
## 1 1         GARY HUA            ON    15445895 / R: 1794   ->1817
## 2 2         DAKSHESH DARURI     MI    14598900 / R: 1553   ->1663
## 3 3         ADITYA BAJAJ        MI    14959604 / R: 1384   ->1640
## 4 4         PATRICK H SCHILLING MI    12616049 / R: 1716   ->1744
## 5 5         HANSHI ZUO          MI    14601533 / R: 1655   ->1690
## 6 6         HANSEN SONG         OH    15055204 / R: 1686   ->1687

FInally, we’ll use a regex to grab the player’s pre-tournament rating from the middle of player_info.

# extract player pre-rating, convert player_id to numeric, drop player_info 
df_players <- df_players %>% 
  mutate(pre_rating = as.numeric(str_extract(player_info,'(?<=: ).{4}'))) %>% 
  mutate(player_id = as.numeric(player_id)) %>% 
  select(!player_info)
## # A tibble: 6 × 4
##   player_id name                state pre_rating
##       <dbl> <chr>               <chr>      <dbl>
## 1         1 GARY HUA            ON          1794
## 2         2 DAKSHESH DARURI     MI          1553
## 3         3 ADITYA BAJAJ        MI          1384
## 4         4 PATRICK H SCHILLING MI          1716
## 5         5 HANSHI ZUO          MI          1655
## 6         6 HANSEN SONG         OH          1686

Matches table

Tidying up for our Matches table will be simpler; looking at our original df, the scores for each tournament round are contained in the rows where player_id is a number. We should be able to simply filter rows using as.numeric, but there’s an extra step required!

The player_id column is character type, and we didn’t coerce it to numeric in the first step because we needed to retrieve the State and Pre-Rating values. Instead, let’s create an array with as.numeric and filter on the resulting NA values instead:

# only keep rows with a numeric value in first column
df_matches <- filter(df, !is.na(as.numeric(df$player_id))) 

# drop cols we don't need
df_matches <- select(df_matches,-player_name,-Pts)
## # A tibble: 6 × 8
##   player_id `1`   `2`   `3`   `4`   `5`   `6`   `7`  
##   <chr>     <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1         W  39 W  21 W  18 W  14 W   7 D  12 D   4
## 2 2         W  63 W  58 L   4 W  17 W  16 W  20 W   7
## 3 3         L   8 W  61 W  25 W  21 W  11 W  13 W  12
## 4 4         W  23 D  28 W   2 W  26 D   5 W  19 D   1
## 5 5         W  45 W  37 D  12 D  13 D   4 W  14 W  17
## 6 6         W  34 D  29 L  11 W  35 D  10 W  27 W  21

Next, we’ll use pivot_longer to pivot the data so that each row represents a single round’s results from a single player:

# pivot all columns except player_id
df_matches <- pivot_longer(df_matches,!player_id, 
                           names_to='round', 
                           values_to='match_info') 
## # A tibble: 6 × 3
##   player_id round match_info
##   <chr>     <chr> <chr>     
## 1 1         1     W  39     
## 2 1         2     W  21     
## 3 1         3     W  18     
## 4 1         4     W  14     
## 5 1         5     W   7     
## 6 1         6     D  12

For brevity, the next code chunk accomplishes several tasks:

  • Extract the outcome code (W,L,D…) and the opponent_id from match_info.
  • Filter the rows on outcome so we only keep Wins, Losses and Draws.
  • Create a match_points column which uses ifelse to award 1 point for a Win and .5 points for a Draw.
# separate match_info into outcome and opponent_id; filter for wins, losses 
# and draws; calculate points per match; convert to numeric; drop match_info.

df_matches <- df_matches %>% 
  mutate(outcome=str_sub(match_info,1,1), 
         opponent_id=str_sub(match_info,str_length(match_info)-1)) %>%
  filter(outcome == 'W' | outcome =='L' | outcome == 'D') %>% 
  mutate(match_points = ifelse(outcome == 'W', 1, ifelse(outcome == 'D', 0.5, 0))) %>% 
  mutate(player_id = as.numeric(player_id), 
         opponent_id = as.numeric(opponent_id)) %>%
  select(!match_info) # 
## # A tibble: 6 × 5
##   player_id round outcome opponent_id match_points
##       <dbl> <chr> <chr>         <dbl>        <dbl>
## 1         1 1     W                39          1  
## 2         1 2     W                21          1  
## 3         1 3     W                18          1  
## 4         1 4     W                14          1  
## 5         1 5     W                 7          1  
## 6         1 6     D                12          0.5

And as a final step, we’ll join in the opponent’s pre-tournament ratings from our new Players table.

# join in the opponent's pre-ratings; rename column
df_matches <- df_matches %>% 
  left_join(df_players[c('player_id','pre_rating')], 
            by = c('opponent_id' = 'player_id')) %>% 
  rename(opponent_pre_rating = pre_rating)
## # A tibble: 6 × 6
##   player_id round outcome opponent_id match_points opponent_pre_rating
##       <dbl> <chr> <chr>         <dbl>        <dbl>               <dbl>
## 1         1 1     W                39          1                  1436
## 2         1 2     W                21          1                  1563
## 3         1 3     W                18          1                  1600
## 4         1 4     W                14          1                  1610
## 5         1 5     W                 7          1                  1649
## 6         1 6     D                12          0.5                1663

Matches per Player table

For our third table, let’s run a quick group_by on the Matches table and count the number of matches completed per player.

# calculate number of matches per player
df_matches_played <- df_matches %>%
  group_by(player_id) %>%
  summarize(matches_played = n())
## # A tibble: 6 × 2
##   player_id matches_played
##       <dbl>          <int>
## 1         1              7
## 2         2              7
## 3         3              7
## 4         4              7
## 5         5              7
## 6         6              7

Final Scores and Opponent Ratings

With our Matches table, let’s total up the match scores for each player, and get the average pre-rating of the opponents they each faced.

df_player_scores <- df_matches %>% 
  group_by(player_id) %>% 
  summarize(total_points=sum(match_points), 
            avg_opponent_pre_rating=round(mean(opponent_pre_rating))) %>%
  arrange(player_id)
## # A tibble: 6 × 3
##   player_id total_points avg_opponent_pre_rating
##       <dbl>        <dbl>                   <dbl>
## 1         1          6                      1605
## 2         2          6                      1469
## 3         3          6                      1564
## 4         4          5.5                    1574
## 5         5          5.5                    1501
## 6         6          5                      1519

Results

Joining our player scores to our Players table gives us the final result.

df_final <- df_players %>% 
  left_join(df_player_scores, by='player_id') %>%
  left_join(df_matches_played, by='player_id') %>% 
  relocate(matches_played, .after=state) %>% 
  relocate(total_points, .after=matches_played) # rearrange column order

df_final_all_players <- df_final %>% 
  select(!matches_played)
## # A tibble: 6 × 6
##   player_id name                state total_points pre_rating avg_opponent_pre_…
##       <dbl> <chr>               <chr>        <dbl>      <dbl>              <dbl>
## 1         1 GARY HUA            ON             6         1794               1605
## 2         2 DAKSHESH DARURI     MI             6         1553               1469
## 3         3 ADITYA BAJAJ        MI             6         1384               1564
## 4         4 PATRICK H SCHILLING MI             5.5       1716               1574
## 5         5 HANSHI ZUO          MI             5.5       1655               1501
## 6         6 HANSEN SONG         OH             5         1686               1519

We can filter this using our Matches per Player table compare players who completed all seven rounds, to those who completed fewer than seven:

df_final_all_matches <- df_final %>% 
  filter(matches_played == 7) %>%
  select(!matches_played)
## tibble [41 × 6] (S3: tbl_df/tbl/data.frame)
##  $ player_id              : num [1:41] 1 2 3 4 5 6 7 8 9 10 ...
##  $ name                   : chr [1:41] "GARY HUA" "DAKSHESH DARURI" "ADITYA BAJAJ" "PATRICK H SCHILLING" ...
##  $ state                  : chr [1:41] "ON" "MI" "MI" "MI" ...
##  $ total_points           : num [1:41] 6 6 6 5.5 5.5 5 5 5 5 5 ...
##  $ pre_rating             : num [1:41] 1794 1553 1384 1716 1655 ...
##  $ avg_opponent_pre_rating: num [1:41] 1605 1469 1564 1574 1501 ...
df_final_some_matches <- df_final %>%
  filter(matches_played < 7) %>%
  select(!matches_played)
## tibble [23 × 6] (S3: tbl_df/tbl/data.frame)
##  $ player_id              : num [1:23] 12 16 22 27 29 36 37 38 41 44 ...
##  $ name                   : chr [1:23] "KENNETH J TACK" "MIKE NIKITIN" "EUGENE L MCCLURE" "GAURAV GIDWANI" ...
##  $ state                  : chr [1:23] "MI" "MI" "MI" "MI" ...
##  $ total_points           : num [1:23] 4 3.5 3.5 3.5 3.5 3 2 2.5 2 2 ...
##  $ pre_rating             : num [1:23] 1663 1604 1555 1552 1602 ...
##  $ avg_opponent_pre_rating: num [1:23] 1506 1386 1300 1222 1314 ...

Output

write_csv(df_final_all_players, 'output/all_players.csv')
write_csv(df_final_all_matches, 'output/played_all_matches.csv')
write_csv(df_final_some_matches, 'output/played_some_matches.csv')