Project Description
For this project, we’re provided a text file with chess tournament results in a tabular format. Our task is to generate a .CSV file with the following information for all of the players:
- Player’s Name
- Player’s State
- Total Number of Points
- Player’s Pre-Rating
- Average Pre Chess Rating of Opponents
sample of tournamentinfo.txt
Data Loading & Transformations
To accomplish this task, we’ll create three subtables; one of Player data, one of Match data, and a third table that counts how many matches each Player completed.
As we can see, our source data is not well-formatted (or ‘tidy’) with each row representing a single observation of player performance; instead, each player has information stored over two consecutive rows.
This tabular format makes for a nice visual presentation, but it poses a challenge for processing and analysis. We’ll take two different approaches to tidying this data - but first let’s load it in using read_delim with the pipe character (|) as our separator.
library(tidyverse)
df <- read_delim('https://raw.githubusercontent.com/jefedigital/cuny-data-607-projects/main/chess-tournament/data/tournamentinfo.txt',
delim='|', skip=2, trim_ws=TRUE, show_col_types = FALSE) %>%
rename(player_id=1, player_name=2)
df <- select(df,1:length(df)-1) # drop last col## # A tibble: 6 × 10
## player_id player_name Pts `1` `2` `3` `4` `5` `6` `7`
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 ----------------… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 2 1 GARY HUA 6.0 W 39 W 21 W 18 W 14 W 7 D 12 D 4
## 3 ON 15445895 / … N:2 W B W B W B W
## 4 ----------------… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 5 2 DAKSHESH DA… 6.0 W 63 W 58 L 4 W 17 W 16 W 20 W 7
## 6 MI 14598900 / … N:2 B W B W B W B
Players table
First, let’s create a Player table with tournament id, name, state, and pre-tournament rating, so that each player’s information is represented on a unique row.
First, we’ll select the first two columns from df, and remove all the ‘separator’ rows (where player_name is NA.)
# select first two cols, drop rows where player_id is NA
df_players <- select(df,1:2) %>%
filter(!is.na(df['player_name'])) ## # A tibble: 6 × 2
## player_id player_name
## <chr> <chr>
## 1 1 GARY HUA
## 2 ON 15445895 / R: 1794 ->1817
## 3 2 DAKSHESH DARURI
## 4 MI 14598900 / R: 1553 ->1663
## 5 3 ADITYA BAJAJ
## 6 MI 14959604 / R: 1384 ->1640
Next we’ll want to condense each two-row sequence to a single row; creating separate dataframes of even and odd rows, then binding them together column-wise.
# create two dfs from even and odd rows, bind column-wise and relabel
df_players_odd <- filter(df_players,row_number() %% 2 == 1)
df_players_even <- filter(df_players,row_number() %% 2 == 0)
df_players <- bind_cols(df_players_odd, df_players_even)
names(df_players) <- c('player_id','name','state','player_info')## # A tibble: 6 × 4
## player_id name state player_info
## <chr> <chr> <chr> <chr>
## 1 1 GARY HUA ON 15445895 / R: 1794 ->1817
## 2 2 DAKSHESH DARURI MI 14598900 / R: 1553 ->1663
## 3 3 ADITYA BAJAJ MI 14959604 / R: 1384 ->1640
## 4 4 PATRICK H SCHILLING MI 12616049 / R: 1716 ->1744
## 5 5 HANSHI ZUO MI 14601533 / R: 1655 ->1690
## 6 6 HANSEN SONG OH 15055204 / R: 1686 ->1687
FInally, we’ll use a regex to grab the player’s pre-tournament rating from the middle of player_info.
# extract player pre-rating, convert player_id to numeric, drop player_info
df_players <- df_players %>%
mutate(pre_rating = as.numeric(str_extract(player_info,'(?<=: ).{4}'))) %>%
mutate(player_id = as.numeric(player_id)) %>%
select(!player_info)## # A tibble: 6 × 4
## player_id name state pre_rating
## <dbl> <chr> <chr> <dbl>
## 1 1 GARY HUA ON 1794
## 2 2 DAKSHESH DARURI MI 1553
## 3 3 ADITYA BAJAJ MI 1384
## 4 4 PATRICK H SCHILLING MI 1716
## 5 5 HANSHI ZUO MI 1655
## 6 6 HANSEN SONG OH 1686
Matches table
Tidying up for our Matches table will be simpler; looking at our original df, the scores for each tournament round are contained in the rows where player_id is a number. We should be able to simply filter rows using as.numeric, but there’s an extra step required!
The player_id column is character type, and we didn’t coerce it to numeric in the first step because we needed to retrieve the State and Pre-Rating values. Instead, let’s create an array with as.numeric and filter on the resulting NA values instead:
# only keep rows with a numeric value in first column
df_matches <- filter(df, !is.na(as.numeric(df$player_id)))
# drop cols we don't need
df_matches <- select(df_matches,-player_name,-Pts)## # A tibble: 6 × 8
## player_id `1` `2` `3` `4` `5` `6` `7`
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 W 39 W 21 W 18 W 14 W 7 D 12 D 4
## 2 2 W 63 W 58 L 4 W 17 W 16 W 20 W 7
## 3 3 L 8 W 61 W 25 W 21 W 11 W 13 W 12
## 4 4 W 23 D 28 W 2 W 26 D 5 W 19 D 1
## 5 5 W 45 W 37 D 12 D 13 D 4 W 14 W 17
## 6 6 W 34 D 29 L 11 W 35 D 10 W 27 W 21
Next, we’ll use pivot_longer to pivot the data so that each row represents a single round’s results from a single player:
# pivot all columns except player_id
df_matches <- pivot_longer(df_matches,!player_id,
names_to='round',
values_to='match_info') ## # A tibble: 6 × 3
## player_id round match_info
## <chr> <chr> <chr>
## 1 1 1 W 39
## 2 1 2 W 21
## 3 1 3 W 18
## 4 1 4 W 14
## 5 1 5 W 7
## 6 1 6 D 12
For brevity, the next code chunk accomplishes several tasks:
- Extract the outcome code (W,L,D…) and the opponent_id from match_info.
- Filter the rows on outcome so we only keep Wins, Losses and Draws.
- Create a match_points column which uses ifelse to award 1 point for a Win and .5 points for a Draw.
# separate match_info into outcome and opponent_id; filter for wins, losses
# and draws; calculate points per match; convert to numeric; drop match_info.
df_matches <- df_matches %>%
mutate(outcome=str_sub(match_info,1,1),
opponent_id=str_sub(match_info,str_length(match_info)-1)) %>%
filter(outcome == 'W' | outcome =='L' | outcome == 'D') %>%
mutate(match_points = ifelse(outcome == 'W', 1, ifelse(outcome == 'D', 0.5, 0))) %>%
mutate(player_id = as.numeric(player_id),
opponent_id = as.numeric(opponent_id)) %>%
select(!match_info) # ## # A tibble: 6 × 5
## player_id round outcome opponent_id match_points
## <dbl> <chr> <chr> <dbl> <dbl>
## 1 1 1 W 39 1
## 2 1 2 W 21 1
## 3 1 3 W 18 1
## 4 1 4 W 14 1
## 5 1 5 W 7 1
## 6 1 6 D 12 0.5
And as a final step, we’ll join in the opponent’s pre-tournament ratings from our new Players table.
# join in the opponent's pre-ratings; rename column
df_matches <- df_matches %>%
left_join(df_players[c('player_id','pre_rating')],
by = c('opponent_id' = 'player_id')) %>%
rename(opponent_pre_rating = pre_rating)## # A tibble: 6 × 6
## player_id round outcome opponent_id match_points opponent_pre_rating
## <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 1 W 39 1 1436
## 2 1 2 W 21 1 1563
## 3 1 3 W 18 1 1600
## 4 1 4 W 14 1 1610
## 5 1 5 W 7 1 1649
## 6 1 6 D 12 0.5 1663
Matches per Player table
For our third table, let’s run a quick group_by on the Matches table and count the number of matches completed per player.
# calculate number of matches per player
df_matches_played <- df_matches %>%
group_by(player_id) %>%
summarize(matches_played = n())## # A tibble: 6 × 2
## player_id matches_played
## <dbl> <int>
## 1 1 7
## 2 2 7
## 3 3 7
## 4 4 7
## 5 5 7
## 6 6 7
Final Scores and Opponent Ratings
With our Matches table, let’s total up the match scores for each player, and get the average pre-rating of the opponents they each faced.
df_player_scores <- df_matches %>%
group_by(player_id) %>%
summarize(total_points=sum(match_points),
avg_opponent_pre_rating=round(mean(opponent_pre_rating))) %>%
arrange(player_id)## # A tibble: 6 × 3
## player_id total_points avg_opponent_pre_rating
## <dbl> <dbl> <dbl>
## 1 1 6 1605
## 2 2 6 1469
## 3 3 6 1564
## 4 4 5.5 1574
## 5 5 5.5 1501
## 6 6 5 1519
Results
Joining our player scores to our Players table gives us the final result.
df_final <- df_players %>%
left_join(df_player_scores, by='player_id') %>%
left_join(df_matches_played, by='player_id') %>%
relocate(matches_played, .after=state) %>%
relocate(total_points, .after=matches_played) # rearrange column order
df_final_all_players <- df_final %>%
select(!matches_played)## # A tibble: 6 × 6
## player_id name state total_points pre_rating avg_opponent_pre_…
## <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 GARY HUA ON 6 1794 1605
## 2 2 DAKSHESH DARURI MI 6 1553 1469
## 3 3 ADITYA BAJAJ MI 6 1384 1564
## 4 4 PATRICK H SCHILLING MI 5.5 1716 1574
## 5 5 HANSHI ZUO MI 5.5 1655 1501
## 6 6 HANSEN SONG OH 5 1686 1519
We can filter this using our Matches per Player table compare players who completed all seven rounds, to those who completed fewer than seven:
df_final_all_matches <- df_final %>%
filter(matches_played == 7) %>%
select(!matches_played)## tibble [41 × 6] (S3: tbl_df/tbl/data.frame)
## $ player_id : num [1:41] 1 2 3 4 5 6 7 8 9 10 ...
## $ name : chr [1:41] "GARY HUA" "DAKSHESH DARURI" "ADITYA BAJAJ" "PATRICK H SCHILLING" ...
## $ state : chr [1:41] "ON" "MI" "MI" "MI" ...
## $ total_points : num [1:41] 6 6 6 5.5 5.5 5 5 5 5 5 ...
## $ pre_rating : num [1:41] 1794 1553 1384 1716 1655 ...
## $ avg_opponent_pre_rating: num [1:41] 1605 1469 1564 1574 1501 ...
df_final_some_matches <- df_final %>%
filter(matches_played < 7) %>%
select(!matches_played)## tibble [23 × 6] (S3: tbl_df/tbl/data.frame)
## $ player_id : num [1:23] 12 16 22 27 29 36 37 38 41 44 ...
## $ name : chr [1:23] "KENNETH J TACK" "MIKE NIKITIN" "EUGENE L MCCLURE" "GAURAV GIDWANI" ...
## $ state : chr [1:23] "MI" "MI" "MI" "MI" ...
## $ total_points : num [1:23] 4 3.5 3.5 3.5 3.5 3 2 2.5 2 2 ...
## $ pre_rating : num [1:23] 1663 1604 1555 1552 1602 ...
## $ avg_opponent_pre_rating: num [1:23] 1506 1386 1300 1222 1314 ...
Output
write_csv(df_final_all_players, 'output/all_players.csv')
write_csv(df_final_all_matches, 'output/played_all_matches.csv')
write_csv(df_final_some_matches, 'output/played_some_matches.csv')