Overview / Introduction

For this project, we are given a particularly structured text file with chess tournament results. Our goal is to create an R Markdown file that generates a .CSV file with the following information for all of the players:

Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Opponent Pre-Rating

Load Packages and Data

Let’s begin by first loading the necessary packages and reading the text file into a manageable data frame. Looking at the text file, one can see that each row is separated by “|” which makes this symbol perfect to be used as the delimiter in our read.delim() function.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
url = "https://raw.githubusercontent.com/Stevee-G/Data607/refs/heads/main/Project1Table"
raw <- read.delim(file=url,header=FALSE,sep="|")
glimpse(raw)
## Rows: 196
## Columns: 11
## $ V1  <chr> "-----------------------------------------------------------------…
## $ V2  <chr> "", " Player Name                     ", " USCF ID / Rtg (Pre->Pos…
## $ V3  <chr> "", "Total", " Pts ", "", "6.0  ", "N:2  ", "", "6.0  ", "N:2  ", …
## $ V4  <chr> "", "Round", "  1  ", "", "W  39", "W    ", "", "W  63", "B    ", …
## $ V5  <chr> "", "Round", "  2  ", "", "W  21", "B    ", "", "W  58", "W    ", …
## $ V6  <chr> "", "Round", "  3  ", "", "W  18", "W    ", "", "L   4", "B    ", …
## $ V7  <chr> "", "Round", "  4  ", "", "W  14", "B    ", "", "W  17", "W    ", …
## $ V8  <chr> "", "Round", "  5  ", "", "W   7", "W    ", "", "W  16", "B    ", …
## $ V9  <chr> "", "Round", "  6  ", "", "D  12", "B    ", "", "W  20", "W    ", …
## $ V10 <chr> "", "Round", "  7  ", "", "D   4", "W    ", "", "W   7", "B    ", …
## $ V11 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

Remove Junk Rows and Blank Columns

We will now remove any unusable data or blank rows. The “raw” data frame seems to have interpreted the “—” values as observations with their own rows. Thankfully, this makes these rows easy to remove through a simple filter() command. The blank column at the end can then be removed by deselecting it.

adjusted <- raw %>%
  filter(!grepl("---",V1)) %>%
  select(-V11)
glimpse(adjusted)
## Rows: 130
## Columns: 10
## $ V1  <chr> " Pair ", " Num  ", "    1 ", "   ON ", "    2 ", "   MI ", "    3…
## $ V2  <chr> " Player Name                     ", " USCF ID / Rtg (Pre->Post)  …
## $ V3  <chr> "Total", " Pts ", "6.0  ", "N:2  ", "6.0  ", "N:2  ", "6.0  ", "N:…
## $ V4  <chr> "Round", "  1  ", "W  39", "W    ", "W  63", "B    ", "L   8", "W …
## $ V5  <chr> "Round", "  2  ", "W  21", "B    ", "W  58", "W    ", "W  61", "B …
## $ V6  <chr> "Round", "  3  ", "W  18", "W    ", "L   4", "B    ", "W  25", "W …
## $ V7  <chr> "Round", "  4  ", "W  14", "B    ", "W  17", "W    ", "W  21", "B …
## $ V8  <chr> "Round", "  5  ", "W   7", "W    ", "W  16", "B    ", "W  11", "W …
## $ V9  <chr> "Round", "  6  ", "D  12", "B    ", "W  20", "W    ", "W  13", "B …
## $ V10 <chr> "Round", "  7  ", "D   4", "W    ", "W   7", "B    ", "W  12", "W …

Divide Data Frame and Prepare Subsets

Further observation of the “adjusted” data frame reveals a pattern within its contents. The values in the rows alternate between two sets of data. With regards to the fields we want for our final deliverable, the odd numbered rows seem to have the player name, their total points, and the opponent number for each round. The even numbered rows give us the player state and player rankings.

Seeing this, we can then split the “adjusted” data frame into two subsets of data with their respective contents. Later on, after making the necessary changes and calculations, when creating our final data frame, we can merge them back together. knowing this, it would be wise to make use of the shared total number of rows each subset has and create an index column for joining.

rounds <- adjusted[seq_len(nrow(adjusted)) %% 2 == 1,]
glimpse(rounds)
## Rows: 65
## Columns: 10
## $ V1  <chr> " Pair ", "    1 ", "    2 ", "    3 ", "    4 ", "    5 ", "    6…
## $ V2  <chr> " Player Name                     ", " GARY HUA                   …
## $ V3  <chr> "Total", "6.0  ", "6.0  ", "6.0  ", "5.5  ", "5.5  ", "5.0  ", "5.…
## $ V4  <chr> "Round", "W  39", "W  63", "L   8", "W  23", "W  45", "W  34", "W …
## $ V5  <chr> "Round", "W  21", "W  58", "W  61", "D  28", "W  37", "D  29", "W …
## $ V6  <chr> "Round", "W  18", "L   4", "W  25", "W   2", "D  12", "L  11", "W …
## $ V7  <chr> "Round", "W  14", "W  17", "W  21", "W  26", "D  13", "W  35", "W …
## $ V8  <chr> "Round", "W   7", "W  16", "W  11", "D   5", "D   4", "D  10", "L …
## $ V9  <chr> "Round", "D  12", "W  20", "W  13", "W  19", "W  14", "W  27", "W …
## $ V10 <chr> "Round", "D   4", "W   7", "W  12", "D   1", "W  17", "W  21", "L …
colnames(rounds) <- c("player_number","player_name","total_points",
                      "round1","round2","round3","round4",
                      "round5","round6","round7")
rounds <- rounds[-1,]
rounds$index <- 1:nrow(rounds)
glimpse(rounds)
## Rows: 64
## Columns: 11
## $ player_number <chr> "    1 ", "    2 ", "    3 ", "    4 ", "    5 ", "    6…
## $ player_name   <chr> " GARY HUA                        ", " DAKSHESH DARURI  …
## $ total_points  <chr> "6.0  ", "6.0  ", "6.0  ", "5.5  ", "5.5  ", "5.0  ", "5…
## $ round1        <chr> "W  39", "W  63", "L   8", "W  23", "W  45", "W  34", "W…
## $ round2        <chr> "W  21", "W  58", "W  61", "D  28", "W  37", "D  29", "W…
## $ round3        <chr> "W  18", "L   4", "W  25", "W   2", "D  12", "L  11", "W…
## $ round4        <chr> "W  14", "W  17", "W  21", "W  26", "D  13", "W  35", "W…
## $ round5        <chr> "W   7", "W  16", "W  11", "D   5", "D   4", "D  10", "L…
## $ round6        <chr> "D  12", "W  20", "W  13", "W  19", "W  14", "W  27", "W…
## $ round7        <chr> "D   4", "W   7", "W  12", "D   1", "W  17", "W  21", "L…
## $ index         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…

Create Long Rounds Data Frame

The “rounds” data frame would be more useful to us if we had all of the rounds and their respective opponent numbers in two columns. This won’t affect the index for joining in any way since the index column and its values were already declared before this step.

tidy_rounds <- rounds %>%
  pivot_longer(cols=starts_with("round"), 
               names_to="round",
               values_to="opponent_number")
tidy_rounds$opponent_number <- as.numeric(
  gsub("\\D","",tidy_rounds$opponent_number))
tidy_rounds$round <- as.numeric(
  gsub("\\D","",tidy_rounds$round))
glimpse(tidy_rounds)
## Rows: 448
## Columns: 6
## $ player_number   <chr> "    1 ", "    1 ", "    1 ", "    1 ", "    1 ", "   …
## $ player_name     <chr> " GARY HUA                        ", " GARY HUA       …
## $ total_points    <chr> "6.0  ", "6.0  ", "6.0  ", "6.0  ", "6.0  ", "6.0  ", …
## $ index           <int> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, …
## $ round           <dbl> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, …
## $ opponent_number <dbl> 39, 21, 18, 14, 7, 12, 4, 63, 58, 4, 17, 16, 20, 7, 8,…

Create Ratings Data Frame

Next, we will create the “ratings” data frame which, as its name suggests, will hold the ratings data for each player, along with their state data.

ratings <- adjusted[seq_len(nrow(adjusted)) %% 2 == 0,] %>%
  select(V1,V2)
glimpse(ratings)
## Rows: 65
## Columns: 2
## $ V1 <chr> " Num  ", "   ON ", "   MI ", "   MI ", "   MI ", "   MI ", "   OH …
## $ V2 <chr> " USCF ID / Rtg (Pre->Post)       ", " 15445895 / R: 1794   ->1817 …
colnames(ratings) <- c("player_state","player_prerating")
ratings <- ratings[-1,]
ratings$index <- 1:nrow(ratings)
glimpse(ratings)
## Rows: 64
## Columns: 3
## $ player_state     <chr> "   ON ", "   MI ", "   MI ", "   MI ", "   MI ", "  …
## $ player_prerating <chr> " 15445895 / R: 1794   ->1817     ", " 14598900 / R: …
## $ index            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…

Tidy Player Ratings Column

Here, we will strip the player pre-ratings column values down to just the pre-ratings since the final deliverable does not require USCF ID or post-ratings data.

tidy_ratings <- ratings
tidy_ratings$player_prerating <- substr(
  tidy_ratings$player_prerating,16,19)
glimpse(tidy_ratings)
## Rows: 64
## Columns: 3
## $ player_state     <chr> "   ON ", "   MI ", "   MI ", "   MI ", "   MI ", "  …
## $ player_prerating <chr> "1794", "1553", "1384", "1716", "1655", "1686", "1649…
## $ index            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…

Create Player Rating Index

Afterwards, we create a new data frame that combines “tidy_ratings” with “rounds” in order to tie player number and name to their pre-rating.

prerating_index <- left_join(tidy_ratings,rounds,by="index") %>%
  select(player_number,player_name,player_prerating)
glimpse(prerating_index)
## Rows: 64
## Columns: 3
## $ player_number    <chr> "    1 ", "    2 ", "    3 ", "    4 ", "    5 ", "  …
## $ player_name      <chr> " GARY HUA                        ", " DAKSHESH DARUR…
## $ player_prerating <chr> "1794", "1553", "1384", "1716", "1655", "1686", "1649…

Create Average Opponent Ratings Column

We will now use “tidy_rounds” and “prerating_index” in order to tie opponent number to their pre-rating data. Once this is done, we pivot the data frame wider and prepare the rows to be used towards our calculation of the opponent pre-rating average using only the columns referring to rounds.

prerating_index$player_number <- as.numeric(
  as.character(prerating_index$player_number))
prerating_index$player_prerating <- as.numeric(
  as.character(prerating_index$player_prerating))
prerating_avg <- left_join(tidy_rounds,prerating_index,
                       c("opponent_number"="player_number")) %>%
  select(player_number,round,opponent_number,player_prerating)
prerating_avg <- prerating_avg %>%
  pivot_wider(id_cols="player_number",
              names_from="round",
              values_from="player_prerating")
prerating_avg$opp_avg_prerating <- rowMeans(
  prerating_avg[c("1","2","3","4","5","6","7")],na.rm=TRUE)
prerating_avg$opp_avg_prerating <-
  format(round(prerating_avg$opp_avg_prerating,0),nsmall=0)
glimpse(prerating_avg)
## Rows: 64
## Columns: 9
## $ player_number     <chr> "    1 ", "    2 ", "    3 ", "    4 ", "    5 ", " …
## $ `1`               <dbl> 1436, 1175, 1641, 1363, 1242, 1399, 1092, 1384, 1745…
## $ `2`               <dbl> 1563, 917, 955, 1507, 980, 1602, 377, 1441, 1600, 15…
## $ `3`               <dbl> 1600, 1716, 1745, 1553, 1663, 1712, 1666, 1610, 853,…
## $ `4`               <dbl> 1610, 1629, 1563, 1579, 1666, 1438, 1712, 1411, 1641…
## $ `5`               <dbl> 1649, 1604, 1712, 1655, 1716, 1365, 1794, 1362, 1579…
## $ `6`               <dbl> 1663, 1595, 1666, 1564, 1610, 1552, 1411, 1507, 1649…
## $ `7`               <dbl> 1716, 1649, 1663, 1794, 1629, 1563, 1553, 1564, 1595…
## $ opp_avg_prerating <chr> "1605", "1469", "1564", "1574", "1501", "1519", "137…

Merge Subsets

Finally, we have all of the fields we are looking for! All we have to do now is merge the necessary data frames, pulling in only the fields related to our deliverable.

tidy <- left_join(rounds,tidy_ratings,by="index") %>%
  select(player_number,player_name,player_state,total_points,player_prerating)
glimpse(tidy)
## Rows: 64
## Columns: 5
## $ player_number    <chr> "    1 ", "    2 ", "    3 ", "    4 ", "    5 ", "  …
## $ player_name      <chr> " GARY HUA                        ", " DAKSHESH DARUR…
## $ player_state     <chr> "   ON ", "   MI ", "   MI ", "   MI ", "   MI ", "  …
## $ total_points     <chr> "6.0  ", "6.0  ", "6.0  ", "5.5  ", "5.5  ", "5.0  ",…
## $ player_prerating <chr> "1794", "1553", "1384", "1716", "1655", "1686", "1649…
tidier <- left_join(tidy,prerating_avg,by="player_number") %>%
  select(player_name,player_state,total_points,player_prerating,opp_avg_prerating)
glimpse(tidier)
## Rows: 64
## Columns: 5
## $ player_name       <chr> " GARY HUA                        ", " DAKSHESH DARU…
## $ player_state      <chr> "   ON ", "   MI ", "   MI ", "   MI ", "   MI ", " …
## $ total_points      <chr> "6.0  ", "6.0  ", "6.0  ", "5.5  ", "5.5  ", "5.0  "…
## $ player_prerating  <chr> "1794", "1553", "1384", "1716", "1655", "1686", "164…
## $ opp_avg_prerating <chr> "1605", "1469", "1564", "1574", "1501", "1519", "137…

Finalize Data Frame

Glimpsing “tidier,” we can see a bunch of white space surrounding our values and some columns with the incorrect data type. We quickly remedy this using the str_trim() and as.numeric() functions, respectively, to create our “tidiest” final product.

tidiest <- tidier
tidiest$player_name <- str_trim(tidiest$player_name,side="both")
tidiest$player_state <- str_trim(tidiest$player_state,side="both")
tidiest$total_points <- str_trim(tidiest$total_points,side="both") %>%
  as.numeric(as.character(tidiest$total_points))
tidiest$player_prerating <- as.numeric(as.character(tidiest$player_prerating))
tidiest$opp_avg_prerating <- as.numeric(as.character(tidiest$opp_avg_prerating))
glimpse(tidiest)
## Rows: 64
## Columns: 5
## $ player_name       <chr> "GARY HUA", "DAKSHESH DARURI", "ADITYA BAJAJ", "PATR…
## $ player_state      <chr> "ON", "MI", "MI", "MI", "MI", "OH", "MI", "MI", "ON"…
## $ total_points      <dbl> 6.0, 6.0, 6.0, 5.5, 5.5, 5.0, 5.0, 5.0, 5.0, 5.0, 4.…
## $ player_prerating  <dbl> 1794, 1553, 1384, 1716, 1655, 1686, 1649, 1641, 1411…
## $ opp_avg_prerating <dbl> 1605, 1469, 1564, 1574, 1501, 1519, 1372, 1468, 1523…

Create CSV File

We finish off by writing our “tidiest” data frame unto a .CSV file that ends up in our default or recently used R folder.

write_csv(tidiest, "chess_data.csv")

Conclusions / Findings and Recommendations

Suffice to say, tidying up a text file, especially one formatted the way this chess tournament results one was, is quite the mission. Either way, with the right resources and plenty of patience, it can be done. For fun, here’s a plot illustrating player pre-rating versus average opponent pre-rating.

ggplot(data=tidiest,aes(x=player_prerating,y=opp_avg_prerating,
                        color=player_prerating>opp_avg_prerating)) +
  geom_point() +
  theme(legend.position="top")