Introduction

In this project, we take chess tournament results data from a text file and prepare it for export as .csv

Strategy

My strategy for this project was to identify the individual steps I needed to take to transform the data from the original text file.

The steps I identified to complete:
1. Removing |’s and -’s
2. Isolating the player names and ratings into their own columns
3. Extracting the second row of data for each observation and adding them as additional columns to each observation.
4. Calculating the opposing player rating by match in order to aggregate an average.

Data Cleaning

The first step was to read the data into R. I tried several methods: first, table.read(), then read.tsv(), then I tried to read each individual line of the file into a new file. I ultimately discovered I could use read.csv, and through the arguments sep, header, skip, and colnames, I could achieve several of my cleaning goals in one line of code:

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
url <- 'https://raw.githubusercontent.com/josh1den/DATA-607/Projects/tournamentinfo.txt'
cols = c('player_num','name','total_pts','round_1','round_2','round_3','round_4',
         'round_5','round_6','round_7','NA')
df <- read.csv(url, sep="|", header=FALSE, skip=3, col.names=cols)
head(df)
##                                                                                  player_num
## 1 -----------------------------------------------------------------------------------------
## 2                                                                                        1 
## 3                                                                                       ON 
## 4 -----------------------------------------------------------------------------------------
## 5                                                                                        2 
## 6                                                                                       MI 
##                                name total_pts round_1 round_2 round_3 round_4
## 1                                                                            
## 2  GARY HUA                             6.0     W  39   W  21   W  18   W  14
## 3  15445895 / R: 1794   ->1817          N:2     W       B       W       B    
## 4                                                                            
## 5  DAKSHESH DARURI                      6.0     W  63   W  58   L   4   W  17
## 6  14598900 / R: 1553   ->1663          N:2     B       W       B       W    
##   round_5 round_6 round_7 NA.
## 1                          NA
## 2   W   7   D  12   D   4  NA
## 3   W       B       W      NA
## 4                          NA
## 5   W  16   W  20   W   7  NA
## 6   B       W       B      NA

The next step was to remove the rows with dashes.

# remove rows with dashes
dashes <- "-----------------------------------------------------------------------------------------"
df <- filter(df, player_num != dashes)
head(df)
##   player_num                              name total_pts round_1 round_2
## 1         1   GARY HUA                             6.0     W  39   W  21
## 2        ON   15445895 / R: 1794   ->1817          N:2     W       B    
## 3         2   DAKSHESH DARURI                      6.0     W  63   W  58
## 4        MI   14598900 / R: 1553   ->1663          N:2     B       W    
## 5         3   ADITYA BAJAJ                         6.0     L   8   W  61
## 6        MI   14959604 / R: 1384   ->1640          N:2     W       B    
##   round_3 round_4 round_5 round_6 round_7 NA.
## 1   W  18   W  14   W   7   D  12   D   4  NA
## 2   W       B       W       B       W      NA
## 3   L   4   W  17   W  16   W  20   W   7  NA
## 4   B       W       B       W       B      NA
## 5   W  25   W  21   W  11   W  13   W  12  NA
## 6   W       B       W       B       W      NA

Next, I decided to create two new dataframes which I would later merge to one. The first dataframe (df1) would contain the odd number rows, and the second dataframe (df2) would contain the even number rows.

# extract every first row
df1 <- df %>%
  filter(row_number() %% 2 == 1)

# extract every second row
df2 <- df %>%
  filter(row_number() %% 2 != 1)

Next I used regex to extract the player rating from the name column as a new column: rating. The player ratings all followed a : followed by two or three spaces, and were digits of a minimum length of 3, maximum length of 4.

df2 <- df2 %>%
  mutate(rating = as.numeric(str_extract(name, "(?<=:\\s|:\\s\\s)\\d{3,4}")))
head(df2)
##   player_num                              name total_pts round_1 round_2
## 1        ON   15445895 / R: 1794   ->1817          N:2     W       B    
## 2        MI   14598900 / R: 1553   ->1663          N:2     B       W    
## 3        MI   14959604 / R: 1384   ->1640          N:2     W       B    
## 4        MI   12616049 / R: 1716   ->1744          N:2     W       B    
## 5        MI   14601533 / R: 1655   ->1690          N:2     B       W    
## 6        OH   15055204 / R: 1686   ->1687          N:3     W       B    
##   round_3 round_4 round_5 round_6 round_7 NA. rating
## 1   W       B       W       B       W      NA   1794
## 2   B       W       B       W       B      NA   1553
## 3   W       B       W       B       W      NA   1384
## 4   W       B       W       B       B      NA   1716
## 5   B       W       B       W       B      NA   1655
## 6   W       B       B       W       B      NA   1686

Now I could rename the player_num column to its appropriate name, state, and by subsetting, remove the columns I don’t need.

# rename and drop unnecessary columns
df2 <- rename(df2, state=player_num)
df2 <- df2[c("state","rating")]
head(df2)
##    state rating
## 1    ON    1794
## 2    MI    1553
## 3    MI    1384
## 4    MI    1716
## 5    MI    1655
## 6    OH    1686

Now I’m ready to merge the dataframes.

# merge the dataframes
df_new <- merge(df1, df2, by.x = 0, by.y = 0)
head(df_new)
##   Row.names player_num                              name total_pts round_1
## 1         1         1   GARY HUA                             6.0     W  39
## 2        10        10   ANVIT RAO                            5.0     D  16
## 3        11        11   CAMERON WILLIAM MC LEMAN             4.5     D  38
## 4        12        12   KENNETH J TACK                       4.5     W  42
## 5        13        13   TORRANCE HENRY JR                    4.5     W  36
## 6        14        14   BRADLEY SHAW                         4.5     W  54
##   round_2 round_3 round_4 round_5 round_6 round_7 NA.  state rating
## 1   W  21   W  18   W  14   W   7   D  12   D   4  NA    ON    1794
## 2   L  19   W  55   W  31   D   6   W  25   W  18  NA    MI    1365
## 3   W  56   W   6   L   7   L   3   W  34   W  26  NA    MI    1712
## 4   W  33   D   5   W  38   H       D   1   L   3  NA    MI    1663
## 5   W  27   L   7   D   5   W  33   L   3   W  32  NA    MI    1666
## 6   W  44   W   8   L   1   D  27   L   5   W  31  NA    MI    1610

Steps 1, 2, and 3 are now basically complete. However, there is a little bit of tidying up to attend to.

# remove undesired columns
cols_to_remove <- c("Row.names","NA.")
df_new <- select(df_new, -cols_to_remove)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(cols_to_remove)` instead of `cols_to_remove` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
head(df_new)
##   player_num                              name total_pts round_1 round_2
## 1         1   GARY HUA                             6.0     W  39   W  21
## 2        10   ANVIT RAO                            5.0     D  16   L  19
## 3        11   CAMERON WILLIAM MC LEMAN             4.5     D  38   W  56
## 4        12   KENNETH J TACK                       4.5     W  42   W  33
## 5        13   TORRANCE HENRY JR                    4.5     W  36   W  27
## 6        14   BRADLEY SHAW                         4.5     W  54   W  44
##   round_3 round_4 round_5 round_6 round_7  state rating
## 1   W  18   W  14   W   7   D  12   D   4    ON    1794
## 2   W  55   W  31   D   6   W  25   W  18    MI    1365
## 3   W   6   L   7   L   3   W  34   W  26    MI    1712
## 4   D   5   W  38   H       D   1   L   3    MI    1663
## 5   L   7   D   5   W  33   L   3   W  32    MI    1666
## 6   W   8   L   1   D  27   L   5   W  31    MI    1610
# the name column appears to have whitespace, let's trim that
df_new$name <- str_trim(df_new$name, side = c("both"))
head(df_new)
##   player_num                     name total_pts round_1 round_2 round_3 round_4
## 1         1                  GARY HUA     6.0     W  39   W  21   W  18   W  14
## 2        10                 ANVIT RAO     5.0     D  16   L  19   W  55   W  31
## 3        11  CAMERON WILLIAM MC LEMAN     4.5     D  38   W  56   W   6   L   7
## 4        12            KENNETH J TACK     4.5     W  42   W  33   D   5   W  38
## 5        13         TORRANCE HENRY JR     4.5     W  36   W  27   L   7   D   5
## 6        14              BRADLEY SHAW     4.5     W  54   W  44   W   8   L   1
##   round_5 round_6 round_7  state rating
## 1   W   7   D  12   D   4    ON    1794
## 2   D   6   W  25   W  18    MI    1365
## 3   L   3   W  34   W  26    MI    1712
## 4   H       D   1   L   3    MI    1663
## 5   W  33   L   3   W  32    MI    1666
## 6   D  27   L   5   W  31    MI    1610
# player_num column needs to be numeric
df_new$player_num <- as.numeric(df_new$player_num)
head(df_new)
##   player_num                     name total_pts round_1 round_2 round_3 round_4
## 1          1                 GARY HUA     6.0     W  39   W  21   W  18   W  14
## 2         10                ANVIT RAO     5.0     D  16   L  19   W  55   W  31
## 3         11 CAMERON WILLIAM MC LEMAN     4.5     D  38   W  56   W   6   L   7
## 4         12           KENNETH J TACK     4.5     W  42   W  33   D   5   W  38
## 5         13        TORRANCE HENRY JR     4.5     W  36   W  27   L   7   D   5
## 6         14             BRADLEY SHAW     4.5     W  54   W  44   W   8   L   1
##   round_5 round_6 round_7  state rating
## 1   W   7   D  12   D   4    ON    1794
## 2   D   6   W  25   W  18    MI    1365
## 3   L   3   W  34   W  26    MI    1712
## 4   H       D   1   L   3    MI    1663
## 5   W  33   L   3   W  32    MI    1666
## 6   D  27   L   5   W  31    MI    1610

Now to complete the final step: calculating the average opposing player ratings.

First I start by extracting the player IDs from the round columns.

# reduce round columns to only the player player_ids
round_cols <- c("round_1","round_2","round_3","round_4", "round_5", "round_6", "round_7")

for (col in round_cols) {
  df_new[[col]] <- as.numeric(str_extract_all(df_new[[col]], "(?<=\\s)\\d{1,2}", simplify=TRUE)) 
}

head(df_new)
##   player_num                     name total_pts round_1 round_2 round_3 round_4
## 1          1                 GARY HUA     6.0        39      21      18      14
## 2         10                ANVIT RAO     5.0        16      19      55      31
## 3         11 CAMERON WILLIAM MC LEMAN     4.5        38      56       6       7
## 4         12           KENNETH J TACK     4.5        42      33       5      38
## 5         13        TORRANCE HENRY JR     4.5        36      27       7       5
## 6         14             BRADLEY SHAW     4.5        54      44       8       1
##   round_5 round_6 round_7  state rating
## 1       7      12       4    ON    1794
## 2       6      25      18    MI    1365
## 3       3      34      26    MI    1712
## 4      NA       1       3    MI    1663
## 5      33       3      32    MI    1666
## 6      27       5      31    MI    1610

Next, I iterate through the round columns, replacing the player IDs with the player rating corresponding to the player ID.

# loop through the rounds replacing the opposing player id with the opposing player's pre-rating
for (row in 1:nrow(df_new)) {
  for (col in round_cols) {
    rating <- df_new[row, col]
    
    if(!is.na(rating)) {
      df_new[row, col] <- df_new$rating[df_new$player_num == rating]
    }
  }
}

head(df_new)
##   player_num                     name total_pts round_1 round_2 round_3 round_4
## 1          1                 GARY HUA     6.0      1436    1563    1600    1610
## 2         10                ANVIT RAO     5.0      1604    1564    1186    1494
## 3         11 CAMERON WILLIAM MC LEMAN     4.5      1423    1153    1686    1649
## 4         12           KENNETH J TACK     4.5      1332    1449    1655    1423
## 5         13        TORRANCE HENRY JR     4.5      1355    1552    1649    1655
## 6         14             BRADLEY SHAW     4.5      1270    1199    1641    1794
##   round_5 round_6 round_7  state rating
## 1    1649    1663    1716    ON    1794
## 2    1686    1745    1600    MI    1365
## 3    1384    1399    1579    MI    1712
## 4      NA    1794    1384    MI    1663
## 5    1449    1384    1441    MI    1666
## 6    1552    1655    1494    MI    1610

Now that I have columns with each opposing player rating, I can create a new columns with the average of the values in the round columns.

## calculate avg opponent rating
df_new$avg_opposing_rating <- round(rowMeans(df_new[round_cols], na.rm=TRUE))

head(df_new)
##   player_num                     name total_pts round_1 round_2 round_3 round_4
## 1          1                 GARY HUA     6.0      1436    1563    1600    1610
## 2         10                ANVIT RAO     5.0      1604    1564    1186    1494
## 3         11 CAMERON WILLIAM MC LEMAN     4.5      1423    1153    1686    1649
## 4         12           KENNETH J TACK     4.5      1332    1449    1655    1423
## 5         13        TORRANCE HENRY JR     4.5      1355    1552    1649    1655
## 6         14             BRADLEY SHAW     4.5      1270    1199    1641    1794
##   round_5 round_6 round_7  state rating avg_opposing_rating
## 1    1649    1663    1716    ON    1794                1605
## 2    1686    1745    1600    MI    1365                1554
## 3    1384    1399    1579    MI    1712                1468
## 4      NA    1794    1384    MI    1663                1506
## 5    1449    1384    1441    MI    1666                1498
## 6    1552    1655    1494    MI    1610                1515

From here, it’s simply a matter of selecting the desired columns for my output file.

# select the desired output
final <- select(df_new, c(2, 11, 3, 12, 13))
head(final)
##                       name  state total_pts rating avg_opposing_rating
## 1                 GARY HUA    ON      6.0     1794                1605
## 2                ANVIT RAO    MI      5.0     1365                1554
## 3 CAMERON WILLIAM MC LEMAN    MI      4.5     1712                1468
## 4           KENNETH J TACK    MI      4.5     1663                1506
## 5        TORRANCE HENRY JR    MI      4.5     1666                1498
## 6             BRADLEY SHAW    MI      4.5     1610                1515

Writing the file to csv:

path <- "/Users/joshiden/Documents/Classes/CUNY SPS/Fall 2022/DATA 607/Projects/Project 1.csv"
write.csv(final, path)

Thanks!