library(openintro)
library(tinytex)
library(tidyverse)
library(stringr)
library(magrittr)
library(gridExtra)

Working with Text Files

Introduction

In this project we will convert a text file into a usable dataframe. The file has extra characters and other issues which need to be resolved before it can be useful. The exercise will show the many ways that R can help us clean a messy text file.

This is the file - it shows results from a chess match and some player statistics. Our goal is to end with a file that has 5 columns: Name, State, Points, pre-chess rating and average rating of opponents.

dfChess <- as.data.frame(read.delim("https://raw.githubusercontent.com/ericonsi/CUNY_607/main/Projects/Project%201/Chess.txt", header = FALSE, stringsAsFactors = FALSE, sep = "|"))


head(dfChess)
##                                                                                          V1
## 1 -----------------------------------------------------------------------------------------
## 2                                                                                     Pair 
## 3                                                                                     Num  
## 4 -----------------------------------------------------------------------------------------
## 5                                                                                        1 
## 6                                                                                       ON 
##                                  V2    V3    V4    V5    V6    V7    V8    V9
## 1                                                                            
## 2  Player Name                      Total Round Round Round Round Round Round
## 3  USCF ID / Rtg (Pre->Post)         Pts    1     2     3     4     5     6  
## 4                                                                            
## 5  GARY HUA                         6.0   W  39 W  21 W  18 W  14 W   7 D  12
## 6  15445895 / R: 1794   ->1817      N:2   W     B     W     B     W     B    
##     V10 V11
## 1        NA
## 2 Round  NA
## 3   7    NA
## 4        NA
## 5 D   4  NA
## 6 W      NA

Eliminating What We Don’t Want and Extracting What We Do

  1. First we eliminate rows and columns with no information.
dfChess %<>% filter(!str_detect(dfChess$V1, "-----")) %<>% select(!V11)
  1. Then we merge the even rows with the odd rows, since each player’s info spans two rows.
EvenRows<- dfChess %>% filter(row_number() %% 2 == 1) %<>% mutate("qid" = row_number()) %<>% filter(qid!=1)
OddRows<- dfChess %>% filter(row_number() %% 2 != 1)  %<>% mutate("qid" = row_number()) %<>% filter(qid!=1)
dfChessMerged <- as.data.frame(merge(x = EvenRows, y = OddRows, by = c("qid")))
  1. Then we get the pre-chess rating for each player. To get the pre-chess rating, we need to manipulate the string to extract just the rating. (We ignore the parts of the rating that occur with “P” for some players). We can do this with str_sub and str_extract.

Before:

head(dfChessMerged$V2.y)
## [1] " 15445895 / R: 1794   ->1817     " " 14598900 / R: 1553   ->1663     "
## [3] " 14959604 / R: 1384   ->1640     " " 12616049 / R: 1716   ->1744     "
## [5] " 14601533 / R: 1655   ->1690     " " 15055204 / R: 1686   ->1687     "
dfChessMerged$PreChessRating <- str_sub(str_extract(dfChessMerged$V2.y, "R:....."), -4,-1)

After:

head(dfChessMerged$PreChessRating)
## [1] "1794" "1553" "1384" "1716" "1655" "1686"
  1. Now we create a dataframe to hold our final result. We select and rename the columns which are of interest to us. We also trim the character fields to remove whitesapce.
dfChessFinal <- dfChessMerged %>% 
          subset(select = c("V2.x", "V1.y", "V3.x", "PreChessRating")) %>% 
          rename(c(Name =  "V2.x", State = "V1.y", Points = "V3.x"))

dfChessFinal$Name<-str_trim(dfChessFinal$Name)
dfChessFinal$State<-str_trim(dfChessFinal$State)
dfChessFinal$Points<-str_trim(dfChessFinal$Points)

head(dfChessFinal)
##                  Name State Points PreChessRating
## 1            GARY HUA    ON    6.0           1794
## 2     DAKSHESH DARURI    MI    6.0           1553
## 3        ADITYA BAJAJ    MI    6.0           1384
## 4 PATRICK H SCHILLING    MI    5.5           1716
## 5          HANSHI ZUO    MI    5.5           1655
## 6         HANSEN SONG    OH    5.0           1686

Creating the Opponent’s Average Pre-Chess Rating

We have also been tasked with calculating and displaying the average of the opponent’s pre-chess ratings for each player (the total ratings divided by the number of games.) We will do this by creating a matrix which will hold the opponent rating for each player for each round. This will simplify the task of calculating the average.

  1. We begin by creating a dataframe which will hold the rating and opponent ID for each round for each player. We will need to extract the number out of the round column, as this represents the opponent’s ID. We use lapply and the gsub function (which uses a regular expression to find the first sequence of digits within the string), and then we convert the NAs (which occur in the rounds where there was no game) to 0.
dfChessOpponentAvg <- dfChessMerged %>% 
          subset(select = c("V1.x", "PreChessRating", "V4.x", "V5.x", "V6.x", "V7.x", "V8.x", "V9.x", "V10.x" )) %>% 
          rename(c(ID="V1.x")) 

dfChessOpponentAvg[] <- lapply(dfChessOpponentAvg, function(x) as.numeric(gsub(".*?([0-9]+).*", "\\1", x)))
## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion
dfChessOpponentAvg[is.na(dfChessOpponentAvg)] <- 0
head(dfChessOpponentAvg)
##   ID PreChessRating V4.x V5.x V6.x V7.x V8.x V9.x V10.x
## 1  1           1794   39   21   18   14    7   12     4
## 2  2           1553   63   58    4   17   16   20     7
## 3  3           1384    8   61   25   21   11   13    12
## 4  4           1716   23   28    2   26    5   19     1
## 5  5           1655   45   37   12   13    4   14    17
## 6  6           1686   34   29   11   35   10   27    21
  1. Now in place of all of the player IDs, we are going to put the player’s rating. That way we can easily sum and average the opponents’ ratings across the player’s row.

In Java or C# we might need to do something complicated with a loop and a function here, but in R we can easily use a few simple commands to make this substitution over the entire matrx. Probably. In any case, as a newcomer to R I didn’t manage to figure that out. However, in R there are many ways to get to the same destination, so here is my function (which accepts the player’s identification number and returns their rating), and my loop (with map_dfr from the purrr library.) We loop through the dataframe, and as we do, we also collect the total number of non-zero cells in each row (i.e., the total number of games). We need to create a second dataframe for this exercise, since we are adding a column inside the loop and don’t want map_dfr to loop over this column as well.

ConvertPlayerIDToRating<-function(playerID)
  {
    if(playerID==0) {return(0)}
    x<- as.vector(filter(dfChessOpponentAvg, dfChessOpponentAvg$ID==playerID))
    playerRating<-x$PreChessRating
  
return (playerRating)
  }

dfC<-subset(dfChessOpponentAvg, select= c("V4.x", "V5.x", "V6.x", "V7.x", "V8.x", "V9.x", "V10.x"))

for(i in 1:nrow(dfC)) 
  { 
      dfC[i, ] <- map_dfr(dfC[i,], ConvertPlayerIDToRating)   #This will substitute the player identification number with their rating
      dfChessOpponentAvg[i,10] = sum(dfC[i,] != 0)                #This column will hold the number of games per player
  }

head(dfC)
##   V4.x V5.x V6.x V7.x V8.x V9.x V10.x
## 1 1436 1563 1600 1610 1649 1663  1716
## 2 1175  917 1716 1629 1604 1595  1649
## 3 1641  955 1745 1563 1712 1666  1663
## 4 1363 1507 1553 1579 1655 1564  1794
## 5 1242  980 1663 1666 1716 1610  1629
## 6 1399 1602 1712 1438 1365 1552  1563
  1. Now all we need to do is sum up the rows, divide by the number of games, write that to the final table and output it to a csv file (we push this to the github file at https://github.com/ericonsi/CUNY_607/blob/main/ChessFinal.csv) But there is more we can do!
dfChessOpponentAvg$row_sum = rowSums(dfC[,])
dfChessFinal$AvgRatingOfOpps = dfChessOpponentAvg$row_sum/dfChessOpponentAvg$V10

dfChessFinal
##                          Name State Points PreChessRating AvgRatingOfOpps
## 1                    GARY HUA    ON    6.0           1794        1605.286
## 2             DAKSHESH DARURI    MI    6.0           1553        1469.286
## 3                ADITYA BAJAJ    MI    6.0           1384        1563.571
## 4         PATRICK H SCHILLING    MI    5.5           1716        1573.571
## 5                  HANSHI ZUO    MI    5.5           1655        1500.857
## 6                 HANSEN SONG    OH    5.0           1686        1518.714
## 7           GARY DEE SWATHELL    MI    5.0           1649        1372.143
## 8            EZEKIEL HOUGHTON    MI    5.0           1641        1468.429
## 9                 STEFANO LEE    ON    5.0           1411        1523.143
## 10                  ANVIT RAO    MI    5.0           1365        1554.143
## 11   CAMERON WILLIAM MC LEMAN    MI    4.5           1712        1467.571
## 12             KENNETH J TACK    MI    4.5           1663        1506.167
## 13          TORRANCE HENRY JR    MI    4.5           1666        1497.857
## 14               BRADLEY SHAW    MI    4.5           1610        1515.000
## 15     ZACHARY JAMES HOUGHTON    MI    4.5           1220        1483.857
## 16               MIKE NIKITIN    MI    4.0           1604        1385.800
## 17         RONALD GRZEGORCZYK    MI    4.0           1629        1498.571
## 18              DAVID SUNDEEN    MI    4.0           1600        1480.000
## 19               DIPANKAR ROY    MI    4.0           1564        1426.286
## 20                JASON ZHENG    MI    4.0           1595        1410.857
## 21              DINH DANG BUI    ON    4.0           1563        1470.429
## 22           EUGENE L MCCLURE    MI    4.0           1555        1300.333
## 23                   ALAN BUI    ON    4.0           1363        1213.857
## 24          MICHAEL R ALDRICH    MI    4.0           1229        1357.000
## 25           LOREN SCHWIEBERT    MI    3.5           1745        1363.286
## 26                    MAX ZHU    ON    3.5           1579        1506.857
## 27             GAURAV GIDWANI    MI    3.5           1552        1221.667
## 28 SOFIA ADINA STANESCU-BELLU    MI    3.5           1507        1522.143
## 29           CHIEDOZIE OKORIE    MI    3.5           1602        1313.500
## 30         GEORGE AVERY JONES    ON    3.5           1522        1144.143
## 31               RISHI SHETTY    MI    3.5           1494        1259.857
## 32      JOSHUA PHILIP MATHEWS    ON    3.5           1441        1378.714
## 33                    JADE GE    MI    3.5           1449        1276.857
## 34     MICHAEL JEFFERY THOMAS    MI    3.5           1399        1375.286
## 35           JOSHUA DAVID LEE    MI    3.5           1438        1149.714
## 36              SIDDHARTH JHA    MI    3.5           1355        1388.167
## 37       AMIYATOSH PWNANANDAM    MI    3.5            980        1384.800
## 38                  BRIAN LIU    MI    3.0           1423        1539.167
## 39              JOEL R HENDON    MI    3.0           1436        1429.571
## 40               FOREST ZHANG    MI    3.0           1348        1390.571
## 41        KYLE WILLIAM MURPHY    MI    3.0           1403        1248.500
## 42                   JARED GE    MI    3.0           1332        1149.857
## 43          ROBERT GLEN VASEY    MI    3.0           1283        1106.571
## 44         JUSTIN D SCHILLING    MI    3.0           1199        1327.000
## 45                  DEREK YAN    MI    3.0           1242        1152.000
## 46   JACOB ALEXANDER LAVALLEY    MI    3.0            377        1357.714
## 47                ERIC WRIGHT    MI    2.5           1362        1392.000
## 48               DANIEL KHAIN    MI    2.5           1382        1355.800
## 49           MICHAEL J MARTIN    MI    2.5           1291        1285.800
## 50                 SHIVAM JHA    MI    2.5           1056        1296.000
## 51             TEJAS AYYAGARI    MI    2.5           1011        1356.143
## 52                  ETHAN GUO    MI    2.5            935        1494.571
## 53              JOSE C YBARRA    MI    2.0           1393        1345.333
## 54                LARRY HODGE    MI    2.0           1270        1206.167
## 55                  ALEX KONG    MI    2.0           1186        1406.000
## 56               MARISA RICCI    MI    2.0           1153        1414.400
## 57                 MICHAEL LU    MI    2.0           1092        1363.000
## 58               VIRAJ MOHILE    MI    2.0            917        1391.000
## 59          SEAN M MC CORMICK    MI    2.0            853        1319.000
## 60                 JULIA SHEN    MI    1.5            967        1330.200
## 61              JEZZEL FARKAS    ON    1.5            955        1327.286
## 62              ASHWIN BALAJI    MI    1.0           1530        1186.000
## 63       THOMAS JOSEPH HOSMER    MI    1.0           1175        1350.200
## 64                     BEN LI    MI    1.0           1163        1263.000
write.csv(dfChessFinal, "d:\\RStudio\\CUNY_607\\ChessFinal.csv") 
  1. The opponent player matrix is a handy tool which we can use in other ways as well. For example, we could use it to collect numbers of wins by player (unlike the “points” category which includes .5 points for a draw). Here, for each round we are going to put a 1 if they won the round and a zero otherwise.
dfChessWorL <- dfChessMerged %>% 
          subset(select = c("V4.x", "V5.x", "V6.x", "V7.x", "V8.x", "V9.x", "V10.x" )) 


for(i in 1:nrow(dfC)) 
  { 
      dfChessWorL[i,] <- as.numeric(str_detect(dfChessWorL[i,], "W"))
      
  }

      dfChessWorL <- as.data.frame(apply(dfChessWorL, 2, as.numeric))  # Convert all variable types to numeric
      dfChessWorL$NumOfWins = rowSums(dfChessWorL[,])
      
      dfChessFinal$NumOfWins = dfChessWorL$NumOfWins
      
head(dfChessWorL)
##   V4.x V5.x V6.x V7.x V8.x V9.x V10.x NumOfWins
## 1    1    1    1    1    1    0     0         5
## 2    1    1    0    1    1    1     1         6
## 3    0    1    1    1    1    1     1         6
## 4    1    0    1    1    0    1     0         4
## 5    1    1    0    0    0    1     1         4
## 6    1    0    0    1    0    1     1         4

We assume that the best players are being paired with each other. It might be interesting to see if this effectively evens the playing field. If so, we might expect to see number of wins relatively evenly distributed among players. Here we plot the number of wins against strength of opponents using boxplots:

EHplot <- function(i)
  
{
  a <- dfChessFinal %>%
  filter(NumOfWins==i)

g1 <- ggplot(data = a, aes(x ="", y=AvgRatingOfOpps )) +
  geom_boxplot() + ggtitle(str_c("Number of Wins: ", i))
return(g1)

}

grid.arrange(EHplot(1), EHplot(2), EHplot(3), EHplot(4), EHplot(5), EHplot(6), ncol=3)

There are a number of interesting findings here. First, as strength of opponents increases, the number of wins also increases. This counterintuitive finding may suggest that the best players are being paired with the other good players, but they are nonetheless dominating them. We can also see some outliers. For example, on the one-win boxplot we see Ethan G, a low ranked player who faced some of the strongest opponents. At the time same time we see Michael M only score 1 win against some of the easiest opponents.

Conclusion

R has many tools for cleaning up messy files. We used stringr, purrr, dyplr and other libraries to accomplish this task. As a beginner to R I’m confident there are more robust solutions than mine - but it is encouraging to know how much can be done with so few lines of code!