Data 607 Project 1

Excerpt from source file

Our purpose is to take the source file tournamentinfo.txt and generate a .CSV file that has the following information for each player: Player's Name, Player's State, Total Number of Points, Players' Pre-Rating, and Average Pre Chess Rating of Opponents.

We calculate the last column by looking up the pre rating of each opponent by the pair number. So Gary Hua’s first opponent in Round 1 was pair number 39, Joel R Hendon, who had a pre chess rating of 1436P23, but we only care about 1436. We add up the rating for all the opponents, then divide by seven to arrive at the Average Pre Chess Rating of Opponents.

Let’s start by reading in the file.

library(stringr)
url <- "https://raw.githubusercontent.com/EyeDen/data607/master/tournamentinfo.txt"
download.file(url, destfile = "tournamentinfo.txt")
tour.info <- read.csv("tournamentinfo.txt", header = FALSE, sep = "|")
head(tour.info, 10)

##                                                                                           V1
## 1  -----------------------------------------------------------------------------------------
## 2                                                                                      Pair 
## 3                                                                                      Num  
## 4  -----------------------------------------------------------------------------------------
## 5                                                                                         1 
## 6                                                                                        ON 
## 7  -----------------------------------------------------------------------------------------
## 8                                                                                         2 
## 9                                                                                        MI 
## 10 -----------------------------------------------------------------------------------------
##                                   V2    V3    V4    V5    V6    V7    V8
## 1                                                                       
## 2   Player Name                      Total Round Round Round Round Round
## 3   USCF ID / Rtg (Pre->Post)         Pts    1     2     3     4     5  
## 4                                                                       
## 5   GARY HUA                         6.0   W  39 W  21 W  18 W  14 W   7
## 6   15445895 / R: 1794   ->1817      N:2   W     B     W     B     W    
## 7                                                                       
## 8   DAKSHESH DARURI                  6.0   W  63 W  58 L   4 W  17 W  16
## 9   14598900 / R: 1553   ->1663      N:2   B     W     B     W     B    
## 10                                                                      
##       V9   V10 V11
## 1               NA
## 2  Round Round  NA
## 3    6     7    NA
## 4               NA
## 5  D  12 D   4  NA
## 6  B     W      NA
## 7               NA
## 8  W  20 W   7  NA
## 9  W     B      NA
## 10              NA

Not bad, but it’s not exactly structured the way we’d like. Firstly, the source file actually contains two headers. This means that every entry is really a pair. However, if we examine each entry’s second row, we will see that we don’t need all the information contained. Secondly, the “——” lines might make the text file more readable, but turns our data frame into a mess. Lastly, the final column is nothing but NA values, a quirk of the original file.

Removing the extraneous characters is simple to do. By examining the original text file in a text editor, we can see that it has 196 lines and that the last line is also a meaningless row of “——”. Alternatively, if the text file is too unwieldly to read, we can peak at it with this:

tail(tour.info, 10)

##                                                                                            V1
## 187 -----------------------------------------------------------------------------------------
## 188                                                                                       62 
## 189                                                                                       MI 
## 190 -----------------------------------------------------------------------------------------
## 191                                                                                       63 
## 192                                                                                       MI 
## 193 -----------------------------------------------------------------------------------------
## 194                                                                                       64 
## 195                                                                                       MI 
## 196 -----------------------------------------------------------------------------------------
##                                    V2    V3    V4    V5    V6    V7    V8
## 187                                                                      
## 188  ASHWIN BALAJI                    1.0   W  55 U     U     U     U    
## 189  15219542 / R: 1530   ->1535            B                            
## 190                                                                      
## 191  THOMAS JOSEPH HOSMER             1.0   L   2 L  48 D  49 L  43 L  45
## 192  15057092 / R: 1175   ->1125            W     B     W     B     B    
## 193                                                                      
## 194  BEN LI                           1.0   L  22 D  30 L  31 D  49 L  46
## 195  15006561 / R: 1163   ->1112            B     W     W     B     W    
## 196                                                                      
##        V9   V10 V11
## 187              NA
## 188 U     U      NA
## 189              NA
## 190              NA
## 191 H     U      NA
## 192              NA
## 193              NA
## 194 L  42 L  54  NA
## 195 B     B      NA
## 196              NA

Again, row 196 is the last row and is nothing but the character “——”. Because each entry is a pair, and each pair of entries is separated by dashes, there is an obvious pattern we can exploit. Each dash line occurs three lines after the previous. Let us remove them this way:

rows.to.delete <- seq(from = 1, to = 196, by = 3)
tour.info <- tour.info[-rows.to.delete, ]
head(tour.info, 10)

##        V1                                V2    V3    V4    V5    V6    V7
## 2   Pair   Player Name                      Total Round Round Round Round
## 3   Num    USCF ID / Rtg (Pre->Post)         Pts    1     2     3     4  
## 5      1   GARY HUA                         6.0   W  39 W  21 W  18 W  14
## 6     ON   15445895 / R: 1794   ->1817      N:2   W     B     W     B    
## 8      2   DAKSHESH DARURI                  6.0   W  63 W  58 L   4 W  17
## 9     MI   14598900 / R: 1553   ->1663      N:2   B     W     B     W    
## 11     3   ADITYA BAJAJ                     6.0   L   8 W  61 W  25 W  21
## 12    MI   14959604 / R: 1384   ->1640      N:2   W     B     W     B    
## 14     4   PATRICK H SCHILLING              5.5   W  23 D  28 W   2 W  26
## 15    MI   12616049 / R: 1716   ->1744      N:2   W     B     W     B    
##       V8    V9   V10 V11
## 2  Round Round Round  NA
## 3    5     6     7    NA
## 5  W   7 D  12 D   4  NA
## 6  W     B     W      NA
## 8  W  16 W  20 W   7  NA
## 9  B     W     B      NA
## 11 W  11 W  13 W  12  NA
## 12 W     B     W      NA
## 14 D   5 W  19 D   1  NA
## 15 W     B     B      NA

Much better! Now let’s work on extracting only the information we need from the second row of each paired row. We only need Player's State from the first column, and their Pre-Rating Score from the second. For that, we’ll need to use regular expressions.

First, we will clean up the row numbers to better see the pattern we’ll need for the paired rows.

row.names(tour.info) <- 1:nrow(tour.info)
head(tour.info)

##       V1                                V2    V3    V4    V5    V6    V7
## 1  Pair   Player Name                      Total Round Round Round Round
## 2  Num    USCF ID / Rtg (Pre->Post)         Pts    1     2     3     4  
## 3     1   GARY HUA                         6.0   W  39 W  21 W  18 W  14
## 4    ON   15445895 / R: 1794   ->1817      N:2   W     B     W     B    
## 5     2   DAKSHESH DARURI                  6.0   W  63 W  58 L   4 W  17
## 6    MI   14598900 / R: 1553   ->1663      N:2   B     W     B     W    
##      V8    V9   V10 V11
## 1 Round Round Round  NA
## 2   5     6     7    NA
## 3 W   7 D  12 D   4  NA
## 4 W     B     W      NA
## 5 W  16 W  20 W   7  NA
## 6 B     W     B      NA

tail(tour.info)

##         V1                                V2    V3    V4    V5    V6    V7
## 125    62   ASHWIN BALAJI                    1.0   W  55 U     U     U    
## 126    MI   15219542 / R: 1530   ->1535            B                      
## 127    63   THOMAS JOSEPH HOSMER             1.0   L   2 L  48 D  49 L  43
## 128    MI   15057092 / R: 1175   ->1125            W     B     W     B    
## 129    64   BEN LI                           1.0   L  22 D  30 L  31 D  49
## 130    MI   15006561 / R: 1163   ->1112            B     W     W     B    
##        V8    V9   V10 V11
## 125 U     U     U      NA
## 126                    NA
## 127 L  45 H     U      NA
## 128 B                  NA
## 129 L  46 L  42 L  54  NA
## 130 W     B     B      NA

The only rows we’re concerned about right now starts at row 4 for Gary Hua, ends at row 130 for Ben Li, and are separated by one row. Our sequence will look like this.

rows.to.extract <- seq(from = 4, to = 130, by = 2)

We’ll start with the states, because it’s simpler.

states <- as.character(tour.info[rows.to.extract, 1])
states <- str_replace_all(states, pattern = " ", replacement = "")
states

##  [1] "ON" "MI" "MI" "MI" "MI" "OH" "MI" "MI" "ON" "MI" "MI" "MI" "MI" "MI"
## [15] "MI" "MI" "MI" "MI" "MI" "MI" "ON" "MI" "ON" "MI" "MI" "ON" "MI" "MI"
## [29] "MI" "ON" "MI" "ON" "MI" "MI" "MI" "MI" "MI" "MI" "MI" "MI" "MI" "MI"
## [43] "MI" "MI" "MI" "MI" "MI" "MI" "MI" "MI" "MI" "MI" "MI" "MI" "MI" "MI"
## [57] "MI" "MI" "MI" "MI" "ON" "MI" "MI" "MI"

Now let’s get the scores. The obvious delimiters should be the R: that precedes the pre-score, and the -> that separates the pre and post score. However, we know from our very first example that some scores have characters in them such as 1436P23. Also, some scores do not break a thousand and are padded with two spaces instead of one. We can simplify the regex like so: "R:\\s+\\d+.

prescores <- str_extract(tour.info[rows.to.extract, 2], pattern = "R:\\s+\\d+")
prescores <- as.integer(str_replace_all(prescores, pattern = "R:\\s+", replacement = ""))
prescores

##  [1] 1794 1553 1384 1716 1655 1686 1649 1641 1411 1365 1712 1663 1666 1610
## [15] 1220 1604 1629 1600 1564 1595 1563 1555 1363 1229 1745 1579 1552 1507
## [29] 1602 1522 1494 1441 1449 1399 1438 1355  980 1423 1436 1348 1403 1332
## [43] 1283 1199 1242  377 1362 1382 1291 1056 1011  935 1393 1270 1186 1153
## [57] 1092  917  853  967  955 1530 1175 1163

Next we need to convert the opponents into integers. For this assignment we don’t care whether a player won, lost, had a bye, or any other stat. We only care about who their opponent was, if they had one at all. To make this even easier, we can now delete the rows we no longer need.

tour.info <- tour.info[-c(1, 2, rows.to.extract), ]
tour.info$V12 <- prescores
op.cols <- c(4, 5, 6, 7, 8, 9, 10)

for(i in op.cols){
  tour.info[, i] <- as.integer(str_extract(tour.info[, i], pattern = "\\s+\\d*"))
}

tour.info[is.na(tour.info)] <- 0
row.names(tour.info) <- 1:nrow(tour.info)
colnames(tour.info) <- c("PairID", "Player.Name", "TotalPts", "R1", "R2", "R3", "R4", "R5", "R6", "R7", "Player.State", "PreChess")

Finally we’ve got a clean data frame. All that’s left is to calculate the Average Pre Chess Rating of Opponents.

tour.info$Avg.Opp.Rating <- ""
avg.rating <- function(x, columns){
  # Get playerIDs
  for(i in 1:64){
    opponents <- x[i, columns]
    # Get Pre Chess rating for ID (if applicable)
    num.op <- 0
    sum <- 0
    for(j in opponents){
      if(j != 0){
        sum <- sum + x[j, 12]
        num.op <- num.op + 1
      }
    }
    # Get Average
    avg <- sum/num.op
    # Insert into column
    x[i, 13] <- avg
  }
  
  return(x)
}

tour.info <- avg.rating(tour.info, op.cols)
head(tour.info, 10)

##    PairID                       Player.Name TotalPts R1 R2 R3 R4 R5 R6 R7
## 1      1   GARY HUA                            6.0   39 21 18 14  7 12  4
## 2      2   DAKSHESH DARURI                     6.0   63 58  4 17 16 20  7
## 3      3   ADITYA BAJAJ                        6.0    8 61 25 21 11 13 12
## 4      4   PATRICK H SCHILLING                 5.5   23 28  2 26  5 19  1
## 5      5   HANSHI ZUO                          5.5   45 37 12 13  4 14 17
## 6      6   HANSEN SONG                         5.0   34 29 11 35 10 27 21
## 7      7   GARY DEE SWATHELL                   5.0   57 46 13 11  1  9  2
## 8      8   EZEKIEL HOUGHTON                    5.0    3 32 14  9 47 28 19
## 9      9   STEFANO LEE                         5.0   25 18 59  8 26  7 20
## 10    10   ANVIT RAO                           5.0   16 19 55 31  6 25 18
##    Player.State PreChess   Avg.Opp.Rating
## 1             0     1794 1605.28571428571
## 2             0     1553 1469.28571428571
## 3             0     1384 1563.57142857143
## 4             0     1716 1573.57142857143
## 5             0     1655 1500.85714285714
## 6             0     1686 1518.71428571429
## 7             0     1649 1372.14285714286
## 8             0     1641 1468.42857142857
## 9             0     1411 1523.14285714286
## 10            0     1365 1554.14285714286

Almost there. We just need to round the numbers, as indicated in the prompt.

tour.info$Avg.Opp.Rating <- round(as.integer(tour.info[,13]))

Now we write the required information to a .CSV.

final <- subset(tour.info, select = c("Player.Name", "Player.State", "TotalPts", "PreChess", "Avg.Opp.Rating"))
head(final, 10)

##                          Player.Name Player.State TotalPts PreChess
## 1   GARY HUA                                    0    6.0       1794
## 2   DAKSHESH DARURI                             0    6.0       1553
## 3   ADITYA BAJAJ                                0    6.0       1384
## 4   PATRICK H SCHILLING                         0    5.5       1716
## 5   HANSHI ZUO                                  0    5.5       1655
## 6   HANSEN SONG                                 0    5.0       1686
## 7   GARY DEE SWATHELL                           0    5.0       1649
## 8   EZEKIEL HOUGHTON                            0    5.0       1641
## 9   STEFANO LEE                                 0    5.0       1411
## 10  ANVIT RAO                                   0    5.0       1365
##    Avg.Opp.Rating
## 1            1605
## 2            1469
## 3            1563
## 4            1573
## 5            1500
## 6            1518
## 7            1372
## 8            1468
## 9            1523
## 10           1554

write.csv(final, file = "chessinfo.csv")

Data 607 Project 1

Iden Watanabe

February 21, 2018