The goal in this assignment is to read a text file containing the results of a chess tournament and generate a CSV of that data. An average of each player’s opponent’s pre-rating score must also be calculated and included.

The first few lines of the text file in question look like this:

Chess Tournament Text File

Chess Tournament Text File

At first glance we can see that records are separated with horizontal dashes, and fields within the records are separated with vertical pipes.

The first step is to import the stringr library and read the table into R

library(stringr)
chess <- read.table("tournamentinfo.txt",header=FALSE, sep="*")
head(chess)
##                                                                                           V1
## 1  -----------------------------------------------------------------------------------------
## 2  Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| 
## 3  Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | 
## 4  -----------------------------------------------------------------------------------------
## 5      1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|
## 6     ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |

We can see that each line of the table has been imported as one long string.

Next, we will remove the header lines.

chess <- chess[-c(1,2,3,4),]
head(chess)
## [1]     1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|
## [2]    ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |
## [3] -----------------------------------------------------------------------------------------
## [4]     2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|
## [5]    MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |
## [6] -----------------------------------------------------------------------------------------
## 131 Levels:     1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4| ...

To separate the lines containing data from the lines containing horizontal dashes, we well sequentally assign lines to new variables.

chess_line1 <- chess[seq(1,length(chess), 3)]
head(chess_line1)
## [1]     1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|
## [2]     2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|
## [3]     3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|
## [4]     4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|
## [5]     5 | HANSHI ZUO                      |5.5  |W  45|W  37|D  12|D  13|D   4|W  14|W  17|
## [6]     6 | HANSEN SONG                     |5.0  |W  34|D  29|L  11|W  35|D  10|W  27|W  21|
## 131 Levels:     1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4| ...
chess_line2 <- chess[seq(2,length(chess), 3)]
head(chess_line2)
## [1]    ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |
## [2]    MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |
## [3]    MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |
## [4]    MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |
## [5]    MI | 14601533 / R: 1655   ->1690     |N:2  |B    |W    |B    |W    |B    |W    |B    |
## [6]    OH | 15055204 / R: 1686   ->1687     |N:3  |W    |B    |W    |B    |B    |W    |B    |
## 131 Levels:     1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4| ...

We now need to take the lines with data and split them into individual data fields. Because each field is separated by a vertical pipe, we will use the str_split function of the stringr library and use the pipe as the separation point.

split1 <- str_split(chess_line1,"\\|")
split1[[1]]
##  [1] "    1 "                           
##  [2] " GARY HUA                        "
##  [3] "6.0  "                            
##  [4] "W  39"                            
##  [5] "W  21"                            
##  [6] "W  18"                            
##  [7] "W  14"                            
##  [8] "W   7"                            
##  [9] "D  12"                            
## [10] "D   4"                            
## [11] ""
split2 <- str_split(chess_line2,"\\|")
split2[[1]]
##  [1] "   ON "                           
##  [2] " 15445895 / R: 1794   ->1817     "
##  [3] "N:2  "                            
##  [4] "W    "                            
##  [5] "B    "                            
##  [6] "W    "                            
##  [7] "B    "                            
##  [8] "W    "                            
##  [9] "B    "                            
## [10] "W    "                            
## [11] ""

The above displays only the first record. We can see the data we need, but some clean up is needed. Before that, we will take the data we want and place them into vectors using a for loop.

name <- NULL
state <- NULL
total_pts <- NULL
pre_rating <- NULL
round_1 <- NULL
round_2 <- NULL
round_3 <- NULL
round_4 <- NULL
round_5 <- NULL
round_6 <- NULL
round_7 <- NULL


for(i in split1){
  name <- c(name,i[2])
  total_pts <- c(total_pts,i[3])
  round_1 <- c(round_1,i[4])
  round_2 <- c(round_2,i[5])
  round_3 <- c(round_3,i[6])
  round_4 <- c(round_4,i[7])
  round_5 <- c(round_5,i[8])
  round_6 <- c(round_6,i[9])
  round_7 <- c(round_7,i[10])
}

for(i in split2){
  state <- c(state,i[1])
  pre_rating <- c(pre_rating,i[2])
}

head(name)
## [1] " GARY HUA                        " " DAKSHESH DARURI                 "
## [3] " ADITYA BAJAJ                    " " PATRICK H SCHILLING             "
## [5] " HANSHI ZUO                      " " HANSEN SONG                     "
head(state)
## [1] "   ON " "   MI " "   MI " "   MI " "   MI " "   OH "
head(total_pts)
## [1] "6.0  " "6.0  " "6.0  " "5.5  " "5.5  " "5.0  "
head(pre_rating)
## [1] " 15445895 / R: 1794   ->1817     " " 14598900 / R: 1553   ->1663     "
## [3] " 14959604 / R: 1384   ->1640     " " 12616049 / R: 1716   ->1744     "
## [5] " 14601533 / R: 1655   ->1690     " " 15055204 / R: 1686   ->1687     "
head(round_1)
## [1] "W  39" "W  63" "L   8" "W  23" "W  45" "W  34"

We now have separate vectors containing the name, state, total points, and pre_rating of each player. We also have fields containing the player number of the opponent each player faced in the seven rounds. But we can see that some clean up is needed to leave us with only the data we want in the correct datatype.

We will use the str_extract function on the “round” fields, as we do not need the Win/Loss letter indicator.

name <- str_trim(name)
state <-str_trim(state)
total_pts <- as.numeric(total_pts)

round_1<-as.numeric(str_extract(round_1,"(\\d)+"))
round_2<-as.numeric(str_extract(round_2,"(\\d)+"))
round_3<-as.numeric(str_extract(round_3,"(\\d)+"))
round_4<-as.numeric(str_extract(round_4,"(\\d)+"))
round_5<-as.numeric(str_extract(round_5,"(\\d)+"))
round_6<-as.numeric(str_extract(round_6,"(\\d)+"))
round_7<-as.numeric(str_extract(round_7,"(\\d)+"))

head(name)
## [1] "GARY HUA"            "DAKSHESH DARURI"     "ADITYA BAJAJ"       
## [4] "PATRICK H SCHILLING" "HANSHI ZUO"          "HANSEN SONG"
head(state)
## [1] "ON" "MI" "MI" "MI" "MI" "OH"
head(total_pts)
## [1] 6.0 6.0 6.0 5.5 5.5 5.0
head(round_1)
## [1] 39 63  8 23 45 34

Pre-rating will also use str_extract, but in two parts. The first to ensure we extract the correct digits from the string (as they vary in length, leading spaces, and trailing characters), and the second to filter off any left over characters.

pre_rating <- str_extract(pre_rating,"[ ]([0-9]{3})[ ]|[:][ ]([0-9]{4})|([0-9]{3})[P]|([0-9]{4})[P]")

pre_rating <- as.numeric(str_extract(pre_rating,"(\\d)+"))
head(pre_rating)
## [1] 1794 1553 1384 1716 1655 1686

Now we have most of the elements needed to build our final dataframe.

df <- data.frame(name,state,total_pts,pre_rating)
head(df)
##                  name state total_pts pre_rating
## 1            GARY HUA    ON       6.0       1794
## 2     DAKSHESH DARURI    MI       6.0       1553
## 3        ADITYA BAJAJ    MI       6.0       1384
## 4 PATRICK H SCHILLING    MI       5.5       1716
## 5          HANSHI ZUO    MI       5.5       1655
## 6         HANSEN SONG    OH       5.0       1686

The final step (before generating the CSV) is to calculate the average pre-rating of each player’s opponents.

Because the “round” vectors contain each opponent’s numerical identifier, we can use that as a reference to their pre-rating.

round_1_opponent_rating <- df$pre_rating[round_1]
round_2_opponent_rating <- df$pre_rating[round_2]
round_3_opponent_rating <- df$pre_rating[round_3]
round_4_opponent_rating <- df$pre_rating[round_4]
round_5_opponent_rating <- df$pre_rating[round_5]
round_6_opponent_rating <- df$pre_rating[round_6]
round_7_opponent_rating <- df$pre_rating[round_7]

head(round_1_opponent_rating)
## [1] 1436 1175 1641 1363 1242 1399

Now we can build a dataframe containing each player’s opponent pre-ratings.

average_df <- data.frame(round_1_opponent_rating,round_2_opponent_rating,round_3_opponent_rating,round_4_opponent_rating,round_5_opponent_rating,round_6_opponent_rating,round_7_opponent_rating)

head(average_df)
##   round_1_opponent_rating round_2_opponent_rating round_3_opponent_rating
## 1                    1436                    1563                    1600
## 2                    1175                     917                    1716
## 3                    1641                     955                    1745
## 4                    1363                    1507                    1553
## 5                    1242                     980                    1663
## 6                    1399                    1602                    1712
##   round_4_opponent_rating round_5_opponent_rating round_6_opponent_rating
## 1                    1610                    1649                    1663
## 2                    1629                    1604                    1595
## 3                    1563                    1712                    1666
## 4                    1579                    1655                    1564
## 5                    1666                    1716                    1610
## 6                    1438                    1365                    1552
##   round_7_opponent_rating
## 1                    1716
## 2                    1649
## 3                    1663
## 4                    1794
## 5                    1629
## 6                    1563

Some players did not play certain rounds, and so have no opponent that round. We can use the rowMeans function on average_df to get the average of each row regardless of any NAs, and then create a new column in our final df that contains these averages.

df$opponent_average <- rowMeans(average_df,na.rm=TRUE)
head(df)
##                  name state total_pts pre_rating opponent_average
## 1            GARY HUA    ON       6.0       1794         1605.286
## 2     DAKSHESH DARURI    MI       6.0       1553         1469.286
## 3        ADITYA BAJAJ    MI       6.0       1384         1563.571
## 4 PATRICK H SCHILLING    MI       5.5       1716         1573.571
## 5          HANSHI ZUO    MI       5.5       1655         1500.857
## 6         HANSEN SONG    OH       5.0       1686         1518.714

With the final dataframe complete we can generate the CSV.

write.csv(df,"chess_tournament.csv")