1. Loading Data into a Dataframe

The data I will be using is sourced from the course for this project. The data involves a chess tournament, the data includes for example, the 7 games played per player, their elo ratings, and the results of the games. The data is in a text file format, and requires processing before data can be extracted. The first challenge is that the data is meant for human consumption, so spacing is uniform and the data is not separated by commas. Second, the data is not in a standard format, so data for one player spans multiple lines. Though it may be possible to simply read the file line by line and extract the data, I will be using the readr package to read the data into a dataframe. This process helps eliminate the need to manually extract the data, but still requires some processing in order to extract the data correctly. The chief challenge is that the data is not in a standard format, in particular, its use of a non-standard delimiter, the pipe character “|”. Secondly, between each player’s data, there is a line of dashes that must be removed. Third, and the least important, is that the data for each player is split across two lines. In this first section, I will load the data into a dataframe in a way that not only identifies the delimiter, but also helps remove the initial set of dashes at the top of the dataset in order to allow the second row to become the title row.

Note: The code used to load the data is sourced from the renaming the file to a .csv file and then using Rstudio’s import dataset feature.

library(readr)
tournament <- read_delim("tournament.csv", delim = "|", escape_double = FALSE, trim_ws = TRUE,
    skip = 1, show_col_types = FALSE)
## New names:
## • `Round` -> `Round...4`
## • `Round` -> `Round...5`
## • `Round` -> `Round...6`
## • `Round` -> `Round...7`
## • `Round` -> `Round...8`
## • `Round` -> `Round...9`
## • `Round` -> `Round...10`
## • `` -> `...11`
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

After running this code, we can see that the data has been loaded into the dataframe tournament. The data is now in a format that can be used for further processing. Unlike the original data, extraneous data and formatting, can now be removed or ignored. What data that remains clustered together can be much more easily separated with simple regex patterns that are applied cell to cell rather than line by line. First up for further processing, is to remove the dashes that separate each player’s data.

1.a. Removing Dashes

Since the data is now in a dataframe, we can use tidyverse to handle the processing of the player separators, but searching and removing for rows with missing values where the name should be. This is because the name and the ID/Rating(s) are are populated for all players. Where there is a row of dashes, there is a NA value in the name column.

tournament <- tournament %>%
    drop_na("Player Name")

1.b. Removing the Last Column

The last column of the data is not needed for this project, so it will be removed. This column is all NA values, so it can be removed by simply removing the last column of the dataframe. The source of this last column is due to the fact that the data had a “|” delimiter at the end of each line, which caused the last column to be empty.

To make this simple, we can remove the last column of the dataframe by using the following code that takes the index of the last column and removes it from the dataframe. And since the last column’s index is equal to the number of columns in the dataframe, we can simply use the following code to remove the last column.

tournament <- tournament[, -ncol(tournament)]

1.c. Making Each Row Represent a Player

The data is currently in a format where each player’s data is split across two rows. This is not ideal for processing the data, so I will be combining the data for each player into a single row. This makes the data easier to work with by allowing us to know that each row represents a player, and that each column represents a different attribute (kind of) of the player.

rows <- seq(2, nrow(tournament), 2)
players = tournament[rows, ]  # Extracting the 1st row of each player
players
## # A tibble: 64 × 10
##    Pair  `Player Name`   Total Round...4 Round...5 Round...6 Round...7 Round...8
##    <chr> <chr>           <chr> <chr>     <chr>     <chr>     <chr>     <chr>    
##  1 1     GARY HUA        6.0   W  39     W  21     W  18     W  14     W   7    
##  2 2     DAKSHESH DARURI 6.0   W  63     W  58     L   4     W  17     W  16    
##  3 3     ADITYA BAJAJ    6.0   L   8     W  61     W  25     W  21     W  11    
##  4 4     PATRICK H SCHI… 5.5   W  23     D  28     W   2     W  26     D   5    
##  5 5     HANSHI ZUO      5.5   W  45     W  37     D  12     D  13     D   4    
##  6 6     HANSEN SONG     5.0   W  34     D  29     L  11     W  35     D  10    
##  7 7     GARY DEE SWATH… 5.0   W  57     W  46     W  13     W  11     L   1    
##  8 8     EZEKIEL HOUGHT… 5.0   W   3     W  32     L  14     L   9     W  47    
##  9 9     STEFANO LEE     5.0   W  25     L  18     W  59     W   8     W  26    
## 10 10    ANVIT RAO       5.0   D  16     L  19     W  55     W  31     D   6    
## # ℹ 54 more rows
## # ℹ 2 more variables: Round...9 <chr>, Round...10 <chr>
base <- tournament[rows + 1, ]  # Extracting the 2nd row of each player
cols <- tournament[1, ]  # extracting the column names for the 2nd rows for each player
cols  # we can see that the columns are pretty much correct, except for the fact that they some are meant to be read with the first row, I will change those to better reflect the data.
## # A tibble: 1 × 10
##   Pair  `Player Name`    Total Round...4 Round...5 Round...6 Round...7 Round...8
##   <chr> <chr>            <chr> <chr>     <chr>     <chr>     <chr>     <chr>    
## 1 Num   USCF ID / Rtg (… Pts   1         2         3         4         5        
## # ℹ 2 more variables: Round...9 <chr>, Round...10 <chr>
cols[1] <- "State"
cols[2] <- "ID/Ratings"  # simplify the ratings column name
cols[3] <- "Unkown"  # just a temporary name since we don't need this data
colnames(base) = cols  # Setting the column names of the 2nd row to the 1st row of the original double stacked data
base
## # A tibble: 64 × 10
##    State `ID/Ratings`           Unkown `1`   `2`   `3`   `4`   `5`   `6`   `7`  
##    <chr> <chr>                  <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr>
##  1 ON    15445895 / R: 1794   … N:2    W     B     W     B     W     B     W    
##  2 MI    14598900 / R: 1553   … N:2    B     W     B     W     B     W     B    
##  3 MI    14959604 / R: 1384   … N:2    W     B     W     B     W     B     W    
##  4 MI    12616049 / R: 1716   … N:2    W     B     W     B     W     B     B    
##  5 MI    14601533 / R: 1655   … N:2    B     W     B     W     B     W     B    
##  6 OH    15055204 / R: 1686   … N:3    W     B     W     B     B     W     B    
##  7 MI    11146376 / R: 1649   … N:3    W     B     W     B     B     W     W    
##  8 MI    15142253 / R: 1641P17… N:3    B     W     B     W     B     W     W    
##  9 ON    14954524 / R: 1411   … N:2    W     B     W     B     W     B     B    
## 10 MI    14150362 / R: 1365   … N:3    W     W     B     B     W     B     W    
## # ℹ 54 more rows

We can then combine the two sets of data into a single dataframe by using the following code:

tournament <- cbind(players, base)

# just a quick and dirty rename of the round names to make them more readable.
# Also great if we need to search them later.
cols <- colnames(tournament)
cols[4] <- "Round 1"
cols[5] <- "Round 2"
cols[6] <- "Round 3"
cols[7] <- "Round 4"
cols[8] <- "Round 5"
cols[9] <- "Round 6"
cols[10] <- "Round 7"
colnames(tournament) <- cols

2. Extracting Data

we can use a loop to extract the data for each player now since each row represents a player. At each iteration, we extract the player name, state, total points scored, their pre-rating, and their pair number. We are reserving the last column, the ratings of their opponents faced, for another section. After we have the data we need from a player, we append it to a new dataframe called players. Lastly, we need to reformat the pre-rating column to be the player’s rating, right now its a combination of their ID and their pre/post ratings.

players <- data.frame()
for (row in 1:nrow(tournament)) {
    name <- tournament[row, 2]
    state <- tournament[row, "State"]  # easier to search by col name state here
    points <- tournament[row, 3]
    pre_rating <- tournament[row, "ID/Ratings"]
    pair_num <- tournament[row, 1]
    player <- cbind(name, state, points, pre_rating, pair_num)
    players <- rbind(players, player)
}

Removing only the pre-ratings is simple, unlike all other numeric entries in the column, the pre-ratings are the only ones that have a space before the number. So we can use regex to extract the pre-ratings from the column. Then to remove the space, we can simply use the trimws function to remove the whitespace from the string.

# Reformatting the pre-rating column to be the player's rating
players$pre_rating <- trimws(str_extract(players$pre_rating, "\\s\\d+"))

And to double check that the data has been extracted correctly, we can use the following code to display the data in the players dataframe:

# Just to quickly check if we have the data
head(players)
##                  name state points pre_rating pair_num
## 1            GARY HUA    ON    6.0       1794        1
## 2     DAKSHESH DARURI    MI    6.0       1553        2
## 3        ADITYA BAJAJ    MI    6.0       1384        3
## 4 PATRICK H SCHILLING    MI    5.5       1716        4
## 5          HANSHI ZUO    MI    5.5       1655        5
## 6         HANSEN SONG    OH    5.0       1686        6

3. Extracting Opponent Ratings

This is a bit more complicated than the prior section as each player all have 7 games, however, not all are played. So we need to extract them all first, search through them to obtain a list of opponents, and then extract the ratings of those opponents. This is done by using regex to extract the opponent pair number and then using the number to search for the ratings of the opponents from the extracted data from the prior step.

op_ratings <- data.frame()
for (row in 1:nrow(tournament)) {
    rounds <- tournament[row, 4:10]  # we already know the round col #s from the previous section
    opponent_nums <- data.frame(ncol = 7)
    for (round in rounds) {
        if (round != "U") {
            # U means unplayed Extract the player number of the opponent
            opponent_num <- str_extract(round, "\\d+")
            opponent_nums <- cbind(opponent_nums, opponent_num)
        } else {
            opponent_nums <- cbind(opponent_nums, NA)
        }
    }
    p_ops <- data.frame(ncols = 7)  # player opponent scores
    for (opponent in opponent_nums) {
        if (!is.na(opponent)) {
            # Extract the rating of the opponent
            opponent_rating <- players[players$pair_num == opponent, "pre_rating"]
            p_ops <- cbind(p_ops, opponent_rating)
        }
    }
    op_ratings <- rbind(op_ratings, sum(as.numeric(p_ops[2:length(p_ops)]))/(length(p_ops) -
        1))  # divide by ops to get the average rating of the opponents
}

head(op_ratings)
##   X1610.75
## 1 1610.750
## 2 1491.750
## 3 1574.250
## 4 1583.000
## 5 1519.375
## 6 1535.000

4. Finalize Data

Now that we have the data for the players and their opponents, we can combine the two dataframes into a single dataframe. This is done by using the cbind function to combine the two dataframes. However, we must first remove the pair number column from the players dataframe, as it is not needed in the final dataframe.

players <- players[, -ncol(players)]  # remove the pair number column
players <- cbind(players, op_ratings)  # combine the two dataframes
colnames(players)[5] <- "opponent_ratings"  # rename the column to be more descriptive
head(players)
##                  name state points pre_rating opponent_ratings
## 1            GARY HUA    ON    6.0       1794         1610.750
## 2     DAKSHESH DARURI    MI    6.0       1553         1491.750
## 3        ADITYA BAJAJ    MI    6.0       1384         1574.250
## 4 PATRICK H SCHILLING    MI    5.5       1716         1583.000
## 5          HANSHI ZUO    MI    5.5       1655         1519.375
## 6         HANSEN SONG    OH    5.0       1686         1535.000

5. Conclusion

In this project, I have explored the usage of regex in R, in combination with csv importing, and dataframe manipulation, to extract data from a text file. The data was not in a standard format, so it required some processing to extract the data correctly. But by creatively applying individual steps to eliminate the human-readable formatting, extraneous data, and the non-standard delimiter, the data required data was extracted from the text file. This new data is now nearly ready for further processing, such as analysis or visualization. The data extracted includes the player’s name, state, points scored, rating, and the average rating of their opponents. Unfortunately, the data is not yet in a format that is ready for analysis, as the ratings and points scored are still in string format. However, this data can be easily converted to numeric format, and then used for further analysis.

Using this data, we can discern relationships between the player’s rating and the average rating of their opponents, and how that affected the players performance in the tournament by comparing the points scored by the player to the average rating of their opponents. Furthermore, the data includes the player’s state, which could be used to determine if there is a correlation between the player’s state and their performance in the tournament. Or it may allow us to visualize the data on a map. Regardless, this subset of the data is now ready for further analysis, and can be used to answer questions about the tournament, the players, and the games played.

Some further steps that could be taken to improve the data: First, and as mentioned earlier, we can conver the ratings and points scored to numeric format. Second, we can add how many games each player played, as this data is not included despite it influencing the average rating of opponents played, and can indicate therefore how confident we can be in the average rating of the opponents as well as the points scored. Lastly, we can also provide a ratio of the games won to the games played, this can also help us understand the player’s performance in the tournament. These are only a few of the many ways we can further process the data, and the possibilities are endless, limited only by the original data.

In conclusion, this project has demonstrated the power of regex in R, and how it can be used to extract data from a text file. It has also illuminated the fact that Regex is not the only tool that can be used to extract data. In the end, the data extracted can be used for further analysis, and can help answer questions about the tournament, the players, and the games played.

6. Appendix

6.a. The Full Final Dataframe

players
##                          name state points pre_rating opponent_ratings
## 1                    GARY HUA    ON    6.0       1794         1610.750
## 2             DAKSHESH DARURI    MI    6.0       1553         1491.750
## 3                ADITYA BAJAJ    MI    6.0       1384         1574.250
## 4         PATRICK H SCHILLING    MI    5.5       1716         1583.000
## 5                  HANSHI ZUO    MI    5.5       1655         1519.375
## 6                 HANSEN SONG    OH    5.0       1686         1535.000
## 7           GARY DEE SWATHELL    MI    5.0       1649         1406.750
## 8            EZEKIEL HOUGHTON    MI    5.0       1641         1491.000
## 9                 STEFANO LEE    ON    5.0       1411         1538.875
## 10                  ANVIT RAO    MI    5.0       1365         1566.000
## 11   CAMERON WILLIAM MC LEMAN    MI    4.5       1712         1490.250
## 12             KENNETH J TACK    MI    4.5       1663         1526.571
## 13          TORRANCE HENRY JR    MI    4.5       1666         1516.750
## 14               BRADLEY SHAW    MI    4.5       1610         1531.750
## 15     ZACHARY JAMES HOUGHTON    MI    4.5       1220         1504.500
## 16               MIKE NIKITIN    MI    4.0       1604         1429.667
## 17         RONALD GRZEGORCZYK    MI    4.0       1629         1517.375
## 18              DAVID SUNDEEN    MI    4.0       1600         1501.125
## 19               DIPANKAR ROY    MI    4.0       1564         1454.125
## 20                JASON ZHENG    MI    4.0       1595         1440.625
## 21              DINH DANG BUI    ON    4.0       1563         1492.750
## 22           EUGENE L MCCLURE    MI    4.0       1555         1350.143
## 23                   ALAN BUI    ON    4.0       1363         1268.250
## 24          MICHAEL R ALDRICH    MI    4.0       1229         1393.500
## 25           LOREN SCHWIEBERT    MI    3.5       1745         1399.000
## 26                    MAX ZHU    ON    3.5       1579         1524.625
## 27             GAURAV GIDWANI    MI    3.5       1552         1282.714
## 28 SOFIA ADINA STANESCU-BELLU    MI    3.5       1507         1538.000
## 29           CHIEDOZIE OKORIE    MI    3.5       1602         1361.429
## 30         GEORGE AVERY JONES    ON    3.5       1522         1207.250
## 31               RISHI SHETTY    MI    3.5       1494         1308.500
## 32      JOSHUA PHILIP MATHEWS    ON    3.5       1441         1412.500
## 33                    JADE GE    MI    3.5       1449         1323.375
## 34     MICHAEL JEFFERY THOMAS    MI    3.5       1399         1409.500
## 35           JOSHUA DAVID LEE    MI    3.5       1438         1212.125
## 36              SIDDHARTH JHA    MI    3.5       1355         1425.429
## 37       AMIYATOSH PWNANANDAM    MI    3.5        980         1428.833
## 38                  BRIAN LIU    MI    3.0       1423         1554.857
## 39              JOEL R HENDON    MI    3.0       1436         1457.000
## 40               FOREST ZHANG    MI    3.0       1348         1422.875
## 41        KYLE WILLIAM MURPHY    MI    3.0       1403         1328.600
## 42                   JARED GE    MI    3.0       1332         1212.250
## 43          ROBERT GLEN VASEY    MI    3.0       1283         1174.375
## 44         JUSTIN D SCHILLING    MI    3.0       1199         1373.000
## 45                  DEREK YAN    MI    3.0       1242         1214.125
## 46   JACOB ALEXANDER LAVALLEY    MI    3.0        377         1394.125
## 47                ERIC WRIGHT    MI    2.5       1362         1424.125
## 48               DANIEL KHAIN    MI    2.5       1382         1404.667
## 49           MICHAEL J MARTIN    MI    2.5       1291         1346.333
## 50                 SHIVAM JHA    MI    2.5       1056         1346.429
## 51             TEJAS AYYAGARI    MI    2.5       1011         1392.750
## 52                  ETHAN GUO    MI    2.5        935         1513.875
## 53              JOSE C YBARRA    MI    2.0       1393         1421.250
## 54                LARRY HODGE    MI    2.0       1270         1269.429
## 55                  ALEX KONG    MI    2.0       1186         1440.714
## 56               MARISA RICCI    MI    2.0       1153         1453.500
## 57                 MICHAEL LU    MI    2.0       1092         1403.857
## 58               VIRAJ MOHILE    MI    2.0        917         1427.857
## 59          SEAN M MC CORMICK    MI    2.0        853         1366.143
## 60                 JULIA SHEN    MI    1.5        967         1383.333
## 61              JEZZEL FARKAS    ON    1.5        955         1367.500
## 62              ASHWIN BALAJI    MI    1.0       1530         1417.500
## 63       THOMAS JOSEPH HOSMER    MI    1.0       1175         1400.000
## 64                     BEN LI    MI    1.0       1163         1311.250