DATA 607 Project 1
Regex on Chess Data
Prof. Catlin
The data I will be using is sourced from the course for this project.
The data involves a chess tournament, the data includes for example, the
7 games played per player, their elo ratings, and the results of the
games. The data is in a text file format, and requires processing before
data can be extracted. The first challenge is that the data is meant for
human consumption, so spacing is uniform and the data is not separated
by commas. Second, the data is not in a standard format, so data for one
player spans multiple lines. Though it may be possible to simply read
the file line by line and extract the data, I will be using the
readr package to read the data into a dataframe. This
process helps eliminate the need to manually extract the data, but still
requires some processing in order to extract the data correctly. The
chief challenge is that the data is not in a standard format, in
particular, its use of a non-standard delimiter, the pipe character “|”.
Secondly, between each player’s data, there is a line of dashes that
must be removed. Third, and the least important, is that the data for
each player is split across two lines. In this first section, I will
load the data into a dataframe in a way that not only identifies the
delimiter, but also helps remove the initial set of dashes at the top of
the dataset in order to allow the second row to become the title
row.
Note: The code used to load the data is sourced from the renaming the file to a .csv file and then using Rstudio’s import dataset feature.
library(readr)
tournament <- read_delim("tournament.csv", delim = "|", escape_double = FALSE, trim_ws = TRUE,
skip = 1, show_col_types = FALSE)
## New names:
## • `Round` -> `Round...4`
## • `Round` -> `Round...5`
## • `Round` -> `Round...6`
## • `Round` -> `Round...7`
## • `Round` -> `Round...8`
## • `Round` -> `Round...9`
## • `Round` -> `Round...10`
## • `` -> `...11`
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
After running this code, we can see that the data has been loaded
into the dataframe tournament. The data is now in a format
that can be used for further processing. Unlike the original data,
extraneous data and formatting, can now be removed or ignored. What data
that remains clustered together can be much more easily separated with
simple regex patterns that are applied cell to cell rather than line by
line. First up for further processing, is to remove the dashes that
separate each player’s data.
Since the data is now in a dataframe, we can use
tidyverse to handle the processing of the player
separators, but searching and removing for rows with missing values
where the name should be. This is because the name and the ID/Rating(s)
are are populated for all players. Where there is a row of dashes, there
is a NA value in the name column.
tournament <- tournament %>%
drop_na("Player Name")
The last column of the data is not needed for this project, so it
will be removed. This column is all NA values, so it can be removed by
simply removing the last column of the dataframe. The source of this
last column is due to the fact that the data had a “|” delimiter at the
end of each line, which caused the last column to be empty.
To make this simple, we can remove the last column of the dataframe by using the following code that takes the index of the last column and removes it from the dataframe. And since the last column’s index is equal to the number of columns in the dataframe, we can simply use the following code to remove the last column.
tournament <- tournament[, -ncol(tournament)]
The data is currently in a format where each player’s data is split across two rows. This is not ideal for processing the data, so I will be combining the data for each player into a single row. This makes the data easier to work with by allowing us to know that each row represents a player, and that each column represents a different attribute (kind of) of the player.
rows <- seq(2, nrow(tournament), 2)
players = tournament[rows, ] # Extracting the 1st row of each player
players
## # A tibble: 64 × 10
## Pair `Player Name` Total Round...4 Round...5 Round...6 Round...7 Round...8
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 GARY HUA 6.0 W 39 W 21 W 18 W 14 W 7
## 2 2 DAKSHESH DARURI 6.0 W 63 W 58 L 4 W 17 W 16
## 3 3 ADITYA BAJAJ 6.0 L 8 W 61 W 25 W 21 W 11
## 4 4 PATRICK H SCHI… 5.5 W 23 D 28 W 2 W 26 D 5
## 5 5 HANSHI ZUO 5.5 W 45 W 37 D 12 D 13 D 4
## 6 6 HANSEN SONG 5.0 W 34 D 29 L 11 W 35 D 10
## 7 7 GARY DEE SWATH… 5.0 W 57 W 46 W 13 W 11 L 1
## 8 8 EZEKIEL HOUGHT… 5.0 W 3 W 32 L 14 L 9 W 47
## 9 9 STEFANO LEE 5.0 W 25 L 18 W 59 W 8 W 26
## 10 10 ANVIT RAO 5.0 D 16 L 19 W 55 W 31 D 6
## # ℹ 54 more rows
## # ℹ 2 more variables: Round...9 <chr>, Round...10 <chr>
base <- tournament[rows + 1, ] # Extracting the 2nd row of each player
cols <- tournament[1, ] # extracting the column names for the 2nd rows for each player
cols # we can see that the columns are pretty much correct, except for the fact that they some are meant to be read with the first row, I will change those to better reflect the data.
## # A tibble: 1 × 10
## Pair `Player Name` Total Round...4 Round...5 Round...6 Round...7 Round...8
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Num USCF ID / Rtg (… Pts 1 2 3 4 5
## # ℹ 2 more variables: Round...9 <chr>, Round...10 <chr>
cols[1] <- "State"
cols[2] <- "ID/Ratings" # simplify the ratings column name
cols[3] <- "Unkown" # just a temporary name since we don't need this data
colnames(base) = cols # Setting the column names of the 2nd row to the 1st row of the original double stacked data
base
## # A tibble: 64 × 10
## State `ID/Ratings` Unkown `1` `2` `3` `4` `5` `6` `7`
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 ON 15445895 / R: 1794 … N:2 W B W B W B W
## 2 MI 14598900 / R: 1553 … N:2 B W B W B W B
## 3 MI 14959604 / R: 1384 … N:2 W B W B W B W
## 4 MI 12616049 / R: 1716 … N:2 W B W B W B B
## 5 MI 14601533 / R: 1655 … N:2 B W B W B W B
## 6 OH 15055204 / R: 1686 … N:3 W B W B B W B
## 7 MI 11146376 / R: 1649 … N:3 W B W B B W W
## 8 MI 15142253 / R: 1641P17… N:3 B W B W B W W
## 9 ON 14954524 / R: 1411 … N:2 W B W B W B B
## 10 MI 14150362 / R: 1365 … N:3 W W B B W B W
## # ℹ 54 more rows
We can then combine the two sets of data into a single dataframe by using the following code:
tournament <- cbind(players, base)
# just a quick and dirty rename of the round names to make them more readable.
# Also great if we need to search them later.
cols <- colnames(tournament)
cols[4] <- "Round 1"
cols[5] <- "Round 2"
cols[6] <- "Round 3"
cols[7] <- "Round 4"
cols[8] <- "Round 5"
cols[9] <- "Round 6"
cols[10] <- "Round 7"
colnames(tournament) <- cols
we can use a loop to extract the data for each player now since each
row represents a player. At each iteration, we extract the player name,
state, total points scored, their pre-rating, and their pair number. We
are reserving the last column, the ratings of their opponents faced, for
another section. After we have the data we need from a player, we append
it to a new dataframe called players. Lastly, we need to
reformat the pre-rating column to be the player’s rating, right now its
a combination of their ID and their pre/post ratings.
players <- data.frame()
for (row in 1:nrow(tournament)) {
name <- tournament[row, 2]
state <- tournament[row, "State"] # easier to search by col name state here
points <- tournament[row, 3]
pre_rating <- tournament[row, "ID/Ratings"]
pair_num <- tournament[row, 1]
player <- cbind(name, state, points, pre_rating, pair_num)
players <- rbind(players, player)
}
Removing only the pre-ratings is simple, unlike all other numeric
entries in the column, the pre-ratings are the only ones that have a
space before the number. So we can use regex to extract the pre-ratings
from the column. Then to remove the space, we can simply use the
trimws function to remove the whitespace from the
string.
# Reformatting the pre-rating column to be the player's rating
players$pre_rating <- trimws(str_extract(players$pre_rating, "\\s\\d+"))
And to double check that the data has been extracted correctly, we
can use the following code to display the data in the
players dataframe:
# Just to quickly check if we have the data
head(players)
## name state points pre_rating pair_num
## 1 GARY HUA ON 6.0 1794 1
## 2 DAKSHESH DARURI MI 6.0 1553 2
## 3 ADITYA BAJAJ MI 6.0 1384 3
## 4 PATRICK H SCHILLING MI 5.5 1716 4
## 5 HANSHI ZUO MI 5.5 1655 5
## 6 HANSEN SONG OH 5.0 1686 6
This is a bit more complicated than the prior section as each player all have 7 games, however, not all are played. So we need to extract them all first, search through them to obtain a list of opponents, and then extract the ratings of those opponents. This is done by using regex to extract the opponent pair number and then using the number to search for the ratings of the opponents from the extracted data from the prior step.
op_ratings <- data.frame()
for (row in 1:nrow(tournament)) {
rounds <- tournament[row, 4:10] # we already know the round col #s from the previous section
opponent_nums <- data.frame(ncol = 7)
for (round in rounds) {
if (round != "U") {
# U means unplayed Extract the player number of the opponent
opponent_num <- str_extract(round, "\\d+")
opponent_nums <- cbind(opponent_nums, opponent_num)
} else {
opponent_nums <- cbind(opponent_nums, NA)
}
}
p_ops <- data.frame(ncols = 7) # player opponent scores
for (opponent in opponent_nums) {
if (!is.na(opponent)) {
# Extract the rating of the opponent
opponent_rating <- players[players$pair_num == opponent, "pre_rating"]
p_ops <- cbind(p_ops, opponent_rating)
}
}
op_ratings <- rbind(op_ratings, sum(as.numeric(p_ops[2:length(p_ops)]))/(length(p_ops) -
1)) # divide by ops to get the average rating of the opponents
}
head(op_ratings)
## X1610.75
## 1 1610.750
## 2 1491.750
## 3 1574.250
## 4 1583.000
## 5 1519.375
## 6 1535.000
Now that we have the data for the players and their opponents, we can
combine the two dataframes into a single dataframe. This is done by
using the cbind function to combine the two dataframes.
However, we must first remove the pair number column from the
players dataframe, as it is not needed in the final
dataframe.
players <- players[, -ncol(players)] # remove the pair number column
players <- cbind(players, op_ratings) # combine the two dataframes
colnames(players)[5] <- "opponent_ratings" # rename the column to be more descriptive
head(players)
## name state points pre_rating opponent_ratings
## 1 GARY HUA ON 6.0 1794 1610.750
## 2 DAKSHESH DARURI MI 6.0 1553 1491.750
## 3 ADITYA BAJAJ MI 6.0 1384 1574.250
## 4 PATRICK H SCHILLING MI 5.5 1716 1583.000
## 5 HANSHI ZUO MI 5.5 1655 1519.375
## 6 HANSEN SONG OH 5.0 1686 1535.000
In this project, I have explored the usage of regex in R, in combination with csv importing, and dataframe manipulation, to extract data from a text file. The data was not in a standard format, so it required some processing to extract the data correctly. But by creatively applying individual steps to eliminate the human-readable formatting, extraneous data, and the non-standard delimiter, the data required data was extracted from the text file. This new data is now nearly ready for further processing, such as analysis or visualization. The data extracted includes the player’s name, state, points scored, rating, and the average rating of their opponents. Unfortunately, the data is not yet in a format that is ready for analysis, as the ratings and points scored are still in string format. However, this data can be easily converted to numeric format, and then used for further analysis.
Using this data, we can discern relationships between the player’s rating and the average rating of their opponents, and how that affected the players performance in the tournament by comparing the points scored by the player to the average rating of their opponents. Furthermore, the data includes the player’s state, which could be used to determine if there is a correlation between the player’s state and their performance in the tournament. Or it may allow us to visualize the data on a map. Regardless, this subset of the data is now ready for further analysis, and can be used to answer questions about the tournament, the players, and the games played.
Some further steps that could be taken to improve the data: First, and as mentioned earlier, we can conver the ratings and points scored to numeric format. Second, we can add how many games each player played, as this data is not included despite it influencing the average rating of opponents played, and can indicate therefore how confident we can be in the average rating of the opponents as well as the points scored. Lastly, we can also provide a ratio of the games won to the games played, this can also help us understand the player’s performance in the tournament. These are only a few of the many ways we can further process the data, and the possibilities are endless, limited only by the original data.
In conclusion, this project has demonstrated the power of regex in R, and how it can be used to extract data from a text file. It has also illuminated the fact that Regex is not the only tool that can be used to extract data. In the end, the data extracted can be used for further analysis, and can help answer questions about the tournament, the players, and the games played.
players
## name state points pre_rating opponent_ratings
## 1 GARY HUA ON 6.0 1794 1610.750
## 2 DAKSHESH DARURI MI 6.0 1553 1491.750
## 3 ADITYA BAJAJ MI 6.0 1384 1574.250
## 4 PATRICK H SCHILLING MI 5.5 1716 1583.000
## 5 HANSHI ZUO MI 5.5 1655 1519.375
## 6 HANSEN SONG OH 5.0 1686 1535.000
## 7 GARY DEE SWATHELL MI 5.0 1649 1406.750
## 8 EZEKIEL HOUGHTON MI 5.0 1641 1491.000
## 9 STEFANO LEE ON 5.0 1411 1538.875
## 10 ANVIT RAO MI 5.0 1365 1566.000
## 11 CAMERON WILLIAM MC LEMAN MI 4.5 1712 1490.250
## 12 KENNETH J TACK MI 4.5 1663 1526.571
## 13 TORRANCE HENRY JR MI 4.5 1666 1516.750
## 14 BRADLEY SHAW MI 4.5 1610 1531.750
## 15 ZACHARY JAMES HOUGHTON MI 4.5 1220 1504.500
## 16 MIKE NIKITIN MI 4.0 1604 1429.667
## 17 RONALD GRZEGORCZYK MI 4.0 1629 1517.375
## 18 DAVID SUNDEEN MI 4.0 1600 1501.125
## 19 DIPANKAR ROY MI 4.0 1564 1454.125
## 20 JASON ZHENG MI 4.0 1595 1440.625
## 21 DINH DANG BUI ON 4.0 1563 1492.750
## 22 EUGENE L MCCLURE MI 4.0 1555 1350.143
## 23 ALAN BUI ON 4.0 1363 1268.250
## 24 MICHAEL R ALDRICH MI 4.0 1229 1393.500
## 25 LOREN SCHWIEBERT MI 3.5 1745 1399.000
## 26 MAX ZHU ON 3.5 1579 1524.625
## 27 GAURAV GIDWANI MI 3.5 1552 1282.714
## 28 SOFIA ADINA STANESCU-BELLU MI 3.5 1507 1538.000
## 29 CHIEDOZIE OKORIE MI 3.5 1602 1361.429
## 30 GEORGE AVERY JONES ON 3.5 1522 1207.250
## 31 RISHI SHETTY MI 3.5 1494 1308.500
## 32 JOSHUA PHILIP MATHEWS ON 3.5 1441 1412.500
## 33 JADE GE MI 3.5 1449 1323.375
## 34 MICHAEL JEFFERY THOMAS MI 3.5 1399 1409.500
## 35 JOSHUA DAVID LEE MI 3.5 1438 1212.125
## 36 SIDDHARTH JHA MI 3.5 1355 1425.429
## 37 AMIYATOSH PWNANANDAM MI 3.5 980 1428.833
## 38 BRIAN LIU MI 3.0 1423 1554.857
## 39 JOEL R HENDON MI 3.0 1436 1457.000
## 40 FOREST ZHANG MI 3.0 1348 1422.875
## 41 KYLE WILLIAM MURPHY MI 3.0 1403 1328.600
## 42 JARED GE MI 3.0 1332 1212.250
## 43 ROBERT GLEN VASEY MI 3.0 1283 1174.375
## 44 JUSTIN D SCHILLING MI 3.0 1199 1373.000
## 45 DEREK YAN MI 3.0 1242 1214.125
## 46 JACOB ALEXANDER LAVALLEY MI 3.0 377 1394.125
## 47 ERIC WRIGHT MI 2.5 1362 1424.125
## 48 DANIEL KHAIN MI 2.5 1382 1404.667
## 49 MICHAEL J MARTIN MI 2.5 1291 1346.333
## 50 SHIVAM JHA MI 2.5 1056 1346.429
## 51 TEJAS AYYAGARI MI 2.5 1011 1392.750
## 52 ETHAN GUO MI 2.5 935 1513.875
## 53 JOSE C YBARRA MI 2.0 1393 1421.250
## 54 LARRY HODGE MI 2.0 1270 1269.429
## 55 ALEX KONG MI 2.0 1186 1440.714
## 56 MARISA RICCI MI 2.0 1153 1453.500
## 57 MICHAEL LU MI 2.0 1092 1403.857
## 58 VIRAJ MOHILE MI 2.0 917 1427.857
## 59 SEAN M MC CORMICK MI 2.0 853 1366.143
## 60 JULIA SHEN MI 1.5 967 1383.333
## 61 JEZZEL FARKAS ON 1.5 955 1367.500
## 62 ASHWIN BALAJI MI 1.0 1530 1417.500
## 63 THOMAS JOSEPH HOSMER MI 1.0 1175 1400.000
## 64 BEN LI MI 1.0 1163 1311.250