Gary Hua, ON, 6.0, 1794, 1605
Load required libraries
The structure of the text is the following:
- The first three lines describe the pattern (which we will skip when processing)
- Player attributes such as Player Name, Total Points as shown below
- Player attributes such as State, Pre Tournament Rating as shown below
- A dashed line which we will ignore
## X.........................................................................................
## 1 Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round|
## 2 Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
## 3 -----------------------------------------------------------------------------------------
## 4 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|
## 5 ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |
## 6 -----------------------------------------------------------------------------------------
## 7 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|
## 8 MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |
## 9 -----------------------------------------------------------------------------------------
We re-format the text file for easier extraction. Each player has two lines of text from which we will extract information in the two following loops. We ignore the first three lines of text and increment by 3 everytime to ignore the lines of dashes.
# initialize lists that we will iteratively build
names <- c()
totalpts <- c()
opponentlist <- c()
states <- c()
preratings <- c()
# extract the name, total points and opponents list
for (i in seq(from=4, to=nrow(tournament_data), by=3)) {
str <- tournament_data[i,]
name <- unlist(str_extract_all(str, "[:alpha:]{2,}"))
name <- paste(name, collapse = " ")
names <- append(names, name)
totalpts <- append(totalpts,unlist(str_extract_all(str, "[\\d].[\\d]")))
totalpts <- as.numeric(totalpts)
opponents <- unlist(str_extract_all(str, "[:digit:]{1,}"))[4:10]
opponents <- paste(opponents, collapse = ",")
opponentlist <- append(opponentlist,opponents)
}
# extract the states and pre tournament ratings
for (i in seq(from=5, to=nrow(tournament_data), by=3)) {
str <- tournament_data[i,]
states <- append(states, unlist(str_extract_all(str, "[:alpha:]{2}")))
preratings <- append(preratings, unlist(str_extract_all(str, "[:digit:]{4}"))[3])
preratings <- as.numeric(preratings)
}We combine this extracted information into a dataframe and take a look at it.
players <- data.frame(name = names, state = states,
total_pts = totalpts, pre_tournament_rating = preratings,
opponents = opponentlist)## 'data.frame': 64 obs. of 5 variables:
## $ name : Factor w/ 64 levels "ADITYA BAJAJ",..: 24 12 1 51 28 27 23 21 59 5 ...
## $ state : Factor w/ 3 levels "MI","OH","ON": 3 1 1 1 1 2 1 1 3 1 ...
## $ total_pts : num 6 6 6 5.5 5.5 5 5 5 5 5 ...
## $ pre_tournament_rating: num 1794 1553 1384 1716 1655 ...
## $ opponents : Factor w/ 64 levels "1,54,40,16,44,21,24",..: 35 60 63 18 41 31 53 26 20 10 ...
| name | state | total_pts | pre_tournament_rating | opponents |
|---|---|---|---|---|
| GARY HUA | ON | 6.0 | 1794 | 39,21,18,14,7,12,4 |
| DAKSHESH DARURI | MI | 6.0 | 1553 | 63,58,4,17,16,20,7 |
| ADITYA BAJAJ | MI | 6.0 | 1384 | 8,61,25,21,11,13,12 |
| PATRICK SCHILLING | MI | 5.5 | 1716 | 23,28,2,26,5,19,1 |
| HANSHI ZUO | MI | 5.5 | 1655 | 45,37,12,13,4,14,17 |
| HANSEN SONG | OH | 5.0 | 1686 | 34,29,11,35,10,27,21 |
| name | state | total_pts | pre_tournament_rating | opponents | |
|---|---|---|---|---|---|
| ADITYA BAJAJ : 1 | MI:55 | Min. :1.000 | Min. :1011 | 1,54,40,16,44,21,24 : 1 | |
| ALAN BUI : 1 | OH: 1 | 1st Qu.:2.500 | 1st Qu.:1280 | 10,15,39,2,36,NA,NA : 1 | |
| ALEX KONG : 1 | ON: 8 | Median :3.500 | Median :1430 | 11,35,29,12,18,15,NA: 1 | |
| AMIYATOSH PWNANANDAM: 1 | NA | Mean :3.438 | Mean :1425 | 11,35,45,40,42,NA,NA: 1 | |
| ANVIT RAO : 1 | NA | 3rd Qu.:4.000 | 3rd Qu.:1596 | 12,50,57,60,61,64,56: 1 | |
| ASHWIN BALAJI : 1 | NA | Max. :6.000 | Max. :1794 | 13,57,51,33,16,28,NA: 1 | |
| (Other) :58 | NA | NA | NA’s :4 | (Other) :58 |
From the summary above, we learn that there missing values in both the pre_tournament_rating and the opponents values. We need to keep this in mind when calculating the apcro (Average Pre Chess Rating of Opponent) for each player using each player’s row index as the pair number. We will drop the NA values in the apcro calculation.
for (i in seq(from=1, to=nrow(players), by=1)) {
# separate the opponents list and coerce it to an integer
op_index <- unlist(str_split(players$opponents[i], ","))
# NAs are removed since no game was played, so no opponent score should be fetched
op_index <- as.integer(c(op_index), na.rm=TRUE)
# use indices to locate the opponents' pre tournament ratings
op_ratings <- strtoi(players$pre_tournament_rating[op_index])
# calculate the apcro by taking the mean and dropping the NAs representing the 5 players without ratings
players$apcro[i] <- round(mean(op_ratings, na.rm=TRUE))
}Preparing the output format for data export
# We create a new dataframe and omit the opponents variable, we will not be writing it to file.
outputdf <- select(players, -opponents)| name | state | total_pts | pre_tournament_rating | apcro |
|---|---|---|---|---|
| GARY HUA | ON | 6.0 | 1794 | 1605 |
| DAKSHESH DARURI | MI | 6.0 | 1553 | 1561 |
| ADITYA BAJAJ | MI | 6.0 | 1384 | 1665 |
| PATRICK SCHILLING | MI | 5.5 | 1716 | 1574 |
| HANSHI ZUO | MI | 5.5 | 1655 | 1515 |
| HANSEN SONG | OH | 5.0 | 1686 | 1519 |
| GARY DEE SWATHELL | MI | 5.0 | 1649 | 1472 |
| EZEKIEL HOUGHTON | MI | 5.0 | 1641 | 1468 |
| STEFANO LEE | ON | 5.0 | 1411 | 1635 |
| ANVIT RAO | MI | 5.0 | 1365 | 1554 |
Write to the dataframe to a .csv file
Let’s start by taking a look at the summary statistics. It is interesting to note that the mean for pre tournament rating of the players and their opponents were very close, 1425 and 1424 respectively.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.500 3.500 3.438 4.000 6.000
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1011 1280 1430 1425 1596 1794 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1186 1356 1418 1424 1496 1665
Here we take a look at the top ratings for pre tournament and opponents
| name | state | total_pts | pre_tournament_rating | apcro | |
|---|---|---|---|---|---|
| 1 | GARY HUA | ON | 6.0 | 1794 | 1605 |
| 25 | LOREN SCHWIEBERT | MI | 3.5 | 1745 | 1363 |
| 4 | PATRICK SCHILLING | MI | 5.5 | 1716 | 1574 |
| 11 | CAMERON WILLIAM MC LEMAN | MI | 4.5 | 1712 | 1468 |
| 6 | HANSEN SONG | OH | 5.0 | 1686 | 1519 |
| 13 | TORRANCE HENRY JR | MI | 4.5 | 1666 | 1498 |
| name | state | total_pts | pre_tournament_rating | apcro | |
|---|---|---|---|---|---|
| 3 | ADITYA BAJAJ | MI | 6.0 | 1384 | 1665 |
| 9 | STEFANO LEE | ON | 5.0 | 1411 | 1635 |
| 41 | KYLE WILLIAM MURPHY | MI | 3.0 | 1403 | 1612 |
| 1 | GARY HUA | ON | 6.0 | 1794 | 1605 |
| 4 | PATRICK SCHILLING | MI | 5.5 | 1716 | 1574 |
| 2 | DAKSHESH DARURI | MI | 6.0 | 1553 | 1561 |
The histogram of the total points show that a mean and median close to the center at 3.5 as was described above.
Here we look at the the Average Pre Chess Rating of Opponent against the Pre Tournament Rating. A x=y is added to identify which players played against opponents that were on average rated better (upper left side of line) or worse than themselves (lower right side of line). A color dimension is added to represent the number of points obtained. It appears that weaker players played players better than themselves and better players played weaker players than themselves, as can be expected.
ggplot(data = outputdf) +
geom_point(mapping = aes(x = pre_tournament_rating, y = apcro, color = totalpts)) +
geom_abline(slope = 1, intercept = 0) +
xlim(1000, 1800) + ylim(1000, 1800)