For this project, I will use the stringr and dplyr libraries to generate a .CSV file from the information in tournamentinfo.txt, which can be accessed here: https://raw.githubusercontent.com/emilye5/607-project1/refs/heads/main/tournamentinfo.txt. After structuring the data in a new data frame, I will be able to calculate the average pre chess rating of the player’s opponents. Then, I will be able to use this data frame to create the .CSV file required of this project. To conclude, I will verify my data by calculating the average ratings of the opponents of a player that played all 7 games, and one that played less. By comparing my hand calculations to my code results, I will be able to make sure that I executed this project properly. I would also like to create some sort of visualization (most likely a scatter plot) to assess the relationship between a player’s pre rating and their total number of points.
Dplyr, stringr, and readr libraries are needed to manipulate the data, extract patterns in the text file information, and import the file. These are necessary in the trasformation of the data into a usable .CSV file.
#load libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
library(readr)
#read tournamentinfo.txt
url <- "https://raw.githubusercontent.com/emilye5/607-project1/refs/heads/main/tournamentinfo.txt"
df <- readLines(url)
## Warning in readLines(url): incomplete final line found on
## 'https://raw.githubusercontent.com/emilye5/607-project1/refs/heads/main/tournamentinfo.txt'
#view the current structure of the data
head(df)
## [1] "-----------------------------------------------------------------------------------------"
## [2] " Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 | "
## [4] "-----------------------------------------------------------------------------------------"
## [5] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [6] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
Clean the data, remove the header and empty lines.
#remove the header & separating lines containing dashes
df <- df[!str_detect(df, "Pair|Num|---")]
#remove blank lines
df <- df[df != ""]
#view the current structure of the data
head(df)
## [1] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [2] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [3] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [4] " MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [5] " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|"
## [6] " MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |"
We can see that every two lines accounts for the data for one player.
#split every two rows into one player
players <- length(df) / 2
After doing this, the information for the player’s name, state, total number of points, pre-rating, and opponents can be populated.
#create empty variables
name <- c()
state <- c()
total_points <- c()
pre_rating <- c()
opponents <- list()
#loop through players and populate variables
for (i in 1:players) {
line1 <- df[(2*i)-1]
line2 <- df[2*i]
#name
name <- c(name, str_trim(substr(line1, 8, 40)))
#total points
total_points <- c(total_points, as.numeric(str_trim(substr(line1,42, 46))))
#state
state <- c(state, substr(str_trim(line2), 1, 2))
#pre rating
pre_rating <- c(pre_rating, as.numeric(substr(line2, 23, 26)))
#opponents
line_1_numbers <- str_extract_all(line1, "\\d+")[[1]]
opps <- line_1_numbers[4:length(line_1_numbers)]
opponents[[i]] <- as.numeric(opps)
}
#average opponent rating
avg_opps_rating <- c()
for (i in 1:players) {
opps_ids <- opponents[[i]]
opps_ratings <- pre_rating[opps_ids]
avg <- mean(opps_ratings)
avg_opps_rating <- c(avg_opps_rating, avg)
}
With all of the necessary variables populated, the data frame can be created.
tournament_info <- data.frame(
name = name,
state = state,
total_points = total_points,
pre_rating = pre_rating,
avg_opps_rating = avg_opps_rating
)
Export to .CSV file:
write_csv(tournament_info, "tournament_info.csv")
To verify my work, I have hand calculated the average opponent ratings for player 1 (Gary Hua) and player 2 (Dakshesh Daruri). Gary Hua’s average opponent rating is about 1605, while Dakshesh Daruri’s is about 1469.
print(avg_opps_rating[1])
## [1] 1605.286
print(avg_opps_rating[2])
## [1] 1469.286
Another way to extend this work might be to create a plot that is reprsentative of some of the data. Below I have created a scatter plot that assesses the relationship between a player’s pre rating and their total number of points:
library(ggplot2)
ggplot(tournament_info, aes(x = pre_rating, y = total_points)) +
geom_point(color = "pink") +
labs(title = "Tournament Player Pre Rating vs Total Points",
x = "Pre Rating",
y = "Total Points")
This scatter plot alludes to a correlation between players that had
higher pre ratings also having a higher amount of total points. However,
this is not guaranteed as there are several clear deviations to this
that seem to skew this trend.