Project 1

Code Base

Introduction

Dplyr, stringr, and readr libraries are needed to manipulate the data, extract patterns in the text file information, and import the file. These are necessary in the trasformation of the data into a usable .CSV file.

#load libraries
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stringr)
library(readr)
#read tournamentinfo.txt
url <- "https://raw.githubusercontent.com/emilye5/607-project1/refs/heads/main/tournamentinfo.txt"
df <- readLines(url)

## Warning in readLines(url): incomplete final line found on
## 'https://raw.githubusercontent.com/emilye5/607-project1/refs/heads/main/tournamentinfo.txt'

#view the current structure of the data
head(df)

## [1] "-----------------------------------------------------------------------------------------" 
## [2] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
## [4] "-----------------------------------------------------------------------------------------" 
## [5] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
## [6] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"

Body

Clean the data, remove the header and empty lines.

#remove the header & separating lines containing dashes
df <- df[!str_detect(df, "Pair|Num|---")]
#remove blank lines
df <- df[df != ""]
#view the current structure of the data
head(df)

## [1] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|"
## [2] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [3] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|"
## [4] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [5] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|"
## [6] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |"

We can see that every two lines accounts for the data for one player.

#split every two rows into one player
players <- length(df) / 2

After doing this, the information for the player’s name, state, total number of points, pre-rating, and opponents can be populated.

#create empty variables
name <- c()
state <- c()
total_points <- c()
pre_rating <- c()
opponents <- list()
#loop through players and populate variables
for (i in 1:players) {
  line1 <- df[(2*i)-1]
  line2 <- df[2*i]
  #name
  name <- c(name, str_trim(substr(line1, 8, 40)))
  #total points
  total_points <- c(total_points, as.numeric(str_trim(substr(line1,42, 46))))
  #state
  state <- c(state, substr(str_trim(line2), 1, 2))
  #pre rating
  pre_rating <- c(pre_rating, as.numeric(substr(line2, 23, 26)))
  #opponents
  line_1_numbers <- str_extract_all(line1, "\\d+")[[1]]
  opps <- line_1_numbers[4:length(line_1_numbers)]
  opponents[[i]] <- as.numeric(opps)
}
#average opponent rating
avg_opps_rating <- c()
for (i in 1:players) {
  opps_ids <- opponents[[i]] 
  opps_ratings <- pre_rating[opps_ids]
  avg <- mean(opps_ratings)
  avg_opps_rating <- c(avg_opps_rating, avg)
}

With all of the necessary variables populated, the data frame can be created.

tournament_info <- data.frame(
  name = name,
  state = state,
  total_points = total_points,
  pre_rating = pre_rating,
  avg_opps_rating = avg_opps_rating
)

Export to .CSV file:

write_csv(tournament_info, "tournament_info.csv")

Conclusion

To verify my work, I have hand calculated the average opponent ratings for player 1 (Gary Hua) and player 2 (Dakshesh Daruri). Gary Hua’s average opponent rating is about 1605, while Dakshesh Daruri’s is about 1469.

print(avg_opps_rating[1])

## [1] 1605.286

print(avg_opps_rating[2])

## [1] 1469.286

Another way to extend this work might be to create a plot that is reprsentative of some of the data. Below I have created a scatter plot that assesses the relationship between a player’s pre rating and their total number of points:

library(ggplot2)
ggplot(tournament_info, aes(x = pre_rating, y = total_points)) +
  geom_point(color = "pink") +
  labs(title = "Tournament Player Pre Rating vs Total Points",
       x = "Pre Rating",
       y = "Total Points")

This scatter plot alludes to a correlation between players that had higher pre ratings also having a higher amount of total points. However, this is not guaranteed as there are several clear deviations to this that seem to skew this trend.

Project 1

Emily El Mouaquite

2026-02-22

Approach

Code Base

Introduction

Body

Conclusion