Introduction

This project was all about turning messy text data into something structured and insightful. I was given a .txt file containing results from a chess tournament, and the goal was to extract the meaningful data—like each player’s name, state, total number of points, pre-tournament rating, and the average pre-rating of all their opponents—and generate a clean .csv that could be used for analysis or imported into a SQL database.

At first glance, the file looked chaotic. The information was technically structured, but it wasn’t in a straightforward format like a spreadsheet. This meant I had to write logic in R to parse out the right pieces, clean them up, and reorganize everything into something readable and consistent.

Read and Process Tournament Data

# Read File
file_path <- "tournamentinfo.txt"
lines <- readLines(file_path, encoding = "UTF-8")


# Remove separator lines
lines <- lines[!grepl("^-{5,}", lines)]

# Extract player-related lines (skip first 3 header lines)
player_lines <- lines[4:length(lines)]

# Extract every alternate line CORRECTLY
player_data <- player_lines[seq(2, length(player_lines), 2)]  # Players
rating_data <- player_lines[seq(1, length(player_lines), 2)]  # Ratings

# Ensure equal row count
min_rows <- min(length(player_data), length(rating_data))
player_data <- player_data[1:min_rows]
rating_data <- rating_data[1:min_rows]

# Extract Player Information
df <- tibble(
  Rank = as.numeric(str_extract(player_data, "^\\s*\\d+")),  
  Player = str_trim(str_match(player_data, "\\|\\s*([A-Za-z .'-]+?)\\s*\\|")[,2]),  
  Total_Points = as.numeric(str_match(player_data, "\\|\\s*([0-9]+\\.?[0-9]*)\\s*\\|")[,2]),
  Opponents = str_extract_all(player_data, "\\d{1,2}")  # Extract opponent numbers as a list
)

# Ensure all columns have the same number of rows
num_rows <- min(length(df$Rank), length(df$Player), length(df$Total_Points), length(df$Opponents))
df <- df[1:num_rows, ]

# Extract Pre-Ratings from rating_data
df$Pre_Rating <- as.numeric(str_match(rating_data, "R:\\s*([0-9]{3,4})")[,2])

# Extract State from rating_data
df$State <- str_trim(str_match(rating_data, "^\\s*([A-Z]{2})")[,2])

# Check for missing values and issue warnings
if (any(is.na(df$Pre_Rating))) {
  warning("Some Pre_Rating values could not be extracted. Check input formatting.")
  print(which(is.na(df$Pre_Rating)))
}

if (any(is.na(df$Total_Points))) {
  warning(" Some Total_Points values could not be extracted. Check input formatting.")
  print(which(is.na(df$Total_Points)))
}

# Print first 10 rows to verify
print(head(df, 10))

## # A tibble: 10 × 6
##     Rank Player                   Total_Points Opponents  Pre_Rating State
##    <dbl> <chr>                           <dbl> <list>          <dbl> <chr>
##  1     2 DAKSHESH DARURI                   6   <chr [10]>       1794 ON   
##  2     3 ADITYA BAJAJ                      6   <chr [10]>       1553 MI   
##  3     4 PATRICK H SCHILLING               5.5 <chr [10]>       1384 MI   
##  4     5 HANSHI ZUO                        5.5 <chr [10]>       1716 MI   
##  5     6 HANSEN SONG                       5   <chr [10]>       1655 MI   
##  6     7 GARY DEE SWATHELL                 5   <chr [10]>       1686 OH   
##  7     8 EZEKIEL HOUGHTON                  5   <chr [10]>       1649 MI   
##  8     9 STEFANO LEE                       5   <chr [10]>       1641 MI   
##  9    10 ANVIT RAO                         5   <chr [10]>       1411 ON   
## 10    11 CAMERON WILLIAM MC LEMAN          4.5 <chr [10]>       1365 MI

Calculate Average Opponent Rating

# Convert Opponents column into a list of numeric values
df$Opponents <- lapply(df$Opponents, function(x) as.numeric(unlist(x)))

# Create a lookup table for ratings
rating_lookup <- setNames(df$Pre_Rating, df$Rank)

# Calculate the average opponent rating
df$Avg_Opponent_Pre_Rating <- sapply(df$Opponents, function(opp) {
  opp_ratings <- as.numeric(rating_lookup[as.character(opp)])
  mean(opp_ratings, na.rm = TRUE)  # Get the average, ignoring NAs
})

# Print first 10 rows to verify
print(head(df, 10))

## # A tibble: 10 × 7
##     Rank Player   Total_Points Opponents Pre_Rating State Avg_Opponent_Pre_Rat…¹
##    <dbl> <chr>           <dbl> <list>         <dbl> <chr>                  <dbl>
##  1     2 DAKSHES…          6   <dbl>           1794 ON                     1503.
##  2     3 ADITYA …          6   <dbl>           1553 MI                     1488.
##  3     4 PATRICK…          5.5 <dbl>           1384 MI                     1642 
##  4     5 HANSHI …          5.5 <dbl>           1716 MI                     1573.
##  5     6 HANSEN …          5   <dbl>           1655 MI                     1520.
##  6     7 GARY DE…          5   <dbl>           1686 OH                     1532.
##  7     8 EZEKIEL…          5   <dbl>           1649 MI                     1472 
##  8     9 STEFANO…          5   <dbl>           1641 MI                     1531.
##  9    10 ANVIT R…          5   <dbl>           1411 ON                     1472.
## 10    11 CAMERON…          4.5 <dbl>           1365 MI                     1472.
## # ℹ abbreviated name: ¹Avg_Opponent_Pre_Rating

Rank Players by Performance

# Rank players by Total_Points, then Avg_Opponent_Rating (tie-breakers)
df <- df[order(-df$Total_Points, -df$Avg_Opponent_Pre_Rating), ]

# Assign new ranking
df$Final_Rank <- 1:nrow(df)

# Print updated rankings
print(head(df, 10))

## # A tibble: 10 × 8
##     Rank Player   Total_Points Opponents Pre_Rating State Avg_Opponent_Pre_Rat…¹
##    <dbl> <chr>           <dbl> <list>         <dbl> <chr>                  <dbl>
##  1     2 DAKSHES…          6   <dbl>           1794 ON                     1503.
##  2     3 ADITYA …          6   <dbl>           1553 MI                     1488.
##  3     4 PATRICK…          5.5 <dbl>           1384 MI                     1642 
##  4     5 HANSHI …          5.5 <dbl>           1716 MI                     1573.
##  5     7 GARY DE…          5   <dbl>           1686 OH                     1532.
##  6     9 STEFANO…          5   <dbl>           1641 MI                     1531.
##  7     6 HANSEN …          5   <dbl>           1655 MI                     1520.
##  8    10 ANVIT R…          5   <dbl>           1411 ON                     1472.
##  9     8 EZEKIEL…          5   <dbl>           1649 MI                     1472 
## 10    13 TORRANC…          4.5 <dbl>           1663 MI                     1567 
## # ℹ abbreviated name: ¹Avg_Opponent_Pre_Rating
## # ℹ 1 more variable: Final_Rank <int>

Export Cleaned Data to CSV

After preparing the final dataset, I exported it to a CSV. I then created a SQL script to generate a table and load the CSV into a MySQL database. There were a few permissions issues (thank you –secure-file-priv), but I was able to work around them and confirm all 63 players were loaded correctly.

# Select important columns including State for database import
final_table <- df[, c("Final_Rank", "Player", "State", "Total_Points", "Pre_Rating", "Avg_Opponent_Pre_Rating")]

# Export CSV
write.csv(final_table, "chess_tournament_results.csv", row.names = FALSE)

# Print confirmation
print("✅ CSV file 'chess_tournament_results.csv' has been successfully created.")

## [1] "✅ CSV file 'chess_tournament_results.csv' has been successfully created."

#Data Cleaning & Processing***

One of the trickier parts was aligning the player lines with their corresponding rating lines. Once aligned, I used regular expressions to pull out what I needed. I then used those opponent IDs to match players and compute each person’s average opponent rating.

Visualize Data

With a clean dataset in hand, I created a few visualizations using ggplot2 to highlight top performers and explore relationships in the data. For example:A bar chart showing the top 10 players by total points.A scatter plot showing how a player’s pre-rating compared to the average rating of their opponents.I also added a Shiny app so the data could be explored interactively. Users can filter players by rating and view individual stats, making the data more dynamic and useful for quick insights.

library(ggplot2)

# Bar plot of top 10 players by Total Points
ggplot(head(df, 10), aes(x = reorder(Player, -Total_Points), y = Total_Points, fill = as.factor(Final_Rank))) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Top 10 Players by Points", x = "Player", y = "Total Points") +
  theme_minimal()

Interactive Data w. shiny

ggplot(data = df, aes(x = Pre_Rating, y = Avg_Opponent_Pre_Rating, color = Total_Points)) +
  geom_point(size = 3) +
  labs(title = "Pre-Rating vs. Average Opponent Pre-Rating", x = "Pre-Rating", y = "Avg Opponent Pre-Rating") +
  theme_minimal()

# Rpubs dont allow interactive data

library(shiny)
library(ggplot2)
library(dplyr)

# UI: Sidebar with inputs
ui <- fluidPage(
    titlePanel("Chess Tournament Analysis"),
    sidebarLayout(
        sidebarPanel(
            selectInput("player", "Select Player:", choices = unique(df$Player), selected = unique(df$Player)[1]),
            sliderInput("minRating", "Minimum Rating:", min = min(df$Pre_Rating), max = max(df$Pre_Rating), value = min(df$Pre_Rating))
        ),
        mainPanel(
            plotOutput("ratingPlot"),
            tableOutput("playerStats")
        )
    )
)

# SERVER: Dynamic filtering & plotting
server <- function(input, output) {
    filteredData <- reactive({
      df$Player <- trimws(df$Player)  # Remove extra spaces
    df$Player <- as.character(df$Player)  # Ensure it's a character
        df %>% filter(Pre_Rating >= input$minRating)
    })

    output$ratingPlot <- renderPlot({
        ggplot(filteredData(), aes(x = Pre_Rating, y = Avg_Opponent_Pre_Rating, color = Total_Points)) +
            geom_point(size = 3) +
            labs(title = "Pre-Rating vs. Avg Opponent Pre-Rating", x = "Pre-Rating", y = "Avg Opponent Pre-Rating") +
            theme_minimal()
    })

    output$playerStats <- renderTable({
    req(input$player)  # Ensure input is not empty
    filtered_df <- df %>% filter(Player == input$player)

    if (nrow(filtered_df) == 0) {
        return(data.frame(Message = "No data found for this player"))
    }

    # Select only existing columns
    filtered_df %>% select(any_of(c("Player", "State", "Total_Points", "Pre_Rating", "Avg_Opponent_Pre_Rating")))
})

}

# Run the app
shinyApp(ui, server)

Reflection

This project taught me how to wrangle semi-structured text data and bring it to life. I went from a plain .txt file that looked like gibberish to a working SQL-ready table, a downloadable CSV, clear visualizations, and even an interactive Shiny dashboard. Every part of the workflow had its own challenge—especially extracting and aligning the data—but it was satisfying to see it all come together.

If I had more time, I’d probably explore how performance correlates with other metrics like number of games played or opponent diversity. I’d also look into deploying the Shiny app or visualizations online, just to make it more accessible.

Chess Tournament Data Analysis

Sheriann McLarty

2025-03-22

Interactive Data w. shiny