Introduction

In this project, you’re given a text file with chess tournament results where the information has some structure.

Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:

  • Player’s Name
  • Player’s State
  • Total Number of Points
  • Player’s Pre-Rating
  • Average Pre Chess Rating of Opponents

Loading packages

These are the packages that will be necessary to complete the following conversion of the Chess Tournament Data.

library(tidyverse)
library(knitr)
library(kableExtra)

Load raw data from Github

The first step is to load the data into r from the text file.

setwd("C:/Users/biguz/Desktop/CUNY Data Science/Fall2020/Data 607/Projects/Project 1")
tournament_data <- readLines("tournamentinfo.txt", warn=FALSE)

This data gets read into r as a one large string so we will need to use some string operations to get it to a usable data frame and get it ready to export the data into a csv file.

How initial load of data looks: (first 7 rows as example)
tournament_data
—————————————————————————————–
Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round|
Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
—————————————————————————————–
1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|
ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |
—————————————————————————————–

We want to change the data so the lines are gone and all the data between the lines shows up in one row and not two.

Data munging to create readable dataframe

In this section I used some techniques to change the data from the table shown above to a more readable data frame, where it would be possible to create the CSV file requested.

  1. I removed the lines from the data string and replaced where there was more than one space between characters.
lines <-
  c('-----------------------------------------------------------------------------------------')
tournament_data <- tournament_data[!(tournament_data == lines)]
tournament_data <- str_replace(tournament_data," {2}","")
  1. I created a data frame. Creating this data frame helps me in a later step by allowing me to merge odd rows with even rows. I also noticed an extra column was created so I removed it as it was duplicate information.
#Create dataframe
tournament_raw_df <- data.frame(do.call(rbind, strsplit(tournament_data, "|", fixed=TRUE)))
tournament_raw_df <- tournament_raw_df[-c(11)]
  1. I realized that for each participant of the tournament there are 2 rows. I had to merge the even row to each odd row using “|” as a separator. After merging the rows I removed every odd row since it was now excess data.
#Merge row data
for (column in names(tournament_raw_df)){
  for (i in 1:nrow(tournament_raw_df)){
    if((i %% 2) > 0){
          tournament_raw_df[i,column] <-
            paste(tournament_raw_df[i,column], tournament_raw_df[i+1,column],sep="|")}}}

#Removing odd rows
toKeep <- seq(3,nrow(tournament_raw_df),2)
tournament_raw_df <- tournament_raw_df[toKeep,]
  1. I rejoined the data frame into a new data frame splitting on the separator “|”. This allowed me to create a more readable data frame.
#Creating new clean dataframe
df_args <- c(tournament_raw_df, sep="|")
tournament_raw_df <- do.call(paste,df_args)
tournament_df <- data.frame(do.call(rbind, strsplit(tournament_raw_df, "|", fixed=TRUE)))

#Renaming columns
tournament_df <- rename(tournament_df,
  c("Pair Num" = "X1", "State" = "X2", "Player Name" = "X3", "USCF ID / Rtg (Pre->Post)" = "X4",
    "Total Pts" = "X5","N Count" = "X6","Round 1 Outcome" = "X7", "Round 1 Pieces" = "X8", 
    "Round 2 Outcome" = "X9", "Round 2 Pieces" = "X10", "Round 3 Outcome" = "X11",
    "Round 3 Pieces" = "X12", "Round 4 Outcome" = "X13", "Round 4 Pieces" = "X14",
    "Round 5 Outcome" = "X15", "Round 5 Pieces" = "X16", "Round 6 Outcome" = "X17",
    "Round 6 Pieces" = "X18", "Round 7 Outcome" = "X19", "Round 7 Pieces" = "X20"))
  1. I renamed the values for which the chess pieces were identified for each round. So “W” became “White” and “B” became “Black”
for (col in names(tournament_df)){
  if (grepl("Pieces",col)){
    tournament_df[,c(col)] = ifelse(grepl("W",tournament_df[,c(col)]),"White","Black")}}
  1. I decided to continue splitting the data frame to create individual columns for each participant USCF ID and each participants rating pre tournament and post tournament rating
#Creating ID column
tournament_df$USCFID <-
  unlist(lapply(
    strsplit(as.character(tournament_df[,c("USCF ID / Rtg (Pre->Post)")]), " / "), '[', 1))

#Creating pre-post column
tournament_df$"Rtg (Pre->Post)" <-
  unlist(lapply(
    strsplit(as.character(tournament_df[,c("USCF ID / Rtg (Pre->Post)")]), " / "), '[', 2))

#Removing old id/rating column
tournament_df <- tournament_df[, -which(names(tournament_df) %in% "USCF ID / Rtg (Pre->Post)")]

#Creating pre rating column
tournament_df$PreRating <-
  unlist(lapply(
    strsplit(as.character(tournament_df[,c("Rtg (Pre->Post)")]), "->"), '[', 1))

#Creating post rating column
tournament_df$PostRating <-
  unlist(lapply(
    strsplit(as.character(tournament_df[,c("Rtg (Pre->Post)")]), "->"), '[', 2))

#Removing old pre/post rating column
tournament_df <- tournament_df[,-which(names(tournament_df) %in% "Rtg (Pre->Post)")]
  1. I cleaned up the new pre and post rating columns by removing any excess string. (Some of the ratings had a P followed by a list of numbers, which I removed as it was not relevant to this project.)
#Remove "R:" from pre rating number
tournament_df$PreRating <- gsub("R: ", "", tournament_df$PreRating)

#Remove P from pre and post rating
tournament_df$PreRating <- gsub("()P.*", "", tournament_df$PreRating)
tournament_df$PostRating <- gsub("()P.*", "", tournament_df$PostRating)
  1. I split the round outcome column into round outcome and round opponent columns. This way I could easily match a round opponent to his pre tournament ranking.
#Create new columns and populate values
for (col in names(tournament_df)){
  if (grepl("Outcome",col)){
    #Opponent column
    tournament_df[,c(gsub("Outcome", "Opponent", col))] <-
      unlist(lapply(strsplit(as.character(tournament_df[,c(col)]), "  "), '[', 2))
    #Outcome column
    tournament_df[,c(col)] <- gsub("([A-Z]).*","\\1",tournament_df[,c(col)])}}

#Clean up white spaces
for (i in names(tournament_df)) {
  tournament_df[[i]] <-
    trimws(tournament_df[[i]], which = c("both", "left", "right"), whitespace = "[ \t\r\n]")}
  1. I created the opponent pre tournament chess rating by matching the round opponent pair number to the tournament data frame pair number, grabbing the value in the PreRating column, and writing it in the corresponding row and column
for (i in seq(nrow(tournament_df))){
  for (col in names(tournament_df)){
    if (grepl("Opponent",col)){
        tournament_df[i,c(gsub("Opponent","Opp PreRtg",col))] <-
          ifelse(
            is.na(tournament_df[i,col]), NA,
            tournament_df$PreRating[tournament_df$`Pair Num` == tournament_df[i,col]])}}}
  1. Finally, I reorganized the columns for readability. This was the final raw data frame. This is still a raw dataframe as it is too wide making it very unreadable, as seen in the example below.
Data dump of chess tournament (example first 5 rows)
Pair Num State Player Name USCFID PreRating PostRating Total Pts N Count Round 1 Opponent Round 1 Opp PreRtg Round 1 Outcome Round 1 Pieces Round 2 Opponent Round 2 Opp PreRtg Round 2 Outcome Round 2 Pieces Round 3 Opponent Round 3 Opp PreRtg Round 3 Outcome Round 3 Pieces Round 4 Opponent Round 4 Opp PreRtg Round 4 Outcome Round 4 Pieces Round 5 Opponent Round 5 Opp PreRtg Round 5 Outcome Round 5 Pieces Round 6 Opponent Round 6 Opp PreRtg Round 6 Outcome Round 6 Pieces Round 7 Opponent Round 7 Opp PreRtg Round 7 Outcome Round 7 Pieces
1 ON GARY HUA 15445895 1794 1817 6.0 N:2 39 1436 W White 21 1563 W Black 18 1600 W White 14 1610 W Black 7 1649 W White 12 1663 D Black 4 1716 D White
2 MI DAKSHESH DARURI 14598900 1553 1663 6.0 N:2 63 1175 W Black 58 917 W White 4 1716 L Black 17 1629 W White 16 1604 W Black 20 1595 W White 7 1649 W Black
3 MI ADITYA BAJAJ 14959604 1384 1640 6.0 N:2 8 1641 L White 61 955 W Black 25 1745 W White 21 1563 W Black 11 1712 W White 13 1666 W Black 12 1663 W White
4 MI PATRICK H SCHILLING 12616049 1716 1744 5.5 N:2 23 1363 W White 28 1507 D Black 2 1553 W White 26 1579 W Black 5 1655 D White 19 1564 W Black 1 1794 D Black
5 MI HANSHI ZUO 14601533 1655 1690 5.5 N:2 45 1242 W Black 37 980 W White 12 1663 D Black 13 1666 D White 4 1716 D Black 14 1610 W White 17 1629 W Black

Creating the CSV file

For this project we were asked to generate a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:

  • Player’s Name
  • Player’s State
  • Total Number of Points
  • Player’s Pre-Rating
  • Average Pre Chess Rating of Opponents

The first 4 columns of the csv file were completed in the previous section. I used this section to create the final column for the CSV table Average Pre Chess Rating of Opponents.

  1. I converted the opponent pre rating fields, which I summed in the next step, into numeric columns using a for loop.
for (col in names(tournament_df)){
  if (grepl("PreRtg",col)){
    tournament_df[,c(col)] <- as.numeric(tournament_df[,c(col)])}}
  1. I summed up all the pre game ratings of each opponent while also summing up the number of games played. I did this by using the rowSums function while ignoring the NA since some participants did not play a full 7 rounds.
#Creating pre rating sum
tournament_df$OppPreRtgSum <-
  rowSums(tournament_df[,c("Round 1 Opp PreRtg", "Round 2 Opp PreRtg", "Round 3 Opp PreRtg",
                           "Round 4 Opp PreRtg", "Round 5 Opp PreRtg", "Round 6 Opp PreRtg",
                           "Round 7 Opp PreRtg")], na.rm = TRUE)

#Creating games played
tournament_df$GamesPlayed <-
  rowSums(!is.na(tournament_df[,c("Round 1 Opp PreRtg", "Round 2 Opp PreRtg", "Round 3 Opp PreRtg",
                                  "Round 4 Opp PreRtg", "Round 5 Opp PreRtg", "Round 6 Opp PreRtg",
                                  "Round 7 Opp PreRtg")]))
  1. I divided the total sum of pre tournament ratings by the number of games played to get the average opponent rating for each tournament participant.
tournament_df <- tournament_df %>%
  mutate(AvgOppRtg = OppPreRtgSum/GamesPlayed)
  1. I created the export csv table as asked in number 5. As seen below:
Final Chess Tournament Table
Player’s Name Player’s State Total Number of Points Player’s Pre-Rating Average Pre Chess Rating of Opponents
GARY HUA ON 6.0 1794 1605.286
DAKSHESH DARURI MI 6.0 1553 1469.286
ADITYA BAJAJ MI 6.0 1384 1563.571
PATRICK H SCHILLING MI 5.5 1716 1573.571
HANSHI ZUO MI 5.5 1655 1500.857
HANSEN SONG OH 5.0 1686 1518.714
GARY DEE SWATHELL MI 5.0 1649 1372.143
EZEKIEL HOUGHTON MI 5.0 1641 1468.429
STEFANO LEE ON 5.0 1411 1523.143
ANVIT RAO MI 5.0 1365 1554.143
CAMERON WILLIAM MC LEMAN MI 4.5 1712 1467.571
KENNETH J TACK MI 4.5 1663 1506.167
TORRANCE HENRY JR MI 4.5 1666 1497.857
BRADLEY SHAW MI 4.5 1610 1515.000
ZACHARY JAMES HOUGHTON MI 4.5 1220 1483.857
MIKE NIKITIN MI 4.0 1604 1385.800
RONALD GRZEGORCZYK MI 4.0 1629 1498.571
DAVID SUNDEEN MI 4.0 1600 1480.000
DIPANKAR ROY MI 4.0 1564 1426.286
JASON ZHENG MI 4.0 1595 1410.857
DINH DANG BUI ON 4.0 1563 1470.429
EUGENE L MCCLURE MI 4.0 1555 1300.333
ALAN BUI ON 4.0 1363 1213.857
MICHAEL R ALDRICH MI 4.0 1229 1357.000
LOREN SCHWIEBERT MI 3.5 1745 1363.286
MAX ZHU ON 3.5 1579 1506.857
GAURAV GIDWANI MI 3.5 1552 1221.667
SOFIA ADINA STANESCU-BELLU MI 3.5 1507 1522.143
CHIEDOZIE OKORIE MI 3.5 1602 1313.500
GEORGE AVERY JONES ON 3.5 1522 1144.143
RISHI SHETTY MI 3.5 1494 1259.857
JOSHUA PHILIP MATHEWS ON 3.5 1441 1378.714
JADE GE MI 3.5 1449 1276.857
MICHAEL JEFFERY THOMAS MI 3.5 1399 1375.286
JOSHUA DAVID LEE MI 3.5 1438 1149.714
SIDDHARTH JHA MI 3.5 1355 1388.167
AMIYATOSH PWNANANDAM MI 3.5 980 1384.800
BRIAN LIU MI 3.0 1423 1539.167
JOEL R HENDON MI 3.0 1436 1429.571
FOREST ZHANG MI 3.0 1348 1390.571
KYLE WILLIAM MURPHY MI 3.0 1403 1248.500
JARED GE MI 3.0 1332 1149.857
ROBERT GLEN VASEY MI 3.0 1283 1106.571
JUSTIN D SCHILLING MI 3.0 1199 1327.000
DEREK YAN MI 3.0 1242 1152.000
JACOB ALEXANDER LAVALLEY MI 3.0 377 1357.714
ERIC WRIGHT MI 2.5 1362 1392.000
DANIEL KHAIN MI 2.5 1382 1355.800
MICHAEL J MARTIN MI 2.5 1291 1285.800
SHIVAM JHA MI 2.5 1056 1296.000
TEJAS AYYAGARI MI 2.5 1011 1356.143
ETHAN GUO MI 2.5 935 1494.571
JOSE C YBARRA MI 2.0 1393 1345.333
LARRY HODGE MI 2.0 1270 1206.167
ALEX KONG MI 2.0 1186 1406.000
MARISA RICCI MI 2.0 1153 1414.400
MICHAEL LU MI 2.0 1092 1363.000
VIRAJ MOHILE MI 2.0 917 1391.000
SEAN M MC CORMICK MI 2.0 853 1319.000
JULIA SHEN MI 1.5 967 1330.200
JEZZEL FARKAS ON 1.5 955 1327.286
ASHWIN BALAJI MI 1.0 1530 1186.000
THOMAS JOSEPH HOSMER MI 1.0 1175 1350.200
BEN LI MI 1.0 1163 1263.000