Introduction

In this project, you’re given a text file with chess tournament results where the information has some structure.

Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:

Player’s Name
Player’s State
Total Number of Points
Player’s Pre-Rating
Average Pre Chess Rating of Opponents

Loading packages

These are the packages that will be necessary to complete the following conversion of the Chess Tournament Data.

library(tidyverse)
library(knitr)
library(kableExtra)

Load raw data from Github

The first step is to load the data into r from the text file.

setwd("C:/Users/biguz/Desktop/CUNY Data Science/Fall2020/Data 607/Projects/Project 1")
tournament_data <- readLines("tournamentinfo.txt", warn=FALSE)

This data gets read into r as a one large string so we will need to use some string operations to get it to a usable data frame and get it ready to export the data into a csv file.

How initial load of data looks: (first 7 rows as example)
tournament_data
—————————————————————————————–
Pair \| Player Name \|Total\|Round\|Round\|Round\|Round\|Round\|Round\|Round\|
Num \| USCF ID / Rtg (Pre->Post) \| Pts \| 1 \| 2 \| 3 \| 4 \| 5 \| 6 \| 7 \|
—————————————————————————————–
1 \| GARY HUA \|6.0 \|W 39\|W 21\|W 18\|W 14\|W 7\|D 12\|D 4\|
ON \| 15445895 / R: 1794 ->1817 \|N:2 \|W \|B \|W \|B \|W \|B \|W \|
—————————————————————————————–

We want to change the data so the lines are gone and all the data between the lines shows up in one row and not two.

Data munging to create readable dataframe

In this section I used some techniques to change the data from the table shown above to a more readable data frame, where it would be possible to create the CSV file requested.

I removed the lines from the data string and replaced where there was more than one space between characters.

lines <-
  c('-----------------------------------------------------------------------------------------')
tournament_data <- tournament_data[!(tournament_data == lines)]
tournament_data <- str_replace(tournament_data," {2}","")

I created a data frame. Creating this data frame helps me in a later step by allowing me to merge odd rows with even rows. I also noticed an extra column was created so I removed it as it was duplicate information.

#Create dataframe
tournament_raw_df <- data.frame(do.call(rbind, strsplit(tournament_data, "|", fixed=TRUE)))
tournament_raw_df <- tournament_raw_df[-c(11)]

I realized that for each participant of the tournament there are 2 rows. I had to merge the even row to each odd row using “|” as a separator. After merging the rows I removed every odd row since it was now excess data.

#Merge row data
for (column in names(tournament_raw_df)){
  for (i in 1:nrow(tournament_raw_df)){
    if((i %% 2) > 0){
          tournament_raw_df[i,column] <-
            paste(tournament_raw_df[i,column], tournament_raw_df[i+1,column],sep="|")}}}

#Removing odd rows
toKeep <- seq(3,nrow(tournament_raw_df),2)
tournament_raw_df <- tournament_raw_df[toKeep,]

I rejoined the data frame into a new data frame splitting on the separator “|”. This allowed me to create a more readable data frame.

#Creating new clean dataframe
df_args <- c(tournament_raw_df, sep="|")
tournament_raw_df <- do.call(paste,df_args)
tournament_df <- data.frame(do.call(rbind, strsplit(tournament_raw_df, "|", fixed=TRUE)))

#Renaming columns
tournament_df <- rename(tournament_df,
  c("Pair Num" = "X1", "State" = "X2", "Player Name" = "X3", "USCF ID / Rtg (Pre->Post)" = "X4",
    "Total Pts" = "X5","N Count" = "X6","Round 1 Outcome" = "X7", "Round 1 Pieces" = "X8", 
    "Round 2 Outcome" = "X9", "Round 2 Pieces" = "X10", "Round 3 Outcome" = "X11",
    "Round 3 Pieces" = "X12", "Round 4 Outcome" = "X13", "Round 4 Pieces" = "X14",
    "Round 5 Outcome" = "X15", "Round 5 Pieces" = "X16", "Round 6 Outcome" = "X17",
    "Round 6 Pieces" = "X18", "Round 7 Outcome" = "X19", "Round 7 Pieces" = "X20"))

I renamed the values for which the chess pieces were identified for each round. So “W” became “White” and “B” became “Black”

for (col in names(tournament_df)){
  if (grepl("Pieces",col)){
    tournament_df[,c(col)] = ifelse(grepl("W",tournament_df[,c(col)]),"White","Black")}}

I decided to continue splitting the data frame to create individual columns for each participant USCF ID and each participants rating pre tournament and post tournament rating

#Creating ID column
tournament_df$USCFID <-
  unlist(lapply(
    strsplit(as.character(tournament_df[,c("USCF ID / Rtg (Pre->Post)")]), " / "), '[', 1))

#Creating pre-post column
tournament_df$"Rtg (Pre->Post)" <-
  unlist(lapply(
    strsplit(as.character(tournament_df[,c("USCF ID / Rtg (Pre->Post)")]), " / "), '[', 2))

#Removing old id/rating column
tournament_df <- tournament_df[, -which(names(tournament_df) %in% "USCF ID / Rtg (Pre->Post)")]

#Creating pre rating column
tournament_df$PreRating <-
  unlist(lapply(
    strsplit(as.character(tournament_df[,c("Rtg (Pre->Post)")]), "->"), '[', 1))

#Creating post rating column
tournament_df$PostRating <-
  unlist(lapply(
    strsplit(as.character(tournament_df[,c("Rtg (Pre->Post)")]), "->"), '[', 2))

#Removing old pre/post rating column
tournament_df <- tournament_df[,-which(names(tournament_df) %in% "Rtg (Pre->Post)")]

I cleaned up the new pre and post rating columns by removing any excess string. (Some of the ratings had a P followed by a list of numbers, which I removed as it was not relevant to this project.)

#Remove "R:" from pre rating number
tournament_df$PreRating <- gsub("R: ", "", tournament_df$PreRating)

#Remove P from pre and post rating
tournament_df$PreRating <- gsub("()P.*", "", tournament_df$PreRating)
tournament_df$PostRating <- gsub("()P.*", "", tournament_df$PostRating)

I split the round outcome column into round outcome and round opponent columns. This way I could easily match a round opponent to his pre tournament ranking.

#Create new columns and populate values
for (col in names(tournament_df)){
  if (grepl("Outcome",col)){
    #Opponent column
    tournament_df[,c(gsub("Outcome", "Opponent", col))] <-
      unlist(lapply(strsplit(as.character(tournament_df[,c(col)]), "  "), '[', 2))
    #Outcome column
    tournament_df[,c(col)] <- gsub("([A-Z]).*","\\1",tournament_df[,c(col)])}}

#Clean up white spaces
for (i in names(tournament_df)) {
  tournament_df[[i]] <-
    trimws(tournament_df[[i]], which = c("both", "left", "right"), whitespace = "[ \t\r\n]")}

I created the opponent pre tournament chess rating by matching the round opponent pair number to the tournament data frame pair number, grabbing the value in the PreRating column, and writing it in the corresponding row and column

for (i in seq(nrow(tournament_df))){
  for (col in names(tournament_df)){
    if (grepl("Opponent",col)){
        tournament_df[i,c(gsub("Opponent","Opp PreRtg",col))] <-
          ifelse(
            is.na(tournament_df[i,col]), NA,
            tournament_df$PreRating[tournament_df$`Pair Num` == tournament_df[i,col]])}}}

Finally, I reorganized the columns for readability. This was the final raw data frame. This is still a raw dataframe as it is too wide making it very unreadable, as seen in the example below.

Data dump of chess tournament (example first 5 rows)
Pair Num	State	Player Name	USCFID	PreRating	PostRating	Total Pts	N Count	Round 1 Opponent	Round 1 Opp PreRtg	Round 1 Outcome	Round 1 Pieces	Round 2 Opponent	Round 2 Opp PreRtg	Round 2 Outcome	Round 2 Pieces	Round 3 Opponent	Round 3 Opp PreRtg	Round 3 Outcome	Round 3 Pieces	Round 4 Opponent	Round 4 Opp PreRtg	Round 4 Outcome	Round 4 Pieces	Round 5 Opponent	Round 5 Opp PreRtg	Round 5 Outcome	Round 5 Pieces	Round 6 Opponent	Round 6 Opp PreRtg	Round 6 Outcome	Round 6 Pieces	Round 7 Opponent	Round 7 Opp PreRtg	Round 7 Outcome	Round 7 Pieces
1	ON	GARY HUA	15445895	1794	1817	6.0	N:2	39	1436	W	White	21	1563	W	Black	18	1600	W	White	14	1610	W	Black	7	1649	W	White	12	1663	D	Black	4	1716	D	White
2	MI	DAKSHESH DARURI	14598900	1553	1663	6.0	N:2	63	1175	W	Black	58	917	W	White	4	1716	L	Black	17	1629	W	White	16	1604	W	Black	20	1595	W	White	7	1649	W	Black
3	MI	ADITYA BAJAJ	14959604	1384	1640	6.0	N:2	8	1641	L	White	61	955	W	Black	25	1745	W	White	21	1563	W	Black	11	1712	W	White	13	1666	W	Black	12	1663	W	White
4	MI	PATRICK H SCHILLING	12616049	1716	1744	5.5	N:2	23	1363	W	White	28	1507	D	Black	2	1553	W	White	26	1579	W	Black	5	1655	D	White	19	1564	W	Black	1	1794	D	Black
5	MI	HANSHI ZUO	14601533	1655	1690	5.5	N:2	45	1242	W	Black	37	980	W	White	12	1663	D	Black	13	1666	D	White	4	1716	D	Black	14	1610	W	White	17	1629	W	Black

Creating the CSV file

For this project we were asked to generate a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:

Player’s Name
Player’s State
Total Number of Points
Player’s Pre-Rating
Average Pre Chess Rating of Opponents

The first 4 columns of the csv file were completed in the previous section. I used this section to create the final column for the CSV table Average Pre Chess Rating of Opponents.

I converted the opponent pre rating fields, which I summed in the next step, into numeric columns using a for loop.

for (col in names(tournament_df)){
  if (grepl("PreRtg",col)){
    tournament_df[,c(col)] <- as.numeric(tournament_df[,c(col)])}}

I summed up all the pre game ratings of each opponent while also summing up the number of games played. I did this by using the rowSums function while ignoring the NA since some participants did not play a full 7 rounds.

#Creating pre rating sum
tournament_df$OppPreRtgSum <-
  rowSums(tournament_df[,c("Round 1 Opp PreRtg", "Round 2 Opp PreRtg", "Round 3 Opp PreRtg",
                           "Round 4 Opp PreRtg", "Round 5 Opp PreRtg", "Round 6 Opp PreRtg",
                           "Round 7 Opp PreRtg")], na.rm = TRUE)

#Creating games played
tournament_df$GamesPlayed <-
  rowSums(!is.na(tournament_df[,c("Round 1 Opp PreRtg", "Round 2 Opp PreRtg", "Round 3 Opp PreRtg",
                                  "Round 4 Opp PreRtg", "Round 5 Opp PreRtg", "Round 6 Opp PreRtg",
                                  "Round 7 Opp PreRtg")]))

I divided the total sum of pre tournament ratings by the number of games played to get the average opponent rating for each tournament participant.

tournament_df <- tournament_df %>%
  mutate(AvgOppRtg = OppPreRtgSum/GamesPlayed)

I created the export csv table as asked in number 5. As seen below:

Final Chess Tournament Table
Player’s Name	Player’s State	Total Number of Points	Player’s Pre-Rating	Average Pre Chess Rating of Opponents
GARY HUA	ON	6.0	1794	1605.286
DAKSHESH DARURI	MI	6.0	1553	1469.286
ADITYA BAJAJ	MI	6.0	1384	1563.571
PATRICK H SCHILLING	MI	5.5	1716	1573.571
HANSHI ZUO	MI	5.5	1655	1500.857
HANSEN SONG	OH	5.0	1686	1518.714
GARY DEE SWATHELL	MI	5.0	1649	1372.143
EZEKIEL HOUGHTON	MI	5.0	1641	1468.429
STEFANO LEE	ON	5.0	1411	1523.143
ANVIT RAO	MI	5.0	1365	1554.143
CAMERON WILLIAM MC LEMAN	MI	4.5	1712	1467.571
KENNETH J TACK	MI	4.5	1663	1506.167
TORRANCE HENRY JR	MI	4.5	1666	1497.857
BRADLEY SHAW	MI	4.5	1610	1515.000
ZACHARY JAMES HOUGHTON	MI	4.5	1220	1483.857
MIKE NIKITIN	MI	4.0	1604	1385.800
RONALD GRZEGORCZYK	MI	4.0	1629	1498.571
DAVID SUNDEEN	MI	4.0	1600	1480.000
DIPANKAR ROY	MI	4.0	1564	1426.286
JASON ZHENG	MI	4.0	1595	1410.857
DINH DANG BUI	ON	4.0	1563	1470.429
EUGENE L MCCLURE	MI	4.0	1555	1300.333
ALAN BUI	ON	4.0	1363	1213.857
MICHAEL R ALDRICH	MI	4.0	1229	1357.000
LOREN SCHWIEBERT	MI	3.5	1745	1363.286
MAX ZHU	ON	3.5	1579	1506.857
GAURAV GIDWANI	MI	3.5	1552	1221.667
SOFIA ADINA STANESCU-BELLU	MI	3.5	1507	1522.143
CHIEDOZIE OKORIE	MI	3.5	1602	1313.500
GEORGE AVERY JONES	ON	3.5	1522	1144.143
RISHI SHETTY	MI	3.5	1494	1259.857
JOSHUA PHILIP MATHEWS	ON	3.5	1441	1378.714
JADE GE	MI	3.5	1449	1276.857
MICHAEL JEFFERY THOMAS	MI	3.5	1399	1375.286
JOSHUA DAVID LEE	MI	3.5	1438	1149.714
SIDDHARTH JHA	MI	3.5	1355	1388.167
AMIYATOSH PWNANANDAM	MI	3.5	980	1384.800
BRIAN LIU	MI	3.0	1423	1539.167
JOEL R HENDON	MI	3.0	1436	1429.571
FOREST ZHANG	MI	3.0	1348	1390.571
KYLE WILLIAM MURPHY	MI	3.0	1403	1248.500
JARED GE	MI	3.0	1332	1149.857
ROBERT GLEN VASEY	MI	3.0	1283	1106.571
JUSTIN D SCHILLING	MI	3.0	1199	1327.000
DEREK YAN	MI	3.0	1242	1152.000
JACOB ALEXANDER LAVALLEY	MI	3.0	377	1357.714
ERIC WRIGHT	MI	2.5	1362	1392.000
DANIEL KHAIN	MI	2.5	1382	1355.800
MICHAEL J MARTIN	MI	2.5	1291	1285.800
SHIVAM JHA	MI	2.5	1056	1296.000
TEJAS AYYAGARI	MI	2.5	1011	1356.143
ETHAN GUO	MI	2.5	935	1494.571
JOSE C YBARRA	MI	2.0	1393	1345.333
LARRY HODGE	MI	2.0	1270	1206.167
ALEX KONG	MI	2.0	1186	1406.000
MARISA RICCI	MI	2.0	1153	1414.400
MICHAEL LU	MI	2.0	1092	1363.000
VIRAJ MOHILE	MI	2.0	917	1391.000
SEAN M MC CORMICK	MI	2.0	853	1319.000
JULIA SHEN	MI	1.5	967	1330.200
JEZZEL FARKAS	ON	1.5	955	1327.286
ASHWIN BALAJI	MI	1.0	1530	1186.000
THOMAS JOSEPH HOSMER	MI	1.0	1175	1350.200
BEN LI	MI	1.0	1163	1263.000

Project 1 - Chess Tournament Data

Stefano Biguzzi

9/9/2020

Introduction

Loading packages

Load raw data from Github

Data munging to create readable dataframe

Creating the CSV file