In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:
Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents
For the first player, the information would be:
Gary Hua, ON, 6.0, 1794, 1605
1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played.
I used read.table so that I could separate data on the pipe “|” symbol. I tried using the read_delim function from the readr package, but it gave me errors because of the separator rows.
theUrl <- "https://raw.githubusercontent.com/betsyrosalen/DATA_607_Data_Acquisition_and_Management/master/Project1/tournamentinfo.txt"
rawChessData <- read.table(file = theUrl, header = FALSE, skip = 1, sep = "|", fill = TRUE, stringsAsFactors = FALSE)
str(rawChessData)
## 'data.frame': 195 obs. of 11 variables:
## $ V1 : chr " Pair " " Num " "-----------------------------------------------------------------------------------------" " 1 " ...
## $ V2 : chr " Player Name " " USCF ID / Rtg (Pre->Post) " "" " GARY HUA " ...
## $ V3 : chr "Total" " Pts " "" "6.0 " ...
## $ V4 : chr "Round" " 1 " "" "W 39" ...
## $ V5 : chr "Round" " 2 " "" "W 21" ...
## $ V6 : chr "Round" " 3 " "" "W 18" ...
## $ V7 : chr "Round" " 4 " "" "W 14" ...
## $ V8 : chr "Round" " 5 " "" "W 7" ...
## $ V9 : chr "Round" " 6 " "" "D 12" ...
## $ V10: chr "Round" " 7 " "" "D 4" ...
## $ V11: logi NA NA NA NA NA NA ...
The original file has two rows for each player plus a separator row. I created two data frames, one for each row of data and then combined them into one consolidated data.frame automatically removing the separator rows in the process.
chessData1 <- rawChessData[seq(1 , 195, 3), ]
head(chessData1, 5)
## V1 V2 V3 V4 V5 V6 V7
## 1 Pair Player Name Total Round Round Round Round
## 4 1 GARY HUA 6.0 W 39 W 21 W 18 W 14
## 7 2 DAKSHESH DARURI 6.0 W 63 W 58 L 4 W 17
## 10 3 ADITYA BAJAJ 6.0 L 8 W 61 W 25 W 21
## 13 4 PATRICK H SCHILLING 5.5 W 23 D 28 W 2 W 26
## V8 V9 V10 V11
## 1 Round Round Round NA
## 4 W 7 D 12 D 4 NA
## 7 W 16 W 20 W 7 NA
## 10 W 11 W 13 W 12 NA
## 13 D 5 W 19 D 1 NA
chessData2 <- rawChessData[seq(2 , 195, 3), ]
head(chessData2, 5)
## V1 V2 V3 V4 V5 V6 V7
## 2 Num USCF ID / Rtg (Pre->Post) Pts 1 2 3 4
## 5 ON 15445895 / R: 1794 ->1817 N:2 W B W B
## 8 MI 14598900 / R: 1553 ->1663 N:2 B W B W
## 11 MI 14959604 / R: 1384 ->1640 N:2 W B W B
## 14 MI 12616049 / R: 1716 ->1744 N:2 W B W B
## V8 V9 V10 V11
## 2 5 6 7 NA
## 5 W B W NA
## 8 B W B NA
## 11 W B W NA
## 14 W B B NA
chessData <- cbind(chessData1[-1,1:10], chessData2[-1,1:10])
names(chessData) <- cbind(chessData1[1,1:10], chessData2[1,1:10])
names(chessData)[c(1, 3:10, 11)] <- c('PairNum', 'TotalPts', paste('Round', 1:7, sep=""), 'State')
rownames(chessData) <- NULL
head(chessData, 5)
## PairNum Player Name TotalPts Round1 Round2 Round3
## 1 1 GARY HUA 6.0 W 39 W 21 W 18
## 2 2 DAKSHESH DARURI 6.0 W 63 W 58 L 4
## 3 3 ADITYA BAJAJ 6.0 L 8 W 61 W 25
## 4 4 PATRICK H SCHILLING 5.5 W 23 D 28 W 2
## 5 5 HANSHI ZUO 5.5 W 45 W 37 D 12
## Round4 Round5 Round6 Round7 State USCF ID / Rtg (Pre->Post)
## 1 W 14 W 7 D 12 D 4 ON 15445895 / R: 1794 ->1817
## 2 W 17 W 16 W 20 W 7 MI 14598900 / R: 1553 ->1663
## 3 W 21 W 11 W 13 W 12 MI 14959604 / R: 1384 ->1640
## 4 W 26 D 5 W 19 D 1 MI 12616049 / R: 1716 ->1744
## 5 D 13 D 4 W 14 W 17 MI 14601533 / R: 1655 ->1690
## Pts 1 2 3 4 5 6 7
## 1 N:2 W B W B W B W
## 2 N:2 B W B W B W B
## 3 N:2 W B W B W B W
## 4 N:2 W B W B W B B
## 5 N:2 B W B W B W B
The original file had multiple variables of data in one column, so I used regular expressions to separate out and create new columns for USCFID, and both ratings.
library(stringr)
USCFID <- str_extract_all(chessData[12], "\\d{8}")
chessData$USCFID <- unlist(USCFID)
RtgPre <- str_extract_all(chessData[12], "R: +\\d{3,4}")
RtgPre <- str_extract_all(RtgPre, "\\d{3,4}")
chessData$RtgPre <- as.integer(unlist(RtgPre))
RtgPost <- str_extract_all(chessData[12], "-> *\\d{3,4}")
RtgPost <- str_extract_all(RtgPost, "\\d{3,4}")
chessData$RtgPost <- as.integer(unlist(RtgPost))
head(chessData[c(2,21:23)], 5)
## Player Name USCFID RtgPre RtgPost
## 1 GARY HUA 15445895 1794 1817
## 2 DAKSHESH DARURI 14598900 1553 1663
## 3 ADITYA BAJAJ 14959604 1384 1640
## 4 PATRICK H SCHILLING 12616049 1716 1744
## 5 HANSHI ZUO 14601533 1655 1690
I had a lot of false starts with this section before I figured out how to do this. I created a separate matrix with the round columns and used a regular expression to pull out just the opponents ID numbers, then used that in a loop to pull the oppenents ratings and calculate the average and add it as a new column to my original data.frame.
opponents <- t(chessData[4:10])
opponents <- as.integer(str_replace(opponents,"\\w*\\s*(\\d{1,})","\\1"))
opponents <- matrix(opponents, nrow=64)
x <- 1
for (i in 1:nrow(chessData)){
y = as.integer(opponents[x:(x+6)])
Average <- round(mean(chessData$RtgPre[y], na.rm = TRUE))
chessData$avgOppRating[i] <- Average
x <- x+7
}
head(chessData[,c(1,2,24)])
## PairNum Player Name avgOppRating
## 1 1 GARY HUA 1605
## 2 2 DAKSHESH DARURI 1469
## 3 3 ADITYA BAJAJ 1564
## 4 4 PATRICK H SCHILLING 1574
## 5 5 HANSHI ZUO 1501
## 6 6 HANSEN SONG 1519
Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents
finalChessData <- chessData[,c(2,11,3,22,24)]
head(finalChessData)
## Player Name State TotalPts RtgPre avgOppRating
## 1 GARY HUA ON 6.0 1794 1605
## 2 DAKSHESH DARURI MI 6.0 1553 1469
## 3 ADITYA BAJAJ MI 6.0 1384 1564
## 4 PATRICK H SCHILLING MI 5.5 1716 1574
## 5 HANSHI ZUO MI 5.5 1655 1501
## 6 HANSEN SONG OH 5.0 1686 1519
write.csv(finalChessData, file = "chess.csv", row.names = FALSE)
I struggled with a few sections of this poject, but learned a lot along the way. I have a feeling if I went back and tried my original ideas again, probably I could get them to work now after all the troubleshooting I did trying different solutions. I find that it’s hard to troublshoot in R. Error messages are not always as descritive as I am used to from working with Python. Often I find that my problems are with earlier steps and data that is of the wrong type, rather than errors in the code that gives me the error. I need to figure out a better way of finding my mistakes and making sure data is the right type.