CUNY SPS DATA 607 Project 1

Assignment
Import the data.
Combine rows of data that belong to the same observation
Create new columns for USCFID, and both ratings
Calculate Average Pre Chess Rating of Opponents
Create new data.frame with just the required columns of data
Export to CSV!
Final Thoughts

Assignment

In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:

Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents

For the first player, the information would be:

Gary Hua, ON, 6.0, 1794, 1605

1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played.

Import the data.

I used read.table so that I could separate data on the pipe “|” symbol. I tried using the read_delim function from the readr package, but it gave me errors because of the separator rows.

theUrl <- "https://raw.githubusercontent.com/betsyrosalen/DATA_607_Data_Acquisition_and_Management/master/Project1/tournamentinfo.txt"
rawChessData <- read.table(file = theUrl, header = FALSE, skip = 1, sep = "|", fill = TRUE, stringsAsFactors = FALSE)
str(rawChessData)

## 'data.frame':    195 obs. of  11 variables:
##  $ V1 : chr  " Pair " " Num  " "-----------------------------------------------------------------------------------------" "    1 " ...
##  $ V2 : chr  " Player Name                     " " USCF ID / Rtg (Pre->Post)       " "" " GARY HUA                        " ...
##  $ V3 : chr  "Total" " Pts " "" "6.0  " ...
##  $ V4 : chr  "Round" "  1  " "" "W  39" ...
##  $ V5 : chr  "Round" "  2  " "" "W  21" ...
##  $ V6 : chr  "Round" "  3  " "" "W  18" ...
##  $ V7 : chr  "Round" "  4  " "" "W  14" ...
##  $ V8 : chr  "Round" "  5  " "" "W   7" ...
##  $ V9 : chr  "Round" "  6  " "" "D  12" ...
##  $ V10: chr  "Round" "  7  " "" "D   4" ...
##  $ V11: logi  NA NA NA NA NA NA ...

Combine rows of data that belong to the same observation

The original file has two rows for each player plus a separator row. I created two data frames, one for each row of data and then combined them into one consolidated data.frame automatically removing the separator rows in the process.

Data frame with row one data

chessData1 <- rawChessData[seq(1 , 195, 3), ]
head(chessData1, 5)

##        V1                                V2    V3    V4    V5    V6    V7
## 1   Pair   Player Name                      Total Round Round Round Round
## 4      1   GARY HUA                         6.0   W  39 W  21 W  18 W  14
## 7      2   DAKSHESH DARURI                  6.0   W  63 W  58 L   4 W  17
## 10     3   ADITYA BAJAJ                     6.0   L   8 W  61 W  25 W  21
## 13     4   PATRICK H SCHILLING              5.5   W  23 D  28 W   2 W  26
##       V8    V9   V10 V11
## 1  Round Round Round  NA
## 4  W   7 D  12 D   4  NA
## 7  W  16 W  20 W   7  NA
## 10 W  11 W  13 W  12  NA
## 13 D   5 W  19 D   1  NA

Data frame with row two data

chessData2 <- rawChessData[seq(2 , 195, 3), ]
head(chessData2, 5)

##        V1                                V2    V3    V4    V5    V6    V7
## 2   Num    USCF ID / Rtg (Pre->Post)         Pts    1     2     3     4  
## 5     ON   15445895 / R: 1794   ->1817      N:2   W     B     W     B    
## 8     MI   14598900 / R: 1553   ->1663      N:2   B     W     B     W    
## 11    MI   14959604 / R: 1384   ->1640      N:2   W     B     W     B    
## 14    MI   12616049 / R: 1716   ->1744      N:2   W     B     W     B    
##       V8    V9   V10 V11
## 2    5     6     7    NA
## 5  W     B     W      NA
## 8  B     W     B      NA
## 11 W     B     W      NA
## 14 W     B     B      NA

Combined data.frame with one row per player, no separator rows, and column names added.

chessData <- cbind(chessData1[-1,1:10], chessData2[-1,1:10])
names(chessData) <- cbind(chessData1[1,1:10], chessData2[1,1:10])
names(chessData)[c(1, 3:10, 11)] <- c('PairNum', 'TotalPts', paste('Round', 1:7, sep=""), 'State')
rownames(chessData) <- NULL
head(chessData, 5)

##   PairNum  Player Name                      TotalPts Round1 Round2 Round3
## 1      1   GARY HUA                            6.0    W  39  W  21  W  18
## 2      2   DAKSHESH DARURI                     6.0    W  63  W  58  L   4
## 3      3   ADITYA BAJAJ                        6.0    L   8  W  61  W  25
## 4      4   PATRICK H SCHILLING                 5.5    W  23  D  28  W   2
## 5      5   HANSHI ZUO                          5.5    W  45  W  37  D  12
##   Round4 Round5 Round6 Round7  State  USCF ID / Rtg (Pre->Post)       
## 1  W  14  W   7  D  12  D   4    ON   15445895 / R: 1794   ->1817     
## 2  W  17  W  16  W  20  W   7    MI   14598900 / R: 1553   ->1663     
## 3  W  21  W  11  W  13  W  12    MI   14959604 / R: 1384   ->1640     
## 4  W  26  D   5  W  19  D   1    MI   12616049 / R: 1716   ->1744     
## 5  D  13  D   4  W  14  W  17    MI   14601533 / R: 1655   ->1690     
##    Pts    1     2     3     4     5     6     7  
## 1 N:2   W     B     W     B     W     B     W    
## 2 N:2   B     W     B     W     B     W     B    
## 3 N:2   W     B     W     B     W     B     W    
## 4 N:2   W     B     W     B     W     B     B    
## 5 N:2   B     W     B     W     B     W     B

Create new columns for USCFID, and both ratings

The original file had multiple variables of data in one column, so I used regular expressions to separate out and create new columns for USCFID, and both ratings.

library(stringr)
USCFID <- str_extract_all(chessData[12], "\\d{8}")
chessData$USCFID <- unlist(USCFID)
RtgPre <- str_extract_all(chessData[12], "R: +\\d{3,4}")  
RtgPre <- str_extract_all(RtgPre, "\\d{3,4}")
chessData$RtgPre <- as.integer(unlist(RtgPre))
RtgPost <- str_extract_all(chessData[12], "-> *\\d{3,4}")
RtgPost <- str_extract_all(RtgPost, "\\d{3,4}")
chessData$RtgPost <- as.integer(unlist(RtgPost))
head(chessData[c(2,21:23)], 5)

##    Player Name                        USCFID RtgPre RtgPost
## 1  GARY HUA                         15445895   1794    1817
## 2  DAKSHESH DARURI                  14598900   1553    1663
## 3  ADITYA BAJAJ                     14959604   1384    1640
## 4  PATRICK H SCHILLING              12616049   1716    1744
## 5  HANSHI ZUO                       14601533   1655    1690

Calculate Average Pre Chess Rating of Opponents

I had a lot of false starts with this section before I figured out how to do this. I created a separate matrix with the round columns and used a regular expression to pull out just the opponents ID numbers, then used that in a loop to pull the oppenents ratings and calculate the average and add it as a new column to my original data.frame.

opponents <- t(chessData[4:10])
opponents <- as.integer(str_replace(opponents,"\\w*\\s*(\\d{1,})","\\1"))
opponents <- matrix(opponents, nrow=64)

x <- 1
for (i in 1:nrow(chessData)){
  y = as.integer(opponents[x:(x+6)])
  Average <- round(mean(chessData$RtgPre[y], na.rm = TRUE))
  chessData$avgOppRating[i] <- Average
  x <- x+7
}
head(chessData[,c(1,2,24)])

##   PairNum  Player Name                      avgOppRating
## 1      1   GARY HUA                                 1605
## 2      2   DAKSHESH DARURI                          1469
## 3      3   ADITYA BAJAJ                             1564
## 4      4   PATRICK H SCHILLING                      1574
## 5      5   HANSHI ZUO                               1501
## 6      6   HANSEN SONG                              1519

Create new data.frame with just the required columns of data

Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents

finalChessData <- chessData[,c(2,11,3,22,24)]
head(finalChessData)

##    Player Name                       State TotalPts RtgPre avgOppRating
## 1  GARY HUA                            ON     6.0     1794         1605
## 2  DAKSHESH DARURI                     MI     6.0     1553         1469
## 3  ADITYA BAJAJ                        MI     6.0     1384         1564
## 4  PATRICK H SCHILLING                 MI     5.5     1716         1574
## 5  HANSHI ZUO                          MI     5.5     1655         1501
## 6  HANSEN SONG                         OH     5.0     1686         1519

Export to CSV!

write.csv(finalChessData, file = "chess.csv", row.names = FALSE)

Final Thoughts

I struggled with a few sections of this poject, but learned a lot along the way. I have a feeling if I went back and tried my original ideas again, probably I could get them to work now after all the troubleshooting I did trying different solutions. I find that it’s hard to troublshoot in R. Error messages are not always as descritive as I am used to from working with Python. Often I find that my problems are with earlier steps and data that is of the wrong type, rather than errors in the code that gives me the error. I need to figure out a better way of finding my mistakes and making sure data is the right type.

CUNY SPS DATA 607 Project 1

Betsy Rosalen

February 24, 2018

Assignment

Import the data.

Combine rows of data that belong to the same observation

Data frame with row one data

Data frame with row two data

Combined data.frame with one row per player, no separator rows, and column names added.

Create new columns for USCFID, and both ratings

Calculate Average Pre Chess Rating of Opponents

Create new data.frame with just the required columns of data

Export to CSV!

Final Thoughts