Assignment Requirements

Project 1

In this project, you’re given a text file with chess tournament results where the information has some structure.

Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players: + Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents

For the first player, the information would be: Gary Hua, ON, 6.0, 1794, 1605

1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played.

If you have questions about the meaning of the data or the results, please post them on the discussion forum.

Data science, like chess, is a game of back and forth…

The chess rating system (invented by a Minnesota statistician named Arpad Elo) has been used in many other contexts, including assessing relative strength of employment candidates by human resource departments.

You may substitute another text file (or set of text files, or data scraped from web pages) of similar or greater complexity, and create your own assignment and solution. You may work in a small team. All of your code should be in an R markdown file (and published to rpubs.com); with your data accessible for the person running.

Import Data

Data is imported from the text file path leading to tournamentinfo.txt from my github Project 1 folder. Function read.delim from the utils package is used.

# Store Github url to variable
txtfile= 
"https://raw.githubusercontent.com/gcampos100/DATA607Spring2021/main/Projects/Project%201/
tournamentinfo.txt"

The text file is converted into a data.frame separated by the delimiter | and not claiming a header, resulting in 11 columns vs. the actual 10 in the file.

my_data <- as.data.frame(read.delim(txtfile,header = FALSE,stringsAsFactors = FALSE, sep = "|"))

Note1

##                                                                                          V1
## 1 -----------------------------------------------------------------------------------------
## 2                                                                                     Pair 
## 3                                                                                     Num  
## 4 -----------------------------------------------------------------------------------------
##                                  V2    V3    V4    V5    V6    V7    V8    V9
## 1                                                                            
## 2  Player Name                      Total Round Round Round Round Round Round
## 3  USCF ID / Rtg (Pre->Post)         Pts    1     2     3     4     5     6  
## 4

Cleaning Imported Data

Remove --- & NULL

The data frame my_data is a 2 dimensional data frame containing 11 columns and 195 rows. Rows composed completely of --- and the additional all NULL column created on import needs to be removed.

#Find rows to delete, multiples of 3
toDelete    <- seq(1, length(my_data$V1), 3)
#remove rows
my_data     <- my_data[-toDelete ,]
# remove column with NULL Values
my_data[11] <-NULL

Note2

##       V1                                V2    V3    V4    V5    V6    V7    V8
## 2  Pair   Player Name                      Total Round Round Round Round Round
## 3  Num    USCF ID / Rtg (Pre->Post)         Pts    1     2     3     4     5  
## 5     1   GARY HUA                         6.0   W  39 W  21 W  18 W  14 W   7

Subset header and body

The header is composed of 2 rows that require merging. They must also be separated by column. Separating while keep most data intact requires subsetting. The first two rows are subset individually and the body of the data set is retained on my_data.

# Subset column names and remainder of vectors
# Subset row 1 of my data
my_data_names <- my_data[1,]
# Subset row 2 of my data
my_data_names_2 <- my_data[2,]
# Subset remainder of my data
my_data<-my_data[3:NROW(my_data),]

Header subsets

Using str_replace in combination with gsub() and regex expressions, unnecessary information from the rows are removed. The rows are then combined to form a single header. The end result will be the exact or near exact column names needed for this dataset.

# Merge both vectors and seperate by `,`
my_data_names <- 
  paste(my_data_names,my_data_names_2,collapse = ",")
# Replace any character issues
my_data_names <- 
  str_replace(my_data_names, "/",",")
my_data_names <- 
  str_replace(my_data_names, "->"," , ")
my_data_names <- 
  str_replace(my_data_names, "\\(","")
my_data_names <- 
  str_replace(my_data_names, "Post\\)","Rtg Post")
# Remove excessive spacing
my_data_names <-
  str_replace(gsub("\\s\\s", " ", str_trim(my_data_names)), "B", "b")
# Add comma to attribute `Player Name`
my_data_names <-
  str_replace(gsub("Player Name", "Player Name,",my_data_names),"B", "b")
my_data_names<-
  unlist(strsplit(my_data_names, ","))
##  [1] "Pair  Num "           " Player Name"         "            USCF ID "
##  [4] " Rtg Pre "            " Rtg Post    "        "Total Pts "          
##  [7] "Round  1 "            "Round  2 "            "Round  3 "           
## [10] "Round  4 "            "Round  5 "            "Round  6 "           
## [13] "Round  7"

Body subset

The relevant data in the body of the data set, is split between 2 rows, similar to the header. In order to consolidate, the body is broken into 2 halves and ultimately merged. A sequence is used to store the index for odd rows and even rows in the data frame. The index is used to create subsets my_data_top and my_data_bottom.

# Subset of top using index
top    <- seq(1, length(my_data$V1), 2)
my_data_top         <- my_data[top ,]
# Subset of bottom using index
bottom     <- seq(2, length(my_data$V1), 2)
my_data_bottom          <- my_data[bottom ,]

Note3

## [1] "TOP"
##                                  V2    V3    V4    V5    V6    V7    V8    V9
## 5  GARY HUA                         6.0   W  39 W  21 W  18 W  14 W   7 D  12
## 8  DAKSHESH DARURI                  6.0   W  63 W  58 L   4 W  17 W  16 W  20
## [1] "BOTTOM"
##                                  V2    V3    V4    V5    V6    V7    V8    V9
## 6  15445895 / R: 1794   ->1817      N:2   W     B     W     B     W     B    
## 9  14598900 / R: 1553   ->1663      N:2   B     W     B     W     B     W

Cleanup & consolidation

In order to merge the two subsets data cleanly as well as easy manipulation of the data sets’, content the following was done:

  • Both subsets were converted into lists
  • Using regex, unnecessary content from my_data_bottom is removed
  • Rows are merged, consolidating columns with the same index for each subset
  • Additional cleanup using regex is conducted, including the adding of pipe in column one |, for future splitting
# List conversion
my_data_top <- as.list(t(my_data_top))
my_data_bottom<- as.list(t(my_data_bottom))

Note4

# Unnecessary Data removal
my_data_bottom<-str_replace_all(gsub("W|B|b|P[1-9]|P[1-9][0-9]", " ",my_data_bottom),"B", "b")
my_data_bottom<-str_replace_all(gsub("N\\:.", " ",my_data_bottom),"B", "b")
# Insertion of `|` to use as a delimiter later on
my_data_bottom<-str_replace_all(gsub("O", "| O",my_data_bottom),"B", "b")
my_data_bottom<-str_replace_all(gsub("M", "| M",my_data_bottom),"B", "b")
# paste or merger of rows
my_data <- paste(my_data_top,my_data_bottom,collapse = ",")
# Overall cleanup of data
my_data <-str_replace(gsub("/ R:", ",",my_data),"B", "b")
my_data <-str_replace(gsub("->", ",",my_data),"B", "b")
my_data <-str_replace(gsub("\\s+", " ", str_trim(my_data)), "B", "B")

The result of the cleanup is a list of all variables, comma separated, WITH EXCEPTION of Pair Num content and, Player Name and USCF ID.

##  chr "1 | ON , GARY HUA 15445895 , 1794 ,1817 ,6.0 ,W 39 ,W 21 ,W 18 ,W 14 ,W 7 ,D 12 ,D 4 , 2 | MI , DAKSHESH DARURI"| __truncated__

Recreate Data Frame

The list is split into elements, than converted into a data frame.

my_data<-strsplit(my_data[1],",")
my_data <- data.frame(matrix(unlist(my_data), ncol=12, byrow=TRUE))

Note5

##         X1                         X2     X3    X4   X5    X6    X7    X8    X9
## 1  1 | ON          GARY HUA 15445895   1794  1817  6.0  W 39  W 21  W 18  W 14 
## 2  2 | MI   DAKSHESH DARURI 14598900   1553  1663  6.0  W 63  W 58   L 4  W 17

Specifically column X2 needs to be separated using str_split_fixed

## [1] " GARY HUA 15445895 "
## [1] " DAKSHESH DARURI 14598900 "
quick.split<-str_split_fixed(my_data$X2, "\\s[0-9]", 2)
my_data<-cbind(my_data[1],quick.split,my_data[3:12])

This will to create the 13 columns we required to match with the names in my_data_names

##        X1         1        2     X3    X4   X5    X6    X7    X8    X9  X10
## 1 1 | ON   GARY HUA 5445895   1794  1817  6.0  W 39  W 21  W 18  W 14  W 7 
##     X11  X12
## 1 D 12  D 4

Unfortunately, the State column name was not accounted for with Pair Num ’s aggregate data, hence we have to seperate the first column X1 using the below command.

Note6

my_data<-my_data %>%
  separate(X1, c(my_data_names[1], "State"), "\\|")

Then the remaining names can be added using the list my_data_names

colnames(my_data)[colnames(my_data)
            %in% c("1", "2","X3","X4","X5","X6","X7","X8","X9","X10","X11","X12")] <-
                                                                    my_data_names[2:13]
##   Pair  Num  State  Player Name             USCF ID   Rtg Pre   Rtg Post    
## 1         1    ON      GARY HUA             5445895      1794          1817 
##   Total Pts  Round  1  Round  2  Round  3  Round  4  Round  5  Round  6 
## 1       6.0      W 39      W 21      W 18      W 14       W 7      D 12 
##   Round  7
## 1     D 4

Calculating Average

Vector Average is created to store the calculated values, that will also be stored into my_data’s final column.

Average<-c(1:64)

The logic for the below for loop is as follows

  • Initial loop goes from the first row 1 to the maximum row nrow
  • Initial column for each row is 8 indexing the Round 1 column to the maximum column ncol.
  • Index i is retrieved as a numeric value
  • i is used to index the row of opponents data, with 4 indexing Rtg Pre value as a numeric
  • Value is calculated into sum initialized in the first for loop as 0, if it passes the conditional if statement of not NA using the function \(\\!is.na\).
  • The else statement, tallies number of NA values.
  • Upon completion of the inner loop, an average for that row is calculated and stored to vector Average indexing the same row that is calculated.
  • Average calculation = \(round(\frac{sum}{column_{\#of}-NA_{\#of}})\)
for(row in 1:nrow(my_data)){
    sum = 0
    columns.na=0
  for(col in 8:ncol(my_data)){
    i<-as.numeric(str_extract(my_data[row,col],"[1-9]."))
        if(!is.na(i)){
      print(i)
      sum = sum + as.numeric(my_data[i,5])
        }
      else columns.na=columns.na+1
  }
  #print(columns.na)
  #print(sum)
  #print(round((sum/((ncol(my_data)-7)-columns.na))))
  Average[row]<-round((sum/((ncol(my_data)-7)-columns.na)))
}

Once the loop ends and the Average vector is filled, Average is combined with my_data using cbind() function.

my_data<-cbind(my_data,Average)
##      Pair  Num  State   Player Name                   USCF ID   Rtg Pre 
## [1,] "1 "       " ON " " GARY HUA"        "5445895 "           " 1794 " 
## [2,] " 2 "      " MI " " DAKSHESH DARURI" "4598900 "           " 1553 " 
## [3,] " 3 "      " MI " " ADITYA BAJAJ"    "4959604 "           " 1384 " 
##       Rtg Post     Total Pts  Round  1  Round  2  Round  3  Round  4  Round  5 
## [1,] "1817 "       "6.0 "     "W 39 "   "W 21 "   "W 18 "   "W 14 "   "W 7 "   
## [2,] "1663 "       "6.0 "     "W 63 "   "W 58 "   "L 4 "    "W 17 "   "W 16 "  
## [3,] "1640 "       "6.0 "     "L 8 "    "W 61 "   "W 25 "   "W 21 "   "W 11 "  
##      Round  6  Round  7 Average
## [1,] "D 12 "   "D 4 "   "1605" 
## [2,] "W 20 "   "W 7 "   "1469" 
## [3,] "W 13 "   "W 12 "  "1564"

Convert to CSV

The requirements for Project 1’s submission is the data:

  Player’s Name, Player’s State, Total Number of Points, Player’s 
  Pre-Rating, and Average Pre Chess Rating of Opponents

\(\therefore\) my_data subset Project_1_csv with the needed columns is create, before converting to a .csv

colnames(my_data)
##  [1] "Pair  Num "           "State"                " Player Name"        
##  [4] "            USCF ID " " Rtg Pre "            " Rtg Post    "       
##  [7] "Total Pts "           "Round  1 "            "Round  2 "           
## [10] "Round  3 "            "Round  4 "            "Round  5 "           
## [13] "Round  6 "            "Round  7"             "Average"
#my_data<-as.data.frame(my_data)
Project_1_csv<-as.data.frame(my_data) %>%
  select(" Player Name",State,"Total Pts "," Rtg Pre ","Average")
head(Project_1_csv)
##            Player Name State Total Pts   Rtg Pre  Average
## 1             GARY HUA   ON        6.0      1794     1605
## 2      DAKSHESH DARURI   MI        6.0      1553     1469
## 3         ADITYA BAJAJ   MI        6.0      1384     1564
## 4  PATRICK H SCHILLING   MI        5.5      1716     1574
## 5           HANSHI ZUO   MI        5.5      1655     1501
## 6          HANSEN SONG   OH        5.0      1686     1519

The .csv file is created and stored using a local github path

cloned_git_PATH <- rstudioapi::askForPassword("Enter Path:")
write.csv(Project_1_csv,
          cloned_git_PATH)

and can be found on github.com/gcampos100/ in the DATA607 repository


  1. Column 10 was ommitted from head() for formatting:↩︎

  2. Column 9 & 10 was ommitted from head() for formatting:↩︎

  3. Columns 1 & 10 was omitted from head() for formatting:↩︎

  4. Regex was used to add | character. Although not very intuitive, the pasting of the subsets created issues separating column 1 Pair Num into Pair Num and State, hence | was added to use as a delimiter later on:↩︎

  5. Column 10-12 was ommitted from head() for formatting:↩︎

  6. This is where the added | is used as a delimeter:↩︎