Project 1

Load All Necessary Libraries

First, we need to load all the libraries used in this project. The following packages and the reason for them is shown below:

tidyr - This project utilizes the “separate” function to split columns based on a delimiter.
stringr - Various functions from this package are used to manipulate strings.
DT - This was used to display the final table into a user-friendly table.

library(tidyr)
library(stringr)
library(DT)

Load in Data from URL and Clean Up

The following R code will collect the data from a URL and place it in a data frame:

raw.url <- "https://raw.githubusercontent.com/rg563/DATA607/master/Projects/Project%201/tournamentinfo.txt"
raw.table <- read.delim(raw.url,header=FALSE,stringsAsFactors=FALSE)
raw.data <- as.data.frame(raw.table)
head(raw.data)

                                                                                          V1
1  -----------------------------------------------------------------------------------------
2  Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| 
3  Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | 
4  -----------------------------------------------------------------------------------------
5      1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|
6     ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |

When looking at the structure of this data, there are two things that immediately stand out. The first is how each section is separated by a chain of “-”. We will need to remove these rows. The second is that it appears that there are consecutive rows of data that are stacked on top of each other. We will need to combine these consecutive rows into one row, so that we have all the information on one line.

First, let’s delete all the rows with “-” using the “grepl” function. This function returns a logical vector. In my code, I use “!grepl”, which will return the opposite of “grepl”. Therefore, my function will return ‘FALSE’ if the line contains “—” and return ‘TRUE’ if it does not contain it. Then, my data frame will only contain rows that do not contain “—”:

raw.data <- as.data.frame(raw.data[!grepl("---",raw.data$V1),])
head(raw.data)

                                                      raw.data[!grepl("---", raw.data$V1), ]
1  Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| 
2  Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | 
3      1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|
4     ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |
5      2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|
6     MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |

Next, we need to combine consecutive rows into one. First, we can create two data frames from the original data frame with one containing all even rows, and the other containing all odd rows. For this, I used the “seq” function, which will return a specified sequence of rows from my data frame. Then, we can use the “paste” function to combine the data frames into one data frame:

odd.df <- raw.data[seq(1,nrow(raw.data),2),]
even.df <- raw.data[seq(2,nrow(raw.data),2),]
raw.data.combined <- data.frame(paste(odd.df,even.df))
head(raw.data.combined)

                                                                                                                                                                 paste.odd.df..even.df.
1  Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round|   Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | 
2       1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|    ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |
3       2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|    MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |
4       3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|    MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |
5       4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|    MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |
6       5 | HANSHI ZUO                      |5.5  |W  45|W  37|D  12|D  13|D   4|W  14|W  17|    MI | 14601533 / R: 1655   ->1690     |N:2  |B    |W    |B    |W    |B    |W    |B    |

Now, we can split the rows using “|” as the deliminator, select the first row as the column names, and then delete the first row:

col.vector <- as.character(1:20)
raw.data.sep <- separate(data=raw.data.combined,col=paste.odd.df..even.df.,into=col.vector,sep="\\|")
colnames(raw.data.sep) = raw.data.sep[1,]
raw.data.sep <- raw.data.sep[-1,]
head(raw.data.sep)

   Pair   Player Name                      Total Round Round Round Round
2     1   GARY HUA                         6.0   W  39 W  21 W  18 W  14
3     2   DAKSHESH DARURI                  6.0   W  63 W  58 L   4 W  17
4     3   ADITYA BAJAJ                     6.0   L   8 W  61 W  25 W  21
5     4   PATRICK H SCHILLING              5.5   W  23 D  28 W   2 W  26
6     5   HANSHI ZUO                       5.5   W  45 W  37 D  12 D  13
7     6   HANSEN SONG                      5.0   W  34 D  29 L  11 W  35
  Round Round Round    Num    USCF ID / Rtg (Pre->Post)         Pts    1  
2 W   7 D  12 D   4      ON   15445895 / R: 1794   ->1817      N:2   W    
3 W  16 W  20 W   7      MI   14598900 / R: 1553   ->1663      N:2   B    
4 W  11 W  13 W  12      MI   14959604 / R: 1384   ->1640      N:2   W    
5 D   5 W  19 D   1      MI   12616049 / R: 1716   ->1744      N:2   W    
6 D   4 W  14 W  17      MI   14601533 / R: 1655   ->1690      N:2   B    
7 D  10 W  27 W  21      OH   15055204 / R: 1686   ->1687      N:3   W    
    2     3     4     5     6     7  
2 B     W     B     W     B     W    
3 W     B     W     B     W     B    
4 B     W     B     W     B     W    
5 B     W     B     W     B     B    
6 W     B     W     B     W     B    
7 B     W     B     B     W     B

Next, we need to split up the ‘USCF ID / Rtg (Pre->Post)’ column because this contains three pieces of useful information all in one column. All of the important information are the digits containing ID, Pre-Rating, and Post-Rating. Originally, I used the “str_extract_all” function with the regular expression “\\d{1,}” to list all sequences of digits in the string. However, this returned abnormal results from some rows because the ratings contained the letter ‘P’ in the middle of the number. I noticed that the sequence of numbers after each ‘P’ were at most two digits, and the ID, Pre- and Post-Ratings were all at least 3 digits. Therefore, I ammended the regular expression to “\\d{3,}” to account for this. Then, I added an additional for loop to only extract the digits from these columns, and used the “cbind” function to add these new columns to the existing data frame:

split.column <- data.frame(as.character(str_extract_all(raw.data.sep$` USCF ID / Rtg (Pre->Post)       `,"\\d{3,}")))
split.column <- separate(data=split.column,col="as.character.str_extract_all.raw.data.sep...USCF.ID...Rtg..Pre..Post...........",into=c("ID","Pre","Post"),sep=",")
for (i in 1:nrow(split.column)) {
  for (j in 1:ncol(split.column)) {
    split.column[i,j] <- str_extract(split.column[i,j],"\\d{1,}")
  }
}
raw.data.add.col <- cbind(raw.data.sep,split.column)
head(raw.data.add.col)

   Pair   Player Name                      Total Round Round Round Round
2     1   GARY HUA                         6.0   W  39 W  21 W  18 W  14
3     2   DAKSHESH DARURI                  6.0   W  63 W  58 L   4 W  17
4     3   ADITYA BAJAJ                     6.0   L   8 W  61 W  25 W  21
5     4   PATRICK H SCHILLING              5.5   W  23 D  28 W   2 W  26
6     5   HANSHI ZUO                       5.5   W  45 W  37 D  12 D  13
7     6   HANSEN SONG                      5.0   W  34 D  29 L  11 W  35
  Round Round Round    Num    USCF ID / Rtg (Pre->Post)         Pts    1  
2 W   7 D  12 D   4      ON   15445895 / R: 1794   ->1817      N:2   W    
3 W  16 W  20 W   7      MI   14598900 / R: 1553   ->1663      N:2   B    
4 W  11 W  13 W  12      MI   14959604 / R: 1384   ->1640      N:2   W    
5 D   5 W  19 D   1      MI   12616049 / R: 1716   ->1744      N:2   W    
6 D   4 W  14 W  17      MI   14601533 / R: 1655   ->1690      N:2   B    
7 D  10 W  27 W  21      OH   15055204 / R: 1686   ->1687      N:3   W    
    2     3     4     5     6     7         ID  Pre Post
2 B     W     B     W     B     W     15445895 1794 1817
3 W     B     W     B     W     B     14598900 1553 1663
4 B     W     B     W     B     W     14959604 1384 1640
5 B     W     B     W     B     B     12616049 1716 1744
6 W     B     W     B     W     B     14601533 1655 1690
7 B     W     B     B     W     B     15055204 1686 1687

Next, we need to clean up the “Round” columns since they contain the result and the pair number of the person they played, and we only care about the opponents they played. We can use “str_extract” again to accomplish this.

for (i in 1:nrow(raw.data.add.col)) {
  for (j in 4:10) {
    raw.data.add.col[i,j] <- str_extract(raw.data.add.col[i,j],"\\d{1,}")
  }
}
head(raw.data.add.col)

   Pair   Player Name                      Total Round Round Round Round
2     1   GARY HUA                         6.0      39    21    18    14
3     2   DAKSHESH DARURI                  6.0      63    58     4    17
4     3   ADITYA BAJAJ                     6.0       8    61    25    21
5     4   PATRICK H SCHILLING              5.5      23    28     2    26
6     5   HANSHI ZUO                       5.5      45    37    12    13
7     6   HANSEN SONG                      5.0      34    29    11    35
  Round Round Round    Num    USCF ID / Rtg (Pre->Post)         Pts    1  
2     7    12     4      ON   15445895 / R: 1794   ->1817      N:2   W    
3    16    20     7      MI   14598900 / R: 1553   ->1663      N:2   B    
4    11    13    12      MI   14959604 / R: 1384   ->1640      N:2   W    
5     5    19     1      MI   12616049 / R: 1716   ->1744      N:2   W    
6     4    14    17      MI   14601533 / R: 1655   ->1690      N:2   B    
7    10    27    21      OH   15055204 / R: 1686   ->1687      N:3   W    
    2     3     4     5     6     7         ID  Pre Post
2 B     W     B     W     B     W     15445895 1794 1817
3 W     B     W     B     W     B     14598900 1553 1663
4 B     W     B     W     B     W     14959604 1384 1640
5 B     W     B     W     B     B     12616049 1716 1744
6 W     B     W     B     W     B     14601533 1655 1690
7 B     W     B     B     W     B     15055204 1686 1687

Finally, we can place only the necessary information needed for the final table in the data frame.

column.names <- c('Pair','Player Name','Total Points','Opponent 1','Opponent 2','Opponent 3','Opponent 4','Opponent 5','Opponent 6','Opponent 7','State','Pre')
raw.data.final <- raw.data.add.col[,c(1,2,3,4,5,6,7,8,9,10,11,22)]
colnames(raw.data.final) <- column.names
head(raw.data.final)

    Pair                       Player Name Total Points Opponent 1
2     1   GARY HUA                                6.0           39
3     2   DAKSHESH DARURI                         6.0           63
4     3   ADITYA BAJAJ                            6.0            8
5     4   PATRICK H SCHILLING                     5.5           23
6     5   HANSHI ZUO                              5.5           45
7     6   HANSEN SONG                             5.0           34
  Opponent 2 Opponent 3 Opponent 4 Opponent 5 Opponent 6 Opponent 7
2         21         18         14          7         12          4
3         58          4         17         16         20          7
4         61         25         21         11         13         12
5         28          2         26          5         19          1
6         37         12         13          4         14         17
7         29         11         35         10         27         21
    State  Pre
2     ON  1794
3     MI  1553
4     MI  1384
5     MI  1716
6     MI  1655
7     OH  1686

Calculating Average Opponent Pre-Rating for Each Player

In order to conduct numerical calculations, I needed to convert all rows that contained numeric values to numeric columns.

for (j in 3:10) {
  raw.data.final[,j] <- as.numeric(raw.data.final[,j])
}
raw.data.final$Pre <- as.numeric(raw.data.final$Pre)

Next, I am going to replace all of the “Opponent”" columns with that particular player’s Pre-Rating:

for (i in 1:nrow(raw.data.final)) {
  for (j in 4:10) {
    if (!is.na(raw.data.final[i,j])) {
      raw.data.final[i,j] <- raw.data.final[raw.data.final[i,j],12]
    }
  }
}
head(raw.data.final)

    Pair                       Player Name Total Points Opponent 1
2     1   GARY HUA                                  6.0       1436
3     2   DAKSHESH DARURI                           6.0       1175
4     3   ADITYA BAJAJ                              6.0       1641
5     4   PATRICK H SCHILLING                       5.5       1363
6     5   HANSHI ZUO                                5.5       1242
7     6   HANSEN SONG                               5.0       1399
  Opponent 2 Opponent 3 Opponent 4 Opponent 5 Opponent 6 Opponent 7
2       1563       1600       1610       1649       1663       1716
3        917       1716       1629       1604       1595       1649
4        955       1745       1563       1712       1666       1663
5       1507       1553       1579       1655       1564       1794
6        980       1663       1666       1716       1610       1629
7       1602       1712       1438       1365       1552       1563
    State  Pre
2     ON  1794
3     MI  1553
4     MI  1384
5     MI  1716
6     MI  1655
7     OH  1686

Finally, we can calculate the average of these columns using the rowMeans() function. It is important to specify the “na.rm” component to be ‘TRUE’ to avoid the NA values being used in the calculation.

raw.data.final$average <- format(round(rowMeans(raw.data.final[,4:10],na.rm=TRUE),digits=2),nsmall=2)

Creating Final Table and Writing to CSV File

Now that we have all of the components for the data table, we can create our data frame from this, and display it in a table.

final.column.names <- c('Player Name','Player State','Total Points','Player Pre-Rating','Avg. Opponent Pre-Rating')
final.table <- raw.data.final[,c(2,11,3,12,13)]
colnames(final.table) <- final.column.names
rownames(final.table) <- NULL
datatable(final.table, rownames=FALSE)

Finally, we can save this data table to a CSV file. However, we need to make sure all the columns that are numeric are saved as numeric, and to trim all white-space from the string columns.

for (j in 3:5) {
  final.table[,j] <- as.numeric(final.table[,j])
}
for (j in 1:2) {
  final.table[,j] <- str_trim(final.table[,j])
}
write.table(final.table,file="Ryan_Gordon_Project1.csv",row.names=FALSE,col.names=TRUE,sep=",")

Project 1

Ryan Gordon

02/24/2019

Load All Necessary Libraries

Load in Data from URL and Clean Up

Calculating Average Opponent Pre-Rating for Each Player

Creating Final Table and Writing to CSV File