First, we need to load all the libraries used in this project. The following packages and the reason for them is shown below:
library(tidyr)
library(stringr)
library(DT)The following R code will collect the data from a URL and place it in a data frame:
raw.url <- "https://raw.githubusercontent.com/rg563/DATA607/master/Projects/Project%201/tournamentinfo.txt"
raw.table <- read.delim(raw.url,header=FALSE,stringsAsFactors=FALSE)
raw.data <- as.data.frame(raw.table)
head(raw.data) V1
1 -----------------------------------------------------------------------------------------
2 Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round|
3 Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
4 -----------------------------------------------------------------------------------------
5 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|
6 ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |
When looking at the structure of this data, there are two things that immediately stand out. The first is how each section is separated by a chain of “-”. We will need to remove these rows. The second is that it appears that there are consecutive rows of data that are stacked on top of each other. We will need to combine these consecutive rows into one row, so that we have all the information on one line.
First, let’s delete all the rows with “-” using the “grepl” function. This function returns a logical vector. In my code, I use “!grepl”, which will return the opposite of “grepl”. Therefore, my function will return ‘FALSE’ if the line contains “—” and return ‘TRUE’ if it does not contain it. Then, my data frame will only contain rows that do not contain “—”:
raw.data <- as.data.frame(raw.data[!grepl("---",raw.data$V1),])
head(raw.data) raw.data[!grepl("---", raw.data$V1), ]
1 Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round|
2 Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
3 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|
4 ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |
5 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|
6 MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |
Next, we need to combine consecutive rows into one. First, we can create two data frames from the original data frame with one containing all even rows, and the other containing all odd rows. For this, I used the “seq” function, which will return a specified sequence of rows from my data frame. Then, we can use the “paste” function to combine the data frames into one data frame:
odd.df <- raw.data[seq(1,nrow(raw.data),2),]
even.df <- raw.data[seq(2,nrow(raw.data),2),]
raw.data.combined <- data.frame(paste(odd.df,even.df))
head(raw.data.combined) paste.odd.df..even.df.
1 Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round| Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
2 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4| ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |
3 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7| MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |
4 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12| MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |
5 4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1| MI | 12616049 / R: 1716 ->1744 |N:2 |W |B |W |B |W |B |B |
6 5 | HANSHI ZUO |5.5 |W 45|W 37|D 12|D 13|D 4|W 14|W 17| MI | 14601533 / R: 1655 ->1690 |N:2 |B |W |B |W |B |W |B |
Now, we can split the rows using “|” as the deliminator, select the first row as the column names, and then delete the first row:
col.vector <- as.character(1:20)
raw.data.sep <- separate(data=raw.data.combined,col=paste.odd.df..even.df.,into=col.vector,sep="\\|")
colnames(raw.data.sep) = raw.data.sep[1,]
raw.data.sep <- raw.data.sep[-1,]
head(raw.data.sep) Pair Player Name Total Round Round Round Round
2 1 GARY HUA 6.0 W 39 W 21 W 18 W 14
3 2 DAKSHESH DARURI 6.0 W 63 W 58 L 4 W 17
4 3 ADITYA BAJAJ 6.0 L 8 W 61 W 25 W 21
5 4 PATRICK H SCHILLING 5.5 W 23 D 28 W 2 W 26
6 5 HANSHI ZUO 5.5 W 45 W 37 D 12 D 13
7 6 HANSEN SONG 5.0 W 34 D 29 L 11 W 35
Round Round Round Num USCF ID / Rtg (Pre->Post) Pts 1
2 W 7 D 12 D 4 ON 15445895 / R: 1794 ->1817 N:2 W
3 W 16 W 20 W 7 MI 14598900 / R: 1553 ->1663 N:2 B
4 W 11 W 13 W 12 MI 14959604 / R: 1384 ->1640 N:2 W
5 D 5 W 19 D 1 MI 12616049 / R: 1716 ->1744 N:2 W
6 D 4 W 14 W 17 MI 14601533 / R: 1655 ->1690 N:2 B
7 D 10 W 27 W 21 OH 15055204 / R: 1686 ->1687 N:3 W
2 3 4 5 6 7
2 B W B W B W
3 W B W B W B
4 B W B W B W
5 B W B W B B
6 W B W B W B
7 B W B B W B
Next, we need to split up the ‘USCF ID / Rtg (Pre->Post)’ column because this contains three pieces of useful information all in one column. All of the important information are the digits containing ID, Pre-Rating, and Post-Rating. Originally, I used the “str_extract_all” function with the regular expression “\\d{1,}” to list all sequences of digits in the string. However, this returned abnormal results from some rows because the ratings contained the letter ‘P’ in the middle of the number. I noticed that the sequence of numbers after each ‘P’ were at most two digits, and the ID, Pre- and Post-Ratings were all at least 3 digits. Therefore, I ammended the regular expression to “\\d{3,}” to account for this. Then, I added an additional for loop to only extract the digits from these columns, and used the “cbind” function to add these new columns to the existing data frame:
split.column <- data.frame(as.character(str_extract_all(raw.data.sep$` USCF ID / Rtg (Pre->Post) `,"\\d{3,}")))
split.column <- separate(data=split.column,col="as.character.str_extract_all.raw.data.sep...USCF.ID...Rtg..Pre..Post...........",into=c("ID","Pre","Post"),sep=",")
for (i in 1:nrow(split.column)) {
for (j in 1:ncol(split.column)) {
split.column[i,j] <- str_extract(split.column[i,j],"\\d{1,}")
}
}
raw.data.add.col <- cbind(raw.data.sep,split.column)
head(raw.data.add.col) Pair Player Name Total Round Round Round Round
2 1 GARY HUA 6.0 W 39 W 21 W 18 W 14
3 2 DAKSHESH DARURI 6.0 W 63 W 58 L 4 W 17
4 3 ADITYA BAJAJ 6.0 L 8 W 61 W 25 W 21
5 4 PATRICK H SCHILLING 5.5 W 23 D 28 W 2 W 26
6 5 HANSHI ZUO 5.5 W 45 W 37 D 12 D 13
7 6 HANSEN SONG 5.0 W 34 D 29 L 11 W 35
Round Round Round Num USCF ID / Rtg (Pre->Post) Pts 1
2 W 7 D 12 D 4 ON 15445895 / R: 1794 ->1817 N:2 W
3 W 16 W 20 W 7 MI 14598900 / R: 1553 ->1663 N:2 B
4 W 11 W 13 W 12 MI 14959604 / R: 1384 ->1640 N:2 W
5 D 5 W 19 D 1 MI 12616049 / R: 1716 ->1744 N:2 W
6 D 4 W 14 W 17 MI 14601533 / R: 1655 ->1690 N:2 B
7 D 10 W 27 W 21 OH 15055204 / R: 1686 ->1687 N:3 W
2 3 4 5 6 7 ID Pre Post
2 B W B W B W 15445895 1794 1817
3 W B W B W B 14598900 1553 1663
4 B W B W B W 14959604 1384 1640
5 B W B W B B 12616049 1716 1744
6 W B W B W B 14601533 1655 1690
7 B W B B W B 15055204 1686 1687
Next, we need to clean up the “Round” columns since they contain the result and the pair number of the person they played, and we only care about the opponents they played. We can use “str_extract” again to accomplish this.
for (i in 1:nrow(raw.data.add.col)) {
for (j in 4:10) {
raw.data.add.col[i,j] <- str_extract(raw.data.add.col[i,j],"\\d{1,}")
}
}
head(raw.data.add.col) Pair Player Name Total Round Round Round Round
2 1 GARY HUA 6.0 39 21 18 14
3 2 DAKSHESH DARURI 6.0 63 58 4 17
4 3 ADITYA BAJAJ 6.0 8 61 25 21
5 4 PATRICK H SCHILLING 5.5 23 28 2 26
6 5 HANSHI ZUO 5.5 45 37 12 13
7 6 HANSEN SONG 5.0 34 29 11 35
Round Round Round Num USCF ID / Rtg (Pre->Post) Pts 1
2 7 12 4 ON 15445895 / R: 1794 ->1817 N:2 W
3 16 20 7 MI 14598900 / R: 1553 ->1663 N:2 B
4 11 13 12 MI 14959604 / R: 1384 ->1640 N:2 W
5 5 19 1 MI 12616049 / R: 1716 ->1744 N:2 W
6 4 14 17 MI 14601533 / R: 1655 ->1690 N:2 B
7 10 27 21 OH 15055204 / R: 1686 ->1687 N:3 W
2 3 4 5 6 7 ID Pre Post
2 B W B W B W 15445895 1794 1817
3 W B W B W B 14598900 1553 1663
4 B W B W B W 14959604 1384 1640
5 B W B W B B 12616049 1716 1744
6 W B W B W B 14601533 1655 1690
7 B W B B W B 15055204 1686 1687
Finally, we can place only the necessary information needed for the final table in the data frame.
column.names <- c('Pair','Player Name','Total Points','Opponent 1','Opponent 2','Opponent 3','Opponent 4','Opponent 5','Opponent 6','Opponent 7','State','Pre')
raw.data.final <- raw.data.add.col[,c(1,2,3,4,5,6,7,8,9,10,11,22)]
colnames(raw.data.final) <- column.names
head(raw.data.final) Pair Player Name Total Points Opponent 1
2 1 GARY HUA 6.0 39
3 2 DAKSHESH DARURI 6.0 63
4 3 ADITYA BAJAJ 6.0 8
5 4 PATRICK H SCHILLING 5.5 23
6 5 HANSHI ZUO 5.5 45
7 6 HANSEN SONG 5.0 34
Opponent 2 Opponent 3 Opponent 4 Opponent 5 Opponent 6 Opponent 7
2 21 18 14 7 12 4
3 58 4 17 16 20 7
4 61 25 21 11 13 12
5 28 2 26 5 19 1
6 37 12 13 4 14 17
7 29 11 35 10 27 21
State Pre
2 ON 1794
3 MI 1553
4 MI 1384
5 MI 1716
6 MI 1655
7 OH 1686
In order to conduct numerical calculations, I needed to convert all rows that contained numeric values to numeric columns.
for (j in 3:10) {
raw.data.final[,j] <- as.numeric(raw.data.final[,j])
}
raw.data.final$Pre <- as.numeric(raw.data.final$Pre)Next, I am going to replace all of the “Opponent”" columns with that particular player’s Pre-Rating:
for (i in 1:nrow(raw.data.final)) {
for (j in 4:10) {
if (!is.na(raw.data.final[i,j])) {
raw.data.final[i,j] <- raw.data.final[raw.data.final[i,j],12]
}
}
}
head(raw.data.final) Pair Player Name Total Points Opponent 1
2 1 GARY HUA 6.0 1436
3 2 DAKSHESH DARURI 6.0 1175
4 3 ADITYA BAJAJ 6.0 1641
5 4 PATRICK H SCHILLING 5.5 1363
6 5 HANSHI ZUO 5.5 1242
7 6 HANSEN SONG 5.0 1399
Opponent 2 Opponent 3 Opponent 4 Opponent 5 Opponent 6 Opponent 7
2 1563 1600 1610 1649 1663 1716
3 917 1716 1629 1604 1595 1649
4 955 1745 1563 1712 1666 1663
5 1507 1553 1579 1655 1564 1794
6 980 1663 1666 1716 1610 1629
7 1602 1712 1438 1365 1552 1563
State Pre
2 ON 1794
3 MI 1553
4 MI 1384
5 MI 1716
6 MI 1655
7 OH 1686
Finally, we can calculate the average of these columns using the rowMeans() function. It is important to specify the “na.rm” component to be ‘TRUE’ to avoid the NA values being used in the calculation.
raw.data.final$average <- format(round(rowMeans(raw.data.final[,4:10],na.rm=TRUE),digits=2),nsmall=2)Now that we have all of the components for the data table, we can create our data frame from this, and display it in a table.
final.column.names <- c('Player Name','Player State','Total Points','Player Pre-Rating','Avg. Opponent Pre-Rating')
final.table <- raw.data.final[,c(2,11,3,12,13)]
colnames(final.table) <- final.column.names
rownames(final.table) <- NULL
datatable(final.table, rownames=FALSE)Finally, we can save this data table to a CSV file. However, we need to make sure all the columns that are numeric are saved as numeric, and to trim all white-space from the string columns.
for (j in 3:5) {
final.table[,j] <- as.numeric(final.table[,j])
}
for (j in 1:2) {
final.table[,j] <- str_trim(final.table[,j])
}
write.table(final.table,file="Ryan_Gordon_Project1.csv",row.names=FALSE,col.names=TRUE,sep=",")