Project 1
In this project, you’re given a text file with chess tournament results where the information has some structure.
Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players: + Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents
For the first player, the information would be: Gary Hua, ON, 6.0, 1794, 1605
1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played.
If you have questions about the meaning of the data or the results, please post them on the discussion forum.
Data science, like chess, is a game of back and forth…
The chess rating system (invented by a Minnesota statistician named Arpad Elo) has been used in many other contexts, including assessing relative strength of employment candidates by human resource departments.
You may substitute another text file (or set of text files, or data scraped from web pages) of similar or greater complexity, and create your own assignment and solution. You may work in a small team. All of your code should be in an R markdown file (and published to rpubs.com); with your data accessible for the person running.
Data is imported from the text file path leading to tournamentinfo.txt from my github Project 1 folder. Function read.delim from the utils package is used.
# Store Github url to variable
txtfile=
"https://raw.githubusercontent.com/gcampos100/DATA607Spring2021/main/Projects/Project%201/
tournamentinfo.txt"The text file is converted into a data.frame separated by the delimiter | and not claiming a header, resulting in 11 columns vs. the actual 10 in the file.
Note1
## V1
## 1 -----------------------------------------------------------------------------------------
## 2 Pair
## 3 Num
## 4 -----------------------------------------------------------------------------------------
## V2 V3 V4 V5 V6 V7 V8 V9
## 1
## 2 Player Name Total Round Round Round Round Round Round
## 3 USCF ID / Rtg (Pre->Post) Pts 1 2 3 4 5 6
## 4
--- & NULLThe data frame my_data is a 2 dimensional data frame containing 11 columns and 195 rows. Rows composed completely of --- and the additional all NULL column created on import needs to be removed.
#Find rows to delete, multiples of 3
toDelete <- seq(1, length(my_data$V1), 3)
#remove rows
my_data <- my_data[-toDelete ,]
# remove column with NULL Values
my_data[11] <-NULLNote2
## V1 V2 V3 V4 V5 V6 V7 V8
## 2 Pair Player Name Total Round Round Round Round Round
## 3 Num USCF ID / Rtg (Pre->Post) Pts 1 2 3 4 5
## 5 1 GARY HUA 6.0 W 39 W 21 W 18 W 14 W 7
The header is composed of 2 rows that require merging. They must also be separated by column. Separating while keep most data intact requires subsetting. The first two rows are subset individually and the body of the data set is retained on my_data.
Using str_replace in combination with gsub() and regex expressions, unnecessary information from the rows are removed. The rows are then combined to form a single header. The end result will be the exact or near exact column names needed for this dataset.
# Merge both vectors and seperate by `,`
my_data_names <-
paste(my_data_names,my_data_names_2,collapse = ",")
# Replace any character issues
my_data_names <-
str_replace(my_data_names, "/",",")
my_data_names <-
str_replace(my_data_names, "->"," , ")
my_data_names <-
str_replace(my_data_names, "\\(","")
my_data_names <-
str_replace(my_data_names, "Post\\)","Rtg Post")
# Remove excessive spacing
my_data_names <-
str_replace(gsub("\\s\\s", " ", str_trim(my_data_names)), "B", "b")
# Add comma to attribute `Player Name`
my_data_names <-
str_replace(gsub("Player Name", "Player Name,",my_data_names),"B", "b")
my_data_names<-
unlist(strsplit(my_data_names, ","))## [1] "Pair Num " " Player Name" " USCF ID "
## [4] " Rtg Pre " " Rtg Post " "Total Pts "
## [7] "Round 1 " "Round 2 " "Round 3 "
## [10] "Round 4 " "Round 5 " "Round 6 "
## [13] "Round 7"
The relevant data in the body of the data set, is split between 2 rows, similar to the header. In order to consolidate, the body is broken into 2 halves and ultimately merged. A sequence is used to store the index for odd rows and even rows in the data frame. The index is used to create subsets my_data_top and my_data_bottom.
# Subset of top using index
top <- seq(1, length(my_data$V1), 2)
my_data_top <- my_data[top ,]
# Subset of bottom using index
bottom <- seq(2, length(my_data$V1), 2)
my_data_bottom <- my_data[bottom ,]Note3
## [1] "TOP"
## V2 V3 V4 V5 V6 V7 V8 V9
## 5 GARY HUA 6.0 W 39 W 21 W 18 W 14 W 7 D 12
## 8 DAKSHESH DARURI 6.0 W 63 W 58 L 4 W 17 W 16 W 20
## [1] "BOTTOM"
## V2 V3 V4 V5 V6 V7 V8 V9
## 6 15445895 / R: 1794 ->1817 N:2 W B W B W B
## 9 14598900 / R: 1553 ->1663 N:2 B W B W B W
In order to merge the two subsets data cleanly as well as easy manipulation of the data sets’, content the following was done:
listsmy_data_bottom is removed|, for future splitting# List conversion
my_data_top <- as.list(t(my_data_top))
my_data_bottom<- as.list(t(my_data_bottom))Note4
# Unnecessary Data removal
my_data_bottom<-str_replace_all(gsub("W|B|b|P[1-9]|P[1-9][0-9]", " ",my_data_bottom),"B", "b")
my_data_bottom<-str_replace_all(gsub("N\\:.", " ",my_data_bottom),"B", "b")
# Insertion of `|` to use as a delimiter later on
my_data_bottom<-str_replace_all(gsub("O", "| O",my_data_bottom),"B", "b")
my_data_bottom<-str_replace_all(gsub("M", "| M",my_data_bottom),"B", "b")# Overall cleanup of data
my_data <-str_replace(gsub("/ R:", ",",my_data),"B", "b")
my_data <-str_replace(gsub("->", ",",my_data),"B", "b")
my_data <-str_replace(gsub("\\s+", " ", str_trim(my_data)), "B", "B")The result of the cleanup is a list of all variables, comma separated, WITH EXCEPTION of Pair Num content and, Player Name and USCF ID.
## chr "1 | ON , GARY HUA 15445895 , 1794 ,1817 ,6.0 ,W 39 ,W 21 ,W 18 ,W 14 ,W 7 ,D 12 ,D 4 , 2 | MI , DAKSHESH DARURI"| __truncated__
The list is split into elements, than converted into a data frame.
my_data<-strsplit(my_data[1],",")
my_data <- data.frame(matrix(unlist(my_data), ncol=12, byrow=TRUE))Note5
## X1 X2 X3 X4 X5 X6 X7 X8 X9
## 1 1 | ON GARY HUA 15445895 1794 1817 6.0 W 39 W 21 W 18 W 14
## 2 2 | MI DAKSHESH DARURI 14598900 1553 1663 6.0 W 63 W 58 L 4 W 17
Specifically column X2 needs to be separated using str_split_fixed
## [1] " GARY HUA 15445895 "
## [1] " DAKSHESH DARURI 14598900 "
quick.split<-str_split_fixed(my_data$X2, "\\s[0-9]", 2)
my_data<-cbind(my_data[1],quick.split,my_data[3:12])This will to create the 13 columns we required to match with the names in my_data_names
## X1 1 2 X3 X4 X5 X6 X7 X8 X9 X10
## 1 1 | ON GARY HUA 5445895 1794 1817 6.0 W 39 W 21 W 18 W 14 W 7
## X11 X12
## 1 D 12 D 4
Unfortunately, the State column name was not accounted for with Pair Num ’s aggregate data, hence we have to seperate the first column X1 using the below command.
Note6
Then the remaining names can be added using the list my_data_names
colnames(my_data)[colnames(my_data)
%in% c("1", "2","X3","X4","X5","X6","X7","X8","X9","X10","X11","X12")] <-
my_data_names[2:13]## Pair Num State Player Name USCF ID Rtg Pre Rtg Post
## 1 1 ON GARY HUA 5445895 1794 1817
## Total Pts Round 1 Round 2 Round 3 Round 4 Round 5 Round 6
## 1 6.0 W 39 W 21 W 18 W 14 W 7 D 12
## Round 7
## 1 D 4
Vector Average is created to store the calculated values, that will also be stored into my_data’s final column.
The logic for the below for loop is as follows
1 to the maximum row nrow8 indexing the Round 1 column to the maximum column ncol.i is retrieved as a numeric valuei is used to index the row of opponents data, with 4 indexing Rtg Pre value as a numericsum initialized in the first for loop as 0, if it passes the conditional if statement of not NA using the function \(\\!is.na\).else statement, tallies number of NA values.Average indexing the same row that is calculated.for(row in 1:nrow(my_data)){
sum = 0
columns.na=0
for(col in 8:ncol(my_data)){
i<-as.numeric(str_extract(my_data[row,col],"[1-9]."))
if(!is.na(i)){
print(i)
sum = sum + as.numeric(my_data[i,5])
}
else columns.na=columns.na+1
}
#print(columns.na)
#print(sum)
#print(round((sum/((ncol(my_data)-7)-columns.na))))
Average[row]<-round((sum/((ncol(my_data)-7)-columns.na)))
}Once the loop ends and the Average vector is filled, Average is combined with my_data using cbind() function.
## Pair Num State Player Name USCF ID Rtg Pre
## [1,] "1 " " ON " " GARY HUA" "5445895 " " 1794 "
## [2,] " 2 " " MI " " DAKSHESH DARURI" "4598900 " " 1553 "
## [3,] " 3 " " MI " " ADITYA BAJAJ" "4959604 " " 1384 "
## Rtg Post Total Pts Round 1 Round 2 Round 3 Round 4 Round 5
## [1,] "1817 " "6.0 " "W 39 " "W 21 " "W 18 " "W 14 " "W 7 "
## [2,] "1663 " "6.0 " "W 63 " "W 58 " "L 4 " "W 17 " "W 16 "
## [3,] "1640 " "6.0 " "L 8 " "W 61 " "W 25 " "W 21 " "W 11 "
## Round 6 Round 7 Average
## [1,] "D 12 " "D 4 " "1605"
## [2,] "W 20 " "W 7 " "1469"
## [3,] "W 13 " "W 12 " "1564"
The requirements for Project 1’s submission is the data:
Player’s Name, Player’s State, Total Number of Points, Player’s
Pre-Rating, and Average Pre Chess Rating of Opponents
\(\therefore\) my_data subset Project_1_csv with the needed columns is create, before converting to a .csv
## [1] "Pair Num " "State" " Player Name"
## [4] " USCF ID " " Rtg Pre " " Rtg Post "
## [7] "Total Pts " "Round 1 " "Round 2 "
## [10] "Round 3 " "Round 4 " "Round 5 "
## [13] "Round 6 " "Round 7" "Average"
#my_data<-as.data.frame(my_data)
Project_1_csv<-as.data.frame(my_data) %>%
select(" Player Name",State,"Total Pts "," Rtg Pre ","Average")
head(Project_1_csv)## Player Name State Total Pts Rtg Pre Average
## 1 GARY HUA ON 6.0 1794 1605
## 2 DAKSHESH DARURI MI 6.0 1553 1469
## 3 ADITYA BAJAJ MI 6.0 1384 1564
## 4 PATRICK H SCHILLING MI 5.5 1716 1574
## 5 HANSHI ZUO MI 5.5 1655 1501
## 6 HANSEN SONG OH 5.0 1686 1519
The .csv file is created and stored using a local github path
and can be found on github.com/gcampos100/ in the DATA607 repository
Column 10 was ommitted from head() for formatting:↩︎
Column 9 & 10 was ommitted from head() for formatting:↩︎
Columns 1 & 10 was omitted from head() for formatting:↩︎
Regex was used to add | character. Although not very intuitive, the pasting of the subsets created issues separating column 1 Pair Num into Pair Num and State, hence | was added to use as a delimiter later on:↩︎
Column 10-12 was ommitted from head() for formatting:↩︎
This is where the added | is used as a delimeter:↩︎