In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:
Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents.
The text file will be downloaded from the Github repository to csv file using read_csv function. read_csv is faster for large .csv files comparing to read.csv function as it imports data into R as a tibble. So it may be better be familiar to try to use it in this project.
data <-read_csv(file = "https://raw.githubusercontent.com/ex-pr/DATA607/Project-1/7645617.txt",col_names = FALSE, show_col_types = FALSE)
By checking the data downloaded, we have 1 column with 196 rows. The final file should contain information about 64 players.
summary(data)
## X1
## Length:196
## Class :character
## Mode :character
head(data)
## # A tibble: 6 x 1
## X1
## <chr>
## 1 -----------------------------------------------------------------------------~
## 2 Pair | Player Name |Total|Round|Round|Round|Round|Round|R~
## 3 Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | ~
## 4 -----------------------------------------------------------------------------~
## 5 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 1~
## 6 ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B ~
To make the work easy, I will remove column names as well as the last dashed line.
data <- data %>%
filter(!row_number() %in% c(1, 2, 3, 4, 196))
data
## # A tibble: 191 x 1
## X1
## <chr>
## 1 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D ~
## 2 ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B ~
## 3 ----------------------------------------------------------------------------~
## 4 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W ~
## 5 MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W ~
## 6 ----------------------------------------------------------------------------~
## 7 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W ~
## 8 MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B ~
## 9 ----------------------------------------------------------------------------~
## 10 4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W ~
## # ... with 181 more rows
As we see in our file each player takes 2 lines, so we need to take the information we need for each player from these two lines and make it as one line for one player. To make it easy to work with, we will deframe the tibble and receive 196 strings of data.
chess <- deframe(data)
head(chess)
## [1] "1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [2] "ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [3] "-----------------------------------------------------------------------------------------"
## [4] "2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [5] "MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [6] "-----------------------------------------------------------------------------------------"
Using code and libraries from the previous HW, we will treat our data as string. At this step, the main goal is to remove all | between future columns and dashes between future rows.
As the first step, we will detect and delete dashes with a help from regular expressions.
remove_ <- str_detect(chess, '^[-]{2,}$')
chess <- chess[!remove_ == "TRUE"]
head(chess)
## [1] "1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [2] "ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [3] "2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [4] "MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [5] "3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|"
## [6] "MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |"
Next, we will substitute | and / with comma as it will help us to convert data into data frame.
chess <- str_replace_all(chess, "[|/]",",")
head(chess)
## [1] "1 , GARY HUA ,6.0 ,W 39,W 21,W 18,W 14,W 7,D 12,D 4,"
## [2] "ON , 15445895 , R: 1794 ->1817 ,N:2 ,W ,B ,W ,B ,W ,B ,W ,"
## [3] "2 , DAKSHESH DARURI ,6.0 ,W 63,W 58,L 4,W 17,W 16,W 20,W 7,"
## [4] "MI , 14598900 , R: 1553 ->1663 ,N:2 ,B ,W ,B ,W ,B ,W ,B ,"
## [5] "3 , ADITYA BAJAJ ,6.0 ,L 8,W 61,W 25,W 21,W 11,W 13,W 12,"
## [6] "MI , 14959604 , R: 1384 ->1640 ,N:2 ,W ,B ,W ,B ,W ,B ,W ,"
Combine two rows
a <- c("")
for (i in seq(1, length(chess)-1, by = 2)){
a <- c(a, paste(chess[i], chess[i+1], sep = "", collapse = NULL))
}
Create data frame from the strings of data.
chess_final <- as.data.frame(do.call(rbind, strsplit(a, ",")), stringsAsFactors = FALSE)
head(chess_final)
## V1 V2 V3 V4 V5 V6 V7 V8
## 1 1 GARY HUA 6.0 W 39 W 21 W 18 W 14 W 7
## 2 2 DAKSHESH DARURI 6.0 W 63 W 58 L 4 W 17 W 16
## 3 3 ADITYA BAJAJ 6.0 L 8 W 61 W 25 W 21 W 11
## 4 4 PATRICK H SCHILLING 5.5 W 23 D 28 W 2 W 26 D 5
## 5 5 HANSHI ZUO 5.5 W 45 W 37 D 12 D 13 D 4
## 6 6 HANSEN SONG 5.0 W 34 D 29 L 11 W 35 D 10
## V9 V10 V11 V12 V13 V14 V15 V16 V17
## 1 D 12 D 4 ON 15445895 R: 1794 ->1817 N:2 W B W
## 2 W 20 W 7 MI 14598900 R: 1553 ->1663 N:2 B W B
## 3 W 13 W 12 MI 14959604 R: 1384 ->1640 N:2 W B W
## 4 W 19 D 1 MI 12616049 R: 1716 ->1744 N:2 W B W
## 5 W 14 W 17 MI 14601533 R: 1655 ->1690 N:2 B W B
## 6 W 27 W 21 OH 15055204 R: 1686 ->1687 N:3 W B W
## V18 V19 V20 V21
## 1 B W B W
## 2 W B W B
## 3 B W B W
## 4 B W B B
## 5 W B W B
## 6 B B W B
We will remove columns that we don’t need for the assignment
chess_final[12] <- list(NULL)
chess_final[c(1,13:ncol(chess_final))] <- list(NULL)
Rename columns.
colnames(chess_final)[1] <- c("Name")
colnames(chess_final)[2] <- c("Total Number of Points")
colnames(chess_final)[3:9] <- c("R1","R2","R3","R4","R5","R6","R7")
colnames(chess_final)[10] <- c("State")
colnames(chess_final)[11] <- c("Pre_Rating")
We need to extract pre-rating for each player.
chess_final$Pre_Rating <- str_sub(chess_final$Pre_Rating, 5, 8)
Remove letters from Round results.
nm1 <- c("R1", "R2", "R3", "R4", "R5", "R6", "R7")
chess_final[nm1]<-lapply(chess_final[nm1], gsub, pattern = "W ", replacement = "")
chess_final[nm1]<-lapply(chess_final[nm1], gsub, pattern = "L ", replacement = "")
chess_final[nm1]<-lapply(chess_final[nm1], gsub, pattern = "D ", replacement = "")
chess_final[nm1]<-lapply(chess_final[nm1], gsub, pattern = "U ", replacement = "")
chess_final[nm1]<-lapply(chess_final[nm1], gsub, pattern = "H ", replacement = "")
chess_final[nm1]<-lapply(chess_final[nm1], gsub, pattern = "X ", replacement = "")
If there are any empty (NA) spaces, we will substitute it with 0
chess_final[is.na(chess_final)] <- 0
We won’t be able to calculate average if we don’t transform data in columns 2:9 and 11 in to numbers.
chess_final[c(2:9,11)] <- sapply((chess_final)[c(2:9,11)], as.numeric)
chess_final[is.na(chess_final)] <- 0
head(chess_final)
## Name Total Number of Points R1 R2 R3 R4 R5 R6 R7
## 1 GARY HUA 6.0 39 21 18 14 7 12 4
## 2 DAKSHESH DARURI 6.0 63 58 4 17 16 20 7
## 3 ADITYA BAJAJ 6.0 8 61 25 21 11 13 12
## 4 PATRICK H SCHILLING 5.5 23 28 2 26 5 19 1
## 5 HANSHI ZUO 5.5 45 37 12 13 4 14 17
## 6 HANSEN SONG 5.0 34 29 11 35 10 27 21
## State Pre_Rating
## 1 ON 1794
## 2 MI 1553
## 3 MI 1384
## 4 MI 1716
## 5 MI 1655
## 6 OH 1686
The average pre-rating of opponents is calculated by using the pre-tournament opponents’ ratings and dividing by the total number of games played.
total <- numeric(0)
opponents <- numeric(0)
avg_rate <- vector()
for (i in 1:length(chess_final$Pre_Rating)){
player <- as.numeric(as.vector(chess_final[i,3:9]))
total <- sum(chess_final[player, "Pre_Rating"])
opponents <- sum(chess_final[i,c(3:9)]!=0) # number of opponents for each player, if no opponents than 0
avg_rate[i] <- round(total / opponents, digits = 0)
}
chess_final$Avg_Pre_Rate <- avg_rate
head(chess_final)
## Name Total Number of Points R1 R2 R3 R4 R5 R6 R7
## 1 GARY HUA 6.0 39 21 18 14 7 12 4
## 2 DAKSHESH DARURI 6.0 63 58 4 17 16 20 7
## 3 ADITYA BAJAJ 6.0 8 61 25 21 11 13 12
## 4 PATRICK H SCHILLING 5.5 23 28 2 26 5 19 1
## 5 HANSHI ZUO 5.5 45 37 12 13 4 14 17
## 6 HANSEN SONG 5.0 34 29 11 35 10 27 21
## State Pre_Rating Avg_Pre_Rate
## 1 ON 1794 1605
## 2 MI 1553 1469
## 3 MI 1384 1564
## 4 MI 1716 1574
## 5 MI 1655 1501
## 6 OH 1686 1519
For the final csv file, we will use columns Name, State, Total Number of Points,Pre-Rating, and Average Pre Chess Rating of Opponents as it is asked by the assignment.
chess_final <- chess_final[,c(1,10,2,11,12)]
head(chess_final)
## Name State Total Number of Points Pre_Rating
## 1 GARY HUA ON 6.0 1794
## 2 DAKSHESH DARURI MI 6.0 1553
## 3 ADITYA BAJAJ MI 6.0 1384
## 4 PATRICK H SCHILLING MI 5.5 1716
## 5 HANSHI ZUO MI 5.5 1655
## 6 HANSEN SONG OH 5.0 1686
## Avg_Pre_Rate
## 1 1605
## 2 1469
## 3 1564
## 4 1574
## 5 1501
## 6 1519
write.csv(chess_final, file = "Chess.csv")