In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:
Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents For the first player, the information would be: Gary Hua, ON, 6.0, 1794, 1605 1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played.
If you have questions about the meaning of the data or the results, please post them on the discussion forum. Data science, like chess, is a game of back and forth…
The chess rating system (invented by a Minnesota statistician named Arpad Elo) has been used in many other contexts, including assessing relative strength of employment candidates by human resource departments.
You may substitute another text file (or set of text files, or data scraped from web pages) of similar or greater complexity, and create your own assignment and solution. You may work in a small team. All of your code should be in an R markdown file (and published to rpubs.com); with your data accessible for the person running the script.
library(stringr)
library(tidyverse)
library(ggplot2)Importing tounrament project data
#ingesting data from github repo
tournament.data <- readLines('https://raw.githubusercontent.com/keshaws/CUNY_MSDS_2020/master/DATA607/tournamentinfo.txt')
head(tournament.data,20)## [1] "-----------------------------------------------------------------------------------------"
## [2] " Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 | "
## [4] "-----------------------------------------------------------------------------------------"
## [5] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [6] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [7] "-----------------------------------------------------------------------------------------"
## [8] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [9] " MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [10] "-----------------------------------------------------------------------------------------"
## [11] " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|"
## [12] " MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |"
## [13] "-----------------------------------------------------------------------------------------"
## [14] " 4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1|"
## [15] " MI | 12616049 / R: 1716 ->1744 |N:2 |W |B |W |B |W |B |B |"
## [16] "-----------------------------------------------------------------------------------------"
## [17] " 5 | HANSHI ZUO |5.5 |W 45|W 37|D 12|D 13|D 4|W 14|W 17|"
## [18] " MI | 14601533 / R: 1655 ->1690 |N:2 |B |W |B |W |B |W |B |"
## [19] "-----------------------------------------------------------------------------------------"
## [20] " 6 | HANSEN SONG |5.0 |W 34|D 29|L 11|W 35|D 10|W 27|W 21|"
#view data excluding first 4 lines
tournament.data <- tournament.data[-c(0:4)]
head(tournament.data,20)## [1] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [2] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [3] "-----------------------------------------------------------------------------------------"
## [4] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [5] " MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [6] "-----------------------------------------------------------------------------------------"
## [7] " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|"
## [8] " MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |"
## [9] "-----------------------------------------------------------------------------------------"
## [10] " 4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1|"
## [11] " MI | 12616049 / R: 1716 ->1744 |N:2 |W |B |W |B |W |B |B |"
## [12] "-----------------------------------------------------------------------------------------"
## [13] " 5 | HANSHI ZUO |5.5 |W 45|W 37|D 12|D 13|D 4|W 14|W 17|"
## [14] " MI | 14601533 / R: 1655 ->1690 |N:2 |B |W |B |W |B |W |B |"
## [15] "-----------------------------------------------------------------------------------------"
## [16] " 6 | HANSEN SONG |5.0 |W 34|D 29|L 11|W 35|D 10|W 27|W 21|"
## [17] " OH | 15055204 / R: 1686 ->1687 |N:3 |W |B |W |B |B |W |B |"
## [18] "-----------------------------------------------------------------------------------------"
## [19] " 7 | GARY DEE SWATHELL |5.0 |W 57|W 46|W 13|W 11|L 1|W 9|L 2|"
## [20] " MI | 11146376 / R: 1649 ->1673 |N:3 |W |B |W |B |B |W |W |"
tournament.data <- tournament.data[sapply(tournament.data, nchar) > 0]
head(tournament.data,10)## [1] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [2] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [3] "-----------------------------------------------------------------------------------------"
## [4] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [5] " MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [6] "-----------------------------------------------------------------------------------------"
## [7] " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|"
## [8] " MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |"
## [9] "-----------------------------------------------------------------------------------------"
## [10] " 4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1|"
From dataset, it appears that each player data present in two consecutive rows. First row has player’s information and match result. The second row has players state, USCF information.
# extracting data of players match - Odd posistion starting from 1
player.data <- c(seq(1, length(tournament.data), 3))
player.info <- tournament.data[player.data]
head(player.info)## [1] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [2] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [3] " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|"
## [4] " 4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1|"
## [5] " 5 | HANSHI ZUO |5.5 |W 45|W 37|D 12|D 13|D 4|W 14|W 17|"
## [6] " 6 | HANSEN SONG |5.0 |W 34|D 29|L 11|W 35|D 10|W 27|W 21|"
# Extracting rating data - even position starting from 2
player.rating.data <- c(seq(2, length(tournament.data), 3))
player.rating.info <- tournament.data[player.rating.data]
head(player.rating.info)## [1] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [2] " MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [3] " MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |"
## [4] " MI | 12616049 / R: 1716 ->1744 |N:2 |W |B |W |B |W |B |B |"
## [5] " MI | 14601533 / R: 1655 ->1690 |N:2 |B |W |B |W |B |W |B |"
## [6] " OH | 15055204 / R: 1686 ->1687 |N:3 |W |B |W |B |B |W |B |"
Information on player name
player.name <- str_extract(player.info, "\\s+([[:alpha:]- ]+)\\b\\s*\\|")
player.name <- gsub(player.name, pattern = "|", replacement = "", fixed = T)
player.name <- trimws(player.name)
head(player.name)## [1] "GARY HUA" "DAKSHESH DARURI" "ADITYA BAJAJ"
## [4] "PATRICK H SCHILLING" "HANSHI ZUO" "HANSEN SONG"
Infomration on player state
player.state <- str_extract(player.rating.info, "[[:alpha:]]{2}")
head(player.state)## [1] "ON" "MI" "MI" "MI" "MI" "OH"
Infomration on Player Pre rating score value
player.prerating.score <- str_extract(player.rating.info, ".\\: \\s?[[:digit:]]{3,4}")
player.prerating.score <- gsub(player.prerating.score, pattern = "R: ", replacement = "", fixed = T)
player.prerating.score <- as.numeric(as.character(player.prerating.score))
head(player.prerating.score)## [1] 1794 1553 1384 1716 1655 1686
Extract Players total points
player.total.points <- str_extract(player.info, "[[:digit:]]+\\.[[:digit:]]")
player.total.points <- as.numeric(as.character(player.total.points))
head(player.total.points)## [1] 6.0 6.0 6.0 5.5 5.5 5.0
Infomration on players opponent info
player.opponent.info <- str_extract_all(player.info, "[[:digit:]]{1,2}\\|")
player.opponent.info <- str_extract_all(player.opponent.info, "[[:digit:]]{1,2}")
player.opponent.info <- lapply(player.opponent.info, as.numeric)
head(player.opponent.info)## [[1]]
## [1] 39 21 18 14 7 12 4
##
## [[2]]
## [1] 63 58 4 17 16 20 7
##
## [[3]]
## [1] 8 61 25 21 11 13 12
##
## [[4]]
## [1] 23 28 2 26 5 19 1
##
## [[5]]
## [1] 45 37 12 13 4 14 17
##
## [[6]]
## [1] 34 29 11 35 10 27 21
Now calulating Player’s opponent avg. rating
opponent.avg.rating <- list()
for (i in 1:length(player.opponent.info)){
opponent.avg.rating[i] <- round(mean(player.prerating.score[unlist(player.opponent.info[i])]),2)
}
opponent.avg.rating <- lapply(opponent.avg.rating, as.numeric)
opponent.avg.rating <- data.frame(unlist(opponent.avg.rating))
head(opponent.avg.rating)## unlist.opponent.avg.rating.
## 1 1605.29
## 2 1469.29
## 3 1563.57
## 4 1573.57
## 5 1500.86
## 6 1518.71
player.df <- cbind.data.frame(player.name, player.state, player.total.points,
player.prerating.score,round(opponent.avg.rating,0))
head(player.df)## player.name player.state player.total.points
## 1 GARY HUA ON 6.0
## 2 DAKSHESH DARURI MI 6.0
## 3 ADITYA BAJAJ MI 6.0
## 4 PATRICK H SCHILLING MI 5.5
## 5 HANSHI ZUO MI 5.5
## 6 HANSEN SONG OH 5.0
## player.prerating.score unlist.opponent.avg.rating.
## 1 1794 1605
## 2 1553 1469
## 3 1384 1564
## 4 1716 1574
## 5 1655 1501
## 6 1686 1519
colnames(player.df)## [1] "player.name" "player.state"
## [3] "player.total.points" "player.prerating.score"
## [5] "unlist.opponent.avg.rating."
play_final_df <- rename(player.df, opp.avg.rating='unlist.opponent.avg.rating.')
colnames(play_final_df)## [1] "player.name" "player.state"
## [3] "player.total.points" "player.prerating.score"
## [5] "opp.avg.rating"
ggplot(play_final_df, aes(player.prerating.score, opp.avg.rating, color = player.state)) +
geom_point(aes(size = player.total.points, shape = player.state))+
ggtitle('Pre-rating Vs. Opponent Avg. Pre-rating')+
xlab('Pre-rating')+
ylab('Opponent Avg. Pre-rating')write.csv(player.df,'player_chess_data.csv')