1 Project 1

In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:

Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents For the first player, the information would be: Gary Hua, ON, 6.0, 1794, 1605 1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played.

If you have questions about the meaning of the data or the results, please post them on the discussion forum. Data science, like chess, is a game of back and forth…

The chess rating system (invented by a Minnesota statistician named Arpad Elo) has been used in many other contexts, including assessing relative strength of employment candidates by human resource departments.

You may substitute another text file (or set of text files, or data scraped from web pages) of similar or greater complexity, and create your own assignment and solution. You may work in a small team. All of your code should be in an R markdown file (and published to rpubs.com); with your data accessible for the person running the script.

2 Data Acquisition

library(stringr)
library(tidyverse)
library(ggplot2)

Importing tounrament project data

#ingesting data from github repo
tournament.data <- readLines('https://raw.githubusercontent.com/keshaws/CUNY_MSDS_2020/master/DATA607/tournamentinfo.txt')
head(tournament.data,20)
##  [1] "-----------------------------------------------------------------------------------------" 
##  [2] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
##  [3] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
##  [4] "-----------------------------------------------------------------------------------------" 
##  [5] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
##  [6] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
##  [7] "-----------------------------------------------------------------------------------------" 
##  [8] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|" 
##  [9] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |" 
## [10] "-----------------------------------------------------------------------------------------" 
## [11] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|" 
## [12] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
## [13] "-----------------------------------------------------------------------------------------" 
## [14] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|" 
## [15] "   MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |" 
## [16] "-----------------------------------------------------------------------------------------" 
## [17] "    5 | HANSHI ZUO                      |5.5  |W  45|W  37|D  12|D  13|D   4|W  14|W  17|" 
## [18] "   MI | 14601533 / R: 1655   ->1690     |N:2  |B    |W    |B    |W    |B    |W    |B    |" 
## [19] "-----------------------------------------------------------------------------------------" 
## [20] "    6 | HANSEN SONG                     |5.0  |W  34|D  29|L  11|W  35|D  10|W  27|W  21|"
#view data excluding first 4 lines
tournament.data <- tournament.data[-c(0:4)]
head(tournament.data,20)
##  [1] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|"
##  [2] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
##  [3] "-----------------------------------------------------------------------------------------"
##  [4] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|"
##  [5] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
##  [6] "-----------------------------------------------------------------------------------------"
##  [7] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|"
##  [8] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
##  [9] "-----------------------------------------------------------------------------------------"
## [10] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|"
## [11] "   MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |"
## [12] "-----------------------------------------------------------------------------------------"
## [13] "    5 | HANSHI ZUO                      |5.5  |W  45|W  37|D  12|D  13|D   4|W  14|W  17|"
## [14] "   MI | 14601533 / R: 1655   ->1690     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [15] "-----------------------------------------------------------------------------------------"
## [16] "    6 | HANSEN SONG                     |5.0  |W  34|D  29|L  11|W  35|D  10|W  27|W  21|"
## [17] "   OH | 15055204 / R: 1686   ->1687     |N:3  |W    |B    |W    |B    |B    |W    |B    |"
## [18] "-----------------------------------------------------------------------------------------"
## [19] "    7 | GARY DEE SWATHELL               |5.0  |W  57|W  46|W  13|W  11|L   1|W   9|L   2|"
## [20] "   MI | 11146376 / R: 1649   ->1673     |N:3  |W    |B    |W    |B    |B    |W    |W    |"

3 Data Wrangling

3.1 Torunament Data Cleaning

tournament.data <- tournament.data[sapply(tournament.data, nchar) > 0]
head(tournament.data,10)
##  [1] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|"
##  [2] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
##  [3] "-----------------------------------------------------------------------------------------"
##  [4] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|"
##  [5] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
##  [6] "-----------------------------------------------------------------------------------------"
##  [7] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|"
##  [8] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
##  [9] "-----------------------------------------------------------------------------------------"
## [10] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|"

From dataset, it appears that each player data present in two consecutive rows. First row has player’s information and match result. The second row has players state, USCF information.

# extracting data of players match - Odd posistion starting from 1
player.data <- c(seq(1, length(tournament.data), 3))
player.info <- tournament.data[player.data]
head(player.info)
## [1] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|"
## [2] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|"
## [3] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|"
## [4] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|"
## [5] "    5 | HANSHI ZUO                      |5.5  |W  45|W  37|D  12|D  13|D   4|W  14|W  17|"
## [6] "    6 | HANSEN SONG                     |5.0  |W  34|D  29|L  11|W  35|D  10|W  27|W  21|"
# Extracting rating data - even position starting from 2
player.rating.data <- c(seq(2, length(tournament.data), 3))
player.rating.info <- tournament.data[player.rating.data]
head(player.rating.info)
## [1] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [2] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [3] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [4] "   MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |"
## [5] "   MI | 14601533 / R: 1655   ->1690     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [6] "   OH | 15055204 / R: 1686   ->1687     |N:3  |W    |B    |W    |B    |B    |W    |B    |"

3.2 Extract information of players using RegEx Tehniques

Information on player name

player.name <- str_extract(player.info, "\\s+([[:alpha:]- ]+)\\b\\s*\\|")
player.name <- gsub(player.name, pattern = "|", replacement = "", fixed = T)
player.name <- trimws(player.name)
head(player.name)
## [1] "GARY HUA"            "DAKSHESH DARURI"     "ADITYA BAJAJ"       
## [4] "PATRICK H SCHILLING" "HANSHI ZUO"          "HANSEN SONG"

Infomration on player state

player.state <- str_extract(player.rating.info, "[[:alpha:]]{2}")
head(player.state)
## [1] "ON" "MI" "MI" "MI" "MI" "OH"

Infomration on Player Pre rating score value

player.prerating.score <- str_extract(player.rating.info, ".\\: \\s?[[:digit:]]{3,4}")
player.prerating.score <- gsub(player.prerating.score, pattern = "R: ", replacement = "", fixed = T)
player.prerating.score <- as.numeric(as.character(player.prerating.score))
head(player.prerating.score)
## [1] 1794 1553 1384 1716 1655 1686

Extract Players total points

player.total.points <- str_extract(player.info, "[[:digit:]]+\\.[[:digit:]]")
player.total.points <- as.numeric(as.character(player.total.points))
head(player.total.points)
## [1] 6.0 6.0 6.0 5.5 5.5 5.0

Infomration on players opponent info

player.opponent.info <- str_extract_all(player.info, "[[:digit:]]{1,2}\\|")
player.opponent.info <- str_extract_all(player.opponent.info, "[[:digit:]]{1,2}")
player.opponent.info <- lapply(player.opponent.info, as.numeric)
head(player.opponent.info)
## [[1]]
## [1] 39 21 18 14  7 12  4
## 
## [[2]]
## [1] 63 58  4 17 16 20  7
## 
## [[3]]
## [1]  8 61 25 21 11 13 12
## 
## [[4]]
## [1] 23 28  2 26  5 19  1
## 
## [[5]]
## [1] 45 37 12 13  4 14 17
## 
## [[6]]
## [1] 34 29 11 35 10 27 21

Now calulating Player’s opponent avg. rating

opponent.avg.rating <- list()
for (i in 1:length(player.opponent.info)){
  opponent.avg.rating[i] <- round(mean(player.prerating.score[unlist(player.opponent.info[i])]),2)
}
opponent.avg.rating <- lapply(opponent.avg.rating, as.numeric)
opponent.avg.rating <- data.frame(unlist(opponent.avg.rating))
head(opponent.avg.rating)
##   unlist.opponent.avg.rating.
## 1                     1605.29
## 2                     1469.29
## 3                     1563.57
## 4                     1573.57
## 5                     1500.86
## 6                     1518.71

3.3 Data PreProcessing

player.df <- cbind.data.frame(player.name, player.state, player.total.points, 
                              player.prerating.score,round(opponent.avg.rating,0))
head(player.df)
##           player.name player.state player.total.points
## 1            GARY HUA           ON                 6.0
## 2     DAKSHESH DARURI           MI                 6.0
## 3        ADITYA BAJAJ           MI                 6.0
## 4 PATRICK H SCHILLING           MI                 5.5
## 5          HANSHI ZUO           MI                 5.5
## 6         HANSEN SONG           OH                 5.0
##   player.prerating.score unlist.opponent.avg.rating.
## 1                   1794                        1605
## 2                   1553                        1469
## 3                   1384                        1564
## 4                   1716                        1574
## 5                   1655                        1501
## 6                   1686                        1519
colnames(player.df)
## [1] "player.name"                 "player.state"               
## [3] "player.total.points"         "player.prerating.score"     
## [5] "unlist.opponent.avg.rating."
play_final_df <- rename(player.df, opp.avg.rating='unlist.opponent.avg.rating.')
colnames(play_final_df)
## [1] "player.name"            "player.state"          
## [3] "player.total.points"    "player.prerating.score"
## [5] "opp.avg.rating"

4 Data Visualization

ggplot(play_final_df, aes(player.prerating.score, opp.avg.rating, color = player.state)) + 
           geom_point(aes(size = player.total.points, shape = player.state))+
           ggtitle('Pre-rating Vs. Opponent Avg. Pre-rating')+
           xlab('Pre-rating')+
           ylab('Opponent Avg. Pre-rating')

5 Export Data

write.csv(player.df,'player_chess_data.csv')