Project 1

In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players: Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents For the first player, the information would be: Gary Hua, ON, 6.0, 1794, 1605 1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played. If you have questions about the meaning of the data or the results, please post them on the discussion forum. Data science, like chess, is a game of back and forth… The chess rating system (invented by a Minnesota statistician named Arpad Elo) has been used in many other contexts, including assessing relative strength of employment candidates by human resource departments. You may substitute another text file (or set of text files, or data scraped from web pages) of similar or greater complexity, and create your own assignment and solution. You may work in a small team. All of your code should be in an R markdown file (and published to rpubs.com); with your data accessible for the person running the script.

Required Data: Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents

Load the libraries

library(stringr)
library(DT)
library(data.table)

Load the txt file

tournament <- readLines("https://raw.githubusercontent.com/Vthomps000/DATA607_VT/master/tournamentinfo.txt")

## Warning in readLines("https://raw.githubusercontent.com/Vthomps000/DATA607_VT/
## master/tournamentinfo.txt"): incomplete final line found on 'https://
## raw.githubusercontent.com/Vthomps000/DATA607_VT/master/tournamentinfo.txt'

head(tournament)

## [1] "-----------------------------------------------------------------------------------------" 
## [2] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
## [4] "-----------------------------------------------------------------------------------------" 
## [5] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
## [6] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"

From observation of the data table, we see that the text file is formatted in a way where the information is segmented into 2 lines. The first line contains player ID, name and points with win/loss information. The second line contains state and rating. I will first format the row data into 2 different sets, 1 set for each row. From there, I will extract the necessary information to a vector. Finally, I will combined these vectors to form a formatted database.

Subsetting the required data

summary(tournament)

##    Length     Class      Mode 
##       196 character character

Extracting relevant data by row:

name <- seq(5, 196, 3) 
state_point <- seq(6, 196, 3)

Extracting player ID and points from name

player <- as.integer(str_extract(tournament[name], "\\d+")) 

name_player <- str_replace_all(str_extract(tournament[name],"([|]).+?\\1"),"[|]","") 

points <- str_extract(tournament[name], "\\d.\\d")

Extracting State and rating

state <- str_extract(tournament[state_point], "[A-Z]{2}" ) 
rating <- as.integer(str_replace_all(str_extract(tournament[state_point], "R: \\s?\\d{3,4}"), "R:\\s", ""))

combining the vectors to form the database

new_tournament <- data.frame(player, name_player, state, points, rating) 
head(new_tournament)

##   player                       name_player state points rating
## 1      1  GARY HUA                            ON    6.0   1794
## 2      2  DAKSHESH DARURI                     MI    6.0   1553
## 3      3  ADITYA BAJAJ                        MI    6.0   1384
## 4      4  PATRICK H SCHILLING                 MI    5.5   1716
## 5      5  HANSHI ZUO                          MI    5.5   1655
## 6      6  HANSEN SONG                         OH    5.0   1686

Calculating average opponent rating:

opponent <- str_extract_all(str_extract_all(tournament[name], "\\d+\\|"), "\\d+")

## Warning in stri_extract_all_regex(string, pattern, simplify = simplify, :
## argument is not an atomic vector; coercing

avg_opp <- length(name)
for (i in 1:length(name)) 
{ 
  avg_opp[i] <- round(mean(rating[as.numeric(unlist(opponent[player[i]]))]), digits = 0)
}
avg_opp

##  [1] 1605 1469 1564 1574 1501 1519 1372 1468 1523 1554 1468 1506 1498 1515 1484
## [16] 1386 1499 1480 1426 1411 1470 1300 1214 1357 1363 1507 1222 1522 1314 1144
## [31] 1260 1379 1277 1375 1150 1388 1385 1539 1430 1391 1248 1150 1107 1327 1152
## [46] 1358 1392 1356 1286 1296 1356 1495 1345 1206 1406 1414 1363 1391 1319 1330
## [61] 1327 1186 1350 1263

Adding the avg variable to the new database. I named this database newtournamentinfo

newtournamentinfo <- data.frame(player, name, state, points, rating, avg_opp)
head(newtournamentinfo)

##   player name state points rating avg_opp
## 1      1    5    ON    6.0   1794    1605
## 2      2    8    MI    6.0   1553    1469
## 3      3   11    MI    6.0   1384    1564
## 4      4   14    MI    5.5   1716    1574
## 5      5   17    MI    5.5   1655    1501
## 6      6   20    OH    5.0   1686    1519

Writing CSV

write.csv(newtournamentinfo, file = "newtournamentinfo.csv")