Data 607 - Project 1

Introduction

In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players: Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents For the first player, the information would be: Gary Hua, ON, 6.0, 1794, 1605

Pulling in the data

library(readr, quietly = TRUE)
library(stringr, quietly = TRUE)

TournamentInfo = read_csv(file = 'https://raw.githubusercontent.com/dcorrea614/MSDS/master/tournamentinfo.txt',col_names = FALSE)

## Rows: 196 Columns: 1
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): X1
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Configuring the data

It seems that this data has 2 relevant lines for each palyer. The first contains the player name and their results. The second contains the player id, ranking, and board side. So now we will transform the data into a matrix and separate these lines.

TourneyMatrix = matrix(unlist(TournamentInfo), byrow=TRUE)
line1 = TourneyMatrix[seq(5,length(TourneyMatrix),3)]
line2 = TourneyMatrix[seq(6,length(TourneyMatrix),3)]

Extracting the data

Now we must take the data in these 2 matrices and create lists for each relevant column for our final dataframe. Since there is a large buffer of spaces between the name and any other letters, this is fairly easy to extract.

Name = str_trim(str_extract(line1, '[A-z].{1,25}'),'right') 
head(Name)

## [1] "GARY HUA"            "DAKSHESH DARURI"     "ADITYA BAJAJ"       
## [4] "PATRICK H SCHILLING" "HANSHI ZUO"          "HANSEN SONG"

Next we pull the state, these are just the first 2 letters on the second line of info.

State = str_extract(line2, '[A-z].{2}')
head(State)

## [1] "ON " "MI " "MI " "MI " "MI " "OH "

Then we pull the amount of points the player has achieved in this tournament

Points = as.numeric(str_extract(str_sub(line1,35,44),'\\d+\\.*\\d*'))
head(Points)

## [1] 6.0 6.0 6.0 5.5 5.5 5.0

Now we’ll pull the players pre rating from the tournament from line2.

Rating = as.numeric(str_sub(line2,20,23))
head(Rating)

## [1] 1794 1553 1384 1716 1655 1686

Now we must calculate the Average Pre Chess Rating for each players opponents, the rating can be found by it being the only number with a | character immediately following it on line 1. We then use these numbers to pull each rating and average them for each individual players opponents.

Rounds = str_extract_all(line1, '\\d+\\|')
Rounds = str_extract_all(Rounds,'\\d+')

## Warning in stri_extract_all_regex(string, pattern, simplify = simplify, :
## argument is not an atomic vector; coercing

OppRating = c()
for(i in c(1:64)){
  OppRating[i] = round(mean(Rating[as.numeric(Rounds[[i]])]),0)
}
OppRating

##  [1] 1605 1469 1564 1574 1501 1519 1372 1468 1523 1554 1468 1506 1498 1515 1484
## [16] 1386 1499 1480 1426 1411 1470 1300 1214 1357 1363 1507 1222 1522 1314 1144
## [31] 1260 1379 1277 1375 1150 1388 1385 1539 1430 1391 1248 1150 1107 1327 1152
## [46] 1358 1392 1356 1286 1296 1356 1495 1345 1206 1406 1414 1363 1391 1319 1330
## [61] 1327 1186 1350 1263

Putting it all together

Now we will construct the datafram and write the csv file.

tournamentFinal = data.frame(Name,State,Points,Rating,OppRating)
head(tournamentFinal)

##                  Name State Points Rating OppRating
## 1            GARY HUA   ON     6.0   1794      1605
## 2     DAKSHESH DARURI   MI     6.0   1553      1469
## 3        ADITYA BAJAJ   MI     6.0   1384      1564
## 4 PATRICK H SCHILLING   MI     5.5   1716      1574
## 5          HANSHI ZUO   MI     5.5   1655      1501
## 6         HANSEN SONG   OH     5.0   1686      1519

write_csv(tournamentFinal,'tournamentFinal.csv')