library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(readr)
In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:
Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents.
#Load dataset
Here we get an idea of the number of observations in the dataset and what to expect with the setup, such as the column names.
rawlink <- 'https://raw.githubusercontent.com/rkasa01/tournament_info/main/tournamentinfo.txt'
dataset <- read.csv(rawlink)
head(dataset)
## X.........................................................................................
## 1 Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round|
## 2 Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
## 3 -----------------------------------------------------------------------------------------
## 4 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|
## 5 ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |
## 6 -----------------------------------------------------------------------------------------
summary(dataset)
## X.........................................................................................
## Length:195
## Class :character
## Mode :character
#Reformatting the dataset to get what we need
We are only looking for the Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents, meaning that we don’t need a lot of this data. To do this, we need to remove what we don’t need from the data, and then rename the new data accordingly.
It looks like each player’s data is formatted into 2 rows so this must first be separated. We can extract data by starting with the first player at the 4th row, and taking every 3rd row thereafter to get the next player, and so on.
scores <- matrix(unlist(dataset), byrow=TRUE)
score_rows<- scores[seq(4,length(scores),3)]
head(score_rows)
## [1] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [2] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [3] " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|"
## [4] " 4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1|"
## [5] " 5 | HANSHI ZUO |5.5 |W 45|W 37|D 12|D 13|D 4|W 14|W 17|"
## [6] " 6 | HANSEN SONG |5.0 |W 34|D 29|L 11|W 35|D 10|W 27|W 21|"
Since each player’s scores are tracked between 2 rows, we can also extract data by starting with the first player at the 5th row, and then taking every 3rd row thereafter to get the next player, and so on.
new_score_rows <- scores[seq(5,length(scores),3)]
head(new_score_rows)
## [1] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [2] " MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [3] " MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |"
## [4] " MI | 12616049 / R: 1716 ->1744 |N:2 |W |B |W |B |W |B |B |"
## [5] " MI | 14601533 / R: 1655 ->1690 |N:2 |B |W |B |W |B |W |B |"
## [6] " OH | 15055204 / R: 1686 ->1687 |N:3 |W |B |W |B |B |W |B |"
Using regexes, we can extract only the data which we are looking for in the strings.
Player_Number <- as.numeric(str_extract(score_rows, '\\d+'))
Player_Name <- str_extract(score_rows, '[A-z].{1,32}')
Player_Name <- str_trim(str_extract(Player_Name, '.+\\s{2,}'))
State <- str_extract(new_score_rows, '[A-Z]{2}')
Total_Points <- as.numeric(str_extract(score_rows, '\\d+\\.\\d'))
Pre_Rating <- str_extract(new_score_rows, 'R:.{8,}-')
Pre_Rating <- as.numeric(str_extract(Pre_Rating, '\\d{1,4}'))
Round <- str_extract_all(score_rows, '[A-Z]\\s{2,}\\d+')
Round <- str_extract_all(Round, '\\d+')
## Warning in stri_extract_all_regex(string, pattern, simplify = simplify, :
## argument is not an atomic vector; coercing
The average pre chess ratings of opponents are not given in the data set, therefore these values must be calculated and placed into a new column.
Opp_Pre_Rating_Avg <- c()
for(i in c(1:length(Round))){
Opp_Pre_Rating_Avg[i] <- round(mean(Pre_Rating[as.numeric(Round[[i]])]),0)
}
Opp_Pre_Rating_Avg
## [1] 1605 1469 1564 1574 1501 1519 1372 1468 1523 1554 1468 1506 1498 1515 1484
## [16] 1386 1499 1480 1426 1411 1470 1300 1214 1357 1363 1507 1222 1522 1314 1144
## [31] 1260 1379 1277 1375 1150 1388 1385 1539 1430 1391 1248 1150 1107 1327 1152
## [46] 1358 1392 1356 1286 1296 1356 1495 1345 1206 1406 1414 1363 1391 1319 1330
## [61] 1327 1186 1350 1263
Project1 <- data.frame(Player_Number,Player_Name,State,Total_Points,Pre_Rating,Opp_Pre_Rating_Avg)
head(Project1)
## Player_Number Player_Name State Total_Points Pre_Rating
## 1 1 GARY HUA ON 6.0 1794
## 2 2 DAKSHESH DARURI MI 6.0 1553
## 3 3 ADITYA BAJAJ MI 6.0 1384
## 4 4 PATRICK H SCHILLING MI 5.5 1716
## 5 5 HANSHI ZUO MI 5.5 1655
## 6 6 HANSEN SONG OH 5.0 1686
## Opp_Pre_Rating_Avg
## 1 1605
## 2 1469
## 3 1564
## 4 1574
## 5 1501
## 6 1519
write.csv(Project1, "Project1.csv", row.names = FALSE)