In this project, we are given a text file with chess tournament results. The goal is to create an R Markdown file that will later generate a .CSV file that can be used in a database management system with the following information for all of the players:
Player’s Name
Player’s State
Total Number of Points
Player’s Pre-Rating
Average Pre-Chess Rating of Opponents
For example: For the 1st player, the information would be: Gary Hua, ON, 6.0, 1794, 1605
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(stringr)
library(ggplot2)
Save the given text file to Github and load the data in to R:
chess_data <- read.delim("https://raw.githubusercontent.com/FarhanaAkther23/DATA607/main/Project%201/Project%201.txt", header = FALSE, stringsAsFactors = FALSE)
Let’s take a look at our input data:
head(chess_data, 12)
## V1
## 1 -----------------------------------------------------------------------------------------
## 2 Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round|
## 3 Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
## 4 -----------------------------------------------------------------------------------------
## 5 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|
## 6 ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |
## 7 -----------------------------------------------------------------------------------------
## 8 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|
## 9 MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |
## 10 -----------------------------------------------------------------------------------------
## 11 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|
## 12 MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |
From the first few lines, we can see that of our data that seems to have lines separating each player.
We can use regular expressions to extract the information that we need.
Extract information we need from the “player” row.
player = chess_data[seq(5, nrow(chess_data), 3), ] #The player raw starts at line 5
We will extract each variable separately to form vectors that we will put together in a data frame.
We will also extract player ID.This will help us to obtain the average opponent’s rating.
library(stringr)
playerId <- as.integer(str_extract(player, "\\d+"))
playerName <- str_extract(player, "(\\w+\\s){2,4}(\\w+-\\w+)?") #There are players with 4 names and hyphenated names.
playerName <- str_trim(playerName) # gets rid of the extra blank spaces.
playerPoints <- as.numeric(str_extract(player, "\\d+\\.\\d+"))
playerOpponent <- str_extract_all(player, "\\d+\\|")
playerOpponent <- str_extract_all(playerOpponent, "\\d+") # use a second step to get the opponent ID by itself.
## Warning in stri_extract_all_regex(string, pattern, simplify = simplify, :
## argument is not an atomic vector; coercing
Extract information needed from the “ratings” row.
ratings = chess_data[seq(6, nrow(chess_data), 3), ] #The ratings data starts at line 6
playerState <- str_extract(ratings, "\\w+")
playerRating <- (str_extract(ratings, "(\\:\\s\\s?\\d+)([[:alpha:]]\\d+)?"))
playerRating <- as.numeric(str_extract(playerRating, "\\d+")) # used a second step to get the player's rating by itself.
Put together the data frame:
chess_data_trans <- data.frame(playerId, playerName, playerState, playerPoints, playerRating)
head(chess_data_trans)
## playerId playerName playerState playerPoints playerRating
## 1 1 GARY HUA ON 6.0 1794
## 2 2 DAKSHESH DARURI MI 6.0 1553
## 3 3 ADITYA BAJAJ MI 6.0 1384
## 4 4 PATRICK H SCHILLING MI 5.5 1716
## 5 5 HANSHI ZUO MI 5.5 1655
## 6 6 HANSEN SONG OH 5.0 1686
Add a column that shows the average of the opponent’s pre chess ratings. This require us to use a for loop in order to add up each of the opponent’s pre-ratings and use the mean function.
List of the opponents by player ID:
unlist(playerOpponent[playerId[1]])
## [1] "39" "21" "18" "14" "7" "12" "4"
We can see above that a list of the opponent’s IDs for player Gary Hua with player ID 1.
Actual pre-rating for each of these opponents:
playerRating[as.numeric(unlist(playerOpponent[playerId[1]]))]
## [1] 1436 1563 1600 1610 1649 1663 1716
The average:
round(mean(playerRating[as.numeric(unlist(playerOpponent[playerId[1]]))]),)
## [1] 1605
For loop for all 64 players:
avgRating = 0
for (i in 1:64) {
avgRating[i] <- round(mean(playerRating[as.numeric(unlist(playerOpponent[playerId[i]]))]),)
}
putting together in a data frame:
chess_data_trans <- data.frame(playerId, playerName, playerState, playerPoints, playerRating, avgRating)
colnames(chess_data_trans) <- c("Player ID", "Player Name", "State", "Total Points", "Pre-Rating", "Average Opponents Pre-Rating")
head(chess_data_trans, 12)
## Player ID Player Name State Total Points Pre-Rating
## 1 1 GARY HUA ON 6.0 1794
## 2 2 DAKSHESH DARURI MI 6.0 1553
## 3 3 ADITYA BAJAJ MI 6.0 1384
## 4 4 PATRICK H SCHILLING MI 5.5 1716
## 5 5 HANSHI ZUO MI 5.5 1655
## 6 6 HANSEN SONG OH 5.0 1686
## 7 7 GARY DEE SWATHELL MI 5.0 1649
## 8 8 EZEKIEL HOUGHTON MI 5.0 1641
## 9 9 STEFANO LEE ON 5.0 1411
## 10 10 ANVIT RAO MI 5.0 1365
## 11 11 CAMERON WILLIAM MC LEMAN MI 4.5 1712
## 12 12 KENNETH J TACK MI 4.5 1663
## Average Opponents Pre-Rating
## 1 1605
## 2 1469
## 3 1564
## 4 1574
## 5 1501
## 6 1519
## 7 1372
## 8 1468
## 9 1523
## 10 1554
## 11 1468
## 12 1506
avg_state_rating <- aggregate(x=chess_data_trans["Pre-Rating"], by = list(State=chess_data_trans$State), FUN = mean, na.rm=TRUE)
avg_state_rating
## State Pre-Rating
## 1 MI 1362.0
## 2 OH 1686.0
## 3 ON 1453.5
ggplot(aes(x = reorder (State, -`Pre-Rating`), y = `Pre-Rating`), data = avg_state_rating,) + geom_bar(stat = "identity", color = "black", fill = "lightblue") + ggtitle("Average Pre-Rating by State")
write.table(chess_data_trans, file = "chessExtraction.csv", row.names = FALSE, na = "", col.names = TRUE, sep = ",")
In this project we are able to code to input text file data in R and output the file with Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre-Chess Rating of Opponents. We are also able to see the average pre-ratings of the players by each state. From the graph above we can see that state OH has the highest average pre-ratings followed by ON and MI.