DATA 607: Project 1

Overview:

In this project, we are given a text file with chess tournament results. The goal is to create an R Markdown file that will later generate a .CSV file that can be used in a database management system with the following information for all of the players:

Player’s Name
Player’s State
Total Number of Points
Player’s Pre-Rating
Average Pre-Chess Rating of Opponents

For example: For the 1st player, the information would be: Gary Hua, ON, 6.0, 1794, 1605

Load required libaries and data

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.3.0      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(stringr)
library(ggplot2)

Save the given text file to Github and load the data in to R:

chess_data <- read.delim("https://raw.githubusercontent.com/FarhanaAkther23/DATA607/main/Project%201/Project%201.txt", header = FALSE, stringsAsFactors = FALSE)

Let’s take a look at our input data:

head(chess_data, 12)

##                                                                                            V1
## 1   -----------------------------------------------------------------------------------------
## 2   Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| 
## 3   Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | 
## 4   -----------------------------------------------------------------------------------------
## 5       1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|
## 6      ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |
## 7   -----------------------------------------------------------------------------------------
## 8       2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|
## 9      MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |
## 10  -----------------------------------------------------------------------------------------
## 11      3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|
## 12     MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |

From the first few lines, we can see that of our data that seems to have lines separating each player.

We can use regular expressions to extract the information that we need.

Extract information we need from the “player” row.

player = chess_data[seq(5, nrow(chess_data), 3), ] #The player raw starts at line 5

We will extract each variable separately to form vectors that we will put together in a data frame.

We will also extract player ID.This will help us to obtain the average opponent’s rating.

library(stringr)
playerId <- as.integer(str_extract(player, "\\d+"))
playerName <- str_extract(player, "(\\w+\\s){2,4}(\\w+-\\w+)?") #There are players with 4 names and hyphenated names.
playerName <- str_trim(playerName) # gets rid of the extra blank spaces.
playerPoints <- as.numeric(str_extract(player, "\\d+\\.\\d+"))
playerOpponent <- str_extract_all(player, "\\d+\\|")
playerOpponent <- str_extract_all(playerOpponent, "\\d+") # use a second step to get the opponent ID by itself.

## Warning in stri_extract_all_regex(string, pattern, simplify = simplify, :
## argument is not an atomic vector; coercing

Extract information needed from the “ratings” row.

ratings = chess_data[seq(6, nrow(chess_data), 3), ] #The ratings data starts at line 6

playerState <- str_extract(ratings, "\\w+")
playerRating <- (str_extract(ratings, "(\\:\\s\\s?\\d+)([[:alpha:]]\\d+)?"))
playerRating <- as.numeric(str_extract(playerRating, "\\d+")) # used a second step to get the player's rating by itself.

Put together the data frame:

chess_data_trans <- data.frame(playerId, playerName, playerState, playerPoints, playerRating)
head(chess_data_trans)

##   playerId          playerName playerState playerPoints playerRating
## 1        1            GARY HUA          ON          6.0         1794
## 2        2     DAKSHESH DARURI          MI          6.0         1553
## 3        3        ADITYA BAJAJ          MI          6.0         1384
## 4        4 PATRICK H SCHILLING          MI          5.5         1716
## 5        5          HANSHI ZUO          MI          5.5         1655
## 6        6         HANSEN SONG          OH          5.0         1686

Add a column that shows the average of the opponent’s pre chess ratings. This require us to use a for loop in order to add up each of the opponent’s pre-ratings and use the mean function.

List of the opponents by player ID:

unlist(playerOpponent[playerId[1]])

## [1] "39" "21" "18" "14" "7"  "12" "4"

We can see above that a list of the opponent’s IDs for player Gary Hua with player ID 1.

Actual pre-rating for each of these opponents:

playerRating[as.numeric(unlist(playerOpponent[playerId[1]]))]

## [1] 1436 1563 1600 1610 1649 1663 1716

The average:

round(mean(playerRating[as.numeric(unlist(playerOpponent[playerId[1]]))]),)

## [1] 1605

For loop for all 64 players:

avgRating = 0
  for (i in 1:64) { 
  avgRating[i] <- round(mean(playerRating[as.numeric(unlist(playerOpponent[playerId[i]]))]),) 
  }

Average Pre-Chess Ratings of Opponents

putting together in a data frame:

chess_data_trans <- data.frame(playerId, playerName, playerState, playerPoints, playerRating, avgRating)
colnames(chess_data_trans) <- c("Player ID", "Player Name", "State", "Total Points", "Pre-Rating", "Average Opponents Pre-Rating")
head(chess_data_trans, 12)

##    Player ID              Player Name State Total Points Pre-Rating
## 1          1                 GARY HUA    ON          6.0       1794
## 2          2          DAKSHESH DARURI    MI          6.0       1553
## 3          3             ADITYA BAJAJ    MI          6.0       1384
## 4          4      PATRICK H SCHILLING    MI          5.5       1716
## 5          5               HANSHI ZUO    MI          5.5       1655
## 6          6              HANSEN SONG    OH          5.0       1686
## 7          7        GARY DEE SWATHELL    MI          5.0       1649
## 8          8         EZEKIEL HOUGHTON    MI          5.0       1641
## 9          9              STEFANO LEE    ON          5.0       1411
## 10        10                ANVIT RAO    MI          5.0       1365
## 11        11 CAMERON WILLIAM MC LEMAN    MI          4.5       1712
## 12        12           KENNETH J TACK    MI          4.5       1663
##    Average Opponents Pre-Rating
## 1                          1605
## 2                          1469
## 3                          1564
## 4                          1574
## 5                          1501
## 6                          1519
## 7                          1372
## 8                          1468
## 9                          1523
## 10                         1554
## 11                         1468
## 12                         1506

Average Pre-Rating by state:

avg_state_rating <- aggregate(x=chess_data_trans["Pre-Rating"], by = list(State=chess_data_trans$State), FUN = mean, na.rm=TRUE)
avg_state_rating

##   State Pre-Rating
## 1    MI     1362.0
## 2    OH     1686.0
## 3    ON     1453.5

Visualization:

ggplot(aes(x = reorder (State, -`Pre-Rating`), y = `Pre-Rating`), data = avg_state_rating,) + geom_bar(stat = "identity", color = "black", fill = "lightblue") + ggtitle("Average Pre-Rating by State")

Generate a .CSV file:

write.table(chess_data_trans, file = "chessExtraction.csv", row.names = FALSE, na = "", col.names = TRUE, sep = ",")

Conclusion

In this project we are able to code to input text file data in R and output the file with Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre-Chess Rating of Opponents. We are also able to see the average pre-ratings of the players by each state. From the graph above we can see that state OH has the highest average pre-ratings followed by ON and MI.