Project1-ChessDataAnalysis

DATA 607 Project 1

Take chess tournament results text file, and create an R Markdown document which generates a structured CSV file. Output CSV should follow the format: 1. Player’s Name 2. Player’s State 3. Total Number of Points 4. Player’s Pre-Rating 5. Average Pre Chess Rating of Opponents

For example, the first player would be Gary Hua, ON, 6.0, 1794, 1605

Loading Packages

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.4     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Text Import

Because I can’t import the text from Blackboard (due to it being behind a login wall), I uploaded the text file to my own Github, where I’ll read it from.

Reading the file is somewhat difficult because it has an unconventional structure. The simplest solution I could find was using readLines.

chess <- readLines("https://raw.githubusercontent.com/rossboehme/DATA607/main/project1/data607-project1-chess.txt")

Scraping File With Regex

My approach is to first remove unnecessary info and split the df into three groups based on the information a given line provided. This should make my regex easier.

Removing Unnecessary Info; Preparing Data for Scraping

#Remove first four rows
chess <- chess[-c(0:4)]

#Splitting dataframe into three groups based on rows

#Rows 1, 4, 7, etc. contain Pair Number, Player Name, Total Points, Opponent info, 
#Aliasing "PairNamePointsOpp"
PairNamePointsOpp <- chess[seq(1, length(chess),3)]

#Rows 2, 5, 8, etc. contain Player State, Pre Rating. 
#Aliasing "StateRating"
StateRating <- chess[seq(2, length(chess),3)]

#Rows 3, 6, 9, etc. can be removed. Don't need to be saved.

Scraping Pair Number, Player’s Name, Total Points, and Opponent Info

#Note: I put my regexes on lines by themselves so I can visualize them easier

pair_num <- as.integer(unlist(str_extract_all(PairNamePointsOpp,
                                              "(?<=\\s{3,4})\\d{1,2}(?=\\s)"
                                              )))

name <- unlist(trimws(str_extract(PairNamePointsOpp,
                           "([[A-Z]]+\\s){2,3}"
                           )))

points <- as.numeric(unlist(str_extract(PairNamePointsOpp,
                             "\\d+\\.\\d+"
                             )))

opponents <- as.integer(unlist(str_extract_all(str_extract_all(PairNamePointsOpp,
                                                               "[[0-9]]+\\|"),"[[0-9]]+"
                                               )))

## Warning in stri_extract_all_regex(string, pattern, simplify = simplify, :
## argument is not an atomic vector; coercing

Scraping Player’s State, Pre Rating

state <- str_extract(StateRating, 
                     "[[A-Z]]+"
                     )

# I'll pull two versions of the "pre ratings": 
# 1) The full version includes the alphabetical characters e.g. "EZEKIEL HOUGHTON": 1641P17. I'll call this simply "pre-rating" 
# 2) However, since I need to average players opponents' Pre Ratings for my final product, I'll pull another adjusted ("adj") version which is only the numeric characters, and use that for my calculations
pre_rating <- unlist(trimws(str_extract(StateRating, 
                                                 "(?<=>)(\\s)?[0-9A-Z]{3,7}"
                                     )))

pre_rating_adj <- as.integer(unlist(trimws(str_extract(StateRating, 
                                                 "(?<=>)(\\s)?[0-9]{3,4}"
                                     ))))

Putting Scraped Information into Dataframe

chess_cleaned <- data.frame(name,state,points,pre_rating)

Calculating Pre Chess Rating of Opponents; Adding to DF

As a final step, I need to add the Average Pre Chess Rating of Opponents to my chess_cleaned df created above. There are 6.375 opponents for every player (“name”), therefore there are some players who played fewer than 7 matches.

#Average of 6.375 opponents for every player
length(opponents) / length(name)

## [1] 6.375

As a solution, I’ll create a matrix containing the pair numbers each player (“name”) played, this means there will be NA values accounting for missed matches.

col_df <- str_split(PairNamePointsOpp, pattern = "\\|",simplify = TRUE)

opp_matrix <- matrix(as.numeric(str_extract_all(col_df[,4:10], pattern = "..$")), ncol = 7)

I’ll run a for loop over that matrix, averaging for each row (while skipping over NA values) the opponents’ pre ratings.

avg_opp_pre_rating <- c()

for(i in 1:nrow(opp_matrix)){
  avg_opp_pre_rating[i] <- round(mean(pre_rating_adj[opp_matrix[i,]], na.rm = TRUE),0)
}

Finally I’ll add this column to my df.

chess_cleaned$avg_opp_pre_rating = avg_opp_pre_rating

Showing Dataframe; Writing to CSV

chess_cleaned

##                        name state points pre_rating avg_opp_pre_rating
## 1                  GARY HUA    ON    6.0       1817               1611
## 2           DAKSHESH DARURI    MI    6.0       1663               1468
## 3              ADITYA BAJAJ    MI    6.0       1640               1558
## 4       PATRICK H SCHILLING    MI    5.5       1744               1598
## 5                HANSHI ZUO    MI    5.5       1690               1510
## 6               HANSEN SONG    OH    5.0       1687               1520
## 7         GARY DEE SWATHELL    MI    5.0       1673               1508
## 8          EZEKIEL HOUGHTON    MI    5.0    1657P24               1526
## 9               STEFANO LEE    ON    5.0       1564               1517
## 10                ANVIT RAO    MI    5.0       1544               1537
## 11       CAMERON WILLIAM MC    MI    4.5       1696               1506
## 12           KENNETH J TACK    MI    4.5       1670               1544
## 13        TORRANCE HENRY JR    MI    4.5       1662               1538
## 14             BRADLEY SHAW    MI    4.5       1618               1507
## 15   ZACHARY JAMES HOUGHTON    MI    4.5    1416P20               1459
## 16             MIKE NIKITIN    MI    4.0       1613               1481
## 17       RONALD GRZEGORCZYK    MI    4.0       1610               1499
## 18            DAVID SUNDEEN    MI    4.0       1600               1530
## 19             DIPANKAR ROY    MI    4.0       1570               1509
## 20              JASON ZHENG    MI    4.0       1569               1437
## 21            DINH DANG BUI    ON    4.0       1562               1498
## 22         EUGENE L MCCLURE    MI    4.0       1529               1348
## 23                 ALAN BUI    ON    4.0       1371               1323
## 24        MICHAEL R ALDRICH    MI    4.0       1300               1339
## 25         LOREN SCHWIEBERT    MI    3.5       1681               1450
## 26                  MAX ZHU    ON    3.5       1564               1522
## 27           GAURAV GIDWANI    MI    3.5       1539               1370
## 28              SOFIA ADINA    MI    3.5       1513               1534
## 29         CHIEDOZIE OKORIE    MI    3.5    1508P12               1344
## 30       GEORGE AVERY JONES    ON    3.5       1444               1188
## 31             RISHI SHETTY    MI    3.5       1444               1276
## 32    JOSHUA PHILIP MATHEWS    ON    3.5       1433               1394
## 33                  JADE GE    MI    3.5       1421               1330
## 34   MICHAEL JEFFERY THOMAS    MI    3.5       1400               1389
## 35         JOSHUA DAVID LEE    MI    3.5       1392               1264
## 36            SIDDHARTH JHA    MI    3.5       1367               1398
## 37     AMIYATOSH PWNANANDAM    MI    3.5    1077P17               1396
## 38                BRIAN LIU    MI    3.0       1439               1547
## 39            JOEL R HENDON    MI    3.0       1413               1434
## 40             FOREST ZHANG    MI    3.0       1346               1379
## 41      KYLE WILLIAM MURPHY    MI    3.0     1341P9               1250
## 42                 JARED GE    MI    3.0       1256               1154
## 43        ROBERT GLEN VASEY    MI    3.0       1244               1211
## 44       JUSTIN D SCHILLING    MI    3.0       1199               1334
## 45                DEREK YAN    MI    3.0       1191               1163
## 46 JACOB ALEXANDER LAVALLEY    MI    3.0    1076P10               1349
## 47              ERIC WRIGHT    MI    2.5       1341               1411
## 48             DANIEL KHAIN    MI    2.5       1335               1345
## 49         MICHAEL J MARTIN    MI    2.5    1259P17               1262
## 50               SHIVAM JHA    MI    2.5       1111               1358
## 51           TEJAS AYYAGARI    MI    2.5       1097               1339
## 52                ETHAN GUO    MI    2.5       1092               1454
## 53            JOSE C YBARRA    MI    2.0       1359               1320
## 54              LARRY HODGE    MI    2.0       1200               1236
## 55                ALEX KONG    MI    2.0       1163               1400
## 56             MARISA RICCI    MI    2.0       1140               1376
## 57               MICHAEL LU    MI    2.0       1079               1357
## 58             VIRAJ MOHILE    MI    2.0        941               1378
## 59                SEAN M MC    MI    2.0        878               1316
## 60               JULIA SHEN    MI    1.5        984               1314
## 61            JEZZEL FARKAS    ON    1.5     979P18               1342
## 62            ASHWIN BALAJI    MI    1.0       1535               1163
## 63     THOMAS JOSEPH HOSMER    MI    1.0       1125               1338
## 64                   BEN LI    MI    1.0       1112               1315

#User should write to whatever path they want. For me, it's my desktop.
write.csv(chess_cleaned,"C:\\Users\\rossboehme\\Desktop\\chess.csv",row.names=FALSE)