Problem statement

In this project, I have been given a text file with chess tournament results where the information has some structure. My job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:

Player’s Name Player’s State Total Number of Points Player’s Pre-Rating Average Pre-Chess Rating of Opponenets
GARY HUA ON 6 1794 1605

1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played.

Data in Text File

The chess tournament text file is written in a format that is not legible in R. This file needs to be re-structured in order to find the average pre-tournament score.

Data in text file

Packages:

  • Stringr

  • Tydyverse

  • Dplryr

Key Methods

Using Github URL

https://raw.githubusercontent.com/mgino11/Cleanup_txt_str/main/tournamentinfo.txt

We will perfom a series of methods using the Stringr package Some of the functions used where str_split, str_extract_all.

  • Skip the Headers and get the data
  • Remove the Dash-lines
  • Extraction of Columns using Regular Epressions
    • Obtain Players Names
    • Get the players States
    • Get the total number of Points
    • Get the Pre-Ratings
    • Get the Opponent Strings
    • Get the Individual Opponent into a matrix of 7 columns
    • Remove any Blank rows from the Data

Deconstruct Data

In this data set we already know that the first 4 lines are not part of our data because they include – dash lines, Pair, Num, Dash lines. in addition, we need to extract the data using sequences to be able to see data better and take direction from what we see. In this case we will divide the data in 2 matrices.

  • Matrix 1 = the sequence starts at row 5 with length (chess1)
  • Matrix 2 = the sequence starts at row 6 with length (chess1)
See Code
chess_data <- read.csv(file = "https://raw.githubusercontent.com/mgino11/Cleanup_txt_str/main/tournamentinfo.txt", skip = 3, header = F)

# Create a Matrix and create a sequence
chess1 <- matrix(unlist(chess_data), byrow = TRUE)
matrix_chess1 <- chess1[seq(5, length(chess1), 3)]
head(matrix_chess1)
## [1] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|"
## [2] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|"
## [3] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|"
## [4] "    5 | HANSHI ZUO                      |5.5  |W  45|W  37|D  12|D  13|D   4|W  14|W  17|"
## [5] "    6 | HANSEN SONG                     |5.0  |W  34|D  29|L  11|W  35|D  10|W  27|W  21|"
## [6] "    7 | GARY DEE SWATHELL               |5.0  |W  57|W  46|W  13|W  11|L   1|W   9|L   2|"
matrix_chess2 <- chess1[seq(6,length(chess1),3)]
head(matrix_chess2)
## [1] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [2] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [3] "   MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |"
## [4] "   MI | 14601533 / R: 1655   ->1690     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [5] "   OH | 15055204 / R: 1686   ->1687     |N:3  |W    |B    |W    |B    |B    |W    |B    |"
## [6] "   MI | 11146376 / R: 1649   ->1673     |N:3  |W    |B    |W    |B    |B    |W    |W    |"

_________________________________________________________________________

String Manipulation and Regular Expressions

In this section we will use Stringr to obtain the data we need to create a data frame

We will extract

  • ID
  • Name
  • State
  • Total number of Points
  • Pre rating score

Name

See Code
# we will use the ID number to obtain an organized Matrix

ID <- as.numeric(str_extract(matrix_chess1, '\\d+'))

# Matching Name Characters any characters with length up to 32 char

Name <- str_extract(matrix_chess1, '[A-z].{1,32}')
head(Name)
## [1] "DAKSHESH DARURI                 |" "ADITYA BAJAJ                    |"
## [3] "PATRICK H SCHILLING             |" "HANSHI ZUO                      |"
## [5] "HANSEN SONG                     |" "GARY DEE SWATHELL               |"
# Extract Name - we get rid of white space 
Name <- str_trim(str_extract(Name, '.+\\s{2,}'))
head(Name)
## [1] "DAKSHESH DARURI"     "ADITYA BAJAJ"        "PATRICK H SCHILLING"
## [4] "HANSHI ZUO"          "HANSEN SONG"         "GARY DEE SWATHELL"

_________________________________________________________________________

State

See Code
# Matrix chess 2 is the one that contains the information for state
# We are asking to extract range of characters from A-Z with length of 2 char
state <- str_extract(matrix_chess2, '[A-Z]{2}')
head(state)
## [1] "MI" "MI" "MI" "MI" "OH" "MI"

_________________________________________________________________________

Total Points

See Code
# First we need to match the total points format 6.0 number followed by a period and a number
total_points <- as.numeric(str_extract(matrix_chess1, '\\d+\\.\\d'))
total_points
##  [1] 6.0 6.0 5.5 5.5 5.0 5.0 5.0 5.0 5.0 4.5 4.5 4.5 4.5 4.5 4.0 4.0 4.0 4.0 4.0
## [20] 4.0 4.0 4.0 4.0 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.0 3.0
## [39] 3.0 3.0 3.0 3.0 3.0 3.0 3.0 2.5 2.5 2.5 2.5 2.5 2.5 2.0 2.0 2.0 2.0 2.0 2.0
## [58] 2.0 1.5 1.5 1.0 1.0 1.0

_________________________________________________________________________

Pre Ratings

See Code
# First we need to Match the combination of "R" and the characters before (-) that information will be  Matrix # 2
pre_rating <- str_extract(matrix_chess2, 'R:.{8,}-')
head(pre_rating)
## [1] "R: 1553   -" "R: 1384   -" "R: 1716   -" "R: 1655   -" "R: 1686   -"
## [6] "R: 1649   -"
# now that we have located the data we need to match and extract 
pre_rating <- as.numeric(str_extract(pre_rating, '\\d{1,4}'))
head(pre_rating)
## [1] 1553 1384 1716 1655 1686 1649

_________________________________________________________________________

Average Pre Chess Ratings

See Code

1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played.

Data in text file

# Match for all combinations of 1 capital letter and 2 spaces  and all numbers after it
round_opponent <- str_extract(matrix_chess1, '[A-Z]\\s{2,}\\d+')
head(round_opponent)
## [1] "W  63" "L   8" "W  23" "W  45" "W  34" "W  57"
round_opponent <- as.numeric(str_extract(round_opponent, '\\d+'))
head(round_opponent)
## [1] 63  8 23 45 34 57

I will use a loop to calculate the Average

average <- c()

for(i in c(1:length(round_opponent))) {
  average[i] <- round(mean(pre_rating[as.numeric(round_opponent[[i]])]),0)
  
}

head(average)
## [1] 1163 1411 1229  377 1438  917

_________________________________________________________________________

Data Frame to .csv

See Code
Project1_DT607 <- data.frame(ID, Name, state, total_points, pre_rating,average)
head(Project1_DT607)
##   ID                Name state total_points pre_rating average
## 1  2     DAKSHESH DARURI    MI          6.0       1553    1163
## 2  3        ADITYA BAJAJ    MI          6.0       1384    1411
## 3  4 PATRICK H SCHILLING    MI          5.5       1716    1229
## 4  5          HANSHI ZUO    MI          5.5       1655     377
## 5  6         HANSEN SONG    OH          5.0       1686    1438
## 6  7   GARY DEE SWATHELL    MI          5.0       1649     917

_________________________________________________________________________

Solution

ID Player’s Name Player’s State Total Number of Points Player’s Pre-Rating Average Pre-Chess Rating of Opponenets
2 DAKSHESH DARURI MI 6.0 1553 1163
3 ADITYA BAJAJ MI 6.0 1384 1411
4 PATRICK H SCHILLING MI 5.5 1716 1229
5 HANSHI ZUO MI 5.5 1655 377
6 HANSEN SONG OH 5.0 1686 1438
7 GARY DEE SWATHELL MI 5.0 1649 917
8 EZEKIEL HOUGHTON MI 5.0 1641 1716
9 STEFANO LEE ON 5.0 1411 1579
10 ANVIT RAO MI 5.0 1365 1629
11 CAMERON WILLIAM MC LEMAN MI 4.5 1712 1436
12 KENNETH J TACK MI 4.5 1663 1283
13 TORRANCE HENRY JR MI 4.5 1666 980
14 BRADLEY SHAW MI 4.5 1610 1186
15 ZACHARY JAMES HOUGHTON MI 4.5 1220 1595
16 MIKE NIKITIN MI 4.0 1604 1712
17 RONALD GRZEGORCZYK MI 4.0 1629 1291
18 DAVID SUNDEEN MI 4.0 1600 1382
19 DIPANKAR ROY MI 4.0 1564 1604
20 JASON ZHENG MI 4.0 1595 1403
21 DINH DANG BUI ON 4.0 1563 1199
22 EUGENE L MCCLURE MI 4.0 1555 NA
23 ALAN BUI ON 4.0 1363 1655
24 MICHAEL R ALDRICH MI 4.0 1229 1602
25 LOREN SCHWIEBERT MI 3.5 1745 1365
26 MAX ZHU ON 3.5 1579 1056
27 GAURAV GIDWANI MI 3.5 1552 935
28 SOFIA ADINA STANESCU-BELLU MI 3.5 1507 1745
29 CHIEDOZIE OKORIE MI 3.5 1602 1011
30 GEORGE AVERY JONES ON 3.5 1522 1393
31 RISHI SHETTY MI 3.5 1494 853
32 JOSHUA PHILIP MATHEWS ON 3.5 1441 1530
33 JADE GE MI 3.5 1449 955
34 MICHAEL JEFFERY THOMAS MI 3.5 1399 1649
35 JOSHUA DAVID LEE MI 3.5 1438 1362
36 SIDDHARTH JHA MI 3.5 1355 1610
37 AMIYATOSH PWNANANDAM MI 3.5 980 1686
38 BRIAN LIU MI 3.0 1423 1663
39 JOEL R HENDON MI 3.0 1436 1553
40 FOREST ZHANG MI 3.0 1348 1563
41 KYLE WILLIAM MURPHY MI 3.0 1403 967
42 JARED GE MI 3.0 1332 1666
43 ROBERT GLEN VASEY MI 3.0 1283 1555
44 JUSTIN D SCHILLING MI 3.0 1199 1220
45 DEREK YAN MI 3.0 1242 1686
46 JACOB ALEXANDER LAVALLEY MI 3.0 377 1355
47 ERIC WRIGHT MI 2.5 1362 1564
48 DANIEL KHAIN MI 2.5 1382 1600
49 MICHAEL J MARTIN MI 2.5 1291 1552
50 SHIVAM JHA MI 2.5 1056 1522
51 TEJAS AYYAGARI MI 2.5 1011 1507
52 ETHAN GUO MI 2.5 935 1494
53 JOSE C YBARRA MI 2.0 1393 1579
54 LARRY HODGE MI 2.0 1270 1220
55 ALEX KONG MI 2.0 1186 1175
56 MARISA RICCI MI 2.0 1153 1663
57 MICHAEL LU MI 2.0 1092 1641
58 VIRAJ MOHILE MI 2.0 917 1441
59 SEAN M MC CORMICK MI 2.0 853 1332
60 JULIA SHEN MI 1.5 967 1399
61 JEZZEL FARKAS ON 1.5 955 1449
62 ASHWIN BALAJI MI 1.0 1530 1153
63 THOMAS JOSEPH HOSMER MI 1.0 1175 1384
64 BEN LI MI 1.0 1163 1363

_________________________________________________________________________

Data Frame to .CSV

write.csv(Project1_DT607, "Project1.csv")