Problem statement
In this project, I have been given a text file with chess tournament results where the information has some structure. My job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:
Player’s Name | Player’s State | Total Number of Points | Player’s Pre-Rating | Average Pre-Chess Rating of Opponenets |
---|---|---|---|---|
GARY HUA | ON | 6 | 1794 | 1605 |
1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played.
Data in Text File
The chess tournament text file is written in a format that is not legible in R. This file needs to be re-structured in order to find the average pre-tournament score.
Data in text file
Packages:
Stringr
Tydyverse
Dplryr
Key Methods
Using Github URL
https://raw.githubusercontent.com/mgino11/Cleanup_txt_str/main/tournamentinfo.txt
We will perfom a series of methods using the Stringr package Some of the functions used where str_split, str_extract_all.
- Skip the Headers and get the data
- Remove the Dash-lines
- Extraction of Columns using Regular Epressions
- Obtain Players Names
- Get the players States
- Get the total number of Points
- Get the Pre-Ratings
- Get the Opponent Strings
- Get the Individual Opponent into a matrix of 7 columns
- Remove any Blank rows from the Data
Deconstruct Data
In this data set we already know that the first 4 lines are not part of our data because they include – dash lines, Pair, Num, Dash lines. in addition, we need to extract the data using sequences to be able to see data better and take direction from what we see. In this case we will divide the data in 2 matrices.
- Matrix 1 = the sequence starts at row 5 with length (chess1)
- Matrix 2 = the sequence starts at row 6 with length (chess1)
See Code
<- read.csv(file = "https://raw.githubusercontent.com/mgino11/Cleanup_txt_str/main/tournamentinfo.txt", skip = 3, header = F)
chess_data
# Create a Matrix and create a sequence
<- matrix(unlist(chess_data), byrow = TRUE)
chess1 <- chess1[seq(5, length(chess1), 3)]
matrix_chess1 head(matrix_chess1)
## [1] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [2] " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|"
## [3] " 4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1|"
## [4] " 5 | HANSHI ZUO |5.5 |W 45|W 37|D 12|D 13|D 4|W 14|W 17|"
## [5] " 6 | HANSEN SONG |5.0 |W 34|D 29|L 11|W 35|D 10|W 27|W 21|"
## [6] " 7 | GARY DEE SWATHELL |5.0 |W 57|W 46|W 13|W 11|L 1|W 9|L 2|"
<- chess1[seq(6,length(chess1),3)]
matrix_chess2 head(matrix_chess2)
## [1] " MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [2] " MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |"
## [3] " MI | 12616049 / R: 1716 ->1744 |N:2 |W |B |W |B |W |B |B |"
## [4] " MI | 14601533 / R: 1655 ->1690 |N:2 |B |W |B |W |B |W |B |"
## [5] " OH | 15055204 / R: 1686 ->1687 |N:3 |W |B |W |B |B |W |B |"
## [6] " MI | 11146376 / R: 1649 ->1673 |N:3 |W |B |W |B |B |W |W |"
_________________________________________________________________________
String Manipulation and Regular Expressions
In this section we will use Stringr to obtain the data we need to create a data frame
We will extract
- ID
- Name
- State
- Total number of Points
- Pre rating score
Name
See Code
# we will use the ID number to obtain an organized Matrix
<- as.numeric(str_extract(matrix_chess1, '\\d+'))
ID
# Matching Name Characters any characters with length up to 32 char
<- str_extract(matrix_chess1, '[A-z].{1,32}')
Name head(Name)
## [1] "DAKSHESH DARURI |" "ADITYA BAJAJ |"
## [3] "PATRICK H SCHILLING |" "HANSHI ZUO |"
## [5] "HANSEN SONG |" "GARY DEE SWATHELL |"
# Extract Name - we get rid of white space
<- str_trim(str_extract(Name, '.+\\s{2,}'))
Name head(Name)
## [1] "DAKSHESH DARURI" "ADITYA BAJAJ" "PATRICK H SCHILLING"
## [4] "HANSHI ZUO" "HANSEN SONG" "GARY DEE SWATHELL"
_________________________________________________________________________
State
See Code
# Matrix chess 2 is the one that contains the information for state
# We are asking to extract range of characters from A-Z with length of 2 char
<- str_extract(matrix_chess2, '[A-Z]{2}')
state head(state)
## [1] "MI" "MI" "MI" "MI" "OH" "MI"
_________________________________________________________________________
Total Points
See Code
# First we need to match the total points format 6.0 number followed by a period and a number
<- as.numeric(str_extract(matrix_chess1, '\\d+\\.\\d'))
total_points total_points
## [1] 6.0 6.0 5.5 5.5 5.0 5.0 5.0 5.0 5.0 4.5 4.5 4.5 4.5 4.5 4.0 4.0 4.0 4.0 4.0
## [20] 4.0 4.0 4.0 4.0 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.0 3.0
## [39] 3.0 3.0 3.0 3.0 3.0 3.0 3.0 2.5 2.5 2.5 2.5 2.5 2.5 2.0 2.0 2.0 2.0 2.0 2.0
## [58] 2.0 1.5 1.5 1.0 1.0 1.0
_________________________________________________________________________
Pre Ratings
See Code
# First we need to Match the combination of "R" and the characters before (-) that information will be Matrix # 2
<- str_extract(matrix_chess2, 'R:.{8,}-')
pre_rating head(pre_rating)
## [1] "R: 1553 -" "R: 1384 -" "R: 1716 -" "R: 1655 -" "R: 1686 -"
## [6] "R: 1649 -"
# now that we have located the data we need to match and extract
<- as.numeric(str_extract(pre_rating, '\\d{1,4}'))
pre_rating head(pre_rating)
## [1] 1553 1384 1716 1655 1686 1649
_________________________________________________________________________
Average Pre Chess Ratings
See Code
1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played.
Data in text file
# Match for all combinations of 1 capital letter and 2 spaces and all numbers after it
<- str_extract(matrix_chess1, '[A-Z]\\s{2,}\\d+')
round_opponent head(round_opponent)
## [1] "W 63" "L 8" "W 23" "W 45" "W 34" "W 57"
<- as.numeric(str_extract(round_opponent, '\\d+'))
round_opponent head(round_opponent)
## [1] 63 8 23 45 34 57
I will use a loop to calculate the Average
<- c()
average
for(i in c(1:length(round_opponent))) {
<- round(mean(pre_rating[as.numeric(round_opponent[[i]])]),0)
average[i]
}
head(average)
## [1] 1163 1411 1229 377 1438 917
_________________________________________________________________________
Data Frame to .csv
See Code
<- data.frame(ID, Name, state, total_points, pre_rating,average)
Project1_DT607 head(Project1_DT607)
## ID Name state total_points pre_rating average
## 1 2 DAKSHESH DARURI MI 6.0 1553 1163
## 2 3 ADITYA BAJAJ MI 6.0 1384 1411
## 3 4 PATRICK H SCHILLING MI 5.5 1716 1229
## 4 5 HANSHI ZUO MI 5.5 1655 377
## 5 6 HANSEN SONG OH 5.0 1686 1438
## 6 7 GARY DEE SWATHELL MI 5.0 1649 917
_________________________________________________________________________
Solution
ID | Player’s Name | Player’s State | Total Number of Points | Player’s Pre-Rating | Average Pre-Chess Rating of Opponenets |
---|---|---|---|---|---|
2 | DAKSHESH DARURI | MI | 6.0 | 1553 | 1163 |
3 | ADITYA BAJAJ | MI | 6.0 | 1384 | 1411 |
4 | PATRICK H SCHILLING | MI | 5.5 | 1716 | 1229 |
5 | HANSHI ZUO | MI | 5.5 | 1655 | 377 |
6 | HANSEN SONG | OH | 5.0 | 1686 | 1438 |
7 | GARY DEE SWATHELL | MI | 5.0 | 1649 | 917 |
8 | EZEKIEL HOUGHTON | MI | 5.0 | 1641 | 1716 |
9 | STEFANO LEE | ON | 5.0 | 1411 | 1579 |
10 | ANVIT RAO | MI | 5.0 | 1365 | 1629 |
11 | CAMERON WILLIAM MC LEMAN | MI | 4.5 | 1712 | 1436 |
12 | KENNETH J TACK | MI | 4.5 | 1663 | 1283 |
13 | TORRANCE HENRY JR | MI | 4.5 | 1666 | 980 |
14 | BRADLEY SHAW | MI | 4.5 | 1610 | 1186 |
15 | ZACHARY JAMES HOUGHTON | MI | 4.5 | 1220 | 1595 |
16 | MIKE NIKITIN | MI | 4.0 | 1604 | 1712 |
17 | RONALD GRZEGORCZYK | MI | 4.0 | 1629 | 1291 |
18 | DAVID SUNDEEN | MI | 4.0 | 1600 | 1382 |
19 | DIPANKAR ROY | MI | 4.0 | 1564 | 1604 |
20 | JASON ZHENG | MI | 4.0 | 1595 | 1403 |
21 | DINH DANG BUI | ON | 4.0 | 1563 | 1199 |
22 | EUGENE L MCCLURE | MI | 4.0 | 1555 | NA |
23 | ALAN BUI | ON | 4.0 | 1363 | 1655 |
24 | MICHAEL R ALDRICH | MI | 4.0 | 1229 | 1602 |
25 | LOREN SCHWIEBERT | MI | 3.5 | 1745 | 1365 |
26 | MAX ZHU | ON | 3.5 | 1579 | 1056 |
27 | GAURAV GIDWANI | MI | 3.5 | 1552 | 935 |
28 | SOFIA ADINA STANESCU-BELLU | MI | 3.5 | 1507 | 1745 |
29 | CHIEDOZIE OKORIE | MI | 3.5 | 1602 | 1011 |
30 | GEORGE AVERY JONES | ON | 3.5 | 1522 | 1393 |
31 | RISHI SHETTY | MI | 3.5 | 1494 | 853 |
32 | JOSHUA PHILIP MATHEWS | ON | 3.5 | 1441 | 1530 |
33 | JADE GE | MI | 3.5 | 1449 | 955 |
34 | MICHAEL JEFFERY THOMAS | MI | 3.5 | 1399 | 1649 |
35 | JOSHUA DAVID LEE | MI | 3.5 | 1438 | 1362 |
36 | SIDDHARTH JHA | MI | 3.5 | 1355 | 1610 |
37 | AMIYATOSH PWNANANDAM | MI | 3.5 | 980 | 1686 |
38 | BRIAN LIU | MI | 3.0 | 1423 | 1663 |
39 | JOEL R HENDON | MI | 3.0 | 1436 | 1553 |
40 | FOREST ZHANG | MI | 3.0 | 1348 | 1563 |
41 | KYLE WILLIAM MURPHY | MI | 3.0 | 1403 | 967 |
42 | JARED GE | MI | 3.0 | 1332 | 1666 |
43 | ROBERT GLEN VASEY | MI | 3.0 | 1283 | 1555 |
44 | JUSTIN D SCHILLING | MI | 3.0 | 1199 | 1220 |
45 | DEREK YAN | MI | 3.0 | 1242 | 1686 |
46 | JACOB ALEXANDER LAVALLEY | MI | 3.0 | 377 | 1355 |
47 | ERIC WRIGHT | MI | 2.5 | 1362 | 1564 |
48 | DANIEL KHAIN | MI | 2.5 | 1382 | 1600 |
49 | MICHAEL J MARTIN | MI | 2.5 | 1291 | 1552 |
50 | SHIVAM JHA | MI | 2.5 | 1056 | 1522 |
51 | TEJAS AYYAGARI | MI | 2.5 | 1011 | 1507 |
52 | ETHAN GUO | MI | 2.5 | 935 | 1494 |
53 | JOSE C YBARRA | MI | 2.0 | 1393 | 1579 |
54 | LARRY HODGE | MI | 2.0 | 1270 | 1220 |
55 | ALEX KONG | MI | 2.0 | 1186 | 1175 |
56 | MARISA RICCI | MI | 2.0 | 1153 | 1663 |
57 | MICHAEL LU | MI | 2.0 | 1092 | 1641 |
58 | VIRAJ MOHILE | MI | 2.0 | 917 | 1441 |
59 | SEAN M MC CORMICK | MI | 2.0 | 853 | 1332 |
60 | JULIA SHEN | MI | 1.5 | 967 | 1399 |
61 | JEZZEL FARKAS | ON | 1.5 | 955 | 1449 |
62 | ASHWIN BALAJI | MI | 1.0 | 1530 | 1153 |
63 | THOMAS JOSEPH HOSMER | MI | 1.0 | 1175 | 1384 |
64 | BEN LI | MI | 1.0 | 1163 | 1363 |
_________________________________________________________________________
Data Frame to .CSV
write.csv(Project1_DT607, "Project1.csv")