Problem statement

In this project, I have been given a text file with chess tournament results where the information has some structure. My job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:

Player’s Name	Player’s State	Total Number of Points	Player’s Pre-Rating	Average Pre-Chess Rating of Opponenets
GARY HUA	ON	6	1794	1605

1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played.

Data in Text File

The chess tournament text file is written in a format that is not legible in R. This file needs to be re-structured in order to find the average pre-tournament score.

Data in text file

Packages:

Stringr
Tydyverse
Dplryr

Key Methods

Using Github URL

https://raw.githubusercontent.com/mgino11/Cleanup_txt_str/main/tournamentinfo.txt

We will perfom a series of methods using the Stringr package Some of the functions used where str_split, str_extract_all.

Skip the Headers and get the data
Remove the Dash-lines
Extraction of Columns using Regular Epressions
- Obtain Players Names
- Get the players States
- Get the total number of Points
- Get the Pre-Ratings
- Get the Opponent Strings
- Get the Individual Opponent into a matrix of 7 columns
- Remove any Blank rows from the Data

Deconstruct Data

In this data set we already know that the first 4 lines are not part of our data because they include – dash lines, Pair, Num, Dash lines. in addition, we need to extract the data using sequences to be able to see data better and take direction from what we see. In this case we will divide the data in 2 matrices.

Matrix 1 = the sequence starts at row 5 with length (chess1)
Matrix 2 = the sequence starts at row 6 with length (chess1)

See Code

chess_data <- read.csv(file = "https://raw.githubusercontent.com/mgino11/Cleanup_txt_str/main/tournamentinfo.txt", skip = 3, header = F)

# Create a Matrix and create a sequence
chess1 <- matrix(unlist(chess_data), byrow = TRUE)
matrix_chess1 <- chess1[seq(5, length(chess1), 3)]
head(matrix_chess1)

## [1] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|"
## [2] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|"
## [3] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|"
## [4] "    5 | HANSHI ZUO                      |5.5  |W  45|W  37|D  12|D  13|D   4|W  14|W  17|"
## [5] "    6 | HANSEN SONG                     |5.0  |W  34|D  29|L  11|W  35|D  10|W  27|W  21|"
## [6] "    7 | GARY DEE SWATHELL               |5.0  |W  57|W  46|W  13|W  11|L   1|W   9|L   2|"

matrix_chess2 <- chess1[seq(6,length(chess1),3)]
head(matrix_chess2)

## [1] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [2] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [3] "   MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |"
## [4] "   MI | 14601533 / R: 1655   ->1690     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [5] "   OH | 15055204 / R: 1686   ->1687     |N:3  |W    |B    |W    |B    |B    |W    |B    |"
## [6] "   MI | 11146376 / R: 1649   ->1673     |N:3  |W    |B    |W    |B    |B    |W    |W    |"

_________________________________________________________________________

String Manipulation and Regular Expressions

In this section we will use Stringr to obtain the data we need to create a data frame

We will extract

ID
Name
State
Total number of Points
Pre rating score

Name

See Code

# we will use the ID number to obtain an organized Matrix

ID <- as.numeric(str_extract(matrix_chess1, '\\d+'))

# Matching Name Characters any characters with length up to 32 char

Name <- str_extract(matrix_chess1, '[A-z].{1,32}')
head(Name)

## [1] "DAKSHESH DARURI                 |" "ADITYA BAJAJ                    |"
## [3] "PATRICK H SCHILLING             |" "HANSHI ZUO                      |"
## [5] "HANSEN SONG                     |" "GARY DEE SWATHELL               |"

# Extract Name - we get rid of white space 
Name <- str_trim(str_extract(Name, '.+\\s{2,}'))
head(Name)

## [1] "DAKSHESH DARURI"     "ADITYA BAJAJ"        "PATRICK H SCHILLING"
## [4] "HANSHI ZUO"          "HANSEN SONG"         "GARY DEE SWATHELL"

_________________________________________________________________________

State

See Code

# Matrix chess 2 is the one that contains the information for state
# We are asking to extract range of characters from A-Z with length of 2 char
state <- str_extract(matrix_chess2, '[A-Z]{2}')
head(state)

## [1] "MI" "MI" "MI" "MI" "OH" "MI"

_________________________________________________________________________

Total Points

See Code

# First we need to match the total points format 6.0 number followed by a period and a number
total_points <- as.numeric(str_extract(matrix_chess1, '\\d+\\.\\d'))
total_points

##  [1] 6.0 6.0 5.5 5.5 5.0 5.0 5.0 5.0 5.0 4.5 4.5 4.5 4.5 4.5 4.0 4.0 4.0 4.0 4.0
## [20] 4.0 4.0 4.0 4.0 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.0 3.0
## [39] 3.0 3.0 3.0 3.0 3.0 3.0 3.0 2.5 2.5 2.5 2.5 2.5 2.5 2.0 2.0 2.0 2.0 2.0 2.0
## [58] 2.0 1.5 1.5 1.0 1.0 1.0

_________________________________________________________________________

Pre Ratings

See Code

# First we need to Match the combination of "R" and the characters before (-) that information will be  Matrix # 2
pre_rating <- str_extract(matrix_chess2, 'R:.{8,}-')
head(pre_rating)

## [1] "R: 1553   -" "R: 1384   -" "R: 1716   -" "R: 1655   -" "R: 1686   -"
## [6] "R: 1649   -"

# now that we have located the data we need to match and extract 
pre_rating <- as.numeric(str_extract(pre_rating, '\\d{1,4}'))
head(pre_rating)

## [1] 1553 1384 1716 1655 1686 1649

_________________________________________________________________________

Average Pre Chess Ratings

See Code

1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played.

Data in text file

# Match for all combinations of 1 capital letter and 2 spaces  and all numbers after it
round_opponent <- str_extract(matrix_chess1, '[A-Z]\\s{2,}\\d+')
head(round_opponent)

## [1] "W  63" "L   8" "W  23" "W  45" "W  34" "W  57"

round_opponent <- as.numeric(str_extract(round_opponent, '\\d+'))
head(round_opponent)

## [1] 63  8 23 45 34 57

I will use a loop to calculate the Average

average <- c()

for(i in c(1:length(round_opponent))) {
  average[i] <- round(mean(pre_rating[as.numeric(round_opponent[[i]])]),0)
  
}

head(average)

## [1] 1163 1411 1229  377 1438  917

_________________________________________________________________________

Data Frame to .csv

See Code

Project1_DT607 <- data.frame(ID, Name, state, total_points, pre_rating,average)
head(Project1_DT607)

##   ID                Name state total_points pre_rating average
## 1  2     DAKSHESH DARURI    MI          6.0       1553    1163
## 2  3        ADITYA BAJAJ    MI          6.0       1384    1411
## 3  4 PATRICK H SCHILLING    MI          5.5       1716    1229
## 4  5          HANSHI ZUO    MI          5.5       1655     377
## 5  6         HANSEN SONG    OH          5.0       1686    1438
## 6  7   GARY DEE SWATHELL    MI          5.0       1649     917

_________________________________________________________________________

Solution

ID	Player’s Name	Player’s State	Total Number of Points	Player’s Pre-Rating	Average Pre-Chess Rating of Opponenets
2	DAKSHESH DARURI	MI	6.0	1553	1163
3	ADITYA BAJAJ	MI	6.0	1384	1411
4	PATRICK H SCHILLING	MI	5.5	1716	1229
5	HANSHI ZUO	MI	5.5	1655	377
6	HANSEN SONG	OH	5.0	1686	1438
7	GARY DEE SWATHELL	MI	5.0	1649	917
8	EZEKIEL HOUGHTON	MI	5.0	1641	1716
9	STEFANO LEE	ON	5.0	1411	1579
10	ANVIT RAO	MI	5.0	1365	1629
11	CAMERON WILLIAM MC LEMAN	MI	4.5	1712	1436
12	KENNETH J TACK	MI	4.5	1663	1283
13	TORRANCE HENRY JR	MI	4.5	1666	980
14	BRADLEY SHAW	MI	4.5	1610	1186
15	ZACHARY JAMES HOUGHTON	MI	4.5	1220	1595
16	MIKE NIKITIN	MI	4.0	1604	1712
17	RONALD GRZEGORCZYK	MI	4.0	1629	1291
18	DAVID SUNDEEN	MI	4.0	1600	1382
19	DIPANKAR ROY	MI	4.0	1564	1604
20	JASON ZHENG	MI	4.0	1595	1403
21	DINH DANG BUI	ON	4.0	1563	1199
22	EUGENE L MCCLURE	MI	4.0	1555	NA
23	ALAN BUI	ON	4.0	1363	1655
24	MICHAEL R ALDRICH	MI	4.0	1229	1602
25	LOREN SCHWIEBERT	MI	3.5	1745	1365
26	MAX ZHU	ON	3.5	1579	1056
27	GAURAV GIDWANI	MI	3.5	1552	935
28	SOFIA ADINA STANESCU-BELLU	MI	3.5	1507	1745
29	CHIEDOZIE OKORIE	MI	3.5	1602	1011
30	GEORGE AVERY JONES	ON	3.5	1522	1393
31	RISHI SHETTY	MI	3.5	1494	853
32	JOSHUA PHILIP MATHEWS	ON	3.5	1441	1530
33	JADE GE	MI	3.5	1449	955
34	MICHAEL JEFFERY THOMAS	MI	3.5	1399	1649
35	JOSHUA DAVID LEE	MI	3.5	1438	1362
36	SIDDHARTH JHA	MI	3.5	1355	1610
37	AMIYATOSH PWNANANDAM	MI	3.5	980	1686
38	BRIAN LIU	MI	3.0	1423	1663
39	JOEL R HENDON	MI	3.0	1436	1553
40	FOREST ZHANG	MI	3.0	1348	1563
41	KYLE WILLIAM MURPHY	MI	3.0	1403	967
42	JARED GE	MI	3.0	1332	1666
43	ROBERT GLEN VASEY	MI	3.0	1283	1555
44	JUSTIN D SCHILLING	MI	3.0	1199	1220
45	DEREK YAN	MI	3.0	1242	1686
46	JACOB ALEXANDER LAVALLEY	MI	3.0	377	1355
47	ERIC WRIGHT	MI	2.5	1362	1564
48	DANIEL KHAIN	MI	2.5	1382	1600
49	MICHAEL J MARTIN	MI	2.5	1291	1552
50	SHIVAM JHA	MI	2.5	1056	1522
51	TEJAS AYYAGARI	MI	2.5	1011	1507
52	ETHAN GUO	MI	2.5	935	1494
53	JOSE C YBARRA	MI	2.0	1393	1579
54	LARRY HODGE	MI	2.0	1270	1220
55	ALEX KONG	MI	2.0	1186	1175
56	MARISA RICCI	MI	2.0	1153	1663
57	MICHAEL LU	MI	2.0	1092	1641
58	VIRAJ MOHILE	MI	2.0	917	1441
59	SEAN M MC CORMICK	MI	2.0	853	1332
60	JULIA SHEN	MI	1.5	967	1399
61	JEZZEL FARKAS	ON	1.5	955	1449
62	ASHWIN BALAJI	MI	1.0	1530	1153
63	THOMAS JOSEPH HOSMER	MI	1.0	1175	1384
64	BEN LI	MI	1.0	1163	1363

_________________________________________________________________________

Data Frame to .CSV

write.csv(Project1_DT607, "Project1.csv")

Data Project 1

MariAlejandra Ginorio

2021-02-28

Problem statement

Data in Text File

Packages:

Key Methods

Deconstruct Data

_________________________________________________________________________

String Manipulation and Regular Expressions

Name

_________________________________________________________________________

State

_________________________________________________________________________

Total Points

_________________________________________________________________________

Pre Ratings

_________________________________________________________________________

Average Pre Chess Ratings

_________________________________________________________________________

Data Frame to .csv

_________________________________________________________________________

Solution

_________________________________________________________________________

Data Frame to .CSV