data607-illien-project1

We’re taking a look at chess tournament data and we want to process and extract the following variables into into a formatted .csv file extract:

Player’s Name
Player’s State
Total Number of Points
Player’s Pre-Rating
Average Pre Chess Rating of Opponents

Output format:

Gary Hua, ON, 6.0, 1794, 1605

Libraries

Load required libraries

library(stringr)
library(readr)
library(dplyr)
library(knitr)
library(kableExtra)
library(ggplot2)

Data Import

url <- "https://raw.githubusercontent.com/maelillien/data607-illien-project1/master/tournamentinfo.txt"
# use stringAsFactors=FALSE to prevent automatically converting character vectors to factors
tournament_data <- read.delim(url, stringsAsFactors = FALSE)

Data Evaluation

The structure of the text is the following:
- The first three lines describe the pattern (which we will skip when processing)
- Player attributes such as Player Name, Total Points as shown below
- Player attributes such as State, Pre Tournament Rating as shown below
- A dashed line which we will ignore

##   X.........................................................................................
## 1  Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| 
## 2  Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | 
## 3  -----------------------------------------------------------------------------------------
## 4      1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|
## 5     ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |
## 6  -----------------------------------------------------------------------------------------
## 7      2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|
## 8     MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |
## 9  -----------------------------------------------------------------------------------------

Data Transformation

We re-format the text file for easier extraction. Each player has two lines of text from which we will extract information in the two following loops. We ignore the first three lines of text and increment by 3 everytime to ignore the lines of dashes.

# initialize lists that we will iteratively build
names <- c()
totalpts <- c()
opponentlist <- c()
states <- c()
preratings <- c()

# extract the name, total points and opponents list
for (i in seq(from=4, to=nrow(tournament_data), by=3)) {
  
  str <- tournament_data[i,]
  
  name <- unlist(str_extract_all(str, "[:alpha:]{2,}"))
  name <- paste(name, collapse = " ")
  names <- append(names, name)
  
  totalpts <- append(totalpts,unlist(str_extract_all(str, "[\\d].[\\d]")))
  totalpts <- as.numeric(totalpts)
  
  opponents <- unlist(str_extract_all(str, "[:digit:]{1,}"))[4:10]
  opponents <- paste(opponents, collapse = ",")
  opponentlist <- append(opponentlist,opponents)
}

# extract the states and pre tournament ratings
for (i in seq(from=5, to=nrow(tournament_data), by=3)) {
  str <- tournament_data[i,]
  
  states <- append(states, unlist(str_extract_all(str, "[:alpha:]{2}")))
  
  preratings <- append(preratings, unlist(str_extract_all(str, "[:digit:]{4}"))[3])
  preratings <- as.numeric(preratings)
}

We combine this extracted information into a dataframe and take a look at it.

players <- data.frame(name = names, state = states, 
                      total_pts = totalpts, pre_tournament_rating = preratings,
                      opponents = opponentlist)

## 'data.frame':    64 obs. of  5 variables:
##  $ name                 : Factor w/ 64 levels "ADITYA BAJAJ",..: 24 12 1 51 28 27 23 21 59 5 ...
##  $ state                : Factor w/ 3 levels "MI","OH","ON": 3 1 1 1 1 2 1 1 3 1 ...
##  $ total_pts            : num  6 6 6 5.5 5.5 5 5 5 5 5 ...
##  $ pre_tournament_rating: num  1794 1553 1384 1716 1655 ...
##  $ opponents            : Factor w/ 64 levels "1,54,40,16,44,21,24",..: 35 60 63 18 41 31 53 26 20 10 ...

Extracted Data
name	state	total_pts	pre_tournament_rating	opponents
GARY HUA	ON	6.0	1794	39,21,18,14,7,12,4
DAKSHESH DARURI	MI	6.0	1553	63,58,4,17,16,20,7
ADITYA BAJAJ	MI	6.0	1384	8,61,25,21,11,13,12
PATRICK SCHILLING	MI	5.5	1716	23,28,2,26,5,19,1
HANSHI ZUO	MI	5.5	1655	45,37,12,13,4,14,17
HANSEN SONG	OH	5.0	1686	34,29,11,35,10,27,21

Summary of Extraced Data
name	state	total_pts	pre_tournament_rating	opponents
ADITYA BAJAJ : 1	MI:55	Min. :1.000	Min. :1011	1,54,40,16,44,21,24 : 1
ALAN BUI : 1	OH: 1	1st Qu.:2.500	1st Qu.:1280	10,15,39,2,36,NA,NA : 1
ALEX KONG : 1	ON: 8	Median :3.500	Median :1430	11,35,29,12,18,15,NA: 1
AMIYATOSH PWNANANDAM: 1	NA	Mean :3.438	Mean :1425	11,35,45,40,42,NA,NA: 1
ANVIT RAO : 1	NA	3rd Qu.:4.000	3rd Qu.:1596	12,50,57,60,61,64,56: 1
ASHWIN BALAJI : 1	NA	Max. :6.000	Max. :1794	13,57,51,33,16,28,NA: 1
(Other) :58	NA	NA	NA’s :4	(Other) :58

Data Processing

From the summary above, we learn that there missing values in both the pre_tournament_rating and the opponents values. We need to keep this in mind when calculating the apcro (Average Pre Chess Rating of Opponent) for each player using each player’s row index as the pair number. We will drop the NA values in the apcro calculation.

for (i in seq(from=1, to=nrow(players), by=1)) {
  # separate the opponents list and coerce it to an integer 
  op_index <- unlist(str_split(players$opponents[i], ","))
  # NAs are removed since no game was played, so no opponent score should be fetched
  op_index <- as.integer(c(op_index), na.rm=TRUE)
  # use indices to locate the opponents' pre tournament ratings
  op_ratings <- strtoi(players$pre_tournament_rating[op_index])
  # calculate the apcro by taking the mean and dropping the NAs representing the 5 players without ratings
  players$apcro[i] <- round(mean(op_ratings, na.rm=TRUE))
}

Preparing the output format for data export

# We create a new dataframe and omit the opponents variable, we will not be writing it to file.
outputdf <- select(players, -opponents)

Processed Data
name	state	total_pts	pre_tournament_rating	apcro
GARY HUA	ON	6.0	1794	1605
DAKSHESH DARURI	MI	6.0	1553	1561
ADITYA BAJAJ	MI	6.0	1384	1665
PATRICK SCHILLING	MI	5.5	1716	1574
HANSHI ZUO	MI	5.5	1655	1515
HANSEN SONG	OH	5.0	1686	1519
GARY DEE SWATHELL	MI	5.0	1649	1472
EZEKIEL HOUGHTON	MI	5.0	1641	1468
STEFANO LEE	ON	5.0	1411	1635
ANVIT RAO	MI	5.0	1365	1554

Data Export

Write to the dataframe to a .csv file

write.csv(outputdf, file = "chess_tournament.csv", row.names = FALSE, col.names = FALSE, sep = "", quote = FALSE)

Data Analysis & Visualization

Let’s start by taking a look at the summary statistics. It is interesting to note that the mean for pre tournament rating of the players and their opponents were very close, 1425 and 1424 respectively.

summary(outputdf$total_pts)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.500   3.500   3.438   4.000   6.000

summary(outputdf$pre_tournament_rating)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1011    1280    1430    1425    1596    1794       4

summary(outputdf$apcro)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1186    1356    1418    1424    1496    1665

Here we take a look at the top ratings for pre tournament and opponents

Pre Tournament Rating Descending
	name	state	total_pts	pre_tournament_rating	apcro
1	GARY HUA	ON	6.0	1794	1605
25	LOREN SCHWIEBERT	MI	3.5	1745	1363
4	PATRICK SCHILLING	MI	5.5	1716	1574
11	CAMERON WILLIAM MC LEMAN	MI	4.5	1712	1468
6	HANSEN SONG	OH	5.0	1686	1519
13	TORRANCE HENRY JR	MI	4.5	1666	1498

Average Pre Chess Rating of Opponent Descending
	name	state	total_pts	pre_tournament_rating	apcro
3	ADITYA BAJAJ	MI	6.0	1384	1665
9	STEFANO LEE	ON	5.0	1411	1635
41	KYLE WILLIAM MURPHY	MI	3.0	1403	1612
1	GARY HUA	ON	6.0	1794	1605
4	PATRICK SCHILLING	MI	5.5	1716	1574
2	DAKSHESH DARURI	MI	6.0	1553	1561

The histogram of the total points show that a mean and median close to the center at 3.5 as was described above.

ggplot(data = outputdf, mapping = aes(x=totalpts)) + 
  geom_bar(fill = 'lightblue')

Here we look at the the Average Pre Chess Rating of Opponent against the Pre Tournament Rating. A x=y is added to identify which players played against opponents that were on average rated better (upper left side of line) or worse than themselves (lower right side of line). A color dimension is added to represent the number of points obtained. It appears that weaker players played players better than themselves and better players played weaker players than themselves, as can be expected.

ggplot(data = outputdf) +
    geom_point(mapping = aes(x = pre_tournament_rating, y = apcro, color = totalpts)) +
    geom_abline(slope = 1, intercept = 0) +
    xlim(1000, 1800) + ylim(1000, 1800)