Project Goal: The data of a chess tournament is imported in a text format and manipulated to show the players’ name, the state they are from, their total number of points, in addition to their pre-rating and the average pre-rating of their opponents.The data is then written into a .csv file.

The data is imported from GitHub and the required libraries are loaded. In addition, the columns are formatted before working with the data.

library(stringr)
library(knitr)
library(ggplot2)

# import data from github
chess = read.table("https://raw.githubusercontent.com/Galanopoulog/DATA607-Project-1/master/chess.txt",
           header = F,
           sep = "|",
           skip = 3,
           fill = T,
           blank.lines.skip = T,
           stringsAsFactors = F)

# Because the V11 column is not required in addition to having only NA values, we remove it.
chess = chess[1:10]

# Format column 1 by removing the repeating "---"
chess$V1 = str_replace_all(chess$V1, "\\-", "")

head(data.frame(chess))
##       V1                                V2    V3    V4    V5    V6    V7
## 1                                                                       
## 2     1   GARY HUA                         6.0   W  39 W  21 W  18 W  14
## 3    ON   15445895 / R: 1794   ->1817      N:2   W     B     W     B    
## 4                                                                       
## 5     2   DAKSHESH DARURI                  6.0   W  63 W  58 L   4 W  17
## 6    MI   14598900 / R: 1553   ->1663      N:2   B     W     B     W    
##      V8    V9   V10
## 1                  
## 2 W   7 D  12 D   4
## 3 W     B     W    
## 4                  
## 5 W  16 W  20 W   7
## 6 B     W     B

To make the data more compact, we remove any rows with NA values or only spaces.

# Remove rows with NA and spaces
chess = subset(chess, chess[,1:10]  != "")
chess = chess[apply(chess,1,function(chess)any(!is.na(chess))),]
head(chess)
##       V1                                V2    V3    V4    V5    V6    V7
## 2     1   GARY HUA                         6.0   W  39 W  21 W  18 W  14
## 3    ON   15445895 / R: 1794   ->1817      N:2   W     B     W     B    
## 5     2   DAKSHESH DARURI                  6.0   W  63 W  58 L   4 W  17
## 6    MI   14598900 / R: 1553   ->1663      N:2   B     W     B     W    
## 8     3   ADITYA BAJAJ                     6.0   L   8 W  61 W  25 W  21
## 9    MI   14959604 / R: 1384   ->1640      N:2   W     B     W     B    
##      V8    V9   V10
## 2 W   7 D  12 D   4
## 3 W     B     W    
## 5 W  16 W  20 W   7
## 6 B     W     B    
## 8 W  11 W  13 W  12
## 9 W     B     W

Since two rows are dedicated to one player, the goal is to make one row include all the information for one player. To do that, we will split the data into two data sets using their index as an odd or even row and then combine them back into one data set. Some columns will be dropped because they are not necessary to the overall project.

# Splitting the dataframe into two and then combining them again
odds = seq(1, nrow(chess), by=2)
chessodds = chess[odds,]
colnames(chessodds) = c("Num", "Name", "Points", "Round_1", "Round_2", "Round_3", "Round_4", "Round_5", "Round_6", "Round_7")

evens = seq(2, nrow(chess), by=2)
chessevens = chess[evens, 1:3]
colnames(chessevens) = c("State", "ID/Pre_Post", "Total")

kable(head(chessodds))
Num Name Points Round_1 Round_2 Round_3 Round_4 Round_5 Round_6 Round_7
2 1 GARY HUA 6.0 W 39 W 21 W 18 W 14 W 7 D 12 D 4
5 2 DAKSHESH DARURI 6.0 W 63 W 58 L 4 W 17 W 16 W 20 W 7
8 3 ADITYA BAJAJ 6.0 L 8 W 61 W 25 W 21 W 11 W 13 W 12
11 4 PATRICK H SCHILLING 5.5 W 23 D 28 W 2 W 26 D 5 W 19 D 1
14 5 HANSHI ZUO 5.5 W 45 W 37 D 12 D 13 D 4 W 14 W 17
17 6 HANSEN SONG 5.0 W 34 D 29 L 11 W 35 D 10 W 27 W 21
kable(head(chessevens))
State ID/Pre_Post Total
3 ON 15445895 / R: 1794 ->1817 N:2
6 MI 14598900 / R: 1553 ->1663 N:2
9 MI 14959604 / R: 1384 ->1640 N:2
12 MI 12616049 / R: 1716 ->1744 N:2
15 MI 14601533 / R: 1655 ->1690 N:2
18 OH 15055204 / R: 1686 ->1687 N:3

One step before combining the data, the data in the column titled “ID/Pre_Post” will be broken into three parts: ID, Pre and Post. These three columns will describe the players’ ID number, their Pre-Rating and their Post-Rating.

## Breaking up ID/Pre_Post into ID, Pre and Post
# Remove spacing for easier manipulation
chessevens$'ID/Pre_Post' = gsub("[[:space:]]", "", chessevens$'ID/Pre_Post')

# ID
ID = substr(chessevens$'ID/Pre_Post', 0, 8)

# Pre values
Pre = as.numeric(str_extract(substr(chessevens$'ID/Pre_Post', 12, 30), "\\d+"))

# Post values
Post = str_extract(substr(chessevens$'ID/Pre_Post', 12, 30), "(>)\\d+")
Post = as.numeric(gsub(">", "", Post))

Finally, we combine the two data sets and rearrange the columns for an easier read.

# Combine
merged_chess = cbind(chessodds, chessevens, ID, Post, Pre)

# Rearrange columns
merged_chess = merged_chess[ ,c("Num", "ID", "Name", "State","Points", "Pre", "Post", "Round_1", "Round_2", "Round_3", "Round_4", "Round_5", "Round_6", "Round_7")]
kable(head(merged_chess))
Num ID Name State Points Pre Post Round_1 Round_2 Round_3 Round_4 Round_5 Round_6 Round_7
2 1 15445895 GARY HUA ON 6.0 1794 1817 W 39 W 21 W 18 W 14 W 7 D 12 D 4
5 2 14598900 DAKSHESH DARURI MI 6.0 1553 1663 W 63 W 58 L 4 W 17 W 16 W 20 W 7
8 3 14959604 ADITYA BAJAJ MI 6.0 1384 1640 L 8 W 61 W 25 W 21 W 11 W 13 W 12
11 4 12616049 PATRICK H SCHILLING MI 5.5 1716 1744 W 23 D 28 W 2 W 26 D 5 W 19 D 1
14 5 14601533 HANSHI ZUO MI 5.5 1655 1690 W 45 W 37 D 12 D 13 D 4 W 14 W 17
17 6 15055204 HANSEN SONG OH 5.0 1686 1687 W 34 D 29 L 11 W 35 D 10 W 27 W 21

In order to generate the Average Pre-Rating score of a player’s opponents, we will create an empty array that we will fill in with a for loop. From each round, we will extract each player’s opponent, draw that opponent’s pre-rating score and then find the average of all oppents.

rank = array(dim=length(merged_chess$Num))
for (i in 1:length(merged_chess$Num)){
  pres = na.omit(as.numeric(str_extract_all(merged_chess[i,8:14],"\\d+")))
  rank[i] = round(mean(merged_chess[pres, 6]))
}
merged_chess$Ranks = rank

Once again, we will rearrange the columns and, finally, write the data out into a csv file.

# Rearrange columns again
merged_chess = merged_chess[ ,c("Num", "ID", "Name", "State","Points", "Pre", "Post", "Ranks", "Round_1", "Round_2", "Round_3", "Round_4", "Round_5", "Round_6", "Round_7")]
kable(head(merged_chess))
Num ID Name State Points Pre Post Ranks Round_1 Round_2 Round_3 Round_4 Round_5 Round_6 Round_7
2 1 15445895 GARY HUA ON 6.0 1794 1817 1605 W 39 W 21 W 18 W 14 W 7 D 12 D 4
5 2 14598900 DAKSHESH DARURI MI 6.0 1553 1663 1469 W 63 W 58 L 4 W 17 W 16 W 20 W 7
8 3 14959604 ADITYA BAJAJ MI 6.0 1384 1640 1564 L 8 W 61 W 25 W 21 W 11 W 13 W 12
11 4 12616049 PATRICK H SCHILLING MI 5.5 1716 1744 1574 W 23 D 28 W 2 W 26 D 5 W 19 D 1
14 5 14601533 HANSHI ZUO MI 5.5 1655 1690 1501 W 45 W 37 D 12 D 13 D 4 W 14 W 17
17 6 15055204 HANSEN SONG OH 5.0 1686 1687 1519 W 34 D 29 L 11 W 35 D 10 W 27 W 21
# Remove "#" in the line below to write out csv file that will be titled "ChessTournament"
#write.csv(merged_chess, "ChessTournament.csv")

Trying to visually identify if players’ score increases if they face off opponents who have on average higher scores than them.

To do that, we take the difference between Post scores and Pre scores and if the difference is positive, then the player has increased their scoring.

We will also take the difference between Pre scores and “Ranks” scores to identify if the players faced against opponents who overall had higher scores than them and if the difference is positive, then the player played against more skilled opponents on average.

To simplify:

Orange scores above the zero line = player increased their scores

Green scores above the zero line = player faced on average opponents with higher scores

PlotPostPre = data.frame(x= merged_chess$Num, y= (Post-Pre))
PlotRanksPost = data.frame(x= merged_chess$Num, y= (merged_chess$Ranks-Post))
ZeroLine = data.frame(x= merged_chess$Num, y= 0)

ggplot(PlotPostPre,aes(x,y))+ geom_line(aes(color= "Post-Pre",group= 1))+
  geom_line(data= PlotRanksPost, aes(color="Ranks-Pre",group= 1))+
  geom_line(data= ZeroLine, aes(color="Zero Line",group= 1))+
  labs(color="Legend")