9/22/2019

In this project, we are given a text file with chess tournament results where the information has some structure. I will extract relevant information and create an R Markdown file that will later generate a .CSV file that could be used in a database management system.

I will look for and obtain the following player information:

Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents.

My first step is to load the data into R:

chess_data <- read.delim("https://raw.githubusercontent.com/marioipena/Project1data/master/tournamentinfo.txt", header = FALSE, stringsAsFactors = FALSE)

Let’s take a look at our data to get a better sense of how we will extract the required information:

head(chess_data, 10)

We can see from the first few lines of our data that there seems to be a pattern separating each player. I will use this pattern to identify two lines, which contain information we want to extract from each player.

player = chess_data[seq(5, nrow(chess_data), 3), ] #The player data starts at line 5
ratings = chess_data[seq(6, nrow(chess_data), 3), ] #The ratings data starts at line 6

Now we will use regular expressions to extract only the information we identified at the instructions of the project.

I will extract each variable separately in order to form vectors that I will later on put together in a dataframe.

First, let’s extract all the information we will need from the “player” row.

Note that I will also extract player ID as it will help us later on to obtain the average opponent’s rating

library(stringr)
playerId <- as.integer(str_extract(player, "\\d+"))
playerName <- str_extract(player, "(\\w+\\s){2,4}(\\w+-\\w+)?") #There are players with 4 names and hyphenated names.
playerName <- str_trim(playerName) #This gets rid of the extra blank spaces.
playerPoints <- as.numeric(str_extract(player, "\\d+\\.\\d+"))
playerOpponent <- str_extract_all(player, "\\d+\\|")
playerOpponent <- str_extract_all(playerOpponent, "\\d+") #I use a second step to get the opponent ID by itself.

## Warning in stri_extract_all_regex(string, pattern, simplify = simplify, :
## argument is not an atomic vector; coercing

I will now extract all the information we need from the “ratings” row.

playerState <- str_extract(ratings, "\\w+")
playerRating <- (str_extract(ratings, "(\\:\\s\\s?\\d+)([[:alpha:]]\\d+)?"))
playerRating <- as.numeric(str_extract(playerRating, "\\d+")) #I used a second step to get the player's rating by itself.

Let’s put together this dataframe and see what it looks like for now.

chess_data_trans <- data.frame(playerId, playerName, playerState, playerPoints, playerRating)
head(chess_data_trans)

Great! We’re a step closer to completing the task at hand.

The last step is to add a column that shows the average of the opponent’s pre chess ratings. This will require us to use a for loop in order to add up each of the opponent’s pre-ratings and use the mean fuction for simplicity.

First let’s find a way to get a list of the opponents by player ID:

unlist(playerOpponent[playerId[1]])

## [1] "39" "21" "18" "14" "7"  "12" "4"

unlist(playerOpponent[playerId[2]])

## [1] "63" "58" "4"  "17" "16" "20" "7"

We can see above that I managed to pull a list of the opponets’ IDs for player Gary Hua and Dakshesh Daruri, which have player ID 1 and 2 respectively.

Next let’s try to get the actual pre-rating for each of these opponents:

playerRating[as.numeric(unlist(playerOpponent[playerId[1]]))]

## [1] 1436 1563 1600 1610 1649 1663 1716

playerRating[as.numeric(unlist(playerOpponent[playerId[2]]))]

## [1] 1175  917 1716 1629 1604 1595 1649

Lastly, the mean. I’ll round the number to an integer:

round(mean(playerRating[as.numeric(unlist(playerOpponent[playerId[1]]))]), digits = 0)

## [1] 1605

round(mean(playerRating[as.numeric(unlist(playerOpponent[playerId[2]]))]), digits = 0)

## [1] 1469

Ok, and let’s wrap this up in a for loop so that we don’t have to actually do this one by one for all 64 players:

avgRating = 0
  for (i in 1:64) { 
  avgRating[i] <- round(mean(playerRating[as.numeric(unlist(playerOpponent[playerId[i]]))]), digits = 0) 
  }

Our next step is to add this last vector to our dataframe and take a look at the data:

chess_data_trans <- data.frame(playerId, playerName, playerState, playerPoints, playerRating, avgRating)
colnames(chess_data_trans) <- c("Player ID", "Player Name", "State", "Total Points", "Pre-Rating", "Average Opponents Pre-Rating")
head(chess_data_trans, 10)

Let’s see how many players there are per state:

table(chess_data_trans$State)

## 
## MI OH ON 
## 55  1  8

Highest player pre-rating by state:

tapply(chess_data_trans$`Pre-Rating`, chess_data_trans$State, max)

##   MI   OH   ON 
## 1745 1686 1794

The highest player pre-rating in the whole data set:

subset(chess_data_trans, `Pre-Rating` == max(chess_data_trans$`Pre-Rating`), select = c(`Player ID`, `Player Name`, State, `Total Points`, `Pre-Rating`))

Average Pre-Rating by state:

avg_state_rating <- aggregate(x=chess_data_trans["Pre-Rating"], by = list(State=chess_data_trans$State), FUN = mean, na.rm=TRUE)
avg_state_rating

library(ggplot2)
ggplot(aes(x = reorder (State, -`Pre-Rating`), y = `Pre-Rating`), data = avg_state_rating) + geom_bar(stat = "identity")

In our final step we generate a .CSV file:

write.table(chess_data_trans, file = "chessExtraction.csv", row.names = FALSE, na = "", col.names = TRUE, sep = ",")

CUNY SPS - Master of Science in Data Science - DATA607

Project 1: Chess Tournament Data Manipulation

Mario Pena

9/22/2019