In this project, we are given a text file with chess tournament results where the information has some structure. I will extract relevant information and create an R Markdown file that will later generate a .CSV file that could be used in a database management system.
I will look for and obtain the following player information:
Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents.
My first step is to load the data into R:
chess_data <- read.delim("https://raw.githubusercontent.com/marioipena/Project1data/master/tournamentinfo.txt", header = FALSE, stringsAsFactors = FALSE)
Let’s take a look at our data to get a better sense of how we will extract the required information:
head(chess_data, 10)
We can see from the first few lines of our data that there seems to be a pattern separating each player. I will use this pattern to identify two lines, which contain information we want to extract from each player.
player = chess_data[seq(5, nrow(chess_data), 3), ] #The player data starts at line 5
ratings = chess_data[seq(6, nrow(chess_data), 3), ] #The ratings data starts at line 6
Now we will use regular expressions to extract only the information we identified at the instructions of the project.
I will extract each variable separately in order to form vectors that I will later on put together in a dataframe.
First, let’s extract all the information we will need from the “player” row.
Note that I will also extract player ID as it will help us later on to obtain the average opponent’s rating
library(stringr)
playerId <- as.integer(str_extract(player, "\\d+"))
playerName <- str_extract(player, "(\\w+\\s){2,4}(\\w+-\\w+)?") #There are players with 4 names and hyphenated names.
playerName <- str_trim(playerName) #This gets rid of the extra blank spaces.
playerPoints <- as.numeric(str_extract(player, "\\d+\\.\\d+"))
playerOpponent <- str_extract_all(player, "\\d+\\|")
playerOpponent <- str_extract_all(playerOpponent, "\\d+") #I use a second step to get the opponent ID by itself.
## Warning in stri_extract_all_regex(string, pattern, simplify = simplify, :
## argument is not an atomic vector; coercing
I will now extract all the information we need from the “ratings” row.
playerState <- str_extract(ratings, "\\w+")
playerRating <- (str_extract(ratings, "(\\:\\s\\s?\\d+)([[:alpha:]]\\d+)?"))
playerRating <- as.numeric(str_extract(playerRating, "\\d+")) #I used a second step to get the player's rating by itself.
Let’s put together this dataframe and see what it looks like for now.
chess_data_trans <- data.frame(playerId, playerName, playerState, playerPoints, playerRating)
head(chess_data_trans)
Great! We’re a step closer to completing the task at hand.
The last step is to add a column that shows the average of the opponent’s pre chess ratings. This will require us to use a for loop in order to add up each of the opponent’s pre-ratings and use the mean fuction for simplicity.
First let’s find a way to get a list of the opponents by player ID:
unlist(playerOpponent[playerId[1]])
## [1] "39" "21" "18" "14" "7" "12" "4"
unlist(playerOpponent[playerId[2]])
## [1] "63" "58" "4" "17" "16" "20" "7"
We can see above that I managed to pull a list of the opponets’ IDs for player Gary Hua and Dakshesh Daruri, which have player ID 1 and 2 respectively.
Next let’s try to get the actual pre-rating for each of these opponents:
playerRating[as.numeric(unlist(playerOpponent[playerId[1]]))]
## [1] 1436 1563 1600 1610 1649 1663 1716
playerRating[as.numeric(unlist(playerOpponent[playerId[2]]))]
## [1] 1175 917 1716 1629 1604 1595 1649
Lastly, the mean. I’ll round the number to an integer:
round(mean(playerRating[as.numeric(unlist(playerOpponent[playerId[1]]))]), digits = 0)
## [1] 1605
round(mean(playerRating[as.numeric(unlist(playerOpponent[playerId[2]]))]), digits = 0)
## [1] 1469
Ok, and let’s wrap this up in a for loop so that we don’t have to actually do this one by one for all 64 players:
avgRating = 0
for (i in 1:64) {
avgRating[i] <- round(mean(playerRating[as.numeric(unlist(playerOpponent[playerId[i]]))]), digits = 0)
}
Our next step is to add this last vector to our dataframe and take a look at the data:
chess_data_trans <- data.frame(playerId, playerName, playerState, playerPoints, playerRating, avgRating)
colnames(chess_data_trans) <- c("Player ID", "Player Name", "State", "Total Points", "Pre-Rating", "Average Opponents Pre-Rating")
head(chess_data_trans, 10)
Let’s see how many players there are per state:
table(chess_data_trans$State)
##
## MI OH ON
## 55 1 8
Highest player pre-rating by state:
tapply(chess_data_trans$`Pre-Rating`, chess_data_trans$State, max)
## MI OH ON
## 1745 1686 1794
The highest player pre-rating in the whole data set:
subset(chess_data_trans, `Pre-Rating` == max(chess_data_trans$`Pre-Rating`), select = c(`Player ID`, `Player Name`, State, `Total Points`, `Pre-Rating`))
Average Pre-Rating by state:
avg_state_rating <- aggregate(x=chess_data_trans["Pre-Rating"], by = list(State=chess_data_trans$State), FUN = mean, na.rm=TRUE)
avg_state_rating
library(ggplot2)
ggplot(aes(x = reorder (State, -`Pre-Rating`), y = `Pre-Rating`), data = avg_state_rating) + geom_bar(stat = "identity")
In our final step we generate a .CSV file:
write.table(chess_data_trans, file = "chessExtraction.csv", row.names = FALSE, na = "", col.names = TRUE, sep = ",")