Data607 - Project 1

Chess Tournament

We will attempt to complete this project by using only REGEX (Regular Expressions) to extract all features necessary. I am sure there are many other ways to tackle this project, but since I’ve never used REGEX before, I decided to give it a try.

Initialization

We won’t use many packages, just the basic ones in TIDYVERSE

rm(list=ls())
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v tibble  3.1.6     v purrr   0.3.4
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.1.1     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x mosaic::count()            masks dplyr::count()
## x purrr::cross()             masks mosaic::cross()
## x mosaic::do()               masks dplyr::do()
## x tidyr::expand()            masks Matrix::expand()
## x dplyr::filter()            masks stats::filter()
## x ggstance::geom_errorbarh() masks ggplot2::geom_errorbarh()
## x dplyr::lag()               masks stats::lag()
## x tidyr::pack()              masks Matrix::pack()
## x mosaic::stat()             masks ggplot2::stat()
## x mosaic::tally()            masks dplyr::tally()
## x tidyr::unpack()            masks Matrix::unpack()

Load the chess.txt file

We will load the TXT file to manipulate it with R using regular expressions only. We will skip the first 4 lines, since they don’t have any information we will use in this project.

my_chess <- read_lines("chess.txt", skip=4)
head(my_chess)

## [1] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|"
## [2] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [3] "-----------------------------------------------------------------------------------------"
## [4] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|"
## [5] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [6] "-----------------------------------------------------------------------------------------"

REGEX Matching and extracting features

For each feature we will use the same procedure: We will define a pattern for REGEX. We will use str_match_all to extract matching features. We will remove any unwanted artifacts, like empty rows or columns. Trim any white-space and convert to numeric if necessary

Extract player’s NAMES

pattern_names <- "(?<=\\| )[A-Za-z -]{6,}(?=\\|)"
names <- str_match_all(my_chess,pattern_names)

# The match added blank lines so I remove every 2,3 rows and 
# remove all leading and trailing white space
names2 <- str_trim(names[seq(1, length(names), 3)],
                   side = c("both","left","right"))
head(names2)

## [1] "GARY HUA"            "DAKSHESH DARURI"     "ADITYA BAJAJ"       
## [4] "PATRICK H SCHILLING" "HANSHI ZUO"          "HANSEN SONG"

Extract player’s STATES

# Pattern for STATE
pattern_states <- "[A-Z]{2}(?=\\s\\|)"
states <- str_match_all(my_chess,pattern_states)
states2 <- str_trim(states[seq(2, length(states), 3)],
                   side = c("both","left","right"))
head(states2)

## [1] "ON" "MI" "MI" "MI" "MI" "OH"

Extract player’s POINTS

pattern_points <- "\\d\\.\\d"
points <- str_match_all(my_chess,pattern_points)
points2 <- str_trim(points[seq(1, length(points), 3)],
                    side = c("both","left","right"))
head(points2)

## [1] "6.0" "6.0" "6.0" "5.5" "5.5" "5.0"

Extract PLAYER’s RATING

# Pattern Rating
pattern_rating <- "((?<=R: )|(?<=R:  ))\\d{3,4}"
rating <- str_match_all(my_chess,pattern_rating)
#Remove unneeded rows
rating2 <- rating[seq(2, length(rating), 3)]
#Remove unneeded columns
rating_temp <- lapply(rating2, function(x) x[,-2])
# This to handle the Unrated people
rating_temp[lengths(rating_temp) == 0] <- "0000"
# Trim whitespace
rating2 <- str_trim(rating_temp,side = c("both","left","right"))
# Convert from character to numeric
rating2 <- unlist(lapply(rating2,as.numeric), recursive = FALSE)
head(rating2)

## [1] 1794 1553 1384 1716 1655 1686

Extract PLAYERS PLAYED

# This line extract every 3rd row
my_chess_games <- my_chess[seq(1, length(my_chess), 3)]

# This pattern extracts the number that followed a D, W or L"
pattern_games <- "(((?<=W )|(?<=L ))|(?<=D ))\\s{0,3}\\d{0,2}(?=\\|)"

# Results are stored in this list of lists
players_played <- str_match_all(my_chess_games,pattern_games)

# Lets remove 2 and 3 column, since I won't use them
players2 <- lapply(players_played, function(x) x[,-c(2:3)])

# Need to trim for whitespace
players2 <- lapply(players2,str_trim)
head(players2)

## [[1]]
## [1] "39" "21" "18" "14" "7"  "12" "4" 
## 
## [[2]]
## [1] "63" "58" "4"  "17" "16" "20" "7" 
## 
## [[3]]
## [1] "8"  "61" "25" "21" "11" "13" "12"
## 
## [[4]]
## [1] "23" "28" "2"  "26" "5"  "19" "1" 
## 
## [[5]]
## [1] "45" "37" "12" "13" "4"  "14" "17"
## 
## [[6]]
## [1] "34" "29" "11" "35" "10" "27" "21"

Calculate AVERAGE RATING OF PLAYERS PLAYED

This one was tricky. I have all data elements I need to calculate the means of ratings, I was looking for a way to do it without a for loop. Unfortunately don’t know enough R to figure out a sleak way of applying the MEAN to the vectors I had and store the results as a simple vector. LAPPLY would generate the AVERAGES I need, but the return was a very complex multi-dimensional LIST of LISTS of LISTS. So I gave up and did the FOR LOOP.

index_players <- lapply(players2,as.numeric)
rows_players <- length(index_players)
player_opponents <- vector(mode = "list", length = rows_players)

for (row in 1:rows_players) {
  for (col in 1:length(index_players[[row]])) {
    player_opponents[[row]][col] <- rating2[[index_players[[row]][col]]]
  }
}

# We need to take mean and then simplify list of vectors into a single vector
avg_players <- round(unlist(lapply(player_opponents,mean),recursive=FALSE))

head(avg_players)

## [1] 1605 1469 1564 1574 1501 1519

Create CSV file

Now we have all the information we need. We will merge it into comma separated file

output <- cbind(names2,states2,points2,rating2,avg_players)
write.csv(output,"project1.txt", row.names=FALSE)

Let’s check the CSV to make all is OK

Final thing to do is to load the file and inspect it to see if results are exactly what we wanted.

check_csv <- read_csv("project1.txt")

## Rows: 64 Columns: 5

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (2): names2, states2
## dbl (3): points2, rating2, avg_players

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(check_csv)

## # A tibble: 6 x 5
##   names2              states2 points2 rating2 avg_players
##   <chr>               <chr>     <dbl>   <dbl>       <dbl>
## 1 GARY HUA            ON          6      1794        1605
## 2 DAKSHESH DARURI     MI          6      1553        1469
## 3 ADITYA BAJAJ        MI          6      1384        1564
## 4 PATRICK H SCHILLING MI          5.5    1716        1574
## 5 HANSHI ZUO          MI          5.5    1655        1501
## 6 HANSEN SONG         OH          5      1686        1519

All good!