We will attempt to complete this project by using only REGEX (Regular Expressions) to extract all features necessary. I am sure there are many other ways to tackle this project, but since I’ve never used REGEX before, I decided to give it a try.
We won’t use many packages, just the basic ones in TIDYVERSE
rm(list=ls())
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble 3.1.6 v purrr 0.3.4
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.1.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x mosaic::count() masks dplyr::count()
## x purrr::cross() masks mosaic::cross()
## x mosaic::do() masks dplyr::do()
## x tidyr::expand() masks Matrix::expand()
## x dplyr::filter() masks stats::filter()
## x ggstance::geom_errorbarh() masks ggplot2::geom_errorbarh()
## x dplyr::lag() masks stats::lag()
## x tidyr::pack() masks Matrix::pack()
## x mosaic::stat() masks ggplot2::stat()
## x mosaic::tally() masks dplyr::tally()
## x tidyr::unpack() masks Matrix::unpack()
We will load the TXT file to manipulate it with R using regular expressions only. We will skip the first 4 lines, since they don’t have any information we will use in this project.
my_chess <- read_lines("chess.txt", skip=4)
head(my_chess)
## [1] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [2] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [3] "-----------------------------------------------------------------------------------------"
## [4] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [5] " MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [6] "-----------------------------------------------------------------------------------------"
For each feature we will use the same procedure: We will define a pattern for REGEX. We will use str_match_all to extract matching features. We will remove any unwanted artifacts, like empty rows or columns. Trim any white-space and convert to numeric if necessary
pattern_names <- "(?<=\\| )[A-Za-z -]{6,}(?=\\|)"
names <- str_match_all(my_chess,pattern_names)
# The match added blank lines so I remove every 2,3 rows and
# remove all leading and trailing white space
names2 <- str_trim(names[seq(1, length(names), 3)],
side = c("both","left","right"))
head(names2)
## [1] "GARY HUA" "DAKSHESH DARURI" "ADITYA BAJAJ"
## [4] "PATRICK H SCHILLING" "HANSHI ZUO" "HANSEN SONG"
# Pattern for STATE
pattern_states <- "[A-Z]{2}(?=\\s\\|)"
states <- str_match_all(my_chess,pattern_states)
states2 <- str_trim(states[seq(2, length(states), 3)],
side = c("both","left","right"))
head(states2)
## [1] "ON" "MI" "MI" "MI" "MI" "OH"
pattern_points <- "\\d\\.\\d"
points <- str_match_all(my_chess,pattern_points)
points2 <- str_trim(points[seq(1, length(points), 3)],
side = c("both","left","right"))
head(points2)
## [1] "6.0" "6.0" "6.0" "5.5" "5.5" "5.0"
# Pattern Rating
pattern_rating <- "((?<=R: )|(?<=R: ))\\d{3,4}"
rating <- str_match_all(my_chess,pattern_rating)
#Remove unneeded rows
rating2 <- rating[seq(2, length(rating), 3)]
#Remove unneeded columns
rating_temp <- lapply(rating2, function(x) x[,-2])
# This to handle the Unrated people
rating_temp[lengths(rating_temp) == 0] <- "0000"
# Trim whitespace
rating2 <- str_trim(rating_temp,side = c("both","left","right"))
# Convert from character to numeric
rating2 <- unlist(lapply(rating2,as.numeric), recursive = FALSE)
head(rating2)
## [1] 1794 1553 1384 1716 1655 1686
# This line extract every 3rd row
my_chess_games <- my_chess[seq(1, length(my_chess), 3)]
# This pattern extracts the number that followed a D, W or L"
pattern_games <- "(((?<=W )|(?<=L ))|(?<=D ))\\s{0,3}\\d{0,2}(?=\\|)"
# Results are stored in this list of lists
players_played <- str_match_all(my_chess_games,pattern_games)
# Lets remove 2 and 3 column, since I won't use them
players2 <- lapply(players_played, function(x) x[,-c(2:3)])
# Need to trim for whitespace
players2 <- lapply(players2,str_trim)
head(players2)
## [[1]]
## [1] "39" "21" "18" "14" "7" "12" "4"
##
## [[2]]
## [1] "63" "58" "4" "17" "16" "20" "7"
##
## [[3]]
## [1] "8" "61" "25" "21" "11" "13" "12"
##
## [[4]]
## [1] "23" "28" "2" "26" "5" "19" "1"
##
## [[5]]
## [1] "45" "37" "12" "13" "4" "14" "17"
##
## [[6]]
## [1] "34" "29" "11" "35" "10" "27" "21"
This one was tricky. I have all data elements I need to calculate the means of ratings, I was looking for a way to do it without a for loop. Unfortunately don’t know enough R to figure out a sleak way of applying the MEAN to the vectors I had and store the results as a simple vector. LAPPLY would generate the AVERAGES I need, but the return was a very complex multi-dimensional LIST of LISTS of LISTS. So I gave up and did the FOR LOOP.
index_players <- lapply(players2,as.numeric)
rows_players <- length(index_players)
player_opponents <- vector(mode = "list", length = rows_players)
for (row in 1:rows_players) {
for (col in 1:length(index_players[[row]])) {
player_opponents[[row]][col] <- rating2[[index_players[[row]][col]]]
}
}
# We need to take mean and then simplify list of vectors into a single vector
avg_players <- round(unlist(lapply(player_opponents,mean),recursive=FALSE))
head(avg_players)
## [1] 1605 1469 1564 1574 1501 1519
Now we have all the information we need. We will merge it into comma separated file
output <- cbind(names2,states2,points2,rating2,avg_players)
write.csv(output,"project1.txt", row.names=FALSE)
Final thing to do is to load the file and inspect it to see if results are exactly what we wanted.
check_csv <- read_csv("project1.txt")
## Rows: 64 Columns: 5
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (2): names2, states2
## dbl (3): points2, rating2, avg_players
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(check_csv)
## # A tibble: 6 x 5
## names2 states2 points2 rating2 avg_players
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 GARY HUA ON 6 1794 1605
## 2 DAKSHESH DARURI MI 6 1553 1469
## 3 ADITYA BAJAJ MI 6 1384 1564
## 4 PATRICK H SCHILLING MI 5.5 1716 1574
## 5 HANSHI ZUO MI 5.5 1655 1501
## 6 HANSEN SONG OH 5 1686 1519
All good!