Project1

Crosstable Cleaning

I spoke with Jeff Shamp regarding his approach to this problem. All the code is my own. I also included my own crosstable from the recent USATE to show how the code is robust and can process multiple files of similar type.

The crosstable was originally stored on blackboard, but my Rstudio could not access it there so I uploaded it to my github account. I initialize important libraries and download the file if it is not already present in my working directory.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stringr)
library(stringi)

url <- 'https://raw.githubusercontent.com/sbellows1/607/master/tournamentinfo.txt?token=ALKCMBOIKKNNYO5VRLGIXQK6JWQPK'
if (!file.exists('tournamentinfo.txt')){
  download.file(url, 'tournamentinfo.txt')
}

Initial Clean

The first thing I would like to do is to remove the useless lines of dashes. Then, each person really takes up two lines, so we will split those lines out to have like information with like information. If you would like to try the USATE crosstable, simply change the filename below.

##Read in the data and remove lines with 0 info
filename <- 'tournamentinfo.txt'
crosstab <- read.delim(filename, header = FALSE, sep = '\n', stringsAsFactors = FALSE)
crosstab <- crosstab %>% filter(row_number() %% 3 != 1)

##DF has a pattern of alternating lines, so split them out into two dataframes.
set1 <- crosstab %>% filter(row_number() %% 2 == 0)
head(set1, 10)

##                                                                                            V1
## 1   Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | 
## 2      ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |
## 3      MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |
## 4      MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |
## 5      MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |
## 6      MI | 14601533 / R: 1655   ->1690     |N:2  |B    |W    |B    |W    |B    |W    |B    |
## 7      OH | 15055204 / R: 1686   ->1687     |N:3  |W    |B    |W    |B    |B    |W    |B    |
## 8      MI | 11146376 / R: 1649   ->1673     |N:3  |W    |B    |W    |B    |B    |W    |W    |
## 9      MI | 15142253 / R: 1641P17->1657P24  |N:3  |B    |W    |B    |W    |B    |W    |W    |
## 10     ON | 14954524 / R: 1411   ->1564     |N:2  |W    |B    |W    |B    |W    |B    |B    |

Regex

Here I use some regular expressions to extract the states and ratings for each player from the first half of the data.

##Extract the stats and pre-rating
state_regex <- '^\\s*([A-Z]{2})'
states <- apply(set1, 1, str_match, state_regex)[2,]
head(states, 10)

##  [1] NA   "ON" "MI" "MI" "MI" "MI" "OH" "MI" "MI" "ON"

rating_regex <- 'R:\\s*(\\d{3,4})'
pre_rating <- apply(set1, 1, str_match, rating_regex)[2,]
head(pre_rating, 10)

##  [1] NA     "1794" "1553" "1384" "1716" "1655" "1686" "1649" "1641" "1411"

Now I move on to the second half of the data having retrieved all of the requisite information

##Extract the player number then split the rows using | character
set2 <- crosstab %>% filter(row_number() %% 2 == 1)
head(set2,10)

##                                                                                            V1
## 1   Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| 
## 2       1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|
## 3       2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|
## 4       3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|
## 5       4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|
## 6       5 | HANSHI ZUO                      |5.5  |W  45|W  37|D  12|D  13|D   4|W  14|W  17|
## 7       6 | HANSEN SONG                     |5.0  |W  34|D  29|L  11|W  35|D  10|W  27|W  21|
## 8       7 | GARY DEE SWATHELL               |5.0  |W  57|W  46|W  13|W  11|L   1|W   9|L   2|
## 9       8 | EZEKIEL HOUGHTON                |5.0  |W   3|W  32|L  14|L   9|W  47|W  28|W  19|
## 10      9 | STEFANO LEE                     |5.0  |W  25|L  18|W  59|W   8|W  26|L   7|W  20|

More regex!

player_regex <- '^\\s*(\\d*)'
player_number <- apply(set2, 1, str_match, player_regex)[2,]
head(player_number, 10)

##  [1] ""  "1" "2" "3" "4" "5" "6" "7" "8" "9"

Splitting the data

The second line has a lot of information that I want split by the same character ‘|’. I will split these lines by this character and then store the resulting in a matrix. I will also remove the information about whether each game was with white or black as that is not relevant to my analysis.

name_score_opponents <- sapply(set2, str_split, '\\|')

##Translate these rows into a matrix, keep relevant portions of matrix, and clean the remaining cells
nso_matrix <- stri_list2matrix(name_score_opponents)
nso_matrix[4:nrow(nso_matrix),2:ncol(nso_matrix)] <- str_replace_all(nso_matrix[4:nrow(nso_matrix),2:ncol(nso_matrix)], '[A-Z]*', '')
nso_matrix <- apply(nso_matrix, 1, str_trim)

Bringing it together!

Now I collect all the information we have up to this point in one dataframe. The last step is for each player, I only know the number of the players they played, not the rating of the players they played. In order to find this, I will create our own function to lookup a player’s rating based on their number.

##Create overarching df with relevant information and names
df <- cbind(states, pre_rating, nso_matrix)
df <- as.data.frame(df, stringsasfactors = FALSE)
names(df) <- c('State', 'Rating', 'Num', 'Name', 'Pts', 'Opp1', 'Opp2', 'Opp3', 'Opp4', 'Opp5', 'Opp6', 'Opp7')
df <- df[2:dim(nso_matrix)[1], -13]
df <- apply(df, 2, as.character)
df <- as.data.frame(df, stringsAsFactors = FALSE)

#Function to turn an opponent pairing into an opponent rating
pairing_rating <- function(pairing){
  if (pairing %in% df$Num){
    df %>% filter(Num == pairing) %>% .[2]
  }
  else {
    NA
  }
}

opps <- df[,6:ncol(df)]

Finding Opponent Rating

Here I will loop through the rows of the dataframe and apply the function I made to the columns, thus looking up the rating of every opponent in the dataframe. If there is no opponent, the function returns NA. Then I will take the mean of these ratings to find the average opponent rating.

##Iterate over the rows and apply the function along the columns to apply the function to each cell
for (i in 1:dim(opps)[1]){
  opps[i,] <- apply(opps[i,], 2, pairing_rating)
}
opps <- apply(opps, 2, as.integer)

#Take the mean rating as that is what was requested
avg_opp_rating <- round(rowMeans(opps, na.rm = TRUE))
head(avg_opp_rating, 10)

##  [1] 1605 1469 1564 1574 1501 1519 1372 1468 1523 1554

Export

Finally, I tidy up the data and give the columns appropriate names before exporting to csv.

#Rebind into final dataframe
df <- cbind(df$Name, df$State, df$Pts, df$Rating, avg_opp_rating)
df <- as.data.frame(df)
names(df) <- c('Name', 'State', 'Points', 'Rating', 'Average_Opponent_Rating')

#Export
write.csv(df, 'cleaned_crosstable.txt')

Visualization

One quick little visualization before I go!

library(ggplot2)

df %>% ggplot(aes(x = as.integer(as.character(Rating)), y = Points, color = State)) + geom_point() + xlab('Rating')

We see a clear positive relationship between rating and score as we would expect. Also the Canadiens are bringing the heat!