Assignment
Hypothesis
Read in file
Get Rid of Header Records
Intitialize Variables
Loop Through Records in File 3 at a time
Calculate Pre Tournement Opponent Average
Check Correlations
Scatterplots
Conclusions

Assignment

In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:

Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents

For the first player, the information would be: Gary Hua, ON, 6.0, 1794, 1605

1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played.

If you have questions about the meaning of the data or the results, please post them on the discussion forum. Data science, like chess, is a game of back and forth…

The chess rating system (invented by a Minnesota statistician named Arpad Elo) has been used in many other contexts, including assessing relative strength of employment candidates by human resource departments.

Hypothesis

A chess players total points in this tournament will correlate to how good of a player that they are based on pre-tournament rating.

Read in file

file <- readLines("tournamentinfo.txt")
head(file)

## [1] "-----------------------------------------------------------------------------------------" 
## [2] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
## [4] "-----------------------------------------------------------------------------------------" 
## [5] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
## [6] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"

Get Rid of Header Records

file <- file[5: length(file)]
head(file)

## [1] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|"
## [2] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [3] "-----------------------------------------------------------------------------------------"
## [4] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|"
## [5] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [6] "-----------------------------------------------------------------------------------------"

Intitialize Variables

num_players <- length(file) / 3
player_number <-vector()
player_name <-vector()
total_points <- vector()
num_games_played <-vector()
opponents <-vector()
state <- vector()
pre_tournament_rating <- vector()

Loop Through Records in File 3 at a time

for ( i in 1:num_players)  {
  
  rawPlayerData <- file[1:2]
  
  file <- file[4: length(file)]
  
  player_number <- c (player_number, as.numeric( str_extract (
    substr(rawPlayerData[1], 3, 7), '\\d{1,2}')))
  
  player_name <- c ( player_name, trimws(substr(rawPlayerData[1], 9, 40)))
  
  total_points <- c ( total_points, as.numeric(substr(rawPlayerData[1], 42, 44)))
  
  num_games_played <-c ( num_games_played, length(unlist(str_extract_all(substr(rawPlayerData[1],
                                                                                44,
                                                                                nchar(rawPlayerData[1])),
                                                                         "[WLD]")))) 
  
  opponents <- c ( opponents, str_extract_all(substr(rawPlayerData[1],
                                                     45,
                                                     nchar(rawPlayerData[1])), 
                                              "\\d{1,2}"))
  
  state <- c ( state, trimws(substr(rawPlayerData[2], 3, 6))) 
  
  pre_tournament_rating <- c ( pre_tournament_rating,
                               as.numeric(unlist (str_extract_all(rawPlayerData[2], 
                                                                  "[:space:]\\d{3,4}" ))[2])) 
}

players <- data.table(
  player_number = player_number,
  player_name = player_name,
  total_points = total_points,
  num_games_played = num_games_played,
  opponents = opponents,
  state = state,
  pre_tournament_rating = pre_tournament_rating,
  opponents_pre_tournament_average = 1
  )

Calculate Pre Tournement Opponent Average

for (i in 1:64) {
    opponents_list <- as.numeric(unlist(players[player_number==i][["opponents"]]))
    
    opponents_vector <- as.vector(opponents_list)
    
    sumOppPreTourneyAvg <- sum(subset(players, player_number %in% opponents_vector)
                                [, pre_tournament_rating]) / 
                                   as.numeric(players[[ i, c("num_games_played")]])
    
    players [ player_number == i,   
              opponents_pre_tournament_average := round(sumOppPreTourneyAvg, digits =2 ) ]

}

datatable(players)

Check Correlations

vars_to_correlate <- players[, c("total_points", 
                                 "pre_tournament_rating", 
                                 "opponents_pre_tournament_average" 
                                 ),
                             with = FALSE]

corrM <- cor(vars_to_correlate)
kable(corrM)

	total_points	pre_tournament_rating	opponents_pre_tournament_average
total_points	1.0000000	0.6093942	0.5505095
pre_tournament_rating	0.6093942	1.0000000	0.2839561
opponents_pre_tournament_average	0.5505095	0.2839561	1.0000000

corrplot(corrM, method = "ellipse", type = "upper")

Scatterplots

p <- ggplot(players, aes(x = pre_tournament_rating, y = total_points)) +
         geom_point() +
         stat_smooth(method = "lm", col = "red")

ggplotly()

p <- ggplot(players, aes(x = opponents_pre_tournament_average, y = total_points)) +
         geom_point() +
         stat_smooth(method = "lm", col = "red")

ggplotly()

p <- ggplot(players, aes(x = opponents_pre_tournament_average, y = pre_tournament_rating)) +
         geom_point() +
         stat_smooth(method = "lm", col = "red")

ggplotly()

Conclusions

This is a fair tournament. All players play players of varying skill levels (as demonstrated my their pre tournament rating). This shown by the fact that there is no correlation between a players pre tournament rating and the average rating of his/her opponent.
There is a weak positive relationship between a person’s pre tournament rating and the number of points a player receives (Pearson’s r = .6). Therefore it is inconclusive that, but more likely than not that a players pre tournament ranking is related to their pre tournament rating. To prove this conclusively, we would need data from many more chess tournaments.

CUNY DATA 607 Project 1

Raphael Nash

September 13, 2016