Data607 - Project 1

0.1 Environment Setup

library(tidyverse)

## -- Attaching packages ---------------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.8.0     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0

## -- Conflicts ------------------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

## The following object is masked from 'package:purrr':
## 
##     transpose

library(magrittr)

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:purrr':
## 
##     set_names

## The following object is masked from 'package:tidyr':
## 
##     extract

library(RCurl)

## Loading required package: bitops

## 
## Attaching package: 'RCurl'

## The following object is masked from 'package:tidyr':
## 
##     complete

0.2 Task

In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players: Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents. For the first player, the information would be: Gary Hua, ON, 6.0, 1794, 1605 1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played. If you have questions about the meaning of the data or the results, please post them on the discussion forum. Data science, like chess, is a game of back and forth. The chess rating system (invented by a Minnesota statistician named Arpad Elo) has been used in many other contexts, including assessing relative strength of employment candidates by human resource departments. You may substitute another text file (or set of text files, or data scraped from web pages) of similar or greater complexity, and create your own assignment and solution. You may work in a small team. All of your code should be in an R markdown file (and published to rpubs.com); with your data accessible for the person running the script.

0.3 Import Data

The text file has been uploaded to GitHub. To assist reading the URL, RCurl is used. The fread function is used to read that data into a data table from the data.table package. Additional parameters are added to the function to specify separators and how to handle rows without appropriate number of columns.

txt <- getURL("https://raw.githubusercontent.com/baroncurtin2/data607/master/tournamentinfo.txt")

fields <- c("Number", "Name", "Points", "R1", "R2", "R3", "R4", "R5", "R6", "R7", "EOL")
data <- fread(txt, header = FALSE, skip = 4, sep = "|", fill = TRUE, stringsAsFactors = FALSE, col.names = fields)

0.4 Data Cleansing

The resulting data is full of useless rows that need to be removed, as well as extra spaces. The great members of the Tidyverse package will assist. The player’s state is in the second row follow the player’s row. Additional columns need to be created for the player’s state and points.

data <- data %>%
  filter(!Name == "") %>%
  select(Number:R7) %>%
  mutate_all(str_trim) %>%
  mutate(State = "",
         Rating = "")

Loop through the data table to get the values needed from the rows following the rows with the names. The number column in the second row will contain the State. The Name column will contain the Points.

for (i in 1:nrow(data)) {
  data$State[i] <- data$Number[i + 1]
  data$Rating[i] <- str_extract(data$Name[i + 1], '[[:blank:]]{1}[[:digit:]]{3,4}')
}

data <- data %>%
  filter(!is.na(Rating))

Since we will need to for the next part, we can convert each round to just the opponents number.

data <- data %>%
  mutate_at(vars(matches('R[[:digit:]]{1}')),
            funs(str_extract(. ,'[[:space:]]+[[:digit:]]{1,2}'))) %>%
  mutate_all(str_trim)

0.5 Opponents Ratings

First we will create a lookup table with ID (Number) and the Rating

ratings <- data %>%
  select(Number, Rating) %>%
  mutate_all(str_trim)

Next we will attempt to mutate the Opponents’ IDs into their rating

for (i in 1:nrow(data)) {
  for (j in 4:10) {
    data[i,j] <- ratings[ratings$Number == data[i,j], 2][1]
  }
}

Next we will calculate the opponents average rating

data <- data %>%
  mutate_at(vars(matches('R[[:digit:]]{1}')), 
            funs(as.numeric))
data$OppRating <- round(rowMeans(data[, c(4:10)], na.rm = TRUE), 1)

0.6 Final Table

final <- data %>%
  select(Name, State, Points, Rating, OppRating)

write_tsv(final, path = "final.csv")