Introduction

This project solution creates a .CSV files based on a text file with chess tournament results where the information has a defined structure. The final output in the .CSV files contains one line for each player which includes the player’s name, player’s state, player’s total number of points, player’s rating before the tournament, and the average rating of the player’s opponents before the tournament.

The input file required for this program: “tournamentinfo.txt”.

The output file generated by this program: “tournamentInfoOutput.csv”.

An example of one line of output in the .CSV file:

Gary Hua,ON,6.0,1794,1605

The solution is dependent on the file structure present in “tournamentinfo.txt” and is capable of producing a file of a greater number of tournament players than the test file provides.

Program functions

To organize the solution better, I created several functions to perform the different data transformations and calculations. The function definitions below are documented with name and description along with input and output objects. Having never written functions in R before, the language does provide quite a bit of leeway with the function definitions compared to more general programming lanugages such as Java and C++.

library(stringr)
library(gdata)

## gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.

##

## gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED.

## 
## Attaching package: 'gdata'

## The following object is masked from 'package:stats':
## 
##     nobs

## The following object is masked from 'package:utils':
## 
##     object.size

## The following object is masked from 'package:base':
## 
##     startsWith

# Function: retrieve_opp_number
# Description: Retrieve the numeric identifier for an opponent based on the match info
# Input: list containing the result and opponent numeric identifier for a single match
# Output: numeric identifier of a player's opponent based on the single match info (integer)
retrieve_opp_number <- function(opp_section) {
  
  player_opp_section_tok <- strsplit(opp_section, " +")
  player_opp_result <- player_opp_section_tok[[1]][1]

  # Default to not available numerically
  player_opp <- "-1"

  # Opponents only occur with a Win, Loss or Draw
  if (player_opp_result == "W" || 
      player_opp_result == "L" ||
      player_opp_result == "D") {
    # Set
    player_opp <- player_opp_section_tok[[1]][2]
  }

  # Return the player opponent number
  player_opp
}

# Function: transform_input_file_to_df
# Description: Transform the raw input (list of character strings) into a dataframe object in which each player information is contained in one entry of the data frame
# Input: list of character strings
# Output: dataframe object of the meaningful player information
transform_input_file_to_df <- function(input) {
  
  input_len <- length(input)
  line_count <- 1
  
  # Create the dataframe to be returned
  final_df <- data.frame(matrix(ncol = 12, nrow = 0))
  column_names <- c("Player_Number", "Player_Name", "Player_State", "Player_Total_Points",
                    "Player_Pre_Rating", "Player_Opp01", "Player_Opp02", "Player_Opp03",
                    "Player_Opp04", "Player_Opp05", "Player_Opp06", "Player_opp07")
  colnames(final_df) <- column_names

  # Loop through the raw input file of characters strings
  while (line_count < input_len) {
    
    player_rows_input <- input[line_count:(line_count+1)]
    player_row_for_df <- transform_player_input_to_row(player_rows_input)
    
    # Add the player's info to the final data frame
    df <- data.frame(player_row_for_df[1], player_row_for_df[2], player_row_for_df[3],
                     player_row_for_df[4], player_row_for_df[5], player_row_for_df[6],
                     player_row_for_df[7], player_row_for_df[8], player_row_for_df[9],
                     player_row_for_df[10], player_row_for_df[11], player_row_for_df[12])
    names(df) <- column_names
    final_df <- rbind(final_df, df)
    
    # Increase by 3 due to the input file's structure
    line_count <- line_count + 3
  }
  
  # Return the final data frame with each player's info as a single entry
  final_df
}

# Function: extract_player_pre_rating
# Description: As the tournament info sheet may contain a player's pre-rating with additional info, this function will remove the extra info and return just the player's pre-rating
# Input: a character string
# Output: a character string with only the initial pre-rating included
extract_player_pre_rating <- function(player_pre_rating) {
  
  if (str_detect(player_pre_rating, "P")) {
    player_pre_rating_tok <- strsplit(player_pre_rating, "P")
    player_pre_rating <- player_pre_rating_tok[[1]][1]
  }
  
  # Return the player's pre tournament rating
  player_pre_rating
}

# Function: transform_player_input_to_row
# Description: A player's entry in the tournament info sheet contains two lines of character data. This function will transform the two lines of character data into a list object of the player's info
# Input: list of character strings (list should contain 3 items)
# Output: list of values that represent a single player's information
transform_player_input_to_row <- function(player_input) {
  
  # Read first line of player and tokenize on the pipe "|" and trim leading/trailing whitespace
  first_line <- player_input[1]
  first_line_tok <- strsplit(first_line, "\\|")
  first_line_tok <- trim(first_line_tok)

  # Extract the player number, name, total points, and 7 opponents from the tokenzied first line 
  player_num <- first_line_tok[[1]][1]
  player_name <- first_line_tok[[1]][2]
  player_tot_num_pts <- first_line_tok[[1]][3]
  player_opp01 <- retrieve_opp_number(first_line_tok[[1]][4])
  player_opp02 <- retrieve_opp_number(first_line_tok[[1]][5])
  player_opp03 <- retrieve_opp_number(first_line_tok[[1]][6])
  player_opp04 <- retrieve_opp_number(first_line_tok[[1]][7])
  player_opp05 <- retrieve_opp_number(first_line_tok[[1]][8])
  player_opp06 <- retrieve_opp_number(first_line_tok[[1]][9])
  player_opp07 <- retrieve_opp_number(first_line_tok[[1]][10])

  # Read second line of player, tokenize on the pipe "|" and trim leading/trailing whitespace
  second_line <- player_input[2]
  second_line_tok <- strsplit(second_line, "\\|")
  second_line_tok <- trim(second_line_tok)
  
  # Extract the player's state from tokenzied second line
  player_state <- second_line_tok[[1]][1]
  
  # Retrieve pre-rating from second line
  rating_section <- second_line_tok[[1]][2]
  rating_section_tok <- strsplit(rating_section, " +")
  player_pre_rating <- extract_player_pre_rating(rating_section_tok[[1]][4])
  
  # Combine all the necessary player info into one list to be returned as player object
  player <- c(player_num, player_name, player_state, player_tot_num_pts, 
              player_pre_rating, player_opp01, player_opp02, player_opp03, 
              player_opp04, player_opp05, player_opp06, player_opp07)

  # Return the vector for a single player instance
  player
}

# Function: remove_input_file_header
# Description: Truncate the first 4 lines (character strings) from the initial input file. 
# Input: list of character strings
# Output: same list as input without the first 4 characters strings included
remove_input_file_header <- function(input) {

  # Remove the first 4 lines from the input file
  input[5:length(input)]
}

# Function: calculate_opp_pre_rating_avg
# Description: Calculate a player's opponents' pre-rating average
# Input: dataframe of all the tournament players
# Output: dataframe of all the tournament players with an additional column with the calculated opponents' pre-rating average
calculate_opp_pre_rating_avg <- function(input_df) {
  
  df_with_avg <- data.frame(matrix(ncol = 5, nrow = 0))
  column_names <- c("Player_Name", "Player_State", "Player_Total_Points",
                    "Player_Pre_Rating", "Opponents_Pre_Rating_Avg")
  colnames(df_with_avg) <- column_names
  
  rows <- nrow(input_df)
  
  # Loop through each player in the tournament
  for (pl in 1:rows) {
    player_instance <- input_df[pl,]
    names(player_instance) <- NULL
    
    player_list <- unlist(c(player_instance))
    
    # Loop through a player's opponents
    opp_cnt <- 0
    opp_pts_total <- 0

    # Loop through the opponents faced by the player
    for (i in 6:12) {
      player_opp_num <- toString(player_list[i])
      opp_row <- subset(df_func, Player_Number == player_opp_num)
      names(opp_row) <- NULL
      opp_row_list <- unlist(c(opp_row))
      opp_pre_rating <- opp_row_list[5]
  
      opp_pre_rating_num <- as.numeric(as.character(opp_pre_rating))
  
      # Only consider match information for calculation when opponent info is present
      if (is.na(opp_pre_rating_num) == FALSE) {
        opp_pts_total <- opp_pts_total + opp_pre_rating_num
        opp_cnt <- opp_cnt + 1
      }
    }
    
    opp_avg <- round(opp_pts_total / opp_cnt, digits=0)

    # Construct return data frame with player name, player state, player total points, player pre-rating, and the final column of the player's opponents' pre-rating average
    df <- data.frame(player_instance[2], player_instance[3], player_instance[4],
                     player_instance[5], opp_avg)
    names(df) <- column_names
    df_with_avg <- rbind(df_with_avg, df)
  }
  
  # Return the final data frame of all player's with additional column for the opponents' pre tournament average rating
  df_with_avg
}

Program main

The main section of the R program processing acts as a controller to initiate the functions defined above. The intention of the main-section is to provide a high-level step process of transfomring the raw text file input into the specific .CSV file structure and format.

# Read in the txt file as list of character lines
raw_data <- readLines("tournamentinfo.txt")

## Warning in readLines("tournamentinfo.txt"): incomplete final line found on
## 'tournamentinfo.txt'

# Call function to remove the rows of the header
raw_data <- remove_input_file_header(raw_data)

# Transform raw input file into data frame
df_func <- transform_input_file_to_df(raw_data)

# Calculate the opponents' pre-rating average 
# Expect response to be input data frame with additional column for average
df_final <- calculate_opp_pre_rating_avg(df_func)

# Output the final data frame to a CSV file named "tournamentInfoOutput.csv"
write.table(df_final, file="tournamentInfoOutput.csv", row.names=FALSE, 
            col.names=FALSE, sep=",", quote=FALSE)

Conclusion

The project solution transforms the chess tournament info with a specific structure into a .CSV file that can be easily consumed in a SQL statement. The most difficult part of the project assignment was the manipulation of data across the different data types in the R programming language. Understanding and researching the differences between a data frame and vector were key to completing the proper transformations. I do believe that if a different input file structure were provided, this project solution could be modified easily to consumer a different, yet similar structure.

DATA 607 Project 1

Philip Tanofsky

2/23/2020

Introduction

Program functions

Program main

Conclusion