607 Week4 Project1

Project Summary

This project explores the ability to perform basic data wrangling functions with R:

Load data into R and exporting data from R
Using regular expressions to extract portions of larger text strings
Transforming data orientation
Changing data types
Joining/merging dataframes
Performing aggregations on dataframes

The desired task is to use R to take a provided semi-structured text file containing information on chess ratings and output a fully-structured .csv file with the following information for all records:

Player’s Id
Player’s Name
Player’s State
Player’s Total Number of Points
Player’s Pre-Rating
Average Opponent’s Pre-Rating

In this context a semi-structured file is one that is structured to the human eye but is not laid out in the machine readable structure of rows and columns. Whereas a fully-structured file is a laid out in a machine readable fashion.

Load the text file and the necessary R packages

library(stringr)
library(plyr)
library(reshape2)

input_file_name <- "https://raw.githubusercontent.com/ChadRyanBailey/607-Week4-Project1/master/02%20tournamentinfo.txt"

textFile <- readLines(input_file_name, warn = FALSE)

Explore the text file

head(textFile, n=10)

##  [1] "-----------------------------------------------------------------------------------------" 
##  [2] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
##  [3] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
##  [4] "-----------------------------------------------------------------------------------------" 
##  [5] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
##  [6] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
##  [7] "-----------------------------------------------------------------------------------------" 
##  [8] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|" 
##  [9] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |" 
## [10] "-----------------------------------------------------------------------------------------"

tail(textFile, n=10)

##  [1] "-----------------------------------------------------------------------------------------"
##  [2] "   62 | ASHWIN BALAJI                   |1.0  |W  55|U    |U    |U    |U    |U    |U    |"
##  [3] "   MI | 15219542 / R: 1530   ->1535     |     |B    |     |     |     |     |     |     |"
##  [4] "-----------------------------------------------------------------------------------------"
##  [5] "   63 | THOMAS JOSEPH HOSMER            |1.0  |L   2|L  48|D  49|L  43|L  45|H    |U    |"
##  [6] "   MI | 15057092 / R: 1175   ->1125     |     |W    |B    |W    |B    |B    |     |     |"
##  [7] "-----------------------------------------------------------------------------------------"
##  [8] "   64 | BEN LI                          |1.0  |L  22|D  30|L  31|D  49|L  46|L  42|L  54|"
##  [9] "   MI | 15006561 / R: 1163   ->1112     |     |B    |W    |W    |B    |W    |B    |B    |"
## [10] "-----------------------------------------------------------------------------------------"

It appears the text file is laid out with the first three rows containing file header information and the last row being a visual divider not containing data.

The remaining rows are in three block groupings:

A row having a visual divider with no data
A row with data visually partitioned
A another row, of the same record, visually partitioned

Transform the data into a machine-readable layout

Start by removing the header and last rows as they do not follow the same pattern as other rows. Then within each 3-block of rows load each row with data (rows 2 and 3) to its own vector.

#drop the headers and the final row 
textFileNRows <- length(textFile)
rows_with_pattern <- textFile [c(4:(textFileNRows-1))]

#In 3-block rows: ignore row1 (visual divider) and load rows 2 and 3 to seperate vectors
record_row1 <- rows_with_pattern[seq(from = 2, to = length(rows_with_pattern), by = 3)]
record_row2 <- rows_with_pattern[seq(from = 3, to = length(rows_with_pattern), by = 3)]

Then beginning with the row1 vector, within each row split apart the character string to individual elements.

#For row1 split apart to individual elements and then load into a dataframe
data_elements <- unlist(str_split(record_row1, "\\|"))
data_elements <- str_trim(data_elements) 
de_matrix <- matrix(data_elements, nrow = length(record_row1), byrow = TRUE)
de_matrix <- de_matrix[, 1:(ncol(de_matrix)-1)]
de_dataframe <- data.frame(de_matrix, stringsAsFactors = FALSE)    
names(de_dataframe) <- c("Id"
                         , "Name"
                         , "Total_Points"
                         , "Round1"
                         , "Round2"
                         , "Round3"
                         , "Round4"
                         , "Round5"
                         , "Round6"
                         , "Round7")

#Convert column data types as needed
de_dataframe["Id"] <- apply(de_dataframe["Id"], MARGIN = c(1), FUN = as.numeric)
de_dataframe["Total_Points"] <- apply(de_dataframe["Total_Points"], MARGIN = c(1), FUN = as.numeric)

head(de_dataframe)

##   Id                Name Total_Points Round1 Round2 Round3 Round4 Round5
## 1  1            GARY HUA          6.0  W  39  W  21  W  18  W  14  W   7
## 2  2     DAKSHESH DARURI          6.0  W  63  W  58  L   4  W  17  W  16
## 3  3        ADITYA BAJAJ          6.0  L   8  W  61  W  25  W  21  W  11
## 4  4 PATRICK H SCHILLING          5.5  W  23  D  28  W   2  W  26  D   5
## 5  5          HANSHI ZUO          5.5  W  45  W  37  D  12  D  13  D   4
## 6  6         HANSEN SONG          5.0  W  34  D  29  L  11  W  35  D  10
##   Round6 Round7
## 1  D  12  D   4
## 2  W  20  W   7
## 3  W  13  W  12
## 4  W  19  D   1
## 5  W  14  W  17
## 6  W  27  W  21

The data for each round contain two pieces of information (status and Opponent Id) that still need to be split apart.

First get the outcome status of the round.

#limit to columns with rounds data
rounds <- de_dataframe[, 4:10]

#get just the rounds status
round_status_func <- function(data) 
{
  unlist(str_extract_all(data, "[[:alpha:]]"))
}
round_status <- data.frame(apply(rounds, MARGIN = c(1,2), FUN = round_status_func), stringsAsFactors =  FALSE)

names(round_status) <- c("R1_Status"
                         , "R2_Status"
                         , "R3_Status"
                         , "R4_Status"
                         , "R5_Status"
                         , "R6_Status"
                         , "R7_Status")

Then get the opponent Id for the round.

#get just the round's OppoentId
round_opp_func <- function(data) 
{
  test <- unlist(str_extract_all(data, "[[:digit:]]{1,}"))
  if (length(test) == 0) {NA}
  else {test}
}

round_oppId <- apply(rounds, MARGIN = c(1,2), FUN = round_opp_func)
round_oppId <- apply(round_oppId, MARGIN = c(1,2), FUN = as.numeric)
round_oppId <- data.frame(round_oppId, stringsAsFactors = FALSE)

names(round_oppId) <- c("R1_Opp_Id"
                         , "R2_Opp_Id"
                         , "R3_Opp_Id"
                         , "R4_Opp_Id"
                         , "R5_Opp_Id"
                         , "R6_Opp_Id"
                         , "R7_Opp_Id")

Moving on the the row2 vector, within each row split apart the character string to individual elements. Because most of the data in the second row is not needed for this project a more direct approach for extracting the individual elements is taken.

head(record_row2)

## [1] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [2] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [3] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [4] "   MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |"
## [5] "   MI | 14601533 / R: 1655   ->1690     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [6] "   OH | 15055204 / R: 1686   ->1687     |N:3  |W    |B    |W    |B    |B    |W    |B    |"

State <- str_trim(unlist(str_extract_all(record_row2, "\\s\\w{2}\\s")))

Pre_Rating <- unlist(str_extract_all(record_row2, "R:[ ]{0,}\\d{1,}"))
Pre_Rating <- unlist(str_extract_all(Pre_Rating, "\\d{1,}"))
Pre_Rating <- as.numeric(Pre_Rating)

Pull the transformed elements back together.

tournament <- cbind(de_dataframe[, c("Id", "Name", "Total_Points")]
                    , round_status
                    , round_oppId
                    , State
                    , Pre_Rating)

head(tournament)

##   Id                Name Total_Points R1_Status R2_Status R3_Status
## 1  1            GARY HUA          6.0         W         W         W
## 2  2     DAKSHESH DARURI          6.0         W         W         L
## 3  3        ADITYA BAJAJ          6.0         L         W         W
## 4  4 PATRICK H SCHILLING          5.5         W         D         W
## 5  5          HANSHI ZUO          5.5         W         W         D
## 6  6         HANSEN SONG          5.0         W         D         L
##   R4_Status R5_Status R6_Status R7_Status R1_Opp_Id R2_Opp_Id R3_Opp_Id
## 1         W         W         D         D        39        21        18
## 2         W         W         W         W        63        58         4
## 3         W         W         W         W         8        61        25
## 4         W         D         W         D        23        28         2
## 5         D         D         W         W        45        37        12
## 6         W         D         W         W        34        29        11
##   R4_Opp_Id R5_Opp_Id R6_Opp_Id R7_Opp_Id State Pre_Rating
## 1        14         7        12         4    ON       1794
## 2        17        16        20         7    MI       1553
## 3        21        11        13        12    MI       1384
## 4        26         5        19         1    MI       1716
## 5        13         4        14        17    MI       1655
## 6        35        10        27        21    OH       1686

Determine the average pre-ratings of opponents

This requires another data transformation: melting (i.e., unpivoting) the opponent Ids so that for each player Id gets a block of 7 rows - one for each round’s opponent Id.

opponentsDenorm <- tournament[, c("Id"
                            , "R1_Opp_Id"
                            , "R2_Opp_Id"
                            , "R3_Opp_Id"
                            , "R4_Opp_Id"
                            , "R5_Opp_Id"
                            , "R6_Opp_Id"
                            , "R7_Opp_Id")]

OpponentsNorm <- melt(opponentsDenorm, id.vars = c("Id"), variable.name = "Round", value.name = "OpponentId")

OpponentsNorm$Round <- unlist(str_extract_all(OpponentsNorm$Round, "\\d"))
OpponentsNorm$Round <- as.numeric(OpponentsNorm$Round)

head(OpponentsNorm[OpponentsNorm$Id == 1, ])

##     Id Round OpponentId
## 1    1     1         39
## 65   1     2         21
## 129  1     3         18
## 193  1     4         14
## 257  1     5          7
## 321  1     6         12

Then merge/join the transformed tournament dataframe to the transformed opponent dataframe

oppPreRatings <- merge(x=OpponentsNorm, y=tournament[,c("Id", "Pre_Rating")]
      ,by.x=c("OpponentId")
      ,by.y=c("Id"))

head(oppPreRatings[oppPreRatings$Id == 1, ], n=7)

##     OpponentId Id Round Pre_Rating
## 27           4  1     7       1716
## 48           7  1     5       1649
## 81          12  1     6       1663
## 95          14  1     4       1610
## 121         18  1     3       1600
## 140         21  1     2       1563
## 260         39  1     1       1436

Finally, calculate the average (mean) opponent pre-rating

avgOppPreRating <- aggregate(Pre_Rating ~ Id, oppPreRatings, mean, na.action = na.omit)
names(avgOppPreRating) <- c("Id", "Avg_Opp_Pre_Rating")
avgOppPreRating$Avg_Opp_Pre_Rating <- round(avgOppPreRating$Avg_Opp_Pre_Rating, 0)

head(avgOppPreRating)

##   Id Avg_Opp_Pre_Rating
## 1  1               1605
## 2  2               1469
## 3  3               1564
## 4  4               1574
## 5  5               1501
## 6  6               1519

Construct the desired output dataset

First merge/join the average opponent pre-rating with the transformed tournament dataframe.

tournament <- oppPreRatings <- merge(x=tournament, y=avgOppPreRating
      ,by.x=c("Id")
      ,by.y=c("Id"))

Then limit to the desired columns/fields.

tournamentSubset <- tournament[, c("Id"
                                   , "Name"
                                   , "State"
                                   , "Total_Points"
                                   , "Pre_Rating"
                                   , "Avg_Opp_Pre_Rating")]

head(tournamentSubset)

##   Id                Name State Total_Points Pre_Rating Avg_Opp_Pre_Rating
## 1  1            GARY HUA    ON          6.0       1794               1605
## 2  2     DAKSHESH DARURI    MI          6.0       1553               1469
## 3  3        ADITYA BAJAJ    MI          6.0       1384               1564
## 4  4 PATRICK H SCHILLING    MI          5.5       1716               1574
## 5  5          HANSHI ZUO    MI          5.5       1655               1501
## 6  6         HANSEN SONG    OH          5.0       1686               1519

Output the data to a .csv file

output_file_name <- "C:/Users/cbailey/Desktop/tournamentSubset.csv"

write.csv(tournamentSubset, file = output_file_name, row.names = FALSE)