Data 607 Project 1

Project 1

In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:

Player’s Name

Player’s State

Total Number of Points

Player’s Pre-Rating

Average Pre Chess Rating of Opponents

Source File:

https://raw.githubusercontent.com/Shetura36/Data-607-Assignments/master/Project1/tournamentinfo.txt

The file has 196 lines

The file has 128 lines that contain player information

The file has 64 individual players

Output File:

Link to a CSV file generated by this R code that contains the player data extracted from the source file.

https://github.com/Shetura36/Data-607-Assignments/blob/master/Project1/project1-output.csv

https://raw.githubusercontent.com/Shetura36/Data-607-Assignments/master/Project1/project1-output.csv (raw file)

————————————————

Load Libraries

library(RCurl)
library(stringr)

————————————————

Prepare Data Frame

Read file

Load data as a data frame

Convert data type of each line from factor to character

fileURL = "https://raw.githubusercontent.com/Shetura36/Data-607-Assignments/master/Project1/tournamentinfo.txt"
textFile <- readLines(fileURL)
typeof(textFile) #array of character

## [1] "character"

df.textData <- as.data.frame(textFile)#convert textFile into a data frame
dim(df.textData) # has 196 lines

## [1] 196   1

df.textData$textFile <- as.character(df.textData$textFile) #convert factor data into character data
head(df.textData)

##                                                                                     textFile
## 1  -----------------------------------------------------------------------------------------
## 2  Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| 
## 3  Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | 
## 4  -----------------------------------------------------------------------------------------
## 5      1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|
## 6     ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |

tail(df.textData)

##                                                                                      textFile
## 191    63 | THOMAS JOSEPH HOSMER            |1.0  |L   2|L  48|D  49|L  43|L  45|H    |U    |
## 192    MI | 15057092 / R: 1175   ->1125     |     |W    |B    |W    |B    |B    |     |     |
## 193 -----------------------------------------------------------------------------------------
## 194    64 | BEN LI                          |1.0  |L  22|D  30|L  31|D  49|L  46|L  42|L  54|
## 195    MI | 15006561 / R: 1163   ->1112     |     |B    |W    |W    |B    |W    |B    |B    |
## 196 -----------------------------------------------------------------------------------------

————————————————

Define Functions

(1) Function: isPlayerData

This function helps identify which lines of text contain player data and which lines don’t.

Assumption:
Every player data line either starts with a number or a 2 letter abbreviation of a state. This assumption is true for this specific file.

Returns TRUE if line either starts with a 2 letter abbreviation or start with a number; otherwise return FALSE.

isPlayerData <- function(strData){
  return(str_detect(str_trim(strData), "^([0-9]{1,2})") | 
           str_detect(str_trim(strData), "^([A-Z]{2,2})"))
}

(2) Function: getRoundOutcome

This will extra theh “outcome portion of a”round" token of a player

getRoundOutcome <- function(roundData){
  return(unlist(sapply(roundData, str_extract, pattern = "^[A-Z]")))
}

(3) Function: getRoundOpponent

This will extract the “opponent” portion of a “round” token of a player.

getRoundOpponent <- function(roundData){
  unlist(sapply(roundData, str_extract, pattern = "[0-9]{1,2}$"))
}

(4) Function: getPlayerPreRating

This function returns a player’s pre-rating.
This function takes a parameter called “prerating”, and it expects this data frame to have a column called “playerID” and “pre_ratings”

getPlayerPreRating <- function(thisPlayerID, preratings){
  
  if (is.na(thisPlayerID)) {return(NA)}
  else
  {
    thisSubset <- subset(playerPreRating, playerPreRating$playerID == thisPlayerID)
    return(thisSubset$pre_ratings)
  }
}

————————————————

Process Data

Data to be extracted from the file:

Player’s Name

Player’s State

Total Number of Points

Player’s Pre-Rating

Average Pre Chess Rating of opponents

Step 1: Identify if line contains player data or not.

Add a logical column called “isPlayerData” to “df.textData”.

This column will be TRUE if the line has player data and FALSE if the line does not have player data.

The text file has 128 lines that contain player information.

The text file has 64 individual players

#We are adding a logical column isPlayerData: 
df.textData$isPlayerData <- sapply(df.textData$textFile, isPlayerData)

#Retrieve all lines identified to contain player data. 
playerLines <- subset(df.textData, df.textData$isPlayerData == TRUE)

#has 128 row of player data
dim(playerLines)

## [1] 128   2

Step 2: Identify odd and even player lines

The assumption here is each player in the file has two lines of data: an odd and even line. The purpose of this data processing is to collapse the odd and even lines for each player into a single row.

#Get odd player lines
oddPlayerLines <- unlist(sapply(playerLines$textFile[seq(1, 128, 2)], str_trim))
names(oddPlayerLines) <- NULL

#64 odd player lines
length(oddPlayerLines)

## [1] 64

#Get even player lines
evenPlayerLines <- unlist(sapply(playerLines$textFile[seq(2, 128, 2)], str_trim))
names(evenPlayerLines) <- NULL

#64 even player lines
length(evenPlayerLines)

## [1] 64

#Combine odd and even player lines as columns in a matrix. 
players <- cbind(oddPlayerLines, evenPlayerLines)

#Convert player as a data frame
players <- as.data.frame(players)

#Convert data type from factor to character
players$oddPlayerLines <- as.character(players$oddPlayerLines)
players$evenPlayerLines <- as.character(players$evenPlayerLines)

#64 rows with 2 columns
dim(players)

## [1] 64  2

head(players)

##                                                                          oddPlayerLines
## 1 1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|
## 2 2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|
## 3 3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|
## 4 4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|
## 5 5 | HANSHI ZUO                      |5.5  |W  45|W  37|D  12|D  13|D   4|W  14|W  17|
## 6 6 | HANSEN SONG                     |5.0  |W  34|D  29|L  11|W  35|D  10|W  27|W  21|
##                                                                          evenPlayerLines
## 1 ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |
## 2 MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |
## 3 MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |
## 4 MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |
## 5 MI | 14601533 / R: 1655   ->1690     |N:2  |B    |W    |B    |W    |B    |W    |B    |
## 6 OH | 15055204 / R: 1686   ->1687     |N:3  |W    |B    |W    |B    |B    |W    |B    |

Step 3: Extract atomic player information from “oddPlayerLine” column

“Odd” player lines contain the following information:

Player Id

Player Name

Total Points

Outcome for each round

Opponent for each round

#Tokenize "odd" player line by breaking the line apart with the separator "|"
playerInfo <- unlist(sapply(players$oddPlayerLines, str_split, "\\|"))

#Convert to data frame
playerInfo <- as.data.frame(playerInfo)

#704 tokens
dim(playerInfo)

## [1] 704   1

#Rename column
row.names(playerInfo) <- NULL
names(playerInfo) <- NULL
names(playerInfo) <- c("tokens")

head(playerInfo)

##                              tokens
## 1                                1 
## 2  GARY HUA                        
## 3                             6.0  
## 4                             W  39
## 5                             W  21
## 6                             W  18

#Convert data type of tokens from factor to character
playerInfo$tokens <- as.character(playerInfo$tokens)

#Remove empty tokens
playerInfo <- subset(playerInfo, playerInfo$tokens != "")

#640 tokens after removing all empty tokens
dim(playerInfo)

## [1] 640   1

#Each player's "odd" line has 10 attributes. 
#We are going to retrieve each attribute
playerID <- playerInfo$tokens[seq(1,640, 10)]
playerNames <- playerInfo$tokens[seq(2,640,10)]
totalPoints <- playerInfo$tokens[seq(3,640,10)]
Round1 <- playerInfo$tokens[seq(4,640,10)]
Round2 <- playerInfo$tokens[seq(5,640,10)]
Round3 <- playerInfo$tokens[seq(6,640,10)]
Round4 <- playerInfo$tokens[seq(7,640,10)]
Round5 <- playerInfo$tokens[seq(8,640,10)]
Round6 <- playerInfo$tokens[seq(9,640,10)]
Round7 <- playerInfo$tokens[seq(10,640,10)]
R1_outcome <- getRoundOutcome(Round1)
R1_opponent <- getRoundOpponent(Round1)
R2_outcome <- getRoundOutcome(Round2)
R2_opponent <- getRoundOpponent(Round2)
R3_outcome <- getRoundOutcome(Round3)
R3_opponent <- getRoundOpponent(Round3)
R4_outcome <- getRoundOutcome(Round4)
R4_opponent <- getRoundOpponent(Round4)
R5_outcome <- getRoundOutcome(Round5)
R5_opponent <- getRoundOpponent(Round5)
R6_outcome <- getRoundOutcome(Round6)
R6_opponent <- getRoundOpponent(Round6)
R7_outcome <- getRoundOutcome(Round7)
R7_opponent <- getRoundOpponent(Round7)

#check the tokens extracted
check <- cbind(playerID, playerNames, totalPoints, R1_opponent, R2_opponent, R3_opponent, R4_opponent, R5_opponent, R6_opponent, R7_opponent)
row.names(check) <- NULL
dim(check) #64 rows, 10 columns

## [1] 64 10

head(check)

##      playerID playerNames                         totalPoints R1_opponent
## [1,] "1 "     " GARY HUA                        " "6.0  "     "39"       
## [2,] "2 "     " DAKSHESH DARURI                 " "6.0  "     "63"       
## [3,] "3 "     " ADITYA BAJAJ                    " "6.0  "     "8"        
## [4,] "4 "     " PATRICK H SCHILLING             " "5.5  "     "23"       
## [5,] "5 "     " HANSHI ZUO                      " "5.5  "     "45"       
## [6,] "6 "     " HANSEN SONG                     " "5.0  "     "34"       
##      R2_opponent R3_opponent R4_opponent R5_opponent R6_opponent
## [1,] "21"        "18"        "14"        "7"         "12"       
## [2,] "58"        "4"         "17"        "16"        "20"       
## [3,] "61"        "25"        "21"        "11"        "13"       
## [4,] "28"        "2"         "26"        "5"         "19"       
## [5,] "37"        "12"        "13"        "4"         "14"       
## [6,] "29"        "11"        "35"        "10"        "27"       
##      R7_opponent
## [1,] "4"        
## [2,] "7"        
## [3,] "12"       
## [4,] "1"        
## [5,] "17"       
## [6,] "21"

Step 4: Extract atomic player information from “evenPlayerLine” column

“Even” player lines contain the following information:

State

Pre_rating

states <- unlist(sapply(players$evenPlayerLines, str_extract, pattern = "^[A-Z]{2,2}"))

pre_ratings_part1 <- unlist(sapply(players$evenPlayerLines, str_extract, pattern = "R:([ ]){1,}([0-9]){1,}"))

pre_ratings <- unlist(sapply(pre_ratings_part1, str_extract, pattern = "([0-9]){1,}"))

#check extraced data
check <- cbind(states, pre_ratings)
row.names(check) <- NULL
dim(check) #64 rows, 2 columns

## [1] 64  2

head(check)

##      states pre_ratings
## [1,] "ON"   "1794"     
## [2,] "MI"   "1553"     
## [3,] "MI"   "1384"     
## [4,] "MI"   "1716"     
## [5,] "MI"   "1655"     
## [6,] "OH"   "1686"

Step 5: Build data frames for players

Data frames:

playerInfo - contains playerID, playerNames, states

playerOpponent - contains playerID, playerOpponent

playerOutcome - contains playerID, playerOutcome

playerTotalPoints - contains playerID, totalPoints

playerPreRatings - contains playerID, pre_ratings

playerInfo <- as.data.frame(cbind(playerID, playerNames, states))
row.names(playerInfo) <- NULL

playerOutcome <- as.data.frame(cbind(playerID, R1_outcome, R2_outcome, R3_outcome, R4_outcome, R5_outcome, R6_outcome, R7_outcome))
row.names(playerOutcome) <- NULL

playerOpponent <- as.data.frame(cbind(playerID, R1_opponent, R2_opponent, R3_opponent, R4_opponent, R5_opponent, R6_opponent, R7_opponent))
row.names(playerOpponent) <- NULL

playerPreRating <- as.data.frame(cbind(playerID, pre_ratings))
row.names(playerPreRating) <- NULL

playerTotalPoints <- as.data.frame(cbind(playerID, totalPoints))
row.names(playerTotalPoints) <- NULL

head(playerInfo)

##   playerID                       playerNames states
## 1       1   GARY HUA                             ON
## 2       2   DAKSHESH DARURI                      MI
## 3       3   ADITYA BAJAJ                         MI
## 4       4   PATRICK H SCHILLING                  MI
## 5       5   HANSHI ZUO                           MI
## 6       6   HANSEN SONG                          OH

head(playerTotalPoints)

##   playerID totalPoints
## 1       1        6.0  
## 2       2        6.0  
## 3       3        6.0  
## 4       4        5.5  
## 5       5        5.5  
## 6       6        5.0

head(playerPreRating)

##   playerID pre_ratings
## 1       1         1794
## 2       2         1553
## 3       3         1384
## 4       4         1716
## 5       5         1655
## 6       6         1686

head(playerOpponent)

##   playerID R1_opponent R2_opponent R3_opponent R4_opponent R5_opponent
## 1       1           39          21          18          14           7
## 2       2           63          58           4          17          16
## 3       3            8          61          25          21          11
## 4       4           23          28           2          26           5
## 5       5           45          37          12          13           4
## 6       6           34          29          11          35          10
##   R6_opponent R7_opponent
## 1          12           4
## 2          20           7
## 3          13          12
## 4          19           1
## 5          14          17
## 6          27          21

head(playerOutcome)

##   playerID R1_outcome R2_outcome R3_outcome R4_outcome R5_outcome
## 1       1           W          W          W          W          W
## 2       2           W          W          L          W          W
## 3       3           L          W          W          W          W
## 4       4           W          D          W          W          D
## 5       5           W          W          D          D          D
## 6       6           W          D          L          W          D
##   R6_outcome R7_outcome
## 1          D          D
## 2          W          W
## 3          W          W
## 4          W          D
## 5          W          W
## 6          W          W

Step 6: Convert to appropriate data types

When a data frame is built, most of the variables are presented as factors. Towards the end of this project, I learned that I could set “stringsAsFactors=FALSE” to avoid the character data from being converted to factors automatically.

playerID –> numeric

opponents –> numeric (since these refer to playerID’s)

totalPoints –> numeric

pre_ratings –> numeric

outcomes –> factors

states –> factors

playerNames –> factors

#convert data type
playerInfo$playerID <- as.numeric(as.character(playerInfo$playerID))
playerInfo$states <- as.factor(playerInfo$states)

playerOutcome$playerID <- as.numeric(as.character(playerOutcome$playerID))
playerOutcome$R1_outcome <- as.factor(playerOutcome$R1_outcome)
playerOutcome$R2_outcome <- as.factor(playerOutcome$R2_outcome)
playerOutcome$R3_outcome <- as.factor(playerOutcome$R3_outcome)
playerOutcome$R4_outcome <- as.factor(playerOutcome$R4_outcome)
playerOutcome$R5_outcome <- as.factor(playerOutcome$R5_outcome)
playerOutcome$R6_outcome <- as.factor(playerOutcome$R6_outcome)
playerOutcome$R7_outcome <- as.factor(playerOutcome$R7_outcome)

playerOpponent$playerID <- as.numeric(as.character(playerOpponent$playerID))
playerOpponent$R1_opponent <- as.numeric(as.character(playerOpponent$R1_opponent))
playerOpponent$R2_opponent <- as.numeric(as.character(playerOpponent$R2_opponent))
playerOpponent$R3_opponent <- as.numeric(as.character(playerOpponent$R3_opponent))
playerOpponent$R4_opponent <- as.numeric(as.character(playerOpponent$R4_opponent))
playerOpponent$R5_opponent <- as.numeric(as.character(playerOpponent$R5_opponent))
playerOpponent$R6_opponent <- as.numeric(as.character(playerOpponent$R6_opponent))
playerOpponent$R7_opponent <- as.numeric(as.character(playerOpponent$R7_opponent))

playerPreRating$playerID <- as.numeric(as.character(playerPreRating$playerID))
playerPreRating$pre_ratings <- as.numeric(as.character(playerPreRating$pre_ratings))

playerTotalPoints$playerID <- as.numeric(as.character(playerTotalPoints$playerID))
playerTotalPoints$totalPoints <- as.numeric(as.character(playerTotalPoints$totalPoints))

Step 7: Calculate opponents’ average pre-rating

#Retrieve opponent's pre-rating scores
R1_opponentPreRatings <- unlist(sapply(playerOpponent$R1_opponent, getPlayerPreRating, preratings = playerPreRating))
R2_opponentPreRatings <- unlist(sapply(playerOpponent$R2_opponent, getPlayerPreRating, preratings = playerPreRating))
R3_opponentPreRatings <- unlist(sapply(playerOpponent$R3_opponent, getPlayerPreRating, preratings = playerPreRating))
R4_opponentPreRatings <- unlist(sapply(playerOpponent$R4_opponent, getPlayerPreRating, preratings = playerPreRating))
R5_opponentPreRatings <- unlist(sapply(playerOpponent$R5_opponent, getPlayerPreRating, preratings = playerPreRating))
R6_opponentPreRatings <- unlist(sapply(playerOpponent$R6_opponent, getPlayerPreRating, preratings = playerPreRating))
R7_opponentPreRatings <- unlist(sapply(playerOpponent$R7_opponent, getPlayerPreRating, preratings = playerPreRating))

#Build Matrix
opponentPreRatings <- cbind(R1_opponentPreRatings, R2_opponentPreRatings, R3_opponentPreRatings, R4_opponentPreRatings, R5_opponentPreRatings, R6_opponentPreRatings, R7_opponentPreRatings)

#Get the mean avarage pre rating scores of all oppponents for each player
opponentAverageRatings <- rowMeans(opponentPreRatings, na.rm = TRUE)

#Display 
opponentAverageRatings

##  [1] 1605.286 1469.286 1563.571 1573.571 1500.857 1518.714 1372.143
##  [8] 1468.429 1523.143 1554.143 1467.571 1506.167 1497.857 1515.000
## [15] 1483.857 1385.800 1498.571 1480.000 1426.286 1410.857 1470.429
## [22] 1300.333 1213.857 1357.000 1363.286 1506.857 1221.667 1522.143
## [29] 1313.500 1144.143 1259.857 1378.714 1276.857 1375.286 1149.714
## [36] 1388.167 1384.800 1539.167 1429.571 1390.571 1248.500 1149.857
## [43] 1106.571 1327.000 1152.000 1357.714 1392.000 1355.800 1285.800
## [50] 1296.000 1356.143 1494.571 1345.333 1206.167 1406.000 1414.400
## [57] 1363.000 1391.000 1319.000 1330.200 1327.286 1186.000 1350.200
## [64] 1263.000

Step 8: Write to file

all_data <- as.data.frame(cbind(playerInfo, playerTotalPoints$totalPoints, playerPreRating$pre_ratings, opponentAverageRatings, playerOpponent[,2:8], playerOutcome[,2:8]))

names(all_data) <- c("Player ID", "Player Name", "State", "Total Points", "Pre-rating", "Opponent Average Pre-rating",
                     "R1 - Opponent", "R2 - Opponent", "R3 - Opponent", "R4 - Opponent", "R5 - Opponent", "R6 - Opponent", "R7 - Opponent",
                     "R1 - Outcome", "R2 - Outcome", "R3 - Outcome", "R4 - Outcome", "R5 - Outcome", "R6 - Outcome", "R7 - Outcome" )

filePath <- "C:/Users/stina/Documents/CUNY SPS Data Science/Spring 2018 Classes/DATA 607 - Data Acquisition and Management/Project 1/project1-output.csv"

#write to file
write.table(all_data, file = filePath, col.names = TRUE, row.names = FALSE, sep = ",")

————————————————

Present Data

Comparison between player’s pre-rating and opponents’ average pre-rating.

plot(all_data$`Pre-rating`, all_data$`Opponent Average Pre-rating`, xlim=c(1000,1800), ylim=c(1000,1800),
     main="Player Pre-rating vs. Opponent Average Pre-rating", xlab = "Player Pre-rating", ylab="Opponent Average Pre-rating")

barplot(rbind(all_data$`Pre-rating`[1:20], all_data$`Opponent Average Pre-rating`[1:20]), beside = TRUE, col = c("red", "blue"), 
        xlab = "Player Pre-Rating vs. Opponent Average Pre-Rating", ylab = "Pre-rating", names.arg = c(1:20))

barplot(rbind(all_data$`Pre-rating`[21:40], all_data$`Opponent Average Pre-rating`[21:40]), beside = TRUE, col = c("red", "blue"), 
        xlab = "Player Pre-Rating vs. Opponent Average Pre-Rating", ylab = "Pre-rating", names.arg = c(21:40))

barplot(rbind(all_data$`Pre-rating`[41:64], all_data$`Opponent Average Pre-rating`[41:64]), beside = TRUE, col = c("red", "blue"), 
        xlab = "Player Pre-Rating vs. Opponent Average Pre-Rating", ylab = "Pre-rating", names.arg = c(41:64))

Visualize number of wins in the 7 rounds.

#initialize array of 64 elements
wins <- rep(-1, 64)

#count number of wins
for (i in 1:64){
  wins[i] <- length(which(playerOutcome[i,]=='W'))
}

table(wins)

## wins
##  0  1  2  3  4  5  6 
##  4 11 11 18 14  4  2

barplot(wins[1:30], col=c("blue"), xlab="Players", ylab="Number of Wins in 7 Rounds", names.arg = c(1:30))

barplot(wins[31:64], col=c("blue"), xlab="Players", ylab="Number of Wins in 7 Rounds", names.arg = c(31:64))

Data 607 Project 1

Sherranette Tinapunan

February 24, 2018

Project 1

Source File:

Output File:

————————————————

Load Libraries

————————————————

Prepare Data Frame

————————————————

Define Functions

(1) Function: isPlayerData

(2) Function: getRoundOutcome

(3) Function: getRoundOpponent

(4) Function: getPlayerPreRating

————————————————

Process Data

Step 1: Identify if line contains player data or not.

Step 2: Identify odd and even player lines

Step 3: Extract atomic player information from “oddPlayerLine” column

Step 4: Extract atomic player information from “evenPlayerLine” column

Step 5: Build data frames for players

Step 6: Convert to appropriate data types

Step 7: Calculate opponents’ average pre-rating

Step 8: Write to file

————————————————

Present Data

Comparison between player’s pre-rating and opponents’ average pre-rating.

Visualize number of wins in the 7 rounds.