In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:
- Player’s Name
- Player’s State
- Total Number of Points
- Player’s Pre-Rating
- Average Pre Chess Rating of Opponents
Source File:
https://raw.githubusercontent.com/Shetura36/Data-607-Assignments/master/Project1/tournamentinfo.txt
- The file has 196 lines
- The file has 128 lines that contain player information
- The file has 64 individual players
Output File:
Link to a CSV file generated by this R code that contains the player data extracted from the source file.
library(RCurl)
library(stringr)
- Read file
- Load data as a data frame
- Convert data type of each line from factor to character
fileURL = "https://raw.githubusercontent.com/Shetura36/Data-607-Assignments/master/Project1/tournamentinfo.txt"
textFile <- readLines(fileURL)
typeof(textFile) #array of character
## [1] "character"
df.textData <- as.data.frame(textFile)#convert textFile into a data frame
dim(df.textData) # has 196 lines
## [1] 196 1
df.textData$textFile <- as.character(df.textData$textFile) #convert factor data into character data
head(df.textData)
## textFile
## 1 -----------------------------------------------------------------------------------------
## 2 Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round|
## 3 Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
## 4 -----------------------------------------------------------------------------------------
## 5 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|
## 6 ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |
tail(df.textData)
## textFile
## 191 63 | THOMAS JOSEPH HOSMER |1.0 |L 2|L 48|D 49|L 43|L 45|H |U |
## 192 MI | 15057092 / R: 1175 ->1125 | |W |B |W |B |B | | |
## 193 -----------------------------------------------------------------------------------------
## 194 64 | BEN LI |1.0 |L 22|D 30|L 31|D 49|L 46|L 42|L 54|
## 195 MI | 15006561 / R: 1163 ->1112 | |B |W |W |B |W |B |B |
## 196 -----------------------------------------------------------------------------------------
This function helps identify which lines of text contain player data and which lines don’t.
Assumption:
Every player data line either starts with a number or a 2 letter abbreviation of a state. This assumption is true for this specific file.
Returns TRUE if line either starts with a 2 letter abbreviation or start with a number; otherwise return FALSE.
isPlayerData <- function(strData){
return(str_detect(str_trim(strData), "^([0-9]{1,2})") |
str_detect(str_trim(strData), "^([A-Z]{2,2})"))
}
This will extra theh “outcome portion of a”round" token of a player
getRoundOutcome <- function(roundData){
return(unlist(sapply(roundData, str_extract, pattern = "^[A-Z]")))
}
This will extract the “opponent” portion of a “round” token of a player.
getRoundOpponent <- function(roundData){
unlist(sapply(roundData, str_extract, pattern = "[0-9]{1,2}$"))
}
This function returns a player’s pre-rating.
This function takes a parameter called “prerating”, and it expects this data frame to have a column called “playerID” and “pre_ratings”
getPlayerPreRating <- function(thisPlayerID, preratings){
if (is.na(thisPlayerID)) {return(NA)}
else
{
thisSubset <- subset(playerPreRating, playerPreRating$playerID == thisPlayerID)
return(thisSubset$pre_ratings)
}
}
Data to be extracted from the file:
- Player’s Name
- Player’s State
- Total Number of Points
- Player’s Pre-Rating
- Average Pre Chess Rating of opponents
- Add a logical column called “isPlayerData” to “df.textData”.
- This column will be TRUE if the line has player data and FALSE if the line does not have player data.
- The text file has 128 lines that contain player information.
- The text file has 64 individual players
#We are adding a logical column isPlayerData:
df.textData$isPlayerData <- sapply(df.textData$textFile, isPlayerData)
#Retrieve all lines identified to contain player data.
playerLines <- subset(df.textData, df.textData$isPlayerData == TRUE)
#has 128 row of player data
dim(playerLines)
## [1] 128 2
The assumption here is each player in the file has two lines of data: an odd and even line. The purpose of this data processing is to collapse the odd and even lines for each player into a single row.
#Get odd player lines
oddPlayerLines <- unlist(sapply(playerLines$textFile[seq(1, 128, 2)], str_trim))
names(oddPlayerLines) <- NULL
#64 odd player lines
length(oddPlayerLines)
## [1] 64
#Get even player lines
evenPlayerLines <- unlist(sapply(playerLines$textFile[seq(2, 128, 2)], str_trim))
names(evenPlayerLines) <- NULL
#64 even player lines
length(evenPlayerLines)
## [1] 64
#Combine odd and even player lines as columns in a matrix.
players <- cbind(oddPlayerLines, evenPlayerLines)
#Convert player as a data frame
players <- as.data.frame(players)
#Convert data type from factor to character
players$oddPlayerLines <- as.character(players$oddPlayerLines)
players$evenPlayerLines <- as.character(players$evenPlayerLines)
#64 rows with 2 columns
dim(players)
## [1] 64 2
head(players)
## oddPlayerLines
## 1 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|
## 2 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|
## 3 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|
## 4 4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1|
## 5 5 | HANSHI ZUO |5.5 |W 45|W 37|D 12|D 13|D 4|W 14|W 17|
## 6 6 | HANSEN SONG |5.0 |W 34|D 29|L 11|W 35|D 10|W 27|W 21|
## evenPlayerLines
## 1 ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |
## 2 MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |
## 3 MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |
## 4 MI | 12616049 / R: 1716 ->1744 |N:2 |W |B |W |B |W |B |B |
## 5 MI | 14601533 / R: 1655 ->1690 |N:2 |B |W |B |W |B |W |B |
## 6 OH | 15055204 / R: 1686 ->1687 |N:3 |W |B |W |B |B |W |B |
“Odd” player lines contain the following information:
- Player Id
- Player Name
- Total Points
- Outcome for each round
- Opponent for each round
#Tokenize "odd" player line by breaking the line apart with the separator "|"
playerInfo <- unlist(sapply(players$oddPlayerLines, str_split, "\\|"))
#Convert to data frame
playerInfo <- as.data.frame(playerInfo)
#704 tokens
dim(playerInfo)
## [1] 704 1
#Rename column
row.names(playerInfo) <- NULL
names(playerInfo) <- NULL
names(playerInfo) <- c("tokens")
head(playerInfo)
## tokens
## 1 1
## 2 GARY HUA
## 3 6.0
## 4 W 39
## 5 W 21
## 6 W 18
#Convert data type of tokens from factor to character
playerInfo$tokens <- as.character(playerInfo$tokens)
#Remove empty tokens
playerInfo <- subset(playerInfo, playerInfo$tokens != "")
#640 tokens after removing all empty tokens
dim(playerInfo)
## [1] 640 1
#Each player's "odd" line has 10 attributes.
#We are going to retrieve each attribute
playerID <- playerInfo$tokens[seq(1,640, 10)]
playerNames <- playerInfo$tokens[seq(2,640,10)]
totalPoints <- playerInfo$tokens[seq(3,640,10)]
Round1 <- playerInfo$tokens[seq(4,640,10)]
Round2 <- playerInfo$tokens[seq(5,640,10)]
Round3 <- playerInfo$tokens[seq(6,640,10)]
Round4 <- playerInfo$tokens[seq(7,640,10)]
Round5 <- playerInfo$tokens[seq(8,640,10)]
Round6 <- playerInfo$tokens[seq(9,640,10)]
Round7 <- playerInfo$tokens[seq(10,640,10)]
R1_outcome <- getRoundOutcome(Round1)
R1_opponent <- getRoundOpponent(Round1)
R2_outcome <- getRoundOutcome(Round2)
R2_opponent <- getRoundOpponent(Round2)
R3_outcome <- getRoundOutcome(Round3)
R3_opponent <- getRoundOpponent(Round3)
R4_outcome <- getRoundOutcome(Round4)
R4_opponent <- getRoundOpponent(Round4)
R5_outcome <- getRoundOutcome(Round5)
R5_opponent <- getRoundOpponent(Round5)
R6_outcome <- getRoundOutcome(Round6)
R6_opponent <- getRoundOpponent(Round6)
R7_outcome <- getRoundOutcome(Round7)
R7_opponent <- getRoundOpponent(Round7)
#check the tokens extracted
check <- cbind(playerID, playerNames, totalPoints, R1_opponent, R2_opponent, R3_opponent, R4_opponent, R5_opponent, R6_opponent, R7_opponent)
row.names(check) <- NULL
dim(check) #64 rows, 10 columns
## [1] 64 10
head(check)
## playerID playerNames totalPoints R1_opponent
## [1,] "1 " " GARY HUA " "6.0 " "39"
## [2,] "2 " " DAKSHESH DARURI " "6.0 " "63"
## [3,] "3 " " ADITYA BAJAJ " "6.0 " "8"
## [4,] "4 " " PATRICK H SCHILLING " "5.5 " "23"
## [5,] "5 " " HANSHI ZUO " "5.5 " "45"
## [6,] "6 " " HANSEN SONG " "5.0 " "34"
## R2_opponent R3_opponent R4_opponent R5_opponent R6_opponent
## [1,] "21" "18" "14" "7" "12"
## [2,] "58" "4" "17" "16" "20"
## [3,] "61" "25" "21" "11" "13"
## [4,] "28" "2" "26" "5" "19"
## [5,] "37" "12" "13" "4" "14"
## [6,] "29" "11" "35" "10" "27"
## R7_opponent
## [1,] "4"
## [2,] "7"
## [3,] "12"
## [4,] "1"
## [5,] "17"
## [6,] "21"
“Even” player lines contain the following information:
- State
- Pre_rating
states <- unlist(sapply(players$evenPlayerLines, str_extract, pattern = "^[A-Z]{2,2}"))
pre_ratings_part1 <- unlist(sapply(players$evenPlayerLines, str_extract, pattern = "R:([ ]){1,}([0-9]){1,}"))
pre_ratings <- unlist(sapply(pre_ratings_part1, str_extract, pattern = "([0-9]){1,}"))
#check extraced data
check <- cbind(states, pre_ratings)
row.names(check) <- NULL
dim(check) #64 rows, 2 columns
## [1] 64 2
head(check)
## states pre_ratings
## [1,] "ON" "1794"
## [2,] "MI" "1553"
## [3,] "MI" "1384"
## [4,] "MI" "1716"
## [5,] "MI" "1655"
## [6,] "OH" "1686"
Data frames:
- playerInfo - contains playerID, playerNames, states
- playerOpponent - contains playerID, playerOpponent
- playerOutcome - contains playerID, playerOutcome
- playerTotalPoints - contains playerID, totalPoints
- playerPreRatings - contains playerID, pre_ratings
playerInfo <- as.data.frame(cbind(playerID, playerNames, states))
row.names(playerInfo) <- NULL
playerOutcome <- as.data.frame(cbind(playerID, R1_outcome, R2_outcome, R3_outcome, R4_outcome, R5_outcome, R6_outcome, R7_outcome))
row.names(playerOutcome) <- NULL
playerOpponent <- as.data.frame(cbind(playerID, R1_opponent, R2_opponent, R3_opponent, R4_opponent, R5_opponent, R6_opponent, R7_opponent))
row.names(playerOpponent) <- NULL
playerPreRating <- as.data.frame(cbind(playerID, pre_ratings))
row.names(playerPreRating) <- NULL
playerTotalPoints <- as.data.frame(cbind(playerID, totalPoints))
row.names(playerTotalPoints) <- NULL
head(playerInfo)
## playerID playerNames states
## 1 1 GARY HUA ON
## 2 2 DAKSHESH DARURI MI
## 3 3 ADITYA BAJAJ MI
## 4 4 PATRICK H SCHILLING MI
## 5 5 HANSHI ZUO MI
## 6 6 HANSEN SONG OH
head(playerTotalPoints)
## playerID totalPoints
## 1 1 6.0
## 2 2 6.0
## 3 3 6.0
## 4 4 5.5
## 5 5 5.5
## 6 6 5.0
head(playerPreRating)
## playerID pre_ratings
## 1 1 1794
## 2 2 1553
## 3 3 1384
## 4 4 1716
## 5 5 1655
## 6 6 1686
head(playerOpponent)
## playerID R1_opponent R2_opponent R3_opponent R4_opponent R5_opponent
## 1 1 39 21 18 14 7
## 2 2 63 58 4 17 16
## 3 3 8 61 25 21 11
## 4 4 23 28 2 26 5
## 5 5 45 37 12 13 4
## 6 6 34 29 11 35 10
## R6_opponent R7_opponent
## 1 12 4
## 2 20 7
## 3 13 12
## 4 19 1
## 5 14 17
## 6 27 21
head(playerOutcome)
## playerID R1_outcome R2_outcome R3_outcome R4_outcome R5_outcome
## 1 1 W W W W W
## 2 2 W W L W W
## 3 3 L W W W W
## 4 4 W D W W D
## 5 5 W W D D D
## 6 6 W D L W D
## R6_outcome R7_outcome
## 1 D D
## 2 W W
## 3 W W
## 4 W D
## 5 W W
## 6 W W
When a data frame is built, most of the variables are presented as factors. Towards the end of this project, I learned that I could set “stringsAsFactors=FALSE” to avoid the character data from being converted to factors automatically.
- playerID –> numeric
- opponents –> numeric (since these refer to playerID’s)
- totalPoints –> numeric
- pre_ratings –> numeric
- outcomes –> factors
- states –> factors
- playerNames –> factors
#convert data type
playerInfo$playerID <- as.numeric(as.character(playerInfo$playerID))
playerInfo$states <- as.factor(playerInfo$states)
playerOutcome$playerID <- as.numeric(as.character(playerOutcome$playerID))
playerOutcome$R1_outcome <- as.factor(playerOutcome$R1_outcome)
playerOutcome$R2_outcome <- as.factor(playerOutcome$R2_outcome)
playerOutcome$R3_outcome <- as.factor(playerOutcome$R3_outcome)
playerOutcome$R4_outcome <- as.factor(playerOutcome$R4_outcome)
playerOutcome$R5_outcome <- as.factor(playerOutcome$R5_outcome)
playerOutcome$R6_outcome <- as.factor(playerOutcome$R6_outcome)
playerOutcome$R7_outcome <- as.factor(playerOutcome$R7_outcome)
playerOpponent$playerID <- as.numeric(as.character(playerOpponent$playerID))
playerOpponent$R1_opponent <- as.numeric(as.character(playerOpponent$R1_opponent))
playerOpponent$R2_opponent <- as.numeric(as.character(playerOpponent$R2_opponent))
playerOpponent$R3_opponent <- as.numeric(as.character(playerOpponent$R3_opponent))
playerOpponent$R4_opponent <- as.numeric(as.character(playerOpponent$R4_opponent))
playerOpponent$R5_opponent <- as.numeric(as.character(playerOpponent$R5_opponent))
playerOpponent$R6_opponent <- as.numeric(as.character(playerOpponent$R6_opponent))
playerOpponent$R7_opponent <- as.numeric(as.character(playerOpponent$R7_opponent))
playerPreRating$playerID <- as.numeric(as.character(playerPreRating$playerID))
playerPreRating$pre_ratings <- as.numeric(as.character(playerPreRating$pre_ratings))
playerTotalPoints$playerID <- as.numeric(as.character(playerTotalPoints$playerID))
playerTotalPoints$totalPoints <- as.numeric(as.character(playerTotalPoints$totalPoints))
#Retrieve opponent's pre-rating scores
R1_opponentPreRatings <- unlist(sapply(playerOpponent$R1_opponent, getPlayerPreRating, preratings = playerPreRating))
R2_opponentPreRatings <- unlist(sapply(playerOpponent$R2_opponent, getPlayerPreRating, preratings = playerPreRating))
R3_opponentPreRatings <- unlist(sapply(playerOpponent$R3_opponent, getPlayerPreRating, preratings = playerPreRating))
R4_opponentPreRatings <- unlist(sapply(playerOpponent$R4_opponent, getPlayerPreRating, preratings = playerPreRating))
R5_opponentPreRatings <- unlist(sapply(playerOpponent$R5_opponent, getPlayerPreRating, preratings = playerPreRating))
R6_opponentPreRatings <- unlist(sapply(playerOpponent$R6_opponent, getPlayerPreRating, preratings = playerPreRating))
R7_opponentPreRatings <- unlist(sapply(playerOpponent$R7_opponent, getPlayerPreRating, preratings = playerPreRating))
#Build Matrix
opponentPreRatings <- cbind(R1_opponentPreRatings, R2_opponentPreRatings, R3_opponentPreRatings, R4_opponentPreRatings, R5_opponentPreRatings, R6_opponentPreRatings, R7_opponentPreRatings)
#Get the mean avarage pre rating scores of all oppponents for each player
opponentAverageRatings <- rowMeans(opponentPreRatings, na.rm = TRUE)
#Display
opponentAverageRatings
## [1] 1605.286 1469.286 1563.571 1573.571 1500.857 1518.714 1372.143
## [8] 1468.429 1523.143 1554.143 1467.571 1506.167 1497.857 1515.000
## [15] 1483.857 1385.800 1498.571 1480.000 1426.286 1410.857 1470.429
## [22] 1300.333 1213.857 1357.000 1363.286 1506.857 1221.667 1522.143
## [29] 1313.500 1144.143 1259.857 1378.714 1276.857 1375.286 1149.714
## [36] 1388.167 1384.800 1539.167 1429.571 1390.571 1248.500 1149.857
## [43] 1106.571 1327.000 1152.000 1357.714 1392.000 1355.800 1285.800
## [50] 1296.000 1356.143 1494.571 1345.333 1206.167 1406.000 1414.400
## [57] 1363.000 1391.000 1319.000 1330.200 1327.286 1186.000 1350.200
## [64] 1263.000
all_data <- as.data.frame(cbind(playerInfo, playerTotalPoints$totalPoints, playerPreRating$pre_ratings, opponentAverageRatings, playerOpponent[,2:8], playerOutcome[,2:8]))
names(all_data) <- c("Player ID", "Player Name", "State", "Total Points", "Pre-rating", "Opponent Average Pre-rating",
"R1 - Opponent", "R2 - Opponent", "R3 - Opponent", "R4 - Opponent", "R5 - Opponent", "R6 - Opponent", "R7 - Opponent",
"R1 - Outcome", "R2 - Outcome", "R3 - Outcome", "R4 - Outcome", "R5 - Outcome", "R6 - Outcome", "R7 - Outcome" )
filePath <- "C:/Users/stina/Documents/CUNY SPS Data Science/Spring 2018 Classes/DATA 607 - Data Acquisition and Management/Project 1/project1-output.csv"
#write to file
write.table(all_data, file = filePath, col.names = TRUE, row.names = FALSE, sep = ",")
plot(all_data$`Pre-rating`, all_data$`Opponent Average Pre-rating`, xlim=c(1000,1800), ylim=c(1000,1800),
main="Player Pre-rating vs. Opponent Average Pre-rating", xlab = "Player Pre-rating", ylab="Opponent Average Pre-rating")
barplot(rbind(all_data$`Pre-rating`[1:20], all_data$`Opponent Average Pre-rating`[1:20]), beside = TRUE, col = c("red", "blue"),
xlab = "Player Pre-Rating vs. Opponent Average Pre-Rating", ylab = "Pre-rating", names.arg = c(1:20))
barplot(rbind(all_data$`Pre-rating`[21:40], all_data$`Opponent Average Pre-rating`[21:40]), beside = TRUE, col = c("red", "blue"),
xlab = "Player Pre-Rating vs. Opponent Average Pre-Rating", ylab = "Pre-rating", names.arg = c(21:40))
barplot(rbind(all_data$`Pre-rating`[41:64], all_data$`Opponent Average Pre-rating`[41:64]), beside = TRUE, col = c("red", "blue"),
xlab = "Player Pre-Rating vs. Opponent Average Pre-Rating", ylab = "Pre-rating", names.arg = c(41:64))
#initialize array of 64 elements
wins <- rep(-1, 64)
#count number of wins
for (i in 1:64){
wins[i] <- length(which(playerOutcome[i,]=='W'))
}
table(wins)
## wins
## 0 1 2 3 4 5 6
## 4 11 11 18 14 4 2
barplot(wins[1:30], col=c("blue"), xlab="Players", ylab="Number of Wins in 7 Rounds", names.arg = c(1:30))
barplot(wins[31:64], col=c("blue"), xlab="Players", ylab="Number of Wins in 7 Rounds", names.arg = c(31:64))