The goal of the assignment is to read data from a text file that has information on chess players and their performance in a tournament, create a data frame that has five specified columns, and write that data frame to a csv file.
The five columns are: Player Name, Player State, Total Points, Player Pre Rating, Average Opponent Pre Rating
I copied the text file to my Github repo and then pulled the text file from the repo to allow for recreation of the code. I squished the text file to remove any unnecessary spaces.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
fileURL <- 'https://raw.githubusercontent.com/stoybis/DATA607Repo/main/tournamentinfo.txt'
txtFileData <- read_lines(url(fileURL), skip = 3)
txtFileDataSquished <- str_squish(txtFileData)
remove(txtFileData) #drop the original text file since I'll be working with the squished file
head(txtFileDataSquished, n=10)
## [1] "-----------------------------------------------------------------------------------------"
## [2] "1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [3] "ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [4] "-----------------------------------------------------------------------------------------"
## [5] "2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [6] "MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [7] "-----------------------------------------------------------------------------------------"
## [8] "3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|"
## [9] "MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |"
## [10] "-----------------------------------------------------------------------------------------"
I use regex to extract player names. I first search for the pattern that starts with a space, followed by a |, followed by a space, followed by letters, followed by a space, followed by a |, followed by a digit to capture the the name surrounded by | and the digit after it.
playerName <- str_view(txtFileDataSquished, "\\s\\|\\s[a-zA-Z].*\\s\\|\\d")
head(playerName)
## [2] │ 1< | GARY HUA |6>.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|
## [5] │ 2< | DAKSHESH DARURI |6>.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|
## [8] │ 3< | ADITYA BAJAJ |6>.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|
## [11] │ 4< | PATRICK H SCHILLING |5>.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1|
## [14] │ 5< | HANSHI ZUO |5>.5 |W 45|W 37|D 12|D 13|D 4|W 14|W 17|
## [17] │ 6< | HANSEN SONG |5>.0 |W 34|D 29|L 11|W 35|D 10|W 27|W 21|
From this vector, I extract the \, followed by anything that is not a number, followed by a space and a |
playerName <- str_extract(playerName, "\\|\\s[^0-9]+\\s\\|")
head(playerName)
## [1] "| GARY HUA |" "| DAKSHESH DARURI |"
## [3] "| ADITYA BAJAJ |" "| PATRICK H SCHILLING |"
## [5] "| HANSHI ZUO |" "| HANSEN SONG |"
Last, I extract everything that is not a | and trim white spaces. This now has 64 clean names
playerName <- str_extract(playerName, "\\s.*\\s")
playerName <-str_trim(playerName) # this has 64 clean names
head(playerName, n=10)
## [1] "GARY HUA" "DAKSHESH DARURI" "ADITYA BAJAJ"
## [4] "PATRICK H SCHILLING" "HANSHI ZUO" "HANSEN SONG"
## [7] "GARY DEE SWATHELL" "EZEKIEL HOUGHTON" "STEFANO LEE"
## [10] "ANVIT RAO"
I use regex to identify the pattern that starts with a letter followed by any character followed by a space, which is how every line that starts with the State starts.
playerStates <-str_view(txtFileDataSquished, "^[A-Z].\\s")
head(playerStates)
## [3] │ <ON >| 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |
## [6] │ <MI >| 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |
## [9] │ <MI >| 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |
## [12] │ <MI >| 12616049 / R: 1716 ->1744 |N:2 |W |B |W |B |W |B |B |
## [15] │ <MI >| 14601533 / R: 1655 ->1690 |N:2 |B |W |B |W |B |W |B |
## [18] │ <OH >| 15055204 / R: 1686 ->1687 |N:3 |W |B |W |B |B |W |B |
From this, I extract the the first two letters
playerStates <-str_extract(playerStates, "[A-Z][A-Z]\\s")
head(playerStates)
## [1] "ON " "MI " "MI " "MI " "MI " "OH "
I combine the playerName vector and PlayerStates vector into a data frame
cleanedData <- data.frame(playerName, playerStates)
head(cleanedData)
## playerName playerStates
## 1 GARY HUA ON
## 2 DAKSHESH DARURI MI
## 3 ADITYA BAJAJ MI
## 4 PATRICK H SCHILLING MI
## 5 HANSHI ZUO MI
## 6 HANSEN SONG OH
I use regex to identify the pattern that starts with a letter, followed by a space and |, followed by a digit, a period and a digit
playerTotalPoints <-str_view(txtFileDataSquished, "[A-Z]\\s\\|\\d.\\d")
head(playerTotalPoints)
## [2] │ 1 | GARY HU<A |6.0> |W 39|W 21|W 18|W 14|W 7|D 12|D 4|
## [5] │ 2 | DAKSHESH DARUR<I |6.0> |W 63|W 58|L 4|W 17|W 16|W 20|W 7|
## [8] │ 3 | ADITYA BAJA<J |6.0> |L 8|W 61|W 25|W 21|W 11|W 13|W 12|
## [11] │ 4 | PATRICK H SCHILLIN<G |5.5> |W 23|D 28|W 2|W 26|D 5|W 19|D 1|
## [14] │ 5 | HANSHI ZU<O |5.5> |W 45|W 37|D 12|D 13|D 4|W 14|W 17|
## [17] │ 6 | HANSEN SON<G |5.0> |W 34|D 29|L 11|W 35|D 10|W 27|W 21|
From this, I extract the two digits.
playerTotalPoints <- as.numeric(str_extract(playerTotalPoints, "\\d.\\d"))
head(playerTotalPoints)
## [1] 6.0 6.0 6.0 5.5 5.5 5.0
I then add this to the data frame as a new column
cleanedData <- mutate(cleanedData, playerTotalPoints)
head(cleanedData)
## playerName playerStates playerTotalPoints
## 1 GARY HUA ON 6.0
## 2 DAKSHESH DARURI MI 6.0
## 3 ADITYA BAJAJ MI 6.0
## 4 PATRICK H SCHILLING MI 5.5
## 5 HANSHI ZUO MI 5.5
## 6 HANSEN SONG OH 5.0
I use regex to identify the pattern that starts with a : followed by a space then any characters followed by a -, which captures the pre-rating and the characters surrounding it. Note that some pre-ratings have a letter followed by two digits after it.
preRating <- str_view(txtFileDataSquished, ":\\s.*\\-")
head(preRating, n=10)
## [3] │ ON | 15445895 / R<: 1794 ->>1817 |N:2 |W |B |W |B |W |B |W |
## [6] │ MI | 14598900 / R<: 1553 ->>1663 |N:2 |B |W |B |W |B |W |B |
## [9] │ MI | 14959604 / R<: 1384 ->>1640 |N:2 |W |B |W |B |W |B |W |
## [12] │ MI | 12616049 / R<: 1716 ->>1744 |N:2 |W |B |W |B |W |B |B |
## [15] │ MI | 14601533 / R<: 1655 ->>1690 |N:2 |B |W |B |W |B |W |B |
## [18] │ OH | 15055204 / R<: 1686 ->>1687 |N:3 |W |B |W |B |B |W |B |
## [21] │ MI | 11146376 / R<: 1649 ->>1673 |N:3 |W |B |W |B |B |W |W |
## [24] │ MI | 15142253 / R<: 1641P17->>1657P24 |N:3 |B |W |B |W |B |W |W |
## [27] │ ON | 14954524 / R<: 1411 ->>1564 |N:2 |W |B |W |B |W |B |B |
## [30] │ MI | 14150362 / R<: 1365 ->>1544 |N:3 |W |W |B |B |W |B |W |
I extract the pre-rating using regex
preRating <- str_extract(preRating, ":.*-")
head(preRating, n=10)
## [1] ": 1794 -" ": 1553 -" ": 1384 -" ": 1716 -" ": 1655 -"
## [6] ": 1686 -" ": 1649 -" ": 1641P17-" ": 1411 -" ": 1365 -"
I then remove the : and - using regex. This still captures the letter and the two digits after it which some players have.
preRating <- str_extract(preRating,"\\d.*\\d")
head(preRating, n=10)
## [1] "1794" "1553" "1384" "1716" "1655" "1686" "1649"
## [8] "1641P17" "1411" "1365"
Last, I capture the first four digits only.
preRating <- str_extract(preRating, "\\d{3,4}")
head(preRating, n=10)
## [1] "1794" "1553" "1384" "1716" "1655" "1686" "1649" "1641" "1411" "1365"
I add the pre-rating to the data frame
cleanedData <- mutate(cleanedData, preRating)
head(cleanedData)
## playerName playerStates playerTotalPoints preRating
## 1 GARY HUA ON 6.0 1794
## 2 DAKSHESH DARURI MI 6.0 1553
## 3 ADITYA BAJAJ MI 6.0 1384
## 4 PATRICK H SCHILLING MI 5.5 1716
## 5 HANSHI ZUO MI 5.5 1655
## 6 HANSEN SONG OH 5.0 1686
I add an ID column using the row names which corresponds to the position of the player in the original order of players in the text file. I will later use this to merge data. Furthermore, I convert the ID, total points, and pre–rating to type numeric
cleanedData <- rownames_to_column(cleanedData, "ID")
colsToConvert <- names(cleanedData[c(1,4,5)])
cleanedData[colsToConvert] <- lapply(cleanedData[colsToConvert], as.numeric)
head(cleanedData)
## ID playerName playerStates playerTotalPoints preRating
## 1 1 GARY HUA ON 6.0 1794
## 2 2 DAKSHESH DARURI MI 6.0 1553
## 3 3 ADITYA BAJAJ MI 6.0 1384
## 4 4 PATRICK H SCHILLING MI 5.5 1716
## 5 5 HANSHI ZUO MI 5.5 1655
## 6 6 HANSEN SONG OH 5.0 1686
I use regex to remove letters as we know opponent numbers don’t have letters
noletters <- str_remove_all(txtFileDataSquished, "[A-Z]")
head(noletters)
## [1] "-----------------------------------------------------------------------------------------"
## [2] "1 | |6.0 | 39| 21| 18| 14| 7| 12| 4|"
## [3] " | 15445895 / : 1794 ->1817 |:2 | | | | | | | |"
## [4] "-----------------------------------------------------------------------------------------"
## [5] "2 | |6.0 | 63| 58| 4| 17| 16| 20| 7|"
## [6] " | 14598900 / : 1553 ->1663 |:2 | | | | | | | |"
I then search for lines that start with a digit followed by any characters and then ending with a |. This captures the line of the text file that has the player ID, total points, and their opponents.
noletters <- str_view(noletters, "^\\d.*\\|")
head(noletters)
## [2] │ <1 | |6.0 | 39| 21| 18| 14| 7| 12| 4|>
## [5] │ <2 | |6.0 | 63| 58| 4| 17| 16| 20| 7|>
## [8] │ <3 | |6.0 | 8| 61| 25| 21| 11| 13| 12|>
## [11] │ <4 | |5.5 | 23| 28| 2| 26| 5| 19| 1|>
## [14] │ <5 | |5.5 | 45| 37| 12| 13| 4| 14| 17|>
## [17] │ <6 | |5.0 | 34| 29| 11| 35| 10| 27| 21|>
I then use str_split to split the vector based on | and use simplify = True to create a matrix
playerandOppsMatrix <- str_split(noletters,"\\|", simplify = TRUE)
head(playerandOppsMatrix)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## [1,] "<1 " " " "6.0 " " 39" " 21" " 18" " 14" " 7" " 12" " 4" ">"
## [2,] "<2 " " " "6.0 " " 63" " 58" " 4" " 17" " 16" " 20" " 7" ">"
## [3,] "<3 " " " "6.0 " " 8" " 61" " 25" " 21" " 11" " 13" " 12" ">"
## [4,] "<4 " " " "5.5 " " 23" " 28" " 2" " 26" " 5" " 19" " 1" ">"
## [5,] "<5 " " " "5.5 " " 45" " 37" " 12" " 13" " 4" " 14" " 17" ">"
## [6,] "<6 " " " "5.0 " " 34" " 29" " 11" " 35" " 10" " 27" " 21" ">"
I then drop the first, second and last column to capture the total points and the list of opponents for each round. I add column names
playerandOppsMatrix <- subset(playerandOppsMatrix, select = -c(1,2,11))
colnames(playerandOppsMatrix) <- c('TotalPts', 'R1_opp', 'R2_opp', 'R3_opp',
'R4_opp', 'R5_opp', 'R6_opp',
'R7_opp')
head(playerandOppsMatrix)
## TotalPts R1_opp R2_opp R3_opp R4_opp R5_opp R6_opp R7_opp
## [1,] "6.0 " " 39" " 21" " 18" " 14" " 7" " 12" " 4"
## [2,] "6.0 " " 63" " 58" " 4" " 17" " 16" " 20" " 7"
## [3,] "6.0 " " 8" " 61" " 25" " 21" " 11" " 13" " 12"
## [4,] "5.5 " " 23" " 28" " 2" " 26" " 5" " 19" " 1"
## [5,] "5.5 " " 45" " 37" " 12" " 13" " 4" " 14" " 17"
## [6,] "5.0 " " 34" " 29" " 11" " 35" " 10" " 27" " 21"
I convert the matrix to a dataframe and add a column of row names to correspond to the player ID, similar to above. I also convert ID and points to type numeric.
playerandOppsMatrix <- as.data.frame(playerandOppsMatrix)
playerandOppsMatrix <- rownames_to_column(playerandOppsMatrix, "ID")
playerandOppsMatrix$ID <- as.numeric(playerandOppsMatrix$ID)
playerandOppsMatrix$TotalPts <- as.numeric(playerandOppsMatrix$TotalPts)
head(playerandOppsMatrix)
## ID TotalPts R1_opp R2_opp R3_opp R4_opp R5_opp R6_opp R7_opp
## 1 1 6.0 39 21 18 14 7 12 4
## 2 2 6.0 63 58 4 17 16 20 7
## 3 3 6.0 8 61 25 21 11 13 12
## 4 4 5.5 23 28 2 26 5 19 1
## 5 5 5.5 45 37 12 13 4 14 17
## 6 6 5.0 34 29 11 35 10 27 21
Now I merge the data frame that has the player ID, total points, and list of opponents per round with the cleaned data frame I have created previously, merging by the ID column. I convert columns that have numbers into type numeric, and then move the data from first data frame (name, state, etc) to the start of the merged data frame. I drop the total points column since I already have it.
mergedData <- merge(playerandOppsMatrix, cleanedData, by = 'ID')
colsToConvert2 <- names(mergedData[3:9])
mergedData[colsToConvert2] <- lapply(mergedData[colsToConvert2], as.numeric)
mergedData <- relocate(mergedData, playerName, playerStates,
playerTotalPoints, preRating,
.after = ID )
mergedData$TotalPts <- NULL
head(mergedData, n=10)
## ID playerName playerStates playerTotalPoints preRating R1_opp
## 1 1 GARY HUA ON 6.0 1794 39
## 2 2 DAKSHESH DARURI MI 6.0 1553 63
## 3 3 ADITYA BAJAJ MI 6.0 1384 8
## 4 4 PATRICK H SCHILLING MI 5.5 1716 23
## 5 5 HANSHI ZUO MI 5.5 1655 45
## 6 6 HANSEN SONG OH 5.0 1686 34
## 7 7 GARY DEE SWATHELL MI 5.0 1649 57
## 8 8 EZEKIEL HOUGHTON MI 5.0 1641 3
## 9 9 STEFANO LEE ON 5.0 1411 25
## 10 10 ANVIT RAO MI 5.0 1365 16
## R2_opp R3_opp R4_opp R5_opp R6_opp R7_opp
## 1 21 18 14 7 12 4
## 2 58 4 17 16 20 7
## 3 61 25 21 11 13 12
## 4 28 2 26 5 19 1
## 5 37 12 13 4 14 17
## 6 29 11 35 10 27 21
## 7 46 13 11 1 9 2
## 8 32 14 9 47 28 19
## 9 18 59 8 26 7 20
## 10 19 55 31 6 25 18
The last column we need to deliver is the average of the opponents pre-ratings. We have to look up each opponent’s ID and then find their pre-rating. I first want to add columns to the data frame that has the pre-rating of the opponent for each round. Once I have a column for the pre-rating of an opponent for each round, I can create a column that is the average of these aforementioned columns.
To return an opponents pre-rating based on their ID, I create a function that is similar to the vlookup function in excel. In the below function, I provide the dataframe, the lookup values column (eg the column that has the IDs for round 1), the column in which to find these IDs (the ID column), and then the column to return (the pre-rating column). The function returns a vector of pre-ratings based on the IDs passed.
As an example, the function should do the below for a list of IDs.
mergedData$preRating[mergedData$R1_opp[1]==mergedData$ID]
## [1] 1436
Below is the function. I add an if else statement to account for NAs as some players did not have opponents in certain rounds.
lookupFunction <- function(dataFrame,
lookupValues, lookupColumn, returnColumn) {
returnVector <- c() ### an empty vector
for (i in 1:length(dataFrame[[lookupValues]])) {
if (is.na(dataFrame[[lookupValues]][i])) {
returnVector[i] <- NA
}
else {
returnVector[i] <- dataFrame[[returnColumn]][dataFrame[[lookupValues]][i]==dataFrame[[lookupColumn]]]
}
}
return(returnVector)
}
Using the above function, I create 7 columns that have the opponent pre-rating for each round.
mergedData$R1_opp_rating <- lookupFunction(mergedData, 'R1_opp', 'ID', 'preRating')
mergedData$R2_opp_rating <- lookupFunction(mergedData, 'R2_opp', 'ID', 'preRating')
mergedData$R3_opp_rating <- lookupFunction(mergedData, 'R3_opp', 'ID', 'preRating')
mergedData$R4_opp_rating <- lookupFunction(mergedData, 'R4_opp', 'ID', 'preRating')
mergedData$R5_opp_rating <- lookupFunction(mergedData, 'R5_opp', 'ID', 'preRating')
mergedData$R6_opp_rating <- lookupFunction(mergedData, 'R6_opp', 'ID', 'preRating')
mergedData$R7_opp_rating <- lookupFunction(mergedData, 'R7_opp', 'ID', 'preRating')
head(mergedData)
## ID playerName playerStates playerTotalPoints preRating R1_opp R2_opp
## 1 1 GARY HUA ON 6.0 1794 39 21
## 2 2 DAKSHESH DARURI MI 6.0 1553 63 58
## 3 3 ADITYA BAJAJ MI 6.0 1384 8 61
## 4 4 PATRICK H SCHILLING MI 5.5 1716 23 28
## 5 5 HANSHI ZUO MI 5.5 1655 45 37
## 6 6 HANSEN SONG OH 5.0 1686 34 29
## R3_opp R4_opp R5_opp R6_opp R7_opp R1_opp_rating R2_opp_rating R3_opp_rating
## 1 18 14 7 12 4 1436 1563 1600
## 2 4 17 16 20 7 1175 917 1716
## 3 25 21 11 13 12 1641 955 1745
## 4 2 26 5 19 1 1363 1507 1553
## 5 12 13 4 14 17 1242 980 1663
## 6 11 35 10 27 21 1399 1602 1712
## R4_opp_rating R5_opp_rating R6_opp_rating R7_opp_rating
## 1 1610 1649 1663 1716
## 2 1629 1604 1595 1649
## 3 1563 1712 1666 1663
## 4 1579 1655 1564 1794
## 5 1666 1716 1610 1629
## 6 1438 1365 1552 1563
Now that I have the pre-ratings of each opponent in each round, I create a column that has the average of the opponent pre-ratings.
mergedData$avgOppRating <- round(rowMeans(mergedData[13:19], na.rm = TRUE))
head(mergedData)
## ID playerName playerStates playerTotalPoints preRating R1_opp R2_opp
## 1 1 GARY HUA ON 6.0 1794 39 21
## 2 2 DAKSHESH DARURI MI 6.0 1553 63 58
## 3 3 ADITYA BAJAJ MI 6.0 1384 8 61
## 4 4 PATRICK H SCHILLING MI 5.5 1716 23 28
## 5 5 HANSHI ZUO MI 5.5 1655 45 37
## 6 6 HANSEN SONG OH 5.0 1686 34 29
## R3_opp R4_opp R5_opp R6_opp R7_opp R1_opp_rating R2_opp_rating R3_opp_rating
## 1 18 14 7 12 4 1436 1563 1600
## 2 4 17 16 20 7 1175 917 1716
## 3 25 21 11 13 12 1641 955 1745
## 4 2 26 5 19 1 1363 1507 1553
## 5 12 13 4 14 17 1242 980 1663
## 6 11 35 10 27 21 1399 1602 1712
## R4_opp_rating R5_opp_rating R6_opp_rating R7_opp_rating avgOppRating
## 1 1610 1649 1663 1716 1605
## 2 1629 1604 1595 1649 1469
## 3 1563 1712 1666 1663 1564
## 4 1579 1655 1564 1794 1574
## 5 1666 1716 1610 1629 1501
## 6 1438 1365 1552 1563 1519
From the above data frame, I select the original five columns that are asked for in the deliverable and then write the data frame to a CSV file.
finalDF <- mergedData[c(2,3,4,5,20)]
head(finalDF, n=10)
## playerName playerStates playerTotalPoints preRating avgOppRating
## 1 GARY HUA ON 6.0 1794 1605
## 2 DAKSHESH DARURI MI 6.0 1553 1469
## 3 ADITYA BAJAJ MI 6.0 1384 1564
## 4 PATRICK H SCHILLING MI 5.5 1716 1574
## 5 HANSHI ZUO MI 5.5 1655 1501
## 6 HANSEN SONG OH 5.0 1686 1519
## 7 GARY DEE SWATHELL MI 5.0 1649 1372
## 8 EZEKIEL HOUGHTON MI 5.0 1641 1468
## 9 STEFANO LEE ON 5.0 1411 1523
## 10 ANVIT RAO MI 5.0 1365 1554
write.csv(finalDF, "chessTournamentInfo.csv", row.names = FALSE) #will be saved into the current working directory