Project 1
Load Libraries
library(RCurl)
library(stringr)
library(readr)
library(dplyr)
library(tidyr)
library(knitr)
library(kableExtra)Load Raw Data
Using the RCurl library download the raw text data from my GitHub repository. Set the result as raw_data
file_url <- 'https://raw.githubusercontent.com/nolivercuny/data607/master/project1/tournamentinfo.txt'
raw_data <- getURL(file_url)Processing
Converting raw text string to dataframe
Get rid of dashes
raw_data <- str_replace_all(raw_data,"-", "")Split the data on newlines then on the pipe character. Newlines being the logical separator between the rows and the pipe character being the separator of the columns.
split_data<- raw_data %>%
str_split('\n', simplify = TRUE) %>%
str_split('\\|', simplify = TRUE)Convert to a dataframe and use the filter function to remove empty rows
df <- as.data.frame(split_data) %>%
filter(V1 != "", V1 != "\r")Manipulate the dataframe
At this point we have a dataframe which contains all our data in individual rows but there are two rows representing a single player record. This is because the raw data was structured in such a way that a logical row was actually represented as two newline separated lines in the text file.
Break up the dataframe into two dataframes so they can be recombined as a single row. This is done by using the modulus operator to split into even and odd rows.
df_one <- df %>% filter(row_number() %% 2 == 1)
df_two <- df %>% filter(row_number() %% 2 == 0)Set the column names to the first row.
names(df_one) <- df_one[1,]
names(df_one) <- df_one[1,]Merge the two dataframes back into a single data frame. I am also using the subset combined with the select function to drop two columns that contain no data.
data <- bind_cols(df_one,df_two) %>%
subset(select=-c(11,22))## New names:
## * Round -> Round...4
## * Round -> Round...5
## * Round -> Round...6
## * Round -> Round...7
## * Round -> Round...8
## * ...
Drop the first row because it contained the column names and we no longer need that.
data <- data[-c(1),]Split column 12 which contains the USCF Id and the Pre and Post ratings for the the player. Then split the pre and post ratings into separate columns as well.
data <- separate(data, 12, sep="/", into = c('USCF_ID','PRE_POST'))
data <- separate(data, 13, sep=">", into = c('Pre_Rating','Post_Rating'))Parse the Pre rating column as a number so it can be used in subsequent calculation
data$Pre_Rating <- parse_number(data$Pre_Rating)Computing Opponent’s Pre-Rating Mean
- Loop over every row.
- Select the “Rounds” columns (4 through 10).
- Because those contain the row number of the opponent. Parse out that value as a number using
gsubto replace all non-numeric characters with no value. - Then us
as.numericto cast the values as numbers to be used as indexes in the dataframe. - Use the opponent index vector to select the opponents pre-ratings.
- Use the mean function to obtain the average opponent pre-rating with the
na.rmargument set toTRUEin order to removeNAvalues where opponents were not part of the round for the player. - Round the mean to the nearest whole value using the
roundfunction and add that value as a new column in the dataframe.
for (i in 1:nrow(data)){
opponent_indexes <- as.numeric(gsub("[^0-9]","",data[i,4:10]))
opponents_ratings<-data[opponent_indexes,13]
rating_average <- mean(opponents_ratings, na.rm = TRUE)
data$Mean_Opponent_Pre_Rating[i] <- round(rating_average)
}Clean up for final output
- Select only relevant columns
- Remove the
R:from the player’s pre-rating. - Rename the columns
- Trim whitespace
- Reset the row numbers
final_output <- data[,c(2,3,11,13,23)]
final_output$Pre_Rating <- gsub("R:", "", final_output$Pre_Rating)
names(final_output) <- c('Player’s Name', 'Total Number of Points','Player’s State', 'Player’s Pre-Rating', 'Average Pre Chess Rating of Opponents')
final_output <- final_output %>%
mutate(across(where(is.character), str_trim))
rownames(final_output) <- NULLDisplay the values as a table for double checking before outputing to a CSV
kable(final_output,caption="Players",digits = 2, format = "html", row.names = TRUE)| Player’s Name | Total Number of Points | Player’s State | Player’s Pre-Rating | Average Pre Chess Rating of Opponents | |
|---|---|---|---|---|---|
| 1 | GARY HUA | 6.0 | ON | 1794 | 1605 |
| 2 | DAKSHESH DARURI | 6.0 | MI | 1553 | 1469 |
| 3 | ADITYA BAJAJ | 6.0 | MI | 1384 | 1564 |
| 4 | PATRICK H SCHILLING | 5.5 | MI | 1716 | 1574 |
| 5 | HANSHI ZUO | 5.5 | MI | 1655 | 1501 |
| 6 | HANSEN SONG | 5.0 | OH | 1686 | 1519 |
| 7 | GARY DEE SWATHELL | 5.0 | MI | 1649 | 1372 |
| 8 | EZEKIEL HOUGHTON | 5.0 | MI | 1641 | 1468 |
| 9 | STEFANO LEE | 5.0 | ON | 1411 | 1523 |
| 10 | ANVIT RAO | 5.0 | MI | 1365 | 1554 |
| 11 | CAMERON WILLIAM MC LEMAN | 4.5 | MI | 1712 | 1468 |
| 12 | KENNETH J TACK | 4.5 | MI | 1663 | 1506 |
| 13 | TORRANCE HENRY JR | 4.5 | MI | 1666 | 1498 |
| 14 | BRADLEY SHAW | 4.5 | MI | 1610 | 1515 |
| 15 | ZACHARY JAMES HOUGHTON | 4.5 | MI | 1220 | 1484 |
| 16 | MIKE NIKITIN | 4.0 | MI | 1604 | 1386 |
| 17 | RONALD GRZEGORCZYK | 4.0 | MI | 1629 | 1499 |
| 18 | DAVID SUNDEEN | 4.0 | MI | 1600 | 1480 |
| 19 | DIPANKAR ROY | 4.0 | MI | 1564 | 1426 |
| 20 | JASON ZHENG | 4.0 | MI | 1595 | 1411 |
| 21 | DINH DANG BUI | 4.0 | ON | 1563 | 1470 |
| 22 | EUGENE L MCCLURE | 4.0 | MI | 1555 | 1300 |
| 23 | ALAN BUI | 4.0 | ON | 1363 | 1214 |
| 24 | MICHAEL R ALDRICH | 4.0 | MI | 1229 | 1357 |
| 25 | LOREN SCHWIEBERT | 3.5 | MI | 1745 | 1363 |
| 26 | MAX ZHU | 3.5 | ON | 1579 | 1507 |
| 27 | GAURAV GIDWANI | 3.5 | MI | 1552 | 1222 |
| 28 | SOFIA ADINA STANESCUBELLU | 3.5 | MI | 1507 | 1522 |
| 29 | CHIEDOZIE OKORIE | 3.5 | MI | 1602 | 1314 |
| 30 | GEORGE AVERY JONES | 3.5 | ON | 1522 | 1144 |
| 31 | RISHI SHETTY | 3.5 | MI | 1494 | 1260 |
| 32 | JOSHUA PHILIP MATHEWS | 3.5 | ON | 1441 | 1379 |
| 33 | JADE GE | 3.5 | MI | 1449 | 1277 |
| 34 | MICHAEL JEFFERY THOMAS | 3.5 | MI | 1399 | 1375 |
| 35 | JOSHUA DAVID LEE | 3.5 | MI | 1438 | 1150 |
| 36 | SIDDHARTH JHA | 3.5 | MI | 1355 | 1388 |
| 37 | AMIYATOSH PWNANANDAM | 3.5 | MI | 980 | 1385 |
| 38 | BRIAN LIU | 3.0 | MI | 1423 | 1539 |
| 39 | JOEL R HENDON | 3.0 | MI | 1436 | 1430 |
| 40 | FOREST ZHANG | 3.0 | MI | 1348 | 1391 |
| 41 | KYLE WILLIAM MURPHY | 3.0 | MI | 1403 | 1248 |
| 42 | JARED GE | 3.0 | MI | 1332 | 1150 |
| 43 | ROBERT GLEN VASEY | 3.0 | MI | 1283 | 1107 |
| 44 | JUSTIN D SCHILLING | 3.0 | MI | 1199 | 1327 |
| 45 | DEREK YAN | 3.0 | MI | 1242 | 1152 |
| 46 | JACOB ALEXANDER LAVALLEY | 3.0 | MI | 377 | 1358 |
| 47 | ERIC WRIGHT | 2.5 | MI | 1362 | 1392 |
| 48 | DANIEL KHAIN | 2.5 | MI | 1382 | 1356 |
| 49 | MICHAEL J MARTIN | 2.5 | MI | 1291 | 1286 |
| 50 | SHIVAM JHA | 2.5 | MI | 1056 | 1296 |
| 51 | TEJAS AYYAGARI | 2.5 | MI | 1011 | 1356 |
| 52 | ETHAN GUO | 2.5 | MI | 935 | 1495 |
| 53 | JOSE C YBARRA | 2.0 | MI | 1393 | 1345 |
| 54 | LARRY HODGE | 2.0 | MI | 1270 | 1206 |
| 55 | ALEX KONG | 2.0 | MI | 1186 | 1406 |
| 56 | MARISA RICCI | 2.0 | MI | 1153 | 1414 |
| 57 | MICHAEL LU | 2.0 | MI | 1092 | 1363 |
| 58 | VIRAJ MOHILE | 2.0 | MI | 917 | 1391 |
| 59 | SEAN M MC CORMICK | 2.0 | MI | 853 | 1319 |
| 60 | JULIA SHEN | 1.5 | MI | 967 | 1330 |
| 61 | JEZZEL FARKAS | 1.5 | ON | 955 | 1327 |
| 62 | ASHWIN BALAJI | 1.0 | MI | 1530 | 1186 |
| 63 | THOMAS JOSEPH HOSMER | 1.0 | MI | 1175 | 1350 |
| 64 | BEN LI | 1.0 | MI | 1163 | 1263 |
Write out the results as CSV
Making an attempt at being agnostic about file systems by trying to get the working directory and using that as the location to write out the CSV.
working_directory <- getwd()
write_csv(final_output,paste(working_directory, "/chess_tournement_players.csv",sep=""))