Project 1

Load Libraries

library(RCurl)
library(stringr)
library(readr)
library(dplyr)
library(tidyr)
library(knitr)
library(kableExtra)

Load Raw Data

Using the RCurl library download the raw text data from my GitHub repository. Set the result as raw_data

file_url <- 'https://raw.githubusercontent.com/nolivercuny/data607/master/project1/tournamentinfo.txt'
raw_data <- getURL(file_url)

Processing

Converting raw text string to dataframe

Get rid of dashes

raw_data <- str_replace_all(raw_data,"-", "")

Split the data on newlines then on the pipe character. Newlines being the logical separator between the rows and the pipe character being the separator of the columns.

split_data<- raw_data %>% 
  str_split('\n', simplify = TRUE) %>% 
  str_split('\\|', simplify = TRUE)

Convert to a dataframe and use the filter function to remove empty rows

df <- as.data.frame(split_data) %>%
  filter(V1 != "", V1 != "\r")

Manipulate the dataframe

At this point we have a dataframe which contains all our data in individual rows but there are two rows representing a single player record. This is because the raw data was structured in such a way that a logical row was actually represented as two newline separated lines in the text file.

Break up the dataframe into two dataframes so they can be recombined as a single row. This is done by using the modulus operator to split into even and odd rows.

df_one <- df %>% filter(row_number() %% 2 == 1) 
df_two <- df %>% filter(row_number() %% 2 == 0)

Set the column names to the first row.

names(df_one) <- df_one[1,]
names(df_one) <- df_one[1,]

Merge the two dataframes back into a single data frame. I am also using the subset combined with the select function to drop two columns that contain no data.

data <- bind_cols(df_one,df_two) %>% 
  subset(select=-c(11,22))

## New names:
## * Round -> Round...4
## * Round -> Round...5
## * Round -> Round...6
## * Round -> Round...7
## * Round -> Round...8
## * ...

Drop the first row because it contained the column names and we no longer need that.

data <- data[-c(1),]

Split column 12 which contains the USCF Id and the Pre and Post ratings for the the player. Then split the pre and post ratings into separate columns as well.

data <- separate(data, 12, sep="/", into = c('USCF_ID','PRE_POST'))
data <- separate(data, 13, sep=">", into = c('Pre_Rating','Post_Rating'))

Parse the Pre rating column as a number so it can be used in subsequent calculation

data$Pre_Rating <- parse_number(data$Pre_Rating)

Computing Opponent’s Pre-Rating Mean

Loop over every row.
Select the “Rounds” columns (4 through 10).
Because those contain the row number of the opponent. Parse out that value as a number using gsub to replace all non-numeric characters with no value.
Then us as.numeric to cast the values as numbers to be used as indexes in the dataframe.
Use the opponent index vector to select the opponents pre-ratings.
Use the mean function to obtain the average opponent pre-rating with the na.rm argument set to TRUE in order to remove NA values where opponents were not part of the round for the player.
Round the mean to the nearest whole value using the round function and add that value as a new column in the dataframe.

for (i in 1:nrow(data)){
  opponent_indexes <- as.numeric(gsub("[^0-9]","",data[i,4:10]))
  opponents_ratings<-data[opponent_indexes,13]
  rating_average <- mean(opponents_ratings, na.rm = TRUE)
  data$Mean_Opponent_Pre_Rating[i] <- round(rating_average)
}

Clean up for final output

Select only relevant columns
Remove the R: from the player’s pre-rating.
Rename the columns
Trim whitespace
Reset the row numbers

final_output <- data[,c(2,3,11,13,23)]
final_output$Pre_Rating <- gsub("R:", "", final_output$Pre_Rating)
names(final_output) <- c('Player’s Name', 'Total Number of Points','Player’s State', 'Player’s Pre-Rating', 'Average Pre Chess Rating of Opponents')
final_output <- final_output %>% 
  mutate(across(where(is.character), str_trim))
rownames(final_output) <- NULL

Display the values as a table for double checking before outputing to a CSV

kable(final_output,caption="Players",digits = 2, format = "html", row.names = TRUE)

Players
	Player’s Name	Total Number of Points	Player’s State	Player’s Pre-Rating	Average Pre Chess Rating of Opponents
1	GARY HUA	6.0	ON	1794	1605
2	DAKSHESH DARURI	6.0	MI	1553	1469
3	ADITYA BAJAJ	6.0	MI	1384	1564
4	PATRICK H SCHILLING	5.5	MI	1716	1574
5	HANSHI ZUO	5.5	MI	1655	1501
6	HANSEN SONG	5.0	OH	1686	1519
7	GARY DEE SWATHELL	5.0	MI	1649	1372
8	EZEKIEL HOUGHTON	5.0	MI	1641	1468
9	STEFANO LEE	5.0	ON	1411	1523
10	ANVIT RAO	5.0	MI	1365	1554
11	CAMERON WILLIAM MC LEMAN	4.5	MI	1712	1468
12	KENNETH J TACK	4.5	MI	1663	1506
13	TORRANCE HENRY JR	4.5	MI	1666	1498
14	BRADLEY SHAW	4.5	MI	1610	1515
15	ZACHARY JAMES HOUGHTON	4.5	MI	1220	1484
16	MIKE NIKITIN	4.0	MI	1604	1386
17	RONALD GRZEGORCZYK	4.0	MI	1629	1499
18	DAVID SUNDEEN	4.0	MI	1600	1480
19	DIPANKAR ROY	4.0	MI	1564	1426
20	JASON ZHENG	4.0	MI	1595	1411
21	DINH DANG BUI	4.0	ON	1563	1470
22	EUGENE L MCCLURE	4.0	MI	1555	1300
23	ALAN BUI	4.0	ON	1363	1214
24	MICHAEL R ALDRICH	4.0	MI	1229	1357
25	LOREN SCHWIEBERT	3.5	MI	1745	1363
26	MAX ZHU	3.5	ON	1579	1507
27	GAURAV GIDWANI	3.5	MI	1552	1222
28	SOFIA ADINA STANESCUBELLU	3.5	MI	1507	1522
29	CHIEDOZIE OKORIE	3.5	MI	1602	1314
30	GEORGE AVERY JONES	3.5	ON	1522	1144
31	RISHI SHETTY	3.5	MI	1494	1260
32	JOSHUA PHILIP MATHEWS	3.5	ON	1441	1379
33	JADE GE	3.5	MI	1449	1277
34	MICHAEL JEFFERY THOMAS	3.5	MI	1399	1375
35	JOSHUA DAVID LEE	3.5	MI	1438	1150
36	SIDDHARTH JHA	3.5	MI	1355	1388
37	AMIYATOSH PWNANANDAM	3.5	MI	980	1385
38	BRIAN LIU	3.0	MI	1423	1539
39	JOEL R HENDON	3.0	MI	1436	1430
40	FOREST ZHANG	3.0	MI	1348	1391
41	KYLE WILLIAM MURPHY	3.0	MI	1403	1248
42	JARED GE	3.0	MI	1332	1150
43	ROBERT GLEN VASEY	3.0	MI	1283	1107
44	JUSTIN D SCHILLING	3.0	MI	1199	1327
45	DEREK YAN	3.0	MI	1242	1152
46	JACOB ALEXANDER LAVALLEY	3.0	MI	377	1358
47	ERIC WRIGHT	2.5	MI	1362	1392
48	DANIEL KHAIN	2.5	MI	1382	1356
49	MICHAEL J MARTIN	2.5	MI	1291	1286
50	SHIVAM JHA	2.5	MI	1056	1296
51	TEJAS AYYAGARI	2.5	MI	1011	1356
52	ETHAN GUO	2.5	MI	935	1495
53	JOSE C YBARRA	2.0	MI	1393	1345
54	LARRY HODGE	2.0	MI	1270	1206
55	ALEX KONG	2.0	MI	1186	1406
56	MARISA RICCI	2.0	MI	1153	1414
57	MICHAEL LU	2.0	MI	1092	1363
58	VIRAJ MOHILE	2.0	MI	917	1391
59	SEAN M MC CORMICK	2.0	MI	853	1319
60	JULIA SHEN	1.5	MI	967	1330
61	JEZZEL FARKAS	1.5	ON	955	1327
62	ASHWIN BALAJI	1.0	MI	1530	1186
63	THOMAS JOSEPH HOSMER	1.0	MI	1175	1350
64	BEN LI	1.0	MI	1163	1263

Write out the results as CSV

Making an attempt at being agnostic about file systems by trying to get the working directory and using that as the location to write out the CSV.

working_directory <- getwd()
write_csv(final_output,paste(working_directory, "/chess_tournement_players.csv",sep=""))