Project 1

1. Introduction

In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:
Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents.

2. Load text file

The text file will be downloaded from the Github repository to csv file using read_csv function. read_csv is faster for large .csv files comparing to read.csv function as it imports data into R as a tibble. So it may be better be familiar to try to use it in this project.

data <-read_csv(file = "https://raw.githubusercontent.com/ex-pr/DATA607/Project-1/7645617.txt",col_names = FALSE, show_col_types = FALSE)

By checking the data downloaded, we have 1 column with 196 rows. The final file should contain information about 64 players.

summary(data)

##       X1           
##  Length:196        
##  Class :character  
##  Mode  :character

head(data)

## # A tibble: 6 x 1
##   X1                                                                            
##   <chr>                                                                         
## 1 -----------------------------------------------------------------------------~
## 2 Pair | Player Name                     |Total|Round|Round|Round|Round|Round|R~
## 3 Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  | ~
## 4 -----------------------------------------------------------------------------~
## 5 1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  1~
## 6 ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B  ~

To make the work easy, I will remove column names as well as the last dashed line.

data <- data %>%
  filter(!row_number() %in% c(1, 2, 3, 4, 196))
data

## # A tibble: 191 x 1
##    X1                                                                           
##    <chr>                                                                        
##  1 1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  ~
##  2 ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B ~
##  3 ----------------------------------------------------------------------------~
##  4 2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  ~
##  5 MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W ~
##  6 ----------------------------------------------------------------------------~
##  7 3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  ~
##  8 MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B ~
##  9 ----------------------------------------------------------------------------~
## 10 4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  ~
## # ... with 181 more rows

3. Format data

3.1 Deframe data

As we see in our file each player takes 2 lines, so we need to take the information we need for each player from these two lines and make it as one line for one player. To make it easy to work with, we will deframe the tibble and receive 196 strings of data.

chess <- deframe(data)
head(chess)

## [1] "1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|"    
## [2] "ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"   
## [3] "-----------------------------------------------------------------------------------------"
## [4] "2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|"    
## [5] "MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"   
## [6] "-----------------------------------------------------------------------------------------"

3.2 Removing - and |

Using code and libraries from the previous HW, we will treat our data as string. At this step, the main goal is to remove all | between future columns and dashes between future rows.
As the first step, we will detect and delete dashes with a help from regular expressions.

remove_ <- str_detect(chess, '^[-]{2,}$') 
chess <- chess[!remove_ == "TRUE"]
head(chess)

## [1] "1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
## [2] "ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [3] "2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|" 
## [4] "MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [5] "3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|" 
## [6] "MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |"

Next, we will substitute | and / with comma as it will help us to convert data into data frame.

chess <- str_replace_all(chess, "[|/]",",")   
head(chess)

## [1] "1 , GARY HUA                        ,6.0  ,W  39,W  21,W  18,W  14,W   7,D  12,D   4," 
## [2] "ON , 15445895 , R: 1794   ->1817     ,N:2  ,W    ,B    ,W    ,B    ,W    ,B    ,W    ,"
## [3] "2 , DAKSHESH DARURI                 ,6.0  ,W  63,W  58,L   4,W  17,W  16,W  20,W   7," 
## [4] "MI , 14598900 , R: 1553   ->1663     ,N:2  ,B    ,W    ,B    ,W    ,B    ,W    ,B    ,"
## [5] "3 , ADITYA BAJAJ                    ,6.0  ,L   8,W  61,W  25,W  21,W  11,W  13,W  12," 
## [6] "MI , 14959604 , R: 1384   ->1640     ,N:2  ,W    ,B    ,W    ,B    ,W    ,B    ,W    ,"

3.3 Combining two rows and creating data frame

Combine two rows

a <- c("")
for (i in seq(1, length(chess)-1, by = 2)){
   a <- c(a, paste(chess[i], chess[i+1], sep = "", collapse = NULL))
}

Create data frame from the strings of data.

chess_final <- as.data.frame(do.call(rbind, strsplit(a, ",")), stringsAsFactors = FALSE)
head(chess_final)

##   V1                                V2    V3    V4    V5    V6    V7    V8
## 1 1   GARY HUA                         6.0   W  39 W  21 W  18 W  14 W   7
## 2 2   DAKSHESH DARURI                  6.0   W  63 W  58 L   4 W  17 W  16
## 3 3   ADITYA BAJAJ                     6.0   L   8 W  61 W  25 W  21 W  11
## 4 4   PATRICK H SCHILLING              5.5   W  23 D  28 W   2 W  26 D   5
## 5 5   HANSHI ZUO                       5.5   W  45 W  37 D  12 D  13 D   4
## 6 6   HANSEN SONG                      5.0   W  34 D  29 L  11 W  35 D  10
##      V9   V10 V11        V12                    V13   V14   V15   V16   V17
## 1 D  12 D   4 ON   15445895   R: 1794   ->1817      N:2   W     B     W    
## 2 W  20 W   7 MI   14598900   R: 1553   ->1663      N:2   B     W     B    
## 3 W  13 W  12 MI   14959604   R: 1384   ->1640      N:2   W     B     W    
## 4 W  19 D   1 MI   12616049   R: 1716   ->1744      N:2   W     B     W    
## 5 W  14 W  17 MI   14601533   R: 1655   ->1690      N:2   B     W     B    
## 6 W  27 W  21 OH   15055204   R: 1686   ->1687      N:3   W     B     W    
##     V18   V19   V20   V21
## 1 B     W     B     W    
## 2 W     B     W     B    
## 3 B     W     B     W    
## 4 B     W     B     B    
## 5 W     B     W     B    
## 6 B     B     W     B

3.4 Format data frame

We will remove columns that we don’t need for the assignment

chess_final[12] <- list(NULL) 
chess_final[c(1,13:ncol(chess_final))] <- list(NULL)

Rename columns.

colnames(chess_final)[1] <- c("Name")
colnames(chess_final)[2] <- c("Total Number of Points")
colnames(chess_final)[3:9] <- c("R1","R2","R3","R4","R5","R6","R7")
colnames(chess_final)[10] <- c("State")
colnames(chess_final)[11] <- c("Pre_Rating")

We need to extract pre-rating for each player.

chess_final$Pre_Rating <- str_sub(chess_final$Pre_Rating, 5, 8)

Remove letters from Round results.

nm1 <- c("R1", "R2", "R3", "R4", "R5", "R6", "R7") 
chess_final[nm1]<-lapply(chess_final[nm1], gsub, pattern = "W ", replacement = "")
chess_final[nm1]<-lapply(chess_final[nm1], gsub, pattern = "L ", replacement = "")
chess_final[nm1]<-lapply(chess_final[nm1], gsub, pattern = "D ", replacement = "")
chess_final[nm1]<-lapply(chess_final[nm1], gsub, pattern = "U ", replacement = "")
chess_final[nm1]<-lapply(chess_final[nm1], gsub, pattern = "H ", replacement = "")
chess_final[nm1]<-lapply(chess_final[nm1], gsub, pattern = "X ", replacement = "")

If there are any empty (NA) spaces, we will substitute it with 0

chess_final[is.na(chess_final)] <- 0

We won’t be able to calculate average if we don’t transform data in columns 2:9 and 11 in to numbers.

chess_final[c(2:9,11)] <- sapply((chess_final)[c(2:9,11)], as.numeric) 
chess_final[is.na(chess_final)] <- 0  
head(chess_final)

##                                Name Total Number of Points R1 R2 R3 R4 R5 R6 R7
## 1  GARY HUA                                            6.0 39 21 18 14  7 12  4
## 2  DAKSHESH DARURI                                     6.0 63 58  4 17 16 20  7
## 3  ADITYA BAJAJ                                        6.0  8 61 25 21 11 13 12
## 4  PATRICK H SCHILLING                                 5.5 23 28  2 26  5 19  1
## 5  HANSHI ZUO                                          5.5 45 37 12 13  4 14 17
## 6  HANSEN SONG                                         5.0 34 29 11 35 10 27 21
##   State Pre_Rating
## 1   ON        1794
## 2   MI        1553
## 3   MI        1384
## 4   MI        1716
## 5   MI        1655
## 6   OH        1686

4 Calculating Average Pre-Chess Rating of Opponents

The average pre-rating of opponents is calculated by using the pre-tournament opponents’ ratings and dividing by the total number of games played.

total <- numeric(0)
opponents <- numeric(0)
avg_rate <- vector()
for (i in 1:length(chess_final$Pre_Rating)){
    player <- as.numeric(as.vector(chess_final[i,3:9])) 
    total <- sum(chess_final[player, "Pre_Rating"])
    opponents <-  sum(chess_final[i,c(3:9)]!=0) # number of opponents for each player, if no opponents than 0
    avg_rate[i] <- round(total / opponents, digits = 0)   
    }
chess_final$Avg_Pre_Rate <- avg_rate
head(chess_final)

##                                Name Total Number of Points R1 R2 R3 R4 R5 R6 R7
## 1  GARY HUA                                            6.0 39 21 18 14  7 12  4
## 2  DAKSHESH DARURI                                     6.0 63 58  4 17 16 20  7
## 3  ADITYA BAJAJ                                        6.0  8 61 25 21 11 13 12
## 4  PATRICK H SCHILLING                                 5.5 23 28  2 26  5 19  1
## 5  HANSHI ZUO                                          5.5 45 37 12 13  4 14 17
## 6  HANSEN SONG                                         5.0 34 29 11 35 10 27 21
##   State Pre_Rating Avg_Pre_Rate
## 1   ON        1794         1605
## 2   MI        1553         1469
## 3   MI        1384         1564
## 4   MI        1716         1574
## 5   MI        1655         1501
## 6   OH        1686         1519

5. Creating csv file

For the final csv file, we will use columns Name, State, Total Number of Points,Pre-Rating, and Average Pre Chess Rating of Opponents as it is asked by the assignment.

chess_final <- chess_final[,c(1,10,2,11,12)]
head(chess_final)

##                                Name State Total Number of Points Pre_Rating
## 1  GARY HUA                           ON                     6.0       1794
## 2  DAKSHESH DARURI                    MI                     6.0       1553
## 3  ADITYA BAJAJ                       MI                     6.0       1384
## 4  PATRICK H SCHILLING                MI                     5.5       1716
## 5  HANSHI ZUO                         MI                     5.5       1655
## 6  HANSEN SONG                        OH                     5.0       1686
##   Avg_Pre_Rate
## 1         1605
## 2         1469
## 3         1564
## 4         1574
## 5         1501
## 6         1519

write.csv(chess_final, file = "Chess.csv")