Project 1

project introduction

In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players: Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents For the first player, the information would be: Gary Hua, ON, 6.0, 1794, 1605 1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played. If you have questions about the meaning of the data or the results, please post them on the discussion forum. Data science, like chess, is a game of back and forth… The chess rating system (invented by a Minnesota statistician named Arpad Elo) has been used in many other contexts, including assessing relative strength of employment candidates by human resource departments. You may substitute another text file (or set of text files, or data scraped from web pages) of similar or greater complexity, and create your own assignment and solution. You may work in a small team. All of your code should be in an R markdown file (and published to rpubs.com); with your data accessible for the person running the script.

Loading the data

library(stringr)
library(ggplot2)
library(tidyverse)

I,m going to use ./ in front of the file to make my codes more explicit and portable.

tournment<- ("./tournamentinfo.txt")
waheeb<- readLines(tournment)
head(waheeb, 7)

## [1] "-----------------------------------------------------------------------------------------" 
## [2] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
## [4] "-----------------------------------------------------------------------------------------" 
## [5] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
## [6] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
## [7] "-----------------------------------------------------------------------------------------"

Data transformation

# remove first 4 rows that I don't need
con <- waheeb[-c(0:4)]

# remove unnecessary spaces
con <- con[sapply(con, nchar) > 0]

# divide odd / even rows into separate set of lines
odd <- c(seq(1, length(con), 3))
odd_a <- con[odd]

even <- c(seq(2, length(con), 3))
even_a <- con[even]

I will use regex to extract the only required information.

# name
name <- str_extract(odd_a, "\\s+([[:alpha:]- ]+)\\b\\s*\\|")
name <- gsub(name, pattern = "|", replacement = "", fixed = T)
# strip the space
name <- trimws(name)

# state
state <- str_extract(even_a, "[[:alpha:]]{2}")

# total_points
total_points <- str_extract(odd_a, "[[:digit:]]+\\.[[:digit:]]")
total_points <- as.numeric(as.character(total_points))

# pre_rating
pre_rating <- str_extract(even_a, ".\\: \\s?[[:digit:]]{3,4}")
pre_rating <- gsub(pre_rating, pattern = "R: ", replacement = "", fixed = T)
pre_rating <- as.numeric(as.character(pre_rating))

# opponent_number to extract opponents pair number per player
opponent_number <- str_extract_all(odd_a, "[[:digit:]]{1,2}\\|")
opponent_number <- str_extract_all(opponent_number, "[[:digit:]]{1,2}")
opponent_number <- lapply(opponent_number, as.numeric)

calculate Average Pre Chess Rating of Opponents and store that in a list.

opp_avg_rating <- list()
for (i in 1:length(opponent_number)){
  opp_avg_rating[i] <- round(mean(pre_rating[unlist(opponent_number[i])]),2)
}
opp_avg_rating <- lapply(opp_avg_rating, as.numeric)
opp_avg_rating <- data.frame(unlist(opp_avg_rating))

create data frame

df <- cbind.data.frame(name, state, total_points, pre_rating, opp_avg_rating)
colnames(df) <- c("Name", "State", "Total_points", "Pre_rating", "Avg_pre_chess_rating_of_opponents")
head(df)

##                  Name State Total_points Pre_rating
## 1            GARY HUA    ON          6.0       1794
## 2     DAKSHESH DARURI    MI          6.0       1553
## 3        ADITYA BAJAJ    MI          6.0       1384
## 4 PATRICK H SCHILLING    MI          5.5       1716
## 5          HANSHI ZUO    MI          5.5       1655
## 6         HANSEN SONG    OH          5.0       1686
##   Avg_pre_chess_rating_of_opponents
## 1                           1605.29
## 2                           1469.29
## 3                           1563.57
## 4                           1573.57
## 5                           1500.86
## 6                           1518.71

Visualization

ggplot(data = df, aes(x = Pre_rating, y = Total_points)) +
  geom_point(size = 4, color = "blue") +
  ggtitle("Pre-chess rating vs Total points earned") +
  xlab("Pre-chess rating") +
  ylab("Total points earned")

df_state_points <- df %>% group_by(State) %>% 
  summarize(Total_points = sum(Total_points))

ggplot(data = df_state_points, aes(x = "", y = Total_points, fill = State)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start = 0) +
  ggtitle("Distribution of Total points earned by players from different states") +
  labs(fill = "State") +
  scale_fill_brewer(palette = "Set1")

write.csv(df, "chess_ratings.csv")

Conclusion

summary(df$Total_points)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.500   3.500   3.438   4.000   6.000

The majority of the values for the “Total_points” variable fall between 2.5 and 4.0, with a median value of 3.5. The mean value of 3.438 is close to the median value, indicating that the data is relatively symmetrical and does not have any extreme outliers.

table(df$State)

## 
## MI OH ON 
## 55  1  8

The frequency distribution of the number of players in each state, with MI having the highest frequency of 55 players, OH having 1 player, and ON having 8 players. From this information, it can be concluded that most of the players in this dataset come from the state of MI, while there are significantly fewer players from OH and ON.

summary(df)

##      Name              State            Total_points     Pre_rating  
##  Length:64          Length:64          Min.   :1.000   Min.   : 377  
##  Class :character   Class :character   1st Qu.:2.500   1st Qu.:1227  
##  Mode  :character   Mode  :character   Median :3.500   Median :1407  
##                                        Mean   :3.438   Mean   :1378  
##                                        3rd Qu.:4.000   3rd Qu.:1583  
##                                        Max.   :6.000   Max.   :1794  
##  Avg_pre_chess_rating_of_opponents
##  Min.   :1107                     
##  1st Qu.:1310                     
##  Median :1382                     
##  Mean   :1379                     
##  3rd Qu.:1481                     
##  Max.   :1605