Data 607_Project 1

First chunk loaded the necessary libraries.

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.4     v stringr 1.4.0
## v tidyr   1.1.2     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Used the read_delim function to import the tournament info first. At first, I had difficulty importing the text files without adjusting the attributes. The header within the text file did not allow read_delim to break up the text into its appropriate column. By using the skip attribute I was able to skip the header and resolve this issue. Next after reading through the r documentation I made use of the comment attribute. This attribute allows you to designate a character and assign that text as a comment. When first attempting to used this attribute, I used ‘-’ which got rid of all the dash lines. Unexpected outcome was that it also removed ‘->’ in the ratings column. In order to resolve this I adjusted the attribute to read double dashes.

df<-read_delim("https://raw.githubusercontent.com/engine2031/Data-607/main/tournamentinfo.txt",delim="|", 
               col_names = c("Player State", "Player Name", "Total Pts", 
                             "X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8", "X9"),
               skip=4, comment = "--")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   `Player State` = col_character(),
##   `Player Name` = col_character(),
##   `Total Pts` = col_character(),
##   X1 = col_character(),
##   X2 = col_character(),
##   X3 = col_character(),
##   X4 = col_character(),
##   X5 = col_character(),
##   X6 = col_character(),
##   X7 = col_character(),
##   X8 = col_logical(),
##   X9 = col_character()
## )

## Warning: 128 parsing failures.
## row col   expected     actual                                                                            file
##   1  -- 12 columns 11 columns 'https://raw.githubusercontent.com/engine2031/Data-607/main/tournamentinfo.txt'
##   2  -- 12 columns 11 columns 'https://raw.githubusercontent.com/engine2031/Data-607/main/tournamentinfo.txt'
##   3  -- 12 columns 11 columns 'https://raw.githubusercontent.com/engine2031/Data-607/main/tournamentinfo.txt'
##   4  -- 12 columns 11 columns 'https://raw.githubusercontent.com/engine2031/Data-607/main/tournamentinfo.txt'
##   5  -- 12 columns 11 columns 'https://raw.githubusercontent.com/engine2031/Data-607/main/tournamentinfo.txt'
## ... ... .......... .......... ...............................................................................
## See problems(...) for more details.

In this chunk I create a new data frame to only work with the columns that will be used for the final product. Additionally within the total points column I got rid of unnecessary text “N:(.)” using reg expressions.

df1 <- df %>% select(`Player State`,`Player Name`,`Total Pts`)
df1$`Total Pts`<- str_remove(df1$`Total Pts`,"N:(.)")

The column ‘Player Name’ included text having the player’s name and its rating. I separated these two types of characters using the separate function. Next for the ‘Player Name1’ and ‘Player State’ columns I get rid of the numbers.

df2 <-df1 %>% mutate(df1, separate(df1,col = "Player Name", into = c("Player Name1", "Rating"), sep="/"))

## Warning: Problem with `mutate()` input `..2`.
## i Expected 2 pieces. Missing pieces filled with `NA` in 64 rows [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, ...].
## i Input `..2` is `separate(...)`.

df2$`Player Name1` <- str_remove(df2$`Player Name1`,"\\d+")
df2$`Player State` <- str_remove(df2$`Player State`,"\\d+")

df3 <- df2 %>% select(-'Player Name')

General preperation for the final production. In this chunk I get rid of the spaces within each column.

df3$'Player Name1' <- str_trim(df3$`Player Name1`, side = c("both"))
df3$'Player State' <- str_trim(df3$`Player State`, side = c("both"))
df3$'Total Pts' <- str_trim(df3$`Total Pts`, side = c("both"))

Because of the way the initial text file was set up, there is blank data for some of the values within the columns. In order to prep the data frame to drop these empty variable, I replace the spaces with NA. My intent was to use the drop_na() function with brute force but I found that I lost data when applying the function to the entire data frame. NA values were alternating depending on which column they were in. I subsetted the columns so that the NA values were in the same row. Then finally using cbind to merge the two subsets to create the complete dataframe.

df3 <- df3 %>% na_if("")
df3_drop1 <- df3 %>% select(-'Player State',-'Rating') %>% drop_na()


df3_drop2 <- df3 %>% select(-'Total Pts',-'Player Name1') %>% drop_na()


df4 <- cbind(df3_drop1,df3_drop2)
df4 <- df4 %>% relocate('Player Name1', 'Player State', 'Total Pts', 'Rating')

I left data manipulation of the Rating column as the last big task. Using the separate() function I created two new rows to indicate PreRating and PostRating. The PreRating column is ultimately the column I wanted in my last dataframe and did some clean up on it. Using str_remove and regular expressions I removed non relevant data. In some cases the rating value had a P with some numbers following. I assumed this group of strings were errors and removed accordingly.

df5 <-df4 %>% mutate(df4, separate(df4,col = "Rating", into = c("PreRating", "PostRating"), sep="->"))


df5 <- df5 %>% select(-'Rating')
df5$'PreRating' <- str_remove(df5$'PreRating', "R: ")
df5$'PreRating' <- str_remove(df5$'PreRating', "(P)(.*)$")

df5$'PreRating' <- str_trim(df5$'PreRating', side = c("right"))

Final Product. Creating the CSV.

df6 <- df5 %>% select(-"PostRating")
df6

##                  Player Name1 Player State Total Pts PreRating
## 1                    GARY HUA           ON       6.0      1794
## 2             DAKSHESH DARURI           MI       6.0      1553
## 3                ADITYA BAJAJ           MI       6.0      1384
## 4         PATRICK H SCHILLING           MI       5.5      1716
## 5                  HANSHI ZUO           MI       5.5      1655
## 6                 HANSEN SONG           OH       5.0      1686
## 7           GARY DEE SWATHELL           MI       5.0      1649
## 8            EZEKIEL HOUGHTON           MI       5.0      1641
## 9                 STEFANO LEE           ON       5.0      1411
## 10                  ANVIT RAO           MI       5.0      1365
## 11   CAMERON WILLIAM MC LEMAN           MI       4.5      1712
## 12             KENNETH J TACK           MI       4.5      1663
## 13          TORRANCE HENRY JR           MI       4.5      1666
## 14               BRADLEY SHAW           MI       4.5      1610
## 15     ZACHARY JAMES HOUGHTON           MI       4.5      1220
## 16               MIKE NIKITIN           MI       4.0      1604
## 17         RONALD GRZEGORCZYK           MI       4.0      1629
## 18              DAVID SUNDEEN           MI       4.0      1600
## 19               DIPANKAR ROY           MI       4.0      1564
## 20                JASON ZHENG           MI       4.0      1595
## 21              DINH DANG BUI           ON       4.0      1563
## 22           EUGENE L MCCLURE           MI       4.0      1555
## 23                   ALAN BUI           ON       4.0      1363
## 24          MICHAEL R ALDRICH           MI       4.0      1229
## 25           LOREN SCHWIEBERT           MI       3.5      1745
## 26                    MAX ZHU           ON       3.5      1579
## 27             GAURAV GIDWANI           MI       3.5      1552
## 28 SOFIA ADINA STANESCU-BELLU           MI       3.5      1507
## 29           CHIEDOZIE OKORIE           MI       3.5      1602
## 30         GEORGE AVERY JONES           ON       3.5      1522
## 31               RISHI SHETTY           MI       3.5      1494
## 32      JOSHUA PHILIP MATHEWS           ON       3.5      1441
## 33                    JADE GE           MI       3.5      1449
## 34     MICHAEL JEFFERY THOMAS           MI       3.5      1399
## 35           JOSHUA DAVID LEE           MI       3.5      1438
## 36              SIDDHARTH JHA           MI       3.5      1355
## 37       AMIYATOSH PWNANANDAM           MI       3.5       980
## 38                  BRIAN LIU           MI       3.0      1423
## 39              JOEL R HENDON           MI       3.0      1436
## 40               FOREST ZHANG           MI       3.0      1348
## 41        KYLE WILLIAM MURPHY           MI       3.0      1403
## 42                   JARED GE           MI       3.0      1332
## 43          ROBERT GLEN VASEY           MI       3.0      1283
## 44         JUSTIN D SCHILLING           MI       3.0      1199
## 45                  DEREK YAN           MI       3.0      1242
## 46   JACOB ALEXANDER LAVALLEY           MI       3.0       377
## 47                ERIC WRIGHT           MI       2.5      1362
## 48               DANIEL KHAIN           MI       2.5      1382
## 49           MICHAEL J MARTIN           MI       2.5      1291
## 50                 SHIVAM JHA           MI       2.5      1056
## 51             TEJAS AYYAGARI           MI       2.5      1011
## 52                  ETHAN GUO           MI       2.5       935
## 53              JOSE C YBARRA           MI       2.0      1393
## 54                LARRY HODGE           MI       2.0      1270
## 55                  ALEX KONG           MI       2.0      1186
## 56               MARISA RICCI           MI       2.0      1153
## 57                 MICHAEL LU           MI       2.0      1092
## 58               VIRAJ MOHILE           MI       2.0       917
## 59          SEAN M MC CORMICK           MI       2.0       853
## 60                 JULIA SHEN           MI       1.5       967
## 61              JEZZEL FARKAS           ON       1.5       955
## 62              ASHWIN BALAJI           MI       1.0      1530
## 63       THOMAS JOSEPH HOSMER           MI       1.0      1175
## 64                     BEN LI           MI       1.0      1163

write.csv(df6, "Project1_ER.csv", row.names = TRUE)

Data 607_Project 1_ER

Euclides Rodriguez

2/28/2021