First chunk loaded the necessary libraries.
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.4 v stringr 1.4.0
## v tidyr 1.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Used the read_delim function to import the tournament info first. At first, I had difficulty importing the text files without adjusting the attributes. The header within the text file did not allow read_delim to break up the text into its appropriate column. By using the skip attribute I was able to skip the header and resolve this issue. Next after reading through the r documentation I made use of the comment attribute. This attribute allows you to designate a character and assign that text as a comment. When first attempting to used this attribute, I used ‘-’ which got rid of all the dash lines. Unexpected outcome was that it also removed ‘->’ in the ratings column. In order to resolve this I adjusted the attribute to read double dashes.
df<-read_delim("https://raw.githubusercontent.com/engine2031/Data-607/main/tournamentinfo.txt",delim="|",
col_names = c("Player State", "Player Name", "Total Pts",
"X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8", "X9"),
skip=4, comment = "--")
##
## -- Column specification --------------------------------------------------------
## cols(
## `Player State` = col_character(),
## `Player Name` = col_character(),
## `Total Pts` = col_character(),
## X1 = col_character(),
## X2 = col_character(),
## X3 = col_character(),
## X4 = col_character(),
## X5 = col_character(),
## X6 = col_character(),
## X7 = col_character(),
## X8 = col_logical(),
## X9 = col_character()
## )
## Warning: 128 parsing failures.
## row col expected actual file
## 1 -- 12 columns 11 columns 'https://raw.githubusercontent.com/engine2031/Data-607/main/tournamentinfo.txt'
## 2 -- 12 columns 11 columns 'https://raw.githubusercontent.com/engine2031/Data-607/main/tournamentinfo.txt'
## 3 -- 12 columns 11 columns 'https://raw.githubusercontent.com/engine2031/Data-607/main/tournamentinfo.txt'
## 4 -- 12 columns 11 columns 'https://raw.githubusercontent.com/engine2031/Data-607/main/tournamentinfo.txt'
## 5 -- 12 columns 11 columns 'https://raw.githubusercontent.com/engine2031/Data-607/main/tournamentinfo.txt'
## ... ... .......... .......... ...............................................................................
## See problems(...) for more details.
In this chunk I create a new data frame to only work with the columns that will be used for the final product. Additionally within the total points column I got rid of unnecessary text “N:(.)” using reg expressions.
df1 <- df %>% select(`Player State`,`Player Name`,`Total Pts`)
df1$`Total Pts`<- str_remove(df1$`Total Pts`,"N:(.)")
The column ‘Player Name’ included text having the player’s name and its rating. I separated these two types of characters using the separate function. Next for the ‘Player Name1’ and ‘Player State’ columns I get rid of the numbers.
df2 <-df1 %>% mutate(df1, separate(df1,col = "Player Name", into = c("Player Name1", "Rating"), sep="/"))
## Warning: Problem with `mutate()` input `..2`.
## i Expected 2 pieces. Missing pieces filled with `NA` in 64 rows [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, ...].
## i Input `..2` is `separate(...)`.
df2$`Player Name1` <- str_remove(df2$`Player Name1`,"\\d+")
df2$`Player State` <- str_remove(df2$`Player State`,"\\d+")
df3 <- df2 %>% select(-'Player Name')
General preperation for the final production. In this chunk I get rid of the spaces within each column.
df3$'Player Name1' <- str_trim(df3$`Player Name1`, side = c("both"))
df3$'Player State' <- str_trim(df3$`Player State`, side = c("both"))
df3$'Total Pts' <- str_trim(df3$`Total Pts`, side = c("both"))
Because of the way the initial text file was set up, there is blank data for some of the values within the columns. In order to prep the data frame to drop these empty variable, I replace the spaces with NA. My intent was to use the drop_na() function with brute force but I found that I lost data when applying the function to the entire data frame. NA values were alternating depending on which column they were in. I subsetted the columns so that the NA values were in the same row. Then finally using cbind to merge the two subsets to create the complete dataframe.
df3 <- df3 %>% na_if("")
df3_drop1 <- df3 %>% select(-'Player State',-'Rating') %>% drop_na()
df3_drop2 <- df3 %>% select(-'Total Pts',-'Player Name1') %>% drop_na()
df4 <- cbind(df3_drop1,df3_drop2)
df4 <- df4 %>% relocate('Player Name1', 'Player State', 'Total Pts', 'Rating')
I left data manipulation of the Rating column as the last big task. Using the separate() function I created two new rows to indicate PreRating and PostRating. The PreRating column is ultimately the column I wanted in my last dataframe and did some clean up on it. Using str_remove and regular expressions I removed non relevant data. In some cases the rating value had a P with some numbers following. I assumed this group of strings were errors and removed accordingly.
df5 <-df4 %>% mutate(df4, separate(df4,col = "Rating", into = c("PreRating", "PostRating"), sep="->"))
df5 <- df5 %>% select(-'Rating')
df5$'PreRating' <- str_remove(df5$'PreRating', "R: ")
df5$'PreRating' <- str_remove(df5$'PreRating', "(P)(.*)$")
df5$'PreRating' <- str_trim(df5$'PreRating', side = c("right"))
Final Product. Creating the CSV.
df6 <- df5 %>% select(-"PostRating")
df6
## Player Name1 Player State Total Pts PreRating
## 1 GARY HUA ON 6.0 1794
## 2 DAKSHESH DARURI MI 6.0 1553
## 3 ADITYA BAJAJ MI 6.0 1384
## 4 PATRICK H SCHILLING MI 5.5 1716
## 5 HANSHI ZUO MI 5.5 1655
## 6 HANSEN SONG OH 5.0 1686
## 7 GARY DEE SWATHELL MI 5.0 1649
## 8 EZEKIEL HOUGHTON MI 5.0 1641
## 9 STEFANO LEE ON 5.0 1411
## 10 ANVIT RAO MI 5.0 1365
## 11 CAMERON WILLIAM MC LEMAN MI 4.5 1712
## 12 KENNETH J TACK MI 4.5 1663
## 13 TORRANCE HENRY JR MI 4.5 1666
## 14 BRADLEY SHAW MI 4.5 1610
## 15 ZACHARY JAMES HOUGHTON MI 4.5 1220
## 16 MIKE NIKITIN MI 4.0 1604
## 17 RONALD GRZEGORCZYK MI 4.0 1629
## 18 DAVID SUNDEEN MI 4.0 1600
## 19 DIPANKAR ROY MI 4.0 1564
## 20 JASON ZHENG MI 4.0 1595
## 21 DINH DANG BUI ON 4.0 1563
## 22 EUGENE L MCCLURE MI 4.0 1555
## 23 ALAN BUI ON 4.0 1363
## 24 MICHAEL R ALDRICH MI 4.0 1229
## 25 LOREN SCHWIEBERT MI 3.5 1745
## 26 MAX ZHU ON 3.5 1579
## 27 GAURAV GIDWANI MI 3.5 1552
## 28 SOFIA ADINA STANESCU-BELLU MI 3.5 1507
## 29 CHIEDOZIE OKORIE MI 3.5 1602
## 30 GEORGE AVERY JONES ON 3.5 1522
## 31 RISHI SHETTY MI 3.5 1494
## 32 JOSHUA PHILIP MATHEWS ON 3.5 1441
## 33 JADE GE MI 3.5 1449
## 34 MICHAEL JEFFERY THOMAS MI 3.5 1399
## 35 JOSHUA DAVID LEE MI 3.5 1438
## 36 SIDDHARTH JHA MI 3.5 1355
## 37 AMIYATOSH PWNANANDAM MI 3.5 980
## 38 BRIAN LIU MI 3.0 1423
## 39 JOEL R HENDON MI 3.0 1436
## 40 FOREST ZHANG MI 3.0 1348
## 41 KYLE WILLIAM MURPHY MI 3.0 1403
## 42 JARED GE MI 3.0 1332
## 43 ROBERT GLEN VASEY MI 3.0 1283
## 44 JUSTIN D SCHILLING MI 3.0 1199
## 45 DEREK YAN MI 3.0 1242
## 46 JACOB ALEXANDER LAVALLEY MI 3.0 377
## 47 ERIC WRIGHT MI 2.5 1362
## 48 DANIEL KHAIN MI 2.5 1382
## 49 MICHAEL J MARTIN MI 2.5 1291
## 50 SHIVAM JHA MI 2.5 1056
## 51 TEJAS AYYAGARI MI 2.5 1011
## 52 ETHAN GUO MI 2.5 935
## 53 JOSE C YBARRA MI 2.0 1393
## 54 LARRY HODGE MI 2.0 1270
## 55 ALEX KONG MI 2.0 1186
## 56 MARISA RICCI MI 2.0 1153
## 57 MICHAEL LU MI 2.0 1092
## 58 VIRAJ MOHILE MI 2.0 917
## 59 SEAN M MC CORMICK MI 2.0 853
## 60 JULIA SHEN MI 1.5 967
## 61 JEZZEL FARKAS ON 1.5 955
## 62 ASHWIN BALAJI MI 1.0 1530
## 63 THOMAS JOSEPH HOSMER MI 1.0 1175
## 64 BEN LI MI 1.0 1163
write.csv(df6, "Project1_ER.csv", row.names = TRUE)