In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file with the following information for all of the players:: Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents
First of all, we can go to the url website of the original data: https://bbhosted.cuny.edu/bbcswebdav/pid-25842547-dt-content-rid-119781398_1/courses/SPS01_DATA_607_01_1169_1/SPS01_DATA_607_01_1169_1_ImportedContent_20160815114002/SPS01_DATA_607_01_1162_1_ImportedContent_20160126062155/tournamentinfo.txt
This website can really give us an idea about how this dataset is structured. It seems to me everything is seperated by an “|”. Therefore, the first step I am trying to do is to load the “stringr” package which I am going to use for regular expression. I loaded the raw data from my local machine and set the seperator to be “|”.
library(stringr)
raw_data <- read.csv("C:/Users/blin261/Downloads/tournamentinfo.txt", header = FALSE, stringsAsFactors = FALSE, sep = "|")
The following codes just help me to explore the raw data.
head(raw_data)
## V1
## 1 -----------------------------------------------------------------------------------------
## 2 Pair
## 3 Num
## 4 -----------------------------------------------------------------------------------------
## 5 1
## 6 ON
## V2 V3 V4 V5 V6 V7 V8
## 1
## 2 Player Name Total Round Round Round Round Round
## 3 USCF ID / Rtg (Pre->Post) Pts 1 2 3 4 5
## 4
## 5 GARY HUA 6.0 W 39 W 21 W 18 W 14 W 7
## 6 15445895 / R: 1794 ->1817 N:2 W B W B W
## V9 V10 V11
## 1 NA
## 2 Round Round NA
## 3 6 7 NA
## 4 NA
## 5 D 12 D 4 NA
## 6 B W NA
str(raw_data)
## 'data.frame': 196 obs. of 11 variables:
## $ V1 : chr "-----------------------------------------------------------------------------------------" " Pair " " Num " "-----------------------------------------------------------------------------------------" ...
## $ V2 : chr "" " Player Name " " USCF ID / Rtg (Pre->Post) " "" ...
## $ V3 : chr "" "Total" " Pts " "" ...
## $ V4 : chr "" "Round" " 1 " "" ...
## $ V5 : chr "" "Round" " 2 " "" ...
## $ V6 : chr "" "Round" " 3 " "" ...
## $ V7 : chr "" "Round" " 4 " "" ...
## $ V8 : chr "" "Round" " 5 " "" ...
## $ V9 : chr "" "Round" " 6 " "" ...
## $ V10: chr "" "Round" " 7 " "" ...
## $ V11: logi NA NA NA NA NA NA ...
class(raw_data)
## [1] "data.frame"
rownames(raw_data)
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11"
## [12] "12" "13" "14" "15" "16" "17" "18" "19" "20" "21" "22"
## [23] "23" "24" "25" "26" "27" "28" "29" "30" "31" "32" "33"
## [34] "34" "35" "36" "37" "38" "39" "40" "41" "42" "43" "44"
## [45] "45" "46" "47" "48" "49" "50" "51" "52" "53" "54" "55"
## [56] "56" "57" "58" "59" "60" "61" "62" "63" "64" "65" "66"
## [67] "67" "68" "69" "70" "71" "72" "73" "74" "75" "76" "77"
## [78] "78" "79" "80" "81" "82" "83" "84" "85" "86" "87" "88"
## [89] "89" "90" "91" "92" "93" "94" "95" "96" "97" "98" "99"
## [100] "100" "101" "102" "103" "104" "105" "106" "107" "108" "109" "110"
## [111] "111" "112" "113" "114" "115" "116" "117" "118" "119" "120" "121"
## [122] "122" "123" "124" "125" "126" "127" "128" "129" "130" "131" "132"
## [133] "133" "134" "135" "136" "137" "138" "139" "140" "141" "142" "143"
## [144] "144" "145" "146" "147" "148" "149" "150" "151" "152" "153" "154"
## [155] "155" "156" "157" "158" "159" "160" "161" "162" "163" "164" "165"
## [166] "166" "167" "168" "169" "170" "171" "172" "173" "174" "175" "176"
## [177] "177" "178" "179" "180" "181" "182" "183" "184" "185" "186" "187"
## [188] "188" "189" "190" "191" "192" "193" "194" "195" "196"
colnames(raw_data)
## [1] "V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V9" "V10" "V11"
This code really helps me visualize the raw data, because the raw data contains lots of “——”, which really causes a lot of confusion. More importantly, this subset tells me which variables are located at which specific rows.
raw_data[-seq(1,196,3), 1:10]
## V1 V2 V3 V4 V5 V6 V7
## 2 Pair Player Name Total Round Round Round Round
## 3 Num USCF ID / Rtg (Pre->Post) Pts 1 2 3 4
## 5 1 GARY HUA 6.0 W 39 W 21 W 18 W 14
## 6 ON 15445895 / R: 1794 ->1817 N:2 W B W B
## 8 2 DAKSHESH DARURI 6.0 W 63 W 58 L 4 W 17
## 9 MI 14598900 / R: 1553 ->1663 N:2 B W B W
## 11 3 ADITYA BAJAJ 6.0 L 8 W 61 W 25 W 21
## 12 MI 14959604 / R: 1384 ->1640 N:2 W B W B
## 14 4 PATRICK H SCHILLING 5.5 W 23 D 28 W 2 W 26
## 15 MI 12616049 / R: 1716 ->1744 N:2 W B W B
## 17 5 HANSHI ZUO 5.5 W 45 W 37 D 12 D 13
## 18 MI 14601533 / R: 1655 ->1690 N:2 B W B W
## 20 6 HANSEN SONG 5.0 W 34 D 29 L 11 W 35
## 21 OH 15055204 / R: 1686 ->1687 N:3 W B W B
## 23 7 GARY DEE SWATHELL 5.0 W 57 W 46 W 13 W 11
## 24 MI 11146376 / R: 1649 ->1673 N:3 W B W B
## 26 8 EZEKIEL HOUGHTON 5.0 W 3 W 32 L 14 L 9
## 27 MI 15142253 / R: 1641P17->1657P24 N:3 B W B W
## 29 9 STEFANO LEE 5.0 W 25 L 18 W 59 W 8
## 30 ON 14954524 / R: 1411 ->1564 N:2 W B W B
## 32 10 ANVIT RAO 5.0 D 16 L 19 W 55 W 31
## 33 MI 14150362 / R: 1365 ->1544 N:3 W W B B
## 35 11 CAMERON WILLIAM MC LEMAN 4.5 D 38 W 56 W 6 L 7
## 36 MI 12581589 / R: 1712 ->1696 N:3 B W B W
## 38 12 KENNETH J TACK 4.5 W 42 W 33 D 5 W 38
## 39 MI 12681257 / R: 1663 ->1670 N:3 W B W B
## 41 13 TORRANCE HENRY JR 4.5 W 36 W 27 L 7 D 5
## 42 MI 15082995 / R: 1666 ->1662 N:3 B W B B
## 44 14 BRADLEY SHAW 4.5 W 54 W 44 W 8 L 1
## 45 MI 10131499 / R: 1610 ->1618 N:3 W B W W
## 47 15 ZACHARY JAMES HOUGHTON 4.5 D 19 L 16 W 30 L 22
## 48 MI 15619130 / R: 1220P13->1416P20 N:3 B B W W
## 50 16 MIKE NIKITIN 4.0 D 10 W 15 H W 39
## 51 MI 10295068 / R: 1604 ->1613 N:3 B W B
## 53 17 RONALD GRZEGORCZYK 4.0 W 48 W 41 L 26 L 2
## 54 MI 10297702 / R: 1629 ->1610 N:3 W B W B
## 56 18 DAVID SUNDEEN 4.0 W 47 W 9 L 1 W 32
## 57 MI 11342094 / R: 1600 ->1600 N:3 B W B W
## 59 19 DIPANKAR ROY 4.0 D 15 W 10 W 52 D 28
## 60 MI 14862333 / R: 1564 ->1570 N:3 W B W B
## 62 20 JASON ZHENG 4.0 L 40 W 49 W 23 W 41
## 63 MI 14529060 / R: 1595 ->1569 N:4 W B W B
## 65 21 DINH DANG BUI 4.0 W 43 L 1 W 47 L 3
## 66 ON 15495066 / R: 1563P22->1562 N:3 B W B W
## 68 22 EUGENE L MCCLURE 4.0 W 64 D 52 L 28 W 15
## 69 MI 12405534 / R: 1555 ->1529 N:4 W B W B
## 71 23 ALAN BUI 4.0 L 4 W 43 L 20 W 58
## 72 ON 15030142 / R: 1363 ->1371 B W B W
## 74 24 MICHAEL R ALDRICH 4.0 L 28 L 47 W 43 L 25
## 75 MI 13469010 / R: 1229 ->1300 N:4 B W B B
## 77 25 LOREN SCHWIEBERT 3.5 L 9 W 53 L 3 W 24
## 78 MI 12486656 / R: 1745 ->1681 N:4 B W B W
## 80 26 MAX ZHU 3.5 W 49 W 40 W 17 L 4
## 81 ON 15131520 / R: 1579 ->1564 N:4 B W B W
## 83 27 GAURAV GIDWANI 3.5 W 51 L 13 W 46 W 37
## 84 MI 14476567 / R: 1552 ->1539 N:4 W B W B
## 86 28 SOFIA ADINA STANESCU-BELLU 3.5 W 24 D 4 W 22 D 19
## 87 MI 14882954 / R: 1507 ->1513 N:3 W W B W
## 89 29 CHIEDOZIE OKORIE 3.5 W 50 D 6 L 38 L 34
## 90 MI 15323285 / R: 1602P6 ->1508P12 N:4 B W B W
## 92 30 GEORGE AVERY JONES 3.5 L 52 D 64 L 15 W 55
## 93 ON 12577178 / R: 1522 ->1444 W B B W
## 95 31 RISHI SHETTY 3.5 L 58 D 55 W 64 L 10
## 96 MI 15131618 / R: 1494 ->1444 B W B W
## 98 32 JOSHUA PHILIP MATHEWS 3.5 W 61 L 8 W 44 L 18
## 99 ON 14073750 / R: 1441 ->1433 N:4 W B W B
## 101 33 JADE GE 3.5 W 60 L 12 W 50 D 36
## 102 MI 14691842 / R: 1449 ->1421 B W B W
## 104 34 MICHAEL JEFFERY THOMAS 3.5 L 6 W 60 L 37 W 29
## 105 MI 15051807 / R: 1399 ->1400 B W B B
## 107 35 JOSHUA DAVID LEE 3.5 L 46 L 38 W 56 L 6
## 108 MI 14601397 / R: 1438 ->1392 W W B W
## 110 36 SIDDHARTH JHA 3.5 L 13 W 57 W 51 D 33
## 111 MI 14773163 / R: 1355 ->1367 N:4 W B W B
## 113 37 AMIYATOSH PWNANANDAM 3.5 B L 5 W 34 L 27
## 114 MI 15489571 / R: 980P12->1077P17 B W W
## 116 38 BRIAN LIU 3.0 D 11 W 35 W 29 L 12
## 117 MI 15108523 / R: 1423 ->1439 N:4 W B W W
## 119 39 JOEL R HENDON 3.0 L 1 W 54 W 40 L 16
## 120 MI 12923035 / R: 1436P23->1413 N:4 B W B W
## 122 40 FOREST ZHANG 3.0 W 20 L 26 L 39 W 59
## 123 MI 14892710 / R: 1348 ->1346 B B W W
## 125 41 KYLE WILLIAM MURPHY 3.0 W 59 L 17 W 58 L 20
## 126 MI 15761443 / R: 1403P5 ->1341P9 B W B W
## 128 42 JARED GE 3.0 L 12 L 50 L 57 D 60
## 129 MI 14462326 / R: 1332 ->1256 B W B B
## 131 43 ROBERT GLEN VASEY 3.0 L 21 L 23 L 24 W 63
## 132 MI 14101068 / R: 1283 ->1244 W B W W
## 134 44 JUSTIN D SCHILLING 3.0 B L 14 L 32 W 53
## 135 MI 15323504 / R: 1199 ->1199 W B B
## 137 45 DEREK YAN 3.0 L 5 L 51 D 60 L 56
## 138 MI 15372807 / R: 1242 ->1191 W B W B
## 140 46 JACOB ALEXANDER LAVALLEY 3.0 W 35 L 7 L 27 L 50
## 141 MI 15490981 / R: 377P3 ->1076P10 B W B W
## 143 47 ERIC WRIGHT 2.5 L 18 W 24 L 21 W 61
## 144 MI 12533115 / R: 1362 ->1341 W B W B
## 146 48 DANIEL KHAIN 2.5 L 17 W 63 H D 52
## 147 MI 14369165 / R: 1382 ->1335 B W B
## 149 49 MICHAEL J MARTIN 2.5 L 26 L 20 D 63 D 64
## 150 MI 12531685 / R: 1291P12->1259P17 W W B W
## 152 50 SHIVAM JHA 2.5 L 29 W 42 L 33 W 46
## 153 MI 14773178 / R: 1056 ->1111 W B W B
## 155 51 TEJAS AYYAGARI 2.5 L 27 W 45 L 36 W 57
## 156 MI 15205474 / R: 1011 ->1097 B W B W
## 158 52 ETHAN GUO 2.5 W 30 D 22 L 19 D 48
## 159 MI 14918803 / R: 935 ->1092 N:4 B W B W
## 161 53 JOSE C YBARRA 2.0 H L 25 H L 44
## 162 MI 12578849 / R: 1393 ->1359 B W
## 164 54 LARRY HODGE 2.0 L 14 L 39 L 61 B
## 165 MI 12836773 / R: 1270 ->1200 B B W
## 167 55 ALEX KONG 2.0 L 62 D 31 L 10 L 30
## 168 MI 15412571 / R: 1186 ->1163 W B W B
## 170 56 MARISA RICCI 2.0 H L 11 L 35 W 45
## 171 MI 14679887 / R: 1153 ->1140 B W W
## 173 57 MICHAEL LU 2.0 L 7 L 36 W 42 L 51
## 174 MI 15113330 / R: 1092 ->1079 B W W B
## 176 58 VIRAJ MOHILE 2.0 W 31 L 2 L 41 L 23
## 177 MI 14700365 / R: 917 -> 941 W B W B
## 179 59 SEAN M MC CORMICK 2.0 L 41 B L 9 L 40
## 180 MI 12841036 / R: 853 -> 878 W B B
## 182 60 JULIA SHEN 1.5 L 33 L 34 D 45 D 42
## 183 MI 14579262 / R: 967 -> 984 W B B W
## 185 61 JEZZEL FARKAS 1.5 L 32 L 3 W 54 L 47
## 186 ON 15771592 / R: 955P11-> 979P18 B W B W
## 188 62 ASHWIN BALAJI 1.0 W 55 U U U
## 189 MI 15219542 / R: 1530 ->1535 B
## 191 63 THOMAS JOSEPH HOSMER 1.0 L 2 L 48 D 49 L 43
## 192 MI 15057092 / R: 1175 ->1125 W B W B
## 194 64 BEN LI 1.0 L 22 D 30 L 31 D 49
## 195 MI 15006561 / R: 1163 ->1112 B W W B
## V8 V9 V10
## 2 Round Round Round
## 3 5 6 7
## 5 W 7 D 12 D 4
## 6 W B W
## 8 W 16 W 20 W 7
## 9 B W B
## 11 W 11 W 13 W 12
## 12 W B W
## 14 D 5 W 19 D 1
## 15 W B B
## 17 D 4 W 14 W 17
## 18 B W B
## 20 D 10 W 27 W 21
## 21 B W B
## 23 L 1 W 9 L 2
## 24 B W W
## 26 W 47 W 28 W 19
## 27 B W W
## 29 W 26 L 7 W 20
## 30 W B B
## 32 D 6 W 25 W 18
## 33 W B W
## 35 L 3 W 34 W 26
## 36 B W B
## 38 H D 1 L 3
## 39 W B
## 41 W 33 L 3 W 32
## 42 W W B
## 44 D 27 L 5 W 31
## 45 B B W
## 47 W 54 W 33 W 38
## 48 B B W
## 50 L 2 W 36 U
## 51 W B
## 53 W 23 W 22 L 5
## 54 W B W
## 56 L 19 W 38 L 10
## 57 B W B
## 59 W 18 L 4 L 8
## 60 W W B
## 62 W 28 L 2 L 9
## 63 W B W
## 65 W 40 W 39 L 6
## 66 W B W
## 68 H L 17 W 40
## 69 W B
## 71 L 17 W 37 W 46
## 72 B W B
## 74 W 60 W 44 W 39
## 75 W W B
## 77 D 34 L 10 W 47
## 78 B W B
## 80 L 9 D 32 L 11
## 81 B W W
## 83 D 14 L 6 U
## 84 W B
## 86 L 20 L 8 D 36
## 87 B B W
## 89 W 52 W 48 U
## 90 W B
## 92 L 31 W 61 W 50
## 93 W B B
## 95 W 30 W 50 L 14
## 96 B W B
## 98 W 51 D 26 L 13
## 99 W B W
## 101 L 13 L 15 W 51
## 102 B W B
## 104 D 25 L 11 W 52
## 105 W B W
## 107 W 57 D 52 W 48
## 108 B B W
## 110 H L 16 D 28
## 111 W B
## 113 H L 23 W 61
## 114 B W
## 116 H L 18 L 15
## 117 B B
## 119 W 44 L 21 L 24
## 120 B W W
## 122 L 21 W 56 L 22
## 123 B W W
## 125 X U U
## 126
## 128 D 61 W 64 W 56
## 129 W W B
## 131 W 59 L 46 W 55
## 132 B B W
## 134 L 39 L 24 W 59
## 135 W B W
## 137 W 63 D 55 W 58
## 138 W B W
## 140 W 64 W 43 L 23
## 141 B W W
## 143 L 8 D 51 L 25
## 144 W B W
## 146 H L 29 L 35
## 147 W B
## 149 W 58 H U
## 150 B
## 152 H L 31 L 30
## 153 B W
## 155 L 32 D 47 L 33
## 156 B W W
## 158 L 29 D 35 L 34
## 159 B W B
## 161 U W 57 U
## 162 W
## 164 L 15 L 59 W 64
## 165 W B W
## 167 B D 45 L 43
## 168 W B
## 170 H L 40 L 42
## 171 B W
## 173 L 35 L 53 B
## 174 W B
## 176 L 49 B L 45
## 177 W B
## 179 L 43 W 54 L 44
## 180 W W B
## 182 L 24 H U
## 183 B
## 185 D 42 L 30 L 37
## 186 B W B
## 188 U U U
## 189
## 191 L 45 H U
## 192 B
## 194 L 46 L 42 L 54
## 195 W B B
Then I am going to extract the information I need and create vectors with appropriate data types.
player_number <- as.numeric(raw_data[seq(5,195,3), "V1"])
head(player_number)
## [1] 1 2 3 4 5 6
player_name <- raw_data[seq(5,195,3),"V2"]
player_name <- str_trim(player_name)
head(player_name)
## [1] "GARY HUA" "DAKSHESH DARURI" "ADITYA BAJAJ"
## [4] "PATRICK H SCHILLING" "HANSHI ZUO" "HANSEN SONG"
player_state <- raw_data[seq(6,195,3), "V1"]
player_state <- str_trim(player_state)
head(player_state)
## [1] "ON" "MI" "MI" "MI" "MI" "OH"
total_number_of_points <- as.numeric(raw_data[seq(5,195,3), "V3"])
head(total_number_of_points)
## [1] 6.0 6.0 6.0 5.5 5.5 5.0
I used two steps of regular expression techniques to draw the pre-rating data from the raw data
pre_rating <- raw_data[seq(6,195,3), "V2"]
head(pre_rating)
## [1] " 15445895 / R: 1794 ->1817 " " 14598900 / R: 1553 ->1663 "
## [3] " 14959604 / R: 1384 ->1640 " " 12616049 / R: 1716 ->1744 "
## [5] " 14601533 / R: 1655 ->1690 " " 15055204 / R: 1686 ->1687 "
for (i in 1:length(pre_rating))
{
#I want any string that starts with "R" followed by a space and followed by more than 1 digits.
pre_rating[i] <- str_extract(pre_rating[i], "R:[:space:]+[:digit:]+")
#After I got the string containing pre-rating, I get rid of the "R" and space, then convert these digits from string into numeric values.
pre_rating[i] <- str_extract(pre_rating[i], "[[:digit:]]+")
}
head(pre_rating)
## [1] "1794" "1553" "1384" "1716" "1655" "1686"
The last variable I want is the average of opponents’ rating for each individual players. The first step invloves getting those opponents’ player numbers, which can be found on columns 4 to 10. I only need to extract strings that refer to numeric values from these columns. After I got those strings, I convert them into numeric values and store them into a variable called opponent_number. Then I found their corresponding pre-rating values, use the mean function to get the average of these opponents’ pre-rating values.
average_opponent_point <- vector()
for (i in seq(5,195,3))
{
{
opponent_number<- as.numeric(unlist(str_extract_all(raw_data[i, 4:10], "[[:digit:]]+")))
average_opponent_point[(i-2)/3] <- mean(as.numeric(pre_rating[opponent_number]), na.rm = TRUE)
}
}
average_opponent_point
## [1] 1605.286 1469.286 1563.571 1573.571 1500.857 1518.714 1372.143
## [8] 1468.429 1523.143 1554.143 1467.571 1506.167 1497.857 1515.000
## [15] 1483.857 1385.800 1498.571 1480.000 1426.286 1410.857 1470.429
## [22] 1300.333 1213.857 1357.000 1363.286 1506.857 1221.667 1522.143
## [29] 1313.500 1144.143 1259.857 1378.714 1276.857 1375.286 1149.714
## [36] 1388.167 1384.800 1539.167 1429.571 1390.571 1248.500 1149.857
## [43] 1106.571 1327.000 1152.000 1357.714 1392.000 1355.800 1285.800
## [50] 1296.000 1356.143 1494.571 1345.333 1206.167 1406.000 1414.400
## [57] 1363.000 1391.000 1319.000 1330.200 1327.286 1186.000 1350.200
## [64] 1263.000
Then I put all the vectors together and create a data frame called chess_tournament.
chess_tournament <- data.frame(player_number, player_name, player_state, total_number_of_points, pre_rating, average_opponent_point)
head(chess_tournament)
## player_number player_name player_state total_number_of_points
## 1 1 GARY HUA ON 6.0
## 2 2 DAKSHESH DARURI MI 6.0
## 3 3 ADITYA BAJAJ MI 6.0
## 4 4 PATRICK H SCHILLING MI 5.5
## 5 5 HANSHI ZUO MI 5.5
## 6 6 HANSEN SONG OH 5.0
## pre_rating average_opponent_point
## 1 1794 1605.286
## 2 1553 1469.286
## 3 1384 1563.571
## 4 1716 1573.571
## 5 1655 1500.857
## 6 1686 1518.714
In the end, I export the output into a csv file in my working directory.
write.csv(chess_tournament, file = "chess_tournament_result", row.names=FALSE)