Copyright @ Sya’roni @ Prof.Dr. Suhartono M.Kom @ Magister Informatika @ UIN Maulana Malik Ibrahim @ UIN Malang

9/6/2021

load library rvest dan baca HTML Web Target

library(rvest)
## Loading required package: xml2
url <- "https://www.bbc.com/sport/football/premier-league/table"
html <- url %>% read_html

Melalui Inspect elements, kita mengetahui bahwa CSS selector yang memuat tabel klasemen adalah .gs-o-table.

table class=“gs-o-table” data-reactid=“.98jiqbvx9q.2.0.0.0.0.1.$competition-table-0”

Dengan menggunakan fungsi html_node dan html_table, kita peroleh sebuah data frame tanpa perlu melakukan parsing.

epl_table <- html %>%
    html_node(".gs-o-table") %>%
    html_table

str(epl_table)
## 'data.frame':    21 obs. of  12 variables:
##  $     : chr  "1" "2" "3" "4" ...
##  $     : chr  "team hasn't moved" "team hasn't moved" "team hasn't moved" "team hasn't moved" ...
##  $ Team: chr  "Man City" "Man Utd" "Liverpool" "Chelsea" ...
##  $ P   : chr  "38" "38" "38" "38" ...
##  $ W   : chr  "27" "21" "20" "19" ...
##  $ D   : chr  "5" "11" "9" "10" ...
##  $ L   : chr  "6" "6" "9" "9" ...
##  $ F   : chr  "83" "73" "68" "58" ...
##  $ A   : chr  "32" "44" "42" "36" ...
##  $ GD  : chr  "51" "29" "26" "22" ...
##  $ Pts : chr  "86" "74" "69" "67" ...
##  $ Form: chr  "WWon 2 - 0 against Crystal Palace on May 1st 2021.LLost 1 - 2 against Chelsea on May 8th 2021.WWon 4 - 3 agains"| __truncated__ "WWon 3 - 1 against Aston Villa on May 9th 2021.LLost 1 - 2 against Leicester City on May 11th 2021.LLost 2 - 4 "| __truncated__ "WWon 2 - 0 against Southampton on May 8th 2021.WWon 4 - 2 against Manchester United on May 13th 2021.WWon 2 - 1"| __truncated__ "WWon 2 - 0 against Fulham on May 1st 2021.WWon 2 - 1 against Manchester City on May 8th 2021.LLost 0 - 1 agains"| __truncated__ ...

Scraping dan parsing tabel HTML selesai.

untuk membersihkan data. Pertama, hapus kolom dan baris yang tidak perlu. Jika diperhatikan (silakan print data frame), dua kolom pertama serta satu baris terakhir tidak diperlukan dan dapat dihapus. Hapus dua kolom pertama:

epl_table[1:2] <- list(NULL)

Hapus satu baris terakhir (baris ke-21):

epl_table <- epl_table[-21,]
str(epl_table)
## 'data.frame':    20 obs. of  10 variables:
##  $ Team: chr  "Man City" "Man Utd" "Liverpool" "Chelsea" ...
##  $ P   : chr  "38" "38" "38" "38" ...
##  $ W   : chr  "27" "21" "20" "19" ...
##  $ D   : chr  "5" "11" "9" "10" ...
##  $ L   : chr  "6" "6" "9" "9" ...
##  $ F   : chr  "83" "73" "68" "58" ...
##  $ A   : chr  "32" "44" "42" "36" ...
##  $ GD  : chr  "51" "29" "26" "22" ...
##  $ Pts : chr  "86" "74" "69" "67" ...
##  $ Form: chr  "WWon 2 - 0 against Crystal Palace on May 1st 2021.LLost 1 - 2 against Chelsea on May 8th 2021.WWon 4 - 3 agains"| __truncated__ "WWon 3 - 1 against Aston Villa on May 9th 2021.LLost 1 - 2 against Leicester City on May 11th 2021.LLost 2 - 4 "| __truncated__ "WWon 2 - 0 against Southampton on May 8th 2021.WWon 4 - 2 against Manchester United on May 13th 2021.WWon 2 - 1"| __truncated__ "WWon 2 - 0 against Fulham on May 1st 2021.WWon 2 - 1 against Manchester City on May 8th 2021.LLost 0 - 1 agains"| __truncated__ ...

Kedua, reformat kolom Form. Kita lihat, pada kolom Form, selain singkatan dan status juga ditulis pertandingan, skor dan tanggalnya.

epl_table$Form[1]
## [1] "WWon 2 - 0 against Crystal Palace on May 1st 2021.LLost 1 - 2 against Chelsea on May 8th 2021.WWon 4 - 3 against Newcastle United on May 14th 2021.LLost 2 - 3 against Brighton & Hove Albion on May 18th 2021.WWon 5 - 0 against Everton on May 23rd 2021."

akan kita ubah menjadi W,W,D,W,W.

Untuk keperluan ini, kita akan menggunakan beberapa fungsi dari package stringr

library(stringr)

Pertama-tama, kita ekstrak teks WWon, DDrew, atau LLost. Gunakan fungsi str_extract_all, dengan regular expression “WWon|DDrew|LLost”. Simbol | berarti “atau”

extract_form <- function(form){
    str_extract_all(form, "WWon|DDrew|LLost")
}

form <- sapply(epl_table$Form, extract_form, USE.NAMES = FALSE)
str(form)
## List of 20
##  $ : chr [1:5] "WWon" "LLost" "WWon" "LLost" ...
##  $ : chr [1:5] "WWon" "LLost" "LLost" "DDrew" ...
##  $ : chr [1:5] "WWon" "WWon" "WWon" "WWon" ...
##  $ : chr [1:5] "WWon" "WWon" "LLost" "WWon" ...
##  $ : chr [1:5] "DDrew" "LLost" "WWon" "LLost" ...
##  $ : chr [1:5] "WWon" "LLost" "DDrew" "WWon" ...
##  $ : chr [1:5] "WWon" "LLost" "WWon" "LLost" ...
##  $ : chr [1:5] "WWon" "WWon" "WWon" "WWon" ...
##  $ : chr [1:5] "LLost" "WWon" "WWon" "WWon" ...
##  $ : chr [1:5] "WWon" "DDrew" "LLost" "WWon" ...
##  $ : chr [1:5] "LLost" "DDrew" "LLost" "WWon" ...
##  $ : chr [1:5] "LLost" "WWon" "LLost" "WWon" ...
##  $ : chr [1:5] "DDrew" "WWon" "LLost" "LLost" ...
##  $ : chr [1:5] "WWon" "LLost" "WWon" "LLost" ...
##  $ : chr [1:5] "LLost" "WWon" "WWon" "LLost" ...
##  $ : chr [1:5] "WWon" "LLost" "DDrew" "WWon" ...
##  $ : chr [1:5] "LLost" "WWon" "LLost" "LLost" ...
##  $ : chr [1:5] "LLost" "LLost" "LLost" "DDrew" ...
##  $ : chr [1:5] "DDrew" "LLost" "LLost" "LLost" ...
##  $ : chr [1:5] "LLost" "LLost" "WWon" "LLost" ...

Selanjutnya dalam setiap elemen list, ekstrak satu huruf W, D, atau L lalu gabungkan dengan delimiter tanda koma. Ekstraksi huruf menggunakan fungsi str_extract

simply_form <- function(form){
    form %>%
        str_extract("W|D|L") %>%
        paste(collapse = ",")
}

form <- sapply(form, simply_form)
str(form)
##  chr [1:20] "W,L,W,L,W" "W,L,L,D,W" "W,W,W,W,W" "W,W,L,W,L" "D,L,W,L,L" ...

update kolom Form pada data frame epl_table dengan vector form

epl_table$Form <- form

Hasilnya

print(epl_table)
##              Team  P  W  D  L  F  A  GD Pts      Form
## 1        Man City 38 27  5  6 83 32  51  86 W,L,W,L,W
## 2         Man Utd 38 21 11  6 73 44  29  74 W,L,L,D,W
## 3       Liverpool 38 20  9  9 68 42  26  69 W,W,W,W,W
## 4         Chelsea 38 19 10  9 58 36  22  67 W,W,L,W,L
## 5       Leicester 38 20  6 12 68 50  18  66 D,L,W,L,L
## 6        West Ham 38 19  8 11 62 47  15  65 W,L,D,W,W
## 7       Tottenham 38 18  8 12 68 45  23  62 W,L,W,L,W
## 8         Arsenal 38 18  7 13 55 39  16  61 W,W,W,W,W
## 9           Leeds 38 18  5 15 62 54   8  59 L,W,W,W,W
## 10        Everton 38 17  8 13 47 48  -1  59 W,D,L,W,L
## 11    Aston Villa 38 16  7 15 55 46   9  55 L,D,L,W,W
## 12      Newcastle 38 12  9 17 46 62 -16  45 L,W,L,W,W
## 13         Wolves 38 12  9 17 36 52 -16  45 D,W,L,L,L
## 14 Crystal Palace 38 12  8 18 41 66 -25  44 W,L,W,L,L
## 15    Southampton 38 12  7 19 47 68 -21  43 L,W,W,L,L
## 16       Brighton 38  9 14 15 40 46  -6  41 W,L,D,W,L
## 17        Burnley 38 10  9 19 33 55 -22  39 L,W,L,L,L
## 18         Fulham 38  5 13 20 27 53 -26  28 L,L,L,D,L
## 19      West Brom 38  5 11 22 35 76 -41  26 D,L,L,L,L
## 20      Sheff Utd 38  7  2 29 20 63 -43  23 L,L,W,L,W