Assessment: Web Scraping

This Ex is part of course: Data science: Wrangling The web page using for practice is: https://web.archive.org/web/20181024132313/http://www.stevetheump.com/Payrolls.htm

Using package: rvest for web crawling websites

library(rvest)
## Loading required package: xml2
library(tidyverse)
## -- Attaching packages ------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.0     v purrr   0.3.4
## v tibble  3.0.0     v dplyr   0.8.5
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts ---------------------------------- tidyverse_conflicts() --
## x dplyr::filter()         masks stats::filter()
## x readr::guess_encoding() masks rvest::guess_encoding()
## x dplyr::lag()            masks stats::lag()
## x purrr::pluck()          masks rvest::pluck()
url_using <- "https://web.archive.org/web/20181024132313/http://www.stevetheump.com/Payrolls.htm"
target <- read_html(url_using)

The tables in html are associated with the table node. Use the html_nodes() function and the table node type to extract the first table

nodes <- html_nodes(target, "table")

The html_nodes() function returns a list of objects of class xml_node. We can see the content of each one using, for example, the html_text() function

html_text(nodes[[1]])
## [1] "\n\n\nSalary Stats 1967-2019\nTop ML Player Salaries / Baseball's Luxury Tax\n"

If the content of this object is an html table, we can use the html_table() function to convert it to a data frame

html_table(nodes[[1]])
##   X1                                                                     X2
## 1 NA Salary Stats 1967-2019\nTop ML Player Salaries / Baseball's Luxury Tax

Question 1: Many tables on this page are team payroll tables, with columns for rank, team, and one or more money values. Convert the first for tables in nodes to data frames and inspect them.

Using lappy to apply a function over a list or vector. To get more information, using ?lapply() Try both lapply(), and sapply() to see the diff

lapply(nodes[1:4], html_table)
## [[1]]
##   X1                                                                     X2
## 1 NA Salary Stats 1967-2019\nTop ML Player Salaries / Baseball's Luxury Tax
## 
## [[2]]
##    RANK                  TEAM   Payroll
## 1     1        Boston Red Sox  $235.65M
## 2     2  San Francisco Giants  $208.51M
## 3     3   Los Angeles Dodgers  $186.14M
## 4     4          Chicago Cubs  $183.46M
## 5     5  Washington Nationals  $181.59M
## 6     6    Los Angeles Angels   $175.1M
## 7     7      New York Yankees  $168.54M
## 8     8      Seattle Mariners  $162.48M
## 9     9     Toronto Blue Jays $162.316M
## 10   10   St. Louis Cardinals  $161.01M
## 11   11        Houston Astros  $160.04M
## 12   12         New York Mets  $154.61M
## 13   13         Texas Rangers   $144.0M
## 14   14     Baltimore Orioles  $143.09M
## 15   15      Colorado Rockies  $141.34M
## 16   16     Cleveland Indians  $134.35M
## 17   17  Arizona Diamondbacks   $132.5M
## 18   18       Minnesota Twins  $131.91M
## 19   19        Detroit Tigers  $129.92M
## 20   20    Kansas City Royals  $129.92M
## 21   21        Atlanta Braves  $120.54M
## 22   22       Cincinnati Reds  $101.19M
## 23   23         Miami Marlins   $98.64M
## 24   24 Philadelphia Phillies   $96.85M
## 25   25      San Diego Padres   $96.13M
## 26   26     Milwaukee Brewers   $90.24M
## 27   27    Pittsburgh Pirates   $87.88M
## 28   28        Tampa Bay Rays   $78.73M
## 29   29     Chicago White Sox   $72.18M
## 30   30     Oakland Athletics   $68.53M
## 
## [[3]]
##      X1                            X2           X3            X4            X5
## 1  Rank                          Team       25 Man Disabled List Total Payroll
## 2     1           Los Angeles Dodgers $155,887,854   $37,354,166  $242,065,828
## 3     2              New York Yankees $168,045,699    $5,644,000  $201,539,699
## 4     3                Boston Red Sox $136,780,500   $38,239,250  $199,805,178
## 5     4                Detroit Tigers $168,500,600   $11,750,000  $199,750,600
## 6     5             Toronto Blue Jays $159,175,968    $2,169,400  $177,795,368
## 7     6                 Texas Rangers $115,162,703   $39,136,360  $175,909,063
## 8     7          San Francisco Giants $169,504,611    $2,500,000  $172,354,611
## 9     8                  Chicago Cubs $170,189,880    $2,000,000  $172,189,880
## 10    9          Washington Nationals $163,111,918      $535,000  $167,846,918
## 11   10             Baltimore Orioles $142,066,615   $19,501,668  $163,676,616
## 12   11 Los Angeles Angels of Anaheim $116,844,833   $17,120,500  $160,375,333
## 13   12                 New York Mets $120,870,470   $26,141,990  $155,187,460
## 14   13              Seattle Mariners $139,257,018   $15,007,300  $154,800,918
## 15   14           St. Louis Cardinals $136,181,533   $13,521,400  $152,452,933
## 16   15            Kansas City Royals $127,333,150    $4,092,100  $140,925,250
## 17   16              Colorado Rockies  $86,909,571   $14,454,000  $130,963,571
## 18   17             Cleveland Indians $101,105,399   $14,005,766  $124,861,165
## 19   18                Houston Astros $117,957,800    $4,386,100  $124,343,900
## 20   19                Atlanta Braves $103,303,791    $8,927,500  $112,437,541
## 21   20                 Miami Marlins  $96,446,100   $15,035,000  $111,881,100
## 22   21         Philadelphia Phillies  $86,841,000      $537,000  $111,378,000
## 23   22               Minnesota Twins  $92,592,500    $8,735,000  $108,077,500
## 24   23            Pittsburgh Pirates  $92,362,832             -  $100,575,946
## 25   24             Chicago White Sox  $95,625,000    $1,671,000   $99,119,770
## 26   25               Cincinnati Reds  $53,858,785   $26,910,000   $93,768,785
## 27   26          Arizona Diamondbacks  $91,481,600    $1,626,000   $93,257,600
## 28   27             Oakland Athletics  $64,339,166    $5,732,500   $81,738,333
## 29   28              San Diego Padres  $29,628,400    $4,946,000   $71,624,400
## 30   29                Tampa Bay Rays  $55,282,232   $14,680,300   $69,962,532
## 31   30             Milwaukee Brewers  $50,023,900   $13,037,400   $63,061,300
## 
## [[4]]
##    Rank         Team   Opening Day  Avg Salary      Median
## 1     1      Dodgers $ 223,352,402 $ 7,445,080 $ 5,166,666
## 2     2      Yankees $ 213,472,857 $ 7,361,133 $ 3,300,000
## 3     3      Red Sox $ 182,161,414 $ 6,072,047 $ 3,500,000
## 4     4       Tigers $ 172,282,250 $ 6,891,290 $ 3,000,000
## 5     5       Giants $ 166,495,942 $ 5,946,284 $ 4,000,000
## 6     6    Nationals $ 166,010,977 $ 5,724,516 $ 2,500,000
## 7     7       Angels $ 146,449,583 $ 5,049,986 $ 1,312,500
## 8     8      Rangers $ 144,307,373 $ 4,509,605   $ 937,500
## 9     9     Phillies $ 133,048,000 $ 4,434,933   $ 700,000
## 10   10    Blue Jays $ 126,369,628 $ 4,357,573 $ 1,650,000
## 11   11     Mariners $ 122,706,842 $ 4,719,494 $ 2,252,500
## 12   12    Cardinals $ 120,301,957 $ 4,455,628 $ 2,000,000
## 13   13         Reds $ 116,732,284 $ 4,323,418 $ 2,350,000
## 14   14         Cubs $ 116,654,522 $ 4,166,233 $ 2,515,000
## 15   15      Orioles $ 115,587,632 $ 3,985,780 $ 2,750,000
## 16   16       Royals $ 112,914,525 $ 4,032,662 $ 2,532,500
## 17   17       Padres $ 112,895,700 $ 4,342,142   $ 763,500
## 18   18        Twins $ 108,262,000 $ 4,163,923 $ 1,775,000
## 19   19         Mets  $ 99,626,453 $ 3,558,088   $ 669,562
## 20   20    White Sox  $ 98,712,867 $ 3,525,460 $ 1,250,000
## 21   21      Brewers  $ 98,683,035 $ 3,795,501   $ 529,750
## 22   22      Rockies  $ 98,261,171 $ 3,388,316 $ 1,087,600
## 23   23       Braves  $ 87,622,648 $ 2,920,755 $ 1,333,333
## 24   24      Indians  $ 86,339,067 $ 3,197,743 $ 1,940,000
## 25   25      Pirates  $ 85,885,832 $ 2,862,861 $ 1,279,166
## 26   26      Marlins  $ 84,637,500 $ 3,134,722 $ 1,925,000
## 27   27    Athletics  $ 80,279,166 $ 2,508,724   $ 648,750
## 28   28         Rays  $ 73,649,584 $ 2,454,986   $ 750,000
## 29   29 Diamondbacks  $ 70,762,833 $ 2,358,761   $ 663,000
## 30   30       Astros  $ 69,064,200 $ 2,466,579 $ 1,031,250
sapply(nodes[1:4], html_table)
## [[1]]
##   X1                                                                     X2
## 1 NA Salary Stats 1967-2019\nTop ML Player Salaries / Baseball's Luxury Tax
## 
## [[2]]
##    RANK                  TEAM   Payroll
## 1     1        Boston Red Sox  $235.65M
## 2     2  San Francisco Giants  $208.51M
## 3     3   Los Angeles Dodgers  $186.14M
## 4     4          Chicago Cubs  $183.46M
## 5     5  Washington Nationals  $181.59M
## 6     6    Los Angeles Angels   $175.1M
## 7     7      New York Yankees  $168.54M
## 8     8      Seattle Mariners  $162.48M
## 9     9     Toronto Blue Jays $162.316M
## 10   10   St. Louis Cardinals  $161.01M
## 11   11        Houston Astros  $160.04M
## 12   12         New York Mets  $154.61M
## 13   13         Texas Rangers   $144.0M
## 14   14     Baltimore Orioles  $143.09M
## 15   15      Colorado Rockies  $141.34M
## 16   16     Cleveland Indians  $134.35M
## 17   17  Arizona Diamondbacks   $132.5M
## 18   18       Minnesota Twins  $131.91M
## 19   19        Detroit Tigers  $129.92M
## 20   20    Kansas City Royals  $129.92M
## 21   21        Atlanta Braves  $120.54M
## 22   22       Cincinnati Reds  $101.19M
## 23   23         Miami Marlins   $98.64M
## 24   24 Philadelphia Phillies   $96.85M
## 25   25      San Diego Padres   $96.13M
## 26   26     Milwaukee Brewers   $90.24M
## 27   27    Pittsburgh Pirates   $87.88M
## 28   28        Tampa Bay Rays   $78.73M
## 29   29     Chicago White Sox   $72.18M
## 30   30     Oakland Athletics   $68.53M
## 
## [[3]]
##      X1                            X2           X3            X4            X5
## 1  Rank                          Team       25 Man Disabled List Total Payroll
## 2     1           Los Angeles Dodgers $155,887,854   $37,354,166  $242,065,828
## 3     2              New York Yankees $168,045,699    $5,644,000  $201,539,699
## 4     3                Boston Red Sox $136,780,500   $38,239,250  $199,805,178
## 5     4                Detroit Tigers $168,500,600   $11,750,000  $199,750,600
## 6     5             Toronto Blue Jays $159,175,968    $2,169,400  $177,795,368
## 7     6                 Texas Rangers $115,162,703   $39,136,360  $175,909,063
## 8     7          San Francisco Giants $169,504,611    $2,500,000  $172,354,611
## 9     8                  Chicago Cubs $170,189,880    $2,000,000  $172,189,880
## 10    9          Washington Nationals $163,111,918      $535,000  $167,846,918
## 11   10             Baltimore Orioles $142,066,615   $19,501,668  $163,676,616
## 12   11 Los Angeles Angels of Anaheim $116,844,833   $17,120,500  $160,375,333
## 13   12                 New York Mets $120,870,470   $26,141,990  $155,187,460
## 14   13              Seattle Mariners $139,257,018   $15,007,300  $154,800,918
## 15   14           St. Louis Cardinals $136,181,533   $13,521,400  $152,452,933
## 16   15            Kansas City Royals $127,333,150    $4,092,100  $140,925,250
## 17   16              Colorado Rockies  $86,909,571   $14,454,000  $130,963,571
## 18   17             Cleveland Indians $101,105,399   $14,005,766  $124,861,165
## 19   18                Houston Astros $117,957,800    $4,386,100  $124,343,900
## 20   19                Atlanta Braves $103,303,791    $8,927,500  $112,437,541
## 21   20                 Miami Marlins  $96,446,100   $15,035,000  $111,881,100
## 22   21         Philadelphia Phillies  $86,841,000      $537,000  $111,378,000
## 23   22               Minnesota Twins  $92,592,500    $8,735,000  $108,077,500
## 24   23            Pittsburgh Pirates  $92,362,832             -  $100,575,946
## 25   24             Chicago White Sox  $95,625,000    $1,671,000   $99,119,770
## 26   25               Cincinnati Reds  $53,858,785   $26,910,000   $93,768,785
## 27   26          Arizona Diamondbacks  $91,481,600    $1,626,000   $93,257,600
## 28   27             Oakland Athletics  $64,339,166    $5,732,500   $81,738,333
## 29   28              San Diego Padres  $29,628,400    $4,946,000   $71,624,400
## 30   29                Tampa Bay Rays  $55,282,232   $14,680,300   $69,962,532
## 31   30             Milwaukee Brewers  $50,023,900   $13,037,400   $63,061,300
## 
## [[4]]
##    Rank         Team   Opening Day  Avg Salary      Median
## 1     1      Dodgers $ 223,352,402 $ 7,445,080 $ 5,166,666
## 2     2      Yankees $ 213,472,857 $ 7,361,133 $ 3,300,000
## 3     3      Red Sox $ 182,161,414 $ 6,072,047 $ 3,500,000
## 4     4       Tigers $ 172,282,250 $ 6,891,290 $ 3,000,000
## 5     5       Giants $ 166,495,942 $ 5,946,284 $ 4,000,000
## 6     6    Nationals $ 166,010,977 $ 5,724,516 $ 2,500,000
## 7     7       Angels $ 146,449,583 $ 5,049,986 $ 1,312,500
## 8     8      Rangers $ 144,307,373 $ 4,509,605   $ 937,500
## 9     9     Phillies $ 133,048,000 $ 4,434,933   $ 700,000
## 10   10    Blue Jays $ 126,369,628 $ 4,357,573 $ 1,650,000
## 11   11     Mariners $ 122,706,842 $ 4,719,494 $ 2,252,500
## 12   12    Cardinals $ 120,301,957 $ 4,455,628 $ 2,000,000
## 13   13         Reds $ 116,732,284 $ 4,323,418 $ 2,350,000
## 14   14         Cubs $ 116,654,522 $ 4,166,233 $ 2,515,000
## 15   15      Orioles $ 115,587,632 $ 3,985,780 $ 2,750,000
## 16   16       Royals $ 112,914,525 $ 4,032,662 $ 2,532,500
## 17   17       Padres $ 112,895,700 $ 4,342,142   $ 763,500
## 18   18        Twins $ 108,262,000 $ 4,163,923 $ 1,775,000
## 19   19         Mets  $ 99,626,453 $ 3,558,088   $ 669,562
## 20   20    White Sox  $ 98,712,867 $ 3,525,460 $ 1,250,000
## 21   21      Brewers  $ 98,683,035 $ 3,795,501   $ 529,750
## 22   22      Rockies  $ 98,261,171 $ 3,388,316 $ 1,087,600
## 23   23       Braves  $ 87,622,648 $ 2,920,755 $ 1,333,333
## 24   24      Indians  $ 86,339,067 $ 3,197,743 $ 1,940,000
## 25   25      Pirates  $ 85,885,832 $ 2,862,861 $ 1,279,166
## 26   26      Marlins  $ 84,637,500 $ 3,134,722 $ 1,925,000
## 27   27    Athletics  $ 80,279,166 $ 2,508,724   $ 648,750
## 28   28         Rays  $ 73,649,584 $ 2,454,986   $ 750,000
## 29   29 Diamondbacks  $ 70,762,833 $ 2,358,761   $ 663,000
## 30   30       Astros  $ 69,064,200 $ 2,466,579 $ 1,031,250

Question 2: For the last 3 components of nodes which of the following are true?

First using length(nodes) to know the get the length of nodes. then using code html_table() to get answer or can use lapply()

length_n <- length(nodes)
html_table(nodes[length_n])
## [[1]]
##      X1       X2         X3     X4
## 1  Year  Minimum    Average  % Chg
## 2  2019 $555,000                 -
## 3  2018 $545,000 $4,520,000       
## 4  2017 $535,000 $4,470,000    5.4
## 5  2016 $507,500 $4,400,000      -
## 6  2015 $507,500 $4,250,000      -
## 7  2014 $507,500 $3,820,000   12.8
## 8  2013 $480,000 $3,386,212    5.4
## 9  2012 $480,000 $3,440,000    3.8
## 10 2011 $414,500 $3,305,393    0.2
## 11 2010 $400,000 $3,297,828    1.8
## 12 2009 $400,000 $3,240,206    2.7
## 13 2008 $390,000 $3,150,000    7.1
## 14 2007 $380,000 $2,820,000    4.6
## 15 2006 $327,000 $2,699,292      9
## 16 2005 $316,000 $2,632,655    5.9
## 17 2004 $300,000 $2,486,609 (-2.7)
## 18 2003 $300,000 $2,555,416    7.2
## 19 2002 $200,000 $2,340,920    5.2
## 20 2001 $200,000 $2,138,896   13.9
## 21 2000 $200,000 $1,895,630   15.6
## 22 1999 $200,000 $1,611,166   19.3
## 23 1998 $170,000 $1,398,831    4.2
## 24 1997 $150,000 $1,336,609   17.6
## 25 1996 $122,667 $1,119,981    9.9
## 26 1995 $109,000 $1,110,766 (-9.9)
## 27 1994 $109,000 $1,168,263    6.1
## 28 1993 $109,000 $1,076,089    3.3
## 29 1992 $109,000 $1,028,667   21.7
## 30 1991 $100,000   $851,492   53.9
## 31 1990 $100,000   $597,537   12.9
## 32 1989  $68,000   $497,254       
## 33 1988  $62,500   $438,729       
## 34 1987  $62,500   $412,454       
## 35 1986  $60,000   $412,520       
## 36 1985  $60,000   $371,571       
## 37 1984  $40,000   $329,408       
## 38 1983  $35,000   $289,194       
## 39 1982  $33,500   $241,497       
## 40 1981  $32,500   $185,651       
## 41 1980  $30,000   $143,756       
## 42 1979  $21,000   $113,558       
## 43 1978  $21,000    $99,876       
## 44 1977  $19,000    $76,066       
## 45 1976  $19,000    $51,501       
## 46 1975  $16,000    $44,676       
## 47 1974  $15,000    $40,839       
## 48 1973  $15,000    $36,566       
## 49 1972  $13,500    $34,092       
## 50 1971  $12,750    $31,543       
## 51 1970  $12,000    $29,303       
## 52 1969  $10,000    $24,909       
## 53 1968  $10,000        N/A       
## 54 1967   $6,000    $19,000
lapply(nodes[length_n - 1], html_table)
## [[1]]
##                   X1          X2         X3
## 1               Team     Payroll    Average
## 2         NY Yankees $92,538,260 $3,190,974
## 3        Los Angeles $88,124,286 $3,263,862
## 4            Atlanta $84,537,836 $2,817,928
## 5          Baltimore $81,447,435 $2,808,532
## 6            Arizona $81,027,833 $2,893,851
## 7            NY Mets $79,509,776 $3,180,391
## 8             Boston $77,940,333 $2,598,011
## 9          Cleveland $75,880,871 $2,918,495
## 10             Texas $70,795,921 $2,722,920
## 11         Tampa Bay $62,765,129 $2,024,682
## 12         St. Louis $61,453,863 $2,276,069
## 13          Colorado $61,111,190 $2,182,543
## 14      Chicago Cubs $60,539,333 $2,017,978
## 15           Seattle $58,915,000 $2,265,962
## 16           Detroit $58,265,167 $2,157,969
## 17         San Diego $54,821,000 $1,827,367
## 18     San Francisco $53,737,826 $2,066,839
## 19           Anaheim $51,464,167 $1,715,472
## 20           Houston $51,289,111 $1,899,597
## 21      Philadelphia $47,308,000 $1,631,310
## 22        Cincinnati $46,867,200 $1,735,822
## 23           Toronto $46,238,333 $1,778,397
## 24         Milwaukee $36,505,333 $1,140,792
## 25          Montreal $34,807,833 $1,200,270
## 26           Oakland $31,971,333 $1,184,123
## 27 Chicago White Sox $31,133,500 $1,073,569
## 28        Pittsburgh $28,928,333 $1,112,628
## 29       Kansas City $23,433,000   $836,893
## 30           Florida $20,072,000   $692,138
## 31         Minnesota $16,519,500   $635,365

if using sapply() the result will different from lapply()

sapply(nodes[length_n - 1], html_table)
##    [,1]        
## X1 Character,31
## X2 Character,31
## X3 Character,31

Question 3: Create a table called tab_1 using entry 10 of nodes . Create a table called tab_2 using entry 19 of nodes. Remove the extra column in tab_1, remove the first row of each dataset. and change the column names for each table to c(“Team”, “Payroll”, “Average”). Use a full_join() by the Team to combine these two tables.

tab_1 <- html_table(nodes[10])
tab_1 <- as.data.frame(tab_1)
tab_1 <- tab_1[-1, -1]
tab_2 <- html_table(nodes[19])
tab_2 <- as.data.frame(tab_2)
tab_2 <- tab_2[-1,]
names(tab_1) <- c("Team", "Payroll", "Average")
names(tab_2) <- c("Team", "Payroll", "Average")
full_join(tab_1,tab_2, by = "Team") %>% 
  nrow()
## [1] 58

Introduction: Questions 4 and 5: The Wikipedia page on opinion polling for the Brexit referendum, in which the United Kingdom voted to leave the European Union in June 2016, contains several tables. One table contains the results of all polls regarding the referendum over 2016: Use the rvest library to read the HTML from this Wikipedia page

url_new <- "https://en.wikipedia.org/w/index.php?title=Opinion_polling_for_the_United_Kingdom_European_Union_membership_referendum&oldid=896735054"

Questions 4: Assign tab to be the html nodes of the “table” class.

How many tables are in this Wikipedia page?

new_target <- read_html(url_new)
tab <- html_nodes(new_target, "table")
length(tab)
## [1] 40

Question 5: Inspect the first several html tables using html_table() with the argument fill=TRUE (you can read about this argument in the documentation). Find the first table that has 9 columns with the first column named “Date(s) conducted”.

tab[[5]] %>% html_table(fill = TRUE) %>% 
  names()
## [1] "Date(s) conducted" "Remain"            "Leave"            
## [4] "Undecided"         "Lead"              "Sample"           
## [7] "Conducted by"      "Polling type"      "Notes"

Instead of TRUE, use FALSE to see result, by default fill is FALSE