This Ex is part of course: Data science: Wrangling The web page using for practice is: https://web.archive.org/web/20181024132313/http://www.stevetheump.com/Payrolls.htm
Using package: rvest for web crawling websites
library(rvest)
## Loading required package: xml2
library(tidyverse)
## -- Attaching packages ------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.0 v purrr 0.3.4
## v tibble 3.0.0 v dplyr 0.8.5
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ---------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x readr::guess_encoding() masks rvest::guess_encoding()
## x dplyr::lag() masks stats::lag()
## x purrr::pluck() masks rvest::pluck()
url_using <- "https://web.archive.org/web/20181024132313/http://www.stevetheump.com/Payrolls.htm"
target <- read_html(url_using)
The tables in html are associated with the table node. Use the html_nodes() function and the table node type to extract the first table
nodes <- html_nodes(target, "table")
The html_nodes() function returns a list of objects of class xml_node. We can see the content of each one using, for example, the html_text() function
html_text(nodes[[1]])
## [1] "\n\n\nSalary Stats 1967-2019\nTop ML Player Salaries / Baseball's Luxury Tax\n"
If the content of this object is an html table, we can use the html_table() function to convert it to a data frame
html_table(nodes[[1]])
## X1 X2
## 1 NA Salary Stats 1967-2019\nTop ML Player Salaries / Baseball's Luxury Tax
Using lappy to apply a function over a list or vector. To get more information, using ?lapply()
Try both lapply(), and sapply() to see the diff
lapply(nodes[1:4], html_table)
## [[1]]
## X1 X2
## 1 NA Salary Stats 1967-2019\nTop ML Player Salaries / Baseball's Luxury Tax
##
## [[2]]
## RANK TEAM Payroll
## 1 1 Boston Red Sox $235.65M
## 2 2 San Francisco Giants $208.51M
## 3 3 Los Angeles Dodgers $186.14M
## 4 4 Chicago Cubs $183.46M
## 5 5 Washington Nationals $181.59M
## 6 6 Los Angeles Angels $175.1M
## 7 7 New York Yankees $168.54M
## 8 8 Seattle Mariners $162.48M
## 9 9 Toronto Blue Jays $162.316M
## 10 10 St. Louis Cardinals $161.01M
## 11 11 Houston Astros $160.04M
## 12 12 New York Mets $154.61M
## 13 13 Texas Rangers $144.0M
## 14 14 Baltimore Orioles $143.09M
## 15 15 Colorado Rockies $141.34M
## 16 16 Cleveland Indians $134.35M
## 17 17 Arizona Diamondbacks $132.5M
## 18 18 Minnesota Twins $131.91M
## 19 19 Detroit Tigers $129.92M
## 20 20 Kansas City Royals $129.92M
## 21 21 Atlanta Braves $120.54M
## 22 22 Cincinnati Reds $101.19M
## 23 23 Miami Marlins $98.64M
## 24 24 Philadelphia Phillies $96.85M
## 25 25 San Diego Padres $96.13M
## 26 26 Milwaukee Brewers $90.24M
## 27 27 Pittsburgh Pirates $87.88M
## 28 28 Tampa Bay Rays $78.73M
## 29 29 Chicago White Sox $72.18M
## 30 30 Oakland Athletics $68.53M
##
## [[3]]
## X1 X2 X3 X4 X5
## 1 Rank Team 25 Man Disabled List Total Payroll
## 2 1 Los Angeles Dodgers $155,887,854 $37,354,166 $242,065,828
## 3 2 New York Yankees $168,045,699 $5,644,000 $201,539,699
## 4 3 Boston Red Sox $136,780,500 $38,239,250 $199,805,178
## 5 4 Detroit Tigers $168,500,600 $11,750,000 $199,750,600
## 6 5 Toronto Blue Jays $159,175,968 $2,169,400 $177,795,368
## 7 6 Texas Rangers $115,162,703 $39,136,360 $175,909,063
## 8 7 San Francisco Giants $169,504,611 $2,500,000 $172,354,611
## 9 8 Chicago Cubs $170,189,880 $2,000,000 $172,189,880
## 10 9 Washington Nationals $163,111,918 $535,000 $167,846,918
## 11 10 Baltimore Orioles $142,066,615 $19,501,668 $163,676,616
## 12 11 Los Angeles Angels of Anaheim $116,844,833 $17,120,500 $160,375,333
## 13 12 New York Mets $120,870,470 $26,141,990 $155,187,460
## 14 13 Seattle Mariners $139,257,018 $15,007,300 $154,800,918
## 15 14 St. Louis Cardinals $136,181,533 $13,521,400 $152,452,933
## 16 15 Kansas City Royals $127,333,150 $4,092,100 $140,925,250
## 17 16 Colorado Rockies $86,909,571 $14,454,000 $130,963,571
## 18 17 Cleveland Indians $101,105,399 $14,005,766 $124,861,165
## 19 18 Houston Astros $117,957,800 $4,386,100 $124,343,900
## 20 19 Atlanta Braves $103,303,791 $8,927,500 $112,437,541
## 21 20 Miami Marlins $96,446,100 $15,035,000 $111,881,100
## 22 21 Philadelphia Phillies $86,841,000 $537,000 $111,378,000
## 23 22 Minnesota Twins $92,592,500 $8,735,000 $108,077,500
## 24 23 Pittsburgh Pirates $92,362,832 - $100,575,946
## 25 24 Chicago White Sox $95,625,000 $1,671,000 $99,119,770
## 26 25 Cincinnati Reds $53,858,785 $26,910,000 $93,768,785
## 27 26 Arizona Diamondbacks $91,481,600 $1,626,000 $93,257,600
## 28 27 Oakland Athletics $64,339,166 $5,732,500 $81,738,333
## 29 28 San Diego Padres $29,628,400 $4,946,000 $71,624,400
## 30 29 Tampa Bay Rays $55,282,232 $14,680,300 $69,962,532
## 31 30 Milwaukee Brewers $50,023,900 $13,037,400 $63,061,300
##
## [[4]]
## Rank Team Opening Day Avg Salary Median
## 1 1 Dodgers $ 223,352,402 $ 7,445,080 $ 5,166,666
## 2 2 Yankees $ 213,472,857 $ 7,361,133 $ 3,300,000
## 3 3 Red Sox $ 182,161,414 $ 6,072,047 $ 3,500,000
## 4 4 Tigers $ 172,282,250 $ 6,891,290 $ 3,000,000
## 5 5 Giants $ 166,495,942 $ 5,946,284 $ 4,000,000
## 6 6 Nationals $ 166,010,977 $ 5,724,516 $ 2,500,000
## 7 7 Angels $ 146,449,583 $ 5,049,986 $ 1,312,500
## 8 8 Rangers $ 144,307,373 $ 4,509,605 $ 937,500
## 9 9 Phillies $ 133,048,000 $ 4,434,933 $ 700,000
## 10 10 Blue Jays $ 126,369,628 $ 4,357,573 $ 1,650,000
## 11 11 Mariners $ 122,706,842 $ 4,719,494 $ 2,252,500
## 12 12 Cardinals $ 120,301,957 $ 4,455,628 $ 2,000,000
## 13 13 Reds $ 116,732,284 $ 4,323,418 $ 2,350,000
## 14 14 Cubs $ 116,654,522 $ 4,166,233 $ 2,515,000
## 15 15 Orioles $ 115,587,632 $ 3,985,780 $ 2,750,000
## 16 16 Royals $ 112,914,525 $ 4,032,662 $ 2,532,500
## 17 17 Padres $ 112,895,700 $ 4,342,142 $ 763,500
## 18 18 Twins $ 108,262,000 $ 4,163,923 $ 1,775,000
## 19 19 Mets $ 99,626,453 $ 3,558,088 $ 669,562
## 20 20 White Sox $ 98,712,867 $ 3,525,460 $ 1,250,000
## 21 21 Brewers $ 98,683,035 $ 3,795,501 $ 529,750
## 22 22 Rockies $ 98,261,171 $ 3,388,316 $ 1,087,600
## 23 23 Braves $ 87,622,648 $ 2,920,755 $ 1,333,333
## 24 24 Indians $ 86,339,067 $ 3,197,743 $ 1,940,000
## 25 25 Pirates $ 85,885,832 $ 2,862,861 $ 1,279,166
## 26 26 Marlins $ 84,637,500 $ 3,134,722 $ 1,925,000
## 27 27 Athletics $ 80,279,166 $ 2,508,724 $ 648,750
## 28 28 Rays $ 73,649,584 $ 2,454,986 $ 750,000
## 29 29 Diamondbacks $ 70,762,833 $ 2,358,761 $ 663,000
## 30 30 Astros $ 69,064,200 $ 2,466,579 $ 1,031,250
sapply(nodes[1:4], html_table)
## [[1]]
## X1 X2
## 1 NA Salary Stats 1967-2019\nTop ML Player Salaries / Baseball's Luxury Tax
##
## [[2]]
## RANK TEAM Payroll
## 1 1 Boston Red Sox $235.65M
## 2 2 San Francisco Giants $208.51M
## 3 3 Los Angeles Dodgers $186.14M
## 4 4 Chicago Cubs $183.46M
## 5 5 Washington Nationals $181.59M
## 6 6 Los Angeles Angels $175.1M
## 7 7 New York Yankees $168.54M
## 8 8 Seattle Mariners $162.48M
## 9 9 Toronto Blue Jays $162.316M
## 10 10 St. Louis Cardinals $161.01M
## 11 11 Houston Astros $160.04M
## 12 12 New York Mets $154.61M
## 13 13 Texas Rangers $144.0M
## 14 14 Baltimore Orioles $143.09M
## 15 15 Colorado Rockies $141.34M
## 16 16 Cleveland Indians $134.35M
## 17 17 Arizona Diamondbacks $132.5M
## 18 18 Minnesota Twins $131.91M
## 19 19 Detroit Tigers $129.92M
## 20 20 Kansas City Royals $129.92M
## 21 21 Atlanta Braves $120.54M
## 22 22 Cincinnati Reds $101.19M
## 23 23 Miami Marlins $98.64M
## 24 24 Philadelphia Phillies $96.85M
## 25 25 San Diego Padres $96.13M
## 26 26 Milwaukee Brewers $90.24M
## 27 27 Pittsburgh Pirates $87.88M
## 28 28 Tampa Bay Rays $78.73M
## 29 29 Chicago White Sox $72.18M
## 30 30 Oakland Athletics $68.53M
##
## [[3]]
## X1 X2 X3 X4 X5
## 1 Rank Team 25 Man Disabled List Total Payroll
## 2 1 Los Angeles Dodgers $155,887,854 $37,354,166 $242,065,828
## 3 2 New York Yankees $168,045,699 $5,644,000 $201,539,699
## 4 3 Boston Red Sox $136,780,500 $38,239,250 $199,805,178
## 5 4 Detroit Tigers $168,500,600 $11,750,000 $199,750,600
## 6 5 Toronto Blue Jays $159,175,968 $2,169,400 $177,795,368
## 7 6 Texas Rangers $115,162,703 $39,136,360 $175,909,063
## 8 7 San Francisco Giants $169,504,611 $2,500,000 $172,354,611
## 9 8 Chicago Cubs $170,189,880 $2,000,000 $172,189,880
## 10 9 Washington Nationals $163,111,918 $535,000 $167,846,918
## 11 10 Baltimore Orioles $142,066,615 $19,501,668 $163,676,616
## 12 11 Los Angeles Angels of Anaheim $116,844,833 $17,120,500 $160,375,333
## 13 12 New York Mets $120,870,470 $26,141,990 $155,187,460
## 14 13 Seattle Mariners $139,257,018 $15,007,300 $154,800,918
## 15 14 St. Louis Cardinals $136,181,533 $13,521,400 $152,452,933
## 16 15 Kansas City Royals $127,333,150 $4,092,100 $140,925,250
## 17 16 Colorado Rockies $86,909,571 $14,454,000 $130,963,571
## 18 17 Cleveland Indians $101,105,399 $14,005,766 $124,861,165
## 19 18 Houston Astros $117,957,800 $4,386,100 $124,343,900
## 20 19 Atlanta Braves $103,303,791 $8,927,500 $112,437,541
## 21 20 Miami Marlins $96,446,100 $15,035,000 $111,881,100
## 22 21 Philadelphia Phillies $86,841,000 $537,000 $111,378,000
## 23 22 Minnesota Twins $92,592,500 $8,735,000 $108,077,500
## 24 23 Pittsburgh Pirates $92,362,832 - $100,575,946
## 25 24 Chicago White Sox $95,625,000 $1,671,000 $99,119,770
## 26 25 Cincinnati Reds $53,858,785 $26,910,000 $93,768,785
## 27 26 Arizona Diamondbacks $91,481,600 $1,626,000 $93,257,600
## 28 27 Oakland Athletics $64,339,166 $5,732,500 $81,738,333
## 29 28 San Diego Padres $29,628,400 $4,946,000 $71,624,400
## 30 29 Tampa Bay Rays $55,282,232 $14,680,300 $69,962,532
## 31 30 Milwaukee Brewers $50,023,900 $13,037,400 $63,061,300
##
## [[4]]
## Rank Team Opening Day Avg Salary Median
## 1 1 Dodgers $ 223,352,402 $ 7,445,080 $ 5,166,666
## 2 2 Yankees $ 213,472,857 $ 7,361,133 $ 3,300,000
## 3 3 Red Sox $ 182,161,414 $ 6,072,047 $ 3,500,000
## 4 4 Tigers $ 172,282,250 $ 6,891,290 $ 3,000,000
## 5 5 Giants $ 166,495,942 $ 5,946,284 $ 4,000,000
## 6 6 Nationals $ 166,010,977 $ 5,724,516 $ 2,500,000
## 7 7 Angels $ 146,449,583 $ 5,049,986 $ 1,312,500
## 8 8 Rangers $ 144,307,373 $ 4,509,605 $ 937,500
## 9 9 Phillies $ 133,048,000 $ 4,434,933 $ 700,000
## 10 10 Blue Jays $ 126,369,628 $ 4,357,573 $ 1,650,000
## 11 11 Mariners $ 122,706,842 $ 4,719,494 $ 2,252,500
## 12 12 Cardinals $ 120,301,957 $ 4,455,628 $ 2,000,000
## 13 13 Reds $ 116,732,284 $ 4,323,418 $ 2,350,000
## 14 14 Cubs $ 116,654,522 $ 4,166,233 $ 2,515,000
## 15 15 Orioles $ 115,587,632 $ 3,985,780 $ 2,750,000
## 16 16 Royals $ 112,914,525 $ 4,032,662 $ 2,532,500
## 17 17 Padres $ 112,895,700 $ 4,342,142 $ 763,500
## 18 18 Twins $ 108,262,000 $ 4,163,923 $ 1,775,000
## 19 19 Mets $ 99,626,453 $ 3,558,088 $ 669,562
## 20 20 White Sox $ 98,712,867 $ 3,525,460 $ 1,250,000
## 21 21 Brewers $ 98,683,035 $ 3,795,501 $ 529,750
## 22 22 Rockies $ 98,261,171 $ 3,388,316 $ 1,087,600
## 23 23 Braves $ 87,622,648 $ 2,920,755 $ 1,333,333
## 24 24 Indians $ 86,339,067 $ 3,197,743 $ 1,940,000
## 25 25 Pirates $ 85,885,832 $ 2,862,861 $ 1,279,166
## 26 26 Marlins $ 84,637,500 $ 3,134,722 $ 1,925,000
## 27 27 Athletics $ 80,279,166 $ 2,508,724 $ 648,750
## 28 28 Rays $ 73,649,584 $ 2,454,986 $ 750,000
## 29 29 Diamondbacks $ 70,762,833 $ 2,358,761 $ 663,000
## 30 30 Astros $ 69,064,200 $ 2,466,579 $ 1,031,250
First using length(nodes) to know the get the length of nodes. then using code html_table() to get answer or can use lapply()
length_n <- length(nodes)
html_table(nodes[length_n])
## [[1]]
## X1 X2 X3 X4
## 1 Year Minimum Average % Chg
## 2 2019 $555,000 -
## 3 2018 $545,000 $4,520,000
## 4 2017 $535,000 $4,470,000 5.4
## 5 2016 $507,500 $4,400,000 -
## 6 2015 $507,500 $4,250,000 -
## 7 2014 $507,500 $3,820,000 12.8
## 8 2013 $480,000 $3,386,212 5.4
## 9 2012 $480,000 $3,440,000 3.8
## 10 2011 $414,500 $3,305,393 0.2
## 11 2010 $400,000 $3,297,828 1.8
## 12 2009 $400,000 $3,240,206 2.7
## 13 2008 $390,000 $3,150,000 7.1
## 14 2007 $380,000 $2,820,000 4.6
## 15 2006 $327,000 $2,699,292 9
## 16 2005 $316,000 $2,632,655 5.9
## 17 2004 $300,000 $2,486,609 (-2.7)
## 18 2003 $300,000 $2,555,416 7.2
## 19 2002 $200,000 $2,340,920 5.2
## 20 2001 $200,000 $2,138,896 13.9
## 21 2000 $200,000 $1,895,630 15.6
## 22 1999 $200,000 $1,611,166 19.3
## 23 1998 $170,000 $1,398,831 4.2
## 24 1997 $150,000 $1,336,609 17.6
## 25 1996 $122,667 $1,119,981 9.9
## 26 1995 $109,000 $1,110,766 (-9.9)
## 27 1994 $109,000 $1,168,263 6.1
## 28 1993 $109,000 $1,076,089 3.3
## 29 1992 $109,000 $1,028,667 21.7
## 30 1991 $100,000 $851,492 53.9
## 31 1990 $100,000 $597,537 12.9
## 32 1989 $68,000 $497,254
## 33 1988 $62,500 $438,729
## 34 1987 $62,500 $412,454
## 35 1986 $60,000 $412,520
## 36 1985 $60,000 $371,571
## 37 1984 $40,000 $329,408
## 38 1983 $35,000 $289,194
## 39 1982 $33,500 $241,497
## 40 1981 $32,500 $185,651
## 41 1980 $30,000 $143,756
## 42 1979 $21,000 $113,558
## 43 1978 $21,000 $99,876
## 44 1977 $19,000 $76,066
## 45 1976 $19,000 $51,501
## 46 1975 $16,000 $44,676
## 47 1974 $15,000 $40,839
## 48 1973 $15,000 $36,566
## 49 1972 $13,500 $34,092
## 50 1971 $12,750 $31,543
## 51 1970 $12,000 $29,303
## 52 1969 $10,000 $24,909
## 53 1968 $10,000 N/A
## 54 1967 $6,000 $19,000
lapply(nodes[length_n - 1], html_table)
## [[1]]
## X1 X2 X3
## 1 Team Payroll Average
## 2 NY Yankees $92,538,260 $3,190,974
## 3 Los Angeles $88,124,286 $3,263,862
## 4 Atlanta $84,537,836 $2,817,928
## 5 Baltimore $81,447,435 $2,808,532
## 6 Arizona $81,027,833 $2,893,851
## 7 NY Mets $79,509,776 $3,180,391
## 8 Boston $77,940,333 $2,598,011
## 9 Cleveland $75,880,871 $2,918,495
## 10 Texas $70,795,921 $2,722,920
## 11 Tampa Bay $62,765,129 $2,024,682
## 12 St. Louis $61,453,863 $2,276,069
## 13 Colorado $61,111,190 $2,182,543
## 14 Chicago Cubs $60,539,333 $2,017,978
## 15 Seattle $58,915,000 $2,265,962
## 16 Detroit $58,265,167 $2,157,969
## 17 San Diego $54,821,000 $1,827,367
## 18 San Francisco $53,737,826 $2,066,839
## 19 Anaheim $51,464,167 $1,715,472
## 20 Houston $51,289,111 $1,899,597
## 21 Philadelphia $47,308,000 $1,631,310
## 22 Cincinnati $46,867,200 $1,735,822
## 23 Toronto $46,238,333 $1,778,397
## 24 Milwaukee $36,505,333 $1,140,792
## 25 Montreal $34,807,833 $1,200,270
## 26 Oakland $31,971,333 $1,184,123
## 27 Chicago White Sox $31,133,500 $1,073,569
## 28 Pittsburgh $28,928,333 $1,112,628
## 29 Kansas City $23,433,000 $836,893
## 30 Florida $20,072,000 $692,138
## 31 Minnesota $16,519,500 $635,365
if using sapply() the result will different from lapply()
sapply(nodes[length_n - 1], html_table)
## [,1]
## X1 Character,31
## X2 Character,31
## X3 Character,31
tab_1 <- html_table(nodes[10])
tab_1 <- as.data.frame(tab_1)
tab_1 <- tab_1[-1, -1]
tab_2 <- html_table(nodes[19])
tab_2 <- as.data.frame(tab_2)
tab_2 <- tab_2[-1,]
names(tab_1) <- c("Team", "Payroll", "Average")
names(tab_2) <- c("Team", "Payroll", "Average")
full_join(tab_1,tab_2, by = "Team") %>%
nrow()
## [1] 58
url_new <- "https://en.wikipedia.org/w/index.php?title=Opinion_polling_for_the_United_Kingdom_European_Union_membership_referendum&oldid=896735054"
How many tables are in this Wikipedia page?
new_target <- read_html(url_new)
tab <- html_nodes(new_target, "table")
length(tab)
## [1] 40
tab[[5]] %>% html_table(fill = TRUE) %>%
names()
## [1] "Date(s) conducted" "Remain" "Leave"
## [4] "Undecided" "Lead" "Sample"
## [7] "Conducted by" "Polling type" "Notes"
Instead of TRUE, use FALSE to see result, by default fill is FALSE