library(rvest)
## Loading required package: xml2
mojo=read_html("https://www.boxofficemojo.com/alltime/domestic.htm")
rank=mojo %>% html_nodes("center td tr+ tr td:nth-child(1) font") %>% html_text()
title=mojo %>% html_nodes("center td tr+ tr td:nth-child(2) font") %>% html_text()
studio=mojo %>% html_nodes("center td tr+ tr td:nth-child(3) font") %>% html_text()
gross=mojo %>% html_nodes("center td tr+ tr td:nth-child(4) font") %>% html_text()
gross=as.numeric(gsub("[$,]","", gross))
year=mojo %>% html_nodes("center td tr+ tr td:nth-child(5) font") %>% html_text()
year=as.numeric(gsub("\\^","", year))
mojodf = data.frame(cbind(rank,title,studio,gross,year))
mojodf
## rank title
## 1 1 Star Wars: The Force Awakens
## 2 2 Avatar
## 3 3 Black Panther
## 4 4 Avengers: Infinity War
## 5 5 Titanic
## 6 6 Jurassic World
## 7 7 Marvel's The Avengers
## 8 8 Star Wars: The Last Jedi
## 9 9 Incredibles 2
## 10 10 The Dark Knight
## 11 11 Rogue One: A Star Wars Story
## 12 12 Beauty and the Beast (2017)
## 13 13 Finding Dory
## 14 14 Star Wars: Episode I - The Phantom Menace
## 15 15 Star Wars
## 16 16 Avengers: Age of Ultron
## 17 17 The Dark Knight Rises
## 18 18 Shrek 2
## 19 19 E.T.: The Extra-Terrestrial
## 20 20 The Hunger Games: Catching Fire
## 21 21 Pirates of the Caribbean: Dead Man's Chest
## 22 22 The Lion King
## 23 23 Toy Story 3
## 24 24 Wonder Woman
## 25 25 Jurassic World: Fallen Kingdom
## 26 26 Iron Man 3
## 27 27 Captain America: Civil War
## 28 28 The Hunger Games
## 29 29 Jumanji: Welcome to the Jungle
## 30 30 Spider-Man
## 31 31 Jurassic Park
## 32 32 Transformers: Revenge of the Fallen
## 33 33 Frozen
## 34 34 Guardians of the Galaxy Vol. 2
## 35 35 Harry Potter and the Deathly Hallows Part 2
## 36 36 Finding Nemo
## 37 37 Star Wars: Episode III - Revenge of the Sith
## 38 38 The Lord of the Rings: The Return of the King
## 39 39 Spider-Man 2
## 40 40 The Passion of the Christ
## 41 41 The Secret Life of Pets
## 42 42 Despicable Me 2
## 43 43 The Jungle Book (2016)
## 44 44 Deadpool
## 45 45 Inside Out
## 46 46 Furious 7
## 47 47 Transformers: Dark of the Moon
## 48 48 American Sniper
## 49 49 The Lord of the Rings: The Two Towers
## 50 50 Zootopia
## 51 51 The Hunger Games: Mockingjay - Part 1
## 52 52 Spider-Man 3
## 53 53 Minions
## 54 54 Spider-Man: Homecoming
## 55 55 Alice in Wonderland (2010)
## 56 56 Guardians of the Galaxy
## 57 57 Batman v Superman: Dawn of Justice
## 58 58 Forrest Gump
## 59 59 It
## 60 60 Suicide Squad
## 61 61 Shrek the Third
## 62 62 Transformers
## 63 63 Iron Man
## 64 64 Deadpool 2
## 65 65 Harry Potter and the Sorcerer's Stone
## 66 66 Indiana Jones and the Kingdom of the Crystal Skull
## 67 67 The Lord of the Rings: The Fellowship of the Ring
## 68 68 Thor: Ragnarok
## 69 69 Iron Man 2
## 70 70 Star Wars: Episode II - Attack of the Clones
## 71 71 Pirates of the Caribbean: At World's End
## 72 72 Return of the Jedi
## 73 73 Independence Day
## 74 74 Pirates of the Caribbean: The Curse of the Black Pearl
## 75 75 Skyfall
## 76 76 The Hobbit: An Unexpected Journey
## 77 77 Harry Potter and the Half-Blood Prince
## 78 78 The Twilight Saga: Eclipse
## 79 79 The Twilight Saga: New Moon
## 80 80 Harry Potter and the Deathly Hallows Part 1
## 81 81 The Sixth Sense
## 82 82 Up
## 83 83 Inception
## 84 84 The Twilight Saga: Breaking Dawn Part 2
## 85 85 Harry Potter and the Order of the Phoenix
## 86 86 The Chronicles of Narnia: The Lion, the Witch and the Wardrobe
## 87 87 Man of Steel
## 88 88 The Empire Strikes Back
## 89 89 Harry Potter and the Goblet of Fire
## 90 90 Monsters, Inc.
## 91 91 Home Alone
## 92 92 The Hunger Games: Mockingjay - Part 2
## 93 93 The Matrix Reloaded
## 94 94 The Twilight Saga: Breaking Dawn Part 1
## 95 95 Meet the Fockers
## 96 96 The Hangover
## 97 97 Gravity
## 98 98 Sing
## 99 99 Monsters University
## 100 100 Shrek
## studio gross year
## 1 BV 936662225 2015
## 2 Fox 760507625 2009
## 3 BV 700059566 2018
## 4 BV 678587869 2018
## 5 Par. 659363944 1997
## 6 Uni. 652270625 2015
## 7 BV 623357910 2012
## 8 BV 620181382 2017
## 9 BV 594119848 2018
## 10 WB 534858444 2008
## 11 BV 532177324 2016
## 12 BV 504014165 2017
## 13 BV 486295561 2016
## 14 Fox 474544677 1999
## 15 Fox 460998007 1977
## 16 BV 459005868 2015
## 17 WB 448139099 2012
## 18 DW 441226247 2004
## 19 Uni. 435110554 1982
## 20 LGF 424668047 2013
## 21 BV 423315812 2006
## 22 BV 422783777 1994
## 23 BV 415004880 2010
## 24 WB 412563408 2017
## 25 Uni. 411752365 2018
## 26 BV 409013994 2013
## 27 BV 408084349 2016
## 28 LGF 408010692 2012
## 29 Sony 404515480 2017
## 30 Sony 403706375 2002
## 31 Uni. 402453882 1993
## 32 P/DW 402111870 2009
## 33 BV 400738009 2013
## 34 BV 389813101 2017
## 35 WB 381011219 2011
## 36 BV 380843261 2003
## 37 Fox 380270577 2005
## 38 NL 377845905 2003
## 39 Sony 373585825 2004
## 40 NM 370782930 2004
## 41 Uni. 368384330 2016
## 42 Uni. 368061265 2013
## 43 BV 364001123 2016
## 44 Fox 363070709 2016
## 45 BV 356461711 2015
## 46 Uni. 353007020 2015
## 47 P/DW 352390543 2011
## 48 WB 350126372 2014
## 49 NL 342551365 2002
## 50 BV 341268248 2016
## 51 LGF 337135885 2014
## 52 Sony 336530303 2007
## 53 Uni. 336045770 2015
## 54 Sony 334201140 2017
## 55 BV 334191110 2010
## 56 BV 333176600 2014
## 57 WB 330360194 2016
## 58 Par. 330252182 1994
## 59 WB (NL) 327481748 2017
## 60 WB 325100054 2016
## 61 P/DW 322719944 2007
## 62 P/DW 319246193 2007
## 63 Par. 318412101 2008
## 64 Fox 318278611 2018
## 65 WB 317575550 2001
## 66 Par. 317101119 2008
## 67 NL 315544750 2001
## 68 BV 315058289 2017
## 69 Par. 312433331 2010
## 70 Fox 310676740 2002
## 71 BV 309420425 2007
## 72 Fox 309306177 1983
## 73 Fox 306169268 1996
## 74 BV 305413918 2003
## 75 Sony 304360277 2012
## 76 WB (NL) 303003568 2012
## 77 WB 301959197 2009
## 78 Sum. 300531751 2010
## 79 Sum. 296623634 2009
## 80 WB 295983305 2010
## 81 BV 293506292 1999
## 82 BV 293004164 2009
## 83 WB 292576195 2010
## 84 LG/S 292324737 2012
## 85 WB 292004738 2007
## 86 BV 291710957 2005
## 87 WB 291045518 2013
## 88 Fox 290475067 1980
## 89 WB 290013036 2005
## 90 BV 289916256 2001
## 91 Fox 285761243 1990
## 92 LGF 281723902 2015
## 93 WB 281576461 2003
## 94 Sum. 281287133 2011
## 95 Uni. 279261160 2004
## 96 WB 277322503 2009
## 97 WB 274092705 2013
## 98 Uni. 270395425 2016
## 99 BV 268492764 2013
## 100 DW 267665011 2001
When I tried to scrape the title, SelectorGadget’s suggested code (“tr+ tr a b”) almost worked - in addition to the intended movie titles, it also latched on two headers (Rank and Lifetime Gross), which led to a misalignment in my dataframe. I had hoped that perhaps there was unnecessary code and that by removing it I would remove the headers, but varying permutations of “tr+ tr a b” didn’t work. I stumbled upon a workaround by taking the code used for the rank and then bumping up the number by one to “center td tr+ tr td:nth-child(2) font”. I followed this pattern to get all subsequent columns.
The domestic gross was imported as a character string, which I had to turn into a numeric by using the gsub function to “replace” all instances of dollar signs and commas with blanks.
The year column was annoying because some years had a “^”, which because it’s a reserved character led to my initial gsubs turning all years with that symbol into NA. After a lot of Googling I was able to eventually narrow down the jargon I needed to do a better query (“escape”), figured out the right way to spell the word (“caret”), and the correct number of slashes to force R to ignore that it’s an operator. (I later discovered this solution was mentioned in the Wikibooks reading.)
The one problem I wasn’t able to solve was getting in the page numbers: I only scraped page 1, but was unable to scale it to a set number of pages (let’s say 1000), or even ideally all of the pages. I wanted to use the code from the YouTube link in this week’s reading but had difficulty repurposing the code.