library(rvest)
## Loading required package: xml2
mojo=read_html("https://www.boxofficemojo.com/alltime/domestic.htm")
rank=mojo %>% html_nodes("center td tr+ tr td:nth-child(1) font") %>% html_text()
title=mojo %>% html_nodes("center td tr+ tr td:nth-child(2) font") %>% html_text()
studio=mojo %>% html_nodes("center td tr+ tr td:nth-child(3) font") %>% html_text()
gross=mojo %>% html_nodes("center td tr+ tr td:nth-child(4) font") %>% html_text()
gross=as.numeric(gsub("[$,]","", gross))
year=mojo %>% html_nodes("center td tr+ tr td:nth-child(5) font") %>% html_text()
year=as.numeric(gsub("\\^","", year))
mojodf = data.frame(cbind(rank,title,studio,gross,year))
mojodf
##     rank                                                          title
## 1      1                                   Star Wars: The Force Awakens
## 2      2                                                         Avatar
## 3      3                                                  Black Panther
## 4      4                                         Avengers: Infinity War
## 5      5                                                        Titanic
## 6      6                                                 Jurassic World
## 7      7                                          Marvel's The Avengers
## 8      8                                       Star Wars: The Last Jedi
## 9      9                                                  Incredibles 2
## 10    10                                                The Dark Knight
## 11    11                                   Rogue One: A Star Wars Story
## 12    12                                    Beauty and the Beast (2017)
## 13    13                                                   Finding Dory
## 14    14                      Star Wars: Episode I - The Phantom Menace
## 15    15                                                      Star Wars
## 16    16                                        Avengers: Age of Ultron
## 17    17                                          The Dark Knight Rises
## 18    18                                                        Shrek 2
## 19    19                                    E.T.: The Extra-Terrestrial
## 20    20                                The Hunger Games: Catching Fire
## 21    21                     Pirates of the Caribbean: Dead Man's Chest
## 22    22                                                  The Lion King
## 23    23                                                    Toy Story 3
## 24    24                                                   Wonder Woman
## 25    25                                 Jurassic World: Fallen Kingdom
## 26    26                                                     Iron Man 3
## 27    27                                     Captain America: Civil War
## 28    28                                               The Hunger Games
## 29    29                                 Jumanji: Welcome to the Jungle
## 30    30                                                     Spider-Man
## 31    31                                                  Jurassic Park
## 32    32                            Transformers: Revenge of the Fallen
## 33    33                                                         Frozen
## 34    34                                 Guardians of the Galaxy Vol. 2
## 35    35                    Harry Potter and the Deathly Hallows Part 2
## 36    36                                                   Finding Nemo
## 37    37                   Star Wars: Episode III - Revenge of the Sith
## 38    38                  The Lord of the Rings: The Return of the King
## 39    39                                                   Spider-Man 2
## 40    40                                      The Passion of the Christ
## 41    41                                        The Secret Life of Pets
## 42    42                                                Despicable Me 2
## 43    43                                         The Jungle Book (2016)
## 44    44                                                       Deadpool
## 45    45                                                     Inside Out
## 46    46                                                      Furious 7
## 47    47                                 Transformers: Dark of the Moon
## 48    48                                                American Sniper
## 49    49                          The Lord of the Rings: The Two Towers
## 50    50                                                       Zootopia
## 51    51                          The Hunger Games: Mockingjay - Part 1
## 52    52                                                   Spider-Man 3
## 53    53                                                        Minions
## 54    54                                         Spider-Man: Homecoming
## 55    55                                     Alice in Wonderland (2010)
## 56    56                                        Guardians of the Galaxy
## 57    57                             Batman v Superman: Dawn of Justice
## 58    58                                                   Forrest Gump
## 59    59                                                             It
## 60    60                                                  Suicide Squad
## 61    61                                                Shrek the Third
## 62    62                                                   Transformers
## 63    63                                                       Iron Man
## 64    64                                                     Deadpool 2
## 65    65                          Harry Potter and the Sorcerer's Stone
## 66    66             Indiana Jones and the Kingdom of the Crystal Skull
## 67    67              The Lord of the Rings: The Fellowship of the Ring
## 68    68                                                 Thor: Ragnarok
## 69    69                                                     Iron Man 2
## 70    70                   Star Wars: Episode II - Attack of the Clones
## 71    71                       Pirates of the Caribbean: At World's End
## 72    72                                             Return of the Jedi
## 73    73                                               Independence Day
## 74    74         Pirates of the Caribbean: The Curse of the Black Pearl
## 75    75                                                        Skyfall
## 76    76                              The Hobbit: An Unexpected Journey
## 77    77                         Harry Potter and the Half-Blood Prince
## 78    78                                     The Twilight Saga: Eclipse
## 79    79                                    The Twilight Saga: New Moon
## 80    80                    Harry Potter and the Deathly Hallows Part 1
## 81    81                                                The Sixth Sense
## 82    82                                                             Up
## 83    83                                                      Inception
## 84    84                        The Twilight Saga: Breaking Dawn Part 2
## 85    85                      Harry Potter and the Order of the Phoenix
## 86    86 The Chronicles of Narnia: The Lion, the Witch and the Wardrobe
## 87    87                                                   Man of Steel
## 88    88                                        The Empire Strikes Back
## 89    89                            Harry Potter and the Goblet of Fire
## 90    90                                                 Monsters, Inc.
## 91    91                                                     Home Alone
## 92    92                          The Hunger Games: Mockingjay - Part 2
## 93    93                                            The Matrix Reloaded
## 94    94                        The Twilight Saga: Breaking Dawn Part 1
## 95    95                                               Meet the Fockers
## 96    96                                                   The Hangover
## 97    97                                                        Gravity
## 98    98                                                           Sing
## 99    99                                            Monsters University
## 100  100                                                          Shrek
##      studio     gross year
## 1        BV 936662225 2015
## 2       Fox 760507625 2009
## 3        BV 700059566 2018
## 4        BV 678587869 2018
## 5      Par. 659363944 1997
## 6      Uni. 652270625 2015
## 7        BV 623357910 2012
## 8        BV 620181382 2017
## 9        BV 594119848 2018
## 10       WB 534858444 2008
## 11       BV 532177324 2016
## 12       BV 504014165 2017
## 13       BV 486295561 2016
## 14      Fox 474544677 1999
## 15      Fox 460998007 1977
## 16       BV 459005868 2015
## 17       WB 448139099 2012
## 18       DW 441226247 2004
## 19     Uni. 435110554 1982
## 20      LGF 424668047 2013
## 21       BV 423315812 2006
## 22       BV 422783777 1994
## 23       BV 415004880 2010
## 24       WB 412563408 2017
## 25     Uni. 411752365 2018
## 26       BV 409013994 2013
## 27       BV 408084349 2016
## 28      LGF 408010692 2012
## 29     Sony 404515480 2017
## 30     Sony 403706375 2002
## 31     Uni. 402453882 1993
## 32     P/DW 402111870 2009
## 33       BV 400738009 2013
## 34       BV 389813101 2017
## 35       WB 381011219 2011
## 36       BV 380843261 2003
## 37      Fox 380270577 2005
## 38       NL 377845905 2003
## 39     Sony 373585825 2004
## 40       NM 370782930 2004
## 41     Uni. 368384330 2016
## 42     Uni. 368061265 2013
## 43       BV 364001123 2016
## 44      Fox 363070709 2016
## 45       BV 356461711 2015
## 46     Uni. 353007020 2015
## 47     P/DW 352390543 2011
## 48       WB 350126372 2014
## 49       NL 342551365 2002
## 50       BV 341268248 2016
## 51      LGF 337135885 2014
## 52     Sony 336530303 2007
## 53     Uni. 336045770 2015
## 54     Sony 334201140 2017
## 55       BV 334191110 2010
## 56       BV 333176600 2014
## 57       WB 330360194 2016
## 58     Par. 330252182 1994
## 59  WB (NL) 327481748 2017
## 60       WB 325100054 2016
## 61     P/DW 322719944 2007
## 62     P/DW 319246193 2007
## 63     Par. 318412101 2008
## 64      Fox 318278611 2018
## 65       WB 317575550 2001
## 66     Par. 317101119 2008
## 67       NL 315544750 2001
## 68       BV 315058289 2017
## 69     Par. 312433331 2010
## 70      Fox 310676740 2002
## 71       BV 309420425 2007
## 72      Fox 309306177 1983
## 73      Fox 306169268 1996
## 74       BV 305413918 2003
## 75     Sony 304360277 2012
## 76  WB (NL) 303003568 2012
## 77       WB 301959197 2009
## 78     Sum. 300531751 2010
## 79     Sum. 296623634 2009
## 80       WB 295983305 2010
## 81       BV 293506292 1999
## 82       BV 293004164 2009
## 83       WB 292576195 2010
## 84     LG/S 292324737 2012
## 85       WB 292004738 2007
## 86       BV 291710957 2005
## 87       WB 291045518 2013
## 88      Fox 290475067 1980
## 89       WB 290013036 2005
## 90       BV 289916256 2001
## 91      Fox 285761243 1990
## 92      LGF 281723902 2015
## 93       WB 281576461 2003
## 94     Sum. 281287133 2011
## 95     Uni. 279261160 2004
## 96       WB 277322503 2009
## 97       WB 274092705 2013
## 98     Uni. 270395425 2016
## 99       BV 268492764 2013
## 100      DW 267665011 2001

When I tried to scrape the title, SelectorGadget’s suggested code (“tr+ tr a b”) almost worked - in addition to the intended movie titles, it also latched on two headers (Rank and Lifetime Gross), which led to a misalignment in my dataframe. I had hoped that perhaps there was unnecessary code and that by removing it I would remove the headers, but varying permutations of “tr+ tr a b” didn’t work. I stumbled upon a workaround by taking the code used for the rank and then bumping up the number by one to “center td tr+ tr td:nth-child(2) font”. I followed this pattern to get all subsequent columns.

The domestic gross was imported as a character string, which I had to turn into a numeric by using the gsub function to “replace” all instances of dollar signs and commas with blanks.

The year column was annoying because some years had a “^”, which because it’s a reserved character led to my initial gsubs turning all years with that symbol into NA. After a lot of Googling I was able to eventually narrow down the jargon I needed to do a better query (“escape”), figured out the right way to spell the word (“caret”), and the correct number of slashes to force R to ignore that it’s an operator. (I later discovered this solution was mentioned in the Wikibooks reading.)

The one problem I wasn’t able to solve was getting in the page numbers: I only scraped page 1, but was unable to scale it to a set number of pages (let’s say 1000), or even ideally all of the pages. I wanted to use the code from the YouTube link in this week’s reading but had difficulty repurposing the code.