library(rvest)##scraping
library(tidyverse)##tidying data
library(ggplot2)##plotting
library(quanteda)##text analysis
library(wordcloud2)##word clouds
library(tm)##text analysis
If you have never used R before,1 or have never downloaded these packages, you would need to install them first, using the install.packages() function, and then call them using the library() function.
In this short exercise we will get some data about horror movies from the Rotten Tomatoes website. We will learn how to get some numeric data and visualize it. Secondly, we will look at some character data, namely, movies reviews. We will use the reviews to generate a picture that summarizes the reviews.
We start by getting the url and saving it. This is the page that we will scrape2 https://www.rottentomatoes.com/top/bestofrt/top_100_horror_movies/. We store the rotten tomatoes page in an object we call url. Then, we read the html3 and store it into an object we call webpage. url and webpage are objects that we define in the R environment.
The main things that we are doing in this chunk of code are:
1.Storing the webpage in the url object (using the arrow sign)
2.Using the read_html() function to read the url object
3.Printing it to the screen, to verify that the code ‘worked’
url<-"https://www.rottentomatoes.com/top/bestofrt/top_100_horror_movies/"
webpage <- read_html(url)
webpage
## {html_document}
## <html lang="en" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema/">
## [1] <head prefix="og: http://ogp.me/ns# flixstertomatoes: http://ogp.me/ns/ap ...
## [2] <body class="body " style="min-height:800px">\n <script>console.info( ...
Now that we have the page data stored, we can start extracting interesting things from it. Let’s get the ratings from the page. We find the ratings by using selector gadget (easy to install, see the video on how to use it) https://selectorgadget.com/. Once we know what needs to be scraped, we:
first read it using the html_nodes() function.
We then turn it into text using the html_text() function. Note that the %>% symbol is called a ‘pipe’. It tells the software to do something to the object, in this case webpage.
The next step entails cleaning the data by trimming it from white space and saving only the numeric values of the ratings. We then examine the output by printing it to the screen.
##grab the ratings
ratings<-webpage%>%
html_nodes("#top_movies_main .tMeterScore")%>%
html_text()
ratings
## [1] " 98%" " 98%" " 89%" " 95%" " 98%" " 95%" " 96%" " 100%" " 90%"
## [10] " 92%" " 86%" " 95%" " 92%" " 96%" " 93%" " 98%" " 95%" " 94%"
## [19] " 87%" " 89%" " 88%" " 92%" " 88%" " 79%" " 87%" " 88%" " 86%"
## [28] " 91%" " 93%" " 88%" " 92%" " 90%" " 86%" " 85%" " 89%" " 90%"
## [37] " 90%" " 87%" " 90%" " 77%" " 83%" " 86%" " 85%" " 86%" " 85%"
## [46] " 85%" " 81%" " 85%" " 83%" " 82%" " 79%" " 73%" " 68%" " 78%"
## [55] " 79%" " 72%" " 76%" " 78%" " 81%" " 77%" " 80%" " 79%" " 78%"
## [64] " 80%" " 75%" " 75%" " 74%" " 74%" " 71%" " 76%" " 78%" " 72%"
## [73] " 70%" " 73%" " 74%" " 72%" " 74%" " 76%" " 66%" " 75%" " 68%"
## [82] " 72%" " 74%" " 68%" " 74%" " 74%" " 70%" " 73%" " 69%" " 66%"
## [91] " 71%" " 72%" " 71%" " 71%" " 70%" " 70%" " 67%" " 66%" " 69%"
## [100] " 63%"
##get rid of the percentage sign so we can plot the data
ratings<-str_trim(ratings)%>%
str_extract("\\d\\d")%>%
as.numeric()#clean the data
ratings
## [1] 98 98 89 95 98 95 96 10 90 92 86 95 92 96 93 98 95 94 87 89 88 92 88 79 87
## [26] 88 86 91 93 88 92 90 86 85 89 90 90 87 90 77 83 86 85 86 85 85 81 85 83 82
## [51] 79 73 68 78 79 72 76 78 81 77 80 79 78 80 75 75 74 74 71 76 78 72 70 73 74
## [76] 72 74 76 66 75 68 72 74 68 74 74 70 73 69 66 71 72 71 71 70 70 67 66 69 63
Now lets get the corresponding movie titles!
Similar to what we did before, we:
Use the selector gadget to find the correct xpath, and
plug it into the html\_nodes() function.
We then read it into text using the html\_text() function. The next two stages include cleaning up the titles a bit.
names<-webpage%>%html_nodes("#top_movies_main .articleLink")%>%
html_text()
names[1:10]
## [1] "\n Get Out (2017)"
## [2] "\n The Babadook (2014)"
## [3] "\n Hereditary (2018)"
## [4] "\n It Follows (2015)"
## [5] "\n Let the Right One In (2008)"
## [6] "\n Freaks (1932)"
## [7] "\n Night of the Living Dead (1968)"
## [8] "\n One Cut of the Dead (Kamera o tomeru na!) (2019)"
## [9] "\n The Witch (2016)"
## [10] "\n The Cabin in the Woods (2012)"
names<-str_trim(names)#clean white space
names[1:10]
## [1] "Get Out (2017)"
## [2] "The Babadook (2014)"
## [3] "Hereditary (2018)"
## [4] "It Follows (2015)"
## [5] "Let the Right One In (2008)"
## [6] "Freaks (1932)"
## [7] "Night of the Living Dead (1968)"
## [8] "One Cut of the Dead (Kamera o tomeru na!) (2019)"
## [9] "The Witch (2016)"
## [10] "The Cabin in the Woods (2012)"
clean_names<-str_extract(names, "^[^\\(]+")#a bit more cleaning using 'regular expressions' (google it if curious)
Let’s get years and turn into a numeric object– again, using regular expressions and
years<-str_extract(names, "\\d\\d\\d\\d")#the str_extract function, which matches specific characters
years<-as.numeric(years)
years
## [1] 2017 2014 2018 2015 2008 1932 1968 2019 2016 2012 2017 1963 2009 2014 1976
## [16] 2012 2016 2013 2017 2009 2016 2018 2018 2018 2012 2010 2013 2015 2016 1973
## [31] 2017 2020 2006 2012 2017 2018 2017 2016 2010 2017 2009 1990 2013 2014 2011
## [46] 2009 2004 2015 2016 2016 2013 2012 2017 2015 2012 2015 2016 1975 2016 2018
## [61] 2014 2003 2012 2016 2010 2018 2005 2014 2017 2012 2010 2017 2013 2010 2010
## [76] 2011 2017 2013 2013 2010 2015 2010 2017 2019 2011 2017 2010 2011 2010 2012
## [91] 2013 2011 2011 2018 2018 2015 2015 2011 2009 2013
#Plotting our data
Now, we can plot rating against years. We begin by creating the data, which we will call the horror\_data(). It will be composed of three variables: years, ratings, names.
horror_data<-data.frame(years=years,
ratings=ratings,
names=names)
We plot the data using the ggplot() function. This is a very basic plot, but ggplot is very powerful. In fact, every figure you see these days online in outlets like the guardian, NYTIMES, Financial Times etc. is generated using this technique
horror_plot<-ggplot(data=horror_data, aes(x=years, y=ratings))+
geom_point()+
geom_smooth()+
labs(title="Top 100 Horror Movies",
x="years",
subtitle="Source: Rotten Tomatoes")
horror_plot
If we want a more recent plot, and perhaps a trend line that is more sensitive to specific observations, we can filter the data so that the plot only presents movies from the 21st century, and is more wiggly:
horror_plot_recent<-horror_data%>%filter(years>1999)%>%
ggplot(aes(x=years, y=ratings))+
geom_point()+
geom_smooth(span = 0.3)+
labs(title="Top 100 Horror Movies",
x="years",
subtitle="Source: Rotten Tomatoes")
horror_plot_recent
#Getting text data from hyperlinks
Now, we would like to play with some textual data. Specifically, we want to get the reviews from the hyperlinks that are behind the name of the movie. We use the selector gadget and the html\_attr() function, making sure that R reads it as an hyperlink (href)
links<-webpage%>%html_nodes("#top_movies_main .articleLink")%>%
html_attr("href")
links[1:10]
## [1] "/m/get_out" "/m/the_babadook"
## [3] "/m/hereditary" "/m/it_follows"
## [5] "/m/let_the_right_one_in" "/m/freaks"
## [7] "/m/night_of_the_living_dead" "/m/one_cut_of_the_dead"
## [9] "/m/the_witch_2016" "/m/the_cabin_in_the_woods"
To access the review, notice that we need to concatenate the links with the rotten tomatoes url to get the full hyperlink. We use the paste0() function to do that:
links<-paste0("https://www.rottentomatoes.com", links)
links[1:10]
## [1] "https://www.rottentomatoes.com/m/get_out"
## [2] "https://www.rottentomatoes.com/m/the_babadook"
## [3] "https://www.rottentomatoes.com/m/hereditary"
## [4] "https://www.rottentomatoes.com/m/it_follows"
## [5] "https://www.rottentomatoes.com/m/let_the_right_one_in"
## [6] "https://www.rottentomatoes.com/m/freaks"
## [7] "https://www.rottentomatoes.com/m/night_of_the_living_dead"
## [8] "https://www.rottentomatoes.com/m/one_cut_of_the_dead"
## [9] "https://www.rottentomatoes.com/m/the_witch_2016"
## [10] "https://www.rottentomatoes.com/m/the_cabin_in_the_woods"
This gives us the links to the reviews, which we can now loop over!
Now, we create a loop in order to scrape the movies’ reviews. A loop means that we want the machine to do something repeatedly. Here, we will tell it to grab only the first 50 reviews (to save time. But a loop can 10,000 or 1,000,000 times, if we just tell it to. It’s just a machine.
critic_concensus<-character()##Create a holder to store the reviews
Run the loop, going through the first 50 links, scraping the reviews, turning them into text and storing into critic_concensu. notice the i below. That is a counter that runs from 1 to 50. Every time we loop through a review, we tell R to print the link, to make sure everything is working
## [1] "https://www.rottentomatoes.com/m/get_out"
## [1] "https://www.rottentomatoes.com/m/the_babadook"
## [1] "https://www.rottentomatoes.com/m/hereditary"
## [1] "https://www.rottentomatoes.com/m/it_follows"
## [1] "https://www.rottentomatoes.com/m/let_the_right_one_in"
## [1] "https://www.rottentomatoes.com/m/freaks"
## [1] "https://www.rottentomatoes.com/m/night_of_the_living_dead"
## [1] "https://www.rottentomatoes.com/m/one_cut_of_the_dead"
## [1] "https://www.rottentomatoes.com/m/the_witch_2016"
## [1] "https://www.rottentomatoes.com/m/the_cabin_in_the_woods"
## [1] "https://www.rottentomatoes.com/m/it_2017"
## [1] "https://www.rottentomatoes.com/m/1002448-birds"
## [1] "https://www.rottentomatoes.com/m/drag_me_to_hell"
## [1] "https://www.rottentomatoes.com/m/a_girl_walks_home_alone_at_night"
## [1] "https://www.rottentomatoes.com/m/1003625-carrie"
## [1] "https://www.rottentomatoes.com/m/the_loved_ones_2012"
## [1] "https://www.rottentomatoes.com/m/the_love_witch"
## [1] "https://www.rottentomatoes.com/m/room_237_2012"
## [1] "https://www.rottentomatoes.com/m/it_comes_at_night"
## [1] "https://www.rottentomatoes.com/m/zombieland"
## [1] "https://www.rottentomatoes.com/m/dont_breathe_2016"
## [1] "https://www.rottentomatoes.com/m/the_endless"
## [1] "https://www.rottentomatoes.com/m/upgrade_2018"
## [1] "https://www.rottentomatoes.com/m/halloween_2018"
## [1] "https://www.rottentomatoes.com/m/frankenweenie_2012"
## [1] "https://www.rottentomatoes.com/m/let_me_in"
## [1] "https://www.rottentomatoes.com/m/the_conjuring"
## [1] "https://www.rottentomatoes.com/m/bone_tomahawk"
## [1] "https://www.rottentomatoes.com/m/hush_2016"
## [1] "https://www.rottentomatoes.com/m/the_wicker_man_1973"
## [1] "https://www.rottentomatoes.com/m/the_devils_candy"
## [1] "https://www.rottentomatoes.com/m/blood_quantum"
## [1] "https://www.rottentomatoes.com/m/descent"
## [1] "https://www.rottentomatoes.com/m/chronicle"
## [1] "https://www.rottentomatoes.com/m/better_watch_out"
## [1] "https://www.rottentomatoes.com/m/possum"
## [1] "https://www.rottentomatoes.com/m/a_dark_song"
## [1] "https://www.rottentomatoes.com/m/the_autopsy_of_jane_doe"
## [1] "https://www.rottentomatoes.com/m/los_ojos_de_julia"
## [1] "https://www.rottentomatoes.com/m/split_2017"
## [1] "https://www.rottentomatoes.com/m/paranormal_activity"
## [1] "https://www.rottentomatoes.com/m/tremors"
## [1] "https://www.rottentomatoes.com/m/berberian_sound_studio_2012"
## [1] "https://www.rottentomatoes.com/m/a_field_in_england"
## [1] "https://www.rottentomatoes.com/m/tucker_and_dale_vs_evil"
## [1] "https://www.rottentomatoes.com/m/the_house_of_the_devil_2009"
## [1] "https://www.rottentomatoes.com/m/hellboy"
## [1] "https://www.rottentomatoes.com/m/spring_2015"
## [1] "https://www.rottentomatoes.com/m/tale_of_tales"
## [1] "https://www.rottentomatoes.com/m/ouija_origin_of_evil"
## [1] "Funny, scary, and thought-provoking, Get Out seamlessly weaves its trenchant social critiques into a brilliantly effective and entertaining horror/comedy thrill ride."
## [2] "The Babadook relies on real horror rather than cheap jump scares -- and boasts a heartfelt, genuinely moving story to boot."
## [3] "Hereditary uses its classic setup as the framework for a harrowing, uncommonly unsettling horror film whose cold touch lingers long beyond the closing credits."
## [4] "Smart, original, and above all terrifying, It Follows is the rare modern horror film that works on multiple levels -- and leaves a lingering sting."
## [5] "Let the Right One In reinvigorates the seemingly tired vampire genre by effectively mixing scares with intelligent storytelling."
## [6] "Time has been kind to this horror legend: Freaks manages to frighten, shock, and even touch viewers in ways that contemporary viewers missed."
## [7] "George A. Romero's debut set the template for the zombie film, and features tight editing, realistic gore, and a sly political undercurrent."
## [8] "Brainy and bloody in equal measure, One Cut of the Dead reanimates the moribund zombie genre with a refreshing blend of formal daring and clever satire."
## [9] "As thought-provoking as it is visually compelling, The Witch delivers a deeply unsettling exercise in slow-building horror that suggests great things for debuting writer-director Robert Eggers."
## [10] "The Cabin in the Woods is an astonishing meta-feat, capable of being funny, strange, and scary -- frequently all at the same time."
We want to make a word cloud in order to see what words characterize the reviews. Before we do that, let’s use R to clean up a bit by getting rid of punctuation and stop words (such as ‘a’, ‘the’, ‘I’). Then, we’ll trun the words into a data frame that R can plot.
docs<-Corpus(VectorSource(critic_concensus))
docs <- docs %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(stripWhitespace)
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeWords, stopwords("english"))
dtm <- TermDocumentMatrix(docs)
matrix <- as.matrix(dtm)
words <- sort(rowSums(matrix),decreasing=TRUE)
df <- data.frame(word = names(words),freq=words)
Now, the word-cloud. Don’t be like me! Never only do a word-cloud!!!
wordcloud2(data=df, size = 0.7, shape = 'STAR')
What do you think? Anything surprising?
This is it. Come see me if you want to learn more. This is the tip of the iceberg!
R is currently one of the world’s fastest growing tool in computing and data science.↩︎
Scraping is importing information from a website onto a local file on our computer. It’s a really fast an efficient way of getting lots of information.↩︎
Hypertext Markup Language - it’s the code that is used to structure a web page and its content↩︎