Web Scraping with R

Introduction

Web scraping is data scraping used for extracting data from websites. you may need to know what is the HTML, CSS et cetera. To be honestly, I am keep studying about network, in this documents I will introduce briefly how I extract the data from HTML.

HTML(Hyper Text Markup Language) : the standard markup language for creating web pages and web applications.
- Contents : Headings, Paragraphs, Lists
- Stucture:
  <html>
  <head>
  <title> .... </title>
  </head>
  <body>
  <h1> ....</h1>
  <p> ........ </p>
  <li> .... </li>
  </body>
  </html>
CSS(Cascading Style Sheets) : A style sheet language used for describing the presentation of a document written in markup language like HTML.
- Presentation : Font, Color, Backgroud color, Border
JavaScript :
- Behavior : dynamic display, widgets, user iteraction, click to open a popup page

1 Web scraping with `rvest`

please understand that node, attr and text in HTML when you using rvest library.

html_node : Simply, a node is an HTML element. Each node can have HTML attributes specified. It could be a text, head, font et cetera. More easily extract pieces out of HTML using XPath and css selectors. In rvest library, html_node and html_nodes are exists. html_node is always extracts exactly one element but html_nodes is more flexible.
html_attr : attributes are additional values that configure the elements or adjust their behavior ways to meet the criteria the users want. There are many attributes in HTML, so far, I believe that dir, title is one of the essential attributes.
- dir : Global attribute; Defines the text direction
- title : Global attribute
html_text : Extract text from HTML

for more detail, please check this website

1.1 Example

# url <- "http://kabutan.jp/stock/kabuka?code=0000"
# url_res <- read_html(url)

# Load URL
url_res <- read_html("http://kabutan.jp/stock/kabuka?code=0000")
url_res

## {xml_document}
## <html xmlns="http://www.w3.org/1999/xhtml" lang="ja" xml:lang="ja">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body>\r\n<!-- Google Tag Manager -->\r\n<noscript><iframe src="//ww ...

# same way to load the title
url_title1 <- html_nodes(url_res, 
                        xpath = "/html/head/title") %>% # using xpath
  html_text()

url_title2 <- html_nodes(url_res,
                        css = "title") %>% # Using CSS
  html_text()

url_title1 == url_title2

## [1] TRUE

stock <- read_html("http://kabutan.jp/stock/kabuka?code=0000") %>% 
  html_nodes(xpath = "//*[@id='stock_kabuka_table']/table[2]") %>% 
  html_table()

stock %>% class()

## [1] "list"

If the data gathered in only one page, it would be great. But in reality, most of the web data spread out over several pages. In this case, using for() statement is made easier.

base_url <- "https://kabutan.jp/stock/kabuka?code=0000&ashi=day&page="

urls <- NULL
stocks <- list()

for(i in 1:9) {
  pgnum <- as.character(i)
  urls[i] <- paste0(base_url, pgnum)
  
  stocks[[i]] <- read_html(urls[i]) %>% 
    html_node(xpath = "//*[@id='stock_kabuka_table']/table[2]") %>% 
    html_table() %>% 
    dplyr::mutate_at("前日比", as.character)
  
  Sys.sleep(1) # very important! Do not forget this!
}

## Warning: Mangling the following names: <U+58F2>買高(株) -> <U+58F2>買高
## (株). Use enc2native() to avoid the warning.

## Warning: Mangling the following names: <U+58F2>買高(株) -> <U+58F2>買高
## (株). Use enc2native() to avoid the warning.

## Warning: Mangling the following names: <U+58F2>買高(株) -> <U+58F2>買高
## (株). Use enc2native() to avoid the warning.

## Warning: Mangling the following names: <U+58F2>買高(株) -> <U+58F2>買高
## (株). Use enc2native() to avoid the warning.

## Warning: Mangling the following names: <U+58F2>買高(株) -> <U+58F2>買高
## (株). Use enc2native() to avoid the warning.

## Warning: Mangling the following names: <U+58F2>買高(株) -> <U+58F2>買高
## (株). Use enc2native() to avoid the warning.

## Warning: Mangling the following names: <U+58F2>買高(株) -> <U+58F2>買高
## (株). Use enc2native() to avoid the warning.

## Warning: Mangling the following names: <U+58F2>買高(株) -> <U+58F2>買高
## (株). Use enc2native() to avoid the warning.

## Warning: Mangling the following names: <U+58F2>買高(株) -> <U+58F2>買高
## (株). Use enc2native() to avoid the warning.

#
stocks <- dplyr::bind_rows(stocks)
stocks %>% 
  glimpse()

## Observations: 269
## Variables: 8
## $ 日付         <chr> "18/08/02", "18/08/01", "18/07/31", "18/07/30", "18...
## $ 始値         <chr> "22,676.73", "22,642.18", "22,472.12", "22,613.30",...
## $ 高値         <chr> "22,754.73", "22,775.47", "22,678.06", "22,631.32",...
## $ 安値         <chr> "22,464.81", "22,615.98", "22,352.21", "22,518.94",...
## $ 終値         <chr> "22,512.53", "22,746.70", "22,553.72", "22,544.84",...
## $ 前日比       <chr> "-234.17", "192.98", "8.88", "-167.91", "125.88", "-...
## $ `前日比％`   <dbl> -1.0, 0.9, 0.0, -0.7, 0.6, -0.1, 0.5, 0.5, -1.3, -0.3...
## $ `<U+58F2>買高(株)` <chr> "1,642,420,000", "1,767,250,000", "1,972,430,000", "1...

After you bring the web data, you can wrangle what you want. Please check the Intellectual Property Rights before you do web scraping.

1.2 Example

Using CSS selector, let’s try web scraping. This example, how to do the web scraping about the movie ratings. If you want to check what is going on this example, please press Ctrl + Shift + I and see the HTML. It may help to understand.

url <- 'http://www.imdb.com/search/title?count=100&release_date=2017,2017&title_type=feature'

webpage <- xml2::read_html(url)


# ranking data
rank.data <- html_nodes(webpage,
                        '.text-primary') %>% # Scrap ranking section
  html_text() %>% # Converting the ranking data to text
  as.numeric() # converting rankings to numeric

rank.data %>% head() # for check

## [1] 1 2 3 4 5 6

# title data
title.data <- html_nodes(webpage,
                         '.lister-item-header a') %>% 
  html_text()
title.data %>% head() # for check

## [1] "Mighty Thor: Battle Royale" "A Prayer Before Dawn"      
## [3] "The Greatest Showman"       "Justice League"            
## [5] "It"                         "The Snowman"

# description
description <- html_nodes(webpage,
                          '.ratings-bar+ .text-muted') %>% 
  html_text()
description %>% head() # for check

## [1] "\n    Thor is imprisoned on the planet Sakaar, and must race against time to return to Asgard and stop Ragnarok, the destruction of his world, at the hands of the powerful and ruthless villain Hela."
## [2] "\n    The true story of an English boxer incarcerated in one of Thailand's most notorious prisons as he fights in Muay Thai tournaments to earn his freedom."                                          
## [3] "\n    Celebrates the birth of show business, and tells of a visionary who rose from nothing to create a spectacle that became a worldwide sensation."                                                  
## [4] "\n    Fueled by his restored faith in humanity and inspired by Superman's selfless act, Bruce Wayne enlists the help of his newfound ally, Diana Prince, to face an even greater enemy."               
## [5] "\n    In the summer of 1989, a group of bullied kids band together to destroy a shapeshifting monster, which disguises itself as a clown and preys on the children of Derry, their small Maine town."  
## [6] "\n    Detective Harry Hole investigates the disappearance of a woman whose scarf is found wrapped around an ominous-looking snowman."

# runtime
runtime <- html_nodes(webpage,
                      '.text-muted .runtime') %>% 
  html_text() 
runtime <- gsub(" min", "",
                   runtime) %>% # Remove "min" character
  as.numeric()
runtime %>% head() # for check

## [1] 130 116 105 120 135 119

# Genre
genre <- html_nodes(webpage,
                    '.genre') %>% 
  html_text()
genre <- gsub("\n", "", # removing \n 
              gsub(" ", "", # removing excess spaces 
                   gsub(",.*", "", genre) %>%  # take only the first genre
  as.factor())) # converting to factor)  

genre %>% head()

## [1] "Action"    "Action"    "Biography" "Action"    "Drama"     "Crime"

# Ranking
ranking <- html_nodes(webpage,
                      '.ratings-imdb-rating strong') %>% 
  html_text() %>% 
  as.numeric()

ranking %>% head()

## [1] 7.9 7.0 7.7 6.6 7.4 5.1

# votes
votes <- html_nodes(webpage,
                    '.sort-num_votes-visible span:nth-child(2)') %>% 
  html_text()
votes <- gsub(",", "", 
              votes) %>% as.numeric()

votes %>% length()

## [1] 100

# revenue
revenue <- html_nodes(webpage,
                      '.sort-num_votes-visible span:nth-child(5)') %>%   # == '.ghost~ .text-muted+ span'
  html_text()
# revenue <- html_nodes(webpage,
#                       xpath = '//*[@id="main"]/div/div/div[3]/div[1]/div[3]/p[4]/span[5]') # Using xpath

revenue <- gsub("M", "",
                        revenue) # remove "M" characters
revenue %>% length() # now you realized that revenue has 12 missing data

## [1] 87

Now, we have got the movie title, description, running time and ranking from the web. We believe that we need a gross and vote data too. However, if you carefully take a look at the gross data, there are 12 missing data. If you take this data and combine later in R, it occurs errors. So, I will scrape the vote and gross data at the same time and split the data.

votes_revenue <- html_nodes(webpage,
                      '.sort-num_votes-visible') %>%
  html_text(trim = TRUE)

votes_revenue <- strsplit(votes_revenue, "\n    |                ")

votes_revenue %>% head()

## [[1]]
## [1] "Votes:"               "            367,761"  "|"                   
## [4] "Gross:"               "            $315.06M"
## 
## [[2]]
## [1] "Votes:"            "            3,541"
## 
## [[3]]
## [1] "Votes:"               "            143,538"  "|"                   
## [4] "Gross:"               "            $174.34M"
## 
## [[4]]
## [1] "Votes:"               "            270,856"  "|"                   
## [4] "Gross:"               "            $229.02M"
## 
## [[5]]
## [1] "Votes:"               "            294,240"  "|"                   
## [4] "Gross:"               "            $327.48M"
## 
## [[6]]
## [1] "Votes:"             "            39,120" "|"                 
## [4] "Gross:"             "            $6.67M"

votes <- NULL
revenue <- NULL

for(i in 1:100) {
  
  votes[i] <- votes_revenue[[i]][2] # votes
  votes[i] <- gsub(" ", "",
                   gsub(",", "", votes[i]))
  
  
  revenue[i] <- votes_revenue[[i]][5]
  revenue[i] <- gsub(" ", "",
                     gsub("M", "", 
                          revenue[i]))
  revenue[i] <- substring(revenue[i], 2)
}

revenue %>% length();votes %>% length()

## [1] 100

## [1] 100

revenue %>% head(30)

##  [1] "315.06" NA       "174.34" "229.02" "327.48" "6.67"   NA      
##  [8] "404.52" "107.83" "92.05"  "620.18" "0.14"   "100.23" "58.06" 
## [15] "209.73" "63.86"  "334.20" "412.56" "3.48"   "176.04" "389.81"
## [22] "2.52"   "188.37" "18.10"  "54.51"  "17.80"  "130.17" "33.80" 
## [29] "30.01"  "12.64"

# combining all
movies <- data.frame(
  id = rank.data,
  Title = title.data %>% as.character(),
  Description = description %>% as.character(),
  Runtime = runtime,
  Genre = genre,
  Rank = ranking,
  Votes = votes %>% as.numeric(),
  Revenue_Million = revenue %>% as.numeric()
)
movies %>% glimpse()

## Observations: 100
## Variables: 8
## $ id              <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,...
## $ Title           <fct> Mighty Thor: Battle Royale, A Prayer Before Da...
## $ Description     <fct> 
##     Thor is imprisoned on the planet Sakaar, ...
## $ Runtime         <dbl> 130, 116, 105, 120, 135, 119, 107, 119, 112, 1...
## $ Genre           <fct> Action, Action, Biography, Action, Drama, Crim...
## $ Rank            <dbl> 7.9, 7.0, 7.7, 6.6, 7.4, 5.1, 6.5, 7.0, 7.7, 8...
## $ Votes           <dbl> 367761, 3541, 143538, 270856, 294240, 39120, 2...
## $ Revenue_Million <dbl> 315.06, NA, 174.34, 229.02, 327.48, 6.67, NA, ...

Great, we just have got the data with 100 observations and 8 variables. Now using this data, make some useful graph.

suppressMessages(library(ggplot2)) # Draw plots

movies %>%
  dplyr::group_by(Genre) %>% 
  dplyr::summarize(
    cnt = n() 
  ) %>%
  ggplot() +
  geom_col(aes(x = Genre, y = cnt, fill = Genre))

movies %>% 
  ggplot() +
  geom_point(aes(x = Runtime, y = Rank,
                 size = Votes, 
                 col = Genre))

movies %>% 
  ggplot() +
  geom_histogram(aes(x = Runtime, 
                     fill = Genre),
                 bins = 30)

If you are confident about the text mining, then you can do some analysis.

2 RSelenium

You can also use other packages for web scraping, but the reason for using selenium is that in the case of the pages are created by JavaScript, there is not exists separated URL. So, using JavaScript handler to read the specific results and then extract the desired text. Installing RSelenium package, you should bring it from outside of R because it doesn’t exist in CRAN.

However, in the markdown, it takes lots of times to knit to HTML, so I will not show them here. Also, I feel like that this is quite tricky at first. In case you want to study, please search Google and learn how to do web scraping with RSelenium.

Web Scraping with R

Using rvest package

Catharina Jisoo Park

Introduction

1 Web scraping with `rvest`

1.1 Example

1.2 Example

2 RSelenium

References

Web Scraping with R

Using rvest package

Catharina Jisoo Park

Introduction

1 Web scraping with rvest

1.1 Example

1.2 Example

2 RSelenium

References

1 Web scraping with `rvest`