If you want to knit this R Markdown to PDF, you may want to run the following R code, first:

tinytex::install_tinytex()

Review on web scraping

library(XML)
library(stringr)

baseurl <- "https://movie.naver.com/movie/point/af/list.nhn?&page="
pages <- seq(from=1, to=1000, by=1) # Maximum of 1000 pages
urls <- str_c(baseurl, pages)
class(urls)
length(urls)
head(urls)
tail(urls)

Things to remember before web scraping

  1. Different XPATH expressions

  2. A dynamic, not a static, display of content in a page layout

library(XML)
library(rvest)

page_captured <- function(url){
  page <- readLines(url)
  page_parsed <- htmlParse(page)
  return(page_parsed)
}

page_list <- lapply(urls, page_captured)
length(page_list)
class(page_list)
#page_list[[1]]

Collecting movie titles from HTML documents

Title 1 ‘//[@id="old_content"]/table/tbody/tr[1]/td[2]/a[1]’ Title 2 ’//[@id="old_content"]/table/tbody/tr[2]/td[2]/a[1]’ Title 3 ’//*[@id="old_content"]/table/tbody/tr[3]/td[2]/a[1]’

titles <- xpathSApply(page_list[[1]], '//*[@id="old_content"]/table/tbody/tr/td[2]/a[1]', xmlValue)
titles

titles <- repair_encoding(titles, from="utf-8")
titles

title_extractor <- function(page_parsed){
  titles <- xpathSApply(page_parsed, '//*[@id="old_content"]/table/tbody/tr/td[2]/a[1]', xmlValue)
  titles <- repair_encoding(titles, from="utf-8") # Repair Encoding
  return(titles)
}

title_list <- lapply(page_list, title_extractor)
length(title_list)
title_list[[1000]]

Collecting movie rates from HTML documents

rate 1 ‘//[@id="old_content"]/table/tbody/tr[1]/td[2]/div/em’ rate 2 ’//[@id="old_content"]/table/tbody/tr[2]/td[2]/div/em’

rates <- xpathSApply(page_list[[1]], '//*[@id="old_content"]/table/tbody/tr/td[2]/div/em', xmlValue)
rates
class(rates)

rates <- as.integer(rates)
rates
class(rates)

rate_extractor <- function(page_parsed){
  rates <- xpathSApply(page_parsed, '//*[@id="old_content"]/table/tbody/tr/td[2]/div/em', xmlValue)
  rates <- as.integer(rates)
  return(rates)
}
rate_extractor(page_list[[1]])

rate_list <- lapply(page_list, rate_extractor)
length(rate_list)
rate_list[[1000]]

Combining movie titles and rates into a data frame

title_list[[1]]
rate_list[[1]]
?tibble
library(tidyverse)
tibble(title = title_list[[1]], rate = rate_list[[1]])

df_1 <- tibble(title = title_list[[1]], rate = rate_list[[1]])
df_1

df_2 <- tibble(title = title_list[[2]], rate = rate_list[[2]])
df_2

#...

df_1000 <- tibble(title = title_list[[1000]], rate = rate_list[[1000]])
df_1000

Apply a combining function to multiple list elements

?cbind
?mapply

movie_list <- mapply(cbind, title_list, rate_list)
class(movie_list)
movie_list[[1000]]
?rbind
?do.call

rbind(rbind(movie_list[[1]], movie_list[[2]]), movie_list[[3]])

movie_list[1:3]

do.call(rbind, movie_list[1:3])

movie_df <- do.call(rbind, movie_list)
dim(movie_df)
class(movie_df)
movie_df <- as_tibble(movie_df)
class(movie_df)
movie_df
colnames(movie_df)
colnames(movie_df) <- c("Title","Rate")
movie_df

class(movie_df$Title)
movie_df$Title[1:10]
class(movie_df$Rate)
movie_df$Rate[1:10]

movie_df$Rate <- as.integer(movie_df$Rate)
class(movie_df$Rate)

movie_df

Example of Web Data Analysis

movie_df %>% 
  count(Title, sort=T)

movie_df %>% 
  left_join(movie_df %>% count(Title, sort=T), by="Title")

movie_df %>% 
  left_join(movie_df %>% count(Title, sort=T), by="Title") %>% 
  filter(n > 29) %>% 
  group_by(Title) %>% 
  summarise(Average = mean(Rate)) %>% 
  arrange(desc(Average))

Let’s take a look at how some functions are used for analyzing scraped from the web.

First, the pipe operator. The pipe operator %>% is very useful in data processing.

Background of the pipe Operator in R

Let say we have two functions do.call(A) => B and as_tibble(B) => C. The function do.call processes the input, A, and returns the outcome, B. And the function as_tibble processes the input, B, and returns the outcome, C.

So far, we have run the two functions step by step. But using the pipe operator, we can chain these two functions together by taking the output of one function and inserting into the next. In short, “changing” means that we pass an intermediate result onto the next function. Here, “as_tibble follows do.call”: as_tibble(do.call(x))

In R, we can pass command from one to the next with the pipe operator. As we’ve seen, our R code is often containing lots of parentheses, ( and ), especially when code is complex: functions are nested in another function that are nested in another function, and so on… This makes R code hard to read and understand. Here’s where %>% comes in to the rescue.

Here’s an example

library(tidyverse)

rename(as_tibble(do.call(rbind, movie_list)), Title = V1, Rate = V2)

# It looks very complicated and is hard to read what each function does.

But with the help of %>%, we can rewrite the above code step by step as follows:

movie_df <- do.call(rbind, movie_list) %>% 
  as_tibble() %>% 
  rename(Title = V1, Rate = V2)
movie_df

Using the pipe operator, we can write the R input in an intuitively simple way while chaining a sequence of multiple functions together to be run.

Web Data Analysis with dplyr package

From now on, we are going to work with the movie_df data set in a data frame format to manipulate its observations and variables. Let say we want to remove any movie whose reviews are less than . dplyr package provides useful functions for doing such tasks.

Today, I am going to introduce you to its basic set of functions and show you how to apply them to the covid19_tweets_df data frame.

dplyr functions

Package dplyr provides useful functions for data manipulation:

Name Task
left_join() to add columns from another data frame based on the key
filter() to select cases (observations) based on their values
group_by() to group cases (observations) defined by variables
summarise() to create a new data frame by reducing multiple values down to a single summary
arrange() to change the ordering of the rows
count() to count the number of observations based on their values

filter(): Reduce rows/observations with matching conditions

Filtering data is a common task to identify and keep observations in which a particular variable matches a specific value/condition. So, it requires an argument that refers to a variable within the data frame to select rows where the expression is TRUE. For instance, we can filter by the variable Title whose values are duplicates. The base function duplicated determines which elements of a vector or a variable in a data frame are duplicates of elements and returns a logical vector indicating which elements (rows) are duplicates (TRUE) or unique (FALSE).

movie_df %>% filter(duplicated(Title))
# 7,290 titles are duplicated (replicated)

# By putting ! operator, we can reverse the logical vector by the duplicated function. This means, we can keep those tweets that are NOT duplicated.
movie_df %>% filter(!duplicated(Title))