tinytex::install_tinytex()
library(XML)
library(stringr)
baseurl <- "https://movie.naver.com/movie/point/af/list.nhn?&page="
pages <- seq(from=1, to=1000, by=1) # Maximum of 1000 pages
urls <- str_c(baseurl, pages)
class(urls)
length(urls)
head(urls)
tail(urls)
Different XPATH expressions
A dynamic, not a static, display of content in a page layout
library(XML)
library(rvest)
page_captured <- function(url){
page <- readLines(url)
page_parsed <- htmlParse(page)
return(page_parsed)
}
page_list <- lapply(urls, page_captured)
length(page_list)
class(page_list)
page_list[[1]]
Title 1 ‘//[@id="old_content"]/table/tbody/tr[1]/td[2]/a[1]’ Title 2 ’//[@id="old_content"]/table/tbody/tr[2]/td[2]/a[1]’ Title 3 ’//*[@id="old_content"]/table/tbody/tr[3]/td[2]/a[1]’
titles <- xpathSApply(page_list[[1]], '//*[@id="old_content"]/table/tbody/tr/td[2]/a[1]', xmlValue)
titles
titles <- repair_encoding(titles, from="utf-8")
titles
title_extractor <- function(page_parsed){
titles <- xpathSApply(page_parsed, '//*[@id="old_content"]/table/tbody/tr/td[2]/a[1]', xmlValue)
titles <- repair_encoding(titles, from="utf-8") # Repair Encoding
return(titles)
}
title_list <- lapply(page_list, title_extractor)
length(title_list)
title_list[[1000]]
rate 1 ‘//[@id="old_content"]/table/tbody/tr[1]/td[2]/div/em’ rate 2 ’//[@id="old_content"]/table/tbody/tr[2]/td[2]/div/em’
rates <- xpathSApply(page_list[[1]], '//*[@id="old_content"]/table/tbody/tr/td[2]/div/em', xmlValue)
rates
class(rates)
mean(rates)
rates <- as.integer(rates)
rates
class(rates)
mean(rates)
rate_extractor <- function(page_parsed){
rates <- xpathSApply(page_parsed, '//*[@id="old_content"]/table/tbody/tr/td[2]/div/em', xmlValue)
rates <- as.integer(rates)
return(rates)
}
rate_extractor(page_list[[1]])
rate_list <- lapply(page_list, rate_extractor)
length(rate_list)
rate_list[[1000]]
title_list[[1000]]
rate_list[[1000]]
?tibble
library(tidyverse)
tibble(title = title_list[[1]], rate = rate_list[[1]])
title_list[[1]]
rate_list[[1]]
df_1 <- tibble(title = title_list[[1]], rate = rate_list[[1]])
df_1
df_2 <- tibble(title = title_list[[2]], rate = rate_list[[2]])
df_2
#...
df_1000 <- tibble(title = title_list[[1000]], rate = rate_list[[1000]])
df_1000
?rbind
?cbind
?mapply
movie_list <- mapply(cbind, title_list, rate_list)
class(movie_list)
movie_list[[1]]
movie_list[[2]]
movie_list[[1000]]
class(movie_list[[1]])
?rbind
?do.call
rbind(rbind(movie_list[[1]], movie_list[[2]]), movie_list[[3]])
movie_list[1:3]
do.call(rbind, movie_list[1:3])
movie_df <- do.call(rbind, movie_list)
dim(movie_df)
class(movie_df)
movie_df <- as_tibble(movie_df)
class(movie_df)
movie_df
colnames(movie_df)
colnames(movie_df) <- c("Title","Rate")
movie_df
class(movie_df$Title)
movie_df$Title[1:10]
class(movie_df$Rate)
movie_df$Rate[1:10]
movie_df$Rate <- as.integer(movie_df$Rate)
class(movie_df$Rate)
movie_df
movie_df %>%
count(Title, sort=TRUE)
movie_df %>%
count(Title, sort=TRUE) %>%
ggplot(aes(Title, n)) +
geom_col()
movie_df %>%
group_by(Title) %>%
summarise(mean = mean(Rate)) %>%
arrange(desc(mean))
movie_df %>%
left_join(movie_df %>% count(Title, sort=T), by="Title")
movie_df %>%
left_join(movie_df %>% count(Title, sort=T), by="Title") %>%
filter(n > 29) %>%
group_by(Title) %>%
summarise(Average = mean(Rate)) %>%
arrange(desc(Average))
Let’s take a look at how some functions are used for analyzing scraped from the web.
First, the pipe operator. The pipe operator %>%
is very useful in data processing.
Let say we have two functions do.call(A)
=> B
and as_tibble(B)
=> C
. The function do.call
processes the input, A, and returns the outcome, B. And the function as_tibble
processes the input, B, and returns the outcome, C.
So far, we have run the two functions step by step. But using the pipe operator, we can chain these two functions together by taking the output of one function and inserting into the next. In short, “changing” means that we pass an intermediate result onto the next function. Here, “as_tibble
follows do.call
”: as_tibble(do.call(x))
In R, we can pass command from one to the next with the pipe operator. As we’ve seen, our R code is often containing lots of parentheses, (
and )
, especially when code is complex: functions are nested in another function that are nested in another function, and so on… This makes R code hard to read and understand. Here’s where %>%
comes in to the rescue.
Here’s an example
library(tidyverse)
rename(as_tibble(do.call(rbind, movie_list)), Title = V1, Rate = V2)
# It looks very complicated and is hard to read what each function does.
But with the help of %>%
, we can rewrite the above code step by step as follows:
movie_df <- do.call(rbind, movie_list) %>%
as_tibble() %>%
rename(Title = V1, Rate = V2)
movie_df
Using the pipe operator, we can write the R input in an intuitively simple way while chaining a sequence of multiple functions together to be run.
dplyr
packageFrom now on, we are going to work with the movie_df
data set in a data frame format to manipulate its observations and variables. Let say we want to remove any movie whose reviews are less than . dplyr
package provides useful functions for doing such tasks.
Today, I am going to introduce you to its basic set of functions and show you how to apply them to the covid19_tweets_df
data frame.
dplyr
functionsPackage dplyr
provides useful functions for data manipulation:
Name | Task |
---|---|
left_join() |
to add columns from another data frame based on the key |
filter() |
to select cases (observations) based on their values |
group_by() |
to group cases (observations) defined by variables |
summarise() |
to create a new data frame by reducing multiple values down to a single summary |
arrange() |
to change the ordering of the rows |
count() |
to count the number of observations based on their values |
filter()
: Reduce rows/observations with matching conditionsFiltering data is a common task to identify and keep observations in which a particular variable matches a specific value/condition. So, it requires an argument that refers to a variable within the data frame to select rows where the expression is TRUE
. For instance, we can filter by the variable Title
whose values are duplicates. The base function duplicated
determines which elements of a vector or a variable in a data frame are duplicates of elements and returns a logical vector indicating which elements (rows) are duplicates (TRUE
) or unique (FALSE
).
movie_df %>% filter(duplicated(Title))
# 7,290 titles are duplicated (replicated)
# By putting ! operator, we can reverse the logical vector by the duplicated function. This means, we can keep those tweets that are NOT duplicated.
movie_df %>% filter(!duplicated(Title))