Assignment six

This assignment is done on the open data fiveThirtyEight website features page https://fivethirtyeight.com/features/. This page is open to scrapping given that the robots.txt doesn’t disallow and also, there’s no explicit mention of no scraping on the terms and conditions.

This expercise will look at articles and make some good use of it.

Let’s start by loading libraries we will need:

library(readr)
library(dplyr)
library(tidyverse)
library(rvest)
library(stringr)
library(reshape2)

Using read_html let’s read in the url:

fivethirtyeithtEntries <- read_html("https://fivethirtyeight.com/features/")

Read_html will scrap the html file and store it into a list; and what we are interested in is the node information that we will look into in detail. Below is the head of the nodes:

head(fivethirtyeithtEntries)
## $node
## <pointer: 0x9657b70>
## 
## $doc
## <pointer: 0x965d670>

The above is a pointer to the nodes

Now let’s pull out the details: writers, category and date published and look at the head for ten pages:

titles <- lapply(paste0('https://fivethirtyeight.com/features/page/', 1:10),
                function(url){
                    url %>% read_html() %>% 
                        html_nodes("#primary .entry-title a") %>% 
                        html_text()
                })
titles <- titles %>% unlist(unlist(titles)) # lapply returns a list item for each time it runs the function, but we want to collapse them all into one vector

writers <- lapply(paste0('https://fivethirtyeight.com/features/page/', 1:10),
                function(url){
                    url %>% read_html() %>% 
                        html_nodes(".fn") %>% 
                        html_text() 
                })

writers <- writers %>% unlist(unlist(writers))
category <- lapply(paste0('https://fivethirtyeight.com/features/page/', 1:10),
                function(url){
                    url %>% read_html() %>% 
                        html_nodes("#primary .term") %>% 
                        html_text() 
                })
category <- category %>% unlist(unlist(category))
datepublished <- lapply(paste0('https://fivethirtyeight.com/features/page/', 1:10),
                function(url){
                    url %>% read_html() %>% 
                        html_nodes(".updated") %>% 
                        html_text() 
                })
datepublished <- datepublished %>% unlist(unlist(datepublished))

A peek details into the details:

dplyr::glimpse(titles)
##  chr [1:100] "Politics Podcast: The Electoral Challenges Facing The Republican Party" ...
dplyr::glimpse(writers)
##  chr [1:155] "Galen Druke" "Nate Silver" "Perry Bacon Jr." ...
dplyr::glimpse(category)
##  chr [1:100] "Politics Podcast" "Politics Podcast" "Hot Takedown" ...
dplyr::glimpse(datepublished)
##  chr [1:100] "Dec. 8, 2020" "Dec. 8, 2020" "Dec. 8, 2020" "Dec. 8, 2020" ...
dataTitles <- titles  %>% str_replace("\n\t\t\t\t","") %>% str_replace("\t\t\t","") #remove unecessary characters
class.df<- data.frame(datepublished,category,dataTitles) 
dplyr::glimpse(class.df)
## Rows: 100
## Columns: 3
## $ datepublished <chr> "Dec. 8, 2020", "Dec. 8, 2020", "Dec. 8, 2020", "Dec. 8…
## $ category      <chr> "Politics Podcast", "Politics Podcast", "Hot Takedown",…
## $ dataTitles    <chr> "Politics Podcast: The Electoral Challenges Facing The …

Let’s group articles by category and see which category is writen about often

class.df %>% group_by(class.df$category) %>% summarise(numberOfArticles= n_distinct(dataTitles)) %>% arrange(desc(numberOfArticles)) 
## # A tibble: 28 x 2
##    `class.df$category`   numberOfArticles
##    <chr>                            <int>
##  1 2020 Election                       20
##  2 Politics Podcast                    18
##  3 NFL                                  9
##  4 NBA                                  8
##  5 College Football                     6
##  6 Hot Takedown                         5
##  7 College Basketball                   3
##  8 Georgia Senate Runoff                3
##  9 Soccer                               3
## 10 COVID-19                             2
## # … with 18 more rows

You can see that the top topic is: 2020 Election with 20 articles in the last 100 articles.

Let’s group articles by days and see when people at FiveThirtyEight were busy

class.df %>% group_by(class.df$datepublished) %>% summarise(numberOfArticles= n_distinct(dataTitles)) %>% arrange(desc(numberOfArticles)) 
## # A tibble: 22 x 2
##    `class.df$datepublished` numberOfArticles
##    <chr>                               <int>
##  1 Nov. 12, 2020                           8
##  2 Nov. 23, 2020                           7
##  3 Nov. 9, 2020                            7
##  4 Dec. 3, 2020                            5
##  5 Nov. 10, 2020                           5
##  6 Nov. 16, 2020                           5
##  7 Nov. 19, 2020                           5
##  8 Nov. 24, 2020                           5
##  9 Nov. 30, 2020                           5
## 10 Dec. 1, 2020                            4
## # … with 12 more rows

hist(Temperature) And the busiest day was November 12.

Let’s build some nicely formatted table and plots:

Table of all the 100 articles:

library(DT)
datatable(class.df)

What’s the reocurring team in all the articles? ( getting fancy and exploring NLP)

library(tm)
library(NLP)
library(wordcloud) # this requires the tm and NLP packages

wordcloud(class.df$dataTitles, min.freq=1)  # w/o min.req=1, you get just "merc"

We see Biden and Trump are a prominent topic.

ggplot(class.df, aes(x=factor(1), fill=category))+
  geom_bar(width = 1)+
  coord_polar("y")

The Pie chart confirms that the 2020 Election is the hottest topic.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.