Introduction

As a fan of television dramas I often find myself trying to convince friends to watch certain programs. Inspired by several chapters in Text Minining with R I decided to pull the main characters/most used words to describe the first season of my favorite 5 television shows and use some basic sentiment analysis to see if they are considered positive or negative.

Top 5 Television Shows

Load libraries

library(rvest)
## Warning: package 'rvest' was built under R version 3.4.2
## Loading required package: xml2
## Warning: package 'xml2' was built under R version 3.4.2
library(stringr)
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.4.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidytext)
## Warning: package 'tidytext' was built under R version 3.4.2
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.4.4
## Loading required package: RColorBrewer
library(ggplot2)

Web Scraping the Data

From the IMDB site I will scrape data for each show for their inaugural season using the rvest package along with assistance from CSS Selector Gadget

The Wire

Create a vector entitled url which will store the web address to be scraped. Next, create another vector named webpage which will use the read_html feature from the rvest package to read the contents of url.

#obtaining the web address for the first season of The wire
url <- 'https://www.imdb.com/title/tt0306414/episodes?season=1&ref_=tt_eps_sn_1'

#using the read_html function from the rvest package
webpage <- read_html(url)

Next, we create wire_description_html which will utilize the html_nodes feature which will extract contents of webpage and .item_description, a CSS (Cascading Style Sheet) field, which houses the description of each episode; CSS selector was identified by using the CSS Selecotor Gadget extension on Google Chrome.

Following the previous step, we create wire_description which extracts attributes, text and tag name information from wire_description_html and then view a portion using the head feature in base R.

#scrape each episodes description using information from the selector gadget and using the html_nodes function from the rvest package
wire_description_html <- html_nodes(webpage,'.item_description')

#convert the description data
wire_description <- html_text(wire_description_html)

#view the head of wire_description
head(wire_description)
## [1] "\nBaltimore Det. Jimmy McNulty finds himself in hot water with his superior Major William Rawls after a drug dealer, D'Angelo Barksdale who is charged with three murders, is acquitted. McNulty knows the judge in question and although it's not his case, he's called into chambers to explain what happened. Obviously key witnesses recanted their police statements on the stand but McNulty doesn't underplay Barksdale's role in at least 7 other murders. When the judge's raises his concerns at the senior levels of the police department, they have a new investigation on their ...    "  
## [2] "\nLt. Daniels puts his team together but the extra help he's asked includes several less than stellar officers. It also includes McNulty, who is obviously in the doghouse. When one of the witnesses against D'Angelo Barksdale is found dead, McNulty and his homicide partner 'Bunk' Moreland bring him in for questioning. They have no evidence that he had anything to do with the killing but their instincts tell them otherwise. When news of the witnesses death hits the press, Major Rawls is convinced that McNulty is again responsible. Three of the newly assigned detectives, ...    " 
## [3] "\nAfter two weeks on the job, the team still doesn't have a photo of their principal target, Avon Barksdale. Freamon has a solution when he hears Avon was a Gold Gloves boxer. In the aftermath of the near riot at the towers, Lt. Daniels stands by his men even though he thinks they were stupid to do what they did. He's ordered to lead a raid on the low-rise units in the projects but McNulty rebels and refuses to participate in a raid that's just being put up for show. Dissatisfied with the antiquated equipment they have to work with, McNulty asks his friend at the FBI to...    "
## [4] "\nBodie walks out of detention pretending to be a janitor but Patty is recovering in hospital and is pleased that his injuries will result in a medical pension. Avon Barksdale puts a contract out on Omar and others who robbed him. He wants to make sure everyone on the street knows he's not to be messed with. Judge Phelan keeps the pressure on and Daniels recommends to the Deputy that they need a wire. McNulty suggests they get a warrant to clone the pagers the drug runners use. Det. Fremon is one step ahead of everyone having already found D'Angelo Barksdale's pager ...    "   
## [5] "\nAvon Barksdale is becoming paranoid and thinks he's being watched. He's also worried his phone is being tapped. He also thinks there's a snitch in D'Angelo's crew and he's told not to pay anyone until it's sorted. D'Angelo isn't too pleased when he hears Avon has given Stinkum a new territory. Judge Phelan visits the squad to sign the warrants to clone D'Angelo Barksdale's pager. The numbers they're collecting are coded and don't make much sense. Det. Prez Pryzbylewski figures it out however. McNulty and Greggs try to get information from Omar.    "                           
## [6] "\nAvon takes care of Omar's man Brandon but Wallace, who saw Brandon in the arcade, isn't too comfortable when he sees what they've done to him. Avon gives D'Angelo and Wallace a bonus for their good work. The police meanwhile get authorization to place taps for the pay phones used by D'Angelo and his crew but can only listen in when one of their suspects is using it meaning they'll have to keep the phones under constant observation. McNulty is in a tight spot when Major Rawls gives him a week to report back to his old job. When he realizes the connections McNulty's ...    "

Our next step consists of creating a data frame using dplyr and data_frame and rename the sole column wire_description as text. Once achieved, we then tokenize the contents of the column using unnest_tokens from the tidytext package. This will extract all punctuation and return one word based on the text column and will return all data to a column named word. We also make to_lower false since unnest_tokens converts all words to lowercase by default; this will asssist in identifying main characters since they are usually capitalized.

Mutate is also used to create a linenumber, a numbered column matching each row of data. Lastly, we use select to reorder the columns and view the contents of the data frame.

#create a data frame using dplyr's data_frame function
wire_df <- data_frame(text = wire_description)

#tokenize; convert all text into a word using the unnest_tokens feature in tidytext
wire_df <- wire_df %>%
  unnest_tokens(word, text, to_lower = FALSE) %>%
  mutate(linenumber = row_number()) %>%
  select (linenumber, word)

#view the dataframe
wire_df
## # A tibble: 1,268 x 2
##    linenumber word     
##         <int> <chr>    
##  1          1 Baltimore
##  2          2 Det      
##  3          3 Jimmy    
##  4          4 McNulty  
##  5          5 finds    
##  6          6 himself  
##  7          7 in       
##  8          8 hot      
##  9          9 water    
## 10         10 with     
## # ... with 1,258 more rows

We obtain a count of the words used in the data frame by using the count feature in addition to sort to give us insight on how many words are used. As predicted, common words like ‘the’ and ‘to’ populate the data frame.

wire_df %>% 
  count(word, sort = TRUE)
## Warning: package 'bindrcpp' was built under R version 3.4.2
## # A tibble: 561 x 2
##    word      n
##    <chr> <int>
##  1 the      75
##  2 to       54
##  3 and      34
##  4 a        30
##  5 in       25
##  6 is       24
##  7 of       24
##  8 his      19
##  9 on       16
## 10 Avon     15
## # ... with 551 more rows

Using the dplyr and tidytext packages we apply anti_join with the stop_words dataset. This pre-installed dataset removes common words from the data frame. We also obtain a new count of the words in the data frame before and after applying stop_words.

wire_df <- wire_df %>%
  anti_join(stop_words)
## Joining, by = "word"
wire_df %>% 
  count(word, sort = TRUE)
## # A tibble: 387 x 2
##    word          n
##    <chr>     <int>
##  1 Avon         15
##  2 McNulty      15
##  3 D'Angelo     10
##  4 Stringer     10
##  5 Wallace       8
##  6 Daniels       7
##  7 detail        7
##  8 When          7
##  9 Barksdale     6
## 10 pit           6
## # ... with 377 more rows

Based on the count we can see who are the main characters/most used words of this groundbreaking show in season one. Next we will plot this data in ggplot2 and filter on words used 6 or more times.

wire_df %>%
  count(word, sort = TRUE) %>%
  filter(n > 5) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

Excluding the word When the most used words in the IMDB descriptions do a great job of describing the first season of The Wire. We see that the main characters are McNulty, Avon, Stringer, D’Angelo, Wallace and Daniels. In this show about the crime in Baltimore the police form a detail to survey a drug location known as the pit that is run by the Barksdale organization.

Next we will load the bing dictionary which is one of 3 that is used in the tidytext package. These dictionaries use special algorithms to see if text is considered negative or positive.

wire_bing <- wire_df %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE)
## Joining, by = "word"
wire_bing
## # A tibble: 43 x 3
##    word        sentiment     n
##    <chr>       <chr>     <int>
##  1 dead        negative      2
##  2 killed      negative      2
##  3 pleased     positive      2
##  4 worried     negative      2
##  5 angry       negative      1
##  6 antiquated  negative      1
##  7 bonus       positive      1
##  8 comfortable positive      1
##  9 complex     negative      1
## 10 concerns    negative      1
## # ... with 33 more rows

As we can see the descriptions provided by IMDB don’t really provide a case to plot positive against negative words.

Lastly, we remove the stop_words again and conduct a re-count and provide a pattern of words using the wordcloud package which for the most part describes the first season of The Wire.

wire_df %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))
## Joining, by = "word"

Game of Thrones

url <- 'https://www.imdb.com/title/tt0944947/episodes?season=1&ref_=tt_eps_sn_1'
webpage <- read_html(url)

got_description_html <- html_nodes(webpage,'.item_description')

got_description <- html_text(got_description_html)

got_df <- data_frame(text = got_description)

got_df <- got_df %>%
  unnest_tokens(word, text, to_lower = FALSE) %>%
  mutate(linenumber = row_number()) %>%
  select (linenumber, word)

got_df
## # A tibble: 333 x 2
##    linenumber word 
##         <int> <chr>
##  1          1 Jon  
##  2          2 Arryn
##  3          3 the  
##  4          4 Hand 
##  5          5 of   
##  6          6 the  
##  7          7 King 
##  8          8 is   
##  9          9 dead 
## 10         10 King 
## # ... with 323 more rows

Anti-Join/Word Count (Game of Thrones)

got_df <- got_df %>%
  anti_join(stop_words)
## Joining, by = "word"
got_df %>% 
  count(word, sort = TRUE)
## # A tibble: 116 x 2
##    word         n
##    <chr>    <int>
##  1 Jon          7
##  2 Eddard       5
##  3 plans        5
##  4 Robb         5
##  5 Robert       5
##  6 Daenerys     4
##  7 Drogo        4
##  8 Night's      4
##  9 Tyrion       4
## 10 Watch        4
## # ... with 106 more rows

PLOT (Game of Thrones)

got_df %>%
  count(word, sort = TRUE) %>%
  filter(n > 3) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

Excluding the word plans the most used words in the IMDB descriptions continue to do a great job of describing the first season of Game of Thrones.

Negative/Positive (Game of Thrones)

got_bing <- got_df %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE)
## Joining, by = "word"
got_bing
## # A tibble: 18 x 3
##    word       sentiment     n
##    <chr>      <chr>     <int>
##  1 attack     negative      2
##  2 dead       negative      2
##  3 coward     negative      1
##  4 desperate  negative      1
##  5 dying      negative      1
##  6 fall       negative      1
##  7 fallen     negative      1
##  8 freedom    positive      1
##  9 fresh      positive      1
## 10 killed     negative      1
## 11 losing     negative      1
## 12 murder     negative      1
## 13 patience   positive      1
## 14 poison     negative      1
## 15 refuses    negative      1
## 16 revenge    negative      1
## 17 struggling negative      1
## 18 wound      negative      1

Wordcloud (Game of Thrones)

got_df %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))
## Joining, by = "word"

Boardwalk Empire

url <- 'https://www.imdb.com/title/tt0979432/episodes?season=1&ref_=tt_eps_sn_1'

webpage <- read_html(url)

empire_description_html <- html_nodes(webpage,'.item_description')

empire_description <- html_text(empire_description_html)

empire_df <- data_frame(text = empire_description)

empire_df <- empire_df %>%
  unnest_tokens(word, text, to_lower = FALSE) %>%
  mutate(linenumber = row_number()) %>%
  select (linenumber, word)

empire_df
## # A tibble: 307 x 2
##    linenumber word        
##         <int> <chr>       
##  1          1 In          
##  2          2 1920        
##  3          3 Atlantic    
##  4          4 City        
##  5          5 politician  
##  6          6 Enoch       
##  7          7 Nucky       
##  8          8 Thompson    
##  9          9 makes       
## 10         10 arrangements
## # ... with 297 more rows

Anti-Join/Word Count (Boardwalk Empire)

empire_df <- empire_df %>%
  anti_join(stop_words)
## Joining, by = "word"
empire_df %>% 
  count(word, sort = TRUE)
## # A tibble: 147 x 2
##    word         n
##    <chr>    <int>
##  1 Jimmy        9
##  2 Nucky        9
##  3 Margaret     5
##  4 Chicago      4
##  5 Darmody      3
##  6 brother      2
##  7 Chalky       2
##  8 Day          2
##  9 Eli          2
## 10 Feds         2
## # ... with 137 more rows

PLOT (Boardwalk Empire)

empire_df %>%
  count(word, sort = TRUE) %>%
  filter(n > 3) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

Negative/Positive (Boardwalk Empire)

empire_bing <- empire_df %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE)
## Joining, by = "word"
empire_bing
## # A tibble: 17 x 3
##    word         sentiment     n
##    <chr>        <chr>     <int>
##  1 rival        negative      2
##  2 breaks       negative      1
##  3 chilly       negative      1
##  4 clash        negative      1
##  5 dying        negative      1
##  6 illegitimate negative      1
##  7 isolated     negative      1
##  8 kill         negative      1
##  9 loyal        positive      1
## 10 massacre     negative      1
## 11 mistakenly   negative      1
## 12 retaliate    negative      1
## 13 senile       negative      1
## 14 suspect      negative      1
## 15 suspicious   negative      1
## 16 trauma       negative      1
## 17 weakness     negative      1

Wordcloud (Boardwalk Empire)

empire_df %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))
## Joining, by = "word"

The Sopranos

url <- 'https://www.imdb.com/title/tt0141842/episodes?season=1&ref_=tt_eps_sn_1'

webpage <- read_html(url)

sopranos_description_html <- html_nodes(webpage,'.item_description')

sopranos_description <- html_text(sopranos_description_html)

sopranos_df <- data_frame(text = sopranos_description)

sopranos_df <- sopranos_df %>%
  unnest_tokens(word, text, to_lower = FALSE) %>%
  mutate(linenumber = row_number()) %>%
  select (linenumber, word)

sopranos_df
## # A tibble: 285 x 2
##    linenumber word    
##         <int> <chr>   
##  1          1 A       
##  2          2 mobster 
##  3          3 passes  
##  4          4 out     
##  5          5 at      
##  6          6 a       
##  7          7 family  
##  8          8 barbecue
##  9          9 and     
## 10         10 seeks   
## # ... with 275 more rows

Anti-Join/Word Count (The Sopranos)

sopranos_df <- sopranos_df %>%
  anti_join(stop_words)
## Joining, by = "word"
sopranos_df %>% 
  count(word, sort = TRUE)
## # A tibble: 117 x 2
##    word          n
##    <chr>     <int>
##  1 Tony         11
##  2 Carmela       4
##  3 Chris         4
##  4 Junior        3
##  5 leaves        3
##  6 The           3
##  7 Tony's        3
##  8 attention     2
##  9 begins        2
## 10 Brendan       2
## # ... with 107 more rows

PLOT (The Sopranos)

sopranos_df %>%
  count(word, sort = TRUE) %>%
  filter(n > 2) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

We see that the word the slips through and bypasses the stop_words attempts to remove common words. Also the word leaves pops up more than twice. Other than that, it’s clear to see the Sopranos has a stand-alone main character unlinke other shows with an ensemble cast.

Negative/Positive (The Sopranos)

sopranos_bing <- sopranos_df %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE)
## Joining, by = "word"
sopranos_bing
## # A tibble: 14 x 3
##    word          sentiment     n
##    <chr>         <chr>     <int>
##  1 death         negative      2
##  2 angry         negative      1
##  3 confused      negative      1
##  4 disrespecting negative      1
##  5 embarrassment negative      1
##  6 fumes         negative      1
##  7 mobster       negative      1
##  8 peace         positive      1
##  9 stolen        negative      1
## 10 stress        negative      1
## 11 stymied       negative      1
## 12 toll          negative      1
## 13 traitor       negative      1
## 14 unwanted      negative      1

Wordcloud (The Sopranos)

sopranos_df %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))
## Joining, by = "word"

Mad Men

url <- 'https://www.imdb.com/title/tt0804503/episodes?season=1&ref_=tt_eps_sn_1'

webpage <- read_html(url)

mm_description_html <- html_nodes(webpage,'.item_description')

mm_description <- html_text(mm_description_html)

mm_df <- data_frame(text = mm_description)

mm_df <- mm_df %>%
  unnest_tokens(word, text, to_lower = FALSE) %>%
  mutate(linenumber = row_number()) %>%
  select (linenumber, word)

mm_df
## # A tibble: 997 x 2
##    linenumber word  
##         <int> <chr> 
##  1          1 New   
##  2          2 York  
##  3          3 City  
##  4          4 1960s 
##  5          5 In    
##  6          6 the   
##  7          7 ego   
##  8          8 driven
##  9          9 Golden
## 10         10 Age   
## # ... with 987 more rows

Anti-Join/Word Count (Mad Men)

mm_df <- mm_df %>%
  anti_join(stop_words)
## Joining, by = "word"
mm_df %>% 
  count(word, sort = TRUE)
## # A tibble: 306 x 2
##    word         n
##    <chr>    <int>
##  1 Don         21
##  2 Pete        11
##  3 Draper       9
##  4 Peggy        8
##  5 Roger        8
##  6 Campbell     7
##  7 Cooper       6
##  8 account      5
##  9 ad           5
## 10 Sterling     5
## # ... with 296 more rows

PLOT (The Sopranos)

mm_df %>%
  count(word, sort = TRUE) %>%
  filter(n > 4) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

IMDB has done a fantastic job writing descriptions for Mad Men which gives us a look into the shows first season.

Negative/Positive (Mad Men)

mm_bing <- mm_df %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE)
## Joining, by = "word"
mm_bing
## # A tibble: 29 x 3
##    word          sentiment     n
##    <chr>         <chr>     <int>
##  1 loss          negative      2
##  2 attack        negative      1
##  3 bonus         positive      1
##  4 cold          negative      1
##  5 complicated   negative      1
##  6 conflicted    negative      1
##  7 confrontation negative      1
##  8 congratulate  positive      1
##  9 creative      positive      1
## 10 denies        negative      1
## # ... with 19 more rows

Wordcloud (Mad Men)

mm_df %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))
## Joining, by = "word"

Summary

With the exception of the attempts to judge a show based on negative/positive sentiment analysis, we see what can be achieved by text mining. To dig deep further, one could retrieve the scripts from each first season instead of relying on a one paragraph synopsis of the television show.