As a fan of television dramas I often find myself trying to convince friends to watch certain programs. Inspired by several chapters in Text Minining with R I decided to pull the main characters/most used words to describe the first season of my favorite 5 television shows and use some basic sentiment analysis to see if they are considered positive or negative.
library(rvest)
## Warning: package 'rvest' was built under R version 3.4.2
## Loading required package: xml2
## Warning: package 'xml2' was built under R version 3.4.2
library(stringr)
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.4.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidytext)
## Warning: package 'tidytext' was built under R version 3.4.2
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.4.4
## Loading required package: RColorBrewer
library(ggplot2)
From the IMDB site I will scrape data for each show for their inaugural season using the rvest package along with assistance from CSS Selector Gadget
Create a vector entitled url which will store the web address to be scraped. Next, create another vector named webpage which will use the read_html feature from the rvest package to read the contents of url.
#obtaining the web address for the first season of The wire
url <- 'https://www.imdb.com/title/tt0306414/episodes?season=1&ref_=tt_eps_sn_1'
#using the read_html function from the rvest package
webpage <- read_html(url)
Next, we create wire_description_html which will utilize the html_nodes feature which will extract contents of webpage and .item_description, a CSS (Cascading Style Sheet) field, which houses the description of each episode; CSS selector was identified by using the CSS Selecotor Gadget extension on Google Chrome.
Following the previous step, we create wire_description which extracts attributes, text and tag name information from wire_description_html and then view a portion using the head feature in base R.
#scrape each episodes description using information from the selector gadget and using the html_nodes function from the rvest package
wire_description_html <- html_nodes(webpage,'.item_description')
#convert the description data
wire_description <- html_text(wire_description_html)
#view the head of wire_description
head(wire_description)
## [1] "\nBaltimore Det. Jimmy McNulty finds himself in hot water with his superior Major William Rawls after a drug dealer, D'Angelo Barksdale who is charged with three murders, is acquitted. McNulty knows the judge in question and although it's not his case, he's called into chambers to explain what happened. Obviously key witnesses recanted their police statements on the stand but McNulty doesn't underplay Barksdale's role in at least 7 other murders. When the judge's raises his concerns at the senior levels of the police department, they have a new investigation on their ... "
## [2] "\nLt. Daniels puts his team together but the extra help he's asked includes several less than stellar officers. It also includes McNulty, who is obviously in the doghouse. When one of the witnesses against D'Angelo Barksdale is found dead, McNulty and his homicide partner 'Bunk' Moreland bring him in for questioning. They have no evidence that he had anything to do with the killing but their instincts tell them otherwise. When news of the witnesses death hits the press, Major Rawls is convinced that McNulty is again responsible. Three of the newly assigned detectives, ... "
## [3] "\nAfter two weeks on the job, the team still doesn't have a photo of their principal target, Avon Barksdale. Freamon has a solution when he hears Avon was a Gold Gloves boxer. In the aftermath of the near riot at the towers, Lt. Daniels stands by his men even though he thinks they were stupid to do what they did. He's ordered to lead a raid on the low-rise units in the projects but McNulty rebels and refuses to participate in a raid that's just being put up for show. Dissatisfied with the antiquated equipment they have to work with, McNulty asks his friend at the FBI to... "
## [4] "\nBodie walks out of detention pretending to be a janitor but Patty is recovering in hospital and is pleased that his injuries will result in a medical pension. Avon Barksdale puts a contract out on Omar and others who robbed him. He wants to make sure everyone on the street knows he's not to be messed with. Judge Phelan keeps the pressure on and Daniels recommends to the Deputy that they need a wire. McNulty suggests they get a warrant to clone the pagers the drug runners use. Det. Fremon is one step ahead of everyone having already found D'Angelo Barksdale's pager ... "
## [5] "\nAvon Barksdale is becoming paranoid and thinks he's being watched. He's also worried his phone is being tapped. He also thinks there's a snitch in D'Angelo's crew and he's told not to pay anyone until it's sorted. D'Angelo isn't too pleased when he hears Avon has given Stinkum a new territory. Judge Phelan visits the squad to sign the warrants to clone D'Angelo Barksdale's pager. The numbers they're collecting are coded and don't make much sense. Det. Prez Pryzbylewski figures it out however. McNulty and Greggs try to get information from Omar. "
## [6] "\nAvon takes care of Omar's man Brandon but Wallace, who saw Brandon in the arcade, isn't too comfortable when he sees what they've done to him. Avon gives D'Angelo and Wallace a bonus for their good work. The police meanwhile get authorization to place taps for the pay phones used by D'Angelo and his crew but can only listen in when one of their suspects is using it meaning they'll have to keep the phones under constant observation. McNulty is in a tight spot when Major Rawls gives him a week to report back to his old job. When he realizes the connections McNulty's ... "
Our next step consists of creating a data frame using dplyr and data_frame and rename the sole column wire_description as text. Once achieved, we then tokenize the contents of the column using unnest_tokens from the tidytext package. This will extract all punctuation and return one word based on the text column and will return all data to a column named word. We also make to_lower false since unnest_tokens converts all words to lowercase by default; this will asssist in identifying main characters since they are usually capitalized.
Mutate is also used to create a linenumber, a numbered column matching each row of data. Lastly, we use select to reorder the columns and view the contents of the data frame.
#create a data frame using dplyr's data_frame function
wire_df <- data_frame(text = wire_description)
#tokenize; convert all text into a word using the unnest_tokens feature in tidytext
wire_df <- wire_df %>%
unnest_tokens(word, text, to_lower = FALSE) %>%
mutate(linenumber = row_number()) %>%
select (linenumber, word)
#view the dataframe
wire_df
## # A tibble: 1,268 x 2
## linenumber word
## <int> <chr>
## 1 1 Baltimore
## 2 2 Det
## 3 3 Jimmy
## 4 4 McNulty
## 5 5 finds
## 6 6 himself
## 7 7 in
## 8 8 hot
## 9 9 water
## 10 10 with
## # ... with 1,258 more rows
We obtain a count of the words used in the data frame by using the count feature in addition to sort to give us insight on how many words are used. As predicted, common words like ‘the’ and ‘to’ populate the data frame.
wire_df %>%
count(word, sort = TRUE)
## Warning: package 'bindrcpp' was built under R version 3.4.2
## # A tibble: 561 x 2
## word n
## <chr> <int>
## 1 the 75
## 2 to 54
## 3 and 34
## 4 a 30
## 5 in 25
## 6 is 24
## 7 of 24
## 8 his 19
## 9 on 16
## 10 Avon 15
## # ... with 551 more rows
Using the dplyr and tidytext packages we apply anti_join with the stop_words dataset. This pre-installed dataset removes common words from the data frame. We also obtain a new count of the words in the data frame before and after applying stop_words.
wire_df <- wire_df %>%
anti_join(stop_words)
## Joining, by = "word"
wire_df %>%
count(word, sort = TRUE)
## # A tibble: 387 x 2
## word n
## <chr> <int>
## 1 Avon 15
## 2 McNulty 15
## 3 D'Angelo 10
## 4 Stringer 10
## 5 Wallace 8
## 6 Daniels 7
## 7 detail 7
## 8 When 7
## 9 Barksdale 6
## 10 pit 6
## # ... with 377 more rows
Based on the count we can see who are the main characters/most used words of this groundbreaking show in season one. Next we will plot this data in ggplot2 and filter on words used 6 or more times.
wire_df %>%
count(word, sort = TRUE) %>%
filter(n > 5) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
coord_flip()
Excluding the word When the most used words in the IMDB descriptions do a great job of describing the first season of The Wire. We see that the main characters are McNulty, Avon, Stringer, D’Angelo, Wallace and Daniels. In this show about the crime in Baltimore the police form a detail to survey a drug location known as the pit that is run by the Barksdale organization.
Next we will load the bing dictionary which is one of 3 that is used in the tidytext package. These dictionaries use special algorithms to see if text is considered negative or positive.
wire_bing <- wire_df %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE)
## Joining, by = "word"
wire_bing
## # A tibble: 43 x 3
## word sentiment n
## <chr> <chr> <int>
## 1 dead negative 2
## 2 killed negative 2
## 3 pleased positive 2
## 4 worried negative 2
## 5 angry negative 1
## 6 antiquated negative 1
## 7 bonus positive 1
## 8 comfortable positive 1
## 9 complex negative 1
## 10 concerns negative 1
## # ... with 33 more rows
As we can see the descriptions provided by IMDB don’t really provide a case to plot positive against negative words.
Lastly, we remove the stop_words again and conduct a re-count and provide a pattern of words using the wordcloud package which for the most part describes the first season of The Wire.
wire_df %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
## Joining, by = "word"
url <- 'https://www.imdb.com/title/tt0944947/episodes?season=1&ref_=tt_eps_sn_1'
webpage <- read_html(url)
got_description_html <- html_nodes(webpage,'.item_description')
got_description <- html_text(got_description_html)
got_df <- data_frame(text = got_description)
got_df <- got_df %>%
unnest_tokens(word, text, to_lower = FALSE) %>%
mutate(linenumber = row_number()) %>%
select (linenumber, word)
got_df
## # A tibble: 333 x 2
## linenumber word
## <int> <chr>
## 1 1 Jon
## 2 2 Arryn
## 3 3 the
## 4 4 Hand
## 5 5 of
## 6 6 the
## 7 7 King
## 8 8 is
## 9 9 dead
## 10 10 King
## # ... with 323 more rows
got_df <- got_df %>%
anti_join(stop_words)
## Joining, by = "word"
got_df %>%
count(word, sort = TRUE)
## # A tibble: 116 x 2
## word n
## <chr> <int>
## 1 Jon 7
## 2 Eddard 5
## 3 plans 5
## 4 Robb 5
## 5 Robert 5
## 6 Daenerys 4
## 7 Drogo 4
## 8 Night's 4
## 9 Tyrion 4
## 10 Watch 4
## # ... with 106 more rows
got_df %>%
count(word, sort = TRUE) %>%
filter(n > 3) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
coord_flip()
Excluding the word plans the most used words in the IMDB descriptions continue to do a great job of describing the first season of Game of Thrones.
got_bing <- got_df %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE)
## Joining, by = "word"
got_bing
## # A tibble: 18 x 3
## word sentiment n
## <chr> <chr> <int>
## 1 attack negative 2
## 2 dead negative 2
## 3 coward negative 1
## 4 desperate negative 1
## 5 dying negative 1
## 6 fall negative 1
## 7 fallen negative 1
## 8 freedom positive 1
## 9 fresh positive 1
## 10 killed negative 1
## 11 losing negative 1
## 12 murder negative 1
## 13 patience positive 1
## 14 poison negative 1
## 15 refuses negative 1
## 16 revenge negative 1
## 17 struggling negative 1
## 18 wound negative 1
got_df %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
## Joining, by = "word"
url <- 'https://www.imdb.com/title/tt0979432/episodes?season=1&ref_=tt_eps_sn_1'
webpage <- read_html(url)
empire_description_html <- html_nodes(webpage,'.item_description')
empire_description <- html_text(empire_description_html)
empire_df <- data_frame(text = empire_description)
empire_df <- empire_df %>%
unnest_tokens(word, text, to_lower = FALSE) %>%
mutate(linenumber = row_number()) %>%
select (linenumber, word)
empire_df
## # A tibble: 307 x 2
## linenumber word
## <int> <chr>
## 1 1 In
## 2 2 1920
## 3 3 Atlantic
## 4 4 City
## 5 5 politician
## 6 6 Enoch
## 7 7 Nucky
## 8 8 Thompson
## 9 9 makes
## 10 10 arrangements
## # ... with 297 more rows
empire_df <- empire_df %>%
anti_join(stop_words)
## Joining, by = "word"
empire_df %>%
count(word, sort = TRUE)
## # A tibble: 147 x 2
## word n
## <chr> <int>
## 1 Jimmy 9
## 2 Nucky 9
## 3 Margaret 5
## 4 Chicago 4
## 5 Darmody 3
## 6 brother 2
## 7 Chalky 2
## 8 Day 2
## 9 Eli 2
## 10 Feds 2
## # ... with 137 more rows
empire_df %>%
count(word, sort = TRUE) %>%
filter(n > 3) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
coord_flip()
empire_bing <- empire_df %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE)
## Joining, by = "word"
empire_bing
## # A tibble: 17 x 3
## word sentiment n
## <chr> <chr> <int>
## 1 rival negative 2
## 2 breaks negative 1
## 3 chilly negative 1
## 4 clash negative 1
## 5 dying negative 1
## 6 illegitimate negative 1
## 7 isolated negative 1
## 8 kill negative 1
## 9 loyal positive 1
## 10 massacre negative 1
## 11 mistakenly negative 1
## 12 retaliate negative 1
## 13 senile negative 1
## 14 suspect negative 1
## 15 suspicious negative 1
## 16 trauma negative 1
## 17 weakness negative 1
empire_df %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
## Joining, by = "word"
url <- 'https://www.imdb.com/title/tt0141842/episodes?season=1&ref_=tt_eps_sn_1'
webpage <- read_html(url)
sopranos_description_html <- html_nodes(webpage,'.item_description')
sopranos_description <- html_text(sopranos_description_html)
sopranos_df <- data_frame(text = sopranos_description)
sopranos_df <- sopranos_df %>%
unnest_tokens(word, text, to_lower = FALSE) %>%
mutate(linenumber = row_number()) %>%
select (linenumber, word)
sopranos_df
## # A tibble: 285 x 2
## linenumber word
## <int> <chr>
## 1 1 A
## 2 2 mobster
## 3 3 passes
## 4 4 out
## 5 5 at
## 6 6 a
## 7 7 family
## 8 8 barbecue
## 9 9 and
## 10 10 seeks
## # ... with 275 more rows
sopranos_df <- sopranos_df %>%
anti_join(stop_words)
## Joining, by = "word"
sopranos_df %>%
count(word, sort = TRUE)
## # A tibble: 117 x 2
## word n
## <chr> <int>
## 1 Tony 11
## 2 Carmela 4
## 3 Chris 4
## 4 Junior 3
## 5 leaves 3
## 6 The 3
## 7 Tony's 3
## 8 attention 2
## 9 begins 2
## 10 Brendan 2
## # ... with 107 more rows
sopranos_df %>%
count(word, sort = TRUE) %>%
filter(n > 2) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
coord_flip()
We see that the word the slips through and bypasses the stop_words attempts to remove common words. Also the word leaves pops up more than twice. Other than that, it’s clear to see the Sopranos has a stand-alone main character unlinke other shows with an ensemble cast.
sopranos_bing <- sopranos_df %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE)
## Joining, by = "word"
sopranos_bing
## # A tibble: 14 x 3
## word sentiment n
## <chr> <chr> <int>
## 1 death negative 2
## 2 angry negative 1
## 3 confused negative 1
## 4 disrespecting negative 1
## 5 embarrassment negative 1
## 6 fumes negative 1
## 7 mobster negative 1
## 8 peace positive 1
## 9 stolen negative 1
## 10 stress negative 1
## 11 stymied negative 1
## 12 toll negative 1
## 13 traitor negative 1
## 14 unwanted negative 1
sopranos_df %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
## Joining, by = "word"
url <- 'https://www.imdb.com/title/tt0804503/episodes?season=1&ref_=tt_eps_sn_1'
webpage <- read_html(url)
mm_description_html <- html_nodes(webpage,'.item_description')
mm_description <- html_text(mm_description_html)
mm_df <- data_frame(text = mm_description)
mm_df <- mm_df %>%
unnest_tokens(word, text, to_lower = FALSE) %>%
mutate(linenumber = row_number()) %>%
select (linenumber, word)
mm_df
## # A tibble: 997 x 2
## linenumber word
## <int> <chr>
## 1 1 New
## 2 2 York
## 3 3 City
## 4 4 1960s
## 5 5 In
## 6 6 the
## 7 7 ego
## 8 8 driven
## 9 9 Golden
## 10 10 Age
## # ... with 987 more rows
mm_df <- mm_df %>%
anti_join(stop_words)
## Joining, by = "word"
mm_df %>%
count(word, sort = TRUE)
## # A tibble: 306 x 2
## word n
## <chr> <int>
## 1 Don 21
## 2 Pete 11
## 3 Draper 9
## 4 Peggy 8
## 5 Roger 8
## 6 Campbell 7
## 7 Cooper 6
## 8 account 5
## 9 ad 5
## 10 Sterling 5
## # ... with 296 more rows
mm_df %>%
count(word, sort = TRUE) %>%
filter(n > 4) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
coord_flip()
IMDB has done a fantastic job writing descriptions for Mad Men which gives us a look into the shows first season.
mm_bing <- mm_df %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE)
## Joining, by = "word"
mm_bing
## # A tibble: 29 x 3
## word sentiment n
## <chr> <chr> <int>
## 1 loss negative 2
## 2 attack negative 1
## 3 bonus positive 1
## 4 cold negative 1
## 5 complicated negative 1
## 6 conflicted negative 1
## 7 confrontation negative 1
## 8 congratulate positive 1
## 9 creative positive 1
## 10 denies negative 1
## # ... with 19 more rows
mm_df %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
## Joining, by = "word"
With the exception of the attempts to judge a show based on negative/positive sentiment analysis, we see what can be achieved by text mining. To dig deep further, one could retrieve the scripts from each first season instead of relying on a one paragraph synopsis of the television show.