As a fan of The Daily Show, one thing i look forward to is the guests the hosts interviews. I will analyze a dataset i found online (i do not remember where i found it but i downloaded it last year) with a list of people who have appeared as guests of The Daily Show
The dataset will focus more on interviews done by Jon Stewart, who hosted the show from 1999-2015.
I also want to textmine the Occupations of these folks and turn it into a wordcloud or something to see what’s the most common occupation being interviewed.
library(readxl)
guests <- read_excel("D:/Working Directory/Daily_show_guests.xlsx")library(tidyverse)## -- Attaching packages ------------------------------------------------------------------------------ tidyverse 1.2.1 --
## v ggplot2 3.1.0 v purrr 0.2.5
## v tibble 1.4.2 v dplyr 0.7.8
## v tidyr 0.8.1 v stringr 1.3.1
## v readr 1.1.1 v forcats 0.3.0
## Warning: package 'dplyr' was built under R version 3.5.2
## -- Conflicts --------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggthemes)## Warning: package 'ggthemes' was built under R version 3.5.2
library(gridExtra)##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(stringr)
library(lubridate)##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(effects)## Warning: package 'effects' was built under R version 3.5.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 3.5.2
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
library(carData)
library(httr)## Warning: package 'httr' was built under R version 3.5.3
library(tm)## Warning: package 'tm' was built under R version 3.5.3
## Loading required package: NLP
## Warning: package 'NLP' was built under R version 3.5.2
##
## Attaching package: 'NLP'
## The following object is masked from 'package:httr':
##
## content
## The following object is masked from 'package:ggplot2':
##
## annotate
library(wordcloud)## Warning: package 'wordcloud' was built under R version 3.5.3
## Loading required package: RColorBrewer
library(SnowballC)## Warning: package 'SnowballC' was built under R version 3.5.2
library(car)## Warning: package 'car' was built under R version 3.5.3
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
glimpse(guests)## Observations: 2,693
## Variables: 5
## $ YEAR <dbl> 1999, 1999, 1999, 1999, 1999, 1999, ...
## $ GoogleKnowlege_Occupation <chr> "actor", "Comedian", "television act...
## $ Show <dttm> 1999-01-11, 1999-01-12, 1999-01-13,...
## $ Group <chr> "Acting", "Comedy", "Acting", "Actin...
## $ Raw_Guest_List <chr> "Michael J. Fox", "Sandra Bernhard",...
summary(guests)## YEAR GoogleKnowlege_Occupation Show
## Min. :1999 Length:2693 Min. :1999-01-11 00:00:00
## 1st Qu.:2003 Class :character 1st Qu.:2003-02-13 00:00:00
## Median :2007 Mode :character Median :2007-03-22 00:00:00
## Mean :2007 Mean :2007-04-15 06:20:43
## 3rd Qu.:2011 3rd Qu.:2011-06-21 00:00:00
## Max. :2015 Max. :2015-08-05 00:00:00
## Group Raw_Guest_List
## Length:2693 Length:2693
## Class :character Class :character
## Mode :character Mode :character
##
##
##
Thankfully there are no NAs, so i reckon we can go ahead with our analysis here.
Year - the Year the episode took place GoogleKnowledge_Occupation - the occupation of the person being interviewed. Show - The day the show was aired Group - I think this is the field where the interviewees come from Raw_Guest_List - I guess these are the names of the interviewees
I would like to determine what field are the guests who appear being interviewed on The Daily Show are from.
guests_a <- guests %>%
group_by(Group) %>%
summarise(total = n()) %>%
arrange(desc(total))
print(guests_a)## # A tibble: 18 x 2
## Group total
## <chr> <int>
## 1 Acting 930
## 2 Media 751
## 3 Politician 308
## 4 Comedy 150
## 5 Musician 123
## 6 Academic 103
## 7 Athletics 52
## 8 Misc 45
## 9 Government 40
## 10 Political Aide 36
## 11 NA 31
## 12 Science 28
## 13 Business 25
## 14 Advocacy 24
## 15 Consultant 18
## 16 Military 16
## 17 Clergy 8
## 18 media 5
guests_a %>%
head(5) %>%
ggplot(aes(x = reorder(Group, - total), y = total, FILL = Group)) +
geom_bar(aes(fill = Group), stat = "identity") +
coord_cartesian(ylim = c(100, 1000)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 9.5, face = "bold")) +
geom_text(aes(label = round(total, 0), hjust = 0.65, vjust = - 0.7), size = 4) +
theme(legend.position = "none") +
labs(x = "\n Field \n") + labs(y = "\n Total \n") + labs(title = "What field do The Daily Show's interviewees come from? \n") +
theme_solarized_2()Actors are the field group where The Daily Show’s interviewees generally come from, followed by those in the media and i think they include those in the press, then politicians (and i’ve seen them being roasted on The Daily Show) and then fellow comedians and then musicians.
I would like to also answer the question, what year were The Daily Show had the most interviews
year <- guests %>%
group_by(YEAR) %>%
summarise(total = n()) %>%
arrange(desc(total))
print(year)## # A tibble: 17 x 2
## YEAR total
## <dbl> <int>
## 1 2000 169
## 2 1999 166
## 3 2003 166
## 4 2013 166
## 5 2010 165
## 6 2004 164
## 7 2008 164
## 8 2012 164
## 9 2009 163
## 10 2011 163
## 11 2014 163
## 12 2005 162
## 13 2006 161
## 14 2002 159
## 15 2001 157
## 16 2007 141
## 17 2015 100
year %>%
head(5) %>%
ggplot(aes(x = reorder(YEAR, - total), y = total, FILL = YEAR)) +
geom_bar(aes(fill = YEAR), stat = "identity") +
coord_cartesian(ylim = c(0, 200)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 9.5, face = "bold")) +
geom_text(aes(label = round(total, 0), hjust = 0.65, vjust = - 0.7), size = 4) +
theme(legend.position = "none") +
labs(x = "\n Field \n") + labs(y = "\n Total \n") + labs(title = "What field do The Daily Show's interviewees come from? \n") +
theme_solarized_2()So, the year 2000 was a year where The Daily Show had the most number of interviews, closely followed by 1999, 2003, 2013 and 2010.
# Creating vector from descriptions
guestslist <- guests$GoogleKnowlege_Occupation
#Interpreting review vector with tm
review <- VectorSource(guestslist)
# Creating VCorpus object
rev <- VCorpus(review)
# Removing punctuation from reviews
rev <- tm_map(rev, removePunctuation)
# Converting to lowercase to make future cleaning easier
rev <- tm_map(rev, content_transformer(tolower))
# Removing Numbers
rev <- tm_map(rev, removeNumbers)
# Removing excess whitespaces and stopwords
rev <- tm_map(rev, removeWords, stopwords("en"))
# Removing excess whitespaces
rev <- tm_map(rev, stripWhitespace)
# Conducting word stemming
rev <- tm_map(rev, stemDocument)We want to also know what names are most prominent amongst The Daily Show’s guest list.
# Creating a DTM matrix
rev_dtm <- DocumentTermMatrix(rev)
# Finding the most popular stems. Let's use ones that appear at least 50 times
rev_freq <- findFreqTerms(rev_dtm, lowfreq = 50)
rev_freq## [1] "actor" "actress" "author" "comedian" "film"
## [6] "former" "governor" "journalist" "secretari" "senat"
## [11] "state" "televis" "unit" "writer"
I want to remove some of the most irrelevant words on here
new_stopwords <- c("unit", "former")
rev_new <- tm_map(rev, removeWords, new_stopwords)
# Creating second DTM
rev_dtm2 <- DocumentTermMatrix(rev_new)
# Finding top stems
top_stems <- findFreqTerms(rev_dtm2, lowfreq = 50)
top_stems## [1] "actor" "actress" "author" "comedian" "film"
## [6] "governor" "journalist" "secretari" "senat" "state"
## [11] "televis" "writer"
Creating the Clean Corpus function
clean_corpus <- function(corpus){
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, stemDocument)
corpus <- tm_map(corpus, removeWords, new_stopwords)
corpus <-tm_map(corpus, stripWhitespace)
return(corpus)
}
guests_corpus <- clean_corpus(rev)
guest_corpus_dm <- TermDocumentMatrix(guests_corpus)Creating the matrix ffrom TDM
guest_corpus_cdm <- as.matrix(guest_corpus_dm)
guest_corpus_f <- rowSums(guest_corpus_cdm)
guest_corpus_f <- sort(guest_corpus_f, decreasing = T)
# Creating the data frame
gcf <- data.frame(term=names(guest_corpus_f), num = guest_corpus_f)
# Creating a plot of 10 top stems
ggplot(data = head(gcf, 10), aes(x = factor(term, levels = gcf$term[order(-gcf$num)]), y = num)) +
geom_col(fill = "orchid") +
labs(x = "Word Stems", y = "Count", title = "Most common Occupations interviewed at The Daily Show") +
scale_y_continuous(expand = c(0, 0), breaks=seq(0,10000,500))Creating the Wordcloud. For legibility i will use a 10 word limit
wordcloud(gcf$term, gcf$num, max.words = 10, color = "orchid")Actors and Actresses are the most common professions being interviewed in The Daily Show, followed by Journalists, authors, fellow comedians and Senators.
We would like to know what’s the most prominent first name or last name common of interviewees being interviewed by The Daily Show
# Creating vector from descriptions
guestslist <- guests$Raw_Guest_List
#Interpreting review vector with tm
review <- VectorSource(guestslist)
# Creating VCorpus object
rev <- VCorpus(review)
# Removing punctuation from reviews
rev <- tm_map(rev, removePunctuation)
# Converting to lowercase to make future cleaning easier
rev <- tm_map(rev, content_transformer(tolower))
# Removing Numbers
rev <- tm_map(rev, removeNumbers)
# Removing excess whitespaces and stopwords
rev <- tm_map(rev, removeWords, stopwords("en"))
# Removing excess whitespaces
rev <- tm_map(rev, stripWhitespace)
# Conducting word stemming
rev <- tm_map(rev, stemDocument)We want to also know what names are most prominent amongst The Daily Show’s guest list.
# Creating a DTM matrix
rev_dtm <- DocumentTermMatrix(rev)
# Finding the most popular stems. Let's use ones that appear at least 50 times
rev_freq <- findFreqTerms(rev_dtm, lowfreq = 15)
rev_freq## [1] "adam" "ben" "bill" "bob" "brian" "bruce"
## [7] "chris" "colin" "david" "deni" "edward" "fare"
## [13] "ferrel" "georg" "gyllenha" "howard" "jackson" "jame"
## [19] "jeff" "jennif" "jim" "joe" "john" "jon"
## [25] "kevin" "leari" "lewi" "mark" "martin" "matt"
## [31] "matthew" "michael" "mike" "miller" "paul" "peter"
## [37] "richard" "robert" "rudd" "sen" "steve" "thoma"
## [43] "tim" "tom" "will" "william" "zakaria"
I want to remove some of the most irrelevant words on here
new_stopwords <- c("unit", "former")
rev_new <- tm_map(rev, removeWords, new_stopwords)
# Creating second DTM
rev_dtm2 <- DocumentTermMatrix(rev_new)
# Finding top stems
top_stems <- findFreqTerms(rev_dtm2, lowfreq = 15)
top_stems## [1] "adam" "ben" "bill" "bob" "brian" "bruce"
## [7] "chris" "colin" "david" "deni" "edward" "fare"
## [13] "ferrel" "georg" "gyllenha" "howard" "jackson" "jame"
## [19] "jeff" "jennif" "jim" "joe" "john" "jon"
## [25] "kevin" "leari" "lewi" "mark" "martin" "matt"
## [31] "matthew" "michael" "mike" "miller" "paul" "peter"
## [37] "richard" "robert" "rudd" "sen" "steve" "thoma"
## [43] "tim" "tom" "will" "william" "zakaria"
Creating the Clean Corpus function
clean_corpus <- function(corpus){
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, stemDocument)
corpus <- tm_map(corpus, removeWords, new_stopwords)
corpus <-tm_map(corpus, stripWhitespace)
return(corpus)
}
guests_corpus <- clean_corpus(rev)
guest_corpus_dm <- TermDocumentMatrix(guests_corpus)Creating the matrix ffrom TDM
guest_corpus_cdm <- as.matrix(guest_corpus_dm)
guest_corpus_f <- rowSums(guest_corpus_cdm)
guest_corpus_f <- sort(guest_corpus_f, decreasing = T)
# Creating the data frame
gcf <- data.frame(term=names(guest_corpus_f), num = guest_corpus_f)
# Creating a plot of 10 top stems
ggplot(data = head(gcf, 10), aes(x = factor(term, levels = gcf$term[order(-gcf$num)]), y = num)) +
geom_col(fill = "orchid") +
labs(x = "Word Stems", y = "Count", title = "Most common Occupations interviewed at The Daily Show") +
scale_y_continuous(expand = c(0, 0), breaks=seq(0,10000,500))Creating the Wordcloud. For legibility i will use a 10 word limit
wordcloud(gcf$term, gcf$num, max.words = 25, color = "orchid")So, looking at the results, it’s obvious that the name John, a common name, is the most common first name being interviewed. Followed by David, Michael, Bill, William, Richard, Paul and etc.