Introduction

As a fan of The Daily Show, one thing i look forward to is the guests the hosts interviews. I will analyze a dataset i found online (i do not remember where i found it but i downloaded it last year) with a list of people who have appeared as guests of The Daily Show

Scope

The dataset will focus more on interviews done by Jon Stewart, who hosted the show from 1999-2015.

I also want to textmine the Occupations of these folks and turn it into a wordcloud or something to see what’s the most common occupation being interviewed.

Uploading the data and respective libraries

library(readxl)
guests <- read_excel("D:/Working Directory/Daily_show_guests.xlsx")

Uploading the packages

library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------ tidyverse 1.2.1 --
## v ggplot2 3.1.0     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.8
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0
## Warning: package 'dplyr' was built under R version 3.5.2
## -- Conflicts --------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 3.5.2
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(stringr)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library(effects)
## Warning: package 'effects' was built under R version 3.5.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 3.5.2
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
library(carData)
library(httr)
## Warning: package 'httr' was built under R version 3.5.3
library(tm)
## Warning: package 'tm' was built under R version 3.5.3
## Loading required package: NLP
## Warning: package 'NLP' was built under R version 3.5.2
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:httr':
## 
##     content
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.5.3
## Loading required package: RColorBrewer
library(SnowballC)
## Warning: package 'SnowballC' was built under R version 3.5.2
library(car)
## Warning: package 'car' was built under R version 3.5.3
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some

Examining the data

glimpse(guests)
## Observations: 2,693
## Variables: 5
## $ YEAR                      <dbl> 1999, 1999, 1999, 1999, 1999, 1999, ...
## $ GoogleKnowlege_Occupation <chr> "actor", "Comedian", "television act...
## $ Show                      <dttm> 1999-01-11, 1999-01-12, 1999-01-13,...
## $ Group                     <chr> "Acting", "Comedy", "Acting", "Actin...
## $ Raw_Guest_List            <chr> "Michael J. Fox", "Sandra Bernhard",...
summary(guests)
##       YEAR      GoogleKnowlege_Occupation      Show                    
##  Min.   :1999   Length:2693               Min.   :1999-01-11 00:00:00  
##  1st Qu.:2003   Class :character          1st Qu.:2003-02-13 00:00:00  
##  Median :2007   Mode  :character          Median :2007-03-22 00:00:00  
##  Mean   :2007                             Mean   :2007-04-15 06:20:43  
##  3rd Qu.:2011                             3rd Qu.:2011-06-21 00:00:00  
##  Max.   :2015                             Max.   :2015-08-05 00:00:00  
##     Group           Raw_Guest_List    
##  Length:2693        Length:2693       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

Thankfully there are no NAs, so i reckon we can go ahead with our analysis here.

Explanation of variables

Year - the Year the episode took place GoogleKnowledge_Occupation - the occupation of the person being interviewed. Show - The day the show was aired Group - I think this is the field where the interviewees come from Raw_Guest_List - I guess these are the names of the interviewees

Getting the number and percentage of folks per group appearing in the show.

I would like to determine what field are the guests who appear being interviewed on The Daily Show are from.

guests_a <- guests %>%
  group_by(Group) %>%
  summarise(total = n()) %>%
  arrange(desc(total)) 

print(guests_a)
## # A tibble: 18 x 2
##    Group          total
##    <chr>          <int>
##  1 Acting           930
##  2 Media            751
##  3 Politician       308
##  4 Comedy           150
##  5 Musician         123
##  6 Academic         103
##  7 Athletics         52
##  8 Misc              45
##  9 Government        40
## 10 Political Aide    36
## 11 NA                31
## 12 Science           28
## 13 Business          25
## 14 Advocacy          24
## 15 Consultant        18
## 16 Military          16
## 17 Clergy             8
## 18 media              5

Top 5 fields on where interviewees are from

guests_a %>%
  head(5) %>%
  ggplot(aes(x = reorder(Group, - total), y = total, FILL = Group)) +
  geom_bar(aes(fill = Group), stat = "identity") + 
  coord_cartesian(ylim = c(100, 1000)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 9.5, face = "bold")) +
  geom_text(aes(label = round(total, 0), hjust = 0.65, vjust = - 0.7), size = 4) +
  theme(legend.position = "none") + 
  labs(x = "\n Field \n") + labs(y = "\n Total \n") + labs(title = "What field do The Daily Show's interviewees come from? \n") +
  theme_solarized_2()

Actors are the field group where The Daily Show’s interviewees generally come from, followed by those in the media and i think they include those in the press, then politicians (and i’ve seen them being roasted on The Daily Show) and then fellow comedians and then musicians.

Most prolific interview years

I would like to also answer the question, what year were The Daily Show had the most interviews

year <- guests %>%
  group_by(YEAR) %>%
  summarise(total = n()) %>%
  arrange(desc(total)) 

print(year)
## # A tibble: 17 x 2
##     YEAR total
##    <dbl> <int>
##  1  2000   169
##  2  1999   166
##  3  2003   166
##  4  2013   166
##  5  2010   165
##  6  2004   164
##  7  2008   164
##  8  2012   164
##  9  2009   163
## 10  2011   163
## 11  2014   163
## 12  2005   162
## 13  2006   161
## 14  2002   159
## 15  2001   157
## 16  2007   141
## 17  2015   100

Top 5 most prolific interview years

year %>%
  head(5) %>%
  ggplot(aes(x = reorder(YEAR, - total), y = total, FILL = YEAR)) +
  geom_bar(aes(fill = YEAR), stat = "identity") + 
  coord_cartesian(ylim = c(0, 200)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 9.5, face = "bold")) +
  geom_text(aes(label = round(total, 0), hjust = 0.65, vjust = - 0.7), size = 4) +
  theme(legend.position = "none") + 
  labs(x = "\n Field \n") + labs(y = "\n Total \n") + labs(title = "What field do The Daily Show's interviewees come from? \n") +
  theme_solarized_2()

So, the year 2000 was a year where The Daily Show had the most number of interviews, closely followed by 1999, 2003, 2013 and 2010.

Text Mining on Guests and Occupation.

Text Mining on Occupation

# Creating vector from descriptions
guestslist <- guests$GoogleKnowlege_Occupation

#Interpreting review vector with tm
review <- VectorSource(guestslist)

# Creating VCorpus object
rev <- VCorpus(review)

# Removing punctuation from reviews
rev <- tm_map(rev, removePunctuation)

# Converting to lowercase to make future cleaning easier
rev <- tm_map(rev, content_transformer(tolower))

# Removing Numbers
rev <- tm_map(rev, removeNumbers)

# Removing excess whitespaces and stopwords 
rev <- tm_map(rev, removeWords, stopwords("en"))

# Removing excess whitespaces
rev <- tm_map(rev, stripWhitespace)

# Conducting word stemming
rev <- tm_map(rev, stemDocument)

We want to also know what names are most prominent amongst The Daily Show’s guest list.

# Creating a DTM matrix
rev_dtm <- DocumentTermMatrix(rev)

# Finding the most popular stems. Let's use ones that appear at least 50 times
rev_freq <- findFreqTerms(rev_dtm, lowfreq = 50)
rev_freq
##  [1] "actor"      "actress"    "author"     "comedian"   "film"      
##  [6] "former"     "governor"   "journalist" "secretari"  "senat"     
## [11] "state"      "televis"    "unit"       "writer"

Removing irrelevant words

I want to remove some of the most irrelevant words on here

new_stopwords <- c("unit", "former")

rev_new <- tm_map(rev, removeWords, new_stopwords)

# Creating second DTM
rev_dtm2 <- DocumentTermMatrix(rev_new)

# Finding top stems
top_stems <- findFreqTerms(rev_dtm2, lowfreq = 50)

top_stems
##  [1] "actor"      "actress"    "author"     "comedian"   "film"      
##  [6] "governor"   "journalist" "secretari"  "senat"      "state"     
## [11] "televis"    "writer"

Creating the wordcloud.

Creating the Clean Corpus function

clean_corpus <- function(corpus){
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, removeWords, stopwords("en"))
  corpus <- tm_map(corpus, stemDocument)
  corpus <- tm_map(corpus, removeWords, new_stopwords)
  corpus <-tm_map(corpus, stripWhitespace)
  return(corpus)
}
guests_corpus <- clean_corpus(rev)
guest_corpus_dm <- TermDocumentMatrix(guests_corpus)

Creating the matrix ffrom TDM

guest_corpus_cdm <- as.matrix(guest_corpus_dm)
guest_corpus_f <- rowSums(guest_corpus_cdm)
guest_corpus_f <- sort(guest_corpus_f, decreasing = T)

# Creating the data frame
gcf <- data.frame(term=names(guest_corpus_f), num = guest_corpus_f)

# Creating a plot of 10 top stems
ggplot(data = head(gcf, 10), aes(x = factor(term, levels = gcf$term[order(-gcf$num)]), y = num)) + 
  geom_col(fill = "orchid") + 
  labs(x = "Word Stems", y = "Count", title = "Most common Occupations interviewed at The Daily Show") + 
    scale_y_continuous(expand = c(0, 0), breaks=seq(0,10000,500))

Creating the Wordcloud. For legibility i will use a 10 word limit

wordcloud(gcf$term, gcf$num, max.words = 10, color = "orchid")

Actors and Actresses are the most common professions being interviewed in The Daily Show, followed by Journalists, authors, fellow comedians and Senators.

Text Mining on Guest Names

We would like to know what’s the most prominent first name or last name common of interviewees being interviewed by The Daily Show

# Creating vector from descriptions
guestslist <- guests$Raw_Guest_List

#Interpreting review vector with tm
review <- VectorSource(guestslist)

# Creating VCorpus object
rev <- VCorpus(review)

# Removing punctuation from reviews
rev <- tm_map(rev, removePunctuation)

# Converting to lowercase to make future cleaning easier
rev <- tm_map(rev, content_transformer(tolower))

# Removing Numbers
rev <- tm_map(rev, removeNumbers)

# Removing excess whitespaces and stopwords 
rev <- tm_map(rev, removeWords, stopwords("en"))

# Removing excess whitespaces
rev <- tm_map(rev, stripWhitespace)

# Conducting word stemming
rev <- tm_map(rev, stemDocument)

We want to also know what names are most prominent amongst The Daily Show’s guest list.

# Creating a DTM matrix
rev_dtm <- DocumentTermMatrix(rev)

# Finding the most popular stems. Let's use ones that appear at least 50 times
rev_freq <- findFreqTerms(rev_dtm, lowfreq = 15)
rev_freq
##  [1] "adam"     "ben"      "bill"     "bob"      "brian"    "bruce"   
##  [7] "chris"    "colin"    "david"    "deni"     "edward"   "fare"    
## [13] "ferrel"   "georg"    "gyllenha" "howard"   "jackson"  "jame"    
## [19] "jeff"     "jennif"   "jim"      "joe"      "john"     "jon"     
## [25] "kevin"    "leari"    "lewi"     "mark"     "martin"   "matt"    
## [31] "matthew"  "michael"  "mike"     "miller"   "paul"     "peter"   
## [37] "richard"  "robert"   "rudd"     "sen"      "steve"    "thoma"   
## [43] "tim"      "tom"      "will"     "william"  "zakaria"

Removing irrelevant words

I want to remove some of the most irrelevant words on here

new_stopwords <- c("unit", "former")

rev_new <- tm_map(rev, removeWords, new_stopwords)

# Creating second DTM
rev_dtm2 <- DocumentTermMatrix(rev_new)

# Finding top stems
top_stems <- findFreqTerms(rev_dtm2, lowfreq = 15)

top_stems
##  [1] "adam"     "ben"      "bill"     "bob"      "brian"    "bruce"   
##  [7] "chris"    "colin"    "david"    "deni"     "edward"   "fare"    
## [13] "ferrel"   "georg"    "gyllenha" "howard"   "jackson"  "jame"    
## [19] "jeff"     "jennif"   "jim"      "joe"      "john"     "jon"     
## [25] "kevin"    "leari"    "lewi"     "mark"     "martin"   "matt"    
## [31] "matthew"  "michael"  "mike"     "miller"   "paul"     "peter"   
## [37] "richard"  "robert"   "rudd"     "sen"      "steve"    "thoma"   
## [43] "tim"      "tom"      "will"     "william"  "zakaria"

Creating the wordcloud.

Creating the Clean Corpus function

clean_corpus <- function(corpus){
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, removeWords, stopwords("en"))
  corpus <- tm_map(corpus, stemDocument)
  corpus <- tm_map(corpus, removeWords, new_stopwords)
  corpus <-tm_map(corpus, stripWhitespace)
  return(corpus)
}
guests_corpus <- clean_corpus(rev)
guest_corpus_dm <- TermDocumentMatrix(guests_corpus)

Creating the matrix ffrom TDM

guest_corpus_cdm <- as.matrix(guest_corpus_dm)
guest_corpus_f <- rowSums(guest_corpus_cdm)
guest_corpus_f <- sort(guest_corpus_f, decreasing = T)

# Creating the data frame
gcf <- data.frame(term=names(guest_corpus_f), num = guest_corpus_f)

# Creating a plot of 10 top stems
ggplot(data = head(gcf, 10), aes(x = factor(term, levels = gcf$term[order(-gcf$num)]), y = num)) + 
  geom_col(fill = "orchid") + 
  labs(x = "Word Stems", y = "Count", title = "Most common Occupations interviewed at The Daily Show") + 
    scale_y_continuous(expand = c(0, 0), breaks=seq(0,10000,500))

Creating the Wordcloud. For legibility i will use a 10 word limit

wordcloud(gcf$term, gcf$num, max.words = 25, color = "orchid")

So, looking at the results, it’s obvious that the name John, a common name, is the most common first name being interviewed. Followed by David, Michael, Bill, William, Richard, Paul and etc.