Homework Six: Bivariate Relationships using The Daily Show Guests Data From 1999-2015

Introduction

Homework Six uses the Daily Show Data Set used in Homework Five, to glean more insight in to the data set through the lens of bivariate relationships.

Importing the data

Once again we will import the data. Before any analysis can take place the data and the libraries must first be imported into Rstudio.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(datasets)
library(stringr)
library(distill)
library(devtools)

## Loading required package: usethis

library(here)

## here() starts at /Users/chester/Desktop

library(stringr)
library(skimr)
library(grDevices)
library(wordcloud)

## Loading required package: RColorBrewer

library(RColorBrewer)
library(fivethirtyeight)

## Some larger datasets need to be installed separately, like senators and
## house_district_forecast. To install these, we recommend you install the
## fivethirtyeightdata package by running:
## install.packages('fivethirtyeightdata', repos =
## 'https://fivethirtyeightdata.github.io/drat/', type = 'source')

library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

## The following object is masked from 'package:purrr':
## 
##     transpose

library(DT)
data("daily_show_guests")

Now that the Daily_Show_Guests data set is imported we can now tidy it up, clean it, and see what it’s all about. Note this data set was tidyed, cleaned, recoded, and summary descriptives were conducted in Homework 5. The data tidying was,however, refined in some instances for better readability and comprehension for Homework Six

head(daily_show_guests,5)    # to see the first five rows of data set

## # A tibble: 5 × 5
##    year google_knowledge_occupation show       group  raw_guest_list  
##   <int> <chr>                       <date>     <chr>  <chr>           
## 1  1999 actor                       1999-01-11 Acting Michael J. Fox  
## 2  1999 comedian                    1999-01-12 Comedy Sandra Bernhard 
## 3  1999 television actress          1999-01-13 Acting Tracey Ullman   
## 4  1999 film actress                1999-01-14 Acting Gillian Anderson
## 5  1999 actor                       1999-01-18 Acting David Alan Grier

tail(daily_show_guests,5)    # shows the last five rows of data set

## # A tibble: 5 × 5
##    year google_knowledge_occupation show       group  raw_guest_list      
##   <int> <chr>                       <date>     <chr>  <chr>               
## 1  2015 biographer                  2015-07-29 Media  Doris Kearns Goodwin
## 2  2015 director                    2015-07-30 Media  J. J. Abrams        
## 3  2015 stand-up comedian           2015-08-03 Comedy Amy Schumer         
## 4  2015 actor                       2015-08-04 Acting Denis Leary         
## 5  2015 comedian                    2015-08-05 Comedy Louis C.K.

dim(daily_show_guests)       # shows the dimensions of data set

## [1] 2693    5

colnames(daily_show_guests)  # column names

## [1] "year"                        "google_knowledge_occupation"
## [3] "show"                        "group"                      
## [5] "raw_guest_list"

Summary Statistics

Here the skimr() function is used and the summary functions. My preference is for the skimr function because it shows missing data, dates and logicals. It provides more comprehensive insight than the summary function by comparison.

skim(daily_show_guests)

Data summary
Name	daily_show_guests
Number of rows	2693
Number of columns	5
_______________________
Column type frequency:
character	3
Date	1
numeric	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
google_knowledge_occupation	26	0.99	1	66	335
group	31	0.99	4	14	17
raw_guest_list	0	1.00	3	72	1669

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
show	0	1	1999-01-11	2015-08-05	2007-03-22	2639

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	0	1	2006.82	4.83	1999	2003	2007	2011	2015	▇▆▆▆▇

#summary(daily_show_guests)

Now to make the data set a bit more understandable the rename() function is used to rename google_knowledge_occupation to guest_occupation, show to show_date, raw_guest_list to guest_list, year to *year_episode _aired* and group to occupational_domain.

daily_show_guests<-daily_show_guests %>%
  rename(guest_occupation = google_knowledge_occupation)

daily_show_guests<-daily_show_guests %>% 
  rename(show_date = show)

daily_show_guests<-daily_show_guests %>% 
  rename(guest_list = raw_guest_list)

daily_show_guests<-daily_show_guests %>% 
  rename(year_episode_aired = year)

daily_show_guests<-daily_show_guests %>%
  rename(occupational_domain = group)

This chunk verifies columns headings are changed to reflect the renamed headings

head(daily_show_guests,5)    #shows first five rows of data set.

## # A tibble: 5 × 5
##   year_episode_aired guest_occupation   show_date  occupational_dom… guest_list 
##                <int> <chr>              <date>     <chr>             <chr>      
## 1               1999 actor              1999-01-11 Acting            Michael J.…
## 2               1999 comedian           1999-01-12 Comedy            Sandra Ber…
## 3               1999 television actress 1999-01-13 Acting            Tracey Ull…
## 4               1999 film actress       1999-01-14 Acting            Gillian An…
## 5               1999 actor              1999-01-18 Acting            David Alan…

Chunk shows a new variable name of my_data which is a copy of the daily_show_guests. This was done for ease of data analysis.

my_data<-daily_show_guests

This chunk shows the domains that appeared more than 10 times on the show. As predicted acting appeared the most at 930 times.

my_data %>% 
  count(occupational_domain) %>% 
  filter(n > 10) %>% 
  na.omit()

## # A tibble: 15 × 2
##    occupational_domain     n
##    <chr>               <int>
##  1 Academic              103
##  2 Acting                930
##  3 Advocacy               24
##  4 Athletics              52
##  5 Business               25
##  6 Comedy                150
##  7 Consultant             18
##  8 Government             40
##  9 Media                 751
## 10 Military               16
## 11 Misc                   45
## 12 Musician              123
## 13 Political Aide         36
## 14 Politician            308
## 15 Science                28

This chunk shows the guests who appeared on the show more than 10 times. It shows that Fareed Zakaria appeared more than any other guest;appearing 19 times.

my_data %>% 
  count(guest_list) %>% 
  filter(n > 10) %>% 
  na.omit()

## # A tibble: 6 × 2
##   guest_list         n
##   <chr>          <int>
## 1 Brian Williams    16
## 2 Denis Leary       17
## 3 Fareed Zakaria    19
## 4 Paul Rudd         13
## 5 Ricky Gervais     13
## 6 Tom Brokaw        12

This chunk explores how many guests in the acting category appeared. It shows us 930 actors from 1999 to 2015. A new variable called my_data was been created as a copy of daily_show_guests.

my_data %>% 
 select(year_episode_aired,guest_occupation,show_date,occupational_domain,guest_list) %>% 
 filter( occupational_domain == "Acting",guest_list >= 1)

## # A tibble: 930 × 5
##    year_episode_aired guest_occupation   show_date  occupational_dom… guest_list
##                 <int> <chr>              <date>     <chr>             <chr>     
##  1               1999 actor              1999-01-11 Acting            Michael J…
##  2               1999 television actress 1999-01-13 Acting            Tracey Ul…
##  3               1999 film actress       1999-01-14 Acting            Gillian A…
##  4               1999 actor              1999-01-18 Acting            David Ala…
##  5               1999 actor              1999-01-19 Acting            William B…
##  6               1999 actor              1999-01-25 Acting            Matthew L…
##  7               1999 actress            1999-01-27 Acting            Yasmine B…
##  8               1999 actor              1999-01-28 Acting            D. L. Hug…
##  9               1999 television actress 1999-10-18 Acting            Rebecca G…
## 10               1999 actress            1999-10-20 Acting            Amy Brenn…
## # … with 920 more rows

Here we see that 28 guests in the science category appeared on show during it’s run.

my_data %>% 
 select(year_episode_aired,guest_occupation,show_date,occupational_domain,guest_list) %>% 
 filter(occupational_domain == "Science",guest_list >= 1)

## # A tibble: 28 × 5
##    year_episode_aired guest_occupation show_date  occupational_domain guest_list
##                 <int> <chr>            <date>     <chr>               <chr>     
##  1               2003 neurosurgeon     2003-04-28 Science             Dr Sanjay…
##  2               2004 scientist        2004-01-13 Science             Catherine…
##  3               2004 physician        2004-06-15 Science             Hassan Ib…
##  4               2005 doctor           2005-09-06 Science             Dr. Marc …
##  5               2006 astronaut        2006-02-13 Science             Astronaut…
##  6               2007 astrophysicist   2007-01-30 Science             Neil deGr…
##  7               2007 surgeon          2007-03-06 Science             Richard J…
##  8               2007 physician        2007-03-08 Science             Dr. Sharo…
##  9               2007 astrophysicist   2007-07-23 Science             Neil deGr…
## 10               2008 neuroscientist   2008-04-01 Science             Simon LeV…
## # … with 18 more rows

308 politicians appeared on show during run.

my_data %>% 
 select(year_episode_aired,guest_occupation,show_date,occupational_domain,guest_list) %>% 
 filter(occupational_domain == "Politician",guest_list >= 1)

## # A tibble: 308 × 5
##    year_episode_aired guest_occupation  show_date  occupational_do… guest_list  
##                 <int> <chr>             <date>     <chr>            <chr>       
##  1               1999 us senator        1999-12-07 Politician       Senator Bob…
##  2               1999 us senator        1999-12-08 Politician       Senator Bob…
##  3               2000 former mayor of … 2000-01-20 Politician       Jerry Sprin…
##  4               2000 former us senator 2000-11-06 Politician       Arlen Spect…
##  5               2000 american politic… 2000-11-07 Politician       Bob Dole    
##  6               2000 former senator f… 2000-02-02 Politician       Focus on Ne…
##  7               2000 american politic… 2000-03-08 Politician       Bob Dole    
##  8               2000 former us senator 2000-04-20 Politician       Arlen Spect…
##  9               2000 american politic… 2000-08-01 Politician       Bob Dole    
## 10               2000 former governor … 2000-08-15 Politician       Bob Kerrey  
## # … with 298 more rows

In this chunk the guest occupation row is used to filter out how many writers and directors appeared on the show, using the concatenation “c” function.

my_data %>% 
  select(year_episode_aired,guest_occupation,show_date,occupational_domain,guest_list) %>% 
  filter(guest_occupation %in% c("writer","director"))

## # A tibble: 61 × 5
##    year_episode_aired guest_occupation show_date  occupational_domain guest_list
##                 <int> <chr>            <date>     <chr>               <chr>     
##  1               1999 writer           1999-03-17 Media               Frank DeC…
##  2               1999 director         1999-08-12 Media               Eduardo S…
##  3               2000 writer           2000-04-11 Media               Ben Stein 
##  4               2000 writer           2000-06-19 Media               Heather D…
##  5               2000 writer           2000-07-25 Media               Joe Eszte…
##  6               2000 writer           2000-08-04 Media               Robert Re…
##  7               2001 writer           2001-08-13 Media               David Rak…
##  8               2002 writer           2002-05-07 Media               Mark Bowd…
##  9               2003 writer           2003-10-22 Media               Walter Is…
## 10               2003 writer           2003-11-18 Media               Bernard G…
## # … with 51 more rows

Observations

If you’re a viewer of The Daily Show you know that most shows have more than one guest. The data supports this observation,and sometimes there are two entries showing “guest 1” and “guest 2” for the same show. So in the case of guests appearing on the show on the same date, there will be an entry containing both guests names but under the guest_occupation there will understandably be separate entries if they have different occupations.

This chunk shows the episode that had more than one guest, date of their appearances, and their occupation.

library(DT)
my_data %>%                                          #shows that had more than one guest and their occupation
  group_by(guest_list) %>% 
  summarise(ngroups = n_distinct(occupational_domain)) %>% 
  filter(ngroups>1) %>% 
  select(-ngroups) %>% 
  inner_join(my_data, by= "guest_list") %>% 
  arrange(year_episode_aired,guest_list) %>% 
  datatable(my_data)

Missing Data

When the data was analyzed it was noticed that sometimes there are special events, and there were no specified guests. Because of this only data in small chunks was analysed as needed. Since special events provided meaningful insight it should not be removed.

library(DT)
my_data %>% 
  filter(is.na(occupational_domain)) %>% 
  datatable(my_data)

Now for some Data Visualisation

Visualizes guests by occupation and number of guests that work in that occupation. I used a flipped geom bar from the ggplot package that I flipped using the coord_flip function to better visualize the data. The data is easily visualized in descending ordering representing the most popular occupational domains to the least. Note that for some reason, I am not sure of, the labels for the x and y axis flipped, so I had to interchange the xlab as the ylab and vice versa.

my_data %>% 
  group_by(occupational_domain) %>%
  summarise(n=n()) %>% 
  arrange(desc(n)) %>% 
  na.omit %>% 
  ggplot(aes(reorder(occupational_domain,n),n))+
     geom_bar(stat = "identity")+
     coord_flip()+
     xlab("Guest Occupation")+
     ylab("Number of Guests")+
     ggtitle("Guest_List_Grouped_By_Occupation")

Visualizes guests appearing by year episode aired and grouped by occupational domain which if you recall was renamed from the group variable. I used rainbow colors to better represent the separations of the occupational and year_episode_aired variables.

library(ggplot2)
my_data %>% 
  group_by(occupational_domain,year_episode_aired) %>% 
  summarise(n=n()) %>% 
  ggplot(aes(factor(year_episode_aired),n))+
  geom_bar(stat = "identity",aes(fill= occupational_domain))+
  theme(axis.text.x = element_text(angle = 90,hjust = 0.5))+
  xlab("Year Episode Aired")+
  ylab("Frequency of Occupational Damain")+
  ggtitle("Guests_Appearing_Per_Year") %>% 
  na.omit()

## `summarise()` has grouped output by 'occupational_domain'. You can override using the `.groups` argument.

Bivariate Relationships

This bi-variate graph shows the Top 5 Guest Occupational Domains, over a time period from 1999 to 2015. This is helpful in visualizing graphically the five most popular types of occupations appearing on the Daily show during its run. It shows the acting and media were the two most frequent occupational domains. When the show first aired the there was a conspicuous peak of actors appearing, However, there is a steady decline,and between 2008 and 2009 there is a precipitous dip in actors appearing and an uptrend in media appearances.

my_data %>% 
  group_by(occupational_domain,year_episode_aired) %>% 
  summarise(n=n()) %>% 
  summarise(m=mean(n)) %>% 
  arrange(desc(m)) %>% 
  filter(row_number()<= 5) %>% 
  select(-m) %>% 
  inner_join(my_data, by ="occupational_domain") %>% 
  group_by(occupational_domain,year_episode_aired) %>% 
  summarise(n=n()) %>% 
  filter(year_episode_aired<=2015) %>% 
  ggplot(aes(year_episode_aired,n)) + geom_line(aes(col= occupational_domain), lwd=1.5) +
  ggtitle(" Guests Occupational Domain Over Time ")

## `summarise()` has grouped output by 'occupational_domain'. You can override using the `.groups` argument.
## `summarise()` has grouped output by 'occupational_domain'. You can override using the `.groups` argument.

This filter that shows the guest occupation/occupational domain ranked from most to least frequently

my_data<-as.data.table(my_data)
my_data

##       year_episode_aired   guest_occupation  show_date occupational_domain
##    1:               1999              actor 1999-01-11              Acting
##    2:               1999           comedian 1999-01-12              Comedy
##    3:               1999 television actress 1999-01-13              Acting
##    4:               1999       film actress 1999-01-14              Acting
##    5:               1999              actor 1999-01-18              Acting
##   ---                                                                     
## 2689:               2015         biographer 2015-07-29               Media
## 2690:               2015           director 2015-07-30               Media
## 2691:               2015  stand-up comedian 2015-08-03              Comedy
## 2692:               2015              actor 2015-08-04              Acting
## 2693:               2015           comedian 2015-08-05              Comedy
##                 guest_list
##    1:       Michael J. Fox
##    2:      Sandra Bernhard
##    3:        Tracey Ullman
##    4:     Gillian Anderson
##    5:     David Alan Grier
##   ---                     
## 2689: Doris Kearns Goodwin
## 2690:         J. J. Abrams
## 2691:          Amy Schumer
## 2692:          Denis Leary
## 2693:           Louis C.K.

my_data1<-my_data[occupational_domain!=""]
my_data2<-my_data1[,.N,by=occupational_domain]
view(my_data2)

Arrange new table comprised of two variables, occupational domain and number of appearances in descending to ascending order.

my_data3<-my_data2[order(-N)] %>% 
  rename("frequency" = N) %>% 
  head(17)

I wanted to try a Wordcloud since I have never tried one before, and it could provide visual bi-variate representation of data of my choosing. For this wordcloud I have choosen the Occupation_Domain and a renamed variable called frequency. My hypothesis is that since the acting had the most appearances on the show the text will appear larger, since text in wordclouds appear larger with the greater frequency they appear. After running the word cloud chunk my hypothesis is correct. The acting domain appears larger, followed in descending frequency by the other occupational domains. Since this is a small subsection of the dataset, the wordcloud is even smaller than I would have imagined, however, it does accomplish it’s intended purpose.

library(wordcloud)
#png("wordcloud.png",width=10,height=7,units ='in',res=300)
par(mar=rep(0,4))
set.seed(1330)
wordcloud(words = my_data3$occupational_domain,freq = my_data3$frequency,scale=c(3.5,0.65),
          max.words=17,colors=brewer.pal(8,"Dark2"))

Here I’m trying a wordcloud for a larger subset, to see what it looks like. This word cloud is comprised of the variables guest_occupation and the frequencies of their appearance.

my_data<-as.data.table(my_data)
my_data

##       year_episode_aired   guest_occupation  show_date occupational_domain
##    1:               1999              actor 1999-01-11              Acting
##    2:               1999           comedian 1999-01-12              Comedy
##    3:               1999 television actress 1999-01-13              Acting
##    4:               1999       film actress 1999-01-14              Acting
##    5:               1999              actor 1999-01-18              Acting
##   ---                                                                     
## 2689:               2015         biographer 2015-07-29               Media
## 2690:               2015           director 2015-07-30               Media
## 2691:               2015  stand-up comedian 2015-08-03              Comedy
## 2692:               2015              actor 2015-08-04              Acting
## 2693:               2015           comedian 2015-08-05              Comedy
##                 guest_list
##    1:       Michael J. Fox
##    2:      Sandra Bernhard
##    3:        Tracey Ullman
##    4:     Gillian Anderson
##    5:     David Alan Grier
##   ---                     
## 2689: Doris Kearns Goodwin
## 2690:         J. J. Abrams
## 2691:          Amy Schumer
## 2692:          Denis Leary
## 2693:           Louis C.K.

my_data4<-my_data[guest_occupation!=""]
my_data5<-my_data1[,.N,by=guest_occupation]
view(my_data5)

my_data6<-my_data5[order(-N)] %>% # displays data in descending order
  rename("frequency_1" = N) %>%  #renaming  specific guest occupation frequency from N to frequency_1
  head(333)    #displays first 5 rows of new bi-variate dateset
  view(my_data6) #displays renamed dataset

library(wordcloud)
#png("wordcloud.png",width=10,height=7,units ='in',res=300)
par(mar=rep(0,4))
set.seed(50)
wordcloud(words = my_data6$guest_occupation,freq = my_data6$frequency_1,scale=c(3.5,0.75),
          max.words=333,colors=brewer.pal(8,"Dark2"))

## Warning in wordcloud(words = my_data6$guest_occupation, freq =
## my_data6$frequency_1, : former lieutenant governor of maryland could not be fit
## on page. It will not be plotted.

## Warning in wordcloud(words = my_data6$guest_occupation, freq =
## my_data6$frequency_1, : former white house press secretary could not be fit on
## page. It will not be plotted.

## Warning in wordcloud(words = my_data6$guest_occupation, freq =
## my_data6$frequency_1, : television personality could not be fit on page. It will
## not be plotted.

## Warning in wordcloud(words = my_data6$guest_occupation, freq =
## my_data6$frequency_1, : minority leader of the united states house of
## representatives could not be fit on page. It will not be plotted.