Introduction

Homework Five takes a more granular look at how data can be analysed visually. Having a visual component for data analysis makes it possible to see patterns in data, as well as making it easier to glean insight and understanding into data that may not have been apparent otherwise.

For this assignment I am using a data set from FiveThirtyEight a website that looks at economics, entertainment, science and politics. The data set used is Daily Show Guests during the tenure of host Jon Stewart from 1999 to 2015. I chose this data set because I thought it would be fun and insightful to examine a data set in the entertainment domain.

Importing the data

Before any visualization can take place the data and the libraries must first be imported into Rstudio.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(datasets)
library(stringr)
library(distill)
library(devtools)
## Loading required package: usethis
library(here)
## here() starts at /Users/chester/Documents
library(stringr)
library(skimr)
library(fivethirtyeight)
## Some larger datasets need to be installed separately, like senators and
## house_district_forecast. To install these, we recommend you install the
## fivethirtyeightdata package by running:
## install.packages('fivethirtyeightdata', repos =
## 'https://fivethirtyeightdata.github.io/drat/', type = 'source')
library(DT)
data("daily_show_guests")

Now that the Daily_Show_Guests data set is imported we can now tidy it up, clean it, and see what it’s all about.

head(daily_show_guests,5)    # to see the first five rows of data set
## # A tibble: 5 × 5
##    year google_knowledge_occupation show       group  raw_guest_list  
##   <int> <chr>                       <date>     <chr>  <chr>           
## 1  1999 actor                       1999-01-11 Acting Michael J. Fox  
## 2  1999 comedian                    1999-01-12 Comedy Sandra Bernhard 
## 3  1999 television actress          1999-01-13 Acting Tracey Ullman   
## 4  1999 film actress                1999-01-14 Acting Gillian Anderson
## 5  1999 actor                       1999-01-18 Acting David Alan Grier
tail(daily_show_guests,5)    # shows the last five rows of data set
## # A tibble: 5 × 5
##    year google_knowledge_occupation show       group  raw_guest_list      
##   <int> <chr>                       <date>     <chr>  <chr>               
## 1  2015 biographer                  2015-07-29 Media  Doris Kearns Goodwin
## 2  2015 director                    2015-07-30 Media  J. J. Abrams        
## 3  2015 stand-up comedian           2015-08-03 Comedy Amy Schumer         
## 4  2015 actor                       2015-08-04 Acting Denis Leary         
## 5  2015 comedian                    2015-08-05 Comedy Louis C.K.
dim(daily_show_guests)       # shows the dimensions of data set
## [1] 2693    5
colnames(daily_show_guests)  # column names
## [1] "year"                        "google_knowledge_occupation"
## [3] "show"                        "group"                      
## [5] "raw_guest_list"

Summary Statistics

Here the skimr() function is used and the summary functions. My preference is for the skimr function because it shows missing data, dates and logicals. It provides more comprehensive insight than the summary function by comparison.

skim(daily_show_guests)   
Data summary
Name daily_show_guests
Number of rows 2693
Number of columns 5
_______________________
Column type frequency:
character 3
Date 1
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
google_knowledge_occupation 26 0.99 1 66 0 335 0
group 31 0.99 4 14 0 17 0
raw_guest_list 0 1.00 3 72 0 1669 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
show 0 1 1999-01-11 2015-08-05 2007-03-22 2639

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2006.82 4.83 1999 2003 2007 2011 2015 ▇▆▆▆▇
#summary(daily_show_guests)

Now to make the data set a bit more understandable the rename() function is used to rename google_knowledge_occupation to guest_occupation, show to show_date, raw_guest_list to guest_list, year to *year_episode _aired* and group to domain.

daily_show_guests<-daily_show_guests %>%
  rename(guest_occupation = google_knowledge_occupation) 
daily_show_guests<-daily_show_guests %>% 
  rename(show_date = show)
daily_show_guests<-daily_show_guests %>% 
  rename(guest_list = raw_guest_list)
daily_show_guests<-daily_show_guests %>% 
  rename(year_episode_aired = year)
daily_show_guests<-daily_show_guests %>%
  rename(domain = group) 

This chunk verifies columns headings are changed to reflect the renamed headings

head(daily_show_guests,5)    #shows first five rows of data set.
## # A tibble: 5 × 5
##   year_episode_aired guest_occupation   show_date  domain guest_list      
##                <int> <chr>              <date>     <chr>  <chr>           
## 1               1999 actor              1999-01-11 Acting Michael J. Fox  
## 2               1999 comedian           1999-01-12 Comedy Sandra Bernhard 
## 3               1999 television actress 1999-01-13 Acting Tracey Ullman   
## 4               1999 film actress       1999-01-14 Acting Gillian Anderson
## 5               1999 actor              1999-01-18 Acting David Alan Grier

Chunk shows a new variable name of my_data which is a copy of the daily_show_guests. This was done for ease of data analysis.

my_data<-daily_show_guests

This chunk shows the domains that appeared more than 10 times on the show. As predicted acting appeared the most at 930 times.

my_data %>% 
  count(domain) %>% 
  filter(n > 10) %>% 
  na.omit()
## # A tibble: 15 × 2
##    domain             n
##    <chr>          <int>
##  1 Academic         103
##  2 Acting           930
##  3 Advocacy          24
##  4 Athletics         52
##  5 Business          25
##  6 Comedy           150
##  7 Consultant        18
##  8 Government        40
##  9 Media            751
## 10 Military          16
## 11 Misc              45
## 12 Musician         123
## 13 Political Aide    36
## 14 Politician       308
## 15 Science           28

This chunk shows the guests who appeared on the show more than 10 times. It shows that Fareed Zakaria appeared more than any other guest;appearing 19 times.

my_data %>% 
  count(guest_list) %>% 
  filter(n > 10) %>% 
  na.omit()
## # A tibble: 6 × 2
##   guest_list         n
##   <chr>          <int>
## 1 Brian Williams    16
## 2 Denis Leary       17
## 3 Fareed Zakaria    19
## 4 Paul Rudd         13
## 5 Ricky Gervais     13
## 6 Tom Brokaw        12

This chunk explores how many guests in the acting category appeared. It shows us 930 actors from 1999 to 2015. A new variable called my_data was been created as a copy of daily_show_guests.

my_data %>% 
 select(year_episode_aired,guest_occupation,show_date,domain,guest_list) %>% 
 filter(domain == "Acting",guest_list >= 1)
## # A tibble: 930 × 5
##    year_episode_aired guest_occupation   show_date  domain guest_list      
##                 <int> <chr>              <date>     <chr>  <chr>           
##  1               1999 actor              1999-01-11 Acting Michael J. Fox  
##  2               1999 television actress 1999-01-13 Acting Tracey Ullman   
##  3               1999 film actress       1999-01-14 Acting Gillian Anderson
##  4               1999 actor              1999-01-18 Acting David Alan Grier
##  5               1999 actor              1999-01-19 Acting William Baldwin 
##  6               1999 actor              1999-01-25 Acting Matthew Lillard 
##  7               1999 actress            1999-01-27 Acting Yasmine Bleeth  
##  8               1999 actor              1999-01-28 Acting D. L. Hughley   
##  9               1999 television actress 1999-10-18 Acting Rebecca Gayheart
## 10               1999 actress            1999-10-20 Acting Amy Brenneman   
## # … with 920 more rows

Here we see that 28 guests in the science category appeared on show during it’s run.

my_data %>% 
 select(year_episode_aired,guest_occupation,show_date,domain,guest_list) %>% 
 filter(domain == "Science",guest_list >= 1)
## # A tibble: 28 × 5
##    year_episode_aired guest_occupation show_date  domain  guest_list            
##                 <int> <chr>            <date>     <chr>   <chr>                 
##  1               2003 neurosurgeon     2003-04-28 Science Dr Sanjay Gupta       
##  2               2004 scientist        2004-01-13 Science Catherine Weitz       
##  3               2004 physician        2004-06-15 Science Hassan Ibrahim        
##  4               2005 doctor           2005-09-06 Science Dr. Marc Siegel       
##  5               2006 astronaut        2006-02-13 Science Astronaut Mike Mullane
##  6               2007 astrophysicist   2007-01-30 Science Neil deGrasse Tyson   
##  7               2007 surgeon          2007-03-06 Science Richard Jadick        
##  8               2007 physician        2007-03-08 Science Dr. Sharon Moalem     
##  9               2007 astrophysicist   2007-07-23 Science Neil deGrasse Tyson   
## 10               2008 neuroscientist   2008-04-01 Science Simon LeVay           
## # … with 18 more rows

308 politicians appeared on show during run.

my_data %>% 
 select(year_episode_aired,guest_occupation,show_date,domain,guest_list) %>% 
 filter(domain == "Politician",guest_list >= 1)
## # A tibble: 308 × 5
##    year_episode_aired guest_occupation            show_date  domain  guest_list 
##                 <int> <chr>                       <date>     <chr>   <chr>      
##  1               1999 us senator                  1999-12-07 Politi… Senator Bo…
##  2               1999 us senator                  1999-12-08 Politi… Senator Bo…
##  3               2000 former mayor of cincinatti  2000-01-20 Politi… Jerry Spri…
##  4               2000 former us senator           2000-11-06 Politi… Arlen Spec…
##  5               2000 american politician         2000-11-07 Politi… Bob Dole   
##  6               2000 former senator from kansas  2000-02-02 Politi… Focus on N…
##  7               2000 american politician         2000-03-08 Politi… Bob Dole   
##  8               2000 former us senator           2000-04-20 Politi… Arlen Spec…
##  9               2000 american politician         2000-08-01 Politi… Bob Dole   
## 10               2000 former governor of nebraska 2000-08-15 Politi… Bob Kerrey 
## # … with 298 more rows

In this chunk the guest occupation row is used to filter out how many writers and directors appeared on the show, using the concatenation “c” function.

my_data %>% 
  select(year_episode_aired,guest_occupation,show_date,domain,guest_list) %>% 
  filter(guest_occupation %in% c("writer","director"))
## # A tibble: 61 × 5
##    year_episode_aired guest_occupation show_date  domain guest_list             
##                 <int> <chr>            <date>     <chr>  <chr>                  
##  1               1999 writer           1999-03-17 Media  Frank DeCaro's Oscar S…
##  2               1999 director         1999-08-12 Media  Eduardo Sanchez and Da…
##  3               2000 writer           2000-04-11 Media  Ben Stein              
##  4               2000 writer           2000-06-19 Media  Heather Donahue        
##  5               2000 writer           2000-07-25 Media  Joe Eszterhas          
##  6               2000 writer           2000-08-04 Media  Robert Reich and Ben S…
##  7               2001 writer           2001-08-13 Media  David Rakoff           
##  8               2002 writer           2002-05-07 Media  Mark Bowden            
##  9               2003 writer           2003-10-22 Media  Walter Isaacson        
## 10               2003 writer           2003-11-18 Media  Bernard Goldberg       
## # … with 51 more rows

Observations

If you’re a viewer of The Daily Show you know that most shows have more than one guest. The data supports this observation,and sometimes there are two entries showing “guest 1” and “guest 2” for the same show. So in the case of guests appearing on the show on the same date, there will be an entry containing both guests names but under the guest_occupation there will understandably be separate entries if they have different occupations.

This chunk shows the episode that had more than one guest, date of their appearances, and their occupation.

library(DT)
my_data %>%                                          #shows that had more than one guest and their occupation
  group_by(guest_list) %>% 
  summarise(ngroups = n_distinct(domain)) %>% 
  filter(ngroups>1) %>% 
  select(-ngroups) %>% 
  inner_join(my_data, by= "guest_list") %>% 
  arrange(year_episode_aired,guest_list) %>% 
  datatable(my_data)

Missing Data

When the data was analyzed it was noticed that sometimes there are special events, and there were no specified guests. Because of this only data in small chunks was analysed as needed. Since special events provided meaningful insight it should not be removed.

library(DT)
my_data %>% 
  filter(is.na(domain)) %>% 
  datatable(my_data)

Now for some Data Visualisation

Visualizes guests by occupation

my_data %>% 
  group_by(domain) %>%
  summarise(n=n()) %>% 
  arrange(desc(n)) %>% 
  na.omit %>% 
  ggplot(aes(reorder(domain,n),n))+
     geom_bar(stat = "identity")+
     coord_flip()+
     ggtitle("Guest_List_Grouped_By_Occupation")

Visualizes guests appearing year episode aired and grouped by occupational domain

my_data %>% 
  group_by(domain,year_episode_aired) %>% 
  summarise(n=n()) %>% 
  ggplot(aes(factor(year_episode_aired),n))+
  geom_bar(stat = "identity",aes(fill=domain))+
  theme(axis.text.x = element_text(angle = 45,hjust = 1))+
  ggtitle("Guests_Appearing_Per_Year") %>% 
  na.omit()
## `summarise()` has grouped output by 'domain'. You can override using the `.groups` argument.

Top 5 Guest Occupational Domains, overtime. Last year of show 2015 omitted for aesthetic purposes.

my_data %>% 
  group_by(domain,year_episode_aired) %>% 
  summarise(n=n()) %>% 
  summarise(m=mean(n)) %>% 
  arrange(desc(m)) %>% 
  filter(row_number()<= 5) %>% 
  select(-m) %>% 
  inner_join(my_data, by ="domain") %>% 
  group_by(domain, year_episode_aired) %>% 
  summarise(n=n()) %>% 
  filter(year_episode_aired<2015) %>% 
  ggplot(aes(year_episode_aired,n)) + geom_line(aes(col=domain), lwd=2) +
    ggtitle(" Guests Occupational Domain Over Time ")
## `summarise()` has grouped output by 'domain'. You can override using the `.groups` argument.
## `summarise()` has grouped output by 'domain'. You can override using the `.groups` argument.

That’s all for now, this is a work in process.