Selected data set: “Daily-Show-Guests” from FiveThirtyEight’s GitHub: https://github.com/fivethirtyeight/data/tree/master/daily-show-guests
This data comes from the FiveThirtyEight article “Every Guest Jon Stewart Ever Had on ‘The Daily Show.’” Article summary: The Daily Show aired its last new episode hosted by Jon Stewart on August 6th, 2015, and FiveThirtyEight did a retrospective, analyzing who served as a guest during his hosting tenure (1/11/99 to 8/6/15). These data are important because Stewart’s show was popular, and who he chose to highlight could have had a major impact on culture and politics.
Data set is in CSV format. Reading into my R Markdown document via URL with name “daily.”
daily <- read.csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/daily-show-guests/daily_show_guests.csv',sep=',',encoding='utf-8')
Data set includes “README.MD” file of metadata, for help defining columns
readLines(con="https://raw.githubusercontent.com/fivethirtyeight/data/master/daily-show-guests/README.md")
## [1] "# Daily Show Guests"
## [2] ""
## [3] "This folder contains data behind the story [Every Guest Jon Stewart Ever Had On ‘The Daily Show’](http://fivethirtyeight.com/datalab/every-guest-jon-stewart-ever-had-on-the-daily-show/)."
## [4] ""
## [5] "Header | Definition"
## [6] "---|---------"
## [7] "`YEAR` | The year the episode aired"
## [8] "`GoogleKnowlege_Occupation` | Their occupation or office, according to Google's Knowledge Graph or, if they're not in there, how Stewart introduced them on the program."
## [9] "`Show` | Air date of episode. Not unique, as some shows had more than one guest"
## [10] "`Group` | A larger group designation for the occupation. For instance, us senators, us presidents, and former presidents are all under \"politicians\""
## [11] "`Raw_Guest_List` | The person or list of people who appeared on the show, according to Wikipedia. The GoogleKnowlege_Occupation only refers to one of them in a given row."
## [12] ""
## [13] "Source: Google Knowlege Graph, The Daily Show clip library, Wikipedia."
First, I’ll rename the GoogleKnowledge_Occupation column to something shorter; Rename the Show and Group columns to something more intuitive; De-capitalize the YEAR column. This will make exploratory data analysis easier.
colnames(daily)[colnames(daily) == "GoogleKnowlege_Occupation"] ="Occupation"
colnames(daily)[colnames(daily) == "Show"] ="AirDate"
colnames(daily)[colnames(daily) == "Group"] ="OccupationType"
colnames(daily)[colnames(daily) == "YEAR"] ="Year"
Now we’ll run R commands to develop an understanding of the data shape/format before performing any further analysis/changes.
The summary function shows the dimensions of the df (2693,5), shows the data types of each column, and shows the length of each column, noting that all columns appear to be the same length.
summary(daily)
## Year Occupation AirDate OccupationType
## Min. :1999 Length:2693 Length:2693 Length:2693
## 1st Qu.:2003 Class :character Class :character Class :character
## Median :2007 Mode :character Mode :character Mode :character
## Mean :2007
## 3rd Qu.:2011
## Max. :2015
## Raw_Guest_List
## Length:2693
## Class :character
## Mode :character
##
##
##
The head function gives an example of what the values look like. The values align with what we’d expect based on the column names.
head(daily)
## Year Occupation AirDate OccupationType Raw_Guest_List
## 1 1999 actor 1/11/99 Acting Michael J. Fox
## 2 1999 Comedian 1/12/99 Comedy Sandra Bernhard
## 3 1999 television actress 1/13/99 Acting Tracey Ullman
## 4 1999 film actress 1/14/99 Acting Gillian Anderson
## 5 1999 actor 1/18/99 Acting David Alan Grier
## 6 1999 actor 1/19/99 Acting William Baldwin
Per the above, this dataframe has 2693 rows, 5 columns. My first goal is to remove as many of the columns as possible, because the assignment instructions request that the final product has a “subset” of the original dataframe. There are two pairs of columns with somewhat duplicative information: 1) Year and AirDate, 2) Occupation and OccupationType. While Year’s information is technically inside AirDate, because AirDate has so many unique values (one date for every show), I believe keeping Year is prudent as it’s a way to group dates into easier-to-understand graphs.
“Occupation” can be removed since it’s overly-detailed. As the below R code shows, there are 399 unique values in this column, with values that could be more easily understood, charted, and analyzed in larger groupings, such as “Singer,” “Vocalist, and”singer-songwriter”being grouped under “Musician.” The OccupationType already provides these groupings.
length(unique(daily$Occupation)) #Number of unique Occupation values: 399
## [1] 399
head(unique(daily$Occupation),50) #Sample of the overly-detailed occupation values
## [1] "actor" "Comedian"
## [3] "television actress" "film actress"
## [5] "Singer-lyricist" "model"
## [7] "stand-up comedian" "actress"
## [9] "comedian" "Singer-songwriter"
## [11] "television personality" "Comic"
## [13] "rock band" "musician"
## [15] "Film actor" "Model"
## [17] "journalist" NA
## [19] "singer-songwriter" "us senator"
## [21] "film actor" "pianist"
## [23] "Vocalist" "writer"
## [25] "Stand-up comedian" "Film director"
## [27] "singer" "television host"
## [29] "televison actor" "muppet"
## [31] "director" "film director"
## [33] "american television personality" "rapper"
## [35] "football player" "former mayor of cincinatti"
## [37] "Film actress" "businesswoman"
## [39] "activist" "Media person"
## [41] "former us senator" "american politician"
## [43] "Filmmaker" "radio personality"
## [45] "commentator" "Journalist"
## [47] "former senator from kansas" "Reporter"
## [49] "Singer" "professional wrestler"
Dropping Occupation column
daily = subset(daily, select = -c(Occupation))
daily df now only has 4 columns instead of 5.
ncol(daily)
## [1] 4
How many AirDates are duplicative and therefore had multiple guests?
library(tidyverse) #Need to load tidyverse to use duplicated function
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
length(daily$AirDate[duplicated(daily$AirDate) == TRUE])
## [1] 54
What AirDates had multiple guests?
daily$AirDate[duplicated(daily$AirDate) == TRUE]
## [1] "2/10/99" "2/10/99" "3/11/99" "3/17/99" "8/12/99" "9/30/99"
## [7] "10/25/00" "2/1/00" "4/24/00" "6/14/00" "7/17/00" "8/4/00"
## [13] "6/20/02" "12/3/03" "5/26/03" "7/28/03" "9/15/03" "11/2/04"
## [19] "4/6/04" "9/30/04" "7/18/05" "9/14/05" "9/14/05" "1/1/07"
## [25] "3/8/07" "1/1/08" "1/1/08" "7/22/08" "10/28/09" "8/12/09"
## [31] "1/13/10" "11/16/10" "6/24/10" "6/30/10" "1/20/11" "3/10/11"
## [37] "6/15/11" "11/27/12" "11/8/12" "2/1/12" "2/20/12" "6/4/12"
## [43] "7/26/12" "11/13/13" "12/18/13" "12/18/13" "2/26/13" "6/3/13"
## [49] "9/11/13" "11/10/14" "11/13/14" "12/9/14" "3/25/14" "4/30/14"
Graphing how many guests appeared per year.
library(ggplot2)
ggplot(data=daily, aes(x=Year)) +
geom_bar()
Which 10 guests appeared the greatest number of times on The Daily Show during Stewart’s tenure?
sort(table(daily$Raw_Guest_List),decreasing=TRUE)[1:10]
##
## Fareed Zakaria Denis Leary Brian Williams Paul Rudd Ricky Gervais
## 19 17 16 13 13
## Tom Brokaw Bill O'Reilly Reza Aslan Richard Lewis Will Ferrell
## 12 10 10 10 10
The Daily Show had an eclectic array of guests (2693 total with 399 unique occupations), with a roughly consistent 150 guests per year from 1999 to 2014 (2015 had fewer guests as the show ended that year). Fareed Zakaria was the most frequent guest, making 19 appearances, followed by Denis Leary (17), and Brian Williams (16).
The FiveThirtyEight article which covered this data only looked at how the Occupation Type of the guests changed over time, however I’d also be interested to see how the ethnicity and gender of guests changed over time. We tend to talk to people who are most like ourselves. Did Jon Stewart make an effort to share the perspectives of those with a different background than him? If the “Most Frequent Guest” list, excerpted above, is indicative, the answer is “Not so much.”