Homework Five takes a more granular look at how data can be analysed visually. Having a visual component for data analysis makes it possible to see patterns in data, as well as making it easier to glean insight and understanding into data that may not have been apparent otherwise.
For this assignment I am using a data set from FiveThirtyEight a website that looks at economics, entertainment, science and politics. The data set used is Daily Show Guests during the tenure of host Jon Stewart from 1999 to 2015. I chose this data set because I thought it would be fun and insightful to examine a data set in the entertainment domain.
Before any visualization can take place the data and the libraries must first be imported into Rstudio.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(datasets)
library(stringr)
library(distill)
library(devtools)
## Loading required package: usethis
library(here)
## here() starts at /Users/chester/Documents
library(stringr)
library(skimr)
library(fivethirtyeight)
## Some larger datasets need to be installed separately, like senators and
## house_district_forecast. To install these, we recommend you install the
## fivethirtyeightdata package by running:
## install.packages('fivethirtyeightdata', repos =
## 'https://fivethirtyeightdata.github.io/drat/', type = 'source')
library(DT)
data("daily_show_guests")
Now that the Daily_Show_Guests data set is imported we can now tidy it up, clean it, and see what it’s all about.
head(daily_show_guests,5) # to see the first five rows of data set
## # A tibble: 5 × 5
## year google_knowledge_occupation show group raw_guest_list
## <int> <chr> <date> <chr> <chr>
## 1 1999 actor 1999-01-11 Acting Michael J. Fox
## 2 1999 comedian 1999-01-12 Comedy Sandra Bernhard
## 3 1999 television actress 1999-01-13 Acting Tracey Ullman
## 4 1999 film actress 1999-01-14 Acting Gillian Anderson
## 5 1999 actor 1999-01-18 Acting David Alan Grier
tail(daily_show_guests,5) # shows the last five rows of data set
## # A tibble: 5 × 5
## year google_knowledge_occupation show group raw_guest_list
## <int> <chr> <date> <chr> <chr>
## 1 2015 biographer 2015-07-29 Media Doris Kearns Goodwin
## 2 2015 director 2015-07-30 Media J. J. Abrams
## 3 2015 stand-up comedian 2015-08-03 Comedy Amy Schumer
## 4 2015 actor 2015-08-04 Acting Denis Leary
## 5 2015 comedian 2015-08-05 Comedy Louis C.K.
dim(daily_show_guests) # shows the dimensions of data set
## [1] 2693 5
colnames(daily_show_guests) # column names
## [1] "year" "google_knowledge_occupation"
## [3] "show" "group"
## [5] "raw_guest_list"
Here the skimr() function is used and the summary functions. My preference is for the skimr function because it shows missing data, dates and logicals. It provides more comprehensive insight than the summary function by comparison.
skim(daily_show_guests)
| Name | daily_show_guests |
| Number of rows | 2693 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| Date | 1 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| google_knowledge_occupation | 26 | 0.99 | 1 | 66 | 0 | 335 | 0 |
| group | 31 | 0.99 | 4 | 14 | 0 | 17 | 0 |
| raw_guest_list | 0 | 1.00 | 3 | 72 | 0 | 1669 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| show | 0 | 1 | 1999-01-11 | 2015-08-05 | 2007-03-22 | 2639 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1 | 2006.82 | 4.83 | 1999 | 2003 | 2007 | 2011 | 2015 | ▇▆▆▆▇ |
#summary(daily_show_guests)
Now to make the data set a bit more understandable the rename() function is used to rename google_knowledge_occupation to guest_occupation, show to show_date, raw_guest_list to guest_list, year to *year_episode _aired* and group to domain.
daily_show_guests<-daily_show_guests %>%
rename(guest_occupation = google_knowledge_occupation)
daily_show_guests<-daily_show_guests %>%
rename(show_date = show)
daily_show_guests<-daily_show_guests %>%
rename(guest_list = raw_guest_list)
daily_show_guests<-daily_show_guests %>%
rename(year_episode_aired = year)
daily_show_guests<-daily_show_guests %>%
rename(domain = group)
This chunk verifies columns headings are changed to reflect the renamed headings
head(daily_show_guests,5) #shows first five rows of data set.
## # A tibble: 5 × 5
## year_episode_aired guest_occupation show_date domain guest_list
## <int> <chr> <date> <chr> <chr>
## 1 1999 actor 1999-01-11 Acting Michael J. Fox
## 2 1999 comedian 1999-01-12 Comedy Sandra Bernhard
## 3 1999 television actress 1999-01-13 Acting Tracey Ullman
## 4 1999 film actress 1999-01-14 Acting Gillian Anderson
## 5 1999 actor 1999-01-18 Acting David Alan Grier
Chunk shows a new variable name of my_data which is a copy of the daily_show_guests. This was done for ease of data analysis.
my_data<-daily_show_guests
This chunk shows the domains that appeared more than 10 times on the show. As predicted acting appeared the most at 930 times.
my_data %>%
count(domain) %>%
filter(n > 10) %>%
na.omit()
## # A tibble: 15 × 2
## domain n
## <chr> <int>
## 1 Academic 103
## 2 Acting 930
## 3 Advocacy 24
## 4 Athletics 52
## 5 Business 25
## 6 Comedy 150
## 7 Consultant 18
## 8 Government 40
## 9 Media 751
## 10 Military 16
## 11 Misc 45
## 12 Musician 123
## 13 Political Aide 36
## 14 Politician 308
## 15 Science 28
This chunk shows the guests who appeared on the show more than 10 times. It shows that Fareed Zakaria appeared more than any other guest;appearing 19 times.
my_data %>%
count(guest_list) %>%
filter(n > 10) %>%
na.omit()
## # A tibble: 6 × 2
## guest_list n
## <chr> <int>
## 1 Brian Williams 16
## 2 Denis Leary 17
## 3 Fareed Zakaria 19
## 4 Paul Rudd 13
## 5 Ricky Gervais 13
## 6 Tom Brokaw 12
This chunk explores how many guests in the acting category appeared. It shows us 930 actors from 1999 to 2015. A new variable called my_data was been created as a copy of daily_show_guests.
my_data %>%
select(year_episode_aired,guest_occupation,show_date,domain,guest_list) %>%
filter(domain == "Acting",guest_list >= 1)
## # A tibble: 930 × 5
## year_episode_aired guest_occupation show_date domain guest_list
## <int> <chr> <date> <chr> <chr>
## 1 1999 actor 1999-01-11 Acting Michael J. Fox
## 2 1999 television actress 1999-01-13 Acting Tracey Ullman
## 3 1999 film actress 1999-01-14 Acting Gillian Anderson
## 4 1999 actor 1999-01-18 Acting David Alan Grier
## 5 1999 actor 1999-01-19 Acting William Baldwin
## 6 1999 actor 1999-01-25 Acting Matthew Lillard
## 7 1999 actress 1999-01-27 Acting Yasmine Bleeth
## 8 1999 actor 1999-01-28 Acting D. L. Hughley
## 9 1999 television actress 1999-10-18 Acting Rebecca Gayheart
## 10 1999 actress 1999-10-20 Acting Amy Brenneman
## # … with 920 more rows
Here we see that 28 guests in the science category appeared on show during it’s run.
my_data %>%
select(year_episode_aired,guest_occupation,show_date,domain,guest_list) %>%
filter(domain == "Science",guest_list >= 1)
## # A tibble: 28 × 5
## year_episode_aired guest_occupation show_date domain guest_list
## <int> <chr> <date> <chr> <chr>
## 1 2003 neurosurgeon 2003-04-28 Science Dr Sanjay Gupta
## 2 2004 scientist 2004-01-13 Science Catherine Weitz
## 3 2004 physician 2004-06-15 Science Hassan Ibrahim
## 4 2005 doctor 2005-09-06 Science Dr. Marc Siegel
## 5 2006 astronaut 2006-02-13 Science Astronaut Mike Mullane
## 6 2007 astrophysicist 2007-01-30 Science Neil deGrasse Tyson
## 7 2007 surgeon 2007-03-06 Science Richard Jadick
## 8 2007 physician 2007-03-08 Science Dr. Sharon Moalem
## 9 2007 astrophysicist 2007-07-23 Science Neil deGrasse Tyson
## 10 2008 neuroscientist 2008-04-01 Science Simon LeVay
## # … with 18 more rows
308 politicians appeared on show during run.
my_data %>%
select(year_episode_aired,guest_occupation,show_date,domain,guest_list) %>%
filter(domain == "Politician",guest_list >= 1)
## # A tibble: 308 × 5
## year_episode_aired guest_occupation show_date domain guest_list
## <int> <chr> <date> <chr> <chr>
## 1 1999 us senator 1999-12-07 Politi… Senator Bo…
## 2 1999 us senator 1999-12-08 Politi… Senator Bo…
## 3 2000 former mayor of cincinatti 2000-01-20 Politi… Jerry Spri…
## 4 2000 former us senator 2000-11-06 Politi… Arlen Spec…
## 5 2000 american politician 2000-11-07 Politi… Bob Dole
## 6 2000 former senator from kansas 2000-02-02 Politi… Focus on N…
## 7 2000 american politician 2000-03-08 Politi… Bob Dole
## 8 2000 former us senator 2000-04-20 Politi… Arlen Spec…
## 9 2000 american politician 2000-08-01 Politi… Bob Dole
## 10 2000 former governor of nebraska 2000-08-15 Politi… Bob Kerrey
## # … with 298 more rows
In this chunk the guest occupation row is used to filter out how many writers and directors appeared on the show, using the concatenation “c” function.
my_data %>%
select(year_episode_aired,guest_occupation,show_date,domain,guest_list) %>%
filter(guest_occupation %in% c("writer","director"))
## # A tibble: 61 × 5
## year_episode_aired guest_occupation show_date domain guest_list
## <int> <chr> <date> <chr> <chr>
## 1 1999 writer 1999-03-17 Media Frank DeCaro's Oscar S…
## 2 1999 director 1999-08-12 Media Eduardo Sanchez and Da…
## 3 2000 writer 2000-04-11 Media Ben Stein
## 4 2000 writer 2000-06-19 Media Heather Donahue
## 5 2000 writer 2000-07-25 Media Joe Eszterhas
## 6 2000 writer 2000-08-04 Media Robert Reich and Ben S…
## 7 2001 writer 2001-08-13 Media David Rakoff
## 8 2002 writer 2002-05-07 Media Mark Bowden
## 9 2003 writer 2003-10-22 Media Walter Isaacson
## 10 2003 writer 2003-11-18 Media Bernard Goldberg
## # … with 51 more rows
If you’re a viewer of The Daily Show you know that most shows have more than one guest. The data supports this observation,and sometimes there are two entries showing “guest 1” and “guest 2” for the same show. So in the case of guests appearing on the show on the same date, there will be an entry containing both guests names but under the guest_occupation there will understandably be separate entries if they have different occupations.
This chunk shows the episode that had more than one guest, date of their appearances, and their occupation.
library(DT)
my_data %>% #shows that had more than one guest and their occupation
group_by(guest_list) %>%
summarise(ngroups = n_distinct(domain)) %>%
filter(ngroups>1) %>%
select(-ngroups) %>%
inner_join(my_data, by= "guest_list") %>%
arrange(year_episode_aired,guest_list) %>%
datatable(my_data)
When the data was analyzed it was noticed that sometimes there are special events, and there were no specified guests. Because of this only data in small chunks was analysed as needed. Since special events provided meaningful insight it should not be removed.
library(DT)
my_data %>%
filter(is.na(domain)) %>%
datatable(my_data)
Visualizes guests by occupation
my_data %>%
group_by(domain) %>%
summarise(n=n()) %>%
arrange(desc(n)) %>%
na.omit %>%
ggplot(aes(reorder(domain,n),n))+
geom_bar(stat = "identity")+
coord_flip()+
ggtitle("Guest_List_Grouped_By_Occupation")
Visualizes guests appearing year episode aired and grouped by occupational domain
my_data %>%
group_by(domain,year_episode_aired) %>%
summarise(n=n()) %>%
ggplot(aes(factor(year_episode_aired),n))+
geom_bar(stat = "identity",aes(fill=domain))+
theme(axis.text.x = element_text(angle = 45,hjust = 1))+
ggtitle("Guests_Appearing_Per_Year") %>%
na.omit()
## `summarise()` has grouped output by 'domain'. You can override using the `.groups` argument.
Top 5 Guest Occupational Domains, overtime. Last year of show 2015 omitted for aesthetic purposes.
my_data %>%
group_by(domain,year_episode_aired) %>%
summarise(n=n()) %>%
summarise(m=mean(n)) %>%
arrange(desc(m)) %>%
filter(row_number()<= 5) %>%
select(-m) %>%
inner_join(my_data, by ="domain") %>%
group_by(domain, year_episode_aired) %>%
summarise(n=n()) %>%
filter(year_episode_aired<2015) %>%
ggplot(aes(year_episode_aired,n)) + geom_line(aes(col=domain), lwd=2) +
ggtitle(" Guests Occupational Domain Over Time ")
## `summarise()` has grouped output by 'domain'. You can override using the `.groups` argument.
## `summarise()` has grouped output by 'domain'. You can override using the `.groups` argument.
That’s all for now, this is a work in process.