Homework Six uses the Daily Show Data Set used in Homework Five, to glean more insight in to the data set through the lens of bivariate relationships.
Once again we will import the data. Before any analysis can take place the data and the libraries must first be imported into Rstudio.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(datasets)
library(stringr)
library(distill)
library(devtools)
## Loading required package: usethis
library(here)
## here() starts at /Users/chester/Desktop
library(stringr)
library(skimr)
library(grDevices)
library(wordcloud)
## Loading required package: RColorBrewer
library(RColorBrewer)
library(fivethirtyeight)
## Some larger datasets need to be installed separately, like senators and
## house_district_forecast. To install these, we recommend you install the
## fivethirtyeightdata package by running:
## install.packages('fivethirtyeightdata', repos =
## 'https://fivethirtyeightdata.github.io/drat/', type = 'source')
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
library(DT)
data("daily_show_guests")
Now that the Daily_Show_Guests data set is imported we can now tidy it up, clean it, and see what it’s all about. Note this data set was tidyed, cleaned, recoded, and summary descriptives were conducted in Homework 5. The data tidying was,however, refined in some instances for better readability and comprehension for Homework Six
head(daily_show_guests,5) # to see the first five rows of data set
## # A tibble: 5 × 5
## year google_knowledge_occupation show group raw_guest_list
## <int> <chr> <date> <chr> <chr>
## 1 1999 actor 1999-01-11 Acting Michael J. Fox
## 2 1999 comedian 1999-01-12 Comedy Sandra Bernhard
## 3 1999 television actress 1999-01-13 Acting Tracey Ullman
## 4 1999 film actress 1999-01-14 Acting Gillian Anderson
## 5 1999 actor 1999-01-18 Acting David Alan Grier
tail(daily_show_guests,5) # shows the last five rows of data set
## # A tibble: 5 × 5
## year google_knowledge_occupation show group raw_guest_list
## <int> <chr> <date> <chr> <chr>
## 1 2015 biographer 2015-07-29 Media Doris Kearns Goodwin
## 2 2015 director 2015-07-30 Media J. J. Abrams
## 3 2015 stand-up comedian 2015-08-03 Comedy Amy Schumer
## 4 2015 actor 2015-08-04 Acting Denis Leary
## 5 2015 comedian 2015-08-05 Comedy Louis C.K.
dim(daily_show_guests) # shows the dimensions of data set
## [1] 2693 5
colnames(daily_show_guests) # column names
## [1] "year" "google_knowledge_occupation"
## [3] "show" "group"
## [5] "raw_guest_list"
Here the skimr() function is used and the summary functions. My preference is for the skimr function because it shows missing data, dates and logicals. It provides more comprehensive insight than the summary function by comparison.
skim(daily_show_guests)
| Name | daily_show_guests |
| Number of rows | 2693 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| Date | 1 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| google_knowledge_occupation | 26 | 0.99 | 1 | 66 | 0 | 335 | 0 |
| group | 31 | 0.99 | 4 | 14 | 0 | 17 | 0 |
| raw_guest_list | 0 | 1.00 | 3 | 72 | 0 | 1669 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| show | 0 | 1 | 1999-01-11 | 2015-08-05 | 2007-03-22 | 2639 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1 | 2006.82 | 4.83 | 1999 | 2003 | 2007 | 2011 | 2015 | ▇▆▆▆▇ |
#summary(daily_show_guests)
Now to make the data set a bit more understandable the rename() function is used to rename google_knowledge_occupation to guest_occupation, show to show_date, raw_guest_list to guest_list, year to *year_episode _aired* and group to occupational_domain.
daily_show_guests<-daily_show_guests %>%
rename(guest_occupation = google_knowledge_occupation)
daily_show_guests<-daily_show_guests %>%
rename(show_date = show)
daily_show_guests<-daily_show_guests %>%
rename(guest_list = raw_guest_list)
daily_show_guests<-daily_show_guests %>%
rename(year_episode_aired = year)
daily_show_guests<-daily_show_guests %>%
rename(occupational_domain = group)
This chunk verifies columns headings are changed to reflect the renamed headings
head(daily_show_guests,5) #shows first five rows of data set.
## # A tibble: 5 × 5
## year_episode_aired guest_occupation show_date occupational_dom… guest_list
## <int> <chr> <date> <chr> <chr>
## 1 1999 actor 1999-01-11 Acting Michael J.…
## 2 1999 comedian 1999-01-12 Comedy Sandra Ber…
## 3 1999 television actress 1999-01-13 Acting Tracey Ull…
## 4 1999 film actress 1999-01-14 Acting Gillian An…
## 5 1999 actor 1999-01-18 Acting David Alan…
Chunk shows a new variable name of my_data which is a copy of the daily_show_guests. This was done for ease of data analysis.
my_data<-daily_show_guests
This chunk shows the domains that appeared more than 10 times on the show. As predicted acting appeared the most at 930 times.
my_data %>%
count(occupational_domain) %>%
filter(n > 10) %>%
na.omit()
## # A tibble: 15 × 2
## occupational_domain n
## <chr> <int>
## 1 Academic 103
## 2 Acting 930
## 3 Advocacy 24
## 4 Athletics 52
## 5 Business 25
## 6 Comedy 150
## 7 Consultant 18
## 8 Government 40
## 9 Media 751
## 10 Military 16
## 11 Misc 45
## 12 Musician 123
## 13 Political Aide 36
## 14 Politician 308
## 15 Science 28
This chunk shows the guests who appeared on the show more than 10 times. It shows that Fareed Zakaria appeared more than any other guest;appearing 19 times.
my_data %>%
count(guest_list) %>%
filter(n > 10) %>%
na.omit()
## # A tibble: 6 × 2
## guest_list n
## <chr> <int>
## 1 Brian Williams 16
## 2 Denis Leary 17
## 3 Fareed Zakaria 19
## 4 Paul Rudd 13
## 5 Ricky Gervais 13
## 6 Tom Brokaw 12
This chunk explores how many guests in the acting category appeared. It shows us 930 actors from 1999 to 2015. A new variable called my_data was been created as a copy of daily_show_guests.
my_data %>%
select(year_episode_aired,guest_occupation,show_date,occupational_domain,guest_list) %>%
filter( occupational_domain == "Acting",guest_list >= 1)
## # A tibble: 930 × 5
## year_episode_aired guest_occupation show_date occupational_dom… guest_list
## <int> <chr> <date> <chr> <chr>
## 1 1999 actor 1999-01-11 Acting Michael J…
## 2 1999 television actress 1999-01-13 Acting Tracey Ul…
## 3 1999 film actress 1999-01-14 Acting Gillian A…
## 4 1999 actor 1999-01-18 Acting David Ala…
## 5 1999 actor 1999-01-19 Acting William B…
## 6 1999 actor 1999-01-25 Acting Matthew L…
## 7 1999 actress 1999-01-27 Acting Yasmine B…
## 8 1999 actor 1999-01-28 Acting D. L. Hug…
## 9 1999 television actress 1999-10-18 Acting Rebecca G…
## 10 1999 actress 1999-10-20 Acting Amy Brenn…
## # … with 920 more rows
Here we see that 28 guests in the science category appeared on show during it’s run.
my_data %>%
select(year_episode_aired,guest_occupation,show_date,occupational_domain,guest_list) %>%
filter(occupational_domain == "Science",guest_list >= 1)
## # A tibble: 28 × 5
## year_episode_aired guest_occupation show_date occupational_domain guest_list
## <int> <chr> <date> <chr> <chr>
## 1 2003 neurosurgeon 2003-04-28 Science Dr Sanjay…
## 2 2004 scientist 2004-01-13 Science Catherine…
## 3 2004 physician 2004-06-15 Science Hassan Ib…
## 4 2005 doctor 2005-09-06 Science Dr. Marc …
## 5 2006 astronaut 2006-02-13 Science Astronaut…
## 6 2007 astrophysicist 2007-01-30 Science Neil deGr…
## 7 2007 surgeon 2007-03-06 Science Richard J…
## 8 2007 physician 2007-03-08 Science Dr. Sharo…
## 9 2007 astrophysicist 2007-07-23 Science Neil deGr…
## 10 2008 neuroscientist 2008-04-01 Science Simon LeV…
## # … with 18 more rows
308 politicians appeared on show during run.
my_data %>%
select(year_episode_aired,guest_occupation,show_date,occupational_domain,guest_list) %>%
filter(occupational_domain == "Politician",guest_list >= 1)
## # A tibble: 308 × 5
## year_episode_aired guest_occupation show_date occupational_do… guest_list
## <int> <chr> <date> <chr> <chr>
## 1 1999 us senator 1999-12-07 Politician Senator Bob…
## 2 1999 us senator 1999-12-08 Politician Senator Bob…
## 3 2000 former mayor of … 2000-01-20 Politician Jerry Sprin…
## 4 2000 former us senator 2000-11-06 Politician Arlen Spect…
## 5 2000 american politic… 2000-11-07 Politician Bob Dole
## 6 2000 former senator f… 2000-02-02 Politician Focus on Ne…
## 7 2000 american politic… 2000-03-08 Politician Bob Dole
## 8 2000 former us senator 2000-04-20 Politician Arlen Spect…
## 9 2000 american politic… 2000-08-01 Politician Bob Dole
## 10 2000 former governor … 2000-08-15 Politician Bob Kerrey
## # … with 298 more rows
In this chunk the guest occupation row is used to filter out how many writers and directors appeared on the show, using the concatenation “c” function.
my_data %>%
select(year_episode_aired,guest_occupation,show_date,occupational_domain,guest_list) %>%
filter(guest_occupation %in% c("writer","director"))
## # A tibble: 61 × 5
## year_episode_aired guest_occupation show_date occupational_domain guest_list
## <int> <chr> <date> <chr> <chr>
## 1 1999 writer 1999-03-17 Media Frank DeC…
## 2 1999 director 1999-08-12 Media Eduardo S…
## 3 2000 writer 2000-04-11 Media Ben Stein
## 4 2000 writer 2000-06-19 Media Heather D…
## 5 2000 writer 2000-07-25 Media Joe Eszte…
## 6 2000 writer 2000-08-04 Media Robert Re…
## 7 2001 writer 2001-08-13 Media David Rak…
## 8 2002 writer 2002-05-07 Media Mark Bowd…
## 9 2003 writer 2003-10-22 Media Walter Is…
## 10 2003 writer 2003-11-18 Media Bernard G…
## # … with 51 more rows
If you’re a viewer of The Daily Show you know that most shows have more than one guest. The data supports this observation,and sometimes there are two entries showing “guest 1” and “guest 2” for the same show. So in the case of guests appearing on the show on the same date, there will be an entry containing both guests names but under the guest_occupation there will understandably be separate entries if they have different occupations.
This chunk shows the episode that had more than one guest, date of their appearances, and their occupation.
library(DT)
my_data %>% #shows that had more than one guest and their occupation
group_by(guest_list) %>%
summarise(ngroups = n_distinct(occupational_domain)) %>%
filter(ngroups>1) %>%
select(-ngroups) %>%
inner_join(my_data, by= "guest_list") %>%
arrange(year_episode_aired,guest_list) %>%
datatable(my_data)
When the data was analyzed it was noticed that sometimes there are special events, and there were no specified guests. Because of this only data in small chunks was analysed as needed. Since special events provided meaningful insight it should not be removed.
library(DT)
my_data %>%
filter(is.na(occupational_domain)) %>%
datatable(my_data)
Visualizes guests by occupation and number of guests that work in that occupation. I used a flipped geom bar from the ggplot package that I flipped using the coord_flip function to better visualize the data. The data is easily visualized in descending ordering representing the most popular occupational domains to the least. Note that for some reason, I am not sure of, the labels for the x and y axis flipped, so I had to interchange the xlab as the ylab and vice versa.
my_data %>%
group_by(occupational_domain) %>%
summarise(n=n()) %>%
arrange(desc(n)) %>%
na.omit %>%
ggplot(aes(reorder(occupational_domain,n),n))+
geom_bar(stat = "identity")+
coord_flip()+
xlab("Guest Occupation")+
ylab("Number of Guests")+
ggtitle("Guest_List_Grouped_By_Occupation")
Visualizes guests appearing by year episode aired and grouped by occupational domain which if you recall was renamed from the group variable. I used rainbow colors to better represent the separations of the occupational and year_episode_aired variables.
library(ggplot2)
my_data %>%
group_by(occupational_domain,year_episode_aired) %>%
summarise(n=n()) %>%
ggplot(aes(factor(year_episode_aired),n))+
geom_bar(stat = "identity",aes(fill= occupational_domain))+
theme(axis.text.x = element_text(angle = 90,hjust = 0.5))+
xlab("Year Episode Aired")+
ylab("Frequency of Occupational Damain")+
ggtitle("Guests_Appearing_Per_Year") %>%
na.omit()
## `summarise()` has grouped output by 'occupational_domain'. You can override using the `.groups` argument.
This bi-variate graph shows the Top 5 Guest Occupational Domains, over a time period from 1999 to 2015. This is helpful in visualizing graphically the five most popular types of occupations appearing on the Daily show during its run. It shows the acting and media were the two most frequent occupational domains. When the show first aired the there was a conspicuous peak of actors appearing, However, there is a steady decline,and between 2008 and 2009 there is a precipitous dip in actors appearing and an uptrend in media appearances.
my_data %>%
group_by(occupational_domain,year_episode_aired) %>%
summarise(n=n()) %>%
summarise(m=mean(n)) %>%
arrange(desc(m)) %>%
filter(row_number()<= 5) %>%
select(-m) %>%
inner_join(my_data, by ="occupational_domain") %>%
group_by(occupational_domain,year_episode_aired) %>%
summarise(n=n()) %>%
filter(year_episode_aired<=2015) %>%
ggplot(aes(year_episode_aired,n)) + geom_line(aes(col= occupational_domain), lwd=1.5) +
ggtitle(" Guests Occupational Domain Over Time ")
## `summarise()` has grouped output by 'occupational_domain'. You can override using the `.groups` argument.
## `summarise()` has grouped output by 'occupational_domain'. You can override using the `.groups` argument.
This filter that shows the guest occupation/occupational domain ranked from most to least frequently
my_data<-as.data.table(my_data)
my_data
## year_episode_aired guest_occupation show_date occupational_domain
## 1: 1999 actor 1999-01-11 Acting
## 2: 1999 comedian 1999-01-12 Comedy
## 3: 1999 television actress 1999-01-13 Acting
## 4: 1999 film actress 1999-01-14 Acting
## 5: 1999 actor 1999-01-18 Acting
## ---
## 2689: 2015 biographer 2015-07-29 Media
## 2690: 2015 director 2015-07-30 Media
## 2691: 2015 stand-up comedian 2015-08-03 Comedy
## 2692: 2015 actor 2015-08-04 Acting
## 2693: 2015 comedian 2015-08-05 Comedy
## guest_list
## 1: Michael J. Fox
## 2: Sandra Bernhard
## 3: Tracey Ullman
## 4: Gillian Anderson
## 5: David Alan Grier
## ---
## 2689: Doris Kearns Goodwin
## 2690: J. J. Abrams
## 2691: Amy Schumer
## 2692: Denis Leary
## 2693: Louis C.K.
my_data1<-my_data[occupational_domain!=""]
my_data2<-my_data1[,.N,by=occupational_domain]
view(my_data2)
Arrange new table comprised of two variables, occupational domain and number of appearances in descending to ascending order.
my_data3<-my_data2[order(-N)] %>%
rename("frequency" = N) %>%
head(17)
I wanted to try a Wordcloud since I have never tried one before, and it could provide visual bi-variate representation of data of my choosing. For this wordcloud I have choosen the Occupation_Domain and a renamed variable called frequency. My hypothesis is that since the acting had the most appearances on the show the text will appear larger, since text in wordclouds appear larger with the greater frequency they appear. After running the word cloud chunk my hypothesis is correct. The acting domain appears larger, followed in descending frequency by the other occupational domains. Since this is a small subsection of the dataset, the wordcloud is even smaller than I would have imagined, however, it does accomplish it’s intended purpose.
library(wordcloud)
#png("wordcloud.png",width=10,height=7,units ='in',res=300)
par(mar=rep(0,4))
set.seed(1330)
wordcloud(words = my_data3$occupational_domain,freq = my_data3$frequency,scale=c(3.5,0.65),
max.words=17,colors=brewer.pal(8,"Dark2"))
Here I’m trying a wordcloud for a larger subset, to see what it looks like. This word cloud is comprised of the variables guest_occupation and the frequencies of their appearance.
my_data<-as.data.table(my_data)
my_data
## year_episode_aired guest_occupation show_date occupational_domain
## 1: 1999 actor 1999-01-11 Acting
## 2: 1999 comedian 1999-01-12 Comedy
## 3: 1999 television actress 1999-01-13 Acting
## 4: 1999 film actress 1999-01-14 Acting
## 5: 1999 actor 1999-01-18 Acting
## ---
## 2689: 2015 biographer 2015-07-29 Media
## 2690: 2015 director 2015-07-30 Media
## 2691: 2015 stand-up comedian 2015-08-03 Comedy
## 2692: 2015 actor 2015-08-04 Acting
## 2693: 2015 comedian 2015-08-05 Comedy
## guest_list
## 1: Michael J. Fox
## 2: Sandra Bernhard
## 3: Tracey Ullman
## 4: Gillian Anderson
## 5: David Alan Grier
## ---
## 2689: Doris Kearns Goodwin
## 2690: J. J. Abrams
## 2691: Amy Schumer
## 2692: Denis Leary
## 2693: Louis C.K.
my_data4<-my_data[guest_occupation!=""]
my_data5<-my_data1[,.N,by=guest_occupation]
view(my_data5)
my_data6<-my_data5[order(-N)] %>% # displays data in descending order
rename("frequency_1" = N) %>% #renaming specific guest occupation frequency from N to frequency_1
head(333) #displays first 5 rows of new bi-variate dateset
view(my_data6) #displays renamed dataset
library(wordcloud)
#png("wordcloud.png",width=10,height=7,units ='in',res=300)
par(mar=rep(0,4))
set.seed(50)
wordcloud(words = my_data6$guest_occupation,freq = my_data6$frequency_1,scale=c(3.5,0.75),
max.words=333,colors=brewer.pal(8,"Dark2"))
## Warning in wordcloud(words = my_data6$guest_occupation, freq =
## my_data6$frequency_1, : former lieutenant governor of maryland could not be fit
## on page. It will not be plotted.
## Warning in wordcloud(words = my_data6$guest_occupation, freq =
## my_data6$frequency_1, : former white house press secretary could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(words = my_data6$guest_occupation, freq =
## my_data6$frequency_1, : television personality could not be fit on page. It will
## not be plotted.
## Warning in wordcloud(words = my_data6$guest_occupation, freq =
## my_data6$frequency_1, : minority leader of the united states house of
## representatives could not be fit on page. It will not be plotted.