How do I read this file?

# Alternative 1: 
# Download the file into excel, save it in the correct sub-folder, change working directory to that sub-folder then read it.

ted<-readxl::read_excel("TED.xlsx")
# View(ted) 

# If you issue the View command that has been commented, the dataframe is pulled up. Did you notice any change in the "duration" variable in the 'ted' dataframe?

# Alternative 2: google: "Reading directly from googlesheet R"; the first search result took me to this page: https://www.r-bloggers.com/reading-data-from-google-sheets-into-r/

# There is a package called googlesheets and there is a functionality built into that package that can help read the data directly into it....The r-bloggers page didn't give that info... So I looked up the googlesheets package page on CRAN (a key place from which we can download R package): https://cran.r-project.org/web/packages/googlesheets/vignettes/basic-usage.html#get-a-google-sheet-to-practice-with; This page seems to have the necessary info. In the "Register a sheet" section of that page is the information for how to read from a sheet that we don't own. 

#install.packages("googlesheets")

sheeturl="https://docs.google.com/spreadsheets/d/1Yv_9nDl4ocIZR0GXU3OZuBaXxER1blfwR_XHvklPpEM/edit?hl=en&hl=en&hl=en#gid=0"

library(googlesheets)
library(tidyverse)

tedsheet <- sheeturl %>% gs_url()
TED <- tedsheet %>% gs_read()

# This version of the data does not have the duration field messed up. Cleaner. Let us use this version. Else, we might have to cleanup the mess from the earlier dataframe, "ted".

1. Create an interactive table that shows the total number of talks given by an individual and the average duration of all their talks. Hence, there should be three columns in the table: The name, the number of talks, and the mean of the talk time (in minutes) for all their talks.

For us to answer this question, we need the information on the total number of talks, average duration. So, some data processing is needed.

  1. Computing the total talks should be relatively easier using the group_by function from dplyr.
  2. For mean talk time, in minutes, we will need to convert the duration variable into the number of minutes. Let’s work on this before we deal with the first part. Note that if we had worked with the excel version of the file, duration shows up in this format 1899-12-31 00:16:17 (this is the first value in that column). This is a version of the date variable in yyyy-mm-dd hh:mm:ss format. If we worked with the dataframe that was produced by googlesheets, then we have duration in only the hh:mm:ss format. Let us check that the data are identical if we focused only on the hours, minutes, and seconds.

Excel data

# ted is the dataframe produced by excel. We are extracting the values of hour, minute, and seconds from that dataframe for the first observation.

# The three functions used below -- hour, minute, and second --- are from the lubridate package. This page has information on how to extract specific elements: https://r4ds.had.co.nz/dates-and-times.html 

library(lubridate)# Need to install that package if not already present

hour(ted$duration[1])
## [1] 0
minute(ted$duration[1])
## [1] 16
second(ted$duration[1])
## [1] 17

Google sheets data

# TED is the dataframe produced by excel. We are extracting the values of hour, minute, and seconds from that dataframe for the first observation.
hour(TED$duration[1])
## [1] 0
minute(TED$duration[1])
## [1] 16
second(TED$duration[1])
## [1] 17

Now that we are in a position to extract the minutes and seconds. Let us compute the total seconds for each talk and then divide by 60 to find the duration of each talk in minutes. We will use the google sheet version of the data for the rest of the analysis.

TED <- TED %>% mutate(duration_minutes=(second(duration)+60*minute(duration)+3600*hour(duration))/60)

Let us verify that the job was done by taking a look at the first three observations of the duration and the duration_minutes columns.

head(TED[,c("duration","duration_minutes")])
## # A tibble: 6 x 2
##   duration duration_minutes
##   <time>              <dbl>
## 1 16'17"               16.3
## 2 21'26"               21.4
## 3 18'36"               18.6
## 4 19'24"               19.4
## 5 19'50"               19.8
## 6 21'45"               21.8

Looks good. Moving on, we now need to compute the total number of talks by each speaker and the average minutes per talk for each speaker. dplyr should help us. For this, we might have to refresh our learning from chapter 5 of this book: https://r4ds.had.co.nz/transform.html.

TED_speaker_metrics <- TED %>% group_by(speaker_name)%>%
  summarise(Number_talks=length(speaker_name),
            Mean_talk_duration=mean(duration_minutes))

# Note, names of variables, dataframes, etc., are a personal preference. You may have chosen a different name than TED_speaker_metrics, but that seemed 'right' to me because it provides a good amount of information regarding what might be contained in it.

Let us verify if we have the data in the right format.

head(as.data.frame(TED_speaker_metrics))
##      speaker_name Number_talks Mean_talk_duration
## 1   Aakash Odedra            1           9.833333
## 2   Aala El-Khani            1          14.266667
## 3      Aaron Huey            1          15.450000
## 4    Aaron Koblin            1          18.300000
## 5 Aaron O'Connell            1           7.850000
## 6       Abe Davis            1          17.950000

The Mean_talk_duration variable has 6 decimals. Let us round it off at 2 decimals.

TED_speaker_metrics$Mean_talk_duration <- round(TED_speaker_metrics$Mean_talk_duration,2)

Now we need an interactive table to display the new dataframe we created, TED_speaker_metrics. One package we came across in the htmlwidgets section is DT, or the datatable package.

library(htmlwidgets)
library(DT) # You will have to install it using install.packages('DT') before issuing the library command for the first time.

datatable(TED_speaker_metrics)

Question 2

Create bar graphs to:

  1. show speakers who gave more than 3 talks, such that the height of bars corresponds to the mean talk time of each speaker and the color of the bar corresponds to the number of talks given by each speaker.
  2. show the top 20 tag terms/phrase (based on the frequency of use of each term/phrase) and how frequently they were present in the dataset.

There are 2335 unique speakers in the dataset. Let’s just use ggplot to plot a histogram of the talks.

TED_speaker_metrics %>% ggplot(.,aes(Number_talks))+geom_histogram()

Most people have only 1 talk. Let us filter it to those who have more than 3 talks, as the assignment questions suggests.

TED_speaker_metrics %>% filter(Number_talks>3)%>% ggplot(.,aes(reorder(speaker_name, Mean_talk_duration),Mean_talk_duration))+geom_bar(stat="identity")+coord_flip()

TED_speaker_metrics %>% filter(Number_talks>3)%>% ggplot(.,aes(reorder(speaker_name, Mean_talk_duration),Mean_talk_duration))+geom_bar(stat="identity")+coord_flip()+labs(x="",y="Mean talk duration")+theme_bw()

TED_speaker_metrics %>% filter(Number_talks>3)%>% ggplot(.,aes(reorder(speaker_name, Mean_talk_duration),Mean_talk_duration, fill=Number_talks))+geom_bar(stat="identity")+coord_flip()+labs(x="",y="Mean talk duration")+theme_bw()

TED_speaker_metrics %>% filter(Number_talks>3)%>% ggplot(.,aes(reorder(speaker_name, Mean_talk_duration),Mean_talk_duration, fill=as.factor(Number_talks)))+geom_bar(stat="identity")+coord_flip()+labs(x="",y="Mean talk duration")+theme_bw()

TED_speaker_metrics %>% filter(Number_talks>3)%>% ggplot(.,aes(reorder(speaker_name, Mean_talk_duration),Mean_talk_duration, fill=as.factor(Number_talks)))+geom_bar(stat="identity")+coord_flip()+labs(x="",y="Mean talk duration")+theme_bw()+scale_fill_discrete("Number of talks")

# install.packages("plotly")
library(plotly)

ggobject <- TED_speaker_metrics %>% filter(Number_talks>3)%>% ggplot(.,aes(reorder(speaker_name, Mean_talk_duration),Mean_talk_duration, fill=as.factor(Number_talks)))+geom_bar(stat="identity")+coord_flip()+labs(x="",y="Mean talk duration")+theme_bw()+scale_fill_discrete("Number of talks")

ggobject

ggplotly(ggobject)
TED_speaker_metrics %>% filter(Number_talks>3)%>%
  plot_ly(x=~Mean_talk_duration,y=~speaker_name,color=~as.factor(Number_talks))
TED_speaker_metrics %>% filter(Number_talks>3)%>%
  plot_ly(x=~Mean_talk_duration,y=~reorder(speaker_name,Mean_talk_duration),color=~as.factor(Number_talks),type="bar") %>% layout(title="Speakers with more than 3 Ted Talks", yaxis=list(title=""),xaxis=list(title="Mean talk duration"))
library(taucharts)

tmp= TED_speaker_metrics %>% filter(Number_talks>3)

tauchart(tmp) %>% tau_bar("Mean_talk_duration","speaker_name",color="Number_talks",horizontal = "TRUE") %>% tau_legend() %>% tau_tooltip()
tmp= TED_speaker_metrics %>% filter(Number_talks>3)%>%arrange(-Mean_talk_duration)
 
# Rearranging the data in descending order of Mean_talk_duration, then converting speaker_name to factor variable and specifying that the order of factor variables be in the sequence in which they appear.

tmp$speaker_name=fct_inorder(tmp$speaker_name)

tauchart(tmp)%>%tau_bar("Mean_talk_duration","speaker_name",color="Number_talks",horizontal = "TRUE")%>% tau_legend()%>% tau_tooltip()

show the top 20 tag terms/phrase (based on the frequency of use of each term/phrase) and how frequently they were present in the dataset.

For this, let us take a look at how that column is currently shown

head(TED$tags)
## [1] "alternative energy,cars,global issues,climate change,environment,science,culture,sustainability,technology"
## [2] "simplicity,entertainment,interface design,software,media,computers,technology,music,performance"           
## [3] "MacArthur grant,cities,green,activism,politics,pollution,environment,inequality,business"                  
## [4] "children,teaching,creativity,parenting,culture,dance,education"                                            
## [5] "demo,Asia,global issues,visualizations,global development,statistics,math,health,economics,Google,Africa"  
## [6] "entertainment,goal-setting,potential,psychology,motivation,emotions,culture,business"

For each talk, tags are separated by commas (,). So the strategy will be to separate that column into multiple columns, such that each tag occupies a column of its own, then we can rearrange the data into a long format to look at each specific tag. A priori, we don’t know how many tags are present in each talk. Let’s venture a guess that there are 50 of them;

#stringr

TEDtags <- TED %>% select(tags) %>% separate(tags,c("tag1","tag2","tag3","tag4","tag5","tag6","tag7","tag8","tag9","tag10","tag11","tag12","tag13","tag14","tag15","tag16","tag17","tag18","tag19","tag20","tag21","tag22","tag23","tag24","tag25","tag26","tag27","tag28","tag29","tag30","tag31","tag32","tag33","tag34","tag35","tag36","tag37","tag38","tag39","tag40","tag41","tag42","tag43","tag44","tag45","tag46","tag47","tag48","tag49","tag50"),sep=",") %>%
  gather(tagnum, Tag, tag1:tag50) %>%
  filter(Tag != "")
head(TEDtags)
## # A tibble: 6 x 2
##   tagnum Tag               
##   <chr>  <chr>             
## 1 tag1   alternative energy
## 2 tag1   simplicity        
## 3 tag1   MacArthur grant   
## 4 tag1   children          
## 5 tag1   demo              
## 6 tag1   entertainment

Let us now clean the entries so that there are no beginning and trailing spaces and convert the text to lowercase.

# "alternative energy" is different from "  alternative energy", "alternative energy "

TEDtags$Tag <- trimws(TEDtags$Tag)
TEDtags$Tag <- tolower(TEDtags$Tag)

Now, let us put the power of dplyr to do what is needed to count the tags up.

tagcount <-TEDtags %>% group_by(Tag)%>%summarise(Tag_count=length(Tag))%>%arrange(-Tag_count)

tagcount$Tag=fct_inorder(tagcount$Tag)
# dataframename[rows,columns] # patilv.com/Rbook
tauchart(tagcount[1:20,])%>%tau_bar("Tag_count","Tag",horizontal = "TRUE")%>% tau_legend()%>% tau_tooltip()