Time for Analysis: A Brief Introduction to Lubridate

Why Lubridate?

Data about date and time can be captured in myriad ways, depending on which data is available, which country captured the data, whether it complies to certain standards, etc. Various professionals are regularly required to interpret and make use of time data. For example, a marketer might want to know the best day of the week to upload a YouTube video so that it becomes a “trending” video.

Lubridate makes it easier to analyse time data by converting it into a usable format, a time object, and by allowing components of that time object (such as day, week, month, year and more!) to be extracted for analysis, for example, extracting days of the week from dates. We will look at examples of this using Trending YouTube Video Statistics data, downloaded from Kaggle.

# load packages
library(lubridate)
library(here)
library(ggplot2)
library(readr)

# read in data 
videos <- read_csv(here("data", "GBvideos.csv"))

Making Sense of Time Data

If R does not recognise that a variable is a time object it might store it as a character or factor. Lubridate allows you to change that variable into a time object. This action of converting text into logical components is commonly referred to as ‘parsing.’

By looking at the class of the variables, we can see that the ‘publish_time’ variable is stored as a character, when we know that this variable contains data about the date and time that the video was published.

# check the class of the variable 'publish_time'
sapply(videos,class)

##               video_id          trending_date                  title 
##            "character"            "character"            "character" 
##          channel_title            category_id           publish_time 
##            "character"              "numeric"            "character" 
##                   tags                  views                  likes 
##            "character"              "numeric"              "numeric" 
##               dislikes          comment_count         thumbnail_link 
##              "numeric"              "numeric"            "character" 
##      comments_disabled       ratings_disabled video_error_or_removed 
##              "logical"              "logical"              "logical" 
##            description 
##            "character"

The Lubridate command that you use to convert the time data from a factor or character to a time object will simply be the order that your time data is currently in.

#check the first few rows of the 'publish_time' variable to find out the format that the time data is in
head(videos$publish_time)

## [1] "11/10/17 7:38"  "11/12/17 6:24"  "11/10/17 17:00" "11/11/17 17:00"
## [5] "11/9/17 11:04"  "11/10/17 19:19"

By viewing the first few rows of data, we can see that the time is in a format DD-MM-YYYY HH:MM. Therefore, we will use the function ymd_hm() to convert this variable to a time object.

#convert the ‘publish_time’ variable to a time object and create a new column called “date” where newly converted time data called will be stored
videos$date <- dmy_hm(videos$publish_time)
names(videos)

##  [1] "video_id"               "trending_date"          "title"                 
##  [4] "channel_title"          "category_id"            "publish_time"          
##  [7] "tags"                   "views"                  "likes"                 
## [10] "dislikes"               "comment_count"          "thumbnail_link"        
## [13] "comments_disabled"      "ratings_disabled"       "video_error_or_removed"
## [16] "description"            "date"

If your time information is in a different order, then you would follow the same steps as above, but the Lubridate function that you use will change to reflect the same order that your time data is currently in e.g. dmy(), mdy_hm(), etc.

Extracting Elements of Time

Once we’ve converted the data into a time object there are a number of ways that we’re able to use this. We can now extract years, months, days, hours, minutes, seconds to be used in our analysis.

vidyear <- year(videos$date) # creates a vector of years that videos were trending 
head(vidyear, 5)

## [1] 2017 2017 2017 2017 2017

Now that we have a time object, Lubridate can also give us additional information such as the day of week.

vidwday <- wday(videos$date) # creates a vector of days of the week that videos were trending
head(vidwday, 5)

## [1] 4 2 4 7 2

By default, the day of week will be represented as a number i.e. Monday = 1, Tuesday = 2, etc. We can ask Lubridate to label the days.

vidwday2 <- wday(videos$date, label = TRUE) # creates a vector of days of the week that videos were trending and labels the days e.g. Monday, Tuesday 
head(vidwday2, 5)

## [1] Wed Mon Wed Sat Mon
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat

The same convention applies to months:

vidmon2 <- month(videos$date, label = TRUE) # creates a vector of months that videos were trending, and labels the months 
head(vidmon2, 5)

## [1] Oct Dec Oct Nov Sep
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec

Weeks of the year, which don’t have labels, can be extracted and will show as a number:

vidweek <- week(videos$date) # creates a vector of weeks of the year that videos were trending
head(vidweek, 20)

##  [1] 41 50 41 45 37 41 41 41 45 41 41 41 37 41 37 41 37 41 41 37

Now that ‘publish-time’ is in a time object and we can extract out components, we can use that for analysis. For instance, if we wanted to know which day of the week was most likely to have trending YouTube videos. We could create a frequency table that shows us the most frequent day that trending videos began:

table(wday(videos$date, label = TRUE)) #create a frequecy table to show which weekday has the greatest frequency of trending videos

## 
## Sun Mon Tue Wed Thu Fri Sat 
##  35  88  79  97  34  53 132

sort(table(wday(videos$date, label = TRUE))) #sort the table so that we can quickly see which weekday has the greatest frequency of trending videos

## 
## Thu Sun Fri Tue Mon Wed Sat 
##  34  35  53  79  88  97 132

We can see that Saturday has the greatest frequency of trending videos.

We could also see trends such as which day had of the week was most likely to have trending YouTube videos over the past years.

videos$vyear <- year(videos$date) # create a new column in the videos data frame for year 
videos$vwday <- wday(videos$date, label = TRUE) # create a new column in the videos data frame for weekday 
names(videos)

##  [1] "video_id"               "trending_date"          "title"                 
##  [4] "channel_title"          "category_id"            "publish_time"          
##  [7] "tags"                   "views"                  "likes"                 
## [10] "dislikes"               "comment_count"          "thumbnail_link"        
## [13] "comments_disabled"      "ratings_disabled"       "video_error_or_removed"
## [16] "description"            "date"                   "vyear"                 
## [19] "vwday"

with(videos, table(vyear, vwday)) # create a frequency table using two variables

##       vwday
## vyear  Sun Mon Tue Wed Thu Fri Sat
##   2010   0   0   2   0   0   0   0
##   2015   0   0   0   0   0   1   3
##   2016   0   0   0   1   0   0   0
##   2017  35  88  77  96  34  52 129

# plot the frequency of trending videos by day, split out by year 
ggplot(videos, aes(vwday)) + 
  geom_bar() +
  facet_wrap(~ vyear)

Calculations with Time

Lubridate can be used for calculating the difference between times. If the marketer wasn’t sure which time period these trending videos were from, they could extract the earliest date and latest date and use Lubridate’s as.duration and as.period functions to calculate the time in-between.

First, we need to define the two dates that we want to examine and assign and interval to these two dates:

startdate <- min(videos$date) #create an object for the earliest date in the dataset
enddate <- max(videos$date) #create an object for the latest date in the dataset

interval <- startdate %--% enddate
interval

## [1] 2010-02-09 20:48:00 UTC--2017-12-11 19:53:00 UTC

Lubridate’s as.duration function can then be used to find the time between these two dates.

duration <- as.duration(interval)
duration # lubridate prints the duration in seconds and in this case, gives an estimate in years

## [1] "247273500s (~7.84 years)"

The as.period function can be used to find the time in years, months, days, hours, minutes and seconds.

period <- as.period(interval)
period # lubridate prints the duration in years, months, days, hours, minutes, seconds

## [1] "7y 10m 1d 23H 5M 0S"

The marketer can quickly find that the YouTube data that they are looking at spans a period of over 7years and 10months.

Time for Analysis: A Brief Introduction to Lubridate

Zoe Barrett

15/08/2020

Why Lubridate?

Making Sense of Time Data

Extracting Elements of Time

Calculations with Time