Data about date and time can be captured in myriad ways, depending on which data is available, which country captured the data, whether it complies to certain standards, etc. Various professionals are regularly required to interpret and make use of time data. For example, a marketer might want to know the best day of the week to upload a YouTube video so that it becomes a “trending” video.
Lubridate makes it easier to analyse time data by converting it into a usable format, a time object, and by allowing components of that time object (such as day, week, month, year and more!) to be extracted for analysis, for example, extracting days of the week from dates. We will look at examples of this using Trending YouTube Video Statistics data, downloaded from Kaggle.
# load packages
library(lubridate)
library(here)
library(ggplot2)
library(readr)
# read in data
videos <- read_csv(here("data", "GBvideos.csv"))
If R does not recognise that a variable is a time object it might store it as a character or factor. Lubridate allows you to change that variable into a time object. This action of converting text into logical components is commonly referred to as ‘parsing.’
By looking at the class of the variables, we can see that the ‘publish_time’ variable is stored as a character, when we know that this variable contains data about the date and time that the video was published.
# check the class of the variable 'publish_time'
sapply(videos,class)
## video_id trending_date title
## "character" "character" "character"
## channel_title category_id publish_time
## "character" "numeric" "character"
## tags views likes
## "character" "numeric" "numeric"
## dislikes comment_count thumbnail_link
## "numeric" "numeric" "character"
## comments_disabled ratings_disabled video_error_or_removed
## "logical" "logical" "logical"
## description
## "character"
The Lubridate command that you use to convert the time data from a factor or character to a time object will simply be the order that your time data is currently in.
#check the first few rows of the 'publish_time' variable to find out the format that the time data is in
head(videos$publish_time)
## [1] "11/10/17 7:38" "11/12/17 6:24" "11/10/17 17:00" "11/11/17 17:00"
## [5] "11/9/17 11:04" "11/10/17 19:19"
By viewing the first few rows of data, we can see that the time is in a format DD-MM-YYYY HH:MM. Therefore, we will use the function ymd_hm() to convert this variable to a time object.
#convert the ‘publish_time’ variable to a time object and create a new column called “date” where newly converted time data called will be stored
videos$date <- dmy_hm(videos$publish_time)
names(videos)
## [1] "video_id" "trending_date" "title"
## [4] "channel_title" "category_id" "publish_time"
## [7] "tags" "views" "likes"
## [10] "dislikes" "comment_count" "thumbnail_link"
## [13] "comments_disabled" "ratings_disabled" "video_error_or_removed"
## [16] "description" "date"
If your time information is in a different order, then you would follow the same steps as above, but the Lubridate function that you use will change to reflect the same order that your time data is currently in e.g. dmy(), mdy_hm(), etc.
Once we’ve converted the data into a time object there are a number of ways that we’re able to use this. We can now extract years, months, days, hours, minutes, seconds to be used in our analysis.
vidyear <- year(videos$date) # creates a vector of years that videos were trending
head(vidyear, 5)
## [1] 2017 2017 2017 2017 2017
Now that we have a time object, Lubridate can also give us additional information such as the day of week.
vidwday <- wday(videos$date) # creates a vector of days of the week that videos were trending
head(vidwday, 5)
## [1] 4 2 4 7 2
By default, the day of week will be represented as a number i.e. Monday = 1, Tuesday = 2, etc. We can ask Lubridate to label the days.
vidwday2 <- wday(videos$date, label = TRUE) # creates a vector of days of the week that videos were trending and labels the days e.g. Monday, Tuesday
head(vidwday2, 5)
## [1] Wed Mon Wed Sat Mon
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
The same convention applies to months:
vidmon2 <- month(videos$date, label = TRUE) # creates a vector of months that videos were trending, and labels the months
head(vidmon2, 5)
## [1] Oct Dec Oct Nov Sep
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
Weeks of the year, which don’t have labels, can be extracted and will show as a number:
vidweek <- week(videos$date) # creates a vector of weeks of the year that videos were trending
head(vidweek, 20)
## [1] 41 50 41 45 37 41 41 41 45 41 41 41 37 41 37 41 37 41 41 37
Now that ‘publish-time’ is in a time object and we can extract out components, we can use that for analysis. For instance, if we wanted to know which day of the week was most likely to have trending YouTube videos. We could create a frequency table that shows us the most frequent day that trending videos began:
table(wday(videos$date, label = TRUE)) #create a frequecy table to show which weekday has the greatest frequency of trending videos
##
## Sun Mon Tue Wed Thu Fri Sat
## 35 88 79 97 34 53 132
sort(table(wday(videos$date, label = TRUE))) #sort the table so that we can quickly see which weekday has the greatest frequency of trending videos
##
## Thu Sun Fri Tue Mon Wed Sat
## 34 35 53 79 88 97 132
We can see that Saturday has the greatest frequency of trending videos.
We could also see trends such as which day had of the week was most likely to have trending YouTube videos over the past years.
videos$vyear <- year(videos$date) # create a new column in the videos data frame for year
videos$vwday <- wday(videos$date, label = TRUE) # create a new column in the videos data frame for weekday
names(videos)
## [1] "video_id" "trending_date" "title"
## [4] "channel_title" "category_id" "publish_time"
## [7] "tags" "views" "likes"
## [10] "dislikes" "comment_count" "thumbnail_link"
## [13] "comments_disabled" "ratings_disabled" "video_error_or_removed"
## [16] "description" "date" "vyear"
## [19] "vwday"
with(videos, table(vyear, vwday)) # create a frequency table using two variables
## vwday
## vyear Sun Mon Tue Wed Thu Fri Sat
## 2010 0 0 2 0 0 0 0
## 2015 0 0 0 0 0 1 3
## 2016 0 0 0 1 0 0 0
## 2017 35 88 77 96 34 52 129
# plot the frequency of trending videos by day, split out by year
ggplot(videos, aes(vwday)) +
geom_bar() +
facet_wrap(~ vyear)
Lubridate can be used for calculating the difference between times. If the marketer wasn’t sure which time period these trending videos were from, they could extract the earliest date and latest date and use Lubridate’s as.duration and as.period functions to calculate the time in-between.
First, we need to define the two dates that we want to examine and assign and interval to these two dates:
startdate <- min(videos$date) #create an object for the earliest date in the dataset
enddate <- max(videos$date) #create an object for the latest date in the dataset
interval <- startdate %--% enddate
interval
## [1] 2010-02-09 20:48:00 UTC--2017-12-11 19:53:00 UTC
Lubridate’s as.duration function can then be used to find the time between these two dates.
duration <- as.duration(interval)
duration # lubridate prints the duration in seconds and in this case, gives an estimate in years
## [1] "247273500s (~7.84 years)"
The as.period function can be used to find the time in years, months, days, hours, minutes and seconds.
period <- as.period(interval)
period # lubridate prints the duration in years, months, days, hours, minutes, seconds
## [1] "7y 10m 1d 23H 5M 0S"
The marketer can quickly find that the YouTube data that they are looking at spans a period of over 7years and 10months.