I have always struggled when dealing with Time, Dates and Duration in most computer programs. There does not seem to be an intuitive way to represent the idiosyncrasies we are so accustomed to in these concepts: from the fact that duration and time of day appear similar, to time zones, leap years, and even our “sexagesimal” system. For this project I wanted to explore a package to deal with all of the above - the Tidyverse package Lubridate.
In searching the FiveThirtyEight data sets, I found an interesting set which I wanted to explore. FiveThirtyEight did an analysis on tweets during a big boxing match - Mayweather vs McGregor. The team ran an analysis on emoji use as a proxy for fight excitement and how the fight was progressing. I will be limiting my analysis to the time stamp of each of the 12 thousand tweets.
library(RCurl)
Tweets <- read.csv(text = getURL("https://raw.githubusercontent.com/ChristopherBloome/607/master/Boxing.Tweets.csv"))
head(Tweets)
## created_at emojis id
## 1 8/27/2017 0:05 TRUE 9.01657e+17
## 2 8/27/2017 0:05 TRUE 9.01657e+17
## 3 8/27/2017 0:05 TRUE 9.01657e+17
## 4 8/27/2017 0:05 TRUE 9.01657e+17
## 5 8/27/2017 0:05 TRUE 9.01657e+17
## 6 8/27/2017 0:05 TRUE 9.01657e+17
## link retweeted screen_name
## 1 https://twitter.com/statuses/901656910939770881 FALSE aaLiysr
## 2 https://twitter.com/statuses/901656917281574912 FALSE zulmafrancozaf
## 3 https://twitter.com/statuses/901656917105369088 FALSE Adriana11D
## 4 https://twitter.com/statuses/901656917747142657 FALSE Nathan_Caro_
## 5 https://twitter.com/statuses/901656916828594177 FALSE sahouraxox
## 6 https://twitter.com/statuses/901656914307805184 FALSE wvtnces
## text
## 1 Ringe çikmadan ates etmeye basladi <U+0001F603>#McGregor https://t.co/mJHDvLfIVc
## 2 <U+0001F632><U+0001F632><U+0001F632><U+0001F632><U+0001F632> @lalylourbet2 https://t.co/ERUGHhQINE
## 3 <U+0001F1EE><U+0001F1EA><U+0001F1EE><U+0001F1EA><U+0001F1EE><U+0001F1EA> <U+0001F4AA><U+0001F4AA>#MayweathervMcgregor
## 4 Cest partit #MayweatherMcGregor <U+0001F4AA><U+0001F3FF>
## 5 Low key feeling bad for ppl who payed to watch the game cause it got delayed it's rigged <U+0001F923><U+0001F923>#AintNobodyGotTimeForThat #MayweathervMcgregor
## 6 #McGregor <U+0001F44A><U+0001F44A><U+0001F44A>
Tweets$created_at[1]
## [1] 8/27/2017 0:05
## 70 Levels: 8/27/2017 0:05 8/27/2017 0:06 8/27/2017 0:07 ... 8/27/2017 1:14
class(Tweets$created_at[1])
## [1] "factor"
While the data may seem to be in the correct form, the created_at column is in the class “factor.” If it was a time stamp, we would expect it to be in class “POSIXct”. Luckily we have lubridate to help us convert.
library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.4
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ------------------------------------------------------------------------ tidyverse_conflicts() --
## x tidyr::complete() masks RCurl::complete()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
Tweets$created_at_timestamp <- mdy_hm(Tweets$created_at)
Tweets$created_at_timestamp[1]
## [1] "2017-08-27 00:05:00 UTC"
class(Tweets$created_at_timestamp[1])
## [1] "POSIXct" "POSIXt"
LubriDate has a very simple interface. As you see above, the function is nothing more than mdy_hm(). This of course stands for Month, Date, Year, Hour, Minute. The package Lubridate has a number of functions with variation of these letters. Using these functions is as simple as listing each of the letters in the order they appear in the string. The underscore is in place to represent the space between the date and time in the string. This underscore is not to be used when the string contains a slash, dash or colon as LubriDate understands that these are used in splitting dates and times, but it required for letters.
As you may notice, the timestamp above is listed in UTC. After a quick search, I learned that the fight started at 12:05 am ET, which is 4:05 UTC. This means the above timestamp is incorrect. We have two options for converting our time correctly - we could either represent this time with the correct time in UTC, or the correct time in Eastern Time. Due to the nature of the data, it is more logical to convert to the correct time zone.
Tweets$created_at_timestamp <- force_tz(Tweets$created_at_timestamp, tzone = "America/New_York")
The force_tz function maintains the clock time and simply changes the time zone. Alternatively, if the UTC time was correct, but we wanted to represent the time in a new time zone, we could use the with_tz function to convert the clock time to the equivalent time in a different time zone.
As a final note, the time zones are not exactly intuitive. To find a complete list, use OlsonNames(). As we saw above, the name of the time zone in the function is not how it will be represented in the time stamp.