Let’s first load up the tidyverse library because Hadley Wickham is a homie.
library(tidyverse)
Let’s begin with a sample dataset of five datapoints of 5 people with different timestamps.
#Sample dataset
start_data <- tibble(Name = c("Sean_Ontime",
"Chips",
"Amy",
"Sean_Late",
"Sean_BarelyOntime"),
Time_Stamp = c("2020/01/04 07:32:21 AM PST",
"2020/02/05 05:32:21 PM PST",
"2020/03/04 12:09:11 AM PST",
"2020/01/04 06:32:21 PM PST",
"2020/01/04 09:00:00 AM PST"))
head(start_data)
## # A tibble: 5 x 2
## Name Time_Stamp
## <chr> <chr>
## 1 Sean_Ontime 2020/01/04 07:32:21 AM PST
## 2 Chips 2020/02/05 05:32:21 PM PST
## 3 Amy 2020/03/04 12:09:11 AM PST
## 4 Sean_Late 2020/01/04 06:32:21 PM PST
## 5 Sean_BarelyOntime 2020/01/04 09:00:00 AM PST
We will do two rounds of separation.
In this case, we have the structure YYYY/MM/DD HH:MM:SS AM/PM PST
We will separate() the Timestamp column by everything separated by a space. This will result in 4 columns:
#Separate timestamp out by spaces
DataTimeDate <- start_data %>%
separate(Time_Stamp, sep = " ", into = c("Date", "Time", "DayNight", "PST")) %>%
#Remove unnecessary columns
select(-PST)
head(DataTimeDate)
## # A tibble: 5 x 4
## Name Date Time DayNight
## <chr> <chr> <chr> <chr>
## 1 Sean_Ontime 2020/01/04 07:32:21 AM
## 2 Chips 2020/02/05 05:32:21 PM
## 3 Amy 2020/03/04 12:09:11 AM
## 4 Sean_Late 2020/01/04 06:32:21 PM
## 5 Sean_BarelyOntime 2020/01/04 09:00:00 AM
We will now further separate the date using the numbers separated by a slash “/”. Since slash is a special character, in R we must indicate we are introducing a special character by proceeding it with two backslashes “\”.
For example, our code would look like… separate(Date, sep = “\/”)
This will separate the Date columns into three new columns:
Lastly, we will change the Year, Month, and Day column into numeric vectors (they are currently charcter vectors). This will allow us to use numerical filtering such as greater than or less than.
#Separate date into year, month, day
DataYearMonthDay <- DataTimeDate %>%
separate(Date, sep = "\\/", into = c("Year", "Month", "Day")) %>%
#Change dates into numerics for filtering
mutate_at(c("Year", "Month", "Day"), as.numeric)
head(DataYearMonthDay)
## # A tibble: 5 x 6
## Name Year Month Day Time DayNight
## <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 Sean_Ontime 2020 1 4 07:32:21 AM
## 2 Chips 2020 2 5 05:32:21 PM
## 3 Amy 2020 3 4 12:09:11 AM
## 4 Sean_Late 2020 1 4 06:32:21 PM
## 5 Sean_BarelyOntime 2020 1 4 09:00:00 AM
If at this point you want to filter for a specific month and day, you can do so using the filter() function on your dataframe.
How would be filter for all records that have a date before February?
#Give me dates before February
DataYearMonthDay %>%
filter(Month < 2)
## # A tibble: 3 x 6
## Name Year Month Day Time DayNight
## <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 Sean_Ontime 2020 1 4 07:32:21 AM
## 2 Sean_Late 2020 1 4 06:32:21 PM
## 3 Sean_BarelyOntime 2020 1 4 09:00:00 AM
Now we are leveling up.
You may find that you need more than one argument to satisfy your query. Sometimes it’s easier separating your question into multiple arguments. This makes it easier to do a query. For example let’s say “We want all dates between February 2-10”.
We can separate this into two specific arguments:
We need our query to match BOTH these arguments. That means this argument should be stated “Give records that match arugment 1 AND arugment two”. In R, the AND operator is “&”.
#Conditional: Give me dates between feburary 2 and 10
DataYearMonthDay %>%
#Give me only records that are in February AND have day between 2 and 10
filter(Month == 2 & Day %in% 2:10)
## # A tibble: 1 x 6
## Name Year Month Day Time DayNight
## <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 Chips 2020 2 5 05:32:21 PM
Now that we have our date separated, let’s also deal with the time.
Currently our time is in the format “HH:MM:SS”. We see that hours, minutes, and seconds are separated by colons. Colons are not a special character, so we can simply separate them like… separate(Data, sep = “:”)
We can break up our time column into three new columns:
We will then transform them in numeric vectors since we saw it made things easier for the date!
#Separate time into Hour, minutes seconds
DataHourMinutesSeconds <- DataYearMonthDay %>%
separate(Time, sep = ":", into = c("Hours", "Minutes", "Seconds")) %>%
#Change these columns into numeric vectors
mutate_at(c("Hours", "Minutes", "Seconds"), as.numeric)
head(DataHourMinutesSeconds)
## # A tibble: 5 x 8
## Name Year Month Day Hours Minutes Seconds DayNight
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Sean_Ontime 2020 1 4 7 32 21 AM
## 2 Chips 2020 2 5 5 32 21 PM
## 3 Amy 2020 3 4 12 9 11 AM
## 4 Sean_Late 2020 1 4 6 32 21 PM
## 5 Sean_BarelyOntime 2020 1 4 9 0 0 AM
Time is bit tricker to deal with than the date. It may be easier to use military time for example since you don’t have to worry about AM or PM.
I’m lazy, so I’m not going to convert things to military times, but use many conditional statement to get what I want! It’s not ideal but it lets us play with conditionals.
Let’s say we want to get all records on January 4 before 5pm.
We can break this into four arguments:
Argument 1 and 2 are clearly an AND statment “&” since both conditions must be met.
Argument 3 and 4 can are an OR statment. The condition is satified simply if the hour is AM OR if the leftover PM hours are before 5pm.
In R the operator for OR is “|”.
DataHourMinutesSeconds %>%
#Filter records on January 4
filter(Month == 1 & Day == 4) %>%
#Must be either in the AM hours OR before 5pm
filter(DayNight == "AM" | Hours <5)
## # A tibble: 2 x 8
## Name Year Month Day Hours Minutes Seconds DayNight
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Sean_Ontime 2020 1 4 7 32 21 AM
## 2 Sean_BarelyOntime 2020 1 4 9 0 0 AM
What if we don’t want to just query the results that match our criteria, but want to flag it for later. Perhaps a bad graduate student, Sean, is often late, or barely ontime. We may want to track this in his records….
In this situation we can create a new column called “Lateness” that has a different entry dependent on whether or not it means the condition we are interested in.
Let’s say we want to mark anyone who shows up between 7 and 9am as ontime, and everyone else late. We can create a new column by using the mutate() function. We make the new column “Lateness” equal to a conditional statement that populates the column with one of two possible states “OnTime” or “Late”.
Then we will mark the entry as ontime. Otherwise, we will mark it as late. We code this up as follows..
#Maybe you want to mark everybody as late or ontime depending upon if they show up between 7 and 9AM
NewColumnLATES <- DataHourMinutesSeconds %>%
mutate(Lateness = if_else(DayNight == "AM" & Hours %in% 7:8 |
DayNight == "AM" & Hours == 9 & Minutes == 0,
#If all coniditions are met, we fill the corresponding element with OnTime
"OnTime",
#If the conditions are not met, we fill the element with Late
"Late"))
head(NewColumnLATES)
## # A tibble: 5 x 9
## Name Year Month Day Hours Minutes Seconds DayNight Lateness
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 Sean_Ontime 2020 1 4 7 32 21 AM OnTime
## 2 Chips 2020 2 5 5 32 21 PM Late
## 3 Amy 2020 3 4 12 9 11 AM Late
## 4 Sean_Late 2020 1 4 6 32 21 PM Late
## 5 Sean_BarelyOntime 2020 1 4 9 0 0 AM OnTime
Hope this was helpful!
-Sean Perez PhD Candidate @ UC Berkeley