Data Wrangling with R

Introduction

In R, there are a variety of packages for handling character strings and dates. I will introduce to some packages and usages. If you are curious the details of the functions, then please search and take a look at its details using ?help()

Strings

As soon as you report or print the results, you need to know how to use the strings.

1. length() / nchar()

when you want to know the strings, nchar() function will help you. It counts the number of characters(or bytes or width).

strings <- c("curly", "straight", "long", "short")
nchar("curly")

## [1] 5

nchar(strings)

## [1] 5 8 4 5

you may think that length() count the strings. It is not. It returns the length of the vector. When you apply the length() to single strings, then it returns you 1 because of it views that string as a single vector—a vector with one element:

length("long")

## [1] 1

length(strings)

## [1] 4

2. paste() / paste0()

When you want to join more than two string, paste() function helps that concatenate strings. Briefly, create new strings that joining two strings.

paste("Everyone", "loves", "Statistics") # default is " "(blank)

## [1] "Everyone loves Statistics"

paste("Everyone", "loves", "Statistics", 
      sep = "-")

## [1] "Everyone-loves-Statistics"

paste("The square root of twice pi is approximately", sqrt(2*pi))

## [1] "The square root of twice pi is approximately 2.506628274631"

paste(strings, 
      "hairstyle", 
      collapse = ", and ") # collapse parameter lets you define a top-level separator

## [1] "curly hairstyle, and straight hairstyle, and long hairstyle, and short hairstyle"

paste0 function is equivalent to paste(..., sep = "", collapse), slightly more efficiently. If a value is specified for collapse, the values in the result are then concatenated into a single string, with the elements being separated by the value of collapse.

paste0("Everyone", "loves", "Statistics")

## [1] "EveryonelovesStatistics"

paste0(strings, collapse = "; ")

## [1] "curly; straight; long; short"

3. substr()

substr can extract a portion of a string according to a position.

substr("Statistics is awesome", 
       start = 1,
       stop = 4)

## [1] "Stat"

substr("Statistics is awesome", 
       start = 15,
       stop = 21)

## [1] "awesome"

substr(strings, # vector 
       start = 1, stop = 3)

## [1] "cur" "str" "lon" "sho"

4. strsplit()

Not only concatenating the strings, but also spliting the strings is possible. Those strings are separated by delimitor. If there is a match at the begining of a (non-empty) string, the first element of the output is “”, but if there is a match at the end of the string, the output is the same as with the match removed. strsplit function returns the elements as a list.

path <- "/home/mike/data/trials.csv"

strsplit(path, split = "/" ) # returns list

## [[1]]
## [1] ""           "home"       "mike"       "data"       "trials.csv"

unlist(strsplit(path, split = "/" )) # If you don't want, use unlist function. It returns character

## [1] ""           "home"       "mike"       "data"       "trials.csv"

paths <- c("/home/data/trials.csv",
           "/home/data/errors.csv",
           "/home/corr/reject.doc")

strsplit(paths, split = "/")

## [[1]]
## [1] ""           "home"       "data"       "trials.csv"
## 
## [[2]]
## [1] ""           "home"       "data"       "errors.csv"
## 
## [[3]]
## [1] ""           "home"       "corr"       "reject.doc"

unlist(strsplit(paths, split = "/"))

##  [1] ""           "home"       "data"       "trials.csv" ""          
##  [6] "home"       "data"       "errors.csv" ""           "home"      
## [11] "corr"       "reject.doc"

5. sub() / gsub()

sub() and gsub() perform replacement of the first and all matches respectively.

strings <- "Statistics are attractive. Statistics are awesome"

sub("Statistics", "you", strings) # replace the first matches

## [1] "you are attractive. Statistics are awesome"

gsub("Statistics", "you", strings) # replace the all matches

## [1] "you are attractive. you are awesome"

sub(" and SAS", "",
    "For really tough problems, you need R and SAS.") # if you want eliminate

## [1] "For really tough problems, you need R."

Obviously, I am not familiar using the Regular Expression(a.k.a regexp) and keep studying it. If you are interested too then this book Mastering Regular Expressions may help you.

6. combination using outer() & paste()

outer() functions is for calculate the outer product. However, it permit to put other function to third element. In this code, using paste(), create all combinations of strings.

locations <- c("NY", "CA", "CHI", "LA")
treatments <- c("T1", "T2", "T3")

outer(locations, treatments, paste, sep = " - ") # returns as a matrix

##      [,1]       [,2]       [,3]      
## [1,] "NY - T1"  "NY - T2"  "NY - T3" 
## [2,] "CA - T1"  "CA - T2"  "CA - T3" 
## [3,] "CHI - T1" "CHI - T2" "CHI - T3"
## [4,] "LA - T1"  "LA - T2"  "LA - T3"

outer(locations, treatments, paste, sep = " - ") %>% as.vector() # retruns as a vector

##  [1] "NY - T1"  "CA - T1"  "CHI - T1" "LA - T1"  "NY - T2"  "CA - T2" 
##  [7] "CHI - T2" "LA - T2"  "NY - T3"  "CA - T3"  "CHI - T3" "LA - T3"

outer(treatments, treatments, paste, sep = ":") # duplicate combination

##      [,1]    [,2]    [,3]   
## [1,] "T1:T1" "T1:T2" "T1:T3"
## [2,] "T2:T1" "T2:T2" "T2:T3"
## [3,] "T3:T1" "T3:T2" "T3:T3"

m <- outer(treatments, treatments, paste, sep = ":")
m[!lower.tri(m)] # returns lower triangle

## [1] "T1:T1" "T1:T2" "T2:T2" "T1:T3" "T2:T3" "T3:T3"

Dates

The following classes are included in the base distribution of R:

Date
The Date class can represent a calendar date but not a clock time. It is a solid, general-purpose class for working with dates, including conversions, formatting, basic date arithmetic, and time-zone handling.

POSIXct
This is a datetime class, and it can represent a moment in time with an accuracy of one second. Internally, the datetime is stored as the number of seconds since January 1, 1970, and so is a very compact representation. This class is recommended for storing datetime information (e.g., in data frames).

POSIXlt This is also a datetime class, but the representation is stored in a nine-element list that includes the year, month, day, hour, minute, and second. That representation makes it easy to extract date parts, such as the month or hour. Obviously, this representation is much less compact than the POSIXct class; hence it is normally used for intermediate processing and not for storing data. The base distribution also provides functions for easily converting between representations: as.Date, as.POSIXct, and as.POSIXlt.

The following packages are available for download from CRAN:

chron
The chron package can represent both dates and times but without the added complexities of handling time zones and daylight savings time. It’s, therefore, easier to use than Date but less powerful than POSIXct and POSIXlt. It would be useful for work in econometrics or time series analysis.

lubridate
This is a relatively new, general-purpose package. It’s designed to make working with dates and times easier while keeping the important bells and whistles such as time zones. It’s especially clever regarding datetime arithmetic.

mondate
This is a specialized package for handling dates in units of months in addition to days and years. Such needs arise in accounting and actuarial work, for example, where month-by-month calculations are needed.

timeDate
This is a high-powered package with well-thought-out facilities for handling dates and times, including date arithmetic, business days, holidays, conversions, and generalized handling of time zones. It was originally part of the Rmetrics software for financial modeling, where precision in dates and times is critical. If you have a demanding need for date facilities, consider this package.

Personally, I am usually using lubridate package and Date functions. Also, check for details Date and Time Classes in R

1. current date

Sys.date() for current date. Also, Sys.time() for current date and current time.

Sys.Date() # current date

## [1] "2018-08-05"

Sys.time() # current date and time

## [1] "2018-08-05 18:05:36 KST"

2. change character to date

Using as.Date() function, we can change the character to date class. If you want to change the date format, format function will help.

"2018-07-16" %>% class()

## [1] "character"

as.Date("2018-07-16")

## [1] "2018-07-16"

as.Date("2018-07-16") %>% class()

## [1] "Date"

as.Date("2018-07-16 12:34:56") # only returns date info

## [1] "2018-07-16"

format(as.Date("2018-07-16"),
       "%d / %m / %Y") # "%Y/%m/%d"

## [1] "16 / 07 / 2018"

the details of the formats are platform-specific, but the following are likely to be widely available:

%b
Abbreviated month name in the current locale(Also matches full name on input)
%B
Full month name in the current locale
%d
Day of the month as decimal number(01-31)
%y
Year without century(00-99)
%Y
Year with century(0000-9999)

for the details, put ?strftime in R console.

3. lubridate

Before we start, if you does not install thelubridate package, please do it. Also, recommend installing tidyverse Assuming here that you already installed those packages what we needed. According to lubridate vignettes, it was created by Garrett Grolemund and Hadley Wickham, and is now maintained by Vitalie Spinu.

Before I know about this packages, I couldn’t handle the times and dates enough because it was tricky. This package is really saved my time.

3.1 Parsing dates and times

lubridate’s parse functions accept a wide variety of formats and separators, which simplifies the parsing process. As long as the order of format is correct, these functions will parse dates correctly even when the input vectors contain heterogeneous format.

suppressMessages(library(lubridate))
lubridate::today() # == Sys.Date()

## [1] "2018-08-05"

ymd(20180717) == ymd(180717)

## [1] TRUE

ymd("180717")

## [1] "2018-07-17"

ymd("2018 ?! 07 -- 17") # heterogeneous format

## [1] "2018-07-17"

mdy(02292018) # NA: Feb 29th,2018 NOT exists on Calendar

## Warning: 1 failed to parse.

## [1] NA

mdy("08-14-2018")

## [1] "2018-08-14"

mdy("0814-2018")

## [1] "2018-08-14"

mdy("August 14th (2018)")

## [1] "2018-08-14"

dmy(21122018)

## [1] "2018-12-21"

dmy("21/12/2018")

## [1] "2018-12-21"

dmy("18 - July - 2018")

## [1] "2018-07-18"

3.2 Setting and Extracting information

second(), minute(), hour(), day(), wday(), yday(), month() and year() like this functions is intuitively easy to use.

lb1 <- ymd_hms("2018/07/17 12:34:56")
lubridate::year(lb1)

## [1] 2018

lubridate::month(lb1)

## [1] 7

lubridate::month(lb1,
                 label = TRUE) # as an ordered factor of character strings

## [1] Jul
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec

lubridate::day(lb1) # == lubridate::mday(lb1)

## [1] 17

lubridate::wday(lb1) # Sun = 1, ..., Sat = 7

## [1] 3

lubridate::wday(lb1, 
                label = TRUE, 
                abbr = FALSE) # abbr = TRUE -> Tue

## [1] Tuesday
## 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday

lubridate::yday(lb1) # 1 ~ 365 day of the year

## [1] 198

lubridate::hour(lb1)

## [1] 12

lubridate::minute(lb1)

## [1] 34

lubridate::second(lb1)

## [1] 56

Extracting information from date-time is not that difficult. What about modifying date-time?

lb2 <- mdy_hms("0229 2016 12:34:56")

lubridate::hour(lb2) <- 18 ; lb2 # change the time to 12 to 18

## [1] "2016-02-29 18:34:56 UTC"

lubridate::year(lb2) <- 2004 ;lb2

## [1] "2004-02-29 18:34:56 UTC"

lb2 <- update(lb2, 
              year = 2016,
              hour = 12,
              minutes = 30,
              seconds = 30) ; lb2

## [1] "2016-02-29 12:30:30 UTC"

3.3 Time zones

One reason why handling the time is tricky is that people live in different areas with different time zones. For example, if you are living in Japan, your time zone is JST. Also if you live in New York, United States, then your time is fitted by EDT. You can easily check the time zone which indicated how far from one area to another.

length(OlsonNames()) # how many time zones in the world

## [1] 592

head(OlsonNames())

## [1] "Africa/Abidjan"     "Africa/Accra"       "Africa/Addis_Ababa"
## [4] "Africa/Algiers"     "Africa/Asmara"      "Africa/Asmera"

Sys.timezone() # returns your time zone

## [1] "Asia/Seoul"

now("America/New_York") # current time in NewYork

## [1] "2018-08-05 05:05:37 EDT"

now("UTC") # Coordinated Universal Time(UTC)

## [1] "2018-08-05 09:05:37 UTC"

now("GMT") # Greenwich Mean Time(GMT)

## [1] "2018-08-05 09:05:37 GMT"

now <- lubridate::now() # current time

lubridate::force_tz(now, "GMT") # change the timezone to GMT

## [1] "2018-08-05 18:05:37 GMT"

3.4 Arithmetic with date times

Just like numeric, date and time objects can be calculated including addition(+), subtraction(-) and division(/). Also, understanding about time spans will help you.

durations
represent an exact number of seconds. Duration always records the time span in seconds.

r_age <- today() - ymd(19930801)

as.duration(r_age)

## [1] "789264000s (~25.01 years)"

duration(31) + dminutes(3)

## [1] "211s (~3.52 minutes)"

dminutes(30)

## [1] "1800s (~30 minutes)"

ddays(0:4)

## [1] "0s"                "86400s (~1 days)"  "172800s (~2 days)"
## [4] "259200s (~3 days)" "345600s (~4 days)"

dyears(1)

## [1] "31536000s (~52.14 weeks)"

# Arithmetic
dyears(2) + dweeks(12) + dhours(13)

## [1] "70376400s (~2.23 years)"

today() + ddays(1) # == tomorrow

## [1] "2018-08-06"

now() + ddays(1)

## [1] "2018-08-06 18:05:37 KST"

periods
represent human units like weeks and months. Periods are time spans but don’t have a fixed length in seconds.

seconds(31)

## [1] "31S"

minutes(10)

## [1] "10M 0S"

hours(c(9, 12, 18))

## [1] "9H 0M 0S"  "12H 0M 0S" "18H 0M 0S"

days(c(1, 15, 31))

## [1] "1d 0H 0M 0S"  "15d 0H 0M 0S" "31d 0H 0M 0S"

months(seq(1, 12, by = 3))

## [1] "1m 0d 0H 0M 0S"  "4m 0d 0H 0M 0S"  "7m 0d 0H 0M 0S"  "10m 0d 0H 0M 0S"

weeks(2) # 7 days * 2

## [1] "14d 0H 0M 0S"

# Arithmetic

months(6:8) + days(c(1, 15, 31)) + hours(c(9, 12, 18)) + seconds(31) + minutes(10)

## [1] "6m 1d 9H 10M 31S"   "7m 15d 12H 10M 31S" "8m 31d 18H 10M 31S"

# for little fun
jan31 <- ymd(20180131)
jan31 + months(0:11) # which month have 31 days?

##  [1] "2018-01-31" NA           "2018-03-31" NA           "2018-05-31"
##  [6] NA           "2018-07-31" "2018-08-31" NA           "2018-10-31"
## [11] NA           "2018-12-31"

feb29 <- ymd(19920229)
feb29 + years(0:30) # Oops! my birthday has been only 6 times so far!

##  [1] "1992-02-29" NA           NA           NA           "1996-02-29"
##  [6] NA           NA           NA           "2000-02-29" NA          
## [11] NA           NA           "2004-02-29" NA           NA          
## [16] NA           "2008-02-29" NA           NA           NA          
## [21] "2012-02-29" NA           NA           NA           "2016-02-29"
## [26] NA           NA           NA           "2020-02-29" NA          
## [31] NA

feb29 %m+% years(0:10) # returns last day of February

##  [1] "1992-02-29" "1993-02-28" "1994-02-28" "1995-02-28" "1996-02-29"
##  [6] "1997-02-28" "1998-02-28" "1999-02-28" "2000-02-29" "2001-02-28"
## [11] "2002-02-28"

jan31 %m-% months(0:12) # returns last day of the last 1 years

##  [1] "2018-01-31" "2017-12-31" "2017-11-30" "2017-10-31" "2017-09-30"
##  [6] "2017-08-31" "2017-07-31" "2017-06-30" "2017-05-31" "2017-04-30"
## [11] "2017-03-31" "2017-02-28" "2017-01-31"

intervals
represent a starting and ending points. Those points make the intervals precisely so you can determine exactly how long it is.

lubridate::interval(start = ymd(20180725),
                    end = ymd(20180731)) # == ymd(20180725) %--% ymd(20180731)

## [1] 2018-07-25 UTC--2018-07-31 UTC

years(1) / days(1)

## estimate only: convert to intervals for accuracy

## [1] 365.25

(today() %--% (today() + years(1))) / ddays(1)

## [1] 365

R examples with lakers dataset using lubridates

data("lakers")

lakers %>% head()

##       date opponent game_type  time period     etype team
## 1 20081028      POR      home 12:00      1 jump ball  OFF
## 2 20081028      POR      home 11:39      1      shot  LAL
## 3 20081028      POR      home 11:37      1   rebound  LAL
## 4 20081028      POR      home 11:25      1      shot  LAL
## 5 20081028      POR      home 11:23      1   rebound  LAL
## 6 20081028      POR      home 11:22      1      shot  LAL
##                player result points  type  x  y
## 1                                 0       NA NA
## 2           Pau Gasol missed      0  hook 23 13
## 3 Vladimir Radmanovic             0   off NA NA
## 4        Derek Fisher missed      0 layup 25  6
## 5           Pau Gasol             0   off NA NA
## 6           Pau Gasol   made      2  hook 25 10

lakers_df <- lakers %>% 
  dplyr::mutate(date = paste(date, time) %>% ymd_hm) %>% 
  dplyr::rename(time_index = date) %>% 
  dplyr::select(-time) 

lakers_df %>% head()

##            time_index opponent game_type period     etype team
## 1 2008-10-28 12:00:00      POR      home      1 jump ball  OFF
## 2 2008-10-28 11:39:00      POR      home      1      shot  LAL
## 3 2008-10-28 11:37:00      POR      home      1   rebound  LAL
## 4 2008-10-28 11:25:00      POR      home      1      shot  LAL
## 5 2008-10-28 11:23:00      POR      home      1   rebound  LAL
## 6 2008-10-28 11:22:00      POR      home      1      shot  LAL
##                player result points  type  x  y
## 1                                 0       NA NA
## 2           Pau Gasol missed      0  hook 23 13
## 3 Vladimir Radmanovic             0   off NA NA
## 4        Derek Fisher missed      0 layup 25  6
## 5           Pau Gasol             0   off NA NA
## 6           Pau Gasol   made      2  hook 25 10

d1 <- lakers_df %>% 
  dplyr::filter(time_index >= ymd("20081028"),
                time_index <= ymd("20081029"))

inter <- ymd(20081028) %--% ymd(20081029) # ==  interval(ymd(20081028), ymd(20081029))

d2 <- lakers_df %>% 
  dplyr::filter(time_index %within% inter)

all.equal(d1, d2) # d1 == d2 ?

## [1] TRUE