Introduction

The purpose of these notes is to facilitate the transition from SAS to R at health departments.

These notes outline the structure and content of a series of training sessions for R held by the Department of Epidemiology at the Marion County Public Health Department in early 2017. The series consisted of 3 sessions of length 90 minutes each in which we learned how the most common data management and analytic (DMA) tasks that we routinely perform in SAS can be performed in R.

The training sessions were a part of an effort to assess whether our department should start preparing to transition all of its SAS-based DMA activities to R in the long-term. Three of our highly proficient SAS programmers, with little to no recent experience in R, volunteered to participate in the training sessions with me as the instructor.

Our approach had the following important elements:

  1. Joint planning sessions in which we discussed and selected topics to cover in the next training session.
  2. Learning with datasets that we regularly use for our work.

Our Windows laptops were synchronized so as to make sure we were all using the exact same version of R, RStudio, and Rtools. This book assumes the reader will be using Windows machines as well.

Getting set up

Download or update R, RStudio, and Rtools

You will need to install R, RStudio, and Rtools. Below are the links to the download pages for each. You should download or update to the latest versions of all these. If you haven’t updated to the latest versions of these in 2017, you might not be able to follow the steps in this tutorial.

Installing R and RStudio is straightforward. See the note below about an important step that must be taken when installing Rtools.

Download R - top link: https://cran.r-project.org/bin/windows/base/ Note: It’s the top link.

Download RStudio: https://www.rstudio.com/products/rstudio/download/ Note: Go to the middle of the page where it says Installers for Supported Platforms.

Download Rtools: https://cran.r-project.org/bin/windows/Rtools/ Note: Click the RtoolsXX.exe in the topmost table row for which the column Frozen says Yes. Important: During the installation process for Rtools, there will be a menu step with the following prompt…

“Select the additional tasks you would like Setup to perform while installing R tools, then click Next.

Edit the system Path."

There will be a checkbox next to the Current value of PATH. Out of the two checkboxes in this menu, it is the unchecked one. You must check it and click Next. The rest is straightforward.

Install required packages

You will also need to install a set of packages that will be used in this tutorial. This step is done in RStudio. Simply copy-paste the line of R code below into the RStudio console.

install.packages(c("rio", "dplyr", "magrittr", "highcharter", "sqldf", "RODBC", "lubridate", "tidyr" ))

RStudio

We will be using the RStudio Desktop IDE throughout the entire tutorial. Whenever I refer to doing something in “R” or “RStudio”, I really mean doing it within the RStudio application (sorry about not keeping it standardized - it was an afterthought). In this tutorial, R refers to the programming language, whereas RStudio refers to the RStudio environment you carry out your R programming in. You should not open up the R application for any reason during this tutorial, just RStudio.

After having installed RStudio, you should have a shortcut available somewhere. Anytime you click on the shortcut (and if you didn’t leave any files open in RStudio the last time you used it), you will see three windows within RStudio. The left window is the Console. It is where you can type R commands and press Enter to execute them. Before RStudio, there was really only the Console. RStudio brought with it some really neat features. For instance, you can click the big window-shaped button on the top-right corner of the Console. RStudio IDE - at startup

You’ll then see a blank window labeled Untitled1 which is actually a temporary area where you can also write your commands. This area is neat because you write your commands, execute them with the shortcut keys given below, and then Save all the commands you wrote as a .R (i.e. FirstProgram.R) file so you can refer to it later. This is the easiest way to code in R and RStudio makes that happen. RStudio IDE - the four main windows

The next most important window is labeled Environment and it is on the top-right. Any R object that you create gets listed here. If the object you created is a dataset, it will have a right-arrow next to it and this means that you can click on the object’s row in the Environment to open the Viewer. Later in this tutorial, you’ll create datasets and will be able to open them. In fact, I highly encourage you to view every dataset you create and manipulate so that you can get a better feel for what the R commands are doing. I won’t be telling you to look at any datasets, I’ll assume that you will do this yourself. RStudio IDE - viewing a dataset

Shortcut Keys to know:

Run current line/selection of code: Ctrl+Enter

Clear console: Ctrl+L

R

R vs. SAS

Some of the first things you should know about R include.

There are no end-of-line markers. In SAS, you have to mark the end of almost every line of code with a ; character. R does not use any such analogous characters for this purpose. The R programming language syntax does not require nor use them.

Code comments. Single-line comments in R are created with the # character. Simply precede the text or code you want the R compiler to ignore with the # character. Unlike SAS, there are no multi-line comment markers in R (no analogue to SAS’s /* */). If you want to comment out multiple lines in R, you’ll have to comment out each line individually.

Watch the slash direction. Windows machines use backslashes \ also exclusively when working with paths. Whenever you copy the path to any file or folder, the path will contain backslashes. For instance C:\Harold\Desktop\file.txt. This can cause confusion because R requires the use of forward slashes / in any referenced paths (otherwise it will trigger an error). This will require (manually) replacing \ with / in path strings so that the previous example is referenced in RStudio as C:/Harold/Desktop/file.txt.

Working directory. The working directory is the directory which R is set to input/output files from/to. You can figure out what your working directory is by entering the command getwd() in the console. We’ll learn to use working directories more later on in the book.

Packages

What makes R so versatile is the availability of downloadable add-on features, known as packages, nearly all of which are available for free on the Internet through reputable, trustworthy, and curated sites like CRAN.

In more detail, an R package is a set of R functions each of which are designed to carry out specific tasks. We’ll discuss functions more a little later.

You can install an R package using the command install.packages("PACKAGE") where PACKAGE is the name of the package that you want to install.

It is important to note that the functions of the R packages you’ve downloaded are not accessible to you until you load the package. To load a particular package, use the command library("PACKAGE"), where as before PACKAGE is the name of the package you want to load from the package library. This is a point of major confusion for beginners, so as an example let’s pretend you want to use the import() function from the rio package to import a datafile. You may think that because you’ve already installed this package, you’re golden. WRONG. Before you can use any function from any package, you must load the package using the command mentioned earlier. To summarize, you only need to install a package one time in the lifetime of your machine. You use the functions in the package, you need to load the package every time you open RStudio (this can be multiple times a day).

Something useful to know is that many developers nowadays use the site GitHub to work on creating packages. When their package is “good enough”, they usually submit it to CRAN which goes through an review and approval process to determine whether CRAN will host the package on its servers. Being hosted on CRAN makes a package more “official” and trustworthy because the truth is that (although most people don’t do this) someone could create a package with harmful, virus-carrying code and host it on GitHub for people to download. This is one of the reasons IT departments are cautious about their organization’s employees using R and downloading packages from random sources on the Internet.

Essential packages list

Many DMA tasks are extremely common across health departments (import/export, redefining variables, restructuring datasets, etc.). Given that there are so many ways to accomplish the same task in R, it is helpful to have a list of what packages and functions are the most useful to accomplish these common tasks.

The table below lists the most useful packages and functions. You can easily accomplish 95% of everything you do in SAS with the functions below. Throughout the book, we will cover how to use these functions (i.e. function arguments, limitations, and some things to keep in mind as you use them). You will find that they are quite simple ways to accomplish even complicated tasks.

[Currently the packages are listed here: https://sites.google.com/site/rapplicationforbiosurveillance/home/essential-packages]

This list should save you a ton of time scouring the Internet for the R functions you’ll need for a particular and common task.

Session 1 - a practical warm up

First click on the links below to download two SAS datasets we’ll use in this tutorial. After downloading them I moved them to my Desktop because it was convenient. You’ll obviously have to edit any referened paths in this tutorial to fit where you stored the dataset files (i.e. your paths won’t be “S:/EPI_Data/FINAL/”).

Download bca - a SAS dataset Download dca1 - a SAS dataset

Let’s import our bca dataset. For this we use the import() function from the rio package. Note, this package makes importing, exporting, and converting between file types really easy. We’ll use it for these sorts of tasks.

library(rio)
bca <- import("C:/Users/haroldicus/Desktop/bca.sas7bdat")

Looks like it imported correctly. Now let’s create a dataset with only a few fields of interest that we’ll work with in this session.

For manipulating datasets, the dplyr package is one of the most useful packages you’ll use.

library(dplyr)
sbca <- select(bca, LOCAL_FILE_NUMBER, Event_Year, WTGRAMS, EST_GEST, MDPSMOKE)

Let’s check out out some meta data for the dataset. We’ll use the cryptic-looking, but actually simple, str() function which is short for structure.

str(sbca)
## 'data.frame':    3000 obs. of  5 variables:
##  $ LOCAL_FILE_NUMBER: atomic  908 2656 964 2545 1824 ...
##   ..- attr(*, "format.sas")= chr "BEST"
##  $ Event_Year       : atomic  1993 1974 1979 1991 1973 ...
##   ..- attr(*, "format.sas")= chr "BEST"
##  $ WTGRAMS          : atomic  3333 2413 3957 600 2849 ...
##   ..- attr(*, "format.sas")= chr "BEST"
##  $ EST_GEST         : atomic  0 0 0 0 0 1 1 1 1 1 ...
##   ..- attr(*, "format.sas")= chr "BEST"
##  $ MDPSMOKE         : atomic  1 0 1 1 0 0 1 1 1 0 ...
##   ..- attr(*, "format.sas")= chr "BEST"
##  - attr(*, "label")= chr "BCA"

Looks like when we imported the dataset, we had some field data types set in a way what we don’t want. For instance, WTGRAMS which means “weight in grams” is of chr type (short for character), but it should obviously be of a double type. We’ll modify the field data types using dplyr’s mutate() function.

sbca <- mutate(sbca,  LOCAL_FILE_NUMBER = as.character(LOCAL_FILE_NUMBER), 
                      Event_Year = as.integer(Event_Year), 
                      WTGRAMS = as.double(WTGRAMS), 
                      EST_GEST = as.integer(EST_GEST),
                      MDPSMOKE = as.integer(MDPSMOKE)
)

Let’s check out the structure of the dataset again.

str(sbca)
## 'data.frame':    3000 obs. of  5 variables:
##  $ LOCAL_FILE_NUMBER: chr  "908" "2656" "964" "2545" ...
##  $ Event_Year       : int  1993 1974 1979 1991 1973 1976 1992 1970 1985 1973 ...
##  $ WTGRAMS          : num  3333 2413 3957 600 2849 ...
##  $ EST_GEST         : int  0 0 0 0 0 1 1 1 1 1 ...
##  $ MDPSMOKE         : int  1 0 1 1 0 0 1 1 1 0 ...

Better!

Let’s start with creating a new field. Our objective right now will be to make a plot of the count of low birthweight infants born in Marion County, IN by year. We’ll need to create an indicator variable to do this.

sbca <- mutate(sbca, LOW_WEIGHT = WTGRAMS < 2500)
sbca <- mutate(sbca, LOW_WEIGHT = as.integer(WTGRAMS < 2500))

It would be neat if for every year and every category for LOW_WEIGHT, we could see the corresponding counts. We’ll need to do some aggregation for this.

sbcag <- group_by(sbca, LOW_WEIGHT, Event_Year)
lw_y <- summarise(sbcag, count = n())

Now let’s make a plot for the trends we wanted to see!

library(highcharter)

hchart(lw_y, "line", hcaes(x = Event_Year, y = count, group = LOW_WEIGHT)) 

Looks nice! But something is up with the way our stratified field LOW_WEIGHT is being coded. It’s because the group parameter in the code above is expecting a field of type character. Let’s correct that and improve our plot.

lw_y <- ungroup(lw_y) # Need to ungroup to account for some under-the-hood detail

lw_y <- mutate(lw_y,  LOW_WEIGHT = as.character(LOW_WEIGHT), # First change data type of LOW_WEIGHT to character
                      LOW_WEIGHT = replace(LOW_WEIGHT, LOW_WEIGHT == 0, "normal weight"),                                       LOW_WEIGHT = replace(LOW_WEIGHT, LOW_WEIGHT == 1, "low weight"),
                      LOW_WEIGHT = replace(LOW_WEIGHT, is.na(LOW_WEIGHT), "missing")
)

hchart(lw_y, "line", hcaes(x = Event_Year, y = count, group = LOW_WEIGHT))

Better. Now let’s improve on the graph.

hchart(lw_y, "line", hcaes(x = Event_Year, y = count, group = LOW_WEIGHT)) %>%
  hc_title(text = "Infant Birth Weights") %>%
  hc_tooltip(table = TRUE, sort = TRUE)

We just used the magical pipe operator %>%. The pipe operator lets you feed the result of an operation to the next operation following the pipe operator. You can use pipe operators in sequence and have as many steps as you want be linked by pipe operators. A simple example of this is below.

library(magrittr) # Note that for many scenarios, use of the pipe operator requires loading this package first

# Let's take this computation below as an example
sqrtOfMean <- sqrt(mean(c(1,2,3,4,5)))

# This is equivalent to the above
sqrtOfMean <- c(1,2,3,4,5) %>% mean() %>% sqrt()

# This is also equivalent and good syntax
sqrtOfMean <- c(1,2,3,4,5) %>% 
  mean() %>% 
  sqrt()

The code below is bad syntax for pipe operators. It doesn’t work because R executes a line at a time. R doesn’t use end-of-line operators so it thinks the computation is complete at the end of the first line. Then it moves to the next line and since its not in the middle of a computation, it doesn’t know what to do with the next pipe operators and steps.

sqrtOfMean <- c(1,2,3,4,5) 
%>% mean() 
%>% sqrt()

Let’s use the pipe operator as an alternative way of doing earlier computations.

sbca <- bca %>% # The dataset
  select(LOCAL_FILE_NUMBER, Event_Year, WTGRAMS, EST_GEST, MDPSMOKE) %>% # Keep these fields only
  mutate(LOCAL_FILE_NUMBER = as.character(LOCAL_FILE_NUMBER), # Change data types
         Event_Year = as.integer(Event_Year), 
         WTGRAMS = as.double(WTGRAMS), 
         EST_GEST = as.integer(EST_GEST),
         MDPSMOKE = as.integer(MDPSMOKE)
  )

Homework

Now let’s make a similar plot using the field EST_GEST (which represents estimated gestation time in weeks). Instead of looking at low birthweight trends by year, we’ll look at premature birth trends by year. A premature birth is indicated by an estimated gestation period less than 37 weeks. Try to take the earlier code to make a similar plot for this field (premature, normal, and missing counts by year).

Don’t look or work through the code below (which is the solution) until you try to do it by yourself. If you get stuck, look at the code below. If you still don’t get it, ask me. I’m not sure I’d recommend Google searching for us right now.

sbca <- mutate(sbca, PREMATURE = as.integer(EST_GEST < 37))

sbcag <- group_by(sbca, Event_Year, PREMATURE)
premature_y <- summarise(sbcag, count = n())

premature_y <- ungroup(premature_y)

premature_y <- mutate(premature_y,  PREMATURE = as.character(PREMATURE),
PREMATURE = replace(PREMATURE, PREMATURE == 0, "normal"),                      
PREMATURE = replace(PREMATURE, PREMATURE == 1, "premature"),
PREMATURE = replace(PREMATURE, is.na(PREMATURE), "missing")
)

hchart(premature_y, "line", hcaes(x = Event_Year, y = count, group = PREMATURE)) %>%
hc_title(text = "Infants Born Premature") %>% 
hc_tooltip(table = TRUE, sort = TRUE)

Now let’s make the same plot again for the field MDPSMOKE. This field indicates whether the mother smoked or not during pregnancy (1-yes, 0-no). Note that it is already coded as an indicator variable (does that save you a step?).

Try to make the same trend plot for this field on your own without looking at the code below at all!

sbcag <- group_by(sbca, Event_Year, MDPSMOKE)
MDPSMOKE_y <- summarise(sbcag, count = n())

MDPSMOKE_y <- ungroup(MDPSMOKE_y)

MDPSMOKE_y <- mutate(MDPSMOKE_y,  MDPSMOKE = as.character(MDPSMOKE),
MDPSMOKE = replace(MDPSMOKE, MDPSMOKE == 0, "normal"),                      
MDPSMOKE = replace(MDPSMOKE, MDPSMOKE == 1, "smoked"),
MDPSMOKE = replace(MDPSMOKE, is.na(MDPSMOKE), "missing")
)

hchart(MDPSMOKE_y, "line", hcaes(x = Event_Year, y = count, group = MDPSMOKE)) %>%
hc_title(text = "Infants Whose Mothers Smoked During Pregnancy") %>% 
hc_tooltip(table = TRUE, sort = TRUE)

Session 2 - more versatility

HW review

Let’s go over the HW. Any problems encountered?

rio and dplyr overview

Today we’re going to review some of the packages from the last session and review and expand on the functions we know about. After the review, we’ll cover topics largely falling under the themes of dataset merging, SQL querying, and connecting to remote databases.

rio

Let’s start by reviewing the rio package.

We already learned about the import() function.

bca <- import("C:/Users/haroldicus/Desktop/bca.sas7bdat")
dca1 <- import("C:/Users/haroldicus/Desktop/dca1.sas7bdat")

You can also use the rio package to convert datasets of one file type to another. For instance, below I convert the dataset dca1 from a SAS file to a CSV file using the convert() function. This function infers the file types from the file type extensions in the file paths.

convert("C:/Users/haroldicus/Desktop/dca1.sas7bdat", "C:/Users/hgil/Desktop/dca1.csv")

You can use rio’s export() function to export any dataset that you’ve already imported as a file of a specified file type. The export() function infers the file type you desire to create from the file’s file type extension.

export(dca1, "C:/Users/haroldicus/Desktop/dca1.xlsx")

Conclusion: rio is a great package and can be used to import and export files, as well as convert between file types.

Remember to check out the rio package vignette to see really useful documentation: https://cran.r-project.org/web/packages/rio/vignettes/rio.html

dplyr

Last week we learned about some of dplyr’s most useful functions:

  • select(): select certain columns (fields/variables) of your dataset
  • filter(): select specific rows (observations) of your dataset
  • mutate(): add new columns or change existing ones
  • group_by() and summarise(): aggregate your data according to specified groupings

See last week’s notes for examples using these functions.

Some other dplyr functions we have not mentioned so far:

  • arrange(): sort specified columns in ascending (default) or descending order
  • rename(): change column names for variables
  • distinct(): get unique values of specified variable set

Below are examples using these functions.

library(dplyr)

# First select a few fields to work with
sdca1 <- select(dca1, YearOfDeath, Sex_DC)

# Sort by YearOfDeath in ascending order and Sex_DC in descending order
sdca1 <- arrange(sdca1, YearOfDeath, desc(Sex_DC)) # Note the syntax to specify descending order

# Rename the field Sex_DC to Sex
sdca1 <- rename(sdca1, Sex = Sex_DC)

# Select fields you want to get unique values for
sdca1 <- distinct(sdca1, YearOfDeath, Sex) # Consider unique YearOfDeath and Sex combination values

head(sdca1) # Inspect output
##   YearOfDeath Sex
## 1        1960   M
## 2        1961   M
## 3        1961   F
## 4        1962   M
## 5        1962   F
## 6        1963   M
sdca1 <- distinct(sdca1, YearOfDeath) # Consider unique YearOfDeath values only

head(sdca1) # Inspect output
##   YearOfDeath
## 1        1960
## 2        1961
## 3        1962
## 4        1963
## 5        1964
## 6        1965

Now we’re going to learn how to use dplyr to merge datasets. As an example, let’s merge our infant births (bca) and deaths (dca1) datasets together. The following code provides two different ways of accomplishing this depending on the whether the field(s) comprising the key have the same name or not. In our case, the field that we want to link by initially has a different name in each dataset (LOCAL_FILE_NUMBER in bca and BC_DataIDnumber in dca1).

# If the key fields have different names in each dataset
bda <- left_join(bca, dca1, by = c("LOCAL_FILE_NUMBER" = "BC_DataIDnumber"))

# Alternative approach: Rename the key fields to be the same, then the join syntax is simpler.
bca <- rename(bca, Key = LOCAL_FILE_NUMBER)
dca1 <- rename(dca1, Key = BC_DataIDnumber)
bda <- left_join(bca, dca1, by = "Key") # Only need this line if the key fields have the same name initially.

Below are some of the dplyr join functions:

  • left_join(a, b, by = key)
  • right_join(a, b, by = key)
  • inner_join(a, b, by = key)
  • full_join(a, b, by = key)

Other types of joins and set operations (such as union, intersection, and set difference) are available through dplyr. More simple examples and a table comparing the dplyr functions to their associated SQL equivalent code are available here: https://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html

Note: That you can only use these joins and set operations with two tables at a time, but you can always pipe the results of one join to another.

Using SQL code instead of R code

We’ve done quite a bit of SQL-like work with R’s dplyr package. What if we prefer to use SQL code to work with our local dataset? There’s a package for that - sqldf. This package is extremely simple to use. See the example below.

library(sqldf)

res <- sqldf("SELECT * FROM bda WHERE Event_Year >= 2016 AND Sex == 'F' ORDER BY BMI")

There’s not much to explain here. You simply have to be familiar with the SQL language.

This is how we would’ve done the same computation using R instead of SQL.

res <- bda %>% 
       filter(Event_Year >= 2016, Sex == "F") %>%
       arrange(desc(BMI))

Remote database connections

The code below resembles a SAS connection string used by MCPHD.

CONNECT TO ODBC AS &DB_Conn. (REQUIRED = “DATABASE=HealthKnowledgeDB;Server=sql123/inst456;DRIVER={SQL Server};Trusted_Connection=Yes”);

The equivalent connection string in R is similar. We use the package RODBC for SQL Server connections.

library(RODBC)

# Connection string
mycon <- odbcDriverConnect('driver={SQL Server};server=sql123\\inst456;database=HealthKnowledgeDB;trusted_connection=yes') # Note: Sometimes, instead of the parameter 'database', you have to use 'dsn'. Example: dsn=HealthKnowledgeDB. 

# Optional: Now that you're connected to the database, you can get a list of all the SQL tables in the database.
sqlTables(mycon)

# Submit your SQL query and get the result
res <- sqlQuery(mycon, "SELECT * FROM EPICASE INNER JOIN TBEPDISEASE ON EPICASE.DISEASE_CODE=TBEPDISEASE.DISEASE_CODE")  

# When you're done working with the database, you can close the connection string.
close(mycon)

Concatenation

To concatenate strings together, use the paste() function.

paste("Hello", "there", "friend")
## [1] "Hello there friend"

One major reason paste() is extremely useful is because it lets you reference the objects you’ve created. See the example below.

a <- "WHAT"
b <- "WHO"
chika <- "CHIKA CHIKA"
name <- "Slim Shady"
paste("Hi", "My name is", a, "My name is", b, "My name is", chika, name)
## [1] "Hi My name is WHAT My name is WHO My name is CHIKA CHIKA Slim Shady"
# Numeric objects can also be used in paste()
age <- 30
paste("And I am", age, "years old")
## [1] "And I am 30 years old"
# Or you can directly use the numeric value in the paste() function
paste("And I am", 35, "years old")
## [1] "And I am 35 years old"

The most important parameter to know about for paste() is sep, which is the character string that will separate the terms you use in paste(). If you are using paste(), it is important to know that the parameter sep is set to " " by default (yes, that is one space character in between quotes).

# Let's try to create the string "1-800-123-4567" using the paste function.
paste(1,"-", 800, "-", 123, "-", 4567)
## [1] "1 - 800 - 123 - 4567"

We ended up with an extra space in between the terms we used in paste(). Again, is because sep is set to " " (one space character) by default. But in this case we want no spaces in between the numbers! To adapt to this case, we specify the sep to use no characters by using sep = “” (note that the space character is now missing).

# Specify sep to have no characters.
paste(1,"-", 800, "-", 123, "-", 4567, sep = "")
## [1] "1-800-123-4567"

This is what we wanted. We can also specify any other characters we want in sep.

paste("1", "000", "000") # "1 000 000"
## [1] "1 000 000"
paste("1", "000", "000", sep = "") # "1000000"
## [1] "1000000"
paste("1", "000", "000", sep = ",") # "1,000,000"
## [1] "1,000,000"
paste("1", "000", "000", sep = ".") # "1.000.000"
## [1] "1.000.000"
paste("1", "000", "000", sep = "xyz") # "1xyz000xyz000"
## [1] "1xyz000xyz000"

Note: paste0() is slightly faster than paste(). The difference isn’t significant unless you’re going to be concatenating A LOT in your program.

Further documentation for these functions, showing more interesting examples and features such as the collapse parameter, are available here: https://stat.ethz.ch/R-manual/R-devel/library/base/html/paste.html

Session 3 - other tools

Dates and datetimes

Let’s first start with a small example. Let’s say we had an object of character type which you see is a list of dates. Whenever you’re working with dates in R, it’s a safe bet to assume that you should use the lubridate package (don’t laugh). The first set of most useful functions in lubridate are the ones that take date objects that of character type, and convert them to Date type. These functions are:

  • ymd()
  • ydm()
  • mdy()
  • myd()
  • dmy()
  • dym()

The letters in these functions - y, m, and d - simply stand for year, month, and day, respectively. All you have to do is use the function which matches the order of the year, month, and day portions of the character type object that represents a date. This is way easier to understand with the examples below. We will apply these functions to different representations of the same date - 3/22/2017.

library(lubridate)
## Warning: package 'lubridate' was built under R version 3.4.3
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
Date_1 <- ymd("2017Mar22")
Date_2 <- ydm("2017/22/03")
Date_3 <- mdy("March 22, 2017")
Date_4 <- myd("Mar2017-22")
Date_5 <- dmy("22-03-2017")
Date_6 <- dym("22201703")

# Check the structure of these objects to confirm they worked as expected
str(Date_1)
##  Date[1:1], format: "2017-03-22"
str(Date_2)
##  Date[1:1], format: "2017-03-22"
str(Date_3)
##  Date[1:1], format: "2017-03-22"
str(Date_4)
##  Date[1:1], format: "2017-03-22"
str(Date_5)
##  Date[1:1], format: "2017-03-22"
str(Date_6)
##  Date[1:1], format: "2017-03-22"

Note the data types are all Date and not char. That was amazingly simple. The one case you have to watch out for is when the century is not specified. Let’s say you KNOW someone was born before the 20th century, for instance: 6/21/1890. If you don’t specify the century (and left the date as 6/21/90), lubridate will assume the most likely thing: that this person was born in a recent century (not 2090, because it knows that doesn’t make sense, but 1990).

mdy("6/21/1890")
## [1] "1890-06-21"
mdy("6/21/90")
## [1] "1990-06-21"
mdy("3/22/2010")
## [1] "2010-03-22"
mdy("3/22/10")
## [1] "2010-03-22"

Notice that R stores objects in the format yyyy-mm-dd. This is the most important standard unambiguous date format in R (the only other one is yyyy/mm/dd but it’s not as important to know since R doesn’t print it this way).

Why go through this Date type conversion? Once R recognizes an object as being of Date type, it will be much easier to work with. You can compare whether one date is less than another, etc.

What if you want to extract the year, month, or day component from an object of Date type? For that you have the following functions:

  • year()
  • month()
  • day()
Date_1
## [1] "2017-03-22"
year(Date_1)
## [1] 2017
month(Date_1)
## [1] 3
day(Date_1)
## [1] 22

Let’s mess with some datetimes. [This section needs to be expanded. Skip to the next import step].

format(today, “%B %d %Y”) tz description OlsonNames() America/Indiana/Indianapolis America/Indianapolis

Now let’s apply this new knowledge to one of our datasets.

Let’s import the dataset dca1 again and work with a date variable there.

library(rio)
dca1 <- import("C:/Users/haroldicus/Desktop/dca1.sas7bdat")

Let’s look at the structure of the dataset.

str(dca1)
## 'data.frame':    500 obs. of  6 variables:
##  $ Key        : atomic  1518 934 1651 1001 2919 ...
##   ..- attr(*, "format.sas")= chr "BEST"
##  $ YearOfDeath: atomic  1999 1966 1963 1978 1970 ...
##   ..- attr(*, "format.sas")= chr "BEST"
##  $ Sex_DC     : atomic  M M M M ...
##   ..- attr(*, "format.sas")= chr "$CHAR"
##  $ DateOfDeath: Date, format: "1991-11-03" "1994-08-01" ...
##  $ Dummy_Var4 : atomic  3930 1411 551 1930 2102 ...
##   ..- attr(*, "format.sas")= chr "BEST"
##  $ Dummy_Var5 : atomic  4860 9286 723 5400 1112 ...
##   ..- attr(*, "format.sas")= chr "BEST"
##  - attr(*, "label")= chr "DCA1"

Notice that the field DateOfDeath is a column of dates. Moreover, it looks like it is already of Date type. R must have inferred that this field should be of Date type from the format metadata in the original SAS dataset. That’s great but it usually won’t happen for us automatically (for instance, if you import a CSV file which doesn’t store format metadata).

For the sake of having an example to work with, let’s convert the DateOfDeath column from Date type to character type.

library(dplyr)
dca1 <- mutate(dca1, DateOfDeath = as.character(DateOfDeath))

You can check the structure of the dataset and see that it worked.

Let’s pretend now that this is how our dataset was originally imported. Oh dang, we need to convert DateOfDeath to a Date type! What do we do?

library(lubridate)
dca1 <- mutate(dca1, DateOfDeath = ymd(DateOfDeath))

That was way easy.

References:

http://www.statmethods.net/input/dates.html

https://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html

Data formats: Long and Wide

tidyr package

Next session and HW

By now, you guys have seen how to do all the common DMA tasks we do at MCPHD with R. For homework, please pick one of your own SAS scripts which you would like to convert into R. First, make sure the SAS script has common DMA tasks that would be relevant to others in our Epidemiology group. Then, look over the SAS script prior to the next session and ask yourself if you’ve already seen how to do the types of tasks in the script with R. If not, please compile a list of those tasks we haven’t covered (example: Send conditional email alert) and email them to me.

In the next session, we’ll be going over a SAS script and converting it into R. Then we’ll take a look at your selected SAS script and try to figure out how you would do the same tasks with R. We will not plan on learning any specific things that are new, just those steps in your SAS scripts that we haven’t covered (perhaps sending an email alert, generating a log, etc.).