TO DO

  1. Everyone needs to make sure they have R, RStudio, and Rtools installed
  2. Everyone needs to have installed a set of R packages using the line of code below:
install.packages(c("rio", "dplyr", "magrittr", "sqldf", "RODBC", "lubridate", "tidyr" ))
  1. Keep these notes open throughout the session.
  2. Go to this webpage https://github.com/haroldgil/MCPHD-R-Intro/tree/master/data, download all 3 data sets, and save them to your Desktop
  3. We will also discuss the R Cheat Sheet available for download at: https://github.com/haroldgil/MCPHD-R-Intro/raw/master/R%20Packages%20Cheat%20Sheet.docx

FYI

Fundamentals

Calculator

You can do all sorts of arithmetic computations in R.

1

1 + 2

(1 + 2)^2/3

1.2 + 2.3

Data types

# By default, numbers in R are stored as 'double' (decimal) types even though they may look like integers
1  

# To force R to understand that you want the 'integer' type, follow the number with an L. Note: In practice you never really need to do this. If the number is stored a 'double', it's still going to work fine.
1L  

# Character
"Hello, Adele"

# Logicals
TRUE

FALSE

T

F

Columns (Vectors)

The c() function is used to create columns (also know as vector). The ‘c’ is short for column.

c(1, 2, 3)

c(1.5, 1, 2.5)/2

c("Hello", "Darkness", "My", "Old", "Friend...")

c(TRUE, FALSE, T, F)

Functions

Have the format function(). Different functions have different parameters/options.

mean(c(1, 2, 3))  # Take the mean of the values: 1, 2, and 3.

?mean  # Look up info about function (inputs, outputs, parameters/options)

Assignment

The <- symbol is the assignment operator. This is how you store things in R. Anything that

x <- 1  # Here I created the R object, x, which stores the value 1..

y <- c(1, 2, 3)  # Here I stored the column of values: 1, 2, and 3.

mean_of_y <- mean(y)  # Here I applied the mean() function to the object y which was created above.

Questions?

Installing R packages

# Install single package
install.packages("packageName")

# Install multiple packages
install.packages(c("a", "b", "c"))

# Load a package
library(package)

# User str() to find out more information about an object. 'str' is short for structure.
str(object)

Questions?

Check data types

is.integer(1)  # This is FALSE because R stored all numbers as double, not integer, types by default.
is.integer(1L)  # Here I told R to interpret the 1 as an integer so this is TRUE
is.numeric(c(1, 2, 3))  # This function evaluates to TRUE if the object is of integer or double type. 
is.integer(1.5)
is.double(1.5)
is.character(1.5)
is.character("Hello Adele")
is.logical(TRUE)
is.logical(F)

Logical operations

1==1  # Note that to test for equality requires two = signs
1==2
1<2
1>2
1<=1
1>=1
1!=2

Missing values in R

R represents missing values with NA.NAs are their own data type and there are actually sub-types for NAs (one for characters, one for real/decimal numbers, etc.) but usually you don’t need to worry about this. What is important to know though is that you need to use a special function to test whether values of an R object are NAs.

# Is NA equivalent to NA? This evaluates to NA. Not helpful.
NA==NA  

# Instead, you need to use is.na(). This will work as you would expect it to. Remember this function.
is.na(NA)
is.na(c(1, NA, 3))

Coercing data types

You can change the data type of objects in R from one to another. This is sometimes useful. For instance, you may want years (2000, 2001, …) to be of character type instead of numeric type.

a <- as.character(1)
b <- as.double(1)
c <- as.integer("1")
d <- as.logical(1)

# Remember you can also use str() to check the object data type. The output will be different though.

str(a)
str(b)
str(c)
str(d)

is.double(as.double(1))

Dataframes

Here I’m going to create a dataframe (R’s equivalent data structure for a dataset). R is designed to make working with dataframes easy.

# I'll first make three columns that will become my dataframe (dataset) columns
var1 <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
var2 <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j")
var3 <- seq.Date(from = as.Date("2017-05-16"), to = as.Date("2017-05-25"), by="day")

# data.frame() is the function that takes columns of equal length and makes them a dataframe
df <- data.frame(var1, var2, var3)

# Let's see some info about df
str(df)

# Is the object of type dataframe? We could see it was from using the str() command but we can also see it this way.
is.data.frame(df)

# A dataframe column itself is not a dataframe 
is.data.frame(var1)

# head() gives the first 6 elements (for columns) or rows (for dataframes) of an object by default. You can add the 'n = Some_Number' option to change the number of elements of rows that head() will retrieve.
head(df)
head(df, n = 8)
head(var1)

Concatenating strings/characters

You can concatenate characters with the paste() function. By default, paste() has an option called sep that sets the character that will be inserted in between character elements being pasted and it is set to one blank: " “.

# When these two characters are concatenated, there will be one blank space placed between them because sep = " " is a default option.
paste("Hello by Adele", "is #1")

me <- "Harold"
paste(me, "loves Adele")

# You can change the sep option by specifying it to be something else in quotes (for instance, no blank, dash, nonsense, etc.).
paste("317", "123", "4567", sep = "")
paste("317", "123", "4567", sep = "-")
paste("317", "123", "4567", sep = "blah")

Importing/exporting and connecting to DBs

Load rio.

library(rio)

Why do you load a package? In order to have access to its functions. Try using ?import without loading rio first results in an error. A package needs to be loaded every time you restart an R session in order to have access to the package functions again.

See reference: (https://cran.r-project.org/web/packages/rio/vignettes/rio.html)

IMPORTANT: Make sure to use forward slashes and not a backslashes!

library(rio)
bca <- import("C:/Users/haroldicus/Desktop/bca.sas7bdat")
export(df, "C:/Users/haroldicus/Desktop/df.csv")
export(df, "C:/Users/haroldicus/Desktop/df.xlsx", which = "Training day")
convert("C:/Users/haroldicus/Desktop/bca.sas7bdat", "C:/Users/haroldicus/Desktop/bca.xlsx")

Questions?

RODBC (for remote DB connections to SQL Server DBs)

We use the RODBC package to connect to SQL Server databases.

library(RODBC)

Interestingly, you need to use \ instead of  or / if your server name includes them.

If you use passwords to connect to DBs, don’t use the trusted_connection parameter, instead use uid for username and pwd for password.

Also note that the parameter in which you specify the database name can go by different names depending on your SQL Server implementation: dsn, dbname, or database. You can pick one and try it out and move to the next. Many times, the connection string won’t work until you use the right parameter (dsn, dbname, or database) for this.

This is an example of an R connection string used at MCPHD.

# This is the connection and it is stored in the object named 'con'
con <- odbcDriverConnect('driver={SQL Server}; server=sql123\\inst456;database=HealthKnowledgeDB; trusted_connection=yes')

# Here I submit my SQL query through the connection so I specify the object 'con'
res <- sqlQuery(con, "SELECT * FROM sql_table_with_data")

Templates. Remember to use only one of the parameters dsn, dbname, and database.

# Using trusted connection
con <- odbcDriverConnect ('driver={SQL Server};server=SERVER\\instance;dsn/dbname/database=DatabaseName;trusted_connection=yes')
res <- sqlQuery(con, "sql_statement")

# Using username and password
con <- odbcDriverConnect ('driver={SQL Server};server=SERVER\\instance;dsn/dbname/database=DatabaseName;uid=username;pwd=password')
res <- sqlQuery(con, "sql_statement")

Try adapting one of your SAS connection strings to R now.

Questions?

sqldf (for applying SQL to local dataframes)

The package sqldf is for applying SQL queries to local dataframes, not external databases.

We will apply SQL to the dataframe we created previously.

var1 <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
var2 <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j")
var3 <- seq.Date(from = as.Date("2017-05-16"), to = as.Date("2017-05-25"), by="day")

df <- data.frame(var1, var2, var3)

Here we apply the SQL code the the dataframe object named ‘df’.

library(sqldf)
sqldf("SELECT * FROM df WHERE var1 >= 5 OR var2 == 'a'")

Questions?

Transforming data

dplyr (transforming and linking data)

# The function names() tells you the names of the columns of a dataframe.
names(bca)

library(dplyr)

# Use the select() function to keep or drop columns.

# Keep columns
Kbca <- select(bca, LOCAL_FILE_NUMBER, Event_Year, WTGRAMS, EST_GEST, MDPSMOKE, BMI, DateOfBirth)

# Drop columns
Dbca <- select(bca, -Dummy_Var1, -Dummy_Var2, -Dummy_Var3)

# The result of the previous two commands is the same, we just stored them in different objects. We'll pick one (Dbca) for the subsequent examples.

# filter() allows you to get the rows which meet any logical conditions you specify

example1 <- filter(Dbca, Event_Year >= 2000) 
example2 <- filter(Dbca, Event_Year >= 2000 | WTGRAMS >= 2500)  # | means OR
example3 <- filter(Dbca, Event_Year >= 2000, WTGRAMS >= 2500)  # Comma is the same as & which means AND

# mutate() let's you change entries in columns or create new columns
example4 <- mutate(Dbca, Infant = 1)

Nbca <- mutate(Dbca, Risk_Level = EST_GEST + MDPSMOKE + (WTGRAMS < 2500))

Nbca <- mutate(Nbca, Risk = case_when(
                                                          Nbca$Risk_Level == 3 ~ "high",
                                                          Nbca$Risk_Level == 2 ~ "medium", 
                                                          Nbca$Risk_Level == 1 ~ "low", 
                                                          Nbca$Risk_Level == 0 ~ "zero risk", 
                                                          TRUE ~ "unknown"  # Basically, an else condition
                                                          )
                  )

# Another example
Nbca <- mutate(Nbca, Millenial = case_when(
                                                      Nbca$Event_Year >= 1982 & Nbca$Event_Year <= 2004 ~ "yes",
                                                      Nbca$Event_Year < 1975 ~ NA_character_,  # Note this is a weird case in which I have to specify which type of NA I'm using (I used NA_character_)
                                                      TRUE ~ "no"
                                                    )
          )

# Replace values in the column Millenial which say "yes" with "totes"
Nbca <- mutate(Nbca, Millenial = replace(Millenial, Millenial == "yes", "totes"))       

miss_bca <- filter(Nbca, is.na(Millenial))  # Get the rows for which Millenial entries are missing

# ! symbol stands for negation of logical values
nonmiss_bca <- filter(Nbca, !is.na(Millenial))  # Get the rows for which Millenial entries are not missing

# Sort values with arrange(). Sorted in ascending order by default. To specify that you want a particular column sorted in descending order, enclose the column in the desc() function.
Nbca <- arrange(Nbca, Event_Year)
Nbca <- arrange(Nbca, Event_Year, desc(EST_GEST))

# Rename columns
Nbca <- rename(Nbca, Year = Event_Year, Weight = WTGRAMS)

# Get distinct values combinations for one or more columns
distinct(Nbca, BMI)

Questions?

Grouping, summarizing, and merging/linking

# Use group_by() and summarize() to get aggregate measures
gbca <- group_by(bca, Event_Year)
summarize(gbca, Count = n())

gbca <- group_by(bca, Event_Year, MDPSMOKE)
summarize(gbca, Count = n())

gbca <- group_by(bca, Event_Year)
summarize(gbca, Count = sum(MDPSMOKE))

gbca <- group_by(bca, Event_Year)
summarize(gbca, Proportion = sum(MDPSMOKE)/n())

# Let's merge some datasets

dca1 <- import("C:/Users/haroldicus/Desktop/dca1.sas7bdat")  # Note, the last letter of the file is the number 1 not the letter l.

linked <- left_join(bca, dca1, by=c("LOCAL_FILE_NUMBER"="BC_DataIDnumber"))

# If the columns you are joining by have the same name in both datasets, the syntax for joining is simpler.
bca <- rename(bca, BC_DataIDnumber = LOCAL_FILE_NUMBER)
linked <- left_join(bca, dca1, by = "BC_DataIDnumber")

Questions?

Working with dates and datetimes

library(lubridate)

# Use the function which specifies the order in which the Year, Month, and Day of the date is represented.
Date_1 <- ymd("2017Mar22")  # Here the date has Year, then Month, then Day, so use ymd()
Date_2 <- ydm("2017/22/03")
Date_3 <- mdy("March 22, 2017")  # Here the date has Month, then Day, then Year so use mdy()
Date_4 <- myd("Mar2017-22")
Date_5 <- dmy("22-03-2017")
Date_6 <- dym("22201703")

# You get the pattern. You can use str() on these and see that the objects are now Date types instead of character types. They are automatically stored in the format YYYY-MM-DD. This is called the unambiguous date format. But what if I want to change the format? You can use R's base format() function in which you specify how you want the date to be formatted. See this really simple and short resource to learn more about how to use it: http://www.statmethods.net/input/dates.html

reformatDate <- format(Date_1, "%m/%d/%Y")

# Once the objects are recognized by R as Date types, it's simple to extract information from the dates.

year(Date_1) 
month(Date_1) 
day(Date_1) 

datetimeNow <- Sys.time()
 
hour(datetimeNow)
minute(datetimeNow)
second(datetimeNow)

# Specifying time zones
ymd_hms(datetimeNow, tz="America/Indiana/Indianapolis")
nz <- ymd_hms(datetimeNow, tz="Pacific/Auckland")

# What datetime is this in another timezone?
with_tz(nz, "America/Indiana/Indianapolis")

Questions?

Piping/Chaining

library(magrittr) # Note that for many scenarios, use of the pipe operator requires loading this package first

# Let us take this computation below as an example
sqrtOfMean <- sqrt(mean(c(1,2,3,4,5)))

# This is equivalent to the above
sqrtOfMean <- c(1,2,3,4,5) %>% mean() %>% sqrt()

# This is also equivalent and good syntax
sqrtOfMean <- c(1,2,3,4,5) %>% 
                mean() %>% 
                sqrt()

Questions?

Long to wide and wide to long

Dataset with Measles and Varicella vaccination dates for different children.

Dataset with Measles and Varicella vaccination dates for different children.

library(tidyr)

# Note that in importing the datasets below, I specified the data types of the columns. 
vax_wide <- import("C:/Users/haroldicus/Desktop/vax.xlsx", which = "vax_wide", col_types=c("Child_ID"="numeric", "Varicella"="date", "Measles"="date"))  
vax_long <- import("C:/Users/haroldicus/Desktop/vax.xlsx", which = "vax_long", col_types=c("Child_ID"="numeric", "Vax_Received"="text", "Vax_Date"="date"))

# Wide to Long

vax_WtL <- gather(vax_wide, Vax_Received, Vax_Date, Varicella, Measles)

vax_WtL <- gather(vax_wide, Vax_Received, Vax_Date, Varicella, Measles, na.rm = T)  # Remove entries with NAs if you want to

# Long to Wide

vax_LtW <- spread(vax_long, Vax_Received, Vax_Date)

Questions?