R

R is an open-source programming environment, for statistical computing and data visualization, among other capabilities.

R has many features to recommend it (Kabacoff, 2015):

where to look in the book?:

chapter 1


RStudio

Programming, editing, project management, and other procedures in R, are more easy and efficient using an integrated development environment (IDE), and the most used and versatile is RStudio.

Exercises

Open RStudio:

  • browse all the components of the interface
  • establish a Working Directory
  • create a R Script (compare with the R console)
  • write code to calculate the area of a geometric surface, and run it
  • explore other windows and menus of RStudio
  • load packages
where to look in the book?:

chapter 2 & 3


R Markdown (or R Notebook)

As important as carrying out the data analysis of an investigation (main objective of this workshop), it is also important to communicate the results to an audience, and receive acknowledgement and comments about them. Usually that aspect is addressed at the end (as in the workshop book: Chapter 28), but if we do so, we can leave behind notes, results, graphs, et c., which at some point we have to find, organize and move to a final document.

Creating a R Markdown document

First, check if you have the package rmarkdown installed; for this, look into the Packages menu (User Library).

To create a R Markdown document, use the left-upper corner menu:


Select R Markdown.

Editing the document

The R Markdown has three types of content:

  • an (optional) YAML header surrounded by - - -
  • R code chunks surrounded by ```
  • simple text mixed with formatted text, images, tables

The YAML header is a metadata section of the document, and can be edited to include basic information of the document:


You can change (or eliminate) the title, author, and date. The output option is originally created by RStudio, and depends on the output format that you produce (html, pdf, Word). A R Markdown document (html_document) can be transformed into a R Notebook (html_notebook) and vice versa.

The text and other document features (web links, images, tables) are edited or created using a R Markdown syntax.

Inserting R code chunks

A code chunk is simply a piece of R code by itself, embedded in a R Markdown document. The format is:


Inside the {r} you can write chunk options that control the behavior of the code output, when producing the final document.

Another way to create code chunks (including for other scripting languages) and select options, is using the drop-menu.

Executing code chunks and controling output

  1. Code in the notebook can be executed in different ways: Use the green triangle button on the toolbar of a code chunk that has the tool tip “Run Current Chunk”, or Ctrl + Shift + Enter (macOS: Cmd + Shift + Enter) to run the current chunk, or use “Run Selected Line(s)”, for one or more selected lines of code.

  2. Press Ctrl + Enter (macOS: Cmd + Enter) to run just the current statement.

  3. There are other ways to run a batch of chunks if you click the menu Run on the editor toolbar, such as “Run All”, “Run All Chunks Above”, and “Run All Chunks Below”.

In the previous section you can find how to control chunk output, with several options, but you can also select the options using .

Using Knit to produce a HTML or Word document

Now you can produce a HTML or Word document (this is called knitting, Knit), using the following menu:


Producing a PDF document requires a LaTex version for your computer (usually requires more than 1 GB!).

Extras

TinyTex
RPubs

where to look in the book?:

chapters 27 & 28


Basic operations and variable types

  1. Open a new R Markdown, and name it “Apuntes del Taller R”.
  2. Save it with a short name.
  3. Edit the YAML metadata information, including a new title, your name and date.
  4. Write a title for a section called: “Ejercicios Introductorios”. In this section we are going to practice some common R operations and variable types.
  5. Create a chunk.
  6. Using # write titles and short descriptions inside the chunk.

Mathematical operations and variable assignment

#basic math operations
56 + 45
## [1] 101
56/45
## [1] 1.244444
#order of operations and parentheses
6 + 5 * 9
## [1] 51
(6 + 5) * 9
## [1] 99
5 + 3 / 2 * 3
## [1] 9.5
5 + 3 / (2 * 3)
## [1] 5.5
#variable assignment
#you can use <- or = , but the first is more used
v1 <- 2.5
v1
## [1] 2.5
long <- v1
width <- 1.25
area <- long * width
long
## [1] 2.5
width
## [1] 1.25
area
## [1] 3.125

Data types

There are four basic data types in R:

  • numeric (including integer, double)
  • character (including “strings”, factor)
  • time (including Date and POSIXct)
  • logical (TRUE, FALSE)

numeric

# how to know if a variable contain a numeric data?
vari1 <- 14 / 2
vari1
## [1] 7
class(vari1)
## [1] "numeric"
# a numeric data can be an integer
vari2 <- as.integer(14 / 2)
vari2
## [1] 7
class(vari2)
## [1] "integer"
# try 15/2

character

# characters must use " "
char1 <- "hola"
char1
## [1] "hola"
class(char1)
## [1] "character"
# numeric to factor
char2 <- factor(3)
char2
## [1] 3
## Levels: 3
class(char2)
## [1] "factor"
# nchar output is the length of a character variable (or numeric treated as character)
nchar(char1)
## [1] 4
nchar(12358)
## [1] 5
# does it works with a factor?

Date and Time of the day

Using as.Date store a date string (“year-month-day”) as a Date type data; it can be converted to numeric (as.numeric), counting days since January 1, 1970. With as.POSIXct a string of date and time of the day (“year-month-day hour:minute:second”) is converted to a time (POSIXct) class data; numerically is the number of seconds

# as.Date store a date string as Date data
today <- as.Date("2019-10-26")
today
## [1] "2019-10-26"
class(today)
## [1] "Date"
# number of days since January 1, 1970
today.days <- as.numeric(today)
today.days
## [1] 18195
class(today.days)
## [1] "numeric"
# date and time
today.time <- as.POSIXct("2019-10-26 09:00")
today.time
## [1] "2019-10-26 09:00:00 AST"
class(today.time)
## [1] "POSIXct" "POSIXt"
# how many seconds since January 1, 1970?

logical

A variable can store logical data (TRUE or FALSE), as result of a logical statement.

# does a equal/no equal b?
a <- 23 + 2/3
b <- 25 - 2/3
equal <- a == b
equal
## [1] FALSE
class(equal)
## [1] "logical"
noequal <- a != b
noequal
## [1] TRUE
# comparing characters, logical results depend on alphanumeric order
char <- "2data" > "data2"
char
## [1] FALSE
# what is the result: equal*5 - noequal*5 ? Why?
where to look in the book?:

chapter 4


Datasets

The first step in any data analysis is the creation of a dataset containing the data to be analyzed, in a format that meets your needs. In R, this task involves the following:

Data structures

R has a wide variety of objects for holding data, including scalars, vectors, matrices, arrays, data frames, and lists. They differ in terms of the type of data they can hold, how they are created, their structural complexity, and the notation used to identify and access individual elements.

R data structures:

  • vector
  • matrix
  • data frame
  • array
  • list

Vectors

Vectors are one-dimensional arrays that can hold numeric data, character data, or logical data. The combine function: c(…) is used to form the vector.

# numeric data
vec.num <- c(1, 2, 5, 3, 6, -2, 2.3)
vec.num
## [1]  1.0  2.0  5.0  3.0  6.0 -2.0  2.3
# character data
vec.char <- c("one", "two", "three")
vec.char
## [1] "one"   "two"   "three"
# What happen if we combine different types of data?
##
# vectors can be created with numeric and logic operators:
vec.oper <- c(2/5, 3, 5+3, 4-7, 4 == 4.01, 3.5 < 3.5001, "data" < "2data")
vec.oper
## [1]  0.4  3.0  8.0 -3.0  0.0  1.0  0.0
# vectors can be created with the content of variables:
vec.var <- c(equal, a, b, today.days)
vec.var
## [1]     0.00000    23.66667    24.33333 18195.00000
# the function seq(...) can be used to create vectors:
vec.seq <- seq(3, 4.5, 0.2) # c(...) is not necessary
vec.seq
## [1] 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4

You can refer to elements of a vector using a numeric vector of positions within brackets.

# vector created from simple sequence
vec.ref <- c(3:11)
vec.ref
## [1]  3  4  5  6  7  8  9 10 11
# locate the first three elements and last two elements
# first we need the length of the vector:
length(vec.ref)
## [1] 9
# now we can look for the elements:
vec.ref[c(1:3, 8, 9)]
## [1]  3  4  5 10 11
# another way:
vec.ref[-(4:7)]
## [1]  3  4  5 10 11

We can do operations with vectors.

# multiplying two vectors
f <- c(1, 2, 3, 5, 7)
g <- c(2, 4, 6, 8, 10)
vec.mult <- f * g
vec.mult
## [1]  2  8 18 40 70
# What happen if vectors are of different length?
where to look in the book?:

chapter 4


Matrices

A matrix is a two-dimensional array in which each element has the same class (numeric, character, or logical). Matrices are created with the matrix(…) function. The general format is:

mymatrix <- matrix(vector,       nrow=number of rows,       ncol=number of columns,       byrow=logical value,       dimnames=list(character vector of row names, character vector of column names)       )
# data vector - [age,glu,chol]
vec.diabetes <- c(32,90,160,26,130,200,40,200,180,55,150,260)
# name vectors
RowN <- c("patient 1","patient 2","patient 3","patient 4")
ColN <- c("Age","Glucose","Cholesterol")
# matrix
mtx.diabetes <- matrix(vec.diabetes,
                       ncol = 3,
                       byrow = TRUE,
                       dimnames = list(RowN,ColN)
                       )
mtx.diabetes
##           Age Glucose Cholesterol
## patient 1  32      90         160
## patient 2  26     130         200
## patient 3  40     200         180
## patient 4  55     150         260
class(mtx.diabetes)
## [1] "matrix"

Data selection from a matrix, is similar as vectors, but now we must specify rows and columns.

# selecting all rows and two columns
new.matrix1 <- mtx.diabetes[ ,2:3]
new.matrix1
##           Glucose Cholesterol
## patient 1      90         160
## patient 2     130         200
## patient 3     200         180
## patient 4     150         260
# selecting first and last patient, and age and cholesterol
new.matrix2 <- mtx.diabetes[c(1,4),c(1,3)]
new.matrix2
##           Age Cholesterol
## patient 1  32         160
## patient 4  55         260

Data frames

A data frame is more general than a matrix in that different columns can contain different classes of data (numeric, character, and so on). Data frames are the most common data structure you will deal with in R. A data frame is created with the data.frame(…) function:

mydata <- data.frame(col1, col2, col3,…)

where col1, col2, col3, and so on are column vectors of any type (such as character, numeric, or logical).

# column vectors
ID <- c(1L,2L,3L,4L)
chol <- c(160, 200, 180, 260)
glu <- c(90, 130, 200, 150)
age <- c(32,26,40,55)
sex <- c("M","F","M","M")
diabetes <- c("neg","pos","pos","neg")
# data frame from vectors
patientdata <- data.frame(ID, age, sex, glu, chol, diabetes)
class(patientdata)
## [1] "data.frame"
patientdata
# changing the name of the columns
new.patiendata <- setNames(patientdata, c("Patient ID","Age","Birth Sex","Blood Glucose","Cholesterol","Diabetes Diagnosis"))
new.patiendata

Data selection from a data frame is similar to selection from a matrix, but also is possible to do logical selections using factor columns.

# selecting columns by name
select.df1 <- new.patiendata[ ,c("Blood Glucose","Diabetes Diagnosis")]
select.df1
# selecting using logical operator
select.df2 <- new.patiendata[new.patiendata$`Diabetes Diagnosis`=="pos",]
select.df2
where to look in the book?:

chapter 5


Data input

R provides a wide range of tools for importing data, and create data frames. The definitive guide for importing data in R is the R Data Import/Export manual available at http://mng.bz/urwn.
We are considering three tools to input data for analyses:

Reading CSV files

CSV files are text files where values in each line are separated by commas (i.e. 1,2,3,5,7,11), and have a carriage return code at the end of each line. The file can be created with any text editor or exported from a spreadsheet application (Excel, for example). To read a CSV file use the following basic code:

honeymoondata <- read.csv('honeymoon.csv', header = TRUE)
dim(honeymoondata)
## [1] 115   6
head(honeymoondata)
class(honeymoondata)
## [1] "data.frame"

Import data from Excel file

RStudio provides a menu driven tool to import data from an Excel file (and other types, too). It requires that you have installed and activate the readxl package. Use the menu Import Dataset, select the File, select the Sheet with the data you are interested in, and check First Row as Names if that is the case.


These menu actions will generate a code that you can use inside your procedure.

library(readxl)
melodata <- read_excel("melocactus.xlsx", 
    sheet = "datos")
class(melodata)
## [1] "tbl_df"     "tbl"        "data.frame"
head(melodata)
tail(melodata)
where to look in the book?:

chapter 6


Descriptive Statistics

R has a large number of ways to calculate descriptive statistics on the datasets, some are included in the basic installation, and others in packages that need download-installation (using install.packages(…) or the Package menu) and load (activation) to the R environment (using library(…)).

Using summary

summary is the simplest way to obtain some descriptive statistics from our data.

# select the columns (variable) to analyze
melovars <- c("alturatotal","longinflo")
# use summary
summary(melodata[melovars])
##   alturatotal      longinflo     
##  Min.   : 3.00   Min.   : 0.000  
##  1st Qu.:11.00   1st Qu.: 0.000  
##  Median :18.00   Median : 0.000  
##  Mean   :21.93   Mean   : 5.966  
##  3rd Qu.:30.00   3rd Qu.:11.000  
##  Max.   :69.00   Max.   :35.000

Using sapply and a function

A more general way to calculate descriptive statistics is using the sapply procedure, with the syntax:

sapply(x, FUN, options))

where FUN is a simple system function (like mean(var), sd(var), et c.) or an user-defined function, with the following syntax:

myfunction <- function(arg1, arg2, … ){
statements
return(object)
}

# defining function
mystats <- function(x){
                m <- mean(x)
                md <- median(x)
                n <- length(x)
                s <- sd(x)
                return(c(n=n, mean=m, median=md, stdev=s))
}
# sapply function on dataset
sapply(melodata[melovars], mystats)
##        alturatotal  longinflo
## n        145.00000 145.000000
## mean      21.93103   5.965517
## median    18.00000   0.000000
## stdev     14.18120   8.065198

We can select some of the data to apply the statistics.

# select plants with longitude of inflorescence different from 0
melo.inflo <- melodata[melodata$longinflo > 0,"longinflo"]
# sapply function mystats
sapply(melo.inflo, mystats)
##        longinflo
## n      69.000000
## mean   12.536232
## median 11.000000
## stdev   7.359627

Using aggregate

When you have variables that can be considered as factor, you can use the aggregate function to obtain basic statistics aggregating by such factors. The function has the following syntax:

aggregate(x, by, FUN)

# calculate the mean of plant heights (alturatotal) by plant status (estado)
aggdata <- aggregate(melodata$alturatotal, by = list(melodata$estado), mean)
aggdata
# we can change the names of columns
aggdata <- setNames(aggdata, c("Plant Status","Mean Height, cm"))
aggdata

Using a data.table

The data.table package allows for an improved functionality of data.frames operations. First, we have to convert a data.frame into a data.table. Thereafter you can select subgroups and apply functions to them.

# activate data.table package
library(data.table)
# conver data.frame o data.table
melodataDT <- data.table(melodata)
class(melodataDT)
## [1] "data.table" "data.frame"
# descriptive statistics by groups
meloDS <- melodataDT[, list(Media=mean(alturatotal), Median=median(alturatotal), StDev=sd(alturatotal)), by=list(Status=estado)]
meloDS
# ordering results
meloDS[order(Media)]
where to look in the book?

chapters 11, 18


Introduction to graphs

R is a great platform for building graphs. Literally, in a typical interactive session, you build a graph one statement at a time, adding features, until you have what you want.
The base graphics system of R is described at the beginning of chapter 7 in the Lander’s book. Two other systems, that are widely used, and provide extensive options are lattice and ggplot2. We will be mostly using the base graphics system and ggplot2.

The primary graph for a variable: the histogram

Histograms display the distribution of a continuous variable by dividing the range of scores into a specified number of bins on the x-axis and displaying the frequency of scores in each bin on the y-axis.

Introducing ggplot2: building a histogram

Now we are going to build a histogram using ggplot2. ggplot2 provides a system for creating graphs based on the grammar of graphics. The intention of the ggplot2 package is to provide a comprehensive, grammar-based system for generating graphs in a unified and coherent manner, allowing users to create new and innovative data visualizations. The power of this approach has led ggplot2 to become an important tool for visualizing data using R.

First, let see a basic ggplot2 histogram:

# activating ggplot2
library(ggplot2)
# basic histogram
ggplot(melodata, aes(alturatotal))+
  geom_histogram(color="white", bins = 14)

Now a more detailed histogram, including several layers:

hist.melodata <- ggplot(melodata, aes(alturatotal)) + 
  geom_histogram(aes(y=..density..), bins = 14, colour="white", fill="green") +
  geom_rug(sides = "t", color = "black") +
  labs(x="Total heigth,cm", y = "Density") +
  stat_function(fun = dnorm, 
                args = list(mean = mean(melodata$alturatotal, na.rm = TRUE), 
                            sd = sd(melodata$alturatotal, na.rm = TRUE)), 
                colour = "red", size = 1)
hist.melodata

Homework

Here is an assigment to complete, and send me for review.

References

Kabacoff, R., 2015. R in action: data analysis and graphics with R, Second edition. ed. Manning, Shelter Island.

Lander, J. P., 2014.R for everyone. Pearson Education, Inc., Upper Saddle River, NJ, USA.

Verzani, J., 2012. Getting started with RStudio. O’Reilly, Sebastopol, Calif.

Xie, Y., J. J. Allaire, G. Grolemund, 2018. R Markdown: The Definitive Guide. Chapman & Hall/CRC.