R is an open-source programming environment, for statistical computing and data visualization, among other capabilities.
R has many features to recommend it (Kabacoff, 2015):
Most commercial statistical software platforms cost thousands, if not tens of thousands, of dollars. R is free! If you’re a teacher or a student, the benefits are obvious.
R is a comprehensive statistical platform, offering all manner of data-analytic techniques. Just about any type of data analysis can be done in R.
R contains advanced statistical routines not yet available in other packages. In fact, new methods become available for download on a weekly basis.
R has state-of-the-art graphics capabilities. If you want to visualize complex data, R has the most comprehensive and powerful feature set available.
R is a powerful platform for interactive data analysis and exploration. For example, the results of any analytic step can easily be saved, manipulated, and used as input for additional analyses.
Getting data into a usable form from multiple sources can be a challenging proposition. R can easily import data from a wide variety of sources, including text files, database-management systems, statistical packages, and specialized data stores. It can write data out to these systems as well. R can also access data directly from web pages, social media sites, and a wide range of online data services.
R provides an unparalleled platform for programming new statistical methods in an easy, straightforward manner. It’s easily extensible and provides a natural language for quickly programming recently published methods.
R functionality can be integrated into applications written in other languages, including C++, Java, Python, PHP, Pentaho, SAS, and SPSS. This allows you to continue working in a language that you may be familiar with, while adding R’s capabilities to your applications.
R runs on a wide array of platforms, including Windows, Unix, and Mac OS X. It’s likely to run on any computer you may have. If you don’t want to learn a new language, a variety of graphic user interfaces (GUIs) are available, offering the power of R through menus and dialogs.
chapter 1
Programming, editing, project management, and other procedures in R, are more easy and efficient using an integrated development environment (IDE), and the most used and versatile is RStudio.
chapter 2 & 3
As important as carrying out the data analysis of an investigation (main objective of this workshop), it is also important to communicate the results to an audience, and receive acknowledgement and comments about them. Usually that aspect is addressed at the end (as in the workshop book: Chapter 28), but if we do so, we can leave behind notes, results, graphs, et c., which at some point we have to find, organize and move to a final document.
First, check if you have the package rmarkdown installed; for this, look into the Packages menu (User Library).
To create a R Markdown document, use the left-upper corner menu:
Select R Markdown.
The R Markdown has three types of content:
The YAML header is a metadata section of the document, and can be edited to include basic information of the document:
You can change (or eliminate) the title, author, and date. The output option is originally created by RStudio, and depends on the output format that you produce (html, pdf, Word). A R Markdown document (html_document) can be transformed into a R Notebook (html_notebook) and vice versa.
The text and other document features (web links, images, tables) are edited or created using a R Markdown syntax.
A code chunk is simply a piece of R code by itself, embedded in a R Markdown document. The format is:
Inside the {r} you can write chunk options that control the behavior of the code output, when producing the final document.
Another way to create code chunks (including for other scripting languages) and select options, is using the drop-menu.
Code in the notebook can be executed in different ways: Use the green triangle button on the toolbar of a code chunk that has the tool tip “Run Current Chunk”, or Ctrl + Shift + Enter (macOS: Cmd + Shift + Enter) to run the current chunk, or use “Run Selected Line(s)”, for one or more selected lines of code.
Press Ctrl + Enter (macOS: Cmd + Enter) to run just the current statement.
There are other ways to run a batch of chunks if you click the menu Run on the editor toolbar, such as “Run All”, “Run All Chunks Above”, and “Run All Chunks Below”.
In the previous section you learn how to control chunk output, with several options. You can also select the options using .
Now you can produce a HTML or Word document (this is called knitting, Knit), using the following menu:
Producing a PDF document requires a LaTex version for your computer (usually requires more than 1 GB!).
chapters 27 & 28
The first step in any data analysis is the creation of a dataset containing the data to be analyzed, in a format that meets your needs. In R, this task involves the following:
R has a wide variety of objects for holding data, including scalars, vectors, matrices, arrays, data frames, and lists. They differ in terms of the type of data they can hold, how they’re created, their structural complexity, and the notation used to identify and access individual elements.
R data structures:
Vectors are one-dimensional arrays that can hold numeric data, character data, or logical data. The combine function: c(…) is used to form the vector.
a <- c(1, 2, 5, 3, 6, -2, 2.3)
a
## [1] 1.0 2.0 5.0 3.0 6.0 -2.0 2.3
b <- c("one", "two", "three")
b
## [1] "one" "two" "three"
c <- c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE)
c
## [1] TRUE TRUE TRUE FALSE TRUE FALSE
# What happen if we combine different types of data?
Vectors may also be generated using the seq(…) function, when you want a sequence of numbers, evenly spaced by a specified interval.
d <- seq(3, 11, by = 1)
d
## [1] 3 4 5 6 7 8 9 10 11
dd <- c(1:4, seq(2.1, 4.5, by = 1.1))
dd
## [1] 1.0 2.0 3.0 4.0 2.1 3.2 4.3
# You can use operators (+, -, *, /, et c.) to create the vector; try them.
You can refer to elements of a vector using a numeric vector of positions within brackets. For example, a[c(2, 4)] refers to the second and fourth elements of vector e.
e <- c("k", "j", "h", "a", "c", "m")
e[3]
## [1] "h"
e[c(1, 3, 5)]
## [1] "k" "h" "c"
e[2:6]
## [1] "j" "h" "a" "c" "m"
# What happen if you use e[]?
We can do operations with vectors.
f <- c(1, 2, 3, 5, 7)
g <- c(2, 4, 6, 8, 10)
f + g
## [1] 3 6 9 13 17
f / g
## [1] 0.500 0.500 0.500 0.625 0.700
# What happen if vectors are of different length?
chapter 4
A matrix is a two-dimensional array in which each element has the same mode (numeric, character, or logical). Matrices are created with the matrix() function. The general format is:
mymatrix <- matrix(vector, nrow=number_of_rows, ncol=number_of_columns,
byrow=logical_value,
dimnames=list(char_vector_rownames,
char_vector_colnames))
y <- matrix(1:20, nrow=5, ncol=4)
y
## [,1] [,2] [,3] [,4]
## [1,] 1 6 11 16
## [2,] 2 7 12 17
## [3,] 3 8 13 18
## [4,] 4 9 14 19
## [5,] 5 10 15 20
# What happen if we add byrow = TRUE ?
# You can put names to rows and columns:
cells <- c(1,26,24,68,-3,0.3,110,17,2)
rnames <- c("Row1", "Row2", "Row3")
cnames <- c("Col1", "Col2", "Col3")
mymatrix <- matrix(cells, ncol=3, byrow=TRUE,
dimnames=list(rnames, cnames))
mymatrix
## Col1 Col2 Col3
## Row1 1 26 24.0
## Row2 68 -3 0.3
## Row3 110 17 2.0
# Multiply a matrix by a scalar (for example, to change units).
# Can you extract specific elements of a matrix? (Look back in vectors, and use the format [rows,columns])
# A matrix can be transposed (rows <-> columns) using t(...)
tmatrix <- t(mymatrix)
tmatrix
## Row1 Row2 Row3
## Col1 1 68.0 110
## Col2 26 -3.0 17
## Col3 24 0.3 2
A data frame is more general than a matrix in that different columns can contain different modes of data (numeric, character, and so on). Data frames are the most common data structure you’ll deal with in R. A data frame is created with the data.frame() function:
mydata <- data.frame(col1, col2, col3,…)
where col1, col2, col3, and so on are column vectors of any type (such as character, numeric, or logical).
# Creating a matrix from vectors:
patientID <- c(1, 2, 3, 4)
age <- c(25, 34, 28, 52)
diabetes <- c("Type1", "Type2", "Type1", "Type1")
status <- c("Poor", "Improved", "Excellent", "Poor")
patientdata <- data.frame(patientID, age, diabetes, status)
patientdata
# Changing the names of the columns:
newpatientdata <- setNames(patientdata, c("ID","edad","DiabT","Estado"))
newpatientdata
There are several ways to select the elements of a data frame:
# Using the subscript notation (as in a matrix):
selectpatientdata <- patientdata[ ,1:2] # the blank for row index means that all rows are selected
selectpatientdata
# Using the columns' names:
selectpatientdata <- patientdata[,c("diabetes", "status")]
selectpatientdata
# Using the object$variable notation:
selectpatientdata <- patientdata$age
selectpatientdata # Note that the output is a vector
## [1] 25 34 28 52
Categorical (nominal) and ordered categorical (ordinal) variables in R are called factors. Factors are crucial in R because they determine how data is analyzed and presented visually.
With categorical variables you can create cross-tabulations, using the table function:
table(patientdata$diabetes, patientdata$status)
##
## Excellent Improved Poor
## Type1 1 0 2
## Type2 0 1 0
The function factor(…) stores the categorical values as a vector of integers associated to each different value of the categorical variable, this is very important for data analysis:
patientdata
statuslevel <- factor(status)
statuslevel
## [1] Poor Improved Excellent Poor
## Levels: Excellent Improved Poor
#los niveles de los factores como números (más fácil análisis)
as.numeric(statuslevel)
## [1] 3 2 1 3
chapter 5
R provides a wide range of tools for importing data, and create data frames. The definitive guide for importing data in R is the R Data Import/Export manual available at http://mng.bz/urwn.
We are considering four tools to input data for analyses:
Once created a data frame, you can edit it using the package (really an Addin) editData.
You can manually enter (or paste) data into the RStudio script editor (or chunk) with column names, and values (numeric of character) in a table style format, and using the following code:
mydata <- read.table(header=TRUE, text = "
age gender weight
25 m 166
30 f 115
18 f 120
")
mydata
class(mydata)
## [1] "data.frame"
Or using vectors to create the data.frame:
#vectors
age <- c(25,30,18)
gender <- c("m","f","f")
weight <- c(166,115,120)
#data.frame
mydata2 <- data.frame(age,gender,weight)
colnames(mydata2) <- c("Age (years)", "Gender", "Weight (lb)")
mydata2
CSV files are text files where values in each line are separated by commas (i.e. 1,2,3,5,7,11). The file can be created with any text editor or exported from a spreadsheet application (Excel, for example). To read a CSV file use the following basic code:
honeymoondata <- read.csv('honeymoon.csv', header = TRUE)
head(honeymoondata)
class(honeymoondata)
## [1] "data.frame"
RStudio provides a menu driven tool to import data from an Excel file (and other types, too). It requires that you have installed and activate the readxl package. Use the menu Import Dataset, select the File, select the Sheet with the data you are interested in, and check First Row as Names if that is the case.
These menu actions will generate a code that you can use inside your procedure.
library(readxl)
honeymoon <- read_excel("honeymoon.xlsx",
sheet = "Sheet1")
head(honeymoon)
class(honeymoon)
## [1] "tbl_df" "tbl" "data.frame"
Several data files can be integrated in one system file (.Rdata) using the save function,
save(mydata, honeymoon, file = "rfile.Rdata")
and use them as input later with the load function:
load("rfile.Rdata")
mydata
head(honeymoon)
class(honeymoon)
## [1] "tbl_df" "tbl" "data.frame"
Packing and loading the data in Rdata may be useful when working in a project with several data frames, and need to distribute and reuse them easily, preserving original data (like teaching a workshop!).
chapter 6
Here is an assigment to complete, and we will look it first thing next session.
Kabacoff, R., 2015. R in action: data analysis and graphics with R, Second edition. ed. Manning, Shelter Island.
Verzani, J., 2012. Getting started with RStudio. O’Reilly, Sebastopol, Calif.