As important as carrying out the data analysis of an investigation (main objective of this workshop), it is also important to communicate the results to an audience, and receive acknowledgement and comments about them. Usually that aspect is addressed at the end (as in the workshop book: chapter 22), but if we do so, we can leave behind results, graphs, et c., which at some point we have to find, organize and move to a final document. In this workshop we will be working from the beginning, with a format that allows us to use the R platform for data analysis and visualization, while producing documents that contain text, codes and products (results and graphs) of the workshop activities.
Creating a R Notebook
(R Notebook is a R Markdown document, but with some extra options, like a Preview)
Use the left-upper corner menu:

Select R Notebook.
Editing the document
The R Notebook has three types of content:
- an (optional) YAML header surrounded by - - -
- R code chunks surrounded by ```
- text mixed with simple text formatting
The YAML header is a metadata section of the document, and can be edited to include basic information of the document:

You can change (or eliminate) the title, author, and date. The output option is originally created by RStudio, and depends on the output format that you produce (html, pdf, Word). A R Markdown document (html_document) can be transformed into a R Notebook (html_notebook) and vice versa.
The text and other document features (web links, images) are edited or created using a R Markdown syntax.
Inserting R code chunks
A code chunk is simply a piece of R code by itself, embedded in a R Markdown document. The format is:

Inside the {r} you can write chunk options that control the behavior of the code output, when producing the final document.
Another way to create code chunks (including for other scripting languages) and select options, is using the
drop-menu.
Executing code chunks and controling output
Code in the notebook can be executed in different ways: Use the green triangle button on the toolbar of a code chunk that has the tool tip “Run Current Chunk”, or Ctrl + Shift + Enter (macOS: Cmd + Shift + Enter) to run the current chunk.
Press Ctrl + Enter (macOS: Cmd + Enter) to run just the current statement.
There are other ways to run a batch of chunks if you click the menu Run on the editor toolbar, such as “Run All”, “Run All Chunks Above”, and “Run All Chunks Below”.
In the previous section you learn how to control chunk output, with several options. You can also select the options using
.
Preview and Knit
To obtain a preview (not running codes) of the R Notebook, or in a different format (HTML, PDF, Word, with code output) you will use the following menu (if you start with a R Markdown document, instead of a R Notebook, you will see the word Knit, instead of Preview):

Producing a PDF document requires a LaTex version for your computer (usually requires more than 1 GB!).
Exercises
- Open a new R Notebook, and name it DayOneExercises??? (??? = your initials). Saving it will let you name it.
- Edit the YAML metadata information, including a title, your name and date.
- Write a short paragraph about your municipality, that includes text in different styles, a web link, and an image. Preview and edit as necessary.
- Write a code to calculate the surface of a geometric figure. Use the code chunk format and run it (you can start in a R Script, and later create a chunk in the document and copy de code). Run the chunk. Change output options for the chunk, and run again. When finish, preview the notebook.
Datasets
The first step in any data analysis is the creation of a dataset containing the information to be studied, in a format that meets your needs. In R, this task involves the following:
- Selecting a data structure to hold your data
- Entering or importing your data into the data structure
Data structures
R has a wide variety of objects for holding data, including scalars, vectors, matrices, arrays, data frames, and lists. They differ in terms of the type of data they can hold, how they’re created, their structural complexity, and the notation used to identify and access individual elements.
R data structures:
Vectors
Vectors are one-dimensional arrays that can hold numeric data, character data, or logical data. The combine function c(…) is used to form the vector.
a <- c(1, 2, 5, 3, 6, -2, 2.3)
a
b <- c("one", "two", "three")
b
c <- c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE)
c
# What happen if we combine different types of data?
Vectors may also be generated using the seq(…) function, when you want a sequence of numbers, evenly spaced by a specified interval.
d <- seq(3, 11, by = 0.7)
d
dd <- c(1:4, seq(2.1, 4.5, by = 1.1))
dd
# You can use operators (+, -, *, /, et c.) to create the vector; try them.
You can refer to elements of a vector using a numeric vector of positions within brackets. For example, a[c(2, 4)] refers to the second and fourth elements of vector e.
e <- c("k", "j", "h", "a", "c", "m")
e[3]
e[c(1, 3, 5)]
e[2:6]
# What happen if you use e[]?
We can do operations with vectors.
f <- c(1, 2, 3, 5, 7)
g <- c(2, 4, 6, 8, 10)
f + g
f / g
# What happen if vectors are of different length?
Matrices
A matrix is a two-dimensional array in which each element has the same mode (numeric, character, or logical). Matrices are created with the matrix() function. The general format is:
mymatrix <- matrix(vector, nrow=number_of_rows, ncol=number_of_columns,
byrow=logical_value,
dimnames=list(char_vector_rownames,
char_vector_colnames))
y <- matrix(1:20, nrow=5, ncol=4)
y
[,1] [,2] [,3] [,4]
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 13 18
[4,] 4 9 14 19
[5,] 5 10 15 20
# What happen if we add byrow = TRUE ?
cells <- c(1,26,24,68,-3,0.3,110,17,2)
rnames <- c("Row1", "Row2", "Row3")
cnames <- c("Col1", "Col2", "Col3")
mymatrix <- matrix(cells, ncol=3, byrow=TRUE,
dimnames=list(rnames, cnames))
mymatrix
Col1 Col2 Col3
Row1 1 26 24.0
Row2 68 -3 0.3
Row3 110 17 2.0
# Multiply a matrix by a scalar (for example, to change units).
# Can you extract specific elements of a matrix? (Look back in vectors, and use the format [rows,columns])
# A matrix can be transposed (rows <-> columns) using t(...)
tmatrix <- t(mymatrix)
tmatrix
Row1 Row2 Row3
Col1 1 68.0 110
Col2 26 -3.0 17
Col3 24 0.3 2
Data frames
A data frame is more general than a matrix in that different columns can contain different modes of data (numeric, character, and so on). Data frames are the most common data structure you’ll deal with in R. A data frame is created with the data.frame() function:
mydata <- data.frame(col1, col2, col3,…)
where col1, col2, col3, and so on are column vectors of any type (such as character, numeric, or logical). Names for each column can be provided with the names function.
# Creating a matrix from vectors:
patientID <- c(1, 2, 3, 4)
age <- c(25, 34, 28, 52)
diabetes <- c("Type1", "Type2", "Type1", "Type1")
status <- c("Poor", "Improved", "Excellent", "Poor")
patientdata <- data.frame(patientID, age, diabetes, status)
patientdata
# Changing the names of the columns:
newpatientdata <- setNames(patientdata, c("ID","edad","DiabT","Estado"))
newpatientdata
There are several ways to select the elements of a data frame:
# Using the subscript notation (as in a matrix):
selectpatientdata <- patientdata[ ,1:2] # the blank for row index means that all rows are selected
selectpatientdata
# Using the columns' names:
selectpatientdata <- patientdata[,c("diabetes", "status")]
selectpatientdata
# Using the object$variable notation:
selectpatientdata <- patientdata$age
selectpatientdata
[1] 25 34 28 52
# Note that output is a vector
Factors
Categorical (nominal) and ordered categorical (ordinal) variables in R are called factors. Factors are crucial in R because they determine how data is analyzed and presented visually.
With categorical variables you can create cross-tabulations, using the table function:
table(patientdata$diabetes, patientdata$status)
The function factor(…) stores the categorical values as a vector of integers associated to each different value of the categorical variable, this is very important for data analysis:
statuslevel <- factor(status)
statuslevel
as.numeric(statuslevel)
Descriptive Statistics
R have a large number of ways to calculate descriptive statistics on the datasets, some are included in the basic installation, and others in packages that need download-installation (using install.packages) and load (activation) to the environment (using library).
Using summary
Let start using the basic procedures (function summary), with the mtcars dataset provided with your book’s code.
# create a data frame with the CSV file and make it available for the procedure
mtcars <- read.csv("mtcars.csv")
# create a list of variables to analyze
myvars <- c("mpg", "hp", "wt")
# use summary with the selected variables
summary(mtcars[myvars])
Using function and sapply
A more general way to calculate basic statistics is using the sapply procedure, with the syntax:
sapply(x, FUN, options))
where FUN is a simple system function (like mean(var), sd(var), et c.) or an user-defined function, with the following syntax:
myfunction <- function(arg1, arg2, … ){
statements
return(object)
}
# defining function
mystats <- function(x){
m <- mean(x)
n <- length(x)
s <- sd(x)
skew <- sum((x-m)^3/s^3)/n
kurt <- sum((x-m)^4/s^4)/n - 3
return(c(n=n, mean=m, stdev=s, skew=skew, kurtosis=kurt))
}
# arguments (variables) to use
myvars <- c("mpg", "hp", "wt")
# sapply function on dataset
sapply(mtcars[myvars], mystats)
Using aggregate
When you have variables that can be considered as factor, you can use the aggregate function to obtain basic statistics aggregating by such factors. The function has the following syntax:
aggregate(x, by, FUN)
#read data and attach object for easier use of the variables
mtcars <- read.csv('mtcars.csv')
attach(mtcars)
#calculate the mean of numeric variables, aggregated by cyl and gear as factors; missing values are removed
aggdata <- aggregate(mtcars[c('mpg','hp','wt')], by=list(cyl,gear), FUN = mean, na.rm = TRUE)
#set names for resulting data table - what happen if we skip this line of code?
aggdata <- setNames(aggdata, c("CYL","GEAR","mpg","hp","wt"))
aggdata
detach(mtcars) #why detach?
