R

R is an open-source programming environment, for statistical computing and data visualization, among other capabilities.

Exercises

  1. Open R (no RStudio):
  • use it as a scientific calculator
  • assign values or arithmetic (¿ x = 3 o x <- 3 ?) formulas to R objects (variables)
  • edit calculations and formulas
  1. Look up R menus

extra

Why use R?

RStudio

Programming, editing, project management, and other procedures in R, are more easy and efficient using an integrated development environment (IDE), and the most used and versatile is RStudio.

Exercises

  1. Open RStudio:
  • browse all the components of the interface
  • establish a Working Directory
  • create a R Script (compare with the R console)
  • write code to calculate the area of a geometric surface, and run it
  • explore other windows and menus of RStudio
  • load packages: ggplot2, ggversa

extra

Packages


R Markdown (or R Notebook)

As important as carrying out the data analysis of an investigation (main objective of this workshop), it is also important to communicate the results to an audience, and receive acknowledgement and comments about them. Usually that aspect is addressed at the end (as in the workshop book: chapter 22), but if we do so, we can leave behind results, graphs, et c., which at some point we have to find, organize and move to a final document. In this workshop we will be working from the beginning, with a format that allows us to use the R platform for data analysis and visualization, while producing documents that contain text, codes and products (results and graphs) of the workshop activities.

Creating a R Notebook

(R Notebook is a R Markdown document, but with some extra options, like a Preview)

Use the left-upper corner menu:


Select R Notebook.

Editing the document

The R Notebook has three types of content:

  • an (optional) YAML header surrounded by - - -
  • R code chunks surrounded by ```
  • text mixed with simple text formatting

The YAML header is a metadata section of the document, and can be edited to include basic information of the document:


You can change (or eliminate) the title, author, and date. The output option is originally created by RStudio, and depends on the output format that you produce (html, pdf, Word). A R Markdown document (html_document) can be transformed into a R Notebook (html_notebook) and vice versa.

The text and other document features (web links, images) are edited or created using a R Markdown syntax.

Inserting R code chunks

A code chunk is simply a piece of R code by itself, embedded in a R Markdown document. The format is:


Inside the {r} you can write chunk options that control the behavior of the code output, when producing the final document.

Another way to create code chunks (including for other scripting languages) and select options, is using the drop-menu.

extra

inline code

Executing code chunks and controling output

  1. Code in the notebook can be executed in different ways: Use the green triangle button on the toolbar of a code chunk that has the tool tip “Run Current Chunk”, or Ctrl + Shift + Enter (macOS: Cmd + Shift + Enter) to run the current chunk.

  2. Press Ctrl + Enter (macOS: Cmd + Enter) to run just the current statement.

  3. There are other ways to run a batch of chunks if you click the menu Run on the editor toolbar, such as “Run All”, “Run All Chunks Above”, and “Run All Chunks Below”.

In the previous section you learn how to control chunk output, with several options. You can also select the options using .

Preview and Knit

To obtain a preview (not running codes) of the R Notebook, or in a different format (HTML, PDF, Word, with code output) you will use the following menu (if you start with a R Markdown document, instead of a R Notebook, you will see the word Knit, instead of Preview):


Producing a PDF document requires a LaTex version for your computer (usually requires more than 1 GB!).

extras

TinyTex
RPubs

Exercises

  1. Open a new R Notebook, and name it DayOneExercises??? (??? = your initials). Saving it will let you name it.
  2. Edit the YAML metadata information, including a title, your name and date.
  3. Write a short paragraph about your municipality, that includes text in different styles, a web link, and an image. Preview and edit as necessary.
  4. Write a code to calculate the surface of a geometric figure. Use the code chunk format and run it (you can start in a R Script, and later create a chunk in the document and copy de code). Run the chunk. Change output options for the chunk, and run again. When finish, preview the notebook.

Datasets

The first step in any data analysis is the creation of a dataset containing the information to be studied, in a format that meets your needs. In R, this task involves the following:

Data structures

R has a wide variety of objects for holding data, including scalars, vectors, matrices, arrays, data frames, and lists. They differ in terms of the type of data they can hold, how they’re created, their structural complexity, and the notation used to identify and access individual elements.

R data structures:


Vectors

Vectors are one-dimensional arrays that can hold numeric data, character data, or logical data. The combine function c(…) is used to form the vector.

a <- c(1, 2, 5, 3, 6, -2, 2.3)
a
## [1]  1.0  2.0  5.0  3.0  6.0 -2.0  2.3
b <- c("one", "two", "three")
b
## [1] "one"   "two"   "three"
c <- c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE)
c
## [1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE
# What happen if we combine different types of data?

Vectors may also be generated using the seq(…) function, when you want a sequence of numbers, evenly spaced by a specified interval.

d <- seq(3, 11, by = 1)
d
## [1]  3  4  5  6  7  8  9 10 11
dd <- c(1:4, seq(2.1, 4.5, by = 1.1))
dd
## [1] 1.0 2.0 3.0 4.0 2.1 3.2 4.3
# You can use operators (+, -, *, /, et c.) to create the vector; try them.

You can refer to elements of a vector using a numeric vector of positions within brackets. For example, a[c(2, 4)] refers to the second and fourth elements of vector e.

e <- c("k", "j", "h", "a", "c", "m")
e[3]
## [1] "h"
e[c(1, 3, 5)]
## [1] "k" "h" "c"
e[2:6]
## [1] "j" "h" "a" "c" "m"
# What happen if you use e[]?

We can do operations with vectors.

f <- c(1, 2, 3, 5, 7)
g <- c(2, 4, 6, 8, 10)

f + g
## [1]  3  6  9 13 17
f / g
## [1] 0.500 0.500 0.500 0.625 0.700
# What happen if vectors are of different length?

Matrices

A matrix is a two-dimensional array in which each element has the same mode (numeric, character, or logical). Matrices are created with the matrix() function. The general format is:

mymatrix <- matrix(vector, nrow=number_of_rows,        ncol=number_of_columns,
      byrow=logical_value,
      dimnames=list(char_vector_rownames,
      char_vector_colnames))

y <- matrix(1:20, nrow=5, ncol=4)   
y
##      [,1] [,2] [,3] [,4]
## [1,]    1    6   11   16
## [2,]    2    7   12   17
## [3,]    3    8   13   18
## [4,]    4    9   14   19
## [5,]    5   10   15   20
# What happen if we add byrow = TRUE ?

# You can put names to rows and columns:
cells    <- c(1,26,24,68,-3,0.3,110,17,2)
rnames   <- c("Row1", "Row2", "Row3")
cnames   <- c("Col1", "Col2", "Col3")
mymatrix <- matrix(cells, ncol=3, byrow=TRUE,   
                     dimnames=list(rnames, cnames)) 
mymatrix
##      Col1 Col2 Col3
## Row1    1   26 24.0
## Row2   68   -3  0.3
## Row3  110   17  2.0
# Multiply a matrix by a scalar (for example, to change units).

# Can you extract specific elements of a matrix? (Look back in vectors, and use the format [rows,columns])

# A matrix can be transposed (rows <-> columns) using t(...)
tmatrix <- t(mymatrix)
tmatrix
##      Row1 Row2 Row3
## Col1    1 68.0  110
## Col2   26 -3.0   17
## Col3   24  0.3    2

Data frames

A data frame is more general than a matrix in that different columns can contain different modes of data (numeric, character, and so on). Data frames are the most common data structure you’ll deal with in R. A data frame is created with the data.frame() function:

mydata <- data.frame(col1, col2, col3,…)

where col1, col2, col3, and so on are column vectors of any type (such as character, numeric, or logical).

# Creating a matrix from vectors:
patientID <- c(1, 2, 3, 4)
age <- c(25, 34, 28, 52)
diabetes <- c("Type1", "Type2", "Type1", "Type1")
status <- c("Poor", "Improved", "Excellent", "Poor")
patientdata <- data.frame(patientID, age, diabetes, status)
patientdata
# Changing the names of the columns:
newpatientdata <- setNames(patientdata, c("ID","edad","DiabT","Estado"))
newpatientdata

There are several ways to select the elements of a data frame:

# Using the subscript notation (as in a matrix):
selectpatientdata <- patientdata[ ,1:2]  # the blank for row index means that all rows are selected
selectpatientdata
# Using the columns' names:
selectpatientdata <- patientdata[,c("diabetes", "status")]
selectpatientdata
# Using the object$variable notation:
selectpatientdata <- patientdata$age
selectpatientdata   # Note that the output is a vector
## [1] 25 34 28 52
Factors

Categorical (nominal) and ordered categorical (ordinal) variables in R are called factors. Factors are crucial in R because they determine how data is analyzed and presented visually.

With categorical variables you can create cross-tabulations, using the table function:

table(patientdata$diabetes, patientdata$status)
##        
##         Excellent Improved Poor
##   Type1         1        0    2
##   Type2         0        1    0

The function factor(…) stores the categorical values as a vector of integers associated to each different value of the categorical variable, this is very important for data analysis:

patientdata
statuslevel <- factor(status)
statuslevel
## [1] Poor      Improved  Excellent Poor     
## Levels: Excellent Improved Poor
as.numeric(statuslevel)
## [1] 3 2 1 3

Data input

R provides a wide range of tools for importing data, and create data frames. The definitive guide for importing data in R is the R Data Import/Export manual available at http://mng.bz/urwn.
We are considering four tools to input data for analyses:

Once created a data frame, you can edit it using the package (really an Addin) editData.

Manually

You can manually enter (or paste) data into the RStudio script editor (or chunk) with column names, and values (numeric of character) in a table style format, and using the following code:

mydata <- read.table(header=TRUE, text = "
age gender weight
25 m 166
30 f 115
18 f 120
")
mydata
class(mydata)
## [1] "data.frame"
Reading CSV files

CSV files are text files where values in each line are separated by commas (i.e. 1,2,3,5,7,11). The file can be created with any text editor or exported from a spreadsheet application (Excel, for example). To read a CSV file use the following basic code:

honeymoondata <- read.csv('honeymoon.csv', header = TRUE)
honeymoondata
class(honeymoondata)
## [1] "data.frame"
Import data from Excel file

RStudio provides a menu driven tool to import data from an Excel file (and other types, too). It requires that you have installed and activate the readxl package. Use the menu Import Dataset, select the File, select the Sheet with the data you are interested in, and check First Row as Names if that is the case.


These menu actions will generate a code that you can use inside your procedure.

library(readxl)
honeymoon <- read_excel("honeymoon.xlsx", 
    sheet = "Sheet1")
View(honeymoon)
class(honeymoon)
## [1] "tbl_df"     "tbl"        "data.frame"
Rdata files

Several data files can be integrated in one system file (.Rdata) using the save function,

save(mydata, honeymoon, file = "rfile.Rdata")

and use them as input later with the load function:

load("rfile.Rdata")
mydata
honeymoon
class(honeymoon)
## [1] "tbl_df"     "tbl"        "data.frame"

Packing and loading the data in Rdata may be useful when working in a project with several data frames, and need to distribute and reuse them easily, preserving original data (like teaching a workshop!).


Descriptive Statistics

R have a large number of ways to calculate descriptive statistics on the datasets, some are included in the basic installation, and others in packages that need download-installation (using install.packages) and load (activation) to the environment (using library).

Using summary

Let start using the basic procedures (function summary), with the mtcars dataset provided with your book’s code.

# create a data frame with the CSV file and make it available for the procedure
mtcars <- read.csv("mtcars.csv")

# create a list of variables to analyze
myvars <- c("mpg", "hp", "wt")

# use summary with the selected variables
summary(mtcars[myvars])
##       mpg              hp              wt       
##  Min.   :10.40   Min.   : 52.0   Min.   :1.513  
##  1st Qu.:15.43   1st Qu.: 96.5   1st Qu.:2.581  
##  Median :19.20   Median :123.0   Median :3.325  
##  Mean   :20.09   Mean   :146.7   Mean   :3.217  
##  3rd Qu.:22.80   3rd Qu.:180.0   3rd Qu.:3.610  
##  Max.   :33.90   Max.   :335.0   Max.   :5.424
Using function and sapply

A more general way to calculate basic statistics is using the sapply procedure, with the syntax:

sapply(x, FUN, options))

where FUN is a simple system function (like mean(var), sd(var), et c.) or an user-defined function, with the following syntax:

myfunction <- function(arg1, arg2, … ){
statements
return(object)
}

# defining function
mystats <- function(x){
                m <- mean(x)
                n <- length(x)
                s <- sd(x)
                skew <- sum((x-m)^3/s^3)/n
                kurt <- sum((x-m)^4/s^4)/n - 3
                return(c(n=n, mean=m, stdev=s, skew=skew, kurtosis=kurt))
}
# arguments (variables) to use
myvars <- c("mpg", "hp", "wt")
# sapply function on dataset
sapply(mtcars[myvars], mystats)
##                mpg          hp          wt
## n        32.000000  32.0000000 32.00000000
## mean     20.090625 146.6875000  3.21725000
## stdev     6.026948  68.5628685  0.97845744
## skew      0.610655   0.7260237  0.42314646
## kurtosis -0.372766  -0.1355511 -0.02271075
Using aggregate

When you have variables that can be considered as factor, you can use the aggregate function to obtain basic statistics aggregating by such factors. The function has the following syntax:

aggregate(x, by, FUN)

#read data and attach object for easier use of the variables
mtcars <- read.csv('mtcars.csv')
attach(mtcars)
#calculate the mean of numeric variables, aggregated by cyl and gear as factors; missing values are removed
aggdata <- aggregate(mtcars[c('mpg','hp','wt')], by=list(cyl,gear), FUN = mean, na.rm = TRUE)

#set names for resulting data table - what happen if we skip this line of code?
aggdata <- setNames(aggdata, c("CYL","GEAR","mpg","hp","wt"))
aggdata
detach(mtcars)

Exercise

Here is an exercise to complete, and we will discuss it first thing tomorrow in the morning.

  • Look in your kitchen, fridge, or elsewhere, for food with nutrition labels (Nutrition Facts).
  • Find four canned foods, four in boxes (dry food), and four beverages (liquid).
  • Take note of the values for calories per serving, total fat (g), cholesterol (mg), sodium (mg), total carbohydrates (g), and protein (g), for each food.
  • Create a data frame with the data.