We R getting started

About `R`

R is a free software environment for statistical computing and graphics. It can be downloaded from its homepage, where documentations and more information are detailed. The software can be downloaded from its main repository, CRAN - The Comprehensive R Archive Network, as well as the packages.

Many users and developers prefer to run R from Rstudio, which is an IDE - integrated development environment for R and Python, with a console, syntax-highlighting editor that supports direct code execution, and tools for plotting, history, debugging and workspace management. It can also be downloaded for free.

There are some web hosted services that allow to run R on cloud. One of them is Rstudio.cloud. Another is the Jupyter Notebook.

Main types of objects and classes

In R we usually deal with the following types of objects:

numeric (or float, continuous data)
integer (discrete data)
complex
character (nominal data)
logical (FALSE and TRUE)

Examples:

x <- 2.7
class(x)

[1] "numeric"

y <- 3L
class(y)

[1] "integer"

z <- 'yellow'
class(z)

[1] "character"

k <- TRUE
class(k)

[1] "logical"

And some operations might end up in special values, such as

a <- 0
b <- 0
a/b

[1] NaN

which means Not a Number, or

a <- 1
b <- 0
a/b

[1] Inf

Missing data is defined as NA (Not Available).

Vectors

The function c() is used to concatenate objects and then build vectors of data.

x <- c(2.7, 1.3, 0.85, 4)
y <- c('red', 'yellow', 'another colour')

What happens if we concatenate a and b? Try it. This is what is called coercion.

Subsetting

Let us say we are interested in only part of a vector, say the second value.

x[2]

[1] 1.3

y[2]

[1] "yellow"

Or perhaps we would like only to exclude the first value.

x[-1]

[1] 1.30 0.85 4.00

x[2:3]

[1] 1.30 0.85

Ow we just wanted the first and the third:

x[c(1, 3)]

[1] 2.70 0.85

Some built-in functions

Let us say we want to calculate the average value of x:

s <- sum(x)
s

[1] 8.85

n <- length(x)
n

[1] 4

s/n

[1] 2.2125

Well, not very practical! This is better:

mean(x)

[1] 2.2125

Some other useful functions for descriptive analysis are:

min() and max()
median(), median
quantile(), sample quantiles
sd(), standard deviation
var(), variance
sqrt(), square root
log(), natural logarithm
exp(), exponential
cor(), correlation
sort()

Matrices and arrays

Consider those two vectors both of length 4.

[1] 2.70 1.30 0.85 4.00

y <- x^2

We can col-bind or row-bind them, like this:

M1 = cbind(x, y)
M1

        x       y
[1,] 2.70  7.2900
[2,] 1.30  1.6900
[3,] 0.85  0.7225
[4,] 4.00 16.0000

M2 = rbind(x, y)
M2

  [,1] [,2]   [,3] [,4]
x 2.70 1.30 0.8500    4
y 7.29 1.69 0.7225   16

And we end up with matrices, M1 ($4 \times 2$) and M2 ($2 \times 4$).

Now, arrays are generalizations of matrices for higher dimensions. For instance, suppose we want to combine three matrices ($2 \times 4$). The result is an array of dimension $2 \times 3 \times 4$. Example:

A <- array(rnorm(n = 2*4*3), dim = c(2, 4, 3))
A

, , 1

           [,1]      [,2]      [,3]       [,4]
[1,] -0.2076705  2.024141 0.2504917 -0.1801403
[2,] -1.8222397 -0.630097 0.6148187 -0.6685826

, , 2

           [,1]       [,2]        [,3]      [,4]
[1,]  0.6724788 -1.8735471  0.06543014 0.3229076
[2,] -1.5745063  0.6683405 -0.58822985 0.1742973

, , 3

          [,1]     [,2]        [,3]       [,4]
[1,] 0.2910866 1.781717  0.02779452 -0.6235659
[2,] 0.3524773 1.219310 -2.41213322 -0.5107545

If we want to recall the third matrix, the subsetting command should be:

A[ , , 3]

          [,1]     [,2]        [,3]       [,4]
[1,] 0.2910866 1.781717  0.02779452 -0.6235659
[2,] 0.3524773 1.219310 -2.41213322 -0.5107545

Main opperations with matrices

Some common matrix opperations are:

A %*% B, multiplication, $A \times B$
solve(A), unique inverse, $A^{-1}$
t(A), transpose, $A^T$
diag(A), diagonal elements

Example:

M3 <- M1 %*% M2
M3

           [,1]      [,2]      [,3]   [,4]
[1,]  60.434100 15.830100  7.562025 127.44
[2,]  15.830100  4.546100  2.326025  32.24
[3,]   7.562025  2.326025  1.244506  14.96
[4,] 127.440000 32.240000 14.960000 272.00

Data frames

data.frames are data tables, in which columns can contain data of different types, with the same size. For example,

a <- seq(2, 10, by = 2)
b <- c('maize', 'soybean', 'wheat', NA, 'maize')
d <- data.frame(a, b)
d

   a       b
1  2   maize
2  4 soybean
3  6   wheat
4  8    <NA>
5 10   maize

Notice that, although the fourth value is missing, b has the same length (5) as a, but they are of two different classes.

Often data are imported into R as data.frame. The subsetting procedures is similar as those for matrices, using the opperator [ , ] for (rows, columns) indices. Moreover, they can be conviniently subsetted using the function subset(). Check this out:

subset(d, a >= 4)

   a       b
2  4 soybean
3  6   wheat
4  8    <NA>
5 10   maize

subset(d, b == 'maize')

   a     b
1  2 maize
5 10 maize

Lists

Lists are very flexible objects, meaning that they can store different types (classes) of data, with different dimensions. For example, consider a list x that contains a character vector of size 2, a numeric vector of size 1, a logical vector of size 3 and a ($2 \times 2$) matrix:

x <- list(crop = c('maize', 'soybean'), 
          yield = 2.2221, 
          rotation = c(TRUE, FALSE, FALSE),
          geopoints = matrix(1:4, nrow = 2))
x

$crop
[1] "maize"   "soybean"

$yield
[1] 2.2221

$rotation
[1]  TRUE FALSE FALSE

$geopoints
     [,1] [,2]
[1,]    1    3
[2,]    2    4

And we can easily retrieve any of its elements by using the symbol $, like this:

x$geopoints

     [,1] [,2]
[1,]    1    3
[2,]    2    4

Import and export data

The general .csv and .txt format for data sets can be imported using one of the functions

read.csv(), or read.csv2()
read.table()

The result is an object of class data.frame. Those functions can also read data from an URL. For example, take the data set available from https://raw.githubusercontent.com/arsilva87/statsbook/main/datasets/camadassolo.csv

file_address <- 'https://raw.githubusercontent.com/arsilva87/statsbook/main/datasets/camadassolo.csv'
soil <- read.csv(file = file_address)
head(soil)   # first 6 rows

     US   DS   RP    CO Argila Tensao Camada
1  6.85 1.77 4.37  9.94     16   83.0   0-20
2 10.61 1.58 1.64 26.89     13   64.7   0-20
3  6.63 1.79 5.15  9.94     18   93.5   0-20
4  6.63 1.78 4.83  9.94     18   87.5   0-20
5 10.72 1.57 0.40 27.80     12   61.0   0-20
6 11.93 1.56 0.74 24.18     12   61.9   0-20

Alternatively, a path to a local file can be passed to the argument file. Run help(read.table) to check the help page for more details and further arguments.

The function readLines() can be used to read entire rows of a file.

The analogous functions to export data are: write.csv(), write.csv2(), write.table() and writeLines().

The package readxl can be used to import data stored as .xls ou .xlsx (MS Excel).

Packages

R is made of packages. In general, a package contains data, documentation and functions that allow to compute procedures, graphical or not. The R distribution comes with a bunch of pre-installed packages, such as stats, graphics, datasets and many others. Too check what packages are installed, run installed.packages()

These pre-installed packages cannot account for all kinds of procedures that one might need. This is one reason to have packages, so that we can install only what we need to use. And the number of packages from contributors increases every day. Currently, almost 20 thousand packages are available from CRAN, where the source code of the packages is available.

To install a package directly from the R console,

install.packages('readxl')

Packages can also be installed from their source codes, as local .zip (or .tar.gz) files.

Also, some packages my be available in other repositories, such as Bioconductor or R-forge. And many packages have a repository development platforms such as GitHub, from where a package can also be installed.

Exercises

With data soil, create a new data frame containing only the data from layer ‘0-20’.
Calculate mean and standard deviation of each variable, by levels of ‘Camada’.

Miscellanea

Descriptive statistics by subsets of data: aggregate(), apply(), tapply(), lapply()
Installing packages from CRAN, GitHub and other repositories, and from local source files.
Writting functions.

We `R` getting started

Anderson Rodrigo da Silva

02/03/2022

About `R`

Main types of objects and classes

Vectors

Subsetting

Some built-in functions

Matrices and arrays

Main opperations with matrices

Data frames

Lists

Import and export data

Packages

Exercises

Miscellanea

We R getting started

Anderson Rodrigo da Silva

02/03/2022

About R

Main types of objects and classes

Vectors

Subsetting

Some built-in functions

Matrices and arrays

Main opperations with matrices

Data frames

Lists

Import and export data

Packages

Exercises

Miscellanea

We `R` getting started

About `R`