About R
R is a free software environment for statistical computing and graphics. It can be downloaded from its homepage, where documentations and more information are detailed. The software can be downloaded from its main repository, CRAN - The Comprehensive R Archive Network, as well as the packages.
Many users and developers prefer to run R from Rstudio, which is an IDE - integrated development environment for R and Python, with a console, syntax-highlighting editor that supports direct code execution, and tools for plotting, history, debugging and workspace management. It can also be downloaded for free.
There are some web hosted services that allow to run R on cloud. One of them is Rstudio.cloud. Another is the Jupyter Notebook.
Main types of objects and classes
In R we usually deal with the following types of objects:
numeric (or float, continuous data)
integer (discrete data)
complex
character (nominal data)
logical (FALSE and TRUE)
Examples:
[1] "numeric"
[1] "integer"
[1] "character"
[1] "logical"
And some operations might end up in special values, such as
[1] NaN
which means Not a Number, or
[1] Inf
Missing data is defined as NA (Not Available).
Vectors
The function c() is used to concatenate objects and then build vectors of data.
x <- c(2.7, 1.3, 0.85, 4)
y <- c('red', 'yellow', 'another colour')
What happens if we concatenate a and b? Try it. This is what is called coercion.
Subsetting
Let us say we are interested in only part of a vector, say the second value.
[1] 1.3
[1] "yellow"
Or perhaps we would like only to exclude the first value.
[1] 1.30 0.85 4.00
[1] 1.30 0.85
Ow we just wanted the first and the third:
[1] 2.70 0.85
Some built-in functions
Let us say we want to calculate the average value of x:
[1] 8.85
[1] 4
[1] 2.2125
Well, not very practical! This is better:
[1] 2.2125
Some other useful functions for descriptive analysis are:
min() and max()
median(), median
quantile(), sample quantiles
sd(), standard deviation
var(), variance
sqrt(), square root
log(), natural logarithm
exp(), exponential
cor(), correlation
sort()
Matrices and arrays
Consider those two vectors both of length 4.
[1] 2.70 1.30 0.85 4.00
We can col-bind or row-bind them, like this:
x y
[1,] 2.70 7.2900
[2,] 1.30 1.6900
[3,] 0.85 0.7225
[4,] 4.00 16.0000
[,1] [,2] [,3] [,4]
x 2.70 1.30 0.8500 4
y 7.29 1.69 0.7225 16
And we end up with matrices, M1 (\(4 \times 2\)) and M2 (\(2 \times 4\)).
Now, arrays are generalizations of matrices for higher dimensions. For instance, suppose we want to combine three matrices (\(2 \times 4\)). The result is an array of dimension \(2 \times 3 \times 4\). Example:
A <- array(rnorm(n = 2*4*3), dim = c(2, 4, 3))
A
, , 1
[,1] [,2] [,3] [,4]
[1,] -0.2076705 2.024141 0.2504917 -0.1801403
[2,] -1.8222397 -0.630097 0.6148187 -0.6685826
, , 2
[,1] [,2] [,3] [,4]
[1,] 0.6724788 -1.8735471 0.06543014 0.3229076
[2,] -1.5745063 0.6683405 -0.58822985 0.1742973
, , 3
[,1] [,2] [,3] [,4]
[1,] 0.2910866 1.781717 0.02779452 -0.6235659
[2,] 0.3524773 1.219310 -2.41213322 -0.5107545
If we want to recall the third matrix, the subsetting command should be:
[,1] [,2] [,3] [,4]
[1,] 0.2910866 1.781717 0.02779452 -0.6235659
[2,] 0.3524773 1.219310 -2.41213322 -0.5107545
Main opperations with matrices
Some common matrix opperations are:
A %*% B, multiplication, \(A \times B\)
solve(A), unique inverse, \(A^{-1}\)
t(A), transpose, \(A^T\)
diag(A), diagonal elements
Example:
[,1] [,2] [,3] [,4]
[1,] 60.434100 15.830100 7.562025 127.44
[2,] 15.830100 4.546100 2.326025 32.24
[3,] 7.562025 2.326025 1.244506 14.96
[4,] 127.440000 32.240000 14.960000 272.00
Data frames
data.frames are data tables, in which columns can contain data of different types, with the same size. For example,
a <- seq(2, 10, by = 2)
b <- c('maize', 'soybean', 'wheat', NA, 'maize')
d <- data.frame(a, b)
d
a b
1 2 maize
2 4 soybean
3 6 wheat
4 8 <NA>
5 10 maize
Notice that, although the fourth value is missing, b has the same length (5) as a, but they are of two different classes.
Often data are imported into R as data.frame. The subsetting procedures is similar as those for matrices, using the opperator [ , ] for (rows, columns) indices. Moreover, they can be conviniently subsetted using the function subset(). Check this out:
a b
2 4 soybean
3 6 wheat
4 8 <NA>
5 10 maize
a b
1 2 maize
5 10 maize
Lists
Lists are very flexible objects, meaning that they can store different types (classes) of data, with different dimensions. For example, consider a list x that contains a character vector of size 2, a numeric vector of size 1, a logical vector of size 3 and a (\(2 \times 2\)) matrix:
x <- list(crop = c('maize', 'soybean'),
yield = 2.2221,
rotation = c(TRUE, FALSE, FALSE),
geopoints = matrix(1:4, nrow = 2))
x
$crop
[1] "maize" "soybean"
$yield
[1] 2.2221
$rotation
[1] TRUE FALSE FALSE
$geopoints
[,1] [,2]
[1,] 1 3
[2,] 2 4
And we can easily retrieve any of its elements by using the symbol $, like this:
[,1] [,2]
[1,] 1 3
[2,] 2 4
Import and export data
The general .csv and .txt format for data sets can be imported using one of the functions
read.csv(), or read.csv2()
read.table()
The result is an object of class data.frame. Those functions can also read data from an URL. For example, take the data set available from https://raw.githubusercontent.com/arsilva87/statsbook/main/datasets/camadassolo.csv
file_address <- 'https://raw.githubusercontent.com/arsilva87/statsbook/main/datasets/camadassolo.csv'
soil <- read.csv(file = file_address)
head(soil) # first 6 rows
US DS RP CO Argila Tensao Camada
1 6.85 1.77 4.37 9.94 16 83.0 0-20
2 10.61 1.58 1.64 26.89 13 64.7 0-20
3 6.63 1.79 5.15 9.94 18 93.5 0-20
4 6.63 1.78 4.83 9.94 18 87.5 0-20
5 10.72 1.57 0.40 27.80 12 61.0 0-20
6 11.93 1.56 0.74 24.18 12 61.9 0-20
Alternatively, a path to a local file can be passed to the argument file. Run help(read.table) to check the help page for more details and further arguments.
The function readLines() can be used to read entire rows of a file.
The analogous functions to export data are: write.csv(), write.csv2(), write.table() and writeLines().
The package readxl can be used to import data stored as .xls ou .xlsx (MS Excel).
Packages
R is made of packages. In general, a package contains data, documentation and functions that allow to compute procedures, graphical or not. The R distribution comes with a bunch of pre-installed packages, such as stats, graphics, datasets and many others. Too check what packages are installed, run installed.packages()
These pre-installed packages cannot account for all kinds of procedures that one might need. This is one reason to have packages, so that we can install only what we need to use. And the number of packages from contributors increases every day. Currently, almost 20 thousand packages are available from CRAN, where the source code of the packages is available.
To install a package directly from the R console,
install.packages('readxl')
Packages can also be installed from their source codes, as local .zip (or .tar.gz) files.
Also, some packages my be available in other repositories, such as Bioconductor or R-forge. And many packages have a repository development platforms such as GitHub, from where a package can also be installed.
Exercises
- With data
soil, create a new data frame containing only the data from layer ‘0-20’.
- Calculate mean and standard deviation of each variable, by levels of ‘Camada’.
Miscellanea
- Descriptive statistics by subsets of data:
aggregate(), apply(), tapply(), lapply()
- Installing packages from CRAN, GitHub and other repositories, and from local source files.
- Writting functions.
