If already familiar with these concepts and might be interested in more advanced concepts you can have a look here.
You can work directly in R, but most users prefer a graphical interface to interact with R more easily.
The most efficient choice is RStudio, an integrated development environment (IDE) that features:
It looks like this:
The first time you open RStudio, you will see three windows. The code editor is hidden by default, but can be opened by clicking the File drop-down menu, then New File, and then R Script.
| RStudio Windows / Tabs | Description |
|---|---|
| Console Window | location were commands are entered and the output is printed |
| Source Tabs | built-in text editor |
| Environment Tab | interactive list of loaded R objects |
| History Tab | list of key strokes entered into the Console |
| Files Tab | file explorer to navigate folders |
| Plots Tab | output location for plots |
| Packages Tab | list of installed packages |
| Help Tab | output location for help commands and help search window |
| Viewer Tab | advanced tab for local web content |
Before we begin working in R, we should set our working directory (a folder to hold all of your project files). This directory is the location where all our input data-sets are be stored. It also serves as the default location for plots and other objects exported from R. If set, it conviently allows us to import data into R with just a file name, not the entire file path. To change the working directory in RStudio, select the Files Tab > More > Set As Working Directory, or you can also use the functions getwd() and setwd() to get and set the working directory, respectively.
R code can be entered into the console directly or be saved as a script. We can run a command directly from a script by placing the cursor inside the command or highlighting the commands and hitting Ctrl-Enter. This will advance the cursor to the next command, where we can hit Ctrl-Enter again to run it.
Commands are separated either by a ; or by a newline.
R is case sensitive.
The # character signifies a comment, which is not executed.
Commands can extend beyond one line of text, by puting the + operator at the end of lines for multi-line commands.
In R, data is stored in objects. To achieve this we use the <- or = operator. A simple analogue for objects is a closet, where we can store similar (homogeneous) or different (heterogeneous) things of various sizes.
To print the contents of an object, we type the object’s name alone.
For example:
# assign the number 3 to object called my_closet
my_closet <- 3
# print contents
my_closet
## [1] 3
Functions perform most of the work on data in R.
Functions in R are much the same as they are in math; they perform some operation on an input and return some output. For example, the mathematical function \(f(x) = x^2\), takes an input x, and returns its square. Similarly, the mean() function in R takes a vector of numbers and returns its mean. The inputs to functions are often referred to as arguments.
We have already discussed a few functions, such as getwd() and setwd().
get_square <- function(x){x ^ 2}
a <- 5
get_square(a)
## [1] 25
b <- 1:5
mean(b)
## [1] 3
Help files for R functions are accessed by preceding the name of the function with ? (e.g., ?seq) or by the F1 key.
In the help file, we can find the list of function arguments, in a specific order. Values for arguments can be specified either by name or position.
Basic data structures can be organised by their dimensionality (1-D, 2-D, or N-D) and whether they are homogeneous (all contents must be of the same type) or heterogeneous (the contents can be of different types). The most common data types used in data analysis are:
| Homogeneous | Heterogeneous | |
|---|---|---|
| 1-D | Atomic vector | List |
| 2-D | Matrix | Data frame |
| N-D | Array |
The basic data structure in R is the vector. Vectors come in two flavours: atomic vectors and lists. They both are one-dimensional, but they differ in the types of their elements: all elements of an atomic vector must be the same type (homoheneous), whereas the elements of a list can have different types (heterogeneous). They can be numerical, categorical or logical. here are some atomic vectors:
They are created and printed by:
just_a_vector <- c(3, 5, -2, 24, 1, 0, 2, 1)
my_fav_fruits <- c('apple', 'orange', 'pear') #This not true
b <- 4.17
rainy_days <- c(T, F, T, T)
my_fav_fruits
## [1] "apple" "orange" "pear"
Vector elements can be accessed or subseted by specifying a vector of numbers inside [].
my_fav_fruits[3] #I hate pears
## [1] "pear"
rainy_days[1:2]
## [1] TRUE FALSE
In addition, they can be named and accessed or subseted by name, using ’’:
rainy_days <- c(Mon = T, Tue = F, Wed = T, Thu = T)
rainy_days[c('Tue', 'Wed')]
## Tue Wed
## FALSE TRUE
Like vectors, lists are 1-D structures, but the elements can be a mixture of types. Often vectors (of any length), but also other lists, matrices and data frames.
Which are created and printed by:
fruits <- list(
weight_deka = c(2, 5, 3, 4, 5),
type = c('apple', 'pear'),
fresh = T,
owners = c('John', 'Jane'),
quality = c(3.25, 1.17, 2, 2.1)
)
fruits
## $weight_deka
## [1] 2 5 3 4 5
##
## $type
## [1] "apple" "pear"
##
## $fresh
## [1] TRUE
##
## $owners
## [1] "John" "Jane"
##
## $quality
## [1] 3.25 1.17 2.00 2.10
There are a couple of ways to access list elements. Most common is by [[]] or $:
fruits[[2]]
## [1] "apple" "pear"
fruits$owners
## [1] "John" "Jane"
fruits$quality[3:4] #accesing third and fourth element of quality vector
## [1] 2.0 2.1
Matrices are 2-D, homogeneous data structures, that can be generated manually with matrix(). The input to matrix() is a one-dimensional vector, which is reshaped into a two-dimensional matrix according to the dimensions specified by the user in the arguments nrow and ncol (generally only one is needed).
The matrix is filled down the columns by default, but this can be changed by setting the byrow argument to TRUE.
bad_apples <- matrix(c(6, 9, 9, 1, 0, 4, 4, 4, 8, 7, 9, 0, 8, 0, 7, 5, 3, 2, 9, 4, 7, 7, 1, 4, 5), nrow = 5)
bad_apples
## [,1] [,2] [,3] [,4] [,5]
## [1,] 6 4 9 5 7
## [2,] 9 4 0 3 7
## [3,] 9 4 8 2 1
## [4,] 1 8 0 9 4
## [5,] 0 7 7 4 5
Matrix elements can be accessed with matrix[row, column] notation.
Omitting row accesses all rows, and omitting column accesses all columns.
bad_apples[2, 4]
## [1] 3
bad_apples[2, ]
## [1] 9 4 0 3 7
bad_apples[, 4]
## [1] 5 3 2 9 4
Arrays are multi-dimensional matrices:
a_big_matrix <- matrix(seq(from = 2, to = 32, by = 2), nrow = 4)
a_random_array <- array(a_big_matrix, dim = c(2, 2, 4))
a_big_matrix
## [,1] [,2] [,3] [,4]
## [1,] 2 10 18 26
## [2,] 4 12 20 28
## [3,] 6 14 22 30
## [4,] 8 16 24 32
a_random_array
## , , 1
##
## [,1] [,2]
## [1,] 2 6
## [2,] 4 8
##
## , , 2
##
## [,1] [,2]
## [1,] 10 14
## [2,] 12 16
##
## , , 3
##
## [,1] [,2]
## [1,] 18 22
## [2,] 20 24
##
## , , 4
##
## [,1] [,2]
## [1,] 26 30
## [2,] 28 32
Datasets for statistical analysis are typically stored in data frames, which combine the features of matrices and lists.
Real datasets usually combine variables of different types (heterogeneous), so data frames are well suited for storage.
medical_record <- data.frame(name = c('John', 'Emily', 'Mary', 'Dan'),
weight = c(185, 150, 120, 225),
height = c(69, 62, 65, 72),
age = c(34.5, 55.6, 21.1, 51.1),
disease = c(T, F, T, T))
medical_record
## name weight height age disease
## 1 John 185 69 34.5 TRUE
## 2 Emily 150 62 55.6 FALSE
## 3 Mary 120 65 21.1 TRUE
## 4 Dan 225 72 51.1 TRUE
Since data frames are both matrices and lists, they can be subseted by methods for either matrices or lists.
medical_record[, 3]
## [1] 69 62 65 72
medical_record[1, 3]
## [1] 69
medical_record$name
## [1] John Emily Mary Dan
## Levels: Dan Emily John Mary
medical_record$name[3]
## [1] Mary
## Levels: Dan Emily John Mary
Here is a series of plots describing the most extreme events in Europe during the last 250 years.
Source: Scientific Reports
- Create the atomic vector eur_runoff with the years when extreme runoff droughts happened in whole Europe (rightside plot in the last panel of plots in figure).
- Use the
sort()function to set them from the latest to the most recent one.
- Use the
order()function to do the same thing.
- Create the 3-element list all_droughts with the years when extreme drought events happened in Europe, classified by drought type (precipitation, runoff, soil moisture).
- Access the list elements, to estimate the average interval between each type of drought (hint)
- Create the data frame prcp_droughts_ceu with the precipitation droughts of CEU (first column and row in figure) with four variables: ‘year’, ‘region’, ‘severity’, ‘area’
Sometimes is useful to examine the structure of an R object. We can do this with dim() and str() functions:
dim(medical_record)
## [1] 4 5
str(medical_record)
## 'data.frame': 4 obs. of 5 variables:
## $ name : Factor w/ 4 levels "Dan","Emily",..: 3 2 4 1
## $ weight : num 185 150 120 225
## $ height : num 69 62 65 72
## $ age : num 34.5 55.6 21.1 51.1
## $ disease: logi TRUE FALSE TRUE TRUE
Here we can see the three most common data types in R: numerical, factor and logical, corresponding to qualitative, quantitative and logical data.
A factor is a vector object used to specify a discrete classification (grouping) of the components of other vectors of the same length, e.g. station, country, month etc.
station <- factor(c('Praha-Libus', 'Brno-Turany', 'Lysa Hora'))
station
## [1] Praha-Libus Brno-Turany Lysa Hora
## Levels: Brno-Turany Lysa Hora Praha-Libus
levels(station)
## [1] "Brno-Turany" "Lysa Hora" "Praha-Libus"
as.integer(station)
## [1] 3 1 2
Typical usecase - indicator of station in data.frames:
data.frame(station = station[c(1, 1, 2, 2, 3, 3, 1, 1)], rainfall = c(0, 0, 4, 5, 1, 1, 10, 1))
## station rainfall
## 1 Praha-Libus 0
## 2 Praha-Libus 0
## 3 Brno-Turany 4
## 4 Brno-Turany 5
## 5 Lysa Hora 1
## 6 Lysa Hora 1
## 7 Praha-Libus 10
## 8 Praha-Libus 1
Factors are also handy for classifications and grouping, e.g., based on value or date using cut() function.
For example:
cut(rnorm(20), breaks = 4)
## [1] (0.42,1.44] (1.44,2.46] (1.44,2.46] (0.42,1.44]
## [5] (-0.598,0.42] (-1.62,-0.598] (0.42,1.44] (-1.62,-0.598]
## [9] (1.44,2.46] (0.42,1.44] (-1.62,-0.598] (-1.62,-0.598]
## [13] (-1.62,-0.598] (0.42,1.44] (-0.598,0.42] (-0.598,0.42]
## [17] (-0.598,0.42] (-1.62,-0.598] (0.42,1.44] (0.42,1.44]
## Levels: (-1.62,-0.598] (-0.598,0.42] (0.42,1.44] (1.44,2.46]
cut(rnorm(20), breaks = c(-Inf, -1, 0, 1, Inf))
## [1] (-1,0] (0,1] (-1,0] (0,1] (1, Inf] (-1,0] (-1,0]
## [8] (-1,0] (1, Inf] (-1,0] (0,1] (-Inf,-1] (-Inf,-1] (0,1]
## [15] (0,1] (1, Inf] (-1,0] (-1,0] (1, Inf] (-1,0]
## Levels: (-Inf,-1] (-1,0] (0,1] (1, Inf]
x <- rnorm(20)
cut(x, breaks = pretty(x, 4))
## [1] (0,1] (-1,0] (0,1] (-1,0] (-1,0] (1,2] (2,3] (0,1]
## [9] (0,1] (0,1] (-1,0] (-1,0] (-1,0] (-2,-1] (-1,0] (1,2]
## [17] (-1,0] (1,2] (-2,-1] (-2,-1]
## Levels: (-2,-1] (-1,0] (0,1] (1,2] (2,3]
x < 0x[x < 0]x == 0x <- rnorm(10)
x
## [1] 0.2466275 0.9970717 0.4467337 0.7554737 -0.1569542 0.1007701
## [7] -0.1819114 -0.3369466 1.0686075 1.4106003
x < 0
## [1] FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE
x[x < 0]
## [1] -0.1569542 -0.1819114 -0.3369466
as.Date(x, format, ...), for formating strings see ?strptimeseqformatbase package): months, weekdays, quarters, juliandata.table | lubridate pkgs): hour, year, month, mday, etc.POSIXlt, POSIXct (see ?DateTimeClasses )lubridate, chronFor example:
dates <- c("27/02/92", "22/03/92", "14/01/93", "28/10/95", "02/01/96")
dates <- as.Date(dates, format = "%d/%m/%y")
months(dates)
## [1] "February" "March" "January" "October" "January"
The cut() function can be used with dates types as well. For example, we can split a sequence of days into groups of 3 days (same can be done for months, years etc.)
d <- seq(as.Date('1900-01-01'), length = 10, by = 'day')
cut(d, breaks = '3 days')
## [1] 1900-01-01 1900-01-01 1900-01-01 1900-01-04 1900-01-04 1900-01-04
## [7] 1900-01-07 1900-01-07 1900-01-07 1900-01-10
## Levels: 1900-01-01 1900-01-04 1900-01-07 1900-01-10
Explore the structure of data frame prcp_droughts with the
str()function.
Use the
x[x < 0]syntax to prcp_droughts, to create a vector with all years after 1900 (hint: you might want to first create a new vector with all the year values).
By combining the
cut()andseq()functions, split the year values of prcp_droughts_ceu, in 50-year intervals from 1760 to 2010 and store it in the vector int_50.
Use the
seq()function to calculate how many days passed from the last drought (hint: first create a date object).
Use the
plot()function to plot severity versus area from the prcp_droughts_ceu data frame.
Detailed information about using RStudio can be found at RStudio Website or in other web resources (for example).
Some of the material used in this workshop can be found here, as well as a lot of other interesting stuff.