Introduction to R/Rstudio and Data Structures

Objectives
RStudio
R Basics
Basic data structures
Data types
Other resources:

Objectives

Get familiar with Rstudio
Create and access the basic data structures in R
Explore the basic data types in R

If already familiar with these concepts and might be interested in more advanced concepts you can have a look here.

RStudio

You can work directly in R, but most users prefer a graphical interface to interact with R more easily.

The most efficient choice is RStudio, an integrated development environment (IDE) that features:

a powerful code/script editor featuring
- syntax highlighting
- code completion
- smart indentation
- auto-tab for object names and function arguments
tools for plotting, viewing R objects and code history
some file management options
github connectivity
cloud portability

It looks like this:

The first time you open RStudio, you will see three windows. The code editor is hidden by default, but can be opened by clicking the File drop-down menu, then New File, and then R Script.

RStudio Windows / Tabs	Description
Console Window	location were commands are entered and the output is printed
Source Tabs	built-in text editor
Environment Tab	interactive list of loaded R objects
History Tab	list of key strokes entered into the Console
Files Tab	file explorer to navigate folders
Plots Tab	output location for plots
Packages Tab	list of installed packages
Help Tab	output location for help commands and help search window
Viewer Tab	advanced tab for local web content

R Basics

Before we begin working in R, we should set our working directory (a folder to hold all of your project files). This directory is the location where all our input data-sets are be stored. It also serves as the default location for plots and other objects exported from R. If set, it conviently allows us to import data into R with just a file name, not the entire file path. To change the working directory in RStudio, select the Files Tab > More > Set As Working Directory, or you can also use the functions getwd() and setwd() to get and set the working directory, respectively.

R code can be entered into the console directly or be saved as a script. We can run a command directly from a script by placing the cursor inside the command or highlighting the commands and hitting Ctrl-Enter. This will advance the cursor to the next command, where we can hit Ctrl-Enter again to run it.

Commands are separated either by a ; or by a newline.

R is case sensitive.

The # character signifies a comment, which is not executed.

Commands can extend beyond one line of text, by puting the + operator at the end of lines for multi-line commands.

In R, data is stored in objects. To achieve this we use the <- or = operator. A simple analogue for objects is a closet, where we can store similar (homogeneous) or different (heterogeneous) things of various sizes.

To print the contents of an object, we type the object’s name alone.

For example:

# assign the number 3 to object called my_closet
my_closet <- 3

# print contents
my_closet

## [1] 3

Functions perform most of the work on data in R.

Functions in R are much the same as they are in math; they perform some operation on an input and return some output. For example, the mathematical function $f(x) = x^2$, takes an input x, and returns its square. Similarly, the mean() function in R takes a vector of numbers and returns its mean. The inputs to functions are often referred to as arguments.

We have already discussed a few functions, such as getwd() and setwd().

get_square <- function(x){x ^ 2}

a <- 5
get_square(a)

## [1] 25

b <- 1:5
mean(b)

## [1] 3

Help files for R functions are accessed by preceding the name of the function with ? (e.g., ?seq) or by the F1 key.

In the help file, we can find the list of function arguments, in a specific order. Values for arguments can be specified either by name or position.

Basic data structures

Basic data structures can be organised by their dimensionality (1-D, 2-D, or N-D) and whether they are homogeneous (all contents must be of the same type) or heterogeneous (the contents can be of different types). The most common data types used in data analysis are:

	Homogeneous	Heterogeneous
1-D	Atomic vector	List
2-D	Matrix	Data frame
N-D	Array

The basic data structure in R is the vector. Vectors come in two flavours: atomic vectors and lists. They both are one-dimensional, but they differ in the types of their elements: all elements of an atomic vector must be the same type (homoheneous), whereas the elements of a list can have different types (heterogeneous). They can be numerical, categorical or logical. here are some atomic vectors:

They are created and printed by:

just_a_vector <- c(3, 5, -2, 24, 1, 0, 2, 1) 

my_fav_fruits <- c('apple', 'orange', 'pear') #This not true

b <- 4.17

rainy_days <- c(T, F, T, T)

my_fav_fruits

## [1] "apple"  "orange" "pear"

Vector elements can be accessed or subseted by specifying a vector of numbers inside [].

my_fav_fruits[3] #I hate pears

## [1] "pear"

rainy_days[1:2]

## [1]  TRUE FALSE

In addition, they can be named and accessed or subseted by name, using ’’:

rainy_days <- c(Mon = T, Tue = F, Wed = T, Thu = T)
rainy_days[c('Tue', 'Wed')]

##   Tue   Wed 
## FALSE  TRUE

Like vectors, lists are 1-D structures, but the elements can be a mixture of types. Often vectors (of any length), but also other lists, matrices and data frames.

Which are created and printed by:

fruits <- list(
  weight_deka = c(2, 5, 3, 4, 5),
  type = c('apple', 'pear'),
  fresh = T,
  owners = c('John', 'Jane'),
  quality = c(3.25, 1.17, 2, 2.1)
)

fruits

## $weight_deka
## [1] 2 5 3 4 5
## 
## $type
## [1] "apple" "pear" 
## 
## $fresh
## [1] TRUE
## 
## $owners
## [1] "John" "Jane"
## 
## $quality
## [1] 3.25 1.17 2.00 2.10

There are a couple of ways to access list elements. Most common is by [[]] or $:

fruits[[2]]

## [1] "apple" "pear"

fruits$owners

## [1] "John" "Jane"

fruits$quality[3:4] #accesing third and fourth element of quality vector

## [1] 2.0 2.1

Matrices are 2-D, homogeneous data structures, that can be generated manually with matrix(). The input to matrix() is a one-dimensional vector, which is reshaped into a two-dimensional matrix according to the dimensions specified by the user in the arguments nrow and ncol (generally only one is needed).

The matrix is filled down the columns by default, but this can be changed by setting the byrow argument to TRUE.

bad_apples <- matrix(c(6, 9, 9, 1, 0, 4, 4, 4, 8, 7, 9, 0, 8, 0, 7, 5, 3, 2, 9, 4, 7, 7, 1, 4, 5), nrow = 5)

bad_apples

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    6    4    9    5    7
## [2,]    9    4    0    3    7
## [3,]    9    4    8    2    1
## [4,]    1    8    0    9    4
## [5,]    0    7    7    4    5

Matrix elements can be accessed with matrix[row, column] notation.

Omitting row accesses all rows, and omitting column accesses all columns.

bad_apples[2, 4]

## [1] 3

bad_apples[2, ]

## [1] 9 4 0 3 7

bad_apples[, 4]

## [1] 5 3 2 9 4

Arrays are multi-dimensional matrices:

a_big_matrix <- matrix(seq(from = 2, to = 32, by = 2), nrow = 4)
a_random_array <- array(a_big_matrix, dim = c(2, 2, 4))

a_big_matrix

##      [,1] [,2] [,3] [,4]
## [1,]    2   10   18   26
## [2,]    4   12   20   28
## [3,]    6   14   22   30
## [4,]    8   16   24   32

a_random_array

## , , 1
## 
##      [,1] [,2]
## [1,]    2    6
## [2,]    4    8
## 
## , , 2
## 
##      [,1] [,2]
## [1,]   10   14
## [2,]   12   16
## 
## , , 3
## 
##      [,1] [,2]
## [1,]   18   22
## [2,]   20   24
## 
## , , 4
## 
##      [,1] [,2]
## [1,]   26   30
## [2,]   28   32

Datasets for statistical analysis are typically stored in data frames, which combine the features of matrices and lists.

similarly to matrices, data frames are rectangular, where the columns are variables and the rows are observations of those variables.
similarly to lists, data frame can have elements (column vectors) of different data types (some double, some character, etc.), but they must be equal length

Real datasets usually combine variables of different types (heterogeneous), so data frames are well suited for storage.

medical_record <- data.frame(name = c('John', 'Emily', 'Mary', 'Dan'),
                             weight = c(185, 150, 120, 225),
                             height = c(69, 62, 65, 72),
                             age = c(34.5, 55.6, 21.1, 51.1),
                             disease = c(T, F, T, T))

medical_record

##    name weight height  age disease
## 1  John    185     69 34.5    TRUE
## 2 Emily    150     62 55.6   FALSE
## 3  Mary    120     65 21.1    TRUE
## 4   Dan    225     72 51.1    TRUE

Since data frames are both matrices and lists, they can be subseted by methods for either matrices or lists.

medical_record[, 3]

## [1] 69 62 65 72

medical_record[1, 3]

## [1] 69

medical_record$name

## [1] John  Emily Mary  Dan  
## Levels: Dan Emily John Mary

medical_record$name[3]

## [1] Mary
## Levels: Dan Emily John Mary

Exercises: Droughts in Europe (part I)

Here is a series of plots describing the most extreme events in Europe during the last 250 years.

Source: Scientific Reports

Create the atomic vector eur_runoff with the years when extreme runoff droughts happened in whole Europe (rightside plot in the last panel of plots in figure).

Use the sort() function to set them from the latest to the most recent one.

Use the order() function to do the same thing.

Create the 3-element list all_droughts with the years when extreme drought events happened in Europe, classified by drought type (precipitation, runoff, soil moisture).

Access the list elements, to estimate the average interval between each type of drought (hint)

Create the data frame prcp_droughts_ceu with the precipitation droughts of CEU (first column and row in figure) with four variables: ‘year’, ‘region’, ‘severity’, ‘area’

Data types

Sometimes is useful to examine the structure of an R object. We can do this with dim() and str() functions:

dim(medical_record)

## [1] 4 5

str(medical_record)

## 'data.frame':    4 obs. of  5 variables:
##  $ name   : Factor w/ 4 levels "Dan","Emily",..: 3 2 4 1
##  $ weight : num  185 150 120 225
##  $ height : num  69 62 65 72
##  $ age    : num  34.5 55.6 21.1 51.1
##  $ disease: logi  TRUE FALSE TRUE TRUE

Here we can see the three most common data types in R: numerical, factor and logical, corresponding to qualitative, quantitative and logical data.

A factor is a vector object used to specify a discrete classification (grouping) of the components of other vectors of the same length, e.g. station, country, month etc.

station <- factor(c('Praha-Libus', 'Brno-Turany', 'Lysa Hora'))
station

## [1] Praha-Libus Brno-Turany Lysa Hora  
## Levels: Brno-Turany Lysa Hora Praha-Libus

levels(station)

## [1] "Brno-Turany" "Lysa Hora"   "Praha-Libus"

as.integer(station)

## [1] 3 1 2

Typical usecase - indicator of station in data.frames:

data.frame(station = station[c(1, 1, 2, 2, 3, 3, 1, 1)], rainfall = c(0, 0, 4, 5, 1, 1, 10, 1))

##       station rainfall
## 1 Praha-Libus        0
## 2 Praha-Libus        0
## 3 Brno-Turany        4
## 4 Brno-Turany        5
## 5   Lysa Hora        1
## 6   Lysa Hora        1
## 7 Praha-Libus       10
## 8 Praha-Libus        1

Factors are also handy for classifications and grouping, e.g., based on value or date using cut() function.

For example:

split 20 random numbers into 4 classes

cut(rnorm(20), breaks = 4)

##  [1] (0.42,1.44]    (1.44,2.46]    (1.44,2.46]    (0.42,1.44]   
##  [5] (-0.598,0.42]  (-1.62,-0.598] (0.42,1.44]    (-1.62,-0.598]
##  [9] (1.44,2.46]    (0.42,1.44]    (-1.62,-0.598] (-1.62,-0.598]
## [13] (-1.62,-0.598] (0.42,1.44]    (-0.598,0.42]  (-0.598,0.42] 
## [17] (-0.598,0.42]  (-1.62,-0.598] (0.42,1.44]    (0.42,1.44]   
## Levels: (-1.62,-0.598] (-0.598,0.42] (0.42,1.44] (1.44,2.46]

define the classes manually

cut(rnorm(20), breaks = c(-Inf, -1, 0, 1, Inf))

##  [1] (-1,0]    (0,1]     (-1,0]    (0,1]     (1, Inf]  (-1,0]    (-1,0]   
##  [8] (-1,0]    (1, Inf]  (-1,0]    (0,1]     (-Inf,-1] (-Inf,-1] (0,1]    
## [15] (0,1]     (1, Inf]  (-1,0]    (-1,0]    (1, Inf]  (-1,0]   
## Levels: (-Inf,-1] (-1,0] (0,1] (1, Inf]

let R decide the value of breaks but use “pretty” numbers, note that the number of classes is only suggested in this case

x <- rnorm(20)
cut(x, breaks = pretty(x, 4))

##  [1] (0,1]   (-1,0]  (0,1]   (-1,0]  (-1,0]  (1,2]   (2,3]   (0,1]  
##  [9] (0,1]   (0,1]   (-1,0]  (-1,0]  (-1,0]  (-2,-1] (-1,0]  (1,2]  
## [17] (-1,0]  (1,2]   (-2,-1] (-2,-1]
## Levels: (-2,-1] (-1,0] (0,1] (1,2] (2,3]

Logical vectors

typically resulting from comparison operations like x < 0
often used as an index for subsetting, e.g., x[x < 0]
note that for the equality condition we use x == 0

x <- rnorm(10)
x

##  [1]  0.2466275  0.9970717  0.4467337  0.7554737 -0.1569542  0.1007701
##  [7] -0.1819114 -0.3369466  1.0686075  1.4106003

x < 0

##  [1] FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE

x[x < 0]

## [1] -0.1569542 -0.1819114 -0.3369466

Date (and Time)

conversion from other data types as.Date(x, format, ...), for formating strings see ?strptime
creating sequences by seq
formating by format
easy extraction (base package): months, weekdays, quarters, julian
easy extraction (data.table | lubridate pkgs): hour, year, month, mday, etc.
classes including time: POSIXlt, POSIXct (see ?DateTimeClasses )
useful packages: lubridate, chron

For example:

dates <- c("27/02/92", "22/03/92", "14/01/93", "28/10/95", "02/01/96")
dates <- as.Date(dates, format = "%d/%m/%y")
months(dates)

## [1] "February" "March"    "January"  "October"  "January"

The cut() function can be used with dates types as well. For example, we can split a sequence of days into groups of 3 days (same can be done for months, years etc.)

d <- seq(as.Date('1900-01-01'), length = 10, by = 'day')
cut(d, breaks = '3 days')

##  [1] 1900-01-01 1900-01-01 1900-01-01 1900-01-04 1900-01-04 1900-01-04
##  [7] 1900-01-07 1900-01-07 1900-01-07 1900-01-10
## Levels: 1900-01-01 1900-01-04 1900-01-07 1900-01-10

Exercises: Droughts in Europe (part II)

Explore the structure of data frame prcp_droughts with the str() function.

Use the x[x < 0] syntax to prcp_droughts, to create a vector with all years after 1900 (hint: you might want to first create a new vector with all the year values).

By combining the cut() and seq() functions, split the year values of prcp_droughts_ceu, in 50-year intervals from 1760 to 2010 and store it in the vector int_50.

Use the seq() function to calculate how many days passed from the last drought (hint: first create a date object).

Use the plot() function to plot severity versus area from the prcp_droughts_ceu data frame.

Other resources:

Detailed information about using RStudio can be found at RStudio Website or in other web resources (for example).

Some of the material used in this workshop can be found here, as well as a lot of other interesting stuff.

Introduction to R/Rstudio and Data Structures

Yannis Markonis

20.09.2017

Objectives

RStudio

R Basics

Basic data structures

Exercises: Droughts in Europe (part I)

Data types

Logical vectors

Date (and Time)

Exercises: Droughts in Europe (part II)

Other resources: