R training: Basic R

IMPACT Initiatives - Iraq (Apr 2021)

1. What is R?

R is a powerful, extensible environment. It has a wide range of statistics and general data analysis and visualization capabilities.

Install R & RStudio: Download both R and RStudio. Install R first and then RStudio.

2. R Studio interface

3. Basic terms

4. Basic data structure (vectors)

A vector is the most common and basic data structure in R, the workhorse of R. It is a collection of values of the same basic type. The c() function returns a vector.

# define the vector
myVector <- c(1, 50, 9, 42)

# display the vector
myVector
## [1]  1 50  9 42
govs <- c("Anbar", "Baghdad", "Basra", "Duhok", "Kirkuk", "Ninevah")
govs
## [1] "Anbar"   "Baghdad" "Basra"   "Duhok"   "Kirkuk"  "Ninevah"

back to top

5. Operators

5.1 Assignment

element <- "this is a value"
element
## [1] "this is a value"

5.2 Arithmetic

Examples of arithmetic operators:

3 - 2
## [1] 1
v <- c(2, 5, 6)
t <- c(8, 3, 4)
v + t
## [1] 10  8 10
5^2 * 2
## [1] 50

5.3 Relational

Examples of relational operators:

1 == 2
## [1] FALSE
4 > 3
## [1] TRUE
"a" != "b"
## [1] TRUE
v <- c(2, 5, 6, 9)
t <- c(8, 3, 4, 9)
v > t
## [1] FALSE  TRUE  TRUE FALSE

5.4 Logical

1 & 3 < 4
## [1] TRUE
1 & 3 > 4
## [1] FALSE
1 | 3 < 2
## [1] TRUE

5.5 Other operators

v <- 2:8
v
## [1] 2 3 4 5 6 7 8
v <- 5
t <- c(1:7)

v %in% t
## [1] TRUE
9 %in% t
## [1] FALSE

back to top

6. Data types

6.1 Character

The character class is your typical string, a set of one or more letters.

myString <- "Hello R!"
myString
## [1] "Hello R!"

You can check the class associated with the object we defined above, by wrapping it in the class() function.

class(myString)
## [1] "character"

6.2 Numeric

Indicates numeric values like 10, 15.6, -48792.54989827 and so on.

myNum <- 10.4343
myNum
## [1] 10.4343
class(myNum)
## [1] "numeric"

6.3 Integer

Integers are whole numbers, though they get autocoerced (changed) into numerics when saved into variables.

myInt <- 209173987
class(myInt)
## [1] "numeric"
myInt <- as.integer(myInt)
class(myInt)
## [1] "integer"
myInt <- 209173987L
class(myInt)
## [1] "integer"

6.4 Logical

Logical types (booleans) are the same as in most other languages and can be two things – either true, or false. True can be represented with TRUE or T while false is FALSE or F.

Logical_1 <- TRUE 
class(Logical_1)
## [1] "logical"
Logical_2 <- 2 < 1 
class(Logical_2)
## [1] "logical"

6.5 Factors

A factor is a type of vector that is used to store categorical data. Each unique category is referred to as a factor level (category = level).

Strlevels <- c("low", "high", "medium", "high", "low", "medium", "high")
Factorlevels <- factor(Strlevels)
Factorlevels
## [1] low    high   medium high   low    medium high  
## Levels: high low medium
class(Factorlevels)
## [1] "factor"

6.6 Lists

Lists are the R objects containing different data types: numbers, strings, vectors, another list, a matrix, a function. Lists are created using list().

list_data <- list(c("Red", "Green"), c(21,32,11), TRUE, 51.23)
list_data
## [[1]]
## [1] "Red"   "Green"
## 
## [[2]]
## [1] 21 32 11
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] 51.23

6.7 Matrices

A matrix in R is a collection of vectors of same length and identical datatype. Vectors can be combined as columns in the matrix or by row, to create a 2-dimensional structure. Matrices are mostly used for mathematical calculations.

matrix  <- matrix(c(3, 9, -1, 4, 2, 6), nrow = 2)
matrix
##      [,1] [,2] [,3]
## [1,]    3   -1    2
## [2,]    9    4    6

6.8 Data frame

A data frame is the most common way of storing data in R and, generally, is the data structure most often used for data analyses. Under the hood, a data frame is a list of equal-length vectors. Each element of the list can be thought of as a column and the length of each element of the list is the number of rows. As a result, data frames can store different classes of objects in each column (i.e. numeric, character, factor). Typically, data frames have column headers (i.e. variable names).

In essence, the easiest way to think of a data frame is as an Excel worksheet that contains columns of different types of data but are all of equal length rows.

Example

Let’s define a data frame, using the data.frame() function, and then have a look at it:

df <- data.frame(col1 = 1:3, 
                 col2 = c("this", "is", "text"), 
                 variable_name3 = c(TRUE, FALSE, TRUE), 
                 Variable.4 = c(2.5, 4.2, pi))

df
##   col1 col2 variable_name3 Variable.4
## 1    1 this           TRUE   2.500000
## 2    2   is          FALSE   4.200000
## 3    3 text           TRUE   3.141593
class(df$col1)
## [1] "integer"
class(df$col2)
## [1] "character"
class(df$variable_name3)
## [1] "logical"
class(df$Variable.4)
## [1] "numeric"

Importing data in R

Typically, you do not create your own data frame in R, but load in a dataset, which you then use to do calculations on.

Generally, it is good practice to work with CSV files rather than Excel sheets. As needed, you define the character separating cells in your CSV file, and specify strings that should be treated as missing values (or NA):

data <- read.csv("data/example_data.csv")
data <- read.csv("data/example_data.csv", na.strings = c("NA", ""), sep = ",")

You can find more documentation on data input functions here.

Working with “projects” (.Rproj): RStudio projects make it straight-forward to divide your work into multiple contexts, each with their own working directory, workspace, history, and source documents.

Click on “File >> New Project” to create a new project file. Once a project file is created, open it (which opens a new R session) and open your scripts from within R (via the bottom right “Files” pane).

A major advantage of working with project files is that they automatically set the working directory to where the respective project documents are saved. When you load in datasets from your project folder, it means the document path will be relative and therefore much cleaner (as in the above example).

More info on projects is found here.

Inspecting your data

All data structures:

  • str(): compact display of data contents (env.)
  • class(): returns data type
  • summary(): summarizes aspect of data depending on data class
  • head(): prints the beginning entries for the variable in the console
  • tail(): prints the end entries for the variable in the console

Vector and factor variables:

  • length(): returns the number of elements in the vector or factor

Dataframe and matrix variables:

  • dim(): returns dimensions of the data frame or matrix (ie. # of rows and columns)
  • nrow(): returns the number of rows in the dataset
  • ncol(): returns the number of columns in the dataset
  • rownames(): returns the row names in the dataset
  • colnames(): returns the column names in the dataset

back to top

7. Functions

A key feature of R is functions. Functions are “self contained” modules of code that accomplish a specific task. Functions usually take in some sort of data structure (value, vector, dataframe etc.), process it, and return a result.

The syntax of formulas always follows the same structure: The name of the function is followed by parentheses (), in which the input arguments are specified.

Examples:

Other examples of base R functions: seq(), mean(), max().

a <- c(7,3,12,19)
mean(a)
## [1] 10.25
b <- c(7,3,12, NA)
mean(b, na.rm = TRUE)
## [1] 7.333333

There are many, many more functions that are available. As we will see later, R also allows users to define their own functions.

back to top

8. Missing values (NA)

In R, missing values are represented by the symbol NA (not available). Missing values are not the same as empty values ("") - empty values ("") are blanks, while missing values (NA) are truly missing.

Test for missing values

Here is how you can test for missing data:

v <- c(7,3,12,NA)
is.na(v)
## [1] FALSE FALSE FALSE  TRUE

Exclude empty values

Some functions may break down when missing values are not explicitly excluded:

v <- c(7,3,12,NA)
mean(v)
## [1] NA

Passing an argument to remove empty values (na.rm = TRUE) may be needed for the formula to work as intended:

mean(v, na.rm = TRUE)
## [1] 7.333333

Specifying NAs when importing datasets

When importing a dataset into R, make sure you specify the NA values in the na.strings as needed. Failing to do so may affect your analysis later on, as blanks ("") are regular values that will be evaluated and therefore might skew your results.

data <- read.csv("data/example_data.csv", na.strings = c("NA", ""))

back to top

9. R packages

R packages are collections of functions and datasets developed by the community. They increase the power of R by improving existing base R functionalities, or by adding new ones.

Imagine that you would like to analyze stratified household-level data, make maps of migration routes and charts for your reports, extract price data from the web, or build an online dashboard, R packages got you covered! Recently, the official repository (CRAN) reached 10,000 packages published by eager developers from all over the world, and many more are publicly available through the internet.

The directory where packages are stored is called the library. R comes with a standard set of packages. Others are available for download and installation. Once installed, they have to be loaded into the session to be used.

Run the following code (with the package name in quotation marks) to install a package (once):

install.packages("dplyr")

Then run the following line (once per active R session):

library(dplyr)

Popular packages that you might want to check out if you start using R regularly are:

back to top