R is a powerful, extensible environment. It has a wide range of statistics and general data analysis and visualization capabilities.
Install R & RStudio: Download both R and RStudio. Install R first and then RStudio.
A vector is the most common and basic data structure in R, the workhorse of R. It is a collection of values of the same basic type. The c() function returns a vector.
## [1] 1 50 9 42
## [1] "Anbar" "Baghdad" "Basra" "Duhok" "Kirkuk" "Ninevah"
<-, <<-, = : Leftwards assignment->, ->> : Rightwards assignment## [1] "this is a value"
+ : Addition- : Subtraction* : Multiplication/ : Division^ : Exponent%% : Modulus (remainder from division)%/% : Integer divisionExamples of arithmetic operators:
## [1] 1
## [1] 10 8 10
## [1] 50
< : Less than> : Greater than<= : Less than or equal to>= : Greater than or equal to== : Equal to!= : Not equal toExamples of relational operators:
## [1] FALSE
## [1] TRUE
## [1] TRUE
## [1] FALSE TRUE TRUE FALSE
! : Logical NOT& : Element-wise logical AND&& : Logical AND| : Element-wise logical OR|| : Logical OR## [1] TRUE
## [1] FALSE
## [1] TRUE
: operator - creates the series of numbers in sequence for a vector## [1] 2 3 4 5 6 7 8
%in% - used to identify if an element belongs to a vector## [1] TRUE
## [1] FALSE
The character class is your typical string, a set of one or more letters.
## [1] "Hello R!"
You can check the class associated with the object we defined above, by wrapping it in the class() function.
## [1] "character"
Indicates numeric values like 10, 15.6, -48792.54989827 and so on.
## [1] 10.4343
## [1] "numeric"
Integers are whole numbers, though they get autocoerced (changed) into numerics when saved into variables.
## [1] "numeric"
## [1] "integer"
## [1] "integer"
Logical types (booleans) are the same as in most other languages and can be two things – either true, or false. True can be represented with TRUE or T while false is FALSE or F.
## [1] "logical"
## [1] "logical"
A factor is a type of vector that is used to store categorical data. Each unique category is referred to as a factor level (category = level).
Strlevels <- c("low", "high", "medium", "high", "low", "medium", "high")
Factorlevels <- factor(Strlevels)
Factorlevels## [1] low high medium high low medium high
## Levels: high low medium
## [1] "factor"
Lists are the R objects containing different data types: numbers, strings, vectors, another list, a matrix, a function. Lists are created using list().
## [[1]]
## [1] "Red" "Green"
##
## [[2]]
## [1] 21 32 11
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] 51.23
A matrix in R is a collection of vectors of same length and identical datatype. Vectors can be combined as columns in the matrix or by row, to create a 2-dimensional structure. Matrices are mostly used for mathematical calculations.
## [,1] [,2] [,3]
## [1,] 3 -1 2
## [2,] 9 4 6
A data frame is the most common way of storing data in R and, generally, is the data structure most often used for data analyses. Under the hood, a data frame is a list of equal-length vectors. Each element of the list can be thought of as a column and the length of each element of the list is the number of rows. As a result, data frames can store different classes of objects in each column (i.e. numeric, character, factor). Typically, data frames have column headers (i.e. variable names).
In essence, the easiest way to think of a data frame is as an Excel worksheet that contains columns of different types of data but are all of equal length rows.
Let’s define a data frame, using the data.frame() function, and then have a look at it:
df <- data.frame(col1 = 1:3,
col2 = c("this", "is", "text"),
variable_name3 = c(TRUE, FALSE, TRUE),
Variable.4 = c(2.5, 4.2, pi))
df## col1 col2 variable_name3 Variable.4
## 1 1 this TRUE 2.500000
## 2 2 is FALSE 4.200000
## 3 3 text TRUE 3.141593
## [1] "integer"
## [1] "character"
## [1] "logical"
## [1] "numeric"
Typically, you do not create your own data frame in R, but load in a dataset, which you then use to do calculations on.
Generally, it is good practice to work with CSV files rather than Excel sheets. As needed, you define the character separating cells in your CSV file, and specify strings that should be treated as missing values (or NA):
data <- read.csv("data/example_data.csv")
data <- read.csv("data/example_data.csv", na.strings = c("NA", ""), sep = ",")You can find more documentation on data input functions here.
Working with “projects” (.Rproj): RStudio projects make it straight-forward to divide your work into multiple contexts, each with their own working directory, workspace, history, and source documents.
Click on “File >> New Project” to create a new project file. Once a project file is created, open it (which opens a new R session) and open your scripts from within R (via the bottom right “Files” pane).
A major advantage of working with project files is that they automatically set the working directory to where the respective project documents are saved. When you load in datasets from your project folder, it means the document path will be relative and therefore much cleaner (as in the above example).
More info on projects is found here.
All data structures:
str(): compact display of data contents (env.)class(): returns data typesummary(): summarizes aspect of data depending on data classhead(): prints the beginning entries for the variable in the consoletail(): prints the end entries for the variable in the consoleVector and factor variables:
length(): returns the number of elements in the vector or factorDataframe and matrix variables:
dim(): returns dimensions of the data frame or matrix (ie. # of rows and columns)nrow(): returns the number of rows in the datasetncol(): returns the number of columns in the datasetrownames(): returns the row names in the datasetcolnames(): returns the column names in the datasetA key feature of R is functions. Functions are “self contained” modules of code that accomplish a specific task. Functions usually take in some sort of data structure (value, vector, dataframe etc.), process it, and return a result.
The syntax of formulas always follows the same structure: The name of the function is followed by parentheses (), in which the input arguments are specified.
Examples:
sum(1,2) Adds two or more numberssum(a) sums a numeric vectorsum(b, na.rm = TRUE) sums a numeric vector and removes all instances of empty values (NA)Other examples of base R functions: seq(), mean(), max().
## [1] 10.25
## [1] 7.333333
There are many, many more functions that are available. As we will see later, R also allows users to define their own functions.
In R, missing values are represented by the symbol NA (not available). Missing values are not the same as empty values ("") - empty values ("") are blanks, while missing values (NA) are truly missing.
Here is how you can test for missing data:
## [1] FALSE FALSE FALSE TRUE
Some functions may break down when missing values are not explicitly excluded:
## [1] NA
Passing an argument to remove empty values (na.rm = TRUE) may be needed for the formula to work as intended:
## [1] 7.333333
When importing a dataset into R, make sure you specify the NA values in the na.strings as needed. Failing to do so may affect your analysis later on, as blanks ("") are regular values that will be evaluated and therefore might skew your results.
R packages are collections of functions and datasets developed by the community. They increase the power of R by improving existing base R functionalities, or by adding new ones.
Imagine that you would like to analyze stratified household-level data, make maps of migration routes and charts for your reports, extract price data from the web, or build an online dashboard, R packages got you covered! Recently, the official repository (CRAN) reached 10,000 packages published by eager developers from all over the world, and many more are publicly available through the internet.
The directory where packages are stored is called the library. R comes with a standard set of packages. Others are available for download and installation. Once installed, they have to be loaded into the session to be used.
Run the following code (with the package name in quotation marks) to install a package (once):
Then run the following line (once per active R session):
Popular packages that you might want to check out if you start using R regularly are: