Free and cross-platform
R is a programming language and open source software dedicated to statistics and data science supported by the R Foundation for Statistical Computing. R is cross-platform and can therefore be installed on Windows, Mac OS or GNU/Linux.
Polyvalent
R allows you to manipulate all types of objects. It is for one of these reasons that it competes, complements or replaces a whole range of pre-existing software and languages, including in very specific fields, for example: textual statistics, graph analysis, cartography, spatial statistics, webscraping, document production, web applications, etc. It also competes with other languages widely used for scientific calculation and data analysis, in particular with Python.
Expandable
R is composed of a common base r-base (standard statistical and graphical functions, called primitive functions) on which is grafted a set of extensions called packages. A package is a library of functions implemented by users and made available to all through repositories grouped under the Comprehensive R Archive Network (CRAN). This modular structure explains the wide range of possible applications: the expansion of the software is only limited by the work that users worldwide make available to all other users (12809 packages are on CRAN as of July 30, 2018).
Reproducible
The entire processing chain can thus be carried out under R. This integrated workflow is more efficient and secure: no more imports and exports from one software package to another, and the archiving, dissemination and reproducibility of the integrity of its work and methodology are ensured.
Let’s start by recalling some important points in R’s history.
R was founded in the early 1990s in New Zealand and has been growing ever since. Since 1997, a group of researchers called “R Core Team” composed today of 20 people has the authorization to modify the source code of R. Of course anyone can submit packages but the core of the software is managed by these 20 people. Two other structures are important for R, the R foundation and the CRAN.
THE R FOUNDATION
It is a non-profit organization created by the R Core Team to provide support for the development of the R project. It serves as an interlocutor for those who would like to support or interact with the R software developer community. This foundation owns and administers the copyright of the R software and related documentation.
THE CRAN
It is the “Comprehensive R Archive Network”, a network of web servers hosted in different locations around the world that stores R’s code and documentation in order to make R quickly accessible from anywhere in the world. It is also the place where new packages are submitted to be made available to R users.
Rstudio’s interface is divided into 4 graphic windows: * One area allows the editing of R source files (with syntax highlighting and autocompletion of function and object names with the tab key).
An other area displays the console with the current R session running. Using the shortcut Ctrl+Enter allows you to execute a line or selection directly from the source file.
A third area allows you to switch between displaying the objects in the current workspace and the history of the commands executed. You can even visually inspect the contents of some objects.
Finally, a fourth zone allows you to switch between:
-a file browser
-the graphs display and export window
-a list of installed extensions, which allows you to load them into memory or install new ones very easily
-a help browser that allows both navigation in the online help integrated in R and the display of the help pages for the various functions
R studio
Each time R is asked to load or save a file (especially when trying to import data), R will evaluate the name of the file sent to it against the currently defined working directory, which corresponds to the directory in which R is currently running.
To know the current working directory, we can use the getwd function :
getwd()
To define the working directory:
setwd()
RStudio has a very practical feature to organize its work into different projects.
The main idea is to gather all the files / documents related to the same project (data, scripts, automated reports…) in a dedicated directory.
You can create an RStudio project:
When a project is opened within RStudio the following actions are taken:
A new R session (process) is started:
* The .Rprofile file in the project’s main directory (if any) is sourced by R
* The .RData file in the project’s main directory is loaded (if project options indicate that it should be loaded).
* The .Rhistory file in the project’s main directory is loaded into the RStudio History pane (and used for Console Up/Down arrow command history).
* The current working directory is set to the project directory.
* Previously edited source documents are restored into editor tabs
* Other RStudio settings (e.g. active tabs, splitter positions, etc.) are restored to where they were the last time the project was closed.
The help(ma_function) command where ?ma_function allows you to access using the ma_function function . The help page automatically appears in the Help tab, at the bottom right of the RStudio interface. There are:
Example : Executing the following command takes you to the help page of the log() function:
help(log)
An R package is a library of functions that perform particular operations. To use a package, it must have been previously installed and loaded. During the installation of R, a number of packages are pre-installed. When RStudio is launched, some of these packages are loaded by default. This is the case, for example, of the stats package.
It is possible to view the list of packages already installed on your computer via the RStudio Packages tab.
When it is necessary to use a non pre-installed package, it is possible to install it manually via the Menu Tools→Install Packages. Once the package is installed, it is loaded by the library() instruction.
Starting your code with an annotated description of what the code does when it is run will help you when you have to look at or change it in the future. Just one or two lines at the beginning of the file can save you or someone else a lot of time and effort when trying to understand what a particular script does.
# it's a wonderful script that will take you to other worlds
# L . Lospin & J.C
# 18 brumaire de l'an 205
library(ggplot2)
library(reshape)
# OR
x<-c("data.table","tidyverse","magrittr")
lapply(x, require, character.only = TRUE)
input_file <- "data/data.csv"
output_file <- "data/results.csv"
# read input
input_data <- read.csv(input_file)
# get number of samples in data
sample_number <- nrow(input_data)
# generate results
results <- some_other_function(input_file, sample_number)
# write results
write.table(results, results_file)
is preferable to:
# check
input_data <- read.csv("monOrdi/mes fichiers/data/data.csv")
# get number of samples in data
sample_number <- nrow(input_data)
# generate results
results <- some_other_function("monOrdi/mes fichiers/data/data.csv", sample_number)
# write results
write.table("data/results.csv", results_file)
source("my_genius_fxns.R")
Other ideas: * Use a consistent style within your code. For example, name all matrices something ending in _mat. Consistency makes code easier to read and problems easier to spot.
Keep your code in bite-sized chunks. If a single function or loop gets too long, consider looking for ways to break it into smaller pieces.
Don’t repeat yourself–automate! If you are repeating the same code over and over, use a loop or a function to repeat that code for you. Needless repetition doesn’t just waste time–it also increases the likelihood you’ll make a costly mistake!
Keep all of your source files for a project in the same directory, then use relative paths as necessary to access them. For example, use
dat <- read.csv(file = "files/dataset-2013-01.csv", header = TRUE)
rather than:
dat <- read.csv(file = "/Users/Karthik/Documents/sannic-project/files/dataset-2013-01.csv", header = TRUE)
Everything in the R language is an object: variables containing data, functions, operators, even the symbol representing the name of an object is itself an object. Objects have at least one mode and one length and some may have one or more attributes.
The mode of an object is obtained with the mode function :
v <- c(1, 2, 5, 9)
mode(v)
[1] "numeric"
The length of an object is obtained with the length function:
> length(v)
[1] 4
In R, for all practical purposes, everything is a vector. The vector is the basic unit in the calculations.
In a simple vector, all elements must be in the same mode. We restrict ourselves to this type of vectors for the moment. The basic functions for creating vectors are:
c (concatenation) ; numeric (numeric mode vector) ; logical (logical mode vector) ; character (character mode vector). It is possible (and often desirable) to give a label to each of the elements of a vector.
v <- c(a = 1, b = 2, c = 5)
v
a b c
1 2 5
v <- c(1, 2, 5)
names(v) <- c("a", "b", "c")
v
a b c
1 2 5
These labels are then part of the attributes of the vector.
The indication in a vector is done with the brackets [ ]. An element can be extracted from a vector by its position or label, if it exists (in which case this approach is much safer).
v[3]
v["c"]
As R is a specialized language for mathematical calculations, it naturally and intuitively supports matrices and, more generally, multidimensional tables.
A matrix is a vector with a dim attribute of length 2. This implicitly changes the object class to “matrix” and, as a result, the way the object is displayed and its interaction with several operators and functions. The basic function for creating matrices is matrix :
matrix(1:6, nrow = 2, ncol = 3)
matrix(1:6, nrow = 2, ncol = 3, byrow = TRUE)
The generalization of a matrix to more than two dimensions is an array. The number of dimensions of the array is always equal to the length of the dim attribute, the implicit class of an array is “array”.
The basic function to create arrays is array:
array(1:24, dim = c(3, 4, 2))
The list is the most general and versatile storage mode of the R language. It is a type of special vector whose elements can be of any mode, including the list mode. This allows lists to be nested, hence the term recursive for this type of object.
The basic function for creating lists is list:
x <- list(size = c(1, 5, 2), user = "Joe", new = TRUE)
x
$size
[1] 1 5 2
$user
[1] "Joe"
$new
[1] TRUE
Vectors, matrices, tables and lists are the most common object types used in R programming. However, many statistical procedures - think of linear regression, for example - rely more on data frames for data storage.
Although visually similar to a matrix, a data frame is more general since the columns can be in different modes; think of a table with names (character mode) in one column and notes (numeric mode) in another.
A data frame is created with the data.frame function or, to convert another type of object into a data frame, with as.data.frame.
We have the following types of operators in R programming:
‘+’ Adds two vectors
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v+t)
[1] 10.0 8.5 10.0
‘−’ Subtracts second vector from the first
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v-t)
[1] -6.0 2.5 2.0
’*’ Multiplies both vectors
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v*t)
[1] 16.0 16.5 24.0
‘/’ Divide the first vector with the second
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v/t)
[1] 0.250000 1.833333 1.500000
‘%%’ Give the remainder of the first vector with the second
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v%%t)
[1] 2.0 2.5 2.0
‘%/%’ The result of division of first vector with second (quotient)
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v%/%t)
[1] 0 1 1
‘^’ The first vector raised to the exponent of second vector
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v^t)
[1] 256.000 166.375 1296.000
‘>’ Checks if each element of the first vector is greater than the corresponding element of the second vector.
v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
print(v>t)
[1] FALSE TRUE FALSE FALSE
‘<’ Checks if each element of the first vector is less than the corresponding element of the second vector.
v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
print(v < t)
[1] TRUE FALSE TRUE FALSE
‘==’ Checks if each element of the first vector is equal to the corresponding element of the second vector.
v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
print(v == t)
[1] FALSE FALSE FALSE TRUE
‘<=’ Checks if each element of the first vector is less than or equal to the corresponding element of the second vector.
v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
print(v<=t)
[1] TRUE FALSE TRUE TRUE
‘>=’ Checks if each element of the first vector is greater than or equal to the corresponding element of the second vector.
v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
print(v>=t)
[1] FALSE TRUE FALSE TRUE
‘!=’ Checks if each element of the first vector is unequal to the corresponding element of the second vector. Live Demo
v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
print(v!=t)
[1] TRUE TRUE TRUE FALSE
Many times we are interested in repeating some calculations. In R, there are many methods to do this, including the use of loops.
K1 <- c(4,2,8,5)
L1 <- c(1,3,4,2)
M1 <- 0*1:4 # This in the object where we will place the answer to our query
M1
This loop finds the maximum of K1 and L1 at each position
for (j in 1:4){
M1[j] <- max(K1[j],L1[j])
}
M1
general form : ifelse(test_expression, x, y)
Here, test_expression must be a logical vector (or an object that can be coerced to logical). The return value is a vector with the same length as test_expression.
This returned vector has element from x if the corresponding value of test_expression is TRUE or from y if the corresponding value of test_expression is FALSE.
x <- c("what","is","truth")
if("Truth" %in% x) {
print("Truth is found")
} else {
print("Truth is not found")
}
[1] "Truth is not found"
In programming, you use functions to incorporate sets of instructions that you want to use repeatedly or that, because of their complexity, are better self-contained in a sub program and called when needed. A function is a piece of code written to carry out a specified task; it can or can not accept arguments or parameters and it can or can not return one or more values.
The absolute value of “x”
abs(x)
A generic function which combines its arguments
c(x)
Combine vectors by row/column (cf. “paste” in Unix)
cbind()
Returns suitably lagged and iterated differences
diff(x)
Pattern matching
grep()
Test if 2 objects are exactly equal
identical()
Add a small amount of noise to a numeric vector
jitter()
Return no. of elements in vector x
length(x)
List objects in current environment
ls()
Concatenate vectors after converting to character
paste(x)
Returns the minimum and maximum of x
range(x)
Repeat the number 1 five times
rep(1,5)
List the elements of “x” in reverse order
rev(x)
Generate a sequence (1 -> 10, spaced by 0.4)
seq(1,10,0.4)
Create a vector of sequences
sequence()
Returns the signs of the elements of x
sign(x)
Sort the vector x
sort(x)
list sorted element numbers of x
order(x)
Convert string to lower/upper case letters
tolower()
toupper()
unique(x) # Remove duplicate entries from vector
unique(x)
rounding functions
trunc(x)
round(x)
Return system date
Sys.Date()
Return working directory
getwd()
Set working directory
setwd()
The different parts of a function are :
Function Name − This is the actual name of the function. It is stored in R environment as an object with this name.
Arguments − An argument is a placeholder. When a function is invoked, you pass a value to the argument. Arguments are optional; that is, a function may contain no arguments. Also arguments can have default values.
Function Body − The function body contains a collection of statements that defines what the function does.
Return Value − The return value of a function is the last expression in the function body to be evaluated.
new.function <- function(a, b) {
print(a^2)
print(a)
print(b)
}
new.function(2,4)
[1] 4
[1] 2
[1] 4
R has a function dedicated to reading comma-separated files. To import a local CSV file named filename.txt and store the data into one R variable named mydata, the syntax would be
mydata = read.csv("~/Documents/Course/Yield_Winterschool.csv") # read csv file
If your data use another character to separate the fields, not a comma, R also has the more general read.table function. So if your separator is a tab, for instance, this would work:
mydata <- read.table("~/Documents/Course/Yield_Winterschool.csv", sep="\t", header=TRUE) # but it is not the case here!
There is often more than one way to read data into R. Even a simple .csv file can be imported using a range of methods, with implications for computational efficiency. This section looks at three approaches: base R’s reading functions such as read.csv, which are derived from read.table; the data.table approach, which uses the function fread; and the new readr package which provides read_csv and other read_ functions such as read_tsv.
mydata <- fread("~/Documents/Course/Yield_Winterschool.csv", header=TRUE)
Efficiency of packages to load files
The data import features can be accessed from the environment pane or from the tools menu. The importers are grouped into 3 categories: Text data, Excel data and statistical data. To access this feature, use the “Import Dataset” dropdown from the “Environment” pane:
R studio load
You can invoke the viewer in a console by calling the View function on the data frame you want to look at. For instance, to view the built-in iris dataset, run these commands:
View(mydata)
Useful Functions for Exploring Data Frames
Use dim() to obtain the dimensions of the data frame (number of rows and number of columns). The output is a vector.
dim(mydata)
Use nrow() and ncol() to get the number of rows and number of columns, respectively. You can get the same information by extracting the first and second element of the output vector from dim().
nrow(mydata)
# same as dim(InsectSprays)[1]
ncol(mydata)
# same as dim(InsectSprays)[2]
Use head() to obtain the first n observations and tail() to obtain the last n observations; by default, n = 6. These are good commands for obtaining an intuitive idea of what the data look like without revealing the entire data set, which could have millions of rows and thousands of columns.
head(mydata, n = 5)
For example, the following command will return the last 10 observations.
tail(mydata, n = 5)
The names() function will return the column headers.
names(mydata)
The str() function returns many useful pieces of information, including the above useful outputs and the types of data for each column. In this example, “num” denotes that the variable “count” is numeric (continuous), and “Factor” denotes that the variable “spray” is categorical with 6 categories or levels.
str(mydata)
To obtain all of the categories or levels of a categorical variable, use the levels() function.
mydata$departement<-as.factor(mydata$departement)
levels(mydata$departement)
When applied to a data frame, the summary() function is essentially applied to each column, and the results for all columns are shown together.
For a continuous (numeric) variable like “yield”, it returns the 5-number summary. If there are any missing values (denoted by “NA” for a particular datum), it would also provide a count for them. In this example, there are no missing values for “count”, so there is no display for the number of NA’s. For a categorical variable like “departement”, it returns the levels and the number of data in each level.
summary(mydata)
Use the assignment operator <- to create new variables. A wide array of operators and functions are available here.
mydata$sum <- mydata$Tmean_1 + mydata$Tmean_2
mydata$mean <- (mydata$Tmean_1 + mydata$Tmean_2)/2
We can also delete the variable by using command NULL:
mydata$sum <- NULL
mydata$mean <- NULL
you can select data:
mysubset<- mydata[departement=="AIN"]
# is equivalent to
mysubset<- subset(mydata,departement=="AIN")
# is equivalent to
#install.packages("dplyr")
library(dplyr)
mysubset <- mysubset %>% filter(departement=="AIN")
The following is an introduction for producing simple graphs with the R Programming Language.
Graph the yield vector with all defaults
plot(mysubset$yield)
Let’s add a title, a line to connect the points, and some color:
# Graph cars using blue points overlayed by a line
plot(mysubset$yield,type="o", col="blue")
# Create a title with a red, bold/italic font
title(main="Yield", col.main="red", font.main=4)
Now let’s add a red line for an other department and specify the y-axis range directly so it will be large enough to fit all the data:
# Graph cars using a y axis that ranges from 0 to 12
plot(mysubset$yield, type="o", col="blue", ylim=c(0,12))
# Graph trucks with red dashed line and square points
lines(mydata[departement=="AISNE"]$yield, type="o", pch=22, lty=2, col="red")
# Create a title with a red, bold/italic font
title(main="Yield", col.main="red", font.main=4)
Next let’s change the axes labels to match our data and add a legend. We’ll also compute the y-axis values using the max function so any changes to our data will be automatically reflected in our graph.
# Calculate range from 0 to max value of cars and trucks
g_range <- range(0, mydata$yield)
# Graph autos using y axis that ranges from 0 to max
# value in cars or trucks vector. Turn off axes and
# annotations (axis labels) so we can specify them ourself
plot(mysubset$year_harvest,mysubset$yield, type="o", col="blue", ylim=g_range,ann=FALSE)
# Graph trucks with red dashed line and square points
lines(mysubset$year_harvest,mydata[departement=="AISNE"]$yield, type="o", pch=22, lty=2, col="red")
# Create a title with a red, bold/italic font
title(main="Yield", col.main="red", font.main=4)
# Label the x and y axes with dark green text
title(xlab="years", col.lab=rgb(0,0.5,0))
title(ylab="YIELD", col.lab=rgb(0,0.5,0))
# Create a legend at (1, g_range[2]) that is slightly smaller
# (cex) and uses the same line colors and points used by
# the actual plots
legend(1960, 1, c("AIN","AISNE"), cex=0.8,
col=c("blue","red"), pch=21:22, lty=1:2);
NA
Let’s start with a simple bar chart graphing the yield vector:
Let’s now add labels, blue borders around the bars, and density lines:
barplot(mysubset2$yield, main="yield", xlab="years",
ylab="yields", names.arg=c("best_years","2001","2002","2003","2004","2005"),
border="blue", density=c(10,20,30,40,50, 60))
Let’s start with a simple dotchart graphing the autos data:
mysubset3<- mydata[year_harvest %in% c(2003:2005),]
mysubset3<- mysubset3[departement %in% c("AIN", "AISNE"),]
mysubset3<- mysubset3[,c("year_harvest","departement","yield")]
library("reshape")
mysubset3$year_harvest<-as.factor(mysubset3$year_harvest)
mysubset3<-cast(mysubset3, year_harvest ~ departement, mean, value = 'yield')
dotchart(t(mysubset3))
Let’s make the dotchart a little more colorful:
dotchart(t(mysubset3), color=c("red","blue","darkgreen"),
main="yield", cex=0.8)
The function used for building linear models is lm(). The lm() function takes in two main arguments, namely: 1. Formula 2. Data. The data is typically a data.frame and the formula is a object of class formula. But the most common convention is to write out the formula directly in place of the argument as written below.
linearMod <- lm(yield ~ year_harvest, data=mysubset) # build linear regression model on full data
print(linearMod)
Call:
lm(formula = yield ~ year_harvest, data = mysubset)
Coefficients:
(Intercept) year_harvest
-140.3625 0.0722
plot the curve
mysubset$pred<- predict(linearMod)
plot(mysubset$year_harvest, mysubset$yield)
lines(mysubset$year_harvest,mysubset$pred)
There are many resources available on R. Here are some of them in open access, some of which have been widely used to produce this document: