1 Installer R

On install d’abord le noyau de R. Voici les liens vers les installation pour differents système d’exploitation:

Windows,
macOS: Apple silicon (M1-3)
macOS: Intel
et pour Linux Debian, Fedora/Redhat, Ubuntu

Install l’interface la plus utilisée, Rstudio pour :

2 Les operateurs

R est une calculatrice du fait qu’on peut effectuer toute les opérations dans la console de R.

2.1 Arithmetic Operators

2.1.1 L’addition: `+`

1+1

## [1] 2

2.1.2 La soustraction: `-`

1-1

## [1] 0

2.1.3 La multiplication: `*`

1*1

## [1] 1

2.1.4 Division: `/`

1/1

## [1] 1

2.1.5 Modulo (reste d’une division entre deux nombres) : `%%`

1 %% 2

## [1] 1

2.1.6 Exponentiel : `^` or `**`

2 ^ 10 # or 2 ** 10

## [1] 1024

2.1.7 Division entière : `%/%`

1035 %/% 3

## [1] 345

2.2 Les opérateurs logiques

Les opérateurs logiques sont utilisés pour effecter des test. Il

2.2.1 Strictement inférieur : `<`

1 < 1

## [1] FALSE

2.2.2 Inférieur ou égal à : `<=`

1 <= 1

## [1] TRUE

2.2.3 Strictement supérieur à : `>`

1 > 1

## [1] FALSE

2.2.4 Supérieur ou égal à : `>=`

1 >= 1

## [1] TRUE

2.2.5 Exactement égal à : `==`

"R" == "r"

## [1] FALSE

2.2.6 Not equal to: `!=`

1 != 1

## [1] FALSE

2.2.7 Négation/NON: `!`

Utilisé pour changer une condition fausse en vraie (ou vraie en fausse)

!TRUE # or !T

## [1] FALSE

!FALSE # or !F

## [1] TRUE

!(T & F) # this is TRUE

## [1] TRUE

!(F | T) # is FALSE

## [1] FALSE

2.2.8 AND: `&`

TRUE & TRUE

## [1] TRUE

TRUE & FALSE

## [1] FALSE

FALSE & FALSE

## [1] FALSE

2.2.9 OR: `|`

TRUE | TRUE

## [1] TRUE

TRUE | FALSE

## [1] TRUE

FALSE | FALSE

## [1] FALSE

3 Objet R et affectation

Dans R, nous pouvons utiliser <-, = (signe égal simple !) et -> pour attribuer une valeur à une variable.

Un nom de variable :

peut commencer par un caractère ou un ou plusieurs points. Ex : a <- 1, 0 -> .a
ne doit pas contenir d’espace. Remplacez l’espace vide par _.

v rsion <- 4.3.2

## Error: <text>:1:3: unexpected symbol
## 1: v rsion
##       ^

peut contenir des nombres. Ex : a1 <- 1.

a <- 1
b <- 2
0 -> .a
a1 = .a

3.1 Data types

Dans R, nous avons les types de données suivants :

Numérique (numeric)
Entière (integer)
Complexe (complex)
Nominale/Caractère/chaîne (character)
Booléenne (logical)
raw
factor

3.1.1 Numérique (Numeric) - (10.5, 55, 787)

PI <- pi; class(PI); typeof(PI)

## [1] "numeric"

## [1] "double"

n <- 55; class(n); typeof(n)

## [1] "numeric"

## [1] "double"

3.1.2 Entière (integer)

(1L, 55L, 100L, où la lettre L déclare le nombre qui la précède comme un entier).
Vérifiez la classe de n <- 55L. Que voyez-vous ?

n <- 55L
class(n)

## [1] "integer"

3.1.3 Complexe (complex) - (9 + 3i, où `i` est la partie imaginaire)

z <- 9 + 3i
class(z)

## [1] "complex"

typeof(z)

## [1] "complex"

z1 <- a + 1i*b
print(z1)

## [1] 1+2i

class(z1)

## [1] "complex"

3.1.4 Nominale/Caractère/chaîne (character)

string <- "J'apprends R"
class(string)

## [1] "character"

N’oubliez pas !! LeaRning est différent de Learning.

3.1.5 Booléenne (logical) - (TRUE or FALSE)

TRUE # or T

## [1] TRUE

FALSE # or F

## [1] FALSE

La sortie logique peut également être le résultat d’un test. Exemple : si nous voulons vérifier si LeaRning == Learning

"LeaRning" == "Learning"

## [1] FALSE

3.1.6 Raw

# variable textuelle
text <- "J'apprend R."
# conversion de la variable text en raw ()
raw_text <- charToRaw(text)

print(raw_text)

##  [1] 4a 27 61 70 70 72 65 6e 64 20 52 2e

class(raw_text)

## [1] "raw"

Converting raw to text:

rawToChar(raw_text)

## [1] "J'apprend R."

3.1.7 Donnée catégorielle (factor)

Il s’agit d’un type de données utilisé pour faire référence à une relation qualitative comme les couleurs, le bien et le mal, les notes des cours ou des films, etc. Ils sont utiles dans la modélisation statistique.

Gender <- factor(c("Female", "Male"))
print(Gender)

## [1] Female Male  
## Levels: Female Male

class(Gender)

## [1] "factor"

3.1.8 Logical

v <- TRUE
w <- FALSE

class(v); typeof(v)

## [1] "logical"

## [1] "logical"

!v

## [1] FALSE

isTRUE(w)

## [1] FALSE

# if (isTRUE(v)) {
#   print("This code is compiled")
# }

3.2 R Data Structures

The most used data types in R are

Vectors
Lists
Matrices (2D)
Arrays (plus de deux dimensions)
Factors
Data Frames

3.2.1 Scalars and vectors:

A scalar is any number in N, Z, D, Q, R, or C (Quantum Mechanics)
Vectors: collection of objects of the same type. A vector can also be a sequence;

Example 1:

v <- c(1, "R", T, FALSE, NA)
# print v
print(v)

## [1] "1"     "R"     "TRUE"  "FALSE" NA

# what is the class of v?
class(v)

## [1] "character"

# sequence: ?seq
x <- seq(0, 2*pi, length.out = 90)
y <- seq(0, 2*pi, by = 0.1282283)
head(x); head(y)

## [1] 0.00000000 0.07059759 0.14119518 0.21179276 0.28239035 0.35298794

## [1] 0.0000000 0.1282283 0.2564566 0.3846849 0.5129132 0.6411415

tail(x)

## [1] 5.930197 6.000795 6.071393 6.141990 6.212588 6.283185

range(x)

## [1] 0.000000 6.283185

rg <- range(x)
rg[1]

## [1] 0

rg[2]

## [1] 6.283185

x[10]

## [1] 0.6353783

The length of a vector is given by:

length(x)

## [1] 90

length(rg)

## [1] 2

a <- 9
length(a)

## [1] 1

A scalar is a vector of length 1.

Example 2:

# repeating
rep("I learn R", 5)

## [1] "I learn R" "I learn R" "I learn R" "I learn R" "I learn R"

rep(c(0, 1), 10)

##  [1] 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

v <- rep(0, 10)
v <- numeric(10)
v[10] <- NA
v

##  [1]  0  0  0  0  0  0  0  0  0 NA

# repetition
rep(c(0:1), c(50, 50))

##   [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

rep(c(0:5), each = 50)

##   [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [149] 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [186] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [223] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5
## [260] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
## [297] 5 5 5 5

# sampling
set.seed(24102024) # fix the randomness for reproducibility.
sample(0:1, size = 100, replace = TRUE, prob = c(0.3, 0.7)) -> y
y; y == 0

##   [1] 1 1 1 1 0 1 1 0 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 0
##  [38] 1 1 1 1 1 0 0 1 1 1 0 1 1 1 0 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1
##  [75] 1 1 0 1 0 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0

##   [1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE
##  [13] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [25] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE
##  [37]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE
##  [49] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
##  [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
##  [73] FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
##  [85] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
##  [97] FALSE FALSE FALSE  TRUE

table(y)

## y
##  0  1 
## 21 79

sum(y == 0); sum(!y == 0)

## [1] 21

## [1] 79

as.numeric(TRUE)

## [1] 1

as.numeric(FALSE)

## [1] 0

sum(c(T, F))

## [1] 1

3.2.2 Matrices:

Matrices are two dimensional data set with columns and rows.

(A <- matrix(1:25, ncol = 5)) # byrow = F by default

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    6   11   16   21
## [2,]    2    7   12   17   22
## [3,]    3    8   13   18   23
## [4,]    4    9   14   19   24
## [5,]    5   10   15   20   25

(B <- matrix(1:25, nrow = 5, ncol = 5, byrow = T)) # ncol = 5 is optional.

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    2    3    4    5
## [2,]    6    7    8    9   10
## [3,]   11   12   13   14   15
## [4,]   16   17   18   19   20
## [5,]   21   22   23   24   25

3.2.2.1 Matrix definition

(A <- matrix(c(1, 0, 2, 5, 2, 1, 4, 2, 0), nrow = 3))

##      [,1] [,2] [,3]
## [1,]    1    5    4
## [2,]    0    2    2
## [3,]    2    1    0

(B <- matrix(c(2, 5, 2, 3, 1, 1, 0, 1, 1), nrow = 3))

##      [,1] [,2] [,3]
## [1,]    2    3    0
## [2,]    5    1    1
## [3,]    2    1    1

3.2.2.2 Matrix from vectors

We can also construct a matrix from vectors $M = (v_1, v_2, v_3)$ using the cbind and rbind functions.

v1 <- c(1, 0, 2); v2 <- c(5, 2, 1); v3 <- c(4, 2, 0)
(M1 <- cbind(v1, v2, v3))

##      v1 v2 v3
## [1,]  1  5  4
## [2,]  0  2  2
## [3,]  2  1  0

(M2 <- rbind(v1, v2, v3))

##    [,1] [,2] [,3]
## v1    1    0    2
## v2    5    2    1
## v3    4    2    0

class(M1)

## [1] "matrix" "array"

class(M2)

## [1] "matrix" "array"

3.2.2.3 Matrix using `dim` function

! dim is also called to check the dimension of a matrix, a data frame or an array.

M3 <- c(1, 5, 4, 0, 2, 2, 2, 1, 0)
dim(M3) <- c(3, 3) # sets the dimensions of M3
dim(M3) # shows the dimensions of M3

## [1] 3 3

M3

##      [,1] [,2] [,3]
## [1,]    1    0    2
## [2,]    5    2    1
## [3,]    4    2    0

class(M3);

## [1] "matrix" "array"

3.2.2.4 Matrix operations

Transpose

(A_T <- t(A))

##      [,1] [,2] [,3]
## [1,]    1    0    2
## [2,]    5    2    1
## [3,]    4    2    0

Addition

A + B

##      [,1] [,2] [,3]
## [1,]    3    8    4
## [2,]    5    3    3
## [3,]    4    2    1

Substraction

A - B

##      [,1] [,2] [,3]
## [1,]   -1    2    4
## [2,]   -5    1    1
## [3,]    0    0   -1

Multiplication

# number of columns in A: dim(A)[2], or ncol(A).
# number of rows in A: dim(A)[1], or nrow(A)
dim(A)[2] == ncol(A)

## [1] TRUE

ncol(A) == nrow(B)

## [1] TRUE

A %*% B

##      [,1] [,2] [,3]
## [1,]   35   12    9
## [2,]   14    4    4
## [3,]    9    7    1

Inverse

# I want to get the inverse of A
(A_inv <- solve(A))

##      [,1] [,2] [,3]
## [1,]   -1  2.0    1
## [2,]    2 -4.0   -1
## [3,]   -2  4.5    1

A %*% A_inv # is to check if A_inv is really the inverse of A.

##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

3.2.2.5 Solving a system of equations

\[ \begin{cases} 2x + 2y &= 4\\ x + 3y &= 4 \end{cases} \] The matrix of the equation system is: $\displaystyle A = \begin{pmatrix}2&2\\1&3\end{pmatrix}$ and the right hand side of the equation is $b = \begin{pmatrix}4\\4\end{pmatrix}$. We can use the solve function to have the solutions.

A1 <- matrix(c(2, 2, 1, 3), nrow = 2, byrow = TRUE)
b <- c(4, 4)
solve(A1, b)

## [1] 1 1

# A1*A1 # point-wise multiplication.

Division: multiply a matrix by the inverse of another. $B/A = BA^{-1}$

B %*% A_inv

##      [,1] [,2] [,3]
## [1,]    4 -8.0   -1
## [2,]   -5 10.5    5
## [3,]   -2  4.5    2

3.2.2.6 Eigen values/vectors (basis of Principal Component Analysis)

Requirements:

A should be a square matrix of dimension $n$.
The eigen values $\lambda$ are solutions of the characteristic polynomial

\[ P_A(\lambda) = \det(A-\lambda I_n) = 0,\quad n\in \mathbb{N}. \]

3.2.2.7 Eigen values/vectors

ev <- eigen(A)  # gives a list of eigen values and
                # eigen vectors
ev$values

## [1]  4.7664355 -1.4836116 -0.2828239

is.list(ev)

## [1] TRUE

ev$vectors

##            [,1]       [,2]       [,3]
## [1,] -0.8535725 -0.3668743  0.2177685
## [2,] -0.3052279 -0.4631774 -0.6431613
## [3,] -0.4221966  0.8067651  0.7341120

3.2.3 Arrays

Arrays are data type with more than two dimensions

(aRray <- array(1:24, dim = c(3, 4, 2)))

## , , 1
## 
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
## 
## , , 2
## 
##      [,1] [,2] [,3] [,4]
## [1,]   13   16   19   22
## [2,]   14   17   20   23
## [3,]   15   18   21   24

class(aRray)

## [1] "array"

An example of array is NetCDF data with for instance: * Longitude as column names ($n$) * Latitude as row names ($p$) * 3rd dimension could the time. For each time, we have a $n\times p$ matrix.

dim(aRray)

## [1] 3 4 2

aRray[1, 1, 2] # element at i=1, j=1 from the second matrix

## [1] 13

The dimension: row position, column position, matrix level

3.2.4 Lists

A list is a collection of object of different types. The sizes of elements could be different.

mylist <- list("matrix" = A, 
               "sequence" = x,
               "Bool" = TRUE, 
               "Array" = aRray)

mylist$matrix

##      [,1] [,2] [,3]
## [1,]    1    5    4
## [2,]    0    2    2
## [3,]    2    1    0

3.2.4.1 Accessing elements of a list

class(mylist[[1]])

## [1] "matrix" "array"

mylist$Array

## , , 1
## 
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
## 
## , , 2
## 
##      [,1] [,2] [,3] [,4]
## [1,]   13   16   19   22
## [2,]   14   17   20   23
## [3,]   15   18   21   24

3.2.4.2 Accessing elements of a list

mylist[c("Array", "matrix")]

## $Array
## , , 1
## 
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
## 
## , , 2
## 
##      [,1] [,2] [,3] [,4]
## [1,]   13   16   19   22
## [2,]   14   17   20   23
## [3,]   15   18   21   24
## 
## 
## $matrix
##      [,1] [,2] [,3]
## [1,]    1    5    4
## [2,]    0    2    2
## [3,]    2    1    0

3.2.5 Data Frames

A data frame is a table of $n$ number of rows (observations) and $p$ number of columns (features or variables). Variables can take any data type.

3.2.6 Factors

3.2.7 Converting a continuous variable into a categorical variable

set.seed(12092024)
age <- sample(0:120, size = 100)
(brks <- seq(0, 120, by = 10))

##  [1]   0  10  20  30  40  50  60  70  80  90 100 110 120

# (brks <- seq(min(age), max(age), le = 40))
age_groups <- cut(age, breaks = brks, include.lowest = TRUE)
class(age_groups)

## [1] "factor"

table(age_groups)

## age_groups
##    [0,10]   (10,20]   (20,30]   (30,40]   (40,50]   (50,60]   (60,70]   (70,80] 
##        10         9         9         5         8         8         9         8 
##   (80,90]  (90,100] (100,110] (110,120] 
##         9         8         9         8

# checking for missing values
(which(is.na(age_groups)) -> id_missing)

## integer(0)

# convertion
age_factor <- factor(age_groups) # not necessary!

identical(age_groups, age_factor)

## [1] TRUE

# count in each class/group
frequencies <- table(age_groups)

3.2.8 Data frames

A data frame a is also a list where all elements (columns) have the same length. A data frame in R is a table.

# converting a list into a dataframe

3.2.8.1 Create a data frame using the `data.frame()` function

df2 <- data.frame(x = rnorm(10), y = rpois(10, 2))
head(df2)

3.2.8.2 Data manipulation

Missing values (NA)

x <- c(NA, 1, 2, NA, 3, NA, 3.55)
which(is.na(x)) # means: which of the elements of x are missing

## [1] 1 4 6

which(x >= 2) # means: which of the elements of x are greater than or

## [1] 3 5 7

              # equal to 2.
# which(x != NA) wrong way to check for non-missing values
which(!is.na(x)) # means: which of the elements of x are not missing

## [1] 2 3 5 7

mis_id <- which(is.na(x))
x[mis_id]

## [1] NA NA NA

x[is.na(x)] <- mean(x[which(!is.na(x))]) # Good but could be shorter
x[is.na(x)] <- mean(x, na.rm = TRUE)

print(x)

## [1] 2.3875 1.0000 2.0000 2.3875 3.0000 2.3875 3.5500

NAs introduced by coercion when converting strings to numeric

x <- c(2, 1, 2, 7, 3, 2.5, 9, "2,7")
class(x)

## [1] "character"

# converting into numeric
z <- as.numeric(x)

## Warning: NAs introduced by coercion

x[which(is.na(z))] <- 2.7
z[which(is.na(z))] <- 2.7
z; x

## [1] 2.0 1.0 2.0 7.0 3.0 2.5 9.0 2.7

## [1] "2"   "1"   "2"   "7"   "3"   "2.5" "9"   "2.7"

Outliers detection

3.2.8.3 Plots from a data frame

4 Data simulation and visualization

4.1 Charts in R

4.1.1 Bar chart/plot

barplot(frequencies, cex.names=0.6) # use horiz = TRUE to have horizontal bars

The argument cex.names reduces the size of x-labels. Low values, say cex.names=0.6, forces R to show all the labels.

Number of observations in a subset: $\displaystyle\sum_{i=1}^nI_{\{\textrm{age}_i\geq80\}}$
Percentage of observations in a subset: $\displaystyle\frac{100}{n}\sum_{i=1}^nI_{\{\textrm{age}_i\geq80\}}$
$n$ is the sample size.

sum(age >= 80)

## [1] 35

mean(age >= 80) # I get the relative frequency

## [1] 0.35

4.1.2 Pie chart/plot

pie(c(30, 50, 20), col = 1:3)

4.1.3 Histograms

set.seed(12092024)
x <- sample(1:120, size = 1000, replace = TRUE); 
hist(x, probability = TRUE) # use probability = TRUE to have densities

        # instead of counts (frequencies)
# Density plots
plot(density(x))

4.1.4 Scatter plot

x <- seq(0, 2*pi, le = 50)
y <- sin(x)
z <- cos(x)
tg <- tan(x)
plot(x, y, pch = 16, col = "blue")

4.2 Distribution simulations

4.2.1 Uniform distribution

set.seed(13092024)
unif_dist <- runif(1000)
hist(unif_dist, probability = TRUE, xlab = NULL,
     main = "Histogram of uniform distribution", col = "turquoise")
lines(density(unif_dist), col = "red", lwd = 2, lty = 2)
abline(v = mean(unif_dist), col = "blue", lty = 2, lwd = 2)
legend("topleft", lty = c(2, 2), 
       col = c("red", "blue"), legend = c("Density", "Average"))

4.2.2 Binomial distribution

rbinom_dist <- rbinom(10000, 10, 0.5)
hist(rbinom_dist, probability = TRUE, main = "Histogram of binomial distribution", 
     col = "turquoise", breaks = 20, xlab = NULL)
lines(density(rbinom_dist), col = "red", lwd = 2, lty = 2)
abline(v = mean(rbinom_dist), col = "blue", lty = 2, lwd = 2) # vertical line
legend("topleft", lty = c(2, 2), 
       col = c("red", "blue"), legend = c("Density", "Average"))

4.2.3 Gaussian distribution

set.seed(13092024)
gauss_dist <- rnorm(1000, mean = 0, sd = 1)
hist(gauss_dist, probability = TRUE, breaks = 30, xlab = NULL,
     main = "Histogram of standard normal\ndistribution", col = "turquoise")
lines(density(gauss_dist), col = "red", lwd = 2, lty = 2)
abline(v = mean(gauss_dist), col = "blue", lty = 2, lwd = 2)
legend("topleft", lty = c(2, 2), 
       col = c("red", "blue"), legend = c("Density", "Average"))

4.2.4 Scatter plot to show relationship between two variables

set.seed(13092024)
x <- rnorm(1000); y <- rnorm(1000, mean = 5, sd = 1.5)
z <- 4 + 8*x + rnorm(1000) # linear dependence between x and z
par(mfrow = c(1, 2))
plot(x, y, main = "Scatter plot of x and y", col = "blue", pch = 16); grid()
plot(x, z, main = "Scatter plot of x and z", col = "blue", pch = 16)

4.2.5 Exponential distribution

set.seed(13092024)
exp_dist <- rexp(1000, rate = 1)
hist(exp_dist, probability = TRUE, xlab = NULL, 
     main = "Histogram of exponential distribution", col = "turquoise")
lines(density(exp_dist), col = "red", lwd = 2, lty = 2)
abline(v = mean(exp_dist), col = "blue", lty = 2, lwd = 2)
legend("topright", lty = c(2, 2), 
       col = c("red", "blue"), legend = c("Density", "Average"))

4.2.6 Poisson distribution

set.seed(13092024)
pois_dist <- rpois(1000, lambda = 2.5)
hist(pois_dist, probability = TRUE, main = "Histogram of poisson distribution", 
     col = "turquoise", xlab = NULL)
lines(density(pois_dist), col = "red", lwd = 2, lty = 2)
abline(v = mean(pois_dist), col = "blue", lty = 2, lwd = 2)
legend("topright", lty = c(2, 2), 
       col = c("red", "blue"), legend = c("Density", "Average"))

5 Flow Controls:

5.1 `if` / `else`

if (condition/Boolean expression){
  ## code to be executed
}

5.1.1 Example

x <- 3

if (x < 4){
  print(TRUE)
} else {
  print(FALSE)
}

## [1] TRUE

# one line
ifelse(x < 4, T, F)

## [1] TRUE

We can embed if to if and else.

if (x < 4){
  if (x != 0){
    print("x is not equal to zero.")
  } else {
    print("x is equal to zero")
  }
  
  print("x is less than 4")
} else {
  if (x > 1){
    print("x is greater than 1.")
  } else {
    print("x is less than or equal to 1")
  }
  print("x is greater than 4.")
}

## [1] "x is not equal to zero."
## [1] "x is less than 4"

5.2 Loops

for loops

for (i in vector){
  ## code to be executed
}

m <- 6
for (i in 1:m) print(i)

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6

for (i in 1:m) {
  print(i)
}

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6

5.2.1 Exercise 1:

Write a for loop that checks each of the first 10 positive integers if it is odd or even.

## TODO

5.2.2 Exercise 2:

Using for loop, import all CSV from the data_files folder.

# checking the working directory
getwd()

## [1] "/Users/logamouseknewnalema/Library/CloudStorage/GoogleDrive-lseknewna@aimsric.org/My Drive/INDABAXCHAD"

# simple ls like in bash
dir()

##  [1] "a.csv"                                   
##  [2] "additional_code.R"                       
##  [3] "aims-logo.jpg"                           
##  [4] "calc.png"                                
##  [5] "data"                                    
##  [6] "donnees_hospitalieres"                   
##  [7] "mypackages.R"                            
##  [8] "pipe.jpeg"                               
##  [9] "R_Intro_DrLema_files"                    
## [10] "R_Intro_DrLema.html"                     
## [11] "R_Intro_DrLema.pdf"                      
## [12] "R_Intro_DrLema.Rmd"                      
## [13] "R_Intro_IndabaX_Chad_fr_files"           
## [14] "R_Intro_IndabaX_Chad_fr.html"            
## [15] "R_Intro_IndabaX_Chad_fr.Rmd"             
## [16] "R_Intro_IndabaX_Chad.html"               
## [17] "R_Intro_IndabaX_Chad.Rmd"                
## [18] "rsconnect"                               
## [19] "Screenshots"                             
## [20] "telecharger_donnees_admissions_med.ipynb"

dir("./data_list/", pattern = ".csv") # list of elements of in a directory

## character(0)

# Exercise: write a for loop to import all
# csv files in a list.
(file_names <- dir("./data_list/", pattern = ".csv"))

## character(0)

Hints: Importing files from the working directory

We need a path/url when the file to be loaded is not in the working directory.
We construct a path by combining strings. See the example below.

string1 <- "." # working directory (root where the script is saved)
string2 <- "folder" # folder in the working directory
string3 <- "subfolder" # sub-folder in folder
paste(string1, string2, string3, sep = "/")

## [1] "./folder/subfolder"

paste0(string1, "/", string2, "/", string3)

## [1] "./folder/subfolder"

5.2.3 Importing files from a folder located in my working directory

5.3 while

while (condition){
  ## code to be executed
  
  # increment
}

# Initialize i
i <- 0
while (i <= 10) {
  print(i*2)
  
  i <- i+10
}

## [1] 0
## [1] 20

5.4 Exercises

Write a program that will tell the user YOU WON! and exit if they get 5 three times on a row.
Write a program that run continuously an ask a user to input a number between 0 and 9 and provide the multiplication table by 2 and asks the user to stop or continue.

Hint: Use the function readline(prompt = "Enter a number: ") to interact with the user.

number <- readline(prompt = "Entrer un nombre: ") # conversion is needed.

5.5 repeat

Syntax of the repeat loop:

# increment i or anything else
i <- 0

repeat{
  # execute a code
  
  # increment
  i <- i + 1
  
  # stopping criteria
  if ( something happens ){
    break # repeat until something happens
  }
}

i <- 0
repeat{
  print(i)
  i <- i + 1
  if (i > 10) break # repeat until condition holds.
}

## [1] 0
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

6 Apply Functions Over Array Margins

6.1 `apply`

The apply() function return a vector or array or list of values obtained by applying a function to margins of an array or matrix.

A <- c(1:4)
dim(A) <- c(2, 2)
A

##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

avg <- function(x){
  sum(x)/length(x)
}

v <- 1:10

avg(x = v)

## [1] 5.5

# iris[-5]
apply(iris[-5], MARGIN = 2, summary)#/nrow(iris[-5]) # MARGIN = 2 means column-wise

##         Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min.        4.300000    2.000000        1.000    0.100000
## 1st Qu.     5.100000    2.800000        1.600    0.300000
## Median      5.800000    3.000000        4.350    1.300000
## Mean        5.843333    3.057333        3.758    1.199333
## 3rd Qu.     6.400000    3.300000        5.100    1.800000
## Max.        7.900000    4.400000        6.900    2.500000

6.2 `sapply`: use `?sapply` to check the documentation.

sapply(A, sum) # does not apply for matrices

## [1] 1 2 3 4

The sapply function can also return a list if the outputs are not of the same length.

sapply(iris, summary)

## $Sepal.Length
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.300   5.100   5.800   5.843   6.400   7.900 
## 
## $Sepal.Width
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.800   3.000   3.057   3.300   4.400 
## 
## $Petal.Length
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.600   4.350   3.758   5.100   6.900 
## 
## $Petal.Width
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.100   0.300   1.300   1.199   1.800   2.500 
## 
## $Species
##     setosa versicolor  virginica 
##         50         50         50

df <- data.frame(replicate(10, rnorm(1000)))
L <- as.list(df) # converting data frame to list.
sapply(L, avg)

##           X1           X2           X3           X4           X5           X6 
##  0.002825105  0.020374063  0.055152188 -0.044140208 -0.016019744  0.050948281 
##           X7           X8           X9          X10 
## -0.030072130  0.015579582 -0.007629547  0.019322313

sapply(1:10, function(x) x^2)

##  [1]   1   4   9  16  25  36  49  64  81 100

6.3 `lapply`:

The lapply() function returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X

a <- lapply(iris[-5], mean) # MARGIN = 2 means column-wise
write.csv(a, "a.csv")
unlist(a)

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##     5.843333     3.057333     3.758000     1.199333

6.4 `tapply`: check the documentation using `?tapply`

is.factor(iris$Species) # checking if the column named Species is a factor.

## [1] TRUE

tapply(iris$Sepal.Length, iris[[5]], mean)

##     setosa versicolor  virginica 
##      5.006      5.936      6.588

6.5 `vapply`: check the documentation

vapply(X = as.list(iris[-5]), quantile, FUN.VALUE =
       c("0%" = 0, "25%" = 0, "50%" = 0, "75%" = 0, "100%" = 0))

##      Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0%            4.3         2.0         1.00         0.1
## 25%           5.1         2.8         1.60         0.3
## 50%           5.8         3.0         4.35         1.3
## 75%           6.4         3.3         5.10         1.8
## 100%          7.9         4.4         6.90         2.5

7 Define functions in R

Syntax to write/define a function in R:

function_name <- function(arg1, arg2, ...){
  # code to be executed
}

pp <- function(x) return(x+1)
i <- 1
(i <- pp(i))

## [1] 2

7.1 Exercises

Write a function that takes an x as argument and detects NA then replaces them by the mean

replace_missing <- function(x, fun){
  
}

replace_missing(x, fun = mean)

## NULL

Draw the flowchart of the quadratic equation $ax^2+bx+c=0$ and write an R function that give solutions and comment according to the values of the discriminant.

8 Packages

A package is a collection of data and functions with their documentations.

8.1 Install a package if it is not in the `installed.packages` list.

# to get the list of all installed packages in R
all_packages <- rownames(installed.packages())
head(all_packages, 20) # only displaying the first 20 packages by alp. order

##  [1] "abind"          "acepack"        "ada"            "adabag"        
##  [5] "ade4"           "adegenet"       "alphavantager"  "animation"     
##  [9] "AnnotationDbi"  "AnophelesModel" "anytime"        "ape"           
## [13] "arrow"          "arsenal"        "ash"            "askpass"       
## [17] "assertthat"     "audio"          "automap"        "backports"

Having the list of all packages, we can check if a package is in it.

"pacman" %in% all_packages

## [1] FALSE

It looks like the package is not installed yet. Using if control-flow, we can check if a package is missing and then install it using the install.package() function.

if (!"pacman" %in% all_packages) {
  install.packages("pacman", repos = "http://cran.us.r-project.org")
}

## 
## The downloaded binary packages are in
##  /var/folders/x6/rdmyg9yd5cq432r1z8p90p6r0000gn/T//RtmpKhzeNP/downloaded_packages

8.2 Loading a package using `library` function from the `base` package.

library(pacman)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()     masks stats::filter()
## ✖ dplyr::group_rows() masks kableExtra::group_rows()
## ✖ dplyr::lag()        masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

R is displaying messages when loading the tidyverse package. You would not want to have it displayed in your report.

8.3 Prenvent R from displaying warnings when loading a packages

Do the following setting

{r warning=FALSE, message=FALSE}

library(pacman)

8.4 Package documentation

To get the documentation for a specific package that you already installed. Use the command help(package = "the_package_name")

Let’s get help for the pacman package

help(package = "pacman")

We can have the entire documentation displayed in File | Plots | Packages | Help pane.

8.5 Functions from a specific package

To access all the functions and data from a given package, we need to load it in R using the library(the_package) or require(the_package). The pacman package give more flexibility by loading a list of packages and if there any on the list that is not install, pacman does the installation for you.

The command to load a list of packages with pacman is as follows:

pkg_list <- c("tidyverse", "ggplot2", "lubridate", "flextable")
p_load(pkg_list, character.only = TRUE)

9 Import data in R

9.1 Inbuilt data

The iris data set exist already in the R environment. We can import data in R from different sources:

# access iris data
data("iris")
# displaying the first 6 rows
help(iris)
# data structure
str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

# descriptive statistics
summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

9.2 from a package without loading it using the `library` function.

data("spam", package = "kernlab")
# data structure
str(spam[1:10])

## 'data.frame':    4601 obs. of  10 variables:
##  $ make    : num  0 0.21 0.06 0 0 0 0 0 0.15 0.06 ...
##  $ address : num  0.64 0.28 0 0 0 0 0 0 0 0.12 ...
##  $ all     : num  0.64 0.5 0.71 0 0 0 0 0 0.46 0.77 ...
##  $ num3d   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ our     : num  0.32 0.14 1.23 0.63 0.63 1.85 1.92 1.88 0.61 0.19 ...
##  $ over    : num  0 0.28 0.19 0 0 0 0 0 0 0.32 ...
##  $ remove  : num  0 0.21 0.19 0.31 0.31 0 0 0 0.3 0.38 ...
##  $ internet: num  0 0.07 0.12 0.63 0.63 1.85 0 1.88 0 0 ...
##  $ order   : num  0 0 0.64 0.31 0.31 0 0 0 0.92 0.06 ...
##  $ mail    : num  0 0.94 0.25 0.63 0.63 0 0.64 0 0.76 0 ...

9.3 Comma Separated Value file

To import a CSV file in R we can use:

read.csv() function from base package

dta <- read.csv("./data/iris.csv")
head(dta)

Notice that the column Species is seen as character. We can force the conversion by setting the argument stringsAsFactors to TRUE in read.csv().

read_csv() function from readr package already loaded together with tidyverse.

dta <- read_csv("./data/iris.csv")

## Rows: 150 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Species
## dbl (4): Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(dta)

! Always check the errors, warnings and messages to make your report look good.

Using import function from rio package that can detect file extension and load it.

Exercise: import all the csv files in a list using a for loop.

data_list <- list() # creating an empty list.
dir()

##  [1] "a.csv"                                   
##  [2] "additional_code.R"                       
##  [3] "aims-logo.jpg"                           
##  [4] "calc.png"                                
##  [5] "data"                                    
##  [6] "donnees_hospitalieres"                   
##  [7] "mypackages.R"                            
##  [8] "pipe.jpeg"                               
##  [9] "R_Intro_DrLema_files"                    
## [10] "R_Intro_DrLema.html"                     
## [11] "R_Intro_DrLema.pdf"                      
## [12] "R_Intro_DrLema.Rmd"                      
## [13] "R_Intro_IndabaX_Chad_fr_files"           
## [14] "R_Intro_IndabaX_Chad_fr.html"            
## [15] "R_Intro_IndabaX_Chad_fr.Rmd"             
## [16] "R_Intro_IndabaX_Chad.html"               
## [17] "R_Intro_IndabaX_Chad.Rmd"                
## [18] "rsconnect"                               
## [19] "Screenshots"                             
## [20] "telecharger_donnees_admissions_med.ipynb"

# check the files names in data/csv
dir("./data/csv/")

## character(0)

# import 
# TODO

10 Pipe: `%>%` from `magrittr` package or `|>` from `base` package.

Library: tidyverse or dplyr
Shortcut: Crtl + Shift + M
Why is it useful?

iris %>% group_by(Species) %>% summarise(mean = mean(Petal.Width))

Instead of

summarise(group_by(iris, Species), mean = mean(Petal.Width))

11 Data manipulation

11.1 Data manipulation with tidyverse

11.2 Data manipulation with tibble

11.3 Data manipulation with reshape2

library(ggplot2)
iris %>% ggplot(aes(y = Petal.Length,  col = Species)) + geom_boxplot()

11.4 Data display with kabbleExtra

11.5 Data display with officer, …

11.6 Data visualization with ggplot2

iris %>% 
  ggplot(aes(y = Petal.Length,  fill = Species)) + 
  geom_boxplot() + 
  facet_grid(~Species) + 
  theme_minimal()

https://bookdown.org/ozancanozdemir/introduction-to-ggplot2/

11.7 Data visualization with plotly

12 R advanced

12.1 Regular expressions

12.2 Unsupervised & Supervised Learning

12.2.1 Principal Component Analysis

12.2.2 Clustering: K-means, Hierarchical Clustering

12.2.3 K-Nearest Neighbor

12.2.4 Simple Linear Regression

12.2.5 Logistic Regression

12.3 Machine Learning

12.4 Latex in Rstudio (R markdown/Quarto markdown)

The variance of a real-valued variables $X = (X_1, \ldots, X_n)$ is given by:

$$
\textrm{Var(X)} = 
\left[\frac{1}{n}\sum_{i=1}^n\left(X_i-\frac{1}{n}
\sum_{i=1}^nX_i\right)^2\right]^\frac{1}{2}
$$

\[ \textrm{Var(X)} = \left[\frac{1}{n}\sum_{i=1}^n\left(X_i-\frac{1}{n}\sum_{i=1}^nX_i\right)^2\right]^\frac{1}{2} \]

12.5 Improvisation

# importation de donnees dans R
donnees <- read.csv("./data/iris.csv")

donnees$Species <- factor(donnees$Species)

View(donnees)

# analyse descriptive de la donnees
summary(donnees)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

# structure
str(donnees)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

12.6 Analyse exploratoire

library(tidyverse)

donnees[-5] %>% colMeans()

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##     5.843333     3.057333     3.758000     1.199333

Voir la relation entre deux variables

plot(Sepal.Length ~ Sepal.Width, data = donnees)

plot(Petal.Length ~ Petal.Width, data = donnees)

with(donnees, hist(Petal.Length))

Diagramme en barres

library(ggplot2)
donnees[-5] %>% colMeans() %>% barplot()

donnees[-5] %>% colMeans() %>% as.data.frame() %>% 
  rename(Means = ".") %>% 
  rownames_to_column("Species") %>% 
  ggplot(aes(Means, Species)) + geom_col() + theme_minimal()

Introduction au langage R (l’environnement R, manipulation des données et visualisation)

IndabaX Chad 2024

Lema Logamou Seknewna, Data Scientist à AIMS RIC

27 October, 2024