R est une calculatrice du fait qu’on peut effectuer toute les
opérations dans la console de R.
Les opérateurs logiques sont utilisés pour effecter des test. Il
!Utilisé pour changer une condition fausse en vraie (ou vraie en fausse)
## [1] FALSE
## [1] TRUE
## [1] TRUE
## [1] FALSE
Dans R, nous pouvons utiliser <-, =
(signe égal simple !) et -> pour attribuer une valeur à
une variable.
Un nom de variable :
a <- 1, 0 -> .a_.## Error: <text>:1:3: unexpected symbol
## 1: v rsion
## ^
a1 <- 1.Dans R, nous avons les types de données suivants :
## [1] "numeric"
## [1] "double"
## [1] "numeric"
## [1] "double"
L déclare le nombre qui la
précède comme un entier).## [1] "integer"
i est la partie imaginaire)## [1] "complex"
## [1] "complex"
## [1] 1+2i
## [1] "complex"
## [1] "character"
N’oubliez pas !! LeaRning est différent de
Learning.
## [1] TRUE
## [1] FALSE
La sortie logique peut également être le résultat d’un test. Exemple
: si nous voulons vérifier si LeaRning == Learning
## [1] FALSE
# variable textuelle
text <- "J'apprend R."
# conversion de la variable text en raw ()
raw_text <- charToRaw(text)
print(raw_text)## [1] 4a 27 61 70 70 72 65 6e 64 20 52 2e
## [1] "raw"
Converting raw to text:
## [1] "J'apprend R."
Il s’agit d’un type de données utilisé pour faire référence à une relation qualitative comme les couleurs, le bien et le mal, les notes des cours ou des films, etc. Ils sont utiles dans la modélisation statistique.
## [1] Female Male
## Levels: Female Male
## [1] "factor"
The most used data types in R are
Example 1:
## [1] "1" "R" "TRUE" "FALSE" NA
## [1] "character"
# sequence: ?seq
x <- seq(0, 2*pi, length.out = 90)
y <- seq(0, 2*pi, by = 0.1282283)
head(x); head(y)## [1] 0.00000000 0.07059759 0.14119518 0.21179276 0.28239035 0.35298794
## [1] 0.0000000 0.1282283 0.2564566 0.3846849 0.5129132 0.6411415
## [1] 5.930197 6.000795 6.071393 6.141990 6.212588 6.283185
## [1] 0.000000 6.283185
## [1] 0
## [1] 6.283185
## [1] 0.6353783
The length of a vector is given by:
## [1] 90
## [1] 2
## [1] 1
A scalar is a vector of length 1.
Example 2:
## [1] "I learn R" "I learn R" "I learn R" "I learn R" "I learn R"
## [1] 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
## [1] 0 0 0 0 0 0 0 0 0 NA
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [149] 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [186] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [223] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5
## [260] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
## [297] 5 5 5 5
# sampling
set.seed(24102024) # fix the randomness for reproducibility.
sample(0:1, size = 100, replace = TRUE, prob = c(0.3, 0.7)) -> y
y; y == 0## [1] 1 1 1 1 0 1 1 0 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 0
## [38] 1 1 1 1 1 0 0 1 1 1 0 1 1 1 0 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1
## [75] 1 1 0 1 0 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0
## [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE
## [13] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
## [37] TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE
## [49] FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [73] FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [97] FALSE FALSE FALSE TRUE
## y
## 0 1
## 21 79
## [1] 21
## [1] 79
## [1] 1
## [1] 0
## [1] 1
Matrices are two dimensional data set with columns and rows.
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 6 11 16 21
## [2,] 2 7 12 17 22
## [3,] 3 8 13 18 23
## [4,] 4 9 14 19 24
## [5,] 5 10 15 20 25
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 2 3 4 5
## [2,] 6 7 8 9 10
## [3,] 11 12 13 14 15
## [4,] 16 17 18 19 20
## [5,] 21 22 23 24 25
## [,1] [,2] [,3]
## [1,] 1 5 4
## [2,] 0 2 2
## [3,] 2 1 0
## [,1] [,2] [,3]
## [1,] 2 3 0
## [2,] 5 1 1
## [3,] 2 1 1
We can also construct a matrix from vectors \(M = (v_1, v_2, v_3)\) using the
cbind and rbind functions.
## v1 v2 v3
## [1,] 1 5 4
## [2,] 0 2 2
## [3,] 2 1 0
## [,1] [,2] [,3]
## v1 1 0 2
## v2 5 2 1
## v3 4 2 0
## [1] "matrix" "array"
## [1] "matrix" "array"
dim function! dim is also called to check the dimension of a matrix,
a data frame or an array.
M3 <- c(1, 5, 4, 0, 2, 2, 2, 1, 0)
dim(M3) <- c(3, 3) # sets the dimensions of M3
dim(M3) # shows the dimensions of M3## [1] 3 3
## [,1] [,2] [,3]
## [1,] 1 0 2
## [2,] 5 2 1
## [3,] 4 2 0
## [1] "matrix" "array"
## [,1] [,2] [,3]
## [1,] 1 0 2
## [2,] 5 2 1
## [3,] 4 2 0
## [,1] [,2] [,3]
## [1,] 3 8 4
## [2,] 5 3 3
## [3,] 4 2 1
## [,1] [,2] [,3]
## [1,] -1 2 4
## [2,] -5 1 1
## [3,] 0 0 -1
# number of columns in A: dim(A)[2], or ncol(A).
# number of rows in A: dim(A)[1], or nrow(A)
dim(A)[2] == ncol(A)## [1] TRUE
## [1] TRUE
## [,1] [,2] [,3]
## [1,] 35 12 9
## [2,] 14 4 4
## [3,] 9 7 1
## [,1] [,2] [,3]
## [1,] -1 2.0 1
## [2,] 2 -4.0 -1
## [3,] -2 4.5 1
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
\[
\begin{cases}
2x + 2y &= 4\\
x + 3y &= 4
\end{cases}
\] The matrix of the equation system is: \(\displaystyle A =
\begin{pmatrix}2&2\\1&3\end{pmatrix}\) and the right hand
side of the equation is \(b =
\begin{pmatrix}4\\4\end{pmatrix}\). We can use the
solve function to have the solutions.
## [1] 1 1
## [,1] [,2] [,3]
## [1,] 4 -8.0 -1
## [2,] -5 10.5 5
## [3,] -2 4.5 2
Requirements:
A should be a square matrix of dimension \(n\).\[ P_A(\lambda) = \det(A-\lambda I_n) = 0,\quad n\in \mathbb{N}. \]
## [1] 4.7664355 -1.4836116 -0.2828239
## [1] TRUE
## [,1] [,2] [,3]
## [1,] -0.8535725 -0.3668743 0.2177685
## [2,] -0.3052279 -0.4631774 -0.6431613
## [3,] -0.4221966 0.8067651 0.7341120
Arrays are data type with more than two dimensions
## , , 1
##
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
##
## , , 2
##
## [,1] [,2] [,3] [,4]
## [1,] 13 16 19 22
## [2,] 14 17 20 23
## [3,] 15 18 21 24
## [1] "array"
An example of array is NetCDF data with for instance: * Longitude as column names (\(n\)) * Latitude as row names (\(p\)) * 3rd dimension could the time. For each time, we have a \(n\times p\) matrix.
## [1] 3 4 2
## [1] 13
The dimension: row position, column position, matrix level
A list is a collection of object of different types. The sizes of elements could be different.
## [,1] [,2] [,3]
## [1,] 1 5 4
## [2,] 0 2 2
## [3,] 2 1 0
## [1] "matrix" "array"
## , , 1
##
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
##
## , , 2
##
## [,1] [,2] [,3] [,4]
## [1,] 13 16 19 22
## [2,] 14 17 20 23
## [3,] 15 18 21 24
## $Array
## , , 1
##
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
##
## , , 2
##
## [,1] [,2] [,3] [,4]
## [1,] 13 16 19 22
## [2,] 14 17 20 23
## [3,] 15 18 21 24
##
##
## $matrix
## [,1] [,2] [,3]
## [1,] 1 5 4
## [2,] 0 2 2
## [3,] 2 1 0
A data frame is a table of \(n\) number of rows (observations) and \(p\) number of columns (features or variables). Variables can take any data type.
## [1] 0 10 20 30 40 50 60 70 80 90 100 110 120
# (brks <- seq(min(age), max(age), le = 40))
age_groups <- cut(age, breaks = brks, include.lowest = TRUE)
class(age_groups)## [1] "factor"
## age_groups
## [0,10] (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80]
## 10 9 9 5 8 8 9 8
## (80,90] (90,100] (100,110] (110,120]
## 9 8 9 8
## integer(0)
## [1] TRUE
A data frame a is also a list where all elements (columns) have the same length. A data frame in R is a table.
data.frame() functionNA)## [1] 1 4 6
## [1] 3 5 7
# equal to 2.
# which(x != NA) wrong way to check for non-missing values
which(!is.na(x)) # means: which of the elements of x are not missing## [1] 2 3 5 7
## [1] NA NA NA
x[is.na(x)] <- mean(x[which(!is.na(x))]) # Good but could be shorter
x[is.na(x)] <- mean(x, na.rm = TRUE)
print(x)## [1] 2.3875 1.0000 2.0000 2.3875 3.0000 2.3875 3.5500
## [1] "character"
## Warning: NAs introduced by coercion
## [1] 2.0 1.0 2.0 7.0 3.0 2.5 9.0 2.7
## [1] "2" "1" "2" "7" "3" "2.5" "9" "2.7"
The argument cex.names reduces the size of x-labels. Low
values, say cex.names=0.6, forces R to show all the
labels.
## [1] 35
## [1] 0.35
set.seed(13092024)
unif_dist <- runif(1000)
hist(unif_dist, probability = TRUE, xlab = NULL,
main = "Histogram of uniform distribution", col = "turquoise")
lines(density(unif_dist), col = "red", lwd = 2, lty = 2)
abline(v = mean(unif_dist), col = "blue", lty = 2, lwd = 2)
legend("topleft", lty = c(2, 2),
col = c("red", "blue"), legend = c("Density", "Average"))rbinom_dist <- rbinom(10000, 10, 0.5)
hist(rbinom_dist, probability = TRUE, main = "Histogram of binomial distribution",
col = "turquoise", breaks = 20, xlab = NULL)
lines(density(rbinom_dist), col = "red", lwd = 2, lty = 2)
abline(v = mean(rbinom_dist), col = "blue", lty = 2, lwd = 2) # vertical line
legend("topleft", lty = c(2, 2),
col = c("red", "blue"), legend = c("Density", "Average"))set.seed(13092024)
gauss_dist <- rnorm(1000, mean = 0, sd = 1)
hist(gauss_dist, probability = TRUE, breaks = 30, xlab = NULL,
main = "Histogram of standard normal\ndistribution", col = "turquoise")
lines(density(gauss_dist), col = "red", lwd = 2, lty = 2)
abline(v = mean(gauss_dist), col = "blue", lty = 2, lwd = 2)
legend("topleft", lty = c(2, 2),
col = c("red", "blue"), legend = c("Density", "Average"))set.seed(13092024)
x <- rnorm(1000); y <- rnorm(1000, mean = 5, sd = 1.5)
z <- 4 + 8*x + rnorm(1000) # linear dependence between x and z
par(mfrow = c(1, 2))
plot(x, y, main = "Scatter plot of x and y", col = "blue", pch = 16); grid()
plot(x, z, main = "Scatter plot of x and z", col = "blue", pch = 16)set.seed(13092024)
exp_dist <- rexp(1000, rate = 1)
hist(exp_dist, probability = TRUE, xlab = NULL,
main = "Histogram of exponential distribution", col = "turquoise")
lines(density(exp_dist), col = "red", lwd = 2, lty = 2)
abline(v = mean(exp_dist), col = "blue", lty = 2, lwd = 2)
legend("topright", lty = c(2, 2),
col = c("red", "blue"), legend = c("Density", "Average"))set.seed(13092024)
pois_dist <- rpois(1000, lambda = 2.5)
hist(pois_dist, probability = TRUE, main = "Histogram of poisson distribution",
col = "turquoise", xlab = NULL)
lines(density(pois_dist), col = "red", lwd = 2, lty = 2)
abline(v = mean(pois_dist), col = "blue", lty = 2, lwd = 2)
legend("topright", lty = c(2, 2),
col = c("red", "blue"), legend = c("Density", "Average"))if /
elseif (condition/Boolean expression){
## code to be executed
}
## [1] TRUE
## [1] TRUE
We can embed if to if and
else.
if (x < 4){
if (x != 0){
print("x is not equal to zero.")
} else {
print("x is equal to zero")
}
print("x is less than 4")
} else {
if (x > 1){
print("x is greater than 1.")
} else {
print("x is less than or equal to 1")
}
print("x is greater than 4.")
}## [1] "x is not equal to zero."
## [1] "x is less than 4"
for (i in vector){
## code to be executed
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
Write a for loop that checks each of the first 10 positive integers if it is odd or even.
Using for loop, import all CSV from the
data_files folder.
## [1] "/Users/logamouseknewnalema/Library/CloudStorage/GoogleDrive-lseknewna@aimsric.org/My Drive/INDABAXCHAD"
## [1] "a.csv"
## [2] "additional_code.R"
## [3] "aims-logo.jpg"
## [4] "calc.png"
## [5] "data"
## [6] "donnees_hospitalieres"
## [7] "mypackages.R"
## [8] "pipe.jpeg"
## [9] "R_Intro_DrLema_files"
## [10] "R_Intro_DrLema.html"
## [11] "R_Intro_DrLema.pdf"
## [12] "R_Intro_DrLema.Rmd"
## [13] "R_Intro_IndabaX_Chad_fr_files"
## [14] "R_Intro_IndabaX_Chad_fr.html"
## [15] "R_Intro_IndabaX_Chad_fr.Rmd"
## [16] "R_Intro_IndabaX_Chad.html"
## [17] "R_Intro_IndabaX_Chad.Rmd"
## [18] "rsconnect"
## [19] "Screenshots"
## [20] "telecharger_donnees_admissions_med.ipynb"
## character(0)
# Exercise: write a for loop to import all
# csv files in a list.
(file_names <- dir("./data_list/", pattern = ".csv"))## character(0)
Hints: Importing files from the working directory
string1 <- "." # working directory (root where the script is saved)
string2 <- "folder" # folder in the working directory
string3 <- "subfolder" # sub-folder in folder
paste(string1, string2, string3, sep = "/")## [1] "./folder/subfolder"
## [1] "./folder/subfolder"
while (condition){
## code to be executed
# increment
}
## [1] 0
## [1] 20
Write a program that will tell the user YOU WON! and exit if they get 5 three times on a row.
Write a program that run continuously an ask a user to input a number between 0 and 9 and provide the multiplication table by 2 and asks the user to stop or continue.
Hint: Use the function
readline(prompt = "Enter a number: ") to interact with the
user.
number <- readline(prompt = "Entrer un nombre: ") # conversion is needed.
Syntax of the repeat loop:
# increment i or anything else
i <- 0
repeat{
# execute a code
# increment
i <- i + 1
# stopping criteria
if ( something happens ){
break # repeat until something happens
}
}
## [1] 0
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
applyThe apply() function return a vector or array or list of
values obtained by applying a function to margins of an array or
matrix.
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
## [1] 5.5
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. 4.300000 2.000000 1.000 0.100000
## 1st Qu. 5.100000 2.800000 1.600 0.300000
## Median 5.800000 3.000000 4.350 1.300000
## Mean 5.843333 3.057333 3.758 1.199333
## 3rd Qu. 6.400000 3.300000 5.100 1.800000
## Max. 7.900000 4.400000 6.900 2.500000
sapply:
use ?sapply to check the documentation.## [1] 1 2 3 4
The sapply function can also return a list if the
outputs are not of the same length.
## $Sepal.Length
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.300 5.100 5.800 5.843 6.400 7.900
##
## $Sepal.Width
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.800 3.000 3.057 3.300 4.400
##
## $Petal.Length
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.600 4.350 3.758 5.100 6.900
##
## $Petal.Width
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.100 0.300 1.300 1.199 1.800 2.500
##
## $Species
## setosa versicolor virginica
## 50 50 50
df <- data.frame(replicate(10, rnorm(1000)))
L <- as.list(df) # converting data frame to list.
sapply(L, avg)## X1 X2 X3 X4 X5 X6
## 0.002825105 0.020374063 0.055152188 -0.044140208 -0.016019744 0.050948281
## X7 X8 X9 X10
## -0.030072130 0.015579582 -0.007629547 0.019322313
## [1] 1 4 9 16 25 36 49 64 81 100
lapply:The lapply() function returns a list of the same length
as X, each element of which is the result of applying FUN to the
corresponding element of X
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5.843333 3.057333 3.758000 1.199333
tapply:
check the documentation using ?tapply## [1] TRUE
## setosa versicolor virginica
## 5.006 5.936 6.588
vapply:
check the documentationvapply(X = as.list(iris[-5]), quantile, FUN.VALUE =
c("0%" = 0, "25%" = 0, "50%" = 0, "75%" = 0, "100%" = 0))## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0% 4.3 2.0 1.00 0.1
## 25% 5.1 2.8 1.60 0.3
## 50% 5.8 3.0 4.35 1.3
## 75% 6.4 3.3 5.10 1.8
## 100% 7.9 4.4 6.90 2.5
Syntax to write/define a function in R:
function_name <- function(arg1, arg2, ...){
# code to be executed
}
## [1] 2
x as argument and
detects NA then replaces them by the mean## NULL
A package is a collection of data and functions with their documentations.
installed.packages list.# to get the list of all installed packages in R
all_packages <- rownames(installed.packages())
head(all_packages, 20) # only displaying the first 20 packages by alp. order## [1] "abind" "acepack" "ada" "adabag"
## [5] "ade4" "adegenet" "alphavantager" "animation"
## [9] "AnnotationDbi" "AnophelesModel" "anytime" "ape"
## [13] "arrow" "arsenal" "ash" "askpass"
## [17] "assertthat" "audio" "automap" "backports"
Having the list of all packages, we can check if a package is in it.
## [1] FALSE
It looks like the package is not installed yet. Using if
control-flow, we can check if a package is missing and then install it
using the install.package() function.
if (!"pacman" %in% all_packages) {
install.packages("pacman", repos = "http://cran.us.r-project.org")
}##
## The downloaded binary packages are in
## /var/folders/x6/rdmyg9yd5cq432r1z8p90p6r0000gn/T//RtmpKhzeNP/downloaded_packages
library function from the base
package.## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::group_rows() masks kableExtra::group_rows()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
R is displaying messages when loading the tidyverse package. You would not want to have it displayed in your report.
Do the following setting
{r warning=FALSE, message=FALSE}
To get the documentation for a specific package that you already
installed. Use the command
help(package = "the_package_name")
Let’s get help for the pacman package
We can have the entire documentation displayed in
File | Plots | Packages | Help pane.
To access all the functions and data from a given package, we need to
load it in R using the library(the_package) or
require(the_package). The pacman package give
more flexibility by loading a list of packages and if there any on the
list that is not install, pacman does the installation for
you.
The command to load a list of packages with pacman is as
follows:
The iris data set exist already in the R environment. We
can import data in R from different sources:
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
library function.## 'data.frame': 4601 obs. of 10 variables:
## $ make : num 0 0.21 0.06 0 0 0 0 0 0.15 0.06 ...
## $ address : num 0.64 0.28 0 0 0 0 0 0 0 0.12 ...
## $ all : num 0.64 0.5 0.71 0 0 0 0 0 0.46 0.77 ...
## $ num3d : num 0 0 0 0 0 0 0 0 0 0 ...
## $ our : num 0.32 0.14 1.23 0.63 0.63 1.85 1.92 1.88 0.61 0.19 ...
## $ over : num 0 0.28 0.19 0 0 0 0 0 0 0.32 ...
## $ remove : num 0 0.21 0.19 0.31 0.31 0 0 0 0.3 0.38 ...
## $ internet: num 0 0.07 0.12 0.63 0.63 1.85 0 1.88 0 0 ...
## $ order : num 0 0 0.64 0.31 0.31 0 0 0 0.92 0.06 ...
## $ mail : num 0 0.94 0.25 0.63 0.63 0 0.64 0 0.76 0 ...
To import a CSV file in R we can use:
read.csv() function from base packageNotice that the column Species is seen as
character. We can force the conversion by setting the
argument stringsAsFactors to TRUE in
read.csv().
read_csv() function from readr package
already loaded together with tidyverse.## Rows: 150 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Species
## dbl (4): Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
! Always check the errors, warnings and messages to make your report look good.
import function from rio package
that can detect file extension and load it.Exercise: import all the csv files in a list using a for loop.
## [1] "a.csv"
## [2] "additional_code.R"
## [3] "aims-logo.jpg"
## [4] "calc.png"
## [5] "data"
## [6] "donnees_hospitalieres"
## [7] "mypackages.R"
## [8] "pipe.jpeg"
## [9] "R_Intro_DrLema_files"
## [10] "R_Intro_DrLema.html"
## [11] "R_Intro_DrLema.pdf"
## [12] "R_Intro_DrLema.Rmd"
## [13] "R_Intro_IndabaX_Chad_fr_files"
## [14] "R_Intro_IndabaX_Chad_fr.html"
## [15] "R_Intro_IndabaX_Chad_fr.Rmd"
## [16] "R_Intro_IndabaX_Chad.html"
## [17] "R_Intro_IndabaX_Chad.Rmd"
## [18] "rsconnect"
## [19] "Screenshots"
## [20] "telecharger_donnees_admissions_med.ipynb"
## character(0)
%>% from magrittr package or
|> from base package.tidyverse or dplyrCrtl + Shift + MInstead of
iris %>%
ggplot(aes(y = Petal.Length, fill = Species)) +
geom_boxplot() +
facet_grid(~Species) +
theme_minimal()The variance of a real-valued variables
$X = (X_1, \ldots, X_n)$ is given by:
The variance of a real-valued variables \(X = (X_1, \ldots, X_n)\) is given by:
$$
\textrm{Var(X)} =
\left[\frac{1}{n}\sum_{i=1}^n\left(X_i-\frac{1}{n}
\sum_{i=1}^nX_i\right)^2\right]^\frac{1}{2}
$$
\[ \textrm{Var(X)} = \left[\frac{1}{n}\sum_{i=1}^n\left(X_i-\frac{1}{n}\sum_{i=1}^nX_i\right)^2\right]^\frac{1}{2} \]
# importation de donnees dans R
donnees <- read.csv("./data/iris.csv")
donnees$Species <- factor(donnees$Species)
View(donnees)
# analyse descriptive de la donnees
summary(donnees)## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5.843333 3.057333 3.758000 1.199333
Voir la relation entre deux variables
Diagramme en barres
donnees[-5] %>% colMeans() %>% as.data.frame() %>%
rename(Means = ".") %>%
rownames_to_column("Species") %>%
ggplot(aes(Means, Species)) + geom_col() + theme_minimal()