R a statistical programming language created in 1992 by two statisticians from the Auchkand University (New Zealand), Ross Ihaka & Robert Gentlemen. The first version (1.0) was released in February 2000. The name R comes from their first names’ initials. The R language come from the S language also based on Fortran was developped between 1975 and 1976 at Bell Laboratories. The current version of R is 4.5.1.
Choose your OS (operating system):
If you do not know how to install a package, refer to the section Packages.
# the following code installs tinytex if it is not installed already.
if (!"tinytex" %in% rownames(installed.packages())){
install.packages("tinytex")
}
library(tinytex)
install_tinytex() # to install LaTeXYou can also access R online from Posit Cloud
Kindly reply to this survey. Collected data will be used in this training.
R is a calculator because you can perform all operations in the R
console.
==## [1] FALSE
The equality operator can also be used to match one element with multiple elements
## [1] FALSE FALSE FALSE FALSE TRUE
!Used to change a TRUE condition to FALSE
(respectively a FALSE condition to TRUE)
## [1] FALSE
## [1] TRUE
## [1] TRUE
## [1] FALSE
In R, we also have inbuilt functions that help to match element of a
given vector. The first function is match(). You can check
the documentation with help("match") or
?match. Read that: match returns a
vector of the positions of (first) matches of its first argument in its
second.
## [1] 5
The second function %in% check the existence of a value
in a given vector (of values).
## [1] TRUE
In R we can use <-, = (single equal sign
!) and -> to assign a value to a variable.
A variable name:
a <- 1,
0 -> .a_ or
a dot ..## Error in parse(text = input): <text>:1:3: unexpected symbol
## 1: v rsion
## ^
a1 <- 1.In R we have the following data types: * numeric * integer * complex * character * logical * raw * factor
Examples of numberic numbers are 10.5, 55, 787, pi
## [1] "numeric"
## [1] "double"
## [1] "numeric"
## [1] "double"
L declares this as an
integer).## [1] "integer"
An example of a complex number is \(9 +
3i\), where \(i\) is the
imaginary part. Multiplying a real number by 1i, transforms
it to complex.
## [1] "complex"
## [1] "complex"
## [1] 1+2i
## [1] "complex"
## [1] "character"
Remember!! LeaRning is different from
Learning.
## [1] TRUE
## [1] FALSE
Logical output can also be an outcome of a test. Example: if we want
to check if "LeaRning" == "Learning"
## [1] FALSE
## [1] 49 20 61 6d 20 6c 65 61 72 6e 69 6e 67 20 52 2e
## [1] "raw"
Converting raw to text:
## [1] "I am learning R."
They are a data type that is used to refer to a qualitative relationship like colors, good & bad, course or movie ratings, etc. They are useful in statistical modeling.
## [1] Female Male
## Levels: Female Male
## [1] "factor"
## [1] "logical"
## [1] "logical"
## [1] FALSE
## [1] FALSE
## [1] 0 0 0 0 0 0 0 0 0 0
## [1] 2.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
## [1] 0 0 0 0 0 0 0 0 0 0
## [1] "integer"
## [1] "numeric"
## [1] 2.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
To convert data in R we can use function starting with
as. + data type from the base package.
Numeric to character
Character to numeric
Factor to character
Character to factor
The most used data types in R are
Let create a vector with elements of different types to see how R will deal with them.
## [1] "1" "R" "TRUE" "FALSE" NA
## [1] "character"
R converts everything in character type except NA which is common to numeric and character.
## [1] 1 4 8 0 1 0 0 1 0
Here, R converts everything into numeric. FALSE is 0 and TRUE is 1.
seq()
function## [1] 90
## [1] 0.0706
## [1] 0.00000000 0.07059759 0.14119518 0.21179276 0.28239035 0.35298794
## [7] 0.42358553 0.49418311 0.56478070 0.63537829 0.70597588 0.77657346
## [13] 0.84717105 0.91776864 0.98836623 1.05896382 1.12956140 1.20015899
## [19] 1.27075658 1.34135417
## [1] 0.00000000 0.07059759 0.14119518 0.21179277 0.28239036 0.35298795
## [7] 0.42358554 0.49418313 0.56478072 0.63537831 0.70597590 0.77657349
## [13] 0.84717108 0.91776867 0.98836626 1.05896385 1.12956144 1.20015903
## [19] 1.27075662 1.34135421
## [1] 5.930197 6.000795 6.071393 6.141990 6.212588 6.283185
## [1] 0.000000 6.283185
If you want the difference between the max and the min of x
## [1] 2
## [1] 6.283185
## [1] 0
## [1] 6.283185
## [1] TRUE
## [1] TRUE
## [1] 9
The length of a vector is given by:
## [1] 90
## [1] 2
## [1] 1
A scalar is a vector of length 1.
Example 2:
## [1] "I learn R" "I learn R" "I learn R" "I learn R" "I learn R" "I learn R"
## [7] "I learn R" "I learn R" "I learn R" "I learn R"
## [1] 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
## [1] 0 0 0 0 0 0 0 0 0 NA
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Random generation
# sampling
set.seed(123) # fix the randomness for reproducibility.
sample(0:1, size = 200, replace = TRUE, prob = c(1/3, 1-1/3)) -> y
print(y)## [1] 1 0 1 0 0 1 1 0 1 1 0 1 0 1 1 0 1 1 1 0 0 0 1 0 1 0 1 1 1 1 0 0 0 0 1 1 0
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 0 0 1 1 1 1 1 0 1 0 0 0 1 0 1 0 1
## [75] 1 1 1 1 1 1 1 0 1 0 1 1 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 0 0 1 1 1 0
## [112] 1 1 0 0 1 1 0 1 1 1 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 0 0 1 1 1 1 1 0 1 1 1
## [149] 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1
## [186] 1 1 1 0 0 1 1 0 1 0 1 1 1 1 1
What does the following code do?
## [1] 62
## [1] 138
prop.table() on table()## y
## 0 1
## 0.31 0.69
## [1] 0.31
## [1] 0.69
## [1] 62
## [1] 138
## [1] 1
## [1] 0
set.seed(123)
participants <- sample(c("Female", "Male", "Child"), size = 120, replace = TRUE)
head(participants, 20) # displays the first 20 elements of the sample## [1] "Child" "Child" "Child" "Male" "Child" "Male" "Male" "Male"
## [9] "Child" "Female" "Male" "Male" "Female" "Male" "Child" "Female"
## [17] "Child" "Child" "Female" "Female"
## participants
## Child Female Male
## 40 36 44
## [1] "Child" "Child" "Child" "Male" "Child" "Male" "Male" "Male"
## [9] "Child" "Female" "Male" "Male" "Female" "Male" "Child" "Female"
## [17] "Child" "Child" "Female" "Female" "Female" "Female" "Child" "Male"
## [25] "Child" "Male" "Female" "Male" "Child" "Male" "Female" "Child"
## [33] "Child" "Female" "Child" "Male" "Female" "Child" "Female" "Female"
## [41] "Male" "Child" "Child" "Female" "Child" "Female" "Child" "Male"
## [49] "Female" "Male" "Female" "Female" "Child" "Female" "Male" "Female"
## [57] "Female" "Child" "Female" "Male" "Female" "Child" "Female" "Child"
## [65] "Male" "Child" "Male" "Male" "Child" "Male" "Male" "Child"
## [73] "Child" "Female" "Male" "Male" "Female" "Male" "Female" "Female"
## [81] "Male" "Child" "Child" "Female" "Male" "Female" "Male" "Female"
## [89] "Child" "Child" "Male" "Child" "Female" "Male" "Male" "Child"
## [97] "Male" "Female" "Child" "Child" "Child" "Male" "Male" "Child"
## [105] "Female" "Female" "Child" "Male" "Male" "Male" "Male" "Male"
## [113] "Male" "Child" "Male" "Female" "Male" "Male" "Male" "Child"
Count of females
## [1] 36
Proportion of females
## [1] 0.3
From our survey
Gender <- c("Female", "Male", "Male", "Male", "Female", "Female", "Male",
"Male", "Female", "Female", "Male", "Male", "Male", "Male", "Female",
"Female", "Male", "Female", "Female", "Male", "Female", "Male",
"Male", "Male", "Male", "Female", "Femaale", "Male", "Female",
"Male", "Male", "Female", "Male", "Female", "Malle", "Male", "Female",
"Female", "Femmale", "Female", "Female", "Female", "Female", "Male",
"Male", "Male", "Male")
# counts
sum(Gender == "Female")## [1] 20
## [1] 27
## [1] 0.4255319
## [1] 0.5744681
## Gender
## Femaale Female Femmale Male Malle
## 1 20 1 24 1
## Gender
## Femaale Female Femmale Male Malle
## 0.0212766 0.4255319 0.0212766 0.5106383 0.0212766
Gender[Gender == "Femaale"] <- "Female"
Gender[Gender == "Femmale"] <- "Female"
Gender[Gender == "Malle"] <- "Male"
# checking if the Gender variable has only 2 classes
table(Gender)## Gender
## Female Male
## 22 25
##
## Female Male
## 10 10
Mimicking the LUDO game
##
## 1 2 3 4 5 6
## 1697 1688 1706 1594 1665 1650
##
## 1 2 3 4 5 6
## 0.1649 0.1663 0.1717 0.1692 0.1669 0.1610
## [1] 3 2 5
## [1] 3 2 5
## [1] 1 2 -1
## [1] -1 -2 1
## [1] 2 0 6
## [1] 2 0 6
## [1] 0.5 0.0 1.5
## [1] 2.0000000 Inf 0.6666667
k <- 1/2 # scalar
v <- c(pi, 0, 1, 4)
w <- c(0, pi, pi, 0, 0)
# product of scalar and vector
k*v; v*k## [1] 1.570796 0.000000 0.500000 2.000000
## [1] 1.570796 0.000000 0.500000 2.000000
## [1] 1.772454 0.000000 1.000000 2.000000
## Warning in v + w: longer object length is not a multiple of shorter object
## length
## [1] 3.141593 3.141593 4.141593 4.000000 3.141593
## [1] 6.283185 0.000000 2.000000 8.000000
## Warning in v^w: longer object length is not a multiple of shorter object length
## [1] 1 0 1 1 1
Matrices are two dimensional data set with columns and rows.
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 6 11 16 21
## [2,] 2 7 12 17 22
## [3,] 3 8 13 18 23
## [4,] 4 9 14 19 24
## [5,] 5 10 15 20 25
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 2 3 4 5
## [2,] 6 7 8 9 10
## [3,] 11 12 13 14 15
## [4,] 16 17 18 19 20
## [5,] 21 22 23 24 25
Exercise:
Define the following matrix in R
$$
\begin{pmatrix}
1 & 0 & 0 & 1\\
0 & 1 & 1 & 1\\
0 & 0 & 1 & 1\\
\end{pmatrix}
$$
\[ \begin{pmatrix} 1 & 0 & 0 & 1\\ 0 & 1 & 1 & 1\\ 0 & 0 & 1 & 1\\ \end{pmatrix} \]
# define the matrix above
v <- c(1, rep(0, 2), 1, 0, rep(1, 3), rep(0, 2), rep(1, 2))
A = matrix(v, 3, byrow = TRUE)
# displaying A
print(A)## [,1] [,2] [,3] [,4]
## [1,] 1 0 0 1
## [2,] 0 1 1 1
## [3,] 0 0 1 1
## [1] 3 4
Define a matrix using the dim() function which stands
for dimension of a given matrix.
## [,1] [,2] [,3] [,4]
## [1,] 1 0 0 1
## [2,] 0 1 1 1
## [3,] 0 0 1 1
Define a 0 matrix
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [39] 0 0 0 0 0 0 0 0 0 0
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0
## [3,] 0 0 0 0 0 0 0 0
## [4,] 0 0 0 0 0 0 0 0
## [5,] 0 0 0 0 0 0 0 0
## [6,] 0 0 0 0 0 0 0 0
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0
## [3,] 0 0 0 0 0 0 0 0
## [4,] 0 0 0 0 0 0 0 0
## [5,] 0 0 0 0 0 0 0 0
## [6,] 0 0 0 0 0 0 0 0
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0
## [3,] 0 0 0 0 0 0 0 0
## [4,] 0 0 0 0 0 0 0 0
## [5,] 0 0 0 0 0 0 0 0
## [6,] 0 0 0 0 0 0 0 0
## [,1] [,2] [,3]
## [1,] 1 5 4
## [2,] 0 2 2
## [3,] 2 1 0
## [,1] [,2] [,3]
## [1,] 2 3 0
## [2,] 5 1 1
## [3,] 2 1 1
We can also construct a matrix from vectors \(M = (v_1, v_2, v_3)\) using the
cbind and rbind functions.
## v1 v2 v3
## [1,] 1 5 4
## [2,] 0 2 2
## [3,] 2 1 0
## [,1] [,2] [,3]
## v1 1 0 2
## v2 5 2 1
## v3 4 2 0
## [1] "matrix" "array"
## [1] "matrix" "array"
dim function! dim is also called to check the dimension of a matrix,
a data frame or an array.
M3 <- c(1, 5, 4, 0, 2, 2, 2, 1, 0)
dim(M3) <- c(3, 3) # sets the dimensions of M3
dim(M3) # shows the dimensions of M3## [1] 3 3
## [,1] [,2] [,3]
## [1,] 1 0 2
## [2,] 5 2 1
## [3,] 4 2 0
## [1] "matrix" "array"
## [,1] [,2] [,3]
## [1,] 1 0 2
## [2,] 5 2 1
## [3,] 4 2 0
## [,1] [,2] [,3]
## [1,] 3 8 4
## [2,] 5 3 3
## [3,] 4 2 1
## [,1] [,2] [,3]
## [1,] -1 2 4
## [2,] -5 1 1
## [3,] 0 0 -1
# number of columns in A: dim(A)[2], or ncol(A).
# number of rows in A: dim(A)[1], or nrow(A)
dim(A)[2] == ncol(A)## [1] TRUE
## [1] TRUE
## [,1] [,2] [,3]
## [1,] 35 12 9
## [2,] 14 4 4
## [3,] 9 7 1
## [,1] [,2] [,3]
## [1,] -1 2.0 1
## [2,] 2 -4.0 -1
## [3,] -2 4.5 1
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
\[ \begin{cases} 2x + 2y &= 0\\ x + 3y &= 2 \end{cases} \]
The matrix of the equation system is: \(\displaystyle A =
\begin{pmatrix}2&2\\1&3\end{pmatrix}\) and the right hand
side of the equation is \(b =
\begin{pmatrix}4\\4\end{pmatrix}\). We can use the
solve function to have the solutions.
## [1] -1 1
## [,1] [,2] [,3]
## [1,] 4 -8.0 -1
## [2,] -5 10.5 5
## [3,] -2 4.5 2
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
Requirements:
A should be a square matrix of dimension \(n\).\[ P_A(\lambda) = \det(A-\lambda I_n) = 0,\quad n\in \mathbb{N}. \]
## eigen() decomposition
## $values
## [1] 4.7664355 -1.4836116 -0.2828239
##
## $vectors
## [,1] [,2] [,3]
## [1,] -0.8535725 -0.3668743 0.2177685
## [2,] -0.3052279 -0.4631774 -0.6431613
## [3,] -0.4221966 0.8067651 0.7341120
## [1] 4.7664355 -1.4836116 -0.2828239
## [,1] [,2] [,3]
## [1,] 4.7664 0.0000 0.0000
## [2,] 0.0000 -1.4836 0.0000
## [3,] 0.0000 0.0000 -0.2828
Example:
## [1] 3 2
## [,1] [,2]
## [1,] 0.7071068 1
## [2,] 0.7071068 0
Exercise for tomorrow
dtadta. Save it in avgdta by removing the average of each column. Name
it it scaled_dtat(scaled_dta) %*% scaled_dtaArrays are data type with more than two dimensions
## , , 1
##
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
##
## , , 2
##
## [,1] [,2] [,3] [,4]
## [1,] 13 16 19 22
## [2,] 14 17 20 23
## [3,] 15 18 21 24
## [1] "array"
An example of array is NetCDF data with for instance: * Longitude as column names (\(n\)) * Latitude as row names (\(p\)) * 3rd dimension could the time. For each time, we have a \(n\times p\) matrix.
## [1] 3 4 2
## [1] 13
The dimension: row position, column position, matrix level
A list is a collection of object of different types. The sizes of elements could be different.
## $A
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0
## [3,] 0 0 0 0 0 0
## [4,] 0 0 0 0 0 0
## [5,] 0 0 0 0 0 0
## $A
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0
## [3,] 0 0 0 0 0 0
## [4,] 0 0 0 0 0 0
## [5,] 0 0 0 0 0 0
##
## $my_seq
## [1] 7 10 8 6 4
## $A
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0
## [3,] 0 0 0 0 0 0
## [4,] 0 0 0 0 0 0
## [5,] 0 0 0 0 0 0
##
## $my_seq
## [1] 7 10 8 6 4
##
## $Bool
## [1] TRUE
## $A
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0
## [3,] 0 0 0 0 0 0
## [4,] 0 0 0 0 0 0
## [5,] 0 0 0 0 0 0
##
## $my_seq
## [1] 7 10 8 6 4
##
## $Bool
## [1] TRUE
##
## $iris
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## [1] 4
## $A
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0
## [3,] 0 0 0 0 0 0
## [4,] 0 0 0 0 0 0
## [5,] 0 0 0 0 0 0
##
## $my_seq
## [1] 7 10 8 6 4
##
## $Bool
## [1] TRUE
##
## $iris
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
##
## [[5]]
## [1] "I am learning R"
Getting the names of element of L?
## [1] "A" "my_seq" "Bool" "iris" ""
Renaming the 5th element of L?
## $A
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0
## [3,] 0 0 0 0 0 0
## [4,] 0 0 0 0 0 0
## [5,] 0 0 0 0 0 0
##
## $my_seq
## [1] 7 10 8 6 4
##
## $Bool
## [1] TRUE
##
## $iris
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
##
## $String
## [1] "I am learning R"
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0
## [3,] 0 0 0 0 0 0
## [4,] 0 0 0 0 0 0
## [5,] 0 0 0 0 0 0
## [1] "matrix" "array"
## NULL
A data frame is a table of \(n\) number of rows (observations) and \(p\) number of columns (features or variables). Variables can be of any data type.
set.seed(123)
n <- 120
df <- data.frame(
"ID" = paste0("Particip_", 1:n),
"age" = sample(0:120, size = n)
)
head(df, 10)brks <- seq(0, 120, by = 10)
df$age_groups <- cut(df$age, breaks = brks, include.lowest = TRUE)
head(df, 20)Exercise:
##
## [0,10] (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80]
## 11 10 10 10 10 9 10 10
## (80,90] (90,100] (100,110] (110,120]
## 10 10 10 10
barplot of age groupsx <- barplot(counts, col = "purple", main = "Distribution of ages", xaxt="n")
text(x = x, y = -1.5, cex = 0.8, xpd=TRUE, srt=45, labels = names(counts))## [1] "Particip_2" "Particip_21" "Particip_28" "Particip_29" "Particip_31"
## [6] "Particip_37" "Particip_52" "Particip_83" "Particip_91" "Particip_97"
## [1] "Particip_2" "Particip_21" "Particip_28" "Particip_29" "Particip_31"
## [6] "Particip_37" "Particip_52" "Particip_83" "Particip_91" "Particip_97"
## [1] "[0,10]" "(10,20]" "(20,30]" "(30,40]" "(40,50]" "(50,60]"
## [7] "(60,70]" "(70,80]" "(80,90]" "(90,100]" "(100,110]" "(110,120]"
## [1] "Particip_2" "Particip_21" "Particip_28" "Particip_29" "Particip_31"
## [6] "Particip_37" "Particip_52" "Particip_83" "Particip_91" "Particip_97"
data.frame() functionNA)## [1] 1 4 6
## [1] 3 5 7
# equal to 2.
# which(x != NA) wrong way to check for non-missing values
which(!is.na(x)) # means: which of the elements of x are not missing## [1] 2 3 5 7
## [1] NA NA NA
x[is.na(x)] <- mean(x[which(!is.na(x))]) # Good but could be shorter
x[is.na(x)] <- mean(x, na.rm = TRUE)
print(x)## [1] 2.3875 1.0000 2.0000 2.3875 3.0000 2.3875 3.5500
## [1] "2" "1" "2" "7" "$8" "3" "2.5" "9" "2,7"
## [1] "character"
## Warning: NAs introduced by coercion
## [1] 2.0 1.0 2.0 7.0 NA 3.0 2.5 9.0 NA
id <- which(is.na(y))
x[id[1]] <- 8
x[id[2]] <- 2.7
y <- as.numeric(x)
# gsub
x <- c(2, 1, 2, ",7", "$8", 3, "&2.5", 9, "2,7")
x <- gsub("\\$|\\&", "", x)
print(x)## [1] "2" "1" "2" ",7" "8" "3" "2.5" "9" "2,7"
## [1] 2.0 1.0 2.0 0.7 8.0 3.0 2.5 9.0 2.7
A number outside the following intervall is an outlier
\[[Q_1 - 1.5\times IQR, Q_3 + 1.5\times IQR]\]
where
x <- c(19.103, 19.632, 22.494, 20.113, 10.5, 22.744, 20.737, 17.976,
18.901, 192.87, 21.959, 0.001, 20.641, 20.177, 19.111, 22.859,
207.97, 16.853, 21.122, 19.244, 18.291, 19.651, 18.358, 18.834, 19)
boxplot(x_copy <- x)Here
Q <- as.numeric(quantile(x, probs = c(0.25, 0.5, 0.75)))
d <- as.data.frame(t(Q))
names(d) <- paste0("Q", 1:length(Q))
rownames(d) <- "stats"
d$IQR <- d$Q3 - d$Q1
d$LowerLim <- d$Q1 - 1.5*d$IQR
d$UpperLim <- d$Q3 + 1.5*d$IQR
dChecking for all elements of x if they are outliers or not.
Identifying outliers:
d_out table## [1] 10.500 192.870 0.001 207.970
boxplot() function## $stats
## [,1]
## [1,] 16.853
## [2,] 18.834
## [3,] 19.632
## [4,] 21.122
## [5,] 22.859
##
## $n
## [1] 25
##
## $conf
## [,1]
## [1,] 18.90899
## [2,] 20.35501
##
## $out
## [1] 10.500 192.870 0.001 207.970
##
## $group
## [1] 1 1 1 1
##
## $names
## [1] "1"
The object bxpt is a list of 6 elements with a column
named out which stands for outliers (this is
the vector of outliers). You all know how to identify and element in a
list using $out or [["out"]].
## [1] 10.500 192.870 0.001 207.970
## [1] 10.500 192.870 0.001 207.970
## [1] 10.500 192.870 0.001 207.970
Now, we can consider all the outliers as missing values
(NAs). First, we need to locate them.
# find all outliers in x
x[d_out$outlier] <- NA # or x[x %in% bxpt$out]
x[d_out$outlier] <- mean(x, na.rm = TRUE)# boxplot of imputed x
par(mfrow = c(1, 2))
boxplot(x_copy, main = "Before\nhandling outliers", cex.main = 0.7)
boxplot(x, main = "After\nhandling outliers", col = "green", cex.main = 0.7)As you can see, there is no more outlier.
frequencies <- c(Female = 35, Children = 25, Male = 40)
barplot(frequencies, cex.names=0.8,
names.arg = names(frequencies), # optional
horiz = FALSE, col = c("purple", "salmon1", "pink"))The argument cex.names reduces the size of x-labels. Low
values, say cex.names=0.6, forces R to show all the
labels.
par(mfrow = c(1, 2))
pie(frequencies, cex = 0.5,
labels = paste0(names(frequencies), " (", frequencies, ")"),
col = c("purple", "salmon1", "pink"))
pie(frequencies, cex = 0.5,
labels = paste0(names(frequencies), " (", frequencies, ")"),
col = rainbow(length(frequencies)))set.seed(12092024)
x <- sample(1:120, size = 1000, replace = TRUE)
par(mfrow = c(1,2))
hist(x, probability = TRUE) # use probability = TRUE to have densities
# instead of counts (frequencies)
# Density plots
plot(density(x), col = "salmon", main = "density of x")x <- seq(0, 2*pi, le = 50)
y <- sin(x)
# z <- cos(x)
# tg <- tan(x)
plot(x, y, pch = 16, col = "blue")Exercise: Make a scatter plot of
Petal.Length and Petal.Width colored by
Species using iris data frame.
f <- Petal.Length ~ Petal.Width
lr <- lm(f, data = iris)
plot(f, data = iris, col = Species, pch = 16)
abline(lr, col = "blue", lwd = 2, lty = 3)
legend("topleft", legend = unique(iris$Species), col = unique(iris$Species), pch = 16, bty = "n")The p.d.f of the uniform distribution is:
\[ f_X(x) = \begin{cases} \frac{1}{b-a} & \text{if } x \in[a, b] \\ 0 & \text{else}\end{cases} \]
set.seed(123) # just for reproducibility
x <- runif(1000)
hist(x, probability = TRUE, xlab = NULL,
main = "Histogram of a uniform distribution", col = "turquoise")
# using the lines function draw a line on top of the existing histogram
lines(density(x), col = "red", lwd = 2, lty = 3)
# adding a vertical line that represents the mean (average)
abline(v = mean(x), col = "blue", lty = 2, lwd = 2)
# adding the legend to our plot
legend("topleft", lty = c(2, 2),
col = c("red", "blue"), legend = c("Density", "Average"))## [1] 0.4972778
## [1] 0.082647
## [1] 0.08256435
## [1] 0.5781153
Let’s test if the uniform distribution is normal.
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: x
## D = 0.50019, p-value < 2.2e-16
## alternative hypothesis: two-sided
# is the distribution uniform?
ks.test(x, "punif") # check if the p-value > 0.05. If yes, x is uniform.##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: x
## D = 0.014051, p-value = 0.9891
## alternative hypothesis: two-sided
A given variable \(X\) follows a binomial distribution of parameters \(n\) and \(p\) if: \[ P(X = k) = {n \choose k}p^k(1-p)^{n-k}, \quad k\in\{0, 1\}. \]
rbinom_dist <- rbinom(n = 10000, size = 1, 0.3)
hist(rbinom_dist, probability = TRUE, main = "Histogram of a binomial distribution",
col = "turquoise", breaks = 20, xlab = NULL)
lines(density(rbinom_dist), col = "red", lwd = 2, lty = 2)
abline(v = mean(rbinom_dist), col = "blue", lty = 2, lwd = 2) # vertical line
legend("topright", lty = c(2, 2),
col = c("red", "blue"), legend = c("Density", "Average"), bty = "n")A histogram is not appropriate for this distribution. We use the bar charts or pie charts instead.
The probability density function for a normal distribution with parameters \(\mu\) and \(\sigma^2\) is given by:
\(\LaTeX\) code:
f_X(x) = \frac{1}{\sigma^2\sqrt{2\pi}}\exp\left\{-\frac{1}{2\sigma^2}\left(x -\mu\right)^2\right\}, \quad x\in R.
\[ f_X(x) = \frac{1}{\sigma^2\sqrt{2\pi}}\exp\left\{-\frac{1}{2\sigma^2}\left(x -\mu\right)^2\right\}, \quad x\in R. \]
set.seed(13092024)
gauss_dist <- rnorm(1000, mean = 0, sd = 1)
hist(gauss_dist, probability = TRUE, breaks = 30, xlab = NULL,
main = "Histogram of standard normal\ndistribution", col = "turquoise")
lines(density(gauss_dist), col = "red", lwd = 2, lty = 2)
abline(v = mean(gauss_dist), col = "blue", lty = 2, lwd = 2)
legend("topleft", lty = c(2, 2),
col = c("red", "blue"), legend = c("Density", "Average"), bty = "n")##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: gauss_dist
## D = 0.019659, p-value = 0.8343
## alternative hypothesis: two-sided
# is the distribution uniform?
ks.test(gauss_dist, "punif") # check if the p-value > 0.05. If yes, x is uniform.##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: gauss_dist
## D = 0.49381, p-value < 2.2e-16
## alternative hypothesis: two-sided
## [1] 0.01907579
## [1] 0.9686031
Equation:
\[y_i = \alpha + \beta x_i + \varepsilon_i, \quad i=1, 2, \ldots, n\] where \(\varepsilon_i \sim \mathcal{N}(0, 1)\). The coefficients \(\alpha\) and \(\beta\) are to be estimated.
Data simulation
set.seed(123)
x <- rnorm(1000, mean = 5, sd = 1.5)
y <- rnorm(1000) # another normal distribution.
error_term <- rnorm(1000)
z <- 4 + 8*x + error_term # linear dependence between x and z (this is
# just a simulation. In real-world data analysis the relationships between
# variables are not known in advance).
# construction of data frame from x and z
head(df <- data.frame(x, z), 10)Simple Linear Model
# fitting the model
LR <- lm(z ~ x, data = df)
# the estimated coefficients
coefs <- coefficients(LR)
# summary
summary(LR)##
## Call:
## lm(formula = z ~ x, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8254 -0.6397 -0.0310 0.6588 3.4195
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.04376 0.10906 37.08 <2e-16 ***
## x 7.98729 0.02082 383.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9787 on 998 degrees of freedom
## Multiple R-squared: 0.9933, Adjusted R-squared: 0.9933
## F-statistic: 1.472e+05 on 1 and 998 DF, p-value: < 2.2e-16
Estimated model: \(\hat{y}=\) 4.0437603 + 7.987287\(x\). For a new value of \(x\), say \(x =
15\), estimated value for \(y\)
should be: \(\hat{y}=\) 4.0437603 +
4.0437603\(\times 15\) = 123.8530646.
We use the predict() function to make predictions from our
model.
# Prediction
new_data <- data.frame(x = 15)
4 + 8*new_data$x # is the theoretical value of y for x = 15## [1] 124
## 1
## 123.8531
# plot of the predicted value and existing ones
plot(predicted ~ x, data = new_data, col = "darkgreen", xlim = c(0, 15),
ylim = c(0, new_data$predicted), pch = 16, ylab = "z")
points(z ~ x, data = df, col = "darkorange", pch = 20)
legend("topleft", c("Actual", "Predicted"), col = c("darkorange", "darkgreen"),
pch = c(20, 16), bty = "n")
abline(LR, col = "blue")
abline(v = new_data$x, h = new_data$predicted, col = "gray", lty = 3)par(mfrow = c(1, 2))
plot(x, y, main = "Scatter plot of x and y", col = "blue", pch = 16); grid()
plot(x, z, main = "Scatter plot of x and z", col = "blue", pch = 16)
abline(LR, col = "red")## [1] 0.9966282
## [1] 0.08647944
set.seed(13092024)
exp_dist <- rexp(1000, rate = 1)
hist(exp_dist, probability = TRUE, xlab = NULL,
main = "Histogram of exponential distribution", col = "turquoise")
lines(density(exp_dist), col = "red", lwd = 2, lty = 2)
abline(v = mean(exp_dist), col = "blue", lty = 2, lwd = 2)
legend("topright", lty = c(2, 2),
col = c("red", "blue"), legend = c("Density", "Average"))set.seed(13092024)
pois_dist <- rpois(1000, lambda = 2.5)
hist(pois_dist, probability = TRUE, main = "Histogram of poisson distribution",
col = "turquoise", xlab = NULL)
lines(density(pois_dist), col = "red", lwd = 2, lty = 2)
abline(v = mean(pois_dist), col = "blue", lty = 2, lwd = 2)
legend("topright", lty = c(2, 2),
col = c("red", "blue"), legend = c("Density", "Average"))if /
elseif (condition/Boolean expression){
## code to be executed
}
## [1] TRUE
## [1] TRUE
# alternative 2
x <- c(3, 4, 6, 7, 0, 4)
ifelse(x < 4, T, F) # this is a vectorized function (can be applied on vectors)## [1] TRUE FALSE FALSE FALSE TRUE FALSE
We can embed if to if and
else.
Exercise: write a if statement to check for each value of marks if they fall in the following categories
m < 60 or marks[i] < 60.marks directlycategory <- character(0)
for (m in marks){
if (m >= 0 & m < 60){
cat("Fail ")
category <- c(category, "Fail") # category.append("Fail") from python
} else if (m >= 60 & m < 70){
cat("Pass ")
category <- c(category, "Pass")
} else if (m >= 70 & m < 80){
cat("Good Pass ")
category <- c(category, "Good Pass")
} else if (m >= 80 & m < 85){
cat("Very Good Pass ")
category <- c(category, "Very Good Pass")
} else {
cat("Distinction ")
category <- c(category, "Distinction")
}
}## Fail Fail Very Good Pass Pass Pass Distinction Pass Fail Fail Fail Good Pass Pass Pass Pass Fail Distinction Pass Fail Good Pass Fail Fail Fail Fail Fail Fail Fail Good Pass Pass Fail Good Pass Pass Fail Good Pass Good Pass Good Pass Good Pass Pass Fail Fail Fail Fail Fail Fail Distinction Good Pass Fail Fail Fail Good Pass Fail
category2 <- character()
for (i in 1:length(marks)){
if(marks[i] >= 0 & marks[i] < 60){
cat("F ")
category2[i] <- "F"
} else if (marks[i] >= 60 & marks[i] < 70){
cat("P ")
category2[i] <- "P"
} else if (marks[i] >= 70 & marks[i] < 80){
cat("GP ")
category2[i] <- "GP"
}else if (marks[i] >= 80 & marks[i] < 85){
cat("VGP ")
category2[i] <- "VGP"
} else {
cat("D ")
category2[i] <- "D"
}
}## F F VGP P P D P F F F GP P P P F D P F GP F F F F F F F GP P F GP P F GP GP GP GP P F F F F F F D GP F F F GP F
ifelse (vectorized function)category3 <- ifelse(marks >= 0 & marks < 60, "Fail",
ifelse (marks >= 60 & marks < 70, "Pass",
ifelse(marks >= 70 & marks < 80, "Good Pass",
ifelse(marks >= 80 & marks < 85, "Very Good Pass", "Distinction"))))
print(category3)## [1] "Fail" "Fail" "Very Good Pass" "Pass"
## [5] "Pass" "Distinction" "Pass" "Fail"
## [9] "Fail" "Fail" "Good Pass" "Pass"
## [13] "Pass" "Pass" "Fail" "Distinction"
## [17] "Pass" "Fail" "Good Pass" "Fail"
## [21] "Fail" "Fail" "Fail" "Fail"
## [25] "Fail" "Fail" "Good Pass" "Pass"
## [29] "Fail" "Good Pass" "Pass" "Fail"
## [33] "Good Pass" "Good Pass" "Good Pass" "Good Pass"
## [37] "Pass" "Fail" "Fail" "Fail"
## [41] "Fail" "Fail" "Fail" "Distinction"
## [45] "Good Pass" "Fail" "Fail" "Fail"
## [49] "Good Pass" "Fail"
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [46] TRUE TRUE TRUE TRUE TRUE
Assignment: Write an R function that will take as an
input a vector of marks and return a data frame of marks
and categories (F, P, GP,
VGP, and D). The function should be able make
some plots (pie, barplot).
for (i in vector){
## code to be executed
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
for (i in 1:10){
# for each element, let's check if it is a multiple of 3
if (i %% 3 == 0){
cat(i, "is a multiple of 3\n")
} else if (i %% 2 == 0) {
cat(i, "is a multiple of 2\n")
} else {
cat(i, "is neither a multiple of 2 nor 3.\n")
}
}## 1 is neither a multiple of 2 nor 3.
## 2 is a multiple of 2
## 3 is a multiple of 3
## 4 is a multiple of 2
## 5 is neither a multiple of 2 nor 3.
## 6 is a multiple of 3
## 7 is neither a multiple of 2 nor 3.
## 8 is a multiple of 2
## 9 is a multiple of 3
## 10 is a multiple of 2
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
Write a for loop that checks each of the first 10 positive integers if it is odd or even.
Using for loop, import all CSV from the
data_files folder.
# checking the working directory
# getwd()
# simple ls like in bash
dir("./data_list/", pattern = ".csv") # list of elements of in a directory## character(0)
# Exercise: write a for loop to import all
# csv files in a list.
(file_names <- dir("./data_list/", pattern = ".csv"))## character(0)
Hints: Importing files from the working directory
string1 <- "." # working directory (root where the script is saved)
string2 <- "folder" # folder in the working directory
string3 <- "subfolder" # sub-folder in folder
paste(string1, string2, string3, sep = "/")## [1] "./folder/subfolder"
## [1] "./folder/subfolder"
while (condition){
## code to be executed
# increment
}
## [1] 0
## [1] 20
Write a program that will tell the user YOU WON! and exit if they get 5 three times on a row.
Write a program that run continuously an ask a user to input a number between 0 and 9 and provide the multiplication table by 2 and asks the user to stop or continue.
Hint: Use the function
readline(prompt = "Enter a number: ") to interact with the
user.
number <- readline(prompt = "Entrer un nombre: ") # conversion is needed.
Syntax of the repeat loop:
# increment i or anything else
i <- 0
repeat{
# execute a code
# increment
i <- i + 1
# stopping criteria
if ( something happens ){
break # repeat until something happens
}
}
## [1] 0
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
applyThe apply() function return a vector or array or list of
values obtained by applying a function to margins of an array or
matrix.
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
## [1] 5.5
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. 4.300000 2.000000 1.000 0.100000
## 1st Qu. 5.100000 2.800000 1.600 0.300000
## Median 5.800000 3.000000 4.350 1.300000
## Mean 5.843333 3.057333 3.758 1.199333
## 3rd Qu. 6.400000 3.300000 5.100 1.800000
## Max. 7.900000 4.400000 6.900 2.500000
sapply:
use ?sapply to check the documentation.## [1] 1 2 3 4
Calculating the mean of each from iris data
coef_disp <- function(x){ # x is a numeric vector
sd(x)/mean(x)
}
get_range <- function(x) diff(range(x))
minmax <- function(x) c(min = min(x), max = max(x))
sapply(Filter(is.numeric, iris), get_range)## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 3.6 2.4 5.9 2.4
sapply(Filter(is.numeric, iris), function(x) c(min = min(x),
max = max(x),
std_dev = sd(x),
Q1 = quantile(x, 0.25),
Med = quantile(x, .5),
Q3 = quantile(x, .75),
disp = coef_disp(x)))## Sepal.Length Sepal.Width Petal.Length Petal.Width
## min 4.3000000 2.0000000 1.0000000 0.1000000
## max 7.9000000 4.4000000 6.9000000 2.5000000
## std_dev 0.8280661 0.4358663 1.7652982 0.7622377
## Q1.25% 5.1000000 2.8000000 1.6000000 0.3000000
## Med.50% 5.8000000 3.0000000 4.3500000 1.3000000
## Q3.75% 6.4000000 3.3000000 5.1000000 1.8000000
## disp 0.1417113 0.1425642 0.4697441 0.6355511
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5.849664 3.057333 3.758000 1.199333
The sapply function can also return a list if the
outputs are not of the same length.
## $Sepal.Length
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.300 5.100 5.800 5.843 6.400 7.900
##
## $Sepal.Width
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.800 3.000 3.057 3.300 4.400
##
## $Petal.Length
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.600 4.350 3.758 5.100 6.900
##
## $Petal.Width
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.100 0.300 1.300 1.199 1.800 2.500
##
## $Species
## setosa versicolor virginica
## 50 50 50
df <- data.frame(replicate(10, rnorm(1000)))
L <- as.list(df) # converting data frame to list.
sapply(L, avg)## X1 X2 X3 X4 X5 X6
## 0.022972504 0.034160787 -0.024801530 -0.008415118 -0.030320654 0.039747511
## X7 X8 X9 X10
## -0.024418261 -0.007606869 0.026690273 -0.048200721
## [1] 1 4 9 16 25 36 49 64 81 100
lapply:The lapply() function returns a list of the same length
as X, each element of which is the result of applying FUN to the
corresponding element of X
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5.843333 3.057333 3.758000 1.199333
tapply:
check the documentation using ?tapply## [1] TRUE
## setosa versicolor virginica
## 5.006 5.936 6.588
vapply:
check the documentationvapply(X = as.list(iris[-5]), quantile, FUN.VALUE =
c("0%" = 0, "25%" = 0, "50%" = 0, "75%" = 0, "100%" = 0))## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0% 4.3 2.0 1.00 0.1
## 25% 5.1 2.8 1.60 0.3
## 50% 5.8 3.0 4.35 1.3
## 75% 6.4 3.3 5.10 1.8
## 100% 7.9 4.4 6.90 2.5
Syntax to write/define a function in R:
function_name <- function(arg1, arg2, ...){
# code to be executed
}
toss_coin <- function(){
face <- sample(6, size = 1)
return(face)
}
faces <- c()
for (i in 1:100) {
faces[i] <- toss_coin()
}## [1] 2
x as argument and
detects NA then replaces them by the mean## NULL
A package is a well documented collection of functions, compiled code and data sets. Packages are created to make specific functionality easy.
If a package is not in the installed.packages(), matrix
of installed packages, one can install it using the command
install.packages("package_name").
Notice that the matrix of installed packages has a column named
Package that is easily accessible by the command
installed.packages()[,"Package"] or
rownames(installed.packages()).
all_packages <- installed.packages()[,"Package"]
head(all_packages, 10) # only displaying the first 20 packages by alp. order## abind AER aion arkhe ash
## "abind" "AER" "aion" "arkhe" "ash"
## AsioHeaders askpass assertthat backports base
## "AsioHeaders" "askpass" "assertthat" "backports" "base"
Having the list of all packages, we can check if a package is in it.
## [1] FALSE
It looks like the package is not installed yet. Using if
control-flow, we can check if a package is missing and then install it
using the install.package() function.
if (!"pacman" %in% all_packages) {
install.packages("pacman", repos = "http://cran.us.r-project.org")
}##
## The downloaded binary packages are in
## /var/folders/x6/rdmyg9yd5cq432r1z8p90p6r0000gn/T//Rtmp9V874j/downloaded_packages
library() function from the base
package.library(pacman)
# library(tidyverse)
plist <- c("kairos", "ggplot2", "tidyverse", "tidyr", "dplyr", "stringr")
pacman::p_load(plist, character.only = TRUE)
# library(mice)R is displaying messages when loading the tidyverse package. You would not want to have it displayed in your report.
Do the following setting
{r warning=FALSE, message=FALSE}
To get the documentation for a specific package that you already
installed. Use the command
help(package = "the_package_name")
Let’s get help for the pacman package
We can have the entire documentation displayed in
File | Plots | Packages | Help pane.
To access all the functions and data from a given package, we need to
load it in R using the library(the_package) or
require(the_package). The pacman package give
more flexibility by loading a list of packages and if there any on the
list that is not install, pacman does the installation for
you.
The command to load a list of packages with pacman is as
follows:
The iris data set exist already in the R environment. We
can import data in R from different sources:
# access iris data
data("iris")
data("spam", package = "kernlab")
# displaying the first 6 rows
# help(iris)
# data structure
str(iris)## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
## [1] 5
## [1] 5
Checking the classes of all variables in the a given data frame
library(tictoc)
dtypes <- c()
tic() # you can also use start <- Sys.time()
for ( i in 1:ncol(iris)){
dtypes[i] <- class(iris[[i]])
}
toc() # and end <- Sys.time()## 0.001 sec elapsed
## [1] "numeric" "numeric" "numeric" "numeric" "factor"
## [1] "numeric" "factor"
library function.## 'data.frame': 4601 obs. of 10 variables:
## $ make : num 0 0.21 0.06 0 0 0 0 0 0.15 0.06 ...
## $ address : num 0.64 0.28 0 0 0 0 0 0 0 0.12 ...
## $ all : num 0.64 0.5 0.71 0 0 0 0 0 0.46 0.77 ...
## $ num3d : num 0 0 0 0 0 0 0 0 0 0 ...
## $ our : num 0.32 0.14 1.23 0.63 0.63 1.85 1.92 1.88 0.61 0.19 ...
## $ over : num 0 0.28 0.19 0 0 0 0 0 0 0.32 ...
## $ remove : num 0 0.21 0.19 0.31 0.31 0 0 0 0.3 0.38 ...
## $ internet: num 0 0.07 0.12 0.63 0.63 1.85 0 1.88 0 0 ...
## $ order : num 0 0 0.64 0.31 0.31 0 0 0 0.92 0.06 ...
## $ mail : num 0 0.94 0.25 0.63 0.63 0 0.64 0 0.76 0 ...
To import a CSV file in R we can use:
read.csv() function from base packageNotice that the column Species is seen as
character. We can force the conversion by setting the
argument stringsAsFactors to TRUE in
read.csv().
read_csv() function from readr package
already loaded together with tidyverse.## Rows: 150 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Species
## dbl (4): Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
! Always check the errors, warnings and messages to make your report look good.
import function from rio package
that can detect file extension and load it.##
## Attaching package: 'rio'
## The following object is masked from 'package:dimensio':
##
## export
## The following object is masked from 'package:aion':
##
## convert
Exercise: import all the csv files in a list using a for loop.
data_list <- list() # creating an empty list.
# check the files names in data/csv
dir("./data/csv/")## character(0)
%>% from magrittr package or
*|> from base package.
Library: tidyverse or dplyr
Shortcut: Crtl + Shift + M
Why is it useful?
f(g(h(x))) is equivalent to
x %>% h() %>% g() %>% f()
Gender %>% table() %>% data.frame() %>%
rename("Gender" = 1, Count = Freq) %>%
mutate(Rel_freq = round(Count/sum(Count), 2)) %>%
# using ggplot to plot
ggplot(aes(x = Gender, y = Rel_freq)) + geom_col()Instead of
When we load tidyverse, we load extra packages like:
broom, conflicted, cli,
dbplyr, dplyr, dtplyr,
forcats, ggplot2, googledrive,
googlesheets4, haven, hms,
httr, jsonlite, lubridate,
magrittr, modelr, pillar,
purrr, ragg, readr,
readxl, reprex, rlang,
rstudioapi, rvest, stringr,
tibble, tidyr and xml2.
So, it is not necessary to load tidyverse and then load
any of the dependencies above.
It is part of tidyverse package that allows user to
perform data manipulation easily using grammar which makes it simpler to
use and with the use of pipe, the code is cleaner.
The most used functions from dplyr are:
select: Returns a subset of columns from a data
framefilter: Extracts a subset of rows from a data frame
based on logical conditionsarrange: Reorders rows of a data frame according to a
variable or columnrename: Makes it easier to rename variables in a data
framemutate: Computes transformations of variables in a data
framesummarize: Collapses a group into a single rowlibrary(reshape2)
library(ggplot2)
view1 <- iris %>% group_by(Species) %>%
summarise_if(is.numeric, list(mean = mean, sd = sd)) %>%
# using reshape2 package to reshape from wide to long format
# the tidyr::pivot_longer function can also be used
melt(id.vars = "Species") %>%
separate(variable, into = c("variable", "metric"), sep = "_")
view1Now if we want to have mean and sd as
columns, the tidyr::pivot_wider can be used.
view2 <- view1 %>%
pivot_wider(names_from = metric, values_from = value) %>%
mutate(std_error = sd/sqrt(nrow(iris)))
view2view2 %>% ggplot() +
geom_bar(aes(x = Species, y = mean), stat="identity", fill="forestgreen", alpha = 0.5) +
geom_errorbar(aes(x=Species,
ymin = mean - std_error,
ymax = mean + std_error),
width = 0.4, colour="orange",
alpha = 0.9, linewidth = 0.5) +
facet_wrap(~variable, ncol = 4)library(broom)
library(tidyverse)
library(flextable)
tidy(LR) %>% select(-5) %>% flextable() %>% autofit() %>% theme_box()term | estimate | std.error | statistic |
|---|---|---|---|
(Intercept) | 4.043760 | 0.10906295 | 37.0773 |
x | 7.987287 | 0.02081526 | 383.7227 |
officeriris %>%
ggplot(aes(y = Petal.Length, fill = Species)) +
geom_boxplot() +
facet_grid(~Species) +
theme_minimal()library(plotly)
library(ggpubr)
plot_ly(data = iris %>% count(Species),
labels = ~Species, values = ~n, type = "pie")ggpie(data = iris %>% count(Species) %>%
mutate(label = paste0(Species, " (", round(100* n/sum(n), 2), "%)")),
x = "n", label = "label",
color = "white",
fill = "Species", lab.pos = "in")