At the end of the lesson, the students are expected to:
Basic data types are numeric, character and logical.
# Numeric object: How old are you?
my_age <- 18
# Character object: What's your name?
my_name <- "Nicolas"
# logical object: Are you a data scientist?
# (yes/no) <=> (TRUE/FALSE)
is_datascientist <- TRUE*Note that, character vector can be created using double (“) or single (’) quotes.
## [1] "My Friend's name is Jerome"
It’s possible to use the function class() to see what type a variable is:
## [1] "numeric"
## [1] "character"
You can also use the functions is.numeric(), is.character(), is.logical() to check whether a variable is numeric, character or logical, respectively. For instance:
## [1] TRUE
## [1] FALSE
If you want to change the type of a variable to another one, use the as.* functions, including: as.numeric(), as.character(), as.logical(), etc.
## [1] 18
## [1] "18"
Note that, the conversion of a character to a numeric will output NA (for not available). R doesn’t know how to convert a numeric variable to a character variable.
A vector is a combination of multiple values (numeric, character or logical) in the same object. In this case, you can have numeric vectors, character vectors or logical vectors.
Create a vector
A vector is created using the function c() (for concatenate), as follow:
# Store your friends'age in a numeric vector
friend_ages <- c(27, 25, 29, 26) # Create
friend_ages # Print## [1] 27 25 29 26
# Store your friend names in a character vector**
my_friends <- c("Nicolas", "Thierry", "Bernard", "Jerome")
my_friends ## [1] "Nicolas" "Thierry" "Bernard" "Jerome"
# Store your friends marital status in a logical vector
# Are they married? (yes/no <=> TRUE/FALSE)
are_married <- c(TRUE, FALSE, TRUE, TRUE)
are_married## [1] TRUE FALSE TRUE TRUE
It’s possible to give a name to the elements of a vector using the function names().
## [1] 27 25 29 26
# Vector with element names
names(friend_ages) <- c("Nicolas", "Thierry", "Bernard", "Jerome")
friend_ages## Nicolas Thierry Bernard Jerome
## 27 25 29 26
You can also create a named vector as follow
## Nicolas Thierry Bernard Jerome
## 27 25 29 26
Note that a vector can only hold elements of the same type. For example, you cannot have a vector that contains both characters and numeric values.
Find the length of a vector (i.e., the number of elements in a vector)
## [1] 4
Case of missing values
I know that some of my friends (Nicolas and Thierry) have 2 child. But this information is not available (NA) for the remaining friends (Bernard and Jerome).
In R missing values (or missing information) are represented by NA:
## Nicolas Thierry Bernard Jerome
## "yes" "yes" NA NA
It’s possible to use the function is.na() to check whether a data contains missing value. The result of the function is.na() is a logical vector in which, the value TRUE specifies that the corresponding element in x is NA.
## Nicolas Thierry Bernard Jerome
## FALSE FALSE TRUE TRUE
Note that there is a second type of missing values named NaN (“Not a Number”). This is produced in a situation where mathematical function won’t work properly, for example 0/0 = NaN.
Note also that the function is.na() is TRUE for both NA and NaN values. To differentiate these, the function is.nan() is only TRUE for NaNs.
Get a subset of a vector
Subsetting a vector consists of selecting a part of your vector.
## [1] "Thierry"
## [1] "Thierry" "Jerome"
## [1] "Nicolas" "Thierry" "Bernard"
Note that, R indexes from 1, NOT 0. So your first column is at [1] and not [0].
If you have a named vector, it’s also possible to use the name for selecting an element:
## Bernard
## 29
## [1] "Nicolas" "Bernard" "Jerome"
## [1] "Nicolas" "Bernard"
## [1] "Jerome"
## [1] "Nicolas" "Bernard" "Jerome"
## [1] "Nicolas" "Bernard"
## [1] "Thierry" "Bernard" "Jerome"
If you want to remove missing data, use this:
## Nicolas Thierry Bernard Jerome
## "yes" "yes" NA NA
## Nicolas Thierry
## "yes" "yes"
## Nicolas Thierry Bernard Jerome
## "yes" "yes" "NO" "NO"
Note that, the “logical” comparison operators available in R are:
<: for less than
>: for greater than
<=: for less than or equal to
>=: for greater than or equal to
==: for equal to each other
!=: not equal to each other
Calculations with vectors
Note that, all the basic arithmetic operators (+, -, *, / and ^ ) as well as the common arithmetic functions (log, exp, sin, cos, tan, sqrt, abs, …), described in the previous sections, can be applied on a numeric vector.
If you perform an operation with vectors, the operation will be applied to each element of the vector. An example is provided below:
# My friends' salary in dollars
salaries <- c(2000, 1800, 2500, 3000)
names(salaries) <- c("Nicolas", "Thierry", "Bernard", "Jerome")
salaries## Nicolas Thierry Bernard Jerome
## 2000 1800 2500 3000
## Nicolas Thierry Bernard Jerome
## 4000 3600 5000 6000
As you can see, R multiplies each element in the salaries vector with 2.
Now, suppose that you want to multiply the salaries by different coefficients. The following R code can be used:
# create coefs vector with the same length as salaries
coefs <- c(2, 1.5, 1, 3)
# Multiply salaries by coeff
salaries*coefs## Nicolas Thierry Bernard Jerome
## 4000 2700 2500 9000
Note that the calculation is done element-wise. The first element of salaries vector is multiplied by the first element of coefs vector, and so on.
Compute the square root of a numeric vector:
## [1] 2 4 3
Other useful functions are:
max(x) # Get the maximum value of x
min(x) # Get the minimum value of x
range(x)# Get the range of x. Returns a vector containing the minimum and the maximum of x
length(x) # Get the number of elements in x
sum(x) # Get the total of the elements in x
prod(x) # Get the product of the elements in x
mean(x) = sum(x)/length(x)
sd(x) # Standard deviation of x
var(x) # Variance of x
sort(x) # Sort the element of x in ascending order
For example, if you want to compute the total sum of salaries, type this:
## [1] 9300
Compute the mean of salaries:
## [1] 2325
The range (minimum, maximum) of salaries is:
## [1] 1800 3000
Please be cautious in generating descriptive statistics because some measures are more appropriate than the others in specific situations. For example, when there are outliers in the data, the median is more appropriate than the mean.
A matrix is like an Excel sheet containing multiple rows and columns. It’s used to combine vectors with the same type, which can be either numeric, character or logical. Matrices are used to store a data table in R. The rows of a matrix are generally individuals/observations and the columns are variables.
To create easily a matrix, use the function cbind() or rbind() as follow:
# Numeric vectors
col1 <- c(5, 6, 7, 8, 9)
col2 <- c(2, 4, 5, 9, 8)
col3 <- c(7, 3, 4, 8, 7)
# Combine the vectors by column
my_data <- cbind(col1, col2, col3)
my_data## col1 col2 col3
## [1,] 5 2 7
## [2,] 6 4 3
## [3,] 7 5 4
## [4,] 8 9 8
## [5,] 9 8 7
## col1 col2 col3
## row1 5 2 7
## row2 6 4 3
## row3 7 5 4
## row4 8 9 8
## row5 9 8 7
If you want to transpose your data, use the function t():
## row1 row2 row3 row4 row5
## col1 5 6 7 8 9
## col2 2 4 5 9 8
## col3 7 3 4 8 7
Note that it’s also possible to construct a matrix using the function matrix().
The simplified format of matrix() is as follow:
matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE,
dimnames = NULL)
In the R code below, the input data has length 6. We want to create a matrix with two rows. You don’t need to specify the number of columns (here ncol = 3). R will infer this automatically. The matrix is filled column by column when the argument byrow = TRUE. If you want to fill the matrix by columns, use byrow = FALSE.
mdat <- matrix(
data = c(1,2,3, 11,12,13),
nrow = 2, byrow = TRUE,
dimnames = list(c("row1", "row2"), c("C.1", "C.2", "C.3"))
)
mdat## C.1 C.2 C.3
## row1 1 2 3
## row2 11 12 13
The R functions nrow() and ncol() return the number of rows and columns present in the data, respectively.
## [1] 3
## [1] 5
## [1] 5 3
Rows and/or columns can be selected as follow: my_data[row, col]
## col1 col2 col3
## 6 4 3
## col1 col2 col3
## row2 6 4 3
## row3 7 5 4
## row4 8 9 8
## col1 col2 col3
## row2 6 4 3
## row4 8 9 8
## row1 row2 row3 row4 row5
## 7 3 4 8 7
## [1] 3
## row1 row2 row3 row4 row5
## 2 4 5 9 8
## [1] 5
## col2 col3
## row1 2 7
## row2 4 3
## row3 5 4
## row4 9 8
## row5 8 7
## col1 col2 col3
## row1 5 2 7
## row3 7 5 4
## row4 8 9 8
## row5 9 8 7
## col1 col2 col3
## row1 10 4 14
## row2 12 8 6
## row3 14 10 8
## row4 16 18 16
## row5 18 16 14
Or, compute the log2 values:
## col1 col2 col3
## row1 2.321928 1.000000 2.807355
## row2 2.584963 2.000000 1.584963
## row3 2.807355 2.321928 2.000000
## row4 3.000000 3.169925 3.000000
## row5 3.169925 3.000000 2.807355
## row1 row2 row3 row4 row5
## 14 13 16 25 24
## col1 col2 col3
## 35 28 29
If you are interested in row/column means, you can use the function rowMeans() and colMeans() for computing row and column means, respectively.
Note that it’s also possible to use the function apply() to apply any statistical functions to rows/columns of matrices.
The simplified format of apply() is as follow:
apply(X, MARGIN, FUN)
Use apply() as follow:
## row1 row2 row3 row4 row5
## 4.666667 4.333333 5.333333 8.333333 8.000000
## row1 row2 row3 row4 row5
## 5 4 5 8 8
## col1 col2 col3
## 7.0 5.6 5.8
Factor variables represent categories or groups in your data. The function factor() can be used to create a factor variable.
## [1] 1 2 1 2
## Levels: 1 2
The variable friend_groups contains two categories of friends: 1 and 2. In R terminology, categories are called factor levels.
It’s possible to access to the factor levels using the function levels():
## [1] "1" "2"
## [1] best_friend not_best_friend best_friend not_best_friend
## Levels: best_friend not_best_friend
Note that, R orders factor levels alphabetically. If you want a different order in the levels, you can specify the levels argument in the factor function as follow.
# Change the order of levels
friend_groups <- factor(friend_groups,
levels = c("not_best_friend", "best_friend"))
# Print
friend_groups## [1] best_friend not_best_friend best_friend not_best_friend
## Levels: not_best_friend best_friend
Note that:
## [1] TRUE
## [1] TRUE FALSE TRUE TRUE
## Levels: FALSE TRUE
If you want to know the number of individuals in each levels, use the function summary():
## not_best_friend best_friend
## 2 2
In the following example, I want to compute the mean salary of my friends by groups. The function tapply() can be used to apply a function, here mean(), to each group.
## Nicolas Thierry Bernard Jerome
## 2000 1800 2500 3000
## [1] best_friend not_best_friend best_friend not_best_friend
## Levels: not_best_friend best_friend
# Compute the mean salaries by groups
mean_salaries <- tapply(salaries, friend_groups, mean)
mean_salaries## not_best_friend best_friend
## 2400 2250
## not_best_friend best_friend
## 2 2
It’s also possible to use the function table() to create a frequency table, also known as a contingency table of the counts at each combination of factor levels.
## friend_groups
## not_best_friend best_friend
## 2 2
# Cross-tabulation between
# friend_groups and are_married variables
table(friend_groups, are_married)## are_married
## friend_groups FALSE TRUE
## not_best_friend 1 1
## best_friend 0 2
A data frame is like a matrix but can have columns with different types (numeric, character, logical). Rows are observations (individuals) and columns are variables.
A data frame can be created using the function data.frame(), as follow:
# Create a data frame
friends_data <- data.frame(
name = my_friends,
age = friend_ages,
height = c(180, 170, 185, 169),
married = are_married
)
# Print
friends_dataTo check whether a data is a data frame, use the is.data.frame() function. Returns TRUE if the data is a data frame:
## [1] TRUE
## [1] FALSE
The object “friends_data” is a data frame, but not the object “my_data”. We can convert-it to a data frame using the as.data.frame() function:
## [1] "matrix" "array"
# Convert it as a data frame
my_data2 <- as.data.frame(my_data)
# Now, the class is data.frame
class(my_data2)## [1] "data.frame"
As described in matrix section, you can use the function t() to transpose a data frame:
## Nicolas Thierry Bernard Jerome
## name "Nicolas" "Thierry" "Bernard" "Jerome"
## age "27" "25" "29" "26"
## height "180" "170" "185" "169"
## married "TRUE" "FALSE" "TRUE" "TRUE"
To select just certain columns from a data frame, you can either refer to the columns by name or by their location (i.e., column 1, 2, 3, etc.).
1.) Positive indexing by name and by location
## [1] "Nicolas" "Thierry" "Bernard" "Jerome"
## [1] "Nicolas" "Thierry" "Bernard" "Jerome"
2.) Index by characteristics
We want to select all friends with age >= 27.
## [1] TRUE FALSE TRUE FALSE
TRUE specifies that the row contains a value of age >= 27.
The R code above, tells R to get all rows from friends_data where age >= 27, and then to return all the columns.
If you don’t want to see all the column data for the selected rows but are just interested in displaying, for example, friend names and age for friends with age >= 27, you could use the following R code:
If you’re finding that your selection statement is starting to be inconvenient, you can put your row and column selections into variables first, such as:
Then you can select the rows and columns with those variables:
It’s also possible to use the function subset() as follow.
Another option is to use the functions attach() and detach(). The function attach() takes a data frame and makes its columns accessible by simply giving their names.
The functions attach() and detach() can be used as follow:
Add new column in a data frame
It’s also possible to use the functions cbind() and rbind() to extend a data frame.
With numeric data frame, you can use the function rowSums(), colSums(), colMeans(), rowMeans() and apply() as described in matrix section.
A list is an ordered collection of objects, which can be vectors, matrices, data frames, etc. In other words, a list can contain all kind of R objects.
# Create a list
my_family <- list(
mother = "Veronique",
father = "Michel",
sisters = c("Alicia", "Monica"),
sister_age = c(12, 22)
)
# Print
my_family## $mother
## [1] "Veronique"
##
## $father
## [1] "Michel"
##
## $sisters
## [1] "Alicia" "Monica"
##
## $sister_age
## [1] 12 22
## [1] "mother" "father" "sisters" "sister_age"
## [1] 4
The list object “my_family”, contains four components, which may be individually referred to as my_family[[1]], as_family[[2]] and so on.
It’s possible to select an element, from a list, by its name or its index:
## [1] "Michel"
## [1] "Michel"
## [1] "Veronique"
## [1] "Alicia" "Monica"
# Select a specific element of a component
# select the first ([1]) element of my_family[[3]]
my_family[[3]][1] ## [1] "Alicia"
Note that it’s possible to extend an original list.
In the R code below, we want to add the components “grand_father” and “grand_mother” to my_family list object:
# Extend the list
my_family$grand_father <- "John"
my_family$grand_mother <- "Mary"
# Print
my_family## $mother
## [1] "Veronique"
##
## $father
## [1] "Michel"
##
## $sisters
## [1] "Alicia" "Monica"
##
## $sister_age
## [1] 12 22
##
## $grand_father
## [1] "John"
##
## $grand_mother
## [1] "Mary"
You can also concatenate three lists as follow:
list_abc <- c(list_a, list_b, list_c)
The result is a list also, whose components are those of the argument lists joined together in sequence.