Session Objectives

At the end of the session, the participants are expected to

  • be able to install R and RStudio in their respective computers.
  • explore the programming basics of R.
  • understand the different data types in R and how to perform basic operations on these data types.
  • learn how to install R packages
  • get acquainted with the built-in datasets in R.

What is R and Why use R?

  • R is a programming language and environment that is used widely for statistical data manipulation and analysis.

  • R has become popular because it’s free and people can contribute to the development of R.

  • R is the defacto standard among professional statisticians.

  • comparable, and often superior, in power to commercial products.

  • available for Windows, Mac, Linux.

  • in addition to enabling statistical operations, it’s a general programming language, so that you can automate your analyses and create new functions.

  • it’s easy to get help from the user community, and lots of new functions get contributed by users, many of which are prominent statisticians.

  • it has a feature that supports the use of other programming languages, such as Python, SQL and the like.

  • it has a feature for reproducible research.

Installing R and R Studio

To install R

  1. Open an internet browser and go to www.r-project.org

  2. Click the “download R” link in the middle of the page under “Getting Started.”

  3. Select a Comprehensive R Archive Network (CRAN) location (a mirror site) and click the corresponding link.Look for the Philippines.

  4. Click on the “Download R for Windows” link if you are using Windows. Otherwise, Click on the “Download R for (Mac) OS X” or “Download R for linux”.

  5. Click “base”.

  6. Click “Download R 4.0.2 for Windows”.

Note that R is constantly being updated and so new version will be available from time to time. When you navigate to the website page, by default, you will see the latest version.Download the latest version of R.

  1. Double-click the downloaded file to open, and follow the installation instructions.

  2. Now that R is installed, you need to download and install RStudio.

R Studio

RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management. Click here to see more RStudio features.

For ease of use, navigation and computation, we will be using RStudio throughout our lessons.

To Install R Studio

  1. Go to RStudio

  2. Click on “Products”.

  3. Under Open Source, click “RStudio”.

  4. Scroll down the page, and click RStudio Desktop.

  5. Choose the open source edition which is free.

  6. Click on the version recommended for your system.

The RStudio Interface

To get started with using R Studio, open your RStudio. You will see the following R Studio interface where the window is partitioned into four major panes.

Figure 1

RStudio RStudio is a four pane work-space for

  1. creating file containing R script,

  2. typing R commands,

  3. viewing command histories,

  4. viewing plots and more.

Code editor allows you to create and open a file containing R script. The R script is where you keep a record of your work. R script can be created as follow: File –> New –> R Script.

R console is for typing R commands.

Environment Pane: shows the list of R objects you created during your R session.

  • Import Dataset tab allows you to load datasets stored in your computer in the R environment.

History Pane: shows the history of all previous commands

Connections Pane shows you all the connections you have made to supported data sources, and lets you know which connections are currently active.

Tutorial Pane is powered by the learnr package which hosts the tutorials of the use of different packages.

Files tab: show files in your working directory

Plots tab: show the history of plots you created. From this tab, you can export a plot to a PDF or an image files.

Packages tab: show external R packages available on your system. If checked, the package is loaded in R.

Viewer : To know more, click here

Customizing RStudio

To change the appearance of RStudio: Tools > Global Options > Appearance

You can use the Global options to customize the Code, Pane layout and other parts of RStudio. Take time to explore.

For more info about RStudio read the Online Documentation.

R Working Directory

Recall that, the working directory is a folder where R reads and saves files.

Check your existing working directory

To check your existing working directory: Type getwd() in the R Console and press enter.

getwd()
## [1] "C:/Users/Roel Ceballos/Dropbox/ROEL/R Training"

Change your working directory

From RStudio, use the menu to change your working directory under Session > Set Working Directory > Choose Directory.

Easy R Programming Basics

R can be used to perform calculations. For a start, we have the basic arithmetic operators:

  • addition (+)
  • subtraction (-)
  • mutiplication (*)
  • division (/)
  • exponentiation (^)
# addition
3+7
## [1] 10
#subtraction
7-3
## [1] 4
#multiplication
7*3
## [1] 21
#division
8/2
## [1] 4
# exponentiation
7^2
## [1] 49

Basic arithmetic functions

Logarithms and Exponentials:

log2(x) # logarithms base 2 of x
log10(x) # logaritms base 10 of x
exp(x) # Exponential of x

Trigonometric functions:

cos(x) # Cosine of x
sin(x) # Sine of x
tan(x) #Tangent of x
acos(x) # arc-cosine of x
asin(x) # arc-sine of x
atan(x) #arc-tangent of x

CRAN Resources for R Programming

  • Almost all mathematical operations and functions can be found in R, meaning you don’t have to create them from scratch. All you have to do is call the function if it is contained in the base package. If not, install and load the package that contains the function. Make sure to input the right argument of the function.

  • The base package, which is loaded when you install R, contains the most commonly use of math functions. Some useful built-in functions can be found here.

  • On the other hand, advanced mathematical and statistical functions such as those that are used in Time series analysis, Differential equations, Optimization, and Mathematical Modeling can be found in other R packages that you can easily install and use.

  • CRAN has created Task Views for these advanced mathematical and statistical operations. It aims to serve as a guide for users. For instance, you may click on the hyperlink below to check these Task Views for DE, Optimization and Time series.

Assigning values to Variables

A variable can be used to store a value.

For example, the R code below will store the price of a lemon in a variable, say “lemon_price”:

# Price of a lemon = 2 euros
lemon_price <- 2
# or use this
lemon_price = 2

Note that it’s possible to use <- or = for variable assignments.

Note that R is case-sensitive. This means that lemon_price is different from Lemon_Price.

Note that # is use to make a comment

To print the value of the created object, just type its name:

lemon_price
## [1] 2

or use the function print():

print(lemon_price)
## [1] 2

R saves the object lemon_price (also known as a variable) in memory. It’s possible to make some operations with it.

# Multiply lemon price by 5
5 * lemon_price
## [1] 10

You can change the value of the object:

# Change the value
lemon_price <- 5
# Print again
lemon_price
## [1] 5

The following R code creates two variables holding the width and the height of a rectangle. These two variables will be used to compute the area of the rectangle.

# Rectangle height
height <- 10
# rectangle width
width <- 5
# compute rectangle area
area <- height*width
print(area)
## [1] 50

The function ls() can be used to see the list of objects we have created:

ls()
## [1] "area"        "height"      "lemon_price" "width"

Note that, each variable takes some place in the computer memory. If you work on a big project, it’s good to clean up your workspace.

To remove a variable, use the function rm():

# Remove height and width variable
rm(height, width)
# Display the remaining variables
ls()
## [1] "area"        "lemon_price"

Basic Data Types

Basic data types are numeric, character and logical.

# Numeric object: How old are you?
my_age <- 18
# Character  object: What's your name?
my_name <- "Nicolas"
# logical object: Are you a data scientist?
# (yes/no) <=> (TRUE/FALSE)
is_datascientist <- TRUE

*Note that, character vector can be created using double (“) or single (’) quotes.

"My Friend's name is Jerome"
## [1] "My Friend's name is Jerome"

It’s possible to use the function class() to see what type a variable is:

class(my_age)
## [1] "numeric"
class(my_name)
## [1] "character"

You can also use the functions is.numeric(), is.character(), is.logical() to check whether a variable is numeric, character or logical, respectively. For instance:

is.numeric(my_age)
## [1] TRUE
is.numeric(my_name)
## [1] FALSE

If you want to change the type of a variable to another one, use the as.* functions, including: as.numeric(), as.character(), as.logical(), etc.

my_age
## [1] 18
# Convert my_age to a character variable
as.character(my_age)
## [1] "18"

Note that the conversion of a character to a numeric will output NA (for not available). R doesn’t know how to convert a numeric variable to a character variable.

Vectors

A vector is a combination of multiple values (numeric, character or logical) in the same object. In this case, you can have numeric vectors, character vectors or logical vectors.

Create a vector

A vector is created using the function c() (for concatenate), as follow:

# Store your friends'age in a numeric vector
friend_ages <- c(27, 25, 29, 26) # Create
friend_ages # Print
## [1] 27 25 29 26
# Store your friend names in a character vector**
my_friends <- c("Nicolas", "Thierry", "Bernard", "Jerome")
my_friends 
## [1] "Nicolas" "Thierry" "Bernard" "Jerome"
# Store your friends marital status in a logical vector
# Are they married? (yes/no <=> TRUE/FALSE)
are_married <- c(TRUE, FALSE, TRUE, TRUE)
are_married
## [1]  TRUE FALSE  TRUE  TRUE

It’s possible to give a name to the elements of a vector using the function names().

# Vector without element names
friend_ages
## [1] 27 25 29 26
# Vector with element names
names(friend_ages) <- c("Nicolas", "Thierry", "Bernard", "Jerome")
friend_ages
## Nicolas Thierry Bernard  Jerome 
##      27      25      29      26

You can also create a named vector as follow

friend_ages <- c(Nicolas = 27, Thierry = 25, 
                 Bernard = 29, Jerome = 26)
friend_ages
## Nicolas Thierry Bernard  Jerome 
##      27      25      29      26

Note that a vector can only hold elements of the same type. For example, you cannot have a vector that contains both characters and numeric values.

Find the length of a vector (i.e., the number of elements in a vector)

# Number of friends
length(my_friends)
## [1] 4

Case of missing values

Suppose you know that some of your friends (Nicolas and Thierry) have 2 child. But this information is not available (NA) for the remaining friends (Bernard and Jerome).

In R missing values (or missing information) are represented by NA:

have_child <- c(Nicolas = "yes", Thierry = "yes", 
                Bernard = NA, Jerome = NA)
have_child
## Nicolas Thierry Bernard  Jerome 
##   "yes"   "yes"      NA      NA

It’s possible to use the function is.na() to check whether a data contains missing value. The result of the function is.na() is a logical vector in which, the value TRUE specifies that the corresponding element in x is NA.

# Check if have_child contains missing values
is.na(have_child)
## Nicolas Thierry Bernard  Jerome 
##   FALSE   FALSE    TRUE    TRUE

Note that there is a second type of missing values named NaN (“Not a Number”). This is produced in a situation where mathematical function won’t work properly, for example 0/0 = NaN.

Note also that the function is.na() is TRUE for both NA and NaN values. To differentiate these, the function is.nan() is only TRUE for NaNs.

Get a subset of a vector

Subsetting a vector consists of selecting a part of your vector.

  • Selection by positive indexing: select an element of a vector by its position (index) in square brackets
# Select my friend number 2
my_friends[2]
## [1] "Thierry"
# Select my friends number 2 and 4 
my_friends[c(2, 4)]
## [1] "Thierry" "Jerome"
# Select my friends number 1 to 3
my_friends[1:3]
## [1] "Nicolas" "Thierry" "Bernard"

Note that R indexes from 1, NOT 0. So your first column is at [1] and not [0].

If you have a named vector, it’s also possible to use the name for selecting an element:

friend_ages["Bernard"]
## Bernard 
##      29
  • Selection by negative indexing: Exclude an element
# Exclude my friend number 2
my_friends[-2]
## [1] "Nicolas" "Bernard" "Jerome"
# Exclude my friends number 2 and 4
my_friends[-c(2, 4)]
## [1] "Nicolas" "Bernard"
# Exclude my friends number 1 to 3
my_friends[-(1:3)]
## [1] "Jerome"
  • Selection by logical vector: Only, the elements for which the corresponding value in the selecting vector is TRUE, will be kept in the subset.
# Select only married friends
my_friends[are_married == TRUE]
## [1] "Nicolas" "Bernard" "Jerome"
# Friends with age >=27
my_friends[friend_ages >= 27]
## [1] "Nicolas" "Bernard"
# Friends with age different from 27
my_friends[friend_ages != 27]
## [1] "Thierry" "Bernard" "Jerome"

If you want to remove missing data, use this:

# Data with missing values
have_child
## Nicolas Thierry Bernard  Jerome 
##   "yes"   "yes"      NA      NA
# Keep only values different from NA (!is.na())
have_child[!is.na(have_child)]
## Nicolas Thierry 
##   "yes"   "yes"
# Or, replace NA value by "NO" and then print
have_child[!is.na(have_child)] <- "NO"
have_child
## Nicolas Thierry Bernard  Jerome 
##    "NO"    "NO"      NA      NA

Note that, the “logical” comparison operators available in R are:

<: for less than
>: for greater than
<=: for less than or equal to
>=: for greater than or equal to
==: for equal to each other
!=: not equal to each other

Calculations with vectors

Note that, all the basic arithmetic operators (+, -, *, / and ^ ) as well as the common arithmetic functions (log, exp, sin, cos, tan, sqrt, abs, …), described in the previous sections, can be applied on a numeric vector.

If you perform an operation with vectors, the operation will be applied to each element of the vector. An example is provided below:

# My friends' salary in dollars
salaries <- c(2000, 1800, 2500, 3000)
names(salaries) <- c("Nicolas", "Thierry", "Bernard", "Jerome")
salaries
## Nicolas Thierry Bernard  Jerome 
##    2000    1800    2500    3000
# Multiply salaries by 2
salaries*2
## Nicolas Thierry Bernard  Jerome 
##    4000    3600    5000    6000

As you can see, R multiplies each element in the salaries vector with 2.

Now, suppose that you want to multiply the salaries by different coefficients. The following R code can be used:

# create coefs vector with the same length as salaries
coefs <- c(2, 1.5, 1, 3)
# Multiply salaries by coeff
salaries*coefs
## Nicolas Thierry Bernard  Jerome 
##    4000    2700    2500    9000

Note that the calculation is done element-wise. The first element of salaries vector is multiplied by the first element of coefs vector, and so on.

Compute the square root of a numeric vector:

my_vector <- c(4, 16, 9)
sqrt(my_vector)
## [1] 2 4 3

Other useful functions are:

max(x) # Get the maximum value of x
min(x) # Get the minimum value of x
range(x)# Get the range of x. Returns a vector containing the minimum and the maximum of x  
length(x) # Get the number of elements in x
sum(x) # Get the total of the elements in x
prod(x) # Get the product of the elements in x
mean(x) =  sum(x)/length(x)
sd(x) # Standard deviation of x
var(x) # Variance of x
sort(x) # Sort the element of x in ascending order

For example, if you want to compute the total sum of salaries, type this:

sum(salaries)
## [1] 9300

Compute the mean of salaries:

mean(salaries)
## [1] 2325

The range (minimum, maximum) of salaries is:

range(salaries)
## [1] 1800 3000

Please be cautious in generating descriptive statistics because some measures are more appropriate than the others in specific situations. For example, when there are outliers in the data, the median is more appropriate than the mean.

Matrices

A matrix is like an Excel sheet containing multiple rows and columns. It’s used to combine vectors with the same type, which can be either numeric, character or logical. Matrices are used to store a data table in R. The rows of a matrix are generally individuals/observations and the columns are variables.

Create and naming matrix

To create easily a matrix, use the function cbind() or rbind() as follow:

# Numeric vectors
col1 <- c(5, 6, 7, 8, 9)
col2 <- c(2, 4, 5, 9, 8)
col3 <- c(7, 3, 4, 8, 7)
# Combine the vectors by column
my_data <- cbind(col1, col2, col3)
my_data
##      col1 col2 col3
## [1,]    5    2    7
## [2,]    6    4    3
## [3,]    7    5    4
## [4,]    8    9    8
## [5,]    9    8    7
# Change rownames
rownames(my_data) <- c("row1", "row2", "row3", "row4", "row5")
my_data
##      col1 col2 col3
## row1    5    2    7
## row2    6    4    3
## row3    7    5    4
## row4    8    9    8
## row5    9    8    7
  • cbind(): combine R objects by columns
  • rbind(): combine R objects by rows
  • rownames(): retrieve or set row names of a matrix-like object
  • colnames(): retrieve or set column names of a matrix-like object

If you want to transpose your data, use the function t():

t(my_data)
##      row1 row2 row3 row4 row5
## col1    5    6    7    8    9
## col2    2    4    5    9    8
## col3    7    3    4    8    7

Note that it’s also possible to construct a matrix using the function matrix().

The simplified format of matrix() is as follow:

matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE,
       dimnames = NULL)
  • data: a data vector
  • nrow, ncol: the desired number of rows and columns, respectively.
  • byrow: logical value. If FALSE (the default) the matrix is filled by columns, otherwise the matrix is filled by rows.
  • dimnames: A list of two vectors giving the row and column names respectively.

In the R code below, the input data has length 6. We want to create a matrix with two columns. You don’t need to specify the number of rows (here nrow = 3). R will infer this automatically. The matrix is filled column by column when the argument byrow = FALSE. If you want to fill the matrix by rows, use byrow = TRUE.

mdat <- matrix(
           data = c(1,2,3, 11,12,13), 
           nrow = 2, byrow = TRUE,
           dimnames = list(c("row1", "row2"), c("C.1", "C.2", "C.3"))
           )
mdat
##      C.1 C.2 C.3
## row1   1   2   3
## row2  11  12  13

Dimensions of a matrix

The R functions nrow() and ncol() return the number of rows and columns present in the data, respectively.

ncol(my_data) # Number of columns
## [1] 3
nrow(my_data) # Number of rows
## [1] 5
dim(my_data) # Number of rows and columns
## [1] 5 3

Get a subset of a matrix

  • Select rows/columns by positive indexing

Rows and/or columns can be selected as follow: my_data[row, col]

# Select row number 2
my_data[2, ]
## col1 col2 col3 
##    6    4    3
# Select row number 2 to 4
my_data[2:4, ]
##      col1 col2 col3
## row2    6    4    3
## row3    7    5    4
## row4    8    9    8
# Select multiple rows that aren't contiguous
# e.g.: rows 2 and 4 but not 3
my_data[c(2,4), ]
##      col1 col2 col3
## row2    6    4    3
## row4    8    9    8
# Select column number 3
my_data[, 3]
## row1 row2 row3 row4 row5 
##    7    3    4    8    7
# Select the value at row 2 and column  3
my_data[2, 3]
## [1] 3
  • Select by row/column names
# Select column 2
my_data[, "col2"]
## row1 row2 row3 row4 row5 
##    2    4    5    9    8
# Select by index and names: row 3 and olumn 2
my_data[3, "col2"]
## [1] 5
  • Exclude rows/columns by negative indexing
# Exclude column 1
my_data[, -1]
##      col2 col3
## row1    2    7
## row2    4    3
## row3    5    4
## row4    9    8
## row5    8    7
  • Selection by logical: In the R code below, we want to keep only rows where col3 >=4:
col3 <- my_data[, "col3"]
my_data[col3 >= 4, ]
##      col1 col2 col3
## row1    5    2    7
## row3    7    5    4
## row4    8    9    8
## row5    9    8    7

Calculations with matrices

  • It’s also possible to perform simple operations on matrices. For example, the following R code multiplies each element of the matrix by 2:
my_data*2
##      col1 col2 col3
## row1   10    4   14
## row2   12    8    6
## row3   14   10    8
## row4   16   18   16
## row5   18   16   14

Or, compute the log2 values:

log2(my_data)
##          col1     col2     col3
## row1 2.321928 1.000000 2.807355
## row2 2.584963 2.000000 1.584963
## row3 2.807355 2.321928 2.000000
## row4 3.000000 3.169925 3.000000
## row5 3.169925 3.000000 2.807355
  • rowSums() and colSums() functions: Compute the total of each row and the total of each column, respectively.
# Total of each row
rowSums(my_data)
## row1 row2 row3 row4 row5 
##   14   13   16   25   24
# Total of each column
colSums(my_data)
## col1 col2 col3 
##   35   28   29

If you are interested in row/column means, you can use the function rowMeans() and colMeans() for computing row and column means, respectively.

Note that it’s also possible to use the function apply() to apply any statistical functions to rows/columns of matrices.

The simplified format of apply() is as follow:

apply(X, MARGIN, FUN)
  • X: your data matrix
  • MARGIN: possible values are 1 (for rows) and 2 (for columns)
  • FUN: the function to apply on rows/columns

Use apply() as follow:

# Compute row means
apply(my_data, 1, mean)
##     row1     row2     row3     row4     row5 
## 4.666667 4.333333 5.333333 8.333333 8.000000
# Compute row medians
apply(my_data, 1, median)
## row1 row2 row3 row4 row5 
##    5    4    5    8    8
# Compute column means
apply(my_data, 2, mean)
## col1 col2 col3 
##  7.0  5.6  5.8

Factors

Factor variables represent categories or groups in your data. The function factor() can be used to create a factor variable.

Create a factor

# Create a factor variable
friend_groups <- factor(c(1, 2, 1, 2))
friend_groups
## [1] 1 2 1 2
## Levels: 1 2

The variable friend_groups contains two categories of friends: 1 and 2. In R terminology, categories are called factor levels.

It’s possible to access to the factor levels using the function levels():

# Get group names (or levels)
levels(friend_groups)
## [1] "1" "2"
# Change levels
levels(friend_groups) <- c("best_friend", "not_best_friend")
friend_groups
## [1] best_friend     not_best_friend best_friend     not_best_friend
## Levels: best_friend not_best_friend

Note that, R orders factor levels alphabetically. If you want a different order in the levels, you can specify the levels argument in the factor function as follow.

# Change the order of levels
friend_groups <- factor(friend_groups, 
                      levels = c("not_best_friend", "best_friend"))
# Print
friend_groups
## [1] best_friend     not_best_friend best_friend     not_best_friend
## Levels: not_best_friend best_friend

Note that:

  • The function is.factor() can be used to check whether a variable is a factor. Results are TRUE (if factor) or FALSE (if not factor)
  • The function as.factor() can be used to convert a variable to a factor.
# Check if friend_groups is a factor
is.factor(friend_groups)
## [1] TRUE
# Convert "are_married" as a factor
as.factor(are_married)
## [1] TRUE  FALSE TRUE  TRUE 
## Levels: FALSE TRUE

Calculations with factors

If you want to know the number of individuals in each levels, use the function summary():

summary(friend_groups)
## not_best_friend     best_friend 
##               2               2

In the following example, I want to compute the mean salary of my friends by groups. The function tapply() can be used to apply a function, here mean(), to each group.

# Salaries of my friends
salaries
## Nicolas Thierry Bernard  Jerome 
##    2000    1800    2500    3000
# Friend groups
friend_groups
## [1] best_friend     not_best_friend best_friend     not_best_friend
## Levels: not_best_friend best_friend
# Compute the mean salaries by groups
mean_salaries <- tapply(salaries, friend_groups, mean)
mean_salaries
## not_best_friend     best_friend 
##            2400            2250
# Compute the size/length of each group
tapply(salaries, friend_groups, length)
## not_best_friend     best_friend 
##               2               2

It’s also possible to use the function table() to create a frequency table, also known as a contingency table of the counts at each combination of factor levels.

table(friend_groups)
## friend_groups
## not_best_friend     best_friend 
##               2               2
# Cross-tabulation between 
# friend_groups and are_married variables
table(friend_groups, are_married)
##                  are_married
## friend_groups     FALSE TRUE
##   not_best_friend     1    1
##   best_friend         0    2

Data frames

A data frame is like a matrix but can have columns with different types (numeric, character, logical). Rows are observations (individuals) and columns are variables.

Create a data frame

A data frame can be created using the function data.frame(), as follow:

# Create a data frame
friends_data <- data.frame(
  name = my_friends,
  age = friend_ages,
  height = c(180, 170, 185, 169),
  married = are_married
)
# Print
friends_data

To check whether a data is a data frame, use the is.data.frame() function. Returns TRUE if the data is a data frame:

is.data.frame(friends_data)
## [1] TRUE
is.data.frame(my_data)
## [1] FALSE

The object “friends_data” is a data frame, but not the object “my_data”. We can convert-it to a data frame using the as.data.frame() function:

# What is the class of my_data? --> matrix
class(my_data)
## [1] "matrix" "array"
# Convert it as a data frame
my_data2 <- as.data.frame(my_data)
# Now, the class is data.frame
class(my_data2)
## [1] "data.frame"

As described in matrix section, you can use the function t() to transpose a data frame:

t(friends_data)
##         Nicolas   Thierry   Bernard   Jerome  
## name    "Nicolas" "Thierry" "Bernard" "Jerome"
## age     "27"      "25"      "29"      "26"    
## height  "180"     "170"     "185"     "169"   
## married "TRUE"    "FALSE"   "TRUE"    "TRUE"

Subset a data frame

To select just certain columns from a data frame, you can either refer to the columns by name or by their location (i.e., column 1, 2, 3, etc.).

1.) Positive indexing by name and by location

# Access the data in 'name' column
# dollar sign is used
friends_data$name
## [1] "Nicolas" "Thierry" "Bernard" "Jerome"
# or use this
friends_data[, 'name']
## [1] "Nicolas" "Thierry" "Bernard" "Jerome"
# Subset columns 1 and 3
friends_data[ , c(1, 3)]
# Exclude column 1
friends_data[, -1]

2.) Index by characteristics

We want to select all friends with age >= 27.

# Identify rows that meet the condition
friends_data$age >= 27
## [1]  TRUE FALSE  TRUE FALSE

TRUE specifies that the row contains a value of age >= 27.

# Select the rows that meet the condition
friends_data[friends_data$age >= 27, ]

The R code above, tells R to get all rows from friends_data where age >= 27, and then to return all the columns.

If you don’t want to see all the column data for the selected rows but are just interested in displaying, for example, friend names and age for friends with age >= 27, you could use the following R code:

# Use column locations
friends_data[friends_data$age >= 27,  c(1, 2)]
# Or use column names
friends_data[friends_data$age >= 27, c("name", "age")]

If you’re finding that your selection statement is starting to be inconvenient, you can put your row and column selections into variables first, such as:

age27 <- friends_data$age >= 27
cols <- c("name", "age")

Then you can select the rows and columns with those variables:

friends_data[age27, cols]

It’s also possible to use the function subset() as follow.

# Select friends data with age >= 27
subset(friends_data, age >= 27)

Another option is to use the functions attach() and detach(). The function attach() takes a data frame and makes its columns accessible by simply giving their names.

The functions attach() and detach() can be used as follow:

# Attach a data frame
attach(friends_data)
# === Data manipulation ====
friends_data[age>=27, ]
# === End of data manipulation ====
# Detach the data frame
detach(friends_data)

Extend a data frame

Add new column in a data frame

# Add group column to friends_data
friends_data$group <- friend_groups
friends_data

It’s also possible to use the functions cbind() and rbind() to extend a data frame.

cbind(friends_data, group = friend_groups)

Calculations with data frame

With numeric data frame, you can use the function rowSums(), colSums(), colMeans(), rowMeans() and apply() as described in matrix section.

Lists

A list is an ordered collection of objects, which can be vectors, matrices, data frames, etc. In other words, a list can contain all kind of R objects.

Create a list

# Create a list
my_family <- list(
  mother = "Veronique", 
  father = "Michel",
  sisters = c("Alicia", "Monica"),
  sister_age = c(12, 22)
  )
# Print
my_family
## $mother
## [1] "Veronique"
## 
## $father
## [1] "Michel"
## 
## $sisters
## [1] "Alicia" "Monica"
## 
## $sister_age
## [1] 12 22
# Names of elements in the list
names(my_family)
## [1] "mother"     "father"     "sisters"    "sister_age"
# Number of elements in the list
length(my_family)
## [1] 4

The list object “my_family”, contains four components, which may be individually referred to as my_family[[1]], as_family[[2]] and so on.

Subset a list

It’s possible to select an element, from a list, by its name or its index:

  • my_family$mother is the same as my_family[[1]]
  • my_family$father is the same as my_family[[2]]
# Select by name (1/2)
my_family$father
## [1] "Michel"
# Select by name (2/2)
my_family[["father"]]
## [1] "Michel"
# Select by index
my_family[[1]]
## [1] "Veronique"
my_family[[3]]
## [1] "Alicia" "Monica"
# Select a specific element of a component
# select the first ([1]) element of my_family[[3]]
my_family[[3]][1] 
## [1] "Alicia"

Extend a list

Note that it’s possible to extend an original list.

In the R code below, we want to add the components “grand_father” and “grand_mother” to my_family list object:

# Extend the list
my_family$grand_father <- "John"
my_family$grand_mother <- "Mary"
# Print
my_family
## $mother
## [1] "Veronique"
## 
## $father
## [1] "Michel"
## 
## $sisters
## [1] "Alicia" "Monica"
## 
## $sister_age
## [1] 12 22
## 
## $grand_father
## [1] "John"
## 
## $grand_mother
## [1] "Mary"

You can also concatenate three lists as follow:

list_abc <- c(list_a, list_b, list_c)

The result is a list also, whose components are those of the argument lists joined together in sequence.

Getting help with R functions

Getting help on a specific function

To read more about a given function, for example mean, the R function help() can be used as follow:

help(mean)

Or use this:

?mean

The output look like this:

Figure 2

If you want to see some examples of how to use the function, type this: example(function_name).

example(sum)
## 
## sum> ## Pass a vector to sum, and it will add the elements together.
## sum> sum(1:5)
## [1] 15
## 
## sum> ## Pass several numbers to sum, and it also adds the elements.
## sum> sum(1, 2, 3, 4, 5)
## [1] 15
## 
## sum> ## In fact, you can pass vectors into several arguments, and everything gets added.
## sum> sum(1:2, 3:5)
## [1] 15
## 
## sum> ## If there are missing values, the sum is unknown, i.e., also missing, ....
## sum> sum(1:5, NA)
## [1] NA
## 
## sum> ## ... unless  we exclude missing values explicitly:
## sum> sum(1:5, NA, na.rm = TRUE)
## [1] 15

Note that, typical R help files contain the following sections:

  • Title Description: a short description of what the function does.
  • Usage: the syntax of the function.
  • Arguments: the description of the arguments taken by the function.
  • Value: the value returned by the function
  • Examples: provide examples on how to use the function

General help

If you want to read the general documentation about R, use the function help.start():

help.start()

The output look like this:

Figure 3

Others

  • apropos(): returns a list of object, containing the pattern you searched, by partial matching. This is useful when you don’t remember exactly the name of the function:
# Returns the list of object containing "med"
apropos("med")
## [1] "elNamed"        "elNamed<-"      "median"         "median.default"
## [5] "medpolish"      "runmed"
  • help.search() (alternatively ??): Search for documentation matching a given character in different ways. It returns a list of function containing your searched term with a short description of the function.
help.search("mean")
# Or use this
??mean

R Packages

What is R packages?

  • An R package is an extension of R containing data sets and specific functions to solve specific questions.

  • R comes with standard (or base) packages, which contain the basic functions and data sets as well as standard statistical and graphical functions that allow R to work.

  • There are also thousands other R packages available for download and installation from CRAN, Bioconductor and GitHub repositories.

  • After installation, you must first load the package for using the functions in the package.

Installing R packages

  • Packages can be installed either from CRAN (for general packages), from Bioconductor (for biology-related packages) or from Github (developing versions of packages).

Install a package from CRAN

The function install.packages() is used to install a package from CRAN. The syntax is as follow:

install.packages("package_name")

For example, to install the package named readr, type this:

install.packages("readr")

Note that every time you install an R package, R may ask you to specify a CRAN mirror (or server). Choose one that’s close to your location, and R will connect to that server to download and install the package files.

It’s also possible to install multiple packages at the same time, as follow:

install.packages(c("readr", "ggplot2"))

or you can use the RStudio point and click options to do this: Packages > Install > type the name of the package you want to install.

Figure 4

Install a package from Bioconductor

Bioconductor contains packages for analyzing biological related data. In the following R code, we want to install the R/Bioconductor package limma, which is dedicated to analyse genomic data.

To install a package from Bioconductor, use this:

source("https://bioconductor.org/biocLite.R")
biocLite("limma")

Install a package from Github

GitHub is a repository useful for all software development and data analysis, including R packages. It makes sharing your package easy. You can read more about GitHub here: Git and GitHub, by Hadley Wickham.

To install a package from GitHub, the R package devtools (by Hadley Wickham) can be used. You should first install devtools if you don’t have it installed on your computer.

For example, the following R code installs the latest version of survminer R package developed by A. Kassambara (https://github.com/kassambara/survminer).

install.packages("devtools")
devtools::install_github("kassambara/survminer")

View the list of installed packages

To view the list of the already installed packages on your computer, type :

installed.packages()

or you can go to Packages tab to check the install packages.

Figure 5

Folder containing installed packages

R packages are installed in a directory called library. The R function .libPaths() can be used to get the path to the library.

.libPaths()
## [1] "C:/Users/Roel Ceballos/Documents/R/win-library/4.0"
## [2] "C:/Program Files/R/R-4.0.2/library"

Load and use an R package

To use a specific function available in an R package, you have to load the R package using the function library().

In the following R code, we want to import a file into R using the R package readr, which has been installed in the previous section.

The function read_tsv() [in readr] can be used to import a tab separated .txt file:

# Import my data
library("readr")
my_data <- read_tsv("http://www.sthda.com/upload/decathlon.txt")
## Parsed with column specification:
## cols(
##   name = col_character(),
##   `100m` = col_double(),
##   Long.jump = col_double(),
##   Shot.put = col_double(),
##   High.jump = col_double(),
##   `400m` = col_double(),
##   `110m.hurdle` = col_double(),
##   Discus = col_double(),
##   Pole.vault = col_double(),
##   Javeline = col_double(),
##   `1500m` = col_double(),
##   Rank = col_double(),
##   Points = col_double(),
##   Competition = col_double()
## )
# View the first 6 rows and thee first 6 columns
# syntax: my_data[row, column]
my_data[1:6, 1:6]

View loaded R packages

To view the list of loaded (or attached) packages during an R session, use the function search():

search()

If you’re done with the package readr and you want to unload it, use the function detach():

detach("readr", unload = TRUE)

Remove installed packages

To remove an installed R package, use the function remove.packages() as follow:

remove.packages("package_name")

Update installed packages

If you want to update all installed R packages, type this:

update.packages()

To update specific installed packages, say readr and ggplot2, use this:

update.packages(oldPkgs = c("readr", "ggplot2"))

or you can go to : Packages > Update > Check the package to update

Figure 6

R Built-in Data sets

List of pre-loaded data

To see the list of pre-loaded data, type the function data():

data()

The output is as follow:

Figure 7

Loading a built-in R data

Load and print mtcars data as follow:

data(mtcars)
# Print the first 6 rows
head(mtcars, 6)

If you want learn more about mtcars data sets, type this:

?mtcars

Most used R built-in data sets

mtcars: Motor Trend Car Road Tests

  • The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models)

  • View the content of mtcars data set:

# 1. Loading 
data("mtcars")
# 2. Print
head(mtcars)
  • It contains 32 observations and 11 variables:
# Number of rows (observations)
nrow(mtcars)
## [1] 32
# Number of columns (variables)
ncol(mtcars)
## [1] 11
  • Description of variables:
  1. mpg: Miles/(US) gallon
  2. cyl: Number of cylinders
  3. disp: Displacement (cu.in.)
  4. hp: Gross horsepower
  5. drat: Rear axle ratio
  6. wt: Weight (1000 lbs)
  7. qsec: 1/4 mile time
  8. vs: V/S
  9. am: Transmission (0 = automatic, 1 = manual)
  10. gear: Number of forward gears
  11. carb: Number of carburetors

If you want to learn more about mtcars, type this:

?mtcars

iris

iris data set gives the measurements in centimeters of the variables sepal length, sepal width, petal length and petal width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

data("iris")
head(iris)

ToothGrowth

ToothGrowth data set contains the result from an experiment studying the effect of vitamin C on tooth growth in 60 Guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, (orange juice or ascorbic acid (a form of vitamin C and coded as VC).

data("ToothGrowth")
  
head(ToothGrowth)
  • len: Tooth length
  • supp: Supplement type (VC or OJ).
  • dose: numeric Dose in milligrams/day

PlantGrowth

Results obtained from an experiment to compare yields (as measured by dried weight of plants) obtained under a control and two different treatment condition.

data("PlantGrowth")
  
head(PlantGrowth)

USArrests

This data set contains statistics about violent crime rates by us state.

data("USArrests")
     
head(USArrests)
  • Murder: Murder arrests (per 100,000)
  • Assault: Assault arrests (per 100,000)
  • UrbanPop: Percent urban population
  • Rape: Rape arrests (per 100,000)