Opening RStudio will automatically launch R software. The platform
interface looks as follows:
Roughly, we can divide the working window into three areas:
Left area: includes the tabs
Console, Terminal, and
Background Jobs
Top-right area: includes the tabs
Environment, History,
Connections, and Tutorial
Bottom-right area: includes the tabs
Files, Plots, Packages,
Help, Viewer, and
Presentation.
Let’s take a closer look at the essential tabs.
On this tab, we first see the information about the R version in use and also some basic commands to try. At the end of those descriptions, we can type our R code, press Enter, and get the result below the code line (e.g., try running 2*2 and see what happens). Virtually, we can do here anything we would do in any other R program, for example:
However, when we run our code directly in the console, it isn’t saved for being reproduced further. If we need (and we usually do) to write a reproducible code to solve a specific task, we have to record and regularly save it in a script file rather than in a console.
We’ll explore how to write scripts. For now, let’s keep in mind that you should mostly use the console to test the code and install R packages since they only need to be installed once.
Whenever we define a new or re-assign an existing variable in
RStudio, it’s stored as an object in the workspace and gets displayed,
together with its value, on the Environment tab in the
top-right area of the RStudio window. Try running greeting
<- "Hello, World!" in the console and see what happens
on the Environment tab.
Terminal - to run commands from the terminalHistory - to track the history of all the operations
performed during the current RStudio sessionFiles - to see the structure of the working folder,
reset the working folder, navigate between folders, etc.Plots - to preview and export created data
visualizationsPackages - to check what packages were loaded and load
or unload packages (by switching on/off the box to the left of a package
name)When creating an R File you can make an R-script (typical R document) or make an Rmarkdown, where you can add chunks with scripts and exporting the file as a pdf, html or word file.
To start recording a script, click File -
New File - R Script. This will open a text
editor in the top-left corner of the RStudio interface (above the
Console tab)
In a script, we can do all the things we listed in the section on the console, only that now our actions will be stored in a file for further usage or sharing.
To run a single line of code from a script, put the cursor on that
line and click the Run icon on the top-right of the text
editor. Otherwise, use a keyboard shortcut (Ctrl + Enter in
Windows/Linux, Cmd + Enter in Mac). To run multiple code
lines, do the same after selecting the necessary lines. To run all code
lines, select all the lines and click the Run icon OR use a keyboard
shortcut (Ctrl + A + Enter in Windows/Linux,
Cmd + A + Enter in Mac).
When we write a script, it makes sense to add code comments where
necessary (using a hashtag symbol # followed by a line of a
comment text) to explain to a potential future reader the why behind
certain pieces of code.
R packages are add-ons to base R that help you achieve additional tasks that are not directly supported by base R. It is by the action of these extra functionality that R excels as a tool for computational genomics. The Bioconductor project (http://bioconductor.org/) is a dedicated package repository for computational biology-related packages. However main package repository of R, called CRAN, also has computational biology related packages. In addition, R-Forge (http://r-forge.r-project.org/), GitHub (https://github.com/), and Bitbucket (http://www.bitbucket.org) are some of the other locations where R packages might be hosted. The packages needed for the code snippets in this book and how to install them are explained in the Packages needed to run the book code section in the Preface of the book.
You can install CRAN packages using install.packages()
(# is the comment character in R).
# install package named "ggplot2" from CRAN
install.packages("ggplot2")
You can install bioconductor packages with a specific installer script.
# get the installer package if you don't have
install.packages("BiocManager")
# install bioconductor package "rtracklayer"
BiocManager::install("rtracklayer")
You can install packages from GitHub using the
install_github() function from devtools
package.
library(devtools)
install_github("hadley/stringr")
Another way to install packages is from the source.
# download the source file
download.file(
"https://github.com/al2na/methylKit/releases/download/v0.99.2/methylKit_0.99.2.tar.gz",
destfile="methylKit_0.99.2.tar.gz")
# install the package from the source file
install.packages("methylKit_0.99.2.tar.gz",
repos=NULL,type="source")
# delete the source file
unlink("methylKit_0.99.2.tar.gz")
You can also update CRAN and Bioconductor packages.
# updating CRAN packages
update.packages()
# updating bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install()
You can get help on functions by using help() and
help.search() functions. You can list the functions in a
package with the ls() function
library(MASS)
ls("package:MASS") # functions in the package
ls() # objects in your R enviroment
# get help on hist() function
?hist
help("hist")
# search the word "hist" in help pages
help.search("hist")
??hist
R can be used as an ordinary calculator, and some say it is an
over-grown calculator. Here are some examples. Remember that
# is the comment character. The comments give details about
the operations in case they are not clear.
2 + 3 * 5 # Note the order of operations.
log(10) # Natural logarithm with base e
5^2 # 5 raised to the second power
3/2 # Division
sqrt(16) # Square root
abs(3-7) # Absolute value of 3-7
pi # The number
exp(2) # exponential function
# This is a comment line
Vectors are one of the core R data structures. It is basically a list
of elements of the same type (numeric, character or logical). Later you
will see that every column of a table will be represented as a vector. R
handles vectors easily and intuitively. You can create vectors with the
c() function, however that is not the only way. The
operations on vectors will propagate to all the elements of the
vectors.
x<-c(1,3,2,10,5) #create a vector named x with 5 components
x = c(1,3,2,10,5)
x
## [1] 1 3 2 10 5
y<-1:5 #create a vector of consecutive integers y
y+2 #scalar addition
## [1] 3 4 5 6 7
2*y #scalar multiplication
## [1] 2 4 6 8 10
y #y itself has not been unchanged
## [1] 1 2 3 4 5
The standard assignment operator in R is <-. This
operator is preferentially used in books and documentation. However, it
is also possible to use the = operator for the assignment.
We have an example in the above code snippet and throughout the book we
use <- and = interchangeably for
assignment.
A matrix refers to a numeric array of rows and columns. You can think
of it as a stacked version of vectors where each row or column is a
vector. One of the easiest ways to create a matrix is to combine vectors
of equal length using cbind(), meaning ‘column bind’.
x<-c(1,2,3,4)
y<-c(4,5,6,7)
m1<-cbind(x,y);m1
## x y
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
## [4,] 4 7
dim(m1) # 2 by 5 matrix
## [1] 4 2
A data frame is more general than a matrix, in that different columns
can have different modes (numeric, character, factor, etc.). A data
frame can be constructed by the data.frame() function. For
example, we illustrate how to construct a data frame from genomic
intervals or coordinates.
chr <- c("chr1", "chr1", "chr2", "chr2")
strand <- c("-","-","+","+")
start<- c(200,4000,100,400)
end<-c(250,410,200,450)
mydata <- data.frame(chr,start,end,strand)
#change column names
names(mydata) <- c("chr","start","end","strand")
mydata # OR this will work too
## chr start end strand
## 1 chr1 200 250 -
## 2 chr1 4000 410 -
## 3 chr2 100 200 +
## 4 chr2 400 450 +
There are a variety of ways to extract the elements of a data frame. You can extract certain columns using column numbers or names, or you can extract certain rows by using row numbers. You can also extract data using logical arguments, such as extracting all rows that have a value in a column larger than your threshold.
mydata[,2:4] # columns 2,3,4 of data frame
## start end strand
## 1 200 250 -
## 2 4000 410 -
## 3 100 200 +
## 4 400 450 +
mydata[,c("chr","start")] # columns chr and start from data frame
## chr start
## 1 chr1 200
## 2 chr1 4000
## 3 chr2 100
## 4 chr2 400
mydata$start # variable start in the data frame
## [1] 200 4000 100 400
mydata[mydata$start>400,] # get all rows where start>400
## chr start end strand
## 2 chr1 4000 410 -
A list in R is an ordered collection of objects (components). A list
allows you to gather a variety of (possibly unrelated) objects under one
name. You can create a list with the list() function. Each
object or element in the list has a numbered position and can have
names. Below we show a few examples of how to create lists.
# example of a list with 4 components
# a string, a numeric vector, a matrix, and a scalar
w <- list(name="Fred",
mynumbers=c(1,2,3),
mymatrix=matrix(1:4,ncol=2),
age=5.3)
w
## $name
## [1] "Fred"
##
## $mynumbers
## [1] 1 2 3
##
## $mymatrix
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## $age
## [1] 5.3
Factors are used to store categorical data. They are important for statistical modeling since categorical variables are treated differently in statistical models than continuous variables. This ensures categorical data treated accordingly in statistical models.
features=c("promoter","exon","intron")
f.feat=factor(features)
An important thing to note is that when you are reading a data frame
with read.table() or creating a data frame with
data.frame() function, the character columns are stored as
factors by default, to change this behavior you need to set
stringsAsFactors=FALSE in read.table() and/or
data.frame() function arguments.
There are four common data types in R, they are numeric,
logical, character and integer.
All these data types can be used to create vectors natively.
#create a numeric vector x with 5 components
x<-c(1,3,2,10,5)
x
## [1] 1 3 2 10 5
#create a logical vector x
x<-c(TRUE,FALSE,TRUE)
x
## [1] TRUE FALSE TRUE
# create a character vector
x<-c("sds","sd","as")
x
## [1] "sds" "sd" "as"
class(x)
## [1] "character"
x<-c(1L,2L,3L)
x
## [1] 1 2 3
class(x)
## [1] "integer"
world_population <- read.csv("world_population.csv")
(Ro run the above piece of code, first download the publicly available World Population Dataset from Kaggle and unzip it into the same folder where you store your R script.)
The result of running the above piece of code will be an R dataframe in your working folder.
In RStudio:
File - Import DatasetOR
Import Dataset on the Environmnet
tabAlso, in R, you can easily read tabular format data with the
read.table() function. You can find a data set on different
formats, some of the most typical are the following ones:
excel file (.xlsx)
csv file (.csv)
txt file (.txt).
There are different functions that can be used for opening the data
frame file, like read.csv or read_xlsx. In
order to save or export a dataset from rstudio, you can easly do it
through the function write.csv() (.csv file) or
write.table()
Tips: You can also import your dataset
through the files window. First you should select your working directory
with the function setwd() or with the option More from the
Files window.
To know where is your working directory, you can use the function
getwd(), and you’ll know where your working directory is
located.
Once you’ve import the dataset and you have all your data uploaded on the Rstudio environment is important to make a descriptive analysis of the data. Some of the most important functions for the descriptive analysis are the following ones:
head() tail(): Returns the first or
last parts of a vector, matrix, table, data frame or function.
str(): Compactly display the internal structure of
an R object, a diagnostic function and an alternative to
summary
summary(): is a generic function used to produce
result summaries of the results of various model fitting
functions.
min() max(): Returns the (regular or
parallel) maxima and minima of the input values.
range(): returns a vector containing the minimum and
maximum of all the given arguments.
names(): Functions to get or set the names of an
object.
mean(): Generic function for the (trimmed)
arithmetic mean.
median(): Compute the sample median.
sd(): This function computes the standard deviation
of the values in x
var() cov() cor(): compute
the variance of x and the covariance or correlation of x and y if these
are vectors.
lapply(): returns a list of the same length as X,
each element of which is the result of applying FUN to the corresponding
element of X.
We’ll be using the dataset iris throughout this example.
This dataset is imported by defaut in R, you only need to load it by
running iris:
dat <- iris # load the iris dataset and renamed it dat
Below a preview of this dataset and its structure:
head(dat) # first 6 observations
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
str(dat) # structure of the dataset
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
The dataset contains 150 observations and 5 variables, representing
the length and width of the sepal and petal and the species of 150
flowers. Length and width of the sepal and petal are numeric variables
and the species is a factor with 3 levels (indicated by num
and Factor w/ 3 levels after the name of the
variables).
Minimum and maximum can be found thanks to the min() and
max()functions:
min(dat$Sepal.Length) # The $ symbol is used to indicate which column we desire to calculate the minimum value from the dataset dat
## [1] 4.3
max(dat$Sepal.Length)
## [1] 7.9
Alternatively the range() function:
rng <- range(dat$Sepal.Length)
rng
## [1] 4.3 7.9
gives you the minimum and maximum directly. Note that the output of the range() function is actually an object containing the minimum and maximum (in that order). This means you can actually access the minimum with:
rng[1] # rng = name of the object specified above
## [1] 4.3
This reminds us that, in R, there are often several ways to arrive at the same result. The method that uses the shortest piece of code is usually preferred as a shorter piece of code is less prone to coding errors and more readable.
The range can then be easily computed, as you have guessed, by subtracting the minimum from the maximum:
max(dat$Sepal.Length) - min(dat$Sepal.Length)
## [1] 3.6
The mean can be computed with the mean() function:
mean(dat$Sepal.Length)
## [1] 5.843333
mean(dat\$Sepal.Length, na.rm = TRUE) to compute the mean
with the NA excluded. This argument can be used for most functions
presented in this article, not only the meanmean(dat\$Sepal.Length, trim = 0.10) and change the
trim argument to your needsThe median can be computed thanks to the median()
function:
median(dat$Sepal.Length)
## [1] 5.8
The standard deviation and the variance is computed with the
sd() and var() functions:
sd(dat$Sepal.Length) # standard deviation
## [1] 0.8280661
var(dat$Sepal.Length) # variance
## [1] 0.6856935
to compute the standard deviation (or variance) of multiple variables
at the same time, use lapply() with the appropriate
statistics as second argument:
lapply(dat[, 1:4], sd)
## $Sepal.Length
## [1] 0.8280661
##
## $Sepal.Width
## [1] 0.4358663
##
## $Petal.Length
## [1] 1.765298
##
## $Petal.Width
## [1] 0.7622377
The command dat[, 1:4] selects the variables 1 to 4 as
the fifth variable is a qualitative variable and the standard deviation
cannot be computed on such type of variable.
You can compute the minimum, 1st quartile, median, mean,
3rd quartile and the maximum for all numeric variables of a
dataset at once using summary():
summary(dat)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
If you need these descriptive statistics by group use the
by() function:
by(dat, dat$Species, summary)
## dat$Species: setosa
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.300 Min. :1.000 Min. :0.100
## 1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400 1st Qu.:0.200
## Median :5.000 Median :3.400 Median :1.500 Median :0.200
## Mean :5.006 Mean :3.428 Mean :1.462 Mean :0.246
## 3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575 3rd Qu.:0.300
## Max. :5.800 Max. :4.400 Max. :1.900 Max. :0.600
## Species
## setosa :50
## versicolor: 0
## virginica : 0
##
##
##
## ------------------------------------------------------------
## dat$Species: versicolor
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## Min. :4.900 Min. :2.000 Min. :3.00 Min. :1.000 setosa : 0
## 1st Qu.:5.600 1st Qu.:2.525 1st Qu.:4.00 1st Qu.:1.200 versicolor:50
## Median :5.900 Median :2.800 Median :4.35 Median :1.300 virginica : 0
## Mean :5.936 Mean :2.770 Mean :4.26 Mean :1.326
## 3rd Qu.:6.300 3rd Qu.:3.000 3rd Qu.:4.60 3rd Qu.:1.500
## Max. :7.000 Max. :3.400 Max. :5.10 Max. :1.800
## ------------------------------------------------------------
## dat$Species: virginica
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.900 Min. :2.200 Min. :4.500 Min. :1.400
## 1st Qu.:6.225 1st Qu.:2.800 1st Qu.:5.100 1st Qu.:1.800
## Median :6.500 Median :3.000 Median :5.550 Median :2.000
## Mean :6.588 Mean :2.974 Mean :5.552 Mean :2.026
## 3rd Qu.:6.900 3rd Qu.:3.175 3rd Qu.:5.875 3rd Qu.:2.300
## Max. :7.900 Max. :3.800 Max. :6.900 Max. :2.500
## Species
## setosa : 0
## versicolor: 0
## virginica :50
##
##
##
hist(dat$Sepal.Length)
plot(dat$Species, dat$Sepal.Length,
main = "Sepal Length based on the Species",
xlab = "Species", ylab = "Sepal Length (cm)",
col = "red")
plot(dat$Sepal.Length, dat$Sepal.Width,
main = "Sepal Length Vs Sepal Width" )
barplot(prop.table(table(dat$Species)),
col=c("orange","blue","red"),
legend.text =c("Setosa","versicolor","virginica"),
ylim=c(0,0.5))
shapiro.test(dat$Sepal.Length)
##
## Shapiro-Wilk normality test
##
## data: dat$Sepal.Length
## W = 0.97609, p-value = 0.01018
library(car)
## Warning: package 'car' was built under R version 4.2.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.2.2
qqPlot(dat$Sepal.Length)
## [1] 132 118
lm.iris1 <- lm(Sepal.Length~Species, data=iris) ##makes a 'linear model' object which we will use for the next line of code and the ANOVA code
leveneTest(lm.iris1)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 6.3527 0.002259 **
## 147
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm.iris1)
## Analysis of Variance Table
##
## Response: Sepal.Length
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 63.212 31.606 119.26 < 2.2e-16 ***
## Residuals 147 38.956 0.265
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.iris1)
##
## Call:
## lm(formula = Sepal.Length ~ Species, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6880 -0.3285 -0.0060 0.3120 1.3120
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.0060 0.0728 68.762 < 2e-16 ***
## Speciesversicolor 0.9300 0.1030 9.033 8.77e-16 ***
## Speciesvirginica 1.5820 0.1030 15.366 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5148 on 147 degrees of freedom
## Multiple R-squared: 0.6187, Adjusted R-squared: 0.6135
## F-statistic: 119.3 on 2 and 147 DF, p-value: < 2.2e-16
iris.aov <- aov(Sepal.Length~Species, data = iris)
TukeyHSD(iris.aov)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Sepal.Length ~ Species, data = iris)
##
## $Species
## diff lwr upr p adj
## versicolor-setosa 0.930 0.6862273 1.1737727 0
## virginica-setosa 1.582 1.3382273 1.8257727 0
## virginica-versicolor 0.652 0.4082273 0.8957727 0
iris.aov2 <- aov(Sepal.Width~Species, dat = iris)
TukeyHSD(iris.aov2)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Sepal.Width ~ Species, data = iris)
##
## $Species
## diff lwr upr p adj
## versicolor-setosa -0.658 -0.81885528 -0.4971447 0.0000000
## virginica-setosa -0.454 -0.61485528 -0.2931447 0.0000000
## virginica-versicolor 0.204 0.04314472 0.3648553 0.0087802
plot(dat$Species, dat$Sepal.Width,
main = "Sepal Width based on the Species",
xlab = "Species", ylab = "Sepal Width (cm)",
col = "blue")