R is a programming language created by statisticians. It is frequently used by statisticians and computational biologists. R is especially helpful for data visualization and data analysis.
Like other programming languages, it can be used for mundane tasks like simple math. But its true power lies in statistical interpretations and data analysis/manipulation.
To download R go to: https://www.r-project.org/.
R can be used in RStudio, a text editor, or the R shell. In BIOSC1540, we use RStudio.
One of the most helpful things about R is it’s help menu. Need to use a function but don’t remember how? Type:
?print()
If using RStudio, a the side pane will pull up the documentation for
the function. The general formula for this command is
?funtion_name()
If you come from a Python or Java background, the way variable
assignment in R works will make you angry. In R, the convention is to
NOT to use = as the assignment
operator.
If you don’t have coding background: variables are stored data that can be reused and/or reassigned throughout your program.
The generally agreed upon way to assign variables is with
<- in R.
x <- "a"
print(x)
## [1] "a"
Typically, bland variable names such as “x” are bad. It’s best practice to name the variable as it relates to the data being stored. For example:
alphabet.lower <- c("a", "b", "c", "d", "e", "f", "g",
"h", "i", "j", "k", "l", "m", "n",
"o", "p", "q", "r", "s", "t", "u",
"v", "w", "x", "y", "z")
With a variable named “alphabet.lower”, coming back to the code a day or so later will remind you what the variable is and why you have it. If it were saved as “x” it could mean anything.
The most important thing is that the variable name is unambiguous, but it’s nice to keep the names short to avoid errors down the road.
Spelling a variable wrong and not noticing, then getting angry about the code not working is a genuine issue
You may be looking at the above code chunk like “hey, what is c() and why do I need it? Python doesn’t make me do this.” And you would be correct.
Let’s check the documentation:
?c
c() creates a vector. In most cases, this is used to
create lists of data.
Vector is a object that is a staple in computation. Other data objects in R that will be used in this class are: Arrays, Matrices, Data Frames.
Objects contain data types. Data types are exactly as they sound: what is the data? They can be logical, numerical, character, complex (think like an algebra equation), integer (which is different from numeric), or raw.
For 1540, logical, numerical, character, and integer should be of focus.
Want to know what data type a variable is? Use is() to
return your datatype!
is(alphabet.lower)
## [1] "character" "vector" "data.frameRowLabels"
## [4] "SuperClassMethod"
is(1L)
## [1] "integer" "double" "numeric"
## [4] "vector" "data.frameRowLabels"
Most bioinformatics work requires working with vectors. Some objects
are particular about the lengths of vectors, especially when
manipulating them. To find the length of a vector, use
length()
length(alphabet.lower)
## [1] 26
As mentioned before, a big part of R is data visualization. So, how do we make graphs?
In 1540, we’ll use many different different kinds of plots, different packages that make special plots, and more! But we’ll keep those a surprise for now.
“Base R” refers to R as it is downloaded. Because of the collaborative nature of the R community thousands of packages exist to do specific things. With data visualization being a primary use of R, lots of them are for plots. Later in the semester, we’ll use the ggplot package for making several types of charts/graphs.
The base R plotting function is simply plot(). There are
2 ways to use this function:
plot(x = variablex, y = variabley) OR plot (variabley~variablex)
Dr. B prefers the ~ notation
To use plot(), we need data. For this, we’re going to use a dataset that is a part of Base R. When you encounter a new dataset, it best practice to go through a series of exploratory steps so that you know what it contains.
oh and those “#” signs? that’s how you make comments in R (just like Python!)
# load a library and data:
library(Stat2Data)
data(SeaIce)
# explore the data:
head(SeaIce) # shows the beginning of the dataset
## Year Extent Area t
## 1 1979 7.22 4.54 1
## 2 1980 7.86 4.83 2
## 3 1981 7.25 4.38 3
## 4 1982 7.45 4.38 4
## 5 1983 7.54 4.64 5
## 6 1984 7.11 4.04 6
dim(SeaIce) # [rows, columns]
## [1] 37 4
names(SeaIce) # what are the column names?
## [1] "Year" "Extent" "Area" "t"
summary(SeaIce) # a stat summary of the data
## Year Extent Area t
## Min. :1979 Min. :3.630 Min. :2.370 Min. : 1
## 1st Qu.:1988 1st Qu.:5.590 1st Qu.:3.980 1st Qu.:10
## Median :1997 Median :6.540 Median :4.360 Median :19
## Mean :1997 Mean :6.349 Mean :4.254 Mean :19
## 3rd Qu.:2006 3rd Qu.:7.240 3rd Qu.:4.640 3rd Qu.:28
## Max. :2015 Max. :7.910 Max. :5.610 Max. :37
Now that we know what the data is and what it looks like, let’s plot
it. To pull an entire column from the dataset, use $
notation:
A way to remember how this notation works is to think of $ as “in”
when reading from right to left. For example: SeaIce$Area
can be thought of Area column IN SeaIce dataset
plot(SeaIce$Area ~ SeaIce$Year,
main = "Sea Ice Area per Year", #title
xlab = "Year", #label x axis
ylab = "Area" #label y axis
)
Just like with most things in coding, making explanatory labels and titles for plots is important. This allows for clarity for both the coder and whoever is looking at the results of the code.
The dataset used above, SeaIce, is stored in a dataframe. Dataframes are essentially tables.
# More Dataframe Exploration:
# load a library and data:
library(Stat2Data)
data(SeaIce)
SeaIce[1, ] # will give first row of dataframe
## Year Extent Area t
## 1 1979 7.22 4.54 1
SeaIce[ ,1] # will give first col of dataframe
## [1] 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993
## [16] 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
## [31] 2009 2010 2011 2012 2013 2014 2015
is(SeaIce) # will give datatype
## [1] "data.frame" "list" "oldClass" "vector"
tail(SeaIce) # shows bottom of dataframe
## Year Extent Area t
## 32 2010 4.93 3.29 32
## 33 2011 4.63 3.18 33
## 34 2012 3.63 2.37 34
## 35 2013 5.35 3.75 35
## 36 2014 5.29 3.70 36
## 37 2015 4.68 3.37 37
mean(SeaIce$Extent)
## [1] 6.348919
max(SeaIce$Extent)
## [1] 7.91
min(SeaIce$Extent)
## [1] 3.63
sqrt(SeaIce$Extent) # sqrt of each value in the col
## [1] 2.687006 2.803569 2.692582 2.729469 2.745906 2.666458 2.632489 2.747726
## [9] 2.740438 2.744085 2.660827 2.503997 2.567100 2.754995 2.557342 2.690725
## [17] 2.485961 2.812472 2.603843 2.572936 2.507987 2.521904 2.603843 2.445404
## [25] 2.485961 2.465766 2.364318 2.439262 2.078461 2.174856 2.321637 2.220360
## [33] 2.151743 1.905256 2.313007 2.300000 2.163331
Dataframes can be created from two or more vectors of the same
length in R using data.frame(vector1, vector 2).
BIOSC1540 is a class for Computational Biology majors and nonmajors. Bioinformatics is an evergrowing field with the capability to assist in several disciplines. Even if comp bio is not what you plan to do, this class is helpful to see what possibilities lie beyond standard wet lab trials. If comp bio is your plan, this class will offer a taste of what you can do. This is the first of several bioinformatic classes you will take, so don’t feel discouraged if some of the things we do aren’t wildly interesting. It is impossible to fit everything you can do with computational biology into a single semester class.
Please reach out with any questions or concerns.