R Bootcamp

What is R?

R is a programming language created by statisticians. It is frequently used by statisticians and computational biologists. R is especially helpful for data visualization and data analysis.

Like other programming languages, it can be used for mundane tasks like simple math. But its true power lies in statistical interpretations and data analysis/manipulation.

How do I use it?

To download R go to: https://www.r-project.org/.

R can be used in RStudio, a text editor, or the R shell. In BIOSC1540, we use RStudio.

The Basics:

Help!

One of the most helpful things about R is it’s help menu. Need to use a function but don’t remember how? Type:

?print()

If using RStudio, a the side pane will pull up the documentation for the function. The general formula for this command is ?funtion_name()

Variables

If you come from a Python or Java background, the way variable assignment in R works will make you angry. In R, the convention is to NOT to use = as the assignment operator.

If you don’t have coding background: variables are stored data that can be reused and/or reassigned throughout your program.

The generally agreed upon way to assign variables is with <- in R.

x <- "a"
print(x)

## [1] "a"

Typically, bland variable names such as “x” are bad. It’s best practice to name the variable as it relates to the data being stored. For example:

alphabet.lower <- c("a", "b", "c", "d", "e", "f", "g",
                    "h", "i", "j", "k", "l", "m", "n",
                    "o", "p", "q", "r", "s", "t", "u",
                    "v", "w", "x", "y", "z")

With a variable named “alphabet.lower”, coming back to the code a day or so later will remind you what the variable is and why you have it. If it were saved as “x” it could mean anything.

The most important thing is that the variable name is unambiguous, but it’s nice to keep the names short to avoid errors down the road.

Spelling a variable wrong and not noticing, then getting angry about the code not working is a genuine issue

Basic Functions

You may be looking at the above code chunk like “hey, what is c() and why do I need it? Python doesn’t make me do this.” And you would be correct.

Let’s check the documentation:

?c

c() creates a vector. In most cases, this is used to create lists of data.

Vector is a object that is a staple in computation. Other data objects in R that will be used in this class are: Arrays, Matrices, Data Frames.

Objects contain data types. Data types are exactly as they sound: what is the data? They can be logical, numerical, character, complex (think like an algebra equation), integer (which is different from numeric), or raw.

For 1540, logical, numerical, character, and integer should be of focus.

Logical:
- true/false statements
Numerical:
- if you use other programming languages, numerical is a float value
- in other words, numerical data are decimals.
Character:
- single letter, words, a whole sentence.
- denoted with “”
- if you use other programming languages: strings are character vectors
Integer:
- whole numbers
- 1L is an integer, “1L” is a character, 1.0 is numerical

Want to know what data type a variable is? Use is() to return your datatype!

is(alphabet.lower)

## [1] "character"           "vector"              "data.frameRowLabels"
## [4] "SuperClassMethod"

is(1L)

## [1] "integer"             "double"              "numeric"            
## [4] "vector"              "data.frameRowLabels"

Most bioinformatics work requires working with vectors. Some objects are particular about the lengths of vectors, especially when manipulating them. To find the length of a vector, use length()

length(alphabet.lower)

## [1] 26

As mentioned before, a big part of R is data visualization. So, how do we make graphs?

In 1540, we’ll use many different different kinds of plots, different packages that make special plots, and more! But we’ll keep those a surprise for now.

Plots with Base R:

“Base R” refers to R as it is downloaded. Because of the collaborative nature of the R community thousands of packages exist to do specific things. With data visualization being a primary use of R, lots of them are for plots. Later in the semester, we’ll use the ggplot package for making several types of charts/graphs.

The base R plotting function is simply plot(). There are 2 ways to use this function:

plot(x = variablex, y = variabley) OR plot (variabley~variablex)

Dr. B prefers the ~ notation

To use plot(), we need data. For this, we’re going to use a dataset that is a part of Base R. When you encounter a new dataset, it best practice to go through a series of exploratory steps so that you know what it contains.

oh and those “#” signs? that’s how you make comments in R (just like Python!)

# load a library and data:
library(Stat2Data)
data(SeaIce)

# explore the data:
head(SeaIce) # shows the beginning of the dataset

##   Year Extent Area t
## 1 1979   7.22 4.54 1
## 2 1980   7.86 4.83 2
## 3 1981   7.25 4.38 3
## 4 1982   7.45 4.38 4
## 5 1983   7.54 4.64 5
## 6 1984   7.11 4.04 6

dim(SeaIce) # [rows, columns]

## [1] 37  4

names(SeaIce) # what are the column names?

## [1] "Year"   "Extent" "Area"   "t"

summary(SeaIce) # a stat summary of the data

##       Year          Extent           Area             t     
##  Min.   :1979   Min.   :3.630   Min.   :2.370   Min.   : 1  
##  1st Qu.:1988   1st Qu.:5.590   1st Qu.:3.980   1st Qu.:10  
##  Median :1997   Median :6.540   Median :4.360   Median :19  
##  Mean   :1997   Mean   :6.349   Mean   :4.254   Mean   :19  
##  3rd Qu.:2006   3rd Qu.:7.240   3rd Qu.:4.640   3rd Qu.:28  
##  Max.   :2015   Max.   :7.910   Max.   :5.610   Max.   :37

Now that we know what the data is and what it looks like, let’s plot it. To pull an entire column from the dataset, use $ notation:

dataset$column-name

A way to remember how this notation works is to think of $ as “in” when reading from right to left. For example: SeaIce$Area can be thought of Area column IN SeaIce dataset

plot(SeaIce$Area ~ SeaIce$Year,
     main = "Sea Ice Area per Year", #title
     xlab = "Year", #label x axis
     ylab = "Area" #label y axis
     )

Just like with most things in coding, making explanatory labels and titles for plots is important. This allows for clarity for both the coder and whoever is looking at the results of the code.

Dataframes:

The dataset used above, SeaIce, is stored in a dataframe. Dataframes are essentially tables.

# More Dataframe Exploration:

# load a library and data:
library(Stat2Data)
data(SeaIce)

SeaIce[1, ] # will give first row of dataframe

##   Year Extent Area t
## 1 1979   7.22 4.54 1

SeaIce[ ,1] # will give first col of dataframe

##  [1] 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993
## [16] 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
## [31] 2009 2010 2011 2012 2013 2014 2015

is(SeaIce) # will give datatype

## [1] "data.frame" "list"       "oldClass"   "vector"

tail(SeaIce) # shows bottom of dataframe

##    Year Extent Area  t
## 32 2010   4.93 3.29 32
## 33 2011   4.63 3.18 33
## 34 2012   3.63 2.37 34
## 35 2013   5.35 3.75 35
## 36 2014   5.29 3.70 36
## 37 2015   4.68 3.37 37

mean(SeaIce$Extent)

## [1] 6.348919

max(SeaIce$Extent)

## [1] 7.91

min(SeaIce$Extent)

## [1] 3.63

sqrt(SeaIce$Extent) # sqrt of each value in the col

##  [1] 2.687006 2.803569 2.692582 2.729469 2.745906 2.666458 2.632489 2.747726
##  [9] 2.740438 2.744085 2.660827 2.503997 2.567100 2.754995 2.557342 2.690725
## [17] 2.485961 2.812472 2.603843 2.572936 2.507987 2.521904 2.603843 2.445404
## [25] 2.485961 2.465766 2.364318 2.439262 2.078461 2.174856 2.321637 2.220360
## [33] 2.151743 1.905256 2.313007 2.300000 2.163331

Dataframes can be created from two or more vectors of the same length in R using data.frame(vector1, vector 2).

Computational Biology at Pitt:

BIOSC1540 is a class for Computational Biology majors and nonmajors. Bioinformatics is an evergrowing field with the capability to assist in several disciplines. Even if comp bio is not what you plan to do, this class is helpful to see what possibilities lie beyond standard wet lab trials. If comp bio is your plan, this class will offer a taste of what you can do. This is the first of several bioinformatic classes you will take, so don’t feel discouraged if some of the things we do aren’t wildly interesting. It is impossible to fit everything you can do with computational biology into a single semester class.

Please reach out with any questions or concerns.