Coding Club 5 - Title TBD

When should you use R vs Excel?

Excel is a standard for many reasons. Here are 3 weak reasons, and 1 strong one.

Excel can be easier, but it also tends to be more limited.
Excel can be faster, but it is also more error-prone.
Excel can make graphs, but they tend to be ugly.
Excel is a true spreadsheet editor, which makes it ideal for recording data.

When not to use spreadsheets

Spreadsheets serve everything to you raw. As you know from a lifetime of eating, it can be dangerous to work with raw material.

Open our old blood_pressure.csv dataset and PBMC data

# Open a new script.
# First, load libraries (Seurat, ggplot, dplyr)
# Ensure your working directory is set properly.
# Remember, getwd() to figure out where you are, and setwd() to choose your position.
# Assign these datasets to the variables called "bp.data"" and "pbmc.data"
# Remember, the useful functions are "Read10X()" "read.csv()"

## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')

In programming, you don’t change raw data. Notice how each time we run our script, the raw data is copied (‘read’) into R, then it is processed, before we change anything about it.

Two essential concepts

The Container versus the Contents

This applies to every relevant programming language. The container (or “variable”) does not care what’s in it. Think of variables as boxes - anything could be inside, but they only hold one object at a time.

But, why is this concept essential?

# Create a few new variables
# Make one variable called name, give it your name using quotes
# Make one variable called number, give it the number 24
# Make one variable called numbers, give it the value seq(1,10)
# Make one variable called numberWord, give it the value "1"

# What do you predict will happen if you try to:
# Add number to number?
# Add number to names?
# Add number to numbers?
# Add number to numberWord?
# Add number to data?
# Add numberWord to name?
## Is the result here weird to you?

# Add 1 + "1"

Data-types define the rules of what can be done within a language

We can look at what type of data is in the container using class()

# Use class() to identify all of the variables we've created

# Is it sensible to add a number to a number?
# Is it sensible to add a number to a series of a numbers?
# Is it sensible to add a number to a name?
# Is it sensible to add a name to a name?

# What themes do we notice?
# What do you think a data-frame is?
# What do you think a Seurat object is?

Modifying variables

The most common way to access specific locations (also known as index or indicies) in a variable is through using square brackets.

# What do you think will happen if we try to access numbers[2]
# What do you think will happen if we try to access numbers[11]

# What do you think will happen if we try to access name[1]
# What about if we try to access name[2]?

# Can we be powerful, and assign number[3] to "Henry"?
## What happened?

# What do you think will happen if we check bp.data[2]?
# What about bp.data[[2]]?

# Lastly, what do you think happens if we try to access number["Ian"]
# Can we be clever? Can we access this thing?

Okay, back to Seurat things.

Last time, we saw this weird function

pbmc[[“percent.mt”]] <- PercentageFeatureSet(pbmc, pattern = “^MT-”) # New function!

Here, the interpretation is that we’re looking at some specific part of pbmc, and calling it “percent.mt”. This is a really useful motif - very frequently we’ll be changing the contents of one container, rather than creating many different (unrelated) containers.

VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3) # Now we have more information in our pbmc object.

plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt") # FeatureScatter() allows us to create scatter plots of any of our single cell data.  My interpretation is that cells with lots of mitochondrial genes are likely low-quality/dying, and therefore do not express many other RNA.

plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") # We should expect a relatively linear relationship between them (more RNA detected in a cell -> more unique genes discovered in that cell)

plot1 + plot2 # Combine and show both plots side-by-side

pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000) # Let's look at the documentation of this function using "?

pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)

# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pbmc), 10)

## When using repel, set xnudge and ynudge to 0 for optimal results