Setting up

This is an R-markdown document for PCL1491 - Introduction to bioinformatics.

Let’s start by loading all the packages we will need for this session (or install first, as necessary).

#install.packages('Seurat') # as needed
library(Seurat) # for most if not all of our work with scRNAseq
## Attaching SeuratObject

Now let’s load our data, which is a mouse brain single-cell RNAseq dataset we get from the Allen Institute for Brain Science (AIBS) https://portal.brain-map.org/atlases-and-data/rnaseq/mouse-whole-cortex-and-hippocampus-smart-seq. Our data files are small fractions of the total data stored in this link, to make it easier to handle for all computers.

# replace the paths below to where you have saved "L3_metadata_L.csv" and "L3_count_matrix_L.csv". If your computer is slow, you can try "L3_metadata_S.csv" and "L3_count_matrix_S.csv" instead. Please only use "S" if you know for sure your computer is slow - if, for example, your computer struggled to process the data from last week.

sc_metadata <- read.csv("/external/rprshnas01/kcni/ychen/git/TeachingScraps/L3_metadata_L.csv", row.names=1) # our metadata
sc_matrix <- read.csv("/external/rprshnas01/kcni/ychen/git/TeachingScraps/L3_count_matrix_L.csv", row.names=1) # our count matrix

When starting any new project or research, it is always important to understand the contents, characteristics and distribution of our data (or the data we will be collecting, if it was a brand new experiment). First, let’s take a look at the data we are importing.

head(sc_matrix[1:5, 1:5]) # let's see the first 5 rows and columns of our count matrix
##             sample_name ATP6 ATP8 Aaas Aacs
## 87  F2S4_150427_011_B01    3    0   48   22
## 117 F2S4_150427_003_F01    1    1   11   25
## 185 F2S4_151217_014_E01   12    0    0   76
## 565 F2S4_160115_012_B01   11    0    1    4
## 719 F2S4_160120_014_G01    2    0   55   74

As it is, we have every column (except the first) representing one gene, and every row representing one sample - in single-cell RNAseq (and single-nucleus RNAseq), a sample is a cell, so every row is one cell. The numbers in the main matrix represent the number of RNA fragments, or reads, detected for one gene in one cell. For example, in sc_matrix[3, 2], the cell named “F2S4_151217_014_E01” has 12 RNA reads detected for the gene “ATP6”.

In addition to the above, our count matrix contains a “sample_name” first column, and some row names (e.g: 87 and 117 for the first two rows) that are not continuous and do not begin from 1, which is a left-over from the down-sampling performed previously to produce the small datasets we are working with.

Now let us look at the metadata, which is a process that’s a bit more involved.

head(sc_metadata[1:5, 1:5]) # let's see the first 5 rows and columns of our metadata
##             sample_name  exp_component_name platform_label cluster_color
## 184 F2S4_160620_006_D01  LS-15354_S36_E1-50             SS       #BF7185
## 209 F2S4_180507_103_E01 SM-GE92I_S221_E1-50             SS       #D97C80
## 372 F2S4_171205_060_E01 SM-GE65S_S053_E1-50             SS       #996068
## 676 F2S4_171212_075_D01 SM-GE8ZS_S268_E1-50             SS       #935F50
## 723 F2S4_171206_002_F01 SM-GE8ZS_S038_E1-50             SS       #CC7F64
##     cluster_order
## 184            15
## 209            16
## 372             9
## 676             3
## 723             4

As you can see, the format is similar to the count matrix, with every row being one sample or cell, but now every column being a metadata variable. The row names are also left-over indices from the down-sampling process. It would be good to know what metadata we have available in our dataset, so let’s take a look at our variables/columns.

colnames(sc_metadata) # see our column names, which are our metadata variables
##  [1] "sample_name"                 "exp_component_name"         
##  [3] "platform_label"              "cluster_color"              
##  [5] "cluster_order"               "cluster_label"              
##  [7] "class_color"                 "class_order"                
##  [9] "class_label"                 "subclass_color"             
## [11] "subclass_order"              "subclass_label"             
## [13] "full_genotype_color"         "full_genotype_id"           
## [15] "full_genotype_label"         "sex_color"                  
## [17] "sex_id"                      "donor_sex_label"            
## [19] "region_color"                "region_id"                  
## [21] "region_label"                "cell_type_accession_color"  
## [23] "cell_type_accession_id"      "cell_type_accession_label"  
## [25] "cell_type_alias_color"       "cell_type_alias_id"         
## [27] "cell_type_alias_label"       "cell_type_alt_alias_color"  
## [29] "cell_type_alt_alias_id"      "cell_type_alt_alias_label"  
## [31] "cell_type_designation_color" "cell_type_designation_id"   
## [33] "cell_type_designation_label" "neighborhood_label"         
## [35] "neighborhood_id"             "neighborhood_color"         
## [37] "external_donor_name_color"   "external_donor_name_id"     
## [39] "external_donor_name_label"   "facs_population_plan_color" 
## [41] "facs_population_plan_id"     "facs_population_plan_label" 
## [43] "injection_materials_color"   "injection_materials_id"     
## [45] "injection_materials_label"   "injection_method_color"     
## [47] "injection_method_id"         "injection_method_label"     
## [49] "injection_roi_color"         "injection_roi_id"           
## [51] "injection_roi_label"         "injection_type_color"       
## [53] "injection_type_id"           "injection_type_label"       
## [55] "cortical_layer_label"        "outlier_call"               
## [57] "outlier_type"

There are a few things to note about our metadata.

First, we have a lot of information about our experimental samples and specimen: the sex and genotype of the donor mouse, the brain region from which each cell is collected, and more.

Second, notice the mention of outliers. There were originally outliers in the AIBS data, but these outliers were taken out earlier during the preparation of the data tables for our course.

table(sc_metadata$outlier_call) # see the breakdown of outlier information - True means cell is an outlier, False means cell is not.
## 
## False 
##   814

All cells in our current dataset have “outlier_call” values of “False” - we have 814 rows or cells in our matrix and metadata, and there are 814 “False” values under the “outlier_call” column of our metadata.

Next note that we have have various resolutions of cell groupings - class, neighborhood, subclass, cluster. We will be working with the “subclass” resolution of cell types. We will also be looking at sex differences in gene expression. So let’s see the breakdown of our data.

table(sc_metadata$donor_sex_label, sc_metadata$subclass_label) #let's see the distribution of our data between cell subclasses and biological sex 
##    
##     Astro CA1-ProS CA3 Car3 CR CT SUB DG Endo L2 IT RHP L2/3 IT CTX-1
##   F    11       11  11   11 11     11 11   11        11            11
##   M    11       11  11   11 11     11 11   11        11            11
##    
##     L2/3 IT CTX-2 L2/3 IT ENTl L2/3 IT PPP L3 IT ENT L3 RSP-ACA L4/5 IT CTX
##   F            11           11          11        11         11          11
##   M            11           11          11        11         11          11
##    
##     L5 IT CTX L5 IT TPE-ENT L5 NP CTX L5 PPP L5 PT CTX L6 CT CTX L6 IT CTX
##   F        11            11        11     11        11        11        11
##   M        11            11        11     11        11        11        11
##    
##     L6b CTX L6b/CT ENT Lamp5 Meis2 Micro-PVM NP PPP Oligo Pvalb SMC-Peri Sncg
##   F      11         11    11    11        11     11    11    11       11   11
##   M      11         11    11    11        11     11    11    11       11   11
##    
##     Sst Sst Chodl Vip VLMC
##   F  11        11  11   11
##   M  11        11  11   11

Note we have 11 cells for each subclass-sex group. For example, there are 11 “Sst” cells from males, and another 11 “Sst” cells from females in our dataset. This is because our data was downsampled to meet this criteria.

Starting with Seurat - the beginning

Now that we have an understanding of our data and metadata, we can now move on to creating our Seurat object, which we will use with Seurat functions for scRNAseq data processing and analysis.

Seurat matches the identity of cells between our metadata and count matrix by row names and column names, respectively. So we will need to modify our dataframes a little.

row.names(sc_metadata) <- sc_metadata$sample_name # set sample names to row names
# sc_metadata <- sc_metadata[,-1] 
sc_metadata[1:5,1:4] # see the first 5 rows and 4 columns of our metadata
##                             sample_name  exp_component_name platform_label
## F2S4_160620_006_D01 F2S4_160620_006_D01  LS-15354_S36_E1-50             SS
## F2S4_180507_103_E01 F2S4_180507_103_E01 SM-GE92I_S221_E1-50             SS
## F2S4_171205_060_E01 F2S4_171205_060_E01 SM-GE65S_S053_E1-50             SS
## F2S4_171212_075_D01 F2S4_171212_075_D01 SM-GE8ZS_S268_E1-50             SS
## F2S4_171206_002_F01 F2S4_171206_002_F01 SM-GE8ZS_S038_E1-50             SS
##                     cluster_color
## F2S4_160620_006_D01       #BF7185
## F2S4_180507_103_E01       #D97C80
## F2S4_171205_060_E01       #996068
## F2S4_171212_075_D01       #935F50
## F2S4_171206_002_F01       #CC7F64

All we need to do for the metadata is to set samples names (sc_metadata$sample_name) as the row names. Note how the row name numbers are now replaced with values from the “sample_name” column.

The second line of code is optional - we CAN remove the first column of sample names AFTER setting it as row names if we desire. The row names will be retained in our Seurat object. Leaving the column of sample names in the dataframe is also fine, this would just result in our Seurat object having a somewhat-redundant “sample_name” column as an additional metadata column.

We need to do a bit more work on the count matrix.

row.names(sc_matrix) <- sc_matrix$sample_name # set sample names to row names
sc_matrix <- as.matrix(sc_matrix[,-1]) # remove the first column of sample names, to leave us with a pure matrix
sc_matrix <- t(sc_matrix) # finally, transpose the matrix, as Seurat expects a count matrix formatted with samples as columns and genes as rows
head(sc_matrix[,1:5]) # see the first five columns and rows of our transformed count matrix
##         F2S4_150427_011_B01 F2S4_150427_003_F01 F2S4_151217_014_E01
## ATP6                      3                   1                  12
## ATP8                      0                   1                   0
## Aaas                     48                  11                   0
## Aacs                     22                  25                  76
## Aadac                     0                   0                   0
## Aadacl2                   0                   0                   0
##         F2S4_160115_012_B01 F2S4_160120_014_G01
## ATP6                     11                   2
## ATP8                      0                   0
## Aaas                      1                  55
## Aacs                      4                  74
## Aadac                     0                   0
## Aadacl2                   0                   0

Our final count matrix is now a matrix object (as opposed to a dataframe). It is also transposed, with columns as samples and rows as genes. This is the general format of count matrices for RNAseq data. The sample names stored in the column names here match the sample names stored in the row names in our metadata object.

length(intersect(row.names(sc_metadata), # see how many names (length function) overlap (intersect function) between the row names of sc_metadata...
                 colnames(sc_matrix))) # and the column names of sc_matrix
## [1] 814

We can now create our Seurat object

Seurat_object <- CreateSeuratObject(count = sc_matrix, 
                                    meta.data = sc_metadata) # create our Seurat object
## Warning: The following arguments are not used: row.names
Seurat_object # see the number of genes (features) and cells (samples) in our object - it matches the corresponding dimensions of our initial data
## An object of class Seurat 
## 22084 features across 814 samples within 1 assay 
## Active assay: RNA (22084 features, 0 variable features)

Both our metadata and our count matrix information, matched by sample names, are stored in the Seurat object.

Seurat_object@meta.data[1:5, 1:7] # see the first 5 rows and first 7 columns of our metadata
##                     orig.ident nCount_RNA nFeature_RNA         sample_name
## F2S4_150427_011_B01       F2S4     760015         8523 F2S4_150427_011_B01
## F2S4_150427_003_F01       F2S4     694742         9444 F2S4_150427_003_F01
## F2S4_151217_014_E01       F2S4    1452266         5351 F2S4_151217_014_E01
## F2S4_160115_012_B01       F2S4    1351557         4951 F2S4_160115_012_B01
## F2S4_160120_014_G01       F2S4    1473413         9161 F2S4_160120_014_G01
##                     exp_component_name platform_label cluster_color
## F2S4_150427_011_B01  US-1250275_E2_S19             SS       #2FBCE5
## F2S4_150427_003_F01  US-1250275_E2_S60             SS       #0A75B1
## F2S4_151217_014_E01 LS-14690_S45_E1-50             SS       #72B09C
## F2S4_160115_012_B01 LS-15001_S90_E1-50             SS       #332F26
## F2S4_160120_014_G01 LS-15003_S63_E1-50             SS       #A7A322
Seurat_object@assays$RNA@counts[1:5, 1:5] # see the first 5 rows and columns of our count matrix
## 5 x 5 sparse Matrix of class "dgCMatrix"
##       F2S4_150427_011_B01 F2S4_150427_003_F01 F2S4_151217_014_E01
## ATP6                    3                   1                  12
## ATP8                    .                   1                   .
## Aaas                   48                  11                   .
## Aacs                   22                  25                  76
## Aadac                   .                   .                   .
##       F2S4_160115_012_B01 F2S4_160120_014_G01
## ATP6                   11                   2
## ATP8                    .                   .
## Aaas                    1                  55
## Aacs                    4                  74
## Aadac                   .                   .

Our data is stored in one Seurat object now. There are a few changes we should note above.

Our metadata now has a few extra columns at the beginning, like n_Count_RNA and n_Feature_RNA, which represent the total number of reads and the number of detected genes in each cell. This was generated by Seurat.

Our count matrix is now stored as a sparse matrix, where 0’s are replaced with “.”. This is also done by Seurat, with the intention to save memory space.

We next need to normalize our count data. This is important to consider as, for example, we do not want to confound increased gene expression because of biological reasons with increased RNA amplification due to technical differences.

Seurat_object <- NormalizeData(Seurat_object, 
                               normalization.method = "LogNormalize", 
                               scale.factor = 1000000) # we use a scale factor of 1 million to convert our values to counts-per-million

Seurat_object@assays$RNA@data[1:10,1:5] # our normalized data is stored here; preview/see our normalized data
## 10 x 5 sparse Matrix of class "dgCMatrix"
##         F2S4_150427_011_B01 F2S4_150427_003_F01 F2S4_151217_014_E01
## ATP6               1.598840           0.8917452            2.226022
## ATP8               .                  0.8917452            .       
## Aaas               4.161328           2.8233541            .       
## Aacs               3.399422           3.6105011            3.976537
## Aadac              .                  .                    .       
## Aadacl2            .                  .                    .       
## Aadacl3            .                  .                    .       
## Aadat              .                  .                    .       
## Aaed1              2.896184           2.1037580            2.297726
## Aagab              4.330714           4.5183167            .       
##         F2S4_160115_012_B01 F2S4_160120_014_G01
## ATP6              2.2125249           0.8575562
## ATP8              .                   .        
## Aaas              0.5538204           3.6461885
## Aacs              1.3761303           3.9361990
## Aadac             .                   .        
## Aadacl2           .                   .        
## Aadacl3           .                   .        
## Aadat             .                   .        
## Aaed1             5.1512889           .        
## Aagab             .                   4.4365107
Seurat_object@assays$RNA@counts[1:10,1:5] # our counts are still present in our Seurat object, which we can see here
## 10 x 5 sparse Matrix of class "dgCMatrix"
##         F2S4_150427_011_B01 F2S4_150427_003_F01 F2S4_151217_014_E01
## ATP6                      3                   1                  12
## ATP8                      .                   1                   .
## Aaas                     48                  11                   .
## Aacs                     22                  25                  76
## Aadac                     .                   .                   .
## Aadacl2                   .                   .                   .
## Aadacl3                   .                   .                   .
## Aadat                     .                   .                   .
## Aaed1                    13                   5                  13
## Aagab                    57                  63                   .
##         F2S4_160115_012_B01 F2S4_160120_014_G01
## ATP6                     11                   2
## ATP8                      .                   .
## Aaas                      1                  55
## Aacs                      4                  74
## Aadac                     .                   .
## Aadacl2                   .                   .
## Aadacl3                   .                   .
## Aadat                     .                   .
## Aaed1                   232                   .
## Aagab                     .                 123

In many Seurat functions, the output is often saved into a proper part of the Seurat object. In this case, our normalized data is saved under “data”, separate from “counts”, thus allowing us to keep both.

Extra/Optional - PCA, clustering, and UMAP visualization of clusters

We already have cell clusters and groups (e.g: classes and subclasses) defined by the AIBS. To simulate the experience of finding cluster, we can perform some machine learning techniques in Seurat, starting with a PCA.

### First, we need to scale the data before PCA

Seurat_object <- ScaleData(Seurat_object, 
                           features = rownames(Seurat_object)) # we are going to scale the data,and do this for all genes
## Centering and scaling data matrix
Seurat_object@assays$RNA@scale.data[1:10,1:5] # see our new scaled data, stored in yet another part of the Seurat object
##         F2S4_150427_011_B01 F2S4_150427_003_F01 F2S4_151217_014_E01
## ATP6             0.69011371         -0.28211938          1.55246977
## ATP8            -0.46379888          2.17872518         -0.46379888
## Aaas             1.39146774          0.66251443         -0.87570217
## Aacs             0.59639887          0.70040387          0.88076090
## Aadac            0.00000000          0.00000000          0.00000000
## Aadacl2         -0.04724882         -0.04724882         -0.04724882
## Aadacl3         -0.04495216         -0.04495216         -0.04495216
## Aadat           -0.12831727         -0.12831727         -0.12831727
## Aaed1            1.01476245          0.51437290          0.63685703
## Aagab            0.83592693          0.92621027         -1.24821646
##         F2S4_160115_012_B01 F2S4_160120_014_G01
## ATP6             1.53391099         -0.32912822
## ATP8            -0.46379888         -0.46379888
## Aaas            -0.57397036          1.11081016
## Aacs            -0.40053953          0.86088525
## Aadac            0.00000000          0.00000000
## Aadacl2         -0.04724882         -0.04724882
## Aadacl3         -0.04495216         -0.04495216
## Aadat           -0.12831727         -0.12831727
## Aaed1            2.43878295         -0.81407732
## Aagab           -1.24821646          0.88684133
### Run our PCA, for ALL of our genes

Seurat_object <- RunPCA(Seurat_object,
                        features = rownames(Seurat_object))
## Warning in PrepDR(object = object, features = features, verbose = verbose): The
## following 3454 features requested have zero variance (running reduction without
## them): Aadac, Acrv1, Actbl2, Actl11, Actrt1, Actrt2, Adam20, Adam25, Adam39,
## Adam6a, Adam6b, Adh6.ps1, Ahsg, Akap4, Akp.ps1, Alpi, Amd.ps2, Amd.ps5, Amtn,
## Amy2a2, Amy2a3, Ang.ps3, Ang5, Ankrd33, Antxrl, Anxa2r, Apcs, Api5.ps, Aplnr,
## Apoa4, Apol10c.ps, Apol11a, Apol11b, Apol7c, Aqp2, Aqp3, Aqp8, Arl14, Art2b,
## Astx, Astx6, Atp1b4, Barx1, Bc1.ps1, Becn2, Bhlha15, Bhlhe23, Bhmt.ps1, Bmp10,
## Bmp4.ps, Bpifa2, Bpifa3, Bpifa5, Bpifb6, Bpifb9b, Bricd5, Bsnd, Btg1.ps2, Btnl1,
## Btnl4, Btnl5.ps, Btnl6, C1qtnf5, Cabs1, Calm4, Calm5, Casp14, Catsper1, Ccdc185,
## Ccdc54, Ccdc70, Ccer1, Ccl19.ps1, Ccl19.ps2, Ccl20, Ccl21b, Ccl27b, Ccnb2.ps,
## Ccr1l1, Cd2, Cd209d, Cd300c, Cd40lg, Cd5l, Cd69, Cdhr5, Ceacam11, Ceacam12,
## Ceacam15, Ceacam18, Cebpe, Cela3a, Cers1, Ces1b, Ces1h, Ces2c, Ces2f, Cetn1,
## Chad, Chrnd, Clca2, Cldn13, Cldn17, Cldn24, Cldn4, Cldn6, Cldn8, Clec2j, Clec4e,
## Cox8c, Cphx2, Cphx3, Cplx4, Crct1, Crnn, Crp, Crygc, Crygd, Cryge, Crygf,
## Csn1s2b, Csn2, Csn3, Cst9, Csta1, Ctag2, Ctcflos, Ctla4, Cts3, Cts7, Cts8.ps,
## Ctsg, Ctsll3, Ctsm, Ctsq, Ctsr, Cxcl10, Cyp11b2, Cyp17a1, Cyp1a2, Cyp21a1,
## Cyp2a12, Cyp2a4, Cyp2c37, Cyp2c73.ps, Cyp2d12, Cyp2d13, Cyp2d26, Cyp2d32.ps,
## Cyp2d34.ps, Cyp2d35.ps, Cyp2d37.ps, Cyp2d38.ps, Cyp2d39.ps, Cyp2d40, Cyp2d67.ps,
## Cyp3a11, Cyp3a41a, Cyp3a60.ps, Cyp4a14, Cyp4a31, Cyp4b1.ps1, Cyp4b1.ps2,
## Cyp4f38.ps, Cyp7a1, Cyp8b1, Cypt1, Cypt12, Cypt14, Cypt15, Cypt2, Cypt3, Cypt4,
## Cytl1, D11Bhm181e, D5Ertd577e, D7Ertd143e, Dbil5.ps, Dcpp1, Dcpp2, Dcpp3, Ddi1,
## Defa.ps1, Defa.ps10, Defa.ps11, Defa.ps12, Defa.ps14, Defa.ps15, Defa.ps16,
## Defa.ps17, Defa.ps18, Defa.ps6, Defa.rs1, Defa.rs7, Defa17, Defa2, Defa20,
## Defa21, Defa22, Defa23, Defa24, Defa25, Defa26, Defa3, Defa5, Defb10, Defb11,
## Defb12, Defb13, Defb15, Defb18, Defb19, Defb2, Defb21, Defb25, Defb28, Defb29,
## Defb3, Defb30, Defb37, Defb38, Defb4, Defb40, Defb42, Defb45, Defb46, Defb50,
## Defb7, Defb8, Defb9, Dhx58os, Dkk1, Dmr, Dmrtc1b, Dmrtc1c1, Dmrtc1c2, Dnajb8,
## Dpep3, Dppa3.ps, Dppa5b, Dppa5c, Dreh, Duoxa2, Dusp21, Duxbl2, Duxbl3, Ear.ps1,
## Ear.ps10, Ear.ps12, Ear.ps2, Ear.ps3, Ear.ps4, Ear.ps5, Ear.ps7, Ear.ps8,
## Ear.ps9, Ear10, Ear6, Eddm3b, Elane, Elf3, Emilin3, En1, Epo, Eppin, Ereg, Esp1,
## Esp15, Esp16, Esp18, Esp23, Esp36, Esp4, Esp5, Esp6, Evx1, Evx2, F11, Fabp1,
## Fabp2, Fabp9, Fam154aos, Fam170a, Fam170b, Fam71b, Fam83g, Fbxw14, Fbxw20,
## Fbxw28, Fcamr, Fcer1a, Ffar1, Ffar3, Fga, Fgf20, Fgf4, Fgfr3.ps, Fmo3, Foxa1,
## Foxa2, Foxb1, Foxb2, Foxe3, Foxi2, Foxl1, Fpr.rs3, Fpr.rs4, Fpr.rs5, Fpr.rs6,
## Fpr.rs7, Fpr3, Frmpd1os, Fscb, Fshb, Fthl17, Ftmt, G6b, G6bos, Gabarapl2.ps,
## Gata1, Gata5os, Gcm2, Gcsam, Gdf1, Gdf2, Gfi1b, Gh, Ghsr, Gja10, Gjb5, Gk2,
## Gkn2, Glycam1, Gm12, Gm156, Gm318, Gm362, Gm371, Gm378, Gm379, Gm381, Gm40,
## Gm412, Gm428, Gm438, Gm454, Gm46, Gm53, Gm649, Gm725, Gm732, Gm839, Gm904,
## Gm906, Gm94, Gmcl1l, Gnb1.ps3, Gngt1, Gpha2, Gpihbp1, Gpr144.ps, Gpr31a,
## Gpr31b, Gpr31c, Gprc6a, Gsc, Gsc2, Gsdmc2, Gsdmcl1, Gsdmcl2, Gsx1, Gtsf1l, Gzmb,
## Gzmd, Gzme, Gzmf, Gzmg, H1fnt, H2.Ea.ps, H2.M1, H2.M10.3, H2.M10.4, H2.M10.6,
## H2.M11, H2.M9, H2af.ps, H2afb1, H2afb2, H2afb3, H2bfm, H60c, Hand1, Hand2,
## Hba.ps3, Hba.x, Hbb.bh1, Hbb.bh2, Hbb.bh3, Hbb.y, Higd1c, Hist1h2ah, Hist1h2aj,
## Hmgb1.ps11, Hmgb3.ps, Hmgb4, Hmgb4os, Hmx2, Hmx3, Hopxos, Hotair, Hottip,
## Hoxa10, Hoxa11os, Hoxa13, Hoxa2, Hoxa4, Hoxa6, Hoxa9, Hoxaas3, Hoxb1, Hoxb13,
## Hoxb4, Hoxb5, Hoxb6, Hoxb7, Hoxb8, Hoxc10, Hoxc11, Hoxc12, Hoxc13, Hoxc4,
## Hoxc5, Hoxc9, Hoxd10, Hoxd11, Hoxd12, Hoxd13, Hoxd3os1, Hoxd4, Hoxd9, Hsd3b4,
## Hspe1.ps4, Hypm, Iapp, Ifi44l, Ifna.ps1, Ifna1, Ifna11, Ifna12, Ifna13, Ifna14,
## Ifna15, Ifna16, Ifna2, Ifna4, Ifna6, Ifna7, Ifna9, Ifne, Ifng, Ifnl2, Ifnl3,
## Ifnz, Igfals, Igfl3, Ighd, Ighd1.1, Ighd2.3, Ighd2.4, Ighd2.5, Ighd2.6, Ighd2.7,
## Ighd2.8, Ighd3.1, Ighd3.2, Ighd4.1, Ighd5.1, Ighd5.2, Ighd5.3, Ighd5.4, Ighd5.5,
## Ighd5.6, Ighd5.7, Ighd5.8, Ighd6.1, Ighd6.2, Ighe, Ighg1, Ighg2b, Ighg2c,
## Ighg3, Ighj1, Ighj2, Ighj3, Ighj4, Ighm, Ighv1.1, Ighv1.10, Ighv1.11, Ighv1.12,
## Ighv1.13, Ighv1.14, Ighv1.15, Ighv1.16, Ighv1.17, Ighv1.17.1, Ighv1.18,
## Ighv1.19, Ighv1.19.1, Ighv1.2, Ighv1.20, Ighv1.21, Ighv1.21.1, Ighv1.22,
## Ighv1.23, Ighv1.24, Ighv1.25, Ighv1.26, Ighv1.27, Ighv1.28, Ighv1.29, Ighv1.3,
## Ighv1.30, Ighv1.31, Ighv1.32, Ighv1.33, Ighv1.34, Ighv1.35, Ighv1.36, Ighv1.37,
## Ighv1.38, Ighv1.39, Ighv1.4, Ighv1.40, Ighv1.41, Ighv1.42, Ighv1.43, Ighv1.44,
## Ighv1.45, Ighv1.46, Ighv1.47, Ighv1.48, Ighv1.49, Ighv1.5, Ighv1.50, Ighv1.51,
## Ighv1.52, Ighv1.53, Ighv1.54, Ighv1.55, Ighv1.56, Ighv1.57, Ighv1.58, Ighv1.59,
## Ighv1.6, Ighv1.60, Ighv1.61, Ighv1.62, Ighv1.62.1, Ighv1.62.2, Ighv1.62.3,
## Ighv1.63, Ighv1.64, Ighv1.65, Ighv1.66, Ighv1.67, Ighv1.68, Ighv1.69, Ighv1.7,
## Ighv1.70, Ighv1.71, Ighv1.72, Ighv1.73, Ighv1.74, Ighv1.75, Ighv1.76, Ighv1.77,
## Ighv1.78, Ighv1.79, Ighv1.8, Ighv1.80, Ighv1.81, Ighv1.82, Ighv1.83, Ighv1.84,
## Ighv1.85, Ighv1.86, Ighv1.9, Ighv10.1, Ighv10.2, Ighv10.3, Ighv10.4, Ighv11.1,
## Ighv11.2, Ighv12.1, Ighv12.2, Ighv12.3, Ighv13.1, Ighv13.2, Ighv14.1, Ighv14.2,
## Ighv14.3, Ighv14.4, Ighv15.1, Ighv15.2, Ighv16.1, Ighv2.1, Ighv2.2, Ighv2.3,
## Ighv2.4, Ighv2.5, Ighv2.6, Ighv2.6.8, Ighv2.7, Ighv2.8, Ighv2.9, Ighv2.9.1,
## Ighv3.1, Ighv3.2, Ighv3.3, Ighv3.4, Ighv3.5, Ighv3.6, Ighv3.7, Ighv3.8,
## Ighv4.1, Ighv4.2, Ighv5.1, Ighv5.10, Ighv5.11, Ighv5.12, Ighv5.12.4, Ighv5.13,
## Ighv5.15, Ighv5.16, Ighv5.17, Ighv5.18, Ighv5.19, Ighv5.2, Ighv5.21, Ighv5.3,
## Ighv5.4, Ighv5.5, Ighv5.6, Ighv5.7, Ighv5.8, Ighv5.9, Ighv5.9.1, Ighv6.1,
## Ighv6.2, Ighv6.3, Ighv6.4, Ighv6.5, Ighv6.6, Ighv6.7, Ighv7.1, Ighv7.2,
## Ighv7.3, Ighv7.4, Ighv8.1, Ighv8.10, Ighv8.11, Ighv8.12, Ighv8.13, Ighv8.14,
## Ighv8.15, Ighv8.16, Ighv8.2, Ighv8.3, Ighv8.4, Ighv8.5, Ighv8.6, Ighv8.7,
## Ighv8.8, Ighv8.9, Ighv9.1, Ighv9.2, Ighv9.3, Ighv9.4, Igkc, Igkj1, Igkj2,
## Igkj3, Igkj4, Igkj5, Igkv1.108, Igkv1.110, Igkv1.115, Igkv1.117, Igkv1.122,
## Igkv1.131, Igkv1.132, Igkv1.133, Igkv1.135, Igkv1.136, Igkv1.35, Igkv1.88,
## Igkv1.99, Igkv10.94, Igkv10.95, Igkv10.96, Igkv11.106, Igkv11.114, Igkv11.118,
## Igkv11.125, Igkv12.38, Igkv12.40, Igkv12.41, Igkv12.42, Igkv12.44, Igkv12.46,
## Igkv12.47, Igkv12.49, Igkv12.66, Igkv12.67, Igkv12.89, Igkv12.98, Igkv13.54.1,
## Igkv13.55.1, Igkv13.56.1, Igkv13.57.1, Igkv13.57.2, Igkv13.61.1, Igkv13.62.1,
## Igkv13.64, Igkv13.71.1, Igkv13.73.1, Igkv13.74.1, Igkv13.76, Igkv13.78.1,
## Igkv13.80.1, Igkv13.82, Igkv13.84, Igkv13.85, Igkv13.87, Igkv13.89.1,
## Igkv14.100, Igkv14.111, Igkv14.118.1, Igkv14.118.2, Igkv14.126, Igkv14.126.1,
## Igkv14.130, Igkv14.134.1, Igkv15.101, Igkv15.101.1, Igkv15.102, Igkv15.103,
## Igkv15.97, Igkv16.104, Igkv17.121, Igkv17.127, Igkv17.134, Igkv18.36, Igkv19.93,
## Igkv2.105, Igkv2.107, Igkv2.109, Igkv2.112, Igkv2.113, Igkv2.116, Igkv2.137,
## Igkv2.93.1, Igkv2.95.1, Igkv2.95.2, Igkv20.101.2, Igkv3.1, Igkv3.10, Igkv3.11,
## Igkv3.12, Igkv3.12.1, Igkv3.2, Igkv3.3, Igkv3.4, Igkv3.5, Igkv3.6, Igkv3.7,
## Igkv3.8, Igkv3.9, Igkv4.50, Igkv4.51, Igkv4.53, Igkv4.54, Igkv4.55, Igkv4.56,
## Igkv4.57, Igkv4.57.1, Igkv4.58, Igkv4.59, Igkv4.60, Igkv4.61, Igkv4.62,
## Igkv4.63, Igkv4.65, Igkv4.68, Igkv4.69, Igkv4.70, Igkv4.71, Igkv4.72, Igkv4.73,
## Igkv4.74, Igkv4.75, Igkv4.77, Igkv4.78, Igkv4.79, Igkv4.80, Igkv4.81, Igkv4.83,
## Igkv4.86, Igkv4.90, Igkv4.91, Igkv4.92, Igkv5.37, Igkv5.39, Igkv5.40.1,
## Igkv5.43, Igkv5.45, Igkv5.48, Igkv6.13, Igkv6.14, Igkv6.15, Igkv6.17, Igkv6.20,
## Igkv6.23, Igkv6.25, Igkv6.29, Igkv6.32, Igkv7.33, Igkv8.16, Igkv8.18, Igkv8.19,
## Igkv8.21, Igkv8.22, Igkv8.23.1, Igkv8.24, Igkv8.26, Igkv8.27, Igkv8.28,
## Igkv8.30, Igkv8.31, Igkv8.34, Igkv9.119, Igkv9.120, Igkv9.123, Igkv9.124,
## Igkv9.128, Igkv9.129, Iglc1, Iglc2, Iglc3, Iglc4, Iglj1, Iglj2, Iglj3, Iglj3p,
## Iglj4, Iglv1, Iglv2, Iglv3, Il11ra2, Il13, Il17a, Il1f8, Il1f9, Il20, Il24,
## Il31, Il4i1, Il9, Ins1, Ins2, Iqcf1, Iqcf4, Iqcf6, Irx5, Itgb2l, Itih1, Itih4,
## Itpa.ps3, Izumo3, Kars.ps1, Kcna10, Kcnj1, Kcnk16, Kif12, Kif2b, Kifc5c.ps,
## Kir3dl1, Kis2, Klk1b1, Klk1b10.ps, Klk1b11, Klk1b14.ps, Klk1b16, Klk1b18.ps,
## Klk1b2.ps, Klk1b21, Klk1b23.ps, Klk1b24, Klk1b26, Klk1b28.ps, Klk1b5, Klk1b7.ps,
## Klk1b9, Klk4, Klra1, Klra10, Klra13.ps, Klra14.ps, Klra3, Klra6, Klra7, Klrc2,
## Klrc3, Klri2, Krt32, Krt33a, Krt34, Krt35, Krt39, Krt4, Krt6a, Krt6b, Kr
## PC_ 1 
## Positive:  Syp, Ckmt1, Atp1a3, Sult4a1, Napb, Syngr3, Sncb, Dnm1, Atp6v1g2, Tmem59l 
##     Snap25, Vsnl1, Stmn3, Mllt11, Ndrg4, Gabrg2, Syt4, Stmn2, Syn1, Pdxp 
##     Tuba4a, Faim2, Pacsin1, Scn1b, Sept3, Thy1, Tubb3, Scn8a, Mapk10, Syt1 
## Negative:  Zfp36l1, Sepp1, Sparc, Ifitm3, Gng5, Zfp36, Fcgrt, Gng11, Apoe, Epas1 
##     Cyba, Cd63.ps, Neat1, Tcf7l2, Cd63, Serpinh1, Id3, Ctsh, Tcn2, Vamp8 
##     Hspb1, Pon2, Ifitm2, S100a13, Txnip, Msn, Cfh, Mertk, Zfp36l2, Clic1 
## PC_ 2 
## Positive:  Eif1, Ubb, Cox7c, Ppia, Cox6c, Bex2, Bex1, Usmg5, Uqcrb, Cox7a2 
##     Ndufb4, Ank3, Cox7b, Ahi1, Cox5b, Pcsk1n, Atp5k, Cox17, Scg5, Ndufs5 
##     Cox6a1, Bex4, Ngfrap1, Pam16, Selm, Stmn1, Caly, Cox6b1, Nrxn1, Anks1b 
## Negative:  Plat, Serinc3, Akap13, Dusp6, Zfp36l1, Sepp1, Il10rb, Ctsh, Arpc1b, H2.K1 
##     Grn, Serp1, Txnip, Ptn, Zfp36, Ctsd, Gnai2, Gatm, Sparc, Ctsa 
##     Fcgrt, Tcn2, Lamp2, B3gnt2, Surf4, Ifngr1, Ifitm3, Bst2, Nbl1, Cmtm6 
## PC_ 3 
## Positive:  Slc6a1, Slc32a1, Gad2, Dlx1, Gad1, Arx, Sox2ot, Dlx1as, Dlx5, Dlx2 
##     Rnd2, Kcnmb2, Kcnip1, Erbb4, Dlx6os1, Sox2, Npy, Zfp536, Lhx6, Tspan2 
##     Cort, Afap1, Alk, Btbd11, Rpp25, Igf1, Grik1, Nxph1, Zcchc12, Dlx6 
## Negative:  Slc17a7, Neurod2, Nrgn, Nrn1, Rprml, Itpka, Arpp21, Neurod6, Ier5, Ptk2b 
##     Sv2b, Zbtb18, Baiap2, Satb2, Tbr1, Ano3, Egr3, Lingo1, Fermt3, Slit3 
##     Mast3, Ankrd33b, Fam212b, Lzts1, Fmnl1, Cacna2d1, Egr1, Ctxn1, Atp1a1, Extl1 
## PC_ 4 
## Positive:  Gpr37l1, Cyp4f15, Acsbg1, Ppp1r3g, Slc39a12, Dio2, Cmtm5, Lcat, Nat8, Slc7a10 
##     Mlc1, Scrg1, Timp4, Hepacam, Cdh19, Gpc5, Grin2c, Aqp4, Slc25a18, Cldn10 
##     Bcan, Gli3, Rfx4, Ank2, Fgfr3, Ednrb, Cyp2j9, Kcnj10, Chrdl1, F3 
## Negative:  Tmsb4x, Myl6, Rps11, Myl12b, Cldn5, Pglyrp1, Ly6a, Eltd1, Slco1a4, Ly6c1 
##     Ctla2a, Ly6c2, Hspb1, Gpr116, Abcb1a, Hsp25.ps1, Icam2, Flt1, Tuba1b, Ebf1 
##     Bsg, Emcn, Tie1, Nostrin, Atox1, Cyyr1, Rpl41, Eng, Sox17, Eef1a1 
## PC_ 5 
## Positive:  Fcrls, Laptm5, Tyrobp, C1qb, Csf1r, C1qc, Cd53, Hmha1, C1qa, Pld4 
##     P2ry13, Fcgr3, Spi1, Fcer1g, Emr1, Ltc4s, Ctss, P2ry12, Fyb, Tbxas1 
##     Cx3cr1, Cd86, Ptprc, Dock2, P2ry6, Ptpn18, Aif1, Siglech, Dock8, Gpr34 
## Negative:  Sptbn1, Tsc22d1, Fermt2, Sparcl1, Serpinh1, Nedd4, Fxyd1, Ptn, Dynll1, Epas1 
##     Crip2, Gng11, Foxc1, Igfbp7, Plat, Mfge8, Tm4sf1, Id3, Slc38a3, Hspb1 
##     Cnn2, Hes1, Hsp25.ps1, Rbpms, Atp1a2, Rhoc, Vim, Wfdc1, Gja1, Nckap1
print(Seurat_object@reductions$pca, dims = 1:5, nfeatures = 5) # see some of our PCs
## PC_ 1 
## Positive:  Syp, Ckmt1, Atp1a3, Sult4a1, Napb 
## Negative:  Zfp36l1, Sepp1, Sparc, Ifitm3, Gng5 
## PC_ 2 
## Positive:  Eif1, Ubb, Cox7c, Ppia, Cox6c 
## Negative:  Plat, Serinc3, Akap13, Dusp6, Zfp36l1 
## PC_ 3 
## Positive:  Slc6a1, Slc32a1, Gad2, Dlx1, Gad1 
## Negative:  Slc17a7, Neurod2, Nrgn, Nrn1, Rprml 
## PC_ 4 
## Positive:  Gpr37l1, Cyp4f15, Acsbg1, Ppp1r3g, Slc39a12 
## Negative:  Tmsb4x, Myl6, Rps11, Myl12b, Cldn5 
## PC_ 5 
## Positive:  Fcrls, Laptm5, Tyrobp, C1qb, Csf1r 
## Negative:  Sptbn1, Tsc22d1, Fermt2, Sparcl1, Serpinh1

We output the first 5 PC’s resulting from our PCA. We can see, for example, that the first PC is positively correlated with the genes Syp, Ckmt1, Atp1a3, Sult4a1, and Napb.

We can next find clusters and visualize them via a method/measure called UMAP. You can think of UMAP as method to help us visualize our PCA results.

Seurat_object <- FindNeighbors(Seurat_object) # find neighbours (cells close to one another)
## Computing nearest neighbor graph
## Computing SNN
Seurat_object  <- FindClusters(Seurat_object) # find clusters
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 814
## Number of edges: 19940
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8427
## Number of communities: 13
## Elapsed time: 0 seconds
Seurat_object <- RunUMAP(Seurat_object, 
                         dims = 1:10) # run umap for a number of dimensions, we'll go with 10
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 10:52:58 UMAP embedding parameters a = 0.9922 b = 1.112
## 10:52:58 Read 814 rows and found 10 numeric columns
## 10:52:58 Using Annoy for neighbor search, n_neighbors = 30
## 10:52:58 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 10:52:58 Writing NN index file to temp file /tmp/RtmpLRoVuv/file138b3aecf504
## 10:52:58 Searching Annoy index using 1 thread, search_k = 3000
## 10:52:59 Annoy recall = 100%
## 10:52:59 Commencing smooth kNN distance calibration using 1 thread
## 10:53:00 Initializing from normalized Laplacian + noise
## 10:53:00 Commencing optimization for 500 epochs, with 27116 positive edges
## 10:53:03 Optimization finished
### plotting

DimPlot(Seurat_object, # our data object
        reduction = "umap", # the reduction method we want to plot; we could put PCA here, but it would result in a hard-to-interpret plot 
        label = TRUE) + # show the labels of our clusters 
        NoLegend() # we don't need a legend for our purposes, it would be redundant

We end up with 13 clusters, labelled “0” to “12”. If we were working from fresh data, we would now work to try to identify what these clusters mean biologically. This can be a difficult and tedious process. Thankfully we have cell group labels from the AIBS. We can see how their subclasses, for example, look on our UMAP plot.

Idents(Seurat_object) <- "subclass_label" # The Idents() function lets us specify which grouping variable we want to work with

### run the same code to generate our UMAP plot, but now we have different active "identities" due to the code above

DimPlot(Seurat_object, 
        reduction = "umap", 
        label = TRUE) +  
        NoLegend()

Note we have some distinct clusters with a few others more vaguely grouped together. This is likely because we are only working with a subset of the data, and that some cell subclasses are close to others.

Differential gene expression testing

Now for your assignment. Each of you have been assigned one subclass. We can filter our object for our given subclass with the following code.

Seurat_object <- subset(Seurat_object, subclass_label == "Sst") # we select for Sst cells only

table(Seurat_object$subclass_label, Seurat_object$donor_sex_label) # notice we can call metadata variables as if the Seurat object was a dataframe
##      
##        F  M
##   Sst 11 11

Now we only have 22 cells in our Seurat object - 11 Sst cells from female mice, and 11 Sst cells from Male mice.

We can now look for differentially expressed genes (DEG’s) between cells from male and female mice.

Idents(Seurat_object) <- "donor_sex_label" # we set "donor_sex_label" as the active identity via "Idents()", this lets Seurat know we're using the biological sex information stored in "donor_sex_label" to define our comparison groups

results <- FindMarkers(Seurat_object, # our data
                       ident.1 = "M", # what group are we looking for DEG's for
                       logfc.threshold = 2.5, # what fold-change difference minimum do we want to see between M and F for us to consider testing it
                       min.pct = .35, # what percent of "ident.1" cells do we want to express a gene in order for us to consider testing it
                       only.pos = T) # do we want only genes that are enriched/greater in "ident.1"

And we’re done with Seurat! We’re only playing around with logfc.threshold, min.pct, and only.pos to increase the speed of our analysis. In a real analysis, we would adjust these thresholds based on what kind of DEG’s we’re looking for.

We can now take our results for further analyses.

write.csv(results, "results.csv")