RNA-seq analysis for B7H3 KO clones 2

## Warning: package(s) not installed when version(s) same as current; use `force = TRUE` to
##   re-install: 'ggplot2'

Load in the data (aligned reads count) into the vector counts

## [1] "/Users/nataliamunozperez/Documents/Einstein/PhD/Martin Lab/Year 1/Data/RNA-seq/counts"

## [1] "/Users/nataliamunozperez/Documents/Einstein/PhD/Martin Lab/Year 1/Data/RNA-seq/counts"

##    parent_1A          parent_2A          parent_3A         Sample_2_1_1     
##  Min.   :       0   Min.   :       0   Min.   :       0   Min.   :       0  
##  1st Qu.:       0   1st Qu.:       0   1st Qu.:       0   1st Qu.:       0  
##  Median :       0   Median :       0   Median :       0   Median :       0  
##  Mean   :    1661   Mean   :    1286   Mean   :    1326   Mean   :    1533  
##  3rd Qu.:      36   3rd Qu.:      34   3rd Qu.:      28   3rd Qu.:      34  
##  Max.   :27002543   Max.   :20936225   Max.   :22118521   Max.   :27493245  
##   Sample_2_1_2       Sample_2_1_3       Sample_2_2_1       Sample_2_2_2     
##  Min.   :       0   Min.   :       0   Min.   :       0   Min.   :       0  
##  1st Qu.:       0   1st Qu.:       0   1st Qu.:       0   1st Qu.:       0  
##  Median :       0   Median :       0   Median :       0   Median :       0  
##  Mean   :    1872   Mean   :    1626   Mean   :    1473   Mean   :    1387  
##  3rd Qu.:      40   3rd Qu.:      36   3rd Qu.:      31   3rd Qu.:      29  
##  Max.   :33341466   Max.   :29367119   Max.   :24550302   Max.   :23262095  
##   Sample_2_2_3       Sample_6_1_1       Sample_6_1_2       Sample_6_1_3     
##  Min.   :       0   Min.   :       0   Min.   :       0   Min.   :       0  
##  1st Qu.:       0   1st Qu.:       0   1st Qu.:       0   1st Qu.:       0  
##  Median :       0   Median :       0   Median :       0   Median :       0  
##  Mean   :    2030   Mean   :    1550   Mean   :    1563   Mean   :    1720  
##  3rd Qu.:      45   3rd Qu.:      31   3rd Qu.:      35   3rd Qu.:      36  
##  Max.   :35812014   Max.   :26936999   Max.   :27213534   Max.   :28815476  
##  Sample_Scr2_1A     Sample_Scr2_2A     Sample_Scr2_3A      gene_name        
##  Min.   :       0   Min.   :       0   Min.   :       0   Length:55541      
##  1st Qu.:       0   1st Qu.:       0   1st Qu.:       0   Class :character  
##  Median :       0   Median :       0   Median :       0   Mode  :character  
##  Mean   :    1533   Mean   :    1446   Mean   :    1389                     
##  3rd Qu.:      36   3rd Qu.:      34   3rd Qu.:      32                     
##  Max.   :23431953   Max.   :23751800   Max.   :22851275

## [1] 55541    16

Seperate out the gene name

Remove it from the data frame
Create additional information

## [1] "/Users/nataliamunozperez/Documents/Einstein/PhD/Martin Lab/Year 1/Data/RNA-seq/counts"

## [1] "/Users/nataliamunozperez/Documents/Einstein/PhD/Martin Lab/Year 1/Data/RNA-seq/counts"

## [1] 15  3

3.Generate a dds object

## Warning: package(s) not installed when version(s) same as current; use `force = TRUE` to
##   re-install: 'DESeq2'

## Warning: package 'S4Vectors' was built under R version 4.1.2

## Warning: package 'GenomicRanges' was built under R version 4.1.2

## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors

## class: DESeqDataSet 
## dim: 55541 12 
## metadata(1): version
## assays(1): counts
## rownames(55541): ENSMUSG00000000001.4 ENSMUSG00000000003.15 ...
##   __not_aligned __alignment_not_unique
## rowData names(0):
## colnames(12): parent_1A parent_2A ... Sample_6_1_2 Sample_6_1_3
## colData names(3): condition group replicate

## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors

## class: DESeqDataSet 
## dim: 55541 12 
## metadata(1): version
## assays(1): counts
## rownames(55541): ENSMUSG00000000001.4 ENSMUSG00000000003.15 ...
##   __not_aligned __alignment_not_unique
## rowData names(0):
## colnames(12): Sample_2_1_1 Sample_2_1_2 ... Sample_Scr2_2A
##   Sample_Scr2_3A
## colData names(3): condition group replicate

## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors

## class: DESeqDataSet 
## dim: 55541 6 
## metadata(1): version
## assays(1): counts
## rownames(55541): ENSMUSG00000000001.4 ENSMUSG00000000003.15 ...
##   __not_aligned __alignment_not_unique
## rowData names(0):
## colnames(6): parent_1A parent_2A ... Sample_Scr2_2A Sample_Scr2_3A
## colData names(3): condition group replicate

3b. Make sure factor levels are correct

## [1] "parent" "scr2"

4.Pre-filtering: filter out lowly expressed genes by only keeping rows with at least 10 counts per gene

Perform differential expression and store results in a vector

## log2 fold change (MLE): condition clone vs control 
## Wald test p-value: condition clone vs control 
## DataFrame with 55541 rows and 6 columns
##                           baseMean log2FoldChange     lfcSE       stat
##                          <numeric>      <numeric> <numeric>  <numeric>
## ENSMUSG00000000001.4   1.50452e+04       0.524650 0.0853515  6.1469390
## ENSMUSG00000000003.15  6.33682e-02       0.126886 3.5986409  0.0352595
## ENSMUSG00000000028.15  3.34137e+03      -0.194482 0.1339397 -1.4520105
## ENSMUSG00000000031.16  6.07257e+02      -7.770508 0.9940525 -7.8169996
## ENSMUSG00000000037.16  1.84680e+00      -1.383965 1.2275245 -1.1274438
## ...                            ...            ...       ...        ...
## __no_feature               4391282     -0.2050530 0.1304784   -1.57155
## __ambiguous                2495136     -0.2823281 0.1409592   -2.00291
## __too_low_aQual                  0             NA        NA         NA
## __not_aligned                    0             NA        NA         NA
## __alignment_not_unique    26697626      0.0901309 0.0629959    1.43074
##                             pvalue        padj
##                          <numeric>   <numeric>
## ENSMUSG00000000001.4   7.89925e-10 6.10884e-08
## ENSMUSG00000000003.15  9.71873e-01          NA
## ENSMUSG00000000028.15  1.46499e-01 4.54375e-01
## ENSMUSG00000000031.16  5.40973e-15 9.34790e-13
## ENSMUSG00000000037.16  2.59555e-01          NA
## ...                            ...         ...
## __no_feature             0.1160556    0.400921
## __ambiguous              0.0451873    0.224862
## __too_low_aQual                 NA          NA
## __not_aligned                   NA          NA
## __alignment_not_unique   0.1525045    0.464131

## log2 fold change (MLE): condition clone vs control 
## Wald test p-value: condition clone vs control 
## DataFrame with 55541 rows and 6 columns
##                           baseMean log2FoldChange     lfcSE       stat
##                          <numeric>      <numeric> <numeric>  <numeric>
## ENSMUSG00000000001.4   1.59289e+04      0.2414369 0.0737715  3.2727678
## ENSMUSG00000000003.15  6.42570e-02      0.2291678 3.5986409  0.0636818
## ENSMUSG00000000028.15  3.33477e+03     -0.0929964 0.1276046 -0.7287859
## ENSMUSG00000000031.16  5.15760e+01     -3.9539147 0.9023733 -4.3816843
## ENSMUSG00000000037.16  1.28053e+00      0.3673577 1.4522845  0.2529516
## ...                            ...            ...       ...        ...
## __no_feature               4257890      0.0612479 0.1278220   0.479166
## __ambiguous                2457085     -0.1130532 0.0777348  -1.454345
## __too_low_aQual                  0             NA        NA         NA
## __not_aligned                    0             NA        NA         NA
## __alignment_not_unique    26744126      0.1852999 0.0580834   3.190241
##                             pvalue       padj
##                          <numeric>  <numeric>
## ENSMUSG00000000001.4   1.06500e-03 0.01108376
## ENSMUSG00000000003.15  9.49224e-01         NA
## ENSMUSG00000000028.15  4.66133e-01 0.73032335
## ENSMUSG00000000031.16  1.17765e-05 0.00024855
## ENSMUSG00000000037.16  8.00306e-01         NA
## ...                            ...        ...
## __no_feature            0.63182074  0.8399297
## __ambiguous             0.14585071  0.3904142
## __too_low_aQual                 NA         NA
## __not_aligned                   NA         NA
## __alignment_not_unique  0.00142154  0.0139158

## log2 fold change (MLE): group scr2 vs parent 
## Wald test p-value: group scr2 vs parent 
## DataFrame with 55541 rows and 6 columns
##                           baseMean log2FoldChange     lfcSE      stat
##                          <numeric>      <numeric> <numeric> <numeric>
## ENSMUSG00000000001.4   11788.31229       0.288504  0.162454  1.775916
## ENSMUSG00000000003.15      0.00000             NA        NA        NA
## ENSMUSG00000000028.15   3347.62931      -0.096815  0.195210 -0.495953
## ENSMUSG00000000031.16   1202.45920      -3.809510  0.763116 -4.992044
## ENSMUSG00000000037.16      2.06334      -1.741200  1.662284 -1.047474
## ...                            ...            ...       ...       ...
## __no_feature               4192386     -0.2602742  0.224274 -1.160519
## __ambiguous                2553811     -0.1632179  0.251762 -0.648301
## __too_low_aQual                  0             NA        NA        NA
## __not_aligned                    0             NA        NA        NA
## __alignment_not_unique    23151471     -0.0896871  0.155899 -0.575291
##                             pvalue        padj
##                          <numeric>   <numeric>
## ENSMUSG00000000001.4   7.57467e-02 0.628107119
## ENSMUSG00000000003.15           NA          NA
## ENSMUSG00000000028.15  6.19928e-01 0.979093604
## ENSMUSG00000000031.16  5.97437e-07 0.000110118
## ENSMUSG00000000037.16  2.94881e-01          NA
## ...                            ...         ...
## __no_feature              0.245837    0.871268
## __ambiguous               0.516790    0.967566
## __too_low_aQual                 NA          NA
## __not_aligned                   NA          NA
## __alignment_not_unique    0.565094    0.972468

Reorganize your p-values

## 
## out of 38992 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 1340, 3.4%
## LFC < 0 (down)     : 1778, 4.6%
## outliers [1]       : 74, 0.19%
## low counts [2]     : 16275, 42%
## (mean count < 2)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

## 
## out of 38952 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 2255, 5.8%
## LFC < 0 (down)     : 2224, 5.7%
## outliers [1]       : 43, 0.11%
## low counts [2]     : 15516, 40%
## (mean count < 2)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

## 
## out of 35175 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 208, 0.59%
## LFC < 0 (down)     : 255, 0.72%
## outliers [1]       : 30, 0.085%
## low counts [2]     : 16529, 47%
## (mean count < 6)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

How many p-values were less than 0.1?

## [1] 3118

## [1] 4479

## [1] 463

6: Use the alpha argument in the results function to identify a different p-value cut off (in this case, 0.05) and output a summary with the chosen cutoff.

## 
## out of 38992 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 1001, 2.6%
## LFC < 0 (down)     : 1417, 3.6%
## outliers [1]       : 74, 0.19%
## low counts [2]     : 15537, 40%
## (mean count < 2)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

## 
## out of 38952 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 1751, 4.5%
## LFC < 0 (down)     : 1723, 4.4%
## outliers [1]       : 43, 0.11%
## low counts [2]     : 16253, 42%
## (mean count < 2)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

## 
## out of 35175 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 154, 0.44%
## LFC < 0 (down)     : 185, 0.53%
## outliers [1]       : 30, 0.085%
## low counts [2]     : 13885, 39%
## (mean count < 3)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

Generate an MA plot

8b. MA plots and volcano plots

9.Using the plotcounts functions, you can visualize the read counts of a gene across different groups.

Plot the gene from your results that has the lowest p-value from your results.

## [1] "Ddx19b"

## [1] "Sri"

## [1] "Rundc3b"

Customize plot using ggplot2 to plot gene with lowest p-value

Customize plot using ggplot2 to plot B7H3/CD276 gene

12.Save your results into a .txt file

13.Plot PCA

Now we will extract results for gene set enrichment analysis

## Warning: package(s) not installed when version(s) same as current; use `force = TRUE` to
##   re-install: 'AnnotationDbi'

## Warning: package 'AnnotationDbi' was built under R version 4.1.2

## Warning: package(s) not installed when version(s) same as current; use `force = TRUE` to
##   re-install: 'AnnotationHub'

## Warning: package(s) not installed when version(s) same as current; use `force = TRUE` to
##   re-install: 'org.Mm.eg.db'

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MGI"         
## [16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UNIPROT"

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MGI"         
## [16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UNIPROT"

## [1] "11287" "11298" "11302" "11303" "11304" "11305"

## [1] "ENSMUSG00000033658" "ENSMUSG00000060803" "ENSMUSG00000051579"
## [4] "ENSMUSG00000046991" "ENSMUSG00000117904" "ENSMUSG00000074863"

## [1] "ENSMUSG00000003161" "ENSMUSG00000040570" "ENSMUSG00000003623"
## [4] "ENSMUSG00000079659" "ENSMUSG00000054099" "ENSMUSG00000033658"

## [1] "ENSMUSG00000040570" "ENSMUSG00000003623" "ENSMUSG00000003161"
## [4] "ENSMUSG00000056004" "ENSMUSG00000054099" "ENSMUSG00000002297"

## AnnotationHub with 1 record
## # snapshotDate(): 2021-10-20
## # names(): AH78811
## # $dataprovider: Ensembl
## # $species: Mus musculus
## # $rdataclass: EnsDb
## # $rdatadateadded: 2019-10-29
## # $title: Ensembl 99 EnsDb for Mus musculus
## # $description: Gene and protein annotations for Mus musculus based on Ensem...
## # $taxonomyid: 10090
## # $genome: GRCm38
## # $sourcetype: ensembl
## # $sourceurl: http://www.ensembl.org
## # $sourcesize: NA
## # $tags: c("99", "AHEnsDbs", "Annotation", "EnsDb", "Ensembl", "Gene",
## #   "Protein", "Transcript") 
## # retrieve record with 'object[["AH78811"]]'

## EnsDb for Ensembl:
## |Backend: SQLite
## |Db type: EnsDb
## |Type of Gene ID: Ensembl Gene ID
## |Supporting package: ensembldb
## |Db created by: ensembldb package from Bioconductor
## |script_version: 0.3.5
## |Creation time: Wed Feb  5 00:04:44 2020
## |ensembl_version: 99
## |ensembl_host: localhost
## |Organism: Mus musculus
## |taxonomy_id: 10090
## |genome_build: GRCm38
## |DBSCHEMAVERSION: 2.1
## | No. of genes: 56289.
## | No. of transcripts: 144726.
## |Protein data available.

##  [1] "DESCRIPTION"         "ENTREZID"            "EXONID"             
##  [4] "EXONIDX"             "EXONSEQEND"          "EXONSEQSTART"       
##  [7] "GCCONTENT"           "GENEBIOTYPE"         "GENEID"             
## [10] "GENEIDVERSION"       "GENENAME"            "GENESEQEND"         
## [13] "GENESEQSTART"        "INTERPROACCESSION"   "ISCIRCULAR"         
## [16] "PROTDOMEND"          "PROTDOMSTART"        "PROTEINDOMAINID"    
## [19] "PROTEINDOMAINSOURCE" "PROTEINID"           "PROTEINSEQUENCE"    
## [22] "SEQCOORDSYSTEM"      "SEQLENGTH"           "SEQNAME"            
## [25] "SEQSTRAND"           "SYMBOL"              "TXBIOTYPE"          
## [28] "TXCDSSEQEND"         "TXCDSSEQSTART"       "TXID"               
## [31] "TXIDVERSION"         "TXNAME"              "TXSEQEND"           
## [34] "TXSEQSTART"          "TXSUPPORTLEVEL"      "UNIPROTDB"          
## [37] "UNIPROTID"           "UNIPROTMAPPINGTYPE"

##  [1] "ENTREZID"            "EXONID"              "GENEBIOTYPE"        
##  [4] "GENEID"              "GENENAME"            "PROTDOMID"          
##  [7] "PROTEINDOMAINID"     "PROTEINDOMAINSOURCE" "PROTEINID"          
## [10] "SEQNAME"             "SEQSTRAND"           "SYMBOL"             
## [13] "TXBIOTYPE"           "TXID"                "TXNAME"             
## [16] "UNIPROTID"

## Warning: Unable to map 5 of 2431 requested IDs.

## Warning: Unable to map 10 of 3458 requested IDs.

## Warning: Unable to map 1 of 348 requested IDs.

Make a pretty heatmap
Gene ontology

## Warning: package(s) not installed when version(s) same as current; use `force = TRUE` to
##   re-install: 'topGO'

## Warning: package(s) not installed when version(s) same as current; use `force = TRUE` to
##   re-install: 'org.Mm.eg.db'

## Warning: package(s) not installed when version(s) same as current; use `force = TRUE` to
##   re-install: 'tidygraph'

## Warning: package(s) not installed when version(s) same as current; use `force = TRUE` to
##   re-install: 'clusterProfiler'

## Warning: package(s) not installed when version(s) same as current; use `force = TRUE` to
##   re-install: 'enrichplot'

## Warning: package(s) not installed when version(s) same as current; use `force = TRUE` to
##   re-install: 'ggplot2'