Unit 4 Final Part I: Volcano Plots

As we’ve seen, volcano plots are incredibly useful visuals to create when working with expression data for a large number of proteins. The data set that we will work with for this assignment has data comparing protein expression between two different types of skin cancer. Each row of the data set represents a comparison between expression of a specific protein with respect to the two types of cancers.

Complete the prompts below and be sure to answer all questions as completely as possible!

NOTE: given that I ask you to overwrite the variable name for the dataframe several times, if you run into errors with your code, try putting ALL of your code together in one code chunk!

Part I: Exploring the data

  1. Code: Bring in the “RDEB.csv” file and call it “mySig”

  2. Code: Explore the data set by looking at the variables and variable types

  3. Question: What do you notice about the columns in this dataset? What columns/variables sound familiar given our previous content on hypothesis testing? The columns in the data set represent 11 different variables each with 1765 observances. They are all categorized in respect to their class and mode. If the column is a character column then there are no summary statistics to be calculated. The variables log2FC (log2 fold change), SE (standard error), T- value, df(degrees of freedom), p-value, and adjusted p-value are all familiar in the data set. In the summary statistics we have gone over all of the statistics shown in the code block: min, 1stQ, median, mean, etc.

  4. Question: If each row represents a comparison for expression between two types of cancer, what type of test was likely run to generate the data in a few of the columns? Why? Assuming that the data fits all the assumptions necessary, a two-sample t-test would be run. This would be the appropriate test as there are two categorical explanatory variables with numerical response variables.

mySig <- read.table('RDEB.csv', sep = ',', header = TRUE)
str(mySig)
## 'data.frame':    1764 obs. of  11 variables:
##  $ Protein             : chr  "sp|A0A0C4DH31|HV118_HUMAN" "sp|A0A0C4DH33|HV124_HUMAN" "sp|A0A0C4DH38|HV551_HUMAN" "sp|A0A0C4DH67|KV108_HUMAN" ...
##  $ Label               : chr  "metast-RDEB" "metast-RDEB" "metast-RDEB" "metast-RDEB" ...
##  $ log2FC              : num  -0.2603 -0.3026 -0.0631 -2.1192 0.1537 ...
##  $ SE                  : num  0.789 0.224 0.973 0.516 0.759 ...
##  $ Tvalue              : num  -0.3297 -1.3519 -0.0649 -4.1087 0.2024 ...
##  $ DF                  : int  11 11 7 6 6 17 4 6 13 6 ...
##  $ pvalue              : num  0.7478 0.2036 0.9501 0.0063 0.8463 ...
##  $ adj.pvalue          : num  0.8526 0.3927 0.9699 0.0703 0.9088 ...
##  $ issue               : chr  NA NA NA NA ...
##  $ MissingPercentage   : num  0.579 0.526 0.719 0.711 0.754 ...
##  $ ImputationPercentage: num  0.263 0.211 0.193 0.132 0.175 ...
summary(mySig)
##    Protein             Label               log2FC               SE         
##  Length:1764        Length:1764        Min.   :-3.07917   Min.   :0.09937  
##  Class :character   Class :character   1st Qu.: 0.09289   1st Qu.:0.26397  
##  Mode  :character   Mode  :character   Median : 0.43447   Median :0.35016  
##                                        Mean   :     Inf   Mean   :0.41715  
##                                        3rd Qu.: 0.74577   3rd Qu.:0.49471  
##                                        Max.   :     Inf   Max.   :2.75309  
##                                        NA's   :1          NA's   :49       
##      Tvalue              DF            pvalue          adj.pvalue    
##  Min.   :-4.7744   Min.   : 2.00   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.: 0.2114   1st Qu.:13.00   1st Qu.:0.04154   1st Qu.:0.1546  
##  Median : 1.2109   Median :16.00   Median :0.19116   Median :0.3740  
##  Mean   : 1.1304   Mean   :14.59   Mean   :0.29402   Mean   :0.4123  
##  3rd Qu.: 2.1400   3rd Qu.:17.00   3rd Qu.:0.49104   3rd Qu.:0.6435  
##  Max.   : 8.2094   Max.   :17.00   Max.   :0.99919   Max.   :0.9992  
##  NA's   :49        NA's   :49      NA's   :49                        
##     issue           MissingPercentage ImputationPercentage
##  Length:1764        Min.   :0.0000    Min.   :0.0000      
##  Class :character   1st Qu.:0.3459    1st Qu.:0.2442      
##  Mode  :character   Median :0.4526    Median :0.3421      
##                     Mean   :0.4627    Mean   :0.3356      
##                     3rd Qu.:0.5789    3rd Qu.:0.4210      
##                     Max.   :0.8421    Max.   :0.8089      
##                     NA's   :49        NA's   :49

Part II: Manipulating the data

  1. Code: to generate a volcano plot, we really only need 3 columns of data, write a line of code that overwrites the “mySig” variable but keeps on columns 1, 3, and 7. NOTE: in one of our first R labs, you created code that removed ROWS, the code will be similar and follow the example below. Keep the name of the dataframe the same!

Ex: myData <- myData[, c(4, 8, 9)] ### this code would rewrite myData but keep only columns 4,8 and 9.

  1. Code: given that this dataset is so large, there are likely instances where “NA” or “INf” values exist, if these stay in your dataframe, R will trip over itself while trying to make a plot. You need to search for and seek out rows in your data that are complete (no NA values), use the example code below and recreate your own

Ex: myData <- myData[complete.cases(myData),] ### this code will search for and pick out complete cases for each row, meaning all of the data are there.

  1. Code: change the log2FC variable to a numeric
mySig <- mySig[, c(-2, -4, -5, -6, -8, -9, -10, -11)]
mySig <- mySig[complete.cases(mySig),]
mySig$log2FC <- as.numeric(mySig$log2FC)
class(mySig$log2FC)
## [1] "numeric"

Part III: Creating, changing, and interpretting the volcano plot

  1. Code: load the “EnhancedVolcano” package (NOTE: if you get an error saying that the package won’t load because of issues in software version, you can use the following code to make sure it loads)

install.packages(“EnhancedVolcano”, force = T)

  1. Code: run the following code chunk to create the plot
library(ggplot2)
#install.packages("EnhancedVolcano", force = T)
library(EnhancedVolcano)
## Loading required package: ggrepel
rownames(mySig) <- mySig$Protein
mySig <- mySig[,-1]

EnhancedVolcano(mySig,
                lab = rownames(mySig),
                x = 'log2FC',
                y = 'pvalue',
                pCutoff = 0.05,
                FCcutoff = 2,
                pointSize = 4,
                labSize = 3)

rownames(mySig) <- mySig$Protein
EnhancedVolcano(mySig,
                lab = rownames(mySig),
                x = 'log2FC',
                y = 'pvalue',
                pCutoff = 0.005,
                FCcutoff = 2.5,
                pointSize = 4,
                labSize = 3)

3. Code: The settings for the code above had created visual cutoffs for the p-value and the log fold change, create a plot that changes the cutoffs for both of these variables, make the pCutoff 0.005 and the FCcutoff 2.5

  1. Question: How do you interpret the plot you made? Think about explaining the colors and locations! All of the grey dots signify a non-significant change. The blue dots signify a significant change in -Log10p. The green dots symbolize a sample with significant change in Log2FC. Finally, the red dots are samples with statistically significant changes in both -Log10p and Log2FC. Anything above the threshold (dotted horizontal line) and/or outside the two vertical zero point lines is a significant change.

  2. Question: If you were to use this information to hone in on specific proteins that experience drastic differences in fold change and significant p-values, what proteins would they be? How would you use this information to determine future research directions? They would be the red colored proteins. In addition, they can be sorted by their values to categorize the most significant changes. I would investigate these proteins and their functions to understand why they may show more significant change than the others. If the goal is to optimize function these proteins can also be a starting point for structural design for drug discovery etc.