Unit 4 Final Part I: Volcano Plots

As we’ve seen, volcano plots are incredibly useful visuals to create when working with expression data for a large number of proteins. The data set that we will work with for this assignment has data comparing protein expression between two different types of skin cancer. Each row of the data set represents a comparison between expression of a specific protein with respect to the two types of cancers.

Complete the prompts below and be sure to answer all questions as completely as possible!

NOTE: given that I ask you to overwrite the variable name for the dataframe several times, if you run into errors with your code, try putting ALL of your code together in one code chunk!

Part I: Exploring the data

  1. Code: Bring in the “RDEB.csv” file and call it “mySig”
mySig<- read.csv("RDEB.csv")
  1. Code: Explore the data set by looking at the variables and variable types
str(mySig)
## 'data.frame':    1764 obs. of  11 variables:
##  $ Protein             : chr  "sp|A0A0C4DH31|HV118_HUMAN" "sp|A0A0C4DH33|HV124_HUMAN" "sp|A0A0C4DH38|HV551_HUMAN" "sp|A0A0C4DH67|KV108_HUMAN" ...
##  $ Label               : chr  "metast-RDEB" "metast-RDEB" "metast-RDEB" "metast-RDEB" ...
##  $ log2FC              : chr  "-0.260261457" "-0.302608749" "-0.063104838" "-2.119160915" ...
##  $ SE                  : num  0.789 0.224 0.973 0.516 0.759 ...
##  $ Tvalue              : num  -0.3297 -1.3519 -0.0649 -4.1087 0.2024 ...
##  $ DF                  : int  11 11 7 6 6 17 4 6 13 6 ...
##  $ pvalue              : num  0.7478 0.2036 0.9501 0.0063 0.8463 ...
##  $ adj.pvalue          : num  0.8526 0.3927 0.9699 0.0703 0.9088 ...
##  $ issue               : chr  NA NA NA NA ...
##  $ MissingPercentage   : num  0.579 0.526 0.719 0.711 0.754 ...
##  $ ImputationPercentage: num  0.263 0.211 0.193 0.132 0.175 ...
names(mySig)
##  [1] "Protein"              "Label"                "log2FC"              
##  [4] "SE"                   "Tvalue"               "DF"                  
##  [7] "pvalue"               "adj.pvalue"           "issue"               
## [10] "MissingPercentage"    "ImputationPercentage"
sapply(mySig, class)
##              Protein                Label               log2FC 
##          "character"          "character"          "character" 
##                   SE               Tvalue                   DF 
##            "numeric"            "numeric"            "integer" 
##               pvalue           adj.pvalue                issue 
##            "numeric"            "numeric"          "character" 
##    MissingPercentage ImputationPercentage 
##            "numeric"            "numeric"
summary(mySig)
##    Protein             Label              log2FC                SE         
##  Length:1764        Length:1764        Length:1764        Min.   :0.09937  
##  Class :character   Class :character   Class :character   1st Qu.:0.26397  
##  Mode  :character   Mode  :character   Mode  :character   Median :0.35016  
##                                                           Mean   :0.41715  
##                                                           3rd Qu.:0.49471  
##                                                           Max.   :2.75309  
##                                                           NA's   :49       
##      Tvalue              DF            pvalue          adj.pvalue    
##  Min.   :-4.7744   Min.   : 2.00   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.: 0.2114   1st Qu.:13.00   1st Qu.:0.04154   1st Qu.:0.1546  
##  Median : 1.2109   Median :16.00   Median :0.19116   Median :0.3740  
##  Mean   : 1.1304   Mean   :14.59   Mean   :0.29402   Mean   :0.4123  
##  3rd Qu.: 2.1400   3rd Qu.:17.00   3rd Qu.:0.49104   3rd Qu.:0.6435  
##  Max.   : 8.2094   Max.   :17.00   Max.   :0.99919   Max.   :0.9992  
##  NA's   :49        NA's   :49      NA's   :49                        
##     issue           MissingPercentage ImputationPercentage
##  Length:1764        Min.   :0.0000    Min.   :0.0000      
##  Class :character   1st Qu.:0.3459    1st Qu.:0.2442      
##  Mode  :character   Median :0.4526    Median :0.3421      
##                     Mean   :0.4627    Mean   :0.3356      
##                     3rd Qu.:0.5789    3rd Qu.:0.4210      
##                     Max.   :0.8421    Max.   :0.8089      
##                     NA's   :49        NA's   :49
  1. Question: What do you notice about the columns in this dataset? What columns/variables sound familiar given our previous content on hypothesis testing? # There is 1764 rows in this dataset with 11 variables. Each row of the dataset represnets one protein comparison. The Tvalue is the t stastistic that comes form a t test. DF is the degrees of freedom in that comparison. pvalue or the probability of observing the Tvalue considering the null hypothesis. log2FC is the log fold change. SE is the standard error that measures the variability in the log2FC.
  2. Question: If each row represents a comparison for expression between two types of cancer, what type of test was likely run to generate the data in a few of the columns? Why? # The most likely test to run is a two sample t test. The null hyptohesis would staste that the expression for each cancer type is equal while the alternative hypothesis would state the expression differs for each cancer type. The columns in the data is what would be expected from a t test output: t value and p value for example. The t value measures how different the 2 groups are (the different cancers). The p value tells if that difference is significant.

Part II: Manipulating the data

  1. Code: to generate a volcano plot, we really only need 3 columns of data, write a line of code that overwrites the “mySig” variable but keeps on columns 1, 3, and 7. NOTE: in one of our first R labs, you created code that removed ROWS, the code will be similar and follow the example below. Keep the name of the dataframe the same!

Ex: myData <- myData[, c(4, 8, 9)] ### this code would rewrite myData but keep only columns 4,8 and 9.

mySig <- mySig [, c(1, 3, 7)]
  1. Code: given that this dataset is so large, there are likely instances where “NA” or “INf” values exist, if these stay in your dataframe, R will trip over itself while trying to make a plot. You need to search for and seek out rows in your data that are complete (no NA values), use the example code below and recreate your own

Ex: myData <- myData[complete.cases(myData),] ### this code will search for and pick out complete cases for each row, meaning all of the data are there.

mySig <- mySig[complete.cases(mySig),]
  1. Code: change the log2FC variable to a numeric
mySig$log2FC <- as.numeric(mySig$log2FC)

Part III: Creating, changing, and interpretting the volcano plot

  1. Code: load the “EnhancedVolcano” package (NOTE: if you get an error saying that the package won’t load because of issues in software version, you can use the following code to make sure it loads) #Installed package outside of rmd to knit upon save install.packages(“EnhancedVolcano”, force = T)
library(EnhancedVolcano)
## Warning: package 'EnhancedVolcano' was built under R version 4.4.1
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.4.1
## Loading required package: ggrepel
## Warning: package 'ggrepel' was built under R version 4.4.1
  1. Code: run the following code chunk to create the plot
rownames(mySig) <- mySig$Protein
mySig <- mySig[,-1]

EnhancedVolcano(mySig,
                lab = rownames(mySig),
                x = 'log2FC',
                y = 'pvalue',
                pCutoff = 0.05,
                FCcutoff = 2,
                pointSize = 4,
                labSize = 3)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the EnhancedVolcano package.
##   Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## ℹ The deprecated feature was likely used in the EnhancedVolcano package.
##   Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

  1. Code: The settings for the code above had created visual cutoffs for the p-value and the log fold change, create a plot that changes the cutoffs for both of these variables, make the pCutoff 0.005 and the FCcutoff 2.5
EnhancedVolcano(mySig,
                lab = rownames(mySig),
                x = 'log2FC',
                y = 'pvalue',
                pCutoff = 0.005,
                FCcutoff = 2.5,
                pointSize = 4,
                labSize = 3)

4. Question: How do you interpret the plot you made? Think about explaining the colors and locations! # The x-axis displays the log fold change. This represnts the change in protien abundance between the 2 cancer types. The left side shows the negative logFC when proteins are downregulated and the right side shows proteins that are upregulated. The y-axis represnts the -log10 p-value. This is the stastical signifance. The higher the point is on the plot the more stastically significant. The colors represent stastical significance -> grey shows no significance, green has a large FC but it is not significant, blue and red both show significance but red shows a more strongly up or down regulated gene.

  1. Question: If you were to use this information to hone in one specific proteins that experience drastic differences in fold change and significant p-values, what proteins would they be? How would you use this information to determine future research directions? # I would chose proteins APOE and ASPN. APOE is both upregulated and significant while APN is downregulated and significant. I would research into the functions of these proteins to validate the cancer’s progression and what symptoms may occur.