As we’ve seen, volcano plots are incredibly useful visuals to create when working with expression data for a large number of proteins. The data set that we will work with for this assignment has data comparing protein expression between two different types of skin cancer. Each row of the data set represents a comparison between expression of a specific protein with respect to the two types of cancers.
Complete the prompts below and be sure to answer all questions as completely as possible!
NOTE: given that I ask you to overwrite the variable name for the dataframe several times, if you run into errors with your code, try putting ALL of your code together in one code chunk!
Code: Bring in the “RDEB.csv” file and call it “mySig”
Code: Explore the data set by looking at the variables and variable types
Question: What do you notice about the columns in this dataset? What columns/variables sound familiar given our previous content on hypothesis testing? The columns in the data set represent 11 different variables each with 1765 observances. They are all categorized in respect to their class and mode. If the column is a character column then there are no summary statistics to be calculated. The variables log2FC (log2 fold change), SE (standard error), T- value, df(degrees of freedom), p-value, and adjusted p-value are all familiar in the data set. In the summary statistics we have gone over all of the statistics shown in the code block: min, 1stQ, median, mean, etc.
Question: If each row represents a comparison for expression between two types of cancer, what type of test was likely run to generate the data in a few of the columns? Why? Assuming that the data fits all the assumptions necessary, a two-sample t-test would be run. This would be the appropriate test as there are two categorical explanatory variables with numerical response variables.
mySig <- read.table('RDEB.csv', sep = ',', header = TRUE)
str(mySig)
## 'data.frame': 1764 obs. of 11 variables:
## $ Protein : chr "sp|A0A0C4DH31|HV118_HUMAN" "sp|A0A0C4DH33|HV124_HUMAN" "sp|A0A0C4DH38|HV551_HUMAN" "sp|A0A0C4DH67|KV108_HUMAN" ...
## $ Label : chr "metast-RDEB" "metast-RDEB" "metast-RDEB" "metast-RDEB" ...
## $ log2FC : num -0.2603 -0.3026 -0.0631 -2.1192 0.1537 ...
## $ SE : num 0.789 0.224 0.973 0.516 0.759 ...
## $ Tvalue : num -0.3297 -1.3519 -0.0649 -4.1087 0.2024 ...
## $ DF : int 11 11 7 6 6 17 4 6 13 6 ...
## $ pvalue : num 0.7478 0.2036 0.9501 0.0063 0.8463 ...
## $ adj.pvalue : num 0.8526 0.3927 0.9699 0.0703 0.9088 ...
## $ issue : chr NA NA NA NA ...
## $ MissingPercentage : num 0.579 0.526 0.719 0.711 0.754 ...
## $ ImputationPercentage: num 0.263 0.211 0.193 0.132 0.175 ...
summary(mySig)
## Protein Label log2FC SE
## Length:1764 Length:1764 Min. :-3.07917 Min. :0.09937
## Class :character Class :character 1st Qu.: 0.09289 1st Qu.:0.26397
## Mode :character Mode :character Median : 0.43447 Median :0.35016
## Mean : Inf Mean :0.41715
## 3rd Qu.: 0.74577 3rd Qu.:0.49471
## Max. : Inf Max. :2.75309
## NA's :1 NA's :49
## Tvalue DF pvalue adj.pvalue
## Min. :-4.7744 Min. : 2.00 Min. :0.00000 Min. :0.0000
## 1st Qu.: 0.2114 1st Qu.:13.00 1st Qu.:0.04154 1st Qu.:0.1546
## Median : 1.2109 Median :16.00 Median :0.19116 Median :0.3740
## Mean : 1.1304 Mean :14.59 Mean :0.29402 Mean :0.4123
## 3rd Qu.: 2.1400 3rd Qu.:17.00 3rd Qu.:0.49104 3rd Qu.:0.6435
## Max. : 8.2094 Max. :17.00 Max. :0.99919 Max. :0.9992
## NA's :49 NA's :49 NA's :49
## issue MissingPercentage ImputationPercentage
## Length:1764 Min. :0.0000 Min. :0.0000
## Class :character 1st Qu.:0.3459 1st Qu.:0.2442
## Mode :character Median :0.4526 Median :0.3421
## Mean :0.4627 Mean :0.3356
## 3rd Qu.:0.5789 3rd Qu.:0.4210
## Max. :0.8421 Max. :0.8089
## NA's :49 NA's :49
Ex: myData <- myData[, c(4, 8, 9)] ### this code would rewrite myData but keep only columns 4,8 and 9.
Ex: myData <- myData[complete.cases(myData),] ### this code will search for and pick out complete cases for each row, meaning all of the data are there.
mySig <- mySig[, c(-2, -4, -5, -6, -8, -9, -10, -11)]
mySig <- mySig[complete.cases(mySig),]
mySig$log2FC <- as.numeric(mySig$log2FC)
class(mySig$log2FC)
## [1] "numeric"
install.packages(“EnhancedVolcano”, force = T)
library(ggplot2)
#install.packages("EnhancedVolcano", force = T)
library(EnhancedVolcano)
## Loading required package: ggrepel
rownames(mySig) <- mySig$Protein
mySig <- mySig[,-1]
EnhancedVolcano(mySig,
lab = rownames(mySig),
x = 'log2FC',
y = 'pvalue',
pCutoff = 0.05,
FCcutoff = 2,
pointSize = 4,
labSize = 3)
rownames(mySig) <- mySig$Protein
EnhancedVolcano(mySig,
lab = rownames(mySig),
x = 'log2FC',
y = 'pvalue',
pCutoff = 0.005,
FCcutoff = 2.5,
pointSize = 4,
labSize = 3)
3. Code: The settings for the code above had created visual cutoffs for
the p-value and the log fold change, create a plot that changes the
cutoffs for both of these variables, make the pCutoff 0.005 and the
FCcutoff 2.5
Question: How do you interpret the plot you made? Think about explaining the colors and locations! All of the grey dots signify a non-significant change. The blue dots signify a significant change in -Log10p. The green dots symbolize a sample with significant change in Log2FC. Finally, the red dots are samples with statistically significant changes in both -Log10p and Log2FC. Anything above the threshold (dotted horizontal line) and/or outside the two vertical zero point lines is a significant change.
Question: If you were to use this information to hone in on specific proteins that experience drastic differences in fold change and significant p-values, what proteins would they be? How would you use this information to determine future research directions? They would be the red colored proteins. In addition, they can be sorted by their values to categorize the most significant changes. I would investigate these proteins and their functions to understand why they may show more significant change than the others. If the goal is to optimize function these proteins can also be a starting point for structural design for drug discovery etc.