As we’ve seen, volcano plots are incredibly useful visuals to create when working with expression data for a large number of proteins. The data set that we will work with for this assignment has data comparing protein expression between two different types of skin cancer. Each row of the data set represents a comparison between expression of a specific protein with respect to the two types of cancers.
Complete the prompts below and be sure to answer all questions as completely as possible!
NOTE: given that I ask you to overwrite the variable name for the dataframe several times, if you run into errors with your code, try putting ALL of your code together in one code chunk!
mySig<- read.csv("RDEB.csv")
str(mySig)
## 'data.frame': 1764 obs. of 11 variables:
## $ Protein : chr "sp|A0A0C4DH31|HV118_HUMAN" "sp|A0A0C4DH33|HV124_HUMAN" "sp|A0A0C4DH38|HV551_HUMAN" "sp|A0A0C4DH67|KV108_HUMAN" ...
## $ Label : chr "metast-RDEB" "metast-RDEB" "metast-RDEB" "metast-RDEB" ...
## $ log2FC : chr "-0.260261457" "-0.302608749" "-0.063104838" "-2.119160915" ...
## $ SE : num 0.789 0.224 0.973 0.516 0.759 ...
## $ Tvalue : num -0.3297 -1.3519 -0.0649 -4.1087 0.2024 ...
## $ DF : int 11 11 7 6 6 17 4 6 13 6 ...
## $ pvalue : num 0.7478 0.2036 0.9501 0.0063 0.8463 ...
## $ adj.pvalue : num 0.8526 0.3927 0.9699 0.0703 0.9088 ...
## $ issue : chr NA NA NA NA ...
## $ MissingPercentage : num 0.579 0.526 0.719 0.711 0.754 ...
## $ ImputationPercentage: num 0.263 0.211 0.193 0.132 0.175 ...
names(mySig)
## [1] "Protein" "Label" "log2FC"
## [4] "SE" "Tvalue" "DF"
## [7] "pvalue" "adj.pvalue" "issue"
## [10] "MissingPercentage" "ImputationPercentage"
sapply(mySig, class)
## Protein Label log2FC
## "character" "character" "character"
## SE Tvalue DF
## "numeric" "numeric" "integer"
## pvalue adj.pvalue issue
## "numeric" "numeric" "character"
## MissingPercentage ImputationPercentage
## "numeric" "numeric"
summary(mySig)
## Protein Label log2FC SE
## Length:1764 Length:1764 Length:1764 Min. :0.09937
## Class :character Class :character Class :character 1st Qu.:0.26397
## Mode :character Mode :character Mode :character Median :0.35016
## Mean :0.41715
## 3rd Qu.:0.49471
## Max. :2.75309
## NA's :49
## Tvalue DF pvalue adj.pvalue
## Min. :-4.7744 Min. : 2.00 Min. :0.00000 Min. :0.0000
## 1st Qu.: 0.2114 1st Qu.:13.00 1st Qu.:0.04154 1st Qu.:0.1546
## Median : 1.2109 Median :16.00 Median :0.19116 Median :0.3740
## Mean : 1.1304 Mean :14.59 Mean :0.29402 Mean :0.4123
## 3rd Qu.: 2.1400 3rd Qu.:17.00 3rd Qu.:0.49104 3rd Qu.:0.6435
## Max. : 8.2094 Max. :17.00 Max. :0.99919 Max. :0.9992
## NA's :49 NA's :49 NA's :49
## issue MissingPercentage ImputationPercentage
## Length:1764 Min. :0.0000 Min. :0.0000
## Class :character 1st Qu.:0.3459 1st Qu.:0.2442
## Mode :character Median :0.4526 Median :0.3421
## Mean :0.4627 Mean :0.3356
## 3rd Qu.:0.5789 3rd Qu.:0.4210
## Max. :0.8421 Max. :0.8089
## NA's :49 NA's :49
Ex: myData <- myData[, c(4, 8, 9)] ### this code would rewrite myData but keep only columns 4,8 and 9.
mySig <- mySig [, c(1, 3, 7)]
Ex: myData <- myData[complete.cases(myData),] ### this code will search for and pick out complete cases for each row, meaning all of the data are there.
mySig <- mySig[complete.cases(mySig),]
mySig$log2FC <- as.numeric(mySig$log2FC)
library(EnhancedVolcano)
## Warning: package 'EnhancedVolcano' was built under R version 4.4.1
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.4.1
## Loading required package: ggrepel
## Warning: package 'ggrepel' was built under R version 4.4.1
rownames(mySig) <- mySig$Protein
mySig <- mySig[,-1]
EnhancedVolcano(mySig,
lab = rownames(mySig),
x = 'log2FC',
y = 'pvalue',
pCutoff = 0.05,
FCcutoff = 2,
pointSize = 4,
labSize = 3)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the EnhancedVolcano package.
## Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## ℹ The deprecated feature was likely used in the EnhancedVolcano package.
## Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
EnhancedVolcano(mySig,
lab = rownames(mySig),
x = 'log2FC',
y = 'pvalue',
pCutoff = 0.005,
FCcutoff = 2.5,
pointSize = 4,
labSize = 3)
4. Question: How do you interpret the plot you made? Think about
explaining the colors and locations! # The x-axis displays the log fold
change. This represnts the change in protien abundance between the 2
cancer types. The left side shows the negative logFC when proteins are
downregulated and the right side shows proteins that are upregulated.
The y-axis represnts the -log10 p-value. This is the stastical
signifance. The higher the point is on the plot the more stastically
significant. The colors represent stastical significance -> grey
shows no significance, green has a large FC but it is not significant,
blue and red both show significance but red shows a more strongly up or
down regulated gene.