library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(ade4)
## Warning: package 'ade4' was built under R version 4.5.2
expr <-
read.delim("Data/expr.txt", header=TRUE)
pheno <-
read.delim("Data/pheno.txt")
pheno$Cancer <- as.factor(pheno$Cancer)
pheno$Batch <- as.factor(pheno$Batch)
pheno$Outcome <- as.factor(pheno$Outcome)
(mergedData <- full_join(pheno, expr, "Sample"))
cancerPCA <- dudi.pca(mergedData[,5:22283], scannf = FALSE, nf = 3)
summary(cancerPCA)
## Class: pca dudi
## Call: dudi.pca(df = mergedData[, 5:22283], scannf = FALSE, nf = 3)
##
## Total inertia: 22280
##
## Eigenvalues:
## Ax1 Ax2 Ax3 Ax4 Ax5
## 7967.5 2464.0 1369.6 764.6 627.4
##
## Projected inertia (%):
## Ax1 Ax2 Ax3 Ax4 Ax5
## 35.762 11.060 6.147 3.432 2.816
##
## Cumulative projected inertia (%):
## Ax1 Ax1:2 Ax1:3 Ax1:4 Ax1:5
## 35.76 46.82 52.97 56.40 59.22
##
## (Only 5 dimensions (out of 56) are shown)
Three PCA axes account for 52.97% of the variation.
s.class(
cancerPCA$li,
fac = mergedData$Cancer,
col = rainbow(3),
axesell = FALSE,
grid = FALSE,
cstar = 0,
cpoint = 2,
sub = "PCA of bladder samples by cancer status"
)
There is some similarity in expression levels between between normal and cancerous samples, indicated by the slight overlap of ellipses but there are also differences since the ellipses do not fully overlap.
s.class(
cancerPCA$li,
fac = mergedData$Batch,
col = rainbow(5),
axesell = FALSE,
grid = FALSE,
cstar = 0,
cpoint = 2,
sub = "PCA of bladder samples by batch number"
)
There are differences between batches, specifically Batch 1 is different from Batches 2, and 4, Batch 2 is different from batch Batches 1, 3, and 4, Batch 3 is different from Batches 2, and 4, Batch 4 is different from Batches 1, 2, and 3. Batch 5 is not different from any batch.
batch_outcome_table <- table(mergedData$Batch, mergedData$Outcome)
print(batch_outcome_table)
##
## Biopsy mTCC Normal sTCC-CIS sTCC+CIS
## 1 0 11 0 0 0
## 2 0 1 4 13 0
## 3 0 0 4 0 0
## 4 5 0 0 0 0
## 5 4 0 0 3 12
The data did not sort perfectly by batch and outcome. Other than the most severe outcome (sTCC+CIS) none of the clusters matched perfectly, indicating that there is overlap in the data. To correct this, I would analyze all samples in the same batch to reduce batch effects.
hobbits <-
read.csv("Data/Hobbits.csv")
hobbits$Species <- as.factor(hobbits$Species)
hobbitsLog <- log10(hobbits[,3:6])
hobbitPCA <- dudi.pca(hobbitsLog,scannf = FALSE, nf = 3)
summary(hobbitPCA)
## Class: pca dudi
## Call: dudi.pca(df = hobbitsLog, scannf = FALSE, nf = 3)
##
## Total inertia: 4
##
## Eigenvalues:
## Ax1 Ax2 Ax3 Ax4
## 1.8023 1.3115 0.5859 0.3003
##
## Projected inertia (%):
## Ax1 Ax2 Ax3 Ax4
## 45.058 32.786 14.648 7.507
##
## Cumulative projected inertia (%):
## Ax1 Ax1:2 Ax1:3 Ax1:4
## 45.06 77.84 92.49 100.00
Three PCA axes account for 92.49% of the variance.
s.class(
hobbitPCA$li,
fac = hobbits$Species,
col = rainbow(9),
axesell = FALSE,
grid = FALSE,
cstar = 0,
cpoint = 2,
clab = 0.5,
sub = "PCA of hobbits by hominid species"
)
The Homo floresiensis most closely resembles the skulls of H. erectus
and H. habilis. It is hard to distinguish which one exactly it most
closely resembles since the cluster appears to be in the middle of both
hominid species and does not overlap with either.
hobbit.dist <- dist(hobbitsLog)
hobbit.hc <- hclust(hobbit.dist, method = "mcquitty")
plot(hobbit.hc,
labels = hobbits$Species,
cex = 0.25)
In my analysis, the hobbit is most similar to Homo erectus and Homo
habilis.