Genes (DNA sequences) contain the instructions for producing proteins (via an intermediate called RNA), which are the molecular workhorses in humans and other organisms. The amount of RNA produced, which is called “expression level” in biology, varies for each gene depending on the need for various proteins in different cells in the body. These needs can change during disease, such as cancer, which is the basis for molecular diagnostic tests to detect diseases using gene expression profiles.
The objective of this project is to use gene expression data to classify five common types of cancer (breast, colon, kidney, lung, and prostate cancer).
Gene expression data were obtained from the UCI Machine Learning Repository (Dataset C5R88H).
The main data file was too large (206 Mb) to import directly, so I downloaded and analyzed a local copy.
tcga_pancan_data_df <-
read_csv('Data/TCGA-PANCAN-HiSeq-801x20531/data.csv', show_col_types = FALSE)
## New names:
## • `` -> `...1`
tcga_pancan_labels_df <-
read_csv('Data/TCGA-PANCAN-HiSeq-801x20531/labels.csv', show_col_types = FALSE)
## New names:
## • `` -> `...1`
I dropped the first column of both datasets because it only contained sample IDs, which are not informative for machine learning.
tcga_pancan_data_df <- tcga_pancan_data_df[, 2:ncol(tcga_pancan_data_df)]
tcga_pancan_labels_df <- tcga_pancan_labels_df[, 2:ncol(tcga_pancan_labels_df)]
The ‘data’ dataframe has 801 rows (samples) and 20,531 columns
(genes). All the gene_# columns are numeric values that are
the expression levels for each gene in the samples.
dim(tcga_pancan_data_df)
## [1] 801 20531
The dataframe is too big to glimpse(), so I show the
first and last few rows and columns below.
head(tcga_pancan_data_df)
## # A tibble: 6 × 20,531
## gene_0 gene_1 gene_2 gene_3 gene_4 gene_5 gene_6 gene_7 gene_8 gene_9 gene_10
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 2.02 3.27 5.48 10.4 0 7.18 0.592 0 0 0.592
## 2 0 0.593 1.59 7.59 9.62 0 6.82 0 0 0 0
## 3 0 3.51 4.33 6.88 9.87 0 6.97 0.453 0 0 0
## 4 0 3.66 4.51 6.66 10.2 0 7.84 0.435 0 0 0
## 5 0 2.66 2.82 6.54 9.74 0 6.57 0.361 0 0 0
## 6 0 3.47 3.58 6.62 9.71 0 7.76 0 0 0 0.515
## # ℹ 20,520 more variables: gene_11 <dbl>, gene_12 <dbl>, gene_13 <dbl>,
## # gene_14 <dbl>, gene_15 <dbl>, gene_16 <dbl>, gene_17 <dbl>, gene_18 <dbl>,
## # gene_19 <dbl>, gene_20 <dbl>, gene_21 <dbl>, gene_22 <dbl>, gene_23 <dbl>,
## # gene_24 <dbl>, gene_25 <dbl>, gene_26 <dbl>, gene_27 <dbl>, gene_28 <dbl>,
## # gene_29 <dbl>, gene_30 <dbl>, gene_31 <dbl>, gene_32 <dbl>, gene_33 <dbl>,
## # gene_34 <dbl>, gene_35 <dbl>, gene_36 <dbl>, gene_37 <dbl>, gene_38 <dbl>,
## # gene_39 <dbl>, gene_40 <dbl>, gene_41 <dbl>, gene_42 <dbl>, …
tcga_pancan_data_df[1:5, (ncol(tcga_pancan_data_df) - 3):ncol(tcga_pancan_data_df)]
## # A tibble: 5 × 4
## gene_20527 gene_20528 gene_20529 gene_20530
## <dbl> <dbl> <dbl> <dbl>
## 1 9.65 8.92 5.29 0
## 2 10.5 9.40 2.09 0
## 3 9.79 10.1 1.68 0
## 4 9.69 9.68 3.29 0
## 5 9.22 9.46 5.11 0
Finally, I checked for duplicates and missing values in the data (there were none).
sprintf("Duplicate rows in dataset: %d", count_duplicate_rows(tcga_pancan_data_df))
## [1] "Duplicate rows in dataset: 0"
get_na_vars(tcga_pancan_data_df)
## [1] "There are no variables with missing values"
The ‘labels’ dataframe has 801 rows (samples) and 1 column of cancer abbreviations, which are shorthand for five types of cancer (BRCA = breast cancer, COAD = colon cancer, KIRC = kidney cancer, LUAD = lung cancer, PRAD = prostate cancer).
dim(tcga_pancan_labels_df)
## [1] 801 1
unique(tcga_pancan_labels_df$Class)
## [1] "PRAD" "LUAD" "BRCA" "KIRC" "COAD"
I recoded these abbreviations as numbers.
tcga_pancan_labels_df <- tcga_pancan_labels_df %>%
mutate(
cancer_type = case_when(Class == 'BRCA' ~ 0,
Class == 'COAD' ~ 1,
Class == 'KIRC' ~ 2,
Class == 'LUAD' ~ 3,
Class == 'PRAD' ~ 4)
) %>%
select(cancer_type)
There are five types of cancer in the dataset. The most common is breast cancer (BRCA) and the least common is colon cancer (COAD). This class imbalance will need to be addressed during pre-processing.
# Cancer types: 0 = breast, 1 = colon, 2 = kidney, 3 = lung, 4 = prostate
tcga_pancan_labels_df %>%
group_by(cancer_type) %>%
summarise(
n = n()
) %>%
arrange(desc(n))
## # A tibble: 5 × 2
## cancer_type n
## <dbl> <int>
## 1 0 300
## 2 2 146
## 3 3 141
## 4 4 136
## 5 1 78
There are too many genes to characterize individually, so I analyzed one as an example and then calculated summary statistics for all of them as a group.
The expression of ‘gene_1’ ranges from 0 to 6.237 with
mean of 3 and standard deviation of 1.2.
tcga_pancan_data_df %>%
summarise(min = round(min(gene_1), 3),
max = round(max(gene_1), 3),
mean = round(mean(gene_1), 1),
sd = round(sd(gene_1), 1),
median = round(median(gene_1), 3),
IQR = round(IQR(gene_1), 3),
n = n())
## # A tibble: 1 × 7
## min max mean sd median IQR n
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 0 6.24 3 1.2 3.14 1.58 801
The expression values appear to be mostly normally distributed, although there are some zero values (ie, nuclei that are not expressing the gene).
ggplot(tcga_pancan_data_df, aes(x = gene_1)) +
geom_histogram(bins = 50, fill = 'steelblue', color = 'darkgray') +
coord_cartesian(xlim = c(0, 8), ylim = c(0, 50)) +
labs(x = 'Expression of gene_1', y = 'Count',
title = 'Histogram of expression values for gene_1') +
guides(
x = guide_axis(minor.ticks = TRUE, cap = 'both'),
y = guide_axis(minor.ticks = TRUE, cap = 'both')
) +
theme_classic() +
theme(
axis.title = element_text(face = 'bold'),
plot.title = element_text(face = 'bold')
)
The gene expression values range from 0 to 20.779.
gene_expr_values <- as.vector(as.matrix(
tcga_pancan_data_df[, 1:(ncol(tcga_pancan_data_df) - 1)]))
(gene_expr_stats <- summary(gene_expr_values))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.600 7.610 6.444 9.702 20.779
Using quartiles, the thresholds for potential outliers indicate that gene expression levels at or near the maximum value may be outliers.
IQR <- gene_expr_stats['3rd Qu.'] - gene_expr_stats['1st Qu.']
lower_threshold <- gene_expr_stats['1st Qu.'] - 1.5 * IQR
# Note that the lowest biologically possible expression level is 0
lower_threshold <- ifelse((lower_threshold < 0), 0, lower_threshold)
upper_threshold <- gene_expr_stats['3rd Qu.'] + 1.5 * IQR
sprintf('Expression values less than %.3f or greater than %.3f may be outliers',
lower_threshold, upper_threshold)
## [1] "Expression values less than 0.000 or greater than 20.353 may be outliers"
These summary statistics suggest that the the data should be normalized.
I also determined the number of genes with zero expression and, more generally, the number with near-zero variance. These columns will need to be removed during pre-processing, because they are not informative for machine learning.
# Number of zero columns
n_zero_col <- sum(colSums(tcga_pancan_data_df != 0) == 0)
sprintf('There are %d zero columns', n_zero_col)
## [1] "There are 267 zero columns"
# Indexes of columns with near-zero variance
zero_variance_col_idx <- caret::nearZeroVar(tcga_pancan_data_df)
# Number of such columns
sprintf('There are %d near-zero variance columns', length(zero_variance_col_idx))
## [1] "There are 1259 near-zero variance columns"
I selected algorithms based on problem type (supervised multiclass classification), dataset characteristics, model strengths/weaknesses, and the assignment requirement of selecting one algorithm from Weeks 1-10 and the other from Weeks 11-15.
The first algorithm I selected was random forest, which was discussed during Week 6. The main rationale for random forest is that it can handle large datasets with many features, which is the case with the gene expression dataset, which has >20,000 features. In addition, random forest models can be used to perform multiclass classification.
The second algorithm I selected was feedforward neural network, which was discussed during Week 13. The main rationale for neural networks is that they can detect complex patterns in data, which may be advantageous for the many features (genes) in the cancer dataset. In addition, neural network models can be used to perform multiclass classification.
Data were split 70:30 into training and test datasets. I mitigated class imbalance in the training dataset using Synthetic Minority Oversampling Technique (SMOTE). Subsequently, I used principal component analysis (PCA) to reduce dimensionality.
For the classification problem, prediction of a class (cancer type) influences the type of treatment that the patient will receive. Both false positives (incorrect prediction of a cancer that is actually type ‘not X’ as type ‘X’) and false negatives (incorrect prediction of cancer that is actually type ‘X’ as type ‘not X’) would result in loss of time and resources spent on incorrect treatments (not to mention malpractice lawsuits). The magnitude of the loss depends on the type of cancer, because some types are more aggressive than others (eg, lung cancer tends to worsen much faster than prostate cancer), which could affect whether a patient has time to receive a correct diagnosis and treatment.
This suggests that both precision and recall should be assessed. In addition, F1 score, which is the harmonic mean of precision and recall, would be helpful to assess the balance between the two metrics. For completeness, I also evaluated overall accuracy, balanced accuracy for each class, and model training time.
To enable statistical comparison of model performance, I used 10-fold
cross-validation to calculate 95% confidence intervals of performance
metrics. However, I encountered some challenges with calculating and
reporting these intervals by class. After extensive searching, I reached
the conclusion that existing R packages for machine learning, such as
caret, do not have this functionality. As a result, I
customized functions to generate the desired output.
As noted in section 3.2.2, there are 1,259 zero-variance variables. Here, I remove them using the indices of the variables identified during exploratory data analysis (3.2.2. Near-zero variance genes).
tcga_pancan_data_df <- tcga_pancan_data_df[, -zero_variance_col_idx]
I normalized gene expression using min-max normalization, so that all values are between 0 and 1.
normalize_process <- preProcess(tcga_pancan_data_df, method = c('range'))
tcga_pancan_data_df <- predict(normalize_process, tcga_pancan_data_df)
gene_expr_values_norm <- as.vector(as.matrix(
tcga_pancan_data_df[, 1:(ncol(tcga_pancan_data_df) - 1)]))
(gene_expr_stats <- summary(gene_expr_values_norm))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2791 0.4587 0.4337 0.6060 1.0000
First, I combined the ‘data’ and ‘labels’ dataframes.
tcga_pancan_all_df <- cbind(tcga_pancan_data_df, tcga_pancan_labels_df)
Then I created training and test datasets using a 70:30 split.
train_idx <- createDataPartition(tcga_pancan_all_df$cancer_type, p = 0.7, list = FALSE)
train_imbalanced <- tcga_pancan_all_df[train_idx, ]
test <- tcga_pancan_all_df[-train_idx, ]
The training dataset is imbalanced with respect to target variable classes.
(target_class_freq_df <- as.data.frame(table(train_imbalanced$cancer_type)))
## Var1 Freq
## 1 0 213
## 2 1 59
## 3 2 95
## 4 3 99
## 5 4 96
To balance the training dataset, I used the ‘scutr’
package to oversample the minority classes using SMOTE.
# The goal is for each minority class to have the same number of samples as the majority class
# This is specified by the 'm' parameter in the oversample_smote() function
#
# This block may take 1-2 minutes to run
n_samples <- max(target_class_freq_df$Freq)
train_smote_cls1 <- scutr::oversample_smote(data = train_imbalanced,
cls = 1, cls_col = 'cancer_type', m = n_samples)
train_smote_cls2 <- scutr::oversample_smote(data = train_imbalanced,
cls = 2, cls_col = 'cancer_type', m = n_samples)
train_smote_cls3 <- scutr::oversample_smote(data = train_imbalanced,
cls = 3, cls_col = 'cancer_type', m = n_samples)
train_smote_cls4 <- scutr::oversample_smote(data = train_imbalanced,
cls = 4, cls_col = 'cancer_type', m = n_samples)
# Combine the class subsets to create a balanced training dataset
train_cls0 <- filter(train_imbalanced, cancer_type == 0) # majority class
train_smote <- rbind(train_cls0, train_smote_cls1, train_smote_cls2,
train_smote_cls3, train_smote_cls4)
# Subsets are no longer needed
rm(list = c('train_cls0', 'train_smote_cls1', 'train_smote_cls2', 'train_smote_cls3',
'train_smote_cls4'))
The classes of cancer_type in the SMOTE training dataset
are now balanced.
table(train_smote$cancer_type)
##
## 0 1 2 3 4
## 213 213 213 213 213
I used PCA to reduce the dimensions of the training and test datasets. Of note, the same process is applied to both datasets.
# Separate features and target variable in training and test datasets
X_train <- train_smote[, -train_smote$cancer_type]
y_train <- train_smote$cancer_type
X_test <- test[, -test$cancer_type]
y_test <- test$cancer_type
# Feature scaling
scaler <- caret::preProcess(X_train, method = c('center', 'scale'))
X_train_scaled <- predict(scaler, X_train)
X_test_scaled <- predict(scaler, X_test) # Apply same scaling
# Apply PCA to scaled training data
# This may take a couple of minutes
pca_model <- prcomp(X_train_scaled, center = FALSE, scale. = FALSE)
The Scree plot shows that the first two principal components (PCs) explain ~25% of the variance in the training data.
factoextra::fviz_eig(pca_model, ylim = c(0, 20))
Analysis of the cumulative proportion of variance explained indicates that the first 96 PCs need to be retained to explain 80% of the variance in the training data.
# Calculate cumulative proportion of variance explained by PCs
pca_variance <- (pca_model$sdev)^2
prop_variance <- pca_variance / sum(pca_variance)
cumulative_prop_df <- data.frame(x = 1:length(prop_variance), y = cumsum(prop_variance))
# Number of PCs needed to explain X% of variance
threshold <- 0.8
npc_criterion <- threshold - 0.005 # accounts for rounding
npc_needed <- min(which(cumulative_prop_df$y >= npc_criterion))
sprintf('%d of %d PCs are needed to explain %.1f%% of variance',
npc_needed, nrow(cumulative_prop_df), threshold * 100)
## [1] "96 of 1065 PCs are needed to explain 80.0% of variance"
So now the test dataset can be transformed using the same PCA transformation and retaining the first 96 PCs.
# Get the PCA-transformed training data (imbalanced)
# The 'x' element of prcomp() output contains the PC scores (transformed coordinates)
train_smote_pca <- as.data.frame(pca_model$x[, 1 : npc_needed])
train_smote_pca$cancer_type <- as.factor(y_train) # Replace the target variable
# Tranform the test data using the same PCA transformation from the training data
test_pca <- as.data.frame(predict(pca_model, X_test_scaled))
# Retain the same number of PCs used for the training data
test_pca <- test_pca[, 1 : npc_needed]
test_pca$cancer_type <- as.factor(y_test) # Add the target variable
To enable statistical comparison of model performance with cross-validation, I created 10 folds from the SMOTE-balanced, PCA-transformed training dataset.
folds <- createFolds(train_smote_pca$cancer_type, k = n_CVfolds)
These folds, which currently only contain row numbers of the training dataframe that comprise each fold, will be used during model development.
Algorithm: randomForest()
Parameters:
ntree (number of trees) = 1000. I followed the rule of
thumb to start with 10 times the number of features (ie, 96 features in
the PCA-transformed training dataset \(\times\) 10 = 960 trees).1mtry (number of variables that are randomly sampled at
each split) = 10. This is the square root of the number of features (ie,
\(\sqrt{96} \approx 9.8\)).set.seed(1403)
rf0_runtime <- system.time(
rf0_model <- randomForest(cancer_type ~ ., data = train_smote_pca,
ntree = 1000, mtry = 10))
rf0_runtime <- rf0_runtime[['elapsed']]
sprintf('Runtime: %.1f seconds', rf0_runtime)
## [1] "Runtime: 2.8 seconds"
Use the model to make class predictions
rf0_predictions <- predict(rf0_model, newdata = test_pca, type = 'class')
Confusion matrix
rf0_confusion_matrix <- caret::confusionMatrix(data = rf0_predictions,
reference = test_pca$cancer_type)
rf0_confusion_matrix$table
## Reference
## Prediction 0 1 2 3 4
## 0 87 0 0 0 0
## 1 0 18 0 0 0
## 2 0 0 51 0 0
## 3 0 1 0 42 1
## 4 0 0 0 0 39
The baseline model performed well and had 99.2% overall accuracy (95% CI, 97.0% to 99.9%).
sprintf('Overall accuracy (95%% CI): %.3f (%.3f, %.3f)',
rf0_confusion_matrix$overall['Accuracy'],
rf0_confusion_matrix$overall['AccuracyLower'],
rf0_confusion_matrix$overall['AccuracyUpper'])
## [1] "Overall accuracy (95% CI): 0.992 (0.970, 0.999)"
By class, the balanced accuracy ranged from 97.4% for colon cancer to 100% for breast and kidney cancer. Precision (range, 95.5% to 100%) and recall (range, 94.7% to 100%) were balanced, which was also reflected in the F1 score (range, 97.3% to 100%). Of note, the model predicted all breast and kidney cancers correctly.
rf0_class_metrics_df <- as.data.frame(rf0_confusion_matrix$byClass)
rf0_class_metrics_df <- cbind(cancer_type = rownames(rf0_class_metrics_df),
rf0_class_metrics_df)
rownames(rf0_class_metrics_df) <- NULL
rf0_class_metrics_df %>%
mutate(
cancer_type = unname(get_cancer_code[str_sub(cancer_type, -1, -1)]),
balanced_accuracy = round(`Balanced Accuracy`, 3),
precision = round(Precision, 3),
recall = round(Recall, 3),
F1_score = round(F1, 3)
) %>%
select(cancer_type, balanced_accuracy, precision, recall, F1_score)
## cancer_type balanced_accuracy precision recall F1_score
## 1 BRCA 1.000 1.000 1.000 1.000
## 2 COAD 0.974 1.000 0.947 0.973
## 3 KIRC 1.000 1.000 1.000 1.000
## 4 LUAD 0.995 0.955 1.000 0.977
## 5 PRAD 0.988 1.000 0.975 0.987
To determine whether classification performance could be improved
further, I tuned mtry (number of variables that are
randomly sampled at each split) by assessing the out-of-bag error for
values less than or greater than the initial value
(mtry = 10). I used the same number of trees,
ntreeTry, as the baseline model (1000).
The plot below shows that the out-of-bag error is minimized when
mtry = 5.
set.seed(1403)
optimal_mtry <- tuneRF(x = train_smote_pca[, -ncol(train_smote_pca)], # features
y = train_smote_pca[, ncol(train_smote_pca)], # target variable
mtryStart = 10, stepFactor = 1.5,
plot = TRUE, ntreeTry = 1000)
## mtry = 10 OOB error = 0.38%
## Searching left ...
## mtry = 7 OOB error = 0.28%
## 0.25 0.05
## mtry = 5 OOB error = 0.19%
## 0.3333333 0.05
## mtry = 4 OOB error = 0.38%
## -1 0.05
## Searching right ...
## mtry = 15 OOB error = 0.38%
## -1 0.05
best_mtry <- as.data.frame(optimal_mtry) %>%
slice_min(OOBError) %>%
pull(mtry)
sprintf('The optimal value of mtry is %d', best_mtry)
## [1] "The optimal value of mtry is 5"
Then, I reran the random forest model with the best value of
mtry and compared the performance with the previous
version.
set.seed(1403)
rf1_runtime <- system.time(
rf1_model <- randomForest(cancer_type ~ ., data = train_smote_pca,
ntree = 1000, mtry = best_mtry))
rf1_runtime <- rf1_runtime[['elapsed']]
sprintf('Runtime: %.1f seconds', rf1_runtime)
## [1] "Runtime: 3.1 seconds"
Use the model to make class predictions
rf1_predictions <- predict(rf1_model, newdata = test_pca, type = 'class')
Confusion matrix
rf1_confusion_matrix <- caret::confusionMatrix(data = rf1_predictions,
reference = test_pca$cancer_type)
rf1_confusion_matrix$table
## Reference
## Prediction 0 1 2 3 4
## 0 87 0 0 0 1
## 1 0 18 0 0 0
## 2 0 0 51 0 0
## 3 0 1 0 42 0
## 4 0 0 0 0 39
This change did not have any effect on classification performance.
sprintf('Overall accuracy of tuned model: %.3f', rf1_confusion_matrix$overall['Accuracy'])
## [1] "Overall accuracy of tuned model: 0.992"
sprintf('Change in overall accuracy vs baseline model: %.3f%%',
(rf1_confusion_matrix$overall['Accuracy'] - rf0_confusion_matrix$overall['Accuracy']) /
rf0_confusion_matrix$overall['Accuracy'])
## [1] "Change in overall accuracy vs baseline model: 0.000%"
rf1_class_metrics_df <- as.data.frame(rf1_confusion_matrix$byClass)
rf1_class_metrics_df <- cbind(cancer_type = rownames(rf1_class_metrics_df),
rf1_class_metrics_df)
rownames(rf1_class_metrics_df) <- NULL
rf1_class_metrics_df %>%
mutate(
cancer_type = unname(get_cancer_code[str_sub(cancer_type, -1, -1)]),
balanced_accuracy = round(`Balanced Accuracy`, 3),
precision = round(Precision, 3),
recall = round(Recall, 3),
F1_score = round(F1, 3)
) %>%
select(cancer_type, balanced_accuracy, precision, recall, F1_score)
## cancer_type balanced_accuracy precision recall F1_score
## 1 BRCA 0.997 0.989 1.000 0.994
## 2 COAD 0.974 1.000 0.947 0.973
## 3 KIRC 1.000 1.000 1.000 1.000
## 4 LUAD 0.997 0.977 1.000 0.988
## 5 PRAD 0.988 1.000 0.975 0.987
Since the classification peformance for both models were nearly identical, I performed cross validation using the baseline model.
set.seed(1403)
rf2_runtime <- system.time(
cv_results <- mclapply(folds, function(x) {
# Training and test datasets
exp_train <- train_smote_pca[-x, ]
exp_test <- train_smote_pca[x, ]
# Use training data to create random forest classification model
exp_model <- randomForest(cancer_type ~ ., data = exp_train,
ntree = 1000, mtry = best_mtry)
# Make predictions on test data
exp_predictions <- predict(exp_model, newdata = exp_test, type = 'class')
# Actual (true) data for comparison
# exp_actual <- exp_test$cancer_type
# Create confusion matrix of actual vs predicted classes
exp_confusion_matrix <- confusionMatrix(data = exp_predictions,
reference = exp_test$cancer_type)
# Extract performance metrics from confusion matrix
exp_overall_accuracy <- exp_confusion_matrix$overall['Accuracy']
# Class metrics
exp_class_metrics_df <- as.data.frame(exp_confusion_matrix$byClass)
exp_balanced_accuracy_vec <- exp_class_metrics_df$`Balanced Accuracy`
exp_precision_vec <- exp_class_metrics_df$Precision
exp_recall_vec <- exp_class_metrics_df$Recall
exp_f1_vec <- exp_class_metrics_df$F1
# Combine class metrics into dataframe
exp_class_metrics_df <- as.data.frame(cbind(class = c(0, 1, 2, 3, 4),
balanced_accuracy = exp_balanced_accuracy_vec,
precision = exp_precision_vec,
recall = exp_recall_vec,
F1_score = exp_f1_vec))
# Return experiment metrics as a list
return(list(overall_accuracy = exp_overall_accuracy,
class_metrics_df = exp_class_metrics_df))
}, mc.cores = n_cores))
rf2_runtime <- rf2_runtime[['elapsed']]
sprintf('Runtime: %.1f seconds', rf2_runtime)
## [1] "Runtime: 5.9 seconds"
For overall accuracy, the 95% confidence interval is 99.2% to 100%. Performance metrics by class are similar. Furthermore, the overlapping confidence intervals for all cancer types indicates that the model does not have any significant difference in performance by cancer type. In addition, the computation time was quite short.
These results show that the random forest model performs very well for all five cancers types.
# Extract performance metrics for each fold
cv_overall_accuracy_vec <- sapply(cv_results, function(x) x$overall_accuracy)
# Calculate 95% CIs
cv_overall_accuracy_mean <- mean(cv_overall_accuracy_vec)
cv_overall_accuracy_sd <- sd(cv_overall_accuracy_vec)
cv_overall_accuracy_ci95 <- calc_ci95(cv_overall_accuracy_mean, cv_overall_accuracy_sd,
n_CVfolds)
sprintf('Overall accuracy 95%% CI: (%.3f, %.3f)',
cv_overall_accuracy_ci95$ci95_lower, cv_overall_accuracy_ci95$ci95_upper)
## [1] "Overall accuracy 95% CI: (0.993, 1.000)"
# Extract performance results by class for all folds, then bind them into a new dataframe
cv_class_metrics_df <- lapply(cv_results, function(x) x$class_metrics_df) %>%
bind_rows()
# For each class (cancer type), calculate 95% CI for all performance metrics
cv_class_metrics_summary_df <- cv_class_metrics_df %>%
group_by(class) %>%
reframe(
# Balanced accuracy
balanced_accuracy_ci95 =
format_ci95_str(calc_ci95(mean(balanced_accuracy), sd(balanced_accuracy),
n_CVfolds)$ci95_lower,
calc_ci95(mean(balanced_accuracy), sd(balanced_accuracy),
n_CVfolds)$ci95_upper),
# Precision
precision_ci95 =
format_ci95_str(calc_ci95(mean(precision), sd(precision), n_CVfolds)$ci95_lower,
calc_ci95(mean(precision), sd(precision), n_CVfolds)$ci95_upper),
# Recall
recall_ci95 =
format_ci95_str(calc_ci95(mean(recall), sd(recall), n_CVfolds)$ci95_lower,
calc_ci95(mean(recall), sd(recall), n_CVfolds)$ci95_upper),
# F1 score
F1_score_ci95 =
format_ci95_str(calc_ci95(mean(F1_score), sd(F1_score), n_CVfolds)$ci95_lower,
calc_ci95(mean(F1_score), sd(F1_score), n_CVfolds)$ci95_upper),
) %>%
ungroup() %>%
mutate(
cancer_type = unname(get_cancer_code[as.character(class)])
) %>%
select(cancer_type, balanced_accuracy_ci95, precision_ci95, recall_ci95, F1_score_ci95)
# Add model name
rf_cv_class_metrics_summary_df <- cv_class_metrics_summary_df %>%
mutate(
model = 'RF',
.before = cancer_type
)
rf_cv_class_metrics_summary_df %>%
kbl() %>%
kable_styling()
| model | cancer_type | balanced_accuracy_ci95 | precision_ci95 | recall_ci95 | F1_score_ci95 |
|---|---|---|---|---|---|
| RF | BRCA | (0.996, 1) | (0.97, 1) | (1, 1) | (0.984, 1) |
| RF | COAD | (1, 1) | (1, 1) | (1, 1) | (1, 1) |
| RF | KIRC | (0.993, 1) | (1, 1) | (0.987, 1) | (0.993, 1) |
| RF | LUAD | (0.989, 1) | (1, 1) | (0.978, 1) | (0.989, 1) |
| RF | PRAD | (1, 1) | (1, 1) | (1, 1) | (1, 1) |
The variable importance plot shows that the first 2 PCs contain the most important variables, and the mean decrease in Gini impurity plateaus beyond PC15.
# The baseline and optimized models had identical performance,
# so here I just use the baseline model
# Calculate variable importance scores
rf0_model_var_imp_df <- vip::vi(rf0_model)
# Plot
var_imp_plot1 <- vip::vip(rf0_model_var_imp_df, num_features = 15, include_type = TRUE) +
ggtitle('Random forest variable importance') +
theme_classic()
var_imp_plot1
Algorithm: nnet()
Parameters:
Number of neurons in hidden layer: size = 5
Weight decay: decay = 0 (default value)
Notes:
Note that the nnet package is limited to
single-layer neural networks, so the number of layers is not
specified.
The model has 5 neurons because the data have 5 classes (cancer types).
Although the nnet() function has a
softmax parameter (logical), it does not need to be
explicitly defined. When the function detects the presence of multiple
factor levels in the target variable, it automatically uses a softmax
activation function.
set.seed(1403)
nn0_runtime <- system.time(
nn0_model <- nnet(cancer_type ~ ., data = train_smote_pca, size = 5)
)
## # weights: 515
## initial value 1825.274188
## iter 10 value 71.839843
## iter 20 value 25.910201
## iter 30 value 8.279478
## iter 40 value 6.514153
## iter 50 value 5.344508
## iter 60 value 4.803838
## iter 70 value 4.791983
## iter 80 value 4.783975
## iter 90 value 4.161583
## iter 100 value 4.161337
## final value 4.161337
## stopped after 100 iterations
nn0_runtime <- nn0_runtime[['elapsed']]
sprintf('Runtime: %.1f seconds', nn0_runtime)
## [1] "Runtime: 0.2 seconds"
Use the model to make class predictions
nn0_predictions <- predict(nn0_model, newdata = test_pca, type = 'class')
nn0_predictions <- factor(nn0_predictions, levels = levels(test_pca$cancer_type))
Confusion matrix
nn0_confusion_matrix <- caret::confusionMatrix(data = nn0_predictions,
reference = test_pca$cancer_type)
nn0_confusion_matrix$table
## Reference
## Prediction 0 1 2 3 4
## 0 87 0 0 0 0
## 1 0 17 0 0 0
## 2 0 0 51 0 0
## 3 0 2 0 42 0
## 4 0 0 0 0 40
The baseline model performed well and had 99.2% overall accuracy (95% CI, 97.0% to 99.9%).
sprintf('Overall accuracy (95%% CI): %.3f (%.3f, %.3f)',
nn0_confusion_matrix$overall['Accuracy'],
nn0_confusion_matrix$overall['AccuracyLower'],
nn0_confusion_matrix$overall['AccuracyUpper'])
## [1] "Overall accuracy (95% CI): 0.992 (0.970, 0.999)"
By class, the balanced accuracy ranged from 94.7% for colon cancer to 100% for breast, kidney, and prostate cancer. Precision (range, 95.5% to 100%) and recall (range, 89.5% to 100%) were balanced, which was also reflected in the F1 score (range, 94.4% to 100%). Of note, the model predicted all breast, kidney, and prostate cancers correctly.
nn0_class_metrics_df <- as.data.frame(nn0_confusion_matrix$byClass)
nn0_class_metrics_df <- cbind(cancer_type = rownames(nn0_class_metrics_df),
nn0_class_metrics_df)
rownames(nn0_class_metrics_df) <- NULL
nn0_class_metrics_df %>%
mutate(
cancer_type = unname(get_cancer_code[str_sub(cancer_type, -1, -1)]),
balanced_accuracy = round(`Balanced Accuracy`, 3),
precision = round(Precision, 3),
recall = round(Recall, 3),
F1_score = round(F1, 3)
) %>%
select(cancer_type, balanced_accuracy, precision, recall, F1_score)
## cancer_type balanced_accuracy precision recall F1_score
## 1 BRCA 1.000 1.000 1.000 1.000
## 2 COAD 0.947 1.000 0.895 0.944
## 3 KIRC 1.000 1.000 1.000 1.000
## 4 LUAD 0.995 0.955 1.000 0.977
## 5 PRAD 1.000 1.000 1.000 1.000
For multiclass classification tasks, the number of neurons in the
neural network is usually the number of classes of the target variable,
so I only tuned the decay hyperparameter.
# This functions performs a single run of a neural network model
# Inputs:
# train, test (dataframe): data
# formula for model relating target and predictor variables
# size, decay (numeric): hyperparameters
# model_num (integer): model number
# results_tracker_df: dataframe with results of previous runs
# Output:
# results_tracker_df: updated dataframe with results of current run
train_nn <- function(train, test, formula, size, decay, model_num, results_tracker_df) {
nn_runtime <- system.time(
nn_model <- nnet(formula = formula, data = train, size = size, decay = decay,
trace = FALSE)
)
nn_runtime <- nn_runtime[['elapsed']]
# Make predictions and ensure factor levels are consistent
predictions <- predict(nn_model, newdata = test, type = "class")
predictions <- factor(predictions, levels = levels(test$cancer_type))
# Confusion matrix
confusion_matrix <- caret::confusionMatrix(data = predictions,
reference = test$cancer_type)
# Model accuracy
overall_accuracy <- confusion_matrix$overall['Accuracy']
# Document results
model_id <- paste0('nn', model_num)
exp_results <- c(model_id, size, decay, round(overall_accuracy, 3), round(nn_runtime, 3))
results_tracker_df <- rbind(results_tracker_df, exp_results)
return(results_tracker_df)
}
set.seed(1403)
# Create experiment tracker
results_tracker_df <- data.frame()
decay_vals <- c(0, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 0.1, 0.5)
# Run experiments
for (i in 1:length(decay_vals)) {
results_tracker_df <- train_nn(
train = train_smote_pca, test = test_pca, formula = cancer_type ~ .,
size = 5, decay = decay_vals[i], model_num = i, results_tracker_df)
}
colnames(results_tracker_df) <- c('model', 'size', 'decay', 'overall_accuracy', 'runtime_sec')
The overall model accuracy was maximized when
decay = 0.01. Computation times were negligible.
results_tracker_df %>%
arrange(desc(overall_accuracy)) %>%
kbl() %>%
kable_styling()
| model | size | decay | overall_accuracy | runtime_sec |
|---|---|---|---|---|
| nn7 | 5 | 0.01 | 1 | 0.246 |
| nn1 | 5 | 0 | 0.992 | 0.196 |
| nn3 | 5 | 1e-06 | 0.992 | 0.232 |
| nn6 | 5 | 0.001 | 0.992 | 0.202 |
| nn4 | 5 | 1e-05 | 0.987 | 0.21 |
| nn5 | 5 | 1e-04 | 0.987 | 0.202 |
| nn2 | 5 | 1e-07 | 0.983 | 0.187 |
| nn8 | 5 | 0.1 | 0.983 | 0.28 |
| nn9 | 5 | 0.5 | 0.971 | 0.3 |
I performed cross-validation using the optimized model.
nn1_runtime <- system.time(
cv_results <- mclapply(folds, function(x) {
# Training and test datasets
exp_train <- train_smote_pca[-x, ]
exp_test <- train_smote_pca[x, ]
# Use training data to create random forest classification model
exp_model <- nnet(cancer_type ~ ., data = exp_train, size = 5, decay = 0.01,
trace = FALSE)
# Make predictions on test data
exp_predictions <- predict(exp_model, newdata = exp_test, type = 'class')
exp_predictions <- factor(exp_predictions, levels = levels(exp_test$cancer_type))
# Create confusion matrix of actual vs predicted classes
exp_confusion_matrix <- confusionMatrix(data = exp_predictions,
reference = exp_test$cancer_type)
# Extract performance metrics from confusion matrix
exp_overall_accuracy <- exp_confusion_matrix$overall['Accuracy']
# Class metrics
exp_class_metrics_df <- as.data.frame(exp_confusion_matrix$byClass)
exp_balanced_accuracy_vec <- exp_class_metrics_df$`Balanced Accuracy`
exp_precision_vec <- exp_class_metrics_df$Precision
exp_recall_vec <- exp_class_metrics_df$Recall
exp_f1_vec <- exp_class_metrics_df$F1
# Combine class metrics into dataframe
exp_class_metrics_df <- as.data.frame(cbind(class = c(0, 1, 2, 3, 4),
balanced_accuracy = exp_balanced_accuracy_vec,
precision = exp_precision_vec,
recall = exp_recall_vec,
F1_score = exp_f1_vec))
# Return experiment metrics as a list
return(list(overall_accuracy = exp_overall_accuracy,
class_metrics_df = exp_class_metrics_df))
}, mc.cores = n_cores)
)
nn1_runtime <- nn1_runtime[['elapsed']]
sprintf('Runtime: %.1f seconds', nn1_runtime)
## [1] "Runtime: 0.6 seconds"
For overall accuracy, the 95% confidence interval is 99.4% to 100%.
# Extract performance metrics for each fold
cv_overall_accuracy_vec <- sapply(cv_results, function(x) x$overall_accuracy)
# Calculate 95% CIs
cv_overall_accuracy_mean <- mean(cv_overall_accuracy_vec)
cv_overall_accuracy_sd <- sd(cv_overall_accuracy_vec)
cv_overall_accuracy_ci95 <- calc_ci95(cv_overall_accuracy_mean, cv_overall_accuracy_sd,
n_CVfolds)
sprintf('Overall accuracy 95%% CI: (%.3f, %.3f)',
cv_overall_accuracy_ci95$ci95_lower, cv_overall_accuracy_ci95$ci95_upper)
## [1] "Overall accuracy 95% CI: (0.994, 1.000)"
Performance metrics by class are similar. Furthermore, the overlapping confidence intervals for all cancer types indicates that the model does not have any significant difference in performance by cancer type. In addition, the computation time was very short.
These results show that the feedforward neural network model performs very well for all five cancers types.
# Extract performance results by class for all folds, then bind them into a new dataframe
cv_class_metrics_df <- lapply(cv_results, function(x) x$class_metrics_df) %>%
bind_rows()
# For each class (cancer type), calculate 95% CI for all performance metrics
cv_class_metrics_summary_df <- cv_class_metrics_df %>%
group_by(class) %>%
reframe(
# Balanced accuracy
balanced_accuracy_ci95 =
format_ci95_str(calc_ci95(mean(balanced_accuracy), sd(balanced_accuracy),
n_CVfolds)$ci95_lower,
calc_ci95(mean(balanced_accuracy), sd(balanced_accuracy),
n_CVfolds)$ci95_upper),
# Precision
precision_ci95 =
format_ci95_str(calc_ci95(mean(precision), sd(precision), n_CVfolds)$ci95_lower,
calc_ci95(mean(precision), sd(precision), n_CVfolds)$ci95_upper),
# Recall
recall_ci95 =
format_ci95_str(calc_ci95(mean(recall), sd(recall), n_CVfolds)$ci95_lower,
calc_ci95(mean(recall), sd(recall), n_CVfolds)$ci95_upper),
# F1 score
F1_score_ci95 =
format_ci95_str(calc_ci95(mean(F1_score), sd(F1_score), n_CVfolds)$ci95_lower,
calc_ci95(mean(F1_score), sd(F1_score), n_CVfolds)$ci95_upper),
) %>%
ungroup() %>%
mutate(
cancer_type = unname(get_cancer_code[as.character(class)])
) %>%
select(cancer_type, balanced_accuracy_ci95, precision_ci95, recall_ci95, F1_score_ci95)
# Add model name
nn_cv_class_metrics_summary_df <- cv_class_metrics_summary_df %>%
mutate(
model = 'NN',
.before = cancer_type
)
nn_cv_class_metrics_summary_df %>%
kbl() %>%
kable_styling()
| model | cancer_type | balanced_accuracy_ci95 | precision_ci95 | recall_ci95 | F1_score_ci95 |
|---|---|---|---|---|---|
| NN | BRCA | (0.989, 1) | (1, 1) | (0.979, 1) | (0.989, 1) |
| NN | COAD | (0.997, 1) | (0.979, 1) | (1, 1) | (0.99, 1) |
| NN | KIRC | (1, 1) | (1, 1) | (1, 1) | (1, 1) |
| NN | LUAD | (0.992, 1) | (0.987, 1) | (0.986, 1) | (0.989, 1) |
| NN | PRAD | (1, 1) | (1, 1) | (1, 1) | (1, 1) |
The variable importance plot shows that PC5 is the most important variable.
set.seed(1403)
# Optimized model
nn2_model <- nnet(cancer_type ~ ., data = train_smote_pca,
size = 5, decay = 0.01, trace = FALSE)
# Calculate variable importance scores
# Variable importance is determined by Olden algorithm, which calculates
# the sum of the product of raw connection weights between neurons
# https://cran.r-project.org/web/packages/vip/vignettes/vip.html#neural-networks
nn2_model_var_imp_df <- vip::vi(nn2_model)
# Plot
var_imp_plot2 <- vip::vip(nn2_model_var_imp_df, num_features = 15, include_type = TRUE) +
ggtitle('Neural network variable importance') +
theme_classic()
var_imp_plot2
The overlapping 95% confidence intervals for all class performance metrics shows that there is no significant difference in classification performance of the random forest vs feedforward neural network models for any cancer type. Both models have excellent balanced accuracy, precision, recall, and F1 score.
rbind(rf_cv_class_metrics_summary_df, nn_cv_class_metrics_summary_df) %>%
group_by(cancer_type) %>%
arrange(cancer_type) %>%
kbl() %>%
kable_styling() %>%
row_spec(c(1, 2, 5, 6, 9, 10), background = 'gray90')
| model | cancer_type | balanced_accuracy_ci95 | precision_ci95 | recall_ci95 | F1_score_ci95 |
|---|---|---|---|---|---|
| RF | BRCA | (0.996, 1) | (0.97, 1) | (1, 1) | (0.984, 1) |
| NN | BRCA | (0.989, 1) | (1, 1) | (0.979, 1) | (0.989, 1) |
| RF | COAD | (1, 1) | (1, 1) | (1, 1) | (1, 1) |
| NN | COAD | (0.997, 1) | (0.979, 1) | (1, 1) | (0.99, 1) |
| RF | KIRC | (0.993, 1) | (1, 1) | (0.987, 1) | (0.993, 1) |
| NN | KIRC | (1, 1) | (1, 1) | (1, 1) | (1, 1) |
| RF | LUAD | (0.989, 1) | (1, 1) | (0.978, 1) | (0.989, 1) |
| NN | LUAD | (0.992, 1) | (0.987, 1) | (0.986, 1) | (0.989, 1) |
| RF | PRAD | (1, 1) | (1, 1) | (1, 1) | (1, 1) |
| NN | PRAD | (1, 1) | (1, 1) | (1, 1) | (1, 1) |
Despite the similar performance of the two models, the most important PCs differ between models. As shown in the figure below, PC1 and PC2 are the most important in the random forest model (left plot), whereas PC5 is the most important in the neural network model (right plot).
These differences reflect differences in how these models find patterns in data as well as how variable importance scores are calculated for each model.
In random forest models, the variable importance score reflects the amount of reduction in entropy (Gini impurity) at decision tree splits. This implies that variables that have a lot of variation can contribute greater reductions and, therefore, will have greater importance. This suggests that the PCA-transformed gene expression values in PC1 and PC2 have more variation than those in other PCs.
In neural networks, the variable importance score is more complex and reflects the sum of products of connection weights between neurons (calculated using the “Olden” algorithm, as noted on the x-axis of the right plot). The nature of this calculation, along with the fact that the neural network model uses one neuron per class (cancer type), suggests that the PCA-transformed gene expression values in PC5 contribute greater discriminatory value between cancer types than those in other PCs.
# Modify random forest variable importance plot
rf0_model_var_imp_df2 <- rf0_model_var_imp_df %>%
arrange(desc(Importance)) %>%
slice(1:15) %>%
mutate(
color = case_when(Variable %in% c('PC1', 'PC2') ~ '#E1BE6A',
Variable == 'PC5' ~ '#40B0A6',
TRUE ~ 'gray')
)
p1 <- ggplot(rf0_model_var_imp_df2, aes(x = reorder(Variable, Importance),
y = Importance, fill = color)) +
geom_bar(stat = 'identity') +
scale_fill_identity() +
coord_flip() +
xlab('Variable') + ylab('Importance (mean decrease Gini)') +
theme_classic() +
ggtitle('Random forest variable importance') +
theme(
axis.title = element_text(face = 'bold'),
plot.title = element_text(face = 'bold', size = 12),
legend.position = 'none'
)
# Modify neural network variable importance plot
nn2_model_var_imp_df2 <- nn2_model_var_imp_df %>%
arrange(desc(Importance)) %>%
slice(1:15) %>%
mutate(
color = ifelse(Variable == 'PC5', '#40B0A6', 'gray')
)
p2 <- ggplot(nn2_model_var_imp_df2, aes(x = reorder(Variable, Importance),
y = Importance, fill = color)) +
geom_bar(stat = 'identity') +
scale_fill_identity() +
coord_flip() +
xlab('Variable') + ylab('Importance (Olsen)') +
theme_classic() +
ggtitle('Neural network variable importance') +
theme(
axis.title = element_text(face = 'bold'),
plot.title = element_text(face = 'bold', size = 12),
legend.position = 'none'
)
# Combine plots
plot_grid(p1, p2, nrow = 1, ncol = 2)
Also of note, PC5 is the highest ranked variable in both models (most important in neural network and 4th most important in random forest). This suggests that the genes transformed in this PC are likely to be important for classifying the five types of cancer (keeping in mind though that the first 5 PCs only explained ~40% of the variance in the data).
Examination of the PCA loadings shows that the top three genes
contributing to PC5 are gene_15306,
gene_10731, and gene_17801.
# Extract the loadings from the PCA model
pc_loadings_df <- as.data.frame(pca_model$rotation)
pc_loadings_df <- cbind(gene = rownames(pc_loadings_df), pc_loadings_df)
rownames(pc_loadings_df) <- NULL
# Rank the loadings and list the top 3 genes
pc5_genes <- pc_loadings_df %>%
select(gene, PC5) %>%
# I use the absolute value of loadings because they can be positive or negative values
arrange(desc(abs(PC5)))
pc5_genes <- as.vector(pc5_genes$gene)
pc5_genes[1:3]
## [1] "gene_15306" "gene_10731" "gene_17801"
The results suggest that both random forest and feedforward neural network models can classify five types of cancer (breast, colon, kidney, lung, and prostrate) using gene expression data with high overall and balanced (ie, by class) accuracy. The computation time for both models was short, although the neural network seemed to be slightly faster.
As a medical researcher, the main limitation that I see with this analysis is that the UCI dataset is highly simplified. In fact, it is oversimplified for several reasons:
The number of samples (N=801) is very small compared with real-world datasets, which often have tens of thousands of samples. 801 samples is insufficient to generalize results to the real world. In addition, training models with real-world datasets would require much more computation time.
The UCI dataset does not include any patient metadata, such as demographic information (eg, age, sex, ethnicity). In this project, this limited exploratory data analysis. From a business/clinical perspective, lack of demographic information would make it impossible to know whether the performance of the machine learning models can be generalized to a broad patient population.
The genes in the UCI dataset are not named and only have numeric identifiers. This prevents biologically meaningful analysis of variable (gene) importance in models, because they are not usable in gene databases that contain information about gene functions and disease relationships (eg, NCBI Gene). From a business/clinical perspective, the lack of gene names would also make it difficult to interpret test results at a genetic level.
These shortcomings limit the conclusions that can be drawn from this analysis as well as the ability to deploy the models in a real-world (ie, clinical) setting.
In conclusion, from a purely data science perspective, both random forest and feedforward neural network models performed well for classifying cancer types in the UCI gene expression dataset. However, from a business/clinical perspective, further testing with larger and more complete datasets is needed to determine how the results in this analysis scale and generalize before the model(s) could be used for real-world applications.
Boehmke B and Greenwell B. (2020). Hands-On Machine Learning in R. Section 11.4.1. https://bradleyboehmke.github.io/HOML/random-forest.html↩︎