1. Background and objective

Genes (DNA sequences) contain the instructions for producing proteins (via an intermediate called RNA), which are the molecular workhorses in humans and other organisms. The amount of RNA produced, which is called “expression level” in biology, varies for each gene depending on the need for various proteins in different cells in the body. These needs can change during disease, such as cancer, which is the basis for molecular diagnostic tests to detect diseases using gene expression profiles.

The objective of this project is to use gene expression data to classify five common types of cancer (breast, colon, kidney, lung, and prostate cancer).

2. Data

Gene expression data were obtained from the UCI Machine Learning Repository (Dataset C5R88H).

The main data file was too large (206 Mb) to import directly, so I downloaded and analyzed a local copy.

tcga_pancan_data_df <-
  read_csv('Data/TCGA-PANCAN-HiSeq-801x20531/data.csv', show_col_types = FALSE)

## New names:
## • `` -> `...1`

tcga_pancan_labels_df <-
  read_csv('Data/TCGA-PANCAN-HiSeq-801x20531/labels.csv', show_col_types = FALSE)

## New names:
## • `` -> `...1`

I dropped the first column of both datasets because it only contained sample IDs, which are not informative for machine learning.

tcga_pancan_data_df <- tcga_pancan_data_df[, 2:ncol(tcga_pancan_data_df)]
tcga_pancan_labels_df <- tcga_pancan_labels_df[, 2:ncol(tcga_pancan_labels_df)]

The ‘data’ dataframe has 801 rows (samples) and 20,531 columns (genes). All the gene_# columns are numeric values that are the expression levels for each gene in the samples.

dim(tcga_pancan_data_df)

## [1]   801 20531

The dataframe is too big to glimpse(), so I show the first and last few rows and columns below.

head(tcga_pancan_data_df)

## # A tibble: 6 × 20,531
##   gene_0 gene_1 gene_2 gene_3 gene_4 gene_5 gene_6 gene_7 gene_8 gene_9 gene_10
##    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
## 1      0  2.02    3.27   5.48  10.4       0   7.18  0.592      0      0   0.592
## 2      0  0.593   1.59   7.59   9.62      0   6.82  0          0      0   0    
## 3      0  3.51    4.33   6.88   9.87      0   6.97  0.453      0      0   0    
## 4      0  3.66    4.51   6.66  10.2       0   7.84  0.435      0      0   0    
## 5      0  2.66    2.82   6.54   9.74      0   6.57  0.361      0      0   0    
## 6      0  3.47    3.58   6.62   9.71      0   7.76  0          0      0   0.515
## # ℹ 20,520 more variables: gene_11 <dbl>, gene_12 <dbl>, gene_13 <dbl>,
## #   gene_14 <dbl>, gene_15 <dbl>, gene_16 <dbl>, gene_17 <dbl>, gene_18 <dbl>,
## #   gene_19 <dbl>, gene_20 <dbl>, gene_21 <dbl>, gene_22 <dbl>, gene_23 <dbl>,
## #   gene_24 <dbl>, gene_25 <dbl>, gene_26 <dbl>, gene_27 <dbl>, gene_28 <dbl>,
## #   gene_29 <dbl>, gene_30 <dbl>, gene_31 <dbl>, gene_32 <dbl>, gene_33 <dbl>,
## #   gene_34 <dbl>, gene_35 <dbl>, gene_36 <dbl>, gene_37 <dbl>, gene_38 <dbl>,
## #   gene_39 <dbl>, gene_40 <dbl>, gene_41 <dbl>, gene_42 <dbl>, …

tcga_pancan_data_df[1:5, (ncol(tcga_pancan_data_df) - 3):ncol(tcga_pancan_data_df)]

## # A tibble: 5 × 4
##   gene_20527 gene_20528 gene_20529 gene_20530
##        <dbl>      <dbl>      <dbl>      <dbl>
## 1       9.65       8.92       5.29          0
## 2      10.5        9.40       2.09          0
## 3       9.79      10.1        1.68          0
## 4       9.69       9.68       3.29          0
## 5       9.22       9.46       5.11          0

Finally, I checked for duplicates and missing values in the data (there were none).

sprintf("Duplicate rows in dataset: %d", count_duplicate_rows(tcga_pancan_data_df))

## [1] "Duplicate rows in dataset: 0"

get_na_vars(tcga_pancan_data_df)

## [1] "There are no variables with missing values"

The ‘labels’ dataframe has 801 rows (samples) and 1 column of cancer abbreviations, which are shorthand for five types of cancer (BRCA = breast cancer, COAD = colon cancer, KIRC = kidney cancer, LUAD = lung cancer, PRAD = prostate cancer).

dim(tcga_pancan_labels_df)

## [1] 801   1

unique(tcga_pancan_labels_df$Class)

## [1] "PRAD" "LUAD" "BRCA" "KIRC" "COAD"

I recoded these abbreviations as numbers.

tcga_pancan_labels_df <- tcga_pancan_labels_df %>%
  mutate(
    cancer_type = case_when(Class == 'BRCA' ~ 0,
                            Class == 'COAD' ~ 1,
                            Class == 'KIRC' ~ 2,
                            Class == 'LUAD' ~ 3,
                            Class == 'PRAD' ~ 4)
  ) %>%
  select(cancer_type)

3. Exploratory data analysis

3.1. Cancer type (target variable)

There are five types of cancer in the dataset. The most common is breast cancer (BRCA) and the least common is colon cancer (COAD). This class imbalance will need to be addressed during pre-processing.

# Cancer types: 0 = breast, 1 = colon, 2 = kidney, 3 = lung, 4 = prostate
tcga_pancan_labels_df %>%
  group_by(cancer_type) %>%
  summarise(
    n = n()
  ) %>%
  arrange(desc(n))

## # A tibble: 5 × 2
##   cancer_type     n
##         <dbl> <int>
## 1           0   300
## 2           2   146
## 3           3   141
## 4           4   136
## 5           1    78

3.2. Gene expression values

3.2.1. Summary statistics

There are too many genes to characterize individually, so I analyzed one as an example and then calculated summary statistics for all of them as a group.

3.2.1.1. Distribution of expression values of ‘gene_1’

The expression of ‘gene_1’ ranges from 0 to 6.237 with mean of 3 and standard deviation of 1.2.

tcga_pancan_data_df %>%
    summarise(min = round(min(gene_1), 3), 
              max = round(max(gene_1), 3),
              mean = round(mean(gene_1), 1),
              sd = round(sd(gene_1), 1),
              median = round(median(gene_1), 3),
              IQR = round(IQR(gene_1), 3),
              n = n())

## # A tibble: 1 × 7
##     min   max  mean    sd median   IQR     n
##   <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <int>
## 1     0  6.24     3   1.2   3.14  1.58   801

The expression values appear to be mostly normally distributed, although there are some zero values (ie, nuclei that are not expressing the gene).

ggplot(tcga_pancan_data_df, aes(x = gene_1)) +
  geom_histogram(bins = 50, fill = 'steelblue', color = 'darkgray') +
  coord_cartesian(xlim = c(0, 8), ylim = c(0, 50)) +
  labs(x = 'Expression of gene_1', y = 'Count',
       title = 'Histogram of expression values for gene_1') +
  guides(
    x = guide_axis(minor.ticks = TRUE, cap = 'both'),
    y = guide_axis(minor.ticks = TRUE, cap = 'both')
  ) +
  theme_classic() +
  theme(
    axis.title = element_text(face = 'bold'),
    plot.title =  element_text(face = 'bold')
  )

3.2.1.2. Summary statistics for all gene expression values

The gene expression values range from 0 to 20.779.

gene_expr_values <- as.vector(as.matrix(
  tcga_pancan_data_df[, 1:(ncol(tcga_pancan_data_df) - 1)]))

(gene_expr_stats <- summary(gene_expr_values))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.600   7.610   6.444   9.702  20.779

Using quartiles, the thresholds for potential outliers indicate that gene expression levels at or near the maximum value may be outliers.

IQR <- gene_expr_stats['3rd Qu.'] - gene_expr_stats['1st Qu.']
lower_threshold <- gene_expr_stats['1st Qu.'] - 1.5 * IQR
# Note that the lowest biologically possible expression level is 0
lower_threshold <- ifelse((lower_threshold < 0), 0, lower_threshold)
upper_threshold <- gene_expr_stats['3rd Qu.'] + 1.5 * IQR

sprintf('Expression values less than %.3f or greater than %.3f may be outliers', 
        lower_threshold, upper_threshold)

## [1] "Expression values less than 0.000 or greater than 20.353 may be outliers"

These summary statistics suggest that the the data should be normalized.

3.2.2. Near-zero variance genes

I also determined the number of genes with zero expression and, more generally, the number with near-zero variance. These columns will need to be removed during pre-processing, because they are not informative for machine learning.

# Number of zero columns
n_zero_col <- sum(colSums(tcga_pancan_data_df != 0) == 0)
sprintf('There are %d zero columns', n_zero_col)

## [1] "There are 267 zero columns"

# Indexes of columns with near-zero variance
zero_variance_col_idx <- caret::nearZeroVar(tcga_pancan_data_df)
  
# Number of such columns
sprintf('There are %d near-zero variance columns', length(zero_variance_col_idx))

## [1] "There are 1259 near-zero variance columns"

4. Methods

4.1. Algorithm selection

I selected algorithms based on problem type (supervised multiclass classification), dataset characteristics, model strengths/weaknesses, and the assignment requirement of selecting one algorithm from Weeks 1-10 and the other from Weeks 11-15.

The first algorithm I selected was random forest, which was discussed during Week 6. The main rationale for random forest is that it can handle large datasets with many features, which is the case with the gene expression dataset, which has >20,000 features. In addition, random forest models can be used to perform multiclass classification.

The second algorithm I selected was feedforward neural network, which was discussed during Week 13. The main rationale for neural networks is that they can detect complex patterns in data, which may be advantageous for the many features (genes) in the cancer dataset. In addition, neural network models can be used to perform multiclass classification.

4.2. Pre-processing

Data were split 70:30 into training and test datasets. I mitigated class imbalance in the training dataset using Synthetic Minority Oversampling Technique (SMOTE). Subsequently, I used principal component analysis (PCA) to reduce dimensionality.

4.3. Performance evaluation

For the classification problem, prediction of a class (cancer type) influences the type of treatment that the patient will receive. Both false positives (incorrect prediction of a cancer that is actually type ‘not X’ as type ‘X’) and false negatives (incorrect prediction of cancer that is actually type ‘X’ as type ‘not X’) would result in loss of time and resources spent on incorrect treatments (not to mention malpractice lawsuits). The magnitude of the loss depends on the type of cancer, because some types are more aggressive than others (eg, lung cancer tends to worsen much faster than prostate cancer), which could affect whether a patient has time to receive a correct diagnosis and treatment.

This suggests that both precision and recall should be assessed. In addition, F1 score, which is the harmonic mean of precision and recall, would be helpful to assess the balance between the two metrics. For completeness, I also evaluated overall accuracy, balanced accuracy for each class, and model training time.

To enable statistical comparison of model performance, I used 10-fold cross-validation to calculate 95% confidence intervals of performance metrics. However, I encountered some challenges with calculating and reporting these intervals by class. After extensive searching, I reached the conclusion that existing R packages for machine learning, such as caret, do not have this functionality. As a result, I customized functions to generate the desired output.

5. Pre-processing

5.1. Removal of zero-variance variables

As noted in section 3.2.2, there are 1,259 zero-variance variables. Here, I remove them using the indices of the variables identified during exploratory data analysis (3.2.2. Near-zero variance genes).

tcga_pancan_data_df <- tcga_pancan_data_df[, -zero_variance_col_idx]

5.2. Data normalization

I normalized gene expression using min-max normalization, so that all values are between 0 and 1.

normalize_process <- preProcess(tcga_pancan_data_df, method = c('range'))
tcga_pancan_data_df <- predict(normalize_process, tcga_pancan_data_df)

gene_expr_values_norm <- as.vector(as.matrix(
  tcga_pancan_data_df[, 1:(ncol(tcga_pancan_data_df) - 1)]))

(gene_expr_stats <- summary(gene_expr_values_norm))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2791  0.4587  0.4337  0.6060  1.0000

5.3. Data split

First, I combined the ‘data’ and ‘labels’ dataframes.

tcga_pancan_all_df <- cbind(tcga_pancan_data_df, tcga_pancan_labels_df)

Then I created training and test datasets using a 70:30 split.

train_idx <- createDataPartition(tcga_pancan_all_df$cancer_type, p = 0.7, list = FALSE)
train_imbalanced <- tcga_pancan_all_df[train_idx, ]
test <- tcga_pancan_all_df[-train_idx, ]

The training dataset is imbalanced with respect to target variable classes.

(target_class_freq_df <- as.data.frame(table(train_imbalanced$cancer_type)))

##   Var1 Freq
## 1    0  213
## 2    1   59
## 3    2   95
## 4    3   99
## 5    4   96

6. SMOTE

To balance the training dataset, I used the ‘scutr’ package to oversample the minority classes using SMOTE.

# The goal is for each minority class to have the same number of samples as the majority class
# This is specified by the 'm' parameter in the oversample_smote() function
#
# This block may take 1-2 minutes to run
n_samples <- max(target_class_freq_df$Freq)

train_smote_cls1 <- scutr::oversample_smote(data = train_imbalanced,
                    cls = 1, cls_col = 'cancer_type', m = n_samples)

train_smote_cls2 <- scutr::oversample_smote(data = train_imbalanced,
                    cls = 2, cls_col = 'cancer_type', m = n_samples)

train_smote_cls3 <- scutr::oversample_smote(data = train_imbalanced,
                    cls = 3, cls_col = 'cancer_type', m = n_samples)

train_smote_cls4 <- scutr::oversample_smote(data = train_imbalanced,
                    cls = 4, cls_col = 'cancer_type', m = n_samples)

# Combine the class subsets to create a balanced training dataset
train_cls0 <- filter(train_imbalanced, cancer_type == 0)  # majority class
train_smote <- rbind(train_cls0, train_smote_cls1, train_smote_cls2,
                     train_smote_cls3, train_smote_cls4)

# Subsets are no longer needed
rm(list = c('train_cls0', 'train_smote_cls1', 'train_smote_cls2', 'train_smote_cls3',
            'train_smote_cls4'))

The classes of cancer_type in the SMOTE training dataset are now balanced.

table(train_smote$cancer_type)

## 
##   0   1   2   3   4 
## 213 213 213 213 213

7. PCA

I used PCA to reduce the dimensions of the training and test datasets. Of note, the same process is applied to both datasets.

# Separate features and target variable in training and test datasets
X_train <- train_smote[, -train_smote$cancer_type]
y_train <- train_smote$cancer_type
X_test <- test[, -test$cancer_type]
y_test <- test$cancer_type

# Feature scaling
scaler <- caret::preProcess(X_train, method = c('center', 'scale'))
X_train_scaled <- predict(scaler, X_train)
X_test_scaled <- predict(scaler, X_test)  # Apply same scaling

# Apply PCA to scaled training data
# This may take a couple of minutes
pca_model <- prcomp(X_train_scaled, center = FALSE, scale. = FALSE)

The Scree plot shows that the first two principal components (PCs) explain ~25% of the variance in the training data.

factoextra::fviz_eig(pca_model, ylim = c(0, 20))

Analysis of the cumulative proportion of variance explained indicates that the first 96 PCs need to be retained to explain 80% of the variance in the training data.

# Calculate cumulative proportion of variance explained by PCs
pca_variance <- (pca_model$sdev)^2
prop_variance <- pca_variance / sum(pca_variance) 
cumulative_prop_df <- data.frame(x = 1:length(prop_variance), y = cumsum(prop_variance))

# Number of PCs needed to explain X% of variance
threshold <- 0.8
npc_criterion <- threshold - 0.005  # accounts for rounding
npc_needed <- min(which(cumulative_prop_df$y >= npc_criterion))
sprintf('%d of %d PCs are needed to explain %.1f%% of variance', 
        npc_needed, nrow(cumulative_prop_df), threshold * 100)

## [1] "96 of 1065 PCs are needed to explain 80.0% of variance"

So now the test dataset can be transformed using the same PCA transformation and retaining the first 96 PCs.

# Get the PCA-transformed training data (imbalanced)
# The 'x' element of prcomp() output contains the PC scores (transformed coordinates)
train_smote_pca <- as.data.frame(pca_model$x[, 1 : npc_needed])
train_smote_pca$cancer_type <- as.factor(y_train)  # Replace the target variable

# Tranform the test data using the same PCA transformation from the training data
test_pca <- as.data.frame(predict(pca_model, X_test_scaled))

# Retain the same number of PCs used for the training data
test_pca <- test_pca[, 1 : npc_needed]
test_pca$cancer_type <- as.factor(y_test)  # Add the target variable

8. Cross-validation folds

To enable statistical comparison of model performance with cross-validation, I created 10 folds from the SMOTE-balanced, PCA-transformed training dataset.

folds <- createFolds(train_smote_pca$cancer_type, k = n_CVfolds)

These folds, which currently only contain row numbers of the training dataframe that comprise each fold, will be used during model development.

9. Model development

9.1. Random forest

9.1.1. Baseline model

Algorithm: randomForest()

Parameters:

ntree (number of trees) = 1000. I followed the rule of thumb to start with 10 times the number of features (ie, 96 features in the PCA-transformed training dataset \(\times\) 10 = 960 trees).¹
mtry (number of variables that are randomly sampled at each split) = 10. This is the square root of the number of features (ie, \(\sqrt{96} \approx 9.8\)).

set.seed(1403)

rf0_runtime <- system.time(
  rf0_model <- randomForest(cancer_type ~ ., data = train_smote_pca,
                            ntree = 1000, mtry = 10))

rf0_runtime <- rf0_runtime[['elapsed']]
sprintf('Runtime: %.1f seconds', rf0_runtime)

## [1] "Runtime: 2.8 seconds"

Use the model to make class predictions

rf0_predictions <- predict(rf0_model, newdata = test_pca, type = 'class')

Confusion matrix

rf0_confusion_matrix <- caret::confusionMatrix(data = rf0_predictions, 
                                               reference = test_pca$cancer_type)
rf0_confusion_matrix$table

##           Reference
## Prediction  0  1  2  3  4
##          0 87  0  0  0  0
##          1  0 18  0  0  0
##          2  0  0 51  0  0
##          3  0  1  0 42  1
##          4  0  0  0  0 39

The baseline model performed well and had 99.2% overall accuracy (95% CI, 97.0% to 99.9%).

sprintf('Overall accuracy (95%% CI): %.3f (%.3f, %.3f)',
        rf0_confusion_matrix$overall['Accuracy'],
        rf0_confusion_matrix$overall['AccuracyLower'],
        rf0_confusion_matrix$overall['AccuracyUpper'])

## [1] "Overall accuracy (95% CI): 0.992 (0.970, 0.999)"

By class, the balanced accuracy ranged from 97.4% for colon cancer to 100% for breast and kidney cancer. Precision (range, 95.5% to 100%) and recall (range, 94.7% to 100%) were balanced, which was also reflected in the F1 score (range, 97.3% to 100%). Of note, the model predicted all breast and kidney cancers correctly.

rf0_class_metrics_df <- as.data.frame(rf0_confusion_matrix$byClass)
rf0_class_metrics_df <- cbind(cancer_type = rownames(rf0_class_metrics_df),
                              rf0_class_metrics_df)
rownames(rf0_class_metrics_df) <- NULL
  
rf0_class_metrics_df %>%
  mutate(
    cancer_type = unname(get_cancer_code[str_sub(cancer_type, -1, -1)]),    
    balanced_accuracy = round(`Balanced Accuracy`, 3),
    precision = round(Precision, 3),
    recall = round(Recall, 3),
    F1_score = round(F1, 3)
  ) %>%
select(cancer_type, balanced_accuracy, precision, recall, F1_score)

##   cancer_type balanced_accuracy precision recall F1_score
## 1        BRCA             1.000     1.000  1.000    1.000
## 2        COAD             0.974     1.000  0.947    0.973
## 3        KIRC             1.000     1.000  1.000    1.000
## 4        LUAD             0.995     0.955  1.000    0.977
## 5        PRAD             0.988     1.000  0.975    0.987

9.1.2. Hyperparameter tuning

To determine whether classification performance could be improved further, I tuned mtry (number of variables that are randomly sampled at each split) by assessing the out-of-bag error for values less than or greater than the initial value (mtry = 10). I used the same number of trees, ntreeTry, as the baseline model (1000).

The plot below shows that the out-of-bag error is minimized when mtry = 5.

set.seed(1403)

optimal_mtry <- tuneRF(x = train_smote_pca[, -ncol(train_smote_pca)],  # features
                       y = train_smote_pca[, ncol(train_smote_pca)],   # target variable
                       mtryStart = 10, stepFactor = 1.5,
                       plot = TRUE, ntreeTry = 1000)

## mtry = 10  OOB error = 0.38% 
## Searching left ...
## mtry = 7     OOB error = 0.28% 
## 0.25 0.05 
## mtry = 5     OOB error = 0.19% 
## 0.3333333 0.05 
## mtry = 4     OOB error = 0.38% 
## -1 0.05 
## Searching right ...
## mtry = 15    OOB error = 0.38% 
## -1 0.05

best_mtry <- as.data.frame(optimal_mtry) %>%
  slice_min(OOBError) %>%
  pull(mtry)

sprintf('The optimal value of mtry is %d', best_mtry)

## [1] "The optimal value of mtry is 5"

Then, I reran the random forest model with the best value of mtry and compared the performance with the previous version.

set.seed(1403)

rf1_runtime <- system.time(
  rf1_model <- randomForest(cancer_type ~ ., data = train_smote_pca,
                            ntree = 1000, mtry = best_mtry))

rf1_runtime <- rf1_runtime[['elapsed']]
sprintf('Runtime: %.1f seconds', rf1_runtime)

## [1] "Runtime: 3.1 seconds"

Use the model to make class predictions

rf1_predictions <- predict(rf1_model, newdata = test_pca, type = 'class')

Confusion matrix

rf1_confusion_matrix <- caret::confusionMatrix(data = rf1_predictions, 
                                               reference = test_pca$cancer_type)
rf1_confusion_matrix$table

##           Reference
## Prediction  0  1  2  3  4
##          0 87  0  0  0  1
##          1  0 18  0  0  0
##          2  0  0 51  0  0
##          3  0  1  0 42  0
##          4  0  0  0  0 39

This change did not have any effect on classification performance.

sprintf('Overall accuracy of tuned model: %.3f', rf1_confusion_matrix$overall['Accuracy'])

## [1] "Overall accuracy of tuned model: 0.992"

sprintf('Change in overall accuracy vs baseline model: %.3f%%', 
  (rf1_confusion_matrix$overall['Accuracy'] - rf0_confusion_matrix$overall['Accuracy']) /
  rf0_confusion_matrix$overall['Accuracy'])

## [1] "Change in overall accuracy vs baseline model: 0.000%"

rf1_class_metrics_df <- as.data.frame(rf1_confusion_matrix$byClass)
rf1_class_metrics_df <- cbind(cancer_type = rownames(rf1_class_metrics_df),
                              rf1_class_metrics_df)
rownames(rf1_class_metrics_df) <- NULL
  
rf1_class_metrics_df %>%
  mutate(
    cancer_type = unname(get_cancer_code[str_sub(cancer_type, -1, -1)]),    
    balanced_accuracy = round(`Balanced Accuracy`, 3),
    precision = round(Precision, 3),
    recall = round(Recall, 3),
    F1_score = round(F1, 3)
  ) %>%
select(cancer_type, balanced_accuracy, precision, recall, F1_score)

##   cancer_type balanced_accuracy precision recall F1_score
## 1        BRCA             0.997     0.989  1.000    0.994
## 2        COAD             0.974     1.000  0.947    0.973
## 3        KIRC             1.000     1.000  1.000    1.000
## 4        LUAD             0.997     0.977  1.000    0.988
## 5        PRAD             0.988     1.000  0.975    0.987

9.1.3. 10-fold cross-validation

Since the classification peformance for both models were nearly identical, I performed cross validation using the baseline model.

set.seed(1403)

rf2_runtime <- system.time(
  cv_results <- mclapply(folds, function(x) {
    # Training and test datasets
    exp_train <- train_smote_pca[-x, ]
    exp_test <- train_smote_pca[x, ]
    # Use training data to create random forest classification model
    exp_model <- randomForest(cancer_type ~ ., data = exp_train, 
                              ntree = 1000, mtry = best_mtry)
    # Make predictions on test data
    exp_predictions <- predict(exp_model, newdata = exp_test, type = 'class')
    # Actual (true) data for comparison
    # exp_actual <- exp_test$cancer_type
    # Create confusion matrix of actual vs predicted classes
    exp_confusion_matrix <- confusionMatrix(data = exp_predictions, 
                                            reference = exp_test$cancer_type)
    # Extract performance metrics from confusion matrix
    exp_overall_accuracy <- exp_confusion_matrix$overall['Accuracy']
    # Class metrics
    exp_class_metrics_df <- as.data.frame(exp_confusion_matrix$byClass)
    exp_balanced_accuracy_vec <- exp_class_metrics_df$`Balanced Accuracy`
    exp_precision_vec <- exp_class_metrics_df$Precision
    exp_recall_vec <- exp_class_metrics_df$Recall
    exp_f1_vec <- exp_class_metrics_df$F1
    # Combine class metrics into dataframe
    exp_class_metrics_df <- as.data.frame(cbind(class = c(0, 1, 2, 3, 4),
                              balanced_accuracy = exp_balanced_accuracy_vec,
                              precision = exp_precision_vec,
                              recall = exp_recall_vec, 
                              F1_score = exp_f1_vec))
    # Return experiment metrics as a list
    return(list(overall_accuracy = exp_overall_accuracy, 
                class_metrics_df = exp_class_metrics_df))
  }, mc.cores = n_cores))

rf2_runtime <- rf2_runtime[['elapsed']]
sprintf('Runtime: %.1f seconds', rf2_runtime)

## [1] "Runtime: 5.9 seconds"

For overall accuracy, the 95% confidence interval is 99.2% to 100%. Performance metrics by class are similar. Furthermore, the overlapping confidence intervals for all cancer types indicates that the model does not have any significant difference in performance by cancer type. In addition, the computation time was quite short.

These results show that the random forest model performs very well for all five cancers types.

# Extract performance metrics for each fold
cv_overall_accuracy_vec <- sapply(cv_results, function(x) x$overall_accuracy)

# Calculate 95% CIs
cv_overall_accuracy_mean <- mean(cv_overall_accuracy_vec)
cv_overall_accuracy_sd <- sd(cv_overall_accuracy_vec)
cv_overall_accuracy_ci95 <- calc_ci95(cv_overall_accuracy_mean, cv_overall_accuracy_sd,
                                      n_CVfolds)
sprintf('Overall accuracy 95%% CI: (%.3f, %.3f)',
        cv_overall_accuracy_ci95$ci95_lower, cv_overall_accuracy_ci95$ci95_upper)

## [1] "Overall accuracy 95% CI: (0.993, 1.000)"

# Extract performance results by class for all folds, then bind them into a new dataframe
cv_class_metrics_df <- lapply(cv_results, function(x) x$class_metrics_df) %>%
  bind_rows()

# For each class (cancer type), calculate 95% CI for all performance metrics
cv_class_metrics_summary_df <- cv_class_metrics_df %>%
  group_by(class) %>%
  reframe(
    # Balanced accuracy
    balanced_accuracy_ci95 = 
      format_ci95_str(calc_ci95(mean(balanced_accuracy), sd(balanced_accuracy),
                            n_CVfolds)$ci95_lower,
                      calc_ci95(mean(balanced_accuracy), sd(balanced_accuracy),
                            n_CVfolds)$ci95_upper),
    # Precision
    precision_ci95 =
      format_ci95_str(calc_ci95(mean(precision), sd(precision), n_CVfolds)$ci95_lower,
                      calc_ci95(mean(precision), sd(precision), n_CVfolds)$ci95_upper),      
    # Recall
    recall_ci95 =
      format_ci95_str(calc_ci95(mean(recall), sd(recall), n_CVfolds)$ci95_lower,
                      calc_ci95(mean(recall), sd(recall), n_CVfolds)$ci95_upper),       
    # F1 score
    F1_score_ci95 =
      format_ci95_str(calc_ci95(mean(F1_score), sd(F1_score), n_CVfolds)$ci95_lower,
                      calc_ci95(mean(F1_score), sd(F1_score), n_CVfolds)$ci95_upper),       
  ) %>%
  ungroup() %>%
  mutate(
    cancer_type = unname(get_cancer_code[as.character(class)])
  ) %>%
  select(cancer_type, balanced_accuracy_ci95, precision_ci95, recall_ci95, F1_score_ci95)

# Add model name
rf_cv_class_metrics_summary_df <- cv_class_metrics_summary_df %>%
  mutate(
    model = 'RF',
    .before = cancer_type
  )

rf_cv_class_metrics_summary_df %>%
  kbl() %>%
  kable_styling()

model	cancer_type	balanced_accuracy_ci95	precision_ci95	recall_ci95	F1_score_ci95
RF	BRCA	(0.996, 1)	(0.97, 1)	(1, 1)	(0.984, 1)
RF	COAD	(1, 1)	(1, 1)	(1, 1)	(1, 1)
RF	KIRC	(0.993, 1)	(1, 1)	(0.987, 1)	(0.993, 1)
RF	LUAD	(0.989, 1)	(1, 1)	(0.978, 1)	(0.989, 1)
RF	PRAD	(1, 1)	(1, 1)	(1, 1)	(1, 1)

9.1.4. Variable importance

The variable importance plot shows that the first 2 PCs contain the most important variables, and the mean decrease in Gini impurity plateaus beyond PC15.

# The baseline and optimized models had identical performance,
# so here I just use the baseline model

# Calculate variable importance scores
rf0_model_var_imp_df <- vip::vi(rf0_model)

# Plot
var_imp_plot1 <- vip::vip(rf0_model_var_imp_df, num_features = 15, include_type = TRUE) + 
  ggtitle('Random forest variable importance') +
  theme_classic()

var_imp_plot1

9.2. Feed-forward neural network

9.2.1. Baseline model

Algorithm: nnet()

Parameters:

Number of neurons in hidden layer: size = 5
Weight decay: decay = 0 (default value)

Notes:

Note that the nnet package is limited to single-layer neural networks, so the number of layers is not specified.
The model has 5 neurons because the data have 5 classes (cancer types).
Although the nnet() function has a softmax parameter (logical), it does not need to be explicitly defined. When the function detects the presence of multiple factor levels in the target variable, it automatically uses a softmax activation function.

set.seed(1403)

nn0_runtime <- system.time(
  nn0_model <- nnet(cancer_type ~ ., data = train_smote_pca, size = 5)
)

## # weights:  515
## initial  value 1825.274188 
## iter  10 value 71.839843
## iter  20 value 25.910201
## iter  30 value 8.279478
## iter  40 value 6.514153
## iter  50 value 5.344508
## iter  60 value 4.803838
## iter  70 value 4.791983
## iter  80 value 4.783975
## iter  90 value 4.161583
## iter 100 value 4.161337
## final  value 4.161337 
## stopped after 100 iterations

nn0_runtime <- nn0_runtime[['elapsed']]
sprintf('Runtime: %.1f seconds', nn0_runtime)

## [1] "Runtime: 0.2 seconds"

Use the model to make class predictions

nn0_predictions <- predict(nn0_model, newdata = test_pca, type = 'class')
nn0_predictions <- factor(nn0_predictions, levels = levels(test_pca$cancer_type))

Confusion matrix

nn0_confusion_matrix <- caret::confusionMatrix(data = nn0_predictions, 
                                               reference = test_pca$cancer_type)
nn0_confusion_matrix$table

##           Reference
## Prediction  0  1  2  3  4
##          0 87  0  0  0  0
##          1  0 17  0  0  0
##          2  0  0 51  0  0
##          3  0  2  0 42  0
##          4  0  0  0  0 40

The baseline model performed well and had 99.2% overall accuracy (95% CI, 97.0% to 99.9%).

sprintf('Overall accuracy (95%% CI): %.3f (%.3f, %.3f)',
        nn0_confusion_matrix$overall['Accuracy'],
        nn0_confusion_matrix$overall['AccuracyLower'],
        nn0_confusion_matrix$overall['AccuracyUpper'])

## [1] "Overall accuracy (95% CI): 0.992 (0.970, 0.999)"

By class, the balanced accuracy ranged from 94.7% for colon cancer to 100% for breast, kidney, and prostate cancer. Precision (range, 95.5% to 100%) and recall (range, 89.5% to 100%) were balanced, which was also reflected in the F1 score (range, 94.4% to 100%). Of note, the model predicted all breast, kidney, and prostate cancers correctly.

nn0_class_metrics_df <- as.data.frame(nn0_confusion_matrix$byClass)
nn0_class_metrics_df <- cbind(cancer_type = rownames(nn0_class_metrics_df),
                              nn0_class_metrics_df)
rownames(nn0_class_metrics_df) <- NULL
  
nn0_class_metrics_df %>%
  mutate(
    cancer_type = unname(get_cancer_code[str_sub(cancer_type, -1, -1)]),    
    balanced_accuracy = round(`Balanced Accuracy`, 3),
    precision = round(Precision, 3),
    recall = round(Recall, 3),
    F1_score = round(F1, 3)
  ) %>%
select(cancer_type, balanced_accuracy, precision, recall, F1_score)

##   cancer_type balanced_accuracy precision recall F1_score
## 1        BRCA             1.000     1.000  1.000    1.000
## 2        COAD             0.947     1.000  0.895    0.944
## 3        KIRC             1.000     1.000  1.000    1.000
## 4        LUAD             0.995     0.955  1.000    0.977
## 5        PRAD             1.000     1.000  1.000    1.000

9.2.2. Hyperparameter tuning

For multiclass classification tasks, the number of neurons in the neural network is usually the number of classes of the target variable, so I only tuned the decay hyperparameter.

# This functions performs a single run of a neural network model
# Inputs:
#   train, test (dataframe): data
#   formula for model relating target and predictor variables
#   size, decay (numeric): hyperparameters
#   model_num (integer): model number
#   results_tracker_df: dataframe with results of previous runs
# Output:
#   results_tracker_df: updated dataframe with results of current run

train_nn <- function(train, test, formula, size, decay, model_num, results_tracker_df) {
  nn_runtime <- system.time(
    nn_model <- nnet(formula = formula, data = train, size = size, decay = decay,
                     trace = FALSE)
  )
  nn_runtime <- nn_runtime[['elapsed']]
  
  # Make predictions and ensure factor levels are consistent
  predictions <- predict(nn_model, newdata = test, type = "class")
  predictions <- factor(predictions, levels = levels(test$cancer_type))
  
  # Confusion matrix
  confusion_matrix <- caret::confusionMatrix(data = predictions, 
                                             reference = test$cancer_type)
  
  # Model accuracy
  overall_accuracy <- confusion_matrix$overall['Accuracy']
  
  # Document results
  model_id <- paste0('nn', model_num)
  exp_results <- c(model_id, size, decay, round(overall_accuracy, 3), round(nn_runtime, 3))
  results_tracker_df <- rbind(results_tracker_df, exp_results)
  
  return(results_tracker_df)  
}

set.seed(1403)

# Create experiment tracker
results_tracker_df <- data.frame()

decay_vals <- c(0, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 0.1, 0.5)

# Run experiments
for (i in 1:length(decay_vals)) {
  results_tracker_df <- train_nn(
    train = train_smote_pca, test = test_pca, formula = cancer_type ~ .,
    size = 5, decay = decay_vals[i], model_num = i, results_tracker_df)
}

colnames(results_tracker_df) <- c('model', 'size', 'decay', 'overall_accuracy', 'runtime_sec')

The overall model accuracy was maximized when decay = 0.01. Computation times were negligible.

results_tracker_df %>%
  arrange(desc(overall_accuracy)) %>%
  kbl() %>%
  kable_styling()

model	size	decay	overall_accuracy	runtime_sec
nn7	5	0.01	1	0.246
nn1	5	0	0.992	0.196
nn3	5	1e-06	0.992	0.232
nn6	5	0.001	0.992	0.202
nn4	5	1e-05	0.987	0.21
nn5	5	1e-04	0.987	0.202
nn2	5	1e-07	0.983	0.187
nn8	5	0.1	0.983	0.28
nn9	5	0.5	0.971	0.3

9.2.3. 10-fold cross-validation

I performed cross-validation using the optimized model.

nn1_runtime <- system.time(
  cv_results <- mclapply(folds, function(x) {
    # Training and test datasets
    exp_train <- train_smote_pca[-x, ]
    exp_test <- train_smote_pca[x, ]
    # Use training data to create random forest classification model
    exp_model <- nnet(cancer_type ~ ., data = exp_train, size = 5, decay = 0.01,
                      trace = FALSE)    
    # Make predictions on test data
    exp_predictions <- predict(exp_model, newdata = exp_test, type = 'class')
    exp_predictions <- factor(exp_predictions, levels = levels(exp_test$cancer_type))
    # Create confusion matrix of actual vs predicted classes
    exp_confusion_matrix <- confusionMatrix(data = exp_predictions, 
                                            reference = exp_test$cancer_type)
    # Extract performance metrics from confusion matrix
    exp_overall_accuracy <- exp_confusion_matrix$overall['Accuracy']
    # Class metrics
    exp_class_metrics_df <- as.data.frame(exp_confusion_matrix$byClass)
    exp_balanced_accuracy_vec <- exp_class_metrics_df$`Balanced Accuracy`
    exp_precision_vec <- exp_class_metrics_df$Precision
    exp_recall_vec <- exp_class_metrics_df$Recall
    exp_f1_vec <- exp_class_metrics_df$F1
    # Combine class metrics into dataframe
    exp_class_metrics_df <- as.data.frame(cbind(class = c(0, 1, 2, 3, 4),
                              balanced_accuracy = exp_balanced_accuracy_vec,
                              precision = exp_precision_vec,
                              recall = exp_recall_vec, 
                              F1_score = exp_f1_vec))
    # Return experiment metrics as a list
    return(list(overall_accuracy = exp_overall_accuracy,
                class_metrics_df = exp_class_metrics_df))
  }, mc.cores = n_cores)
)

nn1_runtime <- nn1_runtime[['elapsed']]
sprintf('Runtime: %.1f seconds', nn1_runtime)

## [1] "Runtime: 0.6 seconds"

For overall accuracy, the 95% confidence interval is 99.4% to 100%.

# Extract performance metrics for each fold
cv_overall_accuracy_vec <- sapply(cv_results, function(x) x$overall_accuracy)

# Calculate 95% CIs
cv_overall_accuracy_mean <- mean(cv_overall_accuracy_vec)
cv_overall_accuracy_sd <- sd(cv_overall_accuracy_vec)
cv_overall_accuracy_ci95 <- calc_ci95(cv_overall_accuracy_mean, cv_overall_accuracy_sd,
                                      n_CVfolds)
sprintf('Overall accuracy 95%% CI: (%.3f, %.3f)',
        cv_overall_accuracy_ci95$ci95_lower, cv_overall_accuracy_ci95$ci95_upper)

## [1] "Overall accuracy 95% CI: (0.994, 1.000)"

Performance metrics by class are similar. Furthermore, the overlapping confidence intervals for all cancer types indicates that the model does not have any significant difference in performance by cancer type. In addition, the computation time was very short.

These results show that the feedforward neural network model performs very well for all five cancers types.

# Extract performance results by class for all folds, then bind them into a new dataframe
cv_class_metrics_df <- lapply(cv_results, function(x) x$class_metrics_df) %>%
  bind_rows()

# For each class (cancer type), calculate 95% CI for all performance metrics
cv_class_metrics_summary_df <- cv_class_metrics_df %>%
  group_by(class) %>%
  reframe(
    # Balanced accuracy
    balanced_accuracy_ci95 = 
      format_ci95_str(calc_ci95(mean(balanced_accuracy), sd(balanced_accuracy),
                            n_CVfolds)$ci95_lower,
                      calc_ci95(mean(balanced_accuracy), sd(balanced_accuracy),
                            n_CVfolds)$ci95_upper),
    # Precision
    precision_ci95 =
      format_ci95_str(calc_ci95(mean(precision), sd(precision), n_CVfolds)$ci95_lower,
                      calc_ci95(mean(precision), sd(precision), n_CVfolds)$ci95_upper),      
    # Recall
    recall_ci95 =
      format_ci95_str(calc_ci95(mean(recall), sd(recall), n_CVfolds)$ci95_lower,
                      calc_ci95(mean(recall), sd(recall), n_CVfolds)$ci95_upper),       
    # F1 score
    F1_score_ci95 =
      format_ci95_str(calc_ci95(mean(F1_score), sd(F1_score), n_CVfolds)$ci95_lower,
                      calc_ci95(mean(F1_score), sd(F1_score), n_CVfolds)$ci95_upper),       
  ) %>%
  ungroup() %>%
  mutate(
    cancer_type = unname(get_cancer_code[as.character(class)])
  ) %>%
  select(cancer_type, balanced_accuracy_ci95, precision_ci95, recall_ci95, F1_score_ci95)

# Add model name
nn_cv_class_metrics_summary_df <- cv_class_metrics_summary_df %>%
  mutate(
    model = 'NN',
    .before = cancer_type
  )

nn_cv_class_metrics_summary_df %>%
  kbl() %>%
  kable_styling()

model	cancer_type	balanced_accuracy_ci95	precision_ci95	recall_ci95	F1_score_ci95
NN	BRCA	(0.989, 1)	(1, 1)	(0.979, 1)	(0.989, 1)
NN	COAD	(0.997, 1)	(0.979, 1)	(1, 1)	(0.99, 1)
NN	KIRC	(1, 1)	(1, 1)	(1, 1)	(1, 1)
NN	LUAD	(0.992, 1)	(0.987, 1)	(0.986, 1)	(0.989, 1)
NN	PRAD	(1, 1)	(1, 1)	(1, 1)	(1, 1)

9.2.4. Variable importance

The variable importance plot shows that PC5 is the most important variable.

set.seed(1403)

# Optimized model
nn2_model <- nnet(cancer_type ~ ., data = train_smote_pca,
                  size = 5, decay = 0.01, trace = FALSE)

# Calculate variable importance scores
# Variable importance is determined by Olden algorithm, which calculates 
# the sum of the product of raw connection weights between neurons
# https://cran.r-project.org/web/packages/vip/vignettes/vip.html#neural-networks
nn2_model_var_imp_df <- vip::vi(nn2_model)

# Plot
var_imp_plot2 <- vip::vip(nn2_model_var_imp_df, num_features = 15, include_type = TRUE) +
  ggtitle('Neural network variable importance') +
  theme_classic()

var_imp_plot2

9.3. Model comparison

9.3.1. Classification performance

The overlapping 95% confidence intervals for all class performance metrics shows that there is no significant difference in classification performance of the random forest vs feedforward neural network models for any cancer type. Both models have excellent balanced accuracy, precision, recall, and F1 score.

rbind(rf_cv_class_metrics_summary_df, nn_cv_class_metrics_summary_df) %>%
  group_by(cancer_type) %>%
  arrange(cancer_type) %>%
  kbl() %>%
  kable_styling() %>%
  row_spec(c(1, 2, 5, 6, 9, 10), background = 'gray90')

model	cancer_type	balanced_accuracy_ci95	precision_ci95	recall_ci95	F1_score_ci95
RF	BRCA	(0.996, 1)	(0.97, 1)	(1, 1)	(0.984, 1)
NN	BRCA	(0.989, 1)	(1, 1)	(0.979, 1)	(0.989, 1)
RF	COAD	(1, 1)	(1, 1)	(1, 1)	(1, 1)
NN	COAD	(0.997, 1)	(0.979, 1)	(1, 1)	(0.99, 1)
RF	KIRC	(0.993, 1)	(1, 1)	(0.987, 1)	(0.993, 1)
NN	KIRC	(1, 1)	(1, 1)	(1, 1)	(1, 1)
RF	LUAD	(0.989, 1)	(1, 1)	(0.978, 1)	(0.989, 1)
NN	LUAD	(0.992, 1)	(0.987, 1)	(0.986, 1)	(0.989, 1)
RF	PRAD	(1, 1)	(1, 1)	(1, 1)	(1, 1)
NN	PRAD	(1, 1)	(1, 1)	(1, 1)	(1, 1)

9.3.2. Variable importance

Despite the similar performance of the two models, the most important PCs differ between models. As shown in the figure below, PC1 and PC2 are the most important in the random forest model (left plot), whereas PC5 is the most important in the neural network model (right plot).

These differences reflect differences in how these models find patterns in data as well as how variable importance scores are calculated for each model.

In random forest models, the variable importance score reflects the amount of reduction in entropy (Gini impurity) at decision tree splits. This implies that variables that have a lot of variation can contribute greater reductions and, therefore, will have greater importance. This suggests that the PCA-transformed gene expression values in PC1 and PC2 have more variation than those in other PCs.
In neural networks, the variable importance score is more complex and reflects the sum of products of connection weights between neurons (calculated using the “Olden” algorithm, as noted on the x-axis of the right plot). The nature of this calculation, along with the fact that the neural network model uses one neuron per class (cancer type), suggests that the PCA-transformed gene expression values in PC5 contribute greater discriminatory value between cancer types than those in other PCs.

# Modify random forest variable importance plot
rf0_model_var_imp_df2 <- rf0_model_var_imp_df %>%
  arrange(desc(Importance)) %>%
  slice(1:15) %>%
  mutate(
    color = case_when(Variable %in% c('PC1', 'PC2') ~ '#E1BE6A',
                      Variable == 'PC5' ~ '#40B0A6',
                      TRUE ~ 'gray')
  )

p1 <- ggplot(rf0_model_var_imp_df2, aes(x = reorder(Variable, Importance), 
                                        y = Importance, fill = color)) +
  geom_bar(stat = 'identity') +
  scale_fill_identity() +
  coord_flip() +
  xlab('Variable') + ylab('Importance (mean decrease Gini)') +
  theme_classic() +
  ggtitle('Random forest variable importance') +
  theme(
    axis.title = element_text(face = 'bold'),
    plot.title = element_text(face = 'bold', size = 12),
    legend.position = 'none'
  )

# Modify neural network variable importance plot
nn2_model_var_imp_df2 <- nn2_model_var_imp_df %>%
  arrange(desc(Importance)) %>%
  slice(1:15) %>%
  mutate(
    color = ifelse(Variable == 'PC5', '#40B0A6', 'gray')
  )

p2 <- ggplot(nn2_model_var_imp_df2, aes(x = reorder(Variable, Importance), 
                                        y = Importance, fill = color)) +
  geom_bar(stat = 'identity') +
  scale_fill_identity() +
  coord_flip() +
  xlab('Variable') + ylab('Importance (Olsen)') +
  theme_classic() +
  ggtitle('Neural network variable importance') +
  theme(
    axis.title = element_text(face = 'bold'),
    plot.title = element_text(face = 'bold', size = 12),
    legend.position = 'none'
  )

# Combine plots
plot_grid(p1, p2, nrow = 1, ncol = 2)

Also of note, PC5 is the highest ranked variable in both models (most important in neural network and 4th most important in random forest). This suggests that the genes transformed in this PC are likely to be important for classifying the five types of cancer (keeping in mind though that the first 5 PCs only explained ~40% of the variance in the data).

Examination of the PCA loadings shows that the top three genes contributing to PC5 are gene_15306, gene_10731, and gene_17801.

# Extract the loadings from the PCA model
pc_loadings_df <- as.data.frame(pca_model$rotation)
pc_loadings_df <- cbind(gene = rownames(pc_loadings_df), pc_loadings_df)
rownames(pc_loadings_df) <- NULL

# Rank the loadings and list the top 3 genes
pc5_genes <- pc_loadings_df %>%
  select(gene, PC5) %>%
  # I use the absolute value of loadings because they can be positive or negative values
  arrange(desc(abs(PC5)))
pc5_genes <- as.vector(pc5_genes$gene)
pc5_genes[1:3]

## [1] "gene_15306" "gene_10731" "gene_17801"

10. Discussion and conclusions

The results suggest that both random forest and feedforward neural network models can classify five types of cancer (breast, colon, kidney, lung, and prostrate) using gene expression data with high overall and balanced (ie, by class) accuracy. The computation time for both models was short, although the neural network seemed to be slightly faster.

As a medical researcher, the main limitation that I see with this analysis is that the UCI dataset is highly simplified. In fact, it is oversimplified for several reasons:

The number of samples (N=801) is very small compared with real-world datasets, which often have tens of thousands of samples. 801 samples is insufficient to generalize results to the real world. In addition, training models with real-world datasets would require much more computation time.
The UCI dataset does not include any patient metadata, such as demographic information (eg, age, sex, ethnicity). In this project, this limited exploratory data analysis. From a business/clinical perspective, lack of demographic information would make it impossible to know whether the performance of the machine learning models can be generalized to a broad patient population.
The genes in the UCI dataset are not named and only have numeric identifiers. This prevents biologically meaningful analysis of variable (gene) importance in models, because they are not usable in gene databases that contain information about gene functions and disease relationships (eg, NCBI Gene). From a business/clinical perspective, the lack of gene names would also make it difficult to interpret test results at a genetic level.

These shortcomings limit the conclusions that can be drawn from this analysis as well as the ability to deploy the models in a real-world (ie, clinical) setting.

In conclusion, from a purely data science perspective, both random forest and feedforward neural network models performed well for classifying cancer types in the UCI gene expression dataset. However, from a business/clinical perspective, further testing with larger and more complete datasets is needed to determine how the results in this analysis scale and generalize before the model(s) could be used for real-world applications.

Boehmke B and Greenwell B. (2020). Hands-On Machine Learning in R. Section 11.4.1. https://bradleyboehmke.github.io/HOML/random-forest.html ↩︎

DATA622 Assignment 4

Alexander Simon

2025-12-07

1. Background and objective

2. Data

3. Exploratory data analysis

3.1. Cancer type (target variable)

3.2. Gene expression values

3.2.1. Summary statistics

3.2.1.1. Distribution of expression values of ‘gene_1’

3.2.1.2. Summary statistics for all gene expression values

3.2.2. Near-zero variance genes

4. Methods

4.1. Algorithm selection

4.2. Pre-processing

4.3. Performance evaluation

5. Pre-processing

5.1. Removal of zero-variance variables

5.2. Data normalization

5.3. Data split

6. SMOTE

7. PCA

8. Cross-validation folds

9. Model development

9.1. Random forest

9.1.1. Baseline model

9.1.2. Hyperparameter tuning

9.1.3. 10-fold cross-validation

9.1.4. Variable importance

9.2. Feed-forward neural network

9.2.1. Baseline model

9.2.2. Hyperparameter tuning

9.2.3. 10-fold cross-validation

9.2.4. Variable importance

9.3. Model comparison

9.3.1. Classification performance

9.3.2. Variable importance

10. Discussion and conclusions