Lung cancer is the second most common cancer worldwide and had the highest mortality rate of any cancer in 2020 (World Health Organization, 2025). Lung cancer exists in various forms, with lung adenocarcinoma (LUAD) being the most prevalent, accounting for approximately 40% of all cases (Myers, 2023). The creation of a machine learning model to distinguish between cancerous LUAD tissue and non-cancerous/healthy lung tissue through RNA-seq data was built with members of a group project previously. This specific machine learning model was created to help with early detection and non-invasive cancer screening methods. In order to address data imbalances, Synthetic Minority Oversampling Technique (SMOTE) will be implemented and evaluated in to understand if it balances class distribution and improves the machine learning algorithm’s performance. SMOTE algorithm works by selecting a minority class instance at random and finding it’s k nearest minority class neighbors, generating samples by interpolating between selected instance and its k nearest neighbor in space (Van Otten, 2023). Integration of a data balancing algorithm can increase the accuracy of the model, suggesting the use of SMOTE in other prediction models allowing for accurate early detection of many diseases.
The creation of a binary machine learning model that could distinguish between cancerous RNA Seq data and healthy lung tissue RNA Seq Data has an accuracy of 85%. Can the integration of SMOTE result in a greater accuracy for this model?
Data is taken from cBioPortal for Cancer Genomics and GTEx database for healthy samples
The creation of a machine learning model distinguishing cancerous and healthy lung samples was made using publicly available RNA-Seq data, experienced data imbalancing. Both healthy and cancerous data were utilized in creating synthetic data to balance the training model through the SMOTE algorithm. The data creation was then ran into the model to understand if the SMOTE algorithm increased accuracy for the model.
Since SMOTE is already an algorithm that has it’s own package, creation of a synthetic dataset only required transforming of data to run the algorithm. The following is my code of how I created a SMOTEd dataset. The new synthetic data was ran through the model.
Our initial model had an accuracy of 86.10% with an ROC AUC of 94.71% with 30 misclassified samples. The top 10 most important genes had the most impact on the model’s performance of distinguishing healthy tissue from cancerous.
The creation and integration of a SMOTEd dataset led to an increase in accuracy at 99.55% and ROC AUC at 99.94%, with only one misclassified sample. The top 10 most important genes changed as 9/10 of the genes are not known as key contributors to LUAD.
The nearest K-means neighbor was also tested with K = 4. This resulted in the following tibble:
## # A tibble: 3 × 4
## data Accuracy ROC_AUC PR_AUC
## <chr> <dbl> <dbl> <dbl>
## 1 Original data 0.861 0.947 0.940
## 2 K = 4 0.902 0.952 0.999
## 3 K = 5 0.996 0.999 0.999
From here I wanted to create some graphs to visualize the different values from the original data, K = 4, and K = 5.
The integration of SMOTE into our machine learning model, has shown to increase the accuracy of the model in all metrics of Accuracy, PR_AUC, ROC_AUC. While SMOTE provides many insights into data balancing, there are several downfalls including oversampling of noise and reliance on the K-means parameter. As seen in the results, SMOTE’s data balancing algorithm increased the accuracy of the model however, failed to recognize the genes that were the most involved with LUAD. When the SMOTE creates synthetic samples by interpolating between real and minority class samples, the result can sometimes be new, and unrealistic patterns in gene expression such as artificial relationships between genes that do not exist in the original dataset. The model in return, can learn these patterns and assign certain importance to these genes resulting in these lesser known genes of LUAD to be pronounced.
The increase in accuracy after SMOTE is suggesting that the model is becoming better and distinguishing between the classes but doesn’t identify the correct or biologically relevant features of the genes. While the accuracy improves, the model seems to be learning from the synthetic data in a way that doesn’t reflect the original LUAD genes.
World Health Organization(2025, February 3rd) Cancer Fact Sheet
Myers, D. J. Lung Adenocarcinoma. National Institutes of Health, 2023
Wongvorachan, T., He, S., & Bulut, O. (2023). A comparison of undersampling, oversampling, and SMOTE methods for dealing with imbalanced classification in educational data mining. Information, 14(1), 54.
Otten, N. V. (2023, October 31). Smote oversampling & tutorial on how to implement in python and R. Spot Intelligence. https://spotintelligence.com/2023/02/17/smote-oversampling-python-r/#:\~:text=SMOTE%20stands%20for%20Synthetic%20Minority,cases%20in%20the%20majority%20class.
Author would like to thank group members Colin, Tom, and Jeffin from their Predictive Modeling in Biomedicine course.