The data set was synthetic data set created for practice purposes in the Book Applied Analytics through Case Studies Using SAS and R by Deepti Gupta and APress. It has 600 observations with 11 total variables - 10 numerical variables and 1 categorical variable. Only 10 of these variables are predictors, however, because observation number is included as a numerical, but has no predictive power.
The link to the raw CSV file in posted on GitHub: https://raw.githubusercontent.com/JZhong01/STA321/main/Topic%204%20(Multiple%20Logistic%20Regression/syntheticBreastCancerData.csv.
The primary objective of this analysis is to identify if these 9 cellular characteristics are factors that can predict the outcome (benign or malignant) of breast cancer cases.
The objectives of this case study are as follows:
The hypotheses of the study are that:
These cellular characteristics are predictive of breast cancer outcomes, and there exists a significnat relationship between certain factors and the presence of malignant breast cancer.
The predictive model’s accuracy is significantly better than random chance, indicating its efficacy in identifying breast cancer outcomes.
In our case study, we analyze the synthetic_cancer_data data set, which includes critical cellular features to differentiate between benign and malignant breast cancer. Key variables such as ‘Thickness of Clump’, ‘Cell Size Uniformity’, and ‘Cell Shape Uniformity’ provide insights into the physical attributes of cancer cells. Additional factors like ‘Marginal Adhesion’, ‘Bland Chromatin’, and ‘Mitoses’ help in understanding cellular behaviors and division patterns. This comprehensive analysis aims to enhance the accuracy of breast cancer diagnosis through detailed examination of these cellular characteristics.
When performing logistic regression, it’s a standard practice to code the response variable as binary numerical values. In our data set, ‘Outcome’ will be recoded such that “Yes” represents 1 and “No” represents 0. Here, ‘1’ signifies ‘Success’ in identifying the presence of cancer, while ‘0’ indicates ‘Failure’ to do so, within the context of statistical modeling. This terminology does not reflect any value judgment about the disease itself; rather, it’s a convention in statistical analysis to facilitate the computational process of the logistic regression.
This synthetic data set is checked for any missing variables, outliers, and anything else that may require transformation.
We first ensure that there are no missing values in our data set and observations.
FALSE [1] 0
FALSE [1] 0
FALSE [1] 0
FALSE [1] 0
FALSE [1] 0
FALSE [1] 0
FALSE [1] 0
FALSE [1] 0
FALSE [1] 0
There are no missing values in our data set. This is unsurprising because the data was fabricated for logistic regression practice. However, we must check this regardless because no assumptions can be made.
We try to ascertain the number of outliers found in our data. We do so by calculating the interquartile range (IQR) which is the range from the 75th percentile and the 25th percentile. Using a commonly used rule-of-thumb where outliers are values that are 1.5 * IQR above the 75th percentile and 1.5 * IQR below the 25th percentile.
Using the measure, we found many outliers based off each predictor variable individual, with the largest upwards of 85 “outliers”. However, this isn’t a reliable measure of the nature of how many and how large the outliers in the data is. This is because there are many predictor variables which means that a ‘lower than normal’ value for Thickness of Clump could be explained by other predictor variables or indicate potential malignant cells.
As a result, no transformations are conducted and no outliers were removed. These are 2 common methods of dealing with outliers, but this is an associative study so we’re not concerned with creating a predictive model but merely analyzing the relationship between the predictor variables and the response or Outcome.
In multiple logistic regression, there aren’t many diagnostic tools that we can utilize. However, this doesn’t mean that pre-processing procedures aren’t performed on predictor variables.
Pairwise scatter plots are used to inspect the predictor variables. Any potential problems can be more easily identified with this plot.
The scatter plot exposes that many of the predictor variables exhibit unimodal but right skews. In addition, there is moderate correlation found between many of the variables.
In order to address the heavy right skews experienced by some of the predictor variables, we considered discretizing variables. Converting continuous data into discrete categories not only simplifies the data but also helps in managing skewed distributions, making the statistical analysis more robust, especially in models sensitive to non-normal distributions.
Let’s take a look at all the numerical predictor variables for our model:
Many of the individual predictors in our dataset exhibit right-skewed distributions, indicating a concentration of values towards the lower end with a tail extending towards higher values. Right-skewed distributions suggest that some observations have larger values, potentially influencing model predictions.
To mitigate the skewness, discretizing these variables is considered. Discretization involves grouping continuous values into discrete intervals or categories. This process can be advantageous as it reduces the impact of extreme values and outliers, making the data more resilient to the influence of skewed distributions.
However, in this case, discretizing may not be beneficial. Discretization can lead to information loss and oversimplification of the data, especially if the original continuous values provide valuable insights. Additionally, it might introduce artificial patterns or categories that do not reflect the true nature of the predictors. Therefore, in this analysis, retaining the original continuous nature of the predictors may be more appropriate for capturing the nuances in the data.
Considering the right-skewed distributions of individual predictors, log transformation is a viable approach to address the skewness. Log transformation involves taking the logarithm of the variable values, which tends to compress higher values and stretch lower values, effectively reducing the right skewness. This is particularly useful when dealing with variables that exhibit a wide range of values.
However, in this analysis, the use of log transformation may not be necessary. The dataset contains a sample size of 600 observations, which is sufficiently large for the Central Limit Theorem (CLT) to come into play. The CLT states that the distribution of the sample mean of a sufficiently large sample approaches a normal distribution, even if the underlying population distribution is not normal. Given the sizable dataset, the impact of right-skewed distributions on the overall analysis may be mitigated by the CLT, making log transformation unnecessary for achieving a semi-normal distribution.
The pairwise scatter plot we had earlier exhibited a heavy presence of correlated pairs of predictor variables. Since there are 9 numerical predictor variables, 36 pairwise comparisons were made. A staggering 16 of these pairs had a correlation coefficient of 0.60 or more. A correlation coefficient measures the strength and direction of the relationship between 2 variables, and having a coefficient of 0.60 indicates a moderate to strong linear relationship. Having so many pairs of predictors correlated with one another indicates multicollinearity.
Multicollinearity is problematic as too much can obscure the individual effect of each predictor on the outcome variable, making it difficult to ascertain the true relationship between predictors and the response.
Addressing multicollinearity in the model requires careful consideration of various techniques. One effective approach is the utilization of variable selection techniques, which involve iteratively adding and removing variables based on their impact on the model’s fit. Another method to combat multicollinearity is employing regularization techniques like Ridge regression, which introduces a penalty to the coefficients of correlated variables, thereby reducing the magnitude of collinear predictors. Additionally, an alternative strategy involves combining correlated variables into a single predictor using principal component analysis.
However, after careful consideration, it is concluded that standardizing the variables is the most suitable approach for addressing multicollinearity in this specific analysis. Standardization ensures that all variables are on a common scale, mitigating the impact of collinearity and facilitating more stable and interpretable regression coefficients. This choice aligns with the specific characteristics of the data set and the goals of the analysis.
We will obtain candidate models after using step-wise variable selection upon our standardized predictors and then cross-validating the data to ascertain the best model.
Standardizing numerical variables involves transforming them in a way that ensures they have a mean of 0 and a standard deviation of 1. This process is essential when dealing with variables measured on different scales, as it puts them on a common scale, preventing one variable from dominating others during model training. In this analysis, standardizing the numerical variables is critical for mitigating the impact of multicollinearity and ensuring that each predictor contributes fairly to the predictive model, improving stability and interpretability of regression coefficients.
It’s important that we build multiple models to compare with one another to assess for quality of fit. In this case study we will compare 3 models: full model without alterations, full model with standardized numerical predictors, and reduced model from step-wise regression.
We will look at each of these models’ summary statistics separately first to consider the impact of each variable. This is to help with model interpretability on top of our goal of having a model with predictive power.
The first candidate model is a generalized linear model performed on all the unmodified predictors.
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -11.3995 | 1.2547 | -9.0857 | 0.0000 |
| Thickness_of_Clump | 0.4670 | 0.1258 | 3.7134 | 0.0002 |
| Cell_Size_Uniformity | 0.0314 | 0.1567 | 0.2003 | 0.8413 |
| Cell_Shape_Uniformity | 0.3686 | 0.1687 | 2.1854 | 0.0289 |
| Marginal_Adhesion | 0.1968 | 0.1111 | 1.7716 | 0.0765 |
| Single_Epithelial_Cell_Size | 0.0761 | 0.1391 | 0.5466 | 0.5847 |
| Bare_Nuclei | 0.4283 | 0.0872 | 4.9130 | 0.0000 |
| Bland_Chromatin | 0.3641 | 0.1366 | 2.6658 | 0.0077 |
| Normal_Nucleoli | 0.1406 | 0.1015 | 1.3853 | 0.1660 |
| Mitoses | 0.4012 | 0.2317 | 1.7315 | 0.0834 |
The full model appears to have multiple variables that are insignificant. We will deal with this later when we use goodness-of-fit measures to select the best model.
This model includes all the variables, only now the numerical predictors are standardized.
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -1.1907 | 0.2899 | -4.1071 | 0.0000 |
| sd.clump_thickness | 1.3068 | 0.3519 | 3.7134 | 0.0002 |
| sd.size_uniformity | 0.0922 | 0.4602 | 0.2003 | 0.8413 |
| sd.shape_uniformity | 1.0787 | 0.4936 | 2.1854 | 0.0289 |
| sd.marginal_adhesion | 0.5477 | 0.3092 | 1.7716 | 0.0765 |
| sd.single_ep_size | 0.1734 | 0.3172 | 0.5466 | 0.5847 |
| sd.bare_nuclei | 1.4887 | 0.3030 | 4.9130 | 0.0000 |
| sd.bland_chromatin | 0.8878 | 0.3330 | 2.6658 | 0.0077 |
| sd.normal_nucleoli | 0.4263 | 0.3078 | 1.3853 | 0.1660 |
| sd.mitoses | 0.7075 | 0.4086 | 1.7315 | 0.0834 |
Standardizing appeared to have no impact on the p-values of our predictors. This means that the statistical significance of individual predictors was not notably influenced by standardization in the context of our model.
As a result, we reevaluate the multicollinearity in our new full model with standardized variables.
The multicollinearity among our standardized predictors remains high. As a result, we will employ automatic variable selection to control for this.
Automatic step-wise regression was performed on both the untouched full model as well as the full model with standardized predictors.
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -11.3645 | 1.2251 | -9.2766 | 0.0000 |
| Thickness_of_Clump | 0.4726 | 0.1251 | 3.7789 | 0.0002 |
| Cell_Shape_Uniformity | 0.4055 | 0.1346 | 3.0124 | 0.0026 |
| Marginal_Adhesion | 0.2170 | 0.1064 | 2.0393 | 0.0414 |
| Bare_Nuclei | 0.4325 | 0.0874 | 4.9497 | 0.0000 |
| Bland_Chromatin | 0.3767 | 0.1350 | 2.7906 | 0.0053 |
| Normal_Nucleoli | 0.1596 | 0.0974 | 1.6384 | 0.1013 |
| Mitoses | 0.4155 | 0.2249 | 1.8473 | 0.0647 |
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -1.1729 | 0.2817 | -4.1634 | 0.0000 |
| sd.clump_thickness | 1.3225 | 0.3500 | 3.7789 | 0.0002 |
| sd.shape_uniformity | 1.1867 | 0.3939 | 3.0124 | 0.0026 |
| sd.marginal_adhesion | 0.6038 | 0.2961 | 2.0393 | 0.0414 |
| sd.bare_nuclei | 1.5033 | 0.3037 | 4.9497 | 0.0000 |
| sd.bland_chromatin | 0.9186 | 0.3292 | 2.7906 | 0.0053 |
| sd.normal_nucleoli | 0.4839 | 0.2953 | 1.6384 | 0.1013 |
| sd.mitoses | 0.7327 | 0.3966 | 1.8473 | 0.0647 |
Interestingly enough, it appears as though there appears to be no difference between standardized and not standardized predictors in explaining the model. Both models removed the same predictor variables Size Uniformity and Single Epithelial Cell Size.
For our model development process, we randomly partition our data set into two distinct subsets: a training set comprising 70% of the data and a sample containing the remaining 30%.
The training set serves as the foundation for the exploration, refinement, and selection of candidate models. Here, we conduct an extensive search for potential models, leveraging the training data to iteratively validate their performance and ultimately identify the most promising model using the cross-validation method. This comprehensive approach allows us to assess how well each candidate model generalizes to unseen data and aids in the avoidance of overfitting to the training set.
Once the final model is identified through this iterative process, we subsequently assess its performance on the untouched sample, providing a reliable measure of the model’s predictive capabilities on new independent data.
In our analysis, we implement a robust 5-fold cross-validation approach to rigorously assess the performance of candidate models developed on the training data. This process involves partitioning the training set into five subsets, with each subset taking turns as the validation set while the remaining four are used for model training. By repeating this procedure five times, we obtain a comprehensive evaluation of each model’s generalization performance across different subsets of the training data. The average predictive errors are then calculated, guiding us in the selection of the most effective model. This meticulous cross-validation strategy ensures that our final model is not only optimized for the training set but also demonstrates strong predictive capabilities on new, unseen data.
FALSE [1] 4.72619
FALSE [1] 4.72619
FALSE [1] 4.72619
FALSE [1] 4.72619
| PE1 | PE2 | PE3 | PE4 |
|---|---|---|---|
| 0.9452 | 0.9452 | 0.9452 | 0.9452 |
In the process of model selection through cross-validation, predictive errors played a crucial role in evaluating and comparing the performance of candidate models. The cutoff probability of 0.5 was utilized as a threshold to classify predicted outcomes as either positive or negative. Models were assessed based on the proportion of misclassifications, with instances where the predicted probability exceeded 0.5 being labeled positive and below 0.5 as negative. By calculating the predictive errors for each candidate model across multiple folds, an average error rate was derived, aiding in the identification of the most suitable model. This approach provided a quantitative metric for model performance, assisting in the final selection of the model with the lowest average predictive error.
The preceding cross-validation process determined the optimal model using a pre-defined cutoff of 0.5. However, to accurately report the model’s performance to the client, it is essential to assess its accuracy on a withheld test dataset. Consequently, the final model’s real accuracy is computed by predicting outcomes on the test data:
| x |
|---|
| 0.95 |
The actual accuracy of the final model, denoted by ‘x’, is reported in the above table as 0.95. This value reflects the genuine performance of the selected model on the test data set, providing a more reliable measure of its accuracy. It is important to note that due to the utilization of a random split method for defining training and testing data, slight variations in performance metrics may occur when rerunning the code.
ROC curves, or Receiver Operating Characteristic curves, are graphical representations used to evaluate the performance of binary classification models. These curves plot the True Positive Rate (also known as sensitivity) against the False Positive Rate (which is 1 minus specificity) for different threshold settings.
True Positive Rate is the probability that given the results are actually positive, the chance that the model correctly predicts that the results are positive. False Positive Rate is the probability that given the results are negative, the chance that the model incorrectly predicts that the results are positive.
The ROC graphs the relationship between this true positive rate and false positive rate to determine how well the model can predict results.
The Area Under the Curve (AUC) is another metric used to evaluate the performance of a binary classification model. It is closely tied with the ROC curve, which plots the true positive rate against the false positive rate at various threshold settings. The AUC quantifies the entire two-dimensional area underneath the entire ROC curve.
This area provides an aggregate measure of performance across all possible classification thresholds. An AUC of 1 indicates a perfect model; an AUC of 0.5 suggests no discriminative power, equivalent to random guessing; and an AUC less than 0.5 suggests worse-than-random predictions.
In our example we have an AUC for our reduced models as larger than 99%. This means that the model has excellent predictive performance.
Our final model was derived from a synthetic data set consisting of 600 observations and 11 variables, with 9 predictors capturing various cellular characteristics indicative of benign or malignant breast cancer. We aimed to associate these cellular traits with cancer outcomes.
There were many alterations we had to perform on the data. We began with recoding the categorical response variable ‘Outcome’ from “Yes” and “No” to 1 and 0, ensuring compatibility for logistic regression analysis. Preliminary checks for missing values and outliers followed, leading to the discovery of substantial multicollinearity, which we addressed through automatic variable selection using step-wise regression. We then split the data 70% for training and 30% for testing, using cross validation to find the model with the best predictive power.
Both reduced models, the ones with and without standardized predictor variables had the same predictive power, so we are using the simpler model where standardizing predictors is not performed.
We finally checked our model’s predictive power by looking at their ROC curves and calculating the area under the curve. This check assessed and compared global performances of our models and confirmed the solid predictive performance of our model.
Why is it bad to discretize a continuous variable? (2022, October 14). Cross Validated. Retrieved January 4, 2024, from https://stats.stackexchange.com/questions/592246/why-is-it-bad-to-discretize-a-continuous-variable
Gupta, D. (2018). Applied Analytics through Case Studies Using SAS and R. APress.
Ciaburro G. (2018). Regression Analysis with R: Design and Develop Statistical Nodes to Identify Unique Relationships Within Data at Scale. Packt Publishing.