Step 1: The Data

Dataset Description

The data set used in this analysis is the Breast Cancer Wisconsin Diagnostic data set (brca), which is commonly used for medical classification tasks.

The data set contains 569 observations and 31 variables. Each observation represents measurements taken from a digitized image of a breast mass.

The variables describe characteristics of cell nuclei, such as size, texture, and shape.

The outcome variable is:

Class:

B = Benign (non-cancerous) M = Malignant (cancerous)

Understanding this outcome is important because early detection of malignant tumors can significantly improve treatment outcomes and survival rates.

Data Source:The data set was obtained from the dslabs data set object brca.

Data Preparation and Structure

The original data set is stored as a list containing:

  • x: numeric feature matrix (measurements)

  • y: classification labels

To make the data set suitable for analysis, it was converted into a structured data frame:

There were no missing data, this step ensures the data is in a format suitable for statistical analysis and visualization.

Step 2: Data Health

Choice of Variables

Two explanatory variables were selected:

  • radius_mean (average size of the cell nucleus)

  • texture_mean (variation in cell texture)

These variables are relevant because cancerous cells often differ in size and texture compared to normal cells.

2.1 Outcome Variable Distribution

Interpretation:

The bar chart shows that benign (B) tumors have a count of approximately 357, while malignant (M) tumors have a count of approximately 212. Benign cases are clearly more frequent in this data set than malignant ones.This is common in medical data sets and may affect model balance.

2.2 Distribution of Radius Mean

Interpretation:

The histogram depicts the distribution of radius mean values across the data set. The distribution exhibits a right-skewed pattern, with the highest frequency of observations concentrated between radius mean values of 12 and 14, reaching a peak frequency of approximately 65 to 67.

The distribution gradually tapers toward the right, with values extending as far as 28, suggesting the presence of tumors with considerably larger radius measurements at the upper end of the data set.

2.3 Distribution of Texture Mean

Interpretation:

The histogram illustrates the distribution of texture mean values across the data set. The distribution follows a relatively bell-shaped pattern, centered around values of 18 to 20, with peak frequencies reaching approximately 53 to 55. A slight right skew is observed, with values extending toward 40, indicating the presence of a small number of cases with notably higher texture mean measurements.

Step 3: Relationship Between Variables and Outcome

3.1 Radius Mean vs Class

Interpretation:

The boxplot presents a comparison of radius mean values between benign (B) and malignant (M) tumor classes. Benign tumors recorded a median radius mean of approximately 12, while malignant tumors recorded a notably higher median of approximately 17 to 18.

The boxes show minimal overlap, with the malignant group displaying a wider interquartile range, and a few outliers are visible above the benign group at approximately 17 and below at approximately 9. This clear separation suggests that radius mean is a highly effective feature in distinguishing between benign and malignant tumors.

3.2 Texture Mean vs Class

Interpretation:

The boxplot illustrates the comparison of texture mean values between benign and malignant tumor classes. Benign tumors recorded a median texture mean of approximately 18, while malignant tumors recorded a higher median of approximately 22 to 23.

Several outliers are visible in both classes, particularly in the benign group where values extend toward 30 and beyond. The considerable overlap between the interquartile ranges of both classes indicates that texture mean alone possesses limited discriminatory power in differentiating between the two tumor classifications.

Limitations of the Analysis

Conclusion

Based on the exploratory data analysis conducted, radius mean demonstrated a stronger ability to distinguish between benign and malignant tumors, as evidenced by the clear separation observed in the boxplot with minimal overlap between the two classes. Texture mean, while showing some differential between classes, exhibited considerable overlap, limiting its discriminatory power as a standalone feature.

The data set reflects an imbalance between tumor classes, with benign cases outnumbering malignant cases. This observation should be taken into consideration in any subsequent classification or predictive modeling.

Overall, the findings suggest that radius mean is a more reliable feature in differentiating tumor classifications, and both features should ideally be used in conjunction with additional variables to strengthen analytical outcomes.