Introduction

This is a report based on the Breast Cancer Dataset from Kaggle. That data set includes information about breast cancer patients from the 2017 November update of the SEER Program of the NCI. The dataset involved female patients with infiltrating duct and lobular carcinoma breast cancer diagnosed in 2006-2010, resulting in 4,024 observations.Important variables used include age, T Stage, N Stage, marital status, tumor size, and survival status.

Dataset Descriptive Stats

Reported from the code are some descriptive statistics about the data. Most of the variables are categorical and therefore do not have basic statistics.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.patches as patches
import warnings
from matplotlib.ticker import FuncFormatter

warnings.filterwarnings("ignore")

path = "//apporto.com/dfs/LOYOLA/Users/rlhamilton1_loyola/Documents/DS 736/"

filename = path + 'Breast_Cancer.csv'

df = pd.read_csv(filename)

df.describe()
##                Age   Tumor Size  ...  Reginol Node Positive  Survival Months
## count  4024.000000  4024.000000  ...            4024.000000      4024.000000
## mean     53.972167    30.473658  ...               4.158052        71.297962
## std       8.963134    21.119696  ...               5.109331        22.921430
## min      30.000000     1.000000  ...               1.000000         1.000000
## 25%      47.000000    16.000000  ...               1.000000        56.000000
## 50%      54.000000    25.000000  ...               2.000000        73.000000
## 75%      61.000000    38.000000  ...               5.000000        90.000000
## max      69.000000   140.000000  ...              46.000000       107.000000
## 
## [8 rows x 5 columns]

Findings

This analysis explores the relationship between cancer survival rates and factors such as T Stage, race, age, tumor size, and marital status. Some findings include racial disparities in survival and a trend of decreasing survival as T Stage increases. The following charts provide a visual representation of these trends and patterns, offering insights into how these factors influence cancer outcomes.

Scatterplot

This scatter plot shows the percentage of cancer cases that are alive, categorized by T Stage and Race. The highest, most progress stage of T4 has the lowest survival rate across all races. As expected, the survival rate decreases at the T stage increases. The other race category has the highest survival rate at every T Stage. The White race has slightly lower percentages, while Black has significantly lower survival rates. This could be an indicator that racial disparities exist in cancer survival rates, with Black individuals potentially facing worse outcomes compared to other racial groups. This difference could be attributed to a variety of factors, including to differences in access to healthcare, socioeconomic status, healthcare quality, early detection, and treatment options.

total = df.groupby(['T Stage ', 'Race'])['Status'].count().reset_index(name = 'Total Count')
alive_df = df[df['Status'] == 'Alive']
alive = alive_df.groupby(['T Stage ', 'Race']).size().reset_index(name='Count')
x = alive.merge(total, on=['T Stage ', 'Race'], how='left')
x['Percents'] = (x['Count'] / x['Total Count']) *100

plt.figure(figsize=(16, 8))

plt.scatter(x['Race'], x['T Stage '], marker='8', cmap='Wistia', c=x['Percents'],
            s=x['Percents'] *20, edgecolors='black')

plt.title('Percentage of Cancer Cases Alive by T Stage and Race', fontsize=18)
plt.xlabel('Race', fontsize=14)
plt.ylabel('T Stage', fontsize=14)

cbar = plt.colorbar()
cbar.set_label('Percentage Alive', rotation=270, fontsize=14, color='black', labelpad = 30)

cbar_ticks = np.linspace(x['Percents'].min(), x['Percents'].max(), num=6)  # Auto-adjusted ticks
cbar.set_ticks(cbar_ticks)
cbar.set_ticklabels([f'{tick:0.0f}%' for tick in cbar_ticks])

for i in range(len(x)):
    plt.text(x['Race'].iloc[i], x['T Stage '].iloc[i], 
             f"{x['Percents'].iloc[i]:.0f}%", 
             fontsize=10, ha='center', va='center', color='black', fontweight='bold')
plt.show()

Bar Charts

The top plot shows the distribution of cancer cases by age. The x-axis represents different ages, and the y-axis shows the count of cancer cases at each age. The bars are color-coded based on whether the count of cancer cases at that age is 5% higher, at, or 5% lower than the mean. The average amount of cancer cases is 100.6. The age 45 is 5% from this average while every age below 45 is at least 5% lower than the mean, and every age above 45 is at least 5% above the mean. This indicates that breast cancer is more prevalent in older women. The second plot shows the average survival months for cancer patients within the age range of 45-65. The x-axis represents age, and the y-axis shows the average survival months for that age group. The bars are color-coded based on whether the average survival months at that age is 1% higher, at, or 1% lower than the mean. The mean is 72, with age 54, 64, and 65 being the only ages that are 1% lower than the mean. It is interesting that at 54 the amount of months survived is less than the older ages from 55-63, as it would be expected that the older a person is, the fewer months they would survive.

#Create dataframe 
df = df.sort_values('Survival Months', ascending = False)

agedf = df.groupby(['Age']).size().reset_index(name='Count')

survival = df.groupby('Age')['Survival Months'].mean().round().reset_index(name = 'Average')
survival = survival.sort_values('Age', ascending=False)
survival_filtered = survival[(survival['Age'] >= 45) & (survival['Age'] <= 65)]

#Function for color selecting
def pick_colors_according_to_mean_count(df):
    mean_value = df['Count'].mean() 
    colors = []

    for value in df['Count']:  
        if value > mean_value*1.05:
            colors.append('firebrick')  
        elif value < mean_value*0.95:
            colors.append('forestgreen')  
        else:
            colors.append('black') 

    return colors

mean_value = survival.Average.mean()  
colors = []
for value in survival.Average:  
    if value > mean_value * 1.01:  
        colors.append('firebrick')  
    elif value < mean_value * 0.99:  
        colors.append('forestgreen')  
    else: 
        colors.append('black') 

#Apply color selection
my_colors1 = pick_colors_according_to_mean_count(agedf)
above = patches.Patch(color='firebrick', label = 'Above Average')
at = patches.Patch(color='black', label = 'Within 5% of Average')
below = patches.Patch(color='forestgreen', label = 'Below Average')
my_colors2 = colors

#Build plot space
fig = plt.figure(figsize = (18,16))
fig.suptitle('Cancer Cases Age Breakdown', fontsize=18, fontweight='bold')

#Create plot with labels
ax1 = fig.add_subplot(2,1,1)
ax1.bar(agedf.Age, agedf.Count, label = 'Count', color = my_colors1)
ax1.legend(handles=[above, at, below], fontsize=14)
plt.axhline(agedf.Count.mean(), color = 'black', linestyle = "dashed")
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False)
ax1.set_title('Count of Cancer Cases by Age', size =16)
ax1.text(30, agedf.Count.mean()+5, 'Mean = ' + str(agedf.Count.mean()), rotation = 0, fontsize=14)
ax1.set_xlabel('Age', fontsize=14)
ax1.set_ylabel('Cancer Cases', fontsize=14)

#Create second plot
at2 = patches.Patch(color='black', label = 'Within 1% of Average')
ax2 = fig.add_subplot(2,1,2)
ax2.bar(survival_filtered.Age, survival_filtered.Average, label = 'Average Survival Months', color = colors)
ax2.legend(handles=[above, at2, below], fontsize=14)
plt.axhline(survival_filtered.Average.mean(), color = 'black', linestyle = "dashed")
ax2.spines['right'].set_visible(False)
ax2.spines['top'].set_visible(False)
ax2.set_title('Average Survival Months by Age (45-65)', size =16)
ax2.text(45, survival_filtered.Average.mean()+8, 'Mean = ' + str(round(survival_filtered.Average.mean())), rotation = 0, fontsize=14)
ax2.set_xlabel('Age', fontsize=14)
ax2.set_ylabel('Average Survival Months', fontsize=14)

ax2.set_ylim(0, survival_filtered.Average.max()*1.3)

Line Chart

This line plot shows the average survival months for patients categorized by tumor size and N Stage (a classification of the extent of cancer spread to lymph nodes). Both N1 and N3 stages have a negative slope, meaning that as tumor size increases, the average survival length decreases. N2 is interesting as it decreases until the 80-100 mm bin, and then increases, potentially indicating that larger tumors at the N2 Stage are treated more effectively. N1 has the highest average survival length, then N2, and then N3, as expected. Higher N Stages result in worse survival outcomes for the same tumor size compared to lower N Stages. N3 has a very steep drop when going from 120-140 to 140+.

bin_edges = [0, 20, 40, 60, 80, 100, 120, 140]  # Adjust as needed
bin_labels = ['0-20mm', '20-40mm', '40-60mm', '60-80mm', '80-100mm', '100-120mm', '140mm+']

# Create binned tumor size column
df['tumor_size_bin'] = pd.cut(df['Tumor Size'], bins=bin_edges, labels=bin_labels, include_lowest=True)

# Group by binned tumor size and estrogen status
survival_df = df.groupby(['tumor_size_bin', 'N Stage'])['Survival Months'].mean().reset_index(name='Avg Survival Months')

my_colors = {'N1':'red', 'N2':'blue', 'N3':'green'}

# Initialize figure
fig = plt.figure(figsize=(18,10))
ax = fig.add_subplot(1,1,1)

# Plot line for each N Stage
for key, grp in survival_df.groupby(['N Stage']):
    grp.plot(ax=ax, kind='line', x='tumor_size_bin', y='Avg Survival Months', color=my_colors[key], label=key, marker='o')

# Titles and labels
plt.title('Average Survival Months by Tumor Size for Each N Stage', fontsize=18)
ax.set_xlabel('Tumor Size (mm)', fontsize=18)
ax.set_ylabel('Average Survival Months', fontsize=18, labelpad=20)
ax.tick_params(axis='x', labelsize=14, rotation=0)
ax.tick_params(axis='y', labelsize=14, rotation=0)

# Legend
plt.legend(title="N Stage", fontsize=17)

plt.show()  

Stacked Bar Chart

This stacked bar plot has Grade IV anaplastic tumors removed from the data, which removes the undifferentiated status from the data, as there were only a few and led to the chart being unreadable. The stacked bar chart displays the count of cancer patients based on their differentiated status (how closely the cancer cells appear to look like regular cells) and race. The chart shows moderately differentiated cells as occurring the most, then poorly differentiated, then well differentiated. The proportion of white to black to other in each status category appears to be very similar with the majority being white, then black then other. The chart does not suggest that race plays a role in the distribution of differentiated status, as each differentation status follows similar patterns.

df = df[~df['Grade'].str.contains('anaplastic', case=False, na=False)]

stacked_df = df.groupby(['differentiate', 'Race']).size().reset_index(name='Count')

# Pivot table for stacked bar plot
stacked_df = stacked_df.pivot(index='differentiate', columns='Race', values='Count')

# Build plot space
fig = plt.figure(figsize=(18,10))
ax = fig.add_subplot(1,1,1)

colors = ["#39DB49", "#63A4EB", "#F3C59C"]
# Plot stacked bar chart
stacked_df.plot(kind='bar', stacked=True, ax=ax, color = colors)

# Labels and Titles
plt.ylabel('Count of Patients', fontsize=18, labelpad=10)
plt.title('Count of Patients by Differentiated Status and Race \n Stacked Bar Plot', fontsize=18)
plt.xticks(rotation=0, horizontalalignment='center', fontsize=14)
plt.yticks(rotation=0, fontsize=14)
ax.set_xlabel('Differentiated Status', fontsize=18)

#Format
ax.yaxis.set_major_formatter(FuncFormatter(lambda x, pos: f'{int(x):,}'))

# Legend
plt.legend(title="Race", loc='best', fontsize=14)

plt.show()

Donut Chart

This donut chart visualizes the distribution of cancer cases by grade and status (alive or dead). The sections are split based on percentages so that the data is normalized. Grade 2 is the most frequent grade, with almost 59% of cases. Grade 3 is next and Grade 1 is the least common. grade has a high percentage of deceased patients, it could suggest poorer prognosis or outcomes for that grade. Grade 3 has the highest percent of deceased patients when compared to the percentage of alive patients. Grade 2 contains a larger percentage of deceased patients, but also has the most amount of patients. Grade 1 has the lowest percentage of deceased patients and indicates a better outcome for patients while Grade 3 indicates poorer outcomes for patients. The total cases is listed in the center, which identifies that 19 cases were removed from the original dataset as a result of removing the anaplastic grade as mentioned in the previous section.

# Outer ring: grade counts
outer_df = df['Grade'].value_counts().reset_index()
outer_df.columns = ['Grade', 'Count']

# Inner ring: Status
inner_df = df.groupby(['Grade', 'Status']).size().reset_index(name='Count')

total_cases = outer_df['Count'].sum()
number_outside_colors = len(outer_df)
outside_color_ref_number = np.arange(number_outside_colors) * 4

#Build plot space
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(1,1,1)

colormap = plt.get_cmap("tab20c")
outer_colors = colormap(outside_color_ref_number)
inner_colors = ["#83E98D", "#39DB49", "#A5CBF4", "#63A4EB", "#ED9E58", "#F3C59C"]
# Outer pie: cancer grade
outer_df.set_index('Grade')['Count'].plot(
   kind='pie', radius=1, colors=outer_colors, pctdistance=0.85, labeldistance=1.1,
   wedgeprops=dict(edgecolor='white'), textprops={'fontsize':18}, 
   autopct=lambda p: '{:.2f}%\n({:,.0f})'.format(p, (p/100)*total_cases),
   startangle=90
)

# Inner pie: status
inner_df['Label'] = inner_df['Grade'] + '-' + inner_df['Status']
inner_df.set_index('Label')['Count'].plot(
   kind='pie', radius=0.7, colors=inner_colors, pctdistance=0.60, labeldistance=0.8,
   wedgeprops=dict(edgecolor='white'), textprops={'fontsize':10}, 
   autopct='%1.2f%%',
   startangle=10
)

# Donut hole
hole = plt.Circle((0,0), 0.3, fc='white')
fig.gca().add_artist(hole)

ax.yaxis.set_visible(False)

# Total case label
ax.text(0, 0, 'Total Cases\n' + '{:,}'.format(total_cases), size=18, ha='center', va='center')


plt.title('Cancer Case Distribution by Grade and Status', fontsize=18)
ax.axis('equal')
plt.tight_layout()
plt.show()

Bump Chart

This bump chart visualizes the rank changes of marital status across different bins of survival months to attempt to identify if specific marital statuses have an influence on the average amount of months survived (normalized). As indicated by the weaving of the lines showing the constant changes of the rankings, there seems to be no relation between marital status and the amount of survival months (proportionally). Each marital status occurs as the #1 rank once and as the last rank once, furthering the case that marital status does not influence the months a person survives.

# Define survival bins
bins = [0, 20, 40, 60, 80, 100, np.inf]
labels = ['0–20', '20–40', '40–60', '60–80', '80–100', '100+']
df['survival_bin'] = pd.cut(df['Survival Months'], bins=bins, labels=labels, right=True)

# Group by survival bin and marital status, count patients
bump_df = df.groupby(['survival_bin', 'Marital Status']).size().reset_index(name='count')

# Manage df so the data is normalized
bump_df = df.groupby(['survival_bin', 'Marital Status']).size().reset_index(name='count')
total_counts = df['Marital Status'].value_counts().to_dict()
bump_df['normalized'] = bump_df.apply(lambda row: row['count'] / total_counts[row['Marital Status']], axis=1)
bump_df_norm = bump_df.pivot(index='Marital Status', columns='survival_bin', values='normalized')
bump_df_ranked = bump_df_norm.rank(axis=0, ascending=False, method='min')
bump_df_ranked = bump_df_ranked.T

#Build plot space
fig = plt.figure(figsize=(18,10)) 
ax = fig.add_subplot(1,1,1)

#Plot the data points
bump_df_ranked.plot(kind='line', ax=ax, marker='o',
                   markeredgewidth=1, linewidth=6,
                   markersize=44, markerfacecolor='white')

ax.invert_yaxis()

num_rows = bump_df_ranked.shape[0]
num_cols = bump_df_ranked.shape[1]

#Labels and ticks
plt.ylabel('Marital Status Ranking (by Normalized Proportion)', fontsize=18, labelpad=10)
plt.title('Ranking of Marital Status Across Survival Time Bins (Normalized) \n Bump Chart', fontsize=18, pad=15)
plt.xticks(np.arange(num_rows), bump_df_ranked.index, fontsize=14)
plt.yticks(range(1, num_cols+1, 1), fontsize=14)
ax.set_xlabel('Survival Bin (Months)', fontsize=18)

# Legend 
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles, labels, bbox_to_anchor=(1.01,1.01), fontsize=14,
         labelspacing=1, markerscale=.4,
         borderpad=1, handletextpad=0.8)

Conclusion

This analysis reveals several findings regarding breast cancer survival and patient characteristics. Survival rates decrease as T Stage increases, with Black patients facing significantly lower survival rates compared to White and Other racial groups, indicating potential racial disparities in cancer outcomes. Cancer is more prevalent in older women, with cases increasing noticeably after age 45. While survival months generally decrease with age, some unexpected patterns emerge, such as the lower survival at age 54 compared to older ages. Larger tumors at N2 stage are associated with higher survival rates, possibly due to better treatment effectiveness at this stage, otherwise higher N stages results in a decreased amount of survival months. Racial differences do not significantly influence tumor differentiation status, with most patients being moderately differentiated regardless of race. Marital status does not appear to have a significant impact on survival months, as rankings fluctuate without clear correlation to survival outcomes. These findings show the importance of considering various patient factors when evaluating cancer survival rates and stages.