import os
os.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = 'D:/Anaconda3/Library/plugins/platforms'

Introduction to the Heart Disease Dataset

Heart disease refers to several types of heart conditions. Key risk factors include high blood pressure, high blood cholesterol,and smoking. Other types of medical conditions and lifestyle choices can also put people at a higher risk for heart disease. In the United States, the most common type of heart diseases is the coronary artery disease (CAD), which affects blood flow to the heart and cause a heart attack.

The heart disease data set was obtained from Kaggle for this analysis.In this data set, there are 16 columns and 4239 rows which include key information from target patients such as gender, blood glucose level, lifestyle (current smoker or not), and other vital readings.

Data Processing and Data Setup

The following libraries were imported to support the analysis of the heart disease dataset.


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
import wget
import os
import seaborn as sns
from matplotlib.ticker import FuncFormatter


warnings.filterwarnings("ignore")

path = "U:/"

filename = path + 'heart_disease.csv'

df = pd.read_csv(filename)

Data Visualizations

Realtionship Between Age and Glucose Levels

For the first analysis, it was important to analyze the data to determine if age had anything to do with glucose levels of the target patient. The normal glucose level for a fasting blood glucose test is between 70-100mg/dL and can rise to approximately 125mg/dL depending on the time of food intake and when test is conducted. This analysis indicates majority of the target patients were within the normal range however, outliers can be seen in patients age 47 and older, having elevated glucose levels. This could indicate that as people age, elevated glucose levels are prevalent.

data1 = df[['age', 'glucose']]
data1 = data1[ data1['glucose'].notna()]
data1 = data1[(data1['glucose'] != 0) & (data1['age'] != 0)]
data1['glucose'] = data1['glucose'].astype('int')
data1 = data1.groupby(['age', 'glucose'])['age'].count().reset_index(name='count')



plt.figure(figsize=(22,14))

plt.scatter(data1['age'], data1['glucose'], marker='h', cmap='viridis', c=data1['count'], s=data1['count']*100)

plt.title('Relationship Between Age and Glucose Levels', fontsize=30)

plt.xlabel('Patient Age', fontsize=20)
plt.ylabel('Glucose Level', fontsize=20)

cbar = plt.colorbar()
cbar.set_label('Number of patients', rotation=270, fontsize=14, color='blue', labelpad=20)

my_colorbar_ticks = [*range (1, int(data1['count'].max())+1, 1)]
cbar.set_ticks(my_colorbar_ticks);


plt.show()

Percentage of Target Patient with Heart Stroke

In the first visualization, it was interesting to see how age could play a part in elevated glucose levels, I decided to analyze the percentage of target patients that have had heart stroke for the second visualization. From the analysis it is evident that the older target patient have had heart stroke. Highest percentage for patients fall in the over 60 age bracket.


data2 = df[['age', 'Heart_ stroke']]

def age_interval(age):
    if age < 40:
        return '30-39'
    elif age < 50:
        return '40-49'
    elif age < 60:
        return '50-59'
    else:
        return '>= 60'


data2['age_interval'] = data2['age'].apply(age_interval)    

stacked_df = data2[["age_interval", "Heart_ stroke"]]


stacked_df = stacked_df.pivot_table(index='age_interval', columns='Heart_ stroke', aggfunc=len, fill_value=0)
stacked_df = stacked_df.reset_index()
stacked_df["Total"] = stacked_df["No"] + stacked_df["yes"]
stacked_df["No"] = 100 * stacked_df["No"] / stacked_df["Total"]
stacked_df["yes"] = 100 * stacked_df["yes"] / stacked_df["Total"]
stacked_df = stacked_df[["age_interval", "No", "yes"]]

fig = plt.figure(figsize = (18,10))
ax = fig.add_subplot (1,1,1)

stacked_df.plot(kind='bar', stacked='True', x = 'age_interval', ax=ax)

plt.ylabel('Percentage of Patients', fontsize=18)
plt.title('Percentage of Patient Age Group with Heart Stroke', fontsize=28)
plt.xticks(rotation=0, horizontalalignment = 'center', fontsize=18);
plt.yticks(fontsize=18);
ax.set_xlabel('Age Interval', fontsize=18)
horiz_offset = 1.0
vert_offset = 1.0

for bar in ax.patches:
  ax.text(bar.get_x() + bar.get_width() / 2,
          bar.get_height() / 2 + bar.get_y(),
          round(bar.get_height()), ha = 'center',
          color = 'w', weight = 'bold', size = 15)

ax.legend(bbox_to_anchor=(horiz_offset, vert_offset))



plt.show()

Cigarettes Consumed per Day by Target Patients

According to the Centers for Disease Control (CDC) cigarettes consumption is one of the lifestyle habits that could lead to heart disease, it was important to analyze the data and identify the number of target patients that are smokers and their consumption levels. From the data, approximately 50% of the target patients are non-smokers, the raw data set returned 29 missing values for the “cigsPerDay” column which was filled with 0 rather than deleting those values to analyze the complete data.

Out of the approximately 50% target patients that were non-smokers 23.1% consume a high number of cigarettes per day approximately 11-20 sticks.


data3 = df[['cigsPerDay']]

data3["cigsPerDay"] = data3["cigsPerDay"].fillna(0)

data3= data3.astype({'cigsPerDay':'int'})

def cigarettes_interval(cigsPerDay):
    if cigsPerDay == 0:
        return '0'
    elif cigsPerDay <= 10:
        return '1-10'
    elif cigsPerDay <= 20:
        return '11-20'
    else:
        return '>20'

data3['cigarettes_interval'] = data3['cigsPerDay'].apply(cigarettes_interval)
counts_for_plot6 = data3['cigarettes_interval'].value_counts()



fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(1, 1, 1)

colormap = plt.get_cmap('tab20c')

all_cigcount = data3.cigsPerDay.count()

explode = (0, 0.05,0,0)

plt.pie(counts_for_plot6, explode = explode,
        labels = counts_for_plot6.index, autopct = '%1.1f%%',
        startangle = 90,
        wedgeprops = {"edgecolor" : "black",
                      'linewidth': 2,
                      'antialiased': True});
plt.legend(counts_for_plot6.index, loc="upper right")
plt.title('Cigarettes Consumed per Day by Target Patient', fontsize=16)

hole = plt.Circle((0,0), 0.3,fc='white')
fig1 =plt.gcf()
fig1.gca().add_artist(hole)

ax.text(0, 0, 'Target Patient\n' + str(all_cigcount), ha='center', va='center'  )
           
plt.show()

Correlation Matrix of Key Variables

For the fourth visualization, I decided to understand how some of the patient’s lifestyle and vital measurements were correlated and the impact on having heart disease. The correlation matrix of some key variables shows asystolic and diastolic blood pressure measurements are highly correlated with prevalent hypertension; as systolic or diastolic blood pressure increases its leads to a higher risk of hypertension.


hm =  df[['age', 'prevalentHyp', 'Heart_ stroke', 'currentSmoker', 'diabetes', 'sysBP', 'diaBP', 'BMI']]

corr_matrix = hm.corr()


fig = plt.figure(figsize=(18, 10))
ax = fig.add_subplot(1, 1, 1)

comma_fmt = FuncFormatter(lambda x, p: format(int(x), ','))

sns.heatmap(corr_matrix, linewidth=0.2, annot=True, cmap='coolwarm',
annot_kws={'size': 11},
cbar_kws={'format': comma_fmt, 'orientation': 'vertical'}, ax=ax)


plt.title('Correlation Matrix of some Key variables', fontsize= 28)
plt.xlabel('Key Data Variables', fontsize= 18, labelpad=15)
plt.ylabel('Key Data Variables', fontsize= 18, labelpad=15)
plt.yticks(rotation=0, size =14);
plt.xticks(size=14);

cbar = ax.collections[0].colorbar
cbar.set_label('Correlation Coefficient', rotation = 270, fontsize=14, color='red', labelpad=20)


plt.show()

Age Distribution of Prevalent Hypertension

With the understanding of variables that are highly correlated with prevalent hypertension, it was important to determine the target patients that have high blood pressures or taking blood pressure medication (prevalent hypertension). Analysis of the data shows, a high number of target patients in the 50-59 age group have prevalent hypertension (12.48%) out of the total percentage of target patients with hypertension (22.48%).

data4 = df[['age', 'prevalentHyp', 'Heart_ stroke']]
data4['prevalentHyp'] = data4['prevalentHyp'].replace({0: 'No', 1: 'yes'})

data4['age_interval'] = data4['age'].apply(age_interval)



fig, ax = plt.subplots(figsize=(18,10))
name = data4['age']


ax = sns.countplot(x='age_interval', hue='prevalentHyp', data=data4, palette='Paired')
plt.title("Age Distribution for Prevelent Hyp", fontsize = 28)



totals = []
for i in ax.patches:
    totals.append(i.get_height())
total = sum(totals)
for i in ax.patches:
    ax.text(i.get_x()+.05, i.get_height()-35,
            str(round((i.get_height()/total)*100, 2))+'%', fontsize=14,
                color='red')
  
plt.xticks(rotation=0, horizontalalignment = 'center', fontsize=18);
plt.yticks(fontsize=14);

plt.xlabel('Age Interval of Target Patients', fontsize=18, labelpad=10)
plt.ylabel('Count of Target Patients', fontsize=18, labelpad=10)
 
    
plt.show()

Conclusion

In conclusion, the analysis of some of the variables in the data set provided insights on the age group of target patients that could be at risk for heart disease based on lifestyle and vital measurement indicators.