Python Project - Health And Lifestyle Dataset

Introduction

Analysis of Stress Levels in people ages 18-29 and 30-39 from the Health and Lifestyle Dataset

Dataset

Health And Lifestyle Dataset

https://www.kaggle.com/datasets/sahilislam007/health-and-lifestyle-dataset?resource=download

Disclaimer: This is a Synthetic Health and Lifestyle Dataset.

Findings

This analysis is for a subset of the data. I will be concentrating on two age groups, 18-29 and 30-39. Throughout the report, we will be seeing the stress levels based on exercise frequency, sleep hours, BMI, gender, whether the person is a smoker or not.

Below, you will see individual tabs for the visualizations mentioned above.

Average Stress Level by Exercise Frequency and Gender

Across both age groups (18–29 and 30–39), average stress levels remain relatively stable, indicating no substantial age-driven variation in stress within this range. The observed differences are marginal, suggesting a weak relationship between age (within these age groups) and stress levels.

Within the 18–29 group, males who exercised daily exhibited the highest average stress level (6.10), which may indicate that increased exercise frequency does not necessarily correspond to lower stress in this subgroup. Conversely, males reporting no exercise had comparatively lower stress levels, suggesting a lack of a clear inverse relationship between exercise frequency and stress.

In the 30–39 group, a similar pattern emerges, where males who did not exercise reported the highest average stress level (6.05). However, individuals identifying as “Other” who exercised daily demonstrated the lowest stress levels, indicating potential variability in how exercise impacts stress across gender categories.

Overall, the results do not show a consistent or strong correlation between exercise frequency and stress levels. Instead, stress appears relatively uniform across categories, with only slight fluctuations by gender and activity level. This suggests that additional factors beyond exercise frequency such as lifestyle, health conditions, or external stressors may play a more significant role in influencing stress.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

path = "U:/DS736 Files/"
filename = "synthetic_health_lifestyle_dataset.csv"
df = pd.read_csv(path + filename)

bins = [18, 30, 40]
labels = ['18-29', '30-39']

df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)

#define the min and max for consistency accross charts
y_min = df['Stress_Level'].min()
y_max = df['Stress_Level'].max()

for group in labels:

    subset = df[df['Age_Group'] == group]

    grouped = subset.groupby(['Exercise_Freq', 'Gender'])['Stress_Level'].mean().unstack()

    ax = grouped.plot(kind='bar', figsize=(12,8))

    plt.title(f"Average Stress Level by Exercise Frequency and Gender ({group})")
    plt.xlabel("Exercise Frequency")
    plt.ylabel("Average Stress Level")

    plt.xticks(rotation=0)

    plt.ylim(y_min, y_max)
    
#add legend
    plt.legend(title="Gender", bbox_to_anchor=(1.05, 1), loc='upper left')

    plt.grid(axis='y', alpha=0.3)

    for container in ax.containers:
        ax.bar_label(container, fmt='%.2f', padding=3)

    plt.tight_layout()
    plt.show()

Sleep vs Stress Level

Across both age groups (18–29 and 30–39), the highest density of observations is concentrated within the 6–8 hour sleep range. This indicates that stress level variation is most frequently observed within this interval, though no strong directional relationship between sleep duration and stress is evident.

#filter data to use only ages 18-39
bins = [18, 30, 40]
labels = ['18-29', '30-39']
df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)


df['Sleep_Hours_Round'] = df['Sleep_Hours'].round(0)


plot_df = (
    df.groupby(['Age_Group', 'Sleep_Hours_Round', 'Stress_Level'])
      .size()
      .reset_index(name='count')
)


plot_df['count_size'] = plot_df['count'] * 40


vmin = plot_df['count'].min()
vmax = plot_df['count'].max()


for group in labels:
    subset = plot_df[plot_df['Age_Group'] == group]

    plt.figure(figsize=(18, 10))

    sc = plt.scatter(
        subset['Sleep_Hours_Round'],
        subset['Stress_Level'],
        marker='8',
        c=subset['count'],
        s=subset['count_size'],
        cmap='viridis',
        edgecolors='black',
        vmin=vmin,
        vmax=vmax
    )

    plt.title(f"Sleep vs Stress Level ({group})", fontsize=18)
    plt.xlabel("Sleep Hours", fontsize=14)
    plt.ylabel("Stress Level", fontsize=14)

#axis and formatting

    plt.xlim(1, 12)
    plt.xticks(range(1, 13), fontsize=12, color='black')

 
    plt.yticks(range(1, 11), fontsize=12, color='black')

#add avg. stress level label
    cbar = plt.colorbar(sc)
    cbar.set_label('Number of Records', rotation=270, labelpad=20, fontsize=14, color='black')

    plt.grid(alpha=0.3)
    plt.show()

Stress Level by BMI

Across both age groups (18–29 and 30–39), there is no strong or consistent relationship between BMI and stress levels. Stress appears to be broadly distributed across BMI categories, indicating a weak or negligible correlation between these variables.

A higher concentration of observations is evident within the BMI range of approximately 18–34, suggesting that most individuals in the dataset fall within this interval. However, the distribution of stress levels within this range is relatively uniform, with no clear trend indicating that higher BMI values correspond to increased stress.

There is no meaningful clustering of elevated stress levels at the upper end of the BMI spectrum. This further supports the conclusion that BMI is not a strong predictor of stress within these age groups.

Given that this is a synthetic dataset constructed to simulate realistic population health trends, the clustering of observations within common BMI ranges is expected. Therefore, the lack of a strong relationship between BMI and stress may reflect the underlying data generation process rather than a definitive real-world conclusion.

# Filter ages 18-39
df = df[(df['Age'] >= 18) & (df['Age'] < 40)]

# Create age groups
bins = [18, 30, 40]
labels = ['18-29', '30-39']
df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)

# Create BMI bins (2-point intervals)
df['BMI_Bin'] = pd.cut(df['BMI'], bins=range(10, 50, 2))

# Fixed scale to match bubble chart
vmax = 60

# Loop through age groups
for group in labels:
    subset = df[df['Age_Group'] == group]
    heatmap_data = subset.pivot_table(
        index='Stress_Level',
        columns='BMI_Bin',
        aggfunc='size',
        fill_value=0
    )
# Ensure stress levels always show 1-10
    heatmap_data = heatmap_data.reindex(index=range(1, 11), fill_value=0)
    plt.figure(figsize=(12, 6))
    plt.imshow(
        heatmap_data,
        aspect='auto',
        cmap='viridis',
        vmin=0,
        vmax=vmax,
        origin='lower'
    )
    plt.title(f"Stress Level by BMI (Age {group})")
    plt.xlabel("BMI (binned)")
    plt.ylabel("Stress Level")
# X-axis labels
    plt.xticks(
        ticks=np.arange(len(heatmap_data.columns)),
        labels=[str(col) for col in heatmap_data.columns],
        rotation=45)
# Y-axis labels
    plt.yticks(
        ticks=np.arange(len(heatmap_data.index)),
        labels=heatmap_data.index)
# Add white count labels inside cells
    for i in range(heatmap_data.shape[0]):
        for j in range(heatmap_data.shape[1]):
            value = heatmap_data.iloc[i, j]
            if value > 0:
                plt.text(
                    j, i, str(value),
                    ha='center',
                    va='center',
                    color='white',
                    fontsize=9)
    
# Colorbar
    cbar = plt.colorbar()
    cbar.set_label("Count of Individuals")
    cbar.set_ticks(range(0, 61, 10))
    plt.tight_layout()
    plt.show()

Average Stress Level by Age and Gender

There is no clear trend indicating that stress levels increase or decrease with age between 18 and 39. Average stress remains relatively stable, with values consistently clustered between 5 and 7 across all ages. This consistency suggests a weak relationship between age and stress within this range. A slight deviation is observed at age 36 among individuals identifying as “Other,” where the average stress level drops to approximately 4; however, this appears to be an isolated variation rather than part of a broader trend.

# Filter ages 18-39
df = df[(df['Age'] >= 18) & (df['Age'] < 40)]

# Group by Gender and Age, then calculate average stress level
stress_df = (
    df.groupby(['Gender', 'Age'])['Stress_Level']
      .mean()
      .reset_index()
)

# Create figure and axis
fig = plt.figure(figsize=(18,10))
ax = fig.add_subplot(1,1,1)

# Optional: choose colors for genders
my_colors = {
    'Male': 'blue',
    'Female': 'red',
    'Other': 'green'
}

# Plot one line per gender
for key, grp in stress_df.groupby('Gender'):
    grp = grp.sort_values('Age')
    grp.plot(
        ax=ax,
        kind='line',
        x='Age',
        y='Stress_Level',
        color=my_colors.get(key, 'black'),
        label=key,
        marker='8'
    )

# Titles and labels
plt.title('Average Stress Level by Age and Gender', fontsize=18)
ax.set_xlabel('Age', fontsize=18)
ax.set_ylabel('Average Stress Level', fontsize=18, labelpad=20)

# Axis formatting
ax.tick_params(axis='x', labelsize=14, rotation=0)
ax.tick_params(axis='y', labelsize=14, rotation=0)

# Set x-axis from 18 to 39
ax.set_xticks(np.arange(18, 40, 1))

# Set y-axis from 1 to 10
ax.set_yticks(np.arange(1, 11, 1))
ax.set_ylim(1, 10)

# Legend
plt.legend(loc='best', fontsize=14, ncol=1)


plt.show()

Stress Level by Smoker Status

Both age groups (18–29 and 30–39) exhibit nearly identical distributions in smoking status, with approximately 70% of individuals reporting as non-smokers in each group. Additionally, the average stress level is identical across both groups at 5.58. The consistency in both smoking behavior and stress levels across age groups suggests minimal variation between these two groups. This lack of differentiation indicates a weak relationship between age, smoking status, and stress within the dataset. The uniformity across variables likely reflects the synthetic design of the dataset rather than real-world variability.

# Create age groups
bins = [18, 30, 40]
labels = ['18-29', '30-39']
df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)

# Loop through each age group
for group in labels:

    subset = df[df['Age_Group'] == group]
# Smoker counts
    smoker_counts = subset['Smoker'].value_counts()
    smoker_pct = smoker_counts / smoker_counts.sum() * 100

#labels with percentages
    smoker_labels = []
    for i in range(len(smoker_counts)):
        label = smoker_counts.index[i]
        pct = smoker_pct.iloc[i]
        smoker_labels.append(label + " (" + str(round(pct, 1)) + "%)")

    # Average stress level for center
    avg_stress = subset['Stress_Level'].mean()

  
    fig = plt.figure(figsize=(5, 5))
    ax = fig.add_subplot(1, 1, 1)

    # Donut chart
    ax.pie(
        smoker_counts,
        labels=smoker_labels,
        autopct='%1.1f%%',
        startangle=90,
        wedgeprops=dict(width=0.35, edgecolor='white'),
        textprops={'fontsize': 11}
    )

# Donut hole
    hole = plt.Circle((0, 0), 0.45, fc='white')
    plt.gcf().gca().add_artist(hole)

# Center text
    ax.text(
        0, 0,
        "Age " + str(group) + "\nAvg Stress\n" + str(round(avg_stress, 2)),
        ha='center',
        va='center',
        fontsize=12
    )

# Title
    plt.title("Smoker Status by Age Group (" + str(group) + ")", fontsize=14)

    ax.axis('equal')
    plt.tight_layout()
    plt.show()

Conclusion

Given that this is a synthetic dataset designed to reflect realistic population patterns, the relatively uniform distributions and lack of strong relationships between variables are expected. The clustering of values within moderate ranges likely reflects how the data was generated rather than indicating definitive real-world relationships.