import kagglehub
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import matplotlib.patches as mpatches
from matplotlib.ticker import FuncFormatter
import plotly.graph_objects as go

os.environ["KAGGLE_API_TOKEN"] = "KGAT_628681bba5673a246a875dee9a635e47"

path = kagglehub.dataset_download("jayjoshi37/sleep-screen-time-and-stress-analysis")

Introduction

This report analyses a dataset from kagglehub: https://www.kaggle.com/datasets/jayjoshi37/sleep-screen-time-and-stress-analysis. The dataset includes 15,000 records which represent individuals. The dataset is meant to provide insights into how stress, screen time, workout, age, etc affect a person. The goal for this project is to analyze the data to identify trends in how screen time and lifestyle factors influence sleep quality and stress levels!

Dataset

df = pd.read_csv(os.path.join(path, "sleep_mobile_stress_dataset_15000.csv"))
stress_sorted = df['stress_level'].sort_values(ascending=False).reset_index(drop=True)
df.head()
##    user_id  age  ... notifications_received_per_day mental_fatigue_score
## 0        1   56  ...                            119                 3.57
## 1        2   46  ...                            299                 1.91
## 2        3   32  ...                             21                 6.05
## 3        4   25  ...                            220                 9.92
## 4        5   38  ...                            167                 5.99
## 
## [5 rows x 13 columns]

As seen above, the dataset includes 13 variables. The variables are as follows: user_id, age, gender, occupation, daily_screen_time_hours, phone_usage_before_sleep_minutes, sleep_duration_hours, sleep_quality_score, stress_level, caffeine_intake_cups, physical_activity_minutes, notifications_received_per_day, and mental_fatigue_score

Findings

Below are the visualizations and findings! Click through the tabs to see each.

Stress Level Distribution

def pick_colors_according_to_mean_count(data):
    colors = []
    avg = data.mean()
    for each in data:
        if each > avg * 1.01:
            colors.append('lightcoral')
        elif each < avg * 0.99:
            colors.append('green')
        else:
            colors.append('black')
    return colors
my_colors1 = pick_colors_according_to_mean_count(stress_sorted)
Above = mpatches.Patch(color='lightcoral', label='Above Average')
At = mpatches.Patch(color='black', label='Within 1% of the Average')
Below = mpatches.Patch(color='green', label='Below Average')
fig = plt.figure(figsize=(18, 10))
ax1 = fig.add_subplot(1, 1, 1)
ax1.bar(range(len(stress_sorted)), stress_sorted, color=my_colors1)
ax1.legend(handles=[Above, At, Below], fontsize=14)
ax1.axhline(stress_sorted.mean(), color='black', linestyle='dashed')
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False)
ax1.axes.xaxis.set_visible(False)
ax1.set_ylabel('Stress Level', fontsize=14)
ax1.set_title('Stress Level Distribution', size=20)
ax1.text(len(stress_sorted)-500, stress_sorted.mean()+0.2,
         'Mean = ' + str(round(stress_sorted.mean(), 2)),
         rotation=0, fontsize=14)
plt.tight_layout()
plt.show()

Analysis

The graph above shows the stress level distribution across all individuals, sorted from highest to lowest, with a mean stress level of 6.98 indicated by the dashed line. The variable stress was rated on a scale from 1 to 10 with 1 being the lowest stress and 10 being the most stress. The stress data shows a steady decline from a max of 10 down to about 1, with a large portion of individuals experiencing above-average stress levels as shown by the pink section taking up roughly half the chart. The high mean of 6.98 out of 10 further supports that this population tends to lean toward higher stress, which could be tied to factors like screen time and sleep habits. An interesting note to add is that there are a good amount of individuals who were a max 10/10 stress … it would be quite interesting to see how high they would be if the max was increased.

User Count and Average Stress Level Analysis By Occupation

occ_stats = df.groupby('occupation').agg(
    Count=('user_id', 'count'),
    AvgStress=('stress_level', 'mean')
).sort_values('Count', ascending=False).reset_index()
def autolabel(these_bars, this_ax, place_of_decimals, symbol):
    for each_bar in these_bars:
        height = each_bar.get_height()
        this_ax.text(each_bar.get_x()+each_bar.get_width()/2, height*1.01,
                     symbol+format(height, place_of_decimals),
                     fontsize=11, color='black', ha='center', va='bottom')
fig = plt.figure(figsize=(18, 10))
ax1 = fig.add_subplot(1, 1, 1)
ax2 = ax1.twinx()
bar_width = 0.4
x_pos = np.arange(len(occ_stats))
count_bars = ax1.bar(x_pos-(0.5*bar_width), occ_stats.Count, bar_width,
                     color='gray', edgecolor='black', label='User Count')
stress_bars = ax2.bar(x_pos+(0.5*bar_width), occ_stats.AvgStress, bar_width,
                      color='green', edgecolor='black', label='Avg Stress Level')
ax1.set_xlabel('Occupation', fontsize=18)
ax1.set_ylabel('User Count', fontsize=18, labelpad=20)
ax2.set_ylabel('Avg Stress Level', fontsize=18, rotation=270, labelpad=20)
ax1.tick_params(axis='y', labelsize=14)
ax2.tick_params(axis='y', labelsize=14)
plt.title('User Count and Average Stress Level Analysis\nBy Occupation', fontsize=18)
ax1.set_xticks(x_pos)
ax1.set_xticklabels(occ_stats.occupation, fontsize=14)
count_color, count_label = ax1.get_legend_handles_labels()
stress_color, stress_label = ax2.get_legend_handles_labels()
legend = ax1.legend(count_color + stress_color, count_label + stress_label,
                    loc='upper center', frameon=True, ncol=2, borderpad=1, fontsize=14,
                    bbox_to_anchor=(0.5, -0.08))
ax1.set_ylim(0, occ_stats.Count.max()*1.50)
## (0.0, 2943.0)
autolabel(count_bars, ax1, '.0f', '')
autolabel(stress_bars, ax2, '.2f', '')
plt.tight_layout()
plt.show()

Analysis

In the graph seen above, students have the highest average level of stress which makes sense as being a young adult in college is quite a high stress environment where you are trying to figure out life. They have a average stress level of 7.18 which is above the mean of 6.98 which we found in the previous graph (Stress Level Distribution). Manager has the highest user count, aka individual count which is 1962. They also have the second highest level of stress of 7.06 which also makes sense as being a manager can be very stressful with everything the job entails, like being in charge of a team. On the other end, the least stressful occupation was designer with an average of 6.87 which is below the mean. Overall, the average stress levels across all occupations are close, ranging only from 6.87 (Designer) to 7.18 (Student).

Total Mental Fatigue Score by Caffeine Intake

fatigue_df = df.groupby(['caffeine_intake_cups', 'occupation'])['mental_fatigue_score'].sum().reset_index(name='TotalFatigue')
fig = plt.figure(figsize = (18, 10))
ax = fig.add_subplot(1, 1, 1)
all_occupations = sorted(fatigue_df['occupation'].unique())
color_list = ['blue', 'red', 'green', 'orange', 'gray', 'purple', 'gold', 'brown', 'cyan', 'magenta', 'olive', 'teal']
my_colors = {occ: color_list[i % len(color_list)] for i, occ in enumerate(all_occupations)}
for key, grp in fatigue_df.groupby(['occupation']):
    grp.plot(ax=ax, kind='line', x='caffeine_intake_cups', y='TotalFatigue', color=my_colors[key[0]], label=key[0], marker='8')
plt.title('Total Mental Fatigue Score by Caffeine Intake', fontsize=18)
ax.set_xlabel('Caffeine Intake (Cups)', fontsize=18)
ax.set_ylabel('Total Mental Fatigue Score', fontsize=18, labelpad=20)
ax.tick_params(axis='x', labelsize=14, rotation=0)
ax.tick_params(axis='y', labelsize=14, rotation=0)
ax.set_xticks(range(0, 5))
handles, labels = ax.get_legend_handles_labels()
sorted_pairs = sorted(zip(labels, handles))
labels, handles = zip(*sorted_pairs)
plt.legend(handles, labels, loc='best', fontsize=14, ncol=1)
plt.tight_layout()
plt.show()

Analysis

To start with, in the graph above, there is no clear consistent trend between caffeine intake and total mental fatigue score across occupations, as the lines go all over the place rather than all moving in one direction. Some interesting notes to take away from this chart is that there are trends within each individual occupation. For example, software engineers who have 0 or 1 cups of caffeine intake have very high mental fatigue scores, while the software engingeers who have 2, 3, or 4 cups of caffeine intake are towards the lower end of the graph. This is a super interesting discoveries and there are many more throughout the graph. The opposite can be said about students, lower caffeine is correlated with lower mental fatigue while higher caffeine is correlated with higher mental fatigue.

Occupation Avg and Overall Avg Sleep Quality

wf_df = df.groupby(['occupation'])['sleep_quality_score'].mean().reset_index(name='AvgSleepQuality')
overall_avg = df['sleep_quality_score'].mean()
wf_df['Deviation'] = wf_df['AvgSleepQuality'] - overall_avg
wf_df.loc[wf_df.index.max()+1] = ['Total', wf_df['AvgSleepQuality'].mean(), wf_df['Deviation'].sum()]
occ_order = sorted(df['occupation'].unique().tolist()) + ['Total']
wf_df.occupation = pd.Categorical(wf_df.occupation, categories=occ_order, ordered=True)
wf_df.sort_values(by='occupation', inplace=True)
wf_df.reset_index(inplace=True, drop=True)
num_occs = len(wf_df) - 1
measure_list = ['absolute'] * num_occs + ['total']
fig = go.Figure( go.Bar( x=wf_df['occupation'],
                    y=wf_df['Deviation'],
                    text=['{:.2f}'.format(each) for each in wf_df['AvgSleepQuality']],
                    textposition='outside',
                    marker_color=['green' if row['occupation'] == 'Total' and row['Deviation'] >= 0 else
                                  'black' if row['occupation'] == 'Total' else
                                  'green' if row['Deviation'] >= 0 else
                                  'red' for _, row in wf_df.iterrows()],
                    hovertemplate='Deviation from Avg: ' + '%{y:.2f}' + '<br>' +
                                  'Avg Sleep Quality: %{text}'))
fig.update_xaxes(title_text='Occupation', title_font={'size': 18})
fig.update_yaxes(title_text='Deviation from Overall Avg Sleep Quality', title_font={'size':18},
                zeroline=True)
fig.update_layout(title=dict(text='Deviation between Occupation Avg and Overall Avg Sleep Quality (Waterfall Diagram)<br>' +
                                  'Above Average in Green, Below Average in Red',
                             font=dict(family='Arial', size=18, color='black')),
                  template='simple_white',
                  title_x=0.5,
                  showlegend=False,
                  autosize=True,
                  margin=dict(l=30, r=30, t=60, b=30))
fig.write_html("waterfall.html", full_html=False)

Analysis

In the graph above, green is a positive deviation and red is a negative deviation. Now looking above, designers have the highest sleep quality at 6.33, sitting the furthest above the overall average of 6.25, while Students fall the lowest at 6.11 with the largest negative deviation, which again ties back to the earlier finding that Students also had the highest stress levels at 7.18. Manager, Software Engineer, and Student are the three occupations that fall below average, with Students being the clear outlier dragging the furthest down.

Total Screen Time by Age Group and Occupation

df['age_group'] = pd.cut(df['age'], bins=[17, 25, 35, 45, 55, 65], labels=['18-25', '26-35', '36-45', '46-55', '56-65'])
bump_df = df.groupby(['occupation', 'age_group'], observed=True)['daily_screen_time_hours'].sum().reset_index(name='TotalScreenTime')
bump_df = bump_df.pivot(index='occupation', columns='age_group', values='TotalScreenTime')
bump_df = bump_df.dropna()
bump_df_ranked = bump_df.rank(0, ascending=False, method='min')
age_labels = ['18-25', '26-35', '36-45', '46-55', '56-65']
fig = plt.figure(figsize=(18, 10))
ax = fig.add_subplot(1, 1, 1)
bump_df_ranked.T.plot(kind='line', ax=ax, marker='o', markeredgewidth=1, linewidth=6,
                    markersize=18,
                    markerfacecolor='white')
ax.invert_yaxis()
num_rows = bump_df_ranked.shape[1]
num_cols = bump_df_ranked.shape[0]
_ = plt.ylabel('Occupation Ranking', fontsize=18, labelpad=10)
_ = plt.title('Ranking of Total Screen Time by Age Group and Occupation \n Bump Chart', fontsize=18, pad=15)
_ = plt.xticks(np.arange(num_rows), age_labels, fontsize=14)
_ = plt.yticks(range(1, num_cols+1, 1), fontsize=14)
_ = ax.set_xlabel('Age Group', fontsize=18)
handles, labels = ax.get_legend_handles_labels()
handles = list(reversed(handles))
labels  = list(reversed(labels))
ax.legend(handles, labels, bbox_to_anchor=(1.01, 1.01), fontsize=14,
          labelspacing=1,
          markerscale=.4,
          borderpad=1,
          handletextpad=0.8)
i = 0
j = 0
for eachcol in bump_df_ranked.columns:
    for eachrow in bump_df_ranked.index:
        this_rank = bump_df_ranked.iloc[j, i]
        ax.text(i, this_rank, str(round(bump_df.iloc[j, i]/1000, 1)) + 'K', ha='center', va='center', fontsize=12)
        j+=1
    i+=1
    j=0
plt.show()

Anaylsis

Looking at the graph above, managers consistently rank at or near the top in total screen time across almost every age group, holding the #1 spot from 26-35 through 36-45 and staying in the top 3 until the 46-55 group, which makes sense given the meeting-heavy, email-driven nature of management roles. Students dominate screen time in the 18-25 group at 2.2K but then plummet to the bottom ranks in older age groups, likely because fewer people identify as students past their 20s, so the sample shrinks and the total drops which makes sense if you think about it.

Conculsion

Overall, the “Sleep, Screen Time and Stress Analysis” dataset reveals that stress levels are fairly high across the board with a mean of 6.98 out of 10, and occupation alone does not significantly differentiate stress or sleep quality, as the ranges across all occupations were narrow. Students stood out as the most impacted group, having both the highest average stress level at 7.18 and the lowest sleep quality at 6.11 (a -0.14 deviation), suggesting a strong link between stress and poor sleep. Caffeine intake showed no clear overall trends, but there were trends for each individual occupations that were super interesting such as software engineer. Screen time rankings shifted heavily by age group, meaning screen time was more affected by age than occupation. All together, the dataset was awesome to work with!