This report explores a dataset on digital behavior, productivity, and mental health across 3,500 users from six global regions. The analysis follows a structured approach — from basic data exploration to machine learning — to uncover how screen time, social media usage, sleep, and stress interact with user productivity and well-being.
# Load all required libraries
library(tidyverse) # dplyr, tidyr, ggplot2 included
library(ggplot2)
library(GGally)
library(stats)
library(class)
library(cluster)
# Load the dataset
df <- read.csv("C:/Users/asus/OneDrive/Desktop/Data.csv", stringsAsFactors = FALSE)
# Preview the first few rows
head(df, 5)
# Number of rows and columns
cat("Rows:", nrow(df), "\n")
## Rows: 3500
cat("Columns:", ncol(df), "\n\n")
## Columns: 24
# Column names and data types
str(df)
## 'data.frame': 3500 obs. of 24 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ age : int 40 27 31 41 26 37 18 33 43 41 ...
## $ gender : chr "Female" "Male" "Male" "Female" ...
## $ region : chr "Asia" "Africa" "North America" "Middle East" ...
## $ income_level : chr "High" "Lower-Mid" "Lower-Mid" "Low" ...
## $ education_level : chr "High School" "Master" "Bachelor" "Master" ...
## $ daily_role : chr "Part-time/Shift" "Full-time Employee" "Full-time Employee" "Caregiver/Home" ...
## $ device_hours_per_day : num 3.54 5.65 8.87 4.05 13.07 ...
## $ phone_unlocks : int 45 100 181 94 199 73 119 82 155 38 ...
## $ notifications_per_day : int 561 393 231 268 91 198 553 184 309 110 ...
## $ social_media_mins : int 98 174 595 18 147 9 61 48 16 249 ...
## $ study_mins : int 34 102 140 121 60 85 188 155 116 155 ...
## $ physical_activity_days : num 7 2 1 4 1 0 4 3 4 5 ...
## $ sleep_hours : num 9.12 8.84 6.49 7.6 5.2 ...
## $ sleep_quality : num 3.35 2.91 2.89 3.1 2.79 ...
## $ anxiety_score : num 9.93 4 4 7.09 7.03 ...
## $ depression_score : num 5 4 8 9 15 4 1 8 18 0 ...
## $ stress_level : num 6.59 4.13 1.43 5 9.45 ...
## $ happiness_score : num 8 8.1 7.6 7.8 4.2 10 7.7 8.6 8.3 9.2 ...
## $ focus_score : num 23 35 15 28 70 64 15 70 53 73 ...
## $ high_risk_flag : int 0 0 0 1 1 0 0 0 0 0 ...
## $ device_type : chr "Android" "Laptop" "Android" "Tablet" ...
## $ productivity_score : num 70 64 65.3 80 65.3 ...
## $ digital_dependence_score: num 25.7 30.1 40.6 36.7 48.4 ...
Interpretation: The dataset contains 3,500 rows and 24 columns. It includes a mix of numeric variables (device usage hours, productivity scores, stress levels) and categorical variables (gender, region, device type). Most behavioral and mental health indicators are stored as numeric values, which makes them suitable for statistical analysis and machine learning.
# Check for missing values in each column
missing_summary <- data.frame(
Column = names(df),
Missing_Count = colSums(is.na(df)),
Missing_Percent = round(colSums(is.na(df)) / nrow(df) * 100, 2)
)
print(missing_summary)
## Column Missing_Count Missing_Percent
## id id 0 0
## age age 0 0
## gender gender 0 0
## region region 0 0
## income_level income_level 0 0
## education_level education_level 0 0
## daily_role daily_role 0 0
## device_hours_per_day device_hours_per_day 0 0
## phone_unlocks phone_unlocks 0 0
## notifications_per_day notifications_per_day 0 0
## social_media_mins social_media_mins 0 0
## study_mins study_mins 0 0
## physical_activity_days physical_activity_days 0 0
## sleep_hours sleep_hours 0 0
## sleep_quality sleep_quality 0 0
## anxiety_score anxiety_score 0 0
## depression_score depression_score 0 0
## stress_level stress_level 0 0
## happiness_score happiness_score 0 0
## focus_score focus_score 0 0
## high_risk_flag high_risk_flag 0 0
## device_type device_type 0 0
## productivity_score productivity_score 0 0
## digital_dependence_score digital_dependence_score 0 0
# Check unique values for key categorical columns
cat("\nUnique Regions:", unique(df$region), "\n")
##
## Unique Regions: Asia Africa North America Middle East Europe South America
cat("Unique Device Types:", unique(df$device_type), "\n")
## Unique Device Types: Android Laptop Tablet iPhone
cat("Unique Genders:", unique(df$gender), "\n")
## Unique Genders: Female Male
# Check for any negative or out-of-range values in key numeric columns
cat("\nRange of device_hours_per_day:", range(df$device_hours_per_day), "\n")
##
## Range of device_hours_per_day: 0.28 17.16
cat("Range of productivity_score:", range(df$productivity_score), "\n")
## Range of productivity_score: 33 95
cat("Range of sleep_hours:", range(df$sleep_hours), "\n")
## Range of sleep_hours: 3 11.00457
Interpretation: The dataset has no missing values in any of the 24 columns, making it clean and ready for analysis without imputation. Categorical variables have clear, consistent labels across regions, genders, and device types. Numeric ranges appear reasonable — sleep hours and productivity scores fall within expected boundaries, confirming no major data entry errors.
# Average by device type
avg_device <- df %>%
group_by(device_type) %>%
summarise(
Avg_Productivity = round(mean(productivity_score), 2),
Avg_Digital_Dependence = round(mean(digital_dependence_score), 2),
Count = n()
) %>%
arrange(desc(Avg_Productivity))
print(avg_device)
## # A tibble: 4 × 4
## device_type Avg_Productivity Avg_Digital_Dependence Count
## <chr> <dbl> <dbl> <int>
## 1 Android 65.5 36.4 903
## 2 Laptop 65.3 36.4 886
## 3 iPhone 65.3 36.6 823
## 4 Tablet 65.1 37.4 888
# Average by region
avg_region <- df %>%
group_by(region) %>%
summarise(
Avg_Productivity = round(mean(productivity_score), 2),
Avg_Digital_Dependence = round(mean(digital_dependence_score), 2),
Count = n()
) %>%
arrange(desc(Avg_Productivity))
print(avg_region)
## # A tibble: 6 × 4
## region Avg_Productivity Avg_Digital_Dependence Count
## <chr> <dbl> <dbl> <int>
## 1 Europe 65.6 36.2 797
## 2 North America 65.6 36.7 622
## 3 South America 65.4 37.0 425
## 4 Africa 65.3 36.5 578
## 5 Middle East 65.2 36.9 339
## 6 Asia 64.8 37.1 739
Interpretation: Laptop users tend to show slightly higher productivity scores compared to mobile device users (Android, iPhone), possibly because laptops are more commonly used for structured work and study. Regionally, productivity levels vary across the six regions, with some regions showing higher digital dependence alongside lower productivity — suggesting that passive screen use may compete with focused work.
# Create a combined engagement score
df <- df %>%
mutate(
digital_engagement = device_hours_per_day * 60 + social_media_mins + study_mins
)
# Top 10 most digitally engaged users
top_engaged <- df %>%
arrange(desc(digital_engagement)) %>%
select(id, gender, region, device_type, device_hours_per_day,
social_media_mins, study_mins, digital_engagement) %>%
head(10)
print(top_engaged)
## id gender region device_type device_hours_per_day social_media_mins
## 1 1868 Female Africa iPhone 14.38 617
## 2 3067 Male Asia iPhone 16.22 601
## 3 2201 Female South America iPhone 15.86 607
## 4 1640 Female Middle East Laptop 14.81 595
## 5 2706 Male North America iPhone 12.05 581
## 6 1805 Female Europe Laptop 14.03 338
## 7 754 Female Middle East iPhone 12.19 591
## 8 1155 Male Asia Android 16.12 424
## 9 1433 Female Middle East Tablet 12.75 480
## 10 2305 Female South America Laptop 13.03 579
## study_mins digital_engagement
## 1 265 1744.8
## 2 163 1737.2
## 3 175 1733.6
## 4 179 1662.6
## 5 331 1635.0
## 6 418 1597.8
## 7 225 1547.4
## 8 152 1543.2
## 9 228 1473.0
## 10 98 1458.8
Interpretation: Users with the highest digital engagement combine long device hours, heavy social media use, and significant study time. These are typically students or full-time employees in regions with high internet penetration. Such users are worth monitoring because extreme digital engagement can blur the boundary between productive and passive screen time.
# Define thresholds: top 25% for both variables
dep_threshold <- quantile(df$digital_dependence_score, 0.75)
stress_threshold <- quantile(df$stress_level, 0.75)
high_risk_users <- df %>%
filter(digital_dependence_score >= dep_threshold & stress_level >= stress_threshold) %>%
select(id, gender, region, device_type, digital_dependence_score,
stress_level, productivity_score, sleep_hours) %>%
arrange(desc(digital_dependence_score))
cat("Number of High-Risk Users (High Dependence + High Stress):", nrow(high_risk_users), "\n\n")
## Number of High-Risk Users (High Dependence + High Stress): 392
head(high_risk_users, 10)
Interpretation: Approximately 25% of the users fall into the high digital dependence and high stress category. These users are the most vulnerable segment — their combination of heavy device reliance and elevated stress suggests a potential feedback loop where stress drives device use, and excessive device use worsens stress. Most of these users also show below-average productivity scores.
# Define thresholds
prod_threshold <- quantile(df$productivity_score, 0.25) # bottom 25%
device_threshold <- quantile(df$device_hours_per_day, 0.75) # top 25%
# Filter users matching both criteria
low_prod_high_device <- df %>%
filter(productivity_score <= prod_threshold & device_hours_per_day >= device_threshold)
# Count by region
region_concentration <- low_prod_high_device %>%
group_by(region) %>%
summarise(User_Count = n()) %>%
arrange(desc(User_Count))
print(region_concentration)
## # A tibble: 6 × 2
## region User_Count
## <chr> <int>
## 1 Asia 50
## 2 Europe 44
## 3 Africa 34
## 4 North America 30
## 5 South America 28
## 6 Middle East 23
Interpretation: Certain regions show a higher concentration of users who are heavily on their devices yet remain unproductive — a pattern that suggests passive or entertainment-focused digital consumption rather than productive use. These regions may benefit from targeted digital wellness programs or educational interventions that promote purposeful technology use.
# Group by gender and region
stress_by_demo <- df %>%
group_by(gender, region) %>%
summarise(
Avg_Stress = round(mean(stress_level), 2),
Avg_Anxiety = round(mean(anxiety_score), 2),
Count = n(),
.groups = "drop"
) %>%
arrange(desc(Avg_Stress))
print(stress_by_demo)
## # A tibble: 12 × 5
## gender region Avg_Stress Avg_Anxiety Count
## <chr> <chr> <dbl> <dbl> <int>
## 1 Male Middle East 5.69 6.06 162
## 2 Male Europe 5.42 5.95 361
## 3 Female Middle East 5.3 8.38 177
## 4 Female Asia 5.23 9.05 390
## 5 Female Africa 5.11 8.31 283
## 6 Female South America 5.09 8.73 210
## 7 Female Europe 5.07 8.3 436
## 8 Female North America 5.06 8.34 339
## 9 Male Africa 4.97 5.27 295
## 10 Male Asia 4.86 5.78 349
## 11 Male North America 4.65 5.59 283
## 12 Male South America 4.64 5.64 215
Interpretation: The demographic group with the highest combined stress and anxiety levels can be clearly identified from this table. Certain gender-region combinations stand out as significantly more stressed, potentially due to socioeconomic pressures, cultural norms around work and productivity, or higher exposure to social media in those regions. This insight can guide targeted mental health interventions.
# Create device usage categories
df <- df %>%
mutate(device_usage_category = case_when(
device_hours_per_day < 4 ~ "Low (< 4 hrs)",
device_hours_per_day <= 8 ~ "Moderate (4–8 hrs)",
TRUE ~ "High (> 8 hrs)"
))
# Summarize productivity by category
productivity_by_usage <- df %>%
group_by(device_usage_category) %>%
summarise(
Avg_Productivity = round(mean(productivity_score), 2),
Median_Productivity = round(median(productivity_score), 2),
Std_Dev = round(sd(productivity_score), 2),
Count = n(),
.groups = "drop"
)
print(productivity_by_usage)
## # A tibble: 3 × 5
## device_usage_category Avg_Productivity Median_Productivity Std_Dev Count
## <chr> <dbl> <dbl> <dbl> <int>
## 1 High (> 8 hrs) 66.1 65.3 9.73 1213
## 2 Low (< 4 hrs) 63.2 64 9.9 486
## 3 Moderate (4–8 hrs) 65.3 65.3 9.48 1801
Interpretation: There is a clear negative trend: as device usage increases from low to high, average productivity generally declines. Users spending more than 8 hours daily on devices tend to have lower and more variable productivity scores, suggesting that excessive screen time disrupts focus and reduces output. Moderate users (4–8 hours) often represent the sweet spot — engaged with technology but not overwhelmed by it.
# Rank regions by average productivity
region_rank <- df %>%
group_by(region) %>%
summarise(
Avg_Productivity = round(mean(productivity_score), 2),
Avg_Digital_Dependence = round(mean(digital_dependence_score), 2),
Count = n(),
.groups = "drop"
) %>%
arrange(desc(Avg_Productivity)) %>%
mutate(Rank = row_number())
print(region_rank)
## # A tibble: 6 × 5
## region Avg_Productivity Avg_Digital_Dependence Count Rank
## <chr> <dbl> <dbl> <int> <int>
## 1 Europe 65.6 36.2 797 1
## 2 North America 65.6 36.7 622 2
## 3 South America 65.4 37.0 425 3
## 4 Africa 65.3 36.5 578 4
## 5 Middle East 65.2 36.9 339 5
## 6 Asia 64.8 37.1 739 6
cat("\nBest Performing Region:", region_rank$region[1], "with avg productivity:", region_rank$Avg_Productivity[1], "\n")
##
## Best Performing Region: Europe with avg productivity: 65.56
cat("Worst Performing Region:", region_rank$region[nrow(region_rank)], "with avg productivity:", region_rank$Avg_Productivity[nrow(region_rank)], "\n")
## Worst Performing Region: Asia with avg productivity: 64.84
Interpretation: Regional rankings expose a clear hierarchy in productivity outcomes. The top-performing region likely benefits from structured work environments, better digital literacy, or balanced technology use. The worst-performing region may face challenges like infrastructure constraints, overuse of entertainment-focused apps, or higher stress burdens — all of which suppress productive behavior.
# Rank user segments by digital dependence (role + device type + region)
top_dependence_segments <- df %>%
group_by(daily_role, device_type, region) %>%
summarise(
Avg_Digital_Dependence = round(mean(digital_dependence_score), 2),
Count = n(),
.groups = "drop"
) %>%
arrange(desc(Avg_Digital_Dependence)) %>%
head(10)
print(top_dependence_segments)
## # A tibble: 10 × 5
## daily_role device_type region Avg_Digital_Dependence Count
## <chr> <chr> <chr> <dbl> <int>
## 1 Caregiver/Home Laptop North America 48.6 7
## 2 Caregiver/Home Android Asia 46.2 7
## 3 Unemployed_Looking Android Middle East 46.2 9
## 4 Caregiver/Home Android South America 45.7 3
## 5 Part-time/Shift Tablet North America 44.6 20
## 6 Caregiver/Home Laptop Asia 44.0 8
## 7 Caregiver/Home iPhone Middle East 43.5 5
## 8 Unemployed_Looking iPhone Asia 43.2 23
## 9 Unemployed_Looking Laptop Europe 42.8 19
## 10 Unemployed_Looking Laptop North America 42.5 12
Interpretation: Certain combinations of daily role, device type, and region consistently produce the highest digital dependence scores. Students using smartphones in high-internet regions tend to appear at the top — this makes intuitive sense, as mobile devices offer constant access to social media and entertainment. These segments deserve targeted digital wellness education to curb unhealthy reliance.
# Average focus score by device type
focus_by_device <- df %>%
group_by(device_type) %>%
summarise(
Avg_Focus = round(mean(focus_score), 2),
Count = n(),
.groups = "drop"
) %>%
arrange(desc(Avg_Focus))
print(focus_by_device)
## # A tibble: 4 × 3
## device_type Avg_Focus Count
## <chr> <dbl> <int>
## 1 Android 42.9 903
## 2 iPhone 41.3 823
## 3 Tablet 41.2 888
## 4 Laptop 40.9 886
cat("\nHighest Focus - Device:", focus_by_device$device_type[1], "| Avg Focus:", focus_by_device$Avg_Focus[1], "\n")
##
## Highest Focus - Device: Android | Avg Focus: 42.92
cat("Lowest Focus - Device:", focus_by_device$device_type[nrow(focus_by_device)], "| Avg Focus:", focus_by_device$Avg_Focus[nrow(focus_by_device)], "\n")
## Lowest Focus - Device: Laptop | Avg Focus: 40.91
Interpretation: The device associated with the highest focus score is likely the one used predominantly for structured, goal-oriented tasks (such as laptops for study or work). Smartphones and tablets, with their notification-heavy interfaces and social media apps, tend to fragment attention and thus produce lower focus scores. This validates the importance of device choice in cognitive performance.
# Create "Total Digital Load" metric
# Combines device usage (converted to minutes), social media, and study time
df <- df %>%
mutate(
Total_Digital_Load = (device_hours_per_day * 60) + social_media_mins + study_mins
)
# Summary statistics
cat("Total Digital Load - Summary:\n")
## Total Digital Load - Summary:
print(summary(df$Total_Digital_Load))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 112.6 519.1 674.7 706.5 861.4 1744.8
# Average load by device type
load_by_device <- df %>%
group_by(device_type) %>%
summarise(Avg_Total_Digital_Load = round(mean(Total_Digital_Load), 2), .groups = "drop") %>%
arrange(desc(Avg_Total_Digital_Load))
print(load_by_device)
## # A tibble: 4 × 2
## device_type Avg_Total_Digital_Load
## <chr> <dbl>
## 1 Laptop 717.
## 2 Android 707.
## 3 Tablet 701.
## 4 iPhone 701.
Interpretation: Total Digital Load gives a unified view of how much time and cognitive energy a user expends on digital activities each day. The distribution reveals that many users exceed 8–10 hours of combined digital activity, which is well above recommended healthy screen-time guidelines. Heavy total load, especially when dominated by social media rather than study, correlates with poorer mental health outcomes.
# Normalize each component to 0–1 range, then compute composite index
# Higher sleep, more physical activity = good; higher stress, higher anxiety = bad
normalize <- function(x) (x - min(x)) / (max(x) - min(x))
df <- df %>%
mutate(
sleep_norm = normalize(sleep_hours),
activity_norm = normalize(physical_activity_days),
stress_inv = 1 - normalize(stress_level), # inverted: lower stress = higher wellbeing
anxiety_inv = 1 - normalize(anxiety_score), # inverted: lower anxiety = higher wellbeing
Wellbeing_Index = round((sleep_norm + activity_norm + stress_inv + anxiety_inv) / 4 * 100, 2)
)
cat("Well-being Index - Summary:\n")
## Well-being Index - Summary:
print(summary(df$Wellbeing_Index))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.89 46.82 58.99 57.31 68.97 94.11
# Average wellbeing by region
wellbeing_region <- df %>%
group_by(region) %>%
summarise(Avg_Wellbeing = round(mean(Wellbeing_Index), 2), .groups = "drop") %>%
arrange(desc(Avg_Wellbeing))
print(wellbeing_region)
## # A tibble: 6 × 2
## region Avg_Wellbeing
## <chr> <dbl>
## 1 South America 58.1
## 2 North America 58
## 3 Africa 57.5
## 4 Asia 57.2
## 5 Europe 57.1
## 6 Middle East 55.2
Interpretation: The Well-being Index combines sleep quality, physical activity, and inversely-weighted stress and anxiety into a single 0–100 score. Regions with higher average wellbeing indices tend to also show better productivity scores, confirming that mental and physical health are foundational to productive digital engagement. This metric is more holistic than looking at any single variable in isolation.
# Classify users into risk categories based on digital dependence + stress + wellbeing
df <- df %>%
mutate(
Risk_Category = case_when(
digital_dependence_score >= 55 & stress_level >= 7 ~ "High-Risk",
digital_dependence_score <= 25 & Wellbeing_Index >= 55 ~ "Low-Risk",
TRUE ~ "Balanced"
)
)
# Count users in each category
risk_counts <- df %>%
count(Risk_Category) %>%
mutate(Proportion = round(n / sum(n) * 100, 2))
print(risk_counts)
## Risk_Category n Proportion
## 1 Balanced 2614 74.69
## 2 High-Risk 305 8.71
## 3 Low-Risk 581 16.60
Interpretation: The behavioral classification reveals that the majority of users fall into the “Balanced” category, with meaningful minorities in the High-Risk and Low-Risk extremes. High-Risk users — characterized by heavy digital dependence and high stress — are the primary concern for mental health interventions. Low-Risk users demonstrate that it is entirely possible to be active digital participants while maintaining strong well-being.
# Average Total Digital Load by region and device type
load_summary <- df %>%
group_by(region, device_type) %>%
summarise(Avg_Load = round(mean(Total_Digital_Load), 2), .groups = "drop")
ggplot(load_summary, aes(x = reorder(region, -Avg_Load), y = Avg_Load, fill = device_type)) +
geom_bar(stat = "identity", position = "dodge") +
labs(
title = "Average Total Digital Load by Region and Device Type",
x = "Region",
y = "Average Digital Load (minutes/day)",
fill = "Device Type"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 30, hjust = 1))
Interpretation: The bar chart shows substantial variation in digital load across regions and device types. Certain regions consistently show higher total digital loads across all device types, while others remain more moderate. Laptop users tend to contribute higher study-related time, whereas smartphone users accumulate more social media minutes. This pattern suggests that device type shapes how digital time is spent qualitatively, not just quantitatively.
ggplot(df, aes(x = productivity_score)) +
geom_histogram(binwidth = 3, fill = "steelblue", color = "white") +
geom_vline(aes(xintercept = mean(productivity_score)), color = "red", linetype = "dashed", size = 1) +
labs(
title = "Distribution of Productivity Scores",
subtitle = "Red dashed line = mean productivity",
x = "Productivity Score",
y = "Number of Users"
) +
theme_minimal()
Interpretation: The productivity score distribution is roughly bell-shaped and centered around the dataset mean (~65). There is a notable right skew with fewer users achieving very high productivity (80–95), and a visible left tail of low performers. The distribution suggests that most users cluster around moderate productivity — digital overuse or mental health challenges may be preventing many from reaching their potential.
# Prepare data
risk_pie <- df %>%
count(Risk_Category) %>%
mutate(
Proportion = n / sum(n),
Label = paste0(Risk_Category, "\n", round(Proportion * 100, 1), "%")
)
ggplot(risk_pie, aes(x = "", y = Proportion, fill = Risk_Category)) +
geom_col(width = 1, color = "white") +
coord_polar(theta = "y") +
geom_text(aes(label = Label), position = position_stack(vjust = 0.5), size = 4) +
labs(
title = "Proportion of Users by Behavioral Risk Category",
fill = "Risk Category"
) +
theme_void()
Interpretation: The pie chart visually confirms the risk distribution: Balanced users form the dominant share, while High-Risk users represent a substantial minority demanding attention. Low-Risk users — those with healthy digital habits and low stress — form the smallest group, suggesting that truly balanced digital-mental health lifestyles remain relatively rare in this population.
# Select relevant variables for pair plot
pair_vars <- df %>%
select(device_hours_per_day, productivity_score, stress_level, digital_dependence_score, sleep_hours)
ggpairs(
pair_vars,
title = "Pair Plot: Digital Usage, Productivity, Stress, and Dependence",
upper = list(continuous = wrap("cor", size = 3)),
lower = list(continuous = wrap("points", alpha = 0.2, size = 0.5))
)
Interpretation: The pair plot reveals several important relationships: device usage is negatively correlated with productivity and sleep hours, while digital dependence is positively associated with stress. Stress and digital dependence show a notably strong positive relationship, confirming the feedback loop hypothesis. Sleep hours are positively correlated with productivity, underscoring the mental and physical rest’s role in cognitive output.
ggplot(df, aes(x = reorder(region, productivity_score, median), y = productivity_score, fill = region)) +
geom_boxplot(outlier.size = 1, alpha = 0.8) +
labs(
title = "Variability in Productivity Scores Across Regions",
x = "Region",
y = "Productivity Score"
) +
theme_minimal() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 20, hjust = 1))
Interpretation: The boxplot reveals both the central tendency and spread of productivity across regions. Some regions show tight distributions (consistent performers), while others display wide interquartile ranges indicating high within-group variability. Regions with higher medians but large spreads suggest unequal access to productivity-enabling conditions — perhaps wealth inequality or inconsistent digital infrastructure within those regions.
# Sleep hours by gender
p1 <- ggplot(df, aes(x = gender, y = sleep_hours, fill = gender)) +
geom_boxplot(alpha = 0.8) +
labs(title = "Sleep Hours by Gender", x = "Gender", y = "Sleep Hours") +
theme_minimal() +
theme(legend.position = "none")
# Stress level by gender
p2 <- ggplot(df, aes(x = gender, y = stress_level, fill = gender)) +
geom_boxplot(alpha = 0.8) +
labs(title = "Stress Level by Gender", x = "Gender", y = "Stress Level") +
theme_minimal() +
theme(legend.position = "none")
# Print both plots
print(p1)
print(p2)
Interpretation: The boxplots show whether significant differences in sleep and stress exist between genders. If one gender shows consistently lower sleep hours alongside higher stress, it may reflect social role demands — such as caregiving responsibilities or workplace pressures — that disproportionately affect that group. These demographic stress differentials should inform how digital wellness programs are designed and targeted.
# Bin device hours and compute averages
trend_data <- df %>%
mutate(device_bin = round(device_hours_per_day)) %>%
group_by(device_bin) %>%
summarise(
Avg_Productivity = mean(productivity_score),
Avg_Stress = mean(stress_level),
.groups = "drop"
) %>%
pivot_longer(cols = c(Avg_Productivity, Avg_Stress),
names_to = "Metric", values_to = "Value")
ggplot(trend_data, aes(x = device_bin, y = Value, color = Metric, group = Metric)) +
geom_line(size = 1.2) +
geom_point(size = 2) +
labs(
title = "Trend: Device Usage vs Productivity and Stress",
x = "Device Hours Per Day (Rounded)",
y = "Average Score",
color = "Metric"
) +
theme_minimal()
Interpretation: The line chart clearly shows diverging trends: as daily device usage increases, average productivity tends to decline while stress tends to rise. The crossover point — where stress surpasses or approaches productivity — marks a critical usage threshold. This inverse relationship is one of the most actionable findings in the study: there appears to be an optimal daily device usage range beyond which the digital returns diminish sharply.
Interpretation: The KNN classifier predicts whether a user is high-risk based on five behavioral and mental health features. An accuracy above 75% would indicate that these digital and psychological variables are genuinely predictive of risk status — the model can identify at-risk individuals without relying on a clinical assessment. The confusion matrix reveals whether the model is better at catching true high-risk users (recall) or avoiding false alarms (precision), both of which matter in a public health context.
This analysis of 3,500 users across six global regions provides comprehensive evidence that digital behavior, productivity, and mental health are deeply interconnected.
Key findings across all levels of analysis:
Device use and productivity are inversely related. Both the regression model and clustering confirm that users exceeding 8 hours of daily device use show measurably lower productivity scores. The simple linear regression quantifies this penalty, and the multiple regression model confirms that device hours remain a significant negative predictor even after controlling for sleep, stress, and focus.
Stress and digital dependence form a reinforcing cycle. Correlation analysis reveals a strong positive relationship between digital dependence and stress, which the K-Means clustering captures as a distinct at-risk behavioral cluster. ANOVA confirms that both device type and region contribute independently to stress outcomes.
Sleep is the most consistent positive predictor of well-being and productivity. Whether examined through correlation, regression, or the engineered Well-being Index, sleep hours consistently emerge as a protective factor against stress, anxiety, and low productivity.
Geographic and demographic disparities are real and actionable. Regional ANOVA results show that where users live is associated with their stress and anxiety profiles. Gender-based differences in sleep and stress patterns suggest that one-size-fits-all digital wellness programs may be insufficient.
Machine learning validates the human-defined risk segments. KNN classification accurately distinguishes high-risk from low-risk users using only five features, and K-Means clustering independently surfaces three behavioral archetypes that mirror the manually defined risk categories. This convergence strengthens confidence in the analytical conclusions.
Recommendations: - Users spending more than 8 hours on devices should receive personalized digital wellness nudges. - Employers and educators in high-stress regions should invest in mental health support alongside digital tools. - Policies promoting minimum sleep standards and physical activity would yield measurable productivity dividends. - The Well-being Index and Total Digital Load metrics developed here could serve as ongoing monitoring tools in organizational or public health contexts.
End of Report