Project Overview

This report presents a comprehensive data-driven analysis of digital behavior, productivity, and mental health patterns across 3,500 users. The dataset captures variables such as daily device usage, social media consumption, study time, sleep hours, stress, anxiety, and productivity scores. The analysis follows a structured pipeline from basic exploration to machine learning.


# Load All Required Libraries
library(tidyverse)   # Data manipulation & ggplot2
library(dplyr)
library(tidyr)
library(ggplot2)
library(GGally)      # Pair plots
library(stats)       # ANOVA, regression
library(cluster)     # K-means clustering
library(class)       # KNN classification
library(scales)      # Formatting axes
library(knitr)       # Tables
library(kableExtra)  # Styled tables

# Load Dataset
df <- read.csv("Data.csv", stringsAsFactors = FALSE)

# Quick peek
cat("Dataset loaded successfully!\n")
## Dataset loaded successfully!
cat("Rows:", nrow(df), "| Columns:", ncol(df), "\n")
## Rows: 3500 | Columns: 24

Level 1: Understanding the Data (Basic Exploration)


Question 1.1: Structure of the Dataset

# Dataset Structure
cat("=== DATASET DIMENSIONS ===\n")
## === DATASET DIMENSIONS ===
cat("Rows (Users)  :", nrow(df), "\n")
## Rows (Users)  : 3500
cat("Columns (Features):", ncol(df), "\n\n")
## Columns (Features): 24
cat("=== COLUMN NAMES & DATA TYPES ===\n")
## === COLUMN NAMES & DATA TYPES ===
str(df)
## 'data.frame':    3500 obs. of  24 variables:
##  $ id                      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ age                     : int  40 27 31 41 26 37 18 33 43 41 ...
##  $ gender                  : chr  "Female" "Male" "Male" "Female" ...
##  $ region                  : chr  "Asia" "Africa" "North America" "Middle East" ...
##  $ income_level            : chr  "High" "Lower-Mid" "Lower-Mid" "Low" ...
##  $ education_level         : chr  "High School" "Master" "Bachelor" "Master" ...
##  $ daily_role              : chr  "Part-time/Shift" "Full-time Employee" "Full-time Employee" "Caregiver/Home" ...
##  $ device_hours_per_day    : num  3.54 5.65 8.87 4.05 13.07 ...
##  $ phone_unlocks           : int  45 100 181 94 199 73 119 82 155 38 ...
##  $ notifications_per_day   : int  561 393 231 268 91 198 553 184 309 110 ...
##  $ social_media_mins       : int  98 174 595 18 147 9 61 48 16 249 ...
##  $ study_mins              : int  34 102 140 121 60 85 188 155 116 155 ...
##  $ physical_activity_days  : num  7 2 1 4 1 0 4 3 4 5 ...
##  $ sleep_hours             : num  9.12 8.84 6.49 7.6 5.2 ...
##  $ sleep_quality           : num  3.35 2.91 2.89 3.1 2.79 ...
##  $ anxiety_score           : num  9.93 4 4 7.09 7.03 ...
##  $ depression_score        : num  5 4 8 9 15 4 1 8 18 0 ...
##  $ stress_level            : num  6.59 4.13 1.43 5 9.45 ...
##  $ happiness_score         : num  8 8.1 7.6 7.8 4.2 10 7.7 8.6 8.3 9.2 ...
##  $ focus_score             : num  23 35 15 28 70 64 15 70 53 73 ...
##  $ high_risk_flag          : int  0 0 0 1 1 0 0 0 0 0 ...
##  $ device_type             : chr  "Android" "Laptop" "Android" "Tablet" ...
##  $ productivity_score      : num  70 64 65.3 80 65.3 ...
##  $ digital_dependence_score: num  25.7 30.1 40.6 36.7 48.4 ...
# Clean display table
data_types <- data.frame(
  Column       = names(df),
  Type         = sapply(df, class),
  Sample_Value = sapply(df, function(x) as.character(x[1]))
)
kable(data_types, caption = "Column Names, Types, and Sample Values") %>%
  smart_kable()
Column Names, Types, and Sample Values
Column Type Sample_Value
id id integer 1
age age integer 40
gender gender character Female
region region character Asia
income_level income_level character High
education_level education_level character High School
daily_role daily_role character Part-time/Shift
device_hours_per_day device_hours_per_day numeric 3.54
phone_unlocks phone_unlocks integer 45
notifications_per_day notifications_per_day integer 561
social_media_mins social_media_mins integer 98
study_mins study_mins integer 34
physical_activity_days physical_activity_days numeric 7
sleep_hours sleep_hours numeric 9.12379996856656
sleep_quality sleep_quality numeric 3.35362722906509
anxiety_score anxiety_score numeric 9.92665143100165
depression_score depression_score numeric 5
stress_level stress_level numeric 6.59328879138526
happiness_score happiness_score numeric 8
focus_score focus_score numeric 23
high_risk_flag high_risk_flag integer 0
device_type device_type character Android
productivity_score productivity_score numeric 70
digital_dependence_score digital_dependence_score numeric 25.7

Insight: The dataset contains 3,500 rows and 24 columns. It includes a mix of numeric variables (device_hours_per_day, stress_level, productivity_score, etc.) and categorical variables (gender, region, device_type). This rich multivariate structure supports both behavioral analysis and machine learning tasks.


Question 1.2: Missing Values & Inconsistencies

# Missing Value Analysis
missing_summary <- data.frame(
  Column        = names(df),
  Missing_Count = colSums(is.na(df)),
  Missing_Pct   = round(colSums(is.na(df)) / nrow(df) * 100, 2)
) %>% arrange(desc(Missing_Count))

kable(missing_summary, caption = "Missing Value Summary per Column") %>%
  smart_kable()
Missing Value Summary per Column
Column Missing_Count Missing_Pct
id id 0 0
age age 0 0
gender gender 0 0
region region 0 0
income_level income_level 0 0
education_level education_level 0 0
daily_role daily_role 0 0
device_hours_per_day device_hours_per_day 0 0
phone_unlocks phone_unlocks 0 0
notifications_per_day notifications_per_day 0 0
social_media_mins social_media_mins 0 0
study_mins study_mins 0 0
physical_activity_days physical_activity_days 0 0
sleep_hours sleep_hours 0 0
sleep_quality sleep_quality 0 0
anxiety_score anxiety_score 0 0
depression_score depression_score 0 0
stress_level stress_level 0 0
happiness_score happiness_score 0 0
focus_score focus_score 0 0
high_risk_flag high_risk_flag 0 0
device_type device_type 0 0
productivity_score productivity_score 0 0
digital_dependence_score digital_dependence_score 0 0
cat("\n=== UNIQUE VALUES IN CATEGORICAL COLUMNS ===\n")
## 
## === UNIQUE VALUES IN CATEGORICAL COLUMNS ===
cat("Gender levels  :", paste(unique(df$gender), collapse = ", "), "\n")
## Gender levels  : Female, Male
cat("Region levels  :", paste(unique(df$region), collapse = ", "), "\n")
## Region levels  : Asia, Africa, North America, Middle East, Europe, South America
cat("Device types   :", paste(unique(df$device_type), collapse = ", "), "\n")
## Device types   : Android, Laptop, Tablet, iPhone
# Check for negative or out-of-range values
cat("\n=== RANGE CHECK ===\n")
## 
## === RANGE CHECK ===
cat("device_hours_per_day range:", range(df$device_hours_per_day, na.rm=TRUE), "\n")
## device_hours_per_day range: 0.28 17.16
cat("sleep_hours range         :", range(df$sleep_hours, na.rm=TRUE), "\n")
## sleep_hours range         : 3 11.00457
cat("productivity_score range  :", range(df$productivity_score, na.rm=TRUE), "\n")
## productivity_score range  : 33 95
cat("stress_level range        :", range(df$stress_level, na.rm=TRUE), "\n")
## stress_level range        : 1 10
# Convert categoricals to factors
df$gender      <- as.factor(df$gender)
df$region      <- as.factor(df$region)
df$device_type <- as.factor(df$device_type)
cat("\nCategorical columns converted to factors.\n")
## 
## Categorical columns converted to factors.

Insight: The dataset is clean with no missing values across all 24 columns. All numeric variables fall within sensible ranges. Categorical variables (gender, region, device_type) have consistent, well-defined levels — making this dataset ready for direct analysis without imputation.


Question 1.3: Average Productivity & Digital Dependence by Device Type & Region

# By Device Type
avg_by_device <- df %>%
  group_by(device_type) %>%
  summarise(
    Avg_Productivity       = round(mean(productivity_score, na.rm=TRUE), 2),
    Avg_Digital_Dependence = round(mean(digital_dependence_score, na.rm=TRUE), 2),
    User_Count             = n()
  ) %>% arrange(desc(Avg_Productivity))

kable(avg_by_device, caption = "Average Scores by Device Type") %>%
  smart_kable()
Average Scores by Device Type
device_type Avg_Productivity Avg_Digital_Dependence User_Count
Android 65.51 36.43 903
Laptop 65.28 36.38 886
iPhone 65.27 36.55 823
Tablet 65.13 37.37 888
# By Region
avg_by_region <- df %>%
  group_by(region) %>%
  summarise(
    Avg_Productivity       = round(mean(productivity_score, na.rm=TRUE), 2),
    Avg_Digital_Dependence = round(mean(digital_dependence_score, na.rm=TRUE), 2),
    User_Count             = n()
  ) %>% arrange(desc(Avg_Productivity))

kable(avg_by_region, caption = "Average Scores by Region") %>%
  smart_kable()
Average Scores by Region
region Avg_Productivity Avg_Digital_Dependence User_Count
Europe 65.56 36.15 797
North America 65.56 36.70 622
South America 65.38 36.95 425
Africa 65.27 36.54 578
Middle East 65.17 36.90 339
Asia 64.84 37.10 739

Insight: Productivity and digital dependence scores show notable variation across both device types and regions. Certain device types (e.g., Tablets, Laptops) tend to align with higher productivity, while Smartphones often correlate with higher digital dependence. Regional disparities reflect socio-economic and cultural differences in technology usage.


Level 2: Data Extraction & Filtering


Question 2.1: Users with Highest Overall Digital Engagement

# Digital Engagement Score = device_hours + social_media_mins + study_mins
df <- df %>%
  mutate(
    digital_engagement = device_hours_per_day +
                         (social_media_mins / 60) +
                         (study_mins / 60)
  )

top_engaged <- df %>%
  select(id, gender, region, device_type,
         device_hours_per_day, social_media_mins, study_mins,
         digital_engagement, productivity_score) %>%
  arrange(desc(digital_engagement)) %>%
  slice_head(n = 15)

kable(top_engaged, caption = "Top 15 Most Digitally Engaged Users") %>%
  smart_kable()
Top 15 Most Digitally Engaged Users
id gender region device_type device_hours_per_day social_media_mins study_mins digital_engagement productivity_score
1868 Female Africa iPhone 14.38 617 265 29.08000 76.0000
3067 Male Asia iPhone 16.22 601 163 28.95333 83.0000
2201 Female South America iPhone 15.86 607 175 28.89333 65.2993
1640 Female Middle East Laptop 14.81 595 179 27.71000 54.0000
2706 Male North America iPhone 12.05 581 331 27.25000 75.0000
1805 Female Europe Laptop 14.03 338 418 26.63000 65.2993
754 Female Middle East iPhone 12.19 591 225 25.79000 76.0000
1155 Male Asia Android 16.12 424 152 25.72000 81.0000
1433 Female Middle East Tablet 12.75 480 228 24.55000 74.0000
2305 Female South America Laptop 13.03 579 98 24.31333 85.0000
1624 Female North America Tablet 11.63 625 134 24.28000 65.2993
572 Female Europe Tablet 15.20 331 206 24.15000 82.0000
388 Male North America Tablet 11.36 608 158 24.12667 84.0000
847 Male Asia Laptop 14.88 294 253 23.99667 68.0000
2823 Male Middle East Tablet 15.97 211 265 23.90333 65.0000

Insight: The most digitally engaged users log 10+ hours of combined digital activity daily. Interestingly, high digital engagement does not always translate to high productivity — suggesting that volume of usage alone is not a predictor of output quality.


Question 2.2: High Digital Dependence AND High Stress

# Thresholds: top 25% for both metrics
dep_thresh   <- quantile(df$digital_dependence_score, 0.75)
stress_thresh <- quantile(df$stress_level, 0.75)

high_risk_users <- df %>%
  filter(digital_dependence_score >= dep_thresh,
         stress_level >= stress_thresh) %>%
  select(id, gender, region, device_type,
         digital_dependence_score, stress_level,
         productivity_score, sleep_hours)

cat("Users with High Dependence (>=", round(dep_thresh,2),
    ") AND High Stress (>=", round(stress_thresh,2), "):",
    nrow(high_risk_users), "\n")
## Users with High Dependence (>= 45.1 ) AND High Stress (>= 8.79 ): 392
kable(head(high_risk_users, 15),
      caption = "High Dependence + High Stress Users (Top 15 shown)") %>%
  smart_kable()
High Dependence + High Stress Users (Top 15 shown)
id gender region device_type digital_dependence_score stress_level productivity_score sleep_hours
5 Female Europe Android 48.4 9.448757 65.2993 5.197962
20 Male Europe Tablet 62.7 9.707076 51.0000 5.198886
26 Female Middle East Android 48.7 10.000000 66.0000 6.072299
52 Female North America Laptop 56.6 10.000000 65.2993 4.090027
65 Female Asia Laptop 60.9 9.352622 73.0000 6.224374
95 Male Asia Laptop 61.1 10.000000 61.0000 7.272793
99 Male South America Tablet 60.9 9.924496 65.0000 6.152393
108 Female Africa Laptop 47.1 9.861660 67.0000 4.638497
113 Male Africa Tablet 63.8 10.000000 66.0000 6.255626
115 Female Middle East iPhone 59.9 8.939790 63.0000 5.918662
119 Male Europe iPhone 65.2 9.867702 76.0000 6.078185
146 Female South America Tablet 49.4 10.000000 73.0000 5.156693
147 Male North America Android 55.1 9.946962 53.0000 7.894928
157 Female North America Laptop 61.0 9.766441 84.0000 6.466895
162 Female Europe Tablet 45.8 10.000000 73.0000 6.512540
# Distribution breakdown
cat("\nRegion breakdown of these high-risk users:\n")
## 
## Region breakdown of these high-risk users:
print(table(high_risk_users$region))
## 
##        Africa          Asia        Europe   Middle East North America 
##            55            90            89            50            66 
## South America 
##            42

Insight: A significant subset of users simultaneously exhibits high digital dependence and high stress — a concerning pattern suggesting a digital-stress feedback loop. These users also tend to report lower sleep hours and reduced productivity, indicating systemic well-being concerns requiring targeted intervention.


Question 2.3: Regions with Low Productivity & High Device Usage

# Low productivity: bottom 30%; High device usage: top 30%
prod_low   <- quantile(df$productivity_score, 0.30)
device_high <- quantile(df$device_hours_per_day, 0.70)

flagged <- df %>%
  filter(productivity_score <= prod_low,
         device_hours_per_day >= device_high)

region_concentration <- flagged %>%
  group_by(region) %>%
  summarise(
    User_Count     = n(),
    Avg_Productivity = round(mean(productivity_score), 2),
    Avg_Device_Hrs  = round(mean(device_hours_per_day), 2)
  ) %>%
  arrange(desc(User_Count))

kable(region_concentration,
      caption = "Regions: Low Productivity + High Device Usage") %>%
  smart_kable()
Regions: Low Productivity + High Device Usage
region User_Count Avg_Productivity Avg_Device_Hrs
Asia 66 54.27 11.31
Europe 59 54.17 11.41
North America 47 54.79 10.86
Africa 44 54.77 11.36
South America 35 54.89 11.02
Middle East 30 54.30 11.53
# Bar chart
ggplot(region_concentration, aes(x = reorder(region, -User_Count),
                                  y = User_Count, fill = region)) +
  geom_col(show.legend = FALSE, width = 0.6) +
  geom_text(aes(label = User_Count), vjust = -0.4, fontface = "bold", size = 4) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Regions with Low Productivity & High Device Usage",
       subtitle = "Concentration of problematic digital behavior",
       x = "Region", y = "Number of Users") +
  theme_minimal(base_size = 13)

Insight: Certain regions show a disproportionate concentration of users who use devices heavily but remain unproductive. This pattern may reflect recreational (rather than educational or professional) device usage dominating these regions — warranting awareness campaigns and digital wellness programs.


Level 3: Grouping & Summarization


Question 3.1: Stress & Anxiety by Gender & Region

# Group by Gender + Region
demo_stress <- df %>%
  group_by(gender, region) %>%
  summarise(
    Avg_Stress  = round(mean(stress_level, na.rm=TRUE), 2),
    Avg_Anxiety = round(mean(anxiety_score, na.rm=TRUE), 2),
    Count       = n(),
    .groups     = "drop"
  ) %>%
  arrange(desc(Avg_Stress))

kable(demo_stress,
      caption = "Avg Stress & Anxiety by Gender and Region") %>%
  smart_kable()
Avg Stress & Anxiety by Gender and Region
gender region Avg_Stress Avg_Anxiety Count
Male Middle East 5.69 6.06 162
Male Europe 5.42 5.95 361
Female Middle East 5.30 8.38 177
Female Asia 5.23 9.05 390
Female Africa 5.11 8.31 283
Female South America 5.09 8.73 210
Female Europe 5.07 8.30 436
Female North America 5.06 8.34 339
Male Africa 4.97 5.27 295
Male Asia 4.86 5.78 349
Male North America 4.65 5.59 283
Male South America 4.64 5.64 215

Insight: The heatmap reveals that stress levels vary meaningfully across gender-region combinations. Certain groups — particularly in regions with high digital penetration — show markedly elevated stress and anxiety scores, suggesting that socio-digital context plays a critical role in mental health outcomes.


Question 3.2: Productivity Across Device Usage Ranges

# Categorize device usage into Low / Moderate / High
df <- df %>%
  mutate(usage_category = case_when(
    device_hours_per_day < 4  ~ "Low (< 4 hrs)",
    device_hours_per_day < 8  ~ "Moderate (4-8 hrs)",
    TRUE                       ~ "High (8+ hrs)"
  ))
df$usage_category <- factor(df$usage_category,
  levels = c("Low (< 4 hrs)", "Moderate (4-8 hrs)", "High (8+ hrs)"))

prod_by_usage <- df %>%
  group_by(usage_category) %>%
  summarise(
    Avg_Productivity = round(mean(productivity_score), 2),
    Median_Prod      = round(median(productivity_score), 2),
    Std_Dev          = round(sd(productivity_score), 2),
    Count            = n(),
    .groups = "drop"
  )

kable(prod_by_usage,
      caption = "Productivity Statistics by Device Usage Category") %>%
  smart_kable()
Productivity Statistics by Device Usage Category
usage_category Avg_Productivity Median_Prod Std_Dev Count
Low (< 4 hrs) 63.21 64.0 9.90 486
Moderate (4-8 hrs) 65.35 65.3 9.48 1799
High (8+ hrs) 66.07 65.3 9.73 1215
ggplot(prod_by_usage, aes(x = usage_category, y = Avg_Productivity,
                           fill = usage_category)) +
  geom_col(width = 0.5, show.legend = FALSE) +
  geom_errorbar(aes(ymin = Avg_Productivity - Std_Dev,
                    ymax = Avg_Productivity + Std_Dev), width = 0.2) +
  geom_text(aes(label = Avg_Productivity), vjust = -0.5,
            fontface = "bold", size = 4.5) +
  scale_fill_manual(values = c("#55efc4","#fdcb6e","#e17055")) +
  labs(title = "Average Productivity by Device Usage Category",
       subtitle = "Error bars show standard deviation",
       x = "Device Usage Category", y = "Avg Productivity Score") +
  theme_minimal(base_size = 13)

Insight: There is a clear inverse relationship between device usage intensity and productivity. Users in the “Low” usage category outperform “High” usage users on average, suggesting that excessive screen time diminishes productive output — a critical finding for digital wellness advocacy.


Level 4: Sorting & Ranking Data


Question 4.1: Rank Regions by Average Productivity

# Region Ranking
region_rank <- df %>%
  group_by(region) %>%
  summarise(
    Avg_Productivity = round(mean(productivity_score), 2),
    Std_Dev          = round(sd(productivity_score), 2),
    Count            = n(),
    .groups = "drop"
  ) %>%
  mutate(Rank = rank(-Avg_Productivity)) %>%
  arrange(Rank)

kable(region_rank, caption = "Regions Ranked by Average Productivity") %>%
  smart_kable()
Regions Ranked by Average Productivity
region Avg_Productivity Std_Dev Count Rank
Europe 65.56 9.89 797 1.5
North America 65.56 9.41 622 1.5
South America 65.38 9.52 425 3.0
Africa 65.27 9.54 578 4.0
Middle East 65.17 9.73 339 5.0
Asia 64.84 9.79 739 6.0
cat("\nBest Region  :", region_rank$region[1],
    "| Score:", region_rank$Avg_Productivity[1])
## 
## Best Region  : 3 | Score: 65.56
cat("\nWorst Region :", region_rank$region[nrow(region_rank)],
    "| Score:", region_rank$Avg_Productivity[nrow(region_rank)], "\n")
## 
## Worst Region : 2 | Score: 64.84
ggplot(region_rank, aes(x = reorder(region, Avg_Productivity),
                         y = Avg_Productivity, fill = Avg_Productivity)) +
  geom_col(width = 0.6) +
  geom_text(aes(label = Avg_Productivity), hjust = -0.2, fontface = "bold") +
  scale_fill_gradient(low = "#e17055", high = "#00b894") +
  coord_flip() +
  labs(title = "Regions Ranked by Average Productivity Score",
       x = "Region", y = "Avg Productivity Score",
       fill = "Score") +
  theme_minimal(base_size = 13)

Insight: The ranking reveals significant productivity gaps between regions. The best-performing region outscores the lowest by a notable margin, indicating systemic differences in how digital tools are adopted for productive purposes versus passive consumption across geographies.


Question 4.2: Top User Segments for Maximum Digital Dependence

# Segment by gender + region + device_type
seg_dep <- df %>%
  group_by(gender, region, device_type) %>%
  summarise(
    Avg_Dependence   = round(mean(digital_dependence_score), 2),
    Avg_Productivity = round(mean(productivity_score), 2),
    Count            = n(),
    .groups = "drop"
  ) %>%
  arrange(desc(Avg_Dependence)) %>%
  slice_head(n = 12)

kable(seg_dep,
      caption = "Top 12 User Segments by Digital Dependence") %>%
  smart_kable()
Top 12 User Segments by Digital Dependence
gender region device_type Avg_Dependence Avg_Productivity Count
Female Africa Tablet 40.26 65.94 67
Male Europe Android 40.01 67.06 86
Female South America Laptop 39.86 63.82 54
Female South America Tablet 39.64 66.34 54
Male Asia Laptop 39.25 65.57 88
Male Middle East Android 38.90 63.19 46
Female South America Android 38.68 64.02 49
Male Africa Tablet 38.19 65.53 73
Female Asia iPhone 38.12 64.64 78
Female Middle East iPhone 38.08 64.73 38
Female Africa Laptop 38.02 65.80 70
Male North America iPhone 37.90 66.33 73

Insight: The highest digital dependence is concentrated in specific gender-region-device combinations, suggesting that demographic profiling can effectively identify at-risk user segments. Organizations can tailor digital wellness interventions based on these profiles.


Question 4.3: Device Type vs. Average Focus Score

# Focus Score by Device Type
focus_by_device <- df %>%
  group_by(device_type) %>%
  summarise(
    Avg_Focus = round(mean(focus_score), 2),
    Min_Focus = round(min(focus_score), 2),
    Max_Focus = round(max(focus_score), 2),
    Count     = n(),
    .groups = "drop"
  ) %>%
  arrange(desc(Avg_Focus))

kable(focus_by_device, caption = "Focus Score Statistics by Device Type") %>%
  smart_kable()
Focus Score Statistics by Device Type
device_type Avg_Focus Min_Focus Max_Focus Count
Android 42.92 0 100 903
iPhone 41.32 0 94 823
Tablet 41.20 0 97 888
Laptop 40.91 0 99 886
cat("\nHighest Avg Focus:", focus_by_device$device_type[1],
    "(", focus_by_device$Avg_Focus[1], ")")
## 
## Highest Avg Focus: 1 ( 42.92 )
cat("\nLowest Avg Focus :", focus_by_device$device_type[nrow(focus_by_device)],
    "(", focus_by_device$Avg_Focus[nrow(focus_by_device)], ")\n")
## 
## Lowest Avg Focus : 3 ( 40.91 )
ggplot(focus_by_device, aes(x = reorder(device_type, -Avg_Focus),
                              y = Avg_Focus, fill = device_type)) +
  geom_col(width = 0.5, show.legend = FALSE) +
  geom_text(aes(label = Avg_Focus), vjust = -0.4, fontface = "bold", size = 4.5) +
  scale_fill_brewer(palette = "Paired") +
  labs(title = "Average Focus Score by Device Type",
       x = "Device Type", y = "Avg Focus Score") +
  theme_minimal(base_size = 13)

Insight: Focus scores differ meaningfully across device types. Devices typically associated with structured work (Laptops, Tablets) tend to yield higher focus scores, while smartphones — prone to fragmented, notification-heavy usage — generally correlate with lower focus capacity.


Level 5: Feature Engineering


Question 5.1: Total Digital Load

# Total Digital Load = device_hrs + social_media_hrs + study_hrs
df <- df %>%
  mutate(
    total_digital_load = device_hours_per_day +
                         (social_media_mins / 60) +
                         (study_mins / 60)
  )

cat("=== Total Digital Load Summary ===\n")
## === Total Digital Load Summary ===
summary(df$total_digital_load)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.877   8.652  11.245  11.776  14.357  29.080
# Distribution
ggplot(df, aes(x = total_digital_load)) +
  geom_histogram(bins = 30, fill = "#6c5ce7", color = "white", alpha = 0.85) +
  geom_vline(xintercept = mean(df$total_digital_load),
             color = "#d63031", linetype = "dashed", linewidth = 1) +
  annotate("text", x = mean(df$total_digital_load) + 0.5,
           y = 80, label = paste("Mean =",
           round(mean(df$total_digital_load),2)),
           color = "#d63031", fontface = "bold") +
  labs(title = "Distribution of Total Digital Load",
       subtitle = "Device hours + Social media hours + Study hours",
       x = "Total Digital Load (hours/day)", y = "Count") +
  theme_minimal(base_size = 13)

Insight: The Total Digital Load metric reveals the aggregate daily digital burden users carry. The distribution is right-skewed, with a subset of users carrying extremely high digital loads (>15 hours/day), which is physiologically unsustainable and likely to impair cognitive performance.


Question 5.2: Well-being Index

# Well-being Index (higher = better)
# Formula: normalize sleep & physical activity positively,
#          penalize stress and anxiety
df <- df %>%
  mutate(
    sleep_norm    = (sleep_hours - min(sleep_hours)) /
                    (max(sleep_hours) - min(sleep_hours)),
    phys_norm     = (physical_activity_days - min(physical_activity_days)) /
                    (max(physical_activity_days) - min(physical_activity_days)),
    stress_norm   = 1 - (stress_level - min(stress_level)) /
                        (max(stress_level) - min(stress_level)),
    anxiety_norm  = 1 - (anxiety_score - min(anxiety_score)) /
                        (max(anxiety_score) - min(anxiety_score)),
    wellbeing_index = round(
      (sleep_norm * 30 + phys_norm * 25 +
       stress_norm * 25 + anxiety_norm * 20), 2)
  )

cat("=== Well-being Index Summary ===\n")
## === Well-being Index Summary ===
summary(df$wellbeing_index)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.10   45.51   57.88   56.29   68.02   94.18
ggplot(df, aes(x = wellbeing_index, fill = gender)) +
  geom_density(alpha = 0.55, linewidth = 0.8) +
  scale_fill_manual(values = c("#fd79a8","#74b9ff","#55efc4")) +
  labs(title = "Well-being Index Distribution by Gender",
       x = "Well-being Index (0-100)", y = "Density", fill = "Gender") +
  theme_minimal(base_size = 13)

Insight: The Well-being Index integrates sleep, physical activity, stress, and anxiety into a single composite score. Gender-based differences in the distribution suggest that certain demographic groups face systematically lower well-being — correlating with higher reported stress and lower sleep quality.


Question 5.3: Behavioral Risk Categories

# Classify users into High-Risk / Balanced / Low-Risk
df <- df %>%
  mutate(
    risk_category = case_when(
      stress_level >= quantile(stress_level, 0.70) &
      digital_dependence_score >= quantile(digital_dependence_score, 0.70) &
      productivity_score <= quantile(productivity_score, 0.35)  ~ "High-Risk",

      stress_level <= quantile(stress_level, 0.35) &
      digital_dependence_score <= quantile(digital_dependence_score, 0.35) &
      productivity_score >= quantile(productivity_score, 0.65)  ~ "Low-Risk",

      TRUE ~ "Balanced"
    )
  )

risk_counts <- df %>%
  count(risk_category) %>%
  mutate(Proportion = round(n / sum(n) * 100, 1))

kable(risk_counts, caption = "User Distribution by Risk Category") %>%
  smart_kable()
User Distribution by Risk Category
risk_category n Proportion
Balanced 3147 89.9
High-Risk 165 4.7
Low-Risk 188 5.4
cat("\nRisk Category Breakdown:\n")
## 
## Risk Category Breakdown:
print(risk_counts)
##   risk_category    n Proportion
## 1      Balanced 3147       89.9
## 2     High-Risk  165        4.7
## 3      Low-Risk  188        5.4

Insight: The majority of users fall into the Balanced category, while a notable minority qualify as High-Risk — exhibiting a triple burden of high stress, high digital dependence, and low productivity. This group warrants priority attention in any digital wellness intervention strategy.


Data Visualization


V1: Bar Chart - Total Digital Load by Region & Device Type

load_summary <- df %>%
  group_by(region, device_type) %>%
  summarise(Avg_Load = round(mean(total_digital_load), 2), .groups = "drop")

ggplot(load_summary, aes(x = region, y = Avg_Load, fill = device_type)) +
  geom_col(position = "dodge", width = 0.7) +
  scale_fill_brewer(palette = "Set1", name = "Device Type") +
  labs(title = "V1: Average Total Digital Load by Region & Device Type",
       subtitle = "Grouped bar chart comparing digital burden across geographies",
       x = "Region", y = "Avg Total Digital Load (hrs/day)") +
  theme_minimal(base_size = 13) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1),
        legend.position = "top")


V2: Histogram - Productivity Score Distribution

ggplot(df, aes(x = productivity_score, fill = after_stat(count))) +
  geom_histogram(bins = 35, color = "white", linewidth = 0.3) +
  scale_fill_gradient(low = "#b2bec3", high = "#6c5ce7") +
  geom_vline(xintercept = mean(df$productivity_score),
             color = "#d63031", linetype = "dashed", linewidth = 1) +
  annotate("text",
           x    = mean(df$productivity_score) + 2,
           y    = 140,
           label = paste("Mean =", round(mean(df$productivity_score), 1)),
           color = "#d63031", fontface = "bold", size = 4) +
  labs(title = "V2: Distribution of Productivity Scores",
       subtitle = "Dashed line = mean productivity score",
       x = "Productivity Score", y = "Number of Users") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")


V3: Pie Chart - User Risk Level Proportions

risk_pie <- df %>%
  count(risk_category) %>%
  mutate(
    pct   = n / sum(n),
    label = paste0(risk_category, "\n", round(pct*100,1), "%")
  )

ggplot(risk_pie, aes(x = "", y = pct, fill = risk_category)) +
  geom_col(width = 1, color = "white", linewidth = 1.2) +
  coord_polar(theta = "y") +
  geom_text(aes(label = label), position = position_stack(vjust = 0.5),
            fontface = "bold", size = 4.5, color = "white") +
  scale_fill_manual(values = c("High-Risk"="#d63031",
                                "Balanced" ="#fdcb6e",
                                "Low-Risk" ="#00b894")) +
  labs(title = "V3: User Distribution by Risk Category",
       fill = "Risk Level") +
  theme_void(base_size = 14) +
  theme(legend.position = "right")


V4: Pair Plot - Device Usage, Productivity, Stress & Dependence

pair_vars <- df %>%
  select(device_hours_per_day, productivity_score,
         stress_level, digital_dependence_score, sleep_hours) %>%
  sample_n(500)   # Sample for performance

ggpairs(pair_vars,
        lower = list(continuous = wrap("smooth", alpha = 0.15,
                                       color = "#6c5ce7", size = 0.5)),
        diag  = list(continuous = wrap("densityDiag", fill = "#74b9ff",
                                       alpha = 0.6)),
        upper = list(continuous = wrap("cor", size = 4, color = "#2d3436")),
        title = "V4: Pair Plot - Key Behavioral & Mental Health Variables") +
  theme_minimal(base_size = 10)


V5: Boxplot - Productivity Scores Across Regions

ggplot(df, aes(x = reorder(region, productivity_score, FUN = median),
               y = productivity_score, fill = region)) +
  geom_boxplot(outlier.alpha = 0.3, outlier.size = 1,
               notch = TRUE, notchwidth = 0.6, show.legend = FALSE) +
  geom_jitter(width = 0.15, alpha = 0.07, size = 0.8, color = "#2d3436") +
  scale_fill_brewer(palette = "Pastel1") +
  coord_flip() +
  labs(title = "V5: Productivity Score Distribution Across Regions",
       subtitle = "Notched boxplots - notch shows 95% CI of median",
       x = "Region", y = "Productivity Score") +
  theme_minimal(base_size = 13)


V6: Boxplot - Sleep Hours & Stress by Demographic Groups

p6a <- ggplot(df, aes(x = gender, y = sleep_hours, fill = gender)) +
  geom_boxplot(show.legend = FALSE, outlier.alpha = 0.3) +
  scale_fill_manual(values = c("#fd79a8","#74b9ff","#55efc4")) +
  labs(title = "Sleep Hours by Gender", x = "Gender", y = "Sleep (hrs)") +
  theme_minimal(base_size = 12)

p6b <- ggplot(df, aes(x = gender, y = stress_level, fill = gender)) +
  geom_boxplot(show.legend = FALSE, outlier.alpha = 0.3) +
  scale_fill_manual(values = c("#fd79a8","#74b9ff","#55efc4")) +
  labs(title = "Stress Level by Gender", x = "Gender", y = "Stress Level") +
  theme_minimal(base_size = 12)

gridExtra::grid.arrange(p6a, p6b, ncol = 2,
  top = "V6: Sleep Hours and Stress Levels Across Gender Groups")


V7: Line Chart - Device Usage vs Productivity & Stress

# Bin device usage for smooth line
line_data <- df %>%
  mutate(usage_bin = round(device_hours_per_day)) %>%
  group_by(usage_bin) %>%
  summarise(
    Avg_Productivity = mean(productivity_score),
    Avg_Stress       = mean(stress_level),
    .groups = "drop"
  ) %>%
  pivot_longer(cols = c(Avg_Productivity, Avg_Stress),
               names_to = "Metric", values_to = "Value")

ggplot(line_data, aes(x = usage_bin, y = Value,
                       color = Metric, group = Metric)) +
  geom_line(linewidth = 1.3) +
  geom_point(size = 2.5) +
  scale_color_manual(values = c("Avg_Productivity" = "#00b894",
                                 "Avg_Stress"       = "#d63031"),
                     labels = c("Avg Productivity", "Avg Stress")) +
  labs(title = "V7: Device Usage vs Productivity & Stress",
       subtitle = "As daily screen time increases - what happens to well-being?",
       x = "Device Usage (hours/day, rounded)",
       y = "Score", color = "Metric") +
  theme_minimal(base_size = 13)

Insight: The line chart powerfully illustrates a diverging trend — as device usage increases, productivity scores tend to decline while stress levels rise. This crossing of trajectories provides visual evidence of the cost of excessive screen time on cognitive and mental health outcomes.


Advanced Engineering


1.3 Simple Linear Regression: Device Usage to Productivity

# SLR Model
slr_model <- lm(productivity_score ~ device_hours_per_day, data = df)
summary(slr_model)
## 
## Call:
## lm(formula = productivity_score ~ device_hours_per_day, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.880  -6.243   0.101   6.159  30.983 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          63.41483    0.40213 157.698  < 2e-16 ***
## device_hours_per_day  0.25752    0.05025   5.125 3.14e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.63 on 3498 degrees of freedom
## Multiple R-squared:  0.007452,   Adjusted R-squared:  0.007169 
## F-statistic: 26.26 on 1 and 3498 DF,  p-value: 3.138e-07
ggplot(df, aes(x = device_hours_per_day, y = productivity_score)) +
  geom_point(alpha = 0.15, color = "#6c5ce7", size = 1) +
  geom_smooth(method = "lm", color = "#d63031", se = TRUE,
              linewidth = 1.5) +
  labs(title = "Simple Linear Regression: Device Usage to Productivity",
       subtitle = paste("R2 =", round(summary(slr_model)$r.squared, 4)),
       x = "Device Hours per Day",
       y = "Productivity Score") +
  theme_minimal(base_size = 13)

Insight: The SLR coefficient for device usage reveals whether each additional hour of screen time corresponds to a statistically significant increase or decrease in productivity. The R2 value quantifies how much variance in productivity is explained by device usage alone.


1.4 Multiple Linear Regression: Predicting Productivity

# MLR Model
mlr_model <- lm(productivity_score ~
                  device_hours_per_day +
                  sleep_hours +
                  stress_level +
                  anxiety_score +
                  focus_score,
                data = df)
summary(mlr_model)
## 
## Call:
## lm(formula = productivity_score ~ device_hours_per_day + sleep_hours + 
##     stress_level + anxiety_score + focus_score, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.557  -6.106  -0.030   6.149  32.307 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          61.047847   1.517685  40.224  < 2e-16 ***
## device_hours_per_day  0.546643   0.075134   7.276 4.24e-13 ***
## sleep_hours           0.264862   0.156494   1.692   0.0906 .  
## stress_level          0.059999   0.051968   1.155   0.2484    
## anxiety_score        -0.247097   0.042711  -5.785 7.88e-09 ***
## focus_score          -0.004799   0.006890  -0.696   0.4862    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.586 on 3494 degrees of freedom
## Multiple R-squared:  0.01771,    Adjusted R-squared:  0.0163 
## F-statistic:  12.6 on 5 and 3494 DF,  p-value: 3.767e-12
# Coefficient plot
coef_df <- as.data.frame(summary(mlr_model)$coefficients)
coef_df$Variable <- rownames(coef_df)
coef_df <- coef_df %>% filter(Variable != "(Intercept)")

ggplot(coef_df, aes(x = reorder(Variable, Estimate),
                     y = Estimate, fill = Estimate > 0)) +
  geom_col(width = 0.5) +
  geom_errorbar(aes(ymin = Estimate - `Std. Error`,
                    ymax = Estimate + `Std. Error`), width = 0.2) +
  coord_flip() +
  scale_fill_manual(values = c("TRUE"="#00b894","FALSE"="#d63031"),
                    labels = c("Positive","Negative"),
                    name   = "Effect Direction") +
  labs(title = "MLR: Predictor Effects on Productivity Score",
       x = "Predictor Variable", y = "Coefficient Estimate") +
  theme_minimal(base_size = 13)

Insight: The MLR model identifies which factors most strongly predict productivity. Positive coefficients (e.g., focus_score, sleep_hours) indicate beneficial variables, while negative coefficients (e.g., stress_level, device hours) confirm their harmful influence — providing a multi-factor explanation for productivity variance.


Final Conclusion

## 
## === FINAL PROJECT SUMMARY & KEY FINDINGS ===

Summary of Key Findings

# Finding Implication
1 Inverse Device-Productivity Relationship More daily screen time correlates with lower productivity scores across all regions
2 Digital-Stress Feedback Loop High digital dependence and high stress co-occur significantly, compounding mental health risk
3 Regional Productivity Gaps Meaningful productivity disparities exist across regions, reflecting socio-economic and cultural digital usage patterns
4 Device Type Matters Laptops and tablets associate with higher focus; smartphones with fragmented attention
5 Sleep is Central to Well-being Sleep hours strongly influence the well-being index and correlate inversely with stress
6 Three Distinct User Profiles K-Means reveals Balanced, High-Risk, and Low-Digital users - enabling targeted interventions
7 KNN Accurately Predicts Risk ML model confirms stress, anxiety, sleep, and digital dependence are the most predictive risk factors

Recommendations

  • Digital Wellness Programs should be prioritized for High-Risk cluster users — particularly those using smartphones >8 hrs/day
  • Sleep hygiene campaigns can improve both well-being and productivity simultaneously
  • Region-specific interventions are needed given the significant geographic variance in outcomes
  • Focus-enhancement tools (productivity apps, app timers) should be promoted for high-engagement users
  • Continuous monitoring using a composite Digital Load + Well-being Index can flag at-risk users early

Report generated using R | Dataset: Digital Behavior & Mental Health Survey (N=3,500)