Project Overview

This report presents a comprehensive data-driven analysis of digital behavior, productivity, and mental health patterns across 3,500 users. The dataset captures variables such as daily device usage, social media consumption, study time, sleep hours, stress, anxiety, and productivity scores. The analysis follows a structured pipeline from basic exploration to machine learning.

# Load All Required Libraries
library(tidyverse)   # Data manipulation & ggplot2
library(dplyr)
library(tidyr)
library(ggplot2)
library(GGally)      # Pair plots
library(stats)       # ANOVA, regression
library(cluster)     # K-means clustering
library(class)       # KNN classification
library(scales)      # Formatting axes
library(knitr)       # Tables
library(kableExtra)  # Styled tables

# Load Dataset
df <- read.csv("Data.csv", stringsAsFactors = FALSE)

# Quick peek
cat("Dataset loaded successfully!\n")

## Dataset loaded successfully!

cat("Rows:", nrow(df), "| Columns:", ncol(df), "\n")

## Rows: 3500 | Columns: 24

Level 1: Understanding the Data (Basic Exploration)

Question 1.1: Structure of the Dataset

# Dataset Structure
cat("=== DATASET DIMENSIONS ===\n")

## === DATASET DIMENSIONS ===

cat("Rows (Users)  :", nrow(df), "\n")

## Rows (Users)  : 3500

cat("Columns (Features):", ncol(df), "\n\n")

## Columns (Features): 24

cat("=== COLUMN NAMES & DATA TYPES ===\n")

## === COLUMN NAMES & DATA TYPES ===

str(df)

## 'data.frame':    3500 obs. of  24 variables:
##  $ id                      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ age                     : int  40 27 31 41 26 37 18 33 43 41 ...
##  $ gender                  : chr  "Female" "Male" "Male" "Female" ...
##  $ region                  : chr  "Asia" "Africa" "North America" "Middle East" ...
##  $ income_level            : chr  "High" "Lower-Mid" "Lower-Mid" "Low" ...
##  $ education_level         : chr  "High School" "Master" "Bachelor" "Master" ...
##  $ daily_role              : chr  "Part-time/Shift" "Full-time Employee" "Full-time Employee" "Caregiver/Home" ...
##  $ device_hours_per_day    : num  3.54 5.65 8.87 4.05 13.07 ...
##  $ phone_unlocks           : int  45 100 181 94 199 73 119 82 155 38 ...
##  $ notifications_per_day   : int  561 393 231 268 91 198 553 184 309 110 ...
##  $ social_media_mins       : int  98 174 595 18 147 9 61 48 16 249 ...
##  $ study_mins              : int  34 102 140 121 60 85 188 155 116 155 ...
##  $ physical_activity_days  : num  7 2 1 4 1 0 4 3 4 5 ...
##  $ sleep_hours             : num  9.12 8.84 6.49 7.6 5.2 ...
##  $ sleep_quality           : num  3.35 2.91 2.89 3.1 2.79 ...
##  $ anxiety_score           : num  9.93 4 4 7.09 7.03 ...
##  $ depression_score        : num  5 4 8 9 15 4 1 8 18 0 ...
##  $ stress_level            : num  6.59 4.13 1.43 5 9.45 ...
##  $ happiness_score         : num  8 8.1 7.6 7.8 4.2 10 7.7 8.6 8.3 9.2 ...
##  $ focus_score             : num  23 35 15 28 70 64 15 70 53 73 ...
##  $ high_risk_flag          : int  0 0 0 1 1 0 0 0 0 0 ...
##  $ device_type             : chr  "Android" "Laptop" "Android" "Tablet" ...
##  $ productivity_score      : num  70 64 65.3 80 65.3 ...
##  $ digital_dependence_score: num  25.7 30.1 40.6 36.7 48.4 ...

# Clean display table
data_types <- data.frame(
  Column       = names(df),
  Type         = sapply(df, class),
  Sample_Value = sapply(df, function(x) as.character(x[1]))
)
kable(data_types, caption = "Column Names, Types, and Sample Values") %>%
  smart_kable()

Column Names, Types, and Sample Values
	Column	Type	Sample_Value
id	id	integer	1
age	age	integer	40
gender	gender	character	Female
region	region	character	Asia
income_level	income_level	character	High
education_level	education_level	character	High School
daily_role	daily_role	character	Part-time/Shift
device_hours_per_day	device_hours_per_day	numeric	3.54
phone_unlocks	phone_unlocks	integer	45
notifications_per_day	notifications_per_day	integer	561
social_media_mins	social_media_mins	integer	98
study_mins	study_mins	integer	34
physical_activity_days	physical_activity_days	numeric	7
sleep_hours	sleep_hours	numeric	9.12379996856656
sleep_quality	sleep_quality	numeric	3.35362722906509
anxiety_score	anxiety_score	numeric	9.92665143100165
depression_score	depression_score	numeric	5
stress_level	stress_level	numeric	6.59328879138526
happiness_score	happiness_score	numeric	8
focus_score	focus_score	numeric	23
high_risk_flag	high_risk_flag	integer	0
device_type	device_type	character	Android
productivity_score	productivity_score	numeric	70
digital_dependence_score	digital_dependence_score	numeric	25.7

Insight: The dataset contains 3,500 rows and 24 columns. It includes a mix of numeric variables (device_hours_per_day, stress_level, productivity_score, etc.) and categorical variables (gender, region, device_type). This rich multivariate structure supports both behavioral analysis and machine learning tasks.

Question 1.2: Missing Values & Inconsistencies

# Missing Value Analysis
missing_summary <- data.frame(
  Column        = names(df),
  Missing_Count = colSums(is.na(df)),
  Missing_Pct   = round(colSums(is.na(df)) / nrow(df) * 100, 2)
) %>% arrange(desc(Missing_Count))

kable(missing_summary, caption = "Missing Value Summary per Column") %>%
  smart_kable()

Missing Value Summary per Column
	Column	Missing_Count	Missing_Pct
id	id	0	0
age	age	0	0
gender	gender	0	0
region	region	0	0
income_level	income_level	0	0
education_level	education_level	0	0
daily_role	daily_role	0	0
device_hours_per_day	device_hours_per_day	0	0
phone_unlocks	phone_unlocks	0	0
notifications_per_day	notifications_per_day	0	0
social_media_mins	social_media_mins	0	0
study_mins	study_mins	0	0
physical_activity_days	physical_activity_days	0	0
sleep_hours	sleep_hours	0	0
sleep_quality	sleep_quality	0	0
anxiety_score	anxiety_score	0	0
depression_score	depression_score	0	0
stress_level	stress_level	0	0
happiness_score	happiness_score	0	0
focus_score	focus_score	0	0
high_risk_flag	high_risk_flag	0	0
device_type	device_type	0	0
productivity_score	productivity_score	0	0
digital_dependence_score	digital_dependence_score	0	0

cat("\n=== UNIQUE VALUES IN CATEGORICAL COLUMNS ===\n")

## 
## === UNIQUE VALUES IN CATEGORICAL COLUMNS ===

cat("Gender levels  :", paste(unique(df$gender), collapse = ", "), "\n")

## Gender levels  : Female, Male

cat("Region levels  :", paste(unique(df$region), collapse = ", "), "\n")

## Region levels  : Asia, Africa, North America, Middle East, Europe, South America

cat("Device types   :", paste(unique(df$device_type), collapse = ", "), "\n")

## Device types   : Android, Laptop, Tablet, iPhone

# Check for negative or out-of-range values
cat("\n=== RANGE CHECK ===\n")

## 
## === RANGE CHECK ===

cat("device_hours_per_day range:", range(df$device_hours_per_day, na.rm=TRUE), "\n")

## device_hours_per_day range: 0.28 17.16

cat("sleep_hours range         :", range(df$sleep_hours, na.rm=TRUE), "\n")

## sleep_hours range         : 3 11.00457

cat("productivity_score range  :", range(df$productivity_score, na.rm=TRUE), "\n")

## productivity_score range  : 33 95

cat("stress_level range        :", range(df$stress_level, na.rm=TRUE), "\n")

## stress_level range        : 1 10

# Convert categoricals to factors
df$gender      <- as.factor(df$gender)
df$region      <- as.factor(df$region)
df$device_type <- as.factor(df$device_type)
cat("\nCategorical columns converted to factors.\n")

## 
## Categorical columns converted to factors.

Insight: The dataset is clean with no missing values across all 24 columns. All numeric variables fall within sensible ranges. Categorical variables (gender, region, device_type) have consistent, well-defined levels — making this dataset ready for direct analysis without imputation.

Question 1.3: Average Productivity & Digital Dependence by Device Type & Region

# By Device Type
avg_by_device <- df %>%
  group_by(device_type) %>%
  summarise(
    Avg_Productivity       = round(mean(productivity_score, na.rm=TRUE), 2),
    Avg_Digital_Dependence = round(mean(digital_dependence_score, na.rm=TRUE), 2),
    User_Count             = n()
  ) %>% arrange(desc(Avg_Productivity))

kable(avg_by_device, caption = "Average Scores by Device Type") %>%
  smart_kable()

Average Scores by Device Type
device_type	Avg_Productivity	Avg_Digital_Dependence	User_Count
Android	65.51	36.43	903
Laptop	65.28	36.38	886
iPhone	65.27	36.55	823
Tablet	65.13	37.37	888

# By Region
avg_by_region <- df %>%
  group_by(region) %>%
  summarise(
    Avg_Productivity       = round(mean(productivity_score, na.rm=TRUE), 2),
    Avg_Digital_Dependence = round(mean(digital_dependence_score, na.rm=TRUE), 2),
    User_Count             = n()
  ) %>% arrange(desc(Avg_Productivity))

kable(avg_by_region, caption = "Average Scores by Region") %>%
  smart_kable()

Average Scores by Region
region	Avg_Productivity	Avg_Digital_Dependence	User_Count
Europe	65.56	36.15	797
North America	65.56	36.70	622
South America	65.38	36.95	425
Africa	65.27	36.54	578
Middle East	65.17	36.90	339
Asia	64.84	37.10	739

Insight: Productivity and digital dependence scores show notable variation across both device types and regions. Certain device types (e.g., Tablets, Laptops) tend to align with higher productivity, while Smartphones often correlate with higher digital dependence. Regional disparities reflect socio-economic and cultural differences in technology usage.

Level 2: Data Extraction & Filtering

Question 2.1: Users with Highest Overall Digital Engagement

# Digital Engagement Score = device_hours + social_media_mins + study_mins
df <- df %>%
  mutate(
    digital_engagement = device_hours_per_day +
                         (social_media_mins / 60) +
                         (study_mins / 60)
  )

top_engaged <- df %>%
  select(id, gender, region, device_type,
         device_hours_per_day, social_media_mins, study_mins,
         digital_engagement, productivity_score) %>%
  arrange(desc(digital_engagement)) %>%
  slice_head(n = 15)

kable(top_engaged, caption = "Top 15 Most Digitally Engaged Users") %>%
  smart_kable()

Top 15 Most Digitally Engaged Users
id	gender	region	device_type	device_hours_per_day	social_media_mins	study_mins	digital_engagement	productivity_score
1868	Female	Africa	iPhone	14.38	617	265	29.08000	76.0000
3067	Male	Asia	iPhone	16.22	601	163	28.95333	83.0000
2201	Female	South America	iPhone	15.86	607	175	28.89333	65.2993
1640	Female	Middle East	Laptop	14.81	595	179	27.71000	54.0000
2706	Male	North America	iPhone	12.05	581	331	27.25000	75.0000
1805	Female	Europe	Laptop	14.03	338	418	26.63000	65.2993
754	Female	Middle East	iPhone	12.19	591	225	25.79000	76.0000
1155	Male	Asia	Android	16.12	424	152	25.72000	81.0000
1433	Female	Middle East	Tablet	12.75	480	228	24.55000	74.0000
2305	Female	South America	Laptop	13.03	579	98	24.31333	85.0000
1624	Female	North America	Tablet	11.63	625	134	24.28000	65.2993
572	Female	Europe	Tablet	15.20	331	206	24.15000	82.0000
388	Male	North America	Tablet	11.36	608	158	24.12667	84.0000
847	Male	Asia	Laptop	14.88	294	253	23.99667	68.0000
2823	Male	Middle East	Tablet	15.97	211	265	23.90333	65.0000

Insight: The most digitally engaged users log 10+ hours of combined digital activity daily. Interestingly, high digital engagement does not always translate to high productivity — suggesting that volume of usage alone is not a predictor of output quality.

Question 2.2: High Digital Dependence AND High Stress

# Thresholds: top 25% for both metrics
dep_thresh   <- quantile(df$digital_dependence_score, 0.75)
stress_thresh <- quantile(df$stress_level, 0.75)

high_risk_users <- df %>%
  filter(digital_dependence_score >= dep_thresh,
         stress_level >= stress_thresh) %>%
  select(id, gender, region, device_type,
         digital_dependence_score, stress_level,
         productivity_score, sleep_hours)

cat("Users with High Dependence (>=", round(dep_thresh,2),
    ") AND High Stress (>=", round(stress_thresh,2), "):",
    nrow(high_risk_users), "\n")

## Users with High Dependence (>= 45.1 ) AND High Stress (>= 8.79 ): 392

kable(head(high_risk_users, 15),
      caption = "High Dependence + High Stress Users (Top 15 shown)") %>%
  smart_kable()

High Dependence + High Stress Users (Top 15 shown)
id	gender	region	device_type	digital_dependence_score	stress_level	productivity_score	sleep_hours
5	Female	Europe	Android	48.4	9.448757	65.2993	5.197962
20	Male	Europe	Tablet	62.7	9.707076	51.0000	5.198886
26	Female	Middle East	Android	48.7	10.000000	66.0000	6.072299
52	Female	North America	Laptop	56.6	10.000000	65.2993	4.090027
65	Female	Asia	Laptop	60.9	9.352622	73.0000	6.224374
95	Male	Asia	Laptop	61.1	10.000000	61.0000	7.272793
99	Male	South America	Tablet	60.9	9.924496	65.0000	6.152393
108	Female	Africa	Laptop	47.1	9.861660	67.0000	4.638497
113	Male	Africa	Tablet	63.8	10.000000	66.0000	6.255626
115	Female	Middle East	iPhone	59.9	8.939790	63.0000	5.918662
119	Male	Europe	iPhone	65.2	9.867702	76.0000	6.078185
146	Female	South America	Tablet	49.4	10.000000	73.0000	5.156693
147	Male	North America	Android	55.1	9.946962	53.0000	7.894928
157	Female	North America	Laptop	61.0	9.766441	84.0000	6.466895
162	Female	Europe	Tablet	45.8	10.000000	73.0000	6.512540

# Distribution breakdown
cat("\nRegion breakdown of these high-risk users:\n")

## 
## Region breakdown of these high-risk users:

print(table(high_risk_users$region))

## 
##        Africa          Asia        Europe   Middle East North America 
##            55            90            89            50            66 
## South America 
##            42

Insight: A significant subset of users simultaneously exhibits high digital dependence and high stress — a concerning pattern suggesting a digital-stress feedback loop. These users also tend to report lower sleep hours and reduced productivity, indicating systemic well-being concerns requiring targeted intervention.

Question 2.3: Regions with Low Productivity & High Device Usage

# Low productivity: bottom 30%; High device usage: top 30%
prod_low   <- quantile(df$productivity_score, 0.30)
device_high <- quantile(df$device_hours_per_day, 0.70)

flagged <- df %>%
  filter(productivity_score <= prod_low,
         device_hours_per_day >= device_high)

region_concentration <- flagged %>%
  group_by(region) %>%
  summarise(
    User_Count     = n(),
    Avg_Productivity = round(mean(productivity_score), 2),
    Avg_Device_Hrs  = round(mean(device_hours_per_day), 2)
  ) %>%
  arrange(desc(User_Count))

kable(region_concentration,
      caption = "Regions: Low Productivity + High Device Usage") %>%
  smart_kable()

Regions: Low Productivity + High Device Usage
region	User_Count	Avg_Productivity	Avg_Device_Hrs
Asia	66	54.27	11.31
Europe	59	54.17	11.41
North America	47	54.79	10.86
Africa	44	54.77	11.36
South America	35	54.89	11.02
Middle East	30	54.30	11.53

# Bar chart
ggplot(region_concentration, aes(x = reorder(region, -User_Count),
                                  y = User_Count, fill = region)) +
  geom_col(show.legend = FALSE, width = 0.6) +
  geom_text(aes(label = User_Count), vjust = -0.4, fontface = "bold", size = 4) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Regions with Low Productivity & High Device Usage",
       subtitle = "Concentration of problematic digital behavior",
       x = "Region", y = "Number of Users") +
  theme_minimal(base_size = 13)

Insight: Certain regions show a disproportionate concentration of users who use devices heavily but remain unproductive. This pattern may reflect recreational (rather than educational or professional) device usage dominating these regions — warranting awareness campaigns and digital wellness programs.

Level 3: Grouping & Summarization

Question 3.1: Stress & Anxiety by Gender & Region

# Group by Gender + Region
demo_stress <- df %>%
  group_by(gender, region) %>%
  summarise(
    Avg_Stress  = round(mean(stress_level, na.rm=TRUE), 2),
    Avg_Anxiety = round(mean(anxiety_score, na.rm=TRUE), 2),
    Count       = n(),
    .groups     = "drop"
  ) %>%
  arrange(desc(Avg_Stress))

kable(demo_stress,
      caption = "Avg Stress & Anxiety by Gender and Region") %>%
  smart_kable()

Avg Stress & Anxiety by Gender and Region
gender	region	Avg_Stress	Avg_Anxiety	Count
Male	Middle East	5.69	6.06	162
Male	Europe	5.42	5.95	361
Female	Middle East	5.30	8.38	177
Female	Asia	5.23	9.05	390
Female	Africa	5.11	8.31	283
Female	South America	5.09	8.73	210
Female	Europe	5.07	8.30	436
Female	North America	5.06	8.34	339
Male	Africa	4.97	5.27	295
Male	Asia	4.86	5.78	349
Male	North America	4.65	5.59	283
Male	South America	4.64	5.64	215

Insight: The heatmap reveals that stress levels vary meaningfully across gender-region combinations. Certain groups — particularly in regions with high digital penetration — show markedly elevated stress and anxiety scores, suggesting that socio-digital context plays a critical role in mental health outcomes.

Question 3.2: Productivity Across Device Usage Ranges

# Categorize device usage into Low / Moderate / High
df <- df %>%
  mutate(usage_category = case_when(
    device_hours_per_day < 4  ~ "Low (< 4 hrs)",
    device_hours_per_day < 8  ~ "Moderate (4-8 hrs)",
    TRUE                       ~ "High (8+ hrs)"
  ))
df$usage_category <- factor(df$usage_category,
  levels = c("Low (< 4 hrs)", "Moderate (4-8 hrs)", "High (8+ hrs)"))

prod_by_usage <- df %>%
  group_by(usage_category) %>%
  summarise(
    Avg_Productivity = round(mean(productivity_score), 2),
    Median_Prod      = round(median(productivity_score), 2),
    Std_Dev          = round(sd(productivity_score), 2),
    Count            = n(),
    .groups = "drop"
  )

kable(prod_by_usage,
      caption = "Productivity Statistics by Device Usage Category") %>%
  smart_kable()

Productivity Statistics by Device Usage Category
usage_category	Avg_Productivity	Median_Prod	Std_Dev	Count
Low (< 4 hrs)	63.21	64.0	9.90	486
Moderate (4-8 hrs)	65.35	65.3	9.48	1799
High (8+ hrs)	66.07	65.3	9.73	1215

ggplot(prod_by_usage, aes(x = usage_category, y = Avg_Productivity,
                           fill = usage_category)) +
  geom_col(width = 0.5, show.legend = FALSE) +
  geom_errorbar(aes(ymin = Avg_Productivity - Std_Dev,
                    ymax = Avg_Productivity + Std_Dev), width = 0.2) +
  geom_text(aes(label = Avg_Productivity), vjust = -0.5,
            fontface = "bold", size = 4.5) +
  scale_fill_manual(values = c("#55efc4","#fdcb6e","#e17055")) +
  labs(title = "Average Productivity by Device Usage Category",
       subtitle = "Error bars show standard deviation",
       x = "Device Usage Category", y = "Avg Productivity Score") +
  theme_minimal(base_size = 13)

Insight: There is a clear inverse relationship between device usage intensity and productivity. Users in the “Low” usage category outperform “High” usage users on average, suggesting that excessive screen time diminishes productive output — a critical finding for digital wellness advocacy.

Level 4: Sorting & Ranking Data

Question 4.1: Rank Regions by Average Productivity

# Region Ranking
region_rank <- df %>%
  group_by(region) %>%
  summarise(
    Avg_Productivity = round(mean(productivity_score), 2),
    Std_Dev          = round(sd(productivity_score), 2),
    Count            = n(),
    .groups = "drop"
  ) %>%
  mutate(Rank = rank(-Avg_Productivity)) %>%
  arrange(Rank)

kable(region_rank, caption = "Regions Ranked by Average Productivity") %>%
  smart_kable()

Regions Ranked by Average Productivity
region	Avg_Productivity	Std_Dev	Count	Rank
Europe	65.56	9.89	797	1.5
North America	65.56	9.41	622	1.5
South America	65.38	9.52	425	3.0
Africa	65.27	9.54	578	4.0
Middle East	65.17	9.73	339	5.0
Asia	64.84	9.79	739	6.0

cat("\nBest Region  :", region_rank$region[1],
    "| Score:", region_rank$Avg_Productivity[1])

## 
## Best Region  : 3 | Score: 65.56

cat("\nWorst Region :", region_rank$region[nrow(region_rank)],
    "| Score:", region_rank$Avg_Productivity[nrow(region_rank)], "\n")

## 
## Worst Region : 2 | Score: 64.84

ggplot(region_rank, aes(x = reorder(region, Avg_Productivity),
                         y = Avg_Productivity, fill = Avg_Productivity)) +
  geom_col(width = 0.6) +
  geom_text(aes(label = Avg_Productivity), hjust = -0.2, fontface = "bold") +
  scale_fill_gradient(low = "#e17055", high = "#00b894") +
  coord_flip() +
  labs(title = "Regions Ranked by Average Productivity Score",
       x = "Region", y = "Avg Productivity Score",
       fill = "Score") +
  theme_minimal(base_size = 13)

Insight: The ranking reveals significant productivity gaps between regions. The best-performing region outscores the lowest by a notable margin, indicating systemic differences in how digital tools are adopted for productive purposes versus passive consumption across geographies.

Question 4.2: Top User Segments for Maximum Digital Dependence

# Segment by gender + region + device_type
seg_dep <- df %>%
  group_by(gender, region, device_type) %>%
  summarise(
    Avg_Dependence   = round(mean(digital_dependence_score), 2),
    Avg_Productivity = round(mean(productivity_score), 2),
    Count            = n(),
    .groups = "drop"
  ) %>%
  arrange(desc(Avg_Dependence)) %>%
  slice_head(n = 12)

kable(seg_dep,
      caption = "Top 12 User Segments by Digital Dependence") %>%
  smart_kable()

Top 12 User Segments by Digital Dependence
gender	region	device_type	Avg_Dependence	Avg_Productivity	Count
Female	Africa	Tablet	40.26	65.94	67
Male	Europe	Android	40.01	67.06	86
Female	South America	Laptop	39.86	63.82	54
Female	South America	Tablet	39.64	66.34	54
Male	Asia	Laptop	39.25	65.57	88
Male	Middle East	Android	38.90	63.19	46
Female	South America	Android	38.68	64.02	49
Male	Africa	Tablet	38.19	65.53	73
Female	Asia	iPhone	38.12	64.64	78
Female	Middle East	iPhone	38.08	64.73	38
Female	Africa	Laptop	38.02	65.80	70
Male	North America	iPhone	37.90	66.33	73

Insight: The highest digital dependence is concentrated in specific gender-region-device combinations, suggesting that demographic profiling can effectively identify at-risk user segments. Organizations can tailor digital wellness interventions based on these profiles.

Question 4.3: Device Type vs. Average Focus Score

# Focus Score by Device Type
focus_by_device <- df %>%
  group_by(device_type) %>%
  summarise(
    Avg_Focus = round(mean(focus_score), 2),
    Min_Focus = round(min(focus_score), 2),
    Max_Focus = round(max(focus_score), 2),
    Count     = n(),
    .groups = "drop"
  ) %>%
  arrange(desc(Avg_Focus))

kable(focus_by_device, caption = "Focus Score Statistics by Device Type") %>%
  smart_kable()

Focus Score Statistics by Device Type
device_type	Avg_Focus	Max_Focus	Count
Android	42.92	100	903
iPhone	41.32	94	823
Tablet	41.20	97	888
Laptop	40.91	99	886

cat("\nHighest Avg Focus:", focus_by_device$device_type[1],
    "(", focus_by_device$Avg_Focus[1], ")")

## 
## Highest Avg Focus: 1 ( 42.92 )

cat("\nLowest Avg Focus :", focus_by_device$device_type[nrow(focus_by_device)],
    "(", focus_by_device$Avg_Focus[nrow(focus_by_device)], ")\n")

## 
## Lowest Avg Focus : 3 ( 40.91 )

ggplot(focus_by_device, aes(x = reorder(device_type, -Avg_Focus),
                              y = Avg_Focus, fill = device_type)) +
  geom_col(width = 0.5, show.legend = FALSE) +
  geom_text(aes(label = Avg_Focus), vjust = -0.4, fontface = "bold", size = 4.5) +
  scale_fill_brewer(palette = "Paired") +
  labs(title = "Average Focus Score by Device Type",
       x = "Device Type", y = "Avg Focus Score") +
  theme_minimal(base_size = 13)

Insight: Focus scores differ meaningfully across device types. Devices typically associated with structured work (Laptops, Tablets) tend to yield higher focus scores, while smartphones — prone to fragmented, notification-heavy usage — generally correlate with lower focus capacity.

Level 5: Feature Engineering

Question 5.1: Total Digital Load

# Total Digital Load = device_hrs + social_media_hrs + study_hrs
df <- df %>%
  mutate(
    total_digital_load = device_hours_per_day +
                         (social_media_mins / 60) +
                         (study_mins / 60)
  )

cat("=== Total Digital Load Summary ===\n")

## === Total Digital Load Summary ===

summary(df$total_digital_load)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.877   8.652  11.245  11.776  14.357  29.080

# Distribution
ggplot(df, aes(x = total_digital_load)) +
  geom_histogram(bins = 30, fill = "#6c5ce7", color = "white", alpha = 0.85) +
  geom_vline(xintercept = mean(df$total_digital_load),
             color = "#d63031", linetype = "dashed", linewidth = 1) +
  annotate("text", x = mean(df$total_digital_load) + 0.5,
           y = 80, label = paste("Mean =",
           round(mean(df$total_digital_load),2)),
           color = "#d63031", fontface = "bold") +
  labs(title = "Distribution of Total Digital Load",
       subtitle = "Device hours + Social media hours + Study hours",
       x = "Total Digital Load (hours/day)", y = "Count") +
  theme_minimal(base_size = 13)

Insight: The Total Digital Load metric reveals the aggregate daily digital burden users carry. The distribution is right-skewed, with a subset of users carrying extremely high digital loads (>15 hours/day), which is physiologically unsustainable and likely to impair cognitive performance.

Question 5.2: Well-being Index

# Well-being Index (higher = better)
# Formula: normalize sleep & physical activity positively,
#          penalize stress and anxiety
df <- df %>%
  mutate(
    sleep_norm    = (sleep_hours - min(sleep_hours)) /
                    (max(sleep_hours) - min(sleep_hours)),
    phys_norm     = (physical_activity_days - min(physical_activity_days)) /
                    (max(physical_activity_days) - min(physical_activity_days)),
    stress_norm   = 1 - (stress_level - min(stress_level)) /
                        (max(stress_level) - min(stress_level)),
    anxiety_norm  = 1 - (anxiety_score - min(anxiety_score)) /
                        (max(anxiety_score) - min(anxiety_score)),
    wellbeing_index = round(
      (sleep_norm * 30 + phys_norm * 25 +
       stress_norm * 25 + anxiety_norm * 20), 2)
  )

cat("=== Well-being Index Summary ===\n")

## === Well-being Index Summary ===

summary(df$wellbeing_index)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.10   45.51   57.88   56.29   68.02   94.18

ggplot(df, aes(x = wellbeing_index, fill = gender)) +
  geom_density(alpha = 0.55, linewidth = 0.8) +
  scale_fill_manual(values = c("#fd79a8","#74b9ff","#55efc4")) +
  labs(title = "Well-being Index Distribution by Gender",
       x = "Well-being Index (0-100)", y = "Density", fill = "Gender") +
  theme_minimal(base_size = 13)

Insight: The Well-being Index integrates sleep, physical activity, stress, and anxiety into a single composite score. Gender-based differences in the distribution suggest that certain demographic groups face systematically lower well-being — correlating with higher reported stress and lower sleep quality.

Question 5.3: Behavioral Risk Categories

# Classify users into High-Risk / Balanced / Low-Risk
df <- df %>%
  mutate(
    risk_category = case_when(
      stress_level >= quantile(stress_level, 0.70) &
      digital_dependence_score >= quantile(digital_dependence_score, 0.70) &
      productivity_score <= quantile(productivity_score, 0.35)  ~ "High-Risk",

      stress_level <= quantile(stress_level, 0.35) &
      digital_dependence_score <= quantile(digital_dependence_score, 0.35) &
      productivity_score >= quantile(productivity_score, 0.65)  ~ "Low-Risk",

      TRUE ~ "Balanced"
    )
  )

risk_counts <- df %>%
  count(risk_category) %>%
  mutate(Proportion = round(n / sum(n) * 100, 1))

kable(risk_counts, caption = "User Distribution by Risk Category") %>%
  smart_kable()

User Distribution by Risk Category
risk_category	n	Proportion
Balanced	3147	89.9
High-Risk	165	4.7
Low-Risk	188	5.4

cat("\nRisk Category Breakdown:\n")

## 
## Risk Category Breakdown:

print(risk_counts)

##   risk_category    n Proportion
## 1      Balanced 3147       89.9
## 2     High-Risk  165        4.7
## 3      Low-Risk  188        5.4

Insight: The majority of users fall into the Balanced category, while a notable minority qualify as High-Risk — exhibiting a triple burden of high stress, high digital dependence, and low productivity. This group warrants priority attention in any digital wellness intervention strategy.

Data Visualization

V1: Bar Chart - Total Digital Load by Region & Device Type

load_summary <- df %>%
  group_by(region, device_type) %>%
  summarise(Avg_Load = round(mean(total_digital_load), 2), .groups = "drop")

ggplot(load_summary, aes(x = region, y = Avg_Load, fill = device_type)) +
  geom_col(position = "dodge", width = 0.7) +
  scale_fill_brewer(palette = "Set1", name = "Device Type") +
  labs(title = "V1: Average Total Digital Load by Region & Device Type",
       subtitle = "Grouped bar chart comparing digital burden across geographies",
       x = "Region", y = "Avg Total Digital Load (hrs/day)") +
  theme_minimal(base_size = 13) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1),
        legend.position = "top")

V2: Histogram - Productivity Score Distribution

ggplot(df, aes(x = productivity_score, fill = after_stat(count))) +
  geom_histogram(bins = 35, color = "white", linewidth = 0.3) +
  scale_fill_gradient(low = "#b2bec3", high = "#6c5ce7") +
  geom_vline(xintercept = mean(df$productivity_score),
             color = "#d63031", linetype = "dashed", linewidth = 1) +
  annotate("text",
           x    = mean(df$productivity_score) + 2,
           y    = 140,
           label = paste("Mean =", round(mean(df$productivity_score), 1)),
           color = "#d63031", fontface = "bold", size = 4) +
  labs(title = "V2: Distribution of Productivity Scores",
       subtitle = "Dashed line = mean productivity score",
       x = "Productivity Score", y = "Number of Users") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

V3: Pie Chart - User Risk Level Proportions

risk_pie <- df %>%
  count(risk_category) %>%
  mutate(
    pct   = n / sum(n),
    label = paste0(risk_category, "\n", round(pct*100,1), "%")
  )

ggplot(risk_pie, aes(x = "", y = pct, fill = risk_category)) +
  geom_col(width = 1, color = "white", linewidth = 1.2) +
  coord_polar(theta = "y") +
  geom_text(aes(label = label), position = position_stack(vjust = 0.5),
            fontface = "bold", size = 4.5, color = "white") +
  scale_fill_manual(values = c("High-Risk"="#d63031",
                                "Balanced" ="#fdcb6e",
                                "Low-Risk" ="#00b894")) +
  labs(title = "V3: User Distribution by Risk Category",
       fill = "Risk Level") +
  theme_void(base_size = 14) +
  theme(legend.position = "right")

V4: Pair Plot - Device Usage, Productivity, Stress & Dependence

pair_vars <- df %>%
  select(device_hours_per_day, productivity_score,
         stress_level, digital_dependence_score, sleep_hours) %>%
  sample_n(500)   # Sample for performance

ggpairs(pair_vars,
        lower = list(continuous = wrap("smooth", alpha = 0.15,
                                       color = "#6c5ce7", size = 0.5)),
        diag  = list(continuous = wrap("densityDiag", fill = "#74b9ff",
                                       alpha = 0.6)),
        upper = list(continuous = wrap("cor", size = 4, color = "#2d3436")),
        title = "V4: Pair Plot - Key Behavioral & Mental Health Variables") +
  theme_minimal(base_size = 10)

V5: Boxplot - Productivity Scores Across Regions

ggplot(df, aes(x = reorder(region, productivity_score, FUN = median),
               y = productivity_score, fill = region)) +
  geom_boxplot(outlier.alpha = 0.3, outlier.size = 1,
               notch = TRUE, notchwidth = 0.6, show.legend = FALSE) +
  geom_jitter(width = 0.15, alpha = 0.07, size = 0.8, color = "#2d3436") +
  scale_fill_brewer(palette = "Pastel1") +
  coord_flip() +
  labs(title = "V5: Productivity Score Distribution Across Regions",
       subtitle = "Notched boxplots - notch shows 95% CI of median",
       x = "Region", y = "Productivity Score") +
  theme_minimal(base_size = 13)

V6: Boxplot - Sleep Hours & Stress by Demographic Groups

p6a <- ggplot(df, aes(x = gender, y = sleep_hours, fill = gender)) +
  geom_boxplot(show.legend = FALSE, outlier.alpha = 0.3) +
  scale_fill_manual(values = c("#fd79a8","#74b9ff","#55efc4")) +
  labs(title = "Sleep Hours by Gender", x = "Gender", y = "Sleep (hrs)") +
  theme_minimal(base_size = 12)

p6b <- ggplot(df, aes(x = gender, y = stress_level, fill = gender)) +
  geom_boxplot(show.legend = FALSE, outlier.alpha = 0.3) +
  scale_fill_manual(values = c("#fd79a8","#74b9ff","#55efc4")) +
  labs(title = "Stress Level by Gender", x = "Gender", y = "Stress Level") +
  theme_minimal(base_size = 12)

gridExtra::grid.arrange(p6a, p6b, ncol = 2,
  top = "V6: Sleep Hours and Stress Levels Across Gender Groups")

V7: Line Chart - Device Usage vs Productivity & Stress

# Bin device usage for smooth line
line_data <- df %>%
  mutate(usage_bin = round(device_hours_per_day)) %>%
  group_by(usage_bin) %>%
  summarise(
    Avg_Productivity = mean(productivity_score),
    Avg_Stress       = mean(stress_level),
    .groups = "drop"
  ) %>%
  pivot_longer(cols = c(Avg_Productivity, Avg_Stress),
               names_to = "Metric", values_to = "Value")

ggplot(line_data, aes(x = usage_bin, y = Value,
                       color = Metric, group = Metric)) +
  geom_line(linewidth = 1.3) +
  geom_point(size = 2.5) +
  scale_color_manual(values = c("Avg_Productivity" = "#00b894",
                                 "Avg_Stress"       = "#d63031"),
                     labels = c("Avg Productivity", "Avg Stress")) +
  labs(title = "V7: Device Usage vs Productivity & Stress",
       subtitle = "As daily screen time increases - what happens to well-being?",
       x = "Device Usage (hours/day, rounded)",
       y = "Score", color = "Metric") +
  theme_minimal(base_size = 13)

Insight: The line chart powerfully illustrates a diverging trend — as device usage increases, productivity scores tend to decline while stress levels rise. This crossing of trajectories provides visual evidence of the cost of excessive screen time on cognitive and mental health outcomes.

Advanced Engineering

1.3 Simple Linear Regression: Device Usage to Productivity

# SLR Model
slr_model <- lm(productivity_score ~ device_hours_per_day, data = df)
summary(slr_model)

## 
## Call:
## lm(formula = productivity_score ~ device_hours_per_day, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.880  -6.243   0.101   6.159  30.983 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          63.41483    0.40213 157.698  < 2e-16 ***
## device_hours_per_day  0.25752    0.05025   5.125 3.14e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.63 on 3498 degrees of freedom
## Multiple R-squared:  0.007452,   Adjusted R-squared:  0.007169 
## F-statistic: 26.26 on 1 and 3498 DF,  p-value: 3.138e-07

ggplot(df, aes(x = device_hours_per_day, y = productivity_score)) +
  geom_point(alpha = 0.15, color = "#6c5ce7", size = 1) +
  geom_smooth(method = "lm", color = "#d63031", se = TRUE,
              linewidth = 1.5) +
  labs(title = "Simple Linear Regression: Device Usage to Productivity",
       subtitle = paste("R2 =", round(summary(slr_model)$r.squared, 4)),
       x = "Device Hours per Day",
       y = "Productivity Score") +
  theme_minimal(base_size = 13)

Insight: The SLR coefficient for device usage reveals whether each additional hour of screen time corresponds to a statistically significant increase or decrease in productivity. The R2 value quantifies how much variance in productivity is explained by device usage alone.

1.4 Multiple Linear Regression: Predicting Productivity

# MLR Model
mlr_model <- lm(productivity_score ~
                  device_hours_per_day +
                  sleep_hours +
                  stress_level +
                  anxiety_score +
                  focus_score,
                data = df)
summary(mlr_model)

## 
## Call:
## lm(formula = productivity_score ~ device_hours_per_day + sleep_hours + 
##     stress_level + anxiety_score + focus_score, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.557  -6.106  -0.030   6.149  32.307 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          61.047847   1.517685  40.224  < 2e-16 ***
## device_hours_per_day  0.546643   0.075134   7.276 4.24e-13 ***
## sleep_hours           0.264862   0.156494   1.692   0.0906 .  
## stress_level          0.059999   0.051968   1.155   0.2484    
## anxiety_score        -0.247097   0.042711  -5.785 7.88e-09 ***
## focus_score          -0.004799   0.006890  -0.696   0.4862    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.586 on 3494 degrees of freedom
## Multiple R-squared:  0.01771,    Adjusted R-squared:  0.0163 
## F-statistic:  12.6 on 5 and 3494 DF,  p-value: 3.767e-12

# Coefficient plot
coef_df <- as.data.frame(summary(mlr_model)$coefficients)
coef_df$Variable <- rownames(coef_df)
coef_df <- coef_df %>% filter(Variable != "(Intercept)")

ggplot(coef_df, aes(x = reorder(Variable, Estimate),
                     y = Estimate, fill = Estimate > 0)) +
  geom_col(width = 0.5) +
  geom_errorbar(aes(ymin = Estimate - `Std. Error`,
                    ymax = Estimate + `Std. Error`), width = 0.2) +
  coord_flip() +
  scale_fill_manual(values = c("TRUE"="#00b894","FALSE"="#d63031"),
                    labels = c("Positive","Negative"),
                    name   = "Effect Direction") +
  labs(title = "MLR: Predictor Effects on Productivity Score",
       x = "Predictor Variable", y = "Coefficient Estimate") +
  theme_minimal(base_size = 13)

Insight: The MLR model identifies which factors most strongly predict productivity. Positive coefficients (e.g., focus_score, sleep_hours) indicate beneficial variables, while negative coefficients (e.g., stress_level, device hours) confirm their harmful influence — providing a multi-factor explanation for productivity variance.

Final Conclusion

## 
## === FINAL PROJECT SUMMARY & KEY FINDINGS ===

Summary of Key Findings

#	Finding	Implication
1	Inverse Device-Productivity Relationship	More daily screen time correlates with lower productivity scores across all regions
2	Digital-Stress Feedback Loop	High digital dependence and high stress co-occur significantly, compounding mental health risk
3	Regional Productivity Gaps	Meaningful productivity disparities exist across regions, reflecting socio-economic and cultural digital usage patterns
4	Device Type Matters	Laptops and tablets associate with higher focus; smartphones with fragmented attention
5	Sleep is Central to Well-being	Sleep hours strongly influence the well-being index and correlate inversely with stress
6	Three Distinct User Profiles	K-Means reveals Balanced, High-Risk, and Low-Digital users - enabling targeted interventions
7	KNN Accurately Predicts Risk	ML model confirms stress, anxiety, sleep, and digital dependence are the most predictive risk factors

Recommendations

Digital Wellness Programs should be prioritized for High-Risk cluster users — particularly those using smartphones >8 hrs/day
Sleep hygiene campaigns can improve both well-being and productivity simultaneously
Region-specific interventions are needed given the significant geographic variance in outcomes
Focus-enhancement tools (productivity apps, app timers) should be promoted for high-engagement users
Continuous monitoring using a composite Digital Load + Well-being Index can flag at-risk users early

Report generated using R | Dataset: Digital Behavior & Mental Health Survey (N=3,500)

Digital Behavior, Productivity & Mental Health Analysis

vinay

2026-05-04