This report presents a comprehensive data-driven analysis of digital behavior, productivity, and mental health patterns across 3,500 users. The dataset captures variables such as daily device usage, social media consumption, study time, sleep hours, stress, anxiety, and productivity scores. The analysis follows a structured pipeline from basic exploration to machine learning.
# Load All Required Libraries
library(tidyverse) # Data manipulation & ggplot2
library(dplyr)
library(tidyr)
library(ggplot2)
library(GGally) # Pair plots
library(stats) # ANOVA, regression
library(cluster) # K-means clustering
library(class) # KNN classification
library(scales) # Formatting axes
library(knitr) # Tables
library(kableExtra) # Styled tables
# Load Dataset
df <- read.csv("Data.csv", stringsAsFactors = FALSE)
# Quick peek
cat("Dataset loaded successfully!\n")## Dataset loaded successfully!
## Rows: 3500 | Columns: 24
## === DATASET DIMENSIONS ===
## Rows (Users) : 3500
## Columns (Features): 24
## === COLUMN NAMES & DATA TYPES ===
## 'data.frame': 3500 obs. of 24 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ age : int 40 27 31 41 26 37 18 33 43 41 ...
## $ gender : chr "Female" "Male" "Male" "Female" ...
## $ region : chr "Asia" "Africa" "North America" "Middle East" ...
## $ income_level : chr "High" "Lower-Mid" "Lower-Mid" "Low" ...
## $ education_level : chr "High School" "Master" "Bachelor" "Master" ...
## $ daily_role : chr "Part-time/Shift" "Full-time Employee" "Full-time Employee" "Caregiver/Home" ...
## $ device_hours_per_day : num 3.54 5.65 8.87 4.05 13.07 ...
## $ phone_unlocks : int 45 100 181 94 199 73 119 82 155 38 ...
## $ notifications_per_day : int 561 393 231 268 91 198 553 184 309 110 ...
## $ social_media_mins : int 98 174 595 18 147 9 61 48 16 249 ...
## $ study_mins : int 34 102 140 121 60 85 188 155 116 155 ...
## $ physical_activity_days : num 7 2 1 4 1 0 4 3 4 5 ...
## $ sleep_hours : num 9.12 8.84 6.49 7.6 5.2 ...
## $ sleep_quality : num 3.35 2.91 2.89 3.1 2.79 ...
## $ anxiety_score : num 9.93 4 4 7.09 7.03 ...
## $ depression_score : num 5 4 8 9 15 4 1 8 18 0 ...
## $ stress_level : num 6.59 4.13 1.43 5 9.45 ...
## $ happiness_score : num 8 8.1 7.6 7.8 4.2 10 7.7 8.6 8.3 9.2 ...
## $ focus_score : num 23 35 15 28 70 64 15 70 53 73 ...
## $ high_risk_flag : int 0 0 0 1 1 0 0 0 0 0 ...
## $ device_type : chr "Android" "Laptop" "Android" "Tablet" ...
## $ productivity_score : num 70 64 65.3 80 65.3 ...
## $ digital_dependence_score: num 25.7 30.1 40.6 36.7 48.4 ...
# Clean display table
data_types <- data.frame(
Column = names(df),
Type = sapply(df, class),
Sample_Value = sapply(df, function(x) as.character(x[1]))
)
kable(data_types, caption = "Column Names, Types, and Sample Values") %>%
smart_kable()| Column | Type | Sample_Value | |
|---|---|---|---|
| id | id | integer | 1 |
| age | age | integer | 40 |
| gender | gender | character | Female |
| region | region | character | Asia |
| income_level | income_level | character | High |
| education_level | education_level | character | High School |
| daily_role | daily_role | character | Part-time/Shift |
| device_hours_per_day | device_hours_per_day | numeric | 3.54 |
| phone_unlocks | phone_unlocks | integer | 45 |
| notifications_per_day | notifications_per_day | integer | 561 |
| social_media_mins | social_media_mins | integer | 98 |
| study_mins | study_mins | integer | 34 |
| physical_activity_days | physical_activity_days | numeric | 7 |
| sleep_hours | sleep_hours | numeric | 9.12379996856656 |
| sleep_quality | sleep_quality | numeric | 3.35362722906509 |
| anxiety_score | anxiety_score | numeric | 9.92665143100165 |
| depression_score | depression_score | numeric | 5 |
| stress_level | stress_level | numeric | 6.59328879138526 |
| happiness_score | happiness_score | numeric | 8 |
| focus_score | focus_score | numeric | 23 |
| high_risk_flag | high_risk_flag | integer | 0 |
| device_type | device_type | character | Android |
| productivity_score | productivity_score | numeric | 70 |
| digital_dependence_score | digital_dependence_score | numeric | 25.7 |
Insight: The dataset contains 3,500 rows and 24 columns. It includes a mix of numeric variables (device_hours_per_day, stress_level, productivity_score, etc.) and categorical variables (gender, region, device_type). This rich multivariate structure supports both behavioral analysis and machine learning tasks.
# Missing Value Analysis
missing_summary <- data.frame(
Column = names(df),
Missing_Count = colSums(is.na(df)),
Missing_Pct = round(colSums(is.na(df)) / nrow(df) * 100, 2)
) %>% arrange(desc(Missing_Count))
kable(missing_summary, caption = "Missing Value Summary per Column") %>%
smart_kable()| Column | Missing_Count | Missing_Pct | |
|---|---|---|---|
| id | id | 0 | 0 |
| age | age | 0 | 0 |
| gender | gender | 0 | 0 |
| region | region | 0 | 0 |
| income_level | income_level | 0 | 0 |
| education_level | education_level | 0 | 0 |
| daily_role | daily_role | 0 | 0 |
| device_hours_per_day | device_hours_per_day | 0 | 0 |
| phone_unlocks | phone_unlocks | 0 | 0 |
| notifications_per_day | notifications_per_day | 0 | 0 |
| social_media_mins | social_media_mins | 0 | 0 |
| study_mins | study_mins | 0 | 0 |
| physical_activity_days | physical_activity_days | 0 | 0 |
| sleep_hours | sleep_hours | 0 | 0 |
| sleep_quality | sleep_quality | 0 | 0 |
| anxiety_score | anxiety_score | 0 | 0 |
| depression_score | depression_score | 0 | 0 |
| stress_level | stress_level | 0 | 0 |
| happiness_score | happiness_score | 0 | 0 |
| focus_score | focus_score | 0 | 0 |
| high_risk_flag | high_risk_flag | 0 | 0 |
| device_type | device_type | 0 | 0 |
| productivity_score | productivity_score | 0 | 0 |
| digital_dependence_score | digital_dependence_score | 0 | 0 |
##
## === UNIQUE VALUES IN CATEGORICAL COLUMNS ===
## Gender levels : Female, Male
## Region levels : Asia, Africa, North America, Middle East, Europe, South America
## Device types : Android, Laptop, Tablet, iPhone
##
## === RANGE CHECK ===
## device_hours_per_day range: 0.28 17.16
## sleep_hours range : 3 11.00457
## productivity_score range : 33 95
## stress_level range : 1 10
# Convert categoricals to factors
df$gender <- as.factor(df$gender)
df$region <- as.factor(df$region)
df$device_type <- as.factor(df$device_type)
cat("\nCategorical columns converted to factors.\n")##
## Categorical columns converted to factors.
Insight: The dataset is clean with no missing values across all 24 columns. All numeric variables fall within sensible ranges. Categorical variables (gender, region, device_type) have consistent, well-defined levels — making this dataset ready for direct analysis without imputation.
# By Device Type
avg_by_device <- df %>%
group_by(device_type) %>%
summarise(
Avg_Productivity = round(mean(productivity_score, na.rm=TRUE), 2),
Avg_Digital_Dependence = round(mean(digital_dependence_score, na.rm=TRUE), 2),
User_Count = n()
) %>% arrange(desc(Avg_Productivity))
kable(avg_by_device, caption = "Average Scores by Device Type") %>%
smart_kable()| device_type | Avg_Productivity | Avg_Digital_Dependence | User_Count |
|---|---|---|---|
| Android | 65.51 | 36.43 | 903 |
| Laptop | 65.28 | 36.38 | 886 |
| iPhone | 65.27 | 36.55 | 823 |
| Tablet | 65.13 | 37.37 | 888 |
# By Region
avg_by_region <- df %>%
group_by(region) %>%
summarise(
Avg_Productivity = round(mean(productivity_score, na.rm=TRUE), 2),
Avg_Digital_Dependence = round(mean(digital_dependence_score, na.rm=TRUE), 2),
User_Count = n()
) %>% arrange(desc(Avg_Productivity))
kable(avg_by_region, caption = "Average Scores by Region") %>%
smart_kable()| region | Avg_Productivity | Avg_Digital_Dependence | User_Count |
|---|---|---|---|
| Europe | 65.56 | 36.15 | 797 |
| North America | 65.56 | 36.70 | 622 |
| South America | 65.38 | 36.95 | 425 |
| Africa | 65.27 | 36.54 | 578 |
| Middle East | 65.17 | 36.90 | 339 |
| Asia | 64.84 | 37.10 | 739 |
Insight: Productivity and digital dependence scores show notable variation across both device types and regions. Certain device types (e.g., Tablets, Laptops) tend to align with higher productivity, while Smartphones often correlate with higher digital dependence. Regional disparities reflect socio-economic and cultural differences in technology usage.
# Digital Engagement Score = device_hours + social_media_mins + study_mins
df <- df %>%
mutate(
digital_engagement = device_hours_per_day +
(social_media_mins / 60) +
(study_mins / 60)
)
top_engaged <- df %>%
select(id, gender, region, device_type,
device_hours_per_day, social_media_mins, study_mins,
digital_engagement, productivity_score) %>%
arrange(desc(digital_engagement)) %>%
slice_head(n = 15)
kable(top_engaged, caption = "Top 15 Most Digitally Engaged Users") %>%
smart_kable()| id | gender | region | device_type | device_hours_per_day | social_media_mins | study_mins | digital_engagement | productivity_score |
|---|---|---|---|---|---|---|---|---|
| 1868 | Female | Africa | iPhone | 14.38 | 617 | 265 | 29.08000 | 76.0000 |
| 3067 | Male | Asia | iPhone | 16.22 | 601 | 163 | 28.95333 | 83.0000 |
| 2201 | Female | South America | iPhone | 15.86 | 607 | 175 | 28.89333 | 65.2993 |
| 1640 | Female | Middle East | Laptop | 14.81 | 595 | 179 | 27.71000 | 54.0000 |
| 2706 | Male | North America | iPhone | 12.05 | 581 | 331 | 27.25000 | 75.0000 |
| 1805 | Female | Europe | Laptop | 14.03 | 338 | 418 | 26.63000 | 65.2993 |
| 754 | Female | Middle East | iPhone | 12.19 | 591 | 225 | 25.79000 | 76.0000 |
| 1155 | Male | Asia | Android | 16.12 | 424 | 152 | 25.72000 | 81.0000 |
| 1433 | Female | Middle East | Tablet | 12.75 | 480 | 228 | 24.55000 | 74.0000 |
| 2305 | Female | South America | Laptop | 13.03 | 579 | 98 | 24.31333 | 85.0000 |
| 1624 | Female | North America | Tablet | 11.63 | 625 | 134 | 24.28000 | 65.2993 |
| 572 | Female | Europe | Tablet | 15.20 | 331 | 206 | 24.15000 | 82.0000 |
| 388 | Male | North America | Tablet | 11.36 | 608 | 158 | 24.12667 | 84.0000 |
| 847 | Male | Asia | Laptop | 14.88 | 294 | 253 | 23.99667 | 68.0000 |
| 2823 | Male | Middle East | Tablet | 15.97 | 211 | 265 | 23.90333 | 65.0000 |
Insight: The most digitally engaged users log 10+ hours of combined digital activity daily. Interestingly, high digital engagement does not always translate to high productivity — suggesting that volume of usage alone is not a predictor of output quality.
# Thresholds: top 25% for both metrics
dep_thresh <- quantile(df$digital_dependence_score, 0.75)
stress_thresh <- quantile(df$stress_level, 0.75)
high_risk_users <- df %>%
filter(digital_dependence_score >= dep_thresh,
stress_level >= stress_thresh) %>%
select(id, gender, region, device_type,
digital_dependence_score, stress_level,
productivity_score, sleep_hours)
cat("Users with High Dependence (>=", round(dep_thresh,2),
") AND High Stress (>=", round(stress_thresh,2), "):",
nrow(high_risk_users), "\n")## Users with High Dependence (>= 45.1 ) AND High Stress (>= 8.79 ): 392
kable(head(high_risk_users, 15),
caption = "High Dependence + High Stress Users (Top 15 shown)") %>%
smart_kable()| id | gender | region | device_type | digital_dependence_score | stress_level | productivity_score | sleep_hours |
|---|---|---|---|---|---|---|---|
| 5 | Female | Europe | Android | 48.4 | 9.448757 | 65.2993 | 5.197962 |
| 20 | Male | Europe | Tablet | 62.7 | 9.707076 | 51.0000 | 5.198886 |
| 26 | Female | Middle East | Android | 48.7 | 10.000000 | 66.0000 | 6.072299 |
| 52 | Female | North America | Laptop | 56.6 | 10.000000 | 65.2993 | 4.090027 |
| 65 | Female | Asia | Laptop | 60.9 | 9.352622 | 73.0000 | 6.224374 |
| 95 | Male | Asia | Laptop | 61.1 | 10.000000 | 61.0000 | 7.272793 |
| 99 | Male | South America | Tablet | 60.9 | 9.924496 | 65.0000 | 6.152393 |
| 108 | Female | Africa | Laptop | 47.1 | 9.861660 | 67.0000 | 4.638497 |
| 113 | Male | Africa | Tablet | 63.8 | 10.000000 | 66.0000 | 6.255626 |
| 115 | Female | Middle East | iPhone | 59.9 | 8.939790 | 63.0000 | 5.918662 |
| 119 | Male | Europe | iPhone | 65.2 | 9.867702 | 76.0000 | 6.078185 |
| 146 | Female | South America | Tablet | 49.4 | 10.000000 | 73.0000 | 5.156693 |
| 147 | Male | North America | Android | 55.1 | 9.946962 | 53.0000 | 7.894928 |
| 157 | Female | North America | Laptop | 61.0 | 9.766441 | 84.0000 | 6.466895 |
| 162 | Female | Europe | Tablet | 45.8 | 10.000000 | 73.0000 | 6.512540 |
##
## Region breakdown of these high-risk users:
##
## Africa Asia Europe Middle East North America
## 55 90 89 50 66
## South America
## 42
Insight: A significant subset of users simultaneously exhibits high digital dependence and high stress — a concerning pattern suggesting a digital-stress feedback loop. These users also tend to report lower sleep hours and reduced productivity, indicating systemic well-being concerns requiring targeted intervention.
# Low productivity: bottom 30%; High device usage: top 30%
prod_low <- quantile(df$productivity_score, 0.30)
device_high <- quantile(df$device_hours_per_day, 0.70)
flagged <- df %>%
filter(productivity_score <= prod_low,
device_hours_per_day >= device_high)
region_concentration <- flagged %>%
group_by(region) %>%
summarise(
User_Count = n(),
Avg_Productivity = round(mean(productivity_score), 2),
Avg_Device_Hrs = round(mean(device_hours_per_day), 2)
) %>%
arrange(desc(User_Count))
kable(region_concentration,
caption = "Regions: Low Productivity + High Device Usage") %>%
smart_kable()| region | User_Count | Avg_Productivity | Avg_Device_Hrs |
|---|---|---|---|
| Asia | 66 | 54.27 | 11.31 |
| Europe | 59 | 54.17 | 11.41 |
| North America | 47 | 54.79 | 10.86 |
| Africa | 44 | 54.77 | 11.36 |
| South America | 35 | 54.89 | 11.02 |
| Middle East | 30 | 54.30 | 11.53 |
# Bar chart
ggplot(region_concentration, aes(x = reorder(region, -User_Count),
y = User_Count, fill = region)) +
geom_col(show.legend = FALSE, width = 0.6) +
geom_text(aes(label = User_Count), vjust = -0.4, fontface = "bold", size = 4) +
scale_fill_brewer(palette = "Set2") +
labs(title = "Regions with Low Productivity & High Device Usage",
subtitle = "Concentration of problematic digital behavior",
x = "Region", y = "Number of Users") +
theme_minimal(base_size = 13)Insight: Certain regions show a disproportionate concentration of users who use devices heavily but remain unproductive. This pattern may reflect recreational (rather than educational or professional) device usage dominating these regions — warranting awareness campaigns and digital wellness programs.
# Group by Gender + Region
demo_stress <- df %>%
group_by(gender, region) %>%
summarise(
Avg_Stress = round(mean(stress_level, na.rm=TRUE), 2),
Avg_Anxiety = round(mean(anxiety_score, na.rm=TRUE), 2),
Count = n(),
.groups = "drop"
) %>%
arrange(desc(Avg_Stress))
kable(demo_stress,
caption = "Avg Stress & Anxiety by Gender and Region") %>%
smart_kable()| gender | region | Avg_Stress | Avg_Anxiety | Count |
|---|---|---|---|---|
| Male | Middle East | 5.69 | 6.06 | 162 |
| Male | Europe | 5.42 | 5.95 | 361 |
| Female | Middle East | 5.30 | 8.38 | 177 |
| Female | Asia | 5.23 | 9.05 | 390 |
| Female | Africa | 5.11 | 8.31 | 283 |
| Female | South America | 5.09 | 8.73 | 210 |
| Female | Europe | 5.07 | 8.30 | 436 |
| Female | North America | 5.06 | 8.34 | 339 |
| Male | Africa | 4.97 | 5.27 | 295 |
| Male | Asia | 4.86 | 5.78 | 349 |
| Male | North America | 4.65 | 5.59 | 283 |
| Male | South America | 4.64 | 5.64 | 215 |
Insight: The heatmap reveals that stress levels vary meaningfully across gender-region combinations. Certain groups — particularly in regions with high digital penetration — show markedly elevated stress and anxiety scores, suggesting that socio-digital context plays a critical role in mental health outcomes.
# Categorize device usage into Low / Moderate / High
df <- df %>%
mutate(usage_category = case_when(
device_hours_per_day < 4 ~ "Low (< 4 hrs)",
device_hours_per_day < 8 ~ "Moderate (4-8 hrs)",
TRUE ~ "High (8+ hrs)"
))
df$usage_category <- factor(df$usage_category,
levels = c("Low (< 4 hrs)", "Moderate (4-8 hrs)", "High (8+ hrs)"))
prod_by_usage <- df %>%
group_by(usage_category) %>%
summarise(
Avg_Productivity = round(mean(productivity_score), 2),
Median_Prod = round(median(productivity_score), 2),
Std_Dev = round(sd(productivity_score), 2),
Count = n(),
.groups = "drop"
)
kable(prod_by_usage,
caption = "Productivity Statistics by Device Usage Category") %>%
smart_kable()| usage_category | Avg_Productivity | Median_Prod | Std_Dev | Count |
|---|---|---|---|---|
| Low (< 4 hrs) | 63.21 | 64.0 | 9.90 | 486 |
| Moderate (4-8 hrs) | 65.35 | 65.3 | 9.48 | 1799 |
| High (8+ hrs) | 66.07 | 65.3 | 9.73 | 1215 |
ggplot(prod_by_usage, aes(x = usage_category, y = Avg_Productivity,
fill = usage_category)) +
geom_col(width = 0.5, show.legend = FALSE) +
geom_errorbar(aes(ymin = Avg_Productivity - Std_Dev,
ymax = Avg_Productivity + Std_Dev), width = 0.2) +
geom_text(aes(label = Avg_Productivity), vjust = -0.5,
fontface = "bold", size = 4.5) +
scale_fill_manual(values = c("#55efc4","#fdcb6e","#e17055")) +
labs(title = "Average Productivity by Device Usage Category",
subtitle = "Error bars show standard deviation",
x = "Device Usage Category", y = "Avg Productivity Score") +
theme_minimal(base_size = 13)Insight: There is a clear inverse relationship between device usage intensity and productivity. Users in the “Low” usage category outperform “High” usage users on average, suggesting that excessive screen time diminishes productive output — a critical finding for digital wellness advocacy.
# Region Ranking
region_rank <- df %>%
group_by(region) %>%
summarise(
Avg_Productivity = round(mean(productivity_score), 2),
Std_Dev = round(sd(productivity_score), 2),
Count = n(),
.groups = "drop"
) %>%
mutate(Rank = rank(-Avg_Productivity)) %>%
arrange(Rank)
kable(region_rank, caption = "Regions Ranked by Average Productivity") %>%
smart_kable()| region | Avg_Productivity | Std_Dev | Count | Rank |
|---|---|---|---|---|
| Europe | 65.56 | 9.89 | 797 | 1.5 |
| North America | 65.56 | 9.41 | 622 | 1.5 |
| South America | 65.38 | 9.52 | 425 | 3.0 |
| Africa | 65.27 | 9.54 | 578 | 4.0 |
| Middle East | 65.17 | 9.73 | 339 | 5.0 |
| Asia | 64.84 | 9.79 | 739 | 6.0 |
##
## Best Region : 3 | Score: 65.56
cat("\nWorst Region :", region_rank$region[nrow(region_rank)],
"| Score:", region_rank$Avg_Productivity[nrow(region_rank)], "\n")##
## Worst Region : 2 | Score: 64.84
ggplot(region_rank, aes(x = reorder(region, Avg_Productivity),
y = Avg_Productivity, fill = Avg_Productivity)) +
geom_col(width = 0.6) +
geom_text(aes(label = Avg_Productivity), hjust = -0.2, fontface = "bold") +
scale_fill_gradient(low = "#e17055", high = "#00b894") +
coord_flip() +
labs(title = "Regions Ranked by Average Productivity Score",
x = "Region", y = "Avg Productivity Score",
fill = "Score") +
theme_minimal(base_size = 13)Insight: The ranking reveals significant productivity gaps between regions. The best-performing region outscores the lowest by a notable margin, indicating systemic differences in how digital tools are adopted for productive purposes versus passive consumption across geographies.
# Segment by gender + region + device_type
seg_dep <- df %>%
group_by(gender, region, device_type) %>%
summarise(
Avg_Dependence = round(mean(digital_dependence_score), 2),
Avg_Productivity = round(mean(productivity_score), 2),
Count = n(),
.groups = "drop"
) %>%
arrange(desc(Avg_Dependence)) %>%
slice_head(n = 12)
kable(seg_dep,
caption = "Top 12 User Segments by Digital Dependence") %>%
smart_kable()| gender | region | device_type | Avg_Dependence | Avg_Productivity | Count |
|---|---|---|---|---|---|
| Female | Africa | Tablet | 40.26 | 65.94 | 67 |
| Male | Europe | Android | 40.01 | 67.06 | 86 |
| Female | South America | Laptop | 39.86 | 63.82 | 54 |
| Female | South America | Tablet | 39.64 | 66.34 | 54 |
| Male | Asia | Laptop | 39.25 | 65.57 | 88 |
| Male | Middle East | Android | 38.90 | 63.19 | 46 |
| Female | South America | Android | 38.68 | 64.02 | 49 |
| Male | Africa | Tablet | 38.19 | 65.53 | 73 |
| Female | Asia | iPhone | 38.12 | 64.64 | 78 |
| Female | Middle East | iPhone | 38.08 | 64.73 | 38 |
| Female | Africa | Laptop | 38.02 | 65.80 | 70 |
| Male | North America | iPhone | 37.90 | 66.33 | 73 |
Insight: The highest digital dependence is concentrated in specific gender-region-device combinations, suggesting that demographic profiling can effectively identify at-risk user segments. Organizations can tailor digital wellness interventions based on these profiles.
# Focus Score by Device Type
focus_by_device <- df %>%
group_by(device_type) %>%
summarise(
Avg_Focus = round(mean(focus_score), 2),
Min_Focus = round(min(focus_score), 2),
Max_Focus = round(max(focus_score), 2),
Count = n(),
.groups = "drop"
) %>%
arrange(desc(Avg_Focus))
kable(focus_by_device, caption = "Focus Score Statistics by Device Type") %>%
smart_kable()| device_type | Avg_Focus | Min_Focus | Max_Focus | Count |
|---|---|---|---|---|
| Android | 42.92 | 0 | 100 | 903 |
| iPhone | 41.32 | 0 | 94 | 823 |
| Tablet | 41.20 | 0 | 97 | 888 |
| Laptop | 40.91 | 0 | 99 | 886 |
##
## Highest Avg Focus: 1 ( 42.92 )
cat("\nLowest Avg Focus :", focus_by_device$device_type[nrow(focus_by_device)],
"(", focus_by_device$Avg_Focus[nrow(focus_by_device)], ")\n")##
## Lowest Avg Focus : 3 ( 40.91 )
ggplot(focus_by_device, aes(x = reorder(device_type, -Avg_Focus),
y = Avg_Focus, fill = device_type)) +
geom_col(width = 0.5, show.legend = FALSE) +
geom_text(aes(label = Avg_Focus), vjust = -0.4, fontface = "bold", size = 4.5) +
scale_fill_brewer(palette = "Paired") +
labs(title = "Average Focus Score by Device Type",
x = "Device Type", y = "Avg Focus Score") +
theme_minimal(base_size = 13)Insight: Focus scores differ meaningfully across device types. Devices typically associated with structured work (Laptops, Tablets) tend to yield higher focus scores, while smartphones — prone to fragmented, notification-heavy usage — generally correlate with lower focus capacity.
# Total Digital Load = device_hrs + social_media_hrs + study_hrs
df <- df %>%
mutate(
total_digital_load = device_hours_per_day +
(social_media_mins / 60) +
(study_mins / 60)
)
cat("=== Total Digital Load Summary ===\n")## === Total Digital Load Summary ===
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.877 8.652 11.245 11.776 14.357 29.080
# Distribution
ggplot(df, aes(x = total_digital_load)) +
geom_histogram(bins = 30, fill = "#6c5ce7", color = "white", alpha = 0.85) +
geom_vline(xintercept = mean(df$total_digital_load),
color = "#d63031", linetype = "dashed", linewidth = 1) +
annotate("text", x = mean(df$total_digital_load) + 0.5,
y = 80, label = paste("Mean =",
round(mean(df$total_digital_load),2)),
color = "#d63031", fontface = "bold") +
labs(title = "Distribution of Total Digital Load",
subtitle = "Device hours + Social media hours + Study hours",
x = "Total Digital Load (hours/day)", y = "Count") +
theme_minimal(base_size = 13)Insight: The Total Digital Load metric reveals the aggregate daily digital burden users carry. The distribution is right-skewed, with a subset of users carrying extremely high digital loads (>15 hours/day), which is physiologically unsustainable and likely to impair cognitive performance.
# Well-being Index (higher = better)
# Formula: normalize sleep & physical activity positively,
# penalize stress and anxiety
df <- df %>%
mutate(
sleep_norm = (sleep_hours - min(sleep_hours)) /
(max(sleep_hours) - min(sleep_hours)),
phys_norm = (physical_activity_days - min(physical_activity_days)) /
(max(physical_activity_days) - min(physical_activity_days)),
stress_norm = 1 - (stress_level - min(stress_level)) /
(max(stress_level) - min(stress_level)),
anxiety_norm = 1 - (anxiety_score - min(anxiety_score)) /
(max(anxiety_score) - min(anxiety_score)),
wellbeing_index = round(
(sleep_norm * 30 + phys_norm * 25 +
stress_norm * 25 + anxiety_norm * 20), 2)
)
cat("=== Well-being Index Summary ===\n")## === Well-being Index Summary ===
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.10 45.51 57.88 56.29 68.02 94.18
ggplot(df, aes(x = wellbeing_index, fill = gender)) +
geom_density(alpha = 0.55, linewidth = 0.8) +
scale_fill_manual(values = c("#fd79a8","#74b9ff","#55efc4")) +
labs(title = "Well-being Index Distribution by Gender",
x = "Well-being Index (0-100)", y = "Density", fill = "Gender") +
theme_minimal(base_size = 13)Insight: The Well-being Index integrates sleep, physical activity, stress, and anxiety into a single composite score. Gender-based differences in the distribution suggest that certain demographic groups face systematically lower well-being — correlating with higher reported stress and lower sleep quality.
# Classify users into High-Risk / Balanced / Low-Risk
df <- df %>%
mutate(
risk_category = case_when(
stress_level >= quantile(stress_level, 0.70) &
digital_dependence_score >= quantile(digital_dependence_score, 0.70) &
productivity_score <= quantile(productivity_score, 0.35) ~ "High-Risk",
stress_level <= quantile(stress_level, 0.35) &
digital_dependence_score <= quantile(digital_dependence_score, 0.35) &
productivity_score >= quantile(productivity_score, 0.65) ~ "Low-Risk",
TRUE ~ "Balanced"
)
)
risk_counts <- df %>%
count(risk_category) %>%
mutate(Proportion = round(n / sum(n) * 100, 1))
kable(risk_counts, caption = "User Distribution by Risk Category") %>%
smart_kable()| risk_category | n | Proportion |
|---|---|---|
| Balanced | 3147 | 89.9 |
| High-Risk | 165 | 4.7 |
| Low-Risk | 188 | 5.4 |
##
## Risk Category Breakdown:
## risk_category n Proportion
## 1 Balanced 3147 89.9
## 2 High-Risk 165 4.7
## 3 Low-Risk 188 5.4
Insight: The majority of users fall into the Balanced category, while a notable minority qualify as High-Risk — exhibiting a triple burden of high stress, high digital dependence, and low productivity. This group warrants priority attention in any digital wellness intervention strategy.
load_summary <- df %>%
group_by(region, device_type) %>%
summarise(Avg_Load = round(mean(total_digital_load), 2), .groups = "drop")
ggplot(load_summary, aes(x = region, y = Avg_Load, fill = device_type)) +
geom_col(position = "dodge", width = 0.7) +
scale_fill_brewer(palette = "Set1", name = "Device Type") +
labs(title = "V1: Average Total Digital Load by Region & Device Type",
subtitle = "Grouped bar chart comparing digital burden across geographies",
x = "Region", y = "Avg Total Digital Load (hrs/day)") +
theme_minimal(base_size = 13) +
theme(axis.text.x = element_text(angle = 30, hjust = 1),
legend.position = "top")ggplot(df, aes(x = productivity_score, fill = after_stat(count))) +
geom_histogram(bins = 35, color = "white", linewidth = 0.3) +
scale_fill_gradient(low = "#b2bec3", high = "#6c5ce7") +
geom_vline(xintercept = mean(df$productivity_score),
color = "#d63031", linetype = "dashed", linewidth = 1) +
annotate("text",
x = mean(df$productivity_score) + 2,
y = 140,
label = paste("Mean =", round(mean(df$productivity_score), 1)),
color = "#d63031", fontface = "bold", size = 4) +
labs(title = "V2: Distribution of Productivity Scores",
subtitle = "Dashed line = mean productivity score",
x = "Productivity Score", y = "Number of Users") +
theme_minimal(base_size = 13) +
theme(legend.position = "none")risk_pie <- df %>%
count(risk_category) %>%
mutate(
pct = n / sum(n),
label = paste0(risk_category, "\n", round(pct*100,1), "%")
)
ggplot(risk_pie, aes(x = "", y = pct, fill = risk_category)) +
geom_col(width = 1, color = "white", linewidth = 1.2) +
coord_polar(theta = "y") +
geom_text(aes(label = label), position = position_stack(vjust = 0.5),
fontface = "bold", size = 4.5, color = "white") +
scale_fill_manual(values = c("High-Risk"="#d63031",
"Balanced" ="#fdcb6e",
"Low-Risk" ="#00b894")) +
labs(title = "V3: User Distribution by Risk Category",
fill = "Risk Level") +
theme_void(base_size = 14) +
theme(legend.position = "right")pair_vars <- df %>%
select(device_hours_per_day, productivity_score,
stress_level, digital_dependence_score, sleep_hours) %>%
sample_n(500) # Sample for performance
ggpairs(pair_vars,
lower = list(continuous = wrap("smooth", alpha = 0.15,
color = "#6c5ce7", size = 0.5)),
diag = list(continuous = wrap("densityDiag", fill = "#74b9ff",
alpha = 0.6)),
upper = list(continuous = wrap("cor", size = 4, color = "#2d3436")),
title = "V4: Pair Plot - Key Behavioral & Mental Health Variables") +
theme_minimal(base_size = 10)ggplot(df, aes(x = reorder(region, productivity_score, FUN = median),
y = productivity_score, fill = region)) +
geom_boxplot(outlier.alpha = 0.3, outlier.size = 1,
notch = TRUE, notchwidth = 0.6, show.legend = FALSE) +
geom_jitter(width = 0.15, alpha = 0.07, size = 0.8, color = "#2d3436") +
scale_fill_brewer(palette = "Pastel1") +
coord_flip() +
labs(title = "V5: Productivity Score Distribution Across Regions",
subtitle = "Notched boxplots - notch shows 95% CI of median",
x = "Region", y = "Productivity Score") +
theme_minimal(base_size = 13)p6a <- ggplot(df, aes(x = gender, y = sleep_hours, fill = gender)) +
geom_boxplot(show.legend = FALSE, outlier.alpha = 0.3) +
scale_fill_manual(values = c("#fd79a8","#74b9ff","#55efc4")) +
labs(title = "Sleep Hours by Gender", x = "Gender", y = "Sleep (hrs)") +
theme_minimal(base_size = 12)
p6b <- ggplot(df, aes(x = gender, y = stress_level, fill = gender)) +
geom_boxplot(show.legend = FALSE, outlier.alpha = 0.3) +
scale_fill_manual(values = c("#fd79a8","#74b9ff","#55efc4")) +
labs(title = "Stress Level by Gender", x = "Gender", y = "Stress Level") +
theme_minimal(base_size = 12)
gridExtra::grid.arrange(p6a, p6b, ncol = 2,
top = "V6: Sleep Hours and Stress Levels Across Gender Groups")# Bin device usage for smooth line
line_data <- df %>%
mutate(usage_bin = round(device_hours_per_day)) %>%
group_by(usage_bin) %>%
summarise(
Avg_Productivity = mean(productivity_score),
Avg_Stress = mean(stress_level),
.groups = "drop"
) %>%
pivot_longer(cols = c(Avg_Productivity, Avg_Stress),
names_to = "Metric", values_to = "Value")
ggplot(line_data, aes(x = usage_bin, y = Value,
color = Metric, group = Metric)) +
geom_line(linewidth = 1.3) +
geom_point(size = 2.5) +
scale_color_manual(values = c("Avg_Productivity" = "#00b894",
"Avg_Stress" = "#d63031"),
labels = c("Avg Productivity", "Avg Stress")) +
labs(title = "V7: Device Usage vs Productivity & Stress",
subtitle = "As daily screen time increases - what happens to well-being?",
x = "Device Usage (hours/day, rounded)",
y = "Score", color = "Metric") +
theme_minimal(base_size = 13)Insight: The line chart powerfully illustrates a diverging trend — as device usage increases, productivity scores tend to decline while stress levels rise. This crossing of trajectories provides visual evidence of the cost of excessive screen time on cognitive and mental health outcomes.
# SLR Model
slr_model <- lm(productivity_score ~ device_hours_per_day, data = df)
summary(slr_model)##
## Call:
## lm(formula = productivity_score ~ device_hours_per_day, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.880 -6.243 0.101 6.159 30.983
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.41483 0.40213 157.698 < 2e-16 ***
## device_hours_per_day 0.25752 0.05025 5.125 3.14e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.63 on 3498 degrees of freedom
## Multiple R-squared: 0.007452, Adjusted R-squared: 0.007169
## F-statistic: 26.26 on 1 and 3498 DF, p-value: 3.138e-07
ggplot(df, aes(x = device_hours_per_day, y = productivity_score)) +
geom_point(alpha = 0.15, color = "#6c5ce7", size = 1) +
geom_smooth(method = "lm", color = "#d63031", se = TRUE,
linewidth = 1.5) +
labs(title = "Simple Linear Regression: Device Usage to Productivity",
subtitle = paste("R2 =", round(summary(slr_model)$r.squared, 4)),
x = "Device Hours per Day",
y = "Productivity Score") +
theme_minimal(base_size = 13)Insight: The SLR coefficient for device usage reveals whether each additional hour of screen time corresponds to a statistically significant increase or decrease in productivity. The R2 value quantifies how much variance in productivity is explained by device usage alone.
# MLR Model
mlr_model <- lm(productivity_score ~
device_hours_per_day +
sleep_hours +
stress_level +
anxiety_score +
focus_score,
data = df)
summary(mlr_model)##
## Call:
## lm(formula = productivity_score ~ device_hours_per_day + sleep_hours +
## stress_level + anxiety_score + focus_score, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.557 -6.106 -0.030 6.149 32.307
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.047847 1.517685 40.224 < 2e-16 ***
## device_hours_per_day 0.546643 0.075134 7.276 4.24e-13 ***
## sleep_hours 0.264862 0.156494 1.692 0.0906 .
## stress_level 0.059999 0.051968 1.155 0.2484
## anxiety_score -0.247097 0.042711 -5.785 7.88e-09 ***
## focus_score -0.004799 0.006890 -0.696 0.4862
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.586 on 3494 degrees of freedom
## Multiple R-squared: 0.01771, Adjusted R-squared: 0.0163
## F-statistic: 12.6 on 5 and 3494 DF, p-value: 3.767e-12
# Coefficient plot
coef_df <- as.data.frame(summary(mlr_model)$coefficients)
coef_df$Variable <- rownames(coef_df)
coef_df <- coef_df %>% filter(Variable != "(Intercept)")
ggplot(coef_df, aes(x = reorder(Variable, Estimate),
y = Estimate, fill = Estimate > 0)) +
geom_col(width = 0.5) +
geom_errorbar(aes(ymin = Estimate - `Std. Error`,
ymax = Estimate + `Std. Error`), width = 0.2) +
coord_flip() +
scale_fill_manual(values = c("TRUE"="#00b894","FALSE"="#d63031"),
labels = c("Positive","Negative"),
name = "Effect Direction") +
labs(title = "MLR: Predictor Effects on Productivity Score",
x = "Predictor Variable", y = "Coefficient Estimate") +
theme_minimal(base_size = 13)Insight: The MLR model identifies which factors most strongly predict productivity. Positive coefficients (e.g., focus_score, sleep_hours) indicate beneficial variables, while negative coefficients (e.g., stress_level, device hours) confirm their harmful influence — providing a multi-factor explanation for productivity variance.
##
## === FINAL PROJECT SUMMARY & KEY FINDINGS ===
| # | Finding | Implication |
|---|---|---|
| 1 | Inverse Device-Productivity Relationship | More daily screen time correlates with lower productivity scores across all regions |
| 2 | Digital-Stress Feedback Loop | High digital dependence and high stress co-occur significantly, compounding mental health risk |
| 3 | Regional Productivity Gaps | Meaningful productivity disparities exist across regions, reflecting socio-economic and cultural digital usage patterns |
| 4 | Device Type Matters | Laptops and tablets associate with higher focus; smartphones with fragmented attention |
| 5 | Sleep is Central to Well-being | Sleep hours strongly influence the well-being index and correlate inversely with stress |
| 6 | Three Distinct User Profiles | K-Means reveals Balanced, High-Risk, and Low-Digital users - enabling targeted interventions |
| 7 | KNN Accurately Predicts Risk | ML model confirms stress, anxiety, sleep, and digital dependence are the most predictive risk factors |
Report generated using R | Dataset: Digital Behavior & Mental Health Survey (N=3,500)