Libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
Datsets
dataset <- read.csv("dataset.csv", stringsAsFactors = FALSE)
head(dataset)
## transaction_id user_id age gender daily_screen_time_hours social_media_hours
## 1 TXN00001 U00001 21 Male 3.23 2.01
## 2 TXN00002 U00002 24 Other 5.09 3.81
## 3 TXN00003 U00003 31 Other 6.06 1.36
## 4 TXN00004 U00004 32 Other 7.83 5.85
## 5 TXN00005 U00005 25 Male 9.96 5.92
## 6 TXN00006 U00006 26 Male 9.32 4.26
## gaming_hours work_study_hours sleep_hours notifications_per_day
## 1 0.89 4.55 7.55 248
## 2 2.24 4.44 7.66 127
## 3 3.83 2.35 4.92 44
## 4 1.51 3.54 8.23 178
## 5 3.42 5.27 6.21 136
## 6 0.29 3.99 6.90 82
## app_opens_per_day weekend_screen_time stress_level academic_work_impact
## 1 154 3.95 Medium Yes
## 2 71 6.71 Medium Yes
## 3 106 8.68 High No
## 4 107 9.77 High Yes
## 5 177 12.55 Low No
## 6 56 10.98 Medium Yes
## addiction_level addicted_label
## 1 None 0
## 2 None 0
## 3 Mild 0
## 4 Moderate 1
## 5 Severe 1
## 6 Severe 1
Data cleaning
dataset[dataset == ""] <- NA
str(dataset)
## 'data.frame': 7500 obs. of 16 variables:
## $ transaction_id : chr "TXN00001" "TXN00002" "TXN00003" "TXN00004" ...
## $ user_id : chr "U00001" "U00002" "U00003" "U00004" ...
## $ age : int 21 24 31 32 25 26 25 26 21 35 ...
## $ gender : chr "Male" "Other" "Other" "Other" ...
## $ daily_screen_time_hours: num 3.23 5.09 6.06 7.83 9.96 9.32 10.4 4.26 4.38 9.76 ...
## $ social_media_hours : num 2.01 3.81 1.36 5.85 5.92 4.26 4.93 4.6 1.38 4.73 ...
## $ gaming_hours : num 0.89 2.24 3.83 1.51 3.42 0.29 1.6 2.16 2.72 1.36 ...
## $ work_study_hours : num 4.55 4.44 2.35 3.54 5.27 3.99 0.86 4.61 3.78 2.11 ...
## $ sleep_hours : num 7.55 7.66 4.92 8.23 6.21 6.9 8.61 6.43 6.23 5.21 ...
## $ notifications_per_day : int 248 127 44 178 136 82 165 169 172 20 ...
## $ app_opens_per_day : int 154 71 106 107 177 56 95 117 134 82 ...
## $ weekend_screen_time : num 3.95 6.71 8.68 9.77 12.55 ...
## $ stress_level : chr "Medium" "Medium" "High" "High" ...
## $ academic_work_impact : chr "Yes" "Yes" "No" "Yes" ...
## $ addiction_level : chr "None" "None" "Mild" "Moderate" ...
## $ addicted_label : int 0 0 0 1 1 1 1 1 0 1 ...
summary(dataset)
## transaction_id user_id age gender
## Length:7500 Length:7500 Min. :18.00 Length:7500
## Class :character Class :character 1st Qu.:22.00 Class :character
## Mode :character Mode :character Median :27.00 Mode :character
## Mean :26.57
## 3rd Qu.:31.00
## Max. :35.00
## daily_screen_time_hours social_media_hours gaming_hours work_study_hours
## Min. : 3.000 Min. :0.500 Min. :0.000 Min. :0.500
## 1st Qu.: 5.220 1st Qu.:1.910 1st Qu.:1.020 1st Qu.:1.850
## Median : 7.525 Median :3.270 Median :2.040 Median :3.230
## Mean : 7.500 Mean :3.273 Mean :2.014 Mean :3.242
## 3rd Qu.: 9.810 3rd Qu.:4.630 3rd Qu.:2.990 3rd Qu.:4.640
## Max. :12.000 Max. :6.000 Max. :4.000 Max. :6.000
## sleep_hours notifications_per_day app_opens_per_day weekend_screen_time
## Min. :4.500 Min. : 20.0 Min. : 15.00 Min. : 3.580
## 1st Qu.:5.630 1st Qu.: 76.0 1st Qu.: 55.00 1st Qu.: 6.960
## Median :6.720 Median :134.0 Median : 98.00 Median : 9.260
## Mean :6.738 Mean :134.3 Mean : 97.83 Mean : 9.244
## 3rd Qu.:7.840 3rd Qu.:191.0 3rd Qu.:140.00 3rd Qu.:11.540
## Max. :9.000 Max. :250.0 Max. :180.00 Max. :14.880
## stress_level academic_work_impact addiction_level addicted_label
## Length:7500 Length:7500 Length:7500 Min. :0.0000
## Class :character Class :character Class :character 1st Qu.:0.0000
## Mode :character Mode :character Mode :character Median :1.0000
## Mean :0.7077
## 3rd Qu.:1.0000
## Max. :1.0000
dim(dataset)
## [1] 7500 16
colSums(is.na(dataset))
## transaction_id user_id age
## 0 0 0
## gender daily_screen_time_hours social_media_hours
## 0 0 0
## gaming_hours work_study_hours sleep_hours
## 0 0 0
## notifications_per_day app_opens_per_day weekend_screen_time
## 0 0 0
## stress_level academic_work_impact addiction_level
## 0 0 0
## addicted_label
## 0
Analysis Questions Question 1
How many users belong to each gender category in the dataset?
dataset %>% count(gender)
## gender n
## 1 Female 2461
## 2 Male 2553
## 3 Other 2486
Interpretation The dataset has a nearly equal distribution of users among Male, Female, and Other gender categories. This balanced representation ensures that behavioral patterns analyzed are not biased toward a specific gender.
Question 2
What is the distribution of users across different age groups?
dataset %>%
mutate(Age_Group = ifelse(age < 30, "Young Adult", "Adult")) %>%
count(Age_Group)
## Age_Group n
## 1 Adult 2569
## 2 Young Adult 4931
Interpretation: The majority of users are “Young Adults” (under 30). This indicates that the analysis is more relevant to younger individuals who generally use smartphones more actively.
Question 3
What are the minimum, maximum, and average daily screen time hours?
dataset %>%
summarise(
Min_Screen_Time = min(daily_screen_time_hours),
Max_Screen_Time = max(daily_screen_time_hours),
Avg_Screen_Time = mean(daily_screen_time_hours)
)
## Min_Screen_Time Max_Screen_Time Avg_Screen_Time
## 1 3 12 7.499912
Interpretation:
Screen time ranges from 3 to 12 hours, with an average of about 7.5
hours. This shows a high level of daily smartphone usage among
users.
Question 4
How many users are classified as addicted versus not addicted?
dataset %>% count(addicted_label)
## addicted_label n
## 1 0 2192
## 2 1 5308
Interpretation:
More than 70% of users are classified as addicted, showing a high level
of digital dependency in the dataset.
Question 5
Does the average daily screen time differ between addicted and
non-addicted users?
dataset %>%
group_by(addicted_label) %>%
summarise(Avg_Screen_Time = mean(daily_screen_time_hours))
## # A tibble: 2 × 2
## addicted_label Avg_Screen_Time
## <int> <dbl>
## 1 0 5.16
## 2 1 8.47
Interpretation:
Addicted users spend significantly more time (8.47 hours) than
non-addicted users (5.16 hours), showing a strong relationship between
screen time and addiction.
Question 6
Who are the top 5 users with the highest daily screen time?
dataset %>%
arrange(desc(daily_screen_time_hours)) %>%
select(user_id, daily_screen_time_hours, addiction_level) %>%
head(5)
## user_id daily_screen_time_hours addiction_level
## 1 U00694 12.00 Severe
## 2 U02237 12.00 Moderate
## 3 U05173 12.00 Severe
## 4 U05236 12.00 Moderate
## 5 U00585 11.99 Severe
Interpretation:
Top users reach the maximum screen time of 12 hours. Addiction levels
vary, indicating other factors also influence addiction severity.
Question 7
How many users of each gender are classified as having “Severe”
addiction?
dataset %>%
filter(addiction_level == "Severe") %>%
count(gender)
## gender n
## 1 Female 762
## 2 Male 848
## 3 Other 824
Interpretation: Severe addiction is almost evenly distributed across genders, with slightly higher numbers in males.
Question 8
Is there a significant difference in age across the different addiction
levels?
dataset %>%
group_by(addiction_level) %>%
summarise(Avg_Age = mean(age, na.rm = TRUE))
## # A tibble: 4 × 2
## addiction_level Avg_Age
## <chr> <dbl>
## 1 Mild 26.6
## 2 Moderate 26.4
## 3 None 26.5
## 4 Severe 26.8
Interpretation: Average age is nearly the same across all addiction levels (~26.5 years), suggesting age is not a major factor in addiction severity.
Question 9
What percentage of “Mild” addiction users report a negative academic
impact?
dataset %>%
filter(addiction_level == "Mild") %>%
count(academic_work_impact) %>%
mutate(Percent = n / sum(n) * 100)
## academic_work_impact n Percent
## 1 No 690 50.25492
## 2 Yes 683 49.74508
Interpretation:
About 50% of mildly addicted users report academic impact, showing that
even low addiction levels can affect performance.
Question 10
Are high notification counts correlated with high stress levels?
dataset %>%
group_by(stress_level) %>%
summarise(Avg_Notifications = mean(notifications_per_day))
## # A tibble: 3 × 2
## stress_level Avg_Notifications
## <chr> <dbl>
## 1 High 134.
## 2 Low 134.
## 3 Medium 135.
Interpretation:
Notification counts are similar across all stress levels, indicating
notifications alone do not significantly affect stress.
Question 11
How many users are in the “Other” gender category?
dataset %>% count(gender) %>% filter(gender == "Other")
## gender n
## 1 Other 2486
Interpretation: There are 2,486 users in the “Other” category, making it about one-third of the dataset.
Question 12
How do average sleep hours vary by stress level?
ggplot(dataset, aes(x = stress_level, y = sleep_hours)) +
stat_summary(fun = mean, geom = "bar", fill = "skyblue") +
labs(title = "Average Sleep Hours by Stress Level",
x = "Stress Level",
y = "Average Sleep Hours")
Interpretation:
Sleep duration is slightly lower for users with high stress, though
overall differences are small.
Question 13
How are genders distributed in the dataset?
ggplot(dataset, aes(x = gender)) +
geom_bar(fill = "red") +
labs(title = "Gender Distribution",
x = "Gender",
y = "Count")
Interpretation: The dataset shows an almost equal distribution of genders, ensuring unbiased analysis.
Question 14
How does daily screen time compare to weekend screen time?
ggplot(dataset, aes(x = daily_screen_time_hours, y = weekend_screen_time)) +
geom_point(color = "blue") +
labs(title = "Daily Weekend Screen Time",
x = "Daily Screen Time (Hours)",
y = "Weekend Screen Time (Hours)")
Interpretation:
Users with higher daily screen time also tend to spend more time on
weekends, showing consistent usage behavior.
Question 15
How is addiction level distributed among users?
ggplot(dataset, aes(x = addiction_level)) +
geom_bar(fill = "orange") +
coord_flip() +
labs(title = "Distribution of Addiction Levels",
x = "Addiction Level",
y = "Count")
Interpretation: This chart shows how users are distributed across different addiction levels, helping identify the most common category.
Question 16
How do Screen Time, Gaming, and Notifications together impact
Stress?
dataset$stress_numeric <- as.numeric(factor(dataset$stress_level,
levels = c("Low", "Medium", "High")))
model_multi <- lm(stress_numeric ~ daily_screen_time_hours + gaming_hours + notifications_per_day, data = dataset)
summary(model_multi)
##
## Call:
## lm(formula = stress_numeric ~ daily_screen_time_hours + gaming_hours +
## notifications_per_day, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0306 -0.9984 -0.0066 0.9835 1.0157
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.982e+00 3.830e-02 51.760 <2e-16 ***
## daily_screen_time_hours 4.142e-03 3.637e-03 1.139 0.255
## gaming_hours -1.240e-03 8.281e-03 -0.150 0.881
## notifications_per_day -2.364e-05 1.425e-04 -0.166 0.868
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8217 on 7496 degrees of freedom
## Multiple R-squared: 0.0001797, Adjusted R-squared: -0.0002205
## F-statistic: 0.449 on 3 and 7496 DF, p-value: 0.718
Interpretation:
The multiple regression model shows that screen time, gaming hours, and
notifications together have no significant impact on stress levels. All
p-values are high (>0.05), and the R² value is almost zero,
indicating that these variables do not explain stress variation in this
dataset. —
Question 17
Is there a Correlation between Daily Screen Time and Sleep Hours?
cor(dataset$daily_screen_time_hours, dataset$sleep_hours)
## [1] 0.01934324
Interpretation:
The correlation coefficient is approximately 0.019. This value is very
close to zero, suggesting there is virtually no linear relationship
between how much time a person spends on their screen and how many hours
they sleep according to this data. —
Question 18
What is the proportion of users based on academic work impact
impact_counts <- table(dataset$academic_work_impact)
impact_counts
##
## No Yes
## 3753 3747
labels <- paste(names(impact_counts),
round(impact_counts / sum(impact_counts) * 100, 1), "%")
colors <- c("skyblue", "orange")
pie(impact_counts,
labels = labels,
col = colors,
main = "Proportion of Academic Work Impact")
Interpretation:
the pie chart shows the distribution of users whose academic or work
performance is affected by digital usage. A larger “Yes” portion would
indicate that a significant number of users experience negative academic
impact due to screen habits, while a balanced distribution suggests that
the impact varies across individuals. —
Question 19
Do screen time and notifications together increase addiction more than
individually?
model2 <- lm(addicted_label ~ daily_screen_time_hours + notifications_per_day,
data = dataset)
summary(model2)
##
## Call:
## lm(formula = addicted_label ~ daily_screen_time_hours + notifications_per_day,
## data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.75823 -0.31646 0.02322 0.25670 0.74477
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.574e-02 1.566e-02 -2.922 0.00349 **
## daily_screen_time_hours 1.006e-01 1.644e-03 61.187 < 2e-16 ***
## notifications_per_day -7.708e-06 6.443e-05 -0.120 0.90477
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3715 on 7497 degrees of freedom
## Multiple R-squared: 0.3331, Adjusted R-squared: 0.3329
## F-statistic: 1872 on 2 and 7497 DF, p-value: < 2.2e-16
Interpretation:
The model shows that daily screen time has a strong and significant
positive effect on addiction (p < 2e-16), while notifications have no
significant impact. This suggests that screen time alone is a key driver
of addiction, rather than notifications. —
Question 20
How does Daily Screen Time distribution vary by Gender?
ggplot(dataset, aes(x = gender, y = daily_screen_time_hours, fill = gender)) +
geom_boxplot() +
labs(title = "Daily Screen Time Distribution by Gender",
x = "Gender",
y = "Daily Screen Time")+
theme_minimal()
Interpretation:
The box plot shows that daily screen time remains very consistent across
Male, Female, and Other gender categories. The median screen time sits
around 7.5 hours for all groups, and the spread (interquartile range) is
almost identical, indicating that gender is not a major factor in
determining total daily usage in this dataset. —
Question 21
Do screen time and notifications influence addiction?
model2 <- lm(addicted_label ~ daily_screen_time_hours + notifications_per_day,
data = dataset)
summary(model2)
##
## Call:
## lm(formula = addicted_label ~ daily_screen_time_hours + notifications_per_day,
## data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.75823 -0.31646 0.02322 0.25670 0.74477
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.574e-02 1.566e-02 -2.922 0.00349 **
## daily_screen_time_hours 1.006e-01 1.644e-03 61.187 < 2e-16 ***
## notifications_per_day -7.708e-06 6.443e-05 -0.120 0.90477
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3715 on 7497 degrees of freedom
## Multiple R-squared: 0.3331, Adjusted R-squared: 0.3329
## F-statistic: 1872 on 2 and 7497 DF, p-value: < 2.2e-16
Interpretation:
This model evaluates the influence of screen time and notifications on
addiction levels. Positive coefficients indicate that higher screen time
and frequent notifications increase the likelihood of addiction. This
suggests that not only prolonged usage but also constant digital
interruptions play a role in developing addictive behavior. —
Question 22
Is there a visible difference in Social Media usage between Addicted and
Non-Addicted users?
ggplot(dataset, aes(x = as.factor(addicted_label), y = social_media_hours, fill = as.factor(addicted_label))) +
geom_boxplot() +
labs(title = "Social Media Usage by Addiction Label",
x = "Addicted Label",
y = "Social Media Hours",
fill = "Addicted Label")+
theme_minimal()
Interpretation:
There is a clear upward shift in social media usage for users labeled as
“Addicted” (1). The median usage for addicted users is significantly
higher (around 4 hours) compared to non-addicted users (around 2.3
hours). Additionally, the top 25% of addicted users spend more time on
social media than almost anyone in the non-addicted category. —
Question 23
How does Daily Screen Time predict Social Media usage?
ggplot(dataset, aes(x = daily_screen_time_hours, y = social_media_hours)) +
geom_point(color = "skyblue") +
geom_smooth(method = "lm", color = "red") +
labs(title = "Regression Analysis: Daily Screen Time vs. Social Media Usage",
x = "Daily Screen Time (Hours)",
y = "Social Media Hours")+
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Interpretation:
.Near-Zero Correlation: The correlation coefficient is approximately
0.01, which indicates a very weak positive relationship. .Horizontal
Regression Line: The red line is nearly flat, showing that an increase
in total daily screen time does not necessarily result in a predictable
increase in social media hours. .Data Dispersion: The scatter plot
reveals that users with both high and low total screen time have widely
varying social media habits, suggesting that other activities (like
gaming or work) significantly contribute to the total screen time for
many individuals. —
Question 24
Is there a Correlation between Daily Screen Time and Sleep Hours?
cor(dataset$daily_screen_time_hours, dataset$sleep_hours)
## [1] 0.01934324
Interpretation:
The correlation coefficient is approximately 0.019. This value is very
close to zero, suggesting there is virtually no linear relationship
between how much time a person spends on their screen and how many hours
they sleep according to this data —
Question 25
What is the distribution of Daily Screen Time across Stress Levels?
ggplot(dataset, aes(x = stress_level, y = daily_screen_time_hours, fill = stress_level)) +
geom_boxplot() +
labs(title = "Daily Screen Time by Stress Level")
Interpretation:
The boxplot indicates that daily screen time is quite similar across
Low, Medium, and High stress groups. There is no major variation in
medians or spread, suggesting that stress levels are not strongly
influenced by screen time. —
Question 26
What is the proportion of users in each Stress Category?
stress_counts <- table(dataset$stress_level)
pie(stress_counts, labels = names(stress_counts), main = "Proportion of Stress Levels")
Interpretation:
The pie chart shows a balanced distribution: High Stress (34.1%), Low
Stress (33.4%), and Medium Stress (32.5%). This confirms that the survey
group is not skewed toward any single stress experience. —
Question 27
How do Social Media Hours and Stress Levels relate visually?
ggplot(dataset, aes(x = social_media_hours, y = stress_numeric)) +
geom_point() +
geom_smooth(method = "lm", color = "red") +
labs(title = "Social Media Hours vs. Stress Level")
## `geom_smooth()` using formula = 'y ~ x'
Interpretation:
The scatter plot shows a very weak relationship between social media
hours and stress levels. The regression line is nearly flat, indicating
that increased social media usage does not significantly increase or
decrease stress in this dataset. —
Question 28
Does Screen Time significantly vary across Stress Levels?
model_anova <- aov(daily_screen_time_hours ~ stress_level, data = dataset)
summary(model_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## stress_level 2 34 17.168 2.523 0.0803 .
## Residuals 7497 51018 6.805
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation:
The ANOVA test resulted in a p-value of 0.08. Since this is above the
standard 0.05 threshold, we cannot conclude that there is a
statistically significant difference in daily screen time between Low,
Medium, and High stress groups in this dataset. —
Question 29
Does the relationship between Screen Time and App Opens follow a
curve?
ggplot(dataset, aes(x = daily_screen_time_hours, y = app_opens_per_day)) +
geom_point(color = "green", size = 2) +
stat_smooth(method = "lm",
formula = y ~ x + I(x^2),
color = "red",
size = 1.5) +
labs(title = "Daily Screen Time vs App Opens",
x = "Daily Screen Time (Hours)",
y = "App Opens Per Day") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Interpretation:
The polynomial model suggests a very weak curved relationship between
screen time and app opens. Although the first term is slightly
significant, the overall model fit is extremely low (R² ≈ 0.0006),
meaning screen time does not meaningfully predict app usage patterns.
—
Question 30
Which factor is the strongest predictor of addiction?
model_full <- lm(addicted_label ~ daily_screen_time_hours + gaming_hours +
social_media_hours + notifications_per_day, data = dataset)
summary(model_full)
##
## Call:
## lm(formula = addicted_label ~ daily_screen_time_hours + gaming_hours +
## social_media_hours + notifications_per_day, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.82903 -0.24810 0.00692 0.25600 0.75079
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.334e-01 1.683e-02 -25.757 <2e-16 ***
## daily_screen_time_hours 9.987e-02 1.424e-03 70.126 <2e-16 ***
## gaming_hours 3.241e-03 3.242e-03 0.999 0.318
## social_media_hours 1.172e-01 2.344e-03 49.985 <2e-16 ***
## notifications_per_day 1.542e-05 5.581e-05 0.276 0.782
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3218 on 7495 degrees of freedom
## Multiple R-squared: 0.4998, Adjusted R-squared: 0.4996
## F-statistic: 1873 on 4 and 7495 DF, p-value: < 2.2e-16
Interpretation:
The strongest predictor of addiction is daily screen time, which holds
the highest \(t\)-value (\(70.126\)) and extreme statistical
significance (\(p < 0.001\)). Social
media hours is also a massive predictor, actually yielding a higher
increase per hour (\(0.117\)) and a
very strong \(t\)-value (\(49.985\)). On the other hand, both gaming
hours and notifications per day are statistically insignificant in this
model, showing they do not directly drive the addiction label when other
factors are present. Overall, this linear regression model successfully
explains about 50% of the total variance in the addiction score.