Smartphones have become an essential part of modern life. While they provide numerous benefits in communication, education and entertainment, excessive usage may lead to behavioural addiction, stress, and reduced academic productivity.
The increasing prevalence of smartphone addiction among students and working adults has raised concerns regarding its impact on daily activities and mental well-being.
Estimation on how many hours a person is likely to spend on their smartphone during the weekend based on their demographic profile and behavioural habits?
Classification of a user as “Addicted” or “Not Addicted” based on prediction whether a person shows signs of smartphone addiction based on their daily phone usage habits and lifestyle patterns?
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.1 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.3 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## corrplot 0.95 loaded
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
getCurrentFileLocation <- function() {
cmdArgs <- commandArgs(trailingOnly = FALSE)
needle <- "--file="
match <- grep(needle, cmdArgs)
if (length(match) > 0) {
return(dirname(normalizePath(sub(needle, "", cmdArgs[match]))))
} else {
for (f in sys.frames()) {
if (!is.null(f$ofile)) return(dirname(normalizePath(f$ofile)))
}
}
# Fallback for interactive RStudio sessions:
if (requireNamespace("rstudioapi", quietly = TRUE) &&
rstudioapi::isAvailable()) {
return(dirname(rstudioapi::getActiveDocumentContext()$path))
}
return(NULL)
}
script_dir <- getCurrentFileLocation()
if (!is.null(script_dir)) setwd(script_dir)
# Output folders (created inside Compile/)
dir.create("models", showWarnings = FALSE)
dir.create("plots", showWarnings = FALSE)
# Path to the raw dataset (place the CSV in a "data" folder next to this .Rmd)
RAW_DATA_PATH <- "/Users/thinkacademy/Downloads/Smartphone_Usage_And_Addiction_Analysis_7500_Rows.csv"
CLEANED_DATA_PATH <- "/Users/thinkacademy/Downloads/Smartphone_Cleaned-2.csv"
df <- read.csv(
RAW_DATA_PATH,
stringsAsFactors = FALSE
)The dataset used in this study is Smartphone Usage and Addiction Analysis, containing information on smartphone usage behaviour, demographic characteristics, stress levels, academic or work impact, and addiction status.
Source and Year: The dataset was obtained from [https://www.kaggle.com/datasets/algozee/smartphone-usage-and-addiction-analysis-dataset/data].
Purpose: The dataset was selected because it provides sufficient behavioural and demographic detail to support both a regression task (predicting daily screen time) and a classification task (predicting addiction status), making it suitable for addressing both research questions of this project.
The dataset consists of 7,500 observations and 16 variables, resulting in 120,000 data points, which satisfies the project requirement of analysing more than 100,000 data points.
data.frame(
Rows = nrow(df),
Columns = ncol(df),
Datapoints = nrow(df) * ncol(df)
) |>
knitr::kable(caption = "Dataset Overview")| Rows | Columns | Datapoints |
|---|---|---|
| 7500 | 16 | 120000 |
## 'data.frame': 7500 obs. of 16 variables:
## $ transaction_id : chr "TXN00001" "TXN00002" "TXN00003" "TXN00004" ...
## $ user_id : chr "U00001" "U00002" "U00003" "U00004" ...
## $ age : int 21 24 31 32 25 26 25 26 21 35 ...
## $ gender : chr "Male" "Other" "Other" "Other" ...
## $ daily_screen_time_hours: num 3.23 5.09 6.06 7.83 9.96 9.32 10.4 4.26 4.38 9.76 ...
## $ social_media_hours : num 2.01 3.81 1.36 5.85 5.92 4.26 4.93 4.6 1.38 4.73 ...
## $ gaming_hours : num 0.89 2.24 3.83 1.51 3.42 0.29 1.6 2.16 2.72 1.36 ...
## $ work_study_hours : num 4.55 4.44 2.35 3.54 5.27 3.99 0.86 4.61 3.78 2.11 ...
## $ sleep_hours : num 7.55 7.66 4.92 8.23 6.21 6.9 8.61 6.43 6.23 5.21 ...
## $ notifications_per_day : int 248 127 44 178 136 82 165 169 172 20 ...
## $ app_opens_per_day : int 154 71 106 107 177 56 95 117 134 82 ...
## $ weekend_screen_time : num 3.95 6.71 8.68 9.77 12.55 ...
## $ stress_level : chr "Medium" "Medium" "High" "High" ...
## $ academic_work_impact : chr "Yes" "Yes" "No" "Yes" ...
## $ addiction_level : chr "None" "None" "Mild" "Moderate" ...
## $ addicted_label : int 0 0 0 1 1 1 1 1 0 1 ...
## transaction_id user_id age gender
## Length:7500 Length:7500 Min. :18.00 Length:7500
## Class :character Class :character 1st Qu.:22.00 Class :character
## Mode :character Mode :character Median :27.00 Mode :character
## Mean :26.57
## 3rd Qu.:31.00
## Max. :35.00
## daily_screen_time_hours social_media_hours gaming_hours work_study_hours
## Min. : 3.000 Min. :0.500 Min. :0.000 Min. :0.500
## 1st Qu.: 5.220 1st Qu.:1.910 1st Qu.:1.020 1st Qu.:1.850
## Median : 7.525 Median :3.270 Median :2.040 Median :3.230
## Mean : 7.500 Mean :3.273 Mean :2.014 Mean :3.242
## 3rd Qu.: 9.810 3rd Qu.:4.630 3rd Qu.:2.990 3rd Qu.:4.640
## Max. :12.000 Max. :6.000 Max. :4.000 Max. :6.000
## sleep_hours notifications_per_day app_opens_per_day weekend_screen_time
## Min. :4.500 Min. : 20.0 Min. : 15.00 Min. : 3.580
## 1st Qu.:5.630 1st Qu.: 76.0 1st Qu.: 55.00 1st Qu.: 6.960
## Median :6.720 Median :134.0 Median : 98.00 Median : 9.260
## Mean :6.738 Mean :134.3 Mean : 97.83 Mean : 9.244
## 3rd Qu.:7.840 3rd Qu.:191.0 3rd Qu.:140.00 3rd Qu.:11.540
## Max. :9.000 Max. :250.0 Max. :180.00 Max. :14.880
## stress_level academic_work_impact addiction_level addicted_label
## Length:7500 Length:7500 Length:7500 Min. :0.0000
## Class :character Class :character Class :character 1st Qu.:0.0000
## Mode :character Mode :character Mode :character Median :1.0000
## Mean :0.7077
## 3rd Qu.:1.0000
## Max. :1.0000
In addition to the base R summary, the skimr package was
used to generate a more detailed variable-level summary, including
completeness, standard deviation, and quantile information for each
variable.
| Name | df |
| Number of rows | 7500 |
| Number of columns | 16 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| numeric | 10 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| transaction_id | 0 | 1 | 8 | 8 | 0 | 7500 | 0 |
| user_id | 0 | 1 | 6 | 6 | 0 | 7500 | 0 |
| gender | 0 | 1 | 4 | 6 | 0 | 3 | 0 |
| stress_level | 0 | 1 | 3 | 6 | 0 | 3 | 0 |
| academic_work_impact | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| addiction_level | 0 | 1 | 4 | 8 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 0 | 1 | 26.57 | 5.20 | 18.00 | 22.00 | 27.00 | 31.00 | 35.00 | ▇▆▇▆▇ |
| daily_screen_time_hours | 0 | 1 | 7.50 | 2.61 | 3.00 | 5.22 | 7.53 | 9.81 | 12.00 | ▇▇▇▇▇ |
| social_media_hours | 0 | 1 | 3.27 | 1.59 | 0.50 | 1.91 | 3.27 | 4.63 | 6.00 | ▇▇▇▇▇ |
| gaming_hours | 0 | 1 | 2.01 | 1.15 | 0.00 | 1.02 | 2.04 | 2.99 | 4.00 | ▇▇▇▇▇ |
| work_study_hours | 0 | 1 | 3.24 | 1.60 | 0.50 | 1.85 | 3.23 | 4.64 | 6.00 | ▇▇▇▇▇ |
| sleep_hours | 0 | 1 | 6.74 | 1.28 | 4.50 | 5.63 | 6.72 | 7.84 | 9.00 | ▇▇▇▇▇ |
| notifications_per_day | 0 | 1 | 134.26 | 66.59 | 20.00 | 76.00 | 134.00 | 191.00 | 250.00 | ▇▇▇▇▇ |
| app_opens_per_day | 0 | 1 | 97.83 | 48.42 | 15.00 | 55.00 | 98.00 | 140.00 | 180.00 | ▇▇▇▇▇ |
| weekend_screen_time | 0 | 1 | 9.24 | 2.72 | 3.58 | 6.96 | 9.26 | 11.54 | 14.88 | ▃▇▇▇▅ |
| addicted_label | 0 | 1 | 0.71 | 0.45 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▃▁▁▁▇ |
The variables can be grouped into the following categories:
transaction_id,
user_idage,
genderdaily_screen_time_hours, social_media_hours,
gaming_hours, work_study_hours,
sleep_hours, notifications_per_day,
app_opens_per_day, weekend_screen_timestress_level,
academic_work_impact, addiction_level,
addicted_labelInitial Findings
The dataset contains both numerical and categorical variables and is suitable for both regression and classification modelling. The large sample size provides sufficient data for reliable predictive analysis.
Data cleaning was performed to ensure the dataset was complete,
consistent, and suitable for predictive modelling. The cleaning steps in
this section primarily use base R functions: is.na() for
missing value detection, duplicated() for duplicate record
detection, table() for cross-tabulation checks, and
ifelse() combined with
factor()/as.factor() for recoding and type
conversion of categorical variables. The tidyverse package
(specifically dplyr, via the %>% pipe) is
used later in this report for feature selection during the regression
and classification modelling stages.
The dataset was first examined for missing values across all variables.
## transaction_id user_id age
## 0 0 0
## gender daily_screen_time_hours social_media_hours
## 0 0 0
## gaming_hours work_study_hours sleep_hours
## 0 0 0
## notifications_per_day app_opens_per_day weekend_screen_time
## 0 0 0
## stress_level academic_work_impact addiction_level
## 0 0 0
## addicted_label
## 0
Findings
No missing values were identified in any of the variables.
Although the variable addiction_level contains the
category “None”, this represents users with no
smartphone addiction and is a valid category rather than a missing
value.
Duplicate observations were checked to ensure that each row represented a unique user record.
## Number of duplicate rows: 0
Findings
No duplicate records were detected in the dataset. Therefore, no observations were removed.
The relationship between addiction_level and
addicted_label was verified to ensure consistency.
##
## Mild Moderate None Severe
## 1373 2874 819 2434
##
## 0 1
## Mild 1373 0
## Moderate 0 2874
## None 819 0
## Severe 0 2434
Findings
The results confirmed that users classified as “None” correspond to 0 (Not Addicted), while users classified as Mild, Moderate, or Severe correspond to 1 (Addicted).
To support classification modelling, a binary target variable was created from the original addiction categories.
df$addicted_label <- ifelse(df$addiction_level == "None", 0, 1)
df$addicted_label <- factor(df$addicted_label, levels = c(0,1))| Label | Description |
|---|---|
| 0 | Not Addicted |
| 1 | Addicted |
This transformation simplifies the classification problem into a binary prediction task.
Categorical variables were converted into factor data types to ensure correct handling during statistical analysis and machine learning modelling.
df$gender <- as.factor(df$gender)
df$stress_level <- as.factor(df$stress_level)
df$academic_work_impact <- as.factor(df$academic_work_impact)
df$addiction_level <- as.factor(df$addiction_level)
df$addicted_label <- as.factor(df$addicted_label)
str(df)## 'data.frame': 7500 obs. of 16 variables:
## $ transaction_id : chr "TXN00001" "TXN00002" "TXN00003" "TXN00004" ...
## $ user_id : chr "U00001" "U00002" "U00003" "U00004" ...
## $ age : int 21 24 31 32 25 26 25 26 21 35 ...
## $ gender : Factor w/ 3 levels "Female","Male",..: 2 3 3 3 2 2 2 2 3 3 ...
## $ daily_screen_time_hours: num 3.23 5.09 6.06 7.83 9.96 9.32 10.4 4.26 4.38 9.76 ...
## $ social_media_hours : num 2.01 3.81 1.36 5.85 5.92 4.26 4.93 4.6 1.38 4.73 ...
## $ gaming_hours : num 0.89 2.24 3.83 1.51 3.42 0.29 1.6 2.16 2.72 1.36 ...
## $ work_study_hours : num 4.55 4.44 2.35 3.54 5.27 3.99 0.86 4.61 3.78 2.11 ...
## $ sleep_hours : num 7.55 7.66 4.92 8.23 6.21 6.9 8.61 6.43 6.23 5.21 ...
## $ notifications_per_day : int 248 127 44 178 136 82 165 169 172 20 ...
## $ app_opens_per_day : int 154 71 106 107 177 56 95 117 134 82 ...
## $ weekend_screen_time : num 3.95 6.71 8.68 9.77 12.55 ...
## $ stress_level : Factor w/ 3 levels "High","Low","Medium": 3 3 1 1 2 3 3 2 1 2 ...
## $ academic_work_impact : Factor w/ 2 levels "No","Yes": 2 2 1 2 1 2 1 1 2 2 ...
## $ addiction_level : Factor w/ 4 levels "Mild","Moderate",..: 3 3 1 2 4 4 4 2 3 4 ...
## $ addicted_label : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 2 2 1 2 ...
Findings
The following variables were successfully converted into factors:
genderstress_levelacademic_work_impactaddiction_leveladdicted_labelThe final dataset was reviewed before proceeding to exploratory analysis and predictive modelling.
data.frame(
Rows = nrow(df),
Columns = ncol(df),
Missing_Values = sum(is.na(df)),
Duplicate_Rows = sum(duplicated(df))
) |>
knitr::kable(
caption = "Final Dataset Summary"
)| Rows | Columns | Missing_Values | Duplicate_Rows |
|---|---|---|---|
| 7500 | 16 | 0 | 0 |
The cleaned dataset was saved to disk for reproducibility and downstream use.
The dataset was successfully cleaned and prepared for analysis. No missing values or duplicate records were identified, addiction labels were validated, and categorical variables were transformed appropriately for modelling.
Exploratory Data Analysis (EDA) was conducted to understand smartphone usage patterns, identify relationships between variables, detect potential anomalies, and gain insights before predictive modelling.
The distribution of smartphone addiction levels was examined to understand the prevalence of addiction within the dataset.
p_addiction_dist <- ggplot(df,
aes(x = addiction_level,
fill = addiction_level)) +
geom_bar() +
labs(
title = "Distribution of Addiction Levels",
x = "Addiction Level",
y = "Frequency"
) +
theme_minimal()
p_addiction_distFindings
The plot shows the number of users within each addiction category (None, Mild, Moderate, and Severe), providing an overview of the target variable distribution.
The gender distribution of users was analysed to understand the demographic composition of the dataset.
ggplot(df,
aes(x = gender,
fill = gender)) +
geom_bar() +
labs(
title = "Gender Distribution",
x = "Gender",
y = "Count"
) +
theme_minimal()Findings
The dataset contains users from different gender groups, ensuring demographic diversity within the sample.
The distribution of daily screen time was examined to understand overall smartphone usage behaviour.
ggplot(df,
aes(x = daily_screen_time_hours)) +
geom_histogram(
bins = 30,
fill = "steelblue",
color = "white"
) +
labs(
title = "Distribution of Daily Screen Time",
x = "Daily Screen Time (Hours)",
y = "Frequency"
) +
theme_minimal()Findings
The histogram provides an overview of users’ daily screen time patterns and highlights the most common screen time range.
Boxplots were used to identify potential outliers in the numerical variables.
numeric_cols <- df %>%
select(where(is.numeric))
par(mfrow = c(2,4))
for(col in names(numeric_cols)){
boxplot(
numeric_cols[[col]],
main = col,
col = "lightblue"
)
}Findings
Several variables contain naturally occurring extreme values. However, no severe outliers were identified that required removal.
A consistency check was performed to determine whether the combined hours spent on social media, gaming, and work/study exceeded the reported daily screen time.
inconsistent_rows <- sum(
(df$social_media_hours +
df$gaming_hours +
df$work_study_hours) >
df$daily_screen_time_hours
)
cat("Rows with hour inconsistency:", inconsistent_rows, "\n")## Rows with hour inconsistency: 4553
## Percentage: 60.7 %
Findings
A total of 4,553 observations (60.7%) exhibited inconsistencies where the combined activity hours exceeded the reported daily screen time.
Interpretation
This suggests that users may have reported overlapping activities or estimated usage differently. Therefore, these variables were treated as independent behavioural indicators.
A correlation heatmap was generated to examine relationships among numerical variables.
numeric_df <- df %>%
select(where(is.numeric))
cor_matrix <- cor(
numeric_df,
use = "complete.obs"
)
corrplot(
cor_matrix,
method = "color",
type = "upper",
tl.cex = 0.7,
number.cex = 0.7
)# Save a copy of the heatmap to the plots/ folder
png("plots/correlation_heatmap.png", width = 800, height = 800)
corrplot(
cor_matrix,
method = "color",
type = "upper",
tl.cex = 0.7,
number.cex = 0.7
)
dev.off()## quartz_off_screen
## 2
Findings
The strongest correlation was observed between
weekend_screen_time and
daily_screen_time_hours (r ≈ 0.96), indicating a potential
target leakage issue for regression modelling.
Feature scaling was performed to standardise numerical variables before modelling.
numeric_features <- df %>%
select(
age,
daily_screen_time_hours,
social_media_hours,
gaming_hours,
work_study_hours,
notifications_per_day
)
scaled_features <- scale(numeric_features)
head(scaled_features)## age daily_screen_time_hours social_media_hours gaming_hours
## [1,] -1.0715189 -1.6364910 -0.7969791 -0.9809287
## [2,] -0.4942749 -0.9236255 0.3384230 0.1970416
## [3,] 0.8526280 -0.5518622 -1.2069853 1.5844287
## [4,] 1.0450427 0.1265099 1.6252119 -0.4399349
## [5,] -0.3018602 0.9428560 1.6693665 1.2266748
## [6,] -0.1094455 0.6975689 0.6222735 -1.5044710
## work_study_hours notifications_per_day
## [1,] 0.8168470 1.70818428
## [2,] 0.7481298 -0.10899044
## [3,] -0.5574960 -1.35548219
## [4,] 0.1858986 0.65692618
## [5,] 1.2666320 0.02617132
## [6,] 0.4670142 -0.78479922
Findings
Feature scaling transformed variables to a common scale with mean 0 and standard deviation 1, preventing variables with larger ranges from dominating the models.
Variance Inflation Factor (VIF) analysis was conducted to assess
multicollinearity among predictor variables. Since VIF depends only on
the relationships among the predictor variables themselves (not on the
response variable), addicted_label was used here purely as
a placeholder outcome to allow lm() and
car::vif() to be run on the predictor set.
vif_model <- lm(
as.numeric(as.character(addicted_label)) ~
age +
daily_screen_time_hours +
social_media_hours +
gaming_hours +
work_study_hours +
notifications_per_day,
data = df
)
car::vif(vif_model)## age daily_screen_time_hours social_media_hours
## 1.001138 1.000284 1.000189
## gaming_hours work_study_hours notifications_per_day
## 1.000945 1.000638 1.000627
Findings
All VIF values were below the commonly accepted threshold of 5, indicating that multicollinearity is not a major concern among the candidate predictors.
A boxplot was generated to examine the relationship between daily screen time and addiction level.
ggplot(
df,
aes(
x = addiction_level,
y = daily_screen_time_hours,
fill = addiction_level
)
) +
geom_boxplot() +
labs(
title = "Daily Screen Time by Addiction Level",
x = "Addiction Level",
y = "Daily Screen Time (Hours)"
) +
theme_minimal()Findings
Users with higher addiction levels generally tend to spend more time on their smartphones compared to users with lower addiction levels.
Summary of EDA Findings
The exploratory analysis revealed several important insights:
These findings provide useful guidance for the regression and classification modelling stages.
Objective
The objective of this analysis is to predict users’ daily screen time using demographic and behavioural variables.
# Remove ID columns
df_model <- df %>%
select(-transaction_id, -user_id)
# Remove leakage variables
df_model <- df_model %>%
select(
-weekend_screen_time,
-addiction_level,
-addicted_label
)
str(df_model)## 'data.frame': 7500 obs. of 11 variables:
## $ age : int 21 24 31 32 25 26 25 26 21 35 ...
## $ gender : Factor w/ 3 levels "Female","Male",..: 2 3 3 3 2 2 2 2 3 3 ...
## $ daily_screen_time_hours: num 3.23 5.09 6.06 7.83 9.96 9.32 10.4 4.26 4.38 9.76 ...
## $ social_media_hours : num 2.01 3.81 1.36 5.85 5.92 4.26 4.93 4.6 1.38 4.73 ...
## $ gaming_hours : num 0.89 2.24 3.83 1.51 3.42 0.29 1.6 2.16 2.72 1.36 ...
## $ work_study_hours : num 4.55 4.44 2.35 3.54 5.27 3.99 0.86 4.61 3.78 2.11 ...
## $ sleep_hours : num 7.55 7.66 4.92 8.23 6.21 6.9 8.61 6.43 6.23 5.21 ...
## $ notifications_per_day : int 248 127 44 178 136 82 165 169 172 20 ...
## $ app_opens_per_day : int 154 71 106 107 177 56 95 117 134 82 ...
## $ stress_level : Factor w/ 3 levels "High","Low","Medium": 3 3 1 1 2 3 3 2 1 2 ...
## $ academic_work_impact : Factor w/ 2 levels "No","Yes": 2 2 1 2 1 2 1 1 2 2 ...
Train-Test Split
The dataset was divided into 80% training data and 20% testing data.
set.seed(42)
train_index <- sample(
1:nrow(df_model),
size = round(0.8 * nrow(df_model))
)
reg_train <- df_model[train_index, ]
reg_test <- df_model[-train_index, ]##
## Call:
## lm(formula = daily_screen_time_hours ~ ., data = reg_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7187 -2.2511 -0.0103 2.2741 4.8068
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.7957174 0.3014412 22.544 <2e-16 ***
## age 0.0086624 0.0064524 1.343 0.1795
## genderMale 0.1310800 0.0822951 1.593 0.1113
## genderOther 0.0442823 0.0826324 0.536 0.5921
## social_media_hours 0.0177288 0.0212670 0.834 0.4045
## gaming_hours -0.0035257 0.0294221 -0.120 0.9046
## work_study_hours 0.0122431 0.0210255 0.582 0.5604
## sleep_hours 0.0357577 0.0262755 1.361 0.1736
## notifications_per_day 0.0002135 0.0005029 0.424 0.6713
## app_opens_per_day 0.0013557 0.0006939 1.954 0.0508 .
## stress_levelLow -0.0235104 0.0818911 -0.287 0.7741
## stress_levelMedium -0.1996135 0.0824594 -2.421 0.0155 *
## academic_work_impactYes -0.0254948 0.0672433 -0.379 0.7046
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.603 on 5987 degrees of freedom
## Multiple R-squared: 0.003012, Adjusted R-squared: 0.001014
## F-statistic: 1.507 on 12 and 5987 DF, p-value: 0.1134
## n= 6000
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 6000 40673.78 7.492087 *
rf_model <- randomForest(
daily_screen_time_hours ~ .,
data = reg_train,
ntree = 100,
importance = TRUE
)
rf_model##
## Call:
## randomForest(formula = daily_screen_time_hours ~ ., data = reg_train, ntree = 100, importance = TRUE)
## Type of random forest: regression
## Number of trees: 100
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 7.061841
## % Var explained: -4.17
Objective
The objective of this analysis is to predict smartphone addiction status.
Feature Selection
classification_df <- df[, c(
"addicted_label",
"age",
"gender",
"daily_screen_time_hours",
"social_media_hours",
"gaming_hours",
"work_study_hours",
"notifications_per_day",
"stress_level",
"academic_work_impact"
)]Train-Test Split
set.seed(123)
train_index <- sample(
1:nrow(classification_df),
0.8*nrow(classification_df)
)
train_data <- classification_df[train_index,]
test_data <- classification_df[-train_index,]##
## Call:
## glm(formula = addicted_label ~ ., family = "binomial", data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.9191380 0.4067275 -12.094 <2e-16 ***
## age -0.0025476 0.0098552 -0.259 0.796
## genderMale 0.0922154 0.1245716 0.740 0.459
## genderOther 0.1041416 0.1242206 0.838 0.402
## daily_screen_time_hours 0.9024571 0.0370023 24.389 <2e-16 ***
## social_media_hours 0.6919513 0.0375378 18.433 <2e-16 ***
## gaming_hours -0.0103911 0.0446703 -0.233 0.816
## work_study_hours -0.0464739 0.0314214 -1.479 0.139
## notifications_per_day -0.0004680 0.0007618 -0.614 0.539
## stress_levelLow 0.1542275 0.1250333 1.233 0.217
## stress_levelMedium 0.0934942 0.1235301 0.757 0.449
## academic_work_impactYes 0.0304168 0.1015260 0.300 0.764
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4166.5 on 5999 degrees of freedom
## Residual deviance: 2533.0 on 5988 degrees of freedom
## AIC: 2557
##
## Number of Fisher Scoring iterations: 7
rf_clf <- randomForest(
addicted_label ~ .,
data = train_data,
ntree = 200,
importance = TRUE
)
rf_clf##
## Call:
## randomForest(formula = addicted_label ~ ., data = train_data, ntree = 200, importance = TRUE)
## Type of random forest: classification
## Number of trees: 200
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 10.23%
## Confusion matrix:
## 0 1 class.error
## 0 340 322 0.48640483
## 1 292 5046 0.05470214
Variable Importance
Evaluation Metrics
get_metrics <- function(actual, predicted){
rmse <- sqrt(
mean(
(actual - predicted)^2
)
)
r2 <- 1 -
(
sum((actual-predicted)^2) /
sum((actual-mean(actual))^2)
)
data.frame(
RMSE = rmse,
R2 = r2
)
}
# Use regression test set
mlr_pred <- predict(mlr_model, reg_test)
dt_pred <- predict(dt_model, reg_test)
rf_pred <- predict(rf_model, reg_test)
results_df <- rbind(
cbind(
Model = "MLR",
get_metrics(
reg_test$daily_screen_time_hours,
mlr_pred
)
),
cbind(
Model = "Decision Tree",
get_metrics(
reg_test$daily_screen_time_hours,
dt_pred
)
),
cbind(
Model = "Random Forest",
get_metrics(
reg_test$daily_screen_time_hours,
rf_pred
)
)
)
knitr::kable(
results_df,
caption = "Regression Model Performance"
)| Model | RMSE | R2 |
|---|---|---|
| MLR | 2.632864 | -0.0020657 |
| Decision Tree | 2.630440 | -0.0002213 |
| Random Forest | 2.644456 | -0.0109090 |
Findings
Among the regression models, Decision Tree Regression achieved the lowest RMSE and the highest R² value. However, all models produced negative R² values, indicating poor predictive performance. This suggests that the available demographic and behavioural variables are insufficient to accurately predict daily smartphone screen time.
Evaluation Metrics
Logistic Regression
pred_prob <- predict(
log_model,
newdata = test_data,
type = "response"
)
pred_class <- factor(
ifelse(pred_prob > 0.5, 1, 0),
levels = c(0,1)
)
cm <- table(
Predicted = pred_class,
Actual = test_data$addicted_label
)
cm## Actual
## Predicted 0 1
## 0 39 50
## 1 118 1293
accuracy <- sum(diag(cm))/sum(cm)
TP <- cm["1","1"]
TN <- cm["0","0"]
FP <- cm["1","0"]
FN <- cm["0","1"]
precision <- TP/(TP+FP)
recall <- TP/(TP+FN)
f1 <- 2*precision*recall/(precision+recall)
metrics <- data.frame(
Accuracy = accuracy,
Precision = precision,
Recall = recall,
F1_Score = f1
)
knitr::kable(
metrics,
caption = "Logistic Regression Performance Metrics"
)| Accuracy | Precision | Recall | F1_Score |
|---|---|---|---|
| 0.888 | 0.9163714 | 0.9627699 | 0.9389978 |
Random Forest
rf_pred <- predict(
rf_clf,
newdata = test_data
)
rf_cm <- table(
Predicted = rf_pred,
Actual = test_data$addicted_label
)
rf_cm## Actual
## Predicted 0 1
## 0 79 75
## 1 78 1268
rf_accuracy <- sum(diag(rf_cm)) / sum(rf_cm)
rf_TP <- rf_cm["1","1"]
rf_FP <- rf_cm["1","0"]
rf_FN <- rf_cm["0","1"]
rf_precision <- rf_TP / (rf_TP + rf_FP)
rf_recall <- rf_TP / (rf_TP + rf_FN)
rf_f1 <- 2 * rf_precision * rf_recall /
(rf_precision + rf_recall)
rf_metrics <- data.frame(
Accuracy = rf_accuracy,
Precision = rf_precision,
Recall = rf_recall,
F1_Score = rf_f1
)
knitr::kable(
rf_metrics,
caption = "Random Forest Performance Metrics"
)| Accuracy | Precision | Recall | F1_Score |
|---|---|---|---|
| 0.898 | 0.9420505 | 0.9441549 | 0.9431015 |
ROC Curve and AUC
## Setting direction: controls < cases
## Area under the curve: 0.9136
Findings
The Logistic Regression model achieved an AUC of 0.914, indicating excellent ability to distinguish between addicted and non-addicted users.
Classification Model Comparison
comparison <- data.frame(
Model = c(
"Logistic Regression",
"Random Forest"
),
Accuracy = c(
accuracy,
rf_accuracy
),
Precision = c(
precision,
rf_precision
),
Recall = c(
recall,
rf_recall
),
F1_Score = c(
f1,
rf_f1
)
)
knitr::kable(
comparison,
caption = "Classification Model Comparison"
)| Model | Accuracy | Precision | Recall | F1_Score |
|---|---|---|---|---|
| Logistic Regression | 0.888 | 0.9163714 | 0.9627699 | 0.9389978 |
| Random Forest | 0.898 | 0.9420505 | 0.9441549 | 0.9431015 |
Findings
Logistic Regression achieved an accuracy of 88.8%, a precision of 91.6%, a recall of 96.3%, and an F1-score of 93.9%. The model also achieved an AUC of 0.914, indicating excellent discrimination between addicted and non-addicted users.
Random Forest achieved an accuracy of 89.8%, a precision of 94.2%, a recall of 94.4%, and an F1-score of 94.3%. Compared with Logistic Regression, Random Forest achieved higher accuracy, precision, and F1-score while maintaining a similarly high recall.
Therefore, Random Forest Classification was selected as the best-performing classification model for predicting smartphone addiction status.
Regression Results
The regression models showed weak predictive performance for estimating daily smartphone screen time. Among the three models, Decision Tree Regression achieved the best performance with the lowest RMSE of 2.630 and the highest R² value of -0.0002. However, all regression models produced negative R² values, indicating that they performed worse than a simple baseline prediction using the average screen time. This suggests that the available demographic and behavioural variables were insufficient to accurately predict daily smartphone screen time.
Classification Results
The classification models performed substantially better than the regression models. Logistic Regression achieved an accuracy of 88.8%, a precision of 91.6%, a recall of 96.3%, an F1-score of 93.9%, and an AUC of 0.914, indicating excellent discrimination between addicted and non-addicted users.
Random Forest Classification achieved an accuracy of 89.8%, a precision of 94.2%, a recall of 94.4%, and an F1-score of 94.3%. The model outperformed Logistic Regression in terms of overall accuracy, precision, and F1-score, demonstrating its ability to capture complex relationships among smartphone usage behaviours.
The strong performance of both classification models suggests that smartphone addiction status is more predictable than daily screen time.
Best Performing Model
For the regression task, Decision Tree Regression was selected as the best-performing model because it achieved the lowest RMSE and highest R² among the regression models. Nevertheless, its predictive performance remained weak.
For the classification task, Random Forest Classification was selected as the best-performing model because it achieved the highest accuracy of 89.8%, together with a precision of 94.2%, recall of 94.4%, and F1-score of 94.3%. These results indicate a strong and balanced classification performance for predicting smartphone addiction status.
Overall, the findings indicate that smartphone addiction can be predicted more successfully than daily screen time using the available dataset.
For the regression task, Decision Tree Regression was selected as the best-performing model because it achieved the lowest RMSE of 2.630 and the highest R² value of -0.0002 among the regression models. However, all regression models demonstrated weak predictive performance, indicating that daily smartphone screen time could not be accurately predicted using the available variables.
For the classification task, Random Forest Classification was selected as the best-performing model because it achieved the highest classification accuracy of 89.8% and effectively captured complex relationships among smartphone usage behaviours.
Overall, the results suggest that smartphone addiction status can be predicted more effectively than daily smartphone screen time, with Random Forest Classification being the most suitable model for identifying smartphone addiction in this study.
Limitations and Future Work
This study has several limitations that should be acknowledged.
First, the weak performance of the regression models suggests that
daily_screen_time_hours may not be well explained by the
demographic and behavioural variables available in this dataset. Future
work could explore additional features (e.g. app category breakdowns,
time-of-day usage patterns, or self-reported motivation for phone use)
that may have stronger predictive relationships with screen time.
Second, the Hours Consistency Check revealed that 60.7% of observations had combined activity hours (social media, gaming, work/study) exceeding the reported daily screen time. This suggests possible self-reporting inconsistencies in the original data collection, which may limit the reliability of behavioural variables as predictors.
Third, the strong correlation (r ≈ 0.96) between
weekend_screen_time and
daily_screen_time_hours meant that
weekend_screen_time had to be excluded from the regression
model to avoid target leakage. Future studies could treat these as
related but distinct outcomes and model them separately, or investigate
the underlying relationship further.
Finally, the classification models were trained using a fixed
train-test split (80/20) with a single random seed. Future work could
apply k-fold cross-validation and hyperparameter tuning (e.g. via
caret::train()) to obtain more robust performance estimates
and potentially improve both the regression and classification
results.