Smartphones have become an essential part of modern life. While they provide numerous benefits in communication, education and entertainment, excessive usage may lead to behavioural addiction, stress, and reduced academic productivity.
The increasing prevalence of smartphone addiction among students and working adults has raised concerns regarding its impact on daily activities and mental well-being.
Estimation on how many hours a person is likely to spend on their smartphone during the weekend based on their weekday behavior and personal habits?
Classification of a user as “Addicted” or “Not Addicted” based on prediction whether a person shows signs of smartphone addiction based on their daily phone usage habits and lifestyle patterns?
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.1 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.3 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## corrplot 0.95 loaded
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
getCurrentFileLocation <- function() {
cmdArgs <- commandArgs(trailingOnly = FALSE)
needle <- "--file="
match <- grep(needle, cmdArgs)
if (length(match) > 0) {
return(dirname(normalizePath(sub(needle, "", cmdArgs[match]))))
} else {
for (f in sys.frames()) {
if (!is.null(f$ofile)) return(dirname(normalizePath(f$ofile)))
}
}
# Fallback for interactive RStudio sessions:
if (requireNamespace("rstudioapi", quietly = TRUE) &&
rstudioapi::isAvailable()) {
return(dirname(rstudioapi::getActiveDocumentContext()$path))
}
return(NULL)
}
script_dir <- getCurrentFileLocation()
if (!is.null(script_dir)) setwd(script_dir)
# Output folders (created inside Compile/)
dir.create("models", showWarnings = FALSE)
dir.create("plots", showWarnings = FALSE)
# Path to the raw dataset (one level up, inside Dataset/)
RAW_DATA_PATH <- "/Users/thinkacademy/Downloads/Smartphone_Usage_And_Addiction_Analysis_7500_Rows.csv"
CLEANED_DATA_PATH <- "/Users/thinkacademy/Downloads/Smartphone_Cleaned-2.csv"
df <- read.csv(
"/Users/thinkacademy/Downloads/Smartphone_Usage_And_Addiction_Analysis_7500_Rows.csv",
stringsAsFactors = FALSE
)The dataset used in this study is Smartphone Usage and Addiction Analysis, containing information on smartphone usage behaviour, demographic characteristics, stress levels, academic or work impact, and addiction status.
The dataset consists of 7,500 observations and 16 variables, resulting in 120,000 data points, which satisfies the project requirement of analysing more than 100,000 data points.
data.frame(
Rows = nrow(df),
Columns = ncol(df),
Datapoints = nrow(df) * ncol(df)
) |>
knitr::kable(caption = "Dataset Overview")| Rows | Columns | Datapoints |
|---|---|---|
| 7500 | 16 | 120000 |
## 'data.frame': 7500 obs. of 16 variables:
## $ transaction_id : chr "TXN00001" "TXN00002" "TXN00003" "TXN00004" ...
## $ user_id : chr "U00001" "U00002" "U00003" "U00004" ...
## $ age : int 21 24 31 32 25 26 25 26 21 35 ...
## $ gender : chr "Male" "Other" "Other" "Other" ...
## $ daily_screen_time_hours: num 3.23 5.09 6.06 7.83 9.96 9.32 10.4 4.26 4.38 9.76 ...
## $ social_media_hours : num 2.01 3.81 1.36 5.85 5.92 4.26 4.93 4.6 1.38 4.73 ...
## $ gaming_hours : num 0.89 2.24 3.83 1.51 3.42 0.29 1.6 2.16 2.72 1.36 ...
## $ work_study_hours : num 4.55 4.44 2.35 3.54 5.27 3.99 0.86 4.61 3.78 2.11 ...
## $ sleep_hours : num 7.55 7.66 4.92 8.23 6.21 6.9 8.61 6.43 6.23 5.21 ...
## $ notifications_per_day : int 248 127 44 178 136 82 165 169 172 20 ...
## $ app_opens_per_day : int 154 71 106 107 177 56 95 117 134 82 ...
## $ weekend_screen_time : num 3.95 6.71 8.68 9.77 12.55 ...
## $ stress_level : chr "Medium" "Medium" "High" "High" ...
## $ academic_work_impact : chr "Yes" "Yes" "No" "Yes" ...
## $ addiction_level : chr "None" "None" "Mild" "Moderate" ...
## $ addicted_label : int 0 0 0 1 1 1 1 1 0 1 ...
## transaction_id user_id age gender
## Length:7500 Length:7500 Min. :18.00 Length:7500
## Class :character Class :character 1st Qu.:22.00 Class :character
## Mode :character Mode :character Median :27.00 Mode :character
## Mean :26.57
## 3rd Qu.:31.00
## Max. :35.00
## daily_screen_time_hours social_media_hours gaming_hours work_study_hours
## Min. : 3.000 Min. :0.500 Min. :0.000 Min. :0.500
## 1st Qu.: 5.220 1st Qu.:1.910 1st Qu.:1.020 1st Qu.:1.850
## Median : 7.525 Median :3.270 Median :2.040 Median :3.230
## Mean : 7.500 Mean :3.273 Mean :2.014 Mean :3.242
## 3rd Qu.: 9.810 3rd Qu.:4.630 3rd Qu.:2.990 3rd Qu.:4.640
## Max. :12.000 Max. :6.000 Max. :4.000 Max. :6.000
## sleep_hours notifications_per_day app_opens_per_day weekend_screen_time
## Min. :4.500 Min. : 20.0 Min. : 15.00 Min. : 3.580
## 1st Qu.:5.630 1st Qu.: 76.0 1st Qu.: 55.00 1st Qu.: 6.960
## Median :6.720 Median :134.0 Median : 98.00 Median : 9.260
## Mean :6.738 Mean :134.3 Mean : 97.83 Mean : 9.244
## 3rd Qu.:7.840 3rd Qu.:191.0 3rd Qu.:140.00 3rd Qu.:11.540
## Max. :9.000 Max. :250.0 Max. :180.00 Max. :14.880
## stress_level academic_work_impact addiction_level addicted_label
## Length:7500 Length:7500 Length:7500 Min. :0.0000
## Class :character Class :character Class :character 1st Qu.:0.0000
## Mode :character Mode :character Mode :character Median :1.0000
## Mean :0.7077
## 3rd Qu.:1.0000
## Max. :1.0000
The variables can be grouped into the following categories:
transaction_id,
user_idage,
genderdaily_screen_time_hours, social_media_hours,
gaming_hours, work_study_hours,
sleep_hours, notifications_per_day,
app_opens_per_day, weekend_screen_timestress_level,
academic_work_impact, addiction_level,
addicted_labelInitial Findings
The dataset contains both numerical and categorical variables and is suitable for both regression and classification modelling. The large sample size provides sufficient data for reliable predictive analysis.
Data cleaning was performed to ensure the dataset was complete, consistent, and suitable for predictive modelling.
The dataset was first examined for missing values across all variables.
## transaction_id user_id age
## 0 0 0
## gender daily_screen_time_hours social_media_hours
## 0 0 0
## gaming_hours work_study_hours sleep_hours
## 0 0 0
## notifications_per_day app_opens_per_day weekend_screen_time
## 0 0 0
## stress_level academic_work_impact addiction_level
## 0 0 0
## addicted_label
## 0
Findings
No missing values were identified in any of the variables.
Although the variable addiction_level contains the
category “None”, this represents users with no
smartphone addiction and is a valid category rather than a missing
value.
Duplicate observations were checked to ensure that each row represented a unique user record.
## Number of duplicate rows: 0
Findings
No duplicate records were detected in the dataset. Therefore, no observations were removed.
The relationship between addiction_level and
addicted_label was verified to ensure consistency.
##
## Mild Moderate None Severe
## 1373 2874 819 2434
##
## 0 1
## Mild 1373 0
## Moderate 0 2874
## None 819 0
## Severe 0 2434
Findings
The results confirmed that users classified as “None” correspond to 0 (Not Addicted), while users classified as Mild, Moderate, or Severe correspond to 1 (Addicted).
To support classification modelling, a binary target variable was created from the original addiction categories.
df$addicted_label <- ifelse(df$addiction_level == "None", 0, 1)
df$addicted_label <- factor(df$addicted_label, levels = c(0,1))| Label | Description |
|---|---|
| 0 | Not Addicted |
| 1 | Addicted |
This transformation simplifies the classification problem into a binary prediction task.
Categorical variables were converted into factor data types to ensure correct handling during statistical analysis and machine learning modelling.
df$gender <- as.factor(df$gender)
df$stress_level <- as.factor(df$stress_level)
df$academic_work_impact <- as.factor(df$academic_work_impact)
df$addiction_level <- as.factor(df$addiction_level)
df$addicted_label <- as.factor(df$addicted_label)
str(df)## 'data.frame': 7500 obs. of 16 variables:
## $ transaction_id : chr "TXN00001" "TXN00002" "TXN00003" "TXN00004" ...
## $ user_id : chr "U00001" "U00002" "U00003" "U00004" ...
## $ age : int 21 24 31 32 25 26 25 26 21 35 ...
## $ gender : Factor w/ 3 levels "Female","Male",..: 2 3 3 3 2 2 2 2 3 3 ...
## $ daily_screen_time_hours: num 3.23 5.09 6.06 7.83 9.96 9.32 10.4 4.26 4.38 9.76 ...
## $ social_media_hours : num 2.01 3.81 1.36 5.85 5.92 4.26 4.93 4.6 1.38 4.73 ...
## $ gaming_hours : num 0.89 2.24 3.83 1.51 3.42 0.29 1.6 2.16 2.72 1.36 ...
## $ work_study_hours : num 4.55 4.44 2.35 3.54 5.27 3.99 0.86 4.61 3.78 2.11 ...
## $ sleep_hours : num 7.55 7.66 4.92 8.23 6.21 6.9 8.61 6.43 6.23 5.21 ...
## $ notifications_per_day : int 248 127 44 178 136 82 165 169 172 20 ...
## $ app_opens_per_day : int 154 71 106 107 177 56 95 117 134 82 ...
## $ weekend_screen_time : num 3.95 6.71 8.68 9.77 12.55 ...
## $ stress_level : Factor w/ 3 levels "High","Low","Medium": 3 3 1 1 2 3 3 2 1 2 ...
## $ academic_work_impact : Factor w/ 2 levels "No","Yes": 2 2 1 2 1 2 1 1 2 2 ...
## $ addiction_level : Factor w/ 4 levels "Mild","Moderate",..: 3 3 1 2 4 4 4 2 3 4 ...
## $ addicted_label : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 2 2 1 2 ...
Findings
The following variables were successfully converted into factors:
genderstress_levelacademic_work_impactaddiction_leveladdicted_labelThe final dataset was reviewed before proceeding to exploratory analysis and predictive modelling.
data.frame(
Rows = nrow(df),
Columns = ncol(df),
Missing_Values = sum(is.na(df)),
Duplicate_Rows = sum(duplicated(df))
) |>
knitr::kable(
caption = "Final Dataset Summary"
)| Rows | Columns | Missing_Values | Duplicate_Rows |
|---|---|---|---|
| 7500 | 16 | 0 | 0 |
The dataset was successfully cleaned and prepared for analysis. No missing values or duplicate records were identified, addiction labels were validated, and categorical variables were transformed appropriately for modelling.
Exploratory Data Analysis (EDA) was conducted to understand smartphone usage patterns, identify relationships between variables, detect potential anomalies, and gain insights before predictive modelling.
The distribution of smartphone addiction levels was examined to understand the prevalence of addiction within the dataset.
ggplot(df,
aes(x = addiction_level,
fill = addiction_level)) +
geom_bar() +
labs(
title = "Distribution of Addiction Levels",
x = "Addiction Level",
y = "Frequency"
) +
theme_minimal()Findings
The plot shows the number of users within each addiction category (None, Mild, Moderate, and Severe), providing an overview of the target variable distribution.
The gender distribution of users was analysed to understand the demographic composition of the dataset.
ggplot(df,
aes(x = gender,
fill = gender)) +
geom_bar() +
labs(
title = "Gender Distribution",
x = "Gender",
y = "Count"
) +
theme_minimal()Findings
The dataset contains users from different gender groups, ensuring demographic diversity within the sample.
The distribution of daily screen time was examined to understand overall smartphone usage behaviour.
ggplot(df,
aes(x = daily_screen_time_hours)) +
geom_histogram(
bins = 30,
fill = "steelblue",
color = "white"
) +
labs(
title = "Distribution of Daily Screen Time",
x = "Daily Screen Time (Hours)",
y = "Frequency"
) +
theme_minimal()Findings
The histogram provides an overview of users’ daily screen time patterns and highlights the most common screen time range.
Boxplots were used to identify potential outliers in the numerical variables.
numeric_cols <- df %>%
select(where(is.numeric))
par(mfrow = c(2,4))
for(col in names(numeric_cols)){
boxplot(
numeric_cols[[col]],
main = col,
col = "lightblue"
)
}Findings
Several variables contain naturally occurring extreme values. However, no severe outliers were identified that required removal.
A consistency check was performed to determine whether the combined hours spent on social media, gaming, and work/study exceeded the reported daily screen time.
inconsistent_rows <- sum(
(df$social_media_hours +
df$gaming_hours +
df$work_study_hours) >
df$daily_screen_time_hours
)
cat("Rows with hour inconsistency:", inconsistent_rows, "\n")## Rows with hour inconsistency: 4553
## Percentage: 60.7 %
Findings
A total of 4,553 observations (60.7%) exhibited inconsistencies where the combined activity hours exceeded the reported daily screen time.
Interpretation
This suggests that users may have reported overlapping activities or estimated usage differently. Therefore, these variables were treated as independent behavioural indicators.
A correlation heatmap was generated to examine relationships among numerical variables.
numeric_df <- df %>%
select(where(is.numeric))
cor_matrix <- cor(
numeric_df,
use = "complete.obs"
)
corrplot(
cor_matrix,
method = "color",
type = "upper",
tl.cex = 0.7,
number.cex = 0.7
)Findings
The strongest correlation was observed between
weekend_screen_time and
daily_screen_time_hours (r ≈ 0.96), indicating a potential
target leakage issue for regression modelling.
Feature scaling was performed to standardise numerical variables before modelling.
numeric_features <- df %>%
select(
age,
daily_screen_time_hours,
social_media_hours,
gaming_hours,
work_study_hours,
notifications_per_day
)
scaled_features <- scale(numeric_features)
head(scaled_features)## age daily_screen_time_hours social_media_hours gaming_hours
## [1,] -1.0715189 -1.6364910 -0.7969791 -0.9809287
## [2,] -0.4942749 -0.9236255 0.3384230 0.1970416
## [3,] 0.8526280 -0.5518622 -1.2069853 1.5844287
## [4,] 1.0450427 0.1265099 1.6252119 -0.4399349
## [5,] -0.3018602 0.9428560 1.6693665 1.2266748
## [6,] -0.1094455 0.6975689 0.6222735 -1.5044710
## work_study_hours notifications_per_day
## [1,] 0.8168470 1.70818428
## [2,] 0.7481298 -0.10899044
## [3,] -0.5574960 -1.35548219
## [4,] 0.1858986 0.65692618
## [5,] 1.2666320 0.02617132
## [6,] 0.4670142 -0.78479922
Findings
Feature scaling transformed variables to a common scale with mean 0 and standard deviation 1, preventing variables with larger ranges from dominating the models.
Variance Inflation Factor (VIF) analysis was conducted to assess multicollinearity among predictor variables.
vif_model <- lm(
as.numeric(as.character(addicted_label)) ~
age +
daily_screen_time_hours +
social_media_hours +
gaming_hours +
work_study_hours +
notifications_per_day,
data = df
)
car::vif(vif_model)## age daily_screen_time_hours social_media_hours
## 1.001138 1.000284 1.000189
## gaming_hours work_study_hours notifications_per_day
## 1.000945 1.000638 1.000627
Findings
All VIF values were below the commonly accepted threshold of 5, indicating that multicollinearity is not a major concern.
A boxplot was generated to examine the relationship between daily screen time and addiction level.
ggplot(
df,
aes(
x = addiction_level,
y = daily_screen_time_hours,
fill = addiction_level
)
) +
geom_boxplot() +
labs(
title = "Daily Screen Time by Addiction Level",
x = "Addiction Level",
y = "Daily Screen Time (Hours)"
) +
theme_minimal()Findings
Users with higher addiction levels generally tend to spend more time on their smartphones compared to users with lower addiction levels.
Summary of EDA Findings
The exploratory analysis revealed several important insights:
These findings provide useful guidance for the regression and classification modelling stages.
Objective
The objective of this analysis is to predict users’ daily screen time using demographic and behavioural variables.
# Remove ID columns
df_model <- df %>%
select(-transaction_id, -user_id)
# Remove leakage variables
df_model <- df_model %>%
select(
-weekend_screen_time,
-addiction_level,
-addicted_label
)
str(df_model)## 'data.frame': 7500 obs. of 11 variables:
## $ age : int 21 24 31 32 25 26 25 26 21 35 ...
## $ gender : Factor w/ 3 levels "Female","Male",..: 2 3 3 3 2 2 2 2 3 3 ...
## $ daily_screen_time_hours: num 3.23 5.09 6.06 7.83 9.96 9.32 10.4 4.26 4.38 9.76 ...
## $ social_media_hours : num 2.01 3.81 1.36 5.85 5.92 4.26 4.93 4.6 1.38 4.73 ...
## $ gaming_hours : num 0.89 2.24 3.83 1.51 3.42 0.29 1.6 2.16 2.72 1.36 ...
## $ work_study_hours : num 4.55 4.44 2.35 3.54 5.27 3.99 0.86 4.61 3.78 2.11 ...
## $ sleep_hours : num 7.55 7.66 4.92 8.23 6.21 6.9 8.61 6.43 6.23 5.21 ...
## $ notifications_per_day : int 248 127 44 178 136 82 165 169 172 20 ...
## $ app_opens_per_day : int 154 71 106 107 177 56 95 117 134 82 ...
## $ stress_level : Factor w/ 3 levels "High","Low","Medium": 3 3 1 1 2 3 3 2 1 2 ...
## $ academic_work_impact : Factor w/ 2 levels "No","Yes": 2 2 1 2 1 2 1 1 2 2 ...
set.seed(42)
train_index <- sample(
1:nrow(df_model),
size = round(0.8 * nrow(df_model))
)
train_data <- df_model[train_index, ]
test_data <- df_model[-train_index, ]Train-Test Split
The dataset was divided into 80% training data and 20% testing data.
set.seed(42)
train_index <- sample(
1:nrow(df_model),
size = 0.8 * nrow(df_model)
)
train_data <- df_model[train_index, ]
test_data <- df_model[-train_index, ]##
## Call:
## lm(formula = daily_screen_time_hours ~ ., data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7187 -2.2511 -0.0103 2.2741 4.8068
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.7957174 0.3014412 22.544 <2e-16 ***
## age 0.0086624 0.0064524 1.343 0.1795
## genderMale 0.1310800 0.0822951 1.593 0.1113
## genderOther 0.0442823 0.0826324 0.536 0.5921
## social_media_hours 0.0177288 0.0212670 0.834 0.4045
## gaming_hours -0.0035257 0.0294221 -0.120 0.9046
## work_study_hours 0.0122431 0.0210255 0.582 0.5604
## sleep_hours 0.0357577 0.0262755 1.361 0.1736
## notifications_per_day 0.0002135 0.0005029 0.424 0.6713
## app_opens_per_day 0.0013557 0.0006939 1.954 0.0508 .
## stress_levelLow -0.0235104 0.0818911 -0.287 0.7741
## stress_levelMedium -0.1996135 0.0824594 -2.421 0.0155 *
## academic_work_impactYes -0.0254948 0.0672433 -0.379 0.7046
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.603 on 5987 degrees of freedom
## Multiple R-squared: 0.003012, Adjusted R-squared: 0.001014
## F-statistic: 1.507 on 12 and 5987 DF, p-value: 0.1134
## n= 6000
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 6000 40673.78 7.492087 *
rf_model <- randomForest(
daily_screen_time_hours ~ .,
data = train_data,
ntree = 100,
importance = TRUE
)
rf_model##
## Call:
## randomForest(formula = daily_screen_time_hours ~ ., data = train_data, ntree = 100, importance = TRUE)
## Type of random forest: regression
## Number of trees: 100
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 7.061841
## % Var explained: -4.17
Regression Model Evaluation
Evaluation Metrics
get_metrics <- function(actual, predicted){
rmse <- sqrt(
mean(
(actual - predicted)^2
)
)
r2 <- 1 -
(
sum((actual-predicted)^2) /
sum((actual-mean(actual))^2)
)
data.frame(
RMSE = rmse,
R2 = r2
)
}
mlr_pred <- predict(mlr_model,test_data)
dt_pred <- predict(dt_model,test_data)
rf_pred <- predict(rf_model,test_data)
results_df <- rbind(
cbind(Model="MLR",
get_metrics(test_data$daily_screen_time_hours,
mlr_pred)),
cbind(Model="Decision Tree",
get_metrics(test_data$daily_screen_time_hours,
dt_pred)),
cbind(Model="Random Forest",
get_metrics(test_data$daily_screen_time_hours,
rf_pred))
)
knitr::kable(
results_df,
caption="Regression Model Performance"
)| Model | RMSE | R2 |
|---|---|---|
| MLR | 2.632864 | -0.0020657 |
| Decision Tree | 2.630440 | -0.0002213 |
| Random Forest | 2.644456 | -0.0109090 |
Findings
The model with the highest R² and lowest RMSE is considered the best regression model.
Objective
The objective of this analysis is to predict smartphone addiction status.
Feature Selection
classification_df <- df[, c(
"addicted_label",
"age",
"gender",
"daily_screen_time_hours",
"social_media_hours",
"gaming_hours",
"work_study_hours",
"notifications_per_day",
"stress_level",
"academic_work_impact"
)]Train-Test Split
set.seed(123)
train_index <- sample(
1:nrow(classification_df),
0.8*nrow(classification_df)
)
train_data <- classification_df[train_index,]
test_data <- classification_df[-train_index,]##
## Call:
## glm(formula = addicted_label ~ ., family = "binomial", data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.9191380 0.4067275 -12.094 <2e-16 ***
## age -0.0025476 0.0098552 -0.259 0.796
## genderMale 0.0922154 0.1245716 0.740 0.459
## genderOther 0.1041416 0.1242206 0.838 0.402
## daily_screen_time_hours 0.9024571 0.0370023 24.389 <2e-16 ***
## social_media_hours 0.6919513 0.0375378 18.433 <2e-16 ***
## gaming_hours -0.0103911 0.0446703 -0.233 0.816
## work_study_hours -0.0464739 0.0314214 -1.479 0.139
## notifications_per_day -0.0004680 0.0007618 -0.614 0.539
## stress_levelLow 0.1542275 0.1250333 1.233 0.217
## stress_levelMedium 0.0934942 0.1235301 0.757 0.449
## academic_work_impactYes 0.0304168 0.1015260 0.300 0.764
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4166.5 on 5999 degrees of freedom
## Residual deviance: 2533.0 on 5988 degrees of freedom
## AIC: 2557
##
## Number of Fisher Scoring iterations: 7
rf_clf <- randomForest(
addicted_label ~ .,
data = train_data,
ntree = 200,
importance = TRUE
)
rf_clf##
## Call:
## randomForest(formula = addicted_label ~ ., data = train_data, ntree = 200, importance = TRUE)
## Type of random forest: classification
## Number of trees: 200
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 10.23%
## Confusion matrix:
## 0 1 class.error
## 0 340 322 0.48640483
## 1 292 5046 0.05470214
Variable Importance
pred_prob <- predict(
log_model,
newdata=test_data,
type="response"
)
pred_class <- factor(
ifelse(pred_prob > 0.5,1,0),
levels=c(0,1)
)
cm <- table(
Predicted=pred_class,
Actual=test_data$addicted_label
)
cm## Actual
## Predicted 0 1
## 0 39 50
## 1 118 1293
Performance Metrics
accuracy <- sum(diag(cm))/sum(cm)
TP <- cm["1","1"]
TN <- cm["0","0"]
FP <- cm["1","0"]
FN <- cm["0","1"]
precision <- TP/(TP+FP)
recall <- TP/(TP+FN)
f1 <- 2*precision*recall/
(precision+recall)
metrics <- data.frame(
Accuracy = accuracy,
Precision = precision,
Recall = recall,
F1_Score = f1
)
knitr::kable(metrics)| Accuracy | Precision | Recall | F1_Score |
|---|---|---|---|
| 0.888 | 0.9163714 | 0.9627699 | 0.9389978 |
## Setting direction: controls < cases
## Area under the curve: 0.9136
Findings
A higher AUC indicates better discrimination between addicted and non-addicted users.
rf_pred <- predict(
rf_clf,
newdata=test_data
)
rf_cm <- confusionMatrix(
rf_pred,
test_data$addicted_label
)
rf_cm## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 79 75
## 1 78 1268
##
## Accuracy : 0.898
## 95% CI : (0.8816, 0.9129)
## No Information Rate : 0.8953
## P-Value [Acc > NIR] : 0.3878
##
## Kappa : 0.4511
##
## Mcnemar's Test P-Value : 0.8715
##
## Sensitivity : 0.50318
## Specificity : 0.94415
## Pos Pred Value : 0.51299
## Neg Pred Value : 0.94205
## Prevalence : 0.10467
## Detection Rate : 0.05267
## Detection Prevalence : 0.10267
## Balanced Accuracy : 0.72367
##
## 'Positive' Class : 0
##
Classification Model Comparison
comparison <- data.frame(
Model = c(
"Logistic Regression",
"Random Forest"
),
Accuracy = c(
accuracy,
rf_cm$overall["Accuracy"]
)
)
knitr::kable(
comparison,
caption = "Classification Model Comparison"
)| Model | Accuracy | |
|---|---|---|
| Logistic Regression | 0.888 | |
| Accuracy | Random Forest | 0.898 |
Findings
The model with the highest classification accuracy is considered the best performing classifier.
The regression models achieved relatively low predictive performance, indicating that the available behavioural and demographic variables have limited ability to explain daily screen time.
The classification models demonstrated stronger predictive performance than the regression models. Smartphone addiction appears to have stronger relationships with behavioural indicators such as social media usage, gaming activity, and notification frequency.
Based on the evaluation metrics, the best-performing model was selected according to:
For the regression task, Multiple Linear Regression was selected as the best-performing model because it achieved the highest Test R² among the regression models, although the overall predictive performance remained weak. This indicates that daily screen time could not be accurately predicted using the available variables.
For the classification task, Random Forest Classification was selected as the best-performing model because it achieved the highest classification accuracy and was able to capture complex relationships among smartphone usage behaviours. Therefore, Random Forest is the most suitable model for predicting smartphone addiction status in this study.
Overall, the findings suggest that smartphone addiction can be predicted more effectively than daily screen time, with Random Forest Classification being the most suitable model for identifying smartphone addiction in this study.