1 Introduction

1.1 Background

Smartphones have become an essential part of modern life. While they provide numerous benefits in communication, education and entertainment, excessive usage may lead to behavioural addiction, stress, and reduced academic productivity.

1.2 Problem Statement

The increasing prevalence of smartphone addiction among students and working adults has raised concerns regarding its impact on daily activities and mental well-being.

1.3 Objectives

To identify factors associated with smartphone addiction.
To predict daily screen time using regression models.
To classify smartphone addiction status using machine learning techniques.

1.4 Research Questions

1.4.1 RQ1 (Regression)

Estimation on how many hours a person is likely to spend on their smartphone during the weekend based on their weekday behavior and personal habits?

1.4.2 RQ2 (Classification)

Classification of a user as “Addicted” or “Not Addicted” based on prediction whether a person shows signs of smartphone addiction based on their daily phone usage habits and lifestyle patterns?

2 Dataset Description

2.1 Packages and Working Directory

library(tidyverse)      # data manipulation (dplyr) + ggplot2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.1     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.3     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(skimr)          # detailed data summaries
library(corrplot)       # correlation heatmap

## corrplot 0.95 loaded

library(car)            # VIF (multicollinearity check)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(caret)          # confusionMatrix for classification evaluation

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(randomForest)   # Random Forest (regression + classification)

## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin

library(rpart)          # Decision Tree regressor
library(pROC)           # ROC curve and AUC

## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

getCurrentFileLocation <- function() {
  cmdArgs <- commandArgs(trailingOnly = FALSE)
  needle <- "--file="
  match <- grep(needle, cmdArgs)
  if (length(match) > 0) {
    return(dirname(normalizePath(sub(needle, "", cmdArgs[match]))))
  } else {
    for (f in sys.frames()) {
      if (!is.null(f$ofile)) return(dirname(normalizePath(f$ofile)))
    }
  }
  # Fallback for interactive RStudio sessions:
  if (requireNamespace("rstudioapi", quietly = TRUE) &&
      rstudioapi::isAvailable()) {
    return(dirname(rstudioapi::getActiveDocumentContext()$path))
  }
  return(NULL)
}
script_dir <- getCurrentFileLocation()
if (!is.null(script_dir)) setwd(script_dir)

# Output folders (created inside Compile/)
dir.create("models", showWarnings = FALSE)
dir.create("plots", showWarnings = FALSE)

# Path to the raw dataset (one level up, inside Dataset/)
RAW_DATA_PATH     <- "/Users/thinkacademy/Downloads/Smartphone_Usage_And_Addiction_Analysis_7500_Rows.csv"
CLEANED_DATA_PATH <- "/Users/thinkacademy/Downloads/Smartphone_Cleaned-2.csv"

df <- read.csv(
  "/Users/thinkacademy/Downloads/Smartphone_Usage_And_Addiction_Analysis_7500_Rows.csv",
  stringsAsFactors = FALSE
)

2.2 Data Description

The dataset used in this study is Smartphone Usage and Addiction Analysis, containing information on smartphone usage behaviour, demographic characteristics, stress levels, academic or work impact, and addiction status.

The dataset consists of 7,500 observations and 16 variables, resulting in 120,000 data points, which satisfies the project requirement of analysing more than 100,000 data points.

2.3 Dataset Overview

data.frame(
  Rows = nrow(df),
  Columns = ncol(df),
  Datapoints = nrow(df) * ncol(df)
) |>
  knitr::kable(caption = "Dataset Overview")

Dataset Overview
Rows	Columns	Datapoints
7500	16	120000

2.4 Dataset Structure

str(df)

## 'data.frame':    7500 obs. of  16 variables:
##  $ transaction_id         : chr  "TXN00001" "TXN00002" "TXN00003" "TXN00004" ...
##  $ user_id                : chr  "U00001" "U00002" "U00003" "U00004" ...
##  $ age                    : int  21 24 31 32 25 26 25 26 21 35 ...
##  $ gender                 : chr  "Male" "Other" "Other" "Other" ...
##  $ daily_screen_time_hours: num  3.23 5.09 6.06 7.83 9.96 9.32 10.4 4.26 4.38 9.76 ...
##  $ social_media_hours     : num  2.01 3.81 1.36 5.85 5.92 4.26 4.93 4.6 1.38 4.73 ...
##  $ gaming_hours           : num  0.89 2.24 3.83 1.51 3.42 0.29 1.6 2.16 2.72 1.36 ...
##  $ work_study_hours       : num  4.55 4.44 2.35 3.54 5.27 3.99 0.86 4.61 3.78 2.11 ...
##  $ sleep_hours            : num  7.55 7.66 4.92 8.23 6.21 6.9 8.61 6.43 6.23 5.21 ...
##  $ notifications_per_day  : int  248 127 44 178 136 82 165 169 172 20 ...
##  $ app_opens_per_day      : int  154 71 106 107 177 56 95 117 134 82 ...
##  $ weekend_screen_time    : num  3.95 6.71 8.68 9.77 12.55 ...
##  $ stress_level           : chr  "Medium" "Medium" "High" "High" ...
##  $ academic_work_impact   : chr  "Yes" "Yes" "No" "Yes" ...
##  $ addiction_level        : chr  "None" "None" "Mild" "Moderate" ...
##  $ addicted_label         : int  0 0 0 1 1 1 1 1 0 1 ...

2.5 Summary Statistics

summary(df)

##  transaction_id       user_id               age           gender         
##  Length:7500        Length:7500        Min.   :18.00   Length:7500       
##  Class :character   Class :character   1st Qu.:22.00   Class :character  
##  Mode  :character   Mode  :character   Median :27.00   Mode  :character  
##                                        Mean   :26.57                     
##                                        3rd Qu.:31.00                     
##                                        Max.   :35.00                     
##  daily_screen_time_hours social_media_hours  gaming_hours   work_study_hours
##  Min.   : 3.000          Min.   :0.500      Min.   :0.000   Min.   :0.500   
##  1st Qu.: 5.220          1st Qu.:1.910      1st Qu.:1.020   1st Qu.:1.850   
##  Median : 7.525          Median :3.270      Median :2.040   Median :3.230   
##  Mean   : 7.500          Mean   :3.273      Mean   :2.014   Mean   :3.242   
##  3rd Qu.: 9.810          3rd Qu.:4.630      3rd Qu.:2.990   3rd Qu.:4.640   
##  Max.   :12.000          Max.   :6.000      Max.   :4.000   Max.   :6.000   
##   sleep_hours    notifications_per_day app_opens_per_day weekend_screen_time
##  Min.   :4.500   Min.   : 20.0         Min.   : 15.00    Min.   : 3.580     
##  1st Qu.:5.630   1st Qu.: 76.0         1st Qu.: 55.00    1st Qu.: 6.960     
##  Median :6.720   Median :134.0         Median : 98.00    Median : 9.260     
##  Mean   :6.738   Mean   :134.3         Mean   : 97.83    Mean   : 9.244     
##  3rd Qu.:7.840   3rd Qu.:191.0         3rd Qu.:140.00    3rd Qu.:11.540     
##  Max.   :9.000   Max.   :250.0         Max.   :180.00    Max.   :14.880     
##  stress_level       academic_work_impact addiction_level    addicted_label  
##  Length:7500        Length:7500          Length:7500        Min.   :0.0000  
##  Class :character   Class :character     Class :character   1st Qu.:0.0000  
##  Mode  :character   Mode  :character     Mode  :character   Median :1.0000  
##                                                             Mean   :0.7077  
##                                                             3rd Qu.:1.0000  
##                                                             Max.   :1.0000

2.6 Variable Categories

The variables can be grouped into the following categories:

Identifiers: transaction_id, user_id
Demographics: age, gender
Behavioural Variables: daily_screen_time_hours, social_media_hours, gaming_hours, work_study_hours, sleep_hours, notifications_per_day, app_opens_per_day, weekend_screen_time
Outcome Variables: stress_level, academic_work_impact, addiction_level, addicted_label

Initial Findings

The dataset contains both numerical and categorical variables and is suitable for both regression and classification modelling. The large sample size provides sufficient data for reliable predictive analysis.

3 Data Cleaning and Preprocessing

Data cleaning was performed to ensure the dataset was complete, consistent, and suitable for predictive modelling.

3.1 Missing Value Analysis

The dataset was first examined for missing values across all variables.

colSums(is.na(df))

##          transaction_id                 user_id                     age 
##                       0                       0                       0 
##                  gender daily_screen_time_hours      social_media_hours 
##                       0                       0                       0 
##            gaming_hours        work_study_hours             sleep_hours 
##                       0                       0                       0 
##   notifications_per_day       app_opens_per_day     weekend_screen_time 
##                       0                       0                       0 
##            stress_level    academic_work_impact         addiction_level 
##                       0                       0                       0 
##          addicted_label 
##                       0

Findings

No missing values were identified in any of the variables.

Although the variable addiction_level contains the category “None”, this represents users with no smartphone addiction and is a valid category rather than a missing value.

3.2 Duplicate Record Detection

Duplicate observations were checked to ensure that each row represented a unique user record.

duplicate_rows <- sum(duplicated(df))
cat("Number of duplicate rows:", duplicate_rows)

## Number of duplicate rows: 0

Findings

No duplicate records were detected in the dataset. Therefore, no observations were removed.

3.3 Validation of Addiction Labels

The relationship between addiction_level and addicted_label was verified to ensure consistency.

table(df$addiction_level)

## 
##     Mild Moderate     None   Severe 
##     1373     2874      819     2434

table(df$addiction_level, df$addicted_label)

##           
##               0    1
##   Mild     1373    0
##   Moderate    0 2874
##   None      819    0
##   Severe      0 2434

Findings

The results confirmed that users classified as “None” correspond to 0 (Not Addicted), while users classified as Mild, Moderate, or Severe correspond to 1 (Addicted).

3.4 Binary Target Variable Creation

To support classification modelling, a binary target variable was created from the original addiction categories.

df$addicted_label <- ifelse(df$addiction_level == "None", 0, 1)
df$addicted_label <- factor(df$addicted_label, levels = c(0,1))

3.4.1 Classification Labels

Label	Description
0	Not Addicted
1	Addicted

This transformation simplifies the classification problem into a binary prediction task.

3.5 Feature Transformation

Categorical variables were converted into factor data types to ensure correct handling during statistical analysis and machine learning modelling.

df$gender <- as.factor(df$gender)
df$stress_level <- as.factor(df$stress_level)
df$academic_work_impact <- as.factor(df$academic_work_impact)
df$addiction_level <- as.factor(df$addiction_level)
df$addicted_label <- as.factor(df$addicted_label)

str(df)

## 'data.frame':    7500 obs. of  16 variables:
##  $ transaction_id         : chr  "TXN00001" "TXN00002" "TXN00003" "TXN00004" ...
##  $ user_id                : chr  "U00001" "U00002" "U00003" "U00004" ...
##  $ age                    : int  21 24 31 32 25 26 25 26 21 35 ...
##  $ gender                 : Factor w/ 3 levels "Female","Male",..: 2 3 3 3 2 2 2 2 3 3 ...
##  $ daily_screen_time_hours: num  3.23 5.09 6.06 7.83 9.96 9.32 10.4 4.26 4.38 9.76 ...
##  $ social_media_hours     : num  2.01 3.81 1.36 5.85 5.92 4.26 4.93 4.6 1.38 4.73 ...
##  $ gaming_hours           : num  0.89 2.24 3.83 1.51 3.42 0.29 1.6 2.16 2.72 1.36 ...
##  $ work_study_hours       : num  4.55 4.44 2.35 3.54 5.27 3.99 0.86 4.61 3.78 2.11 ...
##  $ sleep_hours            : num  7.55 7.66 4.92 8.23 6.21 6.9 8.61 6.43 6.23 5.21 ...
##  $ notifications_per_day  : int  248 127 44 178 136 82 165 169 172 20 ...
##  $ app_opens_per_day      : int  154 71 106 107 177 56 95 117 134 82 ...
##  $ weekend_screen_time    : num  3.95 6.71 8.68 9.77 12.55 ...
##  $ stress_level           : Factor w/ 3 levels "High","Low","Medium": 3 3 1 1 2 3 3 2 1 2 ...
##  $ academic_work_impact   : Factor w/ 2 levels "No","Yes": 2 2 1 2 1 2 1 1 2 2 ...
##  $ addiction_level        : Factor w/ 4 levels "Mild","Moderate",..: 3 3 1 2 4 4 4 2 3 4 ...
##  $ addicted_label         : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 2 2 1 2 ...

Findings

The following variables were successfully converted into factors:

gender
stress_level
academic_work_impact
addiction_level
addicted_label

3.6 Final Dataset Summary

The final dataset was reviewed before proceeding to exploratory analysis and predictive modelling.

data.frame(
  Rows = nrow(df),
  Columns = ncol(df),
  Missing_Values = sum(is.na(df)),
  Duplicate_Rows = sum(duplicated(df))
) |>
knitr::kable(
  caption = "Final Dataset Summary"
)

Final Dataset Summary
Rows	Columns	Missing_Values	Duplicate_Rows
7500	16	0	0

The dataset was successfully cleaned and prepared for analysis. No missing values or duplicate records were identified, addiction labels were validated, and categorical variables were transformed appropriately for modelling.

4 Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) was conducted to understand smartphone usage patterns, identify relationships between variables, detect potential anomalies, and gain insights before predictive modelling.

4.1 Distribution of Addiction Levels

The distribution of smartphone addiction levels was examined to understand the prevalence of addiction within the dataset.

ggplot(df,
       aes(x = addiction_level,
           fill = addiction_level)) +
  geom_bar() +
  labs(
    title = "Distribution of Addiction Levels",
    x = "Addiction Level",
    y = "Frequency"
  ) +
  theme_minimal()

Findings

The plot shows the number of users within each addiction category (None, Mild, Moderate, and Severe), providing an overview of the target variable distribution.

4.2 Gender Distribution

The gender distribution of users was analysed to understand the demographic composition of the dataset.

ggplot(df,
       aes(x = gender,
           fill = gender)) +
  geom_bar() +
  labs(
    title = "Gender Distribution",
    x = "Gender",
    y = "Count"
  ) +
  theme_minimal()

Findings

The dataset contains users from different gender groups, ensuring demographic diversity within the sample.

4.3 Distribution of Daily Screen Time

The distribution of daily screen time was examined to understand overall smartphone usage behaviour.

ggplot(df,
       aes(x = daily_screen_time_hours)) +
  geom_histogram(
    bins = 30,
    fill = "steelblue",
    color = "white"
  ) +
  labs(
    title = "Distribution of Daily Screen Time",
    x = "Daily Screen Time (Hours)",
    y = "Frequency"
  ) +
  theme_minimal()

Findings

The histogram provides an overview of users’ daily screen time patterns and highlights the most common screen time range.

4.4 Boxplot

Boxplots were used to identify potential outliers in the numerical variables.

numeric_cols <- df %>%
  select(where(is.numeric))

par(mfrow = c(2,4))

for(col in names(numeric_cols)){
  boxplot(
    numeric_cols[[col]],
    main = col,
    col = "lightblue"
  )
}

Findings

Several variables contain naturally occurring extreme values. However, no severe outliers were identified that required removal.

4.5 Hours Consistency Check

A consistency check was performed to determine whether the combined hours spent on social media, gaming, and work/study exceeded the reported daily screen time.

inconsistent_rows <- sum(
  (df$social_media_hours +
   df$gaming_hours +
   df$work_study_hours) >
   df$daily_screen_time_hours
)

cat("Rows with hour inconsistency:", inconsistent_rows, "\n")

## Rows with hour inconsistency: 4553

cat(
  "Percentage:",
  round(inconsistent_rows / nrow(df) * 100, 1),
  "%\n"
)

## Percentage: 60.7 %

Findings

A total of 4,553 observations (60.7%) exhibited inconsistencies where the combined activity hours exceeded the reported daily screen time.

Interpretation

This suggests that users may have reported overlapping activities or estimated usage differently. Therefore, these variables were treated as independent behavioural indicators.

4.6 Correlation Analysis

A correlation heatmap was generated to examine relationships among numerical variables.

numeric_df <- df %>%
  select(where(is.numeric))

cor_matrix <- cor(
  numeric_df,
  use = "complete.obs"
)

corrplot(
  cor_matrix,
  method = "color",
  type = "upper",
  tl.cex = 0.7,
  number.cex = 0.7
)

Findings

The strongest correlation was observed between weekend_screen_time and daily_screen_time_hours (r ≈ 0.96), indicating a potential target leakage issue for regression modelling.

4.7 Feature Scaling

Feature scaling was performed to standardise numerical variables before modelling.

numeric_features <- df %>%
  select(
    age,
    daily_screen_time_hours,
    social_media_hours,
    gaming_hours,
    work_study_hours,
    notifications_per_day
  )

scaled_features <- scale(numeric_features)

head(scaled_features)

##             age daily_screen_time_hours social_media_hours gaming_hours
## [1,] -1.0715189              -1.6364910         -0.7969791   -0.9809287
## [2,] -0.4942749              -0.9236255          0.3384230    0.1970416
## [3,]  0.8526280              -0.5518622         -1.2069853    1.5844287
## [4,]  1.0450427               0.1265099          1.6252119   -0.4399349
## [5,] -0.3018602               0.9428560          1.6693665    1.2266748
## [6,] -0.1094455               0.6975689          0.6222735   -1.5044710
##      work_study_hours notifications_per_day
## [1,]        0.8168470            1.70818428
## [2,]        0.7481298           -0.10899044
## [3,]       -0.5574960           -1.35548219
## [4,]        0.1858986            0.65692618
## [5,]        1.2666320            0.02617132
## [6,]        0.4670142           -0.78479922

Findings

Feature scaling transformed variables to a common scale with mean 0 and standard deviation 1, preventing variables with larger ranges from dominating the models.

4.8 Multicollinearity Analysis

Variance Inflation Factor (VIF) analysis was conducted to assess multicollinearity among predictor variables.

vif_model <- lm(
  as.numeric(as.character(addicted_label)) ~
    age +
    daily_screen_time_hours +
    social_media_hours +
    gaming_hours +
    work_study_hours +
    notifications_per_day,
  data = df
)

car::vif(vif_model)

##                     age daily_screen_time_hours      social_media_hours 
##                1.001138                1.000284                1.000189 
##            gaming_hours        work_study_hours   notifications_per_day 
##                1.000945                1.000638                1.000627

Findings

All VIF values were below the commonly accepted threshold of 5, indicating that multicollinearity is not a major concern.

4.9 Daily Screen Time by Addiction Level

A boxplot was generated to examine the relationship between daily screen time and addiction level.

ggplot(
  df,
  aes(
    x = addiction_level,
    y = daily_screen_time_hours,
    fill = addiction_level
  )
) +
  geom_boxplot() +
  labs(
    title = "Daily Screen Time by Addiction Level",
    x = "Addiction Level",
    y = "Daily Screen Time (Hours)"
  ) +
  theme_minimal()

Findings

Users with higher addiction levels generally tend to spend more time on their smartphones compared to users with lower addiction levels.

Summary of EDA Findings

The exploratory analysis revealed several important insights:

The dataset contains a mix of demographic, behavioural, and outcome variables.
Daily screen time varies considerably across users.
A strong relationship exists between weekend screen time and daily screen time.
No severe multicollinearity issues were detected among predictors.
Smartphone addiction appears to be associated with higher daily screen time.
Behavioural variables such as social media usage, gaming activity, and notifications may contribute to addiction prediction.

These findings provide useful guidance for the regression and classification modelling stages.

5 Regression Modelling

Objective

The objective of this analysis is to predict users’ daily screen time using demographic and behavioural variables.

# Remove ID columns
df_model <- df %>%
  select(-transaction_id, -user_id)

# Remove leakage variables
df_model <- df_model %>%
  select(
    -weekend_screen_time,
    -addiction_level,
    -addicted_label
  )

str(df_model)

## 'data.frame':    7500 obs. of  11 variables:
##  $ age                    : int  21 24 31 32 25 26 25 26 21 35 ...
##  $ gender                 : Factor w/ 3 levels "Female","Male",..: 2 3 3 3 2 2 2 2 3 3 ...
##  $ daily_screen_time_hours: num  3.23 5.09 6.06 7.83 9.96 9.32 10.4 4.26 4.38 9.76 ...
##  $ social_media_hours     : num  2.01 3.81 1.36 5.85 5.92 4.26 4.93 4.6 1.38 4.73 ...
##  $ gaming_hours           : num  0.89 2.24 3.83 1.51 3.42 0.29 1.6 2.16 2.72 1.36 ...
##  $ work_study_hours       : num  4.55 4.44 2.35 3.54 5.27 3.99 0.86 4.61 3.78 2.11 ...
##  $ sleep_hours            : num  7.55 7.66 4.92 8.23 6.21 6.9 8.61 6.43 6.23 5.21 ...
##  $ notifications_per_day  : int  248 127 44 178 136 82 165 169 172 20 ...
##  $ app_opens_per_day      : int  154 71 106 107 177 56 95 117 134 82 ...
##  $ stress_level           : Factor w/ 3 levels "High","Low","Medium": 3 3 1 1 2 3 3 2 1 2 ...
##  $ academic_work_impact   : Factor w/ 2 levels "No","Yes": 2 2 1 2 1 2 1 1 2 2 ...

set.seed(42)

train_index <- sample(
  1:nrow(df_model),
  size = round(0.8 * nrow(df_model))
)

train_data <- df_model[train_index, ]
test_data  <- df_model[-train_index, ]

Train-Test Split

The dataset was divided into 80% training data and 20% testing data.

set.seed(42)

train_index <- sample(
  1:nrow(df_model),
  size = 0.8 * nrow(df_model)
)

train_data <- df_model[train_index, ]
test_data  <- df_model[-train_index, ]

5.1 Multiple Linear Regression

mlr_model <- lm(
  daily_screen_time_hours ~ .,
  data = train_data
)

summary(mlr_model)

## 
## Call:
## lm(formula = daily_screen_time_hours ~ ., data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7187 -2.2511 -0.0103  2.2741  4.8068 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              6.7957174  0.3014412  22.544   <2e-16 ***
## age                      0.0086624  0.0064524   1.343   0.1795    
## genderMale               0.1310800  0.0822951   1.593   0.1113    
## genderOther              0.0442823  0.0826324   0.536   0.5921    
## social_media_hours       0.0177288  0.0212670   0.834   0.4045    
## gaming_hours            -0.0035257  0.0294221  -0.120   0.9046    
## work_study_hours         0.0122431  0.0210255   0.582   0.5604    
## sleep_hours              0.0357577  0.0262755   1.361   0.1736    
## notifications_per_day    0.0002135  0.0005029   0.424   0.6713    
## app_opens_per_day        0.0013557  0.0006939   1.954   0.0508 .  
## stress_levelLow         -0.0235104  0.0818911  -0.287   0.7741    
## stress_levelMedium      -0.1996135  0.0824594  -2.421   0.0155 *  
## academic_work_impactYes -0.0254948  0.0672433  -0.379   0.7046    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.603 on 5987 degrees of freedom
## Multiple R-squared:  0.003012,   Adjusted R-squared:  0.001014 
## F-statistic: 1.507 on 12 and 5987 DF,  p-value: 0.1134

5.2 Decision Tree Regression

dt_model <- rpart(
  daily_screen_time_hours ~ .,
  data = train_data,
  method = "anova"
)

dt_model

## n= 6000 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 6000 40673.78 7.492087 *

5.3 Random Forest Regression

rf_model <- randomForest(
  daily_screen_time_hours ~ .,
  data = train_data,
  ntree = 100,
  importance = TRUE
)

rf_model

## 
## Call:
##  randomForest(formula = daily_screen_time_hours ~ ., data = train_data,      ntree = 100, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 100
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 7.061841
##                     % Var explained: -4.17

Regression Model Evaluation

Evaluation Metrics

get_metrics <- function(actual, predicted){

  rmse <- sqrt(
    mean(
      (actual - predicted)^2
    )
  )

  r2 <- 1 -
    (
      sum((actual-predicted)^2) /
      sum((actual-mean(actual))^2)
    )

  data.frame(
    RMSE = rmse,
    R2 = r2
  )
}

mlr_pred <- predict(mlr_model,test_data)
dt_pred  <- predict(dt_model,test_data)
rf_pred  <- predict(rf_model,test_data)

results_df <- rbind(
  cbind(Model="MLR",
        get_metrics(test_data$daily_screen_time_hours,
                    mlr_pred)),
  cbind(Model="Decision Tree",
        get_metrics(test_data$daily_screen_time_hours,
                    dt_pred)),
  cbind(Model="Random Forest",
        get_metrics(test_data$daily_screen_time_hours,
                    rf_pred))
)

knitr::kable(
  results_df,
  caption="Regression Model Performance"
)

Regression Model Performance
Model	RMSE	R2
MLR	2.632864	-0.0020657
Decision Tree	2.630440	-0.0002213
Random Forest	2.644456	-0.0109090

Findings

The model with the highest R² and lowest RMSE is considered the best regression model.

6 Classification Modelling

Objective

The objective of this analysis is to predict smartphone addiction status.

Feature Selection

classification_df <- df[, c(
  "addicted_label",
  "age",
  "gender",
  "daily_screen_time_hours",
  "social_media_hours",
  "gaming_hours",
  "work_study_hours",
  "notifications_per_day",
  "stress_level",
  "academic_work_impact"
)]

Train-Test Split

set.seed(123)

train_index <- sample(
  1:nrow(classification_df),
  0.8*nrow(classification_df)
)

train_data <- classification_df[train_index,]
test_data  <- classification_df[-train_index,]

6.1 Logistic Regression

log_model <- glm(
  addicted_label ~ .,
  data = train_data,
  family = "binomial"
)

summary(log_model)

## 
## Call:
## glm(formula = addicted_label ~ ., family = "binomial", data = train_data)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -4.9191380  0.4067275 -12.094   <2e-16 ***
## age                     -0.0025476  0.0098552  -0.259    0.796    
## genderMale               0.0922154  0.1245716   0.740    0.459    
## genderOther              0.1041416  0.1242206   0.838    0.402    
## daily_screen_time_hours  0.9024571  0.0370023  24.389   <2e-16 ***
## social_media_hours       0.6919513  0.0375378  18.433   <2e-16 ***
## gaming_hours            -0.0103911  0.0446703  -0.233    0.816    
## work_study_hours        -0.0464739  0.0314214  -1.479    0.139    
## notifications_per_day   -0.0004680  0.0007618  -0.614    0.539    
## stress_levelLow          0.1542275  0.1250333   1.233    0.217    
## stress_levelMedium       0.0934942  0.1235301   0.757    0.449    
## academic_work_impactYes  0.0304168  0.1015260   0.300    0.764    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4166.5  on 5999  degrees of freedom
## Residual deviance: 2533.0  on 5988  degrees of freedom
## AIC: 2557
## 
## Number of Fisher Scoring iterations: 7

6.2 Random Forest Classification

rf_clf <- randomForest(
  addicted_label ~ .,
  data = train_data,
  ntree = 200,
  importance = TRUE
)

rf_clf

## 
## Call:
##  randomForest(formula = addicted_label ~ ., data = train_data,      ntree = 200, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 200
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 10.23%
## Confusion matrix:
##     0    1 class.error
## 0 340  322  0.48640483
## 1 292 5046  0.05470214

Variable Importance

varImpPlot(rf_clf)

7 Model Evaluation and Results

7.1 Logistic Regression Evaluation

pred_prob <- predict(
  log_model,
  newdata=test_data,
  type="response"
)

pred_class <- factor(
  ifelse(pred_prob > 0.5,1,0),
  levels=c(0,1)
)

cm <- table(
  Predicted=pred_class,
  Actual=test_data$addicted_label
)

cm

##          Actual
## Predicted    0    1
##         0   39   50
##         1  118 1293

Performance Metrics

accuracy <- sum(diag(cm))/sum(cm)

TP <- cm["1","1"]
TN <- cm["0","0"]
FP <- cm["1","0"]
FN <- cm["0","1"]

precision <- TP/(TP+FP)
recall <- TP/(TP+FN)
f1 <- 2*precision*recall/
      (precision+recall)

metrics <- data.frame(
  Accuracy = accuracy,
  Precision = precision,
  Recall = recall,
  F1_Score = f1
)

knitr::kable(metrics)

Accuracy	Precision	Recall	F1_Score
0.888	0.9163714	0.9627699	0.9389978

7.2 ROC Curve and AUC

roc_curve <- roc(
  response = test_data$addicted_label,
  predictor = pred_prob,
  levels = c("0","1")
)

## Setting direction: controls < cases

plot(roc_curve)

auc(roc_curve)

## Area under the curve: 0.9136

Findings

A higher AUC indicates better discrimination between addicted and non-addicted users.

7.3 Random Forest Evaluation

rf_pred <- predict(
  rf_clf,
  newdata=test_data
)

rf_cm <- confusionMatrix(
  rf_pred,
  test_data$addicted_label
)

rf_cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0   79   75
##          1   78 1268
##                                           
##                Accuracy : 0.898           
##                  95% CI : (0.8816, 0.9129)
##     No Information Rate : 0.8953          
##     P-Value [Acc > NIR] : 0.3878          
##                                           
##                   Kappa : 0.4511          
##                                           
##  Mcnemar's Test P-Value : 0.8715          
##                                           
##             Sensitivity : 0.50318         
##             Specificity : 0.94415         
##          Pos Pred Value : 0.51299         
##          Neg Pred Value : 0.94205         
##              Prevalence : 0.10467         
##          Detection Rate : 0.05267         
##    Detection Prevalence : 0.10267         
##       Balanced Accuracy : 0.72367         
##                                           
##        'Positive' Class : 0               
##

Classification Model Comparison

comparison <- data.frame(
  Model = c(
    "Logistic Regression",
    "Random Forest"
  ),
  Accuracy = c(
    accuracy,
    rf_cm$overall["Accuracy"]
  )
)

knitr::kable(
  comparison,
  caption = "Classification Model Comparison"
)

Classification Model Comparison
	Model	Accuracy
	Logistic Regression	0.888
Accuracy	Random Forest	0.898

Findings

The model with the highest classification accuracy is considered the best performing classifier.

7.4 Discussion of Results

7.4.1 Regression Results

The regression models achieved relatively low predictive performance, indicating that the available behavioural and demographic variables have limited ability to explain daily screen time.

7.4.2 Classification Results

The classification models demonstrated stronger predictive performance than the regression models. Smartphone addiction appears to have stronger relationships with behavioural indicators such as social media usage, gaming activity, and notification frequency.

7.4.3 Best Performing Model

Based on the evaluation metrics, the best-performing model was selected according to:

Lowest RMSE and highest R² for regression.
Highest Accuracy, F1-Score, and AUC for classification.

8 Conclusion

For the regression task, Multiple Linear Regression was selected as the best-performing model because it achieved the highest Test R² among the regression models, although the overall predictive performance remained weak. This indicates that daily screen time could not be accurately predicted using the available variables.

For the classification task, Random Forest Classification was selected as the best-performing model because it achieved the highest classification accuracy and was able to capture complex relationships among smartphone usage behaviours. Therefore, Random Forest is the most suitable model for predicting smartphone addiction status in this study.

Overall, the findings suggest that smartphone addiction can be predicted more effectively than daily screen time, with Random Forest Classification being the most suitable model for identifying smartphone addiction in this study.

Smartphone Usage and Addiction Analysis

WQD7004 Programming for Data Science — Group Project (Group 3)

Anis Sofea Bt Ikhsan (25089745)

Kanchanaa A/P Sivabalan (U2103645)

Farbod Salehi (25061576)

Ranjithaa A/P Vasu (U2103400)

2026-06-11

1 Introduction

1.1 Background

1.2 Problem Statement

1.3 Objectives

1.4 Research Questions

1.4.1 RQ1 (Regression)

1.4.2 RQ2 (Classification)

2 Dataset Description

2.1 Packages and Working Directory

2.2 Data Description

2.3 Dataset Overview

2.4 Dataset Structure

2.5 Summary Statistics

2.6 Variable Categories

3 Data Cleaning and Preprocessing

3.1 Missing Value Analysis

3.2 Duplicate Record Detection

3.3 Validation of Addiction Labels

3.4 Binary Target Variable Creation

3.4.1 Classification Labels

3.5 Feature Transformation

3.6 Final Dataset Summary

4 Exploratory Data Analysis (EDA)

4.1 Distribution of Addiction Levels

4.2 Gender Distribution

4.3 Distribution of Daily Screen Time

4.4 Boxplot

4.5 Hours Consistency Check

4.6 Correlation Analysis

4.7 Feature Scaling

4.8 Multicollinearity Analysis

4.9 Daily Screen Time by Addiction Level

5 Regression Modelling

5.1 Multiple Linear Regression

5.2 Decision Tree Regression

5.3 Random Forest Regression

6 Classification Modelling

6.1 Logistic Regression

6.2 Random Forest Classification

7 Model Evaluation and Results

7.1 Logistic Regression Evaluation

7.2 ROC Curve and AUC

7.3 Random Forest Evaluation

7.4 Discussion of Results

7.4.1 Regression Results

7.4.2 Classification Results

7.4.3 Best Performing Model

8 Conclusion