The objective of this project is to identify and explain the factors that distinguish heavy smartphone users from regular users. Using a generalized linear model with a binomial (logit) link function, the analysis aims to predict whether a user belongs to the heavy usage group based on behavioral, technical, and demographic variables. Understanding these patterns is relevant from both a business perspective (e.g., mobile app monetization, user segmentation) and a scientific perspective (digital behavior and technology dependence).
The dataset used in this project is a user behavior dataset obtained from an online open-data repository (Kaggle). It contains detailed information on smartphone usage patterns, device characteristics, and user demographics. Mobile Device Usage and User Behavior Dataset
Each observation represents one smartphone user. The dataset includes variables related to:
The dataset is well suited for classification tasks and allows the application of a generalized linear model with a binomial family, as discussed during the course.
Previous studies on smartphone usage behavior indicate that excessive usage is strongly associated with screen-on time, app engagement, and demographic factors such as age. Heavy usage patterns have been linked to digital addiction, productivity loss, and increased battery and data consumption. This analysis contributes to the literature by providing a data-driven classification framework that combines behavioral and technical indicators in a single predictive model.
The idea to choose this topic came to me while scanning through the study on “Diversity in Smartphone Usage” which talked about the same idea of approaching the problem having analysed very similar variables. Link to the study.
The dependent variable is: is_heavy_user (binary, nominal scale) It was constructed from the original user_behavior_class variable:
This transformation allows the use of logistic regression, which is appropriate for binary outcomes.
The following predictors were considered:
Variables were selected based on prior expectations and exploratory data analysis results with the the goal of performence in mind.
Distribution of User Behavior Classes The distribution of user_behavior_class shows a clear separation between lower and higher usage categories, justifying the binary transformation into heavy vs. non-heavy users.
Heavy Users by Operating System Bar plots indicate noticeable differences in the proportion of heavy users across operating systems, suggesting that OS may play a role in usage intensity.
Age Distribution The age histogram shows a broad but non-uniform distribution, indicating that age could be an important explanatory variable.
Screen Time vs. App Usage A strong positive relationship is observed between screen-on time and app usage time. Heavy users clearly cluster at higher values of both variables, confirming their relevance as predictors.
Because the dependent variable is binary, a generalized linear model with binomial family and logit link was selected.
Key considerations:
Predictors were selected based on:
Two models were estimated:
The stepwise logistic regression model retained app_usage_time_min_day as the only predictor of heavy smartphone usage. The model shows an extremely low residual deviance compared to the null deviance, indicating that app usage time almost perfectly separates heavy users from non-heavy users in the training data. This explains the very low AIC value (AIC = 4), which suggests an excellent fit with minimal complexity. However, the unusually large coefficient estimates and non-significant p-values indicate quasi-complete separation, meaning the predictor is so strongly related to the outcome that standard inference becomes unstable. Despite this, the model is highly effective for classification purposes and confirms that daily app usage time is the key driver of heavy user behavior.
##
## Call:
## glm(formula = is_heavy_user ~ app_usage_time_min_day, family = binomial(link = "logit"),
## data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2213.446 84680.778 -0.026 0.979
## app_usage_time_min_day 7.352 281.243 0.026 0.979
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7.0060e+02 on 524 degrees of freedom
## Residual deviance: 2.3283e-06 on 523 degrees of freedom
## AIC: 4
##
## Number of Fisher Scoring iterations: 25
The full logistic regression model includes all behavioral, technical, and demographic predictors to explain heavy smartphone usage. Similar to the stepwise model, the residual deviance is almost zero compared to the null deviance, indicating that the model fits the training data extremely well. However, the very large standard errors and non-significant p-values suggest instability caused by strong relationships and multicollinearity among predictors, particularly between app usage time and screen-on time. Although the model explains the outcome very accurately, its higher AIC value (AIC = 18) shows that the added complexity does not improve model quality compared to the simpler stepwise model, making the full model less parsimonious and harder to interpret.
##
## Call:
## glm(formula = is_heavy_user ~ app_usage_time_min_day + screen_on_time_hours_day +
## battery_drain_mah_day + number_of_apps_installed + data_usage_mb_day +
## age + gender + operating_system, family = binomial(link = "logit"),
## data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.898e+02 5.130e+04 -0.004 0.997
## app_usage_time_min_day 1.589e-01 1.753e+02 0.001 0.999
## screen_on_time_hours_day 5.794e+00 9.506e+03 0.001 1.000
## battery_drain_mah_day 2.143e-02 4.221e+01 0.001 1.000
## number_of_apps_installed 8.517e-01 1.075e+03 0.001 0.999
## data_usage_mb_day 1.883e-02 5.610e+01 0.000 1.000
## age -1.232e-01 4.746e+02 0.000 1.000
## genderMale 4.921e+00 1.251e+04 0.000 1.000
## operating_systemiOS -1.437e+00 1.793e+04 0.000 1.000
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7.0060e+02 on 524 degrees of freedom
## Residual deviance: 3.2339e-08 on 516 degrees of freedom
## AIC: 18
##
## Number of Fisher Scoring iterations: 25
## [1] 4.000002
## [1] 18
The Akaike Information Criterion (AIC) comparison shows that the stepwise logistic regression model (AIC ≈ 4) performs substantially better than the full model (AIC = 18). Since AIC penalizes unnecessary model complexity while rewarding good fit, the much lower AIC value indicates that the stepwise model achieves a more optimal balance between explanatory power and parsimony. The large difference in AIC (ΔAIC ≈ 14) provides strong evidence that removing non-informative predictors improves the model without sacrificing its ability to explain the data, making the stepwise model the preferred specification for this analysis.
The dataset was split into:
Model performance was evaluated using confusion matrices. Both models showed strong classification ability, but the stepwise model achieved comparable accuracy with reduced complexity, making it preferable.
## Actual
## Predicted 0 1
## 0 103 0
## 1 0 72
## Actual
## Predicted 0 1
## 0 103 2
## 1 0 70
The analysis successfully identified the primary drivers of heavy smartphone usage, though the results revealed a unique statistical phenomenon.
The models exhibited quasi-complete separation, particularly with the variable app_usage_time_min_day. This means the predictor is so powerful at distinguishing heavy users from regular users that it caused the standard errors to inflate and p-values to become non-significant.
While the p-values suggest instability, the Confusion Matrix and extremely low AIC (4) prove the model is highly effective. We achieved near-perfect classification accuracy, demonstrating that “Heavy Usage” is a distinct behavioral state rather than a gradual trend.
The Stepwise Model outperformed the Full Model by removing noise. It proved that behavioral data (app usage time) is a far more reliable indicator of user type than demographics or operating systems.
The project successfully created a high-precision classification framework. The “failed” significance tests are actually a testament to the overwhelming predictive strength of the behavioral data, confirming that daily app engagement is the definitive metric for user segmentation.
This project was developed with conceptual and structural support inspired by AI-based suggestions (Notebook LM and Gemini), used strictly as a learning and assistance tool. All code, methods, and interpretations were fully understood and verified by the author, in accordance with course guidelines.