1. Objective of the Analysis

The objective of this project is to identify and explain the factors that distinguish heavy smartphone users from regular users. Using a generalized linear model with a binomial (logit) link function, the analysis aims to predict whether a user belongs to the heavy usage group based on behavioral, technical, and demographic variables. Understanding these patterns is relevant from both a business perspective (e.g., mobile app monetization, user segmentation) and a scientific perspective (digital behavior and technology dependence).

2. Dataset Description and Source

The dataset used in this project is a user behavior dataset obtained from an online open-data repository (Kaggle). It contains detailed information on smartphone usage patterns, device characteristics, and user demographics. Mobile Device Usage and User Behavior Dataset

Each observation represents one smartphone user. The dataset includes variables related to:

  • App usage intensity
  • Screen-on time
  • Battery and data consumption
  • Installed applications
  • Demographics (age, gender)
  • Operating system

The dataset is well suited for classification tasks and allows the application of a generalized linear model with a binomial family, as discussed during the course.

3. Brief Literature Context

Previous studies on smartphone usage behavior indicate that excessive usage is strongly associated with screen-on time, app engagement, and demographic factors such as age. Heavy usage patterns have been linked to digital addiction, productivity loss, and increased battery and data consumption. This analysis contributes to the literature by providing a data-driven classification framework that combines behavioral and technical indicators in a single predictive model.

The idea to choose this topic came to me while scanning through the study on “Diversity in Smartphone Usage” which talked about the same idea of approaching the problem having analysed very similar variables. Link to the study.

4. Dependent Variable

The dependent variable is: is_heavy_user (binary, nominal scale) It was constructed from the original user_behavior_class variable:

  • 0 – light to moderate users (classes 1–3)
  • 1 – heavy users (classes 4–5)

This transformation allows the use of logistic regression, which is appropriate for binary outcomes.

5. Independent Variables

The following predictors were considered:

  • app_usage_time_min_day
  • screen_on_time_hours_day
  • battery_drain_mah_day
  • number_of_apps_installed
  • data_usage_mb_day
  • age
  • gender
  • operating_system

Variables were selected based on prior expectations and exploratory data analysis results with the the goal of performence in mind.

6. Exploratory Data Analysis (EDA)

Distribution of User Behavior Classes The distribution of user_behavior_class shows a clear separation between lower and higher usage categories, justifying the binary transformation into heavy vs. non-heavy users.

Heavy Users by Operating System Bar plots indicate noticeable differences in the proportion of heavy users across operating systems, suggesting that OS may play a role in usage intensity.

Age Distribution The age histogram shows a broad but non-uniform distribution, indicating that age could be an important explanatory variable.

Screen Time vs. App Usage A strong positive relationship is observed between screen-on time and app usage time. Heavy users clearly cluster at higher values of both variables, confirming their relevance as predictors.

7. Model Selection and Assumptions

Because the dependent variable is binary, a generalized linear model with binomial family and logit link was selected.

Key considerations:

  • The outcome variable is binary → OLS assumptions are violated
  • Logistic regression does not require normally distributed predictors
  • Independence of observations is assumed
  • Multicollinearity was monitored through model diagnostics
  • The binomial GLM is therefore the most appropriate modeling framework for this problem.

8. Predictor Selection Strategy

Predictors were selected based on:

  • Exploratory visualizations
  • Practical relevance to smartphone usage behavior
  • Inferential contribution assessed via stepwise model selection (AIC criterion)
  • A stepwise procedure (both directions) was applied to reduce model complexity while maintaining explanatory power.

9. Model Fitting and Interpretation

Two models were estimated:

  • Full logistic regression model (all predictors)
  • Stepwise-selected logistic regression model

9.1 Models Summary

Stepwise Model

The stepwise logistic regression model retained app_usage_time_min_day as the only predictor of heavy smartphone usage. The model shows an extremely low residual deviance compared to the null deviance, indicating that app usage time almost perfectly separates heavy users from non-heavy users in the training data. This explains the very low AIC value (AIC = 4), which suggests an excellent fit with minimal complexity. However, the unusually large coefficient estimates and non-significant p-values indicate quasi-complete separation, meaning the predictor is so strongly related to the outcome that standard inference becomes unstable. Despite this, the model is highly effective for classification purposes and confirms that daily app usage time is the key driver of heavy user behavior.

## 
## Call:
## glm(formula = is_heavy_user ~ app_usage_time_min_day, family = binomial(link = "logit"), 
##     data = train_data)
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)
## (Intercept)            -2213.446  84680.778  -0.026    0.979
## app_usage_time_min_day     7.352    281.243   0.026    0.979
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7.0060e+02  on 524  degrees of freedom
## Residual deviance: 2.3283e-06  on 523  degrees of freedom
## AIC: 4
## 
## Number of Fisher Scoring iterations: 25


Full Model

The full logistic regression model includes all behavioral, technical, and demographic predictors to explain heavy smartphone usage. Similar to the stepwise model, the residual deviance is almost zero compared to the null deviance, indicating that the model fits the training data extremely well. However, the very large standard errors and non-significant p-values suggest instability caused by strong relationships and multicollinearity among predictors, particularly between app usage time and screen-on time. Although the model explains the outcome very accurately, its higher AIC value (AIC = 18) shows that the added complexity does not improve model quality compared to the simpler stepwise model, making the full model less parsimonious and harder to interpret.

## 
## Call:
## glm(formula = is_heavy_user ~ app_usage_time_min_day + screen_on_time_hours_day + 
##     battery_drain_mah_day + number_of_apps_installed + data_usage_mb_day + 
##     age + gender + operating_system, family = binomial(link = "logit"), 
##     data = train_data)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)
## (Intercept)              -1.898e+02  5.130e+04  -0.004    0.997
## app_usage_time_min_day    1.589e-01  1.753e+02   0.001    0.999
## screen_on_time_hours_day  5.794e+00  9.506e+03   0.001    1.000
## battery_drain_mah_day     2.143e-02  4.221e+01   0.001    1.000
## number_of_apps_installed  8.517e-01  1.075e+03   0.001    0.999
## data_usage_mb_day         1.883e-02  5.610e+01   0.000    1.000
## age                      -1.232e-01  4.746e+02   0.000    1.000
## genderMale                4.921e+00  1.251e+04   0.000    1.000
## operating_systemiOS      -1.437e+00  1.793e+04   0.000    1.000
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7.0060e+02  on 524  degrees of freedom
## Residual deviance: 3.2339e-08  on 516  degrees of freedom
## AIC: 18
## 
## Number of Fisher Scoring iterations: 25

9.2 AIC

Stepwise Model

## [1] 4.000002

Full Model

## [1] 18

The Akaike Information Criterion (AIC) comparison shows that the stepwise logistic regression model (AIC ≈ 4) performs substantially better than the full model (AIC = 18). Since AIC penalizes unnecessary model complexity while rewarding good fit, the much lower AIC value indicates that the stepwise model achieves a more optimal balance between explanatory power and parsimony. The large difference in AIC (ΔAIC ≈ 14) provides strong evidence that removing non-informative predictors improves the model without sacrificing its ability to explain the data, making the stepwise model the preferred specification for this analysis.

9.3 Plots of resoults and performence


Full Model




Stepwise Model



While the p-values suggest insignificance, the Residual Deviance and Classification Accuracy prove the opposite: the predictors are highly influential. However, because the math has reached a “limit,” the visual representations of these specific regression curves do not provide meaningful incremental insight and should be treated as a demonstration of the model’s stability limits rather than a lack of correlation.


10. Model Evaluation

The dataset was split into:

  • 75% training data
  • 25% testing data

Model performance was evaluated using confusion matrices. Both models showed strong classification ability, but the stepwise model achieved comparable accuracy with reduced complexity, making it preferable.

##          Actual
## Predicted   0   1
##         0 103   0
##         1   0  72
##          Actual
## Predicted   0   1
##         0 103   2
##         1   0  70

12. Conclusions

The analysis successfully identified the primary drivers of heavy smartphone usage, though the results revealed a unique statistical phenomenon.

The models exhibited quasi-complete separation, particularly with the variable app_usage_time_min_day. This means the predictor is so powerful at distinguishing heavy users from regular users that it caused the standard errors to inflate and p-values to become non-significant.

While the p-values suggest instability, the Confusion Matrix and extremely low AIC (4) prove the model is highly effective. We achieved near-perfect classification accuracy, demonstrating that “Heavy Usage” is a distinct behavioral state rather than a gradual trend.

The Stepwise Model outperformed the Full Model by removing noise. It proved that behavioral data (app usage time) is a far more reliable indicator of user type than demographics or operating systems.

The project successfully created a high-precision classification framework. The “failed” significance tests are actually a testament to the overwhelming predictive strength of the behavioral data, confirming that daily app engagement is the definitive metric for user segmentation.

13. AI Tool Usage Disclosure

This project was developed with conceptual and structural support inspired by AI-based suggestions (Notebook LM and Gemini), used strictly as a learning and assistance tool. All code, methods, and interpretations were fully understood and verified by the author, in accordance with course guidelines.