This week’s goal is to practice identifying analytical, ethical, and epistemological issues with statistical models.
I selected my Week 8 Data Dive (“Regression Modeling”) for critique. Here is the link
In that week, I built a simple linear regression model to predict Screen Time based on Sleep Time and other variables.
The original model:
# Simplified example:
model_week8 <- lm(ScreenTime ~ SleepTime + Age, data = screentime_data)
summary(model_week8)
At first glance, the model had some reasonable coefficients, but now, applying Week 14 concepts, several issues are visible.
Omitted Variables: Important factors like device type, stress levels, school/work demands were missing. This introduces omitted variable bias.
Assumption Violations: No clear check for linearity, normality of residuals, or constant variance (homoscedasticity). These assumptions might have been violated.
Small Sample Bias: If the data was small or unbalanced (e.g., skewed age groups), the model would easily overfit or mislead.
Misleading Interpretations: The model risked implying causation (“sleep causes screen time”) when only correlation was modeled. That’s misleading if shared without context.
Representation Problems: If the dataset mostly contained young people or certain groups, results wouldn’t generalize fairly to broader populations.
Data Source Transparency: Without explaining where the Screentime data came from and its limitations, users could wrongly trust the model.
What can we know from this model? The model only shows associations in this dataset — not real-world cause and effect.
Overconfidence Risk: The relatively good fit statistics (e.g., R-squared) could falsely boost confidence in predicting screen time behaviors.
Bias in Variable Choice: Sleep Time was treated as a main predictor, but in reality, many unmeasured variables (e.g., stress, social media addiction) might be more powerful drivers.
Model diagnosis: The Week 8 regression was simple but hid many important risks.
Fixes going forward: