Data Dive 14 (Model Critique)

Introduction

For this data dive I am critiquing my week 11 notebook, where I built a linear regression model using the Google Play Store dataset to predict app ratings from log-transformed review counts and app type (free vs paid). I want to be honest about what that model got wrong, what I would do differently, and what ethical questions it raises.

play <- read.csv("googleplaystore.csv", stringsAsFactors = FALSE)

play_clean <- play %>%
  filter(!is.na(Rating), Rating <= 5, Rating >= 1) %>%
  mutate(
    Reviews  = as.numeric(Reviews),
    Installs = as.numeric(gsub("[+,]", "", Installs))
  ) %>%
  filter(!is.na(Reviews), !is.na(Installs), Type %in% c("Free", "Paid")) %>%
  mutate(
    log_Reviews  = log(Reviews + 1),
    log_Installs = log(Installs + 1)
  )

cat("Rows after cleaning:", nrow(play_clean))

## Rows after cleaning: 9366

Analytical Critique

Issue 1 - Rating is bounded and linear regression does not know that

I used Rating as the response variable in a standard lm(), but ratings only go from 1 to 5. Linear regression can happily predict a 5.9 or a 0.2, which is meaningless. I noticed the Q-Q plot looked off in Week 11 and this is exactly why.

p1 <- play_clean %>%
  ggplot(aes(x = Rating)) +
  geom_histogram(binwidth = 0.1, fill = "#e67e22", color = "white", alpha = 0.9) +
  geom_vline(xintercept = c(1, 5), linetype = "dashed", color = "#2d2d2d") +
  labs(
    title = "Ratings are left-skewed and pile up near 4.5",
    subtitle = "Linear regression assumes a symmetric unbounded response and this is neither",
    x = "Rating", y = "Count"
  ) +
  theme_minimal(base_size = 13)

ggplotly(p1)

I would recommend a beta regression here since it is actually designed for bounded continuous responses. I would rescale ratings to sit between 0 and 1 first and then fit that model instead.

Issue 2 - Only two predictors is leaving a lot unexplained

The Week 11 model only used log_Reviews and Type, and the R-squared came out pretty low. I had log_Installs and Category sitting right there in the dataset and I did not use them, which I think was a mistake.

model_original <- lm(Rating ~ log_Reviews + Type, data = play_clean)

top_cats <- play_clean %>%
  count(Category) %>%
  top_n(8, n) %>%
  pull(Category)

play_top <- play_clean %>%
  filter(Category %in% top_cats)

model_extended <- lm(Rating ~ log_Reviews + log_Installs + Type + Category,
                     data = play_top)

compare_df <- data.frame(
  Model       = c("Week 11 original", "Extended with installs and category"),
  AIC         = round(c(AIC(model_original), AIC(model_extended)), 1),
  R_squared   = round(c(summary(model_original)$r.squared,
                         summary(model_extended)$r.squared), 3)
)

datatable(compare_df, rownames = FALSE, options = list(dom = "t"),
          caption = "Lower AIC and higher R-squared both point to the extended model being a better fit")

The extended model fits noticeably better on both AIC and R-squared. I think adding category is especially important because user expectations and rating behaviour are completely different across app types.

Issue 3 - I spotted heteroskedasticity but did not do anything about it

In Week 11 I correctly identified that the scale location plot showed fanning residuals, which means the variance is not constant. I just described it and moved on, but that actually means my standard errors are unreliable and the TypePaid coefficient interpretation I made might be off.

play_clean <- play_clean %>%
  mutate(
    fitted_orig = fitted(model_original),
    resid_orig  = resid(model_original)
  )

p2 <- play_clean %>%
  ggplot(aes(x = fitted_orig, y = sqrt(abs(resid_orig)))) +
  geom_point(alpha = 0.15, color = "#8e44ad") +
  geom_smooth(method = "loess", color = "#e74c3c", se = FALSE) +
  labs(
    title = "Residual spread fans out as fitted values increase",
    subtitle = "The upward trend confirms heteroskedasticity so standard errors are not trustworthy here",
    x = "Fitted Values", y = "sqrt of absolute standardized residuals"
  ) +
  theme_minimal(base_size = 13)

ggplotly(p2)

vif_vals <- vif(lm(Rating ~ log_Reviews + log_Installs + Type, data = play_clean)) %>%
  as.data.frame() %>%
  tibble::rownames_to_column("Predictor") %>%
  rename(VIF = 2) %>%
  mutate(
    VIF  = round(VIF, 2),
    Flag = ifelse(VIF > 5, "high", "ok")
  )

datatable(vif_vals, rownames = FALSE, options = list(dom = "t"),
          caption = "VIF above 5 is a red flag for multicollinearity")

I would at minimum use robust standard errors to correct for heteroskedasticity before interpreting any of the coefficients, which I did not do in Week 11 at all.

Ethical and Epistemological Concerns

Concern 1

Only apps that made it onto the Play Store and survived long enough to be scraped show up in this dataset. Apps that got removed for bad ratings, policy violations, or just dying out are invisible to me, which means my model was trained entirely on survivors. Any conclusions I draw about what predicts a good rating are really conclusions about what predicts surviving on the platform, which is a different thing.

Concern 2

The Week 11 model treats review count and rating as neutral signals of app quality. But there is a well-documented industry of fake reviews and coordinated rating boosts on the Play Store. I was basically modelling marketing budget as much as product quality, and I never acknowledged that.

Concern 3

The dataset is a static scrape but covers apps from very different time periods. App quality standards, user expectations, and Play Store policies all shifted significantly between 2015 and 2018. I am assuming these observations are independent and identically distributed, but if a policy change or platform shift happened mid-collection, that assumption breaks down. This is the same problem that comes up when you try to analyse Twitter data across an ownership change as in, the population before and after is not really the same, and treating them as one is going to bias the model without any obvious warning signs in the output.

Concern 4

Games and tools have thousands of entries in this dataset while categories like Beauty or Events have under 50. A global model is going to learn patterns from the majority categories and quietly apply them everywhere. I was effectively giving games-optimized recommendations to every app category, which is not a great epistemic position to be in.