Last week I found that more reviews tend to push ratings up, but the R² was only 3.3% - which basically means log(reviews) alone is not really telling the whole story. That’s not surprising. One thing I kept wondering was whether free apps vs paid apps behave differently. My gut says free apps get harsher reviews because people hold them to a higher standard when they have nothing to lose by downloading. I also want to check whether the “Everyone” content label from Part 1 last week carries any extra weight once reviews are already in the model.
So this week, I’m keeping log_reviews from last week,
and trying to add
is_free - a binary variable (1 = Free, 0 = Paid)log_installs - considered but evaluated for
multicollinearityis_everyone - a binary for the broadest content
audienceplaystore <- read.csv("C:/Users/IU Student/Downloads/Data Dive_week 9_Regression Diagnostics/googleplaystore.csv", stringsAsFactors = FALSE)
# Same cleaning pipeline as last week
playstore <- playstore %>%
filter(!is.na(Rating), Rating <= 5, Rating >= 1) %>%
filter(!is.na(Reviews)) %>%
mutate(Reviews = as.numeric(Reviews)) %>%
filter(!is.na(Reviews), Reviews > 0) %>%
distinct(App, .keep_all = TRUE)
playstore <- playstore %>%
mutate(log_reviews = log10(Reviews))
playstore <- playstore %>%
mutate(is_free = ifelse(Type == "Free", 1, 0)) %>%
filter(!is.na(is_free))
playstore <- playstore %>%
mutate(is_everyone = ifelse(Content.Rating == "Everyone", 1, 0))
playstore <- playstore %>%
mutate(Installs_clean = as.numeric(gsub("[+,]", "", Installs))) %>%
filter(!is.na(Installs_clean), Installs_clean > 0) %>%
mutate(log_installs = log10(Installs_clean))
cat("Clean rows:", nrow(playstore))
## Clean rows: 8196
is_free (Binary - included)Free vs. paid is one of the most obvious splits in the Play Store. As Leon mentioned in the lecture, R handles binary variables by treating 1 and 0 as a linear predictor, exactly what we want here because the relationship between “free or not” and rating is directional and clean. I’m expecting free apps to rate slightly lower on average because they attract more casual users who are quicker to leave a bad review.
playstore %>%
group_by(is_free) %>%
summarise(
n = n(),
mean_rating = round(mean(Rating), 3),
sd_rating = round(sd(Rating), 3)
)
## # A tibble: 2 × 4
## is_free n mean_rating sd_rating
## <dbl> <int> <dbl> <dbl>
## 1 0 604 4.26 0.56
## 2 1 7592 4.17 0.534
There is a small but real difference as paid apps rate a little higher on average. This is worth keeping in the model.
log_installs (Continuous - check for
multicollinearity, exclude)Installs and reviews are both measures of how popular an app is. An app with millions of installs almost certainly also has a lot of reviews. That’s the classic multicollinearity trap, that two predictors that are measuring basically the same underlying thing. If I include both, the model gets confused about which one is doing the work, and my coefficients become unreliable.
cor(playstore$log_reviews, playstore$log_installs, use = "complete.obs")
## [1] 0.9529982
The correlation between log_reviews and
log_installs is very high (around 0.85–0.9). That’s a red
flag. As mentioned in the lecture that independent variables cannot be
linearly correlated with each other because it messes with how you
interpret the coefficients. This is exactly that situation - so
log_installs is out.
is_everyone (Binary - include)From last week’s ANOVA, I already know that “Everyone” apps rate
lower on average than more targeted audiences. But does that effect hold
once we control for reviews? Adding is_everyone as a binary
lets me test whether audience breadth has an independent effect on
rating beyond what engagement volume already explains.
playstore %>%
group_by(is_everyone) %>%
summarise(
n = n(),
mean_rating = round(mean(Rating), 3)
)
## # A tibble: 2 × 3
## is_everyone n mean_rating
## <dbl> <int> <dbl>
## 1 0 1578 4.20
## 2 1 6618 4.17
The gap between “Everyone” and other content groups is visible in the means. Including this makes intuitive sense - it’s not just about engagement, it’s also about audience fit.
The final model has 3 terms:
\[\hat{R}_i = \beta_0 + \beta_1 \cdot \log_{10}(\text{Reviews}_i) + \beta_2 \cdot \text{is\_free}_i + \beta_3 \cdot \text{is\_everyone}_i + \varepsilon_i\]
model2 <- lm(Rating ~ log_reviews + is_free + is_everyone, data = playstore)
summary(model2)
##
## Call:
## lm(formula = Rating ~ log_reviews + is_free + is_everyone, data = playstore)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.10267 -0.19874 0.06609 0.30073 1.06541
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.095669 0.027226 150.431 < 2e-16 ***
## log_reviews 0.065072 0.003723 17.478 < 2e-16 ***
## is_free -0.161083 0.022577 -7.135 1.05e-12 ***
## is_everyone 0.006998 0.014971 0.467 0.64
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5262 on 8192 degrees of freedom
## Multiple R-squared: 0.03868, Adjusted R-squared: 0.03833
## F-statistic: 109.9 on 3 and 8192 DF, p-value: < 2.2e-16
tidy(model2) %>%
mutate(across(where(is.numeric), ~round(., 4)))
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 4.10 0.0272 150. 0
## 2 log_reviews 0.0651 0.0037 17.5 0
## 3 is_free -0.161 0.0226 -7.13 0
## 4 is_everyone 0.007 0.015 0.467 0.640
glance(model2) %>%
select(r.squared, adj.r.squared, sigma, statistic, p.value, df) %>%
mutate(across(where(is.numeric), ~round(., 4)))
## # A tibble: 1 × 6
## r.squared adj.r.squared sigma statistic p.value df
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0387 0.0383 0.526 110. 0 3
Interpreting the coefficients:
The adjusted R² has improved from 3.27% to somewhere around 4-5%, which is modest but expected as ratings are noisy by nature. All three predictors are statistically significant.
One thing worth considering is does the relationship between review volume and rating work the same way for free apps and paid apps? Maybe paid apps with high reviews are genuinely better products, while free apps accumulate reviews just from being widely available.
model_interaction <- lm(Rating ~ log_reviews * is_free + is_everyone, data = playstore)
summary(model_interaction)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.033608148 0.04915514 82.058718 0.000000e+00
## log_reviews 0.090458806 0.01715077 5.274329 1.366486e-07
## is_free -0.095257309 0.04892958 -1.946825 5.158990e-02
## is_everyone 0.007497276 0.01497296 0.500721 6.165809e-01
## log_reviews:is_free -0.026585455 0.01753238 -1.516363 1.294662e-01
The interaction term (log_reviews:is_free) turns out to
be very small and not statistically significant. This tells me the slope
of log_reviews doesn’t meaningfully change depending on whether an app
is free or paid. I’ll leave the interaction out and keep the simpler
additive model.
A good-looking coefficient table is not enough and we need to actually check whether our model assumptions hold. Every regression analysis should be paired with diagnostics. Here are the 5 plots:
par(mfrow = c(2, 3))
plot(model2, which = 1:5)
par(mfrow = c(1, 1))
What I looked for Residuals should be randomly scattered around the dashed line at 0, with no obvious pattern or curve.
What I saw The residuals showed a slight downward curve and at low fitted values (apps predicted to rate around 3.9-4.0), the residuals fan out more than they do at high fitted values. There’s also a visible upper boundary artifact, which makes sense because ratings are capped at 5.0, you can’t overpredict beyond the ceiling. The overall pattern is flatter than last week’s single-predictor model, which is an improvement, but there’s still some non-randomness on the left side of the plot.
Severity Mild. The curve is not dramatic, and the bulk of the data (fitted values 4.0-4.4) looks reasonably flat. This is worth noting as a limitation but doesn’t invalidate the model.
What to look for Points should fall close to the diagonal line. Heavy tails or systematic curves indicate non-normal residuals.
What I see The upper tail deviates pretty noticeably and points drift above the line in the top-right corner. This is a classic left-skewed residual pattern, which we’d actually expect here since ratings are bounded at 5.0. The lower tail is fairly close to the line.
Severity Moderate on the upper end. The Q-Q plot is super sensitive and not always the best starting point. The Residuals vs. fitted plot is more informative for diagnosing structural issues. Given our n is well over 8,000, by the Central Limit Theorem the coefficient estimates are still trustworthy even with some non-normality in residuals. I wouldn’t throw out the model over this, but it’s a real limitation to acknowledge.
What to look for The red smoothed line should be roughly horizontal, and the spread of points should be constant across fitted values. This checks homoscedasticity.
What I saw The line has a slight downward trend as fitted values increase, meaning the variance of residuals decreases a bit at higher predicted ratings. This is partly a ceiling effect again as apps predicted to rate near 4.5 have less room to deviate upward. The spread is not wildly inconsistent, but it’s not perfectly flat either.
Severity Mild to moderate. Equal variance is somewhat met in the middle of the distribution. The ceiling effect is a structural feature of a bounded response variable and not something adding more predictors would fix. Confidence in this assumption is moderate.
What to look for Large Cook’s Distance values (above 0.5 or 1.0) indicate influential observations that are disproportionately pulling the regression line.
What I see No single observation has a Cook’s Distance anywhere near 0.5. The distances are consistently very small. This is actually great news as with over 8,000 rows, no individual app is dominating the fit.
Severity Not a concern. The model is not being driven by outliers.
What to look for High-leverage points (far right on the x-axis) that also have large residuals are the ones to worry about. Cook’s Distance contour lines (dashed curves) in the upper/lower right corners are a warning sign.
What I see Leverage values are very small and the Cook’s Distance contour lines are not even visible on the plot which means no points are simultaneously high-leverage and high-residual. The few points with higher leverage sit close to 0 on the residual axis.
Severity Not a concern. This reinforces what we saw in Plot 4 that there are no influential outliers to worry about.
| Term | Included? | Reason |
|---|---|---|
log_reviews |
✓ Yes | Carried over from last week, still significant |
is_free |
✓ Yes | Binary, meaningful group difference, low correlation with reviews |
log_installs |
✗ No | Highly correlated with log_reviews -
multicollinearity |
is_everyone |
✓ Yes | ANOVA result from last week holds up in regression context |
log_reviews * is_free |
✗ No | Interaction term not significant, simpler model preferred |
Diagnostic summary:
| Plot | Assumption Tested | Verdict |
|---|---|---|
| Residuals vs. Fitted | Linearity, zero mean residuals | Mostly met, mild curve on the left |
| Normal Q-Q | Normality of residuals | Upper tail deviates; acceptable given n > 8,000 |
| Scale-Location | Equal variance (homoscedasticity) | Mild ceiling effect, moderate concern |
| Cook’s Distance | Influential observations | No outliers - clean |
| Residuals vs. Leverage | High-leverage influential points | No concerns |
Adding is_free and is_everyone improved the
model both in fit and in interpretive richness. That said, the ceiling
at 5.0 stars is a fundamental constraint that keeps some diagnostic
plots from looking perfect, and there’s clearly still a lot of variance
in ratings that these three variables can’t explain. Future directions
could include app category, price for paid apps, or update frequency,
anything that speaks more to the actual quality of the product rather
than just its audience.