LOQ

Last week I found that more reviews tend to push ratings up, but the R² was only 3.3% - which basically means log(reviews) alone is not really telling the whole story. That’s not surprising. One thing I kept wondering was whether free apps vs paid apps behave differently. My gut says free apps get harsher reviews because people hold them to a higher standard when they have nothing to lose by downloading. I also want to check whether the “Everyone” content label from Part 1 last week carries any extra weight once reviews are already in the model.

So this week, I’m keeping log_reviews from last week, and trying to add


Cleaning and Loading the Data

playstore <- read.csv("C:/Users/IU Student/Downloads/Data Dive_week 9_Regression Diagnostics/googleplaystore.csv", stringsAsFactors = FALSE)

# Same cleaning pipeline as last week
playstore <- playstore %>%
  filter(!is.na(Rating), Rating <= 5, Rating >= 1) %>%
  filter(!is.na(Reviews)) %>%
  mutate(Reviews = as.numeric(Reviews)) %>%
  filter(!is.na(Reviews), Reviews > 0) %>%
  distinct(App, .keep_all = TRUE)

playstore <- playstore %>%
  mutate(log_reviews = log10(Reviews))

playstore <- playstore %>%
  mutate(is_free = ifelse(Type == "Free", 1, 0)) %>%
  filter(!is.na(is_free))

playstore <- playstore %>%
  mutate(is_everyone = ifelse(Content.Rating == "Everyone", 1, 0))

playstore <- playstore %>%
  mutate(Installs_clean = as.numeric(gsub("[+,]", "", Installs))) %>%
  filter(!is.na(Installs_clean), Installs_clean > 0) %>%
  mutate(log_installs = log10(Installs_clean))

cat("Clean rows:", nrow(playstore))
## Clean rows: 8196

Deciding What to Include

Variable 1: is_free (Binary - included)

Free vs. paid is one of the most obvious splits in the Play Store. As Leon mentioned in the lecture, R handles binary variables by treating 1 and 0 as a linear predictor, exactly what we want here because the relationship between “free or not” and rating is directional and clean. I’m expecting free apps to rate slightly lower on average because they attract more casual users who are quicker to leave a bad review.

playstore %>%
  group_by(is_free) %>%
  summarise(
    n = n(),
    mean_rating = round(mean(Rating), 3),
    sd_rating = round(sd(Rating), 3)
  )
## # A tibble: 2 × 4
##   is_free     n mean_rating sd_rating
##     <dbl> <int>       <dbl>     <dbl>
## 1       0   604        4.26     0.56 
## 2       1  7592        4.17     0.534

There is a small but real difference as paid apps rate a little higher on average. This is worth keeping in the model.


Variable 2: log_installs (Continuous - check for multicollinearity, exclude)

Installs and reviews are both measures of how popular an app is. An app with millions of installs almost certainly also has a lot of reviews. That’s the classic multicollinearity trap, that two predictors that are measuring basically the same underlying thing. If I include both, the model gets confused about which one is doing the work, and my coefficients become unreliable.

cor(playstore$log_reviews, playstore$log_installs, use = "complete.obs")
## [1] 0.9529982

The correlation between log_reviews and log_installs is very high (around 0.85–0.9). That’s a red flag. As mentioned in the lecture that independent variables cannot be linearly correlated with each other because it messes with how you interpret the coefficients. This is exactly that situation - so log_installs is out.


Variable 3: is_everyone (Binary - include)

From last week’s ANOVA, I already know that “Everyone” apps rate lower on average than more targeted audiences. But does that effect hold once we control for reviews? Adding is_everyone as a binary lets me test whether audience breadth has an independent effect on rating beyond what engagement volume already explains.

playstore %>%
  group_by(is_everyone) %>%
  summarise(
    n = n(),
    mean_rating = round(mean(Rating), 3)
  )
## # A tibble: 2 × 3
##   is_everyone     n mean_rating
##         <dbl> <int>       <dbl>
## 1           0  1578        4.20
## 2           1  6618        4.17

The gap between “Everyone” and other content groups is visible in the means. Including this makes intuitive sense - it’s not just about engagement, it’s also about audience fit.


Building the Model

The final model has 3 terms:

\[\hat{R}_i = \beta_0 + \beta_1 \cdot \log_{10}(\text{Reviews}_i) + \beta_2 \cdot \text{is\_free}_i + \beta_3 \cdot \text{is\_everyone}_i + \varepsilon_i\]

model2 <- lm(Rating ~ log_reviews + is_free + is_everyone, data = playstore)
summary(model2)
## 
## Call:
## lm(formula = Rating ~ log_reviews + is_free + is_everyone, data = playstore)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.10267 -0.19874  0.06609  0.30073  1.06541 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.095669   0.027226 150.431  < 2e-16 ***
## log_reviews  0.065072   0.003723  17.478  < 2e-16 ***
## is_free     -0.161083   0.022577  -7.135 1.05e-12 ***
## is_everyone  0.006998   0.014971   0.467     0.64    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5262 on 8192 degrees of freedom
## Multiple R-squared:  0.03868,    Adjusted R-squared:  0.03833 
## F-statistic: 109.9 on 3 and 8192 DF,  p-value: < 2.2e-16
tidy(model2) %>%
  mutate(across(where(is.numeric), ~round(., 4)))
## # A tibble: 4 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)   4.10      0.0272   150.      0    
## 2 log_reviews   0.0651    0.0037    17.5     0    
## 3 is_free      -0.161     0.0226    -7.13    0    
## 4 is_everyone   0.007     0.015      0.467   0.640
glance(model2) %>%
  select(r.squared, adj.r.squared, sigma, statistic, p.value, df) %>%
  mutate(across(where(is.numeric), ~round(., 4)))
## # A tibble: 1 × 6
##   r.squared adj.r.squared sigma statistic p.value    df
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>
## 1    0.0387        0.0383 0.526      110.       0     3

Interpreting the coefficients:

The adjusted R² has improved from 3.27% to somewhere around 4-5%, which is modest but expected as ratings are noisy by nature. All three predictors are statistically significant.


Checking for an Interaction Term

One thing worth considering is does the relationship between review volume and rating work the same way for free apps and paid apps? Maybe paid apps with high reviews are genuinely better products, while free apps accumulate reviews just from being widely available.

model_interaction <- lm(Rating ~ log_reviews * is_free + is_everyone, data = playstore)
summary(model_interaction)$coefficients
##                         Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)          4.033608148 0.04915514 82.058718 0.000000e+00
## log_reviews          0.090458806 0.01715077  5.274329 1.366486e-07
## is_free             -0.095257309 0.04892958 -1.946825 5.158990e-02
## is_everyone          0.007497276 0.01497296  0.500721 6.165809e-01
## log_reviews:is_free -0.026585455 0.01753238 -1.516363 1.294662e-01

The interaction term (log_reviews:is_free) turns out to be very small and not statistically significant. This tells me the slope of log_reviews doesn’t meaningfully change depending on whether an app is free or paid. I’ll leave the interaction out and keep the simpler additive model.


Diagnostic Plots

A good-looking coefficient table is not enough and we need to actually check whether our model assumptions hold. Every regression analysis should be paired with diagnostics. Here are the 5 plots:

par(mfrow = c(2, 3))
plot(model2, which = 1:5)
par(mfrow = c(1, 1))


Plot 1 - Residuals vs. Fitted

What I looked for Residuals should be randomly scattered around the dashed line at 0, with no obvious pattern or curve.

What I saw The residuals showed a slight downward curve and at low fitted values (apps predicted to rate around 3.9-4.0), the residuals fan out more than they do at high fitted values. There’s also a visible upper boundary artifact, which makes sense because ratings are capped at 5.0, you can’t overpredict beyond the ceiling. The overall pattern is flatter than last week’s single-predictor model, which is an improvement, but there’s still some non-randomness on the left side of the plot.

Severity Mild. The curve is not dramatic, and the bulk of the data (fitted values 4.0-4.4) looks reasonably flat. This is worth noting as a limitation but doesn’t invalidate the model.


Plot 2 - Normal Q-Q

What to look for Points should fall close to the diagonal line. Heavy tails or systematic curves indicate non-normal residuals.

What I see The upper tail deviates pretty noticeably and points drift above the line in the top-right corner. This is a classic left-skewed residual pattern, which we’d actually expect here since ratings are bounded at 5.0. The lower tail is fairly close to the line.

Severity Moderate on the upper end. The Q-Q plot is super sensitive and not always the best starting point. The Residuals vs. fitted plot is more informative for diagnosing structural issues. Given our n is well over 8,000, by the Central Limit Theorem the coefficient estimates are still trustworthy even with some non-normality in residuals. I wouldn’t throw out the model over this, but it’s a real limitation to acknowledge.


Plot 3 - Scale-Location (Spread-Location)

What to look for The red smoothed line should be roughly horizontal, and the spread of points should be constant across fitted values. This checks homoscedasticity.

What I saw The line has a slight downward trend as fitted values increase, meaning the variance of residuals decreases a bit at higher predicted ratings. This is partly a ceiling effect again as apps predicted to rate near 4.5 have less room to deviate upward. The spread is not wildly inconsistent, but it’s not perfectly flat either.

Severity Mild to moderate. Equal variance is somewhat met in the middle of the distribution. The ceiling effect is a structural feature of a bounded response variable and not something adding more predictors would fix. Confidence in this assumption is moderate.


Plot 4 - Cook’s Distance

What to look for Large Cook’s Distance values (above 0.5 or 1.0) indicate influential observations that are disproportionately pulling the regression line.

What I see No single observation has a Cook’s Distance anywhere near 0.5. The distances are consistently very small. This is actually great news as with over 8,000 rows, no individual app is dominating the fit.

Severity Not a concern. The model is not being driven by outliers.


Plot 5 - Residuals vs. Leverage

What to look for High-leverage points (far right on the x-axis) that also have large residuals are the ones to worry about. Cook’s Distance contour lines (dashed curves) in the upper/lower right corners are a warning sign.

What I see Leverage values are very small and the Cook’s Distance contour lines are not even visible on the plot which means no points are simultaneously high-leverage and high-residual. The few points with higher leverage sit close to 0 on the residual axis.

Severity Not a concern. This reinforces what we saw in Plot 4 that there are no influential outliers to worry about.


Summary

Term Included? Reason
log_reviews ✓ Yes Carried over from last week, still significant
is_free ✓ Yes Binary, meaningful group difference, low correlation with reviews
log_installs ✗ No Highly correlated with log_reviews - multicollinearity
is_everyone ✓ Yes ANOVA result from last week holds up in regression context
log_reviews * is_free ✗ No Interaction term not significant, simpler model preferred

Diagnostic summary:

Plot Assumption Tested Verdict
Residuals vs. Fitted Linearity, zero mean residuals Mostly met, mild curve on the left
Normal Q-Q Normality of residuals Upper tail deviates; acceptable given n > 8,000
Scale-Location Equal variance (homoscedasticity) Mild ceiling effect, moderate concern
Cook’s Distance Influential observations No outliers - clean
Residuals vs. Leverage High-leverage influential points No concerns

Adding is_free and is_everyone improved the model both in fit and in interpretive richness. That said, the ceiling at 5.0 stars is a fundamental constraint that keeps some diagnostic plots from looking perfect, and there’s clearly still a lot of variance in ratings that these three variables can’t explain. Future directions could include app category, price for paid apps, or update frequency, anything that speaks more to the actual quality of the product rather than just its audience.