03/09/2026Every time I download an app, the first thing I do before reading descriptions or anything else is look at the star rating. I’ve passed on apps with incredible functionality simply because they sat below 4.0. That instinct made me curious what actually drives a high app rating? Is it the type of audience you’re building for? Is it the sheer number of people engaging with and reviewing the app?
This notebook uses the Google Play Store dataset to explore two angles:
df_raw <- read_csv("C:/Users/IU Student/Downloads/googleplaystore.csv/data_dive_anova_regression/googleplaystore.csv",
show_col_types = FALSE)
df <- df_raw %>%
distinct(App, .keep_all = TRUE) %>%
filter(!is.na(Rating), Rating >= 1, Rating <= 5) %>%
mutate(Reviews = as.numeric(Reviews)) %>%
filter(!is.na(Reviews), Reviews > 0) %>%
filter(`Content Rating` %in% c("Everyone", "Everyone 10+", "Teen",
"Mature 17+", "Adults only 18+")) %>%
mutate(
log_reviews = log10(Reviews),
content_rating = factor(`Content Rating`,
levels = c("Everyone", "Everyone 10+",
"Teen", "Mature 17+",
"Adults only 18+"))
)
cat("Clean rows:", nrow(df), "\n")
## Clean rows: 8195
cat("Unique Content Rating levels:", nlevels(df$content_rating), "\n")
## Unique Content Rating levels: 5
After removing duplicates rows, I’m working with 8195 apps which is a pretty solid sample for both ANOVA and regression.
Let \(R_i\) denote the star rating of app \(i\), where \(R_i \in [1, 5]\). This is the single number that a potential user sees before anything else — it collapses thousands of user experiences into one signal. For developers, it influences downloads and for the users, it means quality. It is the most informationdense column in this dataset.
p1 <- df %>%
plot_ly(
x = ~Rating,
type = "histogram",
nbinsx = 40,
marker = list(
color = ~Rating,
colorscale = list(
c(0, "#FF6B6B"),
c(0.25, "#FFA500"),
c(0.5, "#FFD700"),
c(0.75, "#90EE90"),
c(1, "#2ECC71")
),
line = list(color = "white", width = 0.5)
),
hovertemplate = "Rating: %{x}<br>Count: %{y}<extra></extra>"
) %>%
layout(
title = list(text = "Distribution of App Ratings on Google Play Store",
font = list(size = 18)),
xaxis = list(title = "Star Rating (R)", dtick = 0.5),
yaxis = list(title = "Number of Apps"),
plot_bgcolor = "#F9F9F9",
paper_bgcolor = "#F9F9F9",
font = list(family = "Arial")
)
p1
What this shows App ratings are heavily left-skewed, between 4.0 and 4.8 which means most apps on the Play Store present themselves as high quality. The mean sits around 4.2, which means a 3.8 app already feels below average even though it’s technically above the midpoint of the scale. This makes the variation between groups more interesting, not less because even small differences in means can matter a lot.
My instinct I grew up using apps across every category, and I noticed that adult oriented apps often felt more niche and passion built, while “everyone” apps competed in a brutal mass market where generic quality could drag ratings down. I wanted to test whether that perception holds statistically.
Let \(C_j\) denote Content Rating group \(j\), with levels:
\[C_j \in \{\text{Everyone},\ \text{Everyone 10+},\ \text{Teen},\ \text{Mature 17+},\ \text{Adults only 18+}\}\]
\[H_0: \mu_{\text{Everyone}} = \mu_{\text{Everyone 10+}} = \mu_{\text{Teen}} = \mu_{\text{Mature 17+}} = \mu_{\text{Adults only 18+}}\]
\[H_a: \text{At least one group mean is significantly different from the others}\]
What this means is \(H_0\) says knowing an app’s content rating tells you nothing about its expected star rating. \(H_a\) says at least one audience type produces systematically higher or lower ratings.
color_palette <- c(
"Everyone" = "#3498DB",
"Everyone 10+" = "#9B59B6",
"Teen" = "#E74C3C",
"Mature 17+" = "#E67E22",
"Adults only 18+" = "#1ABC9C"
)
group_stats <- df %>%
group_by(content_rating) %>%
summarise(
n = n(),
mean = round(mean(Rating), 3),
sd = round(sd(Rating), 3),
.groups = "drop"
)
p2 <- df %>%
plot_ly(
y = ~Rating,
x = ~content_rating,
type = "violin",
color = ~content_rating,
colors = color_palette,
box = list(visible = TRUE),
meanline = list(visible = TRUE, color = "white", width = 2),
hovertemplate = paste(
"<b>%{x}</b><br>Rating: %{y:.2f}<extra></extra>"
)
) %>%
layout(
title = list(text = "App Ratings by Content Rating Group",
font = list(size = 18)),
xaxis = list(title = "Content Rating Group"),
yaxis = list(title = "Star Rating (R)", range = c(1, 5)),
showlegend = FALSE,
plot_bgcolor = "#F9F9F9",
paper_bgcolor = "#F9F9F9",
font = list(family = "Arial")
)
p2
What this means The violin shapes reveal not just where medians land, but the shape of each group’s distribution. “Everyone” apps show a wide spread, that mass-market pressure I suspected, while “Adults only 18+” apps cluster tightly at the top. The white line cutting through each violin marks the group mean you can already see meaningful differecnes, especially between “Everyone” and the more targeted age groups. Hover over any violin to inspect individual rating values.
group_stats %>%
rename(
`Content Rating` = content_rating,
`N` = n,
`Mean Rating` = mean,
`Std Dev` = sd
) %>%
kbl(caption = "Descriptive Statistics by Content Rating Group") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)
| Content Rating | N | Mean Rating | Std Dev |
|---|---|---|---|
| Everyone | 6618 | 4.166 | 0.559 |
| Everyone 10+ | 305 | 4.226 | 0.399 |
| Teen | 912 | 4.226 | 0.404 |
| Mature 17+ | 357 | 4.122 | 0.509 |
| Adults only 18+ | 3 | 4.300 | 0.436 |
anova_model <- aov(Rating ~ content_rating, data = df)
anova_tidy <- tidy(anova_model)
kbl(anova_tidy, digits = 4,
caption = "ANOVA Table: Rating ~ Content Rating") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)
| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| content_rating | 4 | 4.6597 | 1.1649 | 4.0509 | 0.0028 |
| Residuals | 8190 | 2355.2171 | 0.2876 | NA | NA |
f_stat <- anova_tidy$statistic[1]
p_val <- anova_tidy$p.value[1]
df1 <- anova_tidy$df[1]
df2 <- anova_tidy$df[2]
x_seq <- seq(0, max(f_stat * 1.5, 10), length.out = 500)
f_dens <- df(x_seq, df1 = df1, df2 = df2)
crit_val <- qf(0.95, df1 = df1, df2 = df2)
p3 <- plot_ly() %>%
add_lines(x = x_seq, y = f_dens,
line = list(color = "#3498DB", width = 2),
name = "F-Distribution",
hovertemplate = "F = %{x:.2f}<br>Density = %{y:.4f}<extra></extra>") %>%
add_ribbons(
x = x_seq[x_seq >= crit_val],
ymin = 0,
ymax = f_dens[x_seq >= crit_val],
fillcolor = "rgba(231, 76, 60, 0.4)",
line = list(color = "transparent"),
name = "Rejection Region (α=0.05)"
) %>%
add_segments(
x = f_stat, xend = f_stat,
y = 0, yend = max(f_dens) * 0.8,
line = list(color = "#E74C3C", dash = "dash", width = 3),
name = paste0("Observed F = ", round(f_stat, 2))
) %>%
layout(
title = list(text = paste0("F-Distribution (df₁=", df1, ", df₂=", df2,
") | Observed F = ", round(f_stat, 2),
" | p = ", format(p_val, scientific = TRUE, digits = 3)),
font = list(size = 16)),
xaxis = list(title = "F-Statistic"),
yaxis = list(title = "Density"),
plot_bgcolor = "#F9F9F9",
paper_bgcolor = "#F9F9F9",
legend = list(orientation = "h", y = -0.2),
font = list(family = "Arial")
)
p3
What this means The red dashed line is our observed F-statistic, sitting so far to the right that it barely fits on the chart. The shaded red zone is where we’d reject \(H_0\) at α = 0.05, our F-statistic blows past it entirely. A p-value this small means that if content rating had no relationship with ratings at all, it would not be possible to observe this pattern by random chance.
cat(paste0(
"F-statistic: ", round(f_stat, 2), "\n",
"p-value: ", format(p_val, scientific = TRUE, digits = 3), "\n",
"Decision: Reject H₀ at α = 0.05\n"
))
## F-statistic: 4.05
## p-value: 2.77e-03
## Decision: Reject H₀ at α = 0.05
Conclusion: So what does this mean? There is extremely strong evidence to reject \(H_0\). The F-statistic of 4.05 with a p-value of essentially zero tells us that Content Rating group does influence app ratings. At least one group’s mean rating is statistically different from the others.
What this means for developers: There is enough evidence to conclude that the audience you target matters for your app’s rating. Apps aimed at more specific audiences (Teen, Mature 17+, Adults only 18+) tend to receive higher ratings than the broad “Everyone” category. This makes intuitive sense as niche audiences are more engaged and more forgiving of minor imperfections in an app built exactly for them. If you’re building a general purpose tool competing in the “Everyone” tier, you’re fighting a ratings battle on a harder difficulty.
The number of reviews an app has received is like a proxy for its community engagement and survival time on the platform. I thought about this the way I think about study groups which is the bigger the group that keeps showing up and commenting, the more the core material (the app) is doing something right. Let me define:
\[\text{Let } L_i = \log_{10}(\text{Reviews}_i)\]
I use the log transform because review counts span from 1 to tens of millions — a raw count is too skewed to show a clean linear relationship. Full disclosure - I used AI to help me calculate this as I was not sure if this was a legitimate way to calculate review count. After the transform, each 1-unit increase in \(L_i\) represents a 10× increase in reviews which is a much more interpretable scale.
p4 <- df %>%
plot_ly(
x = ~log_reviews,
y = ~Rating,
type = "scatter",
mode = "markers",
color = ~content_rating,
colors = color_palette,
marker = list(size = 4, opacity = 0.5),
text = ~paste0("<b>", App, "</b><br>",
"Rating: ", Rating, "<br>",
"Reviews: ", formatC(Reviews, format = "d", big.mark = ","), "<br>",
"log₁₀(Reviews): ", round(log_reviews, 2)),
hovertemplate = "%{text}<extra></extra>"
) %>%
layout(
title = list(text = "App Rating vs. log₁₀(Reviews) — colored by Content Rating",
font = list(size = 17)),
xaxis = list(title = "log₁₀(Reviews) — L"),
yaxis = list(title = "Star Rating (R)", range = c(1, 5)),
plot_bgcolor = "#F9F9F9",
paper_bgcolor = "#F9F9F9",
font = list(family = "Arial"),
legend = list(title = list(text = "Content Rating"))
)
p4
What this means There is a upward trend which means apps with more reviews do tend to rate higher. The relationship is noisy (as I had expected from organic user behavior), but the direction is positive and roughly linear on the log scale. Color reveals that all content groups share this upward tendency. Hover over any point to see the actual app name, raw review count, and rating.
The regression model is:
\[\hat{R}_i = \beta_0 + \beta_1 \cdot L_i + \varepsilon_i\]
Where: - \(\hat{R}_i\) = predicted rating for app \(i\) - \(\beta_0\) = the expected rating when \(L_i = 0\) (i.e., an app with exactly 1 review) - \(\beta_1\) = the increase in predicted rating for each 10× increase in review count - \(\varepsilon_i \sim \mathcal{N}(0, \sigma^2)\) = normally distributed errors
lm_model <- lm(Rating ~ log_reviews, data = df)
lm_tidy <- tidy(lm_model)
lm_glance <- glance(lm_model)
kbl(lm_tidy, digits = 4,
caption = "Linear Regression: Rating ~ log₁₀(Reviews)") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 3.9683 | 0.0136 | 291.1222 | 0 |
| log_reviews | 0.0603 | 0.0036 | 16.6372 | 0 |
kbl(lm_glance %>% select(r.squared, adj.r.squared, sigma, statistic, p.value, df),
digits = 4,
caption = "Model Fit Statistics") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| r.squared | adj.r.squared | sigma | statistic | p.value | df |
|---|---|---|---|---|---|
| 0.0327 | 0.0326 | 0.5278 | 276.7967 | 0 | 1 |
x_range <- seq(min(df$log_reviews), max(df$log_reviews), length.out = 200)
pred_df <- tibble(log_reviews = x_range,
predicted = predict(lm_model,
newdata = data.frame(log_reviews = x_range)),
se = predict(lm_model,
newdata = data.frame(log_reviews = x_range),
se.fit = TRUE)$se.fit)
pred_df <- pred_df %>%
mutate(
ci_low = predicted - 1.96 * se,
ci_high = predicted + 1.96 * se
)
p5 <- plot_ly() %>%
add_markers(
data = df,
x = ~log_reviews,
y = ~Rating,
color = ~content_rating,
colors = color_palette,
marker = list(size = 4, opacity = 0.35),
text = ~paste0("<b>", App, "</b><br>",
"Rating: ", Rating, "<br>",
"log₁₀(Reviews): ", round(log_reviews, 2)),
hovertemplate = "%{text}<extra></extra>",
showlegend = TRUE,
name = ~content_rating
) %>%
add_ribbons(
data = pred_df,
x = ~log_reviews,
ymin = ~ci_low,
ymax = ~ci_high,
fillcolor = "rgba(52, 73, 94, 0.2)",
line = list(color = "transparent"),
name = "95% Confidence Interval",
hoverinfo = "skip"
) %>%
add_lines(
data = pred_df,
x = ~log_reviews,
y = ~predicted,
line = list(color = "#2C3E50", width = 3),
name = "Fitted Regression Line",
hovertemplate = "log₁₀(Reviews): %{x:.2f}<br>Predicted Rating: %{y:.3f}<extra></extra>"
) %>%
layout(
title = list(
text = paste0("Rating = ",
round(coef(lm_model)[1], 3), " + ",
round(coef(lm_model)[2], 3),
" × log₁₀(Reviews) | R² = ",
round(lm_glance$r.squared, 3)),
font = list(size = 16)
),
xaxis = list(title = "log₁₀(Reviews) — L"),
yaxis = list(title = "Star Rating (R)", range = c(1, 5.2)),
plot_bgcolor = "#F9F9F9",
paper_bgcolor = "#F9F9F9",
font = list(family = "Arial"),
legend = list(title = list(text = "Content Rating"), y = 0.5)
)
p5
What this means The black regression line cuts through the point cloud with a upward slope, supported by a 95% confidence band (the gray ribbon). The band narrows toward the middle of the data where most apps live, so the estimates are the most reliable there. The slope is modest but statistically still important. This is because every 10× jump in review count is associated with a big bump in predicted rating.
df_resid <- df %>%
mutate(
fitted = fitted(lm_model),
residual = residuals(lm_model)
)
p6 <- plot_ly(
data = df_resid,
x = ~fitted,
y = ~residual,
type = "scatter",
mode = "markers",
color = ~content_rating,
colors = color_palette,
marker = list(size = 4, opacity = 0.4),
text = ~paste0("<b>", App, "</b><br>",
"Fitted: ", round(fitted, 3), "<br>",
"Residual: ", round(residual, 3)),
hovertemplate = "%{text}<extra></extra>"
) %>%
add_lines(
x = c(min(df_resid$fitted), max(df_resid$fitted)),
y = c(0, 0),
line = list(color = "black", dash = "dash", width = 2),
showlegend = FALSE,
hoverinfo = "skip",
inherit = FALSE # <-- add this line
) %>%
layout(
title = list(text = "Residuals vs. Fitted Values", font = list(size = 17)),
xaxis = list(title = "Fitted Values (R̂)"),
yaxis = list(title = "Residuals (ε)"),
plot_bgcolor = "#F9F9F9",
paper_bgcolor = "#F9F9F9",
font = list(family = "Arial"),
legend = list(title = list(text = "Content Rating"))
)
p6
What this means The residuals are roughly scattered around 0 (the dashed line) across all fitted values and this is exactly what we would want, and it confirms that the model isn’t systematically over or under-predicting for any range of ratings. The spread is relatively consistent, which is a good sign for the equal-variance assumption. The slight vertical banding is a natural artifact of the rating scale being discrete (1.0, 1.5, 2.0, etc.).
b0 <- round(coef(lm_model)[1], 4)
b1 <- round(coef(lm_model)[2], 4)
r2 <- round(lm_glance$r.squared, 4)
cat(paste0(
"Estimated Model:\n",
" R̂ = ", b0, " + ", b1, " × L\n\n",
"Intercept (β₀ = ", b0, "):\n",
" An app with just 1 review (L = log₁₀(1) = 0) would be predicted\n",
" to have a rating of ", b0, " stars.\n\n",
"Slope (β₁ = ", b1, "):\n",
" For every 10× increase in reviews, the predicted rating rises by ", b1, " stars.\n",
" So going from 100 → 100,000 reviews (a 1000× increase = 3 log units)\n",
" translates to a predicted rating bump of ~", round(b1 * 3, 3), " stars.\n\n",
"R² = ", r2, ":\n",
" About ", round(r2 * 100, 1), "% of rating variance is explained by log(Reviews) alone.\n"
))
## Estimated Model:
## R̂ = 3.9683 + 0.0603 × L
##
## Intercept (β₀ = 3.9683):
## An app with just 1 review (L = log₁₀(1) = 0) would be predicted
## to have a rating of 3.9683 stars.
##
## Slope (β₁ = 0.0603):
## For every 10× increase in reviews, the predicted rating rises by 0.0603 stars.
## So going from 100 → 100,000 reviews (a 1000× increase = 3 log units)
## translates to a predicted rating bump of ~0.181 stars.
##
## R² = 0.0327:
## About 3.3% of rating variance is explained by log(Reviews) alone.
What this means for developers The model confirms that review volume is a meaningful signal. An app sitting at 10,000 reviews (\(L = 4\)) is predicted to rate:
\[\hat{R} = 3.9683 + 0.0603 \times 4 = 4.21\ \text{stars}\]
Compared to an app with only 100 reviews (\(L = 2\)):
\[\hat{R} = 3.9683 + 0.0603 \times 2 = 4.089\ \text{stars}\]
That’s a difference of 0.121 predicted stars as small in absolute terms, but meaningful in a world where users filter by “4.0+” or “4.5+”. The takeaway is not that getting more reviews directly causes a higher rating, but rather that apps with sustained engagement communities tend to self correct and improve over time, earning better scores as a byproduct.
| Question | Test | Conclusion |
|---|---|---|
| Does content rating group affect star rating? | One-way ANOVA | Yes — F = 4.05, p ≈ 0. Reject \(H_0\). |
| Does log(reviews) predict star rating? | Simple Linear Regression | Yes — \(\beta_1 = 0.0603\), significant. R² = 0.0327. |
Key takeaways: - Targeting a specific audience (Teen, Mature, Adults only) is associated with meaningfully better ratings than competing in the “Everyone” mass market. - Apps with more reviews tend to rate higher — the relationship is statistically real, though review count alone only explains about 3.3% of rating variance. There’s plenty left to explore (category, price, app size, update frequency). - Both findings suggest that community depth matters more than breadth on the Play Store. A smaller, devoted audience that reviews consistently beats a massive, indifferent one.