What actually separates a paid app from a free one? This question felt genuinely interesting to me, not just as a modeling exercise, but as a reflection of developer strategy and user behavior in the google play ecosystem. My hypothesis going in was that higher rated apps might be more likely to be paid (developers charging for quality), while apps with massive review counts, which often indicate free, viral downloads would skew free.
I decided to model whether an app is paid (1 = Paid, 0 = Free) using three explanatory variables-
# Load data
df_raw <- read.csv("googleplaystore.csv", stringsAsFactors = FALSE)
df <- df_raw %>%
filter(Type %in% c("Free", "Paid")) %>%
mutate(
Reviews = suppressWarnings(as.numeric(Reviews)),
Rating = suppressWarnings(as.numeric(Rating))
) %>%
filter(!is.na(Rating), !is.na(Reviews), Rating <= 5) %>%
mutate(
is_paid = as.integer(Type == "Paid"),
log_Reviews = log(Reviews + 1),
is_mature = as.integer(Content.Rating %in% c("Mature 17+", "Adults only 18+"))
)
cat("Final dataset dimensions:", nrow(df), "rows x", ncol(df), "columns\n")## Final dataset dimensions: 9366 rows x 16 columns
## Paid apps: 647 | Free apps: 8719
After filtering, I have 9,366 apps - 647 paid and 8,719 free. The class imbalance is notable (about 7% paid), and I’ll keep it in mind when interpreting results. I chose not to downsample because I’m after inference and not prediction accuracy.
df %>%
mutate(App_Type = ifelse(is_paid == 1, "Paid", "Free")) %>%
ggplot(aes(x = App_Type, y = Rating, fill = App_Type)) +
geom_violin(trim = FALSE, alpha = 0.75, color = NA) +
geom_boxplot(width = 0.12, fill = "white", color = "grey30",
outlier.shape = 21, outlier.size = 1.2,
outlier.fill = "grey60", outlier.alpha = 0.4) +
scale_fill_manual(values = c("Free" = "#5B9BD5", "Paid" = "#E8804A")) +
scale_y_continuous(breaks = 1:5) +
labs(
title = "App Ratings by Pricing Type",
subtitle = "Paid apps have a slightly higher median but the gap is modest",
x = NULL,
y = "Average Rating",
fill = NULL
) +
theme_minimal(base_size = 13) +
theme(
legend.position = "none",
panel.grid.major.x = element_blank(),
plot.title = element_text(face = "bold"),
axis.text.x = element_text(size = 13, face = "bold")
)Figure 1: Rating distributions for Free vs. Paid apps. Paid apps show a subtly higher median rating, though both distributions are left-skewed.
Both distributions are left-skewed as most apps rate between 4.0 and 5.0. Paid apps do appear to have a slightly higher center, which already hints that maybe rating might be a positive predictor of being paid.
df %>%
mutate(App_Type = ifelse(is_paid == 1, "Paid", "Free")) %>%
ggplot(aes(x = log_Reviews, fill = App_Type, color = App_Type)) +
geom_density(alpha = 0.55, size = 0.8) +
geom_rug(data = . %>% filter(App_Type == "Paid"),
aes(x = log_Reviews), color = "#E8804A",
alpha = 0.3, length = unit(0.04, "npc")) +
scale_fill_manual(values = c("Free" = "#5B9BD5", "Paid" = "#E8804A")) +
scale_color_manual(values = c("Free" = "#3a7ab8", "Paid" = "#c05e28")) +
labs(
title = "Distribution of log(Reviews + 1) by App Type",
subtitle = "Paid apps (orange rug ticks) concentrate at lower review counts",
x = "log(Reviews + 1)",
y = "Density",
fill = NULL, color = NULL
) +
theme_minimal(base_size = 13) +
theme(
legend.position = c(0.1, 0.88),
plot.title = element_text(face = "bold"),
panel.grid.minor = element_blank()
)Figure 2: Density of log-transformed review counts. Paid apps cluster at lower log-review values, suggesting smaller but potentially more dedicated user bases.
The density plot makes the story clear. Free apps have a wide bimodal review distribution, while paid apps cluster heavily at lower log review counts. This supports my intuition that popularity (proxied by reviews) is strongly associated with being free.
As Leon said in the lecture, when our response variable is binary, we can’t just fit a straight line through it. Instead, we use the sigmoid function to model probability, and its inverse (the logit, or log-odds function) becomes our linear component. That is exactly the structure I’m using here.
model <- glm(is_paid ~ Rating + log_Reviews + is_mature,
data = df,
family = binomial(link = "logit"))
summary(model)##
## Call:
## glm(formula = is_paid ~ Rating + log_Reviews + is_mature, family = binomial(link = "logit"),
## data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.87135 0.33262 -8.633 < 2e-16 ***
## Rating 0.42805 0.07585 5.643 1.67e-08 ***
## log_Reviews -0.21246 0.01189 -17.867 < 2e-16 ***
## is_mature -0.54714 0.25481 -2.147 0.0318 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4706.4 on 9365 degrees of freedom
## Residual deviance: 4317.3 on 9362 degrees of freedom
## AIC: 4325.3
##
## Number of Fisher Scoring iterations: 6
All three coefficients are statistically significant (p < 0.05). The model converged cleanly.
To interpret logistic regression coefficients, I exponentiate them to get odds ratios as these tell me how the odds of an app being paid change with each predictor.
coef_table <- data.frame(
Variable = c("Intercept", "Rating", "log_Reviews", "is_mature"),
Estimate = coef(model),
Std_Error = summary(model)$coefficients[, "Std. Error"],
Odds_Ratio = exp(coef(model))
)
coef_table$OR_Lower <- exp(coef_table$Estimate - 1.96 * coef_table$Std_Error)
coef_table$OR_Upper <- exp(coef_table$Estimate + 1.96 * coef_table$Std_Error)
coef_table[, -1] <- round(coef_table[, -1], 4)
coef_table## Variable Estimate Std_Error Odds_Ratio OR_Lower OR_Upper
## (Intercept) Intercept -2.8714 0.3326 0.0566 0.0295 0.1087
## Rating Rating 0.4281 0.0759 1.5343 1.3223 1.7802
## log_Reviews log_Reviews -0.2125 0.0119 0.8086 0.7900 0.8277
## is_mature is_mature -0.5471 0.2548 0.5786 0.3511 0.9534
Here is what each coefficient tells me:
Rating (β̂ = 0.428, OR ≈ 1.53):
For every one-star increase in an app’s average rating, the odds of it
being a paid app increase by about 53%, holding all
else constant. This is the most substantively interesting finding to me
as it suggests that paid apps genuinely do tend to be higher-quality (as
rated by users), or at least that developers who charge for their apps
are incentivized to maintain higher quality to justify the price.
log_Reviews (β̂ = −0.213, OR ≈ 0.81):
For every one-unit increase in log(Reviews + 1 roughly corresponding to
a 2.7× increase in raw review count, the odds of an app being paid
decrease by about 19%. This makes intuitive sense:
viral, free apps accumulate massive review volumes because they face no
download barrier. Paid apps have a smaller, more selective user
base.
is_mature (β̂ = −0.547, OR ≈ 0.58):
Mature-rated apps (17+ or 18+) have odds of being paid that are about
42% lower than non-mature apps, all else equal. This
surprised me initially I expected niche mature content to command a
price. One possible explanation is that adult content platforms often
use free-with-ads or freemium models to maximize reach before monetizing
through subscriptions outside the play store.
I will manually build a 95% confidence interval for the rating coefficient, using its standard error from the model summary.
beta_rating <- coef(model)["Rating"]
se_rating <- summary(model)$coefficients["Rating", "Std. Error"]
ci_log_odds_lower <- beta_rating - 1.96 * se_rating
ci_log_odds_upper <- beta_rating + 1.96 * se_rating
ci_or_lower <- exp(ci_log_odds_lower)
ci_or_upper <- exp(ci_log_odds_upper)
cat("95% CI for Rating coefficient (log-odds scale): [",
round(ci_log_odds_lower, 3), ",", round(ci_log_odds_upper, 3), "]\n")## 95% CI for Rating coefficient (log-odds scale): [ 0.279 , 0.577 ]
## 95% CI for Rating odds ratio: [ 1.322 , 1.78 ]
What this means:
I am 95% confident that the true effect of a one star rating increase on
the odds of an app being paid corresponds to an odds ratio somewhere
between 1.32 and 1.78. Across the range of plausible
true effects consistent with our data, a one-point rating bump
multiplies the odds of being a paid app by at least 1.32× and as much as
1.78×. The fact that this interval does not cross 1.0 (or 0 on the
log-odds scale) reinforces that the relationship is real and not just
noise.
If I were a developer trying to understand what distinguishes paid apps, the data suggest that rating quality is a genuine differentiator, not just a nice to have.
or_df <- data.frame(
Variable = c("Rating", "log(Reviews)", "is_mature"),
OR = exp(coef(model)[-1]),
Lower = exp(coef(model)[-1] - 1.96 * summary(model)$coefficients[-1, "Std. Error"]),
Upper = exp(coef(model)[-1] + 1.96 * summary(model)$coefficients[-1, "Std. Error"])
)
or_df$Variable <- factor(or_df$Variable,
levels = c("is_mature", "log(Reviews)", "Rating"))
ggplot(or_df, aes(x = OR, y = Variable, color = OR > 1)) +
geom_vline(xintercept = 1, linetype = "dashed",
color = "grey50", size = 0.8) +
geom_errorbarh(aes(xmin = Lower, xmax = Upper),
height = 0.2, size = 1.0) +
geom_point(size = 5, shape = 18) +
scale_color_manual(values = c("TRUE" = "#2e7d32", "FALSE" = "#c62828"),
labels = c("TRUE" = "Increases paid odds",
"FALSE" = "Decreases paid odds")) +
scale_x_continuous(breaks = seq(0.5, 2.0, by = 0.25)) +
labs(
title = "Odds Ratios from Logistic Regression",
subtitle = "Predicting whether a Google Play app is Paid (vs. Free)",
x = "Odds Ratio (95% CI)",
y = NULL,
color = NULL
) +
theme_minimal(base_size = 13) +
theme(
legend.position = "bottom",
plot.title = element_text(face = "bold"),
panel.grid.minor = element_blank(),
axis.text.y = element_text(size = 12)
)Figure 3: Odds ratios with 95% confidence intervals. A value above 1 means the predictor is positively associated with being a paid app; below 1 means negatively associated.
The forest plot offers a clean at-a-glance summary: Rating is the only predictor associated with increased odds of being paid, while review volume and mature content rating both work against it.
To visualize what the model is actually doing, I plotted the predicted probability of being a paid app as a function of rating, holding log_Reviews at its median and is_mature = 0.
pred_df <- data.frame(
Rating = seq(1, 5, length.out = 300),
log_Reviews = median(df$log_Reviews),
is_mature = 0
)
pred_df$prob_paid <- predict(model, newdata = pred_df, type = "response")
jitter_df <- df %>%
mutate(jitter_y = ifelse(is_paid == 1, 1 + runif(n(), -0.02, 0.02),
0 + runif(n(), -0.02, 0.02)),
App_Type = ifelse(is_paid == 1, "Paid", "Free"))
ggplot() +
geom_jitter(data = jitter_df %>% sample_n(2000),
aes(x = Rating, y = jitter_y, color = App_Type),
height = 0.01, width = 0.06, alpha = 0.18, size = 0.9) +
geom_line(data = pred_df, aes(x = Rating, y = prob_paid),
color = "#1a237e", size = 1.4) +
scale_color_manual(values = c("Free" = "#5B9BD5", "Paid" = "#E8804A")) +
scale_y_continuous(labels = percent_format(), breaks = seq(0, 1, 0.25)) +
labs(
title = "Predicted Probability of Being a Paid App by Rating",
subtitle = "Curve shown at median review count; jittered points show actual data (sample of 2,000)",
x = "Average Rating",
y = "Predicted Probability (Paid)",
color = NULL
) +
theme_minimal(base_size = 13) +
theme(
legend.position = "bottom",
plot.title = element_text(face = "bold"),
panel.grid.minor = element_blank()
)Figure 4: The logistic curve showing predicted probability of being a paid app across the rating range, at median review count.
Even at the highest ratings, the predicted probability of being paid tops out around 10–12%, a reminder that the baseline is strongly skewed toward free. But the upward slope of the curve confirms the positive relationship between rating and paid status.
What I found
The logistic regression tells a coherent story. Higher-rated apps are somewhat more likely to be paid — developers who charge for their work may be more invested in product quality. Apps with large review counts are strongly associated with being free, which makes sense since free apps face no adoption barrier. Surprisingly, mature-content apps are less likely to be paid, possibly because adult platforms prefer to keep the Play Store entry point free and monetize elsewhere.
Why it matters
Understanding what predicts paid status has implications for both developers setting pricing strategies and researchers modeling app store economics.
Further questions I’d want to investigate
End of Data Dive