Data Dive Week 10?

1. Introduction

What actually separates a paid app from a free one? This question felt genuinely interesting to me, not just as a modeling exercise, but as a reflection of developer strategy and user behavior in the google play ecosystem. My hypothesis going in was that higher rated apps might be more likely to be paid (developers charging for quality), while apps with massive review counts, which often indicate free, viral downloads would skew free.

I decided to model whether an app is paid (1 = Paid, 0 = Free) using three explanatory variables-

Rating - the app’s average user rating (1–5 stars)
log(Reviews) - log-transformed review count, used because the raw counts are wildly right-skewed (some apps have tens of millions of reviews; most have very few). Taking the log compresses the scale and makes the relationship more tractable.
is_mature - a binary flag for whether the app is rated “mature 17+” or “adults only 18+”. I was curious whether content restrictions correlate with pricing strategy.

2. Data Prep

library(dplyr)
library(ggplot2)
library(scales)
library(tidyr)

# Load data
df_raw <- read.csv("googleplaystore.csv", stringsAsFactors = FALSE)

df <- df_raw %>%
  filter(Type %in% c("Free", "Paid")) %>%
  mutate(
    Reviews     = suppressWarnings(as.numeric(Reviews)),
    Rating      = suppressWarnings(as.numeric(Rating))
  ) %>%
  filter(!is.na(Rating), !is.na(Reviews), Rating <= 5) %>%
  mutate(
    is_paid     = as.integer(Type == "Paid"),
    log_Reviews = log(Reviews + 1),
    is_mature   = as.integer(Content.Rating %in% c("Mature 17+", "Adults only 18+"))
  )

cat("Final dataset dimensions:", nrow(df), "rows x", ncol(df), "columns\n")

## Final dataset dimensions: 9366 rows x 16 columns

cat("Paid apps:", sum(df$is_paid), "| Free apps:", sum(df$is_paid == 0), "\n")

## Paid apps: 647 | Free apps: 8719

After filtering, I have 9,366 apps - 647 paid and 8,719 free. The class imbalance is notable (about 7% paid), and I’ll keep it in mind when interpreting results. I chose not to downsample because I’m after inference and not prediction accuracy.

3. Exploratory Visualizations

3a. Distributions: Free vs. Paid

df %>%
  mutate(App_Type = ifelse(is_paid == 1, "Paid", "Free")) %>%
  ggplot(aes(x = App_Type, y = Rating, fill = App_Type)) +
  geom_violin(trim = FALSE, alpha = 0.75, color = NA) +
  geom_boxplot(width = 0.12, fill = "white", color = "grey30",
               outlier.shape = 21, outlier.size = 1.2,
               outlier.fill = "grey60", outlier.alpha = 0.4) +
  scale_fill_manual(values = c("Free" = "#5B9BD5", "Paid" = "#E8804A")) +
  scale_y_continuous(breaks = 1:5) +
  labs(
    title    = "App Ratings by Pricing Type",
    subtitle = "Paid apps have a slightly higher median but the gap is modest",
    x        = NULL,
    y        = "Average Rating",
    fill     = NULL
  ) +
  theme_minimal(base_size = 13) +
  theme(
    legend.position  = "none",
    panel.grid.major.x = element_blank(),
    plot.title       = element_text(face = "bold"),
    axis.text.x      = element_text(size = 13, face = "bold")
  )

Figure 1: Rating distributions for Free vs. Paid apps. Paid apps show a subtly higher median rating, though both distributions are left-skewed.

Both distributions are left-skewed as most apps rate between 4.0 and 5.0. Paid apps do appear to have a slightly higher center, which already hints that maybe rating might be a positive predictor of being paid.

3b. Review Counts on a Log Scale

df %>%
  mutate(App_Type = ifelse(is_paid == 1, "Paid", "Free")) %>%
  ggplot(aes(x = log_Reviews, fill = App_Type, color = App_Type)) +
  geom_density(alpha = 0.55, size = 0.8) +
  geom_rug(data = . %>% filter(App_Type == "Paid"),
           aes(x = log_Reviews), color = "#E8804A",
           alpha = 0.3, length = unit(0.04, "npc")) +
  scale_fill_manual(values  = c("Free" = "#5B9BD5", "Paid" = "#E8804A")) +
  scale_color_manual(values = c("Free" = "#3a7ab8", "Paid" = "#c05e28")) +
  labs(
    title    = "Distribution of log(Reviews + 1) by App Type",
    subtitle = "Paid apps (orange rug ticks) concentrate at lower review counts",
    x        = "log(Reviews + 1)",
    y        = "Density",
    fill     = NULL, color = NULL
  ) +
  theme_minimal(base_size = 13) +
  theme(
    legend.position = c(0.1, 0.88),
    plot.title      = element_text(face = "bold"),
    panel.grid.minor = element_blank()
  )

Figure 2: Density of log-transformed review counts. Paid apps cluster at lower log-review values, suggesting smaller but potentially more dedicated user bases.

The density plot makes the story clear. Free apps have a wide bimodal review distribution, while paid apps cluster heavily at lower log review counts. This supports my intuition that popularity (proxied by reviews) is strongly associated with being free.

4. The Logistic Regression Model

As Leon said in the lecture, when our response variable is binary, we can’t just fit a straight line through it. Instead, we use the sigmoid function to model probability, and its inverse (the logit, or log-odds function) becomes our linear component. That is exactly the structure I’m using here.

model <- glm(is_paid ~ Rating + log_Reviews + is_mature,
             data   = df,
             family = binomial(link = "logit"))

summary(model)

## 
## Call:
## glm(formula = is_paid ~ Rating + log_Reviews + is_mature, family = binomial(link = "logit"), 
##     data = df)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.87135    0.33262  -8.633  < 2e-16 ***
## Rating       0.42805    0.07585   5.643 1.67e-08 ***
## log_Reviews -0.21246    0.01189 -17.867  < 2e-16 ***
## is_mature   -0.54714    0.25481  -2.147   0.0318 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4706.4  on 9365  degrees of freedom
## Residual deviance: 4317.3  on 9362  degrees of freedom
## AIC: 4325.3
## 
## Number of Fisher Scoring iterations: 6

All three coefficients are statistically significant (p < 0.05). The model converged cleanly.

5. Coefficient Interpretation

To interpret logistic regression coefficients, I exponentiate them to get odds ratios as these tell me how the odds of an app being paid change with each predictor.

coef_table <- data.frame(
  Variable   = c("Intercept", "Rating", "log_Reviews", "is_mature"),
  Estimate   = coef(model),
  Std_Error  = summary(model)$coefficients[, "Std. Error"],
  Odds_Ratio = exp(coef(model))
)

coef_table$OR_Lower <- exp(coef_table$Estimate - 1.96 * coef_table$Std_Error)
coef_table$OR_Upper <- exp(coef_table$Estimate + 1.96 * coef_table$Std_Error)

coef_table[, -1] <- round(coef_table[, -1], 4)
coef_table

##                Variable Estimate Std_Error Odds_Ratio OR_Lower OR_Upper
## (Intercept)   Intercept  -2.8714    0.3326     0.0566   0.0295   0.1087
## Rating           Rating   0.4281    0.0759     1.5343   1.3223   1.7802
## log_Reviews log_Reviews  -0.2125    0.0119     0.8086   0.7900   0.8277
## is_mature     is_mature  -0.5471    0.2548     0.5786   0.3511   0.9534

Here is what each coefficient tells me:

Rating (β̂ = 0.428, OR ≈ 1.53):
For every one-star increase in an app’s average rating, the odds of it being a paid app increase by about 53%, holding all else constant. This is the most substantively interesting finding to me as it suggests that paid apps genuinely do tend to be higher-quality (as rated by users), or at least that developers who charge for their apps are incentivized to maintain higher quality to justify the price.

log_Reviews (β̂ = −0.213, OR ≈ 0.81):
For every one-unit increase in log(Reviews + 1 roughly corresponding to a 2.7× increase in raw review count, the odds of an app being paid decrease by about 19%. This makes intuitive sense: viral, free apps accumulate massive review volumes because they face no download barrier. Paid apps have a smaller, more selective user base.

is_mature (β̂ = −0.547, OR ≈ 0.58):
Mature-rated apps (17+ or 18+) have odds of being paid that are about 42% lower than non-mature apps, all else equal. This surprised me initially I expected niche mature content to command a price. One possible explanation is that adult content platforms often use free-with-ads or freemium models to maximize reach before monetizing through subscriptions outside the play store.

6. Confidence interval for the rating coefficient

I will manually build a 95% confidence interval for the rating coefficient, using its standard error from the model summary.

beta_rating <- coef(model)["Rating"]
se_rating   <- summary(model)$coefficients["Rating", "Std. Error"]

ci_log_odds_lower <- beta_rating - 1.96 * se_rating
ci_log_odds_upper <- beta_rating + 1.96 * se_rating

ci_or_lower <- exp(ci_log_odds_lower)
ci_or_upper <- exp(ci_log_odds_upper)

cat("95% CI for Rating coefficient (log-odds scale): [",
    round(ci_log_odds_lower, 3), ",", round(ci_log_odds_upper, 3), "]\n")

## 95% CI for Rating coefficient (log-odds scale): [ 0.279 , 0.577 ]

cat("95% CI for Rating odds ratio:                   [",
    round(ci_or_lower, 3), ",", round(ci_or_upper, 3), "]\n")

## 95% CI for Rating odds ratio:                   [ 1.322 , 1.78 ]

What this means:
I am 95% confident that the true effect of a one star rating increase on the odds of an app being paid corresponds to an odds ratio somewhere between 1.32 and 1.78. Across the range of plausible true effects consistent with our data, a one-point rating bump multiplies the odds of being a paid app by at least 1.32× and as much as 1.78×. The fact that this interval does not cross 1.0 (or 0 on the log-odds scale) reinforces that the relationship is real and not just noise.

If I were a developer trying to understand what distinguishes paid apps, the data suggest that rating quality is a genuine differentiator, not just a nice to have.

7. Odds Ratio Forest Plot

or_df <- data.frame(
  Variable = c("Rating", "log(Reviews)", "is_mature"),
  OR       = exp(coef(model)[-1]),
  Lower    = exp(coef(model)[-1] - 1.96 * summary(model)$coefficients[-1, "Std. Error"]),
  Upper    = exp(coef(model)[-1] + 1.96 * summary(model)$coefficients[-1, "Std. Error"])
)

or_df$Variable <- factor(or_df$Variable,
                          levels = c("is_mature", "log(Reviews)", "Rating"))

ggplot(or_df, aes(x = OR, y = Variable, color = OR > 1)) +
  geom_vline(xintercept = 1, linetype = "dashed",
             color = "grey50", size = 0.8) +
  geom_errorbarh(aes(xmin = Lower, xmax = Upper),
                 height = 0.2, size = 1.0) +
  geom_point(size = 5, shape = 18) +
  scale_color_manual(values = c("TRUE" = "#2e7d32", "FALSE" = "#c62828"),
                     labels = c("TRUE"  = "Increases paid odds",
                                "FALSE" = "Decreases paid odds")) +
  scale_x_continuous(breaks = seq(0.5, 2.0, by = 0.25)) +
  labs(
    title    = "Odds Ratios from Logistic Regression",
    subtitle = "Predicting whether a Google Play app is Paid (vs. Free)",
    x        = "Odds Ratio (95% CI)",
    y        = NULL,
    color    = NULL
  ) +
  theme_minimal(base_size = 13) +
  theme(
    legend.position  = "bottom",
    plot.title       = element_text(face = "bold"),
    panel.grid.minor = element_blank(),
    axis.text.y      = element_text(size = 12)
  )

Figure 3: Odds ratios with 95% confidence intervals. A value above 1 means the predictor is positively associated with being a paid app; below 1 means negatively associated.

The forest plot offers a clean at-a-glance summary: Rating is the only predictor associated with increased odds of being paid, while review volume and mature content rating both work against it.

8. Fitted Probability Curve

To visualize what the model is actually doing, I plotted the predicted probability of being a paid app as a function of rating, holding log_Reviews at its median and is_mature = 0.

pred_df <- data.frame(
  Rating      = seq(1, 5, length.out = 300),
  log_Reviews = median(df$log_Reviews),
  is_mature   = 0
)
pred_df$prob_paid <- predict(model, newdata = pred_df, type = "response")

jitter_df <- df %>%
  mutate(jitter_y = ifelse(is_paid == 1, 1 + runif(n(), -0.02, 0.02),
                                          0 + runif(n(), -0.02, 0.02)),
         App_Type = ifelse(is_paid == 1, "Paid", "Free"))

ggplot() +
  geom_jitter(data = jitter_df %>% sample_n(2000),
              aes(x = Rating, y = jitter_y, color = App_Type),
              height = 0.01, width = 0.06, alpha = 0.18, size = 0.9) +
  geom_line(data = pred_df, aes(x = Rating, y = prob_paid),
            color = "#1a237e", size = 1.4) +
  scale_color_manual(values = c("Free" = "#5B9BD5", "Paid" = "#E8804A")) +
  scale_y_continuous(labels = percent_format(), breaks = seq(0, 1, 0.25)) +
  labs(
    title    = "Predicted Probability of Being a Paid App by Rating",
    subtitle = "Curve shown at median review count; jittered points show actual data (sample of 2,000)",
    x        = "Average Rating",
    y        = "Predicted Probability (Paid)",
    color    = NULL
  ) +
  theme_minimal(base_size = 13) +
  theme(
    legend.position = "bottom",
    plot.title      = element_text(face = "bold"),
    panel.grid.minor = element_blank()
  )

Figure 4: The logistic curve showing predicted probability of being a paid app across the rating range, at median review count.

Even at the highest ratings, the predicted probability of being paid tops out around 10–12%, a reminder that the baseline is strongly skewed toward free. But the upward slope of the curve confirms the positive relationship between rating and paid status.

9. Insights, Significance, and Further Questions

What I found

The logistic regression tells a coherent story. Higher-rated apps are somewhat more likely to be paid — developers who charge for their work may be more invested in product quality. Apps with large review counts are strongly associated with being free, which makes sense since free apps face no adoption barrier. Surprisingly, mature-content apps are less likely to be paid, possibly because adult platforms prefer to keep the Play Store entry point free and monetize elsewhere.

Why it matters

Understanding what predicts paid status has implications for both developers setting pricing strategies and researchers modeling app store economics.

Further questions I’d want to investigate

Does the relationship between Rating and paid status differ by category?
How does install count (which I couldn’t easily clean for this model) factor in? Installs might mediate the relationship between being free and accumulating reviews, and including it could clarify whether review count is a cause or consequence of pricing.
Is the model stable over time? The dataset includes apps from different years, and the shift toward freemium models post-2015 might mean the rating–paid relationship has changed. A longitudinal split would be revealing.

End of Data Dive