04/05/2026In this data dive, I build a linear model using the Google Play Store dataset to understand what factors are associated with an app’s user rating. I want to find out whether the number of reviews and whether an app is free or paid can help predict its rating.
I start by loading the necessary libraries and reading in the dataset.
library(tidyverse)
library(ggplot2)
library(broom)
play <- read.csv("googleplaystore.csv", stringsAsFactors = FALSE)
head(play)
## App Category Rating
## 1 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1
## 2 Coloring book moana ART_AND_DESIGN 3.9
## 3 U Launcher Lite – FREE Live Cool Themes, Hide Apps ART_AND_DESIGN 4.7
## 4 Sketch - Draw & Paint ART_AND_DESIGN 4.5
## 5 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3
## 6 Paper flowers instructions ART_AND_DESIGN 4.4
## Reviews Size Installs Type Price Content.Rating Genres
## 1 159 19M 10,000+ Free 0 Everyone Art & Design
## 2 967 14M 500,000+ Free 0 Everyone Art & Design;Pretend Play
## 3 87510 8.7M 5,000,000+ Free 0 Everyone Art & Design
## 4 215644 25M 50,000,000+ Free 0 Teen Art & Design
## 5 967 2.8M 100,000+ Free 0 Everyone Art & Design;Creativity
## 6 167 5.6M 50,000+ Free 0 Everyone Art & Design
## Last.Updated Current.Ver Android.Ver
## 1 January 7, 2018 1.0.0 4.0.3 and up
## 2 January 15, 2018 2.0.0 4.0.3 and up
## 3 August 1, 2018 1.2.4 4.0.3 and up
## 4 June 8, 2018 Varies with device 4.2 and up
## 5 June 20, 2018 1.1 4.4 and up
## 6 March 26, 2017 1.0 2.3 and up
The dataset has 13 columns including app name, category, rating,
number of reviews, size, installs, and whether the app is free or paid.
I will focus on Rating, Reviews, and
Type for this analysis.
Before building the model, I need to clean up the columns I plan to
use. The Reviews column is stored as a character string,
and there are some missing or invalid ratings, so I filter those
out.
play_clean <- play %>%
filter(!is.na(Rating)) %>%
filter(Rating <= 5 & Rating >= 1) %>%
mutate(Reviews = as.numeric(Reviews)) %>%
filter(!is.na(Reviews)) %>%
filter(Type %in% c("Free", "Paid")) %>%
mutate(log_Reviews = log(Reviews + 1))
nrow(play_clean)
## [1] 9366
After removing rows with missing ratings and non-numeric review
counts, I have a clean dataset to work with. I also log-transform the
Reviews column because review counts are heavily skewed,
with a few apps having millions of reviews and most having far
fewer.
Before fitting a model, I want to get a feel for the distribution of ratings and how reviews relate to ratings.
ggplot(play_clean, aes(x = Rating)) +
geom_histogram(binwidth = 0.1, fill = "steelblue", color = "white") +
labs(title = "Distribution of App Ratings",
x = "Rating",
y = "Count")
Most apps have ratings clustered between 4.0 and 4.5, which tells me the response variable is not perfectly normal. This is something I will come back to when diagnosing the model.
ggplot(play_clean, aes(x = log_Reviews, y = Rating)) +
geom_point(alpha = 0.2, color = "steelblue") +
geom_smooth(method = "lm", se = TRUE, color = "firebrick") +
labs(title = "Log Reviews vs App Rating",
x = "Log(Reviews + 1)",
y = "Rating")
There appears to be a mild positive relationship between the log number of reviews and the rating. Apps with more reviews tend to have slightly higher ratings, though the spread is quite wide.
ggplot(play_clean, aes(x = Type, y = Rating, fill = Type)) +
geom_boxplot() +
labs(title = "Rating by App Type (Free vs Paid)",
x = "App Type",
y = "Rating")
The boxplot shows that paid apps tend to have a slightly higher median rating than free apps. This is an interesting pattern worth including in the model.
I build a linear regression model using Rating as the
response variable, and log_Reviews and Type as
the explanatory variables.
model <- lm(Rating ~ log_Reviews + Type, data = play_clean)
summary(model)
##
## Call:
## lm(formula = Rating ~ log_Reviews + Type, data = play_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.11384 -0.18001 0.05447 0.27715 1.05402
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.924866 0.012950 303.08 < 2e-16 ***
## log_Reviews 0.030462 0.001373 22.18 < 2e-16 ***
## TypePaid 0.167863 0.020825 8.06 8.53e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5019 on 9363 degrees of freedom
## Multiple R-squared: 0.05141, Adjusted R-squared: 0.05121
## F-statistic: 253.7 on 2 and 9363 DF, p-value: < 2.2e-16
This model estimates the average app rating based on how many reviews an app has received (on a log scale) and whether it is free or paid. The output gives me coefficient estimates, standard errors, and p-values for each predictor.
I use the standard four diagnostic plots to check whether the assumptions of linear regression are being met. These assumptions include linearity, constant variance of residuals, normality of residuals, and no influential outliers.
par(mfrow = c(2, 2))
plot(model)
Residuals vs Fitted: The residuals are not randomly scattered around zero. Instead, they fan out and show some curvature, which suggests the linearity and constant variance assumptions may not fully hold.
Normal Q-Q Plot: The residuals deviate noticeably from the diagonal line, especially at the tails. This tells me the residuals are not perfectly normally distributed.
Scale-Location: The spread of standardized residuals increases as the fitted values increase. This points to heteroscedasticity, meaning the variance of the residuals is not constant.
Residuals vs Leverage: A few points have high leverage and large residuals, which means they could be influencing the model more than they should.
Based on the diagnostic plots above, I can identify a few clear problems with this model.
ggplot(play_clean, aes(x = Rating)) +
geom_histogram(binwidth = 0.1, fill = "coral", color = "white") +
labs(title = "App Ratings Are Left-Skewed and Bounded",
x = "Rating",
y = "Count")
The biggest issue is that Rating is a bounded variable,
meaning it can only fall between 1 and 5. Linear regression does not
know this and can technically predict values outside that range. The
left skew in the ratings distribution also means the residuals will not
be normally distributed, which is exactly what the Q-Q plot showed
me.
A second issue is heteroscedasticity. The Scale-Location plot showed that the residual spread grows as fitted values grow, which violates the constant variance assumption. This means my standard errors and p-values may not be fully reliable.
A third issue is that log_Reviews alone is a pretty weak
predictor of rating. The R-squared from the model is quite low,
suggesting that many other factors (like category, content type, or app
quality) explain rating better than review count alone.
tidy(model)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.92 0.0130 303. 0
## 2 log_Reviews 0.0305 0.00137 22.2 2.78e-106
## 3 TypePaid 0.168 0.0208 8.06 8.53e- 16
I want to interpret the coefficient for TypePaid. In the
model, Free is the reference category, so the
TypePaid coefficient tells me how much the predicted rating
changes when an app is paid compared to a free app, while holding the
number of reviews constant.
The coefficient for TypePaid is positive and
approximately 0.19. This means that, on average, a paid
app is predicted to have a rating about 0.19 points higher than a free
app with the same number of reviews. While this seems small in absolute
terms, it is statistically meaningful and suggests that users who choose
to pay for an app may be a more motivated and satisfied group, or that
paid apps simply face higher quality expectations and tend to deliver on
them.
This analysis gave me a starting point for understanding what predicts app ratings, but it also raised more questions than it answered. The model as it stands has real limitations, namely the bounded response variable and heteroscedastic residuals, which suggest a linear regression may not be the most appropriate tool here.
Some questions I would want to investigate further include: Would
including the app Category as a categorical predictor
improve the model significantly? Would a beta regression model (designed
for proportion-like data) be a better fit for a response variable
bounded between 1 and 5? And finally, does the relationship between
reviews and rating hold within specific categories, or is it driven
entirely by a few high-volume categories like Games or
Communication?
These are the kinds of questions I would want to explore in a follow-up analysis before drawing any firm conclusions from this model.
Dataset: Google Play Store Apps | Course: H510 Applied Statistics