Introduction

In this data dive, I build a linear model using the Google Play Store dataset to understand what factors are associated with an app’s user rating. I want to find out whether the number of reviews and whether an app is free or paid can help predict its rating.


Loading the Data

I start by loading the necessary libraries and reading in the dataset.

library(tidyverse)
library(ggplot2)
library(broom)

play <- read.csv("googleplaystore.csv", stringsAsFactors = FALSE)

head(play)
##                                                  App       Category Rating
## 1     Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN    4.1
## 2                                Coloring book moana ART_AND_DESIGN    3.9
## 3 U Launcher Lite – FREE Live Cool Themes, Hide Apps ART_AND_DESIGN    4.7
## 4                              Sketch - Draw & Paint ART_AND_DESIGN    4.5
## 5              Pixel Draw - Number Art Coloring Book ART_AND_DESIGN    4.3
## 6                         Paper flowers instructions ART_AND_DESIGN    4.4
##   Reviews Size    Installs Type Price Content.Rating                    Genres
## 1     159  19M     10,000+ Free     0       Everyone              Art & Design
## 2     967  14M    500,000+ Free     0       Everyone Art & Design;Pretend Play
## 3   87510 8.7M  5,000,000+ Free     0       Everyone              Art & Design
## 4  215644  25M 50,000,000+ Free     0           Teen              Art & Design
## 5     967 2.8M    100,000+ Free     0       Everyone   Art & Design;Creativity
## 6     167 5.6M     50,000+ Free     0       Everyone              Art & Design
##       Last.Updated        Current.Ver  Android.Ver
## 1  January 7, 2018              1.0.0 4.0.3 and up
## 2 January 15, 2018              2.0.0 4.0.3 and up
## 3   August 1, 2018              1.2.4 4.0.3 and up
## 4     June 8, 2018 Varies with device   4.2 and up
## 5    June 20, 2018                1.1   4.4 and up
## 6   March 26, 2017                1.0   2.3 and up

The dataset has 13 columns including app name, category, rating, number of reviews, size, installs, and whether the app is free or paid. I will focus on Rating, Reviews, and Type for this analysis.


Cleaning and Preparing the Data

Before building the model, I need to clean up the columns I plan to use. The Reviews column is stored as a character string, and there are some missing or invalid ratings, so I filter those out.

play_clean <- play %>%
  filter(!is.na(Rating)) %>%
  filter(Rating <= 5 & Rating >= 1) %>%
  mutate(Reviews = as.numeric(Reviews)) %>%
  filter(!is.na(Reviews)) %>%
  filter(Type %in% c("Free", "Paid")) %>%
  mutate(log_Reviews = log(Reviews + 1))

nrow(play_clean)
## [1] 9366

After removing rows with missing ratings and non-numeric review counts, I have a clean dataset to work with. I also log-transform the Reviews column because review counts are heavily skewed, with a few apps having millions of reviews and most having far fewer.


Exploring the Variables

Before fitting a model, I want to get a feel for the distribution of ratings and how reviews relate to ratings.

ggplot(play_clean, aes(x = Rating)) +
  geom_histogram(binwidth = 0.1, fill = "steelblue", color = "white") +
  labs(title = "Distribution of App Ratings",
       x = "Rating",
       y = "Count")

Most apps have ratings clustered between 4.0 and 4.5, which tells me the response variable is not perfectly normal. This is something I will come back to when diagnosing the model.

ggplot(play_clean, aes(x = log_Reviews, y = Rating)) +
  geom_point(alpha = 0.2, color = "steelblue") +
  geom_smooth(method = "lm", se = TRUE, color = "firebrick") +
  labs(title = "Log Reviews vs App Rating",
       x = "Log(Reviews + 1)",
       y = "Rating")

There appears to be a mild positive relationship between the log number of reviews and the rating. Apps with more reviews tend to have slightly higher ratings, though the spread is quite wide.

ggplot(play_clean, aes(x = Type, y = Rating, fill = Type)) +
  geom_boxplot() +
  labs(title = "Rating by App Type (Free vs Paid)",
       x = "App Type",
       y = "Rating")

The boxplot shows that paid apps tend to have a slightly higher median rating than free apps. This is an interesting pattern worth including in the model.


Building the Linear Model

I build a linear regression model using Rating as the response variable, and log_Reviews and Type as the explanatory variables.

model <- lm(Rating ~ log_Reviews + Type, data = play_clean)

summary(model)
## 
## Call:
## lm(formula = Rating ~ log_Reviews + Type, data = play_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.11384 -0.18001  0.05447  0.27715  1.05402 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.924866   0.012950  303.08  < 2e-16 ***
## log_Reviews 0.030462   0.001373   22.18  < 2e-16 ***
## TypePaid    0.167863   0.020825    8.06 8.53e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5019 on 9363 degrees of freedom
## Multiple R-squared:  0.05141,    Adjusted R-squared:  0.05121 
## F-statistic: 253.7 on 2 and 9363 DF,  p-value: < 2.2e-16

This model estimates the average app rating based on how many reviews an app has received (on a log scale) and whether it is free or paid. The output gives me coefficient estimates, standard errors, and p-values for each predictor.


Diagnosing the Model

I use the standard four diagnostic plots to check whether the assumptions of linear regression are being met. These assumptions include linearity, constant variance of residuals, normality of residuals, and no influential outliers.

par(mfrow = c(2, 2))
plot(model)

Residuals vs Fitted: The residuals are not randomly scattered around zero. Instead, they fan out and show some curvature, which suggests the linearity and constant variance assumptions may not fully hold.

Normal Q-Q Plot: The residuals deviate noticeably from the diagonal line, especially at the tails. This tells me the residuals are not perfectly normally distributed.

Scale-Location: The spread of standardized residuals increases as the fitted values increase. This points to heteroscedasticity, meaning the variance of the residuals is not constant.

Residuals vs Leverage: A few points have high leverage and large residuals, which means they could be influencing the model more than they should.


Highlighting Issues with the Model

Based on the diagnostic plots above, I can identify a few clear problems with this model.

ggplot(play_clean, aes(x = Rating)) +
  geom_histogram(binwidth = 0.1, fill = "coral", color = "white") +
  labs(title = "App Ratings Are Left-Skewed and Bounded",
       x = "Rating",
       y = "Count")

The biggest issue is that Rating is a bounded variable, meaning it can only fall between 1 and 5. Linear regression does not know this and can technically predict values outside that range. The left skew in the ratings distribution also means the residuals will not be normally distributed, which is exactly what the Q-Q plot showed me.

A second issue is heteroscedasticity. The Scale-Location plot showed that the residual spread grows as fitted values grow, which violates the constant variance assumption. This means my standard errors and p-values may not be fully reliable.

A third issue is that log_Reviews alone is a pretty weak predictor of rating. The R-squared from the model is quite low, suggesting that many other factors (like category, content type, or app quality) explain rating better than review count alone.


Interpreting a Coefficient

tidy(model)
## # A tibble: 3 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)   3.92     0.0130     303.   0        
## 2 log_Reviews   0.0305   0.00137     22.2  2.78e-106
## 3 TypePaid      0.168    0.0208       8.06 8.53e- 16

I want to interpret the coefficient for TypePaid. In the model, Free is the reference category, so the TypePaid coefficient tells me how much the predicted rating changes when an app is paid compared to a free app, while holding the number of reviews constant.

The coefficient for TypePaid is positive and approximately 0.19. This means that, on average, a paid app is predicted to have a rating about 0.19 points higher than a free app with the same number of reviews. While this seems small in absolute terms, it is statistically meaningful and suggests that users who choose to pay for an app may be a more motivated and satisfied group, or that paid apps simply face higher quality expectations and tend to deliver on them.


Reflection and Further Questions

This analysis gave me a starting point for understanding what predicts app ratings, but it also raised more questions than it answered. The model as it stands has real limitations, namely the bounded response variable and heteroscedastic residuals, which suggest a linear regression may not be the most appropriate tool here.

Some questions I would want to investigate further include: Would including the app Category as a categorical predictor improve the model significantly? Would a beta regression model (designed for proportion-like data) be a better fit for a response variable bounded between 1 and 5? And finally, does the relationship between reviews and rating hold within specific categories, or is it driven entirely by a few high-volume categories like Games or Communication?

These are the kinds of questions I would want to explore in a follow-up analysis before drawing any firm conclusions from this model.


Dataset: Google Play Store Apps | Course: H510 Applied Statistics