Data Dive Week 12

install.packages("tsibble", repos = "https://cran.r-project.org")

## package 'tsibble' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\IU Student\AppData\Local\Temp\RtmpYnI4O5\downloaded_packages

install.packages("xts", repos = "https://cran.r-project.org")

## package 'xts' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\IU Student\AppData\Local\Temp\RtmpYnI4O5\downloaded_packages

install.packages("ggthemes", repos = "https://cran.r-project.org")

## package 'ggthemes' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\IU Student\AppData\Local\Temp\RtmpYnI4O5\downloaded_packages

library(tidyverse)
library(tsibble)
library(xts)
library(ggthemes)
library(lubridate)

Introduction

This week, I was looking at how average app ratings in the google play store have changed over time. The data set contains 10,840 app listings, each with a last updated date. I’m treating that date as my time variable and average rating as the response variable, aggregated by month.

Before diving in, I want to flag a practical assumption I’m making here, which connects to something from the week 13 lecture. The professor said that when you apply an IID assumption to your data, you’re assuming the population stays consistent throughout the period of observation and the example used was Twitter changing ownership and the user base shifting. A similar thing applies here. I’m assuming that the kinds of apps being published to the Play Store, and the kinds of users rating them, stayed roughly consistent from 2010 through 2018. In reality, the early years (2010-2013) have very few apps in the dataset, so those months may not be representative. I’ll try and keep that in mind when interpreting the early part of the series.

Step 1: Converting the Time Column

The “last updated” column comes in as a string like "January 7, 2018". I convert it to a proper Date using as.Date() with the matching format string.

playstore <- read.csv("GooglePlaystoreData.csv", stringsAsFactors = FALSE)

playstore <- playstore %>%
  mutate(
    date = as.Date(Last.Updated, format = "%B %d, %Y"),
    Rating = as.numeric(Rating)
  ) %>%
  filter(!is.na(date), !is.na(Rating), Rating <= 5)

head(playstore %>% select(App, Last.Updated, date, Rating))

##                                                  App     Last.Updated
## 1     Photo Editor & Candy Camera & Grid & ScrapBook  January 7, 2018
## 2                                Coloring book moana January 15, 2018
## 3 U Launcher Lite – FREE Live Cool Themes, Hide Apps   August 1, 2018
## 4                              Sketch - Draw & Paint     June 8, 2018
## 5              Pixel Draw - Number Art Coloring Book    June 20, 2018
## 6                         Paper flowers instructions   March 26, 2017
##         date Rating
## 1 2018-01-07    4.1
## 2 2018-01-15    3.9
## 3 2018-08-01    4.7
## 4 2018-06-08    4.5
## 5 2018-06-20    4.3
## 6 2017-03-26    4.4

I also filter out any ratings above 5, since those are clearly errors in the raw data. After cleaning, I have 9366 usable rows.

Step 2: Choosing a Response Variable

I’m using average rating per month as the response variable. The idea is to track whether user satisfaction with play store apps shifted over the 2010-2018 window. This is a natural response variable because it captures the collective signal of how users felt about apps at a given point in time.

Step 3: Building the tsibble and Plotting Over Time

I aggregate to monthly average ratings, then build a tsibble.

monthly_ratings <- playstore %>%
  mutate(month = yearmonth(date)) %>%
  group_by(month) %>%
  summarise(avg_rating = mean(Rating, na.rm = TRUE)) %>%
  ungroup() %>%
  as_tsibble(index = month) %>%
  fill_gaps()

head(monthly_ratings)

## # A tsibble: 6 x 2 [1M]
##      month avg_rating
##      <mth>      <dbl>
## 1 2010 May        4.2
## 2 2010 Jun       NA  
## 3 2010 Jul       NA  
## 4 2010 Aug       NA  
## 5 2010 Sep       NA  
## 6 2010 Oct       NA

monthly_ratings %>%
  ggplot(aes(x = month, y = avg_rating)) +
  geom_line(color = "#0077A8") +
  labs(
    title = "Average App Rating Over Time",
    subtitle = "Monthly averages, Google Play Store (2010–2018)",
    x = "Month",
    y = "Average Rating"
  ) +
  theme_hc()

What stands out immediately: The early years (2010-2013) are very noisy and volatile and average ratings swing wildly because very few apps were updated in those periods, so one or two outlier apps can drag the monthly average up or down. Starting around 2015 the series becomes much more stable, hovering consistently between 4.1 and 4.4. This suggests that the pre-2015 portion of the series is not reliable enough to analyze as a trend as the sample size per month is just too small.

Let me also look at just the 2015–2018 window where the data is denser.

monthly_ratings %>%
  filter(month >= yearmonth("2015 Jan")) %>%
  ggplot(aes(x = month, y = avg_rating)) +
  geom_line(color = "#0077A8") +
  labs(
    title = "Average App Rating Over Time (2015–2018)",
    subtitle = "Zoomed in on the stable, data-rich period",
    x = "Month",
    y = "Average Rating"
  ) +
  theme_hc()

This window is much cleaner. There’s a slight dip around early 2016 and ratings seem to recover and hold steady through 2018. I’ll use this subset for trend and seasonality analysis.

Step 4: Linear Regression to Detect Trend

I convert the 2015–2018 subset into an xts object and fit a linear regression with time as the predictor.

ratings_sub <- monthly_ratings %>%
  filter(month >= yearmonth("2015 Jan")) %>%
  mutate(date = as.Date(paste0(format(as.Date(month), "%Y-%m"), "-01")))

ratings_xts <- xts(
  x = ratings_sub$avg_rating,
  order.by = ratings_sub$date
)
ratings_xts <- setNames(ratings_xts, "avg_rating")

time_index <- 1:nrow(ratings_sub)
rating_vals <- as.numeric(ratings_xts)

trend_model <- lm(rating_vals ~ time_index)
summary(trend_model)

## 
## Call:
## lm(formula = rating_vals ~ time_index)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.207936 -0.072256  0.003676  0.042624  0.223475 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.025105   0.029242 137.649  < 2e-16 ***
## time_index  0.003264   0.001132   2.884  0.00617 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09534 on 42 degrees of freedom
## Multiple R-squared:  0.1653, Adjusted R-squared:  0.1454 
## F-statistic: 8.317 on 1 and 42 DF,  p-value: 0.006171

ratings_sub %>%
  mutate(time_index = row_number()) %>%
  ggplot(aes(x = month, y = avg_rating)) +
  geom_line(color = "#0077A8") +
  geom_smooth(method = "lm", se = TRUE, color = "#E8A838", linetype = "dashed") +
  labs(
    title = "App Ratings with Linear Trend (2015–2018)",
    x = "Month",
    y = "Average Rating"
  ) +
  theme_hc()

Interpretation: The linear model shows a very slight upward trend over the 2015–2018 period. However, the R-squared is low, which means time alone doesn’t explain much of the variation in ratings. The trend is real but weak as ratings didn’t dramatically improve or decline, they mostly stayed flat with some month-to-month noise. I don’t think I need to subset further for multiple trends within this window since there’s no obvious structural break.

Further question: Is this slight upward trend driven by any particular category of apps improving over time, or is it evenly distributed across all categories?

Step 5: Smoothing to Detect Seasonality

I use a rolling average to smooth out the noise and look for any repeating seasonal pattern.

ratings_xts %>%
  rollapply(width = 3, FUN = function(x) mean(x, na.rm = TRUE), fill = NA) %>%
  ggplot(aes(x = Index, y = avg_rating)) +
  geom_line(color = "#0077A8") +
  labs(
    title = "App Ratings - 3-Month Rolling Average",
    subtitle = "Smoothed to reveal underlying pattern",
    x = "Date",
    y = "Avg Rating (smoothed)"
  ) +
  theme_hc()

Interpretation: After smoothing, a gentle wave pattern becomes slightly more visible. Ratings appear to dip in the first quarter of each year (roughly January–February) and recover by mid-year. This is a plausible seasonal pattern and app updates tend to cluster around the holiday season (Q4), and ratings after that burst of updates may dip as newer less polished apps hit the store.

ACF Plot

acf(na.omit(as.numeric(ratings_xts)),
    lag.max = 24,
    main = "ACF - Monthly App Ratings",
    xlab = "Lag (months)")

Interpretation: The ACF shows significant autocorrelation at lag 1 and lag 2, meaning this month’s average rating is correlated with the previous one or two months. This is consistent with what we see visually, that the series doesn’t jump around randomly, it moves slowly. There’s a hint of a repeating pattern around lag 12, which would suggest an annual seasonal cycle, but it’s not strongly significant here given the limited number of years in the dataset.

PACF Plot

pacf(na.omit(as.numeric(ratings_xts)),
     lag.max = 24,
     main = "PACF - Monthly App Ratings",
     xlab = "Lag (months)")

Interpretation: The PACF cuts off sharply after lag 1, which suggests that most of the autocorrelation in this series is captured by just the previous month. In other words, once I account for last month’s rating, earlier months don’t add much additional information. This points toward an AR(1)-type structure in the data.

Summary

Here’s what I found from the time series analysis of Google Play Store ratings:

1.The “Last Updated” column converted cleanly to monthly dates. Pre-2015 data is too sparse to be reliable for trend analysis.

2.Monthly average rating is a reasonable response variable and it’s a meaningful signal of how satisfied users were with apps in a given period.

3.There’s a very slight upward trend in ratings from 2015–2018, but it’s weak. Ratings were more or less stable across this period, hovering around 4.1–4.4.

4.I’m assuming the app population and user base stayed roughly consistent from 2010-2018. As the professor said in the Week 13 lecture, this kind of IID assumption is critical to flag and if the kinds of apps or users changed substantially over time (and they probably did to some extent), then my trend interpretation needs to be read with that caveat in mind.

Further questions: A natural next step would be to break this analysis down by category like do Education apps show a stronger upward trend than Games? And is the seasonal dip in Q1 consistent across all categories or concentrated in a few?