Homework 2

google-playstore

apps

reviews

Getting a working version of your final project

Author

Rohan Lekhwani

Published

January 22, 2024

Code

library(tidyverse)
library(readxl)
library(here)
library(dplyr)
library(viridis)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Reading the Data

The working directory for RStudio has been set such that “googleplaystore.csv” can be found at the root of the working directory using the setwd() method.

Code

googleplaystore_orig <- read_csv(here("googleplaystore.csv"))
googleplaystore_orig

# A tibble: 10,841 × 13
   App       Category Rating Reviews Size  Installs Type  Price `Content Rating`
   <chr>     <chr>     <dbl>   <dbl> <chr> <chr>    <chr> <chr> <chr>           
 1 Photo Ed… ART_AND…    4.1     159 19M   10,000+  Free  0     Everyone        
 2 Coloring… ART_AND…    3.9     967 14M   500,000+ Free  0     Everyone        
 3 U Launch… ART_AND…    4.7   87510 8.7M  5,000,0… Free  0     Everyone        
 4 Sketch -… ART_AND…    4.5  215644 25M   50,000,… Free  0     Teen            
 5 Pixel Dr… ART_AND…    4.3     967 2.8M  100,000+ Free  0     Everyone        
 6 Paper fl… ART_AND…    4.4     167 5.6M  50,000+  Free  0     Everyone        
 7 Smoke Ef… ART_AND…    3.8     178 19M   50,000+  Free  0     Everyone        
 8 Infinite… ART_AND…    4.1   36815 29M   1,000,0… Free  0     Everyone        
 9 Garden C… ART_AND…    4.4   13791 33M   1,000,0… Free  0     Everyone        
10 Kids Pai… ART_AND…    4.7     121 3.1M  10,000+  Free  0     Everyone        
# ℹ 10,831 more rows
# ℹ 4 more variables: Genres <chr>, `Last Updated` <chr>, `Current Ver` <chr>,
#   `Android Ver` <chr>

Dataset Description

High Level Description

The data set comprises of 10841 rows with 13 columns.

Code

googleplaystore_orig

# A tibble: 10,841 × 13
   App       Category Rating Reviews Size  Installs Type  Price `Content Rating`
   <chr>     <chr>     <dbl>   <dbl> <chr> <chr>    <chr> <chr> <chr>           
 1 Photo Ed… ART_AND…    4.1     159 19M   10,000+  Free  0     Everyone        
 2 Coloring… ART_AND…    3.9     967 14M   500,000+ Free  0     Everyone        
 3 U Launch… ART_AND…    4.7   87510 8.7M  5,000,0… Free  0     Everyone        
 4 Sketch -… ART_AND…    4.5  215644 25M   50,000,… Free  0     Teen            
 5 Pixel Dr… ART_AND…    4.3     967 2.8M  100,000+ Free  0     Everyone        
 6 Paper fl… ART_AND…    4.4     167 5.6M  50,000+  Free  0     Everyone        
 7 Smoke Ef… ART_AND…    3.8     178 19M   50,000+  Free  0     Everyone        
 8 Infinite… ART_AND…    4.1   36815 29M   1,000,0… Free  0     Everyone        
 9 Garden C… ART_AND…    4.4   13791 33M   1,000,0… Free  0     Everyone        
10 Kids Pai… ART_AND…    4.7     121 3.1M  10,000+  Free  0     Everyone        
# ℹ 10,831 more rows
# ℹ 4 more variables: Genres <chr>, `Last Updated` <chr>, `Current Ver` <chr>,
#   `Android Ver` <chr>

The dataset has a total of 11 <chr> type columns and the remaining columns are of the <dbl> type. Each observation pertains to a single app. The following are the descriptions of each of the variables in the dataset:

The App variable lists the name of the app under observation.
The Category variable consists of the category the app under observation is grouped into.
The Rating variable lists the rating of the app on the Google Play Store.
The Reviews variable represents the number of user reviews the app received on the Google Play Store.
The Size variable consists of the size of the app (either in KB, MB or varying by device).
The Installs variable marks the number of devices the app is installed on.
The Type variable lists whether an app is “Free” or “Paid”.
The Price variable marks the price of the app if the Type is “Paid” and 0 for “Free” apps.
The Content Rating variable mentions the suitable user group the app targets.
The Genres variable represents which genre the app falls under. This is similar to the Category variable. An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres.
The Last Updated variable enlists the date on which the latest update was published on the Google Play Store.
The Current Ver variable marks the current published version of the app.
The Android Ver variable lists the target Android version of the device the app will work on.

How was the Data likely collected?

The dataset is a published Kaggle dataset. It has been collected by web-scraping Google Playstore listings in 2018. The detail page for an app on the Google Playstore provides all of the information about the app which the dataset is built out of.

The following query gives the total apps under each category:

Code

googleplaystore_orig %>% 
  group_by(Category) %>%
  summarize(total_apps = n()) %>%
  arrange(desc(total_apps))

# A tibble: 33 × 2
   Category        total_apps
   <chr>                <int>
 1 FAMILY                1972
 2 GAME                  1144
 3 TOOLS                  844
 4 MEDICAL                463
 5 BUSINESS               460
 6 PRODUCTIVITY           424
 7 PERSONALIZATION        392
 8 COMMUNICATION          387
 9 SPORTS                 384
10 LIFESTYLE              382
# ℹ 23 more rows

We see that the “FAMILY” category has the most number of apps published in the dataset.

Tidying the Data

Remove Duplicates

The dataset has 10841 observations and 13 variables. However, the dataset seems to contain duplicates as seen using the below query.

Code

n_distinct(googleplaystore_orig)

[1] 10358

The following query de-duplicates the dataset based on matching rows

Code

googleplaystore_orig %>% distinct()

# A tibble: 10,358 × 13
   App       Category Rating Reviews Size  Installs Type  Price `Content Rating`
   <chr>     <chr>     <dbl>   <dbl> <chr> <chr>    <chr> <chr> <chr>           
 1 Photo Ed… ART_AND…    4.1     159 19M   10,000+  Free  0     Everyone        
 2 Coloring… ART_AND…    3.9     967 14M   500,000+ Free  0     Everyone        
 3 U Launch… ART_AND…    4.7   87510 8.7M  5,000,0… Free  0     Everyone        
 4 Sketch -… ART_AND…    4.5  215644 25M   50,000,… Free  0     Teen            
 5 Pixel Dr… ART_AND…    4.3     967 2.8M  100,000+ Free  0     Everyone        
 6 Paper fl… ART_AND…    4.4     167 5.6M  50,000+  Free  0     Everyone        
 7 Smoke Ef… ART_AND…    3.8     178 19M   50,000+  Free  0     Everyone        
 8 Infinite… ART_AND…    4.1   36815 29M   1,000,0… Free  0     Everyone        
 9 Garden C… ART_AND…    4.4   13791 33M   1,000,0… Free  0     Everyone        
10 Kids Pai… ART_AND…    4.7     121 3.1M  10,000+  Free  0     Everyone        
# ℹ 10,348 more rows
# ℹ 4 more variables: Genres <chr>, `Last Updated` <chr>, `Current Ver` <chr>,
#   `Android Ver` <chr>

However, this does not solve for the edge case where app names are duplicates:

Code

googleplaystore_orig %>% 
  distinct() %>%
  select(App, Category, Rating, Reviews) %>%
  arrange(desc(Reviews))

# A tibble: 10,358 × 4
   App                                      Category      Rating  Reviews
   <chr>                                    <chr>          <dbl>    <dbl>
 1 Facebook                                 SOCIAL           4.1 78158306
 2 Facebook                                 SOCIAL           4.1 78128208
 3 WhatsApp Messenger                       COMMUNICATION    4.4 69119316
 4 WhatsApp Messenger                       COMMUNICATION    4.4 69109672
 5 Instagram                                SOCIAL           4.5 66577446
 6 Instagram                                SOCIAL           4.5 66577313
 7 Instagram                                SOCIAL           4.5 66509917
 8 Messenger – Text and Video Chat for Free COMMUNICATION    4   56646578
 9 Messenger – Text and Video Chat for Free COMMUNICATION    4   56642847
10 Clash of Clans                           GAME             4.6 44893888
# ℹ 10,348 more rows

The following query, deduplicates the dataset based on the App variable:

Code

googleplaystore_distinct <- googleplaystore_orig %>%
  distinct(App, .keep_all = T)
googleplaystore_distinct

# A tibble: 9,660 × 13
   App       Category Rating Reviews Size  Installs Type  Price `Content Rating`
   <chr>     <chr>     <dbl>   <dbl> <chr> <chr>    <chr> <chr> <chr>           
 1 Photo Ed… ART_AND…    4.1     159 19M   10,000+  Free  0     Everyone        
 2 Coloring… ART_AND…    3.9     967 14M   500,000+ Free  0     Everyone        
 3 U Launch… ART_AND…    4.7   87510 8.7M  5,000,0… Free  0     Everyone        
 4 Sketch -… ART_AND…    4.5  215644 25M   50,000,… Free  0     Teen            
 5 Pixel Dr… ART_AND…    4.3     967 2.8M  100,000+ Free  0     Everyone        
 6 Paper fl… ART_AND…    4.4     167 5.6M  50,000+  Free  0     Everyone        
 7 Smoke Ef… ART_AND…    3.8     178 19M   50,000+  Free  0     Everyone        
 8 Infinite… ART_AND…    4.1   36815 29M   1,000,0… Free  0     Everyone        
 9 Garden C… ART_AND…    4.4   13791 33M   1,000,0… Free  0     Everyone        
10 Kids Pai… ART_AND…    4.7     121 3.1M  10,000+  Free  0     Everyone        
# ℹ 9,650 more rows
# ℹ 4 more variables: Genres <chr>, `Last Updated` <chr>, `Current Ver` <chr>,
#   `Android Ver` <chr>

Missing Values

We observe using the following query that 1465 apps do not have a valid Rating. This could be due to multiple reasons including too few installations, lesser number of reviews, more recently updated version of the app or due to discrepancies while scraping the data from the Google Play Store.

Code

# Gets all observations with at least one NA column
googleplaystore_distinct[!complete.cases(googleplaystore_distinct), ] %>%
  select(App, Installs, Reviews, `Last Updated`, Rating)

# A tibble: 1,465 × 5
   App                                 Installs Reviews `Last Updated`    Rating
   <chr>                               <chr>      <dbl> <chr>              <dbl>
 1 Mcqueen Coloring pages              100,000+      61 March 7, 2018        NaN
 2 Wrinkles and rejuvenation           100,000+     182 September 20, 20…    NaN
 3 Manicure - nail design              50,000+      119 July 23, 2018        NaN
 4 Skin Care and Natural Beauty        100,000+     654 July 17, 2018        NaN
 5 Secrets of beauty, youth and health 10,000+       77 August 8, 2017       NaN
 6 Recipes and tips for losing weight  10,000+       35 December 11, 2017    NaN
 7 Lady adviser (beauty, health)       10,000+       30 January 24, 2018     NaN
 8 Anonymous caller detection          10,000+      161 July 13, 2018        NaN
 9 SH-02J Owner's Manual (Android 8.0) 50,000+        2 June 15, 2018        NaN
10 URBANO V 02 instruction manual      100,000+     114 August 7, 2015       NaN
# ℹ 1,455 more rows

Additionally, therer is also an app with a “NA” Type demonstrated by the following query. A possible reason could be that the app is new (as seen by 0 Installs) and does not have an assigned Type yet.

Code

googleplaystore_distinct %>%
  filter(str_detect(Type, "NaN"))

# A tibble: 1 × 13
  App        Category Rating Reviews Size  Installs Type  Price `Content Rating`
  <chr>      <chr>     <dbl>   <dbl> <chr> <chr>    <chr> <chr> <chr>           
1 Command &… FAMILY      NaN       0 Vari… 0        NaN   0     Everyone 10+    
# ℹ 4 more variables: Genres <chr>, `Last Updated` <chr>, `Current Ver` <chr>,
#   `Android Ver` <chr>

We do not remove these observations from the dataset since variables other than Rating and Type contain useful information for the said apps.

Mutating the Data

The following mutations should be applied to the dataset to make it easier to analyze and work with: - The Last Updated variable should be read as a date type. The lubridate package is useful to do so. - The Category, Type and Content Rating variables can be read as factor. - The Price variable can be trimmed of the ‘$’ sign and be read as a dbl. - The Installs variable can be trimmed of the ‘+’ sign and be read as a dbl.

The following query achieves these mutations:

Code

googleplaystore <- googleplaystore_distinct %>%
  mutate(`Last Updated`= mdy(`Last Updated`),
         Type=as.factor(Type),
         Category=as.factor(Category),
         `Content Rating`=as.factor(`Content Rating`),
         Price=as.numeric(gsub("\\$", "", Price)),
         Installs=as.numeric(gsub("[^0-9]", "", Installs)))
googleplaystore

# A tibble: 9,660 × 13
   App       Category Rating Reviews Size  Installs Type  Price `Content Rating`
   <chr>     <fct>     <dbl>   <dbl> <chr>    <dbl> <fct> <dbl> <fct>           
 1 Photo Ed… ART_AND…    4.1     159 19M      10000 Free      0 Everyone        
 2 Coloring… ART_AND…    3.9     967 14M     500000 Free      0 Everyone        
 3 U Launch… ART_AND…    4.7   87510 8.7M   5000000 Free      0 Everyone        
 4 Sketch -… ART_AND…    4.5  215644 25M   50000000 Free      0 Teen            
 5 Pixel Dr… ART_AND…    4.3     967 2.8M    100000 Free      0 Everyone        
 6 Paper fl… ART_AND…    4.4     167 5.6M     50000 Free      0 Everyone        
 7 Smoke Ef… ART_AND…    3.8     178 19M      50000 Free      0 Everyone        
 8 Infinite… ART_AND…    4.1   36815 29M    1000000 Free      0 Everyone        
 9 Garden C… ART_AND…    4.4   13791 33M    1000000 Free      0 Everyone        
10 Kids Pai… ART_AND…    4.7     121 3.1M     10000 Free      0 Everyone        
# ℹ 9,650 more rows
# ℹ 4 more variables: Genres <chr>, `Last Updated` <date>, `Current Ver` <chr>,
#   `Android Ver` <chr>

Descriptive Statistics

Summary Statistics on Ratings

The following query summarizes the mean, median and standard deviation for the Rating variable across all apps.

Code

googleplaystore %>%
  summarize(mean_rating = mean(Rating, na.rm = T),
            median_rating = median(Rating, na.rm = T),
            sd_rating = sd(Rating, na.rm = T))

# A tibble: 1 × 3
  mean_rating median_rating sd_rating
        <dbl>         <dbl>     <dbl>
1        4.17           4.3     0.537

The above summary tibble shows that most of the apps are well-rated. Rating statistics when grouped by app category can be computed using the following query. These have been arranged in descending order of number of reviews received:

Code

googleplaystore %>%
  group_by(Category) %>%
  summarize(review_count = sum(Reviews),
            mean_rating = mean(Rating, na.rm = T),
            median_rating = median(Rating, na.rm = T),
            sd_rating = sd(Rating, na.rm = T)) %>%
  arrange(desc(review_count))

# A tibble: 33 × 5
   Category        review_count mean_rating median_rating sd_rating
   <fct>                  <dbl>       <dbl>         <dbl>     <dbl>
 1 GAME               622298709        4.25           4.3     0.384
 2 COMMUNICATION      285811368        4.12           4.2     0.470
 3 TOOLS              229356597        4.04           4.2     0.625
 4 SOCIAL             227927801        4.25           4.3     0.457
 5 FAMILY             143825488        4.18           4.3     0.523
 6 PHOTOGRAPHY        105351270        4.16           4.3     0.494
 7 VIDEO_PLAYERS       67484568        4.04           4.2     0.564
 8 PRODUCTIVITY        55590649        4.18           4.3     0.534
 9 PERSONALIZATION     53543080        4.33           4.4     0.359
10 SHOPPING            44551730        4.23           4.3     0.445
# ℹ 23 more rows

The above tibble is also complemented with the boxplot for the top 10 most reviewed categories below. We observe that apps categorized as “PERSONALIZATION” have the highest mean rating among these.

Code

top_categories <- googleplaystore %>%
  group_by(Category) %>%
  summarize(SumReviews = sum(Reviews), MeanRating = mean(Rating, na.rm = T)) %>%
  top_n(10, wt = SumReviews)

googleplaystore %>%
  filter(Category %in% top_categories$Category) %>%
  arrange(desc(sum(Reviews))) %>%
  ggplot(aes(x = Category, y = Rating, fill = Category)) +
  geom_boxplot() +
  labs(title = "Top 10 Most Reviewed Categories\nBox Plot of Ratings", x = "Category", y = "Rating") +
  theme(legend.position = "none", # To remove the legend
        plot.title = element_text(hjust=0.5),
        axis.text.x = element_text(angle = 45, hjust = 1))

Summary Statistics on Reviews

The number of Reviews form an implicit indicator of the popularity of an app. The following query summarizes the mean, median and sd for the Reviews variable across all apps.

Code

googleplaystore %>%
  summarize(mean_reviews = mean(Reviews),
            median_reviews = median(Reviews),
            sd_reviews = sd(Reviews))

# A tibble: 1 × 3
  mean_reviews median_reviews sd_reviews
         <dbl>          <dbl>      <dbl>
1      216570.            967   1831226.

From the above tibble we observe that mean_reviews are significantly larger than the median_reviews implying the presence of a few outlier apps having a huge number of reviews.

Code

googleplaystore %>%
  select(App, Reviews) %>%
  arrange(desc(Reviews)) %>%
  head()

# A tibble: 6 × 2
  App                                       Reviews
  <chr>                                       <dbl>
1 Facebook                                 78158306
2 WhatsApp Messenger                       69119316
3 Instagram                                66577313
4 Messenger – Text and Video Chat for Free 56642847
5 Clash of Clans                           44891723
6 Clean Master- Space Cleaner & Antivirus  42916526

From the above tibble we observe apps with the highest number of reviews. Review statistics can also be grouped by app category and can be computed using the following query. These have been arranged in descending order of mean reviews:

Code

googleplaystore %>%
  group_by(Category) %>%
  summarize(mean_reviews = mean(Reviews),
            median_reviews = median(Reviews),
            sd_reviews = sd(Reviews)) %>%
  arrange(desc(mean_reviews))

# A tibble: 33 × 4
   Category      mean_reviews median_reviews sd_reviews
   <fct>                <dbl>          <dbl>      <dbl>
 1 SOCIAL             953673.          3782    6753581.
 2 COMMUNICATION      907338.          1711    5335063.
 3 GAME               648904.         28510    2502348.
 4 VIDEO_PLAYERS      414016.          4585    2233854.
 5 PHOTOGRAPHY        374916.         26252    1147458.
 6 ENTERTAINMENT      340810.         37884.   1036260.
 7 TOOLS              277001.           475    2046758.
 8 SHOPPING           220553.         11076.    822197.
 9 WEATHER            155635.         11297     445494.
10 PRODUCTIVITY       148638.          1161     504486.
# ℹ 23 more rows

This matches our previous finding. Apps published within “SOCIAL”, “COMMUNICATION” and “GAME” categories tend to have the highest reviews even though they don’t have the highest mean rating which may be attributed to a larger number of users using apps within these categories leading to a more non-skewed distribution. The following 2D density plot confirms this observation. The most reviewed app categories don’t have the highest mean rating.

Code

googleplaystore %>%
  group_by(Category) %>%
  summarize(mean_reviews = mean(Reviews),
            mean_rating = mean(Rating, na.rm = T)) %>%
  ggplot(aes(x = mean_rating, y = mean_reviews)) +
  geom_density2d() +
  labs(title = "2D Density Plot\nMean Reviews vs Mean Ratings for App Categories", x = "Mean Ratings", y = "Mean Reviews") +
  scale_y_continuous(labels = scales::label_number(scale = 1e-3, suffix = "K")) +
  theme(plot.title = element_text(hjust=0.5),
        axis.text.x = element_text(angle = 45, hjust = 1))

Summary Statistics on Category

The following query provides a summary over the count of apps published under each category as well as their proportion within the total number of apps.

Code

category_statistics <- googleplaystore %>%
  group_by(Category) %>%
  summarize(app_count = n(),
            app_proportion = n() / nrow(googleplaystore)) %>%
  arrange(desc(app_count))
category_statistics

# A tibble: 33 × 3
   Category        app_count app_proportion
   <fct>               <int>          <dbl>
 1 FAMILY               1832         0.190 
 2 GAME                  959         0.0993
 3 TOOLS                 828         0.0857
 4 BUSINESS              420         0.0435
 5 MEDICAL               395         0.0409
 6 PERSONALIZATION       376         0.0389
 7 PRODUCTIVITY          374         0.0387
 8 LIFESTYLE             369         0.0382
 9 FINANCE               345         0.0357
10 SPORTS                325         0.0336
# ℹ 23 more rows

From the above tibble we observe that the maximum number of apps in our dataset are published under the “FAMILY” category. The following pie-chart also demonstrates this through the visualization.

Code

ggplot(category_statistics, aes(x = "", y = category_statistics$app_proportion, fill = category_statistics$Category)) +
  geom_bar(stat = "identity", width = 1, color = "white") +
  coord_polar("y") +
  theme_void() +
  labs(title = "Proportion of Apps by Category", fill = "Category") +
  theme(plot.title = element_text(hjust=0.5))

Summary Statistics on App Type

App Types (Free or Paid) form an interesting statistic.

Code

googleplaystore %>%
  filter(!str_detect(Type, "NaN")) %>%
  group_by(Type) %>%
  summarize(count = n(),
            proportion = n() / nrow(googleplaystore))

# A tibble: 2 × 3
  Type  count proportion
  <fct> <int>      <dbl>
1 Free   8903     0.922 
2 Paid    756     0.0783

We observe that about 92% of the apps in our dataset are “Free”. Another interesting statistic to observe is the most downloaded app for each Type:

Code

googleplaystore %>%
  filter(!str_detect(Type, "NaN")) %>%
  mutate(Installs = parse_number(gsub("[+,]", "", Installs))) %>%
  group_by(Type) %>%
  filter(Installs == max(Installs)) %>%
  arrange(desc(Type))

# A tibble: 22 × 13
# Groups:   Type [2]
   App       Category Rating Reviews Size  Installs Type  Price `Content Rating`
   <chr>     <fct>     <dbl>   <dbl> <chr>    <dbl> <fct> <dbl> <fct>           
 1 Minecraft FAMILY      4.5  2.38e6 Vari…      1e7 Paid   6.99 Everyone 10+    
 2 Hitman S… GAME        4.6  4.08e5 29M        1e7 Paid   0.99 Mature 17+      
 3 Google P… BOOKS_A…    3.9  1.43e6 Vari…      1e9 Free   0    Teen            
 4 Messenge… COMMUNI…    4    5.66e7 Vari…      1e9 Free   0    Everyone        
 5 WhatsApp… COMMUNI…    4.4  6.91e7 Vari…      1e9 Free   0    Everyone        
 6 Google C… COMMUNI…    4.3  9.64e6 Vari…      1e9 Free   0    Everyone        
 7 Gmail     COMMUNI…    4.3  4.60e6 Vari…      1e9 Free   0    Everyone        
 8 Hangouts  COMMUNI…    4    3.42e6 Vari…      1e9 Free   0    Everyone        
 9 Skype - … COMMUNI…    4.1  1.05e7 Vari…      1e9 Free   0    Everyone        
10 Google P… ENTERTA…    4.3  7.17e6 Vari…      1e9 Free   0    Teen            
# ℹ 12 more rows
# ℹ 4 more variables: Genres <chr>, `Last Updated` <date>, `Current Ver` <chr>,
#   `Android Ver` <chr>

We observe that “Minecraft” and “Hitman Sniper” form the most installed “Paid” apps with more than 10 million installs while the remaining apps form the most installed “Free” apps with more than a billion installs. An interesting observation is that a majority of the most downloaded “Free” apps on the Google Play Store are owned by Google itself!

Research Questions

We’ll now use the dataset to answer the following research questions:

Question 1 - Does the `Size` of an app relate to the number of `Installs`?

Users don’t prefer installing bulky apps. Yet in our observation above we saw “GAMES” which tend to be bigger in size to be one of the most installed category of apps. An important research question to answer is the optimal size of the app that maximizes the number of installs.

A normalized mutation of the Size variable is created to show all sizes in kilobytes.

Code

googleplaystore %>%
  filter(!str_detect(Size, "Varies with device")) %>%
  mutate(Size_KB = ifelse(grepl("M", Size), as.numeric(gsub("[^0-9]", "", Size)) * 1024, as.numeric(gsub("[^0-9]", "", Size))))

# A tibble: 8,433 × 14
   App       Category Rating Reviews Size  Installs Type  Price `Content Rating`
   <chr>     <fct>     <dbl>   <dbl> <chr>    <dbl> <fct> <dbl> <fct>           
 1 Photo Ed… ART_AND…    4.1     159 19M      10000 Free      0 Everyone        
 2 Coloring… ART_AND…    3.9     967 14M     500000 Free      0 Everyone        
 3 U Launch… ART_AND…    4.7   87510 8.7M   5000000 Free      0 Everyone        
 4 Sketch -… ART_AND…    4.5  215644 25M   50000000 Free      0 Teen            
 5 Pixel Dr… ART_AND…    4.3     967 2.8M    100000 Free      0 Everyone        
 6 Paper fl… ART_AND…    4.4     167 5.6M     50000 Free      0 Everyone        
 7 Smoke Ef… ART_AND…    3.8     178 19M      50000 Free      0 Everyone        
 8 Infinite… ART_AND…    4.1   36815 29M    1000000 Free      0 Everyone        
 9 Garden C… ART_AND…    4.4   13791 33M    1000000 Free      0 Everyone        
10 Kids Pai… ART_AND…    4.7     121 3.1M     10000 Free      0 Everyone        
# ℹ 8,423 more rows
# ℹ 5 more variables: Genres <chr>, `Last Updated` <date>, `Current Ver` <chr>,
#   `Android Ver` <chr>, Size_KB <dbl>

To visualize the trend of the number of installs with changing app size we plot the following connected scatterplot:

Code

googleplaystore %>%
  filter(!str_detect(Size, "Varies with device")) %>%
  mutate(Size_KB = ifelse(grepl("M", Size), as.numeric(gsub("[^0-9]", "", Size)) * 1024, as.numeric(gsub("[^0-9]", "", Size)))) %>%
  filter(Installs > 1000000) %>%
  ggplot(aes(x=Size_KB, y=Installs)) +
  geom_line() +
  geom_point() +
  labs(title = "Connected Scatterplot\nInstalls vs App Size", x = "App Size", y = "Installs") +
  scale_y_continuous(labels = scales::label_number(scale = 1e-6, suffix = "million")) +
  scale_x_continuous(labels = scales::label_number(scale = 1/1024, suffix = "MB")) +
  theme(plot.title = element_text(hjust=0.5),
        axis.text.x = element_text(angle = 45, hjust = 1))

In the above plot, we filter apps which have more than a million installs. We see that App Size does not seem to make any discernable pattern on the number of installs. There are apps both approximately 20MB and 75MB with more than a billion installs.

A follow-up question would be whether some categories like “GAME” form outliers having apps with high size and installs. This is intuitive since games have more visual assets that tend to be bulky.

Code

googleplaystore %>%
  filter(!str_detect(Size, "Varies with device") & !str_detect(Category, "GAME")) %>%
  mutate(Size_KB = ifelse(grepl("M", Size), as.numeric(gsub("[^0-9]", "", Size)) * 1024, as.numeric(gsub("[^0-9]", "", Size)))) %>%
  filter(Installs > 1000000) %>%
  ggplot(aes(x=Size_KB, y=Installs)) +
  geom_line() +
  geom_point() +
  labs(title = "Connected Scatterplot\nInstalls vs App Size", x = "App Size", y = "Installs") +
  scale_y_continuous(labels = scales::label_number(scale = 1e-6, suffix = "million")) +
  scale_x_continuous(labels = scales::label_number(scale = 1/1024, suffix = "MB")) +
  theme(plot.title = element_text(hjust=0.5),
        axis.text.x = element_text(angle = 45, hjust = 1))

From the above plot, our hypothesis is correct. While there are a few outliers having a greater app size resulting in a more number of installs, the general trend seen above is that users tend to install apps that are lighter.

A point to note is that number of installs does not correlate to users paying for these apps. As seen from our descriptive analysis most billion user apps are free. An important follow-up question to ask is - do users pay more for apps that are lighter?

An interesting observation before we proceed to note are apps like “I am Rich” apps which form the highest paid apps and tend to skew the data:

Code

googleplaystore %>%
  top_n(10, wt=Price) %>%
  arrange(desc(Price))

# A tibble: 13 × 13
   App       Category Rating Reviews Size  Installs Type  Price `Content Rating`
   <chr>     <fct>     <dbl>   <dbl> <chr>    <dbl> <fct> <dbl> <fct>           
 1 I'm Rich… LIFESTY…    3.6     275 7.3M     10000 Paid   400  Everyone        
 2 most exp… FAMILY      4.3       6 1.5M       100 Paid   400. Everyone        
 3 💎 I'm r… LIFESTY…    3.8     718 26M      10000 Paid   400. Everyone        
 4 I am rich LIFESTY…    3.8    3547 1.8M    100000 Paid   400. Everyone        
 5 I am Ric… FAMILY      4       856 8.7M     10000 Paid   400. Everyone        
 6 I Am Ric… FINANCE     4.1    1867 4.7M     50000 Paid   400. Everyone        
 7 I am Ric… FINANCE     3.8      93 22M       1000 Paid   400. Everyone        
 8 I am ric… FINANCE     3.5     472 965k      5000 Paid   400. Everyone        
 9 I Am Ric… FAMILY      4.4     201 2.7M      5000 Paid   400. Everyone        
10 I am ric… FINANCE     4.1     129 2.7M      1000 Paid   400. Teen            
11 I am Rich FINANCE     4.3     180 3.8M      5000 Paid   400. Everyone        
12 I AM RIC… FINANCE     4        36 41M       1000 Paid   400. Everyone        
13 I'm Rich… LIFESTY…  NaN         0 40M          0 Paid   400. Everyone        
# ℹ 4 more variables: Genres <chr>, `Last Updated` <date>, `Current Ver` <chr>,
#   `Android Ver` <chr>

To get a more reliable plot, we only consider apps priced less than $20. The following query results in a connected scatter plot to visualize this.

Code

googleplaystore %>%
  filter(!str_detect(Size, "Varies with device") & Type == "Paid" & Price < 20) %>%
  mutate(Size_KB = ifelse(grepl("M", Size), as.numeric(gsub("[^0-9]", "", Size)) * 1024, as.numeric(gsub("[^0-9]", "", Size)))) %>%
  ggplot(aes(x=Size_KB, y=Price)) +
  geom_line() +
  geom_point() +
  labs(title = "Connected Scatterplot\nApp Price vs App Size", x = "App Size", y = "App Price") +
  scale_y_continuous(labels = scales::label_number(prefix = "$")) +
  scale_x_continuous(labels = scales::label_number(scale = 1/1024, suffix = "MB")) +
  theme(plot.title = element_text(hjust=0.5),
        axis.text.x = element_text(angle = 45, hjust = 1))

From the above plot, we see that users do indeed tend to pay for apps that are lighter.

Question 2 - What is the best monetization strategy for “Paid” apps?

Relations between the Price, Rating and Installs variables can tell us about the correlation between the optimal pricing for apps.

The Price and Rating variables can be converted to factor for better visibility.

Code

googleplaystore_price_rating_grouped <- googleplaystore %>%
  filter(Type == "Paid" & !str_detect(Rating, "NaN"))%>%
  mutate(Price_Group = case_when(
    Price < 1 ~ "< $1",
    between(Price, 1, 10) ~ "$1 - $10",
    between(Price, 10, 20) ~ "$10 - $20",
    Price > 20 ~ "> $20",
  ),
  Rating_Group = case_when(
    Rating < 1 ~ "0.0 - 1.0",
    between(Rating, 1, 2) ~ "1.0 - 2.0",
    between(Rating, 2.0, 3.0) ~ "2.0 - 3.0",
    between(Rating, 3.0, 4.0) ~ "3.0 - 4.0",
    Rating > 4 ~ "4.0 - 5.0"))

googleplaystore_price_rating_grouped %>%
  ggplot(aes(x=Price_Group)) +
  geom_bar(fill="lightblue") +
  scale_fill_brewer(palette = "Set3") +
  facet_wrap(~ Rating_Group, scales = "free") +
  labs(title = "Faceted Bar Plot\nNumber of Apps within a Price Group for a Rating group", x = "Price Group", y = "Number of Apps") +
  theme_minimal() +
  theme(plot.title = element_text(hjust=0.5))

We see that while $1 - $10 price category is spread across the rating group, apps priced >$20 or <$1 tend to get a significantly higher rating on the Playstore thereby leading to the inference that apps should either be priced high or very low.

A follow up question that arises now is - Is there a pricing threshold above which users don’t prefer downloading apps? For the sake of brevity, we only consider apps with more than a 1000 installs and less than 10 million installs. The following group bar plot can help answer this. The mean price for each rating group is plotted and the entire plot is faceted across the number of installs.

Code

googleplaystore_price_rating_grouped %>%
  filter(Installs > 1000 & Installs < 10000000) %>%
  group_by(Rating_Group, Installs) %>%
  summarize(mean_price = mean(Price))%>%
  ggplot(aes(fill=Rating_Group, y=mean_price, x=Rating_Group)) + 
    geom_bar(position="dodge", stat="identity") +
  facet_wrap(~ Installs, scales = "free") +
  labs(title = "Faceted Group Bar Plot\nMean Price for a Rating group\nVaried by Number of Installs", x = "Rating Group", y = "Mean Price") +
  theme(plot.title = element_text(hjust=0.5),
        axis.text.x = element_blank())

The above plot leads us to infer that users tend to download apps and leave a better rating if the app is priced close to $3. An interesting observation is that even though apps with a mean price of about $65 don’t get a great rating, however are installed much more than apps with a mid-price range ($10 - $30).

Question 3 - How does the last update date of the app impact its popularity among users?

The Last Updated variable provides the date when the latest update of the app was published on the Google Play Store. Are apps which are less frequently updated impact user ratings? Do these apps tend to be less installed? It could also be the case that Last Updated does not have any impact on any of the popularity metrics.

Code

googleplaystore %>%
  filter(Type == "Paid" & Installs > 0) %>%
  ggplot(aes(x=`Last Updated`, y=Installs)) +
  geom_line() + 
  labs(title = "Line Plot\nApp Installs varied by Last Update date", x = "Date Last Updated", y = "App Installs") +
  scale_y_continuous(labels = scales::label_number(scale=1e-6, suffix = "million")) +
  theme(plot.title = element_text(hjust=0.5))

From the above plots, we observe that apps having more recent updates (the data is collected in 2018), tend to have a significantly higher number of installs. Although there a few apps last updated within 4 years of the data collection that seem to have a high number of installs, almost all apps last updated more than 4 years back have very few installs.

A follow-up question is whether the last updated date of the app affects the ratings it receives. We plot a heatmap of the number of apps falling within a particular rating group as varied by the year they were last updated.

Code

googleplaystore %>%
  filter(!str_detect(Rating, "NaN"))%>%
  mutate(
    Rating_Group = case_when(
      Rating < 1 ~ "0.0 - 1.0",
      between(Rating, 1, 2) ~ "1.0 - 2.0",
      between(Rating, 2.0, 3.0) ~ "2.0 - 3.0",
      between(Rating, 3.0, 4.0) ~ "3.0 - 4.0",
      Rating > 4 ~ "4.0 - 5.0"),
    Year = year(`Last Updated`)) %>%
  group_by(Year, Rating_Group) %>%
  summarize(App_Count = n()) %>%
  ggplot(aes(x = Year, y = Rating_Group, fill = App_Count)) +
  geom_tile() +
  scale_fill_viridis() +
  theme_minimal() +
  labs(title = "Heat Map\nApp Ratings varied by Last Update Year", x = "Year Last Updated", y = "Rating Group Counts") +
  theme(plot.title = element_text(hjust=0.5))

Our finding from the above plot is that a majority of the apps rated between 4.0 and 5.0 on the Playstore are the ones which were recently updated.

Limitations of Current Analysis

The following are some of the limitations of the current analysis:

For the first research question, instead of plotting against varying app sizes, the App Size variable can be converted into a factor with only a few possible thresholds. Since close to 1000 observations are plotted in each graph the data points tend to get congested in the current plots.
The Price plots in the second research question group the Price variable into 4 categories. A finer granularity can be undertaken for apps priced >$20 since that division contains about 40 apps.
Owing to time constraints for the current homework a few more research questions concerning the Android Ver, Genre and Category are left unanswered. Additionally, another interesting research question is the correlation of number of reviews with the number of installs for an app. Comments over all code blocks need to be added. Color-blind-proof colors for the plots and further visual improvements using the khroma package. These improvements will be taken up in the final project report.

Bibliography

Google Play Store Dataset on Kaggle. https://www.kaggle.com/datasets/lava18/google-play-store-apps/data. Gupta, Lavanya. 2018.
R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Wickham, H., & Grolemund, G. (2016). R for data science: Visualize, model, transform, tidy, and import data. OReilly Media.
Baumer, B. S., Kaplan, D. T., & Horton, N. J. (2017). Modern data science with R. crc Press.
R Graph Gallery. https://r-graph-gallery.com/.

Reading the Data

Dataset Description

High Level Description

How was the Data likely collected?

Tidying the Data

Remove Duplicates

Missing Values

Mutating the Data

Descriptive Statistics

Summary Statistics on Ratings

Summary Statistics on Reviews

Summary Statistics on Category

Summary Statistics on App Type

Research Questions

Question 1 - Does the Size of an app relate to the number of Installs?

Question 2 - What is the best monetization strategy for “Paid” apps?

Question 3 - How does the last update date of the app impact its popularity among users?

Limitations of Current Analysis

Bibliography

Question 1 - Does the `Size` of an app relate to the number of `Installs`?