Code
library(tidyverse)
library(readxl)
library(here)
library(dplyr)
library(viridis)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)library(tidyverse)
library(readxl)
library(here)
library(dplyr)
library(viridis)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)The working directory for RStudio has been set such that “googleplaystore.csv” can be found at the root of the working directory using the setwd() method.
googleplaystore_orig <- read_csv(here("googleplaystore.csv"))
googleplaystore_orig# A tibble: 10,841 × 13
App Category Rating Reviews Size Installs Type Price `Content Rating`
<chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
1 Photo Ed… ART_AND… 4.1 159 19M 10,000+ Free 0 Everyone
2 Coloring… ART_AND… 3.9 967 14M 500,000+ Free 0 Everyone
3 U Launch… ART_AND… 4.7 87510 8.7M 5,000,0… Free 0 Everyone
4 Sketch -… ART_AND… 4.5 215644 25M 50,000,… Free 0 Teen
5 Pixel Dr… ART_AND… 4.3 967 2.8M 100,000+ Free 0 Everyone
6 Paper fl… ART_AND… 4.4 167 5.6M 50,000+ Free 0 Everyone
7 Smoke Ef… ART_AND… 3.8 178 19M 50,000+ Free 0 Everyone
8 Infinite… ART_AND… 4.1 36815 29M 1,000,0… Free 0 Everyone
9 Garden C… ART_AND… 4.4 13791 33M 1,000,0… Free 0 Everyone
10 Kids Pai… ART_AND… 4.7 121 3.1M 10,000+ Free 0 Everyone
# ℹ 10,831 more rows
# ℹ 4 more variables: Genres <chr>, `Last Updated` <chr>, `Current Ver` <chr>,
# `Android Ver` <chr>
The data set comprises of 10841 rows with 13 columns.
googleplaystore_orig# A tibble: 10,841 × 13
App Category Rating Reviews Size Installs Type Price `Content Rating`
<chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
1 Photo Ed… ART_AND… 4.1 159 19M 10,000+ Free 0 Everyone
2 Coloring… ART_AND… 3.9 967 14M 500,000+ Free 0 Everyone
3 U Launch… ART_AND… 4.7 87510 8.7M 5,000,0… Free 0 Everyone
4 Sketch -… ART_AND… 4.5 215644 25M 50,000,… Free 0 Teen
5 Pixel Dr… ART_AND… 4.3 967 2.8M 100,000+ Free 0 Everyone
6 Paper fl… ART_AND… 4.4 167 5.6M 50,000+ Free 0 Everyone
7 Smoke Ef… ART_AND… 3.8 178 19M 50,000+ Free 0 Everyone
8 Infinite… ART_AND… 4.1 36815 29M 1,000,0… Free 0 Everyone
9 Garden C… ART_AND… 4.4 13791 33M 1,000,0… Free 0 Everyone
10 Kids Pai… ART_AND… 4.7 121 3.1M 10,000+ Free 0 Everyone
# ℹ 10,831 more rows
# ℹ 4 more variables: Genres <chr>, `Last Updated` <chr>, `Current Ver` <chr>,
# `Android Ver` <chr>
The dataset has a total of 11 <chr> type columns and the remaining columns are of the <dbl> type. Each observation pertains to a single app. The following are the descriptions of each of the variables in the dataset:
App variable lists the name of the app under observation.Category variable consists of the category the app under observation is grouped into.Rating variable lists the rating of the app on the Google Play Store.Reviews variable represents the number of user reviews the app received on the Google Play Store.Size variable consists of the size of the app (either in KB, MB or varying by device).Installs variable marks the number of devices the app is installed on.Type variable lists whether an app is “Free” or “Paid”.Price variable marks the price of the app if the Type is “Paid” and 0 for “Free” apps.Content Rating variable mentions the suitable user group the app targets.Genres variable represents which genre the app falls under. This is similar to the Category variable. An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres.Last Updated variable enlists the date on which the latest update was published on the Google Play Store.Current Ver variable marks the current published version of the app.Android Ver variable lists the target Android version of the device the app will work on.The dataset is a published Kaggle dataset. It has been collected by web-scraping Google Playstore listings in 2018. The detail page for an app on the Google Playstore provides all of the information about the app which the dataset is built out of.
The following query gives the total apps under each category:
googleplaystore_orig %>%
group_by(Category) %>%
summarize(total_apps = n()) %>%
arrange(desc(total_apps))# A tibble: 33 × 2
Category total_apps
<chr> <int>
1 FAMILY 1972
2 GAME 1144
3 TOOLS 844
4 MEDICAL 463
5 BUSINESS 460
6 PRODUCTIVITY 424
7 PERSONALIZATION 392
8 COMMUNICATION 387
9 SPORTS 384
10 LIFESTYLE 382
# ℹ 23 more rows
We see that the “FAMILY” category has the most number of apps published in the dataset.
The dataset has 10841 observations and 13 variables. However, the dataset seems to contain duplicates as seen using the below query.
n_distinct(googleplaystore_orig)[1] 10358
The following query de-duplicates the dataset based on matching rows
googleplaystore_orig %>% distinct()# A tibble: 10,358 × 13
App Category Rating Reviews Size Installs Type Price `Content Rating`
<chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
1 Photo Ed… ART_AND… 4.1 159 19M 10,000+ Free 0 Everyone
2 Coloring… ART_AND… 3.9 967 14M 500,000+ Free 0 Everyone
3 U Launch… ART_AND… 4.7 87510 8.7M 5,000,0… Free 0 Everyone
4 Sketch -… ART_AND… 4.5 215644 25M 50,000,… Free 0 Teen
5 Pixel Dr… ART_AND… 4.3 967 2.8M 100,000+ Free 0 Everyone
6 Paper fl… ART_AND… 4.4 167 5.6M 50,000+ Free 0 Everyone
7 Smoke Ef… ART_AND… 3.8 178 19M 50,000+ Free 0 Everyone
8 Infinite… ART_AND… 4.1 36815 29M 1,000,0… Free 0 Everyone
9 Garden C… ART_AND… 4.4 13791 33M 1,000,0… Free 0 Everyone
10 Kids Pai… ART_AND… 4.7 121 3.1M 10,000+ Free 0 Everyone
# ℹ 10,348 more rows
# ℹ 4 more variables: Genres <chr>, `Last Updated` <chr>, `Current Ver` <chr>,
# `Android Ver` <chr>
However, this does not solve for the edge case where app names are duplicates:
googleplaystore_orig %>%
distinct() %>%
select(App, Category, Rating, Reviews) %>%
arrange(desc(Reviews))# A tibble: 10,358 × 4
App Category Rating Reviews
<chr> <chr> <dbl> <dbl>
1 Facebook SOCIAL 4.1 78158306
2 Facebook SOCIAL 4.1 78128208
3 WhatsApp Messenger COMMUNICATION 4.4 69119316
4 WhatsApp Messenger COMMUNICATION 4.4 69109672
5 Instagram SOCIAL 4.5 66577446
6 Instagram SOCIAL 4.5 66577313
7 Instagram SOCIAL 4.5 66509917
8 Messenger – Text and Video Chat for Free COMMUNICATION 4 56646578
9 Messenger – Text and Video Chat for Free COMMUNICATION 4 56642847
10 Clash of Clans GAME 4.6 44893888
# ℹ 10,348 more rows
The following query, deduplicates the dataset based on the App variable:
googleplaystore_distinct <- googleplaystore_orig %>%
distinct(App, .keep_all = T)
googleplaystore_distinct# A tibble: 9,660 × 13
App Category Rating Reviews Size Installs Type Price `Content Rating`
<chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
1 Photo Ed… ART_AND… 4.1 159 19M 10,000+ Free 0 Everyone
2 Coloring… ART_AND… 3.9 967 14M 500,000+ Free 0 Everyone
3 U Launch… ART_AND… 4.7 87510 8.7M 5,000,0… Free 0 Everyone
4 Sketch -… ART_AND… 4.5 215644 25M 50,000,… Free 0 Teen
5 Pixel Dr… ART_AND… 4.3 967 2.8M 100,000+ Free 0 Everyone
6 Paper fl… ART_AND… 4.4 167 5.6M 50,000+ Free 0 Everyone
7 Smoke Ef… ART_AND… 3.8 178 19M 50,000+ Free 0 Everyone
8 Infinite… ART_AND… 4.1 36815 29M 1,000,0… Free 0 Everyone
9 Garden C… ART_AND… 4.4 13791 33M 1,000,0… Free 0 Everyone
10 Kids Pai… ART_AND… 4.7 121 3.1M 10,000+ Free 0 Everyone
# ℹ 9,650 more rows
# ℹ 4 more variables: Genres <chr>, `Last Updated` <chr>, `Current Ver` <chr>,
# `Android Ver` <chr>
We observe using the following query that 1465 apps do not have a valid Rating. This could be due to multiple reasons including too few installations, lesser number of reviews, more recently updated version of the app or due to discrepancies while scraping the data from the Google Play Store.
# Gets all observations with at least one NA column
googleplaystore_distinct[!complete.cases(googleplaystore_distinct), ] %>%
select(App, Installs, Reviews, `Last Updated`, Rating)# A tibble: 1,465 × 5
App Installs Reviews `Last Updated` Rating
<chr> <chr> <dbl> <chr> <dbl>
1 Mcqueen Coloring pages 100,000+ 61 March 7, 2018 NaN
2 Wrinkles and rejuvenation 100,000+ 182 September 20, 20… NaN
3 Manicure - nail design 50,000+ 119 July 23, 2018 NaN
4 Skin Care and Natural Beauty 100,000+ 654 July 17, 2018 NaN
5 Secrets of beauty, youth and health 10,000+ 77 August 8, 2017 NaN
6 Recipes and tips for losing weight 10,000+ 35 December 11, 2017 NaN
7 Lady adviser (beauty, health) 10,000+ 30 January 24, 2018 NaN
8 Anonymous caller detection 10,000+ 161 July 13, 2018 NaN
9 SH-02J Owner's Manual (Android 8.0) 50,000+ 2 June 15, 2018 NaN
10 URBANO V 02 instruction manual 100,000+ 114 August 7, 2015 NaN
# ℹ 1,455 more rows
Additionally, therer is also an app with a “NA” Type demonstrated by the following query. A possible reason could be that the app is new (as seen by 0 Installs) and does not have an assigned Type yet.
googleplaystore_distinct %>%
filter(str_detect(Type, "NaN"))# A tibble: 1 × 13
App Category Rating Reviews Size Installs Type Price `Content Rating`
<chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
1 Command &… FAMILY NaN 0 Vari… 0 NaN 0 Everyone 10+
# ℹ 4 more variables: Genres <chr>, `Last Updated` <chr>, `Current Ver` <chr>,
# `Android Ver` <chr>
We do not remove these observations from the dataset since variables other than Rating and Type contain useful information for the said apps.
The following mutations should be applied to the dataset to make it easier to analyze and work with: - The Last Updated variable should be read as a date type. The lubridate package is useful to do so. - The Category, Type and Content Rating variables can be read as factor. - The Price variable can be trimmed of the ‘$’ sign and be read as a dbl. - The Installs variable can be trimmed of the ‘+’ sign and be read as a dbl.
The following query achieves these mutations:
googleplaystore <- googleplaystore_distinct %>%
mutate(`Last Updated`= mdy(`Last Updated`),
Type=as.factor(Type),
Category=as.factor(Category),
`Content Rating`=as.factor(`Content Rating`),
Price=as.numeric(gsub("\\$", "", Price)),
Installs=as.numeric(gsub("[^0-9]", "", Installs)))
googleplaystore# A tibble: 9,660 × 13
App Category Rating Reviews Size Installs Type Price `Content Rating`
<chr> <fct> <dbl> <dbl> <chr> <dbl> <fct> <dbl> <fct>
1 Photo Ed… ART_AND… 4.1 159 19M 10000 Free 0 Everyone
2 Coloring… ART_AND… 3.9 967 14M 500000 Free 0 Everyone
3 U Launch… ART_AND… 4.7 87510 8.7M 5000000 Free 0 Everyone
4 Sketch -… ART_AND… 4.5 215644 25M 50000000 Free 0 Teen
5 Pixel Dr… ART_AND… 4.3 967 2.8M 100000 Free 0 Everyone
6 Paper fl… ART_AND… 4.4 167 5.6M 50000 Free 0 Everyone
7 Smoke Ef… ART_AND… 3.8 178 19M 50000 Free 0 Everyone
8 Infinite… ART_AND… 4.1 36815 29M 1000000 Free 0 Everyone
9 Garden C… ART_AND… 4.4 13791 33M 1000000 Free 0 Everyone
10 Kids Pai… ART_AND… 4.7 121 3.1M 10000 Free 0 Everyone
# ℹ 9,650 more rows
# ℹ 4 more variables: Genres <chr>, `Last Updated` <date>, `Current Ver` <chr>,
# `Android Ver` <chr>
The following query summarizes the mean, median and standard deviation for the Rating variable across all apps.
googleplaystore %>%
summarize(mean_rating = mean(Rating, na.rm = T),
median_rating = median(Rating, na.rm = T),
sd_rating = sd(Rating, na.rm = T))# A tibble: 1 × 3
mean_rating median_rating sd_rating
<dbl> <dbl> <dbl>
1 4.17 4.3 0.537
The above summary tibble shows that most of the apps are well-rated. Rating statistics when grouped by app category can be computed using the following query. These have been arranged in descending order of number of reviews received:
googleplaystore %>%
group_by(Category) %>%
summarize(review_count = sum(Reviews),
mean_rating = mean(Rating, na.rm = T),
median_rating = median(Rating, na.rm = T),
sd_rating = sd(Rating, na.rm = T)) %>%
arrange(desc(review_count))# A tibble: 33 × 5
Category review_count mean_rating median_rating sd_rating
<fct> <dbl> <dbl> <dbl> <dbl>
1 GAME 622298709 4.25 4.3 0.384
2 COMMUNICATION 285811368 4.12 4.2 0.470
3 TOOLS 229356597 4.04 4.2 0.625
4 SOCIAL 227927801 4.25 4.3 0.457
5 FAMILY 143825488 4.18 4.3 0.523
6 PHOTOGRAPHY 105351270 4.16 4.3 0.494
7 VIDEO_PLAYERS 67484568 4.04 4.2 0.564
8 PRODUCTIVITY 55590649 4.18 4.3 0.534
9 PERSONALIZATION 53543080 4.33 4.4 0.359
10 SHOPPING 44551730 4.23 4.3 0.445
# ℹ 23 more rows
The above tibble is also complemented with the boxplot for the top 10 most reviewed categories below. We observe that apps categorized as “PERSONALIZATION” have the highest mean rating among these.
top_categories <- googleplaystore %>%
group_by(Category) %>%
summarize(SumReviews = sum(Reviews), MeanRating = mean(Rating, na.rm = T)) %>%
top_n(10, wt = SumReviews)
googleplaystore %>%
filter(Category %in% top_categories$Category) %>%
arrange(desc(sum(Reviews))) %>%
ggplot(aes(x = Category, y = Rating, fill = Category)) +
geom_boxplot() +
labs(title = "Top 10 Most Reviewed Categories\nBox Plot of Ratings", x = "Category", y = "Rating") +
theme(legend.position = "none", # To remove the legend
plot.title = element_text(hjust=0.5),
axis.text.x = element_text(angle = 45, hjust = 1)) The number of Reviews form an implicit indicator of the popularity of an app. The following query summarizes the mean, median and sd for the Reviews variable across all apps.
googleplaystore %>%
summarize(mean_reviews = mean(Reviews),
median_reviews = median(Reviews),
sd_reviews = sd(Reviews))# A tibble: 1 × 3
mean_reviews median_reviews sd_reviews
<dbl> <dbl> <dbl>
1 216570. 967 1831226.
From the above tibble we observe that mean_reviews are significantly larger than the median_reviews implying the presence of a few outlier apps having a huge number of reviews.
googleplaystore %>%
select(App, Reviews) %>%
arrange(desc(Reviews)) %>%
head()# A tibble: 6 × 2
App Reviews
<chr> <dbl>
1 Facebook 78158306
2 WhatsApp Messenger 69119316
3 Instagram 66577313
4 Messenger – Text and Video Chat for Free 56642847
5 Clash of Clans 44891723
6 Clean Master- Space Cleaner & Antivirus 42916526
From the above tibble we observe apps with the highest number of reviews. Review statistics can also be grouped by app category and can be computed using the following query. These have been arranged in descending order of mean reviews:
googleplaystore %>%
group_by(Category) %>%
summarize(mean_reviews = mean(Reviews),
median_reviews = median(Reviews),
sd_reviews = sd(Reviews)) %>%
arrange(desc(mean_reviews))# A tibble: 33 × 4
Category mean_reviews median_reviews sd_reviews
<fct> <dbl> <dbl> <dbl>
1 SOCIAL 953673. 3782 6753581.
2 COMMUNICATION 907338. 1711 5335063.
3 GAME 648904. 28510 2502348.
4 VIDEO_PLAYERS 414016. 4585 2233854.
5 PHOTOGRAPHY 374916. 26252 1147458.
6 ENTERTAINMENT 340810. 37884. 1036260.
7 TOOLS 277001. 475 2046758.
8 SHOPPING 220553. 11076. 822197.
9 WEATHER 155635. 11297 445494.
10 PRODUCTIVITY 148638. 1161 504486.
# ℹ 23 more rows
This matches our previous finding. Apps published within “SOCIAL”, “COMMUNICATION” and “GAME” categories tend to have the highest reviews even though they don’t have the highest mean rating which may be attributed to a larger number of users using apps within these categories leading to a more non-skewed distribution. The following 2D density plot confirms this observation. The most reviewed app categories don’t have the highest mean rating.
googleplaystore %>%
group_by(Category) %>%
summarize(mean_reviews = mean(Reviews),
mean_rating = mean(Rating, na.rm = T)) %>%
ggplot(aes(x = mean_rating, y = mean_reviews)) +
geom_density2d() +
labs(title = "2D Density Plot\nMean Reviews vs Mean Ratings for App Categories", x = "Mean Ratings", y = "Mean Reviews") +
scale_y_continuous(labels = scales::label_number(scale = 1e-3, suffix = "K")) +
theme(plot.title = element_text(hjust=0.5),
axis.text.x = element_text(angle = 45, hjust = 1)) The following query provides a summary over the count of apps published under each category as well as their proportion within the total number of apps.
category_statistics <- googleplaystore %>%
group_by(Category) %>%
summarize(app_count = n(),
app_proportion = n() / nrow(googleplaystore)) %>%
arrange(desc(app_count))
category_statistics# A tibble: 33 × 3
Category app_count app_proportion
<fct> <int> <dbl>
1 FAMILY 1832 0.190
2 GAME 959 0.0993
3 TOOLS 828 0.0857
4 BUSINESS 420 0.0435
5 MEDICAL 395 0.0409
6 PERSONALIZATION 376 0.0389
7 PRODUCTIVITY 374 0.0387
8 LIFESTYLE 369 0.0382
9 FINANCE 345 0.0357
10 SPORTS 325 0.0336
# ℹ 23 more rows
From the above tibble we observe that the maximum number of apps in our dataset are published under the “FAMILY” category. The following pie-chart also demonstrates this through the visualization.
ggplot(category_statistics, aes(x = "", y = category_statistics$app_proportion, fill = category_statistics$Category)) +
geom_bar(stat = "identity", width = 1, color = "white") +
coord_polar("y") +
theme_void() +
labs(title = "Proportion of Apps by Category", fill = "Category") +
theme(plot.title = element_text(hjust=0.5))App Types (Free or Paid) form an interesting statistic.
googleplaystore %>%
filter(!str_detect(Type, "NaN")) %>%
group_by(Type) %>%
summarize(count = n(),
proportion = n() / nrow(googleplaystore))# A tibble: 2 × 3
Type count proportion
<fct> <int> <dbl>
1 Free 8903 0.922
2 Paid 756 0.0783
We observe that about 92% of the apps in our dataset are “Free”. Another interesting statistic to observe is the most downloaded app for each Type:
googleplaystore %>%
filter(!str_detect(Type, "NaN")) %>%
mutate(Installs = parse_number(gsub("[+,]", "", Installs))) %>%
group_by(Type) %>%
filter(Installs == max(Installs)) %>%
arrange(desc(Type))# A tibble: 22 × 13
# Groups: Type [2]
App Category Rating Reviews Size Installs Type Price `Content Rating`
<chr> <fct> <dbl> <dbl> <chr> <dbl> <fct> <dbl> <fct>
1 Minecraft FAMILY 4.5 2.38e6 Vari… 1e7 Paid 6.99 Everyone 10+
2 Hitman S… GAME 4.6 4.08e5 29M 1e7 Paid 0.99 Mature 17+
3 Google P… BOOKS_A… 3.9 1.43e6 Vari… 1e9 Free 0 Teen
4 Messenge… COMMUNI… 4 5.66e7 Vari… 1e9 Free 0 Everyone
5 WhatsApp… COMMUNI… 4.4 6.91e7 Vari… 1e9 Free 0 Everyone
6 Google C… COMMUNI… 4.3 9.64e6 Vari… 1e9 Free 0 Everyone
7 Gmail COMMUNI… 4.3 4.60e6 Vari… 1e9 Free 0 Everyone
8 Hangouts COMMUNI… 4 3.42e6 Vari… 1e9 Free 0 Everyone
9 Skype - … COMMUNI… 4.1 1.05e7 Vari… 1e9 Free 0 Everyone
10 Google P… ENTERTA… 4.3 7.17e6 Vari… 1e9 Free 0 Teen
# ℹ 12 more rows
# ℹ 4 more variables: Genres <chr>, `Last Updated` <date>, `Current Ver` <chr>,
# `Android Ver` <chr>
We observe that “Minecraft” and “Hitman Sniper” form the most installed “Paid” apps with more than 10 million installs while the remaining apps form the most installed “Free” apps with more than a billion installs. An interesting observation is that a majority of the most downloaded “Free” apps on the Google Play Store are owned by Google itself!
We’ll now use the dataset to answer the following research questions:
Size of an app relate to the number of Installs?Users don’t prefer installing bulky apps. Yet in our observation above we saw “GAMES” which tend to be bigger in size to be one of the most installed category of apps. An important research question to answer is the optimal size of the app that maximizes the number of installs.
A normalized mutation of the Size variable is created to show all sizes in kilobytes.
googleplaystore %>%
filter(!str_detect(Size, "Varies with device")) %>%
mutate(Size_KB = ifelse(grepl("M", Size), as.numeric(gsub("[^0-9]", "", Size)) * 1024, as.numeric(gsub("[^0-9]", "", Size))))# A tibble: 8,433 × 14
App Category Rating Reviews Size Installs Type Price `Content Rating`
<chr> <fct> <dbl> <dbl> <chr> <dbl> <fct> <dbl> <fct>
1 Photo Ed… ART_AND… 4.1 159 19M 10000 Free 0 Everyone
2 Coloring… ART_AND… 3.9 967 14M 500000 Free 0 Everyone
3 U Launch… ART_AND… 4.7 87510 8.7M 5000000 Free 0 Everyone
4 Sketch -… ART_AND… 4.5 215644 25M 50000000 Free 0 Teen
5 Pixel Dr… ART_AND… 4.3 967 2.8M 100000 Free 0 Everyone
6 Paper fl… ART_AND… 4.4 167 5.6M 50000 Free 0 Everyone
7 Smoke Ef… ART_AND… 3.8 178 19M 50000 Free 0 Everyone
8 Infinite… ART_AND… 4.1 36815 29M 1000000 Free 0 Everyone
9 Garden C… ART_AND… 4.4 13791 33M 1000000 Free 0 Everyone
10 Kids Pai… ART_AND… 4.7 121 3.1M 10000 Free 0 Everyone
# ℹ 8,423 more rows
# ℹ 5 more variables: Genres <chr>, `Last Updated` <date>, `Current Ver` <chr>,
# `Android Ver` <chr>, Size_KB <dbl>
To visualize the trend of the number of installs with changing app size we plot the following connected scatterplot:
googleplaystore %>%
filter(!str_detect(Size, "Varies with device")) %>%
mutate(Size_KB = ifelse(grepl("M", Size), as.numeric(gsub("[^0-9]", "", Size)) * 1024, as.numeric(gsub("[^0-9]", "", Size)))) %>%
filter(Installs > 1000000) %>%
ggplot(aes(x=Size_KB, y=Installs)) +
geom_line() +
geom_point() +
labs(title = "Connected Scatterplot\nInstalls vs App Size", x = "App Size", y = "Installs") +
scale_y_continuous(labels = scales::label_number(scale = 1e-6, suffix = "million")) +
scale_x_continuous(labels = scales::label_number(scale = 1/1024, suffix = "MB")) +
theme(plot.title = element_text(hjust=0.5),
axis.text.x = element_text(angle = 45, hjust = 1)) In the above plot, we filter apps which have more than a million installs. We see that App Size does not seem to make any discernable pattern on the number of installs. There are apps both approximately 20MB and 75MB with more than a billion installs.
A follow-up question would be whether some categories like “GAME” form outliers having apps with high size and installs. This is intuitive since games have more visual assets that tend to be bulky.
googleplaystore %>%
filter(!str_detect(Size, "Varies with device") & !str_detect(Category, "GAME")) %>%
mutate(Size_KB = ifelse(grepl("M", Size), as.numeric(gsub("[^0-9]", "", Size)) * 1024, as.numeric(gsub("[^0-9]", "", Size)))) %>%
filter(Installs > 1000000) %>%
ggplot(aes(x=Size_KB, y=Installs)) +
geom_line() +
geom_point() +
labs(title = "Connected Scatterplot\nInstalls vs App Size", x = "App Size", y = "Installs") +
scale_y_continuous(labels = scales::label_number(scale = 1e-6, suffix = "million")) +
scale_x_continuous(labels = scales::label_number(scale = 1/1024, suffix = "MB")) +
theme(plot.title = element_text(hjust=0.5),
axis.text.x = element_text(angle = 45, hjust = 1))From the above plot, our hypothesis is correct. While there are a few outliers having a greater app size resulting in a more number of installs, the general trend seen above is that users tend to install apps that are lighter.
A point to note is that number of installs does not correlate to users paying for these apps. As seen from our descriptive analysis most billion user apps are free. An important follow-up question to ask is - do users pay more for apps that are lighter?
An interesting observation before we proceed to note are apps like “I am Rich” apps which form the highest paid apps and tend to skew the data:
googleplaystore %>%
top_n(10, wt=Price) %>%
arrange(desc(Price))# A tibble: 13 × 13
App Category Rating Reviews Size Installs Type Price `Content Rating`
<chr> <fct> <dbl> <dbl> <chr> <dbl> <fct> <dbl> <fct>
1 I'm Rich… LIFESTY… 3.6 275 7.3M 10000 Paid 400 Everyone
2 most exp… FAMILY 4.3 6 1.5M 100 Paid 400. Everyone
3 💎 I'm r… LIFESTY… 3.8 718 26M 10000 Paid 400. Everyone
4 I am rich LIFESTY… 3.8 3547 1.8M 100000 Paid 400. Everyone
5 I am Ric… FAMILY 4 856 8.7M 10000 Paid 400. Everyone
6 I Am Ric… FINANCE 4.1 1867 4.7M 50000 Paid 400. Everyone
7 I am Ric… FINANCE 3.8 93 22M 1000 Paid 400. Everyone
8 I am ric… FINANCE 3.5 472 965k 5000 Paid 400. Everyone
9 I Am Ric… FAMILY 4.4 201 2.7M 5000 Paid 400. Everyone
10 I am ric… FINANCE 4.1 129 2.7M 1000 Paid 400. Teen
11 I am Rich FINANCE 4.3 180 3.8M 5000 Paid 400. Everyone
12 I AM RIC… FINANCE 4 36 41M 1000 Paid 400. Everyone
13 I'm Rich… LIFESTY… NaN 0 40M 0 Paid 400. Everyone
# ℹ 4 more variables: Genres <chr>, `Last Updated` <date>, `Current Ver` <chr>,
# `Android Ver` <chr>
To get a more reliable plot, we only consider apps priced less than $20. The following query results in a connected scatter plot to visualize this.
googleplaystore %>%
filter(!str_detect(Size, "Varies with device") & Type == "Paid" & Price < 20) %>%
mutate(Size_KB = ifelse(grepl("M", Size), as.numeric(gsub("[^0-9]", "", Size)) * 1024, as.numeric(gsub("[^0-9]", "", Size)))) %>%
ggplot(aes(x=Size_KB, y=Price)) +
geom_line() +
geom_point() +
labs(title = "Connected Scatterplot\nApp Price vs App Size", x = "App Size", y = "App Price") +
scale_y_continuous(labels = scales::label_number(prefix = "$")) +
scale_x_continuous(labels = scales::label_number(scale = 1/1024, suffix = "MB")) +
theme(plot.title = element_text(hjust=0.5),
axis.text.x = element_text(angle = 45, hjust = 1))From the above plot, we see that users do indeed tend to pay for apps that are lighter.
Relations between the Price, Rating and Installs variables can tell us about the correlation between the optimal pricing for apps.
The Price and Rating variables can be converted to factor for better visibility.
googleplaystore_price_rating_grouped <- googleplaystore %>%
filter(Type == "Paid" & !str_detect(Rating, "NaN"))%>%
mutate(Price_Group = case_when(
Price < 1 ~ "< $1",
between(Price, 1, 10) ~ "$1 - $10",
between(Price, 10, 20) ~ "$10 - $20",
Price > 20 ~ "> $20",
),
Rating_Group = case_when(
Rating < 1 ~ "0.0 - 1.0",
between(Rating, 1, 2) ~ "1.0 - 2.0",
between(Rating, 2.0, 3.0) ~ "2.0 - 3.0",
between(Rating, 3.0, 4.0) ~ "3.0 - 4.0",
Rating > 4 ~ "4.0 - 5.0"))
googleplaystore_price_rating_grouped %>%
ggplot(aes(x=Price_Group)) +
geom_bar(fill="lightblue") +
scale_fill_brewer(palette = "Set3") +
facet_wrap(~ Rating_Group, scales = "free") +
labs(title = "Faceted Bar Plot\nNumber of Apps within a Price Group for a Rating group", x = "Price Group", y = "Number of Apps") +
theme_minimal() +
theme(plot.title = element_text(hjust=0.5))We see that while $1 - $10 price category is spread across the rating group, apps priced >$20 or <$1 tend to get a significantly higher rating on the Playstore thereby leading to the inference that apps should either be priced high or very low.
A follow up question that arises now is - Is there a pricing threshold above which users don’t prefer downloading apps? For the sake of brevity, we only consider apps with more than a 1000 installs and less than 10 million installs. The following group bar plot can help answer this. The mean price for each rating group is plotted and the entire plot is faceted across the number of installs.
googleplaystore_price_rating_grouped %>%
filter(Installs > 1000 & Installs < 10000000) %>%
group_by(Rating_Group, Installs) %>%
summarize(mean_price = mean(Price))%>%
ggplot(aes(fill=Rating_Group, y=mean_price, x=Rating_Group)) +
geom_bar(position="dodge", stat="identity") +
facet_wrap(~ Installs, scales = "free") +
labs(title = "Faceted Group Bar Plot\nMean Price for a Rating group\nVaried by Number of Installs", x = "Rating Group", y = "Mean Price") +
theme(plot.title = element_text(hjust=0.5),
axis.text.x = element_blank())The above plot leads us to infer that users tend to download apps and leave a better rating if the app is priced close to $3. An interesting observation is that even though apps with a mean price of about $65 don’t get a great rating, however are installed much more than apps with a mid-price range ($10 - $30).
The Last Updated variable provides the date when the latest update of the app was published on the Google Play Store. Are apps which are less frequently updated impact user ratings? Do these apps tend to be less installed? It could also be the case that Last Updated does not have any impact on any of the popularity metrics.
googleplaystore %>%
filter(Type == "Paid" & Installs > 0) %>%
ggplot(aes(x=`Last Updated`, y=Installs)) +
geom_line() +
labs(title = "Line Plot\nApp Installs varied by Last Update date", x = "Date Last Updated", y = "App Installs") +
scale_y_continuous(labels = scales::label_number(scale=1e-6, suffix = "million")) +
theme(plot.title = element_text(hjust=0.5))From the above plots, we observe that apps having more recent updates (the data is collected in 2018), tend to have a significantly higher number of installs. Although there a few apps last updated within 4 years of the data collection that seem to have a high number of installs, almost all apps last updated more than 4 years back have very few installs.
A follow-up question is whether the last updated date of the app affects the ratings it receives. We plot a heatmap of the number of apps falling within a particular rating group as varied by the year they were last updated.
googleplaystore %>%
filter(!str_detect(Rating, "NaN"))%>%
mutate(
Rating_Group = case_when(
Rating < 1 ~ "0.0 - 1.0",
between(Rating, 1, 2) ~ "1.0 - 2.0",
between(Rating, 2.0, 3.0) ~ "2.0 - 3.0",
between(Rating, 3.0, 4.0) ~ "3.0 - 4.0",
Rating > 4 ~ "4.0 - 5.0"),
Year = year(`Last Updated`)) %>%
group_by(Year, Rating_Group) %>%
summarize(App_Count = n()) %>%
ggplot(aes(x = Year, y = Rating_Group, fill = App_Count)) +
geom_tile() +
scale_fill_viridis() +
theme_minimal() +
labs(title = "Heat Map\nApp Ratings varied by Last Update Year", x = "Year Last Updated", y = "Rating Group Counts") +
theme(plot.title = element_text(hjust=0.5))Our finding from the above plot is that a majority of the apps rated between 4.0 and 5.0 on the Playstore are the ones which were recently updated.
The following are some of the limitations of the current analysis:
App Size variable can be converted into a factor with only a few possible thresholds. Since close to 1000 observations are plotted in each graph the data points tend to get congested in the current plots.Price variable into 4 categories. A finer granularity can be undertaken for apps priced >$20 since that division contains about 40 apps.Android Ver, Genre and Category are left unanswered. Additionally, another interesting research question is the correlation of number of reviews with the number of installs for an app. Comments over all code blocks need to be added. Color-blind-proof colors for the plots and further visual improvements using the khroma package. These improvements will be taken up in the final project report.