Code
library(tidyverse)
library(readxl)
library(here)
library(dplyr)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)library(tidyverse)
library(readxl)
library(here)
library(dplyr)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)The working directory for RStudio has been set such that “googleplaystore.csv” can be found at the root of the working directory using the setwd() method.
googleplaystore <- read_csv(here("googleplaystore.csv"))
googleplaystore# A tibble: 10,841 × 13
App Category Rating Reviews Size Installs Type Price `Content Rating`
<chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
1 Photo Ed… ART_AND… 4.1 159 19M 10,000+ Free 0 Everyone
2 Coloring… ART_AND… 3.9 967 14M 500,000+ Free 0 Everyone
3 U Launch… ART_AND… 4.7 87510 8.7M 5,000,0… Free 0 Everyone
4 Sketch -… ART_AND… 4.5 215644 25M 50,000,… Free 0 Teen
5 Pixel Dr… ART_AND… 4.3 967 2.8M 100,000+ Free 0 Everyone
6 Paper fl… ART_AND… 4.4 167 5.6M 50,000+ Free 0 Everyone
7 Smoke Ef… ART_AND… 3.8 178 19M 50,000+ Free 0 Everyone
8 Infinite… ART_AND… 4.1 36815 29M 1,000,0… Free 0 Everyone
9 Garden C… ART_AND… 4.4 13791 33M 1,000,0… Free 0 Everyone
10 Kids Pai… ART_AND… 4.7 121 3.1M 10,000+ Free 0 Everyone
# ℹ 10,831 more rows
# ℹ 4 more variables: Genres <chr>, `Last Updated` <chr>, `Current Ver` <chr>,
# `Android Ver` <chr>
The dataset has 10841 observations and 13 variables. However, the dataset seems to contain duplicates as seen using the below query.
n_distinct(googleplaystore)[1] 10358
The following query de-duplicates the dataset based on matching rows
googleplaystore %>% distinct()# A tibble: 10,358 × 13
App Category Rating Reviews Size Installs Type Price `Content Rating`
<chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
1 Photo Ed… ART_AND… 4.1 159 19M 10,000+ Free 0 Everyone
2 Coloring… ART_AND… 3.9 967 14M 500,000+ Free 0 Everyone
3 U Launch… ART_AND… 4.7 87510 8.7M 5,000,0… Free 0 Everyone
4 Sketch -… ART_AND… 4.5 215644 25M 50,000,… Free 0 Teen
5 Pixel Dr… ART_AND… 4.3 967 2.8M 100,000+ Free 0 Everyone
6 Paper fl… ART_AND… 4.4 167 5.6M 50,000+ Free 0 Everyone
7 Smoke Ef… ART_AND… 3.8 178 19M 50,000+ Free 0 Everyone
8 Infinite… ART_AND… 4.1 36815 29M 1,000,0… Free 0 Everyone
9 Garden C… ART_AND… 4.4 13791 33M 1,000,0… Free 0 Everyone
10 Kids Pai… ART_AND… 4.7 121 3.1M 10,000+ Free 0 Everyone
# ℹ 10,348 more rows
# ℹ 4 more variables: Genres <chr>, `Last Updated` <chr>, `Current Ver` <chr>,
# `Android Ver` <chr>
However, this does not solve for the edge case where app names are duplicates:
googleplaystore %>%
distinct() %>%
select(App, Category, Rating, Reviews) %>%
arrange(desc(Reviews))# A tibble: 10,358 × 4
App Category Rating Reviews
<chr> <chr> <dbl> <dbl>
1 Facebook SOCIAL 4.1 78158306
2 Facebook SOCIAL 4.1 78128208
3 WhatsApp Messenger COMMUNICATION 4.4 69119316
4 WhatsApp Messenger COMMUNICATION 4.4 69109672
5 Instagram SOCIAL 4.5 66577446
6 Instagram SOCIAL 4.5 66577313
7 Instagram SOCIAL 4.5 66509917
8 Messenger – Text and Video Chat for Free COMMUNICATION 4 56646578
9 Messenger – Text and Video Chat for Free COMMUNICATION 4 56642847
10 Clash of Clans GAME 4.6 44893888
# ℹ 10,348 more rows
The following query, deduplicates the dataset based on the App variable:
googleplaystore_distinct <- googleplaystore %>%
distinct(App, .keep_all = T)
googleplaystore_distinct# A tibble: 9,660 × 13
App Category Rating Reviews Size Installs Type Price `Content Rating`
<chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
1 Photo Ed… ART_AND… 4.1 159 19M 10,000+ Free 0 Everyone
2 Coloring… ART_AND… 3.9 967 14M 500,000+ Free 0 Everyone
3 U Launch… ART_AND… 4.7 87510 8.7M 5,000,0… Free 0 Everyone
4 Sketch -… ART_AND… 4.5 215644 25M 50,000,… Free 0 Teen
5 Pixel Dr… ART_AND… 4.3 967 2.8M 100,000+ Free 0 Everyone
6 Paper fl… ART_AND… 4.4 167 5.6M 50,000+ Free 0 Everyone
7 Smoke Ef… ART_AND… 3.8 178 19M 50,000+ Free 0 Everyone
8 Infinite… ART_AND… 4.1 36815 29M 1,000,0… Free 0 Everyone
9 Garden C… ART_AND… 4.4 13791 33M 1,000,0… Free 0 Everyone
10 Kids Pai… ART_AND… 4.7 121 3.1M 10,000+ Free 0 Everyone
# ℹ 9,650 more rows
# ℹ 4 more variables: Genres <chr>, `Last Updated` <chr>, `Current Ver` <chr>,
# `Android Ver` <chr>
Further, the dataset also contains missing values. Rows containing missing values can be dropped using the following query.
googleplaystore_cleaned <- googleplaystore_distinct %>% drop_na()The data set comprises of 8195 rows with 13 columns.
googleplaystore_cleaned# A tibble: 8,195 × 13
App Category Rating Reviews Size Installs Type Price `Content Rating`
<chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
1 Photo Ed… ART_AND… 4.1 159 19M 10,000+ Free 0 Everyone
2 Coloring… ART_AND… 3.9 967 14M 500,000+ Free 0 Everyone
3 U Launch… ART_AND… 4.7 87510 8.7M 5,000,0… Free 0 Everyone
4 Sketch -… ART_AND… 4.5 215644 25M 50,000,… Free 0 Teen
5 Pixel Dr… ART_AND… 4.3 967 2.8M 100,000+ Free 0 Everyone
6 Paper fl… ART_AND… 4.4 167 5.6M 50,000+ Free 0 Everyone
7 Smoke Ef… ART_AND… 3.8 178 19M 50,000+ Free 0 Everyone
8 Infinite… ART_AND… 4.1 36815 29M 1,000,0… Free 0 Everyone
9 Garden C… ART_AND… 4.4 13791 33M 1,000,0… Free 0 Everyone
10 Kids Pai… ART_AND… 4.7 121 3.1M 10,000+ Free 0 Everyone
# ℹ 8,185 more rows
# ℹ 4 more variables: Genres <chr>, `Last Updated` <chr>, `Current Ver` <chr>,
# `Android Ver` <chr>
The dataset has a total of 11 <chr> type columns and the remaining columns are of the <dbl> type. Each observation pertains to a single app. The following are the descriptions of each of the variables in the dataset:
App variable lists the name of the app under observation.Category variable consists of the category the app under observation is grouped into.Rating variable lists the rating of the app on the Google Play Store.Reviews variable represents the number of user reviews the app received on the Google Play Store.Size variable consists of the size of the app (either in KB, MB or varying by device).Installs variable marks the number of devices the app is installed on.Type variable lists whether an app is “Free” or “Paid”.Price variable marks the price of the app if the Type is “Paid” and 0 for “Free” apps.Content Rating variable mentions the suitable user group the app targets.Genres variable represents which genre the app falls under. This is similar to the Category variable. An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres.Last Updated variable enlists the date on which the latest update was published on the Google Play Store.Current Ver variable marks the current published version of the app.Android Ver variable lists the target Android version of the device the app will work on.The dataset is a published Kaggle dataset. It has been collected by web-scraping Google Playstore listings in 2018. The detail page for an app provides all of the information presented in the dataset.
The following query gives the total apps under each category:
googleplaystore_cleaned %>%
group_by(Category) %>%
summarize(total_apps = n()) %>%
arrange(desc(total_apps))# A tibble: 33 × 2
Category total_apps
<chr> <int>
1 FAMILY 1608
2 GAME 912
3 TOOLS 718
4 FINANCE 302
5 LIFESTYLE 301
6 PRODUCTIVITY 301
7 PERSONALIZATION 298
8 MEDICAL 290
9 BUSINESS 263
10 PHOTOGRAPHY 263
# ℹ 23 more rows
We see that the “FAMILY” category has the most number of apps published in the dataset.
The following query summarizes the mean, median and sd for the Rating variable across all apps.
googleplaystore_cleaned %>%
summarize(mean_rating = mean(Rating),
median_rating = median(Rating),
sd_rating = sd(Rating))# A tibble: 1 × 3
mean_rating median_rating sd_rating
<dbl> <dbl> <dbl>
1 4.17 4.3 0.537
The above summary tibble shows that most of the apps are well-rated. Rating statistics when grouped by app category can be computed using the following query. These have been arranged in descending order of mean ratings:
googleplaystore_cleaned %>%
group_by(Category) %>%
summarize(mean_rating = mean(Rating),
median_rating = median(Rating),
sd_rating = sd(Rating)) %>%
arrange(desc(mean_rating))# A tibble: 33 × 4
Category mean_rating median_rating sd_rating
<chr> <dbl> <dbl> <dbl>
1 EVENTS 4.44 4.5 0.419
2 EDUCATION 4.36 4.4 0.264
3 ART_AND_DESIGN 4.36 4.4 0.361
4 BOOKS_AND_REFERENCE 4.34 4.5 0.438
5 PERSONALIZATION 4.33 4.4 0.359
6 PARENTING 4.3 4.4 0.518
7 BEAUTY 4.28 4.3 0.363
8 GAME 4.25 4.3 0.384
9 SOCIAL 4.25 4.3 0.457
10 WEATHER 4.24 4.3 0.338
# ℹ 23 more rows
We observe that apps categorized as “EVENTS” have the highest mean rating in the dataset.
The number of Reviews form an implicit indicator of the popularity of an app. The following query summarizes the mean, median and sd for the Reviews variable across all apps.
googleplaystore_cleaned %>%
summarize(mean_reviews = mean(Reviews),
median_reviews = median(Reviews),
sd_reviews = sd(Reviews))# A tibble: 1 × 3
mean_reviews median_reviews sd_reviews
<dbl> <dbl> <dbl>
1 255280. 3003 1985713.
From the above tibble we observe that mean_reviews are significantly larger than the median_reviews implying the presence of a few outlier apps have a huge number of reviews.
googleplaystore_cleaned %>%
select(App, Reviews) %>%
arrange(desc(Reviews)) %>%
head()# A tibble: 6 × 2
App Reviews
<chr> <dbl>
1 Facebook 78158306
2 WhatsApp Messenger 69119316
3 Instagram 66577313
4 Messenger – Text and Video Chat for Free 56642847
5 Clash of Clans 44891723
6 Clean Master- Space Cleaner & Antivirus 42916526
From the above tibble we observe apps with the highest number of reviews. Review statistics can also be grouped by app category and can be computed using the following query. These have been arranged in descending order of mean reviews:
googleplaystore_cleaned %>%
group_by(Category) %>%
summarize(mean_reviews = mean(Reviews),
median_reviews = median(Reviews),
sd_reviews = sd(Reviews)) %>%
arrange(desc(mean_reviews))# A tibble: 33 × 4
Category mean_reviews median_reviews sd_reviews
<chr> <dbl> <dbl> <dbl>
1 SOCIAL 1122795. 9606 7317698.
2 COMMUNICATION 1116449. 15162. 5900344.
3 GAME 682342. 32840 2561632.
4 VIDEO_PLAYERS 455973. 6567 2340948.
5 PHOTOGRAPHY 400575. 31985 1181863.
6 ENTERTAINMENT 340810. 37884. 1036260.
7 TOOLS 319437. 1038. 2195072.
8 SHOPPING 247509. 20362. 867401.
9 PRODUCTIVITY 184686. 6752 556558.
10 PERSONALIZATION 179674. 1516. 835898.
# ℹ 23 more rows
This matches our previous finding. Apps published within “SOCIAL”, “COMMUNICATION” and “GAME” categories tend to have the highest reviews even though they don’t have the highest mean rating which may be attributed to a larger number of users using apps within these categories leading to a more unskewed distribution.
The following query provides a summary over the count of apps published under each category as well as their proportion within the total number of apps.
googleplaystore_cleaned %>%
group_by(Category) %>%
summarize(app_count = n(),
app_proportion = n() / nrow(googleplaystore_cleaned)) %>%
arrange(desc(app_count))# A tibble: 33 × 3
Category app_count app_proportion
<chr> <int> <dbl>
1 FAMILY 1608 0.196
2 GAME 912 0.111
3 TOOLS 718 0.0876
4 FINANCE 302 0.0369
5 LIFESTYLE 301 0.0367
6 PRODUCTIVITY 301 0.0367
7 PERSONALIZATION 298 0.0364
8 MEDICAL 290 0.0354
9 BUSINESS 263 0.0321
10 PHOTOGRAPHY 263 0.0321
# ℹ 23 more rows
From the above tibble we observe that the maximum number of apps in our dataset are published under the “FAMILY” category.
App Types (Free or Paid) form an interesting statistic.
googleplaystore_cleaned %>%
group_by(Type) %>%
summarize(count = n(),
proportion = n() / nrow(googleplaystore_cleaned))# A tibble: 2 × 3
Type count proportion
<chr> <int> <dbl>
1 Free 7591 0.926
2 Paid 604 0.0737
We observe that about 92% of the apps in our dataset are “Free”. Another interesting statistic to observe is the most downloaded app for each Type:
googleplaystore_cleaned %>%
mutate(Installs = parse_number(gsub("[+,]", "", Installs))) %>%
group_by(Type) %>%
filter(Installs == max(Installs)) %>%
arrange(desc(Type))# A tibble: 22 × 13
# Groups: Type [2]
App Category Rating Reviews Size Installs Type Price `Content Rating`
<chr> <chr> <dbl> <dbl> <chr> <dbl> <chr> <chr> <chr>
1 Minecraft FAMILY 4.5 2.38e6 Vari… 1e7 Paid $6.99 Everyone 10+
2 Hitman S… GAME 4.6 4.08e5 29M 1e7 Paid $0.99 Mature 17+
3 Google P… BOOKS_A… 3.9 1.43e6 Vari… 1e9 Free 0 Teen
4 Messenge… COMMUNI… 4 5.66e7 Vari… 1e9 Free 0 Everyone
5 WhatsApp… COMMUNI… 4.4 6.91e7 Vari… 1e9 Free 0 Everyone
6 Google C… COMMUNI… 4.3 9.64e6 Vari… 1e9 Free 0 Everyone
7 Gmail COMMUNI… 4.3 4.60e6 Vari… 1e9 Free 0 Everyone
8 Hangouts COMMUNI… 4 3.42e6 Vari… 1e9 Free 0 Everyone
9 Skype - … COMMUNI… 4.1 1.05e7 Vari… 1e9 Free 0 Everyone
10 Google P… ENTERTA… 4.3 7.17e6 Vari… 1e9 Free 0 Teen
# ℹ 12 more rows
# ℹ 4 more variables: Genres <chr>, `Last Updated` <chr>, `Current Ver` <chr>,
# `Android Ver` <chr>
We observe that “Minecraft” and “Hitman Sniper” form the most installed “Paid” apps with more than 10 million installs while the remaining apps form the most install “Free” apps with more than a billion installs. An interesting observation is that a majority of the most downloaded “Free” apps on the Google Play Store are owned by Google itself!
The following are some of the potential research questions that the dataset can be used to answer:
Does the Size of an app relate to the number of Installs? Users don’t prefer installing bulky apps. Yet in our observation above we saw “GAMES” which tend to be bigger in size to be one of the most installed category of apps. An important research question to answer is the optimal size of the app that maximizes the number of installs. A follow-up question would be whether some categories like “GAMES” form outliers having apps with high size and installs. Do users pay more for apps that are lighter?
What is the best monetization strategy for “Paid” apps? Relations between the Price, Rating and Installs variables can tell us about the correlation between the optimal pricing for apps. Is there a pricing threshold above which users don’t prefer downloading apps? This research question would help deduce the best monetization strategy for a new app.
Are apps that target older Android Ver less popular? If an app targets older Android Ver it most likely has backward compatible components and is less probable to have the newest Android features. Are these kinds of apps less popular in terms of ratings, reviews and the number of installs? An alternative hypothesis is that apps targeting newer Android Ver have possibly not been on the Play Store for long and might consequently have lesser popularity. These follow-up questions can be answered by analyzing the dataset.
How does the distribution of Genres vary across different Category? A Genre might belong to multiple Category. An important question to answer would be to observe the Genre belonging to the maximum number of categories. Is this Genre also popular among users? What kind of genres tend to be the least popular among users? What genres form the highest priced apps?
How does the last update date of the app impact its popularity among users? The Last Updated variable provides the date when the latest update of the app was published on the Google Play Store. Are apps which are less frequently updated impact user ratings? Do these apps tend to be less installed? It could also be the case that Last Updated does not have any impact on any of the popularity metrics.
Does the number of reviews positively correlate with the number of installs for apps? Usually active users tend to leave reviews but does that also mean that the app is more installed or has a higher rating on the Play Store? This might be significantly important for paid apps, since users tend to read other people’s reviews before deciding to pay for an app.