Introduction

This is an EDA for Google Play Store Applications, if you like this kernel or it was helpful, please upvote!

Load packages

for loading packages, I’m enjoying using the “pacman” package which is allow me to install and load the packages at once, as follows:

library(pacman)
p_load(tidyverse,highcharter,lubridate,stringr,xts)

Loading the data

Loading the data and take a look at it

##                                                  App       Category Rating
## 1     Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN    4.1
## 2                                Coloring book moana ART_AND_DESIGN    3.9
## 3 U Launcher Lite – FREE Live Cool Themes, Hide Apps ART_AND_DESIGN    4.7
## 4                              Sketch - Draw & Paint ART_AND_DESIGN    4.5
## 5              Pixel Draw - Number Art Coloring Book ART_AND_DESIGN    4.3
## 6                         Paper flowers instructions ART_AND_DESIGN    4.4
##   Reviews Size    Installs Type Price Content.Rating                    Genres
## 1     159  19M     10,000+ Free     0       Everyone              Art & Design
## 2     967  14M    500,000+ Free     0       Everyone Art & Design;Pretend Play
## 3   87510 8.7M  5,000,000+ Free     0       Everyone              Art & Design
## 4  215644  25M 50,000,000+ Free     0           Teen              Art & Design
## 5     967 2.8M    100,000+ Free     0       Everyone   Art & Design;Creativity
## 6     167 5.6M     50,000+ Free     0       Everyone              Art & Design
##       Last.Updated        Current.Ver  Android.Ver
## 1  January 7, 2018              1.0.0 4.0.3 and up
## 2 January 15, 2018              2.0.0 4.0.3 and up
## 3   August 1, 2018              1.2.4 4.0.3 and up
## 4     June 8, 2018 Varies with device   4.2 and up
## 5    June 20, 2018                1.1   4.4 and up
## 6   March 26, 2017                1.0   2.3 and up

Let’s see the structure of this data

str(data)
## 'data.frame':    10841 obs. of  13 variables:
##  $ App           : chr  "Photo Editor & Candy Camera & Grid & ScrapBook" "Coloring book moana" "U Launcher Lite – FREE Live Cool Themes, Hide Apps" "Sketch - Draw & Paint" ...
##  $ Category      : chr  "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" ...
##  $ Rating        : num  4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
##  $ Reviews       : chr  "159" "967" "87510" "215644" ...
##  $ Size          : chr  "19M" "14M" "8.7M" "25M" ...
##  $ Installs      : chr  "10,000+" "500,000+" "5,000,000+" "50,000,000+" ...
##  $ Type          : chr  "Free" "Free" "Free" "Free" ...
##  $ Price         : chr  "0" "0" "0" "0" ...
##  $ Content.Rating: chr  "Everyone" "Everyone" "Everyone" "Teen" ...
##  $ Genres        : chr  "Art & Design" "Art & Design;Pretend Play" "Art & Design" "Art & Design" ...
##  $ Last.Updated  : chr  "January 7, 2018" "January 15, 2018" "August 1, 2018" "June 8, 2018" ...
##  $ Current.Ver   : chr  "1.0.0" "2.0.0" "1.2.4" "Varies with device" ...
##  $ Android.Ver   : chr  "4.0.3 and up" "4.0.3 and up" "4.0.3 and up" "4.2 and up" ...

I see a lot of factor variables that should actually be numeric. So, Let’s do some data cleaning as follows:

data_clean <- data %>% 
  mutate(
    # Eliminate some characters to transform Installs to numeric
    Installs = gsub("\\+", "", as.character(Installs)),
    Installs = as.numeric(gsub(",", "", Installs)),
    # Eliminate M to transform Size to numeric
    Size = gsub("M", "", Size),
    # Replace cells with k to 0 since it is < 1MB
    Size = ifelse(grepl("k", Size), 0, as.numeric(Size)),
    # Transform reviews to numeric
    Reviews = as.numeric(Reviews),
    # Remove currency symbol from Price, change it to numeric
    Price = as.numeric(gsub("\\$", "", as.character(Price))),
    # Last update to date format
    Last_update = mdy(Last.Updated),
    # Replace "Varies with device" to NA since it is unknown
    Min_Android_Ver = gsub("Varies with device", NA, Android.Ver),
    # Keep only version number to 1 decimal
    Min_Android_Ver = as.numeric(substr(Min_Android_Ver, start = 1, stop = 3)),
    # Drop all Android version column
    Android.Ver = NULL
  ) %>% 
  filter(
    # Two apps had types as 0 or NA, they will be removed
    Type %in% c("Free","Paid")
  )

str(data_clean)
## 'data.frame':    10839 obs. of  14 variables:
##  $ App            : chr  "Photo Editor & Candy Camera & Grid & ScrapBook" "Coloring book moana" "U Launcher Lite – FREE Live Cool Themes, Hide Apps" "Sketch - Draw & Paint" ...
##  $ Category       : chr  "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" ...
##  $ Rating         : num  4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
##  $ Reviews        : num  159 967 87510 215644 967 ...
##  $ Size           : num  19 14 8.7 25 2.8 5.6 19 29 33 3.1 ...
##  $ Installs       : num  1e+04 5e+05 5e+06 5e+07 1e+05 5e+04 5e+04 1e+06 1e+06 1e+04 ...
##  $ Type           : chr  "Free" "Free" "Free" "Free" ...
##  $ Price          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Content.Rating : chr  "Everyone" "Everyone" "Everyone" "Teen" ...
##  $ Genres         : chr  "Art & Design" "Art & Design;Pretend Play" "Art & Design" "Art & Design" ...
##  $ Last.Updated   : chr  "January 7, 2018" "January 15, 2018" "August 1, 2018" "June 8, 2018" ...
##  $ Current.Ver    : chr  "1.0.0" "2.0.0" "1.2.4" "Varies with device" ...
##  $ Last_update    : Date, format: "2018-01-07" "2018-01-15" ...
##  $ Min_Android_Ver: num  4 4 4 4.2 4.4 2.3 4 4.2 3 4 ...

EDA

Duplicate records

First,we need to check if there any duplicate rows (records)

nrow(data_clean %>% distinct())
## [1] 10356

The above should have listed 10841, so there are 483 rows duplicated. We’ll remove them.

data_clean <- data_clean %>% distinct()

NA Analysis

Let’s see if the data is complete, going through all columns to see if we find any NA values

data_clean %>% 
  summarise_all(funs(sum(is.na(.)))) %>% 
  gather() %>% 
  # Only show columns with NA
  filter(value>1) %>% 
  arrange(-value) %>% 
  hchart("column", hcaes(x = "key", y = "value", color = "key")) %>% 
  hc_add_theme(hc_theme_bloom()) %>% 
  hc_title(text = "Columns with NA Values")

We have 3 columns with NA values, some of them part of the data cleaning process above (Transforming factor variables to numeric). Let’s take a close look at these NA

NA values in Rating

Let’s take a look at how many installs applications with NA rating have:

data_clean %>% 
  filter(is.na(Rating)) %>% 
  count(Installs) %>% 
  arrange(-n) %>% 
  hchart("column", hcaes(x = "Installs", y = "n")) %>% 
  hc_add_theme(hc_theme_bloom()) %>% 
  hc_title(text = "Installations with no rating")

Most applications without rating have very few installs, which could make sense, If you have low amount of installs, people haven’t reviewed it yet. But even an application with 1 million installs is listed without rating. something maybe went wrong with scrapping?

NA values in Size

On the data cleaning process above, I saw “Varies with device” was present under “Size”. There were a total of 1695 applications with “Varies with device” which we transformed to NA. So, that explains NA in this column.

data %>% 
  filter(Size %in% "Varies with device") %>%
  count()
##      n
## 1 1695

NA values in Minimum Android Version

Same as above, with “Varies with device” label present. There were a total of 1362 applications with “Varies with device” which we transformed to NA. So, that explains NA in this column.

data %>% 
  filter(Android.Ver %in% "Varies with device") %>% count()
##      n
## 1 1362

Category

Most popular category, by number of installs

data_clean %>% 
  count(Category, Installs) %>% 
  group_by(Category) %>% 
  summarise(Total_Installs = sum(as.numeric(Installs))) %>% 
  arrange(-Total_Installs) %>% 
  hchart("scatter", hcaes(x = Category, y = Total_Installs, size = Total_Installs, color = Category)) %>% 
  hc_add_theme(hc_theme_bloom()) %>% 
  hc_title(text = "Most Popular categories (# of installs")

Application Size

Now let’s take a look at applications size (in MB)

data_clean %>% 
  count(Size) %>% 
  hchart("area", hcaes(x = Size, y = n)) %>% 
  hc_add_theme(hc_theme_bloom()) %>% 
  hc_title(text = "Distribution of application size (in MB)")
hcboxplot(x = data_clean$Size, var = data_clean$Type, outliers = TRUE, color = "#fb4901", fillColor = "lightblue") %>% 
  hc_chart(type = "column") %>% 
  hc_add_theme(hc_theme_bloom()) %>% 
  hc_title(text = "Application size range (in MB) by Application Type")
## Warning: 'hcboxplot' is deprecated.
## Use 'data_to_boxplot' instead.
## See help("Deprecated")

There are a lot of apps under 10MB, in general applications are between 5 MB to 30 MB, Paid applications are slightly smaller.

Installs

Applications in this dataset range from 0 to 10^{9} installations. The full range is,

summary(data_clean$Installs)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.000e+00 1.000e+03 1.000e+05 1.416e+07 1.000e+06 1.000e+09

If we divide this range by groups of 10k, we can see the following distribution as follows:

tmp <- data_clean %>% 
  group_by(
    Installs.Group = cut(Installs, breaks = seq(0, 1000000000, by = 10000))
  ) %>% 
  summarise(n = n())

hchart(tmp, type = "pie", hcaes(tmp$Installs.Group, tmp$n)) %>% 
  hc_title(text = "Number of installs (group per 10k)") %>% 
  hc_add_theme(hc_theme_bloom())

The largest group is made of applications with up to 10k downloads, over half of the applications had less than 100k installs.

Type

Let’s take a look at the total split between free and paid applications.

tmp <- data_clean %>% 
  count(Type) %>% 
  mutate(perc = round((n/sum(n))*100)) %>% 
  arrange(desc(perc))

hchart(tmp, "bar", hcaes(Type, perc)) %>% 
  hc_title(text = "Percentage of Paid vs Free Apps") %>% 
  hc_add_theme(hc_theme_bloom())

for every 100 applications, there are 7 are paid applications. Let’s see if how much this ratio changes per category as follows:

data_clean %>% 
  group_by(Category, Type) %>% 
  summarise(n = n()) %>% 
  mutate(perc = round((n/sum(n))*100)) %>% 
  hchart("bar", hcaes(Category, perc, group = Type)) %>% 
  hc_plotOptions(series = list(stacking = "normal")) %>% 
  hc_title(text = "Percentage of Free vs Paid Apps Per Category") %>% 
  hc_add_theme(hc_theme_bloom())
## `summarise()` has grouped output by 'Category'. You can override using the
## `.groups` argument.

There are two categories “Medical” and “Personalization” have a lot more paid then free applications (>20% of the apps are paid).

Price

tmp <- data_clean %>% 
  mutate(Total_Paid = Price * Installs)

A total of 3.6747185^{8} was spent on all of these apps, combined. Let’s now see pricing per category. I’ll avoid using mean because outliers could skew the data a lot. So, using median Let’s see which categories have most expensive apps

data_clean %>% 
  filter(Type == "Paid") %>% 
  group_by(Category) %>% 
  summarise(Median_Price = median(Price)) %>% 
  arrange(-Median_Price) %>% 
  hchart("treemap", hcaes(x = Category, value = Median_Price, color = Median_Price)) %>% 
  hc_add_theme(hc_theme_bloom()) %>% 
  hc_title(text = "Median Price Per Category")

We can see that the “Events” is the leader (but it has 1 paid app only). “Finance” is the category which most expensive apps, a median of almost 29 US Dollars.

and whatever, let’s use mean instead of median and see what we are going to have

data_clean %>% 
  filter(Type == "Paid") %>% 
  group_by(Category) %>% 
  summarise(Mean_Price = mean(Price)) %>% 
  arrange(-Mean_Price) %>% 
  hchart("treemap", hcaes(x = Category, value = Mean_Price, color = Mean_Price)) %>% 
  hc_add_theme(hc_theme_bloom()) %>% 
  hc_title(text = "Median Price Per Category")

“Finance” is the leader with 170 US Dollars, and “Events” 110, while in between there are the “lifestyle” apps with 124 US Dollars.

Finally, Let’s see how much money was spent per category

data_clean %>% 
  filter(Type == "Paid") %>% 
  mutate(Total_Paid = Price * Installs) %>% 
  group_by(Category) %>% 
  summarise(Paid_USD = sum(Total_Paid)) %>% 
  arrange(-Paid_USD) %>% 
  hchart("treemap", hcaes(x = Category, value = Paid_USD, color = Paid_USD)) %>% 
  hc_add_theme(hc_theme_bloom()) %>% 
  hc_title(text = "Total Amount Spent by Category (Price * Installs")

“Family” is the category that generated the highest revenue, probably due to Minecraft’s huge popularity.

Content Rating

How many installations were there by “content rating”?

tmp <- data_clean %>% 
  group_by(Content.Rating) %>% 
  summarise(Total_Installs = sum(Installs)) %>% 
  arrange(-Total_Installs)

tmp %>% 
  hchart("pie", hcaes(x = Content.Rating, y = Total_Installs)) %>% 
  hc_title(text = "Number of Installs by Content Rating") %>% 
  hc_add_theme(hc_theme_bloom())

“Everyone” has the largest share of installations, followed by “Teen”.

Genres

There are a total of 119 unique “Genres” on this dataset. Let’s take a look at the most popular, by number of installs

tmp <- data_clean %>% 
  group_by(Genres, Type) %>% 
  summarise(Total_Installs = sum(Installs)) %>% 
  arrange(-Total_Installs)
## `summarise()` has grouped output by 'Genres'. You can override using the
## `.groups` argument.
highchart() %>% 
  hc_chart(type = "column") %>%
  hc_xAxis(categories = tmp$Genres) %>% 
  hc_add_series(data = tmp$Total_Installs, name = "Total_Installs") %>% 
  hc_add_theme(hc_theme_bloom()) %>% 
  hc_title(text = "Number of Installs by Genre (All Apps)")

“Communications” is by far the most popular genre, by number of installs. Which are the top 10 communication apps?

data_clean %>% 
  select(App, Installs) %>% 
  arrange(-Installs) %>% 
  head(10)
##                                         App Installs
## 1                         Google Play Books    1e+09
## 2  Messenger – Text and Video Chat for Free    1e+09
## 3                        WhatsApp Messenger    1e+09
## 4              Google Chrome: Fast & Secure    1e+09
## 5                                     Gmail    1e+09
## 6                                  Hangouts    1e+09
## 7  Messenger – Text and Video Chat for Free    1e+09
## 8                                  Hangouts    1e+09
## 9             Skype - free IM & video calls    1e+09
## 10             Google Chrome: Fast & Secure    1e+09

There are actually more apps listed with the same number of installs, but these are the first 10, next we see which were the top 10 paid and free apps by genre

tmp_paid <- tmp %>% filter(Type == "Paid") %>% head(10)

highchart() %>% 
  hc_chart(type = "column") %>% 
  hc_xAxis(categories = tmp_paid$Genres) %>% 
  hc_add_series(data = tmp_paid$Total_Installs, name = "Total Installs") %>% 
  hc_add_theme(hc_theme_bloom()) %>% 
  hc_title(text = "Top 10 Number of Installs by Genre") %>% 
  hc_subtitle(text = "Paid Apps Only")
rm(tmp_paid)
tmp_free <- tmp %>% filter(Type == "Free") %>% head(10)

highchart() %>% 
  hc_chart(type = "column") %>% 
  hc_xAxis(categories = tmp_free$Genres) %>% 
  hc_add_series(data = tmp_free$Total_Installs, name = "Total Installs") %>% 
  hc_add_theme(hc_theme_bloom()) %>% 
  hc_title(text = "Top 10 Number of Installs by Genre") %>% 
  hc_subtitle(text = "Free Apps Only")
rm(tmp_free)

The top 10 application genre changes quite a bit on free vs paid. Top free applications are more “social” (browsers, chat apps, social network apps) while top paid genres are games.

Last Updated

Let’s now analyse when applications were last updated by making a time series chart.

# Get number of apps by last uodated date
tmp <- data_clean %>% count(Last.Updated)

# Transform data into time series (it needed some edjustment to edit the column of Last.Updated to be as date)
tmp$Last.Updated <- mdy(tmp$Last.Updated)
time_series <- xts(tmp$n, order.by = tmp$Last.Updated)

highchart(type = "stock") %>% 
  hc_title(text = "Last Updated Date") %>% 
  hc_subtitle(text = "Number of Applications by Date of Last Updates") %>% 
  hc_add_series(time_series) %>% 
  hc_add_theme(hc_theme_bloom())
rm(time_series)

Most applications have been updated within the last 6 months but there are applications that haven’t not seen an update for 5 years!!

Android Version

The column “Android version” actually relates to the minimum Android version the application supports. Let’s analyse minimum android version by number of installations.

data_clean %>% 
  filter(Min_Android_Ver > 0, Type %in% c("Free","Paid")) %>% 
  group_by(as.factor(Min_Android_Ver), Type) %>% 
  rename(Minimum_Android_Version = "as.factor(Min_Android_Ver)") %>% 
  summarise(Total_Installs = sum(Installs)) %>% 
  hchart(type = "bar", 
         hcaes(x = Minimum_Android_Version, y = Total_Installs, 
               group = Type)) %>% 
  hc_title(text = "Minimum Android Version (By Number of Installs)") %>% 
  hc_add_theme(hc_theme_bloom())
## `summarise()` has grouped output by 'Minimum_Android_Version'. You can override
## using the `.groups` argument.

Applications rely mostly that users have Android version 4.1 and up. This is good news for legacy devices :)