This is an EDA for Google Play Store Applications, if you like this kernel or it was helpful, please upvote!
for loading packages, I’m enjoying using the “pacman” package which is allow me to install and load the packages at once, as follows:
library(pacman)
p_load(tidyverse,highcharter,lubridate,stringr,xts)
Loading the data and take a look at it
## App Category Rating
## 1 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1
## 2 Coloring book moana ART_AND_DESIGN 3.9
## 3 U Launcher Lite – FREE Live Cool Themes, Hide Apps ART_AND_DESIGN 4.7
## 4 Sketch - Draw & Paint ART_AND_DESIGN 4.5
## 5 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3
## 6 Paper flowers instructions ART_AND_DESIGN 4.4
## Reviews Size Installs Type Price Content.Rating Genres
## 1 159 19M 10,000+ Free 0 Everyone Art & Design
## 2 967 14M 500,000+ Free 0 Everyone Art & Design;Pretend Play
## 3 87510 8.7M 5,000,000+ Free 0 Everyone Art & Design
## 4 215644 25M 50,000,000+ Free 0 Teen Art & Design
## 5 967 2.8M 100,000+ Free 0 Everyone Art & Design;Creativity
## 6 167 5.6M 50,000+ Free 0 Everyone Art & Design
## Last.Updated Current.Ver Android.Ver
## 1 January 7, 2018 1.0.0 4.0.3 and up
## 2 January 15, 2018 2.0.0 4.0.3 and up
## 3 August 1, 2018 1.2.4 4.0.3 and up
## 4 June 8, 2018 Varies with device 4.2 and up
## 5 June 20, 2018 1.1 4.4 and up
## 6 March 26, 2017 1.0 2.3 and up
Let’s see the structure of this data
str(data)
## 'data.frame': 10841 obs. of 13 variables:
## $ App : chr "Photo Editor & Candy Camera & Grid & ScrapBook" "Coloring book moana" "U Launcher Lite – FREE Live Cool Themes, Hide Apps" "Sketch - Draw & Paint" ...
## $ Category : chr "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" ...
## $ Rating : num 4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
## $ Reviews : chr "159" "967" "87510" "215644" ...
## $ Size : chr "19M" "14M" "8.7M" "25M" ...
## $ Installs : chr "10,000+" "500,000+" "5,000,000+" "50,000,000+" ...
## $ Type : chr "Free" "Free" "Free" "Free" ...
## $ Price : chr "0" "0" "0" "0" ...
## $ Content.Rating: chr "Everyone" "Everyone" "Everyone" "Teen" ...
## $ Genres : chr "Art & Design" "Art & Design;Pretend Play" "Art & Design" "Art & Design" ...
## $ Last.Updated : chr "January 7, 2018" "January 15, 2018" "August 1, 2018" "June 8, 2018" ...
## $ Current.Ver : chr "1.0.0" "2.0.0" "1.2.4" "Varies with device" ...
## $ Android.Ver : chr "4.0.3 and up" "4.0.3 and up" "4.0.3 and up" "4.2 and up" ...
I see a lot of factor variables that should actually be numeric. So, Let’s do some data cleaning as follows:
data_clean <- data %>%
mutate(
# Eliminate some characters to transform Installs to numeric
Installs = gsub("\\+", "", as.character(Installs)),
Installs = as.numeric(gsub(",", "", Installs)),
# Eliminate M to transform Size to numeric
Size = gsub("M", "", Size),
# Replace cells with k to 0 since it is < 1MB
Size = ifelse(grepl("k", Size), 0, as.numeric(Size)),
# Transform reviews to numeric
Reviews = as.numeric(Reviews),
# Remove currency symbol from Price, change it to numeric
Price = as.numeric(gsub("\\$", "", as.character(Price))),
# Last update to date format
Last_update = mdy(Last.Updated),
# Replace "Varies with device" to NA since it is unknown
Min_Android_Ver = gsub("Varies with device", NA, Android.Ver),
# Keep only version number to 1 decimal
Min_Android_Ver = as.numeric(substr(Min_Android_Ver, start = 1, stop = 3)),
# Drop all Android version column
Android.Ver = NULL
) %>%
filter(
# Two apps had types as 0 or NA, they will be removed
Type %in% c("Free","Paid")
)
str(data_clean)
## 'data.frame': 10839 obs. of 14 variables:
## $ App : chr "Photo Editor & Candy Camera & Grid & ScrapBook" "Coloring book moana" "U Launcher Lite – FREE Live Cool Themes, Hide Apps" "Sketch - Draw & Paint" ...
## $ Category : chr "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" ...
## $ Rating : num 4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
## $ Reviews : num 159 967 87510 215644 967 ...
## $ Size : num 19 14 8.7 25 2.8 5.6 19 29 33 3.1 ...
## $ Installs : num 1e+04 5e+05 5e+06 5e+07 1e+05 5e+04 5e+04 1e+06 1e+06 1e+04 ...
## $ Type : chr "Free" "Free" "Free" "Free" ...
## $ Price : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Content.Rating : chr "Everyone" "Everyone" "Everyone" "Teen" ...
## $ Genres : chr "Art & Design" "Art & Design;Pretend Play" "Art & Design" "Art & Design" ...
## $ Last.Updated : chr "January 7, 2018" "January 15, 2018" "August 1, 2018" "June 8, 2018" ...
## $ Current.Ver : chr "1.0.0" "2.0.0" "1.2.4" "Varies with device" ...
## $ Last_update : Date, format: "2018-01-07" "2018-01-15" ...
## $ Min_Android_Ver: num 4 4 4 4.2 4.4 2.3 4 4.2 3 4 ...
First,we need to check if there any duplicate rows (records)
nrow(data_clean %>% distinct())
## [1] 10356
The above should have listed 10841, so there are 483 rows duplicated. We’ll remove them.
data_clean <- data_clean %>% distinct()
Let’s see if the data is complete, going through all columns to see if we find any NA values
data_clean %>%
summarise_all(funs(sum(is.na(.)))) %>%
gather() %>%
# Only show columns with NA
filter(value>1) %>%
arrange(-value) %>%
hchart("column", hcaes(x = "key", y = "value", color = "key")) %>%
hc_add_theme(hc_theme_bloom()) %>%
hc_title(text = "Columns with NA Values")
We have 3 columns with NA values, some of them part of the data cleaning process above (Transforming factor variables to numeric). Let’s take a close look at these NA
Let’s take a look at how many installs applications with NA rating have:
data_clean %>%
filter(is.na(Rating)) %>%
count(Installs) %>%
arrange(-n) %>%
hchart("column", hcaes(x = "Installs", y = "n")) %>%
hc_add_theme(hc_theme_bloom()) %>%
hc_title(text = "Installations with no rating")
Most applications without rating have very few installs, which could make sense, If you have low amount of installs, people haven’t reviewed it yet. But even an application with 1 million installs is listed without rating. something maybe went wrong with scrapping?
On the data cleaning process above, I saw “Varies with device” was present under “Size”. There were a total of 1695 applications with “Varies with device” which we transformed to NA. So, that explains NA in this column.
data %>%
filter(Size %in% "Varies with device") %>%
count()
## n
## 1 1695
Same as above, with “Varies with device” label present. There were a total of 1362 applications with “Varies with device” which we transformed to NA. So, that explains NA in this column.
data %>%
filter(Android.Ver %in% "Varies with device") %>% count()
## n
## 1 1362
Most popular category, by number of installs
data_clean %>%
count(Category, Installs) %>%
group_by(Category) %>%
summarise(Total_Installs = sum(as.numeric(Installs))) %>%
arrange(-Total_Installs) %>%
hchart("scatter", hcaes(x = Category, y = Total_Installs, size = Total_Installs, color = Category)) %>%
hc_add_theme(hc_theme_bloom()) %>%
hc_title(text = "Most Popular categories (# of installs")
Now let’s take a look at applications size (in MB)
data_clean %>%
count(Size) %>%
hchart("area", hcaes(x = Size, y = n)) %>%
hc_add_theme(hc_theme_bloom()) %>%
hc_title(text = "Distribution of application size (in MB)")
hcboxplot(x = data_clean$Size, var = data_clean$Type, outliers = TRUE, color = "#fb4901", fillColor = "lightblue") %>%
hc_chart(type = "column") %>%
hc_add_theme(hc_theme_bloom()) %>%
hc_title(text = "Application size range (in MB) by Application Type")
## Warning: 'hcboxplot' is deprecated.
## Use 'data_to_boxplot' instead.
## See help("Deprecated")
There are a lot of apps under 10MB, in general applications are between 5 MB to 30 MB, Paid applications are slightly smaller.
Applications in this dataset range from 0 to 10^{9} installations. The full range is,
summary(data_clean$Installs)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000e+00 1.000e+03 1.000e+05 1.416e+07 1.000e+06 1.000e+09
If we divide this range by groups of 10k, we can see the following distribution as follows:
tmp <- data_clean %>%
group_by(
Installs.Group = cut(Installs, breaks = seq(0, 1000000000, by = 10000))
) %>%
summarise(n = n())
hchart(tmp, type = "pie", hcaes(tmp$Installs.Group, tmp$n)) %>%
hc_title(text = "Number of installs (group per 10k)") %>%
hc_add_theme(hc_theme_bloom())
The largest group is made of applications with up to 10k downloads, over half of the applications had less than 100k installs.
Let’s take a look at the total split between free and paid applications.
tmp <- data_clean %>%
count(Type) %>%
mutate(perc = round((n/sum(n))*100)) %>%
arrange(desc(perc))
hchart(tmp, "bar", hcaes(Type, perc)) %>%
hc_title(text = "Percentage of Paid vs Free Apps") %>%
hc_add_theme(hc_theme_bloom())
for every 100 applications, there are 7 are paid applications. Let’s see if how much this ratio changes per category as follows:
data_clean %>%
group_by(Category, Type) %>%
summarise(n = n()) %>%
mutate(perc = round((n/sum(n))*100)) %>%
hchart("bar", hcaes(Category, perc, group = Type)) %>%
hc_plotOptions(series = list(stacking = "normal")) %>%
hc_title(text = "Percentage of Free vs Paid Apps Per Category") %>%
hc_add_theme(hc_theme_bloom())
## `summarise()` has grouped output by 'Category'. You can override using the
## `.groups` argument.
There are two categories “Medical” and “Personalization” have a lot more paid then free applications (>20% of the apps are paid).
tmp <- data_clean %>%
mutate(Total_Paid = Price * Installs)
A total of 3.6747185^{8} was spent on all of these apps, combined. Let’s now see pricing per category. I’ll avoid using mean because outliers could skew the data a lot. So, using median Let’s see which categories have most expensive apps
data_clean %>%
filter(Type == "Paid") %>%
group_by(Category) %>%
summarise(Median_Price = median(Price)) %>%
arrange(-Median_Price) %>%
hchart("treemap", hcaes(x = Category, value = Median_Price, color = Median_Price)) %>%
hc_add_theme(hc_theme_bloom()) %>%
hc_title(text = "Median Price Per Category")
We can see that the “Events” is the leader (but it has 1 paid app only). “Finance” is the category which most expensive apps, a median of almost 29 US Dollars.
and whatever, let’s use mean instead of median and see what we are going to have
data_clean %>%
filter(Type == "Paid") %>%
group_by(Category) %>%
summarise(Mean_Price = mean(Price)) %>%
arrange(-Mean_Price) %>%
hchart("treemap", hcaes(x = Category, value = Mean_Price, color = Mean_Price)) %>%
hc_add_theme(hc_theme_bloom()) %>%
hc_title(text = "Median Price Per Category")
“Finance” is the leader with 170 US Dollars, and “Events” 110, while in between there are the “lifestyle” apps with 124 US Dollars.
Finally, Let’s see how much money was spent per category
data_clean %>%
filter(Type == "Paid") %>%
mutate(Total_Paid = Price * Installs) %>%
group_by(Category) %>%
summarise(Paid_USD = sum(Total_Paid)) %>%
arrange(-Paid_USD) %>%
hchart("treemap", hcaes(x = Category, value = Paid_USD, color = Paid_USD)) %>%
hc_add_theme(hc_theme_bloom()) %>%
hc_title(text = "Total Amount Spent by Category (Price * Installs")
“Family” is the category that generated the highest revenue, probably due to Minecraft’s huge popularity.
How many installations were there by “content rating”?
tmp <- data_clean %>%
group_by(Content.Rating) %>%
summarise(Total_Installs = sum(Installs)) %>%
arrange(-Total_Installs)
tmp %>%
hchart("pie", hcaes(x = Content.Rating, y = Total_Installs)) %>%
hc_title(text = "Number of Installs by Content Rating") %>%
hc_add_theme(hc_theme_bloom())
“Everyone” has the largest share of installations, followed by “Teen”.
There are a total of 119 unique “Genres” on this dataset. Let’s take a look at the most popular, by number of installs
tmp <- data_clean %>%
group_by(Genres, Type) %>%
summarise(Total_Installs = sum(Installs)) %>%
arrange(-Total_Installs)
## `summarise()` has grouped output by 'Genres'. You can override using the
## `.groups` argument.
highchart() %>%
hc_chart(type = "column") %>%
hc_xAxis(categories = tmp$Genres) %>%
hc_add_series(data = tmp$Total_Installs, name = "Total_Installs") %>%
hc_add_theme(hc_theme_bloom()) %>%
hc_title(text = "Number of Installs by Genre (All Apps)")
“Communications” is by far the most popular genre, by number of installs. Which are the top 10 communication apps?
data_clean %>%
select(App, Installs) %>%
arrange(-Installs) %>%
head(10)
## App Installs
## 1 Google Play Books 1e+09
## 2 Messenger – Text and Video Chat for Free 1e+09
## 3 WhatsApp Messenger 1e+09
## 4 Google Chrome: Fast & Secure 1e+09
## 5 Gmail 1e+09
## 6 Hangouts 1e+09
## 7 Messenger – Text and Video Chat for Free 1e+09
## 8 Hangouts 1e+09
## 9 Skype - free IM & video calls 1e+09
## 10 Google Chrome: Fast & Secure 1e+09
There are actually more apps listed with the same number of installs, but these are the first 10, next we see which were the top 10 paid and free apps by genre
tmp_paid <- tmp %>% filter(Type == "Paid") %>% head(10)
highchart() %>%
hc_chart(type = "column") %>%
hc_xAxis(categories = tmp_paid$Genres) %>%
hc_add_series(data = tmp_paid$Total_Installs, name = "Total Installs") %>%
hc_add_theme(hc_theme_bloom()) %>%
hc_title(text = "Top 10 Number of Installs by Genre") %>%
hc_subtitle(text = "Paid Apps Only")
rm(tmp_paid)
tmp_free <- tmp %>% filter(Type == "Free") %>% head(10)
highchart() %>%
hc_chart(type = "column") %>%
hc_xAxis(categories = tmp_free$Genres) %>%
hc_add_series(data = tmp_free$Total_Installs, name = "Total Installs") %>%
hc_add_theme(hc_theme_bloom()) %>%
hc_title(text = "Top 10 Number of Installs by Genre") %>%
hc_subtitle(text = "Free Apps Only")
rm(tmp_free)
The top 10 application genre changes quite a bit on free vs paid. Top free applications are more “social” (browsers, chat apps, social network apps) while top paid genres are games.
Let’s now analyse when applications were last updated by making a time series chart.
# Get number of apps by last uodated date
tmp <- data_clean %>% count(Last.Updated)
# Transform data into time series (it needed some edjustment to edit the column of Last.Updated to be as date)
tmp$Last.Updated <- mdy(tmp$Last.Updated)
time_series <- xts(tmp$n, order.by = tmp$Last.Updated)
highchart(type = "stock") %>%
hc_title(text = "Last Updated Date") %>%
hc_subtitle(text = "Number of Applications by Date of Last Updates") %>%
hc_add_series(time_series) %>%
hc_add_theme(hc_theme_bloom())
rm(time_series)
Most applications have been updated within the last 6 months but there are applications that haven’t not seen an update for 5 years!!
The column “Android version” actually relates to the minimum Android version the application supports. Let’s analyse minimum android version by number of installations.
data_clean %>%
filter(Min_Android_Ver > 0, Type %in% c("Free","Paid")) %>%
group_by(as.factor(Min_Android_Ver), Type) %>%
rename(Minimum_Android_Version = "as.factor(Min_Android_Ver)") %>%
summarise(Total_Installs = sum(Installs)) %>%
hchart(type = "bar",
hcaes(x = Minimum_Android_Version, y = Total_Installs,
group = Type)) %>%
hc_title(text = "Minimum Android Version (By Number of Installs)") %>%
hc_add_theme(hc_theme_bloom())
## `summarise()` has grouped output by 'Minimum_Android_Version'. You can override
## using the `.groups` argument.
Applications rely mostly that users have Android version 4.1 and up. This is good news for legacy devices :)