In this project, we will try to analyze Google Play Store Dataset and visualize the insight that can we get from the dataset. The dataset obtained from Kaggle (Click for the source of the dataset). The period of this data is from 21st May 2010 until 8 August 2018.
We will use these packages below to analyze this dataset.
library(lubridate)
library(ggplot2)
library(plotly)
library(scales)
library(glue)
library(GGally)
library(tidyverse)# Reading the data
pstore <- read.csv("googleplaystore.csv")
head(pstore)str(pstore)## 'data.frame': 10841 obs. of 13 variables:
## $ App : chr "Photo Editor & Candy Camera & Grid & ScrapBook" "Coloring book moana" "U Launcher Lite â\200“ FREE Live Cool Themes, Hide Apps" "Sketch - Draw & Paint" ...
## $ Category : chr "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" ...
## $ Rating : num 4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
## $ Reviews : chr "159" "967" "87510" "215644" ...
## $ Size : chr "19M" "14M" "8.7M" "25M" ...
## $ Installs : chr "10,000+" "500,000+" "5,000,000+" "50,000,000+" ...
## $ Type : chr "Free" "Free" "Free" "Free" ...
## $ Price : chr "0" "0" "0" "0" ...
## $ Content.Rating: chr "Everyone" "Everyone" "Everyone" "Teen" ...
## $ Genres : chr "Art & Design" "Art & Design;Pretend Play" "Art & Design" "Art & Design" ...
## $ Last.Updated : chr "January 7, 2018" "January 15, 2018" "August 1, 2018" "June 8, 2018" ...
## $ Current.Ver : chr "1.0.0" "2.0.0" "1.2.4" "Varies with device" ...
## $ Android.Ver : chr "4.0.3 and up" "4.0.3 and up" "4.0.3 and up" "4.2 and up" ...
First, we will change the type of the data to the right type.
pstore$Category <- as.factor(pstore$Category)
pstore$Type <- as.factor(pstore$Type)
pstore$Content.Rating <- as.factor(pstore$Content.Rating)
pstore$Genres <- as.factor(pstore$Genres)
pstore$Installs <- as.factor(x = pstore$Installs)
pstore$Size <- as.factor(pstore$Size)
pstore$Last.Updated <- mdy(pstore$Last.Updated)
pstore$Current.Ver <- as.factor(pstore$Current.Ver)
pstore$Android.Ver <- as.factor(pstore$Android.Ver)
pstore$Reviews <- as.integer(pstore$Reviews)
pstore$Installs <- str_replace_all(string=pstore$Installs, pattern="[,|\\,|\\+]", replacement = "")
pstore$Installs <- as.integer(pstore$Installs)Because there is some missing values and duplicate data in the dataset, we will try to omit the missing values and the duplicated data.
pstore <- na.omit(pstore)
pstore <- pstore[which(!duplicated(pstore)),]
summary(pstore)## App Category Rating Reviews
## Length:8892 FAMILY :1718 Min. :1.000 Min. : 1
## Class :character GAME :1074 1st Qu.:4.000 1st Qu.: 164
## Mode :character TOOLS : 734 Median :4.300 Median : 4714
## PRODUCTIVITY : 334 Mean :4.188 Mean : 472776
## FINANCE : 317 3rd Qu.:4.500 3rd Qu.: 71267
## PERSONALIZATION: 310 Max. :5.000 Max. :78158306
## (Other) :4405
## Size Installs Type Price
## Varies with device:1468 Min. : 1 0 : 0 Length:8892
## 14M : 154 1st Qu.: 10000 Free:8279 Class :character
## 13M : 152 Median : 500000 NaN : 0 Mode :character
## 12M : 151 Mean : 16489648 Paid: 613
## 11M : 150 3rd Qu.: 5000000
## 15M : 149 Max. :1000000000
## (Other) :6668
## Content.Rating Genres Last.Updated
## : 0 Tools : 733 Min. :2010-05-21
## Adults only 18+: 3 Entertainment: 498 1st Qu.:2017-09-21
## Everyone :7095 Education : 446 Median :2018-05-28
## Everyone 10+ : 360 Action : 349 Mean :2017-11-21
## Mature 17+ : 411 Productivity : 334 3rd Qu.:2018-07-23
## Teen :1022 Finance : 317 Max. :2018-08-08
## Unrated : 1 (Other) :6215
## Current.Ver Android.Ver
## Varies with device:1258 4.1 and up :1987
## 1.0 : 451 4.0.3 and up :1197
## 1.1 : 191 Varies with device:1178
## 1.2 : 126 4.0 and up :1094
## 1.3 : 117 4.4 and up : 789
## 2.0 : 117 2.3 and up : 573
## (Other) :6632 (Other) :2074
Because all of the App is free, we will eliminate Price and Type column. In this analysis, we also will focus on the Rating of the App as the parameter of the analysis. We also don’t need Current.Ver, Size and Android.Ver Column
pstore <- pstore[,-c(5, 7, 8, 12, 13)]
head(pstore)First, we need to aggregate the Rating and Category column. We will take 10 Category that have the highest average rating.
# Data Wrangling
gen_rat <- aggregate(Rating~Category, data=pstore, FUN = "mean")
# Visualization
ggplot(head(gen_rat,10), mapping=aes(x=Rating, y=reorder(Category,Rating)))+
geom_col(aes(fill=Rating))+
scale_fill_gradient(low = "yellow", high = "purple")+
labs(title = "Top 10 Category based on Rating",
y=NULL)+
theme(plot.title = element_text(hjust = 0.5))As we can in the visualization above, for the first place, there is
Educationcategory, followed byBooks and Referencecategory and also byArt and designcategory in the second place and the third place.
As we know from the previous visualization, Education Category has the highest rating of all Category. We will try to analyze more of this category. Firstly, we will eliminate all of the category except Education Category. After that, we will sort the app from the highest rating to the lowest rating.
# Data Wrangling
cond1 <- pstore[pstore$Category=="EDUCATION",]
cond1 <- cond1[order(cond1$Rating, decreasing = TRUE),]
# Visualization
ggplot(head(cond1,10), mapping = aes(x = Rating, y = reorder(App, Rating)))+
geom_col(aes(fill=Rating))+
scale_fill_gradient(low = "yellow", high = "purple")+
labs(title = "Top 10 Apps for Education Category based on Rating",
y=NULL)+
scale_y_discrete(labels = wrap_format(25))+
theme(plot.title = element_text(hjust = 0.5))As we can see on the plot above,
Sago Mini Hat MakerandLearn Japanese, Korean, Chinese Offline & Freehave the highest rating of all Apps in the Education Category. After these 2 Apps, there areSoloLearn: Learn to Code for FreeandEnglish Grammar Testthat in the second place of the highest rating in the Education Category.
Now, we will try to analyze the top Category in Google Apps based on Total Installation for Each Category.
# Data Wrangling
ins_cat <- aggregate(Installs~Category, data=pstore, FUN = "mean")
# Visualization
ggplot(head(ins_cat,10), mapping=aes(x=Installs/1000, y=reorder(Category,Installs)))+
geom_col(aes(fill=Installs))+
scale_fill_gradient(low = "yellow", high = "purple")+
labs(title = "Top 10 Category based on Total Installation",
x="Total Installation",
y=NULL)+
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5))As we can in the visualization above, for the first place, there is Communication category. For the second place there is Entertainment. And for the third place there is Books and Reference.
Now we will try to analyze more about the Communication Category.
# Data Wrangling
cond2 <- pstore[pstore$Category=="COMMUNICATION",]
cond2 <- aggregate(Installs~App, data = cond2, FUN="mean")
cond2 <- cond2[order(cond2$Installs, decreasing = TRUE),]
# Visualization
ggplot(head(cond2,10), mapping = aes(x = Installs/1000, y = reorder(App,Installs)))+
geom_col(aes(fill=Installs))+
scale_fill_gradient(low = "yellow", high = "purple")+
labs(title = "Top 10 Apps for Communication Category based on Total Installation",
x="Total Installation",
y=NULL)+
scale_y_discrete(labels = wrap_format(25))+
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5))As we can see on the visualization above, Whatsapp Messenger, Skype, Messenger, Hangouts, Google Chrome, and Gmail have the highest total installation.
# Data Wrangling
rev_cat <- aggregate(Reviews~Category, data=pstore, FUN = "mean")
# Visualization
ggplot(head(rev_cat,10), mapping=aes(x=Reviews, y=reorder(Category,Reviews)))+
geom_col(aes(fill=Reviews))+
scale_fill_gradient(low = "yellow", high = "purple")+
labs(title = "Top 10 Category based on Rating",
y=NULL)+
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5))From the visualization above, it explain that the first place and the second place have the Category with the highest category based on Total Installation. Meanwhile, The third place and the forth place are switching place.
# Data Wrangling
cond3 <- pstore[pstore$Category=="COMMUNICATION",]
cond3 <- aggregate(Reviews~App, data = cond3, FUN="mean")
cond3 <- cond3[order(cond3$Reviews, decreasing = TRUE),]
# Visualization
ggplot(head(cond3,10), mapping = aes(x = Reviews, y = reorder(App,Reviews)))+
geom_col(aes(fill=Reviews))+
scale_fill_gradient(low = "yellow", high = "purple")+
labs(title = "Top 10 Apps for Communication Category based on Reviews",
y=NULL)+
scale_y_discrete(labels = wrap_format(25))+
theme(plot.title = element_text(hjust=0.5))From the visualization above, we can see that WhatsApp Messenger still in the first place. Meanwhile, the other apps have a different place according to the visualization based on Total Installation.
According to all of the visualization, we want to know if there is a correlation among Ratings, Reviews, and Total Installation.
# Data Wrangling
corr <- pstore[,c(3,4,5)]
# Visualization
ggcorr(corr,label = T)As we can see from the result that Reviews and Installs have high correlation. In conclusion, as the reviews increase, the total installation will also increase.
Because there is correlation, we want to know the App that has the highest reviews and also the highest installation.
# Data Wrangling
rev_ins <- cond1 %>%
mutate(label = glue("App: {App}
Reviews: {Reviews}
Install: {Installs}"))
# Visualization
ri_plot <- ggplot(rev_ins, mapping = aes(x=Reviews/1000, y=Installs/1000, text=label))+
labs(title = "Reviews vs Total Installation",
x = "Reviews (Thousand)",
y = "Total Installation (Thousand)")+
geom_point(aes(fill=App))+
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5))
ggplotly(ri_plot, tooltip = "text")As we can see from the visualization above, the app that has the highest reviews and the highest total installation is Duolingo: Learn Languages Free.
Sago Mini Hat Maker and Learn Japanese, Korean, Chinese Offline & Free.Whatsapp Messenger, Skype, Messenger, Hangouts, Google Chrome, and Gmail.Whatsapp Messenger.