In this project, we will try to analyze Google Play Store Dataset and visualize the insight that can we get from the dataset. The dataset obtained from Kaggle (Click for the source of the dataset). The period of this data is from 21st May 2010 until 8 August 2018.

We will use these packages below to analyze this dataset.

library(lubridate)
library(ggplot2)
library(plotly)
library(scales)
library(glue)
library(GGally)
library(tidyverse)

Sneak Peak into Data

# Reading the data
pstore <- read.csv("googleplaystore.csv")
head(pstore)
str(pstore)
## 'data.frame':    10841 obs. of  13 variables:
##  $ App           : chr  "Photo Editor & Candy Camera & Grid & ScrapBook" "Coloring book moana" "U Launcher Lite â\200“ FREE Live Cool Themes, Hide Apps" "Sketch - Draw & Paint" ...
##  $ Category      : chr  "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" ...
##  $ Rating        : num  4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
##  $ Reviews       : chr  "159" "967" "87510" "215644" ...
##  $ Size          : chr  "19M" "14M" "8.7M" "25M" ...
##  $ Installs      : chr  "10,000+" "500,000+" "5,000,000+" "50,000,000+" ...
##  $ Type          : chr  "Free" "Free" "Free" "Free" ...
##  $ Price         : chr  "0" "0" "0" "0" ...
##  $ Content.Rating: chr  "Everyone" "Everyone" "Everyone" "Teen" ...
##  $ Genres        : chr  "Art & Design" "Art & Design;Pretend Play" "Art & Design" "Art & Design" ...
##  $ Last.Updated  : chr  "January 7, 2018" "January 15, 2018" "August 1, 2018" "June 8, 2018" ...
##  $ Current.Ver   : chr  "1.0.0" "2.0.0" "1.2.4" "Varies with device" ...
##  $ Android.Ver   : chr  "4.0.3 and up" "4.0.3 and up" "4.0.3 and up" "4.2 and up" ...

Data Cleaning

First, we will change the type of the data to the right type.

pstore$Category <- as.factor(pstore$Category)
pstore$Type <- as.factor(pstore$Type)
pstore$Content.Rating <- as.factor(pstore$Content.Rating)
pstore$Genres <- as.factor(pstore$Genres)
pstore$Installs <- as.factor(x = pstore$Installs)
pstore$Size <- as.factor(pstore$Size)
pstore$Last.Updated <- mdy(pstore$Last.Updated) 
pstore$Current.Ver <- as.factor(pstore$Current.Ver)
pstore$Android.Ver <- as.factor(pstore$Android.Ver)
pstore$Reviews <- as.integer(pstore$Reviews)
pstore$Installs <- str_replace_all(string=pstore$Installs, pattern="[,|\\,|\\+]", replacement = "")
pstore$Installs <- as.integer(pstore$Installs)

Because there is some missing values and duplicate data in the dataset, we will try to omit the missing values and the duplicated data.

pstore <- na.omit(pstore)
pstore <- pstore[which(!duplicated(pstore)),]
summary(pstore)
##      App                       Category        Rating         Reviews        
##  Length:8892        FAMILY         :1718   Min.   :1.000   Min.   :       1  
##  Class :character   GAME           :1074   1st Qu.:4.000   1st Qu.:     164  
##  Mode  :character   TOOLS          : 734   Median :4.300   Median :    4714  
##                     PRODUCTIVITY   : 334   Mean   :4.188   Mean   :  472776  
##                     FINANCE        : 317   3rd Qu.:4.500   3rd Qu.:   71267  
##                     PERSONALIZATION: 310   Max.   :5.000   Max.   :78158306  
##                     (Other)        :4405                                     
##                  Size         Installs            Type         Price          
##  Varies with device:1468   Min.   :         1   0   :   0   Length:8892       
##  14M               : 154   1st Qu.:     10000   Free:8279   Class :character  
##  13M               : 152   Median :    500000   NaN :   0   Mode  :character  
##  12M               : 151   Mean   :  16489648   Paid: 613                     
##  11M               : 150   3rd Qu.:   5000000                                 
##  15M               : 149   Max.   :1000000000                                 
##  (Other)           :6668                                                      
##          Content.Rating           Genres      Last.Updated       
##                 :   0   Tools        : 733   Min.   :2010-05-21  
##  Adults only 18+:   3   Entertainment: 498   1st Qu.:2017-09-21  
##  Everyone       :7095   Education    : 446   Median :2018-05-28  
##  Everyone 10+   : 360   Action       : 349   Mean   :2017-11-21  
##  Mature 17+     : 411   Productivity : 334   3rd Qu.:2018-07-23  
##  Teen           :1022   Finance      : 317   Max.   :2018-08-08  
##  Unrated        :   1   (Other)      :6215                       
##              Current.Ver               Android.Ver  
##  Varies with device:1258   4.1 and up        :1987  
##  1.0               : 451   4.0.3 and up      :1197  
##  1.1               : 191   Varies with device:1178  
##  1.2               : 126   4.0 and up        :1094  
##  1.3               : 117   4.4 and up        : 789  
##  2.0               : 117   2.3 and up        : 573  
##  (Other)           :6632   (Other)           :2074

Because all of the App is free, we will eliminate Price and Type column. In this analysis, we also will focus on the Rating of the App as the parameter of the analysis. We also don’t need Current.Ver, Size and Android.Ver Column

pstore <- pstore[,-c(5, 7, 8, 12, 13)]
head(pstore)

Data Visualization

Based on Rating

First, we need to aggregate the Rating and Category column. We will take 10 Category that have the highest average rating.

# Data Wrangling
gen_rat <- aggregate(Rating~Category, data=pstore, FUN = "mean")
# Visualization
ggplot(head(gen_rat,10), mapping=aes(x=Rating, y=reorder(Category,Rating)))+
  geom_col(aes(fill=Rating))+
  scale_fill_gradient(low = "yellow", high = "purple")+
  labs(title = "Top 10 Category based on Rating",
       y=NULL)+
  theme(plot.title = element_text(hjust = 0.5))

As we can in the visualization above, for the first place, there is Education category, followed by Books and Reference category and also by Art and design category in the second place and the third place.

Top 10 Apps in Education Category based on Rating

As we know from the previous visualization, Education Category has the highest rating of all Category. We will try to analyze more of this category. Firstly, we will eliminate all of the category except Education Category. After that, we will sort the app from the highest rating to the lowest rating.

# Data Wrangling
cond1 <- pstore[pstore$Category=="EDUCATION",]
cond1 <- cond1[order(cond1$Rating, decreasing = TRUE),]
# Visualization
ggplot(head(cond1,10), mapping = aes(x = Rating, y = reorder(App, Rating)))+
  geom_col(aes(fill=Rating))+
  scale_fill_gradient(low = "yellow", high = "purple")+
  labs(title = "Top 10 Apps for Education Category based on Rating",
       y=NULL)+
  scale_y_discrete(labels = wrap_format(25))+
  theme(plot.title = element_text(hjust = 0.5))

As we can see on the plot above, Sago Mini Hat Maker and Learn Japanese, Korean, Chinese Offline & Free have the highest rating of all Apps in the Education Category. After these 2 Apps, there are SoloLearn: Learn to Code for Free and English Grammar Test that in the second place of the highest rating in the Education Category.

Based on Total Installation

Now, we will try to analyze the top Category in Google Apps based on Total Installation for Each Category.

# Data Wrangling
ins_cat <- aggregate(Installs~Category, data=pstore, FUN = "mean")
# Visualization
ggplot(head(ins_cat,10), mapping=aes(x=Installs/1000, y=reorder(Category,Installs)))+
  geom_col(aes(fill=Installs))+
  scale_fill_gradient(low = "yellow", high = "purple")+
  labs(title = "Top 10 Category based on Total Installation",
       x="Total Installation",
       y=NULL)+
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5))

As we can in the visualization above, for the first place, there is Communication category. For the second place there is Entertainment. And for the third place there is Books and Reference.

Top 10 Apps in Communication Category based on Total Installation

Now we will try to analyze more about the Communication Category.

# Data Wrangling
cond2 <- pstore[pstore$Category=="COMMUNICATION",]
cond2 <- aggregate(Installs~App, data = cond2, FUN="mean")
cond2 <- cond2[order(cond2$Installs, decreasing = TRUE),]
# Visualization
ggplot(head(cond2,10), mapping = aes(x = Installs/1000, y = reorder(App,Installs)))+
  geom_col(aes(fill=Installs))+
  scale_fill_gradient(low = "yellow", high = "purple")+
  labs(title = "Top 10 Apps for Communication Category based on Total Installation",
       x="Total Installation",
       y=NULL)+
  scale_y_discrete(labels = wrap_format(25))+
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5))

As we can see on the visualization above, Whatsapp Messenger, Skype, Messenger, Hangouts, Google Chrome, and Gmail have the highest total installation.

Based on Review

# Data Wrangling
rev_cat <- aggregate(Reviews~Category, data=pstore, FUN = "mean")
# Visualization
ggplot(head(rev_cat,10), mapping=aes(x=Reviews, y=reorder(Category,Reviews)))+
  geom_col(aes(fill=Reviews))+
  scale_fill_gradient(low = "yellow", high = "purple")+
  labs(title = "Top 10 Category based on Rating",
       y=NULL)+
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5))

From the visualization above, it explain that the first place and the second place have the Category with the highest category based on Total Installation. Meanwhile, The third place and the forth place are switching place.

Top 10 Apps in Communication Category based on Reviews

# Data Wrangling
cond3 <- pstore[pstore$Category=="COMMUNICATION",]
cond3 <- aggregate(Reviews~App, data = cond3, FUN="mean")
cond3 <- cond3[order(cond3$Reviews, decreasing = TRUE),]
# Visualization
ggplot(head(cond3,10), mapping = aes(x = Reviews, y = reorder(App,Reviews)))+
  geom_col(aes(fill=Reviews))+
  scale_fill_gradient(low = "yellow", high = "purple")+
  labs(title = "Top 10 Apps for Communication Category based on Reviews",
       y=NULL)+
  scale_y_discrete(labels = wrap_format(25))+
  theme(plot.title = element_text(hjust=0.5))

From the visualization above, we can see that WhatsApp Messenger still in the first place. Meanwhile, the other apps have a different place according to the visualization based on Total Installation.

Correlation

According to all of the visualization, we want to know if there is a correlation among Ratings, Reviews, and Total Installation.

# Data Wrangling
corr <- pstore[,c(3,4,5)]
# Visualization
ggcorr(corr,label = T)

As we can see from the result that Reviews and Installs have high correlation. In conclusion, as the reviews increase, the total installation will also increase.

Because there is correlation, we want to know the App that has the highest reviews and also the highest installation.

# Data Wrangling
rev_ins <- cond1 %>% 
  mutate(label = glue("App: {App}
                  Reviews: {Reviews}
                  Install: {Installs}"))
# Visualization
ri_plot <- ggplot(rev_ins, mapping = aes(x=Reviews/1000, y=Installs/1000, text=label))+
  labs(title = "Reviews vs Total Installation",
       x = "Reviews (Thousand)",
       y = "Total Installation (Thousand)")+
  geom_point(aes(fill=App))+
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5))
ggplotly(ri_plot, tooltip = "text")

As we can see from the visualization above, the app that has the highest reviews and the highest total installation is Duolingo: Learn Languages Free.

Conclusion

  • Based on the Rating, Education is the category that has the highest rating. From Education category, Apps that have the highest rating are Sago Mini Hat Maker and Learn Japanese, Korean, Chinese Offline & Free.
  • Based on the total installation, Communication is the category that has the highest total installation. From Communication Category, Apps that have the highest total installation are Whatsapp Messenger, Skype, Messenger, Hangouts, Google Chrome, and Gmail.
  • Based on the Review, Communication is the category that has the highest total review. From Communication Category, Apps that have the highest total installation is Whatsapp Messenger.
  • Reviews have high correlation with Total Installation. The value of the correlation is 0.6. An app that has the highest reviews and the highest total installation is Duolingo: Learn Languages Free.