Google Play Store Apps Analysis

Project Overview

One day, Mr. X plans to start his 2021 by developing an Android mobile application. However, Mr. X wants to know how to build a good Android application. He asks two of his data scientist friends Mr Phoon and Mr Sii to help out to study current trends and insights of Google Play Store.

Objective

  1. To analyze the Google Play Store Apps by implementing Data Science Process
  2. To determine trends and patterns in Google Play Store Apps

Hence, we have two research questions that we can find out for Mr X.

Question 1

What is the relationship between application rating and installation?

Question 2

What is the rating of an an application with given number of reviews and installation ?

Dataset and Data Preparing

Following is the details of dataset that we have chosen :

Title : Google Play Store Apps
Source : Kaggle (https://www.kaggle.com/lava18/google-play-store-apps)
Year : 2018
Purpose : Web scraped data of 10k Play Store apps for analyzing the Android market.

#list.files("../Downloads")
library(tidyverse)
library(plyr)
library(dplyr)
library(magrittr)
library(tidyverse)  # data manipulation
library(cluster)    # clustering algorithms
library(factoextra) # clustering algorithms & visualization
library(class)
library(caTools) 
library(DT)

Reading the data from the CSV

googleplay <- read.csv("../googleplaystore.csv")
datatable(googleplay)
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

Dimension and Struture of the dataset

print(paste("Dimensions of dataset: ", dim(googleplay)))
## [1] "Dimensions of dataset:  10841" "Dimensions of dataset:  13"
str(googleplay)
## 'data.frame':    10841 obs. of  13 variables:
##  $ App           : chr  "Photo Editor & Candy Camera & Grid & ScrapBook" "Coloring book moana" "U Launcher Lite â\200“ FREE Live Cool Themes, Hide Apps" "Sketch - Draw & Paint" ...
##  $ Category      : chr  "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" ...
##  $ Rating        : num  4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
##  $ Reviews       : chr  "159" "967" "87510" "215644" ...
##  $ Size          : chr  "19M" "14M" "8.7M" "25M" ...
##  $ Installs      : chr  "10,000+" "500,000+" "5,000,000+" "50,000,000+" ...
##  $ Type          : chr  "Free" "Free" "Free" "Free" ...
##  $ Price         : chr  "0" "0" "0" "0" ...
##  $ Content.Rating: chr  "Everyone" "Everyone" "Everyone" "Teen" ...
##  $ Genres        : chr  "Art & Design" "Art & Design;Pretend Play" "Art & Design" "Art & Design" ...
##  $ Last.Updated  : chr  "January 7, 2018" "January 15, 2018" "August 1, 2018" "June 8, 2018" ...
##  $ Current.Ver   : chr  "1.0.0" "2.0.0" "1.2.4" "Varies with device" ...
##  $ Android.Ver   : chr  "4.0.3 and up" "4.0.3 and up" "4.0.3 and up" "4.2 and up" ...
summary(googleplay)
##      App              Category             Rating         Reviews         
##  Length:10841       Length:10841       Min.   : 1.000   Length:10841      
##  Class :character   Class :character   1st Qu.: 4.000   Class :character  
##  Mode  :character   Mode  :character   Median : 4.300   Mode  :character  
##                                        Mean   : 4.193                     
##                                        3rd Qu.: 4.500                     
##                                        Max.   :19.000                     
##                                        NA's   :1474                       
##      Size             Installs             Type              Price          
##  Length:10841       Length:10841       Length:10841       Length:10841      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Content.Rating        Genres          Last.Updated       Current.Ver       
##  Length:10841       Length:10841       Length:10841       Length:10841      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Android.Ver       
##  Length:10841      
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

Data Cleaning

After successfully performing data loading , we will proceed to Data Cleaning Process.

(1) NA Values

First of all, most important step in cleaning process is too remove NA values because it will affect the accuracy of the result.

Step 1 : Determine the existence of NA in the dataset
sum(is.na(googleplay))
## [1] 1474
Step 2 : Remove the NA data
googleplay<-na.omit(googleplay)
Step 3 : Verify the NA is gone from data
sum(is.na(googleplay))
## [1] 0

(2) Remove symbols

Inside the dataset, there are values have included symbols. Hence, we do not want them in performing our analysis.

Price - Remove “$” sign
googleplay$Price <- str_replace_all(googleplay$Price, "\\$","")
Installs - Remove “+” , “,” symbols
googleplay$Installs <- str_replace_all(googleplay$Installs, "\\+","")
googleplay$Installs <- str_replace_all(googleplay$Installs, "\\,", "")

(3) Standardize types

Some columns are in character type whereas they should be numeric or integer.

Rating, Reviews , Price, Installs
googleplay$Rating <- as.numeric(googleplay$Rating)
googleplay$Reviews <- as.numeric(googleplay$Reviews)
## Warning: NAs introduced by coercion
googleplay$Price <- as.numeric(googleplay$Price)
## Warning: NAs introduced by coercion
googleplay$Installs <- as.integer(googleplay$Installs)
## Warning: NAs introduced by coercion

(4) Remove invalid data

We have realize that there’s one row having weird data issue whereas the Category is “1.9” and the Rating is 19 where maximum rating is 5.

googleplay <- subset(googleplay, Category != "1.9")

(5) Focus on Last 5 years

We will focus on the latest 5 years of application due to following reasons 1. Older application might not be relevant due to Android version 2. No longer having supports and updates anymore 3. Possibility of Junk application.

googleplay_clean <- googleplay %>%
  select(-c(5,7,9,10,12,13)) %>%
  distinct()
datatable( filter(googleplay_clean, str_detect(googleplay_clean$Last.Updated, "2014|2015|2016|2017|2018")) )
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

Final Cleansed Dataset

datatable(googleplay_clean)
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

Exploratory Data Analsis (EDA)

In this step, we will perform some initial analysis and visualizations.

In order to understand which category has the most number of application installations from the dataset, we have made a bar plot to visualize it.

options(scipen = 999)
ggplot(googleplay_clean, aes(x = Category, y = Installs)) +
  geom_bar(stat = "identity", width = 0.7, fill = "skyblue") +
  coord_flip() +
  labs(title = "Total App Installation for Each Category") +
  theme(axis.text.x = element_text(angle = 90))

From the plot, we can see that Game category has the highest number of installations. Then, we can know from the bar plot below that Game category has the highest number of reviews too.

ggplot(googleplay_clean, aes(x = Category, y = Reviews)) +
  geom_bar(stat = "identity", width = 0.7, fill = "indianred") +
  coord_flip() +
  labs(title = "Total App Reviews for Each Category") +
  theme(axis.text.x = element_text(angle = 90))

However, to answer our question whether rating affects the installations in Google Play store, we perform some EDA to find out. We will explore the variable Rating and Installs from the dataset by applying univariate analysis.

Brief Description of EDA variables Rating: It is numeric value ranges from 1 to 5. This is given by user to rate particular application where the minimum rating is 1 and maximum rating is 5. Installs: It is numeric value which is about total number of installs of each application.

google_eda <- googleplay_clean %>% select(Rating, Installs, Reviews)
summary(google_eda)
##      Rating         Installs             Reviews        
##  Min.   :1.000   Min.   :         1   Min.   :       1  
##  1st Qu.:4.000   1st Qu.:     10000   1st Qu.:     164  
##  Median :4.300   Median :    500000   Median :    4714  
##  Mean   :4.188   Mean   :  16489648   Mean   :  472776  
##  3rd Qu.:4.500   3rd Qu.:   5000000   3rd Qu.:   71267  
##  Max.   :5.000   Max.   :1000000000   Max.   :78158306
glimpse(google_eda)
## Rows: 8,892
## Columns: 3
## $ Rating   <dbl> 4.1, 3.9, 4.7, 4.5, 4.3, 4.4, 3.8, 4.1, 4.4, 4.7, 4.4, 4.4...
## $ Installs <int> 10000, 500000, 5000000, 50000000, 100000, 50000, 50000, 10...
## $ Reviews  <dbl> 159, 967, 87510, 215644, 967, 167, 178, 36815, 13791, 121,...
ggplot(google_eda, aes(y=Rating)) +
  geom_boxplot() +
  ggtitle("Boxplot of App Rating") +
  ylab("Rating")

ggplot(google_eda, aes(x=Reviews)) +
  geom_histogram(fill = "indianred", col = "black") +
  ggtitle("Histogram Distribution of Reviews") +
  xlab("Reviews")+
  xlim(0, 20000000)+
  ylim(0,1000)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 30 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).

ggplot(google_eda, aes(x=Installs)) +
  geom_histogram(fill = "orange", col = "black") +
  ggtitle("Histogram Distribution of Installs") +
  xlab("Installs")+
  xlim(0, 1000000000)+
  ylim(0, 2500)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing missing values (geom_bar).

google_eda %>% filter(Installs >= 500000000)
##     Rating   Installs  Reviews
## 1      3.9 1000000000  1433233
## 2      4.0 1000000000 56642847
## 3      4.4 1000000000 69119316
## 4      4.3 1000000000  9642995
## 5      4.3 1000000000  4604324
## 6      4.0 1000000000  3419249
## 7      4.3  500000000 11334799
## 8      4.3  500000000  4785892
## 9      4.6  500000000  2083237
## 10     4.5  500000000 17712922
## 11     4.0 1000000000 56646578
## 12     4.3  500000000  4785988
## 13     4.3  500000000 11334973
## 14     4.0 1000000000  3419433
## 15     4.1 1000000000 10484169
## 16     4.2  500000000 10790289
## 17     4.3 1000000000  9643041
## 18     4.5  500000000 17714850
## 19     4.3 1000000000  4604483
## 20     4.0 1000000000  3419513
## 21     4.3  500000000 11335255
## 22     4.3 1000000000  7165362
## 23     4.5 1000000000 27722264
## 24     4.4  500000000 22426677
## 25     4.3  500000000  8118609
## 26     4.3  500000000 10485308
## 27     4.5 1000000000 27723193
## 28     4.3  500000000 10485334
## 29     4.4  500000000 22428456
## 30     4.5  500000000 14891223
## 31     4.3  500000000  8118937
## 32     4.5 1000000000 27724094
## 33     4.4  500000000 22429716
## 34     4.4  500000000 22430188
## 35     4.5 1000000000 27725352
## 36     4.3  500000000 10486018
## 37     4.3  500000000  8119151
## 38     4.5  500000000 14892469
## 39     4.3  500000000  8119154
## 40     4.1 1000000000 78158306
## 41     4.5 1000000000 66577313
## 42     4.3  500000000  8606259
## 43     4.0  500000000 17014787
## 44     4.2 1000000000  4831125
## 45     4.0  500000000 17014705
## 46     4.5 1000000000 66577446
## 47     4.0  500000000 17015352
## 48     4.5 1000000000 10858556
## 49     4.5 1000000000 10858538
## 50     4.5 1000000000 10859051
## 51     4.3 1000000000  9235155
## 52     4.2 1000000000  2129689
## 53     4.3 1000000000  9235373
## 54     4.2 1000000000  2129707
## 55     4.4 1000000000  8033493
## 56     4.4  500000000  5745093
## 57     4.6  500000000  7790693
## 58     4.2  500000000  1859115
## 59     4.2  500000000  1859109
## 60     4.5  500000000  2084126
## 61     4.4 1000000000  2731171
## 62     4.4  500000000  1861310
## 63     4.2  500000000   858208
## 64     4.5  500000000  2084125
## 65     4.4 1000000000  2731211
## 66     4.2  500000000   858227
## 67     4.2  500000000   858230
## 68     4.4  500000000  1861309
## 69     4.1  500000000   282460
## 70     4.3 1000000000 25655305
## 71     3.7 1000000000   906384
## 72     4.5  500000000  6474426
## 73     4.5  500000000  6474672
## 74     3.9 1000000000   877635
## 75     4.3  500000000 11667403
## 76     4.4  500000000  1284017
## 77     3.9 1000000000   877643
## 78     4.4  500000000  1284018
## 79     4.0  500000000 17000166
## 80     4.3  500000000 10483141
## 81     4.5  500000000 14885236
## 82     4.5 1000000000 27711703
## 83     4.4 1000000000 69109672
## 84     4.4  500000000  5741684
## 85     4.5 1000000000 66509917
## 86     4.3 1000000000 25623548
## 87     4.5  500000000  2078744
## 88     4.1 1000000000 78128208
## 89     4.4  500000000 22419455
## 90     4.3 1000000000  9642112
## 91     4.7  500000000 42916526
## 92     4.3  500000000  8116142
## 93     4.4  500000000  1860844
## 94     4.3 1000000000  9231613
## 95     4.3  500000000  8595964
## 96     4.3  500000000 11657972
## 97     4.2  500000000 10790092
## 98     4.2 1000000000  4828372
## 99     4.2  500000000  1855262
## 100    4.4 1000000000  8021623
## 101    4.0 1000000000  3419464
## 102    4.4 1000000000  2728941
## 103    4.5  500000000  6469179
## 104    4.6  500000000  7775146
## 105    4.3  500000000 11335481
## 106    4.5 1000000000 10847682
## 107    4.3  500000000   480208
## 108    4.3 1000000000  7168735
## 109    4.7  500000000 24900999
## 110    3.9 1000000000   878065

We grouped the dataset to google_eda that consists of only rating and install for analysis. The dataset consists of 8279 rows of data. From the summary of the dataset, we know that the minimum rating is 1 and the maximum rating is 5. The mean or the average rating is 4.182.

The minimum installation of application is 1 and the maximum installation of application is 1 billion from the dataset. The mean installation is approximately 17 millions.

Visualization is done to represent the summary of dataset. From the visualization, we can know that most ratings are between 3 to 5 based on the boxplot. Next, most of the installations are below 500 millions. We have done some filtering and there are only 110 out of 8269 applications that have installations more than 500 millions.

Linear Regression

Brief Overview

Linear regression is one of the simple technique to show or predict the relationship between two variables

Step 1 : Scatter Plot

The code below will plot a scatter plot for our visualization. From the scatter plot, we can say that it is a linear relationship and the direction of the line shows that it is a positive relationship between application rating and its installations.

ggplot(googleplay_clean, aes(x=Installs, y=Rating)) +
  geom_point(shape = 3, alpha = 0.3, col="Darkblue")+
  geom_smooth(method = 'lm', se = FALSE) +
  xlab("Total Number of Installations") +
  ylab("App Rating") +
  ggtitle("Relationship between App Rating and Installations")
## `geom_smooth()` using formula 'y ~ x'

Step 2 : Calculating the correlation values

cor((googleplay_clean$Rating), googleplay_clean$Installs)
## [1] 0.05088596

The correlation shows the strength of the relationship is weak or no correlation as the value is only 0.05.

Step 3 : Linear Regression modelling

linearmodel = lm(Rating~Installs, data = googleplay_clean)
linearmodel
## 
## Call:
## lm(formula = Rating ~ Installs, data = googleplay_clean)
## 
## Coefficients:
##     (Intercept)         Installs  
## 4.1828021632893  0.0000000003077
summary(linearmodel)
## 
## Call:
## lm(formula = Rating ~ Installs, data = googleplay_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1828 -0.1830  0.1157  0.3170  0.8172 
## 
## Coefficients:
##                     Estimate       Std. Error t value             Pr(>|t|)    
## (Intercept) 4.18280216328931 0.00563273674630 742.588 < 0.0000000000000002 ***
## Installs    0.00000000030774 0.00000000006406   4.804           0.00000158 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5217 on 8890 degrees of freedom
## Multiple R-squared:  0.002589,   Adjusted R-squared:  0.002477 
## F-statistic: 23.08 on 1 and 8890 DF,  p-value: 0.00000158

From the observation, most of the rating is between 4 to 5. Lm model gives us intercept and slope which are two elements of best fit line for predicting response variable. The value R square represents the proportion of variability in the response variable that is explained by the explanatory variable. For this model, approximately 0.30% of the variability in Rating is explained by Installs which is too less or nothing.

Step 4 : Draw the Best Fit Line

Best Fit Line

ggplot(google_eda, aes(x=Installs, y=Rating)) +
  geom_point(alpha = 0.3, col="magenta") +
  geom_abline(slope = 0.00000000032034, intercept = 4.17675462517689, col = "Blue" )+
  xlab("Total of Installations") +
  ylab("App Rating") +
  ggtitle("Scatter plot with Best Fit Line")

Hence, to answer the question of what is the relationship between rating and installation, we can conclude that there is actually no relation in between them. The rating of an application will not affect the installation by the user.

Discussion & Future works

For future work, in order to have better relationship in between rating and installation, we may consider limit the parts of dataset to rating 4 and above. The number of installations is too large to help in prediction.

We can find the relationship between Last Updated variable and Rating as any new feature or bugs fixed introduced by the developers will increase the app rating.

KNN Clustering

Brief Overview

In order to answer the question, we can apply the classification method called K-nearest neighbors(KNN).

Reason to choose KNN : 1. One of the simple machine learning algorithm 2. Low calculation time 3. Easy to interpret results

Basic concept : 1. The distance between the stored data and the new instance is calculated by means of similarity measures eg. Euclidean distance 2. Use the similarity value to perform prediction

Step 1 : Subset the column to a new data frame

In this assignment we will focus on number of reviews and installations to predict the rating of the application. Hence, we will form a new dataset with 3 columns - Rating, Review , Installation

googleplay_clean_knn <- googleplay_clean %>%
                        select(c(3,4,5)) 

datatable(googleplay_clean_knn)

Step 2 : Splicing the Data

We will split the dataset from step 1 into 2 parts with the ratio of 7:3. The 70% will be the training data , 30% will be the testing data.

split <- sample.split(googleplay_clean_knn, SplitRatio = 0.7) 

train_data <- subset(googleplay_clean_knn, split == "TRUE") 
test_data <- subset(googleplay_clean_knn, split == "FALSE") 

Step 3 : Building KNN Model using Class library

In this project , we will use k = 5 , which means it will find 5 neareast neighbour and predict the value of the rating.

classifier_knn <- knn(train = train_data, 
                      test = test_data, 
                      cl = train_data$Rating, 
                      k = 5 ) 

Step 4 : Result

The following result is the prediction result based on their installation and number of reviews.

test_data$Rating[test_data$Rating >0]<-0
test_data$Result<- classifier_knn
datatable(test_data)

#Discussion & Future Work As you can see, to conduct a KNN classification in R , it is not complex and easy to understand. Mr X can just simply create a dataframe with his desired number of ratings and installations to predict the rating of his mobile application.

In the future we can do more to improve the prediction model such as

  1. Accuracy We can evaluate the model accuracy and tune the model to achieve more accurate results

  2. Optimization We could optimize the model by using Elbow method for finding the best K neighbour to be used for achieving the best result

  3. More variables Adding more variables to have more variety of data to predict the ratings.

Conclusion

In this group project of analyzing Google Play Store applications, we have created 2 research questions for Mr X as reference for building his mobile application. First question is to find out the relationship between Rating and Installations number and what is the expected rating if number of reviews and installation is provided.

Before we start, we have selected a Google Play Store dataset from Kaggle and start our Data Science process which is Data Preparing, Data Cleansing and Data Analysis. In Data Cleansing, we have performed few steps to ensure the data quality such as removing NAN values. With the cleansed data, we have perform Exploratory Data Analysis to understand our dataset like number of installation for each category and so on. After that, we have proceed with our modeling of Linear Regression and K-Nearest Neighbours (KNN) to answer our research questions.

From the results, we can know the relationships between ratings and installations is a very weak relation as close as no relations at all. We can tell Mr X that the trends between Ratings and Installations is not dependent to each other so that we can develop an application without worrying his installation will affect the application’s ratings. As for the KNN, We have implemented K-nearest Neighbour clustering algorithm to predict the ratings. As observed from the result, Mr X can use the KNN model to predict his application rating by inputting the number of installations and reviews.

For future work, we can do more with more complex questions such as adding more variables to KNN model to provide a more holistic prediction results. Not only that, we can find more relationships other than between Ratings and Installations such as considering the factor of Category and App Last Updated date with other Regression method.

From the results and process we have implemented, we can conclude that we have achieved this group project objectives which are analyzing the Google Play Store apps and determine trends of the Google Play Store and both of our research questions.