One day, Mr. X plans to start his 2021 by developing an Android mobile application. However, Mr. X wants to know how to build a good Android application. He asks two of his data scientist friends Mr Phoon and Mr Sii to help out to study current trends and insights of Google Play Store.
Hence, we have two research questions that we can find out for Mr X.
What is the relationship between application rating and installation?
What is the rating of an an application with given number of reviews and installation ?
Following is the details of dataset that we have chosen :
Title : Google Play Store Apps
Source : Kaggle (https://www.kaggle.com/lava18/google-play-store-apps)
Year : 2018
Purpose : Web scraped data of 10k Play Store apps for analyzing the Android market.
#list.files("../Downloads")
library(tidyverse)
library(plyr)
library(dplyr)
library(magrittr)
library(tidyverse) # data manipulation
library(cluster) # clustering algorithms
library(factoextra) # clustering algorithms & visualization
library(class)
library(caTools)
library(DT)
googleplay <- read.csv("../googleplaystore.csv")
datatable(googleplay)
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html
print(paste("Dimensions of dataset: ", dim(googleplay)))
## [1] "Dimensions of dataset: 10841" "Dimensions of dataset: 13"
str(googleplay)
## 'data.frame': 10841 obs. of 13 variables:
## $ App : chr "Photo Editor & Candy Camera & Grid & ScrapBook" "Coloring book moana" "U Launcher Lite â\200“ FREE Live Cool Themes, Hide Apps" "Sketch - Draw & Paint" ...
## $ Category : chr "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" ...
## $ Rating : num 4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
## $ Reviews : chr "159" "967" "87510" "215644" ...
## $ Size : chr "19M" "14M" "8.7M" "25M" ...
## $ Installs : chr "10,000+" "500,000+" "5,000,000+" "50,000,000+" ...
## $ Type : chr "Free" "Free" "Free" "Free" ...
## $ Price : chr "0" "0" "0" "0" ...
## $ Content.Rating: chr "Everyone" "Everyone" "Everyone" "Teen" ...
## $ Genres : chr "Art & Design" "Art & Design;Pretend Play" "Art & Design" "Art & Design" ...
## $ Last.Updated : chr "January 7, 2018" "January 15, 2018" "August 1, 2018" "June 8, 2018" ...
## $ Current.Ver : chr "1.0.0" "2.0.0" "1.2.4" "Varies with device" ...
## $ Android.Ver : chr "4.0.3 and up" "4.0.3 and up" "4.0.3 and up" "4.2 and up" ...
summary(googleplay)
## App Category Rating Reviews
## Length:10841 Length:10841 Min. : 1.000 Length:10841
## Class :character Class :character 1st Qu.: 4.000 Class :character
## Mode :character Mode :character Median : 4.300 Mode :character
## Mean : 4.193
## 3rd Qu.: 4.500
## Max. :19.000
## NA's :1474
## Size Installs Type Price
## Length:10841 Length:10841 Length:10841 Length:10841
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Content.Rating Genres Last.Updated Current.Ver
## Length:10841 Length:10841 Length:10841 Length:10841
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Android.Ver
## Length:10841
## Class :character
## Mode :character
##
##
##
##
After successfully performing data loading , we will proceed to Data Cleaning Process.
First of all, most important step in cleaning process is too remove NA values because it will affect the accuracy of the result.
sum(is.na(googleplay))
## [1] 1474
googleplay<-na.omit(googleplay)
sum(is.na(googleplay))
## [1] 0
Inside the dataset, there are values have included symbols. Hence, we do not want them in performing our analysis.
googleplay$Price <- str_replace_all(googleplay$Price, "\\$","")
googleplay$Installs <- str_replace_all(googleplay$Installs, "\\+","")
googleplay$Installs <- str_replace_all(googleplay$Installs, "\\,", "")
Some columns are in character type whereas they should be numeric or integer.
googleplay$Rating <- as.numeric(googleplay$Rating)
googleplay$Reviews <- as.numeric(googleplay$Reviews)
## Warning: NAs introduced by coercion
googleplay$Price <- as.numeric(googleplay$Price)
## Warning: NAs introduced by coercion
googleplay$Installs <- as.integer(googleplay$Installs)
## Warning: NAs introduced by coercion
We have realize that there’s one row having weird data issue whereas the Category is “1.9” and the Rating is 19 where maximum rating is 5.
googleplay <- subset(googleplay, Category != "1.9")
We will focus on the latest 5 years of application due to following reasons 1. Older application might not be relevant due to Android version 2. No longer having supports and updates anymore 3. Possibility of Junk application.
googleplay_clean <- googleplay %>%
select(-c(5,7,9,10,12,13)) %>%
distinct()
datatable( filter(googleplay_clean, str_detect(googleplay_clean$Last.Updated, "2014|2015|2016|2017|2018")) )
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html
datatable(googleplay_clean)
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html
In this step, we will perform some initial analysis and visualizations.
In order to understand which category has the most number of application installations from the dataset, we have made a bar plot to visualize it.
options(scipen = 999)
ggplot(googleplay_clean, aes(x = Category, y = Installs)) +
geom_bar(stat = "identity", width = 0.7, fill = "skyblue") +
coord_flip() +
labs(title = "Total App Installation for Each Category") +
theme(axis.text.x = element_text(angle = 90))
From the plot, we can see that Game category has the highest number of installations. Then, we can know from the bar plot below that Game category has the highest number of reviews too.
ggplot(googleplay_clean, aes(x = Category, y = Reviews)) +
geom_bar(stat = "identity", width = 0.7, fill = "indianred") +
coord_flip() +
labs(title = "Total App Reviews for Each Category") +
theme(axis.text.x = element_text(angle = 90))
However, to answer our question whether rating affects the installations in Google Play store, we perform some EDA to find out. We will explore the variable Rating and Installs from the dataset by applying univariate analysis.
Brief Description of EDA variables Rating: It is numeric value ranges from 1 to 5. This is given by user to rate particular application where the minimum rating is 1 and maximum rating is 5. Installs: It is numeric value which is about total number of installs of each application.
google_eda <- googleplay_clean %>% select(Rating, Installs, Reviews)
summary(google_eda)
## Rating Installs Reviews
## Min. :1.000 Min. : 1 Min. : 1
## 1st Qu.:4.000 1st Qu.: 10000 1st Qu.: 164
## Median :4.300 Median : 500000 Median : 4714
## Mean :4.188 Mean : 16489648 Mean : 472776
## 3rd Qu.:4.500 3rd Qu.: 5000000 3rd Qu.: 71267
## Max. :5.000 Max. :1000000000 Max. :78158306
glimpse(google_eda)
## Rows: 8,892
## Columns: 3
## $ Rating <dbl> 4.1, 3.9, 4.7, 4.5, 4.3, 4.4, 3.8, 4.1, 4.4, 4.7, 4.4, 4.4...
## $ Installs <int> 10000, 500000, 5000000, 50000000, 100000, 50000, 50000, 10...
## $ Reviews <dbl> 159, 967, 87510, 215644, 967, 167, 178, 36815, 13791, 121,...
ggplot(google_eda, aes(y=Rating)) +
geom_boxplot() +
ggtitle("Boxplot of App Rating") +
ylab("Rating")
ggplot(google_eda, aes(x=Reviews)) +
geom_histogram(fill = "indianred", col = "black") +
ggtitle("Histogram Distribution of Reviews") +
xlab("Reviews")+
xlim(0, 20000000)+
ylim(0,1000)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 30 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
ggplot(google_eda, aes(x=Installs)) +
geom_histogram(fill = "orange", col = "black") +
ggtitle("Histogram Distribution of Installs") +
xlab("Installs")+
xlim(0, 1000000000)+
ylim(0, 2500)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing missing values (geom_bar).
google_eda %>% filter(Installs >= 500000000)
## Rating Installs Reviews
## 1 3.9 1000000000 1433233
## 2 4.0 1000000000 56642847
## 3 4.4 1000000000 69119316
## 4 4.3 1000000000 9642995
## 5 4.3 1000000000 4604324
## 6 4.0 1000000000 3419249
## 7 4.3 500000000 11334799
## 8 4.3 500000000 4785892
## 9 4.6 500000000 2083237
## 10 4.5 500000000 17712922
## 11 4.0 1000000000 56646578
## 12 4.3 500000000 4785988
## 13 4.3 500000000 11334973
## 14 4.0 1000000000 3419433
## 15 4.1 1000000000 10484169
## 16 4.2 500000000 10790289
## 17 4.3 1000000000 9643041
## 18 4.5 500000000 17714850
## 19 4.3 1000000000 4604483
## 20 4.0 1000000000 3419513
## 21 4.3 500000000 11335255
## 22 4.3 1000000000 7165362
## 23 4.5 1000000000 27722264
## 24 4.4 500000000 22426677
## 25 4.3 500000000 8118609
## 26 4.3 500000000 10485308
## 27 4.5 1000000000 27723193
## 28 4.3 500000000 10485334
## 29 4.4 500000000 22428456
## 30 4.5 500000000 14891223
## 31 4.3 500000000 8118937
## 32 4.5 1000000000 27724094
## 33 4.4 500000000 22429716
## 34 4.4 500000000 22430188
## 35 4.5 1000000000 27725352
## 36 4.3 500000000 10486018
## 37 4.3 500000000 8119151
## 38 4.5 500000000 14892469
## 39 4.3 500000000 8119154
## 40 4.1 1000000000 78158306
## 41 4.5 1000000000 66577313
## 42 4.3 500000000 8606259
## 43 4.0 500000000 17014787
## 44 4.2 1000000000 4831125
## 45 4.0 500000000 17014705
## 46 4.5 1000000000 66577446
## 47 4.0 500000000 17015352
## 48 4.5 1000000000 10858556
## 49 4.5 1000000000 10858538
## 50 4.5 1000000000 10859051
## 51 4.3 1000000000 9235155
## 52 4.2 1000000000 2129689
## 53 4.3 1000000000 9235373
## 54 4.2 1000000000 2129707
## 55 4.4 1000000000 8033493
## 56 4.4 500000000 5745093
## 57 4.6 500000000 7790693
## 58 4.2 500000000 1859115
## 59 4.2 500000000 1859109
## 60 4.5 500000000 2084126
## 61 4.4 1000000000 2731171
## 62 4.4 500000000 1861310
## 63 4.2 500000000 858208
## 64 4.5 500000000 2084125
## 65 4.4 1000000000 2731211
## 66 4.2 500000000 858227
## 67 4.2 500000000 858230
## 68 4.4 500000000 1861309
## 69 4.1 500000000 282460
## 70 4.3 1000000000 25655305
## 71 3.7 1000000000 906384
## 72 4.5 500000000 6474426
## 73 4.5 500000000 6474672
## 74 3.9 1000000000 877635
## 75 4.3 500000000 11667403
## 76 4.4 500000000 1284017
## 77 3.9 1000000000 877643
## 78 4.4 500000000 1284018
## 79 4.0 500000000 17000166
## 80 4.3 500000000 10483141
## 81 4.5 500000000 14885236
## 82 4.5 1000000000 27711703
## 83 4.4 1000000000 69109672
## 84 4.4 500000000 5741684
## 85 4.5 1000000000 66509917
## 86 4.3 1000000000 25623548
## 87 4.5 500000000 2078744
## 88 4.1 1000000000 78128208
## 89 4.4 500000000 22419455
## 90 4.3 1000000000 9642112
## 91 4.7 500000000 42916526
## 92 4.3 500000000 8116142
## 93 4.4 500000000 1860844
## 94 4.3 1000000000 9231613
## 95 4.3 500000000 8595964
## 96 4.3 500000000 11657972
## 97 4.2 500000000 10790092
## 98 4.2 1000000000 4828372
## 99 4.2 500000000 1855262
## 100 4.4 1000000000 8021623
## 101 4.0 1000000000 3419464
## 102 4.4 1000000000 2728941
## 103 4.5 500000000 6469179
## 104 4.6 500000000 7775146
## 105 4.3 500000000 11335481
## 106 4.5 1000000000 10847682
## 107 4.3 500000000 480208
## 108 4.3 1000000000 7168735
## 109 4.7 500000000 24900999
## 110 3.9 1000000000 878065
We grouped the dataset to google_eda that consists of only rating and install for analysis. The dataset consists of 8279 rows of data. From the summary of the dataset, we know that the minimum rating is 1 and the maximum rating is 5. The mean or the average rating is 4.182.
The minimum installation of application is 1 and the maximum installation of application is 1 billion from the dataset. The mean installation is approximately 17 millions.
Visualization is done to represent the summary of dataset. From the visualization, we can know that most ratings are between 3 to 5 based on the boxplot. Next, most of the installations are below 500 millions. We have done some filtering and there are only 110 out of 8269 applications that have installations more than 500 millions.
Linear regression is one of the simple technique to show or predict the relationship between two variables
The code below will plot a scatter plot for our visualization. From the scatter plot, we can say that it is a linear relationship and the direction of the line shows that it is a positive relationship between application rating and its installations.
ggplot(googleplay_clean, aes(x=Installs, y=Rating)) +
geom_point(shape = 3, alpha = 0.3, col="Darkblue")+
geom_smooth(method = 'lm', se = FALSE) +
xlab("Total Number of Installations") +
ylab("App Rating") +
ggtitle("Relationship between App Rating and Installations")
## `geom_smooth()` using formula 'y ~ x'
cor((googleplay_clean$Rating), googleplay_clean$Installs)
## [1] 0.05088596
The correlation shows the strength of the relationship is weak or no correlation as the value is only 0.05.
linearmodel = lm(Rating~Installs, data = googleplay_clean)
linearmodel
##
## Call:
## lm(formula = Rating ~ Installs, data = googleplay_clean)
##
## Coefficients:
## (Intercept) Installs
## 4.1828021632893 0.0000000003077
summary(linearmodel)
##
## Call:
## lm(formula = Rating ~ Installs, data = googleplay_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1828 -0.1830 0.1157 0.3170 0.8172
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.18280216328931 0.00563273674630 742.588 < 0.0000000000000002 ***
## Installs 0.00000000030774 0.00000000006406 4.804 0.00000158 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5217 on 8890 degrees of freedom
## Multiple R-squared: 0.002589, Adjusted R-squared: 0.002477
## F-statistic: 23.08 on 1 and 8890 DF, p-value: 0.00000158
From the observation, most of the rating is between 4 to 5. Lm model gives us intercept and slope which are two elements of best fit line for predicting response variable. The value R square represents the proportion of variability in the response variable that is explained by the explanatory variable. For this model, approximately 0.30% of the variability in Rating is explained by Installs which is too less or nothing.
Best Fit Line
ggplot(google_eda, aes(x=Installs, y=Rating)) +
geom_point(alpha = 0.3, col="magenta") +
geom_abline(slope = 0.00000000032034, intercept = 4.17675462517689, col = "Blue" )+
xlab("Total of Installations") +
ylab("App Rating") +
ggtitle("Scatter plot with Best Fit Line")
Hence, to answer the question of what is the relationship between rating and installation, we can conclude that there is actually no relation in between them. The rating of an application will not affect the installation by the user.
For future work, in order to have better relationship in between rating and installation, we may consider limit the parts of dataset to rating 4 and above. The number of installations is too large to help in prediction.
We can find the relationship between Last Updated variable and Rating as any new feature or bugs fixed introduced by the developers will increase the app rating.
In order to answer the question, we can apply the classification method called K-nearest neighbors(KNN).
Reason to choose KNN : 1. One of the simple machine learning algorithm 2. Low calculation time 3. Easy to interpret results
Basic concept : 1. The distance between the stored data and the new instance is calculated by means of similarity measures eg. Euclidean distance 2. Use the similarity value to perform prediction
In this assignment we will focus on number of reviews and installations to predict the rating of the application. Hence, we will form a new dataset with 3 columns - Rating, Review , Installation
googleplay_clean_knn <- googleplay_clean %>%
select(c(3,4,5))
datatable(googleplay_clean_knn)
We will split the dataset from step 1 into 2 parts with the ratio of 7:3. The 70% will be the training data , 30% will be the testing data.
split <- sample.split(googleplay_clean_knn, SplitRatio = 0.7)
train_data <- subset(googleplay_clean_knn, split == "TRUE")
test_data <- subset(googleplay_clean_knn, split == "FALSE")
In this project , we will use k = 5 , which means it will find 5 neareast neighbour and predict the value of the rating.
classifier_knn <- knn(train = train_data,
test = test_data,
cl = train_data$Rating,
k = 5 )
The following result is the prediction result based on their installation and number of reviews.
test_data$Rating[test_data$Rating >0]<-0
test_data$Result<- classifier_knn
datatable(test_data)
#Discussion & Future Work As you can see, to conduct a KNN classification in R , it is not complex and easy to understand. Mr X can just simply create a dataframe with his desired number of ratings and installations to predict the rating of his mobile application.
In the future we can do more to improve the prediction model such as
Accuracy We can evaluate the model accuracy and tune the model to achieve more accurate results
Optimization We could optimize the model by using Elbow method for finding the best K neighbour to be used for achieving the best result
More variables Adding more variables to have more variety of data to predict the ratings.
In this group project of analyzing Google Play Store applications, we have created 2 research questions for Mr X as reference for building his mobile application. First question is to find out the relationship between Rating and Installations number and what is the expected rating if number of reviews and installation is provided.
Before we start, we have selected a Google Play Store dataset from Kaggle and start our Data Science process which is Data Preparing, Data Cleansing and Data Analysis. In Data Cleansing, we have performed few steps to ensure the data quality such as removing NAN values. With the cleansed data, we have perform Exploratory Data Analysis to understand our dataset like number of installation for each category and so on. After that, we have proceed with our modeling of Linear Regression and K-Nearest Neighbours (KNN) to answer our research questions.
From the results, we can know the relationships between ratings and installations is a very weak relation as close as no relations at all. We can tell Mr X that the trends between Ratings and Installations is not dependent to each other so that we can develop an application without worrying his installation will affect the application’s ratings. As for the KNN, We have implemented K-nearest Neighbour clustering algorithm to predict the ratings. As observed from the result, Mr X can use the KNN model to predict his application rating by inputting the number of installations and reviews.
For future work, we can do more with more complex questions such as adding more variables to KNN model to provide a more holistic prediction results. Not only that, we can find more relationships other than between Ratings and Installations such as considering the factor of Category and App Last Updated date with other Regression method.
From the results and process we have implemented, we can conclude that we have achieved this group project objectives which are analyzing the Google Play Store apps and determine trends of the Google Play Store and both of our research questions.