Based on info from Statista, as of September 2019, there were approximately 590,000 apps in the Google Play store with an average rating of 4.5 out of 5 stars or higher. In total, 1.5 million apps had a rating and a further 1.24 million apps in the Google Play store had less than three user ratings.
Along with the development of mobile apps and the increasing number of developers who make their livelihood besides of cellular development alone, It becomes important for developers to be able to predict the success of their applications. The purpose of this project is to try to predict the rating of a google playstore application because it is very influential on the user’s view. Applications with higher ratings will be recommended and trusted by users.
In this project I tried to use dataset that was obtained from Kaggle Dataset. The objective dataset has 12 features and one target variable (the rating) and about 10.8k entries. The user reviews dataset contains the first 100 most relevant reviews and 5 features for a total of 64.3k entries. All the data was acquired by scraping the Google Play Store directly and was last updated 10 months ago (Version 6).
The initial data looked like the dataframe below.
The following is an explanation of the features contained in the entire dataset.
App : Application nameCategory : Category the app belongs toRating : Overall user rating of the app (as when scraped)Reviews : Number of user reviews for the app (as when scraped)Size : Size of the app (as when scraped)Installs : Number of user downloads/installs for the app (as when scraped)Type : Paid or FreePrice : Price of the app (as when scraped)Content Rating : Age group the app is targeted at - Children / Mature 21+ / AdultGenres : An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres.Last Updated Date when the app was last updated on Play Store (as when scraped)Current Ver Current version of the app available on Play Store (as when scraped)Android Ver : Min required Android version (as when scraped)After the dataset is collected, the next step in the process is preprocessing. At this stage we do the process of data wrangling or data mining which in other words is often interpreted as transforming data into a tidy form and ready to be analyzed. After observe the data, we can do some preprocesses that will be applied to the dataset are as follows :
gps <- gps %>%
mutate(
App = as.character(App),
Reviews = as.numeric(Reviews),
Size = gsub("M", "", Size),
Size = ifelse(grepl("k", Size), 0, as.numeric(Size)),
Installs = gsub("\\+", "", as.character(Installs)),
Installs = as.numeric(gsub(",", "", Installs)),
Price = as.numeric((gsub("\\$", "", as.character(Price)))),
Last.Updated = mdy(Last.Updated),
Year.Updated = year(Last.Updated),
Month.Updated = month(Last.Updated)
) %>%
select(-c(12:13)) %>% #remove unused columns
distinct() #remove duplicated dataWe can see a summary of the data held to continue the preprocess section. The following is a summary of the data we have now.
#> App Category Rating Reviews
#> Length:10358 FAMILY :1943 Min. : 1.000 Min. : 1
#> Class :character GAME :1121 1st Qu.: 4.000 1st Qu.:1123
#> Mode :character TOOLS : 843 Median : 4.300 Median :2738
#> BUSINESS : 427 Mean : 4.189 Mean :2728
#> MEDICAL : 408 3rd Qu.: 4.500 3rd Qu.:4307
#> PRODUCTIVITY: 407 Max. :19.000 Max. :6002
#> (Other) :5209 NA's :1465
#> Size Installs Type Price
#> Min. : 0.00 Min. : 0 0 : 1 Min. : 0.000
#> 1st Qu.: 4.70 1st Qu.: 1000 Free:9591 1st Qu.: 0.000
#> Median : 13.00 Median : 100000 NaN : 1 Median : 0.000
#> Mean : 21.27 Mean : 14157759 Paid: 765 Mean : 1.031
#> 3rd Qu.: 29.00 3rd Qu.: 1000000 3rd Qu.: 0.000
#> Max. :100.00 Max. :1000000000 Max. :400.000
#> NA's :1527 NA's :1 NA's :1
#> Content.Rating Genres Last.Updated
#> : 1 Tools : 842 Min. :2010-05-21
#> Adults only 18+: 3 Entertainment: 588 1st Qu.:2017-09-03
#> Everyone :8382 Education : 527 Median :2018-05-20
#> Everyone 10+ : 377 Business : 427 Mean :2017-11-14
#> Mature 17+ : 447 Medical : 408 3rd Qu.:2018-07-19
#> Teen :1146 Productivity : 407 Max. :2018-08-08
#> Unrated : 2 (Other) :7159 NA's :1
#> Year.Updated Month.Updated
#> Min. :2010 Min. : 1.000
#> 1st Qu.:2017 1st Qu.: 5.000
#> Median :2018 Median : 7.000
#> Mean :2017 Mean : 6.397
#> 3rd Qu.:2018 3rd Qu.: 8.000
#> Max. :2018 Max. :12.000
#> NA's :1 NA's :1
In addition to the transformation of the data above, we can also do featured engineering, one of which is by adding a variable containing the category/grade of the number of installs by user for the apps.
Missing data can be a not so trivial problem when analysing a dataset and accounting for it is usually not so straightforward either. There are many ways to approach missing data, such as imputation. Imputation simply means replacing the missing values with an estimate, then analyzing the full data set as if the imputed values were actual observed values.
Then we can analyze the variables that have NA-values and imputate them based on data requirements for further analysis.
Most of the missing values are from the Grade C Category of Installs. Meaning Apps with less than 10k number of Installs have no rating
Another way to impute missing values could be using another type of model, such as linear regression, KNN, or bag imputation, to predict what the missing Size values would be based on the other features in the dataset.
The following is a summary of the Size features after being imputed
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.00 6.00 16.00 22.64 29.00 100.00
Before we proceed further, the thing that needs to be done before modeling is Exploratory Data Analysis. At this point, we can analyze the correlation between features/variables and also of course observe the target variable Rating that we will predict later. As you can see in the plot below, features like Reviews and Installs and Price are not normally distributed
Therefore, before fitting the data into a model later, We have to use logarithmic transformation for columns Reviews, Installs and Prices.
We then can observe the target variable Rating that we have. Because of its continuous numeric value, Rating distribution is an important thing to note.
The result we got look different from the graphs derived from Statista, maybe this is caused by the data we used is not the same or incomplete.
Then we try to find out how the distribution of the Google Playstore market itself. This can be seen based on the number of Google Playstore applications contained in each category and of course the Average Rating of applications in those categories.
We can also observe the same thing on Type features. Let’s take a look at the total split between free and paid applications.
For this project, we took the Google Play Store Data sets and analyzed and processed the data. After the data was transformed into a usable set, we used plots and functions to understand the correlations between features. We then used this knowledge to build the best model we could for finding ratings based on the cleaned data set