Apps Rating Prediction

Introduction

Based on info from Statista, as of September 2019, there were approximately 590,000 apps in the Google Play store with an average rating of 4.5 out of 5 stars or higher. In total, 1.5 million apps had a rating and a further 1.24 million apps in the Google Play store had less than three user ratings.

Along with the development of mobile apps and the increasing number of developers who make their livelihood besides of cellular development alone, It becomes important for developers to be able to predict the success of their applications. The purpose of this project is to try to predict the rating of a google playstore application because it is very influential on the user’s view. Applications with higher ratings will be recommended and trusted by users.

In this project I tried to use dataset that was obtained from Kaggle Dataset. The objective dataset has 12 features and one target variable (the rating) and about 10.8k entries. The user reviews dataset contains the first 100 most relevant reviews and 5 features for a total of 64.3k entries. All the data was acquired by scraping the Google Play Store directly and was last updated 10 months ago (Version 6).

The initial data looked like the dataframe below.

The following is an explanation of the features contained in the entire dataset.

App : Application name
Category : Category the app belongs to
Rating : Overall user rating of the app (as when scraped)
Reviews : Number of user reviews for the app (as when scraped)
Size : Size of the app (as when scraped)
Installs : Number of user downloads/installs for the app (as when scraped)
Type : Paid or Free
Price : Price of the app (as when scraped)
Content Rating : Age group the app is targeted at - Children / Mature 21+ / Adult
Genres : An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres.
Last Updated Date when the app was last updated on Play Store (as when scraped)
Current Ver Current version of the app available on Play Store (as when scraped)
Android Ver : Min required Android version (as when scraped)

Data Preparation

After the dataset is collected, the next step in the process is preprocessing. At this stage we do the process of data wrangling or data mining which in other words is often interpreted as transforming data into a tidy form and ready to be analyzed. After observe the data, we can do some preprocesses that will be applied to the dataset are as follows :

gps <- gps %>% 
  mutate(
    App = as.character(App),
    Reviews = as.numeric(Reviews),
    Size = gsub("M", "", Size),
    Size = ifelse(grepl("k", Size), 0, as.numeric(Size)),
    Installs = gsub("\\+", "", as.character(Installs)),
    Installs = as.numeric(gsub(",", "", Installs)),
    Price = as.numeric((gsub("\\$", "", as.character(Price)))),
    Last.Updated = mdy(Last.Updated),
    Year.Updated = year(Last.Updated),
    Month.Updated = month(Last.Updated)
  ) %>% 
  select(-c(12:13)) %>% #remove unused columns
  distinct() #remove duplicated data

We can see a summary of the data held to continue the preprocess section. The following is a summary of the data we have now.

#>      App                    Category        Rating          Reviews    
#>  Length:10358       FAMILY      :1943   Min.   : 1.000   Min.   :   1  
#>  Class :character   GAME        :1121   1st Qu.: 4.000   1st Qu.:1123  
#>  Mode  :character   TOOLS       : 843   Median : 4.300   Median :2738  
#>                     BUSINESS    : 427   Mean   : 4.189   Mean   :2728  
#>                     MEDICAL     : 408   3rd Qu.: 4.500   3rd Qu.:4307  
#>                     PRODUCTIVITY: 407   Max.   :19.000   Max.   :6002  
#>                     (Other)     :5209   NA's   :1465                   
#>       Size           Installs            Type          Price        
#>  Min.   :  0.00   Min.   :         0   0   :   1   Min.   :  0.000  
#>  1st Qu.:  4.70   1st Qu.:      1000   Free:9591   1st Qu.:  0.000  
#>  Median : 13.00   Median :    100000   NaN :   1   Median :  0.000  
#>  Mean   : 21.27   Mean   :  14157759   Paid: 765   Mean   :  1.031  
#>  3rd Qu.: 29.00   3rd Qu.:   1000000               3rd Qu.:  0.000  
#>  Max.   :100.00   Max.   :1000000000               Max.   :400.000  
#>  NA's   :1527     NA's   :1                        NA's   :1        
#>          Content.Rating           Genres      Last.Updated       
#>                 :   1   Tools        : 842   Min.   :2010-05-21  
#>  Adults only 18+:   3   Entertainment: 588   1st Qu.:2017-09-03  
#>  Everyone       :8382   Education    : 527   Median :2018-05-20  
#>  Everyone 10+   : 377   Business     : 427   Mean   :2017-11-14  
#>  Mature 17+     : 447   Medical      : 408   3rd Qu.:2018-07-19  
#>  Teen           :1146   Productivity : 407   Max.   :2018-08-08  
#>  Unrated        :   2   (Other)      :7159   NA's   :1           
#>   Year.Updated  Month.Updated   
#>  Min.   :2010   Min.   : 1.000  
#>  1st Qu.:2017   1st Qu.: 5.000  
#>  Median :2018   Median : 7.000  
#>  Mean   :2017   Mean   : 6.397  
#>  3rd Qu.:2018   3rd Qu.: 8.000  
#>  Max.   :2018   Max.   :12.000  
#>  NA's   :1      NA's   :1

In addition to the transformation of the data above, we can also do featured engineering, one of which is by adding a variable containing the category/grade of the number of installs by user for the apps.

Missing data can be a not so trivial problem when analysing a dataset and accounting for it is usually not so straightforward either. There are many ways to approach missing data, such as imputation. Imputation simply means replacing the missing values with an estimate, then analyzing the full data set as if the imputed values were actual observed values.

Then we can analyze the variables that have NA-values and imputate them based on data requirements for further analysis.

Most of the missing values are from the Grade C Category of Installs. Meaning Apps with less than 10k number of Installs have no rating

Another way to impute missing values could be using another type of model, such as linear regression, KNN, or bag imputation, to predict what the missing Size values would be based on the other features in the dataset.

The following is a summary of the Size features after being imputed

#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    0.00    6.00   16.00   22.64   29.00  100.00

Data Exploration

Before we proceed further, the thing that needs to be done before modeling is Exploratory Data Analysis. At this point, we can analyze the correlation between features/variables and also of course observe the target variable Rating that we will predict later. As you can see in the plot below, features like Reviews and Installs and Price are not normally distributed

Therefore, before fitting the data into a model later, We have to use logarithmic transformation for columns Reviews, Installs and Prices.

We then can observe the target variable Rating that we have. Because of its continuous numeric value, Rating distribution is an important thing to note.

The result we got look different from the graphs derived from Statista, maybe this is caused by the data we used is not the same or incomplete.

Then we try to find out how the distribution of the Google Playstore market itself. This can be seen based on the number of Google Playstore applications contained in each category and of course the Average Rating of applications in those categories.

We can also observe the same thing on Type features. Let’s take a look at the total split between free and paid applications.

Machine Learning Model

Conclusion

For this project, we took the Google Play Store Data sets and analyzed and processed the data. After the data was transformed into a usable set, we used plots and functions to understand the correlations between features. We then used this knowledge to build the best model we could for finding ratings based on the cleaned data set

Google Play Store Apps

Fran Sanjaya Lumbangaol