1. INTRODUCTION
2. EXPLORATORY DATA ANALYSIS
- 2.1. BASIC STATISTICS
- 2.2. MODELLING

1. INTRODUCTION

1.1. MOBILE APPLICATIONS

Mobile applications (or mobile apps) are computer programs or software applications created to be used on mobile phones, tablets or smartwatches. Originally, they were designed for basic needs like e-mails or calendar. But with the development of mobile devices, there has been an increase in demand for new and life-easing applications such as mobile games, social networking apps, navigation and location-based services. Those applications can be free of charge or can have a price. Mobile applications can be downloaded from distribution platforms like App Store or Google Play Store.

1.2. DATASET

The dataset about Google play store apps was downloaded from kaggle.com website (to see the source click here).

The dataset before cleaning had almost 11 thousand observations - mobile applications, and 13 variables - Application name, Category, Rating, Reviews, Size, Installs, Type, Price, Content Rating, Genres, Last Updated, Current version, Android version. After cleaning, the dataset remained 9367 observations and 11 variables - we rejected Category and Current version. The procedure of cleaning the data is thoroughly described in the next section.

1.2.1. CLEANING THE DATA

# Some basic summary
str(mainTable)

## 'data.frame':    9367 obs. of  13 variables:
##  $ App           : Factor w/ 9660 levels "- Free Comics - Comic Apps",..: 7229 2563 8998 8113 7294 7125 8171 5589 4948 5826 ...
##  $ Category      : Factor w/ 34 levels "1.9","ART_AND_DESIGN",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Rating        : num  4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
##  $ Reviews       : Factor w/ 6002 levels "0","1","10","100",..: 1183 5924 5681 1947 5924 1310 1464 3385 816 485 ...
##  $ Size          : Factor w/ 462 levels "1,000+","1.0M",..: 55 30 368 102 64 222 55 118 146 120 ...
##  $ Installs      : Factor w/ 22 levels "0","0+","1,000,000,000+",..: 8 20 13 16 11 17 17 4 4 8 ...
##  $ Type          : Factor w/ 4 levels "0","Free","NaN",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Price         : Factor w/ 93 levels "$0.99","$1.00",..: 92 92 92 92 92 92 92 92 92 92 ...
##  $ Content.Rating: Factor w/ 7 levels "","Adults only 18+",..: 3 3 3 6 3 3 3 3 3 3 ...
##  $ Genres        : Factor w/ 120 levels "Action","Action;Action & Adventure",..: 10 13 10 10 12 10 10 10 10 12 ...
##  $ Last.Updated  : Factor w/ 1378 levels "1.0.19","April 1, 2016",..: 562 482 117 825 757 901 76 726 1317 670 ...
##  $ Current.Ver   : Factor w/ 2834 levels "","0.0.0.2","0.0.1",..: 121 1020 466 2827 279 115 279 2393 1457 1431 ...
##  $ Android.Ver   : Factor w/ 35 levels "","1.0 and up",..: 17 17 17 20 22 10 17 20 12 17 ...
##  - attr(*, "na.action")= 'omit' Named int  24 114 124 127 130 131 135 164 181 186 ...
##   ..- attr(*, "names")= chr  "24" "114" "124" "127" ...

summary(mainTable)

##                                                 App      
##  ROBLOX                                           :   9  
##  CBS Sports App - Scores, News, Stats & Watch Live:   8  
##  8 Ball Pool                                      :   7  
##  Candy Crush Saga                                 :   7  
##  Duolingo: Learn Languages Free                   :   7  
##  ESPN                                             :   7  
##  (Other)                                          :9322  
##           Category        Rating          Reviews    
##  FAMILY       :1747   Min.   : 1.000   2      :  83  
##  GAME         :1097   1st Qu.: 4.000   3      :  78  
##  TOOLS        : 734   Median : 4.300   4      :  74  
##  PRODUCTIVITY : 351   Mean   : 4.193   5      :  74  
##  MEDICAL      : 350   3rd Qu.: 4.500   1      :  67  
##  COMMUNICATION: 328   Max.   :19.000   6      :  62  
##  (Other)      :4760                    (Other):8929  
##                  Size             Installs      Type          Price     
##  Varies with device:1637   1,000,000+ :1577   0   :   1   0      :8719  
##  14M               : 166   10,000,000+:1252   Free:8719   $2.99  : 114  
##  12M               : 161   100,000+   :1150   NaN :   0   $0.99  : 107  
##  11M               : 160   10,000+    :1010   Paid: 647   $4.99  :  70  
##  15M               : 159   5,000,000+ : 752               $1.99  :  59  
##  13M               : 157   1,000+     : 713               $3.99  :  58  
##  (Other)           :6927   (Other)    :2913               (Other): 240  
##          Content.Rating           Genres             Last.Updated 
##                 :   1   Tools        : 733   August 3, 2018: 319  
##  Adults only 18+:   3   Entertainment: 533   August 2, 2018: 284  
##  Everyone       :7420   Education    : 468   July 31, 2018 : 279  
##  Everyone 10+   : 397   Action       : 358   August 1, 2018: 275  
##  Mature 17+     : 461   Productivity : 351   July 30, 2018 : 199  
##  Teen           :1084   Medical      : 350   July 25, 2018 : 157  
##  Unrated        :   1   (Other)      :6574   (Other)       :7854  
##              Current.Ver               Android.Ver  
##  Varies with device:1415   4.1 and up        :2059  
##  1.0               : 458   Varies with device:1319  
##  1.1               : 195   4.0.3 and up      :1240  
##  1.2               : 126   4.0 and up        :1131  
##  1.3               : 120   4.4 and up        : 875  
##  2.0               : 119   2.3 and up        : 582  
##  (Other)           :6934   (Other)           :2161

# Renaming columns
columnsToRename = c('Reviews' = 'Reviews.Count', 'Current.Ver' = 'Current.Software.Version', 'Android.Ver' = 'Android.Version')
mainTable <- mainTable %>% plyr::rename(columnsToRename)

# FIXING DATA
# All - Setting "Varies with device" as NaN
mainTable[mainTable == "Varies with device"] <- NA
# Rating - Fixing outiliers
mainTable$Rating[mainTable$Rating > 5] <- NA
# Size - Delete M (megabytes)
mainTable[5] <- lapply(mainTable[5], as.character)
mainTable$Size <- substr(mainTable$Size, 1, nchar(mainTable$Size) - 1)
# Type - Fixing the 0
mainTable$Type[mainTable$Type == 0] <- "Free"
# Price - Delete dollars, fix outliers
mainTable$Price <- substring(mainTable$Price,2)
mainTable$Price[mainTable$Price == "" | mainTable$Price == "veryone"] <- "0"
# Content.Rating - Deleting outliers
mainTable$Content.Rating[mainTable$Content.Rating == '' | mainTable$Content.Rating == "Unrated"] <- "Everyone"
# Genres - Taking only the main genre 
mainTable$Genres <- gsub(";.*","",mainTable$Genres)
# Genres - Education and Educational are the same type of apps
mainTable$Genres[mainTable$Genres == "Educational"] <- "Education"
# Last.Updated - Converting to date NOT WORKING YET!
mainTable <- mainTable[!mainTable$Last.Updated == "1.0.19",]
mainTable$Last.Updated <- gsub(",","",mainTable$Last.Updated)
# placeholder <- as.Date(mainTable$Last.Updated, format = "%B %d %Y", optional = TRUE)
# Android.Version - Take only the main part, eg. 4.3
mainTable$Android.Version <- substr(mainTable$Android.Version, 1, 3)

######DATA TYPES######
# Convert to character
for (i in c(1, 4)){
  mainTable[i] <- lapply(mainTable[i], as.character)
}
# Convert to numeric 
for (i in c(4, 5, 8)){
  mainTable[i] <- lapply(mainTable[i], as.numeric)
}
# Convert to factors
for (i in c(10, 13)){
  mainTable[i] <- lapply(mainTable[i], as.factor)
}
remove(i)
# Drop all unused factors (cool function)
mainTable <- droplevels.data.frame(mainTable)
# Drop the Category column
mainTable <- mainTable[-c(2,12)]

1.3. RESEARCH OBJECTIVE

The aim of our research is to analyse the big dataset about customers behaviour in terms of downloading the free mobile applications and purchasing the paid ones, find some interesting patterns and detect the factors affecting the number of installs.

2. EXPLORATORY DATA ANALYSIS

2.1. BASIC STATISTICS

# Summary
str(mainTable)

## 'data.frame':    9366 obs. of  11 variables:
##  $ App            : chr  "Photo Editor & Candy Camera & Grid & ScrapBook" "Coloring book moana" "U Launcher Lite â\200“ FREE Live Cool Themes, Hide Apps" "Sketch - Draw & Paint" ...
##  $ Rating         : num  4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
##  $ Reviews.Count  : num  159 967 87510 215644 967 ...
##  $ Size           : num  19 14 8.7 25 2.8 5.6 19 29 33 3.1 ...
##  $ Installs       : Factor w/ 19 levels "1,000,000,000+",..: 6 18 11 14 9 15 15 2 2 6 ...
##  $ Type           : Factor w/ 2 levels "Free","Paid": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Price          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Content.Rating : Factor w/ 5 levels "Adults only 18+",..: 2 2 2 5 2 2 2 2 2 2 ...
##  $ Genres         : Factor w/ 47 levels "Action","Adventure",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Last.Updated   : chr  "January 7 2018" "January 15 2018" "August 1 2018" "June 8 2018" ...
##  $ Android.Version: Factor w/ 22 levels "1.0","1.5","1.6",..: 11 11 11 13 15 7 11 13 8 11 ...

summary(mainTable)

##      App                Rating      Reviews.Count           Size       
##  Length:9366        Min.   :1.000   Min.   :       1   Min.   :  1.00  
##  Class :character   1st Qu.:4.000   1st Qu.:     186   1st Qu.:  6.10  
##  Mode  :character   Median :4.300   Median :    5930   Median : 16.00  
##                     Mean   :4.192   Mean   :  514050   Mean   : 37.28  
##                     3rd Qu.:4.500   3rd Qu.:   81533   3rd Qu.: 37.00  
##                     Max.   :5.000   Max.   :78158306   Max.   :994.00  
##                                                        NA's   :1637    
##         Installs      Type          Price                  Content.Rating
##  1,000,000+ :1577   Free:8719   Min.   :  0.0000   Adults only 18+:   3  
##  10,000,000+:1252   Paid: 647   1st Qu.:  0.0000   Everyone       :7421  
##  100,000+   :1150               Median :  0.0000   Everyone 10+   : 397  
##  10,000+    :1010               Mean   :  0.9609   Mature 17+     : 461  
##  5,000,000+ : 752               3rd Qu.:  0.0000   Teen           :1084  
##  1,000+     : 713               Max.   :400.0000                         
##  (Other)    :2912                                                        
##            Genres     Last.Updated       Android.Version
##  Tools        : 734   Length:9366        4.0    :2373   
##  Education    : 666   Class :character   4.1    :2060   
##  Entertainment: 577   Mode  :character   4.4    : 881   
##  Action       : 375                      2.3    : 822   
##  Productivity : 351                      5.0    : 538   
##  Medical      : 350                      (Other):1373   
##  (Other)      :6313                      NA's   :1319

# Factors summary
summary(mainTable[c(5,6,8,9,11)])

##         Installs      Type              Content.Rating
##  1,000,000+ :1577   Free:8719   Adults only 18+:   3  
##  10,000,000+:1252   Paid: 647   Everyone       :7421  
##  100,000+   :1150               Everyone 10+   : 397  
##  10,000+    :1010               Mature 17+     : 461  
##  5,000,000+ : 752               Teen           :1084  
##  1,000+     : 713                                     
##  (Other)    :2912                                     
##            Genres     Android.Version
##  Tools        : 734   4.0    :2373   
##  Education    : 666   4.1    :2060   
##  Entertainment: 577   4.4    : 881   
##  Action       : 375   2.3    : 822   
##  Productivity : 351   5.0    : 538   
##  Medical      : 350   (Other):1373   
##  (Other)      :6313   NA's   :1319

# Histograms
p1 <- ggplot(mainTable, aes(x = Rating)) + geom_histogram() 
p2 <- ggplot(mainTable, aes(x = Reviews.Count)) + geom_histogram()
p3 <- ggplot(mainTable, aes(x = Size)) + geom_histogram()
p4 <- ggplot(mainTable, aes(x = Price)) + geom_histogram()
grid.arrange(p1, p2, p3, p4, nrow = 2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 1637 rows containing non-finite values (stat_bin).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Box plots 
p1 <- ggplot(mainTable, aes(y = Rating, 1)) + geom_boxplot() 
p2 <- ggplot(mainTable, aes(y = Reviews.Count, 1)) + geom_boxplot()
p3 <- ggplot(mainTable, aes(y = Size, 1)) + geom_boxplot()
p4 <- ggplot(mainTable, aes(y = Price, 1)) + geom_boxplot()
grid.arrange(p1, p2, p3, p4, nrow = 2)

## Warning: Removed 1637 rows containing non-finite values (stat_boxplot).

# Scatter plots
ggplot(mainTable, aes(y = Reviews.Count, x = seq(1,length(mainTable$Reviews.Count)))) + geom_point()

head(mainTable)

##                                                    App Rating
## 1       Photo Editor & Candy Camera & Grid & ScrapBook    4.1
## 2                                  Coloring book moana    3.9
## 3 U Launcher Lite â\200“ FREE Live Cool Themes, Hide Apps    4.7
## 4                                Sketch - Draw & Paint    4.5
## 5                Pixel Draw - Number Art Coloring Book    4.3
## 6                           Paper flowers instructions    4.4
##   Reviews.Count Size    Installs Type Price Content.Rating       Genres
## 1           159 19.0     10,000+ Free     0       Everyone Art & Design
## 2           967 14.0    500,000+ Free     0       Everyone Art & Design
## 3         87510  8.7  5,000,000+ Free     0       Everyone Art & Design
## 4        215644 25.0 50,000,000+ Free     0           Teen Art & Design
## 5           967  2.8    100,000+ Free     0       Everyone Art & Design
## 6           167  5.6     50,000+ Free     0       Everyone Art & Design
##      Last.Updated Android.Version
## 1  January 7 2018             4.0
## 2 January 15 2018             4.0
## 3   August 1 2018             4.0
## 4     June 8 2018             4.2
## 5    June 20 2018             4.4
## 6   March 26 2017             2.3

pie(table(mainTable$Genres), radius=1)

ggplot(mainTable, aes(fill=Type, y=1, x=Genres)) + 
    geom_bar( stat="identity", position="fill") + theme(axis.text.x = element_text(angle = 90, hjust = 1))

2.2. MODELLING

head(df_selected)

##   Rating    Installs Type       Genres
## 1    4.1     10,000+ Free Art & Design
## 2    3.9    500,000+ Free Art & Design
## 3    4.7  5,000,000+ Free Art & Design
## 4    4.5 50,000,000+ Free Art & Design
## 5    4.3    100,000+ Free Art & Design
## 6    4.4     50,000+ Free Art & Design

summary(df_selected)

##      Rating             Installs      Type                Genres    
##  Min.   :1.000   1,000,000+ :1577   Free:8719   Tools        : 734  
##  1st Qu.:4.000   10,000,000+:1252   Paid: 647   Education    : 666  
##  Median :4.300   100,000+   :1150               Entertainment: 577  
##  Mean   :4.192   10,000+    :1010               Action       : 375  
##  3rd Qu.:4.500   5,000,000+ : 752               Productivity : 351  
##  Max.   :5.000   1,000+     : 713               Medical      : 350  
##                  (Other)    :2912               (Other)      :6313

The distribution of current number of Installs

df_selected$Installs <- as.character(df_selected$Installs)
df_selected$Installs <- substr(df_selected$Installs,1,nchar(df_selected$Installs)-1)
df_selected$Installs<- as.numeric(gsub(",", "", df_selected$Installs))
df_selected$Installs <- as.factor(df_selected$Installs)
ggplot(df_selected,aes(Installs))+
  geom_bar()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

Ordinal Logistic Regression model

Since the number of Installs in this dataset is classified as categorial value, Ordinal Logistic Regression seems to be the most suitable.

model_fit <-polr(Installs~Rating+Type+Genres, data = df_selected)
summary(model_fit)

## 
## Re-fitting to get Hessian

## Call:
## polr(formula = Installs ~ Rating + Type + Genres, data = df_selected)
## 
## Coefficients:
##                                  Value Std. Error  t value
## Rating                         0.62259    0.03668  16.9736
## TypePaid                      -2.03432    0.07278 -27.9498
## GenresAdventure               -0.52513    0.20695  -2.5375
## GenresArcade                   0.11714    0.15036   0.7791
## GenresArt & Design            -1.65863    0.21737  -7.6304
## GenresAuto & Vehicles         -1.83716    0.21111  -8.7024
## GenresBeauty                  -1.90267    0.26394  -7.2086
## GenresBoard                   -1.22331    0.23470  -5.2121
## GenresBooks & Reference       -1.64750    0.15884 -10.3719
## GenresBusiness                -1.92984    0.13875 -13.9085
## GenresCard                    -0.92821    0.25612  -3.6242
## GenresCasino                  -1.16408    0.28504  -4.0840
## GenresCasual                  -0.22335    0.14278  -1.5642
## GenresComics                  -1.61782    0.22954  -7.0481
## GenresCommunication           -0.03472    0.13940  -0.2491
## GenresDating                  -1.51536    0.15207  -9.9652
## GenresEducation               -1.84630    0.11303 -16.3342
## GenresEntertainment           -1.25418    0.11705 -10.7147
## GenresEvents                  -2.76011    0.28103  -9.8213
## GenresFinance                 -1.65130    0.13044 -12.6596
## GenresFood & Drink            -0.85167    0.17835  -4.7752
## GenresHealth & Fitness        -0.98743    0.13126  -7.5228
## GenresHouse & Home            -0.86986    0.20013  -4.3465
## GenresLibraries & Demo        -1.97983    0.22270  -8.8900
## GenresLifestyle               -1.76976    0.13440 -13.1680
## GenresMaps & Navigation       -1.19365    0.17991  -6.6347
## GenresMedical                 -2.25318    0.12892 -17.4778
## GenresMusic                   -0.40686    0.36220  -1.1233
## GenresMusic & Audio           -1.22405    1.42069  -0.8616
## GenresNews & Magazines        -1.23788    0.14683  -8.4308
## GenresParenting               -1.69198    0.24160  -7.0033
## GenresPersonalization         -1.30349    0.13597  -9.5867
## GenresPhotography              0.04502    0.13337   0.3376
## GenresProductivity            -0.64586    0.13436  -4.8070
## GenresPuzzle                  -0.35409    0.16668  -2.1244
## GenresRacing                   0.09373    0.18263   0.5132
## GenresRole Playing            -0.52749    0.17239  -3.0599
## GenresShopping                -0.31272    0.14322  -2.1836
## GenresSimulation              -0.84630    0.14499  -5.8371
## GenresSocial                  -0.51920    0.14683  -3.5361
## GenresSports                  -0.74403    0.12962  -5.7399
## GenresStrategy                -0.03122    0.18439  -0.1693
## GenresTools                   -1.24117    0.11172 -11.1099
## GenresTravel & Local          -0.67850    0.14705  -4.6139
## GenresTrivia                  -1.78539    0.32798  -5.4436
## GenresVideo Players & Editors -0.50855    0.16645  -3.0554
## GenresWeather                 -0.54554    0.20557  -2.6538
## GenresWord                    -0.30614    0.33508  -0.9136
## 
## Intercepts:
##                      Value    Std. Error t value 
## 1|5                   -6.9366   0.6094   -11.3823
## 5|10                  -5.7091   0.3385   -16.8648
## 10|50                 -3.8108   0.2076   -18.3551
## 50|100                -3.2700   0.1952   -16.7537
## 100|500               -2.0129   0.1823   -11.0412
## 500|1000              -1.5981   0.1805    -8.8532
## 1000|5000             -0.6864   0.1788    -3.8393
## 5000|10000            -0.3075   0.1786    -1.7218
## 10000|50000            0.3859   0.1788     2.1588
## 50000|100000           0.6591   0.1790     3.6830
## 100000|500000          1.2788   0.1796     7.1189
## 500000|1000000         1.5576   0.1800     8.6536
## 1000000|5000000        2.4074   0.1810    13.3017
## 5000000|10000000       2.8865   0.1816    15.8953
## 10000000|50000000      4.0732   0.1840    22.1410
## 50000000|100000000     4.5642   0.1859    24.5583
## 100000000|500000000    6.0698   0.2012    30.1712
## 500000000|1000000000   6.8903   0.2237    30.8011
## 
## Residual Deviance: 44546.32 
## AIC: 44678.32

summary_table <- coef(summary(model_fit))

## 
## Re-fitting to get Hessian

pval <- pnorm(abs(summary_table[, "t value"]),lower.tail = FALSE)* 2
summary_table <- cbind(summary_table, "p value" = round(pval,3))

Calculating p value and filtering out those who have p value <= 0.05 or have impact on the model

summary_table_filtered <- as_data_frame(summary_table, rownames = 'id')
summary_table_filtered <- summary_table_filtered %>%
                            filter(`p value` <= 0.05)
print.data.frame(summary_table_filtered)

##                               id      Value Std. Error    t value p value
## 1                         Rating  0.6225871 0.03667970  16.973614   0.000
## 2                       TypePaid -2.0343163 0.07278475 -27.949762   0.000
## 3                GenresAdventure -0.5251321 0.20695255  -2.537452   0.011
## 4             GenresArt & Design -1.6586254 0.21737038  -7.630411   0.000
## 5          GenresAuto & Vehicles -1.8371625 0.21110924  -8.702426   0.000
## 6                   GenresBeauty -1.9026701 0.26394414  -7.208609   0.000
## 7                    GenresBoard -1.2233055 0.23470351  -5.212131   0.000
## 8        GenresBooks & Reference -1.6475037 0.15884238 -10.371941   0.000
## 9                 GenresBusiness -1.9298429 0.13875288 -13.908488   0.000
## 10                    GenresCard -0.9282061 0.25611597  -3.624163   0.000
## 11                  GenresCasino -1.1640815 0.28503661  -4.083972   0.000
## 12                  GenresComics -1.6178169 0.22953987  -7.048086   0.000
## 13                  GenresDating -1.5153634 0.15206543  -9.965207   0.000
## 14               GenresEducation -1.8462959 0.11303237 -16.334223   0.000
## 15           GenresEntertainment -1.2541776 0.11705219 -10.714687   0.000
## 16                  GenresEvents -2.7601094 0.28103168  -9.821346   0.000
## 17                 GenresFinance -1.6512960 0.13043777 -12.659646   0.000
## 18            GenresFood & Drink -0.8516658 0.17835047  -4.775237   0.000
## 19        GenresHealth & Fitness -0.9874314 0.13125765  -7.522848   0.000
## 20            GenresHouse & Home -0.8698603 0.20012751  -4.346531   0.000
## 21        GenresLibraries & Demo -1.9798288 0.22270256  -8.890013   0.000
## 22               GenresLifestyle -1.7697624 0.13439915 -13.167959   0.000
## 23       GenresMaps & Navigation -1.1936537 0.17990954  -6.634744   0.000
## 24                 GenresMedical -2.2531791 0.12891667 -17.477795   0.000
## 25        GenresNews & Magazines -1.2378785 0.14682848  -8.430779   0.000
## 26               GenresParenting -1.6919771 0.24159587  -7.003336   0.000
## 27         GenresPersonalization -1.3034913 0.13596937  -9.586654   0.000
## 28            GenresProductivity -0.6458617 0.13435954  -4.806966   0.000
## 29                  GenresPuzzle -0.3540949 0.16668063  -2.124391   0.034
## 30            GenresRole Playing -0.5274912 0.17238778  -3.059911   0.002
## 31                GenresShopping -0.3127235 0.14321506  -2.183594   0.029
## 32              GenresSimulation -0.8462998 0.14498544  -5.837136   0.000
## 33                  GenresSocial -0.5191964 0.14682936  -3.536053   0.000
## 34                  GenresSports -0.7440328 0.12962455  -5.739906   0.000
## 35                   GenresTools -1.2411722 0.11171773 -11.109894   0.000
## 36          GenresTravel & Local -0.6784979 0.14705420  -4.613930   0.000
## 37                  GenresTrivia -1.7853906 0.32797952  -5.443604   0.000
## 38 GenresVideo Players & Editors -0.5085516 0.16644510  -3.055372   0.002
## 39                 GenresWeather -0.5455423 0.20557173  -2.653780   0.008
## 40                           1|5 -6.9365660 0.60941915 -11.382258   0.000
## 41                          5|10 -5.7090990 0.33852096 -16.864832   0.000
## 42                         10|50 -3.8108211 0.20761617 -18.355127   0.000
## 43                        50|100 -3.2700147 0.19518158 -16.753705   0.000
## 44                       100|500 -2.0128767 0.18230586 -11.041207   0.000
## 45                      500|1000 -1.5981047 0.18051234  -8.853161   0.000
## 46                     1000|5000 -0.6864149 0.17878706  -3.839288   0.000
## 47                   10000|50000  0.3859030 0.17876132   2.158761   0.031
## 48                  50000|100000  0.6591469 0.17897046   3.682993   0.000
## 49                 100000|500000  1.2788362 0.17963900   7.118923   0.000
## 50                500000|1000000  1.5575692 0.17999176   8.653558   0.000
## 51               1000000|5000000  2.4073653 0.18098181  13.301698   0.000
## 52              5000000|10000000  2.8864919 0.18159372  15.895329   0.000
## 53             10000000|50000000  4.0732434 0.18396854  22.140978   0.000
## 54            50000000|100000000  4.5641971 0.18585117  24.558345   0.000
## 55           100000000|500000000  6.0698011 0.20117845  30.171229   0.000
## 56          500000000|1000000000  6.8902784 0.22370249  30.801080   0.000

Explaining the model (created and improved by Duy Tuan)

The basic of proportional odds model have mathematical fomulation: model formula

With ‘J’ is sum of number of factors in number of Installs (J=18) and ‘M’ is total number of independent variables (M=3).

‘j’ is each factor in number of Installs, meanwhile ‘i’ is each independent variables, simply put:

i =1 refers to Rating
i = 2 refers to Type
i = 3 refers to Genres

Interpretation:

Comments on Coefficients: Only rating of the app have positive effect on number of installs, if the app is paid or belong to these genres below will have negative impact on its number of Installs.

Comments on intercept: take 1|5 as example: the odd of log that the app will have only 1 person installs the app versus the odd of log many people (>1) try the app

Examples of the model

Let’s predict the probability of popularity that an app developer create apps with following info. Cost of app development is based on this report

select <- dplyr::select
df_price <- df_load %>% select(Type, Price) %>% drop_na(.) %>% filter(Type == 'Paid' & Price <=50)
#graph
ggplot(df_price, aes(x= Price)) +
  geom_histogram(color ='blue', fill = 'white') +
  theme_light()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Price less than 50 (majority of paid app lie on this domain)

Build a function to calculate ad rate/ price for each app

costCalculator <- function(result, cost){
  result_1 <- as.data.frame(result)
  result_1 <- result_1 %>% rownames_to_column()
  colnames(result_1) <- c('Installs','Probabbility')
  result_1$Installs <- as.numeric(result_1$Installs)
  result_1$Expected_value <- result_1$Installs * result_1$Probabbility
  print(result_1)
  total_install <- sum(result_1$Expected_value)
  average_cost = cost/total_install
  cat('The total sum of expected value in total Installs is',total_install,'and the ideal ad rate/ price should be minimum at', average_cost)
}

Suppose one app is a free educational app with high quality costs $150 000 to develop

new_app <- data.frame('Rating'=4,'Type'='Free','Genres'='Education')
result1<-round(predict(model_fit,new_app,type = "p"), 3)
result1

##          1          5         10         50        100        500 
##      0.001      0.001      0.010      0.008      0.046      0.030 
##       1000       5000      10000      50000     100000     500000 
##      0.113      0.069      0.157      0.068      0.150      0.060 
##    1000000    5000000   10000000   50000000  100000000  500000000 
##      0.140      0.050      0.065      0.012      0.015      0.002 
## 1000000000 
##      0.002

The app may have 15.7% (highest chance) to get 10000 downloads so it should set ad rate as (break even)

costCalculator(result=result1,cost= 150000)

##      Installs Probabbility Expected_value
## 1           1        0.001          0.001
## 2           5        0.001          0.005
## 3          10        0.010          0.100
## 4          50        0.008          0.400
## 5         100        0.046          4.600
## 6         500        0.030         15.000
## 7        1000        0.113        113.000
## 8        5000        0.069        345.000
## 9       10000        0.157       1570.000
## 10      50000        0.068       3400.000
## 11     100000        0.150      15000.000
## 12     500000        0.060      30000.000
## 13    1000000        0.140     140000.000
## 14    5000000        0.050     250000.000
## 15   10000000        0.065     650000.000
## 16   50000000        0.012     600000.000
## 17  100000000        0.015    1500000.000
## 18  500000000        0.002    1000000.000
## 19 1000000000        0.002    2000000.000
## The total sum of expected value in total Installs is 6190448 and the ideal ad rate/ price should be minimum at 0.02423088

Another app belong to paid average Racing game costs $200 000 to develop

new_app_2 <- data.frame('Rating'=3.5,'Type'='Paid','Genres'='Racing')
result2<-round(predict(model_fit,new_app_2,type = "p"), 3)
result2

##          1          5         10         50        100        500 
##      0.001      0.002      0.015      0.012      0.066      0.042 
##       1000       5000      10000      50000     100000     500000 
##      0.147      0.083      0.170      0.067      0.135      0.050 
##    1000000    5000000   10000000   50000000  100000000  500000000 
##      0.108      0.036      0.045      0.008      0.010      0.002 
## 1000000000 
##      0.001

Comments: this app have 17% (highest) to get at least 10 thounsand downloads

costCalculator(result=result2,cost= 200000)

##      Installs Probabbility Expected_value
## 1           1        0.001          0.001
## 2           5        0.002          0.010
## 3          10        0.015          0.150
## 4          50        0.012          0.600
## 5         100        0.066          6.600
## 6         500        0.042         21.000
## 7        1000        0.147        147.000
## 8        5000        0.083        415.000
## 9       10000        0.170       1700.000
## 10      50000        0.067       3350.000
## 11     100000        0.135      13500.000
## 12     500000        0.050      25000.000
## 13    1000000        0.108     108000.000
## 14    5000000        0.036     180000.000
## 15   10000000        0.045     450000.000
## 16   50000000        0.008     400000.000
## 17  100000000        0.010    1000000.000
## 18  500000000        0.002    1000000.000
## 19 1000000000        0.001    1000000.000
## The total sum of expected value in total Installs is 4182140 and the ideal ad rate/ price should be minimum at 0.0478224

Based on the distribution price’s graph and use ‘Herd Behavior’ as guideline, the developer should set price at 0.99

Analysis of purchases of mobile applications

Michał Szałański , Duy Tuan Doan, Aleksandra Bednarczuk