Mobile applications (or mobile apps) are computer programs or software applications created to be used on mobile phones, tablets or smartwatches. Originally, they were designed for basic needs like e-mails or calendar. But with the development of mobile devices, there has been an increase in demand for new and life-easing applications such as mobile games, social networking apps, navigation and location-based services. Those applications can be free of charge or can have a price. Mobile applications can be downloaded from distribution platforms like App Store or Google Play Store.
The dataset about Google play store apps was downloaded from kaggle.com website (to see the source click here).
The dataset before cleaning had almost 11 thousand observations - mobile applications, and 13 variables - Application name, Category, Rating, Reviews, Size, Installs, Type, Price, Content Rating, Genres, Last Updated, Current version, Android version. After cleaning, the dataset remained 9367 observations and 11 variables - we rejected Category and Current version. The procedure of cleaning the data is thoroughly described in the next section.
# Some basic summary
str(mainTable)
## 'data.frame': 9367 obs. of 13 variables:
## $ App : Factor w/ 9660 levels "- Free Comics - Comic Apps",..: 7229 2563 8998 8113 7294 7125 8171 5589 4948 5826 ...
## $ Category : Factor w/ 34 levels "1.9","ART_AND_DESIGN",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Rating : num 4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
## $ Reviews : Factor w/ 6002 levels "0","1","10","100",..: 1183 5924 5681 1947 5924 1310 1464 3385 816 485 ...
## $ Size : Factor w/ 462 levels "1,000+","1.0M",..: 55 30 368 102 64 222 55 118 146 120 ...
## $ Installs : Factor w/ 22 levels "0","0+","1,000,000,000+",..: 8 20 13 16 11 17 17 4 4 8 ...
## $ Type : Factor w/ 4 levels "0","Free","NaN",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Price : Factor w/ 93 levels "$0.99","$1.00",..: 92 92 92 92 92 92 92 92 92 92 ...
## $ Content.Rating: Factor w/ 7 levels "","Adults only 18+",..: 3 3 3 6 3 3 3 3 3 3 ...
## $ Genres : Factor w/ 120 levels "Action","Action;Action & Adventure",..: 10 13 10 10 12 10 10 10 10 12 ...
## $ Last.Updated : Factor w/ 1378 levels "1.0.19","April 1, 2016",..: 562 482 117 825 757 901 76 726 1317 670 ...
## $ Current.Ver : Factor w/ 2834 levels "","0.0.0.2","0.0.1",..: 121 1020 466 2827 279 115 279 2393 1457 1431 ...
## $ Android.Ver : Factor w/ 35 levels "","1.0 and up",..: 17 17 17 20 22 10 17 20 12 17 ...
## - attr(*, "na.action")= 'omit' Named int 24 114 124 127 130 131 135 164 181 186 ...
## ..- attr(*, "names")= chr "24" "114" "124" "127" ...
summary(mainTable)
## App
## ROBLOX : 9
## CBS Sports App - Scores, News, Stats & Watch Live: 8
## 8 Ball Pool : 7
## Candy Crush Saga : 7
## Duolingo: Learn Languages Free : 7
## ESPN : 7
## (Other) :9322
## Category Rating Reviews
## FAMILY :1747 Min. : 1.000 2 : 83
## GAME :1097 1st Qu.: 4.000 3 : 78
## TOOLS : 734 Median : 4.300 4 : 74
## PRODUCTIVITY : 351 Mean : 4.193 5 : 74
## MEDICAL : 350 3rd Qu.: 4.500 1 : 67
## COMMUNICATION: 328 Max. :19.000 6 : 62
## (Other) :4760 (Other):8929
## Size Installs Type Price
## Varies with device:1637 1,000,000+ :1577 0 : 1 0 :8719
## 14M : 166 10,000,000+:1252 Free:8719 $2.99 : 114
## 12M : 161 100,000+ :1150 NaN : 0 $0.99 : 107
## 11M : 160 10,000+ :1010 Paid: 647 $4.99 : 70
## 15M : 159 5,000,000+ : 752 $1.99 : 59
## 13M : 157 1,000+ : 713 $3.99 : 58
## (Other) :6927 (Other) :2913 (Other): 240
## Content.Rating Genres Last.Updated
## : 1 Tools : 733 August 3, 2018: 319
## Adults only 18+: 3 Entertainment: 533 August 2, 2018: 284
## Everyone :7420 Education : 468 July 31, 2018 : 279
## Everyone 10+ : 397 Action : 358 August 1, 2018: 275
## Mature 17+ : 461 Productivity : 351 July 30, 2018 : 199
## Teen :1084 Medical : 350 July 25, 2018 : 157
## Unrated : 1 (Other) :6574 (Other) :7854
## Current.Ver Android.Ver
## Varies with device:1415 4.1 and up :2059
## 1.0 : 458 Varies with device:1319
## 1.1 : 195 4.0.3 and up :1240
## 1.2 : 126 4.0 and up :1131
## 1.3 : 120 4.4 and up : 875
## 2.0 : 119 2.3 and up : 582
## (Other) :6934 (Other) :2161
# Renaming columns
columnsToRename = c('Reviews' = 'Reviews.Count', 'Current.Ver' = 'Current.Software.Version', 'Android.Ver' = 'Android.Version')
mainTable <- mainTable %>% plyr::rename(columnsToRename)
# FIXING DATA
# All - Setting "Varies with device" as NaN
mainTable[mainTable == "Varies with device"] <- NA
# Rating - Fixing outiliers
mainTable$Rating[mainTable$Rating > 5] <- NA
# Size - Delete M (megabytes)
mainTable[5] <- lapply(mainTable[5], as.character)
mainTable$Size <- substr(mainTable$Size, 1, nchar(mainTable$Size) - 1)
# Type - Fixing the 0
mainTable$Type[mainTable$Type == 0] <- "Free"
# Price - Delete dollars, fix outliers
mainTable$Price <- substring(mainTable$Price,2)
mainTable$Price[mainTable$Price == "" | mainTable$Price == "veryone"] <- "0"
# Content.Rating - Deleting outliers
mainTable$Content.Rating[mainTable$Content.Rating == '' | mainTable$Content.Rating == "Unrated"] <- "Everyone"
# Genres - Taking only the main genre
mainTable$Genres <- gsub(";.*","",mainTable$Genres)
# Genres - Education and Educational are the same type of apps
mainTable$Genres[mainTable$Genres == "Educational"] <- "Education"
# Last.Updated - Converting to date NOT WORKING YET!
mainTable <- mainTable[!mainTable$Last.Updated == "1.0.19",]
mainTable$Last.Updated <- gsub(",","",mainTable$Last.Updated)
# placeholder <- as.Date(mainTable$Last.Updated, format = "%B %d %Y", optional = TRUE)
# Android.Version - Take only the main part, eg. 4.3
mainTable$Android.Version <- substr(mainTable$Android.Version, 1, 3)
######DATA TYPES######
# Convert to character
for (i in c(1, 4)){
mainTable[i] <- lapply(mainTable[i], as.character)
}
# Convert to numeric
for (i in c(4, 5, 8)){
mainTable[i] <- lapply(mainTable[i], as.numeric)
}
# Convert to factors
for (i in c(10, 13)){
mainTable[i] <- lapply(mainTable[i], as.factor)
}
remove(i)
# Drop all unused factors (cool function)
mainTable <- droplevels.data.frame(mainTable)
# Drop the Category column
mainTable <- mainTable[-c(2,12)]
The aim of our research is to analyse the big dataset about customers behaviour in terms of downloading the free mobile applications and purchasing the paid ones, find some interesting patterns and detect the factors affecting the number of installs.
# Summary
str(mainTable)
## 'data.frame': 9366 obs. of 11 variables:
## $ App : chr "Photo Editor & Candy Camera & Grid & ScrapBook" "Coloring book moana" "U Launcher Lite â\200“ FREE Live Cool Themes, Hide Apps" "Sketch - Draw & Paint" ...
## $ Rating : num 4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
## $ Reviews.Count : num 159 967 87510 215644 967 ...
## $ Size : num 19 14 8.7 25 2.8 5.6 19 29 33 3.1 ...
## $ Installs : Factor w/ 19 levels "1,000,000,000+",..: 6 18 11 14 9 15 15 2 2 6 ...
## $ Type : Factor w/ 2 levels "Free","Paid": 1 1 1 1 1 1 1 1 1 1 ...
## $ Price : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Content.Rating : Factor w/ 5 levels "Adults only 18+",..: 2 2 2 5 2 2 2 2 2 2 ...
## $ Genres : Factor w/ 47 levels "Action","Adventure",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Last.Updated : chr "January 7 2018" "January 15 2018" "August 1 2018" "June 8 2018" ...
## $ Android.Version: Factor w/ 22 levels "1.0","1.5","1.6",..: 11 11 11 13 15 7 11 13 8 11 ...
summary(mainTable)
## App Rating Reviews.Count Size
## Length:9366 Min. :1.000 Min. : 1 Min. : 1.00
## Class :character 1st Qu.:4.000 1st Qu.: 186 1st Qu.: 6.10
## Mode :character Median :4.300 Median : 5930 Median : 16.00
## Mean :4.192 Mean : 514050 Mean : 37.28
## 3rd Qu.:4.500 3rd Qu.: 81533 3rd Qu.: 37.00
## Max. :5.000 Max. :78158306 Max. :994.00
## NA's :1637
## Installs Type Price Content.Rating
## 1,000,000+ :1577 Free:8719 Min. : 0.0000 Adults only 18+: 3
## 10,000,000+:1252 Paid: 647 1st Qu.: 0.0000 Everyone :7421
## 100,000+ :1150 Median : 0.0000 Everyone 10+ : 397
## 10,000+ :1010 Mean : 0.9609 Mature 17+ : 461
## 5,000,000+ : 752 3rd Qu.: 0.0000 Teen :1084
## 1,000+ : 713 Max. :400.0000
## (Other) :2912
## Genres Last.Updated Android.Version
## Tools : 734 Length:9366 4.0 :2373
## Education : 666 Class :character 4.1 :2060
## Entertainment: 577 Mode :character 4.4 : 881
## Action : 375 2.3 : 822
## Productivity : 351 5.0 : 538
## Medical : 350 (Other):1373
## (Other) :6313 NA's :1319
# Factors summary
summary(mainTable[c(5,6,8,9,11)])
## Installs Type Content.Rating
## 1,000,000+ :1577 Free:8719 Adults only 18+: 3
## 10,000,000+:1252 Paid: 647 Everyone :7421
## 100,000+ :1150 Everyone 10+ : 397
## 10,000+ :1010 Mature 17+ : 461
## 5,000,000+ : 752 Teen :1084
## 1,000+ : 713
## (Other) :2912
## Genres Android.Version
## Tools : 734 4.0 :2373
## Education : 666 4.1 :2060
## Entertainment: 577 4.4 : 881
## Action : 375 2.3 : 822
## Productivity : 351 5.0 : 538
## Medical : 350 (Other):1373
## (Other) :6313 NA's :1319
# Histograms
p1 <- ggplot(mainTable, aes(x = Rating)) + geom_histogram()
p2 <- ggplot(mainTable, aes(x = Reviews.Count)) + geom_histogram()
p3 <- ggplot(mainTable, aes(x = Size)) + geom_histogram()
p4 <- ggplot(mainTable, aes(x = Price)) + geom_histogram()
grid.arrange(p1, p2, p3, p4, nrow = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1637 rows containing non-finite values (stat_bin).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Box plots
p1 <- ggplot(mainTable, aes(y = Rating, 1)) + geom_boxplot()
p2 <- ggplot(mainTable, aes(y = Reviews.Count, 1)) + geom_boxplot()
p3 <- ggplot(mainTable, aes(y = Size, 1)) + geom_boxplot()
p4 <- ggplot(mainTable, aes(y = Price, 1)) + geom_boxplot()
grid.arrange(p1, p2, p3, p4, nrow = 2)
## Warning: Removed 1637 rows containing non-finite values (stat_boxplot).
# Scatter plots
ggplot(mainTable, aes(y = Reviews.Count, x = seq(1,length(mainTable$Reviews.Count)))) + geom_point()
head(mainTable)
## App Rating
## 1 Photo Editor & Candy Camera & Grid & ScrapBook 4.1
## 2 Coloring book moana 3.9
## 3 U Launcher Lite â\200“ FREE Live Cool Themes, Hide Apps 4.7
## 4 Sketch - Draw & Paint 4.5
## 5 Pixel Draw - Number Art Coloring Book 4.3
## 6 Paper flowers instructions 4.4
## Reviews.Count Size Installs Type Price Content.Rating Genres
## 1 159 19.0 10,000+ Free 0 Everyone Art & Design
## 2 967 14.0 500,000+ Free 0 Everyone Art & Design
## 3 87510 8.7 5,000,000+ Free 0 Everyone Art & Design
## 4 215644 25.0 50,000,000+ Free 0 Teen Art & Design
## 5 967 2.8 100,000+ Free 0 Everyone Art & Design
## 6 167 5.6 50,000+ Free 0 Everyone Art & Design
## Last.Updated Android.Version
## 1 January 7 2018 4.0
## 2 January 15 2018 4.0
## 3 August 1 2018 4.0
## 4 June 8 2018 4.2
## 5 June 20 2018 4.4
## 6 March 26 2017 2.3
pie(table(mainTable$Genres), radius=1)
ggplot(mainTable, aes(fill=Type, y=1, x=Genres)) +
geom_bar( stat="identity", position="fill") + theme(axis.text.x = element_text(angle = 90, hjust = 1))
head(df_selected)
## Rating Installs Type Genres
## 1 4.1 10,000+ Free Art & Design
## 2 3.9 500,000+ Free Art & Design
## 3 4.7 5,000,000+ Free Art & Design
## 4 4.5 50,000,000+ Free Art & Design
## 5 4.3 100,000+ Free Art & Design
## 6 4.4 50,000+ Free Art & Design
summary(df_selected)
## Rating Installs Type Genres
## Min. :1.000 1,000,000+ :1577 Free:8719 Tools : 734
## 1st Qu.:4.000 10,000,000+:1252 Paid: 647 Education : 666
## Median :4.300 100,000+ :1150 Entertainment: 577
## Mean :4.192 10,000+ :1010 Action : 375
## 3rd Qu.:4.500 5,000,000+ : 752 Productivity : 351
## Max. :5.000 1,000+ : 713 Medical : 350
## (Other) :2912 (Other) :6313
The distribution of current number of Installs
df_selected$Installs <- as.character(df_selected$Installs)
df_selected$Installs <- substr(df_selected$Installs,1,nchar(df_selected$Installs)-1)
df_selected$Installs<- as.numeric(gsub(",", "", df_selected$Installs))
df_selected$Installs <- as.factor(df_selected$Installs)
ggplot(df_selected,aes(Installs))+
geom_bar()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
Since the number of Installs in this dataset is classified as categorial value, Ordinal Logistic Regression seems to be the most suitable.
model_fit <-polr(Installs~Rating+Type+Genres, data = df_selected)
summary(model_fit)
##
## Re-fitting to get Hessian
## Call:
## polr(formula = Installs ~ Rating + Type + Genres, data = df_selected)
##
## Coefficients:
## Value Std. Error t value
## Rating 0.62259 0.03668 16.9736
## TypePaid -2.03432 0.07278 -27.9498
## GenresAdventure -0.52513 0.20695 -2.5375
## GenresArcade 0.11714 0.15036 0.7791
## GenresArt & Design -1.65863 0.21737 -7.6304
## GenresAuto & Vehicles -1.83716 0.21111 -8.7024
## GenresBeauty -1.90267 0.26394 -7.2086
## GenresBoard -1.22331 0.23470 -5.2121
## GenresBooks & Reference -1.64750 0.15884 -10.3719
## GenresBusiness -1.92984 0.13875 -13.9085
## GenresCard -0.92821 0.25612 -3.6242
## GenresCasino -1.16408 0.28504 -4.0840
## GenresCasual -0.22335 0.14278 -1.5642
## GenresComics -1.61782 0.22954 -7.0481
## GenresCommunication -0.03472 0.13940 -0.2491
## GenresDating -1.51536 0.15207 -9.9652
## GenresEducation -1.84630 0.11303 -16.3342
## GenresEntertainment -1.25418 0.11705 -10.7147
## GenresEvents -2.76011 0.28103 -9.8213
## GenresFinance -1.65130 0.13044 -12.6596
## GenresFood & Drink -0.85167 0.17835 -4.7752
## GenresHealth & Fitness -0.98743 0.13126 -7.5228
## GenresHouse & Home -0.86986 0.20013 -4.3465
## GenresLibraries & Demo -1.97983 0.22270 -8.8900
## GenresLifestyle -1.76976 0.13440 -13.1680
## GenresMaps & Navigation -1.19365 0.17991 -6.6347
## GenresMedical -2.25318 0.12892 -17.4778
## GenresMusic -0.40686 0.36220 -1.1233
## GenresMusic & Audio -1.22405 1.42069 -0.8616
## GenresNews & Magazines -1.23788 0.14683 -8.4308
## GenresParenting -1.69198 0.24160 -7.0033
## GenresPersonalization -1.30349 0.13597 -9.5867
## GenresPhotography 0.04502 0.13337 0.3376
## GenresProductivity -0.64586 0.13436 -4.8070
## GenresPuzzle -0.35409 0.16668 -2.1244
## GenresRacing 0.09373 0.18263 0.5132
## GenresRole Playing -0.52749 0.17239 -3.0599
## GenresShopping -0.31272 0.14322 -2.1836
## GenresSimulation -0.84630 0.14499 -5.8371
## GenresSocial -0.51920 0.14683 -3.5361
## GenresSports -0.74403 0.12962 -5.7399
## GenresStrategy -0.03122 0.18439 -0.1693
## GenresTools -1.24117 0.11172 -11.1099
## GenresTravel & Local -0.67850 0.14705 -4.6139
## GenresTrivia -1.78539 0.32798 -5.4436
## GenresVideo Players & Editors -0.50855 0.16645 -3.0554
## GenresWeather -0.54554 0.20557 -2.6538
## GenresWord -0.30614 0.33508 -0.9136
##
## Intercepts:
## Value Std. Error t value
## 1|5 -6.9366 0.6094 -11.3823
## 5|10 -5.7091 0.3385 -16.8648
## 10|50 -3.8108 0.2076 -18.3551
## 50|100 -3.2700 0.1952 -16.7537
## 100|500 -2.0129 0.1823 -11.0412
## 500|1000 -1.5981 0.1805 -8.8532
## 1000|5000 -0.6864 0.1788 -3.8393
## 5000|10000 -0.3075 0.1786 -1.7218
## 10000|50000 0.3859 0.1788 2.1588
## 50000|100000 0.6591 0.1790 3.6830
## 100000|500000 1.2788 0.1796 7.1189
## 500000|1000000 1.5576 0.1800 8.6536
## 1000000|5000000 2.4074 0.1810 13.3017
## 5000000|10000000 2.8865 0.1816 15.8953
## 10000000|50000000 4.0732 0.1840 22.1410
## 50000000|100000000 4.5642 0.1859 24.5583
## 100000000|500000000 6.0698 0.2012 30.1712
## 500000000|1000000000 6.8903 0.2237 30.8011
##
## Residual Deviance: 44546.32
## AIC: 44678.32
summary_table <- coef(summary(model_fit))
##
## Re-fitting to get Hessian
pval <- pnorm(abs(summary_table[, "t value"]),lower.tail = FALSE)* 2
summary_table <- cbind(summary_table, "p value" = round(pval,3))
Calculating p value and filtering out those who have p value <= 0.05 or have impact on the model
summary_table_filtered <- as_data_frame(summary_table, rownames = 'id')
summary_table_filtered <- summary_table_filtered %>%
filter(`p value` <= 0.05)
print.data.frame(summary_table_filtered)
## id Value Std. Error t value p value
## 1 Rating 0.6225871 0.03667970 16.973614 0.000
## 2 TypePaid -2.0343163 0.07278475 -27.949762 0.000
## 3 GenresAdventure -0.5251321 0.20695255 -2.537452 0.011
## 4 GenresArt & Design -1.6586254 0.21737038 -7.630411 0.000
## 5 GenresAuto & Vehicles -1.8371625 0.21110924 -8.702426 0.000
## 6 GenresBeauty -1.9026701 0.26394414 -7.208609 0.000
## 7 GenresBoard -1.2233055 0.23470351 -5.212131 0.000
## 8 GenresBooks & Reference -1.6475037 0.15884238 -10.371941 0.000
## 9 GenresBusiness -1.9298429 0.13875288 -13.908488 0.000
## 10 GenresCard -0.9282061 0.25611597 -3.624163 0.000
## 11 GenresCasino -1.1640815 0.28503661 -4.083972 0.000
## 12 GenresComics -1.6178169 0.22953987 -7.048086 0.000
## 13 GenresDating -1.5153634 0.15206543 -9.965207 0.000
## 14 GenresEducation -1.8462959 0.11303237 -16.334223 0.000
## 15 GenresEntertainment -1.2541776 0.11705219 -10.714687 0.000
## 16 GenresEvents -2.7601094 0.28103168 -9.821346 0.000
## 17 GenresFinance -1.6512960 0.13043777 -12.659646 0.000
## 18 GenresFood & Drink -0.8516658 0.17835047 -4.775237 0.000
## 19 GenresHealth & Fitness -0.9874314 0.13125765 -7.522848 0.000
## 20 GenresHouse & Home -0.8698603 0.20012751 -4.346531 0.000
## 21 GenresLibraries & Demo -1.9798288 0.22270256 -8.890013 0.000
## 22 GenresLifestyle -1.7697624 0.13439915 -13.167959 0.000
## 23 GenresMaps & Navigation -1.1936537 0.17990954 -6.634744 0.000
## 24 GenresMedical -2.2531791 0.12891667 -17.477795 0.000
## 25 GenresNews & Magazines -1.2378785 0.14682848 -8.430779 0.000
## 26 GenresParenting -1.6919771 0.24159587 -7.003336 0.000
## 27 GenresPersonalization -1.3034913 0.13596937 -9.586654 0.000
## 28 GenresProductivity -0.6458617 0.13435954 -4.806966 0.000
## 29 GenresPuzzle -0.3540949 0.16668063 -2.124391 0.034
## 30 GenresRole Playing -0.5274912 0.17238778 -3.059911 0.002
## 31 GenresShopping -0.3127235 0.14321506 -2.183594 0.029
## 32 GenresSimulation -0.8462998 0.14498544 -5.837136 0.000
## 33 GenresSocial -0.5191964 0.14682936 -3.536053 0.000
## 34 GenresSports -0.7440328 0.12962455 -5.739906 0.000
## 35 GenresTools -1.2411722 0.11171773 -11.109894 0.000
## 36 GenresTravel & Local -0.6784979 0.14705420 -4.613930 0.000
## 37 GenresTrivia -1.7853906 0.32797952 -5.443604 0.000
## 38 GenresVideo Players & Editors -0.5085516 0.16644510 -3.055372 0.002
## 39 GenresWeather -0.5455423 0.20557173 -2.653780 0.008
## 40 1|5 -6.9365660 0.60941915 -11.382258 0.000
## 41 5|10 -5.7090990 0.33852096 -16.864832 0.000
## 42 10|50 -3.8108211 0.20761617 -18.355127 0.000
## 43 50|100 -3.2700147 0.19518158 -16.753705 0.000
## 44 100|500 -2.0128767 0.18230586 -11.041207 0.000
## 45 500|1000 -1.5981047 0.18051234 -8.853161 0.000
## 46 1000|5000 -0.6864149 0.17878706 -3.839288 0.000
## 47 10000|50000 0.3859030 0.17876132 2.158761 0.031
## 48 50000|100000 0.6591469 0.17897046 3.682993 0.000
## 49 100000|500000 1.2788362 0.17963900 7.118923 0.000
## 50 500000|1000000 1.5575692 0.17999176 8.653558 0.000
## 51 1000000|5000000 2.4073653 0.18098181 13.301698 0.000
## 52 5000000|10000000 2.8864919 0.18159372 15.895329 0.000
## 53 10000000|50000000 4.0732434 0.18396854 22.140978 0.000
## 54 50000000|100000000 4.5641971 0.18585117 24.558345 0.000
## 55 100000000|500000000 6.0698011 0.20117845 30.171229 0.000
## 56 500000000|1000000000 6.8902784 0.22370249 30.801080 0.000
The basic of proportional odds model have mathematical fomulation:
With ‘J’ is sum of number of factors in number of Installs (J=18) and ‘M’ is total number of independent variables (M=3).
‘j’ is each factor in number of Installs, meanwhile ‘i’ is each independent variables, simply put:
i =1 refers to Rating
i = 2 refers to Type
i = 3 refers to Genres
Interpretation:
Comments on Coefficients: Only rating of the app have positive effect on number of installs, if the app is paid or belong to these genres below will have negative impact on its number of Installs.
Comments on intercept: take 1|5 as example: the odd of log that the app will have only 1 person installs the app versus the odd of log many people (>1) try the app
Let’s predict the probability of popularity that an app developer create apps with following info. Cost of app development is based on this report
select <- dplyr::select
df_price <- df_load %>% select(Type, Price) %>% drop_na(.) %>% filter(Type == 'Paid' & Price <=50)
#graph
ggplot(df_price, aes(x= Price)) +
geom_histogram(color ='blue', fill = 'white') +
theme_light()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Price less than 50 (majority of paid app lie on this domain)
Build a function to calculate ad rate/ price for each app
costCalculator <- function(result, cost){
result_1 <- as.data.frame(result)
result_1 <- result_1 %>% rownames_to_column()
colnames(result_1) <- c('Installs','Probabbility')
result_1$Installs <- as.numeric(result_1$Installs)
result_1$Expected_value <- result_1$Installs * result_1$Probabbility
print(result_1)
total_install <- sum(result_1$Expected_value)
average_cost = cost/total_install
cat('The total sum of expected value in total Installs is',total_install,'and the ideal ad rate/ price should be minimum at', average_cost)
}
Suppose one app is a free educational app with high quality costs $150 000 to develop
new_app <- data.frame('Rating'=4,'Type'='Free','Genres'='Education')
result1<-round(predict(model_fit,new_app,type = "p"), 3)
result1
## 1 5 10 50 100 500
## 0.001 0.001 0.010 0.008 0.046 0.030
## 1000 5000 10000 50000 100000 500000
## 0.113 0.069 0.157 0.068 0.150 0.060
## 1000000 5000000 10000000 50000000 100000000 500000000
## 0.140 0.050 0.065 0.012 0.015 0.002
## 1000000000
## 0.002
The app may have 15.7% (highest chance) to get 10000 downloads so it should set ad rate as (break even)
costCalculator(result=result1,cost= 150000)
## Installs Probabbility Expected_value
## 1 1 0.001 0.001
## 2 5 0.001 0.005
## 3 10 0.010 0.100
## 4 50 0.008 0.400
## 5 100 0.046 4.600
## 6 500 0.030 15.000
## 7 1000 0.113 113.000
## 8 5000 0.069 345.000
## 9 10000 0.157 1570.000
## 10 50000 0.068 3400.000
## 11 100000 0.150 15000.000
## 12 500000 0.060 30000.000
## 13 1000000 0.140 140000.000
## 14 5000000 0.050 250000.000
## 15 10000000 0.065 650000.000
## 16 50000000 0.012 600000.000
## 17 100000000 0.015 1500000.000
## 18 500000000 0.002 1000000.000
## 19 1000000000 0.002 2000000.000
## The total sum of expected value in total Installs is 6190448 and the ideal ad rate/ price should be minimum at 0.02423088
Another app belong to paid average Racing game costs $200 000 to develop
new_app_2 <- data.frame('Rating'=3.5,'Type'='Paid','Genres'='Racing')
result2<-round(predict(model_fit,new_app_2,type = "p"), 3)
result2
## 1 5 10 50 100 500
## 0.001 0.002 0.015 0.012 0.066 0.042
## 1000 5000 10000 50000 100000 500000
## 0.147 0.083 0.170 0.067 0.135 0.050
## 1000000 5000000 10000000 50000000 100000000 500000000
## 0.108 0.036 0.045 0.008 0.010 0.002
## 1000000000
## 0.001
Comments: this app have 17% (highest) to get at least 10 thounsand downloads
costCalculator(result=result2,cost= 200000)
## Installs Probabbility Expected_value
## 1 1 0.001 0.001
## 2 5 0.002 0.010
## 3 10 0.015 0.150
## 4 50 0.012 0.600
## 5 100 0.066 6.600
## 6 500 0.042 21.000
## 7 1000 0.147 147.000
## 8 5000 0.083 415.000
## 9 10000 0.170 1700.000
## 10 50000 0.067 3350.000
## 11 100000 0.135 13500.000
## 12 500000 0.050 25000.000
## 13 1000000 0.108 108000.000
## 14 5000000 0.036 180000.000
## 15 10000000 0.045 450000.000
## 16 50000000 0.008 400000.000
## 17 100000000 0.010 1000000.000
## 18 500000000 0.002 1000000.000
## 19 1000000000 0.001 1000000.000
## The total sum of expected value in total Installs is 4182140 and the ideal ad rate/ price should be minimum at 0.0478224
Based on the distribution price’s graph and use ‘Herd Behavior’ as guideline, the developer should set price at 0.99