1 1.變數介紹:

我們有13個變數，包含10841個樣本

App:Application name
Category:Category the app belongs to
Rating:Overall user rating of the app (as when scraped)
Reviews:Number of user reviews for the app (as when scraped)
Size:Size of the app (as when scraped)
Installs:Number of user downloads/installs for the app (as when scraped)
Type:Paid or Free
Price:Price of the app (as when scraped)
Content Rating:Age group the app is targeted at - Children / Mature 21+ / Adult
Genres:An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres.
Last UpdatedDate: when the app was last updated on Play Store (as when scraped)
Current VerCurrent: version of the app available on Play Store (as when scraped)
Android Ver:Min required Android version (as when scraped)

2 2.資料處理方式

2.1 2.1刪掉的變數

我們刪掉 App,Last UpdatedDate,Current VerCurrent,Android Ver這4個變數

2.2 2.2變數處理方式

將Size變數全部轉成MB大小，並且刪除Size變數中大小隨裝置改變的APP樣本去除
將Installs變數的“+”號去除
將Rating變數NA的樣本去除
新增一個變數叫做small app ，容量小於1MB的叫做small app

最後剩下7729個樣本和10個變數

show the summary statistic in r

datause3 <- as.data.frame(datause2)
#stargazer(datause3,omit.summary.stat = c("p25", "p75"))

df <- datause3[,-c(1,5,7,8)]
correlation.matrix <- cor(df) 

#stargazer(correlation.matrix, title = "Android APP的相關係數矩陣")

M<- cor(df)
library(corrplot)

## Warning: package 'corrplot' was built under R version 3.6.3

## corrplot 0.84 loaded

corrplot(M, method="number")

3 3.感興趣的問題

qplot(Rating, data = datause3, geom = "density",
  fill = Type, alpha = I(.5),
  main="Distribution of App rating",
  xlab="Rating",
  ylab="Density")

3.1 3.1哪些因素會影響APP的訂價

這裡我們配適Tobit model中的corner soution 模型，以Price作為outcome，以Review,Rating,Installs,size_mb,small_app作為feature ，這裡使用Tobit model是因為Price有很多價格都等於0

note: 這裡我是參考別人配適婚外情的data，我不太確定left cersored在0是不是就等價 Tobit model中的corner soution

fm.tobit <- tobit(Price ~Reviews,
data = datause2)
m1 <- summary(fm.tobit)

fm.tobit <- tobit(Price ~Reviews+Rating,
data = datause2)

m2 <- summary(fm.tobit)
fm.tobit <- tobit(Price ~Reviews+Rating+Installs,
data = datause2)
m3 <- summary(fm.tobit)
fm.tobit <- tobit(Price ~Reviews+Rating+Installs+size_mb,
data = datause2)
m4 <- summary(fm.tobit)

fm.tobit <- tobit(Price ~Reviews+Rating+Installs+size_mb+small_app,
data = datause2)
m5 <- summary(fm.tobit)

m1

## 
## Call:
## tobit(formula = Price ~ Reviews, data = datause2)
## 
## Observations:
##          Total  Left-censored     Uncensored Right-censored 
##           7729           7150            579              0 
## 
## Coefficients:
##                 Estimate   Std. Error z value Pr(>|z|)    
## (Intercept) -101.9994121    3.8203521 -26.699  < 2e-16 ***
## Reviews       -0.0004762    0.0000610  -7.806 5.91e-15 ***
## Log(scale)     4.3160207    0.0317690 135.856  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Scale: 74.89 
## 
## Gaussian distribution
## Number of Newton-Raphson Iterations: 10 
## Log-likelihood: -4461 on 3 Df
## Wald-statistic: 60.93 on 1 Df, p-value: 5.9081e-15

m2

## 
## Call:
## tobit(formula = Price ~ Reviews + Rating, data = datause2)
## 
## Observations:
##          Total  Left-censored     Uncensored Right-censored 
##           7729           7150            579              0 
## 
## Coefficients:
##                  Estimate    Std. Error z value Pr(>|z|)    
## (Intercept) -153.96592018   13.96637658 -11.024  < 2e-16 ***
## Reviews       -0.00050063    0.00006238  -8.025 1.01e-15 ***
## Rating        12.41740500    3.08860459   4.020 5.81e-05 ***
## Log(scale)     4.31924525    0.03182514 135.718  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Scale: 75.13 
## 
## Gaussian distribution
## Number of Newton-Raphson Iterations: 10 
## Log-likelihood: -4452 on 4 Df
## Wald-statistic: 73.94 on 2 Df, p-value: < 2.22e-16

m3

## 
## Call:
## tobit(formula = Price ~ Reviews + Rating + Installs, data = datause2)
## 
## Observations:
##          Total  Left-censored     Uncensored Right-censored 
##           7729           7150            579              0 
## 
## Coefficients:
##                   Estimate     Std. Error z value Pr(>|z|)    
## (Intercept) -139.367794518   13.415244106 -10.389  < 2e-16 ***
## Reviews        0.000122725    0.000012106  10.138  < 2e-16 ***
## Rating        11.381540679    2.981341125   3.818 0.000135 ***
## Installs      -0.000055951    0.000005205 -10.749  < 2e-16 ***
## Log(scale)     4.305359970    0.031654281 136.012  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Scale: 74.1 
## 
## Gaussian distribution
## Number of Newton-Raphson Iterations: 11 
## Log-likelihood: -4346 on 5 Df
## Wald-statistic: 124.7 on 3 Df, p-value: < 2.22e-16

m4

## 
## Call:
## tobit(formula = Price ~ Reviews + Rating + Installs + size_mb, 
##     data = datause2)
## 
## Observations:
##          Total  Left-censored     Uncensored Right-censored 
##           7729           7150            579              0 
## 
## Coefficients:
##                   Estimate     Std. Error z value Pr(>|z|)    
## (Intercept) -142.022112660   13.523809728 -10.502  < 2e-16 ***
## Reviews        0.000126464    0.000012288  10.291  < 2e-16 ***
## Rating        11.160540722    2.988883832   3.734 0.000188 ***
## Installs      -0.000057796    0.000005296 -10.914  < 2e-16 ***
## size_mb        0.194043497    0.081303242   2.387 0.017002 *  
## Log(scale)     4.306681251    0.031677698 135.953  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Scale: 74.19 
## 
## Gaussian distribution
## Number of Newton-Raphson Iterations: 11 
## Log-likelihood: -4343 on 6 Df
## Wald-statistic: 128.3 on 4 Df, p-value: < 2.22e-16

m5

## 
## Call:
## tobit(formula = Price ~ Reviews + Rating + Installs + size_mb + 
##     small_app, data = datause2)
## 
## Observations:
##          Total  Left-censored     Uncensored Right-censored 
##           7729           7150            579              0 
## 
## Coefficients:
##                  Estimate    Std. Error z value   Pr(>|z|)    
## (Intercept) -148.91289953   13.84169619 -10.758    < 2e-16 ***
## Reviews        0.00012537    0.00001232  10.176    < 2e-16 ***
## Rating        11.95001211    3.03144860   3.942 0.00008080 ***
## Installs      -0.00005735    0.00000531 -10.801    < 2e-16 ***
## size_mb        0.27169760    0.08312430   3.269    0.00108 ** 
## small_app     32.62328616    7.25582737   4.496 0.00000692 ***
## Log(scale)     4.30739311    0.03168942 135.925    < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Scale: 74.25 
## 
## Gaussian distribution
## Number of Newton-Raphson Iterations: 11 
## Log-likelihood: -4333 on 7 Df
## Wald-statistic: 142.4 on 5 Df, p-value: < 2.22e-16

low(1-10000)

medium(10000-1000000)

high(10^6 up)

#ifelse(datause2$Installs<10000 ,"low" ,"medium"     ) 
a<- datause2$Installs

ordinal<- cut(a, c(-Inf,10000,1000000,Inf), c("low","medium","high"),right=T)

datause2$ordinal <- ordinal

df <- datause2[,-c( 1,4,8  )]
#a[which(1000000<=a),1] <- "high"
#a[which(a<10000),1]<-"low"
#a[which(10000<=a&a<1000000),1] <- "medium"

m <- polr(ordinal~ Price+Rating+size_mb+small_app, data = df, Hess=TRUE)
summary(m)

## Call:
## polr(formula = ordinal ~ Price + Rating + size_mb + small_app, 
##     data = df, Hess = TRUE)
## 
## Coefficients:
##              Value Std. Error t value
## Price     -0.06542   0.012292  -5.322
## Rating     0.36387   0.041137   8.845
## size_mb    0.02600   0.000997  26.084
## small_app -0.58709   0.129758  -4.524
## 
## Intercepts:
##             Value   Std. Error t value
## low|medium   1.3699  0.1726     7.9375
## medium|high  3.4158  0.1763    19.3716
## 
## Residual Deviance: 15498.17 
## AIC: 15510.17

(ctable <- coef(summary(m)))

##                   Value   Std. Error   t value
## Price       -0.06542291 0.0122922065 -5.322308
## Rating       0.36386696 0.0411366445  8.845324
## size_mb      0.02600456 0.0009969589 26.083883
## small_app   -0.58708625 0.1297579377 -4.524473
## low|medium   1.36991449 0.1725880531  7.937482
## medium|high  3.41579831 0.1763302744 19.371593

p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
(ctable <- cbind(ctable, "p value" = p))

##                   Value   Std. Error   t value       p value
## Price       -0.06542291 0.0122922065 -5.322308  1.024589e-07
## Rating       0.36386696 0.0411366445  8.845324  9.126290e-19
## size_mb      0.02600456 0.0009969589 26.083883 5.555030e-150
## small_app   -0.58708625 0.1297579377 -4.524473  6.054626e-06
## low|medium   1.36991449 0.1725880531  7.937482  2.063277e-15
## medium|high  3.41579831 0.1763302744 19.371593  1.340454e-83

ctable

##                   Value   Std. Error   t value       p value
## Price       -0.06542291 0.0122922065 -5.322308  1.024589e-07
## Rating       0.36386696 0.0411366445  8.845324  9.126290e-19
## size_mb      0.02600456 0.0009969589 26.083883 5.555030e-150
## small_app   -0.58708625 0.1297579377 -4.524473  6.054626e-06
## low|medium   1.36991449 0.1725880531  7.937482  2.063277e-15
## medium|high  3.41579831 0.1763302744 19.371593  1.340454e-83

m <- polr(ordinal~ Price, data = df, Hess=TRUE)

(ctable <- coef(summary(m)))

##                   Value Std. Error    t value
## Price       -0.05944354 0.01192488  -4.984833
## low|medium  -0.64332011 0.02433648 -26.434391
## medium|high  1.21310195 0.02733553  44.378205

p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
(ctable <- cbind(ctable, "p value" = p))

##                   Value Std. Error    t value       p value
## Price       -0.05944354 0.01192488  -4.984833  6.201541e-07
## low|medium  -0.64332011 0.02433648 -26.434391 5.516235e-154
## medium|high  1.21310195 0.02733553  44.378205  0.000000e+00

ctable

##                   Value Std. Error    t value       p value
## Price       -0.05944354 0.01192488  -4.984833  6.201541e-07
## low|medium  -0.64332011 0.02433648 -26.434391 5.516235e-154
## medium|high  1.21310195 0.02733553  44.378205  0.000000e+00

summary(m)

## Call:
## polr(formula = ordinal ~ Price, data = df, Hess = TRUE)
## 
## Coefficients:
##          Value Std. Error t value
## Price -0.05944    0.01192  -4.985
## 
## Intercepts:
##             Value    Std. Error t value 
## low|medium   -0.6433   0.0243   -26.4344
## medium|high   1.2131   0.0273    44.3782
## 
## Residual Deviance: 16440.95 
## AIC: 16446.95

m <- polr(ordinal~ Price+Rating, data = df, Hess=TRUE)

(ctable <- coef(summary(m)))

##                  Value Std. Error   t value
## Price       -0.0645582 0.01222475 -5.280941
## Rating       0.4425909 0.04043985 10.944424
## low|medium   1.1880031 0.16888575  7.034360
## medium|high  3.0656636 0.17181312 17.843012

p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
(ctable <- cbind(ctable, "p value" = p))

##                  Value Std. Error   t value      p value
## Price       -0.0645582 0.01222475 -5.280941 1.285220e-07
## Rating       0.4425909 0.04043985 10.944424 7.066529e-28
## low|medium   1.1880031 0.16888575  7.034360 2.001784e-12
## medium|high  3.0656636 0.17181312 17.843012 3.275527e-71

ctable

##                  Value Std. Error   t value      p value
## Price       -0.0645582 0.01222475 -5.280941 1.285220e-07
## Rating       0.4425909 0.04043985 10.944424 7.066529e-28
## low|medium   1.1880031 0.16888575  7.034360 2.001784e-12
## medium|high  3.0656636 0.17181312 17.843012 3.275527e-71

summary(m)

## Call:
## polr(formula = ordinal ~ Price + Rating, data = df, Hess = TRUE)
## 
## Coefficients:
##           Value Std. Error t value
## Price  -0.06456    0.01222  -5.281
## Rating  0.44259    0.04044  10.944
## 
## Intercepts:
##             Value   Std. Error t value
## low|medium   1.1880  0.1689     7.0344
## medium|high  3.0657  0.1718    17.8430
## 
## Residual Deviance: 16318.54 
## AIC: 16326.54

m <- polr(ordinal~ Price+Rating+small_app, data = df, Hess=TRUE)
summary(m)

## Call:
## polr(formula = ordinal ~ Price + Rating + small_app, data = df, 
##     Hess = TRUE)
## 
## Coefficients:
##              Value Std. Error t value
## Price     -0.06126    0.01204  -5.090
## Rating     0.42539    0.04053  10.496
## small_app -1.13928    0.12764  -8.926
## 
## Intercepts:
##             Value   Std. Error t value
## low|medium   1.0757  0.1695     6.3472
## medium|high  2.9672  0.1723    17.2189
## 
## Residual Deviance: 16232.30 
## AIC: 16242.30

(ctable <- coef(summary(m)))

##                   Value Std. Error   t value
## Price       -0.06126175 0.01203559 -5.090050
## Rating       0.42538543 0.04052946 10.495709
## small_app   -1.13928306 0.12763955 -8.925784
## low|medium   1.07570076 0.16947749  6.347160
## medium|high  2.96719417 0.17232198 17.218895

p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
(ctable <- cbind(ctable, "p value" = p))

##                   Value Std. Error   t value      p value
## Price       -0.06126175 0.01203559 -5.090050 3.579683e-07
## Rating       0.42538543 0.04052946 10.495709 9.039647e-26
## small_app   -1.13928306 0.12763955 -8.925784 4.425521e-19
## low|medium   1.07570076 0.16947749  6.347160 2.193258e-10
## medium|high  2.96719417 0.17232198 17.218895 1.916104e-66

ctable

##                   Value Std. Error   t value      p value
## Price       -0.06126175 0.01203559 -5.090050 3.579683e-07
## Rating       0.42538543 0.04052946 10.495709 9.039647e-26
## small_app   -1.13928306 0.12763955 -8.925784 4.425521e-19
## low|medium   1.07570076 0.16947749  6.347160 2.193258e-10
## medium|high  2.96719417 0.17232198 17.218895 1.916104e-66

m <- polr(ordinal~ Price+Rating+small_app+size_mb, data = df, Hess=TRUE)
summary(m)

## Call:
## polr(formula = ordinal ~ Price + Rating + small_app + size_mb, 
##     data = df, Hess = TRUE)
## 
## Coefficients:
##              Value Std. Error t value
## Price     -0.06543   0.012288  -5.325
## Rating     0.36376   0.041181   8.833
## small_app -0.58784   0.129778  -4.530
## size_mb    0.02601   0.000997  26.085
## 
## Intercepts:
##             Value   Std. Error t value
## low|medium   1.3695  0.1727     7.9301
## medium|high  3.4153  0.1765    19.3529
## 
## Residual Deviance: 15498.17 
## AIC: 15510.17

(ctable <- coef(summary(m)))

##                   Value   Std. Error   t value
## Price       -0.06542820 0.0122879694 -5.324574
## Rating       0.36375788 0.0411810589  8.833136
## small_app   -0.58784489 0.1297776958 -4.529630
## size_mb      0.02600585 0.0009969815 26.084583
## low|medium   1.36945439 0.1726906498  7.930102
## medium|high  3.41530378 0.1764750775 19.352896

p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
(ctable <- cbind(ctable, "p value" = p))

##                   Value   Std. Error   t value       p value
## Price       -0.06542820 0.0122879694 -5.324574  1.011901e-07
## Rating       0.36375788 0.0411810589  8.833136  1.017814e-18
## small_app   -0.58784489 0.1297776958 -4.529630  5.908718e-06
## size_mb      0.02600585 0.0009969815 26.084583 5.454428e-150
## low|medium   1.36945439 0.1726906498  7.930102  2.189668e-15
## medium|high  3.41530378 0.1764750775 19.352896  1.927048e-83

ctable

##                   Value   Std. Error   t value       p value
## Price       -0.06542820 0.0122879694 -5.324574  1.011901e-07
## Rating       0.36375788 0.0411810589  8.833136  1.017814e-18
## small_app   -0.58784489 0.1297776958 -4.529630  5.908718e-06
## size_mb      0.02600585 0.0009969815 26.084583 5.454428e-150
## low|medium   1.36945439 0.1726906498  7.930102  2.189668e-15
## medium|high  3.41530378 0.1764750775 19.352896  1.927048e-83

googleplaydataset

Chen Ning Kuan