我們有13個變數,包含10841個樣本
App:Application name
Category:Category the app belongs to
Rating:Overall user rating of the app (as when scraped)
Reviews:Number of user reviews for the app (as when scraped)
Size:Size of the app (as when scraped)
Installs:Number of user downloads/installs for the app (as when scraped)
Type:Paid or Free
Price:Price of the app (as when scraped)
Content Rating:Age group the app is targeted at - Children / Mature 21+ / Adult
Genres:An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres.
Last UpdatedDate: when the app was last updated on Play Store (as when scraped)
Current VerCurrent: version of the app available on Play Store (as when scraped)
Android Ver:Min required Android version (as when scraped)
我們刪掉 App,Last UpdatedDate,Current VerCurrent,Android Ver這4個變數
最後剩下7729個樣本和10個變數
show the summary statistic in r
datause3 <- as.data.frame(datause2)
#stargazer(datause3,omit.summary.stat = c("p25", "p75"))
df <- datause3[,-c(1,5,7,8)]
correlation.matrix <- cor(df)
#stargazer(correlation.matrix, title = "Android APP的相關係數矩陣")
M<- cor(df)
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.6.3
## corrplot 0.84 loaded
corrplot(M, method="number")
qplot(Rating, data = datause3, geom = "density",
fill = Type, alpha = I(.5),
main="Distribution of App rating",
xlab="Rating",
ylab="Density")
這裡我們配適Tobit model中的corner soution 模型,以Price作為outcome,以Review,Rating,Installs,size_mb,small_app作為feature ,這裡使用Tobit model是因為Price有很多價格都等於0
note: 這裡我是參考別人配適婚外情的data,我不太確定left cersored在0是不是就等價 Tobit model中的corner soution
fm.tobit <- tobit(Price ~Reviews,
data = datause2)
m1 <- summary(fm.tobit)
fm.tobit <- tobit(Price ~Reviews+Rating,
data = datause2)
m2 <- summary(fm.tobit)
fm.tobit <- tobit(Price ~Reviews+Rating+Installs,
data = datause2)
m3 <- summary(fm.tobit)
fm.tobit <- tobit(Price ~Reviews+Rating+Installs+size_mb,
data = datause2)
m4 <- summary(fm.tobit)
fm.tobit <- tobit(Price ~Reviews+Rating+Installs+size_mb+small_app,
data = datause2)
m5 <- summary(fm.tobit)
m1
##
## Call:
## tobit(formula = Price ~ Reviews, data = datause2)
##
## Observations:
## Total Left-censored Uncensored Right-censored
## 7729 7150 579 0
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -101.9994121 3.8203521 -26.699 < 2e-16 ***
## Reviews -0.0004762 0.0000610 -7.806 5.91e-15 ***
## Log(scale) 4.3160207 0.0317690 135.856 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Scale: 74.89
##
## Gaussian distribution
## Number of Newton-Raphson Iterations: 10
## Log-likelihood: -4461 on 3 Df
## Wald-statistic: 60.93 on 1 Df, p-value: 5.9081e-15
m2
##
## Call:
## tobit(formula = Price ~ Reviews + Rating, data = datause2)
##
## Observations:
## Total Left-censored Uncensored Right-censored
## 7729 7150 579 0
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -153.96592018 13.96637658 -11.024 < 2e-16 ***
## Reviews -0.00050063 0.00006238 -8.025 1.01e-15 ***
## Rating 12.41740500 3.08860459 4.020 5.81e-05 ***
## Log(scale) 4.31924525 0.03182514 135.718 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Scale: 75.13
##
## Gaussian distribution
## Number of Newton-Raphson Iterations: 10
## Log-likelihood: -4452 on 4 Df
## Wald-statistic: 73.94 on 2 Df, p-value: < 2.22e-16
m3
##
## Call:
## tobit(formula = Price ~ Reviews + Rating + Installs, data = datause2)
##
## Observations:
## Total Left-censored Uncensored Right-censored
## 7729 7150 579 0
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -139.367794518 13.415244106 -10.389 < 2e-16 ***
## Reviews 0.000122725 0.000012106 10.138 < 2e-16 ***
## Rating 11.381540679 2.981341125 3.818 0.000135 ***
## Installs -0.000055951 0.000005205 -10.749 < 2e-16 ***
## Log(scale) 4.305359970 0.031654281 136.012 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Scale: 74.1
##
## Gaussian distribution
## Number of Newton-Raphson Iterations: 11
## Log-likelihood: -4346 on 5 Df
## Wald-statistic: 124.7 on 3 Df, p-value: < 2.22e-16
m4
##
## Call:
## tobit(formula = Price ~ Reviews + Rating + Installs + size_mb,
## data = datause2)
##
## Observations:
## Total Left-censored Uncensored Right-censored
## 7729 7150 579 0
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -142.022112660 13.523809728 -10.502 < 2e-16 ***
## Reviews 0.000126464 0.000012288 10.291 < 2e-16 ***
## Rating 11.160540722 2.988883832 3.734 0.000188 ***
## Installs -0.000057796 0.000005296 -10.914 < 2e-16 ***
## size_mb 0.194043497 0.081303242 2.387 0.017002 *
## Log(scale) 4.306681251 0.031677698 135.953 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Scale: 74.19
##
## Gaussian distribution
## Number of Newton-Raphson Iterations: 11
## Log-likelihood: -4343 on 6 Df
## Wald-statistic: 128.3 on 4 Df, p-value: < 2.22e-16
m5
##
## Call:
## tobit(formula = Price ~ Reviews + Rating + Installs + size_mb +
## small_app, data = datause2)
##
## Observations:
## Total Left-censored Uncensored Right-censored
## 7729 7150 579 0
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -148.91289953 13.84169619 -10.758 < 2e-16 ***
## Reviews 0.00012537 0.00001232 10.176 < 2e-16 ***
## Rating 11.95001211 3.03144860 3.942 0.00008080 ***
## Installs -0.00005735 0.00000531 -10.801 < 2e-16 ***
## size_mb 0.27169760 0.08312430 3.269 0.00108 **
## small_app 32.62328616 7.25582737 4.496 0.00000692 ***
## Log(scale) 4.30739311 0.03168942 135.925 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Scale: 74.25
##
## Gaussian distribution
## Number of Newton-Raphson Iterations: 11
## Log-likelihood: -4333 on 7 Df
## Wald-statistic: 142.4 on 5 Df, p-value: < 2.22e-16
low(1-10000)
medium(10000-1000000)
high(10^6 up)
#ifelse(datause2$Installs<10000 ,"low" ,"medium" )
a<- datause2$Installs
ordinal<- cut(a, c(-Inf,10000,1000000,Inf), c("low","medium","high"),right=T)
datause2$ordinal <- ordinal
df <- datause2[,-c( 1,4,8 )]
#a[which(1000000<=a),1] <- "high"
#a[which(a<10000),1]<-"low"
#a[which(10000<=a&a<1000000),1] <- "medium"
m <- polr(ordinal~ Price+Rating+size_mb+small_app, data = df, Hess=TRUE)
summary(m)
## Call:
## polr(formula = ordinal ~ Price + Rating + size_mb + small_app,
## data = df, Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## Price -0.06542 0.012292 -5.322
## Rating 0.36387 0.041137 8.845
## size_mb 0.02600 0.000997 26.084
## small_app -0.58709 0.129758 -4.524
##
## Intercepts:
## Value Std. Error t value
## low|medium 1.3699 0.1726 7.9375
## medium|high 3.4158 0.1763 19.3716
##
## Residual Deviance: 15498.17
## AIC: 15510.17
(ctable <- coef(summary(m)))
## Value Std. Error t value
## Price -0.06542291 0.0122922065 -5.322308
## Rating 0.36386696 0.0411366445 8.845324
## size_mb 0.02600456 0.0009969589 26.083883
## small_app -0.58708625 0.1297579377 -4.524473
## low|medium 1.36991449 0.1725880531 7.937482
## medium|high 3.41579831 0.1763302744 19.371593
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
(ctable <- cbind(ctable, "p value" = p))
## Value Std. Error t value p value
## Price -0.06542291 0.0122922065 -5.322308 1.024589e-07
## Rating 0.36386696 0.0411366445 8.845324 9.126290e-19
## size_mb 0.02600456 0.0009969589 26.083883 5.555030e-150
## small_app -0.58708625 0.1297579377 -4.524473 6.054626e-06
## low|medium 1.36991449 0.1725880531 7.937482 2.063277e-15
## medium|high 3.41579831 0.1763302744 19.371593 1.340454e-83
ctable
## Value Std. Error t value p value
## Price -0.06542291 0.0122922065 -5.322308 1.024589e-07
## Rating 0.36386696 0.0411366445 8.845324 9.126290e-19
## size_mb 0.02600456 0.0009969589 26.083883 5.555030e-150
## small_app -0.58708625 0.1297579377 -4.524473 6.054626e-06
## low|medium 1.36991449 0.1725880531 7.937482 2.063277e-15
## medium|high 3.41579831 0.1763302744 19.371593 1.340454e-83
m <- polr(ordinal~ Price, data = df, Hess=TRUE)
(ctable <- coef(summary(m)))
## Value Std. Error t value
## Price -0.05944354 0.01192488 -4.984833
## low|medium -0.64332011 0.02433648 -26.434391
## medium|high 1.21310195 0.02733553 44.378205
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
(ctable <- cbind(ctable, "p value" = p))
## Value Std. Error t value p value
## Price -0.05944354 0.01192488 -4.984833 6.201541e-07
## low|medium -0.64332011 0.02433648 -26.434391 5.516235e-154
## medium|high 1.21310195 0.02733553 44.378205 0.000000e+00
ctable
## Value Std. Error t value p value
## Price -0.05944354 0.01192488 -4.984833 6.201541e-07
## low|medium -0.64332011 0.02433648 -26.434391 5.516235e-154
## medium|high 1.21310195 0.02733553 44.378205 0.000000e+00
summary(m)
## Call:
## polr(formula = ordinal ~ Price, data = df, Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## Price -0.05944 0.01192 -4.985
##
## Intercepts:
## Value Std. Error t value
## low|medium -0.6433 0.0243 -26.4344
## medium|high 1.2131 0.0273 44.3782
##
## Residual Deviance: 16440.95
## AIC: 16446.95
m <- polr(ordinal~ Price+Rating, data = df, Hess=TRUE)
(ctable <- coef(summary(m)))
## Value Std. Error t value
## Price -0.0645582 0.01222475 -5.280941
## Rating 0.4425909 0.04043985 10.944424
## low|medium 1.1880031 0.16888575 7.034360
## medium|high 3.0656636 0.17181312 17.843012
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
(ctable <- cbind(ctable, "p value" = p))
## Value Std. Error t value p value
## Price -0.0645582 0.01222475 -5.280941 1.285220e-07
## Rating 0.4425909 0.04043985 10.944424 7.066529e-28
## low|medium 1.1880031 0.16888575 7.034360 2.001784e-12
## medium|high 3.0656636 0.17181312 17.843012 3.275527e-71
ctable
## Value Std. Error t value p value
## Price -0.0645582 0.01222475 -5.280941 1.285220e-07
## Rating 0.4425909 0.04043985 10.944424 7.066529e-28
## low|medium 1.1880031 0.16888575 7.034360 2.001784e-12
## medium|high 3.0656636 0.17181312 17.843012 3.275527e-71
summary(m)
## Call:
## polr(formula = ordinal ~ Price + Rating, data = df, Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## Price -0.06456 0.01222 -5.281
## Rating 0.44259 0.04044 10.944
##
## Intercepts:
## Value Std. Error t value
## low|medium 1.1880 0.1689 7.0344
## medium|high 3.0657 0.1718 17.8430
##
## Residual Deviance: 16318.54
## AIC: 16326.54
m <- polr(ordinal~ Price+Rating+small_app, data = df, Hess=TRUE)
summary(m)
## Call:
## polr(formula = ordinal ~ Price + Rating + small_app, data = df,
## Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## Price -0.06126 0.01204 -5.090
## Rating 0.42539 0.04053 10.496
## small_app -1.13928 0.12764 -8.926
##
## Intercepts:
## Value Std. Error t value
## low|medium 1.0757 0.1695 6.3472
## medium|high 2.9672 0.1723 17.2189
##
## Residual Deviance: 16232.30
## AIC: 16242.30
(ctable <- coef(summary(m)))
## Value Std. Error t value
## Price -0.06126175 0.01203559 -5.090050
## Rating 0.42538543 0.04052946 10.495709
## small_app -1.13928306 0.12763955 -8.925784
## low|medium 1.07570076 0.16947749 6.347160
## medium|high 2.96719417 0.17232198 17.218895
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
(ctable <- cbind(ctable, "p value" = p))
## Value Std. Error t value p value
## Price -0.06126175 0.01203559 -5.090050 3.579683e-07
## Rating 0.42538543 0.04052946 10.495709 9.039647e-26
## small_app -1.13928306 0.12763955 -8.925784 4.425521e-19
## low|medium 1.07570076 0.16947749 6.347160 2.193258e-10
## medium|high 2.96719417 0.17232198 17.218895 1.916104e-66
ctable
## Value Std. Error t value p value
## Price -0.06126175 0.01203559 -5.090050 3.579683e-07
## Rating 0.42538543 0.04052946 10.495709 9.039647e-26
## small_app -1.13928306 0.12763955 -8.925784 4.425521e-19
## low|medium 1.07570076 0.16947749 6.347160 2.193258e-10
## medium|high 2.96719417 0.17232198 17.218895 1.916104e-66
m <- polr(ordinal~ Price+Rating+small_app+size_mb, data = df, Hess=TRUE)
summary(m)
## Call:
## polr(formula = ordinal ~ Price + Rating + small_app + size_mb,
## data = df, Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## Price -0.06543 0.012288 -5.325
## Rating 0.36376 0.041181 8.833
## small_app -0.58784 0.129778 -4.530
## size_mb 0.02601 0.000997 26.085
##
## Intercepts:
## Value Std. Error t value
## low|medium 1.3695 0.1727 7.9301
## medium|high 3.4153 0.1765 19.3529
##
## Residual Deviance: 15498.17
## AIC: 15510.17
(ctable <- coef(summary(m)))
## Value Std. Error t value
## Price -0.06542820 0.0122879694 -5.324574
## Rating 0.36375788 0.0411810589 8.833136
## small_app -0.58784489 0.1297776958 -4.529630
## size_mb 0.02600585 0.0009969815 26.084583
## low|medium 1.36945439 0.1726906498 7.930102
## medium|high 3.41530378 0.1764750775 19.352896
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
(ctable <- cbind(ctable, "p value" = p))
## Value Std. Error t value p value
## Price -0.06542820 0.0122879694 -5.324574 1.011901e-07
## Rating 0.36375788 0.0411810589 8.833136 1.017814e-18
## small_app -0.58784489 0.1297776958 -4.529630 5.908718e-06
## size_mb 0.02600585 0.0009969815 26.084583 5.454428e-150
## low|medium 1.36945439 0.1726906498 7.930102 2.189668e-15
## medium|high 3.41530378 0.1764750775 19.352896 1.927048e-83
ctable
## Value Std. Error t value p value
## Price -0.06542820 0.0122879694 -5.324574 1.011901e-07
## Rating 0.36375788 0.0411810589 8.833136 1.017814e-18
## small_app -0.58784489 0.1297776958 -4.529630 5.908718e-06
## size_mb 0.02600585 0.0009969815 26.084583 5.454428e-150
## low|medium 1.36945439 0.1726906498 7.930102 2.189668e-15
## medium|high 3.41530378 0.1764750775 19.352896 1.927048e-83