我們有16個變數,包含
“id” : App ID
“track_name”: App Name
“size_bytes”: Size (in Bytes)
“currency”: Currency Type
“price”: Price amount
“ratingcounttot”: User Rating counts (for all version)
“ratingcountver”: User Rating counts (for current version)
“user_rating” : Average User Rating value (for all version)
“userratingver”: Average User Rating value (for current version)
“ver” : Latest version code
“cont_rating”: Content Rating 適合哪個年齡層使用 有4個level
分別是 4+,9+,12+,17+
“prime_genre”: Primary Genre
“sup_devices.num”: Number of supporting devices
“ipadSc_urls.num”: Number of screenshots showed for display “可以視為功能的展現”
“lang.num”: Number of supported languages
“vpp_lic”: Vpp Device Based Licensing Enabled
備註: Apple批量購買計劃(VPP)是一項服務,允許已註冊Apple VPP的組織批量購買iOS應用,但不能以折扣價購買。主要應該是用於企業的大量購買,此變數為二元變數
其中id ,app name 都沒包含有用的資訊,而所有app都使用美金計價,ver(版本)部分由於各家版本號過於凌亂所以也予以刪除。
是故我只留下12個變數,其中又只有prime_genre,vpp_lic,cont_rating和是屬於類別型變數,其他變數都是連續型
然後新增一個虛擬變數為付費與否
另外由於bytes並非常用的單位,是故把它轉換成MB
非常可惜的是並沒有公布APP的下載量
charge<- as.factor(ifelse( ios$price>0,"paid","free" ))
ios[,13] <- charge
ios<- ios %>% rename(charge=V13)
ios$cont_rating <- as.factor(ios$cont_rating )
ios$prime_genre <- as.factor(ios$prime_genre )
ios$size_MB <- ios$size_bytes/1000000
ios <- ios[,-1]
這裡可以看出所有的資料,至此我們總共有13個變數外加7197個觀察值,我們的資料沒有任何遺漏值
秀出我們有的所有資料
DT::datatable(ios, options = list(
pageLength=50, scrollX='400px'), filter = 'top')
## This version of Shiny is designed to work with 'htmlwidgets' >= 1.5.
## Please upgrade via install.packages('htmlwidgets').
#any NA in data.frame
sum(is.na.data.frame(ios))
## [1] 0
秀出我們的變數的敘述統計
m<- summary(ios)
knitr::kable(m)
| price | rating_count_tot | rating_count_ver | user_rating | user_rating_ver | cont_rating | prime_genre | sup_devices.num | ipadSc_urls.num | lang.num | vpp_lic | charge | size_MB | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 0.000 | Min. : 0 | Min. : 0.0 | Min. :0.000 | Min. :0.000 | 12+:1155 | Games :3862 | Min. : 9.00 | Min. :0.000 | Min. : 0.000 | Min. :0.0000 | free:4056 | Min. : 0.59 | |
| 1st Qu.: 0.000 | 1st Qu.: 28 | 1st Qu.: 1.0 | 1st Qu.:3.500 | 1st Qu.:2.500 | 17+: 622 | Entertainment : 535 | 1st Qu.:37.00 | 1st Qu.:3.000 | 1st Qu.: 1.000 | 1st Qu.:1.0000 | paid:3141 | 1st Qu.: 46.92 | |
| Median : 0.000 | Median : 300 | Median : 23.0 | Median :4.000 | Median :4.000 | 4+ :4433 | Education : 453 | Median :37.00 | Median :5.000 | Median : 1.000 | Median :1.0000 | NA | Median : 97.15 | |
| Mean : 1.726 | Mean : 12893 | Mean : 460.4 | Mean :3.527 | Mean :3.254 | 9+ : 987 | Photo & Video : 349 | Mean :37.36 | Mean :3.707 | Mean : 5.435 | Mean :0.9931 | NA | Mean : 199.13 | |
| 3rd Qu.: 1.990 | 3rd Qu.: 2793 | 3rd Qu.: 140.0 | 3rd Qu.:4.500 | 3rd Qu.:4.500 | NA | Utilities : 248 | 3rd Qu.:38.00 | 3rd Qu.:5.000 | 3rd Qu.: 8.000 | 3rd Qu.:1.0000 | NA | 3rd Qu.: 181.93 | |
| Max. :299.990 | Max. :2974676 | Max. :177050.0 | Max. :5.000 | Max. :5.000 | NA | Health & Fitness: 180 | Max. :47.00 | Max. :5.000 | Max. :75.000 | Max. :1.0000 | NA | Max. :4025.97 | |
| NA | NA | NA | NA | NA | NA | (Other) :1570 | NA | NA | NA | NA | NA | NA |
library(corrplot)
## corrplot 0.84 loaded
df<- ios[,-c(6,7,12)]
df <- as.matrix(df)
M<- cor(df)
corrplot(M, method="circle")
corrplot(M, method="number")
可以看出變數之間除了目前的user rating和所有版本的user rating外,其餘變數之間並無線性關係
m1<- table(ios$cont_rating)
table(ios$cont_rating)
##
## 12+ 17+ 4+ 9+
## 1155 622 4433 987
ios$cont_rating <- factor(ios$cont_rating,levels = c("4+", "9+", "12+", "17+"))
m1<- table(ios$cont_rating)
barplot(m1)
knitr::kable(m1)
| Var1 | Freq |
|---|---|
| 4+ | 4433 |
| 9+ | 987 |
| 12+ | 1155 |
| 17+ | 622 |
可以看出大部分的APP都是設計出來給4歲以上使用即可
1.哪些變數會影響APP的評分?
2.付費軟體的評分有比較好嗎?
3.大部分的APP的定價趨勢為何?
m1 <- lm(user_rating ~. ,ios)
m2<- summary(m1)
data.frame(summary(m1)$coef[summary(m1)$coef[,4] <= .05, 4])
## summary.m1..coef.summary.m1..coef...4.....0.05..4.
## (Intercept) 0.000885712962934446244
## price 0.017149274322532933462
## user_rating_ver 0.000000000000000000000
## cont_rating12+ 0.030125854628497372723
## prime_genreBusiness 0.008332204043162476370
## prime_genreEducation 0.013074676199248123562
## prime_genreEntertainment 0.008985110250370754656
## prime_genreFinance 0.008459562324483195853
## prime_genreFood & Drink 0.000205487435224537907
## prime_genreGames 0.011396798805536760210
## prime_genreHealth & Fitness 0.000255245793004355188
## prime_genreLifestyle 0.008410693519466458032
## prime_genreMusic 0.006334849962495827866
## prime_genreNews 0.001242261154232396098
## prime_genrePhoto & Video 0.000063520404373961997
## prime_genreProductivity 0.000544045272050046173
## prime_genreShopping 0.000000000149900693008
## prime_genreSocial Networking 0.007537393754674120809
## prime_genreSports 0.046849439600335426870
## prime_genreTravel 0.000762921294891544624
## prime_genreUtilities 0.003950537294257992597
## prime_genreWeather 0.008810318304743505746
## sup_devices.num 0.042884258425487840893
## ipadSc_urls.num 0.000000000000001649702
## lang.num 0.001047715801772405907
## vpp_lic 0.000409720710488918757
## chargepaid 0.031491392856383318422
set.seed(1)
rf<- ranger(user_rating~. ,ios, quantreg = TRUE,importance='impurity')
rf$variable.importance %>%
as.matrix() %>%
as.data.frame() %>%
add_rownames() %>%
`colnames<-`(c("varname","imp")) %>%
arrange(desc(imp)) %>%
top_n(25,wt = imp) %>%
ggplot(mapping = aes(x = reorder(varname, imp), y = imp)) +
geom_col() +
coord_flip() +
ggtitle(label = "Top 12 important variables") +
theme(
axis.title = element_blank()
)
從隨機森林的結果可以發現影響使用者目前rating的前三名變數分別是
qplot(user_rating, data = ios, geom = "density",
fill = charge, alpha = I(.5),
main="Distribution of App rating",
xlab="Rating",
ylab="Density")
mean(ios$user_rating)
## [1] 3.526956
mean(ios$user_rating[which(ios$V13=="paid" )])
## [1] NaN
mean(ios$user_rating[which(ios$V13=="free" )])
## [1] NaN
所有APP的平均評分為3.526956,付費APP的評分為3.720949,免費APP為3.376726
# Compute the analysis of variance
res.aov <- aov(user_rating ~ charge, data = ios)
# Summary of the analysis
summary(res.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## charge 1 210 209.75 92.18 <0.0000000000000002 ***
## Residuals 7195 16371 2.28
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
one_way anova table也告訴我們,如果評分代表著APP的品質的話,那麼付費APP確實在統計上品質顯著大於免費APP
sum(is.na(ios$price))
## [1] 0
#there is no NA in price
#we draw the ecdf of this data
plot(ecdf(ios$price ))
object<- table(ios$price )
barplot(log(object))
#plot(sort(unique(applestore$price)) ,log(object) )
#log(table(applestore$price ))
#qplot(price,data=applestore,geom="histogram" )
#qplot(price,data=applestore,geom="histogram",log = "y")
#plot(applestore$price, log="y", type='histogram')
APP的訂價顯然是免費居多,而且訂價有指數分布的趨勢存在
fm.tobit <- tobit(price~.-cont_rating-charge-prime_genre,
data = ios)
summary(fm.tobit)
##
## Call:
## tobit(formula = price ~ . - cont_rating - charge - prime_genre,
## data = ios)
##
## Observations:
## Total Left-censored Uncensored Right-censored
## 7197 4056 3141 0
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.048136359 2.049136950 0.512 0.60900
## rating_count_tot -0.000050599 0.000005149 -9.828 < 0.0000000000000002 ***
## rating_count_ver 0.000025065 0.000038847 0.645 0.51879
## user_rating 0.655692302 0.142783884 4.592 0.000004386 ***
## user_rating_ver 0.141256047 0.117381845 1.203 0.22883
## sup_devices.num -0.164158253 0.033108869 -4.958 0.000000712 ***
## ipadSc_urls.num 0.219992391 0.071121613 3.093 0.00198 **
## lang.num -0.046153546 0.017895840 -2.579 0.00991 **
## vpp_lic -2.357705841 1.535457318 -1.536 0.12466
## size_MB 0.004816440 0.000340466 14.147 < 0.0000000000000002 ***
## Log(scale) 2.257355544 0.013342801 169.182 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Scale: 9.558
##
## Gaussian distribution
## Number of Newton-Raphson Iterations: 4
## Log-likelihood: -1.365e+04 on 11 Df
## Wald-statistic: 471.4 on 9 Df, p-value: < 0.000000000000000222
#colnames(ios)
#這個可以向老師報告 要看wooldrige 的教科書
fm.tobit$scale
## [1] 9.557781
exp(2.257355544)
## [1] 9.557781
require(AER)
require(wooldridge)
## Loading required package: wooldridge
require(npsf)
## Loading required package: npsf
## Loading required package: Formula
## Loading required package: randtoolbox
## Loading required package: rngWELL
## This is randtoolbox. For an overview, type 'help("randtoolbox")'.
## Loading required package: sfsmisc
##
## Attaching package: 'sfsmisc'
## The following object is masked from 'package:rminer':
##
## factorize
## The following object is masked from 'package:dplyr':
##
## last
data(mroz)
names(mroz)
## [1] "inlf" "hours" "kidslt6" "kidsge6" "age" "educ"
## [7] "wage" "repwage" "hushrs" "husage" "huseduc" "huswage"
## [13] "faminc" "mtr" "motheduc" "fatheduc" "unem" "city"
## [19] "exper" "nwifeinc"
fm.tobit <- tobit(hours~nwifeinc+educ+exper+I(exper^2)+age+kidslt6+kidsge6,
data = mroz)
summary(fm.tobit)
##
## Call:
## tobit(formula = hours ~ nwifeinc + educ + exper + I(exper^2) +
## age + kidslt6 + kidsge6, data = mroz)
##
## Observations:
## Total Left-censored Uncensored Right-censored
## 753 325 428 0
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 965.30528 446.43614 2.162 0.030599 *
## nwifeinc -8.81424 4.45910 -1.977 0.048077 *
## educ 80.64561 21.58324 3.736 0.000187 ***
## exper 131.56430 17.27939 7.614 0.00000000000002659 ***
## I(exper^2) -1.86416 0.53766 -3.467 0.000526 ***
## age -54.40501 7.41850 -7.334 0.00000000000022390 ***
## kidslt6 -894.02174 111.87804 -7.991 0.00000000000000134 ***
## kidsge6 -16.21800 38.64139 -0.420 0.674701
## Log(scale) 7.02289 0.03706 189.514 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Scale: 1122
##
## Gaussian distribution
## Number of Newton-Raphson Iterations: 4
## Log-likelihood: -3819 on 9 Df
## Wald-statistic: 253.9 on 7 Df, p-value: < 0.000000000000000222
文字雲的結果
mixseg<-worker()
seg <- mixseg[dff$track_name]
segA<-data.frame(table(seg))
segC<-data.frame(table(seg[nchar(seg)>1]))#data.frame
segC_top50<-head(segC[order(segC$Freq,decreasing = TRUE),],50)
library(wordcloud)
## Loading required package: RColorBrewer
par(family=("Heiti TC Light"))
wordcloud(
words = segC_top50[,1], # 或segC_top50$Var1
freq = segC_top50$Freq,
scale = c(4,.1), # 給定文字尺寸的區間(向量)
random.order = FALSE,# 關閉文字隨機顯示 按順序
ordered.colors = FALSE,#關閉配色順序
rot.per = FALSE,#關閉文字轉角度
min.freq = 7,# 定義最小freq數字
colors = brewer.pal(8,"Dark2")
)