一 變數介紹:

分別是 4+,9+,12+,17+

備註: Apple批量購買計劃(VPP)是一項服務,允許已註冊Apple VPP的組織批量購買iOS應用,但不能以折扣價購買。主要應該是用於企業的大量購買,此變數為二元變數

二 變數處理方式

其中id ,app name 都沒包含有用的資訊,而所有app都使用美金計價,ver(版本)部分由於各家版本號過於凌亂所以也予以刪除。

是故我只留下12個變數,其中又只有prime_genre,vpp_lic,cont_rating和是屬於類別型變數,其他變數都是連續型

然後新增一個虛擬變數為付費與否

另外由於bytes並非常用的單位,是故把它轉換成MB

非常可惜的是並沒有公布APP的下載量

charge<- as.factor(ifelse( ios$price>0,"paid","free"  ))
ios[,13] <- charge

ios<- ios %>% rename(charge=V13)

ios$cont_rating <- as.factor(ios$cont_rating )
ios$prime_genre <- as.factor(ios$prime_genre  )
ios$size_MB <- ios$size_bytes/1000000
ios <- ios[,-1]

這裡可以看出所有的資料,至此我們總共有13個變數外加7197個觀察值,我們的資料沒有任何遺漏值

秀出我們有的所有資料

DT::datatable(ios, options = list(
pageLength=50, scrollX='400px'), filter = 'top')
## This version of Shiny is designed to work with 'htmlwidgets' >= 1.5.
##     Please upgrade via install.packages('htmlwidgets').
#any NA in data.frame
sum(is.na.data.frame(ios))
## [1] 0

秀出我們的變數的敘述統計

m<- summary(ios)

knitr::kable(m)
price rating_count_tot rating_count_ver user_rating user_rating_ver cont_rating prime_genre sup_devices.num ipadSc_urls.num lang.num vpp_lic charge size_MB
Min. : 0.000 Min. : 0 Min. : 0.0 Min. :0.000 Min. :0.000 12+:1155 Games :3862 Min. : 9.00 Min. :0.000 Min. : 0.000 Min. :0.0000 free:4056 Min. : 0.59
1st Qu.: 0.000 1st Qu.: 28 1st Qu.: 1.0 1st Qu.:3.500 1st Qu.:2.500 17+: 622 Entertainment : 535 1st Qu.:37.00 1st Qu.:3.000 1st Qu.: 1.000 1st Qu.:1.0000 paid:3141 1st Qu.: 46.92
Median : 0.000 Median : 300 Median : 23.0 Median :4.000 Median :4.000 4+ :4433 Education : 453 Median :37.00 Median :5.000 Median : 1.000 Median :1.0000 NA Median : 97.15
Mean : 1.726 Mean : 12893 Mean : 460.4 Mean :3.527 Mean :3.254 9+ : 987 Photo & Video : 349 Mean :37.36 Mean :3.707 Mean : 5.435 Mean :0.9931 NA Mean : 199.13
3rd Qu.: 1.990 3rd Qu.: 2793 3rd Qu.: 140.0 3rd Qu.:4.500 3rd Qu.:4.500 NA Utilities : 248 3rd Qu.:38.00 3rd Qu.:5.000 3rd Qu.: 8.000 3rd Qu.:1.0000 NA 3rd Qu.: 181.93
Max. :299.990 Max. :2974676 Max. :177050.0 Max. :5.000 Max. :5.000 NA Health & Fitness: 180 Max. :47.00 Max. :5.000 Max. :75.000 Max. :1.0000 NA Max. :4025.97
NA NA NA NA NA NA (Other) :1570 NA NA NA NA NA NA

三 資料視覺化

3.1 畫出correlation matrix

library(corrplot)
## corrplot 0.84 loaded
df<- ios[,-c(6,7,12)]
df <- as.matrix(df)
M<- cor(df)

corrplot(M, method="circle")

corrplot(M, method="number")

可以看出變數之間除了目前的user rating和所有版本的user rating外,其餘變數之間並無線性關係

3.2 大部分APP的使用年齡

m1<- table(ios$cont_rating)

table(ios$cont_rating)
## 
##  12+  17+   4+   9+ 
## 1155  622 4433  987
ios$cont_rating <- factor(ios$cont_rating,levels = c("4+", "9+", "12+", "17+"))
m1<- table(ios$cont_rating)
barplot(m1)

knitr::kable(m1)
Var1 Freq
4+ 4433
9+ 987
12+ 1155
17+ 622

可以看出大部分的APP都是設計出來給4歲以上使用即可

四 感興趣的問題

1.哪些變數會影響APP的評分?

2.付費軟體的評分有比較好嗎?

3.大部分的APP的定價趨勢為何?

4.1 哪些變數會影響APP的評分?

4.1.1先使用簡單的線性回歸來看

m1 <- lm(user_rating   ~. ,ios)
m2<- summary(m1)

data.frame(summary(m1)$coef[summary(m1)$coef[,4] <= .05, 4])
##                              summary.m1..coef.summary.m1..coef...4.....0.05..4.
## (Intercept)                                             0.000885712962934446244
## price                                                   0.017149274322532933462
## user_rating_ver                                         0.000000000000000000000
## cont_rating12+                                          0.030125854628497372723
## prime_genreBusiness                                     0.008332204043162476370
## prime_genreEducation                                    0.013074676199248123562
## prime_genreEntertainment                                0.008985110250370754656
## prime_genreFinance                                      0.008459562324483195853
## prime_genreFood & Drink                                 0.000205487435224537907
## prime_genreGames                                        0.011396798805536760210
## prime_genreHealth & Fitness                             0.000255245793004355188
## prime_genreLifestyle                                    0.008410693519466458032
## prime_genreMusic                                        0.006334849962495827866
## prime_genreNews                                         0.001242261154232396098
## prime_genrePhoto & Video                                0.000063520404373961997
## prime_genreProductivity                                 0.000544045272050046173
## prime_genreShopping                                     0.000000000149900693008
## prime_genreSocial Networking                            0.007537393754674120809
## prime_genreSports                                       0.046849439600335426870
## prime_genreTravel                                       0.000762921294891544624
## prime_genreUtilities                                    0.003950537294257992597
## prime_genreWeather                                      0.008810318304743505746
## sup_devices.num                                         0.042884258425487840893
## ipadSc_urls.num                                         0.000000000000001649702
## lang.num                                                0.001047715801772405907
## vpp_lic                                                 0.000409720710488918757
## chargepaid                                              0.031491392856383318422

4.1.2 使用隨機森林來看哪些變數影響rating

set.seed(1)

rf<- ranger(user_rating~. ,ios, quantreg = TRUE,importance='impurity')
rf$variable.importance %>% 
  as.matrix() %>% 
  as.data.frame() %>% 
  add_rownames() %>% 
  `colnames<-`(c("varname","imp")) %>%
  arrange(desc(imp)) %>% 
  top_n(25,wt = imp) %>% 
  ggplot(mapping = aes(x = reorder(varname, imp), y = imp)) +
  geom_col() +
  coord_flip() +
  ggtitle(label = "Top 12 important variables") +
  theme(
    axis.title = element_blank()
  )

從隨機森林的結果可以發現影響使用者目前rating的前三名變數分別是

4.2 付費軟體的評分有比較好嗎?

qplot(user_rating, data = ios, geom = "density",
  fill = charge, alpha = I(.5),
  main="Distribution of App rating",
  xlab="Rating",
  ylab="Density")

mean(ios$user_rating)
## [1] 3.526956
mean(ios$user_rating[which(ios$V13=="paid"       )])
## [1] NaN
mean(ios$user_rating[which(ios$V13=="free"       )])
## [1] NaN

所有APP的平均評分為3.526956,付費APP的評分為3.720949,免費APP為3.376726

# Compute the analysis of variance
res.aov <- aov(user_rating ~ charge, data = ios)
# Summary of the analysis
summary(res.aov)
##               Df Sum Sq Mean Sq F value              Pr(>F)    
## charge         1    210  209.75   92.18 <0.0000000000000002 ***
## Residuals   7195  16371    2.28                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

one_way anova table也告訴我們,如果評分代表著APP的品質的話,那麼付費APP確實在統計上品質顯著大於免費APP

4.3大部分的APP的定價趨勢為何?

sum(is.na(ios$price))
## [1] 0
#there is no NA in price
#we draw the ecdf of this data

plot(ecdf(ios$price  ))

object<- table(ios$price  )
barplot(log(object))

#plot(sort(unique(applestore$price)) ,log(object)     )

#log(table(applestore$price  ))
#qplot(price,data=applestore,geom="histogram"     )

#qplot(price,data=applestore,geom="histogram",log = "y")

#plot(applestore$price, log="y", type='histogram')

APP的訂價顯然是免費居多,而且訂價有指數分布的趨勢存在

4.4哪些因素影響APP定價

fm.tobit <- tobit(price~.-cont_rating-charge-prime_genre,
data = ios)

summary(fm.tobit)
## 
## Call:
## tobit(formula = price ~ . - cont_rating - charge - prime_genre, 
##     data = ios)
## 
## Observations:
##          Total  Left-censored     Uncensored Right-censored 
##           7197           4056           3141              0 
## 
## Coefficients:
##                      Estimate   Std. Error z value             Pr(>|z|)    
## (Intercept)       1.048136359  2.049136950   0.512              0.60900    
## rating_count_tot -0.000050599  0.000005149  -9.828 < 0.0000000000000002 ***
## rating_count_ver  0.000025065  0.000038847   0.645              0.51879    
## user_rating       0.655692302  0.142783884   4.592          0.000004386 ***
## user_rating_ver   0.141256047  0.117381845   1.203              0.22883    
## sup_devices.num  -0.164158253  0.033108869  -4.958          0.000000712 ***
## ipadSc_urls.num   0.219992391  0.071121613   3.093              0.00198 ** 
## lang.num         -0.046153546  0.017895840  -2.579              0.00991 ** 
## vpp_lic          -2.357705841  1.535457318  -1.536              0.12466    
## size_MB           0.004816440  0.000340466  14.147 < 0.0000000000000002 ***
## Log(scale)        2.257355544  0.013342801 169.182 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Scale: 9.558 
## 
## Gaussian distribution
## Number of Newton-Raphson Iterations: 4 
## Log-likelihood: -1.365e+04 on 11 Df
## Wald-statistic: 471.4 on 9 Df, p-value: < 0.000000000000000222
#colnames(ios)

#這個可以向老師報告 要看wooldrige 的教科書

fm.tobit$scale
## [1] 9.557781
exp(2.257355544)
## [1] 9.557781
require(AER)
require(wooldridge)
## Loading required package: wooldridge
require(npsf)
## Loading required package: npsf
## Loading required package: Formula
## Loading required package: randtoolbox
## Loading required package: rngWELL
## This is randtoolbox. For an overview, type 'help("randtoolbox")'.
## Loading required package: sfsmisc
## 
## Attaching package: 'sfsmisc'
## The following object is masked from 'package:rminer':
## 
##     factorize
## The following object is masked from 'package:dplyr':
## 
##     last
data(mroz)
names(mroz)
##  [1] "inlf"     "hours"    "kidslt6"  "kidsge6"  "age"      "educ"    
##  [7] "wage"     "repwage"  "hushrs"   "husage"   "huseduc"  "huswage" 
## [13] "faminc"   "mtr"      "motheduc" "fatheduc" "unem"     "city"    
## [19] "exper"    "nwifeinc"
fm.tobit <- tobit(hours~nwifeinc+educ+exper+I(exper^2)+age+kidslt6+kidsge6,
data = mroz)
summary(fm.tobit)
## 
## Call:
## tobit(formula = hours ~ nwifeinc + educ + exper + I(exper^2) + 
##     age + kidslt6 + kidsge6, data = mroz)
## 
## Observations:
##          Total  Left-censored     Uncensored Right-censored 
##            753            325            428              0 
## 
## Coefficients:
##               Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)  965.30528  446.43614   2.162             0.030599 *  
## nwifeinc      -8.81424    4.45910  -1.977             0.048077 *  
## educ          80.64561   21.58324   3.736             0.000187 ***
## exper        131.56430   17.27939   7.614  0.00000000000002659 ***
## I(exper^2)    -1.86416    0.53766  -3.467             0.000526 ***
## age          -54.40501    7.41850  -7.334  0.00000000000022390 ***
## kidslt6     -894.02174  111.87804  -7.991  0.00000000000000134 ***
## kidsge6      -16.21800   38.64139  -0.420             0.674701    
## Log(scale)     7.02289    0.03706 189.514 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Scale: 1122 
## 
## Gaussian distribution
## Number of Newton-Raphson Iterations: 4 
## Log-likelihood: -3819 on 9 Df
## Wald-statistic: 253.9 on 7 Df, p-value: < 0.000000000000000222

文字雲的結果

mixseg<-worker()

seg <- mixseg[dff$track_name]
segA<-data.frame(table(seg))

segC<-data.frame(table(seg[nchar(seg)>1]))#data.frame
segC_top50<-head(segC[order(segC$Freq,decreasing = TRUE),],50)

library(wordcloud)
## Loading required package: RColorBrewer
par(family=("Heiti TC Light"))
wordcloud(
  words = segC_top50[,1], # 或segC_top50$Var1
  freq =  segC_top50$Freq, 
  scale = c(4,.1), # 給定文字尺寸的區間(向量)
  random.order = FALSE,# 關閉文字隨機顯示 按順序
  ordered.colors = FALSE,#關閉配色順序
  rot.per = FALSE,#關閉文字轉角度
  min.freq = 7,# 定義最小freq數字 
  colors = brewer.pal(8,"Dark2")
)