一變數介紹:
二變數處理方式
三資料視覺化
- 3.1 畫出correlation matrix
- 3.2 大部分APP的使用年齡
四感興趣的問題

一變數介紹:

我們有16個變數，包含
“id” : App ID
“track_name”: App Name
“size_bytes”: Size (in Bytes)
“currency”: Currency Type
“price”: Price amount
“ratingcounttot”: User Rating counts (for all version)
“ratingcountver”: User Rating counts (for current version)
“user_rating” : Average User Rating value (for all version)
“userratingver”: Average User Rating value (for current version)
“ver” : Latest version code
“cont_rating”: Content Rating 適合哪個年齡層使用有4個level

分別是 4+,9+,12+,17+

“prime_genre”: Primary Genre
“sup_devices.num”: Number of supporting devices
“ipadSc_urls.num”: Number of screenshots showed for display “可以視為功能的展現”
“lang.num”: Number of supported languages
“vpp_lic”: Vpp Device Based Licensing Enabled

備註: Apple批量購買計劃（VPP）是一項服務，允許已註冊Apple VPP的組織批量購買iOS應用，但不能以折扣價購買。主要應該是用於企業的大量購買，此變數為二元變數

二變數處理方式

其中id ,app name 都沒包含有用的資訊，而所有app都使用美金計價，ver(版本)部分由於各家版本號過於凌亂所以也予以刪除。

是故我只留下12個變數，其中又只有prime_genre,vpp_lic,cont_rating和是屬於類別型變數，其他變數都是連續型

然後新增一個虛擬變數為付費與否

另外由於bytes並非常用的單位，是故把它轉換成MB

非常可惜的是並沒有公布APP的下載量

charge<- as.factor(ifelse( ios$price>0,"paid","free"  ))
ios[,13] <- charge

ios<- ios %>% rename(charge=V13)

ios$cont_rating <- as.factor(ios$cont_rating )
ios$prime_genre <- as.factor(ios$prime_genre  )
ios$size_MB <- ios$size_bytes/1000000
ios <- ios[,-1]

這裡可以看出所有的資料，至此我們總共有13個變數外加7197個觀察值，我們的資料沒有任何遺漏值

秀出我們有的所有資料

DT::datatable(ios, options = list(
pageLength=50, scrollX='400px'), filter = 'top')

## This version of Shiny is designed to work with 'htmlwidgets' >= 1.5.
##     Please upgrade via install.packages('htmlwidgets').

#any NA in data.frame
sum(is.na.data.frame(ios))

## [1] 0

秀出我們的變數的敘述統計

m<- summary(ios)

knitr::kable(m)

price	rating_count_tot	rating_count_ver	user_rating	user_rating_ver	cont_rating	prime_genre	sup_devices.num	ipadSc_urls.num	lang.num	vpp_lic	charge	size_MB
Min. : 0.000	Min. : 0	Min. : 0.0	Min. :0.000	Min. :0.000	12+:1155	Games :3862	Min. : 9.00	Min. :0.000	Min. : 0.000	Min. :0.0000	free:4056	Min. : 0.59
1st Qu.: 0.000	1st Qu.: 28	1st Qu.: 1.0	1st Qu.:3.500	1st Qu.:2.500	17+: 622	Entertainment : 535	1st Qu.:37.00	1st Qu.:3.000	1st Qu.: 1.000	1st Qu.:1.0000	paid:3141	1st Qu.: 46.92
Median : 0.000	Median : 300	Median : 23.0	Median :4.000	Median :4.000	4+ :4433	Education : 453	Median :37.00	Median :5.000	Median : 1.000	Median :1.0000	NA	Median : 97.15
Mean : 1.726	Mean : 12893	Mean : 460.4	Mean :3.527	Mean :3.254	9+ : 987	Photo & Video : 349	Mean :37.36	Mean :3.707	Mean : 5.435	Mean :0.9931	NA	Mean : 199.13
3rd Qu.: 1.990	3rd Qu.: 2793	3rd Qu.: 140.0	3rd Qu.:4.500	3rd Qu.:4.500	NA	Utilities : 248	3rd Qu.:38.00	3rd Qu.:5.000	3rd Qu.: 8.000	3rd Qu.:1.0000	NA	3rd Qu.: 181.93
Max. :299.990	Max. :2974676	Max. :177050.0	Max. :5.000	Max. :5.000	NA	Health & Fitness: 180	Max. :47.00	Max. :5.000	Max. :75.000	Max. :1.0000	NA	Max. :4025.97
NA	NA	NA	NA	NA	NA	(Other) :1570	NA	NA	NA	NA	NA	NA

三資料視覺化

3.1 畫出correlation matrix

library(corrplot)

## corrplot 0.84 loaded

df<- ios[,-c(6,7,12)]
df <- as.matrix(df)
M<- cor(df)

corrplot(M, method="circle")

corrplot(M, method="number")

可以看出變數之間除了目前的user rating和所有版本的user rating外，其餘變數之間並無線性關係

3.2 大部分APP的使用年齡

m1<- table(ios$cont_rating)

table(ios$cont_rating)

## 
##  12+  17+   4+   9+ 
## 1155  622 4433  987

ios$cont_rating <- factor(ios$cont_rating,levels = c("4+", "9+", "12+", "17+"))
m1<- table(ios$cont_rating)
barplot(m1)

knitr::kable(m1)

Var1	Freq
4+	4433
9+	987
12+	1155
17+	622

可以看出大部分的APP都是設計出來給4歲以上使用即可

四感興趣的問題

1.哪些變數會影響APP的評分?

2.付費軟體的評分有比較好嗎?

3.大部分的APP的定價趨勢為何?

4.1 哪些變數會影響APP的評分?

4.1.1先使用簡單的線性回歸來看

m1 <- lm(user_rating   ~. ,ios)
m2<- summary(m1)

data.frame(summary(m1)$coef[summary(m1)$coef[,4] <= .05, 4])

##                              summary.m1..coef.summary.m1..coef...4.....0.05..4.
## (Intercept)                                             0.000885712962934446244
## price                                                   0.017149274322532933462
## user_rating_ver                                         0.000000000000000000000
## cont_rating12+                                          0.030125854628497372723
## prime_genreBusiness                                     0.008332204043162476370
## prime_genreEducation                                    0.013074676199248123562
## prime_genreEntertainment                                0.008985110250370754656
## prime_genreFinance                                      0.008459562324483195853
## prime_genreFood & Drink                                 0.000205487435224537907
## prime_genreGames                                        0.011396798805536760210
## prime_genreHealth & Fitness                             0.000255245793004355188
## prime_genreLifestyle                                    0.008410693519466458032
## prime_genreMusic                                        0.006334849962495827866
## prime_genreNews                                         0.001242261154232396098
## prime_genrePhoto & Video                                0.000063520404373961997
## prime_genreProductivity                                 0.000544045272050046173
## prime_genreShopping                                     0.000000000149900693008
## prime_genreSocial Networking                            0.007537393754674120809
## prime_genreSports                                       0.046849439600335426870
## prime_genreTravel                                       0.000762921294891544624
## prime_genreUtilities                                    0.003950537294257992597
## prime_genreWeather                                      0.008810318304743505746
## sup_devices.num                                         0.042884258425487840893
## ipadSc_urls.num                                         0.000000000000001649702
## lang.num                                                0.001047715801772405907
## vpp_lic                                                 0.000409720710488918757
## chargepaid                                              0.031491392856383318422

4.1.2 使用隨機森林來看哪些變數影響rating

set.seed(1)

rf<- ranger(user_rating~. ,ios, quantreg = TRUE,importance='impurity')

rf$variable.importance %>% 
  as.matrix() %>% 
  as.data.frame() %>% 
  add_rownames() %>% 
  `colnames<-`(c("varname","imp")) %>%
  arrange(desc(imp)) %>% 
  top_n(25,wt = imp) %>% 
  ggplot(mapping = aes(x = reorder(varname, imp), y = imp)) +
  geom_col() +
  coord_flip() +
  ggtitle(label = "Top 12 important variables") +
  theme(
    axis.title = element_blank()
  )

從隨機森林的結果可以發現影響使用者目前rating的前三名變數分別是

4.2 付費軟體的評分有比較好嗎?

qplot(user_rating, data = ios, geom = "density",
  fill = charge, alpha = I(.5),
  main="Distribution of App rating",
  xlab="Rating",
  ylab="Density")

mean(ios$user_rating)

## [1] 3.526956

mean(ios$user_rating[which(ios$V13=="paid"       )])

## [1] NaN

mean(ios$user_rating[which(ios$V13=="free"       )])

## [1] NaN

所有APP的平均評分為3.526956，付費APP的評分為3.720949，免費APP為3.376726

# Compute the analysis of variance
res.aov <- aov(user_rating ~ charge, data = ios)
# Summary of the analysis
summary(res.aov)

##               Df Sum Sq Mean Sq F value              Pr(>F)    
## charge         1    210  209.75   92.18 <0.0000000000000002 ***
## Residuals   7195  16371    2.28                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

one_way anova table也告訴我們，如果評分代表著APP的品質的話，那麼付費APP確實在統計上品質顯著大於免費APP

4.3大部分的APP的定價趨勢為何?

sum(is.na(ios$price))

## [1] 0

#there is no NA in price
#we draw the ecdf of this data

plot(ecdf(ios$price  ))

object<- table(ios$price  )
barplot(log(object))

#plot(sort(unique(applestore$price)) ,log(object)     )

#log(table(applestore$price  ))
#qplot(price,data=applestore,geom="histogram"     )

#qplot(price,data=applestore,geom="histogram",log = "y")

#plot(applestore$price, log="y", type='histogram')

APP的訂價顯然是免費居多，而且訂價有指數分布的趨勢存在

4.4哪些因素影響APP定價

fm.tobit <- tobit(price~.-cont_rating-charge-prime_genre,
data = ios)

summary(fm.tobit)

## 
## Call:
## tobit(formula = price ~ . - cont_rating - charge - prime_genre, 
##     data = ios)
## 
## Observations:
##          Total  Left-censored     Uncensored Right-censored 
##           7197           4056           3141              0 
## 
## Coefficients:
##                      Estimate   Std. Error z value             Pr(>|z|)    
## (Intercept)       1.048136359  2.049136950   0.512              0.60900    
## rating_count_tot -0.000050599  0.000005149  -9.828 < 0.0000000000000002 ***
## rating_count_ver  0.000025065  0.000038847   0.645              0.51879    
## user_rating       0.655692302  0.142783884   4.592          0.000004386 ***
## user_rating_ver   0.141256047  0.117381845   1.203              0.22883    
## sup_devices.num  -0.164158253  0.033108869  -4.958          0.000000712 ***
## ipadSc_urls.num   0.219992391  0.071121613   3.093              0.00198 ** 
## lang.num         -0.046153546  0.017895840  -2.579              0.00991 ** 
## vpp_lic          -2.357705841  1.535457318  -1.536              0.12466    
## size_MB           0.004816440  0.000340466  14.147 < 0.0000000000000002 ***
## Log(scale)        2.257355544  0.013342801 169.182 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Scale: 9.558 
## 
## Gaussian distribution
## Number of Newton-Raphson Iterations: 4 
## Log-likelihood: -1.365e+04 on 11 Df
## Wald-statistic: 471.4 on 9 Df, p-value: < 0.000000000000000222

#colnames(ios)

#這個可以向老師報告 要看wooldrige 的教科書

fm.tobit$scale

## [1] 9.557781

exp(2.257355544)

## [1] 9.557781

require(AER)
require(wooldridge)

## Loading required package: wooldridge

require(npsf)

## Loading required package: npsf

## Loading required package: Formula

## Loading required package: randtoolbox

## Loading required package: rngWELL

## This is randtoolbox. For an overview, type 'help("randtoolbox")'.

## Loading required package: sfsmisc

## 
## Attaching package: 'sfsmisc'

## The following object is masked from 'package:rminer':
## 
##     factorize

## The following object is masked from 'package:dplyr':
## 
##     last

data(mroz)
names(mroz)

##  [1] "inlf"     "hours"    "kidslt6"  "kidsge6"  "age"      "educ"    
##  [7] "wage"     "repwage"  "hushrs"   "husage"   "huseduc"  "huswage" 
## [13] "faminc"   "mtr"      "motheduc" "fatheduc" "unem"     "city"    
## [19] "exper"    "nwifeinc"

fm.tobit <- tobit(hours~nwifeinc+educ+exper+I(exper^2)+age+kidslt6+kidsge6,
data = mroz)
summary(fm.tobit)

## 
## Call:
## tobit(formula = hours ~ nwifeinc + educ + exper + I(exper^2) + 
##     age + kidslt6 + kidsge6, data = mroz)
## 
## Observations:
##          Total  Left-censored     Uncensored Right-censored 
##            753            325            428              0 
## 
## Coefficients:
##               Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)  965.30528  446.43614   2.162             0.030599 *  
## nwifeinc      -8.81424    4.45910  -1.977             0.048077 *  
## educ          80.64561   21.58324   3.736             0.000187 ***
## exper        131.56430   17.27939   7.614  0.00000000000002659 ***
## I(exper^2)    -1.86416    0.53766  -3.467             0.000526 ***
## age          -54.40501    7.41850  -7.334  0.00000000000022390 ***
## kidslt6     -894.02174  111.87804  -7.991  0.00000000000000134 ***
## kidsge6      -16.21800   38.64139  -0.420             0.674701    
## Log(scale)     7.02289    0.03706 189.514 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Scale: 1122 
## 
## Gaussian distribution
## Number of Newton-Raphson Iterations: 4 
## Log-likelihood: -3819 on 9 Df
## Wald-statistic: 253.9 on 7 Df, p-value: < 0.000000000000000222

文字雲的結果

mixseg<-worker()

seg <- mixseg[dff$track_name]
segA<-data.frame(table(seg))

segC<-data.frame(table(seg[nchar(seg)>1]))#data.frame
segC_top50<-head(segC[order(segC$Freq,decreasing = TRUE),],50)

library(wordcloud)

## Loading required package: RColorBrewer

par(family=("Heiti TC Light"))
wordcloud(
  words = segC_top50[,1], # 或segC_top50$Var1
  freq =  segC_top50$Freq, 
  scale = c(4,.1), # 給定文字尺寸的區間（向量）
  random.order = FALSE,# 關閉文字隨機顯示 按順序
  ordered.colors = FALSE,#關閉配色順序
  rot.per = FALSE,#關閉文字轉角度
  min.freq = 7,# 定義最小freq數字 
  colors = brewer.pal(8,"Dark2")
)

iosapp

Chen Ning Kuan

2020/3/11

一變數介紹:

二變數處理方式

三資料視覺化

3.1 畫出correlation matrix

3.2 大部分APP的使用年齡

四感興趣的問題

4.1 哪些變數會影響APP的評分?

4.1.1先使用簡單的線性回歸來看

4.1.2 使用隨機森林來看哪些變數影響rating

4.2 付費軟體的評分有比較好嗎?

4.3大部分的APP的定價趨勢為何?

4.4哪些因素影響APP定價

iosapp

Chen Ning Kuan

2020/3/11

一 變數介紹:

二 變數處理方式

三 資料視覺化

3.1 畫出correlation matrix

3.2 大部分APP的使用年齡

四 感興趣的問題

4.1 哪些變數會影響APP的評分?

4.1.1先使用簡單的線性回歸來看

4.1.2 使用隨機森林來看哪些變數影響rating

4.2 付費軟體的評分有比較好嗎?

4.3大部分的APP的定價趨勢為何?

4.4哪些因素影響APP定價

一變數介紹:

二變數處理方式

三資料視覺化

四感興趣的問題