## color director_name num_critic_for_reviews
## Length:5043 Length:5043 Min. : 1.0
## Class :character Class :character 1st Qu.: 50.0
## Mode :character Mode :character Median :110.0
## Mean :140.2
## 3rd Qu.:195.0
## Max. :813.0
## NA's :50
## duration director_facebook_likes actor_3_facebook_likes
## Min. : 7.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 93.0 1st Qu.: 7.0 1st Qu.: 133.0
## Median :103.0 Median : 49.0 Median : 371.5
## Mean :107.2 Mean : 686.5 Mean : 645.0
## 3rd Qu.:118.0 3rd Qu.: 194.5 3rd Qu.: 636.0
## Max. :511.0 Max. :23000.0 Max. :23000.0
## NA's :15 NA's :104 NA's :23
## actor_2_name actor_1_facebook_likes gross
## Length:5043 Min. : 0 Min. : 162
## Class :character 1st Qu.: 614 1st Qu.: 5340988
## Mode :character Median : 988 Median : 25517500
## Mean : 6560 Mean : 48468408
## 3rd Qu.: 11000 3rd Qu.: 62309438
## Max. :640000 Max. :760505847
## NA's :7 NA's :884
## genres actor_1_name movie_title
## Length:5043 Length:5043 Length:5043
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## num_voted_users cast_total_facebook_likes actor_3_name
## Min. : 5 Min. : 0 Length:5043
## 1st Qu.: 8594 1st Qu.: 1411 Class :character
## Median : 34359 Median : 3090 Mode :character
## Mean : 83668 Mean : 9699
## 3rd Qu.: 96309 3rd Qu.: 13756
## Max. :1689764 Max. :656730
##
## facenumber_in_poster plot_keywords movie_imdb_link
## Min. : 0.000 Length:5043 Length:5043
## 1st Qu.: 0.000 Class :character Class :character
## Median : 1.000 Mode :character Mode :character
## Mean : 1.371
## 3rd Qu.: 2.000
## Max. :43.000
## NA's :13
## num_user_for_reviews language country
## Min. : 1.0 Length:5043 Length:5043
## 1st Qu.: 65.0 Class :character Class :character
## Median : 156.0 Mode :character Mode :character
## Mean : 272.8
## 3rd Qu.: 326.0
## Max. :5060.0
## NA's :21
## content_rating budget title_year
## Length:5043 Min. :2.180e+02 Min. :1916
## Class :character 1st Qu.:6.000e+06 1st Qu.:1999
## Mode :character Median :2.000e+07 Median :2005
## Mean :3.975e+07 Mean :2002
## 3rd Qu.:4.500e+07 3rd Qu.:2011
## Max. :1.222e+10 Max. :2016
## NA's :492 NA's :108
## actor_2_facebook_likes imdb_score aspect_ratio
## Min. : 0 Min. :1.600 Min. : 1.18
## 1st Qu.: 281 1st Qu.:5.800 1st Qu.: 1.85
## Median : 595 Median :6.600 Median : 2.35
## Mean : 1652 Mean :6.442 Mean : 2.22
## 3rd Qu.: 918 3rd Qu.:7.200 3rd Qu.: 2.35
## Max. :137000 Max. :9.500 Max. :16.00
## NA's :13 NA's :329
## movie_facebook_likes
## Min. : 0
## 1st Qu.: 0
## Median : 166
## Mean : 7526
## 3rd Qu.: 3000
## Max. :349000
##
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.0 50.0 110.0 140.2 195.0 813.0 50
在這5043部電影中,最低評論數為1,最高評論數為813,大多數電影收到的評論不到200條。
數據的大部分電影都是在2000年之後製作的。
美國製作的電影數量最多。
由於IMDB是1990年後開始的,而且主要都是美國電影 因此我將資料先資料預處理,主要分析內容就縮小為“2000年後美國電影”
我想觀察IMDB分數是否跟其他變數是否有相關性 因此我使用迴歸分析
底下變數是我要分析的自變數:
num_critic_for_reviews
duration
director_facebook_likes
actor_1_facebook_likes
gross
cast_total_facebook_likes
facenumber_in_poster
budget
movie_facebook_likes
選擇數值變量的子集進行回歸建模。
##
## Call:
## lm(formula = imdb_score ~ num_critic_for_reviews + duration +
## director_facebook_likes + actor_1_facebook_likes + gross +
## cast_total_facebook_likes + facenumber_in_poster + budget +
## movie_facebook_likes, data = movie_sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6606 -0.5469 0.0856 0.6668 3.1902
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.510e+00 1.000e-01 55.074 < 2e-16 ***
## num_critic_for_reviews 2.815e-03 2.360e-04 11.931 < 2e-16 ***
## duration 2.940e-03 1.029e-03 2.858 0.004296 **
## director_facebook_likes 3.126e-05 7.315e-06 4.273 2.00e-05 ***
## actor_1_facebook_likes 1.096e-05 3.644e-06 3.007 0.002665 **
## gross 1.325e-09 4.062e-10 3.262 0.001120 **
## cast_total_facebook_likes -7.425e-06 3.152e-06 -2.356 0.018563 *
## facenumber_in_poster -2.935e-02 8.839e-03 -3.321 0.000909 ***
## budget -2.651e-09 6.285e-10 -4.218 2.54e-05 ***
## movie_facebook_likes 2.195e-06 1.201e-06 1.827 0.067763 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.024 on 2724 degrees of freedom
## Multiple R-squared: 0.1757, Adjusted R-squared: 0.173
## F-statistic: 64.5 on 9 and 2724 DF, p-value: < 2.2e-16
可看出只有movie_facebook_likes這項p-value大於0.05 於是刪除此項變數,再繼續做回歸
##
## Call:
## lm(formula = imdb_score ~ num_critic_for_reviews + duration +
## director_facebook_likes + actor_1_facebook_likes + gross +
## cast_total_facebook_likes + facenumber_in_poster + budget,
## data = movie_sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5274 -0.5455 0.0779 0.6589 3.2994
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.488e+00 9.935e-02 55.235 < 2e-16 ***
## num_critic_for_reviews 3.062e-03 1.936e-04 15.813 < 2e-16 ***
## duration 2.930e-03 1.029e-03 2.848 0.004438 **
## director_facebook_likes 3.190e-05 7.310e-06 4.363 1.33e-05 ***
## actor_1_facebook_likes 1.029e-05 3.627e-06 2.838 0.004573 **
## gross 1.402e-09 4.042e-10 3.469 0.000531 ***
## cast_total_facebook_likes -6.851e-06 3.138e-06 -2.183 0.029094 *
## facenumber_in_poster -2.861e-02 8.833e-03 -3.239 0.001213 **
## budget -2.738e-09 6.270e-10 -4.368 1.30e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.024 on 2725 degrees of freedom
## Multiple R-squared: 0.1747, Adjusted R-squared: 0.1722
## F-statistic: 72.09 on 8 and 2725 DF, p-value: < 2.2e-16
畫出四種圖
做預測
求MAE
## [1] 1
影響電影評分的最重要因素是持續時間,電影越長,疼痛就越高。
批評評論的數量很重要,電影收到的評論越多,得分就越高。
海報中的面部編號對電影樂譜有負面影響。電影海報中的面孔越多,得分就越低。