資料來源來自Kaggle中IMDB 的電影資料集

其中資料部分只有1916年後到現今共5043筆資料

先處理異常值,使用平均數填補缺失值

先看Summaary

##     color           director_name      num_critic_for_reviews
##  Length:5043        Length:5043        Min.   :  1.0         
##  Class :character   Class :character   1st Qu.: 50.0         
##  Mode  :character   Mode  :character   Median :110.0         
##                                        Mean   :140.2         
##                                        3rd Qu.:195.0         
##                                        Max.   :813.0         
##                                        NA's   :50            
##     duration     director_facebook_likes actor_3_facebook_likes
##  Min.   :  7.0   Min.   :    0.0         Min.   :    0.0       
##  1st Qu.: 93.0   1st Qu.:    7.0         1st Qu.:  133.0       
##  Median :103.0   Median :   49.0         Median :  371.5       
##  Mean   :107.2   Mean   :  686.5         Mean   :  645.0       
##  3rd Qu.:118.0   3rd Qu.:  194.5         3rd Qu.:  636.0       
##  Max.   :511.0   Max.   :23000.0         Max.   :23000.0       
##  NA's   :15      NA's   :104             NA's   :23            
##  actor_2_name       actor_1_facebook_likes     gross          
##  Length:5043        Min.   :     0         Min.   :      162  
##  Class :character   1st Qu.:   614         1st Qu.:  5340988  
##  Mode  :character   Median :   988         Median : 25517500  
##                     Mean   :  6560         Mean   : 48468408  
##                     3rd Qu.: 11000         3rd Qu.: 62309438  
##                     Max.   :640000         Max.   :760505847  
##                     NA's   :7              NA's   :884        
##     genres          actor_1_name       movie_title       
##  Length:5043        Length:5043        Length:5043       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  num_voted_users   cast_total_facebook_likes actor_3_name      
##  Min.   :      5   Min.   :     0            Length:5043       
##  1st Qu.:   8594   1st Qu.:  1411            Class :character  
##  Median :  34359   Median :  3090            Mode  :character  
##  Mean   :  83668   Mean   :  9699                              
##  3rd Qu.:  96309   3rd Qu.: 13756                              
##  Max.   :1689764   Max.   :656730                              
##                                                                
##  facenumber_in_poster plot_keywords      movie_imdb_link   
##  Min.   : 0.000       Length:5043        Length:5043       
##  1st Qu.: 0.000       Class :character   Class :character  
##  Median : 1.000       Mode  :character   Mode  :character  
##  Mean   : 1.371                                            
##  3rd Qu.: 2.000                                            
##  Max.   :43.000                                            
##  NA's   :13                                                
##  num_user_for_reviews   language           country         
##  Min.   :   1.0       Length:5043        Length:5043       
##  1st Qu.:  65.0       Class :character   Class :character  
##  Median : 156.0       Mode  :character   Mode  :character  
##  Mean   : 272.8                                            
##  3rd Qu.: 326.0                                            
##  Max.   :5060.0                                            
##  NA's   :21                                                
##  content_rating         budget            title_year  
##  Length:5043        Min.   :2.180e+02   Min.   :1916  
##  Class :character   1st Qu.:6.000e+06   1st Qu.:1999  
##  Mode  :character   Median :2.000e+07   Median :2005  
##                     Mean   :3.975e+07   Mean   :2002  
##                     3rd Qu.:4.500e+07   3rd Qu.:2011  
##                     Max.   :1.222e+10   Max.   :2016  
##                     NA's   :492         NA's   :108   
##  actor_2_facebook_likes   imdb_score     aspect_ratio  
##  Min.   :     0         Min.   :1.600   Min.   : 1.18  
##  1st Qu.:   281         1st Qu.:5.800   1st Qu.: 1.85  
##  Median :   595         Median :6.600   Median : 2.35  
##  Mean   :  1652         Mean   :6.442   Mean   : 2.22  
##  3rd Qu.:   918         3rd Qu.:7.200   3rd Qu.: 2.35  
##  Max.   :137000         Max.   :9.500   Max.   :16.00  
##  NA's   :13                             NA's   :329    
##  movie_facebook_likes
##  Min.   :     0      
##  1st Qu.:     0      
##  Median :   166      
##  Mean   :  7526      
##  3rd Qu.:  3000      
##  Max.   :349000      
## 

資料分布開始

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     1.0    50.0   110.0   140.2   195.0   813.0      50

在這5043部電影中,最低評論數為1,最高評論數為813,大多數電影收到的評論不到200條。

數據的大部分電影都是在2000年之後製作的。

美國製作的電影數量最多。

由於IMDB是1990年後開始的,而且主要都是美國電影 因此我將資料先資料預處理,主要分析內容就縮小為“2000年後美國電影”

我想觀察IMDB分數是否跟其他變數是否有相關性 因此我使用迴歸分析

底下變數是我要分析的自變數:

選擇數值變量的子集進行回歸建模。

## 
## Call:
## lm(formula = imdb_score ~ num_critic_for_reviews + duration + 
##     director_facebook_likes + actor_1_facebook_likes + gross + 
##     cast_total_facebook_likes + facenumber_in_poster + budget + 
##     movie_facebook_likes, data = movie_sub)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6606 -0.5469  0.0856  0.6668  3.1902 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                5.510e+00  1.000e-01  55.074  < 2e-16 ***
## num_critic_for_reviews     2.815e-03  2.360e-04  11.931  < 2e-16 ***
## duration                   2.940e-03  1.029e-03   2.858 0.004296 ** 
## director_facebook_likes    3.126e-05  7.315e-06   4.273 2.00e-05 ***
## actor_1_facebook_likes     1.096e-05  3.644e-06   3.007 0.002665 ** 
## gross                      1.325e-09  4.062e-10   3.262 0.001120 ** 
## cast_total_facebook_likes -7.425e-06  3.152e-06  -2.356 0.018563 *  
## facenumber_in_poster      -2.935e-02  8.839e-03  -3.321 0.000909 ***
## budget                    -2.651e-09  6.285e-10  -4.218 2.54e-05 ***
## movie_facebook_likes       2.195e-06  1.201e-06   1.827 0.067763 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.024 on 2724 degrees of freedom
## Multiple R-squared:  0.1757, Adjusted R-squared:  0.173 
## F-statistic:  64.5 on 9 and 2724 DF,  p-value: < 2.2e-16

可看出只有movie_facebook_likes這項p-value大於0.05 於是刪除此項變數,再繼續做回歸

## 
## Call:
## lm(formula = imdb_score ~ num_critic_for_reviews + duration + 
##     director_facebook_likes + actor_1_facebook_likes + gross + 
##     cast_total_facebook_likes + facenumber_in_poster + budget, 
##     data = movie_sub)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5274 -0.5455  0.0779  0.6589  3.2994 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                5.488e+00  9.935e-02  55.235  < 2e-16 ***
## num_critic_for_reviews     3.062e-03  1.936e-04  15.813  < 2e-16 ***
## duration                   2.930e-03  1.029e-03   2.848 0.004438 ** 
## director_facebook_likes    3.190e-05  7.310e-06   4.363 1.33e-05 ***
## actor_1_facebook_likes     1.029e-05  3.627e-06   2.838 0.004573 ** 
## gross                      1.402e-09  4.042e-10   3.469 0.000531 ***
## cast_total_facebook_likes -6.851e-06  3.138e-06  -2.183 0.029094 *  
## facenumber_in_poster      -2.861e-02  8.833e-03  -3.239 0.001213 ** 
## budget                    -2.738e-09  6.270e-10  -4.368 1.30e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.024 on 2725 degrees of freedom
## Multiple R-squared:  0.1747, Adjusted R-squared:  0.1722 
## F-statistic: 72.09 on 8 and 2725 DF,  p-value: < 2.2e-16

畫出四種圖

做預測

求MAE

## [1] 1

結論

影響電影評分的最重要因素是持續時間,電影越長,疼痛就越高。

批評評論的數量很重要,電影收到的評論越多,得分就越高。

海報中的面部編號對電影樂譜有負面影響。電影海報中的面孔越多,得分就越低。