双变量关系及建模

两个变量都是数值型（numeric）
应变量（response, dependent variable）
解释变量（explanatory）
你认为与应变量有关的东西
也可以叫做，independent variable, predictor

一般的模型表达式：

\[应变量=f(解释变量)+噪音\]

线性回归模型表达式：

\[应变量=截距+斜率*解释变量+噪音\]

线性回归模型–为什么这一条线最合适？

\[y_i= a + b*x_i+e\]

最小二乘法（least squares method ）

简单地说，最小二乘的思想就是要使得

** 观测点和估计点的距离的平方和达到最小 **.

这里的“二乘”指的是用平方来度量观测点与估计点的远近（在古汉语中“平方”称为“二乘”），“最小”指的是参数的估计值要保证各个观测点与估计点的距离的平方和达到最小。

线性回归模型:

\[y_i= a + b*x_i+e_i\]

\[e_i ~ N(0, \sigma)\]

预测值：

\[\hat{y_i} = \hat{a} + \hat{b}x_i\]

残差（error）:

\[e_i = y_i-\hat{y_i}\]

拟合的过程：有n个样本量，也就是n个数据对 \((x_i,y_i)\), 找到截距 \(a\) 和斜率 \(b\) ，使得残差平方和最小:

\[\sum_1^{n} e_i^2\]

** 关键概念 ** :

\(\hat{y_i}\) 是给定 \(x_i\) 值时， \(y\) 的期望值

\(\hat{b}\) 是真实但是未知的 \(b\) 的估计值

残差 \(e\)：residual, error, noise

最小二乘法求线性回归系数

\[y_i= a + b*x_i+e_i\]

\[\hat{y_i} = \hat{a} + \hat{b}x_i\]

\[e_i = y_i-\hat{y_i}\]

\[\sum_1^{n} e_i^2 = \sum (y_i-\hat{y_i})^2=\sum (y_i- \hat{a} -\hat{b}x_i)^2\]

最小二乘法求线性回归系数:

\[f=\sum_1^{n} e_i^2 =\sum (y_i- \hat{a} -\hat{b}x_i)^2\]

一阶导数等于 0：

\[\frac{\partial f}{\partial\hat{a}} = -2 \sum (y_i- \hat{a} -\hat{b}x_i)=0\]

\[\frac{\partial f}{\partial\hat{b}} = -2 \sum (y_i- \hat{a} -\hat{b}x_i)x_i=0\]

联立上面的二元一次方程组，求解 \(\hat{a}\) 与 \(\hat{b}\)

\[\hat{b}=\frac{\sum (x_i-\bar{x}) (y_i-\bar{y})}{\sum (x_i-\bar{x})^2}\]

\[\hat{a}= \bar{y} - \hat{b}\bar{x}\]

最小二乘法的特点:

唯一性
残差之和等于0
回归线必过

\((\bar{x}, \bar{y})\)

out <- lm(mpg~wt,data=mtcars)
summary(out)

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

轶事：回归的由来

Regression to the mean is a concept attributed to Sir Francis Galton. The basic idea is that extreme random observations will tend to be less extreme upon a second trial. This is simply due to chance alone. Note that “regression to the mean” and “linear regression” are not the same thing.

孩子的身高和父母的身高：高个子的父母一般生的孩子个子也高，但是孩子一般没有父母那么高，孩子的身高会趋向于平均数。
这就是我们没有看到乔丹的儿子能够和乔丹一样成为飞人的原因

\[y_i= a + b*x_i+e_i\]

回归模型的目标是量化变量之间的关系：

关键系数的大小
关键系数的显著度
模型的解释力
模型的预测力

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

理解模型：

残差（Residuals）： Residuals show if the predicted response values are close or not to the response values that the model predicts.

\[e_i = y_i-\hat{y_i}\]

系数估计（Estimate coefficient）： How y changes when x changes by one unit

\[\hat{b}=\frac{\sum (x_i-\bar{x}) (y_i-\bar{y})}{\sum (x_i-\bar{x})^2}\]

\[\hat{a}= \bar{y} - \hat{b}\bar{x}\]

系数估计的标准差（Standard error）： Standard error measures how the coefficient estimates can vary from the actual average value of the response variable (i.e. if the model is run more times). 每一次数据不同，结果都会有差异，但是该差异在一定范围内。
显著度与P值（Significance and P value）： Test of significance of the model shows that there is strong evidence of a linear relationship between the variables. This is visually interpreted by the significance stars *** at the end of the row. This level is a threshold that is used to indicate real findings, and not the ones by chance alone.

For each estimated regression coefficient, the variable’s p-Value Pr(>|t|) provides an estimate of the probability that the true coefficient is zero given the value of the estimate. More the number of stars near the p-Value are, more significant is the variable.

P值多少算显著？ Typically, a p-value < 0.05 is a good cut-off point), null hypothesis can be safely rejected. In the current case, p-values are well below the 0.05 threshold, so the model is indeed statistically significant.

R-squared 以及 adjusted R-squared

0-1 之间
越大说明自变量对因变量的解释越多
但模型的好坏并不完全取决于R

\[R^2=1-\frac{var(e)}{var(y)}\]

电影票房收入预测

读取数据：

my_data <- read.csv("https://www.dropbox.com/s/sy790xviuxn8psp/movie_metadata.csv?dl=1")

观察数据结构：

str(my_data)

## 'data.frame':    5043 obs. of  28 variables:
##  $ color                    : Factor w/ 3 levels ""," Black and White",..: 3 3 3 3 1 3 3 3 3 3 ...
##  $ director_name            : Factor w/ 2399 levels "","A. Raven Cruz",..: 927 801 2027 377 603 106 2030 1652 1228 551 ...
##  $ num_critic_for_reviews   : int  723 302 602 813 NA 462 392 324 635 375 ...
##  $ duration                 : int  178 169 148 164 NA 132 156 100 141 153 ...
##  $ director_facebook_likes  : int  0 563 0 22000 131 475 0 15 0 282 ...
##  $ actor_3_facebook_likes   : int  855 1000 161 23000 NA 530 4000 284 19000 10000 ...
##  $ actor_2_name             : Factor w/ 3033 levels "","50 Cent","A. Michael Baldwin",..: 1407 2218 2488 534 2432 2549 1227 801 2439 653 ...
##  $ actor_1_facebook_likes   : int  1000 40000 11000 27000 131 640 24000 799 26000 25000 ...
##  $ gross                    : int  760505847 309404152 200074175 448130642 NA 73058679 336530303 200807262 458991599 301956980 ...
##  $ genres                   : Factor w/ 914 levels "Action","Action|Adventure",..: 107 101 128 288 754 126 120 308 126 447 ...
##  $ actor_1_name             : Factor w/ 2098 levels "","50 Cent","A.J. Buckley",..: 302 979 353 1968 526 440 785 221 336 32 ...
##  $ movie_title              : Factor w/ 4917 levels "[Rec] ","[Rec] 2 ",..: 398 2731 3279 3708 3332 1961 3291 3459 399 1631 ...
##  $ num_voted_users          : int  886204 471220 275868 1144337 8 212204 383056 294810 462669 321795 ...
##  $ cast_total_facebook_likes: int  4834 48350 11700 106759 143 1873 46055 2036 92000 58753 ...
##  $ actor_3_name             : Factor w/ 3522 levels "","50 Cent","A.J. Buckley",..: 3442 1392 3134 1769 1 2714 1969 2162 3018 2941 ...
##  $ facenumber_in_poster     : int  0 0 1 0 0 1 0 1 4 3 ...
##  $ plot_keywords            : Factor w/ 4761 levels "","10 year old|dog|florida|girl|supermarket",..: 1320 4283 2076 3484 1 651 4745 29 1142 2005 ...
##  $ movie_imdb_link          : Factor w/ 4919 levels "http://www.imdb.com/title/tt0006864/?ref_=fn_tt_tt_1",..: 2965 2721 4533 3756 4918 2476 2526 2458 4546 2551 ...
##  $ num_user_for_reviews     : int  3054 1238 994 2701 NA 738 1902 387 1117 973 ...
##  $ language                 : Factor w/ 48 levels "","Aboriginal",..: 13 13 13 13 1 13 13 13 13 13 ...
##  $ country                  : Factor w/ 66 levels "","Afghanistan",..: 65 65 63 65 1 65 65 65 65 63 ...
##  $ content_rating           : Factor w/ 19 levels "","Approved",..: 10 10 10 10 1 10 10 9 10 9 ...
##  $ budget                   : num  2.37e+08 3.00e+08 2.45e+08 2.50e+08 NA ...
##  $ title_year               : int  2009 2007 2015 2012 NA 2012 2007 2010 2015 2009 ...
##  $ actor_2_facebook_likes   : int  936 5000 393 23000 12 632 11000 553 21000 11000 ...
##  $ imdb_score               : num  7.9 7.1 6.8 8.5 7.1 6.6 6.2 7.8 7.5 7.5 ...
##  $ aspect_ratio             : num  1.78 2.35 2.35 2.35 NA 2.35 2.35 1.85 2.35 2.35 ...
##  $ movie_facebook_likes     : int  33000 0 85000 164000 0 24000 0 29000 118000 10000 ...

仅查看变量名称：

names(my_data)

##  [1] "color"                     "director_name"            
##  [3] "num_critic_for_reviews"    "duration"                 
##  [5] "director_facebook_likes"   "actor_3_facebook_likes"   
##  [7] "actor_2_name"              "actor_1_facebook_likes"   
##  [9] "gross"                     "genres"                   
## [11] "actor_1_name"              "movie_title"              
## [13] "num_voted_users"           "cast_total_facebook_likes"
## [15] "actor_3_name"              "facenumber_in_poster"     
## [17] "plot_keywords"             "movie_imdb_link"          
## [19] "num_user_for_reviews"      "language"                 
## [21] "country"                   "content_rating"           
## [23] "budget"                    "title_year"               
## [25] "actor_2_facebook_likes"    "imdb_score"               
## [27] "aspect_ratio"              "movie_facebook_likes"

查看数据的描述性统计：

summary(my_data)

##               color               director_name  num_critic_for_reviews
##                  :  19                   : 104   Min.   :  1.0         
##   Black and White: 209   Steven Spielberg:  26   1st Qu.: 50.0         
##  Color           :4815   Woody Allen     :  22   Median :110.0         
##                          Clint Eastwood  :  20   Mean   :140.2         
##                          Martin Scorsese :  20   3rd Qu.:195.0         
##                          Ridley Scott    :  17   Max.   :813.0         
##                          (Other)         :4834   NA's   :50            
##     duration     director_facebook_likes actor_3_facebook_likes
##  Min.   :  7.0   Min.   :    0.0         Min.   :    0.0       
##  1st Qu.: 93.0   1st Qu.:    7.0         1st Qu.:  133.0       
##  Median :103.0   Median :   49.0         Median :  371.5       
##  Mean   :107.2   Mean   :  686.5         Mean   :  645.0       
##  3rd Qu.:118.0   3rd Qu.:  194.5         3rd Qu.:  636.0       
##  Max.   :511.0   Max.   :23000.0         Max.   :23000.0       
##  NA's   :15      NA's   :104             NA's   :23            
##           actor_2_name  actor_1_facebook_likes     gross          
##  Morgan Freeman :  20   Min.   :     0         Min.   :      162  
##  Charlize Theron:  15   1st Qu.:   614         1st Qu.:  5340988  
##  Brad Pitt      :  14   Median :   988         Median : 25517500  
##                 :  13   Mean   :  6560         Mean   : 48468408  
##  James Franco   :  11   3rd Qu.: 11000         3rd Qu.: 62309438  
##  Meryl Streep   :  11   Max.   :640000         Max.   :760505847  
##  (Other)        :4959   NA's   :7              NA's   :884        
##                   genres                actor_1_name 
##  Drama               : 236   Robert De Niro   :  49  
##  Comedy              : 209   Johnny Depp      :  41  
##  Comedy|Drama        : 191   Nicolas Cage     :  33  
##  Comedy|Drama|Romance: 187   J.K. Simmons     :  31  
##  Comedy|Romance      : 158   Bruce Willis     :  30  
##  Drama|Romance       : 152   Denzel Washington:  30  
##  (Other)             :3910   (Other)          :4829  
##                     movie_title   num_voted_users  
##  Ben-Hur                  :   3   Min.   :      5  
##  Halloween                :   3   1st Qu.:   8594  
##  Home                     :   3   Median :  34359  
##  King Kong                :   3   Mean   :  83668  
##  Pan                      :   3   3rd Qu.:  96309  
##  The Fast and the Furious :   3   Max.   :1689764  
##  (Other)                  :5025                    
##  cast_total_facebook_likes         actor_3_name  facenumber_in_poster
##  Min.   :     0                          :  23   Min.   : 0.000      
##  1st Qu.:  1411            Ben Mendelsohn:   8   1st Qu.: 0.000      
##  Median :  3090            John Heard    :   8   Median : 1.000      
##  Mean   :  9699            Steve Coogan  :   8   Mean   : 1.371      
##  3rd Qu.: 13756            Anne Hathaway :   7   3rd Qu.: 2.000      
##  Max.   :656730            Jon Gries     :   7   Max.   :43.000      
##                            (Other)       :4982   NA's   :13          
##                                                                            plot_keywords 
##                                                                                   : 153  
##  based on novel                                                                   :   4  
##  1940s|child hero|fantasy world|orphan|reference to peter pan                     :   3  
##  alien friendship|alien invasion|australia|flying car|mother daughter relationship:   3  
##  animal name in title|ape abducts a woman|gorilla|island|king kong                :   3  
##  assistant|experiment|frankenstein|medical student|scientist                      :   3  
##  (Other)                                                                          :4874  
##                                              movie_imdb_link
##  http://www.imdb.com/title/tt0077651/?ref_=fn_tt_tt_1:   3  
##  http://www.imdb.com/title/tt0232500/?ref_=fn_tt_tt_1:   3  
##  http://www.imdb.com/title/tt0360717/?ref_=fn_tt_tt_1:   3  
##  http://www.imdb.com/title/tt1976009/?ref_=fn_tt_tt_1:   3  
##  http://www.imdb.com/title/tt2224026/?ref_=fn_tt_tt_1:   3  
##  http://www.imdb.com/title/tt2638144/?ref_=fn_tt_tt_1:   3  
##  (Other)                                             :5025  
##  num_user_for_reviews     language         country       content_rating
##  Min.   :   1.0       English :4704   USA      :3807   R        :2118  
##  1st Qu.:  65.0       French  :  73   UK       : 448   PG-13    :1461  
##  Median : 156.0       Spanish :  40   France   : 154   PG       : 701  
##  Mean   : 272.8       Hindi   :  28   Canada   : 126            : 303  
##  3rd Qu.: 326.0       Mandarin:  26   Germany  :  97   Not Rated: 116  
##  Max.   :5060.0       German  :  19   Australia:  55   G        : 112  
##  NA's   :21           (Other) : 153   (Other)  : 356   (Other)  : 232  
##      budget            title_year   actor_2_facebook_likes   imdb_score   
##  Min.   :2.180e+02   Min.   :1916   Min.   :     0         Min.   :1.600  
##  1st Qu.:6.000e+06   1st Qu.:1999   1st Qu.:   281         1st Qu.:5.800  
##  Median :2.000e+07   Median :2005   Median :   595         Median :6.600  
##  Mean   :3.975e+07   Mean   :2002   Mean   :  1652         Mean   :6.442  
##  3rd Qu.:4.500e+07   3rd Qu.:2011   3rd Qu.:   918         3rd Qu.:7.200  
##  Max.   :1.222e+10   Max.   :2016   Max.   :137000         Max.   :9.500  
##  NA's   :492         NA's   :108    NA's   :13                            
##   aspect_ratio   movie_facebook_likes
##  Min.   : 1.18   Min.   :     0      
##  1st Qu.: 1.85   1st Qu.:     0      
##  Median : 2.35   Median :   166      
##  Mean   : 2.22   Mean   :  7526      
##  3rd Qu.: 2.35   3rd Qu.:  3000      
##  Max.   :16.00   Max.   :349000      
##  NA's   :329

将票房收入和投资转换成以百万为单位。
除掉有缺失值的样本，除去不相关变量，例如 “movie_imdb_link”，或者“director_name”。
将电影分级这一变量里面的类别重新划分，限制级，非限制级： movie rating system: https://en.wikipedia.org/wiki/Motion_Picture_Association_of_America_film_rating_system

my_data <- read.csv("https://www.dropbox.com/s/sy790xviuxn8psp/movie_metadata.csv?dl=1") 

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

my_data <- my_data %>%
  na.omit() %>%
  mutate(gross=gross/1000000, budget=budget/1000000, USA = as.numeric(country== "USA"), restricted=as.numeric(content_rating== "PG-13"|content_rating=="R"|content_rating== "NC-17")) %>%
  select(-actor_1_name, -actor_2_name, -actor_3_name, -actor_2_facebook_likes,  -actor_3_facebook_likes, -movie_imdb_link, -plot_keywords, -color, -director_name, -cast_total_facebook_likes, -country, -content_rating)

summary(my_data)

##  num_critic_for_reviews    duration     director_facebook_likes
##  Min.   :  1.0          Min.   : 37.0   Min.   :    0.0        
##  1st Qu.: 75.0          1st Qu.: 96.0   1st Qu.:   11.0        
##  Median :137.0          Median :106.0   Median :   62.0        
##  Mean   :165.8          Mean   :110.2   Mean   :  798.6        
##  3rd Qu.:223.0          3rd Qu.:120.0   3rd Qu.:  234.0        
##  Max.   :813.0          Max.   :330.0   Max.   :23000.0        
##                                                                
##  actor_1_facebook_likes     gross                           genres    
##  Min.   :     0         Min.   :  0.0002   Drama               : 150  
##  1st Qu.:   736         1st Qu.:  7.6895   Comedy|Drama|Romance: 148  
##  Median :  1000         Median : 29.2000   Comedy|Drama        : 142  
##  Mean   :  7673         Mean   : 52.0058   Comedy              : 138  
##  3rd Qu.: 13000         3rd Qu.: 66.4664   Comedy|Romance      : 131  
##  Max.   :640000         Max.   :760.5058   Drama|Romance       : 116  
##                                            (Other)             :2976  
##                     movie_title   num_voted_users   facenumber_in_poster
##  Halloween                :   3   Min.   :      5   Min.   : 0.000      
##  Home                     :   3   1st Qu.:  18915   1st Qu.: 0.000      
##  King Kong                :   3   Median :  53028   Median : 1.000      
##  Pan                      :   3   Mean   : 104677   Mean   : 1.379      
##  The Fast and the Furious :   3   3rd Qu.: 126916   3rd Qu.: 2.000      
##  Victor Frankenstein      :   3   Max.   :1689764   Max.   :43.000      
##  (Other)                  :3783                                         
##  num_user_for_reviews     language        budget           title_year  
##  Min.   :   1.0       English :3626   Min.   :    0.00   Min.   :1920  
##  1st Qu.: 107.0       French  :  36   1st Qu.:   10.00   1st Qu.:1999  
##  Median : 208.0       Spanish :  23   Median :   25.00   Median :2005  
##  Mean   : 333.6       Mandarin:  15   Mean   :   45.85   Mean   :2003  
##  3rd Qu.: 397.0       German  :  13   3rd Qu.:   50.00   3rd Qu.:2010  
##  Max.   :5060.0       Japanese:  12   Max.   :12215.50   Max.   :2016  
##                       (Other) :  76                                    
##    imdb_score     aspect_ratio   movie_facebook_likes      USA        
##  Min.   :1.600   Min.   : 1.18   Min.   :     0       Min.   :0.0000  
##  1st Qu.:5.900   1st Qu.: 1.85   1st Qu.:     0       1st Qu.:1.0000  
##  Median :6.600   Median : 2.35   Median :   224       Median :1.0000  
##  Mean   :6.466   Mean   : 2.11   Mean   :  9262       Mean   :0.7906  
##  3rd Qu.:7.200   3rd Qu.: 2.35   3rd Qu.: 11000       3rd Qu.:1.0000  
##  Max.   :9.300   Max.   :16.00   Max.   :349000       Max.   :1.0000  
##                                                                       
##    restricted    
##  Min.   :0.0000  
##  1st Qu.:1.0000  
##  Median :1.0000  
##  Mean   :0.7966  
##  3rd Qu.:1.0000  
##  Max.   :1.0000  
##

发现有的电影budget异常的高，剔除：

library(dplyr)

my_data <- my_data %>%
  filter(budget<200)

summary(my_data)

##  num_critic_for_reviews    duration     director_facebook_likes
##  Min.   :  1.0          Min.   : 37.0   Min.   :    0.0        
##  1st Qu.: 75.0          1st Qu.: 95.0   1st Qu.:   11.0        
##  Median :135.0          Median :106.0   Median :   62.0        
##  Mean   :161.7          Mean   :109.6   Mean   :  794.8        
##  3rd Qu.:219.0          3rd Qu.:119.0   3rd Qu.:  232.0        
##  Max.   :775.0          Max.   :330.0   Max.   :23000.0        
##                                                                
##  actor_1_facebook_likes     gross                           genres    
##  Min.   :     0         Min.   :  0.0002   Drama               : 149  
##  1st Qu.:   733         1st Qu.:  7.5634   Comedy|Drama|Romance: 148  
##  Median :  1000         Median : 28.6375   Comedy|Drama        : 142  
##  Mean   :  7555         Mean   : 48.8620   Comedy              : 138  
##  3rd Qu.: 12000         3rd Qu.: 64.3712   Comedy|Romance      : 131  
##  Max.   :640000         Max.   :652.1773   Drama|Romance       : 115  
##                                            (Other)             :2910  
##                      movie_title   num_voted_users   facenumber_in_poster
##  Halloween                 :   3   Min.   :      5   Min.   : 0.000      
##  Home                      :   3   1st Qu.:  18632   1st Qu.: 0.000      
##  Pan                       :   3   Median :  51996   Median : 1.000      
##  The Fast and the Furious  :   3   Mean   : 100764   Mean   : 1.383      
##  Victor Frankenstein       :   3   3rd Qu.: 120798   3rd Qu.: 2.000      
##  A Nightmare on Elm Street :   2   Max.   :1689764   Max.   :43.000      
##  (Other)                   :3716                                         
##  num_user_for_reviews      language        budget           title_year  
##  Min.   :   1.0       English  :3571   Min.   :2.20e-04   Min.   :1920  
##  1st Qu.: 106.0       French   :  36   1st Qu.:1.00e+01   1st Qu.:1999  
##  Median : 204.0       Spanish  :  22   Median :2.40e+01   Median :2004  
##  Mean   : 320.6       Mandarin :  14   Mean   :3.58e+01   Mean   :2003  
##  3rd Qu.: 388.0       German   :  13   3rd Qu.:5.00e+01   3rd Qu.:2010  
##  Max.   :5060.0       Cantonese:   8   Max.   :1.95e+02   Max.   :2016  
##                       (Other)  :  69                                    
##    imdb_score     aspect_ratio    movie_facebook_likes      USA        
##  Min.   :1.600   Min.   : 1.180   Min.   :     0       Min.   :0.0000  
##  1st Qu.:5.900   1st Qu.: 1.850   1st Qu.:     0       1st Qu.:1.0000  
##  Median :6.600   Median : 2.350   Median :   206       Median :1.0000  
##  Mean   :6.456   Mean   : 2.108   Mean   :  8808       Mean   :0.7932  
##  3rd Qu.:7.200   3rd Qu.: 2.350   3rd Qu.: 11000       3rd Qu.:1.0000  
##  Max.   :9.300   Max.   :16.000   Max.   :349000       Max.   :1.0000  
##                                                                        
##    restricted    
##  Min.   :0.0000  
##  1st Qu.:1.0000  
##  Median :1.0000  
##  Mean   :0.7961  
##  3rd Qu.:1.0000  
##  Max.   :1.0000  
##

相关系数，精确到三位小数点（用round这个函数）

library(corrplot)

## corrplot 0.84 loaded

my_data %>%
  select(-movie_title, -genres, -language, -title_year) %>%
  cor() %>% 
  round(3) %>%
  corrplot(tl.srt=45)

使用下面的公式，计算相关性的P值，是否显著相关：

library(corrplot)

# mat : is a matrix of data
# ... : further arguments to pass to the native R cor.test function
cor.mtest <- function(mat, ...) {
    mat <- as.matrix(mat)
    n <- ncol(mat)
    p.mat<- matrix(NA, n, n)
    diag(p.mat) <- 0
    for (i in 1:(n - 1)) {
        for (j in (i + 1):n) {
            tmp <- cor.test(mat[, i], mat[, j], ...)
            p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
        }
    }
  colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
  p.mat
}
# matrix of the p-value of the correlation
p.mat <- my_data %>% 
  select(-movie_title, -genres, -language, -title_year) %>%
  cor.mtest()

绘制关联矩阵：

my_data %>%
  select(-movie_title, -genres, -language, -title_year) %>%
  cor() %>% 
  round(3) %>%
  corrplot(type="upper", p.mat = p.mat, sig.level = 0.05, tl.srt=45, method = "circle")

将数据随机分成训练数据和测试数据：

set.seed(100) 
trainingRowIndex <- sample(1:nrow(my_data), 0.7*nrow(my_data)) 

training_data <- my_data[trainingRowIndex, ] # model training data
testing_data <- my_data[-trainingRowIndex, ] # test data

最关心的变量 gross, budget, imdb_score，看一看他们的分布：

library(ggplot2)
library(dplyr)

training_data %>%
  ggplot(aes(x=gross))+
  geom_histogram(color='black', fill='white') + 
  ylab('counts') + 
  xlab('Gross Income (million)')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  ggtitle('gross income')

## $title
## [1] "gross income"
## 
## attr(,"class")
## [1] "labels"

library(ggplot2)
library(dplyr)

training_data %>%
  ggplot(aes(x=gross))+
  geom_histogram(color='black', fill='white') + 
  geom_vline(aes(xintercept=mean(gross)), linetype="dashed", color="blue", size=1)+
  geom_vline(aes(xintercept=median(gross)), linetype="dashed", color="red", size=1)+
  ylab('counts') + 
  xlab('Gross Income (million)')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  ggtitle('gross income')

## $title
## [1] "gross income"
## 
## attr(,"class")
## [1] "labels"

library(ggplot2)
library(dplyr)

training_data %>%
  filter(gross<400) %>%
  ggplot(aes(x=gross))+
  geom_histogram(color='black', fill='white') + 
  geom_vline(aes(xintercept=mean(gross)), linetype="dashed", color="blue", size=1)+
  geom_vline(aes(xintercept=median(gross)), linetype="dashed", color="red", size=1)+
  ylab('counts') + 
  xlab('Gross Income (million)')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  ggtitle('gross income')

## $title
## [1] "gross income"
## 
## attr(,"class")
## [1] "labels"

library(ggplot2)
library(dplyr)

training_data %>%
  ggplot(aes(x=budget))+
  geom_histogram(color='black', fill='white') + 
  geom_vline(aes(xintercept=mean(budget)), linetype="dashed", color="blue", size=1)+
  geom_vline(aes(xintercept=median(budget)), linetype="dashed", color="red", size=1)+
  ylab('counts') + 
  xlab('Budget (million)')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  ggtitle('Budget')

## $title
## [1] "Budget"
## 
## attr(,"class")
## [1] "labels"

library(ggplot2)
library(dplyr)

training_data %>%
  ggplot(aes(x=imdb_score))+
  geom_histogram(color='black', fill='white') + 
  ylab('counts') + 
  ggtitle('imdb_score')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

library(ggplot2)
library(dplyr)

training_data %>%
  ggplot(aes(x=imdb_score))+
  geom_histogram(aes(y=..density..), color='black', fill='white') + 
  ylab('percent') + 
  ggtitle('imdb_score')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

library(ggplot2)
library(dplyr)

training_data %>%
  ggplot(aes(x=imdb_score))+
  geom_histogram(aes(y=..density..), binwidth=0.5, color='black', fill='white') + 
  geom_vline(aes(xintercept=mean(imdb_score)), linetype="dashed", color="blue", size=1)+
  ylab('percent') + 
  ggtitle('imdb_score')

str(training_data)

## 'data.frame':    2613 obs. of  18 variables:
##  $ num_critic_for_reviews : int  303 72 20 354 141 229 14 135 81 249 ...
##  $ duration               : int  136 127 106 129 124 121 108 105 101 119 ...
##  $ director_facebook_likes: int  3000 65 277 235 159 353 208 125 13 218 ...
##  $ actor_1_facebook_likes : int  14000 788 22000 13000 288 45 876 18000 651 13000 ...
##  $ gross                  : num  42.6 14.9 10.6 134.5 30 ...
##  $ genres                 : Factor w/ 914 levels "Action","Action|Adventure",..: 708 536 571 128 550 844 357 700 697 280 ...
##  $ movie_title            : Factor w/ 4917 levels "[Rec] ","[Rec] 2 ",..: 123 3938 4756 2163 1186 568 3506 2744 106 3604 ...
##  $ num_voted_users        : int  136093 2581 17349 336235 75973 59462 6384 23916 14147 407601 ...
##  $ facenumber_in_poster   : int  3 1 0 1 1 0 0 5 3 1 ...
##  $ num_user_for_reviews   : int  416 29 42 782 409 300 54 50 92 849 ...
##  $ language               : Factor w/ 48 levels "","Aboriginal",..: 13 13 13 13 13 13 13 13 13 13 ...
##  $ budget                 : num  40 25 20 110 25 25 6.5 35 20 60 ...
##  $ title_year             : int  2014 2016 1989 2007 1998 2008 1993 2012 2003 2002 ...
##  $ imdb_score             : num  6.1 7.3 6 7.2 7.5 6.6 6.2 5.7 5.6 7.9 ...
##  $ aspect_ratio           : num  2.35 2.35 2.35 2.35 1.85 1.85 1.85 2.35 1.85 2.35 ...
##  $ movie_facebook_likes   : int  24000 0 855 0 0 0 391 0 401 0 ...
##  $ USA                    : num  1 0 1 1 0 0 1 1 1 1 ...
##  $ restricted             : num  1 1 1 1 1 1 0 1 1 1 ...

# Model specification using lm
fullModel <- lm(gross ~.-genres -movie_title -language, 
                        data = my_data)

# Looking at model summary
summary(fullModel)

## 
## Call:
## lm(formula = gross ~ . - genres - movie_title - language, data = my_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -322.28  -19.02   -5.07   12.45  452.31 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              9.001e+02  1.726e+02   5.216 1.93e-07 ***
## num_critic_for_reviews   2.760e-02  1.011e-02   2.730 0.006363 ** 
## duration                -1.183e-01  3.666e-02  -3.226 0.001267 ** 
## director_facebook_likes -9.040e-04  2.376e-04  -3.805 0.000144 ***
## actor_1_facebook_likes  -3.751e-05  4.535e-05  -0.827 0.408228    
## num_voted_users          1.818e-04  8.823e-06  20.601  < 2e-16 ***
## facenumber_in_poster    -3.166e-02  3.357e-01  -0.094 0.924872    
## num_user_for_reviews     4.483e-03  3.020e-03   1.484 0.137841    
## budget                   7.181e-01  2.237e-02  32.098  < 2e-16 ***
## title_year              -4.395e-01  8.602e-02  -5.109 3.40e-07 ***
## imdb_score               4.169e-01  8.269e-01   0.504 0.614202    
## aspect_ratio            -4.289e+00  2.022e+00  -2.121 0.033968 *  
## movie_facebook_likes    -3.824e-06  4.850e-05  -0.079 0.937159    
## USA                      1.777e+01  1.743e+00  10.196  < 2e-16 ***
## restricted              -1.845e+01  1.816e+00 -10.158  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41.58 on 3718 degrees of freedom
## Multiple R-squared:  0.5592, Adjusted R-squared:  0.5575 
## F-statistic: 336.9 on 14 and 3718 DF,  p-value: < 2.2e-16

使用step函数，剔除无关变量： Choose a model by AIC in a Stepwise Algorithm.

In the absence of subject-matter expertise, stepwise regression can assist with the search for the most important predictors of the outcome of interest.

# Model specification using lm
fullModel <- lm(gross ~.-genres -movie_title -language, 
                        data = my_data)

newModel <- step(fullModel)

## Start:  AIC=27846.05
## gross ~ (num_critic_for_reviews + duration + director_facebook_likes + 
##     actor_1_facebook_likes + genres + movie_title + num_voted_users + 
##     facenumber_in_poster + num_user_for_reviews + language + 
##     budget + title_year + imdb_score + aspect_ratio + movie_facebook_likes + 
##     USA + restricted) - genres - movie_title - language
## 
##                           Df Sum of Sq     RSS   AIC
## - movie_facebook_likes     1        11 6429210 27844
## - facenumber_in_poster     1        15 6429214 27844
## - imdb_score               1       439 6429638 27844
## - actor_1_facebook_likes   1      1183 6430382 27845
## <none>                                 6429199 27846
## - num_user_for_reviews     1      3809 6433008 27846
## - aspect_ratio             1      7781 6436980 27849
## - num_critic_for_reviews   1     12888 6442087 27852
## - duration                 1     17993 6447192 27854
## - director_facebook_likes  1     25038 6454237 27859
## - title_year               1     45133 6474331 27870
## - restricted               1    178429 6607628 27946
## - USA                      1    179759 6608958 27947
## - num_voted_users          1    733845 7163044 28248
## - budget                   1   1781578 8210777 28757
## 
## Step:  AIC=27844.06
## gross ~ num_critic_for_reviews + duration + director_facebook_likes + 
##     actor_1_facebook_likes + num_voted_users + facenumber_in_poster + 
##     num_user_for_reviews + budget + title_year + imdb_score + 
##     aspect_ratio + USA + restricted
## 
##                           Df Sum of Sq     RSS   AIC
## - facenumber_in_poster     1        16 6429226 27842
## - imdb_score               1       448 6429657 27842
## - actor_1_facebook_likes   1      1180 6430390 27843
## <none>                                 6429210 27844
## - num_user_for_reviews     1      4082 6433291 27844
## - aspect_ratio             1      7770 6436980 27847
## - num_critic_for_reviews   1     17160 6446369 27852
## - duration                 1     18274 6447484 27853
## - director_facebook_likes  1     25043 6454252 27857
## - title_year               1     45529 6474739 27868
## - restricted               1    178817 6608026 27944
## - USA                      1    179753 6608962 27945
## - num_voted_users          1    791187 7220396 28275
## - budget                   1   1786771 8215981 28758
## 
## Step:  AIC=27842.07
## gross ~ num_critic_for_reviews + duration + director_facebook_likes + 
##     actor_1_facebook_likes + num_voted_users + num_user_for_reviews + 
##     budget + title_year + imdb_score + aspect_ratio + USA + restricted
## 
##                           Df Sum of Sq     RSS   AIC
## - imdb_score               1       460 6429686 27840
## - actor_1_facebook_likes   1      1200 6430425 27841
## <none>                                 6429226 27842
## - num_user_for_reviews     1      4146 6433372 27842
## - aspect_ratio             1      7772 6436997 27845
## - num_critic_for_reviews   1     17207 6446432 27850
## - duration                 1     18499 6447725 27851
## - director_facebook_likes  1     25035 6454261 27855
## - title_year               1     45877 6475102 27867
## - restricted               1    178986 6608212 27943
## - USA                      1    179791 6609016 27943
## - num_voted_users          1    793978 7223203 28275
## - budget                   1   1789904 8219130 28757
## 
## Step:  AIC=27840.33
## gross ~ num_critic_for_reviews + duration + director_facebook_likes + 
##     actor_1_facebook_likes + num_voted_users + num_user_for_reviews + 
##     budget + title_year + aspect_ratio + USA + restricted
## 
##                           Df Sum of Sq     RSS   AIC
## - actor_1_facebook_likes   1      1150 6430836 27839
## <none>                                 6429686 27840
## - num_user_for_reviews     1      3844 6433530 27841
## - aspect_ratio             1      7827 6437514 27843
## - duration                 1     18318 6448004 27849
## - num_critic_for_reviews   1     19754 6449440 27850
## - director_facebook_likes  1     24878 6454565 27853
## - title_year               1     48581 6478268 27866
## - restricted               1    180981 6610667 27942
## - USA                      1    181609 6611295 27942
## - num_voted_users          1    900134 7329820 28328
## - budget                   1   1877049 8306735 28794
## 
## Step:  AIC=27839
## gross ~ num_critic_for_reviews + duration + director_facebook_likes + 
##     num_voted_users + num_user_for_reviews + budget + title_year + 
##     aspect_ratio + USA + restricted
## 
##                           Df Sum of Sq     RSS   AIC
## <none>                                 6430836 27839
## - num_user_for_reviews     1      4066 6434903 27839
## - aspect_ratio             1      7890 6438726 27842
## - duration                 1     18513 6449349 27848
## - num_critic_for_reviews   1     19493 6450330 27848
## - director_facebook_likes  1     25249 6456085 27852
## - title_year               1     49239 6480076 27866
## - USA                      1    180592 6611428 27940
## - restricted               1    182472 6613308 27941
## - num_voted_users          1    901931 7332767 28327
## - budget                   1   1880622 8311458 28795

summary(newModel)

## 
## Call:
## lm(formula = gross ~ num_critic_for_reviews + duration + director_facebook_likes + 
##     num_voted_users + num_user_for_reviews + budget + title_year + 
##     aspect_ratio + USA + restricted, data = my_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -322.59  -19.11   -4.87   12.50  452.39 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              9.240e+02  1.685e+02   5.485 4.42e-08 ***
## num_critic_for_reviews   2.808e-02  8.361e-03   3.359 0.000790 ***
## duration                -1.143e-01  3.490e-02  -3.273 0.001073 ** 
## director_facebook_likes -9.062e-04  2.371e-04  -3.823 0.000134 ***
## num_voted_users          1.823e-04  7.980e-06  22.848  < 2e-16 ***
## num_user_for_reviews     4.462e-03  2.908e-03   1.534 0.125089    
## budget                   7.143e-01  2.165e-02  32.992  < 2e-16 ***
## title_year              -4.503e-01  8.435e-02  -5.338 9.94e-08 ***
## aspect_ratio            -4.316e+00  2.019e+00  -2.137 0.032666 *  
## USA                      1.755e+01  1.716e+00  10.224  < 2e-16 ***
## restricted              -1.856e+01  1.807e+00 -10.277  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41.57 on 3722 degrees of freedom
## Multiple R-squared:  0.5591, Adjusted R-squared:  0.5579 
## F-statistic: 471.9 on 10 and 3722 DF,  p-value: < 2.2e-16

newFormula <- as.formula(summary(newModel)$call)
newFormula

## gross ~ num_critic_for_reviews + duration + director_facebook_likes + 
##     num_voted_users + num_user_for_reviews + budget + title_year + 
##     aspect_ratio + USA + restricted

bestModel<- training_data %>%
  lm(formula=newFormula)

summary(bestModel)

## 
## Call:
## lm(formula = newFormula, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -291.22  -18.85   -4.99   11.84  450.89 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              9.283e+02  2.062e+02   4.502 7.02e-06 ***
## num_critic_for_reviews   3.535e-02  1.016e-02   3.480 0.000511 ***
## duration                -1.074e-01  4.183e-02  -2.568 0.010284 *  
## director_facebook_likes -6.703e-04  2.922e-04  -2.294 0.021863 *  
## num_voted_users          1.604e-04  9.745e-06  16.461  < 2e-16 ***
## num_user_for_reviews     5.344e-03  3.554e-03   1.504 0.132823    
## budget                   7.503e-01  2.607e-02  28.786  < 2e-16 ***
## title_year              -4.527e-01  1.032e-01  -4.388 1.19e-05 ***
## aspect_ratio            -5.096e+00  2.241e+00  -2.274 0.023045 *  
## USA                      1.624e+01  2.069e+00   7.849 6.07e-15 ***
## restricted              -1.665e+01  2.190e+00  -7.603 4.01e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.17 on 2602 degrees of freedom
## Multiple R-squared:  0.5402, Adjusted R-squared:  0.5385 
## F-statistic: 305.7 on 10 and 2602 DF,  p-value: < 2.2e-16

多元回归的共线性问题处理： Estimate the variance inflation factors using the vif() function from the rms package. A variance inflation factor greater than 5 hints to multicollinearity, greater than 10 indicates unstable regression estimates.

library(rms)

## Loading required package: Hmisc

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

## Loading required package: SparseM

## 
## Attaching package: 'SparseM'

## The following object is masked from 'package:base':
## 
##     backsolve

# Checking variance inflation factors
vif(bestModel)

##  num_critic_for_reviews                duration director_facebook_likes 
##                2.144584                1.261584                1.121974 
##         num_voted_users    num_user_for_reviews                  budget 
##                2.783532                2.598297                1.400162 
##              title_year            aspect_ratio                     USA 
##                1.554954                1.089147                1.046986 
##              restricted 
##                1.146993

Since none of the variance inflation factors is greater than 10 we can certainly accept our bestModel.

summary(bestModel)

## 
## Call:
## lm(formula = newFormula, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -291.22  -18.85   -4.99   11.84  450.89 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              9.283e+02  2.062e+02   4.502 7.02e-06 ***
## num_critic_for_reviews   3.535e-02  1.016e-02   3.480 0.000511 ***
## duration                -1.074e-01  4.183e-02  -2.568 0.010284 *  
## director_facebook_likes -6.703e-04  2.922e-04  -2.294 0.021863 *  
## num_voted_users          1.604e-04  9.745e-06  16.461  < 2e-16 ***
## num_user_for_reviews     5.344e-03  3.554e-03   1.504 0.132823    
## budget                   7.503e-01  2.607e-02  28.786  < 2e-16 ***
## title_year              -4.527e-01  1.032e-01  -4.388 1.19e-05 ***
## aspect_ratio            -5.096e+00  2.241e+00  -2.274 0.023045 *  
## USA                      1.624e+01  2.069e+00   7.849 6.07e-15 ***
## restricted              -1.665e+01  2.190e+00  -7.603 4.01e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.17 on 2602 degrees of freedom
## Multiple R-squared:  0.5402, Adjusted R-squared:  0.5385 
## F-statistic: 305.7 on 10 and 2602 DF,  p-value: < 2.2e-16

我们使用刚刚建立的模型来做预测：

grossPred <- predict(bestModel, testing_data) # predict distance

检验一下，我们的预测效果

actuals_preds <- data.frame(cbind(actuals=testing_data$gross, predicteds=grossPred)) # make actuals_predicteds dataframe.
attach(actuals_preds)
correlation_accuracy <- cor(actuals,predicteds) 
correlation_accuracy

## [1] 0.7727497

实际对比一下原始数据和预测数据

head(actuals_preds, n=10)

##      actuals predicteds
## 1   70.08352   166.5624
## 4  350.03411   194.4259
## 9  101.78548   223.0923
## 20 533.31606   431.2951
## 23 318.29818   230.2883
## 25 113.74541   156.8614
## 27 161.08718   160.1627
## 32 356.45437   222.5212
## 35 241.40733   203.4240
## 45 217.38800   226.4663

还可以再进一步，交叉验证：从上面6，4分的训练测试数据得到的模型表现不错，这是不是偶然碰巧的结果？随机把数据分成K等份，K-1做训练，1做测试。得到k种结果，看看预测结果之间偏差如何？这个方法通常在数据量不足以支持数据分成训练组和测试组的时候 k- Fold Cross validation 曲线是否平行

library(DAAG)

## 
## Attaching package: 'DAAG'

## The following object is masked from 'package:rms':
## 
##     vif

## The following object is masked from 'package:survival':
## 
##     lung

kfold <- CVlm(data = my_data, form.lm = formula(newFormula), m=5, 
                   dots = FALSE, seed=123, legend.pos="topleft",
                   main="Cross Validation; k=5",
                   plotit=TRUE, printit=FALSE)

## Warning in CVlm(data = my_data, form.lm = formula(newFormula), m = 5, dots = FALSE, : 
## 
##  As there is >1 explanatory variable, cross-validation
##  predicted values for a fold are not a linear function
##  of corresponding overall predicted values.  Lines that
##  are shown for the different folds are approximate

Remember, cross-validation validates the modeling process, not an actual model

还有一种 kWayCrossValidation(nRows, nSplits, NULL, NULL)

library(vtreat)

完结撒花！

Linear Regression and Predictive Model

Chris Qi

2018

双变量关系及建模

最小二乘法（least squares method ）

电影票房收入预测