Regression Analysis

双变量关系及建模

两个变量都是数值型（numeric）
应变量（response, dependent variable）
解释变量（explanatory）
你认为与应变量有关的东西
也可以叫做，independent variable, predictor

一般的模型表达式

\[应变量=f(解释变量)+噪音\]

线性回归模型表达式

\[应变量=截距+斜率*解释变量+噪音\]

线性回归模型–为什么这一条线最合适？

\[y_i= a + b*x_i+e\]

最小二乘法（least squares method ）

简单地说，最小二乘的思想就是要使得

** 观测点和估计点的距离的平方和达到最小 **.

这里的“二乘”指的是用平方来度量观测点与估计点的远近（在古汉语中“平方”称为“二乘”），“最小”指的是参数的估计值要保证各个观测点与估计点的距离的平方和达到最小。

最小二乘法（least squares method ）

线性回归模型:

\[y_i= a + b*x_i+e_i\]

\[e_i ~ N(0, \sigma)\]

预测值：

\[\hat{y_i} = \hat{a} + \hat{b}x_i\]

残差（error）:

\[e_i = y_i-\hat{y_i}\]

拟合的过程：有n个样本量，也就是n个数据对 \((x_i,y_i)\), 找到截距 \(a\) 和斜率 \(b\) ，使得残差平方和最小:

\[\sum_1^{n} e_i^2\]

** 关键概念 ** :

\(\hat{y_i}\) 是给定 \(x_i\) 值时， \(y\) 的期望值

\(\hat{b}\) 是真实但是未知的 \(b\) 的估计值

残差 \(e\)：residual, error, noise

最小二乘法求线性回归系数

\[y_i= a + b*x_i+e_i\]

\[\hat{y_i} = \hat{a} + \hat{b}x_i\]

\[e_i = y_i-\hat{y_i}\]

\[\sum_1^{n} e_i^2 = \sum (y_i-\hat{y_i})^2=\sum (y_i- \hat{a} -\hat{b}x_i)^2\]

最小二乘法求线性回归系数:

\[f=\sum_1^{n} e_i^2 =\sum (y_i- \hat{a} -\hat{b}x_i)^2\]

一阶导数等于 0：

\[\frac{\partial f}{\partial\hat{a}} = -2 \sum (y_i- \hat{a} -\hat{b}x_i)=0\]

\[\frac{\partial f}{\partial\hat{b}} = -2 \sum (y_i- \hat{a} -\hat{b}x_i)x_i=0\]

联立上面的二元一次方程组，求解 \(\hat{a}\) 与 \(\hat{b}\)

\[\hat{b}=\frac{\sum (x_i-\bar{x}) (y_i-\bar{y})}{\sum (x_i-\bar{x})^2}\]

\[\hat{a}= \bar{y} - \hat{b}\bar{x}\]

最小二乘法的特点:

唯一性
简单（相对于非线性）
残差之和等于0
回归线必过

\((\bar{x}, \bar{y})\)

out <- lm(mpg~wt,data=mtcars)
summary(out)

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

轶事：回归的由来

Regression to the mean is a concept attributed to Sir Francis Galton. The basic idea is that extreme random observations will tend to be less extreme upon a second trial. This is simply due to chance alone. Note that “regression to the mean” and “linear regression” are not the same thing.

孩子的身高和父母的身高：高个子的父母一般生的孩子个子也高，但是孩子一般没有父母那么高，孩子的身高会趋向于平均数。
这就是我们没有看到乔丹的儿子能够和乔丹一样成为飞人的原因

\[y_i= a + b*x_i+e_i\]

回归模型的目标是量化变量之间的关系：

关键系数的大小
关键系数的显著度
模型的解释力
模型的预测力

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

理解模型：

残差（Residuals）： Residuals show if the predicted response values are close or not to the response values that the model predicts.

\[e_i = y_i-\hat{y_i}\]

系数估计（Estimate coefficient）： How y changes when x changes by one unit

\[\hat{b}=\frac{\sum (x_i-\bar{x}) (y_i-\bar{y})}{\sum (x_i-\bar{x})^2}\]

\[\hat{a}= \bar{y} - \hat{b}\bar{x}\]

系数估计的标准差（Standard error）： Standard error measures how the coefficient estimates can vary from the actual average value of the response variable (i.e. if the model is run more times). 每一次数据不同，结果都会有差异，但是该差异在一定范围内。
显著度与P值（Significance and P value）： Test of significance of the model shows that there is strong evidence of a linear relationship between the variables. This is visually interpreted by the significance stars *** at the end of the row. This level is a threshold that is used to indicate real findings, and not the ones by chance alone.

For each estimated regression coefficient, the variable’s p-Value Pr(>|t|) provides an estimate of the probability that the true coefficient is zero given the value of the estimate. More the number of stars near the p-Value are, more significant is the variable.

With the presence of the p-value, there is a test of hypothesis associated with it. In Linear Regression, the Null Hypothesis is that the coefficient associated with the variables is equal to zero. Instead, the alternative hypothesis is that the coefficient is not equal to zero and then exists a relationship between the independent variable and the dependent variable.

So, if p-values are less than significance level (typically, a p-value < 0.05 is a good cut-off point), null hypothesis can be safely rejected. In the current case, p-values are well below the 0.05 threshold, so the model is indeed statistically significant.

R-squared 以及 adjusted R-squared For the simple linear regression, R-squared is the square of the correlation between two variables. Its value can vary between 0 and 1: a value close to 0 means that the regression model does not explain the variance in the response variable, while a number close to 1 that the observed variance in the response variable is well explained. In the current case, R-squared suggests the linear model fit explains about 75% of the variance observed in the data.

\[R^2=1-\frac{var(e)}{var(y)}\]

6.F statistic Basically, F-test compares the model with zero predictor variables (the intercept only mod1el), and suggests whether the added coefficients improve the model. If a significant result is obtained, then the coefficients (in the current case, being a simple regression, only one predictor is entered) included in the model improve the model’s fit. So, F statistic defines the collective effect of all predictor variables on the response variable. In this model, F=119.8 is far greater than 1.

Import Data

my_data <- read.csv("https://www.dropbox.com/s/sy790xviuxn8psp/movie_metadata.csv?dl=1")

str(my_data)

## 'data.frame':    5043 obs. of  28 variables:
##  $ color                    : Factor w/ 3 levels ""," Black and White",..: 3 3 3 3 1 3 3 3 3 3 ...
##  $ director_name            : Factor w/ 2399 levels "","A. Raven Cruz",..: 927 801 2027 377 603 106 2030 1652 1228 551 ...
##  $ num_critic_for_reviews   : int  723 302 602 813 NA 462 392 324 635 375 ...
##  $ duration                 : int  178 169 148 164 NA 132 156 100 141 153 ...
##  $ director_facebook_likes  : int  0 563 0 22000 131 475 0 15 0 282 ...
##  $ actor_3_facebook_likes   : int  855 1000 161 23000 NA 530 4000 284 19000 10000 ...
##  $ actor_2_name             : Factor w/ 3033 levels "","50 Cent","A. Michael Baldwin",..: 1407 2218 2488 534 2432 2549 1227 801 2439 653 ...
##  $ actor_1_facebook_likes   : int  1000 40000 11000 27000 131 640 24000 799 26000 25000 ...
##  $ gross                    : int  760505847 309404152 200074175 448130642 NA 73058679 336530303 200807262 458991599 301956980 ...
##  $ genres                   : Factor w/ 914 levels "Action","Action|Adventure",..: 107 101 128 288 754 126 120 308 126 447 ...
##  $ actor_1_name             : Factor w/ 2098 levels "","50 Cent","A.J. Buckley",..: 302 979 353 1968 526 440 785 221 336 32 ...
##  $ movie_title              : Factor w/ 4917 levels "[Rec] ","[Rec] 2 ",..: 398 2731 3279 3708 3332 1961 3291 3459 399 1631 ...
##  $ num_voted_users          : int  886204 471220 275868 1144337 8 212204 383056 294810 462669 321795 ...
##  $ cast_total_facebook_likes: int  4834 48350 11700 106759 143 1873 46055 2036 92000 58753 ...
##  $ actor_3_name             : Factor w/ 3522 levels "","50 Cent","A.J. Buckley",..: 3442 1392 3134 1769 1 2714 1969 2162 3018 2941 ...
##  $ facenumber_in_poster     : int  0 0 1 0 0 1 0 1 4 3 ...
##  $ plot_keywords            : Factor w/ 4761 levels "","10 year old|dog|florida|girl|supermarket",..: 1320 4283 2076 3484 1 651 4745 29 1142 2005 ...
##  $ movie_imdb_link          : Factor w/ 4919 levels "http://www.imdb.com/title/tt0006864/?ref_=fn_tt_tt_1",..: 2965 2721 4533 3756 4918 2476 2526 2458 4546 2551 ...
##  $ num_user_for_reviews     : int  3054 1238 994 2701 NA 738 1902 387 1117 973 ...
##  $ language                 : Factor w/ 48 levels "","Aboriginal",..: 13 13 13 13 1 13 13 13 13 13 ...
##  $ country                  : Factor w/ 66 levels "","Afghanistan",..: 65 65 63 65 1 65 65 65 65 63 ...
##  $ content_rating           : Factor w/ 19 levels "","Approved",..: 10 10 10 10 1 10 10 9 10 9 ...
##  $ budget                   : num  2.37e+08 3.00e+08 2.45e+08 2.50e+08 NA ...
##  $ title_year               : int  2009 2007 2015 2012 NA 2012 2007 2010 2015 2009 ...
##  $ actor_2_facebook_likes   : int  936 5000 393 23000 12 632 11000 553 21000 11000 ...
##  $ imdb_score               : num  7.9 7.1 6.8 8.5 7.1 6.6 6.2 7.8 7.5 7.5 ...
##  $ aspect_ratio             : num  1.78 2.35 2.35 2.35 NA 2.35 2.35 1.85 2.35 2.35 ...
##  $ movie_facebook_likes     : int  33000 0 85000 164000 0 24000 0 29000 118000 10000 ...

查看数据的描述性统计

summary(my_data)

##               color               director_name  num_critic_for_reviews
##                  :  19                   : 104   Min.   :  1.0         
##   Black and White: 209   Steven Spielberg:  26   1st Qu.: 50.0         
##  Color           :4815   Woody Allen     :  22   Median :110.0         
##                          Clint Eastwood  :  20   Mean   :140.2         
##                          Martin Scorsese :  20   3rd Qu.:195.0         
##                          Ridley Scott    :  17   Max.   :813.0         
##                          (Other)         :4834   NA's   :50            
##     duration     director_facebook_likes actor_3_facebook_likes
##  Min.   :  7.0   Min.   :    0.0         Min.   :    0.0       
##  1st Qu.: 93.0   1st Qu.:    7.0         1st Qu.:  133.0       
##  Median :103.0   Median :   49.0         Median :  371.5       
##  Mean   :107.2   Mean   :  686.5         Mean   :  645.0       
##  3rd Qu.:118.0   3rd Qu.:  194.5         3rd Qu.:  636.0       
##  Max.   :511.0   Max.   :23000.0         Max.   :23000.0       
##  NA's   :15      NA's   :104             NA's   :23            
##           actor_2_name  actor_1_facebook_likes     gross          
##  Morgan Freeman :  20   Min.   :     0         Min.   :      162  
##  Charlize Theron:  15   1st Qu.:   614         1st Qu.:  5340988  
##  Brad Pitt      :  14   Median :   988         Median : 25517500  
##                 :  13   Mean   :  6560         Mean   : 48468408  
##  James Franco   :  11   3rd Qu.: 11000         3rd Qu.: 62309438  
##  Meryl Streep   :  11   Max.   :640000         Max.   :760505847  
##  (Other)        :4959   NA's   :7              NA's   :884        
##                   genres                actor_1_name 
##  Drama               : 236   Robert De Niro   :  49  
##  Comedy              : 209   Johnny Depp      :  41  
##  Comedy|Drama        : 191   Nicolas Cage     :  33  
##  Comedy|Drama|Romance: 187   J.K. Simmons     :  31  
##  Comedy|Romance      : 158   Bruce Willis     :  30  
##  Drama|Romance       : 152   Denzel Washington:  30  
##  (Other)             :3910   (Other)          :4829  
##                     movie_title   num_voted_users  
##  Ben-Hur                  :   3   Min.   :      5  
##  Halloween                :   3   1st Qu.:   8594  
##  Home                     :   3   Median :  34359  
##  King Kong                :   3   Mean   :  83668  
##  Pan                      :   3   3rd Qu.:  96309  
##  The Fast and the Furious :   3   Max.   :1689764  
##  (Other)                  :5025                    
##  cast_total_facebook_likes         actor_3_name  facenumber_in_poster
##  Min.   :     0                          :  23   Min.   : 0.000      
##  1st Qu.:  1411            Ben Mendelsohn:   8   1st Qu.: 0.000      
##  Median :  3090            John Heard    :   8   Median : 1.000      
##  Mean   :  9699            Steve Coogan  :   8   Mean   : 1.371      
##  3rd Qu.: 13756            Anne Hathaway :   7   3rd Qu.: 2.000      
##  Max.   :656730            Jon Gries     :   7   Max.   :43.000      
##                            (Other)       :4982   NA's   :13          
##                                                                            plot_keywords 
##                                                                                   : 153  
##  based on novel                                                                   :   4  
##  1940s|child hero|fantasy world|orphan|reference to peter pan                     :   3  
##  alien friendship|alien invasion|australia|flying car|mother daughter relationship:   3  
##  animal name in title|ape abducts a woman|gorilla|island|king kong                :   3  
##  assistant|experiment|frankenstein|medical student|scientist                      :   3  
##  (Other)                                                                          :4874  
##                                              movie_imdb_link
##  http://www.imdb.com/title/tt0077651/?ref_=fn_tt_tt_1:   3  
##  http://www.imdb.com/title/tt0232500/?ref_=fn_tt_tt_1:   3  
##  http://www.imdb.com/title/tt0360717/?ref_=fn_tt_tt_1:   3  
##  http://www.imdb.com/title/tt1976009/?ref_=fn_tt_tt_1:   3  
##  http://www.imdb.com/title/tt2224026/?ref_=fn_tt_tt_1:   3  
##  http://www.imdb.com/title/tt2638144/?ref_=fn_tt_tt_1:   3  
##  (Other)                                             :5025  
##  num_user_for_reviews     language         country       content_rating
##  Min.   :   1.0       English :4704   USA      :3807   R        :2118  
##  1st Qu.:  65.0       French  :  73   UK       : 448   PG-13    :1461  
##  Median : 156.0       Spanish :  40   France   : 154   PG       : 701  
##  Mean   : 272.8       Hindi   :  28   Canada   : 126            : 303  
##  3rd Qu.: 326.0       Mandarin:  26   Germany  :  97   Not Rated: 116  
##  Max.   :5060.0       German  :  19   Australia:  55   G        : 112  
##  NA's   :21           (Other) : 153   (Other)  : 356   (Other)  : 232  
##      budget            title_year   actor_2_facebook_likes   imdb_score   
##  Min.   :2.180e+02   Min.   :1916   Min.   :     0         Min.   :1.600  
##  1st Qu.:6.000e+06   1st Qu.:1999   1st Qu.:   281         1st Qu.:5.800  
##  Median :2.000e+07   Median :2005   Median :   595         Median :6.600  
##  Mean   :3.975e+07   Mean   :2002   Mean   :  1652         Mean   :6.442  
##  3rd Qu.:4.500e+07   3rd Qu.:2011   3rd Qu.:   918         3rd Qu.:7.200  
##  Max.   :1.222e+10   Max.   :2016   Max.   :137000         Max.   :9.500  
##  NA's   :492         NA's   :108    NA's   :13                            
##   aspect_ratio   movie_facebook_likes
##  Min.   : 1.18   Min.   :     0      
##  1st Qu.: 1.85   1st Qu.:     0      
##  Median : 2.35   Median :   166      
##  Mean   : 2.22   Mean   :  7526      
##  3rd Qu.: 2.35   3rd Qu.:  3000      
##  Max.   :16.00   Max.   :349000      
##  NA's   :329

names(my_data)

##  [1] "color"                     "director_name"            
##  [3] "num_critic_for_reviews"    "duration"                 
##  [5] "director_facebook_likes"   "actor_3_facebook_likes"   
##  [7] "actor_2_name"              "actor_1_facebook_likes"   
##  [9] "gross"                     "genres"                   
## [11] "actor_1_name"              "movie_title"              
## [13] "num_voted_users"           "cast_total_facebook_likes"
## [15] "actor_3_name"              "facenumber_in_poster"     
## [17] "plot_keywords"             "movie_imdb_link"          
## [19] "num_user_for_reviews"      "language"                 
## [21] "country"                   "content_rating"           
## [23] "budget"                    "title_year"               
## [25] "actor_2_facebook_likes"    "imdb_score"               
## [27] "aspect_ratio"              "movie_facebook_likes"

movie rating system: https://en.wikipedia.org/wiki/Motion_Picture_Association_of_America_film_rating_system

除掉有缺失值的样本，除去不相关变量，例如 “movie_imdb_link”，或者“director_name”。

my_data <- read.csv("https://www.dropbox.com/s/sy790xviuxn8psp/movie_metadata.csv?dl=1") 

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

my_data <- my_data %>%
  na.omit() %>%
  mutate(gross=gross/1000000, budget=budget/1000000, USA = as.numeric(country== "USA"), restricted=as.numeric(content_rating== "PG-13"|content_rating=="R"|content_rating== "NC-17")) %>%
  select(-actor_1_name, -actor_2_name, -actor_3_name, -actor_2_facebook_likes,  -actor_3_facebook_likes, -movie_imdb_link, -plot_keywords, -color, -director_name, -cast_total_facebook_likes, -country, -content_rating)

发现有的电影budget异常的高，剔除：

summary(my_data)

##  num_critic_for_reviews    duration     director_facebook_likes
##  Min.   :  1.0          Min.   : 37.0   Min.   :    0.0        
##  1st Qu.: 75.0          1st Qu.: 96.0   1st Qu.:   11.0        
##  Median :137.0          Median :106.0   Median :   62.0        
##  Mean   :165.8          Mean   :110.2   Mean   :  798.6        
##  3rd Qu.:223.0          3rd Qu.:120.0   3rd Qu.:  234.0        
##  Max.   :813.0          Max.   :330.0   Max.   :23000.0        
##                                                                
##  actor_1_facebook_likes     gross                           genres    
##  Min.   :     0         Min.   :  0.0002   Drama               : 150  
##  1st Qu.:   736         1st Qu.:  7.6895   Comedy|Drama|Romance: 148  
##  Median :  1000         Median : 29.2000   Comedy|Drama        : 142  
##  Mean   :  7673         Mean   : 52.0058   Comedy              : 138  
##  3rd Qu.: 13000         3rd Qu.: 66.4664   Comedy|Romance      : 131  
##  Max.   :640000         Max.   :760.5058   Drama|Romance       : 116  
##                                            (Other)             :2976  
##                     movie_title   num_voted_users   facenumber_in_poster
##  Halloween                :   3   Min.   :      5   Min.   : 0.000      
##  Home                     :   3   1st Qu.:  18915   1st Qu.: 0.000      
##  King Kong                :   3   Median :  53028   Median : 1.000      
##  Pan                      :   3   Mean   : 104677   Mean   : 1.379      
##  The Fast and the Furious :   3   3rd Qu.: 126916   3rd Qu.: 2.000      
##  Victor Frankenstein      :   3   Max.   :1689764   Max.   :43.000      
##  (Other)                  :3783                                         
##  num_user_for_reviews     language        budget           title_year  
##  Min.   :   1.0       English :3626   Min.   :    0.00   Min.   :1920  
##  1st Qu.: 107.0       French  :  36   1st Qu.:   10.00   1st Qu.:1999  
##  Median : 208.0       Spanish :  23   Median :   25.00   Median :2005  
##  Mean   : 333.6       Mandarin:  15   Mean   :   45.85   Mean   :2003  
##  3rd Qu.: 397.0       German  :  13   3rd Qu.:   50.00   3rd Qu.:2010  
##  Max.   :5060.0       Japanese:  12   Max.   :12215.50   Max.   :2016  
##                       (Other) :  76                                    
##    imdb_score     aspect_ratio   movie_facebook_likes      USA        
##  Min.   :1.600   Min.   : 1.18   Min.   :     0       Min.   :0.0000  
##  1st Qu.:5.900   1st Qu.: 1.85   1st Qu.:     0       1st Qu.:1.0000  
##  Median :6.600   Median : 2.35   Median :   224       Median :1.0000  
##  Mean   :6.466   Mean   : 2.11   Mean   :  9262       Mean   :0.7906  
##  3rd Qu.:7.200   3rd Qu.: 2.35   3rd Qu.: 11000       3rd Qu.:1.0000  
##  Max.   :9.300   Max.   :16.00   Max.   :349000       Max.   :1.0000  
##                                                                       
##    restricted    
##  Min.   :0.0000  
##  1st Qu.:1.0000  
##  Median :1.0000  
##  Mean   :0.7966  
##  3rd Qu.:1.0000  
##  Max.   :1.0000  
##

library(dplyr)

my_data <- my_data %>%
  filter(budget<200)

相关系数，精确到三位小数点（用round这个函数）

library(corrplot)

## corrplot 0.84 loaded

my_data %>%
  select(-movie_title, -genres, -language, -title_year) %>%
  cor() %>% 
  round(3) %>%
  corrplot(tl.srt=45)

使用下面的公式，计算相关性的P值，是否显著相关：

library(corrplot)

# mat : is a matrix of data
# ... : further arguments to pass to the native R cor.test function
cor.mtest <- function(mat, ...) {
    mat <- as.matrix(mat)
    n <- ncol(mat)
    p.mat<- matrix(NA, n, n)
    diag(p.mat) <- 0
    for (i in 1:(n - 1)) {
        for (j in (i + 1):n) {
            tmp <- cor.test(mat[, i], mat[, j], ...)
            p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
        }
    }
  colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
  p.mat
}
# matrix of the p-value of the correlation
p.mat <- my_data %>% 
  select(-movie_title, -genres, -language, -title_year) %>%
  cor.mtest()

my_data %>%
  select(-movie_title, -genres, -language, -title_year) %>%
  cor() %>% 
  round(3) %>%
  corrplot(type="upper", p.mat = p.mat, sig.level = 0.05, tl.srt=45, method = "circle")

将数据随机分成训练数据和测试数据

set.seed(100) 
trainingRowIndex <- sample(1:nrow(my_data), 0.7*nrow(my_data)) 

training_data <- my_data[trainingRowIndex, ] # model training data
testing_data <- my_data[-trainingRowIndex, ] # test data

summary(training_data)

##  num_critic_for_reviews    duration     director_facebook_likes
##  Min.   :  1.0          Min.   : 45.0   Min.   :    0.0        
##  1st Qu.: 75.0          1st Qu.: 96.0   1st Qu.:   10.0        
##  Median :134.0          Median :106.0   Median :   58.0        
##  Mean   :161.5          Mean   :109.8   Mean   :  773.1        
##  3rd Qu.:218.0          3rd Qu.:120.0   3rd Qu.:  219.0        
##  Max.   :775.0          Max.   :330.0   Max.   :22000.0        
##                                                                
##  actor_1_facebook_likes     gross                           genres    
##  Min.   :     0         Min.   :  0.0007   Comedy|Drama|Romance: 108  
##  1st Qu.:   734         1st Qu.:  7.9169   Comedy              : 102  
##  Median :  1000         Median : 28.0145   Drama               : 102  
##  Mean   :  7465         Mean   : 48.6357   Comedy|Drama        :  98  
##  3rd Qu.: 12000         3rd Qu.: 63.8266   Comedy|Romance      :  84  
##  Max.   :260000         Max.   :652.1773   Drama|Romance       :  84  
##                                            (Other)             :2035  
##                        movie_title   num_voted_users  
##  A Nightmare on Elm Street   :   2   Min.   :      5  
##  Around the World in 80 Days :   2   1st Qu.:  18313  
##  Big Fat Liar                :   2   Median :  51252  
##  Carrie                      :   2   Mean   :  99977  
##  Clash of the Titans         :   2   3rd Qu.: 123090  
##  Conan the Barbarian         :   2   Max.   :1689764  
##  (Other)                     :2601                    
##  facenumber_in_poster num_user_for_reviews      language   
##  Min.   : 0.000       Min.   :   1.0       English  :2501  
##  1st Qu.: 0.000       1st Qu.: 107.0       French   :  28  
##  Median : 1.000       Median : 207.0       Spanish  :  13  
##  Mean   : 1.395       Mean   : 318.6       Cantonese:   7  
##  3rd Qu.: 2.000       3rd Qu.: 384.0       German   :   7  
##  Max.   :43.000       Max.   :4144.0       Italian  :   6  
##                                            (Other)  :  51  
##      budget          title_year     imdb_score    aspect_ratio   
##  Min.   :  0.007   Min.   :1920   Min.   :1.60   Min.   : 1.330  
##  1st Qu.: 10.000   1st Qu.:1999   1st Qu.:5.90   1st Qu.: 1.850  
##  Median : 24.000   Median :2004   Median :6.60   Median : 2.350  
##  Mean   : 35.998   Mean   :2003   Mean   :6.46   Mean   : 2.108  
##  3rd Qu.: 50.000   3rd Qu.:2010   3rd Qu.:7.20   3rd Qu.: 2.350  
##  Max.   :195.000   Max.   :2016   Max.   :9.30   Max.   :16.000  
##                                                                  
##  movie_facebook_likes      USA           restricted    
##  Min.   :     0       Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:     0       1st Qu.:1.0000   1st Qu.:1.0000  
##  Median :   193       Median :1.0000   Median :1.0000  
##  Mean   :  8649       Mean   :0.7891   Mean   :0.7953  
##  3rd Qu.: 11000       3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :191000       Max.   :1.0000   Max.   :1.0000  
##

boxplot(imdb_score ~ title_year, data=training_data, col='red')
title("IMDb score vs year")

最关心变量：gross, budget, imdb_score，使用直方图看一看他们的分布：

gross 毛收入：

library(ggplot2)
library(dplyr)

training_data %>%
  ggplot(aes(x=gross))+
  geom_histogram(color='black', fill='white') + 
  ylab('counts') + 
  xlab('Gross Income (million)')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  ggtitle('gross income')

## $title
## [1] "gross income"
## 
## $subtitle
## NULL
## 
## attr(,"class")
## [1] "labels"

添加均值和中位数的直线：

library(ggplot2)
library(dplyr)

training_data %>%
  ggplot(aes(x=gross))+
  geom_histogram(color='black', fill='white') + 
  geom_vline(aes(xintercept=mean(gross)), linetype="dashed", color="blue", size=1)+
  geom_vline(aes(xintercept=median(gross)), linetype="dashed", color="red", size=1)+
  ylab('counts') + 
  xlab('Gross Income (million)')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  ggtitle('gross income')

## $title
## [1] "gross income"
## 
## $subtitle
## NULL
## 
## attr(,"class")
## [1] "labels"

budget 投资预算的分布：

library(ggplot2)
library(dplyr)

training_data %>%
  ggplot(aes(x=budget))+
  geom_histogram(color='black', fill='white') + 
  geom_vline(aes(xintercept=mean(budget)), linetype="dashed", color="blue", size=1)+
  geom_vline(aes(xintercept=median(budget)), linetype="dashed", color="red", size=1)+
  ylab('counts') + 
  xlab('Budget (million)')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  ggtitle('Budget')

## $title
## [1] "Budget"
## 
## $subtitle
## NULL
## 
## attr(,"class")
## [1] "labels"

imdb评分的直方图分布（频数）：

library(ggplot2)
library(dplyr)

training_data %>%
  ggplot(aes(x=imdb_score))+
  geom_histogram(color='black', fill='white') + 
  ylab('counts') + 
  ggtitle('imdb_score')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

imdb评分的直方图分布（频率）：

library(ggplot2)
library(dplyr)

training_data %>%
  ggplot(aes(x=imdb_score))+
  geom_histogram(aes(y=..density..), color='black', fill='white') + 
  ylab('percent') + 
  ggtitle('imdb_score')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

imdb评分的直方图分布（频率），添加均值线：

library(ggplot2)
library(dplyr)

training_data %>%
  ggplot(aes(x=imdb_score))+
  geom_histogram(aes(y=..density..), binwidth=0.5, color='black', fill='white') + 
  geom_vline(aes(xintercept=mean(imdb_score)), linetype="dashed", color="blue", size=1)+
  ylab('percent') + 
  ggtitle('imdb_score')

str(training_data)

## 'data.frame':    2613 obs. of  18 variables:
##  $ num_critic_for_reviews : int  303 72 20 354 141 229 14 135 81 249 ...
##  $ duration               : int  136 127 106 129 124 121 108 105 101 119 ...
##  $ director_facebook_likes: int  3000 65 277 235 159 353 208 125 13 218 ...
##  $ actor_1_facebook_likes : int  14000 788 22000 13000 288 45 876 18000 651 13000 ...
##  $ gross                  : num  42.6 14.9 10.6 134.5 30 ...
##  $ genres                 : Factor w/ 914 levels "Action","Action|Adventure",..: 708 536 571 128 550 844 357 700 697 280 ...
##  $ movie_title            : Factor w/ 4917 levels "[Rec] ","[Rec] 2 ",..: 123 3938 4756 2163 1186 568 3506 2744 106 3604 ...
##  $ num_voted_users        : int  136093 2581 17349 336235 75973 59462 6384 23916 14147 407601 ...
##  $ facenumber_in_poster   : int  3 1 0 1 1 0 0 5 3 1 ...
##  $ num_user_for_reviews   : int  416 29 42 782 409 300 54 50 92 849 ...
##  $ language               : Factor w/ 48 levels "","Aboriginal",..: 13 13 13 13 13 13 13 13 13 13 ...
##  $ budget                 : num  40 25 20 110 25 25 6.5 35 20 60 ...
##  $ title_year             : int  2014 2016 1989 2007 1998 2008 1993 2012 2003 2002 ...
##  $ imdb_score             : num  6.1 7.3 6 7.2 7.5 6.6 6.2 5.7 5.6 7.9 ...
##  $ aspect_ratio           : num  2.35 2.35 2.35 2.35 1.85 1.85 1.85 2.35 1.85 2.35 ...
##  $ movie_facebook_likes   : int  24000 0 855 0 0 0 391 0 401 0 ...
##  $ USA                    : num  1 0 1 1 0 0 1 1 1 1 ...
##  $ restricted             : num  1 1 1 1 1 1 0 1 1 1 ...

# Model specification using lm
fullModel <- lm(gross ~.-genres -movie_title -language, 
                        data = my_data)

# Looking at model summary
summary(fullModel)

## 
## Call:
## lm(formula = gross ~ . - genres - movie_title - language, data = my_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -322.28  -19.02   -5.07   12.45  452.31 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              9.001e+02  1.726e+02   5.216 1.93e-07 ***
## num_critic_for_reviews   2.760e-02  1.011e-02   2.730 0.006363 ** 
## duration                -1.183e-01  3.666e-02  -3.226 0.001267 ** 
## director_facebook_likes -9.040e-04  2.376e-04  -3.805 0.000144 ***
## actor_1_facebook_likes  -3.751e-05  4.535e-05  -0.827 0.408228    
## num_voted_users          1.818e-04  8.823e-06  20.601  < 2e-16 ***
## facenumber_in_poster    -3.166e-02  3.357e-01  -0.094 0.924872    
## num_user_for_reviews     4.483e-03  3.020e-03   1.484 0.137841    
## budget                   7.181e-01  2.237e-02  32.098  < 2e-16 ***
## title_year              -4.395e-01  8.602e-02  -5.109 3.40e-07 ***
## imdb_score               4.169e-01  8.269e-01   0.504 0.614202    
## aspect_ratio            -4.289e+00  2.022e+00  -2.121 0.033968 *  
## movie_facebook_likes    -3.824e-06  4.850e-05  -0.079 0.937159    
## USA                      1.777e+01  1.743e+00  10.196  < 2e-16 ***
## restricted              -1.845e+01  1.816e+00 -10.158  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41.58 on 3718 degrees of freedom
## Multiple R-squared:  0.5592, Adjusted R-squared:  0.5575 
## F-statistic: 336.9 on 14 and 3718 DF,  p-value: < 2.2e-16

使用step函数，剔除无关变量，得到最佳模型： Choose a model by AIC in a Stepwise Algorithm

# Model specification using lm
fullModel <- lm(gross ~.-genres -movie_title -language, 
                        data = my_data)

newModel <- step(fullModel)

## Start:  AIC=27846.05
## gross ~ (num_critic_for_reviews + duration + director_facebook_likes + 
##     actor_1_facebook_likes + genres + movie_title + num_voted_users + 
##     facenumber_in_poster + num_user_for_reviews + language + 
##     budget + title_year + imdb_score + aspect_ratio + movie_facebook_likes + 
##     USA + restricted) - genres - movie_title - language
## 
##                           Df Sum of Sq     RSS   AIC
## - movie_facebook_likes     1        11 6429210 27844
## - facenumber_in_poster     1        15 6429214 27844
## - imdb_score               1       439 6429638 27844
## - actor_1_facebook_likes   1      1183 6430382 27845
## <none>                                 6429199 27846
## - num_user_for_reviews     1      3809 6433008 27846
## - aspect_ratio             1      7781 6436980 27849
## - num_critic_for_reviews   1     12888 6442087 27852
## - duration                 1     17993 6447192 27854
## - director_facebook_likes  1     25038 6454237 27859
## - title_year               1     45133 6474331 27870
## - restricted               1    178429 6607628 27946
## - USA                      1    179759 6608958 27947
## - num_voted_users          1    733845 7163044 28248
## - budget                   1   1781578 8210777 28757
## 
## Step:  AIC=27844.06
## gross ~ num_critic_for_reviews + duration + director_facebook_likes + 
##     actor_1_facebook_likes + num_voted_users + facenumber_in_poster + 
##     num_user_for_reviews + budget + title_year + imdb_score + 
##     aspect_ratio + USA + restricted
## 
##                           Df Sum of Sq     RSS   AIC
## - facenumber_in_poster     1        16 6429226 27842
## - imdb_score               1       448 6429657 27842
## - actor_1_facebook_likes   1      1180 6430390 27843
## <none>                                 6429210 27844
## - num_user_for_reviews     1      4082 6433291 27844
## - aspect_ratio             1      7770 6436980 27847
## - num_critic_for_reviews   1     17160 6446369 27852
## - duration                 1     18274 6447484 27853
## - director_facebook_likes  1     25043 6454252 27857
## - title_year               1     45529 6474739 27868
## - restricted               1    178817 6608026 27944
## - USA                      1    179753 6608962 27945
## - num_voted_users          1    791187 7220396 28275
## - budget                   1   1786771 8215981 28758
## 
## Step:  AIC=27842.07
## gross ~ num_critic_for_reviews + duration + director_facebook_likes + 
##     actor_1_facebook_likes + num_voted_users + num_user_for_reviews + 
##     budget + title_year + imdb_score + aspect_ratio + USA + restricted
## 
##                           Df Sum of Sq     RSS   AIC
## - imdb_score               1       460 6429686 27840
## - actor_1_facebook_likes   1      1200 6430425 27841
## <none>                                 6429226 27842
## - num_user_for_reviews     1      4146 6433372 27842
## - aspect_ratio             1      7772 6436997 27845
## - num_critic_for_reviews   1     17207 6446432 27850
## - duration                 1     18499 6447725 27851
## - director_facebook_likes  1     25035 6454261 27855
## - title_year               1     45877 6475102 27867
## - restricted               1    178986 6608212 27943
## - USA                      1    179791 6609016 27943
## - num_voted_users          1    793978 7223203 28275
## - budget                   1   1789904 8219130 28757
## 
## Step:  AIC=27840.33
## gross ~ num_critic_for_reviews + duration + director_facebook_likes + 
##     actor_1_facebook_likes + num_voted_users + num_user_for_reviews + 
##     budget + title_year + aspect_ratio + USA + restricted
## 
##                           Df Sum of Sq     RSS   AIC
## - actor_1_facebook_likes   1      1150 6430836 27839
## <none>                                 6429686 27840
## - num_user_for_reviews     1      3844 6433530 27841
## - aspect_ratio             1      7827 6437514 27843
## - duration                 1     18318 6448004 27849
## - num_critic_for_reviews   1     19754 6449440 27850
## - director_facebook_likes  1     24878 6454565 27853
## - title_year               1     48581 6478268 27866
## - restricted               1    180981 6610667 27942
## - USA                      1    181609 6611295 27942
## - num_voted_users          1    900134 7329820 28328
## - budget                   1   1877049 8306735 28794
## 
## Step:  AIC=27839
## gross ~ num_critic_for_reviews + duration + director_facebook_likes + 
##     num_voted_users + num_user_for_reviews + budget + title_year + 
##     aspect_ratio + USA + restricted
## 
##                           Df Sum of Sq     RSS   AIC
## <none>                                 6430836 27839
## - num_user_for_reviews     1      4066 6434903 27839
## - aspect_ratio             1      7890 6438726 27842
## - duration                 1     18513 6449349 27848
## - num_critic_for_reviews   1     19493 6450330 27848
## - director_facebook_likes  1     25249 6456085 27852
## - title_year               1     49239 6480076 27866
## - USA                      1    180592 6611428 27940
## - restricted               1    182472 6613308 27941
## - num_voted_users          1    901931 7332767 28327
## - budget                   1   1880622 8311458 28795

summary(newModel)

## 
## Call:
## lm(formula = gross ~ num_critic_for_reviews + duration + director_facebook_likes + 
##     num_voted_users + num_user_for_reviews + budget + title_year + 
##     aspect_ratio + USA + restricted, data = my_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -322.59  -19.11   -4.87   12.50  452.39 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              9.240e+02  1.685e+02   5.485 4.42e-08 ***
## num_critic_for_reviews   2.808e-02  8.361e-03   3.359 0.000790 ***
## duration                -1.143e-01  3.490e-02  -3.273 0.001073 ** 
## director_facebook_likes -9.062e-04  2.371e-04  -3.823 0.000134 ***
## num_voted_users          1.823e-04  7.980e-06  22.848  < 2e-16 ***
## num_user_for_reviews     4.462e-03  2.908e-03   1.534 0.125089    
## budget                   7.143e-01  2.165e-02  32.992  < 2e-16 ***
## title_year              -4.503e-01  8.435e-02  -5.338 9.94e-08 ***
## aspect_ratio            -4.316e+00  2.019e+00  -2.137 0.032666 *  
## USA                      1.755e+01  1.716e+00  10.224  < 2e-16 ***
## restricted              -1.856e+01  1.807e+00 -10.277  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41.57 on 3722 degrees of freedom
## Multiple R-squared:  0.5591, Adjusted R-squared:  0.5579 
## F-statistic: 471.9 on 10 and 3722 DF,  p-value: < 2.2e-16

newFormula <- as.formula(summary(newModel)$call)
newFormula

## gross ~ num_critic_for_reviews + duration + director_facebook_likes + 
##     num_voted_users + num_user_for_reviews + budget + title_year + 
##     aspect_ratio + USA + restricted

bestModel<- training_data %>%
  lm(formula=newFormula)

summary(bestModel)

## 
## Call:
## lm(formula = newFormula, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -291.22  -18.85   -4.99   11.84  450.89 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              9.283e+02  2.062e+02   4.502 7.02e-06 ***
## num_critic_for_reviews   3.535e-02  1.016e-02   3.480 0.000511 ***
## duration                -1.074e-01  4.183e-02  -2.568 0.010284 *  
## director_facebook_likes -6.703e-04  2.922e-04  -2.294 0.021863 *  
## num_voted_users          1.604e-04  9.745e-06  16.461  < 2e-16 ***
## num_user_for_reviews     5.344e-03  3.554e-03   1.504 0.132823    
## budget                   7.503e-01  2.607e-02  28.786  < 2e-16 ***
## title_year              -4.527e-01  1.032e-01  -4.388 1.19e-05 ***
## aspect_ratio            -5.096e+00  2.241e+00  -2.274 0.023045 *  
## USA                      1.624e+01  2.069e+00   7.849 6.07e-15 ***
## restricted              -1.665e+01  2.190e+00  -7.603 4.01e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.17 on 2602 degrees of freedom
## Multiple R-squared:  0.5402, Adjusted R-squared:  0.5385 
## F-statistic: 305.7 on 10 and 2602 DF,  p-value: < 2.2e-16

lets’s interpretate of the coefficients. Look at the model fit statistics. How much of the dependent variable’s variation is explained by the explanatory variables?

我们使用刚刚建立的模型来做预测：

grossPred <- predict(bestModel, testing_data) # predict distance

检验一下，我们的预测效果

actuals_preds <- data.frame(cbind(actuals=testing_data$gross, predicteds=grossPred)) # make actuals_predicteds dataframe.
attach(actuals_preds)
correlation_accuracy <- cor(actuals,predicteds) 
correlation_accuracy

## [1] 0.7727497

Regression Analysis

Chris Qi

2018

双变量关系及建模

一般的模型表达式

线性回归模型表达式

线性回归模型–为什么这一条线最合适？

最小二乘法（least squares method ）

最小二乘法（least squares method ）

轶事：回归的由来

回归模型的目标是量化变量之间的关系：