两个变量都是数值型(numeric)
应变量(response, dependent variable)
解释变量(explanatory)
也可以叫做,independent variable, predictor
一般的模型表达式:
\[应变量=f(解释变量)+噪音\]
线性回归模型表达式:
\[应变量=截距+斜率*解释变量+噪音\]
线性回归模型–为什么这一条线最合适?
\[y_i= a + b*x_i+e\]
简单地说,最小二乘的思想就是要使得
** 观测点和估计点的距离的平方和达到最小 **.
这里的“二乘”指的是用平方来度量观测点与估计点的远近(在古汉语中“平方”称为“二乘”),“最小”指的是参数的估计值要保证各个观测点与估计点的距离的平方和达到最小。
线性回归模型:
\[y_i= a + b*x_i+e_i\]
\[e_i ~ N(0, \sigma)\]
预测值:
\[\hat{y_i} = \hat{a} + \hat{b}x_i\]
残差(error):
\[e_i = y_i-\hat{y_i}\]
拟合的过程:有n个样本量,也就是n个数据对 \((x_i,y_i)\), 找到 截距 \(a\) 和斜率 \(b\) ,使得残差平方和最小:
\[\sum_1^{n} e_i^2\]
** 关键概念 ** :
\(\hat{y_i}\) 是给定 \(x_i\) 值时, \(y\) 的期望值
\(\hat{b}\) 是真实但是未知的 \(b\) 的估计值
残差 \(e\):residual, error, noise
最小二乘法求线性回归系数
\[y_i= a + b*x_i+e_i\]
\[\hat{y_i} = \hat{a} + \hat{b}x_i\]
\[e_i = y_i-\hat{y_i}\]
\[\sum_1^{n} e_i^2 = \sum (y_i-\hat{y_i})^2=\sum (y_i- \hat{a} -\hat{b}x_i)^2\]
最小二乘法求线性回归系数:
\[f=\sum_1^{n} e_i^2 =\sum (y_i- \hat{a} -\hat{b}x_i)^2\]
一阶导数等于 0:
\[\frac{\partial f}{\partial\hat{a}} = -2 \sum (y_i- \hat{a} -\hat{b}x_i)=0\]
\[\frac{\partial f}{\partial\hat{b}} = -2 \sum (y_i- \hat{a} -\hat{b}x_i)x_i=0\]
联立上面的二元一次方程组,求解 \(\hat{a}\) 与 \(\hat{b}\)
\[\hat{b}=\frac{\sum (x_i-\bar{x}) (y_i-\bar{y})}{\sum (x_i-\bar{x})^2}\]
\[\hat{a}= \bar{y} - \hat{b}\bar{x}\]
最小二乘法的特点:
唯一性
残差之和等于0
回归线必过
\((\bar{x}, \bar{y})\)
out <- lm(mpg~wt,data=mtcars)
summary(out)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
轶事:回归的由来
Regression to the mean is a concept attributed to Sir Francis Galton. The basic idea is that extreme random observations will tend to be less extreme upon a second trial. This is simply due to chance alone. Note that “regression to the mean” and “linear regression” are not the same thing.
孩子的身高和父母的身高:高个子的父母一般生的孩子个子也高,但是孩子一般没有父母那么高,孩子的身高会趋向于平均数。
这就是我们没有看到乔丹的儿子能够和乔丹一样成为飞人的原因
\[y_i= a + b*x_i+e_i\]
回归模型的目标是量化变量之间的关系:
关键系数的大小
关键系数的显著度
模型的解释力
模型的预测力
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
理解模型:
\[e_i = y_i-\hat{y_i}\]
\[\hat{b}=\frac{\sum (x_i-\bar{x}) (y_i-\bar{y})}{\sum (x_i-\bar{x})^2}\]
\[\hat{a}= \bar{y} - \hat{b}\bar{x}\]
系数估计的标准差(Standard error): Standard error measures how the coefficient estimates can vary from the actual average value of the response variable (i.e. if the model is run more times). 每一次数据不同,结果都会有差异,但是该差异在一定范围内。
显著度与P值 (Significance and P value): Test of significance of the model shows that there is strong evidence of a linear relationship between the variables. This is visually interpreted by the significance stars *** at the end of the row. This level is a threshold that is used to indicate real findings, and not the ones by chance alone.
For each estimated regression coefficient, the variable’s p-Value Pr(>|t|) provides an estimate of the probability that the true coefficient is zero given the value of the estimate. More the number of stars near the p-Value are, more significant is the variable.
P值 多少算显著? Typically, a p-value < 0.05 is a good cut-off point), null hypothesis can be safely rejected. In the current case, p-values are well below the 0.05 threshold, so the model is indeed statistically significant.
\[R^2=1-\frac{var(e)}{var(y)}\]
my_data <- read.csv("https://www.dropbox.com/s/sy790xviuxn8psp/movie_metadata.csv?dl=1")
str(my_data)
## 'data.frame': 5043 obs. of 28 variables:
## $ color : Factor w/ 3 levels ""," Black and White",..: 3 3 3 3 1 3 3 3 3 3 ...
## $ director_name : Factor w/ 2399 levels "","A. Raven Cruz",..: 927 801 2027 377 603 106 2030 1652 1228 551 ...
## $ num_critic_for_reviews : int 723 302 602 813 NA 462 392 324 635 375 ...
## $ duration : int 178 169 148 164 NA 132 156 100 141 153 ...
## $ director_facebook_likes : int 0 563 0 22000 131 475 0 15 0 282 ...
## $ actor_3_facebook_likes : int 855 1000 161 23000 NA 530 4000 284 19000 10000 ...
## $ actor_2_name : Factor w/ 3033 levels "","50 Cent","A. Michael Baldwin",..: 1407 2218 2488 534 2432 2549 1227 801 2439 653 ...
## $ actor_1_facebook_likes : int 1000 40000 11000 27000 131 640 24000 799 26000 25000 ...
## $ gross : int 760505847 309404152 200074175 448130642 NA 73058679 336530303 200807262 458991599 301956980 ...
## $ genres : Factor w/ 914 levels "Action","Action|Adventure",..: 107 101 128 288 754 126 120 308 126 447 ...
## $ actor_1_name : Factor w/ 2098 levels "","50 Cent","A.J. Buckley",..: 302 979 353 1968 526 440 785 221 336 32 ...
## $ movie_title : Factor w/ 4917 levels "[Rec] ","[Rec] 2 ",..: 398 2731 3279 3708 3332 1961 3291 3459 399 1631 ...
## $ num_voted_users : int 886204 471220 275868 1144337 8 212204 383056 294810 462669 321795 ...
## $ cast_total_facebook_likes: int 4834 48350 11700 106759 143 1873 46055 2036 92000 58753 ...
## $ actor_3_name : Factor w/ 3522 levels "","50 Cent","A.J. Buckley",..: 3442 1392 3134 1769 1 2714 1969 2162 3018 2941 ...
## $ facenumber_in_poster : int 0 0 1 0 0 1 0 1 4 3 ...
## $ plot_keywords : Factor w/ 4761 levels "","10 year old|dog|florida|girl|supermarket",..: 1320 4283 2076 3484 1 651 4745 29 1142 2005 ...
## $ movie_imdb_link : Factor w/ 4919 levels "http://www.imdb.com/title/tt0006864/?ref_=fn_tt_tt_1",..: 2965 2721 4533 3756 4918 2476 2526 2458 4546 2551 ...
## $ num_user_for_reviews : int 3054 1238 994 2701 NA 738 1902 387 1117 973 ...
## $ language : Factor w/ 48 levels "","Aboriginal",..: 13 13 13 13 1 13 13 13 13 13 ...
## $ country : Factor w/ 66 levels "","Afghanistan",..: 65 65 63 65 1 65 65 65 65 63 ...
## $ content_rating : Factor w/ 19 levels "","Approved",..: 10 10 10 10 1 10 10 9 10 9 ...
## $ budget : num 2.37e+08 3.00e+08 2.45e+08 2.50e+08 NA ...
## $ title_year : int 2009 2007 2015 2012 NA 2012 2007 2010 2015 2009 ...
## $ actor_2_facebook_likes : int 936 5000 393 23000 12 632 11000 553 21000 11000 ...
## $ imdb_score : num 7.9 7.1 6.8 8.5 7.1 6.6 6.2 7.8 7.5 7.5 ...
## $ aspect_ratio : num 1.78 2.35 2.35 2.35 NA 2.35 2.35 1.85 2.35 2.35 ...
## $ movie_facebook_likes : int 33000 0 85000 164000 0 24000 0 29000 118000 10000 ...
names(my_data)
## [1] "color" "director_name"
## [3] "num_critic_for_reviews" "duration"
## [5] "director_facebook_likes" "actor_3_facebook_likes"
## [7] "actor_2_name" "actor_1_facebook_likes"
## [9] "gross" "genres"
## [11] "actor_1_name" "movie_title"
## [13] "num_voted_users" "cast_total_facebook_likes"
## [15] "actor_3_name" "facenumber_in_poster"
## [17] "plot_keywords" "movie_imdb_link"
## [19] "num_user_for_reviews" "language"
## [21] "country" "content_rating"
## [23] "budget" "title_year"
## [25] "actor_2_facebook_likes" "imdb_score"
## [27] "aspect_ratio" "movie_facebook_likes"
summary(my_data)
## color director_name num_critic_for_reviews
## : 19 : 104 Min. : 1.0
## Black and White: 209 Steven Spielberg: 26 1st Qu.: 50.0
## Color :4815 Woody Allen : 22 Median :110.0
## Clint Eastwood : 20 Mean :140.2
## Martin Scorsese : 20 3rd Qu.:195.0
## Ridley Scott : 17 Max. :813.0
## (Other) :4834 NA's :50
## duration director_facebook_likes actor_3_facebook_likes
## Min. : 7.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 93.0 1st Qu.: 7.0 1st Qu.: 133.0
## Median :103.0 Median : 49.0 Median : 371.5
## Mean :107.2 Mean : 686.5 Mean : 645.0
## 3rd Qu.:118.0 3rd Qu.: 194.5 3rd Qu.: 636.0
## Max. :511.0 Max. :23000.0 Max. :23000.0
## NA's :15 NA's :104 NA's :23
## actor_2_name actor_1_facebook_likes gross
## Morgan Freeman : 20 Min. : 0 Min. : 162
## Charlize Theron: 15 1st Qu.: 614 1st Qu.: 5340988
## Brad Pitt : 14 Median : 988 Median : 25517500
## : 13 Mean : 6560 Mean : 48468408
## James Franco : 11 3rd Qu.: 11000 3rd Qu.: 62309438
## Meryl Streep : 11 Max. :640000 Max. :760505847
## (Other) :4959 NA's :7 NA's :884
## genres actor_1_name
## Drama : 236 Robert De Niro : 49
## Comedy : 209 Johnny Depp : 41
## Comedy|Drama : 191 Nicolas Cage : 33
## Comedy|Drama|Romance: 187 J.K. Simmons : 31
## Comedy|Romance : 158 Bruce Willis : 30
## Drama|Romance : 152 Denzel Washington: 30
## (Other) :3910 (Other) :4829
## movie_title num_voted_users
## Ben-Hur : 3 Min. : 5
## Halloween : 3 1st Qu.: 8594
## Home : 3 Median : 34359
## King Kong : 3 Mean : 83668
## Pan : 3 3rd Qu.: 96309
## The Fast and the Furious : 3 Max. :1689764
## (Other) :5025
## cast_total_facebook_likes actor_3_name facenumber_in_poster
## Min. : 0 : 23 Min. : 0.000
## 1st Qu.: 1411 Ben Mendelsohn: 8 1st Qu.: 0.000
## Median : 3090 John Heard : 8 Median : 1.000
## Mean : 9699 Steve Coogan : 8 Mean : 1.371
## 3rd Qu.: 13756 Anne Hathaway : 7 3rd Qu.: 2.000
## Max. :656730 Jon Gries : 7 Max. :43.000
## (Other) :4982 NA's :13
## plot_keywords
## : 153
## based on novel : 4
## 1940s|child hero|fantasy world|orphan|reference to peter pan : 3
## alien friendship|alien invasion|australia|flying car|mother daughter relationship: 3
## animal name in title|ape abducts a woman|gorilla|island|king kong : 3
## assistant|experiment|frankenstein|medical student|scientist : 3
## (Other) :4874
## movie_imdb_link
## http://www.imdb.com/title/tt0077651/?ref_=fn_tt_tt_1: 3
## http://www.imdb.com/title/tt0232500/?ref_=fn_tt_tt_1: 3
## http://www.imdb.com/title/tt0360717/?ref_=fn_tt_tt_1: 3
## http://www.imdb.com/title/tt1976009/?ref_=fn_tt_tt_1: 3
## http://www.imdb.com/title/tt2224026/?ref_=fn_tt_tt_1: 3
## http://www.imdb.com/title/tt2638144/?ref_=fn_tt_tt_1: 3
## (Other) :5025
## num_user_for_reviews language country content_rating
## Min. : 1.0 English :4704 USA :3807 R :2118
## 1st Qu.: 65.0 French : 73 UK : 448 PG-13 :1461
## Median : 156.0 Spanish : 40 France : 154 PG : 701
## Mean : 272.8 Hindi : 28 Canada : 126 : 303
## 3rd Qu.: 326.0 Mandarin: 26 Germany : 97 Not Rated: 116
## Max. :5060.0 German : 19 Australia: 55 G : 112
## NA's :21 (Other) : 153 (Other) : 356 (Other) : 232
## budget title_year actor_2_facebook_likes imdb_score
## Min. :2.180e+02 Min. :1916 Min. : 0 Min. :1.600
## 1st Qu.:6.000e+06 1st Qu.:1999 1st Qu.: 281 1st Qu.:5.800
## Median :2.000e+07 Median :2005 Median : 595 Median :6.600
## Mean :3.975e+07 Mean :2002 Mean : 1652 Mean :6.442
## 3rd Qu.:4.500e+07 3rd Qu.:2011 3rd Qu.: 918 3rd Qu.:7.200
## Max. :1.222e+10 Max. :2016 Max. :137000 Max. :9.500
## NA's :492 NA's :108 NA's :13
## aspect_ratio movie_facebook_likes
## Min. : 1.18 Min. : 0
## 1st Qu.: 1.85 1st Qu.: 0
## Median : 2.35 Median : 166
## Mean : 2.22 Mean : 7526
## 3rd Qu.: 2.35 3rd Qu.: 3000
## Max. :16.00 Max. :349000
## NA's :329
将票房收入和投资转换成以百万为单位。
除掉有缺失值的样本,除去不相关变量,例如 “movie_imdb_link”,或者“director_name”。
将电影分级这一变量里面的类别重新划分,限制级,非限制级: movie rating system: https://en.wikipedia.org/wiki/Motion_Picture_Association_of_America_film_rating_system
my_data <- read.csv("https://www.dropbox.com/s/sy790xviuxn8psp/movie_metadata.csv?dl=1")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
my_data <- my_data %>%
na.omit() %>%
mutate(gross=gross/1000000, budget=budget/1000000, USA = as.numeric(country== "USA"), restricted=as.numeric(content_rating== "PG-13"|content_rating=="R"|content_rating== "NC-17")) %>%
select(-actor_1_name, -actor_2_name, -actor_3_name, -actor_2_facebook_likes, -actor_3_facebook_likes, -movie_imdb_link, -plot_keywords, -color, -director_name, -cast_total_facebook_likes, -country, -content_rating)
summary(my_data)
## num_critic_for_reviews duration director_facebook_likes
## Min. : 1.0 Min. : 37.0 Min. : 0.0
## 1st Qu.: 75.0 1st Qu.: 96.0 1st Qu.: 11.0
## Median :137.0 Median :106.0 Median : 62.0
## Mean :165.8 Mean :110.2 Mean : 798.6
## 3rd Qu.:223.0 3rd Qu.:120.0 3rd Qu.: 234.0
## Max. :813.0 Max. :330.0 Max. :23000.0
##
## actor_1_facebook_likes gross genres
## Min. : 0 Min. : 0.0002 Drama : 150
## 1st Qu.: 736 1st Qu.: 7.6895 Comedy|Drama|Romance: 148
## Median : 1000 Median : 29.2000 Comedy|Drama : 142
## Mean : 7673 Mean : 52.0058 Comedy : 138
## 3rd Qu.: 13000 3rd Qu.: 66.4664 Comedy|Romance : 131
## Max. :640000 Max. :760.5058 Drama|Romance : 116
## (Other) :2976
## movie_title num_voted_users facenumber_in_poster
## Halloween : 3 Min. : 5 Min. : 0.000
## Home : 3 1st Qu.: 18915 1st Qu.: 0.000
## King Kong : 3 Median : 53028 Median : 1.000
## Pan : 3 Mean : 104677 Mean : 1.379
## The Fast and the Furious : 3 3rd Qu.: 126916 3rd Qu.: 2.000
## Victor Frankenstein : 3 Max. :1689764 Max. :43.000
## (Other) :3783
## num_user_for_reviews language budget title_year
## Min. : 1.0 English :3626 Min. : 0.00 Min. :1920
## 1st Qu.: 107.0 French : 36 1st Qu.: 10.00 1st Qu.:1999
## Median : 208.0 Spanish : 23 Median : 25.00 Median :2005
## Mean : 333.6 Mandarin: 15 Mean : 45.85 Mean :2003
## 3rd Qu.: 397.0 German : 13 3rd Qu.: 50.00 3rd Qu.:2010
## Max. :5060.0 Japanese: 12 Max. :12215.50 Max. :2016
## (Other) : 76
## imdb_score aspect_ratio movie_facebook_likes USA
## Min. :1.600 Min. : 1.18 Min. : 0 Min. :0.0000
## 1st Qu.:5.900 1st Qu.: 1.85 1st Qu.: 0 1st Qu.:1.0000
## Median :6.600 Median : 2.35 Median : 224 Median :1.0000
## Mean :6.466 Mean : 2.11 Mean : 9262 Mean :0.7906
## 3rd Qu.:7.200 3rd Qu.: 2.35 3rd Qu.: 11000 3rd Qu.:1.0000
## Max. :9.300 Max. :16.00 Max. :349000 Max. :1.0000
##
## restricted
## Min. :0.0000
## 1st Qu.:1.0000
## Median :1.0000
## Mean :0.7966
## 3rd Qu.:1.0000
## Max. :1.0000
##
发现有的电影budget异常的高,剔除:
library(dplyr)
my_data <- my_data %>%
filter(budget<200)
summary(my_data)
## num_critic_for_reviews duration director_facebook_likes
## Min. : 1.0 Min. : 37.0 Min. : 0.0
## 1st Qu.: 75.0 1st Qu.: 95.0 1st Qu.: 11.0
## Median :135.0 Median :106.0 Median : 62.0
## Mean :161.7 Mean :109.6 Mean : 794.8
## 3rd Qu.:219.0 3rd Qu.:119.0 3rd Qu.: 232.0
## Max. :775.0 Max. :330.0 Max. :23000.0
##
## actor_1_facebook_likes gross genres
## Min. : 0 Min. : 0.0002 Drama : 149
## 1st Qu.: 733 1st Qu.: 7.5634 Comedy|Drama|Romance: 148
## Median : 1000 Median : 28.6375 Comedy|Drama : 142
## Mean : 7555 Mean : 48.8620 Comedy : 138
## 3rd Qu.: 12000 3rd Qu.: 64.3712 Comedy|Romance : 131
## Max. :640000 Max. :652.1773 Drama|Romance : 115
## (Other) :2910
## movie_title num_voted_users facenumber_in_poster
## Halloween : 3 Min. : 5 Min. : 0.000
## Home : 3 1st Qu.: 18632 1st Qu.: 0.000
## Pan : 3 Median : 51996 Median : 1.000
## The Fast and the Furious : 3 Mean : 100764 Mean : 1.383
## Victor Frankenstein : 3 3rd Qu.: 120798 3rd Qu.: 2.000
## A Nightmare on Elm Street : 2 Max. :1689764 Max. :43.000
## (Other) :3716
## num_user_for_reviews language budget title_year
## Min. : 1.0 English :3571 Min. :2.20e-04 Min. :1920
## 1st Qu.: 106.0 French : 36 1st Qu.:1.00e+01 1st Qu.:1999
## Median : 204.0 Spanish : 22 Median :2.40e+01 Median :2004
## Mean : 320.6 Mandarin : 14 Mean :3.58e+01 Mean :2003
## 3rd Qu.: 388.0 German : 13 3rd Qu.:5.00e+01 3rd Qu.:2010
## Max. :5060.0 Cantonese: 8 Max. :1.95e+02 Max. :2016
## (Other) : 69
## imdb_score aspect_ratio movie_facebook_likes USA
## Min. :1.600 Min. : 1.180 Min. : 0 Min. :0.0000
## 1st Qu.:5.900 1st Qu.: 1.850 1st Qu.: 0 1st Qu.:1.0000
## Median :6.600 Median : 2.350 Median : 206 Median :1.0000
## Mean :6.456 Mean : 2.108 Mean : 8808 Mean :0.7932
## 3rd Qu.:7.200 3rd Qu.: 2.350 3rd Qu.: 11000 3rd Qu.:1.0000
## Max. :9.300 Max. :16.000 Max. :349000 Max. :1.0000
##
## restricted
## Min. :0.0000
## 1st Qu.:1.0000
## Median :1.0000
## Mean :0.7961
## 3rd Qu.:1.0000
## Max. :1.0000
##
相关系数,精确到三位小数点(用round这个函数)
library(corrplot)
## corrplot 0.84 loaded
my_data %>%
select(-movie_title, -genres, -language, -title_year) %>%
cor() %>%
round(3) %>%
corrplot(tl.srt=45)
使用下面的公式,计算相关性的P值,是否显著相关:
library(corrplot)
# mat : is a matrix of data
# ... : further arguments to pass to the native R cor.test function
cor.mtest <- function(mat, ...) {
mat <- as.matrix(mat)
n <- ncol(mat)
p.mat<- matrix(NA, n, n)
diag(p.mat) <- 0
for (i in 1:(n - 1)) {
for (j in (i + 1):n) {
tmp <- cor.test(mat[, i], mat[, j], ...)
p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
}
}
colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
p.mat
}
# matrix of the p-value of the correlation
p.mat <- my_data %>%
select(-movie_title, -genres, -language, -title_year) %>%
cor.mtest()
绘制关联矩阵:
my_data %>%
select(-movie_title, -genres, -language, -title_year) %>%
cor() %>%
round(3) %>%
corrplot(type="upper", p.mat = p.mat, sig.level = 0.05, tl.srt=45, method = "circle")
将数据随机分成训练数据和测试数据:
set.seed(100)
trainingRowIndex <- sample(1:nrow(my_data), 0.7*nrow(my_data))
training_data <- my_data[trainingRowIndex, ] # model training data
testing_data <- my_data[-trainingRowIndex, ] # test data
最关心的变量 gross, budget, imdb_score,看一看他们的分布:
library(ggplot2)
library(dplyr)
training_data %>%
ggplot(aes(x=gross))+
geom_histogram(color='black', fill='white') +
ylab('counts') +
xlab('Gross Income (million)')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggtitle('gross income')
## $title
## [1] "gross income"
##
## attr(,"class")
## [1] "labels"
library(ggplot2)
library(dplyr)
training_data %>%
ggplot(aes(x=gross))+
geom_histogram(color='black', fill='white') +
geom_vline(aes(xintercept=mean(gross)), linetype="dashed", color="blue", size=1)+
geom_vline(aes(xintercept=median(gross)), linetype="dashed", color="red", size=1)+
ylab('counts') +
xlab('Gross Income (million)')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggtitle('gross income')
## $title
## [1] "gross income"
##
## attr(,"class")
## [1] "labels"
library(ggplot2)
library(dplyr)
training_data %>%
filter(gross<400) %>%
ggplot(aes(x=gross))+
geom_histogram(color='black', fill='white') +
geom_vline(aes(xintercept=mean(gross)), linetype="dashed", color="blue", size=1)+
geom_vline(aes(xintercept=median(gross)), linetype="dashed", color="red", size=1)+
ylab('counts') +
xlab('Gross Income (million)')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggtitle('gross income')
## $title
## [1] "gross income"
##
## attr(,"class")
## [1] "labels"
library(ggplot2)
library(dplyr)
training_data %>%
ggplot(aes(x=budget))+
geom_histogram(color='black', fill='white') +
geom_vline(aes(xintercept=mean(budget)), linetype="dashed", color="blue", size=1)+
geom_vline(aes(xintercept=median(budget)), linetype="dashed", color="red", size=1)+
ylab('counts') +
xlab('Budget (million)')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggtitle('Budget')
## $title
## [1] "Budget"
##
## attr(,"class")
## [1] "labels"
library(ggplot2)
library(dplyr)
training_data %>%
ggplot(aes(x=imdb_score))+
geom_histogram(color='black', fill='white') +
ylab('counts') +
ggtitle('imdb_score')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
library(ggplot2)
library(dplyr)
training_data %>%
ggplot(aes(x=imdb_score))+
geom_histogram(aes(y=..density..), color='black', fill='white') +
ylab('percent') +
ggtitle('imdb_score')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
library(ggplot2)
library(dplyr)
training_data %>%
ggplot(aes(x=imdb_score))+
geom_histogram(aes(y=..density..), binwidth=0.5, color='black', fill='white') +
geom_vline(aes(xintercept=mean(imdb_score)), linetype="dashed", color="blue", size=1)+
ylab('percent') +
ggtitle('imdb_score')
str(training_data)
## 'data.frame': 2613 obs. of 18 variables:
## $ num_critic_for_reviews : int 303 72 20 354 141 229 14 135 81 249 ...
## $ duration : int 136 127 106 129 124 121 108 105 101 119 ...
## $ director_facebook_likes: int 3000 65 277 235 159 353 208 125 13 218 ...
## $ actor_1_facebook_likes : int 14000 788 22000 13000 288 45 876 18000 651 13000 ...
## $ gross : num 42.6 14.9 10.6 134.5 30 ...
## $ genres : Factor w/ 914 levels "Action","Action|Adventure",..: 708 536 571 128 550 844 357 700 697 280 ...
## $ movie_title : Factor w/ 4917 levels "[Rec] ","[Rec] 2 ",..: 123 3938 4756 2163 1186 568 3506 2744 106 3604 ...
## $ num_voted_users : int 136093 2581 17349 336235 75973 59462 6384 23916 14147 407601 ...
## $ facenumber_in_poster : int 3 1 0 1 1 0 0 5 3 1 ...
## $ num_user_for_reviews : int 416 29 42 782 409 300 54 50 92 849 ...
## $ language : Factor w/ 48 levels "","Aboriginal",..: 13 13 13 13 13 13 13 13 13 13 ...
## $ budget : num 40 25 20 110 25 25 6.5 35 20 60 ...
## $ title_year : int 2014 2016 1989 2007 1998 2008 1993 2012 2003 2002 ...
## $ imdb_score : num 6.1 7.3 6 7.2 7.5 6.6 6.2 5.7 5.6 7.9 ...
## $ aspect_ratio : num 2.35 2.35 2.35 2.35 1.85 1.85 1.85 2.35 1.85 2.35 ...
## $ movie_facebook_likes : int 24000 0 855 0 0 0 391 0 401 0 ...
## $ USA : num 1 0 1 1 0 0 1 1 1 1 ...
## $ restricted : num 1 1 1 1 1 1 0 1 1 1 ...
# Model specification using lm
fullModel <- lm(gross ~.-genres -movie_title -language,
data = my_data)
# Looking at model summary
summary(fullModel)
##
## Call:
## lm(formula = gross ~ . - genres - movie_title - language, data = my_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -322.28 -19.02 -5.07 12.45 452.31
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.001e+02 1.726e+02 5.216 1.93e-07 ***
## num_critic_for_reviews 2.760e-02 1.011e-02 2.730 0.006363 **
## duration -1.183e-01 3.666e-02 -3.226 0.001267 **
## director_facebook_likes -9.040e-04 2.376e-04 -3.805 0.000144 ***
## actor_1_facebook_likes -3.751e-05 4.535e-05 -0.827 0.408228
## num_voted_users 1.818e-04 8.823e-06 20.601 < 2e-16 ***
## facenumber_in_poster -3.166e-02 3.357e-01 -0.094 0.924872
## num_user_for_reviews 4.483e-03 3.020e-03 1.484 0.137841
## budget 7.181e-01 2.237e-02 32.098 < 2e-16 ***
## title_year -4.395e-01 8.602e-02 -5.109 3.40e-07 ***
## imdb_score 4.169e-01 8.269e-01 0.504 0.614202
## aspect_ratio -4.289e+00 2.022e+00 -2.121 0.033968 *
## movie_facebook_likes -3.824e-06 4.850e-05 -0.079 0.937159
## USA 1.777e+01 1.743e+00 10.196 < 2e-16 ***
## restricted -1.845e+01 1.816e+00 -10.158 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41.58 on 3718 degrees of freedom
## Multiple R-squared: 0.5592, Adjusted R-squared: 0.5575
## F-statistic: 336.9 on 14 and 3718 DF, p-value: < 2.2e-16
使用step函数,剔除无关变量: Choose a model by AIC in a Stepwise Algorithm.
In the absence of subject-matter expertise, stepwise regression can assist with the search for the most important predictors of the outcome of interest.
# Model specification using lm
fullModel <- lm(gross ~.-genres -movie_title -language,
data = my_data)
newModel <- step(fullModel)
## Start: AIC=27846.05
## gross ~ (num_critic_for_reviews + duration + director_facebook_likes +
## actor_1_facebook_likes + genres + movie_title + num_voted_users +
## facenumber_in_poster + num_user_for_reviews + language +
## budget + title_year + imdb_score + aspect_ratio + movie_facebook_likes +
## USA + restricted) - genres - movie_title - language
##
## Df Sum of Sq RSS AIC
## - movie_facebook_likes 1 11 6429210 27844
## - facenumber_in_poster 1 15 6429214 27844
## - imdb_score 1 439 6429638 27844
## - actor_1_facebook_likes 1 1183 6430382 27845
## <none> 6429199 27846
## - num_user_for_reviews 1 3809 6433008 27846
## - aspect_ratio 1 7781 6436980 27849
## - num_critic_for_reviews 1 12888 6442087 27852
## - duration 1 17993 6447192 27854
## - director_facebook_likes 1 25038 6454237 27859
## - title_year 1 45133 6474331 27870
## - restricted 1 178429 6607628 27946
## - USA 1 179759 6608958 27947
## - num_voted_users 1 733845 7163044 28248
## - budget 1 1781578 8210777 28757
##
## Step: AIC=27844.06
## gross ~ num_critic_for_reviews + duration + director_facebook_likes +
## actor_1_facebook_likes + num_voted_users + facenumber_in_poster +
## num_user_for_reviews + budget + title_year + imdb_score +
## aspect_ratio + USA + restricted
##
## Df Sum of Sq RSS AIC
## - facenumber_in_poster 1 16 6429226 27842
## - imdb_score 1 448 6429657 27842
## - actor_1_facebook_likes 1 1180 6430390 27843
## <none> 6429210 27844
## - num_user_for_reviews 1 4082 6433291 27844
## - aspect_ratio 1 7770 6436980 27847
## - num_critic_for_reviews 1 17160 6446369 27852
## - duration 1 18274 6447484 27853
## - director_facebook_likes 1 25043 6454252 27857
## - title_year 1 45529 6474739 27868
## - restricted 1 178817 6608026 27944
## - USA 1 179753 6608962 27945
## - num_voted_users 1 791187 7220396 28275
## - budget 1 1786771 8215981 28758
##
## Step: AIC=27842.07
## gross ~ num_critic_for_reviews + duration + director_facebook_likes +
## actor_1_facebook_likes + num_voted_users + num_user_for_reviews +
## budget + title_year + imdb_score + aspect_ratio + USA + restricted
##
## Df Sum of Sq RSS AIC
## - imdb_score 1 460 6429686 27840
## - actor_1_facebook_likes 1 1200 6430425 27841
## <none> 6429226 27842
## - num_user_for_reviews 1 4146 6433372 27842
## - aspect_ratio 1 7772 6436997 27845
## - num_critic_for_reviews 1 17207 6446432 27850
## - duration 1 18499 6447725 27851
## - director_facebook_likes 1 25035 6454261 27855
## - title_year 1 45877 6475102 27867
## - restricted 1 178986 6608212 27943
## - USA 1 179791 6609016 27943
## - num_voted_users 1 793978 7223203 28275
## - budget 1 1789904 8219130 28757
##
## Step: AIC=27840.33
## gross ~ num_critic_for_reviews + duration + director_facebook_likes +
## actor_1_facebook_likes + num_voted_users + num_user_for_reviews +
## budget + title_year + aspect_ratio + USA + restricted
##
## Df Sum of Sq RSS AIC
## - actor_1_facebook_likes 1 1150 6430836 27839
## <none> 6429686 27840
## - num_user_for_reviews 1 3844 6433530 27841
## - aspect_ratio 1 7827 6437514 27843
## - duration 1 18318 6448004 27849
## - num_critic_for_reviews 1 19754 6449440 27850
## - director_facebook_likes 1 24878 6454565 27853
## - title_year 1 48581 6478268 27866
## - restricted 1 180981 6610667 27942
## - USA 1 181609 6611295 27942
## - num_voted_users 1 900134 7329820 28328
## - budget 1 1877049 8306735 28794
##
## Step: AIC=27839
## gross ~ num_critic_for_reviews + duration + director_facebook_likes +
## num_voted_users + num_user_for_reviews + budget + title_year +
## aspect_ratio + USA + restricted
##
## Df Sum of Sq RSS AIC
## <none> 6430836 27839
## - num_user_for_reviews 1 4066 6434903 27839
## - aspect_ratio 1 7890 6438726 27842
## - duration 1 18513 6449349 27848
## - num_critic_for_reviews 1 19493 6450330 27848
## - director_facebook_likes 1 25249 6456085 27852
## - title_year 1 49239 6480076 27866
## - USA 1 180592 6611428 27940
## - restricted 1 182472 6613308 27941
## - num_voted_users 1 901931 7332767 28327
## - budget 1 1880622 8311458 28795
summary(newModel)
##
## Call:
## lm(formula = gross ~ num_critic_for_reviews + duration + director_facebook_likes +
## num_voted_users + num_user_for_reviews + budget + title_year +
## aspect_ratio + USA + restricted, data = my_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -322.59 -19.11 -4.87 12.50 452.39
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.240e+02 1.685e+02 5.485 4.42e-08 ***
## num_critic_for_reviews 2.808e-02 8.361e-03 3.359 0.000790 ***
## duration -1.143e-01 3.490e-02 -3.273 0.001073 **
## director_facebook_likes -9.062e-04 2.371e-04 -3.823 0.000134 ***
## num_voted_users 1.823e-04 7.980e-06 22.848 < 2e-16 ***
## num_user_for_reviews 4.462e-03 2.908e-03 1.534 0.125089
## budget 7.143e-01 2.165e-02 32.992 < 2e-16 ***
## title_year -4.503e-01 8.435e-02 -5.338 9.94e-08 ***
## aspect_ratio -4.316e+00 2.019e+00 -2.137 0.032666 *
## USA 1.755e+01 1.716e+00 10.224 < 2e-16 ***
## restricted -1.856e+01 1.807e+00 -10.277 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41.57 on 3722 degrees of freedom
## Multiple R-squared: 0.5591, Adjusted R-squared: 0.5579
## F-statistic: 471.9 on 10 and 3722 DF, p-value: < 2.2e-16
newFormula <- as.formula(summary(newModel)$call)
newFormula
## gross ~ num_critic_for_reviews + duration + director_facebook_likes +
## num_voted_users + num_user_for_reviews + budget + title_year +
## aspect_ratio + USA + restricted
bestModel<- training_data %>%
lm(formula=newFormula)
summary(bestModel)
##
## Call:
## lm(formula = newFormula, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -291.22 -18.85 -4.99 11.84 450.89
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.283e+02 2.062e+02 4.502 7.02e-06 ***
## num_critic_for_reviews 3.535e-02 1.016e-02 3.480 0.000511 ***
## duration -1.074e-01 4.183e-02 -2.568 0.010284 *
## director_facebook_likes -6.703e-04 2.922e-04 -2.294 0.021863 *
## num_voted_users 1.604e-04 9.745e-06 16.461 < 2e-16 ***
## num_user_for_reviews 5.344e-03 3.554e-03 1.504 0.132823
## budget 7.503e-01 2.607e-02 28.786 < 2e-16 ***
## title_year -4.527e-01 1.032e-01 -4.388 1.19e-05 ***
## aspect_ratio -5.096e+00 2.241e+00 -2.274 0.023045 *
## USA 1.624e+01 2.069e+00 7.849 6.07e-15 ***
## restricted -1.665e+01 2.190e+00 -7.603 4.01e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.17 on 2602 degrees of freedom
## Multiple R-squared: 0.5402, Adjusted R-squared: 0.5385
## F-statistic: 305.7 on 10 and 2602 DF, p-value: < 2.2e-16
多元回归的共线性问题处理: Estimate the variance inflation factors using the vif() function from the rms package. A variance inflation factor greater than 5 hints to multicollinearity, greater than 10 indicates unstable regression estimates.
library(rms)
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
## Loading required package: SparseM
##
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
##
## backsolve
# Checking variance inflation factors
vif(bestModel)
## num_critic_for_reviews duration director_facebook_likes
## 2.144584 1.261584 1.121974
## num_voted_users num_user_for_reviews budget
## 2.783532 2.598297 1.400162
## title_year aspect_ratio USA
## 1.554954 1.089147 1.046986
## restricted
## 1.146993
Since none of the variance inflation factors is greater than 10 we can certainly accept our bestModel.
summary(bestModel)
##
## Call:
## lm(formula = newFormula, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -291.22 -18.85 -4.99 11.84 450.89
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.283e+02 2.062e+02 4.502 7.02e-06 ***
## num_critic_for_reviews 3.535e-02 1.016e-02 3.480 0.000511 ***
## duration -1.074e-01 4.183e-02 -2.568 0.010284 *
## director_facebook_likes -6.703e-04 2.922e-04 -2.294 0.021863 *
## num_voted_users 1.604e-04 9.745e-06 16.461 < 2e-16 ***
## num_user_for_reviews 5.344e-03 3.554e-03 1.504 0.132823
## budget 7.503e-01 2.607e-02 28.786 < 2e-16 ***
## title_year -4.527e-01 1.032e-01 -4.388 1.19e-05 ***
## aspect_ratio -5.096e+00 2.241e+00 -2.274 0.023045 *
## USA 1.624e+01 2.069e+00 7.849 6.07e-15 ***
## restricted -1.665e+01 2.190e+00 -7.603 4.01e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.17 on 2602 degrees of freedom
## Multiple R-squared: 0.5402, Adjusted R-squared: 0.5385
## F-statistic: 305.7 on 10 and 2602 DF, p-value: < 2.2e-16
我们使用刚刚建立的模型来做预测:
grossPred <- predict(bestModel, testing_data) # predict distance
检验一下,我们的预测效果
actuals_preds <- data.frame(cbind(actuals=testing_data$gross, predicteds=grossPred)) # make actuals_predicteds dataframe.
attach(actuals_preds)
correlation_accuracy <- cor(actuals,predicteds)
correlation_accuracy
## [1] 0.7727497
实际对比一下原始数据和预测数据
head(actuals_preds, n=10)
## actuals predicteds
## 1 70.08352 166.5624
## 4 350.03411 194.4259
## 9 101.78548 223.0923
## 20 533.31606 431.2951
## 23 318.29818 230.2883
## 25 113.74541 156.8614
## 27 161.08718 160.1627
## 32 356.45437 222.5212
## 35 241.40733 203.4240
## 45 217.38800 226.4663
还可以再进一步,交叉验证:从上面6,4分的训练测试数据得到的模型表现不错,这是不是偶然碰巧的结果? 随机把数据分成K等份,K-1做训练,1做测试。得到k种结果,看看预测结果之间偏差如何? 这个方法通常在数据量不足以支持数据分成训练组和测试组的时候 k- Fold Cross validation 曲线是否平行
library(DAAG)
##
## Attaching package: 'DAAG'
## The following object is masked from 'package:rms':
##
## vif
## The following object is masked from 'package:survival':
##
## lung
kfold <- CVlm(data = my_data, form.lm = formula(newFormula), m=5,
dots = FALSE, seed=123, legend.pos="topleft",
main="Cross Validation; k=5",
plotit=TRUE, printit=FALSE)
## Warning in CVlm(data = my_data, form.lm = formula(newFormula), m = 5, dots = FALSE, :
##
## As there is >1 explanatory variable, cross-validation
## predicted values for a fold are not a linear function
## of corresponding overall predicted values. Lines that
## are shown for the different folds are approximate
Remember, cross-validation validates the modeling process, not an actual model
还有一种 kWayCrossValidation(nRows, nSplits, NULL, NULL)
library(vtreat)
完结撒花!