回归模型的目标是量化变量之间的关系:
关键系数的大小
关键系数的显著度
模型的解释力
模型的预测力
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
理解模型:
- 残差 (Residuals): Residuals show if the predicted response values are close or not to the response values that the model predicts.
\[e_i = y_i-\hat{y_i}\]
- 系数估计 (Estimate coefficient): How y changes when x changes by one unit
\[\hat{b}=\frac{\sum (x_i-\bar{x}) (y_i-\bar{y})}{\sum (x_i-\bar{x})^2}\]
\[\hat{a}= \bar{y} - \hat{b}\bar{x}\]
系数估计的标准差(Standard error): Standard error measures how the coefficient estimates can vary from the actual average value of the response variable (i.e. if the model is run more times). 每一次数据不同,结果都会有差异,但是该差异在一定范围内。
显著度与P值 (Significance and P value): Test of significance of the model shows that there is strong evidence of a linear relationship between the variables. This is visually interpreted by the significance stars *** at the end of the row. This level is a threshold that is used to indicate real findings, and not the ones by chance alone.
For each estimated regression coefficient, the variable’s p-Value Pr(>|t|) provides an estimate of the probability that the true coefficient is zero given the value of the estimate. More the number of stars near the p-Value are, more significant is the variable.
With the presence of the p-value, there is a test of hypothesis associated with it. In Linear Regression, the Null Hypothesis is that the coefficient associated with the variables is equal to zero. Instead, the alternative hypothesis is that the coefficient is not equal to zero and then exists a relationship between the independent variable and the dependent variable.
So, if p-values are less than significance level (typically, a p-value < 0.05 is a good cut-off point), null hypothesis can be safely rejected. In the current case, p-values are well below the 0.05 threshold, so the model is indeed statistically significant.
- R-squared 以及 adjusted R-squared For the simple linear regression, R-squared is the square of the correlation between two variables. Its value can vary between 0 and 1: a value close to 0 means that the regression model does not explain the variance in the response variable, while a number close to 1 that the observed variance in the response variable is well explained. In the current case, R-squared suggests the linear model fit explains about 75% of the variance observed in the data.
\[R^2=1-\frac{var(e)}{var(y)}\]
6.F statistic Basically, F-test compares the model with zero predictor variables (the intercept only mod1el), and suggests whether the added coefficients improve the model. If a significant result is obtained, then the coefficients (in the current case, being a simple regression, only one predictor is entered) included in the model improve the model’s fit. So, F statistic defines the collective effect of all predictor variables on the response variable. In this model, F=119.8 is far greater than 1.
Import Data
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJTYWxhcmllc0luIDwtIHJlYWQuY3N2KFwiaHR0cHM6Ly93d3cuZHJvcGJveC5jb20vcy9oNHdueThzbGg0eDF2MXMvU2FsYXJpZXMuY3N2P2RsPTFcIiApICJ9
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzdHIoU2FsYXJpZXNJbikifQ==
回顾取子集的方法
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJTYWxhcmllc0luJHlycy5zZXJ2aWNlICAgICAgICAjIHlycy5zZXJ2aWNlIHZhcmlhYmxlXG5TYWxhcmllc0luWzIsXSAgICAgICAgICAgICAgICAjIHNlY29uZCByb3csIG9ic2VydmF0aW9uXG5TYWxhcmllc0luJGRpc2NpcGxpbmVbNl0gICAgICAjIDZ0aCBvYnNlcnZhdGlvbiBvZiB0aGUgZGlzY2lwbGluZSB2YXJpYWJsZVxuU2FsYXJpZXNJbls1LFwieXJzLnNlcnZpY2VcIl0gICAjIHJvdyA1IG9mIHRoZSB5cnMuc2VydmljZSB2YXJpYWJsZVxuU2FsYXJpZXNJbls0LDNdICAgICAgICAgICAgICAgIyByb3cgNCwgY29sdW1uIDNcblNhbGFyaWVzSW5bMjo2LF0gICAgICAgICAgICAgICMgb2JzZXJ2YXRpb25zIDIgdGhyb3VnaCA2XG5TYWxhcmllc0luW2MoNyw5LDE2KSxdICAgICAgICAjIG9ic2VydmF0aW9ucyA3LCA5LCBhbmQgMTYifQ==
summary不含缺失值的数据
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzdW0oIGlzLm5hKFNhbGFyaWVzSW4pIClcbnN1bSggU2FsYXJpZXNJbj09XCJcIiApXG5cbnNhbGFyeSA8LSBTYWxhcmllc0luIn0=
把工资变量变成以1000为单位,然后对数化,存成新变量logSalary
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzYWxhcnkkc2FsYXJ5IDwtIHNhbGFyeSRzYWxhcnkgLyAxMDAwXG5sb2dTYWxhcnkgPC0gbG9nKHNhbGFyeSRzYWxhcnkpIn0=
把工资分为三等,创建新变量,存成salaryLevel
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzYWxhcnlMZXZlbCA9IGlmZWxzZShzYWxhcnkkc2FsYXJ5PjEzNCwgXCJoaWdoXCIsXG4gaWZlbHNlKHNhbGFyeSRzYWxhcnk8OTEsIFwibG93XCIsXCJtaWRkbGVcIlxuICkgKVxuc3RyKHNhbGFyeUxldmVsKSJ9
将新变量加入到原来的数据
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzYWxhcnkgPC0gZGF0YS5mcmFtZShzYWxhcnksXG4gbG9nU2FsYXJ5ID0gbG9nU2FsYXJ5LFxuIHNhbGFyeUxldmVscyA9IGZhY3RvcihzYWxhcnlMZXZlbClcbikifQ==
将变量从新命名
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJjb2xuYW1lcyhzYWxhcnkpIDwtIGMoXCJyYW5rXCIsXCJkc2NwbFwiLFwieXJTaW5cIixcInlyU2VyXCIsXCJzZXhcIixcbiAgICAgICAgICAgICAgICAgICAgICBcInNhbGFyeVwiLFwibG9nU2FsXCIsXCJzYWxMZXZcIikifQ==
查看数据的描述性统计 Data exploration
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzdW1tYXJ5KHNhbGFyeSkifQ==
将教授级别变成有序分类变量
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzYWxhcnkkcmFuayA8LSBvcmRlcmVkKHNhbGFyeSRyYW5rLCBsZXZlbHM9YyhcIkFzc3RQcm9mXCIsXCJBc3NvY1Byb2ZcIixcIlByb2ZcIikpXG5zYWxhcnlOdW0gPC0gc2FsYXJ5In0=
使用循环函数,将数据全部数值化
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzYWxhcnlOdW08LXNhcHBseShzYWxhcnlOdW0sIGFzLm51bWVyaWMpIn0=
相关系数,精确到三位小数点(用round这个函数),-c(8)排除第8列的变量
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJyb3VuZCggY29yKHNhbGFyeU51bVsgLC1jKDgpXSksIDMpIn0=
教授工资与其工作时间(年)
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJnZ3Bsb3QoZGF0YT1zYWxhcnksIGFlcyh4PXlyU2VyLCB5PXNhbGFyeSkpICtcbiAgZ2VvbV9wb2ludCgpICtcbiAgdGhlbWVfYncoKSArXG4gIGdndGl0bGUoXCJQcm9mZXNzb3IncyBzYWxhcmllcyBmcm9tIDIwMDgtOVwiKSAifQ==
细节改进:横轴,纵轴标签
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJnZ3Bsb3QoZGF0YT1zYWxhcnksIGFlcyh4PXlyU2VyLCB5PXNhbGFyeSkpICtcbiAgZ2VvbV9wb2ludCgpICtcbiAgdGhlbWVfYncoKSArXG4gIGdndGl0bGUoXCJQcm9mZXNzb3IncyBzYWxhcmllcyBmcm9tIDIwMDgtOVwiKSArXG4gIHRoZW1lKCBwbG90LnRpdGxlPWVsZW1lbnRfdGV4dCh2anVzdD0xLjApICkgK1xuICB4bGFiKFwiWWVhcnMgb2Ygc2VydmljZVwiKSArXG4gIHRoZW1lKCBheGlzLnRpdGxlLnggPSBlbGVtZW50X3RleHQodmp1c3Q9LS41KSApICtcbiAgeWxhYihcIlNhbGFyeSBpbiB0aG91c2FuZHMgb2YgZG9sbGFyc1wiKSArXG4gIHRoZW1lKCBheGlzLnRpdGxlLnkgPSBlbGVtZW50X3RleHQodmp1c3Q9MS4wKSApICJ9
细节改进:按教授级别分别作图,更改背景
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJwbG90U2FsRmFjUmFuayA8LSBnZ3Bsb3QoZGF0YT1zYWxhcnksIGFlcyh4PXlyU2VyLCB5PXNhbGFyeSkpICtcbiAgZ2VvbV9wb2ludCgpICtcbiAgdGhlbWVfYncoKSArXG4gIGdndGl0bGUoXCJQcm9mZXNzb3IncyBzYWxhcmllcyBmcm9tIDIwMDgtOVwiKSArXG4gIHRoZW1lKCBwbG90LnRpdGxlPWVsZW1lbnRfdGV4dCh2anVzdD0xLjApICkgK1xuICB4bGFiKFwiWWVhcnMgb2Ygc2VydmljZVwiKSArXG4gIHRoZW1lKCBheGlzLnRpdGxlLnggPSBlbGVtZW50X3RleHQodmp1c3Q9LS41KSApICtcbiAgeWxhYihcIlNhbGFyeSBpbiB0aG91c2FuZHMgb2YgZG9sbGFyc1wiKSArXG4gIHRoZW1lKCBheGlzLnRpdGxlLnkgPSBlbGVtZW50X3RleHQodmp1c3Q9MS4wKSApICtcbiAgZmFjZXRfd3JhcCh+cmFuaykgK1xuICB0aGVtZShzdHJpcC5iYWNrZ3JvdW5kID0gZWxlbWVudF9yZWN0KGZpbGwgPSBcIldoaXRlXCIpKVxuXG5wbG90U2FsRmFjUmFuayJ9
散点图:添加图表标题,在底部添加图例
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJnZ3Bsb3QoZGF0YT1zYWxhcnksIGFlcyh4PXlycy5zZXJ2aWNlLCB5PXNhbGFyeSkpICtcbiAgZ2VvbV9wb2ludChhZXMoY29sb3I9cmFuaykpICtcbiAgdGhlbWVfYncoKSArXG4gIGdndGl0bGUoXCJQcm9mZXNzb3IncyBzYWxhcmllcyBmcm9tIDIwMDgtOVwiKSArXG4gIHRoZW1lKCBwbG90LnRpdGxlPWVsZW1lbnRfdGV4dCh2anVzdD0xLjApICkgK1xuICB4bGFiKFwiWWVhcnMgb2Ygc2VydmljZVwiKSArXG4gIHRoZW1lKCBheGlzLnRpdGxlLnggPSBlbGVtZW50X3RleHQodmp1c3Q9LS41KSApICtcbiAgeWxhYihcIlNhbGFyeSBpbiB0aG91c2FuZHMgb2YgZG9sbGFyc1wiKSArXG4gIHRoZW1lKCBheGlzLnRpdGxlLnkgPSBlbGVtZW50X3RleHQodmp1c3Q9MS4wKSApICtcbiAgdGhlbWUobGVnZW5kLnBvc2l0aW9uID0gXCJib3R0b21cIikifQ==
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJwbG90KHNhbGFyeVssLWMoOCldKSJ9
简单回归模型
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJvdXQgPC0gbG0oc2FsYXJ5fnNleCxkYXRhPXNhbGFyeSlcbnN1bW1hcnkob3V0KSJ9
模型评估:adjusted R squared, BIC, AIC 从最全的模型开始 工资~学科+性别+级别+工作年数+年数平方 工资~学科+性别+级别+取得博士之后的年数+年数平方
哪个模型好?比较调整后的r方
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzYWxhcnkkeXJTaW5TcXIgPC0gc2FsYXJ5JHlyU2luXjJcbnNhbGFyeSR5clNlclNxciA8LSBzYWxhcnkkeXJTZXJeMlxuXG5vdXRTaW4gPC0gbG0oc2FsYXJ5fmRzY3BsICsgc2V4ICsgcmFuayArIHlyU2luK3lyU2luU3FyLGRhdGE9c2FsYXJ5KVxuc3VtbWFyeShvdXRTaW4pXG5cbm91dFNlciA8LSBsbShzYWxhcnl+ZHNjcGwgKyBzZXggKyByYW5rICsgeXJTZXIreXJTZXJTcXIsZGF0YT1zYWxhcnkpXG5zdW1tYXJ5KG91dFNlcikifQ==
有可以移除的变量吗?,使用step函数 最好的模型(AIC最小):
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzdGVwKG91dFNpbikifQ==
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJiZXN0PC1sbShzYWxhcnl+ZHNjcGwgKyByYW5rICsgeXJTZXIreXJTZXJTcXIsZGF0YT1zYWxhcnkpIn0=
模型预测:
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzZXQuc2VlZCgxMDApICMgXHU4YmJlXHU3ZjZlXHU3OWNkXHU1YjUwXHU3MGI5XHVmZjBjXHU0ZWU1XHU0ZmJmXHU5MWNkXHU1OTBkXHU1MThkXHU3M2IwXHU1NDBjXHU2ODM3XHU3Njg0XHU3ZWQzXHU2NzljXG50cmFpbmluZ1Jvd0luZGV4IDwtIHNhbXBsZSgxOm5yb3coc2FsYXJ5KSwgMC44Km5yb3coc2FsYXJ5KSkgXG4jIFx1NWMwNlx1NjU3MFx1NjM2ZVx1NGU4Y1x1NTE2Ylx1NTIwNlx1ZmYwY3RyYWluaW5nIHZzLiB0ZXN0aW5nXG50cmFpbmluZ0RhdGEgPC0gc2FsYXJ5W3RyYWluaW5nUm93SW5kZXgsIF0gIyBtb2RlbCB0cmFpbmluZyBkYXRhXG50ZXN0RGF0YSA8LSBzYWxhcnlbLXRyYWluaW5nUm93SW5kZXgsIF0gIyB0ZXN0IGRhdGEifQ==
然后,用训练数据建立一个模型
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJiZXN0TW9kIDwtIGxtKHNhbGFyeX5kc2NwbCArIHJhbmsgKyB5clNlcit5clNlclNxcixkYXRhPXRyYWluaW5nRGF0YSkgIyBidWlsZCB0aGUgbW9kZWwifQ==
再然后,将我们建立的模型用来测试数据上,做预测
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzYWxhcnlQcmVkIDwtIHByZWRpY3QoYmVzdE1vZCwgdGVzdERhdGEpICMgcHJlZGljdCBkaXN0YW5jZSJ9
检验一下,我们的预测效果
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJhY3R1YWxzX3ByZWRzIDwtIGRhdGEuZnJhbWUoY2JpbmQoYWN0dWFscz10ZXN0RGF0YSRzYWxhcnksIHByZWRpY3RlZHM9c2FsYXJ5UHJlZCkpICMgbWFrZSBhY3R1YWxzX3ByZWRpY3RlZHMgZGF0YWZyYW1lLlxuYXR0YWNoKGFjdHVhbHNfcHJlZHMpXG5jb3JyZWxhdGlvbl9hY2N1cmFjeSA8LSBjb3IoYWN0dWFscyxwcmVkaWN0ZWRzKSBcbmNvcnJlbGF0aW9uX2FjY3VyYWN5In0=
读取数据
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJhZHZlcnRpc2luZyA8LSByZWFkLmNzdihcIn4vUiBUZWFjaGluZy9kZW1vIGNvZGUvcmVncmVzc2lvbjMvYWR2ZXJ0aXNpbmcuY3N2XCIpXG5cbiNhZHZlcnRpc2luZyJ9
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJjb2xTdW1zKGlzLm5hKGFkdmVydGlzaW5nKSkifQ==
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzdW1tYXJ5KGFkdmVydGlzaW5nKSJ9
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJwYXIobWZyb3c9YygxLCAyKSkgICMgaXQgZGl2aWRlcyBncmFwaCBhcmVhIGluIHR3byBwYXJ0c1xuXG5ib3hwbG90KGFkdmVydGlzaW5nJHNhbGVzLCBjb2wgPSBcInllbGxvd1wiLCBib3JkZXI9XCJibHVlXCIsXG4gICAgICAgIG1haW4gPSBcIlNBTEVTIGJveHBsb3RcIixcbiAgICAgICAgeWxhYiA9IFwiZyBwZXIgZGVjYWxpdGVyXCIpXG5cbmJveHBsb3QoYWR2ZXJ0aXNpbmckVFYsIGNvbCA9IFwib3JhbmdlXCIsIGJvcmRlcj1cImJsdWVcIixcbiAgICAgICAgbWFpbiA9IFwiVFYgYm94cGxvdFwiLFxuICAgICAgICB5bGFiID0gXCJwZXJjZW50IHZhbHVlc1wiKSJ9
找出异常值
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIHNhbGVzIG91dGxpZXJzXG5ib3hwbG90LnN0YXRzKGFkdmVydGlzaW5nJHNhbGVzKSRvdXQgIn0=
sales直方图
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJxcGxvdChzYWxlcywgZGF0YSA9IGFkdmVydGlzaW5nLCBnZW9tPVwiaGlzdG9ncmFtXCIsIGJpbndpZHRoPTEsIFxuICAgICAgZmlsbD1JKFwiYXp1cmU0XCIpLCBjb2w9SShcImF6dXJlM1wiKSkgK1xuICBsYWJzKHRpdGxlID0gXCJzYWxlc1wiKSArXG4gIHRoZW1lKHBsb3QudGl0bGUgPSBlbGVtZW50X3RleHQoaGp1c3QgPSAwLjUpKSArXG4gIGxhYnMoeCA9XCJzYWxlcyAodW5pdHMpXCIpICtcbiAgbGFicyh5ID0gXCJGcmVxdWVuY3lcIikgK1xuICBzY2FsZV95X2NvbnRpbnVvdXMoYnJlYWtzID0gYygwLDIsNCw2LDgsMTAsMTIsMTQsMTYsMTgpLCBtaW5vcl9icmVha3MgPSBOVUxMKSArXG4gIHNjYWxlX3hfY29udGludW91cyhicmVha3MgPSBjKDEsNSwxMCwyMCwzMCw1MCksIG1pbm9yX2JyZWFrcyA9IDUpICtcbiAgZ2VvbV92bGluZSh4aW50ZXJjZXB0ID0gbWVhbihhZHZlcnRpc2luZyRzYWxlcyksIHNob3dfZ3VpZGU9VFJVRSwgY29sb3JcbiAgICAgICAgICAgICA9XCJyZWRcIiwgbGFiZWxzPVwiQXZlcmFnZVwiKSArXG4gIGdlb21fdmxpbmUoeGludGVyY2VwdCA9IG1lZGlhbihhZHZlcnRpc2luZyRzYWxlcyksIHNob3dfZ3VpZGU9VFJVRSwgY29sb3JcbiAgICAgICAgICAgICA9XCJibHVlXCIsIGxhYmVscz1cIk1lZGlhblwiKSJ9
TV 直方图
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJxcGxvdChUViwgZGF0YSA9IGFkdmVydGlzaW5nLCBnZW9tPVwiaGlzdG9ncmFtXCIsIGJpbndpZHRoPTUsIFxuICAgICAgZmlsbD1JKFwiYXp1cmU0XCIpLCBjb2w9SShcImF6dXJlM1wiKSkgK1xuICBsYWJzKHRpdGxlID0gXCJUVlwiKSArXG4gIHRoZW1lKHBsb3QudGl0bGUgPSBlbGVtZW50X3RleHQoaGp1c3QgPSAwLjUpKSArXG4gIGxhYnMoeCA9XCJBRCBidWRnZXQgaW4gVFYgKGluIHRlbiB0aG91c2FuZCBkb2xsYXJzKVwiKSArXG4gIGxhYnMoeSA9IFwiRnJlcXVlbmN5XCIpICtcbiAgc2NhbGVfeV9jb250aW51b3VzKGJyZWFrcyA9IGMoMCwyLDQsNiw4LDEwLDEyLDE0LDE2LDE4KSwgbWlub3JfYnJlYWtzID0gTlVMTCkgK1xuICBzY2FsZV94X2NvbnRpbnVvdXMobGltaXRzID0gYygwLCAzMDApKSArXG4gIGdlb21fdmxpbmUoeGludGVyY2VwdCA9IG1lYW4oYWR2ZXJ0aXNpbmckVFYpLCBzaG93X2d1aWRlPVRSVUUsIGNvbG9yXG4gICAgICAgICAgICAgPVwicmVkXCIsIGxhYmVscz1cIkF2ZXJhZ2VcIikgK1xuICBnZW9tX3ZsaW5lKHhpbnRlcmNlcHQgPSBtZWRpYW4oYWR2ZXJ0aXNpbmckVFYpLCBzaG93X2d1aWRlPVRSVUUsIGNvbG9yXG4gICAgICAgICAgICAgPVwiYmx1ZVwiLCBsYWJlbHM9XCJNZWRpYW5cIikifQ==
散点图
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJnZ3Bsb3QoYWR2ZXJ0aXNpbmcsIGFlcyh4PVRWLCB5PXNhbGVzKSkrXG4gIGdlb21fcG9pbnQoY29sb3VyID0gXCJibHVlXCIsIHNpemUgPSAxLjUpK1xuICBzY2FsZV95X2NvbnRpbnVvdXMobGltaXRzPWMoMCw1MCkpK1xuICBzY2FsZV94X2NvbnRpbnVvdXMobGltaXRzPWMoMCwzMDApKStcbiAgdGhlbWUocGxvdC50aXRsZSA9IGVsZW1lbnRfdGV4dChoanVzdCA9IDAuNSkpICtcbiAgZ2d0aXRsZShcIlRWIGFkIGJ1ZGdldCBhbmQgc2FsZXMgcmVsYXRpb25zaGlwXCIpIn0=
在散点图基础上添加线性回归拟合线
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJnZ3Bsb3QoYWR2ZXJ0aXNpbmcsIGFlcyh4PVRWLCB5PXNhbGVzKSkrXG4gIGdlb21fcG9pbnQoY29sb3VyID0gXCJibHVlXCIsIHNpemUgPSAxLjUpK1xuICBzY2FsZV95X2NvbnRpbnVvdXMobGltaXRzPWMoMCw1MCkpK1xuICBzY2FsZV94X2NvbnRpbnVvdXMobGltaXRzPWMoMCwzMDApKStcbiAgdGhlbWUocGxvdC50aXRsZSA9IGVsZW1lbnRfdGV4dChoanVzdCA9IDAuNSkpICtcbiAgZ2d0aXRsZShcIlRWIGFkIGJ1ZGdldCBhbmQgc2FsZXMgcmVsYXRpb25zaGlwXCIpK1xuICBnZW9tX3Ntb290aChtZXRob2Q9XCJsbVwiLCBjb2xvcj1cInJlZFwiKSJ9
建立模型
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtb2RlbDEgPC0gbG0oc2FsZXMgfiBUViwgZGF0YSA9IGFkdmVydGlzaW5nKSJ9
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzdW1tYXJ5KG1vZGVsMSkifQ==
回归诊断作图
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjUmVzaWR1YWxzIHZzIEZpdHRlZCB2YWx1ZXNcbnBsb3QobW9kZWwxLCBwY2g9MTYsIGNvbD1cImJsdWVcIiwgbHR5PTEsIGx3ZD0yLCB3aGljaD0xKSJ9
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjTm9ybWFsIFEtUVxucGxvdChtb2RlbDEsIHBjaD0xNiwgY29sPVwiYmx1ZVwiLCBsdHk9MSwgbHdkPTIsIHdoaWNoPTIpIn0=
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJwbG90KG1vZGVsMSwgcGNoPTE2LCBjb2w9XCJibHVlXCIsIGx0eT0xLCBsd2Q9Miwgd2hpY2g9NCkifQ==
模型改进,标记异常值
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJhZHZlcnRpc2luZyRvdXRsaWVyID0gaWZlbHNlKGFkdmVydGlzaW5nJHNhbGVzPjQwLFwiWVwiLFwiTlwiKSAjIGNyZWF0ZSBjb25kaXRpb24gWWVzL05vIGlmIG91dGxpZXIifQ==
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJnZ3Bsb3QoZGF0YT1hZHZlcnRpc2luZywgYWVzKHg9VFYsIHk9c2FsZXMsIGNvbD1hcy5mYWN0b3Iob3V0bGllcikpKStcbiAgZ2VvbV9wb2ludCgpK1xuICBnZ3RpdGxlKFwiVFYgYWRzIGFuZCBzYWxlc1wiKStcbiAgc2NhbGVfeV9jb250aW51b3VzKGxpbWl0cz1jKDAsNTApKStcbiAgc2NhbGVfeF9jb250aW51b3VzKGxpbWl0cz1jKDAsMzAwKSkrXG4gIHRoZW1lKHBsb3QudGl0bGUgPSBlbGVtZW50X3RleHQoaGp1c3QgPSAwLjUpKSJ9
剔除异常值,创建新数据
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJhZF9uZXc8LXN1YnNldChhZHZlcnRpc2luZywgc2FsZXM8NDApIn0=
用新数据从新估计模型
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtb2RlbDIgPC0gbG0oc2FsZXMgfiBUViwgZGF0YSA9IGFkX25ldykifQ==
新模型完爆旧模型
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzdW1tYXJ5KG1vZGVsMilcbnN1bW1hcnkobW9kZWwxKSJ9
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJBSUMobW9kZWwyKVxuQUlDKG1vZGVsMSlcblxuQklDKG1vZGVsMilcbkJJQyhtb2RlbDEpIn0=
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzZXQuc2VlZCgxMDApICMgXHU4YmJlXHU3ZjZlXHU3OWNkXHU1YjUwXHU3MGI5XHVmZjBjXHU0ZWU1XHU0ZmJmXHU5MWNkXHU1OTBkXHU1MThkXHU3M2IwXHU1NDBjXHU2ODM3XHU3Njg0XHU3ZWQzXHU2NzljXG50cmFpbmluZ1Jvd0luZGV4IDwtIHNhbXBsZSgxOm5yb3coYWRfbmV3KSwgMC43Km5yb3coYWRfbmV3KSkgXG4jIFx1NWMwNlx1NjU3MFx1NjM2ZVx1NGUwM1x1NGUwOVx1NTIwNlx1ZmYwY3RyYWluaW5nIHZzLiB0ZXN0aW5nXG50cmFpbiA8LSBhZF9uZXdbdHJhaW5pbmdSb3dJbmRleCwgXSAjIG1vZGVsIHRyYWluaW5nIGRhdGFcbnRlc3QgPC0gYWRfbmV3Wy10cmFpbmluZ1Jvd0luZGV4LCBdICMgdGVzdCBkYXRhIn0=
用训练数据建模,将得到的模型运用到测试数据来做预测
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtb2RUcmFpbiA8LSBsbShzYWxlcyB+IFRWLCBkYXRhPXRyYWluKSAgIyBidWlsZCB0aGUgbW9kZWxcbnByZWRpY3QgPC0gcHJlZGljdChtb2RUcmFpbiwgdGVzdCkgICMgcHJlZGljdGVkIHZhbHVlc1xuc3VtbWFyeShtb2RUcmFpbikifQ==
将预测数据和实际数据放在一起(针对测试数据而言),计算相关系数,来衡量模型
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJhY3RfcHJlZCA8LSBkYXRhLmZyYW1lKGNiaW5kKGFjdHVhbHM9dGVzdCRzYWxlcywgcHJlZGljdGVkcz1wcmVkaWN0KSkgIyBhY3R1YWxzX3ByZWRpY3RlZHMgXG5jb3IoYWN0X3ByZWQpICMgY29ycmVsYXRpb25fYWNjdXJhY3kifQ==
实际对比一下原始数据和预测数据
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJoZWFkKGFjdF9wcmVkLCBuPTEwKSJ9
还可以再进一步,交叉验证:从上面6,4分的训练测试数据得到的模型表现不错,这是不是偶然碰巧的结果? 随机把数据分成K等份,K-1做训练,1做测试。得到k种结果,看看预测结果之间偏差如何? k- Fold Cross validation 曲线是否平行
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJrZm9sZCA8LSBDVmxtKGRhdGEgPSBhZF9uZXcsIGZvcm0ubG0gPSBmb3JtdWxhKHNhbGVzIH4gVFYpLCBtPTUsIFxuICAgICAgICAgICAgICAgICAgIGRvdHMgPSBGQUxTRSwgc2VlZD0xMjMsIGxlZ2VuZC5wb3M9XCJ0b3BsZWZ0XCIsXG4gICAgICAgICAgICAgICAgICAgbWFpbj1cIkNyb3NzIFZhbGlkYXRpb247IGs9NVwiLFxuICAgICAgICAgICAgICAgICAgIHBsb3RpdD1UUlVFLCBwcmludGl0PUZBTFNFKSJ9