\[y_i= a + b*x_i+e_i\]
关键系数的大小
关键系数的显著度
模型的解释力
模型的预测力
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
理解模型:
\[e_i = y_i-\hat{y_i}\]
\[\hat{b}=\frac{\sum (x_i-\bar{x}) (y_i-\bar{y})}{\sum (x_i-\bar{x})^2}\]
\[\hat{a}= \bar{y} - \hat{b}\bar{x}\]
系数估计的标准差(Standard error): Standard error measures how the coefficient estimates can vary from the actual average value of the response variable (i.e. if the model is run more times). 每一次数据不同,结果都会有差异,但是该差异在一定范围内。
显著度与P值 (Significance and P value): Test of significance of the model shows that there is strong evidence of a linear relationship between the variables. This is visually interpreted by the significance stars *** at the end of the row. This level is a threshold that is used to indicate real findings, and not the ones by chance alone.
For each estimated regression coefficient, the variable’s p-Value Pr(>|t|) provides an estimate of the probability that the true coefficient is zero given the value of the estimate. More the number of stars near the p-Value are, more significant is the variable.
With the presence of the p-value, there is a test of hypothesis associated with it. In Linear Regression, the Null Hypothesis is that the coefficient associated with the variables is equal to zero. Instead, the alternative hypothesis is that the coefficient is not equal to zero and then exists a relationship between the independent variable and the dependent variable.
So, if p-values are less than significance level (typically, a p-value < 0.05 is a good cut-off point), null hypothesis can be safely rejected. In the current case, p-values are well below the 0.05 threshold, so the model is indeed statistically significant.
\[R^2=1-\frac{var(e)}{var(y)}\] \[R^2_{adj}=1-\frac{var(e)/(n-q)}{var(y)/(n-1)}=1-\frac{(1-R^2)(n-1)}{n-q}\]
6.F statistic Basically, F-test compares the model with zero predictor variables (the intercept only mod1el), and suggests whether the added coefficients improve the model. If a significant result is obtained, then the coefficients (in the current case, being a simple regression, only one predictor is entered) included in the model improve the model’s fit. So, F statistic defines the collective effect of all predictor variables on the response variable. In this model, F=119.8 is far greater than 1.
模型评估标准汇总:
记得在自己本地电脑上安装install.packages("corrplot")
查看数据前几名:
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
所有变量间的相关系数:
cor(mtcars)
## mpg cyl disp hp drat wt
## mpg 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.68117191 -0.8676594
## cyl -0.8521620 1.0000000 0.9020329 0.8324475 -0.69993811 0.7824958
## disp -0.8475514 0.9020329 1.0000000 0.7909486 -0.71021393 0.8879799
## hp -0.7761684 0.8324475 0.7909486 1.0000000 -0.44875912 0.6587479
## drat 0.6811719 -0.6999381 -0.7102139 -0.4487591 1.00000000 -0.7124406
## wt -0.8676594 0.7824958 0.8879799 0.6587479 -0.71244065 1.0000000
## qsec 0.4186840 -0.5912421 -0.4336979 -0.7082234 0.09120476 -0.1747159
## vs 0.6640389 -0.8108118 -0.7104159 -0.7230967 0.44027846 -0.5549157
## am 0.5998324 -0.5226070 -0.5912270 -0.2432043 0.71271113 -0.6924953
## gear 0.4802848 -0.4926866 -0.5555692 -0.1257043 0.69961013 -0.5832870
## carb -0.5509251 0.5269883 0.3949769 0.7498125 -0.09078980 0.4276059
## qsec vs am gear carb
## mpg 0.41868403 0.6640389 0.59983243 0.4802848 -0.55092507
## cyl -0.59124207 -0.8108118 -0.52260705 -0.4926866 0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692 0.39497686
## hp -0.70822339 -0.7230967 -0.24320426 -0.1257043 0.74981247
## drat 0.09120476 0.4402785 0.71271113 0.6996101 -0.09078980
## wt -0.17471588 -0.5549157 -0.69249526 -0.5832870 0.42760594
## qsec 1.00000000 0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs 0.74453544 1.0000000 0.16834512 0.2060233 -0.56960714
## am -0.22986086 0.1683451 1.00000000 0.7940588 0.05753435
## gear -0.21268223 0.2060233 0.79405876 1.0000000 0.27407284
## carb -0.65624923 -0.5696071 0.05753435 0.2740728 1.00000000
保存所有相关系数为M
library(corrplot)
## corrplot 0.84 loaded
M<-cor(mtcars)
head(round(M,2))
## mpg cyl disp hp drat wt qsec vs am gear carb
## mpg 1.00 -0.85 -0.85 -0.78 0.68 -0.87 0.42 0.66 0.60 0.48 -0.55
## cyl -0.85 1.00 0.90 0.83 -0.70 0.78 -0.59 -0.81 -0.52 -0.49 0.53
## disp -0.85 0.90 1.00 0.79 -0.71 0.89 -0.43 -0.71 -0.59 -0.56 0.39
## hp -0.78 0.83 0.79 1.00 -0.45 0.66 -0.71 -0.72 -0.24 -0.13 0.75
## drat 0.68 -0.70 -0.71 -0.45 1.00 -0.71 0.09 0.44 0.71 0.70 -0.09
## wt -0.87 0.78 0.89 0.66 -0.71 1.00 -0.17 -0.55 -0.69 -0.58 0.43
corrplot主要有数种展现相关性的方法:“circle”,“pie”, “color”, “number”。 一般表达式为:
corrplot(corr, method="circle")
library(corrplot)
M<-cor(mtcars)
corrplot(M, method="circle")
library(corrplot)
M<-cor(mtcars)
corrplot(M, method="pie")
library(corrplot)
M<-cor(mtcars)
corrplot(M, method="color")
library(corrplot)
M<-cor(mtcars)
corrplot(M, method="number")
library(corrplot)
M<-cor(mtcars)
corrplot(M, type="upper")
library(corrplot)
M<-cor(mtcars)
corrplot(M, type="lower")
library(corrplot)
library(RColorBrewer)
M<-cor(mtcars)
corrplot(M, type="upper",
col=brewer.pal(n=8, name="RdBu"))
library(corrplot)
library(RColorBrewer)
corrplot(M, type="upper",
col=brewer.pal(n=8, name="RdYlBu"))
library(corrplot)
library(RColorBrewer)
corrplot(M, type="upper",
col=brewer.pal(n=8, name="PuOr"))
改字体颜色以及字体方向:
library(corrplot)
library(RColorBrewer)
corrplot(M, type="upper", tl.col="black", tl.srt=45)
改变背景颜色:
library(corrplot)
M<-cor(mtcars)
# Change background color to lightblue
corrplot(M, type="upper", col=c("black", "white"),
bg="lightblue")
使用下面的公式,计算相关性的P值,是否显著相关:
# mat : is a matrix of data
# ... : further arguments to pass to the native R cor.test function
cor.mtest <- function(mat, ...) {
mat <- as.matrix(mat)
n <- ncol(mat)
p.mat<- matrix(NA, n, n)
diag(p.mat) <- 0
for (i in 1:(n - 1)) {
for (j in (i + 1):n) {
tmp <- cor.test(mat[, i], mat[, j], ...)
p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
}
}
colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
p.mat
}
# matrix of the p-value of the correlation
p.mat <- cor.mtest(mtcars)
head(p.mat[, 1:5])
## mpg cyl disp hp drat
## mpg 0.000000e+00 6.112687e-10 9.380327e-10 1.787835e-07 1.776240e-05
## cyl 6.112687e-10 0.000000e+00 1.802838e-12 3.477861e-09 8.244636e-06
## disp 9.380327e-10 1.802838e-12 0.000000e+00 7.142679e-08 5.282022e-06
## hp 1.787835e-07 3.477861e-09 7.142679e-08 0.000000e+00 9.988772e-03
## drat 1.776240e-05 8.244636e-06 5.282022e-06 9.988772e-03 0.000000e+00
## wt 1.293959e-10 1.217567e-07 1.222320e-11 4.145827e-05 4.784260e-06
library(corrplot)
M<-cor(mtcars)
# Specialized the insignificant value according to the significant level
corrplot(M, type="upper",
p.mat = p.mat, sig.level = 0.01)
library(corrplot)
M<-cor(mtcars)
# Leave blank on no significant coefficient
corrplot(M, type="upper",
p.mat = p.mat, sig.level = 0.01, insig = "blank")
library(corrplot)
M<-cor(mtcars)
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(M, method="color", col=col(200),
type="upper",
addCoef.col = "black", # Add coefficient of correlation
tl.col="black", tl.srt=45, #Text label color and rotation
# Combine with significance
p.mat = p.mat, sig.level = 0.01, insig = "blank",
# hide correlation coefficient on the principal diagonal
diag=FALSE
)
使用 corrplot()
函数 作出炫酷的相关性矩阵 correlation matrix,你学会了吗?
再教你一招,来补一刀! install.packages("PerformanceAnalytics")
library("PerformanceAnalytics")
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
my_data <- mtcars[, c(1,3,4,5,6,7)]
chart.Correlation(my_data, histogram=TRUE, pch=19)
用箱图检查异常值:
Generally, any datapoint that lies outside the 1.5 X interquartile-range (1.5 X IQR) is considered an outlier, where, IQR is calculated as the distance between the 25th percentile and 75th percentile values for that variable.
boxplot(mtcars)
par(mfrow=c(1, 2)) # divide graph area in 2 columns
boxplot(mtcars$wt, main="Weigt")
boxplot(mtcars$hp, main="horse power")
线性回归,一起来理解一下下面两个模型的结果:
library(ggplot2)
ggplot(data = mtcars, aes(x = wt, y = hp)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
model_1 <- lm(hp~wt,data=mtcars)
summary(model_1)
##
## Call:
## lm(formula = hp ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -83.430 -33.596 -13.587 7.913 172.030
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.821 32.325 -0.056 0.955
## wt 46.160 9.625 4.796 4.15e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 52.44 on 30 degrees of freedom
## Multiple R-squared: 0.4339, Adjusted R-squared: 0.4151
## F-statistic: 23 on 1 and 30 DF, p-value: 4.146e-05
fitted.values(model_1)
## Mazda RX4 Mazda RX4 Wag Datsun 710
## 119.11841 130.88922 105.27039
## Hornet 4 Drive Hornet Sportabout Valiant
## 146.58364 156.96965 157.89285
## Duster 360 Merc 240D Merc 230
## 162.97046 145.42964 143.58324
## Merc 280 Merc 280C Merc 450SE
## 156.96965 156.96965 186.05048
## Merc 450SL Merc 450SLC Cadillac Fleetwood
## 170.35607 172.66407 240.51934
## Lincoln Continental Chrysler Imperial Fiat 128
## 248.55119 244.90455 99.73119
## Honda Civic Toyota Corolla Toyota Corona
## 72.72756 82.88277 111.96360
## Dodge Challenger AMC Javelin Camaro Z28
## 160.66246 156.73885 175.43367
## Pontiac Firebird Fiat X1-9 Porsche 914-2
## 175.66447 87.49878 96.96159
## Lotus Europa Ford Pantera L Ferrari Dino
## 68.01923 144.50644 126.04242
## Maserati Bora Volvo 142E
## 162.97046 126.50402
residuals(model_1)
## Mazda RX4 Mazda RX4 Wag Datsun 710
## -9.1184100 -20.8892228 -12.2703949
## Hornet 4 Drive Hornet Sportabout Valiant
## -36.5836399 18.0303488 -52.8928522
## Duster 360 Merc 240D Merc 230
## 82.0295423 -83.4296386 -48.5832366
## Merc 280 Merc 280C Merc 450SE
## -33.9696512 -33.9696512 -6.0504829
## Merc 450SL Merc 450SLC Cadillac Fleetwood
## 9.6439342 7.3359317 -35.5193422
## Lincoln Continental Chrysler Imperial Fiat 128
## -33.5511910 -14.9045470 -33.7311889
## Honda Civic Toyota Corolla Toyota Corona
## -20.7275594 -17.8827705 -14.9636022
## Dodge Challenger AMC Javelin Camaro Z28
## -10.6624552 -6.7388509 69.5663287
## Pontiac Firebird Fiat X1-9 Porsche 914-2
## -0.6644716 -21.4987755 -5.9615858
## Lotus Europa Ford Pantera L Ferrari Dino
## 44.9807657 119.4935624 48.9575825
## Maserati Bora Volvo 142E
## 172.0295423 -17.5040180
# Scatterplot with regression line
ggplot(data = airquality, aes(x = Temp, y = Wind)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
model_2 <- lm(Wind~Temp,data=airquality)
summary(model_2)
##
## Call:
## lm(formula = Wind ~ Temp, data = airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.5784 -2.4489 -0.2261 1.9853 9.7398
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.23369 2.11239 10.999 < 2e-16 ***
## Temp -0.17046 0.02693 -6.331 2.64e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.142 on 151 degrees of freedom
## Multiple R-squared: 0.2098, Adjusted R-squared: 0.2045
## F-statistic: 40.08 on 1 and 151 DF, p-value: 2.642e-09
fitted.values(model_2)
## 1 2 3 4 5 6 7
## 11.812571 10.960248 10.619319 12.664893 13.687679 11.983035 12.153499
## 8 9 10 11 12 13 14
## 13.176286 12.835357 11.471642 10.619319 11.471642 11.983035 11.642106
## 15 16 17 18 19 20 21
## 13.346751 12.323964 11.983035 13.517215 11.642106 12.664893 13.176286
## 22 23 24 25 26 27 28
## 10.789784 12.835357 12.835357 13.517215 13.346751 13.517215 11.812571
## 29 30 31 32 33 34 35
## 9.426068 9.766997 10.278391 9.937462 10.619319 11.812571 8.914675
## 36 37 38 39 40 41 42
## 8.744211 9.766997 9.255604 8.403282 7.891888 8.403282 7.380495
## 43 44 45 46 47 48 49
## 7.550960 9.255604 9.596533 9.766997 10.107926 10.960248 12.153499
## 50 51 52 53 54 55 56
## 10.789784 10.278391 10.107926 10.278391 10.278391 10.278391 10.448855
## 57 58 59 60 61 62 63
## 9.937462 10.789784 9.596533 10.107926 9.085139 8.914675 8.744211
## 64 65 66 67 68 69 70
## 9.426068 8.914675 9.085139 9.085139 8.232817 7.550960 7.550960
## 71 72 73 74 75 76 77
## 8.062353 9.255604 10.789784 9.426068 7.721424 9.596533 9.426068
## 78 79 80 81 82 83 84
## 9.255604 8.914675 8.403282 8.744211 10.619319 9.426068 9.255604
## 85 86 87 88 89 90 91
## 8.573746 8.744211 9.255604 8.573746 8.232817 8.573746 9.085139
## 92 93 94 95 96 97 98
## 9.426068 9.426068 9.426068 9.255604 8.573746 8.744211 8.403282
## 99 100 101 102 103 104 105
## 8.062353 7.891888 7.891888 7.550960 8.573746 8.573746 9.255604
## 106 107 108 109 110 111 112
## 9.596533 9.766997 10.107926 9.766997 10.278391 9.937462 9.937462
## 113 114 115 116 117 118 119
## 10.107926 10.960248 10.448855 9.766997 9.426068 8.573746 8.232817
## 120 121 122 123 124 125 126
## 6.698637 7.210031 6.869102 7.210031 7.721424 7.550960 7.380495
## 127 128 129 130 131 132 133
## 7.380495 8.403282 8.914675 9.596533 9.937462 10.448855 10.789784
## 134 135 136 137 138 139 140
## 9.426068 10.278391 10.107926 11.130713 11.130713 9.937462 11.812571
## 141 142 143 144 145 146 147
## 10.278391 11.642106 9.255604 12.323964 11.130713 9.426068 11.471642
## 148 149 150 151 152 153
## 12.494428 11.301177 10.107926 10.448855 10.278391 11.642106
residuals(model_2)
## 1 2 3 4 5 6
## -4.41257055 -2.96024835 1.98068054 -1.16489276 0.61232059 2.91696501
## 7 8 9 10 11 12
## -3.55349944 0.62371392 7.26464280 -2.87164167 -3.71931946 -1.77164167
## 13 14 15 16 17 18
## -2.78303499 -0.74210611 -0.14675052 -0.82396388 0.01696501 4.88278503
## 19 20 21 22 23 24
## -0.14210611 -2.96489276 -3.47628608 5.81021609 -3.13535720 -0.83535720
## 25 26 27 28 29 30
## 3.08278503 1.55324948 -5.51721497 0.18742945 5.47393162 -4.06699726
## 31 32 33 34 35 36
## -2.87839058 -1.33746170 -0.91931946 4.28742945 0.28532495 -0.14421061
## 37 38 39 40 41 42
## 4.53300274 0.44439607 -1.50328173 5.90811159 3.09671827 3.51950492
## 43 44 45 46 47 48
## 1.64904048 -1.25560393 4.20346718 1.73300274 4.79207386 9.73975165
## 49 50 51 52 53 54
## -2.95349944 0.71021609 0.02160942 -3.80792614 -8.57839058 -5.67839058
## 55 56 57 58 59 60
## -3.97839058 -2.44885502 -1.93746170 -0.48978391 1.90346718 4.79207386
## 61 62 63 64 65 66
## -1.08513949 -4.81467505 0.45578939 -0.22606838 1.98532495 -4.48513949
## 67 68 69 70 71 72
## 1.81486051 -3.13281729 -1.25095952 -1.85095952 -0.66235285 -0.65560393
## 73 74 75 76 77 78
## 3.51021609 5.47393162 7.17857604 4.70346718 -2.52606838 1.04439607
## 79 80 81 82 83 84
## -2.61467505 -3.30328173 2.75578939 -3.71931946 0.27393162 2.24439607
## 85 86 87 88 89 90
## 0.02625383 -0.74421061 -0.65560393 3.42625383 -0.83281729 -1.17374617
## 91 92 93 94 95 96
## -1.68513949 -0.22606838 -2.52606838 4.37393162 -1.85560393 -1.67374617
## 97 98 99 100 101 102
## -1.34421061 -3.80328173 -4.06235285 2.40811159 0.10811159 1.04904048
## 103 104 105 106 107 108
## 2.92625383 2.92625383 2.24439607 0.10346718 1.73300274 0.19207386
## 109 110 111 112 113 114
## -3.46699726 -2.87839058 0.96253830 0.36253830 5.39207386 3.33975165
## 115 116 117 118 119 120
## 2.15114498 -0.06699726 -6.02606838 -0.57374617 -2.53281729 3.00136268
## 121 122 123 124 125 126
## -4.91003064 -0.56910176 -0.91003064 -0.82142396 -2.45095952 -4.58049508
## 127 128 129 130 131 132
## -2.78049508 -1.00328173 6.58532495 1.30346718 0.36253830 0.45114498
## 133 134 135 136 137 138
## -1.08978391 5.47393162 5.22160942 -3.80792614 -0.23071279 0.36928721
## 139 140 141 142 143 144
## -3.03746170 1.98742945 0.02160942 -1.34210611 -1.25560393 0.27603612
## 145 146 147 148 149 150
## -1.93071279 0.87393162 -1.17164167 4.10557168 -4.40117723 3.09207386
## 151 152 153
## 3.85114498 -2.27839058 -0.14210611
我们可以将最终得到的模型用于新的数据来做预测,这是机器学习的基础。