Correlation and Regression 2

课程回放1: (https://www.cctalk.com/v/15406449969744?xh_fshareuid=87b7f11f-6ceb-c339-6ae2-1632fe5f61ea&xh_preshareid=f0b6bd9c-0d1e-4cca-9c81-56e1d163e617)

课程回放2: (https://www.cctalk.com/v/15406498208704?xh_fshareuid=87b7f11f-6ceb-c339-6ae2-1632fe5f61ea&xh_preshareid=e5305ddb-7fff-42d3-ba04-b984d1830612)

\[y_i= a + b*x_i+e_i\]

回归模型的目标是量化变量之间的关系：

关键系数的大小
关键系数的显著度
模型的解释力
模型的预测力

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

理解模型：

残差（Residuals）： Residuals show if the predicted response values are close or not to the response values that the model predicts.

\[e_i = y_i-\hat{y_i}\]

系数估计（Estimate coefficient）： How y changes when x changes by one unit

\[\hat{b}=\frac{\sum (x_i-\bar{x}) (y_i-\bar{y})}{\sum (x_i-\bar{x})^2}\]

\[\hat{a}= \bar{y} - \hat{b}\bar{x}\]

系数估计的标准差（Standard error）： Standard error measures how the coefficient estimates can vary from the actual average value of the response variable (i.e. if the model is run more times). 每一次数据不同，结果都会有差异，但是该差异在一定范围内。
显著度与P值（Significance and P value）： Test of significance of the model shows that there is strong evidence of a linear relationship between the variables. This is visually interpreted by the significance stars *** at the end of the row. This level is a threshold that is used to indicate real findings, and not the ones by chance alone.

For each estimated regression coefficient, the variable’s p-Value Pr(>|t|) provides an estimate of the probability that the true coefficient is zero given the value of the estimate. More the number of stars near the p-Value are, more significant is the variable.

With the presence of the p-value, there is a test of hypothesis associated with it. In Linear Regression, the Null Hypothesis is that the coefficient associated with the variables is equal to zero. Instead, the alternative hypothesis is that the coefficient is not equal to zero and then exists a relationship between the independent variable and the dependent variable.

So, if p-values are less than significance level (typically, a p-value < 0.05 is a good cut-off point), null hypothesis can be safely rejected. In the current case, p-values are well below the 0.05 threshold, so the model is indeed statistically significant.

R-squared 以及 adjusted R-squared For the simple linear regression, R-squared is the square of the correlation between two variables. Its value can vary between 0 and 1: a value close to 0 means that the regression model does not explain the variance in the response variable, while a number close to 1 that the observed variance in the response variable is well explained. In the current case, R-squared suggests the linear model fit explains about 75% of the variance observed in the data.

\[R^2=1-\frac{var(e)}{var(y)}\] \[R^2_{adj}=1-\frac{var(e)/(n-q)}{var(y)/(n-1)}=1-\frac{(1-R^2)(n-1)}{n-q}\]

6.F statistic Basically, F-test compares the model with zero predictor variables (the intercept only mod1el), and suggests whether the added coefficients improve the model. If a significant result is obtained, then the coefficients (in the current case, being a simple regression, only one predictor is entered) included in the model improve the model’s fit. So, F statistic defines the collective effect of all predictor variables on the response variable. In this model, F=119.8 is far greater than 1.

模型评估标准汇总：

相关系数分析

记得在自己本地电脑上安装install.packages("corrplot")

查看数据前几名：

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

所有变量间的相关系数：

cor(mtcars)

##             mpg        cyl       disp         hp        drat         wt
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.68117191 -0.8676594
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.69993811  0.7824958
## disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.71021393  0.8879799
## hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.44875912  0.6587479
## drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.00000000 -0.7124406
## wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.71244065  1.0000000
## qsec  0.4186840 -0.5912421 -0.4336979 -0.7082234  0.09120476 -0.1747159
## vs    0.6640389 -0.8108118 -0.7104159 -0.7230967  0.44027846 -0.5549157
## am    0.5998324 -0.5226070 -0.5912270 -0.2432043  0.71271113 -0.6924953
## gear  0.4802848 -0.4926866 -0.5555692 -0.1257043  0.69961013 -0.5832870
## carb -0.5509251  0.5269883  0.3949769  0.7498125 -0.09078980  0.4276059
##             qsec         vs          am       gear        carb
## mpg   0.41868403  0.6640389  0.59983243  0.4802848 -0.55092507
## cyl  -0.59124207 -0.8108118 -0.52260705 -0.4926866  0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692  0.39497686
## hp   -0.70822339 -0.7230967 -0.24320426 -0.1257043  0.74981247
## drat  0.09120476  0.4402785  0.71271113  0.6996101 -0.09078980
## wt   -0.17471588 -0.5549157 -0.69249526 -0.5832870  0.42760594
## qsec  1.00000000  0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs    0.74453544  1.0000000  0.16834512  0.2060233 -0.56960714
## am   -0.22986086  0.1683451  1.00000000  0.7940588  0.05753435
## gear -0.21268223  0.2060233  0.79405876  1.0000000  0.27407284
## carb -0.65624923 -0.5696071  0.05753435  0.2740728  1.00000000

保存所有相关系数为M

library(corrplot)

## corrplot 0.84 loaded

M<-cor(mtcars)
head(round(M,2))

##        mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
## mpg   1.00 -0.85 -0.85 -0.78  0.68 -0.87  0.42  0.66  0.60  0.48 -0.55
## cyl  -0.85  1.00  0.90  0.83 -0.70  0.78 -0.59 -0.81 -0.52 -0.49  0.53
## disp -0.85  0.90  1.00  0.79 -0.71  0.89 -0.43 -0.71 -0.59 -0.56  0.39
## hp   -0.78  0.83  0.79  1.00 -0.45  0.66 -0.71 -0.72 -0.24 -0.13  0.75
## drat  0.68 -0.70 -0.71 -0.45  1.00 -0.71  0.09  0.44  0.71  0.70 -0.09
## wt   -0.87  0.78  0.89  0.66 -0.71  1.00 -0.17 -0.55 -0.69 -0.58  0.43

corrplot主要有数种展现相关性的方法：“circle”,“pie”, “color”, “number”。一般表达式为：

corrplot(corr, method="circle")

library(corrplot)
M<-cor(mtcars)
corrplot(M, method="circle")

library(corrplot)
M<-cor(mtcars)
corrplot(M, method="pie")

library(corrplot)
M<-cor(mtcars)
corrplot(M, method="color")

library(corrplot)
M<-cor(mtcars)
corrplot(M, method="number")

library(corrplot)
M<-cor(mtcars)
corrplot(M, type="upper")

library(corrplot)
M<-cor(mtcars)
corrplot(M, type="lower")

library(corrplot)
library(RColorBrewer)

M<-cor(mtcars)
corrplot(M, type="upper", 
         col=brewer.pal(n=8, name="RdBu"))

library(corrplot)
library(RColorBrewer)

corrplot(M, type="upper",
         col=brewer.pal(n=8, name="RdYlBu"))

library(corrplot)
library(RColorBrewer)

corrplot(M, type="upper",
         col=brewer.pal(n=8, name="PuOr"))

改字体颜色以及字体方向：

library(corrplot)
library(RColorBrewer)

corrplot(M, type="upper", tl.col="black", tl.srt=45)

改变背景颜色：

library(corrplot)
M<-cor(mtcars)
# Change background color to lightblue
corrplot(M, type="upper", col=c("black", "white"),
         bg="lightblue")

使用下面的公式，计算相关性的P值，是否显著相关：

# mat : is a matrix of data
# ... : further arguments to pass to the native R cor.test function
cor.mtest <- function(mat, ...) {
    mat <- as.matrix(mat)
    n <- ncol(mat)
    p.mat<- matrix(NA, n, n)
    diag(p.mat) <- 0
    for (i in 1:(n - 1)) {
        for (j in (i + 1):n) {
            tmp <- cor.test(mat[, i], mat[, j], ...)
            p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
        }
    }
  colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
  p.mat
}
# matrix of the p-value of the correlation
p.mat <- cor.mtest(mtcars)
head(p.mat[, 1:5])

##               mpg          cyl         disp           hp         drat
## mpg  0.000000e+00 6.112687e-10 9.380327e-10 1.787835e-07 1.776240e-05
## cyl  6.112687e-10 0.000000e+00 1.802838e-12 3.477861e-09 8.244636e-06
## disp 9.380327e-10 1.802838e-12 0.000000e+00 7.142679e-08 5.282022e-06
## hp   1.787835e-07 3.477861e-09 7.142679e-08 0.000000e+00 9.988772e-03
## drat 1.776240e-05 8.244636e-06 5.282022e-06 9.988772e-03 0.000000e+00
## wt   1.293959e-10 1.217567e-07 1.222320e-11 4.145827e-05 4.784260e-06

library(corrplot)
M<-cor(mtcars)

# Specialized the insignificant value according to the significant level
corrplot(M, type="upper",
         p.mat = p.mat, sig.level = 0.01)

library(corrplot)
M<-cor(mtcars)

# Leave blank on no significant coefficient
corrplot(M, type="upper",
         p.mat = p.mat, sig.level = 0.01, insig = "blank")

library(corrplot)
M<-cor(mtcars)

col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(M, method="color", col=col(200),  
         type="upper",
         addCoef.col = "black", # Add coefficient of correlation
         tl.col="black", tl.srt=45, #Text label color and rotation
         # Combine with significance
         p.mat = p.mat, sig.level = 0.01, insig = "blank", 
         # hide correlation coefficient on the principal diagonal
         diag=FALSE 
         )

使用 corrplot()函数作出炫酷的相关性矩阵 correlation matrix，你学会了吗？

再教你一招，来补一刀！ install.packages("PerformanceAnalytics")

library("PerformanceAnalytics")

## Loading required package: xts

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## 
## Attaching package: 'PerformanceAnalytics'

## The following object is masked from 'package:graphics':
## 
##     legend

my_data <- mtcars[, c(1,3,4,5,6,7)]
chart.Correlation(my_data, histogram=TRUE, pch=19)

回归分析

用箱图检查异常值：

Generally, any datapoint that lies outside the 1.5 X interquartile-range (1.5 X IQR) is considered an outlier, where, IQR is calculated as the distance between the 25th percentile and 75th percentile values for that variable.

boxplot(mtcars)

par(mfrow=c(1, 2))  # divide graph area in 2 columns
boxplot(mtcars$wt, main="Weigt") 
boxplot(mtcars$hp, main="horse power")

线性回归，一起来理解一下下面两个模型的结果：

library(ggplot2)
ggplot(data = mtcars, aes(x = wt, y = hp)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)

model_1 <- lm(hp~wt,data=mtcars)
summary(model_1)

## 
## Call:
## lm(formula = hp ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -83.430 -33.596 -13.587   7.913 172.030 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.821     32.325  -0.056    0.955    
## wt            46.160      9.625   4.796 4.15e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 52.44 on 30 degrees of freedom
## Multiple R-squared:  0.4339, Adjusted R-squared:  0.4151 
## F-statistic:    23 on 1 and 30 DF,  p-value: 4.146e-05

fitted.values(model_1)

##           Mazda RX4       Mazda RX4 Wag          Datsun 710 
##           119.11841           130.88922           105.27039 
##      Hornet 4 Drive   Hornet Sportabout             Valiant 
##           146.58364           156.96965           157.89285 
##          Duster 360           Merc 240D            Merc 230 
##           162.97046           145.42964           143.58324 
##            Merc 280           Merc 280C          Merc 450SE 
##           156.96965           156.96965           186.05048 
##          Merc 450SL         Merc 450SLC  Cadillac Fleetwood 
##           170.35607           172.66407           240.51934 
## Lincoln Continental   Chrysler Imperial            Fiat 128 
##           248.55119           244.90455            99.73119 
##         Honda Civic      Toyota Corolla       Toyota Corona 
##            72.72756            82.88277           111.96360 
##    Dodge Challenger         AMC Javelin          Camaro Z28 
##           160.66246           156.73885           175.43367 
##    Pontiac Firebird           Fiat X1-9       Porsche 914-2 
##           175.66447            87.49878            96.96159 
##        Lotus Europa      Ford Pantera L        Ferrari Dino 
##            68.01923           144.50644           126.04242 
##       Maserati Bora          Volvo 142E 
##           162.97046           126.50402

residuals(model_1)

##           Mazda RX4       Mazda RX4 Wag          Datsun 710 
##          -9.1184100         -20.8892228         -12.2703949 
##      Hornet 4 Drive   Hornet Sportabout             Valiant 
##         -36.5836399          18.0303488         -52.8928522 
##          Duster 360           Merc 240D            Merc 230 
##          82.0295423         -83.4296386         -48.5832366 
##            Merc 280           Merc 280C          Merc 450SE 
##         -33.9696512         -33.9696512          -6.0504829 
##          Merc 450SL         Merc 450SLC  Cadillac Fleetwood 
##           9.6439342           7.3359317         -35.5193422 
## Lincoln Continental   Chrysler Imperial            Fiat 128 
##         -33.5511910         -14.9045470         -33.7311889 
##         Honda Civic      Toyota Corolla       Toyota Corona 
##         -20.7275594         -17.8827705         -14.9636022 
##    Dodge Challenger         AMC Javelin          Camaro Z28 
##         -10.6624552          -6.7388509          69.5663287 
##    Pontiac Firebird           Fiat X1-9       Porsche 914-2 
##          -0.6644716         -21.4987755          -5.9615858 
##        Lotus Europa      Ford Pantera L        Ferrari Dino 
##          44.9807657         119.4935624          48.9575825 
##       Maserati Bora          Volvo 142E 
##         172.0295423         -17.5040180

# Scatterplot with regression line
ggplot(data = airquality, aes(x = Temp, y = Wind)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)

model_2 <- lm(Wind~Temp,data=airquality)
summary(model_2)

## 
## Call:
## lm(formula = Wind ~ Temp, data = airquality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.5784 -2.4489 -0.2261  1.9853  9.7398 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 23.23369    2.11239  10.999  < 2e-16 ***
## Temp        -0.17046    0.02693  -6.331 2.64e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.142 on 151 degrees of freedom
## Multiple R-squared:  0.2098, Adjusted R-squared:  0.2045 
## F-statistic: 40.08 on 1 and 151 DF,  p-value: 2.642e-09

fitted.values(model_2)

##         1         2         3         4         5         6         7 
## 11.812571 10.960248 10.619319 12.664893 13.687679 11.983035 12.153499 
##         8         9        10        11        12        13        14 
## 13.176286 12.835357 11.471642 10.619319 11.471642 11.983035 11.642106 
##        15        16        17        18        19        20        21 
## 13.346751 12.323964 11.983035 13.517215 11.642106 12.664893 13.176286 
##        22        23        24        25        26        27        28 
## 10.789784 12.835357 12.835357 13.517215 13.346751 13.517215 11.812571 
##        29        30        31        32        33        34        35 
##  9.426068  9.766997 10.278391  9.937462 10.619319 11.812571  8.914675 
##        36        37        38        39        40        41        42 
##  8.744211  9.766997  9.255604  8.403282  7.891888  8.403282  7.380495 
##        43        44        45        46        47        48        49 
##  7.550960  9.255604  9.596533  9.766997 10.107926 10.960248 12.153499 
##        50        51        52        53        54        55        56 
## 10.789784 10.278391 10.107926 10.278391 10.278391 10.278391 10.448855 
##        57        58        59        60        61        62        63 
##  9.937462 10.789784  9.596533 10.107926  9.085139  8.914675  8.744211 
##        64        65        66        67        68        69        70 
##  9.426068  8.914675  9.085139  9.085139  8.232817  7.550960  7.550960 
##        71        72        73        74        75        76        77 
##  8.062353  9.255604 10.789784  9.426068  7.721424  9.596533  9.426068 
##        78        79        80        81        82        83        84 
##  9.255604  8.914675  8.403282  8.744211 10.619319  9.426068  9.255604 
##        85        86        87        88        89        90        91 
##  8.573746  8.744211  9.255604  8.573746  8.232817  8.573746  9.085139 
##        92        93        94        95        96        97        98 
##  9.426068  9.426068  9.426068  9.255604  8.573746  8.744211  8.403282 
##        99       100       101       102       103       104       105 
##  8.062353  7.891888  7.891888  7.550960  8.573746  8.573746  9.255604 
##       106       107       108       109       110       111       112 
##  9.596533  9.766997 10.107926  9.766997 10.278391  9.937462  9.937462 
##       113       114       115       116       117       118       119 
## 10.107926 10.960248 10.448855  9.766997  9.426068  8.573746  8.232817 
##       120       121       122       123       124       125       126 
##  6.698637  7.210031  6.869102  7.210031  7.721424  7.550960  7.380495 
##       127       128       129       130       131       132       133 
##  7.380495  8.403282  8.914675  9.596533  9.937462 10.448855 10.789784 
##       134       135       136       137       138       139       140 
##  9.426068 10.278391 10.107926 11.130713 11.130713  9.937462 11.812571 
##       141       142       143       144       145       146       147 
## 10.278391 11.642106  9.255604 12.323964 11.130713  9.426068 11.471642 
##       148       149       150       151       152       153 
## 12.494428 11.301177 10.107926 10.448855 10.278391 11.642106

residuals(model_2)

##           1           2           3           4           5           6 
## -4.41257055 -2.96024835  1.98068054 -1.16489276  0.61232059  2.91696501 
##           7           8           9          10          11          12 
## -3.55349944  0.62371392  7.26464280 -2.87164167 -3.71931946 -1.77164167 
##          13          14          15          16          17          18 
## -2.78303499 -0.74210611 -0.14675052 -0.82396388  0.01696501  4.88278503 
##          19          20          21          22          23          24 
## -0.14210611 -2.96489276 -3.47628608  5.81021609 -3.13535720 -0.83535720 
##          25          26          27          28          29          30 
##  3.08278503  1.55324948 -5.51721497  0.18742945  5.47393162 -4.06699726 
##          31          32          33          34          35          36 
## -2.87839058 -1.33746170 -0.91931946  4.28742945  0.28532495 -0.14421061 
##          37          38          39          40          41          42 
##  4.53300274  0.44439607 -1.50328173  5.90811159  3.09671827  3.51950492 
##          43          44          45          46          47          48 
##  1.64904048 -1.25560393  4.20346718  1.73300274  4.79207386  9.73975165 
##          49          50          51          52          53          54 
## -2.95349944  0.71021609  0.02160942 -3.80792614 -8.57839058 -5.67839058 
##          55          56          57          58          59          60 
## -3.97839058 -2.44885502 -1.93746170 -0.48978391  1.90346718  4.79207386 
##          61          62          63          64          65          66 
## -1.08513949 -4.81467505  0.45578939 -0.22606838  1.98532495 -4.48513949 
##          67          68          69          70          71          72 
##  1.81486051 -3.13281729 -1.25095952 -1.85095952 -0.66235285 -0.65560393 
##          73          74          75          76          77          78 
##  3.51021609  5.47393162  7.17857604  4.70346718 -2.52606838  1.04439607 
##          79          80          81          82          83          84 
## -2.61467505 -3.30328173  2.75578939 -3.71931946  0.27393162  2.24439607 
##          85          86          87          88          89          90 
##  0.02625383 -0.74421061 -0.65560393  3.42625383 -0.83281729 -1.17374617 
##          91          92          93          94          95          96 
## -1.68513949 -0.22606838 -2.52606838  4.37393162 -1.85560393 -1.67374617 
##          97          98          99         100         101         102 
## -1.34421061 -3.80328173 -4.06235285  2.40811159  0.10811159  1.04904048 
##         103         104         105         106         107         108 
##  2.92625383  2.92625383  2.24439607  0.10346718  1.73300274  0.19207386 
##         109         110         111         112         113         114 
## -3.46699726 -2.87839058  0.96253830  0.36253830  5.39207386  3.33975165 
##         115         116         117         118         119         120 
##  2.15114498 -0.06699726 -6.02606838 -0.57374617 -2.53281729  3.00136268 
##         121         122         123         124         125         126 
## -4.91003064 -0.56910176 -0.91003064 -0.82142396 -2.45095952 -4.58049508 
##         127         128         129         130         131         132 
## -2.78049508 -1.00328173  6.58532495  1.30346718  0.36253830  0.45114498 
##         133         134         135         136         137         138 
## -1.08978391  5.47393162  5.22160942 -3.80792614 -0.23071279  0.36928721 
##         139         140         141         142         143         144 
## -3.03746170  1.98742945  0.02160942 -1.34210611 -1.25560393  0.27603612 
##         145         146         147         148         149         150 
## -1.93071279  0.87393162 -1.17164167  4.10557168 -4.40117723  3.09207386 
##         151         152         153 
##  3.85114498 -2.27839058 -0.14210611

我们可以将最终得到的模型用于新的数据来做预测，这是机器学习的基础。

Correlation and Regression 2

Chris Qi

10/13/2018

回归模型的目标是量化变量之间的关系：

相关系数分析

回归分析