Yi = β1X1i + β2X2i + .. + εi
Least Square tries to minimise : Σ(Yi - β1X1i + β2X2i + .. + εi)2
Estimate of coeff of a regressor (in a multivariable regression model) is regression through mean with linear relationship with other regressors removed from both the regressor and the outcome by taking residuals.
x1<-rnorm(100)
x2<-rnorm(100)
x3<-rnorm(100)
y<-1+2*x1+3*x2+4*x3+rnorm(100,sd=0.1)
fit<-lm(y ~ x1+x2+x3)
summary(fit)
##
## Call:
## lm(formula = y ~ x1 + x2 + x3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.210169 -0.063073 -0.001123 0.066284 0.273272
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.013149 0.009560 106.0 <2e-16 ***
## x1 1.986485 0.010075 197.2 <2e-16 ***
## x2 2.995667 0.010577 283.2 <2e-16 ***
## x3 4.005139 0.009805 408.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09516 on 96 degrees of freedom
## Multiple R-squared: 0.9997, Adjusted R-squared: 0.9996
## F-statistic: 9.291e+04 on 3 and 96 DF, p-value: < 2.2e-16
# Regress out x2 and x3 from outcome. Find the remaining as residue
ey<-resid(lm(y ~ x2+x3))
# Regress out x2 and x3 from regressor - x1. Find the remaining as residue
ex1<-resid(lm(x1 ~ x2+x3))
# Regression through origin estimate of x1 - Method1
beta1<-sum(ey*ex1)/sum(ex1^2)
cat("Coeff of x1: ",beta1)
## Coeff of x1: 1.986485
# Regression through origin estimate of x1 - Method2
beta1<-coef(lm(ey ~ ex1 -1))
cat("Coeff of x1: ",beta1)
## Coeff of x1: 1.986485
Multivariate Regression Coeff Expected (change in response): βi = Expected change in Response when Regressor Xi changes by 1 unit while other regressors are hold fixed.
require(datasets)
require(GGally)
## Loading required package: GGally
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'GGally'
library(ggplot2)
str(swiss)
## 'data.frame': 47 obs. of 6 variables:
## $ Fertility : num 80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
## $ Agriculture : num 17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
## $ Examination : int 15 6 5 12 17 9 16 14 12 16 ...
## $ Education : int 12 9 5 7 15 7 7 8 7 13 ...
## $ Catholic : num 9.96 84.84 93.4 33.77 5.16 ...
## $ Infant.Mortality: num 22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
p1<- ggplot(swiss, aes(x=Agriculture, y=Fertility)) + geom_point() + geom_smooth(method = "lm")
p2<- ggplot(swiss, aes(x=Examination, y=Fertility)) + geom_point() + geom_smooth(method = "lm")
p3<- ggplot(swiss, aes(x=Education, y=Fertility)) + geom_point() + geom_smooth(method = "lm")
p4<- ggplot(swiss, aes(x=Catholic, y=Fertility)) + geom_point() + geom_smooth(method = "lm")
p5<- ggplot(swiss, aes(x=Infant.Mortality, y=Fertility)) + geom_point() + geom_smooth(method = "lm")
library(gridExtra)
#p1
grid.arrange(p1, p2, p3, p4, p5, ncol=3)
fit<- lm(Fertility ~ ., data = swiss)
summary(fit)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.9151817 10.70603759 6.250229 1.906051e-07
## Agriculture -0.1721140 0.07030392 -2.448142 1.872715e-02
## Examination -0.2580082 0.25387820 -1.016268 3.154617e-01
## Education -0.8709401 0.18302860 -4.758492 2.430605e-05
## Catholic 0.1041153 0.03525785 2.952969 5.190079e-03
## Infant.Mortality 1.0770481 0.38171965 2.821568 7.335715e-03
So for every 1% increase in male involved in agriculture (while keeping other varaibles constant), expected decrease in Fertility is 0.17
Effect of missing improtant regressors: If we miss some regressors in our model which should have there, it results in confounding. Let’s remove Education and Infant.Mortality from the model and see the effect on coefficients values.
fit_conf<- lm(Fertility ~ Agriculture+Examination+Catholic, data = swiss)
summary(fit_conf)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 90.86802624 8.63690935 10.520896 1.807221e-13
## Agriculture -0.09515863 0.08586671 -1.108213 2.739301e-01
## Examination -1.07034661 0.27315957 -3.918393 3.146567e-04
## Catholic 0.04240137 0.04147562 1.022320 3.123467e-01
Effect of having unnecessary variables: Overfitting.
Eg: Adding a regressor which is sum of two regressors.
R handles such cases and returns coeff NA for such regressors