General Linear Model

Yi = β1X1i + β2X2i + .. + εi
Least Square tries to minimise : Σ(Yi - β1X1i + β2X2i + .. + εi)2

Estimate of coeff of a regressor (in a multivariable regression model) is regression through mean with linear relationship with other regressors removed from both the regressor and the outcome by taking residuals.

x1<-rnorm(100)
x2<-rnorm(100)
x3<-rnorm(100)
y<-1+2*x1+3*x2+4*x3+rnorm(100,sd=0.1)
fit<-lm(y ~ x1+x2+x3)
summary(fit)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.210169 -0.063073 -0.001123  0.066284  0.273272 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.013149   0.009560   106.0   <2e-16 ***
## x1          1.986485   0.010075   197.2   <2e-16 ***
## x2          2.995667   0.010577   283.2   <2e-16 ***
## x3          4.005139   0.009805   408.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09516 on 96 degrees of freedom
## Multiple R-squared:  0.9997, Adjusted R-squared:  0.9996 
## F-statistic: 9.291e+04 on 3 and 96 DF,  p-value: < 2.2e-16
# Regress out x2 and x3 from outcome. Find the remaining as residue  
ey<-resid(lm(y ~ x2+x3)) 
# Regress out x2 and x3 from regressor - x1. Find the remaining as residue
ex1<-resid(lm(x1 ~ x2+x3))

# Regression through origin estimate of x1 - Method1
beta1<-sum(ey*ex1)/sum(ex1^2)
cat("Coeff of x1: ",beta1)
## Coeff of x1:  1.986485
# Regression through origin estimate of x1 - Method2
beta1<-coef(lm(ey ~ ex1 -1))
cat("Coeff of x1: ",beta1)
## Coeff of x1:  1.986485

Interpretation of Coefficients:

Multivariate Regression Coeff Expected (change in response): βi = Expected change in Response when Regressor Xi changes by 1 unit while other regressors are hold fixed.

require(datasets)
require(GGally)
## Loading required package: GGally
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'GGally'
library(ggplot2)
str(swiss)
## 'data.frame':    47 obs. of  6 variables:
##  $ Fertility       : num  80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
##  $ Agriculture     : num  17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
##  $ Examination     : int  15 6 5 12 17 9 16 14 12 16 ...
##  $ Education       : int  12 9 5 7 15 7 7 8 7 13 ...
##  $ Catholic        : num  9.96 84.84 93.4 33.77 5.16 ...
##  $ Infant.Mortality: num  22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
p1<- ggplot(swiss, aes(x=Agriculture, y=Fertility)) + geom_point() + geom_smooth(method = "lm")
p2<- ggplot(swiss, aes(x=Examination, y=Fertility))  + geom_point() + geom_smooth(method = "lm")
p3<- ggplot(swiss, aes(x=Education, y=Fertility))  + geom_point() + geom_smooth(method = "lm")
p4<- ggplot(swiss, aes(x=Catholic, y=Fertility))  + geom_point() + geom_smooth(method = "lm")
p5<- ggplot(swiss, aes(x=Infant.Mortality, y=Fertility))  + geom_point() + geom_smooth(method = "lm")
library(gridExtra)
#p1
grid.arrange(p1, p2, p3, p4, p5, ncol=3)

fit<- lm(Fertility ~ ., data = swiss)
summary(fit)$coefficients
##                    Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept)      66.9151817 10.70603759  6.250229 1.906051e-07
## Agriculture      -0.1721140  0.07030392 -2.448142 1.872715e-02
## Examination      -0.2580082  0.25387820 -1.016268 3.154617e-01
## Education        -0.8709401  0.18302860 -4.758492 2.430605e-05
## Catholic          0.1041153  0.03525785  2.952969 5.190079e-03
## Infant.Mortality  1.0770481  0.38171965  2.821568 7.335715e-03

So for every 1% increase in male involved in agriculture (while keeping other varaibles constant), expected decrease in Fertility is 0.17

Effect of missing improtant regressors: If we miss some regressors in our model which should have there, it results in confounding. Let’s remove Education and Infant.Mortality from the model and see the effect on coefficients values.

fit_conf<- lm(Fertility ~ Agriculture+Examination+Catholic, data = swiss)
summary(fit_conf)$coefficients
##                Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 90.86802624 8.63690935 10.520896 1.807221e-13
## Agriculture -0.09515863 0.08586671 -1.108213 2.739301e-01
## Examination -1.07034661 0.27315957 -3.918393 3.146567e-04
## Catholic     0.04240137 0.04147562  1.022320 3.123467e-01

Effect of having unnecessary variables: Overfitting.
Eg: Adding a regressor which is sum of two regressors.
R handles such cases and returns coeff NA for such regressors