Econometrics is learning about characteristics of population from the sample.


1. Types of Data

  1. Cross-Section Data - Xi and Yi. Date is collected at one go. One individual has one data point
  2. Time-Series Data - Xt and Yt. Data collected at time interval for one individual item. E.g. GDP Data, Stock Return Data, etc.
  3. Panel Data - Xit, Yit.
plot(cars) #Cross-Section Data

plot(AirPassengers)#Time-Series Data


2. Important Formulas

Expected Value: E(X) = \(\sum_{i=1}^n p_ix_i\)
Variance: Var(X) = \(E[(X-\mu)^2]\) = \(E[x^2] - \mu^2\)

2.1. Fixed and Random Component of a random variable

Population Mean of X: \(E(X_i) = \mu_X\)
Random component in oberservation i: \(u_i = X_i - \mu_X\) and \(X_i = u_i + \mu_X\)
Variance of X is same as u : \(\sigma^2_x = \sigma^2_u = E(u^2)\)

2.2. Independence of two random variables

Two variables are independent iff \(E[f(x)g(y)] = E[f(x)]*E[g(y)]\)

If X and Y are independent, \(E(XY) = E(X)E(Y)\)

2.3. Covariance

\(cov(X,Y) = \sigma_{XY}= E[(X - \mu_X)(Y - \mu_Y)]\)

If X and Y are independent, \(cov(X,Y) = 0\)

Covariance Rules
1. if \(Y = V + W\), then \(cov(X,Y) = cov(X,V) + cov(X,W)\)
2. if \(Y = bZ\), then \(cov(X,Y) = bcov(X,Z)\)
3. if \(Y = b\), then \(cov(X,Y) = 0\)
4. if \(Y = W+ b\), then \(cov(X,Y) = cov(X,W)\)

2.4. Variance

Var(X) = \(E[(X-\mu)^2]\) = \(E[x^2] - \mu^2\)

Variance Rules
1. if \(Y = V + W\), then \(Var(Y) = Var(V) + Var(W) + 2*cov(V,W)\)
2. if \(Y = bZ\), then \(Var(Y) = b^2Var(Z)\)
3. if \(Y = b\), then \(Var(Y) = 0\)
4. if \(Y = W+ b\), then \(Var(Y) = Var(W)\)

2.5. Correlation

\(\rho_{XY} = \frac{\sigma_{XY}}{\sqrt{\sigma_X^2\sigma_Y^2}}\)
This is a better measure of association as it is dimension-less.


3. Simple Regression

Correlation coefficient gives if two variables are associated but does not provide the relation between the variables.

\(Y_i = \beta_0 + \beta_1X_i + u_i\), where \(u_i\) is the random part or the error term.
\(Y_i\) is the dependent variable and \(X_i\) is the independent or explanatory variable. The above hypothetical equation is the regression model. \(\beta_0\) and \(\beta_1\) are fixed quantities known as parameters of the equation.

Why error term is appearing?
* Omission of other explanatory variable(s) - Other variables (not added \(X_2\))
* Aggregation of variables - NEED TO READ MORE
* Model mis-specification or Functional Misspecification - May be Y is not dependent on X but on \(X^2\) or time series where Y depends on previous Y value.
* Measurement Error
* Purely random
* Linear approximation - NEED TO READ MORE

3.1. Fitting the Model

\(\hat{Y} = b_1 + b_2X\) is the fitted model that estimates the actual equation.

model1 <- lm(data = cars, speed~dist)
summary(model1)
## 
## Call:
## lm(formula = speed ~ dist, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5293 -2.1550  0.3615  2.4377  6.4179 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.28391    0.87438   9.474 1.44e-12 ***
## dist         0.16557    0.01749   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.156 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Here speed = 8.28391 + 0.16557*dist

IMPORTANT: One this to accept is that you can never discover the true value of \(\beta_0\) and \(\beta_1\).

Residual for each observation is defined by \(Y_i - \hat{Y}\), i.e. \(e_i = Y_i - \hat{Y}\)

Now multiple models can exist, so the best estimator for the actual line is the find the equation with smallest Residual Sum of Errors.

3.2. Residual Sum of Errors (RSS)

RSS = \(\sum_{i=1}^ne_i^2 = e_1^2 + e_2^2 + ... + e_n^2\)

In the above box, the RSS = 3.156.

3.3. Derivation of regression coefficient

  • Given a dataset, define RSS for the \(\hat{Y} = b_0 + b_1X_i\).
  • Choose \(b_0\) and \(b_1\) to minimize RSS.
  • In order to find \(b_0\), \(\frac{\partial{RSS}}{\partial{b_0}}\) = 0
  • To find \(b_1\), \(\frac{\partial{RSS}}{\partial{b_1}}\) = 0

3.4. Ordinary Least Square Regression (OLS Regression)

General case with n observations and two variables X and Y, where Y depends on X. We will fit the following equation:

\({\hat{Y} = b_0 + b_1X_i}\)

RSS = \(\sum_{i=1}^ne_i\)

\(b_1 = \frac{\sum_{i=1}^n(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^n(X_i - \bar{X})}\)

\(b_0 = \bar{Y} - b_1\bar{X}\)

3.5. Goodness of fit: \(R^2\)

TSS = ESS + RSS, where TSS is Total Sum of Squares, ESS is Explained Sum of Squares and RSS is Residual Sum of Suqares.

The aim of Regression Analysis is explain the variation of dependent variable Y.

\(Var(Y) = \sum(Y_i - \bar{Y})^2\). Now, \(Y_i = \hat{Y} - e_i\)

\(\sum_{i=1}^n(Y_i - \bar{Y})^2 = \sum_{i=1}^n(\hat{Y_i} - \bar{Y})^2 + \sum_{i=1}^ne_i^2\)
TSS = ESS + RSS, ESS = \(\sum_{i=1}^n(\hat{Y_i}- \bar{Y})^2\)

Here “explanied” should be apperantly explained.

Now, \(R^2\) is defined by ESS/TSS or proportion of Explained Error by Total Error. The more the amount of explaination of error one can provide, higher the value of \(R^2\).

Alternatively, \(R^2 = 1 - \frac{RSS}{TSS}\).

model1 <- lm(data = cars, speed~dist)
summary(model1)
## 
## Call:
## lm(formula = speed ~ dist, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5293 -2.1550  0.3615  2.4377  6.4179 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.28391    0.87438   9.474 1.44e-12 ***
## dist         0.16557    0.01749   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.156 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Here RSS = 3.156 and \(R^2\) = 0.6511, so by above definition,
\(TSS = \frac{RSS}{(1-R^2)}\), so TSS = 9.045572.

ESS = TSS - RSS = 9.045572 - 3.156 = 5.889572.

Now, \(R^2 = \frac{ESS}{TSS}\) = \(\frac{5.889572}{9.045572}\) = 0.6511.

So, we can effectively say that we can apperantly explain 65.11% of the error.

Also, correlation of coefficient between actual Y and fitted \(\hat{Y}\) is square root of \(R^2\), i.e. \(R^2 = r_{Y,\hat{Y}}\).


4. Assumptions of Linear Regression Model

  1. Model is linear in parameter and correctly specified.(Correctness)

E.g. of linear model - \(Y = \beta_0 + \beta_1X + u\)

E.g. of non-linear models:

\(Y = \beta_1X^{\beta_2} + u\)

\(Y = \beta_0 + \beta_1\beta_2X + u\)

  1. The disturbance term (u) has zero expectation,i.e. \(E(u_i)=0\) for all i.(Unbiased)

  2. There should be some variation in X (trivial).

  3. Variance of random error u is \(\sigma^2\) (Efficient)
    \(Var(u) = E(u - E(u))^2\)