Econometrics is learning about characteristics of population from the sample.
plot(cars) #Cross-Section Data
plot(AirPassengers)#Time-Series Data
Expected Value: E(X) = \(\sum_{i=1}^n p_ix_i\)
Variance: Var(X) = \(E[(X-\mu)^2]\) = \(E[x^2] - \mu^2\)
Population Mean of X: \(E(X_i) = \mu_X\)
Random component in oberservation i: \(u_i = X_i - \mu_X\) and \(X_i = u_i + \mu_X\)
Variance of X is same as u : \(\sigma^2_x = \sigma^2_u = E(u^2)\)
Two variables are independent iff \(E[f(x)g(y)] = E[f(x)]*E[g(y)]\)
If X and Y are independent, \(E(XY) = E(X)E(Y)\)
\(cov(X,Y) = \sigma_{XY}= E[(X - \mu_X)(Y - \mu_Y)]\)
If X and Y are independent, \(cov(X,Y) = 0\)
Covariance Rules
1. if \(Y = V + W\), then \(cov(X,Y) = cov(X,V) + cov(X,W)\)
2. if \(Y = bZ\), then \(cov(X,Y) = bcov(X,Z)\)
3. if \(Y = b\), then \(cov(X,Y) = 0\)
4. if \(Y = W+ b\), then \(cov(X,Y) = cov(X,W)\)
Var(X) = \(E[(X-\mu)^2]\) = \(E[x^2] - \mu^2\)
Variance Rules
1. if \(Y = V + W\), then \(Var(Y) = Var(V) + Var(W) + 2*cov(V,W)\)
2. if \(Y = bZ\), then \(Var(Y) = b^2Var(Z)\)
3. if \(Y = b\), then \(Var(Y) = 0\)
4. if \(Y = W+ b\), then \(Var(Y) = Var(W)\)
\(\rho_{XY} = \frac{\sigma_{XY}}{\sqrt{\sigma_X^2\sigma_Y^2}}\)
This is a better measure of association as it is dimension-less.
Correlation coefficient gives if two variables are associated but does not provide the relation between the variables.
\(Y_i = \beta_0 + \beta_1X_i + u_i\), where \(u_i\) is the random part or the error term.
\(Y_i\) is the dependent variable and \(X_i\) is the independent or explanatory variable. The above hypothetical equation is the regression model. \(\beta_0\) and \(\beta_1\) are fixed quantities known as parameters of the equation.
Why error term is appearing?
* Omission of other explanatory variable(s) - Other variables (not added \(X_2\))
* Aggregation of variables - NEED TO READ MORE
* Model mis-specification or Functional Misspecification - May be Y is not dependent on X but on \(X^2\) or time series where Y depends on previous Y value.
* Measurement Error
* Purely random
* Linear approximation - NEED TO READ MORE
\(\hat{Y} = b_1 + b_2X\) is the fitted model that estimates the actual equation.
model1 <- lm(data = cars, speed~dist)
summary(model1)
##
## Call:
## lm(formula = speed ~ dist, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.5293 -2.1550 0.3615 2.4377 6.4179
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.28391 0.87438 9.474 1.44e-12 ***
## dist 0.16557 0.01749 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.156 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
Here speed = 8.28391 + 0.16557*dist
IMPORTANT: One this to accept is that you can never discover the true value of \(\beta_0\) and \(\beta_1\).
Residual for each observation is defined by \(Y_i - \hat{Y}\), i.e. \(e_i = Y_i - \hat{Y}\)
Now multiple models can exist, so the best estimator for the actual line is the find the equation with smallest Residual Sum of Errors.
RSS = \(\sum_{i=1}^ne_i^2 = e_1^2 + e_2^2 + ... + e_n^2\)
In the above box, the RSS = 3.156.
General case with n observations and two variables X and Y, where Y depends on X. We will fit the following equation:
\({\hat{Y} = b_0 + b_1X_i}\)
RSS = \(\sum_{i=1}^ne_i\)
\(b_1 = \frac{\sum_{i=1}^n(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^n(X_i - \bar{X})}\)
\(b_0 = \bar{Y} - b_1\bar{X}\)
TSS = ESS + RSS, where TSS is Total Sum of Squares, ESS is Explained Sum of Squares and RSS is Residual Sum of Suqares.
The aim of Regression Analysis is explain the variation of dependent variable Y.
\(Var(Y) = \sum(Y_i - \bar{Y})^2\). Now, \(Y_i = \hat{Y} - e_i\)
\(\sum_{i=1}^n(Y_i - \bar{Y})^2 = \sum_{i=1}^n(\hat{Y_i} - \bar{Y})^2 + \sum_{i=1}^ne_i^2\)
TSS = ESS + RSS, ESS = \(\sum_{i=1}^n(\hat{Y_i}- \bar{Y})^2\)
Here “explanied” should be apperantly explained.
Now, \(R^2\) is defined by ESS/TSS or proportion of Explained Error by Total Error. The more the amount of explaination of error one can provide, higher the value of \(R^2\).
Alternatively, \(R^2 = 1 - \frac{RSS}{TSS}\).
model1 <- lm(data = cars, speed~dist)
summary(model1)
##
## Call:
## lm(formula = speed ~ dist, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.5293 -2.1550 0.3615 2.4377 6.4179
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.28391 0.87438 9.474 1.44e-12 ***
## dist 0.16557 0.01749 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.156 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
Here RSS = 3.156 and \(R^2\) = 0.6511, so by above definition,
\(TSS = \frac{RSS}{(1-R^2)}\), so TSS = 9.045572.
ESS = TSS - RSS = 9.045572 - 3.156 = 5.889572.
Now, \(R^2 = \frac{ESS}{TSS}\) = \(\frac{5.889572}{9.045572}\) = 0.6511.
So, we can effectively say that we can apperantly explain 65.11% of the error.
Also, correlation of coefficient between actual Y and fitted \(\hat{Y}\) is square root of \(R^2\), i.e. \(R^2 = r_{Y,\hat{Y}}\).
E.g. of linear model - \(Y = \beta_0 + \beta_1X + u\)
E.g. of non-linear models:
\(Y = \beta_1X^{\beta_2} + u\)
\(Y = \beta_0 + \beta_1\beta_2X + u\)
The disturbance term (u) has zero expectation,i.e. \(E(u_i)=0\) for all i.(Unbiased)
There should be some variation in X (trivial).
Variance of random error u is \(\sigma^2\) (Efficient)
\(Var(u) = E(u - E(u))^2\)