\[ y = \beta_0 +\beta_1x + u \]
\(y\): the dependent variable (explained variable, response variable, predicted variable, regressand, outcome variable)
\(x\): the explanatory variable (independent variable, cotrol variable, predictor variable, regressor, input variable)
Also called “bivariate linear regression model”
Purpose: explain the dependent variable \(y\) by the explanatory variable \(x\)
\(\beta_0\): the intercept (also called constant term); the value of \(y\) when \(x = 0\).
\(\beta_1\): the regression slope; reflects the effect of \(x\) on \(y\) when the effects of other factors in \(u\) are held constant, that is, \(\Delta u = 0\)
\[ yield = \beta_0 + \beta_1 fertilizer + u \]
Slope parameter \(\beta_1\):
\(\Delta yield = \beta_1 \Delta fertilizer\)
Ceteris paribus, one unit change in the amount of fertilizer leads to \(\beta_1\) unit change in yield.
Random error term: \(u\)
Contains effects of factors such as soil quality, rainfall, etc. which are assumed to be unobserved.
Ceteris paribus \(\Leftrightarrow\) Holding all other factors fixed \(\Leftrightarrow \Delta u = 0\)
\[ wage = \beta_0 + \beta_1 educ + u \]
where: wage: hourly wage (in peso), educ: education level (in years)
Slope parameter \(\beta_1\):
\(\Delta wage = \beta_1 \Delta educ\)
\(\beta_1\) measures the change in hourly wage given another year of education, holding all other factors fixed (ceteris paribus)
Random error term \(u\):
Other factors including labor force experience, innate ability, tenure with current employer, gender, quality of education, marital status, number of children, etc
Any factor that may potentially affect worker productivity
The linearity of simple regression model means that a one-unit change in \(x\) has the same effect on \(y\) regardless of the initial value of x.
This is unrealistic for many economic applications.
For example, if there increasing or decreasing returns to scale then this model is inappropriate.
In the wage equation, the impact of the next year of education on wages has a larger effect than did the previous year.
We will see how to allow for such possibilities in the following classes.
The expected value of the error term \(u\) is zero
If the model includes a constant term (\(\beta_0\)) then we can assume \[ E(u) = 0 \]
This assumption is about the distribution of \(u\) (unobservables). Some \(u\) terms will be (+) and some will be (−) but on average \(u\) is zero
This assumption is always guaranteed to hold by redefining \(\beta_0\)
Conditional mean of \(u\) is zero
How can we be sure that the ceteris paribus notion is valid (which means that \(\Delta u = 0\)?
For this to hold, \(x\) and \(u\) must be uncorrelated. But since correlation coefficient measures only the linear association between two variables it is not enough just to have zero correlation.
\(u\) must also be uncorrelated with the functions of x(e.g. \(x^2\), \(\sqrt{x}\), etc.)
Zero Conditional Mean assumption ensures this: \[ E(u|x) = E(u) = 0 \]
This equation says that the average value of the unobservables is the same across all slices of the population determined by the value of \(x\).
This assumption means that the average value of \(u\) does not depend on \(x\).
For the wage equation, this assumption is valid if, for example, innate ability is the same across all levels of education in the population
For the agricultural productivity equation, if the amount of fertilizer is assigned to plots independent of the soil quality then the zero-conditional-mean assumption will hold
\[\begin{align} E(y|x) &= \beta_0 + \beta_1 x + E(u|x) \notag \\ &=\beta_0 + \beta_1 x,\; \text{since}\; E(u|x) = 0 \notag \end{align}\]
This is called PRF. Obviously, conditional expectation of the dependent variable is a linear function of \(x\).
Linearity of PRF: for a one-unit change in \(x\) conditional expectation of \(y\) changes by \(\beta_1\).
The center of the conditional distribution of \(y\) for a given value of \(x\) is \(E(y|x)\).
NOTE:
In the simple regression model, \(y = \beta_0 + \beta_1 x + u\), under \(E(u|x) = 0\), the the dependent variable \(y\) can be decomposed into two parts:
Systematic part: \(\beta_0 + \beta_1 x\). This is the part of \(y\) explained by \(x\).
Unsystematic part: \(u\). This is the part of \(y\) that cannot be explained by \(x\).
How can we estimate the unknown population parameters (\(\beta_0; \beta_1\)) given a cross-sectional data set.?
Suppose that we have a random sample of n observations: \((y_i, x_i: i = 1, 2, 3, \cdots, n\)
Regression model can be written for each observation as follows:
\[ y_i = \beta_0 + \beta_1 x_i + u_i, \; i=1, 2, 3, \cdots, n \]
\[\begin{align} y_1 &= \beta_0 + \beta_1 x_1 + u_1 \notag \\ y_2 &= \beta_0 + \beta_1 x_2 + u_2 \notag \\ y_3 &= \beta_0 + \beta_1 x_3 + u_3 \notag \\ \vdots &= \vdots \notag \\ y_n &= \beta_0 + \beta_1 x_n + u_n \notag \end{align}\]
\[\begin{align} E(u) &= 0 \notag \\ Cov(x, u) &= E(xu) = 0 \notag \end{align}\]
Population moment conditions: \[\begin{align} E(y - \beta_0 - \beta_1 x) &= 0 \notag \\ E[x(y - \beta_0 - \beta_1 x)] &= 0 \notag \end{align}\]
Replacing these with their sample analogs we obtain:
\[\begin{align} \frac{1}{n} \sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) &= 0 \notag \\ \frac{1}{n} \sum_{i=1}^n x_i(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) &= 0 \notag \end{align}\]
This system can easily be solved for \(\beta_0\) and \(\beta_1\) using sample data
Note that \(\beta_0\) and \(\beta_1\) have hats on them, they are not fixed quantities. They change as the data change.
Using properties of the summation operator, we obtained from the first sample moment condition \[ \overline{y} = \hat{\beta}_0 + \hat{\beta}_1 \overline{x} \]
where: \(\overline{y}\) and \(\overline{x}\) are sample means.
\[ \hat{\beta}_0 = \overline{y} - \hat{\beta}_1 \overline{x} \]
\[\begin{align} \sum_{i=1}^n x_i(y_i-(\overline{y}-\hat{\beta}_1 \overline{x})- \hat{\beta_1} x_i) &= 0 \notag \\ \Rightarrow \sum_{i=1}^n x_i(y_i - \overline{y}) &= \hat{\beta}_1 \sum_{i=1}^n x_i(x_i - \overline{x}) \notag \end{align}\]
\[ \hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \overline{x})(y_i - \overline{y})}{\sum_{i=1}^n (x_i - \overline{x})^2} \]
based on the following identities:
\[\begin{align} \sum_{i=1}^nx_i(x_i - \overline{x}) &= \sum_{i=1}^n (x_i - \overline{x})^2 \notag \\ \sum_{i=1}^nx_i(y_i - \overline{y}) &= \sum_{i=1}^n (x_i - \overline{x})(y_i - \overline{y}) \notag \end{align}\]
Slope estimator is the ratio of the sample covariance between \(x\) and \(y\) to the sample variance of \(x\).
The sign of \(\hat{\beta}_1\) depends on the sign of sample covariance. If \(x\) and \(y\) are positively correlated in the sample, \(\hat{\beta}_1\) is positive; if \(x\) and \(y\) are negatively correlated then\(\hat{\beta}_1\) is negative.
To be able to calculate \(\hat{\beta}_1\), \(x\) must have enough variability:
\[ \sum_{i=1}^n (x_i - \overline{x})^2 > 0 \]
\[ \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i \] - We define the residuals as difference between the observed and the fitted values:
\[ \hat{u}_i = y_i - \hat{y}_i = y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i \] - the residuals are the realized value of \(u\) in the sample
\[ \underset{_{(\hat{\beta}_0, \hat{\beta}_1)}} {min} \sum_{i=1}^n \hat{u}_i^2 = \underset{_{(\hat{\beta}_0, \hat{\beta}_1)}} {min} \sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2 \]
\[ \frac{\partial SSR}{\partial \hat{\beta}_0} = -2 \sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) = 0 \]
\[ \frac{\partial SSR}{\partial \hat{\beta}_1} = -2 \sum_{i=1}^n x_i(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) = 0 \]
\[ salary = \beta_0 + \beta_1 roe + u \]
salary: annual CEO salary (1000 US$)
roe: average return on equity for the last three years (%)
library(wooldridge)
result1 <- lm(salary ~ roe, data = ceosal1)
summary(result1)
##
## Call:
## lm(formula = salary ~ roe, data = ceosal1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1160.2 -526.0 -254.0 138.8 13499.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 963.19 213.24 4.517 1.05e-05 ***
## roe 18.50 11.12 1.663 0.0978 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1367 on 207 degrees of freedom
## Multiple R-squared: 0.01319, Adjusted R-squared: 0.008421
## F-statistic: 2.767 on 1 and 207 DF, p-value: 0.09777
\[ \widehat{salary} = 963.19 + 18.50 roe \]
ceosal1 %>%
ggplot(aes(x = roe,
y= salary)) +
geom_point() +
scale_x_continuous(expand=c(0,0), limit=c(0,60)) +
scale_y_continuous(expand=c(0,0), limit=c(0,5000)) +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Return on equity(%)",
y = "Salary (1000$)") +
theme_classic()
salaryhat <- fitted(result1)
uhat <- resid(result1)
tab1 <- cbind(ceosal1$roe, ceosal1$salary, salaryhat, uhat)
colnames(tab1) <- c("roe", "salary", "salaryhat", "uhat")
head(tab1, n=15)
## roe salary salaryhat uhat
## 1 14.1 1095 1224.058 -129.058071
## 2 10.9 1001 1164.854 -163.854261
## 3 23.5 1122 1397.969 -275.969216
## 4 5.9 578 1072.348 -494.348338
## 5 13.8 1368 1218.508 149.492288
## 6 20.0 1145 1333.215 -188.215063
## 7 16.4 1078 1266.611 -188.610785
## 8 16.3 1094 1264.761 -170.760660
## 9 10.5 1237 1157.454 79.546207
## 10 26.3 833 1449.773 -616.772523
## 11 25.9 567 1442.372 -875.372056
## 12 26.8 933 1459.023 -526.023116
## 13 14.8 1339 1237.009 101.991102
## 14 22.3 937 1375.768 -438.767778
## 15 56.3 2011 2004.808 6.191886
\[ y_i = \hat{y}_i + \hat{u}_i \]
Summing both sides of this equation we obtain the following quantities:
SST (Total Sum of Squares): gives the total variation in \(y\) (around \(\overline{y}\)) \[ SST = \sum_{i=1}^n (y_i - \overline{y})^2 \]
SSR (Explained/Regression Sum of Squares): measures the variation in the fitted values \(\hat{y}_i\) (around \(\overline{y}\)) \[ SSR = \sum_{i=1}^n (\hat{y}_i - \overline{y})^2 \]
SSE (Residual/Error Sum of Squares): measures the sample variation in the residuals \[ SSE = \sum_{i=1}^n u_i^2 = \sum_{i=1}^n (y_i - \hat{y}_i)^2 \]
NOTES:
Total sample variation in \(y\) can be written as \(SST = SSE + SSR\)
Recall: \(V(y) = \frac{SST}{n-1}\)
\[ R^2 = \frac{SSR}{SST} \]
Since SSE can never be larger than SST we have \(0 \leq R^2 \leq 1\)
\(R^2\) is interpreted as the fraction of the sample variation in \(y\) that is explained by \(x\). After multiplying by 100 it can be interpreted as the percentage of the sample variation in \(y\) explained by \(x\).
In SLR, \(R^2\) can also be calculated as \(R^2 = Corr(x,y)^2\)
Typically, high value of \(R^2\) implies good fit of the model to the data
\[ MSE = \frac{SSE}{n-p}, \; \text{where} \; p = \text{no. of parameters in the model} \]
is the estimated variation in the residuals \(\hat{u}_i\)
The estimated standard deviation of \(\hat{u}_i\) is \(\sqrt{MSE}\) which we call as Root Mean Square Error (RSME)
Ideally, small values of RMSE (close to zero) is desired to say that the model fits well to the data
\(H_0: \beta_i = 0, \forall i\) versus \(H_1: \beta_i \neq 0, \exists i\)
Test statistic: \(F = \frac{MSR}{MSE}\)
\(H_0: \beta_i = 0\) versus \(H_1: \beta_i \neq 0\)
Test statistic: \(t = \frac{\hat{\beta}_i}{se(\hat{\beta}_i)}\)
gpareg <- lm(colGPA ~ hsGPA,data = gpa1)
summary(gpareg)
##
## Call:
## lm(formula = colGPA ~ hsGPA, data = gpa1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.85220 -0.26274 -0.04868 0.28902 0.88551
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.41543 0.30694 4.611 8.98e-06 ***
## hsGPA 0.48243 0.08983 5.371 3.21e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.34 on 139 degrees of freedom
## Multiple R-squared: 0.1719, Adjusted R-squared: 0.1659
## F-statistic: 28.85 on 1 and 139 DF, p-value: 3.211e-07
anova(gpareg)
## Analysis of Variance Table
##
## Response: colGPA
## Df Sum Sq Mean Sq F value Pr(>F)
## hsGPA 1 3.3351 3.3351 28.845 3.211e-07 ***
## Residuals 139 16.0710 0.1156
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1