1 Definition of the Simple Regression Model

\[ y = \beta_0 +\beta_1x + u \]

$y$: the dependent variable (explained variable, response variable, predicted variable, regressand, outcome variable)
$x$: the explanatory variable (independent variable, cotrol variable, predictor variable, regressor, input variable)
Also called “bivariate linear regression model”
Purpose: explain the dependent variable $y$ by the explanatory variable $x$
$\beta_0$: the intercept (also called constant term); the value of $y$ when $x = 0$.
$\beta_1$: the regression slope; reflects the effect of $x$ on $y$ when the effects of other factors in $u$ are held constant, that is, $\Delta u = 0$
- $\beta_1 = \frac{\Delta y}{\Delta x}$

2 Some examples of SLR models

2.1 Agricultural production and fertilizer usage

\[ yield = \beta_0 + \beta_1 fertilizer + u \]

Slope parameter $\beta_1$:
- $\Delta yield = \beta_1 \Delta fertilizer$
- Ceteris paribus, one unit change in the amount of fertilizer leads to $\beta_1$ unit change in yield.
Random error term: $u$
- Contains effects of factors such as soil quality, rainfall, etc. which are assumed to be unobserved.
- Ceteris paribus $\Leftrightarrow$ Holding all other factors fixed $\Leftrightarrow \Delta u = 0$

2.2 A simple wage equation

\[ wage = \beta_0 + \beta_1 educ + u \]

where: wage: hourly wage (in peso), educ: education level (in years)

Slope parameter $\beta_1$:
- $\Delta wage = \beta_1 \Delta educ$
- $\beta_1$ measures the change in hourly wage given another year of education, holding all other factors fixed (ceteris paribus)
Random error term $u$:
- Other factors including labor force experience, innate ability, tenure with current employer, gender, quality of education, marital status, number of children, etc
- Any factor that may potentially affect worker productivity

3 Linearity

The linearity of simple regression model means that a one-unit change in $x$ has the same effect on $y$ regardless of the initial value of x.
This is unrealistic for many economic applications.
For example, if there increasing or decreasing returns to scale then this model is inappropriate.
In the wage equation, the impact of the next year of education on wages has a larger effect than did the previous year.
- An extra year of experience may also have similar increasing returns.
We will see how to allow for such possibilities in the following classes.

4 Assumptions for Ceteris PAribus conclusions

The expected value of the error term $u$ is zero
- If the model includes a constant term ($\beta_0$) then we can assume \[ E(u) = 0 \]
- This assumption is about the distribution of $u$ (unobservables). Some $u$ terms will be (+) and some will be (−) but on average $u$ is zero
- This assumption is always guaranteed to hold by redefining $\beta_0$
Conditional mean of $u$ is zero

How can we be sure that the ceteris paribus notion is valid (which means that $\Delta u = 0$?
For this to hold, $x$ and $u$ must be uncorrelated. But since correlation coefficient measures only the linear association between two variables it is not enough just to have zero correlation.
$u$ must also be uncorrelated with the functions of x(e.g. $x^2$, $\sqrt{x}$, etc.)
Zero Conditional Mean assumption ensures this: \[ E(u|x) = E(u) = 0 \]
This equation says that the average value of the unobservables is the same across all slices of the population determined by the value of $x$.
This assumption means that the average value of $u$ does not depend on $x$.
For the wage equation, this assumption is valid if, for example, innate ability is the same across all levels of education in the population
- If we believe that average ability increases with years of eduction this assumption will not hold
For the agricultural productivity equation, if the amount of fertilizer is assigned to plots independent of the soil quality then the zero-conditional-mean assumption will hold
- However, if larger amounts of fertilizer is assigned to plots with higher soil quality then the expected value of the error term will increase with the amount of fertilizer

5 Population Regression Function (PRF)

Expected value of $y$ given $x$:

\[\begin{align} E(y|x) &= \beta_0 + \beta_1 x + E(u|x) \notag \\ &=\beta_0 + \beta_1 x,\; \text{since}\; E(u|x) = 0 \notag \end{align}\]

This is called PRF. Obviously, conditional expectation of the dependent variable is a linear function of $x$.
Linearity of PRF: for a one-unit change in $x$ conditional expectation of $y$ changes by $\beta_1$.
The center of the conditional distribution of $y$ for a given value of $x$ is $E(y|x)$.

NOTE:

In the simple regression model, $y = \beta_0 + \beta_1 x + u$, under $E(u|x) = 0$, the the dependent variable $y$ can be decomposed into two parts:
- Systematic part: $\beta_0 + \beta_1 x$. This is the part of $y$ explained by $x$.
- Unsystematic part: $u$. This is the part of $y$ that cannot be explained by $x$.

6 Estimation of regression parameters

How can we estimate the unknown population parameters ($\beta_0; \beta_1$) given a cross-sectional data set.?
Suppose that we have a random sample of n observations: $(y_i, x_i: i = 1, 2, 3, \cdots, n$
Regression model can be written for each observation as follows:

\[ y_i = \beta_0 + \beta_1 x_i + u_i, \; i=1, 2, 3, \cdots, n \]

Now we have a system of $n$ equations with 2 unknowns

\[\begin{align} y_1 &= \beta_0 + \beta_1 x_1 + u_1 \notag \\ y_2 &= \beta_0 + \beta_1 x_2 + u_2 \notag \\ y_3 &= \beta_0 + \beta_1 x_3 + u_3 \notag \\ \vdots &= \vdots \notag \\ y_n &= \beta_0 + \beta_1 x_n + u_n \notag \end{align}\]

Recall the following 2 assumptions regarding ceteris paribus:

\[\begin{align} E(u) &= 0 \notag \\ Cov(x, u) &= E(xu) = 0 \notag \end{align}\]

6.1 Method of moments estimation

Population moment conditions: \[\begin{align} E(y - \beta_0 - \beta_1 x) &= 0 \notag \\ E[x(y - \beta_0 - \beta_1 x)] &= 0 \notag \end{align}\]
Replacing these with their sample analogs we obtain:

\[\begin{align} \frac{1}{n} \sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) &= 0 \notag \\ \frac{1}{n} \sum_{i=1}^n x_i(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) &= 0 \notag \end{align}\]

This system can easily be solved for $\beta_0$ and $\beta_1$ using sample data
Note that $\beta_0$ and $\beta_1$ have hats on them, they are not fixed quantities. They change as the data change.
Using properties of the summation operator, we obtained from the first sample moment condition \[ \overline{y} = \hat{\beta}_0 + \hat{\beta}_1 \overline{x} \]

where: $\overline{y}$ and $\overline{x}$ are sample means.

Thus,

\[ \hat{\beta}_0 = \overline{y} - \hat{\beta}_1 \overline{x} \]

Substituting $\hat{\beta}_0$ into second moment condition we get:

\[\begin{align} \sum_{i=1}^n x_i(y_i-(\overline{y}-\hat{\beta}_1 \overline{x})- \hat{\beta_1} x_i) &= 0 \notag \\ \Rightarrow \sum_{i=1}^n x_i(y_i - \overline{y}) &= \hat{\beta}_1 \sum_{i=1}^n x_i(x_i - \overline{x}) \notag \end{align}\]

Therefore,

\[ \hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \overline{x})(y_i - \overline{y})}{\sum_{i=1}^n (x_i - \overline{x})^2} \]

based on the following identities:

\[\begin{align} \sum_{i=1}^nx_i(x_i - \overline{x}) &= \sum_{i=1}^n (x_i - \overline{x})^2 \notag \\ \sum_{i=1}^nx_i(y_i - \overline{y}) &= \sum_{i=1}^n (x_i - \overline{x})(y_i - \overline{y}) \notag \end{align}\]

Slope estimator is the ratio of the sample covariance between $x$ and $y$ to the sample variance of $x$.
The sign of $\hat{\beta}_1$ depends on the sign of sample covariance. If $x$ and $y$ are positively correlated in the sample, $\hat{\beta}_1$ is positive; if $x$ and $y$ are negatively correlated then$\hat{\beta}_1$ is negative.
To be able to calculate $\hat{\beta}_1$, $x$ must have enough variability:

\[ \sum_{i=1}^n (x_i - \overline{x})^2 > 0 \]

If all $x$ values are the same then the sample variance will be 0. In this case, $\hat{\beta}_1$ will be undefined. For example, if all employees have the same level of education, say 12 years, then it is not possible to measure the impact of education on wages.

6.2 Ordinary Least Squares (OLS) Estimation

Fitted values of $y$ can be calculated after $\beta_0$ and $\beta_1$ are found using the equation

\[ \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i \] - We define the residuals as difference between the observed and the fitted values:

\[ \hat{u}_i = y_i - \hat{y}_i = y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i \] - the residuals are the realized value of $u$ in the sample

OLS Objective Function: OLS estimators are found by making the sum of squared residuals (SSR) as small as possible, that is,

\[ \underset{_{(\hat{\beta}_0, \hat{\beta}_1)}} {min} \sum_{i=1}^n \hat{u}_i^2 = \underset{_{(\hat{\beta}_0, \hat{\beta}_1)}} {min} \sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2 \]

OLS First Order Conditions

\[ \frac{\partial SSR}{\partial \hat{\beta}_0} = -2 \sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) = 0 \]

\[ \frac{\partial SSR}{\partial \hat{\beta}_1} = -2 \sum_{i=1}^n x_i(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) = 0 \]

The solution of this system is the same as the solution of the system obtained using the method of moments.

7 Example: CEO Salary and Firm Performance

We want to model the relationship between CEO salary and firm performance:

\[ salary = \beta_0 + \beta_1 roe + u \]

salary: annual CEO salary (1000 US$)
roe: average return on equity for the last three years (%)

library(wooldridge)
result1 <- lm(salary ~ roe, data = ceosal1)
summary(result1)

## 
## Call:
## lm(formula = salary ~ roe, data = ceosal1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1160.2  -526.0  -254.0   138.8 13499.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   963.19     213.24   4.517 1.05e-05 ***
## roe            18.50      11.12   1.663   0.0978 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1367 on 207 degrees of freedom
## Multiple R-squared:  0.01319,    Adjusted R-squared:  0.008421 
## F-statistic: 2.767 on 1 and 207 DF,  p-value: 0.09777

The estimated equation is

\[ \widehat{salary} = 963.19 + 18.50 roe \]

$\hat{\beta}_1 = 18.50$ means that if return of equity increases by one percentage point, that is, $\Delta roe = 1$, then salary is expected to increase by 18.50 (1000$) or 18,500$, ceteris paribus.

ceosal1 %>% 
  ggplot(aes(x = roe,
             y= salary)) +
  geom_point() +
  scale_x_continuous(expand=c(0,0), limit=c(0,60)) +
  scale_y_continuous(expand=c(0,0), limit=c(0,5000)) +
  geom_smooth(method = "lm", se = FALSE) + 
  labs(x = "Return on equity(%)",
       y = "Salary (1000$)") +
  theme_classic()

7.1 Fitted values and residuals

salaryhat <- fitted(result1)
uhat <- resid(result1)
tab1 <- cbind(ceosal1$roe, ceosal1$salary, salaryhat, uhat)
colnames(tab1) <- c("roe", "salary", "salaryhat", "uhat")
head(tab1, n=15)

##     roe salary salaryhat        uhat
## 1  14.1   1095  1224.058 -129.058071
## 2  10.9   1001  1164.854 -163.854261
## 3  23.5   1122  1397.969 -275.969216
## 4   5.9    578  1072.348 -494.348338
## 5  13.8   1368  1218.508  149.492288
## 6  20.0   1145  1333.215 -188.215063
## 7  16.4   1078  1266.611 -188.610785
## 8  16.3   1094  1264.761 -170.760660
## 9  10.5   1237  1157.454   79.546207
## 10 26.3    833  1449.773 -616.772523
## 11 25.9    567  1442.372 -875.372056
## 12 26.8    933  1459.023 -526.023116
## 13 14.8   1339  1237.009  101.991102
## 14 22.3    937  1375.768 -438.767778
## 15 56.3   2011  2004.808    6.191886

8 Sum of Squares

For each observation $i$ we have

\[ y_i = \hat{y}_i + \hat{u}_i \]

Summing both sides of this equation we obtain the following quantities:

SST (Total Sum of Squares): gives the total variation in $y$ (around $\overline{y}$) \[ SST = \sum_{i=1}^n (y_i - \overline{y})^2 \]
SSR (Explained/Regression Sum of Squares): measures the variation in the fitted values $\hat{y}_i$ (around $\overline{y}$) \[ SSR = \sum_{i=1}^n (\hat{y}_i - \overline{y})^2 \]
SSE (Residual/Error Sum of Squares): measures the sample variation in the residuals \[ SSE = \sum_{i=1}^n u_i^2 = \sum_{i=1}^n (y_i - \hat{y}_i)^2 \]

NOTES:

Total sample variation in $y$ can be written as $SST = SSE + SSR$
Recall: $V(y) = \frac{SST}{n-1}$

9 Measures of Goodness-of-fit

9.1 Coefficient of Determination ($R^2$)

The ratio of explained variation to the total variation is called the coefficient of determination and denoted by $R^2$

\[ R^2 = \frac{SSR}{SST} \]

Since SSE can never be larger than SST we have $0 \leq R^2 \leq 1$
$R^2$ is interpreted as the fraction of the sample variation in $y$ that is explained by $x$. After multiplying by 100 it can be interpreted as the percentage of the sample variation in $y$ explained by $x$.
In SLR, $R^2$ can also be calculated as $R^2 = Corr(x,y)^2$
Typically, high value of $R^2$ implies good fit of the model to the data

9.2 Root Mean Square Error (RMSE)

Note that

\[ MSE = \frac{SSE}{n-p}, \; \text{where} \; p = \text{no. of parameters in the model} \]

is the estimated variation in the residuals $\hat{u}_i$

The estimated standard deviation of $\hat{u}_i$ is $\sqrt{MSE}$ which we call as Root Mean Square Error (RSME)
Ideally, small values of RMSE (close to zero) is desired to say that the model fits well to the data

10 Test of Significance

10.1 Test of Overall Significance of Regression

$H_0: \beta_i = 0, \forall i$ versus $H_1: \beta_i \neq 0, \exists i$
Test statistic: $F = \frac{MSR}{MSE}$

10.2 Test of Significance of Individual Regression Coefficient

$H_0: \beta_i = 0$ versus $H_1: \beta_i \neq 0$
Test statistic: $t = \frac{\hat{\beta}_i}{se(\hat{\beta}_i)}$

11 Another Example: College GPA and High School GPA

gpareg <- lm(colGPA ~ hsGPA,data = gpa1)
summary(gpareg)

## 
## Call:
## lm(formula = colGPA ~ hsGPA, data = gpa1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.85220 -0.26274 -0.04868  0.28902  0.88551 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.41543    0.30694   4.611 8.98e-06 ***
## hsGPA        0.48243    0.08983   5.371 3.21e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.34 on 139 degrees of freedom
## Multiple R-squared:  0.1719, Adjusted R-squared:  0.1659 
## F-statistic: 28.85 on 1 and 139 DF,  p-value: 3.211e-07

anova(gpareg)

## Analysis of Variance Table
## 
## Response: colGPA
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## hsGPA       1  3.3351  3.3351  28.845 3.211e-07 ***
## Residuals 139 16.0710  0.1156                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Econ 115s (Introduction to Econometrics)

Lesson 1.2 (The Simple Linear Regression Model: OLS Estimation)

NE Milla, Jr.

2023-03-02