1 Definition of the Simple Regression Model

\[ y = \beta_0 +\beta_1x + u \]

2 Some examples of SLR models

2.1 Agricultural production and fertilizer usage

\[ yield = \beta_0 + \beta_1 fertilizer + u \]

  • Slope parameter \(\beta_1\):

    • \(\Delta yield = \beta_1 \Delta fertilizer\)

    • Ceteris paribus, one unit change in the amount of fertilizer leads to \(\beta_1\) unit change in yield.

  • Random error term: \(u\)

    • Contains effects of factors such as soil quality, rainfall, etc. which are assumed to be unobserved.

    • Ceteris paribus \(\Leftrightarrow\) Holding all other factors fixed \(\Leftrightarrow \Delta u = 0\)

2.2 A simple wage equation

\[ wage = \beta_0 + \beta_1 educ + u \]

where: wage: hourly wage (in peso), educ: education level (in years)

  • Slope parameter \(\beta_1\):

    • \(\Delta wage = \beta_1 \Delta educ\)

    • \(\beta_1\) measures the change in hourly wage given another year of education, holding all other factors fixed (ceteris paribus)

  • Random error term \(u\):

    • Other factors including labor force experience, innate ability, tenure with current employer, gender, quality of education, marital status, number of children, etc

    • Any factor that may potentially affect worker productivity

3 Linearity

4 Assumptions for Ceteris PAribus conclusions

  1. The expected value of the error term \(u\) is zero

    • If the model includes a constant term (\(\beta_0\)) then we can assume \[ E(u) = 0 \]

    • This assumption is about the distribution of \(u\) (unobservables). Some \(u\) terms will be (+) and some will be (−) but on average \(u\) is zero

    • This assumption is always guaranteed to hold by redefining \(\beta_0\)

  2. Conditional mean of \(u\) is zero

5 Population Regression Function (PRF)

\[\begin{align} E(y|x) &= \beta_0 + \beta_1 x + E(u|x) \notag \\ &=\beta_0 + \beta_1 x,\; \text{since}\; E(u|x) = 0 \notag \end{align}\]

NOTE:

6 Estimation of regression parameters

\[ y_i = \beta_0 + \beta_1 x_i + u_i, \; i=1, 2, 3, \cdots, n \]

\[\begin{align} y_1 &= \beta_0 + \beta_1 x_1 + u_1 \notag \\ y_2 &= \beta_0 + \beta_1 x_2 + u_2 \notag \\ y_3 &= \beta_0 + \beta_1 x_3 + u_3 \notag \\ \vdots &= \vdots \notag \\ y_n &= \beta_0 + \beta_1 x_n + u_n \notag \end{align}\]

\[\begin{align} E(u) &= 0 \notag \\ Cov(x, u) &= E(xu) = 0 \notag \end{align}\]

6.1 Method of moments estimation

  • Population moment conditions: \[\begin{align} E(y - \beta_0 - \beta_1 x) &= 0 \notag \\ E[x(y - \beta_0 - \beta_1 x)] &= 0 \notag \end{align}\]

  • Replacing these with their sample analogs we obtain:

\[\begin{align} \frac{1}{n} \sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) &= 0 \notag \\ \frac{1}{n} \sum_{i=1}^n x_i(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) &= 0 \notag \end{align}\]

  • This system can easily be solved for \(\beta_0\) and \(\beta_1\) using sample data

  • Note that \(\beta_0\) and \(\beta_1\) have hats on them, they are not fixed quantities. They change as the data change.

  • Using properties of the summation operator, we obtained from the first sample moment condition \[ \overline{y} = \hat{\beta}_0 + \hat{\beta}_1 \overline{x} \]

where: \(\overline{y}\) and \(\overline{x}\) are sample means.

  • Thus,

\[ \hat{\beta}_0 = \overline{y} - \hat{\beta}_1 \overline{x} \]

  • Substituting \(\hat{\beta}_0\) into second moment condition we get:

\[\begin{align} \sum_{i=1}^n x_i(y_i-(\overline{y}-\hat{\beta}_1 \overline{x})- \hat{\beta_1} x_i) &= 0 \notag \\ \Rightarrow \sum_{i=1}^n x_i(y_i - \overline{y}) &= \hat{\beta}_1 \sum_{i=1}^n x_i(x_i - \overline{x}) \notag \end{align}\]

  • Therefore,

\[ \hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \overline{x})(y_i - \overline{y})}{\sum_{i=1}^n (x_i - \overline{x})^2} \]

based on the following identities:

\[\begin{align} \sum_{i=1}^nx_i(x_i - \overline{x}) &= \sum_{i=1}^n (x_i - \overline{x})^2 \notag \\ \sum_{i=1}^nx_i(y_i - \overline{y}) &= \sum_{i=1}^n (x_i - \overline{x})(y_i - \overline{y}) \notag \end{align}\]

  • Slope estimator is the ratio of the sample covariance between \(x\) and \(y\) to the sample variance of \(x\).

  • The sign of \(\hat{\beta}_1\) depends on the sign of sample covariance. If \(x\) and \(y\) are positively correlated in the sample, \(\hat{\beta}_1\) is positive; if \(x\) and \(y\) are negatively correlated then\(\hat{\beta}_1\) is negative.

  • To be able to calculate \(\hat{\beta}_1\), \(x\) must have enough variability:

\[ \sum_{i=1}^n (x_i - \overline{x})^2 > 0 \]

  • If all \(x\) values are the same then the sample variance will be 0. In this case, \(\hat{\beta}_1\) will be undefined. For example, if all employees have the same level of education, say 12 years, then it is not possible to measure the impact of education on wages.

6.2 Ordinary Least Squares (OLS) Estimation

  • Fitted values of \(y\) can be calculated after \(\beta_0\) and \(\beta_1\) are found using the equation

\[ \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i \] - We define the residuals as difference between the observed and the fitted values:

\[ \hat{u}_i = y_i - \hat{y}_i = y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i \] - the residuals are the realized value of \(u\) in the sample

  • OLS Objective Function: OLS estimators are found by making the sum of squared residuals (SSR) as small as possible, that is,

\[ \underset{_{(\hat{\beta}_0, \hat{\beta}_1)}} {min} \sum_{i=1}^n \hat{u}_i^2 = \underset{_{(\hat{\beta}_0, \hat{\beta}_1)}} {min} \sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2 \]

  • OLS First Order Conditions

\[ \frac{\partial SSR}{\partial \hat{\beta}_0} = -2 \sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) = 0 \]

\[ \frac{\partial SSR}{\partial \hat{\beta}_1} = -2 \sum_{i=1}^n x_i(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) = 0 \]

  • The solution of this system is the same as the solution of the system obtained using the method of moments.

7 Example: CEO Salary and Firm Performance

\[ salary = \beta_0 + \beta_1 roe + u \]

library(wooldridge)
result1 <- lm(salary ~ roe, data = ceosal1)
summary(result1)
## 
## Call:
## lm(formula = salary ~ roe, data = ceosal1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1160.2  -526.0  -254.0   138.8 13499.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   963.19     213.24   4.517 1.05e-05 ***
## roe            18.50      11.12   1.663   0.0978 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1367 on 207 degrees of freedom
## Multiple R-squared:  0.01319,    Adjusted R-squared:  0.008421 
## F-statistic: 2.767 on 1 and 207 DF,  p-value: 0.09777

\[ \widehat{salary} = 963.19 + 18.50 roe \]

ceosal1 %>% 
  ggplot(aes(x = roe,
             y= salary)) +
  geom_point() +
  scale_x_continuous(expand=c(0,0), limit=c(0,60)) +
  scale_y_continuous(expand=c(0,0), limit=c(0,5000)) +
  geom_smooth(method = "lm", se = FALSE) + 
  labs(x = "Return on equity(%)",
       y = "Salary (1000$)") +
  theme_classic()

7.1 Fitted values and residuals

salaryhat <- fitted(result1)
uhat <- resid(result1)
tab1 <- cbind(ceosal1$roe, ceosal1$salary, salaryhat, uhat)
colnames(tab1) <- c("roe", "salary", "salaryhat", "uhat")
head(tab1, n=15)
##     roe salary salaryhat        uhat
## 1  14.1   1095  1224.058 -129.058071
## 2  10.9   1001  1164.854 -163.854261
## 3  23.5   1122  1397.969 -275.969216
## 4   5.9    578  1072.348 -494.348338
## 5  13.8   1368  1218.508  149.492288
## 6  20.0   1145  1333.215 -188.215063
## 7  16.4   1078  1266.611 -188.610785
## 8  16.3   1094  1264.761 -170.760660
## 9  10.5   1237  1157.454   79.546207
## 10 26.3    833  1449.773 -616.772523
## 11 25.9    567  1442.372 -875.372056
## 12 26.8    933  1459.023 -526.023116
## 13 14.8   1339  1237.009  101.991102
## 14 22.3    937  1375.768 -438.767778
## 15 56.3   2011  2004.808    6.191886

8 Sum of Squares

\[ y_i = \hat{y}_i + \hat{u}_i \]

Summing both sides of this equation we obtain the following quantities:

NOTES:

9 Measures of Goodness-of-fit

9.1 Coefficient of Determination (\(R^2\))

  • The ratio of explained variation to the total variation is called the coefficient of determination and denoted by \(R^2\)

\[ R^2 = \frac{SSR}{SST} \]

  • Since SSE can never be larger than SST we have \(0 \leq R^2 \leq 1\)

  • \(R^2\) is interpreted as the fraction of the sample variation in \(y\) that is explained by \(x\). After multiplying by 100 it can be interpreted as the percentage of the sample variation in \(y\) explained by \(x\).

  • In SLR, \(R^2\) can also be calculated as \(R^2 = Corr(x,y)^2\)

  • Typically, high value of \(R^2\) implies good fit of the model to the data

9.2 Root Mean Square Error (RMSE)

  • Note that

\[ MSE = \frac{SSE}{n-p}, \; \text{where} \; p = \text{no. of parameters in the model} \]

is the estimated variation in the residuals \(\hat{u}_i\)

  • The estimated standard deviation of \(\hat{u}_i\) is \(\sqrt{MSE}\) which we call as Root Mean Square Error (RSME)

  • Ideally, small values of RMSE (close to zero) is desired to say that the model fits well to the data

10 Test of Significance

10.1 Test of Overall Significance of Regression

  • \(H_0: \beta_i = 0, \forall i\) versus \(H_1: \beta_i \neq 0, \exists i\)

  • Test statistic: \(F = \frac{MSR}{MSE}\)

10.2 Test of Significance of Individual Regression Coefficient

  • \(H_0: \beta_i = 0\) versus \(H_1: \beta_i \neq 0\)

  • Test statistic: \(t = \frac{\hat{\beta}_i}{se(\hat{\beta}_i)}\)

11 Another Example: College GPA and High School GPA

gpareg <- lm(colGPA ~ hsGPA,data = gpa1)
summary(gpareg)
## 
## Call:
## lm(formula = colGPA ~ hsGPA, data = gpa1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.85220 -0.26274 -0.04868  0.28902  0.88551 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.41543    0.30694   4.611 8.98e-06 ***
## hsGPA        0.48243    0.08983   5.371 3.21e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.34 on 139 degrees of freedom
## Multiple R-squared:  0.1719, Adjusted R-squared:  0.1659 
## F-statistic: 28.85 on 1 and 139 DF,  p-value: 3.211e-07
anova(gpareg)
## Analysis of Variance Table
## 
## Response: colGPA
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## hsGPA       1  3.3351  3.3351  28.845 3.211e-07 ***
## Residuals 139 16.0710  0.1156                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1