Simple Linear Regression

Author : Gianfranco David Chamorro Rodriguez

Channel : Video

This document only estimates the results by the least squares method without evaluating the assumptions of the model.

We will make use of data from Table 5.5 of Damodar Gujarati’s book Econometrics, Fifth Edition, which gives us information on the average salary of a public school teacher (annual salary is in dollars) and public education spending per pupil. (dollars) for 1985 in the 50 states and the District of Columbia in the United States. We will observe if there is any relationship between teacher salary and spending per student in public schools, the following model was suggested: \[\begin{equation*} Salary = \alpha + \beta \; Spending + e \end{equation*}\]

We have :

\[\begin{equation} Y = \alpha + \beta_1 X +e \end{equation}\]

then :

\[\begin{equation*} e= Y - \alpha - \beta_1 X \end{equation*}\]

Recall that the OLS method minimizes the sum of the square of the vertical distances between the responses observed in the sample and the responses of the model.

\[ RSS= min\sum_{i=1}^{n}e_{i}^{2} \]

First order minimization conditions, the first derivatives of the objective function with respect to the coefficients to be estimated must be zero:

\(\alpha\): \[ \sum_{i=1}^{n}e_{i}^{2}= \sum_{i=1}^{n} (Y - \alpha - \beta_1 X)^2 \]

\[ \dfrac{dRSS}{d \alpha} = 2 \sum_{i=1}^{n}( Y - \alpha - \beta_1 X) (-1) =0 \] \[ \sum_{i=1}^{n}Y - n\alpha - \beta_1 \sum_{i=1}^{n}X = 0\\ \]

\[\begin{equation*} \hat{\alpha}_{mco} = \bar{Y}-\beta_1 \bar{X} \end{equation*}\]

\(\beta\):

\[ \dfrac{dRSS}{d \beta_1} = 2 \sum_{i=1}^{n}( Y - \alpha - \beta_1 X) (-X) =0 \]

\[ \sum_{i=1}^{n}YX - (\bar{Y}- \beta_1 \bar{X} ) \sum_{i=1}^{n}X -\beta_1 \sum_{i=1}^{n} X^2 = 0\\ \] \[ \sum YX - \bar{Y}\sum X + n\beta_1 \bar{X}^2 -\beta_1 \sum X^2 = 0\\ \] \[ \beta_1 (\sum X^2 - n\bar{X}^2)= \sum YX - \bar{Y} \sum X \\ \] \[ \beta_1 = \frac{\sum YX -\bar{Y} \sum X }{\sum X^2 -n\bar{X}^2 } = \frac{\sum YX - n \bar{Y} \bar{X}}{\sum X^2 -n\bar{X}^2} \]

\[\begin{equation*} \hat{\beta}_{mco} = \frac{COV(XY)}{VAR(X)} \end{equation*}\]

#We use DB from Github
file <- "https://raw.githubusercontent.com/Gianfrancocr27/Data-Set/main/gujarati55.csv"

datos <- read.csv(file=file, header=TRUE)
head(datos) # shows the first 6 rows

##   salary spending
## 1  19583     3346
## 2  20263     3114
## 3  20325     3554
## 4  26800     4642
## 5  29470     4669
## 6  26610     4888

summary(datos) # We analyze the statistics of our variables

##      salary         spending   
##  Min.   :18095   Min.   :2297  
##  1st Qu.:21495   1st Qu.:2974  
##  Median :23382   Median :3554  
##  Mean   :24356   Mean   :3697  
##  3rd Qu.:26568   3rd Qu.:4082  
##  Max.   :41480   Max.   :8349

par(mfrow = c(2, 2)) #We set up a 2x2 graph for the histogram and Boxplot of each variable
hist(datos$spending, breaks = 5, ylab = "Frequency", main = "", xlab = "Spending", col="pink", border="blue")
hist(datos$salary, breaks = 5, ylab = "Frequency", main = "", xlab = "Salary", col="pink", border="blue")
mtext("Histogram",                   
      side = 3,
      line = - 2,
      outer = TRUE)
mtext("Boxplot",
      side=3,
      line=-17,
      outer=TRUE)
boxplot(datos$spending, horizontal = TRUE, xlab = "Spending", border = "black",  whiskcol = "blue", col="pink")
boxplot(datos$salary, horizontal = TRUE, xlab = "Salary", border = "black", whiskcol = "blue", col="pink")

regresion <- lm(datos$salary ~ datos$spending) #The “Linear Model” function lm() is the main function within R for calculating the fit of a simple linear model.
regresion # Shows us the estimators

## 
## Call:
## lm(formula = datos$salary ~ datos$spending)
## 
## Coefficients:
##    (Intercept)  datos$spending  
##      12129.371           3.308

summary(regresion) # The Model Statistics

## 
## Call:
## lm(formula = datos$salary ~ datos$spending)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3848.0 -1844.6  -217.5  1660.0  5529.3 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.213e+04  1.197e+03   10.13 1.31e-13 ***
## datos$spending 3.308e+00  3.117e-01   10.61 2.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2325 on 49 degrees of freedom
## Multiple R-squared:  0.6968, Adjusted R-squared:  0.6906 
## F-statistic: 112.6 on 1 and 49 DF,  p-value: 2.707e-14

anova(regresion)  # Variance analysis

## Analysis of Variance Table
## 
## Response: datos$salary
##                Df    Sum Sq   Mean Sq F value    Pr(>F)    
## datos$spending  1 608555015 608555015   112.6 2.707e-14 ***
## Residuals      49 264825250   5404597                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

par(mfrow = c(1, 1))  #We configure a 1x1 graph
plot(datos$spending, datos$salary)
abline(regresion, col = "blue") # Add the line estimated by the model

residuo <- resid(regresion) # Get the residuals of the model
plot(fitted(regresion), residuo)  # Scatterplot of the estimates and model residuals
abline(0,0)  #add a line at the value 0

plot(density(residuo))  #Histogram of residual density

library(tseries)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

jarque.bera.test(residuo) #Jarque Bera Normality Test of residues

## 
##  Jarque Bera Test
## 
## data:  residuo
## X-squared = 2.1963, df = 2, p-value = 0.3335

summary(residuo) # Residue statistics

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -3848.0 -1844.6  -217.5     0.0  1660.0  5529.3

qqnorm(residuo) # Quantile-Quantile plot of residuals