Linear Regression

\(y_i=\beta_0+\beta_1x_i+\epsilon_i\)

Least Square method

\(\hat{\beta}_1=\frac{\sum x_iy_i-\frac{1}{n}\sum x_i\sum y_i}{\sum x_i^2-\frac{1}{n}(\sum x_i)^2}\)

\(\hat{\beta_0}=\bar{y}-\hat{\beta}_1\bar{x}\)

The R version of the regression model is: y ~ x where y is your outcome and x is your predictor.

The summary command gets all the additional information (p-values, t-statistics, r-square) that you usually want from a regression.

attach(mtcars)
f1 <- lm(hp~ disp)
summary(f1)
## 
## Call:
## lm(formula = hp ~ disp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.623 -28.378  -6.558  13.588 157.562 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  45.7345    16.1289   2.836  0.00811 ** 
## disp          0.4375     0.0618   7.080 7.14e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.65 on 30 degrees of freedom
## Multiple R-squared:  0.6256, Adjusted R-squared:  0.6131 
## F-statistic: 50.13 on 1 and 30 DF,  p-value: 7.143e-08
names(f1)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"
  • Intercept: When disp is zero, the predicted horsepower is approximately 45.73 units. While a displacement of zero isn’t practical in real-world terms, the intercept is necessary for the mathematical equation and can be interpreted within the context of the data.

  • Slope: For every additional unit increase in engine displacement, the horsepower (hp) is expected to increase by approximately 0.4375 units. This shows a positive relationship between two quantatives.

  • P-value:Both coefficients have p-values less than 0.01, indicating they are statistically significant at the 1% significance level.

  • Goodness of Fit:

  • R-squared: Approximately 62.56% of the variability in horsepower (hp) is explained by the engine displacement (disp). R-squared values range from 0 to 1. A higher value indicates a better fit.

  • Residual standard error: Indicates that the residuals (errors) are small, meaning the model’s predictions are closer to the actual values. This is a sign of a better fit.

  • The F-statistic tests whether at least one predictor variable has a non-zero coefficient. The F-test assesses the overall significance of the model. A low p-value suggests that the model provides a better fit than one with no independent variables.

Final Interpretation: As engine displacement increases, horsepower tends to increase.

f2 <- lm(hp ~ mpg+ disp + drat)
summary (f2) 
## 
## Call:
## lm(formula = hp ~ mpg + disp + drat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -48.74 -19.46 -11.69  17.51 139.93 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  34.3580    94.9085   0.362  0.72006   
## mpg          -5.2383     2.2384  -2.340  0.02663 * 
## disp          0.3401     0.1132   3.005  0.00555 **
## drat         38.6732    19.0215   2.033  0.05162 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.96 on 28 degrees of freedom
## Multiple R-squared:  0.7084, Adjusted R-squared:  0.6771 
## F-statistic: 22.67 on 3 and 28 DF,  p-value: 1.189e-07
anova(f1,f2)
## Analysis of Variance Table
## 
## Model 1: hp ~ disp
## Model 2: hp ~ mpg + disp + drat
##   Res.Df   RSS Df Sum of Sq      F  Pr(>F)  
## 1     30 54560                              
## 2     28 42495  2     12065 3.9748 0.03023 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Interpretation :
  • The p-value is small (typically < 0.05), the more complex model (model2) explains significantly more variance than the simpler model (model1).
  • This method only works for nested models
# Scatter plot with regression line
plot(mtcars$disp, mtcars$hp,
     main = "Horsepower vs. Displacement",
     xlab = "Displacement (cu.in.)",
     ylab = "Gross Horsepower")
abline(f1, col = "red", lwd = 2) 

Exercise

x = c(-2,-1,0,1,2)
y= c(0,0,1,1,3)