About Linear Regression

Linear regression is used to predict the value of a variable based on the value of another variable
(x and y).
In other terms, it allows us to identify and make future predictions in the data.

Why use Linear Regression?

The results can be interpreted easily and is a reliable statistical model.


Where is Linear Regression Used?

  • Business (Ex: Forecasting sales)
  • Academia
  • Environmental and social sciences
  • Behavioral sciences
  • Biological sciences
  • A lot more areas!

Example of Linear Regression

The following is an example using linear regression with the R data set “Orange”, which is about the growth of orange trees.

Example of Linear Regression Cont.

The following code used to produce the linear regression figure:

data(Orange)

orange_graph <- ggplot(Orange, aes(x=age, y=circumference)) +
  ggtitle("Orange Trees, Age vs Circumference") +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_point(aes(col=Tree, size=Tree)) +
  geom_point() +
  geom_smooth(method = "lm") +
  xlab("Age (days)") +
  ylab("Circumference (mm)")

Fit of the Model

The following allows us to look at the model in more detail.

data(Orange)
testing <- lm(age~circumference, data = Orange)
summary(testing)
## 
## Call:
## lm(formula = age ~ circumference, data = Orange)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -317.88 -140.90  -17.20   96.54  471.16 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    16.6036    78.1406   0.212    0.833    
## circumference   7.8160     0.6059  12.900 1.93e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 203.1 on 33 degrees of freedom
## Multiple R-squared:  0.8345, Adjusted R-squared:  0.8295 
## F-statistic: 166.4 on 1 and 33 DF,  p-value: 1.931e-14

Equations Used

Equation for a Line:

\({Y} = {m} \cdot{x} + {b}\)

Where

\({Y}\) is the vertical value

\({m}\) is the slope (rise over run)

\({x}\) is the horizontal value

\({b}\) is the y intercept, or the value of \({Y}\) when \({x=0}\)


Equation for a Simple Regression Line Equation:

\({Y} = \beta_0 + \beta_1 \cdot {X}_1\)

Where

\({\beta_0}\) is the constant

\({\beta_1}\) is the coefficient for \({X}\)

Checking the Model Fit

To determine if the fit of the model we must look at the residuals.
Residuals are the differences between the real values and the predicted values.

(\(Residual = {y}-\hat{y}\))

The closer the residual is to zero, the better the fit.



By testing the coefficient of determination, it gives a measure of the predicted outcome. It is denoted as \(R^2\)
The formula for coefficient of determination is:

\(R^2 = 1 - RSS/TSS\)

Where

\(RSS =\) sum of squares residuals

\(TSS =\) total sum of squares


Looking back at the data set “Orange”, \(R^2=0.8345\) shows a high level of correlation.

Another Example of Linear Regression II.

Here is another look at linear regression using the data set “women” which gives the average heights and weights of American women from ages 30-39.
Due to this modeling we can observe that as weight increases, there is a predictable outcome that height increases as well.

Example of Linear Regression III

Here is a multiple regression, or 3D view of the previous “Orange” data.