Linear regression is used to predict the value of a variable based on the value of another variable
(x and y).
In other terms, it allows us to identify and make future predictions in the data.
Linear regression is used to predict the value of a variable based on the value of another variable
(x and y).
In other terms, it allows us to identify and make future predictions in the data.
The results can be interpreted easily and is a reliable statistical model.
Where is Linear Regression Used?
The following is an example using linear regression with the R data set “Orange”, which is about the growth of orange trees.
The following code used to produce the linear regression figure:
data(Orange)
orange_graph <- ggplot(Orange, aes(x=age, y=circumference)) +
ggtitle("Orange Trees, Age vs Circumference") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_point(aes(col=Tree, size=Tree)) +
geom_point() +
geom_smooth(method = "lm") +
xlab("Age (days)") +
ylab("Circumference (mm)")
The following allows us to look at the model in more detail.
data(Orange) testing <- lm(age~circumference, data = Orange) summary(testing)
## ## Call: ## lm(formula = age ~ circumference, data = Orange) ## ## Residuals: ## Min 1Q Median 3Q Max ## -317.88 -140.90 -17.20 96.54 471.16 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 16.6036 78.1406 0.212 0.833 ## circumference 7.8160 0.6059 12.900 1.93e-14 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 203.1 on 33 degrees of freedom ## Multiple R-squared: 0.8345, Adjusted R-squared: 0.8295 ## F-statistic: 166.4 on 1 and 33 DF, p-value: 1.931e-14
\({Y} = {m} \cdot{x} + {b}\)
Where\({Y}\) is the vertical value
\({m}\) is the slope (rise over run)
\({x}\) is the horizontal value
\({b}\) is the y intercept, or the value of \({Y}\) when \({x=0}\)
\({Y} = \beta_0 + \beta_1 \cdot {X}_1\)
Where\({\beta_0}\) is the constant
\({\beta_1}\) is the coefficient for \({X}\)
(\(Residual = {y}-\hat{y}\))
The closer the residual is to zero, the better the fit.
\(R^2 = 1 - RSS/TSS\)
Where\(RSS =\) sum of squares residuals
\(TSS =\) total sum of squares
Looking back at the data set “Orange”, \(R^2=0.8345\) shows a high level of correlation.
Here is another look at linear regression using the data set “women” which gives the average heights and weights of American women from ages 30-39.
Due to this modeling we can observe that as weight increases, there is a predictable outcome that height increases as well.
Here is a multiple regression, or 3D view of the previous “Orange” data.