These models can be created without any manual calculations useful for more complex data sets.
In R, a simple linear regression can be created using the lm and fitted functions as shown below using the mtcars data set.
data(mtcars)
mod <- lm(hp ~ disp, data = mtcars)
x = mtcars$disp; y = mtcars$hp
xax <- list(
title = "Displacement",
titlefont = list(family = "Modern Computer Roman")
)
yax <- list(
title = "Horsepower",
titlefont = list(family = "Modern Computer Roman"),
range= c(0, 300)
)
fig <- plot_ly(x = x, y = y, type = "scatter", mode = "markers", name = "data",
width = 800, height = 430) %>%
add_lines(x=x, y = fitted(mod), name = "fitted") %>%
layout(xaxis = xax, yaxis = yax) %>%
layout(margin = list(
l = 150,
r = 50,
b = 20,
t = 40
))
config(fig, displaylogo = T)
Determing the Accuracy of these Models
As mentioned earlier, the accuracy of these models can be determined by finding the {\(R^2\)} where the closer it is to 1 the more accurate the model is.
This can be found using many programs like R.
Using Example 3, the summary function can be used to determine the {\(R^2\)}. It is shown that {\(R^2=0.6131\)}, revealing that there is a slight correlation between the two variables but that the linear regression model would not be that accurate.
summary(mod)
##
## Call:
## lm(formula = hp ~ disp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.623 -28.378 -6.558 13.588 157.562
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.7345 16.1289 2.836 0.00811 **
## disp 0.4375 0.0618 7.080 7.14e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.65 on 30 degrees of freedom
## Multiple R-squared: 0.6256, Adjusted R-squared: 0.6131
## F-statistic: 50.13 on 1 and 30 DF, p-value: 7.143e-08
R-squared by Hand
\(R^2\) for simple regression can be calculated using the formula:
\(R^2 = \frac{(n\sum xy - (\sum x)(\sum y))^2}{(n\sum x^2 - (\sum x)^2)(n\sum y^2 - (\sum y)^2)}\)
Where n is the number of observations in the data set.
This is a tedious computation which becomes more time consuming as more observations are added
Therefore, it is recommended to use online tools to compute \(R^2\) for large data sets
Calculation Example Using Example 1 with Code
x <- c(1, 2, 3, 4, 5)
y <- c(3, 7, 10, 14, 17)
n <- length(x)
numerator <- (n * sum(x * y) - sum(x) * sum(y))^2
denominator <- (n * sum(x^2) - (sum(x))^2) * (n * sum(y^2) - (sum(y))^2)
R2 <- numerator / denominator
R2
## [1] 0.997557
Comparison of the Two Methods
The two methods can be compared.
Going back to Example 1, the previous slide calculated \(R^2 = 0.9976\).
The summary function can also be used with the lm function.
For this example, both values are the same but may differ depending on rounding and missing data.
summary(lm(y ~ x))$r.squared
## [1] 0.997557
Thank you!