Preliminary
Simple Linear Regression is one of the most basic models in the various types of regression that exist today. This model uses only one independent variable. I will try to make a simple Linear Regression step by step and compare it with the lm() function in R. Check it below!
Formula Simple Linear Regression
\(\tilde{Y} = b_{0} + b_{1}X\)
The algorithm in below:
First find: \(X_{i} - \tilde{X}\) Value and \(Y_{i} - \tilde{Y}\) Value
Then find: \((X_{i} - \tilde{X})^{2}\) Value
After that, multiply \((X_{i} - \tilde{X})\) with \((Y - \tilde{Y})\)
Find \(\sum_{i=1}^{20} (X - \tilde{X})^{2}\) and \(\sum_{i=1}^{20} (X - \tilde{X})-(Y - \tilde{Y})\) Value
Fifth step is find \(b_{0}\) and \(b_{1}\) Value
You can get \(b_{1}\) with \(\frac{\sum_{i=1}^{20} (X - \tilde{X})^{2}}{\sum_{i=1}^{20} (X - \tilde{X})-(Y - \tilde{Y})}\)
And \(b_{0}\) with \(\tilde{Y}-b_{1}X\)
Finally, subs the value to formula Simple Linear Regression.
Data Used
x <- c(48,63,36,24,24,55,25,24,46,55,20,51,41,21,49,31,31,26,43,52)
y <- c(45,46,49,56,68,62,56,58,45,36,74,53,75,57,50,39,62,73,57,50)
df <- data.frame(x,y)
df
## x y
## 1 48 45
## 2 63 46
## 3 36 49
## 4 24 56
## 5 24 68
## 6 55 62
## 7 25 56
## 8 24 58
## 9 46 45
## 10 55 36
## 11 20 74
## 12 51 53
## 13 41 75
## 14 21 57
## 15 49 50
## 16 31 39
## 17 31 62
## 18 26 73
## 19 43 57
## 20 52 50
Statistic Data
summary(df$x)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20.00 24.75 38.50 38.25 49.50 63.00
summary(df$y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 36.00 48.25 56.00 55.55 62.00 75.00
Data Dimension
dim(df)
## [1] 20 2
Scatter Plot
plot(df$x,df$y,
pch = 19,
xlab='x',
ylab='y',
main= 'Scatter Plot',
col='red')
Making a Copy From Original DataFrame
df_lm <- df
Find \(X_{i} - \tilde{X}\) Value
xx = c()
for (i in seq(length(x),1,-1)){
xx<-append(x[i]-mean(x),xx)
}
df_lm$xx <- xx
df_lm
## x y xx
## 1 48 45 9.75
## 2 63 46 24.75
## 3 36 49 -2.25
## 4 24 56 -14.25
## 5 24 68 -14.25
## 6 55 62 16.75
## 7 25 56 -13.25
## 8 24 58 -14.25
## 9 46 45 7.75
## 10 55 36 16.75
## 11 20 74 -18.25
## 12 51 53 12.75
## 13 41 75 2.75
## 14 21 57 -17.25
## 15 49 50 10.75
## 16 31 39 -7.25
## 17 31 62 -7.25
## 18 26 73 -12.25
## 19 43 57 4.75
## 20 52 50 13.75
Find \(Y_{i} - \tilde{Y}\) Value
yy = c()
for (i in seq(length(x),1,-1)){
yy<-append(y[i]-mean(y),yy)
}
df_lm$yy <- yy
df_lm
## x y xx yy
## 1 48 45 9.75 -10.55
## 2 63 46 24.75 -9.55
## 3 36 49 -2.25 -6.55
## 4 24 56 -14.25 0.45
## 5 24 68 -14.25 12.45
## 6 55 62 16.75 6.45
## 7 25 56 -13.25 0.45
## 8 24 58 -14.25 2.45
## 9 46 45 7.75 -10.55
## 10 55 36 16.75 -19.55
## 11 20 74 -18.25 18.45
## 12 51 53 12.75 -2.55
## 13 41 75 2.75 19.45
## 14 21 57 -17.25 1.45
## 15 49 50 10.75 -5.55
## 16 31 39 -7.25 -16.55
## 17 31 62 -7.25 6.45
## 18 26 73 -12.25 17.45
## 19 43 57 4.75 1.45
## 20 52 50 13.75 -5.55
Find \((X_{i} - \tilde{X})^{2}\) Value
xx2 <- xx^2
df_lm$xx2 <- xx2
df_lm
## x y xx yy xx2
## 1 48 45 9.75 -10.55 95.0625
## 2 63 46 24.75 -9.55 612.5625
## 3 36 49 -2.25 -6.55 5.0625
## 4 24 56 -14.25 0.45 203.0625
## 5 24 68 -14.25 12.45 203.0625
## 6 55 62 16.75 6.45 280.5625
## 7 25 56 -13.25 0.45 175.5625
## 8 24 58 -14.25 2.45 203.0625
## 9 46 45 7.75 -10.55 60.0625
## 10 55 36 16.75 -19.55 280.5625
## 11 20 74 -18.25 18.45 333.0625
## 12 51 53 12.75 -2.55 162.5625
## 13 41 75 2.75 19.45 7.5625
## 14 21 57 -17.25 1.45 297.5625
## 15 49 50 10.75 -5.55 115.5625
## 16 31 39 -7.25 -16.55 52.5625
## 17 31 62 -7.25 6.45 52.5625
## 18 26 73 -12.25 17.45 150.0625
## 19 43 57 4.75 1.45 22.5625
## 20 52 50 13.75 -5.55 189.0625
Find \((X_{i} - \tilde{X})(Y - \tilde{Y})\) Value
xy <- xx*yy
df_lm$xy <- xy
df_lm
## x y xx yy xx2 xy
## 1 48 45 9.75 -10.55 95.0625 -102.8625
## 2 63 46 24.75 -9.55 612.5625 -236.3625
## 3 36 49 -2.25 -6.55 5.0625 14.7375
## 4 24 56 -14.25 0.45 203.0625 -6.4125
## 5 24 68 -14.25 12.45 203.0625 -177.4125
## 6 55 62 16.75 6.45 280.5625 108.0375
## 7 25 56 -13.25 0.45 175.5625 -5.9625
## 8 24 58 -14.25 2.45 203.0625 -34.9125
## 9 46 45 7.75 -10.55 60.0625 -81.7625
## 10 55 36 16.75 -19.55 280.5625 -327.4625
## 11 20 74 -18.25 18.45 333.0625 -336.7125
## 12 51 53 12.75 -2.55 162.5625 -32.5125
## 13 41 75 2.75 19.45 7.5625 53.4875
## 14 21 57 -17.25 1.45 297.5625 -25.0125
## 15 49 50 10.75 -5.55 115.5625 -59.6625
## 16 31 39 -7.25 -16.55 52.5625 119.9875
## 17 31 62 -7.25 6.45 52.5625 -46.7625
## 18 26 73 -12.25 17.45 150.0625 -213.7625
## 19 43 57 4.75 1.45 22.5625 6.8875
## 20 52 50 13.75 -5.55 189.0625 -76.3125
Find \(\sum_{i=1}^{20} (X - \tilde{X})^{2}\) and \(\sum_{i=1}^{20} (X - \tilde{X})-(Y - \tilde{Y})\) Value
sum_xx <- sum(xx2)
sum_xy <- sum(xy)
sum_xx
## [1] 3501.75
sum_xy
## [1] -1460.75
Find \(b_{0}\) and \(b_{1}\) Value
b1 = sum_xy/sum_xx
b0 = mean(y)-b1*mean(x)
b1
## [1] -0.4171486
b0
## [1] 71.50593
Simple Linear Regression Using lm()
lm_model = lm(y~x,df)
summary(lm_model)
##
## Call:
## lm(formula = y ~ x, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.5743 -5.9301 -0.4399 4.2000 20.5972
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.5059 6.7316 10.622 3.5e-09 ***
## x -0.4171 0.1663 -2.508 0.0219 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.842 on 18 degrees of freedom
## Multiple R-squared: 0.259, Adjusted R-squared: 0.2178
## F-statistic: 6.291 on 1 and 18 DF, p-value: 0.02194
Value of \(b_{0} = -0.4171\) and \(b_{1} = 71.5059\), the same results like we compute before.
Conclusion
After compute simple linear regression without function and with function, its make same results, and we can understand the flow compute of simple linear regression. The interpretation of result is
\(\tilde{Y}= 71.5059 - 0.4171X\)
It means x and y have a negative correlation, the higher x value make lower y value and vice versa.