Simple Linear Regression Without Function and With Function lm()

Muhammad Athanabil Andi Fawazdhia

2022-12-28

Preliminary

Simple Linear Regression is one of the most basic models in the various types of regression that exist today. This model uses only one independent variable. I will try to make a simple Linear Regression step by step and compare it with the lm() function in R. Check it below!

Formula Simple Linear Regression

\(\tilde{Y} = b_{0} + b_{1}X\)

The algorithm in below:

First find: \(X_{i} - \tilde{X}\) Value and \(Y_{i} - \tilde{Y}\) Value

Then find: \((X_{i} - \tilde{X})^{2}\) Value

After that, multiply \((X_{i} - \tilde{X})\) with \((Y - \tilde{Y})\)

Find \(\sum_{i=1}^{20} (X - \tilde{X})^{2}\) and \(\sum_{i=1}^{20} (X - \tilde{X})-(Y - \tilde{Y})\) Value

Fifth step is find \(b_{0}\) and \(b_{1}\) Value

You can get \(b_{1}\) with \(\frac{\sum_{i=1}^{20} (X - \tilde{X})^{2}}{\sum_{i=1}^{20} (X - \tilde{X})-(Y - \tilde{Y})}\)

And \(b_{0}\) with \(\tilde{Y}-b_{1}X\)

Finally, subs the value to formula Simple Linear Regression.

Data Used

x <- c(48,63,36,24,24,55,25,24,46,55,20,51,41,21,49,31,31,26,43,52)
y <- c(45,46,49,56,68,62,56,58,45,36,74,53,75,57,50,39,62,73,57,50)

df <- data.frame(x,y)
df
##     x  y
## 1  48 45
## 2  63 46
## 3  36 49
## 4  24 56
## 5  24 68
## 6  55 62
## 7  25 56
## 8  24 58
## 9  46 45
## 10 55 36
## 11 20 74
## 12 51 53
## 13 41 75
## 14 21 57
## 15 49 50
## 16 31 39
## 17 31 62
## 18 26 73
## 19 43 57
## 20 52 50

Statistic Data

summary(df$x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20.00   24.75   38.50   38.25   49.50   63.00
summary(df$y)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   36.00   48.25   56.00   55.55   62.00   75.00

Data Dimension

dim(df)
## [1] 20  2

Scatter Plot

plot(df$x,df$y,
     pch = 19,
     xlab='x',
     ylab='y',
     main= 'Scatter Plot',
     col='red')

Making a Copy From Original DataFrame

df_lm <- df

Find \(X_{i} - \tilde{X}\) Value

xx = c()
for (i in seq(length(x),1,-1)){
  xx<-append(x[i]-mean(x),xx)
}

df_lm$xx <- xx
df_lm
##     x  y     xx
## 1  48 45   9.75
## 2  63 46  24.75
## 3  36 49  -2.25
## 4  24 56 -14.25
## 5  24 68 -14.25
## 6  55 62  16.75
## 7  25 56 -13.25
## 8  24 58 -14.25
## 9  46 45   7.75
## 10 55 36  16.75
## 11 20 74 -18.25
## 12 51 53  12.75
## 13 41 75   2.75
## 14 21 57 -17.25
## 15 49 50  10.75
## 16 31 39  -7.25
## 17 31 62  -7.25
## 18 26 73 -12.25
## 19 43 57   4.75
## 20 52 50  13.75

Find \(Y_{i} - \tilde{Y}\) Value

yy = c()
for (i in seq(length(x),1,-1)){
  yy<-append(y[i]-mean(y),yy)
}

df_lm$yy <- yy
df_lm
##     x  y     xx     yy
## 1  48 45   9.75 -10.55
## 2  63 46  24.75  -9.55
## 3  36 49  -2.25  -6.55
## 4  24 56 -14.25   0.45
## 5  24 68 -14.25  12.45
## 6  55 62  16.75   6.45
## 7  25 56 -13.25   0.45
## 8  24 58 -14.25   2.45
## 9  46 45   7.75 -10.55
## 10 55 36  16.75 -19.55
## 11 20 74 -18.25  18.45
## 12 51 53  12.75  -2.55
## 13 41 75   2.75  19.45
## 14 21 57 -17.25   1.45
## 15 49 50  10.75  -5.55
## 16 31 39  -7.25 -16.55
## 17 31 62  -7.25   6.45
## 18 26 73 -12.25  17.45
## 19 43 57   4.75   1.45
## 20 52 50  13.75  -5.55

Find \((X_{i} - \tilde{X})^{2}\) Value

xx2 <- xx^2
df_lm$xx2 <- xx2
df_lm
##     x  y     xx     yy      xx2
## 1  48 45   9.75 -10.55  95.0625
## 2  63 46  24.75  -9.55 612.5625
## 3  36 49  -2.25  -6.55   5.0625
## 4  24 56 -14.25   0.45 203.0625
## 5  24 68 -14.25  12.45 203.0625
## 6  55 62  16.75   6.45 280.5625
## 7  25 56 -13.25   0.45 175.5625
## 8  24 58 -14.25   2.45 203.0625
## 9  46 45   7.75 -10.55  60.0625
## 10 55 36  16.75 -19.55 280.5625
## 11 20 74 -18.25  18.45 333.0625
## 12 51 53  12.75  -2.55 162.5625
## 13 41 75   2.75  19.45   7.5625
## 14 21 57 -17.25   1.45 297.5625
## 15 49 50  10.75  -5.55 115.5625
## 16 31 39  -7.25 -16.55  52.5625
## 17 31 62  -7.25   6.45  52.5625
## 18 26 73 -12.25  17.45 150.0625
## 19 43 57   4.75   1.45  22.5625
## 20 52 50  13.75  -5.55 189.0625

Find \((X_{i} - \tilde{X})(Y - \tilde{Y})\) Value

xy <- xx*yy
df_lm$xy <- xy
df_lm
##     x  y     xx     yy      xx2        xy
## 1  48 45   9.75 -10.55  95.0625 -102.8625
## 2  63 46  24.75  -9.55 612.5625 -236.3625
## 3  36 49  -2.25  -6.55   5.0625   14.7375
## 4  24 56 -14.25   0.45 203.0625   -6.4125
## 5  24 68 -14.25  12.45 203.0625 -177.4125
## 6  55 62  16.75   6.45 280.5625  108.0375
## 7  25 56 -13.25   0.45 175.5625   -5.9625
## 8  24 58 -14.25   2.45 203.0625  -34.9125
## 9  46 45   7.75 -10.55  60.0625  -81.7625
## 10 55 36  16.75 -19.55 280.5625 -327.4625
## 11 20 74 -18.25  18.45 333.0625 -336.7125
## 12 51 53  12.75  -2.55 162.5625  -32.5125
## 13 41 75   2.75  19.45   7.5625   53.4875
## 14 21 57 -17.25   1.45 297.5625  -25.0125
## 15 49 50  10.75  -5.55 115.5625  -59.6625
## 16 31 39  -7.25 -16.55  52.5625  119.9875
## 17 31 62  -7.25   6.45  52.5625  -46.7625
## 18 26 73 -12.25  17.45 150.0625 -213.7625
## 19 43 57   4.75   1.45  22.5625    6.8875
## 20 52 50  13.75  -5.55 189.0625  -76.3125

Find \(\sum_{i=1}^{20} (X - \tilde{X})^{2}\) and \(\sum_{i=1}^{20} (X - \tilde{X})-(Y - \tilde{Y})\) Value

sum_xx <- sum(xx2)
sum_xy <- sum(xy)
sum_xx
## [1] 3501.75
sum_xy
## [1] -1460.75

Find \(b_{0}\) and \(b_{1}\) Value

b1 = sum_xy/sum_xx
b0 = mean(y)-b1*mean(x)
b1
## [1] -0.4171486
b0
## [1] 71.50593

Simple Linear Regression Using lm()

lm_model = lm(y~x,df)
summary(lm_model)
## 
## Call:
## lm(formula = y ~ x, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.5743  -5.9301  -0.4399   4.2000  20.5972 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  71.5059     6.7316  10.622  3.5e-09 ***
## x            -0.4171     0.1663  -2.508   0.0219 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.842 on 18 degrees of freedom
## Multiple R-squared:  0.259,  Adjusted R-squared:  0.2178 
## F-statistic: 6.291 on 1 and 18 DF,  p-value: 0.02194

Value of \(b_{0} = -0.4171\) and \(b_{1} = 71.5059\), the same results like we compute before.

Conclusion

After compute simple linear regression without function and with function, its make same results, and we can understand the flow compute of simple linear regression. The interpretation of result is

\(\tilde{Y}= 71.5059 - 0.4171X\)

It means x and y have a negative correlation, the higher x value make lower y value and vice versa.