- Simple linear regression is a statistical method used to model the relationship between two quantitative variables: one independent variable (X) and one dependent variable (Y).
2024-03-21
The simple linear regression model can be represented as:
\[ Y = \beta_0 + \beta_1X + \epsilon \]
In simple linear regression, the parameters \(\beta_0\) and \(\beta_1\) are estimated using least squares estimation. The formulas for estimating these parameters are:
\[ \hat{\beta_1} = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2} \]
\[ \hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x} \]
where: \(\hat{\beta_1}\) is the estimated slope. \(\hat{\beta_0}\) is the estimated intercept. \(\bar{x}\) is the mean of the independent variable \(X\). \(\bar{y}\) is the mean of the dependent variable \(Y\). \(n\) is the number of observations. \(x_i\) and \(y_i\) are the individual observations of \(X\) and \(Y\) respectively.
Once the parameters \(\beta_0\) and \(\beta_1\) are estimated, the prediction equation for simple linear regression is given by:
\[ \hat{Y} = \hat{\beta_0} + \hat{\beta_1}X \]
where: \(\hat{Y}\) is the predicted value of the dependent variable \(Y\). \(\hat{\beta_0}\) and \(\hat{\beta_1}\) are the estimated intercept and slope respectively. \(X\) is the value of the independent variable for which the prediction is being made.
This equation allows us to predict the value of the dependent variable \(Y\) for any given value of the independent variable \(X\) based on the estimated parameters of the regression model.
Let’s consider a hypothetical dataset of house prices ($) and their corresponding areas (sq. ft). We want to predict house prices based on the area.
| Area (sq. ft) | Price ($) |
|---|---|
| 1200 | 150000 |
| 1400 | 170000 |
| 1600 | 190000 |
| 1800 | 210000 |
| 2000 | 230000 |
| 2200 | 250000 |
| 2400 | 270000 |
| 2600 | 290000 |
| 2800 | 310000 |
| 3000 | 330000 |
# Example dataset
house_data <- data.frame(area = c(1200, 1400, 1600, 1800, 2000, 2200, 2400, 2600, 2800, 3000),
price = c(150000, 170000, 190000, 210000, 230000, 250000, 270000, 290000, 310000, 330000))
## ## Call: ## lm(formula = price ~ area, data = house_data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.369e-11 -1.433e-11 -2.239e-12 1.049e-11 4.196e-11 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.000e+04 2.477e-11 1.211e+15 <2e-16 *** ## area 1.000e+02 1.138e-14 8.789e+15 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.067e-11 on 8 degrees of freedom ## Multiple R-squared: 1, Adjusted R-squared: 1 ## F-statistic: 7.724e+31 on 1 and 8 DF, p-value: < 2.2e-16
## 1 ## 180000