Simple Linear Regression

2024-03-21

Introduction to Simple Linear Regression

Simple linear regression is a statistical method used to model the relationship between two quantitative variables: one independent variable (X) and one dependent variable (Y).

Mathematical Representation

The simple linear regression model can be represented as:

\[ Y = \beta_0 + \beta_1X + \epsilon \]

$Y$ is the dependent variable
$X$ is the independent variable
$\beta_0$ is the intercept
$\beta_1$ is the slope
$\epsilon$ is the error term

Estimation of Parameters

In simple linear regression, the parameters $\beta_0$ and $\beta_1$ are estimated using least squares estimation. The formulas for estimating these parameters are:

\[ \hat{\beta_1} = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2} \]

\[ \hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x} \]

where: $\hat{\beta_1}$ is the estimated slope. $\hat{\beta_0}$ is the estimated intercept. $\bar{x}$ is the mean of the independent variable $X$. $\bar{y}$ is the mean of the dependent variable $Y$. $n$ is the number of observations. $x_i$ and $y_i$ are the individual observations of $X$ and $Y$ respectively.

Prediction Equation

Once the parameters $\beta_0$ and $\beta_1$ are estimated, the prediction equation for simple linear regression is given by:

\[ \hat{Y} = \hat{\beta_0} + \hat{\beta_1}X \]

where: $\hat{Y}$ is the predicted value of the dependent variable $Y$. $\hat{\beta_0}$ and $\hat{\beta_1}$ are the estimated intercept and slope respectively. $X$ is the value of the independent variable for which the prediction is being made.

This equation allows us to predict the value of the dependent variable $Y$ for any given value of the independent variable $X$ based on the estimated parameters of the regression model.

Example Dataset (Housing Data)

Let’s consider a hypothetical dataset of house prices ($) and their corresponding areas (sq. ft). We want to predict house prices based on the area.

Area (sq. ft)	Price ($)
1200	150000
1400	170000
1600	190000
1800	210000
2000	230000
2200	250000
2400	270000
2600	290000
2800	310000
3000	330000

R code for data entry

Below is the R code for creating house_data shared in previous slide.

# Example dataset
house_data <- data.frame(area = c(1200, 1400, 1600, 1800, 2000, 2200, 2400, 2600, 2800, 3000),
                         price = c(150000, 170000, 190000, 210000, 230000, 250000, 270000, 290000, 310000, 330000))

Scatter plot (Ggplot2)

Below is the scatter plot for the housing data.

3D Scatter plot (Plotly)

Let’s visualize the relationship between house prices, areas, and another variable (e.g., number of bedrooms).

Fitting linear regression model

## 
## Call:
## lm(formula = price ~ area, data = house_data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.369e-11 -1.433e-11 -2.239e-12  1.049e-11  4.196e-11 
## 
## Coefficients:
##              Estimate Std. Error   t value Pr(>|t|)    
## (Intercept) 3.000e+04  2.477e-11 1.211e+15   <2e-16 ***
## area        1.000e+02  1.138e-14 8.789e+15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.067e-11 on 8 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 7.724e+31 on 1 and 8 DF,  p-value: < 2.2e-16

Line plot (Ggplot2)

Predicting house price for an area of 1500 sq. ft

Suppose we want to predict the price of a house with an area of 1500 sq. ft using our linear regression model.
R code: predicted_price <- predict(model, newdata = data.frame(area = new_area))

##      1 
## 180000

Conclusion

Simple linear regression is a powerful tool for understanding and predicting the relationship between two variables.
It provides insights into the direction and strength of the relationship, allowing for informed decision-making in various fields.