Introduction to Simple Linear Regression

Simple linear regression is a method to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable.

Real-world applications include predicting house prices based on square footage, predicting sales based on advertising spend, etc.

Mathematical Formulation

The linear regression equation is given by:

\[ y = \beta_0 + \beta_1 x + \epsilon \]

Assumptions of Linear Regression

  1. Linearity: The relationship between the independent and dependent variable should be linear.
  2. Independence: Observations should be independent of each other.
  3. Homoscedasticity: The residuals should have constant variance.
  4. Normality: The residuals should be normally distributed.

Estimating the Regression Line

The parameters \( \beta_0 \) and \( \beta_1 \) are estimated using the least squares method, which minimizes the sum of the squared residuals:

\[ \hat{\beta_1} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} \] \[ \hat{\beta_0} = \bar{y} - \hat{\beta_1} \bar{x} \]

Example: Predicting House Prices

We will use a dataset containing house prices and their corresponding square footage to illustrate simple linear regression.

Visualizing the Data

# Scatter plot with ggplot2
scatter_plot

regression_plot
## `geom_smooth()` using formula = 'y ~ x'

plot_3d

R-Code

#Setup

library(ggplot2)
library(plotly)
library(knitr)
library(kableExtra)

# Sample Data
set.seed(123)
data = data.frame(
  sqft = rnorm(100, mean=1500, sd=200),
  price = rnorm(100, mean=300000, sd=50000)
)

#Linear Model
model = lm(price ~ sqft, data = data)

# Scatter plot with ggplot2
scatter_plot = ggplot(data, aes(x=sqft, y=price)) +
  geom_point() +
  labs(title="Scatter Plot of House Prices vs. Square Footage",
       x="Square Footage", y="Price")

# Regression line plot with ggplot2
regression_plot = scatter_plot +
  geom_smooth(method="lm", se=FALSE, color="red") +
  labs(title="Linear Regression of House Prices on Square Footage")

# 3D plotly plot
data$rooms = rnorm(100, mean=5, sd=1)
plot_3d = plot_ly(data, x = ~sqft, y = ~rooms, z = ~price, type = "scatter3d", mode = "markers") %>%
  layout(title = "3D Scatter Plot of House Prices",
         scene = list(xaxis = list(title = "Square Footage"),
                      yaxis = list(title = "Number of Rooms"),
                      zaxis = list(title = "Price")))