Simple Linear Regression in Statistics

2024-10-19

What is Simple Linear Regression?

Simple Linear Regression is a statistical method used to model the relationship between two variables by fitting a linear equation to observed data.

Variables:

Independent Variable (X): The variable you use to make predictions.
Dependent Variable (Y): The outcome variable you want to predict.

Goals:

predict the value of the dependent variable based on the value the independent variable.
find the best-fitting line through the data points, allowing us to predict the value of $Y$ for any given value of $X$.

Mathematical Formula for Simple Linear Regression

The formula for Simple Linear Regression is:

\[Y = \beta_0 + \beta_1 X\] Where:

$Y$ is the dependent variable (variable we want to predict).
$X$ is the independent variable (variable used to make predictions).
$\beta_0$ is the y-intercept (value of $Y$ when $X$ = 0)
$\beta_1$ is the slope, representing the change in $Y$ for a one-unit change in $X$.

Assumptions:

Linearity: The relationship between $X$ and $Y$ is linear.
Independence: The observations are independent of each other.

Applications

Simple linear regression is widely used in various fields, such as economics, biology, and social sciences, for tasks like forecasting, trend analysis, and understanding relationships between variables.

Once the model is fitted, you can interpret the coefficients to understand how changes in the independent variable impact the dependent variable, helping to draw meaningful insights from the data.

Simple linear regression is a foundational tool in statistics that provides a clear and interpretable way to analyze relationships between two quantitative variables.

In the next few slides, we will look at an application for Simple Linear Regression in Real Estate.

Data Set Example

Using a Real Estate data set from Kaggle, we have 414 houses that were sold by a real estate agent and relevant data on the houses. We want to view how each variable aspect of the houses correlates with the house price of unit area.

In terms of the data set provided, we will be viewing the relationship/correlation of the quantitative independent variables (house age, distance to the nearest MRT station, number of convenience stores) with the house price of unit area.

Independent Variables:

House Age: quantitative variable that represents the age of the house in years
Distance to the Nearest MRT Station: quantitative variable that represents the distance the house is to the nearest MRT (Mass Rapid Transit) station in km
Number of Convenience Stores: quantitative variable that represents the number of convenience stores within a 5 km radius to the house

Dependent Variable (Output):

House Price of Unit Area: quantitative variable that represents the house’s price per unit area

ggplot (House Age & Price of Unit Area)

We will look at the correlation between House Ages and House Prices of Unit Area.

The $X$ value will represent the House’s Age.
The $Y$ value will represent the House’s Price of Unit Area

Code:

agePrice = ggplot(df, aes(x = X2.house.age, y = Y.house.price.of.unit.area)) +
     geom_point() + 
     geom_smooth(method="lm", level=0.99) + 
     xlab("House Age (Years)") + 
     ylab("House Price of Unit Area (Dollars)") + 
     theme(axis.text.x = element_text(size = 12), 
           axis.text.y = element_text(size = 12))

ggplot Graph (House Age & Price of Unit Area)

Relationship Between House Age & Price of Unit Area

The correlation between House Age and House Price of Unit Area is represented by the value:

## [1] -0.210567

The correlation value is negative but closer to 0 than -1, meaning that there is very little negative correlation between the House Age and House Price of Unit Area.

Therefore, a linear model would not represent this relationship the best.

ggplot (Distance to Nearest MRT Station & Price of Unit Area)

We will look at the correlation between Distances to Nearest MRT Stations and House Prices of Unit Area.

The $X$ value will represent the House’s Distance to Nearest MRT Station
The $Y$ value will represent the House’s Price of Unit Area

Code:

distancePrice = ggplot(df, aes(x = X3.distance.to.the.nearest.MRT.station, y = Y.house.price.of.unit.area)) + 
     geom_point() + 
     geom_smooth(method="lm", formula = y ~ x) + 
     xlab("Distance to Nearest MRT Station (km)") + 
     ylab("House Price of Unit Area (Dollars)") + 
     theme(axis.text.x = element_text(size = 12), 
           axis.text.y = element_text(size = 12))

ggplot Graph (Distance to Nearest MRT Station & Price of Unit Area)

Relationship Between Distance to Nearest MRT Station & Price of Unit Area

The correlation between Distance to Nearest MRT Station & Price of Unit Area is represented by the value:

## [1] -0.6736129

The correlation value is negative but closer to -1 than 0, meaning that there is a strong negative correlation between the Distance to Nearest MRT Station and House Price of Unit Area. In regards to the linear model, we can extract the y-intercept and slope in order to obtain the line-of-best-fit:

$y = -0.01x + 45.85$

We can predict the Price of Unit Area with 1000km as the Distance to Nearest MRT Station:

##        1 
## 38.58938

Therefore, with 1000km as the Distance to Nearest MRT Station, we can predict the Price of Unit Area to be around $38.59

plotly (Number of Convenience Stores & Price of Unit Area)

We will look at the correlation between Number of Convenience Stores and House Prices of Unit Area.

The $X$ value will represent the Number of Convenience Stores within a 5 km radius to the House
The $Y$ value will represent the House’s Price of Unit Area

Code:

linearModel = lm(Y.house.price.of.unit.area ~ X4.number.of.convenience.stores, data = df)

xax = list (title = "Number of Convenience Stores")
yax = list (title = "Price of Unit Area ($)")

storesPrice = plot_ly(x=df$X4.number.of.convenience.stores, 
             y=df$Y.house.price.of.unit.area, 
             type="scatter", 
             mode="markers", 
             name = "Number of Convenience Stores / Price of Unit Area",
             width=690, 
             height=300) %>%
  add_lines(x = df$X4.number.of.convenience.stores, 
            y = predict(linearModel), 
            name = "Linear Fit") %>%
  layout(xaxis = xax, yaxis = yax) %>%
  layout(margin=list(
    l=150,
    r=50,
    b=20,
    t=40
  )
  ) %>%
  layout (title = "Number of Convenience Stores vs. Price of Unit Area")

ggplot Graph (Number of Convenience Stores & Price of Unit Area)

Relationship Between Number of Convenience Stores & Price of Unit Area

The correlation between Number of Convenience Stores & Price of Unit Area is represented by the value:

## [1] 0.5710049

The correlation value is positive and closer to 1 than 0, meaning that there is a strong positive correlation between the Number of Convenience Stores & Price of Unit Area. In regards to the linear model, we can extract the y-intercept and slope in order to obtain the line-of-best-fit:

$y = 2.64x + 27.18$

We can predict the Price of Unit Area with 100 as the Number of Convenience Stores:

##        1 
## 290.9465

Therefore, with 100 as the Number of Convenience Stores, we can predict the Price of Unit Area to be around $290.95