2025-03-16

Simple Linear Regression (SLP)

The aim of simple linear regression is to find a linear relationship to describe pattern between an independent and possibly depend variable.

The formula for SLP is (explained in detail in the next slide): \[ y = \alpha + \beta x \]

where \(x\), the dependent variable, is the input that influences change and \(y\), the independent variable, is affected by it.

Examples

  • Predicting property pricing, predicting weight based on height
  • Predicting exam score based on study time
  • Predicting relationship between spending and sales in a company

SLP Formula for 2 variables

\[ y = \alpha + \beta x \] where \[y \text{ = estimated point}\] \[\alpha \text{ = y-intercept}\] \[\beta \text{ = slope}\] \[x \text{ = input data point}\]

Similarly, a model with 3 variables will be represented as: \[ y = \alpha + \beta_1 x_1 + \beta_0x_0 \]

Introducing House Pricing Dataset

For this project we will take the kaggle dataset for house pricing given at https://www.kaggle.com/datasets/yasserh/housing-prices-dataset?resource=download

For each house price data point there is area, bedrooms, bathrooms, stories, mainroad, guestroom, basement, hot water heating, air conditioning, parking, pref area, and furnishing status of a house.

We can use each of these as a variable, but we will predict the house price only using the area datapoints.

Additionally, for simplicity and clear visualization we will only take the first 50 data points.

Plotting House Pricing vs Area

Our initial intuition is that a larger area corresponds to a higher house price. On the contrary, we notice that the house with the largest area is not the most expensive. One explanation for this might be that we are only considered one variable out of the many mentioned before.

Applying SLR on Pricing vs Area

## `geom_smooth()` using formula = 'y ~ x'

Explaining SLR Plotting

data <- read.csv("Housing.csv")
house_price <- data$price[0:50]
house_area <- data$area[0:50]

data_sub <- data.frame(price = house_price, area = house_area)
ggplot(data_sub, aes(x = price, y = area)) + geom_point() +
  geom_smooth(method = "lm", color = "blue", se = FALSE) +
  labs(x = "House price", y = "House area", 
      title = "Fitting the Regression Line") + theme_minimal()

First, we grab the first 50 data points from the price and area and plot it in ggplot.

Further: - The geom_point() method adds scatter plot point. - “method = ‘lm’” line generates the SLR line. - “theme_minimal()” removes unnecessary grid lines and background for a neater look.

Observing dataset in 3D

We can also visualize the dataset in 3D. Out of the several available parameters lets look at price of house with bedroom and bathroom as our dependent variables.

We notice that the price of a house tends to be higher if the number of bedrooms and bathrooms are the same. The highest being 4 bed and 4 bath at approximately $12 million

Conclusion

We noticed very interesting things in the slides. We got to understand trends that go against what common people would think to be obvious. Such as the price of a house with respect to the area. We would think larger the area of a house the more expensive it is. However, we noticed this is not absolutely true. Other factors such as number of bedrooms, bathroom, heatings, and other factors play a big role in determining the price of a house.

We also noticed that the number of beds and baths is a factor that determines price of a house. The houses with a imbalance of beds and baths are not as pricey as the houses with the same number of beds and baths.

Finally, we also learned how simple linear regression works and how it can help make important predictions and decisions.