Linear Regression

Machine Learning With Python: Linear Regression With three Variable

Problem Statement

Problem Statement: Given above data build a machine learning model that can predict home prices based on square feet area, no of bedroom and age

area bedrooms age price
2600 3 20 550000
3000 4 15 565000
3200 18 610000
3600 3 30 595000
4000 5 8 760000
4100 6 8 810000

Mean Squared Error (MSE)

You can draw multiple lines like this but we choose the one where total sum of error is minimum

You might remember about linear equation from your high school days math class. Home prices can be presented as following equation,

home price = m * (area) + b

Intercept and Slope

Generic form of same equation is,

Download CSV file

!()[https://docs.google.com/spreadsheets/d/1C0FC0UnnH8WXzb85RTAaDKYaoxuZ1cWdkc8n2DJ3CDA/edit?usp=sharing]

import pandas as pd
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

load the data

import pandas as pd
df = pd.read_csv('homeprices.csv')
df
##    area  bedrooms  age   price
## 0  2600       3.0   20  550000
## 1  3000       4.0   15  565000
## 2  3200       NaN   18  610000
## 3  3600       3.0   30  595000
## 4  4000       5.0    8  760000
## 5  4100       6.0    8  810000

draw chart between area and price

plt.xlabel('area')
plt.ylabel('price')
plt.scatter(df.area,df.price,color='red',marker='+')

!()[scatterplot.png]

handle missing data

df['bedrooms'].fillna(df['bedrooms'].median(), inplace=True)
## <string>:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
## The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
## 
## For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.

features and target

# Features and target
X = df[['area', 'bedrooms', 'age']]  # Independent variables
y = df['price']  # Dependent variable (target)

Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Create and train the Linear Regression model

model = LinearRegression()
model.fit(X_train, y_train)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Predict on test set

y_pred = model.predict(X_test)

Evaluate the model (optional: print Mean Squared Error)

mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
## Mean Squared Error: 1713617314.5467577

Predict price for new data (example: area=3200, bedrooms=3, age=18)

new_data = np.array([[3200, 3, 18]])
predicted_price = model.predict(new_data)
## C:\Users\slaxm\Documents\projects\CA5CO32\myenv\Lib\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names
##   warnings.warn(
print(f"Predicted Price: {predicted_price[0]}")
## Predicted Price: 571567.1641791033

Problem to Predict for the data from the given excel file and generate list of predictions

!()[https://docs.google.com/spreadsheets/d/1jDsPOTB5co7rcW66AVcRsQYrQRgYNXxC44XI-rSI7s4/edit?usp=sharing]