2024-09-19

What is Linear Regression ?

Linear Regression is a type of analysis used to predict the value of a variable based on the value of another variable. The variable we want to predict is the dependent variable, and the variable that we’ll be using to predict the other variable’s value is the independent variable.

Some key use cases of linear regression :

  1. Predicting Continuous Outcomes (e.g., House price prediction based on features like sq footage, area, location).
  2. Sales Forecasting.
  3. Stock Price forecasting based on historical data.
  4. Trend Analysis - identifying trends over time in time - series data

Problems that can occur with Linear Regression

  • Overfitting
    • including to many predictors can lead the model that fits the training data too accurately but does not generalize for the unseen data.
  • Underfitting
    • if the model is missing the important predictors , it can underfit the data.
  • Non-lineartiy of Data
    • linear regression assumes that there exits a linear relationship between the independent and dependent variables, if the relationship is non linear (there is high chance that linear regression might not capture pattern in data).

Some Optimization for Linear Regression

  • Regularizing the model
  • Using gradient descent
  • Scaling the data

Some Equations used in Linear Regression

Simple Linear Regression \[ y = \beta_0 + \beta_1x \]

  • \(y\): Dependent Variable (response)
  • \(\beta_0\): Intercept (constant)
  • \(\beta_1\): Slope (coefficient)
  • \(x\): Independent Variable

Multiple Linear Regression \[ y = \beta_0 + \beta_1x + \beta_2x + \dots + \beta_n x_n \]

Loss and Cost functions

  • Loss Function
    • It measures the error for a single data point, for linear regression the most common loss function is the Mean Squared Error (MSE). \[ \text{Loss}(y_i, \hat{y_i}) = (y_i - \hat{y_i})^2 \]
  • Cost Function
    • It aggregates the error across all data points in the dataset. In linear regression, cost function is the average of the loss function for all data points.

\[ J(\theta) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^2 \]

Housing Data

Using a housing data-set to showcase the graphs and plots. - All the columns of the dataset

data = read.csv("housing.csv")
colnames(data)
##  [1] "Posted.On"         "BHK"               "Rent"             
##  [4] "Size"              "Floor"             "Area.Type"        
##  [7] "Area.Locality"     "City"              "Furnishing.Status"
## [10] "Tenant.Preferred"  "Bathroom"          "Point.of.Contact"

Scatter Plot using ploty for Size , BHK and Rent

Linear Plot for Rent and Size

Scatter Plot for Rent and Size

Code for the 3-D Plot

p = plot_ly(data, x= ~data$Size , y= ~data$BHK , z = ~data$Rent,
        type ='scatter3d', mode = 'markers' , marker = list(
         size = 5, color = ~data$City  )) %>%
  layout( title = "3d scatter plot", 
          scene = list(xaxis = list(title = 'X-axis'),
                        yaxis = list(title = 'Y-axis'),
                        zaxis = list(title = 'Z-axis')))