2023-10-11

Linear Regression

Linear regression is a statistical method used in data analysis and machine learning to model the relationship between a dependent variable and one or more independent variables. It is a fundamental and widely used technique for understanding and predicting the relationship between variables.

The primary goal of linear regression is to find a linear equation that best describes how changes in the independent variables are associated with changes in the dependent variable.

Plant Growth Example

Suppose you are a biologist studying the growth of a specific type of plant, and you want to understand how an environmental factor, such as the amount of sunlight, influences the plant’s growth. You collect data on the amount of sunlight (in hours per day) and the corresponding plant growth (measured in centimeters) for a group of plants.

Plant Growth Graph (Plotly)

Plant growth (ggplot2)

Plant Growth (ggplot2)

Formula for Simple Linear Regression

The equation for simple linear regression is:

\[ Y = \beta_0 + \beta_1X + \epsilon \]

where: - \(Y\) is the dependent variable.

  • \(X\) is the independent variable.

  • \(\beta_0\) is the intercept (y-intercept) of the regression line.

  • \(\beta_1\) is the slope of the regression line.

  • \(\epsilon\) represents the error term.

Formula for challenging Linear Regression

The equation for multiple linear regression is:

\[ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_kX_k + \epsilon \]

where: - \(Y\) is the dependent variable.

  • \(X_1, X_2, \ldots, X_k\) are the independent variables.

  • \(\beta_0\) is the intercept.

  • \(\beta_1, \beta_2, \ldots, \beta_k\) are the coefficients associated with each independent variable.

  • \(\epsilon\) represents the error term.

R Code

data1 <- read.csv('Plantgrowth.csv')
library(plotly)
scatter_plot <- plot_ly(data1, x = ~Hours_of_Sunlight_X, y = ~Plant_Growth_Y, type = "scatter", mode = "markers")
regression_model <- lm(Plant_Growth_Y ~ Hours_of_Sunlight_X, data = data1)
predicted_values <- predict(regression_model)
scatter_plot <- scatter_plot %>%
  add_trace(
    x = data1$Hours_of_Sunlight_X,
    y = predicted_values,
    type = "scatter",
    mode = "lines",
    name = "Linear Regression Line",
    line = list(color = "red")
  )

R Code Explained

The X variable represents the number of hours of sunlight that plants are exposed to over a certain period.

The Y variavle typically measures the increase in the size & height of plants.

The \(\beta_0\) it’s the expected plant growth in conditions where there is no sunlight

The \(\beta_1\)A positive \(\beta_1\), indicates that an increase in hours of sunlight is associated with an increase in plant growth, while a negative \(\beta_1\) suggests the oppositive

The \(\epsilon\), error term captures all other factors that affect plant growth but are not included in the model