February 15, 2023

Introduction to Linear Regression

Linear regression is a statistical methodology for modeling the relationship between one or more independent variables with a single dependent variable. It is very useful for identifying, understanding and analyzing the relationships between variables in a data set. When using this tool the analyst is seeking to identify the line that best fits the relationship between a dependent variable and the independent variables.

Linear regression is used in a wide variety of fields, including but not limited to economics, finance, medicine, and social sciences. It allows analysts to make predictions about the value of a dependent variable in relation to the values of one or more independent variables and to identify possible strength and direction of the correlation between them.

Mathematical Formulation of Linear Regression

The mathematical formulation of linear regression is used to describe the relationship between the dependent variable (Y) and one or more independent variables (X). The basic equation for a linear regression model can be represented as:


\(Y_i = f(X_i,\beta) + \varepsilon_i\)

\(Y_i = \text{independent variable};\)   \(f = function;\)
\(\beta = \text{unknown parameters}\)   \(\varepsilon_i = \text{error terms}\)

Types of Linear Regression

Linear regression can be classified into two main types:
Simple and Multiple.

Simple linear regression is used when there is a relationship between one independent variable and one dependent variable.

Multiple linear regression is used when there is a relationship between one dependent variable and multiple independent variables.

The type of linear regression used depends on the nature of the data and the relationship between the dependent and independent variables.

Linear Regression using plotly

Here we have used plotly to demonstrate the relationship between the age of orange trees in days and the tree’s trunk circumference in mm.

model: \(\text{Circumference} = \beta_0 + \beta_1\cdot \text{Age} + \varepsilon; \hspace{1cm} \varepsilon \sim \mathcal{N}(0; \sigma^2)\)
fitted: \(\text{Circumference} = \hat{\beta}_0 + \hat{\beta}_1 \cdot \text{Age}\)                      \(\hat{\beta}_0 = b_0 - \text{estimate of}\beta_0\);  \(\hat{\beta}_1 = b_1 - \text{estimate of}\beta_1\)

Simple Linear Regression using ggplot

The scatter plot shows the relationship between the weight of a car (independent variable) and its miles per gallon (dependent variable). The line of best fit represents the simple linear regression model, and its equation can be used to predict the miles per gallon of a car based on its weight.

## `geom_smooth()` using formula = 'y ~ x'

Multiple Linear Regression

Multiple linear regression is a statistical technique employed to create a model that explains the association between one dependent variable and two or more independent variables. The goal of multiple linear regression is to determine the line that best represents the relation between the dependent variable and the independent variables. The example below shows the correlation between the price of a diamond (dependent variable) and the carat weight and clarity (both independent variables).

## `geom_smooth()` using formula = 'y ~ x'

Assumptions of Linear Regression

Linear regression is based on several assumptions that must be met in order for the results to be valid. These assumptions are:


1/ The relationship between the independent and dependent variables is linear.

2/ The observations are independent of each other.

3/ The variance of the errors is constant for all levels of the independent variable.

4/ The errors are normally distributed.

5/ The independent variables are not highly correlated with each other.


It’s important to check these assumptions before interpreting the results of a linear regression model. If these assumptions are not met, the results may be biased or inaccurate. To check the assumptions of a linear regression model, various diagnostic plots can be used, such as residual plots, normal probability plots, and scatter plots. These plots can help identify any violations of the assumptions and suggest ways to address them.The validity of a linear regression model depends on the assumptions being met. It’s important to check these assumptions before interpreting the results.

Conclusion

In this presentation, we covered the mathematical formulation of linear regression and its various types including simple linear regression and multiple linear regression.

Linear regression is a widely used statistical method for modeling the relationship between a dependent variable and one or more independent variables. It is used to understand and make predictions about data. Linear regression is an essential tool for data analysis and is used in a variety of fields including finance, marketing, and medicine. By understanding its underlying concepts, we can build accurate and effective models that can be used to make informed decisions and predictions based on data.

Code used in the above slides:

GGPLOT

# Load the mtcars dataset
data("mtcars")

# Create a scatter plot of mpg (dependent variable) vs. wt (independent variable)
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Weight (wt)", y = "Miles per Gallon (mpg)")
config(fig, displaylogo=FALSE)

Code used in the above slides continued:

GGPLOT_2

# Load the diamonds data set
data("diamonds")

# Create a scatter plot of price (dependent variable) vs. carat and clarity (independent variables)
ggplot(diamonds, aes(x = carat, y = price, color = clarity)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Carat Weight", y = "Price (in USD)")