Linear Regression Introduction

2025-06-07

Intro to Linear Regression

In this presentation we will discuss the importance of Linear Regression, including the uses of the practice, as well as different simulations of it within Rstudio.

Why Use Linear Regression?

Linear regression analysis is used to predict the value of a variable based on the value of another variable. The variable you want to predict is called the dependent variable. The variable you are using to predict the other variable’s value is called the independent variable. In Data Analysis, understanding the specific relationship between two variables can lead to further and deeper analysis, leading to more complex, in-depth conclusions.

Summary of Data Set: Trees

In this presentation, we will be using the inherent dataset “trees” within Rstudio. Here is a brief summary of the data to refer back to as we deepen our understanding of Linear Regression through graphs.

data(trees)
head(trees)

##   Girth Height Volume
## 1   8.3     70   10.3
## 2   8.6     65   10.3
## 3   8.8     63   10.2
## 4  10.5     72   16.4
## 5  10.7     81   18.8
## 6  10.8     83   19.7

Summary of Data Set: Cars

We will also be using the dataset “cars.” Here is a similar summary:

##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

Using Plotly for Linear Regression

The yellow line through this graph is called the “line of best fit,” and is used to condense points in a scatter plot, like above, into a single line that displays the trend of the relationship between the independent and dependent variables. In this case, the positive linear regression shows a positive relationship.

Using ggplot2 for Linear Regression: Code

Before moving on to the next graph, first we should look at the kind of code that goes into creating a graph for linear regression using RStudio:

{r warning=FALSE, message=FALSE}

data(cars)

ggplot2 <- ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point()

ggplot2 + geom_smooth(method=“lm”)

As we will see in the next slide, this code results in a scatterplot with a line of best fit traveling through.

Finding L.R. “cars” using ggplot2

This scatter plot displays the relationship between the “miles per gallon” of the cars in the data set to the “weight’ of the cars in the data set. This line of best fit is going negative, meaning this model of linear regression is simple linear negative.

Finding L.R. in “iris” using ggplot2

This scatter plot quantifies the variation in structure of the iris flower in its three species. This graph shows a positive line of best fit, displaying a simple linear positive regression

Using Linear Regression to Predict Dependent Variables (without error)

The Linear Regression Equation (without error) looks like this:

\(\text{Y} = \beta\cdot \text{X} + \alpha\)

Where \(\text{Y}\) is the predicted value of Y, \(\beta\) is the rate of increase/decrease of Y, \(\text{X}\) is the increase in X, and \(\alpha\) is the Y-intercept (value of Y when X = 0).

Using Linear Regression to Predict Dependent Variables (with error)

When analyzing data, sometimes we have to account for certain levels of error that may be able to mess with our predictions, so to account for this error, we use a different equation.

We can find the Dependent Variable(Y) by using the simple linear regression model as shown:

\(\text{Y} = \beta_0 +\beta_1\cdot \text{x} + \varepsilon; \hspace{1cm}\varepsilon \sim \mathcal{N} (0; \sigma^2)\)

Where Y is our dependent variable, \(\beta_0\) is our Dependent Y intercept, \(\beta_1\) is our slope coefficient, X is our independent variable, and \(\varepsilon\) is our random error term.

So in the case of our Girth to Volume graph, we can find Volume by filling in parts of this equation, using Girth for x: \(\text{Volume} = \beta_0 + \beta_1\cdot \text{Girth} + \varepsilon; \hspace{1cm}\varepsilon \sim \mathcal{N} (0; \sigma^2)\)

Goodbye Slide

Thank you for watching my presentation on Linear Regression