2023-06-09

What is Linear Regression?

Linear regression is a powerful statistical tool to show the relationship between 2 variables. It also helps predict the value of one variable based on the value of the other. This involves one or more independent variables to predict the outcome of the dependent variable. The linear regression can be a straight line, or it can be a polynomial line, depending on what fits the data. For this lesson, we will focus on a straight line.

Slope-Intercept form

\[ y = mx+b \] This is the equation of the line that will be plotted once linear regression is found. \(x\) and \(y\) are variables, \(m\) is the slope, and \(b\) is the y-intercept. Finding the linear regression is not just plugging in \(x\) & \(y\), or even using the point-slope form
(\(y-y_1=m(x-x_1)\)). This will create many different lines between the points on the scatter plot. On the next slide, we will discuss how to find the linear regression manually.

To Find Slope and \(y\)-intercept

To find the slope:

\[ m = {n(\Sigma xy)-(\Sigma x)(\Sigma y) \over n(\Sigma x^2)-(\Sigma x)^2}\] To find the \(y\)-intercept: \[b = {\Sigma y-m(\Sigma x) \over n}\] NOTE: \(n\) is the number of values in the dataset

Solved Slope and \(y\)-Intercept Table

##    Temperature Sale    xy x_squared y_squared
## 1           50  206 10300      2500     42436
## 2           53  246 13038      2809     60516
## 3           54  266 14364      2916     70756
## 4           62  301 18662      3844     90601
## 5           65  389 25285      4225    151321
## 6           68  411 27948      4624    168921
## 7           73  438 31974      5329    191844
## 8           77  478 36806      5929    228484
## 9           78  523 40794      6084    273529
## 10          80  499 39920      6400    249001
##      sum x sum y sum xy sum x^2 sum y^2
## [1,]   660  3757 219171   38260 1278408

Scatterplot of Data

Code of Scatter Plot Using Plotly

mod <- lm(Sale ~ Temperature, data = df1)
x = df1$Temperature; y = df1$Sale

xax <- list(
  title = "Temperature (°F)",
  titlefont = list(family="Times New Roman")
)

yax <- list(
  title = "Sale ($)",
  titlefont = list(family="Times New Roman"),
  range(200,520)
)

fig <- plot_ly(x=x, y=y, type = "scatter", mode = "markers", name = "data",
               width = 800, height = 430) %>%
              add_lines(x=x, y = fitted(mod), name = "fitted") %>%
              layout(xaxis = xax, yaxis = yax) %>%
              layout(margin=list(
                l=150,
                r=50,
                b=20,
                t=40
              )
            )
config(fig, displaylogo=FALSE)

Linear Equation Used by Plotly

\[ Sale = \beta_0 + \beta_1 * Temperature + \epsilon \]

Where \(\beta_0\) is interpreted as the predicted amount of sales if the Temperature is 0 (or the \(y\)-intercept), \(\beta_1\) is the slope of the line calculated, and \(\epsilon\) calculates the error. For our case, we will only be calculating \(\beta_0\) and \(\beta_1\).

How Do You Calculate Linear Regression Using Plotly?

Using the command,
mod <- lm(Sale ~ Temperature, data = df1)

Call:
lm(formula = Sale ~ Temperature, data = df1)

Coefficients:
(Intercept)  Temperature  
    -292.04        10.12  


\(\beta_0\) is the value under (Intercept) and \(\beta_1\) is the value under Temperature. This makes the equation:
\[ Sale = -292.04 + 10.12 * Temperature \]
Now, all that needs to be done is to plug in the values for Temperature and Sale to pinpoint the predicted line manually.

How Do You Plot Linear Regression Using Plotly?

In the plotly command using:

add_lines(x=x, y = fitted(mod), name = “fitted”)

linear regression can be easily calculated. Plug in your value for x, which for this case is our Temperature data. For y, use the command fitted() and use the model that we created first as your argument in the previous slide. Use the command
name = “fitted” to make sure that this line will fit the data that is presented.

What Kinds of Linear Regression Can You Have?

In the previous example, the linear regression calculated is positive. The other kind you can have is negative. If there is no correlation, there will be a bigger error when the linear regression is calculated. In the next couple of slides are examples of a negative correlation and no correlation using ggplot.

Negative Correlation

`geom_smooth()` using formula = 'y ~ x'

No Correlation