Linear Regression

2024-03-18

What is Simple Linear Regression?

It is a statistical method that allows us to analyze relationships between quantitative variables.
We can use it to preduct the value of an output given information about the input variable.
Simple Linear Regression works with 2 variables and will produce a best-fit line while minimizing error.
There are two variables that are involved for SLR: predictor (independent), response (dependent)

Let’s Look at Some Data

We can import a dataset about the amount of money spent on advertising a product via different means including advertising through the TV, the radio, and the newspaper.

'data.frame':   200 obs. of  4 variables:
 $ TV       : num  230.1 44.5 17.2 151.5 180.8 ...
 $ Radio    : num  37.8 39.3 45.9 41.3 10.8 48.9 32.8 19.6 2.1 2.6 ...
 $ Newspaper: num  69.2 45.1 69.3 58.5 58.4 75 23.5 11.6 1 21.2 ...
 $ Sales    : int  22100 10400 12000 16500 17900 7200 11800 13200 4800 15600 ...

Given this data frame, there are several quantitative variables that we can consider to be the predictor variable such as money spent on TV advertising, money spent on radio advertising, and money spent on newspaper advertising.

The response variable in this data would be the sales of the product as that can be considered to be dependent on other factors.

Let’s consider our first example with money spent on TV advertising as the predictor.

Code for Linear Regression

mod = lm(Sales ~ TV, data = advertising) # fit a linear regression model
x = advertising$TV; y = advertising$Sales

xax <- list( title = "Money Spent on TV Advertising" )
yax <- list( title = "Product Sales", range = c(0,40000))

# scatter plot using plotly with best-fit line
fig <- plot_ly(x=x, y=y, type="scatter", mode="markers", name="data",
               width=690, height=300) %>%
  add_lines(x = x, y = fitted(mod), name="fitted") %>%
  layout(xaxis = xax, yaxis = yax) %>%
  layout(margin=list(
    l=150,
    r=50,
    b=20,
    t=40
  )
  )

Show The Plot We Created

We can see from the above plot that there exists a positive linear relationship between these two variables. In fact we can see how strong this correlation is by using the cor function: cor(advertising$TV, advertising$Sales) = 0.9012079

A correlation of 0.901 indicates a very strong positive correlation between these two variables. In general, a correlation coefficient close to 1 (positive) or -1 (negative) indicates a strong linear relationship, while a correlation coefficient close to 0 indicates a weak linear relationship

What Does the Simple Linear Regression Model Look Like

The model can simple be expressed as a line with the following format:

$y = \beta_0 + \beta_1 x + \varepsilon$

$\beta_0$ is the y intercept or constant, $\beta_1$ is the value of the slope, $\varepsilon$ is the random error component

Using our previous plot, we can extract these values. We get Intercept: 6974.82 and TV: 55.46

This means that our equation looks like this:

$y = 6974.82 + 55.46x$

Using this best-fit line, we can predict the value of the sales given how much money is spent advertising via TV. For instance, let’s estimate how many sales a product will have if the money spent is $250

$y = 6974.82 + 55.46\cdot250$

$y = 20839.82$ –> this seems consistent with our graph

Let’s Look at Another Predictor Variable for this Data

Compared to the last example we did, this graph does not show as strong of a positive linear correlation. There is a general slight upwards trend but not a strong one. The correlation for these variables is 0.15796 which is very close to 0. We can say that there is a weak association between money spent on newspaper advertising and product sales.

Minimizing Error

As mentioned earlier, linear regression ensures to derive a best fit line while minimizing error. But how does it do this? Study this graph which has money spent on advertising on the radio as the predictor. The red lines are vertical lines from the data points to the best fit line.

Mean Squared Error (MSE)

MSE is a metric used to evaluate how well a linear regression model performs. It calculates the average squared difference between the predicted values and the actual values of the response variables.

The red lines on the previous slide represent the difference between the predicted values and the actual values.

MSE can be mathematically written as: $MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

$n$ is the number of observations in the dataset.
$y_i$ is the actual value of the response variable
$\hat{y}_i$ is the predicted value of the response variable

Why do we square the errors?

We want there to be more of a “penalty” the further away a point is. This is why we square them.

A lower MSE means better performance. Lets calculate the MSE of all three plots in the next slide.

MSE of Our Plots

mod_TV <- lm(Sales ~ TV, data = advertising)
predicted_values <- predict(mod_TV)
residuals <- advertising$Sales - predicted_values
mse_TV <- mean(residuals^2)
mse_TV

[1] 5217744

mod_Radio <- lm(Sales ~ Radio, data = advertising)
predicted_values <- predict(mod_Radio)
residuals <- advertising$Sales - predicted_values
mse_Radio <- mean(residuals^2)
mse_Radio

[1] 24384049

mod_Newspaper <- lm(Sales ~ Newspaper, data = advertising)
predicted_values <- predict(mod_Newspaper)
residuals <- advertising$Sales - predicted_values
mse_Newspaper <- mean(residuals^2)
mse_Newspaper

[1] 27086773

The TV model had the lowest MSE which implies that it was had the best fit to the data out of our three predictors. On the other hand, the newspaper model had the highest MSE which means that it had the poorest fit to the data.