Linear Regression

2023-09-16

Slide 1: Simple Linear Regression

Formula

Simple linear regression models the relationship between two variables using the following formula:

\[Y = \beta_0 + \beta_1X + \epsilon\]

\(Y\) is the dependent variable we are trying to predict.
\(X\) is the independent variable.
\(\beta_0\) is the y-intercept, representing the value of \(Y\) when \(X\) is zero.
\(\beta_1\) is the slope of the regression line.
\(\epsilon\) represents the error term, accounting for the variability in \(Y\) not explained by the model.

Slide 2: How does it fit?

R-squared (R²) Model

The R-squared (\(R^2\)) statistic quantifies the of fit of the linear regression model. It tells us how much of the variance in the dependent variable is explained by the independent variable. The formula for \(R^2\) is:

\[R^2 = \frac{{\text{Explained Variance}}}{{\text{Total Variance}}} = 1 - \frac{{\text{Residual Variance}}}{{\text{Total Variance}}}\]

\(R^2\) ranges from 0 to 1, where 0 indicates no relationship between variables, and 1 means a perfect fit.
Explained Variance represents the variance in \(Y\) explained by the regression model.
Total Variance is the total variance in \(Y\).
Residual Variance is the variance that remains unexplained by the model.

A higher \(R^2\) value indicates that the model explains a larger proportion of the variance in the data, suggesting a better fit.

Slide 3: No easy task

Calculating the slope \(\beta_1\) and intercept \(\beta_0\)

The slope (\(\beta_1\)) of the line of best fit comes from the formula:

\[\beta_1 = \frac{\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n} (X_i - \bar{X})^2}\]

Where: - \(n\) is the number of data points. - \(X_i\) and \(Y_i\) are the individual data points. - \(\bar{X}\) and \(\bar{Y}\) are the means of the variables

-The intercept is found with the formula: \[\beta_0 = \bar{Y} - \beta_1\bar{X}\]

Slide 4: Welcome to Delaware

My real world data

The sales data used on the following slides is taken from the last 4 years by fiscal quarter of my largest account in my Delaware territory. I work full time in a business to business sales role. We sell several types of products, but most fall into two categories: High Power and Low Power.

Of note are the beginning of the COVID pandemic in calendar year 2020 and my arrival to the territory about 18 months later, as they are relevant to the data.

Slide 5: So am I any good at my job?

Model Evaluation

I am interested to know how I am performing as a sales rep post-COVID compared to pre-COVID data. First lets see how our data is structured

raw_data <- read.csv("Delaware_Data.csv", header = TRUE, sep = ",")
str(raw_data)

## 'data.frame':    18 obs. of  4 variables:
##  $ Quarter   : chr  "FY2020 Q1" "FY2020 Q2" "FY2020 Q3" "FY2020 Q4" ...
##  $ High.Power: num  640399 721478 615855 559149 644692 ...
##  $ Low.Power : num  250590 241096 214646 242906 224472 ...
##  $ Revenue   : num  997672 1011024 951628 850505 907925 ...

head(raw_data,1)

##     Quarter High.Power Low.Power  Revenue
## 1 FY2020 Q1     640399    250590 997671.8

Slide 5: But wait, there’s more!

Too many rows

The structure of the data shows one more row than I was expecting.

##      Quarter High.Power Low.Power   Revenue
## 13 FY2023 Q1   837655.0  334209.0 1191244.0
## 14 FY2023 Q2   753364.7  335483.5 1122763.2
## 15 FY2023 Q3   744871.8  317669.0 1077075.8
## 16 FY2023 Q4   809482.0  269253.5 1116950.5
## 17 FY2024 Q1   711511.0  282812.8 1090648.9
## 18 FY2024 Q2   300113.0  173732.5  480840.5

Upon investigation, we see that it contains the current quarter, which isn’t over yet. Lets remove that entry before we plot.

df <- raw_data[-nrow(raw_data), ]

Slide 6a: How do I look?

Code for the plot with ggplot2

ggplot(df, aes(x = Quarter, y = Revenue, group = 1)) + 
  geom_point(col = "blue")+
  geom_line(col = "blue", 
            linewidth = 1, 
            linetype = "solid") +
  labs(x = "Quarter", 
       y = "Quarterly Revenue", 
       title = "Revenue by Quarter") +
  geom_vline(xintercept = which(df$Quarter == "FY2021 Q3") - 0.5, 
             linetype = "dashed", 
             col = "red") +
  geom_text(x = which(df$Quarter == "FY2021 Q3"), 
            y = max(df$Revenue), 
            label = "COVID outbreak", 
            vjust = 1.5,
            hjust = 0.1,
            col = "red") +
  geom_vline(xintercept = which(df$Quarter == "FY2022 Q2") + 0.5, 
             linetype = "dashed", 
             col = "black") +
  geom_text(x = which(df$Quarter == "FY2022 Q2"), 
            y = max(df$Revenue), 
            label = "Dave moves to Delaware", 
            vjust = 18,
            hjust = -0.15,
            col = "black") +
  geom_smooth(method = "lm", formula = y ~ x, col = "blue") +
  
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Slide 6b: The visual

The plot

Slide 7: Break it down

High Power, Low Power, and Total Revenue

Slide 8: The COVID impact

Low power stayed steady

We can see that Low Power sales were minimally impacted during the pandemic and tightly hugged our line of best fit.

Slide 9: Bring it home, this is getting long

Using R for linear regression

We can see from our models, COVID seems to have impacted our High Power sales, while Low Power held tighter to our best fit line. Linear Regression would have been cumbersome, at best, to do by hand. On a much larger data set, it would have been impossible.