2023-09-16
Simple linear regression models the relationship between two variables using the following formula:
\[Y = \beta_0 + \beta_1X + \epsilon\]
The R-squared (\(R^2\)) statistic quantifies the of fit of the linear regression model. It tells us how much of the variance in the dependent variable is explained by the independent variable. The formula for \(R^2\) is:
\[R^2 = \frac{{\text{Explained Variance}}}{{\text{Total Variance}}} = 1 - \frac{{\text{Residual Variance}}}{{\text{Total Variance}}}\]
A higher \(R^2\) value indicates that the model explains a larger proportion of the variance in the data, suggesting a better fit.
\[\beta_1 = \frac{\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n} (X_i - \bar{X})^2}\]
Where: - \(n\) is the number of data points. - \(X_i\) and \(Y_i\) are the individual data points. - \(\bar{X}\) and \(\bar{Y}\) are the means of the variables
-The intercept is found with the formula: \[\beta_0 = \bar{Y} - \beta_1\bar{X}\]
The sales data used on the following slides is taken from the last 4 years by fiscal quarter of my largest account in my Delaware territory. I work full time in a business to business sales role. We sell several types of products, but most fall into two categories: High Power and Low Power.
Of note are the beginning of the COVID pandemic in calendar year 2020 and my arrival to the territory about 18 months later, as they are relevant to the data.
I am interested to know how I am performing as a sales rep post-COVID compared to pre-COVID data. First lets see how our data is structured
raw_data <- read.csv("Delaware_Data.csv", header = TRUE, sep = ",")
str(raw_data)
## 'data.frame': 18 obs. of 4 variables: ## $ Quarter : chr "FY2020 Q1" "FY2020 Q2" "FY2020 Q3" "FY2020 Q4" ... ## $ High.Power: num 640399 721478 615855 559149 644692 ... ## $ Low.Power : num 250590 241096 214646 242906 224472 ... ## $ Revenue : num 997672 1011024 951628 850505 907925 ...
head(raw_data,1)
## Quarter High.Power Low.Power Revenue ## 1 FY2020 Q1 640399 250590 997671.8
The structure of the data shows one more row than I was expecting.
## Quarter High.Power Low.Power Revenue ## 13 FY2023 Q1 837655.0 334209.0 1191244.0 ## 14 FY2023 Q2 753364.7 335483.5 1122763.2 ## 15 FY2023 Q3 744871.8 317669.0 1077075.8 ## 16 FY2023 Q4 809482.0 269253.5 1116950.5 ## 17 FY2024 Q1 711511.0 282812.8 1090648.9 ## 18 FY2024 Q2 300113.0 173732.5 480840.5
Upon investigation, we see that it contains the current quarter, which isn’t over yet. Lets remove that entry before we plot.
df <- raw_data[-nrow(raw_data), ]
ggplot(df, aes(x = Quarter, y = Revenue, group = 1)) +
geom_point(col = "blue")+
geom_line(col = "blue",
linewidth = 1,
linetype = "solid") +
labs(x = "Quarter",
y = "Quarterly Revenue",
title = "Revenue by Quarter") +
geom_vline(xintercept = which(df$Quarter == "FY2021 Q3") - 0.5,
linetype = "dashed",
col = "red") +
geom_text(x = which(df$Quarter == "FY2021 Q3"),
y = max(df$Revenue),
label = "COVID outbreak",
vjust = 1.5,
hjust = 0.1,
col = "red") +
geom_vline(xintercept = which(df$Quarter == "FY2022 Q2") + 0.5,
linetype = "dashed",
col = "black") +
geom_text(x = which(df$Quarter == "FY2022 Q2"),
y = max(df$Revenue),
label = "Dave moves to Delaware",
vjust = 18,
hjust = -0.15,
col = "black") +
geom_smooth(method = "lm", formula = y ~ x, col = "blue") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
We can see that Low Power sales were minimally impacted during the pandemic and tightly hugged our line of best fit.
We can see from our models, COVID seems to have impacted our High Power sales, while Low Power held tighter to our best fit line. Linear Regression would have been cumbersome, at best, to do by hand. On a much larger data set, it would have been impossible.