Is the money your organization having any meaningful impact on sales? The impact of how a dollar spent on an organization’s marketing efforts on its sales is something that all organizations should consider. A fiscally prudent organization should be using their relatively scarce resources wisely. Thus, all organizations need to ask themselves, “Is the money I’m spending worth the return on sales?” Furthermore, organizations can delve deeper by asking, “For every dollar spent on marketing, how much are we getting in return on sales?” One can answer these questions using a simple linear regression model. As always, we will use a fabricated example to examine a store’s marketing efforts and its impact on sales. This will also be a more comprehensive primer on the simple linear regression model, the model that the majority of econometrics students are first exposed to. Graphically, linear regression can be aptly summarized with the following scatter plot and fitted regression line
Linear regression can be used to
In this tutorial, we will be using simple linear regression for confidence interval estimation to assess the dollar effect of a burger chain’s marketing efforts on its sales.
The mathematical representation of a simple linear regression model - a model with one dependent variable and only one independent variable - can be expressed as
\[ \begin{aligned} \mathbb{E}(y \vert x) = \mu_{y \vert x} = \beta_{0} + \beta_{1}x \end{aligned} \]
where \(beta_{0}\) and \(\beta_{1}\) are the regression parameters that represent the intercept and slope, respectively. We can interpret the above equation as, “The expected value of the depepndent variable \(y\) given \(x\) is equal to the mean of \(y\) given \(x\).” The expected value of \(y\) given \(x\), according to the above equation, can be mathematically represented with an intercept \(\beta_{0}\) and a coefficient of \(x\) \(\beta_{1}\). The conditional mean \(\mathbb{E}(y \lvert x)\) is called a simple linear regression model because it contains only one non-random independent variable \(x\).
The simple linear regression model relies on a few assumptions. First, if \(\text{var}(y \lvert x) = \sigma^{2}\), then the data is said to be homoscedastic. This means that for any value of \(x\), the variance of \(y\) remains the same. However, this is an unrealistic assumption in real life. Most “real world data” that you would be working with will be heteroscedastic, which means that the variance does not remain the same for a given value of \(x\). The issue of heteroscedasticity is not such a grave sin, depending on the objective and if you have a large enough sample size. However, overcoming this issue is not within the scope of this tutorial. It is just one of the assumptions of the simple linear regression model.
The five assumptinos of the simple linear regression model are
Linear regression models are composed of two parts: a systematic component and a random component. The systematic component is
\[ \begin{aligned} \mathbb{E}(y \lvert x) = \beta_{0} + \beta_{1}x \end{aligned} \]
while the random component of \(y\) is the difference between \(y\) and its conditional mean value \(\mathbb{E}(y \lvert x) = \mu_{y \lvert x}\). Mathematically, this random component \(e\) can then be epxressed as
\[ \begin{aligned} e = y - \mathbb{E}(y \lvert x) = y - \mu_{y \lvert x} = y - \beta_{0} + \beta_{1}x \end{aligned} \]
A more intuitive way of explaining the random component is that it is the difference between the actual observed value of \(y_{i}\) and its estimated value \(\hat{y}_{i}\).
\[ \begin{aligned} e = y_{i} - \hat{y}_{i} \end{aligned} \]
Knowing this, we can rewrite the previous assumptions in terms of the error term \(e\)
The goal in simple linear regression is to estimate the parameters (the \(\beta\)’s). To do so, we must first introduce the least squares principle, the method we will use to estimate these parameters. The least squares principle states that the “line of best fit” is the one that minimizes the distance between each point \((x, y)\) to the regression line. In terms of notation, the parameter estimates of \(\beta_{0}\) and \(\beta_{1}\) are generally denoted as \(b_{0}\) and \(b_{1}\), respectively.
The equation for the line of best fit is given by
\[ \begin{aligned} \hat{y}_{i} = b_{0} + b_{1}x_{i} \end{aligned} \]
The vertical distances between the line of best fit and each point \((x, y)\) are referred to as the least squares residuals, which is given by
\[ \begin{aligned} \hat{e}_{i} = y_{i} - \hat{y}_{i} = y_{i} - b_{0} + b_{1}x_{i} \end{aligned} \]
In order to find the values for \(b_{0}\) and \(b_{1}\), we want to find the values for the unknown parameters \(\beta_{0}\) and \(\beta_{1}\) that minimizes the sum of squares function
\[ \begin{aligned} S(\beta_{0}, \beta_{1}) = \sum_{i = 1}^{N} e^{2}_{i} = (y_{i} - \beta_{0} + \beta_{1}x_{i})^{2} \end{aligned} \]
The least squares estimator \(b_{0}\) and \(b_{1}\) can be solved using the following equations
\[ \begin{aligned} b_{1} = \frac{\sum_{i=1}^{N}(x_{i} - \bar{x})(y_{i} - \bar{y})}{\sum_{i=1}^{N}(x_{i} - \bar{x})^{2}} \end{aligned} \]
\[ \begin{aligned} b_{0} = \bar{y} - b_{2}\bar{x} \end{aligned} \]
It is important to note that we may never know the true value of \(\beta_{0}\) and \(\beta_{1}\), and consequently may never know whether our estimates \(b_{0}\) and \(b_{1}\) are close to the actual values. The least squares estimates \(\hat{y}_{i}\) are random variables because they depend on the random variable \(y\). However, if all of our assumptions hold, then \(\mathbb{E}(b_{1}) = \beta_{1}\). In a situation where the estimators are equal to their true parameter values, then they are said to be unbiased estimators. If we have a large sample size \(N\) and all of the assumptions for the simple linear regression model hold, then the value of the estimators \(b_{0}\) and \(b_{1}\) obtained from all the samples will be \(\beta_{0}\) and \(\beta_{1}\), respectively. The unbiasedness property depends on having many samples of data from the same population.
The variance of the random variable \(b_{1}\) is the average of the squared distance between the possible values of \(b_{1}\) and its mean \(\mathbb{E}(b_{1})\). That is to say, \(\text{var}(b_{1}) = \mathbb{E}[b_{2} - \mathbb{E}(b_{2})]^{2}\). The variance of an estimator measures the precision of the estimator in the sense that it tells us how much the estimates can vary from sample to sample. The smaller the variance, the greater the sampling precision.
If all of the assuptions of the linear regression model hold, then the variances and covariances of the estimators are
\[ \begin{aligned} \text{var}(b_{0}) = \sigma^{2} \Bigg[\frac{\sum_{i=1}^{n}x_{i}^{2}}{N (x_{i} - \bar{x})^{2}} \Bigg] \end{aligned} \]
\[ \begin{aligned} \text{var}(b_{1}) = \frac{\sigma^{2}}{\sum{(x_{i} - \bar{x})^{2}}} \end{aligned} \]
\[ \begin{aligned} \text{cov}(b_{0}, b_{1}) = \sigma^{2} \Bigg[ \frac{-\bar{x}}{\sum_{i=1}^{n}{(x_{i} - \bar{x})^{2}}} \Bigg] \end{aligned} \]
If the regression assumptions hold, then the Gauss-Markov Theorem states that the estimators \(b_{0}\) and \(b_{1}\) have the smallest variance of all linear and unbiased estimators of \(\beta_{0}\) and \(\beta_{1}\). Such estimators are commonly referred to as the best linear unbiased estimator (BLUE).
Recall that the variance of the random error term \(e\) is
\[ \begin{aligned} \text{var}(e_{i}) = \mathbb{E}[e_{i} - \mathbb{E}(e_{i})]^{2} \end{aligned} \]
If the assumption \(\mathbb{E}(e_{i}) = 0\). If the assumption holds, we can then rewrite the variance of the random error term \(e\) as
\[ \begin{aligned} \text{var}(e_{i}) = \hat{\sigma}^{2} = \mathbb{E}[e_{i}]^{2} = \frac{1}{N}\sum_{i=1}^{N}{e_{i}^{2}} \end{aligned} \]
Unfortunately, the above formula is of little to no use since the random error term \(e_{i}\) is unobservervable. Fortunately, since \(e_{i} = y_{i} - \mathbb{E}(y \vert x) = y_{i} - \beta_{0} + \beta_{1}x_{i}\), we can replace the unknown parameters \(\beta_{0}\) and \(\beta_{1}\) with their least squares estimates \(b_{0}\) and \(b_{1}\) to obtain
\[ \begin{aligned} \hat{e}_{i} = y_{i} - \hat{y}_{i} = y_{i} - b_{0} + b_{1}x_{i} \end{aligned} \]
The unbiased estimator of the variance of the random error term is
\[ \begin{aligned} \hat{\sigma}^{2} = \frac{\sum{\hat{e}_{i}}}{N - 2} \end{aligned} \]
where \(2\) represents the number of parameters (in this case \(\beta_{0}\) and \(\beta_{1}\)). We refer to this as having \(N - 2\) degrees of freedom.
Now that we have an unbiased estimator of the variance of our random error term \(e\), we can estimate the variances and covarianes of their least squares estimates.
\[ \begin{aligned} \widehat{\text{var}(b_{0})} = \hat{\sigma}^{2} \Bigg[ \frac{\sum{x_{i}^{2}}}{N \sum_{i=1}^{N}{(x_{i} - \bar{x})^{2}}} \Bigg] &&&& \widehat{\text{var}(b_{1})} = \frac{\hat{\sigma}^{2}}{\sum{(x_{i} - \bar{x})^{2}}} &&&& \text{cov}(b_{0}, b_{1}) = \sigma^{2} \Bigg[ \frac{-\bar{x}}{\sum_{i=1}^{n}{(x_{i} - \bar{x})^{2}}} \Bigg] \end{aligned} \]
Our example will be a fictious burger chain called “Andy’s”. The owner of Andy’s burger restaurant wants to know if his marketing efforts are having any impact on his sales (i.e. is the money he’s spending on advertising/marketing even worth it?).
First, let’s do some exploratory data analysis by gathering summary statistics and plotting a scatterplot to see if a relationship exists between the dependent and independent variable. Let’s also plot the histograms of each variable.
library(foreign)
andy <- read.dta('/Users/czar.yobero/R/Data Sets/andy.dta')
summary(andy)
## sales price advert
## Min. :62.40 Min. :4.830 Min. :0.500
## 1st Qu.:73.20 1st Qu.:5.220 1st Qu.:1.100
## Median :76.50 Median :5.690 Median :1.800
## Mean :77.37 Mean :5.687 Mean :1.844
## 3rd Qu.:82.20 3rd Qu.:6.210 3rd Qu.:2.700
## Max. :91.20 Max. :6.490 Max. :3.100
Our data set contains three variables: sales, price and advert. Since this is a tutorial on simple linear regression, we will only be using the advert variable as our independent variable for our regression model. Let’s examine the distribution of the depenent variable \(\text{sales}\) and the independent variable \(\text{advert}\).
library(ggplot2)
theme_set(theme_bw())
sales.hist <- ggplot(andy, aes(x = sales)) +
geom_histogram(fill = 'tomato2', col = 'black', aes(y = ..density..)) +
geom_density(adjust = 2)
advert.hist <- ggplot(andy, aes(x = advert)) +
geom_histogram(fill = 'yellow', col = 'black', aes(y = ..density..)) +
geom_density(adjust = 2)
sales.hist
advert.hist
Now, let’s see if we can spot a semblance of a relationship between sales and advertising.
ggplot(andy, aes(y = sales, x = advert)) + geom_point(col = 'blue') + geom_smooth(method = 'lm', se = FALSE, col = 'red')
There does appear to be somewhat of a positive lienar relationship between sales and advertising. Now, let’s easily fit our model to see what the effects of advertising are on sales.
andy.fit <- lm(sales ~ advert, data = andy)
summary(andy.fit)
##
## Call:
## lm(formula = sales ~ advert, data = andy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.1658 -4.1950 -0.5776 4.9946 14.2481
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74.1797 1.7990 41.234 <2e-16 ***
## advert 1.7326 0.8903 1.946 0.0555 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.37 on 73 degrees of freedom
## Multiple R-squared: 0.04932, Adjusted R-squared: 0.0363
## F-statistic: 3.787 on 1 and 73 DF, p-value: 0.0555
Our coefficents seem to be both statistically significant, although the coefficient of \(\text{advert}\) is statistically significant only at the \(10\%\) confidence interval level. So, how can we interpret this? What the summary is telling us is that for every one-dollar increase in advertising, sales increases by an average of $1.73. From a simple arithmetic standpoint, the money spent on advertising is not a waste since the return on sales is greater than the amount spent on avertising. However, it is important to note that the model is not a ver good one if one were to judge it simply on R-squared (the value suggests that the model is only able to explain 4.9% of the variance in the data).
Also note that the $1.73 increase in sales is the expected value of sales given advertising. This obviously does not mean that a dollar spent on advertising will always lead to a $1.73 increase in sales. For this reason, it is often useful to calculate confidence intervals, the range in which one can expect sales to change as a result of a one dollar increase in advertising. The first step in obtaining the confidence intervals for \(b_{1}\) is to choose a level of confidence \(1 - \alpha\). Generally, a 95% confidence interval is chosen, but it’s really at your discretion (the default in R is \(0.95\)). We will choose to calculate the confidence interlevals where \(\alpha = 0.05\). Once we have chosen our level of confidence, we need to look up the critical value \(t_{c}\) that corresponds with the level of confidence, since the least squares estimators follow a t-distribution. This can easily be done in R using the following
qnorm(1 - .05/2) # This gives us the critical value
## [1] 1.959964
Like the normal distribution, the student t-distribution is symmetrical, meaning that the value for \(t_{c}\) at \(1 - \alpha / 2\) is the same value as \(\alpha / 2\) except the value is negative. Here is quick proof using R.
qnorm(.05/2)
## [1] -1.959964
The interval estimator for the parameter \(\beta_{1}\) is given by
\[ \begin{aligned} P[b_{1} - t_{c}\text{se}(b_{k}) \leq \beta_{k} \leq + t_{c}\text{se}(b_{1})] = 1 - \alpha \end{aligned} \]
where \(\text{se}(b_{1})\) is the standard error of \(b_{1}\) and can be found by passing an \(\text{lm}\) object to R’s \(\text{summary}()\) function. We can easily compute the interval estimates at the 95% confidence level for \(b_{1}\) using
confint(andy.fit)
## 2.5 % 97.5 %
## (Intercept) 70.59435560 77.765090
## advert -0.04179661 3.507028
According to the output, if one were to resample for an indefinite amount of time, we are 95% confident in the procedure (not the actual interval estimates) that the value of \(b_{1}\) has the range \([-0.04, 3.51]\). Obtaining the confidence intervals sheds new light on the advertising efforts of Andy’s burger restauraunt. It is certainly possible that a dollar spent on advertising may actually lead to a decrease in sales by $0.04. This may not be significant, but it is definitely well worth noting.
Andy’s ROI can also be calculated as
\[ \begin{aligned} \text{ROI} = \frac{1.73 - 1.00}{1.00} \times 100 = 73 \% \end{aligned} \]
Note that because we didn’t include the \(\text{price}\) variable in our model, our fitted regression model may not be as robust. After all, price almost certainly would have a factor on sales; it’s supply and demand 101.
Simple linear regression is a powerful tool that describes the relationship between a depenent variable \(y\) and one or more dependent variable \(x\). It can be used for both prediction and confidence interval estimation. Regression models are one of the more versatile models available in the data scientist’s tool kit. In this case, we used a fictitious burger restaurant to see if the owner’s advertising efforts had any impact on its sales. We conclued that for every $1.00 spent on advertising, there was a resulting increase of $1.73 in sales.