Lecture 19: Polynomial and Multiple Linear Regression

Polynomial Regression

A linear relationship between an explanatory variable, or variables, and a response variable is one of the most common and often effective way to model the impact an explanatory variable has on a response. Sometimes however, this simplistic modeling approach does not capture the relationship well.

Let’s look at the weekly COVID-19 death rate (COVID Deaths/Total Deaths) in the United States starting February 1st, 2020 and ending December 5th, 2020.

COVID = read.csv("C:\\Users\\jloda\\Desktop\\STAT 3615\\Loda\\Fall 2021\\Lectures\\Lecture 19 - Polynomial and MLR\\COVID.csv")
plot(COVID$Week.Index, COVID$Death.Rate, pch = 16, xlab = 'Week Index', ylab = 'COVID Death Rate', col = 'maroon')
abline(v = c(7,21), col = 'blue', lwd = 2)

To simplify this example, instead of looking at the every week since February 1st, we will restrict ourselves to the initial peek period (March 14th - June 20th). These data are shown between the two vertical blue lines and can be found in the \(\texttt{COVID2.csv}\) file.

COVID2 = COVID[7:21,]

We could naively try and fit a simple linear regression, \(y = \beta_0 + \beta_1 x + \varepsilon\), to these data:

SLR = lm(Death.Rate ~ Week.Index, data = COVID2)
summary(SLR)

## 
## Call:
## lm(formula = Death.Rate ~ Week.Index, data = COVID2)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.108833 -0.059083 -0.003833  0.059917  0.103667 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.098333   0.064919   1.515    0.154
## Week.Index  0.001500   0.004431   0.339    0.740
## 
## Residual standard error: 0.07414 on 13 degrees of freedom
## Multiple R-squared:  0.008739,   Adjusted R-squared:  -0.06751 
## F-statistic: 0.1146 on 1 and 13 DF,  p-value: 0.7404

plot(COVID2$Week.Index, COVID2$Death.Rate, pch = 16, xlab = 'Week Index', ylab = 'COVID Death Rate', col = 'maroon')
abline(SLR, lwd = 2)

It is obvious this model does not fit the data well. There appears to be somewhat of a quadratic trend in the data (Note: this can also be seen by graphing the residuals vs explanatory variable). We could fit a quadratic relationship using a linear model. The model is \[y = \beta_0 + \beta_1 x + \beta_{11} x^2 + \varepsilon\] Using the \(\texttt{poly()}\) function, we can specify the polynomial degree for our explanatory variable. In this case, we want to model the response Death Rate as a second order polynomial with respect to the weekly index.

Quad_Reg = lm(Death.Rate ~ poly(Week.Index, degree = 2), COVID2)
summary(Quad_Reg)

## 
## Call:
## lm(formula = Death.Rate ~ poly(Week.Index, degree = 2), data = COVID2)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.038920 -0.035290  0.001265  0.022516  0.060080 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    0.119333   0.009751  12.238 3.89e-08 ***
## poly(Week.Index, degree = 2)1  0.025100   0.037767   0.665    0.519    
## poly(Week.Index, degree = 2)2 -0.233125   0.037767  -6.173 4.78e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03777 on 12 degrees of freedom
## Multiple R-squared:  0.7626, Adjusted R-squared:  0.723 
## F-statistic: 19.27 on 2 and 12 DF,  p-value: 0.0001791

plot(COVID2$Week.Index, COVID2$Death.Rate, pch = 16, xlab = 'Week Index', ylab = 'COVID Death Rate', col = 'maroon')
abline(SLR, lwd = 2)
lines(COVID2$Week.Index, Quad_Reg$fitted.values, lwd = 2, col='Orange')

We see that the quadratic term is significant (\(P-Value < 0.05\)) and the \(R^2\) improved dramatically. We could also include a cubic term into our model: \[y = \beta_0 + \beta_1 x + \beta_{11} x^2 + \beta_{111}x^3 + \varepsilon\]

Cubic_Reg = lm(Death.Rate ~ poly(Week.Index, degree = 3), COVID2)
summary(Cubic_Reg)

## 
## Call:
## lm(formula = Death.Rate ~ poly(Week.Index, degree = 3), data = COVID2)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.055176 -0.010266 -0.001486  0.011627  0.043722 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    0.119333   0.007159  16.669 3.74e-09 ***
## poly(Week.Index, degree = 3)1  0.025100   0.027727   0.905  0.38473    
## poly(Week.Index, degree = 3)2 -0.233125   0.027727  -8.408 4.06e-06 ***
## poly(Week.Index, degree = 3)3  0.093056   0.027727   3.356  0.00641 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02773 on 11 degrees of freedom
## Multiple R-squared:  0.8827, Adjusted R-squared:  0.8507 
## F-statistic: 27.59 on 3 and 11 DF,  p-value: 2.037e-05

plot(COVID2$Week.Index, COVID2$Death.Rate, pch = 16, xlab = 'Week Index', ylab = 'COVID Death Rate', col = 'maroon')
abline(SLR, lwd = 2)
lines(COVID2$Week.Index, Quad_Reg$fitted.values, lwd = 2, col='Orange')
lines(COVID2$Week.Index, Cubic_Reg$fitted.values, lwd = 2, col='blue')

Be careful not to overfit the data. That is, we typically limit ourselves to low-order polynomials because we know that if we use a polynomial of degree \(n-1\), where \(n\) is the number of observations, we can perfectly fit our data but the interpretation of the model coefficients will be lost.

Multiple Linear Regression

Now that we’ve discussed simple linear regression (single explanatory variable), we can add more variables into the mix. The mathematics is quite similar, but it’s much, much easier to work with in matrix notation, so if you’ve taken a linear algebra course you know a little bit about that. We’re going to focus on the implementation and interpretation in this course, not the mathematical derivations, but keep in mind that the math isn’t really that complex underlying this methodology.

It’s reasonable to assume that a response might depend on multiple predictors - hence multiple linear regression. Load the ads.csv data set. A sales company wishes to model the sales (in thousands of dollars) of a product as a function of four advertising types: TV, radio, YouTube, and sidewalk ads. Do you think every dollar of advertising spent on YouTube will be as effective as a dollar spent on radio ads? Do you think a dollar spent on TV ads is even worth it at all?

ads = read.csv("C:\\Users\\jloda\\Desktop\\STAT 3615\\Loda\\Fall 2021\\Lectures\\Lecture 19 - Polynomial and MLR\\ads.csv")

First, just look at the data to make sure it imported correctly:

head(ads, 10)

##    sales radio youtube tv sidewalk
## 1  423.8    73      94 89       29
## 2  334.6    84      97 71       70
## 3  316.6    36      66 83       73
## 4  217.3    41      47 56       62
## 5  145.5    49      68 12       23
## 6  411.7    99      73 94       18
## 7  335.5     1      91 69       51
## 8  302.1    24      43 74       22
## 9  329.4    66      87 66       43
## 10 413.1    63      80 85        6

Next we must still plot the data. A two-dimensional plot is easy, but higher dimensions become more difficult. Instead, we plot each pair of variables with one another. Remember, we’re most interested in which independent variables (ad types) are significantly correlated with the response (sales).

pairs(ads, pch=16, cex=0.5, main="Sales Data")

The variable names in the diagonal tell you which variable is plotted on the x and y axes. So the top row, second column plot shows sales on the y-axis and radio ad spending on the x-axis. Does there appear to be a significant relationship? Not really - it doesn’t look like radio ad spending is very strongly related to our sales. *Notice this means that if you imagine the line of best fit through this plot, it’s mostly flat - that is, \(\beta_{\text{radio}}=0\).

On the other hand, look at row 1, column 4, tv spending vs. sales. Here there does appear to be a strong relationship (and hence a slope \(\beta_{\text{tv}}\ne0\)). It looks like as ad spending on TV increases, sales increases as well.

But look carefully at row 1, column 5: this seems to show that as sidewalk spending increases, sales actually decreases, because of the downward trend. Could it be that \(\beta_{\text{sidewalk}}<0\)?

So by this logic, our hunch is that YouTube and TV ads are significantly correlated with higher sales, sidewalk ads are correlated with lower sales, and radio ads aren’t strongly correlated with sales. Let’s fit a model with each predictor to check this. The FULL model (full because it will include all of the predictors) is something like:

\[ y = \beta_0 + \beta_{radio}x_{radio} + \beta_{youtube}x_{youtube} + \beta_{tv} x_{tv} + \beta_{sidewalk} x_{sidewalk} + \varepsilon\] where \(y\) is the sales, the \(x\)’s are the dollars spend on individual advertising, and the \(\beta\)’s, not including the intercept \(\beta_0\), are the regression coefficients (think slopes for each individually).

mlr_model = lm(sales ~ radio + youtube + tv + sidewalk, data=ads)
summary(mlr_model)

## 
## Call:
## lm(formula = sales ~ radio + youtube + tv + sidewalk, data = ads)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.7015 -2.1641 -0.2385  2.3062  7.6927 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27.77095    1.96507  14.132 4.97e-16 ***
## radio        0.01264    0.01680   0.752    0.457    
## youtube      1.53533    0.01858  82.620  < 2e-16 ***
## tv           3.07179    0.02040 150.610  < 2e-16 ***
## sidewalk    -0.88406    0.01645 -53.727  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.234 on 35 degrees of freedom
## Multiple R-squared:  0.9992, Adjusted R-squared:  0.9991 
## F-statistic: 1.04e+04 on 4 and 35 DF,  p-value: < 2.2e-16

We first test \(H_0:y_{\text{sales}}=\beta_0+\varepsilon\) vs. \(H_A:y_{\text{sales}}+\beta_0+\beta_{\text{radio}}x_{\text{radio}}+\beta_{\text{youtube}}x_{\text{youtube}}+\beta_{\text{tv}}x_{\text{tv}}+\beta_{\text{sidewalk}}x_{\text{sidewalk}}+\varepsilon\). This is the full model. Without getting into too much detail, the F-test at the bottom of the output is what tests this hypothesis. An F distribution is a ratio of chi-squares, which tests all of the variables simultaneously. Note its p-value is quite small (below the machine precision of R Studio) so we reject \(H_0\) and conclude that the full model is preferred.

It is important to note that in the Null hypothesis above, \(H_0=\beta_0+\varepsilon\), is assuming the best we can do is just model the sales with the sample mean. That is, our best estimate of the product sales is just the average, \(\hat{y} = \bar{y}\), thus implying none of the advertising ads are affecting the sales. In generic terms, the alternative hypothesis is that at least one of the advertising types affects the sales in some way.

After we have obtained a significant global result, that is our model is accounting for some of the variation in the response variable, we move on to investigate each individual effect. We obtain estimates, standard errors, \(t-\)statistics (t value), and \(P-\)values (Pr>|t|) for each independent variable. Recall, the intercept is typically not of much use for our analysis and thus we leave it in the model.

The \(P-\)values give us a quick test result for the linear effect of each variable. We see that Radio is not significant, TV and Youtube are significantly correlated with an increase in sales (but don’t necessarily cause increased sales), and Sidewalk ads are significantly correlated with a decrease in sales (but don’t necessarily cause them).

These inferences made on the independent variables can be used for decision making, follow up hypothesis, or help drive further research.

STAT 3615 Lecture 19

Justin Loda

12/7/2021

Lecture 19: Polynomial and Multiple Linear Regression

Polynomial Regression

Multiple Linear Regression