Example 1: Run a simple linear regression on dataset Reynolds.

The problem: We have data on time of emploment and sales. Can we use length of employment to predict sales?

Read in data:

library(readxl)
Reynolds <- read_excel("Reynolds.xlsx")
Reynolds
## # A tibble: 15 × 2
##    `Months Employed` `Scales Sold`
##                <dbl>         <dbl>
##  1                41           275
##  2               106           296
##  3                76           317
##  4               100           376
##  5                22           162
##  6                12           150
##  7                85           367
##  8               111           308
##  9                40           189
## 10                51           235
## 11                 0            83
## 12                12           112
## 13                 6            67
## 14                56           325
## 15                19           189

Rename columns, run model.

names(Reynolds) = c("Months", "Sold")
M1 = lm(Sold~Months, Reynolds)
summary(M1)
## 
## Call:
## lm(formula = Sold ~ Months, data = Reynolds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -68.696 -30.450   0.514  27.891  78.677 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 113.7453    20.8135   5.465 0.000108 ***
## Months        2.3675     0.3384   6.996  9.4e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 48.49 on 13 degrees of freedom
## Multiple R-squared:  0.7901, Adjusted R-squared:  0.774 
## F-statistic: 48.95 on 1 and 13 DF,  p-value: 9.395e-06

Interpret the results: build the model.

Our model: Scale Sold = 113.7453 + 2.364*Months Employed

Business interpretation: For every additional month added, we will get 2.3675 more scales sold.

Since this is linear, we can use the multiple R-squared of 0.7901. Since the Adjusted R-squared is above 65%, our model is pretty good. The final p-value is the results of running a hypothesis test to validate our model:

  • H0: our model is bad (doesn’t explain the data)
  • HA: our model is good (explains the data)
  • Since the pvalue is 9.395e-06 < 5%, we reject H0 and say that our model is good.

Plot the residuals

par(mfrow=c(2,2)) # parameter for the plot: lay out the graphs 2 x 2
plot(M1)

Interpreting the residuals: (more info in the textbook)

  • Residuals vs Fitted: randomness - roughly half above, half below: looks good.
  • Normal Q-Q: we want the dots to be close to the line: looks good.
  • Scale-Location: similar to residuals vs fitted– good randomness, looks good.
  • Residuals vs Leverage: most of the dots are between the two lines– looks good

Make a scatter plot

plot(Sold ~ Months, Reynolds, xlab="Length of Employment", ylab="Scale Sold", main="Reynolds Data")

Additional exploration: Run a Quadratic regression models

Because the scatter plot is curved, we think: hm, there might be something nonlinear happening. Let’s try adding the square of months

M2 = lm(Sold~Months + I(Months^2), Reynolds)
summary(M2)
## 
## Call:
## lm(formula = Sold ~ Months + I(Months^2), data = Reynolds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.606 -22.795  -7.623  27.637  42.686 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 61.429935  20.574335   2.986  0.01136 *  
## Months       5.819797   0.969767   6.001  6.2e-05 ***
## I(Months^2) -0.031010   0.008436  -3.676  0.00317 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34.61 on 12 degrees of freedom
## Multiple R-squared:  0.9013, Adjusted R-squared:  0.8848 
## F-statistic: 54.78 on 2 and 12 DF,  p-value: 9.252e-07

Interpreting the results:

Both terms are signifcant: P < 0.05 (6.2e-05 and 0.00317) Adjusted R-squared has improved– closer to 88%. Final p-value 9.252e-07– this is our hypothesis test that the model is good (see above)

The new model is:

  • Sold = 61.429935 + 5.819797 * Months -0.031010 * Months Squared
  • Since it’s not linear, I can’t give a simple business explanation.

Plot the residuals.

par(mfrow=c(2,2)) # parameter for the plot: lay out the graphs 2 x 2
plot(M2)

Interpreting the results:

  • For Residuals vs Fitted and Scale-Location, looking for roughly equal variation above and below the line.
  • For normal Q-Q looking for roughly along the line
  • For Residuals vs Leverage, want most points between the 0.5 and 0.5 lines

So it looks good!

Make a Prediction based on the model

If someone worked for the company 93 months, how much would they sell?

predict.lm(M2, newdata = data.frame(Months=93))
##        1 
## 334.4691

Interpretation: They would sell 335 units. (Rounding up assuming you can’t sell half a sale.)

Example 2: Run a linear model with 2 predictors on dataset Tyler.

The problem: we have data on price, expenditures, and sales. Can we use price and/ or expenditure to predict sales?

Read in the data.

library(readxl)
Tyler <- read_excel("Tyler.xlsx")
Tyler
## # A tibble: 24 × 3
##    Price Expenditure Sales
##    <dbl>       <dbl> <dbl>
##  1   2            50   478
##  2   2.5          50   373
##  3   3            50   335
##  4   2            50   473
##  5   2.5          50   358
##  6   3            50   329
##  7   2            50   456
##  8   2.5          50   360
##  9   3            50   322
## 10   2            50   437
## # … with 14 more rows

Create a linear model with both predictors.

M3 = lm(Sales~Price + Expenditure, Tyler)
summary(M3)
## 
## Call:
## lm(formula = Sales ~ Price + Expenditure, data = Tyler)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -124.167  -57.417   -1.167   59.958   96.833 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  864.1667   101.0249   8.554 2.77e-08 ***
## Price       -281.0000    35.9052  -7.826 1.17e-07 ***
## Expenditure    4.4800     0.5863   7.641 1.71e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 71.81 on 21 degrees of freedom
## Multiple R-squared:  0.8507, Adjusted R-squared:  0.8365 
## F-statistic: 59.82 on 2 and 21 DF,  p-value: 2.13e-09

Interpretation: Both predictors are highly significant, and yet my R-squared is still sort of low. I wonder what would happen if we added an interaction term?

M4 = lm(Sales~Price + Expenditure + I(Price*Expenditure), Tyler)
summary(M4)
## 
## Call:
## lm(formula = Sales ~ Price + Expenditure + I(Price * Expenditure), 
##     data = Tyler)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.167 -20.792   0.333  15.583  60.333 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -275.8333   112.8421  -2.444 0.023898 *  
## Price                   175.0000    44.5468   3.928 0.000832 ***
## Expenditure              19.6800     1.4274  13.788 1.13e-11 ***
## I(Price * Expenditure)   -6.0800     0.5635 -10.790 8.68e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.17 on 20 degrees of freedom
## Multiple R-squared:  0.9781, Adjusted R-squared:  0.9748 
## F-statistic: 297.9 on 3 and 20 DF,  p-value: < 2.2e-16

All terms are significant, the adjusted R-squared is very close to 1, and the hypothesis team p-value < 2.2e-16 is very signficant. This looks like an excellent model– significantly improved from the last one.

Interpretation:

Sales = -275.8333 + 175.0000 * Price + 19.6800 * Expenditure -6.0800 * Price * Expenditure

Business interpretation:

  • All else being equal, increasing price by 1 will increase Sales by 175.
  • All else being equal, increeasing Expenditure will increase sales by 19.68.
  • If you increase both Price and Expenditure at the same time however, you will also reduce sales by about -6 units.

Make a prediction.

If price is 2 and expenditure is 50, what do we expect sales to be?

predict.lm(M3, newdata=data.frame(Price=2,Expenditure=50))
##        1 
## 526.1667

Answer: 526.

Example 3 Nonlinear Regression (after a linear regression)

If the data is not normal, need to do a nonlinear regression. This is one type of nonlinear regression.

No underlying assumptions. (Though the datapoints should always be above 30 for anything. If not we need to be very cautious about the conclusion– make a strong disclaimer.)

Read the data and create a model.

library(readxl)
MPG <- read_excel("MPG.xlsx")
M4 = lm(MPG~Weight, MPG)

Interpretation: The predictor is highly signifcant, the R squared is very good, and the hypothesis test tells us we have a good model.

Every time the weight increases 1 unit, the MPG drops 0.012 units.

par(mfrow=(c(2,2)))
plot(M4)

Residuals look good. (see above)

This model looks pretty good, so we might stick with it.
However, there might be some cases where we want to try transforming. If the data is not normal, you need to do a log.

MPG$LogM = log(MPG$MPG)
M5 = lm(LogM~Weight, MPG )
summary(M5)
## 
## Call:
## lm(formula = LogM ~ Weight, data = MPG)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.09125 -0.04079 -0.01536  0.03736  0.10310 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.524e+00  9.932e-02   45.55 6.26e-13 ***
## Weight      -5.011e-04  3.722e-05  -13.46 9.84e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06425 on 10 degrees of freedom
## Multiple R-squared:  0.9477, Adjusted R-squared:  0.9425 
## F-statistic: 181.2 on 1 and 10 DF,  p-value: 9.842e-08
plot(LogM~Weight, MPG, xlab="weight", ylab="mpg")

par(mfrow=c(2,2))
plot(M5)

If we have a weight of 2500, what is the MPG?

logM=predict.lm(M5, newdata=data.frame(Weight=2500))
exp(logM)
##        1 
## 26.35056