The problem: We have data on time of emploment and sales. Can we use length of employment to predict sales?
library(readxl)
Reynolds <- read_excel("Reynolds.xlsx")
Reynolds
## # A tibble: 15 × 2
## `Months Employed` `Scales Sold`
## <dbl> <dbl>
## 1 41 275
## 2 106 296
## 3 76 317
## 4 100 376
## 5 22 162
## 6 12 150
## 7 85 367
## 8 111 308
## 9 40 189
## 10 51 235
## 11 0 83
## 12 12 112
## 13 6 67
## 14 56 325
## 15 19 189
names(Reynolds) = c("Months", "Sold")
M1 = lm(Sold~Months, Reynolds)
summary(M1)
##
## Call:
## lm(formula = Sold ~ Months, data = Reynolds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -68.696 -30.450 0.514 27.891 78.677
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 113.7453 20.8135 5.465 0.000108 ***
## Months 2.3675 0.3384 6.996 9.4e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 48.49 on 13 degrees of freedom
## Multiple R-squared: 0.7901, Adjusted R-squared: 0.774
## F-statistic: 48.95 on 1 and 13 DF, p-value: 9.395e-06
Our model: Scale Sold = 113.7453 + 2.364*Months Employed
Business interpretation: For every additional month added, we will get 2.3675 more scales sold.
Since this is linear, we can use the multiple R-squared of 0.7901. Since the Adjusted R-squared is above 65%, our model is pretty good. The final p-value is the results of running a hypothesis test to validate our model:
par(mfrow=c(2,2)) # parameter for the plot: lay out the graphs 2 x 2
plot(M1)
Interpreting the residuals: (more info in the textbook)
plot(Sold ~ Months, Reynolds, xlab="Length of Employment", ylab="Scale Sold", main="Reynolds Data")
Because the scatter plot is curved, we think: hm, there might be something nonlinear happening. Let’s try adding the square of months
M2 = lm(Sold~Months + I(Months^2), Reynolds)
summary(M2)
##
## Call:
## lm(formula = Sold ~ Months + I(Months^2), data = Reynolds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.606 -22.795 -7.623 27.637 42.686
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.429935 20.574335 2.986 0.01136 *
## Months 5.819797 0.969767 6.001 6.2e-05 ***
## I(Months^2) -0.031010 0.008436 -3.676 0.00317 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34.61 on 12 degrees of freedom
## Multiple R-squared: 0.9013, Adjusted R-squared: 0.8848
## F-statistic: 54.78 on 2 and 12 DF, p-value: 9.252e-07
Interpreting the results:
Both terms are signifcant: P < 0.05 (6.2e-05 and 0.00317) Adjusted R-squared has improved– closer to 88%. Final p-value 9.252e-07– this is our hypothesis test that the model is good (see above)
The new model is:
par(mfrow=c(2,2)) # parameter for the plot: lay out the graphs 2 x 2
plot(M2)
Interpreting the results:
So it looks good!
If someone worked for the company 93 months, how much would they sell?
predict.lm(M2, newdata = data.frame(Months=93))
## 1
## 334.4691
Interpretation: They would sell 335 units. (Rounding up assuming you can’t sell half a sale.)
The problem: we have data on price, expenditures, and sales. Can we use price and/ or expenditure to predict sales?
library(readxl)
Tyler <- read_excel("Tyler.xlsx")
Tyler
## # A tibble: 24 × 3
## Price Expenditure Sales
## <dbl> <dbl> <dbl>
## 1 2 50 478
## 2 2.5 50 373
## 3 3 50 335
## 4 2 50 473
## 5 2.5 50 358
## 6 3 50 329
## 7 2 50 456
## 8 2.5 50 360
## 9 3 50 322
## 10 2 50 437
## # … with 14 more rows
M3 = lm(Sales~Price + Expenditure, Tyler)
summary(M3)
##
## Call:
## lm(formula = Sales ~ Price + Expenditure, data = Tyler)
##
## Residuals:
## Min 1Q Median 3Q Max
## -124.167 -57.417 -1.167 59.958 96.833
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 864.1667 101.0249 8.554 2.77e-08 ***
## Price -281.0000 35.9052 -7.826 1.17e-07 ***
## Expenditure 4.4800 0.5863 7.641 1.71e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 71.81 on 21 degrees of freedom
## Multiple R-squared: 0.8507, Adjusted R-squared: 0.8365
## F-statistic: 59.82 on 2 and 21 DF, p-value: 2.13e-09
Interpretation: Both predictors are highly significant, and yet my R-squared is still sort of low. I wonder what would happen if we added an interaction term?
M4 = lm(Sales~Price + Expenditure + I(Price*Expenditure), Tyler)
summary(M4)
##
## Call:
## lm(formula = Sales ~ Price + Expenditure + I(Price * Expenditure),
## data = Tyler)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.167 -20.792 0.333 15.583 60.333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -275.8333 112.8421 -2.444 0.023898 *
## Price 175.0000 44.5468 3.928 0.000832 ***
## Expenditure 19.6800 1.4274 13.788 1.13e-11 ***
## I(Price * Expenditure) -6.0800 0.5635 -10.790 8.68e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28.17 on 20 degrees of freedom
## Multiple R-squared: 0.9781, Adjusted R-squared: 0.9748
## F-statistic: 297.9 on 3 and 20 DF, p-value: < 2.2e-16
All terms are significant, the adjusted R-squared is very close to 1, and the hypothesis team p-value < 2.2e-16 is very signficant. This looks like an excellent model– significantly improved from the last one.
Interpretation:
Sales = -275.8333 + 175.0000 * Price + 19.6800 * Expenditure -6.0800 * Price * Expenditure
Business interpretation:
If price is 2 and expenditure is 50, what do we expect sales to be?
predict.lm(M3, newdata=data.frame(Price=2,Expenditure=50))
## 1
## 526.1667
Answer: 526.
If the data is not normal, need to do a nonlinear regression. This is one type of nonlinear regression.
No underlying assumptions. (Though the datapoints should always be above 30 for anything. If not we need to be very cautious about the conclusion– make a strong disclaimer.)
library(readxl)
MPG <- read_excel("MPG.xlsx")
M4 = lm(MPG~Weight, MPG)
Interpretation: The predictor is highly signifcant, the R squared is very good, and the hypothesis test tells us we have a good model.
Every time the weight increases 1 unit, the MPG drops 0.012 units.
par(mfrow=(c(2,2)))
plot(M4)
Residuals look good. (see above)
This model looks pretty good, so we might stick with it.
However, there might be some cases where we want to try transforming. If the data is not normal, you need to do a log.
MPG$LogM = log(MPG$MPG)
M5 = lm(LogM~Weight, MPG )
summary(M5)
##
## Call:
## lm(formula = LogM ~ Weight, data = MPG)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.09125 -0.04079 -0.01536 0.03736 0.10310
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.524e+00 9.932e-02 45.55 6.26e-13 ***
## Weight -5.011e-04 3.722e-05 -13.46 9.84e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06425 on 10 degrees of freedom
## Multiple R-squared: 0.9477, Adjusted R-squared: 0.9425
## F-statistic: 181.2 on 1 and 10 DF, p-value: 9.842e-08
plot(LogM~Weight, MPG, xlab="weight", ylab="mpg")
par(mfrow=c(2,2))
plot(M5)
If we have a weight of 2500, what is the MPG?
logM=predict.lm(M5, newdata=data.frame(Weight=2500))
exp(logM)
## 1
## 26.35056