LEARNING OUTCOMES
library(readr)
library(ggplot2)
library(gt)
library(gtExtras)
Returns <-read_csv("01-H Microsoft_VS_Walmart.csv")
head(Returns,5)
## # A tibble: 5 × 2
## MsftReturn WmtReturn
## <dbl> <dbl>
## 1 3.62 -1.53
## 2 -7.55 -0.0421
## 3 2.21 1.2
## 4 2.16 3.41
## 5 4.28 -3.52
ggplot(data=Returns,aes(x=WmtReturn, y=MsftReturn)) + geom_point(color="blue") + labs(x="Walmart returns", y="Microsoft returns")
attach(Returns)
cor(MsftReturn,WmtReturn)
## [1] 0.1962059
detach(Returns)
The correlation between Microsoft and Walmart returns is 19.62%
(Same data from Handout)
head(Returns, 5)
## # A tibble: 5 × 2
## MsftReturn WmtReturn
## <dbl> <dbl>
## 1 3.62 -1.53
## 2 -7.55 -0.0421
## 3 2.21 1.2
## 4 2.16 3.41
## 5 4.28 -3.52
Each data point refers to an instance happened simultaneously between
Walmart and Microsoft. We are looking for how much and in which
direction the data moves alongside each other.
Example, in the above table, the returns for Microsoft on one month is
3.6% and Walmart’s return ON THE SAME MONTH is -1.53%. Another month
Microsoft had -7.55% while Walmart in the same month had -0.04%
returns.
Is there a relationship between Microsoft and Walmort returns? First
step: make a scatterplot
ggplot(
data=Returns,
aes(
x=WmtReturn,
y=MsftReturn
)
) + geom_point(color="blue") + labs(x="Walmart returns", y="Microsoft returns")
Each point in the plot corresponds to a pair of Microsoft and Walmart
returns for a month. (There are 59 months, so there’s 59 points in the
plot)
Single number that is computed to measure the tendency of the data to
cluster around the aforementioned line
THEREFORE: correlation is the measure of linear association
Closer to the line, closer to 1 or -1
Farther from the line, closer to 0
CAUTION: r does not measure the slope; r is a
measure of linear association
Example: r=0 but there’s an obvious relationship; r is linear
\[ \begin{align} \text{Covariance(X,Y)}&=\text{Correlation(X,Y)} \times \left [\text{SD}(X)\times \text{SD}(Y)\right ] \end{align} \]
Returns <- readr::read_csv("01-E IbmYhooAaplReturnData copy.csv")
library(ggplot2)
library(tidyr)
library(cowplot)
library(gt)
Make scatter plots of each pair of variables.
IBM_YAHOO <- Returns %>%
ggplot(aes(Returns$IBMreturn,Returns$YhooReturn)) +
geom_point() +
labs(x="IBM Returns",y="Yahoo returns") + geom_smooth(method="lm")
IBM_APPLE <- Returns %>%
ggplot(aes(Returns$IBMreturn,Returns$Aaplreturn)) +
geom_point() +
labs(x="IBM Returns",y="Apple returns") + geom_smooth(method="lm")
YAHOO_APPLE <- Returns %>%
ggplot(aes(Returns$YhooReturn,Returns$Aaplreturn)) +
geom_point() +
labs(x="Yahoo Returns",y="Apple returns") + geom_smooth(method="lm")
plot_grid(IBM_YAHOO,IBM_APPLE,YAHOO_APPLE,
labels = c("Yahoo X IBM", "Apple X IBM","Apple X Yahoo"),
label_colour = "purple")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Compute the correlation in the data for each pair of variables.
attach(Returns)
cor_labels <- c("Apple","IBM","Yahoo")
Apple <- c(
cor(Aaplreturn,Aaplreturn),
cor(IBMreturn,Aaplreturn),
cor(YhooReturn,Aaplreturn))|> round(2)
IBM <- c(
cor(Aaplreturn,IBMreturn),
cor(IBMreturn,IBMreturn),
cor(YhooReturn,IBMreturn)) |> round(2)
Yahoo <- c(
cor(Aaplreturn,YhooReturn),
cor(IBMreturn,YhooReturn),
cor(YhooReturn,YhooReturn)) |> round(2)
data.frame(
cor_labels,
Apple,
IBM,
Yahoo) |>
gt() |>
cols_label(
cor_labels="",
Apple="Apple",
IBM="IBM",
Yahoo="Yahoo")
| Apple | IBM | Yahoo | |
|---|---|---|---|
| Apple | 1.00 | 0.36 | 0.28 |
| IBM | 0.36 | 1.00 | 0.31 |
| Yahoo | 0.28 | 0.31 | 1.00 |
detach(Returns)
cor(Returns) |> knitr::kable()
| IBMreturn | YhooReturn | Aaplreturn | |
|---|---|---|---|
| IBMreturn | 1.0000000 | 0.3143849 | 0.3603621 |
| YhooReturn | 0.3143849 | 1.0000000 | 0.2754087 |
| Aaplreturn | 0.3603621 | 0.2754087 | 1.0000000 |
Are the correlation values that you get in accord what you see in the
scatter plots?
Yes, each scatter plot demonstrates a positive correlation, which is
reflected in the positive signs in the correlation values. Each
correlation value in the table are closer to 0 in magnitude (except for
the correlations compared to itself; Apple against Apple, etc) meaning
that there are many values that stray farther away from the from the
line like how it’s demonstrated in the plots.
LEARNING OUTCOMES
## # A tibble: 14 × 2
## sales advtexp
## <dbl> <dbl>
## 1 432 5
## 2 723 7
## 3 578 6
## 4 600 6.5
## 5 950 8
## 6 106 3.5
## 7 282 4
## 8 233. 4.5
## 9 746 6.5
## 10 812. 7
## 11 800 7.5
## 12 929 7.5
## 13 1004 8.5
## 14 632 7
cor(sales, advtexp)
## [1] 0.9743532
lm([Dependent Variable]~[Independent Variables], data=[name of dataset]):
used to build linear regression models
summary(): used to get full output of the model
linearRegModel <- lm(sales~advtexp, data=AdExp)
summary(linearRegModel)
##
## Call:
## lm(formula = sales ~ advtexp, data = AdExp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -119.346 -37.380 -4.869 54.998 88.609
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -495.28 77.06 -6.427 3.27e-05 ***
## advtexp 178.09 11.87 15.000 3.89e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 65.36 on 12 degrees of freedom
## Multiple R-squared: 0.9494, Adjusted R-squared: 0.9451
## F-statistic: 225 on 1 and 12 DF, p-value: 3.888e-09
linearRegModel.StdRes <- rstandard(linearRegModel) # obtaining standard residuals
linearRegModel.Fit <- fitted.values(linearRegModel) # obtaining fitted values
ggplot(data=AdExp,aes(x=linearRegModel.Fit,y=linearRegModel.StdRes))+
geom_point(color='blue')+
geom_smooth( method= 'lm', se= F, col= "red")+ labs(x= "Fitted values", y= "Standard Residuals")
## `geom_smooth()` using formula = 'y ~ x'
Regression assists with
Least squares (fitted) line: A scatter plot of the data shows that
the two variables seem to be related and the data seems to fall around
the line with positive slope. We do not know the exact equation of this
line that captures the linear relationship between the two variables,
but it makes sense to postulate that it exists. However once we have the
actual data on two such variables that seem to be linearly related,
regression allows us to generate the equation of a line that is our best
estimate of this linear relationship.
The actual equation of this line is the values of its intercept and
slope. The regression procedure finds this equation by considering all
lines that one could possibly draw through the data and then choosing
that line which best intersects the data. To put it more concretely, it
finds the line that has the smallest total squared vertical distance
from the points and hence it’s called the least squares line. Once we
have the fitted line, we can use it for prediction.
Bbe careful not to extrapolate the fitted line too much for
prediction.
For example, if we had an ad campaign with $15,000, then it is tempting to plug in 15 into the equation to predict sales, but we only have data for ad campaigns where the ad expenditure goes up to around $9,000 and we don’t know whether the relationship between sales and ad expenditure will continue to be linear beyond the observed range or may saturate and flatten out. In the latter case, using the line for prediction at X equal to 15 could seriously overpredict sales. A similar remark can be made about extrapolating in the lower range of the X values. What we have seen here so far is the use of regression to generate a prediction of sales for a given value of ad expenditure.
Point prediction: generating an exact prediction of sales for a value
of ad expenditure.
Prediction interval (PI): The actual sales values are not going to fall
perfectly along our best fit line. We can do that by generating an
interval around the line such that the sales value has a high
pre-specified probability, like 95%, of falling in that interval.
Outliers: unusually high or unusually low values of sales relative to
the ad expenditure. Regression gives us an objective way to detect
them.
Goal: deal with a pair of variables with an eye towards using one of
them to predict the other is whether the data objectively can give
evidence of whether a linear relationship exists or not.
The regression procedure provides a standard output along with a fitted
line.
Regression assume, the existence of a regression line that
captures the linear relationship between the two variables, and this
regression line is estimated from observed data by the fitted line,
which allows us to generate point predictions. However, we cannot
reasonably expect all values of the Y variable to fall exactly on the
regression line that relates the Y variable to the X variable. Hence, it
makes sense to generate not just a point prediction, but also an
interval prediction.
Interval prediction: an interval which will contain the values of the Y
variable, with a high pre-specified probability.
Assumptions:
for any fixed value of ad expenditure the sales values fall around the regression line at that point, according to a normal curve.
the standard deviation of the sales values around the regression line at any point is the same, regardless, of where the point is.
lm(AdExp$sales~AdExp$advtexp,AdExp) |> summary()
##
## Call:
## lm(formula = AdExp$sales ~ AdExp$advtexp, data = AdExp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -119.346 -37.380 -4.869 54.998 88.609
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -495.28 77.06 -6.427 3.27e-05 ***
## AdExp$advtexp 178.09 11.87 15.000 3.89e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 65.36 on 12 degrees of freedom
## Multiple R-squared: 0.9494, Adjusted R-squared: 0.9451
## F-statistic: 225 on 1 and 12 DF, p-value: 3.888e-09
Answers the question: Is there evidence of linear relationship
If t-value is LARGE, i.e. outside of (-2,2), then reject \(H_0\).
Answers the question: Is there evidence of linear relationship
If p-value is SMALL, i.e. < 0.05, then there’s evidence to reject
\(H_0\)
\[
\text{p-value} \approx 0 \therefore\text{ reject }H_0 \text{ in favor of
}H_a\text{ that there is a linear relationship}
\]
Answers the question: How strong is the relationship?
Use Multiple R-squared \[
\begin{matrix}
0\% &<& R^2 & <&100\%\\
\leftarrow weaker& & & & stronger\rightarrow
\end{matrix}
\]
Can have small p-value (there is linear relationship), but large
r-squared:
Can have small p-value (there is linear relationship), but SMALL
r-squared:
Use Residual standard error
LEARNING OUTCOMES
Task: The purpose of this analysis/tutorial is to use simple regression to model future movie revenue based on the opening weekend revenue
Movies <- readr::read_csv("03-H MovieRevenues1998.csv")
head(Movies, 10)
## # A tibble: 10 × 4
## Movie `opening week` `total gross` `future gross`
## <chr> <dbl> <dbl> <dbl>
## 1 3 Ninjas: High Moon at Mega Moun… 0.150 0.308 0.158
## 2 54 6.61 16.6 9.96
## 3 Air Bud: Golden Receiver 2.60 10.2 7.61
## 4 Almost Heroes 2.84 6.11 3.28
## 5 American History X 1.34 6.71 5.37
## 6 Antz 17.2 90.6 73.5
## 7 Apostle, The 2.40 20.7 18.3
## 8 Apt Pupil 3.58 8.84 5.25
## 9 Armageddon 36.1 202. 165.
## 10 Avengers, The 10.3 23.3 13.0
Movies |> ggplot(aes(x=`opening week`, y=`future gross`))+geom_point(color='blue')+labs(x="Opening Week", y="Future Gross")
As we can see, there is a hint of linear relationship. Apart from that if we were to plot a fitted line here, we will see that the distance of these data points keep on increasing as we go farther away. Due to this non constant variance, we need to take some corrective actions (natural logs)
Add columns to the dataset
Movies$`log opening week` <- log(Movies$`opening week`)
Movies$`log future revenue` <- log(Movies$`future gross`)
head(Movies,10)
## # A tibble: 10 × 6
## Movie `opening week` `total gross` `future gross` `log opening week`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 3 Ninjas: Hig… 0.150 0.308 0.158 -1.90
## 2 54 6.61 16.6 9.96 1.89
## 3 Air Bud: Gold… 2.60 10.2 7.61 0.957
## 4 Almost Heroes 2.84 6.11 3.28 1.04
## 5 American Hist… 1.34 6.71 5.37 0.293
## 6 Antz 17.2 90.6 73.5 2.84
## 7 Apostle, The 2.40 20.7 18.3 0.877
## 8 Apt Pupil 3.58 8.84 5.25 1.28
## 9 Armageddon 36.1 202. 165. 3.59
## 10 Avengers, The 10.3 23.3 13.0 2.33
## # ℹ 1 more variable: `log future revenue` <dbl>
Movies |> ggplot(aes(x=`log opening week`, y=`log future revenue`))+geom_point(color='blue')+labs(x="Opening Week", y="Future Gross")
cor(Movies$`log opening week`, Movies$`log future revenue`)
## [1] 0.9387172
The correlation between log.opening.week and log.future.revenue is 97.43%.
logRegModel <- lm(Movies$`log future revenue` ~ Movies$`log opening week`,Movies)
summary(logRegModel)
##
## Call:
## lm(formula = Movies$`log future revenue` ~ Movies$`log opening week`,
## data = Movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.93903 -0.35069 -0.08075 0.27636 2.20801
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.41054 0.07540 5.445 2.01e-07 ***
## Movies$`log opening week` 1.19797 0.03545 33.796 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5118 on 154 degrees of freedom
## Multiple R-squared: 0.8812, Adjusted R-squared: 0.8804
## F-statistic: 1142 on 1 and 154 DF, p-value: < 2.2e-16
Movies |> ggplot(aes(`log opening week`,`log future revenue`)) +
geom_point(color="blue") +
geom_smooth(method="lm", se=F, col="red") +
labs("Opening week (log)","Future Revenue (log)")
logRegModel.StdRes <- rstandard(logRegModel)
logRegModel.Fit <- fitted.values(logRegModel)
Movies |> ggplot(aes(logRegModel.Fit,logRegModel.StdRes)) +
geom_point(color='blue') +
geom_smooth(method="lm", se=F, col="red") +
labs(x= "Fitted values", y= "Standard Residuals")
Outlier: the reference value that corresponds to an
X variable that tells whether any values of the Y variable in the data
are unusually high or unusually low, relative to what one would
expect
Residual: the distance of the Y value from the fitted
line; represents the part of the Y value that has not been explained by
the fitted line.
Standardized residual: defines what constitutes a large value; a standardized residual outside plus or minus three corresponds to an unusually high or low residual
Movies |> ggplot(aes(logRegModel.Fit,logRegModel.StdRes)) +
geom_point(color='blue') +
geom_smooth(method="lm", se=F, col="red") +
labs(x= "Fitted values", y= "Standard Residuals")
“Fitted values:” the Y values of the regression model (the Y variable
when one plugs in the X variable value into the fitted line).
Note: It is good practice to see if there’s an explanation as to why
these outliers behaved differently from the others in the dataset.
Outliers are deleted from the dataset and the regression analysis is
redone to ensure that the outliers aren’t affecting the analysis.
Transforming data before running a regression model
Movies_loglog <- readr::read_csv("03-H MovieRevenues1998.csv")
Movies_loglog |> ggplot(aes(Movies_loglog$`opening week`,Movies_loglog$`future gross`)) +
geom_point(color="blue") +
labs(x="Opening week revenue",y="Future gross revenue") +
geom_smooth(method="lm",se=F,col="red")
REASON FOR DATA TRANSFORMATION: violates one of the assumptions of
the regression model: the points have a constant spread around the line.
Therefore, we need to adjust to do the analysis
SOLUTION: original Y variable is replaced by it’s natural log
Movies_loglog |> ggplot(aes(Movies_loglog$`opening week`,Movies_loglog$`future gross`|>log())) +
geom_point(color="blue") +
labs(x="Opening week revenue",y="(LOG) Future gross revenue") +
geom_smooth(se=F,col="red")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
NEW PROBLEM: plot now shows a curvy linear relationship.
SOLUTION: replace the X variable by it’s natural log
Movies_loglog |> ggplot(
aes(
Movies_loglog$`opening week`|>log(),
Movies_loglog$`future gross`|>log())) +
geom_point(color="blue") +
labs(x="(LOG) Opening week revenue",y="(LOG) Future gross revenue") +
geom_smooth(method="lm",se=F,col="red")
CAUTION: back transform the Y values when reporting
How then can we detect the presence of non-constant variables?
# directly plotted residuals (funnel-shaped; non-constant variance)
moviesRegMod <- lm(Movies$`future gross`~Movies$`opening week`,Movies)
stdresDirectMovies <- Movies |> ggplot(
aes(
fitted.values(moviesRegMod),
rstandard(moviesRegMod))) +
geom_point(color='blue') +
geom_smooth(method="lm", se=F, col="red") +
labs(
x= "(Direct) Fitted values",
y= "(Direct) Standard Residuals")
# residuals with log values
moviesRegMod2 <- lm(Movies$`log future revenue`~Movies$`log opening week`,Movies)
stdresLogMovies <- Movies |> ggplot(
aes(
fitted.values(moviesRegMod2),
rstandard(moviesRegMod2))) +
geom_point(color='blue') +
geom_smooth(method="lm", se=F, col="red") +
labs(
x= "(LOG) Fitted values",
y= "(LOG) Standard Residuals")
stdresDirectMovies
stdresLogMovies
library(cowplot)
plot_grid(stdresDirectMovies, stdresLogMovies, labels = c('Non-constant variance', 'Constant variance'))
and plot the fitted line through the data it is easy to see that the residuals get larger in magnitude as we go from left to right. If one were to make a plot of the standardized residuals this pattern shows up as an approximate funnel shape from left to right and that funnel shape is the red flag that points to the presence of non-constant variance and suggests that the original Y variable be replaced by it’s log rhythm. (orchestral music)
Let us now continue with our movie data analysis. We have already established that we needed to take log rhythms of both the future and opening revenue, and so we’ll start the analysis on these transformed variables. Upon fitting the regression line and plotting the standardized residuals versus the fitted values, one sees immediately that there are pre-outliers all corresponding to movies that did much better in terms of future revenue, relative to what one would expect given their opening revenue. These movies can be identified as Sliding Doors, Good Will Hunting, and There’s Something About Mary. Sliding Doors is a large outlier with a standardized residual of about 4.4, meaning that its log future revenue is 4.4 standard deviations above what one would expect given its log opening revenue. Ideally one would remove these three outliers and re-run the model, but for the sake of simplicity we will continue here with the analysis without doing so. We can now inspect the fitted equation. The T statistic for the slope coefficient is 33.8 with a P value of zero. Both of these numbers pointing to sufficient evidence of a linear relationship. This is not surprising since we can see that visually in the scatterplot itself. The R-squared value is 88 percent showing that in the long scale, the linear relationship is quite strong. Finally, if we were interested in making a prediction, we could use the fitted equation above to do so.
LEARNING OUTCOMES
Zagat <- readr::read_csv("04-H Zagat2003.csv")
head(Zagat,10)
## # A tibble: 10 × 5
## Name Food Decor Service Price
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 107 West 16 13 16 26
## 2 44 & Hell's kitchen 22 19 19 42
## 3 55 wall street 21 22 21 54
## 4 92 15 15 15 43
## 5 Angelica kitchen 20 14 15 22
## 6 Angelo's 21 11 14 22
## 7 Avenue 18 14 14 36
## 8 Avra estiatorio 24 21 20 50
## 9 AZ 23 25 22 60
## 10 babbo 27 23 24 66
Zagat$log.Price <- log(Zagat$Price)
head(Zagat, 10)
## # A tibble: 10 × 6
## Name Food Decor Service Price log.Price
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 107 West 16 13 16 26 3.26
## 2 44 & Hell's kitchen 22 19 19 42 3.74
## 3 55 wall street 21 22 21 54 3.99
## 4 92 15 15 15 43 3.76
## 5 Angelica kitchen 20 14 15 22 3.09
## 6 Angelo's 21 11 14 22 3.09
## 7 Avenue 18 14 14 36 3.58
## 8 Avra estiatorio 24 21 20 50 3.91
## 9 AZ 23 25 22 60 4.09
## 10 babbo 27 23 24 66 4.19
MultipleRegModel<- lm(log.Price~ Food + Decor + Service, data= Zagat)
summary(MultipleRegModel)
##
## Call:
## lm(formula = log.Price ~ Food + Decor + Service, data = Zagat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.89913 -0.11587 0.00931 0.12680 0.61914
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.012786 0.069582 28.927 <2e-16 ***
## Food -0.009633 0.005477 -1.759 0.0797 .
## Decor 0.034915 0.003972 8.791 <2e-16 ***
## Service 0.067977 0.007650 8.886 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.209 on 280 degrees of freedom
## Multiple R-squared: 0.753, Adjusted R-squared: 0.7504
## F-statistic: 284.6 on 3 and 280 DF, p-value: < 2.2e-16
anova(MultipleRegModel)
## Analysis of Variance Table
##
## Response: log.Price
## Df Sum Sq Mean Sq F value Pr(>F)
## Food 1 12.4577 12.4577 285.134 < 2.2e-16 ***
## Decor 1 21.3911 21.3911 489.604 < 2.2e-16 ***
## Service 1 3.4496 3.4496 78.954 < 2.2e-16 ***
## Residuals 280 12.2334 0.0437
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
MultipleRegModel.StdRes <- rstandard(MultipleRegModel)
MultipleRegModel.Fit <- fitted.values(MultipleRegModel)
Zagat |> ggplot(aes(x=MultipleRegModel.Fit,y=MultipleRegModel.StdRes)) +
geom_point(color='blue') +
geom_smooth( method= 'lm', se= F, col= "red") +
labs(x= "Fitted values", y= "Standard Residuals")
Multiple Regression: determining if there’s a relationship between
more than 2 variables. \[
y=a+ b_1 x_1 + b_2 x_2 +b_3 x_3
\] Note that each \(x\) variable
has its own slope.
MultipleRegModel|>summary()
##
## Call:
## lm(formula = log.Price ~ Food + Decor + Service, data = Zagat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.89913 -0.11587 0.00931 0.12680 0.61914
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.012786 0.069582 28.927 <2e-16 ***
## Food -0.009633 0.005477 -1.759 0.0797 .
## Decor 0.034915 0.003972 8.791 <2e-16 ***
## Service 0.067977 0.007650 8.886 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.209 on 280 degrees of freedom
## Multiple R-squared: 0.753, Adjusted R-squared: 0.7504
## F-statistic: 284.6 on 3 and 280 DF, p-value: < 2.2e-16
The questions that multiple regression analysis can answer:
Fitted equation for the example: \[ \text{LogCost}=2.012786-0.00963\text{ Food }+0.03491\text{ Decor }+0.06798\text{ Service } \]
Note: the y variable is in units of log dollars while the service
rating is in units of just the rating points.
Coefficient of Service: 0.06798; LogCost is estimated to increase on average by 0.067 log dollars for every increase of one rating point in the service rating all (other x variables kept fixed):
Is there is evidence that at least one x variable in the group is
related to the y variable?
This question can be framed in the context of a hypothesis test:
F statistic: based on a combination of the magnitudes of all the estimated slope coefficients and is (always positive).
LEARNING OUTCOMES
(Same data from lesson 4; Handout is also the same from lesson 4)
Zagat <- readr::read_csv("04-H Zagat2003.csv")
One, in the case of multiple regression, one cannot make a plot of the y variable against all the x variables simultaneously if there are more than two x variables. Some people suggest inspecting individual two-dimensional scatterplots of the y variable against each x variable but this can get overwhelming quickly if there are many x variables. I suggest that one go directly to step two.
Two, run the multiple regression model and plot the standardized residuals against the fitted values. Inspect this plot to see if there is a suggestion of a funnel shape which would indicate the presence of non-constant variance in the Y variable. If this exists, the Y variable should be replaced by its log values. Note that the log transformation may not be necessary in every dataset. #### (3) Outliers Three, once one has done the transformation check and carried it out if necessary, run the multiple regression analysis again and plot the standardized residuals against the fitted values. This time, inspect the plot for outliers indicated by standardized residuals outside plus or minus three. If outliers exist, locate the corresponding cases in the data and remove them from the analysis. It is worth inspecting these cases to see if there’s some reason as why they behave unusually compared to the rest of the data.
Four, after removing the outliers, run the regression again and inspect the output. First, check the f-statistic and its p-value to see if there is evidence that at least one x variable is useful. If there isn’t, then stop because the model is not useful and there is no point in going further.
Five, if the f-statistic and its p-value indicate that at least one x variable is useful in the model, then one can continue the analysis. Now inspect the t-statistic of every individual x variable to test whether each of the x variables is found to be related to the y variable in the presence of the remaining x variables.
Six, if some x variables are not found to be useful, it is good practice to drop some of these variables from the model and work with a reduced model. One major reason why one should do this is that it results in better estimates of the slope coefficients of the remaining relevant x variables and also provides better predictions of the y variable. However, the question of which x variables should be dropped and which ones should be kept in, a question that broadly can be described as that of modern selection is not an easy one to answer and many methods have been proposed to address it. These methods are beyond the scope of our class and so we will not pursue this further though we note that it is an important issue.
Seven, one should now inspect the r-squared value to get the numerical measure of how well the x variables are explaining the y variable.
One can also use the fitted model along with the estimated standard deviation of the y values around the line to generate point predictions and prediction intervals of the y variable. One can also use the fitted model among with the estimated standard deviation of the y values around the line to generate point predictions and prediction intervals of the y variable.
If the F-statistic was large enough to conclude that the model is useful, then we know that there’s at least one useful variable. To check which one is useful, look at each t-statistic or p-value and treat as the simple regression.
MultipleRegModel<- lm(log(Price)~ Food + Decor + Service, data= Zagat)
summary(MultipleRegModel)
##
## Call:
## lm(formula = log(Price) ~ Food + Decor + Service, data = Zagat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.89913 -0.11587 0.00931 0.12680 0.61914
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.012786 0.069582 28.927 <2e-16 ***
## Food -0.009633 0.005477 -1.759 0.0797 .
## Decor 0.034915 0.003972 8.791 <2e-16 ***
## Service 0.067977 0.007650 8.886 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.209 on 280 degrees of freedom
## Multiple R-squared: 0.753, Adjusted R-squared: 0.7504
## F-statistic: 284.6 on 3 and 280 DF, p-value: < 2.2e-16
T statistic for decor is 8.79 AND is outside \(\pm 2\), leading us to conclude that the X
variable decor rating is useful in the model.
NOTE: Since we are testing the usefulness of an X variable in a multiple
regression model, we are testing whether there is evidence that the X
variable is related to the Y variable in the presence of the
other X variables. Especially when one might have several X
variables in the model with more than one of them having small T
statistics.
Specifically, for decor rating, we can conclude that there is evidence
that it is related to log cost in the presence of the other X variables
in the multiple regression model.
T statistic for food rating is -1.76 and thus, not considered large
enough. Its p-value is 8% and so above the 5% rule of thumb value. Thus,
we conclude there is not enough evidence that the food rating is related
to the log cost of dinner in the presence of the other two X variables
in the model.
If you run this simple regression the food rating has a T statistic of 9.73, which is large, implying that in simple regression, it is indeed useful. So what we conclude from it’s T statistic in the multiple regression model, is that we don’t have evidence that the food rating is related to log cost than the other two ratings are also included in the model.
Our conclusion about the lack of usefulness of the food rating in the multiple regression model, also now helps us reconcile the puzzling finding noted earlier, that it’s coefficient is negative in sign contradicting our intuition that log cost of dinner would be positively related to the food rating. The T statistic for the food rating implies that we cannot project the null hypothesis that the true unknown coefficient of food rating is zero, and so the estimated slope coefficient is estimating zero. Thus, the estimated slope of the food rating could as well have fallen to the left of zero yielding a negative sign.
Measure how well the X variables and the model help explain the y variable
One should not be using the R squared value to test for whether the relationship exists between the Y and the X variables (like simple regression, it should only be used to measure magnitude of relationship once determined that there’s a relationship)
REMINDER: transform log to original values
Example, suppose we are interested in restaurants with a
the estimated average lot cost of such restaurants would be the following:
the 95% prediction interval for lot cost for such restaurants would be \(3.91 \pm (2\times 0.209) = (3.49,4.33) \text{ log dollars}\), where point 209 is the estimated standard deviation of log cost values around the regression the estimated average log cost value, as well as the 95% prediction interval for the log cost can then be back transformed into dollar values.
We now discuss the question of detecting outliers and checking for the presence of non constant variance in the y variable, as in the case of simple regression, both of these checks are done through the plot of the standardized residuals versus the fitted values. Note that even in multiple regression, regardless of the number of X variables that one has in the model, there are only as many fitted values as there are cases in the data set, since there is one fitted value per case, for example, the Zagat data set has 284 cases, and so there will be 284 standardized residuals and 284 fitted values one per case. Hence, the plot of the standardized residuals versus the fitted values can be made even in multiple regression in the Zagat data set. This plot is as follows.
REMINDER: Let us remind ourselves that outliers will show up in this plot as standardized residuals that are too large in magnitude, with the rule of thumb being that values outside plus or minus three are considered large.
It is immediately obvious that there are no outliers on the high side, since all the standardized residuals are less than or almost exactly three. However, there are two outliers on the low side, one that is just crossing minus three, and another that is beyond even minus four. The latter would correspond to a restaurant that has a log cost that is approximately four standard deviations below what one would expect for a restaurant with its profile in terms of food, decor and service ratings, as stated earlier. One would go through the data set to pinpoint which cases these are examine if there is any reason why they behave unusually, remove them from the data set and re analyze the data to ensure that the analysis is not affected by the outlines. Here, for convenience, I will proceed with the analysis without dropping any of the outlines.
Another use of the plot is in determining whether the y variable has a non constant variance. Recall from a discussion of simple regression that the presence of non constant variance in Y will manifest itself in the shape of an approximate funnel in the slot, and the corrective action for it is to convert the y variable to its logarithm and use The Log value as a new y variable. The reason why I chose to use log cost in the Zagat analysis now becomes clear. When I ran the multiple regression analysis with just the UN transformed cost variable as the y variable, I detected an approximate funnel shape in the standardized residual plot, and that is what prompted me to convert the cost variable into log cost and use The Log values as the y variable. Here is the plot of the standardized residuals versus the fitted values in the multiple regression of the original untransformed cost on the three rating variables. One should make note that the approximate funnel shape and data sets, if it does occur may not be immediately obvious, particularly to the inexperienced eye. As one gets more and more experienced with regression, one gets better at detecting these shapes when they occur. Furthermore, since computing power is so cheap these days, it never hurts to make residual plots of both the regression models, one with the original untransformed y variable and the other with a long y variable to see which plot of the standardized residuals looks more random and does not show a funnel tree. However, one should not follow a policy of just blindly always transforming the y variable into its log values, since such transformations are not always called for and transforming the data were not needed could result in a poor model.
Steps for modeling
| titles | symbol | description1 |
|---|---|---|
| y-intercept | $$\hat{\beta}_0$$ | predicted value of y when x=0 |
| slope | $$\hat{\beta}_1$$ | increase or decrease in y for every 1 unit increase in x |
| 1 This interpretation is valid only for x-values within the range of the sample data | ||
adsale <- readr::read_tsv("01-R ADSALE.txt")
scatter<- ggplot(data=adsale,aes(x=ADVEXP_X, y=SALES_Y)) + geom_point(color="blue")
fitted <- adsale %>%
ggplot(aes(x=ADVEXP_X, y=SALES_Y)) +
geom_point(color="blue") +
scale_x_continuous(limits=c(0,6)) +
scale_y_continuous(limits=c(-1,6)) +
theme_bw() +
geom_abline(intercept = -1,slope=1)
plot_grid(scatter,fitted,labels = c("A","B"))
Graph B shows a visually plotted line. Use “least squares” method to
quantitatively graph this line.
CAUTION: model parameters should be interpreted only within the sample range of the independent variable
\[ \begin{align} \text{slope: }& \hat{\beta}_1=\frac{\text{SS}_{xy}}{\text{SS}_{xx}}\\ \text{y-intercept: }&\hat{\beta}_0 =\bar{y}-\hat{\beta}_1\bar{x}\\ \text{SS}_{xy}&=\sum(x-\bar{x})(y-\bar{y})\\ \text{SS}_{xx}&=\sum(x-\bar{x})^2 \end{align} \]
Specifying the probability distribution of the random error \(\varepsilon\)
The following assumptions make it possible for us to
Note: the assumptions will be adequately satisfied for many applications encountered in practice.
\[ \therefore E(y)=\beta_0+\beta_1x \]
Most practical situations: \(\sigma^2\) is unknown, so we estimate \(\sigma^2\) with \(s^2\).
divide sum of squares of deviations of y-values from prediction line
(\(\text{SSE }=\sum(y_i-\hat{y}_i)^2\))
with degrees of freedom (\(n-2\))
\[ \begin{align} s^2&=\frac{\text{SSE}}{\text{degrees of freedom}}\\ &=\frac{\text{SSE}}{n-2}\\ \text{SSE }&=\sum(y_i-\color{red}{\hat{y}}_i)^2\color{grey}{=\text{SS}_{yy}-\hat{\beta}_1\text{SS}_{xy}}\\ \text{SS}_{\color{blue}{yy}}&=\sum(y_i-\color{red}{\bar{y}}_i)^2\\ \text{remember: }&\\ \text{SS}_{\color{green}{x}\color{blue}{y}}&=\sum(x-\bar{x})(y-\bar{y})\\ \text{SS}_{\color{green}{xx}}&=\sum(x-\bar{x})^2 \end{align} \]
\(\sqrt{s^2}=s\)Estimated standard error of the regression model
We expect most(\(\approx 95\%\)) of the observed \(y\)-values to lie within \(2s\) of their respective least squares predicted values, \(\hat{y}\)
Assessing the Utility of the Model: Making Inferences about the Slope
\(\beta_1\)
What if revenue is completely unrelated from expenditure? This implies
that mean of \(y\), \(E(y)\), does not change as \(x\) changes. \(\beta_1=0\)
\[ E(y)=\beta_0+\beta_1x=\beta_0 \]
Test the null hypothesis that the linear model contributes no information for the prediction of \(y\) (no evidence of correlation) against the alternative hypothesis that the linear model is useful in predicting \(y\) (There is suggestion of correlation):
\[ H_0: \beta_1=0\\ H_a: \beta_1\neq 0 \]
IF data supports \(H_a\),
THEN \(x\) isn’t helpful using the
straight-line model.
To reject null hypothesis means to say that the slope is not 0 (meaning
that we can use the linear model to make prediction). Rejecting null
hypothesis means that the sample evidence indicates that \(x\) contributes to \(y\) in the linear model.
\[ s_{\hat{\beta}_1}=\frac{s}{\sqrt{\text{SS}_{xx}}} \]
\(t\)-stat (t value)
Also known as test-statistic. Can also be found in
lm()
\[ t=\frac{\hat{\beta}_1-\text{hypothesized }\beta_1}{s_{\hat{\beta}_1}} \]
t_stat <- function(b_hat,hypothesis,s,SSxx){
tstat <- (b_hat-hypothesis)/(s/sqrt(SSxx))
return(tstat)
}
Comparison value
comparison <- function(sided,confidence,df,n){
if (sided == 1) {
alpha=confidence
} else if (sided==2){
alpha=confidence/2
}
value <- qt((1-alpha),(n-df))
return(value)
}
Rejection region: (remember: \(\alpha\)=Confidence level)
p-value:
\(H_a:\beta_1<0\)
Rejection region: (remember: \(\alpha\)=Confidence level)
p-value: \(P(t<t_c)\)
\(H_a:\beta_1>0\)
Rejection region: (remember: \(\alpha\)=Confidence level)
p-value: \(P(t>t_c)\)
| Table 11.2: Comparing Observed and Predicted Values for the Visual Model | ||||
| $$x$$ | $$y$$ | $$\tilde{y}=-1+x$$ | $$(y-\tilde{y})$$ | $$(y-\tilde{y})^2$$ |
|---|---|---|---|---|
| 1 | 1 | 0 | 1 | 1 |
| 2 | 1 | 1 | 0 | 0 |
| 3 | 2 | 2 | 0 | 0 |
| 4 | 2 | 3 | -1 | 1 |
| 5 | 4 | 4 | 0 | 0 |
Sum of errors: \(\sum
(y-\tilde{y})=0\)
Sum of squared errors: \(\sum
(y-\tilde{y})^2=2\)
Consider the straight-line model: \(E(y)=\beta_0+\beta_1x\), where
| Sums | ||||
| $$x$$ | $$y$$ | $$\tilde{y}=-1+x$$ | $$(y-\tilde{y})$$ | $$(y-\tilde{y})^2$$ |
|---|---|---|---|---|
| 15 | 10 | 10 | 0 | 2 |
\[ \begin{align} \bar{x}&=\frac{\sum x}{n} =\frac{15}{5} = 3\\ \bar{y}&=\frac{\sum x}{n} =\frac{10}{5} = 2\\ \end{align} \]
| SS calculations | |||
| x | y | (x-3)(y-2) | (x-3)^2 |
|---|---|---|---|
| 1 | 1 | 2 | 4 |
| 2 | 1 | 1 | 1 |
| 3 | 2 | 0 | 0 |
| 4 | 2 | 0 | 1 |
| 5 | 4 | 4 | 4 |
\[ \begin{align} \text{SS}_{xy}&=\sum(x-\bar{x})(y-\bar{y})\\ &=(2+1+0+0+4)=7\\ \text{SS}_{xx}&=\sum(x-\bar{x})^2\\ &=(4+1+0+1+4)=10\\ \text{slope: }\hat{\beta}_1 &=\frac{\text{SS}_{xy}}{\text{SS}_{xx}}\\ &=\frac{7}{10}=.7\\ \text{y-intercept: }\hat{\beta}_0 &=\bar{y}-\hat{\beta}_1\bar{x}\\ &=2-(.7)(3)\\ \therefore \text{least squares line: }\hat{y}&=\hat{\beta}_0+\hat{\beta}_1x\\ &=-.1+.7x \end{align} \]
\[ \begin{align} \text{least squares line: }\hat{y}&=-.1+.7x\\ f(2)&=-.1+(.7)(2)\\ &=1.3\\ \therefore &\text{\$1,300} \end{align} \]
| SSE analysis | ||||
| x | y | -1.7x | (y-$$\hat{y}$$) | ($$y-hat{y}^2$$) |
|---|---|---|---|---|
| 1 | 1 | 0.6 | 0.4 | 0.16 |
| 2 | 1 | 1.3 | -0.3 | 0.09 |
| 3 | 2 | 2.0 | 0.0 | 0.00 |
| 4 | 2 | 2.7 | -0.7 | 0.49 |
| 5 | 4 | 3.4 | 0.6 | 0.36 |
\[ \sum (y-\bar{y})^2=(.16+.09+0+.49+.36)=1.1 \] (See “Sums” table; notice that this is smaller than 2)
The results imply that the revenue is equal to -$100 when expenditure is $0. Remember that model parameters should be interpreted only within the sampled range of the independent variable. In this example, the range is $100 to $500. The y-intercept isn’t within this range and is therefore not subject to meaningful interpretation.
(From “Model Assumptions”)
\[ \begin{align} s&=\sqrt{\frac{\text{SSE}}{n-2}}\\ &=\sqrt{\frac{1.1}{5-2}}\\ &=.61 \end{align} \]
\(s=.61\) is the standard error of the regression model
Because \(s\) measures the spread of the distribution of y-values about the least squares line, we should not be surprised to find that most of the observations lie within \(2s\) of the least squares line: \(2s=2\times.61=1.22\approx\text{\$1,220}\) within the least squares line.
Testing the Regression Slope \(\beta_1\)–Sales Revenue Model
Conduct a test (at \(\alpha=.05\)) to
determine if sales revenue (\(y\)) is
linearly related to advertising expenditure (\(x\)).
\[ \begin{align} \text{In previous section: }&n=5\\ &\text{df}=n-2=3\\ &\hat{\beta}_1=.7\\ &s=.61\\ &\text{SS}_{xx}=10 \end{align} \]
comparison(2,.05,2,5)
## [1] 3.182446
\[ |t|>(t_{.025}=3.182) \]
lm(adsale$SALES_Y~adsale$ADVEXP_X,adsale) |> summary()
##
## Call:
## lm(formula = adsale$SALES_Y ~ adsale$ADVEXP_X, data = adsale)
##
## Residuals:
## 1 2 3 4 5
## 4.000e-01 -3.000e-01 6.478e-17 -7.000e-01 6.000e-01
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1000 0.6351 -0.157 0.8849
## adsale$ADVEXP_X 0.7000 0.1915 3.656 0.0354 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6055 on 3 degrees of freedom
## Multiple R-squared: 0.8167, Adjusted R-squared: 0.7556
## F-statistic: 13.36 on 1 and 3 DF, p-value: 0.03535
\[ \text{t-stat: }3.656 \] \[ \begin{align} H_0: \beta_1 &= 0\\ H_a: \beta_1 &\neq 0 \end{align} \]
The t-stat falls within the rejection region \(\pm 3.182\), so we reject the null
hypothesis and say that the slope \(\beta_1\) is not 0. This means that there
is evidence that expenditure \(x\)
contributes information for revenue \(y\) when a linear model is used.
Also, we can reach the same conclusion by using the p-value, (shows as
Pr(>|t|) in lm summary; the asterisks show
when can reject null hypothesis) which is 0.0354. \(0.0354<.05\) so we reject the null
hypothesis.
$$ \[\begin{align} \end{align}\] $$
Multiple regression models: probabilistic models that include more than one independent variable
\[ y=\beta_0 +\beta_1 x_1+\beta_2 x_2+\dots+\beta_k x_k + \varepsilon \]