Statistical Modeling Notes

Lesson 1: Correlation

LEARNING OUTCOMES

Identify correlation between variables and interpret its strength
Use R Studio to perform a correlation analysis

Handout

library(readr)
library(ggplot2)
library(gt)
library(gtExtras)

Returns <-read_csv("01-H Microsoft_VS_Walmart.csv")

head(Returns,5)

## # A tibble: 5 × 2
##   MsftReturn WmtReturn
##        <dbl>     <dbl>
## 1       3.62   -1.53  
## 2      -7.55   -0.0421
## 3       2.21    1.2   
## 4       2.16    3.41  
## 5       4.28   -3.52

ggplot(data=Returns,aes(x=WmtReturn, y=MsftReturn)) + geom_point(color="blue") + labs(x="Walmart returns", y="Microsoft returns")

attach(Returns)
cor(MsftReturn,WmtReturn)

## [1] 0.1962059

detach(Returns)

The correlation between Microsoft and Walmart returns is 19.62%

Video

Overview

(Same data from Handout)

head(Returns, 5)

## # A tibble: 5 × 2
##   MsftReturn WmtReturn
##        <dbl>     <dbl>
## 1       3.62   -1.53  
## 2      -7.55   -0.0421
## 3       2.21    1.2   
## 4       2.16    3.41  
## 5       4.28   -3.52

Each data point refers to an instance happened simultaneously between Walmart and Microsoft. We are looking for how much and in which direction the data moves alongside each other.

Example, in the above table, the returns for Microsoft on one month is 3.6% and Walmart’s return ON THE SAME MONTH is -1.53%. Another month Microsoft had -7.55% while Walmart in the same month had -0.04% returns.

Is there a relationship between Microsoft and Walmort returns? First step: make a scatterplot

ggplot(
  data=Returns,
  aes(
    x=WmtReturn,
    y=MsftReturn
  ) 
) + geom_point(color="blue") + labs(x="Walmart returns", y="Microsoft returns")

Each point in the plot corresponds to a pair of Microsoft and Walmart returns for a month. (There are 59 months, so there’s 59 points in the plot)

The points seem to fall in an increasing pattern

High values of Walmart returns, tend to be associated with high values of Microsoft returns
Low values of Walmart tend to be associated with low values of Microsoft

One can visualize a (straight) line through the data s.t. the points seem to fall around this line

Correlation coefficient

Single number that is computed to measure the tendency of the data to cluster around the aforementioned line

THEREFORE: correlation is the measure of linear association

measured by letter “r”
takes values between -1 and 1

sign

magnitude

Closer to the line, closer to 1 or -1

Farther from the line, closer to 0

CAUTION: r does not measure the slope; r is a measure of linear association

Example: r=0 but there’s an obvious relationship; r is linear

Covariance

Notice that the covariance will share the same sign as correlation
Covariance will take any value between $-\infty$ and $\infty$
- CAUTION: do not interpet the magnitude of covariance as any meaningful way
- Example
  - Two variables may have a very high correlation, implying a very strong linear association, but may have a covariance that is very close to 0 if their standard deviations are very small or may have a very small correlation, implying a weak or non-existent linear association. But a very large covariance if their standard deviations are very large.
  - In either case, one should interpret the correlation NOT of the variance (it will be misleading)

\[ \begin{align} \text{Covariance(X,Y)}&=\text{Correlation(X,Y)} \times \left [\text{SD}(X)\times \text{SD}(Y)\right ] \end{align} \]

Class Exercise

Returns <- readr::read_csv("01-E IbmYhooAaplReturnData copy.csv")
library(ggplot2)
library(tidyr)
library(cowplot)
library(gt)

1

Make scatter plots of each pair of variables.

IBM_YAHOO <- Returns %>%
  ggplot(aes(Returns$IBMreturn,Returns$YhooReturn)) + 
  geom_point() +
  labs(x="IBM Returns",y="Yahoo returns") + geom_smooth(method="lm")

IBM_APPLE <- Returns %>%
  ggplot(aes(Returns$IBMreturn,Returns$Aaplreturn)) + 
  geom_point() +
  labs(x="IBM Returns",y="Apple returns") + geom_smooth(method="lm")

YAHOO_APPLE <- Returns %>%
  ggplot(aes(Returns$YhooReturn,Returns$Aaplreturn)) + 
  geom_point() +
  labs(x="Yahoo Returns",y="Apple returns") + geom_smooth(method="lm")

plot_grid(IBM_YAHOO,IBM_APPLE,YAHOO_APPLE,
          labels = c("Yahoo X IBM", "Apple X IBM","Apple X Yahoo"),
          label_colour = "purple")

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

2

Compute the correlation in the data for each pair of variables.

attach(Returns)
cor_labels <- c("Apple","IBM","Yahoo")
Apple <- c(
  cor(Aaplreturn,Aaplreturn),
  cor(IBMreturn,Aaplreturn),
  cor(YhooReturn,Aaplreturn))|> round(2)

IBM <- c(
  cor(Aaplreturn,IBMreturn),
  cor(IBMreturn,IBMreturn),
  cor(YhooReturn,IBMreturn)) |> round(2)

Yahoo <- c(
  cor(Aaplreturn,YhooReturn),
  cor(IBMreturn,YhooReturn),
  cor(YhooReturn,YhooReturn)) |> round(2)

data.frame(
  cor_labels,
  Apple,
  IBM,
  Yahoo) |>
  gt() |>
  cols_label(
    cor_labels="",
    Apple="Apple",
    IBM="IBM",
    Yahoo="Yahoo")

	Apple	IBM	Yahoo
Apple	1.00	0.36	0.28
IBM	0.36	1.00	0.31
Yahoo	0.28	0.31	1.00

detach(Returns)

cor(Returns) |> knitr::kable()

	IBMreturn	YhooReturn	Aaplreturn
IBMreturn	1.0000000	0.3143849	0.3603621
YhooReturn	0.3143849	1.0000000	0.2754087
Aaplreturn	0.3603621	0.2754087	1.0000000

3

Are the correlation values that you get in accord what you see in the scatter plots?

Yes, each scatter plot demonstrates a positive correlation, which is reflected in the positive signs in the correlation values. Each correlation value in the table are closer to 0 in magnitude (except for the correlations compared to itself; Apple against Apple, etc) meaning that there are many values that stray farther away from the from the line like how it’s demonstrated in the plots.

Use simple regression for generating predictions of the dependent variable and test for the presence of a linear relationship

Lesson 2: Simple Regression Part 1

LEARNING OUTCOMES

Fit a least squares line
Interpret its coefficients

Handout

Notes

## # A tibble: 14 × 2
##    sales advtexp
##    <dbl>   <dbl>
##  1  432      5  
##  2  723      7  
##  3  578      6  
##  4  600      6.5
##  5  950      8  
##  6  106      3.5
##  7  282      4  
##  8  233.     4.5
##  9  746      6.5
## 10  812.     7  
## 11  800      7.5
## 12  929      7.5
## 13 1004      8.5
## 14  632      7

Plot graph

Compute correlation

cor(sales, advtexp)

## [1] 0.9743532

Build linear regression model

lm([Dependent Variable]~[Independent Variables], data=[name of dataset]): used to build linear regression models
summary(): used to get full output of the model

linearRegModel <- lm(sales~advtexp, data=AdExp)
summary(linearRegModel)

## 
## Call:
## lm(formula = sales ~ advtexp, data = AdExp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -119.346  -37.380   -4.869   54.998   88.609 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -495.28      77.06  -6.427 3.27e-05 ***
## advtexp       178.09      11.87  15.000 3.89e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 65.36 on 12 degrees of freedom
## Multiple R-squared:  0.9494, Adjusted R-squared:  0.9451 
## F-statistic:   225 on 1 and 12 DF,  p-value: 3.888e-09

Plot fitted line

Plot residuals

linearRegModel.StdRes <- rstandard(linearRegModel) # obtaining standard residuals
linearRegModel.Fit <- fitted.values(linearRegModel) # obtaining fitted values

ggplot(data=AdExp,aes(x=linearRegModel.Fit,y=linearRegModel.StdRes))+
  geom_point(color='blue')+ 
  geom_smooth( method= 'lm', se= F, col= "red")+ labs(x= "Fitted values", y= "Standard Residuals")

## `geom_smooth()` using formula = 'y ~ x'

Video

Overview

Regression assists with

determining what kind of relationship may exist between two or more variables
generating predictions of one variable given values of the other.

Least squares (fitted) line: A scatter plot of the data shows that the two variables seem to be related and the data seems to fall around the line with positive slope. We do not know the exact equation of this line that captures the linear relationship between the two variables, but it makes sense to postulate that it exists. However once we have the actual data on two such variables that seem to be linearly related, regression allows us to generate the equation of a line that is our best estimate of this linear relationship.

The actual equation of this line is the values of its intercept and slope. The regression procedure finds this equation by considering all lines that one could possibly draw through the data and then choosing that line which best intersects the data. To put it more concretely, it finds the line that has the smallest total squared vertical distance from the points and hence it’s called the least squares line. Once we have the fitted line, we can use it for prediction.

Bbe careful not to extrapolate the fitted line too much for prediction.

For example, if we had an ad campaign with $15,000, then it is tempting to plug in 15 into the equation to predict sales, but we only have data for ad campaigns where the ad expenditure goes up to around $9,000 and we don’t know whether the relationship between sales and ad expenditure will continue to be linear beyond the observed range or may saturate and flatten out. In the latter case, using the line for prediction at X equal to 15 could seriously overpredict sales. A similar remark can be made about extrapolating in the lower range of the X values. What we have seen here so far is the use of regression to generate a prediction of sales for a given value of ad expenditure.

Point prediction: generating an exact prediction of sales for a value of ad expenditure.

Prediction interval (PI): The actual sales values are not going to fall perfectly along our best fit line. We can do that by generating an interval around the line such that the sales value has a high pre-specified probability, like 95%, of falling in that interval.

Outliers: unusually high or unusually low values of sales relative to the ad expenditure. Regression gives us an objective way to detect them.

Goal: deal with a pair of variables with an eye towards using one of them to predict the other is whether the data objectively can give evidence of whether a linear relationship exists or not.

The regression procedure provides a standard output along with a fitted line.

Prediction Interval

Regression assume, the existence of a regression line that captures the linear relationship between the two variables, and this regression line is estimated from observed data by the fitted line, which allows us to generate point predictions. However, we cannot reasonably expect all values of the Y variable to fall exactly on the regression line that relates the Y variable to the X variable. Hence, it makes sense to generate not just a point prediction, but also an interval prediction.

Interval prediction: an interval which will contain the values of the Y variable, with a high pre-specified probability.

Assumptions:
for any fixed value of ad expenditure the sales values fall around the regression line at that point, according to a normal curve.
- For example, if we think of the behavior of sales when ad expenditures is $7,000.00, the regression procedure assumes that sales values falls around the regression line at $x=7$, according to a normal curve. Thus, the regression line at the specific value of ad expenditure can be thought of as the average sales to be expected (whatever y equals when x=7;the value of the fitted line at X=7, is the estimate of those average sales when ad expenditure is $7,000.00).
- Putting all these facts together
  - if we had an estimate of the standard deviation of the the sales values around the regression line at X=7, we could use this estimated standard deviation,
  - in conjunction with the estimated mean and the property of the normal curve to construct an interval that has an approximate 95% probability of capturing sales, when ad expenditure is $8,000.00.
  - Specifically, we would go two standard deviations around the mean, which is the value of the fitted line, to generate this 95% interval, which is called the 95% prediction interval of PI.
  Hence, conceptually, we can create a 95% prediction interval for sales, for any value of add expenditure that we are interested in, by using the fitted line at that X value, in conjunction with the estimated standard deviation of sales values around the line at that point. The regression procedure makes a fundamental assumption that
the standard deviation of the sales values around the regression line at any point is the same, regardless, of where the point is.
- Using this assumption, it then provides an estimate of that common standard deviation, by making use of the data and the fitted line. And, this value is part of the regression output that is generated when a line is fitted. Here, is the regression output for the sales and expenditure data. There are a lot of numbers in this output, and we will become more familiar with some of these as we move on. For our current purpose, we need the estimated value of the standard deviation of the sales values around the regression line, which is seen to be 65.36 thousand dollars. Note, that the units of the standard deviation is the same as the units of the sales. We can now compute the approximate 95% prediction interval for sales when $7,000.00 is invested in ad expenditure, as 751 plus or minus 2 times $65,000.00, which is the range $621,000.00 to $881,000.00. We can interpret this interval as follows; if, we invest $7,000.00 in ad expenditure, there is an approximately 95% that the sales we will see, will fall in the range $621 to $881,000.00. A few points to keep in mind, the 95% prediction interval that we did here is approximate because we ignored the fact that the predicted sales and standard deviation are estimates. Regression software can produce more sophisticated versions of this prediction interval that take the estimation into account. The difference between those sophisticated intervals and our approximate interval shown here becomes infinitesimal as the number of our data points increases There is nothing special about the 95% value or the associated prediction interval. Once one has the fitted line and the estimated standard deviation around the line, once could compute any other probability calculation that one might be interested in. For example, the question, if we invest $7,000.00 in ad expenditure, what is the probability that the corresponding sales will be greater than $690,000.00 can also be answered using this fitted line, the estimated standard deviation, and the normal curve.

Class Notes

lm(AdExp$sales~AdExp$advtexp,AdExp) |> summary()

## 
## Call:
## lm(formula = AdExp$sales ~ AdExp$advtexp, data = AdExp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -119.346  -37.380   -4.869   54.998   88.609 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -495.28      77.06  -6.427 3.27e-05 ***
## AdExp$advtexp   178.09      11.87  15.000 3.89e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 65.36 on 12 degrees of freedom
## Multiple R-squared:  0.9494, Adjusted R-squared:  0.9451 
## F-statistic:   225 on 1 and 12 DF,  p-value: 3.888e-09

t-value

Answers the question: Is there evidence of linear relationship
If t-value is LARGE, i.e. outside of (-2,2), then reject $H_0$.

p-value

Answers the question: Is there evidence of linear relationship
If p-value is SMALL, i.e. < 0.05, then there’s evidence to reject $H_0$
\[ \text{p-value} \approx 0 \therefore\text{ reject }H_0 \text{ in favor of }H_a\text{ that there is a linear relationship} \]

R squared

Answers the question: How strong is the relationship?
Use Multiple R-squared \[ \begin{matrix} 0\% &<& R^2 & <&100\%\\ \leftarrow weaker& & & & stronger\rightarrow \end{matrix} \]

Caveat

Can have small p-value (there is linear relationship), but large r-squared:

Can have small p-value (there is linear relationship), but SMALL r-squared:

Std Dev

Use Residual standard error

Lesson 3: Simple Regression Part 2

LEARNING OUTCOMES

Fit a least squares line
Interpret its coefficients
Use simple regression for generating predictions of the dependent variable and test for the presence of a linear relationship

Handout

Overview

Make a scatter plot of y versus x, and visually check if there seems to be a linear relationship. Is there evidence of a non-constant variance?
- If yes, this can be confirmed in the next plot.
Fit the regression line and make the diagnostic plot of the standardized residuals versus the fitted values and see whether there is a funnel shape.
- If yes, this confirms the non-constant variance and suggests that y should be replaced by its log rhythm.
After potentially fixing the non-constant variance issue, check if there’s a curvilinear relationship in the data.
- If yes, you may need to transform the x variable into its natural log
After you have transformed the data if needed, or satisfied yourself that no transformations are needed, fit the regression model and plot the standardized residuals to check for outliers. Identify those outliers, if any exist, see whether there is any story associated with them, and remove them from the analysis.
Now run the regression model for the rest of the data.
- Start by testing for evidence of linear relationship using the t-statistic.
  - If there is no evidence of linear relationship, your analysis would stop at this stage.
  - If there is evidence of linear relationship, you can inspect R-squared value to get a sense of how strong the relationship is. You can also use the regression model to make predictions of y and generate prediction intervals, as well as make any other relevant calculations.

Notes

Task: The purpose of this analysis/tutorial is to use simple regression to model future movie revenue based on the opening weekend revenue

Movies <- readr::read_csv("03-H MovieRevenues1998.csv")
head(Movies, 10)

## # A tibble: 10 × 4
##    Movie                             `opening week` `total gross` `future gross`
##    <chr>                                      <dbl>         <dbl>          <dbl>
##  1 3 Ninjas: High Moon at Mega Moun…          0.150         0.308          0.158
##  2 54                                         6.61         16.6            9.96 
##  3 Air Bud: Golden Receiver                   2.60         10.2            7.61 
##  4 Almost Heroes                              2.84          6.11           3.28 
##  5 American History X                         1.34          6.71           5.37 
##  6 Antz                                      17.2          90.6           73.5  
##  7 Apostle, The                               2.40         20.7           18.3  
##  8 Apt Pupil                                  3.58          8.84           5.25 
##  9 Armageddon                                36.1         202.           165.   
## 10 Avengers, The                             10.3          23.3           13.0

Movies |> ggplot(aes(x=`opening week`, y=`future gross`))+geom_point(color='blue')+labs(x="Opening Week", y="Future Gross")

Steps

As we can see, there is a hint of linear relationship. Apart from that if we were to plot a fitted line here, we will see that the distance of these data points keep on increasing as we go farther away. Due to this non constant variance, we need to take some corrective actions (natural logs)

(1) log-log regression analysis

Add columns to the dataset

Movies$`log opening week` <- log(Movies$`opening week`)
Movies$`log future revenue` <- log(Movies$`future gross`)
head(Movies,10)

## # A tibble: 10 × 6
##    Movie          `opening week` `total gross` `future gross` `log opening week`
##    <chr>                   <dbl>         <dbl>          <dbl>              <dbl>
##  1 3 Ninjas: Hig…          0.150         0.308          0.158             -1.90 
##  2 54                      6.61         16.6            9.96               1.89 
##  3 Air Bud: Gold…          2.60         10.2            7.61               0.957
##  4 Almost Heroes           2.84          6.11           3.28               1.04 
##  5 American Hist…          1.34          6.71           5.37               0.293
##  6 Antz                   17.2          90.6           73.5                2.84 
##  7 Apostle, The            2.40         20.7           18.3                0.877
##  8 Apt Pupil               3.58          8.84           5.25               1.28 
##  9 Armageddon             36.1         202.           165.                 3.59 
## 10 Avengers, The          10.3          23.3           13.0                2.33 
## # ℹ 1 more variable: `log future revenue` <dbl>

Movies |> ggplot(aes(x=`log opening week`, y=`log future revenue`))+geom_point(color='blue')+labs(x="Opening Week", y="Future Gross")

(2) correlation

cor(Movies$`log opening week`, Movies$`log future revenue`)

## [1] 0.9387172

The correlation between log.opening.week and log.future.revenue is 97.43%.

(3) build linear regression model

logRegModel <- lm(Movies$`log future revenue` ~ Movies$`log opening week`,Movies)
summary(logRegModel)

## 
## Call:
## lm(formula = Movies$`log future revenue` ~ Movies$`log opening week`, 
##     data = Movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.93903 -0.35069 -0.08075  0.27636  2.20801 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                0.41054    0.07540   5.445 2.01e-07 ***
## Movies$`log opening week`  1.19797    0.03545  33.796  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5118 on 154 degrees of freedom
## Multiple R-squared:  0.8812, Adjusted R-squared:  0.8804 
## F-statistic:  1142 on 1 and 154 DF,  p-value: < 2.2e-16

(4) plot data

Movies |> ggplot(aes(`log opening week`,`log future revenue`)) +
  geom_point(color="blue") +
  geom_smooth(method="lm", se=F, col="red") +
  labs("Opening week (log)","Future Revenue (log)")

(5) plot residuals

logRegModel.StdRes <- rstandard(logRegModel)
logRegModel.Fit <- fitted.values(logRegModel)
Movies |> ggplot(aes(logRegModel.Fit,logRegModel.StdRes)) + 
  geom_point(color='blue') +
  geom_smooth(method="lm", se=F, col="red") +
  labs(x= "Fitted values", y= "Standard Residuals")

Video

Outliers

Outlier: the reference value that corresponds to an X variable that tells whether any values of the Y variable in the data are unusually high or unusually low, relative to what one would expect
Residual: the distance of the Y value from the fitted line; represents the part of the Y value that has not been explained by the fitted line.

Standardized residual: defines what constitutes a large value; a standardized residual outside plus or minus three corresponds to an unusually high or low residual

the corresponding Y value is unusually high or low, depending on whether the standardized residual is positive or negative relative to what one would expect, given its X variable value.

Movies |> ggplot(aes(logRegModel.Fit,logRegModel.StdRes)) + 
  geom_point(color='blue') +
  geom_smooth(method="lm", se=F, col="red") +
  labs(x= "Fitted values", y= "Standard Residuals")

“Fitted values:” the Y values of the regression model (the Y variable when one plugs in the X variable value into the fitted line).

Note: It is good practice to see if there’s an explanation as to why these outliers behaved differently from the others in the dataset. Outliers are deleted from the dataset and the regression analysis is redone to ensure that the outliers aren’t affecting the analysis.

Log log analysis

Transforming data before running a regression model

Movies_loglog <- readr::read_csv("03-H MovieRevenues1998.csv")
Movies_loglog |> ggplot(aes(Movies_loglog$`opening week`,Movies_loglog$`future gross`)) +
  geom_point(color="blue") +
  labs(x="Opening week revenue",y="Future gross revenue") +
  geom_smooth(method="lm",se=F,col="red")

REASON FOR DATA TRANSFORMATION: violates one of the assumptions of the regression model: the points have a constant spread around the line. Therefore, we need to adjust to do the analysis

SOLUTION: original Y variable is replaced by it’s natural log

Movies_loglog |> ggplot(aes(Movies_loglog$`opening week`,Movies_loglog$`future gross`|>log())) +
  geom_point(color="blue") +
  labs(x="Opening week revenue",y="(LOG) Future gross revenue") +
  geom_smooth(se=F,col="red")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

NEW PROBLEM: plot now shows a curvy linear relationship.

SOLUTION: replace the X variable by it’s natural log

Movies_loglog |> ggplot(
  aes(
    Movies_loglog$`opening week`|>log(),
    Movies_loglog$`future gross`|>log())) +
  geom_point(color="blue") +
  labs(x="(LOG) Opening week revenue",y="(LOG) Future gross revenue") +
  geom_smooth(method="lm",se=F,col="red")

CAUTION: back transform the Y values when reporting

How then can we detect the presence of non-constant variables?

# directly plotted residuals (funnel-shaped; non-constant variance)
moviesRegMod <- lm(Movies$`future gross`~Movies$`opening week`,Movies)
stdresDirectMovies <- Movies |> ggplot(
  aes(
    fitted.values(moviesRegMod),
    rstandard(moviesRegMod))) + 
  geom_point(color='blue') +
  geom_smooth(method="lm", se=F, col="red") +
  labs(
    x= "(Direct) Fitted values", 
    y= "(Direct) Standard Residuals")

# residuals with log values
moviesRegMod2 <- lm(Movies$`log future revenue`~Movies$`log opening week`,Movies)
stdresLogMovies <- Movies |> ggplot(
  aes(
    fitted.values(moviesRegMod2),
    rstandard(moviesRegMod2))) + 
  geom_point(color='blue') +
  geom_smooth(method="lm", se=F, col="red") +
  labs(
    x= "(LOG) Fitted values", 
    y= "(LOG) Standard Residuals")
stdresDirectMovies

stdresLogMovies

library(cowplot)
plot_grid(stdresDirectMovies, stdresLogMovies, labels = c('Non-constant variance', 'Constant variance'))

and plot the fitted line through the data it is easy to see that the residuals get larger in magnitude as we go from left to right. If one were to make a plot of the standardized residuals this pattern shows up as an approximate funnel shape from left to right and that funnel shape is the red flag that points to the presence of non-constant variance and suggests that the original Y variable be replaced by it’s log rhythm. (orchestral music)

Movie example

Let us now continue with our movie data analysis. We have already established that we needed to take log rhythms of both the future and opening revenue, and so we’ll start the analysis on these transformed variables. Upon fitting the regression line and plotting the standardized residuals versus the fitted values, one sees immediately that there are pre-outliers all corresponding to movies that did much better in terms of future revenue, relative to what one would expect given their opening revenue. These movies can be identified as Sliding Doors, Good Will Hunting, and There’s Something About Mary. Sliding Doors is a large outlier with a standardized residual of about 4.4, meaning that its log future revenue is 4.4 standard deviations above what one would expect given its log opening revenue. Ideally one would remove these three outliers and re-run the model, but for the sake of simplicity we will continue here with the analysis without doing so. We can now inspect the fitted equation. The T statistic for the slope coefficient is 33.8 with a P value of zero. Both of these numbers pointing to sufficient evidence of a linear relationship. This is not surprising since we can see that visually in the scatterplot itself. The R-squared value is 88 percent showing that in the long scale, the linear relationship is quite strong. Finally, if we were interested in making a prediction, we could use the fitted equation above to do so.

Lesson 4: Multiple Regression Part 1

LEARNING OUTCOMES

Estimate a multiple regression model
Interpret its coefficients
Test for usefulness of the model

Notes

Zagat <- readr::read_csv("04-H Zagat2003.csv")
head(Zagat,10)

## # A tibble: 10 × 5
##    Name                 Food Decor Service Price
##    <chr>               <dbl> <dbl>   <dbl> <dbl>
##  1 107 West               16    13      16    26
##  2 44 & Hell's kitchen    22    19      19    42
##  3 55 wall street         21    22      21    54
##  4 92                     15    15      15    43
##  5 Angelica kitchen       20    14      15    22
##  6 Angelo's               21    11      14    22
##  7 Avenue                 18    14      14    36
##  8 Avra estiatorio        24    21      20    50
##  9 AZ                     23    25      22    60
## 10 babbo                  27    23      24    66

Steps

(1) natural log of dependent variable

Zagat$log.Price <- log(Zagat$Price)
head(Zagat, 10)

## # A tibble: 10 × 6
##    Name                 Food Decor Service Price log.Price
##    <chr>               <dbl> <dbl>   <dbl> <dbl>     <dbl>
##  1 107 West               16    13      16    26      3.26
##  2 44 & Hell's kitchen    22    19      19    42      3.74
##  3 55 wall street         21    22      21    54      3.99
##  4 92                     15    15      15    43      3.76
##  5 Angelica kitchen       20    14      15    22      3.09
##  6 Angelo's               21    11      14    22      3.09
##  7 Avenue                 18    14      14    36      3.58
##  8 Avra estiatorio        24    21      20    50      3.91
##  9 AZ                     23    25      22    60      4.09
## 10 babbo                  27    23      24    66      4.19

(2) build multiple regression model

MultipleRegModel<- lm(log.Price~ Food + Decor + Service, data= Zagat)
summary(MultipleRegModel)

## 
## Call:
## lm(formula = log.Price ~ Food + Decor + Service, data = Zagat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.89913 -0.11587  0.00931  0.12680  0.61914 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.012786   0.069582  28.927   <2e-16 ***
## Food        -0.009633   0.005477  -1.759   0.0797 .  
## Decor        0.034915   0.003972   8.791   <2e-16 ***
## Service      0.067977   0.007650   8.886   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.209 on 280 degrees of freedom
## Multiple R-squared:  0.753,  Adjusted R-squared:  0.7504 
## F-statistic: 284.6 on 3 and 280 DF,  p-value: < 2.2e-16

Anova output

anova(MultipleRegModel)

## Analysis of Variance Table
## 
## Response: log.Price
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## Food        1 12.4577 12.4577 285.134 < 2.2e-16 ***
## Decor       1 21.3911 21.3911 489.604 < 2.2e-16 ***
## Service     1  3.4496  3.4496  78.954 < 2.2e-16 ***
## Residuals 280 12.2334  0.0437                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(3) plot residuals

MultipleRegModel.StdRes <- rstandard(MultipleRegModel)
MultipleRegModel.Fit <- fitted.values(MultipleRegModel)

Zagat |> ggplot(aes(x=MultipleRegModel.Fit,y=MultipleRegModel.StdRes)) +
  geom_point(color='blue') + 
  geom_smooth( method= 'lm', se= F, col= "red") + 
  labs(x= "Fitted values", y= "Standard Residuals")

Video

Multiple Regression: determining if there’s a relationship between more than 2 variables. \[ y=a+ b_1 x_1 + b_2 x_2 +b_3 x_3 \] Note that each $x$ variable has its own slope.

MultipleRegModel|>summary()

## 
## Call:
## lm(formula = log.Price ~ Food + Decor + Service, data = Zagat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.89913 -0.11587  0.00931  0.12680  0.61914 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.012786   0.069582  28.927   <2e-16 ***
## Food        -0.009633   0.005477  -1.759   0.0797 .  
## Decor        0.034915   0.003972   8.791   <2e-16 ***
## Service      0.067977   0.007650   8.886   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.209 on 280 degrees of freedom
## Multiple R-squared:  0.753,  Adjusted R-squared:  0.7504 
## F-statistic: 284.6 on 3 and 280 DF,  p-value: < 2.2e-16

The questions that multiple regression analysis can answer:

how should one interpret the coefficients of the model?
is there evidence that the group of x variables is useful in predicting y?
- Specifically, in our example, whether the group of three ratings variables is related to the log cost of dinner.
if the group is indeed useful in predicting y then which of each of the x variables is important in the model?
for a restaurant with known values for the three rating variables what would be the predicted log cost of dinner?
- What might be ninety-five percent prediction interval for the log cost of dinner?
are there any restaurants in the data set that have unusually high or unusually low values of cost of dinner relative to their three ratings values?

Fitted equation for the example: \[ \text{LogCost}=2.012786-0.00963\text{ Food }+0.03491\text{ Decor }+0.06798\text{ Service } \]

Note: the y variable is in units of log dollars while the service rating is in units of just the rating points.

Interpret coefficients

Coefficient of Service: 0.06798; LogCost is estimated to increase on average by 0.067 log dollars for every increase of one rating point in the service rating all (other x variables kept fixed):

since the slope coefficient, 0.067 is positive (+), we estimate that LogCost will increase on average if the service rating increases.

if it was negative, LogCost would decrease on average if the service rating increased.

“all other x variables fixed:” any slope coefficient in multiple regression is measuring the modulo association of a change in the corresponding x variable on the y variable when all the other variables are kept unchanged.
there is no implication or inference of causality in the interpretation. We never imply or can infer that changing any x variable is causing the y variable to change.

coefficient of -0.009 for Food service: the negative sign of the coefficient implies that a one-unit increase in food rating is associated with a 0.009 log dollar decrease in price all other x variables kept fixed. This movement in the opposite direction runs contradictory to our intuition.

Contradictory signs

if the corresponding variable is simply not important in the multiple regression.
multi-collinearity: if several x variables are strongly correlated with each other. The coefficients of each of the x variable can be estimated very imprecisely as the regression procedure has a tough time figuring out the individual impact of each x variable and could even result in the sign being the opposite of what intuition would suggestion.
omitted variable bias: an important x variable that is related to both the y and the other x variables is not included in the multiple regression analysis causing the coefficients of some of the x variables in the model to be estimated in a biased manner.

F-statistic (useful?)

Is there is evidence that at least one x variable in the group is related to the y variable?

This question can be framed in the context of a hypothesis test:

$H_0$: true unknown values of all the slope coefficients in the model are zero
$H_a$: at least one of these true unknown coefficients is different from zero.

F statistic: based on a combination of the magnitudes of all the estimated slope coefficients and is (always positive).

large positive (+) value: at least one to the individual estimated slope coefficients is far from zero.
- But we would expect an estimated slope coefficient to be far from zero only if the underlying true unknown slope coefficient is not zero. Hence, this line of reasoning leads us to conclude that
- if the F statistic is large enough, there is evidence that the model is useful; there is evidence that at least one true unknown slope coefficient is different from zero and hence, at least one x variable in the model is related to the y variable.
“large” threshold: there is none; we can use the p-value
- F-statistic correspond to small p-values.
- General rule of thumb: a p-value less than 5% meants that the corresponding F statistic is large enough for us conclude that at least one slope coefficient is nonzero (at least one x variable is related to the y variable or that the model is useful).
  - In our example here, the p-value of the F statistic is zero indicating that there is indeed evidence that the model is useful.

Lesson 5: Multiple Regression Part 2

LEARNING OUTCOMES

Test for the usefulness of each variable
Carry out a residual analysis
Check for outliers

(Same data from lesson 4; Handout is also the same from lesson 4)

Zagat <- readr::read_csv("04-H Zagat2003.csv")

Overview

(1) Scatterplot

One, in the case of multiple regression, one cannot make a plot of the y variable against all the x variables simultaneously if there are more than two x variables. Some people suggest inspecting individual two-dimensional scatterplots of the y variable against each x variable but this can get overwhelming quickly if there are many x variables. I suggest that one go directly to step two.

(2) Check for funnel

Two, run the multiple regression model and plot the standardized residuals against the fitted values. Inspect this plot to see if there is a suggestion of a funnel shape which would indicate the presence of non-constant variance in the Y variable. If this exists, the Y variable should be replaced by its log values. Note that the log transformation may not be necessary in every dataset. #### (3) Outliers Three, once one has done the transformation check and carried it out if necessary, run the multiple regression analysis again and plot the standardized residuals against the fitted values. This time, inspect the plot for outliers indicated by standardized residuals outside plus or minus three. If outliers exist, locate the corresponding cases in the data and remove them from the analysis. It is worth inspecting these cases to see if there’s some reason as why they behave unusually compared to the rest of the data.

(4) Check usefulness

Four, after removing the outliers, run the regression again and inspect the output. First, check the f-statistic and its p-value to see if there is evidence that at least one x variable is useful. If there isn’t, then stop because the model is not useful and there is no point in going further.

(5) Check useful variables

Five, if the f-statistic and its p-value indicate that at least one x variable is useful in the model, then one can continue the analysis. Now inspect the t-statistic of every individual x variable to test whether each of the x variables is found to be related to the y variable in the presence of the remaining x variables.

(6) Drop variables

Six, if some x variables are not found to be useful, it is good practice to drop some of these variables from the model and work with a reduced model. One major reason why one should do this is that it results in better estimates of the slope coefficients of the remaining relevant x variables and also provides better predictions of the y variable. However, the question of which x variables should be dropped and which ones should be kept in, a question that broadly can be described as that of modern selection is not an easy one to answer and many methods have been proposed to address it. These methods are beyond the scope of our class and so we will not pursue this further though we note that it is an important issue.

(7) Inspect r-square

Seven, one should now inspect the r-squared value to get the numerical measure of how well the x variables are explaining the y variable.

(8) (Other)

One can also use the fitted model along with the estimated standard deviation of the y values around the line to generate point predictions and prediction intervals of the y variable. One can also use the fitted model among with the estimated standard deviation of the y values around the line to generate point predictions and prediction intervals of the y variable.

Video

Useful variables

If the F-statistic was large enough to conclude that the model is useful, then we know that there’s at least one useful variable. To check which one is useful, look at each t-statistic or p-value and treat as the simple regression.

If $|t\text{-stat}|>\pm 2$, then $t$-statistic is large enough to conclude that the $x$ variable we’re testing is useful
If the p-value is less than 5%, then the p-value is small enough to conclude that the $x$ variable we’re testing is useful

MultipleRegModel<- lm(log(Price)~ Food + Decor + Service, data= Zagat)
summary(MultipleRegModel)

## 
## Call:
## lm(formula = log(Price) ~ Food + Decor + Service, data = Zagat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.89913 -0.11587  0.00931  0.12680  0.61914 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.012786   0.069582  28.927   <2e-16 ***
## Food        -0.009633   0.005477  -1.759   0.0797 .  
## Decor        0.034915   0.003972   8.791   <2e-16 ***
## Service      0.067977   0.007650   8.886   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.209 on 280 degrees of freedom
## Multiple R-squared:  0.753,  Adjusted R-squared:  0.7504 
## F-statistic: 284.6 on 3 and 280 DF,  p-value: < 2.2e-16

T statistic for decor is 8.79 AND is outside $\pm 2$, leading us to conclude that the X variable decor rating is useful in the model.

NOTE: Since we are testing the usefulness of an X variable in a multiple regression model, we are testing whether there is evidence that the X variable is related to the Y variable in the presence of the other X variables. Especially when one might have several X variables in the model with more than one of them having small T statistics.

Specifically, for decor rating, we can conclude that there is evidence that it is related to log cost in the presence of the other X variables in the multiple regression model.

T statistic for food rating is -1.76 and thus, not considered large enough. Its p-value is 8% and so above the 5% rule of thumb value. Thus, we conclude there is not enough evidence that the food rating is related to the log cost of dinner in the presence of the other two X variables in the model.

If you run this simple regression the food rating has a T statistic of 9.73, which is large, implying that in simple regression, it is indeed useful. So what we conclude from it’s T statistic in the multiple regression model, is that we don’t have evidence that the food rating is related to log cost than the other two ratings are also included in the model.

Our conclusion about the lack of usefulness of the food rating in the multiple regression model, also now helps us reconcile the puzzling finding noted earlier, that it’s coefficient is negative in sign contradicting our intuition that log cost of dinner would be positively related to the food rating. The T statistic for the food rating implies that we cannot project the null hypothesis that the true unknown coefficient of food rating is zero, and so the estimated slope coefficient is estimating zero. Thus, the estimated slope of the food rating could as well have fallen to the left of zero yielding a negative sign.

Computations

R-square

Measure how well the X variables and the model help explain the y variable

One should not be using the R squared value to test for whether the relationship exists between the Y and the X variables (like simple regression, it should only be used to measure magnitude of relationship once determined that there’s a relationship)

REMINDER: transform log to original values

Example, suppose we are interested in restaurants with a

food rating of 20, a
decor rating of 17, and a
service rating of 22

PI

the estimated average lot cost of such restaurants would be the following:

the 95% prediction interval for lot cost for such restaurants would be $3.91 \pm (2\times 0.209) = (3.49,4.33) \text{ log dollars}$, where point 209 is the estimated standard deviation of log cost values around the regression the estimated average log cost value, as well as the 95% prediction interval for the log cost can then be back transformed into dollar values.

Outliers

We now discuss the question of detecting outliers and checking for the presence of non constant variance in the y variable, as in the case of simple regression, both of these checks are done through the plot of the standardized residuals versus the fitted values. Note that even in multiple regression, regardless of the number of X variables that one has in the model, there are only as many fitted values as there are cases in the data set, since there is one fitted value per case, for example, the Zagat data set has 284 cases, and so there will be 284 standardized residuals and 284 fitted values one per case. Hence, the plot of the standardized residuals versus the fitted values can be made even in multiple regression in the Zagat data set. This plot is as follows.

REMINDER: Let us remind ourselves that outliers will show up in this plot as standardized residuals that are too large in magnitude, with the rule of thumb being that values outside plus or minus three are considered large.

It is immediately obvious that there are no outliers on the high side, since all the standardized residuals are less than or almost exactly three. However, there are two outliers on the low side, one that is just crossing minus three, and another that is beyond even minus four. The latter would correspond to a restaurant that has a log cost that is approximately four standard deviations below what one would expect for a restaurant with its profile in terms of food, decor and service ratings, as stated earlier. One would go through the data set to pinpoint which cases these are examine if there is any reason why they behave unusually, remove them from the data set and re analyze the data to ensure that the analysis is not affected by the outlines. Here, for convenience, I will proceed with the analysis without dropping any of the outlines.

Non-constant variance

Another use of the plot is in determining whether the y variable has a non constant variance. Recall from a discussion of simple regression that the presence of non constant variance in Y will manifest itself in the shape of an approximate funnel in the slot, and the corrective action for it is to convert the y variable to its logarithm and use The Log value as a new y variable. The reason why I chose to use log cost in the Zagat analysis now becomes clear. When I ran the multiple regression analysis with just the UN transformed cost variable as the y variable, I detected an approximate funnel shape in the standardized residual plot, and that is what prompted me to convert the cost variable into log cost and use The Log values as the y variable. Here is the plot of the standardized residuals versus the fitted values in the multiple regression of the original untransformed cost on the three rating variables. One should make note that the approximate funnel shape and data sets, if it does occur may not be immediately obvious, particularly to the inexperienced eye. As one gets more and more experienced with regression, one gets better at detecting these shapes when they occur. Furthermore, since computing power is so cheap these days, it never hurts to make residual plots of both the regression models, one with the original untransformed y variable and the other with a long y variable to see which plot of the standardized residuals looks more random and does not show a funnel tree. However, one should not follow a policy of just blindly always transforming the y variable into its log values, since such transformations are not always called for and transforming the data were not needed could result in a poor model.

Readings

Chapter 11: Simple Linear Regression

Steps for modeling

(Determine shape of curve) Hypothesize deterministic component model that relates the mean $E(y)$ to independent variable x. (In this chapter it’s only $y=\beta_0+\beta_1x+\varepsilon$)
(Find the intercepts) Use the sample data to estimate unknown parameters in the model
Specify the probability distribution of the random error term and estimate the standard deviation of this distribution
Statistically evaluate the usefulness of the model
When satisfied that the model is useful, use it for prediction, estimation, and other purposes.

Probabilistic Models

Least Squares Approach

titles	symbol	description¹
y-intercept	$$\hat{\beta}_0$$	predicted value of y when x=0
slope	$$\hat{\beta}_1$$	increase or decrease in y for every 1 unit increase in x
¹ This interpretation is valid only for x-values within the range of the sample data

Notes

Visually plotted line

adsale <- readr::read_tsv("01-R ADSALE.txt")
scatter<- ggplot(data=adsale,aes(x=ADVEXP_X, y=SALES_Y)) + geom_point(color="blue")

fitted <- adsale %>% 
  ggplot(aes(x=ADVEXP_X, y=SALES_Y)) + 
  geom_point(color="blue") + 
  scale_x_continuous(limits=c(0,6)) +
  scale_y_continuous(limits=c(-1,6)) +
  theme_bw() +
  geom_abline(intercept = -1,slope=1) 

plot_grid(scatter,fitted,labels = c("A","B"))

Graph B shows a visually plotted line. Use “least squares” method to quantitatively graph this line.

Least squares line Properties

Sum of errors = 0 (mean error =0)
Sum of squared errors (SSE) variance is minimum (smaller than any other straight-line model)

Method of Least Squares

CAUTION: model parameters should be interpreted only within the sample range of the independent variable

\[ \begin{align} \text{slope: }& \hat{\beta}_1=\frac{\text{SS}_{xy}}{\text{SS}_{xx}}\\ \text{y-intercept: }&\hat{\beta}_0 =\bar{y}-\hat{\beta}_1\bar{x}\\ \text{SS}_{xy}&=\sum(x-\bar{x})(y-\bar{y})\\ \text{SS}_{xx}&=\sum(x-\bar{x})^2 \end{align} \]

Model assumptions

Specifying the probability distribution of the random error $\varepsilon$

Assumptions

The following assumptions make it possible for us to

develop measures of reliability for the least squares estimators and to
develop hypothesis tests for examining the usefulness of the least squares line

Note: the assumptions will be adequately satisfied for many applications encountered in practice.

The average ($E(y)$) of the values of $\varepsilon$ over an infinitely long series of experiments is 0 for each setting of the independent variable x (In picture: $E(y)=\beta_0+\beta_1x$ is the line)

\[ \therefore E(y)=\beta_0+\beta_1x \]

Variance of the probability distribution of $\varepsilon$ is constant for each setting of the independent variable x (for the straight-line model, $\sigma^2$) (In picture: )
probability distribution of $\varepsilon$ is normal
Value of $\varepsilon$ associated with one value of y has no effect on the values of $\varepsilon$ associated with other y-values

Estimate $\sigma^2$

Most practical situations: $\sigma^2$ is unknown, so we estimate $\sigma^2$ with $s^2$.
divide sum of squares of deviations of y-values from prediction line ($\text{SSE }=\sum(y_i-\hat{y}_i)^2$) with degrees of freedom ($n-2$)

\[ \begin{align} s^2&=\frac{\text{SSE}}{\text{degrees of freedom}}\\ &=\frac{\text{SSE}}{n-2}\\ \text{SSE }&=\sum(y_i-\color{red}{\hat{y}}_i)^2\color{grey}{=\text{SS}_{yy}-\hat{\beta}_1\text{SS}_{xy}}\\ \text{SS}_{\color{blue}{yy}}&=\sum(y_i-\color{red}{\bar{y}}_i)^2\\ \text{remember: }&\\ \text{SS}_{\color{green}{x}\color{blue}{y}}&=\sum(x-\bar{x})(y-\bar{y})\\ \text{SS}_{\color{green}{xx}}&=\sum(x-\bar{x})^2 \end{align} \]

$\sqrt{s^2}=s$Estimated standard error of the regression model

Interpretation of $s$

We expect most($\approx 95\%$) of the observed $y$-values to lie within $2s$ of their respective least squares predicted values, $\hat{y}$

Slope

Assessing the Utility of the Model: Making Inferences about the Slope $\beta_1$
What if revenue is completely unrelated from expenditure? This implies that mean of $y$, $E(y)$, does not change as $x$ changes. $\beta_1=0$

\[ E(y)=\beta_0+\beta_1x=\beta_0 \]

Test $H_0$

Test the null hypothesis that the linear model contributes no information for the prediction of $y$ (no evidence of correlation) against the alternative hypothesis that the linear model is useful in predicting $y$ (There is suggestion of correlation):

\[ H_0: \beta_1=0\\ H_a: \beta_1\neq 0 \]

IF data supports $H_a$, THEN $x$ isn’t helpful using the straight-line model.

To reject null hypothesis means to say that the slope is not 0 (meaning that we can use the linear model to make prediction). Rejecting null hypothesis means that the sample evidence indicates that $x$ contributes to $y$ in the linear model.

Formulas

\[ s_{\hat{\beta}_1}=\frac{s}{\sqrt{\text{SS}_{xx}}} \]

$t$-stat (t value)

Also known as test-statistic. Can also be found in lm()

\[ t=\frac{\hat{\beta}_1-\text{hypothesized }\beta_1}{s_{\hat{\beta}_1}} \]

t_stat <- function(b_hat,hypothesis,s,SSxx){
  tstat <- (b_hat-hypothesis)/(s/sqrt(SSxx))
  return(tstat)
}

Comparison value

comparison <- function(sided,confidence,df,n){
    if (sided == 1) {
      alpha=confidence
    } else if (sided==2){
      alpha=confidence/2
    }
  value <- qt((1-alpha),(n-df))
  return(value)
}

$H_a: \beta_1 \neq 0$ (two-tailed)

Rejection region: (remember: $\alpha$=Confidence level)

$|t\text{-stat}|>t_{\alpha/2}$
OR $\alpha$ > p-value

p-value:

$2P(t>t_c)$ if $t\text{-stat}$ is negative (-)
$2P(t<t_c)$ if $t\text{-stat}$ is positive (+)
p-value < $\alpha$

$H_a:\beta_1<0$ or $H_a:\beta_1<0$ (one-tailed)

$H_a:\beta_1<0$

Rejection region: (remember: $\alpha$=Confidence level)

$t\text{-stat}<-t_\alpha$
OR $\alpha$ > p-value

p-value: $P(t<t_c)$

$H_a:\beta_1>0$

Rejection region: (remember: $\alpha$=Confidence level)

$t\text{-stat}>t_\alpha$
OR $\alpha$ > p-value

p-value: $P(t>t_c)$

Coefficients

Estimation and prediction

Complete example

Example: ADSALE

Least Squares

$$x$$	$$y$$	$$\tilde{y}=-1+x$$	$$(y-\tilde{y})$$	$$(y-\tilde{y})^2$$
Table 11.2: Comparing Observed and Predicted Values for the Visual Model
1	1	0	1	1
2	1	1	0	0
3	2	2	0	0
4	2	3	-1	1
5	4	4	0	0

Sum of errors: $\sum (y-\tilde{y})=0$
Sum of squared errors: $\sum (y-\tilde{y})^2=2$

Consider the straight-line model: $E(y)=\beta_0+\beta_1x$, where

y = sales revenue (thousands $)
x = ad spend (hundreds $)

Use the method of least squares to est. the values of $\beta_0$ and $\beta_1$

$$x$$	$$y$$	$$\tilde{y}=-1+x$$	$$(y-\tilde{y})$$	$$(y-\tilde{y})^2$$
Sums
15	10	10	0	2

\[ \begin{align} \bar{x}&=\frac{\sum x}{n} =\frac{15}{5} = 3\\ \bar{y}&=\frac{\sum x}{n} =\frac{10}{5} = 2\\ \end{align} \]

x	y	(x-3)(y-2)	(x-3)^2
SS calculations
1	1	2	4
2	1	1	1
3	2	0	0
4	2	0	1
5	4	4	4

\[ \begin{align} \text{SS}_{xy}&=\sum(x-\bar{x})(y-\bar{y})\\ &=(2+1+0+0+4)=7\\ \text{SS}_{xx}&=\sum(x-\bar{x})^2\\ &=(4+1+0+1+4)=10\\ \text{slope: }\hat{\beta}_1 &=\frac{\text{SS}_{xy}}{\text{SS}_{xx}}\\ &=\frac{7}{10}=.7\\ \text{y-intercept: }\hat{\beta}_0 &=\bar{y}-\hat{\beta}_1\bar{x}\\ &=2-(.7)(3)\\ \therefore \text{least squares line: }\hat{y}&=\hat{\beta}_0+\hat{\beta}_1x\\ &=-.1+.7x \end{align} \]

Predict the sales revenue when advertising expenditure is $200 ($x=2$)

\[ \begin{align} \text{least squares line: }\hat{y}&=-.1+.7x\\ f(2)&=-.1+(.7)(2)\\ &=1.3\\ \therefore &\text{\$1,300} \end{align} \]

Find SSE for the analysis

x	y	-1.7x	(y-$$\hat{y}$$)	($$y-hat{y}^2$$)
SSE analysis
1	1	0.6	0.4	0.16
2	1	1.3	-0.3	0.09
3	2	2.0	0.0	0.00
4	2	2.7	-0.7	0.49
5	4	3.4	0.6	0.36

\[ \sum (y-\bar{y})^2=(.16+.09+0+.49+.36)=1.1 \] (See “Sums” table; notice that this is smaller than 2)

Give practical interpretations to $\beta_0$ and $\beta_1$

The results imply that the revenue is equal to -$100 when expenditure is $0. Remember that model parameters should be interpreted only within the sampled range of the independent variable. In this example, the range is $100 to $500. The y-intercept isn’t within this range and is therefore not subject to meaningful interpretation.

Estimating $\sigma$

(From “Model Assumptions”)

Compute and estimate of $\sigma$ From “Least Squares” part c, SSE=1.1 and the least squares line is $\hat{y}=-.1+.7x$.

\[ \begin{align} s&=\sqrt{\frac{\text{SSE}}{n-2}}\\ &=\sqrt{\frac{1.1}{5-2}}\\ &=.61 \end{align} \]

$s=.61$ is the standard error of the regression model

Give a practical interpretation of the estimate.

Because $s$ measures the spread of the distribution of y-values about the least squares line, we should not be surprised to find that most of the observations lie within $2s$ of the least squares line: $2s=2\times.61=1.22\approx\text{\$1,220}$ within the least squares line.

Hypothesis testing

Testing the Regression Slope $\beta_1$–Sales Revenue Model

Conduct a test (at $\alpha=.05$) to determine if sales revenue ($y$) is linearly related to advertising expenditure ($x$).

\[ \begin{align} \text{In previous section: }&n=5\\ &\text{df}=n-2=3\\ &\hat{\beta}_1=.7\\ &s=.61\\ &\text{SS}_{xx}=10 \end{align} \]

comparison(2,.05,2,5)

## [1] 3.182446

\[ |t|>(t_{.025}=3.182) \]

lm(adsale$SALES_Y~adsale$ADVEXP_X,adsale) |> summary()

## 
## Call:
## lm(formula = adsale$SALES_Y ~ adsale$ADVEXP_X, data = adsale)
## 
## Residuals:
##          1          2          3          4          5 
##  4.000e-01 -3.000e-01  6.478e-17 -7.000e-01  6.000e-01 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)  
## (Intercept)      -0.1000     0.6351  -0.157   0.8849  
## adsale$ADVEXP_X   0.7000     0.1915   3.656   0.0354 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6055 on 3 degrees of freedom
## Multiple R-squared:  0.8167, Adjusted R-squared:  0.7556 
## F-statistic: 13.36 on 1 and 3 DF,  p-value: 0.03535

\[ \text{t-stat: }3.656 \] \[ \begin{align} H_0: \beta_1 &= 0\\ H_a: \beta_1 &\neq 0 \end{align} \]

The t-stat falls within the rejection region $\pm 3.182$, so we reject the null hypothesis and say that the slope $\beta_1$ is not 0. This means that there is evidence that expenditure $x$ contributes information for revenue $y$ when a linear model is used.

Also, we can reach the same conclusion by using the p-value, (shows as Pr(>|t|) in lm summary; the asterisks show when can reject null hypothesis) which is 0.0354. $0.0354<.05$ so we reject the null hypothesis.

$$ \[\begin{align} \end{align}\] $$

Chapter 12: Multiple Regression and Model Building

Models Overview

Multiple regression models: probabilistic models that include more than one independent variable

\[ y=\beta_0 +\beta_1 x_1+\beta_2 x_2+\dots+\beta_k x_k + \varepsilon \]

Statistical Modeling Notes

Sammy Amos

Last updated: 2024-11-10

Lesson 1: Correlation

Handout

Video

Overview

Correlation coefficient

sign

magnitude

Covariance

Class Exercise

1

2

3

Lesson 2: Simple Regression Part 1

Handout

Notes

Plot graph

Compute correlation

Build linear regression model

Plot fitted line

Plot residuals

Video

Overview

Prediction Interval

Class Notes

t-value

p-value

R squared

Caveat

Std Dev

Lesson 3: Simple Regression Part 2

Handout

Overview

Notes

Steps

(1) log-log regression analysis

(2) correlation

(3) build linear regression model

(4) plot data

(5) plot residuals

Video

Outliers

Log log analysis

Movie example

Lesson 4: Multiple Regression Part 1

Notes

Steps

(1) natural log of dependent variable

(2) build multiple regression model

Anova output

(3) plot residuals

Video

Interpret coefficients

Contradictory signs

F-statistic (useful?)

Lesson 5: Multiple Regression Part 2

Overview

(1) Scatterplot

(2) Check for funnel

(4) Check usefulness

(5) Check useful variables

(6) Drop variables

(7) Inspect r-square

(8) (Other)

Video

Useful variables

Computations

R-square

PI

Outliers

Non-constant variance

Readings

Chapter 11: Simple Linear Regression

Probabilistic Models

Least Squares Approach

Notes

Visually plotted line

Least squares line Properties