Multiple Regression
Name(s):
Name 1, Name 2, Name 3, Name 4, Name 5
Questions
Multiple Regression
Load any packages (installing if necessary).
library("tidyverse")
library("openxlsx")
The estimated regression equation for a model involving two independent variables and 10 observations is as follows: \(y = 37.814 + .6023x_1 + .0199x_2\). The values of \(SS_{Total}\) and \(SS_{Regression}\) were \(7819.661\) and \(7145.691\), respectively.
Find \(SS_{Residual}\) (also called \(SS_{Error}\)).
7819.661-7145.691
## [1] 673.97
sserror =673.97
Compute \(R^2\).
7145.691/7819.661
## [1] 0.9138108Compute \(R^2_{Adjusted}\).
1-(1-.914)*(10-1)/(10-2-1)
## [1] 0.8894286
\(R^2\) adjusted = 13.34593
Compute the \(F\).
Comment on how well the model, overall, fits.
Answer:
The auditing division within a large accounting firm investigated the link between audit delay (Delay), the length of time from a company’s fiscal year-end to date of the auditor’s report, and variables that describe the client and the auditor. Some of the independent variables that were included in the study were:
Industry: A dummy variable coded 1 if the firm was an industrial company or 0 if the firm was a bank, savings and loan, or insurance company.
Public: A dummy variable coded 1 if the company was traded on an organized exchange or over the counter; otherwise coded 0.
Quality: A measure of overall quality of internal controls, as judged by the auditor, on a five-point scale ranging from “virtually none” (1) to “excellent” (5).
Finished: A measure ranging from 1 to 4, as judged by the auditor, where 1 indicates “all work performed subsequent to year-end” and 4 indicates “most work performed prior to year-end.”
The AuditDelay data file for 40 companies can be found here as CSV file, here as an SPSS file, or in the following directory: http://bit.ly/MSBA_Data.
```r
data = read_csv('https://www.dropbox.com/s/2796a0zvgek8yip/AuditDelay.csv?dl=1')
```
```
## Parsed with column specification:
## cols(
## Delay = col_double(),
## Industry = col_double(),
## Public = col_double(),
## Quality = col_double(),
## Finished = col_double()
## )
```
Develop the estimated regression equation using all of the independent variables.
lm(Delay ~ Industry+Public+Quality+Finished,data = data)
##
## Call:
## lm(formula = Delay ~ Industry + Public + Quality + Finished,
## data = data)
##
## Coefficients:
## (Intercept) Industry Public Quality Finished
## 80.429 11.944 -4.816 -2.624 -4.073Develop a scatter diagram showing Delay as a function of Finished.
plot(Delay ~ Finished, data = data)
Develop an estimated regression equation that can be used to predict Delay by using Industry and Quality.
lm(Delay ~ Industry+Quality,data = data)
##
## Call:
## lm(formula = Delay ~ Industry + Quality, data = data)
##
## Coefficients:
## (Intercept) Industry Quality
## 70.634 12.737 -2.919As a consultqnt for a movie theater, you have been asked to estimate weekly gross revenue as a function of advertising expenditures. Historical data for a sample of eight weeks during the summer in which advertising campaigns were used follow. The data file Showtime can be found here as CSV file, here as an SPSS file, or in the following directory: http://bit.ly/MSBA_Data. Hint: Be careful with the units of the data and the units of the question.
```r
consult <- read_csv('https://www.dropbox.com/s/9w0zdr7899qcasz/Showtime.csv?dl=1')
```
```
## Parsed with column specification:
## cols(
## WeeklyGross = col_double(),
## TelevisionAds = col_double(),
## NewspaperAds = col_double()
## )
```
Develop an estimated regression equation with the amount of television advertising as the independent variable.
lm(WeeklyGross ~ TelevisionAds, data = consult)
##
## Call:
## lm(formula = WeeklyGross ~ TelevisionAds, data = consult)
##
## Coefficients:
## (Intercept) TelevisionAds
## 88.638 1.604Develop an estimated regression equation with both television advertising and newspaper advertising as the independent variables.
lm(WeeklyGross ~ TelevisionAds + NewspaperAds, data = consult)
##
## Call:
## lm(formula = WeeklyGross ~ TelevisionAds + NewspaperAds, data = consult)
##
## Coefficients:
## (Intercept) TelevisionAds NewspaperAds
## 83.230 2.290 1.301Is the estimated regression equation coefficient for television advertising expenditures the same in part (a) and in part (b)? Interpret the coefficient in each case.
Answer:
What is the estimate of the weekly gross revenue for a week when $3,500 is spent on television advertising and $1800 is spent on newspaper advertising?
83.230 + (3.5 * 2.290) +(1.8 * 1.301)
## [1] 93.5868Note that this question extends question 3 above (i.e., the Showtime data) and is thus based on the same data.
Use \(\alpha\)=.05 to test the null hypothesis of \(H_0: \beta_1=\beta_2=0\) with the alternative hypothesis being \(H_a\): \(\beta_1\) and/or \(\beta_2\) is not equal to zero. For the model \(y=\beta_0+\beta_1x_1+\beta_2x_2+\epsilon\), where \(x_1\)=television advertising ($1,000s) and \(x_2\)=newspaper advertising ($1,000s)
Use \(\alpha\)=.05 to test the significance of \(\beta_1\). Should \(x_1\) be dropped from the model? Why or why not?
Answer:
summary(lm(WeeklyGross ~ NewspaperAds, data = consult))
##
## Call:
## lm(formula = WeeklyGross ~ NewspaperAds, data = consult)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7704 -0.2554 0.2876 1.2274 2.2081
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 93.85641 2.23745 41.95 1.23e-08 ***
## NewspaperAds -0.04299 0.85473 -0.05 0.962
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.061 on 6 degrees of freedom
## Multiple R-squared: 0.0004215, Adjusted R-squared: -0.1662
## F-statistic: 0.00253 on 1 and 6 DF, p-value: 0.9615
P value is .5 Fail to Reject the Null Hypothesis.
Use \(\alpha\)=.05 to test the significance of \(\beta_2\). Should \(x_2\) be dropped from the model? Why or why not?
Answer:
summary(lm(WeeklyGross ~ TelevisionAds, data = consult))
##
## Call:
## lm(formula = WeeklyGross ~ TelevisionAds, data = consult)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8454 -0.6498 -0.1522 0.7512 1.5507
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 88.6377 1.5824 56.016 2.17e-09 ***
## TelevisionAds 1.6039 0.4778 3.357 0.0153 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.215 on 6 degrees of freedom
## Multiple R-squared: 0.6526, Adjusted R-squared: 0.5946
## F-statistic: 11.27 on 1 and 6 DF, p-value: 0.01529
P Value 0.8598 Fail to reject the Null Hypothesis
From the following incomplete ANOVA table for a multiple regression analysis for 50 individuals. Fill in the apprpriate values (a–g).
ANOVA_Table <- as.data.frame(
rbind(
c("Regression", "a", 4, 8, "b", "c"),
c("Residual", 184, "d", "e", "", ""),
c("Total", "f", "g", "", "", "")
))
colnames(ANOVA_Table) <- c("Source", "$SS$", "$df$", "$MS$", "$F$", "Significance $F$")
knitr::kable(ANOVA_Table, caption="ANOVA Source Table")
| Source | \(SS\) | \(df\) | \(MS\) | \(F\) | Significance \(F\) |
|---|---|---|---|---|---|
| Regression | a | 4 | 8 | b | c |
| Residual | 184 | d | e | ||
| Total | f | g |
ANOVA_Table <- as.data.frame(
rbind(
c("Regression", 4*8, 4, 8, round(8/(184/(50-4-1)),2), 1-pf(1.96,4,45)),
c("Residual", 184, 50-4-1, round(184/(50-4-1),2), "", ""),
c("Total",(4*8)+184 , 50-1, "", "", "")
))
colnames(ANOVA_Table) <- c("Source", "$SS$", "$df$", "$MS$", "$F$", "Significance $F$")
knitr::kable(ANOVA_Table, caption="ANOVA Source Table")
| Source | \(SS\) | \(df\) | \(MS\) | \(F\) | Significance \(F\) |
|---|---|---|---|---|---|
| Regression | 32 | 4 | 8 | 1.96 | 0.116935384922812 |
| Residual | 184 | 45 | 4.09 | ||
| Total | 216 | 49 |
Barron’s conducts an annual review of online brokers, including both brokers, who can be accessed via a Web browser, as well as direct-access brokers who connect customers directly with the broker’s network server. Each broker’s offerings and performance are evaluated in six areas, using a scale of 0–5 in each category (the higher the better). The results are weighted to obtain an overall score, and a final star rating, ranging from zero to five stars, is assigned to each broker. Trade execution, ease of use, and range of offerings are three of the areas evaluated. The following data show the point values for trade execution, ease of use, range of offerings, and the star rating for a sample of 10 of the online brokers that Barron’s evaluated. The Brokers data file can be found here as CSV file, here as an SPSS file, or in the following directory: http://bit.ly/MSBA_Data.
brokers <- read_csv('https://www.dropbox.com/s/jfu08pc6lrg00dy/Brokers.csv?dl=1')
## Parsed with column specification:
## cols(
## Broker = col_character(),
## TradeEx = col_double(),
## Use = col_double(),
## Range = col_double(),
## Rating = col_double()
## )
model <- lm(Rating ~ TradeEx + Use+Range, data = brokers)
model
##
## Call:
## lm(formula = Rating ~ TradeEx + Use + Range, data = brokers)
##
## Coefficients:
## (Intercept) TradeEx Use Range
## 0.3451 0.2548 0.1325 0.4585
brokers <- brokers %>%
mutate(TradeEx = as.factor(TradeEx), Use = as.factor(Use), Range = as.factor(Range),Rating = as.factor(Rating))
Determine the estimated regression equation (using all predictor variables) that can be used to predict the star rating given the point values for execution, ease of use, and range of offerings.
Answer:
.3451 + .2548(x1)+.1325(X2)+.4585(X3)
Use the \(F\)-test to determine the overall significance of the relationship. What is the conclusion at the \(\alpha\)=.05 level of significance?
summary(model)
##
## Call:
## lm(formula = Rating ~ TradeEx + Use + Range, data = brokers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3251 -0.1171 -0.0599 0.1460 0.3366
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.34510 0.53067 0.650 0.53958
## TradeEx 0.25482 0.08556 2.978 0.02469 *
## Use 0.13249 0.14043 0.944 0.38185
## Range 0.45852 0.12319 3.722 0.00983 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2431 on 6 degrees of freedom
## Multiple R-squared: 0.8856, Adjusted R-squared: 0.8284
## F-statistic: 15.49 on 3 and 6 DF, p-value: 0.00313
qf(.95,3,6)
## [1] 4.757063
Statistically Significant
Use the \(t\)-test to determine the significance of each independent variable. What is your conclusion at the \(\alpha\).05 level of significance?
summary(model)
##
## Call:
## lm(formula = Rating ~ TradeEx + Use + Range, data = brokers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3251 -0.1171 -0.0599 0.1460 0.3366
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.34510 0.53067 0.650 0.53958
## TradeEx 0.25482 0.08556 2.978 0.02469 *
## Use 0.13249 0.14043 0.944 0.38185
## Range 0.45852 0.12319 3.722 0.00983 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2431 on 6 degrees of freedom
## Multiple R-squared: 0.8856, Adjusted R-squared: 0.8284
## F-statistic: 15.49 on 3 and 6 DF, p-value: 0.00313
Answer:
Remove any independent variable that is not significant from the estimated regression equation. What is your recommended estimated regression equation? Compare the \(R^2\) with the value of \(R^2\) from part (a). Discuss the differences.
Answer:
Consider multicolinearity in the context of regression.
In your own words, what does multicollinearity mean?
Answer:
In your own words, what are the implications of multicollinearity?
Answer:
Understanding the factors that contribute to housing values is an important consideration of investment firms interested in the residential market. The Mishawaka residential real estate market was evaluated as “opportunistic” by an investment firm. To better understand the Mishawaka market, the firm seeks to model the asking price a house by various characteristics of the house. This group hires you as a consultant to help them understand the market as it exists today, in hopes of capitalizing on changes in the marketplace in the future. The data set is available for download here as CSV file, here as an SPSS file, or in the following http://bit.ly/MSBA_Data directory.
house <- read_csv('https://www.dropbox.com/sh/4wm9mff88ium0wv/AABgY6zWFMOt9wd7FIj3pFGIa/Mishawaka_Asking.csv?dl=1')
## Parsed with column specification:
## cols(
## MLS = col_double(),
## Asking = col_double(),
## Square_Footage = col_double(),
## Proportion_Brick_or_Stone = col_double(),
## Baths = col_double(),
## Bedrooms = col_double(),
## Bed_and_Bathrooms = col_double(),
## Garage = col_double(),
## Proportion_Granite = col_double(),
## Proportion_Hardwood = col_double(),
## Basement = col_double(),
## Basement_Finished = col_double(),
## Stories2 = col_double(),
## Lot_Size = col_double(),
## Acreage = col_double(),
## Culdasac = col_double(),
## Fireplaces = col_double(),
## Year_Built = col_double()
## )
What are some relevant descriptive statistics of the data?
summary(house)
## MLS Asking Square_Footage Proportion_Brick_or_Stone
## Min. : 111113 Min. : 16900 Min. : 520 Min. :0.0000
## 1st Qu.:3567585 1st Qu.: 64900 1st Qu.:1008 1st Qu.:0.0000
## Median :6016570 Median : 112450 Median :1366 Median :0.0000
## Mean :5638786 Mean : 152970 Mean :1548 Mean :0.0675
## 3rd Qu.:7837547 3rd Qu.: 170475 3rd Qu.:1797 3rd Qu.:0.0000
## Max. :9803406 Max. :2500000 Max. :6451 Max. :1.0000
## Baths Bedrooms Bed_and_Bathrooms Garage
## Min. :1.00 Min. :1.00 Min. :2.00 Min. :0.00
## 1st Qu.:1.00 1st Qu.:2.00 1st Qu.:3.50 1st Qu.:1.00
## Median :1.50 Median :3.00 Median :4.50 Median :2.00
## Mean :1.64 Mean :2.93 Mean :4.57 Mean :1.48
## 3rd Qu.:2.00 3rd Qu.:3.00 3rd Qu.:5.50 3rd Qu.:2.00
## Max. :4.00 Max. :6.00 Max. :9.00 Max. :4.00
## Proportion_Granite Proportion_Hardwood Basement Basement_Finished
## Min. :0.0000 Min. :0.0000 Min. :0.00 Min. :0.0
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:1.00 1st Qu.:0.0
## Median :0.0000 Median :0.1000 Median :1.00 Median :0.0
## Mean :0.0585 Mean :0.1005 Mean :0.83 Mean :0.3
## 3rd Qu.:0.0000 3rd Qu.:0.1000 3rd Qu.:1.00 3rd Qu.:1.0
## Max. :1.0000 Max. :1.0000 Max. :1.00 Max. :1.0
## Stories2 Lot_Size Acreage Culdasac Fireplaces
## Min. :0.00 Min. : 1914 Min. :0.00 Min. :0.00 Min. :0.00
## 1st Qu.:0.00 1st Qu.: 5040 1st Qu.:0.00 1st Qu.:0.00 1st Qu.:0.00
## Median :0.00 Median : 8955 Median :0.00 Median :0.00 Median :0.00
## Mean :0.28 Mean : 19751 Mean :0.07 Mean :0.01 Mean :0.41
## 3rd Qu.:1.00 3rd Qu.: 18578 3rd Qu.:0.00 3rd Qu.:0.00 3rd Qu.:1.00
## Max. :1.00 Max. :261360 Max. :1.00 Max. :1.00 Max. :1.00
## Year_Built
## Min. :1882
## 1st Qu.:1922
## Median :1950
## Mean :1948
## 3rd Qu.:1972
## Max. :2008What are some relevant visualizations that help one understand the data?
pairs(house[2:18])
The data set contains, among other things, the variables Baths (total number of bathrooms), Bedrooms (total number of bedrooms), and another variable, Bed_and_Bathrooms (which is the sum of Bath and Bedrooms). Explain why including (a) Baths, (b) Bedrooms, and (c) Bed_and_Bathrooms would or would not be a problem in a regression model (i.e., a model that includes all three variables?
Answer:
Fit a model in which the only predictors are Bed_and_Bathrooms (variable 1) and Square_Footage (variable 2). Provided the estimated regression equation.
lm(Asking ~ Bed_and_Bathrooms + Square_Footage, data = house)
##
## Call:
## lm(formula = Asking ~ Bed_and_Bathrooms + Square_Footage, data = house)
##
## Coefficients:
## (Intercept) Bed_and_Bathrooms Square_Footage
## -89587.1 -45720.7 291.7Remove MLS=1575935 and rerun the model above (i.e., with Bed and Bathrooms and Square Footage).
house2 <- house %>%
filter(MLS != 1575935)
lm(Asking ~ Bed_and_Bathrooms + Square_Footage, data = house2)
##
## Call:
## lm(formula = Asking ~ Bed_and_Bathrooms + Square_Footage, data = house2)
##
## Coefficients:
## (Intercept) Bed_and_Bathrooms Square_Footage
## -18931.68 227.37 98.23Describe what major change happened in the output from d and e above. Explain why.
Answer: MLS 1575935 is an outlier, removing it changes the coefficients considerably.
Again with MLS=1575935 removed, fit a model with Square Footage, Year Built, Basement Finished, and Lot Size and save the unstandardized residuals and the unstandardized predicted values. Provide a visual assessment and comments on the normality of errors assumption.
lm(Asking ~ Square_Footage + Year_Built+ Basement_Finished+Lot_Size, data = house2)
##
## Call:
## lm(formula = Asking ~ Square_Footage + Year_Built + Basement_Finished +
## Lot_Size, data = house2)
##
## Coefficients:
## (Intercept) Square_Footage Year_Built Basement_Finished
## -1.583e+06 6.647e+01 8.172e+02 5.078e+04
## Lot_Size
## 3.356e-01For the model and data in part h, provide comments and a visual assessment of the homoscedasticity assumption. Do this by plotting the saved values (with the unstandardized residuals on the ordinate [\(y\)-axis] and the unstandardized predicted values on the abscissa [\(x\)-axis]).
```r
model <-lm(Asking ~ Square_Footage + Year_Built+ Basement_Finished+Lot_Size, data = house2)
```
plot(model)
For the model and data in part h, remove all houses that have an Asking price of $225,000 or above. Now, like h, provide a visual assessment and comments on the normality of errors assumption.
house3 <- house2 %>%
filter(Asking < 225000)
model.2 <- lm(Asking ~ Square_Footage + Year_Built+ Basement_Finished+Lot_Size, data = house3)
plot(model.2)
For the model and data in part j, provide comments and a visual assessment of the homoscedasticity assumption as you did in part i. Again, do this by plotting the saved values (with the unstandardized residuals on the ordinate [\(y\)-axis] and the unstandardized predicted values on the abscissa [\(x\)-axis]).
plot(model.2)
a Does this data (i.e., no houses at or above $225,000 [as in j and k]) satisfy the assumptions any better or any worse than the data that excluded only the house with an asking price of $2,500,000 (as in h and i)?
Answer: Removig the outliers greatly increases the homoscedasity
What are the implications and are there any cautions to consider by conditioning the analysis to houses that are less than $225,000?
Answer: Need to use caution to predict values above 225,000 becuase we have removed them
Provide a matrix plot for Asking, Square Footage, Year Built, Basement Finished, and Lot Size.
house3 %>%
select(Asking,Square_Footage, Year_Built,Basement_Finished,Lot_Size) %>%
pairs()
Square Footage=1,800, Year Built=1996, Basement Finished=1, and Lot Size=43,560.
new <- data.frame(Square_Footage =1800,
Year_Built =1996,
Basement_Finished=1,
Lot_Size=43560)
predict(model.2, newdata = new)
## 1
## 190527.7
i. What is the 95% prediction interval for the mean for houses like this?
new <- data.frame(Square_Footage =1800,
Year_Built =1996,
Basement_Finished=1,
Lot_Size=43560)
predict(model.2, newdata = new,interval ='confidence')
## fit lwr upr
## 1 190527.7 174407.1 206648.3
i. What is the 95% prediction interval for such a house you are considering purchasing?
new2 <- data.frame(Square_Footage =2628,
Year_Built =2015,
Basement_Finished=0,
Lot_Size=26136)
predict(model.2, newdata = new2, interval ='confidence')
## fit lwr upr
## 1 221725.9 201907.5 241544.3In the last assignment you used the FairValue data to find three correlation coefficients. Take the FairValue data and now apply a multiple regression model in which FairValue is modeled from SharePrice and EarningsPerShare. The Fair Value data set is available here.
fair <- read_csv('https://www.dropbox.com/s/nr20yk7mcmg8s49/FairValue.csv?dl=1')
## Parsed with column specification:
## cols(
## Company = col_character(),
## Fair_Value = col_double(),
## Share_Price = col_double(),
## Earnings_Per_Share = col_double()
## )
fairmodel <- lm(Fair_Value ~ Share_Price+Earnings_Per_Share,data = fair)
fairmodel
##
## Call:
## lm(formula = Fair_Value ~ Share_Price + Earnings_Per_Share, data = fair)
##
## Coefficients:
## (Intercept) Share_Price Earnings_Per_Share
## 17.5142 0.7109 1.3834
summary(fairmodel)
##
## Call:
## lm(formula = Fair_Value ~ Share_Price + Earnings_Per_Share, data = fair)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.1698 -8.7624 -0.7402 9.8206 18.7756
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.51418 3.70684 4.725 6.38e-05 ***
## Share_Price 0.71085 0.05527 12.861 5.01e-13 ***
## Earnings_Per_Share 1.38339 1.27796 1.082 0.289
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.65 on 27 degrees of freedom
## Multiple R-squared: 0.9144, Adjusted R-squared: 0.908
## F-statistic: 144.1 on 2 and 27 DF, p-value: 3.896e-15
Overall, does the model provide any inferential evidence that is accounts for more variance in \(y\) (the dependent variable) than would have been expected by chance alone? Describe how you know this?
Answer:
What is a descriptive summary of the overall fit of the model (specially how much variance in \(y\) do you estimate that the model accounts for in the population?
Answer:
Describe, separately, how well each of the regressors “worked” or “did not work” in modeling the outcome variable.
Answer:
Describe, in words, what is an overall managerial summary of the model and the implications of the model that would be relevant to the top-management team (who are interested in the findings).
Answer:
Provide output in a way that would be reasonable in a report to the top-management team (do not include the raw data).
# The answer is.... Here is a college football data set on bowl game. The code below downloads the Excel data set and then uses the readxl (read Excel) package to load into the R workspace.
Bowl_Game <- openxlsx::read.xlsx("http://www.dropbox.com/s/yw2wj8htfew7i5i/Bowl_Game_Spreads.xlsx?dl=1")
# tibble::glimpse(Bowl_Game)
# summary(Bowl_Game)
Fit a simple or multiple regression model of your choice and summarize your findings.
# The answer is.... In the last assignment you fitted multiple simple regressions. Here, fit the multiple regression model with all of the regressors
# The answer is....
# The answer is....