Multiple Regression
Name(s):
Mary Keller
Questions
Multiple Regression
Load any packages (installing if necessary).
library("tidyverse")
library("openxlsx")
library("tinytex")
# The answer is....
SSTotal <- 7819.661
SSRegression <- 7145.691
SSError <- SSTotal-SSRegression
SSError
## [1] 673.97
# The answer is....
R2 <- SSRegression/SSTotal
R2
## [1] 0.9138108
# The answer is....
R2Adjusted <- (1 - ((1-R2) * ((10-1)/(10-2-1))))
R2Adjusted
## [1] 0.8891854
# The answer is....
df1 <- 2
df2 <- 10-3
MSResidual <- SSError/df2
MSRegression <- SSRegression/df1
F <- MSRegression/MSResidual
F
## [1] 37.10836
Pvalue <- 1-pf(F,df1,df2, lower.tail = TRUE)
Pvalue
## [1] 0.0001879681
a. Comment on how well the model, overall, fits.
**Answer**: P-Value is less than <.001 because F Value is 37. Overall model is <.001. Yes, it is a good fit. The AdjR^2 statistically significant at 89% of the variance of hte population. Therefore, due to the high R^2 we are doing a good job predicting the Y with the two R^2. Recalling R^2 metric from 0-1. Fitting really well at Adj R^2 .8892 then 89% as R^2 max is 1.
AuditDelay <- AuditDelayOriginal <- read_csv('https://www.dropbox.com/s/2796a0zvgek8yip/AuditDelay.csv?dl=1')
##
## -- Column specification --------------------------------------------------------
## cols(
## Delay = col_double(),
## Industry = col_double(),
## Public = col_double(),
## Quality = col_double(),
## Finished = col_double()
## )
glimpse(AuditDelay)
## Rows: 40
## Columns: 5
## $ Delay <dbl> 62, 45, 54, 71, 91, 62, 61, 69, 80, 52, 47, 65, 60, 81, 73...
## $ Industry <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1...
## $ Public <dbl> 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Quality <dbl> 3, 3, 2, 1, 1, 4, 3, 5, 1, 5, 3, 2, 1, 1, 2, 2, 5, 2, 1, 5...
## $ Finished <dbl> 1, 3, 2, 2, 1, 4, 2, 2, 1, 3, 2, 3, 3, 2, 2, 1, 4, 2, 2, 2...
Model.1 <- lm(Delay~1 + Industry + Public + Quality + Finished, data=AuditDelay)
summary(Model.1)
##
## Call:
## lm(formula = Delay ~ 1 + Industry + Public + Quality + Finished,
## data = AuditDelay)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.8444 -6.8409 0.6387 7.7526 18.5409
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.429 5.916 13.595 1.57e-15 ***
## Industry 11.944 3.798 3.145 0.00338 **
## Public -4.816 4.229 -1.139 0.26252
## Quality -2.624 1.184 -2.217 0.03324 *
## Finished -4.073 1.851 -2.200 0.03453 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.92 on 35 degrees of freedom
## Multiple R-squared: 0.3826, Adjusted R-squared: 0.312
## F-statistic: 5.422 on 4 and 35 DF, p-value: 0.001666
```r
Model.1 <- lm(Delay~1 + Industry + Public + Quality + Finished, data=AuditDelay)
summary(Model.1)
```
```
##
## Call:
## lm(formula = Delay ~ 1 + Industry + Public + Quality + Finished,
## data = AuditDelay)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.8444 -6.8409 0.6387 7.7526 18.5409
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.429 5.916 13.595 1.57e-15 ***
## Industry 11.944 3.798 3.145 0.00338 **
## Public -4.816 4.229 -1.139 0.26252
## Quality -2.624 1.184 -2.217 0.03324 *
## Finished -4.073 1.851 -2.200 0.03453 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.92 on 35 degrees of freedom
## Multiple R-squared: 0.3826, Adjusted R-squared: 0.312
## F-statistic: 5.422 on 4 and 35 DF, p-value: 0.001666
```
# The answer is....
plot(AuditDelay$Finished, AuditDelay$Delay)
i. What does this figure indicate about the relationship between Delay and Finished?
Answer: The scatter plot indicated that there is a curvilinar relations between Delay and Finished.
```r
Model.2 <- lm(Delay~1 + Industry + Quality , data=AuditDelay)
summary(Model.2)
```
```
##
## Call:
## lm(formula = Delay ~ 1 + Industry + Quality, data = AuditDelay)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.6958 -7.9195 0.1632 8.9057 23.2851
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 70.634 4.558 15.498 < 2e-16 ***
## Industry 12.737 3.966 3.212 0.00273 **
## Quality -2.919 1.238 -2.357 0.02383 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.56 on 37 degrees of freedom
## Multiple R-squared: 0.2689, Adjusted R-squared: 0.2293
## F-statistic: 6.803 on 2 and 37 DF, p-value: 0.003048
```
Showtime <- read_csv('https://www.dropbox.com/s/9w0zdr7899qcasz/Showtime.csv?dl=1')
##
## -- Column specification --------------------------------------------------------
## cols(
## WeeklyGross = col_double(),
## TelevisionAds = col_double(),
## NewspaperAds = col_double()
## )
summary(Showtime)
## WeeklyGross TelevisionAds NewspaperAds
## Min. :90.00 Min. :2.000 Min. :1.500
## 1st Qu.:93.50 1st Qu.:2.500 1st Qu.:1.875
## Median :94.00 Median :3.000 Median :2.400
## Mean :93.75 Mean :3.188 Mean :2.475
## 3rd Qu.:95.00 3rd Qu.:3.625 3rd Qu.:2.700
## Max. :96.00 Max. :5.000 Max. :4.200
Develop an estimated regression equation with the amount of television advertising as the independent variable.
# The answer is....
model.3 <- lm(WeeklyGross~1 +TelevisionAds, data=Showtime)
Answer y=88.638Wg + 1.604tva
Develop an estimated regression equation with both television advertising and newspaper advertising as the independent variables.
# The answer is....
model.4 <- lm(WeeklyGross~1 +TelevisionAds + NewspaperAds, data=Showtime)
Answer y=83.230wg + 2.290tva + 1.301npa
Is the estimated regression equation coefficient for television advertising expenditures the same in part (a) and in part (b)? Interpret the coefficient in each case.
Answer: The beauty of multiple regression is that all independent variables are considered simultaneously and their unique efforts on the dependent variable are quantified.
In Part A, revenue increases at 1.604 when television advertising is the independent variable. Alternatively, in Part B, television advertising increases at 2.290 when the newspaper advertising is included in the model.
anova(model.3, model.4)
What is the estimate of the weekly gross revenue for a week when $3,500 is spent on television advertising and $1800 is spent on newspaper advertising?
# The answer is....
83.230 + (2.290*3500) + (1.301*1800)
## [1] 10440.03summary(lm(WeeklyGross~1 +TelevisionAds + NewspaperAds, data=Showtime))
##
## Call:
## lm(formula = WeeklyGross ~ 1 + TelevisionAds + NewspaperAds,
## data = Showtime)
##
## Residuals:
## 1 2 3 4 5 6 7 8
## -0.6325 -0.4124 0.6577 -0.2080 0.6061 -0.2380 -0.4197 0.6469
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.2301 1.5739 52.882 4.57e-08 ***
## TelevisionAds 2.2902 0.3041 7.532 0.000653 ***
## NewspaperAds 1.3010 0.3207 4.057 0.009761 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6426 on 5 degrees of freedom
## Multiple R-squared: 0.919, Adjusted R-squared: 0.8866
## F-statistic: 28.38 on 2 and 5 DF, p-value: 0.001865
Answer H0: β1 = β2 = 0 Ha: β1 and/or β2 is not equal to zero Thus we reject the null hypothesis because Significance P-Value <.05. At least B1 or B2 is different than zero.ANOVA_Table <- as.data.frame(
rbind(
c("Regression", "a", 4, 8, "b", "c"),
c("Residual", 184, "d", "e", "", ""),
c("Total", "f", "g", "", "", "")
))
colnames(ANOVA_Table) <- c("Source", "$SS$", "$df$", "$MS$", "$F$", "Significance $F$")
knitr::kable(ANOVA_Table, caption="ANOVA Source Table")
| Source | \(SS\) | \(df\) | \(MS\) | \(F\) | Significance \(F\) |
|---|---|---|---|---|---|
| Regression | a | 4 | 8 | b | c |
| Residual | 184 | d | e | ||
| Total | f | g |
n <- 50
df1 <- 4
p <- 4
MSRegression2 <- 8
SSResidual2 <- 184
df2 <- 50-4-1
SSRegression2 <- MSRegression2*p
SSRegression2
## [1] 32
MSResidual2 <- SSResidual2/df2
MSResidual2
## [1] 4.088889
F2 <- MSRegression2/MSResidual2
F2
## [1] 1.956522
SSTotal2 <- SSRegression2 + SSResidual2
SSTotal2
## [1] 216
PValue2 <- 1-pf(F2,df1,df2, lower.tail = TRUE)
PValue2
## [1] 0.1174935
Brokers <- read_csv('https://www.dropbox.com/s/jfu08pc6lrg00dy/Brokers.csv?dl=1')
##
## -- Column specification --------------------------------------------------------
## cols(
## Broker = col_character(),
## TradeEx = col_double(),
## Use = col_double(),
## Range = col_double(),
## Rating = col_double()
## )
summary(Brokers)
## Broker TradeEx Use Range
## Length:10 Min. :1.400 Min. :2.500 Min. :2.500
## Class :character 1st Qu.:2.275 1st Qu.:3.000 1st Qu.:3.125
## Mode :character Median :2.850 Median :3.500 Median :3.350
## Mean :2.940 Mean :3.400 Mean :3.610
## 3rd Qu.:3.625 3rd Qu.:3.675 3rd Qu.:4.150
## Max. :4.800 Max. :4.500 Max. :4.800
## Rating
## Min. :2.0
## 1st Qu.:3.0
## Median :3.5
## Mean :3.2
## 3rd Qu.:3.5
## Max. :4.0
Model.5 <- lm(Rating~1 + TradeEx + Use + Range, data=Brokers)
Model.5
##
## Call:
## lm(formula = Rating ~ 1 + TradeEx + Use + Range, data = Brokers)
##
## Coefficients:
## (Intercept) TradeEx Use Range
## 0.3451 0.2548 0.1325 0.4585
# The answer is.... y= ..34510rate + 0.25482TradeEx + 0.13249Use + 0.45852Range
F <- qf(.95, 3, 6,lower.tail = TRUE)
F
## [1] 4.757063
summary(lm(Rating~1 + TradeEx + Use + Range, data=Brokers))
##
## Call:
## lm(formula = Rating ~ 1 + TradeEx + Use + Range, data = Brokers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3251 -0.1171 -0.0599 0.1460 0.3366
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.34510 0.53067 0.650 0.53958
## TradeEx 0.25482 0.08556 2.978 0.02469 *
## Use 0.13249 0.14043 0.944 0.38185
## Range 0.45852 0.12319 3.722 0.00983 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2431 on 6 degrees of freedom
## Multiple R-squared: 0.8856, Adjusted R-squared: 0.8284
## F-statistic: 15.49 on 3 and 6 DF, p-value: 0.00313
**Answer**: P-Value is <.00313 because F Value is 15.49 with a Critical F of 2.75 it is a good fit and statistically significant.
The AdjR^2 statistically significant at82% of the variance of the population. Therefore, due to the high R^2 we are doing a good job predicting the Y with the two R^2. Recalling R^2 metric from 0-1. Fitting really well at Adj R^2 .0.8284 then 83% as R^2 max is 1.
Model.6 <- lm(formula = Rating ~ 1 + TradeEx + Range, data = Brokers)
Model.6
##
## Call:
## lm(formula = Rating ~ 1 + TradeEx + Range, data = Brokers)
##
## Coefficients:
## (Intercept) TradeEx Range
## 0.6718 0.2641 0.4853
summary(lm(formula = Rating ~ 1 + TradeEx + Range, data = Brokers))
##
## Call:
## lm(formula = Rating ~ 1 + TradeEx + Range, data = Brokers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.25470 -0.17414 -0.03772 0.16976 0.37492
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.67184 0.39892 1.684 0.13603
## TradeEx 0.26406 0.08432 3.131 0.01658 *
## Range 0.48527 0.11893 4.080 0.00469 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2412 on 7 degrees of freedom
## Multiple R-squared: 0.8686, Adjusted R-squared: 0.8311
## F-statistic: 23.15 on 2 and 7 DF, p-value: 0.0008214
Recommended Regression Equation : y= 0.67184rate + 0.26406TradeEx + 0.48527Range
The R^2 value in part a is .828 and when “Use” as the independent variable is removed the R^2 is .86. The R^2 evaluates the scatter of the data points around the fitted regression line. The second model has a higher value representing smaller differences between the observed data and the fitted values.
P-Value is less than <.001 because F Value is 23.15. Overall model is <.001. Yes, it is a good fit. The AdjR^2 statistically significant at 86% of the variance of hte population. Therefore, due to the high R^2 we are doing a good job predicting the Y with the two R^2. Recalling R^2 metric from 0-1. Fitting really well at Adj R^2 0.8686 then 86% as R^2 max is 1.
Step 1: Load the data
Mishawaka_Asking_Original <- read_csv('https://www.dropbox.com/sh/4wm9mff88ium0wv/AABgY6zWFMOt9wd7FIj3pFGIa/Mishawaka_Asking.csv?dl=0')
##
## -- Column specification --------------------------------------------------------
## cols(
## `<!DOCTYPE html><html class="maestro" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"><head><script nonce="6M64r1pnBU1QT0cpJmI2">` = col_character()
## )
## Warning: 4625 parsing failures.
## row col expected actual file
## 2 -- 1 columns 2 columns 'https://www.dropbox.com/sh/4wm9mff88ium0wv/AABgY6zWFMOt9wd7FIj3pFGIa/Mishawaka_Asking.csv?dl=0'
## 7 -- 1 columns 2 columns 'https://www.dropbox.com/sh/4wm9mff88ium0wv/AABgY6zWFMOt9wd7FIj3pFGIa/Mishawaka_Asking.csv?dl=0'
## 11 -- 1 columns 2 columns 'https://www.dropbox.com/sh/4wm9mff88ium0wv/AABgY6zWFMOt9wd7FIj3pFGIa/Mishawaka_Asking.csv?dl=0'
## 12 -- 1 columns 3 columns 'https://www.dropbox.com/sh/4wm9mff88ium0wv/AABgY6zWFMOt9wd7FIj3pFGIa/Mishawaka_Asking.csv?dl=0'
## 13 -- 1 columns 23 columns 'https://www.dropbox.com/sh/4wm9mff88ium0wv/AABgY6zWFMOt9wd7FIj3pFGIa/Mishawaka_Asking.csv?dl=0'
## ... ... ......... .......... ................................................................................................
## See problems(...) for more details.
Parsed with columns specifications
cols(
MLS = col_double(),
Asking = col_double(),
Square_Footage = col_double(),
Proportion_Brick_or_Stone = col_double(),
Baths = col_double(),
Bedrooms = col_double(),
Bed_and_Bathrooms = col_double(),
Garage = col_double(),
Proportion_Granite = col_double(),
Proportion_Hardwood = col_double(),
Basement = col_double(),
Basement_Finished = col_double(),
Stories2 = col_double(),
Lot_Size = col_double(),
Acreage = col_double(),
Culdasac = col_double(),
Fireplaces = col_double(),
Year_Built = col_double()
)
## cols(
## MLS = col_double(),
## Asking = col_double(),
## Square_Footage = col_double(),
## Proportion_Brick_or_Stone = col_double(),
## Baths = col_double(),
## Bedrooms = col_double(),
## Bed_and_Bathrooms = col_double(),
## Garage = col_double(),
## Proportion_Granite = col_double(),
## Proportion_Hardwood = col_double(),
## Basement = col_double(),
## Basement_Finished = col_double(),
## Stories2 = col_double(),
## Lot_Size = col_double(),
## Acreage = col_double(),
## Culdasac = col_double(),
## Fireplaces = col_double(),
## Year_Built = col_double()
## )
Step 2:
glimpse(Mishawaka_Asking_Original)
## Rows: 140
## Columns: 1
## $ `<!DOCTYPE html><html class="maestro" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"><head><script nonce="6M64r1pnBU1QT0cpJmI2">` <chr> ...
b. What are some relevant visualizations that help one understand the data?\ **Answer**:
a. The data set contains, among other things, the variables Baths (total number of bathrooms), Bedrooms (total number of bedrooms), and another variable, Bed_and_Bathrooms (which is the sum of Bath and Bedrooms). Explain why including (a) Baths, (b) Bedrooms, and (c) Bed_and_Bathrooms would or would not be a problem in a regression model (i.e., a model that includes all three variables?
**Answer**:
a. Fit a model in which the only predictors are Bed_and_Bathrooms (variable 1) and Square_Footage (variable 2). Provided the estimated regression equation.
```r
# The answer is....
```
a. Remove MLS=1575935 and rerun the model above (i.e., with Bed and Bathrooms and Square Footage).
```r
# The answer is....
```
a. Describe what *major* change happened in the output from d and e above. Explain why.
**Answer**:
a. Describe what major change happened in the output from d and e above. Explain why.
**Answer**:
a. Again with MLS=1575935 removed, fit a model with Square Footage, Year Built, Basement Finished, and Lot Size and save the unstandardized residuals and the unstandardized predicted values. Provide a visual assessment and comments on the normality of errors assumption.
**Answer**:
a. For the model and data in part h, provide comments and a visual assessment of the homoscedasticity assumption. Do this by plotting the saved values (with the unstandardized residuals on the ordinate [$y$-axis] and the unstandardized predicted values on the abscissa [$x$-axis]).
**Answer**:
a. For the model and data in part h, remove all houses that have an Asking price of \$225,000 or above. Now, like h, provide a visual assessment and comments on the normality of errors assumption.
**Answer**:
a. For the model and data in part j, provide comments and a visual assessment of the homoscedasticity assumption as you did in part i. Again, do this by plotting the saved values (with the unstandardized residuals on the ordinate [$y$-axis] and the unstandardized predicted values on the abscissa [$x$-axis]).
**Answer**:
a Does this data (i.e., no houses at or above \$225,000 [as in j and k]) satisfy the assumptions any better or any worse than the data that excluded only the house with an asking price of \$2,500,000 (as in h and i)?
**Answer**:
a. What are the implications and are there any cautions to consider by conditioning the analysis to houses that are less than \$225,000?
**Answer**:
a. Provide a matrix plot for Asking, Square Footage, Year Built, Basement Finished, and Lot Size.
```r
# The answer is....
```
a. Again using the data for houses less than \$225,000 as a basis of your model, consider a house in the area that has the following properties: `Square Footage=1,800`, `Year Built=1996`, `Basement Finished=1`, and `Lot Size=43,560`.
i. What is the optimal predicted value for the Asking Price?
```r
# The answer is....
```
i. What is the 95% prediction interval for the mean for houses like this?
```r
# The answer is....
```
i. What is the 95% prediction interval for such a house you are considering purchasing?
```r
# The answer is....
```
In the last assignment you used the FairValue data to find three correlation coefficients. Take the FairValue data and now apply a multiple regression model in which FairValue is modeled from SharePrice and EarningsPerShare. The Fair Value data set is available here.
Overall, does the model provide any inferential evidence that is accounts for more variance in \(y\) (the dependent variable) than would have been expected by chance alone? Describe how you know this?
Answer:
What is a descriptive summary of the overall fit of the model (specially how much variance in \(y\) do you estimate that the model accounts for in the population?
Answer:
Describe, separately, how well each of the regressors “worked” or “did not work” in modeling the outcome variable.
Answer:
Describe, in words, what is an overall managerial summary of the model and the implications of the model that would be relevant to the top-management team (who are interested in the findings).
Answer:
Provide output in a way that would be reasonable in a report to the top-management team (do not include the raw data).
# The answer is.... Here is a college football data set on bowl game. The code below downloads the Excel data set and then uses the readxl (read Excel) package to load into the R workspace.
Bowl_Game <- openxlsx::read.xlsx("http://www.dropbox.com/s/yw2wj8htfew7i5i/Bowl_Game_Spreads.xlsx?dl=1")
# tibble::glimpse(Bowl_Game)
# summary(Bowl_Game)
Fit a simple or multiple regression model of your choice and summarize your findings.
```r
# The answer is....
```In the last assignment you fitted multiple simple regressions. Here, fit the multiple regression model with all of the regressors
# The answer is....
# The answer is....