Notre Dame’s MSBA Program

Multiple Regression

Name(s):
Mary Keller

Questions
Multiple Regression

Load any packages (installing if necessary).

library("tidyverse")
library("openxlsx")
library("tinytex")

The estimated regression equation for a model involving two independent variables and 10 observations is as follows: $y = 37.814 + .6023x_1 + .0199x_2$. The values of $SS_{Total}$ and $SS_{Regression}$ were $7819.661$ and $7145.691$, respectively.
1. Find $SS_{Residual}$ (also called $SS_{Error}$).
```
    # The answer is.... 
    SSTotal <- 7819.661
    SSRegression <- 7145.691
    SSError <- SSTotal-SSRegression
    SSError
```
```
## [1] 673.97
```
1. Compute $R^2$.
```
    # The answer is.... 
    R2 <- SSRegression/SSTotal
    R2
```
```
## [1] 0.9138108
```
1. Compute $R^2_{Adjusted}$.
```
    # The answer is.... 
    R2Adjusted <- (1 - ((1-R2) * ((10-1)/(10-2-1))))
    R2Adjusted 
```
```
## [1] 0.8891854
```
1. Compute the $F$.

        # The answer is....
    df1 <- 2
    df2 <- 10-3
    MSResidual <-  SSError/df2   
    MSRegression <- SSRegression/df1
        F <- MSRegression/MSResidual
        F

## [1] 37.10836

   Pvalue <- 1-pf(F,df1,df2, lower.tail = TRUE)
   Pvalue

## [1] 0.0001879681

a. Comment on how well the model, overall, fits.  
    **Answer**:   P-Value is less than <.001 because F Value is 37. Overall model is <.001. Yes, it is a good fit. The AdjR^2 statistically significant at 89% of the variance of hte population. Therefore, due to the high R^2 we are doing a good job predicting the Y with the two R^2. Recalling R^2 metric from 0-1. Fitting really well at Adj R^2 .8892 then 89% as R^2 max is 1.

The auditing division within a large accounting firm investigated the link between audit delay (Delay), the length of time from a company’s fiscal year-end to date of the auditor’s report, and variables that describe the client and the auditor. Some of the independent variables that were included in the study were:
Industry: A dummy variable coded 1 if the firm was an industrial company or 0 if the firm was a bank, savings and loan, or insurance company.
Public: A dummy variable coded 1 if the company was traded on an organized exchange or over the counter; otherwise coded 0.
Quality: A measure of overall quality of internal controls, as judged by the auditor, on a five-point scale ranging from “virtually none” (1) to “excellent” (5).
Finished: A measure ranging from 1 to 4, as judged by the auditor, where 1 indicates “all work performed subsequent to year-end” and 4 indicates “most work performed prior to year-end.”
The AuditDelay data file for 40 companies can be found here as CSV file, here as an SPSS file, or in the following directory: http://bit.ly/MSBA_Data.

    AuditDelay <- AuditDelayOriginal <- read_csv('https://www.dropbox.com/s/2796a0zvgek8yip/AuditDelay.csv?dl=1')

## 
## -- Column specification --------------------------------------------------------
## cols(
##   Delay = col_double(),
##   Industry = col_double(),
##   Public = col_double(),
##   Quality = col_double(),
##   Finished = col_double()
## )

    glimpse(AuditDelay)

## Rows: 40
## Columns: 5
## $ Delay    <dbl> 62, 45, 54, 71, 91, 62, 61, 69, 80, 52, 47, 65, 60, 81, 73...
## $ Industry <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1...
## $ Public   <dbl> 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Quality  <dbl> 3, 3, 2, 1, 1, 4, 3, 5, 1, 5, 3, 2, 1, 1, 2, 2, 5, 2, 1, 5...
## $ Finished <dbl> 1, 3, 2, 2, 1, 4, 2, 2, 1, 3, 2, 3, 3, 2, 2, 1, 4, 2, 2, 2...

Model.1 <- lm(Delay~1 + Industry + Public + Quality + Finished, data=AuditDelay)
summary(Model.1)

## 
## Call:
## lm(formula = Delay ~ 1 + Industry + Public + Quality + Finished, 
##     data = AuditDelay)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.8444  -6.8409   0.6387   7.7526  18.5409 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   80.429      5.916  13.595 1.57e-15 ***
## Industry      11.944      3.798   3.145  0.00338 ** 
## Public        -4.816      4.229  -1.139  0.26252    
## Quality       -2.624      1.184  -2.217  0.03324 *  
## Finished      -4.073      1.851  -2.200  0.03453 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.92 on 35 degrees of freedom
## Multiple R-squared:  0.3826, Adjusted R-squared:  0.312 
## F-statistic: 5.422 on 4 and 35 DF,  p-value: 0.001666

Develop the estimated regression equation using all of the independent variables. Answer: The answer is…. y=80.429 + 11.944Ind - 4.816Pub -2.624Qual - 4.073Fin

```r
Model.1 <- lm(Delay~1 + Industry + Public + Quality + Finished, data=AuditDelay)
summary(Model.1)
```

```
## 
## Call:
## lm(formula = Delay ~ 1 + Industry + Public + Quality + Finished, 
##     data = AuditDelay)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.8444  -6.8409   0.6387   7.7526  18.5409 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   80.429      5.916  13.595 1.57e-15 ***
## Industry      11.944      3.798   3.145  0.00338 ** 
## Public        -4.816      4.229  -1.139  0.26252    
## Quality       -2.624      1.184  -2.217  0.03324 *  
## Finished      -4.073      1.851  -2.200  0.03453 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.92 on 35 degrees of freedom
## Multiple R-squared:  0.3826, Adjusted R-squared:  0.312 
## F-statistic: 5.422 on 4 and 35 DF,  p-value: 0.001666
```

Develop a scatter diagram showing Delay as a function of Finished.

        # The answer is.... 
plot(AuditDelay$Finished, AuditDelay$Delay)

i. What does this figure indicate about the relationship between Delay and Finished?
Answer: The scatter plot indicated that there is a curvilinar relations between Delay and Finished.

Develop an estimated regression equation that can be used to predict Delay by using Industry and Quality. AnswerThe answer is….y=80.429 + 11.944Ind -2.624Qual

```r
    Model.2 <- lm(Delay~1 + Industry + Quality , data=AuditDelay)
    summary(Model.2)
```

```
## 
## Call:
## lm(formula = Delay ~ 1 + Industry + Quality, data = AuditDelay)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.6958  -7.9195   0.1632   8.9057  23.2851 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   70.634      4.558  15.498  < 2e-16 ***
## Industry      12.737      3.966   3.212  0.00273 ** 
## Quality       -2.919      1.238  -2.357  0.02383 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.56 on 37 degrees of freedom
## Multiple R-squared:  0.2689, Adjusted R-squared:  0.2293 
## F-statistic: 6.803 on 2 and 37 DF,  p-value: 0.003048
```

As a consultant for a movie theater, you have been asked to estimate weekly gross revenue as a function of advertising expenditures. Historical data for a sample of eight weeks during the summer in which advertising campaigns were used follow. The data file Showtime can be found here as CSV file, here as an SPSS file, or in the following directory: http://bit.ly/MSBA_Data. Hint: Be careful with the units of the data and the units of the question.

Showtime <- read_csv('https://www.dropbox.com/s/9w0zdr7899qcasz/Showtime.csv?dl=1')

## 
## -- Column specification --------------------------------------------------------
## cols(
##   WeeklyGross = col_double(),
##   TelevisionAds = col_double(),
##   NewspaperAds = col_double()
## )

summary(Showtime)

##   WeeklyGross    TelevisionAds    NewspaperAds  
##  Min.   :90.00   Min.   :2.000   Min.   :1.500  
##  1st Qu.:93.50   1st Qu.:2.500   1st Qu.:1.875  
##  Median :94.00   Median :3.000   Median :2.400  
##  Mean   :93.75   Mean   :3.188   Mean   :2.475  
##  3rd Qu.:95.00   3rd Qu.:3.625   3rd Qu.:2.700  
##  Max.   :96.00   Max.   :5.000   Max.   :4.200

Develop an estimated regression equation with the amount of television advertising as the independent variable.
```
    # The answer is.... 
    model.3 <- lm(WeeklyGross~1 +TelevisionAds, data=Showtime)
```
Answer y=88.638Wg + 1.604tva
Develop an estimated regression equation with both television advertising and newspaper advertising as the independent variables.
```
    # The answer is....
  model.4 <- lm(WeeklyGross~1 +TelevisionAds + NewspaperAds, data=Showtime)
```
Answer y=83.230wg + 2.290tva + 1.301npa
Is the estimated regression equation coefficient for television advertising expenditures the same in part (a) and in part (b)? Interpret the coefficient in each case.
Answer: The beauty of multiple regression is that all independent variables are considered simultaneously and their unique efforts on the dependent variable are quantified.

In Part A, revenue increases at 1.604 when television advertising is the independent variable. Alternatively, in Part B, television advertising increases at 2.290 when the newspaper advertising is included in the model.

anova(model.3, model.4)

What is the estimate of the weekly gross revenue for a week when $3,500 is spent on television advertising and $1800 is spent on newspaper advertising?
```
    # The answer is.... 
83.230 + (2.290*3500) + (1.301*1800)
```
```
## [1] 10440.03
```

Note that this question extends question 3 above (i.e., the Showtime data) and is thus based on the same data.

Use $\alpha$=.05 to test the null hypothesis of $H_0: \beta_1=\beta_2=0$ with the alternative hypothesis being $H_a$: $\beta_1$ and/or $\beta_2$ is not equal to zero. For the model $y=\beta_0+\beta_1x_1+\beta_2x_2+\epsilon$, where $x_1$=television advertising ($1,000s) and $x_2$=newspaper advertising ($1,000s)

summary(lm(WeeklyGross~1 +TelevisionAds + NewspaperAds, data=Showtime))

## 
## Call:
## lm(formula = WeeklyGross ~ 1 + TelevisionAds + NewspaperAds, 
##     data = Showtime)
## 
## Residuals:
##       1       2       3       4       5       6       7       8 
## -0.6325 -0.4124  0.6577 -0.2080  0.6061 -0.2380 -0.4197  0.6469 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    83.2301     1.5739  52.882 4.57e-08 ***
## TelevisionAds   2.2902     0.3041   7.532 0.000653 ***
## NewspaperAds    1.3010     0.3207   4.057 0.009761 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6426 on 5 degrees of freedom
## Multiple R-squared:  0.919,  Adjusted R-squared:  0.8866 
## F-statistic: 28.38 on 2 and 5 DF,  p-value: 0.001865

Answer H0: β1 = β2 = 0 Ha: β1 and/or β2 is not equal to zero Thus we reject the null hypothesis because Significance P-Value <.05. At least B1 or B2 is different than zero.

Use $\alpha$=.05 to test the significance of $\beta_1$. Should $x_1$ be dropped from the model? Why or why not?
Answer: H0: β1 = β2 = 0 Ha: β1!= 0 Thus, we reject the null hypothesis that x1(tva) should be removed with a P-value >.05 there fore we should not revmoe X1 from the model.
c. Use $\alpha$=.05 to test the significance of $\beta_2$. Should $x_2$ be dropped from the model? Why or why not?
Answer:
H0: β2 = β2 = 0 Ha: β2!= 0 Thus, we reject the null hypothesis that x2(np) should be removed with a p value <.05 therefore we should not remove x2 from the model.

From the following incomplete ANOVA table for a multiple regression analysis for 50 individuals. Fill in the appropriate values (a–g).

ANOVA_Table <- as.data.frame(
  rbind(
  c("Regression", "a", 4, 8, "b", "c"),
  c("Residual", 184, "d", "e", "", ""),
  c("Total", "f", "g", "", "", "")
  ))
colnames(ANOVA_Table) <- c("Source", "$SS$", "$df$", "$MS$", "$F$", "Significance $F$")
knitr::kable(ANOVA_Table, caption="ANOVA Source Table")

ANOVA Source Table
Source	$SS$	$df$	$MS$	$F$	Significance $F$
Regression	a	4	8	b	c
Residual	184	d	e
Total	f	g

n <- 50
df1 <- 4
p <- 4
MSRegression2 <- 8
SSResidual2 <- 184
df2 <- 50-4-1
SSRegression2 <- MSRegression2*p
SSRegression2

## [1] 32

MSResidual2 <- SSResidual2/df2
MSResidual2

## [1] 4.088889

F2 <- MSRegression2/MSResidual2
F2

## [1] 1.956522

SSTotal2 <- SSRegression2 + SSResidual2
SSTotal2

## [1] 216

PValue2 <- 1-pf(F2,df1,df2, lower.tail = TRUE)
PValue2

## [1] 0.1174935

SSRegresstion: 32
F: 1.956522
Pvalue: 0.1174935
df1 : 45
MSResidual: 4.088889
SSTotal: 216
df3: 49

Barron’s conducts an annual review of online brokers, including both brokers, who can be accessed via a Web browser, as well as direct-access brokers who connect customers directly with the broker’s network server. Each broker’s offerings and performance are evaluated in six areas, using a scale of 0–5 in each category (the higher the better). The results are weighted to obtain an overall score, and a final star rating, ranging from zero to five stars, is assigned to each broker. Trade execution, ease of use, and range of offerings are three of the areas evaluated. The following data show the point values for trade execution, ease of use, range of offerings, and the star rating for a sample of 10 of the online brokers that Barron’s evaluated. The Brokers data file can be found here as CSV file, here as an SPSS file, or in the following directory: http://bit.ly/MSBA_Data.

Determine the estimated regression equation (using all predictor variables) that can be used to predict the star rating given the point values for execution, ease of use, and range of offerings.

Brokers <- read_csv('https://www.dropbox.com/s/jfu08pc6lrg00dy/Brokers.csv?dl=1')

## 
## -- Column specification --------------------------------------------------------
## cols(
##   Broker = col_character(),
##   TradeEx = col_double(),
##   Use = col_double(),
##   Range = col_double(),
##   Rating = col_double()
## )

summary(Brokers)

##     Broker             TradeEx           Use            Range      
##  Length:10          Min.   :1.400   Min.   :2.500   Min.   :2.500  
##  Class :character   1st Qu.:2.275   1st Qu.:3.000   1st Qu.:3.125  
##  Mode  :character   Median :2.850   Median :3.500   Median :3.350  
##                     Mean   :2.940   Mean   :3.400   Mean   :3.610  
##                     3rd Qu.:3.625   3rd Qu.:3.675   3rd Qu.:4.150  
##                     Max.   :4.800   Max.   :4.500   Max.   :4.800  
##      Rating   
##  Min.   :2.0  
##  1st Qu.:3.0  
##  Median :3.5  
##  Mean   :3.2  
##  3rd Qu.:3.5  
##  Max.   :4.0

Model.5 <- lm(Rating~1 + TradeEx + Use + Range, data=Brokers)
Model.5

## 
## Call:
## lm(formula = Rating ~ 1 + TradeEx + Use + Range, data = Brokers)
## 
## Coefficients:
## (Intercept)      TradeEx          Use        Range  
##      0.3451       0.2548       0.1325       0.4585

    # The answer is.... y= ..34510rate +  0.25482TradeEx + 0.13249Use + 0.45852Range

Use the $F$-test to determine the overall significance of the relationship. What is the conclusion at the $\alpha$=.05 level of significance?

   F <- qf(.95, 3, 6,lower.tail = TRUE)
F

## [1] 4.757063

summary(lm(Rating~1 + TradeEx + Use + Range, data=Brokers))

## 
## Call:
## lm(formula = Rating ~ 1 + TradeEx + Use + Range, data = Brokers)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3251 -0.1171 -0.0599  0.1460  0.3366 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.34510    0.53067   0.650  0.53958   
## TradeEx      0.25482    0.08556   2.978  0.02469 * 
## Use          0.13249    0.14043   0.944  0.38185   
## Range        0.45852    0.12319   3.722  0.00983 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2431 on 6 degrees of freedom
## Multiple R-squared:  0.8856, Adjusted R-squared:  0.8284 
## F-statistic: 15.49 on 3 and 6 DF,  p-value: 0.00313

 **Answer**:  P-Value is  <.00313 because F Value is 15.49 with a Critical F of 2.75 it is a good fit and  statistically significant. 
 The AdjR^2 statistically significant at82% of the variance of the population. Therefore, due to the high R^2 we are doing a good job predicting the Y with the two R^2. Recalling R^2 metric from 0-1. Fitting really well at Adj R^2 .0.8284 then 83% as R^2 max is 1.

Use the $t$-test to determine the significance of each independent variable. What is your conclusion at the $\alpha$.05 level of significance?
Answer: Using the t-Test on the regression model we are able to check the significance of individual regression coefficients in the multiple linear regression model. Adding a significant variable to a regression model makes the model more effective, while adding an unimportant variable may make the model worse. By using the t-test we see that the independent variable of “Use” is not statistically significant with >.05 pvalue.

Remove any independent variable that is not significant from the estimated regression equation. What is your recommended estimated regression equation? Compare the $R^2$ with the value of $R^2$ from part (a). Discuss the differences.
Answer:

Model.6 <- lm(formula = Rating ~ 1 + TradeEx + Range, data = Brokers)
Model.6

## 
## Call:
## lm(formula = Rating ~ 1 + TradeEx + Range, data = Brokers)
## 
## Coefficients:
## (Intercept)      TradeEx        Range  
##      0.6718       0.2641       0.4853

summary(lm(formula = Rating ~ 1 + TradeEx + Range, data = Brokers))

## 
## Call:
## lm(formula = Rating ~ 1 + TradeEx + Range, data = Brokers)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.25470 -0.17414 -0.03772  0.16976  0.37492 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.67184    0.39892   1.684  0.13603   
## TradeEx      0.26406    0.08432   3.131  0.01658 * 
## Range        0.48527    0.11893   4.080  0.00469 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2412 on 7 degrees of freedom
## Multiple R-squared:  0.8686, Adjusted R-squared:  0.8311 
## F-statistic: 23.15 on 2 and 7 DF,  p-value: 0.0008214

Recommended Regression Equation : y= 0.67184rate + 0.26406TradeEx + 0.48527Range

The R^2 value in part a is .828 and when “Use” as the independent variable is removed the R^2 is .86. The R^2 evaluates the scatter of the data points around the fitted regression line. The second model has a higher value representing smaller differences between the observed data and the fitted values.

P-Value is less than <.001 because F Value is 23.15. Overall model is <.001. Yes, it is a good fit. The AdjR^2 statistically significant at 86% of the variance of hte population. Therefore, due to the high R^2 we are doing a good job predicting the Y with the two R^2. Recalling R^2 metric from 0-1. Fitting really well at Adj R^2 0.8686 then 86% as R^2 max is 1.

Consider multicolinearity in the context of regression.
1. In your own words, what does multicollinearity mean?
  Answer: Multicollinearity is the occurrence of high intercorrelations among two or more independent variables in a multiple regression model.

In your own words, what are the implications of multicollinearity?
Answer: Multicollinearity among independent variables will result in less reliable statistical inferences. This is because multicollinearity can lead to wider confidence intervals.

Understanding the factors that contribute to housing values is an important consideration of investment firms interested in the residential market. The Mishawaka residential real estate market was evaluated as “opportunistic” by an investment firm. To better understand the Mishawaka market, the firm seeks to model the asking price a house by various characteristics of the house. This group hires you as a consultant to help them understand the market as it exists today, in hopes of capitalizing on changes in the marketplace in the future. The data set is available for download here as CSV file, here as an SPSS file, or in the following http://bit.ly/MSBA_Data directory.

Step 1: Load the data

Mishawaka_Asking_Original <- read_csv('https://www.dropbox.com/sh/4wm9mff88ium0wv/AABgY6zWFMOt9wd7FIj3pFGIa/Mishawaka_Asking.csv?dl=0')

## 
## -- Column specification --------------------------------------------------------
## cols(
##   `<!DOCTYPE html><html class="maestro" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"><head><script nonce="6M64r1pnBU1QT0cpJmI2">` = col_character()
## )

## Warning: 4625 parsing failures.
## row col  expected     actual                                                                                             file
##   2  -- 1 columns 2 columns  'https://www.dropbox.com/sh/4wm9mff88ium0wv/AABgY6zWFMOt9wd7FIj3pFGIa/Mishawaka_Asking.csv?dl=0'
##   7  -- 1 columns 2 columns  'https://www.dropbox.com/sh/4wm9mff88ium0wv/AABgY6zWFMOt9wd7FIj3pFGIa/Mishawaka_Asking.csv?dl=0'
##  11  -- 1 columns 2 columns  'https://www.dropbox.com/sh/4wm9mff88ium0wv/AABgY6zWFMOt9wd7FIj3pFGIa/Mishawaka_Asking.csv?dl=0'
##  12  -- 1 columns 3 columns  'https://www.dropbox.com/sh/4wm9mff88ium0wv/AABgY6zWFMOt9wd7FIj3pFGIa/Mishawaka_Asking.csv?dl=0'
##  13  -- 1 columns 23 columns 'https://www.dropbox.com/sh/4wm9mff88ium0wv/AABgY6zWFMOt9wd7FIj3pFGIa/Mishawaka_Asking.csv?dl=0'
## ... ... ......... .......... ................................................................................................
## See problems(...) for more details.

Parsed with columns specifications

cols(
 MLS = col_double(),
 Asking = col_double(),
 Square_Footage = col_double(),
 Proportion_Brick_or_Stone = col_double(),
 Baths = col_double(),
 Bedrooms = col_double(),
 Bed_and_Bathrooms = col_double(),
 Garage = col_double(),
 Proportion_Granite = col_double(),
 Proportion_Hardwood = col_double(),
 Basement = col_double(),
 Basement_Finished = col_double(),
 Stories2 = col_double(),
 Lot_Size = col_double(),
 Acreage = col_double(),
 Culdasac = col_double(),
 Fireplaces = col_double(),
 Year_Built = col_double()
 )

## cols(
##   MLS = col_double(),
##   Asking = col_double(),
##   Square_Footage = col_double(),
##   Proportion_Brick_or_Stone = col_double(),
##   Baths = col_double(),
##   Bedrooms = col_double(),
##   Bed_and_Bathrooms = col_double(),
##   Garage = col_double(),
##   Proportion_Granite = col_double(),
##   Proportion_Hardwood = col_double(),
##   Basement = col_double(),
##   Basement_Finished = col_double(),
##   Stories2 = col_double(),
##   Lot_Size = col_double(),
##   Acreage = col_double(),
##   Culdasac = col_double(),
##   Fireplaces = col_double(),
##   Year_Built = col_double()
## )

Step 2:

glimpse(Mishawaka_Asking_Original)

## Rows: 140
## Columns: 1
## $ `<!DOCTYPE html><html class="maestro" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"><head><script nonce="6M64r1pnBU1QT0cpJmI2">` <chr> ...

What are some relevant descriptive statistics of the data?
Answer: summary(data)

b. What are some relevant visualizations that help one understand the data?\        **Answer**:         
a. The data set contains, among other things, the variables Baths (total number of bathrooms), Bedrooms (total number of bedrooms), and another variable, Bed_and_Bathrooms (which is the sum of Bath and Bedrooms). Explain why including (a) Baths, (b) Bedrooms, and (c) Bed_and_Bathrooms would or would not be a problem in a regression model (i.e., a model that includes all three variables?  
      **Answer**:    
a. Fit a model in which the only predictors are Bed_and_Bathrooms (variable 1) and Square_Footage (variable 2). Provided the estimated regression equation.  
    
    ```r
    # The answer is.... 
    ```
a. Remove MLS=1575935 and rerun the model above (i.e., with Bed and Bathrooms and Square Footage).  
    
    ```r
    # The answer is.... 
    ```
  a. Describe what *major* change happened in the output from d and e above. Explain why.  
      **Answer**:         
  a. Describe what major change happened in the output from d and e above. Explain why.  
      **Answer**:         
  a. Again with MLS=1575935 removed, fit a model with Square Footage, Year Built, Basement Finished, and Lot Size and save the unstandardized residuals and the unstandardized predicted values. Provide a visual assessment and comments on the normality of errors assumption.  
        **Answer**:         
a. For the model and data in part h, provide comments and a visual assessment of the homoscedasticity assumption. Do this by plotting the saved values (with the unstandardized residuals on the ordinate [$y$-axis] and the unstandardized predicted values on the abscissa [$x$-axis]).  
        **Answer**:         
a. For the model and data in part h, remove all houses that have an Asking price of \$225,000 or above. Now, like h, provide a visual assessment and comments on the normality of errors assumption.  
    **Answer**:       
a. For the model and data in part j, provide comments and a visual assessment of the homoscedasticity assumption as you did in part i. Again, do this by plotting the saved values (with the unstandardized residuals on the ordinate [$y$-axis] and the unstandardized predicted values on the abscissa [$x$-axis]).  
    **Answer**:         
a Does this data (i.e., no houses at or above \$225,000 [as in j and k]) satisfy the assumptions any better or any worse than the data that excluded only the house with an asking price of \$2,500,000 (as in h and i)?  
    **Answer**:         
a. What are the implications and are there any cautions to consider by conditioning the analysis to houses that are less than \$225,000?  
    **Answer**:         
a. Provide a matrix plot for Asking, Square Footage, Year Built, Basement Finished, and Lot Size.  
    
    ```r
    # The answer is.... 
    ```
a. Again using the data for houses less than \$225,000 as a basis of your model, consider a house in the area that has the following properties: `Square Footage=1,800`, `Year Built=1996`, `Basement Finished=1`, and `Lot Size=43,560`.  
    i. What is the optimal predicted value for the Asking Price?  
        
        ```r
        # The answer is.... 
        ```
    i. What is the 95% prediction interval for the mean for houses like this?
        
        ```r
        # The answer is.... 
        ```
    i. What is the 95% prediction interval for such a house you are considering purchasing?  
        
        ```r
        # The answer is.... 
        ```

In the last assignment you used the FairValue data to find three correlation coefficients. Take the FairValue data and now apply a multiple regression model in which FairValue is modeled from SharePrice and EarningsPerShare. The Fair Value data set is available here.
1. Overall, does the model provide any inferential evidence that is accounts for more variance in $y$ (the dependent variable) than would have been expected by chance alone? Describe how you know this?
  Answer:
2. What is a descriptive summary of the overall fit of the model (specially how much variance in $y$ do you estimate that the model accounts for in the population?
  Answer:
3. Describe, separately, how well each of the regressors “worked” or “did not work” in modeling the outcome variable.
  Answer:
4. Describe, in words, what is an overall managerial summary of the model and the implications of the model that would be relevant to the top-management team (who are interested in the findings).
  Answer:
5. Provide output in a way that would be reasonable in a report to the top-management team (do not include the raw data).
```
# The answer is.... 
```
Here is a college football data set on bowl game. The code below downloads the Excel data set and then uses the readxl (read Excel) package to load into the R workspace.
```
Bowl_Game <- openxlsx::read.xlsx("http://www.dropbox.com/s/yw2wj8htfew7i5i/Bowl_Game_Spreads.xlsx?dl=1")
# tibble::glimpse(Bowl_Game)
# summary(Bowl_Game)
```
Fit a simple or multiple regression model of your choice and summarize your findings.
```
 ```r
 # The answer is.... 
 ```
```
In the last assignment you fitted multiple simple regressions. Here, fit the multiple regression model with all of the regressors
1. Fit multiple models in a model building framework.
```
    # The answer is.... 
```
1. Provide one or more model comparisons (i.e., with $F$-tests).
```
    # The answer is.... 
```
1. From the various models, pick the one that you think is the best of the alteratives to use for purposes of a managerial decision. State why you choose this model using formal statistical reasoning.
  Answer:
2. How does this model seem better – or does it – over multiple single regressor models?
3. Can causality be legitimately claimed – why or why not.
  Answer:
4. Provide an overall summary of what you learned from applying multiple regression to the data from a statistical perspective.
  Answer:
5. What is some managerial guidance you would advise based on the data and analysis?
  Answer:

Notre Dame’s MSBA Program

Ken Kelley’s Statistics for Managerial Decision Making