Multiple regression

Exercises:

Exercise 1:

Use the commands colnames and head to see which variables are included in the data set.

colnames(houses)

##  [1] "House.Price..USD." "HP.in.thousands"   "House.Size"       
##  [4] "Acres"             "Lot.Size"          "Bedrooms"         
##  [7] "T.Bath"            "Age"               "Garage"           
## [10] "Condition"         "Age.Category"

head(houses)

##   House.Price..USD. HP.in.thousands House.Size Acres Lot.Size Bedrooms T.Bath
## 1            232500           232.5       1679  0.23  10018.8        3    1.5
## 2            470000           470.0       4494  0.52  22651.2        5    4.0
## 3            150000           150.0       2542  0.11   4791.6        4    0.0
## 4            167500           167.5       1094  0.18   7840.8        2    1.0
## 5            210000           210.0       1838  0.19   8276.4        4    2.0
## 6            522000           522.0       4156  0.22   9583.2        3    3.5
##   Age Garage Condition Age.Category
## 1  35      1         0            M
## 2  38      1         0            M
## 3   5      1         0            N
## 4  65      0         0            O
## 5  33      1         0            M
## 6   3      1         0            N

Exercise 2:

Use lm to fit a multiple regression model with selling price as the response variable and with house size, number of bedrooms, and number of bathrooms as the explanatory variables. Use summary to see the usual R regression output.

model <- lm(House.Price..USD. ~ House.Size + Bedrooms + T.Bath, data = houses)
summary(model)

## 
## Call:
## lm(formula = House.Price..USD. ~ House.Size + Bedrooms + T.Bath, 
##     data = houses)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -291667  -39942   -2685   33576  426793 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39000.83   17396.58   2.242   0.0261 *  
## House.Size     53.21       4.63  11.493  < 2e-16 ***
## Bedrooms    -7885.09    6130.95  -1.286   0.1999    
## T.Bath      57796.09    9293.06   6.219 2.96e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 73540 on 196 degrees of freedom
## Multiple R-squared:  0.6028, Adjusted R-squared:  0.5967 
## F-statistic: 99.14 on 3 and 196 DF,  p-value: < 2.2e-16

Exercise 3:

For fixed number of bedrooms, how much is the house selling price predicted to increase for each square foot increase in house size? Why?

Answer: Based upon the house size, for every square foot increase, the selling price increases by $53.21. This is because the other predictors are fixed so the house size isolates the price after accounting for changes in nuber of bedrooms and bathrooms.

Exercise 4:

For a fixed house size of 2,000 square feet, how does the predicted selling price change for two, three, and four bedrooms?

#39000.83+(53.21(2000 house size))-bedrooms+bathrooms..
Predicted = 145420.83
           - 7885.09

## [1] -7885.09

           + 57796.09

## [1] 57796.09

Answer: The predicted selling price changes by the bedroom coefficent. 2 bedrooms is the baseline, 3 decreases by $7885, and 4 bedrooms decreases by another $7885, totaling $15770.

Exercise 5:

Interpret the value of the multiple correlation coefficient.

Answer: The multiple correlation coefficient is the correlation between the actual values of the dependent variable and the predicated values. In this case R2=0.6028 which mean there is about 60.28% variability in the selling prices of houses.

Exercise 6:

Suppose the variable house selling prices House.Price..USD. are changed from dollars to thousands of dollars resulting in HP.in.thousands. Calculate a new prediction equation using HP.in.thousands as the response. Compare the new regression coefficients and multiple correlation coefficient to the original fit from (2). Which of these change and why?

#divide coefficients by 1000 to convert 
HP.in.thousands = 39.00083
                 + 0.05321

## [1] 0.05321

                 - 7.88509

## [1] -7.88509

                 + 57.79609

## [1] 57.79609

Exercise 7:

Go back to the original fit from (2). Report the F statistic and state the hypotheses to which it refers. Report its P-value and interpret. Why is it not surprising to get a small P-value for this test?

Answer: The F-statistic is 99.14. The P=value is 2.2e-16 which is very small. So we reject Ho.

Exercise 8:

Report and interpret the t statistic and P-value for testing H0 : β2 = 0 against H0 : β2 6 = 0.

Answer: The T statistic is -1.286 which means the estimated coefficuent is about 1.286 standard errors below zero. The P-value of 0.1999 is bigger than 0.05 so we fail to reject the Ho.

Exercise 9:

Construct a 95% confidence interval for β2 and interpret. This inference is more informative than the test in (8). Explain why.

confint(model, level = 0.95)

##                    2.5 %      97.5 %
## (Intercept)   4692.30956 73309.34597
## House.Size      44.07971    62.34086
## Bedrooms    -19976.18630  4206.00664
## T.Bath       39468.87156 76123.30495

Answer: The T-test only tells us if there is enough evidence to reject Ho whereas the confidence interval provides more information such as the size of effect and direction.

Exercise 10:

Produce a histogram of the standardized residuals. What assumption does this check? What do you conclude from the plot?

std_res <- rstandard(model)
hist(std_res,
     main = "Standardized Residuals Histogram",
     xlab = "Standardized Residuals",
     col = "grey",
     breaks = 20)

Answer: The distribution looks approximately normal.

Exercise 11:

Using plot on your original fit from (2), which will produce four default plots. Choose one of these plots and explain what regression assumption(s) it can check? What do you conclude from the plot?

plot(model)

Answer: The residuals are approximately normal. The constant variability condition does not appear to be met as not all the data points appear to have equal variance.

Exercise 12:

Fit a new multiple regression model with House.Price..USD. as the response variable and with House.Size and Condition as the explanatory variables. Use summary to see the usual R regression output. Report the regression equation. Find and interpret the separate lines relating predicted selling price to house size for good condition homes and for homes in not good condition.

Exercise 13:

Plot how selling price varies as a function of house size for homes in good condition and for homes in not good condition.

Exercise 14:

Estimate the difference, using an appropriate confidence interval, between the mean selling price of homes in good and in not good condition, controlling for house size.

model2 <- lm(House.Price..USD. ~ House.Size + Condition, data = houses)

summary(model2)

## 
## Call:
## lm(formula = House.Price..USD. ~ House.Size + Condition, data = houses)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -312024  -33585    -852   28105  382876 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 96270.971  13464.912   7.150 1.66e-11 ***
## House.Size     66.463      4.682  14.196  < 2e-16 ***
## Condition   12926.940  17196.712   0.752    0.453    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 81790 on 197 degrees of freedom
## Multiple R-squared:  0.5062, Adjusted R-squared:  0.5012 
## F-statistic:   101 on 2 and 197 DF,  p-value: < 2.2e-16

confint(model2, "ConditionGood", level = 0.95)

##               2.5 % 97.5 %
## ConditionGood    NA     NA