Use the commands colnames and head to see which variables are included in the data set.
## [1] "House.Price..USD." "HP.in.thousands" "House.Size"
## [4] "Acres" "Lot.Size" "Bedrooms"
## [7] "T.Bath" "Age" "Garage"
## [10] "Condition" "Age.Category"
## House.Price..USD. HP.in.thousands House.Size Acres Lot.Size Bedrooms T.Bath
## 1 232500 232.5 1679 0.23 10018.8 3 1.5
## 2 470000 470.0 4494 0.52 22651.2 5 4.0
## 3 150000 150.0 2542 0.11 4791.6 4 0.0
## 4 167500 167.5 1094 0.18 7840.8 2 1.0
## 5 210000 210.0 1838 0.19 8276.4 4 2.0
## 6 522000 522.0 4156 0.22 9583.2 3 3.5
## Age Garage Condition Age.Category
## 1 35 1 0 M
## 2 38 1 0 M
## 3 5 1 0 N
## 4 65 0 0 O
## 5 33 1 0 M
## 6 3 1 0 N
Use lm to fit a multiple regression model with selling price as the response variable and with house size, number of bedrooms, and number of bathrooms as the explanatory variables. Use summary to see the usual R regression output.
##
## Call:
## lm(formula = House.Price..USD. ~ House.Size + Bedrooms + T.Bath,
## data = houses)
##
## Residuals:
## Min 1Q Median 3Q Max
## -291667 -39942 -2685 33576 426793
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39000.83 17396.58 2.242 0.0261 *
## House.Size 53.21 4.63 11.493 < 2e-16 ***
## Bedrooms -7885.09 6130.95 -1.286 0.1999
## T.Bath 57796.09 9293.06 6.219 2.96e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 73540 on 196 degrees of freedom
## Multiple R-squared: 0.6028, Adjusted R-squared: 0.5967
## F-statistic: 99.14 on 3 and 196 DF, p-value: < 2.2e-16
For fixed number of bedrooms, how much is the house selling price predicted to increase for each square foot increase in house size? Why?
Answer: Based upon the house size, for every square foot increase, the selling price increases by $53.21. This is because the other predictors are fixed so the house size isolates the price after accounting for changes in nuber of bedrooms and bathrooms.
For a fixed house size of 2,000 square feet, how does the predicted selling price change for two, three, and four bedrooms?
## [1] -7885.09
## [1] 57796.09
Answer: The predicted selling price changes by the bedroom coefficent. 2 bedrooms is the baseline, 3 decreases by $7885, and 4 bedrooms decreases by another $7885, totaling $15770.
Interpret the value of the multiple correlation coefficient.
Answer: The multiple correlation coefficient is the correlation between the actual values of the dependent variable and the predicated values. In this case R2=0.6028 which mean there is about 60.28% variability in the selling prices of houses.
Suppose the variable house selling prices House.Price..USD. are changed from dollars to thousands of dollars resulting in HP.in.thousands. Calculate a new prediction equation using HP.in.thousands as the response. Compare the new regression coefficients and multiple correlation coefficient to the original fit from (2). Which of these change and why?
## [1] 0.05321
## [1] -7.88509
## [1] 57.79609
Go back to the original fit from (2). Report the F statistic and state the hypotheses to which it refers. Report its P-value and interpret. Why is it not surprising to get a small P-value for this test?
Answer: The F-statistic is 99.14. The P=value is 2.2e-16 which is very small. So we reject Ho.
Report and interpret the t statistic and P-value for testing H0 : β2 = 0 against H0 : β2 6 = 0.
Answer: The T statistic is -1.286 which means the estimated coefficuent is about 1.286 standard errors below zero. The P-value of 0.1999 is bigger than 0.05 so we fail to reject the Ho.
Construct a 95% confidence interval for β2 and interpret. This inference is more informative than the test in (8). Explain why.
## 2.5 % 97.5 %
## (Intercept) 4692.30956 73309.34597
## House.Size 44.07971 62.34086
## Bedrooms -19976.18630 4206.00664
## T.Bath 39468.87156 76123.30495
Answer: The T-test only tells us if there is enough evidence to reject Ho whereas the confidence interval provides more information such as the size of effect and direction.
Produce a histogram of the standardized residuals. What assumption does this check? What do you conclude from the plot?
std_res <- rstandard(model)
hist(std_res,
main = "Standardized Residuals Histogram",
xlab = "Standardized Residuals",
col = "grey",
breaks = 20)
Answer: The distribution looks approximately normal.
Using plot on your original fit from (2), which will produce four default plots. Choose one of these plots and explain what regression assumption(s) it can check? What do you conclude from the plot?
Answer: The residuals are approximately normal. The constant variability
condition does not appear to be met as not all the data points appear to
have equal variance.
Fit a new multiple regression model with House.Price..USD. as the response variable and with House.Size and Condition as the explanatory variables. Use summary to see the usual R regression output. Report the regression equation. Find and interpret the separate lines relating predicted selling price to house size for good condition homes and for homes in not good condition.
Plot how selling price varies as a function of house size for homes in good condition and for homes in not good condition.
Estimate the difference, using an appropriate confidence interval, between the mean selling price of homes in good and in not good condition, controlling for house size.
##
## Call:
## lm(formula = House.Price..USD. ~ House.Size + Condition, data = houses)
##
## Residuals:
## Min 1Q Median 3Q Max
## -312024 -33585 -852 28105 382876
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 96270.971 13464.912 7.150 1.66e-11 ***
## House.Size 66.463 4.682 14.196 < 2e-16 ***
## Condition 12926.940 17196.712 0.752 0.453
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 81790 on 197 degrees of freedom
## Multiple R-squared: 0.5062, Adjusted R-squared: 0.5012
## F-statistic: 101 on 2 and 197 DF, p-value: < 2.2e-16
## 2.5 % 97.5 %
## ConditionGood NA NA