Plot price against sqft with a summary linear regression line.
Determine the equation of the summary line in the plot (by fitting the corresponding regression model) and interpret the coefficients for the intercept and sqft in the equation.
In this and subsequent questions interpret means: write down in concrete terms what each coefficient says about the value of a home, or the change in the value of a home, conditional on predictors.
ggplot(data = p, mapping = aes(x = sqft, y = price)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "price ~ sqft")
## `geom_smooth()` using formula = 'y ~ x'
lm(price ~ sqft, data = p) %>%
summary()
##
## Call:
## lm(formula = price ~ sqft, data = p)
##
## Residuals:
## Min 1Q Median 3Q Max
## -622948 -151283 -1650 138951 804553
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40623.019 15862.454 2.561 0.0105 *
## sqft 269.345 7.742 34.791 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 210300 on 4989 degrees of freedom
## Multiple R-squared: 0.1953, Adjusted R-squared: 0.1951
## F-statistic: 1210 on 1 and 4989 DF, p-value: < 2.2e-16
The equation is y = 269.345x + 40,623.019. $40,623.019 is the average base value of a house in LA at zero sqft. $269.345 is the change in average price of a house per change of sqft.
Fit a multiple regression model of price with all the available predictors entered additively (+). However, create a centered version of sqft (by subtracting the mean from each observation) and use this as a predictor rather than the original variable. (There should be 8 coefficients in this model.)
Interpret the 4 coefficients for the intercept, centered sqft, and city. Remember that Long Beach is the (missing) reference city in the model, assuming that factor levels have been assigned alphabetically.
# non-interaction model
(non_int <- lm(price ~ sqft + city + pool + garage + bed + bath, data = p)) %>%
summary()
##
## Call:
## lm(formula = price ~ sqft + city + pool + garage + bed + bath,
## data = p)
##
## Residuals:
## Min 1Q Median 3Q Max
## -539286 -137407 -3532 124838 852187
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -18431.284 16327.524 -1.129 0.2590
## sqft 271.561 7.515 36.138 <2e-16 ***
## citySanta Monica 190239.704 6757.751 28.151 <2e-16 ***
## cityWestwood 88020.719 6794.984 12.954 <2e-16 ***
## poolyes 10124.630 6760.090 1.498 0.1343
## garageyes -14195.911 6120.799 -2.319 0.0204 *
## bed 41.553 8420.185 0.005 0.9961
## bath -3092.909 9439.900 -0.328 0.7432
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 195000 on 4983 degrees of freedom
## Multiple R-squared: 0.3085, Adjusted R-squared: 0.3075
## F-statistic: 317.6 on 7 and 4983 DF, p-value: < 2.2e-16
# centering data, making sqft readable
c_sqft = p$sqft - mean(p$sqft)
(c_non_int <- lm(price ~ c_sqft + city + pool + garage + bed + bath, data = p)) %>%
summary
##
## Call:
## lm(formula = price ~ c_sqft + city + pool + garage + bed + bath,
## data = p)
##
## Residuals:
## Min 1Q Median 3Q Max
## -539286 -137407 -3532 124838 852187
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 528103.213 9291.990 56.834 <2e-16 ***
## c_sqft 271.561 7.515 36.138 <2e-16 ***
## citySanta Monica 190239.704 6757.751 28.151 <2e-16 ***
## cityWestwood 88020.719 6794.984 12.954 <2e-16 ***
## poolyes 10124.630 6760.090 1.498 0.1343
## garageyes -14195.911 6120.799 -2.319 0.0204 *
## bed 41.553 8420.185 0.005 0.9961
## bath -3092.909 9439.900 -0.328 0.7432
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 195000 on 4983 degrees of freedom
## Multiple R-squared: 0.3085, Adjusted R-squared: 0.3075
## F-statistic: 317.6 on 7 and 4983 DF, p-value: < 2.2e-16
‘intercept’: $528,103.213 is the average value of a house in Long Beach area when the other variables equals to 0, that is when the centered sqft is 0 and when there is no pool, garage, bedroom or bathroom. It is an irrealistic situation. ‘c_sqft’: $271.561 represents the expected change in the house value in LA area for each standard deviation increase in the centered sq ft, holding other predictors constant. Since sq ft has been centered, a one-unit increase corresponds to an increase of one standard deviation in sq ft. ‘citySanta Monica’: $190,239.704 is the average change of house value in Santa Monica area in reference of that in Long Beach area. ‘cityWestwood’:$88,020.719 is the average change of house value in Westwood area in reference of the that in Long Beach area.
To the above model add an interaction between centered sqft and city. This means that you combine these terms multiplicatively (*) rather than additively (+).
Create a visualization of this interaction model, making sure to use centered sqft in the plot.
Interpret 6 coefficients from this model: the intercept, the main effects for centered sqft and city, and the interaction effects.
Interaction models can be tricky to understand. Here is some guidance:
The intercept is the average value of the target when the inputs are 0 (for numeric variables) or the reference category (for categorical variables).
The main effects are first 7 slope coefficients in the output. You should interpret the first 3. In the plot you created for this question you can see that there is a regression line for each city. Similarly in the interaction model: there is no single relationship between price and sqft, and consequently the main effect for sqft is conditional on city. Specifically, it denotes the relationship between sqft and price for the reference city. The main effects for a predictor in an interaction model will always be conditional on the levels of the variable with which it has been interacted.
The interaction effects are the final 2 coefficients in the output. (The colon indicates the interaction, as in sqft:citySanta Monica.) These coefficients estimate the change in the slope of the regression line for each city compared to the reference city. If the interaction coefficients are positive that means that the regression line relating sqft to price is steeper for that particular city in comparison to the reference city, or, equivalently, that the relationship is stronger.
# interaction model
(int <- lm(price ~ c_sqft * city, data = p)) %>%
summary
##
## Call:
## lm(formula = price ~ c_sqft * city, data = p)
##
## Residuals:
## Min 1Q Median 3Q Max
## -547521 -139440 -1956 126804 866123
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.137e+05 3.878e+03 132.464 < 2e-16 ***
## c_sqft 2.385e+02 9.856e+00 24.197 < 2e-16 ***
## citySanta Monica 1.896e+05 6.742e+03 28.121 < 2e-16 ***
## cityWestwood 8.828e+04 6.781e+03 13.019 < 2e-16 ***
## c_sqft:citySanta Monica 9.001e+01 1.749e+01 5.146 2.76e-07 ***
## c_sqft:cityWestwood 3.722e+01 1.804e+01 2.063 0.0392 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 194700 on 4985 degrees of freedom
## Multiple R-squared: 0.3109, Adjusted R-squared: 0.3102
## F-statistic: 449.7 on 5 and 4985 DF, p-value: < 2.2e-16
# plot
ggplot(data = p, mapping = aes(x = c_sqft, y = price, color = city)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "price ~ sqft * city")
## `geom_smooth()` using formula = 'y ~ x'
> ‘intercept’: $513,700 is the average value of the house value in
Long Beach area when the centered sqft is 0. ‘c_sqft’: $238.5 is the
change in the house value in Long Beach per each standard deviation
change in sqft. ‘citySanta Monica’: $189,600 is the average value of the
house value in Santa Monica in referece of that in Long Beach when the
centered sqft is 0. ‘cityWestwood’:$88,280 is the average value of the
house value in Westwood in referece of that in Long Beach when the
centered sqft is 0. ‘c_sqft:citySanta Monica’: $328.6 = ($90.01 +
$238.5) is the change in the house value in Santa Monica per each
standard deviation change in sqft in referece of that in Long Beach.
‘c_sqft:cityWestwood’: $275.72 = ($37.22 + $238.5) is the change in the
house value in Westwood per each standard deviation change in sqft in
referece of that in Long Beach.
Is this a good model? To assess model fit create three plots:
A residual plot with model residuals on the vertical axis and the fitted values on the horizontal axis. Add a summary line.
A plot of the model’s fitted values (again on the horizontal axis) against observed values of price. Add a summary line.
A histogram of the residuals.
Two functions will extract the fitted values from a model object: fitted(object) and predict(object). (If the newdata argument is omitted, predict() just returns the fitted values.)
Comment on model fit.
# plots for interaction model
plot(int, which = 1)
p %>%
mutate(fitted_int = fitted(int),
residuals_int = price - fitted_int) %>%
ggplot(mapping = aes(x = fitted_int, y = price)) +
geom_point() +
geom_smooth(method = "lm", se = F, col = "red") +
labs(title = "price ~ fitted or interation model")
## `geom_smooth()` using formula = 'y ~ x'
p %>%
mutate(fitted_int = fitted(int),
residuals_int = price - fitted_int) %>%
ggplot(aes(x = residuals_int)) +
geom_histogram() +
labs(title = "distribution of residuals in interation model")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
> The residual plot shows that all the residuals are randomly
scattered around the line of zero with no obvious pattern. Besides, from
the second plot where the fitted values are plotted against the price,
the red summary line shows that the data is linear and can be described
with a line. Also, in the histogram, we can see a normal distribution of
residuals (again with a center at 0), further proving that there is no
patterns in the residuals. Therefore, the interaction model is a good
model.
What should Andrew say in his presentation? Write a brief summary of the quantitative evidence that he should use to support this recommendation.
Andrew should recommend PacDev to focus on Santa Monica among the three cities in LA since it has shown the largest price increases associated with additional square footage according to the interaction model created. The interaction model shows that the increase in price associated with additional square footage in Santa Monica is $328.6, whereas, the increases in price per sqft in Westwood and Long Beach are $275.72 and $238.5 respectively. Santa Monica has the highest increase in house value per sqft, and hence, PacDev should focus on the SFR in Santa Monica.