Q1

Plot price against sqft with a summary linear regression line.

Determine the equation of the summary line in the plot (by fitting the corresponding regression model) and interpret the coefficients for the intercept and sqft in the equation.

In this and subsequent questions interpret means: write down in concrete terms what each coefficient says about the value of a home, or the change in the value of a home, conditional on predictors.

ggplot(data = p, mapping = aes(x = sqft, y = price)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +  
  labs(title = "price ~ sqft")   
## `geom_smooth()` using formula = 'y ~ x'

lm(price ~ sqft, data = p) %>% 
  summary()
## 
## Call:
## lm(formula = price ~ sqft, data = p)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -622948 -151283   -1650  138951  804553 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 40623.019  15862.454   2.561   0.0105 *  
## sqft          269.345      7.742  34.791   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 210300 on 4989 degrees of freedom
## Multiple R-squared:  0.1953, Adjusted R-squared:  0.1951 
## F-statistic:  1210 on 1 and 4989 DF,  p-value: < 2.2e-16

The equation is y = 269.345x + 40,623.019. $40,623.019 is the average base value of a house in LA at zero sqft. $269.345 is the change in average price of a house per change of sqft.

Q2

Fit a multiple regression model of price with all the available predictors entered additively (+). However, create a centered version of sqft (by subtracting the mean from each observation) and use this as a predictor rather than the original variable. (There should be 8 coefficients in this model.)

Interpret the 4 coefficients for the intercept, centered sqft, and city. Remember that Long Beach is the (missing) reference city in the model, assuming that factor levels have been assigned alphabetically.

# non-interaction model
(non_int <- lm(price ~ sqft + city + pool + garage + bed + bath, data = p)) %>% 
  summary()
## 
## Call:
## lm(formula = price ~ sqft + city + pool + garage + bed + bath, 
##     data = p)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -539286 -137407   -3532  124838  852187 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -18431.284  16327.524  -1.129   0.2590    
## sqft                271.561      7.515  36.138   <2e-16 ***
## citySanta Monica 190239.704   6757.751  28.151   <2e-16 ***
## cityWestwood      88020.719   6794.984  12.954   <2e-16 ***
## poolyes           10124.630   6760.090   1.498   0.1343    
## garageyes        -14195.911   6120.799  -2.319   0.0204 *  
## bed                  41.553   8420.185   0.005   0.9961    
## bath              -3092.909   9439.900  -0.328   0.7432    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 195000 on 4983 degrees of freedom
## Multiple R-squared:  0.3085, Adjusted R-squared:  0.3075 
## F-statistic: 317.6 on 7 and 4983 DF,  p-value: < 2.2e-16
# centering data, making sqft readable
c_sqft = p$sqft - mean(p$sqft)

(c_non_int <- lm(price ~ c_sqft + city + pool + garage + bed + bath, data = p)) %>% 
  summary
## 
## Call:
## lm(formula = price ~ c_sqft + city + pool + garage + bed + bath, 
##     data = p)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -539286 -137407   -3532  124838  852187 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      528103.213   9291.990  56.834   <2e-16 ***
## c_sqft              271.561      7.515  36.138   <2e-16 ***
## citySanta Monica 190239.704   6757.751  28.151   <2e-16 ***
## cityWestwood      88020.719   6794.984  12.954   <2e-16 ***
## poolyes           10124.630   6760.090   1.498   0.1343    
## garageyes        -14195.911   6120.799  -2.319   0.0204 *  
## bed                  41.553   8420.185   0.005   0.9961    
## bath              -3092.909   9439.900  -0.328   0.7432    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 195000 on 4983 degrees of freedom
## Multiple R-squared:  0.3085, Adjusted R-squared:  0.3075 
## F-statistic: 317.6 on 7 and 4983 DF,  p-value: < 2.2e-16

‘intercept’: $528,103.213 is the average value of a house in Long Beach area when the other variables equals to 0, that is when the centered sqft is 0 and when there is no pool, garage, bedroom or bathroom. It is an irrealistic situation. ‘c_sqft’: $271.561 represents the expected change in the house value in LA area for each standard deviation increase in the centered sq ft, holding other predictors constant. Since sq ft has been centered, a one-unit increase corresponds to an increase of one standard deviation in sq ft. ‘citySanta Monica’: $190,239.704 is the average change of house value in Santa Monica area in reference of that in Long Beach area. ‘cityWestwood’:$88,020.719 is the average change of house value in Westwood area in reference of the that in Long Beach area.

Q3

To the above model add an interaction between centered sqft and city. This means that you combine these terms multiplicatively (*) rather than additively (+).

Create a visualization of this interaction model, making sure to use centered sqft in the plot.

Interpret 6 coefficients from this model: the intercept, the main effects for centered sqft and city, and the interaction effects.

Interaction models can be tricky to understand. Here is some guidance:

The intercept is the average value of the target when the inputs are 0 (for numeric variables) or the reference category (for categorical variables).

The main effects are first 7 slope coefficients in the output. You should interpret the first 3. In the plot you created for this question you can see that there is a regression line for each city. Similarly in the interaction model: there is no single relationship between price and sqft, and consequently the main effect for sqft is conditional on city. Specifically, it denotes the relationship between sqft and price for the reference city. The main effects for a predictor in an interaction model will always be conditional on the levels of the variable with which it has been interacted.

The interaction effects are the final 2 coefficients in the output. (The colon indicates the interaction, as in sqft:citySanta Monica.) These coefficients estimate the change in the slope of the regression line for each city compared to the reference city. If the interaction coefficients are positive that means that the regression line relating sqft to price is steeper for that particular city in comparison to the reference city, or, equivalently, that the relationship is stronger.

# interaction model
(int <- lm(price ~ c_sqft * city, data = p)) %>% 
  summary
## 
## Call:
## lm(formula = price ~ c_sqft * city, data = p)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -547521 -139440   -1956  126804  866123 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             5.137e+05  3.878e+03 132.464  < 2e-16 ***
## c_sqft                  2.385e+02  9.856e+00  24.197  < 2e-16 ***
## citySanta Monica        1.896e+05  6.742e+03  28.121  < 2e-16 ***
## cityWestwood            8.828e+04  6.781e+03  13.019  < 2e-16 ***
## c_sqft:citySanta Monica 9.001e+01  1.749e+01   5.146 2.76e-07 ***
## c_sqft:cityWestwood     3.722e+01  1.804e+01   2.063   0.0392 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 194700 on 4985 degrees of freedom
## Multiple R-squared:  0.3109, Adjusted R-squared:  0.3102 
## F-statistic: 449.7 on 5 and 4985 DF,  p-value: < 2.2e-16
# plot
ggplot(data = p, mapping = aes(x = c_sqft, y = price, color = city)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +  
  labs(title = "price ~ sqft * city")   
## `geom_smooth()` using formula = 'y ~ x'

> ‘intercept’: $513,700 is the average value of the house value in Long Beach area when the centered sqft is 0. ‘c_sqft’: $238.5 is the change in the house value in Long Beach per each standard deviation change in sqft. ‘citySanta Monica’: $189,600 is the average value of the house value in Santa Monica in referece of that in Long Beach when the centered sqft is 0. ‘cityWestwood’:$88,280 is the average value of the house value in Westwood in referece of that in Long Beach when the centered sqft is 0. ‘c_sqft:citySanta Monica’: $328.6 = ($90.01 + $238.5) is the change in the house value in Santa Monica per each standard deviation change in sqft in referece of that in Long Beach. ‘c_sqft:cityWestwood’: $275.72 = ($37.22 + $238.5) is the change in the house value in Westwood per each standard deviation change in sqft in referece of that in Long Beach.

Q4

Is this a good model? To assess model fit create three plots:

A residual plot with model residuals on the vertical axis and the fitted values on the horizontal axis. Add a summary line.

A plot of the model’s fitted values (again on the horizontal axis) against observed values of price. Add a summary line.

A histogram of the residuals.

Two functions will extract the fitted values from a model object: fitted(object) and predict(object). (If the newdata argument is omitted, predict() just returns the fitted values.)

Comment on model fit.

# plots for interaction model
plot(int, which = 1)

p %>% 
  mutate(fitted_int = fitted(int),
         residuals_int = price - fitted_int) %>%
  ggplot(mapping = aes(x = fitted_int, y = price)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = F, col = "red") +
  labs(title = "price ~ fitted or interation model")
## `geom_smooth()` using formula = 'y ~ x'

p %>% 
  mutate(fitted_int = fitted(int),
         residuals_int = price - fitted_int) %>%
  ggplot(aes(x = residuals_int)) + 
  geom_histogram() + 
  labs(title = "distribution of residuals in interation model")   
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

> The residual plot shows that all the residuals are randomly scattered around the line of zero with no obvious pattern. Besides, from the second plot where the fitted values are plotted against the price, the red summary line shows that the data is linear and can be described with a line. Also, in the histogram, we can see a normal distribution of residuals (again with a center at 0), further proving that there is no patterns in the residuals. Therefore, the interaction model is a good model.

Q5

What should Andrew say in his presentation? Write a brief summary of the quantitative evidence that he should use to support this recommendation.

Andrew should recommend PacDev to focus on Santa Monica among the three cities in LA since it has shown the largest price increases associated with additional square footage according to the interaction model created. The interaction model shows that the increase in price associated with additional square footage in Santa Monica is $328.6, whereas, the increases in price per sqft in Westwood and Long Beach are $275.72 and $238.5 respectively. Santa Monica has the highest increase in house value per sqft, and hence, PacDev should focus on the SFR in Santa Monica.