Multiple linear regression

Today we will work with data on flats features: their price, square (total and of living space), distance from the closest metro station, type of block of flats. etc.

Variables:

Let us load this data set:

#flats <- read.csv("http://math-info.hse.ru/f/2018-19/pep/flats.csv")
flats <- read.csv("flats.csv")

Now we can model how the price of flats depends on the square of living space, square of living space and distance from the closest metro station:

\[ \text{price} \sim \text{livesp} + \text{kitsp} + \text{metrdist}. \] Run this model in R:

modelf <- lm(data = flats, price ~ livesp + kitsp + metrdist)
summary(modelf)
## 
## Call:
## lm(formula = price ~ livesp + kitsp + metrdist, data = flats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -116.26  -17.69   -3.46   12.76  391.22 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -77.7859     4.7209 -16.477  < 2e-16 ***
## livesp        3.7504     0.1135  33.039  < 2e-16 ***
## kitsp         4.9634     0.3212  15.452  < 2e-16 ***
## metrdist     -1.5606     0.1925  -8.108 8.78e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.12 on 2036 degrees of freedom
## Multiple R-squared:  0.5931, Adjusted R-squared:  0.5925 
## F-statistic: 989.2 on 3 and 2036 DF,  p-value: < 2.2e-16

What do we have here?

\[ \text{price} = -77.78 + 3.75 \times \text{livesp} + 4.96 \times \text{kitsp} - 1.56 \times \text{metrdist}. \]

\[ H_0: \beta_1 = \beta_2 = \dots = \beta_k = 0 \text{ (model is not better than the model with the intercept only)} \]

\[ H_1: \text{at least one } \beta_i \ne 0 \text{ (model is better than the model with the intercept only)} \]

Informally, here we test the hypothesis that a model constructed is needed (not useless). Here p-value is close to 0, so we reject \(H_0\) and conclude that at least one coefficient is statistically different from zero, model is worth considering. Frankly speaking, seldom can we construct a model that is completely useless, so where p-value is too high.

We can extract some elements of this output separately, for example, coefficients, residuals or fitted values (values of reaction time predicted by our model).

modelf$coefficients
## (Intercept)      livesp       kitsp    metrdist 
##  -77.785895    3.750387    4.963444   -1.560587
modelf$residuals[0:10]  # first 10 residuals
##         1         2         3         4         5         6         7 
## 391.21785 133.10883 208.88652 164.67536 -52.01467 147.51443 119.83749 
##         8         9        10 
## 153.64044 -85.56008 192.77074
modelf$fitted.values[0:10] # first 10 predicted values of price
##        1        2        3        4        5        6        7        8 
## 338.7822 343.8912 141.1135 245.3246 102.0147 192.4856 165.1625 316.3596 
##        9       10 
## 220.5601 277.2293

To check the fit of our model we can look at several graphs. First, we can check whether the relationships between price and independent variables in our model are linear. To do so we can plot a matrix of scatterplots via the library GGally. Install and load it:

install.packages("GGally")
library(GGally)

Now let’s create a smaller data set with only variables from the model included:

# choose columns 2, 4, 5, 7 from flats
small <- flats[c(2, 4, 5, 7)]

Now we can plot:

ggpairs(small)

No certain non-linear relationships detected.

Technical comments on this plot:

  1. On the diagonal we see density plots for each variable, smoothed versions of histograms that show us the shape of distributions.

  2. Under the main diagonal there are simple scatterplots for each pair of variables. Above the main diagonal there are correlation coefficients for each pair of variables.

Secondly, we can plot graphs independent variable vs residuals for every independent variable in our model. For convenience we can save residuals as a separate column in flats dataframe.

flats$res <-modelf$residuals
head(flats$res)  # some first values of residuals
## [1] 391.21785 133.10883 208.88652 164.67536 -52.01467 147.51443

Living space vs residuals plot:

library(ggplot2)
ggplot(data = flats, aes(x = livesp, 
                         y = res)) + 
  geom_point()

Distance to the metro station vs residuals plot:

ggplot(data = flats, aes(x = metrdist, 
                         y = res)) + 
  geom_point()

Your can train and plot plots for other independent variables on your own. In these graphs there are no signs of non-linear patterns (no parabolas, U-shaped curves or something like this).