Today we will work with data on flats features: their price, square (total and of living space), distance from the closest metro station, type of block of flats. etc.
Variables:
price
: price for a square meter;totsp
: total space (square);livesp
: living space;kitsp
: kitchen space;metrdist
: distance from the closest metro station;walk
: whether flat is reachable by foot;brick
: whether blocks of flats is made of brick;floor
: whether flat is at the first or last floor.Let us load this data set:
#flats <- read.csv("http://math-info.hse.ru/f/2018-19/pep/flats.csv")
flats <- read.csv("flats.csv")
Now we can model how the price of flats depends on the square of living space, square of living space and distance from the closest metro station:
\[ \text{price} \sim \text{livesp} + \text{kitsp} + \text{metrdist}. \] Run this model in R:
modelf <- lm(data = flats, price ~ livesp + kitsp + metrdist)
summary(modelf)
##
## Call:
## lm(formula = price ~ livesp + kitsp + metrdist, data = flats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -116.26 -17.69 -3.46 12.76 391.22
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -77.7859 4.7209 -16.477 < 2e-16 ***
## livesp 3.7504 0.1135 33.039 < 2e-16 ***
## kitsp 4.9634 0.3212 15.452 < 2e-16 ***
## metrdist -1.5606 0.1925 -8.108 8.78e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33.12 on 2036 degrees of freedom
## Multiple R-squared: 0.5931, Adjusted R-squared: 0.5925
## F-statistic: 989.2 on 3 and 2036 DF, p-value: < 2.2e-16
What do we have here?
\[ \text{price} = -77.78 + 3.75 \times \text{livesp} + 4.96 \times \text{kitsp} - 1.56 \times \text{metrdist}. \]
The effect of all the features on the price is statistically significant at any conventional significance level (p-values are approx. zero, so we reject the null hypothesis about coefficients equal to zero).
All else equal (ceteris paribus) if the square of living space increases by one unit, the price of a flat increases by 3.75 on average.
All else equal (ceteris paribus) if the distance to the nearest metro station increases by one unit, the price of a flat decreases by 1.56 on average.
As for model quality, we can look at the last line in the output. Here R-squared
equals 0.6. There are no strict rules how to decide which \(R^2\) is high enough, all depends on the field. Basically, values greater than 0.3-0.4 are sufficient. However, in statistics as we are mostly interested in relationships between variables, in interactions between them, in model design in general, a substantially plausible model is better than a poorer model but with higher \(R^2\).
At the last line in the output there is also one more p-value. What hypothesis is tested here?
\[ H_0: \beta_1 = \beta_2 = \dots = \beta_k = 0 \text{ (model is not better than the model with the intercept only)} \]
\[ H_1: \text{at least one } \beta_i \ne 0 \text{ (model is better than the model with the intercept only)} \]
Informally, here we test the hypothesis that a model constructed is needed (not useless). Here p-value is close to 0, so we reject \(H_0\) and conclude that at least one coefficient is statistically different from zero, model is worth considering. Frankly speaking, seldom can we construct a model that is completely useless, so where p-value is too high.
We can extract some elements of this output separately, for example, coefficients, residuals or fitted values (values of reaction time predicted by our model).
modelf$coefficients
## (Intercept) livesp kitsp metrdist
## -77.785895 3.750387 4.963444 -1.560587
modelf$residuals[0:10] # first 10 residuals
## 1 2 3 4 5 6 7
## 391.21785 133.10883 208.88652 164.67536 -52.01467 147.51443 119.83749
## 8 9 10
## 153.64044 -85.56008 192.77074
modelf$fitted.values[0:10] # first 10 predicted values of price
## 1 2 3 4 5 6 7 8
## 338.7822 343.8912 141.1135 245.3246 102.0147 192.4856 165.1625 316.3596
## 9 10
## 220.5601 277.2293
To check the fit of our model we can look at several graphs. First, we can check whether the relationships between price and independent variables in our model are linear. To do so we can plot a matrix of scatterplots via the library GGally
. Install and load it:
install.packages("GGally")
library(GGally)
Now let’s create a smaller data set with only variables from the model included:
# choose columns 2, 4, 5, 7 from flats
small <- flats[c(2, 4, 5, 7)]
Now we can plot:
ggpairs(small)
No certain non-linear relationships detected.
Technical comments on this plot:
On the diagonal we see density plots for each variable, smoothed versions of histograms that show us the shape of distributions.
Under the main diagonal there are simple scatterplots for each pair of variables. Above the main diagonal there are correlation coefficients for each pair of variables.
Secondly, we can plot graphs independent variable vs residuals for every independent variable in our model. For convenience we can save residuals as a separate column in flats
dataframe.
flats$res <-modelf$residuals
head(flats$res) # some first values of residuals
## [1] 391.21785 133.10883 208.88652 164.67536 -52.01467 147.51443
Living space vs residuals plot:
library(ggplot2)
ggplot(data = flats, aes(x = livesp,
y = res)) +
geom_point()
Distance to the metro station vs residuals plot:
ggplot(data = flats, aes(x = metrdist,
y = res)) +
geom_point()
Your can train and plot plots for other independent variables on your own. In these graphs there are no signs of non-linear patterns (no parabolas, U-shaped curves or something like this).