Data analysis in the Social Sciences

Multiple linear regression

Today we will work with data on flats features: their price, square (total and of living space), distance from the closest metro station, type of block of flats. etc.

Variables:

price: price for a square meter;
totsp: total space (square);
livesp: living space;
kitsp: kitchen space;
metrdist: distance from the closest metro station;
walk: whether flat is reachable by foot;
brick: whether blocks of flats is made of brick;
floor : whether flat is at the first or last floor.

Let us load this data set:

#flats <- read.csv("http://math-info.hse.ru/f/2018-19/pep/flats.csv")
flats <- read.csv("flats.csv")

Now we can model how the price of flats depends on the square of living space, square of living space and distance from the closest metro station:

\[ \text{price} \sim \text{livesp} + \text{kitsp} + \text{metrdist}. \] Run this model in R:

modelf <- lm(data = flats, price ~ livesp + kitsp + metrdist)
summary(modelf)

## 
## Call:
## lm(formula = price ~ livesp + kitsp + metrdist, data = flats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -116.26  -17.69   -3.46   12.76  391.22 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -77.7859     4.7209 -16.477  < 2e-16 ***
## livesp        3.7504     0.1135  33.039  < 2e-16 ***
## kitsp         4.9634     0.3212  15.452  < 2e-16 ***
## metrdist     -1.5606     0.1925  -8.108 8.78e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.12 on 2036 degrees of freedom
## Multiple R-squared:  0.5931, Adjusted R-squared:  0.5925 
## F-statistic: 989.2 on 3 and 2036 DF,  p-value: < 2.2e-16

What do we have here?

The regression equation is

\[ \text{price} = -77.78 + 3.75 \times \text{livesp} + 4.96 \times \text{kitsp} - 1.56 \times \text{metrdist}. \]

The effect of all the features on the price is statistically significant at any conventional significance level (p-values are approx. zero, so we reject the null hypothesis about coefficients equal to zero).
All else equal (ceteris paribus) if the square of living space increases by one unit, the price of a flat increases by 3.75 on average.
All else equal (ceteris paribus) if the distance to the nearest metro station increases by one unit, the price of a flat decreases by 1.56 on average.
As for model quality, we can look at the last line in the output. Here R-squared equals 0.6. There are no strict rules how to decide which \(R^2\) is high enough, all depends on the field. Basically, values greater than 0.3-0.4 are sufficient. However, in statistics as we are mostly interested in relationships between variables, in interactions between them, in model design in general, a substantially plausible model is better than a poorer model but with higher \(R^2\).
At the last line in the output there is also one more p-value. What hypothesis is tested here?

\[ H_0: \beta_1 = \beta_2 = \dots = \beta_k = 0 \text{ (model is not better than the model with the intercept only)} \]

\[ H_1: \text{at least one } \beta_i \ne 0 \text{ (model is better than the model with the intercept only)} \]

Informally, here we test the hypothesis that a model constructed is needed (not useless). Here p-value is close to 0, so we reject \(H_0\) and conclude that at least one coefficient is statistically different from zero, model is worth considering. Frankly speaking, seldom can we construct a model that is completely useless, so where p-value is too high.

We can extract some elements of this output separately, for example, coefficients, residuals or fitted values (values of reaction time predicted by our model).

modelf$coefficients

## (Intercept)      livesp       kitsp    metrdist 
##  -77.785895    3.750387    4.963444   -1.560587

modelf$residuals[0:10]  # first 10 residuals

##         1         2         3         4         5         6         7 
## 391.21785 133.10883 208.88652 164.67536 -52.01467 147.51443 119.83749 
##         8         9        10 
## 153.64044 -85.56008 192.77074

modelf$fitted.values[0:10] # first 10 predicted values of price

##        1        2        3        4        5        6        7        8 
## 338.7822 343.8912 141.1135 245.3246 102.0147 192.4856 165.1625 316.3596 
##        9       10 
## 220.5601 277.2293

To check the fit of our model we can look at several graphs. First, we can check whether the relationships between price and independent variables in our model are linear. To do so we can plot a matrix of scatterplots via the library GGally. Install and load it:

install.packages("GGally")

library(GGally)

Now let’s create a smaller data set with only variables from the model included:

# choose columns 2, 4, 5, 7 from flats
small <- flats[c(2, 4, 5, 7)]

Now we can plot:

ggpairs(small)

No certain non-linear relationships detected.

Technical comments on this plot:

On the diagonal we see density plots for each variable, smoothed versions of histograms that show us the shape of distributions.
Under the main diagonal there are simple scatterplots for each pair of variables. Above the main diagonal there are correlation coefficients for each pair of variables.

Secondly, we can plot graphs independent variable vs residuals for every independent variable in our model. For convenience we can save residuals as a separate column in flats dataframe.

flats$res <-modelf$residuals
head(flats$res)  # some first values of residuals

## [1] 391.21785 133.10883 208.88652 164.67536 -52.01467 147.51443

Living space vs residuals plot:

library(ggplot2)
ggplot(data = flats, aes(x = livesp, 
                         y = res)) + 
  geom_point()

Distance to the metro station vs residuals plot:

ggplot(data = flats, aes(x = metrdist, 
                         y = res)) + 
  geom_point()

Your can train and plot plots for other independent variables on your own. In these graphs there are no signs of non-linear patterns (no parabolas, U-shaped curves or something like this).

Data analysis in the Social Sciences

Alla Tambovtseva

Multiple linear regression