Library readxl
has to be activated now:
library(readxl)
Adata <- read_xlsx("Apartments.xlsx")
Description:
Which features most significantly affect the price of an apartment?
Does the availability of parking or a balcony influence the price more?
Do newer apartments command a price premium, even if they are farther from the city center?
Adata$Parking <- factor(Adata$Parking, levels = c(0, 1), labels = c("No", "Yes"))
Adata$Balcony <- factor(Adata$Balcony, levels = c(0, 1), labels = c("No", "Yes"))
H0: The average price per m2 is EUR 1900.
H1: The average price per m2 is not EUR 1900.
The Shapiro-Wilk normality test has to be performed:
shapiro.test(Adata$Price)
##
## Shapiro-Wilk normality test
##
## data: Adata$Price
## W = 0.94017, p-value = 0.0006513
Based on the p-value, which is below the typical threshold of 0.05, I reject the null hypothesis that the prices per m2 are normally distributed. Consequently, I proceed with the non-parametric median test.
H0: The median price per m2 is equal to EUR 1900.
H1: The median price per m2 is not equal to EUR 1900.
wilcox.test(Adata$Price, mu=1900, alternative = "two.sided")
##
## Wilcoxon signed rank test with continuity correction
##
## data: Adata$Price
## V = 2328, p-value = 0.02844
## alternative hypothesis: true location is not equal to 1900
The null hypothesis can be rejected at p=0.02844. The median price is not equal to EUR 1900.
fit1 <- lm(Price ~ Age, data = Adata)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = Adata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
Regression Coefficient (Slope):
-8.975
.
For every additional year of apartment age, the rental price decreases by approximately €8.98, on average, assuming a linear relationship.
cor(Adata$Price, Adata$Age)
## [1] -0.230255
Coefficient of Correlation: -0.23
.
There is a weak negative relationship between the price per m2 and age of the apartments.
Coefficient of Determination:
0.053
.
About 5.3% of the variation in price per m2 is explained by the age of the apartment.
library(car)
scatterplotMatrix(Adata[ , c(3,1,2)], smooth = FALSE)
A slight negative linear relationship exists between price and age — as age increases, price per m2 tends to decrease. A clear negative relationship between the price and distance from the city centre is apparent — apartments farther from the city center have lower prices. The predictor variables (Age, Distance) are not highly correlated with each other, which means multicollinearity should not distort the regression.
fit2 <- lm(Price ~ Age + Distance, data = Adata)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Adata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
vif(fit2)
## Age Distance
## 1.001845 1.001845
None of the explanatory variables cause multicollinearity as all VIF statistics are very close to 1.
Adata$StdResid <- round(rstandard(fit2), 3) #standardized residuals
Adata$CooksD <- round(cooks.distance(fit2), 3) #Cooks Distances
hist(Adata$StdResid, xlab = "Standardized residuals", ylab = "Frequency", main = "Histrogram of standardized residuals")
The histogram of the standardized residuals indicates that the errors are most likely not normally distributed. It can also be seen from the histogram that the sample does not contain any units that are outliers (standardized residuals outside ±3).
hist(Adata$CooksD, xlab = "Cooks distance", ylab = "Frequency", main = "Histrogram of Cooks distance")
From the distribution of Cook’s distances I observe that some units have significantly stronger impact on the estimation of partial regression coefficients (Cook’s distance 0.10 and higher).
head(Adata[order(-Adata$CooksD),])
## # A tibble: 6 × 7
## Age Distance Price Parking Balcony StdResid CooksD
## <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 5 45 2180 Yes Yes 2.58 0.32
## 2 43 37 1740 No No 1.44 0.104
## 3 2 11 2790 Yes No 2.05 0.069
## 4 7 2 1760 No Yes -2.15 0.066
## 5 37 3 2540 Yes Yes 1.58 0.061
## 6 40 2 2400 No Yes 1.09 0.038
The two observations with high influence (Cook’s distance 0.10 and higher) can now be removed from the sample:
Adata <- Adata[Adata$CooksD < 0.1, ]
The regression is now re-estimated:
fit2 <- lm(Price ~ Age + Distance, data = Adata)
Adata$StdResid <- round(rstandard(fit2), 3) #standardized residuals
Adata$CooksD <- round(cooks.distance(fit2), 3) #Cooks Distances
Adata$StdFitted <- scale(fit2$fitted.values) #fitted values
scatterplot(y = Adata$StdResid, x = Adata$StdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE, regLine = FALSE, smooth = FALSE)
From the scatter plot it cannt be concluded that homoskedasticity is violated, as the variance of errors appears to be relatively constant for different levels of the explanatory variables. I additionally perform the Breusch-Pagan test for heteroskedasticity:
library(olsrr)
ols_test_breusch_pagan(fit2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## -----------------------------
## DF = 1
## Chi2 = 3.775135
## Prob > Chi2 = 0.05201969
The Breusch-Pagan test for heteroskedasticity shows that the null hypothesis (the variance of the errors is constant) should not be rejected.
hist(Adata$StdResid, xlab = "Standardized residuals", ylab = "Frequency", main = "Histrogram of standardized residuals")
The histogram of the standardized residuals indicates that the errors are most likely not normally distributed. I also formally test this assumption:
shapiro.test(Adata$StdResid)
##
## Shapiro-Wilk normality test
##
## data: Adata$StdResid
## W = 0.95952, p-value = 0.01044
The null hypothesis that the errors are normally distributed should be rejected at p < 0.05.
The regression is now re-estimated:
fit2 <- lm(Price ~ Age + Distance, data = Adata)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Adata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -627.27 -212.96 -46.23 205.05 578.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2490.112 76.189 32.684 < 2e-16 ***
## Age -7.850 3.244 -2.420 0.0178 *
## Distance -23.945 2.826 -8.473 9.53e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared: 0.4968, Adjusted R-squared: 0.4842
## F-statistic: 39.49 on 2 and 80 DF, p-value: 1.173e-12
For each additional year of building age, the rental price decreases on average by approximately EUR 7.9, holding distance from the city centre constant.
For each additional km from the city center, the price drops by approximately EUR 23.95, on average, holding the age of the apartment constant.
About 49.7% of the variation in price is explained by age and distance from the centre together. The model overall is highly statistically significant — at least one predictor has a non-zero coefficient F-statistic (39.49, p < 0.00001).
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = Adata)
#parking and balcony are just included directly
#they were previously converted to factors
#No was the reference category for both
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 80 5982100
## 2 78 5458696 2 523404 3.7395 0.02813 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the p-value = 0.028, I reject the null hypothesis that both
models fit the data equally well. The more complex model
fit3
provides a statistically significant improvement in
model fit over fit2
. That means the additional variables in
fit3
add explanatory power to my regression.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Adata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -499.06 -194.33 -32.04 219.03 544.31
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2358.900 93.664 25.185 < 2e-16 ***
## Age -7.197 3.148 -2.286 0.02499 *
## Distance -21.241 2.911 -7.296 2.14e-10 ***
## ParkingYes 168.921 62.166 2.717 0.00811 **
## BalconyYes -6.985 58.745 -0.119 0.90566
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 264.5 on 78 degrees of freedom
## Multiple R-squared: 0.5408, Adjusted R-squared: 0.5173
## F-statistic: 22.97 on 4 and 78 DF, p-value: 1.449e-12
On average, having parking increases the price per m2 by approximately EUR 169, compared to not having parking, holding all else constant.
On average, having a balcony decreases the price by about EUR 7, but the effect of this variable is not statistically significant, so it is not possible to conclude a real impact on the price per m2 from having a balcony.
The F-test checks the overall significance of the regression model.
H0: All slope coefficients are zero.
H1: At least one of the coefficients is not zero.
Since the p-value = 1.449e-12 < 0.001, I reject the null hypothesis. At least one of the predictors (Age, Distance, Parking, or Balcony) contributes significantly to explaining variation in price per m2.
fitted_val <- fit3$fitted.values
The residual is the difference between the actual price and the predicted (fitted) value:
Adata$Price[2] - fitted_val[2]
## 2
## 422.9572
Check the result:
fit3$residuals[2]
## 2
## 422.9572