library(readxl)
apartments <- read_excel("Apartments.xlsx")
head(apartments)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
How do apartment’s age, proximity to the city center, along with conveniences like having parking and balcony, affect its price per square meter?
apartments$ParkingF <- factor(apartments$Parking,
levels = c(0, 1),
labels = c("No", "Yes"))
apartments$BalconyF <- factor(apartments$Balcony,
levels = c(0, 1),
labels = c("No", "Yes"))
str(apartments)
## tibble [85 × 7] (S3: tbl_df/tbl/data.frame)
## $ Age : num [1:85] 7 18 7 28 18 28 14 18 22 25 ...
## $ Distance: num [1:85] 28 1 28 29 18 12 20 6 7 2 ...
## $ Price : num [1:85] 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
## $ Parking : num [1:85] 0 1 0 0 1 0 0 1 1 1 ...
## $ Balcony : num [1:85] 1 0 0 1 1 1 1 1 0 0 ...
## $ ParkingF: Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 1 2 2 2 ...
## $ BalconyF: Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 2 2 1 1 ...
This is a one-sample t-test with:
t.test(apartments$Price,
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
The average price per square meter is significantly different from 1900 EUR (at p-value = 0.005), with a 95% confidence interval ranging from 1937 to 2100 EUR.
fit1 <- lm(Price ~ Age, data = apartments)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
cor(apartments$Price, apartments$Age) #check for the sign
## [1] -0.230255
sqrt(summary(fit1)$r.squared)
## [1] 0.230255
The slope estimate for Age is -8.975, meaning that for each additional year of apartment age, the price per m\(^2\) decreases by approximately EUR 8.98, on average. The correlation (\(r\)) of -0.23 indicates a weak negative linear relationship between the variables, and the \(R^2\) of 0.053 means that about 5.3% of the variation in apartment prices can be explained by its age.
library(car)
scatterplotMatrix(~ Price + Age + Distance, data = apartments,
smooth = FALSE)
From the scatterplot of Age and Distance, we can clearly see there is almost no correlation between two, hence multicollinearity is not a concern in our case.
fit2 <- lm(Price ~ Age + Distance, data = apartments)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
vif(fit2)
## Age Distance
## 1.001845 1.001845
Both Age and Distance have VIF values very close to 1, therefore we observe no multicollinearity.
apartments$StdResid <- round(rstandard(fit2), 3)
apartments$CooksD <- round(cooks.distance(fit2), 3)
hist(apartments$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(apartments$StdResid)
##
## Shapiro-Wilk normality test
##
## data: apartments$StdResid
## W = 0.95303, p-value = 0.003645
According to Shapiro-Wilk normality test, we would reject the null hypothesis that errors are normally distributed. However, our sample is fairly large (>80), so the central limit theorem helps the t and F statistics remain reasonably robust, even with mild departures from normality. Also, with big samples, any small deviation from normality can show up as significant in Shapiro-Wilk test. So the assumption of normality is less important to us and we can continue even while it is violated, in our case.
hist(apartments$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
We can see a big jump in the histogram, so we will definitely have to remove some potential observations with high influence.
apartments$ID <- seq(1, nrow(apartments))
head(apartments[order(apartments$StdResid), c("ID", "StdResid")], 3)
## # A tibble: 3 × 2
## ID StdResid
## <int> <dbl>
## 1 53 -2.15
## 2 13 -1.50
## 3 72 -1.50
head(apartments[order(-apartments$StdResid), c("ID", "StdResid")], 3)
## # A tibble: 3 × 2
## ID StdResid
## <int> <dbl>
## 1 38 2.58
## 2 33 2.05
## 3 2 1.78
We can see that the standard residuals fall between [-3; 3] interval. So, we do not observe any outliers.
head(apartments[order(-apartments$CooksD), c("ID", "CooksD")], 6)
## # A tibble: 6 × 2
## ID CooksD
## <int> <dbl>
## 1 38 0.32
## 2 55 0.104
## 3 33 0.069
## 4 53 0.066
## 5 22 0.061
## 6 39 0.038
However, we can clearly see that ID 38 is an observation with a high influence, even though it is less than 0.5 or 1 (general rule of thumb), because of the huge difference in values between the next biggest value (0.32 vs 0.10). Therefore, we remove the ID 38.
library(dplyr)
apartments <- apartments %>%
filter(!ID == 38)
Now, when we dropped one observation, let’s fit the function and calculate the values again.
fit2_new <- lm(Price ~ Age + Distance, data = apartments)
apartments$StdResid <- round(rstandard(fit2_new), 3)
apartments$CooksD <- round(cooks.distance(fit2_new), 3)
hist(apartments$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
head(apartments[order(apartments$StdResid), c("ID", "StdResid")], 3)
## # A tibble: 3 × 2
## ID StdResid
## <int> <dbl>
## 1 53 -2.24
## 2 13 -1.49
## 3 72 -1.49
head(apartments[order(-apartments$StdResid), c("ID", "StdResid")], 3)
## # A tibble: 3 × 2
## ID StdResid
## <int> <dbl>
## 1 33 2.22
## 2 25 1.82
## 3 2 1.78
Standard residuals are between [-3; 3].
hist(apartments$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
No big jumps observed, so we can safely move on.
library(car)
apartments$StdFitted <- scale(fit2_new$fitted.values)
scatterplot(y = apartments$StdResid, x = apartments$StdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
library(olsrr)
ols_test_breusch_pagan(fit2_new)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## -----------------------------
## DF = 1
## Chi2 = 2.927455
## Prob > Chi2 = 0.08708469
From Breusch Pagan Test for Heteroskedasticity, we reject the null hypothesis that the variance is constant (p-value = 0.09), so we would have to use robust standard errors.
As already done in task 9, we concluded that standardized residuals are not normally distributed. Let’s check again, but this time with Q-Q plot.
library(ggpubr)
ggqqplot(rstandard(fit2_new))
shapiro.test(rstandard(fit2_new))
##
## Shapiro-Wilk normality test
##
## data: rstandard(fit2_new)
## W = 0.9565, p-value = 0.00636
We can see that some of the points clearly fall outside the CI, and the Shapiro-Wilk normality test confirms non-normal distribution of errors. However, as mentioned already, our sample is fairly large (>80), so the central limit theorem helps the t and F statistics remain reasonably robust, even with mild departures from normality. Also, with big samples, any small deviation from normality can show up as significant in Shapiro-Wilk test. So the assumption of normality is less important to us and we can continue even while it is violated, in our case.
library(estimatr)
fit2_robust <- lm_robust(Price ~ Age + Distance,
se_type = "HC1", # White's robust standard errors
data = apartments)
summary(fit2_robust)
##
## Call:
## lm_robust(formula = Price ~ Age + Distance, data = apartments,
## se_type = "HC1")
##
## Standard error type: HC1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 2456.076 86.333 28.449 6.358e-44 2284.30 2627.8528 81
## Age -6.464 3.338 -1.936 5.633e-02 -13.11 0.1786 81
## Distance -22.955 2.598 -8.837 1.672e-13 -28.12 -17.7860 81
##
## Multiple R-squared: 0.4838 , Adjusted R-squared: 0.4711
## F-statistic: 40.12 on 2 and 81 DF, p-value: 7.781e-13
Intercept: The estimated price when
Age = 0 and Distance = 0 is 2456.
Coefficient for Age: The price decreases by 6.464 (in EUR/m\(^2\)) on average for each additional year of apartment age, holding distance constant (p-value = 0.056, not significant).
Coefficient for Distance: The price decreases by 22.955 (in EUR/m\(^2\)) on average for each additional 1 km from the center, holding age constant (p-value < 0.001, highly significant).
48.38% of variability in Price is explained by the linear effect of Age and Distance.
fit3 <- lm_robust(Price ~ Age + Distance + ParkingF + BalconyF,
se_type = "HC1",
data = apartments)
summary(fit3)
##
## Call:
## lm_robust(formula = Price ~ Age + Distance + ParkingF + BalconyF,
## data = apartments, se_type = "HC1")
##
## Standard error type: HC1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 2329.724 102.132 22.8109 1.659e-36 2126.43 2533.0123 79
## Age -5.821 3.227 -1.8038 7.507e-02 -12.24 0.6022 79
## Distance -20.279 2.860 -7.0902 5.021e-10 -25.97 -14.5861 79
## ParkingFYes 167.531 64.716 2.5887 1.146e-02 38.72 296.3457 79
## BalconyFYes -15.207 59.330 -0.2563 7.984e-01 -133.30 102.8873 79
##
## Multiple R-squared: 0.5275 , Adjusted R-squared: 0.5035
## F-statistic: 24.02 on 4 and 79 DF, p-value: 5.078e-13
anova(fit2_robust, fit3)
## Error in UseMethod("anova"): no applicable method for 'anova' applied to an object of class "lm_robust"
Since we are using lm_robust() instead of lm(), the resulting object is of class “lm_robust”, which does not have a built‐in anova() method, therefore we get the error: Error in UseMethod(“anova”) : no applicable method for ‘anova’ applied to an object of class “lm_robust”. In this case, we could just compare adjusted R-squared – and if we do, then model 3 fits data better than model 2 according to the adjusted R-squared (0.5035 > 0.4711). However, let’s run the models without using robust standard errors, in order to make use of anova() function (we will see later that all the coefficients remain the same).
library(stargazer)
fit2_lm <- lm(Price ~ Age + Distance, data = apartments)
fit3_lm <- lm(Price ~ Age + Distance + ParkingF + BalconyF, data = apartments)
stargazer(fit2_lm, fit3_lm, type="text",
title = "Fit 2 vs Fit 3",
align = TRUE,
no.space = TRUE)
##
## Fit 2 vs Fit 3
## =================================================================
## Dependent variable:
## ---------------------------------------------
## Price
## (1) (2)
## -----------------------------------------------------------------
## Age -6.464** -5.821*
## (3.159) (3.074)
## Distance -22.955*** -20.279***
## (2.786) (2.886)
## ParkingFYes 167.531***
## (62.864)
## BalconyFYes -15.207
## (59.201)
## Constant 2,456.076*** 2,329.724***
## (73.931) (93.066)
## -----------------------------------------------------------------
## Observations 84 84
## R2 0.484 0.527
## Adjusted R2 0.471 0.504
## Residual Std. Error 276.146 (df = 81) 267.536 (df = 79)
## F Statistic 37.959*** (df = 2; 81) 22.045*** (df = 4; 79)
## =================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
anova(fit2_lm, fit3_lm)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 81 6176767
## 2 79 5654480 2 522287 3.6485 0.03051 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see that all the coefficients in both models are the same in both cases (with and without White’s robust standard errors).
Model fit3_lm does indeed fit data better than model fit2_lm (p-value = 0.0306).
summary(fit3)
##
## Call:
## lm_robust(formula = Price ~ Age + Distance + ParkingF + BalconyF,
## data = apartments, se_type = "HC1")
##
## Standard error type: HC1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 2329.724 102.132 22.8109 1.659e-36 2126.43 2533.0123 79
## Age -5.821 3.227 -1.8038 7.507e-02 -12.24 0.6022 79
## Distance -20.279 2.860 -7.0902 5.021e-10 -25.97 -14.5861 79
## ParkingFYes 167.531 64.716 2.5887 1.146e-02 38.72 296.3457 79
## BalconyFYes -15.207 59.330 -0.2563 7.984e-01 -133.30 102.8873 79
##
## Multiple R-squared: 0.5275 , Adjusted R-squared: 0.5035
## F-statistic: 24.02 on 4 and 79 DF, p-value: 5.078e-13
Intercept (\(\beta_0\)):
The predicted price per m\(^2\) when
\(\text{Age} = 0\), \(\text{Distance} = 0\) and with the baseline
categories (Parking = No, Balcony = No) is about 2329.72
EUR/m\(^2\).
Age (\(\beta_1\)):
Each additional 1 year in apartment age is associated
with 5.82 EUR less in price per m\(^2\), on average, holding everything
else constant.
The p-value \(\approx 0.075\) suggests
this effect is not significant.
Distance (\(\beta_2\)):
For every 1 km further from the city center, the price
per m\(^2\) decreases by about
20.28 EUR, on average, holding everything else
constant (\(p < 0.001\), highly
significant).
ParkingFYes (\(\beta_3\)):
Holding age, distance, and balcony the same, having a parking spot adds
about 167.53 EUR to the price per m\(^2\) versus no parking, on average
(\(p \approx 0.012\), highly
significant).
BalconyFYes (\(\beta_4\)):
Holding age, distance, and parking the same, having a balcony
reduces price by about 15.21 EUR per m\(^2\) relative to no balcony, on average
(\(p \approx 0.8\), not
significant).
Null Hypothesis \(H_0\):
All of partial regression coefficients are 0.
Alternative Hypothesis \(H_1\):
At least one of the coefficients is not zero.
Because the p-value is extremely small (\(5.078 \times 10^{-13}\)), we reject \(H_0\) and conclude that, collectively, these predictors do indeed help explain variability in apartment price.
apartments$Fit3_Fitted <- fit3$fitted.values
apartments$Price[apartments$ID=="2"] - apartments$Fit3_Fitted[apartments$ID=="2"]
## 2
## 427.8029