Abel Mate Nagy
library(readxl)
df <- read_xlsx("Apartments.xlsx")
Description:
Can differences in price per square meter of apartments be explained by the distance of the apartment from the city center and whether there is parking available?
df$Parking <- factor(df$Parking, levels = c(0,1), labels = c("No", "Yes"))
df$Balcony <- factor(df$Balcony, levels = c(0,1), labels = c("No", "Yes"))
t.test(df$Price, mu = 1900)
##
## One Sample t-test
##
## data: df$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
Since the exercise requires the test of the arithmetic mean, I use the t-test. Since the sample size is 85, it is safe to assume that the Central Limit Theorem kicks, and this justifies the use of the t-test.
The conclusion of the test is that \(H_0\) can be rejected at \(p = 0.005\), meaning \(\mu_{Price} \neq 1900\). Looking at the sample mean, we see it is higher than 1900.
fit1 <- lm(Price ~ Age, data = df)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
Regression coefficient estimate: a year increase in the age of the apartment leads to a 8.975 EUR decrease in the price per square meter of the apartment on average.
Coefficient of correlation: We can obtain the correlation coefficient by taking the square root of R squared, leading us to \(\sqrt{0.05302} = 0.2302607\). This means there is a weak correlation between the age and price per square meter of apartments.
Coefficient of determination: The coefficient of determination, or R squared, means that \(0.053\%\) of variability in price per square meter can be explained by the age of the apartment.
library(car)
scatterplotMatrix(df[, c("Price", "Age", "Distance")], smooth = FALSE)
As Age and Distance are the explanatory variables, I look at the plots between these two. There does not seem to be a strong correlation between them, the scatterplot looks rather random.
There is a clearer correlation between e.g. Price and Age, but as Price is the independent variable, we do not need to worry about this for multicolinearity.
fit2 <- lm(Price ~ Age + Distance, data = df)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
vif(fit2)
## Age Distance
## 1.001845 1.001845
As the VIF statistic is close to 1 for both variables, there is no indication of multicolinearity. If any of them would be above 5, it would be necessary to drop one.
df$std_res <- round(rstandard(fit2), 3)
df$cook_d <- round(cooks.distance(fit2), 3)
hist(df$std_res)
hist(df$cook_d)
Based on the standardized residual, we see that the highest is between 2.5 and 3, so none of them are more than 3 standard deviations away from the mean. To double check, we can list the 5 highest standardized residuals.
head(df[order(-df$std_res), ])
## # A tibble: 6 × 7
## Age Distance Price Parking Balcony std_res cook_d
## <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 5 45 2180 Yes Yes 2.58 0.32
## 2 2 11 2790 Yes No 2.05 0.069
## 3 18 1 2800 Yes No 1.78 0.03
## 4 18 1 2800 Yes Yes 1.78 0.03
## 5 8 2 2820 Yes No 1.66 0.037
## 6 10 1 2810 No No 1.60 0.032
We can suspect that there is a unit with high influence. We can also list the highest Cook’s distances.
head(df[order(-df$cook_d), ])
## # A tibble: 6 × 7
## Age Distance Price Parking Balcony std_res cook_d
## <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 5 45 2180 Yes Yes 2.58 0.32
## 2 43 37 1740 No No 1.44 0.104
## 3 2 11 2790 Yes No 2.05 0.069
## 4 7 2 1760 No Yes -2.15 0.066
## 5 37 3 2540 Yes Yes 1.58 0.061
## 6 40 2 2400 No Yes 1.09 0.038
As this unit is very distinct from the others based on Cook’s distance, we can drop it.
library(dplyr)
# we know the exact cook_d, and that there is only one of these, so we can drop it based on that
df <- df %>% filter(!cook_d == 0.320)
# fit model again after dropping one observation
fit2_new <- lm(Price ~ Age + Distance, data = df)
df$std_fitted <- scale(fit2_new$fitted.values)
car::scatterplot(x = df$std_fitted, y = df$std_res,
boxplots = FALSE,
smooth = FALSE,
regLine = FALSE)
The variance of errors seems to be more or less constant, meaning that the homoskedasticity assumption seems to be met.
hist(df$std_res)
shapiro.test(df$std_res)
##
## Shapiro-Wilk normality test
##
## data: df$std_res
## W = 0.94879, p-value = 0.002187
The errors do not seem to be normally distributed based on the histogram. The Shapiro-Wilk test confirms this, meaning we reject the null hypothesis that the errors are normally distributed at \(p = 0.003\).
fit2 <- lm(Price ~ Age + Distance, data = df)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -604.92 -229.63 -56.49 192.97 599.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2456.076 73.931 33.221 < 2e-16 ***
## Age -6.464 3.159 -2.046 0.044 *
## Distance -22.955 2.786 -8.240 2.52e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared: 0.4838, Adjusted R-squared: 0.4711
## F-statistic: 37.96 on 2 and 81 DF, p-value: 2.339e-12
Explanation of coefficients:
Age
: a year increase in the age of the apartment
leads to a 6.464 EUR decrease in the price per square meter of the
apartment on average, provided all other variables are held
constant.
Distance
: a km increase in the distance of the
apartment from the city center leads to a 22.955 EUR decrease in the
price per square meter of the apartment on average, provided all other
variables are held constant.
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = df)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -473.21 -192.37 -28.89 204.17 558.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2329.724 93.066 25.033 < 2e-16 ***
## Age -5.821 3.074 -1.894 0.06190 .
## Distance -20.279 2.886 -7.026 6.66e-10 ***
## ParkingYes 167.531 62.864 2.665 0.00933 **
## BalconyYes -15.207 59.201 -0.257 0.79795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared: 0.5275, Adjusted R-squared: 0.5035
## F-statistic: 22.04 on 4 and 79 DF, p-value: 3.018e-12
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 81 6176767
## 2 79 5654480 2 522287 3.6485 0.03051 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As we reject \(H_0\) at \(p = 0.031\), we can conclude that
fit3
fits the data better than fit2
.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -473.21 -192.37 -28.89 204.17 558.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2329.724 93.066 25.033 < 2e-16 ***
## Age -5.821 3.074 -1.894 0.06190 .
## Distance -20.279 2.886 -7.026 6.66e-10 ***
## ParkingYes 167.531 62.864 2.665 0.00933 **
## BalconyYes -15.207 59.201 -0.257 0.79795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared: 0.5275, Adjusted R-squared: 0.5035
## F-statistic: 22.04 on 4 and 79 DF, p-value: 3.018e-12
Parking: Holding all other variables constant, apartments with a parking space have 167.531 EUR higher price per m2 on average than those without.
Balcony: Holding all other variables constant, apartments with a balcony have 15.207 EUR lower price per m2 on average than those without.
Hypothesis being tested:
\[H_0: \rho^2 = 0\] \[H_1: \rho^2 \neq 0\]
where \(\rho^2\) is the coefficient of determination.
df$fitted_3 <- fit3$fitted.values
df$id <- seq(1, nrow(df))
res_id2 <- df$Price[df$id == 2] - df$fitted_3[df$id == 2]
res_id2
## 2
## 427.8029