library(readxl)
mydata <- read_xlsx("Apartments.xlsx")
mydata <- as.data.frame(mydata)
head(mydata)
## Age Distance Price Parking Balcony
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
library(dplyr)
mydata$ParkingF <- factor(mydata$Parking,
levels = c(0, 1),
labels = c("No", "Yes"))
mydata$BalconyF <- factor(mydata$Balcony,
levels = c(0, 1),
labels = c("No", "Yes"))
head(mydata)
## Age Distance Price Parking Balcony ParkingF BalconyF
## 1 7 28 1640 0 1 No Yes
## 2 18 1 2800 1 0 Yes No
## 3 7 28 1660 0 0 No No
## 4 28 29 1850 0 1 No Yes
## 5 18 18 1640 1 1 Yes Yes
## 6 28 12 1770 0 1 No Yes
t.test(mydata$Price,
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydata$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
Conclusion
Since the p-value is 0.005, we reject the null hypothesis.
This shows that the average price per square meter of an apartment is significantly different from 1900€.
The sample mean is 2018,94€ which suggest it is higher than 1900€.
fit1 <- lm(Price ~ Age,
data = mydata)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
cor(mydata$Price, mydata$Age)
## [1] -0.230255
Conclusion
The estimated regression function is: Price = 2185.45 − 8.98 × Age
Regression Coefficient: This means that for each additional year of apartment age, the price per square meter decreases by approximately 8.98€, on average – assuming all other factors remain the same.
Coefficient of Correlation: The correlation between Price and Age is approximately −0.23, indicating a weak negative linear relationship.
Coefficient of Determination (R2): The R2 value of 0.053 shows that about 5.3% of the variation in apartment prices is explained by their age.
library(car)
scatterplotMatrix(mydata[, c("Price", "Age", "Distance")],
smooth = FALSE,
main = "Scatterplot Matrix: Price, Age, Distance")
Conclusion
Based on a visual check, we do not observe any strong signs of multicollinearity.
There is no strong correlation between the independent variables.
fit2 <- lm(Price ~ Age + Distance, data = mydata)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
library(car)
vif(fit2)
## Age Distance
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845
Conclusion:
In this case, both variables Age and Distance have VIF values of 1.0018, and the mean VIF is also 1.0018.
Since all VIF values are well below the threshold of 5, there is no indication of multicollinearity in the model.
mydata$StdResid <- round(rstandard(fit2), 3)
mydata$CooksD <- round(cooks.distance(fit2), 3)
hist(mydata$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
hist(mydata$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
mydata$ID <- seq(1, nrow(mydata))
head(mydata[order(mydata$StdResid), c("ID", "StdResid")], 3)
## ID StdResid
## 53 53 -2.152
## 13 13 -1.499
## 72 72 -1.499
head(mydata[order(-mydata$StdResid), c("ID", "StdResid")], 3)
## ID StdResid
## 38 38 2.577
## 33 33 2.051
## 2 2 1.783
head(mydata[order(-mydata$CooksD), c("ID", "CooksD")], 6)
## ID CooksD
## 38 38 0.320
## 55 55 0.104
## 33 33 0.069
## 53 53 0.066
## 22 22 0.061
## 39 39 0.038
mydata_clean <- mydata %>%
filter(!ID %in% c(38, 55))
Conclusion:
Used standardized residuals and Cook’s distance to check for outliers and unites of high impact in our model fit2
Created a new dataset without these two units of high impact and outlines to improve the model.
mydata$StdFitted <- scale(fit2$fitted.values)
library(car)
scatterplot(y = mydata$StdResid,
x = mydata$StdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
Conclusion:
From the plot we don’t see any clear pattern like a funnel shape or increasing spread.
The residuals seem to be randomly scattered around zero across all fitted values. Therefor we assume homoscedasticity.
Below you can also see the formal test for Homoscedasticity. The variance of error is constant.(p=0,33)
library(olsrr)
ols_test_breusch_pagan(fit2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## ---------------------------
## DF = 1
## Chi2 = 0.968106
## Prob > Chi2 = 0.325153
hist(mydata$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(mydata$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResid
## W = 0.95303, p-value = 0.003645
Conclusion
Since the p-value is 0.004, we reject the null hypothesis of the Shapiro-Wilk test. This means the errors are not normally distributed.
Due to the relatively large sample size, this deviation from normality was expected, but it still needs to be considered when checking the assumptions of linear regression.
fit2_clean <- lm(Price ~ Age + Distance, data = mydata)
summary(fit2_clean)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
Price = 2460,10 − 7,93 × Age − 20,67 × Distance
🔹 Interpretation of Coefficients:
Other
fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF, data = mydata_clean)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = mydata_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -499.06 -194.33 -32.04 219.03 544.31
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2358.900 93.664 25.185 < 2e-16 ***
## Age -7.197 3.148 -2.286 0.02499 *
## Distance -21.241 2.911 -7.296 2.14e-10 ***
## ParkingFYes 168.921 62.166 2.717 0.00811 **
## BalconyFYes -6.985 58.745 -0.119 0.90566
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 264.5 on 78 degrees of freedom
## Multiple R-squared: 0.5408, Adjusted R-squared: 0.5173
## F-statistic: 22.97 on 4 and 78 DF, p-value: 1.449e-12
fit2_clean <- lm(Price ~ Age + Distance, data = mydata_clean)
fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF, data = mydata_clean)
anova(fit2_clean, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 80 5982100
## 2 78 5458696 2 523404 3.7395 0.02813 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusion
Since the p-value is 0.03, we reject the null hypothesis. This means that adding the categorical variables Parking and Balcony leads to a significant improvement in the model.Therefore, these variables explain additional variation in apartment price even more than what is explained by Age and Distance.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = mydata_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -499.06 -194.33 -32.04 219.03 544.31
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2358.900 93.664 25.185 < 2e-16 ***
## Age -7.197 3.148 -2.286 0.02499 *
## Distance -21.241 2.911 -7.296 2.14e-10 ***
## ParkingFYes 168.921 62.166 2.717 0.00811 **
## BalconyFYes -6.985 58.745 -0.119 0.90566
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 264.5 on 78 degrees of freedom
## Multiple R-squared: 0.5408, Adjusted R-squared: 0.5173
## F-statistic: 22.97 on 4 and 78 DF, p-value: 1.449e-12
Categorical variables:
ParkingFYes is 168.92 (p-value = 0.0081)
Apartments with parking cost on average 168,92€ per M2 more than those without parking, holding all other variables constant.
BalconyFYes is -6.99 (p-value = 0.9057), which is not statistically significant.
There is no significant effect of having a balcony on the apartment price.
It tests the hypothesis:
We reject H0. This means the model with Age, Distance, Parking, and Balcony explains a significant amount of variation in average apartment prices.
fitted_vals <- fitted(fit3)
residual_vals <- resid(fit3)
fitted_vals[2]
## 2
## 2377.043
residual_vals[2]
## 2
## 422.9572
Manuel:
2358.900+(−7.197⋅18)+(−21.241⋅1)+(168.921⋅1)+(–6.985⋅0)= 2376.99€
Residual for ID2 2800€ - 2376,99 = 423€