library(readxl)
data <- read_xlsx("Apartments.xlsx")
Description:
data$Parking <- factor(data$Parking, levels = c(0, 1), labels = c("No", "Yes"))
data$Balcony <- factor(data$Balcony, levels = c(0, 1), labels = c("No", "Yes"))
t.test(data$Price, muh = 1900, alternative = "two.sided")
##
## One Sample t-test
##
## data: data$Price
## t = 49.263, df = 84, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
It can be concluded that the arithmetic mean of price per m² is significantly different from 1900 (p<0.001). It is estimated to be higher at 2018.941.
fit1 <- lm(Price ~ Age, data = data)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
cor(data$Age, data$Price)
## [1] -0.230255
The value for b0 is 2185.455. This means that if the apartment is 0 years old, its price per m² is on average 2185.455 euros (p<0.001).
The value for b1 is -8.975. This means that if an apartment age increases by one year, its price per m² decreases by 8.975 euros on average (p=0.035).
The coefficient of correlation is -0.230255. This means that Age and Price are weakly negatively correlated. Meaning that when the Age of the apartment risis, the Price per m² tends to go down.
The coefficient of determination is 0.05302. This means that 5% of the variability of Price per m² can be explained by the age of the apartment. As this number is really low, we should consider adding a few variables.
library(car)
scatterplotMatrix(data[ ,c(-4,-5), smooth = FALSE])
There does not seem to be a potential problem with multicolinearity. There is only a slight negative relationship between the age of the apartment in years and the price in m². There is a stronger negative correlation between the distance from the city center and the price per m², but it also does not look like there is a problem with multicolinearity.
fit2 <- lm(Price ~ Age + Distance, data = data)
vif(fit2)
## Age Distance
## 1.001845 1.001845
The values are very close to one. This means that there is almost no multicolinearity in the model.
data$StdResid <- round(rstandard(fit2), 3)
data$CooksD <- round(cooks.distance(fit2), 3)
hist(data$StdResid, xlab = "Standardized residuals", ylab = "Frequency", main = "Histogram of Standardized Residuals") #no problematic units
hist(data$CooksD, xlab = "Cooks Distance", ylab = "Frequency", main = "Histogram of Cooks distances")
head(data[order(-data$CooksD),],5 )
## # A tibble: 5 × 7
## Age Distance Price Parking Balcony StdResid CooksD
## <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 5 45 2180 Yes Yes 2.58 0.32
## 2 43 37 1740 No No 1.44 0.104
## 3 2 11 2790 Yes No 2.05 0.069
## 4 7 2 1760 No Yes -2.15 0.066
## 5 37 3 2540 Yes Yes 1.58 0.061
which(data$Price == 2180) # checking if any other observation has this value
## [1] 38
library(dplyr)
data <- data %>% filter(!Price == 2180)
hist(data$CooksD, xlab = "Cooks Distance", ylab = "Frequency", main = "Histogram of Cooks distances") # There are still "jumps"
which(data$Price == 1740) #checking if there are any observations which have these values as well
## [1] 54
which(data$Price == 2790)
## [1] 33
which(data$Price == 1760)
## [1] 52
which(data$Price == 2540)
## [1] 22
data <- data %>% filter(!Price %in% c(1740, 2790, 1760, 2540))
hist(data$CooksD, xlab = "Cooks Distance", ylab = "Frequency", main = "Histogram of Cooks distances")
fit2 <- lm(Price ~ Age + Distance, data = data)
data$StdResid <- round(rstandard(fit2), 3)
data$StdFitted <- scale(fit2$fitted.values)
plot(x = data$StdFitted, y = data$StdResid, xlab = "Standardized residuals", ylab = "Standardized fitted values") #There might be Heteroskedasticity present
abline(0,0, col = "red")
library(olsrr)
ols_test_breusch_pagan(fit2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 1.738591
## Prob > Chi2 = 0.1873174
The nullhypothesis can not be rejected (p=0.188). Therefore the assumption of homoscedasticity is not violated.
hist(data$StdResid, xlab = "Standardized residuals", ylab = "Frequency", main = "Histogram of standardized residuals")
shapiro.test(data$StdResid)
##
## Shapiro-Wilk normality test
##
## data: data$StdResid
## W = 0.94154, p-value = 0.001166
As p<0.005 we reject the nullhypothesis that the errors are normally distributed. We can also see this when looking at the distribution of the standardized residuals in the Histogramm, as it does not resemble a normal distribution.
#fit2 was already estimated again in 10.
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -411.50 -203.69 -45.24 191.11 492.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2502.467 75.024 33.356 < 2e-16 ***
## Age -8.674 3.221 -2.693 0.00869 **
## Distance -24.063 2.692 -8.939 1.57e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared: 0.5361, Adjusted R-squared: 0.524
## F-statistic: 44.49 on 2 and 77 DF, p-value: 1.437e-13
The intercept is 2502.467 and is the point where the regression line intersects with the y-axis. This means that when an apartment is 0 years old and is in the city center, its price per m² is on average 2329.724 euros (p>0.001).
The value for b1 is -8.674. This means, when an apartments ages by one more year, its price per m² decreases by -8.674 euros on average, when all other variables stay the same (p=0.009).
The value for b2 is -24.063. This means that the price of an apartment per m² decreases by 24.063 euros on average, for every additional kilometer it is distanted from the city center, when all other variables stay the same (p<0.001).
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = data)
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 77 5077362
## 2 75 4791128 2 286234 2.2403 0.1135
We can not reject the nullhypothesis. This means that the model fit3 does not fit the data significantly better than model fit2.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -390.93 -198.19 -53.64 186.73 518.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2393.316 93.930 25.480 < 2e-16 ***
## Age -7.970 3.191 -2.498 0.0147 *
## Distance -21.961 2.830 -7.762 3.39e-11 ***
## ParkingYes 128.700 60.801 2.117 0.0376 *
## BalconyYes 6.032 57.307 0.105 0.9165
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared: 0.5623, Adjusted R-squared: 0.5389
## F-statistic: 24.08 on 4 and 75 DF, p-value: 7.764e-13
The value for b3 is 128.7. This means that when controlling for all other variables, an apartment is on average 128.7 euros per m² more expensive, when it has parking, compared to when it has not (p=0.038)
The value for b4 is 6.032. The test for the linear regression coefficient is not significant. This means that it wether the apartment has a balcony or not, has no significant influence on the price per m².
F-Test
We can reject the nullhypothesis and can say that at least one explanatory variable has a significant influence on the price per m² of an apartment.
data$Fitted <- fitted.values(fit3)
data$Residuals <- residuals(fit3)
data[2, c(-6,-7,-8)]
## # A tibble: 1 × 7
## Age Distance Price Parking Balcony Fitted Residuals
## <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 18 1 2800 Yes No 2357. 443.