library(readxl)
apartments_db <- read_xlsx("./Apartments.xlsx")
apartments_db <- as.data.frame(apartments_db)
Description:
A possible, and the most reasonable research question with the data that we have is:
To what extent do apartment characteristics such as age, distance from the city center, and the presence of a parking space or balcony influence the price per square meter?
Alternatives research questions could be:
apartments_db$ParkingFactor <- factor(apartments_db$Parking,
levels = c(0,1),
labels = c("No","Yes"))
apartments_db$BalconyFactor <- factor(apartments_db$Balcony,
levels = c(0,1),
labels = c("No","Yes"))
To perform this test, we first need to select which type of hypothesis fits the best with the null hypothesis. Then, we have to check the parametric test assumptions in order to see which subtype of test to perform.
Assumption #1: The research variable (income) is numeric
str(apartments_db$Price)
## num [1:85] 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
Conclusion: The variable is numerical
Assumption #2: The variable of interest is normally distributed
For cheking this assumption, first, a histogram will be plotted and then analyzed
library(ggplot2)
ggplot(data = apartments_db, aes(x=Price)) +
geom_histogram(binwidth = 100, colour="gray", fill = "blue") +
ylab("Frequency")
At first look, the histogram look a little skewed to the right. As no conclusions can be extracted from this graph, in order to check normality, a more formal test will be conducted
library(rstatix)
shapiro_test(apartments_db$Price)
## # A tibble: 1 × 3
## variable statistic p.value
## <chr> <dbl> <dbl>
## 1 apartments_db$Price 0.940 0.000651
Conclusion: We reject \(H_0\) with a P-value < 0.001, meaning that the Price variable is not normally distributed, and thus we need to perform a wilcoxon signed-rank test instead of a t-test
Wilcoxon Signed Rank Test
mu_price <- 1900
wilcox.test(apartments_db$Price,
mu = mu_price,
correct = FALSE)
##
## Wilcoxon signed rank test
##
## data: apartments_db$Price
## V = 2328, p-value = 0.02828
## alternative hypothesis: true location is not equal to 1900
Conclusion: We reject \(H_0\) with a P-value = 0.03, meaning that the median price does not equal 1,900 EUR, and that 50% of the data is equal or less than this figure. In order to assess the effect the size of the deviation from the sample estimate and the one from the \(H_0\), the effect size will be calculated and interpreted
library(effectsize)
effectsize <- effectsize(wilcox.test(apartments_db$Price,
mu = mu_price,
correct = FALSE))
interpret_rank_biserial(effectsize)
## r (rank biserial) | 95% CI | Interpretation
## -------------------------------------------------
## 0.27 | [0.04, 0.48] | medium
##
## - Deviation from a difference of 1900.
## - Interpretation rule: funder2019
Even though the null hypothesis was rejected, the deviation of the sample estimate was medium, meaning that the size of the deviation of the sample from the null hypothesis was neither large nor small
For this simple regression we will estimate the simple following function: \(\hat{Price}=b_0+b_1\times age\)
fit1 <- lm(Price ~ Age,
data = apartments_db)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = apartments_db)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
sqrt(summary(fit1)$r.squared)
## [1] 0.230255
The explanations of the values are as followed:
library(car)
scatterplot_subset <- apartments_db[,c(3,1,2)]
scatterplotMatrix(scatterplot_subset,
smooth = FALSE)
Conclusion: Based on the scatterplot matrix, no multiconllinearity was detected among all the variables. Moreover, in the theory, there shouldn’t be also a potential problem of perfect (or strong) correlation among variables. This since they are different from the conceptual basis, and thus none of the explanatory variables should be a linear combination of the other.
fit2 <- lm(Price ~ Age + Distance,
data = apartments_db)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = apartments_db)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
vif(fit2)
## Age Distance
## 1.001845 1.001845
Findings: Just as previously mentioned and analyzed in the scatterplot, since both figures are below 5, we can conclude that there is no multicolinearity among variables included in the model. Another interpretation is that none of the explanatory variables (age or distance) is a linear combination of the other.
First, an standardized residuals histogram will be plotted in order to see if any values are outside +-3 (which we could consider as outliers)
apartments_db$SD_Residuals <- round(rstandard(fit2), 3)
hist(apartments_db$SD_Residuals,
xlab = "Standardized Residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
Conclusion: As all values for standardized residuals are fitted between +- 3, no outliers were detected, and thus there is no necessity to drop any values
Now, we will calculate Cook’s Distances in order to check whether there are units with high impact or not
apartments_db$CD <- round(cooks.distance(fit2), 3)
hist(apartments_db$CD,
xlab = "Cook's Distances",
ylab = "Frequency",
main = "Histogram of Cook's Distances")
Conclusion: A jump in the values can be appreciated for Cook’s Distances above from 0.15. Since this would affect our model, and the conclusions we could take from it, those units with high impact (with a distance higher than 0.15) will be excluded.
For the latter, first we will check which units are the ones with high impact, then we will filter them, and then we will check again to make sure they were removed
# Checking units again
head(apartments_db[order(-apartments_db$CD),],5)
## Age Distance Price Parking Balcony ParkingFactor BalconyFactor SD_Residuals CD
## 38 5 45 2180 1 1 Yes Yes 2.577 0.320
## 55 43 37 1740 0 0 No No 1.445 0.104
## 33 2 11 2790 1 0 Yes No 2.051 0.069
## 53 7 2 1760 0 1 No Yes -2.152 0.066
## 22 37 3 2540 1 1 Yes Yes 1.576 0.061
We need to remove the first, but is easier to just exclude everything above 0.15
# Filtering units
library(dplyr)
apartments_db <- apartments_db %>%
filter(CD < 0.15)
# Checking units
head(apartments_db[order(-apartments_db$CD),],5)
## Age Distance Price Parking Balcony ParkingFactor BalconyFactor SD_Residuals CD
## 54 43 37 1740 0 0 No No 1.445 0.104
## 33 2 11 2790 1 0 Yes No 2.051 0.069
## 52 7 2 1760 0 1 No Yes -2.152 0.066
## 22 37 3 2540 1 1 Yes Yes 1.576 0.061
## 38 40 2 2400 0 1 No Yes 1.091 0.038
The process is now done, and the unit with high impact was removed. This will not be repeated, since we would be trapped in a loop where always there would be units with a higher CD.
First, in order for interpret the results with the variable we just excluded, we have to refit the model. This, since the amount of fitted values that we are trying to set is 85, but now as we removed 1 apartment from our data, an error will be generated when trying to plug them into 84 rows. After that, the scatterplot was plotted.
fit2 <- lm(Price ~ Age + Distance,
data = apartments_db)
apartments_db$SD_Fitted <- scale(fit2$fitted.values)
scatterplot(y = apartments_db$SD_Residuals, x = apartments_db$SD_Fitted,
xlab = "Standardized Fitted",
ylab = "Standardized Residuals",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
Findings: Even though we can observe a concentration of values around the right hand side of the graph, is not sufficient to state that there is an “opening” pattern. Moreover, as this is just an appreciation, a formal test will be conducted (even though it was not asked), just the prove the intuition
#install.packages("olsrr")
library(olsrr)
ols_test_breusch_pagan(fit2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## -----------------------------
## DF = 1
## Chi2 = 2.927455
## Prob > Chi2 = 0.08708469
In fact, we cannot reject \(H_0\), which means that the variance is constant, and thus Homokedasticity is assumed
apartments_db$SD_Residuals <- round(rstandard(fit2), 3)
hist(apartments_db$SD_Residuals,
xlab = "Standardized Residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals",
breaks = seq(from = -3, to = 3, by = 0.20))
shapiro.test(apartments_db$SD_Residuals)
##
## Shapiro-Wilk normality test
##
## data: apartments_db$SD_Residuals
## W = 0.95649, p-value = 0.006355
Findings: First, when analyzing the corrected standardized residuals histogram (this since the breaks were adjusted), we can appreciate that the distribution of the former it does not look normally distributed (even having more than one mode or peaks). This was further confirmed when doing the formal Shapiro-Wilk test, where we can reject \(H_0\) with a P-Value = 0.007, meaning that the errors are not normally distributed. Moreover, as the sample is higher than 30, due to the central limit theorem, we can assume that the errors are normally distributed, so we can continue with the experiment.
The model was previously refitted excluding the units with high impact, so the summary will be shown, and then the variables will be explained directly
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = apartments_db)
##
## Residuals:
## Min 1Q Median 3Q Max
## -604.92 -229.63 -56.49 192.97 599.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2456.076 73.931 33.221 < 2e-16 ***
## Age -6.464 3.159 -2.046 0.044 *
## Distance -22.955 2.786 -8.240 2.52e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared: 0.4838, Adjusted R-squared: 0.4711
## F-statistic: 37.96 on 2 and 81 DF, p-value: 2.339e-12
fit3 <- lm(Price ~ Age + Distance + ParkingFactor + BalconyFactor,
data = apartments_db)
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingFactor + BalconyFactor
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 81 6176767
## 2 79 5654480 2 522287 3.6485 0.03051 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusion: We reject \(H_0\) with a P-Value = 0.04, meaning that the the differences are statistically significant, and we should choose fit3 over fit2
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor,
## data = apartments_db)
##
## Residuals:
## Min 1Q Median 3Q Max
## -473.21 -192.37 -28.89 204.17 558.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2329.724 93.066 25.033 < 2e-16 ***
## Age -5.821 3.074 -1.894 0.06190 .
## Distance -20.279 2.886 -7.026 6.66e-10 ***
## ParkingFactorYes 167.531 62.864 2.665 0.00933 **
## BalconyFactorYes -15.207 59.201 -0.257 0.79795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared: 0.5275, Adjusted R-squared: 0.5035
## F-statistic: 22.04 on 4 and 79 DF, p-value: 3.018e-12
#Saving fitted values
apartments_db$SD_Fitted3 <- scale(fit3$fitted.values)
apartments_db$Residual <- apartments_db$Price - apartments_db$SD_Fitted3
apartments_db$ID <- seq(1,nrow(apartments_db))
apartments_db <- apartments_db %>% relocate(ID)
apartments_db[apartments_db$ID == 2,]$Residual
## [,1]
## 2 2798.712
The residual for apartment with ID #2, is 2,798.71 EUR