library(readxl)
aptdata <- read_xlsx("./Apartments.xlsx")
head(aptdata, 5)
## # A tibble: 5 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
Description:
Given the dataset above, a possible research question we could look into is to find out whether the proximity to the city center in Vienna (“distance”) have an impact on apartment prices.
# Categorical variables include "Parking" and "Balcony"
aptdata$ParkingF <- factor(aptdata$Parking,
levels = c(1, 0),
labels = c("Yes", "No")) # Found that "Yes" has more units which should be the reference point
aptdata$BalconyF <- factor(aptdata$Balcony,
levels = c(0, 1),
labels = c("No", "Yes"))
summary(aptdata[c(-4, -5)])
## Age Distance Price ParkingF BalconyF
## Min. : 1.00 Min. : 1.00 Min. :1400 Yes:43 No :48
## 1st Qu.:12.00 1st Qu.: 4.00 1st Qu.:1710 No :42 Yes:37
## Median :18.00 Median :12.00 Median :1950
## Mean :18.55 Mean :14.22 Mean :2019
## 3rd Qu.:24.00 3rd Qu.:20.00 3rd Qu.:2290
## Max. :45.00 Max. :45.00 Max. :2820
Defining hypotheses:
# Description summary of data
library(psych)
describe(aptdata$Price)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 85 2018.94 377.84 1950 1990.29 429.95 1400 2820 1420 0.54 -0.69 40.98
Explanation of some parameters
# Normality testing
# Using graphic visualisation (Histogram)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(aptdata, aes(x = Price)) +
geom_histogram(binwidth = 100, color = "black") +
xlab("Price (in EUR)") +
ylab("No. of Apartments")
# Using Shapiro Test
shapiro.test(aptdata$Price)
##
## Shapiro-Wilk normality test
##
## data: aptdata$Price
## W = 0.94017, p-value = 0.0006513
Assumption on normality
Looking at the histogram, the distribution appears somewhat asymmetric and slightly right-skewed, which may indicate that there are more apartments in the sample priced in the lower range (at around 1600 EUR) while only a few are priced at the higher end (above 2500 EUR). While this histogram may have shown that normality is not met to a certain extent, a Shapiro-Wilk test would provide a more accurate assessment.
Using the Shapiro test, we reject null hypothesis (H0) that the data is normally distributed as p-value is <0.0001. Hence, we can formally conclude that the data did not meet the normality requirement to continue with T-test. A non-parametric testing would be more appropriate.
Wilcoxon Signed Rank Test
As Price did not meet the normality requirement, non-parametric testing shall be utilised to determine whether Mu_Price = 1900 EUR
median(aptdata$Price)
## [1] 1950
wilcox.test(aptdata$Price,
mu = 1900,
correct = FALSE)
##
## Wilcoxon signed rank test
##
## data: aptdata$Price
## V = 2328, p-value = 0.02828
## alternative hypothesis: true location is not equal to 1900
library(effectsize)
##
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
##
## phi
effectsize(wilcox.test(aptdata$Price,
mu = 1900,
correct = FALSE))
## r (rank biserial) | 95% CI
## --------------------------------
## 0.27 | [0.04, 0.48]
##
## - Deviation from a difference of 1900.
interpret_rank_biserial(0.27, rules = "funder2019")
## [1] "medium"
## (Rules: funder2019)
Conclusion:
Using the sample data of apartments through the Wilcoson Signed Rank Test, we reject null hypothesis (Mu = 1900) as the results show that the median price of apartments in the Vienna region has a moderate significant difference from 1900 EUR (p < 0.05, r = 0.27 - medium effect). Median is found to be at 1950 EUR per m2, which is not very far from 1900 EUR in our hypothesis.
# Graphical visualisation (Scatterplot)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplot(aptdata$Price ~ aptdata$Age,
smooth = FALSE,
boxplots = FALSE,
ylab = "Price (in EUR)",
xlab = "Age of Apartment (in years)")
Observations
From the scatterplot, we can see that there is a negative correlation relationship between “Price” and “Age”. This suggests that the older the apartment is, the price of apartment would relatively be lower.
# Regression model (1)
fit1 <- lm(Price ~ Age,
data = aptdata)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = aptdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
# Coefficient of correlation
sqrt(summary(fit1)$r.squared)
## [1] 0.230255
Regression Results
(Intercept): The price of a brand-new apartment in the Vienna region is estimated to be priced at 2185.46 EUR per m2.
Age: When the age of an apartment increases by 1 year, its price per m2 would decrease 8.98 EUR on average.
# By scatterplot matrix
library(car)
scatterplotMatrix(aptdata[c("Price", "Age", "Distance")])
Analysis of scatterplot matrix
From the above scatterplot matrix between the dependent variable (Price) and its explanatory variables (Age, Distance), we can see that there is a negative linear relationship between Price and Age, as well as between Price and Distance. This suggests that price per m2 decreases if the age of the apartment is older and if it is located further away from the city center, assuming all other explanatory variables remain constant.
# By correlation matrix (in table form)
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
##
## describe
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(aptdata[c("Price", "Age", "Distance")]))
## Price Age Distance
## Price 1.00 -0.23 -0.63
## Age -0.23 1.00 0.04
## Distance -0.63 0.04 1.00
##
## n= 85
##
##
## P
## Price Age Distance
## Price 0.0340 0.0000
## Age 0.0340 0.6966
## Distance 0.0000 0.6966
Analysis of the correlation matrix
As seen from the correlation matrix above, there is strong negative linear relationship between Price (dependent variable) and its explanatory variables (Age and Distance) as they have a Pearson correlation coefficient of -0.23 and -0.63 respectively.
Conclusion on multicollinearity
From the scatterplot matrix, it can be seen that there is no strong relationship between the explanatory variables (Age and Distance). This can also be supported from the correlation matrix which indicates a Pearson correlation coefficient of 0.04 between Age and Distance. Hence, we can conclude that there is no potential problem of multicolinearity.
# Regression model (2)
fit2 <- lm(Price ~ Age + Distance,
data = aptdata)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = aptdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
Regression Results
vif(fit2)
## Age Distance
## 1.001845 1.001845
The VIF statistics calculated above to check for the strength of correlation between the explanatory variables of Age and Distance falls well below the threshold of 5. It can be concluded that there is no problem with too strong multicolinearity between Age and Distance.
aptdata$StdResid <- round(rstandard(fit2), 3)
aptdata$CooksD <- round(cooks.distance(fit2), 3)
aptdata$ID <- seq(1, nrow(aptdata))
head(aptdata[order(aptdata$StdResid), c("ID", "StdResid")], 5)
## # A tibble: 5 × 2
## ID StdResid
## <int> <dbl>
## 1 53 -2.15
## 2 13 -1.50
## 3 72 -1.50
## 4 20 -1.38
## 5 35 -1.26
head(aptdata[order(-aptdata$StdResid), c("ID", "StdResid")], 5)
## # A tibble: 5 × 2
## ID StdResid
## <int> <dbl>
## 1 38 2.58
## 2 33 2.05
## 3 2 1.78
## 4 61 1.78
## 5 58 1.66
Test for outliers
In determining for outliers, a range between -3 and +3 is normally used. Based on the above observations, all the values fall well between the range. Hence, there is no outliers in our data.
hist(aptdata$CooksD,
xlab = "Cooks Distance",
ylab = "Frequency",
main = "Histogram of Cooks Distance")
head(aptdata[order(-aptdata$CooksD), c("ID", "CooksD"), 8])
## # A tibble: 6 × 2
## ID CooksD
## <int> <dbl>
## 1 38 0.32
## 2 55 0.104
## 3 33 0.069
## 4 53 0.066
## 5 22 0.061
## 6 39 0.038
Test for units with high influence
From our test above, it can be seen that the unit with the largest impact that is significantly different from the rest is unit ID 38, scoring a Cook’s distance of 0.320. In addition, unit ID 55 seems to also have a slightly higher value as compared to the rest of the units. Hence, both unit ID 38 and 55 are to be removed from the data as shown below.
# To remove the units with high influence (i.e. CooksD > 0.3)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
##
## src, summarize
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
aptdata <- aptdata %>%
filter(!ID %in% c(38, 55))
# Re-estimation of regression model (2)
fit2 <- lm(Price ~ Age + Distance,
data = aptdata)
# Check for potential heteroskedasticity
aptdata$StdFitted <- scale(fit2$fitted.values)
aptdata$StdResid <- round(rstandard(fit2), 3)
library(car)
scatterplot(y = aptdata$StdResid, x = aptdata$StdFitted,
ylab = "Standardised Residuals",
xlab = "Standardised Fitted Values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
Observations
Based on the scatterplot graph between standardised residuals and standardised fitted values, the variability seems to slightly increase as we look from the negative values of standardised fitted values towards the positive values. Additionally, it can also be seen that there are more units distributed on the positive side of standardised fitted values. Hence, this could suggest that potential heteroskedasticity may be present in the dataset.
# Formal testing of heteroskedasticity
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## -----------------------------
## DF = 1
## Chi2 = 3.775135
## Prob > Chi2 = 0.05201969
Breusch Pagan Test - Findings
From the results above, we cannot reject the null hypothesis that the variance of errors is constant as p-value is above 0.05. Hence, we can formally verify that there is homoskedasticity in the variance of errors.
# Graphic Visualisation
hist(aptdata$StdResid,
xlab = "Standardised residuals",
ylab = "Frequency",
main = "Histogram of standardised residuals")
# Using Shapiro Test
shapiro.test(aptdata$StdResid)
##
## Shapiro-Wilk normality test
##
## data: aptdata$StdResid
## W = 0.95952, p-value = 0.01044
Findings on normal distribution of standardised residuals
Based on the distribution of the standardised residuals through the histogram graph, we can see that the residuals appears to be somewhat asymmetrical and bimodal (having 2 peaks). Through the formal Shapiro-Wilks normality test, its p-value is calculated to be lesser than 0.05. Hence, we have to reject the null hypothesis that errors are normally distributed, which means the errors in the population are most likely not normally distributed.
Fortunately, our sample size is large enough which includes 85 apartments (including the units of high influence that we have removed earlier on) so the assumption of normal distribution of errors is not as important.
fit2 <- lm(Price ~ Age + Distance,
data = aptdata)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = aptdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -627.27 -212.96 -46.23 205.05 578.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2490.112 76.189 32.684 < 2e-16 ***
## Age -7.850 3.244 -2.420 0.0178 *
## Distance -23.945 2.826 -8.473 9.53e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared: 0.4968, Adjusted R-squared: 0.4842
## F-statistic: 39.49 on 2 and 80 DF, p-value: 1.173e-12
# Coefficient of correlation
sqrt(summary(fit2)$r.squared)
## [1] 0.7048456
Explanation of regression results
fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF,
data = aptdata)
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 80 5982100
## 2 78 5458696 2 523404 3.7395 0.02813 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusion
Using the ANOVA function to determine which model (model fit2 or fit3) fits the data better, the results above show that model fit3 fits the data better (p = 0.03) as compared to model fit2.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = aptdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -499.06 -194.33 -32.04 219.03 544.31
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2527.821 79.105 31.955 < 2e-16 ***
## Age -7.197 3.148 -2.286 0.02499 *
## Distance -21.241 2.911 -7.296 2.14e-10 ***
## ParkingFNo -168.921 62.166 -2.717 0.00811 **
## BalconyFYes -6.985 58.745 -0.119 0.90566
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 264.5 on 78 degrees of freedom
## Multiple R-squared: 0.5408, Adjusted R-squared: 0.5173
## F-statistic: 22.97 on 4 and 78 DF, p-value: 1.449e-12
Explanation of coefficients for categorical variables in regression model 3
Hypothesis tested with F-statistics
F-statistics is to test whether our linear regression model that we are working on fits the data better than a model that contains no explanatory variables.
Hypotheses in F-statistics:
From the regression result above, we reject null hypothesis at p < 0.001. This suggests that population determination coefficient is greater than 0, where at least part of the variability of the dependent variable can be explained by the linear effect of the explantory variable. Hence, we can conclue that the overall regression model is significant.
aptdata$Fitted <- fitted.values(fit3)
aptdata$Residuals <- residuals(fit3)
head(aptdata[colnames(aptdata) %in% c("ID", "Price", "Fitted", "Residuals")], 3)
## # A tibble: 3 × 4
## Price ID Fitted Residuals
## <dbl> <int> <dbl> <dbl>
## 1 1640 1 1707. -66.8
## 2 2800 2 2377. 423.
## 3 1660 3 1714. -53.8
Based on the results of model fit3, the expected price per m2 for the apartment ID2 would be 2377 EUR. However, its actual price is listed higher than expected at 2800 EUR per m2. With these 2 values, the difference between the actual and estimated price per m2 (residual) for apartment ID2 is 423 EUR