library(readxl)
## Warning: package 'readxl' was built under R version 4.2.3
df <- as.data.frame(read_xlsx("./Apartments.xlsx"))
Description:
“Does distance from city centre affect the price of an apartment? If so, how big is its impact?”
df$Parking <- factor(df$Parking, levels = c(0, 1), labels = c("No", "Yes"))
df$Balcony <- factor(df$Balcony, levels = c(0, 1), labels = c("No", "Yes"))
T-test is employed to test whether the arithmetic mean of the sample data deviates from the hypothesis. Based on the p-value, \(0.004731\), the null-hypothesis can be rejected at \(p < 0.05\). As such, the true mean of the sample data differs from 1900.
t.test(df$Price,
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: df$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
Regression coefficient: The regression coefficient shows the change in the dependent variable if the independent variables change by 1 unit.
Correlation coefficient: The correlation coefficient (\(\rho\)) shows the movement of two variables. If \(\rho\) is a positive value, the two variables co-move with one another. If \(\rho\) is a negative value, the two variables counter-move from one another.
Coefficient of determination: The coefficient of determination (\(R^2\)) is used to determine the proportion of variance in the dependent variable that can be predicted by the independent variable.
fit1 <- lm(Price ~ Age, data = df)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
print(paste("Regression Coefficient:", round(coef(fit1)[2], 4)))
## [1] "Regression Coefficient: -8.9751"
print(paste("Correlation Coefficient:", round(cor(df$Price, df$Age), 4)))
## [1] "Correlation Coefficient: -0.2303"
print(paste("Coefficient of Determination:", round(summary(fit1)$r.squared, 4)))
## [1] "Coefficient of Determination: 0.053"
Multicolinearity describes a situation where two or more independent variables correlate strongly with one another.
Based on the scatter plot matrix, it can be concluded that the correlations of Age~Distance and Age~Price are nearly zero, i.e. not correlated. The correlation between Distance and Price seems to be negative. However, it does not seem to be a strong correlation. As such, the multicollinearity does not appear to be a concern for this model.
library(car)
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.2.3
scatterplotMatrix(df[, c(1, 2, 3)], smooth = FALSE)
Having fitted the model, it is necessary to check the assumptions before analysing the results from the regression model.
fit2 <- lm(Price ~ Age + Distance, data = df)
VIF measures how much the variance of a regression coefficient is inflated due to multicollinearity.
vif(fit2)
## Age Distance
## 1.001845 1.001845
print(paste("VIF Statistics:", mean(vif(fit2))))
## [1] "VIF Statistics: 1.0018445034083"
Before removing any potential outliers or units with high influence, the data were manually examined by ordering the dataframe according to each criterion: standardised residuals and Cook’s distance.
A commonly used cut-off point for identifying potential outliers based on standardised residuals is ±3. Upon inspection, all standardised residuals in the dataset fall within this range. Therefore, no records were removed on the basis of residual size, as none of the observations exhibited extreme deviations from the model’s expectations.
In contrast, units with high influence are assessed using Cook’s distance. In this dataset, two observations, namely records indexed 38 and 55, display Cook’s distance values that noticeably deviate from the rest. Although they do not exceed the conventional threshold of one, their relative magnitude suggests that it could have a substantial impact on the regression results. Including such a point may increase model instability and reduce the explanatory power of the regression model.
df$fit2_StdResid <- round(rstandard(fit2), 3)
df$fit2_CooksD <- round(cooks.distance(fit2), 3)
head(df[order(abs(df$fit2_StdResid), decreasing = TRUE),], 5)
## Age Distance Price Parking Balcony fit2_StdResid fit2_CooksD
## 38 5 45 2180 Yes Yes 2.577 0.320
## 53 7 2 1760 No Yes -2.152 0.066
## 33 2 11 2790 Yes No 2.051 0.069
## 2 18 1 2800 Yes No 1.783 0.030
## 61 18 1 2800 Yes Yes 1.783 0.030
head(df[order(df$fit2_CooksD, decreasing = TRUE),], 5)
## Age Distance Price Parking Balcony fit2_StdResid fit2_CooksD
## 38 5 45 2180 Yes Yes 2.577 0.320
## 55 43 37 1740 No No 1.445 0.104
## 33 2 11 2790 Yes No 2.051 0.069
## 53 7 2 1760 No Yes -2.152 0.066
## 22 37 3 2540 Yes Yes 1.576 0.061
hist(x = df$fit2_CooksD, xlab = "Cook's Distance", ylab = "Frequency", main = "Histogram of Cook's Distance")
Based on the preceding analysis, the most appropriate course of action is to remove records that may adversely affect the performance of the regression model. Accordingly, the observations at indices 38 and 55 were excluded from the dataset.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
df_cleaned <- df %>%
filter(!row_number() %in% c(38, 55))
head(df_cleaned[order(df_cleaned$fit2_CooksD, decreasing = TRUE),], 5)
## Age Distance Price Parking Balcony fit2_StdResid fit2_CooksD
## 33 2 11 2790 Yes No 2.051 0.069
## 52 7 2 1760 No Yes -2.152 0.066
## 22 37 3 2540 Yes Yes 1.576 0.061
## 38 40 2 2400 No Yes 1.091 0.038
## 56 8 2 2820 Yes No 1.655 0.037
Based on the scatter plot, the spread of the standardised residuals appears to widen as the fitted values increase, suggesting potential heteroscedasticity in the regression model. However, the Breusch-Pagan test did not reject the null hypothesis at the 5% significance level (\(p > 0.05\)), thereby providing no strong evidence against homoscedasticity.
fit2_cleaned <- lm(Price ~ Age + Distance, data = df_cleaned)
df_cleaned$fit2_StdFitted <- scale(fit2_cleaned$fitted.values)
library(car)
scatterplot(y = df_cleaned$fit2_StdResid, x = df_cleaned$fit2_StdFitted,
ylab = "Standardised residuals",
xlab = "Standardised fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit2_cleaned)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## -----------------------------
## DF = 1
## Chi2 = 3.775135
## Prob > Chi2 = 0.05201969
The histogram below displays the distribution of standardised residuals from the fit2_cleaned model. At first glance, it is evident that the distribution does not follow a normal shape, appearing positively skewed.
To formally assess normality, the Shapiro-Wilk test was conducted. The resulting p-value was \(< 0.05\)(specifically, 0.003), providing statistical evidence that the standardised residuals deviate from a normal distribution.
Given the result, potential biases of the model can be argued. However, the sample size (n = 83) is sufficiently large for the Central Limit Theorem to be applicable. As such, this suggests that while residuals are not perfectly normal, inference may still be valid due to the sample size.
hist(df_cleaned$fit2_StdResid,
xlab = "Standardised residuals",
ylab = "Frequency",
main = "Histogram of standardised residuals")
shapiro.test(df$fit2_StdResid)
##
## Shapiro-Wilk normality test
##
## data: df$fit2_StdResid
## W = 0.95303, p-value = 0.003645
Regression coefficient:
Age: Holding all else constant, the price per m2 of an apartment, on average, declines by approximately 7.85 for each additional year of the apartment’s age.
Distance: Holding all else constant, the price per m2 of an apartment, on average, declines by approximately 23.95 for each additional kilometre farther from city centre.
fit2_cleaned <- lm(Price ~ Age + Distance,
data = df_cleaned)
summary(fit2_cleaned)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = df_cleaned)
##
## Residuals:
## Min 1Q Median 3Q Max
## -627.27 -212.96 -46.23 205.05 578.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2490.112 76.189 32.684 < 2e-16 ***
## Age -7.850 3.244 -2.420 0.0178 *
## Distance -23.945 2.826 -8.473 9.53e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared: 0.4968, Adjusted R-squared: 0.4842
## F-statistic: 39.49 on 2 and 80 DF, p-value: 1.173e-12
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony,
data = df)
Based on the ANOVA output comparing two regression models, this suggests that Model 2 (fit_3) has greater explanatory power than Model 1 (fit_2), evident by a lower residual sum of squares. The improvement in model fit is statistically significant (p = 0.01007 < 0.05), suggesting that the additional predictors (Parking and Balcony) contribute meaningfully to explaining variation in Price.
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 82 6720983
## 2 80 5991088 2 729894 4.8732 0.01007 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Regression Coefficient:
Parking: Holding all else constant, the price per m2 of an apartment with parking, on average, is approximately 196.17 higher than apartments without parking.
Balcony: Holding all else constant, the price per m2 of an apartment with balcony, on average, is approximately 1.94 higher than apartments without balcony. However, the coefficient is not statistically significant ($ p = 0.97$), indicating no meaningful effect of having a balcony on price in this model.
F-statistic tests whether any of the independent variables has any linear relationship with the dependent variable. As such:
\[H_0: \beta_1 = \beta_2 = \beta_3=\beta_4=0\]
\[H_0: At\ least\ one\ \beta \neq0\]
The p-value is extremely small, \(1.849e^{-11}\). Thereby the null-hypothesis can be rejected, stating that at least one independent variables are significantly related to the variation in the dependent variable.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 2e-16 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 5.28e-09 ***
## ParkingYes 196.168 62.868 3.120 0.00251 **
## BalconyYes 1.935 60.014 0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 1.849e-11
fitted3_value <- data.frame(fitted.values(fit3))
df$Resid3 <- df$Price - fitted3_value
index <- 2
print(paste("The residual for apartment", index, "is", round(df$Resid3[index, ], 2)))
## [1] "The residual for apartment 2 is 442.59"