library(readxl)
data <- read_xlsx("Apartments(1).xlsx")
Description:
data$BalconyFactor <- factor(x=data$Balcony,
levels = c(1, 0),
labels = c("Balcony", "No Balcony"))
data$ParkingFactor <- factor(x=data$Parking,
levels = c(1, 0),
labels = c("Parking", "No Parking"))
shapiro.test(data$Price)
##
## Shapiro-Wilk normality test
##
## data: data$Price
## W = 0.94017, p-value = 0.0006513
#P-value < 0.001, hence we reject the Null Hypothesis.
#As normality distribution assumption not fulfilled, a Wilcoxon Signed Rank Test is conducted.
wilcox.test(data$Price,
mu=1900,
correct = FALSE)
##
## Wilcoxon signed rank test
##
## data: data$Price
## V = 2328, p-value = 0.02828
## alternative hypothesis: true location is not equal to 1900
#We reject the Null Hypothesis at p = 0.03.
H0: Mu_Price = EUR 1900 H1: Mu_Price =/ EUR 1900
Assumptions and Requirements:
1.Variable is numeric - TRUE
2.Normality
H0: Variable normally distributed. H1: Variable not normally distributed.
P-value < 0.001, hence we reject the Null Hypothesis.
As normality distribution assumption not fulfilled, a Wilcoxon Signed Rank Test is conducted. We reject the Null Hypothesis at p = 0.03.
fit1 <- lm(Price ~ Age,
data = data)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
cor(data$Price, data$Age)
## [1] -0.230255
Regression coefficient: Everything else equal, a one year increase in age decreases price per square meter by EUR 8.975, on average (p<0.05).
Coefficient of correlation: There is a negative correlation between price per square meter and age of the apartment. This relationship is weak, however. This could also be calculated by taking the root of the r squared value.
Coefficient of determination: The total variation in price that is explained by age is 5.302%. This is a medium effect size.
library(car)
## Loading required package: carData
scatterplotMatrix(data[,1:3], smooth = FALSE)
The correlation between distance and age is comparatively low. Hence, I would expect this not to be too strong. Nevertheless, to be sure, the VIF should be calculated (done below).
fit2 <- lm(Price ~ Age + Distance,
data = data)
vif(fit2)
## Age Distance
## 1.001845 1.001845
With VIF values of less than 5, we can assume that there is no multicolinearity.
data$StdResid <- round(rstandard(fit2), 3)
data$CooksDist <- round(cooks.distance(fit2), 3)
hist(data$CooksDist,
xlab = "Cook's Distance",
ylab = "Frequency",
main = "Histogram of Cook's Distance")
#Removing any problematic values
#For the standardized residuals we only remove values larger/smaller than +3 or -3. There are no such values in the dataset.
#For Cook's Distance, we are removing values that are relatively large compared to the other ones. For example the one between 0.30 and 0.35 in the second histogram plot above. We do so until the histogram plot is continuous.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data1 <- data %>%
filter(!Age=="5")
data1 <- data1 %>%
filter(!Age=="43")
data1 <- data1 %>%
filter(!Age=="2")
data1 <- data1 %>%
filter(!Price=="1760")
data1 <- data1 %>%
filter(!Age=="37")
hist(data1$CooksDist,
xlab = "Cook's Distance",
ylab = "Frequency",
main = "Histogram of Cook's Distance")
#Again calculated fit2 with new data = data1 which excludes the outliers.
removed_values_fit2 <- lm(Price ~ Age + Distance,
data = data1)
summary(removed_values_fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -411.50 -203.69 -45.24 191.11 492.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2502.467 75.024 33.356 < 2e-16 ***
## Age -8.674 3.221 -2.693 0.00869 **
## Distance -24.063 2.692 -8.939 1.57e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared: 0.5361, Adjusted R-squared: 0.524
## F-statistic: 44.49 on 2 and 77 DF, p-value: 1.437e-13
fitted_values <- fitted(removed_values_fit2)
std_fitted_values <- (fitted_values - mean(fitted_values)) / sd(fitted_values)
# Create scatterplot
plot(std_fitted_values, data1$StdResid,
xlab = "Standardized Fitted Values",
ylab = "Standardized Residuals",
main = "Scatterplot of Standardized Residuals vs Standardized Fitted Values")
#install.packages("olsrr")
library(olsrr)
## Warning: package 'olsrr' was built under R version 4.3.2
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(removed_values_fit2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 1.738591
## Prob > Chi2 = 0.1873174
This does not look like there is heteroskedasticity. This is because all the datapoints are roughly fitting inbetween two parallel lines. There is also not a non-linear relationship. The BP-Test confirms that we cannot reject the Null Hypothesis of constant variance (p>0.005).
shapiro.test(data1$StdResid)
##
## Shapiro-Wilk normality test
##
## data: data1$StdResid
## W = 0.93418, p-value = 0.0004761
hist(data1$StdResid,
main = "Histogram of Standardized Residuals",
xlab = "Standardized Residuals",
ylab = "Frequency",
col = "grey",
border = "black",
breaks = 20)
No, the standardized residuals are not normally distributed. We reject the Null Hypothesis of the Shapiro Wilk Test that assumes normal distribution of the standardized residuals. We reject it at p<0.001. This would mean that the assumption for the linear regression is violated. Since n=80, we can still conduct a linear regression analysis, however, we require the significance level of the coefficients to be further away from 0.05.
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
Coefficients:
#Here outliers were excluded.
removed_outliers_fit3 <- lm(Price ~ Age + Distance + ParkingFactor + BalconyFactor,
data = data1)
summary(removed_outliers_fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor,
## data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -390.93 -198.19 -53.64 186.73 518.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2528.049 80.942 31.233 < 2e-16 ***
## Age -7.970 3.191 -2.498 0.0147 *
## Distance -21.961 2.830 -7.762 3.39e-11 ***
## ParkingFactorNo Parking -128.700 60.801 -2.117 0.0376 *
## BalconyFactorNo Balcony -6.032 57.307 -0.105 0.9165
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared: 0.5623, Adjusted R-squared: 0.5389
## F-statistic: 24.08 on 4 and 75 DF, p-value: 7.764e-13
#I compare the models where not outliers are excluded.
anova(removed_values_fit2, removed_outliers_fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingFactor + BalconyFactor
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 77 5077362
## 2 75 4791128 2 286234 2.2403 0.1135
Including the two dummy variables does not significantly increase the fit of the model (p=0.11).
summary(removed_outliers_fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor,
## data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -390.93 -198.19 -53.64 186.73 518.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2528.049 80.942 31.233 < 2e-16 ***
## Age -7.970 3.191 -2.498 0.0147 *
## Distance -21.961 2.830 -7.762 3.39e-11 ***
## ParkingFactorNo Parking -128.700 60.801 -2.117 0.0376 *
## BalconyFactorNo Balcony -6.032 57.307 -0.105 0.9165
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared: 0.5623, Adjusted R-squared: 0.5389
## F-statistic: 24.08 on 4 and 75 DF, p-value: 7.764e-13
Regression coefficient “ParkingFactorNo Parking”: Everything else equal, the effect of not having parking included in the apartment reduces the price per square meter by EUR 196.168 on average to an apartment with parking.
Regression coefficient “ParkingFactorNo Balcony”: This is not statistically significant. Hence, no interpretation.
Null hypothesis F-Statistic: There is no predictive relationship between any of the explanatory variables included in the model and the dependent variable (in other words R^2 squared is equal to zero).
We reject the Null hypothesis at a p-value of less than 0.001.
data1$fitted_values_fit3 <- fitted(removed_outliers_fit3)
residual_apartment2 <- 2800 - 2356.597
The fitted value of apartment ID2 is EUR 2356.597 per square meter and the actual value is EUR 2800. Hence, the residual is 443.403.