library(readxl)
mydata<-read_xlsx("Apartments.xlsx")
mydata<-as.data.frame(mydata)
Description:
A possible research question can be whether the availability of
parking affects the price per m2 of the house price to be different
(Hypo testing)
Another possible research question can be whether the price per m2 is
affected linearly by the distance from city centre in km and the age of
an apartment in years (Linear regression)
mydata$Parking <- factor(mydata$Parking,
levels = c(0,1),
labels = c("No","Yes"))
mydata$Balcony <- factor(mydata$Balcony,
levels = c(0,1),
labels = c("No","Yes"))
shapiro.test(mydata$Price)
##
## Shapiro-Wilk normality test
##
## data: mydata$Price
## W = 0.94017, p-value = 0.0006513
Since the the Shapiro wilk test gives us a p-value < 0.001, we
reject the null hypothesis that the data of price is normally
distributed.
Hence, we use the non parametric mean method instead.
So we are testing whether median is 1900 eur instead.
median(mydata$Price)
## [1] 1950
wilcox.test(mydata$Price,
mu = 1900,
correct = FALSE)
##
## Wilcoxon signed rank test
##
## data: mydata$Price
## V = 2328, p-value = 0.02828
## alternative hypothesis: true location is not equal to 1900
library(effectsize)
effectsize(wilcox.test(mydata$Price, mu = 1900, correct = FALSE))
## r (rank biserial) | 95% CI
## --------------------------------
## 0.27 | [0.04, 0.48]
##
## - Deviation from a difference of 1900.
interpret_rank_biserial(0.27, rules = "funder2019")
## [1] "medium"
## (Rules: funder2019)
Based on the sample data, we found that the median price per m2 is
not equal to 1900 as the p-value is 0.029. So we reject the null
hypothesis that the median is equal to 1900. And the change is medium, r
= 0.27.
The median is instead 1950 and is more than 1900.
fit1 <- lm(Price ~ Age,
data = mydata)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
cor(mydata$Price,mydata$Age)
## [1] -0.230255
library(car)
## Loading required package: carData
scatterplotMatrix(mydata[c(-4,-5)], smooth=FALSE)
Based on the scatter plots, there does not seem to be a problem with
multicolinearity as the 2 other explanatory variables Distance and Age
do not have a show a strong relationship across.
Like there is no
diagonal line with all the points being along the line.
mydata_2 <- mydata
fit2 <- lm(Price ~ Age + Distance,
data = mydata_2)
vif(fit2)
## Age Distance
## 1.001845 1.001845
Based on the VIF statistics, we have to be cautious of it being more
than 5 then it suggests that there is the multicolinearity problem.
And if it is like 3 then maybe we would have to consider it as well. But
since the numbers are low at 1 (which is what we want, closer to 1 the
better), then multicolinearity will not be a problem. Hence, we can
carry on with the other parts.
mydata_2$StdResid <- round(rstandard(fit2), 3)
mydata_2$CooksD <- round(cooks.distance(fit2), 3)
hist(mydata_2$StdResid,
xlab = "standardised residuals",
ylab = "Frequency",
main = "Histogram of Standardised residuals")
hist(mydata_2$CooksD,
xlab ="Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distance")
#head(mydata[order(-mydata$StdResid),],6)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata_2 <-mydata_2 %>%
filter(CooksD <= 0.103)
head(mydata_2[order(-mydata_2$CooksD),],7)
## Age Distance Price Parking Balcony StdResid CooksD
## 33 2 11 2790 Yes No 2.051 0.069
## 52 7 2 1760 No Yes -2.152 0.066
## 22 37 3 2540 Yes Yes 1.576 0.061
## 38 40 2 2400 No Yes 1.091 0.038
## 56 8 2 2820 Yes No 1.655 0.037
## 25 8 26 2300 Yes Yes 1.571 0.034
## 55 10 1 2810 No No 1.601 0.032
Based on the first histogram, I believe that there is no outliers to
be removed as both the ends, the data are within the 3 range, so we do
not have to remove any outliers.
But it is not the same case for the Cooks distance as there are 2 data point which jumps from the rest in the histogram. So, we will have to remove those points before we carry on.
fit2 <- lm(Price ~ Age + Distance, data = mydata_2)
mydata_2$StdFitted <- scale(fit2$fitted.values)
scatterplot(y=mydata_2$StdResid, x=mydata_2$StdFitted,
ylab="Standardised residuals",
xlab="Standardised fitted values",
boxplots = FALSE,
regLine= FALSE,
smooth = FALSE)
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## -----------------------------
## DF = 1
## Chi2 = 3.775135
## Prob > Chi2 = 0.05201969
Based on the scatter plot, I think the data passes as they seem not
to be heteroskedasticity.
The points are random and they do not
seem to be opening up too much. They are opening up a bit towards the
bottom so to confirm I did the Breusch Pagan test.
Based on the test, we test that null hypothesis is that the variance
is constant and the H1 is that the variance is not constant.
Since
the P-value is 0.053, we can accept the null hypothesis and move on
since I can deduce that there is no potential for heteroskedasticity.
hist(mydata_2$StdResid,
xlab = "Standardised Residuals",
ylab = "Frequncy",
main = "Histogram of the Standardised Residuals")
shapiro.test(mydata_2$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata_2$StdResid
## W = 0.94963, p-value = 0.002636
Based on the Shapiro test, the null Hypothesis is that the errors are
normally distributed and H1 is that the errors are not normally
distributed.
Since, the P-value is 0.0027, we can reject the null
hypothesis and understand that the errors are not normally
distributed.
But since the number of samples tested is 83 which is
large, we can assume that they are indeed normally distributed due to
the central limit theorem. So we do not have to worry about this
assumption and we can move on.
fit2 <- lm(Price ~ Age + Distance, data = mydata)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
sqrt(summary(fit2)$r.squared)
## [1] 0.6629887
Based on the model, as the age of the apartment increases by 1 year,
the price per m2 on average decreases by 7.934 given the other variables
are constant (p = 0.016).
And, as the distance from the city centre increases by 1km, the price
per m2 on average decreases by 20.667 given that the other variable is
constant. (p <0.001).
And 43.96% of the variability in price per m2 can be explained by
both the age of the apartment and the distance from the city centre for
the apartment.
Based on the r value, we can deduce that there is a linear and semi
strong relationship between the price per m2 and the Age of the
apartment and the distance from the city centre, r = 0.663.
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data=mydata)
anova(fit2,fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 82 6720983
## 2 80 5991088 2 729894 4.8732 0.01007 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the Multiple R-square, it seems that fit 3 is a better
model.
But to confirm I first compared the Adjusted R-square and
that was higher for fit 3 compared to fit 2 (it is given below). Hence,
fit 3 is a better model.
Secondly, I did the Anova test and the null
hypothesis was that both the models were the same and the H1 was that
the models were different. Since P-value = 0.01, we can reject H0, and
believe that the models are different. Which further confirms that Fit 3
is a better model.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 2e-16 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 5.28e-09 ***
## ParkingYes 196.168 62.868 3.120 0.00251 **
## BalconyYes 1.935 60.014 0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 1.849e-11
For parkingYes, given the values of other explainatory variables, the
group of apartments which have parking on average has a higher price per
m2 by around 196.17 euros (p-value = 0.00251) compared to the apartments
without parking.
For balconyYes, given the values of other explainatory variables, the
group of apartments with have balcony on average has a higher price per
m2 by around 1.94 euros compared to the apartments without Balcony. But
this is not statistically siginificant as p-value is high at
0.98.
The F-test at the bottom tests where the coefficient of determination
is 0 or not. It tells us whether the model is a good model or a bad
model. The null hypothesis is that coefficient of determination is 0 and
the alternative hypothesis is that it is more than 0. Since, the p-value
< 0.001, we can reject H0 and understand that the coeffficient of
determination is more than 0. Hence, I think this is a good model to
use.
mydata$fitted<-fitted.values(fit3)
mydata$residuals <-residuals(fit3)
print(mydata[2,])
## Age Distance Price Parking Balcony fitted residuals
## 2 18 1 2800 Yes No 2357.411 442.5889