mydata_full <- read.table("C:/Julia/SEB LU/IMB/SEMESTER 2/Multivariate Analysis/HOMEWORK/Real estate.csv", header=TRUE, sep=",", dec=".")
mydata <- mydata_full [c(-2)]
# Renaming columns
colnames(mydata) <- c("ID", "House_age", "MRT_distance", "No_stores", "Latitude", "Longitude", "House_price")
# Dummy variable from no of stores (if no stores =0, if 1+ stores =1)
No_stores <- mydata$No_stores
dummy_No_stores <- ifelse(No_stores > 0, 1, 0)
mydata$dummy_No_stores <- dummy_No_stores
# Reorder the columns
mydata <- mydata[, c("ID", "House_price", "House_age", "MRT_distance", "No_stores", "Latitude", "Longitude", "dummy_No_stores")]
head(mydata, 10)
## ID House_price House_age MRT_distance No_stores Latitude Longitude dummy_No_stores
## 1 1 37.9 32.0 84.87882 10 24.98298 121.5402 1
## 2 2 42.2 19.5 306.59470 9 24.98034 121.5395 1
## 3 3 47.3 13.3 561.98450 5 24.98746 121.5439 1
## 4 4 54.8 13.3 561.98450 5 24.98746 121.5439 1
## 5 5 43.1 5.0 390.56840 5 24.97937 121.5425 1
## 6 6 32.1 7.1 2175.03000 3 24.96305 121.5125 1
## 7 7 40.3 34.5 623.47310 7 24.97933 121.5364 1
## 8 8 46.7 20.3 287.60250 6 24.98042 121.5423 1
## 9 9 18.8 31.7 5512.03800 1 24.95095 121.4846 1
## 10 10 22.1 17.9 1783.18000 3 24.96731 121.5149 1
The dataset was found on kaggle.com, and investigates the price of houses on the real estate market. The unit of observation is a house. It incorporates 414 observations.
Description of variables:
ID: house identifier
House_price: price of the house on the real estate market
House_age: age of the house
MRT_distance: distance to the nearest MRT (Mass Rapid Transit) station
No_stores: number of convenience stores close to the house
Latitude: latitude of the property
Longitude: longitude of the property
RQ: What influences the price of houses in the real estate market?
library(pastecs) # However, it makes more sense to do this later when we remove data
round(stat.desc(mydata[ , ]), 2)
## ID House_price House_age MRT_distance No_stores Latitude Longitude dummy_No_stores
## nbr.val 414.00 414.00 414.00 414.00 414.00 414.00 414.00 414.00
## nbr.null 0.00 0.00 17.00 0.00 67.00 0.00 0.00 67.00
## nbr.na 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## min 1.00 7.60 0.00 23.38 0.00 24.93 121.47 0.00
## max 414.00 117.50 43.80 6488.02 10.00 25.01 121.57 1.00
## range 413.00 109.90 43.80 6464.64 10.00 0.08 0.09 1.00
## sum 85905.00 15723.80 7333.00 448728.68 1695.00 10337.18 50314.81 347.00
## median 207.50 38.45 16.10 492.23 4.00 24.97 121.54 1.00
## mean 207.50 37.98 17.71 1083.89 4.09 24.97 121.53 0.84
## SE.mean 5.88 0.67 0.56 62.03 0.14 0.00 0.00 0.02
## CI.mean.0.95 11.56 1.31 1.10 121.93 0.28 0.00 0.00 0.04
## var 14317.50 185.14 129.79 1592920.63 8.68 0.00 0.00 0.14
## std.dev 119.66 13.61 11.39 1262.11 2.95 0.01 0.02 0.37
## coef.var 0.58 0.36 0.64 1.16 0.72 0.00 0.00 0.44
library(car)
RegLine = T
scatterplotMatrix(mydata[ , c(-1, -5, -8)], #only used between numeric variables
smooth = FALSE)
First row we see the relationship between house price and other variables. Positively related to latitude and longitude, but negatively to house age ans MRT distance.
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata[ , c(-1, -5, -8)]))
## House_price House_age MRT_distance Latitude Longitude
## House_price 1.00 -0.21 -0.67 0.55 0.52
## House_age -0.21 1.00 0.03 0.05 -0.05
## MRT_distance -0.67 0.03 1.00 -0.59 -0.81
## Latitude 0.55 0.05 -0.59 1.00 0.41
## Longitude 0.52 -0.05 -0.81 0.41 1.00
##
## n= 414
##
##
## P
## House_price House_age MRT_distance Latitude Longitude
## House_price 0.0000 0.0000 0.0000 0.0000
## House_age 0.0000 0.6032 0.2693 0.3247
## MRT_distance 0.0000 0.6032 0.0000 0.0000
## Latitude 0.0000 0.2693 0.0000 0.0000
## Longitude 0.0000 0.3247 0.0000 0.0000
We can also describe it with the above correlation matrix (actually shows it with the Pearson correlation coefficient).
MRT_distance (distance to the nearest Mass Rapid Transport station) might be problematic, due to correlation to Longitude. However, content wise, for the purpose of this homework, this does not make sense, so I will not remove it, but we could also remove MRT_distance due to high correlation.
For this regression we are checking if house price is the function of house age, distance to a MRT station, latitude, and longitude, and if there is a store nearby.
Explanation: according to Opendoor, a US real estate selling platform, the pricing of a house should be dependent on neighborhood comps, location, home size and usable space, age and condition, upgrades and updates, the local market, economic indicators, and interest rates. This is why I expect house prices to be a function of the house’s age, the distance to the nearest MRT station, and the house’s latitude and longitude.
fit1 <- lm(House_price ~ House_age + MRT_distance + Latitude + Longitude + dummy_No_stores,
data = mydata)
vif(fit1) #used to check multicolinearity
## House_age MRT_distance Latitude Longitude dummy_No_stores
## 1.009518 4.164963 1.613769 3.000454 1.368039
mean(vif(fit1))
## [1] 2.231349
Must be less than 5 and should be around 1, so that there is not a problem with multicolinearity.
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit1)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------------
## Response : House_price
## Variables: fitted values of House_price
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 1.493037
## Prob > Chi2 = 0.2217459
Checking homoskedasticity of the model:
H0: The variance of the errors is constant
H1: The variance is not constant
=> We cannot reject the null hypothesis, which states that the variance is constant. (Constant variance implies homoskedasticity, which we desire, so that we can trust the results of the regression).
mydata$StdResid <- round(rstandard(fit1), 3) #used to find outliers
mydata$CooksD <- round(cooks.distance(fit1), 3) #for finding units with high impact
hist(mydata$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(mydata$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResid
## W = 0.90077, p-value = 8.879e-16
H0: Normally distributed
H1: Not normally distributed
=> Reject null-hypothesis. Rejection implies that the normality assumption of regression is not met.
hist(mydata$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
There are gaps in between, so I will now remove them.
head(mydata[order(mydata$StdResid),], 3)
## ID House_price House_age MRT_distance No_stores Latitude Longitude dummy_No_stores StdResid CooksD
## 114 114 7.6 14.8 393.2606 6 24.96172 121.5381 1 -3.559 0.015
## 257 257 26.5 14.6 339.2289 1 24.97519 121.5315 1 -1.967 0.004
## 335 335 22.8 30.0 1013.3410 5 24.99006 121.5346 1 -1.963 0.010
head(mydata[order(-mydata$StdResid),], 10)
## ID House_price House_age MRT_distance No_stores Latitude Longitude dummy_No_stores StdResid CooksD
## 271 271 117.5 10.8 252.5822 1 24.97460 121.5305 1 7.653 0.080
## 313 313 78.0 35.4 318.5292 9 24.97071 121.5407 1 4.256 0.030
## 221 221 78.3 37.2 186.5101 9 24.97703 121.5426 1 4.083 0.031
## 149 149 45.1 16.4 3780.5900 0 24.93293 121.5120 0 3.495 0.055
## 390 390 67.7 40.9 122.3619 8 24.96756 121.5423 1 3.283 0.029
## 127 127 62.9 38.6 804.6897 4 24.97838 121.5348 1 2.790 0.016
## 362 362 63.3 41.4 281.2050 8 24.97345 121.5409 1 2.746 0.018
## 167 167 73.6 0.0 292.9978 6 24.97744 121.5446 1 2.625 0.012
## 129 129 60.7 41.3 124.9912 6 24.96674 121.5404 1 2.562 0.019
## 48 48 61.5 35.9 640.7391 3 24.97563 121.5371 1 2.546 0.010
head(mydata[order(-mydata$CooksD),], 5)
## ID House_price House_age MRT_distance No_stores Latitude Longitude dummy_No_stores StdResid CooksD
## 271 271 117.5 10.8 252.5822 1 24.97460 121.5305 1 7.653 0.080
## 149 149 45.1 16.4 3780.5900 0 24.93293 121.5120 0 3.495 0.055
## 221 221 78.3 37.2 186.5101 9 24.97703 121.5426 1 4.083 0.031
## 313 313 78.0 35.4 318.5292 9 24.97071 121.5407 1 4.256 0.030
## 390 390 67.7 40.9 122.3619 8 24.96756 121.5423 1 3.283 0.029
Remove units of observation:
mydata <- mydata[c(-114, -149, -221, -271, -313, -390), ] #removing outliers based on standardised residuals (those not between -3 and +3)
mydata <- mydata[mydata$CooksD <0.04,] #removing gaps (based on Cooks distance)
Now because we removed units we check the model again:
fit1 <- lm(House_price ~ House_age + MRT_distance + Latitude + Longitude + dummy_No_stores,
data = mydata)
mydata$StdResid <- round(rstandard(fit1), 3)
mydata$CooksD <- round(cooks.distance(fit1), 3)
hist(mydata$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(mydata$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResid
## W = 0.98247, p-value = 7.49e-05
hist(mydata$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
mydata$StdFittedValues <- scale(fit1$fitted.values)
library(car)
scatterplot(y = mydata$StdResid, x = mydata$StdFittedValues,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
Homoskedasticity means that these units should be as randomly as possible. They are not the most randomly distributed, but it is not terrible.
Formally check it:
library(olsrr)
ols_test_breusch_pagan(fit1)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------------
## Response : House_price
## Variables: fitted values of House_price
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 0.1212683
## Prob > Chi2 = 0.727663
Checking homoskedasticity of the model, after removing units:
H0: The variance of the errors is constant
H1: The variance is not constant
=> We have homoskedasticity, so we do not need to use lm.robust (no need to correct standard errors).
summary(fit1)
##
## Call:
## lm(formula = House_price ~ House_age + MRT_distance + Latitude +
## Longitude + dummy_No_stores, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.1096 -5.1867 -0.6227 4.8800 27.0080
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.005e+03 5.458e+03 -1.283 0.200
## House_age -2.863e-01 3.425e-02 -8.360 1.04e-15 ***
## MRT_distance -5.295e-03 6.221e-04 -8.512 3.45e-16 ***
## Latitude 2.813e+02 3.952e+01 7.117 5.09e-12 ***
## Longitude 2.353e-01 4.336e+01 0.005 0.996
## dummy_No_stores 1.409e+00 1.218e+00 1.157 0.248
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.791 on 402 degrees of freedom
## Multiple R-squared: 0.6245, Adjusted R-squared: 0.6198
## F-statistic: 133.7 on 5 and 402 DF, p-value: < 2.2e-16
House_age: on average, as House_age increases by one unit, House_price is expected to decrease by 0.286 units, holding all other predictor variables constant.
MRT_distance: on average, as MRT_distance increases by one unit, House_price is expected to decrease by 0.005295 units, holding all other predictor variables constant.
Latitude: on average, as Latitude increases by one unit, House_price is expected to increase by 282.13 units, holding all other predictor variables constant.
Multiple R2: 0.6245 ==> 62% of variability of house price is explained by linear effect of house age, distance to nearest MRT station, latitude and longitude of the house.
F-statistic: The F-statistic reported here is 133.7. The p-value is less than 2.2e-16, which provides strong evidence that the model is a good fit. (Because it indicates that there is a strong evidence against the null hypothesis and that the independent variables in the model have a significant effect on the dependent variable House_price.)
sqrt(summary(fit1)$r.squared)
## [1] 0.7902574
This value provides understanding of the strength and direction of the relationship between the two variables, where a value of 1 indicates a perfect positive relationship, a value of -1 indicates a perfect negative relationship, and a value of 0 indicates no relationship.
In the case there is a high correlation coefficient which suggests that the independent variables in the model have a strong linear relationship with the dependent variable.
library(lm.beta)
lm.beta(fit1)
##
## Call:
## lm(formula = House_price ~ House_age + MRT_distance + Latitude +
## Longitude + dummy_No_stores, data = mydata)
##
## Standardized Coefficients::
## (Intercept) House_age MRT_distance Latitude Longitude dummy_No_stores
## NA -0.2568936431 -0.5283621469 0.2750645832 0.0002868856 0.0411159828
The larger the coefficient, the more important it is regarding house price (absolute values). => MRT_distance is the most important.
In this case, the standardized coefficient of dummy_No_stores is 0.0411, which suggests a small positive relationship between the presence of stores and the house price.