mydata_full <- read.table("C:/Julia/SEB LU/IMB/SEMESTER 2/Multivariate Analysis/HOMEWORK/Real estate.csv", header=TRUE, sep=",", dec=".")

mydata <- mydata_full [c(-2)]

# Renaming columns
colnames(mydata) <- c("ID", "House_age", "MRT_distance", "No_stores", "Latitude", "Longitude", "House_price")

# Dummy variable from no of stores (if no stores =0, if 1+ stores =1)
No_stores <- mydata$No_stores
dummy_No_stores <- ifelse(No_stores > 0, 1, 0)
mydata$dummy_No_stores <- dummy_No_stores

# Reorder the columns
mydata <- mydata[, c("ID", "House_price", "House_age", "MRT_distance", "No_stores", "Latitude", "Longitude", "dummy_No_stores")]

head(mydata, 10)
##    ID House_price House_age MRT_distance No_stores Latitude Longitude dummy_No_stores
## 1   1        37.9      32.0     84.87882        10 24.98298  121.5402               1
## 2   2        42.2      19.5    306.59470         9 24.98034  121.5395               1
## 3   3        47.3      13.3    561.98450         5 24.98746  121.5439               1
## 4   4        54.8      13.3    561.98450         5 24.98746  121.5439               1
## 5   5        43.1       5.0    390.56840         5 24.97937  121.5425               1
## 6   6        32.1       7.1   2175.03000         3 24.96305  121.5125               1
## 7   7        40.3      34.5    623.47310         7 24.97933  121.5364               1
## 8   8        46.7      20.3    287.60250         6 24.98042  121.5423               1
## 9   9        18.8      31.7   5512.03800         1 24.95095  121.4846               1
## 10 10        22.1      17.9   1783.18000         3 24.96731  121.5149               1

The dataset was found on kaggle.com, and investigates the price of houses on the real estate market. The unit of observation is a house. It incorporates 414 observations.

Description of variables:

RQ: What influences the price of houses in the real estate market?

library(pastecs) # However, it makes more sense to do this later when we remove data
round(stat.desc(mydata[ , ]), 2)
##                    ID House_price House_age MRT_distance No_stores Latitude Longitude dummy_No_stores
## nbr.val        414.00      414.00    414.00       414.00    414.00   414.00    414.00          414.00
## nbr.null         0.00        0.00     17.00         0.00     67.00     0.00      0.00           67.00
## nbr.na           0.00        0.00      0.00         0.00      0.00     0.00      0.00            0.00
## min              1.00        7.60      0.00        23.38      0.00    24.93    121.47            0.00
## max            414.00      117.50     43.80      6488.02     10.00    25.01    121.57            1.00
## range          413.00      109.90     43.80      6464.64     10.00     0.08      0.09            1.00
## sum          85905.00    15723.80   7333.00    448728.68   1695.00 10337.18  50314.81          347.00
## median         207.50       38.45     16.10       492.23      4.00    24.97    121.54            1.00
## mean           207.50       37.98     17.71      1083.89      4.09    24.97    121.53            0.84
## SE.mean          5.88        0.67      0.56        62.03      0.14     0.00      0.00            0.02
## CI.mean.0.95    11.56        1.31      1.10       121.93      0.28     0.00      0.00            0.04
## var          14317.50      185.14    129.79   1592920.63      8.68     0.00      0.00            0.14
## std.dev        119.66       13.61     11.39      1262.11      2.95     0.01      0.02            0.37
## coef.var         0.58        0.36      0.64         1.16      0.72     0.00      0.00            0.44
library(car)
scatterplotMatrix(mydata[ , c(-1, -5, -8)], #only used between numeric variables
                  smooth = FALSE) 

First row we see the relationship between house price and other variables. Positively related to latitude and longitude, but negatively to house age ans MRT distance.

library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
rcorr(as.matrix(mydata[ , c(-1, -5, -8)])) 
##              House_price House_age MRT_distance Latitude Longitude
## House_price         1.00     -0.21        -0.67     0.55      0.52
## House_age          -0.21      1.00         0.03     0.05     -0.05
## MRT_distance       -0.67      0.03         1.00    -0.59     -0.81
## Latitude            0.55      0.05        -0.59     1.00      0.41
## Longitude           0.52     -0.05        -0.81     0.41      1.00
## 
## n= 414 
## 
## 
## P
##              House_price House_age MRT_distance Latitude Longitude
## House_price              0.0000    0.0000       0.0000   0.0000   
## House_age    0.0000                0.6032       0.2693   0.3247   
## MRT_distance 0.0000      0.6032                 0.0000   0.0000   
## Latitude     0.0000      0.2693    0.0000                0.0000   
## Longitude    0.0000      0.3247    0.0000       0.0000

We can also describe it with the above correlation matrix (actually shows it with the Pearson correlation coefficient).

MRT_distance (distance to the nearest Mass Rapid Transport station) might be problematic, due to correlation to Longitude. However, content wise, for the purpose of this homework, this does not make sense, so I will not remove it, but we could also remove MRT_distance due to high correlation.

For this regression we are checking if house price is the function of house age, distance to a MRT station, latitude, and longitude, and if there is a store nearby.

Explanation: according to Opendoor, a US real estate selling platform, the pricing of a house should be dependent on neighborhood comps, location, home size and usable space, age and condition, upgrades and updates, the local market, economic indicators, and interest rates. This is why I expect house prices to be a function of the house’s age, the distance to the nearest MRT station, and the house’s latitude and longitude.

fit1 <- lm(House_price ~ House_age + MRT_distance + Latitude + Longitude + dummy_No_stores,
           data = mydata)
vif(fit1) #used to check multicolinearity
##       House_age    MRT_distance        Latitude       Longitude dummy_No_stores 
##        1.009518        4.164963        1.613769        3.000454        1.368039
mean(vif(fit1))
## [1] 2.231349

Must be less than 5; and average should be around 1 (under [1]). => Mean is not great.

mydata$StdResid <- round(rstandard(fit1), 3) #used to find outliers

mydata$CooksD <- round(cooks.distance(fit1), 3) #for finding units with high impact

hist(mydata$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

H0: Variable is normally distributed

H1: Variable is not normally distributed

->test:

shapiro.test(mydata$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResid
## W = 0.90077, p-value = 8.879e-16

=> Reject null-hypothesis.

hist(mydata$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

There are gaps in between, so I will now remove them.

head(mydata[order(mydata$StdResid),], 3)
##      ID House_price House_age MRT_distance No_stores Latitude Longitude dummy_No_stores StdResid CooksD
## 114 114         7.6      14.8     393.2606         6 24.96172  121.5381               1   -3.559  0.015
## 257 257        26.5      14.6     339.2289         1 24.97519  121.5315               1   -1.967  0.004
## 335 335        22.8      30.0    1013.3410         5 24.99006  121.5346               1   -1.963  0.010
head(mydata[order(-mydata$StdResid),], 10)
##      ID House_price House_age MRT_distance No_stores Latitude Longitude dummy_No_stores StdResid CooksD
## 271 271       117.5      10.8     252.5822         1 24.97460  121.5305               1    7.653  0.080
## 313 313        78.0      35.4     318.5292         9 24.97071  121.5407               1    4.256  0.030
## 221 221        78.3      37.2     186.5101         9 24.97703  121.5426               1    4.083  0.031
## 149 149        45.1      16.4    3780.5900         0 24.93293  121.5120               0    3.495  0.055
## 390 390        67.7      40.9     122.3619         8 24.96756  121.5423               1    3.283  0.029
## 127 127        62.9      38.6     804.6897         4 24.97838  121.5348               1    2.790  0.016
## 362 362        63.3      41.4     281.2050         8 24.97345  121.5409               1    2.746  0.018
## 167 167        73.6       0.0     292.9978         6 24.97744  121.5446               1    2.625  0.012
## 129 129        60.7      41.3     124.9912         6 24.96674  121.5404               1    2.562  0.019
## 48   48        61.5      35.9     640.7391         3 24.97563  121.5371               1    2.546  0.010
head(mydata[order(-mydata$CooksD),], 5) 
##      ID House_price House_age MRT_distance No_stores Latitude Longitude dummy_No_stores StdResid CooksD
## 271 271       117.5      10.8     252.5822         1 24.97460  121.5305               1    7.653  0.080
## 149 149        45.1      16.4    3780.5900         0 24.93293  121.5120               0    3.495  0.055
## 221 221        78.3      37.2     186.5101         9 24.97703  121.5426               1    4.083  0.031
## 313 313        78.0      35.4     318.5292         9 24.97071  121.5407               1    4.256  0.030
## 390 390        67.7      40.9     122.3619         8 24.96756  121.5423               1    3.283  0.029

Remove units of observation:

mydata <- mydata[c(-114, -149, -221, -271, -313, -390), ] #removing outliers based on standardised residuals (those not between -3 and +3)

mydata <- mydata[mydata$CooksD <0.04,] #removing gaps (based on Cooks distance)

Now because we removed units we check the model again:

fit1 <- lm(House_price ~ House_age + MRT_distance + Latitude + Longitude + dummy_No_stores,
           data = mydata)
mydata$StdResid <- round(rstandard(fit1), 3) 
mydata$CooksD <- round(cooks.distance(fit1), 3) 

hist(mydata$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

shapiro.test(mydata$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResid
## W = 0.98247, p-value = 7.49e-05
hist(mydata$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

mydata$StdFittedValues <- scale(fit1$fitted.values)
library(car)
scatterplot(y = mydata$StdResid, x = mydata$StdFittedValues,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

Homoskedasticity means that these units should be as randomly as possible. They are not the most randomly distributed, but it is not terrible.

Formally check it:

library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
ols_test_breusch_pagan(fit1)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                  Data                   
##  ---------------------------------------
##  Response : House_price 
##  Variables: fitted values of House_price 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    0.1212683 
##  Prob > Chi2   =    0.727663

=> We have homoskedasticity, so we do not need to use lm.robust (no need to correct standard errors).

summary(fit1)
## 
## Call:
## lm(formula = House_price ~ House_age + MRT_distance + Latitude + 
##     Longitude + dummy_No_stores, data = mydata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.1096  -5.1867  -0.6227   4.8800  27.0080 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -7.005e+03  5.458e+03  -1.283    0.200    
## House_age       -2.863e-01  3.425e-02  -8.360 1.04e-15 ***
## MRT_distance    -5.295e-03  6.221e-04  -8.512 3.45e-16 ***
## Latitude         2.813e+02  3.952e+01   7.117 5.09e-12 ***
## Longitude        2.353e-01  4.336e+01   0.005    0.996    
## dummy_No_stores  1.409e+00  1.218e+00   1.157    0.248    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.791 on 402 degrees of freedom
## Multiple R-squared:  0.6245, Adjusted R-squared:  0.6198 
## F-statistic: 133.7 on 5 and 402 DF,  p-value: < 2.2e-16

House_age: If House_age increases by 1 unit, then House_price decreases on average by 100 * -0.0286 = -2.86% while holding all other predictor variables constant

MRT_distance: If MRT_distance increases by 1 unit, then House_price decreases on average by 100 * -0.0053 = -0.53% while holding all other predictor variables constant.

Latitude: If Latitude increases by 1 unit, then House_price increases on average by 100 * 2.813 = 281.3% while holding all other predictor variables constant.

Multiple R2: 0.6245 ==> 62% of variability of house price is explained by linear effect of house age, distance to nearest MRT station, latitude and longitude of the house.

F-statistic: The F-statistic reported here is 133.7. The p-value is less than 2.2e-16, which provides strong evidence that the model is a good fit.

sqrt(summary(fit1)$r.squared) 
## [1] 0.7902574

Multiple Coefficient of Correlation: 0.79 ==> Linear relationship between house price and explanatory variables is strong. (In other words: 76% of the variability in the dependent variable is explained by the independent variables in the model.)

library(lm.beta)
lm.beta(fit1)
## 
## Call:
## lm(formula = House_price ~ House_age + MRT_distance + Latitude + 
##     Longitude + dummy_No_stores, data = mydata)
## 
## Standardized Coefficients::
##     (Intercept)       House_age    MRT_distance        Latitude       Longitude dummy_No_stores 
##              NA   -0.2568936431   -0.5283621469    0.2750645832    0.0002868856    0.0411159828

The larger the coefficient, the more important it is regarding house price (absolute values). => MRT_distance is the most important.

In this case, the standardized coefficient of dummy_No_stores is 0.0411, which suggests a small positive relationship between the presence of stores and the house price.