Dataset from https://www.edgeprop.sg/pullout/the-edge-property. I took the profit and loss transactions of property from the period of 3 weeks of October and 2 weeks of September.
## 'data.frame': 100 obs. of 9 variables:
## $ PROJECT : chr "Beverly Hill" "Orchard Bel Air" "The Waterside" "Bayshore Park" ...
## $ DISTRICT : Factor w/ 18 levels "1","3","4","5",..: 7 7 11 12 8 3 16 8 16 16 ...
## $ AREA : num 3778 3229 2433 3531 1238 ...
## $ SALE_PSF : num 1747 1208 1554 765 1793 ...
## $ PURCHASE_PSF: num 709 483 848 331 798 524 637 803 924 301 ...
## $ HOLD_YEARS : num 14.4 12.2 17.6 11.6 12.4 22.2 11 15.3 8.2 16.3 ...
## $ MAKE : Factor w/ 2 levels "L","P": 2 2 2 2 2 2 2 2 2 2 ...
## $ PROFIT_LOSS : num 3920000 2340000 1717000 1531112 1232000 ...
## $ ANNUALISED : num 6 8 4 7 7 3 8 5 6 6 ...
## PROJECT DISTRICT AREA SALE_PSF PURCHASE_PSF HOLD_YEARS MAKE
## 1 Beverly Hill 10 3778 1747 709 14.4 P
## 2 Orchard Bel Air 10 3229 1208 483 12.2 P
## 3 The Waterside 15 2433 1554 848 17.6 P
## 4 Bayshore Park 16 3531 765 331 11.6 P
## 5 Newton Suites 11 1238 1793 798 12.4 P
## PROFIT_LOSS ANNUALISED
## 1 3920000 6
## 2 2340000 8
## 3 1717000 4
## 4 1531112 7
## 5 1232000 7
PROJECT - is the name of condo.
AREA - is the square feet.
SALES_PSF - sold in 2017 psf.
PURCHASE_PSF - bought in some years back psf.
HOLD_YEARS - number of years property held before SOLD.
MAKE - this is a factor which i indicate Profit (P) or Loss (L). This is used for classification of a likely PROFIT or LOSS of Property based on other independent variables.
# MAKE is the dependent categorial variable. The rest of the variables must be numeric to use LDA
# I removed PROJECT AND DISTRICT - which i think it will not affect the MAKE (PROFIT/LOSS)
# Since i have the observation in MAKE (profit/loss categorial variable), there is no need for PROFIT_LOSS and ANNUALISED variables
# which is why i select columns 3 to 7
data1 <- subset(data, select = c(3:7))
str(data1)
## 'data.frame': 100 obs. of 5 variables:
## $ AREA : num 3778 3229 2433 3531 1238 ...
## $ SALE_PSF : num 1747 1208 1554 765 1793 ...
## $ PURCHASE_PSF: num 709 483 848 331 798 524 637 803 924 301 ...
## $ HOLD_YEARS : num 14.4 12.2 17.6 11.6 12.4 22.2 11 15.3 8.2 16.3 ...
## $ MAKE : Factor w/ 2 levels "L","P": 2 2 2 2 2 2 2 2 2 2 ...
# Understand model
library("MASS")
fit<-lda(MAKE ~ ., data=data1)
fit
## Call:
## lda(MAKE ~ ., data = data1)
##
## Prior probabilities of groups:
## L P
## 0.5 0.5
##
## Group means:
## AREA SALE_PSF PURCHASE_PSF HOLD_YEARS
## L 1574.56 1497.56 1704.58 6.666
## P 1987.86 1408.86 759.70 14.394
##
## Coefficients of linear discriminants:
## LD1
## AREA 0.0001890049
## SALE_PSF 0.0038187891
## PURCHASE_PSF -0.0043282253
## HOLD_YEARS 0.1545673822
plot(fit)
library(klaR)
## Warning: package 'klaR' was built under R version 3.4.4
partimat(MAKE ~ .,data=data1,method="lda")
#Important categorical variable selection
district_tb <- table(data$DISTRICT, data$MAKE)
chisq.test(district_tb)
## Warning in chisq.test(district_tb): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: district_tb
## X-squared = 21.738, df = 17, p-value = 0.195