Dataset from https://www.edgeprop.sg/pullout/the-edge-property. I took the profit and loss transactions of property from the period of 3 weeks of October and 2 weeks of September.
setwd("E:\\interim\\property")
data <- read.csv("profit_loss.csv", header = T, stringsAsFactors = F)
#CONVERT into correct class formats
data$DISTRICT <- as.factor(data$DISTRICT)
data$MAKE <- as.factor(data$MAKE)
data$AREA <- as.numeric(data$AREA)
data$SALE_PSF <- as.numeric(data$SALE_PSF)
data$PURCHASE_PSF <- as.numeric(data$PURCHASE_PSF)
data$PROFIT_LOSS <- as.numeric(data$PROFIT_LOSS)
head(data)
## PROJECT DISTRICT AREA SALE_PSF PURCHASE_PSF HOLD_YEARS MAKE
## 1 Beverly Hill 10 3778 1747 709 14.4 P
## 2 Orchard Bel Air 10 3229 1208 483 12.2 P
## 3 The Waterside 15 2433 1554 848 17.6 P
## 4 Bayshore Park 16 3531 765 331 11.6 P
## 5 Newton Suites 11 1238 1793 798 12.4 P
## 6 Mount Faber Lodge 4 2594 964 524 22.2 P
## PROFIT_LOSS ANNUALISED
## 1 3920000 6
## 2 2340000 8
## 3 1717000 4
## 4 1531112 7
## 5 1232000 7
## 6 1140000 3
str(data)
## 'data.frame': 100 obs. of 9 variables:
## $ PROJECT : chr "Beverly Hill" "Orchard Bel Air" "The Waterside" "Bayshore Park" ...
## $ DISTRICT : Factor w/ 18 levels "1","3","4","5",..: 7 7 11 12 8 3 16 8 16 16 ...
## $ AREA : num 3778 3229 2433 3531 1238 ...
## $ SALE_PSF : num 1747 1208 1554 765 1793 ...
## $ PURCHASE_PSF: num 709 483 848 331 798 524 637 803 924 301 ...
## $ HOLD_YEARS : num 14.4 12.2 17.6 11.6 12.4 22.2 11 15.3 8.2 16.3 ...
## $ MAKE : Factor w/ 2 levels "L","P": 2 2 2 2 2 2 2 2 2 2 ...
## $ PROFIT_LOSS : num 3920000 2340000 1717000 1531112 1232000 ...
## $ ANNUALISED : num 6 8 4 7 7 3 8 5 6 6 ...
PROJECT - is the name of condo.
AREA - is the square feet.
SALES_PSF - sold in 2017 psf.
PURCHASE_PSF - bought in some years back psf.
HOLD_YEARS - number of years property held before SOLD.
PROFIT_LOSS - absolute number in dollars.
Annualised - profit or loss annualised % over the years. Negative means loss and vice versa.
MAKE - this is a factor which i indicate Profit (P) or Loss (L). This is used for classification of a likely PROFIT or LOSS of Property based on other independent variables.
I would say AREA, HOLDING YEARS AND PURCHASE PSF. Why ? This is because property buyer can decide how long to HOLD a property, how BIG an area to buy and at what price buyer would PURCHASE at that point in time.
Condo name is useless because it cannot be used to determine profit or loss. SALES PSF can be a useful variable, but seller usually goes with market price and seller cannot dictate the price. Profit and Loss and Annualised % are DERIVED values which is useless.
#create a subset of Profit and LOSS
profit <- subset(data, data$MAKE =="P")
loss <- subset(data, data$MAKE =="L")
plot(profit$HOLD_YEARS, profit$AREA, pch = 20, col=2, xlim=c(1,20), ylim=c(500, 5000), main="Holding Years vs Area", xlab="Holding Years", ylab="Area")
points(loss$HOLD_YEARS, loss$AREA, col=3, pch=15)
legend("topright", legend =c("Profit", "Loss"), pch=c(20,15), col=c(2:3))
## Call:
## lda(data$MAKE ~ data$HOLD_YEARS + data$AREA, na.action = "na.omit")
##
## Prior probabilities of groups:
## L P
## 0.5 0.5
##
## Group means:
## data$HOLD_YEARS data$AREA
## L 6.666 1574.56
## P 14.394 1987.86
##
## Coefficients of linear discriminants:
## LD1
## data$HOLD_YEARS 0.2877120804
## data$AREA 0.0002344837
##
## L P
## L 48 2
## P 8 42