House Price in India Prediction

Project Description

The project is a study of machine learning algorithms, including classical algorithms such as linear regression, decision trees, regularized regression, and random forests. The project aims to predict house prices using these algorithms and features such as the number of bathrooms, bedrooms, living area, and number of views, and etc. The project also includes exporting the model for further use.

About Dataset

This dataset contains information about the prices of houses in India. It includes several features that affect the price of a house, including living area, number of view, grade of house, number of bedrooms, number of bathrooms, number of floors, build year, lattitude, longitude and others. You can access the source of the data from this link: House Price in India

No.	Attribute	Type of Data
1	id	Integer
2	Date	Integer
3	No. of bedrooms	Integer
4	No. of bathrooms	Decimal
5	Living area	Integer
6	Lot area	Integer
7	No. of Floors	Decimal
8	Waterfront present	Integer
9	No. of views	Integer
10	Condition of the house	Integer
11	Grade of the house	Integer
12	Area of the house(excluding basement)	Integer
13	Area of the basement	Integer
14	Built Year	year
15	Renovation Year	Integer
16	Postal Code	Integer
17	Lattitude	Decimal
18	Longitude	Decimal
19	Living_area_renov	Integer
20	Lot_area_renov	Integer
21	Number of schools nearby	Integer
22	Distance from the airport	Integer
23	Price	Integer

Scope of Project

Build a machine learning model to predict house prices using the following models:
- Linear regression : Models the relationship between a target variable (house price) and one or more predictor variables (number of bedrooms, etc.) using a straight line.
- Regularized regression : Similar to linear regression, but applies penalties to minimize the complexity of the model and reduce overfitting.
- Decision tree: Predicts the target variable by splitting the data into smaller groups based on the values of the predictor variables, like a flowchart.
- Random forest : Combines multiple decision trees to improve accuracy and reduce overfitting. More robust and flexible than single decision trees, but still less interpretable than linear regression.
- Neural network : Inspired by the human brain, with interconnected layers of nodes that process information and learn complex patterns.
Apply the log transformation technique to linear regression to transform a highly skewed variable into a more normalized dataset. The goal is produce the smallest error possible when making a prediction.
Apply the re-sampling technique to model building, such as Leave One Out CV, K-Fold Cross Validation, and Bootstrap.
Perform pre-processing using normalization and standardization to transform features to be on a similar scale. This improves the performance and training stability of the model.
Perform hyper-parameter tuning to improve model accuracy.
Export the model for further use.

Let’s start!

Environment Set up

Import all library that are related as follows:
- tidyverse : For data manipulation, analysis, and visualization in R
- readxl : Reads Excel files into R data frames.
- caret : Provides tools for model training, tuning, and evaluation in R.
- rpart : Specifically focuses on building and analyzing regression and classification trees in R.
- rpart.plot: Generates graphical representations of tree structure, branch splits, and predictions.

library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.3.1

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readxl)
library(caret)

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(rpart)

## Warning: package 'rpart' was built under R version 4.3.1

library(rpart.plot)

Simple Machine learning Pipeline.

Data Preparation / Data Cleansing
Split Data -> Train data and Test data
Train Model
Score Model or Prediction
Evaluate Model

Data Preparation / Data Cleansing

(0-1) Read House Price India which is excel file to dataframe named “dataset” and view the dataset.

# Read excel to dataset
dataset = tibble(read_excel('/Users/j.nrup/Documents/Data Project/House Price India.xlsx'))
head(dataset)

## # A tibble: 6 × 23
##           id  Date `number of bedrooms` `number of bathrooms` `living area`
##        <dbl> <dbl>                <dbl>                 <dbl>         <dbl>
## 1 6762810145 42491                    5                  2.5           3650
## 2 6762810635 42491                    4                  2.5           2920
## 3 6762810998 42491                    5                  2.75          2910
## 4 6762812605 42491                    4                  2.5           3310
## 5 6762812919 42491                    3                  2             2710
## 6 6762813105 42491                    3                  2.5           2600
## # ℹ 18 more variables: `lot area` <dbl>, `number of floors` <dbl>,
## #   `waterfront present` <dbl>, `number of views` <dbl>,
## #   `condition of the house` <dbl>, `grade of the house` <dbl>,
## #   `Area of the house(excluding basement)` <dbl>,
## #   `Area of the basement` <dbl>, `Built Year` <dbl>, `Renovation Year` <dbl>,
## #   `Postal Code` <dbl>, Lattitude <dbl>, Longitude <dbl>,
## #   living_area_renov <dbl>, lot_area_renov <dbl>, …

(0-2) Check for null values and remove them if present.

if (mean(complete.cases(dataset)) != 1) {
  # Delete null.
  clean_data <- drop_na(dataset)
  print("Remove null completely!")
  mean(complete.cases(clean_df))
} else {
  clean_data <- dataset
  print("Data was clean!")
}

## [1] "Data was clean!"

(0-3) Remove columns that are not used as features or independent variables, such as non-numeric columns.

clean_data <- clean_data[, !(names(clean_data) %in% c("id","Date","Built Year","Renovation Year","Postal Code","Lattitude","Longitude"))]
clean_data

## # A tibble: 14,620 × 16
##    `number of bedrooms` `number of bathrooms` `living area` `lot area`
##                   <dbl>                 <dbl>         <dbl>      <dbl>
##  1                    5                  2.5           3650       9050
##  2                    4                  2.5           2920       4000
##  3                    5                  2.75          2910       9480
##  4                    4                  2.5           3310      42998
##  5                    3                  2             2710       4500
##  6                    3                  2.5           2600       4750
##  7                    5                  3.25          3660      11995
##  8                    3                  1.75          2240      10578
##  9                    3                  2.5           2390       6550
## 10                    4                  2.25          2200      11250
## # ℹ 14,610 more rows
## # ℹ 12 more variables: `number of floors` <dbl>, `waterfront present` <dbl>,
## #   `number of views` <dbl>, `condition of the house` <dbl>,
## #   `grade of the house` <dbl>, `Area of the house(excluding basement)` <dbl>,
## #   `Area of the basement` <dbl>, living_area_renov <dbl>,
## #   lot_area_renov <dbl>, `Number of schools nearby` <dbl>,
## #   `Distance from the airport` <dbl>, Price <dbl>

(0-4) Change all columns name to snake case format.

col_name_mappings <- c(
  "number of bedrooms" = "no_of_bedrooms",
  "number of bathrooms" = "no_of_bathrooms",
  "living area" = "living_area",
  "lot area" = "lot_area",
  "number of floors" = "no_of_floors",
  "waterfront present" = "waterfront",
  "number of views" = "no_of_views",
  "condition of the house" = "condition_house",
  "grade of the house" = "grade_house",
  "Area of the house(excluding basement)" = "area_house",
  "Area of the basement" = "area_basement",
  "living_area_renov" = "living_renov",
  "lot_area_renov" = "lot_renov",
  "Number of schools nearby" = "no_of_schools_nearby",
  "Distance from the airport" = "distance_airport",
  "Price" = "price"
)

# Rename columns using the mappings
colnames(clean_data) <- sapply(colnames(clean_data), function(col) col_name_mappings[col])
clean_data

## # A tibble: 14,620 × 16
##    no_of_bedrooms no_of_bathrooms living_area lot_area no_of_floors waterfront
##             <dbl>           <dbl>       <dbl>    <dbl>        <dbl>      <dbl>
##  1              5            2.5         3650     9050          2            0
##  2              4            2.5         2920     4000          1.5          0
##  3              5            2.75        2910     9480          1.5          0
##  4              4            2.5         3310    42998          2            0
##  5              3            2           2710     4500          1.5          0
##  6              3            2.5         2600     4750          1            0
##  7              5            3.25        3660    11995          2            0
##  8              3            1.75        2240    10578          2            0
##  9              3            2.5         2390     6550          1            0
## 10              4            2.25        2200    11250          1.5          0
## # ℹ 14,610 more rows
## # ℹ 10 more variables: no_of_views <dbl>, condition_house <dbl>,
## #   grade_house <dbl>, area_house <dbl>, area_basement <dbl>,
## #   living_renov <dbl>, lot_renov <dbl>, no_of_schools_nearby <dbl>,
## #   distance_airport <dbl>, price <dbl>

(0-5) Check distribution of house price by Histogram chart.

ggplot(data = clean_data, mapping = aes(x = price)) + 
  geom_histogram(bins=30, fill = "#F5AD9E") + 
  labs(title = "Distribution of House Price") + 
  theme_minimal()

Analyzing the histogram revealed a right-skewed distribution in the price variable. This suggests a non-linear relationship between variables, which can lead to negatively skewed errors in linear regression models. To address this, we can apply a log transformation to the features, normalizing their distribution and improving the model’s accuracy.

(0-6) Apply a log transformation.

data_lm <- clean_data %>%
  mutate(log_price = log(price))
data_lm

## # A tibble: 14,620 × 17
##    no_of_bedrooms no_of_bathrooms living_area lot_area no_of_floors waterfront
##             <dbl>           <dbl>       <dbl>    <dbl>        <dbl>      <dbl>
##  1              5            2.5         3650     9050          2            0
##  2              4            2.5         2920     4000          1.5          0
##  3              5            2.75        2910     9480          1.5          0
##  4              4            2.5         3310    42998          2            0
##  5              3            2           2710     4500          1.5          0
##  6              3            2.5         2600     4750          1            0
##  7              5            3.25        3660    11995          2            0
##  8              3            1.75        2240    10578          2            0
##  9              3            2.5         2390     6550          1            0
## 10              4            2.25        2200    11250          1.5          0
## # ℹ 14,610 more rows
## # ℹ 11 more variables: no_of_views <dbl>, condition_house <dbl>,
## #   grade_house <dbl>, area_house <dbl>, area_basement <dbl>,
## #   living_renov <dbl>, lot_renov <dbl>, no_of_schools_nearby <dbl>,
## #   distance_airport <dbl>, price <dbl>, log_price <dbl>

(0-7) Check distribution of house price after apply a log transformation by Histogram chart.

ggplot(data=data_lm, mapping = aes(x=log_price)) +
         geom_histogram(bin=30, fill = "#D9F588") +
         labs(title = "Distribution of Log price") +
         theme_minimal()

## Warning in geom_histogram(bin = 30, fill = "#D9F588"): Ignoring unknown
## parameters: `bin`

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Analyzing the histogram of house prices after log transformation revealed a normal, bell-shaped distribution. Therefore, we can use this transformed data to build a linear regression model.

Split data : Create function to split data.

split_func <- function(data, train_size = 0.8) {
  set.seed(42)
  n <- nrow(data)
  id <- sample(1:n,size = n*train_size)
  train_data <- data[id, ]
  test_data <- data[-id, ]
  list(train = train_data, test = test_data)
}

pre_data <- split_func(data_lm)

trainData <- pre_data[[1]]
testData <- pre_data[[2]]

Train Model
(2-1) Train Model : Algorithm Selection -> Linear Regression

# Train Model
set.seed(40)
lmModel <- train(log_price ~ . - price,
                  data = trainData,
                  method = "lm")
lmModel

## Linear Regression 
## 
## 11696 samples
##    16 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 11696, 11696, 11696, 11696, 11696, 11696, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.3258655  0.6185114  0.2622014
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

(2-2) Show Linear Regression Equation.

print("Regression Equation: ")

## [1] "Regression Equation: "

lmModel$finalModel

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Coefficients:
##          (Intercept)        no_of_bedrooms       no_of_bathrooms  
##            1.074e+01            -2.616e-02            -1.031e-02  
##          living_area              lot_area          no_of_floors  
##            2.759e-04             1.800e-07             8.001e-02  
##           waterfront           no_of_views       condition_house  
##            3.760e-01             5.577e-02             1.025e-01  
##          grade_house            area_house         area_basement  
##            1.804e-01            -1.250e-04                    NA  
##         living_renov             lot_renov  no_of_schools_nearby  
##            8.597e-05            -8.637e-07             3.048e-03  
##     distance_airport  
##            2.747e-04

(2-3) Analyze for significant predictors

varImp(lmModel)

## lm variable importance
## 
##                        Overall
## grade_house          100.00000
## living_area           74.04746
## condition_house       53.82683
## area_house            32.91557
## no_of_views           28.94938
## living_renov          27.95058
## no_of_floors          24.83735
## waterfront            23.97325
## no_of_bedrooms        13.53206
## lot_renov             11.63298
## lot_area               2.18204
## no_of_bathrooms        1.86223
## no_of_schools_nearby   0.03208
## distance_airport       0.00000

Score Model

# Predict Unseen data
pTrain_LM <- predict(lmModel, newdata = trainData)
unlog_pTrain_LM <- exp(pTrain_LM)
pTest_LM <- predict(lmModel, newdata = testData)
unlogpTest_LM <- exp(pTest_LM)

Evaluate Model using MAE, MSE, RMSE

# Create Function to calculate MAE
calcu_mae <- function(actual, pred) {
  error <- actual - pred
  return(mean(abs(error)))
  }
# Create Function to calculate MSE
  calcu_mse <- function(actual, pred) {
    error <- actual - pred
    return(mean(error**2))
  }

# Create Function to calculate RMSE
calcu_rmse <- function(actual, pred) {
  error <- actual - pred
  return(sqrt(mean(error**2)))
}

MAETrain <- calcu_mae(trainData$price, unlog_pTrain_LM)
MAETest <- calcu_mae(testData$price, unlogpTest_LM)
MSETrain <- calcu_mse(trainData$price, unlog_pTrain_LM)
MSETest <- calcu_mse(testData$price, unlogpTest_LM)
RMSETrain <- calcu_rmse(trainData$price, unlog_pTrain_LM)
RMSETest <- calcu_rmse(testData$price, unlogpTest_LM)

result <- c("MAE", "MSE", "RMSE")
Train <- c(MAETrain, MSETrain, RMSETrain)
Test <- c(MAETest, MSETest, RMSETest)

result_df <- data.frame(result,Train, Test)
result_df

##   result        Train         Test
## 1    MAE 1.373709e+05 1.431174e+05
## 2    MSE 4.497051e+10 8.886660e+10
## 3   RMSE 2.120625e+05 2.981050e+05

To evaluate the performance of various algorithms, we employ re-sampling techniques, pre-process, hyper-parameter tuning, and a diverse set of machine learning models as follows:

First-step

Algorithm : Regularized Regression
Re-sampling Technique : K-Fold Cross Validation (Create Train control to setting condition of train process.)
Hyper-parameter tuning : Create my_grid to Hyper-parameter tuning process.

Train model

## create train control
set.seed(42)
ctrl_cv <- trainControl(method = "cv",
                        number = 8,
                        verboseIter = TRUE)
## create my_grid
my_grid <- expand.grid(alpha = 0:1,
                       lambda = seq(0.0005, 0.05, length = 20))
## train model
glmModel_cv <- train(log_price ~ . - price,
                     data = trainData,
                     method = "glmnet",
                     tuneGrid = my_grid,
                     trControl = ctrl_cv)

## + Fold1: alpha=0, lambda=0.05 
## - Fold1: alpha=0, lambda=0.05 
## + Fold1: alpha=1, lambda=0.05 
## - Fold1: alpha=1, lambda=0.05 
## + Fold2: alpha=0, lambda=0.05 
## - Fold2: alpha=0, lambda=0.05 
## + Fold2: alpha=1, lambda=0.05 
## - Fold2: alpha=1, lambda=0.05 
## + Fold3: alpha=0, lambda=0.05 
## - Fold3: alpha=0, lambda=0.05 
## + Fold3: alpha=1, lambda=0.05 
## - Fold3: alpha=1, lambda=0.05 
## + Fold4: alpha=0, lambda=0.05 
## - Fold4: alpha=0, lambda=0.05 
## + Fold4: alpha=1, lambda=0.05 
## - Fold4: alpha=1, lambda=0.05 
## + Fold5: alpha=0, lambda=0.05 
## - Fold5: alpha=0, lambda=0.05 
## + Fold5: alpha=1, lambda=0.05 
## - Fold5: alpha=1, lambda=0.05 
## + Fold6: alpha=0, lambda=0.05 
## - Fold6: alpha=0, lambda=0.05 
## + Fold6: alpha=1, lambda=0.05 
## - Fold6: alpha=1, lambda=0.05 
## + Fold7: alpha=0, lambda=0.05 
## - Fold7: alpha=0, lambda=0.05 
## + Fold7: alpha=1, lambda=0.05 
## - Fold7: alpha=1, lambda=0.05 
## + Fold8: alpha=0, lambda=0.05 
## - Fold8: alpha=0, lambda=0.05 
## + Fold8: alpha=1, lambda=0.05 
## - Fold8: alpha=1, lambda=0.05 
## Aggregating results
## Selecting tuning parameters
## Fitting alpha = 1, lambda = 5e-04 on full training set

print("Regularized Regression with K-Fold Cross Validation")

## [1] "Regularized Regression with K-Fold Cross Validation"

print(glmModel_cv)

## glmnet 
## 
## 11696 samples
##    16 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (8 fold) 
## Summary of sample sizes: 10233, 10235, 10234, 10233, 10234, 10235, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda       RMSE       Rsquared   MAE      
##   0      0.000500000  0.3257580  0.6167350  0.2628317
##   0      0.003105263  0.3257580  0.6167350  0.2628317
##   0      0.005710526  0.3257580  0.6167350  0.2628317
##   0      0.008315789  0.3257580  0.6167350  0.2628317
##   0      0.010921053  0.3257580  0.6167350  0.2628317
##   0      0.013526316  0.3257580  0.6167350  0.2628317
##   0      0.016131579  0.3257580  0.6167350  0.2628317
##   0      0.018736842  0.3257580  0.6167350  0.2628317
##   0      0.021342105  0.3257580  0.6167350  0.2628317
##   0      0.023947368  0.3257580  0.6167350  0.2628317
##   0      0.026552632  0.3257580  0.6167350  0.2628317
##   0      0.029157895  0.3257580  0.6167350  0.2628317
##   0      0.031763158  0.3257580  0.6167350  0.2628317
##   0      0.034368421  0.3257580  0.6167350  0.2628317
##   0      0.036973684  0.3257580  0.6167350  0.2628317
##   0      0.039578947  0.3258168  0.6166421  0.2628982
##   0      0.042184211  0.3258894  0.6165261  0.2629796
##   0      0.044789474  0.3259654  0.6164048  0.2630629
##   0      0.047394737  0.3260421  0.6162841  0.2631437
##   0      0.050000000  0.3261214  0.6161592  0.2632243
##   1      0.000500000  0.3252103  0.6176166  0.2619671
##   1      0.003105263  0.3253619  0.6173417  0.2621559
##   1      0.005710526  0.3257282  0.6166667  0.2624643
##   1      0.008315789  0.3263328  0.6155134  0.2629347
##   1      0.010921053  0.3269387  0.6144304  0.2634140
##   1      0.013526316  0.3275616  0.6133830  0.2639171
##   1      0.016131579  0.3282355  0.6122821  0.2644610
##   1      0.018736842  0.3289728  0.6110861  0.2650516
##   1      0.021342105  0.3298169  0.6096793  0.2657079
##   1      0.023947368  0.3307558  0.6080807  0.2664212
##   1      0.026552632  0.3316329  0.6066865  0.2670887
##   1      0.029157895  0.3324365  0.6055453  0.2677060
##   1      0.031763158  0.3333017  0.6042961  0.2683570
##   1      0.034368421  0.3342383  0.6029035  0.2690522
##   1      0.036973684  0.3352451  0.6013644  0.2697977
##   1      0.039578947  0.3363186  0.5996818  0.2705851
##   1      0.042184211  0.3373999  0.5980036  0.2713820
##   1      0.044789474  0.3385012  0.5962947  0.2722065
##   1      0.047394737  0.3396028  0.5946100  0.2730367
##   1      0.050000000  0.3407642  0.5927932  0.2739074
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 5e-04.

Score model

## Predict Unseen data
pTrain_glm_cv <- predict(glmModel_cv, newdata = trainData)
unlog_pTrain_glm_cv <- exp(pTrain_glm_cv)
pTest_glm_cv <- predict(glmModel_cv, newdata = testData)
unlog_pTest_glm_cv <- exp(pTest_glm_cv)

Evaluate model

MAETrain_glm_cv <- calcu_mae(trainData$price, unlog_pTrain_glm_cv)
MAETest_glm_cv <- calcu_mae(testData$price, unlog_pTest_glm_cv)
MSETrain_glm_cv <- calcu_mse(trainData$price, unlog_pTrain_glm_cv)
MSETest_glm_cv <- calcu_mse(testData$price, unlog_pTest_glm_cv)
RMSETrain_glm_cv <- calcu_rmse(trainData$price, unlog_pTrain_glm_cv)
RMSETest_glm_cv <- calcu_rmse(testData$price, unlog_pTest_glm_cv)

RMSE_of_glmnet <- c("MAE","MSE","RMSE")
Train_glmnet <- c(MAETrain_glm_cv,MSETrain_glm_cv,RMSETrain_glm_cv)
Test_glmnet <- c(MAETest_glm_cv,MSETest_glm_cv,RMSETest_glm_cv)

RMSE_of_glmnet_df <- data.frame(RMSE_of_glmnet,Train_glmnet, Test_glmnet)
RMSE_of_glmnet_df

##   RMSE_of_glmnet Train_glmnet  Test_glmnet
## 1            MAE 1.373003e+05 1.429573e+05
## 2            MSE 4.476356e+10 8.788811e+10
## 3           RMSE 2.115740e+05 2.964593e+05

Next-step

Algorithm : Decision Tree
Re-sampling Technique : K-Fold Cross Validation
Hyper-parameter tuning

Train model

## create train control
set.seed(42)
ctrl_tree <- trainControl(method = "cv",
                          number = 8,
                          verboseIter = TRUE)
## train model
tree_model <- train(log_price ~ . - price,
                    data = trainData,
                    method = "rpart",
                    tuneGrid = expand.grid(cp = c(0.02,0.1,0.25)),
                    trControl = ctrl_tree)

## + Fold1: cp=0.02 
## - Fold1: cp=0.02 
## + Fold2: cp=0.02 
## - Fold2: cp=0.02 
## + Fold3: cp=0.02 
## - Fold3: cp=0.02 
## + Fold4: cp=0.02 
## - Fold4: cp=0.02 
## + Fold5: cp=0.02 
## - Fold5: cp=0.02 
## + Fold6: cp=0.02 
## - Fold6: cp=0.02 
## + Fold7: cp=0.02 
## - Fold7: cp=0.02 
## + Fold8: cp=0.02 
## - Fold8: cp=0.02 
## Aggregating results
## Selecting tuning parameters
## Fitting cp = 0.02 on full training set

print("Decision Tree Model with K-Fold Cross Validation")

## [1] "Decision Tree Model with K-Fold Cross Validation"

print(tree_model)

## CART 
## 
## 11696 samples
##    16 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (8 fold) 
## Summary of sample sizes: 10233, 10235, 10234, 10233, 10234, 10235, ... 
## Resampling results across tuning parameters:
## 
##   cp    RMSE       Rsquared   MAE      
##   0.02  0.3719093  0.5000840  0.2975066
##   0.10  0.4270923  0.3410929  0.3393977
##   0.25  0.4270923  0.3410929  0.3393977
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.02.

Decision Tree Model Visualization

rpart.plot(tree_model$finalModel)

Score model

## Predict Unseen data
pTrain_tree_cv <- predict(tree_model, newdata = trainData)
unlog_pTrain_tree_cv <- exp(pTrain_tree_cv)
pTest_tree_cv <- predict(tree_model, newdata = testData)
unlog_pTest_tree_cv <- exp(pTest_tree_cv)

Evaluate model

MAETrain_tree_cv <- calcu_mae(trainData$price, unlog_pTrain_tree_cv)
MAETest_tree_cv <- calcu_mae(testData$price, unlog_pTest_tree_cv)
MSETrain_tree_cv <- calcu_mse(trainData$price, unlog_pTrain_tree_cv)
MSETest_tree_cv <- calcu_mse(testData$price, unlog_pTest_tree_cv)
RMSETrain_tree_cv <- calcu_rmse(trainData$price, unlog_pTrain_tree_cv)
RMSETest_tree_cv <- calcu_rmse(testData$price, unlog_pTest_tree_cv)

RMSE_of_tree <- c("MAE","MSE","RMSE")
Train_tree <- c(MAETrain_tree_cv,MSETrain_tree_cv,RMSETrain_tree_cv)
Test_tree <- c(MAETest_tree_cv,MSETest_tree_cv,RMSETest_tree_cv)

RMSE_of_tree_df <- data.frame(RMSE_of_tree,Train_tree, Test_tree)
RMSE_of_tree_df

##   RMSE_of_tree   Train_tree    Test_tree
## 1          MAE 1.592545e+05 1.654061e+05
## 2          MSE 6.880517e+10 1.004400e+11
## 3         RMSE 2.623074e+05 3.169227e+05

The last-step

Algorithm : Random forest and Neural network
Re-sampling Technique : K-Fold Cross Validation
Hyper-parameter tuning

Train model

set.seed(42)
ctrl_rf_nn <- trainControl(method = "cv",
                           number = 5,
                           verboseIter = TRUE)
rf_mod <- train(log_price ~ . - price,
                data = trainData,
                method = "rf",
                tuneLength = 5,
                trControl = ctrl_rf_nn)

## + Fold1: mtry= 2 
## - Fold1: mtry= 2 
## + Fold1: mtry= 5 
## - Fold1: mtry= 5 
## + Fold1: mtry= 8 
## - Fold1: mtry= 8 
## + Fold1: mtry=11 
## - Fold1: mtry=11 
## + Fold1: mtry=15 
## - Fold1: mtry=15 
## + Fold2: mtry= 2 
## - Fold2: mtry= 2 
## + Fold2: mtry= 5 
## - Fold2: mtry= 5 
## + Fold2: mtry= 8 
## - Fold2: mtry= 8 
## + Fold2: mtry=11 
## - Fold2: mtry=11 
## + Fold2: mtry=15 
## - Fold2: mtry=15 
## + Fold3: mtry= 2 
## - Fold3: mtry= 2 
## + Fold3: mtry= 5 
## - Fold3: mtry= 5 
## + Fold3: mtry= 8 
## - Fold3: mtry= 8 
## + Fold3: mtry=11 
## - Fold3: mtry=11 
## + Fold3: mtry=15 
## - Fold3: mtry=15 
## + Fold4: mtry= 2 
## - Fold4: mtry= 2 
## + Fold4: mtry= 5 
## - Fold4: mtry= 5 
## + Fold4: mtry= 8 
## - Fold4: mtry= 8 
## + Fold4: mtry=11 
## - Fold4: mtry=11 
## + Fold4: mtry=15 
## - Fold4: mtry=15 
## + Fold5: mtry= 2 
## - Fold5: mtry= 2 
## + Fold5: mtry= 5 
## - Fold5: mtry= 5 
## + Fold5: mtry= 8 
## - Fold5: mtry= 8 
## + Fold5: mtry=11 
## - Fold5: mtry=11 
## + Fold5: mtry=15 
## - Fold5: mtry=15 
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 5 on full training set

nn_mod <- train(log_price ~ . - price,
                data = trainData,
                method = "nnet",
                tuneLength = 5,
                trControl = ctrl_rf_nn)

## + Fold1: size=1, decay=0e+00 
## # weights:  18
## initial  value 1507155.247318 
## final  value 1359992.241451 
## converged
## - Fold1: size=1, decay=0e+00 
## + Fold1: size=3, decay=0e+00 
## # weights:  52
## initial  value 1426655.527163 
## final  value 1359992.241451 
## converged
## - Fold1: size=3, decay=0e+00 
## + Fold1: size=5, decay=0e+00 
## # weights:  86
## initial  value 1531567.824620 
## final  value 1359992.241451 
## converged
## - Fold1: size=5, decay=0e+00 
## + Fold1: size=7, decay=0e+00 
## # weights:  120
## initial  value 1543043.807249 
## final  value 1359992.241451 
## converged
## - Fold1: size=7, decay=0e+00 
## + Fold1: size=9, decay=0e+00 
## # weights:  154
## initial  value 1419428.791910 
## final  value 1359992.241451 
## converged
## - Fold1: size=9, decay=0e+00 
## + Fold1: size=1, decay=1e-01 
## # weights:  18
## initial  value 1473678.847898 
## iter  10 value 1360203.501623
## final  value 1360000.815808 
## converged
## - Fold1: size=1, decay=1e-01 
## + Fold1: size=3, decay=1e-01 
## # weights:  52
## initial  value 1451410.429140 
## iter  10 value 1360508.784578
## iter  20 value 1359996.992599
## final  value 1359996.964749 
## converged
## - Fold1: size=3, decay=1e-01 
## + Fold1: size=5, decay=1e-01 
## # weights:  86
## initial  value 1512153.847759 
## iter  10 value 1360882.093150
## iter  20 value 1359997.340945
## iter  30 value 1359995.608318
## final  value 1359995.572979 
## converged
## - Fold1: size=5, decay=1e-01 
## + Fold1: size=7, decay=1e-01 
## # weights:  120
## initial  value 1480904.079958 
## iter  10 value 1361278.574123
## iter  20 value 1359998.882897
## iter  30 value 1359995.184049
## iter  30 value 1359995.172920
## iter  30 value 1359995.166099
## final  value 1359995.166099 
## converged
## - Fold1: size=7, decay=1e-01 
## + Fold1: size=9, decay=1e-01 
## # weights:  154
## initial  value 1494375.330190 
## iter  10 value 1359997.367556
## iter  20 value 1359994.611879
## iter  20 value 1359994.598711
## iter  20 value 1359994.587793
## final  value 1359994.587793 
## converged
## - Fold1: size=9, decay=1e-01 
## + Fold1: size=1, decay=1e-02 
## # weights:  18
## initial  value 1438439.800360 
## iter  10 value 1359994.335830
## final  value 1359993.380199 
## converged
## - Fold1: size=1, decay=1e-02 
## + Fold1: size=3, decay=1e-02 
## # weights:  52
## initial  value 1465845.328076 
## iter  10 value 1360094.181803
## iter  20 value 1359993.124771
## final  value 1359993.056075 
## converged
## - Fold1: size=3, decay=1e-02 
## + Fold1: size=5, decay=1e-02 
## # weights:  86
## initial  value 1514140.999816 
## iter  10 value 1360162.883787
## iter  20 value 1359993.125333
## final  value 1359992.993430 
## converged
## - Fold1: size=5, decay=1e-02 
## + Fold1: size=7, decay=1e-02 
## # weights:  120
## initial  value 1491584.545211 
## iter  10 value 1360173.226744
## iter  20 value 1359993.220313
## final  value 1359992.851899 
## converged
## - Fold1: size=7, decay=1e-02 
## + Fold1: size=9, decay=1e-02 
## # weights:  154
## initial  value 1416958.921643 
## iter  10 value 1360287.693559
## iter  20 value 1359993.188179
## final  value 1359992.702457 
## converged
## - Fold1: size=9, decay=1e-02 
## + Fold1: size=1, decay=1e-03 
## # weights:  18
## initial  value 1513152.753683 
## final  value 1359996.935393 
## converged
## - Fold1: size=1, decay=1e-03 
## + Fold1: size=3, decay=1e-03 
## # weights:  52
## initial  value 1488910.426846 
## iter  10 value 1360195.775049
## iter  20 value 1360071.580189
## iter  30 value 1359998.701333
## iter  40 value 1359992.747291
## final  value 1359992.376561 
## converged
## - Fold1: size=3, decay=1e-03 
## + Fold1: size=5, decay=1e-03 
## # weights:  86
## initial  value 1420542.472176 
## iter  10 value 1360567.946159
## iter  20 value 1360005.887898
## iter  30 value 1359997.208807
## final  value 1359994.136626 
## converged
## - Fold1: size=5, decay=1e-03 
## + Fold1: size=7, decay=1e-03 
## # weights:  120
## initial  value 1494622.766815 
## iter  10 value 1361574.325807
## iter  20 value 1360020.130675
## iter  30 value 1360001.928501
## iter  40 value 1359997.982616
## iter  50 value 1359993.303674
## final  value 1359993.195294 
## converged
## - Fold1: size=7, decay=1e-03 
## + Fold1: size=9, decay=1e-03 
## # weights:  154
## initial  value 1557636.970306 
## iter  10 value 1360479.796680
## iter  20 value 1360012.330276
## iter  30 value 1360004.178981
## final  value 1359993.399544 
## converged
## - Fold1: size=9, decay=1e-03 
## + Fold1: size=1, decay=1e-04 
## # weights:  18
## initial  value 1489341.612471 
## iter  10 value 1360617.929519
## iter  20 value 1359999.461195
## iter  30 value 1359992.330717
## final  value 1359992.278923 
## converged
## - Fold1: size=1, decay=1e-04 
## + Fold1: size=3, decay=1e-04 
## # weights:  52
## initial  value 1479514.873587 
## iter  10 value 1360161.060399
## iter  20 value 1359994.555102
## final  value 1359992.734925 
## converged
## - Fold1: size=3, decay=1e-04 
## + Fold1: size=5, decay=1e-04 
## # weights:  86
## initial  value 1463296.766386 
## final  value 1359992.580908 
## converged
## - Fold1: size=5, decay=1e-04 
## + Fold1: size=7, decay=1e-04 
## # weights:  120
## initial  value 1393255.428699 
## iter  10 value 1359992.295836
## final  value 1359992.263716 
## converged
## - Fold1: size=7, decay=1e-04 
## + Fold1: size=9, decay=1e-04 
## # weights:  154
## initial  value 1523417.545559 
## iter  10 value 1359998.418245
## iter  20 value 1359992.312664
## final  value 1359992.260109 
## converged
## - Fold1: size=9, decay=1e-04 
## + Fold2: size=1, decay=0e+00 
## # weights:  18
## initial  value 1501200.710163 
## final  value 1359390.832023 
## converged
## - Fold2: size=1, decay=0e+00 
## + Fold2: size=3, decay=0e+00 
## # weights:  52
## initial  value 1442846.494886 
## final  value 1359390.832023 
## converged
## - Fold2: size=3, decay=0e+00 
## + Fold2: size=5, decay=0e+00 
## # weights:  86
## initial  value 1472431.606680 
## final  value 1359390.832023 
## converged
## - Fold2: size=5, decay=0e+00 
## + Fold2: size=7, decay=0e+00 
## # weights:  120
## initial  value 1533105.575507 
## final  value 1359390.832023 
## converged
## - Fold2: size=7, decay=0e+00 
## + Fold2: size=9, decay=0e+00 
## # weights:  154
## initial  value 1527154.013171 
## final  value 1359390.832023 
## converged
## - Fold2: size=9, decay=0e+00 
## + Fold2: size=1, decay=1e-01 
## # weights:  18
## initial  value 1506090.758717 
## iter  10 value 1359403.520818
## final  value 1359399.486415 
## converged
## - Fold2: size=1, decay=1e-01 
## + Fold2: size=3, decay=1e-01 
## # weights:  52
## initial  value 1443370.593915 
## iter  10 value 1359551.448137
## iter  20 value 1359397.874603
## final  value 1359395.554713 
## converged
## - Fold2: size=3, decay=1e-01 
## + Fold2: size=5, decay=1e-01 
## # weights:  86
## initial  value 1438548.872547 
## iter  10 value 1359765.038078
## iter  20 value 1359396.631038
## iter  30 value 1359395.305518
## iter  40 value 1359394.222632
## final  value 1359394.164140 
## converged
## - Fold2: size=5, decay=1e-01 
## + Fold2: size=7, decay=1e-01 
## # weights:  120
## initial  value 1439670.613610 
## iter  10 value 1361146.215131
## iter  20 value 1359400.645654
## iter  30 value 1359395.015867
## iter  30 value 1359395.009448
## final  value 1359393.868630 
## converged
## - Fold2: size=7, decay=1e-01 
## + Fold2: size=9, decay=1e-01 
## # weights:  154
## initial  value 1384248.985792 
## iter  10 value 1361039.223871
## iter  20 value 1359393.997672
## final  value 1359393.187505 
## converged
## - Fold2: size=9, decay=1e-01 
## + Fold2: size=1, decay=1e-02 
## # weights:  18
## initial  value 1469953.420501 
## iter  10 value 1359796.446629
## iter  20 value 1359397.309023
## iter  30 value 1359393.308747
## final  value 1359392.394068 
## converged
## - Fold2: size=1, decay=1e-02 
## + Fold2: size=3, decay=1e-02 
## # weights:  52
## initial  value 1514065.920546 
## iter  10 value 1359505.086928
## final  value 1359391.538554 
## converged
## - Fold2: size=3, decay=1e-02 
## + Fold2: size=5, decay=1e-02 
## # weights:  86
## initial  value 1494864.155203 
## iter  10 value 1361816.274440
## iter  20 value 1360665.740408
## iter  30 value 1359395.593386
## iter  40 value 1359393.409369
## iter  50 value 1359391.301107
## iter  50 value 1359391.290206
## iter  50 value 1359391.286508
## final  value 1359391.286508 
## converged
## - Fold2: size=5, decay=1e-02 
## + Fold2: size=7, decay=1e-02 
## # weights:  120
## initial  value 1556365.808330 
## iter  10 value 1367556.255477
## iter  20 value 1359391.705302
## final  value 1359391.351014 
## converged
## - Fold2: size=7, decay=1e-02 
## + Fold2: size=9, decay=1e-02 
## # weights:  154
## initial  value 1413507.255509 
## iter  10 value 1359665.081531
## iter  20 value 1359391.613787
## final  value 1359391.481816 
## converged
## - Fold2: size=9, decay=1e-02 
## + Fold2: size=1, decay=1e-03 
## # weights:  18
## initial  value 1458187.384798 
## iter  10 value 1359426.551489
## iter  20 value 1359392.211169
## final  value 1359392.138061 
## converged
## - Fold2: size=1, decay=1e-03 
## + Fold2: size=3, decay=1e-03 
## # weights:  52
## initial  value 1449110.976249 
## iter  10 value 1359789.070909
## iter  20 value 1359399.626330
## final  value 1359392.920624 
## converged
## - Fold2: size=3, decay=1e-03 
## + Fold2: size=5, decay=1e-03 
## # weights:  86
## initial  value 1439029.859468 
## iter  10 value 1359986.118135
## iter  20 value 1359404.116583
## iter  30 value 1359394.165649
## final  value 1359392.600703 
## converged
## - Fold2: size=5, decay=1e-03 
## + Fold2: size=7, decay=1e-03 
## # weights:  120
## initial  value 1497030.692878 
## iter  10 value 1360141.527905
## iter  20 value 1359408.933795
## iter  30 value 1359399.951635
## iter  40 value 1359395.094835
## final  value 1359391.627934 
## converged
## - Fold2: size=7, decay=1e-03 
## + Fold2: size=9, decay=1e-03 
## # weights:  154
## initial  value 1426995.638231 
## iter  10 value 1359419.597340
## iter  20 value 1359390.942511
## iter  20 value 1359390.934052
## iter  20 value 1359390.933627
## final  value 1359390.933627 
## converged
## - Fold2: size=9, decay=1e-03 
## + Fold2: size=1, decay=1e-04 
## # weights:  18
## initial  value 1474956.170768 
## iter  10 value 1359996.690186
## iter  20 value 1359397.821456
## iter  30 value 1359390.916952
## final  value 1359390.857526 
## converged
## - Fold2: size=1, decay=1e-04 
## + Fold2: size=3, decay=1e-04 
## # weights:  52
## initial  value 1469786.861007 
## final  value 1359396.745058 
## converged
## - Fold2: size=3, decay=1e-04 
## + Fold2: size=5, decay=1e-04 
## # weights:  86
## initial  value 1525746.640843 
## iter  10 value 1359497.937962
## iter  20 value 1359392.837020
## final  value 1359391.874725 
## converged
## - Fold2: size=5, decay=1e-04 
## + Fold2: size=7, decay=1e-04 
## # weights:  120
## initial  value 1530815.518115 
## iter  10 value 1359544.241357
## iter  20 value 1359394.508528
## final  value 1359393.957636 
## converged
## - Fold2: size=7, decay=1e-04 
## + Fold2: size=9, decay=1e-04 
## # weights:  154
## initial  value 1495990.808025 
## final  value 1359392.358440 
## converged
## - Fold2: size=9, decay=1e-04 
## + Fold3: size=1, decay=0e+00 
## # weights:  18
## initial  value 1480810.105373 
## final  value 1359948.384454 
## converged
## - Fold3: size=1, decay=0e+00 
## + Fold3: size=3, decay=0e+00 
## # weights:  52
## initial  value 1471034.423866 
## final  value 1359948.384454 
## converged
## - Fold3: size=3, decay=0e+00 
## + Fold3: size=5, decay=0e+00 
## # weights:  86
## initial  value 1409199.829222 
## final  value 1359948.384454 
## converged
## - Fold3: size=5, decay=0e+00 
## + Fold3: size=7, decay=0e+00 
## # weights:  120
## initial  value 1428832.219376 
## final  value 1359948.384454 
## converged
## - Fold3: size=7, decay=0e+00 
## + Fold3: size=9, decay=0e+00 
## # weights:  154
## initial  value 1529264.706219 
## final  value 1359948.384454 
## converged
## - Fold3: size=9, decay=0e+00 
## + Fold3: size=1, decay=1e-01 
## # weights:  18
## initial  value 1485900.580443 
## iter  10 value 1360159.985532
## final  value 1359956.973665 
## converged
## - Fold3: size=1, decay=1e-01 
## + Fold3: size=3, decay=1e-01 
## # weights:  52
## initial  value 1495921.889551 
## iter  10 value 1359970.086557
## iter  20 value 1359955.591328
## final  value 1359954.432959 
## converged
## - Fold3: size=3, decay=1e-01 
## + Fold3: size=5, decay=1e-01 
## # weights:  86
## initial  value 1419370.876409 
## iter  10 value 1360779.251361
## iter  20 value 1359955.733370
## iter  30 value 1359952.100934
## final  value 1359951.932775 
## converged
## - Fold3: size=5, decay=1e-01 
## + Fold3: size=7, decay=1e-01 
## # weights:  120
## initial  value 1428352.369285 
## iter  10 value 1361347.474221
## iter  20 value 1359960.723237
## iter  30 value 1359951.966857
## final  value 1359951.297112 
## converged
## - Fold3: size=7, decay=1e-01 
## + Fold3: size=9, decay=1e-01 
## # weights:  154
## initial  value 1442326.470110 
## iter  10 value 1360841.297918
## iter  20 value 1359952.349724
## final  value 1359951.296577 
## converged
## - Fold3: size=9, decay=1e-01 
## + Fold3: size=1, decay=1e-02 
## # weights:  18
## initial  value 1464299.247885 
## iter  10 value 1362343.595749
## iter  20 value 1359986.035361
## iter  30 value 1359951.151800
## iter  40 value 1359949.716784
## final  value 1359949.546862 
## converged
## - Fold3: size=1, decay=1e-02 
## + Fold3: size=3, decay=1e-02 
## # weights:  52
## initial  value 1469037.306098 
## iter  10 value 1360065.896499
## iter  20 value 1359949.495867
## final  value 1359949.210383 
## converged
## - Fold3: size=3, decay=1e-02 
## + Fold3: size=5, decay=1e-02 
## # weights:  86
## initial  value 1459784.382816 
## iter  10 value 1359949.533921
## final  value 1359949.207019 
## converged
## - Fold3: size=5, decay=1e-02 
## + Fold3: size=7, decay=1e-02 
## # weights:  120
## initial  value 1505404.465533 
## iter  10 value 1360892.846458
## iter  20 value 1359949.438818
## final  value 1359948.792969 
## converged
## - Fold3: size=7, decay=1e-02 
## + Fold3: size=9, decay=1e-02 
## # weights:  154
## initial  value 1483386.935996 
## iter  10 value 1363302.782530
## iter  20 value 1360014.180271
## iter  30 value 1359977.072716
## iter  40 value 1359949.962140
## final  value 1359948.722334 
## converged
## - Fold3: size=9, decay=1e-02 
## + Fold3: size=1, decay=1e-03 
## # weights:  18
## initial  value 1467421.815735 
## iter  10 value 1360389.544872
## iter  20 value 1359955.119745
## final  value 1359950.360053 
## converged
## - Fold3: size=1, decay=1e-03 
## + Fold3: size=3, decay=1e-03 
## # weights:  52
## initial  value 1478234.918249 
## iter  10 value 1361295.135903
## iter  20 value 1359968.499209
## iter  30 value 1359952.483974
## iter  40 value 1359949.817224
## final  value 1359949.756492 
## converged
## - Fold3: size=3, decay=1e-03 
## + Fold3: size=5, decay=1e-03 
## # weights:  86
## initial  value 1473634.072128 
## iter  10 value 1359949.236749
## final  value 1359948.465475 
## converged
## - Fold3: size=5, decay=1e-03 
## + Fold3: size=7, decay=1e-03 
## # weights:  120
## initial  value 1448453.403755 
## iter  10 value 1360157.288074
## iter  20 value 1359966.116546
## final  value 1359949.251545 
## converged
## - Fold3: size=7, decay=1e-03 
## + Fold3: size=9, decay=1e-03 
## # weights:  154
## initial  value 1426556.713411 
## iter  10 value 1359983.438827
## iter  20 value 1359948.736746
## final  value 1359948.488823 
## converged
## - Fold3: size=9, decay=1e-03 
## + Fold3: size=1, decay=1e-04 
## # weights:  18
## initial  value 1483546.242269 
## iter  10 value 1359996.776232
## iter  20 value 1359949.103675
## final  value 1359948.640269 
## converged
## - Fold3: size=1, decay=1e-04 
## + Fold3: size=3, decay=1e-04 
## # weights:  52
## initial  value 1489179.343256 
## iter  10 value 1360155.210286
## iter  20 value 1359951.184951
## final  value 1359949.284780 
## converged
## - Fold3: size=3, decay=1e-04 
## + Fold3: size=5, decay=1e-04 
## # weights:  86
## initial  value 1455912.344389 
## iter  10 value 1359994.373877
## iter  20 value 1359949.642488
## final  value 1359949.279875 
## converged
## - Fold3: size=5, decay=1e-04 
## + Fold3: size=7, decay=1e-04 
## # weights:  120
## initial  value 1499491.465784 
## iter  10 value 1360017.640357
## iter  20 value 1359950.260790
## final  value 1359949.701484 
## converged
## - Fold3: size=7, decay=1e-04 
## + Fold3: size=9, decay=1e-04 
## # weights:  154
## initial  value 1392264.937486 
## iter  10 value 1359948.457712
## final  value 1359948.403648 
## converged
## - Fold3: size=9, decay=1e-04 
## + Fold4: size=1, decay=0e+00 
## # weights:  18
## initial  value 1481782.439746 
## final  value 1359902.154238 
## converged
## - Fold4: size=1, decay=0e+00 
## + Fold4: size=3, decay=0e+00 
## # weights:  52
## initial  value 1443352.429467 
## final  value 1359902.154238 
## converged
## - Fold4: size=3, decay=0e+00 
## + Fold4: size=5, decay=0e+00 
## # weights:  86
## initial  value 1440313.912696 
## final  value 1359902.154238 
## converged
## - Fold4: size=5, decay=0e+00 
## + Fold4: size=7, decay=0e+00 
## # weights:  120
## initial  value 1441949.993357 
## final  value 1359902.154238 
## converged
## - Fold4: size=7, decay=0e+00 
## + Fold4: size=9, decay=0e+00 
## # weights:  154
## initial  value 1498099.949905 
## final  value 1359902.154238 
## converged
## - Fold4: size=9, decay=0e+00 
## + Fold4: size=1, decay=1e-01 
## # weights:  18
## initial  value 1497400.139492 
## iter  10 value 1359918.617823
## iter  20 value 1359910.794798
## final  value 1359910.731734 
## converged
## - Fold4: size=1, decay=1e-01 
## + Fold4: size=3, decay=1e-01 
## # weights:  52
## initial  value 1475110.805757 
## iter  10 value 1361504.583315
## iter  20 value 1359917.108188
## final  value 1359907.032083 
## converged
## - Fold4: size=3, decay=1e-01 
## + Fold4: size=5, decay=1e-01 
## # weights:  86
## initial  value 1529847.849037 
## iter  10 value 1360868.182801
## iter  20 value 1359906.749190
## final  value 1359906.508591 
## converged
## - Fold4: size=5, decay=1e-01 
## + Fold4: size=7, decay=1e-01 
## # weights:  120
## initial  value 1513271.420715 
## iter  10 value 1361774.493372
## iter  20 value 1359908.683403
## iter  30 value 1359905.158078
## final  value 1359905.088079 
## converged
## - Fold4: size=7, decay=1e-01 
## + Fold4: size=9, decay=1e-01 
## # weights:  154
## initial  value 1544949.618654 
## iter  10 value 1360437.686096
## iter  20 value 1359906.149776
## final  value 1359905.070221 
## converged
## - Fold4: size=9, decay=1e-01 
## + Fold4: size=1, decay=1e-02 
## # weights:  18
## initial  value 1523355.360830 
## iter  10 value 1363377.682153
## iter  20 value 1359950.952166
## iter  30 value 1359905.051612
## iter  40 value 1359903.478776
## final  value 1359903.316362 
## converged
## - Fold4: size=1, decay=1e-02 
## + Fold4: size=3, decay=1e-02 
## # weights:  52
## initial  value 1474645.403722 
## iter  10 value 1361269.412384
## iter  20 value 1359935.369151
## iter  30 value 1359904.291134
## iter  30 value 1359904.279402
## final  value 1359903.466238 
## converged
## - Fold4: size=3, decay=1e-02 
## + Fold4: size=5, decay=1e-02 
## # weights:  86
## initial  value 1448250.581951 
## iter  10 value 1360041.556303
## iter  20 value 1359903.105257
## final  value 1359902.641505 
## converged
## - Fold4: size=5, decay=1e-02 
## + Fold4: size=7, decay=1e-02 
## # weights:  120
## initial  value 1495355.018850 
## iter  10 value 1360138.804419
## iter  20 value 1359902.748018
## iter  20 value 1359902.735997
## final  value 1359902.637338 
## converged
## - Fold4: size=7, decay=1e-02 
## + Fold4: size=9, decay=1e-02 
## # weights:  154
## initial  value 1443848.816310 
## iter  10 value 1360247.280304
## final  value 1359902.621262 
## converged
## - Fold4: size=9, decay=1e-02 
## + Fold4: size=1, decay=1e-03 
## # weights:  18
## initial  value 1456046.191535 
## iter  10 value 1360808.380066
## iter  20 value 1359914.280936
## iter  30 value 1359903.734819
## final  value 1359903.334931 
## converged
## - Fold4: size=1, decay=1e-03 
## + Fold4: size=3, decay=1e-03 
## # weights:  52
## initial  value 1485411.305199 
## iter  10 value 1359968.978493
## iter  20 value 1359902.937342
## final  value 1359902.707644 
## converged
## - Fold4: size=3, decay=1e-03 
## + Fold4: size=5, decay=1e-03 
## # weights:  86
## initial  value 1512836.987935 
## iter  10 value 1359904.452008
## final  value 1359902.235045 
## converged
## - Fold4: size=5, decay=1e-03 
## + Fold4: size=7, decay=1e-03 
## # weights:  120
## initial  value 1481843.384388 
## iter  10 value 1359931.787819
## iter  20 value 1359902.869838
## final  value 1359902.268565 
## converged
## - Fold4: size=7, decay=1e-03 
## + Fold4: size=9, decay=1e-03 
## # weights:  154
## initial  value 1501213.252383 
## iter  10 value 1360275.924126
## iter  20 value 1360144.057861
## iter  30 value 1359923.717504
## final  value 1359902.272406 
## converged
## - Fold4: size=9, decay=1e-03 
## + Fold4: size=1, decay=1e-04 
## # weights:  18
## initial  value 1450515.440699 
## iter  10 value 1360052.115714
## iter  20 value 1359903.980957
## final  value 1359902.668916 
## converged
## - Fold4: size=1, decay=1e-04 
## + Fold4: size=3, decay=1e-04 
## # weights:  52
## initial  value 1429145.271127 
## iter  10 value 1359913.844981
## iter  20 value 1359902.348328
## final  value 1359902.316395 
## converged
## - Fold4: size=3, decay=1e-04 
## + Fold4: size=5, decay=1e-04 
## # weights:  86
## initial  value 1436484.832399 
## iter  10 value 1360001.995624
## iter  20 value 1359904.069410
## final  value 1359903.041485 
## converged
## - Fold4: size=5, decay=1e-04 
## + Fold4: size=7, decay=1e-04 
## # weights:  120
## initial  value 1491902.717798 
## iter  10 value 1359974.043549
## iter  20 value 1359904.200177
## final  value 1359903.510371 
## converged
## - Fold4: size=7, decay=1e-04 
## + Fold4: size=9, decay=1e-04 
## # weights:  154
## initial  value 1502804.706739 
## iter  10 value 1360081.372505
## iter  20 value 1359905.507887
## final  value 1359904.170732 
## converged
## - Fold4: size=9, decay=1e-04 
## + Fold5: size=1, decay=0e+00 
## # weights:  18
## initial  value 1473077.950036 
## final  value 1359611.642673 
## converged
## - Fold5: size=1, decay=0e+00 
## + Fold5: size=3, decay=0e+00 
## # weights:  52
## initial  value 1460946.195765 
## final  value 1359611.642673 
## converged
## - Fold5: size=3, decay=0e+00 
## + Fold5: size=5, decay=0e+00 
## # weights:  86
## initial  value 1505332.564024 
## final  value 1359611.642673 
## converged
## - Fold5: size=5, decay=0e+00 
## + Fold5: size=7, decay=0e+00 
## # weights:  120
## initial  value 1493368.165297 
## final  value 1359611.642673 
## converged
## - Fold5: size=7, decay=0e+00 
## + Fold5: size=9, decay=0e+00 
## # weights:  154
## initial  value 1546232.496242 
## final  value 1359611.642673 
## converged
## - Fold5: size=9, decay=0e+00 
## + Fold5: size=1, decay=1e-01 
## # weights:  18
## initial  value 1465362.228796 
## iter  10 value 1359629.375287
## iter  20 value 1359620.290012
## final  value 1359620.215925 
## converged
## - Fold5: size=1, decay=1e-01 
## + Fold5: size=3, decay=1e-01 
## # weights:  52
## initial  value 1519223.654586 
## iter  10 value 1360107.147249
## iter  20 value 1359627.105659
## iter  30 value 1359617.567622
## final  value 1359616.366167 
## converged
## - Fold5: size=3, decay=1e-01 
## + Fold5: size=5, decay=1e-01 
## # weights:  86
## initial  value 1437986.725349 
## iter  10 value 1360322.244272
## iter  20 value 1359616.912749
## iter  30 value 1359615.055946
## final  value 1359614.993543 
## converged
## - Fold5: size=5, decay=1e-01 
## + Fold5: size=7, decay=1e-01 
## # weights:  120
## initial  value 1525458.893235 
## iter  10 value 1360743.080778
## iter  20 value 1359614.910677
## iter  30 value 1359614.267996
## final  value 1359614.232594 
## converged
## - Fold5: size=7, decay=1e-01 
## + Fold5: size=9, decay=1e-01 
## # weights:  154
## initial  value 1424562.867611 
## iter  10 value 1360322.506634
## iter  20 value 1359615.512152
## iter  30 value 1359614.304943
## final  value 1359614.245565 
## converged
## - Fold5: size=9, decay=1e-01 
## + Fold5: size=1, decay=1e-02 
## # weights:  18
## initial  value 1520951.719167 
## iter  10 value 1359667.274246
## final  value 1359613.780631 
## converged
## - Fold5: size=1, decay=1e-02 
## + Fold5: size=3, decay=1e-02 
## # weights:  52
## initial  value 1554713.680193 
## iter  10 value 1360279.860346
## iter  20 value 1359612.998968
## final  value 1359612.219812 
## converged
## - Fold5: size=3, decay=1e-02 
## + Fold5: size=5, decay=1e-02 
## # weights:  86
## initial  value 1549032.988266 
## iter  10 value 1364391.608219
## iter  20 value 1359712.050691
## iter  30 value 1359620.958234
## iter  40 value 1359612.957534
## final  value 1359612.667689 
## converged
## - Fold5: size=5, decay=1e-02 
## + Fold5: size=7, decay=1e-02 
## # weights:  120
## initial  value 1513634.801816 
## iter  10 value 1359613.217025
## iter  20 value 1359612.250910
## final  value 1359612.181882 
## converged
## - Fold5: size=7, decay=1e-02 
## + Fold5: size=9, decay=1e-02 
## # weights:  154
## initial  value 1520412.755918 
## iter  10 value 1359962.742734
## iter  20 value 1359612.206512
## iter  20 value 1359612.198253
## final  value 1359612.198253 
## converged
## - Fold5: size=9, decay=1e-02 
## + Fold5: size=1, decay=1e-03 
## # weights:  18
## initial  value 1502424.850243 
## iter  10 value 1359613.335555
## final  value 1359613.032738 
## converged
## - Fold5: size=1, decay=1e-03 
## + Fold5: size=3, decay=1e-03 
## # weights:  52
## initial  value 1459092.780304 
## iter  10 value 1360145.686462
## iter  20 value 1359622.951806
## iter  30 value 1359613.327494
## final  value 1359612.832664 
## converged
## - Fold5: size=3, decay=1e-03 
## + Fold5: size=5, decay=1e-03 
## # weights:  86
## initial  value 1488049.511483 
## iter  10 value 1360321.513519
## iter  20 value 1359626.873578
## iter  30 value 1359618.313790
## final  value 1359611.942911 
## converged
## - Fold5: size=5, decay=1e-03 
## + Fold5: size=7, decay=1e-03 
## # weights:  120
## initial  value 1471416.122363 
## iter  10 value 1359613.456464
## final  value 1359612.108981 
## converged
## - Fold5: size=7, decay=1e-03 
## + Fold5: size=9, decay=1e-03 
## # weights:  154
## initial  value 1390767.093319 
## final  value 1359611.675037 
## converged
## - Fold5: size=9, decay=1e-03 
## + Fold5: size=1, decay=1e-04 
## # weights:  18
## initial  value 1508595.591756 
## iter  10 value 1360161.325071
## iter  20 value 1359617.984733
## iter  30 value 1359611.720427
## final  value 1359611.667113 
## converged
## - Fold5: size=1, decay=1e-04 
## + Fold5: size=3, decay=1e-04 
## # weights:  52
## initial  value 1471711.221424 
## iter  10 value 1359671.792241
## iter  20 value 1359612.771539
## final  value 1359612.202897 
## converged
## - Fold5: size=3, decay=1e-04 
## + Fold5: size=5, decay=1e-04 
## # weights:  86
## initial  value 1455602.856598 
## iter  10 value 1359683.743447
## iter  20 value 1359613.217790
## final  value 1359612.445138 
## converged
## - Fold5: size=5, decay=1e-04 
## + Fold5: size=7, decay=1e-04 
## # weights:  120
## initial  value 1443401.262279 
## final  value 1359612.628972 
## converged
## - Fold5: size=7, decay=1e-04 
## + Fold5: size=9, decay=1e-04 
## # weights:  154
## initial  value 1546211.752951 
## iter  10 value 1359666.825019
## iter  20 value 1359612.278882
## final  value 1359611.660572 
## converged
## - Fold5: size=9, decay=1e-04

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

## Aggregating results
## Selecting tuning parameters
## Fitting size = 1, decay = 0 on full training set
## # weights:  18
## initial  value 1832341.846574 
## final  value 1699711.313709 
## converged

print("Random Forest Model with K-Fold Cross Validation")

## [1] "Random Forest Model with K-Fold Cross Validation"

print(rf_mod)

## Random Forest 
## 
## 11696 samples
##    16 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 9356, 9358, 9357, 9357, 9356 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE      
##    2    0.2935395  0.6926958  0.2332635
##    5    0.2902269  0.6960681  0.2286713
##    8    0.2909088  0.6941460  0.2285693
##   11    0.2915702  0.6925247  0.2286666
##   15    0.2926614  0.6901003  0.2294052
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.

print("Neural Network Model with K-Fold Cross Validation")

## [1] "Neural Network Model with K-Fold Cross Validation"

print(nn_mod)

## Neural Network 
## 
## 11696 samples
##    16 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 9357, 9357, 9357, 9357, 9356 
## Resampling results across tuning parameters:
## 
##   size  decay  RMSE      Rsquared      MAE     
##   1     0e+00  12.05505           NaN  12.04357
##   1     1e-04  12.05505  1.493826e-04  12.04357
##   1     1e-03  12.05513  7.360731e-04  12.04366
##   1     1e-02  12.05505  5.420676e-04  12.04357
##   1     1e-01  12.05505           NaN  12.04358
##   3     0e+00  12.05505           NaN  12.04357
##   3     1e-04  12.05505  7.433719e-05  12.04358
##   3     1e-03  12.05505  1.105221e-03  12.04357
##   3     1e-02  12.05505  1.296808e-03  12.04357
##   3     1e-01  12.05505  2.347539e-03  12.04357
##   5     0e+00  12.05505           NaN  12.04357
##   5     1e-04  12.05505  3.843334e-04  12.04357
##   5     1e-03  12.05505           NaN  12.04357
##   5     1e-02  12.05505  7.755151e-04  12.04357
##   5     1e-01  12.05505  2.475860e-03  12.04357
##   7     0e+00  12.05505           NaN  12.04357
##   7     1e-04  12.05505  8.809214e-05  12.04357
##   7     1e-03  12.05505  9.256058e-04  12.04357
##   7     1e-02  12.05505  3.202389e-05  12.04357
##   7     1e-01  12.05505  1.696830e-03  12.04357
##   9     0e+00  12.05505           NaN  12.04357
##   9     1e-04  12.05505  3.132742e-04  12.04357
##   9     1e-03  12.05505  5.880642e-04  12.04357
##   9     1e-02  12.05505  6.748310e-04  12.04357
##   9     1e-01  12.05505  4.487665e-03  12.04357
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1 and decay = 0.

Score model

## Predict Unseen data
pTrain_rf_cv <- predict(rf_mod, newdata = trainData)
unlog_pTrain_rf_cv <- exp(pTrain_rf_cv)
pTest_rf_cv <- predict(rf_mod, newdata = testData)
unlog_pTest_rf_cv <- exp(pTest_rf_cv)

pTrain_nn_cv <- predict(nn_mod, newdata = trainData)
unlog_pTrain_nn_cv <- exp(pTrain_nn_cv)
pTest_nn_cv <- predict(nn_mod, newdata = testData)
unlog_pTest_nn_cv <- exp(pTest_nn_cv)

Evaluate model

MAETrain_rf_cv <- calcu_mae(trainData$price, unlog_pTrain_rf_cv)
MAETest_rf_cv <- calcu_mae(testData$price, unlog_pTest_rf_cv)
MSETrain_rf_cv <- calcu_mse(trainData$price, unlog_pTrain_rf_cv)
MSETest_rf_cv <- calcu_mse(testData$price, unlog_pTest_rf_cv)
RMSETrain_rf_cv <- calcu_rmse(trainData$price, unlog_pTrain_rf_cv)
RMSETest_rf_cv <- calcu_rmse(testData$price, unlog_pTest_rf_cv)

MAETrain_nn_cv <- calcu_mae(trainData$price, unlog_pTrain_nn_cv)
MAETest_nn_cv <- calcu_mae(testData$price, unlog_pTest_nn_cv)
MSETrain_nn_cv <- calcu_mse(trainData$price, unlog_pTrain_nn_cv)
MSETest_nn_cv <- calcu_mse(testData$price, unlog_pTest_nn_cv)
RMSETrain_nn_cv <- calcu_rmse(trainData$price, unlog_pTrain_nn_cv)
RMSETest_nn_cv <- calcu_rmse(testData$price, unlog_pTest_nn_cv)

RMSE_of_rf <- c("MAE","MSE","RMSE")
Train_rf <- c(MAETrain_rf_cv,MSETrain_rf_cv,RMSETrain_rf_cv)
Test_rf <- c(MAETest_rf_cv,MSETest_rf_cv,RMSETest_rf_cv)

RMSE_of_nn <- c("MAE","MSE","RMSE")
Train_nn <- c(MAETrain_nn_cv,MSETrain_nn_cv,RMSETrain_nn_cv)
Test_nn <- c(MAETest_nn_cv,MSETest_nn_cv,RMSETest_nn_cv)

RMSE_of_rf_df <- data.frame(RMSE_of_rf,Train_rf, Test_rf)
RMSE_of_nn_df <- data.frame(RMSE_of_nn,Train_nn, Test_nn)

print(RMSE_of_rf_df)

##   RMSE_of_rf     Train_rf      Test_rf
## 1        MAE 5.407660e+04 1.240817e+05
## 2        MSE 9.016143e+09 5.306791e+10
## 3       RMSE 9.495338e+04 2.303647e+05

print(RMSE_of_nn_df)

##   RMSE_of_nn     Train_nn      Test_nn
## 1        MAE 5.372778e+05 5.455362e+05
## 2        MSE 4.175130e+11 4.575273e+11
## 3       RMSE 6.461524e+05 6.764076e+05

Comparison Model with RMSE

comparision <- c("Linear Regression", "Regularized Regression", "Decision Tree", "Random Forest", "Neural Network")
train_rmse <- c(RMSETrain, RMSETrain_glm_cv, RMSETrain_tree_cv, RMSETrain_rf_cv, RMSETrain_nn_cv)
test_rmse <- c(RMSETest, RMSETest_glm_cv, RMSETest_tree_cv, RMSETest_rf_cv, RMSETest_nn_cv)

diff_lm <- abs(RMSETrain-RMSETest)
diff_glmnet <- abs(RMSETrain_glm_cv-RMSETest_glm_cv)
diff_tree <- abs(RMSETrain_tree_cv-RMSETest_tree_cv)
diff_rf <- abs(RMSETrain_rf_cv-RMSETest_rf_cv)
diff_nn <- abs(RMSETrain_nn_cv-RMSETest_nn_cv)
Difference <- c(diff_lm, diff_glmnet, diff_tree, diff_rf, diff_nn)

com_model <- data.frame(comparision, train_rmse, test_rmse,Difference)
print(com_model)

##              comparision train_rmse test_rmse Difference
## 1      Linear Regression  212062.51  298105.0   86042.50
## 2 Regularized Regression  211574.01  296459.3   84885.29
## 3          Decision Tree  262307.40  316922.7   54615.28
## 4          Random Forest   94953.38  230364.7  135411.36
## 5         Neural Network  646152.42  676407.6   30255.21

Conclusion
Based on the table above, the model with the smallest difference between train RMSE and test RMSE, which is the “Neural Network model”, is preferred for predicting house prices in India.

Export the model for further use

We will export the optimal model for future applications.

## save model .RDS
saveRDS(nn_mod, "/Users/j.nrup/Documents/Data Project/nn_model.RDS")

Summary

This project leverages machine learning for house price prediction by employing various algorithms such as linear regression, regularized regression, decision trees, random forests, and neural networks. It prioritizes data quality through extensive cleansing, preparation, and pre-processing techniques, including log transformation and re-sampling. Additionally, hyper-parameter tuning ensures optimal model performance.

Thank you for your interest. I hope this project will be beneficial to those who are interested. If there are any errors, I apologize in advance. - Narupong Jarasbunpaisarn