Executive Summary:

According to Statista.com, the number of existing homes sold in the United States increases consistently from 2018 to 2020, and according to the National Association of Realtor website, the median existing-home price in September 2020 also increased almost 15%, comparing with the price in September 2019. With the great increase trend of the house price, a model with strong prediction power could help investment company or real estate agency make the right decision on pricing their houses or on buying new houses.

In many house price prediction tasks, simple linear regression models and ensemble models, especially the tree-related ensemble models, have very high performance. In many Kaggle competitions, we can see that Ridge, Lasso Regression, or Extreme Gradient Boosting (XGBoosting) are dominant in terms of prediction power.

In this project, I will focus on designing a stacking model that can combine and enhance the performance of models that have the best performance and test whether the stacking model really improves the performance of these models on this particular regression task.

The project will contain 3 main sections:
I. Exploratory Data Analysis and Data preprocessing:
For this part, I will focus on exploring the data, and then preprocess and impute the missing values of the dataset.
II. Design the Stacking Model:
The stacking model will consist of 5 layers:
* Layer 1 - Data Processing
* Layer 2 - Dimension Reduction: Feature Selection.
* Layer 3 - Training: Random Forest, XGBoosting, and Neural Network.
* Layer 4 - Aggregation: Support vector machine with Poly kernel.
* Layer 5 - Output

Result and conclusion:

In this project, the stacking model that is designed and trained in the second section has better performance than the best single model (XGBoosting) of the six models that are used in the training layer. The RMSE and MAE of the stacking model is 5.89% and 5.17% lower than that of the XGBoosting model.

I. Exploratory Data Analysis and Data Preprocessing:

Data Description:

The dataset was download from the Kaggle website. Originally, the dataset does not have any missing value. For the requirements of the project, I’ve randomly assigned some missing values in some variables in the dataset.

The dataset contains information about houses in King County of Washington state that was sold between 2014 and 2015. The dataset contains more than 20,000 observations and has 20 predictors and 1 target column (price). The description of the dataset:

Id: Unique ID for each home sold.
Date: Date of the home sale.
Price: Price of each home sold.
Bedrooms: Number of bedrooms.
Bathrooms: Number of bathrooms, where .5 accounts for a room with a toilet but no shower.
Sqft_living: Square footage of the apartment’s interior living space.
Sqft_lot: Square footage of the land space.
Floors: Number of floors.
Waterfront: A dummy variable for whether the apartment was overlooking the waterfront or not.
View: An index from 0 to 4 of how good the view of the property was.
Condition: An index from 1 to 5 on the condition of the apartment.
Grade: An index from 1 to 13, where 1-3 falls short quality, 7 has an average quality, and 11- 13 have a high quality.
Sqft_above: The square footage of the interior housing space that is above ground level.
Sqft_basement: The square footage of the interior housing space that is below ground.
Yr_built: The year the house was initially built.
Yr_renovated: The year of the house’s last renovation.
Zipcode: What zip code area the house is in.
Lat: Lattitude
Long: Longitude
Sqft_living15: The square footage of interior housing living space for the nearest 15 neighbors.
Sqft_lot15: The square footage of the land lots of the nearest 15 neighbors.

Overview of the dataset:

The bar plot shows that the missing values are mainly located in 5 columns bedrooms, condition, floors, sqft_living, and sqft_lot, all of which are belonged to the characteristics of the house. In the next section, I’ll focus on exploring the data and try to identify the relationships between variables and impute as many of the missing values as possible.

I will drop the id column because this column is just an identifier and seems to have no effect on the analysis as well as no information for the modeling phase.

We can take the first glimpse at the distributions of the variables through the histogram plots (please check Appendix A). Variables such as bedrooms, sqft_living, sqft_lot, sqft_basement, sqft_lot15 might have some significant outliers that can affect the performance of the models. Therefore, I will look a closer look at these variables and eliminate the unreasonable or unnecessary outliers to improve the process of building a model in the next steps.
The heat map shows the correlations between each of the predictors with the target variable (Price). Variables such as sqft_living, grade, sqft_above, sqft_living15, bathrooms, and lat have strong correlations (both negative and positive) with the house’s price (absolute value over 0.5). Then, view, sqft_basement, bedrooms, waterfront, and floors variables have moderate correlations with the price variables.

In normal house price datasets, the sqft_lot (the total lot space of the house) usually has a pretty strong correlation with the house price. However, in this dataset, the sqft_lot seems to not correlate with the price. Therefore, I will also take a closer look at this variable and try to extract more information from sqft_lot.

In my opinion, zipcode is an important variable that affects the house price. However, we can see that zipcode has a very week correlation with the house prices. I assume that the current format of the zipcode is not appropriate to extract useful information for the model for the house’s prices prediction model. Therefore, I will also take a closer look at this variable.

Data Preprocessing:

For data preprocessing, I will go through all variables in the dataset in the order of categories of variables. First, I will take an overall look at the target column (price), then I will go through the macro features which contain location features. Finally, I will take a look at the characteristics of the house variables.

Macro: Location Features

zipcode, lat, long features.

To display the price of the houses, I create a new variable that indicates the price range of the house. The summary and the histogram of the price variable indicate that the house prices are ranging from $75,000 to $5,350,000 and most of the house price is below $2,000,000. Therefore, I divide the price variable in to 8 main ranges: under $250k, $250k-$500k, $500k-$750k, $750k-$1m, $1m-$1.25m, $1.25m-$1.5m, $1.5m-$1.75m, $1.75m-$2m, >$2m.

The map indicates that the houses with high prices are mainly located in Seattle and near the beach or have a view of the beach. The house price starts to decrease significantly as the house was located on the south side of Seattle and decrease gradually as the house was located on the West side of Seattle.

Micro elements:

Bedrooms and Bathrooms Features:

The two boxplots describe the outliers of the two variables bedrooms and bathrooms.

The boxplot of bedrooms indicates there are unusual outliers with over 30 bedrooms or 0 bedroom. I will take a closer look at these observations.

The bathrooms variable seems to be fine without any unusual outlier. With more than 10 bedrooms, the house can have around 8 bathrooms and it’s reasonable value for the max value of bathrooms variable.

##        price bedrooms bathrooms sqft_living sqft_lot floors
## 15871 640000       33      1.75        1620     6000      1

This observation is most likely an entry error. We can see that the house that has 33 bedrooms and 1.75 bathrooms has the living area has just only 1620 square feet with only one floor. So, I will eliminate this observation.

Next, we will look at houses with 0 bedrooms or 0 bathrooms.

##         price bedrooms bathrooms sqft_living sqft_lot zipcode
## 876   1095000        0      0.00        3064     4764   98102
## 1150    75000        1      0.00         670    43377   98022
## 3120   380000        0      0.00        1470      979   98133
## 3468   288000        0      1.50        1430       NA   98125
## 5833   280000        1      0.00         600    24501   98045
## 6995  1295650        0      0.00        4810    28008   98053
## 8478   339950        0      2.50        2290     8319   98042
## 8485   240000        0      2.50        1810     5669   98038
## 9774   355000        0      0.00        2460     8049   98031
## 9855   235000        0      0.00        1470     4800   98065
## 10482  484000        1      0.00         690    23244   98053
## 12654  320000        0      2.50        1490     7111   98065
## 14424  139950        0      0.00         844     4269   98001
## 18380  265000        0      0.75         384   213444   98070
## 19453  142000        0      0.00         290       NA   98024

From my perspective, it’s very unlikely that the house has no bedroom. Furthermore, many houses with no bedrooms also have no bathrooms or 2.5 bathrooms, and the living and lot area of these houses are also very high. This seems not to be reasonable, so I will eliminate observations with 0 bedrooms.

On the other hand, houses with 1 bedroom and 0 bathroom have a small living area (around 600-700 sqft). From my perspective, these are reasonable and I will keep these observations.

The scatter plot of bedrooms and price indicates a positive correlation between the number of bedrooms and house prices, but we cannot conclude this is a strong correlation due to a large number of outliers of each box.

In the box plot, we can see there is an increasing trend in median home price as the number of bedrooms increase. When the number of bedrooms reaches 7, the trend seems to fluctuate.

Unlike the scatter plot of bedrooms and price, the scatter plot of bathrooms and price shows a stronger and clearer positive correlation between these two variables.

Just as the trend we see in the bedrooms and price, the box plot shows an increasing trend in median home price as the number of bathrooms increase until the number of bathrooms reaches 4.25 and then the median home price seems to fluctuate.

Area Features:

The 4 boxplots show the outliers of the four variables: sqft_living, sqft_lot, sqft_above, and sqft_basement.

There are some houses with significantly large living space (sqft_living). sqft_lot seems to have some outliers with extremely high value that might affect the scaling process. The same issue appears sqft_above and sqft_basement. Therefore, I will take a closer look at these outliers.

I have manually check some of the houses that have exceptionally large lot areas (> 750,000 sqft) on the Zillow website, and these houses are mansions in isolated areas or houses of farmers with a large area of lot size.

To generate models that have the ability to capture the global trend of the data with high generalization power to correctly predict the prices of the majority of the houses in King County, I will eliminate these observations to improve the generalization power and the robustness of the model.

The other three variables seem to be reasonable and I can’t see any exceptional or wrong outliers. Therefore, I decided to not eliminate any observation from sqft_living, sqft_above, and sqft_basement variables.

As we can observe from the correlation heat map, there is a pretty strong and positive correlation between the living area and the house price, and nearly the same positive correlation between price and above area. We can see the same trend but weaker correlations between basement space and the house price.

On the other hand, we can not see much of the correlation or nearly no correlation between the sqft_lot and the house price, which means that sqft_lot might not be useful to predict the house price. However, as we mention on the heat map result, the correlation could reveal if we combine sqft_lot and zipcode. To demonstrate that will plot the price and sqft_lot with the zipcode facet of the randomly selected six zip codes.

I still cannot see a clear correlation between sqft_lot (the lot space) and the price. In King County, the lot size and the price might not correlate, or the price is not affected by the lot size, or the sqft_lot needs to be in conjunction with other variables to give meaningful information.

The boxplot show an increasing trend in the median house price as the number of floors increases with fluctuations at 3.0 and 3.5 floors.

View and Quality Features:

The graphs below show the correlation between the four variables (waterfront, view, grade, and condition) and price.

The plots indicate that all of the four variables have some effects on the house price. There is a noticeable effect of the waterfront on the house price. There is a significant difference in the median price between houses with waterfronts and those without waterfront.

We can see the same pattern from view and grade variables, not so significantly different price, but they also have some effects on the house price (positive effects).

With the condition variable, we can not see a clear pattern about the difference, and there are also many outliers with will significantly affect what the plot indicates.

Year Built and Year Renovated:

The histogram of the yr_renovated seems to be unusual, so I will take a closer look at this variable.

For the yr_renovated variable, if the house was renovated, the value would be the year of the renovation. If it was not, the value of this value would be zero. We can see that most of the houses are not renovated. Therefore, I decided to transform this variable into a binary variable.

This transformation could eliminate some information about the variable. However, from my perspective, this is the best way to extract information from yr_renovated and also avoid the effect of the high values observed on the majority of observations (with a value of 0) in the scaling process.

The plot of yr_renovated with price indicates that the yr_renovated variable might have a small effect on the house price. On the other hand, with the yr_built variable, we cannot see any noticeable impact or correlation between the variable and the house price.

Date Feature:

As we can observe from the housing market, there are usually seasonality and cycles in the house price. Therefore, for the date variable, I will extract month, which can contain the seasonality information, and year, which, from my perspective, could get the insight of the cycles from the data.

Because of the categorical nature of the Month variable, I will transform the month variable into dummy variables (I will also eliminate December and treat it as a baseline).

Zipcode Features:

Although zipcode has a numerical form, its numerical characteristics can not be treated as a continuous or ordinal variable because the increase or decrease doesn’t indicate the changes. In fact, the zipcode variable has categorical characteristics. In our data set, there are 70 unique zip codes, so if we want to transform zipcode to one-hot-vector, the number would increase a lot and greatly affect the model performance.

Therefore, I will transform the zip codes into 2 main categories: High price area (represent by 1) and normal area (represent by 0). I use the information from the Property Shark website to decide the 9 zip codes which have high median house price:
1. 98039
2. 98004
3. 98040
4. 98112
5. 98006
6. 98105
7. 98033
8. 98199
9. 98119
Other zip codes will be categorized in the normal area.

Imputing Missing Values:

Split the Data into Train and Test set:

Data leakage is a major problem when we preprocessing and building any models. One of the possible causes for the data leakage can appear when we try to impute the missing value based on the information from the whole dataset. Therefore, I decided to split the data in the train and test set first, then used the information from the dataset to impute the missing value on the train set. Finally, I will use the information that was used to impute the missing value for the train set to impute missing values in the test set.

Living Area and Lot Area:

For Living Area and Lot Area, we can impute the missing value by the value of sqft_living15 and sqft_lot15. The reason for this imputation is that the two variables sqft_living15 and sqft_lot15 are the average square footage of living area and lot area for the nearest 15 neighbors respectively. So we will assume that the two sqft_lot15 and sqft_lot15 would be close enough for us to impute the missing value in sqft_living and sqft_lot.

Summary of sqft_living and sqft_lot after imputation:

##   sqft_living       sqft_lot     
##  Min.   :  370   Min.   :   520  
##  1st Qu.: 1430   1st Qu.:  5005  
##  Median : 1910   Median :  7628  
##  Mean   : 2079   Mean   : 14721  
##  3rd Qu.: 2540   3rd Qu.: 10720  
##  Max.   :13540   Max.   :623779

Bedrooms Feature:

As we can observe from the heat map, sqft_living and bedrooms are highly correlated. Therefore, I think that the living area of the house can give us rough information to impute the missing value of bedrooms.
The box plot indicates that as the number of bedrooms increases, the median living area of the house also increases. The fluctuation appears as the number of bedrooms exceeds eight. The reason for this fluctuation is that the number of observations that have the number of bedrooms greater than eight is very small and insufficient to indicate the trend. Therefore, for the imputation task for bedroom, I will only consider observations that have lower than eight bedrooms

The strategy to impute the missing value is that I will find the median sqft_living for each value of the bedrooms, then I will impute the missing value of bedrooms of each observation based on the sqft_living value of the observation and the number of bedrooms of the closest median sqft_living.

Summary of the bedrooms variable after imputation:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   3.000   3.491   4.000  10.000

Floors Feature:

Based on the scatter plot, it would be impossible to impute the missing value of floors based on sqft_living and sqft_above, which are the two variables that have the highest correlation with floors. Therefore, I will move on to grade and yr_built (the two variables with a high correlation with floors). This decision makes sense because due to the new technology, newer homes might get a higher chance to have higher floors and the floors could affect the grade such as higher floors house could have a higher grade.
This scatters plot shows some clusters that could be useful for imputing the missing value of floors. Although based on the two variables grade and yr_built, the imputed values could not be totally correct, these values are still a good approximate value of the real value, which is better than the random or average imputation.

Based on the plot, I decided to create two categorical variables for grade and yr_built to group the data by the two new variables.

For the yr_built, as we can observe from the graph, we will divide the yr_built into 3 groups:
1. before 1940 - denote as 0
2. 1940 to 1975 - denote as 1
3. after 1975 - denote as 2

For the grade variable, we will divide it into 2 main groups:
1. below 7 - denote as 0 2. over or equal to 7 - denote as 1

Based on the grouping, we can calculate the mode (most prevalent value of floors in each group) and then impute these values to the observations that are in the same group.

Summary of floors after imputation:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.447   2.000   3.000

Condition Feature:

As we can observe from the correlation heatmap, the correlation between the condition variables and other variables seems to be very weak. The highest correlation among those is the one between condition and yr_built (0.37). From my perspective, there is not sufficient information from other variables that can be used for the imputation of condition. Therefore, for the condition variable, I will impute the most prevalent value which is condition = 3 (occupy over 65% of the values).

Summary of condition after imputation:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   3.000   3.402   4.000   5.000

Impute Missing Values for the Test Set:

After imputing all missing values for the train set, I will use the information to impute the missing values (grouping, type of variables, etc.) in the train set to impute the missing value in the test set.

II. Design of The Stacking Model:

The stacking will include 5 main layers:

Layer 1 - Data Preprocessing:

The first layer will include every step to preprocess and impute the missing values that we have done so far in the project.

Layer 2 - Dimension Reduction:

For this layer, I will try 2 dimension reduction methods: feature selection and principle analysis transformation.

Feature Selection:

For the feature selection method, first I will fit a tree model and then find choose some of the most important variables. The graph below indicates the variables that I have chosen and its importance from the regression tree.

Principle Component Analysis (PCA):

For PCA transformation, I will keep at least 0.95 of the variance of the dataset. After the transformation, the dataset contains 23 columns (22 predictors and 1 column for price).

I will also try the whole dataset for the training process and then decide whether to keep the whole dataset or use one of the two methods of dimensional reduction. Therefore, in the training phase, I will fit all of the models in layer 3 with 3 datasets: the whole dataset, the feature selection dataset, and the PCA dataset.

Layer 3 - Training Layer:

In this layer, I will train 6 models in total which are divided into two main categories. The first category is linear type models which are Linear Regression, Ridge Regression, and Lasso Regression. The reason is that linear type regressions seem to be very effective for house price prediction tasks and it could have very good generalization power.

The second category is more complex and ensemble models, which contains Random Forest, Extreme Gradient Boosting (XGBoost), and neural network for regression. These three models are more complex and have the ability to capture local information. These models could perform well with a complex dataset, which could be the case of our dataset.

I intend to have another layer to aggregate the result from the training layer to improve the performance of the models, and this aggregation layer will need unseen data to be trained to avoid data leaking (week generalization power if we train both training and aggregation layer on one dataset). Therefore, I will further divide the test dataset into two sets: test set and validation set.

In the report, I will just focus on the results of the training process. For more information about the best-tuned hyperparameter from Cross-validation, please check Appendix B.

Summary of the Layer 3 - Whole Dataset:

The table below is the result summary of all the model I have trained above.

##                            R2_score Train_RMSE_score Valid_RMSE_score Valid_MAE
## Linear Regression         0.7516120         186504.2         179385.5 116250.16
## Ridge Regression          0.7514570         187364.4         179441.5 116496.60
## Lasso Regression          0.7516120         236698.4         179385.5 116250.16
## Random Forest             0.8797898         134462.6         124793.7  68704.47
## Extreme Gradient Boosting 0.9050910         139892.2         110885.8  62444.46
## Neural Network            0.8698137         107256.3         129868.8  77940.15

Summary of the Layer 3 - Feature Selection Dataset:

The table below is the result summary of all the model I have trained above.

##                            R2_score Train_RMSE_score Valid_RMSE_score Valid_MAE
## Linear Regression         0.7492711         187799.7         180228.9 116568.02
## Ridge Regression          0.7490986         188373.1         180290.8 116817.85
## Lasso Regression          0.7492711         242500.6         180228.9 116568.02
## Random Forest             0.8768896         133811.7         126290.2  69779.90
## Extreme Gradient Boosting 0.8967312         128307.1         115666.3  65401.23
## Neural Network            0.8881390         107576.1         120382.0  72573.09

Neural Network with Keras:

Summary of the Layer 3 - PCA Dataset:

The table below is the result summary of all the model I have trained above.

##                            R2_score Train_RMSE_score Valid_RMSE_score       MAE
## Linear Regression         0.7367729        191165.44         184666.2 120123.79
## Ridge Regression          0.7367729        191682.33         184666.2 120123.79
## Lasso Regression          0.7367729        240898.05         184666.2 120123.79
## Random Forest             0.8173886        163488.06         153810.5  90543.48
## Extreme Gradient Boosting 0.8356687        157954.85         145909.1  86506.43
## Neural Network            0.8444288         95704.92         141966.8  81363.93

Based on the results of the three datasets, we can see that PCA transformation might not be a good method for dimension reduction for this particular task. We can see that all of the six models that were fitted with the PCA transformation dataset have significantly lower performance than the other two datasets. Therefore, I will exclude the PCA transformation from layer 2 - dimensional reduction.

Within the result of each dataset, the three complex models - Random Forest, XGBoost, and Neural Network - have significantly better performance than the three simple models. Therefore, I will just use these three models for layer 3 of the final stacking model.

Layer 4 - Aggregation Layer:

I will use the selected models from step 1 and use the test set to train some simple models to aggregate the results from these models. Then, the validation set will be used to evaluate the performance and choose the best aggregate models for the task.
The scatter plot indicates a strong correlation between the result of each model and the price. However, as the price increase and exceed $1,500,000 the prediction variation seems to increase and make the prediction less accurate.
From my perspective, these high variations at the high price may due to the unique characteristics of the houses that are not represented in the dataset.

Based on this plot, I will try models that could average out the error from the result of the three models in the previous layer. From my perspective, average method, linear regression, polynomial with the degree of 2, K-nearest neighbors, and support vector machine with poly kernel with the degree of 2.

The Average Method is the most simple approach that can just simply average the error and could get the highest generalization power due to its simplicity. Based on the plot, linear regression, polynomial regression with the degree of 2, or support vector machine with poly kernel with the degree of 2 seem to be a good fit for the task. K-nearest neighbor regressor, on the other hand, could group some unusual observations and average out the errors in the result.

Although all the models will be fitted, I will just choose one model with the best performance on the validation dataset.

Just as in layer 3, all of the information about the best-tuned hyper-parameters and validation scores is located in Appendix B.

Summary Layer 4 - Whole Dataset:

##                        R2_score Train_RMSE_score Valid_RMSE_score Valid_MAE
## Average Method        0.9011222         120771.2         113180.5  64627.26
## Linear Regression     0.9045934         117869.8         111176.1  65389.38
## Polynomial Regression 0.9071733         109367.6         109662.6  62538.39
## KNN                   0.8955963         134280.3         116300.1  66268.01
## SVM with Poly Kernel  0.9089060         120163.9         108634.3  62177.23

Summary of Layer 4 - Feature Selection Dataset:

##                        R2_score Train_RMSE_score Valid_RMSE_score Valid_MAE
## Average Method        0.8987718         119590.9         114517.8  65472.09
## Linear Regression     0.9013858         115193.7         113029.5  66869.03
## Polynomial Regression 0.8803858         110313.3         124484.0  72100.76
## KNN                   0.8757407         132761.5         126878.0  75828.76
## SVM with Poly Kernel  0.9089060         120163.9         108634.3  62177.23

Based on the result of the two tables from the two datasets, we can see that Support Vector Machine (SVM) with poly kernel have the best performance, which is significantly higher than all of the other models. Therefore, I will just include SVM with poly kernel for the aggregation layer.

Between the two datasets - whole dataset and feature selection dataset, the difference in the performance of the models is not significant. Therefore, I decide to use feature selection for the dimension reduction layer because, with feature selection, the training and prediction time would greatly decrease.

Layer 5 - Output:

After training all the models, the aggregation layer will generate the final result - the predicted house price.

The following diagram displays the structure of the final stacking model.

III. Result and Conclusion:

For the first layer of the stacking model, many strategies were applied to impute the missing values in the dataset. These methods, which utilize the correlation between variables, could help us preserve as much variation in the dataset as possible. I had considered clustering for missing values in the dataset before I decided to use the methods in the first layer. However, from my perspective, with high dimensional data like what we have in this case, the clustering performance might not perform well due to the dependence of the methods on the similarity (the distance between observation), which is greatly weakened with high dimensional data.

For the second layer, dimension reduction, after the training process, the feature selection method demonstrates its simplicity and high performance through two layers of training. Therefore, we will only keep it for the final model.

In the third layer, the training layer, high-level regression models such as Random Forest, XGBoosting, and Neural Network demonstrate their power in dealing with a complex dataset, and these three models become the core of the final stacking model.

In the fourth layer, the aggregation layer, Support Vector Machine with Poly kernel becomes the best model for aggregating the final result.

Finally, the stacking model achieves the RMSE of nearly $109,000 and the Mean Absolute Error (MAE) of roughly $62,000, which is a significant improvement from the best model in the training layer. The best model of the training layer, the XGBoosting with tuned hyper-parameters, achieves the performance of over $115,000 and $65,000 for RMSE and MAE respectively.

##           Stack_model XGBoosting
## RMSE       108634.320 115666.280
## MAE         62177.230  65401.230
## R_squared       0.909      0.897

In conclusion, the designed stacking model does improve the performance from one single model by combining a group of best models in the training layer, specifically, the RMSE decreases by 5.89%, the MAE decreases 5.17%, and the R_squared increase 0.01 point. This model could be the first stone to build a more complex and efficient model to best predict the house price not only just restrict to King County but might also to many states in the U.S.

Appendix A:

Histogram of All of The Variables in the Dataset:

Appendix B:

Best Tuned hyper-parameters for Training Layer Models:

For the models that are fitted by whole dataset:

Ridge Regression:

grid_ridgem$bestTune

##   lambda
## 3  0.005

Lasso Regression:

grid_lassom$bestTune

##   fraction
## 3        1

Random Forest:

grid_treem$bestTune

##   mtry
## 2   15

XGBoosting:

grid_xgbm$bestTune

##   nrounds max_depth  eta gamma colsample_bytree min_child_weight subsample
## 5    2000         5 0.03     0                1                1      0.75

Neural Network:

For the models that are fitted by feature selection dataset:

Ridge Regression:

grid_ridgem2$bestTune

##   lambda
## 3  0.005

Lasso Regression:

grid_lassom2$bestTune

##   fraction
## 3        1

Random Forest:

grid_treem2$bestTune

##   mtry
## 1   10

XGBoosting:

grid_xgbm2$bestTune

##   nrounds max_depth  eta gamma colsample_bytree min_child_weight subsample
## 4    2100         5 0.03     0                1                1      0.75

Neural Network:

For the models that are fitted by PCA dataset:

Ridge Regression:

grid_ridgem3$bestTune

##   lambda
## 5   0.01

Lasso Regression:

grid_lassom3$bestTune

##   fraction
## 3        1

Random Forest:

grid_treem3$bestTune

##   mtry
## 2   15

XGBoosting:

grid_xgbm3$bestTune

##   nrounds max_depth  eta gamma colsample_bytree min_child_weight subsample
## 4    2100         5 0.03     0                1                1      0.75

Neural Network:

Best Tuned hyper-parameters for Aggregation Layer:

For whole dataset:
KNN Regression:

layer2_knn$bestTune

##    k
## 3 10

SVM with Poly kernel:

layer2_svm$bestTune

##   degree scale  C
## 3      2  TRUE 10

For Feature Selection Dataset:
KNN Regression:

layer2_knn2$bestTune

##   k
## 2 5

SVM with Poly kernel:

layer2_svm2$bestTune

##   degree scale  C
## 3      2  TRUE 10

References:

Book
Ethem Alpaydın. 2015. Introduction to Machine Learning 3rd Edition. MIT Press.

Website
Property Shark. 2017. “The Most Expensive Zip Codes in Washington State – Medina Homes 8x Pricier than U.S. Median.” https://www.propertyshark.com/Real-Estate-Reports/2017/10/04/expensive-zip-codes-washington-state-medina-homes-8x-pricier-u-s-median/#

Website
Statista. 2020. “Number of existing homes sold in the United States from 2005 to 2021.” https://www.statista.com/statistics/226144/us-existing-home-sales/

Website
National Association of Realtors. 2020. “Existing-Home Sales Soar 9.4% to 6.5 Million in September.” https://www.nar.realtor/newsroom/existing-home-sales-soar-9-4-to-6-5-million-in-september

Advanced Business Analytics Project

Loc Nguyen

11/9/2020

Executive Summary:

I. Exploratory Data Analysis and Data Preprocessing:

Data Description:

Overview of the dataset:

Data Preprocessing:

Macro: Location Features

Micro elements:

Bedrooms and Bathrooms Features:

Area Features:

View and Quality Features:

Year Built and Year Renovated:

Date Feature:

Zipcode Features:

Imputing Missing Values:

Split the Data into Train and Test set:

Living Area and Lot Area:

Bedrooms Feature:

Floors Feature:

Condition Feature:

Impute Missing Values for the Test Set:

II. Design of The Stacking Model:

Layer 1 - Data Preprocessing:

Layer 2 - Dimension Reduction:

Feature Selection:

Principle Component Analysis (PCA):

Layer 3 - Training Layer:

Summary of the Layer 3 - Whole Dataset:

Summary of the Layer 3 - Feature Selection Dataset:

Neural Network with Keras:

Summary of the Layer 3 - PCA Dataset:

Layer 4 - Aggregation Layer:

Summary Layer 4 - Whole Dataset:

Summary of Layer 4 - Feature Selection Dataset:

Layer 5 - Output:

III. Result and Conclusion:

Appendix A:

Histogram of All of The Variables in the Dataset:

Appendix B:

Best Tuned hyper-parameters for Training Layer Models:

Best Tuned hyper-parameters for Aggregation Layer:

References: