According to Statista.com, the number of existing homes sold in the United States increases consistently from 2018 to 2020, and according to the National Association of Realtor website, the median existing-home price in September 2020 also increased almost 15%, comparing with the price in September 2019. With the great increase trend of the house price, a model with strong prediction power could help investment company or real estate agency make the right decision on pricing their houses or on buying new houses.
In many house price prediction tasks, simple linear regression models and ensemble models, especially the tree-related ensemble models, have very high performance. In many Kaggle competitions, we can see that Ridge, Lasso Regression, or Extreme Gradient Boosting (XGBoosting) are dominant in terms of prediction power.
In this project, I will focus on designing a stacking model that can combine and enhance the performance of models that have the best performance and test whether the stacking model really improves the performance of these models on this particular regression task.
The project will contain 3 main sections:
I. Exploratory Data Analysis and Data preprocessing:
For this part, I will focus on exploring the data, and then preprocess and impute the missing values of the dataset.
II. Design the Stacking Model:
The stacking model will consist of 5 layers:
* Layer 1 - Data Processing
* Layer 2 - Dimension Reduction: Feature Selection.
* Layer 3 - Training: Random Forest, XGBoosting, and Neural Network.
* Layer 4 - Aggregation: Support vector machine with Poly kernel.
* Layer 5 - Output
In this project, the stacking model that is designed and trained in the second section has better performance than the best single model (XGBoosting) of the six models that are used in the training layer. The RMSE and MAE of the stacking model is 5.89% and 5.17% lower than that of the XGBoosting model.
The dataset was download from the Kaggle website. Originally, the dataset does not have any missing value. For the requirements of the project, I’ve randomly assigned some missing values in some variables in the dataset.
The dataset contains information about houses in King County of Washington state that was sold between 2014 and 2015. The dataset contains more than 20,000 observations and has 20 predictors and 1 target column (price). The description of the dataset:
Id: Unique ID for each home sold.Date: Date of the home sale.Price: Price of each home sold.Bedrooms: Number of bedrooms.Bathrooms: Number of bathrooms, where .5 accounts for a room with a toilet but no shower.Sqft_living: Square footage of the apartment’s interior living space.Sqft_lot: Square footage of the land space.Floors: Number of floors.Waterfront: A dummy variable for whether the apartment was overlooking the waterfront or not.View: An index from 0 to 4 of how good the view of the property was.Condition: An index from 1 to 5 on the condition of the apartment.Grade: An index from 1 to 13, where 1-3 falls short quality, 7 has an average quality, and 11- 13 have a high quality.Sqft_above: The square footage of the interior housing space that is above ground level.Sqft_basement: The square footage of the interior housing space that is below ground.Yr_built: The year the house was initially built.Yr_renovated: The year of the house’s last renovation.Zipcode: What zip code area the house is in.Lat: LattitudeLong: LongitudeSqft_living15: The square footage of interior housing living space for the nearest 15 neighbors.Sqft_lot15: The square footage of the land lots of the nearest 15 neighbors. The bar plot shows that the missing values are mainly located in 5 columns
bedrooms, condition, floors, sqft_living, and sqft_lot, all of which are belonged to the characteristics of the house. In the next section, I’ll focus on exploring the data and try to identify the relationships between variables and impute as many of the missing values as possible.
I will drop the id column because this column is just an identifier and seems to have no effect on the analysis as well as no information for the modeling phase.
We can take the first glimpse at the distributions of the variables through the histogram plots (please check Appendix A). Variables such as bedrooms, sqft_living, sqft_lot, sqft_basement, sqft_lot15 might have some significant outliers that can affect the performance of the models. Therefore, I will look a closer look at these variables and eliminate the unreasonable or unnecessary outliers to improve the process of building a model in the next steps.
The heat map shows the correlations between each of the predictors with the target variable (Price). Variables such as
sqft_living, grade, sqft_above, sqft_living15, bathrooms, and lat have strong correlations (both negative and positive) with the house’s price (absolute value over 0.5). Then, view, sqft_basement, bedrooms, waterfront, and floors variables have moderate correlations with the price variables.
In normal house price datasets, the sqft_lot (the total lot space of the house) usually has a pretty strong correlation with the house price. However, in this dataset, the sqft_lot seems to not correlate with the price. Therefore, I will also take a closer look at this variable and try to extract more information from sqft_lot.
In my opinion, zipcode is an important variable that affects the house price. However, we can see that zipcode has a very week correlation with the house prices. I assume that the current format of the zipcode is not appropriate to extract useful information for the model for the house’s prices prediction model. Therefore, I will also take a closer look at this variable.
For data preprocessing, I will go through all variables in the dataset in the order of categories of variables. First, I will take an overall look at the target column (price), then I will go through the macro features which contain location features. Finally, I will take a look at the characteristics of the house variables.
zipcode, lat, long features.
To display the price of the houses, I create a new variable that indicates the price range of the house. The summary and the histogram of the price variable indicate that the house prices are ranging from $75,000 to $5,350,000 and most of the house price is below $2,000,000. Therefore, I divide the price variable in to 8 main ranges: under $250k, $250k-$500k, $500k-$750k, $750k-$1m, $1m-$1.25m, $1.25m-$1.5m, $1.5m-$1.75m, $1.75m-$2m, >$2m.
The map indicates that the houses with high prices are mainly located in Seattle and near the beach or have a view of the beach. The house price starts to decrease significantly as the house was located on the south side of Seattle and decrease gradually as the house was located on the West side of Seattle.
The two boxplots describe the outliers of the two variables bedrooms and bathrooms.
The boxplot of
bedrooms indicates there are unusual outliers with over 30 bedrooms or 0 bedroom. I will take a closer look at these observations.
The bathrooms variable seems to be fine without any unusual outlier. With more than 10 bedrooms, the house can have around 8 bathrooms and it’s reasonable value for the max value of bathrooms variable.
## price bedrooms bathrooms sqft_living sqft_lot floors
## 15871 640000 33 1.75 1620 6000 1
This observation is most likely an entry error. We can see that the house that has 33 bedrooms and 1.75 bathrooms has the living area has just only 1620 square feet with only one floor. So, I will eliminate this observation.
Next, we will look at houses with 0 bedrooms or 0 bathrooms.
## price bedrooms bathrooms sqft_living sqft_lot zipcode
## 876 1095000 0 0.00 3064 4764 98102
## 1150 75000 1 0.00 670 43377 98022
## 3120 380000 0 0.00 1470 979 98133
## 3468 288000 0 1.50 1430 NA 98125
## 5833 280000 1 0.00 600 24501 98045
## 6995 1295650 0 0.00 4810 28008 98053
## 8478 339950 0 2.50 2290 8319 98042
## 8485 240000 0 2.50 1810 5669 98038
## 9774 355000 0 0.00 2460 8049 98031
## 9855 235000 0 0.00 1470 4800 98065
## 10482 484000 1 0.00 690 23244 98053
## 12654 320000 0 2.50 1490 7111 98065
## 14424 139950 0 0.00 844 4269 98001
## 18380 265000 0 0.75 384 213444 98070
## 19453 142000 0 0.00 290 NA 98024
From my perspective, it’s very unlikely that the house has no bedroom. Furthermore, many houses with no bedrooms also have no bathrooms or 2.5 bathrooms, and the living and lot area of these houses are also very high. This seems not to be reasonable, so I will eliminate observations with 0 bedrooms.
On the other hand, houses with 1 bedroom and 0 bathroom have a small living area (around 600-700 sqft). From my perspective, these are reasonable and I will keep these observations.
The scatter plot of
bedrooms and price indicates a positive correlation between the number of bedrooms and house prices, but we cannot conclude this is a strong correlation due to a large number of outliers of each box.
In the box plot, we can see there is an increasing trend in median home price as the number of bedrooms increase. When the number of bedrooms reaches 7, the trend seems to fluctuate.
Unlike the scatter plot of
bedrooms and price, the scatter plot of bathrooms and price shows a stronger and clearer positive correlation between these two variables.
Just as the trend we see in the bedrooms and price, the box plot shows an increasing trend in median home price as the number of bathrooms increase until the number of bathrooms reaches 4.25 and then the median home price seems to fluctuate.
The 4 boxplots show the outliers of the four variables: sqft_living, sqft_lot, sqft_above, and sqft_basement.
There are some houses with significantly large living space (sqft_living). sqft_lot seems to have some outliers with extremely high value that might affect the scaling process. The same issue appears sqft_above and sqft_basement. Therefore, I will take a closer look at these outliers.
I have manually check some of the houses that have exceptionally large lot areas (> 750,000 sqft) on the Zillow website, and these houses are mansions in isolated areas or houses of farmers with a large area of lot size.
To generate models that have the ability to capture the global trend of the data with high generalization power to correctly predict the prices of the majority of the houses in King County, I will eliminate these observations to improve the generalization power and the robustness of the model.
The other three variables seem to be reasonable and I can’t see any exceptional or wrong outliers. Therefore, I decided to not eliminate any observation from sqft_living, sqft_above, and sqft_basement variables.
As we can observe from the correlation heat map, there is a pretty strong and positive correlation between the living area and the house price, and nearly the same positive correlation between price and above area. We can see the same trend but weaker correlations between basement space and the house price.
On the other hand, we can not see much of the correlation or nearly no correlation between the sqft_lot and the house price, which means that sqft_lot might not be useful to predict the house price. However, as we mention on the heat map result, the correlation could reveal if we combine sqft_lot and zipcode. To demonstrate that will plot the price and sqft_lot with the zipcode facet of the randomly selected six zip codes.
I still cannot see a clear correlation between
sqft_lot (the lot space) and the price. In King County, the lot size and the price might not correlate, or the price is not affected by the lot size, or the sqft_lot needs to be in conjunction with other variables to give meaningful information.
The boxplot show an increasing trend in the median house price as the number of floors increases with fluctuations at 3.0 and 3.5 floors.
The graphs below show the correlation between the four variables (waterfront, view, grade, and condition) and price.
The plots indicate that all of the four variables have some effects on the house price. There is a noticeable effect of the waterfront on the house price. There is a significant difference in the median price between houses with waterfronts and those without waterfront.
We can see the same pattern from view and grade variables, not so significantly different price, but they also have some effects on the house price (positive effects).
With the condition variable, we can not see a clear pattern about the difference, and there are also many outliers with will significantly affect what the plot indicates.
The histogram of the yr_renovated seems to be unusual, so I will take a closer look at this variable.
For the yr_renovated variable, if the house was renovated, the value would be the year of the renovation. If it was not, the value of this value would be zero. We can see that most of the houses are not renovated. Therefore, I decided to transform this variable into a binary variable.
This transformation could eliminate some information about the variable. However, from my perspective, this is the best way to extract information from yr_renovated and also avoid the effect of the high values observed on the majority of observations (with a value of 0) in the scaling process.
The plot of
yr_renovated with price indicates that the yr_renovated variable might have a small effect on the house price. On the other hand, with the yr_built variable, we cannot see any noticeable impact or correlation between the variable and the house price.
As we can observe from the housing market, there are usually seasonality and cycles in the house price. Therefore, for the date variable, I will extract month, which can contain the seasonality information, and year, which, from my perspective, could get the insight of the cycles from the data.
Because of the categorical nature of the Month variable, I will transform the month variable into dummy variables (I will also eliminate December and treat it as a baseline).
Although zipcode has a numerical form, its numerical characteristics can not be treated as a continuous or ordinal variable because the increase or decrease doesn’t indicate the changes. In fact, the zipcode variable has categorical characteristics. In our data set, there are 70 unique zip codes, so if we want to transform zipcode to one-hot-vector, the number would increase a lot and greatly affect the model performance.
Therefore, I will transform the zip codes into 2 main categories: High price area (represent by 1) and normal area (represent by 0). I use the information from the Property Shark website to decide the 9 zip codes which have high median house price:
1. 98039
2. 98004
3. 98040
4. 98112
5. 98006
6. 98105
7. 98033
8. 98199
9. 98119
Other zip codes will be categorized in the normal area.
Data leakage is a major problem when we preprocessing and building any models. One of the possible causes for the data leakage can appear when we try to impute the missing value based on the information from the whole dataset. Therefore, I decided to split the data in the train and test set first, then used the information from the dataset to impute the missing value on the train set. Finally, I will use the information that was used to impute the missing value for the train set to impute missing values in the test set.
For Living Area and Lot Area, we can impute the missing value by the value of sqft_living15 and sqft_lot15. The reason for this imputation is that the two variables sqft_living15 and sqft_lot15 are the average square footage of living area and lot area for the nearest 15 neighbors respectively. So we will assume that the two sqft_lot15 and sqft_lot15 would be close enough for us to impute the missing value in sqft_living and sqft_lot.
Summary of sqft_living and sqft_lot after imputation:
## sqft_living sqft_lot
## Min. : 370 Min. : 520
## 1st Qu.: 1430 1st Qu.: 5005
## Median : 1910 Median : 7628
## Mean : 2079 Mean : 14721
## 3rd Qu.: 2540 3rd Qu.: 10720
## Max. :13540 Max. :623779
As we can observe from the heat map, sqft_living and bedrooms are highly correlated. Therefore, I think that the living area of the house can give us rough information to impute the missing value of bedrooms.
The box plot indicates that as the number of bedrooms increases, the median living area of the house also increases. The fluctuation appears as the number of bedrooms exceeds eight. The reason for this fluctuation is that the number of observations that have the number of bedrooms greater than eight is very small and insufficient to indicate the trend. Therefore, for the imputation task for bedroom, I will only consider observations that have lower than eight bedrooms
The strategy to impute the missing value is that I will find the median sqft_living for each value of the bedrooms, then I will impute the missing value of bedrooms of each observation based on the sqft_living value of the observation and the number of bedrooms of the closest median sqft_living.
Summary of the bedrooms variable after imputation:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 3.000 3.491 4.000 10.000
Based on the scatter plot, it would be impossible to impute the missing value of
floors based on sqft_living and sqft_above, which are the two variables that have the highest correlation with floors. Therefore, I will move on to grade and yr_built (the two variables with a high correlation with floors). This decision makes sense because due to the new technology, newer homes might get a higher chance to have higher floors and the floors could affect the grade such as higher floors house could have a higher grade.
This scatters plot shows some clusters that could be useful for imputing the missing value of
floors. Although based on the two variables grade and yr_built, the imputed values could not be totally correct, these values are still a good approximate value of the real value, which is better than the random or average imputation.
Based on the plot, I decided to create two categorical variables for grade and yr_built to group the data by the two new variables.
For the yr_built, as we can observe from the graph, we will divide the yr_built into 3 groups:
1. before 1940 - denote as 0
2. 1940 to 1975 - denote as 1
3. after 1975 - denote as 2
For the grade variable, we will divide it into 2 main groups:
1. below 7 - denote as 0 2. over or equal to 7 - denote as 1
Based on the grouping, we can calculate the mode (most prevalent value of floors in each group) and then impute these values to the observations that are in the same group.
Summary of floors after imputation:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.447 2.000 3.000
As we can observe from the correlation heatmap, the correlation between the condition variables and other variables seems to be very weak. The highest correlation among those is the one between condition and yr_built (0.37). From my perspective, there is not sufficient information from other variables that can be used for the imputation of condition. Therefore, for the condition variable, I will impute the most prevalent value which is condition = 3 (occupy over 65% of the values).
Summary of condition after imputation:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 3.000 3.402 4.000 5.000
After imputing all missing values for the train set, I will use the information to impute the missing values (grouping, type of variables, etc.) in the train set to impute the missing value in the test set.
The stacking will include 5 main layers:
The first layer will include every step to preprocess and impute the missing values that we have done so far in the project.
For this layer, I will try 2 dimension reduction methods: feature selection and principle analysis transformation.
For the feature selection method, first I will fit a tree model and then find choose some of the most important variables. The graph below indicates the variables that I have chosen and its importance from the regression tree.
For PCA transformation, I will keep at least 0.95 of the variance of the dataset. After the transformation, the dataset contains 23 columns (22 predictors and 1 column for price).
I will also try the whole dataset for the training process and then decide whether to keep the whole dataset or use one of the two methods of dimensional reduction. Therefore, in the training phase, I will fit all of the models in layer 3 with 3 datasets: the whole dataset, the feature selection dataset, and the PCA dataset.
In this layer, I will train 6 models in total which are divided into two main categories. The first category is linear type models which are Linear Regression, Ridge Regression, and Lasso Regression. The reason is that linear type regressions seem to be very effective for house price prediction tasks and it could have very good generalization power.
The second category is more complex and ensemble models, which contains Random Forest, Extreme Gradient Boosting (XGBoost), and neural network for regression. These three models are more complex and have the ability to capture local information. These models could perform well with a complex dataset, which could be the case of our dataset.
I intend to have another layer to aggregate the result from the training layer to improve the performance of the models, and this aggregation layer will need unseen data to be trained to avoid data leaking (week generalization power if we train both training and aggregation layer on one dataset). Therefore, I will further divide the test dataset into two sets: test set and validation set.
In the report, I will just focus on the results of the training process. For more information about the best-tuned hyperparameter from Cross-validation, please check Appendix B.
The table below is the result summary of all the model I have trained above.
## R2_score Train_RMSE_score Valid_RMSE_score Valid_MAE
## Linear Regression 0.7516120 186504.2 179385.5 116250.16
## Ridge Regression 0.7514570 187364.4 179441.5 116496.60
## Lasso Regression 0.7516120 236698.4 179385.5 116250.16
## Random Forest 0.8797898 134462.6 124793.7 68704.47
## Extreme Gradient Boosting 0.9050910 139892.2 110885.8 62444.46
## Neural Network 0.8698137 107256.3 129868.8 77940.15
The table below is the result summary of all the model I have trained above.
## R2_score Train_RMSE_score Valid_RMSE_score Valid_MAE
## Linear Regression 0.7492711 187799.7 180228.9 116568.02
## Ridge Regression 0.7490986 188373.1 180290.8 116817.85
## Lasso Regression 0.7492711 242500.6 180228.9 116568.02
## Random Forest 0.8768896 133811.7 126290.2 69779.90
## Extreme Gradient Boosting 0.8967312 128307.1 115666.3 65401.23
## Neural Network 0.8881390 107576.1 120382.0 72573.09
The table below is the result summary of all the model I have trained above.
## R2_score Train_RMSE_score Valid_RMSE_score MAE
## Linear Regression 0.7367729 191165.44 184666.2 120123.79
## Ridge Regression 0.7367729 191682.33 184666.2 120123.79
## Lasso Regression 0.7367729 240898.05 184666.2 120123.79
## Random Forest 0.8173886 163488.06 153810.5 90543.48
## Extreme Gradient Boosting 0.8356687 157954.85 145909.1 86506.43
## Neural Network 0.8444288 95704.92 141966.8 81363.93
Based on the results of the three datasets, we can see that PCA transformation might not be a good method for dimension reduction for this particular task. We can see that all of the six models that were fitted with the PCA transformation dataset have significantly lower performance than the other two datasets. Therefore, I will exclude the PCA transformation from layer 2 - dimensional reduction.
Within the result of each dataset, the three complex models - Random Forest, XGBoost, and Neural Network - have significantly better performance than the three simple models. Therefore, I will just use these three models for layer 3 of the final stacking model.
I will use the selected models from step 1 and use the test set to train some simple models to aggregate the results from these models. Then, the validation set will be used to evaluate the performance and choose the best aggregate models for the task.
The scatter plot indicates a strong correlation between the result of each model and the price. However, as the price increase and exceed $1,500,000 the prediction variation seems to increase and make the prediction less accurate.
From my perspective, these high variations at the high price may due to the unique characteristics of the houses that are not represented in the dataset.
Based on this plot, I will try models that could average out the error from the result of the three models in the previous layer. From my perspective, average method, linear regression, polynomial with the degree of 2, K-nearest neighbors, and support vector machine with poly kernel with the degree of 2.
The Average Method is the most simple approach that can just simply average the error and could get the highest generalization power due to its simplicity. Based on the plot, linear regression, polynomial regression with the degree of 2, or support vector machine with poly kernel with the degree of 2 seem to be a good fit for the task. K-nearest neighbor regressor, on the other hand, could group some unusual observations and average out the errors in the result.
Although all the models will be fitted, I will just choose one model with the best performance on the validation dataset.
Just as in layer 3, all of the information about the best-tuned hyper-parameters and validation scores is located in Appendix B.
## R2_score Train_RMSE_score Valid_RMSE_score Valid_MAE
## Average Method 0.9011222 120771.2 113180.5 64627.26
## Linear Regression 0.9045934 117869.8 111176.1 65389.38
## Polynomial Regression 0.9071733 109367.6 109662.6 62538.39
## KNN 0.8955963 134280.3 116300.1 66268.01
## SVM with Poly Kernel 0.9089060 120163.9 108634.3 62177.23
## R2_score Train_RMSE_score Valid_RMSE_score Valid_MAE
## Average Method 0.8987718 119590.9 114517.8 65472.09
## Linear Regression 0.9013858 115193.7 113029.5 66869.03
## Polynomial Regression 0.8803858 110313.3 124484.0 72100.76
## KNN 0.8757407 132761.5 126878.0 75828.76
## SVM with Poly Kernel 0.9089060 120163.9 108634.3 62177.23
Based on the result of the two tables from the two datasets, we can see that Support Vector Machine (SVM) with poly kernel have the best performance, which is significantly higher than all of the other models. Therefore, I will just include SVM with poly kernel for the aggregation layer.
Between the two datasets - whole dataset and feature selection dataset, the difference in the performance of the models is not significant. Therefore, I decide to use feature selection for the dimension reduction layer because, with feature selection, the training and prediction time would greatly decrease.
After training all the models, the aggregation layer will generate the final result - the predicted house price.
The following diagram displays the structure of the final stacking model.
For the first layer of the stacking model, many strategies were applied to impute the missing values in the dataset. These methods, which utilize the correlation between variables, could help us preserve as much variation in the dataset as possible. I had considered clustering for missing values in the dataset before I decided to use the methods in the first layer. However, from my perspective, with high dimensional data like what we have in this case, the clustering performance might not perform well due to the dependence of the methods on the similarity (the distance between observation), which is greatly weakened with high dimensional data.
For the second layer, dimension reduction, after the training process, the feature selection method demonstrates its simplicity and high performance through two layers of training. Therefore, we will only keep it for the final model.
In the third layer, the training layer, high-level regression models such as Random Forest, XGBoosting, and Neural Network demonstrate their power in dealing with a complex dataset, and these three models become the core of the final stacking model.
In the fourth layer, the aggregation layer, Support Vector Machine with Poly kernel becomes the best model for aggregating the final result.
Finally, the stacking model achieves the RMSE of nearly $109,000 and the Mean Absolute Error (MAE) of roughly $62,000, which is a significant improvement from the best model in the training layer. The best model of the training layer, the XGBoosting with tuned hyper-parameters, achieves the performance of over $115,000 and $65,000 for RMSE and MAE respectively.
## Stack_model XGBoosting
## RMSE 108634.320 115666.280
## MAE 62177.230 65401.230
## R_squared 0.909 0.897
In conclusion, the designed stacking model does improve the performance from one single model by combining a group of best models in the training layer, specifically, the RMSE decreases by 5.89%, the MAE decreases 5.17%, and the R_squared increase 0.01 point. This model could be the first stone to build a more complex and efficient model to best predict the house price not only just restrict to King County but might also to many states in the U.S.
For the models that are fitted by whole dataset:
Ridge Regression:
grid_ridgem$bestTune
## lambda
## 3 0.005
Lasso Regression:
grid_lassom$bestTune
## fraction
## 3 1
Random Forest:
grid_treem$bestTune
## mtry
## 2 15
XGBoosting:
grid_xgbm$bestTune
## nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 5 2000 5 0.03 0 1 1 0.75
Neural Network:
For the models that are fitted by feature selection dataset:
Ridge Regression:
grid_ridgem2$bestTune
## lambda
## 3 0.005
Lasso Regression:
grid_lassom2$bestTune
## fraction
## 3 1
Random Forest:
grid_treem2$bestTune
## mtry
## 1 10
XGBoosting:
grid_xgbm2$bestTune
## nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 4 2100 5 0.03 0 1 1 0.75
Neural Network:
For the models that are fitted by PCA dataset:
Ridge Regression:
grid_ridgem3$bestTune
## lambda
## 5 0.01
Lasso Regression:
grid_lassom3$bestTune
## fraction
## 3 1
Random Forest:
grid_treem3$bestTune
## mtry
## 2 15
XGBoosting:
grid_xgbm3$bestTune
## nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 4 2100 5 0.03 0 1 1 0.75
Neural Network:
For whole dataset:
KNN Regression:
layer2_knn$bestTune
## k
## 3 10
SVM with Poly kernel:
layer2_svm$bestTune
## degree scale C
## 3 2 TRUE 10
For Feature Selection Dataset:
KNN Regression:
layer2_knn2$bestTune
## k
## 2 5
SVM with Poly kernel:
layer2_svm2$bestTune
## degree scale C
## 3 2 TRUE 10
Book
Ethem Alpaydın. 2015. Introduction to Machine Learning 3rd Edition. MIT Press.
Website
Property Shark. 2017. “The Most Expensive Zip Codes in Washington State – Medina Homes 8x Pricier than U.S. Median.” https://www.propertyshark.com/Real-Estate-Reports/2017/10/04/expensive-zip-codes-washington-state-medina-homes-8x-pricier-u-s-median/#
Website
Statista. 2020. “Number of existing homes sold in the United States from 2005 to 2021.” https://www.statista.com/statistics/226144/us-existing-home-sales/
Website
National Association of Realtors. 2020. “Existing-Home Sales Soar 9.4% to 6.5 Million in September.” https://www.nar.realtor/newsroom/existing-home-sales-soar-9-4-to-6-5-million-in-september