(To run the whole project, it costs about 13’52’’ and chunk 44 costs greatly.)
This data set appeared in a 1997 paper titled \(Sparse\) \(Spatial\) \(Autoregressions\) by Pace, R. Kelley and Ronald Barry, published in the Statistics and Probability Letters journal. They built it using the 1990 California housing census data.
The data set contains 20,640 observations of 10 variables. The respondent variable is called “median house value” indicating the median house price for the corresponding investigated living block. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data and in this data set, whose average number is 1425.5. The 9 other variables are considered to be relevant to the house prices(in our case, working as predictors), 8 of which are continuous and one is geographical categorical variables.
## longitude latitude housing_median_age total_rooms
## Min. :-124.3 Min. :32.54 Min. : 1.00 Min. : 2
## 1st Qu.:-121.8 1st Qu.:33.93 1st Qu.:18.00 1st Qu.: 1448
## Median :-118.5 Median :34.26 Median :29.00 Median : 2127
## Mean :-119.6 Mean :35.63 Mean :28.64 Mean : 2636
## 3rd Qu.:-118.0 3rd Qu.:37.71 3rd Qu.:37.00 3rd Qu.: 3148
## Max. :-114.3 Max. :41.95 Max. :52.00 Max. :39320
##
## total_bedrooms population households median_income
## Min. : 1.0 Min. : 3 Min. : 1.0 Min. : 0.4999
## 1st Qu.: 296.0 1st Qu.: 787 1st Qu.: 280.0 1st Qu.: 2.5634
## Median : 435.0 Median : 1166 Median : 409.0 Median : 3.5348
## Mean : 537.9 Mean : 1425 Mean : 499.5 Mean : 3.8707
## 3rd Qu.: 647.0 3rd Qu.: 1725 3rd Qu.: 605.0 3rd Qu.: 4.7432
## Max. :6445.0 Max. :35682 Max. :6082.0 Max. :15.0001
## NA's :207
## median_house_value ocean_proximity
## Min. : 14999 Length:20640
## 1st Qu.:119600 Class :character
## Median :179700 Mode :character
## Mean :206856
## 3rd Qu.:264725
## Max. :500001
##
## 'data.frame': 20640 obs. of 10 variables:
## $ longitude : num -122 -122 -122 -122 -122 ...
## $ latitude : num 37.9 37.9 37.9 37.9 37.9 ...
## $ housing_median_age: num 41 21 52 52 52 52 52 52 42 52 ...
## $ total_rooms : num 880 7099 1467 1274 1627 ...
## $ total_bedrooms : num 129 1106 190 235 280 ...
## $ population : num 322 2401 496 558 565 ...
## $ households : num 126 1138 177 219 259 ...
## $ median_income : num 8.33 8.3 7.26 5.64 3.85 ...
## $ median_house_value: num 452600 358500 352100 341300 342200 ...
## $ ocean_proximity : chr "NEAR BAY" "NEAR BAY" "NEAR BAY" "NEAR BAY" ...
Detailed descriptions for all variables can be found at link .
I will apply different methods discussed in the foundations of data science course to predict the median house value considering all other variables in the sample as predictor candidates. And the sample size is too large for us, I randomly subset a new sample with 30% observations of the original one, that is, N=6192.
## longitude latitude housing_median_age total_rooms
## Min. :-124.3 Min. :32.54 Min. : 1.00 Min. : 2
## 1st Qu.:-121.8 1st Qu.:33.94 1st Qu.:18.00 1st Qu.: 1452
## Median :-118.5 Median :34.28 Median :28.00 Median : 2154
## Mean :-119.6 Mean :35.67 Mean :28.41 Mean : 2647
## 3rd Qu.:-118.0 3rd Qu.:37.73 3rd Qu.:37.00 3rd Qu.: 3150
## Max. :-114.5 Max. :41.95 Max. :52.00 Max. :32054
##
## total_bedrooms population households median_income
## Min. : 2.0 Min. : 3 Min. : 2.0 Min. : 0.536
## 1st Qu.: 298.0 1st Qu.: 788 1st Qu.: 280.0 1st Qu.: 2.553
## Median : 438.0 Median : 1175 Median : 412.0 Median : 3.537
## Mean : 538.4 Mean : 1422 Mean : 499.9 Mean : 3.893
## 3rd Qu.: 648.0 3rd Qu.: 1725 3rd Qu.: 609.0 3rd Qu.: 4.805
## Max. :5290.0 Max. :15507 Max. :5050.0 Max. :15.000
## NA's :63
## median_house_value ocean_proximity
## Min. : 14999 Length:6192
## 1st Qu.:121400 Class :character
## Median :179650 Mode :character
## Mean :208370
## 3rd Qu.:268225
## Max. :500001
##
Before look into detailed data structure, we firstly visualize the response “median_house_value” geographically and compare it with the original one. We observe that their patterns are alike thus we regard this subset as usable.
Generally, we observe that houses near the coasts are more expensive and some houses inland also charge higher prices. This insight is reasonable with common sense.
In this chapter, we do exploratory data analysis.
Firstly, we draw histograms and box charts to find out the variation and distribution of all continuous variables.
Insights:
1.there are a lot of outliers for some variables, and we will fix these outliers next chapter.
2.The last 6 variables’ distribution are quite bit skewed, which is quite a lot and slightly concerning.
Now, we check on the categorical variable using bar chart.
Obviously, we should drop the “ISLAND” group which has only 2 observations.
Similarly, we first draw scatter matrix of all continuous variables.
We observe that there are no clear linear relationships between all the predictors and the respondent. Only resembling linear trends shows in median_income. It inspires us that non-linear models might perform better than linear ones.
Furthermore, we draw the correlation matrix to get more inspirations.
Insight:
1.The house price is highly correlated with income positively;
2.Longitude and latitude change oppositely. It is sensible because in California,usually the far north one live, the far west one live.We will create a new variable combining them together in chapter.
3.Population, households and total_bedrooms are highly positively correlated, which is understandable due to they are all demographic variables.We will create a new variable combining their information together next chapter.
Next, we care about the relation between categorical variables and the response. What to be obvious here is that
1.There is a diverging point around 1.8e05 between “INLAND” and “<1H OCEAN”, higher than which the former barely charge, and lower than which the latter barely charge.
2.”NEAR BAR” has a slight tendency to charge higher, while the trend for “NEAR OCEAN” becomes more vague.
First,we delete the 2 “ISLAND” observations. Then we deal with missing values.
We have 63 missing values which all belongs to “total_bedrooms”. We impute them by their variable median instead of mean because we already detected many outliers before.
## The number of missing values is 63
## The number of missing values of each colunms is
## 0 0 0 0 63 0 0 0 0 0
## The number of missing values after operation is 0
As we showed before, some variables have heavy correlation. Considering that we don’t have many variables to play with, we choose to create new variables containing their information and drop the old ones:
1.Add up longitude and latitude, and new variable”diagonal” representing diagonal locations, throwing out the original.
2.Divide total_bedrooms by total_rooms, and new variable”congestion_bedroom” representing how crowded one room is, throwing out the original.
3.Throw away “household”,because the covariance between it and “population” is almost 100% which means it is actually not bringing more useful information.
After the transformation, all our present variables tend to follow a skewed normal distribution.Their scale are highly different and we normalize them.
Purposely, I put outlier imputation after transformation in order to contain as much information as possible in our practical sample.
Importantly,we do not fix the outliers for the response variable “median_house_value”, because these outlier are of high frequency and obviously they carry information (Of course, we assuming there are no collection errors. In fact,in original data, these outliers’ values are identically “500001”, and they are already “censored” by the researchers from their raw data which were much higher than “500001”).
To modify outliers for predictors, considering that the original author used the IQR method( because remaining values are identically “500001”), we choose to keep in line with the author.
## The number of observations we dropped during outlier elimination is 969
We dropped out 969 outliers in predictors. And we can check the distributions again.
Comparing to the plots before, We see that the outliers are efficiently cut off.
Encode the categorical variable “ocean_proximity” which has 4 levels and we only need 3 of them to make up a valid dummy.
Although we have done some feature engineering manually,considering we are performing with very limited number of variables, we alternatively do another PCA on the original data set(no variables deleted) as a compare and cut off no information from the original. I do not give much hope to PCA because it is designed for high dimensional cases but I use it just as comparison to the set before and test how PCA actually work when with few features.
Firstly, PCA is sensitive to scale and outliers so we deal with them before carry out PCA. First, recap on the distributions.
## The number of observations we dropped during outlier elimination is 905
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.3988 1.0570 0.8499 0.8002 0.25905 0.2335 0.12646
## Proportion of Variance 0.4271 0.2439 0.1577 0.1398 0.01465 0.0119 0.00349
## Cumulative Proportion 0.4271 0.6710 0.8287 0.9685 0.98310 0.9950 0.99849
## PC8
## Standard deviation 0.08315
## Proportion of Variance 0.00151
## Cumulative Proportion 1.00000
We use k=5 principal components to predict which account for 98% of the total variation.
We use GLM, ElNet, Poly-GLM, Random Forest and GBM models to predict median houses price. Then compare all models between the transformed sample(T set) and PCA sample(P set) Firstly,we include mean and kitchen sink predictions as baselines.
Shrinkage methods are included because the set still contains a lot of correlation.We use elastic net and try to get the best mixing Hyperparameter alpha and penalty weight lambda for the model.
Considering possible non-linearity, we also conduct additional polynomial transformation of columns and then regress on it. hyperparameter optimization of degree p is included. In the end, to further understand the potential non-linearity(maybe not, so we check), we will introduce Random Forest and Gradient Boosting Machine. Although, these two models are designed to perform better on data with great amount of features, we still try and compare them with other models in our project.
After predictions, all models will be evaluated by MSE. The idea is to compare the accuracy of prediction from different models while the absolute value of models is actually not that cared about. Because our original data sample is equipped with too few features plus quite a lot correlations. Moreover, we will also compare the influence of each feature/variable cast on different models.
We randomly select 90% of data into training sets and 10% into test sets.
## Dimensions of training sets cut from T set are 4699 1 (response) and 4699 8 (predictor)
## Dimensions of test sets cut from T set are 522 1 (response) and 522 8 (predictor)
## Dimensions of training sets cut from P set are 4757 1 (response) and 4757 5 (predictor)
## Dimensions of test sets cut from P set are 528 1 (response) and 528 5 (predictor)
First, we use mean forecast as a baseline. When we analyse all the results, we will consider first it shall perform better than this naive model. If not, we are supposed to look into that something wrong with the modeling or maybe some strange fact we ignored.
## The MSE from naive prediction for T set is 0.838671
## The MSE from naive prediction for P set is 0.8880077
Our Kitchensink is the most naive model other than mean prediction using general linear regression. We have it somehow out of the same reason as mean prediction. We should pay some attention if other models perform even worse than this two baseline models.
## The MSE from ksink prediction for T set is 0.3302361
## The MSE from ksink prediction for P set is 0.4281945
By involving caret::train, We use a 10 length tune to estimate the best hyperparameters alpha and lambda.
## The best hyperparameters for T set are:
## alpha lambda
## 5 0.1 0.01895798
## The best hyperparameters for P set are:
## alpha lambda
## 2 0.1 0.003628609
## The MSE from elNet prediction for P set is 0.4284595
## The MSE from elNet prediction for T set is 0.3316367
## The MSE from elNet prediction for P set is 0.4284595
We can draw a plot to look into the process of selecting the best hyperparameters:
Observing the lambda and alpha chosen for both sets are nearly zero which indicates to shrink very much little and the method work weakly.
Inspired by our explanatory analysis, we know the true model are not likely linear. It worths a try to perform polynomial transformation. We test the p degree from 1 to 8.
First,we define a poly transformation function. It is to expand every variables to p variables with degree 1 to p of the original variable.
Then, We create a search function to find the best degree, which, when input with number M, run M times of prediction with M different transformed predictor sets of 1 to M degree respectively, and output all the M number of MSEs and the best p whose MSE is the smallest.
The best degree p is 5 and 8 for T and P set. Then, we our use tuned parameters to predict.
## The MSE from Poly-GLM for T set is 0.3109698
## The MSE from Poly-GLM for P set is 0.3785283
In this section. We have two parameters “mtry” and “ntree” to train. We will not test mtry but set it as 3 which is around the 1/3 of the number of variables, becasue we really have very few features. However it worths a try what “ntree”can be set.(Partly, it is also due to that RF algorithm is computationally expensive.)
We design a function to run 10 RFs with ntree increasing from 100 to 1000 every 100.
Calculate MSEs and plot.
## The best ntree of RF for T set is 800
## The best ntree of RF for T set is 600
## The MSE from RF for T set is 0.2762568
## The MSE from RF for P set is 0.3253614
Usually, GBM performs better than RF due to its learning process, but with higher risk to fall into some local optimization and over-fit. So we need to deal with its hyperparameters carefully. We care about 3 of them: “interaction depth”, “n.trees” and “shrinkage”.
From the plots, the risk of GBM falling into overfitting and local optimization and the importance of tuning is somehow supported by our results. As what shows in the plots, when hyperparameters vary, the accuracy changes greatly.
## The best models for T and P set are:
## shrinkage interaction.depth n.minobsinnode n.trees RMSE Rsquared
## 176 0.04 3 20 1100 0.5608684 0.6521633
## shrinkage interaction.depth n.minobsinnode n.trees RMSE Rsquared
## 264 0.06 3 20 900 0.5852266 0.6082758
Then we use the tuned GBM model to predict:
## The MSE from GBM for T set is 0.2734464
## The MSE from GBM for P set is 0.3334095
At last, we have finished all our modelling and prediction. Then, we look closer to the result and comparison between samples and models.
First fo all, we are sure that all models are effective as they are much more better than naive mean prediction.
Secondly, we found using PCA transformation actually drags the accuracy behind for all models we covered. We reckon it as a result of the limited number of features, in which case, PCA is more of deleting variation and information from our model,instead of reducing redundancy like in high dimensional case. Additionally, the relative performance for different models are the same using PCA and Trans sets.
And model-wise, ensemble models always give the best prediction in our case, while the priority between GBM and RF is not so clear as we can see that GBM is better when using T set and RF is better when P set. To emphasize more, the improvement is significant(around 0.06) comparing to kitchen sink. This is quite surprising because as we stated in the begin, ensemble algorithms are generally excellent in high dimensional cases but weak when features are few. We think it could be related to the grouped distribution of the houses’ price that makes tree-based algorithms work better. We can glimpse at the map again. Meanwhile, it also could be a result of non-linearity of the true model.
And poly-glm works for our occasion. It is the second best. We can have a look at the contributions of different variables created by polynomial transformation.
From the plots(in the label, the last number is the degree of power), we find that there are some non-linear(high degree of power) variables contributing high influence and this is especially obvious for PCA set. This explains why poly-glm performs a bit better than kitchen sink. By the way, from the T set, we tell that median income, diagonal and congestion of bedrooms are the top influential variables, with the high degree of the latter two variables carrying weight meanwhile.
At last, what we also care about is whether the ElNet model show better performance. Sadly, the answer is the opposite. The accuracy actually fall even comparing with the simple kitchen sink. This can also be told by the model used, the lambda is very close to zero which is saying it does not work for this scenario and what is actually modelling is quite the same as kitchen sink.
We evaluate contribution of variables by their coefficients or importance in each model. Of course, we scale them to make them comparable.
It can be observed that the top 3 influence are from “median_income”, “diagonal”,“INLAND” for all model except for RF who is with “NEAR.BAY” instead of “INLAND”. However, they both indicates geographical information. These models’ results are slightly different from poly-glm model which values “INLAND” little but “congestion_bedroom”much.
Although the variables in P set does not mean anything, we still use it to look into the robustness of prediction.
We observed that PC4 influence the most for all models, linear model
value PC5 the second while ensemble models value PC3(although GBM value
PC2, the gap between PC2 AND PC3 for GBM is minute).
We predict the response by two baseline models(naive mean and kitchen sink),two linear model(elNet) and 1 non-linear models(poly) with 2 ensemble models(also non-linear)(RF AND GBM) by using two “created” 2 sets. The result showed us that although the number of features are highly limited with some correlation underlying, the ensemble models still work better than others in our model. Importantly, we visualized the process of the selecting of hyperparameters of GBM which showed us the importance of tuning for this model. Another subject we introduced, the PCA method, instead of doing the good, drags the accuracy behind for all models covered. We think it is because the time PCA work is when with many features including redundancy and in our case, it failed to cut off redundancy without leaving much information behind.
We think there are more to improve in our project:
1.We may deal with the skewness of data at EDA stage which is not covered in present work.
2.We found that some high degree of crucial variables also contribute some information when doing poly-glm. If we include these high-degree variables and run some cared models, we might get better accuracy.
3.The RF model is amazingly time expensive. It is suggested to run GBM first in most of occasion in the future.
Thanks for reading and have a good day!
knitr::opts_chunk$set(echo = FALSE, warning=FALSE,message=FALSE,fig.align = "center")
rm(list=ls())
set.seed(2022)
data <- read.csv("https://raw.githubusercontent.com/ageron/handson-ml/master/datasets/housing/housing.csv")
summary(data)
str(data)
library(dplyr)
N=6192
set.seed(2022)
df <- sample_n(data,N)
summary(df)
library(maps)
library(ggplot2)
library(gridExtra)
map <- map_data("state",region = "california")
g1 <- ggplot(df,aes(longitude,latitude))+
geom_polygon(data = map, aes(x=long, y = lat), fill="black", alpha=0.8)+
geom_point(size=1,aes(colour= median_house_value))+
scale_color_gradient(low = "light blue",high = "red")
g2 <- ggplot(data,aes(longitude,latitude))+
geom_polygon(data = map, aes(x=long, y = lat), fill="black", alpha=0.8)+
geom_point(size=1,aes(colour= median_house_value))+
scale_color_gradient(low = "light blue",high = "red")
grid.arrange(g1,g2,ncol=2)
#we define a function to grid box and histogram together for each variables.
his_box <- function(df,x){
g1 <- ggplot(df,mapping = aes(df[,x]))+
geom_histogram()+
xlab(names(df)[x])
g2 <- ggplot(df,
mapping = aes(df[,x]))+
geom_boxplot(outlier.colour="red", outlier.shape=16,outlier.size=2)+
xlab(names(df)[x])
return(grid.arrange(g1,g2,ncol=2))
}
his <- lapply(c(1:9), his_box,df=df)
grid.arrange(his[[1]],his[[2]],his[[3]],nrow=3)
grid.arrange(his[[4]],his[[5]],his[[6]],nrow=3)
grid.arrange(his[[7]],his[[8]],his[[9]],nrow=3)
ggplot(df,aes(x=ocean_proximity),na.rm = TRUE)+
geom_bar()+
xlab(names(df)[ncol(df)])
library(GGally)
ggpairs(df[,1:9],
lower = list(continous=wrap("smooth",alpha=1/10)),
upper = list(continuous="cor")
)
library(ggcorrplot)
ggcorrplot(cor(df[,1:9],use="complete.obs"),lab=TRUE,lab_size = 2,digits = 3)+
theme(axis.text.y = element_text(size = 8)) +
theme(axis.text.x = element_text(size = 8))
g1 <- ggplot(df,aes(median_house_value,fill=ocean_proximity))+
geom_histogram(position = 'fill')+
scale_fill_manual(values=c("darkgreen", "orange","black","darkblue","darkred")
)
g1 <- g1+geom_vline(xintercept=180000,linetype="dotted",size=1,color="white")
g2 <- ggplot(df,aes(ocean_proximity,fill=ocean_proximity))+
geom_bar()+
scale_fill_manual(values=c("darkgreen", "orange","black","darkblue","darkred")
)
grid.arrange(g1,g2,nrow=2)
df_no_island <- df[-which(df[,ncol(df)]=="ISLAND"),]
cat("The number of missing values is",sum(is.na(df_no_island)))
na <- matrix(NA,ncol = ncol(df_no_island))
for (i in 1:ncol(df_no_island)) {
na[,i] <- sum(is.na(df_no_island[,i]))
}
cat("The number of missing values of each colunms is","\n",na)
df_no_na <- df_no_island
df_no_na$total_bedrooms[is.na(df_no_na$total_bedrooms)] <- median(df_no_na$total_bedrooms,na.rm = TRUE)
cat("The number of missing values after operation is",sum(is.na(df_no_na)))
diagonal <- df_no_na$longitude+df_no_na$latitud
congestion_bedroom <- df_no_na$total_bedrooms/df_no_na$total_rooms
df_new <- cbind.data.frame(diagonal,congestion_bedroom,df_no_na[c(-1,-2,-4,-5,-7)])
library(ggcorrplot)
ggcorrplot(cor(df_new[,1:6],use="complete.obs"),lab=TRUE,lab_size = 2,digits = 3)+
theme(axis.text.y = element_text(size = 8)) +
theme(axis.text.x = element_text(size = 8))
his_trans <- lapply(c(1:6), his_box,df=df_new)
grid.arrange(his_trans[[1]],his_trans[[2]],his_trans[[3]],
his_trans[[4]],his_trans[[5]],his_trans[[6]],
nrow=3,ncol=2)
df_scaled <- cbind.data.frame(scale(df_new[,1:(ncol(df_new)-1)]),df_new[,ncol(df_new)])
colnames(df_scaled) <- colnames(df_new)
#create a outlier detect function
detect_outlier <- function(x){
Q1 <- quantile(x, probs=.25,na.rm=TRUE)
Q3 <- quantile(x, probs=.75,na.rm=TRUE)
IQR = Q3-Q1
# return logical type
x > Q3 + (IQR*1.5) | x < Q1 - (IQR*1.5)
}
df_clean_tail <- df_scaled
#here by"-2", we exclude the response "median_house_value" and the categorical variable.
for (i in 1:(ncol(df_clean_tail)-2)){
df_clean_tail <- rbind.data.frame(df_clean_tail[which(!detect_outlier(df_clean_tail[,i])),],df_clean_tail[which(is.na(df_clean_tail[,i])),])
#here,we use rbind to add back omitted NA by "df_no_island[which(!detect_outlier(df_no_island[,i])),]"
}
cat("The number of observations we dropped during outlier elimination is",nrow(df_scaled)-nrow(df_clean_tail))
his_clean <- lapply(c(1:6), his_box,df=df_clean_tail)
grid.arrange(his_clean[[1]],his_clean[[2]],his_clean[[3]],
his_clean[[4]],his_clean[[5]],his_clean[[6]],
nrow=3,ncol=2)
ggpairs(df_clean_tail[,1:6],
lower = list(continous=wrap("smooth",alpha=1/10)),
upper = list(continuous="cor")
)
library(dummy)
dummy <- data.frame(df_clean_tail$ocean_proximity)
df_trans <- cbind.data.frame(df_clean_tail[,1:(ncol(df_clean_tail)-1)],dummy(x=dummy,int = TRUE)[,-1])
#"ocean_proximity" has 4 levels and we only need 3 of them to make up dummy.So it is dummy()[,-1]
#scale before pca
df_no_na_scaled <- data.frame(scale(df_no_na[,1:(ncol(df_no_na)-1)]))
#distribution
his_scaled <- lapply(c(1:9), his_box,df=df_no_na_scaled)
grid.arrange(his_scaled [[1]],his_scaled [[2]],his_scaled [[3]],
his_scaled [[4]],his_scaled [[5]],his_scaled [[6]],
his_scaled [[7]],his_scaled [[8]],his_scaled [[9]],
nrow=3,ncol=3)
#outliers
df_nns_clean <- df_no_na_scaled
#we exclude the respondse "median_house_value" from outliers' deleting.
for (i in 1:(ncol(df_nns_clean)-1)){
df_nns_clean <- df_nns_clean[which(!detect_outlier(df_nns_clean[,i])),]
#here,we use rbind to add back omitted NA
}
cat("The number of observations we dropped during outlier elimination is",nrow(df_no_na_scaled)-nrow(df_nns_clean))
#pca
df_pca_x <- data.frame(df_nns_clean[,1:(ncol(df_nns_clean)-1)])
pca <- prcomp(df_pca_x)
summary(pca)
#pca data
k <- 5
df_pca <- cbind.data.frame(data.frame(pca$x)[,1:k],df_nns_clean[,ncol(df_nns_clean)])
colnames(df_pca)[ncol(df_pca)] <- "median_hous_value"
his_pca <- lapply(c(1:6), his_box,df=df_pca)
grid.arrange(his_pca[[1]],his_pca[[2]],his_pca[[3]],
his_pca[[4]],his_pca[[5]],his_pca[[6]],
nrow=3,ncol=2)
test_rate <- 0.1
set.seed(2022)
picked <- sample(seq_len(nrow(df_trans)),size = nrow(df_trans)*test_rate)
df_trans_test <- df_trans[picked,]
df_trans_train <- df_trans[-picked,]
set.seed(2022)
picked <- sample(seq_len(nrow(df_pca)),size = nrow(df_pca)*test_rate)
df_pca_test <- df_pca[picked,]
df_pca_train <- df_pca[-picked,]
x_dft_test <- data.frame(df_trans_test[,which(names(df_trans_test)!="median_house_value")])
y_dft_test <-data.frame(df_trans_test[,which(names(df_trans_test)=="median_house_value")])
x_dft_train <- data.frame(df_trans_train[,which(names(df_trans_train)!="median_house_value")])
y_dft_train <-data.frame(df_trans_train[,which(names(df_trans_train)=="median_house_value")])
x_dfp_test <- data.frame(df_pca_test[,-ncol(df_pca_test)])
y_dfp_test <-data.frame(df_pca_test[,ncol(df_pca_test)])
x_dfp_train <- data.frame(df_pca_train[,-ncol(df_pca_train)])
y_dfp_train <-data.frame(df_pca_train[,ncol(df_pca_train)])
y_list <- list(y_dft_test,y_dft_train,y_dfp_test,y_dfp_train)
lapply(y_list, function(x){colnames(x) <- "median_house_value"})
cat("Dimensions of training sets cut from T set are",dim(y_dft_train),"(response)","and",dim(x_dft_train),"(predictor)")
cat("Dimensions of test sets cut from T set are",dim(y_dft_test),"(response)","and",dim(x_dft_test),"(predictor)")
cat("Dimensions of training sets cut from P set are",dim(y_dfp_train),"(response)","and",dim(x_dfp_train),"(predictor)")
cat("Dimensions of test sets cut from P set are",dim(y_dfp_test),"(response)","and",dim(x_dfp_test),"(predictor)")
#function
naiveForecast <- function(y_train,y_test){
mean(y_train[,1])
rep(mean(y_train[,1]),nrow(y_test))
}
#predict1 is for T set and 2 for P set
naivePredict1 <- naiveForecast(y_dft_train,y_dft_test)
naivePredict2 <- naiveForecast(y_dfp_train,y_dfp_test)
mse_naive1 <- mean((y_dft_test[,1]-naivePredict1)^2)
mse_naive2 <- mean((y_dfp_test[,1]-naivePredict2)^2)
#print
cat("The MSE from naive prediction for T set is",mse_naive1)
cat("The MSE from naive prediction for P set is",mse_naive2)
#function
ksinkForecast <- function(y_train,x_train,x_test) {
colnames(y_train) <- "median_house_value"
data <- cbind.data.frame(y_train,x_train)
reg <- lm(median_house_value~.,data)
yhat <- predict(reg,newdata=as.data.frame(x_test))
#output prediction and coefficients
out <- list()
out[['prediction']] <- yhat
out[['coefs']] <- coef(reg)[-1]
return(out)
}
#predict
ksinkPredict1 <- ksinkForecast(y_dft_train,x_dft_train,x_dft_test)
ksinkPredict2 <- ksinkForecast(y_dfp_train,x_dfp_train,x_dfp_test)
mse_ksink1 <- mean((y_dft_test[,1]-ksinkPredict1$prediction)^2)
mse_ksink2 <- mean((y_dfp_test[,1]-ksinkPredict2$prediction)^2)
#print
cat("The MSE from ksink prediction for T set is",mse_ksink1)
cat("The MSE from ksink prediction for P set is",mse_ksink2)
library(glmnet)
#Elastic Net
library(vctrs)
library(caret)
elNetForecast <-function(y_train,x_train,x_test){
set.seed(2022)
control <- trainControl(method = "boot", number = 10, p = .9)
fit.cv <- train(x_train,as.numeric(as.matrix(y_train)),metric = 'RMSE',method = "glmnet",
trControl = control,
tuneLength = 10)
yhat <- data.frame(predict(fit.cv,x_test))
#output prediction and coefficients' contribution
out <- list()
out[['prediction']] <- yhat
#store the plot of hyper param
out[['HP_plot']] <- fit.cv$results
#store the hyper param
out[['HP']] <- fit.cv$bestTune
out[['coefs']] <- as.matrix(coef(fit.cv$finalModel,fit.cv$bestTune$lambda))[-1,]
return(out)
}
#predict
elNetPredict1 <- elNetForecast(y_dft_train,x_dft_train,x_dft_test)
elNetPredict2 <- elNetForecast(y_dfp_train,x_dfp_train,x_dfp_test)
mse_elNet1 <- mean((y_dft_test[,1]-elNetPredict1$prediction[,1])^2)
mse_elNet2 <- mean((y_dfp_test[,1]-elNetPredict2$prediction[,1])^2)
#print
cat("The best hyperparameters for T set are:")
elNetPredict1$HP
cat("The best hyperparameters for P set are:")
elNetPredict2$HP
cat("The MSE from elNet prediction for P set is",mse_elNet2)
cat("The MSE from elNet prediction for T set is",mse_elNet1)
cat("The MSE from elNet prediction for P set is",mse_elNet2)
library(dplyr)
g1 <- ggplot(elNetPredict1$HP_plot,aes(x=log(lambda),y=RMSE,color=alpha))+
geom_point(size=1.5)+
geom_point(data=filter(elNetPredict1$HP_plot,RMSE==min(elNetPredict1$HP_plot$RMSE)),
aes(x=log(lambda),y=RMSE),
color='red',
size=3,
show.legend=TRUE)+
ggtitle("Hyperparameter Optimization for Transformed set")+
theme(plot.title = element_text(size=8)
)
g2 <- ggplot(elNetPredict2$HP_plot,aes(x=log(lambda),y=RMSE,color=alpha))+
geom_point(size=1.5)+
geom_point(data=filter(elNetPredict2$HP_plot,RMSE==min(elNetPredict2$HP_plot$RMSE)),
aes(x=log(lambda),y=RMSE),
color='red',
size=3,
show.legend=TRUE)+
ggtitle("Hyperparameter Optimization for PCA set")+
theme(plot.title = element_text(size=8)
)
grid.arrange(g1,g2,ncol=2)
#poly transformmation funtion
pf <- function(x,p){
x_p <- data.frame(matrix(NA,nrow = nrow(x),ncol=ncol(x)*p))
i=1
for (i in (1:ncol(x))) {
n1 <- p*(i-1)+1
n2 <- p*i
x_p[,n1:n2]<- sapply(1:p,function(w)x[,i]^w)
colnames(x_p)[n1:n2] <- paste(colnames(x)[i],1:p)
}
return(x_p)
}
#reg function
polyForecast <- function(y_train,x_train,x_test,p){
colnames(y_train) <- "median_house_value"
x_poly <- pf(x_train,p)
x_test_poly <- pf(x_test,p)
data_lm <- cbind.data.frame(y_train,x_poly)
reg <- lm(median_house_value~.,data = data_lm)
yhat <- predict(reg,newdata=as.data.frame(x_test_poly))
out <- list()
out[['prediction']] <- yhat
out[['coefs']] <- coef(reg)[-1]
return(out)
}
# p search funcrion
polySearch <- function(y_train,x_train,y_test,x_test,M){
colnames(y_train) <- "median_house_value"
mse <- matrix(NA,nrow = M)
for (p in 1:M) {
yhat <- polyForecast(y_train,x_train,x_test,p)
#store mse
mse[p,] <- mean((y_test[,1]-yhat$prediction)^2)
}
#output
out <- list()
out[['MSE']] <- as.data.frame(mse)
out[['pbest']] <- which(mse==min(mse))
return(out)
}
pSearch1 <- polySearch(y_dft_train,x_dft_train,y_dft_test,x_dft_test,8)
pbest1 <- pSearch1$pbest
pSearch2 <- polySearch(y_dfp_train,x_dfp_train,y_dfp_test,x_dfp_test,8)
pbest2 <- pSearch2$pbest
#plot
hp1 <- cbind.data.frame(pSearch1$MSE,rep("Trans",8),c(1:8))
hp2 <- cbind.data.frame(pSearch2$MSE,rep("PCA",8),c(1:8))
colnames(hp1) <- c("MSE","GROUP","p")
colnames(hp2) <- c("MSE","GROUP","p")
p_plot <- rbind.data.frame(hp1,hp2)
ggplot(p_plot,aes(x=p,y=MSE,color=GROUP))+
geom_point()+
ggtitle("Hyperparameter Optimization for degree p")+
geom_point(aes(x=p_plot$p[which(p_plot$MSE[1:8]==min(p_plot$MSE[1:8]))],y=min(MSE)),size = 5, pch = 1,color="black")+
geom_point(aes(x=p_plot$p[which(p_plot$MSE[9:16]==min(p_plot$MSE[9:16]))],y=MSE[16]),size = 5, pch = 1,color="black")
#save the best p
#predict
polyPredict1 <- polyForecast(y_dft_train,x_dft_train,x_dft_test,pSearch1$pbest)
polyPredict2 <- polyForecast(y_dfp_train,x_dfp_train,x_dfp_test,pSearch2$pbest)
mse_poly1 <- mean((y_dft_test[,1]-polyPredict1$prediction)^2)
mse_poly2 <- mean((y_dfp_test[,1]-polyPredict2$prediction)^2)
#print
cat("The MSE from Poly-GLM for T set is",mse_poly1)
cat("The MSE from Poly-GLM for P set is",mse_poly2)
library(randomForest)
# The codes is time expensive seriously.
#Forecast function
rfForecast <- function(y_train,x_train,y_test,x_test){
predictions <- data.frame(matrix(NA,nrow = nrow(y_test),ncol = 10))
coefs<- data.frame(matrix(NA,nrow = ncol(x_test),ncol = 10))
outrf <- list()
i=100
for (i in c(seq(100,1000,100))) {
rf_cv <- randomForest(x=x_train,y=as.numeric(as.matrix(y_train)),mtry=3,ntree = i)
predictions[,i/100] <-predict(rf_cv,x_test)
coefs[,i/100]<- as.numeric(rf_cv$importance)
rownames(coefs) <- rownames((rf_cv$importance))
colnames(coefs)[i/100] <- i
#output
outrf[['predictions']] <- predictions
outrf[['coefs']] <- coefs
}
return(outrf)
}
#Predict
rfPredict1 <- rfForecast(y_dft_train,x_dft_train,y_dft_test,x_dft_test)
rfPredict2 <- rfForecast(y_dfp_train,x_dfp_train,y_dfp_test,x_dfp_test)
#mse dataframe in order to ggplot
mse_rf1 <- data.frame(matrix(NA,ncol=10))
colnames(mse_rf1) <- colnames(rfPredict1$coefs)
for (i in 1:10) {
mse_rf1[,i] <- mean((y_dft_test[,1]-rfPredict1$prediction[,i])^2)
}
mse_rf2 <- data.frame(matrix(NA,ncol=10))
colnames(mse_rf2) <- colnames(rfPredict2$coefs)
for (i in 1:10) {
mse_rf2[,i] <- mean((y_dfp_test[,1]-rfPredict2$prediction[,i])^2)
}
mse_rf <- as.data.frame(cbind.data.frame(data.frame(seq(100,1000,100)),t(mse_rf1),t(mse_rf2)))
colnames(mse_rf) <- c("ntree","Trans","PCA")
#ggplot
g1 <- ggplot(mse_rf)+
geom_point(aes(x=ntree,y=Trans),size=4)+
scale_x_continuous(breaks = seq(100,1000,100))+
labs(title="MSE-ntree for Trans set",y="MSE")+
geom_point(aes(x=which(mse_rf$Trans==min(mse_rf$Trans))*100),y=min(mse_rf$Trans),size = 7, pch = 2,colour="red")
g2 <-ggplot(mse_rf)+
geom_point(aes(x=ntree,y=PCA),size=4)+
scale_x_continuous(breaks = seq(100,1000,100))+
labs(title="MSE-ntree for PCA set",y="MSE")+
geom_point(aes(x=which(mse_rf$PCA==min(mse_rf$PCA))*100),y=min(mse_rf$PCA),
size = 7,pch = 2,colour="red")
grid.arrange(g1,g2,ncol=2)
best_ntree1 <- which(mse_rf$Trans==min(mse_rf$Trans))*100
best_ntree2 <- which(mse_rf$PCA==min(mse_rf$PCA))*100
#save the mse
mse_rf1 <- mse_rf$Trans[which(mse_rf$Trans==min(mse_rf$Trans))]
mse_rf2 <- mse_rf$PCA[which(mse_rf$Trans==min(mse_rf$Trans))]
cat("The best ntree of RF for T set is",best_ntree1)
cat("The best ntree of RF for T set is",best_ntree2)
cat("The MSE from RF for T set is",mse_rf1)
cat("The MSE from RF for P set is",mse_rf2)
library(gbm)
gbmSearch <- function(y_train,x_train){
ctrl <- trainControl(method = "cv",number = 10)
colnames(y_train) <- "median_house_value"
dt <- cbind.data.frame(y_train,x_train)
gbmGrid <- expand.grid(interaction.depth = c(1,2,3),
n.trees = (1:15)*100,
shrinkage = (1:10)*0.01,
n.minobsinnode = 20)
set.seed(2022)
fit.cv <- train(median_house_value~.,data=dt,method="gbm",
trControl=ctrl,
metric="RMSE",
verbose=FALSE,
tuneGrid = gbmGrid)
return(fit.cv)
}
#Search/Forecast
gbmFit1 <- gbmSearch(y_dft_train,x_dft_train)
gbmFit2 <- gbmSearch(y_dfp_train,x_dfp_train)
#plot
plot(gbmFit1,main="Hyperparameters optimization for GBM (T set)",adj=0)
plot(gbmFit2,main="Hyperparameters optimization for GBM (P set)",adj=0)
#best parameter
bestpar1 <-gbmFit1$results[best(gbmFit1$results,metric ="RMSE",maximize=FALSE),1:6]
bestpar2 <-gbmFit2$results[best(gbmFit2$results,metric ="RMSE",maximize=FALSE),1:6]
cat("The best models for T and P set are:")
bestpar1
bestpar2
#svae the importance
gbmImp1 <- varImp(gbmFit1,scale=FALSE)
gbmImp2 <- varImp(gbmFit2,scale=FALSE)
#predict.train automatically choose the final model from training result list
gbmPredict1 <- predict.train(gbmFit1,newdata = x_dft_test)
gbmPredict2 <- predict.train(gbmFit2,newdata = x_dfp_test)
mse_gbm1 <- mean((y_dft_test[,1]-gbmPredict1)^2)
mse_gbm2 <- mean((y_dfp_test[,1]-gbmPredict2)^2)
cat("The MSE from GBM for T set is",mse_gbm1)
cat("The MSE from GBM for P set is",mse_gbm2)
#First,collect all mse.
MSE_all <- cbind.data.frame(c(mse_naive1,mse_naive2,mse_ksink1,mse_ksink2,mse_elNet1,
mse_elNet2,mse_poly1,mse_poly2,mse_rf1,mse_rf2,mse_gbm1,mse_gbm2),
c("mse_naive","mse_naive","mse_ksink","mse_ksink","mse_elNet",
"mse_elNet","mse_poly","mse_poly","mse_rf","mse_rf","mse_gbm","mse_gbm"),
rep(c("T set","P set"),6))
colnames(MSE_all) <- c("MSE","Model","Set")
ggplot(MSE_all,aes(x=Model,y=MSE,color=Set))+
geom_point(size=3)+
geom_label(aes(label = round(MSE, 4)),
hjust = 1.2,
color = "BLACK",
size=3,
label.padding = unit(0.2, "lines")
)
g1 <- ggplot(df,aes(longitude,latitude))+
geom_polygon(data = map, aes(x=long, y = lat), fill="black", alpha=0.8)+
geom_point(size=1,aes(color=ocean_proximity))
g2 <- ggplot(df,aes(longitude,latitude))+
geom_polygon(data = map, aes(x=long, y = lat), fill="black", alpha=0.8)+
geom_point(size=1,aes(colour= median_house_value))+
scale_color_gradient(low = "light blue",high = "red")
grid.arrange(g1,g2,ncol=2)
library(ggrepel)
coefdt1 <- cbind.data.frame(polyPredict1$coefs,names(polyPredict1$coefs))
colnames(coefdt1) <- c("coef","variable")
coefdt2 <- cbind.data.frame(polyPredict2$coefs,names(polyPredict2$coefs))
colnames(coefdt2) <- c("coef","variable")
g1 <- ggplot(coefdt1,aes(x=variable,y=coef))+
geom_point()+
geom_label(data=filter(coefdt1,coef>0.1|coef<(-0.1)),
aes(label = c(round(coef, 4))),
vjust = 1,
color = "BLACK",
size=2,
nudge_x = 1,
fill="lightblue")+
geom_label(data=filter(coefdt1,coef>0.1|coef<(-0.1)),
aes(label = variable),
vjust = 0.52,
color = "BLACK",
size=2,
label.padding = unit(0.3, "lines"),
nudge_x = 3, nudge_y = 0.1,
fill="yellow")+
ggtitle("coefs for Trans set")+
theme(aspect.ratio = 1)
g2 <- ggplot(coefdt2,aes(x=variable,y=coef))+
geom_point()+
geom_label(data=filter(coefdt2,coef>2|coef<(-2)),
aes(label = c(round(coef, 4))),
vjust = 1,
color = "BLACK",
size=2,
nudge_x = -3,
fill=NA, check_overlap = TRUE)+
geom_label(data=filter(coefdt2,coef>2|coef<(-2)),
aes(label = variable),
vjust = 0.52,
color = "BLACK",
size=2,
label.padding = unit(0.3, "lines"),
nudge_x = -3, nudge_y = 0.5,
fill="yellow", check_overlap = TRUE)+
ggtitle("coefs for PCA set")+
theme(aspect.ratio = 1)
grid.arrange(g1,g2,ncol=2)
#scaled coefs/importance.
Contri_all <- cbind.data.frame(scale(cbind.data.frame(ksinkPredict1$coefs,elNetPredict1$coefs,rfPredict1$coefs[,which(mse_rf$Trans==min(mse_rf$Trans))], gbmImp1$importance)),colnames(x_dft_test))
colnames(Contri_all) <- c("kitchensink","elNet","RF","GBM","Variable")
Contri_all$Variable[6:8] <- c("INLAND","NEAR.BAY","NEAR.OCEAN")
g1 <- ggplot(Contri_all)+
geom_point(size=3,aes(x=Variable,y=kitchensink,color="kitchensink"))+
geom_point(size=3,aes(x=Variable,y=elNet,color="elNet"))+
geom_point(size=3,aes(x=Variable,y=RF,color="RF"))+
geom_point(size=3,aes(x=Variable,y=GBM,color="GBM"))+
geom_hline(yintercept=0, linetype="dashed",
color = "red", size=0.5)+
theme(axis.text = element_text(size = 7)) +
labs(title = "Influence of variables(T set)", y = "Influence")
# scaled absolute value
Contri_all_abs <- cbind.data.frame(abs(Contri_all[,1:4]),Contri_all[,5])
colnames(Contri_all_abs) <- c("kitchensink","elNet","RF","GBM","Variable")
g2 <- ggplot(Contri_all_abs)+
geom_point(size=3,aes(x=Variable,y=kitchensink,color="kitchensink"))+
geom_point(size=3,aes(x=Variable,y=elNet,color="elNet"))+
geom_point(size=3,aes(x=Variable,y=RF,color="RF"))+
geom_point(size=3,aes(x=Variable,y=GBM,color="GBM"))+
theme(axis.text = element_text(size = 7)) +
geom_point(aes(x=Contri_all_abs$Variable[1],y=1.2),size=20,fill=NA, pch = 1,color="red") +
geom_point(aes(x=Contri_all_abs$Variable[5],y=2),size=15,fill=NA, pch = 1,color="red") +
geom_point(aes(x=Contri_all_abs$Variable[6],y=0.6),size=24,fill=NA, pch = 1,color="red") +
geom_point(aes(x=Contri_all_abs$Variable[7],y=0.9),size=7,fill=NA, pch = 1,color="red")+
labs(title = "Absolute influence of variables(T set)", y = "Influence")
grid.arrange(g1,g2,nrow=2)
#scaled coefs/importance.
Contri_all2 <- cbind.data.frame(scale(cbind.data.frame(ksinkPredict2$coefs,elNetPredict2$coefs,rfPredict2$coefs[,which(mse_rf$Trans==min(mse_rf$Trans))], gbmImp2$importance)),colnames(x_dfp_test))
colnames(Contri_all2) <- c("kitchensink","elNet","RF","GBM","Variable")
g1 <- ggplot(Contri_all2)+
geom_point(size=3,aes(x=Variable,y=kitchensink,color="kitchensink"))+
geom_point(size=3,aes(x=Variable,y=elNet,color="elNet"))+
geom_point(size=3,aes(x=Variable,y=RF,color="RF"))+
geom_point(size=3,aes(x=Variable,y=GBM,color="GBM"))+
geom_hline(yintercept=0, linetype="dashed",
color = "black", size=0.5)+
theme(axis.text = element_text(size = 7)) +
labs(title = "Influence of variables(P set)", y = "Influence")
# scaled absolute value
Contri_all_abs2 <- cbind.data.frame(abs(Contri_all2[,1:4]),Contri_all2[,5])
colnames(Contri_all_abs2) <- c("kitchensink","elNet","RF","GBM","Variable")
g2 <- ggplot(Contri_all_abs2)+
geom_point(size=3,aes(x=Variable,y=kitchensink,color="kitchensink"))+
geom_point(size=3,aes(x=Variable,y=elNet,color="elNet"))+
geom_point(size=3,aes(x=Variable,y=RF,color="RF"))+
geom_point(size=3,aes(x=Variable,y=GBM,color="GBM"))+
theme(axis.text = element_text(size = 7)) +
labs(title = "Absolute influence of variables(P set)", y = "Influence")
grid.arrange(g1,g2,nrow=2)