Intrduction

(To run the whole project, it costs about 13’52’’ and chunk 44 costs greatly.)

This data set appeared in a 1997 paper titled \(Sparse\) \(Spatial\) \(Autoregressions\) by Pace, R. Kelley and Ronald Barry, published in the Statistics and Probability Letters journal. They built it using the 1990 California housing census data.

The data set contains 20,640 observations of 10 variables. The respondent variable is called “median house value” indicating the median house price for the corresponding investigated living block. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data and in this data set, whose average number is 1425.5. The 9 other variables are considered to be relevant to the house prices(in our case, working as predictors), 8 of which are continuous and one is geographical categorical variables.

##    longitude         latitude     housing_median_age  total_rooms   
##  Min.   :-124.3   Min.   :32.54   Min.   : 1.00      Min.   :    2  
##  1st Qu.:-121.8   1st Qu.:33.93   1st Qu.:18.00      1st Qu.: 1448  
##  Median :-118.5   Median :34.26   Median :29.00      Median : 2127  
##  Mean   :-119.6   Mean   :35.63   Mean   :28.64      Mean   : 2636  
##  3rd Qu.:-118.0   3rd Qu.:37.71   3rd Qu.:37.00      3rd Qu.: 3148  
##  Max.   :-114.3   Max.   :41.95   Max.   :52.00      Max.   :39320  
##                                                                     
##  total_bedrooms     population      households     median_income    
##  Min.   :   1.0   Min.   :    3   Min.   :   1.0   Min.   : 0.4999  
##  1st Qu.: 296.0   1st Qu.:  787   1st Qu.: 280.0   1st Qu.: 2.5634  
##  Median : 435.0   Median : 1166   Median : 409.0   Median : 3.5348  
##  Mean   : 537.9   Mean   : 1425   Mean   : 499.5   Mean   : 3.8707  
##  3rd Qu.: 647.0   3rd Qu.: 1725   3rd Qu.: 605.0   3rd Qu.: 4.7432  
##  Max.   :6445.0   Max.   :35682   Max.   :6082.0   Max.   :15.0001  
##  NA's   :207                                                        
##  median_house_value ocean_proximity   
##  Min.   : 14999     Length:20640      
##  1st Qu.:119600     Class :character  
##  Median :179700     Mode  :character  
##  Mean   :206856                       
##  3rd Qu.:264725                       
##  Max.   :500001                       
## 
## 'data.frame':    20640 obs. of  10 variables:
##  $ longitude         : num  -122 -122 -122 -122 -122 ...
##  $ latitude          : num  37.9 37.9 37.9 37.9 37.9 ...
##  $ housing_median_age: num  41 21 52 52 52 52 52 52 42 52 ...
##  $ total_rooms       : num  880 7099 1467 1274 1627 ...
##  $ total_bedrooms    : num  129 1106 190 235 280 ...
##  $ population        : num  322 2401 496 558 565 ...
##  $ households        : num  126 1138 177 219 259 ...
##  $ median_income     : num  8.33 8.3 7.26 5.64 3.85 ...
##  $ median_house_value: num  452600 358500 352100 341300 342200 ...
##  $ ocean_proximity   : chr  "NEAR BAY" "NEAR BAY" "NEAR BAY" "NEAR BAY" ...

Detailed descriptions for all variables can be found at link .

I will apply different methods discussed in the foundations of data science course to predict the median house value considering all other variables in the sample as predictor candidates. And the sample size is too large for us, I randomly subset a new sample with 30% observations of the original one, that is, N=6192.

##    longitude         latitude     housing_median_age  total_rooms   
##  Min.   :-124.3   Min.   :32.54   Min.   : 1.00      Min.   :    2  
##  1st Qu.:-121.8   1st Qu.:33.94   1st Qu.:18.00      1st Qu.: 1452  
##  Median :-118.5   Median :34.28   Median :28.00      Median : 2154  
##  Mean   :-119.6   Mean   :35.67   Mean   :28.41      Mean   : 2647  
##  3rd Qu.:-118.0   3rd Qu.:37.73   3rd Qu.:37.00      3rd Qu.: 3150  
##  Max.   :-114.5   Max.   :41.95   Max.   :52.00      Max.   :32054  
##                                                                     
##  total_bedrooms     population      households     median_income   
##  Min.   :   2.0   Min.   :    3   Min.   :   2.0   Min.   : 0.536  
##  1st Qu.: 298.0   1st Qu.:  788   1st Qu.: 280.0   1st Qu.: 2.553  
##  Median : 438.0   Median : 1175   Median : 412.0   Median : 3.537  
##  Mean   : 538.4   Mean   : 1422   Mean   : 499.9   Mean   : 3.893  
##  3rd Qu.: 648.0   3rd Qu.: 1725   3rd Qu.: 609.0   3rd Qu.: 4.805  
##  Max.   :5290.0   Max.   :15507   Max.   :5050.0   Max.   :15.000  
##  NA's   :63                                                        
##  median_house_value ocean_proximity   
##  Min.   : 14999     Length:6192       
##  1st Qu.:121400     Class :character  
##  Median :179650     Mode  :character  
##  Mean   :208370                       
##  3rd Qu.:268225                       
##  Max.   :500001                       
## 

Before look into detailed data structure, we firstly visualize the response “median_house_value” geographically and compare it with the original one. We observe that their patterns are alike thus we regard this subset as usable.

Generally, we observe that houses near the coasts are more expensive and some houses inland also charge higher prices. This insight is reasonable with common sense.

Data

In this chapter, we do exploratory data analysis.

Univariate visualizations

Firstly, we draw histograms and box charts to find out the variation and distribution of all continuous variables.

Insights:

1.there are a lot of outliers for some variables, and we will fix these outliers next chapter.

2.The last 6 variables’ distribution are quite bit skewed, which is quite a lot and slightly concerning.

Now, we check on the categorical variable using bar chart.

Obviously, we should drop the “ISLAND” group which has only 2 observations.

Multivariate visualizations

Similarly, we first draw scatter matrix of all continuous variables.

We observe that there are no clear linear relationships between all the predictors and the respondent. Only resembling linear trends shows in median_income. It inspires us that non-linear models might perform better than linear ones.

Furthermore, we draw the correlation matrix to get more inspirations.

Insight:

1.The house price is highly correlated with income positively;

2.Longitude and latitude change oppositely. It is sensible because in California,usually the far north one live, the far west one live.We will create a new variable combining them together in chapter.

3.Population, households and total_bedrooms are highly positively correlated, which is understandable due to they are all demographic variables.We will create a new variable combining their information together next chapter.

Next, we care about the relation between categorical variables and the response. What to be obvious here is that

1.There is a diverging point around 1.8e05 between “INLAND” and “<1H OCEAN”, higher than which the former barely charge, and lower than which the latter barely charge.

2.”NEAR BAR” has a slight tendency to charge higher, while the trend for “NEAR OCEAN” becomes more vague.

Cleaning and preparation

First,we delete the 2 “ISLAND” observations. Then we deal with missing values.

We have 63 missing values which all belongs to “total_bedrooms”. We impute them by their variable median instead of mean because we already detected many outliers before.

## The number of missing values is 63
## The number of missing values of each colunms is 
##  0 0 0 0 63 0 0 0 0 0
## The number of missing values after operation is 0

Feature engineering

Reduction

As we showed before, some variables have heavy correlation. Considering that we don’t have many variables to play with, we choose to create new variables containing their information and drop the old ones:

1.Add up longitude and latitude, and new variable”diagonal” representing diagonal locations, throwing out the original.

2.Divide total_bedrooms by total_rooms, and new variable”congestion_bedroom” representing how crowded one room is, throwing out the original.

3.Throw away “household”,because the covariance between it and “population” is almost 100% which means it is actually not bringing more useful information.

Scaling with normalization

After the transformation, all our present variables tend to follow a skewed normal distribution.Their scale are highly different and we normalize them.

Outliers

Purposely, I put outlier imputation after transformation in order to contain as much information as possible in our practical sample.

Importantly,we do not fix the outliers for the response variable “median_house_value”, because these outlier are of high frequency and obviously they carry information (Of course, we assuming there are no collection errors. In fact,in original data, these outliers’ values are identically “500001”, and they are already “censored” by the researchers from their raw data which were much higher than “500001”).

To modify outliers for predictors, considering that the original author used the IQR method( because remaining values are identically “500001”), we choose to keep in line with the author.

## The number of observations we dropped during outlier elimination is 969

We dropped out 969 outliers in predictors. And we can check the distributions again.

Comparing to the plots before, We see that the outliers are efficiently cut off.

Encoding the categorical variable

Encode the categorical variable “ocean_proximity” which has 4 levels and we only need 3 of them to make up a valid dummy.

Construct an set alternative with PCA

Although we have done some feature engineering manually,considering we are performing with very limited number of variables, we alternatively do another PCA on the original data set(no variables deleted) as a compare and cut off no information from the original. I do not give much hope to PCA because it is designed for high dimensional cases but I use it just as comparison to the set before and test how PCA actually work when with few features.

Firstly, PCA is sensitive to scale and outliers so we deal with them before carry out PCA. First, recap on the distributions.

## The number of observations we dropped during outlier elimination is 905
## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5    PC6     PC7
## Standard deviation     1.3988 1.0570 0.8499 0.8002 0.25905 0.2335 0.12646
## Proportion of Variance 0.4271 0.2439 0.1577 0.1398 0.01465 0.0119 0.00349
## Cumulative Proportion  0.4271 0.6710 0.8287 0.9685 0.98310 0.9950 0.99849
##                            PC8
## Standard deviation     0.08315
## Proportion of Variance 0.00151
## Cumulative Proportion  1.00000

We use k=5 principal components to predict which account for 98% of the total variation.

Methodology and results

We use GLM, ElNet, Poly-GLM, Random Forest and GBM models to predict median houses price. Then compare all models between the transformed sample(T set) and PCA sample(P set) Firstly,we include mean and kitchen sink predictions as baselines.

Shrinkage methods are included because the set still contains a lot of correlation.We use elastic net and try to get the best mixing Hyperparameter alpha and penalty weight lambda for the model.

Considering possible non-linearity, we also conduct additional polynomial transformation of columns and then regress on it. hyperparameter optimization of degree p is included. In the end, to further understand the potential non-linearity(maybe not, so we check), we will introduce Random Forest and Gradient Boosting Machine. Although, these two models are designed to perform better on data with great amount of features, we still try and compare them with other models in our project.

After predictions, all models will be evaluated by MSE. The idea is to compare the accuracy of prediction from different models while the absolute value of models is actually not that cared about. Because our original data sample is equipped with too few features plus quite a lot correlations. Moreover, we will also compare the influence of each feature/variable cast on different models.

Sampling

We randomly select 90% of data into training sets and 10% into test sets.

## Dimensions of training sets cut from T set are 4699 1 (response) and 4699 8 (predictor)
## Dimensions of test sets cut from T set are 522 1 (response) and 522 8 (predictor)
## Dimensions of training sets cut from P set are 4757 1 (response) and 4757 5 (predictor)
## Dimensions of test sets cut from P set are 528 1 (response) and 528 5 (predictor)

Naive mean

First, we use mean forecast as a baseline. When we analyse all the results, we will consider first it shall perform better than this naive model. If not, we are supposed to look into that something wrong with the modeling or maybe some strange fact we ignored.

## The MSE from naive prediction for T set is 0.838671
## The MSE from naive prediction for P set is 0.8880077

kitchensink

Our Kitchensink is the most naive model other than mean prediction using general linear regression. We have it somehow out of the same reason as mean prediction. We should pay some attention if other models perform even worse than this two baseline models.

## The MSE from ksink prediction for T set is 0.3302361
## The MSE from ksink prediction for P set is 0.4281945

Elastic Net

By involving caret::train, We use a 10 length tune to estimate the best hyperparameters alpha and lambda.

## The best hyperparameters for T set are:
##   alpha     lambda
## 5   0.1 0.01895798
## The best hyperparameters for P set are:
##   alpha      lambda
## 2   0.1 0.003628609
## The MSE from elNet prediction for P set is 0.4284595
## The MSE from elNet prediction for T set is 0.3316367
## The MSE from elNet prediction for P set is 0.4284595

We can draw a plot to look into the process of selecting the best hyperparameters:

Observing the lambda and alpha chosen for both sets are nearly zero which indicates to shrink very much little and the method work weakly.

Polynomial

Inspired by our explanatory analysis, we know the true model are not likely linear. It worths a try to perform polynomial transformation. We test the p degree from 1 to 8.

First,we define a poly transformation function. It is to expand every variables to p variables with degree 1 to p of the original variable.

Then, We create a search function to find the best degree, which, when input with number M, run M times of prediction with M different transformed predictor sets of 1 to M degree respectively, and output all the M number of MSEs and the best p whose MSE is the smallest.

The best degree p is 5 and 8 for T and P set. Then, we our use tuned parameters to predict.

## The MSE from Poly-GLM for T set is 0.3109698
## The MSE from Poly-GLM for P set is 0.3785283

Random Forest

In this section. We have two parameters “mtry” and “ntree” to train. We will not test mtry but set it as 3 which is around the 1/3 of the number of variables, becasue we really have very few features. However it worths a try what “ntree”can be set.(Partly, it is also due to that RF algorithm is computationally expensive.)

We design a function to run 10 RFs with ntree increasing from 100 to 1000 every 100.

Calculate MSEs and plot.

## The best ntree of RF for T set is 800
## The best ntree of RF for T set is 600
## The MSE from RF for T set is 0.2762568
## The MSE from RF for P set is 0.3253614

Gradient Boosting Machine

Usually, GBM performs better than RF due to its learning process, but with higher risk to fall into some local optimization and over-fit. So we need to deal with its hyperparameters carefully. We care about 3 of them: “interaction depth”, “n.trees” and “shrinkage”.

From the plots, the risk of GBM falling into overfitting and local optimization and the importance of tuning is somehow supported by our results. As what shows in the plots, when hyperparameters vary, the accuracy changes greatly.

## The best models for T and P set are:
##     shrinkage interaction.depth n.minobsinnode n.trees      RMSE  Rsquared
## 176      0.04                 3             20    1100 0.5608684 0.6521633
##     shrinkage interaction.depth n.minobsinnode n.trees      RMSE  Rsquared
## 264      0.06                 3             20     900 0.5852266 0.6082758

Then we use the tuned GBM model to predict:

## The MSE from GBM for T set is 0.2734464
## The MSE from GBM for P set is 0.3334095

At last, we have finished all our modelling and prediction. Then, we look closer to the result and comparison between samples and models.

Analysis

MSE

First fo all, we are sure that all models are effective as they are much more better than naive mean prediction.

Secondly, we found using PCA transformation actually drags the accuracy behind for all models we covered. We reckon it as a result of the limited number of features, in which case, PCA is more of deleting variation and information from our model,instead of reducing redundancy like in high dimensional case. Additionally, the relative performance for different models are the same using PCA and Trans sets.

And model-wise, ensemble models always give the best prediction in our case, while the priority between GBM and RF is not so clear as we can see that GBM is better when using T set and RF is better when P set. To emphasize more, the improvement is significant(around 0.06) comparing to kitchen sink. This is quite surprising because as we stated in the begin, ensemble algorithms are generally excellent in high dimensional cases but weak when features are few. We think it could be related to the grouped distribution of the houses’ price that makes tree-based algorithms work better. We can glimpse at the map again. Meanwhile, it also could be a result of non-linearity of the true model.

And poly-glm works for our occasion. It is the second best. We can have a look at the contributions of different variables created by polynomial transformation.

From the plots(in the label, the last number is the degree of power), we find that there are some non-linear(high degree of power) variables contributing high influence and this is especially obvious for PCA set. This explains why poly-glm performs a bit better than kitchen sink. By the way, from the T set, we tell that median income, diagonal and congestion of bedrooms are the top influential variables, with the high degree of the latter two variables carrying weight meanwhile.

At last, what we also care about is whether the ElNet model show better performance. Sadly, the answer is the opposite. The accuracy actually fall even comparing with the simple kitchen sink. This can also be told by the model used, the lambda is very close to zero which is saying it does not work for this scenario and what is actually modelling is quite the same as kitchen sink.

Contribution of variables

T set

We evaluate contribution of variables by their coefficients or importance in each model. Of course, we scale them to make them comparable.

It can be observed that the top 3 influence are from “median_income”, “diagonal”,“INLAND” for all model except for RF who is with “NEAR.BAY” instead of “INLAND”. However, they both indicates geographical information. These models’ results are slightly different from poly-glm model which values “INLAND” little but “congestion_bedroom”much.

P set

Although the variables in P set does not mean anything, we still use it to look into the robustness of prediction.

We observed that PC4 influence the most for all models, linear model value PC5 the second while ensemble models value PC3(although GBM value PC2, the gap between PC2 AND PC3 for GBM is minute).

Conclusions

We predict the response by two baseline models(naive mean and kitchen sink),two linear model(elNet) and 1 non-linear models(poly) with 2 ensemble models(also non-linear)(RF AND GBM) by using two “created” 2 sets. The result showed us that although the number of features are highly limited with some correlation underlying, the ensemble models still work better than others in our model. Importantly, we visualized the process of the selecting of hyperparameters of GBM which showed us the importance of tuning for this model. Another subject we introduced, the PCA method, instead of doing the good, drags the accuracy behind for all models covered. We think it is because the time PCA work is when with many features including redundancy and in our case, it failed to cut off redundancy without leaving much information behind.

Outlook

We think there are more to improve in our project:

1.We may deal with the skewness of data at EDA stage which is not covered in present work.

2.We found that some high degree of crucial variables also contribute some information when doing poly-glm. If we include these high-degree variables and run some cared models, we might get better accuracy.

3.The RF model is amazingly time expensive. It is suggested to run GBM first in most of occasion in the future.

Thanks for reading and have a good day!

Appendix: All code for this report

knitr::opts_chunk$set(echo = FALSE, warning=FALSE,message=FALSE,fig.align = "center")
                      
rm(list=ls())
set.seed(2022)
data <- read.csv("https://raw.githubusercontent.com/ageron/handson-ml/master/datasets/housing/housing.csv")
summary(data)
str(data)


library(dplyr)
N=6192
set.seed(2022)
df <-  sample_n(data,N)
summary(df)



library(maps)
library(ggplot2)
library(gridExtra)

map <- map_data("state",region = "california")

g1 <- ggplot(df,aes(longitude,latitude))+
  geom_polygon(data = map, aes(x=long, y = lat), fill="black", alpha=0.8)+
  geom_point(size=1,aes(colour= median_house_value))+
  scale_color_gradient(low = "light blue",high = "red")

g2 <- ggplot(data,aes(longitude,latitude))+
  geom_polygon(data = map, aes(x=long, y = lat), fill="black", alpha=0.8)+
  geom_point(size=1,aes(colour= median_house_value))+
  scale_color_gradient(low = "light blue",high = "red")

grid.arrange(g1,g2,ncol=2)





#we define a function to grid box and histogram together for each variables.
his_box <- function(df,x){
  g1 <- ggplot(df,mapping = aes(df[,x]))+
    geom_histogram()+
    xlab(names(df)[x])

  g2 <- ggplot(df,
               mapping = aes(df[,x]))+
    geom_boxplot(outlier.colour="red", outlier.shape=16,outlier.size=2)+
    xlab(names(df)[x])
  
  return(grid.arrange(g1,g2,ncol=2))
}

his <- lapply(c(1:9), his_box,df=df)




grid.arrange(his[[1]],his[[2]],his[[3]],nrow=3)
grid.arrange(his[[4]],his[[5]],his[[6]],nrow=3)

grid.arrange(his[[7]],his[[8]],his[[9]],nrow=3)
 ggplot(df,aes(x=ocean_proximity),na.rm = TRUE)+
    geom_bar()+
    xlab(names(df)[ncol(df)])
library(GGally)
  ggpairs(df[,1:9],
          lower = list(continous=wrap("smooth",alpha=1/10)),
          upper = list(continuous="cor")
  )

  

library(ggcorrplot)
ggcorrplot(cor(df[,1:9],use="complete.obs"),lab=TRUE,lab_size = 2,digits = 3)+
  theme(axis.text.y = element_text(size = 8)) +
  theme(axis.text.x = element_text(size = 8))
g1 <- ggplot(df,aes(median_house_value,fill=ocean_proximity))+
  geom_histogram(position = 'fill')+
  scale_fill_manual(values=c("darkgreen", "orange","black","darkblue","darkred")
  )

g1 <- g1+geom_vline(xintercept=180000,linetype="dotted",size=1,color="white")

g2 <- ggplot(df,aes(ocean_proximity,fill=ocean_proximity))+
  geom_bar()+
  scale_fill_manual(values=c("darkgreen", "orange","black","darkblue","darkred")
  )

grid.arrange(g1,g2,nrow=2)
df_no_island <- df[-which(df[,ncol(df)]=="ISLAND"),]
cat("The number of missing values is",sum(is.na(df_no_island)))

na <- matrix(NA,ncol = ncol(df_no_island))
for (i in 1:ncol(df_no_island)) {
  na[,i] <- sum(is.na(df_no_island[,i]))
}
cat("The number of missing values of each colunms is","\n",na)

df_no_na <- df_no_island

df_no_na$total_bedrooms[is.na(df_no_na$total_bedrooms)] <- median(df_no_na$total_bedrooms,na.rm = TRUE)

cat("The number of missing values after operation is",sum(is.na(df_no_na)))
diagonal <- df_no_na$longitude+df_no_na$latitud
congestion_bedroom <- df_no_na$total_bedrooms/df_no_na$total_rooms

df_new <- cbind.data.frame(diagonal,congestion_bedroom,df_no_na[c(-1,-2,-4,-5,-7)])
library(ggcorrplot)
ggcorrplot(cor(df_new[,1:6],use="complete.obs"),lab=TRUE,lab_size = 2,digits = 3)+
  theme(axis.text.y = element_text(size = 8)) +
  theme(axis.text.x = element_text(size = 8))
his_trans <- lapply(c(1:6), his_box,df=df_new)
grid.arrange(his_trans[[1]],his_trans[[2]],his_trans[[3]],
             his_trans[[4]],his_trans[[5]],his_trans[[6]],
             nrow=3,ncol=2)

df_scaled <- cbind.data.frame(scale(df_new[,1:(ncol(df_new)-1)]),df_new[,ncol(df_new)])
colnames(df_scaled) <- colnames(df_new)


#create a outlier detect function
detect_outlier <- function(x){
  Q1 <- quantile(x, probs=.25,na.rm=TRUE)
  Q3 <- quantile(x, probs=.75,na.rm=TRUE)
  IQR = Q3-Q1
  # return logical type
  x > Q3 + (IQR*1.5) | x < Q1 - (IQR*1.5)
}


df_clean_tail <- df_scaled


#here by"-2", we exclude the response "median_house_value" and the categorical variable.
for (i in 1:(ncol(df_clean_tail)-2)){
  df_clean_tail <- rbind.data.frame(df_clean_tail[which(!detect_outlier(df_clean_tail[,i])),],df_clean_tail[which(is.na(df_clean_tail[,i])),]) 
  #here,we use rbind to add back omitted NA by "df_no_island[which(!detect_outlier(df_no_island[,i])),]"
}

cat("The number of observations we dropped during outlier elimination is",nrow(df_scaled)-nrow(df_clean_tail))
his_clean <- lapply(c(1:6), his_box,df=df_clean_tail)
grid.arrange(his_clean[[1]],his_clean[[2]],his_clean[[3]],
             his_clean[[4]],his_clean[[5]],his_clean[[6]],
             nrow=3,ncol=2)
ggpairs(df_clean_tail[,1:6],
          lower = list(continous=wrap("smooth",alpha=1/10)),
          upper = list(continuous="cor")
  )



library(dummy)

dummy <- data.frame(df_clean_tail$ocean_proximity)
df_trans <- cbind.data.frame(df_clean_tail[,1:(ncol(df_clean_tail)-1)],dummy(x=dummy,int = TRUE)[,-1])  
#"ocean_proximity" has 4 levels and we only need 3 of them to make up dummy.So it is dummy()[,-1]


#scale before pca
df_no_na_scaled <- data.frame(scale(df_no_na[,1:(ncol(df_no_na)-1)]))

#distribution
his_scaled <- lapply(c(1:9), his_box,df=df_no_na_scaled)

grid.arrange(his_scaled [[1]],his_scaled [[2]],his_scaled [[3]],
             his_scaled [[4]],his_scaled [[5]],his_scaled [[6]],
             his_scaled [[7]],his_scaled [[8]],his_scaled [[9]],
             nrow=3,ncol=3)
#outliers
df_nns_clean <- df_no_na_scaled


#we exclude the respondse "median_house_value" from outliers' deleting.
for (i in 1:(ncol(df_nns_clean)-1)){
 df_nns_clean <- df_nns_clean[which(!detect_outlier(df_nns_clean[,i])),] 
  #here,we use rbind to add back omitted NA
}

cat("The number of observations we dropped during outlier elimination is",nrow(df_no_na_scaled)-nrow(df_nns_clean))


#pca
df_pca_x <- data.frame(df_nns_clean[,1:(ncol(df_nns_clean)-1)]) 

pca <- prcomp(df_pca_x)
summary(pca)
#pca data
k <- 5
df_pca <- cbind.data.frame(data.frame(pca$x)[,1:k],df_nns_clean[,ncol(df_nns_clean)]) 
colnames(df_pca)[ncol(df_pca)] <- "median_hous_value"

his_pca <- lapply(c(1:6), his_box,df=df_pca)

grid.arrange(his_pca[[1]],his_pca[[2]],his_pca[[3]],
             his_pca[[4]],his_pca[[5]],his_pca[[6]],
             nrow=3,ncol=2)
test_rate <- 0.1

set.seed(2022)
picked <- sample(seq_len(nrow(df_trans)),size = nrow(df_trans)*test_rate)
df_trans_test <- df_trans[picked,]
df_trans_train <- df_trans[-picked,]

set.seed(2022)
picked <- sample(seq_len(nrow(df_pca)),size = nrow(df_pca)*test_rate)
df_pca_test <- df_pca[picked,]
df_pca_train <- df_pca[-picked,]

x_dft_test <- data.frame(df_trans_test[,which(names(df_trans_test)!="median_house_value")]) 
y_dft_test <-data.frame(df_trans_test[,which(names(df_trans_test)=="median_house_value")])

x_dft_train <- data.frame(df_trans_train[,which(names(df_trans_train)!="median_house_value")]) 
y_dft_train <-data.frame(df_trans_train[,which(names(df_trans_train)=="median_house_value")])

x_dfp_test <- data.frame(df_pca_test[,-ncol(df_pca_test)]) 
y_dfp_test <-data.frame(df_pca_test[,ncol(df_pca_test)])

x_dfp_train <- data.frame(df_pca_train[,-ncol(df_pca_train)]) 
y_dfp_train <-data.frame(df_pca_train[,ncol(df_pca_train)])

y_list <- list(y_dft_test,y_dft_train,y_dfp_test,y_dfp_train)
lapply(y_list, function(x){colnames(x) <- "median_house_value"})

cat("Dimensions of training sets cut from T set are",dim(y_dft_train),"(response)","and",dim(x_dft_train),"(predictor)")

cat("Dimensions of test sets cut from T set are",dim(y_dft_test),"(response)","and",dim(x_dft_test),"(predictor)")

cat("Dimensions of training sets cut from P set are",dim(y_dfp_train),"(response)","and",dim(x_dfp_train),"(predictor)")

cat("Dimensions of test sets cut from P set are",dim(y_dfp_test),"(response)","and",dim(x_dfp_test),"(predictor)")

#function
naiveForecast <- function(y_train,y_test){
  mean(y_train[,1])
  rep(mean(y_train[,1]),nrow(y_test))
}

#predict1 is for T set and 2 for P set
naivePredict1 <- naiveForecast(y_dft_train,y_dft_test)
naivePredict2 <- naiveForecast(y_dfp_train,y_dfp_test)
mse_naive1 <- mean((y_dft_test[,1]-naivePredict1)^2)
mse_naive2 <- mean((y_dfp_test[,1]-naivePredict2)^2)

#print
cat("The MSE from naive prediction for T set is",mse_naive1)
cat("The MSE from naive prediction for P set is",mse_naive2)


#function
ksinkForecast <- function(y_train,x_train,x_test) {
  colnames(y_train) <- "median_house_value"
  data <- cbind.data.frame(y_train,x_train)
  reg <- lm(median_house_value~.,data)
  yhat <- predict(reg,newdata=as.data.frame(x_test))
  
  #output prediction and coefficients
  out <- list()
  out[['prediction']] <- yhat
  out[['coefs']] <- coef(reg)[-1]
  return(out)
}

#predict
ksinkPredict1 <- ksinkForecast(y_dft_train,x_dft_train,x_dft_test)
ksinkPredict2 <- ksinkForecast(y_dfp_train,x_dfp_train,x_dfp_test)
mse_ksink1 <- mean((y_dft_test[,1]-ksinkPredict1$prediction)^2)
mse_ksink2 <- mean((y_dfp_test[,1]-ksinkPredict2$prediction)^2)

#print
cat("The MSE from ksink prediction for T set is",mse_ksink1)
cat("The MSE from ksink prediction for P set is",mse_ksink2)

library(glmnet)

#Elastic Net
library(vctrs)
library(caret)

elNetForecast <-function(y_train,x_train,x_test){
  set.seed(2022)
  control <- trainControl(method = "boot", number = 10, p = .9)
  fit.cv <- train(x_train,as.numeric(as.matrix(y_train)),metric = 'RMSE',method = "glmnet",
                trControl = control,
                tuneLength = 10)
  yhat <- data.frame(predict(fit.cv,x_test))
  
  #output prediction and coefficients' contribution
  out <- list()
  out[['prediction']] <- yhat
  #store the plot of hyper param
  out[['HP_plot']] <- fit.cv$results
  #store the hyper param
  out[['HP']] <- fit.cv$bestTune
  out[['coefs']] <- as.matrix(coef(fit.cv$finalModel,fit.cv$bestTune$lambda))[-1,]
  return(out)
}

#predict
elNetPredict1 <- elNetForecast(y_dft_train,x_dft_train,x_dft_test)
elNetPredict2 <- elNetForecast(y_dfp_train,x_dfp_train,x_dfp_test)
mse_elNet1 <- mean((y_dft_test[,1]-elNetPredict1$prediction[,1])^2)
mse_elNet2 <- mean((y_dfp_test[,1]-elNetPredict2$prediction[,1])^2)

#print

cat("The best hyperparameters for T set are:")
elNetPredict1$HP

cat("The best hyperparameters for P set are:")
elNetPredict2$HP

cat("The MSE from elNet prediction for P set is",mse_elNet2)

cat("The MSE from elNet prediction for T set is",mse_elNet1)
cat("The MSE from elNet prediction for P set is",mse_elNet2)


library(dplyr)
g1 <- ggplot(elNetPredict1$HP_plot,aes(x=log(lambda),y=RMSE,color=alpha))+
  geom_point(size=1.5)+
  geom_point(data=filter(elNetPredict1$HP_plot,RMSE==min(elNetPredict1$HP_plot$RMSE)), 
             aes(x=log(lambda),y=RMSE), 
             color='red',
             size=3,
             show.legend=TRUE)+
  ggtitle("Hyperparameter Optimization for Transformed set")+
  theme(plot.title = element_text(size=8)
        )

g2 <- ggplot(elNetPredict2$HP_plot,aes(x=log(lambda),y=RMSE,color=alpha))+
  geom_point(size=1.5)+
  geom_point(data=filter(elNetPredict2$HP_plot,RMSE==min(elNetPredict2$HP_plot$RMSE)), 
             aes(x=log(lambda),y=RMSE), 
             color='red',
             size=3,
             show.legend=TRUE)+
  ggtitle("Hyperparameter Optimization for PCA set")+
  theme(plot.title = element_text(size=8)
        )
grid.arrange(g1,g2,ncol=2)

  #poly transformmation funtion
pf <- function(x,p){
    x_p <- data.frame(matrix(NA,nrow = nrow(x),ncol=ncol(x)*p))
    i=1
    for (i in (1:ncol(x))) {
      n1 <- p*(i-1)+1
      n2 <- p*i
      x_p[,n1:n2]<- sapply(1:p,function(w)x[,i]^w)
      colnames(x_p)[n1:n2] <- paste(colnames(x)[i],1:p)
    }
    return(x_p)
    }

#reg function
polyForecast <- function(y_train,x_train,x_test,p){  
    colnames(y_train) <- "median_house_value"
    x_poly <- pf(x_train,p)
    x_test_poly <- pf(x_test,p)
    data_lm <- cbind.data.frame(y_train,x_poly) 
    reg <- lm(median_house_value~.,data = data_lm)
    yhat <- predict(reg,newdata=as.data.frame(x_test_poly))
    out <- list()
    out[['prediction']] <- yhat
    out[['coefs']] <- coef(reg)[-1]
    return(out)
  }
# p search funcrion

polySearch <- function(y_train,x_train,y_test,x_test,M){
    colnames(y_train) <- "median_house_value"
    mse <- matrix(NA,nrow = M)
    for (p in 1:M) {
      yhat <- polyForecast(y_train,x_train,x_test,p)
  #store mse
      mse[p,] <- mean((y_test[,1]-yhat$prediction)^2)
      }
  #output
    out <- list()
    out[['MSE']] <- as.data.frame(mse)
    out[['pbest']] <- which(mse==min(mse))
    return(out)
    }

pSearch1 <- polySearch(y_dft_train,x_dft_train,y_dft_test,x_dft_test,8)
pbest1 <- pSearch1$pbest
pSearch2 <- polySearch(y_dfp_train,x_dfp_train,y_dfp_test,x_dfp_test,8)
pbest2 <- pSearch2$pbest
#plot
hp1 <- cbind.data.frame(pSearch1$MSE,rep("Trans",8),c(1:8))
hp2 <- cbind.data.frame(pSearch2$MSE,rep("PCA",8),c(1:8))
colnames(hp1) <- c("MSE","GROUP","p")
colnames(hp2) <- c("MSE","GROUP","p")
p_plot <- rbind.data.frame(hp1,hp2)
ggplot(p_plot,aes(x=p,y=MSE,color=GROUP))+
  geom_point()+
  ggtitle("Hyperparameter Optimization for degree p")+
  geom_point(aes(x=p_plot$p[which(p_plot$MSE[1:8]==min(p_plot$MSE[1:8]))],y=min(MSE)),size = 5, pch = 1,color="black")+
  geom_point(aes(x=p_plot$p[which(p_plot$MSE[9:16]==min(p_plot$MSE[9:16]))],y=MSE[16]),size = 5, pch = 1,color="black")


#save the best p
#predict
polyPredict1 <- polyForecast(y_dft_train,x_dft_train,x_dft_test,pSearch1$pbest)
polyPredict2 <- polyForecast(y_dfp_train,x_dfp_train,x_dfp_test,pSearch2$pbest)
mse_poly1 <- mean((y_dft_test[,1]-polyPredict1$prediction)^2)
mse_poly2 <- mean((y_dfp_test[,1]-polyPredict2$prediction)^2)

#print

cat("The MSE from Poly-GLM for T set is",mse_poly1)
cat("The MSE from Poly-GLM for P set is",mse_poly2)

library(randomForest)

# The codes is time expensive seriously.

#Forecast function
rfForecast <- function(y_train,x_train,y_test,x_test){
  predictions <- data.frame(matrix(NA,nrow = nrow(y_test),ncol = 10))
  coefs<- data.frame(matrix(NA,nrow = ncol(x_test),ncol = 10))
  outrf <- list()
  i=100
  for (i in c(seq(100,1000,100))) {
  rf_cv <- randomForest(x=x_train,y=as.numeric(as.matrix(y_train)),mtry=3,ntree = i)
  predictions[,i/100]  <-predict(rf_cv,x_test)
  coefs[,i/100]<- as.numeric(rf_cv$importance)
  rownames(coefs) <- rownames((rf_cv$importance))
  colnames(coefs)[i/100] <- i
  #output
  outrf[['predictions']] <- predictions
  outrf[['coefs']] <- coefs
  }
  return(outrf)
}

#Predict
rfPredict1 <- rfForecast(y_dft_train,x_dft_train,y_dft_test,x_dft_test)
rfPredict2 <- rfForecast(y_dfp_train,x_dfp_train,y_dfp_test,x_dfp_test)


#mse dataframe in order to ggplot
mse_rf1 <- data.frame(matrix(NA,ncol=10))
colnames(mse_rf1) <- colnames(rfPredict1$coefs)
for (i in 1:10) {
  mse_rf1[,i] <- mean((y_dft_test[,1]-rfPredict1$prediction[,i])^2)
}
mse_rf2 <- data.frame(matrix(NA,ncol=10))
colnames(mse_rf2) <- colnames(rfPredict2$coefs)
for (i in 1:10) {
  mse_rf2[,i] <- mean((y_dfp_test[,1]-rfPredict2$prediction[,i])^2)
}

mse_rf <- as.data.frame(cbind.data.frame(data.frame(seq(100,1000,100)),t(mse_rf1),t(mse_rf2)))
colnames(mse_rf) <- c("ntree","Trans","PCA")
#ggplot
g1 <- ggplot(mse_rf)+
  geom_point(aes(x=ntree,y=Trans),size=4)+
  scale_x_continuous(breaks = seq(100,1000,100))+
  labs(title="MSE-ntree for Trans set",y="MSE")+
  geom_point(aes(x=which(mse_rf$Trans==min(mse_rf$Trans))*100),y=min(mse_rf$Trans),size = 7, pch = 2,colour="red")

g2 <-ggplot(mse_rf)+
  geom_point(aes(x=ntree,y=PCA),size=4)+
  scale_x_continuous(breaks = seq(100,1000,100))+
  labs(title="MSE-ntree for PCA set",y="MSE")+
  geom_point(aes(x=which(mse_rf$PCA==min(mse_rf$PCA))*100),y=min(mse_rf$PCA),
                 size = 7,pch = 2,colour="red")

grid.arrange(g1,g2,ncol=2)
best_ntree1 <- which(mse_rf$Trans==min(mse_rf$Trans))*100
best_ntree2 <- which(mse_rf$PCA==min(mse_rf$PCA))*100

#save the mse
mse_rf1 <- mse_rf$Trans[which(mse_rf$Trans==min(mse_rf$Trans))]
mse_rf2 <- mse_rf$PCA[which(mse_rf$Trans==min(mse_rf$Trans))]

cat("The best ntree of RF for T set is",best_ntree1)
cat("The best ntree of RF for T set is",best_ntree2)

cat("The MSE from RF for T set is",mse_rf1)
cat("The MSE from RF for P set is",mse_rf2)

library(gbm)

gbmSearch <- function(y_train,x_train){
  ctrl <- trainControl(method = "cv",number = 10)
  colnames(y_train) <- "median_house_value"
  dt <- cbind.data.frame(y_train,x_train)
  gbmGrid <-  expand.grid(interaction.depth = c(1,2,3), 
                        n.trees = (1:15)*100, 
                        shrinkage = (1:10)*0.01,
                        n.minobsinnode = 20)
  set.seed(2022)
  fit.cv <- train(median_house_value~.,data=dt,method="gbm",
                trControl=ctrl,
                metric="RMSE",
                verbose=FALSE,
                tuneGrid = gbmGrid)
  return(fit.cv)
  }

#Search/Forecast
gbmFit1 <- gbmSearch(y_dft_train,x_dft_train)
gbmFit2 <- gbmSearch(y_dfp_train,x_dfp_train)
#plot

plot(gbmFit1,main="Hyperparameters optimization for GBM (T set)",adj=0)
plot(gbmFit2,main="Hyperparameters optimization for GBM (P set)",adj=0)
 

#best parameter
bestpar1 <-gbmFit1$results[best(gbmFit1$results,metric ="RMSE",maximize=FALSE),1:6]
bestpar2 <-gbmFit2$results[best(gbmFit2$results,metric ="RMSE",maximize=FALSE),1:6]

cat("The best models for T and P set are:")
bestpar1
bestpar2

#svae the importance
gbmImp1 <- varImp(gbmFit1,scale=FALSE)
gbmImp2 <- varImp(gbmFit2,scale=FALSE)


#predict.train automatically choose the final model from training result list
gbmPredict1 <- predict.train(gbmFit1,newdata = x_dft_test)
gbmPredict2 <- predict.train(gbmFit2,newdata = x_dfp_test)
mse_gbm1 <- mean((y_dft_test[,1]-gbmPredict1)^2)
mse_gbm2 <- mean((y_dfp_test[,1]-gbmPredict2)^2)

cat("The MSE from GBM for T set is",mse_gbm1)
cat("The MSE from GBM for P set is",mse_gbm2)

#First,collect all mse.
MSE_all <- cbind.data.frame(c(mse_naive1,mse_naive2,mse_ksink1,mse_ksink2,mse_elNet1,
             mse_elNet2,mse_poly1,mse_poly2,mse_rf1,mse_rf2,mse_gbm1,mse_gbm2),
             c("mse_naive","mse_naive","mse_ksink","mse_ksink","mse_elNet",
             "mse_elNet","mse_poly","mse_poly","mse_rf","mse_rf","mse_gbm","mse_gbm"),
             rep(c("T set","P set"),6)) 
colnames(MSE_all) <- c("MSE","Model","Set")



ggplot(MSE_all,aes(x=Model,y=MSE,color=Set))+
  geom_point(size=3)+
  geom_label(aes(label = round(MSE, 4)),
            hjust = 1.2,
            color = "BLACK",
            size=3,
            label.padding = unit(0.2, "lines")
            )

g1 <- ggplot(df,aes(longitude,latitude))+
  geom_polygon(data = map, aes(x=long, y = lat), fill="black", alpha=0.8)+
  geom_point(size=1,aes(color=ocean_proximity))

g2 <- ggplot(df,aes(longitude,latitude))+
  geom_polygon(data = map, aes(x=long, y = lat), fill="black", alpha=0.8)+
  geom_point(size=1,aes(colour= median_house_value))+
  scale_color_gradient(low = "light blue",high = "red")

grid.arrange(g1,g2,ncol=2)
library(ggrepel)
coefdt1 <- cbind.data.frame(polyPredict1$coefs,names(polyPredict1$coefs))
colnames(coefdt1) <- c("coef","variable")

coefdt2 <- cbind.data.frame(polyPredict2$coefs,names(polyPredict2$coefs))
colnames(coefdt2) <- c("coef","variable")

g1 <- ggplot(coefdt1,aes(x=variable,y=coef))+
  geom_point()+
  geom_label(data=filter(coefdt1,coef>0.1|coef<(-0.1)),
             aes(label = c(round(coef, 4))),
            vjust = 1,
            color = "BLACK",
            size=2,
            nudge_x = 1,
            fill="lightblue")+
    geom_label(data=filter(coefdt1,coef>0.1|coef<(-0.1)),
            aes(label = variable),
            vjust = 0.52,
            color = "BLACK",
            size=2,
            label.padding = unit(0.3, "lines"),
            nudge_x = 3, nudge_y = 0.1,
            fill="yellow")+
  ggtitle("coefs for Trans set")+
  theme(aspect.ratio = 1)

 g2 <- ggplot(coefdt2,aes(x=variable,y=coef))+
  geom_point()+
  geom_label(data=filter(coefdt2,coef>2|coef<(-2)),
             aes(label = c(round(coef, 4))),
            vjust = 1,
            color = "BLACK",
            size=2,
            nudge_x = -3,
            fill=NA, check_overlap = TRUE)+
    geom_label(data=filter(coefdt2,coef>2|coef<(-2)),
            aes(label = variable),
            vjust = 0.52,
            color = "BLACK",
            size=2,
            label.padding = unit(0.3, "lines"),
            nudge_x = -3, nudge_y = 0.5,
            fill="yellow", check_overlap = TRUE)+
  ggtitle("coefs for PCA set")+
  theme(aspect.ratio = 1)
grid.arrange(g1,g2,ncol=2)


#scaled coefs/importance.
Contri_all <- cbind.data.frame(scale(cbind.data.frame(ksinkPredict1$coefs,elNetPredict1$coefs,rfPredict1$coefs[,which(mse_rf$Trans==min(mse_rf$Trans))], gbmImp1$importance)),colnames(x_dft_test))
colnames(Contri_all) <- c("kitchensink","elNet","RF","GBM","Variable") 
Contri_all$Variable[6:8] <- c("INLAND","NEAR.BAY","NEAR.OCEAN")
 
g1 <- ggplot(Contri_all)+
  geom_point(size=3,aes(x=Variable,y=kitchensink,color="kitchensink"))+
  geom_point(size=3,aes(x=Variable,y=elNet,color="elNet"))+
  geom_point(size=3,aes(x=Variable,y=RF,color="RF"))+
  geom_point(size=3,aes(x=Variable,y=GBM,color="GBM"))+
   geom_hline(yintercept=0, linetype="dashed", 
                color = "red", size=0.5)+
  theme(axis.text = element_text(size = 7)) +
  labs(title = "Influence of variables(T set)",  y = "Influence")

# scaled absolute value
Contri_all_abs <- cbind.data.frame(abs(Contri_all[,1:4]),Contri_all[,5])
colnames(Contri_all_abs) <- c("kitchensink","elNet","RF","GBM","Variable") 

g2 <- ggplot(Contri_all_abs)+
  geom_point(size=3,aes(x=Variable,y=kitchensink,color="kitchensink"))+
  geom_point(size=3,aes(x=Variable,y=elNet,color="elNet"))+
  geom_point(size=3,aes(x=Variable,y=RF,color="RF"))+
  geom_point(size=3,aes(x=Variable,y=GBM,color="GBM"))+
  theme(axis.text = element_text(size = 7))  +
  geom_point(aes(x=Contri_all_abs$Variable[1],y=1.2),size=20,fill=NA, pch = 1,color="red")  +
  geom_point(aes(x=Contri_all_abs$Variable[5],y=2),size=15,fill=NA, pch = 1,color="red")  +
  geom_point(aes(x=Contri_all_abs$Variable[6],y=0.6),size=24,fill=NA, pch = 1,color="red") +
  geom_point(aes(x=Contri_all_abs$Variable[7],y=0.9),size=7,fill=NA, pch = 1,color="red")+
  labs(title = "Absolute influence of variables(T set)",  y = "Influence")

grid.arrange(g1,g2,nrow=2)



#scaled coefs/importance.
Contri_all2 <- cbind.data.frame(scale(cbind.data.frame(ksinkPredict2$coefs,elNetPredict2$coefs,rfPredict2$coefs[,which(mse_rf$Trans==min(mse_rf$Trans))], gbmImp2$importance)),colnames(x_dfp_test))
colnames(Contri_all2) <- c("kitchensink","elNet","RF","GBM","Variable") 

 
g1 <- ggplot(Contri_all2)+
  geom_point(size=3,aes(x=Variable,y=kitchensink,color="kitchensink"))+
  geom_point(size=3,aes(x=Variable,y=elNet,color="elNet"))+
  geom_point(size=3,aes(x=Variable,y=RF,color="RF"))+
  geom_point(size=3,aes(x=Variable,y=GBM,color="GBM"))+
   geom_hline(yintercept=0, linetype="dashed", 
                color = "black", size=0.5)+
  theme(axis.text = element_text(size = 7)) +
  labs(title = "Influence of variables(P set)",  y = "Influence")

# scaled absolute value
Contri_all_abs2 <- cbind.data.frame(abs(Contri_all2[,1:4]),Contri_all2[,5])
colnames(Contri_all_abs2) <- c("kitchensink","elNet","RF","GBM","Variable") 

g2 <- ggplot(Contri_all_abs2)+
  geom_point(size=3,aes(x=Variable,y=kitchensink,color="kitchensink"))+
  geom_point(size=3,aes(x=Variable,y=elNet,color="elNet"))+
  geom_point(size=3,aes(x=Variable,y=RF,color="RF"))+
  geom_point(size=3,aes(x=Variable,y=GBM,color="GBM"))+
  theme(axis.text = element_text(size = 7))  +
  labs(title = "Absolute influence of variables(P set)",  y = "Influence")

grid.arrange(g1,g2,nrow=2)