Summary

When fitting a Local Polynomial Regression, there exists always the question Which is the ideal bandwith to use? The purpose of this assignment a function will be computed in order to see which bandwidth minimizes the Mean squared error. For this purpose, we will use the Boston housing Data provided in the Advanced Statistical Modeling course.

Description

The bandwidth or smoothing parameter written as h is basically the window taken in order to distribute the weights for each local regression in order to accomplish a Local Polynomial Regression. In order to achieve our objective, a function is built which receives as input three vectors: x and y which are basically the data to be used, and a third vector being a set of choices to test for our bandwidth. The function performms Leave-one-out Cross Validation, 5-Fold Cross Validation, 10-Fold Cross Validation and Generalized Cross Validation in order to find the error for each given bandwith.

In addition, we will use the functions hsel and dpill from Kernsmooth package, which for a given dataset, compute the optimal bandwidth and be able to compare with our results.

Method

As the function coded is expensive due to the LOOCV (Leave-one-out CV), we defined a vector of bandwidth values to compare which are: 0.5, 0.68, 0.93, 1.26, 1.72, 2.35, 3.2, 4.35, 5.93, 8.08, 11.01, 15. With this values we computed the different Cross validation methods.

The following tables contains the values of our results for each cross validation and the mean squared errors for the different bandwidths for LOOCV, PMSE 5-Fold and PMSE 10-Fold and PMSE GCV. We will show only a part of each data frame in the report, but the code is reproducible in order to see the whole tables.

Head Polynomial Degree = 1

PMSE LOOCV PMSE 5-Fold CV PMSE 10-Fold CV PMSE GCV
H = 0.5 0.28 0.31 0.28 0.27
H = 0.68 0.27 0.28 0.27 0.27
H = 0.93 0.26 0.28 0.26 0.27
H = 1.26 0.26 0.29 0.26 0.26
H = 1.72 0.26 0.27 0.26 0.26

Tail Polynomial Degree = 2

PMSE LOOCV PMSE 5-Fold CV PMSE 10-Fold CV PMSE GCV
H = 3.2 0.26 0.26 0.26 0.26
H = 4.35 0.26 0.26 0.26 0.26
H = 5.93 0.26 0.26 0.26 0.26
H = 8.08 0.26 0.27 0.26 0.26
H = 11.01 0.27 0.27 0.27 0.27
H = 15 0.28 0.28 0.28 0.28

We can see that for Polynomial Degree 1, shares the optimal result for the different models, with a bandwidth = \(2.35\). For Polynomial Degree 2 the optimal bandwidth is shared by most of the models. But This could even be consider randomness since they values are so close to each other in the adjacent bandwidths.

To see the behaviour and comparrison more clear, we can plot our results.

Finally we compute the bandwidths with the Kernsmooth package and compare them with the best from each Cross-Validation.

PMSE LOOCV PMSE 5-Fold CV PMSE 10-Fold CV PMSE GCV hsel dpill
Poly Deg 1 H = 2.35 H = 2.35 H = 2.35 H = 2.35 H = 1.99 H = 1.5
Poly Deg 2 H = 4.35 H = 4.35 H = 4.35 H = 4.35 H = 1.99 H = 1.5

Results

After comparing the Mean squared error with the different Cross-Validations and for different Polynomials, the one that aproximates we can see that LOOCV, 5-Fold CV, 10-Fold CV and Generalized CV pretty much the same smoothing parameter as the optimal one, according to the Degree of Polynomial. while hsel, dpill functions have found a lower optimal parameter. It is hard to conclude which is the best smoothing parameter, but at least we know that which range of bandwidth could work for our specific dataset. Also it is important to compare the different degrees of polynomial and see which combination helps for reducing the `MSE

Code

library(sm)
library(KernSmooth)
library(knitr)
library(reshape2)
library(ggplot2)
library(gridExtra)
source("locpolreg.R")


load("boston.Rdata")
names(boston.c)[13]<-'ROOM'
# [1] "TOWN"    "TOWNNO"  "TRACT"   "LON"     "LAT"     "MEDV"    "CMEDV"  
# [8] "CRIM"    "ZN"      "INDUS"   "CHAS"    "NOX"     "ROOM"      "AGE"    
#[15] "DIS"     "RAD"     "TAX"     "PTRATIO" "B"       "LSTAT"  

#X = LSTAT  Y = ROOM

df <- boston.c[c("ROOM", "LSTAT")]

x= df$LSTAT
y= df$ROOM
h.v = exp(seq(from=log(.5), to = log(15), length=12))


############## FUNCTIONS ##################
cross_validation <- function(x, y, k, h_vector,p){
  #Transform vectors as a dataframe
  data = as.data.frame(cbind("x"= x,"y"=y))
  dfTrain<-data[sample(nrow(data)),]
  folds <- cut( seq(1,nrow(data)), breaks= k, labels=FALSE)
  
  results <- c()
  # Do the cross validation for K fold
  for(i in 1:k){
    #Segement your data by fold using the which() function 
    testIndexes <- which(folds==i,arr.ind=TRUE)
    testData <- dfTrain[testIndexes, ]
    trainData <- dfTrain[-testIndexes, ]
    
    PMSE <- matrix(0,nrow =length(testData[,2]) ,ncol=length(h_vector))
    # For each fold, fit a local polynomial regression and calculate the 
    #square error for each different bandwidth
    
    for(h in 1:length(h_vector)){
      fit <-  locpolreg(trainData[,1],trainData[,2],h = h_vector[h],p=p,tg=testData$x ,doing.plot = F)
      mhat <- fit$mtgr
      sqrd_error <- (testData[,2] - mhat)^2
      PMSE[,h] <- sqrd_error
    }
    results = rbind(results,PMSE)
  }
  results <- as.data.frame(results)
  names(results) <- paste("H =",round(h_vector,2))
  return(as.data.frame(apply(results,2,mean)))
}

PMSE_GCV <- function (x, y, h_vector,p){
  PMSE_gcv <- c()
  for(i in h_vector){
    fit <- locpolreg(x,y,h = i,p=p,doing.plot = F)
    trace <- sum(diag(fit$S))
    mhat <- fit$mtgr
    n = length(x)
    result <- sum((y - mhat)^2)*n/(n-trace)^2
    PMSE_gcv <- c(PMSE_gcv,result)
  }
  return(PMSE_gcv)
}

multiple_CV <- function(x,y,h_vector,p){
  #Cross Validation for LOOCV, 5Fold and 10Fold
  names <- c("PMSE LOOCV","PMSE 5-Fold CV", "PMSE 10-Fold CV" )
  cv <- c(length(x),5,10)
  results <- matrix(0,nrow=length(h_vector),ncol=length(names))
  
  for(i in 1: length(names)){
    r <- cross_validation(x,y, cv[i] ,h_vector,p)
    results[,i] <- r[,1]
  }
  results <- as.data.frame(results)
  names(results) <- names
  rownames(results) <- rownames(r)
  
  #Append Generalized CV-PMSE
  results[,"PMSE GCV"] = PMSE_GCV(x,y,h_vector,p)
  return(results)
}

## Plotting by reshaping the data
plot_results <- function(r,h.v,p){
  results <- melt(r)
  bandwidth = round(h.v,2)
  results <-  cbind(bandwidth,results)
  names(results) <- c("bandwidth", "method" , "value" )
  posn.j = position_jitter(width=0.1)
  plot <- ggplot(results, aes(x=log(bandwidth),y=value, col=method)) + 
    geom_point(position=posn.j) + theme_bw() +
    xlab("Log(Bandwidth)") + ylab("Mean Squared Error") + 
    ggtitle(paste("Local Polynomial Regression Degree =",p))
  return(plot)
}
############## FUNCTIONS ##################


#Getting the Results and plotting them for Degree 1 and 2
results1 <- multiple_CV(x,y,h.v,1)
results2 <- multiple_CV(x,y,h.v,2)
plot1 <- plot_results(results1,h.v,1)
plot2 <- plot_results(results2,h.v,2)
grid.arrange(plot1,plot2,nrow=2,ncol=1)

#Getting the best bandwidth for each CV
results <- list(results1,results2)
best<- data.frame()
for(r in 1:length(results)){
  best_h <- c()
  for(i in names(results[[r]])){
  h = rownames(results[[r]][which.min(results[[r]][,i]),])
  best_h <- c(best_h, h)
  }
  best_h <- t(as.data.frame(best_h))
  best <- rbind(best,best_h)
}
colnames(best) <- names(results1)
rownames(best) <- c("Poly Deg 1","Poly Deg 2")

# Package Kernsmooth for obtaining the best bandwidth
hsel_fit <- h.select(x,y,method="cv")
dpill_fit <- dpill(x,y) 

#Compare all of the best bandwidth according to each method.
best <- cbind(best, "hsel" = paste("H =",round(hsel_fit,2)),  "dpill" = paste("H =",round(dpill_fit,2)))