The ‘Ozone’ data from mlbench package contains the Los Angeles ozone pollution data collected in 1976. It is available as a data frame with 366 rows and 13 variables, that may be helpful to explain the ozone reading in the region. It is now up to us to find out which of the available variables would suit best to predict the ozone reading with maximum accuracy. The objective of this analysis is to accurately predict the ‘daily maximum one-hour average ozone reading’, using linear regression models. We will ensure this by making a 80:20 split of the data as development and validation samples, and use only the 80% sample for building linear models(training).
The models thus built will be used to predict the ozone reading on the 20% validation sample. With the actual ozone readings from 20% validation sample and model predictions thus computed, we will calculate the prediction accuracy and use it as one of the important parameters to decide the optimal model that suits this problem.
#Loading libraries
library(e1071) #Skewness
library(PerformanceAnalytics) #Correlogram (chart.Correlation)
library(Hmisc) #Missing value treatement (impute)
library(corrplot) #Correlogram (corrplot)
library(party) #Selecting best variables (cforest)
library(Boruta) #Deciding if a variable is important (Boruta)
library(caret) #Boxcox transformation (BoxCoxTrans)
library(car) #VIF
library(DMwR) #knnImputation
library(mlbench) #Ozone Data
library(DAAG) #Cross validation
library(relaimpo) #Relative importance of predictors in lm mod
#Prepare input data
###Create Input Datasets
data (Ozone, package="mlbench") #Initialize the data
inputData <- Ozone #Data from mlbench
#Assign names
names(inputData) <- c("Month", "DayOfMonth", "DayOfWeek", "OzoneReading", "PressureHeight", "WindSpeed", "Humidity", "TemperatureSandburg", "TemperatureElMonte", "InversionBaseHeight", "PressureGradient", "InversionTemperature", "Visibility")
#Segregate all continuous and categorical variables
#Place all continuous vars in ipDataCnt
ipDataCnt <- inputData[, c("PressureHeight", "WindSpeed", "Humidity", "TemperatureSandburg", "TemperatureElMonte", "InversionBaseHeight", "PressureGradient", "InversionTemperature", "Visibility")]
#Place all categorical variables in ipDataCat
ipDataCat <- inputData[, c("DayOfMonth","DayOfWeek")]
#Create the response data frame
ipDataResponse <- data.frame(OzoneReading=inputData[, "OzoneReading"])
respName <- "OzoneReading" #name of response variable
response <- inputData[, respName] #response variable as a vector
#Generate plots: Density, Scatter, Box plots
#Set up your working directory here, to a location where you want to store plots.
for (k in names(ipDataCnt)){
x <- as.numeric (ipDataCnt[, k])
Skewness <- round(skewness(x), 2) #calc skewness
dens <- density(x, na.rm=T) #density func
par(mfrow=c(1, 3)) #setup plot-area in 3 columns
#Density plot
plot(dens, type="l", col="red", ylab="Frequency", xlab = k, main = paste(k, ": Density Plot"),sub=paste("Skewness: ", Skewness))
polygon(dens, col="red")
#Scatterplot
plot(x, response, col="blue", ylab="Response", xlab = k, main = paste(k, ": Scatterplot"), pch=20)
abline(response ~ x)
boxplot(x, main=paste(k, ": Box Plot"), sub=paste("Outliers: ", paste(boxplot.stats(x)$out, collapse=" ")))
}
###Oultier analysis
#Method 1: If you choose to remove the outliers (not recommended)
x <- x[!x %in% boxplot.stats(x)$out] # NOT run!
#Method 2: Replace outliers with NAs, be be filled up later during missing value treatment.
repOutlierMissing <- function(x, na.rm = TRUE, ...){
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm) #Threshold
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
return(y)
}
ipDataCnt <- as.data.frame (sapply(ipDataCnt, repOutlierMissing)) #this will make outliers as NA
#Re-assemble inputdata with the outlier corrected part
inputData <- cbind(ipDataResponse, ipDataCnt, ipDataCat) #column bind the response, continuous and categorical predictors
Approaches for handling missing values. (1) Eliminating the variable: If a particular variable has more than 30% missing values, it is advisable to consider removing that variable altogether. (2) Eliminating the observations: If there are missing values scattered through out your dataset and you have a good number of observations in your sample, you may choose to eliminate those observations with missing values. After doing this, if you find losing more than 30% of sample size, then you probably need to look for any particular variable that is contributing the most number of missing values. (3) Imputation: Alternatively, you may also impute the missing values with one of these place holders.
#Apply the missing value treatment
#Method 1: Replace missing values with mean (NOT Run)
ipDataCntMatrix <- sapply(ipDataCnt, FUN=function(x){impute(x, mean)})#Missing values with the mean using 'impute' func from 'Hmisc' pkg
ipDataCnt <- as.data.frame(ipDataCntMatrix) #store as dataframe
#Method 2: k-Nearest neighbor imputation method - Applied!
# Impute all except the response variable.
if( anyNA(inputData)) {
# missing value treatment
inputData[, !names(inputData) %in% respName] <- knnImputation(inputData[, !names(inputData) %in% respName])
}
The correlogram is a graphical representation of correlation amongst all combinations of response and predictor variables. The investigator may choose to pick variables that have high correlation with the response based on a pre-determined cutoff. But we do not do that filtering here, since it is possible that a variable with relatively lower correlation can prove significant and valuable in regression models to capture unexplained variations, even if they are of ‘lower’ importance.
par (mfrow=c(1,1))
ipDataRespCntNALess <- na.omit(cbind(ipDataResponse, ipDataCnt)) #remove rows with missing values
chart.Correlation(ipDataRespCntNALess, histogram=TRUE, pch=19) #plot correlogram
corrplot(cor(ipDataRespCntNALess), method="circle", is.corr=TRUE)
#Get and store the summary files
sapply(inputData, summary) #get summary statistics for all columns
## $OzoneReading
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 5.00 9.00 11.53 16.00 38.00 5
##
## $PressureHeight
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5510 5700 5770 5759 5830 5950
##
## $WindSpeed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.000 5.000 4.801 6.000 10.000
##
## $Humidity
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.00 47.25 64.00 58.29 73.00 93.00
##
## $TemperatureSandburg
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 25.00 51.25 62.00 61.90 72.00 93.00
##
## $TemperatureElMonte
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 27.68 51.28 57.11 57.39 64.95 82.58
##
## $InversionBaseHeight
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 111 899 2168 2579 5000 5000
##
## $PressureGradient
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -69.00 -9.75 24.00 17.75 45.00 107.00
##
## $InversionTemperature
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 27.50 51.62 62.78 61.08 70.49 91.76
##
## $Visibility
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 70.0 100.0 108.6 148.0 250.0
##
## $DayOfMonth
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12
## 26 27 28 29 30 31
## 12 12 12 12 11 7
##
## $DayOfWeek
## 1 2 3 4 5 6 7
## 52 52 52 53 53 52 52
Upon performing outlier treatment, you can expect the skewness of the variable to improve. But not always. Sometimes a box-cox transformation of the response variable is preferable to reduce the skewness and make it closer to a normal distribution. While the variable gets transformed, ranking order of the variable remains the same after the transformation.
Because, for the long tailed region of the response, there is a larger variance. This also means, more predictors are expected to have a stronger effect while explaining the larger variance part of the response variable, thereby creating a bias. This can affect all the estimates resulting in spurious models. It can also be argued that a high skewness is a consequence of heteroscedasticity (non-constant variance).
Therefore, if the skewness does not improve after treating the outliers, consider performing a box-cox transformation especially when k-fold cross-validation reveals inconsistent results. Note that, it is not mandatory to apply box-cox transformation on all variables that are skewed. It is up to the best judgement of the investigator to decide if box-cox transformation is indeed needed on a variable.
Nevertheless, just so you know how a box cox transform is done, below is the procedure. As a default, box-cox transformation in the response variable is sufficient, but here, I have chosen to apply it on variables that have a skewness > 1, even after outlier corrections are applied.
skewness(ipDataResponse, na.rm=T) # 0.878864
## [1] 0.8825285
#Transforming the predictor variables
boxcoxTransformedVars <- character(0) #variables that underwent box-cox transformation collects here.
for (colname in colnames(ipDataCnt[, -1])) {
x <- ipDataCnt[, colname]
#Check for high skewness.
if(abs(skewness(x, na.rm=T)) > 1){
boxcoxMod <- BoxCoxTrans(x) #calculates lambda value and store in a model
boxcoxTransformedVars <<- c(boxcoxTransformedVars, colname)
ipDataCnt[, colname] <- predict(boxcoxMod, x) #calculate transformed variable
}
}
#Selecting signifiant variables (continuous) - checking for p-Values
sigPredCnt <- character() #initialise output. Significant preds will accrue here
for (predName in names(ipDataCnt)) {
pred <- inputData[, predName]
mod <- lm(response ~ pred) #build linear model with only current predictor
p_value <- summary(mod)$coefficients[, 4][2] # capture p-Value
#check for significance
if (p_value < 0.1 & p_value > 0) {
#if selected, bind the predictor name if main output
sigPredCnt <- c (sigPredCnt, predName)
}
}
ipDataCntSig <- inputData[, names(inputData) %in% sigPredCnt] #filter selected predictors
#Selecting signifiant variables (categorical) - checking for chi-sq test
sigPredCat <- character() #initialise output. Significant preds will accrue here
for (predName in names(ipDataCat)) {
pred <- inputData[, predName]
chi_sq <- invisible(suppressWarnings(chisq.test(table(response, pred)))) #perform chi-Squared tes
p_value <- chi_sq[3] #capture p-Value
#check for significance
if (p_value < 0.1 & p_value > 0) {
sigPredCat <- c (sigPredCat, predName)
}
}
ipDataCatSig <- as.data.frame(inputData[, names(inputData) %in% sigPredCat])
colnames(ipDataCatSig) <- sigPredCat #assign column names
#Combine the response, continuous significant predictors, categorical significant predictors
inputDataSigF <- cbind(response, ipDataCntSig, ipDataCatSig)
#Part 1: Boruta: Decide if a variable is important or not
# Decide if a variable is important or not using Boruta
borutaOutput <- Boruta(response ~ ., data=na.omit(inputDataSigF), doTrace=2) #perform Boruta search
borutaSigF <- names(borutaOutput$finalDecision[borutaOutput$finalDecision %in% c("Confirmed", "Tentative")]) #collect Confirmed and Tentative variables
#Part 2: Estimate important variables based on mean drop in accuracy - using cforest
#Selecting best variables (variable selection using cforest)
cf1 <- cforest(response ~ . , data= na.omit(inputDataSigF[, names(inputDataSigF) %in% c(borutaSigF, "response")]), control=cforest_unbiased(mtry=2,ntree=50)) #fit the random forest
impVars <- sort(varimp(cf1), decreasing=T) #get variable importance
# Pick top 7. Adjust this as per your needs. More variables will take more computational time.
impVars <- if(length(impVars) > 7){ impVars[1:7] } else { impVars}
#Part 3: Removing multicollinearity
varConsFilter <- names(impVars)[!names(impVars) %in% names(ipDataCatSig)] #exclude categoricals
fullForm <- (paste ("response ~ ", paste (varConsFilter, collapse=" + "), sep="")) #model formula
fullMod <- lm(as.formula(fullForm), data=inputDataSigF) #build linear model
allVIF <- try(vif(fullMod), silent=TRUE) #get all vifs
#if vif of any of the variables in model is greater than threshold (4), drop the max VIF variable and build model again, until all variables have VIF less than threshold.
if(class(allVIF) != "try-error"){
allVIF <- as.data.frame(allVIF) #cast as a dataframe
# check if there are more > 2 rows, max-VIF exceeds 4 and not of 'try-error' clas
while((nrow(allVIF) > 2)& (max(allVIF[, 1]) > 4) & (class(allVIF) != "try-error")) {
#find max VIF variable
rmVars <- rownames(allVIF)[which(allVIF[, 1] == max(allVIF[, 1]))]
impVars <- impVars[!names(impVars) %in% rmVars] #remove
fullForm <- paste ("response ~ ", paste (names(impVars), collapse=" + "), sep="") #new formula
fullMod <- lm(as.formula(fullForm), data=inputDataSigF) #re-build model with new formula
allVIF <- try(as.data.frame(vif(fullMod)), silent=TRUE) # get all vifs
}
}
vFilterVars <- names(fullMod$model)[!names(fullMod$model) %in% "response"]
Lets build all combinations of the selected predictor variables, for building models of all sizes. If you wish to limit the maximum size the model, set the size in maxModelSize variable.
#create all combinations of predictors
maxModelSize <- length(vFilterVars)
cmbMatrix <- matrix(ncol=maxModelSize) #initialise final output
for (n in 1:length(vFilterVars)){
cmbMatFilter <- t(combn(vFilterVars, n)) #all combinations of variable
nrOut <- nrow(cmbMatFilter)
ncOut <- length(vFilterVars)
nrNA <- nrOut
ncNA <- ncOut-ncol(cmbMatFilter)
naMatr<- matrix(rep(NA, nrNA*ncNA), nrow=nrNA, ncol=ncNA)
out <- cbind(cmbMatFilter, naMatr)
cmbMatrix <- rbind(cmbMatrix, out)
}
cmbMatrix <- cmbMatrix[-1, ] #remove the first row that has all NA
## Split training and test data (development and validation)
set.seed(100)
trainingIndex <- sample(1:nrow(inputDataSigF), 0.8*nrow(inputDataSigF)) #random sample rows for training data
inputDataSigFTrain <- na.omit(inputDataSigF[trainingIndex, ]) #remove NA's in response variable
inputDataSigFTesting <- na.omit(inputDataSigF[-trainingIndex, ]) #omitting NA's, since we need to calculate prediction accuracy
Now that we have all the possible combinations for predictors, we can expect to generate several good models from this mix. In the code that follows, you will first build the model and capture the summary of the model in a variable (currSumm). Then, the diagnostic parameters such as R-sq, p-Value, AIC etc are captured. Post that, the Influential observations and accuracy measures are calculated on the training and test data sets. Finally, we perform the k-Fold cross validation and capture the mean-squared error for the model.
#BUILD THE LINEAR MODELS
Final_Output <- data.frame() #initialise the final output dataframe
par (mfrow=c(1,3)) #set margins
for (rownum in 1:nrow(cmbMatrix)){
##Build model for current formula from cmbMatrix
preds <- na.omit(cmbMatrix[rownum, ]) #get the predictor names
form <- paste ("response ~ ", paste (preds, collapse=" + "), sep="") #model formul
currMod <- lm(as.formula(form), data=inputDataSigFTrain) #build the linear model
currSumm <- summary(currMod) #model summary
##Diagnostic parameters-
FStats <- currSumm$fstatistic[1] #Fstatistic
f <- currSumm$fstatistic #parameters for model p-value calc
modelP <- pf(f[1], f[2], f[3], lower=FALSE) #model p-Value
rSQ <- currSumm[8] #R-Squared
ArSQ <- currSumm[9] #adj R-Squared
aic <- AIC(currMod) #AIC
bic <- BIC(currMod) #BIC
## Get Influential Observations-
cutoff <- 4/(nrow(inputDataSigFTrain)-length(currMod$coefficients)-2) #cooks.distance cutoff
infObsRows <- paste(which(cooks.distance(currMod) > cutoff), collapse=" ") #caputure influential observations
## Calculate accuracy measures on Development sample-
trainMap <- mean((abs(currMod$residuals))/abs(na.omit(inputDataSigFTrain$response))) * 100 #MAPE
actFitted <- data.frame(actuals=inputDataSigFTrain$response, fitted=currMod$fitted)
minVals <- apply(actFitted, 1, min)
maxVals <- apply(actFitted, 1, max)
trainMinMax <- minVals/maxVals
trainMinMaxAcc <- mean(trainMinMax)
## Calculate accuracy measures on Validation sample-
predicteds <- predict(currMod, inputDataSigFTesting) #predict for test data
predMap <- mean((abs(inputDataSigFTesting$response - predicteds)/abs(inputDataSigFTesting$response))) * 100 #calculate predicted mape
actual_predicteds <- data.frame(actuals=inputDataSigFTesting$response, predicted=predicteds)
minVals <- apply(actual_predicteds, 1, min)
maxVals <- apply(actual_predicteds, 1, max)
predMinMax <- minVals/maxVals
predMinMaxAcc <- mean(minVals/maxVals) #min-max accuracy on test data
##Perform k-Fold cross-validation-
cvResults <- suppressWarnings(CVlm(na.omit(inputDataSigF), form.lm=currMod, m=5, dots=FALSE, seed=29, legend.pos="topleft", printit=FALSE))
meanSQError <- attr(cvResults, 'ms')
##Collect all stats for Final Output-
currOutput <- data.frame(formula=form, rSQ=rSQ, ArSQ=ArSQ, AIC=aic, BIC=bic, Model.pValue= modelP,
F.Statistic=f[1], training.mape=trainMap, predicted.mape=predMap, training.minmax.accuracy = trainMinMaxAcc, predicted.minmax.accuracy=predMinMaxAcc, influential.obs = infObsRows, k_fold_ms=meanSQError)
#Final Output
Final_Output <- rbind(Final_Output, currOutput)
##Print output so they get accumulated in 'All_Models_In_Combos.txt'
names(actFitted) <- c("actuals", "predicted")
actPredAll <- rbind(actFitted, actual_predicteds)
print (currSumm)
print (vif(currMod))
print (actpPred <- cbind(actPredAll, na.omit(inputDataSigF)[, preds]))
}
write.csv(Final_Output, "Final_Regression_Output.csv", row.names=F) # Export
Final_Regression_Output is the final output file containing diagnostic parameters for all models. In below table, you will find a sample row of ’Final_Regression_Output for one of the models, that we just built using the above logic
rownum=15 #row number of best model from Final_Regression_Output.csv
preds <- na.omit(cmbMatrix[rownum, ]) #get the predictor names
form <- paste ("response ~ ", paste (preds, collapse=" + "), sep="") #model fornula
currMod <- lm(as.formula(form), data=inputDataSigFTrain) #build llinear model
currSumm <- summary(currMod) #model summary
vif(currMod) #variance inflation factor
## InversionBaseHeight PressureHeight Humidity
## 1.5460 1.4330 1.9452
## PressureGradient
## 1.8551
#Get Influential Observations
cutoff <- 4/(nrow(inputDataSigFTrain)-length(currMod$coefficients)-2)
plot (currMod, which=4, cook.levels=cutoff)
abline (h=cutoff, lty=2, col="red")
influencePlot (currMod) #Influence plot for best model fits
## StudRes Hat CookD
## 19 -0.8830961 0.06901362 0.01157109
## 243 3.1844485 0.01855175 0.03714138