Model Validation

In this section, we evaluate the the validity and prediction accuracy of our linear model by using testing data and residual plots

PART ONE: Residual Plots

## loading in the model we build in the end of R PART2
library(leaps)
library(sp)
data(meuse)
meuse <- na.omit(meuse)
var_selec <- c("cadmium","zinc","om","ffreq","lime","lead")
## creating the sub
data_selec <- meuse[var_selec]
data_selec$ffreq <- as.numeric(data_selec$ffreq)
## replicating previous model
model_1 <- lm(lead~. ,data = data_selec) 
summary(model_1)

## 
## Call:
## lm(formula = lead ~ ., data = data_selec)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -77.514 -12.536  -0.082  13.660  59.689 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  35.5617     7.0328   5.057 1.26e-06 ***
## cadmium     -13.0918     1.5459  -8.469 2.43e-14 ***
## zinc          0.4300     0.0128  33.606  < 2e-16 ***
## om           -2.3889     0.8193  -2.916 0.004110 ** 
## ffreq       -10.5112     2.9432  -3.571 0.000481 ***
## lime1       -25.0169     5.8700  -4.262 3.62e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.72 on 146 degrees of freedom
## Multiple R-squared:  0.9597, Adjusted R-squared:  0.9584 
## F-statistic: 696.1 on 5 and 146 DF,  p-value: < 2.2e-16

## loading in the model we build in the end of R PART3 where we did transformation on response and predictors
data_selec$cadmium_log <- log(data_selec$cadmium)
data_selec$zinc_log <- log(data_selec$zinc)
data_selec$cadmium_2 <- sqrt(data_selec$cadmium)
data_selec$zinc_2 <- sqrt(data_selec$zinc)
data_selec$lead_log <- log(data_selec$lead)
var_selec_2 <- c("cadmium_log","zinc_log","om","ffreq","lime","lead_log")
data_selec_2 <- data_selec[var_selec_2]
model_2 <- lm(lead_log~., data = data_selec_2)
summary(model_2)

## 
## Call:
## lm(formula = lead_log ~ ., data = data_selec_2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51262 -0.07645 -0.00578  0.10017  0.50506 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.103709   0.191824  -5.754 4.95e-08 ***
## cadmium_log -0.049709   0.019982  -2.488  0.01398 *  
## zinc_log     1.052586   0.035331  29.792  < 2e-16 ***
## om          -0.014211   0.004965  -2.862  0.00482 ** 
## ffreq       -0.060034   0.019145  -3.136  0.00207 ** 
## lime1       -0.185432   0.035280  -5.256 5.13e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1435 on 146 degrees of freedom
## Multiple R-squared:  0.9547, Adjusted R-squared:  0.9532 
## F-statistic: 615.7 on 5 and 146 DF,  p-value: < 2.2e-16

plot(model_1)

The first residual plot shows that the linearity assumption is violated as the plot shows a slight upward linear trend.
In the second plot, the normality of the error is violated as in the errors does not fit the straight line towards the end of the graph
In the third plot, the constant variance assumption is violated, because the standarized residuals shows a upward then downward slopping trend.
The fourth plot shows that the there are 3 bad leverage points

Now we investigate the model_2 that is built after variable transformation

plot(model_2)

The first residual plot shows that the linearity assumption is not violated as the plot shows a flat linear trend.
In the second plot, the normality of the error is not violated as in the errors does fit the straight line towards the end of the graph
In the third plot, the constant variance assumption is violated, because the standarized residuals shows a linear trend.

From the residual plots, we can see that the model after transformation doesn’t violate any of the assumptions. With similar Rsquared of 95%, we choose this model after variable transformation.

PART TWO: training and testing dataset

A training dataset is a dataset of examples used for learning, that is to fit the parameters (e.g., weights) of, for example, a classifier
A test dataset is a dataset that is independent of the training dataset, but that follows the same probability distribution as the training dataset. If a model fit to the training dataset also fits the test dataset well, minimal overfitting has taken place (see figure below). A better fitting of the training dataset as opposed to the test dataset usually points to overfitting.

Because we are building a linear model, there are no parameters to tune or overfitting issue to control. We simply demonstrate how to split training and testing

## we randomly split 80/20 testing and training
set.seed(12345678+100*8)
## seperate to training and hold-off set
samp_perc <-sample(1:nrow(data_selec_2),nrow(data_selec_2)*0.8,replace = F) 
train <-data_selec_2[samp_perc,]
test <- data_selec_2[-samp_perc,]
model_3 <- lm(lead_log~., data = train)
summary(model_3)

## 
## Call:
## lm(formula = lead_log ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50729 -0.08598  0.00269  0.09793  0.50415 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.004083   0.230982  -4.347 3.00e-05 ***
## cadmium_log -0.036626   0.023212  -1.578   0.1173    
## zinc_log     1.030784   0.042617  24.187  < 2e-16 ***
## om          -0.012731   0.005723  -2.225   0.0281 *  
## ffreq       -0.050327   0.023103  -2.178   0.0314 *  
## lime1       -0.179298   0.042309  -4.238 4.58e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1506 on 115 degrees of freedom
## Multiple R-squared:  0.9512, Adjusted R-squared:  0.9491 
## F-statistic: 448.4 on 5 and 115 DF,  p-value: < 2.2e-16

## we predict the testing dataset
pred <- predict(model_3, newdata = test)
rsquared <- 1 - sum((pred - test$lead_log)^2)/sum((test$lead_log - mean(test$lead_log))^2)
rsquared

## [1] 0.9669326

We obtain a very high Rsquared of 0.9669.

Model Validation

Huizi Yu

10/28/2019

PART ONE: Residual Plots

PART TWO: training and testing dataset