In this section, we evaluate the the validity and prediction accuracy of our linear model by using testing data and residual plots

PART ONE: Residual Plots

## loading in the model we build in the end of R PART2
library(leaps)
library(sp)
data(meuse)
meuse <- na.omit(meuse)
var_selec <- c("cadmium","zinc","om","ffreq","lime","lead")
## creating the sub
data_selec <- meuse[var_selec]
data_selec$ffreq <- as.numeric(data_selec$ffreq)
## replicating previous model
model_1 <- lm(lead~. ,data = data_selec) 
summary(model_1)
## 
## Call:
## lm(formula = lead ~ ., data = data_selec)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -77.514 -12.536  -0.082  13.660  59.689 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  35.5617     7.0328   5.057 1.26e-06 ***
## cadmium     -13.0918     1.5459  -8.469 2.43e-14 ***
## zinc          0.4300     0.0128  33.606  < 2e-16 ***
## om           -2.3889     0.8193  -2.916 0.004110 ** 
## ffreq       -10.5112     2.9432  -3.571 0.000481 ***
## lime1       -25.0169     5.8700  -4.262 3.62e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.72 on 146 degrees of freedom
## Multiple R-squared:  0.9597, Adjusted R-squared:  0.9584 
## F-statistic: 696.1 on 5 and 146 DF,  p-value: < 2.2e-16
## loading in the model we build in the end of R PART3 where we did transformation on response and predictors
data_selec$cadmium_log <- log(data_selec$cadmium)
data_selec$zinc_log <- log(data_selec$zinc)
data_selec$cadmium_2 <- sqrt(data_selec$cadmium)
data_selec$zinc_2 <- sqrt(data_selec$zinc)
data_selec$lead_log <- log(data_selec$lead)
var_selec_2 <- c("cadmium_log","zinc_log","om","ffreq","lime","lead_log")
data_selec_2 <- data_selec[var_selec_2]
model_2 <- lm(lead_log~., data = data_selec_2)
summary(model_2)
## 
## Call:
## lm(formula = lead_log ~ ., data = data_selec_2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51262 -0.07645 -0.00578  0.10017  0.50506 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.103709   0.191824  -5.754 4.95e-08 ***
## cadmium_log -0.049709   0.019982  -2.488  0.01398 *  
## zinc_log     1.052586   0.035331  29.792  < 2e-16 ***
## om          -0.014211   0.004965  -2.862  0.00482 ** 
## ffreq       -0.060034   0.019145  -3.136  0.00207 ** 
## lime1       -0.185432   0.035280  -5.256 5.13e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1435 on 146 degrees of freedom
## Multiple R-squared:  0.9547, Adjusted R-squared:  0.9532 
## F-statistic: 615.7 on 5 and 146 DF,  p-value: < 2.2e-16
plot(model_1)

Now we investigate the model_2 that is built after variable transformation

plot(model_2)

From the residual plots, we can see that the model after transformation doesn’t violate any of the assumptions. With similar Rsquared of 95%, we choose this model after variable transformation.

PART TWO: training and testing dataset

Because we are building a linear model, there are no parameters to tune or overfitting issue to control. We simply demonstrate how to split training and testing

## we randomly split 80/20 testing and training
set.seed(12345678+100*8)
## seperate to training and hold-off set
samp_perc <-sample(1:nrow(data_selec_2),nrow(data_selec_2)*0.8,replace = F) 
train <-data_selec_2[samp_perc,]
test <- data_selec_2[-samp_perc,]
model_3 <- lm(lead_log~., data = train)
summary(model_3)
## 
## Call:
## lm(formula = lead_log ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50729 -0.08598  0.00269  0.09793  0.50415 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.004083   0.230982  -4.347 3.00e-05 ***
## cadmium_log -0.036626   0.023212  -1.578   0.1173    
## zinc_log     1.030784   0.042617  24.187  < 2e-16 ***
## om          -0.012731   0.005723  -2.225   0.0281 *  
## ffreq       -0.050327   0.023103  -2.178   0.0314 *  
## lime1       -0.179298   0.042309  -4.238 4.58e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1506 on 115 degrees of freedom
## Multiple R-squared:  0.9512, Adjusted R-squared:  0.9491 
## F-statistic: 448.4 on 5 and 115 DF,  p-value: < 2.2e-16
## we predict the testing dataset
pred <- predict(model_3, newdata = test)
rsquared <- 1 - sum((pred - test$lead_log)^2)/sum((test$lead_log - mean(test$lead_log))^2)
rsquared
## [1] 0.9669326

We obtain a very high Rsquared of 0.9669.