In this section, we evaluate the the validity and prediction accuracy of our linear model by using testing data and residual plots
## loading in the model we build in the end of R PART2
library(leaps)
library(sp)
data(meuse)
meuse <- na.omit(meuse)
var_selec <- c("cadmium","zinc","om","ffreq","lime","lead")
## creating the sub
data_selec <- meuse[var_selec]
data_selec$ffreq <- as.numeric(data_selec$ffreq)
## replicating previous model
model_1 <- lm(lead~. ,data = data_selec)
summary(model_1)
##
## Call:
## lm(formula = lead ~ ., data = data_selec)
##
## Residuals:
## Min 1Q Median 3Q Max
## -77.514 -12.536 -0.082 13.660 59.689
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.5617 7.0328 5.057 1.26e-06 ***
## cadmium -13.0918 1.5459 -8.469 2.43e-14 ***
## zinc 0.4300 0.0128 33.606 < 2e-16 ***
## om -2.3889 0.8193 -2.916 0.004110 **
## ffreq -10.5112 2.9432 -3.571 0.000481 ***
## lime1 -25.0169 5.8700 -4.262 3.62e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.72 on 146 degrees of freedom
## Multiple R-squared: 0.9597, Adjusted R-squared: 0.9584
## F-statistic: 696.1 on 5 and 146 DF, p-value: < 2.2e-16
## loading in the model we build in the end of R PART3 where we did transformation on response and predictors
data_selec$cadmium_log <- log(data_selec$cadmium)
data_selec$zinc_log <- log(data_selec$zinc)
data_selec$cadmium_2 <- sqrt(data_selec$cadmium)
data_selec$zinc_2 <- sqrt(data_selec$zinc)
data_selec$lead_log <- log(data_selec$lead)
var_selec_2 <- c("cadmium_log","zinc_log","om","ffreq","lime","lead_log")
data_selec_2 <- data_selec[var_selec_2]
model_2 <- lm(lead_log~., data = data_selec_2)
summary(model_2)
##
## Call:
## lm(formula = lead_log ~ ., data = data_selec_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.51262 -0.07645 -0.00578 0.10017 0.50506
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.103709 0.191824 -5.754 4.95e-08 ***
## cadmium_log -0.049709 0.019982 -2.488 0.01398 *
## zinc_log 1.052586 0.035331 29.792 < 2e-16 ***
## om -0.014211 0.004965 -2.862 0.00482 **
## ffreq -0.060034 0.019145 -3.136 0.00207 **
## lime1 -0.185432 0.035280 -5.256 5.13e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1435 on 146 degrees of freedom
## Multiple R-squared: 0.9547, Adjusted R-squared: 0.9532
## F-statistic: 615.7 on 5 and 146 DF, p-value: < 2.2e-16
plot(model_1)
Now we investigate the model_2 that is built after variable transformation
plot(model_2)
From the residual plots, we can see that the model after transformation doesn’t violate any of the assumptions. With similar Rsquared of 95%, we choose this model after variable transformation.
A training dataset is a dataset of examples used for learning, that is to fit the parameters (e.g., weights) of, for example, a classifier
A test dataset is a dataset that is independent of the training dataset, but that follows the same probability distribution as the training dataset. If a model fit to the training dataset also fits the test dataset well, minimal overfitting has taken place (see figure below). A better fitting of the training dataset as opposed to the test dataset usually points to overfitting.
Because we are building a linear model, there are no parameters to tune or overfitting issue to control. We simply demonstrate how to split training and testing
## we randomly split 80/20 testing and training
set.seed(12345678+100*8)
## seperate to training and hold-off set
samp_perc <-sample(1:nrow(data_selec_2),nrow(data_selec_2)*0.8,replace = F)
train <-data_selec_2[samp_perc,]
test <- data_selec_2[-samp_perc,]
model_3 <- lm(lead_log~., data = train)
summary(model_3)
##
## Call:
## lm(formula = lead_log ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50729 -0.08598 0.00269 0.09793 0.50415
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.004083 0.230982 -4.347 3.00e-05 ***
## cadmium_log -0.036626 0.023212 -1.578 0.1173
## zinc_log 1.030784 0.042617 24.187 < 2e-16 ***
## om -0.012731 0.005723 -2.225 0.0281 *
## ffreq -0.050327 0.023103 -2.178 0.0314 *
## lime1 -0.179298 0.042309 -4.238 4.58e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1506 on 115 degrees of freedom
## Multiple R-squared: 0.9512, Adjusted R-squared: 0.9491
## F-statistic: 448.4 on 5 and 115 DF, p-value: < 2.2e-16
## we predict the testing dataset
pred <- predict(model_3, newdata = test)
rsquared <- 1 - sum((pred - test$lead_log)^2)/sum((test$lead_log - mean(test$lead_log))^2)
rsquared
## [1] 0.9669326
We obtain a very high Rsquared of 0.9669.