Today I want to introduce you my personal way of working with linear regression models. This is maybe the most important regression model to know, so I would like to make a fast analysis with a dataset chosed randomly.
This time we will fit a linear regression model with the Cigarette dataset, with contains data of smoking habits and tax changes in USA.
We want to predict the number of packets per capita giving the other variables. In other words, we need to create a regression model
cigar.data <- Cigarette
str(cigar.data)
## 'data.frame': 528 obs. of 9 variables:
## $ state : Factor w/ 48 levels "AL","AR","AZ",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ year : int 1985 1985 1985 1985 1985 1985 1985 1985 1985 1985 ...
## $ cpi : num 1.08 1.08 1.08 1.08 1.08 ...
## $ pop : int 3973000 2327000 3184000 26444000 3209000 3201000 618000 11352000 5963000 2830000 ...
## $ packpc: num 116 129 105 100 113 ...
## $ income: int 46014968 26210736 43956936 447102816 49466672 60063368 9927301 166919248 78364336 37902896 ...
## $ tax : num 32.5 37 31 26 31 ...
## $ avgprs: num 102.2 101.5 108.6 107.8 94.3 ...
## $ taxs : num 33.3 37 36.2 32.1 31 ...
chart.Correlation(cigar.data[,3:9], histogram = TRUE, pch = 19)
sapply(cigar.data[,3:9], function(x) sum(is.na(x)))
## cpi pop packpc income tax avgprs taxs
## 0 0 0 0 0 0 0
There are a bunch of significant relationships. This could mean that we will have too much collinearity. Fortunately, we don’t have any miss value or NA.
cor.mtest <- function(mat, ...) {
mat <- as.matrix(mat)
n <- ncol(mat)
p.mat<- matrix(NA, n, n)
diag(p.mat) <- 0
for (i in 1:(n - 1)) {
for (j in (i + 1):n) {
tmp <- cor.test(mat[, i], mat[, j], ...)
p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
}
}
colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
p.mat
}
p.mat <- cor.mtest(cigar.data[,3:9])
cigar.cor <- cor(cigar.data[,3:9])
corrplot(cigar.cor, method = "number", type = "upper",
tl.cex = 0.9, number.cex = 0.6, order="hclust", diag = FALSE,
addCoef.col = "black", tl.col = "black", tl.srt = 45,
# SignificĂ ncia
p.mat = p.mat, sig.level = 0.05, insig = "blank")
There is no need to clean up the dataset. We have not observed any NA value and although there are some bimodals and skewed distributions, there are statiscally significant relationships enough to create a regression model.
Next, we need to split the data into two different sets to do the cross-validation once we have created the model. We will create the training data with the 80% of the database.
set.seed(1)
cigar.data2 <- cigar.data[,3:9]
cigar.train <- sample_frac(tbl = cigar.data2, replace = FALSE, size = 0.80)
cigar.test <- anti_join(cigar.data2, cigar.train)
The formula we will fit for a multivariate linear regression model is the following:
\[ y_n = \beta_0 + \beta_1x_{n1} + \dots + \beta_kx_{nk} + \epsilon_n \]
fit <- lm(packpc ~ ., data = cigar.train)
The Linear Regression model is a parametric model, so there are some assumptions we need to check to ensure we are making a proper model.
autoplot(fit)
summary(fit)
##
## Call:
## lm(formula = packpc ~ ., data = cigar.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.280 -8.616 0.330 8.161 72.550
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.365e+02 1.060e+01 12.881 < 2e-16 ***
## cpi 4.097e+01 1.308e+01 3.133 0.00185 **
## pop 4.306e-07 1.029e-06 0.419 0.67573
## income -2.647e-08 4.889e-08 -0.541 0.58856
## tax -3.282e-01 3.159e-01 -1.039 0.29942
## avgprs -4.618e-01 9.370e-02 -4.928 1.2e-06 ***
## taxs -6.278e-03 3.026e-01 -0.021 0.98346
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.25 on 415 degrees of freedom
## Multiple R-squared: 0.3822, Adjusted R-squared: 0.3733
## F-statistic: 42.8 on 6 and 415 DF, p-value: < 2.2e-16
confint(fit)
## 2.5 % 97.5 %
## (Intercept) 1.156565e+02 1.573112e+02
## cpi 1.526284e+01 6.667506e+01
## pop -1.591629e-06 2.452893e-06
## income -1.225758e-07 6.964006e-08
## tax -9.490425e-01 2.927122e-01
## avgprs -6.459474e-01 -2.775675e-01
## taxs -6.011955e-01 5.886399e-01
In the plot we observe that: * 1. Not all the variables are significants. * 2. The residuals vs. Fitted plot shows a pattern in the distribution of residuals. * 3. There are some values too influential in the model (Normal QQ plot and Cook’s distance plot).
We have seen in the corrplot that we have many high correlations. This can prove problematic because if the matrices of the model are too correlated we cannot transpose them and the model will have a singularity problem.
VIF(fit)
## cpi pop income tax avgprs taxs
## 5.027321 35.359664 37.152638 18.056987 11.711699 24.032766
As expected, we have a problem with the collinearity assumption of the model. The variables pop, income, tax and taxs have a too high correlation (values higher than ~10), so at this point we would stop and consider another model which help us dealing with the multicollinearity problem, like a PCA model. We will continue with this model as an example though.
Another of the problems we have in our model is the pattern observed in the distribution of the residuals. We can plot every explanatory variable vs. the residuals of the model to see it in detail.
p1 <- ggplot(cigar.train, aes(cigar.train[,1], residuals(fit))) +
geom_point() + geom_smooth(color = "blue")
p2 <- ggplot(cigar.train, aes(cigar.train[,2], residuals(fit))) +
geom_point() + geom_smooth(color = "blue")
p3 <- ggplot(cigar.train, aes(cigar.train[,3], residuals(fit))) +
geom_point() + geom_smooth(color = "blue")
p4 <- ggplot(cigar.train, aes(cigar.train[,4], residuals(fit))) +
geom_point() + geom_smooth(color = "blue")
p5 <- ggplot(cigar.train, aes(cigar.train[,5], residuals(fit))) +
geom_point() + geom_smooth(color = "blue")
p6 <- ggplot(cigar.train, aes(cigar.train[,6], residuals(fit))) +
geom_point() + geom_smooth(color = "blue")
p7 <- ggplot(cigar.train, aes(cigar.train[,7], residuals(fit))) +
geom_point() + geom_smooth(color = "blue")
grid.arrange(p1, p2, p3, p4, p5, p6, p7)
The model is clearly underfitted for two reasons: one, because the assumption of linearity is not achieved in most of cases, and two, because we have a quite low number of degrees of freedom. Given that, we should apply a polynomial transformation on columns 5, 6 and 7 as they are non-linear and they can be better fitted.
ols_plot_cooksd_bar(fit)
Another of the biggest problem of our model are the outliers. However, it seems a bad idea to remove them since they are real data. Removing extreme values improves the performance of the model but it ruins its forecast capability in the long run, especially for new extreme values.
Let’s check if we need all the variables. It would be a great idea to remove explanatory variables as they are too correlated.
summary(step(fit, trace = 0))
##
## Call:
## lm(formula = packpc ~ cpi + tax + avgprs, data = cigar.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.272 -8.592 0.061 8.354 72.213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 137.86768 9.29553 14.832 < 2e-16 ***
## cpi 41.68451 12.21447 3.413 0.000706 ***
## tax -0.34300 0.13399 -2.560 0.010823 *
## avgprs -0.47770 0.07908 -6.040 3.4e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.2 on 418 degrees of freedom
## Multiple R-squared: 0.3812, Adjusted R-squared: 0.3767
## F-statistic: 85.82 on 3 and 418 DF, p-value: < 2.2e-16
Just as we suspected, with just 3 variables we fitted a slightly better model than the previous one. We will continue our analysis with this model.
fit2 <- lm(packpc ~ cpi + sqrt(tax) + sqrt(avgprs), data = cigar.train)
summary(stepAIC(fit, trace = 0))
##
## Call:
## lm(formula = packpc ~ cpi + tax + avgprs, data = cigar.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.272 -8.592 0.061 8.354 72.213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 137.86768 9.29553 14.832 < 2e-16 ***
## cpi 41.68451 12.21447 3.413 0.000706 ***
## tax -0.34300 0.13399 -2.560 0.010823 *
## avgprs -0.47770 0.07908 -6.040 3.4e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.2 on 418 degrees of freedom
## Multiple R-squared: 0.3812, Adjusted R-squared: 0.3767
## F-statistic: 85.82 on 3 and 418 DF, p-value: < 2.2e-16
summary(stepAIC(fit2, trace = 0))
##
## Call:
## lm(formula = packpc ~ cpi + sqrt(tax) + sqrt(avgprs), data = cigar.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.041 -8.881 0.395 8.177 70.956
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 223.235 8.098 27.568 < 2e-16 ***
## cpi 44.585 12.373 3.604 0.000352 ***
## sqrt(tax) -5.698 1.656 -3.440 0.000640 ***
## sqrt(avgprs) -11.403 1.892 -6.028 3.64e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.97 on 418 degrees of freedom
## Multiple R-squared: 0.3973, Adjusted R-squared: 0.393
## F-statistic: 91.85 on 3 and 418 DF, p-value: < 2.2e-16
anova(fit, fit2)
## Analysis of Variance Table
##
## Model 1: packpc ~ cpi + pop + income + tax + avgprs + taxs
## Model 2: packpc ~ cpi + sqrt(tax) + sqrt(avgprs)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 415 138281
## 2 418 134910 -3 3370.5
AIC(fit, fit2)
## df AIC
## fit 8 3657.823
## fit2 5 3641.410
Through the \(AIC\) and \(BIC\) calculation we obtained the same model. The second model is even better than the previous one with less variables and less degrees of freedom. We will stick to this model.
calc.relimp(fit2, type = c("lmg", "last", "first", "pratt", "betasq"), rela = T)
## Response variable: packpc
## Total response variance: 531.6946
## Analysis based on 422 observations
##
## 3 Regressors:
## cpi sqrt(tax) sqrt(avgprs)
## Proportion of variance explained by model: 39.73%
## Metrics are normalized to sum to 100% (rela=TRUE).
##
## Relative importance metrics:
##
## lmg last first betasq pratt
## cpi 0.1689195 0.2123143 0.1885808 0.15191516 -0.2934772
## sqrt(tax) 0.3920353 0.1934852 0.4044931 0.08968051 0.3302400
## sqrt(avgprs) 0.4390453 0.5942006 0.4069261 0.75840433 0.9632372
##
## Average coefficients for different model sizes:
##
## 1X 2Xs 3Xs
## cpi -59.76424 22.06888 44.584804
## sqrt(tax) -14.55899 -10.54206 -5.697947
## sqrt(avgprs) -10.04946 -10.93036 -11.403263
Now we can plot the Bootstrap Measures of Relative Importance over 1000 samples.
boot <- boot.relimp(fit2, b = 1000, type = c("lmg", "last", "first", "pratt"),
rank = TRUE, diff = TRUE, rela = TRUE)
booteval.relimp(boot)
## Response variable: packpc
## Total response variance: 531.6946
## Analysis based on 422 observations
##
## 3 Regressors:
## cpi sqrt(tax) sqrt(avgprs)
## Proportion of variance explained by model: 39.73%
## Metrics are normalized to sum to 100% (rela=TRUE).
##
## Relative importance metrics:
##
## lmg last first pratt
## cpi 0.1689195 0.2123143 0.1885808 -0.2934772
## sqrt(tax) 0.3920353 0.1934852 0.4044931 0.3302400
## sqrt(avgprs) 0.4390453 0.5942006 0.4069261 0.9632372
##
## Average coefficients for different model sizes:
##
## 1X 2Xs 3Xs
## cpi -59.76424 22.06888 44.584804
## sqrt(tax) -14.55899 -10.54206 -5.697947
## sqrt(avgprs) -10.04946 -10.93036 -11.403263
##
##
## Confidence interval information ( 1000 bootstrap replicates, bty= perc ):
## Relative Contributions with confidence intervals:
##
## Lower Upper
## percentage 0.95 0.95 0.95
## cpi.lmg 0.1689 __C 0.1422 0.2054
## sqrt(tax).lmg 0.3920 AB_ 0.3133 0.4633
## sqrt(avgprs).lmg 0.4390 AB_ 0.3872 0.4935
##
## cpi.last 0.2123 _BC 0.0717 0.3310
## sqrt(tax).last 0.1935 ABC 0.0198 0.5104
## sqrt(avgprs).last 0.5942 AB_ 0.3608 0.7552
##
## cpi.first 0.1886 __C 0.1385 0.2338
## sqrt(tax).first 0.4045 AB_ 0.3524 0.4619
## sqrt(avgprs).first 0.4069 AB_ 0.3823 0.4354
##
## cpi.pratt -0.2935 __C -0.4110 -0.1570
## sqrt(tax).pratt 0.3302 _B_ 0.1233 0.5202
## sqrt(avgprs).pratt 0.9632 A__ 0.6827 1.2680
##
## Letters indicate the ranks covered by bootstrap CIs.
## (Rank bootstrap confidence intervals always obtained by percentile method)
## CAUTION: Bootstrap confidence intervals can be somewhat liberal.
##
##
## Differences between Relative Contributions:
##
## Lower Upper
## difference 0.95 0.95 0.95
## cpi-sqrt(tax).lmg -0.2231 * -0.3198 -0.1175
## cpi-sqrt(avgprs).lmg -0.2701 * -0.3119 -0.2202
## sqrt(tax)-sqrt(avgprs).lmg -0.0470 -0.1769 0.0699
##
## cpi-sqrt(tax).last 0.0188 -0.4130 0.2803
## cpi-sqrt(avgprs).last -0.3819 * -0.5837 -0.1715
## sqrt(tax)-sqrt(avgprs).last -0.4007 -0.7211 0.1365
##
## cpi-sqrt(tax).first -0.2159 * -0.3231 -0.1185
## cpi-sqrt(avgprs).first -0.2183 * -0.2688 -0.1677
## sqrt(tax)-sqrt(avgprs).first -0.0024 -0.0746 0.0731
##
## cpi-sqrt(tax).pratt -0.6237 * -0.7640 -0.4590
## cpi-sqrt(avgprs).pratt -1.2567 * -1.6718 -0.8560
## sqrt(tax)-sqrt(avgprs).pratt -0.6330 * -1.1454 -0.1536
##
## * indicates that CI for difference does not include 0.
## CAUTION: Bootstrap confidence intervals can be somewhat liberal.
plot(booteval.relimp(boot,sort=TRUE))
The avgprs variable is the most influential one in this model, and we can improve its adjustment even more so we can improve our model with a polynomial transformation.
As we found out in the previous analysis, the assumption of linearity is not assumed for this model:
ggplot(cigar.data2, aes(x = packpc, y = avgprs)) + geom_point() +
geom_smooth(method = "lm", se = F) + geom_smooth(col = "red") + theme_classic()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
We can smooth the adjustment of the hyperplane with a polynomial regression with the poly() function. To find out how many degrees we need we can write a for loop and plot it afterwards.
# First, we need to define the function for RMSE performance
rmse <- function(y, h){
return(sqrt(mean((y - h) ^2)))}
performance <- data.frame()
for (d in 1:12){
poly.fit <- lm(packpc ~ cpi + poly(tax, degree = d) + poly(avgprs, degree = d), data = cigar.train)
performance <- rbind(performance, data.frame(Degree = d,
Data = "Training",
RMSE = rmse(cigar.train$packpc, predict(poly.fit))))
performance <- rbind(performance, data.frame(Degree = d,
Data = "Test",
RMSE = rmse(cigar.train$packpc, predict(poly.fit,
newdata = cigar.test))))
}
# Final plot
ggplot(performance, aes(x = Degree, y = RMSE, linetype = Data)) +
geom_point() + geom_line() + ggtitle(label = 'RMSE of training set and test set')
The best number of polynomial degrees is one which reduce the RMSE parameter in the train set without increasing the RMSE parameter in the test set, which would mean we are overfitting our model. So 10 degrees seems to be perfect for this model.
fit3 <- lm(packpc ~ cpi + poly(tax, degree = 10) + poly(avgprs, degree = 2),
data = cigar.train)
summary(fit3)
##
## Call:
## lm(formula = packpc ~ cpi + poly(tax, degree = 10) + poly(avgprs,
## degree = 2), data = cigar.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.504 -8.599 0.668 8.496 69.022
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.440 16.749 2.534 0.011652 *
## cpi 49.786 12.937 3.848 0.000138 ***
## poly(tax, degree = 10)1 -95.671 37.356 -2.561 0.010793 *
## poly(tax, degree = 10)2 50.244 25.029 2.007 0.045364 *
## poly(tax, degree = 10)3 -30.007 18.730 -1.602 0.109906
## poly(tax, degree = 10)4 39.747 17.760 2.238 0.025759 *
## poly(tax, degree = 10)5 -19.208 17.749 -1.082 0.279800
## poly(tax, degree = 10)6 3.873 17.752 0.218 0.827407
## poly(tax, degree = 10)7 -3.166 17.797 -0.178 0.858913
## poly(tax, degree = 10)8 -20.000 17.639 -1.134 0.257514
## poly(tax, degree = 10)9 41.634 18.171 2.291 0.022460 *
## poly(tax, degree = 10)10 -33.242 17.604 -1.888 0.059699 .
## poly(avgprs, degree = 2)1 -335.392 52.694 -6.365 5.26e-10 ***
## poly(avgprs, degree = 2)2 39.537 30.189 1.310 0.191055
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.58 on 408 degrees of freedom
## Multiple R-squared: 0.4366, Adjusted R-squared: 0.4186
## F-statistic: 24.32 on 13 and 408 DF, p-value: < 2.2e-16
AIC(fit2, fit3)
## df AIC
## fit2 5 3641.41
## fit3 15 3632.98
The significance has not improved remarkably (\(R^2 = 0.43\)) and the degrees of freedom had rise from 8 to 15. The probability of having overfitted the model is low, but still we have a very poor model.
To apply this method we have to calcute the \(RMSE\) coefficient, which can be expressed as:
\[RMSE = \sqrt{\frac{\sum_{i=1}^{n}(p_i - O_i)}{n}}\]
pred <- predict(fit3, newdata = cigar.test)
rmse <- sqrt(sum((exp(pred) - cigar.test$packpc)^2)/length(cigar.test$packpc))
c(RMSE = rmse, R2 = summary(fit3)$r.squared)
## RMSE R2
## 1.322711e+64 4.365664e-01
plot(cigar.test$packpc, pred)
abline(0, 1, col = "red")
cv.lm(cigar.train, fit3, m = 10)
## Analysis of Variance Table
##
## Response: packpc
## Df Sum Sq Mean Sq F value Pr(>F)
## cpi 1 34986 34986 113.2 < 2e-16 ***
## poly(tax, degree = 10) 10 49383 4938 16.0 < 2e-16 ***
## poly(avgprs, degree = 2) 2 13353 6677 21.6 1.2e-09 ***
## Residuals 408 126121 309
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## fold 1
## Observations in test set: 42
## 345 328 330 195 245 117 150 217 432 177
## Predicted 104.24 101.7 93.9 111.0 108.9 104.36 97.1 135.55 112 117.90
## cvpred 103.98 101.4 93.7 111.0 109.0 104.78 97.5 133.93 112 118.07
## packpc 106.40 119.6 52.5 94.9 89.1 108.50 103.9 139.06 104 122.79
## CV residual 2.42 18.2 -41.2 -16.0 -19.9 3.72 6.3 5.13 -8 4.72
## 422 31 495 164 472 526 65 168 83
## Predicted 107.85 115.1 117.6 110.0 108.55 84.11 109.62 117.8 115.52
## cvpred 106.76 115.2 115.8 110.1 107.98 82.83 109.80 117.8 115.42
## packpc 111.62 142.0 172.6 121.7 114.14 92.47 114.25 90.6 112.07
## CV residual 4.87 26.7 56.9 11.6 6.16 9.64 4.45 -27.2 -3.35
## 297 99 457 126 380 264 287 101 333 76
## Predicted 107.06 115.0 115.60 115.8 96.4 112.2 106.54 120.2 83.65 126.1
## cvpred 106.50 115.0 114.16 115.8 96.2 112.2 106.73 120.3 84.23 125.9
## packpc 109.07 100.3 108.16 82.1 126.1 86.9 111.47 102.7 77.35 190.6
## CV residual 2.57 -14.6 -6.01 -33.7 29.9 -25.4 4.74 -17.6 -6.88 64.7
## 501 322 18 490 455 310 327 290 56
## Predicted 78.7 96.27 128.16 89.87 103.044 111.22 98.89 97.9 111.9
## cvpred 78.2 96.08 127.02 88.86 102.256 110.76 98.67 97.8 112.4
## packpc 82.9 87.13 120.98 92.40 103.137 119.55 91.24 116.3 120.6
## CV residual 4.7 -8.95 -6.04 3.54 0.881 8.79 -7.43 18.5 8.2
## 36 161 134 47
## Predicted 120.19 100.42 140.7 120.00
## cvpred 120.48 100.78 140.8 120.22
## packpc 117.70 109.90 127.2 112.85
## CV residual -2.77 9.12 -13.7 -7.37
##
## Sum of squares = 16385 Mean square = 390 n = 42
##
## fold 2
## Observations in test set: 43
## 194 135 93 409 316 12 230 48 496 428
## Predicted 108.1 113.0 99.32 112.3 99.5 119.4 131.58 137.66 103.54 97.9
## cvpred 108.9 113.1 99.41 113.2 99.4 119.1 132.19 137.21 103.25 98.0
## packpc 122.8 99.0 93.29 124.9 147.3 123.2 124.33 129.40 105.18 116.0
## CV residual 13.9 -14.1 -6.12 11.8 47.9 4.1 -7.86 -7.81 1.93 18.0
## 405 492 176 26 116 55 489 62 206
## Predicted 75.26 84.685 109.55 121.7 115.87 118.6 114.9 117.80 101.66
## cvpred 73.74 83.674 110.17 121.6 116.14 118.4 115.9 117.75 102.11
## packpc 79.59 83.265 106.35 105.5 122.36 140.8 97.5 110.03 97.94
## CV residual 5.85 -0.409 -3.82 -16.1 6.22 22.3 -18.5 -7.72 -4.17
## 15 420 404 504 370 311 242 259 133
## Predicted 163.0 96.98 97.6 109.8 89.90 104.206 106.68 93.5 112.33
## cvpred 160.5 96.41 96.8 109.9 89.70 105.208 107.55 92.8 111.99
## packpc 186.0 93.87 104.2 87.2 84.61 104.892 117.02 114.9 121.82
## CV residual 25.5 -2.54 7.4 -22.8 -5.09 -0.315 9.47 22.1 9.83
## 434 418 119 325 270 146 293 313 207 96
## Predicted 98.62 94.47 112.140 90.7 106.9 111.8 107.4 147 151.6 133.73
## cvpred 98.55 93.91 112.673 90.5 107.7 112.3 108.2 147 150.2 133.57
## packpc 101.63 83.96 112.514 104.1 71.6 125.5 88.9 130 176.1 127.82
## CV residual 3.08 -9.95 -0.159 13.6 -36.1 13.2 -19.3 -17 25.9 -5.74
## 243 113 2 123 234
## Predicted 107.2 106.31 123.94 112.11 100.2
## cvpred 108.0 105.98 123.64 112.24 100.9
## packpc 85.8 112.10 128.53 104.91 57.4
## CV residual -22.2 6.11 4.89 -7.32 -43.4
##
## Sum of squares = 12159 Mean square = 283 n = 43
##
## fold 3
## Observations in test set: 43
## 197 92 396 514 64 298 54 214 156 485
## Predicted 108.5 117.4 93.899 102.98 115.2 94.85 102.17 114.3 108.3 103.7
## cvpred 108.8 115.7 91.422 100.22 113.8 92.96 103.00 113.3 108.7 102.3
## packpc 91.3 131.5 92.393 108.68 123.3 96.34 106.76 125.5 110.1 82.6
## CV residual -17.5 15.8 0.971 8.46 9.5 3.37 3.76 12.2 1.4 -19.7
## 389 238 285 124 28 438 261 139 487 307
## Predicted 104.5 96.62 87.02 123 129.0 77.966 89.512 149.0 105.5 91.3
## cvpred 104.1 96.22 85.89 120 124.7 77.362 88.816 151.7 102.1 89.6
## packpc 82.6 100.60 78.74 185 198.0 77.623 89.463 129.7 124.5 116.3
## CV residual -21.4 4.38 -7.15 65 73.3 0.261 0.647 -22.1 22.4 26.7
## 132 279 296 458 513 528 289 454 77
## Predicted 117.52 104.1 87.4 85.52 105.34 113.978 104.92 112.90 110.76
## cvpred 115.94 104.9 85.8 85.95 101.99 112.636 105.00 111.88 110.35
## packpc 112.49 91.6 97.6 80.34 111.38 112.238 107.01 117.25 114.57
## CV residual -3.46 -13.3 11.8 -5.61 9.39 -0.398 2.02 5.37 4.22
## 154 267 397 205 158 412 87 320 228 69
## Predicted 100.6 99.4 109.0 112.9 104.81 100.4 116.4 82.53 111.6 107.73
## cvpred 100.3 99.7 108.7 111.8 105.69 98.3 115.4 83.05 111.0 108.38
## packpc 103.5 91.5 123.6 134.3 104.63 149.5 98.9 86.60 108.7 104.47
## CV residual 3.2 -8.2 14.8 22.5 -1.06 51.2 -16.5 3.54 -2.3 -3.92
## 292 393 167 235
## Predicted 82.2 111.33 111.59 139
## cvpred 80.7 110.23 111.08 144
## packpc 68.5 104.35 113.02 121
## CV residual -12.2 -5.88 1.94 -23
##
## Sum of squares = 18057 Mean square = 420 n = 43
##
## fold 4
## Observations in test set: 42
## 141 332 229 222 350 179 315 109 346 260
## Predicted 94.31 103 104.5 109.5 91.73 101.24 94.43 121.6 87.68 105.1
## cvpred 94.56 103 103.9 111.3 90.81 100.75 93.99 122.5 86.43 104.2
## packpc 88.97 121 120.1 76.1 88.76 108.06 91.71 136.8 93.91 114.7
## CV residual -5.59 18 16.2 -35.2 -2.06 7.31 -2.28 14.3 7.48 10.6
## 6 84 459 100 174 67 211 68 321
## Predicted 102.69 117.82 94.9 115.8 112.2 107.6 93.3 119.31 105.90
## cvpred 102.01 118.40 94.7 116.9 114.0 106.8 93.0 118.52 105.66
## packpc 109.28 114.68 84.1 92.5 79.5 118.4 120.9 126.14 112.50
## CV residual 7.27 -3.72 -10.6 -24.4 -34.6 11.6 27.9 7.62 6.84
## 24 227 500 169 379 42 299 367 187
## Predicted 125.7 97.59 81.4 139.28 120.0 116 102.7 82.6 145.6
## cvpred 126.7 97.06 -42058.2 138.04 121.6 117 102.4 80.6 146.8
## packpc 104.3 105.20 81.4 145.57 106.2 68 83.0 102.8 127.5
## CV residual -22.4 8.14 42139.6 7.54 -15.4 -49 -19.5 22.2 -19.4
## 41 129 44 343 344 53 232 466 273
## Predicted 121.12 120.66 125.2 91.2 85.5 125.3 106.5 100.87 110.64
## cvpred 120.97 122.20 125.9 90.3 83.8 126.5 108.4 100.56 111.39
## packpc 115.10 125.45 145.3 116.9 93.9 109.2 127.1 101.97 115.85
## CV residual -5.86 3.25 19.4 26.7 10.1 -17.3 18.7 1.41 4.45
## 50 237 276 114 246
## Predicted 119.92 88.96 108.92 121.95 86.24
## cvpred 119.10 88.92 109.67 123.35 85.36
## packpc 127.66 83.97 102.54 115.88 90.82
## CV residual 8.56 -4.96 -7.13 -7.47 5.46
##
## Sum of squares = 1.78e+09 Mean square = 42279978 n = 42
##
## fold 5
## Observations in test set: 42
## 302 355 7 170 241 401 191 120 402
## Predicted 95.95 83.3 121.7 101.57 110.819 84.452 110.22 120.7 88.39
## cvpred 96.29 83.7 121.7 101.86 110.981 84.934 109.96 120.7 89.20
## packpc 91.62 115.0 143.9 92.37 111.745 84.547 115.41 95.3 80.97
## CV residual -4.68 31.3 22.2 -9.49 0.764 -0.388 5.45 -25.4 -8.23
## 384 483 522 262 410 385 468 387 499
## Predicted 105.89 83.8 99.8 114.01 96.3 104.945 101.72 97.0 86.5
## cvpred 106.10 84.7 101.9 114.47 98.0 105.473 103.65 97.3 87.3
## packpc 107.45 72.0 49.3 120.96 82.9 104.958 94.78 75.2 102.5
## CV residual 1.35 -12.8 -52.6 6.49 -15.1 -0.515 -8.87 -22.0 15.2
## 429 498 37 59 182 203 14 115 450 400
## Predicted 82.28 91.8 122.4 123.3 137.0 107.2 125.98 103 90.6 97.9
## cvpred 82.73 92.9 122.5 123.8 138.7 107.0 126.01 103 91.5 98.6
## packpc 74.60 77.5 132.8 98.3 126.0 80.9 116.68 124 79.7 104.3
## CV residual -8.13 -15.4 10.3 -25.5 -12.7 -26.1 -9.33 21 -11.8 5.7
## 336 362 112 247 165 369 509 408 224
## Predicted 111.55 91.9 112.90 111 93.480 101.24 82.43 103.1 100.983
## cvpred 112.01 93.1 112.62 111 94.215 101.39 83.11 103.9 100.780
## packpc 106.83 74.4 121.57 124 94.042 110.86 80.37 81.6 100.323
## CV residual -5.19 -18.7 8.95 13 -0.173 9.47 -2.74 -22.3 -0.458
## 171 427 358 433 21
## Predicted 101.87 110.40 109 108.85 117.27
## cvpred 102.18 111.55 109 109.97 116.94
## packpc 95.55 102.12 121 103.43 112.90
## CV residual -6.63 -9.43 12 -6.54 -4.04
##
## Sum of squares = 10493 Mean square = 250 n = 42
##
## fold 6
## Observations in test set: 42
## 106 494 107 391 381 352 411 209 34
## Predicted 106.811 101.2 113.7 100.2 81.40 91.9 97.03 94.62 125.201
## cvpred 106.725 101.1 111.9 99.7 82.97 92.1 97.07 94.11 126.814
## packpc 107.553 88.8 99.3 119.1 77.80 106.9 89.20 102.49 127.139
## CV residual 0.828 -12.3 -12.6 19.4 -5.17 14.8 -7.88 8.39 0.326
## 312 145 337 281 151 91 295 43 449 366
## Predicted 107.7 118.05 99.60 96.0 115 151.0 102.3 153.7 77.754 99.6
## cvpred 108.0 118.55 99.99 95.5 114 154.9 102.1 157.9 76.722 100.1
## packpc 81.7 115.26 106.90 85.0 136 133.3 117.4 134.0 77.230 69.3
## CV residual -26.3 -3.29 6.91 -10.5 22 -21.6 15.3 -23.9 0.507 -30.8
## 103 361 317 248 81 465 78 89 74 476
## Predicted 116.0 113 85.900 94.9 120.72 105.32 117.5 116.70 119.3 101
## cvpred 115.6 110 85.651 94.7 120.76 105.06 116.3 117.74 120.4 101
## packpc 137.5 126 86.533 105.7 126.59 106.67 87.3 111.45 102.1 119
## CV residual 21.9 16 0.882 11.0 5.84 1.62 -29.0 -6.29 -18.3 18
## 441 22 58 399 196 184 519 86 208
## Predicted 114.8 123.07 113.60 118.9 99.5 110.9 103.92 142.6 107.03
## cvpred 115.0 122.89 114.14 116.1 99.4 109.6 103.60 142.5 106.82
## packpc 102.7 130.37 107.77 163.0 80.5 127.2 97.22 126.7 108.84
## CV residual -12.3 7.48 -6.37 46.9 -18.9 17.7 -6.38 -15.8 2.02
## 448 20 219 178
## Predicted 103.80 121.65 99.65 107.19
## cvpred 103.22 123.06 99.12 107.29
## packpc 103.61 128.00 95.87 109.34
## CV residual 0.39 4.95 -3.24 2.06
##
## Sum of squares = 10513 Mean square = 250 n = 42
##
## fold 7
## Observations in test set: 42
## 192 435 516 202 63 339 251 453 71
## Predicted 127.96 100.9 101.03 97.050 159.9 94.6 107.8 78.60 115.286
## cvpred 127.08 101.1 100.32 97.140 158.9 94.6 106.8 78.63 114.637
## packpc 123.01 72.9 95.64 97.005 181.8 76.4 91.3 81.36 114.032
## CV residual -4.07 -28.2 -4.68 -0.136 22.9 -18.2 -15.5 2.73 -0.605
## 517 478 220 40 520 486 301 360 95 417
## Predicted 78.5 84.93 116 121.31 109.2 78.16 109.0 100.0 117.3 102.18
## cvpred 71.8 84.03 114 120.37 108.6 78.41 107.8 99.9 116.5 102.39
## packpc 92.6 91.61 170 129.83 122.3 79.47 126.4 83.9 115.2 106.84
## CV residual 20.8 7.58 56 9.47 13.8 1.06 18.5 -16.0 -1.3 4.46
## 407 131 118 373 29 502 304 456 497
## Predicted 98.03 106.90 117.1 81.2 114.2 110.0 98.27 108.9 77.497
## cvpred 98.16 107.55 115.9 81.1 114.9 109.9 98.09 108.9 77.615
## packpc 103.01 109.22 129.7 100.9 116.5 122.5 106.34 82.6 76.621
## CV residual 4.86 1.67 13.7 19.8 1.6 12.5 8.25 -26.3 -0.994
## 16 424 488 447 253 140 210 445 35 27
## Predicted 119.69 104.7 93.527 117.6 111.7 113.4 113.85 111 128.1 123.1
## cvpred 118.82 103.8 92.769 116.0 110.3 112.5 112.47 111 127.9 122.6
## packpc 127.56 116.5 93.075 164.8 129.4 127.8 107.59 125 119.5 107.4
## CV residual 8.73 12.7 0.306 48.8 19.1 15.3 -4.88 14 -8.4 -15.2
## 142 300 446 200
## Predicted 107.249 91.57 100.8 98.1
## cvpred 107.780 91.87 100.8 98.4
## packpc 106.842 95.34 87.0 112.6
## CV residual -0.938 3.47 -13.8 14.2
##
## Sum of squares = 12146 Mean square = 289 n = 42
##
## fold 8
## Observations in test set: 42
## 477 33 256 507 506 138 159 503 491 527
## Predicted 68.134 123.12 106.787 95.62 86.96 114.6 156.0 103.67 100.6 106
## cvpred 67.826 124.47 106.725 95.46 87.57 115.9 155.2 102.75 100.3 105
## packpc 68.193 127.60 107.438 87.27 79.81 68.5 172.5 105.58 74.8 116
## CV residual 0.367 3.13 0.713 -8.19 -7.76 -47.4 17.3 2.83 -25.4 11
## 236 233 186 258 72 32 524 166 38 30
## Predicted 107.8 98.48 105.6 111.1 123.3 118.34 100.1 116 146.1 120.5
## cvpred 107.7 98.61 105.6 112.2 124.0 118.49 99.3 118 148.1 122.5
## packpc 120.4 95.08 55.7 100.0 99.4 116.66 122.3 130 127.2 88.7
## CV residual 12.7 -3.53 -49.9 -12.2 -24.6 -1.83 23.0 12 -20.9 -33.8
## 252 85 17 263 98 463 482 240 371
## Predicted 96.402 118.5 112.34 107.95 116.88 84.3 100.9 126.88 90.79
## cvpred 96.761 118.8 113.11 107.46 116.89 83.8 100.8 128.83 90.53
## packpc 95.999 135.4 115.68 105.38 125.31 94.0 111.0 119.21 97.78
## CV residual -0.762 16.6 2.57 -2.08 8.42 10.2 10.3 -9.61 7.26
## 309 204 347 70 288 505 284 365 268 334
## Predicted 83.91 107.502 97.5 119.18 116.2 115.20 106.0 76.1 107.3 92.29
## cvpred 83.72 107.089 96.1 120.85 117.6 115.57 105.6 75.5 106.9 92.34
## packpc 85.87 106.976 80.5 129.10 104.8 121.54 122.6 86.0 154.4 94.72
## CV residual 2.16 -0.112 -15.7 8.25 -12.9 5.97 16.9 10.5 47.6 2.37
## 376 143 274
## Predicted 98.1 113.4794 102.86
## cvpred 97.0 113.8194 102.32
## packpc 118.4 113.8321 93.20
## CV residual 21.4 0.0127 -9.12
##
## Sum of squares = 13561 Mean square = 323 n = 42
##
## fold 9
## Observations in test set: 42
## 475 108 257 382 303 152 319 105 349 306
## Predicted 106.88 110.97 94.02 89.05 145.2 101.4 84.0 120.07 105.8 109.1
## cvpred 107.04 110.26 93.88 88.58 140.5 100.8 83.5 120.53 105.9 109.1
## packpc 103.15 111.94 97.55 94.49 169.1 118.1 111.7 124.79 126.2 97.3
## CV residual -3.88 1.67 3.66 5.91 28.7 17.3 28.3 4.26 20.3 -11.7
## 149 180 183 421 305 254 265 188 122
## Predicted 115.3 115.32 107.1 87.84 90.66 102.96 132.9 111.1 118.1
## cvpred 115.1 114.94 106.7 87.85 90.81 102.96 133.1 110.4 117.9
## packpc 97.0 108.71 95.0 95.10 94.82 95.46 131.6 128.7 97.7
## CV residual -18.1 -6.23 -11.7 7.25 4.01 -7.51 -1.5 18.4 -20.1
## 226 212 383 481 104 111 125 60 136
## Predicted 103.277 104.5 100.71 109.9 104.6 158.7 107.9 117.62 114.5
## cvpred 102.878 104.4 100.94 111.3 104.0 156.0 107.7 117.35 114.4
## packpc 102.635 117.6 96.60 101.1 119.2 176.0 110.8 119.58 130.8
## CV residual -0.243 13.2 -4.34 -10.2 15.2 19.9 3.1 2.23 16.5
## 277 49 244 462 464 75 175 80 137 431
## Predicted 97.42 121.3 86.3 100.7 78.00 118.5 103 115.83 113.10 103.91
## cvpred 95.81 121.5 85.1 101.4 79.81 118.4 102 115.68 112.66 104.62
## packpc 100.63 117.2 74.2 64.4 71.65 105.2 136 113.56 104.31 111.99
## CV residual 4.83 -4.3 -10.9 -37.0 -8.16 -13.2 34 -2.12 -8.35 7.38
## 388 163 282 160
## Predicted 82.8 97.8 95.7 110.92
## cvpred 83.4 97.0 95.5 110.26
## packpc 61.8 123.8 53.0 116.81
## CV residual -21.6 26.7 -42.4 6.55
##
## Sum of squares = 11219 Mean square = 267 n = 42
##
## fold 10
## Observations in test set: 42
## 470 198 368 269 155 354 39 90 436 515
## Predicted 116.61 89.94 78.45 96.72 109.8 95.49 126 114.9 82.3 88.14
## cvpred 118.21 90.35 79.13 96.93 109.5 95.63 129 115.4 83.0 88.20
## packpc 109.07 100.03 82.35 98.48 88.6 90.61 107 64.8 57.9 92.16
## CV residual -9.14 9.68 3.22 1.55 -20.8 -5.03 -22 -50.7 -25.2 3.96
## 79 329 364 474 406 512 88 97 51 523
## Predicted 114.0 82.79 94.7 100.8 115 79.2 118.55 119.75 114.8 108.17
## cvpred 114.6 82.34 96.0 103.2 115 83.4 119.76 120.74 115.5 108.56
## packpc 135.2 78.66 145.6 51.1 119 70.8 129.33 115.84 103.3 105.39
## CV residual 20.6 -3.68 49.6 -52.1 4 -12.6 9.57 -4.91 -12.2 -3.17
## 162 351 121 130 291 46 353 213 335 415
## Predicted 117.79 126 143.18 116.21 102.5 111.57 89.3088 91.2 105.12 83.9
## cvpred 118.92 124 144.45 116.39 102.4 111.73 91.1633 91.2 105.11 85.3
## packpc 112.66 164 148.10 118.25 80.7 107.88 91.1992 92.1 110.16 97.4
## CV residual -6.26 41 3.65 1.86 -21.7 -3.85 0.0359 0.9 5.06 12.1
## 340 510 378 266 19 419 392 467 249 443
## Predicted 80.0 99.8 90.6 95.3 120.98 95.76 90.28 93.45 111.08 105.9
## cvpred 82.5 102.0 92.5 96.3 121.52 98.04 91.33 93.83 111.47 107.0
## packpc 66.4 64.7 52.5 83.0 128.12 93.10 93.28 91.09 112.64 83.8
## CV residual -16.1 -37.3 -39.9 -13.3 6.59 -4.94 1.95 -2.75 1.17 -23.1
## 471 308
## Predicted 103.1 97.3
## cvpred 105.3 97.8
## packpc 92.0 109.5
## CV residual -13.4 11.7
##
## Sum of squares = 17144 Mean square = 408 n = 42
##
## Overall (Sum over all 42 folds)
## ms
## 4208248
theta.fit <- function(x,y){lsfit(x,y)}
theta.predict <- function(fit,x){cbind(1,x)%*%fit$coef}
# Definition of matrices
fit4 <- lm(packpc ~ cpi + poly(tax, degree = 10) + poly(avgprs, degree = 2), data = cigar.data2)
X <- as.matrix(cigar.data2[c("cpi","tax","avgprs")])
y <- as.matrix(cigar.data2[c("packpc")])
results <- crossval(X, y, theta.fit, theta.predict, ngroup = 10)
cor(y, fit4$fitted.values)**2
## [,1]
## packpc 0.44
cor(y, results$cv.fit)**2
## [,1]
## packpc 0.379
Here we have the \(R^2\) of the model and the \(R^2\) of the cross-validated model.
We have created a model that explains up to a 40% of the variance in the train dataset and we can predict a 37% of the variance according to the cross-validation test. That is a quite poor model, so we should considerate to create a non-parametric model like GLM, GAMM or GLMM. The problems we have observed in our model are: outliers, non-linear distribution of the residuals and multicollinearity. In this case, the best alternative to the linear regression model is a PCA analysis.