Exploratory analysis
Data split
Model creation
Diagnostics
Refitting the model
Cross Validation
1. Hold-out method
2. Cross-validation
Conclusion

Today I want to introduce you my personal way of working with linear regression models. This is maybe the most important regression model to know, so I would like to make a fast analysis with a dataset chosed randomly.

This time we will fit a linear regression model with the Cigarette dataset, with contains data of smoking habits and tax changes in USA.

We want to predict the number of packets per capita giving the other variables. In other words, we need to create a regression model

1. Exploratory analysis

cigar.data <- Cigarette
str(cigar.data)

## 'data.frame':    528 obs. of  9 variables:
##  $ state : Factor w/ 48 levels "AL","AR","AZ",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ year  : int  1985 1985 1985 1985 1985 1985 1985 1985 1985 1985 ...
##  $ cpi   : num  1.08 1.08 1.08 1.08 1.08 ...
##  $ pop   : int  3973000 2327000 3184000 26444000 3209000 3201000 618000 11352000 5963000 2830000 ...
##  $ packpc: num  116 129 105 100 113 ...
##  $ income: int  46014968 26210736 43956936 447102816 49466672 60063368 9927301 166919248 78364336 37902896 ...
##  $ tax   : num  32.5 37 31 26 31 ...
##  $ avgprs: num  102.2 101.5 108.6 107.8 94.3 ...
##  $ taxs  : num  33.3 37 36.2 32.1 31 ...

chart.Correlation(cigar.data[,3:9], histogram = TRUE, pch = 19)

sapply(cigar.data[,3:9], function(x) sum(is.na(x)))

##    cpi    pop packpc income    tax avgprs   taxs 
##      0      0      0      0      0      0      0

There are a bunch of significant relationships. This could mean that we will have too much collinearity. Fortunately, we don’t have any miss value or NA.

cor.mtest <- function(mat, ...) {
  mat <- as.matrix(mat)
  n <- ncol(mat)
  p.mat<- matrix(NA, n, n)
  diag(p.mat) <- 0
  for (i in 1:(n - 1)) {
    for (j in (i + 1):n) {
      tmp <- cor.test(mat[, i], mat[, j], ...)
      p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
    }
  }
  colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
  p.mat
}

p.mat <- cor.mtest(cigar.data[,3:9])

cigar.cor <- cor(cigar.data[,3:9])
corrplot(cigar.cor, method = "number", type = "upper",
         tl.cex = 0.9, number.cex = 0.6,  order="hclust",  diag = FALSE,
         addCoef.col = "black", tl.col = "black", tl.srt = 45, 
         # Significància
         p.mat = p.mat, sig.level = 0.05, insig = "blank")

There is no need to clean up the dataset. We have not observed any NA value and although there are some bimodals and skewed distributions, there are statiscally significant relationships enough to create a regression model.

2. Data split

Next, we need to split the data into two different sets to do the cross-validation once we have created the model. We will create the training data with the 80% of the database.

set.seed(1)
cigar.data2 <- cigar.data[,3:9]

cigar.train <- sample_frac(tbl = cigar.data2, replace = FALSE, size = 0.80)
cigar.test <- anti_join(cigar.data2, cigar.train)

3. Model creation

The formula we will fit for a multivariate linear regression model is the following:

\[ y_n = \beta_0 + \beta_1x_{n1} + \dots + \beta_kx_{nk} + \epsilon_n \]

fit <- lm(packpc ~ ., data = cigar.train)

4. Diagnostics

The Linear Regression model is a parametric model, so there are some assumptions we need to check to ensure we are making a proper model.

4.1. Model assumptions

autoplot(fit)

summary(fit)

## 
## Call:
## lm(formula = packpc ~ ., data = cigar.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.280  -8.616   0.330   8.161  72.550 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.365e+02  1.060e+01  12.881  < 2e-16 ***
## cpi          4.097e+01  1.308e+01   3.133  0.00185 ** 
## pop          4.306e-07  1.029e-06   0.419  0.67573    
## income      -2.647e-08  4.889e-08  -0.541  0.58856    
## tax         -3.282e-01  3.159e-01  -1.039  0.29942    
## avgprs      -4.618e-01  9.370e-02  -4.928  1.2e-06 ***
## taxs        -6.278e-03  3.026e-01  -0.021  0.98346    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.25 on 415 degrees of freedom
## Multiple R-squared:  0.3822, Adjusted R-squared:  0.3733 
## F-statistic:  42.8 on 6 and 415 DF,  p-value: < 2.2e-16

confint(fit)

##                     2.5 %        97.5 %
## (Intercept)  1.156565e+02  1.573112e+02
## cpi          1.526284e+01  6.667506e+01
## pop         -1.591629e-06  2.452893e-06
## income      -1.225758e-07  6.964006e-08
## tax         -9.490425e-01  2.927122e-01
## avgprs      -6.459474e-01 -2.775675e-01
## taxs        -6.011955e-01  5.886399e-01

In the plot we observe that: * 1. Not all the variables are significants. * 2. The residuals vs. Fitted plot shows a pattern in the distribution of residuals. * 3. There are some values too influential in the model (Normal QQ plot and Cook’s distance plot).

4.2. Collinearity

We have seen in the corrplot that we have many high correlations. This can prove problematic because if the matrices of the model are too correlated we cannot transpose them and the model will have a singularity problem.

VIF(fit)

##       cpi       pop    income       tax    avgprs      taxs 
##  5.027321 35.359664 37.152638 18.056987 11.711699 24.032766

As expected, we have a problem with the collinearity assumption of the model. The variables pop, income, tax and taxs have a too high correlation (values higher than ~10), so at this point we would stop and consider another model which help us dealing with the multicollinearity problem, like a PCA model. We will continue with this model as an example though.

4.3. Residuals

Another of the problems we have in our model is the pattern observed in the distribution of the residuals. We can plot every explanatory variable vs. the residuals of the model to see it in detail.

p1 <- ggplot(cigar.train, aes(cigar.train[,1], residuals(fit))) +
    geom_point() + geom_smooth(color = "blue")
p2 <- ggplot(cigar.train, aes(cigar.train[,2], residuals(fit))) +
  geom_point() + geom_smooth(color = "blue")
p3 <- ggplot(cigar.train, aes(cigar.train[,3], residuals(fit))) +
  geom_point() + geom_smooth(color = "blue")
p4 <- ggplot(cigar.train, aes(cigar.train[,4], residuals(fit))) +
  geom_point() + geom_smooth(color = "blue")
p5 <- ggplot(cigar.train, aes(cigar.train[,5], residuals(fit))) +
  geom_point() + geom_smooth(color = "blue")
p6 <- ggplot(cigar.train, aes(cigar.train[,6], residuals(fit))) +
  geom_point() + geom_smooth(color = "blue")
p7 <- ggplot(cigar.train, aes(cigar.train[,7], residuals(fit))) +
  geom_point() + geom_smooth(color = "blue")

grid.arrange(p1, p2, p3, p4, p5, p6, p7)

The model is clearly underfitted for two reasons: one, because the assumption of linearity is not achieved in most of cases, and two, because we have a quite low number of degrees of freedom. Given that, we should apply a polynomial transformation on columns 5, 6 and 7 as they are non-linear and they can be better fitted.

4.4. Outliers

ols_plot_cooksd_bar(fit)

Another of the biggest problem of our model are the outliers. However, it seems a bad idea to remove them since they are real data. Removing extreme values improves the performance of the model but it ruins its forecast capability in the long run, especially for new extreme values.

5. Refitting model

Let’s check if we need all the variables. It would be a great idea to remove explanatory variables as they are too correlated.

5.1. Stepwise algorithm

summary(step(fit, trace = 0))

## 
## Call:
## lm(formula = packpc ~ cpi + tax + avgprs, data = cigar.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.272  -8.592   0.061   8.354  72.213 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 137.86768    9.29553  14.832  < 2e-16 ***
## cpi          41.68451   12.21447   3.413 0.000706 ***
## tax          -0.34300    0.13399  -2.560 0.010823 *  
## avgprs       -0.47770    0.07908  -6.040  3.4e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.2 on 418 degrees of freedom
## Multiple R-squared:  0.3812, Adjusted R-squared:  0.3767 
## F-statistic: 85.82 on 3 and 418 DF,  p-value: < 2.2e-16

Just as we suspected, with just 3 variables we fitted a slightly better model than the previous one. We will continue our analysis with this model.

fit2 <- lm(packpc ~ cpi + sqrt(tax) + sqrt(avgprs), data = cigar.train)

5.2. AIC & BIC values

summary(stepAIC(fit, trace = 0))

## 
## Call:
## lm(formula = packpc ~ cpi + tax + avgprs, data = cigar.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.272  -8.592   0.061   8.354  72.213 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 137.86768    9.29553  14.832  < 2e-16 ***
## cpi          41.68451   12.21447   3.413 0.000706 ***
## tax          -0.34300    0.13399  -2.560 0.010823 *  
## avgprs       -0.47770    0.07908  -6.040  3.4e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.2 on 418 degrees of freedom
## Multiple R-squared:  0.3812, Adjusted R-squared:  0.3767 
## F-statistic: 85.82 on 3 and 418 DF,  p-value: < 2.2e-16

summary(stepAIC(fit2, trace = 0))

## 
## Call:
## lm(formula = packpc ~ cpi + sqrt(tax) + sqrt(avgprs), data = cigar.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -56.041  -8.881   0.395   8.177  70.956 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   223.235      8.098  27.568  < 2e-16 ***
## cpi            44.585     12.373   3.604 0.000352 ***
## sqrt(tax)      -5.698      1.656  -3.440 0.000640 ***
## sqrt(avgprs)  -11.403      1.892  -6.028 3.64e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.97 on 418 degrees of freedom
## Multiple R-squared:  0.3973, Adjusted R-squared:  0.393 
## F-statistic: 91.85 on 3 and 418 DF,  p-value: < 2.2e-16

anova(fit, fit2)

## Analysis of Variance Table
## 
## Model 1: packpc ~ cpi + pop + income + tax + avgprs + taxs
## Model 2: packpc ~ cpi + sqrt(tax) + sqrt(avgprs)
##   Res.Df    RSS Df Sum of Sq F Pr(>F)
## 1    415 138281                      
## 2    418 134910 -3    3370.5

AIC(fit, fit2)

##      df      AIC
## fit   8 3657.823
## fit2  5 3641.410

Through the \(AIC\) and \(BIC\) calculation we obtained the same model. The second model is even better than the previous one with less variables and less degrees of freedom. We will stick to this model.

5.3. Relative importance of each variable

calc.relimp(fit2, type = c("lmg", "last", "first", "pratt", "betasq"), rela = T)

## Response variable: packpc 
## Total response variance: 531.6946 
## Analysis based on 422 observations 
## 
## 3 Regressors: 
## cpi sqrt(tax) sqrt(avgprs) 
## Proportion of variance explained by model: 39.73%
## Metrics are normalized to sum to 100% (rela=TRUE). 
## 
## Relative importance metrics: 
## 
##                    lmg      last     first     betasq      pratt
## cpi          0.1689195 0.2123143 0.1885808 0.15191516 -0.2934772
## sqrt(tax)    0.3920353 0.1934852 0.4044931 0.08968051  0.3302400
## sqrt(avgprs) 0.4390453 0.5942006 0.4069261 0.75840433  0.9632372
## 
## Average coefficients for different model sizes: 
## 
##                     1X       2Xs        3Xs
## cpi          -59.76424  22.06888  44.584804
## sqrt(tax)    -14.55899 -10.54206  -5.697947
## sqrt(avgprs) -10.04946 -10.93036 -11.403263

Now we can plot the Bootstrap Measures of Relative Importance over 1000 samples.

boot <- boot.relimp(fit2, b = 1000, type = c("lmg", "last", "first", "pratt"), 
                    rank = TRUE, diff = TRUE, rela = TRUE)
booteval.relimp(boot)

## Response variable: packpc 
## Total response variance: 531.6946 
## Analysis based on 422 observations 
## 
## 3 Regressors: 
## cpi sqrt(tax) sqrt(avgprs) 
## Proportion of variance explained by model: 39.73%
## Metrics are normalized to sum to 100% (rela=TRUE). 
## 
## Relative importance metrics: 
## 
##                    lmg      last     first      pratt
## cpi          0.1689195 0.2123143 0.1885808 -0.2934772
## sqrt(tax)    0.3920353 0.1934852 0.4044931  0.3302400
## sqrt(avgprs) 0.4390453 0.5942006 0.4069261  0.9632372
## 
## Average coefficients for different model sizes: 
## 
##                     1X       2Xs        3Xs
## cpi          -59.76424  22.06888  44.584804
## sqrt(tax)    -14.55899 -10.54206  -5.697947
## sqrt(avgprs) -10.04946 -10.93036 -11.403263
## 
##  
##  Confidence interval information ( 1000 bootstrap replicates, bty= perc ): 
## Relative Contributions with confidence intervals: 
##  
##                                   Lower  Upper
##                    percentage 0.95 0.95    0.95   
## cpi.lmg             0.1689    __C   0.1422  0.2054
## sqrt(tax).lmg       0.3920    AB_   0.3133  0.4633
## sqrt(avgprs).lmg    0.4390    AB_   0.3872  0.4935
##                                                   
## cpi.last            0.2123    _BC   0.0717  0.3310
## sqrt(tax).last      0.1935    ABC   0.0198  0.5104
## sqrt(avgprs).last   0.5942    AB_   0.3608  0.7552
##                                                   
## cpi.first           0.1886    __C   0.1385  0.2338
## sqrt(tax).first     0.4045    AB_   0.3524  0.4619
## sqrt(avgprs).first  0.4069    AB_   0.3823  0.4354
##                                                   
## cpi.pratt          -0.2935    __C  -0.4110 -0.1570
## sqrt(tax).pratt     0.3302    _B_   0.1233  0.5202
## sqrt(avgprs).pratt  0.9632    A__   0.6827  1.2680
## 
## Letters indicate the ranks covered by bootstrap CIs. 
## (Rank bootstrap confidence intervals always obtained by percentile method) 
## CAUTION: Bootstrap confidence intervals can be somewhat liberal. 
## 
##  
##  Differences between Relative Contributions: 
##  
##                                              Lower   Upper
##                              difference 0.95 0.95    0.95   
## cpi-sqrt(tax).lmg            -0.2231     *   -0.3198 -0.1175
## cpi-sqrt(avgprs).lmg         -0.2701     *   -0.3119 -0.2202
## sqrt(tax)-sqrt(avgprs).lmg   -0.0470         -0.1769  0.0699
##                                                             
## cpi-sqrt(tax).last            0.0188         -0.4130  0.2803
## cpi-sqrt(avgprs).last        -0.3819     *   -0.5837 -0.1715
## sqrt(tax)-sqrt(avgprs).last  -0.4007         -0.7211  0.1365
##                                                             
## cpi-sqrt(tax).first          -0.2159     *   -0.3231 -0.1185
## cpi-sqrt(avgprs).first       -0.2183     *   -0.2688 -0.1677
## sqrt(tax)-sqrt(avgprs).first -0.0024         -0.0746  0.0731
##                                                             
## cpi-sqrt(tax).pratt          -0.6237     *   -0.7640 -0.4590
## cpi-sqrt(avgprs).pratt       -1.2567     *   -1.6718 -0.8560
## sqrt(tax)-sqrt(avgprs).pratt -0.6330     *   -1.1454 -0.1536
## 
## * indicates that CI for difference does not include 0. 
## CAUTION: Bootstrap confidence intervals can be somewhat liberal.

plot(booteval.relimp(boot,sort=TRUE))

The avgprs variable is the most influential one in this model, and we can improve its adjustment even more so we can improve our model with a polynomial transformation.

5.4. Polynomial transformation

As we found out in the previous analysis, the assumption of linearity is not assumed for this model:

ggplot(cigar.data2, aes(x = packpc, y = avgprs)) + geom_point() +
  geom_smooth(method = "lm", se = F) + geom_smooth(col = "red") + theme_classic()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

We can smooth the adjustment of the hyperplane with a polynomial regression with the poly() function. To find out how many degrees we need we can write a for loop and plot it afterwards.

# First, we need to define the function for RMSE performance
rmse <- function(y, h){
return(sqrt(mean((y - h) ^2)))}

performance <- data.frame()

for (d in 1:12){
  poly.fit <- lm(packpc  ~ cpi + poly(tax, degree = d) + poly(avgprs, degree = d), data = cigar.train)
  performance <- rbind(performance, data.frame(Degree = d, 
                                Data = "Training",
                                RMSE = rmse(cigar.train$packpc, predict(poly.fit))))
  performance <- rbind(performance, data.frame(Degree = d, 
                                Data = "Test",
                                RMSE = rmse(cigar.train$packpc, predict(poly.fit,
                                                      newdata = cigar.test))))
}

# Final plot
ggplot(performance, aes(x = Degree, y = RMSE, linetype = Data)) + 
  geom_point() + geom_line() + ggtitle(label = 'RMSE of training set and test set')

The best number of polynomial degrees is one which reduce the RMSE parameter in the train set without increasing the RMSE parameter in the test set, which would mean we are overfitting our model. So 10 degrees seems to be perfect for this model.

fit3 <- lm(packpc ~ cpi + poly(tax, degree = 10) + poly(avgprs, degree = 2),
           data = cigar.train)
summary(fit3)

## 
## Call:
## lm(formula = packpc ~ cpi + poly(tax, degree = 10) + poly(avgprs, 
##     degree = 2), data = cigar.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.504  -8.599   0.668   8.496  69.022 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 42.440     16.749   2.534 0.011652 *  
## cpi                         49.786     12.937   3.848 0.000138 ***
## poly(tax, degree = 10)1    -95.671     37.356  -2.561 0.010793 *  
## poly(tax, degree = 10)2     50.244     25.029   2.007 0.045364 *  
## poly(tax, degree = 10)3    -30.007     18.730  -1.602 0.109906    
## poly(tax, degree = 10)4     39.747     17.760   2.238 0.025759 *  
## poly(tax, degree = 10)5    -19.208     17.749  -1.082 0.279800    
## poly(tax, degree = 10)6      3.873     17.752   0.218 0.827407    
## poly(tax, degree = 10)7     -3.166     17.797  -0.178 0.858913    
## poly(tax, degree = 10)8    -20.000     17.639  -1.134 0.257514    
## poly(tax, degree = 10)9     41.634     18.171   2.291 0.022460 *  
## poly(tax, degree = 10)10   -33.242     17.604  -1.888 0.059699 .  
## poly(avgprs, degree = 2)1 -335.392     52.694  -6.365 5.26e-10 ***
## poly(avgprs, degree = 2)2   39.537     30.189   1.310 0.191055    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.58 on 408 degrees of freedom
## Multiple R-squared:  0.4366, Adjusted R-squared:  0.4186 
## F-statistic: 24.32 on 13 and 408 DF,  p-value: < 2.2e-16

AIC(fit2, fit3)

##      df     AIC
## fit2  5 3641.41
## fit3 15 3632.98

The significance has not improved remarkably (\(R^2 = 0.43\)) and the degrees of freedom had rise from 8 to 15. The probability of having overfitted the model is low, but still we have a very poor model.

6. Cross-validation

6.1. Hold-out method

To apply this method we have to calcute the \(RMSE\) coefficient, which can be expressed as:

\[RMSE = \sqrt{\frac{\sum_{i=1}^{n}(p_i - O_i)}{n}}\]

pred <- predict(fit3, newdata = cigar.test)
rmse <- sqrt(sum((exp(pred) - cigar.test$packpc)^2)/length(cigar.test$packpc))
c(RMSE = rmse, R2 = summary(fit3)$r.squared)

##         RMSE           R2 
## 1.322711e+64 4.365664e-01

plot(cigar.test$packpc, pred)
abline(0, 1, col = "red")

6.2. Cross-validation

cv.lm(cigar.train, fit3, m = 10)

## Analysis of Variance Table
## 
## Response: packpc
##                           Df Sum Sq Mean Sq F value  Pr(>F)    
## cpi                        1  34986   34986   113.2 < 2e-16 ***
## poly(tax, degree = 10)    10  49383    4938    16.0 < 2e-16 ***
## poly(avgprs, degree = 2)   2  13353    6677    21.6 1.2e-09 ***
## Residuals                408 126121     309                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## 
## fold 1 
## Observations in test set: 42 
##                345   328   330   195   245    117   150    217 432    177
## Predicted   104.24 101.7  93.9 111.0 108.9 104.36  97.1 135.55 112 117.90
## cvpred      103.98 101.4  93.7 111.0 109.0 104.78  97.5 133.93 112 118.07
## packpc      106.40 119.6  52.5  94.9  89.1 108.50 103.9 139.06 104 122.79
## CV residual   2.42  18.2 -41.2 -16.0 -19.9   3.72   6.3   5.13  -8   4.72
##                422    31   495   164    472   526     65   168     83
## Predicted   107.85 115.1 117.6 110.0 108.55 84.11 109.62 117.8 115.52
## cvpred      106.76 115.2 115.8 110.1 107.98 82.83 109.80 117.8 115.42
## packpc      111.62 142.0 172.6 121.7 114.14 92.47 114.25  90.6 112.07
## CV residual   4.87  26.7  56.9  11.6   6.16  9.64   4.45 -27.2  -3.35
##                297    99    457   126   380   264    287   101   333    76
## Predicted   107.06 115.0 115.60 115.8  96.4 112.2 106.54 120.2 83.65 126.1
## cvpred      106.50 115.0 114.16 115.8  96.2 112.2 106.73 120.3 84.23 125.9
## packpc      109.07 100.3 108.16  82.1 126.1  86.9 111.47 102.7 77.35 190.6
## CV residual   2.57 -14.6  -6.01 -33.7  29.9 -25.4   4.74 -17.6 -6.88  64.7
##              501   322     18   490     455    310   327   290    56
## Predicted   78.7 96.27 128.16 89.87 103.044 111.22 98.89  97.9 111.9
## cvpred      78.2 96.08 127.02 88.86 102.256 110.76 98.67  97.8 112.4
## packpc      82.9 87.13 120.98 92.40 103.137 119.55 91.24 116.3 120.6
## CV residual  4.7 -8.95  -6.04  3.54   0.881   8.79 -7.43  18.5   8.2
##                 36    161   134     47
## Predicted   120.19 100.42 140.7 120.00
## cvpred      120.48 100.78 140.8 120.22
## packpc      117.70 109.90 127.2 112.85
## CV residual  -2.77   9.12 -13.7  -7.37
## 
## Sum of squares = 16385    Mean square = 390    n = 42 
## 
## fold 2 
## Observations in test set: 43 
##               194   135    93   409   316    12    230     48    496   428
## Predicted   108.1 113.0 99.32 112.3  99.5 119.4 131.58 137.66 103.54  97.9
## cvpred      108.9 113.1 99.41 113.2  99.4 119.1 132.19 137.21 103.25  98.0
## packpc      122.8  99.0 93.29 124.9 147.3 123.2 124.33 129.40 105.18 116.0
## CV residual  13.9 -14.1 -6.12  11.8  47.9   4.1  -7.86  -7.81   1.93  18.0
##               405    492    176    26    116    55   489     62    206
## Predicted   75.26 84.685 109.55 121.7 115.87 118.6 114.9 117.80 101.66
## cvpred      73.74 83.674 110.17 121.6 116.14 118.4 115.9 117.75 102.11
## packpc      79.59 83.265 106.35 105.5 122.36 140.8  97.5 110.03  97.94
## CV residual  5.85 -0.409  -3.82 -16.1   6.22  22.3 -18.5  -7.72  -4.17
##                15   420   404   504   370     311    242   259    133
## Predicted   163.0 96.98  97.6 109.8 89.90 104.206 106.68  93.5 112.33
## cvpred      160.5 96.41  96.8 109.9 89.70 105.208 107.55  92.8 111.99
## packpc      186.0 93.87 104.2  87.2 84.61 104.892 117.02 114.9 121.82
## CV residual  25.5 -2.54   7.4 -22.8 -5.09  -0.315   9.47  22.1   9.83
##                434   418     119   325   270   146   293 313   207     96
## Predicted    98.62 94.47 112.140  90.7 106.9 111.8 107.4 147 151.6 133.73
## cvpred       98.55 93.91 112.673  90.5 107.7 112.3 108.2 147 150.2 133.57
## packpc      101.63 83.96 112.514 104.1  71.6 125.5  88.9 130 176.1 127.82
## CV residual   3.08 -9.95  -0.159  13.6 -36.1  13.2 -19.3 -17  25.9  -5.74
##               243    113      2    123   234
## Predicted   107.2 106.31 123.94 112.11 100.2
## cvpred      108.0 105.98 123.64 112.24 100.9
## packpc       85.8 112.10 128.53 104.91  57.4
## CV residual -22.2   6.11   4.89  -7.32 -43.4
## 
## Sum of squares = 12159    Mean square = 283    n = 43 
## 
## fold 3 
## Observations in test set: 43 
##               197    92    396    514    64   298     54   214   156   485
## Predicted   108.5 117.4 93.899 102.98 115.2 94.85 102.17 114.3 108.3 103.7
## cvpred      108.8 115.7 91.422 100.22 113.8 92.96 103.00 113.3 108.7 102.3
## packpc       91.3 131.5 92.393 108.68 123.3 96.34 106.76 125.5 110.1  82.6
## CV residual -17.5  15.8  0.971   8.46   9.5  3.37   3.76  12.2   1.4 -19.7
##               389    238   285 124    28    438    261   139   487   307
## Predicted   104.5  96.62 87.02 123 129.0 77.966 89.512 149.0 105.5  91.3
## cvpred      104.1  96.22 85.89 120 124.7 77.362 88.816 151.7 102.1  89.6
## packpc       82.6 100.60 78.74 185 198.0 77.623 89.463 129.7 124.5 116.3
## CV residual -21.4   4.38 -7.15  65  73.3  0.261  0.647 -22.1  22.4  26.7
##                132   279  296   458    513     528    289    454     77
## Predicted   117.52 104.1 87.4 85.52 105.34 113.978 104.92 112.90 110.76
## cvpred      115.94 104.9 85.8 85.95 101.99 112.636 105.00 111.88 110.35
## packpc      112.49  91.6 97.6 80.34 111.38 112.238 107.01 117.25 114.57
## CV residual  -3.46 -13.3 11.8 -5.61   9.39  -0.398   2.02   5.37   4.22
##               154  267   397   205    158   412    87   320   228     69
## Predicted   100.6 99.4 109.0 112.9 104.81 100.4 116.4 82.53 111.6 107.73
## cvpred      100.3 99.7 108.7 111.8 105.69  98.3 115.4 83.05 111.0 108.38
## packpc      103.5 91.5 123.6 134.3 104.63 149.5  98.9 86.60 108.7 104.47
## CV residual   3.2 -8.2  14.8  22.5  -1.06  51.2 -16.5  3.54  -2.3  -3.92
##               292    393    167 235
## Predicted    82.2 111.33 111.59 139
## cvpred       80.7 110.23 111.08 144
## packpc       68.5 104.35 113.02 121
## CV residual -12.2  -5.88   1.94 -23
## 
## Sum of squares = 18057    Mean square = 420    n = 43 
## 
## fold 4 
## Observations in test set: 42 
##               141 332   229   222   350    179   315   109   346   260
## Predicted   94.31 103 104.5 109.5 91.73 101.24 94.43 121.6 87.68 105.1
## cvpred      94.56 103 103.9 111.3 90.81 100.75 93.99 122.5 86.43 104.2
## packpc      88.97 121 120.1  76.1 88.76 108.06 91.71 136.8 93.91 114.7
## CV residual -5.59  18  16.2 -35.2 -2.06   7.31 -2.28  14.3  7.48  10.6
##                  6     84   459   100   174    67   211     68    321
## Predicted   102.69 117.82  94.9 115.8 112.2 107.6  93.3 119.31 105.90
## cvpred      102.01 118.40  94.7 116.9 114.0 106.8  93.0 118.52 105.66
## packpc      109.28 114.68  84.1  92.5  79.5 118.4 120.9 126.14 112.50
## CV residual   7.27  -3.72 -10.6 -24.4 -34.6  11.6  27.9   7.62   6.84
##                24    227      500    169   379  42   299   367   187
## Predicted   125.7  97.59     81.4 139.28 120.0 116 102.7  82.6 145.6
## cvpred      126.7  97.06 -42058.2 138.04 121.6 117 102.4  80.6 146.8
## packpc      104.3 105.20     81.4 145.57 106.2  68  83.0 102.8 127.5
## CV residual -22.4   8.14  42139.6   7.54 -15.4 -49 -19.5  22.2 -19.4
##                 41    129    44   343  344    53   232    466    273
## Predicted   121.12 120.66 125.2  91.2 85.5 125.3 106.5 100.87 110.64
## cvpred      120.97 122.20 125.9  90.3 83.8 126.5 108.4 100.56 111.39
## packpc      115.10 125.45 145.3 116.9 93.9 109.2 127.1 101.97 115.85
## CV residual  -5.86   3.25  19.4  26.7 10.1 -17.3  18.7   1.41   4.45
##                 50   237    276    114   246
## Predicted   119.92 88.96 108.92 121.95 86.24
## cvpred      119.10 88.92 109.67 123.35 85.36
## packpc      127.66 83.97 102.54 115.88 90.82
## CV residual   8.56 -4.96  -7.13  -7.47  5.46
## 
## Sum of squares = 1.78e+09    Mean square = 42279978    n = 42 
## 
## fold 5 
## Observations in test set: 42 
##               302   355     7    170     241    401    191   120   402
## Predicted   95.95  83.3 121.7 101.57 110.819 84.452 110.22 120.7 88.39
## cvpred      96.29  83.7 121.7 101.86 110.981 84.934 109.96 120.7 89.20
## packpc      91.62 115.0 143.9  92.37 111.745 84.547 115.41  95.3 80.97
## CV residual -4.68  31.3  22.2  -9.49   0.764 -0.388   5.45 -25.4 -8.23
##                384   483   522    262   410     385    468   387   499
## Predicted   105.89  83.8  99.8 114.01  96.3 104.945 101.72  97.0  86.5
## cvpred      106.10  84.7 101.9 114.47  98.0 105.473 103.65  97.3  87.3
## packpc      107.45  72.0  49.3 120.96  82.9 104.958  94.78  75.2 102.5
## CV residual   1.35 -12.8 -52.6   6.49 -15.1  -0.515  -8.87 -22.0  15.2
##               429   498    37    59   182   203     14 115   450   400
## Predicted   82.28  91.8 122.4 123.3 137.0 107.2 125.98 103  90.6  97.9
## cvpred      82.73  92.9 122.5 123.8 138.7 107.0 126.01 103  91.5  98.6
## packpc      74.60  77.5 132.8  98.3 126.0  80.9 116.68 124  79.7 104.3
## CV residual -8.13 -15.4  10.3 -25.5 -12.7 -26.1  -9.33  21 -11.8   5.7
##                336   362    112 247    165    369   509   408     224
## Predicted   111.55  91.9 112.90 111 93.480 101.24 82.43 103.1 100.983
## cvpred      112.01  93.1 112.62 111 94.215 101.39 83.11 103.9 100.780
## packpc      106.83  74.4 121.57 124 94.042 110.86 80.37  81.6 100.323
## CV residual  -5.19 -18.7   8.95  13 -0.173   9.47 -2.74 -22.3  -0.458
##                171    427 358    433     21
## Predicted   101.87 110.40 109 108.85 117.27
## cvpred      102.18 111.55 109 109.97 116.94
## packpc       95.55 102.12 121 103.43 112.90
## CV residual  -6.63  -9.43  12  -6.54  -4.04
## 
## Sum of squares = 10493    Mean square = 250    n = 42 
## 
## fold 6 
## Observations in test set: 42 
##                 106   494   107   391   381   352   411    209      34
## Predicted   106.811 101.2 113.7 100.2 81.40  91.9 97.03  94.62 125.201
## cvpred      106.725 101.1 111.9  99.7 82.97  92.1 97.07  94.11 126.814
## packpc      107.553  88.8  99.3 119.1 77.80 106.9 89.20 102.49 127.139
## CV residual   0.828 -12.3 -12.6  19.4 -5.17  14.8 -7.88   8.39   0.326
##               312    145    337   281 151    91   295    43    449   366
## Predicted   107.7 118.05  99.60  96.0 115 151.0 102.3 153.7 77.754  99.6
## cvpred      108.0 118.55  99.99  95.5 114 154.9 102.1 157.9 76.722 100.1
## packpc       81.7 115.26 106.90  85.0 136 133.3 117.4 134.0 77.230  69.3
## CV residual -26.3  -3.29   6.91 -10.5  22 -21.6  15.3 -23.9  0.507 -30.8
##               103 361    317   248     81    465    78     89    74 476
## Predicted   116.0 113 85.900  94.9 120.72 105.32 117.5 116.70 119.3 101
## cvpred      115.6 110 85.651  94.7 120.76 105.06 116.3 117.74 120.4 101
## packpc      137.5 126 86.533 105.7 126.59 106.67  87.3 111.45 102.1 119
## CV residual  21.9  16  0.882  11.0   5.84   1.62 -29.0  -6.29 -18.3  18
##               441     22     58   399   196   184    519    86    208
## Predicted   114.8 123.07 113.60 118.9  99.5 110.9 103.92 142.6 107.03
## cvpred      115.0 122.89 114.14 116.1  99.4 109.6 103.60 142.5 106.82
## packpc      102.7 130.37 107.77 163.0  80.5 127.2  97.22 126.7 108.84
## CV residual -12.3   7.48  -6.37  46.9 -18.9  17.7  -6.38 -15.8   2.02
##                448     20   219    178
## Predicted   103.80 121.65 99.65 107.19
## cvpred      103.22 123.06 99.12 107.29
## packpc      103.61 128.00 95.87 109.34
## CV residual   0.39   4.95 -3.24   2.06
## 
## Sum of squares = 10513    Mean square = 250    n = 42 
## 
## fold 7 
## Observations in test set: 42 
##                192   435    516    202    63   339   251   453      71
## Predicted   127.96 100.9 101.03 97.050 159.9  94.6 107.8 78.60 115.286
## cvpred      127.08 101.1 100.32 97.140 158.9  94.6 106.8 78.63 114.637
## packpc      123.01  72.9  95.64 97.005 181.8  76.4  91.3 81.36 114.032
## CV residual  -4.07 -28.2  -4.68 -0.136  22.9 -18.2 -15.5  2.73  -0.605
##              517   478 220     40   520   486   301   360    95    417
## Predicted   78.5 84.93 116 121.31 109.2 78.16 109.0 100.0 117.3 102.18
## cvpred      71.8 84.03 114 120.37 108.6 78.41 107.8  99.9 116.5 102.39
## packpc      92.6 91.61 170 129.83 122.3 79.47 126.4  83.9 115.2 106.84
## CV residual 20.8  7.58  56   9.47  13.8  1.06  18.5 -16.0  -1.3   4.46
##                407    131   118   373    29   502    304   456    497
## Predicted    98.03 106.90 117.1  81.2 114.2 110.0  98.27 108.9 77.497
## cvpred       98.16 107.55 115.9  81.1 114.9 109.9  98.09 108.9 77.615
## packpc      103.01 109.22 129.7 100.9 116.5 122.5 106.34  82.6 76.621
## CV residual   4.86   1.67  13.7  19.8   1.6  12.5   8.25 -26.3 -0.994
##                 16   424    488   447   253   140    210 445    35    27
## Predicted   119.69 104.7 93.527 117.6 111.7 113.4 113.85 111 128.1 123.1
## cvpred      118.82 103.8 92.769 116.0 110.3 112.5 112.47 111 127.9 122.6
## packpc      127.56 116.5 93.075 164.8 129.4 127.8 107.59 125 119.5 107.4
## CV residual   8.73  12.7  0.306  48.8  19.1  15.3  -4.88  14  -8.4 -15.2
##                 142   300   446   200
## Predicted   107.249 91.57 100.8  98.1
## cvpred      107.780 91.87 100.8  98.4
## packpc      106.842 95.34  87.0 112.6
## CV residual  -0.938  3.47 -13.8  14.2
## 
## Sum of squares = 12146    Mean square = 289    n = 42 
## 
## fold 8 
## Observations in test set: 42 
##                477     33     256   507   506   138   159    503   491 527
## Predicted   68.134 123.12 106.787 95.62 86.96 114.6 156.0 103.67 100.6 106
## cvpred      67.826 124.47 106.725 95.46 87.57 115.9 155.2 102.75 100.3 105
## packpc      68.193 127.60 107.438 87.27 79.81  68.5 172.5 105.58  74.8 116
## CV residual  0.367   3.13   0.713 -8.19 -7.76 -47.4  17.3   2.83 -25.4  11
##               236   233   186   258    72     32   524 166    38    30
## Predicted   107.8 98.48 105.6 111.1 123.3 118.34 100.1 116 146.1 120.5
## cvpred      107.7 98.61 105.6 112.2 124.0 118.49  99.3 118 148.1 122.5
## packpc      120.4 95.08  55.7 100.0  99.4 116.66 122.3 130 127.2  88.7
## CV residual  12.7 -3.53 -49.9 -12.2 -24.6  -1.83  23.0  12 -20.9 -33.8
##                252    85     17    263     98  463   482    240   371
## Predicted   96.402 118.5 112.34 107.95 116.88 84.3 100.9 126.88 90.79
## cvpred      96.761 118.8 113.11 107.46 116.89 83.8 100.8 128.83 90.53
## packpc      95.999 135.4 115.68 105.38 125.31 94.0 111.0 119.21 97.78
## CV residual -0.762  16.6   2.57  -2.08   8.42 10.2  10.3  -9.61  7.26
##               309     204   347     70   288    505   284  365   268   334
## Predicted   83.91 107.502  97.5 119.18 116.2 115.20 106.0 76.1 107.3 92.29
## cvpred      83.72 107.089  96.1 120.85 117.6 115.57 105.6 75.5 106.9 92.34
## packpc      85.87 106.976  80.5 129.10 104.8 121.54 122.6 86.0 154.4 94.72
## CV residual  2.16  -0.112 -15.7   8.25 -12.9   5.97  16.9 10.5  47.6  2.37
##               376      143    274
## Predicted    98.1 113.4794 102.86
## cvpred       97.0 113.8194 102.32
## packpc      118.4 113.8321  93.20
## CV residual  21.4   0.0127  -9.12
## 
## Sum of squares = 13561    Mean square = 323    n = 42 
## 
## fold 9 
## Observations in test set: 42 
##                475    108   257   382   303   152   319    105   349   306
## Predicted   106.88 110.97 94.02 89.05 145.2 101.4  84.0 120.07 105.8 109.1
## cvpred      107.04 110.26 93.88 88.58 140.5 100.8  83.5 120.53 105.9 109.1
## packpc      103.15 111.94 97.55 94.49 169.1 118.1 111.7 124.79 126.2  97.3
## CV residual  -3.88   1.67  3.66  5.91  28.7  17.3  28.3   4.26  20.3 -11.7
##               149    180   183   421   305    254   265   188   122
## Predicted   115.3 115.32 107.1 87.84 90.66 102.96 132.9 111.1 118.1
## cvpred      115.1 114.94 106.7 87.85 90.81 102.96 133.1 110.4 117.9
## packpc       97.0 108.71  95.0 95.10 94.82  95.46 131.6 128.7  97.7
## CV residual -18.1  -6.23 -11.7  7.25  4.01  -7.51  -1.5  18.4 -20.1
##                 226   212    383   481   104   111   125     60   136
## Predicted   103.277 104.5 100.71 109.9 104.6 158.7 107.9 117.62 114.5
## cvpred      102.878 104.4 100.94 111.3 104.0 156.0 107.7 117.35 114.4
## packpc      102.635 117.6  96.60 101.1 119.2 176.0 110.8 119.58 130.8
## CV residual  -0.243  13.2  -4.34 -10.2  15.2  19.9   3.1   2.23  16.5
##                277    49   244   462   464    75 175     80    137    431
## Predicted    97.42 121.3  86.3 100.7 78.00 118.5 103 115.83 113.10 103.91
## cvpred       95.81 121.5  85.1 101.4 79.81 118.4 102 115.68 112.66 104.62
## packpc      100.63 117.2  74.2  64.4 71.65 105.2 136 113.56 104.31 111.99
## CV residual   4.83  -4.3 -10.9 -37.0 -8.16 -13.2  34  -2.12  -8.35   7.38
##               388   163   282    160
## Predicted    82.8  97.8  95.7 110.92
## cvpred       83.4  97.0  95.5 110.26
## packpc       61.8 123.8  53.0 116.81
## CV residual -21.6  26.7 -42.4   6.55
## 
## Sum of squares = 11219    Mean square = 267    n = 42 
## 
## fold 10 
## Observations in test set: 42 
##                470    198   368   269   155   354  39    90   436   515
## Predicted   116.61  89.94 78.45 96.72 109.8 95.49 126 114.9  82.3 88.14
## cvpred      118.21  90.35 79.13 96.93 109.5 95.63 129 115.4  83.0 88.20
## packpc      109.07 100.03 82.35 98.48  88.6 90.61 107  64.8  57.9 92.16
## CV residual  -9.14   9.68  3.22  1.55 -20.8 -5.03 -22 -50.7 -25.2  3.96
##                79   329   364   474 406   512     88     97    51    523
## Predicted   114.0 82.79  94.7 100.8 115  79.2 118.55 119.75 114.8 108.17
## cvpred      114.6 82.34  96.0 103.2 115  83.4 119.76 120.74 115.5 108.56
## packpc      135.2 78.66 145.6  51.1 119  70.8 129.33 115.84 103.3 105.39
## CV residual  20.6 -3.68  49.6 -52.1   4 -12.6   9.57  -4.91 -12.2  -3.17
##                162 351    121    130   291     46     353  213    335  415
## Predicted   117.79 126 143.18 116.21 102.5 111.57 89.3088 91.2 105.12 83.9
## cvpred      118.92 124 144.45 116.39 102.4 111.73 91.1633 91.2 105.11 85.3
## packpc      112.66 164 148.10 118.25  80.7 107.88 91.1992 92.1 110.16 97.4
## CV residual  -6.26  41   3.65   1.86 -21.7  -3.85  0.0359  0.9   5.06 12.1
##               340   510   378   266     19   419   392   467    249   443
## Predicted    80.0  99.8  90.6  95.3 120.98 95.76 90.28 93.45 111.08 105.9
## cvpred       82.5 102.0  92.5  96.3 121.52 98.04 91.33 93.83 111.47 107.0
## packpc       66.4  64.7  52.5  83.0 128.12 93.10 93.28 91.09 112.64  83.8
## CV residual -16.1 -37.3 -39.9 -13.3   6.59 -4.94  1.95 -2.75   1.17 -23.1
##               471   308
## Predicted   103.1  97.3
## cvpred      105.3  97.8
## packpc       92.0 109.5
## CV residual -13.4  11.7
## 
## Sum of squares = 17144    Mean square = 408    n = 42 
## 
## Overall (Sum over all 42 folds) 
##      ms 
## 4208248

theta.fit <- function(x,y){lsfit(x,y)}
theta.predict <- function(fit,x){cbind(1,x)%*%fit$coef}

# Definition of matrices
fit4 <- lm(packpc ~ cpi + poly(tax, degree = 10) + poly(avgprs, degree = 2), data = cigar.data2)
X <- as.matrix(cigar.data2[c("cpi","tax","avgprs")])

y <- as.matrix(cigar.data2[c("packpc")])

results <- crossval(X, y, theta.fit, theta.predict, ngroup = 10)
cor(y, fit4$fitted.values)**2

##        [,1]
## packpc 0.44

cor(y, results$cv.fit)**2

##         [,1]
## packpc 0.379

Here we have the \(R^2\) of the model and the \(R^2\) of the cross-validated model.

7. Conclusion

We have created a model that explains up to a 40% of the variance in the train dataset and we can predict a 37% of the variance according to the cross-validation test. That is a quite poor model, so we should considerate to create a non-parametric model like GLM, GAMM or GLMM. The problems we have observed in our model are: outliers, non-linear distribution of the residuals and multicollinearity. In this case, the best alternative to the linear regression model is a PCA analysis.

Linear Regression Model (LM)

David Valls Lanàquera

02/09/2018