Multiple Imputation

Remember that we have seen several imputation techniques:

  • Single imputation (mean/median/mode)
  • LOCF & NOCB
  • Regression Imputation

What about multiple imputation?

Multiple imputation fills in missing values by generating plausible numbers derived from distributions and relationships among observed variables in the data set. This renders M different versions of the data set, where the non-missing data is identical, but the missing data entries differ. Discarding all partially observed data units (e.g., through listwise deletion) is generally not recommended because it can lead to substantial bias and poor predictions

Mice

We can use the mice (Multiple Imputation with Chained Equations) to perform multiple imputation and the subsequent analysis.The mice function performs the imputation via chained equations. (see for details, https://www.jstatsoft.org/article/view/v045i03)

library(mice)
## 
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
## 
##     filter
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
iris = read.table("iris_mis.txt",sep=",") #load the data
colnames(iris)=c("var1","var2","var3","var4","var5")
dim(iris)
## [1] 150   5
sum(is.na(iris))
## [1] 42

The data includes 150 rows and 5 columns. Additionally, it has 42 missing observations.

Let’s visualize the missingness pattern via VIM package:

library(VIM)
## Zorunlu paket yükleniyor: colorspace
## Zorunlu paket yükleniyor: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
## 
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
## 
##     sleep
mice_plot = aggr(iris, col=c('darkred','orange'),
                    numbers=TRUE, sortVars=TRUE, prop=FALSE,
                    labels=names(iris), cex.axis=.7,
                    gap=3, ylab=c("Missing data","Pattern")) #gap: about margin btw two plots & prop=FALSE: you will see the counts if TRUE you'll see percentages

## 
##  Variables sorted by number of missings: 
##  Variable Count
##      var2    13
##      var1    12
##      var3    10
##      var4     7
##      var5     0

Comment on the output! e.g., the highest number of missing data belongs to var 2 with a value of 13

OR

e.g., var5 has no missing observations.

Non missing values: 117 Missing values in var3: 10 Missing values in both var2 and var1: 9

set.seed(1)
iris.imp = mice(iris, m = 5, method = "pmm") #m: number of imputation #you can set the seed inside the function via "seed=1".
## 
##  iter imp variable
##   1   1  var1  var2  var3  var4
##   1   2  var1  var2  var3  var4
##   1   3  var1  var2  var3  var4
##   1   4  var1  var2  var3  var4
##   1   5  var1  var2  var3  var4
##   2   1  var1  var2  var3  var4
##   2   2  var1  var2  var3  var4
##   2   3  var1  var2  var3  var4
##   2   4  var1  var2  var3  var4
##   2   5  var1  var2  var3  var4
##   3   1  var1  var2  var3  var4
##   3   2  var1  var2  var3  var4
##   3   3  var1  var2  var3  var4
##   3   4  var1  var2  var3  var4
##   3   5  var1  var2  var3  var4
##   4   1  var1  var2  var3  var4
##   4   2  var1  var2  var3  var4
##   4   3  var1  var2  var3  var4
##   4   4  var1  var2  var3  var4
##   4   5  var1  var2  var3  var4
##   5   1  var1  var2  var3  var4
##   5   2  var1  var2  var3  var4
##   5   3  var1  var2  var3  var4
##   5   4  var1  var2  var3  var4
##   5   5  var1  var2  var3  var4
#options: "pmm" (Predictive mean matching), "logreg" (Logistic regression), "polyreg" (Polytomous logistic regression), "polr" (Proportional odds model)

The default number of imputations, controlled by the m argument, is 5. It is better if you increase the iteration number.

The other major argument to be concerned with, method, controls how each imputed value is computed.

iris.imp$imp$var3 #see differences btw 5 imputed data sets for var3
1 2 3 4 5
9 1.6 1.6 1.6 1.3 1.4
20 1.5 1.4 1.9 1.9 1.5
23 1.3 1.6 1.5 1.3 1.3
34 1.5 1.6 1.6 1.3 1.4
54 4.5 4.5 4.0 4.7 4.7
110 5.9 6.7 6.6 6.7 6.7
117 5.9 5.1 5.0 5.9 4.9
127 5.1 5.5 4.5 5.0 4.5
142 6.7 5.2 6.7 5.6 5.9
150 4.7 4.4 4.7 5.1 4.4
iris.imp$imp$var1 #see differences btw 5 imputed data sets for var1
1 2 3 4 5
31 5.7 5.1 5.2 5.2 4.6
32 5.1 5.0 5.2 5.2 5.4
33 4.7 5.0 4.9 5.1 5.0
53 5.9 6.4 6.3 6.5 6.8
64 5.7 6.5 6.7 6.3 5.7
67 6.9 5.9 6.7 5.8 6.0
68 5.8 5.7 6.2 6.5 6.3
88 6.4 6.0 6.0 6.4 5.6
93 6.1 6.2 5.8 5.8 4.9
121 6.4 6.7 6.1 6.7 6.9
138 6.2 6.4 6.7 7.7 6.8
148 6.7 6.5 6.0 6.4 6.2
iris.comp=complete(iris.imp, "long", include = TRUE) #Extract the "tall" matrix which stacks the imputations
iris.comp$var1.NA = cci(iris$var1)  #cci returns logical whether its input is complete at each observation
head(iris.comp[, c("var1", "var1.NA")],25)
var1 var1.NA
5.1 TRUE
4.9 TRUE
4.7 TRUE
4.6 TRUE
5.0 TRUE
5.4 TRUE
4.6 TRUE
5.0 TRUE
4.4 TRUE
4.9 TRUE
5.4 TRUE
4.8 TRUE
4.8 TRUE
4.3 TRUE
5.8 TRUE
5.7 TRUE
5.4 TRUE
5.1 TRUE
5.7 TRUE
5.1 TRUE
5.4 TRUE
5.1 TRUE
4.6 TRUE
5.1 TRUE
4.8 TRUE
library(ggplot2)
ggplot(iris.comp,
       aes(x = .imp, y = var1, color = var1.NA)) + 
  geom_jitter(show.legend = FALSE,
              width = .1)
## Warning: Removed 12 rows containing missing values (`geom_point()`).

The plot above shows how imputed variables behave among unimputed data set (see for var1). Because we used the default method (pmm which is the most common one), we’re guaranteed not to have any extreme values.

#Customization of multiple imputation
iris.imp$nmis #see number of missing obs in each var
## var1 var2 var3 var4 var5 
##   12   13   10    7    0
imp_meth=iris.imp$method # shows the imputation method for each variable
imp_meth  #all uses pmm
##  var1  var2  var3  var4  var5 
## "pmm" "pmm" "pmm" "pmm"    ""
imp_pred=iris.imp$predictorMatrix  #see predictor matrix
imp_pred
##      var1 var2 var3 var4 var5
## var1    0    1    1    1    0
## var2    1    0    1    1    0
## var3    1    1    0    1    0
## var4    1    1    1    0    0
## var5    1    1    1    1    0
#Customization
imp_pred[,c("var4")]=0 #remove the variable var4 as a predictor (not be included as a predictor during imputation) but still will be imputed. 
imp_meth[c("var2")]="lasso.norm" #to impute the missing value of the var2, we prefer to use Lasso regression

set.seed(1)
iris.imp.changed = mice(iris, method=imp_meth, predictorMatrix=imp_pred, m=5) #specify method and predictor matrix and then impute
## 
##  iter imp variable
##   1   1  var1  var2  var3  var4
##   1   2  var1  var2  var3  var4
##   1   3  var1  var2  var3  var4
##   1   4  var1  var2  var3  var4
##   1   5  var1  var2  var3  var4
##   2   1  var1  var2  var3  var4
##   2   2  var1  var2  var3  var4
##   2   3  var1  var2  var3  var4
##   2   4  var1  var2  var3  var4
##   2   5  var1  var2  var3  var4
##   3   1  var1  var2  var3  var4
##   3   2  var1  var2  var3  var4
##   3   3  var1  var2  var3  var4
##   3   4  var1  var2  var3  var4
##   3   5  var1  var2  var3  var4
##   4   1  var1  var2  var3  var4
##   4   2  var1  var2  var3  var4
##   4   3  var1  var2  var3  var4
##   4   4  var1  var2  var3  var4
##   4   5  var1  var2  var3  var4
##   5   1  var1  var2  var3  var4
##   5   2  var1  var2  var3  var4
##   5   3  var1  var2  var3  var4
##   5   4  var1  var2  var3  var4
##   5   5  var1  var2  var3  var4
iris.imp.changed$method  #see var2 is imputed via lasso-reg
##         var1         var2         var3         var4         var5 
##        "pmm" "lasso.norm"        "pmm"        "pmm"           ""
iris.imp.changed$predictorMatrix  #see var4 is not used as a predictor
##      var1 var2 var3 var4 var5
## var1    0    1    1    0    0
## var2    1    0    1    0    0
## var3    1    1    0    0    0
## var4    1    1    1    0    0
## var5    1    1    1    0    0
iris.imp.changed$imp$var3 #see differences for var3
1 2 3 4 5
9 1.5 1.4 1.6 1.3 1.5
20 1.9 1.1 1.6 1.5 1.3
23 1.4 1.1 1.4 1.3 1.6
34 1.9 1.6 1.3 1.3 1.1
54 4.8 4.5 4.5 4.5 4.7
110 5.0 4.7 4.5 5.5 5.6
117 5.6 5.6 5.7 5.8 4.3
127 5.1 5.6 5.6 5.1 5.6
142 5.0 5.4 4.7 5.5 5.4
150 4.1 4.1 5.1 4.2 4.1
iris.imp.changed$imp$var1 #see differences for var1
1 2 3 4 5
31 4.7 5.0 5.1 4.6 4.9
32 4.4 4.9 5.1 5.0 5.1
33 5.0 4.8 5.0 5.4 4.4
53 6.4 6.7 6.7 5.8 7.0
64 5.7 5.2 5.6 6.1 4.9
67 5.9 5.6 5.9 6.3 5.9
68 6.2 6.2 5.5 5.9 6.2
88 6.4 5.4 6.3 5.2 6.6
93 5.2 5.7 5.7 5.6 6.2
121 7.7 6.3 6.2 6.5 6.5
138 6.5 6.4 6.8 6.5 6.3
148 6.5 6.8 5.9 6.4 6.4

Hmisc

Hmisc is a multiple purpose package useful for data analysis, high level graphics, imputing missing values, advanced table making, model fitting & diagnostics (linear regression, logistic regression & cox regression etc.).

impute() function imputes missing value using user defined statistical method (mean, max, median). It’s default is median.

Different bootstrap resamples are used for each of multiple imputations. Then, a flexible additive model (non parametric regression method) is fitted on samples taken with replacements from original data and missing values (acts as dependent variable) are predicted using non-missing values (independent variable). Then, it uses predictive mean matching (default) to impute missing values.

It is nice to know that it assumes linearity in the variables being predicted. Fishers optimum scoring method is used for predicting categorical variables.

library(Hmisc)
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
summary(iris)  # see mean for var3 is 3.77
##       var1            var2            var3            var4      
##  Min.   :4.300   Min.   :2.000   Min.   :1.100   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.750   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.832   Mean   :3.062   Mean   :3.771   Mean   :1.157  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##  NA's   :12      NA's   :13      NA's   :10      NA's   :7      
##      var5          
##  Length:150        
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
iris_mean = with(iris, impute(var3, mean))
summary(iris_mean)
## 
##  10 values imputed to 3.770714
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.100   1.600   4.200   3.771   5.075   6.900
head(iris_mean,25) #see observations with *, they are imputed by mean
##         1         2         3         4         5         6         7         8 
##  1.400000  1.400000  1.300000  1.500000  1.400000  1.700000  1.400000  1.500000 
##         9        10        11        12        13        14        15        16 
## 3.770714*  1.500000  1.500000  1.600000  1.400000  1.100000  1.200000  1.500000 
##        17        18        19        20        21        22        23        24 
##  1.300000  1.400000  1.700000 3.770714*  1.700000  1.500000 3.770714*  1.700000 
##        25 
##  1.900000
set.seed(1)
iris_random = with(iris, impute(var3, 'random')) #similarly you can use min, max, median to impute missing value
head(iris_random,25)  #missing values are imputed by random value in the dataset.
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
##  1.4  1.4  1.3  1.5  1.4  1.7  1.4  1.5 4.9*  1.5  1.5  1.6  1.4  1.1  1.2  1.5 
##   17   18   19   20   21   22   23   24   25 
##  1.3  1.4  1.7 5.6*  1.7  1.5 1.6*  1.7  1.9

The output shows R2 values for predicted missing values. Higher the value, better are the values predicted. You can also check imputed values using the following command.

impute_arg = aregImpute(~ var1 + var2 + var3 + var4 + var5, data = iris, n.impute = 5) #argImpute() automatically identifies the variable type and treats them accordingly.
## Iteration 1 
Iteration 2 
Iteration 3 
Iteration 4 
Iteration 5 
Iteration 6 
Iteration 7 
Iteration 8 
impute_arg$imputed$var3  # aregImpute takes all aspects of uncertainty in the imputations into account by using the bootstrap to approximate the process of drawing predicted values from a full Bayesian predictive distribution. 
##     [,1] [,2] [,3] [,4] [,5]
## 9    1.6  1.4  1.5  1.3  1.3
## 20   1.3  1.5  1.5  1.4  1.3
## 23   1.4  1.3  1.6  1.5  1.1
## 34   1.4  1.2  1.6  1.4  1.3
## 54   4.4  3.9  3.9  3.0  4.2
## 110  6.6  6.6  6.0  6.1  6.3
## 117  4.9  5.6  5.8  5.9  5.3
## 127  6.0  4.8  5.1  5.3  5.1
## 142  6.0  5.6  5.7  6.3  5.9
## 150  5.1  5.1  5.1  5.0  4.8

Cross Validation

Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting.

A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV.

The most popular cross-validation techniques:

  1. Validation Set Approach

  2. Leave one out cross-validation(LOOCV)

  3. K-fold cross-validation

  4. Repeated K-fold cross-validation

library(MASS)
## Warning: package 'MASS' was built under R version 4.3.3
data(Boston)

Validation Set

In this method, the dataset is divided randomly into training and testing sets.

Generally, 80% of the dataset is split as a train set and 20% of the dataset is split as a test set. (OR, 70%: train set and 30% : test set). It depends on your preference.

Following steps are performed to implement this technique:

  1. A random sampling of the dataset

  2. Model is trained on the training data set

  3. The final model is applied to the testing data set

  4. Calculate prediction error by using model performance metrics

#Data Partition
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(caret)  #for createDataPartition function
## Warning: package 'caret' was built under R version 4.3.3
## Zorunlu paket yükleniyor: lattice
set.seed(1)
train_ind = Boston$medv %>%
  createDataPartition(p = 0.8, list = FALSE) #createDataPartition helps you define train set index #ratio: 80% and 20%
train  = Boston[train_ind, ]
test = Boston[-train_ind, ]


d_original=dim(Boston)
d_train=dim(train)
d_test=dim(test)
dimens=cbind(d_original,d_train,d_test)
rownames(dimens)=c("number of rows","number of columns")
dimens
##                   d_original d_train d_test
## number of rows           506     407     99
## number of columns         14      14     14

See the dimension for each data. The number of columns stays the same. However, the number of rows is changed as expected.

model = lm(medv ~., data = train) #construct the model
summary(model)
## 
## Call:
## lm(formula = medv ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.6891  -2.6859  -0.5393   1.8871  25.5015 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  32.334452   5.582317   5.792 1.42e-08 ***
## crim         -0.130344   0.034279  -3.802 0.000166 ***
## zn            0.052932   0.015536   3.407 0.000724 ***
## indus        -0.023633   0.064464  -0.367 0.714109    
## chas          3.253159   0.957335   3.398 0.000748 ***
## nox         -16.050622   4.059695  -3.954 9.13e-05 ***
## rm            4.095705   0.456716   8.968  < 2e-16 ***
## age          -0.012316   0.014639  -0.841 0.400679    
## dis          -1.636081   0.221466  -7.387 9.05e-13 ***
## rad           0.278750   0.069715   3.998 7.62e-05 ***
## tax          -0.011576   0.003892  -2.975 0.003116 ** 
## ptratio      -0.817708   0.142322  -5.745 1.84e-08 ***
## black         0.010540   0.002888   3.650 0.000298 ***
## lstat        -0.494600   0.055517  -8.909  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.675 on 393 degrees of freedom
## Multiple R-squared:  0.7474, Adjusted R-squared:  0.7391 
## F-statistic: 89.46 on 13 and 393 DF,  p-value: < 2.2e-16

Make a comment on the output. However, to satisfy the consistency we’ll not exclude any variables from the model at this level.

pred = model %>%  predict(test)
pred  #see predictions
##         17         18         21         23         24         30         32 
## 21.1688793 17.1753534 12.6968098 16.1798960 13.9698327 21.1720709 18.0476643 
##         37         41         43         44         49         50         53 
## 22.6121537 35.2432337 25.4565332 24.9381469  8.7293878 16.8453308 27.6538427 
##         58         61         65         69         73         76         82 
## 33.0657243 17.6170132 23.1103489 17.4728905 24.8064986 23.9424957 27.0306435 
##         86         92        104        118        129        132        138 
## 27.9718424 27.6330339 20.7445633 23.6268309 19.2088078 19.4771875 19.6093608 
##        145        154        159        162        167        173        174 
##  8.6089508 16.7847009 27.9820632 36.0994981 36.6126852 22.8296782 29.0444214 
##        188        192        197        199        202        205        213 
## 33.8880618 30.3774500 36.1220488 34.7115703 29.4787421 43.9505929 23.4967080 
##        235        236        237        240        241        245        250 
## 32.2620437 25.2392448 30.4894557 28.3591645 27.2554706 15.6226082 24.3309210 
##        255        270        272        277        278        281        283 
## 23.8356867 26.5211681 27.8396482 36.5863508 35.9047516 38.6861122 40.8183217 
##        286        287        293        308        309        310        311 
## 27.4136825 20.4149840 32.7831636 33.2204608 28.5239075 23.5577286 18.7980456 
##        312        314        315        318        321        323        327 
## 27.1455724 25.4286846 25.3793012 18.3886545 25.1382982 23.0241484 24.0288252 
##        347        348        350        369        380        383        386 
## 14.4948424 25.6349727 22.6311726 22.8927592 16.6366449 13.3040134  7.9591663 
##        387        390        391        393        399        407        413 
##  5.5717784 13.9730318 16.9735732  9.6425054  5.9662802  7.2391219  0.8205434 
##        415        418        420        424        429        434        436 
## -5.5122007  6.1870867 14.6408508 12.7430413 14.1227450 16.7201880 13.1950824 
##        445        460        462        476        485        486        494 
## 11.2771605 18.3721848 20.1494107 15.8629210 19.4486860 22.0785962 21.1953072 
##        500 
## 18.8193384
perf_metrics=data.frame( RMSE = RMSE(pred, test$medv), Rsquared = R2(pred, test$medv),
            MAE = MAE(pred, test$medv))
perf_metrics  #see performance metrics of the model for test set
RMSE Rsquared MAE
5.120283 0.712638 3.375619

Advantages:

  • One of the most basic and simple techniques for evaluating a model.

  • No complex steps for implementation.

Disadvantages:

  • Predictions done by the model is highly dependent upon the subset of observations used for training and validation.

  • Using only one subset of the data for training purposes can make the model biased.

Leave one out cross validation (LOOCV)

This method also splits the data set into 2 parts but it overcomes the drawbacks of the validation set approach. LOOCV carry out the cross-validation in the following way:

  1. Train the model on N-1 data points

  2. Testing the model against that one data points which was left in the previous step

  3. Calculate prediction error

  4. Repeat above 3 steps until the model is not trained and tested on all data points

  5. Generate overall prediction error by taking the average of prediction errors in every case

train.control = trainControl(method = "LOOCV")  #change the method into LOOCV

model_loocv = train(medv ~., data = Boston, method = "lm", trControl = train.control)
print(model_loocv) #we have 506 total number of obs and each step 505 of them are used.
## Linear Regression 
## 
## 506 samples
##  13 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 505, 505, 505, 505, 505, 505, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   4.870908  0.7191505  3.382797
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

The pros of the LOOCV method is that we make use all data points reducing potential bias.

The cons of the LOOCV method is that training the model N times leads to expensive computation time if the data set is large.

Additionally, we test the model performance against one data point at each iteration. This might result to higher variation in the prediction error, if some data points are outliers. So, we need a good ratio of testing data points, a solution provided by the k fold cross validation method.

K-fold Cross Validation

This cross-validation technique divides the data into K subsets(folds) of almost equal size. Out of these K folds, one subset is used as a validation set, and rest others are involved in training the model.

In practice, one typically performs k-fold cross-validation using k = 5 or k = 10, as these values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.

Following are the complete working procedure of this method:

  1. Split the data set into K subsets randomly

  2. Use K-1 subsets for training the model

  3. Test the model against that one subset that was left in the previous step

  4. Repeat the above steps for K times i.e., until the model is not trained and tested on all subsets

  5. Generate overall prediction error by taking the average of prediction errors in every case

set.seed(1)
train.control = trainControl(method = "repeatedcv", number = 10)  #10-fold CV #change the method
model_kfold = train(medv ~., data = Boston, method = "lm",
               trControl = train.control)
print(model_kfold)
## Linear Regression 
## 
## 506 samples
##  13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 458, 455, 455, 455, 456, 455, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   4.809965  0.7332317  3.370754
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Advantages:

  • Fast computation speed.

  • A very effective method to estimate the prediction error and the accuracy of a model.

Disadvantages:

  • A lower value of K leads to a biased model and a higher value of K can lead to variability in the performance metrics of the model. Thus, it is very important to use the correct value of K for the model(generally K = 5 and K = 10 is desirable).

Repeated K-fold CV

K-fold cross-validation algorithm is repeated a certain number of times. The final model error is taken as the mean error from the number of repeats.

train.control2 = trainControl(method = "repeatedcv", number = 10, repeats = 6) #repeat 6 times
model_rep_kfold = train(medv ~., data = Boston, method = "lm",
               trControl = train.control2)
print(model_rep_kfold)
## Linear Regression 
## 
## 506 samples
##  13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 6 times) 
## Summary of sample sizes: 455, 455, 456, 456, 457, 454, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   4.785595  0.7339684  3.372864
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Advantages:

  • In each repetition, the data sample is shuffled which results in developing different splits of the sample data.

Disadvantages:

  • With each repetition, the algorithm has to train the model from scratch which means the computation time to evaluate the model increases by the times of repetition.

Comparison

validation=perf_metrics
loocv=model_loocv$results[,2:4]
kfold=model_kfold$results[,2:4]
rep_kfold=model_rep_kfold$results[,2:4]
rbind(validation,loocv,kfold,rep_kfold)
RMSE Rsquared MAE
5.120283 0.7126380 3.375619
4.870908 0.7191505 3.382796
4.809965 0.7332317 3.370754
4.785595 0.7339684 3.372864

Which one seems reasonable?

Polynomial Regression

rmse = numeric(7)
for (i in 1:7){
  model = lm(medv ~ poly(lstat, i), data = train)
  fit.test = predict(model, newdata=test)
  rmse[i]=sqrt(mean((fit.test-test$medv)^2))
  
  print(rmse[i])
}
## [1] 5.820606
## [1] 4.93455
## [1] 4.676
## [1] 4.404142
## [1] 4.206907
## [1] 4.251626
## [1] 4.250369
data.frame(degree=which.min(rmse),rmse=rmse[which.min(rmse)])
degree rmse
5 4.206907
library(ggplot2)
ggplot(train, aes(x=lstat, y=medv)) + 
          geom_point() +
          stat_smooth(method='lm', formula = y ~ poly(x,5), size = 1) + 
          xlab('Hours Studied') +
          ylab('Score')
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

plot(1:7,rmse[],
  xlab = "k-th order",
  ylab = "RMSE",
  type = "b",
  col = "blue",
  pch = 16
)

Encoding Categorical Data

Encoding refers to the process of representing categorical variables as numerical values that can be used in machine learning algorithms or statistical analyses. Categorical variables are variables that have a limited, fixed number of possible values or categories.

See how to encode the categorical variables:

Create a data set:

set.seed(1)
y=factor(sample(c(0,1),9,replace = TRUE))
x = factor(sample(c("apple","banana","carrot"),9,replace = TRUE))
z = factor(sample(c("A","B","C"), 9,replace=TRUE))
data = data.frame(y,x,z)
str(data)
## 'data.frame':    9 obs. of  3 variables:
##  $ y: Factor w/ 2 levels "0","1": 1 2 1 1 2 1 1 1 2
##  $ x: Factor w/ 3 levels "apple","banana",..: 2 3 3 1 1 1 2 2 2
##  $ z: Factor w/ 3 levels "A","B","C": 2 3 1 3 1 1 1 1 2

One Hot Encoding

Creating binary columns for each category. Each column represents one category, and the presence of a category is indicated by a 1, while its absence is indicated by a 0. This is useful when categories don’t have any ordinal relationship.

library(fastDummies)
## Warning: package 'fastDummies' was built under R version 4.3.3
## Thank you for using fastDummies!
## To acknowledge our work, please cite the package:
## Kaplan, J. & Schlegel, B. (2023). fastDummies: Fast Creation of Dummy (Binary) Columns and Rows from Categorical Variables. Version 1.7.1. URL: https://github.com/jacobkap/fastDummies, https://jacobkap.github.io/fastDummies/.
dummy_cols(data, select_columns = c("x","z"), remove_selected_columns = FALSE)
y x z x_apple x_banana x_carrot z_A z_B z_C
0 banana B 0 1 0 0 1 0
1 carrot C 0 0 1 0 0 1
0 carrot A 0 0 1 1 0 0
0 apple C 1 0 0 0 0 1
1 apple A 1 0 0 1 0 0
0 apple A 1 0 0 1 0 0
0 banana A 0 1 0 1 0 0
0 banana A 0 1 0 1 0 0
1 banana B 0 1 0 0 1 0
#OR

library(mltools)
## Warning: package 'mltools' was built under R version 4.3.3
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
data.table = as.data.table(data)
data.encoded = one_hot(data.table)
data.encoded
y_0 y_1 x_apple x_banana x_carrot z_A z_B z_C
1 0 0 1 0 0 1 0
0 1 0 0 1 0 0 1
1 0 0 0 1 1 0 0
1 0 1 0 0 0 0 1
0 1 1 0 0 1 0 0
1 0 1 0 0 1 0 0
1 0 0 1 0 1 0 0
1 0 0 1 0 1 0 0
0 1 0 1 0 0 1 0

Dummy Encoding

Similar to one-hot encoding, but one category is dropped to avoid multicollinearity. The dropped category becomes the reference category.

dummy_cols(data, select_columns = c("x","z"), remove_selected_columns = FALSE,remove_first_dummy  = TRUE)
y x z x_banana x_carrot z_B z_C
0 banana B 1 0 1 0
1 carrot C 0 1 0 1
0 carrot A 0 1 0 0
0 apple C 0 0 0 1
1 apple A 0 0 0 0
0 apple A 0 0 0 0
0 banana A 1 0 0 0
0 banana A 1 0 0 0
1 banana B 1 0 1 0

Label Encoding

Assigning each category a unique integer. However, this might imply ordinality (e.g., 1,2,3) where there isn’t any, which could lead to incorrect model assumptions.

data.frame(z=data$z,code=as.integer(factor(data$z)))
z code
B 2
C 3
A 1
C 3
A 1
A 1
A 1
A 1
B 2

Count Encoding

Count encoding is a technique used to encode categorical variables by replacing each category with the count of occurrences of that category in the dataset.

table(data$z)
## 
## A B C 
## 5 2 2
data.frame(z=data$z,code= factor(data$z,levels = c('A', 'B', 'C'), labels = table(data$z)))
z code
B 2
C 2
A 5
C 2
A 5
A 5
A 5
A 5
B 2

Target Encoding

library(dataPreparation)
## Warning: package 'dataPreparation' was built under R version 4.3.3
## dataPreparation 1.1.1
## Type data_preparation_news() to see new features/changes/bug fixes.
require(data.table) # Build a new data set
data_set = data.table(student = c("Marie", "Marie", "Pierre", "Louis", "Louis"),
                      grades = c(1, 1, 2, 3, 4))


target_encoding = build_target_encoding(data_set, cols_to_encode = "student",
                                         target_col = "grades", functions = c("mean", "sum"))
## [1] "build_target_encoding: Start to compute encoding for target_encoding according to col: grades."
target_encode(data_set, target_encoding = target_encoding)
## [1] "target_encode: Start to encode columns according to target."
student grades grades_mean_by_student grades_sum_by_student
Marie 1 1.0 2
Marie 1 1.0 2
Pierre 2 2.0 2
Louis 3 3.5 7
Louis 4 3.5 7