Remember that we have seen several imputation techniques:
What about multiple imputation?
Multiple imputation fills in missing values by generating plausible numbers derived from distributions and relationships among observed variables in the data set. This renders M different versions of the data set, where the non-missing data is identical, but the missing data entries differ. Discarding all partially observed data units (e.g., through listwise deletion) is generally not recommended because it can lead to substantial bias and poor predictions
We can use the mice (Multiple Imputation with Chained Equations) to perform multiple imputation and the subsequent analysis.The mice function performs the imputation via chained equations. (see for details, https://www.jstatsoft.org/article/view/v045i03)
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
iris = read.table("iris_mis.txt",sep=",") #load the data
colnames(iris)=c("var1","var2","var3","var4","var5")
dim(iris)## [1] 150 5
## [1] 42
The data includes 150 rows and 5 columns. Additionally, it has 42 missing observations.
Let’s visualize the missingness pattern via VIM package:
## Zorunlu paket yükleniyor: colorspace
## Zorunlu paket yükleniyor: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
mice_plot = aggr(iris, col=c('darkred','orange'),
numbers=TRUE, sortVars=TRUE, prop=FALSE,
labels=names(iris), cex.axis=.7,
gap=3, ylab=c("Missing data","Pattern")) #gap: about margin btw two plots & prop=FALSE: you will see the counts if TRUE you'll see percentages##
## Variables sorted by number of missings:
## Variable Count
## var2 13
## var1 12
## var3 10
## var4 7
## var5 0
Comment on the output! e.g., the highest number of missing data belongs to var 2 with a value of 13
OR
e.g., var5 has no missing observations.
Non missing values: 117 Missing values in var3: 10 Missing values in both var2 and var1: 9
set.seed(1)
iris.imp = mice(iris, m = 5, method = "pmm") #m: number of imputation #you can set the seed inside the function via "seed=1".##
## iter imp variable
## 1 1 var1 var2 var3 var4
## 1 2 var1 var2 var3 var4
## 1 3 var1 var2 var3 var4
## 1 4 var1 var2 var3 var4
## 1 5 var1 var2 var3 var4
## 2 1 var1 var2 var3 var4
## 2 2 var1 var2 var3 var4
## 2 3 var1 var2 var3 var4
## 2 4 var1 var2 var3 var4
## 2 5 var1 var2 var3 var4
## 3 1 var1 var2 var3 var4
## 3 2 var1 var2 var3 var4
## 3 3 var1 var2 var3 var4
## 3 4 var1 var2 var3 var4
## 3 5 var1 var2 var3 var4
## 4 1 var1 var2 var3 var4
## 4 2 var1 var2 var3 var4
## 4 3 var1 var2 var3 var4
## 4 4 var1 var2 var3 var4
## 4 5 var1 var2 var3 var4
## 5 1 var1 var2 var3 var4
## 5 2 var1 var2 var3 var4
## 5 3 var1 var2 var3 var4
## 5 4 var1 var2 var3 var4
## 5 5 var1 var2 var3 var4
#options: "pmm" (Predictive mean matching), "logreg" (Logistic regression), "polyreg" (Polytomous logistic regression), "polr" (Proportional odds model)The default number of imputations, controlled by the m argument, is 5. It is better if you increase the iteration number.
The other major argument to be concerned with, method, controls how each imputed value is computed.
| 1 | 2 | 3 | 4 | 5 | |
|---|---|---|---|---|---|
| 9 | 1.6 | 1.6 | 1.6 | 1.3 | 1.4 |
| 20 | 1.5 | 1.4 | 1.9 | 1.9 | 1.5 |
| 23 | 1.3 | 1.6 | 1.5 | 1.3 | 1.3 |
| 34 | 1.5 | 1.6 | 1.6 | 1.3 | 1.4 |
| 54 | 4.5 | 4.5 | 4.0 | 4.7 | 4.7 |
| 110 | 5.9 | 6.7 | 6.6 | 6.7 | 6.7 |
| 117 | 5.9 | 5.1 | 5.0 | 5.9 | 4.9 |
| 127 | 5.1 | 5.5 | 4.5 | 5.0 | 4.5 |
| 142 | 6.7 | 5.2 | 6.7 | 5.6 | 5.9 |
| 150 | 4.7 | 4.4 | 4.7 | 5.1 | 4.4 |
| 1 | 2 | 3 | 4 | 5 | |
|---|---|---|---|---|---|
| 31 | 5.7 | 5.1 | 5.2 | 5.2 | 4.6 |
| 32 | 5.1 | 5.0 | 5.2 | 5.2 | 5.4 |
| 33 | 4.7 | 5.0 | 4.9 | 5.1 | 5.0 |
| 53 | 5.9 | 6.4 | 6.3 | 6.5 | 6.8 |
| 64 | 5.7 | 6.5 | 6.7 | 6.3 | 5.7 |
| 67 | 6.9 | 5.9 | 6.7 | 5.8 | 6.0 |
| 68 | 5.8 | 5.7 | 6.2 | 6.5 | 6.3 |
| 88 | 6.4 | 6.0 | 6.0 | 6.4 | 5.6 |
| 93 | 6.1 | 6.2 | 5.8 | 5.8 | 4.9 |
| 121 | 6.4 | 6.7 | 6.1 | 6.7 | 6.9 |
| 138 | 6.2 | 6.4 | 6.7 | 7.7 | 6.8 |
| 148 | 6.7 | 6.5 | 6.0 | 6.4 | 6.2 |
iris.comp=complete(iris.imp, "long", include = TRUE) #Extract the "tall" matrix which stacks the imputations
iris.comp$var1.NA = cci(iris$var1) #cci returns logical whether its input is complete at each observation
head(iris.comp[, c("var1", "var1.NA")],25)| var1 | var1.NA |
|---|---|
| 5.1 | TRUE |
| 4.9 | TRUE |
| 4.7 | TRUE |
| 4.6 | TRUE |
| 5.0 | TRUE |
| 5.4 | TRUE |
| 4.6 | TRUE |
| 5.0 | TRUE |
| 4.4 | TRUE |
| 4.9 | TRUE |
| 5.4 | TRUE |
| 4.8 | TRUE |
| 4.8 | TRUE |
| 4.3 | TRUE |
| 5.8 | TRUE |
| 5.7 | TRUE |
| 5.4 | TRUE |
| 5.1 | TRUE |
| 5.7 | TRUE |
| 5.1 | TRUE |
| 5.4 | TRUE |
| 5.1 | TRUE |
| 4.6 | TRUE |
| 5.1 | TRUE |
| 4.8 | TRUE |
library(ggplot2)
ggplot(iris.comp,
aes(x = .imp, y = var1, color = var1.NA)) +
geom_jitter(show.legend = FALSE,
width = .1)## Warning: Removed 12 rows containing missing values (`geom_point()`).
The plot above shows how imputed variables behave among unimputed data set (see for var1). Because we used the default method (pmm which is the most common one), we’re guaranteed not to have any extreme values.
## var1 var2 var3 var4 var5
## 12 13 10 7 0
## var1 var2 var3 var4 var5
## "pmm" "pmm" "pmm" "pmm" ""
## var1 var2 var3 var4 var5
## var1 0 1 1 1 0
## var2 1 0 1 1 0
## var3 1 1 0 1 0
## var4 1 1 1 0 0
## var5 1 1 1 1 0
#Customization
imp_pred[,c("var4")]=0 #remove the variable var4 as a predictor (not be included as a predictor during imputation) but still will be imputed.
imp_meth[c("var2")]="lasso.norm" #to impute the missing value of the var2, we prefer to use Lasso regression
set.seed(1)
iris.imp.changed = mice(iris, method=imp_meth, predictorMatrix=imp_pred, m=5) #specify method and predictor matrix and then impute##
## iter imp variable
## 1 1 var1 var2 var3 var4
## 1 2 var1 var2 var3 var4
## 1 3 var1 var2 var3 var4
## 1 4 var1 var2 var3 var4
## 1 5 var1 var2 var3 var4
## 2 1 var1 var2 var3 var4
## 2 2 var1 var2 var3 var4
## 2 3 var1 var2 var3 var4
## 2 4 var1 var2 var3 var4
## 2 5 var1 var2 var3 var4
## 3 1 var1 var2 var3 var4
## 3 2 var1 var2 var3 var4
## 3 3 var1 var2 var3 var4
## 3 4 var1 var2 var3 var4
## 3 5 var1 var2 var3 var4
## 4 1 var1 var2 var3 var4
## 4 2 var1 var2 var3 var4
## 4 3 var1 var2 var3 var4
## 4 4 var1 var2 var3 var4
## 4 5 var1 var2 var3 var4
## 5 1 var1 var2 var3 var4
## 5 2 var1 var2 var3 var4
## 5 3 var1 var2 var3 var4
## 5 4 var1 var2 var3 var4
## 5 5 var1 var2 var3 var4
## var1 var2 var3 var4 var5
## "pmm" "lasso.norm" "pmm" "pmm" ""
## var1 var2 var3 var4 var5
## var1 0 1 1 0 0
## var2 1 0 1 0 0
## var3 1 1 0 0 0
## var4 1 1 1 0 0
## var5 1 1 1 0 0
| 1 | 2 | 3 | 4 | 5 | |
|---|---|---|---|---|---|
| 9 | 1.5 | 1.4 | 1.6 | 1.3 | 1.5 |
| 20 | 1.9 | 1.1 | 1.6 | 1.5 | 1.3 |
| 23 | 1.4 | 1.1 | 1.4 | 1.3 | 1.6 |
| 34 | 1.9 | 1.6 | 1.3 | 1.3 | 1.1 |
| 54 | 4.8 | 4.5 | 4.5 | 4.5 | 4.7 |
| 110 | 5.0 | 4.7 | 4.5 | 5.5 | 5.6 |
| 117 | 5.6 | 5.6 | 5.7 | 5.8 | 4.3 |
| 127 | 5.1 | 5.6 | 5.6 | 5.1 | 5.6 |
| 142 | 5.0 | 5.4 | 4.7 | 5.5 | 5.4 |
| 150 | 4.1 | 4.1 | 5.1 | 4.2 | 4.1 |
| 1 | 2 | 3 | 4 | 5 | |
|---|---|---|---|---|---|
| 31 | 4.7 | 5.0 | 5.1 | 4.6 | 4.9 |
| 32 | 4.4 | 4.9 | 5.1 | 5.0 | 5.1 |
| 33 | 5.0 | 4.8 | 5.0 | 5.4 | 4.4 |
| 53 | 6.4 | 6.7 | 6.7 | 5.8 | 7.0 |
| 64 | 5.7 | 5.2 | 5.6 | 6.1 | 4.9 |
| 67 | 5.9 | 5.6 | 5.9 | 6.3 | 5.9 |
| 68 | 6.2 | 6.2 | 5.5 | 5.9 | 6.2 |
| 88 | 6.4 | 5.4 | 6.3 | 5.2 | 6.6 |
| 93 | 5.2 | 5.7 | 5.7 | 5.6 | 6.2 |
| 121 | 7.7 | 6.3 | 6.2 | 6.5 | 6.5 |
| 138 | 6.5 | 6.4 | 6.8 | 6.5 | 6.3 |
| 148 | 6.5 | 6.8 | 5.9 | 6.4 | 6.4 |
Hmisc is a multiple purpose package useful for data analysis, high level graphics, imputing missing values, advanced table making, model fitting & diagnostics (linear regression, logistic regression & cox regression etc.).
impute() function imputes missing value using user defined statistical method (mean, max, median). It’s default is median.
Different bootstrap resamples are used for each of multiple imputations. Then, a flexible additive model (non parametric regression method) is fitted on samples taken with replacements from original data and missing values (acts as dependent variable) are predicted using non-missing values (independent variable). Then, it uses predictive mean matching (default) to impute missing values.
It is nice to know that it assumes linearity in the variables being predicted. Fishers optimum scoring method is used for predicting categorical variables.
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
## var1 var2 var3 var4
## Min. :4.300 Min. :2.000 Min. :1.100 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.750 Median :3.000 Median :4.350 Median :1.300
## Mean :5.832 Mean :3.062 Mean :3.771 Mean :1.157
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## NA's :12 NA's :13 NA's :10 NA's :7
## var5
## Length:150
## Class :character
## Mode :character
##
##
##
##
##
## 10 values imputed to 3.770714
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.100 1.600 4.200 3.771 5.075 6.900
## 1 2 3 4 5 6 7 8
## 1.400000 1.400000 1.300000 1.500000 1.400000 1.700000 1.400000 1.500000
## 9 10 11 12 13 14 15 16
## 3.770714* 1.500000 1.500000 1.600000 1.400000 1.100000 1.200000 1.500000
## 17 18 19 20 21 22 23 24
## 1.300000 1.400000 1.700000 3.770714* 1.700000 1.500000 3.770714* 1.700000
## 25
## 1.900000
set.seed(1)
iris_random = with(iris, impute(var3, 'random')) #similarly you can use min, max, median to impute missing value
head(iris_random,25) #missing values are imputed by random value in the dataset.## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 4.9* 1.5 1.5 1.6 1.4 1.1 1.2 1.5
## 17 18 19 20 21 22 23 24 25
## 1.3 1.4 1.7 5.6* 1.7 1.5 1.6* 1.7 1.9
The output shows R2 values for predicted missing values. Higher the value, better are the values predicted. You can also check imputed values using the following command.
impute_arg = aregImpute(~ var1 + var2 + var3 + var4 + var5, data = iris, n.impute = 5) #argImpute() automatically identifies the variable type and treats them accordingly.## Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Iteration 6
Iteration 7
Iteration 8
impute_arg$imputed$var3 # aregImpute takes all aspects of uncertainty in the imputations into account by using the bootstrap to approximate the process of drawing predicted values from a full Bayesian predictive distribution. ## [,1] [,2] [,3] [,4] [,5]
## 9 1.6 1.4 1.5 1.3 1.3
## 20 1.3 1.5 1.5 1.4 1.3
## 23 1.4 1.3 1.6 1.5 1.1
## 34 1.4 1.2 1.6 1.4 1.3
## 54 4.4 3.9 3.9 3.0 4.2
## 110 6.6 6.6 6.0 6.1 6.3
## 117 4.9 5.6 5.8 5.9 5.3
## 127 6.0 4.8 5.1 5.3 5.1
## 142 6.0 5.6 5.7 6.3 5.9
## 150 5.1 5.1 5.1 5.0 4.8
Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting.
A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV.
The most popular cross-validation techniques:
Validation Set Approach
Leave one out cross-validation(LOOCV)
K-fold cross-validation
Repeated K-fold cross-validation
## Warning: package 'MASS' was built under R version 4.3.3
In this method, the dataset is divided randomly into training and testing sets.
Generally, 80% of the dataset is split as a train set and 20% of the dataset is split as a test set. (OR, 70%: train set and 30% : test set). It depends on your preference.
Following steps are performed to implement this technique:
A random sampling of the dataset
Model is trained on the training data set
The final model is applied to the testing data set
Calculate prediction error by using model performance metrics
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:Hmisc':
##
## src, summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Warning: package 'caret' was built under R version 4.3.3
## Zorunlu paket yükleniyor: lattice
set.seed(1)
train_ind = Boston$medv %>%
createDataPartition(p = 0.8, list = FALSE) #createDataPartition helps you define train set index #ratio: 80% and 20%
train = Boston[train_ind, ]
test = Boston[-train_ind, ]
d_original=dim(Boston)
d_train=dim(train)
d_test=dim(test)
dimens=cbind(d_original,d_train,d_test)
rownames(dimens)=c("number of rows","number of columns")
dimens## d_original d_train d_test
## number of rows 506 407 99
## number of columns 14 14 14
See the dimension for each data. The number of columns stays the same. However, the number of rows is changed as expected.
##
## Call:
## lm(formula = medv ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.6891 -2.6859 -0.5393 1.8871 25.5015
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.334452 5.582317 5.792 1.42e-08 ***
## crim -0.130344 0.034279 -3.802 0.000166 ***
## zn 0.052932 0.015536 3.407 0.000724 ***
## indus -0.023633 0.064464 -0.367 0.714109
## chas 3.253159 0.957335 3.398 0.000748 ***
## nox -16.050622 4.059695 -3.954 9.13e-05 ***
## rm 4.095705 0.456716 8.968 < 2e-16 ***
## age -0.012316 0.014639 -0.841 0.400679
## dis -1.636081 0.221466 -7.387 9.05e-13 ***
## rad 0.278750 0.069715 3.998 7.62e-05 ***
## tax -0.011576 0.003892 -2.975 0.003116 **
## ptratio -0.817708 0.142322 -5.745 1.84e-08 ***
## black 0.010540 0.002888 3.650 0.000298 ***
## lstat -0.494600 0.055517 -8.909 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.675 on 393 degrees of freedom
## Multiple R-squared: 0.7474, Adjusted R-squared: 0.7391
## F-statistic: 89.46 on 13 and 393 DF, p-value: < 2.2e-16
Make a comment on the output. However, to satisfy the consistency we’ll not exclude any variables from the model at this level.
## 17 18 21 23 24 30 32
## 21.1688793 17.1753534 12.6968098 16.1798960 13.9698327 21.1720709 18.0476643
## 37 41 43 44 49 50 53
## 22.6121537 35.2432337 25.4565332 24.9381469 8.7293878 16.8453308 27.6538427
## 58 61 65 69 73 76 82
## 33.0657243 17.6170132 23.1103489 17.4728905 24.8064986 23.9424957 27.0306435
## 86 92 104 118 129 132 138
## 27.9718424 27.6330339 20.7445633 23.6268309 19.2088078 19.4771875 19.6093608
## 145 154 159 162 167 173 174
## 8.6089508 16.7847009 27.9820632 36.0994981 36.6126852 22.8296782 29.0444214
## 188 192 197 199 202 205 213
## 33.8880618 30.3774500 36.1220488 34.7115703 29.4787421 43.9505929 23.4967080
## 235 236 237 240 241 245 250
## 32.2620437 25.2392448 30.4894557 28.3591645 27.2554706 15.6226082 24.3309210
## 255 270 272 277 278 281 283
## 23.8356867 26.5211681 27.8396482 36.5863508 35.9047516 38.6861122 40.8183217
## 286 287 293 308 309 310 311
## 27.4136825 20.4149840 32.7831636 33.2204608 28.5239075 23.5577286 18.7980456
## 312 314 315 318 321 323 327
## 27.1455724 25.4286846 25.3793012 18.3886545 25.1382982 23.0241484 24.0288252
## 347 348 350 369 380 383 386
## 14.4948424 25.6349727 22.6311726 22.8927592 16.6366449 13.3040134 7.9591663
## 387 390 391 393 399 407 413
## 5.5717784 13.9730318 16.9735732 9.6425054 5.9662802 7.2391219 0.8205434
## 415 418 420 424 429 434 436
## -5.5122007 6.1870867 14.6408508 12.7430413 14.1227450 16.7201880 13.1950824
## 445 460 462 476 485 486 494
## 11.2771605 18.3721848 20.1494107 15.8629210 19.4486860 22.0785962 21.1953072
## 500
## 18.8193384
perf_metrics=data.frame( RMSE = RMSE(pred, test$medv), Rsquared = R2(pred, test$medv),
MAE = MAE(pred, test$medv))
perf_metrics #see performance metrics of the model for test set| RMSE | Rsquared | MAE |
|---|---|---|
| 5.120283 | 0.712638 | 3.375619 |
Advantages:
One of the most basic and simple techniques for evaluating a model.
No complex steps for implementation.
Disadvantages:
Predictions done by the model is highly dependent upon the subset of observations used for training and validation.
Using only one subset of the data for training purposes can make the model biased.
This method also splits the data set into 2 parts but it overcomes the drawbacks of the validation set approach. LOOCV carry out the cross-validation in the following way:
Train the model on N-1 data points
Testing the model against that one data points which was left in the previous step
Calculate prediction error
Repeat above 3 steps until the model is not trained and tested on all data points
Generate overall prediction error by taking the average of prediction errors in every case
train.control = trainControl(method = "LOOCV") #change the method into LOOCV
model_loocv = train(medv ~., data = Boston, method = "lm", trControl = train.control)
print(model_loocv) #we have 506 total number of obs and each step 505 of them are used.## Linear Regression
##
## 506 samples
## 13 predictor
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 505, 505, 505, 505, 505, 505, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 4.870908 0.7191505 3.382797
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
The pros of the LOOCV method is that we make use all data points reducing potential bias.
The cons of the LOOCV method is that training the model N times leads to expensive computation time if the data set is large.
Additionally, we test the model performance against one data point at each iteration. This might result to higher variation in the prediction error, if some data points are outliers. So, we need a good ratio of testing data points, a solution provided by the k fold cross validation method.
This cross-validation technique divides the data into K subsets(folds) of almost equal size. Out of these K folds, one subset is used as a validation set, and rest others are involved in training the model.
In practice, one typically performs k-fold cross-validation using k = 5 or k = 10, as these values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.
Following are the complete working procedure of this method:
Split the data set into K subsets randomly
Use K-1 subsets for training the model
Test the model against that one subset that was left in the previous step
Repeat the above steps for K times i.e., until the model is not trained and tested on all subsets
Generate overall prediction error by taking the average of prediction errors in every case
set.seed(1)
train.control = trainControl(method = "repeatedcv", number = 10) #10-fold CV #change the methodmodel_kfold = train(medv ~., data = Boston, method = "lm",
trControl = train.control)
print(model_kfold)## Linear Regression
##
## 506 samples
## 13 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 458, 455, 455, 455, 456, 455, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 4.809965 0.7332317 3.370754
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Advantages:
Fast computation speed.
A very effective method to estimate the prediction error and the accuracy of a model.
Disadvantages:
K-fold cross-validation algorithm is repeated a certain number of times. The final model error is taken as the mean error from the number of repeats.
train.control2 = trainControl(method = "repeatedcv", number = 10, repeats = 6) #repeat 6 times
model_rep_kfold = train(medv ~., data = Boston, method = "lm",
trControl = train.control2)
print(model_rep_kfold)## Linear Regression
##
## 506 samples
## 13 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 6 times)
## Summary of sample sizes: 455, 455, 456, 456, 457, 454, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 4.785595 0.7339684 3.372864
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Advantages:
Disadvantages:
validation=perf_metrics
loocv=model_loocv$results[,2:4]
kfold=model_kfold$results[,2:4]
rep_kfold=model_rep_kfold$results[,2:4]
rbind(validation,loocv,kfold,rep_kfold)| RMSE | Rsquared | MAE |
|---|---|---|
| 5.120283 | 0.7126380 | 3.375619 |
| 4.870908 | 0.7191505 | 3.382796 |
| 4.809965 | 0.7332317 | 3.370754 |
| 4.785595 | 0.7339684 | 3.372864 |
Which one seems reasonable?
rmse = numeric(7)
for (i in 1:7){
model = lm(medv ~ poly(lstat, i), data = train)
fit.test = predict(model, newdata=test)
rmse[i]=sqrt(mean((fit.test-test$medv)^2))
print(rmse[i])
}## [1] 5.820606
## [1] 4.93455
## [1] 4.676
## [1] 4.404142
## [1] 4.206907
## [1] 4.251626
## [1] 4.250369
| degree | rmse |
|---|---|
| 5 | 4.206907 |
library(ggplot2)
ggplot(train, aes(x=lstat, y=medv)) +
geom_point() +
stat_smooth(method='lm', formula = y ~ poly(x,5), size = 1) +
xlab('Hours Studied') +
ylab('Score')## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Encoding refers to the process of representing categorical variables as numerical values that can be used in machine learning algorithms or statistical analyses. Categorical variables are variables that have a limited, fixed number of possible values or categories.
See how to encode the categorical variables:
Create a data set:
set.seed(1)
y=factor(sample(c(0,1),9,replace = TRUE))
x = factor(sample(c("apple","banana","carrot"),9,replace = TRUE))
z = factor(sample(c("A","B","C"), 9,replace=TRUE))
data = data.frame(y,x,z)
str(data)## 'data.frame': 9 obs. of 3 variables:
## $ y: Factor w/ 2 levels "0","1": 1 2 1 1 2 1 1 1 2
## $ x: Factor w/ 3 levels "apple","banana",..: 2 3 3 1 1 1 2 2 2
## $ z: Factor w/ 3 levels "A","B","C": 2 3 1 3 1 1 1 1 2
Creating binary columns for each category. Each column represents one category, and the presence of a category is indicated by a 1, while its absence is indicated by a 0. This is useful when categories don’t have any ordinal relationship.
## Warning: package 'fastDummies' was built under R version 4.3.3
## Thank you for using fastDummies!
## To acknowledge our work, please cite the package:
## Kaplan, J. & Schlegel, B. (2023). fastDummies: Fast Creation of Dummy (Binary) Columns and Rows from Categorical Variables. Version 1.7.1. URL: https://github.com/jacobkap/fastDummies, https://jacobkap.github.io/fastDummies/.
| y | x | z | x_apple | x_banana | x_carrot | z_A | z_B | z_C |
|---|---|---|---|---|---|---|---|---|
| 0 | banana | B | 0 | 1 | 0 | 0 | 1 | 0 |
| 1 | carrot | C | 0 | 0 | 1 | 0 | 0 | 1 |
| 0 | carrot | A | 0 | 0 | 1 | 1 | 0 | 0 |
| 0 | apple | C | 1 | 0 | 0 | 0 | 0 | 1 |
| 1 | apple | A | 1 | 0 | 0 | 1 | 0 | 0 |
| 0 | apple | A | 1 | 0 | 0 | 1 | 0 | 0 |
| 0 | banana | A | 0 | 1 | 0 | 1 | 0 | 0 |
| 0 | banana | A | 0 | 1 | 0 | 1 | 0 | 0 |
| 1 | banana | B | 0 | 1 | 0 | 0 | 1 | 0 |
## Warning: package 'mltools' was built under R version 4.3.3
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
| y_0 | y_1 | x_apple | x_banana | x_carrot | z_A | z_B | z_C |
|---|---|---|---|---|---|---|---|
| 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
| 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
| 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 |
| 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
Similar to one-hot encoding, but one category is dropped to avoid multicollinearity. The dropped category becomes the reference category.
dummy_cols(data, select_columns = c("x","z"), remove_selected_columns = FALSE,remove_first_dummy = TRUE)| y | x | z | x_banana | x_carrot | z_B | z_C |
|---|---|---|---|---|---|---|
| 0 | banana | B | 1 | 0 | 1 | 0 |
| 1 | carrot | C | 0 | 1 | 0 | 1 |
| 0 | carrot | A | 0 | 1 | 0 | 0 |
| 0 | apple | C | 0 | 0 | 0 | 1 |
| 1 | apple | A | 0 | 0 | 0 | 0 |
| 0 | apple | A | 0 | 0 | 0 | 0 |
| 0 | banana | A | 1 | 0 | 0 | 0 |
| 0 | banana | A | 1 | 0 | 0 | 0 |
| 1 | banana | B | 1 | 0 | 1 | 0 |
Assigning each category a unique integer. However, this might imply ordinality (e.g., 1,2,3) where there isn’t any, which could lead to incorrect model assumptions.
| z | code |
|---|---|
| B | 2 |
| C | 3 |
| A | 1 |
| C | 3 |
| A | 1 |
| A | 1 |
| A | 1 |
| A | 1 |
| B | 2 |
Count encoding is a technique used to encode categorical variables by replacing each category with the count of occurrences of that category in the dataset.
##
## A B C
## 5 2 2
| z | code |
|---|---|
| B | 2 |
| C | 2 |
| A | 5 |
| C | 2 |
| A | 5 |
| A | 5 |
| A | 5 |
| A | 5 |
| B | 2 |
## Warning: package 'dataPreparation' was built under R version 4.3.3
## dataPreparation 1.1.1
## Type data_preparation_news() to see new features/changes/bug fixes.
require(data.table) # Build a new data set
data_set = data.table(student = c("Marie", "Marie", "Pierre", "Louis", "Louis"),
grades = c(1, 1, 2, 3, 4))
target_encoding = build_target_encoding(data_set, cols_to_encode = "student",
target_col = "grades", functions = c("mean", "sum"))## [1] "build_target_encoding: Start to compute encoding for target_encoding according to col: grades."
## [1] "target_encode: Start to encode columns according to target."
| student | grades | grades_mean_by_student | grades_sum_by_student |
|---|---|---|---|
| Marie | 1 | 1.0 | 2 |
| Marie | 1 | 1.0 | 2 |
| Pierre | 2 | 2.0 | 2 |
| Louis | 3 | 3.5 | 7 |
| Louis | 4 | 3.5 | 7 |