Regression

1. Linear Regression

It’s the functional relationship between the X and Y components of the data. Linear regression was developed in the field of statistics and is studied as a model for understanding the relationship between input and output numerical variables assumes a linear relationship between the input variables (x) and the single output variable (y). More specifically, that y can be calculated from a linear combination of the input variables (x).

These equations take the form Y=∑ni=1BiXi+e

As we already know, the objective of regression learning is to obtain the values of the coefficients that will minimize the difference between the predicted value and the actual value given the training examples.

1.1 Simple Linear Regression Code Example

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.1.3

Will start by examining the data before fitting models by creating a scatter plot. Scatter plots can help visualize any linear relationships between the dependent (response) variable and independent (predictor) variables.

ggplot(mtcars, aes(hp, mpg))+
  geom_point()+
  labs(title = "Gross Horse Power VS Miles Per Gallon",
       x = "hp",
       y = "mpg")

We can also find the correlation coefficient Correlation is a statistical measure that suggests the level of linear dependence between two variables. Correlation can take values between -1 to +1. If we observe for every instance where speed increases, the distance also increases along with it, then there is a high positive correlation between them and therefore the correlation between them will be closer to 1. The opposite is true for an inverse relationship, in which case, the correlation between the variables will be close to -1.

cor(mtcars$hp, mtcars$mpg)

## [1] -0.7761684

The linear model function lm, used below will create the relationship model between the predictor and the response variable. mpg~hp presenting the relation between x and y and mtcars the vector on which the formula will be applied.

simple_lm <- lm(mpg~hp, mtcars)
simple_lm

## 
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
## 
## Coefficients:
## (Intercept)           hp  
##    30.09886     -0.06823

Let’s generate the ANOVA table, it consist of sums of squares, degrees of freedom, F statistic, and p value

anova(simple_lm)

## Analysis of Variance Table
## 
## Response: mpg
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## hp         1 678.37  678.37   45.46 1.788e-07 ***
## Residuals 30 447.67   14.92                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Let’s also find the predicting the response variable.

pred1 <- predict(simple_lm, mtcars)
pred1

##           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
##           22.593750           22.593750           23.753631           22.593750 
##   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
##           18.158912           22.934891           13.382932           25.868707 
##            Merc 230            Merc 280           Merc 280C          Merc 450SE 
##           23.617174           21.706782           21.706782           17.817770 
##          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
##           17.817770           17.817770           16.112064           15.429781 
##   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
##           14.406357           25.595794           26.550990           25.664022 
##       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
##           23.480718           19.864619           19.864619           13.382932 
##    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
##           18.158912           25.595794           23.890087           22.389065 
##      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
##           12.086595           18.158912            7.242387           22.661978

1.2 Simple Linear Regression Code Example

Let’s consider the following two vectors

height <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
weight <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

Finding the correlation coefficient

cor(height, weight)

## [1] 0.9771296

Now let’s determine the model

relation <- lm(weight~height)
relation

## 
## Call:
## lm(formula = weight ~ height)
## 
## Coefficients:
## (Intercept)       height  
##    -38.4551       0.6746

print the summary

summary(relation)

## 
## Call:
## lm(formula = weight ~ height)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3002 -1.6629  0.0412  1.8944  3.9775 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -38.45509    8.04901  -4.778  0.00139 ** 
## height        0.67461    0.05191  12.997 1.16e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.253 on 8 degrees of freedom
## Multiple R-squared:  0.9548, Adjusted R-squared:  0.9491 
## F-statistic: 168.9 on 1 and 8 DF,  p-value: 1.164e-06

Now will wrap the parameters inside a new data frame and later finding the weight of a person with height 170 and

a <- data.frame(height = 170)
result <-  predict(relation,a)

result

##        1 
## 76.22869

1.3 Simple Linear Regression Code Example

Apply simple linear regression to the faithful data set and estimate the next eruption duration if the waiting time since the last eruption has been 80 minutes

head(faithful)

##   eruptions waiting
## 1     3.600      79
## 2     1.800      54
## 3     3.333      74
## 4     2.283      62
## 5     4.533      85
## 6     2.883      55

finding the correlation coefficient

cor(faithful$eruptions, faithful$waiting)

## [1] 0.9008112

Applying the lm() function.

relation_faith <- lm(faithful$eruptions~faithful$waiting)
relation_faith

## 
## Call:
## lm(formula = faithful$eruptions ~ faithful$waiting)
## 
## Coefficients:
##      (Intercept)  faithful$waiting  
##         -1.87402           0.07563

summary(relation_faith)

## 
## Call:
## lm(formula = faithful$eruptions ~ faithful$waiting)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.29917 -0.37689  0.03508  0.34909  1.19329 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.874016   0.160143  -11.70   <2e-16 ***
## faithful$waiting  0.075628   0.002219   34.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4965 on 270 degrees of freedom
## Multiple R-squared:  0.8115, Adjusted R-squared:  0.8108 
## F-statistic:  1162 on 1 and 270 DF,  p-value: < 2.2e-16

Working the solution

e <- data.frame(eruptions = 80)
result <-  predict(relation_faith,e)

## Warning: 'newdata' had 1 row but variables found have 272 rows

result

##        1        2        3        4        5        6        7        8 
## 4.100592 2.209893 3.722452 2.814917 4.554360 2.285521 4.781243 4.554360 
##        9       10       11       12       13       14       15       16 
## 1.983009 4.554360 2.209893 4.478732 4.024964 1.680498 4.403104 2.058637 
##       17       18       19       20       21       22       23       24 
## 2.814917 4.478732 2.058637 4.100592 1.983009 1.680498 4.024964 3.344312 
##       25       26       27       28       29       30       31       32 
## 3.722452 4.403104 2.285521 3.873708 4.024964 4.100592 3.646824 3.949336 
##       33       34       35       36       37       38       39       40 
## 3.117429 4.176220 3.722452 2.058637 1.756126 4.176220 2.588033 4.932499 
##       41       42       43       44       45       46       47       48 
## 4.176220 2.512405 4.478732 2.512405 3.646824 4.403104 2.966173 2.134265 
##       49       50       51       52       53       54       55       56 
## 4.327476 2.588033 3.798080 4.932499 2.209893 4.176220 2.209893 4.403104 
##       57       58       59       60       61       62       63       64 
## 3.495568 2.966173 3.949336 4.251848 2.588033 4.478732 1.756126 4.327476 
##       65       66       67       68       69       70       71       72 
## 2.663661 5.083755 4.024964 4.024964 3.041801 3.646824 4.327476 2.361149 
##       73       74       75       76       77       78       79       80 
## 4.100592 3.495568 2.814917 3.873708 2.663661 4.024964 3.873708 4.403104 
##       81       82       83       84       85       86       87       88 
## 3.798080 4.327476 3.419940 3.041801 3.646824 4.781243 3.873708 4.176220 
##       89       90       91       92       93       94       95       96 
## 1.756126 4.629988 2.663661 4.932499 1.907381 4.024964 2.890545 3.571196 
##       97       98       99      100      101      102      103      104 
## 4.478732 3.798080 1.983009 4.327476 2.814917 4.781243 1.831753 4.403104 
##      105      106      107      108      109      110      111      112 
## 4.251848 1.680498 4.478732 2.058637 4.629988 4.251848 3.798080 2.588033 
##      113      114      115      116      117      118      119      120 
## 4.856871 4.100592 2.588033 4.251848 1.907381 4.554360 2.588033 4.705615 
##      121      122      123      124      125      126      127      128 
## 2.134265 3.344312 3.949336 2.361149 4.781243 4.251848 1.529242 4.327476 
##      129      130      131      132      133      134      135      136 
## 2.285521 4.932499 1.529242 4.403104 2.361149 4.856871 1.604870 4.327476 
##      137      138      139      140      141      142      143      144 
## 1.983009 4.629988 2.134265 4.100592 4.251848 2.663661 4.327476 3.949336 
##      145      146      147      148      149      150      151      152 
## 3.873708 2.588033 4.176220 1.831753 5.386267 2.134265 3.949336 3.949336 
##      153      154      155      156      157      158      159      160 
## 3.041801 4.251848 3.495568 3.419940 4.251848 5.159383 2.134265 4.856871 
##      161      162      163      164      165      166      167      168 
## 1.529242 4.629988 2.512405 4.024964 3.117429 3.873708 2.890545 4.781243 
##      169      170      171      172      173      174      175      176 
## 2.058637 5.159383 1.831753 2.436777 3.949336 3.268684 4.251848 4.251848 
##      177      178      179      180      181      182      183      184 
## 3.646824 1.907381 4.554360 3.722452 2.285521 3.949336 4.403104 4.403104 
##      185      186      187      188      189      190      191      192 
## 1.983009 4.024964 4.478732 1.604870 4.403104 2.285521 4.251848 2.436777 
##      193      194      195      196      197      198      199      200 
## 3.873708 4.478732 3.949336 4.251848 4.705615 3.949336 1.983009 4.024964 
##      201      202      203      204      205      206      207      208 
## 2.663661 4.327476 5.008127 2.134265 4.024964 1.604870 3.949336 4.478732 
##      209      210      211      212      213      214      215      216 
## 1.831753 4.403104 3.495568 4.176220 1.831753 3.798080 2.966173 3.873708 
##      217      218      219      220      221      222      223      224 
## 2.134265 5.235011 2.285521 3.873708 1.907381 4.327476 2.209893 3.798080 
##      225      226      227      228      229      230      231      232 
## 4.024964 4.100592 4.024964 4.024964 3.419940 4.100592 3.419940 2.209893 
##      233      234      235      236      237      238      239      240 
## 4.629988 1.907381 4.932499 2.209893 2.209893 3.949336 4.100592 2.966173 
##      241      242      243      244      245      246      247      248 
## 3.798080 1.680498 4.629988 2.890545 4.554360 4.327476 2.436777 4.327476 
##      249      250      251      252      253      254      255      256 
## 3.193057 3.722452 2.209893 4.403104 3.646824 3.646824 4.781243 4.176220 
##      257      258      259      260      261      262      263      264 
## 3.495568 4.403104 2.361149 4.100592 4.024964 4.478732 2.512405 4.403104 
##      265      266      267      268      269      270      271      272 
## 1.377986 2.663661 3.798080 4.251848 1.604870 4.932499 1.604870 3.722452

2. Multiple Linear Regression

Multiple linear regression is an extension of simple linear regression used to predict an outcome variable (y) on the basis of multiple distinct predictor variables (x). In simple linear relation we have one predictor and one response variable, but in multiple regression we have more than one predictor variable and one response variable.

With three predictor variables (x), the prediction of y is expressed by the following equation:

> y = b0 + b1x1 + b2x2 + b3*x3

2.1 Multiple Linear Regression Code Example 1

Will demonstrate this using the diamond dataset

head(diamonds)

## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

Applying the lm() function.

multiple_lm <- lm(price ~ ., diamonds)
multiple_lm

## 
## Call:
## lm(formula = price ~ ., data = diamonds)
## 
## Coefficients:
## (Intercept)        carat        cut.L        cut.Q        cut.C        cut^4  
##    5753.762    11256.978      584.457     -301.908      148.035      -20.794  
##     color.L      color.Q      color.C      color^4      color^5      color^6  
##   -1952.160     -672.054     -165.283       38.195      -95.793      -48.466  
##   clarity.L    clarity.Q    clarity.C    clarity^4    clarity^5    clarity^6  
##    4097.431    -1925.004      982.205     -364.918      233.563        6.883  
##   clarity^7        depth        table            x            y            z  
##      90.640      -63.806      -26.474    -1008.261        9.609      -50.119

Now will generate the anova table

anova(multiple_lm)

## Analysis of Variance Table
## 
## Response: price
##              Df     Sum Sq    Mean Sq    F value  Pr(>F)    
## carat         1 7.2913e+11 7.2913e+11 5.7092e+05 < 2e-16 ***
## cut           4 6.1332e+09 1.5333e+09 1.2006e+03 < 2e-16 ***
## color         6 1.2598e+10 2.0997e+09 1.6441e+03 < 2e-16 ***
## clarity       7 3.8452e+10 5.4931e+09 4.3012e+03 < 2e-16 ***
## depth         1 4.9405e+06 4.9405e+06 3.8685e+00 0.04921 *  
## table         1 9.2727e+07 9.2727e+07 7.2606e+01 < 2e-16 ***
## x             1 3.2053e+09 3.2053e+09 2.5098e+03 < 2e-16 ***
## y             1 1.1679e+05 1.1679e+05 9.1400e-02 0.76235    
## z             1 2.8609e+06 2.8609e+06 2.2401e+00 0.13448    
## Residuals 53916 6.8857e+10 1.2771e+06                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Will now perform prediction using our model

pred2 <- predict(multiple_lm, diamonds)
summary(pred2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -4308    1073    2819    3933    5886   39394

2.2 Multiple Linear Regression Code Example 2

Will now use the mtcars data set for this example

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Let’s choose “disp”,“hp” and “wt” as predictor variables

input <- mtcars[,c("mpg","disp","hp","wt")]
head(input)

##                    mpg disp  hp    wt
## Mazda RX4         21.0  160 110 2.620
## Mazda RX4 Wag     21.0  160 110 2.875
## Datsun 710        22.8  108  93 2.320
## Hornet 4 Drive    21.4  258 110 3.215
## Hornet Sportabout 18.7  360 175 3.440
## Valiant           18.1  225 105 3.460

Will now create a relationship model

model <- lm(mpg~disp+hp+wt, data = input)
model

## 
## Call:
## lm(formula = mpg ~ disp + hp + wt, data = input)
## 
## Coefficients:
## (Intercept)         disp           hp           wt  
##   37.105505    -0.000937    -0.031157    -3.800891

will now get the summary of our model

summary(model)

## 
## Call:
## lm(formula = mpg ~ disp + hp + wt, data = input)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.891 -1.640 -0.172  1.061  5.861 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.105505   2.110815  17.579  < 2e-16 ***
## disp        -0.000937   0.010350  -0.091  0.92851    
## hp          -0.031157   0.011436  -2.724  0.01097 *  
## wt          -3.800891   1.066191  -3.565  0.00133 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.639 on 28 degrees of freedom
## Multiple R-squared:  0.8268, Adjusted R-squared:  0.8083 
## F-statistic: 44.57 on 3 and 28 DF,  p-value: 8.65e-11

With our created model let’s predict the mileage given a car with disp = 221, hp = 102 and wt = 2.91

a <- data.frame(disp = 221, hp = 102, wt = 2.91)
predicted_mileage <- predict(model, a)
predicted_mileage

##        1 
## 22.65987

2.3 Multiple Linear Regression Code Example 3

In this example will Apply multiple linear regression for the stack loss data set, and predict the stack loss if the air flow is 62, water temperature is 19 and acid concentration is 84.

First let’s see the data set

head(stackloss)

##   Air.Flow Water.Temp Acid.Conc. stack.loss
## 1       80         27         89         42
## 2       80         27         88         37
## 3       75         25         90         37
## 4       62         24         87         28
## 5       62         22         87         18
## 6       62         23         87         18

Let’s choose our predictors

input <- stackloss[,c("Air.Flow","Water.Temp","Acid.Conc.", "stack.loss")]
head(input)

##   Air.Flow Water.Temp Acid.Conc. stack.loss
## 1       80         27         89         42
## 2       80         27         88         37
## 3       75         25         90         37
## 4       62         24         87         28
## 5       62         22         87         18
## 6       62         23         87         18

Will now create a relationship model

model <- lm(stack.loss~Air.Flow+Water.Temp+Acid.Conc., data = input)
model

## 
## Call:
## lm(formula = stack.loss ~ Air.Flow + Water.Temp + Acid.Conc., 
##     data = input)
## 
## Coefficients:
## (Intercept)     Air.Flow   Water.Temp   Acid.Conc.  
##    -39.9197       0.7156       1.2953      -0.1521

Determining the summary of our model

summary(model)

## 
## Call:
## lm(formula = stack.loss ~ Air.Flow + Water.Temp + Acid.Conc., 
##     data = input)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.2377 -1.7117 -0.4551  2.3614  5.6978 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -39.9197    11.8960  -3.356  0.00375 ** 
## Air.Flow      0.7156     0.1349   5.307  5.8e-05 ***
## Water.Temp    1.2953     0.3680   3.520  0.00263 ** 
## Acid.Conc.   -0.1521     0.1563  -0.973  0.34405    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.243 on 17 degrees of freedom
## Multiple R-squared:  0.9136, Adjusted R-squared:  0.8983 
## F-statistic:  59.9 on 3 and 17 DF,  p-value: 3.016e-09

Now will predict the stock loss given the other predictor variables

a <- data.frame(Air.Flow = 62, Water.Temp = 19, Acid.Conc. = 84)
predicted_stock.loss <- predict(model, a)
predicted_stock.loss

##        1 
## 16.28216

3. Cross Validation

How do we know that an estimated regression model is generalization beyond the sample data used to fit it? Ideally, we can obtain new independent data with which to validate our model. For example, we could refit the model to the new data set to see if the various characteristics of the model (e.g., estimates regression coefficients) are consistent with the model fit to the original data set.

However, most of the time we cannot obtain new independent data to validate our model. An alternative is to partition the sample data into a training (or model-building) set, which we can use to develop the model, and a validation (or prediction) set, which is used to evaluate the predictive ability of the model. This is called cross-validation. Again, we can compare the model fit to the training set to the model refit to the validation set to assess consistency. The simplest approach to cross-validation is to partition the sample observations randomly with 50% of the sample in each set.

Instead of doing a single training/testing split, we can systematize this process, produce multiple, different out-of-sample train/test splits, that will lead to a better estimate of the out-of-sample RMSE.

For 3-fold validation for instance, we split the data into 3 random and complementary folds, so that each data point appears exactly once in each fold. This leads to a total test set size that is identical to the size as the full data set but is composed of out-of-sample predictions.

Schematic of 3-fold cross validation producing three training (blue) and testing (white) splits.

After cross-validation, all models used within each fold are discarded, and a new model is build using the whole data set, with the best model parameter(s), i.e those that generalized over all folds.

This makes cross-validation quite time consuming, as it takes x+1 (where x in the number of cross-validation folds) times as long as fitting a single model, but is essential.

It is important to maintain the class proportions within the different folds, i.e. respect the proportion of the different classes in the original data. This is also taken care when using the caret package.

The procedure of creating folds and training the models is handled by the train function in caret. Below, we apply it to the diamond price example that we used when introducing the model performance. > 1. We start by setting a random to be able to reproduce the example.

> 2. We specify the method (the learning algorithm) we want to use. Here, we use “lm”, but, as we will see later, there are many others to choose from.

> 3. We then set the out-of-sample training procedure to 10-fold cross validation (method = “cv” and number = 10).

To simplify the output in the material for better readability, we set the verbosity flag to FALSE, but it is useful to set it to TRUE in interactive mode.

3.1 Cross Validation Code Example 1

library(caret)

## Warning: package 'caret' was built under R version 4.1.3

## Loading required package: lattice

## Warning: package 'lattice' was built under R version 4.1.3

Will use the diamonds data set

head(diamonds)

## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

setting seed to generate a reproducible random sampling

set.seed(42)

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v tibble  3.1.7     v dplyr   1.0.9
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## v purrr   0.3.4

## Warning: package 'tibble' was built under R version 4.1.3

## Warning: package 'tidyr' was built under R version 4.1.3

## Warning: package 'readr' was built under R version 4.1.2

## Warning: package 'purrr' was built under R version 4.1.2

## Warning: package 'dplyr' was built under R version 4.1.3

## Warning: package 'stringr' was built under R version 4.1.1

## Warning: package 'forcats' was built under R version 4.1.2

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x purrr::lift()   masks caret::lift()

creating training data as 80% of the data set

# Split the data into training and test set
training.samples <- diamonds$price %>%
  createDataPartition(p = 0.8, list = FALSE)

train.data  <- diamonds[training.samples, ]
test.data <- diamonds[-training.samples, ]

# Build the model
model <- lm(price ~., data = train.data)

# Make predictions and compute the R2, RMSE and MAE
predictions <- model %>% predict(test.data)
data.frame( R2 = R2(predictions, test.data$price),
            RMSE = RMSE(predictions, test.data$price),
            MAE = MAE(predictions, test.data$price))

##          R2     RMSE      MAE
## 1 0.9213907 1121.769 735.3145

Train a linear model using 10-fold cross-validation Then calculate the RMSE.

train.control <- trainControl(method = "cv", number = 10)
# Train the model
model <- train(price ~., data = diamonds, method = "lm",
               trControl = train.control)
# Summarize the results
print(model)

## Linear Regression 
## 
## 53940 samples
##     9 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 48545, 48546, 48546, 48546, 48545, 48546, ... 
## Resampling results:
## 
##   RMSE      Rsquared  MAE     
##   1131.281  0.919629  740.6904
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Repeating K-fold cross-validation

set.seed(123)
train.control <- trainControl(method = "repeatedcv", 
                              number = 10, repeats = 3)
# Train the model
model <- train(price ~., data = diamonds, method = "lm",
               trControl = train.control)
# Summarize the results
print(model)

## Linear Regression 
## 
## 53940 samples
##     9 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 48547, 48545, 48547, 48546, 48545, 48546, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE    
##   1130.787  0.9196617  740.552
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Challenge

Train a linear model using 10-fold cross-validation and then use it to predict the median value of owner-occupied homes in Boston from the Boston data set as described above. Then calculate the RMSE.

loading our data set

library(MASS)

## Warning: package 'MASS' was built under R version 4.1.3

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

head(Boston)

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7

set seed and defining training control

set.seed(200) 
train.control <- trainControl(method = "cv", number = 10)

Train our model

model <- train(medv ~., data = Boston, method = "lm",
               trControl = train.control)

summarize the results

print(model)

## Linear Regression 
## 
## 506 samples
##  13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 457, 455, 456, 455, 455, 454, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   4.795749  0.7332363  3.383001
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Using repeated CV

# Define training control
set.seed(200)
train.control <- trainControl(method = "repeatedcv", 
                              number = 10, repeats = 3)
# Train the model
model <- train(medv ~., data = Boston, method = "lm",
               trControl = train.control)
# Summarize the results
print(model)

## Linear Regression 
## 
## 506 samples
##  13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 457, 455, 456, 455, 455, 454, ... 
## Resampling results:
## 
##   RMSE      Rsquared  MAE     
##   4.799313  0.731545  3.384661
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE