It’s the functional relationship between the X and Y components of the data. Linear regression was developed in the field of statistics and is studied as a model for understanding the relationship between input and output numerical variables assumes a linear relationship between the input variables (x) and the single output variable (y). More specifically, that y can be calculated from a linear combination of the input variables (x).
These equations take the form Y=∑ni=1BiXi+e
As we already know, the objective of regression learning is to obtain the values of the coefficients that will minimize the difference between the predicted value and the actual value given the training examples.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.1.3
Will start by examining the data before fitting models by creating a scatter plot. Scatter plots can help visualize any linear relationships between the dependent (response) variable and independent (predictor) variables.
ggplot(mtcars, aes(hp, mpg))+
geom_point()+
labs(title = "Gross Horse Power VS Miles Per Gallon",
x = "hp",
y = "mpg")
We can also find the correlation coefficient Correlation is a statistical measure that suggests the level of linear dependence between two variables. Correlation can take values between -1 to +1. If we observe for every instance where speed increases, the distance also increases along with it, then there is a high positive correlation between them and therefore the correlation between them will be closer to 1. The opposite is true for an inverse relationship, in which case, the correlation between the variables will be close to -1.
cor(mtcars$hp, mtcars$mpg)
## [1] -0.7761684
The linear model function lm, used below will create the relationship model between the predictor and the response variable. mpg~hp presenting the relation between x and y and mtcars the vector on which the formula will be applied.
simple_lm <- lm(mpg~hp, mtcars)
simple_lm
##
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
##
## Coefficients:
## (Intercept) hp
## 30.09886 -0.06823
Let’s generate the ANOVA table, it consist of sums of squares, degrees of freedom, F statistic, and p value
anova(simple_lm)
## Analysis of Variance Table
##
## Response: mpg
## Df Sum Sq Mean Sq F value Pr(>F)
## hp 1 678.37 678.37 45.46 1.788e-07 ***
## Residuals 30 447.67 14.92
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let’s also find the predicting the response variable.
pred1 <- predict(simple_lm, mtcars)
pred1
## Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
## 22.593750 22.593750 23.753631 22.593750
## Hornet Sportabout Valiant Duster 360 Merc 240D
## 18.158912 22.934891 13.382932 25.868707
## Merc 230 Merc 280 Merc 280C Merc 450SE
## 23.617174 21.706782 21.706782 17.817770
## Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
## 17.817770 17.817770 16.112064 15.429781
## Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
## 14.406357 25.595794 26.550990 25.664022
## Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
## 23.480718 19.864619 19.864619 13.382932
## Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
## 18.158912 25.595794 23.890087 22.389065
## Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
## 12.086595 18.158912 7.242387 22.661978
Let’s consider the following two vectors
height <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
weight <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
Finding the correlation coefficient
cor(height, weight)
## [1] 0.9771296
Now let’s determine the model
relation <- lm(weight~height)
relation
##
## Call:
## lm(formula = weight ~ height)
##
## Coefficients:
## (Intercept) height
## -38.4551 0.6746
print the summary
summary(relation)
##
## Call:
## lm(formula = weight ~ height)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3002 -1.6629 0.0412 1.8944 3.9775
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -38.45509 8.04901 -4.778 0.00139 **
## height 0.67461 0.05191 12.997 1.16e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.253 on 8 degrees of freedom
## Multiple R-squared: 0.9548, Adjusted R-squared: 0.9491
## F-statistic: 168.9 on 1 and 8 DF, p-value: 1.164e-06
Now will wrap the parameters inside a new data frame and later finding the weight of a person with height 170 and
a <- data.frame(height = 170)
result <- predict(relation,a)
result
## 1
## 76.22869
Apply simple linear regression to the faithful data set and estimate the next eruption duration if the waiting time since the last eruption has been 80 minutes
head(faithful)
## eruptions waiting
## 1 3.600 79
## 2 1.800 54
## 3 3.333 74
## 4 2.283 62
## 5 4.533 85
## 6 2.883 55
finding the correlation coefficient
cor(faithful$eruptions, faithful$waiting)
## [1] 0.9008112
Applying the lm() function.
relation_faith <- lm(faithful$eruptions~faithful$waiting)
relation_faith
##
## Call:
## lm(formula = faithful$eruptions ~ faithful$waiting)
##
## Coefficients:
## (Intercept) faithful$waiting
## -1.87402 0.07563
summary(relation_faith)
##
## Call:
## lm(formula = faithful$eruptions ~ faithful$waiting)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.29917 -0.37689 0.03508 0.34909 1.19329
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.874016 0.160143 -11.70 <2e-16 ***
## faithful$waiting 0.075628 0.002219 34.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4965 on 270 degrees of freedom
## Multiple R-squared: 0.8115, Adjusted R-squared: 0.8108
## F-statistic: 1162 on 1 and 270 DF, p-value: < 2.2e-16
Working the solution
e <- data.frame(eruptions = 80)
result <- predict(relation_faith,e)
## Warning: 'newdata' had 1 row but variables found have 272 rows
result
## 1 2 3 4 5 6 7 8
## 4.100592 2.209893 3.722452 2.814917 4.554360 2.285521 4.781243 4.554360
## 9 10 11 12 13 14 15 16
## 1.983009 4.554360 2.209893 4.478732 4.024964 1.680498 4.403104 2.058637
## 17 18 19 20 21 22 23 24
## 2.814917 4.478732 2.058637 4.100592 1.983009 1.680498 4.024964 3.344312
## 25 26 27 28 29 30 31 32
## 3.722452 4.403104 2.285521 3.873708 4.024964 4.100592 3.646824 3.949336
## 33 34 35 36 37 38 39 40
## 3.117429 4.176220 3.722452 2.058637 1.756126 4.176220 2.588033 4.932499
## 41 42 43 44 45 46 47 48
## 4.176220 2.512405 4.478732 2.512405 3.646824 4.403104 2.966173 2.134265
## 49 50 51 52 53 54 55 56
## 4.327476 2.588033 3.798080 4.932499 2.209893 4.176220 2.209893 4.403104
## 57 58 59 60 61 62 63 64
## 3.495568 2.966173 3.949336 4.251848 2.588033 4.478732 1.756126 4.327476
## 65 66 67 68 69 70 71 72
## 2.663661 5.083755 4.024964 4.024964 3.041801 3.646824 4.327476 2.361149
## 73 74 75 76 77 78 79 80
## 4.100592 3.495568 2.814917 3.873708 2.663661 4.024964 3.873708 4.403104
## 81 82 83 84 85 86 87 88
## 3.798080 4.327476 3.419940 3.041801 3.646824 4.781243 3.873708 4.176220
## 89 90 91 92 93 94 95 96
## 1.756126 4.629988 2.663661 4.932499 1.907381 4.024964 2.890545 3.571196
## 97 98 99 100 101 102 103 104
## 4.478732 3.798080 1.983009 4.327476 2.814917 4.781243 1.831753 4.403104
## 105 106 107 108 109 110 111 112
## 4.251848 1.680498 4.478732 2.058637 4.629988 4.251848 3.798080 2.588033
## 113 114 115 116 117 118 119 120
## 4.856871 4.100592 2.588033 4.251848 1.907381 4.554360 2.588033 4.705615
## 121 122 123 124 125 126 127 128
## 2.134265 3.344312 3.949336 2.361149 4.781243 4.251848 1.529242 4.327476
## 129 130 131 132 133 134 135 136
## 2.285521 4.932499 1.529242 4.403104 2.361149 4.856871 1.604870 4.327476
## 137 138 139 140 141 142 143 144
## 1.983009 4.629988 2.134265 4.100592 4.251848 2.663661 4.327476 3.949336
## 145 146 147 148 149 150 151 152
## 3.873708 2.588033 4.176220 1.831753 5.386267 2.134265 3.949336 3.949336
## 153 154 155 156 157 158 159 160
## 3.041801 4.251848 3.495568 3.419940 4.251848 5.159383 2.134265 4.856871
## 161 162 163 164 165 166 167 168
## 1.529242 4.629988 2.512405 4.024964 3.117429 3.873708 2.890545 4.781243
## 169 170 171 172 173 174 175 176
## 2.058637 5.159383 1.831753 2.436777 3.949336 3.268684 4.251848 4.251848
## 177 178 179 180 181 182 183 184
## 3.646824 1.907381 4.554360 3.722452 2.285521 3.949336 4.403104 4.403104
## 185 186 187 188 189 190 191 192
## 1.983009 4.024964 4.478732 1.604870 4.403104 2.285521 4.251848 2.436777
## 193 194 195 196 197 198 199 200
## 3.873708 4.478732 3.949336 4.251848 4.705615 3.949336 1.983009 4.024964
## 201 202 203 204 205 206 207 208
## 2.663661 4.327476 5.008127 2.134265 4.024964 1.604870 3.949336 4.478732
## 209 210 211 212 213 214 215 216
## 1.831753 4.403104 3.495568 4.176220 1.831753 3.798080 2.966173 3.873708
## 217 218 219 220 221 222 223 224
## 2.134265 5.235011 2.285521 3.873708 1.907381 4.327476 2.209893 3.798080
## 225 226 227 228 229 230 231 232
## 4.024964 4.100592 4.024964 4.024964 3.419940 4.100592 3.419940 2.209893
## 233 234 235 236 237 238 239 240
## 4.629988 1.907381 4.932499 2.209893 2.209893 3.949336 4.100592 2.966173
## 241 242 243 244 245 246 247 248
## 3.798080 1.680498 4.629988 2.890545 4.554360 4.327476 2.436777 4.327476
## 249 250 251 252 253 254 255 256
## 3.193057 3.722452 2.209893 4.403104 3.646824 3.646824 4.781243 4.176220
## 257 258 259 260 261 262 263 264
## 3.495568 4.403104 2.361149 4.100592 4.024964 4.478732 2.512405 4.403104
## 265 266 267 268 269 270 271 272
## 1.377986 2.663661 3.798080 4.251848 1.604870 4.932499 1.604870 3.722452
Multiple linear regression is an extension of simple linear regression used to predict an outcome variable (y) on the basis of multiple distinct predictor variables (x). In simple linear relation we have one predictor and one response variable, but in multiple regression we have more than one predictor variable and one response variable.
With three predictor variables (x), the prediction of y is expressed by the following equation:
> y = b0 + b1x1 + b2x2 + b3*x3
Will demonstrate this using the diamond dataset
head(diamonds)
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Applying the lm() function.
multiple_lm <- lm(price ~ ., diamonds)
multiple_lm
##
## Call:
## lm(formula = price ~ ., data = diamonds)
##
## Coefficients:
## (Intercept) carat cut.L cut.Q cut.C cut^4
## 5753.762 11256.978 584.457 -301.908 148.035 -20.794
## color.L color.Q color.C color^4 color^5 color^6
## -1952.160 -672.054 -165.283 38.195 -95.793 -48.466
## clarity.L clarity.Q clarity.C clarity^4 clarity^5 clarity^6
## 4097.431 -1925.004 982.205 -364.918 233.563 6.883
## clarity^7 depth table x y z
## 90.640 -63.806 -26.474 -1008.261 9.609 -50.119
Now will generate the anova table
anova(multiple_lm)
## Analysis of Variance Table
##
## Response: price
## Df Sum Sq Mean Sq F value Pr(>F)
## carat 1 7.2913e+11 7.2913e+11 5.7092e+05 < 2e-16 ***
## cut 4 6.1332e+09 1.5333e+09 1.2006e+03 < 2e-16 ***
## color 6 1.2598e+10 2.0997e+09 1.6441e+03 < 2e-16 ***
## clarity 7 3.8452e+10 5.4931e+09 4.3012e+03 < 2e-16 ***
## depth 1 4.9405e+06 4.9405e+06 3.8685e+00 0.04921 *
## table 1 9.2727e+07 9.2727e+07 7.2606e+01 < 2e-16 ***
## x 1 3.2053e+09 3.2053e+09 2.5098e+03 < 2e-16 ***
## y 1 1.1679e+05 1.1679e+05 9.1400e-02 0.76235
## z 1 2.8609e+06 2.8609e+06 2.2401e+00 0.13448
## Residuals 53916 6.8857e+10 1.2771e+06
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Will now perform prediction using our model
pred2 <- predict(multiple_lm, diamonds)
summary(pred2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4308 1073 2819 3933 5886 39394
Will now use the mtcars data set for this example
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Let’s choose “disp”,“hp” and “wt” as predictor variables
input <- mtcars[,c("mpg","disp","hp","wt")]
head(input)
## mpg disp hp wt
## Mazda RX4 21.0 160 110 2.620
## Mazda RX4 Wag 21.0 160 110 2.875
## Datsun 710 22.8 108 93 2.320
## Hornet 4 Drive 21.4 258 110 3.215
## Hornet Sportabout 18.7 360 175 3.440
## Valiant 18.1 225 105 3.460
Will now create a relationship model
model <- lm(mpg~disp+hp+wt, data = input)
model
##
## Call:
## lm(formula = mpg ~ disp + hp + wt, data = input)
##
## Coefficients:
## (Intercept) disp hp wt
## 37.105505 -0.000937 -0.031157 -3.800891
will now get the summary of our model
summary(model)
##
## Call:
## lm(formula = mpg ~ disp + hp + wt, data = input)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.891 -1.640 -0.172 1.061 5.861
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.105505 2.110815 17.579 < 2e-16 ***
## disp -0.000937 0.010350 -0.091 0.92851
## hp -0.031157 0.011436 -2.724 0.01097 *
## wt -3.800891 1.066191 -3.565 0.00133 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.639 on 28 degrees of freedom
## Multiple R-squared: 0.8268, Adjusted R-squared: 0.8083
## F-statistic: 44.57 on 3 and 28 DF, p-value: 8.65e-11
With our created model let’s predict the mileage given a car with disp = 221, hp = 102 and wt = 2.91
a <- data.frame(disp = 221, hp = 102, wt = 2.91)
predicted_mileage <- predict(model, a)
predicted_mileage
## 1
## 22.65987
In this example will Apply multiple linear regression for the stack loss data set, and predict the stack loss if the air flow is 62, water temperature is 19 and acid concentration is 84.
First let’s see the data set
head(stackloss)
## Air.Flow Water.Temp Acid.Conc. stack.loss
## 1 80 27 89 42
## 2 80 27 88 37
## 3 75 25 90 37
## 4 62 24 87 28
## 5 62 22 87 18
## 6 62 23 87 18
Let’s choose our predictors
input <- stackloss[,c("Air.Flow","Water.Temp","Acid.Conc.", "stack.loss")]
head(input)
## Air.Flow Water.Temp Acid.Conc. stack.loss
## 1 80 27 89 42
## 2 80 27 88 37
## 3 75 25 90 37
## 4 62 24 87 28
## 5 62 22 87 18
## 6 62 23 87 18
Will now create a relationship model
model <- lm(stack.loss~Air.Flow+Water.Temp+Acid.Conc., data = input)
model
##
## Call:
## lm(formula = stack.loss ~ Air.Flow + Water.Temp + Acid.Conc.,
## data = input)
##
## Coefficients:
## (Intercept) Air.Flow Water.Temp Acid.Conc.
## -39.9197 0.7156 1.2953 -0.1521
Determining the summary of our model
summary(model)
##
## Call:
## lm(formula = stack.loss ~ Air.Flow + Water.Temp + Acid.Conc.,
## data = input)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.2377 -1.7117 -0.4551 2.3614 5.6978
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -39.9197 11.8960 -3.356 0.00375 **
## Air.Flow 0.7156 0.1349 5.307 5.8e-05 ***
## Water.Temp 1.2953 0.3680 3.520 0.00263 **
## Acid.Conc. -0.1521 0.1563 -0.973 0.34405
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.243 on 17 degrees of freedom
## Multiple R-squared: 0.9136, Adjusted R-squared: 0.8983
## F-statistic: 59.9 on 3 and 17 DF, p-value: 3.016e-09
Now will predict the stock loss given the other predictor variables
a <- data.frame(Air.Flow = 62, Water.Temp = 19, Acid.Conc. = 84)
predicted_stock.loss <- predict(model, a)
predicted_stock.loss
## 1
## 16.28216
How do we know that an estimated regression model is generalization beyond the sample data used to fit it? Ideally, we can obtain new independent data with which to validate our model. For example, we could refit the model to the new data set to see if the various characteristics of the model (e.g., estimates regression coefficients) are consistent with the model fit to the original data set.
However, most of the time we cannot obtain new independent data to validate our model. An alternative is to partition the sample data into a training (or model-building) set, which we can use to develop the model, and a validation (or prediction) set, which is used to evaluate the predictive ability of the model. This is called cross-validation. Again, we can compare the model fit to the training set to the model refit to the validation set to assess consistency. The simplest approach to cross-validation is to partition the sample observations randomly with 50% of the sample in each set.
Instead of doing a single training/testing split, we can systematize this process, produce multiple, different out-of-sample train/test splits, that will lead to a better estimate of the out-of-sample RMSE.
For 3-fold validation for instance, we split the data into 3 random and complementary folds, so that each data point appears exactly once in each fold. This leads to a total test set size that is identical to the size as the full data set but is composed of out-of-sample predictions.
Schematic of 3-fold cross validation producing three training (blue) and testing (white) splits.
After cross-validation, all models used within each fold are discarded, and a new model is build using the whole data set, with the best model parameter(s), i.e those that generalized over all folds.
This makes cross-validation quite time consuming, as it takes x+1 (where x in the number of cross-validation folds) times as long as fitting a single model, but is essential.
It is important to maintain the class proportions within the different folds, i.e. respect the proportion of the different classes in the original data. This is also taken care when using the caret package.
The procedure of creating folds and training the models is handled by the train function in caret. Below, we apply it to the diamond price example that we used when introducing the model performance. > 1. We start by setting a random to be able to reproduce the example.
> 2. We specify the method (the learning algorithm) we want to use. Here, we use “lm”, but, as we will see later, there are many others to choose from.
> 3. We then set the out-of-sample training procedure to 10-fold cross validation (method = “cv” and number = 10).
To simplify the output in the material for better readability, we set the verbosity flag to FALSE, but it is useful to set it to TRUE in interactive mode.
library(caret)
## Warning: package 'caret' was built under R version 4.1.3
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 4.1.3
Will use the diamonds data set
head(diamonds)
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
set.seed(42)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v tibble 3.1.7 v dplyr 1.0.9
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## v purrr 0.3.4
## Warning: package 'tibble' was built under R version 4.1.3
## Warning: package 'tidyr' was built under R version 4.1.3
## Warning: package 'readr' was built under R version 4.1.2
## Warning: package 'purrr' was built under R version 4.1.2
## Warning: package 'dplyr' was built under R version 4.1.3
## Warning: package 'stringr' was built under R version 4.1.1
## Warning: package 'forcats' was built under R version 4.1.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x purrr::lift() masks caret::lift()
# Split the data into training and test set
training.samples <- diamonds$price %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- diamonds[training.samples, ]
test.data <- diamonds[-training.samples, ]
# Build the model
model <- lm(price ~., data = train.data)
# Make predictions and compute the R2, RMSE and MAE
predictions <- model %>% predict(test.data)
data.frame( R2 = R2(predictions, test.data$price),
RMSE = RMSE(predictions, test.data$price),
MAE = MAE(predictions, test.data$price))
## R2 RMSE MAE
## 1 0.9213907 1121.769 735.3145
Train a linear model using 10-fold cross-validation Then calculate the RMSE.
train.control <- trainControl(method = "cv", number = 10)
# Train the model
model <- train(price ~., data = diamonds, method = "lm",
trControl = train.control)
# Summarize the results
print(model)
## Linear Regression
##
## 53940 samples
## 9 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 48545, 48546, 48546, 48546, 48545, 48546, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1131.281 0.919629 740.6904
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Repeating K-fold cross-validation
set.seed(123)
train.control <- trainControl(method = "repeatedcv",
number = 10, repeats = 3)
# Train the model
model <- train(price ~., data = diamonds, method = "lm",
trControl = train.control)
# Summarize the results
print(model)
## Linear Regression
##
## 53940 samples
## 9 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 48547, 48545, 48547, 48546, 48545, 48546, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1130.787 0.9196617 740.552
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Train a linear model using 10-fold cross-validation and then use it to predict the median value of owner-occupied homes in Boston from the Boston data set as described above. Then calculate the RMSE.
library(MASS)
## Warning: package 'MASS' was built under R version 4.1.3
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
set.seed(200)
train.control <- trainControl(method = "cv", number = 10)
model <- train(medv ~., data = Boston, method = "lm",
trControl = train.control)
print(model)
## Linear Regression
##
## 506 samples
## 13 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 457, 455, 456, 455, 455, 454, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 4.795749 0.7332363 3.383001
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
# Define training control
set.seed(200)
train.control <- trainControl(method = "repeatedcv",
number = 10, repeats = 3)
# Train the model
model <- train(medv ~., data = Boston, method = "lm",
trControl = train.control)
# Summarize the results
print(model)
## Linear Regression
##
## 506 samples
## 13 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 457, 455, 456, 455, 455, 454, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 4.799313 0.731545 3.384661
##
## Tuning parameter 'intercept' was held constant at a value of TRUE