Formula
\[J(\beta)_{Ridge} = \frac{1}{2n}\sum_{i=1}^n(y_i-\hat{y}_i)^2 + \lambda\sum^m_{j-1}\beta^2_j\]
\(\lambda\) = regularization parameter
Over learning happens when you have many variable often useless.
never tuoch the test system to optimize the system. all the automazation is base on the training set.
Means quare error = is use to teack the model that we got from the training set.
Test set mean sqare error is what tells how well the model is.
There are different types of erros regression example.
Ridge Regression and lasso are similar. there is one importaint difference. Regularization is a technique which makes slight modifications to the learning algorithm such that the model generalizes better. This in turn improves the modelโs performance on the unseen data as well.
So you have your whole data. You create a linear regreasion. you get a pretty accureate estimate. then you take away part of the data as training data. You create another OLS and you have an stimate, but the estimate is less accurated than the one with the whole data.
The sum of square residuals is smaller in the testing data than int hw whole data.
lasso regression
ridge regression. the main idea is we give short amount of vias to train the data so it wont over fit. in return for the bias the regression will give a new regression line.
how ridge regression works:
Ridge regression minimizes the sum of squares residuals + lamda*the slope^2. lamda = how sever the penalty is, slope^2 is the penalty. \(\lamda\) determines how severe the penalty is. The other numberi s the penalty. \(\lamda\) can be from 0 to infinity. When \(\lamda\) increases the slope gets smaller.
when the slopre of the line is steep the changes in x has a bigger effect on y, when the slope is small then is the other way around.
By introducing a bias ridge regression can give out a better stimation.
ridge regression involves tuning a hyperparameter, lambda, runs the model many times for different values of lambda. We can automatically find a value for lambda that is optimal by using cv.glmnet() as follows:
y <- mtcars$hp
x <- mtcars %>% select(mpg, wt, drat) %>% data.matrix()
lambdas <- 10^seq(3, -2, by = -.1)
fit <- glmnet(x, y, alpha = 0, lambda = lambdas)
summary(fit)
## Length Class Mode
## a0 51 -none- numeric
## beta 153 dgCMatrix S4
## df 51 -none- numeric
## dim 2 -none- numeric
## lambda 51 -none- numeric
## dev.ratio 51 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
cv_fit <- cv.glmnet(x, y, alpha = 0, lambda = lambdas)
plot(cv_fit)
The lowest point in the curve indicates the optimal lambda: the log value of lambda that best minimised the error in cross-validation. We can extract this values as
opt_lambda <- cv_fit$lambda.min
opt_lambda
## [1] 12.58925
# recepiets
recipeOLS = recipe(price ~ bedrooms + Bathroom, data = TraininData(split85)) %>%
step_dummy(all_nominal()) %>% #creates dummy variable on all nomial.
prep()
DataTaining = juice(recipeOLS)
DataTest = bake(recipeOLS, testing(split85))
# You can use the lm function to ..
There is difference between the training data and the test data. you don ot combine the two.
whnever we have less standard diviation we get away from thebulls aye. whenever we have more standard deviation we are more into the bulls eye.
high varaince low varaince, high bias, low bias graph. variance is the spread. how far isth spread. low variance and high bias.
low bias and low variance is wilshfull thinking.