Linear regresion is used to know correlation between two or more variables and predict the quantitative value of depedent variable using independent variables. It measures correlation, but not Causation. It is a simple line equation between independent variables and dependent varaible. The following code is used to regress in R
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.4.3
library(MASS)
## Warning: package 'MASS' was built under R version 3.4.3
To use ‘Boston’ dataset in ISLR Package
fix(Boston)
names(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
Now, I will regress ‘medv’ (our dependent variable) on ‘crim’ and ‘zn’. Dataset to be worked with, can be attached or mentioned in our code. I prefer to mention it in code.
linmod <- lm(medv ~ crim + zn, data = Boston)
let us look at our model ad its summary
linmod
##
## Call:
## lm(formula = medv ~ crim + zn, data = Boston)
##
## Coefficients:
## (Intercept) crim zn
## 22.4856 -0.3521 0.1161
summary(linmod)
##
## Call:
## lm(formula = medv ~ crim + zn, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.421 -5.060 -1.558 2.121 30.765
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.48563 0.44173 50.904 < 2e-16 ***
## crim -0.35208 0.04259 -8.267 1.24e-15 ***
## zn 0.11611 0.01571 7.392 6.09e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.065 on 503 degrees of freedom
## Multiple R-squared: 0.234, Adjusted R-squared: 0.2309
## F-statistic: 76.82 on 2 and 503 DF, p-value: < 2.2e-16
I prefer to work with summary command. In the coefficient sub field, we need to check the p-value. More the stars, more significant the variables are. I prefer p-value less than 0.05
Now, let us use all the variables to create our linear model
linmod_1 <- lm(medv~., data = Boston)
and look at its summary
summary(linmod_1)
##
## Call:
## lm(formula = medv ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## crim -1.080e-01 3.286e-02 -3.287 0.001087 **
## zn 4.642e-02 1.373e-02 3.382 0.000778 ***
## indus 2.056e-02 6.150e-02 0.334 0.738288
## chas 2.687e+00 8.616e-01 3.118 0.001925 **
## nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
## rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
## age 6.922e-04 1.321e-02 0.052 0.958229
## dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
## rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
## tax -1.233e-02 3.760e-03 -3.280 0.001112 **
## ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
## black 9.312e-03 2.686e-03 3.467 0.000573 ***
## lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
We can use ‘names’ function on summary to find what all information summary of the model stores.
names(summary(linmod_1))
## [1] "call" "terms" "residuals" "coefficients"
## [5] "aliased" "sigma" "df" "r.squared"
## [9] "adj.r.squared" "fstatistic" "cov.unscaled"
I use adjusted R Squared parameter to assess my model. The value Closer to 1 is better model. Adjusted R Squared can be seen in summary of linmod_1 also.
summary(linmod_1)$adj.r.squared
## [1] 0.7337897
summary(linmod)$adj.r.squared
## [1] 0.2309427
As linmod_1 has higher Adj R Sq, I would prefer ‘linmod_1’
Now using linmod_1 let us make predictions. I am subsetting Boston data to make redictions (Just for example. Usually there would be a seperate test data set)
test <- head(Boston, 10)
pred <- predict(linmod_1, newdata = test, interval = "prediction")
pred
## fit lwr upr
## 1 30.00384 20.601726 39.40596
## 2 25.02556 15.650368 34.40076
## 3 30.56760 21.189093 39.94610
## 4 28.60704 19.210599 38.00347
## 5 27.94352 18.545176 37.34187
## 6 25.25628 15.863926 34.64864
## 7 23.00181 13.613343 32.39027
## 8 19.53599 10.072729 28.99925
## 9 11.52364 1.956786 21.09049
## 10 18.92026 9.462231 28.37829
Not specifying interval parameter in predict just gives ‘fit’ value, which we generally use. Prediction in interval parameter gives Prediction interval and ‘confidence’ is used for Confidence interval.
For interaction between terms, we use ’*‘instead of’+’ in the formula
intmod <- lm(medv ~ lstat*age, data = Boston)
summary(intmod)
##
## Call:
## lm(formula = medv ~ lstat * age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.806 -4.045 -1.333 2.085 27.552
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.0885359 1.4698355 24.553 < 2e-16 ***
## lstat -1.3921168 0.1674555 -8.313 8.78e-16 ***
## age -0.0007209 0.0198792 -0.036 0.9711
## lstat:age 0.0041560 0.0018518 2.244 0.0252 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.149 on 502 degrees of freedom
## Multiple R-squared: 0.5557, Adjusted R-squared: 0.5531
## F-statistic: 209.3 on 3 and 502 DF, p-value: < 2.2e-16
Let us construct a modl with ‘lstat’ as our only independent variable. An another model with lstat and lstat^2 as our independent varable.
mod <- lm(medv ~ lstat, data = Boston)
summary(mod)
##
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## lstat -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
Now with lstat and lstat^2. We have to square lstat variable. We use identity ‘I’ operator in the formula to square any variable So..
mod_1 <- lm(medv ~ lstat + I(lstat^2), data = Boston)
summary(mod_1)
##
## Call:
## lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2834 -3.8313 -0.5295 2.3095 25.4148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.862007 0.872084 49.15 <2e-16 ***
## lstat -2.332821 0.123803 -18.84 <2e-16 ***
## I(lstat^2) 0.043547 0.003745 11.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared: 0.6407, Adjusted R-squared: 0.6393
## F-statistic: 448.5 on 2 and 503 DF, p-value: < 2.2e-16
To confirm, which model is better, we can use either R Sq values or Anova test. I Will use anova test this time. To use Anova Test, the models should be nested i.e, one model should be subset of the other.
anova(mod, mod_1)
## Analysis of Variance Table
##
## Model 1: medv ~ lstat
## Model 2: medv ~ lstat + I(lstat^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 504 19472
## 2 503 15347 1 4125.1 135.2 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The model with square term has a p-value that is virtually zero which rejects the Null Hypothesis.
Null Hypothesis : Both models are equally good. Alternate Hypothesis : Model with More variables is better.
So mod_1 is better that mod.
Let us create a model with 1,2,3,4,5th power of ‘lstat’
polymod <- lm(medv ~ poly(lstat, 5), data = Boston)
summary(polymod)
##
## Call:
## lm(formula = medv ~ poly(lstat, 5), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5433 -3.1039 -0.7052 2.0844 27.1153
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.5328 0.2318 97.197 < 2e-16 ***
## poly(lstat, 5)1 -152.4595 5.2148 -29.236 < 2e-16 ***
## poly(lstat, 5)2 64.2272 5.2148 12.316 < 2e-16 ***
## poly(lstat, 5)3 -27.0511 5.2148 -5.187 3.10e-07 ***
## poly(lstat, 5)4 25.4517 5.2148 4.881 1.42e-06 ***
## poly(lstat, 5)5 -19.2524 5.2148 -3.692 0.000247 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.215 on 500 degrees of freedom
## Multiple R-squared: 0.6817, Adjusted R-squared: 0.6785
## F-statistic: 214.2 on 5 and 500 DF, p-value: < 2.2e-16
Rest of the process is same. Select variables with more number of stars and make predictions on new data.
Summary Points:
We will use ‘glmnet()’ function of ‘glmnet’ package to perform Ridge and Lasso regressions.
library(glmnet)
## Warning: package 'glmnet' was built under R version 3.4.3
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-13
With respect to code, Ridge and Lasso regressions are same, except for the alpha parameter.
For Ridge Regression, alpha = 0 For Lasso Regression, alpha = 1
We will use ‘Boston’ dataset of ISLR Package to fit the ridge regression.
library(ISLR)
library(MASS)
fix(Boston)
names(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
In ridge and lasso regression we have to give the predictors as a matrix and response variable as a vector.
x <- model.matrix(medv ~., data = Boston)[,-1]
y <- Boston$medv
In Ridge (and Lasso), for each value of Lambda, we find the co-efficients of all the predictors. So we have to specify a list of lambdas on which we calculate the coefficients.
grid <- 10^seq(10, -2, length = 100)
Now the ridge regression. We specify the input, output, lambda and also specify if we have to standardize the predictors or not.
ridge_mod <- glmnet(x, y, alpha = 0, lambda = grid, standardize = TRUE)
‘coef(ridge_mod)’ is a matrix will the regression coefficients for different values of lambda. Predictors are rows and lambda is the columns. There are 13 predictors and one intercept. So 13 rows. We have 100 values of lamda, so 100 columns.
dim(coef(ridge_mod))
## [1] 14 100
glmnet package has a ‘predict’ function that can be used to predict the coefficients for a given lamda or response of new dataset for a given lambda and a model.
The predict the coefficients for a given lambda, we give the model name, lamda and type as coefficient for the ‘predict’ function.
predict(ridge_mod, s = 50, type = "coefficients")
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 23.580881779
## crim -0.036358794
## zn 0.011707752
## indus -0.052663644
## chas 0.914385347
## nox -2.584648038
## rm 1.136076181
## age -0.008821137
## dis 0.021708896
## rad -0.026332105
## tax -0.002040002
## ptratio -0.238276456
## black 0.003150038
## lstat -0.106839435
Now, I will take the first 25 observations of the boston data as my test data.
test_x <- head(x, 25)
test_y <- head(y, 25)
Now, let us make predictions for the ‘test_x’ dataset. To make predictions, we pass the model, lambda and new data into the predict function.
pred_y <- predict(ridge_mod, s = 50, newx = test_x)
We have seen that, we get different coefficients for different lamda. So now we have to find an optimal lambda where the test MSE is minimum. For that we will use Cross-Validation.
ridge_cv <- cv.glmnet(x, y, alpha = 0, standardize = T)
Let us look at what ‘ridge_cv’ has.
names(ridge_cv)
## [1] "lambda" "cvm" "cvsd" "cvup" "cvlo"
## [6] "nzero" "name" "glmnet.fit" "lambda.min" "lambda.1se"
We use ‘lambda.min’, to find the optimal lambda i.e., lambda at which the MSE is minimum.
ridge_cv$lambda.min
## [1] 0.7438467
We can use this lambda to make our predictions, using Predict function.
pred_y <- predict(ridge_cv, s = 0.7438467, newx = test_x)
We can calculate our MSE by
mean((test_y - pred_y)^2)
## [1] 11.15965
In this, the predictor is ‘cut’ at different points in its range. For example, our Predictor X has a range of 0 to 10. Cut points are defined at 2 and 7. Thre different regressions are fit in the range. There will a linear Regression between 0 to 2, another from 2 to 7, another for 7 to 10. i.e, Regression Coefficients change at these cut points.
This function can be done in ‘lm’ function using ‘cut()’ as another parameter in the model. We will use ‘Wage’ dataset to understand and make predictions using Step Functions.
library(ISLR)
attach(Wage)
names(Wage)
## [1] "year" "age" "maritl" "race" "education"
## [6] "region" "jobclass" "health" "health_ins" "logwage"
## [11] "wage"
Now let us fit our linear regression using step function. We need to make a model that predicts ‘wage’ using ‘age’ variable. We will use ‘cut()’ function to let R decide the cut points
fit <- lm(wage ~ cut(age, 4), data = Wage)
summary(fit)
##
## Call:
## lm(formula = wage ~ cut(age, 4), data = Wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -98.126 -24.803 -6.177 16.493 200.519
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 94.158 1.476 63.790 <2e-16 ***
## cut(age, 4)(33.5,49] 24.053 1.829 13.148 <2e-16 ***
## cut(age, 4)(49,64.5] 23.665 2.068 11.443 <2e-16 ***
## cut(age, 4)(64.5,80.1] 7.641 4.987 1.532 0.126
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40.42 on 2996 degrees of freedom
## Multiple R-squared: 0.0625, Adjusted R-squared: 0.06156
## F-statistic: 66.58 on 3 and 2996 DF, p-value: < 2.2e-16
Here, R divided ‘age’ into 4 intervals as specified in ‘cut(age, 4)’. We will use ‘breaks’ function to provide cut points at 49 and 64.5
fit_1 <- lm(wage ~ cut(age, breaks = c(49, 70, max(age))), data = Wage)
summary(fit_1)
##
## Call:
## lm(formula = wage ~ cut(age, breaks = c(49, 70, max(age))), data = Wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94.19 -26.67 -7.32 16.08 201.19
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 117.154 1.492 78.515
## cut(age, breaks = c(49, 70, max(age)))(70,80] -20.146 8.083 -2.492
## Pr(>|t|)
## (Intercept) <2e-16 ***
## cut(age, breaks = c(49, 70, max(age)))(70,80] 0.0129 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.78 on 849 degrees of freedom
## (2149 observations deleted due to missingness)
## Multiple R-squared: 0.007264, Adjusted R-squared: 0.006095
## F-statistic: 6.212 on 1 and 849 DF, p-value: 0.01288
In this section, we deal with Regression Splines and Natural Splines. Regression splines are more flexible than polynomials and step functions, and in fact are an extension of the two. They involve dividing the range of X into K distinct regions. Within each region, a polynomial function is fit to the data. However, these polynomials are constrained so that they join smoothly at the region boundaries, or knots. Provided that the interval is divided into enough regions, this can produce an extremely flexible fit.
To create a Regression Spline, we use the ‘splines’ library and ‘Wage’ dataset to predict wages using age.
library(splines)
library(ISLR)
attach(Wage)
## The following objects are masked from Wage (pos = 4):
##
## age, education, health, health_ins, jobclass, logwage, maritl,
## race, region, wage, year
To fit a Regression Spline, we use ‘bs()’ function, specifying predictor and knots.
reg_spl <- lm(wage ~ bs(age, knots = c(25, 40, 60)), data = Wage)
summary(reg_spl)
##
## Call:
## lm(formula = wage ~ bs(age, knots = c(25, 40, 60)), data = Wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -98.832 -24.537 -5.049 15.209 203.207
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.494 9.460 6.394 1.86e-10 ***
## bs(age, knots = c(25, 40, 60))1 3.980 12.538 0.317 0.750899
## bs(age, knots = c(25, 40, 60))2 44.631 9.626 4.636 3.70e-06 ***
## bs(age, knots = c(25, 40, 60))3 62.839 10.755 5.843 5.69e-09 ***
## bs(age, knots = c(25, 40, 60))4 55.991 10.706 5.230 1.81e-07 ***
## bs(age, knots = c(25, 40, 60))5 50.688 14.402 3.520 0.000439 ***
## bs(age, knots = c(25, 40, 60))6 16.606 19.126 0.868 0.385338
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39.92 on 2993 degrees of freedom
## Multiple R-squared: 0.08642, Adjusted R-squared: 0.08459
## F-statistic: 47.19 on 6 and 2993 DF, p-value: < 2.2e-16
The Regression Spline has 6 Degrees of Freedom, i.e, 6 Regression coefficients and an intercept. The ‘bs()’ function fits a cubic spline by default. We specify Degree, to fit different order polynomial.
reg_spl_2 <- lm(wage ~ bs(age, knots = c(25, 40, 60), degree = 2), data = Wage)
summary(reg_spl_2)
##
## Call:
## lm(formula = wage ~ bs(age, knots = c(25, 40, 60), degree = 2),
## data = Wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -98.306 -24.588 -5.079 15.444 201.901
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 58.557 8.097 7.232
## bs(age, knots = c(25, 40, 60), degree = 2)1 14.014 9.982 1.404
## bs(age, knots = c(25, 40, 60), degree = 2)2 58.145 8.090 7.187
## bs(age, knots = c(25, 40, 60), degree = 2)3 61.092 8.648 7.064
## bs(age, knots = c(25, 40, 60), degree = 2)4 57.463 9.370 6.133
## bs(age, knots = c(25, 40, 60), degree = 2)5 17.948 15.766 1.138
## Pr(>|t|)
## (Intercept) 6.02e-13 ***
## bs(age, knots = c(25, 40, 60), degree = 2)1 0.160
## bs(age, knots = c(25, 40, 60), degree = 2)2 8.32e-13 ***
## bs(age, knots = c(25, 40, 60), degree = 2)3 2.00e-12 ***
## bs(age, knots = c(25, 40, 60), degree = 2)4 9.77e-10 ***
## bs(age, knots = c(25, 40, 60), degree = 2)5 0.255
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39.9 on 2994 degrees of freedom
## Multiple R-squared: 0.0871, Adjusted R-squared: 0.08558
## F-statistic: 57.13 on 5 and 2994 DF, p-value: < 2.2e-16
We can fit a regression spline using degrees of freedom also. Dof = Degree of Polynomial + Number of Knots. In this case we dont specify knots. R decides knots quantiles of the predictor. We can know the knots by ‘attr()’ function.
reg_spl_3 <- lm(wage ~ bs(age, df = 5), data = Wage)
summary(reg_spl_3)
##
## Call:
## lm(formula = wage ~ bs(age, df = 5), data = Wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -98.993 -24.280 -5.203 15.626 202.006
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 58.366 6.409 9.107 < 2e-16 ***
## bs(age, df = 5)1 27.713 10.682 2.594 0.00952 **
## bs(age, df = 5)2 65.069 6.203 10.490 < 2e-16 ***
## bs(age, df = 5)3 55.451 9.251 5.994 2.29e-09 ***
## bs(age, df = 5)4 68.386 10.290 6.646 3.57e-11 ***
## bs(age, df = 5)5 16.579 15.564 1.065 0.28687
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39.9 on 2994 degrees of freedom
## Multiple R-squared: 0.08727, Adjusted R-squared: 0.08574
## F-statistic: 57.25 on 5 and 2994 DF, p-value: < 2.2e-16
To know the knots
attr(bs(age, df = 5), "knots")
## 33.33333% 66.66667%
## 37 48
We can make predictions by using ‘predict()’ function. I am creating a new dataset for test set that is to be used in Predict function.
agelims = range(age)
age_grid <- seq(agelims[1], agelims[2])
pred_reg <- predict(reg_spl_2, newdata = list(age_grid))
head(pred_reg)
## 1 2 3 4 5 6
## 58.55681 82.60125 118.58814 118.39123 118.77425 118.60822
To fit a Natural Spline, we use the ‘ns()’ function. Regression Spline with Boundary Conditions are called Natural Splines. The formula for Dof is Dof = Degree of Polynomial + Number of Knots - 2. The ‘ns()’ function fits only ‘Natural Cubic Splines’
ns_reg <- lm(wage ~ ns(age, df = 4), data = Wage)
summary(ns_reg)
##
## Call:
## lm(formula = wage ~ ns(age, df = 4), data = Wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -98.737 -24.477 -5.083 15.371 204.874
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 58.556 4.235 13.827 <2e-16 ***
## ns(age, df = 4)1 60.462 4.190 14.430 <2e-16 ***
## ns(age, df = 4)2 41.963 4.372 9.597 <2e-16 ***
## ns(age, df = 4)3 97.020 10.386 9.341 <2e-16 ***
## ns(age, df = 4)4 9.773 8.657 1.129 0.259
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39.92 on 2995 degrees of freedom
## Multiple R-squared: 0.08598, Adjusted R-squared: 0.08476
## F-statistic: 70.43 on 4 and 2995 DF, p-value: < 2.2e-16
attr(ns(age, df =4), "knots")
## 25% 50% 75%
## 33.75 42.00 51.00
We can use the predict Function to Make predictions.
pred_ns <- predict(ns_reg, newdata = list(age_grid))
head(pred_ns)
## 1 2 3 4 5 6
## 58.55632 82.96999 119.30122 118.82300 119.63414 118.78009
Smoothing Splines are similar to Natural Splines, but it has penalty for over Wiggling of curve. We have to choose the DoF such that this penality is minimized, that is, The curve is smooth. We use “smooth.spline()” to fit smoothing splines. We can either specify the DoF or let R choose for us using Cross Validation.
If we want to specify DoF….
ss1 <- smooth.spline(age, wage, df = 16)
We can instead use Cross Validation to let R choose DoF.
ss2 <- smooth.spline(age, wage, cv = T)
## Warning in smooth.spline(age, wage, cv = T): cross-validation with non-
## unique 'x' values seems doubtful
To know the DoF..
ss2$df
## [1] 6.794596
We now fit a GAM to predict wage using natural spline functions of year and age, treating education as a qualitative predictor. Since this is just a big linear regression model using an appropriate choice of basis functions, we can simply do this using the lm() function
gam_1 <- lm(wage ~ ns(year, 4) + ns(age, 5) +education, data = Wage)
We now fit the model using smoothing splines rather than natural splines. In order to fit more general sorts of GAMs, using smoothing splines or other components that cannot be expressed in terms of basis function and then fit using least squares regression, we will need to use the ‘gam’ library in R. The ‘s()’ function, which is part of the gam library, is used to indicate that ‘s()’ we would like to use a smoothing spline. We specify that the function of year should have 4 degrees of freedom, and that the function of age will have 5 degrees of freedom. Since education is qualitative, we leave it as is, and it is converted into four dummy variables.We use the gam() function in ‘gam()’ order to fit a GAM using these components. All of the terms are fit simultaneously, taking each other into account to explain the response.
library (gam)
## Warning: package 'gam' was built under R version 3.4.3
## Loaded gam 1.14-4
gam_2 <- gam(wage ~ s(year, 4) + s(age, 5) + education, data = Wage)
We can compare models using Anova Test.
gam_m1 <- gam(wage ~ s(age, 5) + education, data = Wage)
gam_m2 <- gam(wage ~ year+s(age,5) + education, data=Wage)
anova(gam_m1, gam_m2, test="F")
## Analysis of Deviance Table
##
## Model 1: wage ~ s(age, 5) + education
## Model 2: wage ~ year + s(age, 5) + education
## Resid. Df Resid. Dev Df Deviance F Pr(>F)
## 1 2990 3711731
## 2 2989 3693842 1 17889 14.476 0.0001448 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since P-Value is less than 5%, we can say that gam_m2 is better than gam_m1
It is important to remember that, the two models specified in ‘anova’ must be nested.