Regression and Smoothing

Linear Regression

Linear regresion is used to know correlation between two or more variables and predict the quantitative value of depedent variable using independent variables. It measures correlation, but not Causation. It is a simple line equation between independent variables and dependent varaible. The following code is used to regress in R

library(ISLR)

## Warning: package 'ISLR' was built under R version 3.4.3

library(MASS)

## Warning: package 'MASS' was built under R version 3.4.3

To use ‘Boston’ dataset in ISLR Package

fix(Boston)
names(Boston)

##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

Now, I will regress ‘medv’ (our dependent variable) on ‘crim’ and ‘zn’. Dataset to be worked with, can be attached or mentioned in our code. I prefer to mention it in code.

linmod <- lm(medv ~ crim + zn, data = Boston)

let us look at our model ad its summary

linmod

## 
## Call:
## lm(formula = medv ~ crim + zn, data = Boston)
## 
## Coefficients:
## (Intercept)         crim           zn  
##     22.4856      -0.3521       0.1161

summary(linmod)

## 
## Call:
## lm(formula = medv ~ crim + zn, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.421  -5.060  -1.558   2.121  30.765 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 22.48563    0.44173  50.904  < 2e-16 ***
## crim        -0.35208    0.04259  -8.267 1.24e-15 ***
## zn           0.11611    0.01571   7.392 6.09e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.065 on 503 degrees of freedom
## Multiple R-squared:  0.234,  Adjusted R-squared:  0.2309 
## F-statistic: 76.82 on 2 and 503 DF,  p-value: < 2.2e-16

I prefer to work with summary command. In the coefficient sub field, we need to check the p-value. More the stars, more significant the variables are. I prefer p-value less than 0.05

Now, let us use all the variables to create our linear model

linmod_1 <- lm(medv~., data = Boston)

and look at its summary

summary(linmod_1)

## 
## Call:
## lm(formula = medv ~ ., data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.595  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
## crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
## zn           4.642e-02  1.373e-02   3.382 0.000778 ***
## indus        2.056e-02  6.150e-02   0.334 0.738288    
## chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
## nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
## rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
## age          6.922e-04  1.321e-02   0.052 0.958229    
## dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
## rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
## tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
## ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
## black        9.312e-03  2.686e-03   3.467 0.000573 ***
## lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

We can use ‘names’ function on summary to find what all information summary of the model stores.

names(summary(linmod_1))

##  [1] "call"          "terms"         "residuals"     "coefficients" 
##  [5] "aliased"       "sigma"         "df"            "r.squared"    
##  [9] "adj.r.squared" "fstatistic"    "cov.unscaled"

I use adjusted R Squared parameter to assess my model. The value Closer to 1 is better model. Adjusted R Squared can be seen in summary of linmod_1 also.

summary(linmod_1)$adj.r.squared

## [1] 0.7337897

summary(linmod)$adj.r.squared

## [1] 0.2309427

As linmod_1 has higher Adj R Sq, I would prefer ‘linmod_1’

Making Prediction

Now using linmod_1 let us make predictions. I am subsetting Boston data to make redictions (Just for example. Usually there would be a seperate test data set)

test <- head(Boston, 10)

pred <- predict(linmod_1, newdata = test, interval = "prediction")
pred

##         fit       lwr      upr
## 1  30.00384 20.601726 39.40596
## 2  25.02556 15.650368 34.40076
## 3  30.56760 21.189093 39.94610
## 4  28.60704 19.210599 38.00347
## 5  27.94352 18.545176 37.34187
## 6  25.25628 15.863926 34.64864
## 7  23.00181 13.613343 32.39027
## 8  19.53599 10.072729 28.99925
## 9  11.52364  1.956786 21.09049
## 10 18.92026  9.462231 28.37829

Not specifying interval parameter in predict just gives ‘fit’ value, which we generally use. Prediction in interval parameter gives Prediction interval and ‘confidence’ is used for Confidence interval.

Interaction Terms

For interaction between terms, we use ’*‘instead of’+’ in the formula

intmod <- lm(medv ~ lstat*age, data = Boston)
summary(intmod)

## 
## Call:
## lm(formula = medv ~ lstat * age, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.806  -4.045  -1.333   2.085  27.552 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36.0885359  1.4698355  24.553  < 2e-16 ***
## lstat       -1.3921168  0.1674555  -8.313 8.78e-16 ***
## age         -0.0007209  0.0198792  -0.036   0.9711    
## lstat:age    0.0041560  0.0018518   2.244   0.0252 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.149 on 502 degrees of freedom
## Multiple R-squared:  0.5557, Adjusted R-squared:  0.5531 
## F-statistic: 209.3 on 3 and 502 DF,  p-value: < 2.2e-16

Using Non Linearity in the model

Let us construct a modl with ‘lstat’ as our only independent variable. An another model with lstat and lstat^2 as our independent varable.

mod <- lm(medv ~ lstat, data = Boston)
summary(mod)

## 
## Call:
## lm(formula = medv ~ lstat, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16

Now with lstat and lstat^2. We have to square lstat variable. We use identity ‘I’ operator in the formula to square any variable So..

mod_1 <- lm(medv ~ lstat + I(lstat^2), data = Boston)
summary(mod_1)

## 
## Call:
## lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2834  -3.8313  -0.5295   2.3095  25.4148 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 42.862007   0.872084   49.15   <2e-16 ***
## lstat       -2.332821   0.123803  -18.84   <2e-16 ***
## I(lstat^2)   0.043547   0.003745   11.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared:  0.6407, Adjusted R-squared:  0.6393 
## F-statistic: 448.5 on 2 and 503 DF,  p-value: < 2.2e-16

To confirm, which model is better, we can use either R Sq values or Anova test. I Will use anova test this time. To use Anova Test, the models should be nested i.e, one model should be subset of the other.

anova(mod, mod_1)

## Analysis of Variance Table
## 
## Model 1: medv ~ lstat
## Model 2: medv ~ lstat + I(lstat^2)
##   Res.Df   RSS Df Sum of Sq     F    Pr(>F)    
## 1    504 19472                                 
## 2    503 15347  1    4125.1 135.2 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model with square term has a p-value that is virtually zero which rejects the Null Hypothesis.

Null Hypothesis : Both models are equally good. Alternate Hypothesis : Model with More variables is better.

So mod_1 is better that mod.

Using Polynomials

Let us create a model with 1,2,3,4,5th power of ‘lstat’

polymod <- lm(medv ~ poly(lstat, 5), data = Boston)
summary(polymod)

## 
## Call:
## lm(formula = medv ~ poly(lstat, 5), data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5433  -3.1039  -0.7052   2.0844  27.1153 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       22.5328     0.2318  97.197  < 2e-16 ***
## poly(lstat, 5)1 -152.4595     5.2148 -29.236  < 2e-16 ***
## poly(lstat, 5)2   64.2272     5.2148  12.316  < 2e-16 ***
## poly(lstat, 5)3  -27.0511     5.2148  -5.187 3.10e-07 ***
## poly(lstat, 5)4   25.4517     5.2148   4.881 1.42e-06 ***
## poly(lstat, 5)5  -19.2524     5.2148  -3.692 0.000247 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.215 on 500 degrees of freedom
## Multiple R-squared:  0.6817, Adjusted R-squared:  0.6785 
## F-statistic: 214.2 on 5 and 500 DF,  p-value: < 2.2e-16

Rest of the process is same. Select variables with more number of stars and make predictions on new data.

Summary Points:

Select the variables with more number of stars.
Select the model with Higher R squared Value

Ridge and Lasso

We will use ‘glmnet()’ function of ‘glmnet’ package to perform Ridge and Lasso regressions.

library(glmnet)

## Warning: package 'glmnet' was built under R version 3.4.3

## Loading required package: Matrix

## Loading required package: foreach

## Loaded glmnet 2.0-13

With respect to code, Ridge and Lasso regressions are same, except for the alpha parameter.

For Ridge Regression, alpha = 0 For Lasso Regression, alpha = 1

We will use ‘Boston’ dataset of ISLR Package to fit the ridge regression.

library(ISLR)
library(MASS)
fix(Boston)

names(Boston)

##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

In ridge and lasso regression we have to give the predictors as a matrix and response variable as a vector.

x <- model.matrix(medv ~., data = Boston)[,-1]
y <- Boston$medv

In Ridge (and Lasso), for each value of Lambda, we find the co-efficients of all the predictors. So we have to specify a list of lambdas on which we calculate the coefficients.

grid <- 10^seq(10, -2, length = 100)

Now the ridge regression. We specify the input, output, lambda and also specify if we have to standardize the predictors or not.

ridge_mod <- glmnet(x, y, alpha = 0, lambda = grid, standardize = TRUE)

‘coef(ridge_mod)’ is a matrix will the regression coefficients for different values of lambda. Predictors are rows and lambda is the columns. There are 13 predictors and one intercept. So 13 rows. We have 100 values of lamda, so 100 columns.

dim(coef(ridge_mod))

## [1]  14 100

glmnet package has a ‘predict’ function that can be used to predict the coefficients for a given lamda or response of new dataset for a given lambda and a model.

Predict Co-efficients

The predict the coefficients for a given lambda, we give the model name, lamda and type as coefficient for the ‘predict’ function.

predict(ridge_mod, s = 50, type = "coefficients")

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept) 23.580881779
## crim        -0.036358794
## zn           0.011707752
## indus       -0.052663644
## chas         0.914385347
## nox         -2.584648038
## rm           1.136076181
## age         -0.008821137
## dis          0.021708896
## rad         -0.026332105
## tax         -0.002040002
## ptratio     -0.238276456
## black        0.003150038
## lstat       -0.106839435

Predictions for New Data

Now, I will take the first 25 observations of the boston data as my test data.

test_x <- head(x, 25)
test_y <- head(y, 25)

Now, let us make predictions for the ‘test_x’ dataset. To make predictions, we pass the model, lambda and new data into the predict function.

pred_y <- predict(ridge_mod, s = 50, newx = test_x)

Choosing the Best Lambda

We have seen that, we get different coefficients for different lamda. So now we have to find an optimal lambda where the test MSE is minimum. For that we will use Cross-Validation.

ridge_cv <- cv.glmnet(x, y, alpha = 0, standardize = T)

Let us look at what ‘ridge_cv’ has.

names(ridge_cv)

##  [1] "lambda"     "cvm"        "cvsd"       "cvup"       "cvlo"      
##  [6] "nzero"      "name"       "glmnet.fit" "lambda.min" "lambda.1se"

We use ‘lambda.min’, to find the optimal lambda i.e., lambda at which the MSE is minimum.

ridge_cv$lambda.min

## [1] 0.7438467

We can use this lambda to make our predictions, using Predict function.

pred_y <- predict(ridge_cv, s = 0.7438467, newx = test_x)

We can calculate our MSE by

mean((test_y - pred_y)^2)

## [1] 11.15965

Step Functions (For Single Predictor)

In this, the predictor is ‘cut’ at different points in its range. For example, our Predictor X has a range of 0 to 10. Cut points are defined at 2 and 7. Thre different regressions are fit in the range. There will a linear Regression between 0 to 2, another from 2 to 7, another for 7 to 10. i.e, Regression Coefficients change at these cut points.

This function can be done in ‘lm’ function using ‘cut()’ as another parameter in the model. We will use ‘Wage’ dataset to understand and make predictions using Step Functions.

library(ISLR)
attach(Wage)
names(Wage)

##  [1] "year"       "age"        "maritl"     "race"       "education" 
##  [6] "region"     "jobclass"   "health"     "health_ins" "logwage"   
## [11] "wage"

Now let us fit our linear regression using step function. We need to make a model that predicts ‘wage’ using ‘age’ variable. We will use ‘cut()’ function to let R decide the cut points

fit <- lm(wage ~ cut(age, 4), data = Wage)
summary(fit)

## 
## Call:
## lm(formula = wage ~ cut(age, 4), data = Wage)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -98.126 -24.803  -6.177  16.493 200.519 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              94.158      1.476  63.790   <2e-16 ***
## cut(age, 4)(33.5,49]     24.053      1.829  13.148   <2e-16 ***
## cut(age, 4)(49,64.5]     23.665      2.068  11.443   <2e-16 ***
## cut(age, 4)(64.5,80.1]    7.641      4.987   1.532    0.126    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40.42 on 2996 degrees of freedom
## Multiple R-squared:  0.0625, Adjusted R-squared:  0.06156 
## F-statistic: 66.58 on 3 and 2996 DF,  p-value: < 2.2e-16

Here, R divided ‘age’ into 4 intervals as specified in ‘cut(age, 4)’. We will use ‘breaks’ function to provide cut points at 49 and 64.5

fit_1 <- lm(wage ~ cut(age, breaks = c(49, 70, max(age))), data = Wage)
summary(fit_1)

## 
## Call:
## lm(formula = wage ~ cut(age, breaks = c(49, 70, max(age))), data = Wage)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -94.19 -26.67  -7.32  16.08 201.19 
## 
## Coefficients:
##                                               Estimate Std. Error t value
## (Intercept)                                    117.154      1.492  78.515
## cut(age, breaks = c(49, 70, max(age)))(70,80]  -20.146      8.083  -2.492
##                                               Pr(>|t|)    
## (Intercept)                                     <2e-16 ***
## cut(age, breaks = c(49, 70, max(age)))(70,80]   0.0129 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.78 on 849 degrees of freedom
##   (2149 observations deleted due to missingness)
## Multiple R-squared:  0.007264,   Adjusted R-squared:  0.006095 
## F-statistic: 6.212 on 1 and 849 DF,  p-value: 0.01288

Spline

In this section, we deal with Regression Splines and Natural Splines. Regression splines are more flexible than polynomials and step functions, and in fact are an extension of the two. They involve dividing the range of X into K distinct regions. Within each region, a polynomial function is fit to the data. However, these polynomials are constrained so that they join smoothly at the region boundaries, or knots. Provided that the interval is divided into enough regions, this can produce an extremely flexible fit.

To create a Regression Spline, we use the ‘splines’ library and ‘Wage’ dataset to predict wages using age.

library(splines)
library(ISLR)
attach(Wage)

## The following objects are masked from Wage (pos = 4):
## 
##     age, education, health, health_ins, jobclass, logwage, maritl,
##     race, region, wage, year

To fit a Regression Spline, we use ‘bs()’ function, specifying predictor and knots.

reg_spl <- lm(wage ~ bs(age, knots = c(25, 40, 60)), data = Wage)
summary(reg_spl)

## 
## Call:
## lm(formula = wage ~ bs(age, knots = c(25, 40, 60)), data = Wage)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -98.832 -24.537  -5.049  15.209 203.207 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       60.494      9.460   6.394 1.86e-10 ***
## bs(age, knots = c(25, 40, 60))1    3.980     12.538   0.317 0.750899    
## bs(age, knots = c(25, 40, 60))2   44.631      9.626   4.636 3.70e-06 ***
## bs(age, knots = c(25, 40, 60))3   62.839     10.755   5.843 5.69e-09 ***
## bs(age, knots = c(25, 40, 60))4   55.991     10.706   5.230 1.81e-07 ***
## bs(age, knots = c(25, 40, 60))5   50.688     14.402   3.520 0.000439 ***
## bs(age, knots = c(25, 40, 60))6   16.606     19.126   0.868 0.385338    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.92 on 2993 degrees of freedom
## Multiple R-squared:  0.08642,    Adjusted R-squared:  0.08459 
## F-statistic: 47.19 on 6 and 2993 DF,  p-value: < 2.2e-16

The Regression Spline has 6 Degrees of Freedom, i.e, 6 Regression coefficients and an intercept. The ‘bs()’ function fits a cubic spline by default. We specify Degree, to fit different order polynomial.

reg_spl_2 <- lm(wage ~ bs(age, knots = c(25, 40, 60), degree = 2), data = Wage)
summary(reg_spl_2)

## 
## Call:
## lm(formula = wage ~ bs(age, knots = c(25, 40, 60), degree = 2), 
##     data = Wage)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -98.306 -24.588  -5.079  15.444 201.901 
## 
## Coefficients:
##                                             Estimate Std. Error t value
## (Intercept)                                   58.557      8.097   7.232
## bs(age, knots = c(25, 40, 60), degree = 2)1   14.014      9.982   1.404
## bs(age, knots = c(25, 40, 60), degree = 2)2   58.145      8.090   7.187
## bs(age, knots = c(25, 40, 60), degree = 2)3   61.092      8.648   7.064
## bs(age, knots = c(25, 40, 60), degree = 2)4   57.463      9.370   6.133
## bs(age, knots = c(25, 40, 60), degree = 2)5   17.948     15.766   1.138
##                                             Pr(>|t|)    
## (Intercept)                                 6.02e-13 ***
## bs(age, knots = c(25, 40, 60), degree = 2)1    0.160    
## bs(age, knots = c(25, 40, 60), degree = 2)2 8.32e-13 ***
## bs(age, knots = c(25, 40, 60), degree = 2)3 2.00e-12 ***
## bs(age, knots = c(25, 40, 60), degree = 2)4 9.77e-10 ***
## bs(age, knots = c(25, 40, 60), degree = 2)5    0.255    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.9 on 2994 degrees of freedom
## Multiple R-squared:  0.0871, Adjusted R-squared:  0.08558 
## F-statistic: 57.13 on 5 and 2994 DF,  p-value: < 2.2e-16

We can fit a regression spline using degrees of freedom also. Dof = Degree of Polynomial + Number of Knots. In this case we dont specify knots. R decides knots quantiles of the predictor. We can know the knots by ‘attr()’ function.

reg_spl_3 <- lm(wage ~ bs(age, df = 5), data = Wage)
summary(reg_spl_3)

## 
## Call:
## lm(formula = wage ~ bs(age, df = 5), data = Wage)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -98.993 -24.280  -5.203  15.626 202.006 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        58.366      6.409   9.107  < 2e-16 ***
## bs(age, df = 5)1   27.713     10.682   2.594  0.00952 ** 
## bs(age, df = 5)2   65.069      6.203  10.490  < 2e-16 ***
## bs(age, df = 5)3   55.451      9.251   5.994 2.29e-09 ***
## bs(age, df = 5)4   68.386     10.290   6.646 3.57e-11 ***
## bs(age, df = 5)5   16.579     15.564   1.065  0.28687    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.9 on 2994 degrees of freedom
## Multiple R-squared:  0.08727,    Adjusted R-squared:  0.08574 
## F-statistic: 57.25 on 5 and 2994 DF,  p-value: < 2.2e-16

To know the knots

attr(bs(age, df = 5), "knots")

## 33.33333% 66.66667% 
##        37        48

We can make predictions by using ‘predict()’ function. I am creating a new dataset for test set that is to be used in Predict function.

agelims = range(age)
age_grid <- seq(agelims[1], agelims[2])

pred_reg <- predict(reg_spl_2, newdata = list(age_grid))
head(pred_reg)

##         1         2         3         4         5         6 
##  58.55681  82.60125 118.58814 118.39123 118.77425 118.60822

To fit a Natural Spline, we use the ‘ns()’ function. Regression Spline with Boundary Conditions are called Natural Splines. The formula for Dof is Dof = Degree of Polynomial + Number of Knots - 2. The ‘ns()’ function fits only ‘Natural Cubic Splines’

ns_reg <- lm(wage ~ ns(age, df = 4), data = Wage)
summary(ns_reg)

## 
## Call:
## lm(formula = wage ~ ns(age, df = 4), data = Wage)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -98.737 -24.477  -5.083  15.371 204.874 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        58.556      4.235  13.827   <2e-16 ***
## ns(age, df = 4)1   60.462      4.190  14.430   <2e-16 ***
## ns(age, df = 4)2   41.963      4.372   9.597   <2e-16 ***
## ns(age, df = 4)3   97.020     10.386   9.341   <2e-16 ***
## ns(age, df = 4)4    9.773      8.657   1.129    0.259    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.92 on 2995 degrees of freedom
## Multiple R-squared:  0.08598,    Adjusted R-squared:  0.08476 
## F-statistic: 70.43 on 4 and 2995 DF,  p-value: < 2.2e-16

attr(ns(age, df =4), "knots")

##   25%   50%   75% 
## 33.75 42.00 51.00

We can use the predict Function to Make predictions.

pred_ns <- predict(ns_reg, newdata = list(age_grid))
head(pred_ns)

##         1         2         3         4         5         6 
##  58.55632  82.96999 119.30122 118.82300 119.63414 118.78009

Smoothing Splines

Smoothing Splines are similar to Natural Splines, but it has penalty for over Wiggling of curve. We have to choose the DoF such that this penality is minimized, that is, The curve is smooth. We use “smooth.spline()” to fit smoothing splines. We can either specify the DoF or let R choose for us using Cross Validation.

If we want to specify DoF….

ss1 <- smooth.spline(age, wage, df = 16)

We can instead use Cross Validation to let R choose DoF.

ss2 <- smooth.spline(age, wage, cv = T)

## Warning in smooth.spline(age, wage, cv = T): cross-validation with non-
## unique 'x' values seems doubtful

To know the DoF..

ss2$df

## [1] 6.794596

Generalized Additive Models

We now fit a GAM to predict wage using natural spline functions of year and age, treating education as a qualitative predictor. Since this is just a big linear regression model using an appropriate choice of basis functions, we can simply do this using the lm() function

gam_1 <- lm(wage ~ ns(year, 4) + ns(age, 5) +education, data = Wage)

We now fit the model using smoothing splines rather than natural splines. In order to fit more general sorts of GAMs, using smoothing splines or other components that cannot be expressed in terms of basis function and then fit using least squares regression, we will need to use the ‘gam’ library in R. The ‘s()’ function, which is part of the gam library, is used to indicate that ‘s()’ we would like to use a smoothing spline. We specify that the function of year should have 4 degrees of freedom, and that the function of age will have 5 degrees of freedom. Since education is qualitative, we leave it as is, and it is converted into four dummy variables.We use the gam() function in ‘gam()’ order to fit a GAM using these components. All of the terms are fit simultaneously, taking each other into account to explain the response.

library (gam)

## Warning: package 'gam' was built under R version 3.4.3

## Loaded gam 1.14-4

gam_2 <- gam(wage ~ s(year, 4) + s(age, 5) + education, data = Wage)

We can compare models using Anova Test.

gam_m1 <- gam(wage ~ s(age, 5) + education, data = Wage)
gam_m2 <- gam(wage ~ year+s(age,5) + education, data=Wage)
anova(gam_m1, gam_m2, test="F")

## Analysis of Deviance Table
## 
## Model 1: wage ~ s(age, 5) + education
## Model 2: wage ~ year + s(age, 5) + education
##   Resid. Df Resid. Dev Df Deviance      F    Pr(>F)    
## 1      2990    3711731                                 
## 2      2989    3693842  1    17889 14.476 0.0001448 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since P-Value is less than 5%, we can say that gam_m2 is better than gam_m1

It is important to remember that, the two models specified in ‘anova’ must be nested.