library(MASS)
data('mcycle')
head(mcycle)
## times accel
## 1 2.4 0.0
## 2 2.6 -1.3
## 3 3.2 -2.7
## 4 3.6 0.0
## 5 4.0 -2.7
## 6 6.2 -2.7
dim(mcycle)
## [1] 133 2
str(mcycle)
## 'data.frame': 133 obs. of 2 variables:
## $ times: num 2.4 2.6 3.2 3.6 4 6.2 6.6 6.8 7.8 8.2 ...
## $ accel: num 0 -1.3 -2.7 0 -2.7 -2.7 -2.7 -1.3 -2.7 -2.7 ...
summary(mcycle)
## times accel
## Min. : 2.40 Min. :-134.00
## 1st Qu.:15.60 1st Qu.: -54.90
## Median :23.40 Median : -13.30
## Mean :25.18 Mean : -25.55
## 3rd Qu.:34.80 3rd Qu.: 0.00
## Max. :57.60 Max. : 75.00
The motorcycle data set has 133 observations, with two numeric column variables: Times and Acceleration
# Rename the variables for ease of usage
Y <- mcycle$accel
X <- mcycle$times
#Scatterplot
plot(Y~X, xlab="time",ylab="Acceleration", main="Scatterplot of Acceleration against Time")
#Histogram
par(mfrow=c(1,2))
hist(mcycle$time, prob=T, col="grey", breaks=20, main="Histogram and Density of Time", xlim=c(2,58), xlab="Time")
lines(density(mcycle$time), col="red", lwd=2)
# Add a vertical line that indicates the average of Time
abline(v=mean(mcycle$time), col="blue", lty=2, lwd=2)
#Acceleration
hist(mcycle$accel, prob=T, col="blue", breaks=20, main="Histogram and Density of Acceleration", xlim=c(-140,78), xlab="Acceleration")
lines(density(mcycle$accel), col="red", lwd=2)
# Add a vertical line that indicates the average of Sepal Length
abline(v=mean(mcycle$accel), col="black", lty=2, lwd=2)
# Individual Box plot
boxplot(mcycle$times, notch=T, col="grey", main="Boxplot of Times")
boxplot(mcycle$accel, notch=T, col= "blue", main="Boxplot of Acceleration")
The scatterplot suggests non-linearity in the relationship between the variables. The histogram reveals that both variables aren’t normally distributed. However, there are no outliers in either variables as shown in the box plot.
We seek to create a model that fits the data set. First, we regress Acceleration on Time to investigate the impact of times on acceleration and how much variation in acceleration is explained by times. X is the time since the motorcycle crash and Y is the acceleration of the driver’s head.
cor(mcycle$times,mcycle$accel)
## [1] 0.2964033
Y <- mcycle$accel
X <- mcycle$times
model1 <- lm(Y~X, data= mcycle)
model1_summary <- summary(model1)
model1_summary
##
## Call:
## lm(formula = Y ~ X, data = mcycle)
##
## Residuals:
## Min 1Q Median 3Q Max
## -104.114 -25.926 4.582 36.163 94.197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -53.008 8.713 -6.084 1.2e-08 ***
## X 1.091 0.307 3.552 0.000532 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 46.33 on 131 degrees of freedom
## Multiple R-squared: 0.08785, Adjusted R-squared: 0.08089
## F-statistic: 12.62 on 1 and 131 DF, p-value: 0.0005318
Assessing the Overall Accuracy of the Model
#MSE
(model1_summary$sigma)^2
## [1] 2146.136
model1_summary$r.squared
## [1] 0.08785493
model1_summary$adj.r.squared
## [1] 0.08089199
AIC(model1)
## [1] 1401.722
BIC(model1)
## [1] 1410.393
Fitted Plot and Confidence Interval
#Plot predicted interval and values of linear regression
newx = seq(min(X),max(X),by = 0.05)
conf_interval <- predict(model1, newdata=data.frame(X=newx), interval="confidence",
level = 0.95)
plot(X, Y, xlab="Times", ylab="Acceleration", main="Linear Regression")
abline(model1, col="blue", lwd = 2)
lines(newx, conf_interval[,2], col="red", lwd =1)
lines(newx, conf_interval[,3], col="red", lwd=1)
Although, the p-value of the F-test shows that the model is significant, the adjusted R-square is very low indicating that very small variation in Y (acceleration) is explained by X (times). The plot of the estimated line shows that the linear regression model doesn’t do a good job with fitting the dataset.This is based on the lack of linearity between the variables as shown in the scatter plot.
Diagnostic Plots
#Diagnostic plots
par(mfrow=c(2,2))
plot(model1)
On the basis of the residual plots, there is some evidence of non-linearity.
In this model, we have two predictor variable terms: X and X^2
model2 <- lm(Y~X+I(X^2), data=mcycle)
summary(model2)
##
## Call:
## lm(formula = Y ~ X + I(X^2), data = mcycle)
##
## Residuals:
## Min 1Q Median 3Q Max
## -96.527 -30.817 9.589 29.210 104.728
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.22700 15.49471 -0.983 0.32757
## X -2.30749 1.20439 -1.916 0.05757 .
## I(X^2) 0.05935 0.02038 2.912 0.00422 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 45.06 on 130 degrees of freedom
## Multiple R-squared: 0.1437, Adjusted R-squared: 0.1306
## F-statistic: 10.91 on 2 and 130 DF, p-value: 4.167e-05
The near-zero p-value associated with the quadratic term suggests that it leads to an improved model. We use the anova() function to further quantify the extent to which the quadratic fit is superior to the linear fit
anova(model1 ,model2)
## Analysis of Variance Table
##
## Model 1: Y ~ X
## Model 2: Y ~ X + I(X^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 131 281144
## 2 130 263923 1 17221 8.4823 0.004222 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The anova() function performs a hypothesis test comparing the linear regression model and the quadratic model. The null hypothesis is that the two models fit the data equally well, and the alternative hypothesis is that the quadratric model is superior. Here the F-statistic is 8.48 and the associated p-value is less than 0.05. This provides very clear evidence that the quadratic model superior to the linear model that only contains the predictor X. This is not surprising, since earlier we saw evidence for non-linearity in the relationship between Y and X.
Assessing the Overall Accuracy of the Model
model2_summary <- summary(model2)
(model2_summary$sigma)^2
## [1] 2030.179
model2_summary$r.squared
## [1] 0.1437254
model2_summary$adj.r.squared
## [1] 0.130552
AIC(model2)
## [1] 1395.315
BIC(model2)
## [1] 1406.877
Fitted Plot and Confidence Interval
#plot prediction and conf.interval
timelims <- range(X)
times.grid1=seq(from=timelims [1],to=timelims [2])
preds1=predict (model2 ,newdata =list(X=times.grid1),se=TRUE)
se.bands1=cbind(preds1$fit +2* preds1$se.fit ,preds1$fit -2* preds1$se.fit)
#Finally, we plot the data and add the fit from the degree-2 polynomial.
plot(X ,Y ,xlab="Times", main = "Quadratic",ylab="Acceleration", xlim=timelims ,cex=.5)
lines(times.grid1 ,preds1$fit ,lwd=2,col="blue")
matlines (times.grid1 ,se.bands1,lwd=1,col="red",lty=3)
Although, this model is superior to the linear model, with lower AIC and higher adjusted R-squared value. The plot of the estimated line shows that the quadratic model doesn’t do a good job with fitting the dataset.
Diagnostic Plots
#Diagnostic plots
par(mfrow=c(2,2))
plot(model2)
We see that when the X^2 term is included in the model, there is little discernible pattern in the residuals compared to the linear regression model with only X.
In this model, we have five predictor variable terms: X, X^2, X^3, X^4, and X^5.
model3 <- lm(Y~poly(X,5,raw=T),data=mcycle)
summary(model3)
##
## Call:
## lm(formula = Y ~ poly(X, 5, raw = T), data = mcycle)
##
## Residuals:
## Min 1Q Median 3Q Max
## -77.271 -21.285 0.975 25.386 82.371
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.059e+02 3.499e+01 -3.026 0.003 **
## poly(X, 5, raw = T)1 4.978e+01 1.036e+01 4.804 4.31e-06 ***
## poly(X, 5, raw = T)2 -6.359e+00 1.015e+00 -6.266 5.30e-09 ***
## poly(X, 5, raw = T)3 2.969e-01 4.253e-02 6.982 1.45e-10 ***
## poly(X, 5, raw = T)4 -5.742e-03 7.932e-04 -7.239 3.83e-11 ***
## poly(X, 5, raw = T)5 3.919e-05 5.406e-06 7.250 3.60e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33.9 on 127 degrees of freedom
## Multiple R-squared: 0.5264, Adjusted R-squared: 0.5078
## F-statistic: 28.24 on 5 and 127 DF, p-value: < 2.2e-16
Assessing the Overall Accuracy of the Model
model3_summary <- summary(model3)
(model3_summary$sigma)^2
## [1] 1149.305
model3_summary$r.squared
## [1] 0.5264409
model3_summary$adj.r.squared
## [1] 0.5077968
AIC(model3)
## [1] 1322.537
BIC(model3)
## [1] 1342.77
Fitted Plot and Confidence Interval
times.grid=seq(from=timelims [1],to=timelims [2])
preds2=predict (model3 ,newdata =list(X=times.grid),se=TRUE)
se.bands2=cbind(preds2$fit +2* preds2$se.fit ,preds2$fit -2* preds2$se.fit)
plot(X,Y ,xlab="Times", main = "Degree -5 Polynomial",ylab="Acceleration", xlim=timelims ,cex=.5)
lines(times.grid ,preds2$fit ,lwd=2,col="blue")
matlines (times.grid ,se.bands2,lwd=1,col="red",lty=1)
This model provides a better fit compared to the previous models. The adjusted R-squared value is higher. The model also has a lower AIC and BIC value. However, we see that a better model that provides better fitting could be created. Hence, we explore the use of splines.
First, we create regression splines by specifying a set of knots, producing a sequence of basis functions, and then using least squares to estimate the spline coefficients. The bs() function generates the entire matrix of basis functions for splines with the specified set of knots. By default, cubic splines are produced.This produces a spline with six basis functions. Recall that a cubic spline with three knots has seven degrees of freedom; these degrees of freedom are used up by an intercept, plus six basis functions.)
library (splines)
#we have prespecified knots at times 15.6,23.4, and 57.6 which are the 1st, 2nd, and 3rd quartiles.
summary(X)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.40 15.60 23.40 25.18 34.80 57.60
fit_sp=lm(Y~bs(X,knots=c(15.6,23.4,57.6)),data=mcycle)
summary(fit_sp)
##
## Call:
## lm(formula = Y ~ bs(X, knots = c(15.6, 23.4, 57.6)), data = mcycle)
##
## Residuals:
## Min 1Q Median 3Q Max
## -78.33 -27.41 -0.42 29.55 87.32
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -29.548 20.073 -1.472 0.143495
## bs(X, knots = c(15.6, 23.4, 57.6))1 132.892 37.896 3.507 0.000628
## bs(X, knots = c(15.6, 23.4, 57.6))2 -75.847 21.462 -3.534 0.000572
## bs(X, knots = c(15.6, 23.4, 57.6))3 31.795 30.790 1.033 0.303739
## bs(X, knots = c(15.6, 23.4, 57.6))4 86.612 32.941 2.629 0.009611
## bs(X, knots = c(15.6, 23.4, 57.6))5 2.278 29.709 0.077 0.939009
## bs(X, knots = c(15.6, 23.4, 57.6))6 NA NA NA NA
##
## (Intercept)
## bs(X, knots = c(15.6, 23.4, 57.6))1 ***
## bs(X, knots = c(15.6, 23.4, 57.6))2 ***
## bs(X, knots = c(15.6, 23.4, 57.6))3
## bs(X, knots = c(15.6, 23.4, 57.6))4 **
## bs(X, knots = c(15.6, 23.4, 57.6))5
## bs(X, knots = c(15.6, 23.4, 57.6))6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36.49 on 127 degrees of freedom
## Multiple R-squared: 0.4514, Adjusted R-squared: 0.4298
## F-statistic: 20.9 on 5 and 127 DF, p-value: 3.378e-15
AIC(fit_sp)
## [1] 1342.094
pred_sp=predict(fit_sp,newdata =list(X=times.grid),se=T)
plot(X ,Y, xlab="Times", ylab="Acceleration")
lines(times.grid ,pred_sp$fit ,lwd=2) #regression spline
lines(times.grid ,pred_sp$fit +2*pred_sp$se ,lty="dashed", col="red")
lines(times.grid ,pred_sp$fit -2*pred_sp$se ,lty="dashed",col="red")
Here the degree of freedom is pre-specified and different numbers are used to see the the best curve that fits the data.
#First: Natural Spline- pre-specified no of knots=4
fit2=lm(Y~ns(X,df=4),data=mcycle)
summary(fit2)
##
## Call:
## lm(formula = Y ~ ns(X, df = 4), data = mcycle)
##
## Residuals:
## Min 1Q Median 3Q Max
## -70.102 -26.688 0.452 24.329 90.446
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.43 14.60 1.536 0.1269
## ns(X, df = 4)1 -110.50 16.47 -6.708 5.72e-10 ***
## ns(X, df = 4)2 82.80 16.76 4.939 2.40e-06 ***
## ns(X, df = 4)3 -69.12 33.95 -2.036 0.0438 *
## ns(X, df = 4)4 -30.71 18.33 -1.676 0.0962 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36.06 on 128 degrees of freedom
## Multiple R-squared: 0.46, Adjusted R-squared: 0.4431
## F-statistic: 27.26 on 4 and 128 DF, p-value: 2.278e-16
AIC(fit2)
## [1] 1338.004
plot(X ,Y,main= "Natural Spline with df=4", xlab="Times", ylab="Acceleration")
pred2=predict (fit2 ,newdata=list(X=times.grid),se=T)
lines(times.grid , pred2$fit ,col="red",lwd=2)
lines(times.grid ,pred2$fit +2*pred2$se ,lty="dashed")
lines(times.grid ,pred2$fit -2*pred2$se ,lty="dashed")
fit2b=lm(Y~ns(X,df=6),data=mcycle)
summary(fit2b)
##
## Call:
## lm(formula = Y ~ ns(X, df = 6), data = mcycle)
##
## Residuals:
## Min 1Q Median 3Q Max
## -88.112 -13.897 0.906 15.380 50.787
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.6605 11.0149 -1.422 0.15757
## ns(X, df = 6)1 -75.8861 11.6492 -6.514 1.58e-09 ***
## ns(X, df = 6)2 -131.1411 16.1063 -8.142 3.23e-13 ***
## ns(X, df = 6)3 102.9920 14.8782 6.922 2.01e-10 ***
## ns(X, df = 6)4 -42.1524 13.6411 -3.090 0.00246 **
## ns(X, df = 6)5 57.7834 27.9160 2.070 0.04050 *
## ns(X, df = 6)6 0.5345 13.6277 0.039 0.96877
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.75 on 126 degrees of freedom
## Multiple R-squared: 0.7695, Adjusted R-squared: 0.7585
## F-statistic: 70.1 on 6 and 126 DF, p-value: < 2.2e-16
AIC(fit2b)
## [1] 1228.784
plot(X ,Y, main= "Natural Spline with df=6",xlab="Times", ylab="Acceleration")
pred2b=predict (fit2b ,newdata=list(X=times.grid),se=T)
lines(times.grid , pred2b$fit ,col="red",lwd=2)
lines(times.grid ,pred2b$fit +2*pred2b$se ,lty="dashed")
lines(times.grid ,pred2b$fit -2*pred2b$se ,lty="dashed")
We see that the spline with df=6 fits the data better than the spline with df=4.
fit2c=lm(Y~ns(X,df=10),data=mcycle)
summary(fit2c)
##
## Call:
## lm(formula = Y ~ ns(X, df = 10), data = mcycle)
##
## Residuals:
## Min 1Q Median 3Q Max
## -77.456 -12.715 0.132 11.165 50.647
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.423 11.578 0.123 0.902383
## ns(X, df = 10)1 17.352 16.196 1.071 0.286103
## ns(X, df = 10)2 -61.409 15.352 -4.000 0.000109 ***
## ns(X, df = 10)3 -115.554 18.547 -6.230 6.88e-09 ***
## ns(X, df = 10)4 -141.969 19.734 -7.194 5.55e-11 ***
## ns(X, df = 10)5 -5.013 16.948 -0.296 0.767883
## ns(X, df = 10)6 56.867 18.629 3.053 0.002785 **
## ns(X, df = 10)7 -1.645 17.276 -0.095 0.924283
## ns(X, df = 10)8 2.790 16.126 0.173 0.862925
## ns(X, df = 10)9 -14.362 30.125 -0.477 0.634402
## ns(X, df = 10)10 7.326 14.885 0.492 0.623459
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.45 on 122 degrees of freedom
## Multiple R-squared: 0.8005, Adjusted R-squared: 0.7841
## F-statistic: 48.94 on 10 and 122 DF, p-value: < 2.2e-16
AIC(fit2c)
## [1] 1217.585
plot(X ,Y , main= "Natural Spline with df=10", xlab="Times", ylab="Acceleration")
pred2c=predict (fit2c ,newdata=list(X=times.grid),se=T)
lines(times.grid , pred2c$fit ,col="red",lwd=2)
lines(times.grid ,pred2c$fit +2*pred2c$se ,lty="dashed")
lines(times.grid ,pred2c$fit -2*pred2c$se ,lty="dashed")
fit2d=lm(Y~ns(X,df=15),data=mcycle)
summary(fit2d)
##
## Call:
## lm(formula = Y ~ ns(X, df = 15), data = mcycle)
##
## Residuals:
## Min 1Q Median 3Q Max
## -74.353 -10.776 0.301 10.737 53.355
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.5775 12.5065 -0.046 0.963250
## ns(X, df = 15)1 0.5785 20.9962 0.028 0.978065
## ns(X, df = 15)2 -5.3747 19.6804 -0.273 0.785261
## ns(X, df = 15)3 -29.4141 16.6694 -1.765 0.080248 .
## ns(X, df = 15)4 -88.6391 19.3244 -4.587 1.14e-05 ***
## ns(X, df = 15)5 -83.9127 22.4445 -3.739 0.000288 ***
## ns(X, df = 15)6 -149.8495 25.9023 -5.785 6.16e-08 ***
## ns(X, df = 15)7 -86.6107 20.9990 -4.125 6.98e-05 ***
## ns(X, df = 15)8 -30.2330 18.3279 -1.650 0.101716
## ns(X, df = 15)9 45.5014 24.3217 1.871 0.063869 .
## ns(X, df = 15)10 47.2031 21.9966 2.146 0.033944 *
## ns(X, df = 15)11 -5.1229 20.8652 -0.246 0.806480
## ns(X, df = 15)12 15.3713 19.7548 0.778 0.438079
## ns(X, df = 15)13 -17.3563 19.9187 -0.871 0.385345
## ns(X, df = 15)14 -4.0118 34.6559 -0.116 0.908041
## ns(X, df = 15)15 13.1690 17.8760 0.737 0.462786
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.62 on 117 degrees of freedom
## Multiple R-squared: 0.8059, Adjusted R-squared: 0.781
## F-statistic: 32.38 on 15 and 117 DF, p-value: < 2.2e-16
AIC(fit2d)
## [1] 1223.947
plot(X ,Y, main= "Natural Spline with df=15", xlab="Times", ylab="Acceleration")
pred2d=predict (fit2d ,newdata=list(X=times.grid),se=T)
lines(times.grid , pred2d$fit ,col="red",lwd=2)
lines(times.grid ,pred2d$fit +2*pred2d$se ,lty="dashed")
lines(times.grid ,pred2d$fit -2*pred2d$se ,lty="dashed")
fit2e=lm(Y~ns(X,df=16),data=mcycle)
summary(fit2e)
##
## Call:
## lm(formula = Y ~ ns(X, df = 16), data = mcycle)
##
## Residuals:
## Min 1Q Median 3Q Max
## -72.757 -11.659 0.086 10.667 53.911
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.3708 12.6481 -0.108 0.913882
## ns(X, df = 16)1 -4.8791 20.0990 -0.243 0.808626
## ns(X, df = 16)2 10.7585 23.4960 0.458 0.647890
## ns(X, df = 16)3 -31.6368 17.2242 -1.837 0.068804 .
## ns(X, df = 16)4 -54.3207 18.9762 -2.863 0.004987 **
## ns(X, df = 16)5 -101.8265 20.2903 -5.018 1.90e-06 ***
## ns(X, df = 16)6 -93.3062 24.8924 -3.748 0.000279 ***
## ns(X, df = 16)7 -153.6727 25.6637 -5.988 2.44e-08 ***
## ns(X, df = 16)8 -45.0077 18.6092 -2.419 0.017136 *
## ns(X, df = 16)9 -9.4926 19.9163 -0.477 0.634527
## ns(X, df = 16)10 70.0893 23.5931 2.971 0.003612 **
## ns(X, df = 16)11 18.7594 19.8760 0.944 0.347222
## ns(X, df = 16)12 2.5143 22.0160 0.114 0.909274
## ns(X, df = 16)13 11.6421 19.9557 0.583 0.560758
## ns(X, df = 16)14 -17.1312 20.5426 -0.834 0.406030
## ns(X, df = 16)15 0.6462 34.4200 0.019 0.985054
## ns(X, df = 16)16 11.5622 18.0371 0.641 0.522773
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.59 on 116 degrees of freedom
## Multiple R-squared: 0.808, Adjusted R-squared: 0.7815
## F-statistic: 30.51 on 16 and 116 DF, p-value: < 2.2e-16
AIC(fit2e)
## [1] 1224.456
plot(X ,Y, main= "Natural Spline with df=16", xlab="Times", ylab="Acceleration")
pred2e=predict (fit2e ,newdata=list(X=times.grid),se=T)
lines(times.grid , pred2e$fit ,col="red",lwd=2)
lines(times.grid ,pred2e$fit +2*pred2e$se ,lty="dashed")
lines(times.grid ,pred2e$fit -2*pred2e$se ,lty="dashed")
The natural splines with df=15 and 16 are similar and provides better fit to the data, but are more wiggly.
APPROACH 1: PRE-SPECIFY THE DEGREES OF FREEDOM
To determine lambda, the smoothing parameter, we specified the degree of freedom and let the smooth.spline() function determine which value of λ leads to the pre-specified degree of freedom
fit3a=smooth.spline(X ,Y ,df=6)
newx <- seq(min(fit3a$x), max(fit3a$x), length.out=100)
pred3a <- predict(fit3a, newdata = data.frame(X=newx),se=T,interval = 'confidence')
plot(X ,Y ,cex=.5, xlab="Times", ylab="Acceleration")
title("Smoothing Spline with df=6 ")
#95% CI
res <- (fit3a$yin - fit3a$y)/(1-fit3a$lev) # jackknife residuals
sigma <- sqrt(var(res))
upper <- fit3a$y + 2*sigma*sqrt(fit3a$lev) # upper 95% conf. band
lower <- fit3a$y - 2*sigma*sqrt(fit3a$lev)
lines(fit3a,lwd=2, col="red")
lines(fit3a$x, upper,lty=2)
lines(fit3a$x, lower,lty=2)
fit3b=smooth.spline(X ,Y ,df=16)
plot(X ,Y ,xlim=timelims ,cex=.5, xlab="Times", ylab="Acceleration")
title("Smoothing Spline with df=16")
lines(fit3b ,col="red",lwd=2)
#95% CI
res1 <- (fit3b$yin - fit3b$y)/(1-fit3b$lev) # jackknife residuals
sigma1 <- sqrt(var(res1))
upper1 <- fit3b$y + 2.0*sigma*sqrt(fit3b$lev) # upper 95% conf. band
lower1 <- fit3b$y - 2.0*sigma*sqrt(fit3b$lev)
lines(fit3b$x, upper1,lty=2)
lines(fit3b$x, lower1,lty=2)
APPROACH 2: SELECTION OF THE SMOOTHING PARAMETER USING CROSS VALIDATION
The use of cross validation in selecting the smoothing parameter λ yields 12.74 degrees of freedom and lambda= 9.146889e-05 over 13 iterations.
fit3c=smooth.spline(X ,Y ,cv=TRUE)
fit3c$df
## [1] 12.74136
plot(X ,Y ,xlim=timelims ,cex=.5, xlab="Times", ylab="Acceleration", main ="Smoothing Spline with df determined using Cross Validation")
lines(fit3c ,col="red",lwd=2)
#95% CI
res2 <- (fit3c$yin - fit3c$y)/(1-fit3c$lev) # jackknife residuals
sigma2 <- sqrt(var(res2))
upper2 <- fit3c$y + 2*sigma*sqrt(fit3c$lev) # upper 95% conf. band
lower2 <- fit3c$y - 2*sigma*sqrt(fit3c$lev)
lines(fit3c$x, upper2,lty=2)
lines(fit3c$x, lower2,lty=2)
#Plot all of the regression lines and curves
plot(X ,Y ,xlim=timelims ,cex=.5, xlab="Times", ylab="Acceleration", main ="Smoothing splines with df=6, 12.74, and 16")
lines(fit3a ,col="red",lwd=2)
lines(fit3b ,col="green",lwd=2)
lines(fit3c ,col="blue",lwd=2)
legend ("topright",legend=c("Specified -6 DF","Specified -16 DF","CV-12.74 DF"), col=c("red","green", "blue"),lty=1,lwd=2,cex=.8)
The combined plot shows that the smoothing spline with df=16 and the spline with df=12.74 (using cross validation) provide a good fit for the dataset and are similar. However, the former is more wiggly than the latter, confirming increased flexibility as the degree of freedom increases.
## plot all four models and the data on a single plot
plot(X, Y, xlab="Times", ylab="Acceleration", main ="Fitted Mean Trend") #data
abline(model1, col="blue", lwd = 2) #linear reg
lines(times.grid1 ,preds1$fit ,lwd=2,col="red") #quadratic
lines(times.grid ,preds2$fit ,lwd=2,col="green") #Fifth-degree polynomial
lines(times.grid , pred2d$fit ,col="purple",lwd=2)#Natural Spline- df=15
lines(fit3b ,col="yellow",lwd=2) #Smoothing Spline-lambda determined using cross validation
legend ("bottomright",legend=c("Linear Regression","Quadratic"
,"Fifth-degree polynomial",
"Natural Spline with df=15",
"Smoothing Spline with df by CV")
, col=c("blue","red", "green", "purple","yellow")
,lty=1,lwd=2,cex=.8)
In conclusion, we see that the linear regression does not provide a good fit for the dataset because the relationship between the variables is non linear as shown in the scatterplot. Hence, non-linear regression techniques are explored. The quadratic and the fifth-degree polynomial fits do not do a good job either, but the the natural spline and the smoothing spline with df=12.74 provide the best fit for the dataset. In addition, the smoothing spline is smoother and less wiggly.