Linear and Non-Linear Regression Analysis of Motorcycle Data

Exploratory Data Analysis

library(MASS)
data('mcycle')
head(mcycle)
##   times accel
## 1   2.4   0.0
## 2   2.6  -1.3
## 3   3.2  -2.7
## 4   3.6   0.0
## 5   4.0  -2.7
## 6   6.2  -2.7
dim(mcycle) 
## [1] 133   2
str(mcycle)
## 'data.frame':    133 obs. of  2 variables:
##  $ times: num  2.4 2.6 3.2 3.6 4 6.2 6.6 6.8 7.8 8.2 ...
##  $ accel: num  0 -1.3 -2.7 0 -2.7 -2.7 -2.7 -1.3 -2.7 -2.7 ...
summary(mcycle)
##      times           accel        
##  Min.   : 2.40   Min.   :-134.00  
##  1st Qu.:15.60   1st Qu.: -54.90  
##  Median :23.40   Median : -13.30  
##  Mean   :25.18   Mean   : -25.55  
##  3rd Qu.:34.80   3rd Qu.:   0.00  
##  Max.   :57.60   Max.   :  75.00

The motorcycle data set has 133 observations, with two numeric column variables: Times and Acceleration

# Rename the variables for ease of usage
Y <- mcycle$accel
X <- mcycle$times

#Scatterplot
plot(Y~X, xlab="time",ylab="Acceleration", main="Scatterplot of Acceleration against Time")

#Histogram
par(mfrow=c(1,2))
hist(mcycle$time, prob=T, col="grey", breaks=20, main="Histogram and Density of Time", xlim=c(2,58), xlab="Time")
lines(density(mcycle$time), col="red", lwd=2)

# Add a vertical line that indicates the average of Time
abline(v=mean(mcycle$time), col="blue", lty=2, lwd=2)

#Acceleration
hist(mcycle$accel, prob=T, col="blue", breaks=20, main="Histogram and Density of Acceleration", xlim=c(-140,78), xlab="Acceleration")
lines(density(mcycle$accel), col="red", lwd=2)

# Add a vertical line that indicates the average of Sepal Length
abline(v=mean(mcycle$accel), col="black", lty=2, lwd=2)

# Individual Box plot
boxplot(mcycle$times, notch=T, col="grey", main="Boxplot of Times")
boxplot(mcycle$accel, notch=T, col= "blue", main="Boxplot of Acceleration")

The scatterplot suggests non-linearity in the relationship between the variables. The histogram reveals that both variables aren’t normally distributed. However, there are no outliers in either variables as shown in the box plot.

Linear Regression

We seek to create a model that fits the data set. First, we regress Acceleration on Time to investigate the impact of times on acceleration and how much variation in acceleration is explained by times. X is the time since the motorcycle crash and Y is the acceleration of the driver’s head.

cor(mcycle$times,mcycle$accel)
## [1] 0.2964033
Y <- mcycle$accel
X <- mcycle$times
model1 <- lm(Y~X, data= mcycle)
model1_summary <- summary(model1)
model1_summary 
## 
## Call:
## lm(formula = Y ~ X, data = mcycle)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -104.114  -25.926    4.582   36.163   94.197 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -53.008      8.713  -6.084  1.2e-08 ***
## X              1.091      0.307   3.552 0.000532 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46.33 on 131 degrees of freedom
## Multiple R-squared:  0.08785,    Adjusted R-squared:  0.08089 
## F-statistic: 12.62 on 1 and 131 DF,  p-value: 0.0005318

Assessing the Overall Accuracy of the Model

#MSE
(model1_summary$sigma)^2
## [1] 2146.136
model1_summary$r.squared 
## [1] 0.08785493
model1_summary$adj.r.squared
## [1] 0.08089199
AIC(model1)
## [1] 1401.722
BIC(model1)
## [1] 1410.393

Fitted Plot and Confidence Interval

#Plot predicted interval and values of linear regression
newx = seq(min(X),max(X),by = 0.05)
conf_interval <- predict(model1, newdata=data.frame(X=newx), interval="confidence",
                         level = 0.95)
plot(X, Y, xlab="Times", ylab="Acceleration", main="Linear Regression")
abline(model1, col="blue", lwd = 2)
lines(newx, conf_interval[,2], col="red", lwd =1)
lines(newx, conf_interval[,3], col="red", lwd=1) 

Although, the p-value of the F-test shows that the model is significant, the adjusted R-square is very low indicating that very small variation in Y (acceleration) is explained by X (times). The plot of the estimated line shows that the linear regression model doesn’t do a good job with fitting the dataset.This is based on the lack of linearity between the variables as shown in the scatter plot.

Diagnostic Plots

#Diagnostic plots
par(mfrow=c(2,2))
plot(model1)

On the basis of the residual plots, there is some evidence of non-linearity.

Quadratic

In this model, we have two predictor variable terms: X and X^2

model2 <- lm(Y~X+I(X^2), data=mcycle) 
summary(model2)
## 
## Call:
## lm(formula = Y ~ X + I(X^2), data = mcycle)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -96.527 -30.817   9.589  29.210 104.728 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -15.22700   15.49471  -0.983  0.32757   
## X            -2.30749    1.20439  -1.916  0.05757 . 
## I(X^2)        0.05935    0.02038   2.912  0.00422 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45.06 on 130 degrees of freedom
## Multiple R-squared:  0.1437, Adjusted R-squared:  0.1306 
## F-statistic: 10.91 on 2 and 130 DF,  p-value: 4.167e-05

The near-zero p-value associated with the quadratic term suggests that it leads to an improved model. We use the anova() function to further quantify the extent to which the quadratic fit is superior to the linear fit

anova(model1 ,model2)
## Analysis of Variance Table
## 
## Model 1: Y ~ X
## Model 2: Y ~ X + I(X^2)
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1    131 281144                                
## 2    130 263923  1     17221 8.4823 0.004222 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The anova() function performs a hypothesis test comparing the linear regression model and the quadratic model. The null hypothesis is that the two models fit the data equally well, and the alternative hypothesis is that the quadratric model is superior. Here the F-statistic is 8.48 and the associated p-value is less than 0.05. This provides very clear evidence that the quadratic model superior to the linear model that only contains the predictor X. This is not surprising, since earlier we saw evidence for non-linearity in the relationship between Y and X.

Assessing the Overall Accuracy of the Model

model2_summary <- summary(model2)
(model2_summary$sigma)^2
## [1] 2030.179
model2_summary$r.squared 
## [1] 0.1437254
model2_summary$adj.r.squared
## [1] 0.130552
AIC(model2)
## [1] 1395.315
BIC(model2)
## [1] 1406.877

Fitted Plot and Confidence Interval

#plot prediction and conf.interval
timelims <- range(X)
times.grid1=seq(from=timelims [1],to=timelims [2]) 
preds1=predict (model2 ,newdata =list(X=times.grid1),se=TRUE) 
se.bands1=cbind(preds1$fit +2* preds1$se.fit ,preds1$fit -2* preds1$se.fit)

#Finally, we plot the data and add the fit from the degree-2 polynomial.
plot(X ,Y ,xlab="Times", main = "Quadratic",ylab="Acceleration", xlim=timelims ,cex=.5) 
lines(times.grid1 ,preds1$fit ,lwd=2,col="blue") 
matlines (times.grid1 ,se.bands1,lwd=1,col="red",lty=3)

Although, this model is superior to the linear model, with lower AIC and higher adjusted R-squared value. The plot of the estimated line shows that the quadratic model doesn’t do a good job with fitting the dataset.

Diagnostic Plots

#Diagnostic plots
par(mfrow=c(2,2)) 
plot(model2)

We see that when the X^2 term is included in the model, there is little discernible pattern in the residuals compared to the linear regression model with only X.

Fifth-degree Polynomial

In this model, we have five predictor variable terms: X, X^2, X^3, X^4, and X^5.

model3 <- lm(Y~poly(X,5,raw=T),data=mcycle) 
summary(model3)
## 
## Call:
## lm(formula = Y ~ poly(X, 5, raw = T), data = mcycle)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -77.271 -21.285   0.975  25.386  82.371 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.059e+02  3.499e+01  -3.026    0.003 ** 
## poly(X, 5, raw = T)1  4.978e+01  1.036e+01   4.804 4.31e-06 ***
## poly(X, 5, raw = T)2 -6.359e+00  1.015e+00  -6.266 5.30e-09 ***
## poly(X, 5, raw = T)3  2.969e-01  4.253e-02   6.982 1.45e-10 ***
## poly(X, 5, raw = T)4 -5.742e-03  7.932e-04  -7.239 3.83e-11 ***
## poly(X, 5, raw = T)5  3.919e-05  5.406e-06   7.250 3.60e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.9 on 127 degrees of freedom
## Multiple R-squared:  0.5264, Adjusted R-squared:  0.5078 
## F-statistic: 28.24 on 5 and 127 DF,  p-value: < 2.2e-16

Assessing the Overall Accuracy of the Model

model3_summary <- summary(model3)
(model3_summary$sigma)^2 
## [1] 1149.305
model3_summary$r.squared
## [1] 0.5264409
model3_summary$adj.r.squared
## [1] 0.5077968
AIC(model3)
## [1] 1322.537
BIC(model3)
## [1] 1342.77

Fitted Plot and Confidence Interval

times.grid=seq(from=timelims [1],to=timelims [2]) 
preds2=predict (model3 ,newdata =list(X=times.grid),se=TRUE) 
se.bands2=cbind(preds2$fit +2* preds2$se.fit ,preds2$fit -2* preds2$se.fit)

plot(X,Y ,xlab="Times", main = "Degree -5 Polynomial",ylab="Acceleration", xlim=timelims ,cex=.5) 
lines(times.grid ,preds2$fit ,lwd=2,col="blue") 
matlines (times.grid ,se.bands2,lwd=1,col="red",lty=1)

This model provides a better fit compared to the previous models. The adjusted R-squared value is higher. The model also has a lower AIC and BIC value. However, we see that a better model that provides better fitting could be created. Hence, we explore the use of splines.

Splines

Regression Splines

First, we create regression splines by specifying a set of knots, producing a sequence of basis functions, and then using least squares to estimate the spline coefficients. The bs() function generates the entire matrix of basis functions for splines with the specified set of knots. By default, cubic splines are produced.This produces a spline with six basis functions. Recall that a cubic spline with three knots has seven degrees of freedom; these degrees of freedom are used up by an intercept, plus six basis functions.)

library (splines)
#we have prespecified knots at times 15.6,23.4, and 57.6 which are the 1st, 2nd, and 3rd quartiles.
summary(X)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.40   15.60   23.40   25.18   34.80   57.60
fit_sp=lm(Y~bs(X,knots=c(15.6,23.4,57.6)),data=mcycle) 
summary(fit_sp)
## 
## Call:
## lm(formula = Y ~ bs(X, knots = c(15.6, 23.4, 57.6)), data = mcycle)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -78.33 -27.41  -0.42  29.55  87.32 
## 
## Coefficients: (1 not defined because of singularities)
##                                     Estimate Std. Error t value Pr(>|t|)
## (Intercept)                          -29.548     20.073  -1.472 0.143495
## bs(X, knots = c(15.6, 23.4, 57.6))1  132.892     37.896   3.507 0.000628
## bs(X, knots = c(15.6, 23.4, 57.6))2  -75.847     21.462  -3.534 0.000572
## bs(X, knots = c(15.6, 23.4, 57.6))3   31.795     30.790   1.033 0.303739
## bs(X, knots = c(15.6, 23.4, 57.6))4   86.612     32.941   2.629 0.009611
## bs(X, knots = c(15.6, 23.4, 57.6))5    2.278     29.709   0.077 0.939009
## bs(X, knots = c(15.6, 23.4, 57.6))6       NA         NA      NA       NA
##                                        
## (Intercept)                            
## bs(X, knots = c(15.6, 23.4, 57.6))1 ***
## bs(X, knots = c(15.6, 23.4, 57.6))2 ***
## bs(X, knots = c(15.6, 23.4, 57.6))3    
## bs(X, knots = c(15.6, 23.4, 57.6))4 ** 
## bs(X, knots = c(15.6, 23.4, 57.6))5    
## bs(X, knots = c(15.6, 23.4, 57.6))6    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36.49 on 127 degrees of freedom
## Multiple R-squared:  0.4514, Adjusted R-squared:  0.4298 
## F-statistic:  20.9 on 5 and 127 DF,  p-value: 3.378e-15
AIC(fit_sp)
## [1] 1342.094
pred_sp=predict(fit_sp,newdata =list(X=times.grid),se=T) 
plot(X ,Y, xlab="Times", ylab="Acceleration") 
lines(times.grid ,pred_sp$fit ,lwd=2) #regression spline
lines(times.grid ,pred_sp$fit +2*pred_sp$se ,lty="dashed", col="red") 
lines(times.grid ,pred_sp$fit -2*pred_sp$se ,lty="dashed",col="red")

Natural Spline

Here the degree of freedom is pre-specified and different numbers are used to see the the best curve that fits the data.

#First: Natural Spline- pre-specified no of knots=4
fit2=lm(Y~ns(X,df=4),data=mcycle) 
summary(fit2)
## 
## Call:
## lm(formula = Y ~ ns(X, df = 4), data = mcycle)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -70.102 -26.688   0.452  24.329  90.446 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       22.43      14.60   1.536   0.1269    
## ns(X, df = 4)1  -110.50      16.47  -6.708 5.72e-10 ***
## ns(X, df = 4)2    82.80      16.76   4.939 2.40e-06 ***
## ns(X, df = 4)3   -69.12      33.95  -2.036   0.0438 *  
## ns(X, df = 4)4   -30.71      18.33  -1.676   0.0962 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36.06 on 128 degrees of freedom
## Multiple R-squared:   0.46,  Adjusted R-squared:  0.4431 
## F-statistic: 27.26 on 4 and 128 DF,  p-value: 2.278e-16
AIC(fit2)
## [1] 1338.004
plot(X ,Y,main= "Natural Spline with df=4", xlab="Times", ylab="Acceleration") 
pred2=predict (fit2 ,newdata=list(X=times.grid),se=T) 
lines(times.grid , pred2$fit ,col="red",lwd=2)
lines(times.grid ,pred2$fit +2*pred2$se ,lty="dashed") 
lines(times.grid ,pred2$fit -2*pred2$se ,lty="dashed")

fit2b=lm(Y~ns(X,df=6),data=mcycle) 
summary(fit2b)
## 
## Call:
## lm(formula = Y ~ ns(X, df = 6), data = mcycle)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -88.112 -13.897   0.906  15.380  50.787 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -15.6605    11.0149  -1.422  0.15757    
## ns(X, df = 6)1  -75.8861    11.6492  -6.514 1.58e-09 ***
## ns(X, df = 6)2 -131.1411    16.1063  -8.142 3.23e-13 ***
## ns(X, df = 6)3  102.9920    14.8782   6.922 2.01e-10 ***
## ns(X, df = 6)4  -42.1524    13.6411  -3.090  0.00246 ** 
## ns(X, df = 6)5   57.7834    27.9160   2.070  0.04050 *  
## ns(X, df = 6)6    0.5345    13.6277   0.039  0.96877    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.75 on 126 degrees of freedom
## Multiple R-squared:  0.7695, Adjusted R-squared:  0.7585 
## F-statistic:  70.1 on 6 and 126 DF,  p-value: < 2.2e-16
AIC(fit2b)
## [1] 1228.784
plot(X ,Y, main= "Natural Spline with df=6",xlab="Times", ylab="Acceleration") 
pred2b=predict (fit2b ,newdata=list(X=times.grid),se=T) 
lines(times.grid , pred2b$fit ,col="red",lwd=2)
lines(times.grid ,pred2b$fit +2*pred2b$se ,lty="dashed") 
lines(times.grid ,pred2b$fit -2*pred2b$se ,lty="dashed")

We see that the spline with df=6 fits the data better than the spline with df=4.

fit2c=lm(Y~ns(X,df=10),data=mcycle) 
summary(fit2c)
## 
## Call:
## lm(formula = Y ~ ns(X, df = 10), data = mcycle)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -77.456 -12.715   0.132  11.165  50.647 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.423     11.578   0.123 0.902383    
## ns(X, df = 10)1    17.352     16.196   1.071 0.286103    
## ns(X, df = 10)2   -61.409     15.352  -4.000 0.000109 ***
## ns(X, df = 10)3  -115.554     18.547  -6.230 6.88e-09 ***
## ns(X, df = 10)4  -141.969     19.734  -7.194 5.55e-11 ***
## ns(X, df = 10)5    -5.013     16.948  -0.296 0.767883    
## ns(X, df = 10)6    56.867     18.629   3.053 0.002785 ** 
## ns(X, df = 10)7    -1.645     17.276  -0.095 0.924283    
## ns(X, df = 10)8     2.790     16.126   0.173 0.862925    
## ns(X, df = 10)9   -14.362     30.125  -0.477 0.634402    
## ns(X, df = 10)10    7.326     14.885   0.492 0.623459    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.45 on 122 degrees of freedom
## Multiple R-squared:  0.8005, Adjusted R-squared:  0.7841 
## F-statistic: 48.94 on 10 and 122 DF,  p-value: < 2.2e-16
AIC(fit2c)
## [1] 1217.585
plot(X ,Y , main= "Natural Spline with df=10", xlab="Times", ylab="Acceleration") 
pred2c=predict (fit2c ,newdata=list(X=times.grid),se=T) 
lines(times.grid , pred2c$fit ,col="red",lwd=2)
lines(times.grid ,pred2c$fit +2*pred2c$se ,lty="dashed") 
lines(times.grid ,pred2c$fit -2*pred2c$se ,lty="dashed")

fit2d=lm(Y~ns(X,df=15),data=mcycle) 
summary(fit2d)
## 
## Call:
## lm(formula = Y ~ ns(X, df = 15), data = mcycle)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -74.353 -10.776   0.301  10.737  53.355 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -0.5775    12.5065  -0.046 0.963250    
## ns(X, df = 15)1     0.5785    20.9962   0.028 0.978065    
## ns(X, df = 15)2    -5.3747    19.6804  -0.273 0.785261    
## ns(X, df = 15)3   -29.4141    16.6694  -1.765 0.080248 .  
## ns(X, df = 15)4   -88.6391    19.3244  -4.587 1.14e-05 ***
## ns(X, df = 15)5   -83.9127    22.4445  -3.739 0.000288 ***
## ns(X, df = 15)6  -149.8495    25.9023  -5.785 6.16e-08 ***
## ns(X, df = 15)7   -86.6107    20.9990  -4.125 6.98e-05 ***
## ns(X, df = 15)8   -30.2330    18.3279  -1.650 0.101716    
## ns(X, df = 15)9    45.5014    24.3217   1.871 0.063869 .  
## ns(X, df = 15)10   47.2031    21.9966   2.146 0.033944 *  
## ns(X, df = 15)11   -5.1229    20.8652  -0.246 0.806480    
## ns(X, df = 15)12   15.3713    19.7548   0.778 0.438079    
## ns(X, df = 15)13  -17.3563    19.9187  -0.871 0.385345    
## ns(X, df = 15)14   -4.0118    34.6559  -0.116 0.908041    
## ns(X, df = 15)15   13.1690    17.8760   0.737 0.462786    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.62 on 117 degrees of freedom
## Multiple R-squared:  0.8059, Adjusted R-squared:  0.781 
## F-statistic: 32.38 on 15 and 117 DF,  p-value: < 2.2e-16
AIC(fit2d)
## [1] 1223.947
plot(X ,Y, main= "Natural Spline with df=15", xlab="Times", ylab="Acceleration") 
pred2d=predict (fit2d ,newdata=list(X=times.grid),se=T) 
lines(times.grid , pred2d$fit ,col="red",lwd=2)
lines(times.grid ,pred2d$fit +2*pred2d$se ,lty="dashed") 
lines(times.grid ,pred2d$fit -2*pred2d$se ,lty="dashed")

fit2e=lm(Y~ns(X,df=16),data=mcycle)
summary(fit2e)
## 
## Call:
## lm(formula = Y ~ ns(X, df = 16), data = mcycle)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -72.757 -11.659   0.086  10.667  53.911 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -1.3708    12.6481  -0.108 0.913882    
## ns(X, df = 16)1    -4.8791    20.0990  -0.243 0.808626    
## ns(X, df = 16)2    10.7585    23.4960   0.458 0.647890    
## ns(X, df = 16)3   -31.6368    17.2242  -1.837 0.068804 .  
## ns(X, df = 16)4   -54.3207    18.9762  -2.863 0.004987 ** 
## ns(X, df = 16)5  -101.8265    20.2903  -5.018 1.90e-06 ***
## ns(X, df = 16)6   -93.3062    24.8924  -3.748 0.000279 ***
## ns(X, df = 16)7  -153.6727    25.6637  -5.988 2.44e-08 ***
## ns(X, df = 16)8   -45.0077    18.6092  -2.419 0.017136 *  
## ns(X, df = 16)9    -9.4926    19.9163  -0.477 0.634527    
## ns(X, df = 16)10   70.0893    23.5931   2.971 0.003612 ** 
## ns(X, df = 16)11   18.7594    19.8760   0.944 0.347222    
## ns(X, df = 16)12    2.5143    22.0160   0.114 0.909274    
## ns(X, df = 16)13   11.6421    19.9557   0.583 0.560758    
## ns(X, df = 16)14  -17.1312    20.5426  -0.834 0.406030    
## ns(X, df = 16)15    0.6462    34.4200   0.019 0.985054    
## ns(X, df = 16)16   11.5622    18.0371   0.641 0.522773    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.59 on 116 degrees of freedom
## Multiple R-squared:  0.808,  Adjusted R-squared:  0.7815 
## F-statistic: 30.51 on 16 and 116 DF,  p-value: < 2.2e-16
AIC(fit2e)
## [1] 1224.456
plot(X ,Y, main= "Natural Spline with df=16", xlab="Times", ylab="Acceleration") 
pred2e=predict (fit2e ,newdata=list(X=times.grid),se=T) 
lines(times.grid , pred2e$fit ,col="red",lwd=2)
lines(times.grid ,pred2e$fit +2*pred2e$se ,lty="dashed") 
lines(times.grid ,pred2e$fit -2*pred2e$se ,lty="dashed")

The natural splines with df=15 and 16 are similar and provides better fit to the data, but are more wiggly.

Smoothing Spline

APPROACH 1: PRE-SPECIFY THE DEGREES OF FREEDOM

To determine lambda, the smoothing parameter, we specified the degree of freedom and let the smooth.spline() function determine which value of λ leads to the pre-specified degree of freedom

fit3a=smooth.spline(X ,Y ,df=6)
newx <- seq(min(fit3a$x), max(fit3a$x), length.out=100)
pred3a <- predict(fit3a, newdata = data.frame(X=newx),se=T,interval = 'confidence')
plot(X ,Y ,cex=.5, xlab="Times", ylab="Acceleration")
title("Smoothing Spline with df=6 ")

#95% CI
res <- (fit3a$yin - fit3a$y)/(1-fit3a$lev)      # jackknife residuals
sigma <- sqrt(var(res))
upper <- fit3a$y + 2*sigma*sqrt(fit3a$lev)   # upper 95% conf. band
lower <- fit3a$y - 2*sigma*sqrt(fit3a$lev)
lines(fit3a,lwd=2, col="red")
lines(fit3a$x, upper,lty=2)
lines(fit3a$x, lower,lty=2)

fit3b=smooth.spline(X ,Y ,df=16) 
plot(X ,Y ,xlim=timelims ,cex=.5, xlab="Times", ylab="Acceleration")
title("Smoothing Spline with df=16") 
lines(fit3b ,col="red",lwd=2)

#95% CI
res1 <- (fit3b$yin - fit3b$y)/(1-fit3b$lev)      # jackknife residuals
sigma1 <- sqrt(var(res1))
upper1 <- fit3b$y + 2.0*sigma*sqrt(fit3b$lev)   # upper 95% conf. band
lower1 <- fit3b$y - 2.0*sigma*sqrt(fit3b$lev)
lines(fit3b$x, upper1,lty=2)
lines(fit3b$x, lower1,lty=2)

APPROACH 2: SELECTION OF THE SMOOTHING PARAMETER USING CROSS VALIDATION

The use of cross validation in selecting the smoothing parameter λ yields 12.74 degrees of freedom and lambda= 9.146889e-05 over 13 iterations.

fit3c=smooth.spline(X ,Y ,cv=TRUE) 
fit3c$df
## [1] 12.74136
plot(X ,Y ,xlim=timelims ,cex=.5, xlab="Times", ylab="Acceleration", main ="Smoothing Spline with df determined using Cross Validation")
lines(fit3c ,col="red",lwd=2)

#95% CI
res2 <- (fit3c$yin - fit3c$y)/(1-fit3c$lev)      # jackknife residuals
sigma2 <- sqrt(var(res2))
upper2 <- fit3c$y + 2*sigma*sqrt(fit3c$lev)   # upper 95% conf. band
lower2 <- fit3c$y - 2*sigma*sqrt(fit3c$lev)
lines(fit3c$x, upper2,lty=2)
lines(fit3c$x, lower2,lty=2)

#Plot all of the regression lines and curves
plot(X ,Y ,xlim=timelims ,cex=.5, xlab="Times", ylab="Acceleration", main ="Smoothing splines with df=6, 12.74, and 16")
lines(fit3a ,col="red",lwd=2)
lines(fit3b ,col="green",lwd=2)
lines(fit3c ,col="blue",lwd=2) 
legend ("topright",legend=c("Specified -6 DF","Specified -16 DF","CV-12.74 DF"), col=c("red","green", "blue"),lty=1,lwd=2,cex=.8)

The combined plot shows that the smoothing spline with df=16 and the spline with df=12.74 (using cross validation) provide a good fit for the dataset and are similar. However, the former is more wiggly than the latter, confirming increased flexibility as the degree of freedom increases.

Model Comparison

## plot all four models and the data on a single plot
plot(X, Y, xlab="Times", ylab="Acceleration", main ="Fitted Mean Trend") #data
abline(model1, col="blue", lwd = 2) #linear reg
lines(times.grid1 ,preds1$fit ,lwd=2,col="red") #quadratic
lines(times.grid ,preds2$fit ,lwd=2,col="green")  #Fifth-degree polynomial
lines(times.grid , pred2d$fit ,col="purple",lwd=2)#Natural Spline- df=15
lines(fit3b ,col="yellow",lwd=2) #Smoothing Spline-lambda determined using cross validation

legend ("bottomright",legend=c("Linear Regression","Quadratic"
                            ,"Fifth-degree polynomial",
                            "Natural Spline with df=15",
                            "Smoothing Spline with df by CV")
        , col=c("blue","red", "green", "purple","yellow")
        ,lty=1,lwd=2,cex=.8)

In conclusion, we see that the linear regression does not provide a good fit for the dataset because the relationship between the variables is non linear as shown in the scatterplot. Hence, non-linear regression techniques are explored. The quadratic and the fifth-degree polynomial fits do not do a good job either, but the the natural spline and the smoothing spline with df=12.74 provide the best fit for the dataset. In addition, the smoothing spline is smoother and less wiggly.