Today we will look at some examples of nonlinear modeling. The data to be used are from the wine quality data set from the UCI Machine Learning Repository, https://archive.ics.uci.edu/ml/datasets/Wine+Quality. Specifically, we will examine the relationship between pH and fixed acidity for white wine.
First, a little bit about the data. It was originally cited in the study by P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis in the article “Modeling wine preferences by data mining from physicochemical properties”, published in Decision Support Systems, Elsevier, 47(4):547-553, 2009. The data is in the winequality-white.csv file. The variables noted in the data are as follows:
First, we will read the .csv file into memory as a data object. Please take note of the separator used in the table, as well as its headers.
vino <- read.csv('winequality-white.csv',header=T,sep=';')
dim(vino) #for dimensions in the data object
## [1] 4898 12
names(vino) #variable names for each column of data
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
We can see that the data object is made up of 4989 rows and 12 different parameters by which each wine is evaluated. Each parameter has been given a column name. We will focus on the relationship between two variables, pH and fixed.acidity
First, we’ll look at polynomial regression. in order to determine the degree of polynomial we will use, linear models using successively higher degrees will be made and compared via anova() to determine best fit.
fit.1 <- lm(pH~fixed.acidity,data=vino)
fit.2 <- lm(pH~poly(fixed.acidity,2),data=vino)
fit.3 <- lm(pH~poly(fixed.acidity,3),data=vino)
fit.4 <- lm(pH~poly(fixed.acidity,4),data=vino)
fit.5 <- lm(pH~poly(fixed.acidity,5),data=vino)
fit.6 <- lm(pH~poly(fixed.acidity,6),data=vino)
anova(fit.1,fit.2,fit.3,fit.4,fit.5,fit.6)
## Analysis of Variance Table
##
## Model 1: pH ~ fixed.acidity
## Model 2: pH ~ poly(fixed.acidity, 2)
## Model 3: pH ~ poly(fixed.acidity, 3)
## Model 4: pH ~ poly(fixed.acidity, 4)
## Model 5: pH ~ poly(fixed.acidity, 5)
## Model 6: pH ~ poly(fixed.acidity, 6)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4896 91.408
## 2 4895 90.279 1 1.12916 61.4579 5.512e-15 ***
## 3 4894 90.176 1 0.10292 5.6019 0.0179793 *
## 4 4893 89.939 1 0.23654 12.8741 0.0003364 ***
## 5 4892 89.883 1 0.05660 3.0805 0.0793004 .
## 6 4891 89.862 1 0.02061 1.1219 0.2895576
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In examining the related p-values, we find that going up to degree 4 may provide us the best fit. Please note that similar results can be found by examining the coefficients of the highest degreed model.
coef(summary(fit.6))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.1882666 0.001936777 1646.171453 0.000000e+00
## poly(fixed.acidity, 6)1 -4.4999618 0.135546703 -33.198608 3.901999e-218
## poly(fixed.acidity, 6)2 1.0626198 0.135546703 7.839510 5.511562e-15
## poly(fixed.acidity, 6)3 -0.3208178 0.135546703 -2.366843 1.797927e-02
## poly(fixed.acidity, 6)4 0.4863492 0.135546703 3.588056 3.364132e-04
## poly(fixed.acidity, 6)5 0.2379015 0.135546703 1.755126 7.930044e-02
## poly(fixed.acidity, 6)6 -0.1435721 0.135546703 -1.059208 2.895576e-01
Now, we’ll plot the data as well as the line representing the polynomial model. We will also set the stage for plotting confidence intervals. We begin by setting up the bands that will illustrate the intervals.
# We ascertain the range of valuse for the predicting variable
xlims <- range(vino$fixed.acidity)
# Now, set up a grid of values based upon the range.
xgrid <- seq(from=xlims[1],to=xlims[2])
# Here we predictions, based on the model, using our new grid as new data. Oh, and don't forget the standard errors
pred <- predict(fit.4,newdata=list(fixed.acidity=xgrid),se=T)
# now we create an object with two colums, each will serve as the limits to cover approximately 95%.
se.bands <- cbind(pred$fit+2*pred$se.fit,pred$fit - 2*pred$se.fit)
Now we can plot the points, the fitted line, as well as the bands for standard error.
plot(pH~fixed.acidity,data=vino,xlim=xlims,cex=.5,col='darkgrey',pch=10)
title('Degree - 4 Polynomial')
lines(xgrid,pred$fit,lwd=2,col='blue')
matlines(xgrid,se.bands,lwd=1,col='blue',lty=3)
We can also use regression splines to fit the data using the splines library. As an example, we will fit a regression spline to the data, using defalt cubic splines and six degrees of freedom. It is actually fairly similar to the previous procedure.
# First we call the library
library(splines)
# Fit a regression spline to the data w/ 6 degrees of freedom
fit.sp <- lm(pH~bs(fixed.acidity,df=6),data=vino)
# Create another set of predictions on our grid using the model
pred.sp <- predict(fit.sp,newdata=list(fixed.acidity=xgrid),se=T)
# Now plot the data, fitted line, and standard error bands.
plot(pH~fixed.acidity,data=vino,cex=.5,col='darkgrey',pch=10)
lines(xgrid,pred.sp$fit,lwd=2,col='blue')
lines(xgrid,pred.sp$fit + 2*pred.sp$se.fit,lty='dashed',col='blue')
lines(xgrid,pred.sp$fit - 2*pred.sp$se.fit,lty='dashed',col='blue')
In order to use a smoothing spline on the data, we’ll use the smooth.spline() function. Here, we will use two options. For the first smoothing spline, we will choose the degrees of freedom. R will then figure out the appropriate value of lambda. For the second, we will use cross validation by indicating cv=TRUE. Through cross validation, we then arrive at an appropriate value of lambda.
# For the first, we will use 6 degrees of freedom.
fit.ss1 <- smooth.spline(vino$fixed.acidity,vino$pH,df=6)
# And for the second, we will use cross validation to determine degrees of freedom
fit.ss2 <- smooth.spline(vino$fixed.acidity,vino$pH,cv=TRUE)
## Warning in smooth.spline(vino$fixed.acidity, vino$pH, cv = TRUE):
## cross-validation with non-unique 'x' values seems doubtful
The value of degrees of freedom for the second spline is the following:
fit.ss2$df
## [1] 5.514441
Now both smoothing splines can be plotted on the data.
# First, plot the data
plot(pH~fixed.acidity,data=vino,cex=.5,pch=10,col='darkgrey')
lines(fit.ss1,col='red',lwd=2)
lines(fit.ss2,col='blue',lwd=2)
legend('topright',legend=c('6 DF','5.51 DF'),col=c('red','blue'),lty=1,lwd=2,cex=.8)
title('Smoothing Splines')
Finally, we have an example of using local regression. Here, the span will be different for each model using the loess() function.
# The data is plotted
plot(pH~fixed.acidity,data=vino,pch=10,cex=.5,xlim=xlims,col='darkgrey')
title('Local Regression')
# Model with a span of .2
fit.loc1 <- loess(pH~fixed.acidity,data=vino,span=.2)
# Model with a span of .5
fit.loc2 <- loess(pH~fixed.acidity,data=vino,span=.5)
lines(xgrid,predict(fit.loc1,data.frame(fixed.acidity=xgrid)),col='red',lwd=2)
lines(xgrid,predict(fit.loc2,data.frame(fixed.acidity=xgrid)),col='blue',lwd=2)
legend('topright',legend=c('span = .2','span = .5'),col=c('red','blue'),lty=1,lwd=2,cex=.8)