Ex. 1

For Example 19 on Page 79 in the book (Introduction to Algorithms for Data Mining and Machine Learning, by Xin-She Yang), carry out the regression using R.

x -0.98 1.00 2.02 3.03 4.00
y 2.44 -1.51 -0.47 2.54 7.52
I always plot the data first to see if it gives me any insight into what type of model I might need. In this case it’s clearly not linear but quadratic, and I know this from the textbook as well anyway.

Then I created an x-squared variable and fit the model with the base R lm() function, which gives the following output:
x2 <- x^2
mod1 <- lm(y~(x + x2))
mod1
## 
## Call:
## lm(formula = y ~ (x + x2))
## 
## Coefficients:
## (Intercept)            x           x2  
##      -0.506       -2.026        1.007
summary(mod1)
## 
## Call:
## lm(formula = y ~ (x + x2))
## 
## Residuals:
##        1        2        3        4        5 
## -0.00677  0.01517  0.02141 -0.05586  0.02605 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.50552    0.03178   -15.9  0.00393 ** 
## x           -2.02616    0.02648   -76.5  0.00017 ***
## x2           1.00651    0.00795   126.6 0.000062 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0476 on 2 degrees of freedom
## Multiple R-squared:     1,   Adjusted R-squared:     1 
## F-statistic: 1.09e+04 on 2 and 2 DF,  p-value: 0.0000919
So the R code gives us the equation:

\[ y = -0.5055 - 2.0262 x +1.0065 x^2 \]

Which is exactly the same as the textbook solution.
Finally I plotted the data along with the model predictions to see how well the model fits the data:

Plot the residuals to make sure there are no patterns

plot(x, mod1$residuals, pch=16)

Ex. 2

Implement the nonlinear curvefitting of Example 20 on Page 83 for the following data:

x 0.10 0.50 1.00 1.50 2.00 2.50
y 0.10 0.28 0.40 0.40 0.37 0.32
Following the same procedure as in example 1 first plot the data.

Then fit the data to the model \(y = \frac{x}{a+bx^2}\) using the nls() function in the stats package.
f <- function(x, a, b) x/(a + b*x^2)
mod2 <- nls(y ~ f(x, a, b), data = data.frame(x,y), 
               start = list(a = 1, b = 1), trace = T)
## 0.0297 :  1 1
## 0.002244 :  1.345 1.032
## 0.001215 :  1.474 1.006
## 0.00121 :  1.485 1.002
## 0.00121 :  1.485 1.002
## 0.00121 :  1.485 1.002
mod2
## Nonlinear regression model
##   model: y ~ f(x, a, b)
##    data: data.frame(x, y)
##    a    b 
## 1.49 1.00 
##  residual sum-of-squares: 0.00121
## 
## Number of iterations to convergence: 5 
## Achieved convergence tolerance: 0.00000039
s <- summary(mod2, correlation = TRUE)
s$coefficients
##   Estimate Std. Error t value   Pr(>|t|)
## a    1.485    0.08777   16.92 0.00007146
## b    1.002    0.05019   19.96 0.00003714
s$correlation
##         a       b
## a  1.0000 -0.6813
## b -0.6813  1.0000
So the R code gives us the equation:

\[ y = \frac{x}{1.4854+1.0021x^2} \]

Which is exactly the same as the textbook solution.
Plot the data along with the model predictions to see how well the model fits the data:

Plot the residuals to make sure there are no patterns

plot(x, s$residuals, pch=16)

Ex. 3

For data with binary \(y\) values, try to fit the following data to the nonlinear function \[y = \frac{1}{1+e^{a+bx}}\] starting with \(a=1\) and \(b=1\).

x 0.1 0.5 1.0 1.5 2.0 2.5
y 0 0 1 1 1 0
Plot the data.

Using the nls() function in the stats package as in exercise 2

Fit the data to the model \(y = \frac{1}{1+e^{a+bx}}\) using the nls() function in the stats package.
f3 <- function(x, a, b) 1/(1 + exp(1)^(a+b*x))
mod3 <- nls(y ~ f3(x, a, b), data = data.frame(x,y), 
               start = list(a = 1, b = 1), trace = T)
## 2.634 :  1 1
## 1.257 :   4.001 -8.045
## 1.033 :   5.807 -7.680
## 1.004 :    9.202 -12.219
## 1.001 :   12.34 -16.41
## 1 :   15.39 -20.47
## 1 :   18.41 -24.50
## 1 :   21.41 -28.51
## 1 :   24.42 -32.51
## 1 :   27.42 -36.51
## 1 :   30.42 -40.51
## 1 :   33.42 -44.51
## 1 :   36.42 -48.51
mod3
## Nonlinear regression model
##   model: y ~ f3(x, a, b)
##    data: data.frame(x, y)
##     a     b 
##  36.4 -48.5 
##  residual sum-of-squares: 1
## 
## Number of iterations to convergence: 12 
## Achieved convergence tolerance: 0.00000766
s <- summary(mod3, correlation = TRUE)
s$coefficients
##   Estimate Std. Error    t value Pr(>|t|)
## a    36.42     211276  0.0001724   0.9999
## b   -48.51     261861 -0.0001853   0.9999
s$correlation
##        a      b
## a  1.000 -0.951
## b -0.951  1.000
Plot the data along with the model predictions to see how well the model fits the data:

Using the glm() function for logistic regression

mod3.2 <- glm(y~x, family = binomial(link = "logit"))
mod3.2
## 
## Call:  glm(formula = y ~ x, family = binomial(link = "logit"))
## 
## Coefficients:
## (Intercept)            x  
##      -0.898        0.710  
## 
## Degrees of Freedom: 5 Total (i.e. Null);  4 Residual
## Null Deviance:       8.32 
## Residual Deviance: 7.83  AIC: 11.8
summary(mod3.2)
## 
## Call:
## glm(formula = y ~ x, family = binomial(link = "logit"))
## 
## Deviance Residuals: 
##      1       2       3       4       5       6  
## -0.852  -0.957   1.258   1.107   0.965  -1.565  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)   -0.898      1.581   -0.57     0.57
## x              0.710      1.056    0.67     0.50
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8.3178  on 5  degrees of freedom
## Residual deviance: 7.8325  on 4  degrees of freedom
## AIC: 11.83
## 
## Number of Fisher Scoring iterations: 4
Plot the data along with the model predictions to see how well the model fits the data:

Looks like the first solution using the nls() function and Gauss-Newton method resulted in better predictions.