DATA 609: Week 3 Homework

Page 113: #2

The following table gives the elongation \(e\) in inches per inch (in./in.) for a given stress \(S\) on a steel wire measured in pounds per square inch (\(lb/in^2\)). Test the model \(e = c_1S\) by plotting the data. Estimate \(c_1\) graphically.

S <- c(5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100)
e <- c(0, 19, 57, 94, 134, 173, 216, 256, 297, 343, 390)

data <- data.frame(S, e)

ggplot(data, aes(x=S, y=e)) + geom_point() + ggtitle('Scatterplot - Elgonation vs. Stress') + labs(x = "Stress", y = "Elongation")

Estimate \(c_1\) graphically.

\(c_1\) looks to be approximately 3.5 if we estimate graphically using a visual best-guess. This gives the equation:

\[e = 3.5S\]

Page 121: # 2.a

For each of the following data sets, formulate the mathematical model that minimizes the largest deviation between the data and the line \(y = ax + b\). If a computer is available, solve for the estimates of \(a\) and \(b\).

data <- data.frame(x = c(1.0, 2.3, 3.7, 4.2, 6.1, 7.0),
                   y = c(3.6, 3.0, 3.2, 5.1, 5.3, 6.8))

ggplot(data, aes(x=x, y=y)) + geom_point() +  labs(x = "x", y = "y")

Using least-squares approximation:

model <- lm(y ~ x, data = data)

summary(model)

## 
## Call:
## lm(formula = y ~ x, data = data)
## 
## Residuals:
##       1       2       3       4       5       6 
##  0.8209 -0.5126 -1.1025  0.5154 -0.3567  0.6355 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   2.2149     0.7737   2.863   0.0458 *
## x             0.5642     0.1703   3.313   0.0296 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8586 on 4 degrees of freedom
## Multiple R-squared:  0.7329, Adjusted R-squared:  0.6661 
## F-statistic: 10.98 on 1 and 4 DF,  p-value: 0.02957

a = 0.5642
b = 2.2149

\[y = 0.5642x + 2.2149\]

The largest deviation \(d_{max}\) is 1.1025182.

ggplot(data, aes(x)) + 
  geom_point(aes(y = y, colour = "weight")) + 
  geom_line(aes(y = model$fit, colour = "fit")) + 
  ggtitle("Comparison - Actual Weight vs. Fit")

Page 127: #10

Data for planets

data <- data.frame(body = c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune"),
                   period = c(7.6e6, 1.94e7, 3.16e7, 5.94e7, 3.74e8, 9.35e8, 2.64e9, 5.22e9),
                   distance_from_sun = c(5.79e10, 1.08e11, 1.5e11, 2.28e11, 7.79e11, 1.43e12, 2.87e12, 4.5e12))
  
data

##      body   period distance_from_sun
## 1 Mercury 7.60e+06          5.79e+10
## 2   Venus 1.94e+07          1.08e+11
## 3   Earth 3.16e+07          1.50e+11
## 4    Mars 5.94e+07          2.28e+11
## 5 Jupiter 3.74e+08          7.79e+11
## 6  Saturn 9.35e+08          1.43e+12
## 7  Uranus 2.64e+09          2.87e+12
## 8 Neptune 5.22e+09          4.50e+12

Fit the model \(y = ax^{3/2}\)

Use the least-squares criterion to fit a curve of the form \(y = Ax^n\) which yields the general form for optimization:

\[a = \frac{\sum x_i^{n} y_i}{\sum x_i^{2n}}\]

Applying the specified parameters:

\[a=\frac{\sum x_i^{\frac{3}{2}} y_i}{\sum x_i^{3}}\]

Compute \(\sum x_i^{\frac{3}{2}} y_i\)

numerator <- sum(data$period^{3/2} * data$distance_from_sun)

Compute \(\sum x_i^{3}\):

denominator <- sum(data$period^3)

a <- numerator/denominator

\[a = 0.01320756 \]

This give the least-squares approximate model:

\[y = 0.01320756x^{3/2}\]

Plot the actual vs. modeled values:

ggplot(data, aes(x = period, y = distance_from_sun)) + 
  geom_point() + 
  geom_line(aes(x = period, y = (a * period^{3/2})), color = "blue") + 
  labs(x = "Period (sec)", y = "Distance from sun (m)")

Page 136: #7

a. In the following data, \(W\) represents the weight of a fish (bass) and \(l\) represents its length. Fit the model

\[W = kl^3\]

to the data using the least-squares criterion.

data1 <- data.frame(length = c(14.5, 12.5, 17.25, 14.5, 12.625, 17.75, 14.125, 12.625),
                    weight = c(27, 17, 41, 26, 17, 49, 23, 16))

\[k = \frac{\sum x_i^{3} y_i}{\sum x_i^{6}}\]

k1 <- sum(data1$length^3 * data1$weight)/sum(data1$length^6)
k1

## [1] 0.008436761

This gives the least-squares approximate model:

\[W = 0.008436761l^3\]

data1$modeled   <- with(data1, k1 * length^3)
data1$abs_error <- with(data1, abs(weight - modeled))
data1$abs_error_squared <- with(data1, abs(weight - modeled)^2)

Calculate \(D\) and largest absolute devation:

D = (sum(data1$abs_error_squared)/8)^{1/2}
(max(data1$abs_error))

## [1] 2.305497

\[D = 1.233306 \leq c_{max} \leq 2.305497 = d_{max}\]

ggplot(data1, aes(x = length, y = weight)) + 
  geom_point() + 
  geom_line(aes(x = length, y = modeled), color = "red") + 
  labs(x = "Length (in)", y = "Weight (oz)")

b. In the following data, \(g\) represents the girth of a fish. Fit the model

\[W = klg^2\]

to the data using the least-squares criterion.

data2 <- data.frame(length = c(14.5, 12.5, 17.25, 14.5, 12.625, 17.75, 14.125, 12.625),
                    girth  = c(9.75, 8.375, 11.0, 9.75, 8.5, 12.5, 9.0, 8.5),
                    weight = c(27, 17, 41, 26, 17, 49, 23, 16))

Applying the least-squares criterion we get:

\[k=\frac{\sum W_i l_i g_i^2}{\sum (l_i g_i^2)^2}\]

Calculate \(k\)

k2 <- sum(data2$weight * data2$length * data2$girth^2)/sum((data2$length * data2$girth^2)^2)
k2

## [1] 0.01867511

This gives the least-squares approximate model:

\[W = 0.01867511lg^2\]

data2$modeled   <- with(data2, k2 * length * girth^2)
data2$abs_error <- with(data2, abs(weight - modeled))
data2$abs_error_squared <- with(data2, abs(weight - modeled)^2)

Calculate \(D\) and largest absolute devation:

D <- (sum(data2$abs_error_squared)/8)^{1/2}
(max(data2$abs_error))

## [1] 2.794251

\[D = 1.486233 \leq c_{max} \leq 2.794251 = d_{max}\]

ggplot(data2, aes(x = length * girth^2, y = weight)) + 
  geom_point() + 
  geom_line(aes(x = length * girth^2, y = modeled), color = "red") + 
  labs(x = "Length (in) * girth ^2 (in)", y = "Weight (oz)")

c. Which of the two models fits the data better? Justify fully. Which model do you prefer and why?

Looking at the resulting plot of actual versus predicted values for models 1 and 2, model 1 appears to fit better upon visual inspection.

data.frame(weight = data2$weight, 
           length = data2$length, 
           model1 = data1$modeled, 
           model2 = data2$modeled) %>%
ggplot(aes(length)) + 
  geom_point(aes(y = weight, colour = "weight")) + 
  geom_line(aes(y = model1, colour = "model1")) + 
  geom_line(aes(y = model2, colour = "model2")) + 
  ggtitle("Comparison - Actual Weight vs. Predicted")

Looking at the error associated with each model below, model 1 has a lower error which indicates a better fit over model 2.

Model	Error
Model 1	7.9811385
Model 2	9.6598777

Page 157: #4

In the following data, X represents the diameter of a ponderosa pine measured at breast height, and Y is a measure of volume-number of board feet divided by 10. Make a scatter plot of the data. Discuss the appropriateness of using a 13th-degree polynomial that passes through the data points as an empirical model. If you have a computer available, fit a polynomial to the data and graph the results.

data  <- data.frame(x = c(17, 19, 20, 22, 23, 25, 31, 32, 33, 36, 37, 38, 39, 41),
                    y = c(19, 25, 32, 51, 57, 71, 141, 123, 187, 192, 205, 252, 248, 294))


ggplot(data, aes(x = x, y = y)) + geom_point() + labs(x = 'Diameter', y='Volume') + ggtitle('Ponderosa')

In this case, since we have 14 data points, it’s possible to pass a unique polynomial of most 13 through the given data points. However, from the model created below, we see that the modeled data fits the data points exactly. The residuals for all 14 values are zero. Using a 13th-degree polynomial as seen below is likely to lead to overfitting. A disadvantage of this is that the model’s complexity would not likely accurately fit any new data provided to the model.

model <- lm(data$y ~ poly(data$x, 13))

summary(model)

## 
## Call:
## lm(formula = data$y ~ poly(data$x, 13))
## 
## Residuals:
## ALL 14 residuals are 0: no residual degrees of freedom!
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)
## (Intercept)        135.5000         NA      NA       NA
## poly(data$x, 13)1  334.0041         NA      NA       NA
## poly(data$x, 13)2   50.6920         NA      NA       NA
## poly(data$x, 13)3    6.3525         NA      NA       NA
## poly(data$x, 13)4    5.8925         NA      NA       NA
## poly(data$x, 13)5   -0.5213         NA      NA       NA
## poly(data$x, 13)6    0.8304         NA      NA       NA
## poly(data$x, 13)7   -8.0260         NA      NA       NA
## poly(data$x, 13)8   -5.2668         NA      NA       NA
## poly(data$x, 13)9    4.2170         NA      NA       NA
## poly(data$x, 13)10  15.1710         NA      NA       NA
## poly(data$x, 13)11  22.4797         NA      NA       NA
## poly(data$x, 13)12  25.7748         NA      NA       NA
## poly(data$x, 13)13 -21.3375         NA      NA       NA
## 
## Residual standard error: NaN on 0 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:    NaN 
## F-statistic:   NaN on 13 and 0 DF,  p-value: NA

data$predicted <- model$fit #predict(model, data)

ggplot(data, aes(x)) + 
  geom_point(aes(y = y, colour = "y")) + 
  geom_line(aes(y = predicted, colour = "predicted")) + 
  ggtitle("Comparison - Actual vs. Predicted")

Page 169: #11

Construct a scatterplot of the given data. Is there a trend in the data? Are any of the data points outliers? Construct a divided difference table. Is smoothing with a low-order polynomial appropriate? If so, choose an appropriate polynomial and fit using the least-squares criterion of best fit. Analyze the goodness of fit by examining appropriate indicators and graphing the model, the data points, and the deviations.

data <- data.frame(length = c(12.5, 12.625, 14.125, 14.5, 17.25, 17.75),
                   weight = c(17, 16.5, 23, 26.5, 41, 49))

ggplot(data, aes(x = length, y = weight)) + geom_point()

There is an apparent trend in the data–as length increases, weight increases. This looks to be a linear relationship. The data point located at (12.625, 16.5) may be an outlier given the apparent linear trend.

Create the Difference table:

data$delta1 <-  c(NA, diff(data$weight)/diff(data$length))
data$delta2 <-  c(NA, NA, na.omit(diff(data$delta1))/diff(data$length, lag = 2))
data$delta3 <-  c(NA, NA, NA, na.omit(diff(data$delta2))/diff(data$length, lag = 3))
data$delta4 <-  c(NA, NA, NA, NA, na.omit(diff(data$delta3))/diff(data$length, lag = 4))
data$delta5 <-  c(NA, NA, NA, NA, NA, na.omit(diff(data$delta4))/diff(data$length, lag = 5))

Divided Difference table
length	weight	delta1	delta2	delta3	delta4	delta5
12.500	17.0	NA	NA	NA	NA	NA
12.625	16.5	-4.000000	NA	NA	NA	NA
14.125	23.0	4.333333	5.128205	NA	NA	NA
14.500	26.5	9.333333	2.666667	-1.2307692	NA	NA
17.250	41.0	5.272727	-1.299394	-0.8575266	0.0785774	NA
17.750	49.0	16.000000	3.300699	1.2689912	0.4149303	0.0640672

Negative values are seen in the first column (delta1) of the divided differences table above rendering it invalid for identifying a low-order polynomial. This negative value could be due to an inaccurate measurement which could be specifically the outlier noted initially. Worth considering is the removal of the outlier to see the effect on the divided difference table.

Looking at the scatterplot of the data, it may be worth exploring a quadratic or cubic polynomial.

Quadratic Model

m2 <- lm(weight ~ poly(length, 2), data=data)
m2$coefficients

##      (Intercept) poly(length, 2)1 poly(length, 2)2 
##        28.833333        29.436813         3.112574

m2$residuals

##          1          2          3          4          5          6 
##  0.3135981 -0.6123686 -0.3760518  1.2256959 -2.2566506  1.7057768

data$m2_prediction <- m2$fit #predict(m2, data)
data$m2_residuals <- m2$residuals

\[P(v) = 28.833333 + 29.436813l + 3.112574l^2\]

ggplot(data, aes(length)) + 
  geom_point(aes(y = weight, colour = "y")) + 
  geom_line(aes(y = m2_prediction, colour = "m2_prediction")) + 
  ggtitle("Comparison - Actual vs. Predicted")

Cubic Model

m3 <- lm(weight ~ poly(length, 3), data=data)
m3$coefficients

##      (Intercept) poly(length, 3)1 poly(length, 3)2 poly(length, 3)3 
##        28.833333        29.436813         3.112574         2.377461

data$m3_prediction <- m3$fit #predict(m3, data)
data$m3_residuals <- m3$residuals

\[P(v) = 28.833333 + 29.436813l + 3.112574l^2 + 2.377461l^3\]

ggplot(data, aes(length)) + 
  geom_point(aes(y = weight, colour = "y")) + 
  geom_line(aes(y = m3_prediction, colour = "m3_prediction")) + 
  ggtitle("Comparison - Actual vs. Predicted")

4th-Order Polynomial Model

m4 <- lm(weight ~ poly(length, 4), data=data)
m4$coefficients

##      (Intercept) poly(length, 4)1 poly(length, 4)2 poly(length, 4)3 
##        28.833333        29.436813         3.112574         2.377461 
## poly(length, 4)4 
##         2.089060

data$m4_prediction <- m4$fit # predict(m4, data)
data$m4_residuals <- m4$residuals

\[P(v) = 28.833333 + 29.436813l + 3.112574l^2 + 2.377461l^3 + 2.089060l^4\]

ggplot(data, aes(length)) + 
  geom_point(aes(y = weight, colour = "weight")) + 
  geom_line(aes(y = m4_prediction, colour = "m4_prediction")) + 
  ggtitle("Comparison - Actual vs. Predicted")

Looking fit of the models above, the cubic polynomial looks like the best fit. This combined with it’s lower error value of 4.466908 would make it the empirical model. The 4-order polynomial appears to be an overfit.

Polynomial	sum of squared errors
quadratic	10.1192309
cubic	4.466908
quartic	0.1027349