The following table gives the elongation \(e\) in inches per inch (in./in.) for a given stress \(S\) on a steel wire measured in pounds per square inch (\(lb/in^2\)). Test the model \(e = c_1S\) by plotting the data. Estimate \(c_1\) graphically.
S <- c(5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100)
e <- c(0, 19, 57, 94, 134, 173, 216, 256, 297, 343, 390)
data <- data.frame(S, e)
ggplot(data, aes(x=S, y=e)) + geom_point() + ggtitle('Scatterplot - Elgonation vs. Stress') + labs(x = "Stress", y = "Elongation")
Estimate \(c_1\) graphically.
\(c_1\) looks to be approximately 3.5 if we estimate graphically using a visual best-guess. This gives the equation:
\[e = 3.5S\]
For each of the following data sets, formulate the mathematical model that minimizes the largest deviation between the data and the line \(y = ax + b\). If a computer is available, solve for the estimates of \(a\) and \(b\).
data <- data.frame(x = c(1.0, 2.3, 3.7, 4.2, 6.1, 7.0),
y = c(3.6, 3.0, 3.2, 5.1, 5.3, 6.8))
ggplot(data, aes(x=x, y=y)) + geom_point() + labs(x = "x", y = "y")
Using least-squares approximation:
model <- lm(y ~ x, data = data)
summary(model)
##
## Call:
## lm(formula = y ~ x, data = data)
##
## Residuals:
## 1 2 3 4 5 6
## 0.8209 -0.5126 -1.1025 0.5154 -0.3567 0.6355
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2149 0.7737 2.863 0.0458 *
## x 0.5642 0.1703 3.313 0.0296 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8586 on 4 degrees of freedom
## Multiple R-squared: 0.7329, Adjusted R-squared: 0.6661
## F-statistic: 10.98 on 1 and 4 DF, p-value: 0.02957
a = 0.5642
b = 2.2149
\[y = 0.5642x + 2.2149\]
The largest deviation \(d_{max}\) is 1.1025182.
ggplot(data, aes(x)) +
geom_point(aes(y = y, colour = "weight")) +
geom_line(aes(y = model$fit, colour = "fit")) +
ggtitle("Comparison - Actual Weight vs. Fit")
Data for planets
data <- data.frame(body = c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune"),
period = c(7.6e6, 1.94e7, 3.16e7, 5.94e7, 3.74e8, 9.35e8, 2.64e9, 5.22e9),
distance_from_sun = c(5.79e10, 1.08e11, 1.5e11, 2.28e11, 7.79e11, 1.43e12, 2.87e12, 4.5e12))
data
## body period distance_from_sun
## 1 Mercury 7.60e+06 5.79e+10
## 2 Venus 1.94e+07 1.08e+11
## 3 Earth 3.16e+07 1.50e+11
## 4 Mars 5.94e+07 2.28e+11
## 5 Jupiter 3.74e+08 7.79e+11
## 6 Saturn 9.35e+08 1.43e+12
## 7 Uranus 2.64e+09 2.87e+12
## 8 Neptune 5.22e+09 4.50e+12
Fit the model \(y = ax^{3/2}\)
Use the least-squares criterion to fit a curve of the form \(y = Ax^n\) which yields the general form for optimization:
\[a = \frac{\sum x_i^{n} y_i}{\sum x_i^{2n}}\]
Applying the specified parameters:
\[a=\frac{\sum x_i^{\frac{3}{2}} y_i}{\sum x_i^{3}}\]
Compute \(\sum x_i^{\frac{3}{2}} y_i\)
numerator <- sum(data$period^{3/2} * data$distance_from_sun)
Compute \(\sum x_i^{3}\):
denominator <- sum(data$period^3)
a <- numerator/denominator
\[a = 0.01320756 \]
This give the least-squares approximate model:
\[y = 0.01320756x^{3/2}\]
Plot the actual vs. modeled values:
ggplot(data, aes(x = period, y = distance_from_sun)) +
geom_point() +
geom_line(aes(x = period, y = (a * period^{3/2})), color = "blue") +
labs(x = "Period (sec)", y = "Distance from sun (m)")
a. In the following data, \(W\) represents the weight of a fish (bass) and \(l\) represents its length. Fit the model
\[W = kl^3\]
to the data using the least-squares criterion.
data1 <- data.frame(length = c(14.5, 12.5, 17.25, 14.5, 12.625, 17.75, 14.125, 12.625),
weight = c(27, 17, 41, 26, 17, 49, 23, 16))
\[k = \frac{\sum x_i^{3} y_i}{\sum x_i^{6}}\]
k1 <- sum(data1$length^3 * data1$weight)/sum(data1$length^6)
k1
## [1] 0.008436761
This gives the least-squares approximate model:
\[W = 0.008436761l^3\]
data1$modeled <- with(data1, k1 * length^3)
data1$abs_error <- with(data1, abs(weight - modeled))
data1$abs_error_squared <- with(data1, abs(weight - modeled)^2)
Calculate \(D\) and largest absolute devation:
D = (sum(data1$abs_error_squared)/8)^{1/2}
(max(data1$abs_error))
## [1] 2.305497
\[D = 1.233306 \leq c_{max} \leq 2.305497 = d_{max}\]
ggplot(data1, aes(x = length, y = weight)) +
geom_point() +
geom_line(aes(x = length, y = modeled), color = "red") +
labs(x = "Length (in)", y = "Weight (oz)")
b. In the following data, \(g\) represents the girth of a fish. Fit the model
\[W = klg^2\]
to the data using the least-squares criterion.
data2 <- data.frame(length = c(14.5, 12.5, 17.25, 14.5, 12.625, 17.75, 14.125, 12.625),
girth = c(9.75, 8.375, 11.0, 9.75, 8.5, 12.5, 9.0, 8.5),
weight = c(27, 17, 41, 26, 17, 49, 23, 16))
Applying the least-squares criterion we get:
\[k=\frac{\sum W_i l_i g_i^2}{\sum (l_i g_i^2)^2}\]
Calculate \(k\)
k2 <- sum(data2$weight * data2$length * data2$girth^2)/sum((data2$length * data2$girth^2)^2)
k2
## [1] 0.01867511
This gives the least-squares approximate model:
\[W = 0.01867511lg^2\]
data2$modeled <- with(data2, k2 * length * girth^2)
data2$abs_error <- with(data2, abs(weight - modeled))
data2$abs_error_squared <- with(data2, abs(weight - modeled)^2)
Calculate \(D\) and largest absolute devation:
D <- (sum(data2$abs_error_squared)/8)^{1/2}
(max(data2$abs_error))
## [1] 2.794251
\[D = 1.486233 \leq c_{max} \leq 2.794251 = d_{max}\]
ggplot(data2, aes(x = length * girth^2, y = weight)) +
geom_point() +
geom_line(aes(x = length * girth^2, y = modeled), color = "red") +
labs(x = "Length (in) * girth ^2 (in)", y = "Weight (oz)")
c. Which of the two models fits the data better? Justify fully. Which model do you prefer and why?
Looking at the resulting plot of actual versus predicted values for models 1 and 2, model 1 appears to fit better upon visual inspection.
data.frame(weight = data2$weight,
length = data2$length,
model1 = data1$modeled,
model2 = data2$modeled) %>%
ggplot(aes(length)) +
geom_point(aes(y = weight, colour = "weight")) +
geom_line(aes(y = model1, colour = "model1")) +
geom_line(aes(y = model2, colour = "model2")) +
ggtitle("Comparison - Actual Weight vs. Predicted")
Looking at the error associated with each model below, model 1 has a lower error which indicates a better fit over model 2.
Model | Error |
---|---|
Model 1 | 7.9811385 |
Model 2 | 9.6598777 |
In the following data, X represents the diameter of a ponderosa pine measured at breast height, and Y is a measure of volume-number of board feet divided by 10. Make a scatter plot of the data. Discuss the appropriateness of using a 13th-degree polynomial that passes through the data points as an empirical model. If you have a computer available, fit a polynomial to the data and graph the results.
data <- data.frame(x = c(17, 19, 20, 22, 23, 25, 31, 32, 33, 36, 37, 38, 39, 41),
y = c(19, 25, 32, 51, 57, 71, 141, 123, 187, 192, 205, 252, 248, 294))
ggplot(data, aes(x = x, y = y)) + geom_point() + labs(x = 'Diameter', y='Volume') + ggtitle('Ponderosa')
In this case, since we have 14 data points, it’s possible to pass a unique polynomial of most 13 through the given data points. However, from the model created below, we see that the modeled data fits the data points exactly. The residuals for all 14 values are zero. Using a 13th-degree polynomial as seen below is likely to lead to overfitting. A disadvantage of this is that the model’s complexity would not likely accurately fit any new data provided to the model.
model <- lm(data$y ~ poly(data$x, 13))
summary(model)
##
## Call:
## lm(formula = data$y ~ poly(data$x, 13))
##
## Residuals:
## ALL 14 residuals are 0: no residual degrees of freedom!
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 135.5000 NA NA NA
## poly(data$x, 13)1 334.0041 NA NA NA
## poly(data$x, 13)2 50.6920 NA NA NA
## poly(data$x, 13)3 6.3525 NA NA NA
## poly(data$x, 13)4 5.8925 NA NA NA
## poly(data$x, 13)5 -0.5213 NA NA NA
## poly(data$x, 13)6 0.8304 NA NA NA
## poly(data$x, 13)7 -8.0260 NA NA NA
## poly(data$x, 13)8 -5.2668 NA NA NA
## poly(data$x, 13)9 4.2170 NA NA NA
## poly(data$x, 13)10 15.1710 NA NA NA
## poly(data$x, 13)11 22.4797 NA NA NA
## poly(data$x, 13)12 25.7748 NA NA NA
## poly(data$x, 13)13 -21.3375 NA NA NA
##
## Residual standard error: NaN on 0 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: NaN
## F-statistic: NaN on 13 and 0 DF, p-value: NA
data$predicted <- model$fit #predict(model, data)
ggplot(data, aes(x)) +
geom_point(aes(y = y, colour = "y")) +
geom_line(aes(y = predicted, colour = "predicted")) +
ggtitle("Comparison - Actual vs. Predicted")
Construct a scatterplot of the given data. Is there a trend in the data? Are any of the data points outliers? Construct a divided difference table. Is smoothing with a low-order polynomial appropriate? If so, choose an appropriate polynomial and fit using the least-squares criterion of best fit. Analyze the goodness of fit by examining appropriate indicators and graphing the model, the data points, and the deviations.
data <- data.frame(length = c(12.5, 12.625, 14.125, 14.5, 17.25, 17.75),
weight = c(17, 16.5, 23, 26.5, 41, 49))
ggplot(data, aes(x = length, y = weight)) + geom_point()
There is an apparent trend in the data–as length increases, weight increases. This looks to be a linear relationship. The data point located at (12.625, 16.5) may be an outlier given the apparent linear trend.
Create the Difference table:
data$delta1 <- c(NA, diff(data$weight)/diff(data$length))
data$delta2 <- c(NA, NA, na.omit(diff(data$delta1))/diff(data$length, lag = 2))
data$delta3 <- c(NA, NA, NA, na.omit(diff(data$delta2))/diff(data$length, lag = 3))
data$delta4 <- c(NA, NA, NA, NA, na.omit(diff(data$delta3))/diff(data$length, lag = 4))
data$delta5 <- c(NA, NA, NA, NA, NA, na.omit(diff(data$delta4))/diff(data$length, lag = 5))
length | weight | delta1 | delta2 | delta3 | delta4 | delta5 |
---|---|---|---|---|---|---|
12.500 | 17.0 | NA | NA | NA | NA | NA |
12.625 | 16.5 | -4.000000 | NA | NA | NA | NA |
14.125 | 23.0 | 4.333333 | 5.128205 | NA | NA | NA |
14.500 | 26.5 | 9.333333 | 2.666667 | -1.2307692 | NA | NA |
17.250 | 41.0 | 5.272727 | -1.299394 | -0.8575266 | 0.0785774 | NA |
17.750 | 49.0 | 16.000000 | 3.300699 | 1.2689912 | 0.4149303 | 0.0640672 |
Negative values are seen in the first column (delta1) of the divided differences table above rendering it invalid for identifying a low-order polynomial. This negative value could be due to an inaccurate measurement which could be specifically the outlier noted initially. Worth considering is the removal of the outlier to see the effect on the divided difference table.
Looking at the scatterplot of the data, it may be worth exploring a quadratic or cubic polynomial.
Quadratic Model
m2 <- lm(weight ~ poly(length, 2), data=data)
m2$coefficients
## (Intercept) poly(length, 2)1 poly(length, 2)2
## 28.833333 29.436813 3.112574
m2$residuals
## 1 2 3 4 5 6
## 0.3135981 -0.6123686 -0.3760518 1.2256959 -2.2566506 1.7057768
data$m2_prediction <- m2$fit #predict(m2, data)
data$m2_residuals <- m2$residuals
\[P(v) = 28.833333 + 29.436813l + 3.112574l^2\]
ggplot(data, aes(length)) +
geom_point(aes(y = weight, colour = "y")) +
geom_line(aes(y = m2_prediction, colour = "m2_prediction")) +
ggtitle("Comparison - Actual vs. Predicted")
Cubic Model
m3 <- lm(weight ~ poly(length, 3), data=data)
m3$coefficients
## (Intercept) poly(length, 3)1 poly(length, 3)2 poly(length, 3)3
## 28.833333 29.436813 3.112574 2.377461
data$m3_prediction <- m3$fit #predict(m3, data)
data$m3_residuals <- m3$residuals
\[P(v) = 28.833333 + 29.436813l + 3.112574l^2 + 2.377461l^3\]
ggplot(data, aes(length)) +
geom_point(aes(y = weight, colour = "y")) +
geom_line(aes(y = m3_prediction, colour = "m3_prediction")) +
ggtitle("Comparison - Actual vs. Predicted")
4th-Order Polynomial Model
m4 <- lm(weight ~ poly(length, 4), data=data)
m4$coefficients
## (Intercept) poly(length, 4)1 poly(length, 4)2 poly(length, 4)3
## 28.833333 29.436813 3.112574 2.377461
## poly(length, 4)4
## 2.089060
data$m4_prediction <- m4$fit # predict(m4, data)
data$m4_residuals <- m4$residuals
\[P(v) = 28.833333 + 29.436813l + 3.112574l^2 + 2.377461l^3 + 2.089060l^4\]
ggplot(data, aes(length)) +
geom_point(aes(y = weight, colour = "weight")) +
geom_line(aes(y = m4_prediction, colour = "m4_prediction")) +
ggtitle("Comparison - Actual vs. Predicted")
Looking fit of the models above, the cubic polynomial looks like the best fit. This combined with it’s lower error value of 4.466908 would make it the empirical model. The 4-order polynomial appears to be an overfit.
Polynomial | sum of squared errors |
---|---|
quadratic | 10.1192309 |
cubic | 4.466908 |
quartic | 0.1027349 |