ISDA609: Mathmatical Modeling Techniques for Data Analytics \ Assignment:03

Page 113: #2

The following table gives the elongation e in inches per inch (in/in) for a given stress S on a steel wire measured in pounds per square \(inch(lb/in_2)\). Test the model \(e = c_1S\) by plotting the data. Estimate \(c_1\) graphically.

Answer Page 113: #2

S<- c(5,10,20,30,40,50,60,70,80,90,100)
S

##  [1]   5  10  20  30  40  50  60  70  80  90 100

e<- c(0,19,57,94,134,173,216,256,297,343,390)
e

##  [1]   0  19  57  94 134 173 216 256 297 343 390

df<- data.frame(S,e)


#plot(df)

library(reshape2)
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.2.3

p1<- ggplot(data = df, aes(x = df$S, y = df$e)) +
  geom_point()

p1 <- p1 + labs(title="Data Points", x="S", y="e")
p1

The value of \(c_1\) can be obtained by finding the value of the slope:

## [1] 4.09

From above the value of \(c_1\)=4.09 \[\\\]

Hence: \(e = c_1S\) can be written as: \[e = 4.09S \]

Page 121: #2.a

For each of the following data sets, formulate the mathematical model that minimizes the largest deviation between the data and the line y = ax + b. If a computer is available, solve for the estimates of a and b.

Answer Page 121: #2.a

x<- c(1.0, 2.3, 3.7, 4.7, 6.1, 7.0)
y<- c(3.6, 3.0, 3.2, 5.1, 5.3, 6.8)
df2<- data.frame(x,y)
df2

##     x   y
## 1 1.0 3.6
## 2 2.3 3.0
## 3 3.7 3.2
## 4 4.7 5.1
## 5 6.1 5.3
## 6 7.0 6.8

p2<- ggplot(data = df2, aes(x = x, y = y)) +
  geom_line()
p2 <- p2 + labs(title="Actual Data Points", x="x", y="y")
p2

From the plot above of actual data points, we find a straight line y = ax + b which is close to all the points. One can choose (x, y) = (6.1, 5,3) and (x, y) = (7.0, 6.8) to find the slope a and the y-intercept b:

x<- c(1.0, 2.3, 3.7, 4.7, 6.1, 7.0)
y<- c(3.6, 3.0, 3.2, 5.1, 5.3, 6.8)
df2<- data.frame(x,y)
#df2

df2$z = c(NA,tail(df2$y,-1)-head(df2$y,-1))
df2$z2= c(NA,tail(df2$x,-1)-head(df2$x,-1))

df2$slope<- df2$z / df2$z2

slope<- mean(df2$slope, na.rm= TRUE)
slope

## [1] 0.6781685

# Finding the y-intercept b:

xmean<- mean(df2$x)
ymean<- mean(df2$y)

b<- ymean - xmean*slope
b

## [1] 1.696904

Hence: \[y = .67x + 1.69\]

Plotting \(y = .67x + 1.69\)

df2$fx<- (.67*df2$x) + 1.69

p2<- ggplot(data = df2, aes(x = x, y = fx)) +
  geom_line()
p2 <- p2 + labs(title="Data & Model y=ax+b", x="x", y="y")
p2

Page 127, Number 10

For planets data below, fit the model \[y= ax^{3/2}\qquad Equation-1\]

Planets <- data.frame(Body = c("Mercury", "Venus", "Earth", "Mars", "Jupiter","Saturn", "Uranus", "Neptune"),
                      Period = c(7.6e6, 1.94e7, 3.16e7, 5.94e7, 3.74e8, 9.35e8, 2.64e9, 5.22e9),
                      Distance = c(5.79e10, 1.08e11, 1.5e11, 2.28e11, 7.79e11, 1.43e12,
                                   2.87e12, 4.5e12))
Planets

##      Body   Period Distance
## 1 Mercury 7.60e+06 5.79e+10
## 2   Venus 1.94e+07 1.08e+11
## 3   Earth 3.16e+07 1.50e+11
## 4    Mars 5.94e+07 2.28e+11
## 5 Jupiter 3.74e+08 7.79e+11
## 6  Saturn 9.35e+08 1.43e+12
## 7  Uranus 2.64e+09 2.87e+12
## 8 Neptune 5.22e+09 4.50e+12

Answer Page 127, Number 10

Since this is a Power curve of the form \(y= Ax^n\), the least-square estimate a is given by:

\[a = \frac {\Sigma x_i^n y_i} {\Sigma x_i^{2n}} \qquad Equation-2\]

in our case n = 3/2, hence our equation becomes \[a = \frac {\Sigma x_i^{3/2} y_i} {\Sigma x_i^{3}} \qquad Equation-3\]

# estimating a:

a <- sum(Planets$Period^(3/2)*Planets$Distance)/sum(Planets$Period^(3))
a

## [1] 0.01320756

Therefore, our Equation-1 becomes: \[ y= 0.01320756x^{3/2}\]

Just curious, plotting actual vs fitted

Planets$Distance_fit <- a*Planets$Period^(3/2)

p3<- ggplot(data = Planets, aes(x = Period, y = Distance)) +
  geom_line()
p3 <- p3 + labs(title="Actual Data", x="x", y="y")

p3_fit<- ggplot(data = Planets, aes(x = Period, y = Distance_fit)) +
  geom_line()
p3_fit <- p3_fit + labs(title="Fiting Power Curve n=3/2", x="x", y="y")

multiplot(p3, p3_fit,cols=2)

Page 157, Number 4

In the following data, \(X\) represents the diameter of a ponderosa pine measured at breast height, and \(Y\) is a measure of volume-number of board feet divided by 10. Make a scatterplot of the data. Discuss the appropriateness of using a 13th-degree polynomial that passes through the data points as an empirical model. If you have a computer available, fit polynomial to the data and graph the results.

Answer Page 157, Number 4

Make a scatterplot of the data.

Ponderosa_Pine <- data.frame(X = c(17, 19, 20, 22, 23, 25, 31, 32, 33, 36, 37, 38, 39, 41),
                        Y = c(19, 25, 32, 51, 57, 71, 141, 123, 187, 192, 205, 252, 248, 294))

df4<- Ponderosa_Pine

p4<- ggplot(data = df4, aes(x = X, y = Y)) +
  geom_point()
p4 <- p4 + labs(title="Actual Data", x="X", y="Y")
p4

Discuss the appropriateness of using a 13th-degree polynomial that passes through the data points as an empirical model.

Let’s explore what is meant by using a 13th-degree polynomial that passes through the data points as an empirical model. It means that the 14 data points require that the constant \(a_i\) satisfy the following systems of linear algebraic equations:

\[ \begin{aligned} 19 &= a_0+ 17a_1 + 17^2 a_2 + 17^3 a_3+ 17^4 a_4 + 17^5 a_5 + 17^6 a_6 + 17^7 a_7 + 17^8 a_8 + 17^9 a_9 + 17^{10} a_{10} + 17^{11} a_{11} + 17^{12} a_{12} + 17^{13} a_{13} \\ 25 &= a_0+ 19a_1 + 19^2 a_2 + 19^3 a_3+ 19^4 a_4 + 19^5 a_5 + 19^6 a_6 + 19^7 a_7 + 19^8 a_8 + 19^9 a_9 + 19^{10} a_{10} + 19^{11} a_{11} + 19^{12} a_{12} + 19^{13} a_{13} \\ 32 &= a_0+ 20a_1 + 20^2 a_2 + 20^3 a_3+ 20^4 a_4 + 20^5 a_5 + 20^6 a_6 + 20^7 a_7 + 20^8 a_8 + 20^9 a_9 + 20^{10} a_{10} + 20^{11} a_{11} + 20^{12} a_{12} + 20^{13} a_{13} \\ 51 &=a_0+ 22a_1 + 22^2 a_2 + 22^3 a_3+ 22^4 a_4 + 22^5 a_5 + 22^6 a_6 + 22^7 a_7 + 22^8 a_8 + 22^9 a_9 + 22^{10} a_{10} + 22^{11} a_{11} + 22^{12} a_{12} + 22^{13} a_{13} \\ 57 &=a_0+ 23a_1 + 23^2 a_2 + 23^3 a_3+ 23^4 a_4 + 23^5 a_5 + 23^6 a_6 + 23^7 a_7 + 23^8 a_8 + 23^9 a_9 + 23^{10} a_{10} + 23^{11} a_{11} + 23^{12} a_{12} + 23^{13} a_{13} \\ 71 &=a_0+ 25a_1 + 25^2 a_2 + 25^3 a_3+ 25^4 a_4 + 25^5 a_5 + 25^6 a_6 + 25^7 a_7 + 25^8 a_8 + 25^9 a_9 + 25^{10} a_{10} + 25^{11} a_{11} + 25^{12} a_{12} + 25^{13} a_{13} \\ 141 & = a_0+ 31a_1 + 31^2 a_2 + 31^3 a_3+ 31^4 a_4 + 31^5 a_5 + 31^6 a_6 + 31^7 a_7 + 31^8 a_8 + 31^9 a_9 + 31^{10} a_{10} + 31^{11} a_{11} + 31^{12} a_{12} + 31^{13} a_{13} \\ 132 &= a_0+ 32a_1 + 32^2 a_2 + 32^3 a_3+ 32^4 a_4 + 32^5 a_5 + 32^6 a_6 + 32^7 a_7 + 32^8 a_8 + 32^9 a_9 + 32^{10} a_{10} + 32^{11} a_{11} + 32^{12} a_{12} + 32^{13} a_{13} \\ 187 &= a_0+ 33a_1 + 33^2 a_2 + 33^3 a_3+ 33^4 a_4 + 33^5 a_5 + 33^6 a_6 + 33^7 a_7 + 33^8 a_8 + 33^9 a_9 + 33^{10} a_{10} + 33^{11} a_{11} + 33^{12} a_{12} + 33^{13} a_{13} \\ 192 &= a_0+ 36a_1 + 36^2 a_2 + 36^3 a_3+ 36^4 a_4 + 36^5 a_5 + 36^6 a_6 + 36^7 a_7 + 36^8 a_8 + 36^9 a_9 + 36^{10} a_{10} + 36^{11} a_{11} + 36^{12} a_{12} + 36^{13} a_{13} \\ 205 &= a_0+ 37a_1 + 37^2 a_2 + 37^3 a_3+ 37^4 a_4 + 37^5 a_5 + 37^6 a_6 + 37^7 a_7 + 37^8 a_8 + 37^9 a_9 + 37^{10} a_{10} + 37^{11} a_{11} + 37^{12} a_{12} + 37^{13} a_{13} \\ 252 &= a_0+ 38a_1 + 38^2 a_2 + 38^3 a_3+ 38^4 a_4 + 38^5 a_5 + 38^6 a_6 + 38^7 a_7 + 38^8 a_8 + 38^9 a_9 + 38^{10} a_{10} + 38^{11} a_{11} + 38^{12} a_{12} + 38^{13} a_{13} \\ 248 &= a_0+ 39a_1 + 39^2 a_2 + 39^3 a_3+ 39^4 a_4 + 39^5 a_5 + 39^6 a_6 + 39^7 a_7 + 39^8 a_8 + 39^9 a_9 + 39^{10} a_{10} + 39^{11} a_{11} + 39^{12} a_{12} + 39^{13} a_{13} \\ 294 &= a_0+ 41a_1 + 41^2 a_2 + 41^3 a_3+ 41^4 a_4 + 41^5 a_5 + 41^6 a_6 + 41^7 a_7 + 41^8 a_8 + 41^9 a_9 + 41^{10} a_{10} + 41^{11} a_{11} + 41^{12} a_{12} + 41^{13} a_{13}\\ \end{aligned} \]

Therefore, we need to solve the above system of linear algebraic equations. Hence, as per below, we will be using the R poly() function and the fitting linear model function lm() to solve the above system.

model1<- summary(lm(Y ~ poly(X, 13, raw=TRUE), data=Ponderosa_Pine))
model1

## 
## Call:
## lm(formula = Y ~ poly(X, 13, raw = TRUE), data = Ponderosa_Pine)
## 
## Residuals:
##        1        2        3        4        5        6        7        8 
##  -0.0777   1.2350  -2.4560   3.2869  -1.7901  -0.6550  13.4945 -29.2786 
##        9       10       11       12       13       14 
##  18.3321   0.6008  -9.5549  10.2363  -3.5314   0.1583 
## 
## Coefficients: (3 not defined because of singularities)
##                             Estimate Std. Error t value Pr(>|t|)
## (Intercept)                1.670e+07  2.823e+07   0.592    0.596
## poly(X, 13, raw = TRUE)1  -6.202e+06  1.031e+07  -0.602    0.590
## poly(X, 13, raw = TRUE)2   1.018e+06  1.666e+06   0.611    0.584
## poly(X, 13, raw = TRUE)3  -9.699e+04  1.563e+05  -0.621    0.579
## poly(X, 13, raw = TRUE)4   5.906e+03  9.383e+03   0.629    0.574
## poly(X, 13, raw = TRUE)5  -2.380e+02  3.732e+02  -0.638    0.569
## poly(X, 13, raw = TRUE)6   6.309e+00  9.774e+00   0.646    0.565
## poly(X, 13, raw = TRUE)7  -1.038e-01  1.590e-01  -0.653    0.561
## poly(X, 13, raw = TRUE)8   8.663e-04  1.315e-03   0.659    0.557
## poly(X, 13, raw = TRUE)9          NA         NA      NA       NA
## poly(X, 13, raw = TRUE)10 -4.741e-08  7.078e-08  -0.670    0.551
## poly(X, 13, raw = TRUE)11         NA         NA      NA       NA
## poly(X, 13, raw = TRUE)12  2.555e-12  3.768e-12   0.678    0.546
## poly(X, 13, raw = TRUE)13         NA         NA      NA       NA
## 
## Residual standard error: 23.14 on 3 degrees of freedom
## Multiple R-squared:  0.9862, Adjusted R-squared:  0.9401 
## F-statistic:  21.4 on 10 and 3 DF,  p-value: 0.01419

It looks like the model has produced some values that have value of “NA”" which means that we have the problem of “singularities” or an instance of multicollinearity. That means that one or more predictor variables are a linear combination of other predictor variables. We will just omit those instances by choosing the appropriate coefficients as per below:

library(ggplot2)

X<- Ponderosa_Pine$X

coefficients <- as.vector(lm(Y ~ poly(X, 13, raw= TRUE), data=Ponderosa_Pine)$coefficients)

z = coefficients[1] + coefficients[2]*X + coefficients[3]*X^2 +
  coefficients[4]*X^3 + coefficients[5]*X^4 + coefficients[6]*X^5 +
  coefficients[7]*X^6 + coefficients[8]*X^7 + coefficients[9]*X^8 +
  coefficients[11]*X^10+   
  coefficients[13]*X^12 

ggplot(Ponderosa_Pine) + geom_point(aes(x=X, y=Y)) + 
  geom_line(aes(x=X, y=z))

From the plot above, we can see that multi term models namely polynomials have the ability to capture the trend of the collected data. They are also popular as they are easy to integrate and to differentiate However, we should be careful with the sensitivity of the coefficients of high order polynomials to small changes in the data. Because we do expect measurement error to occur, the tendency of high order polynomials to oscillate, as well as the sensitivity of their coefficients to small changes in the data, is a disadvantage that restricts their usefulness in modeling.

ISDA609: Mathmatical Modeling Techniques for Data Analytics \[ \ \] Assignment:03

Mohamed Elmoudni

February 20, 2016