In this assignment, you will create your own RMarkdown document and your own code. This code should be closely based on code we have recently used in class, but don’t hesitate to use the internet and others sources as well. All code and answers should be in an RMarkdown document.
1. Load the small dataset Fertilizer Experiment.csv and print it out. There are three columns:
Nitrogen - This shows the pounds per acre of nitrogen applied in an experiment to judge the effect of fertilizer on crop yields.
Nit_by_20 - Shis shows the same data as Column 1 but in units of 20 pounds per acre.
Yield - My source just says ‘bushels’ but I suspect the true unit is bushels per acre.
Nitrogen = read.csv("Fertilizer Experiment.csv")
Nitrogen
## Nitrogen Nit_by_20 Yield
## 1 0 0 24.5
## 2 20 1 41.5
## 3 40 2 52.1
## 4 60 3 61.4
## 5 80 4 73.5
## 6 120 6 86.0
## 7 160 8 92.8
## 8 180 9 94.0
2. Create a linear model to predict crop yield as a function of amount of fertilizer applied. Include …
a. Exploratory plots and relevant comments.
plot(Nitrogen$Nitrogen, Nitrogen$Yield)
postive relationship between x and y, maybe linear maybe quadratic.
b. The linear model, including summary output, a statement of the model, diagnostic plots and your comments.
linear_fit = lm(Nitrogen$Yield ~ Nitrogen$Nitrogen)
summary(linear_fit)
##
## Call:
## lm(formula = Nitrogen$Yield ~ Nitrogen$Nitrogen)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.4080 -3.4420 0.6858 4.6265 8.7088
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.90801 4.30574 8.107 0.000189 ***
## Nitrogen$Nitrogen 0.37354 0.04192 8.911 0.000111 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.255 on 6 degrees of freedom
## Multiple R-squared: 0.9297, Adjusted R-squared: 0.918
## F-statistic: 79.4 on 1 and 6 DF, p-value: 0.0001114
The model is:
yield_hat = 34.9080 + 0.37354 * Nitrogen
If there is no nitrogen applied, then yield will be 34.91 bushels per acre.
If the nitrogen applied per acre increases by 1, the yield will increase by 0.37 bushels per acre.
The p-value of beta1 is 0.000111<0.05 and is significant.
residual = resid(linear_fit)
plot(Nitrogen$Nitrogen, residual, main = "residuals vs x")
plot(linear_fit, which = 1:2)
From the residuals vs x plot, we can tell the residuals of this linear model do have pattern, it seems very likely to have quadratic relation with x, which is not a good thing.
From the residuals vs fitted value plot, we can also tell that the residuals have pattern and the red line is very close to a quadratic line.
From the QQ plot, we can tell it's hard to say it's close to a line since 3 of 5 points are lower than the y=x line. So we may not be able to say the residuals are roughly normally distributed.
plot(Nitrogen$Nitrogen, Nitrogen$Yield)
abline(linear_fit)
comment???
3. Create a quadratic model to predict crop yield as a function of amount of fertilizer applied. Include …
a. The quadratic model, including summary output, a statement of the model, diagnostic plots and your comments.
Nitrogen$Nitrogen2 = Nitrogen$Nitrogen^2
Nitrogen
## Nitrogen Nit_by_20 Yield Nitrogen2
## 1 0 0 24.5 0
## 2 20 1 41.5 400
## 3 40 2 52.1 1600
## 4 60 3 61.4 3600
## 5 80 4 73.5 6400
## 6 120 6 86.0 14400
## 7 160 8 92.8 25600
## 8 180 9 94.0 32400
quad_fit = lm(Yield ~ Nitrogen + Nitrogen2, data = Nitrogen)
summary(quad_fit)
##
## Call:
## lm(formula = Yield ~ Nitrogen + Nitrogen2, data = Nitrogen)
##
## Residuals:
## 1 2 3 4 5 6 7 8
## -0.89532 1.84744 -0.14856 -1.78330 1.04322 -0.01998 -0.13815 0.09464
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.3953155 1.0506461 24.17 2.26e-06 ***
## Nitrogen 0.7543938 0.0293719 25.68 1.67e-06 ***
## Nitrogen2 -0.0020766 0.0001548 -13.42 4.12e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.306 on 5 degrees of freedom
## Multiple R-squared: 0.9981, Adjusted R-squared: 0.9973
## F-statistic: 1314 on 2 and 5 DF, p-value: 1.571e-07
The model is:
yield_hat = 25.3953 + 0.7544 * Nitrogen - -0.0020 * Nitrogen^2
statement?????
residual_quad = resid(quad_fit)
plot(Nitrogen$Yield, residual_quad, main = "residuals vs x")
plot(quad_fit, which = 1:2)
From the residuals vs x plot, no pattern, heteroscedasticity??
From the residuals vs fitted value plot, we can tell residuals have no pattern. heteroscedasticity?
(How to tell if it's independent?)
From the QQ plot,
comment????
xvalues = seq(0, 200, 1)
newNitrogen = data.frame(Nitrogen=xvalues, Nitrogen2=xvalues^2)
head(newNitrogen)
## Nitrogen Nitrogen2
## 1 0 0
## 2 1 1
## 3 2 4
## 4 3 9
## 5 4 16
## 6 5 25
predYields = predict(quad_fit, newNitrogen)
plot(Nitrogen$Nitrogen, Nitrogen$Yield)
lines(xvalues, predYields)
comments??
library("car")
## Warning: package 'car' was built under R version 3.2.5
vif(quad_fit)
## Nitrogen Nitrogen2
## 15.13909 15.13909
VIF = 15.13909, greater than 10, so we really concern about this and don't consider it as a good fit. We decide to center x.
xbar = mean(Nitrogen$Nitrogen)
xbar
## [1] 82.5
Nitrogen$NitrogenC = Nitrogen$Nitrogen - xbar
Nitrogen$NitrogenC2 = Nitrogen$NitrogenC^2
Nitrogen
## Nitrogen Nit_by_20 Yield Nitrogen2 NitrogenC NitrogenC2
## 1 0 0 24.5 0 -82.5 6806.25
## 2 20 1 41.5 400 -62.5 3906.25
## 3 40 2 52.1 1600 -42.5 1806.25
## 4 60 3 61.4 3600 -22.5 506.25
## 5 80 4 73.5 6400 -2.5 6.25
## 6 120 6 86.0 14400 37.5 1406.25
## 7 160 8 92.8 25600 77.5 6006.25
## 8 180 9 94.0 32400 97.5 9506.25
quad_fit_c = lm(Nitrogen$Yield ~ Nitrogen$NitrogenC + Nitrogen$NitrogenC2)
summary(quad_fit_c)
##
## Call:
## lm(formula = Nitrogen$Yield ~ Nitrogen$NitrogenC + Nitrogen$NitrogenC2)
##
## Residuals:
## 1 2 3 4 5 6 7 8
## -0.89532 1.84744 -0.14856 -1.78330 1.04322 -0.01998 -0.13815 0.09464
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 73.4991561 0.7409837 99.19 1.97e-09 ***
## Nitrogen$NitrogenC 0.4117599 0.0080685 51.03 5.46e-08 ***
## Nitrogen$NitrogenC2 -0.0020766 0.0001548 -13.42 4.12e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.306 on 5 degrees of freedom
## Multiple R-squared: 0.9981, Adjusted R-squared: 0.9973
## F-statistic: 1314 on 2 and 5 DF, p-value: 1.571e-07
vif(quad_fit_c)
## Nitrogen$NitrogenC Nitrogen$NitrogenC2
## 1.142397 1.142397
other reasons??? Which model??? ``` e. Run the model again with centered X-values. Write down the model and include a VIF diagnostic.
quad_fit_c = lm(Nitrogen$Yield ~ Nitrogen$NitrogenC + Nitrogen$NitrogenC2)
summary(quad_fit_c)
##
## Call:
## lm(formula = Nitrogen$Yield ~ Nitrogen$NitrogenC + Nitrogen$NitrogenC2)
##
## Residuals:
## 1 2 3 4 5 6 7 8
## -0.89532 1.84744 -0.14856 -1.78330 1.04322 -0.01998 -0.13815 0.09464
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 73.4991561 0.7409837 99.19 1.97e-09 ***
## Nitrogen$NitrogenC 0.4117599 0.0080685 51.03 5.46e-08 ***
## Nitrogen$NitrogenC2 -0.0020766 0.0001548 -13.42 4.12e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.306 on 5 degrees of freedom
## Multiple R-squared: 0.9981, Adjusted R-squared: 0.9973
## F-statistic: 1314 on 2 and 5 DF, p-value: 1.571e-07
vif(quad_fit_c)
## Nitrogen$NitrogenC Nitrogen$NitrogenC2
## 1.142397 1.142397
The centered quadratic model is:
yield_hat = 73.4992 + 0.4118 * (Nitrogen - 82.5) - 0.0021 * (Nitrogen - 82.5)^2
= 25.2326 + 0.7583 * Nitrogen - 0.0021 * Nitrogen
The VIF value is 1.142397.
Q1: slightly different????
Q2: ill-conditioning or just correlation r^2?
Using a model with centered values doesn't change the model itself but makes the matrix more stable. In centered values, there are not always positive relation between x and x^2, so it reduces the correlation and furthermore reduces the VIF.
4. Create a cubic model
a. First create the model using and un-centered X-variable. Include summary output, the written regression function, diagnostics and a VIF analysis.
Nitrogen$Nitrogen3 = Nitrogen$Nitrogen^3
Nitrogen
## Nitrogen Nit_by_20 Yield Nitrogen2 NitrogenC NitrogenC2 Nitrogen3
## 1 0 0 24.5 0 -82.5 6806.25 0
## 2 20 1 41.5 400 -62.5 3906.25 8000
## 3 40 2 52.1 1600 -42.5 1806.25 64000
## 4 60 3 61.4 3600 -22.5 506.25 216000
## 5 80 4 73.5 6400 -2.5 6.25 512000
## 6 120 6 86.0 14400 37.5 1406.25 1728000
## 7 160 8 92.8 25600 77.5 6006.25 4096000
## 8 180 9 94.0 32400 97.5 9506.25 5832000
cubic_fit = lm(Yield ~ Nitrogen + Nitrogen2+Nitrogen3, data = Nitrogen)
#cubic_fit = lm(Nitrogen$Yield ~ Nitrogen$Nitrogen + Nitrogen$Nitrogen2 + Nitrogen$Nitrogen3)
summary(cubic_fit)
##
## Call:
## lm(formula = Yield ~ Nitrogen + Nitrogen2 + Nitrogen3, data = Nitrogen)
##
## Residuals:
## 1 2 3 4 5 6 7 8
## -0.77751 1.79910 -0.25047 -1.85943 1.03904 0.12543 -0.05688 -0.01929
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.528e+01 1.330e+00 19.003 4.52e-05 ***
## Nitrogen 7.661e-01 7.078e-02 10.823 0.000414 ***
## Nitrogen2 -2.259e-03 9.948e-04 -2.271 0.085673 .
## Nitrogen3 6.916e-07 3.719e-06 0.186 0.861523
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.454 on 4 degrees of freedom
## Multiple R-squared: 0.9981, Adjusted R-squared: 0.9967
## F-statistic: 707 on 3 and 4 DF, p-value: 6.639e-06
anova(quad_fit,cubic_fit, test="F")
## Analysis of Variance Table
##
## Model 1: Yield ~ Nitrogen + Nitrogen2
## Model 2: Yield ~ Nitrogen + Nitrogen2 + Nitrogen3
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5 8.5336
## 2 4 8.4604 1 0.073146 0.0346 0.8615
The cubic model is:
yield_hat = 25.28 + 0.7661 * Nitrogen - 0.002259 * Nitrogen^2 + 0.0000006916 * Nitrogen^3
residual_cubic = resid(cubic_fit)
plot(Nitrogen$Nitrogen, residual_cubic, main = "residuals vs x")
plot(cubic_fit, which = 1:2)
comments? QQ plot??
vif(cubic_fit)
## Nitrogen Nitrogen2 Nitrogen3
## 70.94614 504.70897 226.89763
vif not equal??
VIFs are larger than 10 which is bad, this may be the sign of ill-conditioning. ???
Nitrogen$NitrogenC3 = Nitrogen$NitrogenC^3
Nitrogen
## Nitrogen Nit_by_20 Yield Nitrogen2 NitrogenC NitrogenC2 Nitrogen3
## 1 0 0 24.5 0 -82.5 6806.25 0
## 2 20 1 41.5 400 -62.5 3906.25 8000
## 3 40 2 52.1 1600 -42.5 1806.25 64000
## 4 60 3 61.4 3600 -22.5 506.25 216000
## 5 80 4 73.5 6400 -2.5 6.25 512000
## 6 120 6 86.0 14400 37.5 1406.25 1728000
## 7 160 8 92.8 25600 77.5 6006.25 4096000
## 8 180 9 94.0 32400 97.5 9506.25 5832000
## NitrogenC3
## 1 -561515.625
## 2 -244140.625
## 3 -76765.625
## 4 -11390.625
## 5 -15.625
## 6 52734.375
## 7 465484.375
## 8 926859.375
cubic_fit_c = lm(Nitrogen$Yield ~ Nitrogen$NitrogenC + Nitrogen$NitrogenC2 + Nitrogen$NitrogenC3)
summary(cubic_fit_c)
##
## Call:
## lm(formula = Nitrogen$Yield ~ Nitrogen$NitrogenC + Nitrogen$NitrogenC2 +
## Nitrogen$NitrogenC3)
##
## Residuals:
## 1 2 3 4 5 6 7 8
## -0.77751 1.79910 -0.25047 -1.85943 1.03904 0.12543 -0.05688 -0.01929
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.349e+01 8.256e-01 89.017 9.55e-08 ***
## Nitrogen$NitrogenC 4.075e-01 2.464e-02 16.540 7.83e-05 ***
## Nitrogen$NitrogenC2 -2.088e-03 1.822e-04 -11.458 0.000331 ***
## Nitrogen$NitrogenC3 6.916e-07 3.719e-06 0.186 0.861523
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.454 on 4 degrees of freedom
## Multiple R-squared: 0.9981, Adjusted R-squared: 0.9967
## F-statistic: 707 on 3 and 4 DF, p-value: 6.639e-06
The centered cubic model is:
yield_hat = 73.49 + 0.4075 * (Nitrogen - 82.5) - 0.002088 * (Nitrogen - 82.5)^2 - 0.0000006916 * (Nitrogen - 82.5)^3
?????? coeff should be the same???
residual_cubic_c = resid(cubic_fit_c)
plot(Nitrogen$Nitrogen, residual_cubic_c)
plot(cubic_fit_c, which = 1:2)
comments???
QQ plot always the same?
vif(cubic_fit_c)
## Nitrogen$NitrogenC Nitrogen$NitrogenC2 Nitrogen$NitrogenC3
## 8.594830 1.277552 9.296246
Some of the VIF values are very close to 10, that means this model is not quite stable.???
Did centering the polynomial have an appreciable effect on VIF?
Yes, it did. It changed VIF from 70+, 500+, 200+ to VIF less than 10. ?
5. Re-run the quadratic model from Number 3 using the smaller ‘units of 20 pounds’ variable as X.Write the equation of the quadratic model. Include diagnostics as you wish.
Nitrogen$Nit_by_20_squared = Nitrogen$Nit_by_20^2
quad_fit_20 = lm(Nitrogen$Yield ~ Nitrogen$Nit_by_20 + Nitrogen$Nit_by_20_squared)
summary(quad_fit_20)
##
## Call:
## lm(formula = Nitrogen$Yield ~ Nitrogen$Nit_by_20 + Nitrogen$Nit_by_20_squared)
##
## Residuals:
## 1 2 3 4 5 6 7 8
## -0.89532 1.84744 -0.14856 -1.78330 1.04322 -0.01998 -0.13815 0.09464
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.39532 1.05065 24.17 2.26e-06 ***
## Nitrogen$Nit_by_20 15.08788 0.58744 25.68 1.67e-06 ***
## Nitrogen$Nit_by_20_squared -0.83063 0.06191 -13.42 4.12e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.306 on 5 degrees of freedom
## Multiple R-squared: 0.9981, Adjusted R-squared: 0.9973
## F-statistic: 1314 on 2 and 5 DF, p-value: 1.571e-07
The model is:
yield_hat = 25.3953 + 15.0879 * Nitrogen - 0.8306 * Nitrogen^2
Nitrogen here is in units of 20 pounds per acre.
residual_20 = resid(quad_fit_20)
plot(Nitrogen$Nitrogen, residual_20, main = "residuals vs x")
plot(quad_fit_20, which = 1:2)
residuals no pattern, H
QQ plot????
vif(quad_fit_20)
## Nitrogen$Nit_by_20 Nitrogen$Nit_by_20_squared
## 15.13909 15.13909
xbar_20 = mean(Nitrogen$Nit_by_20)
Nitrogen$Nit_by_20_C = Nitrogen$Nit_by_20 - xbar_20
Nitrogen$Nit_by_20_C2 =Nitrogen$Nit_by_20_C^2
Nitrogen
## Nitrogen Nit_by_20 Yield Nitrogen2 NitrogenC NitrogenC2 Nitrogen3
## 1 0 0 24.5 0 -82.5 6806.25 0
## 2 20 1 41.5 400 -62.5 3906.25 8000
## 3 40 2 52.1 1600 -42.5 1806.25 64000
## 4 60 3 61.4 3600 -22.5 506.25 216000
## 5 80 4 73.5 6400 -2.5 6.25 512000
## 6 120 6 86.0 14400 37.5 1406.25 1728000
## 7 160 8 92.8 25600 77.5 6006.25 4096000
## 8 180 9 94.0 32400 97.5 9506.25 5832000
## NitrogenC3 Nit_by_20_squared Nit_by_20_C Nit_by_20_C2
## 1 -561515.625 0 -4.125 17.015625
## 2 -244140.625 1 -3.125 9.765625
## 3 -76765.625 4 -2.125 4.515625
## 4 -11390.625 9 -1.125 1.265625
## 5 -15.625 16 -0.125 0.015625
## 6 52734.375 36 1.875 3.515625
## 7 465484.375 64 3.875 15.015625
## 8 926859.375 81 4.875 23.765625
quad_fit_20_C = lm(Nitrogen$Yield ~ Nitrogen$Nit_by_20_C + Nitrogen$Nit_by_20_C2)
summary(quad_fit_20_C)
##
## Call:
## lm(formula = Nitrogen$Yield ~ Nitrogen$Nit_by_20_C + Nitrogen$Nit_by_20_C2)
##
## Residuals:
## 1 2 3 4 5 6 7 8
## -0.89532 1.84744 -0.14856 -1.78330 1.04322 -0.01998 -0.13815 0.09464
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 73.49916 0.74098 99.19 1.97e-09 ***
## Nitrogen$Nit_by_20_C 8.23520 0.16137 51.03 5.46e-08 ***
## Nitrogen$Nit_by_20_C2 -0.83063 0.06191 -13.42 4.12e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.306 on 5 degrees of freedom
## Multiple R-squared: 0.9981, Adjusted R-squared: 0.9973
## F-statistic: 1314 on 2 and 5 DF, p-value: 1.571e-07
vif(quad_fit_20_C)
## Nitrogen$Nit_by_20_C Nitrogen$Nit_by_20_C2
## 1.142397 1.142397
The new VIF for centered quadratic model with Nitrogen in units of 20 pounds per acre is 1.142397.
VIF_20_quad = 15.13909 VIF_20_quad_C = 1.142397
VIF_quad = VIF_20_quad VIF_quadC = VIF_20_quad_C
I dont’t agree, the VIFs for two quadratic models are actually the same, and the VIFs for two centered quadratic models are also the same.
Why????????? ```