In this assignment, you will create your own RMarkdown document and your own code. This code should be closely based on code we have recently used in class, but don’t hesitate to use the internet and others sources as well. All code and answers should be in an RMarkdown document.

1. Load the small dataset Fertilizer Experiment.csv and print it out. There are three columns:

Nitrogen - This shows the pounds per acre of nitrogen applied in an experiment to judge the effect of fertilizer on crop yields.

Nit_by_20 - Shis shows the same data as Column 1 but in units of 20 pounds per acre.

Yield - My source just says ‘bushels’ but I suspect the true unit is bushels per acre.

Nitrogen = read.csv("Fertilizer Experiment.csv")
Nitrogen
##   Nitrogen Nit_by_20 Yield
## 1        0         0  24.5
## 2       20         1  41.5
## 3       40         2  52.1
## 4       60         3  61.4
## 5       80         4  73.5
## 6      120         6  86.0
## 7      160         8  92.8
## 8      180         9  94.0

2. Create a linear model to predict crop yield as a function of amount of fertilizer applied. Include …

a. Exploratory plots and relevant comments.

plot(Nitrogen$Nitrogen, Nitrogen$Yield)

postive relationship between x and y, maybe linear maybe quadratic.



b. The linear model, including summary output, a statement of the model, diagnostic plots and your comments.

linear_fit = lm(Nitrogen$Yield ~ Nitrogen$Nitrogen)
summary(linear_fit)
## 
## Call:
## lm(formula = Nitrogen$Yield ~ Nitrogen$Nitrogen)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.4080  -3.4420   0.6858   4.6265   8.7088 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       34.90801    4.30574   8.107 0.000189 ***
## Nitrogen$Nitrogen  0.37354    0.04192   8.911 0.000111 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.255 on 6 degrees of freedom
## Multiple R-squared:  0.9297, Adjusted R-squared:  0.918 
## F-statistic:  79.4 on 1 and 6 DF,  p-value: 0.0001114
The model is:
yield_hat = 34.9080 + 0.37354 * Nitrogen
If there is no nitrogen applied, then yield will be 34.91 bushels per acre.
If the nitrogen applied per acre increases by 1, the yield will increase by 0.37 bushels per acre.

The p-value of beta1 is 0.000111<0.05 and is significant.
residual = resid(linear_fit)
plot(Nitrogen$Nitrogen, residual, main = "residuals vs x")

plot(linear_fit, which = 1:2)

From the residuals vs x plot, we can tell the residuals of this linear model do have pattern, it seems very likely to have quadratic relation with x, which is not a good thing.

From the residuals vs fitted value plot, we can also tell that the residuals have pattern and the red line is very close to a quadratic line.

From the QQ plot, we can tell it's hard to say it's close to a line since 3 of 5 points are lower than the y=x line. So we may not be able to say the residuals are roughly normally distributed.
  1. Include a plot of the regression line with the data. Comments.
plot(Nitrogen$Nitrogen, Nitrogen$Yield)
abline(linear_fit)

comment???

3. Create a quadratic model to predict crop yield as a function of amount of fertilizer applied. Include …

a. The quadratic model, including summary output, a statement of the model, diagnostic plots and your comments.

Nitrogen$Nitrogen2 = Nitrogen$Nitrogen^2
Nitrogen
##   Nitrogen Nit_by_20 Yield Nitrogen2
## 1        0         0  24.5         0
## 2       20         1  41.5       400
## 3       40         2  52.1      1600
## 4       60         3  61.4      3600
## 5       80         4  73.5      6400
## 6      120         6  86.0     14400
## 7      160         8  92.8     25600
## 8      180         9  94.0     32400
quad_fit = lm(Yield ~ Nitrogen + Nitrogen2, data = Nitrogen)
summary(quad_fit)
## 
## Call:
## lm(formula = Yield ~ Nitrogen + Nitrogen2, data = Nitrogen)
## 
## Residuals:
##        1        2        3        4        5        6        7        8 
## -0.89532  1.84744 -0.14856 -1.78330  1.04322 -0.01998 -0.13815  0.09464 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 25.3953155  1.0506461   24.17 2.26e-06 ***
## Nitrogen     0.7543938  0.0293719   25.68 1.67e-06 ***
## Nitrogen2   -0.0020766  0.0001548  -13.42 4.12e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.306 on 5 degrees of freedom
## Multiple R-squared:  0.9981, Adjusted R-squared:  0.9973 
## F-statistic:  1314 on 2 and 5 DF,  p-value: 1.571e-07
The model is:
yield_hat = 25.3953 + 0.7544 * Nitrogen - -0.0020 * Nitrogen^2
statement?????
residual_quad = resid(quad_fit)
plot(Nitrogen$Yield, residual_quad, main = "residuals vs x")

plot(quad_fit, which = 1:2)

From the residuals vs x plot, no pattern, heteroscedasticity??
From the residuals vs fitted value plot, we can tell residuals have no pattern. heteroscedasticity?
(How to tell if it's independent?)
From the QQ plot,
comment????
  1. Include a plot of the regression line with the data. Comments.
xvalues = seq(0, 200, 1)
newNitrogen = data.frame(Nitrogen=xvalues, Nitrogen2=xvalues^2)
head(newNitrogen)
##   Nitrogen Nitrogen2
## 1        0         0
## 2        1         1
## 3        2         4
## 4        3         9
## 5        4        16
## 6        5        25
predYields = predict(quad_fit, newNitrogen)

plot(Nitrogen$Nitrogen, Nitrogen$Yield)
lines(xvalues, predYields)

comments??
  1. Include a VIF diagnostic. If necessary, center the X-values and re-run the regression.
library("car")
## Warning: package 'car' was built under R version 3.2.5
vif(quad_fit)
##  Nitrogen Nitrogen2 
##  15.13909  15.13909
VIF = 15.13909, greater than 10, so we really concern about this and don't consider it as a good fit. We decide to center x.
xbar = mean(Nitrogen$Nitrogen)
xbar
## [1] 82.5
Nitrogen$NitrogenC = Nitrogen$Nitrogen - xbar
Nitrogen$NitrogenC2 = Nitrogen$NitrogenC^2
Nitrogen
##   Nitrogen Nit_by_20 Yield Nitrogen2 NitrogenC NitrogenC2
## 1        0         0  24.5         0     -82.5    6806.25
## 2       20         1  41.5       400     -62.5    3906.25
## 3       40         2  52.1      1600     -42.5    1806.25
## 4       60         3  61.4      3600     -22.5     506.25
## 5       80         4  73.5      6400      -2.5       6.25
## 6      120         6  86.0     14400      37.5    1406.25
## 7      160         8  92.8     25600      77.5    6006.25
## 8      180         9  94.0     32400      97.5    9506.25
quad_fit_c = lm(Nitrogen$Yield ~ Nitrogen$NitrogenC + Nitrogen$NitrogenC2)
summary(quad_fit_c)
## 
## Call:
## lm(formula = Nitrogen$Yield ~ Nitrogen$NitrogenC + Nitrogen$NitrogenC2)
## 
## Residuals:
##        1        2        3        4        5        6        7        8 
## -0.89532  1.84744 -0.14856 -1.78330  1.04322 -0.01998 -0.13815  0.09464 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         73.4991561  0.7409837   99.19 1.97e-09 ***
## Nitrogen$NitrogenC   0.4117599  0.0080685   51.03 5.46e-08 ***
## Nitrogen$NitrogenC2 -0.0020766  0.0001548  -13.42 4.12e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.306 on 5 degrees of freedom
## Multiple R-squared:  0.9981, Adjusted R-squared:  0.9973 
## F-statistic:  1314 on 2 and 5 DF,  p-value: 1.571e-07
vif(quad_fit_c)
##  Nitrogen$NitrogenC Nitrogen$NitrogenC2 
##            1.142397            1.142397
  1. State clearly why you think the model is or is not a good fit. ``` The VIF is 1.142397 < 4, so we consider this centered quadratic model a good fit.

other reasons??? Which model??? ``` e. Run the model again with centered X-values. Write down the model and include a VIF diagnostic.

quad_fit_c = lm(Nitrogen$Yield ~ Nitrogen$NitrogenC + Nitrogen$NitrogenC2)
summary(quad_fit_c)
## 
## Call:
## lm(formula = Nitrogen$Yield ~ Nitrogen$NitrogenC + Nitrogen$NitrogenC2)
## 
## Residuals:
##        1        2        3        4        5        6        7        8 
## -0.89532  1.84744 -0.14856 -1.78330  1.04322 -0.01998 -0.13815  0.09464 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         73.4991561  0.7409837   99.19 1.97e-09 ***
## Nitrogen$NitrogenC   0.4117599  0.0080685   51.03 5.46e-08 ***
## Nitrogen$NitrogenC2 -0.0020766  0.0001548  -13.42 4.12e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.306 on 5 degrees of freedom
## Multiple R-squared:  0.9981, Adjusted R-squared:  0.9973 
## F-statistic:  1314 on 2 and 5 DF,  p-value: 1.571e-07
vif(quad_fit_c)
##  Nitrogen$NitrogenC Nitrogen$NitrogenC2 
##            1.142397            1.142397
The centered quadratic model is:
yield_hat = 73.4992 + 0.4118 * (Nitrogen - 82.5) - 0.0021 * (Nitrogen - 82.5)^2
          = 25.2326 + 0.7583 * Nitrogen - 0.0021 * Nitrogen
          
The VIF value is 1.142397.
  1. The models produced by Parts (a) and (e) are algebraically equivalent. Why might it be better nonetheless to use a model with centered values?
Q1: slightly different????
Q2: ill-conditioning or just correlation r^2?

Using a model with centered values doesn't change the model itself but makes the matrix more stable. In centered values, there are not always positive relation between x and x^2, so it reduces the correlation and furthermore reduces the VIF. 



4. Create a cubic model

a. First create the model using and un-centered X-variable. Include summary output, the written regression function, diagnostics and a VIF analysis.

Nitrogen$Nitrogen3 = Nitrogen$Nitrogen^3
Nitrogen
##   Nitrogen Nit_by_20 Yield Nitrogen2 NitrogenC NitrogenC2 Nitrogen3
## 1        0         0  24.5         0     -82.5    6806.25         0
## 2       20         1  41.5       400     -62.5    3906.25      8000
## 3       40         2  52.1      1600     -42.5    1806.25     64000
## 4       60         3  61.4      3600     -22.5     506.25    216000
## 5       80         4  73.5      6400      -2.5       6.25    512000
## 6      120         6  86.0     14400      37.5    1406.25   1728000
## 7      160         8  92.8     25600      77.5    6006.25   4096000
## 8      180         9  94.0     32400      97.5    9506.25   5832000
cubic_fit = lm(Yield ~ Nitrogen + Nitrogen2+Nitrogen3, data = Nitrogen)
#cubic_fit = lm(Nitrogen$Yield ~ Nitrogen$Nitrogen + Nitrogen$Nitrogen2 + Nitrogen$Nitrogen3)
summary(cubic_fit)
## 
## Call:
## lm(formula = Yield ~ Nitrogen + Nitrogen2 + Nitrogen3, data = Nitrogen)
## 
## Residuals:
##        1        2        3        4        5        6        7        8 
## -0.77751  1.79910 -0.25047 -1.85943  1.03904  0.12543 -0.05688 -0.01929 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.528e+01  1.330e+00  19.003 4.52e-05 ***
## Nitrogen     7.661e-01  7.078e-02  10.823 0.000414 ***
## Nitrogen2   -2.259e-03  9.948e-04  -2.271 0.085673 .  
## Nitrogen3    6.916e-07  3.719e-06   0.186 0.861523    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.454 on 4 degrees of freedom
## Multiple R-squared:  0.9981, Adjusted R-squared:  0.9967 
## F-statistic:   707 on 3 and 4 DF,  p-value: 6.639e-06
anova(quad_fit,cubic_fit, test="F")
## Analysis of Variance Table
## 
## Model 1: Yield ~ Nitrogen + Nitrogen2
## Model 2: Yield ~ Nitrogen + Nitrogen2 + Nitrogen3
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1      5 8.5336                           
## 2      4 8.4604  1  0.073146 0.0346 0.8615
The cubic model is:
yield_hat = 25.28 + 0.7661 * Nitrogen - 0.002259 * Nitrogen^2 + 0.0000006916 * Nitrogen^3  
residual_cubic = resid(cubic_fit)
plot(Nitrogen$Nitrogen, residual_cubic, main = "residuals vs x")

plot(cubic_fit, which = 1:2)

comments? QQ plot??
vif(cubic_fit)
##  Nitrogen Nitrogen2 Nitrogen3 
##  70.94614 504.70897 226.89763
vif not equal??
VIFs are larger than 10 which is bad, this may be the sign of ill-conditioning. ???
  1. Repeat Part (a) using a centered polynomial.
Nitrogen$NitrogenC3 = Nitrogen$NitrogenC^3
Nitrogen
##   Nitrogen Nit_by_20 Yield Nitrogen2 NitrogenC NitrogenC2 Nitrogen3
## 1        0         0  24.5         0     -82.5    6806.25         0
## 2       20         1  41.5       400     -62.5    3906.25      8000
## 3       40         2  52.1      1600     -42.5    1806.25     64000
## 4       60         3  61.4      3600     -22.5     506.25    216000
## 5       80         4  73.5      6400      -2.5       6.25    512000
## 6      120         6  86.0     14400      37.5    1406.25   1728000
## 7      160         8  92.8     25600      77.5    6006.25   4096000
## 8      180         9  94.0     32400      97.5    9506.25   5832000
##    NitrogenC3
## 1 -561515.625
## 2 -244140.625
## 3  -76765.625
## 4  -11390.625
## 5     -15.625
## 6   52734.375
## 7  465484.375
## 8  926859.375
cubic_fit_c = lm(Nitrogen$Yield ~ Nitrogen$NitrogenC + Nitrogen$NitrogenC2 + Nitrogen$NitrogenC3)
summary(cubic_fit_c)
## 
## Call:
## lm(formula = Nitrogen$Yield ~ Nitrogen$NitrogenC + Nitrogen$NitrogenC2 + 
##     Nitrogen$NitrogenC3)
## 
## Residuals:
##        1        2        3        4        5        6        7        8 
## -0.77751  1.79910 -0.25047 -1.85943  1.03904  0.12543 -0.05688 -0.01929 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          7.349e+01  8.256e-01  89.017 9.55e-08 ***
## Nitrogen$NitrogenC   4.075e-01  2.464e-02  16.540 7.83e-05 ***
## Nitrogen$NitrogenC2 -2.088e-03  1.822e-04 -11.458 0.000331 ***
## Nitrogen$NitrogenC3  6.916e-07  3.719e-06   0.186 0.861523    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.454 on 4 degrees of freedom
## Multiple R-squared:  0.9981, Adjusted R-squared:  0.9967 
## F-statistic:   707 on 3 and 4 DF,  p-value: 6.639e-06
The centered cubic model is:
yield_hat = 73.49 + 0.4075 * (Nitrogen - 82.5) - 0.002088 * (Nitrogen - 82.5)^2 - 0.0000006916 * (Nitrogen - 82.5)^3

?????? coeff should be the same???
residual_cubic_c = resid(cubic_fit_c)
plot(Nitrogen$Nitrogen, residual_cubic_c)

plot(cubic_fit_c, which = 1:2)

comments???
QQ plot always the same?
vif(cubic_fit_c)
##  Nitrogen$NitrogenC Nitrogen$NitrogenC2 Nitrogen$NitrogenC3 
##            8.594830            1.277552            9.296246
Some of the VIF values are very close to 10, that means this model is not quite stable.???
  1. Did centering the polynomial have an appreciable effect on VIF?

    Yes, it did. It changed VIF from 70+, 500+, 200+ to VIF less than 10. ?
    5. Re-run the quadratic model from Number 3 using the smaller ‘units of 20 pounds’ variable as X.
  2. Write the equation of the quadratic model. Include diagnostics as you wish.

Nitrogen$Nit_by_20_squared = Nitrogen$Nit_by_20^2
quad_fit_20 = lm(Nitrogen$Yield ~ Nitrogen$Nit_by_20 + Nitrogen$Nit_by_20_squared)
summary(quad_fit_20)
## 
## Call:
## lm(formula = Nitrogen$Yield ~ Nitrogen$Nit_by_20 + Nitrogen$Nit_by_20_squared)
## 
## Residuals:
##        1        2        3        4        5        6        7        8 
## -0.89532  1.84744 -0.14856 -1.78330  1.04322 -0.01998 -0.13815  0.09464 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                25.39532    1.05065   24.17 2.26e-06 ***
## Nitrogen$Nit_by_20         15.08788    0.58744   25.68 1.67e-06 ***
## Nitrogen$Nit_by_20_squared -0.83063    0.06191  -13.42 4.12e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.306 on 5 degrees of freedom
## Multiple R-squared:  0.9981, Adjusted R-squared:  0.9973 
## F-statistic:  1314 on 2 and 5 DF,  p-value: 1.571e-07
The model is:
yield_hat = 25.3953 + 15.0879 * Nitrogen - 0.8306 * Nitrogen^2
Nitrogen here is in units of 20 pounds per acre.
residual_20 = resid(quad_fit_20)
plot(Nitrogen$Nitrogen, residual_20, main = "residuals vs x")

plot(quad_fit_20, which = 1:2)

residuals no pattern, H
QQ plot????
  1. Compare the VIF of this model to the VIF obtained using the larger values for X (uncentered) in Part 2(c).
vif(quad_fit_20)
##         Nitrogen$Nit_by_20 Nitrogen$Nit_by_20_squared 
##                   15.13909                   15.13909
  1. Many statisticians say that centering is most important when the x-values are large (far from zero). Use centered values to run the quadratic model one more time and note the new VIF.
xbar_20 = mean(Nitrogen$Nit_by_20)
Nitrogen$Nit_by_20_C = Nitrogen$Nit_by_20 - xbar_20
Nitrogen$Nit_by_20_C2 =Nitrogen$Nit_by_20_C^2
Nitrogen
##   Nitrogen Nit_by_20 Yield Nitrogen2 NitrogenC NitrogenC2 Nitrogen3
## 1        0         0  24.5         0     -82.5    6806.25         0
## 2       20         1  41.5       400     -62.5    3906.25      8000
## 3       40         2  52.1      1600     -42.5    1806.25     64000
## 4       60         3  61.4      3600     -22.5     506.25    216000
## 5       80         4  73.5      6400      -2.5       6.25    512000
## 6      120         6  86.0     14400      37.5    1406.25   1728000
## 7      160         8  92.8     25600      77.5    6006.25   4096000
## 8      180         9  94.0     32400      97.5    9506.25   5832000
##    NitrogenC3 Nit_by_20_squared Nit_by_20_C Nit_by_20_C2
## 1 -561515.625                 0      -4.125    17.015625
## 2 -244140.625                 1      -3.125     9.765625
## 3  -76765.625                 4      -2.125     4.515625
## 4  -11390.625                 9      -1.125     1.265625
## 5     -15.625                16      -0.125     0.015625
## 6   52734.375                36       1.875     3.515625
## 7  465484.375                64       3.875    15.015625
## 8  926859.375                81       4.875    23.765625
quad_fit_20_C = lm(Nitrogen$Yield ~ Nitrogen$Nit_by_20_C + Nitrogen$Nit_by_20_C2)
summary(quad_fit_20_C)
## 
## Call:
## lm(formula = Nitrogen$Yield ~ Nitrogen$Nit_by_20_C + Nitrogen$Nit_by_20_C2)
## 
## Residuals:
##        1        2        3        4        5        6        7        8 
## -0.89532  1.84744 -0.14856 -1.78330  1.04322 -0.01998 -0.13815  0.09464 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           73.49916    0.74098   99.19 1.97e-09 ***
## Nitrogen$Nit_by_20_C   8.23520    0.16137   51.03 5.46e-08 ***
## Nitrogen$Nit_by_20_C2 -0.83063    0.06191  -13.42 4.12e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.306 on 5 degrees of freedom
## Multiple R-squared:  0.9981, Adjusted R-squared:  0.9973 
## F-statistic:  1314 on 2 and 5 DF,  p-value: 1.571e-07
vif(quad_fit_20_C)
##  Nitrogen$Nit_by_20_C Nitrogen$Nit_by_20_C2 
##              1.142397              1.142397
The new VIF for centered quadratic model with Nitrogen in units of 20 pounds per acre is 1.142397.
  1. Using the four VIFs that you have calculated for different versions of the quadratic model, comment on what ‘many statisticians say’. Do you agree? Why or why not? ``` VIF_quad = 15.13909 VIF_quadC = 1.142397

VIF_20_quad = 15.13909 VIF_20_quad_C = 1.142397

VIF_quad = VIF_20_quad VIF_quadC = VIF_20_quad_C

I dont’t agree, the VIFs for two quadratic models are actually the same, and the VIFs for two centered quadratic models are also the same.

Why????????? ```