Question 1
corn_yield = read.table("http://www1.aucegypt.edu/faculty/hadi/RABE5/Data5/P158.txt",
header = T, sep = "\t")
corn_yield$Fertilizer = factor(corn_yield$Fertilizer)
head(corn_yield)
## Yield Fertilizer
## 1 31 1
## 2 34 1
## 3 34 1
## 4 34 1
## 5 43 1
## 6 35 1
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
corn_yield %>% group_by(Fertilizer) %>% summarise(n = n())
## # A tibble: 4 × 2
## Fertilizer n
## <fct> <int>
## 1 1 10
## 2 2 10
## 3 3 10
## 4 4 10
(Mean_Yield = corn_yield %>% group_by(Fertilizer) %>% summarise(Mean.Yield = mean(Yield)))
## # A tibble: 4 × 2
## Fertilizer Mean.Yield
## <fct> <dbl>
## 1 1 36.6
## 2 2 29.9
## 3 3 34.9
## 4 4 29.8
Part 3
corn_yield$Fertilizer = factor(corn_yield$Fertilizer)
corn.lm <- lm(data = corn_yield, Yield ~ Fertilizer)
summary(corn.lm)
##
## Call:
## lm(formula = Yield ~ Fertilizer, data = corn_yield)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.800 -2.825 -0.600 3.125 10.200
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.600 1.533 23.878 < 2e-16 ***
## Fertilizer2 -6.700 2.168 -3.091 0.00384 **
## Fertilizer3 -1.700 2.168 -0.784 0.43803
## Fertilizer4 -6.800 2.168 -3.137 0.00340 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.847 on 36 degrees of freedom
## Multiple R-squared: 0.3001, Adjusted R-squared: 0.2417
## F-statistic: 5.144 on 3 and 36 DF, p-value: 0.004605
Part 4 Null Hypothesis: The model is best with no predictors Alt hypothesis: The model is best with at least one predictor Test: Anova f test Conclusion. At the 0.05 significane level, we reject the null hypothesis that the reduced model with no predictors is best.
corn.reduced <- lm(data = corn_yield, Yield ~ 1)
anova_corn <- anova(corn.reduced, corn.lm)
anova_corn
## Analysis of Variance Table
##
## Model 1: Yield ~ 1
## Model 2: Yield ~ Fertilizer
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 39 1208.4
## 2 36 845.8 3 362.6 5.1445 0.004605 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Part 5: Fertilizer 1 because it has the lowest p-value.
Question 2 Part 1: The model seems to follow the assumption of constant variance. There doesn’t seem to be a pattern in the residuals vs fitted plot.
plot(corn.lm)
Part 2, 3 The coefficients remain the same but the std errors differ.
corn_yield$res = corn.lm$residuals
(group_variance = corn_yield %>% group_by(Fertilizer) %>% summarise(var = var(res)))
## # A tibble: 4 × 2
## Fertilizer var
## <fct> <dbl>
## 1 1 18.7
## 2 2 22.8
## 3 3 17.0
## 4 4 35.5
corn_yield <- corn_yield %>% left_join(group_variance)
## Joining with `by = join_by(Fertilizer)`
weights = 1/corn_yield$var
head(corn_yield)
## Yield Fertilizer res var
## 1 31 1 -5.6 18.71111
## 2 34 1 -2.6 18.71111
## 3 34 1 -2.6 18.71111
## 4 34 1 -2.6 18.71111
## 5 43 1 6.4 18.71111
## 6 35 1 -1.6 18.71111
corn.wls <- lm(data = corn_yield, Yield ~ Fertilizer, weights = weights)
summary(corn.wls)
##
## Call:
## lm(formula = Yield ~ Fertilizer, data = corn_yield, weights = weights)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -1.8653 -0.6028 -0.1343 0.5652 1.9419
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.600 1.368 26.76 < 2e-16 ***
## Fertilizer2 -6.700 2.037 -3.29 0.00225 **
## Fertilizer3 -1.700 1.889 -0.90 0.37424
## Fertilizer4 -6.800 2.329 -2.92 0.00600 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1 on 36 degrees of freedom
## Multiple R-squared: 0.3054, Adjusted R-squared: 0.2475
## F-statistic: 5.275 on 3 and 36 DF, p-value: 0.004045
Part 4
var_model <- lm(res^2 ~ Fertilizer, data = corn_yield)
var2 = predict(var_model)
corn.wls2 <- lm(Yield ~ Fertilizer, data = corn_yield, weights = 1/var2)
summary(corn.wls2)
##
## Call:
## lm(formula = Yield ~ Fertilizer, data = corn_yield, weights = 1/var2)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -1.9662 -0.6353 -0.1415 0.5957 2.0469
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.600 1.368 26.76 < 2e-16 ***
## Fertilizer2 -6.700 2.037 -3.29 0.00225 **
## Fertilizer3 -1.700 1.889 -0.90 0.37424
## Fertilizer4 -6.800 2.329 -2.92 0.00600 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.054 on 36 degrees of freedom
## Multiple R-squared: 0.3054, Adjusted R-squared: 0.2475
## F-statistic: 5.275 on 3 and 36 DF, p-value: 0.004045
Question 3
library(faraway)
## Warning: package 'faraway' was built under R version 4.5.2
data(cheddar)
head(cheddar)
## taste Acetic H2S Lactic
## 1 12.3 4.543 3.135 0.86
## 2 20.9 5.159 5.043 1.53
## 3 39.0 5.366 5.438 1.57
## 4 47.9 5.759 7.496 1.81
## 5 5.6 4.663 3.807 0.99
## 6 25.9 5.697 7.601 1.09
There may be unequal variance. There appear to be no outliers. The errors appear to be normal.
data(cheddar)
cheddar.lm <- lm(taste ~ ., data = cheddar)
summary(cheddar.lm)
##
## Call:
## lm(formula = taste ~ ., data = cheddar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.390 -6.612 -1.009 4.908 25.449
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -28.8768 19.7354 -1.463 0.15540
## Acetic 0.3277 4.4598 0.073 0.94198
## H2S 3.9118 1.2484 3.133 0.00425 **
## Lactic 19.6705 8.6291 2.280 0.03108 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.13 on 26 degrees of freedom
## Multiple R-squared: 0.6518, Adjusted R-squared: 0.6116
## F-statistic: 16.22 on 3 and 26 DF, p-value: 3.81e-06
plot(cheddar.lm)
Part 2 No additional term is suggested.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following objects are masked from 'package:faraway':
##
## logit, vif
## The following object is masked from 'package:dplyr':
##
## recode
crPlots(cheddar.lm)
Part 3 Based on the boxcox plot, the model should be transformed because 1 does not fall in the 95% confidence interval.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
boxcox(cheddar.lm, plotit=T, lambda=seq(0.3,1.3,by=0.1))
Part 4 The cr plots have the same results
cheddar$taste_boxcox <- (cheddar$taste^0.65 - 1)/0.65
cheddar.lm2 <- lm(taste_boxcox ~ Acetic + H2S + Lactic, data = cheddar)
summary(cheddar.lm2)
##
## Call:
## lm(formula = taste_boxcox ~ Acetic + H2S + Lactic, data = cheddar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.385 -1.970 0.136 2.216 7.550
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.37657 6.75929 -1.091 0.28514
## Acetic 0.01099 1.52745 0.007 0.99432
## H2S 1.36966 0.42758 3.203 0.00357 **
## Lactic 6.42131 2.95541 2.173 0.03910 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.47 on 26 degrees of freedom
## Multiple R-squared: 0.6454, Adjusted R-squared: 0.6045
## F-statistic: 15.77 on 3 and 26 DF, p-value: 4.807e-06
crPlots(cheddar.lm2)
Question 4
data(globwarm,package="faraway")
head(globwarm)
## nhtemp wusa jasper westgreen chesapeake tornetrask urals mongolia tasman
## 1000 NA -0.66 -0.03 0.03 -0.66 0.33 -1.49 0.83 -0.12
## 1001 NA -0.63 -0.07 0.09 -0.67 0.21 -1.44 0.96 -0.17
## 1002 NA -0.60 -0.11 0.18 -0.67 0.13 -1.39 0.99 -0.22
## 1003 NA -0.55 -0.14 0.30 -0.68 0.08 -1.34 0.95 -0.26
## 1004 NA -0.51 -0.15 0.41 -0.68 0.06 -1.30 0.87 -0.31
## 1005 NA -0.47 -0.15 0.52 -0.68 0.07 -1.25 0.77 -0.37
## year
## 1000 1000
## 1001 1001
## 1002 1002
## 1003 1003
## 1004 1004
## 1005 1005
globwarm <- globwarm %>% filter(!is.na(nhtemp))
names(globwarm)
## [1] "nhtemp" "wusa" "jasper" "westgreen" "chesapeake"
## [6] "tornetrask" "urals" "mongolia" "tasman" "year"
Part 1
globwarm.lm <- lm(nhtemp ~ wusa + jasper + westgreen + chesapeake +
tornetrask + urals + mongolia + tasman, data = globwarm)
summary(globwarm.lm)
##
## Call:
## lm(formula = nhtemp ~ wusa + jasper + westgreen + chesapeake +
## tornetrask + urals + mongolia + tasman, data = globwarm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.43668 -0.11170 0.00698 0.10176 0.65352
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.242555 0.027011 -8.980 1.97e-15 ***
## wusa 0.077384 0.042927 1.803 0.073647 .
## jasper -0.228795 0.078107 -2.929 0.003986 **
## westgreen 0.009584 0.041840 0.229 0.819168
## chesapeake -0.032112 0.034052 -0.943 0.347346
## tornetrask 0.092668 0.045053 2.057 0.041611 *
## urals 0.185369 0.091428 2.027 0.044567 *
## mongolia 0.041973 0.045794 0.917 0.360996
## tasman 0.115453 0.030111 3.834 0.000192 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1758 on 136 degrees of freedom
## Multiple R-squared: 0.4764, Adjusted R-squared: 0.4456
## F-statistic: 15.47 on 8 and 136 DF, p-value: 5.028e-16
Part 2 There may be some autocorrelation, There appears to be a cyclical effect. The acf plot suggest autocorrelation.
plot(y = residuals(globwarm.lm), x = globwarm$year)
acf(residuals(globwarm.lm))
Part 3 Null: rho = 0 Alt: rho != 0 Reject the null hypothesis since p-value < 0.05
durbinWatsonTest(globwarm.lm)
## lag Autocorrelation D-W Statistic p-value
## 1 0.5710535 0.8166064 0
## Alternative hypothesis: rho != 0
Part 4
n = nrow(globwarm)
old.Y = globwarm$nhtemp
old.X = model.matrix(globwarm.lm)[,-1]
new.Y = old.Y[2:n] - 0.5710535*old.Y[1:(n-1)]
new.X = old.X[2:n] - 0.5710535*old.X[1:(n-1)]
new.globwarm.lm = lm(new.Y ~ new.X)
summary(new.globwarm.lm)
##
## Call:
## lm(formula = new.Y ~ new.X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.37838 -0.09388 -0.00128 0.08734 0.42431
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.09006 0.01638 -5.498 1.73e-07 ***
## new.X 0.14684 0.03743 3.922 0.000136 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1481 on 142 degrees of freedom
## Multiple R-squared: 0.09776, Adjusted R-squared: 0.0914
## F-statistic: 15.39 on 1 and 142 DF, p-value: 0.000136
Part 5 There is no autocorrelation according to the durbinwatson test. There is slight autocorrelation according to the acf plot.
acf(residuals(new.globwarm.lm))
durbinWatsonTest(new.globwarm.lm)
## lag Autocorrelation D-W Statistic p-value
## 1 0.03501057 1.920848 0.578
## Alternative hypothesis: rho != 0
Part 6: The r^2 value of the new model is much lower but all the p-values are statistically significant.