DATA 501 HW 4

Question 1

corn_yield = read.table("http://www1.aucegypt.edu/faculty/hadi/RABE5/Data5/P158.txt",
header = T, sep = "\t")
corn_yield$Fertilizer = factor(corn_yield$Fertilizer)
head(corn_yield)

##   Yield Fertilizer
## 1    31          1
## 2    34          1
## 3    34          1
## 4    34          1
## 5    43          1
## 6    35          1

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

corn_yield %>% group_by(Fertilizer) %>% summarise(n = n())

## # A tibble: 4 × 2
##   Fertilizer     n
##   <fct>      <int>
## 1 1             10
## 2 2             10
## 3 3             10
## 4 4             10

(Mean_Yield = corn_yield %>% group_by(Fertilizer) %>% summarise(Mean.Yield = mean(Yield)))

## # A tibble: 4 × 2
##   Fertilizer Mean.Yield
##   <fct>           <dbl>
## 1 1                36.6
## 2 2                29.9
## 3 3                34.9
## 4 4                29.8

Part 3

corn_yield$Fertilizer = factor(corn_yield$Fertilizer)
corn.lm <- lm(data = corn_yield, Yield ~ Fertilizer)
summary(corn.lm)

## 
## Call:
## lm(formula = Yield ~ Fertilizer, data = corn_yield)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.800 -2.825 -0.600  3.125 10.200 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   36.600      1.533  23.878  < 2e-16 ***
## Fertilizer2   -6.700      2.168  -3.091  0.00384 ** 
## Fertilizer3   -1.700      2.168  -0.784  0.43803    
## Fertilizer4   -6.800      2.168  -3.137  0.00340 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.847 on 36 degrees of freedom
## Multiple R-squared:  0.3001, Adjusted R-squared:  0.2417 
## F-statistic: 5.144 on 3 and 36 DF,  p-value: 0.004605

Part 4 Null Hypothesis: The model is best with no predictors Alt hypothesis: The model is best with at least one predictor Test: Anova f test Conclusion. At the 0.05 significane level, we reject the null hypothesis that the reduced model with no predictors is best.

corn.reduced <- lm(data = corn_yield, Yield ~ 1)
anova_corn <- anova(corn.reduced, corn.lm)
anova_corn

## Analysis of Variance Table
## 
## Model 1: Yield ~ 1
## Model 2: Yield ~ Fertilizer
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1     39 1208.4                                
## 2     36  845.8  3     362.6 5.1445 0.004605 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Part 5: Fertilizer 1 because it has the lowest p-value.

Question 2 Part 1: The model seems to follow the assumption of constant variance. There doesn’t seem to be a pattern in the residuals vs fitted plot.

plot(corn.lm)

Part 2, 3 The coefficients remain the same but the std errors differ.

corn_yield$res = corn.lm$residuals
(group_variance = corn_yield %>% group_by(Fertilizer) %>% summarise(var = var(res)))

## # A tibble: 4 × 2
##   Fertilizer   var
##   <fct>      <dbl>
## 1 1           18.7
## 2 2           22.8
## 3 3           17.0
## 4 4           35.5

corn_yield <- corn_yield %>% left_join(group_variance)

## Joining with `by = join_by(Fertilizer)`

weights = 1/corn_yield$var
head(corn_yield)

##   Yield Fertilizer  res      var
## 1    31          1 -5.6 18.71111
## 2    34          1 -2.6 18.71111
## 3    34          1 -2.6 18.71111
## 4    34          1 -2.6 18.71111
## 5    43          1  6.4 18.71111
## 6    35          1 -1.6 18.71111

corn.wls <- lm(data = corn_yield, Yield ~ Fertilizer, weights = weights)
summary(corn.wls)

## 
## Call:
## lm(formula = Yield ~ Fertilizer, data = corn_yield, weights = weights)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8653 -0.6028 -0.1343  0.5652  1.9419 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   36.600      1.368   26.76  < 2e-16 ***
## Fertilizer2   -6.700      2.037   -3.29  0.00225 ** 
## Fertilizer3   -1.700      1.889   -0.90  0.37424    
## Fertilizer4   -6.800      2.329   -2.92  0.00600 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1 on 36 degrees of freedom
## Multiple R-squared:  0.3054, Adjusted R-squared:  0.2475 
## F-statistic: 5.275 on 3 and 36 DF,  p-value: 0.004045

Part 4

var_model <- lm(res^2 ~ Fertilizer, data = corn_yield)
var2 = predict(var_model)
corn.wls2 <- lm(Yield ~ Fertilizer, data = corn_yield, weights = 1/var2)
summary(corn.wls2)

## 
## Call:
## lm(formula = Yield ~ Fertilizer, data = corn_yield, weights = 1/var2)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9662 -0.6353 -0.1415  0.5957  2.0469 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   36.600      1.368   26.76  < 2e-16 ***
## Fertilizer2   -6.700      2.037   -3.29  0.00225 ** 
## Fertilizer3   -1.700      1.889   -0.90  0.37424    
## Fertilizer4   -6.800      2.329   -2.92  0.00600 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.054 on 36 degrees of freedom
## Multiple R-squared:  0.3054, Adjusted R-squared:  0.2475 
## F-statistic: 5.275 on 3 and 36 DF,  p-value: 0.004045

Question 3

library(faraway)

## Warning: package 'faraway' was built under R version 4.5.2

data(cheddar)
head(cheddar)

##   taste Acetic   H2S Lactic
## 1  12.3  4.543 3.135   0.86
## 2  20.9  5.159 5.043   1.53
## 3  39.0  5.366 5.438   1.57
## 4  47.9  5.759 7.496   1.81
## 5   5.6  4.663 3.807   0.99
## 6  25.9  5.697 7.601   1.09

There may be unequal variance. There appear to be no outliers. The errors appear to be normal.

data(cheddar)
cheddar.lm <- lm(taste ~ ., data = cheddar)
summary(cheddar.lm)

## 
## Call:
## lm(formula = taste ~ ., data = cheddar)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.390  -6.612  -1.009   4.908  25.449 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -28.8768    19.7354  -1.463  0.15540   
## Acetic        0.3277     4.4598   0.073  0.94198   
## H2S           3.9118     1.2484   3.133  0.00425 **
## Lactic       19.6705     8.6291   2.280  0.03108 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.13 on 26 degrees of freedom
## Multiple R-squared:  0.6518, Adjusted R-squared:  0.6116 
## F-statistic: 16.22 on 3 and 26 DF,  p-value: 3.81e-06

plot(cheddar.lm)

Part 2 No additional term is suggested.

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following objects are masked from 'package:faraway':
## 
##     logit, vif

## The following object is masked from 'package:dplyr':
## 
##     recode

crPlots(cheddar.lm)

Part 3 Based on the boxcox plot, the model should be transformed because 1 does not fall in the 95% confidence interval.

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

boxcox(cheddar.lm, plotit=T, lambda=seq(0.3,1.3,by=0.1))

Part 4 The cr plots have the same results

cheddar$taste_boxcox <- (cheddar$taste^0.65 - 1)/0.65
cheddar.lm2 <- lm(taste_boxcox ~ Acetic + H2S + Lactic, data = cheddar)
summary(cheddar.lm2)

## 
## Call:
## lm(formula = taste_boxcox ~ Acetic + H2S + Lactic, data = cheddar)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.385 -1.970  0.136  2.216  7.550 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -7.37657    6.75929  -1.091  0.28514   
## Acetic       0.01099    1.52745   0.007  0.99432   
## H2S          1.36966    0.42758   3.203  0.00357 **
## Lactic       6.42131    2.95541   2.173  0.03910 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.47 on 26 degrees of freedom
## Multiple R-squared:  0.6454, Adjusted R-squared:  0.6045 
## F-statistic: 15.77 on 3 and 26 DF,  p-value: 4.807e-06

crPlots(cheddar.lm2)

Question 4

data(globwarm,package="faraway")
head(globwarm)

##      nhtemp  wusa jasper westgreen chesapeake tornetrask urals mongolia tasman
## 1000     NA -0.66  -0.03      0.03      -0.66       0.33 -1.49     0.83  -0.12
## 1001     NA -0.63  -0.07      0.09      -0.67       0.21 -1.44     0.96  -0.17
## 1002     NA -0.60  -0.11      0.18      -0.67       0.13 -1.39     0.99  -0.22
## 1003     NA -0.55  -0.14      0.30      -0.68       0.08 -1.34     0.95  -0.26
## 1004     NA -0.51  -0.15      0.41      -0.68       0.06 -1.30     0.87  -0.31
## 1005     NA -0.47  -0.15      0.52      -0.68       0.07 -1.25     0.77  -0.37
##      year
## 1000 1000
## 1001 1001
## 1002 1002
## 1003 1003
## 1004 1004
## 1005 1005

globwarm <- globwarm %>% filter(!is.na(nhtemp))
names(globwarm)

##  [1] "nhtemp"     "wusa"       "jasper"     "westgreen"  "chesapeake"
##  [6] "tornetrask" "urals"      "mongolia"   "tasman"     "year"

Part 1

globwarm.lm <- lm(nhtemp ~ wusa + jasper + westgreen + chesapeake +
tornetrask + urals + mongolia + tasman, data = globwarm)

summary(globwarm.lm)

## 
## Call:
## lm(formula = nhtemp ~ wusa + jasper + westgreen + chesapeake + 
##     tornetrask + urals + mongolia + tasman, data = globwarm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.43668 -0.11170  0.00698  0.10176  0.65352 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.242555   0.027011  -8.980 1.97e-15 ***
## wusa         0.077384   0.042927   1.803 0.073647 .  
## jasper      -0.228795   0.078107  -2.929 0.003986 ** 
## westgreen    0.009584   0.041840   0.229 0.819168    
## chesapeake  -0.032112   0.034052  -0.943 0.347346    
## tornetrask   0.092668   0.045053   2.057 0.041611 *  
## urals        0.185369   0.091428   2.027 0.044567 *  
## mongolia     0.041973   0.045794   0.917 0.360996    
## tasman       0.115453   0.030111   3.834 0.000192 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1758 on 136 degrees of freedom
## Multiple R-squared:  0.4764, Adjusted R-squared:  0.4456 
## F-statistic: 15.47 on 8 and 136 DF,  p-value: 5.028e-16

Part 2 There may be some autocorrelation, There appears to be a cyclical effect. The acf plot suggest autocorrelation.

plot(y = residuals(globwarm.lm), x = globwarm$year)

acf(residuals(globwarm.lm))

Part 3 Null: rho = 0 Alt: rho != 0 Reject the null hypothesis since p-value < 0.05

durbinWatsonTest(globwarm.lm)

##  lag Autocorrelation D-W Statistic p-value
##    1       0.5710535     0.8166064       0
##  Alternative hypothesis: rho != 0

Part 4

n = nrow(globwarm)

old.Y = globwarm$nhtemp
old.X = model.matrix(globwarm.lm)[,-1]

new.Y = old.Y[2:n] - 0.5710535*old.Y[1:(n-1)]
new.X = old.X[2:n] - 0.5710535*old.X[1:(n-1)]

new.globwarm.lm = lm(new.Y ~ new.X)
summary(new.globwarm.lm)

## 
## Call:
## lm(formula = new.Y ~ new.X)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.37838 -0.09388 -0.00128  0.08734  0.42431 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.09006    0.01638  -5.498 1.73e-07 ***
## new.X        0.14684    0.03743   3.922 0.000136 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1481 on 142 degrees of freedom
## Multiple R-squared:  0.09776,    Adjusted R-squared:  0.0914 
## F-statistic: 15.39 on 1 and 142 DF,  p-value: 0.000136

Part 5 There is no autocorrelation according to the durbinwatson test. There is slight autocorrelation according to the acf plot.

acf(residuals(new.globwarm.lm))

durbinWatsonTest(new.globwarm.lm)

##  lag Autocorrelation D-W Statistic p-value
##    1      0.03501057      1.920848   0.578
##  Alternative hypothesis: rho != 0

Part 6: The r^2 value of the new model is much lower but all the p-values are statistically significant.

DATA 501 HW 4

2025-11-06