Use the same data set as we used for Part 1 of this problem set

knitr::opts_chunk$set(echo = TRUE)

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.2
data <- read.csv("ps10data.csv")

head(data)
##   day     sector stockprice
## 1  Mo automotive      55.11
## 2  Mo automotive      85.10
## 3  Mo     health      99.67
## 4  Mo automotive      19.79
## 5  Mo automotive      69.68
## 6  Mo automotive      61.97

2. Multi-way ANOVA and regression.

library(car)
## Warning: package 'car' was built under R version 4.0.2
## Loading required package: carData
anova = aov(stockprice~sector,data = data)
summary(anova)
##              Df  Sum Sq Mean Sq F value   Pr(>F)    
## sector        1   31780   31780   17.26 3.66e-05 ***
## Residuals   698 1284995    1841                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# qqPlot(anova$residuals,main='Residuals distribution',ylab = 'model residuals')
# shapiro.test(anova$residuals) #it kind of seems like that assumption of normality has been rejected but being the sample size bigger than 30 cases and ANOVA a robust statistical tool we can proceed with the test

The p-value of the sector variable is very low (p = 3.66e-05), so it appears that the type of sector (automotive or health) included has a huge impact on the stockprice.

B. Predict by Day

anova2 = aov(stockprice~day,data = data)
automotive = data[data$sector=='automotive',]
health = data[data$sector=='health',]

summary(anova2)
##              Df  Sum Sq Mean Sq F value   Pr(>F)    
## day           6   77286   12881   7.202 1.81e-07 ***
## Residuals   693 1239488    1789                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(stockprice~day,data=automotive))
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## day           6  66982   11164   6.858 7.03e-07 ***
## Residuals   335 545323    1628                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(stockprice~day,data=health))
##              Df Sum Sq Mean Sq F value Pr(>F)  
## day           6  25156    4193   2.273 0.0364 *
## Residuals   351 647533    1845                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The first ANOVA suggests us that there is a strongly significant effect of days on the stockprice for all data (p = 1.81e-07). The same is true for the second and third ANOVA models (p = 7.03e-07, p = 0.0364) where we include only data for the automotive and health sectors respectively.

C. Multiple predictors: effect of sector after day

lm_day_sec = lm(stockprice~day+sector,data=data)
lm_day = lm(stockprice~day,data=data)
anova(lm_day,lm_day_sec)
## Analysis of Variance Table
## 
## Model 1: stockprice ~ day
## Model 2: stockprice ~ day + sector
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1    693 1239488                                  
## 2    692 1206952  1     32536 18.654 1.796e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Sector has an effect after day-of-week is considered (p = 1.796e-05), thus the model that includes also sector is a better model than the day-only model.

D. Effect of day after sector

lm_sec = lm(stockprice~sector,data=data)
anova(lm_sec,lm_day_sec)
## Analysis of Variance Table
## 
## Model 1: stockprice ~ sector
## Model 2: stockprice ~ day + sector
##   Res.Df     RSS Df Sum of Sq      F   Pr(>F)    
## 1    698 1284995                                 
## 2    692 1206952  6     78043 7.4576 9.36e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Similar to our previous analysis, day-of-week has an effect after sector is considered (p = 9.36e-08), thus the model that includes also day-of-week is a better model than the sector model.

E. Model lattice

Compare mean-squares and F-test results for sector and day across the model comparisons (i.e., both with and without the other variable). Are the results of the tests identical or do they differ? Why? Pick which comparison of models you would prefer to use to test the effect, and describe why you feel it is better than the others. Are there any reasons why you might choose the alternative?

summary(aov(stockprice~sector,data=data))
##              Df  Sum Sq Mean Sq F value   Pr(>F)    
## sector        1   31780   31780   17.26 3.66e-05 ***
## Residuals   698 1284995    1841                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(stockprice~day,data=data))
##              Df  Sum Sq Mean Sq F value   Pr(>F)    
## day           6   77286   12881   7.202 1.81e-07 ***
## Residuals   693 1239488    1789                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(stockprice~day+sector,data=data))
##              Df  Sum Sq Mean Sq F value   Pr(>F)    
## day           6   77286   12881   7.385 1.13e-07 ***
## sector        1   32536   32536  18.654 1.80e-05 ***
## Residuals   692 1206952    1744                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(stockprice~sector+day,data=data))
##              Df  Sum Sq Mean Sq F value   Pr(>F)    
## sector        1   31780   31780  18.221 2.24e-05 ***
## day           6   78043   13007   7.458 9.36e-08 ***
## Residuals   692 1206952    1744                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
drop1(lm(stockprice~day+sector,data=data))
## Single term deletions
## 
## Model:
## stockprice ~ day + sector
##        Df Sum of Sq     RSS    AIC
## <none>              1206952 5232.8
## day     6     78043 1284995 5264.6
## sector  1     32536 1239488 5249.4

The sum of squares for day = 78043 and sector = 32536 while the F values for day = 7.458 and sector = 18.654 are those who I think are the most appropriate ones. Infact, by looking at the anova (stockprice~day+sector) and the anova (stockprice~sector+day) we can appreciate how order really matters because our predictors are non-orthogonal and have an effect on each other. This is why the results of the tests are all different. For example, the first two ANOVAS have a sum of squares of 31780 for the sector and 77286 for the day but these are representations of “isolated” results that do not take into account the day effect on sector or the sector effect on day. Indeed, in order to use a more conservative approach (Type 2 ANOVA) and take all effects into account I have used the drop1 function which immediately reports us these values (although it doesn’t give us also the F-test value).