Use the same data set as we used for Part 1 of this problem set
knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.2
data <- read.csv("ps10data.csv")
head(data)
## day sector stockprice
## 1 Mo automotive 55.11
## 2 Mo automotive 85.10
## 3 Mo health 99.67
## 4 Mo automotive 19.79
## 5 Mo automotive 69.68
## 6 Mo automotive 61.97
library(car)
## Warning: package 'car' was built under R version 4.0.2
## Loading required package: carData
anova = aov(stockprice~sector,data = data)
summary(anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## sector 1 31780 31780 17.26 3.66e-05 ***
## Residuals 698 1284995 1841
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# qqPlot(anova$residuals,main='Residuals distribution',ylab = 'model residuals')
# shapiro.test(anova$residuals) #it kind of seems like that assumption of normality has been rejected but being the sample size bigger than 30 cases and ANOVA a robust statistical tool we can proceed with the test
The p-value of the sector variable is very low (p = 3.66e-05), so it appears that the type of sector (automotive or health) included has a huge impact on the stockprice.
anova2 = aov(stockprice~day,data = data)
automotive = data[data$sector=='automotive',]
health = data[data$sector=='health',]
summary(anova2)
## Df Sum Sq Mean Sq F value Pr(>F)
## day 6 77286 12881 7.202 1.81e-07 ***
## Residuals 693 1239488 1789
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(stockprice~day,data=automotive))
## Df Sum Sq Mean Sq F value Pr(>F)
## day 6 66982 11164 6.858 7.03e-07 ***
## Residuals 335 545323 1628
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(stockprice~day,data=health))
## Df Sum Sq Mean Sq F value Pr(>F)
## day 6 25156 4193 2.273 0.0364 *
## Residuals 351 647533 1845
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The first ANOVA suggests us that there is a strongly significant effect of days on the stockprice for all data (p = 1.81e-07). The same is true for the second and third ANOVA models (p = 7.03e-07, p = 0.0364) where we include only data for the automotive and health sectors respectively.
lm_day_sec = lm(stockprice~day+sector,data=data)
lm_day = lm(stockprice~day,data=data)
anova(lm_day,lm_day_sec)
## Analysis of Variance Table
##
## Model 1: stockprice ~ day
## Model 2: stockprice ~ day + sector
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 693 1239488
## 2 692 1206952 1 32536 18.654 1.796e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Sector has an effect after day-of-week is considered (p = 1.796e-05), thus the model that includes also sector is a better model than the day-only model.
lm_sec = lm(stockprice~sector,data=data)
anova(lm_sec,lm_day_sec)
## Analysis of Variance Table
##
## Model 1: stockprice ~ sector
## Model 2: stockprice ~ day + sector
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 698 1284995
## 2 692 1206952 6 78043 7.4576 9.36e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Similar to our previous analysis, day-of-week has an effect after sector is considered (p = 9.36e-08), thus the model that includes also day-of-week is a better model than the sector model.
Compare mean-squares and F-test results for sector and day across the model comparisons (i.e., both with and without the other variable). Are the results of the tests identical or do they differ? Why? Pick which comparison of models you would prefer to use to test the effect, and describe why you feel it is better than the others. Are there any reasons why you might choose the alternative?
summary(aov(stockprice~sector,data=data))
## Df Sum Sq Mean Sq F value Pr(>F)
## sector 1 31780 31780 17.26 3.66e-05 ***
## Residuals 698 1284995 1841
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(stockprice~day,data=data))
## Df Sum Sq Mean Sq F value Pr(>F)
## day 6 77286 12881 7.202 1.81e-07 ***
## Residuals 693 1239488 1789
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(stockprice~day+sector,data=data))
## Df Sum Sq Mean Sq F value Pr(>F)
## day 6 77286 12881 7.385 1.13e-07 ***
## sector 1 32536 32536 18.654 1.80e-05 ***
## Residuals 692 1206952 1744
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(stockprice~sector+day,data=data))
## Df Sum Sq Mean Sq F value Pr(>F)
## sector 1 31780 31780 18.221 2.24e-05 ***
## day 6 78043 13007 7.458 9.36e-08 ***
## Residuals 692 1206952 1744
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
drop1(lm(stockprice~day+sector,data=data))
## Single term deletions
##
## Model:
## stockprice ~ day + sector
## Df Sum of Sq RSS AIC
## <none> 1206952 5232.8
## day 6 78043 1284995 5264.6
## sector 1 32536 1239488 5249.4
The sum of squares for day = 78043 and sector = 32536 while the F values for day = 7.458 and sector = 18.654 are those who I think are the most appropriate ones. Infact, by looking at the anova (stockprice~day+sector) and the anova (stockprice~sector+day) we can appreciate how order really matters because our predictors are non-orthogonal and have an effect on each other. This is why the results of the tests are all different. For example, the first two ANOVAS have a sum of squares of 31780 for the sector and 77286 for the day but these are representations of “isolated” results that do not take into account the day effect on sector or the sector effect on day. Indeed, in order to use a more conservative approach (Type 2 ANOVA) and take all effects into account I have used the drop1 function which immediately reports us these values (although it doesn’t give us also the F-test value).