So students need to use the csv file I already gave them. These are the steps for that (according to the lab)
library(stringr)
x <- read.csv("https://raw.githubusercontent.com/am2222/GESC258/master/lab6/snow.csv",fileEncoding="UTF-8-BOM")
#now a little stringr magic:
x$snowcm <- str_trim(str_sub(str_trim(x$Snow), start = 1, end = str_locate(str_trim(x$Snow), " ")[,1]))
x$snowcm <- as.numeric(x$snowcm)
Qustion asks if the snowfall differs among all 4 cities. They have to use ANOVA since it is comparing more than two cities. So this is the hypothesis of ANOVA test “Toronto” “Hamilton” “Montreal” “Ottawa” \[ H_0 : \mu_{Toronto} = \mu_{Hamilton} = \mu_{Montreal}= \mu_{Ottawa}\\ H_a : \mu_{Toronto} \ne \mu_{Hamilton} \ne \mu_{Montreal} \ne \mu_{Ottawa} \] So considering level of significance 0.05. we run anova test, if p-value is less than 0.05 we reject null. it means that based on our sample snowfall in at least one of the four cities is different than others.
If p-value is more than 0.05 we fail to reject null. so it means that snowfall in all the cities are not statistically different from eachother
anova(lm(snowcm~City, data=x))
## Analysis of Variance Table
##
## Response: snowcm
## Df Sum Sq Mean Sq F value Pr(>F)
## City 3 218232 72744 23.776 1.364e-11 ***
## Residuals 96 293721 3060
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
p-value is equal to 1.364e-11
which is very close to 0.
It is also less than 0.05 so we reject null and as a reasult one of the
cities should have a statically different mean value of snowfall
Test statistic is : 23.776 Marking breakdown - State hypothesis -> 1 mark
test statistics -> 1 mark
p-value -> 1 mark
a sentence interpreting your result -> 1 mark
code -> 1 mark
This question the same. But this time they need to repeat above test for any combination of 3 cities. So this is first one: “Toronto” “Hamilton” “Montreal” \[ H_0 : \mu_{Toronto} = \mu_{Hamilton} = \mu_{Montreal}\\ H_a : \mu_{Toronto} \ne \mu_{Hamilton} \ne \mu_{Montreal} \] They need to filter data first
thm <- x[x$City!="Ottawa",] #only keep records that does not have Ottawa in the City column
now thm
only includes data for “Toronto” “Hamilton”
“Montreal”
anova(lm(snowcm~City, data=thm))
## Analysis of Variance Table
##
## Response: snowcm
## Df Sum Sq Mean Sq F value Pr(>F)
## City 2 121995 60997 22.958 1.937e-08 ***
## Residuals 72 191294 2657
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
They need to interpret the results same as question one. if p-value<0.05 reject null otherwise fail to reject null
“Toronto” “Hamilton” “Ottawa” \[ H_0 : \mu_{Toronto} = \mu_{Hamilton} = \mu_{Ottawa}\\ H_a : \mu_{Toronto} \ne \mu_{Hamilton} \ne \mu_{Ottawa} \]
tho <- x[x$City!="Montreal",]
anova(lm(snowcm~City, data=tho))
## Analysis of Variance Table
##
## Response: snowcm
## Df Sum Sq Mean Sq F value Pr(>F)
## City 2 173355 86678 31.17 1.773e-10 ***
## Residuals 72 200220 2781
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
“Toronto” “Montreal” “Ottawa” \[ H_0 : \mu_{Toronto} = \mu_{Montreal}= \mu_{Ottawa}\\ H_a : \mu_{Toronto} \ne \mu_{Montreal} \ne \mu_{Ottawa} \]
tmo <- x[x$City!="Hamilton",]
anova(lm(snowcm~City, data=tmo))
## Analysis of Variance Table
##
## Response: snowcm
## Df Sum Sq Mean Sq F value Pr(>F)
## City 2 186331 93166 27.98 1.021e-09 ***
## Residuals 72 239737 3330
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
“Hamilton” “Montreal” “Ottawa” $$ H_0 : {Hamilton} = {Montreal}= {Ottawa}\ H_a : {Hamilton} {Montreal} {Ottawa}
$$
hmo <- x[x$City!="Toronto",]
anova(lm(snowcm~City, data=hmo))
## Analysis of Variance Table
##
## Response: snowcm
## Df Sum Sq Mean Sq F value Pr(>F)
## City 2 100271 50136 14.444 5.317e-06 ***
## Residuals 72 249911 3471
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Marking breakdown - State hypothesis -> 1 mark in total (0.25 per each time they run above code)
test statistics -> 1 mark in total (0.25 per each time they run above code)
p-value -> 1 mark in total (0.25 per each time they run above code)
a sentence interpreting your result -> 1 mark in total (0.25 per each time they run above code)
code -> 1 mark
This question can have vary answers. you can ask them include graphs, run t.tests. some example t.tests they can do
\[ H_0 : \mu_{Hamilton} = \mu_{Montreal} \\ H_a : \mu_{Hamilton} \ne \mu_{Montreal} \]
\[ H_0 : \mu_{Hamilton} \geq \mu_{Montreal}\\ H_a : \mu_{Hamilton} \lt \mu_{Montreal} \]
\[ H_0 : \mu_{Toronto} \geq 122\\ H_a : \mu_{Toronto} \lt 200 \]
\[ H_0 : \mu_{Hamilton} = 122\\ H_a : \mu_{Hamilton} \ne 200 \] 4. One of the foundational assumptions of classical statistical inference is that observations are indepenent. Can you think of any potential violations to this assumption with the snow dataset? (out of 5)
It is a bit of triky question, but they can for example discuss that snowfall is result of a weather front. so as a result they snowfall in the cities can be dependent to eachother if it is result of the same weather front.
Bonus question: Explain in detail (i.e., what each function is doing) what is happening in this step below
#now a little stringr magic:
x$snowcm <- str_trim(str_sub(str_trim(x$Snow), start = 1, end = str_locate(str_trim(x$Snow), " ")[,1]))