Timothy Davidson
27 May 2017
The investigator who undertook this study commutes to work by bicycle.
The investigator cycles to work:
** on weekdays
** through all weather seasons (summer, autumn, winter and spring).
As a commuter, the investigator’s observation is that there are fewer cyclists riding during winter than during other weather seasons.
This investigation will test whether there is statistical evidence of an association between weather season and cycling flows on weekdays and weekends.
Problem statement: Is there statistical evidence of an association between weather season and cycling flows on weekdays and weekends?
How can statistics help?
Data on cycling flows has been sourced.
The data will be visually inspected to determine whether the flow of cyclists is the same during each weather season.
If the proportions do not appear to be the same during each season, a Chi-square test of association will be used to test whether:
The relationship between the two variables is statistically significant; or
Whether the variation reflects natural sampling variability, assuming weather season and weekday/weekend riding are independent.
Data source and attribution
To conduct the investigation, open data was sourced from data.vic.gov.au
Modifications made to the dataset:
** 631 rows of data were removed - “#REF” showing in six of the columns in all 631 rows.
** Assigned labels to factors in R (described in slide 6).
Licence: Creative Commons Attribution 4.0 International.
About the data
Data on cycling flows recorded at 21 locations from 2005 to 2012.
Sample size = 56,373 (after the removal of the 631 rows outlined in previous slide).
No sampling method used
Rather, the investigator relied on the sample collected and made available as open data.
The data includes a number of variables - however the two most important for this study are:
** weather season
** weekend indicator
Some of the other variables have missing data - R was not instructed to treat the missing data in any particular way, as the other variables are not of importance to this particular study.
Factors
Categorical variables are considered “factors” in R.
However, when importing the data into R, season was treated as a “character” and weekend indicator was treated as “logical”.
R has been instructed to treat both variables as “factors”.
“Weekend” is a logical/binary variable whereby TRUE indicates a weekend observation, otherwise it is FALSE (for weekdays).
Labels have been assigned to make the variable more descriptive: TRUE = “weekend” and FALSE = “weekday”.
The levels for the two main categorical variables are:
** “weekend”: weekend and weekday
** “Season”: Summer, Autumn, Winter and Spring
Cycling <- read_csv("~/Desktop/Cycling.csv")
Cycling$Season <- Cycling$Season %>% as.factor
levels(Cycling$Season)## [1] "Autumn" "Spring" "Summer" "Winter"
Cycling$weekend <- Cycling$weekend %>% as.factor
Cycling$weekend <- Cycling$weekend %>% factor(levels = c("FALSE","TRUE"),
labels=c("Weekday","Weekend"))
levels(Cycling$weekend)## [1] "Weekday" "Weekend"
The data has been summarised into a table, which presents the flow of cyclists (%) within each weather season by weekend and weekday.
A clustered barchart then helps to visualise the potential association between weather seasons and flow of cyclists on weekdays and weekends. Refer to the next two slides.
If there was no association, the height of the bars (%) of weekend and weekday cyclists within each weather season would be the same.
While it is difficult to tell definitely from the clustered barchart, there does appear to be some difference in the percentage of weekends and weekday cyclists within each season. This is supported by the summary table.
The clustered barchart indicates:
** Spring and Autumn: likely to be a greater flow (%) of cyclists on weekdays than on weekends
** Summer and Winter: likely to be a greater flow (%) of cyclists on weekends than on weekdays.
table1<-table(Cycling$Season, Cycling$weekend)
knitr::kable(table1)| Weekday | Weekend | |
|---|---|---|
| Autumn | 9297 | 3728 |
| Spring | 10807 | 4271 |
| Summer | 9656 | 3998 |
| Winter | 10341 | 4275 |
table2<-table(Cycling$Season, Cycling$weekend) %>% prop.table(margin = 2) *100
knitr::kable(table2)| Weekday | Weekend | |
|---|---|---|
| Autumn | 23.18396 | 22.91052 |
| Spring | 26.94945 | 26.24754 |
| Summer | 24.07920 | 24.56981 |
| Winter | 25.78739 | 26.27212 |
Season_weekend <- table(Cycling$Season, Cycling$weekend) %>% prop.table(margin = 2)*100
barplot(Season_weekend, main = "Weekday and weekend cycling flows by weather season", ylab="Percentage", xlab="Weekend or weekday", beside=TRUE,legend=rownames(Season_weekend),args.legend=c(x="top",horiz=TRUE,title="Weather season"),ylim = c(0,35),col=brewer.pal(4, name = "RdBu"))Type of hypothesis test:
Chi-square test of association
Justification: to explore the association between two categorical variables.
Hypothesis:
Null hypothesis: there is no association in the population between weather season and weekend or weekday cycling (independence)
Alternate hypothesis: there is an association in the population between weather season and weekend or weekday cycling (dependence).
Assumptions:
No more than 25% of the cells have expected counts below 5.
Assumes weather season and weekday/weekday cycling are independent
Decision Rules:
Reject the null hypothesis if p-value < 0.05 significance level
Otherwise, fail to reject the null hypothesis.
weekend are overrepresented in Summer & Winter and underrepresented in Autumn & Spring.
weekday are overrepresented in Autumn & Spring and underrepresented in Summer & Winter.
The Chi-square test of association assumes that no more than 25% of the cells have expected counts below 5. This assumption is met as all expected counts are above 5.
The critical value was found to be 7.81.
The investigation found the p-value to be 0.19
As this p-value was greater than the 0.05 level of significance, the decision is to fail to reject the null hypothesis.
Failure to reject the null hypothesis indicates that there was no statistically significant association between weather season and cycling flows on weekends and weekdays.
chi1 <- chisq.test(table(Cycling$weekend,Cycling$Season))
chi1##
## Pearson's Chi-squared test
##
## data: table(Cycling$weekend, Cycling$Season)
## X-squared = 4.706, df = 3, p-value = 0.1946
chi1$observed #observed values##
## Autumn Spring Summer Winter
## Weekday 9297 10807 9656 10341
## Weekend 3728 4271 3998 4275
chi1$expected %>% round(2) #expected values##
## Autumn Spring Summer Winter
## Weekday 9265.35 10725.75 9712.79 10397.11
## Weekend 3759.65 4352.25 3941.21 4218.89
chi1$stdres %>% round(2) #standardised residuals##
## Autumn Spring Summer Winter
## Weekday 0.70 1.71 -1.23 -1.19
## Weekend -0.70 -1.71 1.23 1.19
qchisq(p = .95,df = 3)## [1] 7.814728
pchisq(q = 4.706,df = 3,lower.tail = FALSE)## [1] 0.1946353
chi<-chisq.test(table(Cycling$weekend,Cycling$Season))
chi$p.value## [1] 0.194632
Major findings
A Chi-square test of association was used to test for a statistically significant association between cycling flows on weekends and weekdays, and weather season (summer, winter, autumn, spring).
The results of this study did not find a statistically significant association between weather season and cycling flows on weekdays and weekends, χ2=4.71, p=0.19.
The results suggest that weather season and flow of cyclists on weekends and weekdays are independent variables.
Strengths and limitations
While this investigation considered real data, it’s important to note one of the potential limitations.
It is unclear how the original sample was sourced and whether a probability based sampling method was used. It is unclear whether the times, locations and dates chosen to conduct the survey were biased in any way.
The investigator would need to follow-up with a proper random sample and repeat the Chi-square test of association before being confident in the conclusion.
A strength of this investigation is that, while it was not possible to collate data on all cycling flows during 2005-2012, the sample was relatively large (with over 56,000 records).
Data.vic.gov.au, ‘Bicycle Volumes - VicRoads’, accessed 23 May 2017.