These are the solutions for Computer Lab 11, and use data sourced from Czarniecka-Skubina et al. (2021) and the datarium
R package (Kassambara 2019).
No answer required.
We can test the following hypotheses:
\(H_0\): There is no significant difference between the observed and expected distribution of proportions of coffee consumption frequency of Poles.
\(H_1\): There is a significant difference between the observed and expected distribution of proportions of coffee consumption frequency of Poles.
We have 7 categories of coffee consumption frequency. Therefore the degrees of freedom for this test is \(7-1 = 6\).
Example code is provided below:
obs.freq <- c(108, 74, 38, 146, 335, 426, 373)
exp.prop <- c(0.1, 0.05, 0.05, 0.1, 0.2, 0.3, 0.2)
Example code is provided below. Note that we use the data prepared in 1.4 above.
chisq.gof <- chisq.test(x = obs.freq, p = exp.prop)
chisq.gof
##
## Chi-squared test for given probabilities
##
## data: obs.freq
## X-squared = 53.26, df = 6, p-value = 1.04e-09
The test statistic is \(53.26\) and the \(p\)-value is approximately \(0\). Based on these results, we can reject \(H_0\) at the 5% level of significance, and conclude that there is a significant difference between the observed and expected distribution of proportions of coffee consumption frequency of Poles.
We note from the R code output below that the expected counts are 150, 75, 75, 150, 300, 450, 300. Since all the numbers are greater than 5, this means that:
Hence our assumptions are satisfied.
chisq.gof$expected
## [1] 150 75 75 150 300 450 300
Example R code is provided below:
# This first part creates the observations for the different frequency categories
coffee.consumption <- as.factor(c(rep("Once.Month", 108), rep("Three.per.Month", 74),
rep("Once.Week", 38), rep("Three.or.Four.Week", 146),
rep("Once.Day", 335), rep("Twice.Day", 426),
rep("Three.or.Four.Day", 373)))
# This second part randomizes the order - we now have an object similar to what we might encounter
# if we were analysing a new data set.
coffee.consumption <- sample(coffee.consumption, 1500, replace = F)
Example R code is provided below:
coffee.consumption <- ordered(coffee.consumption,
levels = c("Once.Month", "Three.per.Month", "Once.Week",
"Three.or.Four.Week", "Once.Day", "Twice.Day",
"Three.or.Four.Day"))
Example R code is provided below:
coffee.table <- table(coffee.consumption)
Example R code is provided below.
chisq.test(x = coffee.table, p = exp.prop)
##
## Chi-squared test for given probabilities
##
## data: coffee.table
## X-squared = 53.26, df = 6, p-value = 1.04e-09
The Chi-squared test results are identical to those obtained in 1.5, as expected.
Here, our null and alternative hypotheses are:
\(H_0\): There is no association between coffee consumption frequency and age of Poles.
\(H_1\): There is an association between coffee consumption frequency and age of Poles.
The degrees of freedom for our Chi-square test of independence will be \((5-1) \times (2-1) = 4\), since we have \(5\) rows and \(2\) columns.
Example code is provided below:
group1 <- c(261, 174)
group2 <- c(167, 72)
group3 <- c(227, 31)
group4 <- c(269, 43)
group5 <- c(228, 28)
table <- rbind(group1, group2, group3, group4, group5)
chisq.toi <- chisq.test(table)
chisq.toi
##
## Pearson's Chi-squared test
##
## data: table
## X-squared = 130.59, df = 4, p-value < 2.2e-16
The test statistic is \(130.59\) and the \(p\)-value is approximately \(0\). Based on these results, we can reject \(H_0\) at the 5% level of significance, and conclude that there is an association between coffee consumption frequency and age of Poles.
It would be interesting to conduct this test for segments of the population, to see if this association holds when considering only Poles within a certain age group (e.g. 18-30).
We note from the R code output below that the expected counts are 334.08, 183.552, 198.144, 239.616, 196.608, 100.92, 55.448, 59.856, 72.384, 59.392. Since all the numbers are greater than 5, this means that:
Hence our assumptions are satisfied.
chisq.toi$expected
## [,1] [,2]
## group1 334.080 100.920
## group2 183.552 55.448
## group3 198.144 59.856
## group4 239.616 72.384
## group5 196.608 59.392
install.packages("datarium")
library(datarium)
Example R code is provided below:
property.table <- table(properties$property_type, properties$buyer_type)
Example R code is provided below:
chisq.test(property.table)
##
## Pearson's Chi-squared test
##
## data: property.table
## X-squared = 82.504, df = 9, p-value = 5.134e-14
The test statistic is \(82.504\) and the p-value is approximately \(0\). Based on these results, we can reject \(H_0\) at the 5% level of significance, and conclude that there is an association between property type and buyer type.
prop.test
versus chisq.test
No answer required.
No answer required.
No answer required.
Run the following code to load the plotly
package in R.
#install.packages("plotly")
library(plotly)
Example R code is provided below:
options <- c("Observed", "Expected")
once.month <- c(7.2, 10)
three.per.month <- c(4.93, 5)
once.week <- c(2.53, 5)
three.four.per.week <- c(9.73, 10)
once.day <- c(22.3, 20)
twice.day <- c(28.4, 30)
three.four.per.day <- c(24.87, 20)
data <- data.frame(options, once.month, three.per.month, once.week, three.four.per.week, once.day, twice.day, three.four.per.day)
fig <- plot_ly(data, x = ~options, y = ~once.month, type = 'bar', name = 'Once per month',
text = once.month, textposition = 'auto')
fig <- fig %>% add_trace(y = ~three.per.month, name = 'Three times per month',
text = three.per.month, textposition = 'auto')
fig <- fig %>% add_trace(y = ~once.week, name = 'Once per week',
text = once.week, textposition = 'auto')
fig <- fig %>% add_trace(y = ~three.four.per.week, name = 'Three to four times per week',
text = three.four.per.week, textposition = 'auto')
fig <- fig %>% add_trace(y = ~once.day, name = 'Once per day',
text = once.day, textposition = 'auto')
fig <- fig %>% add_trace(y = ~twice.day, name = 'Twice per day',
text = twice.day, textposition = 'auto')
fig <- fig %>% add_trace(y = ~three.four.per.day, name = 'Three to four times per day',
text = three.four.per.day, textposition = 'auto')
fig <- fig %>% layout(yaxis = list(title = 'Cumulative Percentage'),
xaxis = list(title = " ", categoryorder = 'category descending'),
barmode = 'stack')
fig
Example R code is provided below.
Note that the total number of people, across the different age categories, who drink coffee at least once a day is 1152, while the total number for those who drink coffee less than once a day is 348.
options2 <- c("Drink Coffee Daily", "Drink Coffee less than Daily")
total <- c(1152, 348)
age.18.25 <- round(c(261, 174) / total, 2)
age.26.30 <- round(c(167, 72) / total, 2)
age.31.40 <- round(c(227, 31) / total, 2)
age.41.50 <- round(c(269, 43) / total, 2)
age.51.65 <- round(c(228, 28) / total, 2)
data2 <- data.frame(options2, age.18.25, age.26.30, age.31.40, age.41.50, age.51.65)
fig2 <- plot_ly(data2, x = ~options2, y = ~age.18.25, type = 'bar', name = '18-25',
text = age.18.25, textposition = 'auto')
fig2 <- fig2 %>% add_trace(y = ~age.26.30, name = '26-30',
text = age.26.30, textposition = 'auto')
fig2 <- fig2 %>% add_trace(y = ~age.31.40, name = '31-40',
text = age.31.40, textposition = 'auto')
fig2 <- fig2 %>% add_trace(y = ~age.41.50, name = '41-50',
text = age.41.50, textposition = 'auto')
fig2 <- fig2 %>% add_trace(y = ~age.51.65, name = '51-65',
text = age.51.65, textposition = 'auto')
fig2 <- fig2 %>% layout(yaxis = list(title = 'Count'),
xaxis = list(title = " ", categoryorder = 'category descending'),
barmode = 'stack')
fig2
These notes have been prepared by Rupert Kuveke. The copyright for the material in these notes resides with the author named above, with the Department of Mathematics and Statistics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.