The goal is to adapt the code and add the assumption tests.
This Rmd file requires rmarkdown to be installed which was specified in the tutorial.
# Assume data is loaded as data frames: area_A
# However, to keep it simple, we'll create some example data
## this simulates the sampling of air quality at two locations
## with 30 samples
set.seed(2023)
area_A <- data.frame(PM25 = rnorm(15, mean = 15, sd = 5))
area_B <- data.frame(PM25 = rnorm(10, mean = 10, sd = 3.5))
With the simple one sample t-test we can test if an area has a different expected PM2.5 value from an mean under a NULL hypothesis. For instance, there could be an environmental quality norm: the national standard for air quality. Lets say this is 12 PM 2.5. We can then test if an area, such as area A, is above the norm.
national_standard <- 12 ## AIR Quality norm
t_test_one_sample <- t.test(area_A$PM25, mu = national_standard)
t_test_one_sample
##
## One Sample t-test
##
## data: area_A$PM25
## t = 1.9426, df = 14, p-value = 0.07245
## alternative hypothesis: true mean is not equal to 12
## 95 percent confidence interval:
## 11.78886 16.26888
## sample estimates:
## mean of x
## 14.02887
The results indicate that there is no evidence of a significant violation of the norm. To show this more samples would be needed.
## look at the data with a histogram
hist(area_A$PM25,main="Air Quality Area A",
xlab="pm2.5",breaks=10)
## better yet use a qqnorm with the 1:1 line
qqnorm(area_A$PM25)
qqline(area_A$PM25)
## Or use a more fancy qqplot
car::qqPlot(area_A$PM25)
## [1] 3 6
In the next scenario, an intervention was taken in the area, such as a lower speed limit for cars. You now want to know if this had an effect by measuring before and after the intervention at exactly the same points. If we have a change in time, we can test if there is an improvement at the same locations with the paired t-test. This is because in this case the data is not independent, as we sampled the same location twice, however the differences between the two times is.
# Assuming a new dataset for area_A after regulation
set.seed(456)
area_A_after <- data.frame(PM25 = rnorm(15, mean = 10, sd = 4))
t_test_paired <- t.test(area_A$PM25, area_A_after$PM25, paired = TRUE)
t_test_paired
##
## Paired t-test
##
## data: area_A$PM25 and area_A_after$PM25
## t = 2.3, df = 14, p-value = 0.03735
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## 0.2400164 6.8733308
## sample estimates:
## mean difference
## 3.556674
## using the "cool" qqplot
car::qqPlot(area_A$PM25-area_A_after$PM25)
## [1] 3 6
It appears the the intervention of lowering the speed limit had the desired effect.
In this case we want to test if two area have difference expected (mean) air quality values. The motivation for this could be for finding a good location for a rehabilition centre for people with compromised immune systems. You would want to choose the area with the best air quality.
You get two version of the t-test below, test the assumptions to find out which is the best test to use?
t_test_two_sample_equal_var <- t.test(area_A$PM25, area_B$PM25, var.equal = TRUE)
t_test_two_sample_equal_var
##
## Two Sample t-test
##
## data: area_A$PM25 and area_B$PM25
## t = 2.0166, df = 23, p-value = 0.05556
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.07214112 5.66109839
## sample estimates:
## mean of x mean of y
## 14.02887 11.23439
t_test_two_sample_unequal_var <- t.test(area_A$PM25, area_B$PM25, var.equal = FALSE)
t_test_two_sample_unequal_var
##
## Welch Two Sample t-test
##
## data: area_A$PM25 and area_B$PM25
## t = 2.2893, df = 21.619, p-value = 0.03221
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.260352 5.328605
## sample estimates:
## mean of x mean of y
## 14.02887 11.23439
## Equal sample size? --> R corrects for this automatically
length(area_A$PM25)
## [1] 15
length(area_B$PM25)
## [1] 10
## Normality
par(mfrow=c(1,2))
car::qqPlot(area_A$PM25)
## [1] 3 6
car::qqPlot(area_B$PM25)
## [1] 8 10
## equal variances? How to do that?
par(mfrow=c(1,2))
boxplot(area_A$PM25,main="A",ylim=c(0,25))
boxplot(area_B$PM25,main="B",ylim=c(0,25))
## variance test
var.test(area_A$PM25,area_B$PM25)
##
## F test to compare two variances
##
## data: area_A$PM25 and area_B$PM25
## F = 4.0975, num df = 14, denom df = 9, p-value = 0.03916
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 1.078878 13.150197
## sample estimates:
## ratio of variances
## 4.097528
In this RMarkdown document, we’ve created fictional datasets on air quality for areas A and B before and after certain interventions. The t-tests are performed accordingly based on the theory outlined in the slides given in the morning session.
In this section, we’ll generate skewed data that violates the normality assumption required for the t-test. We’ll use the Mann-Whitney U test as a non-parametric alternative to the t-test.
Here we will work with the following scenario: a dare-care centre in area A, has the option of moving to area C. Given the fact that area A’s air quality is too poor, it would be highly beneficial for the children of day-care in area A (ages 0 to 3) to move to area C … if area C is cleaner most of the time. Can we detect a difference?
As you will see, if you use a test that violates the assumptions, it is not optimal and you risk not using the most powerful test [it is a myth that the t-test is always the most powerful!]. In this case this would lead to the incorrect conclusion that the newly sampled area C is just as polluted as area A.
# Generating skewed data
set.seed(101)
area_C_skewed <- data.frame(PM25 = c(rnorm(15-2, mean = 2.5, sd = 5),90,78))
## Checking normality using QQnorm
car::qqPlot(area_C_skewed$PM25)
## [1] 14 15
## alternative KS-test
ks.test(area_C_skewed, "pnorm", mean(area_C_skewed$PM25),
sd(area_C_skewed$PM25)) # two-sided, exact
##
## Exact one-sample Kolmogorov-Smirnov test
##
## data: area_C_skewed
## D = 0.43545, p-value = 0.004192
## alternative hypothesis: two-sided
## alternative shapiro-Wilk test
shapiro.test(area_C_skewed$PM25)
##
## Shapiro-Wilk normality test
##
## data: area_C_skewed$PM25
## W = 0.53161, p-value = 6.4e-06
The QQplot indicates a violation of the normality assumption. The t-test is suspect: with a sample size of only 30 it is unlikely that the distribution converged with the standard normal under the CTL.
The Mann-Whitney U test is a suitable non-parametric alternative to the t-test when the normality assumption is violated. It examines whether the median of the two groups are identical, without assuming a normal distribution.
In this new section, I added outliers that skewed the data and violated the normality assumption. In general use qqplots to test for normality, you can use the KS-test or the the Shapiro-Wilk test to check the normality of the data, and the p-values are presented to confirm the violation of the normality assumption. However, just realize that these test are not always the best ways to test for normality, rejecting trivial departures from normality at large samples sizes.
# Conducting Mann-Whitney U Test
t_testViolated <- t.test(area_A$PM25, area_C_skewed$PM25,var.equal = FALSE)
mann_whitney <- wilcox.test(area_A$PM25, area_C_skewed$PM25)
# Printing test summary
t_testViolated
##
## Welch Two Sample t-test
##
## data: area_A$PM25 and area_C_skewed$PM25
## t = -0.072905, df = 14.566, p-value = 0.9429
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -16.40035 15.31824
## sample estimates:
## mean of x mean of y
## 14.02887 14.56992
mann_whitney
##
## Wilcoxon rank sum exact test
##
## data: area_A$PM25 and area_C_skewed$PM25
## W = 192, p-value = 0.0005778
## alternative hypothesis: true location shift is not equal to 0
If we operated under the false assumption that the t-test is the most powerful test (an oft repeated inaccuracy), we would have incorrectly stated that there was no use for the day-care to move. This is due to the fact that a test is only optimally powerful if none of it’s assumptions are violated - if so alternatives may be much more powerful.