For all of the applicable parts, plots must be generated using the functions within the ggplot2 package.
Problem 1. Before answering the following questions, take a few minutes to look through the components of the built-in R data set “quakes,” which contains data on n = 1000 seismic events, with magnitudes 4.0 or higher, near the island of Fiji since 1964. Using the “cut()” function, create and save a new vector defining the depths of each seismic event according to the following three intervals: (0, 200], (200, 400], and (400, 700]. Refer to your Lab 4 Activities for guidance on using the “cut()” function.
depth_cat<-with(quakes, cut(depth, breaks=c(0,200,400,700)))
library(ggplot2)
ggplot(quakes, aes(x=depth_cat, y=stations, fill=depth_cat,)) +
geom_boxplot() +
scale_fill_brewer(palette = "Greens") +
coord_flip() +
labs(title = "Boxplot Comparing The Depth of Seismic Events off Fiji with the # of Stations Reporting", x = "Depth of Seismic Event", y = "# of Stations Reporting", fill = "Depth (km)")
Based on the boxplot above, the distributions of # of stations reporting
on seismic events by the depth of the seismic events is very similar. It
appears that all three distributions for seismic events with depths
between 0-200km, 200-400km, and 400-700km are all skewed right with
roughly equal median numbers of stations reporting. It is interesting to
note that all three distributions feature many high outliers. Based only
on the boxplot above, there appears to be no real difference in the
number of stations reporting a seismic event and the depth of said
seismic event.
Check and interpret the following One-Way ANOVA assumptions:
• Normality using both the Normality QQ Plot and Shapiro-Wilk Test, remembering to check this assumption for each depth interval (group).
ggplot(quakes, aes(sample = stations, color = depth_cat, shape = depth_cat)) +
geom_qq(size =2, alpha = 0.5) +
geom_qq_line(lwd=1)+
labs(title = "Normal QQ Plot of The # of Stations Reporting a Seismic Event by The Depth of The Seismic Event",x="Theoretical Quantiles",y="Sample Quantiles") +
scale_color_brewer(palette = "Greens")
with(quakes, shapiro.test(stations[depth_cat=="(0,200]"]))
##
## Shapiro-Wilk normality test
##
## data: stations[depth_cat == "(0,200]"]
## W = 0.85602, p-value < 2.2e-16
with(quakes, shapiro.test(stations[depth_cat=="(200,400]"]))
##
## Shapiro-Wilk normality test
##
## data: stations[depth_cat == "(200,400]"]
## W = 0.81231, p-value = 3.675e-14
with(quakes, shapiro.test(stations[depth_cat=="(400,700]"]))
##
## Shapiro-Wilk normality test
##
## data: stations[depth_cat == "(400,700]"]
## W = 0.82757, p-value < 2.2e-16
Since the distribution of each of the Normal Probability Plots (QQ Plots) for each depth group appeared to be curved and the Shapiro-Wilk normality test p-values for each group were << the significance level of 0.05 we reject the null hypothesis. Thus there is strong enough statistical evidence to day that the distributions of the number of stations reporting a seismic event for each depth interval are not normally distributed.
• Equal Variances using Bartlett’s Test or Levene’s Test - choose which is best!
Since we found the distributions of the data to likely not be normally distributed the best test to evaluate the equality of the variances between the depth groups would be Levene’s Test, since it is robust against strays in Normality.
library(car)
## Warning: package 'car' was built under R version 4.5.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.5.2
with(quakes, leveneTest(stations~depth_cat))
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 3.1182 0.04467 *
## 997
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value of 0.04467 for the Levene’s Test for Homogeneity of Variance is below the significance level of 0.05, we fail to reject the null hypothesis for Levene’s Test. Thus there is strong enough statistical evidence to say that there is likely no difference between the variences for each group, therefore the variances are likely equal.
Since the distribution of the data was found to be likely not be Normally distributed, this violates one of the assumptions for the parametric One-Way ANOVA test. Thus a nonparametric alternative like Kruskal-Wallis would be more powerful as it does not rely on a normality assumption.
kruskal.test(stations~depth_cat, data=quakes)
##
## Kruskal-Wallis rank sum test
##
## data: stations by depth_cat
## Kruskal-Wallis chi-squared = 9.0943, df = 2, p-value = 0.0106
According to the results of the Kruskal-Wallis rank sum test, the p-value is 0.0106 which is less the significance level of 0.05, thus we reject the null hypothesis. There is strong enough statistical evidence to say that at least one of the groups distribution is different from the others. In other words, at least one of the seismic event depth groups gets reported with a different distribution than the other depth groups.
quakes_aov<-aov(stations~depth_cat, data = quakes)
summary(quakes_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## depth_cat 2 4364 2182.1 4.582 0.0104 *
## Residuals 997 474783 476.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value of the ANOVA test was 0.0104 which is also less than the significance level of 0.05 we once again reject the null hypothesis. Therefore we can once again say that there is strong enough statistical difference in the distribution of at least one of the depth groups. This is the same conclusion that was reached using the more robust Kruskal-Wallis test. This could be because only one of the assumptions for the ANOVA test was violated as the variances of the depth groups was still equal. However even though the same conclusion was reached, the p-values of each test were different, as the p-value of the ANOVA was less than that of the Kruskal-Wallis. This could be because of the normality assumption violation.
Since we have already established that the distribution of data for each depth group is likely not normally distributed, Tukey’s HSD would be not recommended as it assumes a normal distribution. Fisher’s LSD however, does not assume normality of data and thus would be the appropriate test to use for this situation.
library(agricolae)
## Warning: package 'agricolae' was built under R version 4.5.2
print(LSD.test(quakes_aov,trt="depth_cat",p.adj="bonferroni"))
## $statistics
## MSerror Df Mean CV
## 476.2118 997 33.418 65.30097
##
## $parameters
## test p.ajusted name.t ntr alpha
## Fisher-LSD bonferroni depth_cat 3 0.05
##
## $means
## stations std r se LCL UCL Min Max Q25 Q50 Q75
## (0,200] 35.63636 22.91381 418 1.067362 33.54183 37.73090 10 123 19 29 45
## (200,400] 30.07568 19.13864 185 1.604406 26.92728 33.22408 10 132 17 25 35
## (400,700] 32.63980 21.81439 397 1.095229 30.49058 34.78902 10 129 18 25 41
##
## $comparison
## NULL
##
## $groups
## stations groups
## (0,200] 35.63636 a
## (400,700] 32.63980 ab
## (200,400] 30.07568 b
##
## attr(,"class")
## [1] "group"
According to the Fischer’s LSD, Seismic events with depths between 0-200km have significantly higher mean number of reporting stations than those at depths between 200-400km (because they don’t share a letter and the mean is higher: 35.63636 > 30.07568). However seismic events at depths between 400-700km appear to have no significant differences in the mean number of reporting stations between either of the other groups (which is why this group shares the letter designation ab, similar to both a and b). Overall, the 0–200 km depth interval appears to have the highest mean number of reporting stations, although it is not significantly different from the 400–700 km interval.