MA2612Lab4

For all of the applicable parts, plots must be generated using the functions within the ggplot2 package.

Problem 1. Before answering the following questions, take a few minutes to look through the components of the built-in R data set “quakes,” which contains data on n = 1000 seismic events, with magnitudes 4.0 or higher, near the island of Fiji since 1964. Using the “cut()” function, create and save a new vector defining the depths of each seismic event according to the following three intervals: (0, 200], (200, 400], and (400, 700]. Refer to your Lab 4 Activities for guidance on using the “cut()” function.

Using “ggplot()”, generate a plot to visually compare the differences between the number of reporting stations grouped by the three depth intervals, using different colors to distinguish the groups and appropriate labels on the axes. What preliminary observations can you make about the differences in the number of stations reporting between the three depth intervals?

depth_cat<-with(quakes, cut(depth, breaks=c(0,200,400,700)))

library(ggplot2)

ggplot(quakes, aes(x=depth_cat, y=stations, fill=depth_cat,)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Greens") +
  coord_flip() +
  labs(title = "Boxplot Comparing The Depth of Seismic Events off Fiji with the # of Stations Reporting",        x = "Depth of Seismic Event", y = "# of Stations Reporting", fill = "Depth (km)")

Based on the boxplot above, the distributions of # of stations reporting on seismic events by the depth of the seismic events is very similar. It appears that all three distributions for seismic events with depths between 0-200km, 200-400km, and 400-700km are all skewed right with roughly equal median numbers of stations reporting. It is interesting to note that all three distributions feature many high outliers. Based only on the boxplot above, there appears to be no real difference in the number of stations reporting a seismic event and the depth of said seismic event.

Check and interpret the following One-Way ANOVA assumptions:

• Normality using both the Normality QQ Plot and Shapiro-Wilk Test, remembering to check this assumption for each depth interval (group).

ggplot(quakes, aes(sample = stations, color = depth_cat, shape = depth_cat)) +
  geom_qq(size =2, alpha = 0.5) +
  geom_qq_line(lwd=1)+
  labs(title = "Normal QQ Plot of The # of Stations Reporting a Seismic Event by The Depth of The Seismic Event",x="Theoretical Quantiles",y="Sample Quantiles") +
  scale_color_brewer(palette = "Greens")

with(quakes, shapiro.test(stations[depth_cat=="(0,200]"]))

## 
##  Shapiro-Wilk normality test
## 
## data:  stations[depth_cat == "(0,200]"]
## W = 0.85602, p-value < 2.2e-16

with(quakes, shapiro.test(stations[depth_cat=="(200,400]"]))

## 
##  Shapiro-Wilk normality test
## 
## data:  stations[depth_cat == "(200,400]"]
## W = 0.81231, p-value = 3.675e-14

with(quakes, shapiro.test(stations[depth_cat=="(400,700]"]))

## 
##  Shapiro-Wilk normality test
## 
## data:  stations[depth_cat == "(400,700]"]
## W = 0.82757, p-value < 2.2e-16

Since the distribution of each of the Normal Probability Plots (QQ Plots) for each depth group appeared to be curved and the Shapiro-Wilk normality test p-values for each group were << the significance level of 0.05 we reject the null hypothesis. Thus there is strong enough statistical evidence to day that the distributions of the number of stations reporting a seismic event for each depth interval are not normally distributed.

• Equal Variances using Bartlett’s Test or Levene’s Test - choose which is best!

Since we found the distributions of the data to likely not be normally distributed the best test to evaluate the equality of the variances between the depth groups would be Levene’s Test, since it is robust against strays in Normality.

library(car)

## Warning: package 'car' was built under R version 4.5.2

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.5.2

with(quakes, leveneTest(stations~depth_cat))

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group   2  3.1182 0.04467 *
##       997                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p-value of 0.04467 for the Levene’s Test for Homogeneity of Variance is below the significance level of 0.05, we fail to reject the null hypothesis for Levene’s Test. Thus there is strong enough statistical evidence to say that there is likely no difference between the variences for each group, therefore the variances are likely equal.

Based on the results of the assumption checks in part (b), choose the best comparison test, between One-Way ANOVA and Kruskal-Wallis. Explain your reasoning for your choice.

Since the distribution of the data was found to be likely not be Normally distributed, this violates one of the assumptions for the parametric One-Way ANOVA test. Thus a nonparametric alternative like Kruskal-Wallis would be more powerful as it does not rely on a normality assumption.

Against a significance level of α = 0.05, conduct your test of choice from part (c). Based on the results, does the depth of seismic events differ between the number of stations which reported the events? Why or why not? Explain your reasoning

kruskal.test(stations~depth_cat, data=quakes)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  stations by depth_cat
## Kruskal-Wallis chi-squared = 9.0943, df = 2, p-value = 0.0106

According to the results of the Kruskal-Wallis rank sum test, the p-value is 0.0106 which is less the significance level of 0.05, thus we reject the null hypothesis. There is strong enough statistical evidence to say that at least one of the groups distribution is different from the others. In other words, at least one of the seismic event depth groups gets reported with a different distribution than the other depth groups.

Conduct whichever comparison test you did not conduct in part (d) and interpret the results. Were the same conclusions reached about the depths of seismic events having an effect on the number of reporting stations? Why or why not? Explain your reasoning

quakes_aov<-aov(stations~depth_cat, data = quakes)
summary(quakes_aov)

##              Df Sum Sq Mean Sq F value Pr(>F)  
## depth_cat     2   4364  2182.1   4.582 0.0104 *
## Residuals   997 474783   476.2                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p-value of the ANOVA test was 0.0104 which is also less than the significance level of 0.05 we once again reject the null hypothesis. Therefore we can once again say that there is strong enough statistical difference in the distribution of at least one of the depth groups. This is the same conclusion that was reached using the more robust Kruskal-Wallis test. This could be because only one of the assumptions for the ANOVA test was violated as the variances of the depth groups was still equal. However even though the same conclusion was reached, the p-values of each test were different, as the p-value of the ANOVA was less than that of the Kruskal-Wallis. This could be because of the normality assumption violation.

Conduct and interpret the results of a pairwise comparison between each pair of depth intervals using the appropriate test between Tukey’s HSD and Fisher’s LSD. Is there a depth interval that appears to have the highest number of reporting stations as compared to the other depth intervals?

Since we have already established that the distribution of data for each depth group is likely not normally distributed, Tukey’s HSD would be not recommended as it assumes a normal distribution. Fisher’s LSD however, does not assume normality of data and thus would be the appropriate test to use for this situation.

library(agricolae)

## Warning: package 'agricolae' was built under R version 4.5.2

print(LSD.test(quakes_aov,trt="depth_cat",p.adj="bonferroni"))

## $statistics
##    MSerror  Df   Mean       CV
##   476.2118 997 33.418 65.30097
## 
## $parameters
##         test  p.ajusted    name.t ntr alpha
##   Fisher-LSD bonferroni depth_cat   3  0.05
## 
## $means
##           stations      std   r       se      LCL      UCL Min Max Q25 Q50 Q75
## (0,200]   35.63636 22.91381 418 1.067362 33.54183 37.73090  10 123  19  29  45
## (200,400] 30.07568 19.13864 185 1.604406 26.92728 33.22408  10 132  17  25  35
## (400,700] 32.63980 21.81439 397 1.095229 30.49058 34.78902  10 129  18  25  41
## 
## $comparison
## NULL
## 
## $groups
##           stations groups
## (0,200]   35.63636      a
## (400,700] 32.63980     ab
## (200,400] 30.07568      b
## 
## attr(,"class")
## [1] "group"

According to the Fischer’s LSD, Seismic events with depths between 0-200km have significantly higher mean number of reporting stations than those at depths between 200-400km (because they don’t share a letter and the mean is higher: 35.63636 > 30.07568). However seismic events at depths between 400-700km appear to have no significant differences in the mean number of reporting stations between either of the other groups (which is why this group shares the letter designation ab, similar to both a and b). Overall, the 0–200 km depth interval appears to have the highest mean number of reporting stations, although it is not significantly different from the 400–700 km interval.

MA2612Lab4

Sasha Teymurazyan

2025-11-19