title: “Earthquake Analysis” author: “Zere Aliakparova” output: html_document date: “2025-12-07”
In this project, I decided to work with the built-in R dataset quakes, which contains information about 1,000 earthquakes near Fiji with magnitude greater than 4.0.
The variables are: lat – latitude long – longitude depth – depth of the earthquake in km mag – magnitude stations – number of reporting stations
Which factors predict earthquake magnitude? More specifically: Do depth and the number of stations help explain variation in magnitude?
I chose this question because: it allows me to use a simple linear model (lm); I can apply robust standard errors using sandwich; I can demonstrate missing data handling via mice and FIML in lavaan, even if I need to artificially create some missing data (I will explain and justify this explicitly).
H1 (stations effect): Higher-magnitude earthquakes will be recorded by a larger number of reporting stations. (Justification: larger quakes produce stronger signals over wider areas.)
H2 (depth effect): Earthquake depth will have a weak or negligible association with magnitude. (Justification: depth does not directly determine energy release in this dataset.)
H0: Neither depth nor number of stations is associated with earthquake magnitude.
Before fitting any model, I want to understand what the data look like: sample size, ranges, and types. This also shows whether I need to transform anything.
data("quakes")
head(quakes)
## lat long depth mag stations
## 1 -20.42 181.62 562 4.8 41
## 2 -20.62 181.03 650 4.2 15
## 3 -26.00 184.10 42 5.4 43
## 4 -17.97 181.66 626 4.1 19
## 5 -20.42 181.96 649 4.0 11
## 6 -19.68 184.31 195 4.0 12
summary(quakes)
## lat long depth mag
## Min. :-38.59 Min. :165.7 Min. : 40.0 Min. :4.00
## 1st Qu.:-23.47 1st Qu.:179.6 1st Qu.: 99.0 1st Qu.:4.30
## Median :-20.30 Median :181.4 Median :247.0 Median :4.60
## Mean :-20.64 Mean :179.5 Mean :311.4 Mean :4.62
## 3rd Qu.:-17.64 3rd Qu.:183.2 3rd Qu.:543.0 3rd Qu.:4.90
## Max. :-10.72 Max. :188.1 Max. :680.0 Max. :6.40
## stations
## Min. : 10.00
## 1st Qu.: 18.00
## Median : 27.00
## Mean : 33.42
## 3rd Qu.: 42.00
## Max. :132.00
str(quakes)
## 'data.frame': 1000 obs. of 5 variables:
## $ lat : num -20.4 -20.6 -26 -18 -20.4 ...
## $ long : num 182 181 184 182 182 ...
## $ depth : int 562 650 42 626 649 195 82 194 211 622 ...
## $ mag : num 4.8 4.2 5.4 4.1 4 4 4.8 4.4 4.7 4.3 ...
## $ stations: int 41 15 43 19 11 12 43 15 35 19 ...
Before jumping into models, I want to see whether there is any visible relationship between: depth and magnitude, stations and magnitude. If there is no visible pattern at all, linear modeling might not be appropriate.
ggplot(quakes, aes(x = depth, y = mag)) +
geom_point(alpha = 0.5, color = "blue") +
theme_minimal(base_size = 14) +
labs(
title = "Depth vs Magnitude",
subtitle = "Scatterplot showing weak relationship",
x = "Depth (km)",
y = "Magnitude"
) +
theme(
plot.title = element_text(face = "bold"),
plot.subtitle = element_text(colour = "pink")
)
Here I am checking: whether magnitude tends to increase or decrease with depth; whether the spread of points suggests linearity or something more complex (e.g., curvature, clusters).
Interpretation of the plot The scatterplot of depth vs magnitude reveals no clear linear relationship.Magnitude values fall within a narrow range (4–6.3), producing a horizontal band of points. Earthquake depth does not strongly predict magnitude in this region.High-density clusters around 400–600 km correspond to deep-focus earthquakes typical of the Fiji-Tonga subduction zone.
ggplot(quakes, aes(x = stations, y = mag)) +
geom_point(alpha = 0.4, color = "steelblue") +
theme_minimal() +
labs(
title = "Stations vs Magnitude",
x = "Number of Stations",
y = "Magnitude"
)
Interpretation The scatterplot shows a clear positive relationship between the number of reporting stations and earthquake magnitude. As magnitude increases, earthquakes tend to be detected by more stations. This creates an upward trend: low-magnitude events cluster around 30–50 stations, while higher-magnitude events (5.5–6.5) are recorded by 80+ stations. Although the pattern is not perfectly linear and includes variability due to geographical and network factors, the overall trend is strong enough to justify using stations as a predictor of magnitude in the regression model. The positive trend in the scatterplot indicates that higher-magnitude earthquakes are recorded by a greater number of stations. This pattern is consistent with physical expectations and supports including stations as a meaningful predictor of magnitude in subsequent linear modeling.
sum(is.na(quakes))
## [1] 0
colSums(is.na(quakes))
## lat long depth mag stations
## 0 0 0 0 0
There are no missing values in the original dataset. Since we need to demonstrate MICE and FIML methods, I artificially introduce a small number of missing values in mag for illustration.
set.seed(123)
quakes_miss <- quakes
miss_idx <- sample(1:nrow(quakes_miss), 30)
quakes_miss$mag[miss_idx] <- NA
colSums(is.na(quakes_miss))
## lat long depth mag stations
## 0 0 0 30 0
Now the dataset contains a manageable amount of missingness that allows me to apply multiple imputation and FIML.
mice_data <- mice(quakes_miss, m = 5, method = "pmm", seed = 123)
##
## iter imp variable
## 1 1 mag
## 1 2 mag
## 1 3 mag
## 1 4 mag
## 1 5 mag
## 2 1 mag
## 2 2 mag
## 2 3 mag
## 2 4 mag
## 2 5 mag
## 3 1 mag
## 3 2 mag
## 3 3 mag
## 3 4 mag
## 3 5 mag
## 4 1 mag
## 4 2 mag
## 4 3 mag
## 4 4 mag
## 4 5 mag
## 5 1 mag
## 5 2 mag
## 5 3 mag
## 5 4 mag
## 5 5 mag
completed <- complete(mice_data, 1)
MICE successfully imputes missing mag values using predictive mean matching, producing a complete dataset for analysis.
lm_cc <- lm(mag ~ depth + stations, data = na.omit(quakes_miss))
summary(lm_cc)
##
## Call:
## lm(formula = mag ~ depth + stations, data = na.omit(quakes_miss))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.57595 -0.13567 -0.00148 0.13364 0.86467
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.204e+00 1.551e-02 271.05 <2e-16 ***
## depth -3.173e-04 3.005e-05 -10.56 <2e-16 ***
## stations 1.542e-02 2.950e-04 52.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.201 on 967 degrees of freedom
## Multiple R-squared: 0.7526, Adjusted R-squared: 0.7521
## F-statistic: 1471 on 2 and 967 DF, p-value: < 2.2e-16
The model shows that stations is a strong positive predictor of magnitude, while depth has a weaker effect.
autoplot(lm_cc)
## Warning: `fortify(<lm>)` was deprecated in ggplot2 3.6.0.
## ℹ Please use `broom::augment(<lm>)` instead.
## ℹ The deprecated feature was likely used in the ggfortify package.
## Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggfortify package.
## Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggfortify package.
## Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Residual plots show mild nonlinearity and some heteroskedasticity, but the model performs reasonably well overall. These patterns justify checking robustness with sandwich estimators and comparing results with imputed/FIML models.
cov_robust <- vcovHC(lm_cc, type = "HC3")
coeftest(lm_cc, vcov = cov_robust)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.2039e+00 1.6658e-02 252.362 < 2.2e-16 ***
## depth -3.1735e-04 3.0189e-05 -10.512 < 2.2e-16 ***
## stations 1.5419e-02 3.3733e-04 45.709 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Robust standard errors confirm the significance pattern from the basic model.
model_fiml <- '
mag ~ depth + stations
'
fit_fiml <- sem(model_fiml, data = quakes_miss, missing = "fiml")
summary(fit_fiml, standardized = TRUE)
## lavaan 0.6-20 ended normally after 24 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 4
##
## Number of observations 1000
## Number of missing patterns 2
##
## Model Test User Model:
##
## Test statistic 0.000
## Degrees of freedom 0
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Observed
## Observed information based on Hessian
##
## Regressions:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## mag ~
## depth -0.000 0.000 -10.577 0.000 -0.000 -0.170
## stations 0.015 0.000 52.343 0.000 0.015 0.838
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .mag 4.204 0.015 271.473 0.000 4.204 10.436
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .mag 0.040 0.002 22.023 0.000 0.040 0.248
FIML produces estimates consistent with both the complete-case and MICE models.
The results show that the number of reporting stations is a strong predictor of earthquake magnitude, while depth has a smaller effect. Diagnostic plots reveal some nonlinearity, but the model is generally stable. Results were consistent across: classical linear regression, robust standard errors, multiple imputation (MICE), full-information maximum likelihood (FIML). This suggests that the relationship between magnitude and its predictors is reliable across different estimation methods. Although depth is statistically significant, its effect size is extremely small, indicating limited practical relevance.