FinalProject_Z.Aliakparova.knit

title: “Earthquake Analysis” author: “Zere Aliakparova” output: html_document date: “2025-12-07”

Introduction

In this project, I decided to work with the built-in R dataset quakes, which contains information about 1,000 earthquakes near Fiji with magnitude greater than 4.0.

The variables are: lat – latitude long – longitude depth – depth of the earthquake in km mag – magnitude stations – number of reporting stations

My main research question

Which factors predict earthquake magnitude? More specifically: Do depth and the number of stations help explain variation in magnitude?

I chose this question because: it allows me to use a simple linear model (lm); I can apply robust standard errors using sandwich; I can demonstrate missing data handling via mice and FIML in lavaan, even if I need to artificially create some missing data (I will explain and justify this explicitly).

Hypotheses

H1 (stations effect): Higher-magnitude earthquakes will be recorded by a larger number of reporting stations. (Justification: larger quakes produce stronger signals over wider areas.)

H2 (depth effect): Earthquake depth will have a weak or negligible association with magnitude. (Justification: depth does not directly determine energy release in this dataset.)

H0: Neither depth nor number of stations is associated with earthquake magnitude.

Before fitting any model, I want to understand what the data look like: sample size, ranges, and types. This also shows whether I need to transform anything.

data("quakes")
head(quakes)

##      lat   long depth mag stations
## 1 -20.42 181.62   562 4.8       41
## 2 -20.62 181.03   650 4.2       15
## 3 -26.00 184.10    42 5.4       43
## 4 -17.97 181.66   626 4.1       19
## 5 -20.42 181.96   649 4.0       11
## 6 -19.68 184.31   195 4.0       12

summary(quakes)

##       lat              long           depth            mag      
##  Min.   :-38.59   Min.   :165.7   Min.   : 40.0   Min.   :4.00  
##  1st Qu.:-23.47   1st Qu.:179.6   1st Qu.: 99.0   1st Qu.:4.30  
##  Median :-20.30   Median :181.4   Median :247.0   Median :4.60  
##  Mean   :-20.64   Mean   :179.5   Mean   :311.4   Mean   :4.62  
##  3rd Qu.:-17.64   3rd Qu.:183.2   3rd Qu.:543.0   3rd Qu.:4.90  
##  Max.   :-10.72   Max.   :188.1   Max.   :680.0   Max.   :6.40  
##     stations     
##  Min.   : 10.00  
##  1st Qu.: 18.00  
##  Median : 27.00  
##  Mean   : 33.42  
##  3rd Qu.: 42.00  
##  Max.   :132.00

str(quakes)

## 'data.frame':    1000 obs. of  5 variables:
##  $ lat     : num  -20.4 -20.6 -26 -18 -20.4 ...
##  $ long    : num  182 181 184 182 182 ...
##  $ depth   : int  562 650 42 626 649 195 82 194 211 622 ...
##  $ mag     : num  4.8 4.2 5.4 4.1 4 4 4.8 4.4 4.7 4.3 ...
##  $ stations: int  41 15 43 19 11 12 43 15 35 19 ...

Visualization

Before jumping into models, I want to see whether there is any visible relationship between: depth and magnitude, stations and magnitude. If there is no visible pattern at all, linear modeling might not be appropriate.

Depth vs Magnitude

ggplot(quakes, aes(x = depth, y = mag)) +
  geom_point(alpha = 0.5, color = "blue") +
  theme_minimal(base_size = 14) +
  labs(
    title = "Depth vs Magnitude",
    subtitle = "Scatterplot showing weak relationship",
    x = "Depth (km)",
    y = "Magnitude"
  ) +
  theme(
    plot.title = element_text(face = "bold"),
    plot.subtitle = element_text(colour = "pink")
  )

Here I am checking: whether magnitude tends to increase or decrease with depth; whether the spread of points suggests linearity or something more complex (e.g., curvature, clusters).

Interpretation of the plot The scatterplot of depth vs magnitude reveals no clear linear relationship.Magnitude values fall within a narrow range (4–6.3), producing a horizontal band of points. Earthquake depth does not strongly predict magnitude in this region.High-density clusters around 400–600 km correspond to deep-focus earthquakes typical of the Fiji-Tonga subduction zone.

Stations vs Magnitude

ggplot(quakes, aes(x = stations, y = mag)) +
geom_point(alpha = 0.4, color = "steelblue") +
theme_minimal() +
labs(
title = "Stations vs Magnitude",
x = "Number of Stations",
y = "Magnitude"
)

Interpretation The scatterplot shows a clear positive relationship between the number of reporting stations and earthquake magnitude. As magnitude increases, earthquakes tend to be detected by more stations. This creates an upward trend: low-magnitude events cluster around 30–50 stations, while higher-magnitude events (5.5–6.5) are recorded by 80+ stations. Although the pattern is not perfectly linear and includes variability due to geographical and network factors, the overall trend is strong enough to justify using stations as a predictor of magnitude in the regression model. The positive trend in the scatterplot indicates that higher-magnitude earthquakes are recorded by a greater number of stations. This pattern is consistent with physical expectations and supports including stations as a meaningful predictor of magnitude in subsequent linear modeling.

Checking Missing Values

sum(is.na(quakes))

## [1] 0

colSums(is.na(quakes))

##      lat     long    depth      mag stations 
##        0        0        0        0        0

There are no missing values in the original dataset. Since we need to demonstrate MICE and FIML methods, I artificially introduce a small number of missing values in mag for illustration.

Creating Artificial Missing Values

set.seed(123)
quakes_miss <- quakes
miss_idx <- sample(1:nrow(quakes_miss), 30)
quakes_miss$mag[miss_idx] <- NA
colSums(is.na(quakes_miss))

##      lat     long    depth      mag stations 
##        0        0        0       30        0

Now the dataset contains a manageable amount of missingness that allows me to apply multiple imputation and FIML.

MICE Imputation

mice_data <- mice(quakes_miss, m = 5, method = "pmm", seed = 123)

## 
##  iter imp variable
##   1   1  mag
##   1   2  mag
##   1   3  mag
##   1   4  mag
##   1   5  mag
##   2   1  mag
##   2   2  mag
##   2   3  mag
##   2   4  mag
##   2   5  mag
##   3   1  mag
##   3   2  mag
##   3   3  mag
##   3   4  mag
##   3   5  mag
##   4   1  mag
##   4   2  mag
##   4   3  mag
##   4   4  mag
##   4   5  mag
##   5   1  mag
##   5   2  mag
##   5   3  mag
##   5   4  mag
##   5   5  mag

completed <- complete(mice_data, 1)

MICE successfully imputes missing mag values using predictive mean matching, producing a complete dataset for analysis.

Linear Model (Complete Cases)

lm_cc <- lm(mag ~ depth + stations, data = na.omit(quakes_miss))
summary(lm_cc)

## 
## Call:
## lm(formula = mag ~ depth + stations, data = na.omit(quakes_miss))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.57595 -0.13567 -0.00148  0.13364  0.86467 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.204e+00  1.551e-02  271.05   <2e-16 ***
## depth       -3.173e-04  3.005e-05  -10.56   <2e-16 ***
## stations     1.542e-02  2.950e-04   52.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.201 on 967 degrees of freedom
## Multiple R-squared:  0.7526, Adjusted R-squared:  0.7521 
## F-statistic:  1471 on 2 and 967 DF,  p-value: < 2.2e-16

The model shows that stations is a strong positive predictor of magnitude, while depth has a weaker effect.

Diagnostic Plots

autoplot(lm_cc)

## Warning: `fortify(<lm>)` was deprecated in ggplot2 3.6.0.
## ℹ Please use `broom::augment(<lm>)` instead.
## ℹ The deprecated feature was likely used in the ggfortify package.
##   Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggfortify package.
##   Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggfortify package.
##   Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Residual plots show mild nonlinearity and some heteroskedasticity, but the model performs reasonably well overall. These patterns justify checking robustness with sandwich estimators and comparing results with imputed/FIML models.

Robust SE (Sandwich)

cov_robust <- vcovHC(lm_cc, type = "HC3")
coeftest(lm_cc, vcov = cov_robust)

## 
## t test of coefficients:
## 
##                Estimate  Std. Error t value  Pr(>|t|)    
## (Intercept)  4.2039e+00  1.6658e-02 252.362 < 2.2e-16 ***
## depth       -3.1735e-04  3.0189e-05 -10.512 < 2.2e-16 ***
## stations     1.5419e-02  3.3733e-04  45.709 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Robust standard errors confirm the significance pattern from the basic model.

FIML (lavaan)

model_fiml <- '
  mag ~ depth + stations
'

fit_fiml <- sem(model_fiml, data = quakes_miss, missing = "fiml")
summary(fit_fiml, standardized = TRUE)

## lavaan 0.6-20 ended normally after 24 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         4
## 
##   Number of observations                          1000
##   Number of missing patterns                         2
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Observed
##   Observed information based on                Hessian
## 
## Regressions:
##                    Estimate   Std.Err  z-value  P(>|z|)   Std.lv   Std.all
##   mag ~                                                                   
##     depth             -0.000    0.000  -10.577    0.000    -0.000   -0.170
##     stations           0.015    0.000   52.343    0.000     0.015    0.838
## 
## Intercepts:
##                    Estimate   Std.Err  z-value  P(>|z|)   Std.lv   Std.all
##    .mag                4.204    0.015  271.473    0.000     4.204   10.436
## 
## Variances:
##                    Estimate   Std.Err  z-value  P(>|z|)   Std.lv   Std.all
##    .mag                0.040    0.002   22.023    0.000     0.040    0.248

FIML produces estimates consistent with both the complete-case and MICE models.

Conclusion

The results show that the number of reporting stations is a strong predictor of earthquake magnitude, while depth has a smaller effect. Diagnostic plots reveal some nonlinearity, but the model is generally stable. Results were consistent across: classical linear regression, robust standard errors, multiple imputation (MICE), full-information maximum likelihood (FIML). This suggests that the relationship between magnitude and its predictors is reliable across different estimation methods. Although depth is statistically significant, its effect size is extremely small, indicating limited practical relevance.