Project 3: Storms

Introduction

To what extent can the intensity of Atlantic tropical storms be predicted from atmospheric and geographic characteristics?

This investigation employs the storms dataset from the dplyr package in R, which comprises tracking data for Atlantic basin tropical storms recorded between 1975 and 2021. The dataset contains 19537 observations across 13 variables, with each observation representing a six-hour measurement of storm characteristics at a discrete point in time.

The variables of primary interest include: wind (maximum sustained wind speed measured in knots), which serves as the response variable; pressure (barometric pressure at the storm center in millibars); lat (latitudinal position); long (longitudinal position); tropicalstorm_force_diameter (diameter of the region experiencing tropical storm force winds in nautical miles); and a derived variable speed_kph, representing storm translational velocity calculated via the Haversine formula. The dataset is accessible through the dplyr package documentation at: https://dplyr.tidyverse.org/reference/storms.html

Data Analysis

The analytical approach encompasses exploratory data analysis to characterize variable distributions and inter-variable relationships. Specifically, histograms are generated to examine the distribution of the response variable, scatter plots to assess bivariate relationships between predictors and response, and a correlation matrix to evaluate potential multicollinearity among predictors. Data preparation involves constructing a unique storm identifier, computing translational velocity using the Haversine formula for geodesic distance between consecutive observations, and excluding records with missing values in relevant fields.

# Load required libraries
library(dplyr)
library(ggplot2)
library(car)
library(geosphere)

# Load the storms dataset
data(storms)

# Examine dataset structure
glimpse(storms)

## Rows: 19,537
## Columns: 13
## $ name                         <chr> "Amy", "Amy", "Amy", "Amy", "Amy", "Amy",…
## $ year                         <dbl> 1975, 1975, 1975, 1975, 1975, 1975, 1975,…
## $ month                        <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
## $ day                          <int> 27, 27, 27, 27, 28, 28, 28, 28, 29, 29, 2…
## $ hour                         <dbl> 0, 6, 12, 18, 0, 6, 12, 18, 0, 6, 12, 18,…
## $ lat                          <dbl> 27.5, 28.5, 29.5, 30.5, 31.5, 32.4, 33.3,…
## $ long                         <dbl> -79.0, -79.0, -79.0, -79.0, -78.8, -78.7,…
## $ status                       <fct> tropical depression, tropical depression,…
## $ category                     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ wind                         <int> 25, 25, 25, 25, 25, 25, 25, 30, 35, 40, 4…
## $ pressure                     <int> 1013, 1013, 1013, 1013, 1012, 1012, 1011,…
## $ tropicalstorm_force_diameter <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hurricane_force_diameter     <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

# Data preparation: construct storm identifier and compute translational velocity
storms <- storms |>
  mutate(name_year = paste(name, year, sep = "_")) |>
  group_by(name_year) |>
  arrange(name_year, month, day, hour) |>
  mutate(
    dist_m = distHaversine(cbind(lag(long), lag(lat)), cbind(long, lat)),
    time_diff_hrs = 6,
    speed_kph = dist_m / 1000 / time_diff_hrs
  ) |>
  ungroup()

# Variable selection and listwise deletion of missing observations
storms_clean <- storms |>
  select(name, year, name_year, wind, pressure, lat, long, 
         tropicalstorm_force_diameter, speed_kph) |>
  filter(!is.na(tropicalstorm_force_diameter), !is.na(speed_kph),
         !is.na(wind), !is.na(pressure))

ggplot(storms_clean, aes(x = wind)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  labs(title = "Distribution of Maximum Sustained Wind Speed",
       x = "Wind Speed (knots)", y = "Frequency") +
  theme_minimal()

ggplot(storms_clean, aes(x = pressure, y = wind)) +
  geom_point(alpha = 0.3, color = "darkblue") +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(title = "Relationship Between Central Pressure and Wind Speed",
       x = "Central Pressure (millibars)", y = "Wind Speed (knots)") +
  theme_minimal()

cor_matrix <- cor(storms_clean |> select(wind, pressure, lat, long, 
                                          tropicalstorm_force_diameter, speed_kph))
round(cor_matrix, 3)

##                                wind pressure    lat   long
## wind                          1.000   -0.924 -0.053 -0.083
## pressure                     -0.924    1.000 -0.114  0.074
## lat                          -0.053   -0.114  1.000  0.162
## long                         -0.083    0.074  0.162  1.000
## tropicalstorm_force_diameter  0.530   -0.638  0.390  0.118
## speed_kph                     0.027   -0.085  0.343  0.252
##                              tropicalstorm_force_diameter speed_kph
## wind                                                0.530     0.027
## pressure                                           -0.638    -0.085
## lat                                                 0.390     0.343
## long                                                0.118     0.252
## tropicalstorm_force_diameter                        1.000     0.276
## speed_kph                                           0.276     1.000

storms_clean |>
  group_by(name, year) |>
  summarise(max_wind = max(wind), min_pressure = min(pressure), 
            n_obs = n(), .groups = "drop") |>
  arrange(desc(max_wind)) |>
  head(10)

## # A tibble: 10 × 5
##    name     year max_wind min_pressure n_obs
##    <chr>   <dbl>    <int>        <int> <int>
##  1 Dorian   2019      160          910    69
##  2 Wilma    2005      160          882    44
##  3 Irma     2017      155          914    65
##  4 Rita     2005      155          897    33
##  5 Dean     2007      150          907    39
##  6 Felix    2007      150          930    25
##  7 Katrina  2005      150          902    30
##  8 Maria    2017      150          908    67
##  9 Ivan     2004      145          910    87
## 10 Matthew  2016      145          934    49

Regression Analysis

Hypothesis Specification

The multiple linear regression framework permits formal hypothesis testing regarding the relationship between each predictor and the response variable. For each regression coefficient \(\beta_j\) (where \(j = 1, 2, \ldots, 5\)), the hypotheses are specified as follows:

\[H_0: \beta_j = 0 \quad \text{(no linear association between predictor } j \text{ and wind speed, controlling for other predictors)}\]

\[H_a: \beta_j \neq 0 \quad \text{(significant linear association exists)}\]

The full model specification is:

\[\text{wind}_i = \beta_0 + \beta_1(\text{pressure}_i) + \beta_2(\text{lat}_i) + \beta_3(\text{long}_i) + \beta_4(\text{diameter}_i) + \beta_5(\text{speed}_i) + \epsilon_i\]

where \(\epsilon_i\) represents the random error term.

storm_model <- lm(wind ~ pressure + lat + long + tropicalstorm_force_diameter + speed_kph, 
                  data = storms_clean)

summary(storm_model)

## 
## Call:
## lm(formula = wind ~ pressure + lat + long + tropicalstorm_force_diameter + 
##     speed_kph, data = storms_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.517  -5.895   0.276   6.125  29.212 
## 
## Coefficients:
##                                Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                   1.307e+03  6.246e+00  209.237  < 2e-16 ***
## pressure                     -1.254e+00  6.200e-03 -202.326  < 2e-16 ***
## lat                          -3.718e-01  9.535e-03  -38.991  < 2e-16 ***
## long                          1.825e-02  4.337e-03    4.207 2.61e-05 ***
## tropicalstorm_force_diameter -3.440e-03  8.243e-04   -4.173 3.03e-05 ***
## speed_kph                     2.364e-03  7.000e-03    0.338    0.736    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.772 on 9691 degrees of freedom
## Multiple R-squared:  0.8801, Adjusted R-squared:  0.8801 
## F-statistic: 1.423e+04 on 5 and 9691 DF,  p-value: < 2.2e-16

confint(storm_model, level = 0.95)

##                                      2.5 %        97.5 %
## (Intercept)                   1.294746e+03  1.319235e+03
## pressure                     -1.266525e+00 -1.242219e+00
## lat                          -3.904553e-01 -3.530758e-01
## long                          9.744983e-03  2.674720e-02
## tropicalstorm_force_diameter -5.055857e-03 -1.824236e-03
## speed_kph                    -1.135701e-02  1.608437e-02

The regression results indicate that central pressure is the most substantial predictor of storm intensity. The coefficient estimate suggests that each one-millibar increase in central pressure is associated with a decrease of approximately 1.25 knots in maximum sustained wind speed, holding all other predictors constant (p < 0.001). This finding aligns with established meteorological theory, wherein lower central pressure corresponds to more intense cyclonic circulation.

The tropical storm force diameter demonstrates a statistically significant positive association with wind speed, indicating that storms with larger wind fields tend to exhibit greater maximum intensity. Geographic position variables (latitude and longitude) and translational velocity capture spatial and dynamic aspects of storm behavior that contribute to intensity variation.

Model Assumptions and Diagnostics

The validity of inferential conclusions from ordinary least squares regression depends upon five assumptions: linearity of the conditional mean, independence of observations, homoscedasticity of residuals, normality of the error distribution, and absence of multicollinearity among predictors.

par(mfrow = c(2, 2))
plot(storm_model)

crPlots(storm_model)

# Variance Inflation Factors for multicollinearity assessment
vif(storm_model)

##                     pressure                          lat 
##                     1.808099                     1.307422 
##                         long tropicalstorm_force_diameter 
##                     1.105391                     2.135600 
##                    speed_kph 
##                     1.217935

Linearity: Examination of the residuals versus fitted values plot reveals residuals distributed approximately symmetrically around zero, though modest curvature at extreme fitted values suggests potential nonlinearity. Component-plus-residual plots facilitate assessment of the linearity assumption for individual predictors.

Independence: A notable limitation of this analysis concerns the temporal dependence structure inherent in the data. Observations represent six-hour measurements within individual storms; consequently, consecutive observations from the same storm exhibit serial correlation. This violation of the independence assumption may result in underestimated standard errors.

Homoscedasticity: The scale-location plot indicates relatively constant residual variance across the range of fitted values, with modest heteroscedasticity evident at higher predicted intensities.

Normality: The normal Q-Q plot demonstrates that residuals conform reasonably well to the theoretical normal distribution, with minor departures in the distributional tails. Given the large sample size, visual assessment via the Q-Q plot is preferred.

Multicollinearity: Variance Inflation Factor values provide a quantitative assessment of multicollinearity. Values exceeding 5 warrant concern, while values exceeding 10 indicate severe multicollinearity. The substantial correlation between pressure and wind speed (r = -0.92) reflects the fundamental physical relationship between these variables.

Conclusion and Future Directions

This analysis demonstrates that maximum sustained wind speed in Atlantic tropical storms is significantly associated with atmospheric and geographic characteristics. The fitted model yields a coefficient of determination (R²) of 0.88 and adjusted R² of 0.88, indicating that the predictor set accounts for approximately 88% of the observed variance in storm intensity.

Central pressure emerges as the dominant predictor (\(\beta\) = -1.254, p < 0.001): each one-millibar increase in central pressure corresponds to a 1.25-knot decrease in wind speed. Geographic position also demonstrates significant predictive value — latitude exhibits a negative association (\(\beta\) = -0.372, p < 0.001), suggesting storms at higher latitudes tend to be less intense, likely due to cooler waters and increased wind shear. The negative coefficient on tropical storm force diameter (\(\beta\) = -0.003, p < 0.001) indicates that more compact storms achieve higher maximum wind speeds.

Notably, storm translational velocity was not a significant predictor (\(\beta\) = 0.002, p = 0.736). One might hypothesize that slower-moving storms would intensify more due to prolonged exposure to warm water; however, after controlling for other variables, speed provides no additional predictive information. z Limitations include temporal dependence among repeated observations within storms, which may bias standard errors. Future research could aggregate to storm-level data, incorporate sea surface temperature, or apply mixed-effects models to account for the nested data structure.

References

Hijmans, R. J. (2022). geosphere: Spherical trigonometry (R package version 1.5-18). https://CRAN.R-project.org/package=geosphere

National Oceanic and Atmospheric Administration. Atlantic hurricane database (HURDAT2). National Hurricane Center. https://www.nhc.noaa.gov/data/

Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). dplyr: A grammar of data manipulation (R package version 1.1.0). https://dplyr.tidyverse.org/