To what extent can the intensity of Atlantic tropical storms be predicted from atmospheric and geographic characteristics?
This investigation employs the storms dataset from the
dplyr package in R, which comprises tracking data for
Atlantic basin tropical storms recorded between 1975 and 2021. The
dataset contains 19537 observations across 13 variables, with each
observation representing a six-hour measurement of storm characteristics
at a discrete point in time.
The variables of primary interest include: wind (maximum
sustained wind speed measured in knots), which serves as the response
variable; pressure (barometric pressure at the storm center
in millibars); lat (latitudinal position);
long (longitudinal position);
tropicalstorm_force_diameter (diameter of the region
experiencing tropical storm force winds in nautical miles); and a
derived variable speed_kph, representing storm
translational velocity calculated via the Haversine formula. The dataset
is accessible through the dplyr package documentation at:
https://dplyr.tidyverse.org/reference/storms.html
The analytical approach encompasses exploratory data analysis to characterize variable distributions and inter-variable relationships. Specifically, histograms are generated to examine the distribution of the response variable, scatter plots to assess bivariate relationships between predictors and response, and a correlation matrix to evaluate potential multicollinearity among predictors. Data preparation involves constructing a unique storm identifier, computing translational velocity using the Haversine formula for geodesic distance between consecutive observations, and excluding records with missing values in relevant fields.
# Load required libraries
library(dplyr)
library(ggplot2)
library(car)
library(geosphere)
# Load the storms dataset
data(storms)
# Examine dataset structure
glimpse(storms)
## Rows: 19,537
## Columns: 13
## $ name <chr> "Amy", "Amy", "Amy", "Amy", "Amy", "Amy",…
## $ year <dbl> 1975, 1975, 1975, 1975, 1975, 1975, 1975,…
## $ month <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
## $ day <int> 27, 27, 27, 27, 28, 28, 28, 28, 29, 29, 2…
## $ hour <dbl> 0, 6, 12, 18, 0, 6, 12, 18, 0, 6, 12, 18,…
## $ lat <dbl> 27.5, 28.5, 29.5, 30.5, 31.5, 32.4, 33.3,…
## $ long <dbl> -79.0, -79.0, -79.0, -79.0, -78.8, -78.7,…
## $ status <fct> tropical depression, tropical depression,…
## $ category <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ wind <int> 25, 25, 25, 25, 25, 25, 25, 30, 35, 40, 4…
## $ pressure <int> 1013, 1013, 1013, 1013, 1012, 1012, 1011,…
## $ tropicalstorm_force_diameter <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hurricane_force_diameter <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
# Data preparation: construct storm identifier and compute translational velocity
storms <- storms |>
mutate(name_year = paste(name, year, sep = "_")) |>
group_by(name_year) |>
arrange(name_year, month, day, hour) |>
mutate(
dist_m = distHaversine(cbind(lag(long), lag(lat)), cbind(long, lat)),
time_diff_hrs = 6,
speed_kph = dist_m / 1000 / time_diff_hrs
) |>
ungroup()
# Variable selection and listwise deletion of missing observations
storms_clean <- storms |>
select(name, year, name_year, wind, pressure, lat, long,
tropicalstorm_force_diameter, speed_kph) |>
filter(!is.na(tropicalstorm_force_diameter), !is.na(speed_kph),
!is.na(wind), !is.na(pressure))
ggplot(storms_clean, aes(x = wind)) +
geom_histogram(bins = 30, fill = "steelblue", color = "white") +
labs(title = "Distribution of Maximum Sustained Wind Speed",
x = "Wind Speed (knots)", y = "Frequency") +
theme_minimal()
ggplot(storms_clean, aes(x = pressure, y = wind)) +
geom_point(alpha = 0.3, color = "darkblue") +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(title = "Relationship Between Central Pressure and Wind Speed",
x = "Central Pressure (millibars)", y = "Wind Speed (knots)") +
theme_minimal()
cor_matrix <- cor(storms_clean |> select(wind, pressure, lat, long,
tropicalstorm_force_diameter, speed_kph))
round(cor_matrix, 3)
## wind pressure lat long
## wind 1.000 -0.924 -0.053 -0.083
## pressure -0.924 1.000 -0.114 0.074
## lat -0.053 -0.114 1.000 0.162
## long -0.083 0.074 0.162 1.000
## tropicalstorm_force_diameter 0.530 -0.638 0.390 0.118
## speed_kph 0.027 -0.085 0.343 0.252
## tropicalstorm_force_diameter speed_kph
## wind 0.530 0.027
## pressure -0.638 -0.085
## lat 0.390 0.343
## long 0.118 0.252
## tropicalstorm_force_diameter 1.000 0.276
## speed_kph 0.276 1.000
storms_clean |>
group_by(name, year) |>
summarise(max_wind = max(wind), min_pressure = min(pressure),
n_obs = n(), .groups = "drop") |>
arrange(desc(max_wind)) |>
head(10)
## # A tibble: 10 × 5
## name year max_wind min_pressure n_obs
## <chr> <dbl> <int> <int> <int>
## 1 Dorian 2019 160 910 69
## 2 Wilma 2005 160 882 44
## 3 Irma 2017 155 914 65
## 4 Rita 2005 155 897 33
## 5 Dean 2007 150 907 39
## 6 Felix 2007 150 930 25
## 7 Katrina 2005 150 902 30
## 8 Maria 2017 150 908 67
## 9 Ivan 2004 145 910 87
## 10 Matthew 2016 145 934 49
The multiple linear regression framework permits formal hypothesis testing regarding the relationship between each predictor and the response variable. For each regression coefficient \(\beta_j\) (where \(j = 1, 2, \ldots, 5\)), the hypotheses are specified as follows:
\[H_0: \beta_j = 0 \quad \text{(no linear association between predictor } j \text{ and wind speed, controlling for other predictors)}\]
\[H_a: \beta_j \neq 0 \quad \text{(significant linear association exists)}\]
The full model specification is:
\[\text{wind}_i = \beta_0 + \beta_1(\text{pressure}_i) + \beta_2(\text{lat}_i) + \beta_3(\text{long}_i) + \beta_4(\text{diameter}_i) + \beta_5(\text{speed}_i) + \epsilon_i\]
where \(\epsilon_i\) represents the random error term.
storm_model <- lm(wind ~ pressure + lat + long + tropicalstorm_force_diameter + speed_kph,
data = storms_clean)
summary(storm_model)
##
## Call:
## lm(formula = wind ~ pressure + lat + long + tropicalstorm_force_diameter +
## speed_kph, data = storms_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.517 -5.895 0.276 6.125 29.212
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.307e+03 6.246e+00 209.237 < 2e-16 ***
## pressure -1.254e+00 6.200e-03 -202.326 < 2e-16 ***
## lat -3.718e-01 9.535e-03 -38.991 < 2e-16 ***
## long 1.825e-02 4.337e-03 4.207 2.61e-05 ***
## tropicalstorm_force_diameter -3.440e-03 8.243e-04 -4.173 3.03e-05 ***
## speed_kph 2.364e-03 7.000e-03 0.338 0.736
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.772 on 9691 degrees of freedom
## Multiple R-squared: 0.8801, Adjusted R-squared: 0.8801
## F-statistic: 1.423e+04 on 5 and 9691 DF, p-value: < 2.2e-16
confint(storm_model, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 1.294746e+03 1.319235e+03
## pressure -1.266525e+00 -1.242219e+00
## lat -3.904553e-01 -3.530758e-01
## long 9.744983e-03 2.674720e-02
## tropicalstorm_force_diameter -5.055857e-03 -1.824236e-03
## speed_kph -1.135701e-02 1.608437e-02
The regression results indicate that central pressure is the most substantial predictor of storm intensity. The coefficient estimate suggests that each one-millibar increase in central pressure is associated with a decrease of approximately 1.25 knots in maximum sustained wind speed, holding all other predictors constant (p < 0.001). This finding aligns with established meteorological theory, wherein lower central pressure corresponds to more intense cyclonic circulation.
The tropical storm force diameter demonstrates a statistically significant positive association with wind speed, indicating that storms with larger wind fields tend to exhibit greater maximum intensity. Geographic position variables (latitude and longitude) and translational velocity capture spatial and dynamic aspects of storm behavior that contribute to intensity variation.
The validity of inferential conclusions from ordinary least squares regression depends upon five assumptions: linearity of the conditional mean, independence of observations, homoscedasticity of residuals, normality of the error distribution, and absence of multicollinearity among predictors.
par(mfrow = c(2, 2))
plot(storm_model)
crPlots(storm_model)
# Variance Inflation Factors for multicollinearity assessment
vif(storm_model)
## pressure lat
## 1.808099 1.307422
## long tropicalstorm_force_diameter
## 1.105391 2.135600
## speed_kph
## 1.217935
Linearity: Examination of the residuals versus fitted values plot reveals residuals distributed approximately symmetrically around zero, though modest curvature at extreme fitted values suggests potential nonlinearity. Component-plus-residual plots facilitate assessment of the linearity assumption for individual predictors.
Independence: A notable limitation of this analysis concerns the temporal dependence structure inherent in the data. Observations represent six-hour measurements within individual storms; consequently, consecutive observations from the same storm exhibit serial correlation. This violation of the independence assumption may result in underestimated standard errors.
Homoscedasticity: The scale-location plot indicates relatively constant residual variance across the range of fitted values, with modest heteroscedasticity evident at higher predicted intensities.
Normality: The normal Q-Q plot demonstrates that residuals conform reasonably well to the theoretical normal distribution, with minor departures in the distributional tails. Given the large sample size, visual assessment via the Q-Q plot is preferred.
Multicollinearity: Variance Inflation Factor values provide a quantitative assessment of multicollinearity. Values exceeding 5 warrant concern, while values exceeding 10 indicate severe multicollinearity. The substantial correlation between pressure and wind speed (r = -0.92) reflects the fundamental physical relationship between these variables.
This analysis demonstrates that maximum sustained wind speed in Atlantic tropical storms is significantly associated with atmospheric and geographic characteristics. The fitted model yields a coefficient of determination (R²) of 0.88 and adjusted R² of 0.88, indicating that the predictor set accounts for approximately 88% of the observed variance in storm intensity.
Central pressure emerges as the dominant predictor (\(\beta\) = -1.254, p < 0.001): each one-millibar increase in central pressure corresponds to a 1.25-knot decrease in wind speed. Geographic position also demonstrates significant predictive value — latitude exhibits a negative association (\(\beta\) = -0.372, p < 0.001), suggesting storms at higher latitudes tend to be less intense, likely due to cooler waters and increased wind shear. The negative coefficient on tropical storm force diameter (\(\beta\) = -0.003, p < 0.001) indicates that more compact storms achieve higher maximum wind speeds.
Notably, storm translational velocity was not a significant predictor (\(\beta\) = 0.002, p = 0.736). One might hypothesize that slower-moving storms would intensify more due to prolonged exposure to warm water; however, after controlling for other variables, speed provides no additional predictive information. z Limitations include temporal dependence among repeated observations within storms, which may bias standard errors. Future research could aggregate to storm-level data, incorporate sea surface temperature, or apply mixed-effects models to account for the nested data structure.
Hijmans, R. J. (2022). geosphere: Spherical trigonometry (R package version 1.5-18). https://CRAN.R-project.org/package=geosphere
National Oceanic and Atmospheric Administration. Atlantic hurricane database (HURDAT2). National Hurricane Center. https://www.nhc.noaa.gov/data/
Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). dplyr: A grammar of data manipulation (R package version 1.1.0). https://dplyr.tidyverse.org/