December 10,2020

Introduction

Over the last 10 years, tornadoes have caused the highest number of fatalities among natural hazards in the United States. In this study, a multivariate regression model is presented to assess the expected number of injuries caused by a tornado as a function of tornado intensity and the tornado path length.

Research Question

Is the magnitude and path length of tornadoes predictive of the injuries caused by them?

Data

NOAA’s Storm Events Database is an integrated database of severe weather events across the United States from 1950 to this year, with information about a storm event’s location, azimuth, distance, impact, and severity, including the cost of damages to property and crops. The database is hosted on Google BigQuery and is publicly available.

Google Cloud: Severe Storm Event Details

Data

Each case represents a tornado that hit the U.S. after 1950. There are 16000 observations in our dataset.

## Rows: 16,000
## Columns: 15
## $ storm_date           <chr> "2015-06-07", "2015-11-11", "2016-09-21", "2008-…
## $ storm_time           <chr> "00:15:00", "14:00:00", "17:32:00", "17:59:00", …
## $ magnitude            <chr> "0", "1", "0", "2", "2", "2", "2", "1", "2", "0"…
## $ injured_count        <int> 0, 0, 0, 0, 0, 30, 1, 0, 15, 0, 0, 0, 0, 0, 0, 0…
## $ fatality_count       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ property_loss        <dbl> 0.015, 0.405, 3000.000, 0.510, 75000.000, 12.000…
## $ crop_loss            <dbl> 4.00e-03, 0.00e+00, 3.00e+03, 0.00e+00, 1.00e+04…
## $ yearly_tornado_count <int> 575683, 602617, 614379, 553, 615497, 356, 1204, …
## $ start_lon            <dbl> -94.0213, -94.5585, -92.7308, -96.3000, -94.7771…
## $ start_lat            <dbl> 42.0995, 40.7157, 42.9155, 43.1400, 40.5851, 41.…
## $ end_long             <dbl> -93.9673, -94.3545, -92.7105, -96.4200, -94.6462…
## $ end_lat              <dbl> 42.1081, 40.9904, 42.9341, 43.3000, 40.6043, 41.…
## $ length               <dbl> 2.83, 21.80, 1.65, 12.95, 6.99, 4.50, 12.00, 1.0…
## $ width                <dbl> 120, 1350, 150, 1200, 3000, 1761, 300, 450, 2640…
## $ tornado_path_geom    <chr> "LINESTRING(-94.0213 42.0995, -93.9673 42.1081)"…

Exploring the dataset

Storms Per Year

The dataset was filtered to only include data from 2007 onward.

Exploring the dataset

Injuries due to Tornadoes by Year

Predicting Injuries from Tornadoes

Dependent Variable: Injured_Count: Number of people injured as a result of the storm

Independent Variables:

  1. Magnitude: EF-Scale rating for the tornado. Possible values are 0, 1, 2, 3, 4 , 5 and unknown.

  2. Length:Length of the tornado, in miles

Dummy Coding

Converted ‘Magnitude’ to indicate if its high(1) or low()

Testing for collinearity between the independent variables

The correlation between magnitude and length, although moderate, seem to be statistically significant. However, since ‘Magnitude’ is a dichotomous variable, is it accurate to do a correlation between the independent variables in this case?

Regression Model

m_injury <- lm(injured_count ~ length + magnitude, data = Tornado3)
summary(m_injury)
## 
## Call:
## lm(formula = injured_count ~ length + magnitude, data = Tornado3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -84.19   -1.06    1.13    1.89 1427.17 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.17122    0.38144  -5.692 1.31e-08 ***
## length       0.68550    0.04651  14.738  < 2e-16 ***
## magnitude   19.69801    1.89944  10.370  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.99 on 6003 degrees of freedom
##   (36 observations deleted due to missingness)
## Multiple R-squared:  0.09545,    Adjusted R-squared:  0.09514 
## F-statistic: 316.7 on 2 and 6003 DF,  p-value: < 2.2e-16

Predicted Injury Count = -2.21774+0.68(length)+24.75521(magnitude)

The predicted injury count increases by about 19.7% as if the magnitude of the storm is above 3 and increases by 0.68% with tornadoes that are longer. Adjusted R2 is 0.09514 which means that 9% of the variance is explained by length and magnitude of the tornado are longer in miles.

Forward Selection

I will add the width variable to see if it improves our adjusted r2.

m_injury2 <- lm(injured_count ~ length + magnitude +width, data = Tornado3)
summary(m_injury2)
## 
## Call:
## lm(formula = injured_count ~ length + magnitude + width, data = Tornado3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -77.12   -1.22    1.18    2.11 1423.97 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.621147   0.405832  -6.459 1.14e-10 ***
## length       0.605839   0.052622  11.513  < 2e-16 ***
## magnitude   18.084474   1.962674   9.214  < 2e-16 ***
## width        0.001498   0.000464   3.228  0.00125 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.97 on 6002 degrees of freedom
##   (36 observations deleted due to missingness)
## Multiple R-squared:  0.09701,    Adjusted R-squared:  0.09656 
## F-statistic: 214.9 on 3 and 6002 DF,  p-value: < 2.2e-16

Adding ‘width’ to our model has raised the adjusted R2 slightly.

Predicted Injury Count = -2.621147+0.605(length)+18.08(magnitude)+0.0015(width)

Diagnostic Plots

Diagnostic Plots

Diagnostic Plots

Log Transformations

lm_log.model = lm(log1p(injured_count) ~ log1p(magnitude)+log1p(length)+log1p(width), data = Tornado3)
summary(lm_log.model)
## 
## Call:
## lm(formula = log1p(injured_count) ~ log1p(magnitude) + log1p(length) + 
##     log1p(width), data = Tornado3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0666 -0.1605 -0.0479  0.0330  5.1874 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -0.205797   0.037489  -5.490 4.19e-08 ***
## log1p(magnitude)  2.239611   0.049779  44.991  < 2e-16 ***
## log1p(length)     0.116873   0.009743  11.995  < 2e-16 ***
## log1p(width)      0.029611   0.007664   3.864 0.000113 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4973 on 6002 degrees of freedom
##   (36 observations deleted due to missingness)
## Multiple R-squared:  0.3802, Adjusted R-squared:  0.3799 
## F-statistic:  1227 on 3 and 6002 DF,  p-value: < 2.2e-16

Log Transformations

Log Transformations

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Log Transformations

Conclusion

The coefficients indicate that magnitude, length and width of a tornado are significant in understanding the injuries caused by tornadoes. However, the linear model does not satisfy the conditions for multiple regressions since the residuals do not appear to be nearly normal and seem to contain outliers (even after transforming the variables).