INTRODUCTION

The purpose of this study was to identify storm events and determine if there was an association between the location, type of the event and the quantity of human life lost. As a Ph.D. student, I’m continually working with the detection of anomalies in different contexts. In the context of storm data, having an understanding of the effect that the location of storms events have in human life, can provide a way to identify the combination of storm type and location that produces the largest loss of human life. Previous studies focused on a spefic type of event [1,2]; however, identyfing the key factors that determine the number of deaths can provide useful hints for their prediction. Usually an area is prepared to deal with specific types of storm events; however, the implications of a storm event that usually does not appear in a location can have a larger impact due to the surprise factor.

METHODOLOGY

Throughout this paper, we attempt to detect an association between the location of storms events and the quantity of human life lost. In section 2.1 we explain how we sampled the data, in section 2.2 the data modifications and the variables used, and finally in section 2.2 a detailed explanation of the analysis.

Sample

The dataset used for this study is the storm event data, published by the National Weather Service [3]. The dataset contain information about weather events like hurricanes, tornadoes, thunderstorms, hail, floods, drought conditions, lightning, high winds, snow, temperature extremes, and other weather phenomena, independently if they have or have not caused any impact or damaged. The dimensionality of the dataset is N=166,048 observations by 49 attributes. For our experiments we used the full dataset, this mean that we did not sample the data.

Measures

The original dataset has 48 variables, for our experiments we used only 6 variables (Table I). The selection of these was made based on their potential to provide some type of the result for the association between the location and death/damage produced by storm events. Event type is simply the type of event occurred (ex. Hurricanes, snow, etc.). Zone type corresponds to the location of the event (Ex. county, zone or Marine). State is the name of the state where the event occurred. Time zone corresponds to the time zone where the storm occurred (EST, CST, MST, etc). Direct deaths and Indirect deaths correspond to the deaths caused by the event.

Name Type

Event type

Categorical

Zone type

Categorical

State

Categorical

Time zone

Categorical

Direct deaths

Numerical

Indirect deaths

Numerical

Table 1. Selected variables.

We joined the direct and indirect death variables into a single response variable “deaths’’ (Table 2) which comprehends the number of deaths directly and indirectly related to the weather event.

Name Type

Event type

Categorical

Zone type

Categorical

State

Categorical

Time zone

Categorical

Deaths

Numerical

Table 2. Selected variables after joining direct and indirect death variables into a single response variable deaths.

We grouped the states into 10 different geographical areas depicted in Table 3.

Region State

NewEngland

CONNECTICUT“,”MAINE“,”MASSACHUSETTS“,”NEW HAMPSHIRE“,”RHODE ISLAND" ,“VERMONT”

Atlantic

“NEW YORK”, “NEW JERSEY”, “PUERTO RICO”, “VIRGIN ISLANDS”

MidAtlantic

“DELAWARE”, “MARYLAND”, “PENNSYLVANIA”, “VIRGINIA”, “WASHINGTON”, “WEST VIRGINIA”

Southeast

“ALABAMA”, “FLORIDA”, “GEORGIA”, “KENTUCKY”, “MISSISSIPPI”, “NORTH CAROLINA”, “SOUTH| | | CAROLINA”, “TENNESSEE”

Great_Lakes

“ILLINOIS”, “INDIANA”, “MICHIGAN”, “MINNESOTA”, “OHIO”, “WISCONSIN”

South_Central

“ARKANSAS”, “LOUISIANA”, “NEW MEXICO”, “OKLAHOMA”, “TEXAS”

Great_Plains

“IOWA”, “KANSAS”, “MISSOURI”, “NEBRASKA”

Rocky_Mountains

“COLORADO”, “MONTANA”, “NORTH DAKOTA”, “SOUTH DAKOTA”, “UTAH”, “WYOMING”

Pacific

“ARIZONA”, “CALIFORNIA”, “GUAM”, “HAWAII”, “NEVADA”,“HAWAII WATERS”

Pacific_Northwest

“ALASKA”, “IDAHO”, “OREGON”, “WASHINGTON”

Table 3. States grouped into regions.

Times Zones from SBA

Figure 1. Times zones. Image source: https://www.sba.gov/tools/local-assistance/regional offices

RESULTS

The predictors variables were examined by using frequency plots and frequency tables. We also examined the standard deviation, mean, median, maximum and minimum values of the response variable.

We expect to use decision trees, mainly prediction decision trees which can be used to predict our response variable using all or some of the predictor variables; an advantage of decision trees is that they are easy to interpret, then our analysis can provide some hints about the behavior of storm events. To increase the potential of decision trees we will use random forest and cross validation, with this we expect to reduce the error rate on the test set; however, by using random forest we will lose the interpretability of decision trees in exchange of a potential decrease in error. We will analyze both results and then decide if it is worth it to use random forest.

Simple analysis

Table 4 displays summary statistics for the single quantitative variable in the dataset, the minimum number of deaths is 0.0, while the maximum is 43; the mean and median number of deaths are 0.01 and 0.0 respectively; plotting the frequency of deaths we can observe that they are right skewed(Figure 2). The rest of the 4 variables are categorical. The 25% of the events are Thunderstorm wind, 17% are Hail, 8% Winter weather, 7% Flash flood, 6% Winter storm, 6% Drought and the remaining 32% is composed of other types of events. 58% of the events occurred in the county/parish, the remaining 42% happened in zone. 9% of the events happened in Texas; Kansas, Iowa and Nebraska have, each one , 4% of the events, Oklahoma and Kentucky 3% each one and the remaining 73% is distributed in the rest of the states. 95% of the events occurred in 3 time zones, 46% in CST-6, 36% in EST-5 and 13% in MST-7, and the remaining 4% in the rest of the time zones.

Statistic Value

Min

0.0

Median

0.0

Mean

0.01

Max

43.0

Table 4: Summary statistics of Number of deaths

Multivariate analyses

We divided the original dataset into a training and testing set. 75% of the instances were used for training, while the remaining instances were used for testing. Due to the type of response and predictor variables we decided to use regression by random forest.

On examination of variable importance analysis produced by random forest [4]. On the left panel of Figure 3 we plotted the %incMSE by variable and on the right side is the mean decrease in node impurity. Slitting on type of event produces the higher reduction in node impurity. We used this model for prediction on the test set. The MSE on the test set was 0.096.

Figure 3. Importance of variables

Conclusions

This project used prediction by random forest to identify the best variables to predict the number of death of a storm event (N=166048)> The number of deaths in this dataset ranged from 0 to 43. The random forest results indicated that the type of the event is the most discriminative feature. The MSE on the test set was 0.096. This project developed a model to predict the number of deaths produced by a storm event, it also identified the type of the event as the most discriminative feature, this implies that the type of event is the most importan factor to consider for prediction of number of deaths. However, using a different categorization for states could provide more hints about the importance of location on deaths produced by storm events. Another limitation of this study is that it used only a subset of the available features, a deeper approach could take into account all the variables included in the original dataset. In further research we will cover the limitations of the present project by using all the available features and a different grouping method for the states variable.

Bibliography

  1. French, Jean, et al. “Mortality from flash floods: a review of national weather service reports, 1969-81.” Public Health Reports 98.6 (1983): 584.

  2. Chen, Yong-Shing, et al. “Effects of Asian dust storm events on daily mortality in Taipei, Taiwan.” Environmental research 95.2 (2004): 151-155.

  3. http://www.ncdc.noaa.gov/news/storm-events-database-version-30

  4. Liaw, Andy, and Matthew Wiener. “Classification and regression by randomForest.” R news 2.3 (2002): 18-22.