A statistical analysis of National Weather Service Storm Data

The data for this analysis was obtained from the USA National Weather Service. Documentation on the dataset can be found on the Storm Data Documentation website

Synopsis

In this report, we study data on storm-related events in the USA between 1950 and November 2011. Since labeling of the events contains typo’s and multiple values for the same type of events, we perform a quick and dirty clean-up. We then show the top 10 most harmful storm-related events in three categories, summed over the history of this dataset: 1. Number of fatal casualties 2. Number of casualties with an injury 3. Total economic damage (crops + property)

Data Processing

Used packages and system info

sessionInfo()

## R version 3.3.2 (2016-10-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 14393)
## 
## locale:
## [1] LC_COLLATE=Dutch_Netherlands.1252  LC_CTYPE=Dutch_Netherlands.1252   
## [3] LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C                      
## [5] LC_TIME=Dutch_Netherlands.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] backports_1.0.5 magrittr_1.5    rprojroot_1.2   tools_3.3.2    
##  [5] htmltools_0.3.5 Rcpp_0.12.9     stringi_1.1.2   rmarkdown_1.3  
##  [9] knitr_1.15.1    stringr_1.1.0   digest_0.6.12   evaluate_0.10

Loading and inspecting dataset

if(!file.exists("weather.csv.bz2")){
        download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "weather.csv.bz2")
}
weather_raw <- read.csv("weather.csv.bz2", header = TRUE, sep = ",", stringsAsFactors = FALSE)

After downloading the dataset, we load it into memory as weather and inspect the dataframe:

# Inspect dataset and check for missing values
str(weather_raw)

## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

colSums(is.na(weather_raw))

##    STATE__   BGN_DATE   BGN_TIME  TIME_ZONE     COUNTY COUNTYNAME 
##          0          0          0          0          0          0 
##      STATE     EVTYPE  BGN_RANGE    BGN_AZI BGN_LOCATI   END_DATE 
##          0          0          0          0          0          0 
##   END_TIME COUNTY_END COUNTYENDN  END_RANGE    END_AZI END_LOCATI 
##          0          0     902297          0          0          0 
##     LENGTH      WIDTH          F        MAG FATALITIES   INJURIES 
##          0          0     843563          0          0          0 
##    PROPDMG PROPDMGEXP    CROPDMG CROPDMGEXP        WFO STATEOFFIC 
##          0          0          0          0          0          0 
##  ZONENAMES   LATITUDE  LONGITUDE LATITUDE_E LONGITUDE_    REMARKS 
##          0         47          0         40          0          0 
##     REFNUM 
##          0

Missing values occur solily in the columns COUNTYENDN, F, LATTITUDE and LATTITUDE_E. None of these variables are used in the analysis below.

Clean up event labels

head(grep("[wW][iI][nN][dD]", unique(weather_raw$EVTYPE), value=TRUE),6)

## [1] "TSTM WIND"                    "HURRICANE OPAL/HIGH WINDS"   
## [3] "THUNDERSTORM WINDS"           "THUNDERSTORM WIND"           
## [5] "HIGH WINDS"                   "THUNDERSTORM WINDS LIGHTNING"

The result of the regular expression grep operation shows a problem with the EVTYPE values of the dataset. Searching on wind (insensitive for capital letters) we see that “TSTM WIND”, “THUNDERSTORM WINDS” and “THUNDERSTORM WIND” correspond to the same event, but are spelled differently. This phenomenon troubles the categorizing of events, and hence we will first try to clean it up.

In order to make grouping of the events easier, we perform the following steps. This clean-up is far from ideal, and for a more precise ananysis, more time should be spend on this.

# Make all EVTYPE names lowercase, to remove redundant factors from the use of capital letters.
weather <- weather_raw
weather$EVTYPE <- tolower(weather$EVTYPE)
weather$EVTYPE <- gsub("tstm", "thunderstorm", weather$EVTYPE)
weather$EVTYPE <- gsub("winds", "wind", weather$EVTYPE)
weather$EVTYPE <- gsub("rain|raining", "rain", weather$EVTYPE)
weather$EVTYPE <- gsub("floods|flooding", "flood", weather$EVTYPE)
weather$EVTYPE <- gsub("thu.*wind", "thunderstorm wind", weather$EVTYPE)
weather$EVTYPE <- gsub("record[ ]*", "", weather$EVTYPE)
weather$EVTYPE <- gsub("excessive[ ]*", "", weather$EVTYPE)
weather$EVTYPE <- gsub("heavy[ ]*", "", weather$EVTYPE)

Results

Total fatalities and injuries of storm-related events, 1950-2011

We sum all fatalities and injuries over the whole time period of the dataset, per event-type:

# sum fatalities and injure by EVTYPE and order on number of fatalities
fatal <- aggregate(weather$FATALITIES, by = list(weather$EVTYPE), sum)
names(fatal) <- c("EVTYPE","FATALITIES")
fatal <- arrange(fatal, desc(FATALITIES))
injury <- aggregate(weather$INJURIES, by = list(weather$EVTYPE), sum)
names(injury) <- c("EVTYPE","INJURIES")
injury <- arrange(injury, desc(INJURIES))

# Make figure 1
par(las=2, mar=c(10,5,3,2), mfrow = c(1,2))
barplot(fatal$FATALITIES[1:10]/1000, names.arg = fatal$EVTYPE[1:10], main="No. casualties, 1950-2011", ylab="No. fatalities (thousands)")
barplot(injury$INJURIES[1:10]/1000, names.arg = injury$EVTYPE[1:10], ylab="No. injuries (thousands)")

Figure1 shows the results of this summing. It is clear that tornadoes are responsible for the most deaths by storm in the US, as well as the most injuries.

Total crop and property damage of storm-related events, 1950-2011

To calculate the total economic damage by event-type, we have to add the crop damage and the damage done to other property. Both of these consist of a value and a multiplier. We extract the multiplier and sum the crop and property damage with the following code:

# make a column for weather with the total damage in dollars
crop <- with(weather,sapply(1:length(CROPDMG), 
        FUN = function(x) {
                if(CROPDMGEXP[x] == "k" | CROPDMGEXP[x] == "K"){
                        y = 10^3*CROPDMG[x]
                } else if(CROPDMGEXP[x] == "m" | CROPDMGEXP[x] == "M"){
                        y = 10^6*CROPDMG[x]
                } else if(CROPDMGEXP[x] == "b" | CROPDMGEXP[x] == "B"){
                        y = 10^9*CROPDMG[x]
                } else {
                        y = CROPDMG[x]
                }
        }))
prop <- with(weather,sapply(1:length(PROPDMG), 
        FUN = function(x) {
                if(PROPDMGEXP[x] == "k" | PROPDMGEXP[x] == "K"){
                        y = 10^3*PROPDMG[x]
                } else if(PROPDMGEXP[x] == "m" | PROPDMGEXP[x] == "M"){
                        y = 10^6*PROPDMG[x]
                } else if(PROPDMGEXP[x] == "b" | PROPDMGEXP[x] == "B"){
                        y = 10^9*PROPDMG[x]
                } else {
                        y = PROPDMG[x]
                }
        }))

weather$TOTALDMG <- crop + prop

# sum total damage by EVTYPE and order
damage <- aggregate(weather$TOTALDMG, by = list(weather$EVTYPE), sum)
names(damage) <- c("EVTYPE","TOTALDMG")
damage <- arrange(damage, desc(TOTALDMG))

# Make figure 2
par(las=2, mar=c(10,5,3,2), mfrow = c(1,1))
barplot(damage$TOTALDMG[1:10]/10^9, names.arg = damage$EVTYPE[1:10], main="Top 10 most damaging storm-related events, 1950-2011", ylab="Damage (billion US dollars)")

Figure2 shows that floods are responsible for the largest economic damage. Tornadoes, which proved to be the most lethal, are the third-most damaging event.