Synopsis

In this report we aim to provide information on which is the more harmful weather event in USA. In order to do that we used the National Oceanic and Atmospheric Administration (NOAA) Storm Database. This database is an official NOAA publication which documents the occurrence of storms and other significant weather phenomena having sufficient intensity to cause loss of life, injuries, significant property damage, and/or disruption to commerce.

In this report we check the impact of weather events from two different visions: harm done to the population health and economic impact. From those analyses, we could see that the most harmful events for the population are Tornadoes and Flood have the biggest economic impact.

Data Processing

First of all we upload the libraries we are going to use

library(dplyr)
library(ggplot2)
library(gridExtra)

We get the information from the StormDatabase and we upload it to the system. It contains weather events from 1950 up until today.

destfile="stormdata.csv.bz2"
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
                      destfile)
stormdata<-read.csv(destfile)

We checked that the information is properly formatted

str(stormdata)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

EVTYPE contains the weather event name. But there are 980 different weather events. We took a closer look:

weather_types<-levels(stormdata$EV)
set.seed(400)
weather_types[sample(1:980,30)]
##  [1] "EXTREME WINDCHILL TEMPERATURES" "FLOOD/RAIN/WINDS"              
##  [3] "Summary of June 24"             "HIGH WAVES"                    
##  [5] "URBAN FLOODING"                 "HIGH WINDS/COLD"               
##  [7] "Gusty winds"                    "MARINE THUNDERSTORM WIND"      
##  [9] "HAIL STORM"                     "HEAT/DROUGHT"                  
## [11] "Damaging Freeze"                "RECORD PRECIPITATION"          
## [13] "THUNDERSTORM WIND."             "Saharan Dust"                  
## [15] "Summary September 23"           "Landslump"                     
## [17] "NON-SEVERE WIND DAMAGE"         "TYPHOON"                       
## [19] "HARD FREEZE"                    "Record Warmth"                 
## [21] "FLOOD/FLASH FLOOD"              "Snow"                          
## [23] "MUD SLIDES"                     "Summary of April 21"           
## [25] "COASTAL FLOOD"                  "HAIL 100"                      
## [27] "LANDSLIDES"                     "WALL CLOUD/FUNNEL CLOUD"       
## [29] "DUST DEVIL WATERSPOUT"          "DAMAGING FREEZE"

As it can be seen there are strange texts like “Summary August 7” and duplicated entries like “HAIL 1.00”, “HAIL 1.75)”, and we know there is one “HAIL”. In order to treat this properly a more in depth study should be done on these values and some data should be cleaned up. We can check how many of these events may have real impact, either in the number of events or fatalities registered:

by_event<-stormdata %>% 
          group_by(EVTYPE) %>% 
          summarise(sum_fatalities=sum(FATALITIES), 
                    c=n())

quantile(by_event$c, c(0.8,0.9,0.95,0.99,1))
##      80%      90%      95%      99%     100% 
##      7.0     35.2    258.8  12360.6 288661.0
sum(head(by_event$c[order(by_event$c,decreasing = T)],30))/sum(by_event$c)
## [1] 0.9787321
quantile(by_event$sum_fatalities, c(0.8,0.9,0.95,0.99,1))
##     80%     90%     95%     99%    100% 
##    0.00    3.00   14.00  208.88 5633.00
sum(head(by_event$sum_fatalities[order(by_event$sum_fatalities,decreasing = T)],30))/sum(by_event$sum_fatalities)
## [1] 0.9425553

As it can be seen, the impact in both, fatalities and number of events, is very low for most of them. And since this is just scholar assignment, I will let the analysis for another time. However, we will see it will have some impact on the classification of events based on the overall economical impact.

Another aspect to be considered is that, from the documentation we know that the way the information is registered has been changing along the years. So I decided to check how may have impacted this changes on the registration process and how may it impact on the figures of this report. On this report there is a description of this analysis. But it doesn’t seem to change significantly the results found on this report.

Taking into consideration these two factors, we start the analysis on the impact on population health and economic consequences.

Results

Harm done to population

There may be 2 different kinds of harms done to the population: fatalities and injuries.

We will analyze both separately. First we consider fatalities caused by weather events:

Fatalities

by_event<-stormdata %>% 
          group_by(EVTYPE) %>% 
          summarise(sum_fatalities=sum(FATALITIES), 
                    sum_injuries=sum(INJURIES))

top_fatalities<-by_event %>% arrange(desc(sum_fatalities)) %>% head(5)
top_fatalities$EVTYPE<-factor(top_fatalities$EVTYPE, 
                              levels=top_fatalities$EVTYPE[order(top_fatalities$sum_fatalities,
                              decreasing = TRUE)])

From this we can see that the events with more fatalities are:

And the most harmful event is the TORNADO with

top_fatalities$sum_fatalities[1]
## [1] 5633

Injuries

top_injuries<-by_event %>% arrange(desc(sum_injuries)) %>% head(5)
top_injuries$EVTYPE<-factor(top_injuries$EVTYPE,
                            levels=top_injuries$EVTYPE[order(top_injuries$sum_injuries,
                            decreasing = TRUE)])

From this we can see that the events with more injuries are:

And the event with more injured people is, again, the TORNADO with

top_injuries$sum_injuries[1]
## [1] 91346

Economic impact

On the database, the economic impact is computed by two different fields: property damage and crop damage. We will follow the same approximation, but at the end we will sum up all the economical damage to see the impact of the different weather events in US economy.

On the database, the information for property and crop damage is splitted in 2 columns, one containing the value of the economic loss, the other the magnitude order. However, the magnitude order has some errors:

table(stormdata$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 424665      7  11330
table(stormdata$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994

As it can be seen, most of the values are K (thousand) or M (millions), from the documentation we know that B are billions (10^9), but we don’t know what the other values mean. Since they are very few (compared to the others) I will not take them into consideration at all. However, it must be noted that, depending on its real value (8->10^8?), they can have a huge impact on this analysis.

First we extract the economic cost for property damage, crop damage and total by type of event.

economic<-select(stormdata, BGN_DATE, EVTYPE, PROPDMG, CROPDMG, PROPDMGEXP, CROPDMGEXP)

economic<-mutate(stormdata, 
                 PROP=case_when(PROPDMGEXP=="M" ~ PROPDMG*10^6, 
                                PROPDMGEXP=="B" ~ PROPDMG*10^9, 
                                PROPDMGEXP=="K" ~ PROPDMG*10^3, 
                                TRUE ~ 0),
                 CROP=case_when(CROPDMGEXP=="M" ~ CROPDMG*10^6, 
                                CROPDMGEXP=="B" ~ CROPDMG*10^9, 
                                CROPDMGEXP=="K" ~ CROPDMG*10^3, 
                                TRUE ~ 0),
                 TOTAL=PROP+CROP
                 )

e_by_event<-economic %>% 
            group_by(EVTYPE) %>% 
            summarise(sum_prop=sum(PROP), 
                      sum_crop=sum(CROP),
                      sum_total=sum(TOTAL))

Property damage

First we get the property damage cost by type of weather event.

top_prop<-e_by_event %>% arrange(desc(sum_prop)) %>% head(5)
top_prop$EVTYPE<-factor(top_prop$EVTYPE, 
                        levels=top_prop$EVTYPE[order(top_prop$sum_prop,
                                                     decreasing = TRUE)])

From this we can see that the events with greatest property damages are:

## # A tibble: 5 x 2
##              EVTYPE     sum_prop
##              <fctr>        <dbl>
## 1             FLOOD 144657709800
## 2 HURRICANE/TYPHOON  69305840000
## 3           TORNADO  56925660480
## 4       STORM SURGE  43323536000
## 5       FLASH FLOOD  16140811510

And the most costly event for properties is the FLOOD with almost 150B$

Crop damage

First we get the crop damage cost by type of weather event.

top_crop<-e_by_event %>% arrange(desc(sum_crop)) %>% head(5)
top_crop$EVTYPE<-factor(top_crop$EVTYPE, 
                        levels=top_crop$EVTYPE[order(top_crop$sum_crop,
                                                     decreasing = TRUE)])

From this we can see that the events with greatest crop damages are:

## # A tibble: 5 x 2
##        EVTYPE    sum_crop
##        <fctr>       <dbl>
## 1     DROUGHT 13972566000
## 2       FLOOD  5661968450
## 3 RIVER FLOOD  5029459000
## 4   ICE STORM  5022113500
## 5        HAIL  3025537450

And the most costly event for crop is the DOUGHT with almost 14B$

Total damage

First we get the total damage cost by type of weather event.

top_total<-e_by_event %>% arrange(desc(sum_total)) %>% head(10)
top_total$EVTYPE<-factor(top_total$EVTYPE, 
                        levels=top_total$EVTYPE[order(top_total$sum_total,
                                                     decreasing = TRUE)])

From this we can see that the events with greatest economic damages are:

## # A tibble: 10 x 2
##               EVTYPE    sum_total
##               <fctr>        <dbl>
##  1             FLOOD 150319678250
##  2 HURRICANE/TYPHOON  71913712800
##  3           TORNADO  57340613590
##  4       STORM SURGE  43323541000
##  5              HAIL  18752904170
##  6       FLASH FLOOD  17562128610
##  7           DROUGHT  15018672000
##  8         HURRICANE  14610229010
##  9       RIVER FLOOD  10148404500
## 10         ICE STORM   8967041310

We can also see it if we plot those values in the following figures:

plot_p<-ggplot(top_prop, aes(x=EVTYPE, y=sum_prop/10^9)) + 
        geom_bar(stat="identity", fill=c("red"))  + 
        labs(y="Property damages (Billion $)", x="Event") + 
        theme(axis.text.x = element_text(size=7))

plot_c<-ggplot(top_crop, aes(x=EVTYPE, y=sum_crop/10^9)) + 
        geom_bar(stat="identity", fill=c("red"))  + 
        labs(y="Crop damages (Billion $)", x="Event")


plot_t<-ggplot(top_total, aes(x=EVTYPE, y=sum_total/10^9)) + 
        geom_bar(stat="identity", fill=c("red"))  + 
        labs(y="Economic damage (Billion $)", x="Event") + 
        theme(axis.text.x = element_text(size=8))

grid.arrange(plot_t,plot_p,plot_c, ncol=2, layout_matrix=cbind(c(1,2),c(1,3)))

And the most costly event is the FLOOD with more than 150B$, while DROUGHT is only the 7th, which it seems to make sense, since it has almost no economical impact on property damage.

One think that needs to be pointed out is that in this list, we can see “HURRICANE/TYPHOON” and “HURRICANE”. It needs a further analysis to understand if this distinction is needed or it can all the events been labeled just as “HURRICANE”.

Conclusions

The most harmful weather events for population are Tornadoes, with 5,633 fatalities and more than 9,000 injuries.

The most harmful weather events for economy are Floods, with an impact of more than 150B$ in property damages. However, for crops, the most harmful economic impact is from the DROUGHT with 15B$.