In this analysis we use data from the U.S. National Oceanic and Atmospheric Administration’s database (NOAA) collected between 1950 and 2011 to understand:

The raw data used for this analysis can be downloaded here. The National Weather Service Storm Data Documentation and the National Climatic Data Center Storm Events FAQ contain useful documentation on how variables in the original dataset have been constructed and defined.

Data Processing

In order to process the raw CSV input data we import the contents of the file stormdata.csv into R:

stormdata <- read.csv("repdata_data_StormData.csv.bz2")

Here we print the first few rows to get an idea of the structure of the data:

head(stormdata)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6

As we can see the INJURIES column and the FATALITIES report how many injuries and fatalities, respectively, have been linked to a particular weather event. The event type is reported in the EVTYPE column.

Events causing the largest number of injuries

We start our analysis by exploring the number of injuries linked to severe weather events. In particular, we calculate the mean total number of injuries and then subset the data to include events that have caused a number of injuries above average:

total_injuries_per_event_type <- aggregate(x = stormdata$INJURIES, by = list(stormdata$EVTYPE), FUN = sum, na.rm=TRUE)
names(total_injuries_per_event_type) <- c("EVTYPE","INJURIES")
avg_injuries <- mean(total_injuries_per_event_type$INJURIES,na.rm=TRUE)
avg_injuries
## [1] 142.668
injuries_above_avg <- total_injuries_per_event_type[which(total_injuries_per_event_type$INJURIES > avg_injuries),]
# sort by descending order for better readability 
injuries_above_avg <- injuries_above_avg[order(-injuries_above_avg$INJURIES),]

The full list of events that have caused a significant number of injuries - in descending order - is finally reported here:

injuries_above_avg
##                 EVTYPE INJURIES
## 830            TORNADO    91346
## 854          TSTM WIND     6957
## 164              FLOOD     6789
## 123     EXCESSIVE HEAT     6525
## 452          LIGHTNING     5230
## 269               HEAT     2100
## 424          ICE STORM     1975
## 147        FLASH FLOOD     1777
## 759  THUNDERSTORM WIND     1488
## 238               HAIL     1361
## 972       WINTER STORM     1321
## 406  HURRICANE/TYPHOON     1275
## 354          HIGH WIND     1137
## 304         HEAVY SNOW     1021
## 953           WILDFIRE      911
## 783 THUNDERSTORM WINDS      908
## 22            BLIZZARD      805
## 182                FOG      734
## 956   WILD/FOREST FIRE      545
## 111         DUST STORM      440
## 978     WINTER WEATHER      398
## 82           DENSE FOG      342
## 844     TROPICAL STORM      340
## 274          HEAT WAVE      309
## 368         HIGH WINDS      302
## 582       RIP CURRENTS      297
## 672        STRONG WIND      280
## 284         HEAVY RAIN      251
## 581        RIP CURRENT      232
## 133       EXTREME COLD      231
## 216              GLAZE      216
## 11           AVALANCHE      170
## 135       EXTREME HEAT      155
## 342          HIGH SURF      152
## 955         WILD FIRES      150

Events causing the largest number of fatalities?

The steps to this analysis are very similar to those in the previous section. In this instance we need process the data in the FATALITIES column. We will calculate the average number of fatalities and then focus our attention on events that have caused a number of fatalities above average:

total_fatalities_per_event_type <- aggregate(x = stormdata$FATALITIES, by = list(stormdata$EVTYPE), FUN = sum, na.rm=TRUE)
names(total_fatalities_per_event_type) <- c("EVTYPE","FATALITIES")
avg_fatalities <- mean(total_fatalities_per_event_type$FATALITIES,na.rm=TRUE)
avg_fatalities
## [1] 15.37563
fatalities_above_avg <- total_fatalities_per_event_type[which(total_fatalities_per_event_type$FATALITIES > avg_fatalities),]
# Sort by descending order for better readability:
fatalities_above_avg <- fatalities_above_avg[order(-fatalities_above_avg$FATALITIES),]

The full list of events that have caused a significant number of fatalities - in descending order - is finally reported here:

fatalities_above_avg
##                         EVTYPE FATALITIES
## 830                    TORNADO       5633
## 123             EXCESSIVE HEAT       1903
## 147                FLASH FLOOD        978
## 269                       HEAT        937
## 452                  LIGHTNING        816
## 854                  TSTM WIND        504
## 164                      FLOOD        470
## 581                RIP CURRENT        368
## 354                  HIGH WIND        248
## 11                   AVALANCHE        224
## 972               WINTER STORM        206
## 582               RIP CURRENTS        204
## 274                  HEAT WAVE        172
## 133               EXTREME COLD        160
## 759          THUNDERSTORM WIND        133
## 304                 HEAVY SNOW        127
## 134    EXTREME COLD/WIND CHILL        125
## 672                STRONG WIND        103
## 22                    BLIZZARD        101
## 342                  HIGH SURF        101
## 284                 HEAVY RAIN         98
## 135               EXTREME HEAT         96
## 71             COLD/WIND CHILL         95
## 424                  ICE STORM         89
## 953                   WILDFIRE         75
## 406          HURRICANE/TYPHOON         64
## 783         THUNDERSTORM WINDS         64
## 182                        FOG         62
## 397                  HURRICANE         61
## 844             TROPICAL STORM         58
## 336       HEAVY SURF/HIGH SURF         42
## 437                  LANDSLIDE         38
## 59                        COLD         35
## 368                 HIGH WINDS         35
## 875                    TSUNAMI         33
## 978             WINTER WEATHER         33
## 885  UNSEASONABLY WARM AND DRY         29
## 917       URBAN/SML STREAM FLD         28
## 980         WINTER WEATHER/MIX         28
## 833 TORNADOES, TSTM WIND, HAIL         25
## 959                       WIND         23
## 111                 DUST STORM         22
## 154             FLASH FLOODING         19
## 82                   DENSE FOG         18
## 138          EXTREME WINDCHILL         17
## 169          FLOOD/FLASH FLOOD         17
## 552      RECORD/EXCESSIVE HEAT         17

Events causing the greatest economic consequences

The original dataset provides two main variables we can use to estimate the damage to the economy caused by sever weather events:

Both these variables come with exponential multipliers, respectively CROPDMGEXP and PROPDMGEXP. Unfortunately not all the values and multipliers are provided (or follow a predictable pattern) in the original dataset, therefore we will perform some cleaning up on the data first to isolate the meaningful rows.

We’ll start with PROPDMG values by filtering out zeros:

propdmg <- subset(stormdata, PROPDMG !=0)

For convenience, figures will be reported in million USD. For this purpose, a helper function will be used to convert our values into million USD: when the exponential multiplier is not recognised or not provided, we return ‘0’ as a an output value:

damage2M <- function(value,exp){
  
    output = 0 
    
    if (exp == "K" || exp == "k" )
        output = value / 1000 
    
    if (exp == "M" || exp == "m" ) 
        output = value 
        
    if (exp == "B" || exp == "b")
        output = value * 1000 
    
    return(output)
}

propdmg$PROPDMG <- damage2M(propdmg$PROPDMG,propdmg$PROPDMGEXP)

propdmg <- subset(propdmg, PROPDMG !=0) #Filter out zeros resulted from running the damage2M function 

propdmg <- propdmg[c("EVTYPE","PROPDMG")]

total_propdmg_per_event_type <- aggregate(x = propdmg$PROPDMG, by = list(propdmg$EVTYPE), FUN = sum, na.rm=TRUE)
names(total_propdmg_per_event_type) <- c("EVTYPE","PROPDMG")

avg_propdmg <- mean(total_propdmg_per_event_type$PROPDMG,na.rm=TRUE)
avg_propdmg
## [1] 26.80911
propdmg_above_avg <- total_propdmg_per_event_type[which(total_propdmg_per_event_type$PROPDMG > avg_propdmg),]
# Finally, sort by descending order for convenience: 
propdmg_above_avg <- propdmg_above_avg[order(-propdmg_above_avg$PROPDMG),]
propdmg_above_avg
##                 EVTYPE    PROPDMG
## 332            TORNADO 3212.25816
## 48         FLASH FLOOD 1420.12459
## 348          TSTM WIND 1335.96561
## 61               FLOOD  899.93848
## 297  THUNDERSTORM WIND  876.84417
## 103               HAIL  688.69338
## 203          LIGHTNING  603.35178
## 308 THUNDERSTORM WINDS  446.29318
## 156          HIGH WIND  324.73156
## 399       WINTER STORM  132.72059
## 130         HEAVY SNOW  122.25199
## 386           WILDFIRE   84.45934
## 188          ICE STORM   66.00067
## 283        STRONG WIND   62.99381
## 163         HIGH WINDS   55.62500
## 120         HEAVY RAIN   50.84214
## 340     TROPICAL STORM   48.42368
## 389   WILD/FOREST FIRE   39.34495
## 53      FLASH FLOODING   28.49715

We can now repeat the same steps for CROPDMG values:

cropdmg <- subset(stormdata, CROPDMG !=0)

cropdmg$CROPDMG <- damage2M(cropdmg$CROPDMG,cropdmg$CROPDMGEXP)

cropdmg <- subset(cropdmg, CROPDMG !=0) #Filter out zeros if any 

cropdmg <- cropdmg[c("EVTYPE","CROPDMG")]

total_cropdmg_per_event_type <- aggregate(x = cropdmg$CROPDMG, by = list(cropdmg$EVTYPE), FUN = sum, na.rm=TRUE)
names(total_cropdmg_per_event_type) <- c("EVTYPE","CROPDMG")

avg_cropdmg <- mean(total_cropdmg_per_event_type$CROPDMG,na.rm=TRUE)
avg_cropdmg
## [1] 10131.08
cropdmg_above_avg <- total_cropdmg_per_event_type[which(total_cropdmg_per_event_type$CROPDMG > avg_cropdmg),]
# Sort by descending order for better readability:
cropdmg_above_avg <- cropdmg_above_avg[order(-cropdmg_above_avg$CROPDMG),]
cropdmg_above_avg
##                 EVTYPE   CROPDMG
## 42                HAIL 579596.28
## 23         FLASH FLOOD 179200.46
## 27               FLOOD 168037.88
## 115          TSTM WIND 109202.60
## 107            TORNADO 100018.52
## 97   THUNDERSTORM WIND  66791.45
## 10             DROUGHT  33898.62
## 100 THUNDERSTORM WINDS  18684.93
## 60           HIGH WIND  17283.21
## 54          HEAVY RAIN  11122.80

Results

With regards to population health, our analysis shows that tornados are the the type of event that has caused the highest number of injuries (91346) and fatalities (5633). Other noticeable events that have caused a considerable number of injuries and fatalities are: thunderstorm winds, floods, excessive heat and lightning.

A summary of the report highlighting the 10 most harmful events is shown here:

high_ft <- head(fatalities_above_avg,10)

p1 <- ggplot(high_ft,aes(x=EVTYPE,y=FATALITIES,fill=EVTYPE)) + geom_bar(stat="identity") + theme(legend.position="none") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("") + ylab("Total fatalities")

high_inj <- head(injuries_above_avg,10)

p2 <- ggplot(high_inj,aes(x=EVTYPE,y=INJURIES,fill=EVTYPE)) + geom_bar(stat="identity") + theme(legend.position="none") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("") + ylab("Total injuries")

# Credits to the Cookbook for R website for helper function multiplot:
# http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_%28ggplot2%29/
multiplot(p1,p2,cols=2) 

With regards to economical damage, our analysis shows that tornados have had the highest impact on property damage (estimated 3212m USD) while hail has had the highest impact on crops damage (estimated 579596m USD).

Storm winds, floods and - in the case of property damage - lightening are also events with significant impact.

A summary of the report highlighting the 10 most significant events is shown here:

high_pd <- head(propdmg_above_avg,10)

p3 <- ggplot(high_pd,aes(x=EVTYPE,y=PROPDMG,fill=EVTYPE)) + geom_bar(stat="identity") + theme(legend.position="none") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("") + ylab("Property damage (M)")

high_cd <- head(cropdmg_above_avg,10)

p4 <- ggplot(high_cd,aes(x=EVTYPE,y=CROPDMG,fill=EVTYPE)) + geom_bar(stat="identity") + theme(legend.position="none") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("") + ylab("Agricultural damage (M)")

# Credits to the Cookbook for R website for helper function multiplot:
# http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_%28ggplot2%29/
multiplot(p3,p4,cols=2)