Summary

  1. Title

  2. Sinopsis

  3. Data Processing

  4. Results

1. Title

The most harmful weather events in the US

2. Sinopsis

Here I present an analysis of the damage caused by weather events in the U.S. The data came from the National Oceanic and Atmospheric Administration’s (NOAA) storm database. The events in the database start in the year 1950 and end in November 2011. Results are organized into two categories: population health and economic consequences. In summary, tornadoes are the most harmful events to people and property, whereas hail is the most harmful to crops.

3. Data Processing

As stated in the course’s assignment: “This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.”

The metadata provided (NATIONAL WEATHER SERVICE INSTRUCTION 10-1605) are far from user friendly. Therefore, a lot of effort is required to understand what the variables mean and how they were computed. Here I describe the steps required from importing the data to plotting the graphs, so anyone fluent in R can reproduce this analysis.

  1. Set the working directory and clear all previous objects from the memory:
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))

rm(list= ls())
  1. Import the raw data and automatically decompress the file:
# As the file is huge, you'd better store it in the cache.
dados <- read.csv("data/repdata-data-StormData.csv.bz2", 
                  header = T, 
                  na.strings = "NA")
  1. Inspect the data:
str(dados)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
head(dados)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL TORNADO
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL TORNADO
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL TORNADO
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL TORNADO
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL TORNADO
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL TORNADO
##   BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1         0                                               0         NA
## 2         0                                               0         NA
## 3         0                                               0         NA
## 4         0                                               0         NA
## 5         0                                               0         NA
## 6         0                                               0         NA
##   END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1         0                      14.0   100 3   0          0       15    25.0
## 2         0                       2.0   150 2   0          0        0     2.5
## 3         0                       0.1   123 2   0          0        2    25.0
## 4         0                       0.0   100 2   0          0        2     2.5
## 5         0                       0.0   150 2   0          0        2     2.5
## 6         0                       1.5   177 2   0          0        6     2.5
##   PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1          K       0                                         3040      8812
## 2          K       0                                         3042      8755
## 3          K       0                                         3340      8742
## 4          K       0                                         3458      8626
## 5          K       0                                         3412      8642
## 6          K       0                                         3450      8748
##   LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1       3051       8806              1
## 2          0          0              2
## 3          0          0              3
## 4          0          0              4
## 5          0          0              5
## 6          0          0              6
  1. Duplicate the data in another object to be used in the analysis, in case you need to go back to their original version.
dados2 <- dados
  1. Process the data related to population health (FATALITIES and INJURIES):
  1. Begin with the fatalities. Calculate how many fatalities were there per category of weather event:
fatalidades <- aggregate(FATALITIES ~ EVTYPE, dados2, sum)
  1. Sort the data in decreasing order of fatalities:
fatalidades2 <- fatalidades[order(fatalidades$FATALITIES, decreasing = T),]
  1. Many events resulted in zero fatalities, so discard them:
fatalidades3 <- subset(fatalidades2, FATALITIES > 0)
  1. Keep only the top 10 events in terms of fatalities:
fatalidades4 <- fatalidades3[1:10,]
  1. Now focus on the injuries. Calculate how many injuries were there per category of weather event:
machucados <- aggregate(INJURIES ~ EVTYPE, dados2, sum)
  1. Sort the data in decreasing order of injuries:
machucados2 <- machucados[order(machucados$INJURIES, decreasing = T),]
  1. Many events resulted in zero injuries, so discard them:
machucados3 <- subset(machucados2, INJURIES > 0)
  1. Keep only the top 10 events in terms of injuries:
machucados4 <- machucados3[1:10,]
  1. Process the data related to economic consequences (PROPDMG and CROPDMG):
  1. Create another copy of the data, just to keep the original version safe.
dados3 <- dados2
  1. The original data on property damage and crop damage are stored in a very messy format. The variables PROPDMGEXP and CROPDMGEXP provide categories related to the magnitude of the damage: thousands (k), millions (m and M), or billions (B) of US dollars. Therefore, transform those values to the same scale, in this case, billions.
dados3$PROPDMG <- ifelse(dados2$PROPDMGEXP == "K", 
                         dados2$PROPDMG*(10**6), dados2$PROPDMG)
dados3$PROPDMG <- ifelse(dados2$PROPDMGEXP == "m", 
                         dados2$PROPDMG*(10**3), dados2$PROPDMG)
dados3$PROPDMG <- ifelse(dados2$PROPDMGEXP == "M", 
                         dados2$PROPDMG*(10**3), dados2$PROPDMG)
dados3$PROPDMG <- ifelse(dados2$PROPDMGEXP == "B", 
                         dados2$PROPDMG, dados2$PROPDMG)

dados3$CROPDMG <- ifelse(dados2$CROPDMGEXP == "K", 
                         dados2$CROPDMG*(10**6), dados2$CROPDMG)
dados3$CROPDMG <- ifelse(dados2$CROPDMGEXP == "m", 
                         dados2$CROPDMG*(10**3), dados2$CROPDMG)
dados3$CROPDMG <- ifelse(dados2$CROPDMGEXP == "M", 
                         dados2$CROPDMG*(10**3), dados2$CROPDMG)
dados3$CROPDMG <- ifelse(dados2$CROPDMGEXP == "B", 
                         dados2$CROPDMG, dados2$CROPDMG)
  1. Just in case, create another copy of the data you’ve just transformed in step 3.6.ii:
dados4 <- dados3
  1. Calculate the total sum of property damage per category of weather event:
propriedades <- aggregate(PROPDMG ~ EVTYPE, dados4, sum)
  1. Sort the data in decreasing order of property damage:
propriedades2 <- propriedades[order(propriedades$PROPDMG, decreasing = T),]
  1. Many events resulted in zero property damage, so discard them:
propriedades3 <- subset(propriedades2, PROPDMG > 0)
  1. Keep only the top 10 events in terms of property damage:
propriedades4 <- propriedades3[1:10,]
  1. Calculate the total sum of crop damage per category of weather event:
lavouras <- aggregate(CROPDMG ~ EVTYPE, dados4, sum)
  1. Sort the data in decreasing order of property damage:
lavouras2 <- lavouras[order(lavouras$CROPDMG, decreasing = T),]
  1. Many events resulted in zero property damage, so discard them:
lavouras3 <- subset(lavouras2, CROPDMG > 0)

xii Keep only the top 10 events in terms of property damage:

lavouras4 <- lavouras3[1:10,]

4. Results

Now is time to plot the results. The data on the damage caused by weather events are presented only for the top ten events of each type.

  1. Let’s begin with population health. In this first panel we can see the number of fatalities and injuries caused by the most harmful weather events:
par(mfrow = c(2, 1), mar=c(5,5,5,1))
barplot(fatalidades4$FATALITIES,
        names.arg=fatalidades4$EVTYPE,
        main = "Most harmful events: fatalities", 
        xlab = "Weather event", 
        ylab = "Number of fatalities",
        col = "darkgrey",
        border = "white",
        cex.axis = 1, 
        cex.lab = 2, 
        cex.main = 2,
        cex.names = 0.5,
        yaxt="n")
axis(side=2, cex.axis = 1,
     at=axTicks(2), 
     labels=formatC(axTicks(2), format="d", big.mark=','))

barplot(machucados4$INJURIES,
        names.arg=machucados4$EVTYPE,
        main = "Most harmful events: injuries", 
        xlab = "Weather event", 
        ylab = "Number of injuries",
        col = "darkgrey",
        border = "white",
        cex.axis = 1, 
        cex.lab = 2, 
        cex.main = 2,
        cex.names = 0.5,
        yaxt="n")
axis(side=2, cex.axis = 1,
     at=axTicks(2), 
     labels=formatC(axTicks(2), format="d", big.mark=','))

par(mfrow=c(1,1))
  1. Finally, let’s focus on the economic consequences. In this second panel we can see the damage caused to property and crops by the most harmful weather events:
par(mfrow = c(2, 1), mar=c(5,5,5,1))
barplot(propriedades4$PROPDMG/1000,
        names.arg=propriedades4$EVTYPE,
        main = "Most harmful events: property damage", 
        xlab = "Weather event", 
        ylab = "Property damage (US$ billion)",
        col = "darkgrey",
        border = "white",
        cex.axis = 1, 
        cex.lab = 2, 
        cex.main = 2,
        cex.names = 0.5,
        yaxt="n")
axis(side=2, cex.axis = 1,
     at=axTicks(2), 
     labels=formatC(axTicks(2), format="d", big.mark=','))

barplot(lavouras4$CROPDMG/1000,
        names.arg=lavouras4$EVTYPE,
        main = "Most harmful events: crop damage", 
        xlab = "Weather event", 
        ylab = "Crop damage (US$ billion)",
        col = "darkgrey",
        border = "white",
        cex.axis = 1, 
        cex.lab = 2, 
        cex.main = 2,
        cex.names = 0.5,
        yaxt="n")
axis(side=2, cex.axis = 1,
     at=axTicks(2), 
     labels=formatC(axTicks(2), format="d", big.mark=','))

par(mfrow=c(1,1))