Data Processing

Loading the data

We load the NOAA storm database (instructions can be found here).

stormDataRaw <- read.csv("repdata_data_StormData.csv.bz2", header = TRUE)

We see how the dataset is, and the colum names.

dim(stormDataRaw)

## [1] 902297     37

head(stormDataRaw)

##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6

colnames(stormDataRaw)

##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

Processing the data

Our goal is to answer two questions:

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

The stormDataRaw database has a lot of variables, so we are going to create a separate data frame with the variables that are important to answer these questions: type of event, number of fatalities, number of injuries, property damage (amount of dolars) and crop damage (amount of dollars).

library(dplyr)
stormData <- stormDataRaw %>%
    select(REFNUM, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
head(stormData)

##   REFNUM  EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1      1 TORNADO          0       15    25.0          K       0           
## 2      2 TORNADO          0        0     2.5          K       0           
## 3      3 TORNADO          0        2    25.0          K       0           
## 4      4 TORNADO          0        2     2.5          K       0           
## 5      5 TORNADO          0        2     2.5          K       0           
## 6      6 TORNADO          0        6     2.5          K       0

Types of event most harmful to population health

According to the instructions, there are only 48 possible types of events. However, the EVTYPE variable has 985 levels (some in capital letters and some in lower case).

str(levels(stormData$EVTYPE), vec.len = 4)

##  chr [1:985] "?" "ABNORMALLY DRY" "ABNORMALLY WET" "ABNORMAL WARMTH" ...

tail(levels(stormData$EVTYPE))

## [1] "WINTER WEATHER/MIX" "WINTERY MIX"        "Wintry mix"        
## [4] "Wintry Mix"         "WINTRY MIX"         "WND"

The first thing we can do to clean a bit the EVTYPE is transform all the entries to capital letters.

# all EVTYPE to uppercase and refactor
stormData$EVTYPE <- factor(toupper(stormData$EVTYPE))
str(levels(stormData$EVTYPE), vec.len = 4)

##  chr [1:898] "?" "ABNORMALLY DRY" "ABNORMALLY WET" "ABNORMAL WARMTH" ...

To see which events are most harmful to population health we are interested in the number of fatalities (FATALITIES variable) and injuries (INJURIES variable). To reduce the number of types of events, we only keep those with non-zero values in both fatalities and injuries, then refactor and save into a new data frame.

# select just the EVTYPES with FATALITIES>0 or INJURIES>0
stormHealth <- stormData %>%
    select(REFNUM, EVTYPE, FATALITIES, INJURIES) %>%
    filter(FATALITIES > 0 | INJURIES > 0)

stormHealth$EVTYPE <- factor(stormHealth$EVTYPE)
str(levels(stormHealth$EVTYPE))

##  chr [1:205] "AVALANCE" "AVALANCHE" "BLACK ICE" "BLIZZARD" ...

There are still some problems with the EVTYPE variable, but now there are just 205 levels. We know that the EVTYPE variable should be cleaned more (it has too many levels, some repeated), but maybe just cleaning a bit is enough to answer our questions.

So, we can try to rename the values of EVTYPE with regular expressions and see if the results are good enough.

# we understand that TSTM means THUNDERSTORM
stormData$EVTYPE[grep("^TSTM", stormData$EVTYPE)] <- "THUNDERSTORM WIND"

# levels that starts with THUNDER we treated as THUNDERSTORM WIND
stormData$EVTYPE[grep("THUNDER", stormData$EVTYPE)] <- "THUNDERSTORM WIND"

# levels that starts with HURRI are HURRICANE
stormData$EVTYPE[grep("HURRI", stormData$EVTYPE)] <- "HURRICANE"

# levels that starts with RIP are RIP CURRENT
stormData$EVTYPE[grep("^RIP", stormData$EVTYPE)] <- "RIP CURRENT"

# levels that starts with FLOOD we treated as FLOOD (not FLASH FOOD)
stormData$EVTYPE[grep("^FLOOD", stormData$EVTYPE)] <- "FLOOD"

# refactor EVTYPE
stormData$EVTYPE <- factor(stormData$EVTYPE)

With the EVTYPE a bit cleaned, we are going to see which type of events causes more fatalities and injuries (sum).

# grouping by EVTYPE and suming FATALITIES and INJURIES
stormHealth <- stormData %>%
    group_by(EVTYPE) %>%
    summarise_each(funs(sum), FATALITIES, INJURIES) %>%
    arrange(desc(FATALITIES + INJURIES))

stormHealth <- data.frame(stormHealth)
head(stormHealth, 20)

##               EVTYPE FATALITIES INJURIES
## 1            TORNADO       5633    91346
## 2  THUNDERSTORM WIND        722     9536
## 3     EXCESSIVE HEAT       1903     6525
## 4              FLOOD        495     6806
## 5          LIGHTNING        816     5230
## 6               HEAT        937     2100
## 7        FLASH FLOOD        978     1777
## 8          ICE STORM         89     1975
## 9       WINTER STORM        206     1321
## 10         HURRICANE        135     1328
## 11         HIGH WIND        248     1137
## 12              HAIL         15     1361
## 13        HEAVY SNOW        127     1021
## 14       RIP CURRENT        577      529
## 15          WILDFIRE         75      911
## 16          BLIZZARD        101      805
## 17               FOG         62      734
## 18  WILD/FOREST FIRE         12      545
## 19         HEAT WAVE        172      379
## 20        DUST STORM         22      440

We see that the most harmful weather-event is the tornado, followed by: thunderstorm wind, excessive heat, flood and lightning.

If we sum the fatalities and injuries of the rest of the table it can be changed by cleaning more the data, but it is difficult that it changes the top of the table (maybe in the order, but not all of them).

# fatalities and injuries of the tail of the table (rows 21:760)
stormHealth[21:dim(stormHealth)[1], ] %>%
    summarise_each(funs(sum), FATALITIES, INJURIES) %>%
    arrange(desc(FATALITIES, INJURIES))

##   FATALITIES INJURIES
## 1       1820     4722

Types of events with the greatest economic consequences

According to the instructions, the PROPDMGEXP and CROPDMGEXP is a character that indicates the magnitude of the damage: “K” for thousands, “M” for millions and “B” for billions. However, we see that there are more levels (some not in capital letters). So we transform all the levels to capital letters.

levels(stormData$CROPDMGEXP)

## [1] ""  "?" "0" "2" "B" "k" "K" "m" "M"

levels(stormData$PROPDMGEXP)

##  [1] ""  "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"

# all PROPDMGEXP to capital letters
stormData[stormData$PROPDMGEXP=="h",]$PROPDMGEXP <-"H"
stormData[stormData$PROPDMGEXP=="m",]$PROPDMGEXP <-"M"
# all CROPDMGEXP to capital letters
stormData[stormData$CROPDMGEXP=="m",]$CROPDMGEXP <-"M"
stormData[stormData$CROPDMGEXP=="k",]$CROPDMGEXP <-"K"

We do not know what these other character means (“-”, “?”, “0”, “1”, etc.) so we are going to omit them. Also, to reduce the data frame we only keep those files with some damage (in property or crop).

stormDamage <- stormData %>%
    select(REFNUM, EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>%
    filter(PROPDMG + CROPDMG > 0, PROPDMGEXP %in% c("B", "H", "K", "M"),
           CROPDMGEXP %in% c("B", "H", "K", "M"))
head(stormDamage)

##   REFNUM            EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 187566         HURRICANE     0.1          B      10          M
## 2 188271 THUNDERSTORM WIND     5.0          M     500          K
## 3 187568         HURRICANE    25.0          M       1          M
## 4 187570         HURRICANE    48.0          M       4          M
## 5 187571         HURRICANE    20.0          M      10          M
## 6 187640 THUNDERSTORM WIND    50.0          K      50          K

Now we want to sum all the amount of money of damage: crop and property by event type. So we first make a function to be able to multiply variables ...DMG with ...DMGEXP.

# function to convert character (billions, millions...) to numbers
mult = function(x){
    ifelse(x=="B", 10^9, ifelse(x=="M", 10^6, ifelse(x=="K", 10^3, 100)))
}
stormTotalDamage <- stormDamage %>%
    select(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>%
    mutate(totalDamage = PROPDMG*mult(PROPDMGEXP) + CROPDMG*mult(CROPDMGEXP)) %>%
    group_by(EVTYPE) %>%
    summarise(sumTotalDamage = sum(totalDamage)) %>%
    arrange(desc(sumTotalDamage))
head(stormTotalDamage)

## Source: local data frame [6 x 2]
## 
##        EVTYPE sumTotalDamage
## 1       FLOOD   138235749000
## 2   HURRICANE    44313445800
## 3     TORNADO    16520148150
## 4 RIVER FLOOD    10108369000
## 5        HAIL    10020591590
## 6 FLASH FLOOD     8715295130

If we plot the data to see the comparison.

library(ggplot2)
qplot(EVTYPE, sumTotalDamage/10^9,
      data=data.frame(stormTotalDamage)[1:10,], stat = "identity",
      geom = "bar",
      main = "Total damage (crop + property) by event type",
      ylab = "Total amount of dollars \n of damage (in billions)",
      xlab = "Type of event"
      ) + theme(axis.text.x = element_text(angle = 90))

Fig.cap: Plot that show the sum of crop and property damage (in billions) by type of weather event.

So we can say that omitting the strange character of the ...DMGEXP variables, Flood is the weather-event with greates economic consequences, followed by Hurricane and Tornado. To be sure in this conclussion we should study what that strange character means and clean better the data.

NOAA StormData: Study of most harmful types of weather events to population health and economy

Álvaro Hernández

August 22th, 2015

Synopsis

Data Processing

Loading the data

Processing the data

Types of event most harmful to population health

Types of events with the greatest economic consequences

Results