Analysis of the effects of severe events in EE.UU between 1950 and 2011

Synopsis

In this report we aim to identify (1) which type of events are the most harmful with respect to population health and (2) which types of events have the greatest economic consequences, both across the United States. To investigate this, we obtained the data for the years between 1950 and 2001 from the NOAA storm Database. From these data, we found that, across the United States, Tornado are the most harmful events for population health, followed by Excessive heat events. On the other hand, Flood events are have the greates economic consequences.

Data Processing

From the NOAA storm database we obtained data on severe events across the U.S. during the time period 1950-2011. The data includes not only the type of event registered and its characteristics but also the social and economic impact

Reading the data

First of all, we set the working directory and after that we download and unzip the file. After that, we read the unzip data with fread command

setwd("C:/Users/Willem/Documents/DATA_SCIENCE/000_Coursera/Course_5/Assignment2")
if (!file.exists("Data/Stormdata.zip"))
      download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "Data/Stormdata.zip")
filename <- "./Data/Stormdata"

md0 <- fread(filename)

After reading the data, we check the structure and the first rows of the data. We observe that the data set contain 902297 observation of 37 variables

str(md0)

## Classes 'data.table' and 'data.frame':   902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : chr  "3" "2" "2" "2" ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
##  - attr(*, ".internal.selfref")=<externalptr>

head(md0)

##    STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1:       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2:       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3:       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4:       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5:       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6:       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##     EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1: TORNADO         0                                               0
## 2: TORNADO         0                                               0
## 3: TORNADO         0                                               0
## 4: TORNADO         0                                               0
## 5: TORNADO         0                                               0
## 6: TORNADO         0                                               0
##    COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1:         NA         0                      14.0   100 3   0          0
## 2:         NA         0                       2.0   150 2   0          0
## 3:         NA         0                       0.1   123 2   0          0
## 4:         NA         0                       0.0   100 2   0          0
## 5:         NA         0                       0.0   150 2   0          0
## 6:         NA         0                       1.5   177 2   0          0
##    INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1:       15    25.0          K       0                                    
## 2:        0     2.5          K       0                                    
## 3:        2    25.0          K       0                                    
## 4:        2     2.5          K       0                                    
## 5:        2     2.5          K       0                                    
## 6:        6     2.5          K       0                                    
##    LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1:     3040      8812       3051       8806              1
## 2:     3042      8755          0          0              2
## 3:     3340      8742          0          0              3
## 4:     3458      8626          0          0              4
## 5:     3412      8642          0          0              5
## 6:     3450      8748          0          0              6

Types of events most harmful with respect to population health

In order to ask the know the which type of events are the most harmful with respect to population health, we need to sum the number of fatalities (and injuries) per type of event. So we construct two new variables aggregating these numbers per event. We also rename the columns in order to have perfeclty labeled the columns

injur <- aggregate(md0$INJURIES, by=list(Category=md0$EVTYPE), FUN=sum)
colnames(injur) <- c("Event", "Injuries")
fatal <- aggregate(md0$FATALITIES, by=list(Category=md0$EVTYPE), FUN=sum)
colnames(fatal) <- c("Event", "Fatalities")

After that, we merge these two variables into a single data set and sort the resulting data set in decreasing order. In this way, the first type of events of the sorted data set will be the most harmful. Please note that here we have considered that the most harmful event is the one that cause more fatalities. Other approaches could be also possible (such as sum of fatalities and injuries, etc)

md1 <- merge(injur, fatal, by="Event")
md1 <- md1[order(-md1$Fatalities),]

Types of events that have the greatest economic consequences

The economic consequences are given by the variable “PROPDMG” that expresses the magnitude of the number (quantity) and the variable “PROPDMGEXP” that give us the units of these quantities (“B” for billions dollars, “M” for millions dollars, “K” for thousands dollars and “” for dollars). Let’s check how many different values have the variable “PROPDMGEXP”

unique(md0$PROPDMGEXP)

##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"

As we can see, there are other values for this variable, apart from “”,“K”, “M” and “B”. We could consider these observations as invalid. Let’s see how many “invalid” observations are and the percentage that it represents

x <- sum(md0$PROPDMGEXP == "") + sum(md0$PROPDMGEXP == "K") + sum(md0$PROPDMGEXP == "B")+
      sum(md0$PROPDMGEXP == "M")
(1-x/nrow(md0))*100

## [1] 0.03635167

Less than 0.04% of the values are invalid. This percentage is small enough so it should not be a problem. Therefore, we are going to subset a dataset with only the variables we are interested in (md2), and after that construct a new variable called “mult” to allow us to convert the character variable “PROPDMGEXP” to a numeric value. Finally, we are going to multiply the quantity “PROPDMG” variable by this new created mult variable. The resulting variable “Econom” quantifies the total cost of the observation Please, note that the observations where the “PROPDMGEXP” variable is different from “”,“K”, “M” or “B”, the variable “mult” is equal to 0, so the total cost (“Econom”) is equal to zero.

md2 <- md0[, c("EVTYPE", "PROPDMG", "PROPDMGEXP")]
md2 = as.data.table(md2)
md2$mult <- 0
md2[PROPDMGEXP=="K", mult:=1000]
md2[PROPDMGEXP=="M", mult:=1000000]
md2[PROPDMGEXP=="B", mult:=1000000000]
md2[PROPDMGEXP=="", mult:=1]
md2$Econom <- md2$PROPDMG*md2$mult

Now, we could aggregate the economic consequences per event, in the same way we have done before for the harmful events. Once aggregated, we sort the subset

cost <- aggregate(md2$Econom, by=list(Category=md0$EVTYPE), FUN=sum)
colnames(cost) <- c("Event", "Cost")
cost <- cost[order(-cost$Cost),]

Results

Once we have done all the previous transformation and have created the required variable, we could answer the questions

Most harmful events with respect to population health

The top 10 harmulf events (remember that we have considered that the most harmful event is the one that cause more fatalities) are the followings:

head(md1, 10)

##              Event Injuries Fatalities
## 834        TORNADO    91346       5633
## 130 EXCESSIVE HEAT     6525       1903
## 153    FLASH FLOOD     1777        978
## 275           HEAT     2100        937
## 464      LIGHTNING     5230        816
## 856      TSTM WIND     6957        504
## 170          FLOOD     6789        470
## 585    RIP CURRENT      232        368
## 359      HIGH WIND     1137        248
## 19       AVALANCHE      170        224

As we can see in the above table, Tornado events are the most harmful events with 5633 deaths and 91346 people injured

Let’s see visually the results, plotting the top ten events with a barplot. We are going to plot both the number of fatalities and the number of injuries by event

md10 <- md1[1:10,]
windows()
par(mar=c(6,6,3,1), mgp=c(4,0.5,0), mfrow=c(2,1))
with(md10,barplot(md10$Fatalities, names.arg = md10$Event, las=2, cex.names=0.7, ylab = "People dead", main = "Nº of Fatalities per Event"))
with(md10,barplot(md10$Injuries, names.arg = md10$Event, las=2, cex.names=0.7, ylab = "People injured", main = "Nº of Injuries per Event"))

Events with the greatest economic consequences

In the same way, in the following table we list the top ten events that caouses the worts economic consequences. The Cost is expressed in dollars

head(cost, 10)

##                 Event         Cost
## 170             FLOOD 144657709807
## 411 HURRICANE/TYPHOON  69305840000
## 834           TORNADO  56925660483
## 670       STORM SURGE  43323536000
## 153       FLASH FLOOD  16140811717
## 244              HAIL  15727366777
## 402         HURRICANE  11868319010
## 848    TROPICAL STORM   7703890550
## 972      WINTER STORM   6688497250
## 359         HIGH WIND   5270046260

We observe that Flood are the events that cause the worst economic consequences resulting in losses of more than 144 billion dollars, followed by hurricane/typhoon events with losses of 69 billion dollars. We could plot the results for the top ten events

cost10 <- cost[1:10,]
windows()
par(mar=c(8,6,3,1))
with(cost10,barplot(cost10$Cost/1000000000, names.arg = cost10$Event, las=2, cex.names=0.7, ylab = "Economic cost (in Billion $)", main = "Top 10 most costly events"))