Synopsis

In this report, we use data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database to study which kind of catastrophic events are more harmful in terms of the population health and in economic terms. We discover that tornadoes are the events that cause the higher number of both fatalities and injuries. With respect to economical loss, floods are the most harmful events for property, but drought is the worst type of event in terms of crop losses.

Data processing

We download the data from the class repository corresponding to data extracted from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database and load it into a dataframe in R.

if (!file.exists("stormdata.csv.bz2")){
  download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2","stormdata.csv.bz2")
}
data=read.csv("stormdata.csv.bz2")

The data contains 902297 observations across 37 variables.

dim(data)
## [1] 902297     37

We examine the names of the dataset variables.

names(data)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

The variable EVTYPE contains the type of registered event. There are a total of 985 different types in the dataset.

length(unique(data$EVTYPE))
## [1] 985

Health Impact

The variables FATALITIES and INJURIES contain the number of personal fatalities and injuries in each recorded event. We use that information to compute the total number of fatalities and the total number of injuries for each of the 985 event types, extracting that information from the main dataset.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
dataharmful=data %>% group_by(EVTYPE) %>% summarise(FATALITIES=sum(FATALITIES),INJURIES=sum(INJURIES),.groups="drop")

Economic Impact

We will also make use of the four variables measuring the economic damage on property (PROPDMGE, PROPDMGEXP) and on crops (CROPDMGE,PROPDMGEXP). The first variable of each type contains a numeric value and the second variable in each set contains and character indicating a multiplicative factor, such a thousand, million, billion. We examine the different posibilities for this variables:

unique(data$PROPDMGEXP)
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
unique(data$CROPDMGEXP)
## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"

We noticed that there are many characters that do not relate to the standard signs for multiplicative indicator. We can assumed that they may have been introduced by error. Thus, we keep only the cases where those characters are K (thousand), M (million), B (billion) or an empty string (indicating no multiplication).

dataeconomic=data %>% filter((PROPDMGEXP=="K"|PROPDMGEXP=="M"|PROPDMGEXP==""|PROPDMGEXP=="B")&(CROPDMGEXP=="K"|CROPDMGEXP=="M"|CROPDMGEXP==""|CROPDMGEXP=="B"))

We can see that after removing the observations with the unexpected factor indicators, those accounted only for about 0.04 % of the original data, so the assumption that they were introduced by mistake seems reasonable.

100*(dim(data)[1]-dim(dataeconomic)[1])/dim(data)[1]
## [1] 0.04167142

We now convert those letters indicating a multiplicative factor to the actual numerical multiplicative factor, so we can use it later.

dataeconomic$PROPDMGEXP[dataeconomic$PROPDMGEXP==""]=1
dataeconomic$PROPDMGEXP[dataeconomic$PROPDMGEXP=="K"]=1e3
dataeconomic$PROPDMGEXP[dataeconomic$PROPDMGEXP=="M"]=1e6
dataeconomic$PROPDMGEXP[dataeconomic$PROPDMGEXP=="B"]=1e9
dataeconomic$PROPDMGEXP=as.numeric(dataeconomic$PROPDMGEXP)
dataeconomic$CROPDMGEXP[dataeconomic$CROPDMGEXP==""]=1
dataeconomic$CROPDMGEXP[dataeconomic$CROPDMGEXP=="K"]=1e3
dataeconomic$CROPDMGEXP[dataeconomic$CROPDMGEXP=="M"]=1e6
dataeconomic$CROPDMGEXP[dataeconomic$CROPDMGEXP=="B"]=1e9
dataeconomic$CROPDMGEXP=as.numeric(dataeconomic$CROPDMGEXP)

Similarly to what we did before, we compute for each of the event types the total property, crop and global damage in dollars.

dataeconomic=dataeconomic %>% group_by(EVTYPE) %>% summarise(PROPDAMAGE=sum(PROPDMG*PROPDMGEXP),CROPDAMAGE=sum(CROPDMG*CROPDMGEXP),TOTALDAMAGE=sum(PROPDMG*PROPDMGEXP)+sum(CROPDMG*CROPDMGEXP),.groups="drop")

Results

Most harmful events with respect to population health across the United States

We are interested in knowing which types of events are the most harmful in terms of population health. We rearrange the types of events in descending order for the number of fatalities and the number of injuries.

f3=arrange(dataharmful,desc(FATALITIES))[1:3,]
i3=arrange(dataharmful,desc(INJURIES))[1:3,]

In terms of total number of fatalities, the most harmful events are tornadoes, followed by excessive heat, and then by flash floods. We can see that tornadoes cause slightly under three times as many fatalities as excessive heat.

print(f3)
## # A tibble: 3 x 3
##   EVTYPE         FATALITIES INJURIES
##   <chr>               <dbl>    <dbl>
## 1 TORNADO              5633    91346
## 2 EXCESSIVE HEAT       1903     6525
## 3 FLASH FLOOD           978     1777

In terms of total number injuries, the most harmful events are tornadoes again, but this time followed by thunderstorm winds, and then by regular floods. We can see that tornadoes injure over 13 times more people than thunderstorm winds, the second cause of injuries due to meteorological events.

print(i3)
## # A tibble: 3 x 3
##   EVTYPE    FATALITIES INJURIES
##   <chr>          <dbl>    <dbl>
## 1 TORNADO         5633    91346
## 2 TSTM WIND        504     6957
## 3 FLOOD            470     6789

Both in terms of number of fatalities and of injuries, tornadoes show over 10 times more cases than the next cause in each case *excessive heat

In the following plot we represent a comparison of the three most harmful cases in terms of the number of people affected. Note that the scale is logarithmic. We can see that the number of injuries is in general significantly higher than the number of fatalities.

xvar=c(1,1,1,2,2,2)
yvar=log10(c(f3$FATALITIES,i3$INJURIES))
co=c("black","darkgoldenrod","lightblue","black","red","blue")
ch=c(17,19,18,17,4,0)
plot(xvar,yvar,xlim=c(0,3),xaxt="n",yaxt="n",xlab="",ylab="Number of ocurrences [thousands of people]",col=co,pch=ch,cex=2)
axis(1,c(1,2),labels=c("Fatalities","Injuries"))
axis(2,log10(c(1,5,10,50,100)*1e3),labels=c(1,50,100,500,1000),las=2)
leg=c("tornado","excessive heat","flash flood","thunderstorm wind","flood")
legend("topleft",legend=leg,pch=ch[-4],cex=1.2,col=co[-4],bty="n")

Number of occurrences in thousands of people in logarithmic scale of fatalities and injuries caused by the three most harmful events for each of the two categories.

Events with greatest economic consequences across the United States

Now we will turn to which types of events have the greatest economic consequences. We rearrange the types of events in descending order for the amount of damage on property and crops.

p3=arrange(dataeconomic,desc(PROPDAMAGE))[1:3,]
c3=arrange(dataeconomic,desc(CROPDAMAGE))[1:3,]

With respect to property losses, the worst events are regular floods, which produce about as twice as much property damage as the next worst events, hurricanes or typhoons. Tornadoes, following shortly after hurricanes and typhoons are the third worst events in terms of property damage.

print(p3)
## # A tibble: 3 x 4
##   EVTYPE              PROPDAMAGE CROPDAMAGE  TOTALDAMAGE
##   <chr>                    <dbl>      <dbl>        <dbl>
## 1 FLOOD             144657709807 5661968450 150319678257
## 2 HURRICANE/TYPHOON  69305840000 2607872800  71913712800
## 3 TORNADO            56925485483  364950110  57290435593

When considering crop losses, the worst event is drought, causing between twice and thrice as much as crop damage as the next types of events, regular floods, causing slightly higher damage than the third cause, river floods.

print(c3)
## # A tibble: 3 x 4
##   EVTYPE        PROPDAMAGE  CROPDAMAGE  TOTALDAMAGE
##   <chr>              <dbl>       <dbl>        <dbl>
## 1 DROUGHT       1046106000 13972566000  15018672000
## 2 FLOOD       144657709807  5661968450 150319678257
## 3 RIVER FLOOD   5118945500  5029459000  10148404500

In the following plot we represent a comparison of the total damage in billions of dollar for the three worst events in terms of property and crop losses. The vertical scale is logarithmic. We can see that the losses are significantly higher in terms of property than in terms of crops.

yvar=log10(c(p3$PROPDAMAGE,c3$CROPDAMAGE))
co=c("blue","forestgreen","black","darkgoldenrod","blue","cyan")
ch=c(15,19,17,4,15,18)
plot(xvar,yvar,xlim=c(0,3),xaxt="n",yaxt="n",xlab="",ylab="Damage [billions of USD]",col=co,pch=ch,cex=2)
axis(1,c(1,2),labels=c("Property damage","Crop damage"))
axis(2,log10(c(5:9,seq(10,100,10),135,150)*1e9),labels=c(5:9,seq(10,100,10),125,150),las=2)
leg=c("flood","hurricane/typhoon","tornado","drought","river flood")
legend("topright",legend=leg,pch=ch[-5],cex=1.2,col=co[-5],bty="n")

Losses in billions of US dollar in logarithmic scale for property and crop damage caused by the three most harmful events for each of the two categories.