In this report, we use data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database to study which kind of catastrophic events are more harmful in terms of the population health and in economic terms. We discover that tornadoes are the events that cause the higher number of both fatalities and injuries. With respect to economical loss, floods are the most harmful events for property, but drought is the worst type of event in terms of crop losses.
We download the data from the class repository corresponding to data extracted from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database and load it into a dataframe in R.
if (!file.exists("stormdata.csv.bz2")){
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2","stormdata.csv.bz2")
}
data=read.csv("stormdata.csv.bz2")
The data contains 902297 observations across 37 variables.
dim(data)
## [1] 902297 37
We examine the names of the dataset variables.
names(data)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
The variable EVTYPE contains the type of registered event. There are a total of 985 different types in the dataset.
length(unique(data$EVTYPE))
## [1] 985
The variables FATALITIES and INJURIES contain the number of personal fatalities and injuries in each recorded event. We use that information to compute the total number of fatalities and the total number of injuries for each of the 985 event types, extracting that information from the main dataset.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
dataharmful=data %>% group_by(EVTYPE) %>% summarise(FATALITIES=sum(FATALITIES),INJURIES=sum(INJURIES),.groups="drop")
We will also make use of the four variables measuring the economic damage on property (PROPDMGE, PROPDMGEXP) and on crops (CROPDMGE,PROPDMGEXP). The first variable of each type contains a numeric value and the second variable in each set contains and character indicating a multiplicative factor, such a thousand, million, billion. We examine the different posibilities for this variables:
unique(data$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
unique(data$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
We noticed that there are many characters that do not relate to the standard signs for multiplicative indicator. We can assumed that they may have been introduced by error. Thus, we keep only the cases where those characters are K (thousand), M (million), B (billion) or an empty string (indicating no multiplication).
dataeconomic=data %>% filter((PROPDMGEXP=="K"|PROPDMGEXP=="M"|PROPDMGEXP==""|PROPDMGEXP=="B")&(CROPDMGEXP=="K"|CROPDMGEXP=="M"|CROPDMGEXP==""|CROPDMGEXP=="B"))
We can see that after removing the observations with the unexpected factor indicators, those accounted only for about 0.04 % of the original data, so the assumption that they were introduced by mistake seems reasonable.
100*(dim(data)[1]-dim(dataeconomic)[1])/dim(data)[1]
## [1] 0.04167142
We now convert those letters indicating a multiplicative factor to the actual numerical multiplicative factor, so we can use it later.
dataeconomic$PROPDMGEXP[dataeconomic$PROPDMGEXP==""]=1
dataeconomic$PROPDMGEXP[dataeconomic$PROPDMGEXP=="K"]=1e3
dataeconomic$PROPDMGEXP[dataeconomic$PROPDMGEXP=="M"]=1e6
dataeconomic$PROPDMGEXP[dataeconomic$PROPDMGEXP=="B"]=1e9
dataeconomic$PROPDMGEXP=as.numeric(dataeconomic$PROPDMGEXP)
dataeconomic$CROPDMGEXP[dataeconomic$CROPDMGEXP==""]=1
dataeconomic$CROPDMGEXP[dataeconomic$CROPDMGEXP=="K"]=1e3
dataeconomic$CROPDMGEXP[dataeconomic$CROPDMGEXP=="M"]=1e6
dataeconomic$CROPDMGEXP[dataeconomic$CROPDMGEXP=="B"]=1e9
dataeconomic$CROPDMGEXP=as.numeric(dataeconomic$CROPDMGEXP)
Similarly to what we did before, we compute for each of the event types the total property, crop and global damage in dollars.
dataeconomic=dataeconomic %>% group_by(EVTYPE) %>% summarise(PROPDAMAGE=sum(PROPDMG*PROPDMGEXP),CROPDAMAGE=sum(CROPDMG*CROPDMGEXP),TOTALDAMAGE=sum(PROPDMG*PROPDMGEXP)+sum(CROPDMG*CROPDMGEXP),.groups="drop")
We are interested in knowing which types of events are the most harmful in terms of population health. We rearrange the types of events in descending order for the number of fatalities and the number of injuries.
f3=arrange(dataharmful,desc(FATALITIES))[1:3,]
i3=arrange(dataharmful,desc(INJURIES))[1:3,]
In terms of total number of fatalities, the most harmful events are tornadoes, followed by excessive heat, and then by flash floods. We can see that tornadoes cause slightly under three times as many fatalities as excessive heat.
print(f3)
## # A tibble: 3 x 3
## EVTYPE FATALITIES INJURIES
## <chr> <dbl> <dbl>
## 1 TORNADO 5633 91346
## 2 EXCESSIVE HEAT 1903 6525
## 3 FLASH FLOOD 978 1777
In terms of total number injuries, the most harmful events are tornadoes again, but this time followed by thunderstorm winds, and then by regular floods. We can see that tornadoes injure over 13 times more people than thunderstorm winds, the second cause of injuries due to meteorological events.
print(i3)
## # A tibble: 3 x 3
## EVTYPE FATALITIES INJURIES
## <chr> <dbl> <dbl>
## 1 TORNADO 5633 91346
## 2 TSTM WIND 504 6957
## 3 FLOOD 470 6789
Both in terms of number of fatalities and of injuries, tornadoes show over 10 times more cases than the next cause in each case *excessive heat
In the following plot we represent a comparison of the three most harmful cases in terms of the number of people affected. Note that the scale is logarithmic. We can see that the number of injuries is in general significantly higher than the number of fatalities.
xvar=c(1,1,1,2,2,2)
yvar=log10(c(f3$FATALITIES,i3$INJURIES))
co=c("black","darkgoldenrod","lightblue","black","red","blue")
ch=c(17,19,18,17,4,0)
plot(xvar,yvar,xlim=c(0,3),xaxt="n",yaxt="n",xlab="",ylab="Number of ocurrences [thousands of people]",col=co,pch=ch,cex=2)
axis(1,c(1,2),labels=c("Fatalities","Injuries"))
axis(2,log10(c(1,5,10,50,100)*1e3),labels=c(1,50,100,500,1000),las=2)
leg=c("tornado","excessive heat","flash flood","thunderstorm wind","flood")
legend("topleft",legend=leg,pch=ch[-4],cex=1.2,col=co[-4],bty="n")
Number of occurrences in thousands of people in logarithmic scale of fatalities and injuries caused by the three most harmful events for each of the two categories.
Now we will turn to which types of events have the greatest economic consequences. We rearrange the types of events in descending order for the amount of damage on property and crops.
p3=arrange(dataeconomic,desc(PROPDAMAGE))[1:3,]
c3=arrange(dataeconomic,desc(CROPDAMAGE))[1:3,]
With respect to property losses, the worst events are regular floods, which produce about as twice as much property damage as the next worst events, hurricanes or typhoons. Tornadoes, following shortly after hurricanes and typhoons are the third worst events in terms of property damage.
print(p3)
## # A tibble: 3 x 4
## EVTYPE PROPDAMAGE CROPDAMAGE TOTALDAMAGE
## <chr> <dbl> <dbl> <dbl>
## 1 FLOOD 144657709807 5661968450 150319678257
## 2 HURRICANE/TYPHOON 69305840000 2607872800 71913712800
## 3 TORNADO 56925485483 364950110 57290435593
When considering crop losses, the worst event is drought, causing between twice and thrice as much as crop damage as the next types of events, regular floods, causing slightly higher damage than the third cause, river floods.
print(c3)
## # A tibble: 3 x 4
## EVTYPE PROPDAMAGE CROPDAMAGE TOTALDAMAGE
## <chr> <dbl> <dbl> <dbl>
## 1 DROUGHT 1046106000 13972566000 15018672000
## 2 FLOOD 144657709807 5661968450 150319678257
## 3 RIVER FLOOD 5118945500 5029459000 10148404500
In the following plot we represent a comparison of the total damage in billions of dollar for the three worst events in terms of property and crop losses. The vertical scale is logarithmic. We can see that the losses are significantly higher in terms of property than in terms of crops.
yvar=log10(c(p3$PROPDAMAGE,c3$CROPDAMAGE))
co=c("blue","forestgreen","black","darkgoldenrod","blue","cyan")
ch=c(15,19,17,4,15,18)
plot(xvar,yvar,xlim=c(0,3),xaxt="n",yaxt="n",xlab="",ylab="Damage [billions of USD]",col=co,pch=ch,cex=2)
axis(1,c(1,2),labels=c("Property damage","Crop damage"))
axis(2,log10(c(5:9,seq(10,100,10),135,150)*1e9),labels=c(5:9,seq(10,100,10),125,150),las=2)
leg=c("flood","hurricane/typhoon","tornado","drought","river flood")
legend("topright",legend=leg,pch=ch[-5],cex=1.2,col=co[-5],bty="n")
Losses in billions of US dollar in logarithmic scale for property and crop damage caused by the three most harmful events for each of the two categories.