In this report the aim is to analyse severe weather events and their detrimental effects on public health and economic activity. For that purpose the storm database of the U.S. National Oceanic and Athmospheric Administration (NOAA) has been used. Main part of the analysis is concerned with processing the data and “cleaning” the variable containing the weather events. A summary of the worst events with respect to health and economy is provided. Tornadoes, excessive heat and thunderstorm wind are among the deadliest events, while foods, hurricanes and storms have the most negative effect on the economic activity.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. The data can be obtained from the following link Storm Data.
There is some extra information about the variables and how they are constructed.
National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ
stormdata <- read.csv("stormdata.csv", stringsAsFactors = F)
After reading the data we can check the basic structure of the dataset and get a feeling of the file.
str(stormdata)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
From here can be seen that for the analysis of the weather we can use the variable EVTYPE, to separate each different event.
The impact on human health can be gleaned by summarizing the variables FATALITIES and INJURIES. Whereas the economic damage can be estimated through analyzing the variables PROPDMG and CROPDMG. However, here can also be seen that there is a scaling factors PROPDMGEXP and CROPDMGEXP, containing the letters “K”, “M”, “B” indicating respectively thousands, millions and billions. These variables should be transformed into actual numbers in order to obtain the actual amount of estimated damage.
In order to simplify the coding it is useful to make all variable names lowercase, and also all the events in EVTYPE would be transformed into lowercase.
names(stormdata) <- tolower(names(stormdata))
stormdata$evtype <- tolower(stormdata$evtype)
The date variable should also be transformed into a class “Date”
stormdata$bgn_date <- as.Date((stormdata$bgn_date), "%m/%d/%Y %H:%M:%S")
An inportant part of this analysis is to shrink the original data file, since there are 902297 observations and 37 variables and that would require more time to do the computations. In order to determine the the inpact of weather on health and economy, we are intrested only in observations that are positive and not equal ot 0. Also we can get rid of the rest of the variables since they do not pertain to our analysis.
stormdata_subset <- subset(stormdata, fatalities>0 | injuries>0 | propdmg > 0 | cropdmg >0, select=c(bgn_date, state, evtype, fatalities, injuries, propdmg, propdmgexp, cropdmg, cropdmgexp))
From now on we are working only with the dataset ‘stormdata_subset’. It is useful to see its structure again:
str(stormdata_subset)
## 'data.frame': 254633 obs. of 9 variables:
## $ bgn_date : Date, format: "1950-04-18" "1950-04-18" ...
## $ state : chr "AL" "AL" "AL" "AL" ...
## $ evtype : chr "tornado" "tornado" "tornado" "tornado" ...
## $ fatalities: num 0 0 0 0 0 0 0 0 1 0 ...
## $ injuries : num 15 0 2 2 2 6 1 0 14 0 ...
## $ propdmg : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ propdmgexp: chr "K" "K" "K" "K" ...
## $ cropdmg : num 0 0 0 0 0 0 0 0 0 0 ...
## $ cropdmgexp: chr "" "" "" "" ...
eventtypes <- length(unique(stormdata_subset$evtype))
sort(unique(stormdata_subset$evtype))[1:30]
## [1] " high surf advisory" " flash flood"
## [3] " tstm wind" " tstm wind (g45)"
## [5] "?" "agricultural freeze"
## [7] "apache county" "astronomical high tide"
## [9] "astronomical low tide" "avalance"
## [11] "avalanche" "beach erosion"
## [13] "black ice" "blizzard"
## [15] "blizzard/winter storm" "blowing dust"
## [17] "blowing snow" "breakup flooding"
## [19] "brush fire" "coastal flooding/erosion"
## [21] "coastal erosion" "coastal flood"
## [23] "coastal flooding" "coastal flooding/erosion"
## [25] "coastal storm" "coastal surge"
## [27] "coastalstorm" "cold"
## [29] "cold air tornado" "cold and snow"
We can see that there are 447 event types listed here in the variable ‘evtype’. Since according to the Storm Data Documentation there should be only around 50 types of event, we have to do a lot of cleaning of the data in order to obtain a meaningful summary. We can see the first 30 events in the data.
Transforming this variable requires a lot of work. For example ‘Thunderstorm Wind’ is substituted to various different denotations like ‘tstm’ and others. The full length of this transformations is best viewed into the following code.
stormdata_subset$evtype <- sub("(.*)thu(.*)","Thunderstorm Wind",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("non(.*)tstm(.*)","Strong Wind",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)tstm(.*)","Thunderstorm Wind",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)tunder(.*)","Thunderstorm Wind",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)torn(.*)","Tornado",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)gustnado(.*)","Tornado",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)blizzard(.*)","Blizzard",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)wild(.*)","Wildfire",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)coastal(.*)flood(.*)","Coastal Flood",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)flash(.*)flood(.*)","Flash Flood",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)flood(.*)flash(.*)","Flash Flood",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)flood(.*)","Flood",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)dam(.*)","Flood",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)drown(.*)","Flood",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)swells(.*)","Flood",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)stream(.*)","Flood",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)fog(.*)","Dense Fog",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)drought(.*)","Drought",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)dry(.*)","Drought",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)dust(.*)","Dust Storm",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)excessive(.*)heat(.*)","Excessive Heat",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)heat(.*)","Heat",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)warm(.*)","Heat",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)extreme(.*)","Cold/Wind Chill",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)cool(.*)","Cold/Wind Chill",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)frost(.*)freeze(.*)","Frost/Freeze",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)frost(.*)","Frost/Freeze",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)funnel(.*)","Funnel Cloud",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)hail(.*)","Hail",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)rain(.*)","Heavy Rain",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)snow(.*)","Heavy Snow",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)surf(.*)","High Surf",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)hurricane(.*)","Hurricane (Typhoon)",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)typhoon(.*)","Hurricane (Typhoon)",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)ice(.*)","Ice Storm",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)light(.*)","Lightning",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)lignt(.*)","Lightning",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)rip(.*)","Rip Current",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)sleet(.*)","Sleet Storm",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)wind(.*)","Strong Wind",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)turbulence(.*)","Strong Wind",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)tropical(.*)","Tropical Storm",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)tsu(.*)","Tsunami",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)volc(.*)","Volcanic Ash",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)water(.*)","Waterspout",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)winter(.*)stor(.*)","Winter Storm",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)winter(.*)","Winter Weather",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)mud(.*)","Mudslides",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)cold(.*)","Cold/Wind Chill",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)shower(.*)","Heavy Rain",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)fire(.*)","Wildfire",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)freez(.*)","Frost/Freeze",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)storm(.*)surge(.*)","Storm/Surge Tide",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)coast(.*)","Storm/Surge Tide",stormdata_subset$evtype)
stormdata_subset$evtype <- sub("(.*)tide(.*)","Storm/Surge Tide",stormdata_subset$evtype)
Finally we arrive at the following list of event types, that can be used to summarize the data.
sort(unique(stormdata_subset$evtype))
## [1] "?" "apache county"
## [3] "avalance" "avalanche"
## [5] "beach erosion" "Blizzard"
## [7] "Coastal Flood" "Cold/Wind Chill"
## [9] "Dense Fog" "dense smoke"
## [11] "downburst" "Drought"
## [13] "Dust Storm" "Excessive Heat"
## [15] "excessive wetness" "Flash Flood"
## [17] "Flood" "Frost/Freeze"
## [19] "Funnel Cloud" "glaze"
## [21] "Hail" "Heat"
## [23] "heavy mix" "heavy precipitation"
## [25] "Heavy Rain" "heavy seas"
## [27] "Heavy Snow" "high"
## [29] "high seas" "High Surf"
## [31] "high waves" "Hurricane (Typhoon)"
## [33] "hyperthermia/exposure" "hypothermia"
## [35] "hypothermia/exposure" "Ice Storm"
## [37] "icy roads" "landslide"
## [39] "landslides" "landslump"
## [41] "landspout" "Lightning"
## [43] "low temperature" "marine accident"
## [45] "marine mishap" "microburst"
## [47] "mixed precip" "mixed precipitation"
## [49] "Mudslides" "other"
## [51] "Rip Current" "rock slide"
## [53] "rogue wave" "rough seas"
## [55] "seiche" "Sleet Storm"
## [57] "Storm/Surge Tide" "Strong Wind"
## [59] "Thunderstorm Wind" "Tornado"
## [61] "Tropical Storm" "Tsunami"
## [63] "urban and small" "urban small"
## [65] "Volcanic Ash" "Waterspout"
## [67] "wet microburst" "Wildfire"
## [69] "Winter Storm" "Winter Weather"
## [71] "wintry mix"
The transformation is not perfect, i.e. there are some events that are not put in a broad category, but at least in the eyes of this student it is not very clear how to proceed further. In any case for the sake of the anaysis, that is not very important since the main results would still hold.
Another important part of the analysyis is to get the property and crop damage variables into proper values.
Now we have ‘propdmgexp’ and ‘cropdmgexp’ with values “K”, “M”, “B” indicating thousands, millions and billions. First we can change all letters into uppercase.
stormdata_subset$propdmgexp <- toupper(stormdata_subset$propdmgexp)
stormdata_subset$cropdmgexp <- toupper(stormdata_subset$cropdmgexp)
With the help of a ‘for’ loop and several ‘if’ statements we can make these variables into variables containing numerical values of 1e+3, 1e+6 and so on.
for (i in seq_along(stormdata_subset$propdmgexp)) {
if (stormdata_subset[i,"propdmgexp"]=="H") { stormdata_subset[i,"propdmgexp"]=100}
if (stormdata_subset[i,"propdmgexp"]=="K") { stormdata_subset[i,"propdmgexp"]=1000}
if (stormdata_subset[i,"propdmgexp"]=="M") { stormdata_subset[i,"propdmgexp"]=1000000}
if (stormdata_subset[i,"propdmgexp"]=="B") { stormdata_subset[i,"propdmgexp"]=1000000000}
}
for (i in seq_along(stormdata_subset$cropdmgexp)) {
if (stormdata_subset[i,"cropdmgexp"]=="K") { stormdata_subset[i,"cropdmgexp"]=1000}
if (stormdata_subset[i,"cropdmgexp"]=="M") { stormdata_subset[i,"cropdmgexp"]=1000000}
if (stormdata_subset[i,"cropdmgexp"]=="B") { stormdata_subset[i,"cropdmgexp"]=1000000000}
}
Now that we have numerical values, we can transform them into class ‘numeric’.
stormdata_subset[,6:9] <- sapply(stormdata_subset[,6:9], as.numeric)
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
Finally, we are creating new variables that a the product of the previous ones, and actually contain the estimated value of the damage in dollars, caused by the various types of severe weather.
stormdata_subset$propdamage <- stormdata_subset$propdmg*stormdata_subset$propdmgexp
stormdata_subset$cropdamage <- stormdata_subset$cropdmg*stormdata_subset$cropdmgexp
Making the ‘evtype’ variable into class ‘factor’.
stormdata_subset$evtype <- as.factor(stormdata_subset$evtype)
stormdata_subset$state <- as.factor(stormdata_subset$state)
That was the end of the ‘Data Processing’ part of the anaysis. We have now a clean dataset to work with - ‘stormdata_subset’
We can have a look at some basic statistics of our variables concerned. For example for the fatalities they are:
summary(stormdata_subset$fatalities)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 0.1 0.0 583.0
And for the injuries basic statistics are:
summary(stormdata_subset$injuries)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 0.6 0.0 1700.0
For the property damages they are:
summary(stormdata_subset$propdamage)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00e+00 2.50e+03 1.00e+04 1.76e+06 4.10e+04 1.15e+11 11591
And for the crop damages they are:
summary(stormdata_subset$cropdamage)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00e+00 0.00e+00 0.00e+00 4.82e+05 0.00e+00 5.00e+09 152670
We can have a boxplot as well and since the data is highly skewed we will transform the data with a log function.
boxplot(log2(stormdata_subset$fatalities),log2(stormdata_subset$injuries),log2(stormdata_subset$propdamage),log2(stormdata_subset$cropdamage), names=c("Fatalities","Injuries", "Propdamage", "Cropdamage"))
## Warning: Outlier (-Inf) in boxplot 1 is not drawn
## Warning: Outlier (-Inf) in boxplot 2 is not drawn
## Warning: Outlier (-Inf) in boxplot 3 is not drawn
## Warning: Outlier (-Inf) in boxplot 4 is not drawn
In order to see which weather events cause most harm, we add all the values across all years for each specific event. The function ‘aggregate’ is used but other ways are also possible. Then we arrange the result in a decsending order.
aggregate_fatalities <- aggregate(stormdata_subset$fatalities, by=list(stormdata_subset$evtype),sum)
aggregate_fatalities_order <- aggregate_fatalities[order(aggregate_fatalities[,2],decreasing=T),][1:15,]
aggregate_injuries <- aggregate(stormdata_subset$injuries, by=list(stormdata_subset$evtype),sum)
aggregate_injuries_order <- aggregate_injuries[order(aggregate_injuries[,2],decreasing=T),][1:15,]
Here is a table with the top 15 most hazardous weather events in the US.
health <- cbind(aggregate_fatalities_order, aggregate_injuries_order)
names(health) <- c("Event", "Fatalities","Event", "Injuries")
health
## Event Fatalities Event Injuries
## 60 Tornado 5636 Tornado 91407
## 14 Excessive Heat 1920 Thunderstorm Wind 9544
## 22 Heat 1223 Flood 6883
## 16 Flash Flood 1035 Excessive Heat 6525
## 42 Lightning 817 Lightning 5231
## 59 Thunderstorm Wind 756 Heat 2703
## 51 Rip Current 572 Ice Storm 2152
## 58 Strong Wind 544 Strong Wind 1915
## 17 Flood 514 Flash Flood 1802
## 8 Cold/Wind Chill 356 Wildfire 1608
## 4 avalanche 224 Hail 1371
## 69 Winter Storm 216 Winter Storm 1338
## 30 High Surf 166 Hurricane (Typhoon) 1331
## 27 Heavy Snow 163 Heavy Snow 1162
## 32 Hurricane (Typhoon) 135 Dense Fog 1077
And a simple barplot illustrating the content of the table graphically.
par(mfrow = c(1,2), cex = .8)
barplot(aggregate_fatalities_order[,2],names.arg=aggregate_fatalities_order[,1],cex.names=.7,col="red", main="Number of Fatalities", xlab="Weather Events", ylab="number")
barplot(aggregate_injuries_order[,2],names.arg=aggregate_injuries_order[,1],cex.names=.7,col="lightblue", main="Number of Injuries", xlab="Weather Events")
From the table and chart we can see that the single most deadly event is tornado. For the rest the biggest number of deaths occur for heat, flash flood, lightning and so on. Most injuries are recorded for thunderstorm, floods, lightnings and so on.
Similarly for the economic effects we sum all the damage in dollar amount across all different event and for all the years. Again the ‘aggregate’ function has been used. Then the result is arranged in decsending order.
aggregate_propdamage <- aggregate(stormdata_subset$propdamage, by=list(stormdata_subset$evtype), sum, na.rm=T)
aggregate_propdamage_order <- aggregate_propdamage[order(aggregate_propdamage[,2],decreasing=T),][1:10,]
aggregate_cropdamage <- aggregate(stormdata_subset$cropdamage, by=list(stormdata_subset$evtype), sum,na.rm=T)
aggregate_cropdamage_order <- aggregate_cropdamage[order(aggregate_cropdamage[,2],decreasing=T),][1:10,]
Here is a simple tabel illustrating the ten most harmful weather events for the property and agriculture.
econ_damage <- cbind(aggregate_propdamage_order,aggregate_cropdamage_order)
names(econ_damage) <- c("Event", "Property Damage", "Event", "Crop Damage")
econ_damage
## Event Property Damage Event Crop Damage
## 17 Flood 1.503e+11 Drought 1.397e+10
## 32 Hurricane (Typhoon) 8.536e+10 Flood 1.115e+10
## 60 Tornado 5.699e+10 Hurricane (Typhoon) 5.516e+09
## 57 Storm/Surge Tide 4.798e+10 Ice Storm 5.022e+09
## 16 Flash Flood 1.691e+10 Hail 3.047e+09
## 21 Hail 1.597e+10 Frost/Freeze 1.701e+09
## 59 Thunderstorm Wind 1.258e+10 Flash Flood 1.532e+09
## 68 Wildfire 8.497e+09 Cold/Wind Chill 1.431e+09
## 61 Tropical Storm 7.716e+09 Thunderstorm Wind 1.274e+09
## 69 Winter Storm 6.689e+09 Heavy Rain 8.062e+08
That is the bar plot corresponding to the table above.
par(mfrow = c(1,2), cex = .8)
barplot(aggregate_propdamage_order[,2],names.arg=aggregate_propdamage_order[,1],cex.names=.7,col="red", main="Property Damage (in dollars)", xlab="Weather Events", ylab="dollars")
barplot(aggregate_cropdamage_order[,2],names.arg=aggregate_cropdamage_order[,1],cex.names=.7,col="lightblue", main="Crop Damage (in dollars)", xlab="Weather Events")
We can see from the table and chart that the most detrimantal weather events with respect to property are floods, hurricanes, tornados, storms and so on. However, the worst events for agriculture are somewhat different, i.e. drought, hurricane, floods, ice storms and so on.