Severe weather events exploratory analysis

Jose Augusto Barros de Oliveira
Reproducible Research PA 2
feb. 2015

Synopsis

Based upon U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database, we addressed two questions: across the USA, which types of weather events are most harmful with respect to population health? Which ones have the greatest economic consequences? Our analysis was restricted to 1996-2011, the period with unbiased data available. To entire USA, we found that tornadoes were the most harmful, regarding to population health, while flood was the most economically prejudicial. As expected, there were differences among the states, as we can see in the last plot.

Data Processing

The data was downloaded from the provided link (http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2). After loading, some processing to cleaning data was necessary. First, as a standard, we changed variable names to lower case and deleted the underscore character “_”. Then, regarding to “evtype” variable, we converted all strings to lower case and removed unused blank spaces, as an attempt to reduce the variability.

database <- read.csv("repdata-data-StormData.csv.bz2", header=TRUE, na.strings="")
names(database) <- tolower(names(database))
names(database)<- sub("_", "", names(database),)
names(database)<- sub("_", "", names(database),)
str(database$evtype)

##  Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...

database$evtype <- tolower(database$evtype)
database$evtype <- gsub("^ *","",database$evtype) #remove blank spaces at begining 
database$evtype <- as.factor(database$evtype)
str(database$evtype)

##  Factor w/ 890 levels "?","abnormal warmth",..: 750 750 750 750 750 750 750 750 750 750 ...

The raw data showed 985 unique values to “evtype”, when we expected 48, as described at the data base documentation (http://www.ncdc.noaa.gov/stormevents/pd01016005curr.pdf), page 6. We used this list of event names to subsetting the database. Also, as clearly stated at (http://www.ncdc.noaa.gov/stormevents/details.jsp) only after 1996 all those type were recorded. We made a new subset with the records after 01/01/1996, to avoid bias due the under-reporting.

standardevents <- c("Astronomical Low Tide","Avalanche","Blizzard","Coastal Flood","Cold/Wind Chill","Debris Flow","Dense Fog","Dense Smoke","Drought","Dust Devil","Dust Storm","Excessive Heat","Extreme Cold/Wind Chill","Flash Flood","Flood","Frost/Freeze","Funnel Cloud","Freezing Fog","Hail","Heat","Heavy Rain","Heavy Snow","High Surf","High Wind","Hurricane (Typhoon)","Ice Storm","Lake-Effect Snow","Lakeshore Flood","Lightning","Marine Hail","Marine High Wind","Marine Strong Wind","Marine Thunderstorm Wind","Rip Current","Seiche","Sleet","Storm Surge/Tide","Strong Wind","Thunderstorm Wind","Tornado","Tropical Depression","Tropical Storm","Tsunami","Volcanic Ash","Waterspout","Wildfire","Winter Storm","Winter Weather") # official event names.
standardevents <-tolower(standardevents)
database1 <- subset(database,(database$evtype %in% standardevents))
database1$bgndate <- as.character(database1$bgndate)
date <- strsplit(database1$bgndate, " ")
date <-unlist(date)
date <-grep("0:00:00", date, invert=TRUE, value=TRUE)
date <- as.Date(date, format="%m/%d/%Y")
database1 <- cbind (database1,date)
database1 <-subset(database1, (database1$date > "1996-01-01"))

At last, we converted the damage values to numeric format, considering the proper scale (K for thousands, M for millions and B for billions) used at “propdmgexp” and “cropdmgexp”. We achieved this goal through a loop and condition code.

summary(database1$propdmgexp)

##      -      ?      +      0      1      2      3      4      5      6 
##      0      0      0      1      0      0      0      0      0      0 
##      7      8      B      h      H      K      m      M   NA's 
##      0      0     14      0      0 304760      0   6519 193318

numpropdmg <- NULL
database3 <- database1
for (i in 1:nrow(database3)){
      if (is.na(database3[i,26])) {
        numpropdmg <- append (numpropdmg, (database3[i,25]))
    } else if (database3[i,26] == "K"){
        numpropdmg <- append (numpropdmg, (database3[i,25]*1000))
    } else if (database3[i,26] == "M"){
        numpropdmg <- append (numpropdmg, (database3[i,25]*1000000))
    } else if (database3[i,26] == "B"){
        numpropdmg <- append (numpropdmg, (database3[i,25]*1000000000))
    } else {
        numpropdmg <- append (numpropdmg, (database3[i,25]))
        }
}
rm(i)
numcropdmg <- NULL
for (i in 1:nrow(database3)){
      if (is.na(database3[i,28])) {
        numcropdmg <- append (numcropdmg, (database3[i,27]))
    } else if (database3[i,28] == "K"){
        numcropdmg <- append (numcropdmg, (database3[i,27]*1000))
    } else if (database3[i,28] == "M"){
        numcropdmg <- append (numcropdmg, (database3[i,27]*1000000))
    } else if (database3[i,28] == "B"){
        numcropdmg <- append (numcropdmg, (database3[i,27]*1000000000))
    } else {
        numcropdmg <- append (numcropdmg, (database3[i,27]))
        }
}

Results

Entire USA

We considered the sum of injuries and fatalities to determine the most harmful event. So we summarized this sum by the kind of event.

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.1.2

## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.1.2

database2 <- summarise(group_by(database1,evtype), sum(fatalities,injuries))
names(database2)<- c("event","fatinj")
database2 <- database2[order(database2$fatinj, decreasing = TRUE), ]
database2 <- database2[(database2$fatinj > 0),]
database2$event <- as.character(database2$event)
plot <- ggplot(database2, aes(x=fatinj, y=reorder(event,fatinj))) + geom_point(stat="identity") + xlab("Number of injuries and fatalities") + ylab("Events") + ggtitle("Most harmful events in USA 1996 - 2011")
plot

plot of chunk harmful

Tornadoes were responsible for 2.2178 × 10⁴ injuries and fatalities between 1996 and 2011.

Similarly, we determined the most prejudicial economic event by the sum of damages on properties and crops and summarizing it by the kind of event.

database3 <- cbind(database3,numpropdmg,numcropdmg)
database4 <- summarise(group_by(database3,evtype), sum(numpropdmg,numcropdmg))
names(database4) <- c("event","damage")
database4 <- database4[order(database4$damage, decreasing = TRUE), ]
database4$event <- as.character(database4$event)
plot1 <- ggplot(database4, aes(x=damage, y=reorder(event,damage))) + geom_point(stat="identity") + xlab("Damage (US$)") + ylab("Events") + ggtitle("Most economic prejudicial weather events in USA 1996 - 2011")
plot1

plot of chunk economic

Floods were responsible for 1.4892 × 10¹¹ dollar in damage between 1996 and 2011.

Across the USA

Due to its continental dimensions, the weather events affect the USA states in different ways. The next plot may help us to visualize this contrast.

database5 <- summarise(group_by(database1,evtype,state.1), sum(fatalities,injuries))
names(database5) <- c("event","state","fatinj")
database6 <- summarise(group_by(database3,evtype,state.1), sum(numpropdmg,numcropdmg))
names(database6) <- c("event", "state","damage")
database7 <-merge(database5,database6)
plot2 <- ggplot(database7, aes(log(fatinj), log(damage), colour = event))+geom_point() + facet_wrap(~state) + xlab("injuries and fatalities") + ylab("damage (US$)") +ggtitle("Severe weather events across USA 1996 - 2011 (logarithmic scales)")
plot2

plot of chunk across