Synopsis

As it was known from the specifications that the first years of the data set have a short number of observations registered, a study was carried on to find out how many years could be discarded without affecting significatively the results. Based on the cumulative frequency of the observations per year (Fig.1), it was evident that the first half of the time period held only 10% of the data. Besides, the number of needed variables was estimated in only six of the original 37; so, a new and lighter data set was created according with this study, leaving the original untouched.

A pointer to the original was added to this reduced dataset, and also two variables representing the effect of the event’s damages to health (fatalities and injuries) and to economy (property damages) respectively. Through two boxplots (Fig.2) it was made evident that the second of these variables showed a highlighted outlier, which was removed by program.

Next, the health and economy effects were grouped by event type, and, in both, the maximum effect was found, together with the type of storm which produced it.

This analysis shows that the most harmful type of storms in USA, in the given period, according with the NOOA storm data base, are TORNADOS to the health, and HURRICANE/TYPHON to the economy. In Fig.3 an ordered group of the ten most harmful types of storms is presented for both health and economy.

Introduction

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Data

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. It has been downloaded from the web site:

Storm Data under the name “StormData.bz2”.

There is also some documentation of the database available, where it is possible to find how some of the variables are constructed or defined:

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Data Processing:

Loading the file into a Data Frame

The reading and unpacking is done only if the data is not already cached:

    # Unpacks the downloaded bz2 file:
    filezp <- "E:/Projects/RR_Course_Project_2/StormData.bz2"
    filerd <- bzfile(filezp)
    stormdata_df <- read.csv(filerd, header = TRUE)

Loading of libraries:

library(Hmisc)
library(dplyr)

Initialization of some variables:

ny <- 2011 - 1950 + 1 # Initial spanned period in the dataset
ifreq <- vector(length=ny); num_na <- 0 # Frequency initialization
YEAR_BGN <- vector(length=nrow(stormdata_df))
mult1 <- matrix(1,nrow=nrow(stormdata_df))
mult2 <- mult1

The next loop extracts the year from de beginning date of each observation, increments a counter and detects NA’s. To evaluate economical damage in dollars, considers the factors in PROPDMGEXP and CROPDMGEXP.
Register the appropiated numerical factors of dollars:

for (i in 1:nrow(stormdata_df)){
    if (!is.na(stormdata_df[i,2])) {
        v <- gregexpr("/",stormdata_df[i,2])
        jahr <- as.integer(substring(stormdata_df[i,2],v[[1]][2]+1,v[[1]][2]+4))
        if (is.na(jahr)){
            print(sprintf("na: %s",jahr))
        } else {
            icol <- jahr - 1949
            if (icol > 0 & icol <= ny){
                ifreq[icol] <- ifreq[icol]+1
                YEAR_BGN[i] <- jahr
            } else {
                print(sprintf("Bad year: %d",jahr))
            }
        }    
        
    } else {
        num_na <- num_na + 1
        print(sprintf("na: %s",stormdata_df[i,2]))
    }

    # Register the appropiated factors of dollars:
    # PROPDMGEXP:
    fct <- stormdata_df[i,26]
    if (fct != ""){
        if (sum(grepl(fct,c('K','k')))){mult1[i] <- 1000}
        else {if (fct == 'M'){mult1[i] <- 1000000}
        else {if (fct == 'B'){mult1[i] <- 1000000000}
        else {if (fct == 'h'){mult1[i] <- 100}}}}
    }
    # CROPDMGEXP:
    fct <- stormdata_df[i,28]
    if (fct != ""){
        if (sum(grepl(fct,c('K','k')))){mult2[i] <- 1000}
        else {if (fct == 'M'){mult2[i] <- 1000000}
        else {if (fct == 'B'){mult2[i] <- 1000000000}
        else {if (fct == 'h'){mult2[i] <- 100}}}}
    }
}
print(sprintf("Number of NA's in BGN_DATE: %d",num_na))
## [1] "Number of NA's in BGN_DATE: 0"

Compute the accumulated frequencies of events by years and plot them:

acum_perc <- 100*cumsum(ifreq)
acum_perc <- acum_perc/acum_perc[ny]
# First plot: Cumulative frequencies in percentage
par(mfrow=c(1,1), mar=c(4,4,2,2))
plot(1950:2011,acum_perc,type='l',col = "blue", xlab = "Years", ylab = "Cumulative frequency in %",
     main = "Figure 1: Cumulative frequencies of storm events per year")
abline(h = seq(10,90,20), col = "lightgray", lty = 3)
segments(1981.6,-3.9,1981.6,10, col = "red")
points(1981.6,10,pch=19,col="blue")
minor.tick(nx=10, tick.ratio=0.6)
text(1967,12,"The first half of time period has only 10% of the data",cex=0.6)
grid()

In Fig.1, the plot shows in percentage the cumulative frequencies of observations per year. It can be seen that from 1950 to 1982 there are registered only the 10% of the total observations. So, by discarting the older 10% of the data, the period shortens to 30 years, but keeps the 90 % of data.

for (j in 1:ny) {
    if (acum_perc[j] >= 10){break}
}
y_from <- j+1949
print(sprintf("First year selected: %d",y_from)) 
## [1] "First year selected: 2011"

Create the reduced data set, with only the necessary variables

cost1 <- stormdata_df[25]*mult1; cost2 <- stormdata_df[27]*mult2
IDSTORM <- 1:nrow(stormdata_df)
red_stormdata_df <- cbind(YEAR_BGN,stormdata_df[8],stormdata_df[23:24],cost1,cost2,IDSTORM)

Filter the chosen years and add two consolidated variables:

red_stormdata_df <- filter(red_stormdata_df,YEAR_BGN >= y_from)
# Add HEALTH and ECONOMY
red_stormdata_df <- mutate(red_stormdata_df,HEALTH = FATALITIES + INJURIES)
red_stormdata_df <- mutate(red_stormdata_df,ECONOMY = PROPDMG + CROPDMG)

Look for outliers, through paneled boxplots (Fig.2).
In this case, as we are interested only in maximum values, and not in the median or quintiles, seems more convenient not to use log scales.

par(mfrow = c(1,2), mar=c(4,4,2,2), oma=c(0,0,2,0))
boxplot(red_stormdata_df$HEALTH,xlab="Effects on Health",ylab="Number of Victims")
boxplot(red_stormdata_df$ECONOMY,xlab="Effects on Economy",ylab="Dollars")
mtitle("Figure 2: Search for outliers in data", line = -2)

In Fig.2, the “Effects on Health” boxplot has a couple of points out of the main group, but not too far, and its values are still reasonable. But the 2nd boxplot shows a highly suspicious value, almost one order of magnitude greater than its next neighbour, quite enormous quantity for one event: about 120 billion dollars.

Analysis in search of outliers:

To find out that suspicious storm, let’s arrange the dataframe in descent order of ECONOMY and print the first 6:

descent <- arrange(red_stormdata_df,desc(ECONOMY))
print(head(descent))
##   YEAR_BGN  EVTYPE FATALITIES INJURIES PROPDMG CROPDMG IDSTORM HEALTH
## 1     2011 TORNADO        158     1150 2.8e+09       0  862634   1308
## 2     2011   FLOOD          0        0 2.0e+09       0  868046      0
## 3     2011 TORNADO         44      800 1.5e+09       0  860386    844
## 4     2011 TORNADO          4       45 1.0e+09       0  856214     49
## 5     2011   FLOOD          0        0 1.0e+09       0  867749      0
## 6     2011 TORNADO         20      700 7.0e+08       0  858228    720
##   ECONOMY
## 1 2.8e+09
## 2 2.0e+09
## 3 1.5e+09
## 4 1.0e+09
## 5 1.0e+09
## 6 7.0e+08

We find the reason in the value of the first PROPDMG: too much greater than its companion CROPDMG.
Using the pointer to the original dataset stormdata_df (obs.605953), we find that the refered storm is a 2006’s New-Year flood in the Napa valley (California). In its REMARKS we read: “… the City and Parks Department was hit with $6 million in damage alone. The City of Napa had 600 homes with moderate damage, 150 damaged businesses with costs of at least $70 million.”
As can be seen, this is not at all in accordance with the reported cost. Probably, the PROPDMGEXP factor should be “M” (millions) and not “B” (billions), and thus PROPDMG should be 115 million dollars.
If we take a look of the other next five registers in the list, we find that 4 of them correspond to the famous Katrina hurricane (28/08/2005), reported part as Storm Surge and part as Hurricane/Typhoon, in Louisiana and Mississippi. The values reported in those four cases for ECONOMY, although big, are in accordance with the public knowledge (about 70 billion dollars altogether).
(On the other hand, it is surprising that the injuries + fatalities of Katrina hurricane are reported as almost inexistent, since it is known that there were 1836 deaths. But that is out of the scope of this work, only based in the given data).

REMOVING THE OUTLIER:
So, we are going to change the value of variables PROPDMG and ECOMOMY in the observation about Napa flood from red_stormdata_df:

irow <- which(red_stormdata_df$PROPDMG > 1.1e11) # Suspicious observation
red_stormdata_df[irow,5]<- 1.15e8               # New corrected value of PROPDMG
red_stormdata_df[irow,9]<- 1.15e8 + 32500000    # New corrected value of ECONOMY

Group by event type and find the health and economy effects in each group:

tabla <- split(red_stormdata_df,red_stormdata_df$EVTYPE)
h_effects <- sapply(tabla,function(x){sum(x$HEALTH, na.rm = TRUE)})
e_effects <- sapply(tabla,function(x){sum(x$ECONOMY, na.rm = TRUE)})

Results

The analysis has rendered the following results, as total damages caused during the period by a single type of event:

The type of event most harmful to population health:

max_h_effects <- max(h_effects)
print(sprintf("Most harmful type of event to population health and number of fatalities and injuries:"))
## [1] "Most harmful type of event to population health and number of fatalities and injuries:"
for (i in 1:length(h_effects)){
    if (h_effects[i][1] == max_h_effects){
        print(h_effects[i]) 
    }
}
## TORNADO 
##    6750

The type of event most harmful to economy:

max_e_effects <- max(e_effects)
print(sprintf("Most harmful type of event to economy and cost in dollars:"))
## [1] "Most harmful type of event to economy and cost in dollars:"
for (i in 1:length(e_effects)) {
    if (e_effects[i][1] == max_e_effects){
        print(e_effects[i])
    }
}
##    TORNADO 
## 9850961700

Plots of the ten top types of harmful storms toward health and economy:

h_descent <- sort(h_effects,decreasing = TRUE)
e_descent <- sort(e_effects,decreasing = TRUE)

# Ten top harmful types of storms, in a paneled plot:
par(mfrow = c(2,1), mar=c(4,4,2,2), oma=c(0,0,2,0))
plot(1:10,h_descent[1:10],xaxt = "n", pch=19, col="red",
     xlab="Effects on Health",ylab="Total number of victims")
abline(v = seq(1,9,2), col = "lightgray", lty = 3)
axis(1,labels=names(h_descent[1:10]),at=1:10, cex.axis=0.5, cex.lab=0.8)
grid()
plot(1:10,e_descent[1:10],xaxt = "n", pch=19, col="blue",
     xlab="Effects on Economy",ylab="Total costs in dollars")
abline(v = seq(1,9,2), col = "lightgray", lty = 3)
axis(1,labels=names(e_descent[1:10]),at=1:10, cex.axis=0.5, cex.lab=0.8)
grid()
mtitle("Figure 3: Ten top types of harmful storms (1982-2011)")

In Fig.3 are displayed, in descending order from left to right, the ten types of events most harmful to population health (superior plot) and to property (inferior plot). The vertical edge shows, in the appropiate unity, the respective costs.

NOTE: Since the period 1950-1981 contains only 10% of the observations, surely the leader type of storm in each list is not going to change, and most probably neither will the rest of the two lists, in the complete 1950-2011 time span.

Effects of events with maximum costs to health and economy for every year:

options(dplyr.print_max = 1e9) # Print all the table
years <- group_by(red_stormdata_df, YEAR_BGN) # Group the dataset by stratum year
# Find in each year the cost of the more costly event to the public health and the
# cost of the more harmful event to economy. Print a table.
summarise(years, cost_health = max(HEALTH,na.rm=TRUE), cost_economy=max(ECONOMY,na.rm=TRUE))
## Source: local data frame [1 x 3]
## 
##   YEAR_BGN cost_health cost_economy
##      (int)       (dbl)        (dbl)
## 1     2011        1308      2.8e+09

The year 2005 shows the greatest economical impact of one storm (because of Katrina), while human health was most affected by one event, reportedly, in 1994, at southern Ohio (2/9/1994): 1568 injuries, 1 death.