knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(cache = TRUE)
knitr::opts_chunk$set(warning = FALSE)
knitr::opts_chunk$set(message = FALSE)

[NB] Consider writing your report as if it were to be read by a government or municipal manager who might be responsible for preparing for severe weather events and will need to prioritize resources for different types of events.

The assignment marks are not for doing a perfect analysis, they are for making a plan, carrying out the plan, and explaining your plan and justifying it.

1. Synopsis

Extreme weather events can have severe consequences both for the health of the public and for the economy of the damaged area. Sometimes, they can result in fatalities, injuries, damages to urban and agricultural properties. Preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Furthermore, this presentation aims to make the point about the efficiency of the data collection object of this rapid analysis. Across the decades, the procedure of data entry and collection could change, with natural consequences on the readability of the dataframe.

I will try to verify the condition of the database by answer two questions:

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

Across the United States, which types of events have the greatest economic consequences?

2. Data processing

stormload <- read.csv("repdata-data-StormData.csv.bz2")
saveRDS(stormload, file = "stormdt.rds")
rm(stormload)
stormdt <- readRDS("stormdt.rds")

The dataframe consists of 902297 observations, collected from 1950 to 2011 across the United States. The information collected cover 37 variables, listed as follow. The codebook of the variables is freely reperible online.

dim(stormdt)
names(stormdt)
str(stormdt)
table(sapply(stormdt, class))

3. Question1

The variable ENVTYPE indicate what kind of event the observation collects. Take note that similar storm events can be listed using different wording e.g. “coastal flood” and “coastal flooding.” This should be considered if we want to run a query grouping by event type.

3.1. Data exploration

# Missing values
sum(is.na(stormdt$EVTYPE))

## [1] 0

class(stormdt$EVTYPE)

## [1] "factor"

str(stormdt$EVTYPE)

##  Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...

According to the National Weather Service, instruction 10-1605, Operations and Services Performance, NWSPD 10-16, STORM DATA PREPARATION, the ENVTYPE variable should consider 48 levels. Unfortunately, during the years, the variable today considers 985 levels. The main reason is the lack of homogeneity in data entry procedures. I.e. capital and lower case letters are used randomly. We should heavily clean the variable.

3.2. Recode event categories

Entering the EVTYPE categories from the NOAAS manual. They are adapten to better fit the recoding loop: the same word should do not appear as first in the same place. With the help of the community, I developed a function that recursively recode a given variable. It considers the first three letters of any element of the input. If it finds a match, it will substitute the matching term with the relative element of the input. NB: problems in computing, read comments in the next chunk of code.

envtype_cat2 <- c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood", "Cold/Wind Chill", "Debris Flow", "Fog", "Smoke", "Drought", "Devil Dust", "Dust Storm", "Excessive Heat", "Extreme Cold", "Flash Flood", "Flood", "Frost", "Funnel Cloud", "Freezing Fg", "Hail", "Heat", "Heavy Rain", "Snow", "Surf", "High Wind", "Hurricane", "Ice Storm", "Effect-Lake Snow", "Lakeshore", "Lightning", "Hail Marine", "Marine Wind", " Strong Wind Marine", " Thunderstorm Marine Wind", "Rip Current", "Seiche", "Sleet", "Surge/Tide", "Wind Strong", "Thunderstorm Wind", "Tornado", "Depression Tropical", "Tropical Storm", "Tsunami", "Volcanic Ash", "Waterspout", "Wildfire", "Winter Storm", "Weather Winter")

recursive.recode <- function( x )
{
        A <- envtype_cat2
        A3 <- sapply(A,substr,1,3)
        x <- as.character(x)
        n <- max( c( 0, which( sapply( A3, grepl, tolower(x) ) ) ) )
        
        if ( n==0 )
        {
                warning( "nothing found")
                return (x)
        }
        
        A[n]
}

## This operation is so computationally demanding (I should make the funtion more efficient) that four hours are not enought to create the final document. Because of that, I use stormdt$EVTYPE2 as charged created before of knitting the document.

# stormdt$EVTYPE2 <- sapply(stormdt$EVTYPE, recursive.recode)

stormdt$EVTYPE2 <- stormdt$EVTYPE

Now that we have recoded the variable, we are interested kown the most frequent extreme events. NB The recoding procedure is not perfect, hence the data still suffers heavy bias. Since the high number of categories, I will proceed by considering only 12 of them, according to their accurency across the dataframe.

barplot(head(sort(table(stormdt$EVTYPE2), decreasing = TRUE), 12), las = 2, main = "Frequece of extreme events. US. 1951-2011")

Hail is the most frequent event, followed by an unsual category, tstm wind. Since the recondig function returs the original value if no match occurs, tstm wind shuld be more deeply investigated.

Unfortunately, the process of recoding will require many days of manual work, since the database is really messy. In this phase, I do not operate such precise work.

3.5. Exploring the data for question 2

In first instance, I proceed with a rapid exploration of the vraiable of interest. PROPDMG contains a number, and PROPDMGEXP a multiplier. The product of the two variables gives the total damage a single event caused. We will soon create this variable.

summary(stormdt$PROPDMG)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   12.06    0.50 5000.00

summary(stormdt$PROPDMGEXP)

##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 424665      7  11330

levels(stormdt$PROPDMGEXP)

##  [1] ""  "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"

3.6. Recoding the data for question 2 (PROPERTIES)

Now I am going to create a new variable called PROPCASH, by recoding PROPDMGEXP in numerical values. Since the dataset is really messed up, the recode is necessary to make the data readable. I.e., PROPDMGEXP should consider three levels, K, M, Bfor thousands, millions and billions respectively. Actually, we have 18. Luckily, the operation in this case is feasible by hand.

unique(stormdt$PROPDMGEXP)

##  [1] K M   B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M

stormdt$PROPDMGEXP2 <- as.numeric(stormdt$PROPDMGEXP)

stormdt$PROPDMGEXP2[stormdt$PROPDMGEXP2 == "K"] <- as.factor(1000)
stormdt$PROPDMGEXP2[stormdt$PROPDMGEXP2 == "M"] <- as.factor(1e+06)
stormdt$PROPDMGEXP2[stormdt$PROPDMGEXP2 == ""] <- as.factor(1)
stormdt$PROPDMGEXP2[stormdt$PROPDMGEXP2 == "B"] <- as.factor(1e+091)
stormdt$PROPDMGEXP2[stormdt$PROPDMGEXP2 == "m"] <- as.factor(1e+06)
stormdt$PROPDMGEXP2[stormdt$PROPDMGEXP2 == "0"] <- as.factor(1)
stormdt$PROPDMGEXP2[stormdt$PROPDMGEXP2 == "5"] <- as.factor(1e+05)
stormdt$PROPDMGEXP2[stormdt$PROPDMGEXP2 == "6"] <- as.factor(1e+06)
stormdt$PROPDMGEXP2[stormdt$PROPDMGEXP2 == "4"] <- as.factor(10000)
stormdt$PROPDMGEXP2[stormdt$PROPDMGEXP2 == "2"] <- as.factor(100)
stormdt$PROPDMGEXP2[stormdt$PROPDMGEXP2 == "3"] <- as.factor(1000)
stormdt$PROPDMGEXP2[stormdt$PROPDMGEXP2 == "h"] <- as.factor(100)
stormdt$PROPDMGEXP2[stormdt$PROPDMGEXP2 == "7"] <- as.factor(1e+07)
stormdt$PROPDMGEXP2[stormdt$PROPDMGEXP2 == "H"] <- as.factor(100)
stormdt$PROPDMGEXP2[stormdt$PROPDMGEXP2 == "1"] <- as.factor(10)
stormdt$PROPDMGEXP2[stormdt$PROPDMGEXP2 == "8"] <- as.factor(1e+08)

Nobody will probably know the reason behind those values, but I am going to make them as missing.

stormdt$PROPDMGEXP2[stormdt$PROPDMGEXP2 == "+"] <- NA
stormdt$PROPDMGEXP2[stormdt$PROPDMGEXP2 == "-"] <- NA
stormdt$PROPDMGEXP2[stormdt$PROPDMGEXP2 == "?"] <- NA

Now, since the variable is recoded, we can create a new variable PROPCASH containing the actual damage related to an event.

stormdt$PROPCASH <- stormdt$PROPDMG * as.numeric(stormdt$PROPDMGEXP2)

stormdt$PROPDMGEXP2 <- as.numeric(stormdt$PROPDMGEXP2)

summary(stormdt$PROPDMGEXP2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   8.758  17.000  19.000

3.7. Recoding the data for question 2 (CROPS)

Now I am going to repeat the process for the damage occurred to crops.

# Exploring the crop exponent data
unique(stormdt$CROPDMGEXP)

## [1]   M K m B ? 0 k 2
## Levels:  ? 0 2 B k K m M

class(stormdt$CROPDMGEXP2)

## [1] "NULL"

stormdt$CROPDMGEXP2 <- as.numeric(stormdt$CROPDMGEXP) 

# Assigning values for the crop exponent data 
stormdt$CROPDMGEXP2[stormdt$CROPDMGEXP2 == "M"] <- as.factor(1e+06)
stormdt$CROPDMGEXP2[stormdt$CROPDMGEXP2 == "K"] <- as.factor(1000)
stormdt$CROPDMGEXP2[stormdt$CROPDMGEXP2 == "m"] <- as.factor(1e+06)
stormdt$CROPDMGEXP2[stormdt$CROPDMGEXP2 == "B"] <- as.factor(1e+09)
stormdt$CROPDMGEXP2[stormdt$CROPDMGEXP2 == "0"] <- as.factor(1)
stormdt$CROPDMGEXP2[stormdt$CROPDMGEXP2 == "k"] <- as.factor(1000)
stormdt$CROPDMGEXP2[stormdt$CROPDMGEXP2 == "2"] <- as.factor(100)
stormdt$CROPDMGEXP2[stormdt$CROPDMGEXP2 == ""] <- as.factor(1)

stormdt$CROPDMGEXP2[stormdt$CROPDMGEXP2 == "?"] <- NA

Here I calculate the crop damage value

stormdt$CROPCASH <- stormdt$CROPDMG * stormdt$CROPDMGEXP2

4. Results

4.1 Question 1

Now, I would like to present a graphical visualization if the events that primarily affects human heath.

head(tapply(stormdt$INJURIES, stormdt$EVTYPE2, FUN=sum))

##    HIGH SURF ADVISORY         COASTAL FLOOD           FLASH FLOOD 
##                     0                     0                     0 
##             LIGHTNING             TSTM WIND       TSTM WIND (G45) 
##                     0                     0                     0

par(mfrow = c(1, 2))

barplot(head(sort(xtabs(INJURIES ~ EVTYPE2, stormdt), decreasing = TRUE), 12), las = 2, type = "h", main = "Injuried by event. U.s. 1951 - 2011")

barplot(head(sort(xtabs(FATALITIES ~ EVTYPE2, stormdt), decreasing = TRUE), 12), las = 2, type = "h", main = "Deaths by event. U.s. 1951 - 2011")

Tornados appear to be the most significative cause of death and injuries across the US.

4.2 Question 2

Across the United States, which types of events have the greatest economic consequences on properties and crops?

When plotted, we have a decreasing list of the most economically dangerous climate events for general properies and for crops.

par(mfrow = c(1,2))
barplot(head(sort(xtabs(PROPCASH ~ EVTYPE2, stormdt), decreasing = TRUE), 12), las = 2, main = "Damages to properties,  by event. U.s. 1951 - 2011")

barplot(head(sort(xtabs(CROPCASH ~ EVTYPE2, stormdt), decreasing = TRUE), 12), las = 2, main = "Damages to crops,  by event. U.s. 1951 - 2011")

PC2. Extreme weather eventes historical observations and damages. Data collecting issues and recoding perspectives.

Maurizio Murino

21 December 2015