NOAA Storm Data Analysis

Overview

The purpose of this report is to study the NOAA Storm Database and analyse the effects of severe weather events on (1) population health and (2) economic cost. The data used in this report covers the time period from 1950 to 2011 which was made available by the US National Oceanic and Atmospheric Admistration (NOAA).
The analysis indicates that storm types which have the largest effect on
- population health are tornadoes, thunderstorm wind, excessive heat, floods and lightning. Tornadoes cause the greatest number of casualties by far.
- economic cost are floods, hurricane typhoons, tornadoes and storm surge, which is largely due to property damage. Crop damage costs are highest for drought.

Sources of Information The data for this report was provided via the Coursera “Reproducible Research” course website includes
- Raw Data
- National Weather Service
- National Climactic Data Center Storm Events
In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
No codebook was provided for the data; search results on the NOAA website yielded the following page as a best guess for interpreting the header names for the data columns - however the information relates to the most recent version of the database, rather than an earlier version which this dataset was downloaded from.

Environment The analysis was undertaken in R using RStudio. The following packages need to be open to run the analysis.

library(R.utils)
library(ggplot2)

Data Processing

Raw Data The raw data is in the form of a csv file compressed using bzip2.

bunzip2("StormData.csv.bz2", "StormData.csv", remove = FALSE, skip = TRUE)
## [1] "StormData.csv"
## attr(,"temporary")
## [1] FALSE
rawdata <- read.csv("StormData.csv")

Data Structure The data comprises 37 columns of 902297 rows of data. Key columns of interest for this report are “EVTYPE”, “FATALITIES”, “INJURIES”, “PROPDMG”, “PROPDMGEXP”, “CROPDMG”, “CROPDMGEXP”.

str(rawdata)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436774 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Exploratory Analysis and Cleaning the Data

A second dataframe called tidydata has been created for the cleaned up data set to maintain the integrity of the original dataset. Only the columns of interest will be tidied for the purposes of this report.

tidydata <- rawdata

Clean the EVTYPE data The EVTYPE data has been cleaned up for the following (1) converting all data to uppercase (2) removing whitespaces and non printable characters.

head(names(table(tidydata$EVTYPE)),20)
##  [1] "   HIGH SURF ADVISORY"  " COASTAL FLOOD"        
##  [3] " FLASH FLOOD"           " LIGHTNING"            
##  [5] " TSTM WIND"             " TSTM WIND (G45)"      
##  [7] " WATERSPOUT"            " WIND"                 
##  [9] "?"                      "ABNORMAL WARMTH"       
## [11] "ABNORMALLY DRY"         "ABNORMALLY WET"        
## [13] "ACCUMULATED SNOWFALL"   "AGRICULTURAL FREEZE"   
## [15] "APACHE COUNTY"          "ASTRONOMICAL HIGH TIDE"
## [17] "ASTRONOMICAL LOW TIDE"  "AVALANCE"              
## [19] "AVALANCHE"              "BEACH EROSIN"
tidydata$EVTYPE <- gsub("^ *", "",tidydata$EVTYPE)
tidydata$EVTYPE <- gsub("[^[:alnum:][:blank:]]","",tidydata$EVTYPE)
tidydata$EVTYPE <- gsub("/", " ",tidydata$EVTYPE)
tidydata$EVTYPE <- toupper(tidydata$EVTYPE)

There are instances of duplicates, spelling errors or abbreviations. The top instances are cleaned up given significance and relevance to the final result.

result <- as.data.frame(table(tidydata$EVTYPE))
result <- result[order(result$Freq,decreasing = TRUE),]
head(result,20)
tidydata$EVTYPE <- gsub("TSTM", "THUNDERSTORM",tidydata$EVTYPE)
tidydata$EVTYPE <- gsub("WINDS", "WIND",tidydata$EVTYPE)
tidydata$EVTYPE <- gsub("FLD", "FLOOD",tidydata$EVTYPE)
tidydata$EVTYPE <- gsub("URBANSML", "URBAN SMALL",tidydata$EVTYPE)
tidydata$EVTYPE <- gsub("  *", " ",tidydata$EVTYPE)

Clean the INJURIES AND FATALITIES data There are no missing data or unusual outliers in the INJURIES or FATALITIES data.

table(is.na(tidydata$INJURIES))
## 
##  FALSE 
## 902297
head(tidydata$INJURIES[order(tidydata$INJURIES, decreasing = TRUE)],10)
##  [1] 1700 1568 1228 1150 1150  800  800  785  780  750
table(is.na(tidydata$FATALITIES))
## 
##  FALSE 
## 902297
head(tidydata$FATALITIES[order(tidydata$FATALITIES, decreasing = TRUE)],10)
##  [1] 583 158 116 114  99  90  75  74  67  57

Clean the PROPDMG and PROPDMGEXP data Total property damage is provided by two columns, PROPDMG which is numeric and PROPDMGEXP which is the multiplier for the amounts (h for 10^2, k for 10^3, m for 10^6 and b for 10^9), otherwise it is multiplied by 1.

str(tidydata$PROPDMG)
##  num [1:902297] 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
names(table(tidydata$PROPDMGEXP))
##  [1] ""  "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"

These values are converted by the multiplier and stored in a new column PROPDMGAMT.

tidydata$PROPDMGEXP <- toupper(tidydata$PROPDMGEXP)
tidydata$PROPDMGAMT <- tidydata$PROPDMG * 
    ifelse(tidydata$PROPDMGEXP == "B",1E9,
    ifelse(tidydata$PROPDMGEXP == "M",1E6,
    ifelse(tidydata$PROPDMGEXP == "K",1E3,
    ifelse(tidydata$PROPDMGEXP == "H",1E2,1E0))))

Clean the CROPDMG and CROPDMGEXP data Total property damage is provided by two columns, PROPDMG which is numeric and PROPDMGEXP which is the multiplier for the amounts (h for 10^2, k for 10^3, m for 10^6 and b for 10^9), otherwise it is multiplied by 1.

str(tidydata$CROPDMG)
##  num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
names(table(tidydata$CROPDMGEXP))
## [1] ""  "?" "0" "2" "B" "k" "K" "m" "M"

These values are converted by the multiplier and stored in a new column CROPDMGAMT.

tidydata$CROPDMGEXP <- toupper(tidydata$CROPDMGEXP)
tidydata$CROPDMGAMT <- tidydata$CROPDMG * 
    ifelse(tidydata$CROPDMGEXP == "B",1E9,
    ifelse(tidydata$CROPDMGEXP == "M",1E6,
    ifelse(tidydata$CROPDMGEXP == "K",1E3,
    ifelse(tidydata$CROPDMGEXP == "H",1E2,1E0))))

Results

Health Costs The storm types which have the largest effect on population health are tornadoes, thunderstorm wind, excessive heat, floods and lightning. Tornadoes cause the greatest number of casualties by far.

## Calculate rank of total injuries and fatalities by event
resRank <- setNames(aggregate(INJURIES+FATALITIES~EVTYPE,tidydata,sum),c("EVTYPE","TOTAL"))
resRank$RANK <- rank(-resRank$TOTAL)

## Calculate totals by casualty type and add overall rank
resInjuries <- cbind(aggregate(INJURIES~EVTYPE,tidydata,sum),"Injuries")
resFatalities <- cbind(aggregate(FATALITIES~EVTYPE,tidydata,sum),"Fatalities")
resInjuries <- cbind(resInjuries,resRank$RANK)
resFatalities <- cbind(resFatalities,resRank$RANK)
names(resInjuries) <- c("EVENT","COUNT","TYPE","RANK")
names(resFatalities) <- c("EVENT","COUNT","TYPE","RANK")
resHealth <- rbind(resInjuries,resFatalities)
resHealth <- resHealth[order(resHealth$RANK),]
resHealthTop <- resHealth[1:30,]

## bar plot
g1 <- ggplot(resHealthTop,aes(EVENT,COUNT),fill = TYPE)
g1 + geom_bar(aes(fill = TYPE),stat="identity",width = 0.9)+theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))+labs(x="",y = "Count of Casualties", title = "TORNADOES HAVE HIGHEST NUMBER OF CASUALTIES BY FAR")+theme(legend.position = c(0.15,0.85))

Economic Cost The storm types which have the largest effect on economic cost in aggregate are floods, hurricane typhoons, tornadoes and storm surge, which is largely due to damage to property. Crop damage costs are highest for drought.

## Calculate rank of total property and crop damage by event
resRank <- setNames(aggregate(PROPDMGAMT+CROPDMGAMT~EVTYPE,tidydata,sum),c("EVTYPE","TOTAL"))
resRank$RANK <- rank(-resRank$TOTAL)

## Calculate totals by damage type and add overall rank
resProp <- cbind(aggregate(PROPDMGAMT~EVTYPE,tidydata,sum),"Property")
resProp <- cbind(resProp,resRank$RANK)
resCrop <- cbind(aggregate(CROPDMGAMT~EVTYPE,tidydata,sum),"Crop")
resCrop <- cbind(resCrop,resRank$RANK)
names(resProp) <- c("EVENT","TOTAL","TYPE","RANK")
names(resCrop) <- c("EVENT","TOTAL","TYPE","RANK")
resEconomic <- rbind(resProp,resCrop)
resEconomic <- resEconomic[order(resEconomic$RANK),]
resEconomicTop <- resEconomic[1:30,]

## bar plot
g2 <- ggplot(resEconomicTop,aes(EVENT,TOTAL/10^9),fill = TYPE)
g2 + geom_bar(aes(fill = TYPE),stat="identity",width = 0.9)+theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))+labs(x="",y = "Damage in USD (billions)", title = "FLOODS CAUSE THE GREATEST ECONOMIC DAMAGE")+theme(legend.position = c(0.85,0.85))