Synopsis

In this report we aims to identifying which weather event in US that are most harmful to the human population and those that have greatest economic consequences.

To investigate this impact, we obtained the storm data from U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. We specifically obtained data for the years 1990 and 2011 ONLY (Since data of the most recent year is most complete - approx 83% of entire data set).

From these data, we found that, tornadoes are responsible for more deaths and injuries than any other type of event. In term of economic impact, we found that Floods and hurricanes have caused the most property damage.


Data Processing

Loading and Processing the Raw Data

From the NOAA storm database, we obtained storm data on major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

We obtained the observation for the years 1990 and 2011 ONLY, which is most complete - approx 83% of entire data set.

1. Download and unzip data file (if not exist)

We load the required library, declare the url path and csv file name and download the bz2 zip archive file if the storm file is not exist.

library(ggplot2)   # Used for the ggplot plotting
library(stringr)   # Used to string count

sourceUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
sourceFile <- "repdata-data-StormData.csv.bz2"
if (!file.exists(sourceFile)) {
    download.file(sourceUrl, destfile=sourceFile)
}

2. Load csv as variable data

We first read in the storm data from the csv file included in the bz2 zip archive. The data is a delimited file were fields are delimited with the “,” character , missing values are coded as “NA” and we read with the header data.

data <- read.csv(sourceFile, header=TRUE, na.strings = "NA", stringsAsFactors=FALSE)

3. Check and Explore Entire Data

After reading in the storm data, we check the names and first few rows (there are 902297,37) rows in this dataset.

dim(data)
## [1] 902297     37
names(data)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"
head(data, 2)
##   STATE__          BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1 4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1 4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                        14   100 3   0          0
## 2         NA         0                         2   150 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
eventIn50 <- sum(str_count(data$BGN_DATE, "195"))
eventIn60 <- sum(str_count(data$BGN_DATE, "196"))
eventIn70 <- sum(str_count(data$BGN_DATE, "197"))
eventIn80 <- sum(str_count(data$BGN_DATE, "198"))
eventIn90 <- sum(str_count(data$BGN_DATE, "199"))
eventIn2k <- sum(str_count(data$BGN_DATE, "200"))
eventIn2t <- sum(str_count(data$BGN_DATE, "201"))

On top of that, we counted the weather events based by year it happen. This double confirmed that most recent years (1990-2011) data is considered more complete (approximated 83% of the entire data set), this most likely due to lack of good records in the earlier years.

Number of Weather Events count (in decade): -
* 50’s - 11191
* 60’s - 25065
* 70’s - 39110
* 80’s - 75191
* 90’s - 228577
* 2k’s - 412828
* 2010 and 2011 - 110335

As you may observed, the weather event counts for year 2010 and 2011 (110335) is already over a quarter of total counts between year 2000 and 2009 (412828).

4. Subset ONLY Data between year 1990 & 2011 and data transformations

Based on the information provided in National Climatic Data Center and our finding and understanding in step #3 (above).

Interesting attributes are:
- BGN_DATE : Date the storm event began
- EVTYPE : Type of storm events
- FATALITIES : Number directly killed
- INJURIES : Number directly injured
- PROPDMG : Property damage in whole numbers and hundredths
- PROPDMGEXP : A multiplier where Hundred (H), Thousand (K), Million (M), Billion (B) etc
- CROPDMG : Crop damage in whole numbers and hundredths
- CROPDMGEXP : A multiplier where Hundred (H), Thousand (K), Million (M), Billion (B) etc

First we convert the BGN_DATE field to proper “Date” format, change EVTYPE to factor. Then, we subset out the most recent data (which recorded after 1st Jan 1990) and ONLY and columns that interest for the analysis.

After theat, we need to reformat two values related to damage to the economy, which is property damage and crop damage. The data for both was coded as dollar values (PROPDMG, CROPDMG) and an exponent (PROPDMGEXP, CROPDMGEXP).

The issue is, these column value was encoded as the letters
* K : Thousand
* M : Million
* B : Billion
* blank row is replaced by “0”

So, we ensure all row is upper-cases and format them to proper amount values, then multiple exponents and assign into new variables called CROP and PROPERTY.

After subset and cleaning the storm data (there are 751740,9) rows in this dataset.

data$BGN_DATE <- as.Date(data$BGN_DATE, "%m/%d/%Y") 
recentData <- subset(data, BGN_DATE >= as.Date("1990-01-01"),
           select = c("EVTYPE", "INJURIES", "PROPDMG", "PROPDMGEXP", 
                      "CROPDMG", "CROPDMGEXP", "FATALITIES"
                      )
    ) 

recentData$EVTYPE <- factor(recentData$EVTYPE)                  ## -- Event Type --
recentData$PROPDMGEXP <- toupper(recentData$PROPDMGEXP)         ## -- Property Damage --
recentData$PROPDMGEXP[recentData$PROPDMGEXP == ""] <- "0"
recentData$PROPDMGEXP <- gsub("K",10^3,recentData$PROPDMGEXP)
recentData$PROPDMGEXP <- gsub("M",10^6,recentData$PROPDMGEXP)
recentData$PROPDMGEXP <- gsub("B",10^9,recentData$PROPDMGEXP)
recentData$PROPERTY <- recentData$PROPDMG * as.numeric(recentData$PROPDMGEXP)
## Warning: NAs introduced by coercion
recentData$CROPDMGEXP <- toupper(recentData$CROPDMGEXP)         ## -- Crop Damage --
recentData$CROPDMGEXP[recentData$CROPDMGEXP == ""] <- "0"
recentData$CROPDMGEXP <- gsub("K",10^3,recentData$CROPDMGEXP)
recentData$CROPDMGEXP <- gsub("M",10^6,recentData$CROPDMGEXP)
recentData$CROPDMGEXP <- gsub("B",10^9,recentData$CROPDMGEXP)
recentData$CROP <- recentData$CROPDMG * as.numeric(recentData$CROPDMGEXP)
## Warning: NAs introduced by coercion
dim(recentData)
## [1] 751740      9
head(recentData)
##         EVTYPE INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP FATALITIES
## 4408      HAIL        0     0.0          0       0          0          0
## 4409 TSTM WIND        0     0.0          0       0          0          0
## 4410 TSTM WIND        0     0.0          0       0          0          0
## 4411 TSTM WIND        0     0.0          0       0          0          0
## 4412   TORNADO       28     2.5      1e+06       0          0          0
## 4413 TSTM WIND        0     0.0          0       0          0          0
##      PROPERTY CROP
## 4408        0    0
## 4409        0    0
## 4410        0    0
## 4411        0    0
## 4412  2500000    0
## 4413        0    0

Result

1. The Weather events which caused most harmful to human

First, we are going to find out, accross US which types of events are most harmful with respect to population health.

To answer this question we considered the variables FATALITIES AND INJURIES. We find the total number (sum) of all fatalities and injuries caused by each of the 985 types of weather events, and plot in descending top 20 order.

injurySum <- tapply(recentData$INJURIES,recentData$EVTYPE,sum,na.rm=TRUE)
injuryDF <- as.data.frame.table(injurySum, row.names = NULL)
injuryDF <- injuryDF[order(injuryDF$Freq, decreasing = TRUE),]
injuryDF <- injuryDF[1:20,]
colnames(injuryDF) <- c("Event", "Injuries")
rownames(injuryDF) <- NULL
fatalitySum <- tapply(recentData$FATALITIES,recentData$EVTYPE,sum,na.rm=TRUE)
fatalityDF <- as.data.frame.table(fatalitySum, row.names = NULL)
fatalityDF <- fatalityDF[order(fatalityDF$Freq, decreasing = TRUE),]
fatalityDF <- fatalityDF[1:20,]
colnames(fatalityDF) <- c("Event", "Facalities")
rownames(fatalityDF) <- NULL

ggplot(injuryDF, aes( x = Event, y = Injuries)) +
    geom_histogram(stat = "identity", fill = "green") + 
    xlab(NULL) + 
    ylab("Number of Injuries") +
    ggtitle("Top 20 Weather Events that caused most Injuries\n in US between 1990 and 2011") +
    theme(plot.title = element_text(size = 20)) +
    theme(axis.text.x = element_text(angle = 90, vjust = 1))

plot of chunk resultOne

ggplot(fatalityDF, aes( x = Event, y = Facalities)) +
    geom_histogram(stat = "identity", fill = "darkred") + 
    xlab(NULL) + 
    ylab("Number of Facalities") +
    ggtitle("Top 20 Weather Events that caused most Facalities\n in US between 1990 and 2011") +
    theme(plot.title = element_text(size = 20)) +
    theme(axis.text.x = element_text(angle = 90, vjust = 1))  

plot of chunk resultOne

First plot is the Top 20 weather event which caused most injuries. From this analysis shows that the top weather-related causes for human injuries is Tornado, then is Flood and follows by Excessive Heat.

In the second plot, we check for the Top 20 weather event which caused most facilities. And we noticed something interesting. Even though that Tornado is the number #1 weather-related causes for human injuries, it is not the top weather-related causes for human facalities. The unexpected number #1 - top weather-related causes for human facalities is Excessive Heat, and follows by Tornado, then third is Flash Flood.

However, this analysis also shows that the top three weather-related causes for human fatalities and the top three causes for injuries were the same.
- Excessive Heat
- Tornado
- Flood/Flash Flood

TOP 5 Weather Events that caused most Injuries

head(injuryDF, 5)
##            Event Injuries
## 1        TORNADO    26674
## 2          FLOOD     6789
## 3 EXCESSIVE HEAT     6525
## 4      LIGHTNING     5230
## 5      TSTM WIND     5022

TOP 5 Weather Events that caused most Fatalities

head(fatalityDF, 5)
##            Event Facalities
## 1 EXCESSIVE HEAT       1903
## 2        TORNADO       1752
## 3    FLASH FLOOD        978
## 4           HEAT        937
## 5      LIGHTNING        816

2. The Weather Events that caused greatest economic consequences

Here, we are going to find out, accross US which types of events are most detrimental to the economy (i.e. caused the largest property or crop damage).

To answer this question we considered the variables PROPERTY AND CROP. We find the total number (sum) of all fatalities and injuries caused by each of the 985 types of weather events, and plot in descending top 10 order.

propertySum <- tapply(recentData$PROPERTY, recentData$EVTYPE, sum)
propertyDF <- as.data.frame.table(propertySum)
propertyDF <- propertyDF[ order(-propertyDF[,2]), ]
propertyDF <- propertyDF[1:10, ] 
colnames(propertyDF) <- c("Event", "Damages")
rownames(propertyDF) <- NULL
cropSum <- tapply(recentData$CROP, recentData$EVTYPE, sum)
cropDF <- as.data.frame.table(cropSum)
cropDF <- cropDF[ order(-cropDF[,2]), ]
cropDF <- cropDF[1:10, ] 
colnames(cropDF) <- c("Event", "Damages")
rownames(cropDF) <- NULL
par(mfrow=c(1,2), mai=c(2, 1, 0.5, 0.5))
plot1 <- barplot(propertyDF$Damages/10^9, col=c("purple"), axes=F, axisnames=F)
text(plot1, par("usr")[3], labels = propertyDF$Event, srt = 45, adj = c(1.1,1.1), 
     xpd = TRUE, cex=.8)
axis(2)
title(main="Property Damage", ylab = "Damages in Dollars (Billions)")
plot2 <- barplot(cropDF$Damages/10^9, col=c("brown"), axes=F, axisnames=F)
text(plot2, par("usr")[3], labels = cropDF$Event, srt = 45, adj = c(1.1,1.1), 
     xpd = TRUE, cex=.8)
axis(2)
title(main="Crop Damage", ylab = "Damages in Dollars (Billions)")

plot of chunk resultTwo

The plot on the left is the Top 10 weather event which caused property damage. From this analysis shows that the Flood and Hurricanes/Typhoon are the major weather event that caused the most property damage (approximated. 144 and 69 billion US dollars) and observed that they are also responsible for causing huge crop damage.

The plot on the right is the Top 10 weather event which caused crop damage. The number #1 - TOP weather event that caused most crop damage is Drought, with a total of ca 13.9 billion US dollars. Then follows by Flood which also caused approximated 5.6 billion US dollars in crop damages.

It is interesting that even Drought was the number #1 weather event that caused most crop damage but it was not include in the top 5 weather weather event types which causes most property damage.

TOP 10 Weather Events that caused most Properties Damage

propertyDF$Damages <- propertyDF$Damages/10^9
colnames(propertyDF) <- c("Event", "Damages in Dollars (Billions)")
head(propertyDF, 10)
##                Event Damages in Dollars (Billions)
## 1              FLOOD                       144.658
## 2  HURRICANE/TYPHOON                        69.306
## 3        STORM SURGE                        43.324
## 4          HURRICANE                        11.868
## 5     TROPICAL STORM                         7.704
## 6       WINTER STORM                         6.688
## 7        RIVER FLOOD                         5.119
## 8           WILDFIRE                         4.765
## 9   STORM SURGE/TIDE                         4.641
## 10         TSTM WIND                         4.485

TOP 10 Weather Events that caused most Crop Damage

cropDF$Damages <- cropDF$Damages/10^9
colnames(cropDF) <- c("Event", "Damages in Dollars (Billions)")
head(cropDF, 10)
##                Event Damages in Dollars (Billions)
## 1            DROUGHT                        13.973
## 2              FLOOD                         5.662
## 3        RIVER FLOOD                         5.029
## 4          ICE STORM                         5.022
## 5               HAIL                         3.026
## 6          HURRICANE                         2.742
## 7  HURRICANE/TYPHOON                         2.608
## 8        FLASH FLOOD                         1.421
## 9       EXTREME COLD                         1.293
## 10      FROST/FREEZE                         1.094

Summary

Floods and tornadoes are quite common weather events across the United States, No doubt that both caused most harmful with respect to human casualty and caused greatest economic consequences.

Though tornadoes have caused huge number of injuries but less in fatalities. On the other hands, flood might seem caused less number of injuries compare to tornadoes, but the percentages of fatalities is very high, up to 15%. Lastly, even hurricanes are not common, but it caused huge impact on economic.

As conclusion, with regards to human lives, it would be prudent to emphasize better techniques for identification and prediction, alerting the public, and addressing safety procedures for tornadoes, winds, and heat. Monetarily, it is wise decision to predicts for large damages from tornadoes, floods, and droughts as these are all quite common and huge impact. Hurricanes are the most costly due to widespread property damage, but are more rare and generally known ahead of time, and may allows us to be prepared.