Synopsis

In this report we explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database which tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. Our analysis objective is to identify the types of weather events that are most harmful with respect to population health, and those with greatest economic consequences. To assess impact on population health, we looked at fatalities and injuries. To assess economic impact, we looked at property damage and crop damage. We used all the history on file (1950-2011) for the analysis. Aggregating this data across the US, we found that tornados caused the most fatalities and injuries, floods caused the highest property damage, and drought caused the most crop damage.

Data Processing

First we download the data from the URL, then we unzip and read the file into R.

## include the required libraries 
library(R.utils)
library(dplyr)
if (!file.exists("./data/storm.csv"))
  {
   fileurl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
   if (!file.exists("./data")){dir.create("./data")}

   download.file(fileurl, destfile = "./data/storm.bz2")
   bunzip2("./data/storm.bz2", "./data/storm.csv", overwrite=TRUE, remove=FALSE)
  }

storm <- read.csv("./data/storm.csv")
dim(storm)
## [1] 902297     37
head(storm)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6

Next we take a subset of the storm data, with just the variables relevant for our analysis. We need the columns: EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, and CROPDMGEXP.

sub_storm <- select(storm, EVTYPE, FATALITIES:CROPDMGEXP)
str(sub_storm)
## 'data.frame':    902297 obs. of  7 variables:
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...

Assessing impact on population health

To assess impact on population health, we looked at fatalities and injuries.

Here we aggregate the fatalities by weather-event type and pick the top 10 weather events causing fatalities.

fe <- aggregate(FATALITIES ~ EVTYPE, sub_storm, sum)
fe_srt <- arrange(fe, desc(FATALITIES))
top10fe <- fe_srt[1:10,] 

Next we aggregate the Injuries by weather-event type and pick the top 10 weather events causing injuries.

ie <- aggregate(INJURIES ~ EVTYPE, sub_storm, sum)
ie_srt <- arrange(ie, desc(INJURIES))
top10ie <- ie_srt[1:10,]

Assessing economic impact

To assess economic impact, we looked at property damage and crop damage.

In order to calculate property damage and crop damage totals, we first process the exponent columns: PROPDMGEXP and CROPDMGEXP by converting the factor values (k, m, b …) to their numeric equivalents (k = 1000, …). Then we multiply the damage amounts by the corresponding exponent to get the final value.

looking at the available exponents:

table(sub_storm$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 424665      7  11330
table(sub_storm$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994

Data has invalid exponents (“-”, “?”, and “+”). We will exclude those rows since they make up a very small proportion (21 rows out of the total 902,297 observations.)

Below is a function to map the exponents to their numeric equivalents.

fexp <- function(x) {
if(x %in% c("0",""))  {x<- 1}
  else if(x == "1")  {x<- 10}
  else if(x %in% c("2","h","H")) {x<-10^2}
  else if(x %in% c("3","k","K")) {x<-10^3} 
  else if(x == "4")  {x<- 10^4}
  else if(x == "5")  {x<- 10^5}
  else if(x %in% c("6","m","M")) {x<-10^6}
  else if(x == "7")  {x<- 10^7}
  else if(x == "8")  {x<- 10^8}
  else if(x %in% c("9","b","B")) {x<-10^9}
  else x<-0  ## to zero out the values with invalid exponents 
}

Here we compute and aggregate property damage by weather-event type and pick the top 10 weather events causing property damage.

## Aapply the fexp function created abpve to compute the property damage amount (add new field PROPDMGAMT). 
sub_storm <- mutate(sub_storm, PROPDMGAMT = PROPDMG * sapply(PROPDMGEXP,fexp ))

pe <- aggregate(PROPDMGAMT ~ EVTYPE, sub_storm, sum)
pe_srt <- arrange(pe, desc(PROPDMGAMT))
top10pe <- pe_srt[1:10,]

Next we compute and aggregate crop damage by weather-event type and pick the top 10 weather events causing crop damage.

## Aapply the fexp function created above to compute the crop damage amount (add new field CROPDMGAMT).
sub_storm <- mutate(sub_storm, CROPDMGAMT = CROPDMG * sapply(CROPDMGEXP,fexp ))

ce <- aggregate(CROPDMGAMT ~ EVTYPE, sub_storm, sum)
ce_srt <- arrange(ce, desc(CROPDMGAMT))
top10ce <- ce_srt[1:10,] 

Results

  1. Population Health Impact

The plots below show the 10 most harmful weather events across the United States with respect to population health. The highest is tornado, which has caused the most fatalities and injuries.

par(mfrow = c(1,2), mar=c(8,5,4,2),cex = 0.7)
barplot(top10fe$FATALITIES,
        main = "Top 10 harmful weather events 
                causing Fatalities",
        names.arg = top10fe$EVTYPE, las = 2,
        ylab = "Fatalities", col = "blue")

barplot(top10ie$INJURIES,
        main = "Top 10 harmful weather events
                causing Injuries",
        names.arg = top10ie$EVTYPE, las = 2,
        ylab = "Injuries", col = "blue")

  1. Economic Impact

The plots below show the 10 most harmful weather events across the United States with respect to economic impact. Floods have caused the highest property damage. and Drought had the worst impact on crop damage.

par(mfrow = c(1,2), mar=c(11,5,4,2),cex = 0.7)
barplot(top10pe$PROPDMGAMT / 10^9,
        main = "Top 10 harmful weather events 
                causing Property Damage",
        names.arg = top10pe$EVTYPE, las = 2,
        ylab = "Property Damage ($billions)", col = "blue")

barplot(top10ce$CROPDMGAMT / 10^9,
        main = "Top 10 harmful weather events
                causing Crop Damage",
        names.arg = top10ce$EVTYPE, las = 2,
        ylab = "Crop Damage ($billions)", col = "blue")