NOAA storm data analysis by storm event type

Synopsis

This analysis involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including the estimates of any fatalities, injuries, Crop and property damages. The database has data from year 1950 and end in November 2011. The Data analysis must report the storm event most harmful (both fatality and injuries) to the population along with reporing the most economic loss (both property and crop) by storm event type. The analysis points to Tornados are most harmful to humans (nearly 97,000) and Flooding is the most damaging ($150 Billions) weather event.

Data Processing

The data required to do this analysis can be downloaded from the national weather service website. The data is compressed by .bz2 and the features are comma separated CSV format. After unloading the data rom above URL, we can unzip it using R utility function ‘bunzip2’. Following is the R code to achieve this.

options(scipen=999)
library(ggplot2)
library(data.table)
library(dplyr)
library(knitr)
library(R.utils)

#  read data
 
url <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
destfile <- "./repdata-data-StormData.csv.bz2"
 
if(!file.exists("./repdata-data-StormData.csv.bz2")){
  file.create("./repdata-data-StormData.csv.bz2")
  download.file(url, destfile)
}
if(!file.exists("./repdata-data-StormData.csv")){
  bunzip2("repdata-data-StormData.csv.bz2","repdata-data-StormData.csv",remove=FALSE,skip=TRUE)
}

As the file contain data for number of years, the file may have thousands of rows and tens of columns/features. Also, in order to do our analysis, we don’t need to load all the columns. First, read some sample data and determine the necessary columns for further analysis.

# read sample data...sampling. 
noaa.smpl<-read.table(file="repdata-data-StormData.csv",header=TRUE,sep=",",quote = "\"",nrow=4000)
str(noaa.smpl)
head(noaa.smpl[,c('STATE','EVTYPE','FATALITIES','INJURIES','PROPDMG','CROPDMG','PROPDMGEXP','CROPDMGEXP')])
group_by(noaa.smpl,EVTYPE) %>% summarize(fatality=sum(FATALITIES),inj=sum(INJURIES))

## 'data.frame':    4000 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 1203 levels "1/10/1972 0:00:00",..: 416 416 199 849 76 76 79 16 182 182 ...
##  $ BGN_TIME  : int  130 145 1600 900 1500 2000 100 900 2000 2000 ...
##  $ TIME_ZONE : Factor w/ 2 levels "CST","EST": 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 68 levels "","AUTAUGA","BALDWIN",..: 50 3 30 46 23 40 6 63 64 30 ...
##  $ STATE     : Factor w/ 1 level "AL": 1 1 1 1 1 1 1 1 1 1 ...
##  $ EVTYPE    : Factor w/ 3 levels "HAIL","TORNADO",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : logi  NA NA NA NA NA NA ...
##  $ BGN_LOCATI: logi  NA NA NA NA NA NA ...
##  $ END_DATE  : logi  NA NA NA NA NA NA ...
##  $ END_TIME  : logi  NA NA NA NA NA NA ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : logi  NA NA NA NA NA NA ...
##  $ END_LOCATI: logi  NA NA NA NA NA NA ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 3 levels "","K","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: logi  NA NA NA NA NA NA ...
##  $ WFO       : Factor w/ 21 levels "","AT","ATA",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: logi  NA NA NA NA NA NA ...
##  $ ZONENAMES : logi  NA NA NA NA NA NA ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : logi  NA NA NA NA NA NA ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
##   STATE  EVTYPE FATALITIES INJURIES PROPDMG CROPDMG PROPDMGEXP CROPDMGEXP
## 1    AL TORNADO          0       15    25.0       0          K         NA
## 2    AL TORNADO          0        0     2.5       0          K         NA
## 3    AL TORNADO          0        2    25.0       0          K         NA
## 4    AL TORNADO          0        2     2.5       0          K         NA
## 5    AL TORNADO          0        2     2.5       0          K         NA
## 6    AL TORNADO          0        6     2.5       0          K         NA
## # A tibble: 3 × 3
##      EVTYPE fatality   inj
##      <fctr>    <dbl> <dbl>
## 1      HAIL        0     0
## 2   TORNADO      225  3537
## 3 TSTM WIND       21    67

We need only the following columns to do answer our qestions. Skip the remaining columns using ‘NULL’ value in the read.table function.

‘EVTYPE’,‘FATALITIES’,‘INJURIES’,‘PROPDMG’,‘CROPDMG’,‘PROPDMGEXP’,‘CROPDMGEXP’

cc<-c(rep('NULL',6),NA,NA,rep('NULL',14),rep(NA,6),rep('NULL',9))
noaa<-read.table(file="repdata-data-StormData.csv",header=TRUE,sep=",",quote = "\"",colClasses=cc)
str(noaa)

## 'data.frame':    902297 obs. of  8 variables:
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...

Across the United States, the following storm event types are most harmful with respect to population health.

harmful.events<-noaa %>% group_by(EVTYPE) %>% 
                summarize(harmful = sum(FATALITIES)+sum(INJURIES)) %>% 
                arrange(desc(harmful))
harmful.events

## # A tibble: 985 × 2
##               EVTYPE harmful
##               <fctr>   <dbl>
## 1            TORNADO   96979
## 2     EXCESSIVE HEAT    8428
## 3          TSTM WIND    7461
## 4              FLOOD    7259
## 5          LIGHTNING    6046
## 6               HEAT    3037
## 7        FLASH FLOOD    2755
## 8          ICE STORM    2064
## 9  THUNDERSTORM WIND    1621
## 10      WINTER STORM    1527
## # ... with 975 more rows

In order to find the storm event types have caused most economic loss, we need to analyze PROPDMG, PROPDMGEXP (for property) and CROPDMG,CROPDMGEXP (for crops). We first need to convert the codes in PROPDMGEXP, CROPDMGEXP columns into multiplyers using the following R code. Also, make sure there are no ‘NA’ values in the PROPDMG and CROPDMG features.

table(noaa$PROPDMGEXP)

## 
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 424665      7  11330

table(noaa$CROPDMGEXP)

## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994

sum(is.na(noaa$PROPDMG))

## [1] 0

sum(is.na(noaa$CROPDMG))

## [1] 0

mult<-function(exp) {
  if (exp=='h' | exp=='H') return(100)
  if (exp=='k' | exp=='K') return(1000)
  if (exp=='m' | exp=='M') return(1000000)
  if (exp=='b' | exp=='B') return(1000000000)
  if (exp=='1') return(10)
  if (exp=='2') return(100)
  if (exp=='3') return(1000)
  if (exp=='4') return(10000)
  if (exp=='5') return(100000)
  if (exp=='6') return(1000000)
  if (exp=='7') return(10000000)
  if (exp=='8') return(100000000)
  return(0)
}
options(digits=11)

Create two new columns to store the multiplyer in the data table which can be used to calulate the total ecomonic loss by storm even type in Billion dollars.

#multiplier columns
noaa$propmult <- as.numeric(lapply(noaa$PROPDMGEXP,mult))
noaa$cropmult <- as.numeric(lapply(noaa$CROPDMGEXP,mult))

noaa.ecomomics <- noaa %>% mutate(proploss=propmult * PROPDMG,croploss= cropmult * CROPDMG) %>%
                       group_by(EVTYPE) %>% summarize(ttl=(sum(proploss)+sum(croploss))/1000000000) %>%
                       arrange(desc(ttl))
noaa.ecomomics

## # A tibble: 985 × 2
##               EVTYPE          ttl
##               <fctr>        <dbl>
## 1              FLOOD 150.31967825
## 2  HURRICANE/TYPHOON  71.91371280
## 3            TORNADO  57.36233359
## 4        STORM SURGE  43.32354100
## 5               HAIL  18.76122167
## 6        FLASH FLOOD  18.24399061
## 7            DROUGHT  15.01867200
## 8          HURRICANE  14.61022901
## 9        RIVER FLOOD  10.14840450
## 10         ICE STORM   8.96704131
## # ... with 975 more rows

Results

The first histogram will show the storm events most harmful to the populations.The histograms will show first 15 most harmful strom events.

# Q1
g<-ggplot(harmful.events[1:15,],aes(x=reorder(EVTYPE,harmful),y=harmful,fill=EVTYPE,color=EVTYPE))
g<-g+geom_histogram(stat="identity") 
g<-g+xlab("Storm events")+ylab("Number of humans harmed")+ggtitle("Total Fatalities & Injuries by event type")
g<-g+theme(axis.text.x = element_text(angle = 90, hjust = 1))
g<-g+geom_text(aes(label=round(harmful)),position = position_dodge(.5),vjust=0,color="black")
g

The second histogram will show the strom events which causes most economic loss.The histograms will show first 15 most loss causing events.

#Q2
g<-ggplot(noaa.ecomomics[1:15,],aes(x=reorder(EVTYPE,ttl),y=ttl,fill=EVTYPE,color=EVTYPE))
g<-g+geom_histogram(stat="identity") 
g<-g+xlab("Storm events")+ylab("Economic loss in Billions")+ggtitle("Property and crop loss (in Billions) by event type")
g<-g+theme(axis.text.x = element_text(angle = 90, hjust = 1))
g<-g+geom_text(aes(label=round(ttl,3)),position = position_dodge(.5),vjust=0,color="black")
g

Remove all variables

## remove datasets 
rm(noaa.ecomomics,noaa,noaa.smpl,harmful.events)

End of the document. (To create yourfilename.md file, use knit(“yourfilename.Rmd”))