Introduction

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Data Processing

Loading the required libraries

library(ggplot2)
library(plyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Data

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

National Weather Service Storm Data Documentation National Climatic Data Center Storm Events FAQ The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Exploratory Analysis

storm.data<-read.csv(bzfile('./data/repdata_data_StormData.csv.bz2'),header=TRUE)
dim(storm.data)
## [1] 902297     37
str(storm.data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Extracting variables of interest

From a list of variables in storm.data, these are columns of interest:

Health Variables:

Economic Variables:

Events - target variable:

vars<-c('FATALITIES','INJURIES','PROPDMG','PROPDMGEXP','CROPDMG','CROPDMGEXP','EVTYPE')
dat<-storm.data[,vars]

Check the last few rows, since, the beginning data might have missing values

tail(dat)
##        FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP         EVTYPE
## 902292          0        0       0          K       0          K WINTER WEATHER
## 902293          0        0       0          K       0          K      HIGH WIND
## 902294          0        0       0          K       0          K      HIGH WIND
## 902295          0        0       0          K       0          K      HIGH WIND
## 902296          0        0       0          K       0          K       BLIZZARD
## 902297          0        0       0          K       0          K     HEAVY SNOW

Checking for Missing Values

sum(is.na(dat$FATALITIES))
## [1] 0
sum(is.na(dat$INJURIES))
## [1] 0
sum(is.na(dat$PROPDMG))
## [1] 0
sum(is.na(dat$PROPDMGEXP))
## [1] 0
sum(is.na(dat$CROPDMG))
## [1] 0
sum(is.na(dat$CROPDMGEXP))
## [1] 0
sum(is.na(dat$CROPDMGEXP))
## [1] 0

Transforming Extracted Variables

sort(table(dat$EVTYPE),decreasing=TRUE)[1:10]
## 
##               HAIL          TSTM WIND  THUNDERSTORM WIND            TORNADO 
##             288661             219940              82563              60652 
##        FLASH FLOOD              FLOOD THUNDERSTORM WINDS          HIGH WIND 
##              54277              25326              20843              20212 
##          LIGHTNING         HEAVY SNOW 
##              15754              15708

We would be clubbing together events like TSTM WIND, THUNDERSTORM WIND, THUNDERSTORM WINDS, HIGH WIND, etc. as WIND. Other similar events would be clubbed together.

dat$EVENT<-'OTHER'
dat$EVENT[grep('HAIL',dat$EVTYPE,ignore.case=TRUE)]<-'HAIL'
dat$EVENT[grep('HEAT',dat$EVTYPE,ignore.case=TRUE)]<-'HEAT'
dat$EVENT[grep('FLOOD',dat$EVTYPE,ignore.case=TRUE)]<-'FLOOD'
dat$EVENT[grep('WIND',dat$EVTYPE,ignore.case=TRUE)]<-'WIND'
dat$EVENT[grep('STORM',dat$EVTYPE,ignore.case=TRUE)]<-'STORM'
dat$EVENT[grep('SNOW',dat$EVTYPE,ignore.case=TRUE)]<-'SNOW'
dat$EVENT[grep('TORNADO',dat$EVTYPE,ignore.case=TRUE)]<-'TORNADO'
dat$EVENT[grep('WINTER',dat$EVTYPE,ignore.case=TRUE)]<-'WINTER'
dat$EVENT[grep('RAIN',dat$EVTYPE,ignore.case=TRUE)]<-'RAIN'
sort(table(dat$EVENT),decreasing=TRUE)
## 
##    HAIL    WIND   STORM   FLOOD TORNADO   OTHER  WINTER    SNOW    RAIN    HEAT 
##  289270  255362  113156   82686   60700   48970   19604   17660   12241    2648

Checking the values for the economic variables

sort(table(dat$PROPDMGEXP),decreasing=TRUE)[1:10]
## 
##             K      M      0      B      5      1      2      ?      m 
## 465934 424665  11330    216     40     28     25     13      8      7
sort(table(dat$CROPDMGEXP),decreasing=TRUE)[1:10]
## 
##             K      M      k      0      B      ?      2      m   <NA> 
## 618413 281832   1994     21     19      9      7      1      1

As it can be seen there some components are mismatched, so, we would put the variables in the following format -
* K or k - thousand dollars (10^3) * M or m - million dollars (10^6) * B or b - billion dollars (10^9) * Rest would go as dollars

dat$PROPDMGEXP<-as.character(dat$PROPDMGEXP)
dat$PROPDMGEXP[is.na(dat$PROPDMGEXP)]<-0
dat$PROPDMGEXP[!grepl('[Kk]|[Mm]|[Bb]',dat$PROPDMGEXP)]<-0
dat$PROPDMGEXP[grepl('[Kk]',dat$PROPDMGEXP)]<-'3'
dat$PROPDMGEXP[grepl('[Mm]',dat$PROPDMGEXP)]<-'6'
dat$PROPDMGEXP[grepl('[Bb]',dat$PROPDMGEXP)]<-'9'
dat$PROPDMGEXP<-as.numeric(as.character(dat$PROPDMGEXP))
dat$property.damage<-dat$PROPDMG*10^dat$PROPDMGEXP

dat$CROPDMGEXP<-as.character(dat$CROPDMGEXP)
dat$CROPDMGEXP[is.na(dat$CROPDMGEXP)]<-0
dat$CROPDMGEXP[!grepl('[Kk]|[Mm]|[Bb]',dat$CROPDMGEXP)]<-0
dat$CROPDMGEXP[grepl('[Kk]',dat$CROPDMGEXP)]<-'3'
dat$CROPDMGEXP[grepl('[Mm]',dat$CROPDMGEXP)]<-'6'
dat$CROPDMGEXP[grepl('[Bb]',dat$CROPDMGEXP)]<-'9'
dat$CROPDMGEXP<-as.numeric(as.character(dat$CROPDMGEXP))
dat$crop.damage<-dat$CROPDMG*10^dat$CROPDMGEXP

Print first 10 values for property and crop damage

sort(table(dat$property.damage),decreasing=TRUE)[1:10]
## 
##      0   5000  10000   1000   2000  25000  50000   3000  20000  15000 
## 663123  31731  21787  17544  17186  17104  13596  10364   9179   8617
sort(table(dat$crop.damage),decreasing=TRUE)[1:10]
## 
##      0   5000  10000  50000  1e+05   1000   2000  25000  20000  5e+05 
## 880198   4097   2349   1984   1233    956    951    830    758    721

Analysis

Agregating events for public health variables

Table of public health problems by event type

agg.fatalities<-ddply(dat,.(EVENT),summarise,Total=sum(FATALITIES,na.rm=TRUE))
agg.fatalities$type<-'fatalities'
agg.injuries<-ddply(dat,.(EVENT),summarise,Total=sum(INJURIES,na.rm=TRUE))
agg.injuries$type<-'injuries'
agg.health<-rbind(agg.fatalities,agg.injuries)
health.by.event<-join(agg.fatalities,agg.injuries,by='EVENT',type='inner')
health.by.event
##      EVENT Total       type Total     type
## 1    FLOOD  1524 fatalities  8602 injuries
## 2     HAIL    15 fatalities  1371 injuries
## 3     HEAT  3138 fatalities  9224 injuries
## 4    OTHER  2626 fatalities 12224 injuries
## 5     RAIN   114 fatalities   305 injuries
## 6     SNOW   164 fatalities  1164 injuries
## 7    STORM   416 fatalities  5339 injuries
## 8  TORNADO  5661 fatalities 91407 injuries
## 9     WIND  1209 fatalities  9001 injuries
## 10  WINTER   278 fatalities  1891 injuries

Aggregating events for economic variables

agg.prop<-ddply(dat,.(EVENT),summarise,Total=sum(property.damage,na.rm=TRUE))
agg.prop$type<-'property'
agg.crop<-ddply(dat,.(EVENT),summarise,Total=sum(crop.damage,na.rm=TRUE))
agg.crop$type<-'crop'
agg.economic<-rbind(agg.prop,agg.crop)
economic.by.event<-join(agg.prop,agg.crop,by='EVENT',type='inner')
economic.by.event
##      EVENT        Total     type       Total type
## 1    FLOOD 167502193929 property 12266906100 crop
## 2     HAIL  15733043048 property  3046837473 crop
## 3     HEAT     20325750 property   904469280 crop
## 4    OTHER  97246712337 property 23588880870 crop
## 5     RAIN   3270230192 property   919315800 crop
## 6     SNOW   1024169752 property   134683100 crop
## 7    STORM  66304415393 property  6374474888 crop
## 8  TORNADO  58593098029 property   417461520 crop
## 9     WIND  10847166618 property  1403719150 crop
## 10  WINTER   6777295251 property    47444000 crop

Results

Across U.S. which type of events are most harmful w.r.t. population health

As it can be seen from the plot, tornado effects the public health the most in terms of fatalities and injuries

Across U.S. which type of event have the greatest economic consequences

As it can be seen from the plot, flood affected the economics in terms of property and crop damage