Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
Loading the required libraries
library(ggplot2)
library(plyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
National Weather Service Storm Data Documentation National Climatic Data Center Storm Events FAQ The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
storm.data<-read.csv(bzfile('./data/repdata_data_StormData.csv.bz2'),header=TRUE)
dim(storm.data)
## [1] 902297 37
str(storm.data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
From a list of variables in storm.data, these are columns of interest:
Health Variables:
FATALITIES - approx. number of deaths
INJURIES - approx. number of people injured
Economic Variables:
PROPDMG - approx. property damages
PROPDMGEXP - the units of value of the damaged property
CROPDMG - approx. crop damages
CROPDMGEXP - the units of value of the damaged crops
Events - target variable:
vars<-c('FATALITIES','INJURIES','PROPDMG','PROPDMGEXP','CROPDMG','CROPDMGEXP','EVTYPE')
dat<-storm.data[,vars]
Check the last few rows, since, the beginning data might have missing values
tail(dat)
## FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP EVTYPE
## 902292 0 0 0 K 0 K WINTER WEATHER
## 902293 0 0 0 K 0 K HIGH WIND
## 902294 0 0 0 K 0 K HIGH WIND
## 902295 0 0 0 K 0 K HIGH WIND
## 902296 0 0 0 K 0 K BLIZZARD
## 902297 0 0 0 K 0 K HEAVY SNOW
sum(is.na(dat$FATALITIES))
## [1] 0
sum(is.na(dat$INJURIES))
## [1] 0
sum(is.na(dat$PROPDMG))
## [1] 0
sum(is.na(dat$PROPDMGEXP))
## [1] 0
sum(is.na(dat$CROPDMG))
## [1] 0
sum(is.na(dat$CROPDMGEXP))
## [1] 0
sum(is.na(dat$CROPDMGEXP))
## [1] 0
sort(table(dat$EVTYPE),decreasing=TRUE)[1:10]
##
## HAIL TSTM WIND THUNDERSTORM WIND TORNADO
## 288661 219940 82563 60652
## FLASH FLOOD FLOOD THUNDERSTORM WINDS HIGH WIND
## 54277 25326 20843 20212
## LIGHTNING HEAVY SNOW
## 15754 15708
We would be clubbing together events like TSTM WIND, THUNDERSTORM WIND, THUNDERSTORM WINDS, HIGH WIND, etc. as WIND. Other similar events would be clubbed together.
dat$EVENT<-'OTHER'
dat$EVENT[grep('HAIL',dat$EVTYPE,ignore.case=TRUE)]<-'HAIL'
dat$EVENT[grep('HEAT',dat$EVTYPE,ignore.case=TRUE)]<-'HEAT'
dat$EVENT[grep('FLOOD',dat$EVTYPE,ignore.case=TRUE)]<-'FLOOD'
dat$EVENT[grep('WIND',dat$EVTYPE,ignore.case=TRUE)]<-'WIND'
dat$EVENT[grep('STORM',dat$EVTYPE,ignore.case=TRUE)]<-'STORM'
dat$EVENT[grep('SNOW',dat$EVTYPE,ignore.case=TRUE)]<-'SNOW'
dat$EVENT[grep('TORNADO',dat$EVTYPE,ignore.case=TRUE)]<-'TORNADO'
dat$EVENT[grep('WINTER',dat$EVTYPE,ignore.case=TRUE)]<-'WINTER'
dat$EVENT[grep('RAIN',dat$EVTYPE,ignore.case=TRUE)]<-'RAIN'
sort(table(dat$EVENT),decreasing=TRUE)
##
## HAIL WIND STORM FLOOD TORNADO OTHER WINTER SNOW RAIN HEAT
## 289270 255362 113156 82686 60700 48970 19604 17660 12241 2648
Checking the values for the economic variables
sort(table(dat$PROPDMGEXP),decreasing=TRUE)[1:10]
##
## K M 0 B 5 1 2 ? m
## 465934 424665 11330 216 40 28 25 13 8 7
sort(table(dat$CROPDMGEXP),decreasing=TRUE)[1:10]
##
## K M k 0 B ? 2 m <NA>
## 618413 281832 1994 21 19 9 7 1 1
As it can be seen there some components are mismatched, so, we would put the variables in the following format -
* K or k - thousand dollars (10^3) * M or m - million dollars (10^6) * B or b - billion dollars (10^9) * Rest would go as dollars
dat$PROPDMGEXP<-as.character(dat$PROPDMGEXP)
dat$PROPDMGEXP[is.na(dat$PROPDMGEXP)]<-0
dat$PROPDMGEXP[!grepl('[Kk]|[Mm]|[Bb]',dat$PROPDMGEXP)]<-0
dat$PROPDMGEXP[grepl('[Kk]',dat$PROPDMGEXP)]<-'3'
dat$PROPDMGEXP[grepl('[Mm]',dat$PROPDMGEXP)]<-'6'
dat$PROPDMGEXP[grepl('[Bb]',dat$PROPDMGEXP)]<-'9'
dat$PROPDMGEXP<-as.numeric(as.character(dat$PROPDMGEXP))
dat$property.damage<-dat$PROPDMG*10^dat$PROPDMGEXP
dat$CROPDMGEXP<-as.character(dat$CROPDMGEXP)
dat$CROPDMGEXP[is.na(dat$CROPDMGEXP)]<-0
dat$CROPDMGEXP[!grepl('[Kk]|[Mm]|[Bb]',dat$CROPDMGEXP)]<-0
dat$CROPDMGEXP[grepl('[Kk]',dat$CROPDMGEXP)]<-'3'
dat$CROPDMGEXP[grepl('[Mm]',dat$CROPDMGEXP)]<-'6'
dat$CROPDMGEXP[grepl('[Bb]',dat$CROPDMGEXP)]<-'9'
dat$CROPDMGEXP<-as.numeric(as.character(dat$CROPDMGEXP))
dat$crop.damage<-dat$CROPDMG*10^dat$CROPDMGEXP
Print first 10 values for property and crop damage
sort(table(dat$property.damage),decreasing=TRUE)[1:10]
##
## 0 5000 10000 1000 2000 25000 50000 3000 20000 15000
## 663123 31731 21787 17544 17186 17104 13596 10364 9179 8617
sort(table(dat$crop.damage),decreasing=TRUE)[1:10]
##
## 0 5000 10000 50000 1e+05 1000 2000 25000 20000 5e+05
## 880198 4097 2349 1984 1233 956 951 830 758 721
Table of public health problems by event type
agg.fatalities<-ddply(dat,.(EVENT),summarise,Total=sum(FATALITIES,na.rm=TRUE))
agg.fatalities$type<-'fatalities'
agg.injuries<-ddply(dat,.(EVENT),summarise,Total=sum(INJURIES,na.rm=TRUE))
agg.injuries$type<-'injuries'
agg.health<-rbind(agg.fatalities,agg.injuries)
health.by.event<-join(agg.fatalities,agg.injuries,by='EVENT',type='inner')
health.by.event
## EVENT Total type Total type
## 1 FLOOD 1524 fatalities 8602 injuries
## 2 HAIL 15 fatalities 1371 injuries
## 3 HEAT 3138 fatalities 9224 injuries
## 4 OTHER 2626 fatalities 12224 injuries
## 5 RAIN 114 fatalities 305 injuries
## 6 SNOW 164 fatalities 1164 injuries
## 7 STORM 416 fatalities 5339 injuries
## 8 TORNADO 5661 fatalities 91407 injuries
## 9 WIND 1209 fatalities 9001 injuries
## 10 WINTER 278 fatalities 1891 injuries
agg.prop<-ddply(dat,.(EVENT),summarise,Total=sum(property.damage,na.rm=TRUE))
agg.prop$type<-'property'
agg.crop<-ddply(dat,.(EVENT),summarise,Total=sum(crop.damage,na.rm=TRUE))
agg.crop$type<-'crop'
agg.economic<-rbind(agg.prop,agg.crop)
economic.by.event<-join(agg.prop,agg.crop,by='EVENT',type='inner')
economic.by.event
## EVENT Total type Total type
## 1 FLOOD 167502193929 property 12266906100 crop
## 2 HAIL 15733043048 property 3046837473 crop
## 3 HEAT 20325750 property 904469280 crop
## 4 OTHER 97246712337 property 23588880870 crop
## 5 RAIN 3270230192 property 919315800 crop
## 6 SNOW 1024169752 property 134683100 crop
## 7 STORM 66304415393 property 6374474888 crop
## 8 TORNADO 58593098029 property 417461520 crop
## 9 WIND 10847166618 property 1403719150 crop
## 10 WINTER 6777295251 property 47444000 crop
As it can be seen from the plot, tornado effects the public health the most in terms of fatalities and injuries
As it can be seen from the plot, flood affected the economics in terms of property and crop damage