This report analyse the NOAA Storm Database (1950 - 2011) to determine the effects of severe weather events. We check the fatalities & injuries to assess the harm to population health. And we also check the property/crop damage of each event to see the enonomic impact.
The data reveals that:
Economic Impact: FLOOD as well as HURRICANE/TYPHOON have the largest impact
Population Health: TORNADO and EXCESSIVE HEAT have the most detrimnetal effect on population health
According to NATIONAL WEATHER SERVICE INSTRUCTION, NOAA Storm Database is an official publication of the National Oceanic and Atmospheric Administration (NOAA) which documents:
The occurrence of storms and other significant weather phenomena having sufficient intensity to cause loss of life, injuries, significant property damage, and/or disruption to commerce;
Rare, unusual, weather phenomena that generate media attention, such as snow flurries in South Florida or the San Diego coastal area; and
Other significant meteorological events, such as record maximum or minimum temperatures or precipitation that occur in connection with another event.
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
First we download and read the compressed database
library(dplyr)
library(knitr)
# Create directory 'data'
if (!file.exists('data'))
{
dir.create('data')
}
# Download the bzip2 database
fileUrl <- 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2'
if (!file.exists('./data/StormData.csv.bz2'))
{
download.file(fileUrl, destfile='./data/StormData.csv.bz2',method='curl')
dateDownloaded <- date()
}
data <- read.csv(bzfile('./data/StormData.csv.bz2'))
Let’s check the structure of the database
str(data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
According to official document NATIONAL WEATHER SERVICE INSTRUCTION, here some columns are very important for our analysis
The official document mentions that for the unit of prop/crop damage, ‘K’ means thousand, ‘M’ means million, and B means billion.
Let’s check the characters in PROPDMGEXP and CROPDMGEXP columns:
unique(sort(data$PROPDMGEXP))
## [1] - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
unique(sort(data$CROPDMGEXP))
## [1] ? 0 2 B k K m M
## Levels: ? 0 2 B k K m M
We can see that there are some characters other than ‘K’, ‘B’, ‘M’ and their lower case ones. We will define ‘H’ or ‘h’ for hundred and other characters we let them to be 0.
Because we are analyse which event caused the greatest economic impaction, we first transfer the property damage from compressed expression to real value
data$REALDMG<-0
data$REALDMG[data$PROPDMGEXP=='H']<-1000000000*data$PROPDMG[data$PROPDMGEXP=='H']
data$REALDMG[data$PROPDMGEXP=='K']<-1000*data$PROPDMG[data$PROPDMGEXP=='K']
data$REALDMG[data$PROPDMGEXP=='M']<-1000000*data$PROPDMG[data$PROPDMGEXP=='M']
data$REALDMG[data$PROPDMGEXP=='B']<-1000000000*data$PROPDMG[data$PROPDMGEXP=='B']
data$REALDMG[data$PROPDMGEXP=='h']<-1000000000*data$PROPDMG[data$PROPDMGEXP=='h']
data$REALDMG[data$PROPDMGEXP=='k']<-1000*data$PROPDMG[data$PROPDMGEXP=='k']
data$REALDMG[data$PROPDMGEXP=='m']<-1000000*data$PROPDMG[data$PROPDMGEXP=='m']
data$REALDMG[data$PROPDMGEXP=='b']<-1000000000*data$PROPDMG[data$PROPDMGEXP=='b']
We then group the data by event type, and summarize the FATALITIES and INJURIES column by event, so that we can know which types of events (as indicated in the EVTYPE variable) that are most harmful with respect to population health.
event_group<-group_by(data, EVTYPE)
total_fatainj<-summarize(event_group, sum(INJURIES, na.rm=TRUE), sum(FATALITIES, na.rm=TRUE), sum(INJURIES, FATALITIES, na.rm=TRUE))
colnames(total_fatainj)<-c("event","inj","fata","inj_fata")
total_fatainj<-total_fatainj[order(total_fatainj$inj_fata, decreasing = TRUE),]
total_fatainj<-total_fatainj[1:5,]
# give ID to each event
total_fatainj$id<-1:nrow(total_fatainj)
plot(total_fatainj$id, total_fatainj$inj, xlab="ID of events", ylab="Fatalities and Injuries", main="Total fatalities and injuries for each event type")
head(total_fatainj)
## Source: local data frame [5 x 5]
##
## event inj fata inj_fata id
## 1 TORNADO 91346 5633 96979 1
## 2 EXCESSIVE HEAT 6525 1903 8428 2
## 3 TSTM WIND 6957 504 7461 3
## 4 FLOOD 6789 470 7259 4
## 5 LIGHTNING 5230 816 6046 5
From the above figure and table, we know the top 5 harmful event. Especially, we know the most harmful event type is TORNADO, which totally cause 96979 fatalities and injuries from 1950 to 2011 in United States.
We then summarize the economic impaction by each event, so we can know which types of events have the greatest economic consequences.
total_dmg<-summarize(event_group, sum(REALDMG,na.rm=TRUE))
colnames(total_dmg)<-c("event","total_dmg")
total_dmg<-total_dmg[order(total_dmg$total_dmg, decreasing = TRUE),]
total_dmg<-total_dmg[1:5,]
# give ID to each event
total_dmg$id<-1:nrow(total_dmg)
mostdmg<-total_dmg[total_dmg==max(total_dmg$total_dmg),]
plot(total_dmg$id, total_dmg$total_dmg, xlab="ID of events", ylab="Economic Impaction($)", main="Total property damage for each event type")
head(total_dmg)
## Source: local data frame [5 x 3]
##
## event total_dmg id
## 1 FLOOD 144657709800 1
## 2 HURRICANE/TYPHOON 69305840000 2
## 3 TORNADO 56937160480 3
## 4 STORM SURGE 43323536000 4
## 5 THUNDERSTORM WINDS 21735952850 5
From above figure and table we can see the greatest economic consequences is caused by FLOOD, which totally cause 144657709800 dollars economic loss from 1950 to 2011 in United States.
According to above data processing section, we know:
FLOOD caused the greatest economic loss in United States from 1950 to 2011
TORNADO is most harmful with respect to population health.