Severe weather events, such as Winter Storm, Tsunami and Tornado are known to cause fatalities, injuries, property damage, etc. The U.S. National Oceanic and Atmospheric Administration (NOAA) storm database lists 48 names that most accurately describe the meteorological events leading to such severe weather events. The primary aim of this project was to address the following two questions.
Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
Our results show that, between 1950 and 2011, of the 48 events that cause harm, Tornado stands out top place culprit for Fatalities, Injuries as well as Property Damage, when compared to other event types. Excessive Heat and Flash Flood are the second and third place causes of Fatalities. Flood and Excessive Heat are second and third place causes of Injuries. Hail, Flash Flood and Flood are blamed for Crop Damage (in that order).
National Weather Service Storm Data Documentation.The events in the database start in the year 1950 and end in November 2011.
National Climatic Data Center Storm Events FAQ.
stormdata<-read.table("repdata-data-StormData.csv.bz2",sep = ",", header = TRUE)
This is a huge data with 902,297 observations and 37 variables.
dim(stormdata)
## [1] 902297 37
Checking some of the names of the data suggests formatting of names to fit for R-Processing
cnames <- readLines("repdata-data-StormData.csv.bz2", 1)
cnames <- strsplit(cnames, ",", fixed = TRUE)
names(stormdata) <- make.names(cnames[[1]])
names(stormdata)<-gsub("X.|[.]","", names(stormdata))
head(stormdata[,3:10])
## BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE BGN_RANGE BGN_AZI
## 1 0130 CST 97 MOBILE AL TORNADO 0
## 2 0145 CST 3 BALDWIN AL TORNADO 0
## 3 1600 CST 57 FAYETTE AL TORNADO 0
## 4 0900 CST 89 MADISON AL TORNADO 0
## 5 1500 CST 43 CULLMAN AL TORNADO 0
## 6 2000 CST 77 LAUDERDALE AL TORNADO 0
Missing values are common problems in many data analysis and severe storm events are not exceptions. We therefore check the proportion of observations with missing values.
mean(is.na(stormdata)) ## Are missing values important in the analysis to follow?
## [1] 0.0523
About 5.2 percent of the rows have missing values. Since this proportion is relatively small, we will ignore missing values.This number does not account for missing values other than those reported as NAs.
We read to R the file which has the 48 valid names of events,which was downloaded from here, to our working directory, using the following code
event_table<-read.csv("event_table.csv",sep=";")
head(event_table)
## EVTYPE Designator
## 1 Astronomical Low Tide Z
## 2 Avalanche Z
## 3 Blizzard Z
## 4 Coastal Flood Z
## 5 Cold/Wind Chill Z
## 6 Debris Flow C
One can see that the stormdata and event_table share a common variable EVTYPE. However, in order to merge the events of the two datasets by the same id, EVTYPE, we need to change the events in the EVTYPE column of the data_table to upper case and create a new data.
event_table$EVTYPE<-toupper(event_table$EVTYPE)
data_mgd<-merge(stormdata,event_table,by="EVTYPE")
We will do the rest of our analysis using latest tidy data, data_mgd.
The variables that will address the question with regard to population health are FATALITIES and INJURIES.
The variables that will address the question with regard to economic consequences are Property Damage (listed as PROPDMG ) and Crop Damage (listed as CROPDMG ).
Here, we will choose the two columns we want (EVTYPE and FATALITIES ), create an aggregated data-frame with total number of fatalities by event type (EVTYPE) across US and report the ten highest number of fatalities by event type (EVTYPE ).
data_fat<-data_mgd[,c(1,23)]
fat <- aggregate(FATALITIES~EVTYPE, data_fat, sum, rm.na=TRUE)
fat<-fat[order(fat$FATALITIES,decreasing = TRUE),]
head(fat,10)
## EVTYPE FATALITIES
## 38 TORNADO 5634
## 11 EXCESSIVE HEAT 1904
## 13 FLASH FLOOD 979
## 19 HEAT 938
## 27 LIGHTNING 817
## 14 FLOOD 471
## 32 RIP CURRENT 369
## 23 HIGH WIND 249
## 2 AVALANCHE 225
## 45 WINTER STORM 207
The total number of Fatalities between \(1950\) and \(2011\) due to such events may be estimated to be at least
sum(fat$FATALITIES)
## [1] 13101
Following is a Bar chart of the total number of Fatalities between \(1950\) and \(2011\) by the top \(10\) harmful Events.
library(lattice)
barchart(EVTYPE~FATALITIES, data = fat[1:10,],main="Barchart: Fatalities by Top 10 Events")
Here, we will choose the two columns we want (EVTYPE and INJURIES), create an aggregated data-frame with total number of fatalities by event type (EVTYPE) across US and report the ten highest number of injuries by event type (EVTYPE).
data_inj<-data_mgd[,c(1,24)]
inj <- aggregate(INJURIES~EVTYPE, data_inj, sum, rm.na=TRUE)
inj<-inj[order(inj$INJURIES,decreasing = TRUE),]
head(inj,10)
## EVTYPE INJURIES
## 38 TORNADO 91347
## 14 FLOOD 6790
## 11 EXCESSIVE HEAT 6526
## 27 LIGHTNING 5231
## 19 HEAT 2101
## 24 ICE STORM 1976
## 13 FLASH FLOOD 1778
## 37 THUNDERSTORM WIND 1489
## 18 HAIL 1362
## 45 WINTER STORM 1322
The total number of Injuries between 1950 and 2011 due to such events may be estimated to be at least
sum(inj$INJURIES)
## [1] 126736
Following is a Bar-chart of the total number of Injuries between 1950 and 2011 by the top 10 harmful Events for Injuries.
library(lattice)
barchart(EVTYPE~INJURIES, data = inj[1:10,],main="Barchart: Injuries by Top 10 Events")
For the purpose of analyzing property damage, we will choose the three columns we want (EVTYPE, PROPDMG and PROPDMGEXP), create an aggregated data-frame with total number of property damage by event type (EVTYPE) across US and report the ten highest number of damages by event type (EVTYPE). We also remark here that for the purpose of this analysis, we used the column PROPDMGEXPas a multiplying factor for `PROPDMG’ to reflect the actual expenses as follows: \(B=10^6, M=10^3,\dots \), etc., as indicated in the following codes.
data_pd<-data_mgd[,c(1,25,26)]
data_pd$PROPDMG[data_pd$PROPDMGEXP=="B"]<-(10^9)*data_pd$PROPDMG[data_pd$PROPDMGEXP=="B"]
data_pd$PROPDMG[data_pd$PROPDMGEXP=="M"]<-(10^6)*data_pd$PROPDMG[data_pd$PROPDMGEXP=="M"]
data_pd$PROPDMG[data_pd$PROPDMGEXP=="m"]<-(10^6)*data_pd$PROPDMG[data_pd$PROPDMGEXP=="m"]
data_pd$PROPDMG[data_pd$PROPDMGEXP=="K"]<-(10^3)*data_pd$PROPDMG[data_pd$PROPDMGEXP=="K"]
data_pd$PROPDMG[data_pd$PROPDMGEXP=="H"]<-100*data_pd$PROPDMG[data_pd$PROPDMGEXP=="H"]
data_pd$PROPDMG[data_pd$PROPDMGEXP=="h"]<-100*data_pd$PROPDMG[data_pd$PROPDMGEXP=="h"]
pd<- aggregate(PROPDMG~EVTYPE, data_pd, sum, rm.na=TRUE)
pd<-pd[order(pd$PROPDMG,decreasing = TRUE),]
head(pd,10)
## EVTYPE PROPDMG
## 38 TORNADO 3212259
## 13 FLASH FLOOD 1420126
## 14 FLOOD 899939
## 37 THUNDERSTORM WIND 876845
## 18 HAIL 688694
## 27 LIGHTNING 603353
## 23 HIGH WIND 324733
## 45 WINTER STORM 132722
## 21 HEAVY SNOW 122253
## 44 WILDFIRE 84460
The total Property Damage between 1950 and 2011 due to such events may be estimated to be at least (in US Dollars):
sum(pd$PROPDMG)
## [1] 8709094
Here, the three columns we want are EVTYPE, PROPDMG and PROPDMGEXP. We create an aggregated data-frame with total number of property damage by event type (EVTYPE) across US and report the ten highest number of damages by event type (EVTYPE). We also remark here that
data_cd<-data_mgd[,c(1,27,28)]
data_cd$CROPDMG[data_cd$CROPDMGEXP=="B"]<-(10^9)*data_cd$CROPDMG[data_cd$CROPDMGEXP=="B"]
data_cd$CROPDMG[data_cd$CROPDMGEXP=="M"]<-(10^6)*data_cd$CROPDMG[data_cd$CROPDMGEXP=="M"]
data_cd$CROPDMG[data_cd$CROPDMGEXP=="m"]<-(10^6)*data_cd$CROPDMG[data_cd$CROPDMGEXP=="m"]
data_cd$CROPDMG[data_cd$CROPDMGEXP=="K"]<-(10^10)*data_cd$CROPDMG[data_cd$CROPDMGEXP=="K"]
data_cd$CROPDMG[data_cd$CROPDMGEXP=="h"]<-1000*data_cd$CROPDMG[data_cd$CROPDMGEXP=="h"]
data_cd$CROPDMG[data_cd$CROPDMGEXP=="H"]<-100*data_cd$CROPDMG[data_cd$CROPDMGEXP=="H"]
cd<- aggregate(CROPDMG~EVTYPE, data_cd, sum, rm.na=TRUE)
cd<-cd[order(cd$CROPDMG,decreasing = TRUE),]
head(cd,10)
## EVTYPE CROPDMG
## 18 HAIL 579597
## 13 FLASH FLOOD 179201
## 14 FLOOD 168039
## 38 TORNADO 100020
## 37 THUNDERSTORM WIND 66792
## 8 DROUGHT 33900
## 23 HIGH WIND 17284
## 20 HEAVY RAIN 11124
## 16 FROST/FREEZE 7035
## 40 TROPICAL STORM 5900
dim(cd)
## [1] 46 2
The total Crop Damage between 1950 and 2011 due to such events may be estimated to be at least (in US Dollars):
sum(cd$CROPDMG)
## [1] 1188839
To see the combined effects of Property and Crop Damage, we merge the data-frames used before
total<-merge(pd,cd,by="EVTYPE")
head(total)
## EVTYPE PROPDMG CROPDMG
## 1 ASTRONOMICAL LOW TIDE 321 1
## 2 AVALANCHE 1625 1
## 3 BLIZZARD 25319 173
## 4 COASTAL FLOOD 12612 1
## 5 COLD/WIND CHILL 1991 601
## 6 DENSE FOG 8226 1
Then we create a new column combined for the combined damage and report the top ten culprits with table and bar-chart.
total$combined<-total$PROPDMG+total$CROPDMG
total<-total[order(total$combined,decreasing = TRUE),]
head(total,10)
## EVTYPE PROPDMG CROPDMG combined
## 38 TORNADO 3212259 100020 3312279
## 13 FLASH FLOOD 1420126 179201 1599327
## 18 HAIL 688694 579597 1268292
## 14 FLOOD 899939 168039 1067978
## 37 THUNDERSTORM WIND 876845 66792 943638
## 27 LIGHTNING 603353 3582 606934
## 23 HIGH WIND 324733 17284 342017
## 45 WINTER STORM 132722 1980 134702
## 21 HEAVY SNOW 122253 2167 124420
## 44 WILDFIRE 84460 4365 88826
library(lattice)
barchart(EVTYPE~combined, data = total[1:10,],xlab="Property and Crop Damage in US Dollars",main="Barchart: Property and Crop Dammage by Top 10 Events")