The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events. You must use the database to answer the questions below and show the code for your entire analysis. Your analysis can consist of tables, figures, or other summaries. You may use any R package you want to support your analysis.
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This report explores the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database and answer the following questions:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
This report shows that across the United States the tornado is most harmful for population health, and the floods results in greatest economic loss.
Let’s download the raw data
if (!file.exists("repdata-data-StormData.csv.bz2"))
{
URL <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(URL, destfile = "repdata-data-StormData.csv.bz2", method="curl", mode="wb")
}
raw.data <- read.table("repdata-data-StormData.csv.bz2", header = TRUE, sep = ",", stringsAsFactors = FALSE, na.strings = "NA")
head(raw.data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
Let’s check to see FEW event types:
length(unique(raw.data$EVTYPE))
## [1] 985
head(unique(raw.data$EVTYPE), 20)
## [1] "TORNADO" "TSTM WIND"
## [3] "HAIL" "FREEZING RAIN"
## [5] "SNOW" "ICE STORM/FLASH FLOOD"
## [7] "SNOW/ICE" "WINTER STORM"
## [9] "HURRICANE OPAL/HIGH WINDS" "THUNDERSTORM WINDS"
## [11] "RECORD COLD" "HURRICANE ERIN"
## [13] "HURRICANE OPAL" "HEAVY RAIN"
## [15] "LIGHTNING" "THUNDERSTORM WIND"
## [17] "DENSE FOG" "RIP CURRENT"
## [19] "THUNDERSTORM WINS" "FLASH FLOOD"
There appears to be duplications. Let’s try to clean and combine some of those.
raw.data$EVTYPE <- gsub("^(HEAT).*", "HEAT", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(RECORD HEAT).*", "HEAT", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(EXTREME HEAT).*", "HEAT", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(Heat).*", "HEAT", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(EXCESSIVE HEAT).*", "HEAT", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(TSTM).*", "THUNDER STORM", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(THUNDERSTORM).*", "THUNDER STORM", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(TROPICAL STORM).*", "TROPICAL STORM", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(FLASH FLOOD).*", "FLOOD", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(WIND).*", "WIND", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(STRONG WIND).*", "WIND", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(HIGH WIND).*", "WIND", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(HURRICANE).*", "HURICCANE", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(SNOW).*", "SNOW", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(HEAVY SNOW).*", "SNOW", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(FIRE).*", "FIRE", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(WILD/FOREST FIRE).*", "FIRE", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(WILDFIRE).*", "FIRE", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(WILD FIRES).*", "FIRE", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(HAIL).*", "HAIL", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(BLIZZARD).*", "BLIZZARD", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(COLD).*", "COLD", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(WINTER WEATHER).*", "COLD", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(EXTREME COLD).*", "COLD", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(RIP).*", "RIP", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(FOG).*", "FOG", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(DENSE FOG).*", "FOG", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(AVALANCHE).*", "AVALANCHE", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(AVALANCE).*", "AVALANCHE", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(RAIN).*", "RAIN", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(HEAVY RAIN).*", "RAIN", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(HIGH SURF).*", "SURF", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(HEAVY SURF).*", "SURF", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(SURF).*", "SURF", raw.data$EVTYPE)
length(unique(raw.data$EVTYPE))
## [1] 658
Let’s consider only the required columns we need for analysis:
names(raw.data)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
data <- raw.data[, c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
str(data)
## 'data.frame': 902297 obs. of 7 variables:
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
Replace the EMPTY values with ZEROES:
data$FATALITIES[(data$FATALITIES=="")]<-0
data$INJURIES[(data$INJURIES=="")]<-0
data$PROPDMG[(data$PROPDMG=="")]<-0
data$CROPDMG[(data$CROPDMG=="")]<-0
The variables PROPDMGEXP and CROPDMGEXP are the factors of multiplicity for variables PROPDMG and CROPDMG
(propdmg.exp <- unique(data$PROPDMGEXP))
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
(cropdmg.exp <- unique(data$CROPDMGEXP))
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
Instead of two variables (value + factor) we will have one (absolute value):
let’s create two vectors with numeric factor values
propdmg.exp.m<-c(1e3,1e6, 1, 1e9,1e6, 1, 1,1e5,1e6, 1,1e4,1e2,1e3, 1,1e7,1e2, 1, 10,1e8)
cropdmg.exp.m<-c( 1,1e6,1e3,1e6,1e9,1,1,1e3,1e2)
Lets map the values, calculate the absolute numeric values and save them into two new variables: ABSPROPDMG and ABSCROPDMG
library(plyr)
propdmg.temp<-mapvalues(data$PROPDMGEXP, propdmg.exp, propdmg.exp.m)
cropdmg.temp<-mapvalues(data$CROPDMGEXP, cropdmg.exp, cropdmg.exp.m)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data<-mutate(data, ABSPROPDMG=as.numeric(as.character(propdmg.temp))*data$PROPDMG, ABSCROPDMG=as.numeric(as.character(cropdmg.temp))*data$CROPDMG)
head(data)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 0 15 25.0 K 0
## 2 TORNADO 0 0 2.5 K 0
## 3 TORNADO 0 2 25.0 K 0
## 4 TORNADO 0 2 2.5 K 0
## 5 TORNADO 0 2 2.5 K 0
## 6 TORNADO 0 6 2.5 K 0
## ABSPROPDMG ABSCROPDMG
## 1 25000 0
## 2 2500 0
## 3 25000 0
## 4 2500 0
## 5 2500 0
## 6 2500 0
Group our dataset based on Event Type (EVTYPE) and calculate the totals for each columns (FATALITIES, INJURIES, ABSPROPDMG, ABSCROPDMG)
data.summary<-data %>% group_by(EVTYPE) %>% summarize(fatalities=sum(FATALITIES), injuries=sum(INJURIES), property.damage=sum(ABSPROPDMG), crop.damage=sum(ABSCROPDMG))
head(data.summary)
## Source: local data frame [6 x 5]
##
## EVTYPE fatalities injuries property.damage crop.damage
## (chr) (dbl) (dbl) (dbl) (dbl)
## 1 HIGH SURF ADVISORY 0 0 200000 0
## 2 COASTAL FLOOD 0 0 0 0
## 3 FLASH FLOOD 0 0 50000 0
## 4 LIGHTNING 0 0 0 0
## 5 TSTM WIND 0 0 8100000 0
## 6 TSTM WIND (G45) 0 0 8000 0
Public Health Impact would include both fatalities and injuries:
data.phi <- data.summary %>% select(EVTYPE, fatalities, injuries) %>% mutate(total.phi=fatalities+injuries) %>%arrange(desc(total.phi))
data.phi[1:5,]
## Source: local data frame [5 x 4]
##
## EVTYPE fatalities injuries total.phi
## (chr) (dbl) (dbl) (dbl)
## 1 TORNADO 5633 91346 96979
## 2 HEAT 3119 9224 12343
## 3 THUNDER STORM 710 9508 10218
## 4 FLOOD 1488 8574 10062
## 5 LIGHTNING 816 5230 6046
From the above, TORNADO is the most harmful weather event across US:
library(tidyr)
library(ggplot2)
public.health<-gather(data.phi[1:5,1:3],Type, Total, fatalities:injuries)
head(public.health)
## Source: local data frame [6 x 3]
##
## EVTYPE Type Total
## (chr) (chr) (dbl)
## 1 TORNADO fatalities 5633
## 2 HEAT fatalities 3119
## 3 THUNDER STORM fatalities 710
## 4 FLOOD fatalities 1488
## 5 LIGHTNING fatalities 816
## 6 TORNADO injuries 91346
g<-ggplot(public.health, aes(x=reorder(EVTYPE, +Total), y=Total, fill=Type)) +
ggtitle("Total fatalities & injuries") +
xlab("Event Type") +
geom_bar(stat="identity", alpha=1) +
coord_flip()
print(g)
Let’s define the economical impact as property.damage + crop.damage and see which types of events have the greatest economic consequences:
data.econ<-data.summary%>%select(EVTYPE, property.damage,crop.damage)%>%mutate(total.e=property.damage+crop.damage)%>%arrange(desc(total.e))
data.econ[1:5,]
## Source: local data frame [5 x 4]
##
## EVTYPE property.damage crop.damage total.e
## (chr) (dbl) (dbl) (dbl)
## 1 FLOOD 162072390896 7099131600 169171522496
## 2 HURICCANE 84756180010 5515292800 90271472810
## 3 TORNADO 56947380677 414953270 57362333947
## 4 STORM SURGE 43323536000 5000 43323541000
## 5 HAIL 15977470013 3026094623 19003564636
FLOOD is the event that have the greatest economic consequences Let’s illustrate:
econ<-gather(data.econ[1:5,1:3],Type, Total, crop.damage:property.damage)
q<-ggplot(econ, aes(x=reorder(EVTYPE, +Total), y=Total, fill=Type)) +
ggtitle("Total Property & Crop Damage") +
xlab("Event Type") +
geom_bar(stat="identity", alpha=1) +
coord_flip()
print(q)
The above analysis shows that the tornadoes are the most harmful events for people’s health in the United States. This report also indicates that the flood events cause the most economic consequences in United States.