The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events. You must use the database to answer the questions below and show the code for your entire analysis. Your analysis can consist of tables, figures, or other summaries. You may use any R package you want to support your analysis.
Your data analysis must address the following questions:
1.Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
2.Across the United States, which types of events have the greatest economic consequences?
Consider writing your report as if it were to be read by a government or municipal manager who might be responsible for preparing for severe weather events and will need to prioritize resources for different types of events. However, there is no need to make any specific recommendations in your report.
This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
A data frame with 902297 observations on 37 variables.
Important variables:
EVTYPE - event type (TORNADO, TSTM WIND, HAIL, FREEZING RAIN, …)
FATALITIES - number of people died
INJURIES - number of people injuured
PROPDMG - amount of property damage (measured in money)
PROPDMGEXP - unit of damage (B,M,K,H,…)
CROPDMG - amount of corp damage (measured in money)
CROPDMGEXP - unit of damage (B,M,K,H,…)
library(plyr)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(doBy)
## Loading required package: survival
library("knitr")
Loading and preprocessing the data
allStormData=read.csv("~/Downloads/repdata-data-StormData.csv.bz2")
dim(allStormData)
## [1] 902297 37
Select the variables from dataset, which we will be used later.
stormData = allStormData[, c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
summary(stormData)
## EVTYPE FATALITIES INJURIES
## HAIL :288661 Min. : 0.0000 Min. : 0.0000
## TSTM WIND :219940 1st Qu.: 0.0000 1st Qu.: 0.0000
## THUNDERSTORM WIND: 82563 Median : 0.0000 Median : 0.0000
## TORNADO : 60652 Mean : 0.0168 Mean : 0.1557
## FLASH FLOOD : 54277 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## FLOOD : 25326 Max. :583.0000 Max. :1700.0000
## (Other) :170878
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00 :465934 Min. : 0.000 :618413
## 1st Qu.: 0.00 K :424665 1st Qu.: 0.000 K :281832
## Median : 0.00 M : 11330 Median : 0.000 M : 1994
## Mean : 12.06 0 : 216 Mean : 1.527 k : 21
## 3rd Qu.: 0.50 B : 40 3rd Qu.: 0.000 0 : 19
## Max. :5000.00 5 : 28 Max. :990.000 B : 9
## (Other): 84 (Other): 9
# Pandoc tables
kable(head(stormData), format = "pandoc")
| EVTYPE | FATALITIES | INJURIES | PROPDMG | PROPDMGEXP | CROPDMG | CROPDMGEXP |
|---|---|---|---|---|---|---|
| TORNADO | 0 | 15 | 25.0 | K | 0 | |
| TORNADO | 0 | 0 | 2.5 | K | 0 | |
| TORNADO | 0 | 2 | 25.0 | K | 0 | |
| TORNADO | 0 | 2 | 2.5 | K | 0 | |
| TORNADO | 0 | 2 | 2.5 | K | 0 | |
| TORNADO | 0 | 6 | 2.5 | K | 0 |
1.Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Select corrective data, which we will be used to see which event is the most harmful respect to population health.
sub<-subset(stormData,
select=c("EVTYPE", "FATALITIES", "INJURIES"),
((!is.na(FATALITIES)) & (!is.na(INJURIES)) &
((FATALITIES > 0) | (INJURIES > 0)) ))
summary(sub)
## EVTYPE FATALITIES INJURIES
## TORNADO :7928 Min. : 0.0000 Min. : 0.000
## LIGHTNING :3305 1st Qu.: 0.0000 1st Qu.: 1.000
## TSTM WIND :2930 Median : 0.0000 Median : 1.000
## FLASH FLOOD : 931 Mean : 0.6906 Mean : 6.408
## THUNDERSTORM WIND: 682 3rd Qu.: 1.0000 3rd Qu.: 3.000
## EXCESSIVE HEAT : 678 Max. :583.0000 Max. :1700.000
## (Other) :5475
#Group data by variable “EVTYPE” and calculate harmful
sub=summaryBy(FATALITIES+INJURIES~EVTYPE, data=sub, FUN=sum)
# Order the number of FATALITIES.sum + INJURIES.sum by function arrange, using decending method
sub = arrange(sub, desc(FATALITIES.sum + INJURIES.sum))
#top 10 events which are most harmful
sub=sub[1:10,]
sub
## EVTYPE FATALITIES.sum INJURIES.sum
## 1 TORNADO 5633 91346
## 2 EXCESSIVE HEAT 1903 6525
## 3 TSTM WIND 504 6957
## 4 FLOOD 470 6789
## 5 LIGHTNING 816 5230
## 6 HEAT 937 2100
## 7 FLASH FLOOD 978 1777
## 8 ICE STORM 89 1975
## 9 THUNDERSTORM WIND 133 1488
## 10 WINTER STORM 206 1321
Draw barplots
barplot(sub$INJURIES.sum,
names.arg = sub$EVTYPE,
main = "Fatalities and Injuries",
ylab = "count of population health damages",
cex.axis = 0.7,cex.names = 0.7,
las = 2, col="red")
barplot(sub$FATALITIES.sum,
cex.axis = 0.7,cex.names = 0.7,
las = 2,col="blue",add=T)
par(mfrow = c(1, 2))
barplot(sub$FATALITIES.sum,
names.arg = sub$EVTYPE,
main = "Fatalities",
ylab = "fatalities",
cex.axis = 0.7, col="blue",
cex.names = 0.7, las = 2)
barplot(sub$INJURIES.sum, names.arg = sub$EVTYPE, main = "Injuries",
ylab = "injuries",
cex.axis = 0.7, col="red",
cex.names = 0.7, las = 2)
2.Across the United States, which types of events have the greatest economic consequences?
Select corrective data, which we will be used to see which event is the greatest respect to economic consequences.
sub=subset(stormData,
select=c("EVTYPE","PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP"),
(!is.na(PROPDMG)) & (!is.na(CROPDMG)) &
((PROPDMG > 0) | (CROPDMG > 0)))
Replace means of CROPDMGEXP and PROPDMGEXP by rules:
B -> 1,000,000,000
M,m -> 1,000,000
K,k -> 1,000
H,h -> 100
=,-,?, blank -> 0
1-8 -> 1
unique(sub$CROPDMGEXP)
## [1] M K m B ? 0 k
## Levels: ? 0 2 B k K m M
sub$CROPDMGEXP[sub$CROPDMGEXP == "?" | sub$CROPDMGEXP == ""]="0"
CROPDMGDol= mapvalues(sub$CROPDMGEXP,
from=c("M","K","m","B","k","0"),
to=c(1e6,1e3,1e6,1e9,1e3,0e0))
unique(sub$PROPDMGEXP)
## [1] K M B m + 0 5 6 4 h 2 7 3 H -
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
sub$PROPDMGEXP[sub$PROPDMGEXP == "+" |
sub$PROPDMGEXP == "" |
sub$PROPDMGEXP == "-"]="0"
PROPDMGDol = mapvalues(sub$PROPDMGEXP,
from=c("K","M", "B","m","5","6","4","2","3","h","7","H","1","8","0"),
#to=c(1e3,1e6,1e9,1e6, 1e5,1e6,1e4,1e2,1e3,1e2,1e7,1e2, 1e1,1e8,0e0))
to=c(1e3,1e6,1e9,1e6, 1e1,1e1,1e1,1e1,1e1,1e2,1e1,1e2, 1e1,1e1,0e0))
#Create new columns with exact amounts for property and corp damage
sub$PROP=as.numeric(as.vector(PROPDMGDol))*sub$PROPDMG
sub$CROP=as.numeric(as.vector(CROPDMGDol))*sub$CROPDMG
#Group data by variable “EVTYPE” and calculate economic consequences.
sub=summaryBy(CROP + PROP ~ EVTYPE, data=sub, FUN=sum)
# Order the number of PROP.sum + CROP.sum by function arrange, using decending method
sub <- arrange(sub, desc(PROP.sum + CROP.sum))
#top 10 events which have the greatest economic consequences
sub=sub[1:10,]
sub
## EVTYPE CROP.sum PROP.sum
## 1 FLOOD 5661968450 144657709800
## 2 HURRICANE/TYPHOON 2607872800 69305840000
## 3 TORNADO 414953110 56937161502
## 4 STORM SURGE 5000 43323536000
## 5 HAIL 3025954450 15732267520
## 6 FLASH FLOOD 1421317100 16140812396
## 7 DROUGHT 13972566000 1046106000
## 8 HURRICANE 2741910000 11868319010
## 9 RIVER FLOOD 5029459000 5118945500
## 10 ICE STORM 5022113500 3944927810
Draw barplots
barplot(sub$PROP.sum,
names.arg = sub$EVTYPE,
main = "Property and crop damage",
ylab = "Value of economic damages",
cex.axis = 0.7,cex.names = 0.7,
las = 2, col="red")
barplot(sub$CROP.sum,
cex.axis = 0.7,cex.names = 0.7,
las = 2,col="blue",add=T)
par(mfrow = c(1, 2))
barplot(sub$CROP.sum,
names.arg = sub$EVTYPE,
main = "Crop damage",
ylab = "crop",
cex.axis = 0.7, col="blue",
cex.names = 0.7, las = 2)
barplot(sub$PROP.sum, names.arg = sub$EVTYPE,
main = "Property damage",
ylab = "property",
cex.axis = 0.7, col="red",
cex.names = 0.7, las = 2)
1.Type of event “TORNADO” is most harmful with respect to population health.
2.Type of event “FLOOD” has the greatest economic consequences.