In this report, I will analyse NOAA Storm Events Data to find out the answer of two question:
1.Across the United States, which types of events are most harmful with respect to population health?
2.Across the United States, which types of events have the greatest economic consequences?
The raw data start from year 1950 to November 2011, with 902297 observations. Each observations contains 37 variables. According to the documentation, there are only 7 variables related to these questions. They present levels of fatalities and injuries and economic damages. This analysis only forcus on them.
Setting and some libaries may be used
library("knitr")
library("ggplot2")
library("plotrix")
library("lubridate")
library("chron")
library("dplyr")
library("tidyr")
library("data.table")
library("datasets")
library("lattice")
library("xtable")
options(rpubs.upload.method = "internal")
options(RCurlOptions = list(verbose = FALSE, capath = system.file("CurlSSL", "cacert.pem", package = "RCurl"), ssl.verifypeer = FALSE))
opts_chunk$set(warning = F,error = F, message = F)
Reading raw data
DataFile <- bzfile(description = "repdata_data_StormData.csv.bz2",open = "repdata_data_StormData.csv")
Table <- read.csv(DataFile)
Because we only use 7 out of 37 variables, others can be removed safely. Further more, PROPDMGEXP and CROPDMGEXP can be easily removed by merge with CROPDMG and PROPDM numbers.
Table<- Table[c("EVTYPE","FATALITIES", "INJURIES","PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
Because in the document, only “B”,“K”,“M” are defined , I assume that “k”,“b”,“m” are same as their upper cases, other simbols are just NA.
clean <- function(x){
y <- as.numeric()
y[!(x %in% c("B","b","M","m","K","k"))] <- 1
y[x %in% c("B","b")] <- 1000000000
y[x %in% c("M","m")] <- 1000000
y[x %in% c("K","k")] <- 1000
return (y)
}
clear <- function(x){
y <- as.numeric(x)
y[is.na(y)] <- 0
return (y)
}
Table$ECONOMIC<-clear(Table$PROPDMG)*clean(Table$PROPDMGEXP)+
clear(Table$CROPDMG)*clean(Table$CROPDMGEXP)
Table <- Table[c("EVTYPE","FATALITIES", "INJURIES","ECONOMIC")]
str(Table)
## 'data.frame': 902297 obs. of 4 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ ECONOMIC : num 25000 2500 25000 2500 2500 2500 2500 2500 25000 25000 ...
Next problem is that, there are too many type of events, I need to merge some of theme together unclear names.
Table$EVTYPE <- toupper(as.character(Table$EVTYPE))
Table<-Table[!grepl("SUMMARY",Table$EVTYPE),]
Table$EVTYPE[grepl("DUST|ASH|VOLCANIC|VOG", Table$EVTYPE)] <- "Dust/Vocano"
Table$EVTYPE[grepl("TSTM|THUNDERSTORM|STORM|*SPOUT|HURRICANE|TYPHOON|TORNADO|TORNDAO|DOWNBURST|MICROBURST|WIND|WND|GUSTNADO", Table$EVTYPE)] <- "Storm/Tornado/Wind"
Table$EVTYPE[grepl("WARMTH|HEAT|HOT|WARM|HIGH TEMPERATURE|HYPERTHERMIA|HYPOTHERMIA|DRY|DRIEST|DROUGHT|FIRES|FIRE|WILDFIRE|RED FLAG",Table$EVTYPE)] <- "Heat/Dry/Fire"
Table$EVTYPE[grepl("FROST|COLD|SLEET|FREEZE|WINTER|WINTRY|FREEZING|ICY|LOW TEMP|COOL|ICE|GLAZE|SNOW|BLIZZARD|HAIL",Table$EVTYPE)] <- "Cold/Hail/Snow"
Table$EVTYPE[grepl("LIGHTNING|LIGHTING|LIGNTNING",Table$EVTYPE)]<-"Lightning"
Table$EVTYPE[grepl("CURRENT|COASTAL|BEACH|TIDE|TIDES|TSUNAMI|SURF|WAVES|WAVE|SEAS|SWELL|SWELL|MARINE",Table$EVTYPE)]<-"Ocean conditions"
Table$EVTYPE[grepl("WET|WETNESS|RAIN|PRECIPATATION|PRECIPITATION|PRECIP|SHOWER|SHOWERS|FLOOD|MUD|FLOODING|SEICHE|FLD|STREAM|FLOYD|DROWNING|DAM|RISING WATER|TURBULENCE|HIGH WATER|*SLIDE",Table$EVTYPE)] <- "Rain/Wet/Flood"
Table$EVTYPE[grepl("FOG|CLOUD|SMOKE|FUNNEL",Table$EVTYPE)] <- "Fog/Smoke"
Table$EVTYPE[grepl("RECORD|OTHER|DEPRESSION|SOUTHEAST|TEMPERATURE|NO*|URBAN|COUNTY|EXCESSIVE|MIX|HIGH|[?]",Table$EVTYPE)] <- "Other"
Table <- summarise(group_by(Table,EVTYPE),ECONOMIC_CONSEQUENCES = sum(ECONOMIC),INJURIES = sum(INJURIES),FATALITIES=sum(FATALITIES))
Now, the raw data is processed, next job is find out the answer of 2 questions.
Actually, number of injuries and number of fatalities are hard to be merged into 1 number. So I will not merge them, instead I’ll split the question into 2 subquestions to find out which type of events is most harmful in terms of injuries and fatalities separately.
pie(Table$FATALITIES,col=rainbow(nrow(Table)),main="Number of fatalities")
legend("topright",Table$EVTYPE,title="Type of events",cex=0.8,fill=rainbow(nrow(Table)))
Tmp <-Table[rev(order(Table$INJURIES)),c("EVTYPE","INJURIES")]
colnames(Tmp) <- c("Type of events","Number of injuries")
kable(Tmp,format="html")
Type of events | Number of injuries |
---|---|
Storm/Tornado/Wind | 108091 |
Heat/Dry/Fire | 10855 |
Rain/Wet/Flood | 7235 |
Lightning | 5231 |
Cold/Hail/Snow | 4644 |
Dust/Vocano | 2285 |
Fog/Smoke | 1079 |
Ocean conditions | 933 |
Other | 175 |
Tmp <- Table[rev(order(Table$ECONOMIC_CONSEQUENCES)),c("EVTYPE","ECONOMIC_CONSEQUENCES")]
colnames(Tmp) <- c("Type of events","Economic Damages($)")
kable(Tmp,format="html")
Type of events | Economic Damages($) |
---|---|
Storm/Tornado/Wind | 241616734559 |
Rain/Wet/Flood | 165473075475 |
Heat/Dry/Fire | 24848387160 |
Cold/Hail/Snow | 24347105722 |
Dust/Vocano | 18450563467 |
Lightning | 945824537 |
Ocean conditions | 709588710 |
Fog/Smoke | 23124100 |
Other | 8438750 |
Clearly, In all 3 terms, Storm/Tornado/Wind are most harmful types of events. It accounts for half of number of Mortality rate and number of injuries it caused is roundly ten-fold the second one (Heat/Dry/Fire). On the other hands, while numbers of injuries and fatalities Heat/Dry/Fire caused are just lower than Storm/Tornado/Wind, it does not affect the economic as much as Rain/Wet/Flood does.