Synopsis

In the report, we make use of the data from NOAA Storm Database and explore the damage on human life and economics. The data is downloaded and transformed to be processable. The damage is evaluated by the total and mean number of number of people injured/died, economic loss (in dollars). It is found that Tornado causes most people injury/death while Flood causes most economic loss.

Data Processing

First we download and load the data, store the data in a variable called stormdata.

con <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
tmp <- tempfile()
download.file(con, destfile = tmp, quiet = TRUE, mode = 'wb')
stormdata <- read.csv(bzfile(tmp))
unlink(tmp)
str(stormdata)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Note the data frame contains a lot of information and is a bit messy, so we need to clean the data. First, we get rid of the columns that will not be used in this report, and preserve those with the information of type of events, human injured/died, property damages and time.

library(plyr)
## 
## Attaching package: 'plyr'
## 
## The following object is masked _by_ '.GlobalEnv':
## 
##     rename
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:plyr':
## 
##     arrange, desc, failwith, id, mutate, summarise, summarize
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
stormdata <- stormdata %>%
  select(EVTYPE, contains("DATE"), contains("DMG"), FATALITIES, INJURIES)

So now we narrow the data down to 9 columns. Among these columns, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP are related to economic loss. We want to combine them together to a number shows how large the economic loss is in dollars. The formula would be Total loss = PROPDMG10^PROPDMGEXP + CROPDMG10^CROPDMGEXP. To do this, we need to change the DMGEXPs to numeric format.

# map H/h, K/k, M/m, B/b to 2,3,6,9 respectively, map empty string to 0
# map unexpected symbols to -1 so that the outcome should be close to 0
# map numbers to numbers
stormdata$PROPDMGEXP <- mapvalues(stormdata$PROPDMGEXP, 
                                  from = c("", "-", "?", "+", "B", 
                                  "h", "H", "K", "m", "M"), 
                                 to = c("0", "-1", "-1", "-1", "9", "2", 
                                  "2", "3", "6", "6" ))
stormdata$CROPDMGEXP <- mapvalues(stormdata$CROPDMGEXP, 
                                  from = c("", "?", "B", "k", "K", "m", "M"), 
                                  to = c("0", "-1", "9", "3", "3", "6", "6" ))

# change factor to numeric type
stormdata$CROPDMGEXP <- as.numeric(levels(stormdata$CROPDMGEXP))[stormdata$CROPDMGEXP]
stormdata$PROPDMGEXP <- as.numeric(levels(stormdata$PROPDMGEXP))[stormdata$PROPDMGEXP]

# store the total economic loss in new column called TotalLoss
stormdata <- mutate(stormdata, TotalLoss = PROPDMG*10^PROPDMGEXP + CROPDMG*10^CROPDMGEXP)

Next we deal with the column EVTYPE. Use summary function to find the 15 most frequent events.

summary(stormdata$EVTYPE, maxsum = 15)
##               HAIL          TSTM WIND  THUNDERSTORM WIND 
##             288661             219940              82563 
##            TORNADO        FLASH FLOOD              FLOOD 
##              60652              54277              25326 
## THUNDERSTORM WINDS          HIGH WIND          LIGHTNING 
##              20843              20212              15754 
##         HEAVY SNOW         HEAVY RAIN       WINTER STORM 
##              15708              11723              11433 
##     WINTER WEATHER       FUNNEL CLOUD            (Other) 
##               7026               6839              61340

Some of the events are clearly the same, like “TSTM WIND”, “THUNDERSTORM WIND”, “THUNDERSTORM WINDS”, so we need to merge them.

rename <- function (x){
  inlist <- list("HAIL", c("TSTM WIND", "THUNDERSTORM WIND", 
                 "THUNDERSTORM WINDS"), "TORNADO", c("FLASH FLOOD", "FLOOD"), 
                 "HIGH WIND", "LIGHTNING", "HEAVY SNOW", "HEAVY RAIN", 
                 "WINTER STORM")
  outlist <- c("HAIL", "THUNDERSTORM", "TORNADO", "FLOOD",
               "HIGH WIND", "LIGHTNING", "HEAVY SNOW", "HEAVY RAIN",
               "WINTER STORM")
  for (i in 1:length(inlist)){
    if (x %in% inlist[[i]])
      return(outlist[i])
  }
  return("Other")
}
stormdata$EVTYPE <- as.factor(sapply(stormdata$EVTYPE, rename))

The time information in BGN_DATE are too detailed, we only need to the which year it is.

require(lubridate)
## Loading required package: lubridate
## 
## Attaching package: 'lubridate'
## 
## The following object is masked from 'package:plyr':
## 
##     here
date <- strptime(stormdata$BGN_DATE, "%m/%d/%Y %H:%M:%S")
stormdata$year <- year(date)

Now the data is ready to process, we can decide which type of event cost more human life or economic loss by looking at the mean or sum of certain column. We did not mix the number of death and the number of injuries together because death is much more serious.

sum_data <- summarize(group_by(stormdata, EVTYPE), 
          TotalFatality = sum(FATALITIES), 
          TotalInjure = sum(INJURIES), 
          TotalEcoLoss = sum(TotalLoss), 
          meanFatality = mean(FATALITIES), 
          meanInjure = mean(INJURIES), 
          meanEcoLoss = mean(TotalLoss))
sum_data_year <- summarize(group_by(stormdata, EVTYPE, year),
          TotalFatality = sum(FATALITIES), 
          TotalInjure = sum(INJURIES), 
          TotalEcoLoss = sum(TotalLoss), 
          meanFatality = mean(FATALITIES), 
          meanInjure = mean(INJURIES), 
          meanEcoLoss = mean(TotalLoss))

Results

The information in sum_data can help us find out which type of events is the most harmful, with regard to population health and economic consequence.

library(knitr)
kable(sum_data, format = "markdown")
## 
## 
## |EVTYPE       | TotalFatality| TotalInjure| TotalEcoLoss| meanFatality| meanInjure| meanEcoLoss|
## |:------------|-------------:|-----------:|------------:|------------:|----------:|-----------:|
## |FLOOD        |          1448|        8566|    1.686e+11|       0.0182|     0.1076|     2117554|
## |HAIL         |            15|        1361|    1.876e+10|       0.0001|     0.0047|       64994|
## |HEAVY RAIN   |            98|         251|    1.428e+09|       0.0084|     0.0214|      121782|
## |HEAVY SNOW   |           127|        1021|    1.067e+09|       0.0081|     0.0650|       67953|
## |HIGH WIND    |           248|        1137|    5.909e+09|       0.0123|     0.0563|      292332|
## |LIGHTNING    |           816|        5230|    9.425e+08|       0.0518|     0.3320|       59824|
## |Other        |          5853|       20942|    2.055e+11|       0.0778|     0.2785|     2732639|
## |THUNDERSTORM |           701|        9353|    1.107e+10|       0.0022|     0.0289|       34242|
## |TORNADO      |          5633|       91346|    5.736e+10|       0.0929|     1.5061|      945762|
## |WINTER STORM |           206|        1321|    6.715e+09|       0.0180|     0.1155|      587374|

It can already be told from the “total” columns of the table that TORNADO cause the most life and injury, while FLOOD is most harmful by economic concerns (ignore Other). For single event, we look at the “mean” columns. Now we see that on average TORNADO cause much more injuries and death that other type of events, while on average FLOOD is also the most harmful by economic concerns.

library(ggplot2)
qplot(year, TotalInjure, data = sum_data_year,
      main = "Number of people injures by different events every year",
      ylab = "No. of injures",
      color = EVTYPE, geom = c("point", "line"))

plot of chunk plot1

From the plot of yearly injuries, we see that some events are not recorded properly until 1990s. So the “mean” columns in the previous table is more trustworthy that “total”s. Again, TORNADO cause more injures that other events, but we found that it is mainly due to some catastrophies in some year.

qplot(year, TotalEcoLoss, data = sum_data_year,
      main = "Economic loss caused by differnet events every year", 
      ylab = "Economic Loss ($)",
      color = EVTYPE, geom = c("point", "line"))

plot of chunk plot2

From the yearly economic loss, the consequence of single event on the statistics is more obvious. Also, it could be that economic loss grows over years because of economic growth.