Synopsis

The Storm Data set, provided by the National Oceanic and Atmospheric Administration (NOAA), is being analyzed with respect to the impact on the population health and economic damage across the US. Basic data exploration unveils data quality issues before 1993, so the previous years have not been included in the analysis. The types of events with highest total numbers of fatalities and injuries are being identified. Similarly the economic damage of events on property and crops are being extracted from the data provided.

Data Processing - Main steps

The data are processed in the following steps

  1. Data are downloaded from the course website into a separate “data” subdirectory
  2. The raw data are then loaded into a dataframe for further analysis. The relevant columns for the analysis are “EVTYPE”, “FATALITIES”, “INJURIES”, “PROPDMG” and “CROPDMG”.
  3. The total damage for property and crops is being added up
  4. The Year of the recorded event is being extracted from the BGN_DATe column. This is being done to perform data quality checks on the basis of whole years for the years covered.
  5. The total damage for property and crops is being added up. These are the two distinct economic damage categories covered in the data set and the summed up amount
  6. Data exploration is being performed by analysing the trends over the lifetime of data covered. This is based on a comment in the documentation that dat recording in early years was not comprehensive.
  7. Summaries for “FATALITIES”, “INJURIES”, “PROPDMG” and “CROPDMG” per YEAR are being created and visualized (figure 1.).
  8. Only the later years, 1993 onwards, are being kept for further analysis, due to the data quality issues before 1993
  9. Summaries are being created for “FATALITIES”, “INJURIES”, “PROPDMG” and “CROPDMG” and sorted in descending order. This way the most serious eventtypes are being identified.
  10. The top ten eventtypes are being shown as final results in figures 2. and 3.
  11. Meaningful names are assigned to the new summary data-frames.
library(ggplot2)
library(reshape2)

url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
    
    data_dir <- "data"

    destfile <- "FStormData.csv.bz2"

    if (!dir.exists(data_dir)) {dir.create(data_dir)}

    dest_path_file <- paste(data_dir,destfile, sep="/")

    if (!file.exists(dest_path_file)) {
        download.file(url, dest_path_file)
    }
    FStorm <- read.csv(dest_path_file)
    FStorm$TOTDMG <- FStorm$PROPDMG + FStorm$CROPDMG
  FStorm$YEAR <- format(as.POSIXlt(as.character(FStorm$BGN_DATE), format="%m/%d/%Y %H:%M:%S"), "%Y")
 
  FAT_by_YEAR <- dcast(FStorm, YEAR ~ . , fun.aggregate=sum, value.var = "FATALITIES")
  INJ_by_YEAR <- dcast(FStorm, YEAR ~ . , fun.aggregate=sum, value.var = "INJURIES")
  PROP_by_YEAR <- dcast(FStorm, YEAR ~ . , fun.aggregate=sum, value.var = "PROPDMG")
  CROP_by_YEAR <- dcast(FStorm, YEAR ~ . , fun.aggregate=sum, value.var = "CROPDMG")

  select_years <- 1993:2011
  FStorm <- FStorm[FStorm$YEAR %in% select_years,]

  FAT_by_EVTYPE <- dcast(FStorm, EVTYPE ~ . , fun.aggregate=sum, value.var = "FATALITIES")
  INJ_by_EVTYPE <- dcast(FStorm, EVTYPE ~ . , fun.aggregate=sum, value.var = "INJURIES")
    PROP_by_EVTYPE <- dcast(FStorm, EVTYPE ~ . , fun.aggregate=sum, value.var = "PROPDMG")
  CROP_by_EVTYPE <- dcast(FStorm, EVTYPE ~ . , fun.aggregate=sum, value.var = "CROPDMG")
  TOT_by_EVTYPE <- dcast(FStorm, EVTYPE ~ . , fun.aggregate=sum, value.var = "TOTDMG")
  
  FAT_ranked <- order(FAT_by_EVTYPE$., decreasing = TRUE)
  INJ_ranked <- order(INJ_by_EVTYPE$., decreasing = TRUE)
  PROP_ranked <- order(PROP_by_EVTYPE$., decreasing = TRUE)
  CROP_ranked <- order(CROP_by_EVTYPE$., decreasing = TRUE)
  TOT_ranked <- order(TOT_by_EVTYPE$., decreasing = TRUE)

  FAT_ranked <- FAT_by_EVTYPE[FAT_ranked,]
  INJ_ranked <- INJ_by_EVTYPE[INJ_ranked,]
  PROP_ranked <- PROP_by_EVTYPE[PROP_ranked,]
  CROP_ranked <- CROP_by_EVTYPE[CROP_ranked,]
  TOT_ranked <- TOT_by_EVTYPE[TOT_ranked,]
  
  POP_ranked <- cbind(FAT_ranked[1:10,], INJ_ranked[1:10,])
  ECO_ranked <- cbind(PROP_ranked[1:10,], CROP_ranked[1:10,])
  
  colnames(POP_ranked) <- c("Eventtype(Fat)","Fatalities","Eventtype(Inj)","Injuries")

  colnames(ECO_ranked) <- c("Eventtype(Prop)","Property Damage","Eventtype(Crop)","Crop Damage")  

Data exploration: Vizualization of key data

  par(mfrow = c(2,2))
  plot(FAT_by_YEAR, ylab = "Fatalities")
  plot(PROP_by_YEAR, ylab = "Property Damage")
  plot(INJ_by_YEAR, ylab = "Injuries")
  plot(CROP_by_YEAR, ylab = "Crop Damage")

  dev.off()
## null device 
##           1

Fig. 1: Fatalities, Injuries, Property Damage an Crop Damage over the years 1950 - 2011

As you can see the data could be devided into two phases: 1950 until 1992 and then 1993 until 2011. Before 1993 the crop damage has not been recorded at all. Fatalities and Injuries appear to be much lower before 1993. The simple conclusion drawn here is that the dataset provided has some quality issues before 1993 and hence only the data of the following years have been analyzed.

Results

The numbers of fatalities and injuries are being summed up and then sorted in descending order. The top ten causes are extracted for figure 2 and 3, thereby assigning meaningful names to column headings.

  print(POP_ranked)
##     Eventtype(Fat) Fatalities    Eventtype(Inj) Injuries
## 130 EXCESSIVE HEAT       1903           TORNADO    23310
## 834        TORNADO       1621             FLOOD     6789
## 153    FLASH FLOOD        978    EXCESSIVE HEAT     6525
## 275           HEAT        937         LIGHTNING     5230
## 464      LIGHTNING        816         TSTM WIND     3631
## 170          FLOOD        470              HEAT     2100
## 585    RIP CURRENT        368         ICE STORM     1975
## 359      HIGH WIND        248       FLASH FLOOD     1777
## 856      TSTM WIND        241 THUNDERSTORM WIND     1488
## 19       AVALANCHE        224      WINTER STORM     1321

Fig. 2: Fatalities, Injuries over the years 1993 - 2011 ranked by top ten eventtype in descending order

  print(ECO_ranked)
##        Eventtype(Prop) Property Damage    Eventtype(Crop) Crop Damage
## 153        FLASH FLOOD       1420124.6               HAIL   579596.28
## 834            TORNADO       1387757.1        FLASH FLOOD   179200.46
## 856          TSTM WIND       1335965.6              FLOOD   168037.88
## 170              FLOOD        899938.5          TSTM WIND   109202.60
## 760  THUNDERSTORM WIND        876844.2            TORNADO   100018.52
## 244               HAIL        688693.4  THUNDERSTORM WIND    66791.45
## 464          LIGHTNING        603351.8            DROUGHT    33898.62
## 786 THUNDERSTORM WINDS        446293.2 THUNDERSTORM WINDS    18684.93
## 359          HIGH WIND        324731.6          HIGH WIND    17283.21
## 972       WINTER STORM        132720.6         HEAVY RAIN    11122.80

Fig 3 - Damage on property and crops over the years 1993 - 2011 ranked by top ten eventtype in descending order

Conclusion

Surprisingly the most frequent cause of fatalities is excessive heat for the years 1993-2011, whereas injuries are by far most frequently caused by tornados. Is is assumed that people with health problems are in high danger in the situation of excessive heat and hospitals need to be in a state of heightened alert.

Property damage is caused most freqently by flooding (flash flood being #1, flood #4) and tornados or strong wind. Precautions would be the protection of buildings through a variety of suitable measurements. Crops are most endangered by hail. The recommendation would be to have a suitable insurance in place.