Summary

In this document we are analyzing the influence of different weather anomalies on the public health and economy in the United states. The research is based on the data collected by the U.S. National Oceanic and Atmospheric Administration (NOAA) in 1950-2011.

As could be seen, the most dangerous weather events for US are tornados (which affects the most number of people in terms of both fatalities and non-fatal injuries) and floods (which causes most severe economic damage, although tornados make a great contribution to the total damages as well).

Data processing

For the analysis we’ll need two popular libraries: dplyr for data manipulation and ggplot2 for plotting.

All files needed for analysis will be stored at the NOAA directory inside the current working directory.

library(dplyr)
library(ggplot2)

CurrentDirectory <- getwd()

if(!file.exists("NOAA"))
    dir.create("NOAA")
setwd("NOAA")

We download the data from the source each time just to be sure it’s up to date.

download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", 
              "NOAAData.bz2", method = "curl")
NOAA.file <- read.csv(bzfile("NOAAData.bz2"), stringsAsFactors = FALSE)

As far as damage estimates are stored in two columns instead of one, we need to reformat it and store to the one column. In the case of damage estimates data is contained in the PROPDMG column (mantissa) and the PROPDMGEXP column (exponent). Furthermore, the exponent may be either in numeric or literal form. So first we convert the literal exponents to numeric form, and then combine mantissa and exponent and store the data to the PROPDMG column.

All meanings of exponent different from the alphanumeric values we see as NA values. In almost all such cases mantissa is equal to zero, so it’s safe to suggest that there’s no damage to the property in this cases.

NOAA.file[NOAA.file$PROPDMGEXP %in% c("B", "b"), "PROPDMGEXP"] <- 9
NOAA.file[NOAA.file$PROPDMGEXP %in% c("M", "m"), "PROPDMGEXP"] <- 6
NOAA.file[NOAA.file$PROPDMGEXP %in% c("K", "k"), "PROPDMGEXP"] <- 3
NOAA.file[NOAA.file$PROPDMGEXP %in% c("H", "h"), "PROPDMGEXP"] <- 2
NOAA.file[!(NOAA.file$PROPDMGEXP %in% "0":"12"), "PROPDMGEXP"] <- NA
NOAA.file$PROPDMG <- NOAA.file$PROPDMG * 10^as.numeric(NOAA.file$PROPDMGEXP)

In the same manner we process the data regarding the damage to the crops.

NOAA.file[NOAA.file$CROPDMGEXP %in% c("B", "b"), "CROPDMGEXP"] <- 9
NOAA.file[NOAA.file$CROPDMGEXP %in% c("M", "m"), "CROPDMGEXP"] <- 6
NOAA.file[NOAA.file$CROPDMGEXP %in% c("K", "k"), "CROPDMGEXP"] <- 3
NOAA.file[NOAA.file$CROPDMGEXP %in% c("H", "h"), "CROPDMGEXP"] <- 2
NOAA.file[!(NOAA.file$CROPDMGEXP %in% "0":"12"), "CROPDMGEXP"] <- NA
NOAA.file$CROPDMG <- NOAA.file$CROPDMG * 10^as.numeric(NOAA.file$CROPDMGEXP)

Let’s take only the data needed for further analysis, and prettify the column names of the working table.

NOAA.damage <- NOAA.file[, c("STATE", "EVTYPE", "FATALITIES", "INJURIES",
                             "PROPDMG", "CROPDMG")]
names(NOAA.damage) <- tolower(names(NOAA.damage))

Data analyzing and results

Now we need to aggregate the data for plotting. As far as we need to know what weather anomalities cause the most severe damage to people’s health (both deaths and injuries), we group this data by event type. In this case, we don’t distinguish fatalities and injuries, since we want to know the total influence of various weather anomalities to public health.

NOAA.human <- group_by(NOAA.damage[, 1:4], evtype) %>%
    summarise(fatalities = sum(fatalities, na.rm = TRUE), 
              injuries = sum(injuries, na.rm = TRUE)) %>%
    mutate(total = fatalities + injuries) %>%
    arrange(desc(total))

head(NOAA.human, 10)
## Source: local data frame [10 x 4]
## 
##               evtype fatalities injuries total
## 1            TORNADO       5633    91346 96979
## 2     EXCESSIVE HEAT       1903     6525  8428
## 3          TSTM WIND        504     6957  7461
## 4              FLOOD        470     6789  7259
## 5          LIGHTNING        816     5230  6046
## 6               HEAT        937     2100  3037
## 7        FLASH FLOOD        978     1777  2755
## 8          ICE STORM         89     1975  2064
## 9  THUNDERSTORM WIND        133     1488  1621
## 10      WINTER STORM        206     1321  1527

The table above consists of the ten most harmful weather anomalities in US.

Let’s look at this data on the plot.

ggplot(head(NOAA.human, 10), aes(x = evtype, y = total)) + 
    geom_bar(stat="identity") +
    labs(title = "The most harmful weather anomalities in terms of human health damage, 1950-2011",
         y = "Number of casualties and injuries", x = "Type of anomality")

It can be clearly seen that tornado is the most dangerous weather anomaly in the United States. It causes more fatalities and injuries than the other nine most influental events combined.

We analyze the influence of weather on US economy in the same manner as in the previous case. Again, we don’t distinguish the damage of property and crops, looking only at the total damage.

NOAA.econ <- group_by(NOAA.damage[, c(1, 2, 5, 6)], evtype) %>%
    summarise(propdmg = sum(propdmg, na.rm = TRUE), 
              cropdmg = sum(cropdmg, na.rm = TRUE)) %>%
    mutate(total = propdmg + cropdmg) %>%
    arrange(desc(total))

head(NOAA.econ, 10)
## Source: local data frame [10 x 4]
## 
##               evtype      propdmg     cropdmg        total
## 1              FLOOD 144657709800  5661968450 150319678250
## 2  HURRICANE/TYPHOON  69305840000  2607872800  71913712800
## 3            TORNADO  56947380614   414953270  57362333884
## 4        STORM SURGE  43323536000        5000  43323541000
## 5               HAIL  15735267456  3025954470  18761221926
## 6        FLASH FLOOD  16822673772  1421317100  18243990872
## 7            DROUGHT   1046106000 13972566000  15018672000
## 8          HURRICANE  11868319010  2741910000  14610229010
## 9        RIVER FLOOD   5118945500  5029459000  10148404500
## 10         ICE STORM   3944927860  5022113500   8967041360

In this case we again can clearly see the event far more disastrous than any other. But in this case it’s completely different: it’s flood, not tornado (although tornado is the third dangerous weather anomaly).

Let’s look at this data on the plot:

ggplot(head(NOAA.econ, 10), aes(x = evtype, y = total/1e+9)) + 
    geom_bar(stat="identity") +
    labs(title = "The most harmful weather anomalities in terms of economic damage, 1950-2011",
         y = "Total damage, $ billion", x = "Type of anomality")

As can be seen from the plot, the most severe

Technical thing: returning the initial working directory and removing all variables from the global environment.

setwd(CurrentDirectory)