In this document we are analyzing the influence of different weather anomalies on the public health and economy in the United states. The research is based on the data collected by the U.S. National Oceanic and Atmospheric Administration (NOAA) in 1950-2011.
As could be seen, the most dangerous weather events for US are tornados (which affects the most number of people in terms of both fatalities and non-fatal injuries) and floods (which causes most severe economic damage, although tornados make a great contribution to the total damages as well).
For the analysis we’ll need two popular libraries: dplyr for data manipulation and ggplot2 for plotting.
All files needed for analysis will be stored at the NOAA directory inside the current working directory.
library(dplyr)
library(ggplot2)
CurrentDirectory <- getwd()
if(!file.exists("NOAA"))
dir.create("NOAA")
setwd("NOAA")
We download the data from the source each time just to be sure it’s up to date.
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
"NOAAData.bz2", method = "curl")
NOAA.file <- read.csv(bzfile("NOAAData.bz2"), stringsAsFactors = FALSE)
As far as damage estimates are stored in two columns instead of one, we need to reformat it and store to the one column. In the case of damage estimates data is contained in the PROPDMG column (mantissa) and the PROPDMGEXP column (exponent). Furthermore, the exponent may be either in numeric or literal form. So first we convert the literal exponents to numeric form, and then combine mantissa and exponent and store the data to the PROPDMG column.
All meanings of exponent different from the alphanumeric values we see as NA values. In almost all such cases mantissa is equal to zero, so it’s safe to suggest that there’s no damage to the property in this cases.
NOAA.file[NOAA.file$PROPDMGEXP %in% c("B", "b"), "PROPDMGEXP"] <- 9
NOAA.file[NOAA.file$PROPDMGEXP %in% c("M", "m"), "PROPDMGEXP"] <- 6
NOAA.file[NOAA.file$PROPDMGEXP %in% c("K", "k"), "PROPDMGEXP"] <- 3
NOAA.file[NOAA.file$PROPDMGEXP %in% c("H", "h"), "PROPDMGEXP"] <- 2
NOAA.file[!(NOAA.file$PROPDMGEXP %in% "0":"12"), "PROPDMGEXP"] <- NA
NOAA.file$PROPDMG <- NOAA.file$PROPDMG * 10^as.numeric(NOAA.file$PROPDMGEXP)
In the same manner we process the data regarding the damage to the crops.
NOAA.file[NOAA.file$CROPDMGEXP %in% c("B", "b"), "CROPDMGEXP"] <- 9
NOAA.file[NOAA.file$CROPDMGEXP %in% c("M", "m"), "CROPDMGEXP"] <- 6
NOAA.file[NOAA.file$CROPDMGEXP %in% c("K", "k"), "CROPDMGEXP"] <- 3
NOAA.file[NOAA.file$CROPDMGEXP %in% c("H", "h"), "CROPDMGEXP"] <- 2
NOAA.file[!(NOAA.file$CROPDMGEXP %in% "0":"12"), "CROPDMGEXP"] <- NA
NOAA.file$CROPDMG <- NOAA.file$CROPDMG * 10^as.numeric(NOAA.file$CROPDMGEXP)
Let’s take only the data needed for further analysis, and prettify the column names of the working table.
NOAA.damage <- NOAA.file[, c("STATE", "EVTYPE", "FATALITIES", "INJURIES",
"PROPDMG", "CROPDMG")]
names(NOAA.damage) <- tolower(names(NOAA.damage))
Now we need to aggregate the data for plotting. As far as we need to know what weather anomalities cause the most severe damage to people’s health (both deaths and injuries), we group this data by event type. In this case, we don’t distinguish fatalities and injuries, since we want to know the total influence of various weather anomalities to public health.
NOAA.human <- group_by(NOAA.damage[, 1:4], evtype) %>%
summarise(fatalities = sum(fatalities, na.rm = TRUE),
injuries = sum(injuries, na.rm = TRUE)) %>%
mutate(total = fatalities + injuries) %>%
arrange(desc(total))
head(NOAA.human, 10)
## Source: local data frame [10 x 4]
##
## evtype fatalities injuries total
## 1 TORNADO 5633 91346 96979
## 2 EXCESSIVE HEAT 1903 6525 8428
## 3 TSTM WIND 504 6957 7461
## 4 FLOOD 470 6789 7259
## 5 LIGHTNING 816 5230 6046
## 6 HEAT 937 2100 3037
## 7 FLASH FLOOD 978 1777 2755
## 8 ICE STORM 89 1975 2064
## 9 THUNDERSTORM WIND 133 1488 1621
## 10 WINTER STORM 206 1321 1527
The table above consists of the ten most harmful weather anomalities in US.
Let’s look at this data on the plot.
ggplot(head(NOAA.human, 10), aes(x = evtype, y = total)) +
geom_bar(stat="identity") +
labs(title = "The most harmful weather anomalities in terms of human health damage, 1950-2011",
y = "Number of casualties and injuries", x = "Type of anomality")
It can be clearly seen that tornado is the most dangerous weather anomaly in the United States. It causes more fatalities and injuries than the other nine most influental events combined.
We analyze the influence of weather on US economy in the same manner as in the previous case. Again, we don’t distinguish the damage of property and crops, looking only at the total damage.
NOAA.econ <- group_by(NOAA.damage[, c(1, 2, 5, 6)], evtype) %>%
summarise(propdmg = sum(propdmg, na.rm = TRUE),
cropdmg = sum(cropdmg, na.rm = TRUE)) %>%
mutate(total = propdmg + cropdmg) %>%
arrange(desc(total))
head(NOAA.econ, 10)
## Source: local data frame [10 x 4]
##
## evtype propdmg cropdmg total
## 1 FLOOD 144657709800 5661968450 150319678250
## 2 HURRICANE/TYPHOON 69305840000 2607872800 71913712800
## 3 TORNADO 56947380614 414953270 57362333884
## 4 STORM SURGE 43323536000 5000 43323541000
## 5 HAIL 15735267456 3025954470 18761221926
## 6 FLASH FLOOD 16822673772 1421317100 18243990872
## 7 DROUGHT 1046106000 13972566000 15018672000
## 8 HURRICANE 11868319010 2741910000 14610229010
## 9 RIVER FLOOD 5118945500 5029459000 10148404500
## 10 ICE STORM 3944927860 5022113500 8967041360
In this case we again can clearly see the event far more disastrous than any other. But in this case it’s completely different: it’s flood, not tornado (although tornado is the third dangerous weather anomaly).
Let’s look at this data on the plot:
ggplot(head(NOAA.econ, 10), aes(x = evtype, y = total/1e+9)) +
geom_bar(stat="identity") +
labs(title = "The most harmful weather anomalities in terms of economic damage, 1950-2011",
y = "Total damage, $ billion", x = "Type of anomality")
As can be seen from the plot, the most severe
Technical thing: returning the initial working directory and removing all variables from the global environment.
setwd(CurrentDirectory)