This report analyzes the impact of different severe weather events on public health and economy in the United States from 1994 to 2011. Our analysis is based on data collected by the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. It tracks characteristics of major storms and weather events in the U.S. from 1950 - 2011, including when and where they occur, as well as estimates of any fatalities, injuries and property damage. To determine the impact of storms on U.S. public health and economy, we use estimates on fatalities and injuries - on the one side - and estimates on property and crop damages - on the other side. We focus out attention on the period that goes from 1994 to 2011, as more recent years are most significant in terms of data availability. Our finding is that excessive heat and tornado are most harmful with respect to population health. In particular, Tornado is the most hazordous climate event in terms of injuries - with more than 22,000 injuries. Excessive heat is the most significant event in terms of fatalities - with 1,903 deaths. With respect to the impact on U.S. economy, we find that Flood, drought and hurricane/typhoon have the greatest economic consequences. In more details, Floods have caused the greatest property damages - more than 144 billion USD. Drought, instead, turns out to be the main cause of crop damages - with more than 13 billion USD.
echo = TRUE
library("R.utils")
library(dplyr)
library(ggplot2)
require(gridExtra)
sessionInfo()
## R version 3.1.1 (2014-07-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
##
## locale:
## [1] LC_COLLATE=Italian_Italy.1252 LC_CTYPE=Italian_Italy.1252
## [3] LC_MONETARY=Italian_Italy.1252 LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] gridExtra_0.9.1 ggplot2_1.0.0 dplyr_0.3.0.2 R.utils_2.0.0
## [5] R.oo_1.19.0 R.methodsS3_1.7.0
##
## loaded via a namespace (and not attached):
## [1] assertthat_0.1 colorspace_1.2-4 DBI_0.3.1 digest_0.6.4
## [5] evaluate_0.5.5 formatR_1.0 gtable_0.1.2 htmltools_0.2.6
## [9] knitr_1.8 magrittr_1.0.1 MASS_7.3-33 munsell_0.4.2
## [13] parallel_3.1.1 plyr_1.8.1 proto_0.3-10 Rcpp_0.11.3
## [17] reshape2_1.4 rmarkdown_0.5.1 scales_0.2.4 stringr_0.6.2
## [21] tools_3.1.1 yaml_2.1.13
Read data:
data <- read.csv("repdata-data-StormData.csv", header = TRUE)
Look at data:
str(data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436774 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
To reduce dataset size, we just keep columns of interest:
storm_data <- select(data,STATE,BGN_DATE,EVTYPE,FATALITIES,INJURIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP)
Check for missing values:
sum(is.na(storm_data))
## [1] 0
Extract variable “year” from date format:
storm_data <- mutate(storm_data, year = as.numeric(format(as.Date(BGN_DATE, format = "%m/%d/%Y %H:%M:%S"), "%Y")))
Have a look at sample size by year:
hist(storm_data$year, breaks = 60)
Select only more recent years that should be more complete:
storm_data <- filter(storm_data, year >= 1994)
PROPDMGEXP and CROPDMGEXP variables need to be recoded into numerical formats according to the multiplier as indicated in the Storm Events CodeBook (H = Hundred, K = Thousand, M = Million and B = Billion).
levels(storm_data$PROPDMGEXP)
## [1] "" "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"
levels(storm_data$CROPDMGEXP)
## [1] "" "?" "0" "2" "B" "k" "K" "m" "M"
unit <- c("", "+", "-", "?", 0:8, "h", "H", "k", "K", "m", "M", "B")
multiplier <- c(rep(0,4), 0:8, 2, 2, 3, 3, 6, 6, 9)
mult.df <- data.frame(unit, multiplier)
storm_data$PROPDMGEXP <- mult.df[match(storm_data$PROPDMGEXP, mult.df$unit),2]
storm_data$CROPDMGEXP <- mult.df[match(storm_data$CROPDMGEXP, mult.df$unit),2]
To get the amount of economic damages in dollars, let’s multiply the number of property/crops damages by their recoded expenses ($):
storm_data <- mutate(storm_data, PROPERTY_DAMAGE = PROPDMG * 10 ^ PROPDMGEXP, CROP_DAMAGE = CROPDMG * 10 ^ CROPDMGEXP)
After data processing, let’s look at the dataset:
str(storm_data)
## 'data.frame': 702131 obs. of 12 variables:
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 763 399 4735 4648 3932 1805 6403 10570 10570 10570 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 201 629 429 657 657 410 786 786 834 244 ...
## $ FATALITIES : num 0 0 0 0 0 2 0 0 0 0 ...
## $ INJURIES : num 0 0 2 0 0 0 0 0 0 0 ...
## $ PROPDMG : num 0 0 0 0 0 0.1 50 5 500 0 ...
## $ PROPDMGEXP : num 0 0 0 0 0 9 3 6 3 0 ...
## $ CROPDMG : num 0 0 0 0 0 10 0 500 0 0 ...
## $ CROPDMGEXP : num 0 0 0 0 0 6 0 3 0 0 ...
## $ year : num 1995 1995 1994 1995 1995 ...
## $ PROPERTY_DAMAGE: num 0e+00 0e+00 0e+00 0e+00 0e+00 1e+08 5e+04 5e+06 5e+05 0e+00 ...
## $ CROP_DAMAGE : num 0e+00 0e+00 0e+00 0e+00 0e+00 1e+07 0e+00 5e+05 0e+00 0e+00 ...
head(storm_data)
## STATE BGN_DATE EVTYPE FATALITIES INJURIES
## 1 AL 1/6/1995 0:00:00 FREEZING RAIN 0 0
## 2 AL 1/22/1995 0:00:00 SNOW 0 0
## 3 AL 2/9/1994 0:00:00 ICE STORM/FLASH FLOOD 0 2
## 4 AL 2/6/1995 0:00:00 SNOW/ICE 0 0
## 5 AL 2/11/1995 0:00:00 SNOW/ICE 0 0
## 6 AL 10/4/1995 0:00:00 HURRICANE OPAL/HIGH WINDS 2 0
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP year PROPERTY_DAMAGE CROP_DAMAGE
## 1 0.0 0 0 0 1995 0e+00 0e+00
## 2 0.0 0 0 0 1995 0e+00 0e+00
## 3 0.0 0 0 0 1994 0e+00 0e+00
## 4 0.0 0 0 0 1995 0e+00 0e+00
## 5 0.0 0 0 0 1995 0e+00 0e+00
## 6 0.1 9 10 6 1995 1e+08 1e+07
The first part of this project asks us to find out the severest weather events in terms of population health. Therefore, we rank the total number of fatalities by wheather event type to get the list of the top 15 severest wheather event type.
fatalities_ranking <-
storm_data %>%
group_by(EVTYPE) %>%
select(FATALITIES) %>%
summarise(
FATALITIES = sum(FATALITIES)
) %>%
arrange(desc(FATALITIES)) %>%
mutate(rank = dense_rank(desc(FATALITIES))) %>%
filter(rank <= 15) %>%
mutate(EVTYPE = factor(EVTYPE, levels = EVTYPE))
Then, we do the same for the number of injuries:
injuries_ranking <-
storm_data %>%
group_by(EVTYPE) %>%
select(INJURIES) %>%
summarise(
INJURIES = sum(INJURIES)
) %>%
arrange(desc(INJURIES)) %>%
mutate(rank = dense_rank(desc(INJURIES))) %>%
filter(rank <= 15) %>%
mutate(EVTYPE = factor(EVTYPE, levels = EVTYPE))
The second part of this project ask us to find out the severest weather events in terms of economic damages. As in the previous section, we aggregate property/crop damages by wheather event type. Then, we ranked them to get the lists of 15 weather events that have had the severest consequences on the U.S. economy.
property_damage_ranking <-
storm_data %>%
group_by(EVTYPE) %>%
select(PROPERTY_DAMAGE) %>%
summarise(
PROPERTY_DAMAGE = sum(PROPERTY_DAMAGE)
) %>%
arrange(desc(PROPERTY_DAMAGE)) %>%
mutate(rank = dense_rank(desc(PROPERTY_DAMAGE))) %>%
filter(rank <= 15) %>%
mutate(EVTYPE = factor(EVTYPE, levels = EVTYPE))
crop_damage_ranking <-
storm_data %>%
group_by(EVTYPE) %>%
select(CROP_DAMAGE) %>%
summarise(
CROP_DAMAGE = sum(CROP_DAMAGE)
) %>%
arrange(desc(CROP_DAMAGE)) %>%
mutate(rank = dense_rank(desc(CROP_DAMAGE))) %>%
filter(rank <= 15) %>%
mutate(EVTYPE = factor(EVTYPE, levels = EVTYPE))
Let’s print out the two lists with the 15 most significant storm events in terms of damages on population health:
fatalities_ranking
## Source: local data frame [15 x 3]
##
## EVTYPE FATALITIES rank
## 1 EXCESSIVE HEAT 1903 1
## 2 TORNADO 1593 2
## 3 FLASH FLOOD 951 3
## 4 HEAT 930 4
## 5 LIGHTNING 794 5
## 6 FLOOD 450 6
## 7 RIP CURRENT 368 7
## 8 HIGH WIND 242 8
## 9 TSTM WIND 241 9
## 10 AVALANCHE 224 10
## 11 RIP CURRENTS 204 11
## 12 WINTER STORM 195 12
## 13 HEAT WAVE 172 13
## 14 EXTREME COLD 150 14
## 15 THUNDERSTORM WIND 133 15
injuries_ranking
## Source: local data frame [15 x 3]
##
## EVTYPE INJURIES rank
## 1 TORNADO 22571 1
## 2 FLOOD 6778 2
## 3 EXCESSIVE HEAT 6525 3
## 4 LIGHTNING 5116 4
## 5 TSTM WIND 3631 5
## 6 HEAT 2095 6
## 7 ICE STORM 1971 7
## 8 FLASH FLOOD 1754 8
## 9 THUNDERSTORM WIND 1476 9
## 10 WINTER STORM 1298 10
## 11 HURRICANE/TYPHOON 1275 11
## 12 HIGH WIND 1099 12
## 13 HEAVY SNOW 980 13
## 14 HAIL 943 14
## 15 WILDFIRE 911 15
and let’s make a plot summarizing all these information:
fatalities_plot <-
qplot(EVTYPE, data = fatalities_ranking, weight = FATALITIES, geom = "bar") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_histogram(colour = "white", fill = "black", binwidth = 1) +
xlab("Severe Weather Events") +
scale_y_continuous("Number of Fatalities") +
ggtitle("Number of Fatalities\n by Top 15 Severe Weather\n Events in the U.S.\n from 1994 - 2011")
injuries_plot <-
qplot(EVTYPE, data = injuries_ranking, weight = INJURIES, geom = "bar") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_histogram(colour = "darkgreen", fill = "white", binwidth = 1) +
xlab("Severe Weather Events") +
scale_y_continuous("Number of Injuries") +
ggtitle("Number of Injuries\n by Top 15 Severe Weather\n Events in the U.S.\n from 1994 - 2011")
grid.arrange(fatalities_plot, injuries_plot, ncol = 2)
From the histogram above, Tornado and Flood turn out to be the 2 severest climate events in terms of number of injuries - with 22,571 and 6,778 injuries. Excessive heat and Tornado have caused hte greatest number of fatalities - with 1,903 and 1,593 deaths from 1994 to 2011.
Finally, let’s look at the 15 most significant storm events in terms of economic damages:
property_damage_ranking
## Source: local data frame [15 x 3]
##
## EVTYPE PROPERTY_DAMAGE rank
## 1 FLOOD 144179608807 1
## 2 HURRICANE/TYPHOON 69305840000 2
## 3 STORM SURGE 43193536000 3
## 4 TORNADO 25630588401 4
## 5 FLASH FLOOD 16398255929 5
## 6 HAIL 15338044461 6
## 7 HURRICANE 11862819010 7
## 8 TROPICAL STORM 7703385550 8
## 9 HIGH WIND 5266939295 9
## 10 WILDFIRE 4765114000 10
## 11 STORM SURGE/TIDE 4641188000 11
## 12 TSTM WIND 4484273495 12
## 13 ICE STORM 3832377860 13
## 14 THUNDERSTORM WIND 3480404972 14
## 15 HURRICANE OPAL 3172846000 15
crop_damage_ranking
## Source: local data frame [15 x 3]
##
## EVTYPE CROP_DAMAGE rank
## 1 DROUGHT 13922066000 1
## 2 FLOOD 5506942450 2
## 3 ICE STORM 5022113500 3
## 4 HAIL 2982699123 4
## 5 HURRICANE 2741410000 5
## 6 HURRICANE/TYPHOON 2607872800 6
## 7 FLASH FLOOD 1402661500 7
## 8 EXTREME COLD 1292973000 8
## 9 FROST/FREEZE 1094086000 9
## 10 HEAVY RAIN 733399800 10
## 11 TROPICAL STORM 677841000 11
## 12 HIGH WIND 633566300 12
## 13 TSTM WIND 553997350 13
## 14 EXCESSIVE HEAT 492402000 14
## 15 THUNDERSTORM WIND 414833050 15
and again, let’s plot the results:
property_damage_plot <-
qplot(EVTYPE, data = property_damage_ranking, weight = PROPERTY_DAMAGE/10^6, geom = "bar") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_histogram(colour = "white", fill = "darkgrey", binwidth = 1) +
xlab("Severe Weather Events") +
scale_y_continuous("Property Damage [Million $]") +
ggtitle("Million $ Property Damage\n by Top 15 Severe Weather\n Events in the U.S.\n from 1994 - 2011")
crop_damage_plot <-
qplot(EVTYPE, data = crop_damage_ranking, weight = CROP_DAMAGE/10^6, geom = "bar") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_histogram(colour = "white", fill = "brown", binwidth = 1) +
xlab("Severe Weather Events") +
scale_y_continuous("Crop Damage [Million $]") +
ggtitle("Million $ Crop Damage\n by Top 15 Severe Weather\n Events in the U.S.\n from 1994 - 2011")
grid.arrange(property_damage_plot, crop_damage_plot, ncol = 2)
In terms of property damages, we show that Floods and Hurricane/Typhoon have been the most severe weather events - with more than 144 and 69 billion USD, respectively. We also show that Drought and Flood represent the top 2 causes of crop damages - with more than 13 and 5 billion USD, respectively.
Our finding is that across the United States from 1994 to 2011, Excessive heat and Tornado had the greatest impact on most population health - while Flood, Hurricane/typhoon and Drought had the greatest economic consequences.