In this report an analysis of the impact of severe weather on public health and finanacial damages in the United States is conducted. The data is from the National Weather Service, the definition of variables is unchanged. The data span from 1950 to 2011, attention is give to aggregate impacts of severe weather, no consideration is given to intertemporal changes. Evidence shows that Tornados are responsible for most deaths and injuries, while flood has the greatest economic consequence. While economic consequeces are easy to measure and aggregate across measures (Property and crop damages in US$), health impacts (fatalities and injuries) are impossible to combine in a meaninful fashion.
We start by loading the libraries and data, using the function
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata <- read.csv("repdata%2Fdata%2FStormData.csv.bz2")
A quick check can be done to see the relative importance of missing data.
mean(is.na(mydata))
## [1] 0.05229737
Missing data amount to only 5% of our data points.
We start by determining which events caused the most fatalities and injuries, for this we need to group the variables by event type and sum the corresponding fatalities and injuries.
health_data <- mydata %>% group_by(EVTYPE) %>% summarise(FATALITIES=sum(FATALITIES), INJURIES=sum(INJURIES))
health_data <- arrange(health_data, desc(FATALITIES))
head(health_data)
## Source: local data frame [6 x 3]
##
## EVTYPE FATALITIES INJURIES
## (fctr) (dbl) (dbl)
## 1 TORNADO 5633 91346
## 2 EXCESSIVE HEAT 1903 6525
## 3 FLASH FLOOD 978 1777
## 4 HEAT 937 2100
## 5 LIGHTNING 816 5230
## 6 TSTM WIND 504 6957
We’ll create a subset of our dataframe composed only of the top 5 fatality causes, obtained from the previous table.
health_subset <- subset(mydata, EVTYPE == "EXCESSIVE HEAT" | EVTYPE=="TORNADO" | EVTYPE =="FLASH FLOOD" | EVTYPE=="HEAT" | EVTYPE=="LIGHTNING")
health_subset$EVTYPE <- factor(health_subset$EVTYPE)
We now process the data to analyze economic consequences of severe weather. We start by subseting our data frame to include only measures for which we have the dollar amount in thousands, millions or billions.
prop_subset <- subset(mydata, PROPDMGEXP == "M"| PROPDMGEXP == "B"| PROPDMGEXP=="K" | PROPDMGEXP == "m"|PROPDMGEXP == "b" | PROPDMGEXP == "k")
The next step is to convert the amounts to a common dollar measure, namely billions.
levels(prop_subset$PROPDMGEXP)[levels(prop_subset$PROPDMGEXP)=='B'] <- 1
levels(prop_subset$PROPDMGEXP)[levels(prop_subset$PROPDMGEXP)=='b'] <- 1
levels(prop_subset$PROPDMGEXP)[levels(prop_subset$PROPDMGEXP)=='M'] <- 1/1000
levels(prop_subset$PROPDMGEXP)[levels(prop_subset$PROPDMGEXP)=='m'] <- 1/1000
levels(prop_subset$PROPDMGEXP)[levels(prop_subset$PROPDMGEXP)=='K'] <- 1/1000000
levels(prop_subset$PROPDMGEXP)[levels(prop_subset$PROPDMGEXP)=='k'] <- 1/1000000
The variable inherits the factor class, now it’s changed to numeric.
prop_subset$PROPDMGEXP <- as.numeric(levels(prop_subset$PROPDMGEXP))[prop_subset$PROPDMGEXP]
## Warning: NAs introduced by coercion
Finally, we obtain the total amount of propriety damage in billions of dollars.
prop_subset$PROPDAMAGE <- prop_subset$PROPDMG*prop_subset$PROPDMGEXP
Now we can create a dataframe containing the sum of damages caused by each event type.
top_prop <- prop_subset %>% group_by(EVTYPE) %>% summarise(PROP_DAMAGE = sum(PROPDAMAGE))
top_prop <- arrange(top_prop, desc(PROP_DAMAGE))
We replicate the above manipulations to obtain an analogous data frame for crop damages.
crop_subset <- subset(mydata, CROPDMGEXP == "B"| CROPDMGEXP == "M"|CROPDMGEXP == "K" |CROPDMGEXP == "b"|CROPDMGEXP == "m"|CROPDMGEXP == "k")
levels(crop_subset$CROPDMGEXP)[levels(crop_subset$CROPDMGEXP)=='B'] <- 1
levels(crop_subset$CROPDMGEXP)[levels(crop_subset$CROPDMGEXP)=='b'] <- 1
levels(crop_subset$CROPDMGEXP)[levels(crop_subset$CROPDMGEXP)=='M'] <- 1/1000
levels(crop_subset$CROPDMGEXP)[levels(crop_subset$CROPDMGEXP)=='m'] <- 1/1000
levels(crop_subset$CROPDMGEXP)[levels(crop_subset$CROPDMGEXP)=='K'] <- 1/1000000
levels(crop_subset$CROPDMGEXP)[levels(crop_subset$CROPDMGEXP)=='k'] <- 1/1000000
crop_subset$CROPDMGEXP <- as.numeric(levels(crop_subset$CROPDMGEXP))[crop_subset$CROPDMGEXP]
## Warning: NAs introduced by coercion
crop_subset$CROPDAMAGE <- crop_subset$CROPDMG*crop_subset$CROPDMGEXP
top_crop <- crop_subset %>% group_by(EVTYPE) %>% summarise(CROP_DAMAGE = sum(CROPDAMAGE))
top_crop <- arrange(top_crop, desc(CROP_DAMAGE))
We now merge the two data frames, that’ll allow us to get the total damages.
top_damage <- merge(top_prop, top_crop)
top_damage$TOTAL_DAMAGE <- top_damage$PROP_DAMAGE + top_damage$CROP_DAMAGE
top_damage <- arrange(top_damage, desc(TOTAL_DAMAGE))
These are all the transformations needed to conduct the data analysis.
We start by looking at the top causes of fatalities.
head(health_data)
## EVTYPE FATALITIES INJURIES
## 1 TORNADO 5633 91346
## 2 EXCESSIVE HEAT 1903 6525
## 3 FLASH FLOOD 978 1777
## 4 HEAT 937 2100
## 5 LIGHTNING 816 5230
## 6 TSTM WIND 504 6957
Note that the table is ordered by Fatalities. How do these relate to Injuries?
cor(health_data$FATALITIES, health_data$INJURIES)
## [1] 0.9438341
The strong correlation suggests that events with high death rates will also have high injury rate.
We can take a look at summary statistics for fatalities and injuries.
summary(health_subset$FATALITIES, health_subset$INJURIES)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0771 0.0000 583.0000
sd(health_subset$FATALITIES)
## [1] 1.951963
sd(health_subset$INJURIES)
## [1] 12.12004
It’s easy to see from the table on fatalities that Tornados are the number 1 cause of death by a large margin. Below we provide a scatterplot relating the number of injuries and fatalities caused by tornados.
subset1 <- subset(health_subset, EVTYPE=="TORNADO")
qplot(FATALITIES, INJURIES, data = subset1, main = "Tornado fatalities and Injuries")
Next we examine a similar plot for the other 4 top causes of fatalities
subset2 <- subset(health_subset, EVTYPE == "EXCESSIVE HEAT" | EVTYPE =="FLASH FLOOD" | EVTYPE=="HEAT" | EVTYPE=="LIGHTNING")
qplot(FATALITIES, INJURIES, data = subset2, facets = .~EVTYPE, xlim = c(0,125), main = "Fatalities and Injuries by Event Type" )
## Warning: Removed 1 rows containing missing values (geom_point).
Note that we exclude one of the entries of “HEAT” which caused close to 600 fatalities for it lies completely outside of the range of the other variables.
We now proceed to analyze the economic impacts of severe weather. We start by taking a look a some summary statistics.
summary(prop_subset$PROPDMG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 1.00 24.94 10.00 5000.00
sd(prop_subset$PROPDMG)
## [1] 83.65349
summary(crop_subset$PROPDMG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 14.22 2.00 5000.00
sd(crop_subset$PROPDMG)
## [1] 68.4765
We can now take a look at the main causes of damage.
head(top_damage)
## EVTYPE PROP_DAMAGE CROP_DAMAGE TOTAL_DAMAGE
## 1 FLOOD 144.65771 5.6619684 150.31968
## 2 HURRICANE/TYPHOON 69.30584 2.6078728 71.91371
## 3 TORNADO 56.93716 0.4149531 57.35211
## 4 STORM SURGE 43.32354 0.0000050 43.32354
## 5 HAIL 15.73227 3.0259544 18.75822
## 6 FLASH FLOOD 16.14081 1.4213171 17.56213
We can ilustrate the total damages using the following plot
barplot(top_damage[1:5,]$TOTAL_DAMAGE, names = c("FLOOD", "Hu./Ty.", "TORNADO", "STORM S.", "HAIL"), main = "Total Damage by Event Type, US$ bn" )
This brief analysis allow us to draw some quick conclusions on the impact of severe heather on health and economic issues. Caution must be exercised in reading these for we didn’t take into account the relative frequency of occurences, nor did we pay attention to the location and population density of the affected areas. Results here serve as a first step towards understanding some of the consequences of severe weather.