The basic goal of this report is to explore the NOAA Storm Database and answer the following questions about severe weather events:
The report will show that tornados are most harmful to population health in the mainland US between 1993 and 2011. They cause 3 times more casualties than any other severe weather events. Floods have the greatest economic consequences. They are responsible for more than 150 Billion USD in property and crop damages since 1993. Texas, California and the southwest of the US are particularly exposed to severe weather events.
We first load the data into R.
## Download the file.
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, "repdata-data-StormData.csv", method="curl")
## Reading and Subsetting the data
dt <- read.csv("./repdata-data-StormData.csv", sep=",", header=T, na.strings = "", stringsAsFactors=FALSE)
dataProcessing <- subset(dt, select=c( "STATE", "EVTYPE", "FATALITIES","INJURIES","PROPDMG",
"PROPDMGEXP", "CROPDMG", "CROPDMGEXP", "BGN_DATE" ))
## Date and US State Formatting
library(lubridate)
dataProcessing$DATE <- as.Date(dataProcessing$BGN_DATE, format = "%m/%d/%Y" )
dataProcessing$STATENAME <- state.name[match(dataProcessing$STATE,state.abb)]
dataProcessing$YEAR <- substr(dataProcessing$DATE,1,4)
dataProcessing$YEAR <- as.numeric(dataProcessing$YEAR)
# 902297 obs. of 12 variables
In this section, we look at the total number of missing values and other potential sources of error. We then chose a strategy for imputing missing value and improving the data quality.
## Calculate the total number of missing value per column.
sapply(dataProcessing, function(x) sum(is.na(x)))
## STATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP
## 0 0 0 0 0 465934
## CROPDMG CROPDMGEXP BGN_DATE DATE STATENAME YEAR
## 0 618413 0 0 19111 0
## We run a simple frequency analysis to look at the "damage" format and state name.
# sort(table(dataProcessing$STATENAME),decreasing=T)
sort(table(dataProcessing$PROPDMGEXP),decreasing=T)
##
## K M 0 B 5 1 2 ? m H
## 424665 11330 216 40 28 25 13 8 7 6
## + 7 3 4 6 - 8 h
## 5 5 4 4 4 1 1 1
sort(table(dataProcessing$CROPDMGEXP),decreasing=T)
##
## K M k 0 B ? 2 m
## 281832 1994 21 19 9 7 1 1
The data is quite “dirty”. We choose the following strategy to improve the data quality:
## Missing values
dataProcessing$CROPDMGSUM = 1
dataProcessing[is.na(dataProcessing$CROPDMGEXP), ]$CROPDMGEXP <- dataProcessing[is.na(dataProcessing$CROPDMGEXP), ]$CROPDMGSUM
dataProcessing$PROPDMGSUM = 1
dataProcessing[is.na(dataProcessing$PROPDMGEXP), ]$PROPDMGEXP <- dataProcessing[is.na(dataProcessing$PROPDMGEXP), ]$PROPDMGSUM
## Remove regions outside Northern America and Federal States
dataProcessing2 <- dataProcessing[!is.na(dataProcessing$STATENAME), ]
# 883186 obs. of 14 variables
We transform and create new variables needed for analysing the damage due to severe weather storms.
transform <- function(x) {
if (x=="H"|x=="h") y <- 100
else if (x=="K"|x=="k") y <- 1000
else if (x=="M"|x=="m") y <- 1000000
else if (x=="B"|x=="b") y <- 1000000000
else y <- 1
y
}
dataProcessing2$PROPDMGEXP <- sapply(dataProcessing2$PROPDMGEXP,transform)
dataProcessing2$CROPDMGEXP <- sapply(dataProcessing2$CROPDMGEXP,transform)
## Create Property and Crop Damage Variables for evaluating the economic impact.
dataProcessing2$DamageProperty <- dataProcessing2$PROPDMGEXP * dataProcessing2$PROPDMG
dataProcessing2$DamageCrops <- dataProcessing2$CROPDMGEXP * dataProcessing2$CROPDMG
dataProcessing2$TotalDamage <- dataProcessing2$DamageProperty + dataProcessing2$DamageCrops
## Create Casualty Variable for evaluating the population health impact.
dataProcessing2$Casualties <- dataProcessing2$FATALITIES + dataProcessing2$INJURIES
We finally quantify the amount of analytical data needed for further data processing.
## Subsetting and quality assessment of the date (assuming that more recent years should be considered more complete.)
## dataProcessing3 <- dataProcessing2[dataProcessing2$FATALITIES > 0, ]
## summary(dataProcessing3$DATE)
dataProcessing4 <- dataProcessing2[dataProcessing2$DamageCrops > 0, ]
summary(dataProcessing4$DATE)
## Min. 1st Qu. Median Mean 3rd Qu.
## "1993-01-06" "1997-08-14" "2001-09-20" "2002-05-08" "2007-07-03"
## Max.
## "2011-11-28"
Conclusion: Information about crop damages is not available before 1993. There is also a similar issue for other severe weather events. Before 1993, the NOAA Storm Database only contains casualties history data for tornados (1950), hails (1983) and thunderstorm winds (1983). Considering the issue of comparability of data sources, we chose to only analyse this dataset from June 1993 to November 2011. All data prior to 01-06-1993 will be removed.
## Creating the analytical dataset
analyticalData <- dataProcessing2[dataProcessing2$DATE >= "1993-01-06", ]
# 695571 obs. of 18 variables
Finally, we create a new dataset for analysing the data. We beleive this dataset contains a higher data quality for further analysis and reporting.
First, we look at which types of events are most harmful with respect to population health.
## Aggregate casualties by most harmful events
aggAnalytical <- with(analyticalData, aggregate(analyticalData$Casualties, by=list(analyticalData$EVTYPE), sum))
names(aggAnalytical)[names(aggAnalytical) == 'Group.1'] <- 'Harmful_Events'
names(aggAnalytical)[names(aggAnalytical) == 'x'] <- 'Nb_Casualties'
## Top 10 most harmful events (in terms of casualties)
agg <- aggAnalytical[order(-aggAnalytical$Nb_Casualties), ]
top <- agg[1:10, ]
top
## Harmful_Events Nb_Casualties
## 811 TORNADO 24931
## 129 EXCESSIVE HEAT 8092
## 167 FLOOD 7250
## 451 LIGHTNING 6018
## 833 TSTM WIND 3855
## 272 HEAT 3035
## 152 FLASH FLOOD 2703
## 415 ICE STORM 2064
## 737 THUNDERSTORM WIND 1604
## 939 WINTER STORM 1526
Conclusion: Tornados are by far the most harmful events in terms of casualties, followed by excessive heat, floods, lightning and thunderstorm winds.
Secondly, we look at which states in the US had the most casualties between 1993 and 2011.
## Aggregate casualties by states
aggAnalytical2 <- with(analyticalData, aggregate(analyticalData$Casualties, by=list(analyticalData$STATENAME), sum))
names(aggAnalytical2)[names(aggAnalytical2) == 'Group.1'] <- 'States'
names(aggAnalytical2)[names(aggAnalytical2) == 'x'] <- 'Nb_Casualties'
## Top 10 States most at risk from 1993-2011 (casualties)
agg2 <- aggAnalytical2[order(-aggAnalytical2$Nb_Casualties), ]
agg2[1:10, ]
## States Nb_Casualties
## 43 Texas 10998
## 25 Missouri 7309
## 1 Alabama 4825
## 9 Florida 4043
## 5 California 3738
## 35 Ohio 3056
## 42 Tennessee 3032
## 13 Illinois 2958
## 38 Pennsylvania 2832
## 10 Georgia 2825
Conclusion: Texas is by far the state with the highest number of casualties between 1993-2011, followed by Missouri, Alabama, Florida and California.
## Create a US map to show the number of casualties per state.
library(maps)
library(ggplot2)
library(grDevices)
## First read the data and add a column with the names of the states:
aggAnalytical3 <- with(analyticalData, aggregate(analyticalData$Casualties, by=list(analyticalData$STATE, analyticalData$STATENAME), sum))
names(aggAnalytical3)[names(aggAnalytical3) == 'Group.1'] <- 'state'
names(aggAnalytical3)[names(aggAnalytical3) == 'x'] <- 'Casualties'
names(aggAnalytical3)[names(aggAnalytical3) == 'Group.2'] <- 'region'
agg3 <- aggAnalytical3[order(aggAnalytical3$state), ]
agg3$state <- as.factor(agg3$state)
agg3$region <- tolower(as.factor(agg3$region))
## Create a map
states <- map_data("state")
agg3merged <- merge(states, agg3, by = "region")
agg3merged <- agg3merged[order(agg3merged$order), ]
pal <- colorRampPalette(c("white", "red"))
breaks <- c(quantile(agg3merged$Casualties))
q <- ggplot(data = agg3merged, aes(x = long, y = lat, group = group))
q <- q + geom_polygon(color = "black", aes(fill = factor(Casualties)))
q <- q + ggtitle("Accumulated Number of Casualties between 1993 and 2011
due to Severe Weather Events")
q <- q + scale_fill_manual( values = pal(48), breaks= breaks )
q <- q + labs(fill="Number of\nCasualties")
Fig 1.1 US States showing the total number of casualties due to severe weather events between 1993 and 2011
q
We first look at which types of events have the greatest economic consequences.
## Aggregate Total Damage by most harmful events
aggAnalytical4 <- with(analyticalData, aggregate(analyticalData$TotalDamage, by=list(analyticalData$EVTYPE), sum))
names(aggAnalytical4)[names(aggAnalytical4) == 'Group.1'] <- 'Harmful_Events'
names(aggAnalytical4)[names(aggAnalytical4) == 'x'] <- 'Total_Damage'
## Top 10 most harmful events (economic consequences)
agg4 <- aggAnalytical4[order(-aggAnalytical4$Total_Damage), ]
## USD Currency
library(scales)
agg4$Total_Damage <- dollar(agg4$Total_Damage)
agg4[1:10, ]
## Harmful_Events Total_Damage
## 167 FLOOD $150,137,226,757
## 399 HURRICANE/TYPHOON $71,636,600,800
## 647 STORM SURGE $43,323,466,000
## 811 TORNADO $26,753,187,479
## 241 HAIL $18,758,215,016
## 152 FLASH FLOOD $17,266,310,517
## 94 DROUGHT $15,013,467,000
## 390 HURRICANE $12,103,928,010
## 571 RIVER FLOOD $10,148,401,000
## 415 ICE STORM $8,966,471,360
##Barplot Foundation for Summary Section
agg4A <- aggAnalytical4[order(-aggAnalytical4$Total_Damage), ]
top2 <- agg4A[1:10, ]
Conclusion: Floods are by far the event that has the greatest economic consequence in the US, followed by hurricanes/typhoons, storm surges, tornados and hail.
Secondly, we look at which states in the US had the greatest economic costs.
## Aggregate Total Damage by States
aggAnalytical5 <- with(analyticalData, aggregate(analyticalData$TotalDamage, by=list(analyticalData$STATENAME), sum))
names(aggAnalytical5)[names(aggAnalytical5) == 'Group.1'] <- 'States'
names(aggAnalytical5)[names(aggAnalytical5) == 'x'] <- 'Total_Damage'
## Top 10 States most at risk from 1993-2011 (economic consequences)
agg5 <- aggAnalytical5[order(-aggAnalytical5$Total_Damage), ]
agg5$Total_Damage <- dollar(agg5$Total_Damage)
agg5[1:10, ]
## States Total_Damage
## 5 California $127,029,596,053
## 18 Louisiana $60,396,916,500
## 9 Florida $44,832,174,923
## 24 Mississippi $35,347,581,002
## 43 Texas $31,280,162,407
## 1 Alabama $16,592,394,290
## 13 Illinois $12,750,030,152
## 33 North Carolina $9,348,983,251
## 15 Iowa $8,212,999,790
## 25 Missouri $6,812,246,768
Conclusion: California is, by far, the US State that suffered the heaviest economic consequences of severe weather events, followed by Louisiana, Florida, Mississipi and Texas.
## Create a US map to show the number of casualties per state.
library(maps)
library(ggplot2)
library(grDevices)
## First read the data and add a column with the names of the states:
aggAnalytical6 <- with(analyticalData, aggregate(analyticalData$TotalDamage, by=list(analyticalData$STATE, analyticalData$STATENAME), sum))
names(aggAnalytical6)[names(aggAnalytical6) == 'Group.1'] <- 'state'
names(aggAnalytical6)[names(aggAnalytical6) == 'x'] <- 'Total_Damage'
names(aggAnalytical6)[names(aggAnalytical6) == 'Group.2'] <- 'region'
agg6 <- aggAnalytical6[order(aggAnalytical6$state), ]
agg6$state <- as.factor(agg6$state)
agg6$region <- tolower(as.factor(agg6$region))
## Create a map
states2 <- map_data("state")
agg6merged <- merge(states2, agg6, by = "region")
agg6merged <- agg6merged[order(agg6merged$order), ]
pal2 <- colorRampPalette(c("white", "green"))
breaks2 <- c(quantile(agg6merged$Total_Damage))
q <- ggplot(data = agg6merged, aes(x = long, y = lat, group = group))
q <- q + geom_polygon(color = "black", aes(fill = factor(Total_Damage)))
q <- q + ggtitle("Property and Crop Damage due to Severe Weather Events
between 1993 and 2011 (in USD)")
q <- q + scale_fill_manual( values = pal2(48), breaks= breaks2 )
q <- q + labs(fill="Total Damage in USD")
Fig 2.1: US States showing the total damage (Property and Crop) in USD due to severe weather events between 1993 and 2011
q
par(mfrow = c(2, 1), mar=c(0.5,5,0.75,1)+1.4)
# plot 1 (graf 1.1)
barplot(top$Nb_Casualties, names.arg=top$Harmful_Events, horiz=TRUE, cex.names=0.5, las=1, ylim=rev(c(0,14)),
beside=TRUE,axes=TRUE,col="red", density=NA,legend.text=TRUE,xlim=c(0,30000), space=0.4, cex=0.5,
xlab="", ylab="", main="Top 10 Most Harmful Events - Casualties")
# plot 2 (graf 2.1)
barplot(top2$Total_Damage, names.arg=top2$Harmful_Events, horiz=TRUE, cex.names=0.5, las=1, ylim=rev(c(0,14)),
beside=TRUE,axes=TRUE,col="green", density=NA,legend.text=TRUE,xlim=c(0,200000000000), space=0.4,
cex=0.5, xlab="", ylab="", main="Top 10 Most Harmful Events - Total Damage USD")