In this data analysis we identified the atmospheric event that primarily responsible for the negative effect on US population and economy between 1950 and 2011. We seeked answer for the following two questions: 1. Across the United States, which types of events are most harmful with respect to population health? 2. Across the United States, which types of events have the greatest economic consequences? To investigate the problem, we obtained Storm Data (47Mb) from the U.S. National Oceanic and Atmospheric Administration’s (NOAA). From this data we found that tornados are those atmospheric events that most significantly affect the population and economy of the U.S.
The storm database of NOAA tracks the characteristics of major storms and weather events in the United States, including when and where they occured, as well as estimates of any fatalities, injuries, and property damage. The National Weather Service provides detailed documantation about the collection method and the structure of the data.
At first, we downloaded the complressed (bz2) database stored as comma-separated-values:
## U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database (47 Mb)
NOAA.data.URL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
subfolder <- "./RepData_PA2" ## Subfolder within the default R working directory
NOAA.data.local <- "StormData.csv.bz2" ## Local, compressed destination filename
NOAA.data.local <- paste(subfolder,NOAA.data.local,sep="/")
if(!file.exists(subfolder))
{
dir.create(subfolder)
}
if(!file.exists(NOAA.data.local))
{
download.file(NOAA.data.URL,NOAA.data.local,mode="wb") ## mode="wb" for compressed (binary) files
}
..then loaded the complete dataset to identify the data of interest and the data quality. The function bzfile() has been applied to secure the extraction of the source data.
if (!exists("StormData"))
{
StormData <- read.csv(bzfile(NOAA.data.local), stringsAsFactors = FALSE)
}
After reading in the NOAA Storm Data we check the first few rows (there are 902,297) in this dataset.
dim(StormData)
## [1] 902297 37
head(StormData)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
Our data analysis requires the following R packages to be installed and loaded:
## dplyr
if("dplyr" %in% rownames(installed.packages()) == FALSE) {install.packages("dplyr")}
suppressMessages(library(dplyr))
## data.table
if("data.table" %in% rownames(installed.packages()) == FALSE) {install.packages("data.table")}
suppressMessages(library(data.table))
## lubridate
if("lubridate" %in% rownames(installed.packages()) == FALSE) {install.packages("lubridate")}
suppressMessages(library(lubridate))
## date
if("date" %in% rownames(installed.packages()) == FALSE) {install.packages("date")}
suppressMessages(library(date))
## ggplot2
if("ggplot2" %in% rownames(installed.packages()) == FALSE) {install.packages("ggplot2")}
suppressMessages(library(ggplot2))
The data of interest is stored in the following columns: EVTYPE, FATALITIES, INJURIES, PROPDMG, CROPDMG, REFNUM, BGN_DATE. Therefore, the original dataset was reduced to store only these relevant columns. In addition, the size of the dataset was decreased by excluding those records in which 0 value was stored, since we are looking for events that caused fatalities, injuries or damages in properties/crops.
StormData.subset <- subset(StormData, FATALITIES > 0 | INJURIES > 0 | PROPDMG > 0 | CROPDMG > 0,
select = c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "CROPDMG", "REFNUM", "BGN_DATE"))
The characteristics of the StormData.Subset indicated that the numeric data of interest is ready for analysis, since none of the relevant columns stored NA values:
summary(StormData.subset)
## EVTYPE FATALITIES INJURIES
## Length:254633 Min. : 0.0000 Min. : 0.0000
## Class :character 1st Qu.: 0.0000 1st Qu.: 0.0000
## Mode :character Median : 0.0000 Median : 0.0000
## Mean : 0.0595 Mean : 0.5519
## 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :583.0000 Max. :1700.0000
## PROPDMG CROPDMG REFNUM BGN_DATE
## Min. : 0.00 Min. : 0.000 Min. : 1 Length:254633
## 1st Qu.: 2.00 1st Qu.: 0.000 1st Qu.:281406 Class :character
## Median : 5.00 Median : 0.000 Median :473485 Mode :character
## Mean : 42.75 Mean : 5.411 Mean :484335
## 3rd Qu.: 25.00 3rd Qu.: 0.000 3rd Qu.:703590
## Max. :5000.00 Max. :990.000 Max. :902260
The fatalities and injuries (by person) was summarized into a new measure, called health and the economy related property and crop damages (by US$) were also cumulated as economic. In order to optimize the time-serie, the certain dates of observations has been grouped by decades.
StormData.subset <- mutate(StormData.subset,
health = FATALITIES + INJURIES,
economic = PROPDMG + CROPDMG)
StormData.subset <- mutate(StormData.subset,
decade = (trunc(year(as.date(gsub( " .*$", "", StormData.subset$BGN_DATE), order = "mdy"))/10,-1)*10))
Finally, the subset of Storm Data was aggregated by decades and atmospheric event types (EVTYPE) to provide the summary of health and economic related effects.
StormData.analysis <- summarise(group_by(StormData.subset, decade, EVTYPE), sum.health=sum(health), sum.economic=sum(economic))
The data prepared for analysis has been ranked to identify the top 3 atmospheric events affecting the population and economy over the decades. The resultset contained 2 different labels for the same event, TST WIND and THUNDERSTORM WIND. In order to analyse appropriate event types, the labels were corrected. The correction was ruled by the relevant 2.1.1 Storm Data Event Table section of the Storm Data documentation, linked at the section Acquiring data of this document.
top3.evtype.health <- na.omit(setDT(StormData.analysis)[order(decade,-sum.health), .SD[1:3], by=decade])
top3.evtype.health$EVTYPE[top3.evtype.health$EVTYPE == "TSTM WIND"] <- "THUNDERSTORM WIND"
top3.evtype.health$EVTYPE <- factor(top3.evtype.health$EVTYPE)
top3.evtype.health$EVTYPE <- factor(top3.evtype.health$EVTYPE, levels = rev(levels(top3.evtype.health$EVTYPE)))
top3.evtype.economic <- na.omit(setDT(StormData.analysis)[order(decade,-sum.economic), .SD[1:3], by=decade])
top3.evtype.economic$EVTYPE[top3.evtype.economic$EVTYPE == "TSTM WIND"] <- "THUNDERSTORM WIND"
top3.evtype.economic$EVTYPE <- factor(top3.evtype.economic$EVTYPE)
top3.evtype.economic$EVTYPE <- factor(top3.evtype.economic$EVTYPE, levels = rev(levels(top3.evtype.economic$EVTYPE)))
The following figures visualize the top 3 events contributing to health and economic damages over the decades.
health.plot <- ggplot(top3.evtype.health, aes(x = reorder(EVTYPE, -sum.health), y = sum.health, fill=EVTYPE))
health.plot <- health.plot + geom_bar(stat="identity", fill="#3333CC")
health.plot <- health.plot + guides(fill=FALSE)
health.plot <- health.plot + facet_grid(~ decade, scales = "free", space = "free")
health.plot <- health.plot + scale_fill_brewer()
health.plot <- health.plot + theme(axis.text.x = element_text(angle = 90, hjust = 1, size=8))
health.plot <- health.plot + theme(panel.background = element_rect(fill = "#333333"))
health.plot <- health.plot + ggtitle("Top 3 atmospheric events\nharmful to population health (by decades)\n")
health.plot <- health.plot + theme(plot.title = element_text(lineheight=.8, face="bold"))
health.plot <- health.plot + xlab("Atmospheric events by decades")
health.plot <- health.plot + ylab("Sum of population affected [1 prs]")
health.plot
economic.plot <- ggplot(top3.evtype.economic, aes(x = reorder(EVTYPE, -sum.economic), y = sum.economic, fill=EVTYPE))
economic.plot <- economic.plot + geom_bar(stat="identity", fill="#006666")
economic.plot <- economic.plot + guides(fill=FALSE)
economic.plot <- economic.plot + facet_grid(~ decade, scales = "free", space = "free")
economic.plot <- economic.plot + scale_fill_brewer()
economic.plot <- economic.plot + theme(axis.text.x = element_text(angle = 90, hjust = 1, size=8))
economic.plot <- economic.plot + theme(panel.background = element_rect(fill = "#333333"))
economic.plot <- economic.plot + ggtitle("Top 3 atmospheric events\nwith great economic consequence (by decades)\n")
economic.plot <- economic.plot + theme(plot.title = element_text(lineheight=.8, face="bold"))
economic.plot <- economic.plot + xlab("Atmospheric events by decades")
economic.plot <- economic.plot + ylab("Total amount of caused demage [$]")
economic.plot
According to the figures above, primarily Tornados are those atmospheric events that most significantly affect the population of the U.S over the past decades. Although in the 00’s tornados caused the the highest number of fatalities and injuries, most of the economic damage of the decade was resulted by primarily flashflood and secondly thunderstorm wind.