Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The goal of this analysis is to use the NOAA Storm Database to answer some basic questions about severe weather events.
This analysis will address the following questions:
Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
The 47Mb file for this project has been compressed via the bzip2 algorithm to reduce its size. It can be downloaded from the course web site:
# downloading data
Url_data <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
File_data <- "StormData.csv.bz2"
if (!file.exists(File_data)) {
download.file(Url_data, File_data, mode = "wb")
}
# reading data
rawdata <- read.csv(file = File_data, header=TRUE, sep=",")
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete. Exploring the original data - NOAA began recording all event types in January 1996, the dataset will be filtered to remove data recorded before this date.
# subsetting by date
data <- rawdata
data$BGN_DATE <- strptime(rawdata$BGN_DATE, "%m/%d/%Y %H:%M:%S")
data <- subset(data, BGN_DATE > "1995-12-31")
Exploring the dataset, there are seven variables that pertain to the two questions. They are:
EVTYPE - type of event
FATALITIES – number of fatalities
INJURIES – number of injuries
PROPDMG – the size of property damage
PROPDMGEXP - the exponent values for ‘PROPDMG’ (property damage)
CROPDMG - the size of crop damage
CROPDMGEXP - the exponent values for ‘CROPDMG’ (crop damage)
data <- subset(data, select = c(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP))
To further reduce the data size and make the analysis more efficient, events with “zero” data are eliminated from the main data:
#cleaning event types names
data$EVTYPE <- toupper(data$EVTYPE)
# eliminating zero data
data <- data[data$FATALITIES !=0 |
data$INJURIES !=0 |
data$PROPDMG !=0 |
data$CROPDMG !=0, ]
Data on fatalities and injuries is aggregated to allow the top ten events to be identified:
healthdata <- aggregate(cbind(FATALITIES, INJURIES) ~ EVTYPE, data = data, FUN=sum)
healthdata$PEOPLE_LOSS <- healthdata$FATALITIES + healthdata$INJURIES
healthdata <- healthdata[order(healthdata$PEOPLE_LOSS, decreasing = TRUE), ]
Top10_events_people <- healthdata[1:10,]
knitr::kable(Top10_events_people, format = "markdown")
| EVTYPE | FATALITIES | INJURIES | PEOPLE_LOSS | |
|---|---|---|---|---|
| 149 | TORNADO | 1511 | 20667 | 22178 |
| 39 | EXCESSIVE HEAT | 1797 | 6391 | 8188 |
| 48 | FLOOD | 414 | 6758 | 7172 |
| 107 | LIGHTNING | 651 | 4141 | 4792 |
| 153 | TSTM WIND | 241 | 3629 | 3870 |
| 46 | FLASH FLOOD | 887 | 1674 | 2561 |
| 146 | THUNDERSTORM WIND | 130 | 1400 | 1530 |
| 182 | WINTER STORM | 191 | 1292 | 1483 |
| 69 | HEAT | 237 | 1222 | 1459 |
| 88 | HURRICANE/TYPHOON | 64 | 1275 | 1339 |
Converting the exponent values of the two columns PROPDMGEXP and CROPDMGEXP into formats that are easier to visualize and sort:
data$PROPDMGEXP <- gsub("[Hh]", "2", data$PROPDMGEXP)
data$PROPDMGEXP <- gsub("[Kk]", "3", data$PROPDMGEXP)
data$PROPDMGEXP <- gsub("[Mm]", "6", data$PROPDMGEXP)
data$PROPDMGEXP <- gsub("[Bb]", "9", data$PROPDMGEXP)
data$PROPDMGEXP <- gsub("\\+", "1", data$PROPDMGEXP)
data$PROPDMGEXP <- gsub("\\?|\\-|\\ ", "0", data$PROPDMGEXP)
data$PROPDMGEXP <- as.numeric(data$PROPDMGEXP)
data$CROPDMGEXP <- gsub("[Hh]", "2", data$CROPDMGEXP)
data$CROPDMGEXP <- gsub("[Kk]", "3", data$CROPDMGEXP)
data$CROPDMGEXP <- gsub("[Mm]", "6", data$CROPDMGEXP)
data$CROPDMGEXP <- gsub("[Bb]", "9", data$CROPDMGEXP)
data$CROPDMGEXP <- gsub("\\+", "1", data$CROPDMGEXP)
data$CROPDMGEXP <- gsub("\\-|\\?|\\ ", "0", data$CROPDMGEXP)
data$CROPDMGEXP <- as.numeric(data$CROPDMGEXP)
data$PROPDMGEXP[is.na(data$PROPDMGEXP)] <- 0
data$CROPDMGEXP[is.na(data$CROPDMGEXP)] <- 0
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data <- mutate(data,
PROPDMGTOTAL = PROPDMG * (10 ^ PROPDMGEXP),
CROPDMGTOTAL = CROPDMG * (10 ^ CROPDMGEXP))
ecodata <- aggregate(cbind(PROPDMGTOTAL, CROPDMGTOTAL) ~ EVTYPE, data = data, FUN=sum)
ecodata$ECONOMIC_LOSS <- ecodata$PROPDMGTOTAL + ecodata$CROPDMGTOTAL
ecodata <- ecodata[order(ecodata$ECONOMIC_LOSS, decreasing = TRUE), ]
Top10_events_economy <- ecodata[1:10,]
knitr::kable(Top10_events_economy, format = "markdown")
| EVTYPE | PROPDMGTOTAL | CROPDMGTOTAL | ECONOMIC_LOSS | |
|---|---|---|---|---|
| 48 | FLOOD | 143944833550 | 4974778400 | 148919611950 |
| 88 | HURRICANE/TYPHOON | 69305840000 | 2607872800 | 71913712800 |
| 141 | STORM SURGE | 43193536000 | 5000 | 43193541000 |
| 149 | TORNADO | 24616945710 | 283425010 | 24900370720 |
| 66 | HAIL | 14595143420 | 2476029450 | 17071172870 |
| 46 | FLASH FLOOD | 15222203910 | 1334901700 | 16557105610 |
| 86 | HURRICANE | 11812819010 | 2741410000 | 14554229010 |
| 32 | DROUGHT | 1046101000 | 13367566000 | 14413667000 |
| 152 | TROPICAL STORM | 7642475550 | 677711000 | 8320186550 |
| 83 | HIGH WIND | 5247860360 | 633561300 | 5881421660 |
Question #1: Identifying which weather events cause the most Fatalities And Injuries:
#plotting health loss
library(ggplot2)
g <- ggplot(data = Top10_events_people, aes(x = reorder(EVTYPE, PEOPLE_LOSS), y = PEOPLE_LOSS))
g <- g + geom_bar(stat = "identity", colour = "black", fill="lightblue")
g <- g + labs(title = "Total Fatalities & Injuries in USA by weather events 1996-2011")
g <- g + theme(plot.title = element_text(hjust = 0.5))
g <- g + labs(y = "Number of fatalities and injuries", x = "Event Type")
g <- g + coord_flip()
g <- g + theme_classic()
print(g)
Question #2: Identifying which weather events cause the most economic damage:
#plotting economic loss
g <- ggplot(data = Top10_events_economy, aes(x = reorder(EVTYPE, ECONOMIC_LOSS), y = ECONOMIC_LOSS))
g <- g + geom_bar(stat = "identity", colour = "black", fill="lightblue")
g <- g + labs(title = "Total Economic Loss in USA by weather events 1996-2011")
g <- g + theme(plot.title = element_text(hjust = 0.5))
g <- g + labs(y = "Size of property and crop loss", x = "Event Type")
g <- g + coord_flip()
g <- g + theme_classic()
print(g)
Based on the analysis of the data from NOAA, Tornadoes caused the most fatalities and injuries while Flooding caused the most economic loss.