Analysing the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm data we will answer the following questions:
Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
As a result, the analysis showed that the TORNADOES are the most harmful, with 91346 injuries and 5633 deaths.
The event with the greatest economic consequence are the FLOODS with 150 billion dollars in damages.
Read the full analysis to see the complete report.
library(ggplot2)
setwd("C:/Users/Reinaldo/Desktop/coursera-JHU/reproducibleresearch/week4")
##importing just the necessary columns
df <- read.csv("repdata-data-StormData.csv.bz2", stringsAsFactors = FALSE, strip.white=TRUE, header=TRUE)[,c("EVTYPE", "BGN_DATE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
Checking the data
head(df,10)
## EVTYPE BGN_DATE FATALITIES INJURIES PROPDMG PROPDMGEXP
## 1 TORNADO 4/18/1950 0:00:00 0 15 25.0 K
## 2 TORNADO 4/18/1950 0:00:00 0 0 2.5 K
## 3 TORNADO 2/20/1951 0:00:00 0 2 25.0 K
## 4 TORNADO 6/8/1951 0:00:00 0 2 2.5 K
## 5 TORNADO 11/15/1951 0:00:00 0 2 2.5 K
## 6 TORNADO 11/15/1951 0:00:00 0 6 2.5 K
## 7 TORNADO 11/16/1951 0:00:00 0 1 2.5 K
## 8 TORNADO 1/22/1952 0:00:00 0 0 2.5 K
## 9 TORNADO 2/13/1952 0:00:00 1 14 25.0 K
## 10 TORNADO 2/13/1952 0:00:00 0 0 25.0 K
## CROPDMG CROPDMGEXP
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## 7 0
## 8 0
## 9 0
## 10 0
From this first look at the data frame, we see that we will have to pre-processing the data. Here I’ll adjust the date, column names and the level of detail by agregating some important columns to our analysis.
The documentation of NOAA data says that that “h” or “H” means 102, “k” or “K” means 103, “m” or “M” means 106 and “b” or “B” means 109. I will ignore others string characters.
df$BGN_DATE <- as.POSIXct(df$BGN_DATE,format="%m/%d/%Y %H:%M:%S")
names(df) <- tolower(names(df))
fatals <- aggregate(fatalities ~ evtype, data = df, FUN = sum)
fatals <- fatals[order(fatals$fatalities, decreasing = T), ]
injuries <- aggregate(injuries ~ evtype, data = df, FUN = sum)
injuries <- injuries[order(injuries$injuries, decreasing = T), ]
pd <- df$propdmg
pde <- df$propdmgexp
cd <- df$cropdmg
cde <- df$cropdmgexp
pd[pde %in% "B"] <- pd[pde %in% "B"] * 1000
pd[pde %in% c("M", "m")] <- pd[pde %in% c("M", "m")] * 1
pd[pde %in% c("K")] <- pd[pde %in% c("K")] * 0.001
pd[pde %in% c("H", "h")] <- pd[pde %in% c("H", "h")] * 1e-04
pd[!(pde %in% c("B", "M", "m", "K", "H", "h"))] <- pd[!(pde %in% c("B", "M", "m", "K", "H", "h"))] * 1e-06
cd[cde %in% "B"] <- cd[cde %in% "B"] * 1000
cd[cde %in% c("M", "m")] <- cd[cde %in% c("M", "m")] * 1
cd[cde %in% c("K", "k")] <- cd[cde %in% c("K", "k")] * 0.001
cd[!(cde %in% c("B", "M", "m", "K", "k"))] <- cd[!(cde %in% c("B", "M", "m", "K", "k"))] * 1e-06
##sumarizes data
econdmg <- cd + pd
edt <- aggregate(econdmg ~ df$evtype, FUN = sum)
oedt <- edt[order(edt$econdmg, decreasing = T), ]
names(oedt)[1] <- "evtype"
ggplot(fatals[1:5, ], aes(x=reorder(evtype, -fatalities), y= fatalities)) + geom_bar(stat = "identity", fill="red") + geom_text(aes(label=fatalities), size=3, vjust="inward") +
ylab("Fatalities") + xlab("Event Type") + ggtitle("Top Five Types of Events Causing Deaths Across the U.S")
ggplot(injuries[1:5, ], aes(x=reorder(evtype, -injuries), y= injuries)) + geom_bar(stat = "identity", fill="blue") + geom_text(aes(label=injuries), size=3, vjust="inward") +
ylab("Injuries") + xlab("Event Type") + ggtitle("Top Five Types of Events Causing Injuries Across the U.S")
ggplot(oedt[1:5, ], aes(x=reorder(evtype, -econdmg), y= econdmg)) + geom_bar(stat = "identity", fill="green") + ylab("Economic Damages (million dollars)") +
xlab("Event Type") + ggtitle("Top Five Types of Events Causing Economic Damages Across the U.S")
The tornadoes are the most harmful, with more than 90,000 people hurt or dead.
The floods, with more than 150 billion dollars have the greatest economic consequences.