The present study aims to characterize the effects of the different weather events in people’s lives and their properties. For this effect was taken the U.S. National Oceanic and Atmospheric Administration’s (NOAA) Storm Database. This database contains information about the time and areas affected, its nature and impact in terms of the population’s heath and property damage. In this study we will focus on answering the two following questions:
Which events are more harmful to the population health, across the United States?
Which events have the greatest economic consequences, across the United States?
Download the database and load it into a data frame. In order to prevent the original data adulteration, the date is cloned to another data frame.
if(!file.exists("data")){
dir.create("data")
}
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2","./data/data.csv.bz2", mode="wb")
data <- read.csv("./data/data.csv.bz2")
dta <- data
Data structure
dim(dta)
## [1] 902297 37
The available information consists in 37 columns and 332195 rows.
Available fields.
names(dta)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
To answer the first question, besides the event type (EVTYPE), we will use both fatalities (FATALITIES) and injuries (INJURIES) fields.
sapply(data[23:24], function(x) sum(is.na(x)))
## FATALITIES INJURIES
## 0 0
str(data[23:24])
## 'data.frame': 902297 obs. of 2 variables:
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
Both are in a numeric form and do not have any missing value.
For the second question, besides the event type (EVTYPE), we will use the property damage (PROPDMG, PROPDMGEXP) and crop damage (CROPDMG, CROPDMGEXP). Each of these parameters have a complementary unit which is necessary to integrate on order to proceed with further calculations.
sapply(data[25:28], function(x) sum(is.na(x)))
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 0 0 0 0
str(data[23:24])
## 'data.frame': 902297 obs. of 2 variables:
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
Considering the following categories:
unique(dta$PROPDMGEXP)
## [1] K M B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
unique(dta$CROPDMGEXP)
## [1] M K m B ? 0 k 2
## Levels: ? 0 2 B k K m M
It is necessary to convert the unit signals (H, K, M, B) to the respective quantity (100, 1000, 1000000, 1000000000). The remaining values will be imputed as zero. The result is multiplied by the respective field integrating the unit of measurement.
unique(dta$CROPDMGEXP)
## [1] M K m B ? 0 k 2
## Levels: ? 0 2 B k K m M
dta$PROPDMGEXP <- toupper(as.character(dta$PROPDMGEXP))
dta$CROPDMGEXP <- toupper(as.character(dta$CROPDMGEXP))
dta$PROPDMGEXP[dta$PROPDMGEXP=="H"] <-100
dta$PROPDMGEXP[dta$PROPDMGEXP=="K"] <-1000
dta$PROPDMGEXP[dta$PROPDMGEXP=="M"] <-1000000
dta$PROPDMGEXP[dta$PROPDMGEXP=="B"] <-1000000000
dta$PROPDMGEXP[-grep("[0-9]",dta$PROPDMGEXP)] <-0
dta$PROPDMGEXP <- as.numeric(dta$PROPDMGEXP)
dta$CROPDMGEXP[dta$CROPDMGEXP=="H"] <-100
dta$CROPDMGEXP[dta$CROPDMGEXP=="K"] <-1000
dta$CROPDMGEXP[dta$CROPDMGEXP=="M"] <-1000000
dta$CROPDMGEXP[dta$CROPDMGEXP=="B"] <-1000000000
dta$CROPDMGEXP[-grep("[0-9]",dta$CROPDMGEXP)] <-0
dta$CROPDMGEXP <- as.numeric(dta$CROPDMGEXP)
dta$PROPDMGTOT <- as.numeric(dta$PROPDMG)*dta$PROPDMGEXP
dta$CROPDMGTOT <- as.numeric(dta$CROPDMG)*dta$CROPDMGEXP
Necessary libraries.
library(reshape2)
library(ggplot2)
To quantify the impact of each event has in people’s health we sum the number of fatalities and injuries cause by each event across United States.
type_tot <- aggregate(cbind(data$INJURIES,data$FATALITIES), list(Events = data$EVTYPE), FUN = sum)
Even considering the difference between a fatality and an injure, for the definition of the event impact, this study aggregate them together and selects as the most harmful Events for Population Health, the ones with more fatalities and injuries. It is created a new parameter, containing the total of injuries and fatalities. The events are reorder just to promote a better visual perception of the results when plotting the results.
type_tot$total <- type_tot$V1+type_tot$V2
type_tot$Events <- reorder(type_tot$Event, -type_tot$total)
The results are ordered according to the population heath impact and selected the 10 major events.
names(type_tot) <- c("Event", "Injuries", "Fatalities", "total")
type_tot2 <- head(type_tot[order(-type_tot$total),1:3],10)
type_tot3 <- melt(type_tot2, id.var = c("Event"))
str(type_tot3)
## 'data.frame': 20 obs. of 3 variables:
## $ Event : Factor w/ 985 levels "TORNADO","EXCESSIVE HEAT",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ variable: Factor w/ 2 levels "Injuries","Fatalities": 1 1 1 1 1 1 1 1 1 1 ...
## $ value : num 91346 6525 6957 6789 5230 ...
names(type_tot3) <- c("Event", "Type", "Casualties")
The final chart shows the number of fatalities and injuries by the different weather events.
ggplot(data = type_tot3, aes(x = Event, y = Casualties, fill=Type)) +
geom_bar(stat = "identity")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))+
ggtitle("Most Harmful Events to Population Health")
Whether in the case of injuries or fatalities, this plot presents tornado as the most harmful event for people health.
To quantify the impact that each event has in the economy we sum the property damages with the crops damages, across United States.
dmg_tot <- aggregate(cbind(dta$PROPDMGTOT ,dta$CROPDMGTOT), list(Events = data$EVTYPE), FUN = sum)
dmg_tot$total <- dmg_tot$V1+dmg_tot$V2
dmg_tot$Events <- reorder(dmg_tot$Event, -dmg_tot$total)
dmg_tot1 <- subset(dmg_tot, total>0)
names(dmg_tot1) <- c("Event", "Properties", "Crops", "total")
Selection of them 10 most destructive events.
dmg_tot2 <- head(dmg_tot1[order(-dmg_tot1$total),1:3],10)
dmg_tot3 <- melt(dmg_tot2, id.var = c("Event"))
str(dmg_tot3)
## 'data.frame': 20 obs. of 3 variables:
## $ Event : Factor w/ 985 levels "FLOOD","HURRICANE/TYPHOON",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ variable: Factor w/ 2 levels "Properties","Crops": 1 1 1 1 1 1 1 1 1 1 ...
## $ value : num 1.45e+11 6.93e+10 5.69e+10 4.33e+10 1.57e+10 ...
The final chart shows the amount of damages in properties and crops by the different weather events.
names(dmg_tot3) <- c("Event", "Damage", "Dollars")
ggplot(data = dmg_tot3, aes(x = Event, y = Dollars, fill=Damage)) +
geom_bar(stat = "identity")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))+
ggtitle("Most Harmful Events to the Economy")
As in the case of population health, the tornado ranks in first place as the most adverse weather event for the economy. In what concerns to the economy tornado mainly affect the properties while events like droughts have a major effect on crops.