This report provides a brief analysis on the US NOAA storm database focused on those events with greater impact on population health and economy. Data correspond to the 1950-2011 period and are available in this link
The first step is to download the source file (it it is not available) and to store the rawdata in a dataframe
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
setwd("C:/Users/agustin.izquierdo/Documents/R/coursera/Reproducible Research/week4")
#wet locale to English
Sys.setlocale("LC_ALL","English")
## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
file<- "repdata_data_StormData.csv"
if (!file.exists(file))
{
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",destfile=file)
}
rawdata<-read.csv(file, header=T, stringsAsFactors=F)
if(!file.exists(file))
{
unlink(file)
}
From the raw data we observe we only need some of the observations for the analysis: -EVTYPE: event type (type of storm) -FATALITIES: number of deceases -INJURIES: number of personal injuries -PROPDMG: property damages (base) -PROPDMGEXP: property damages (exponent) -CROPDMG: crop damage (base) -CROPDMGEXP: crop damage (exponent)
Therefore we create the base dataframe with the columns we need:
data<-rawdata[,c("EVTYPE", "FATALITIES", "INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")]
Let’s make a sanity check to verify NAs
# check for missing values
sum(is.na(data$FATALITIES),is.na(data$INJURIES),is.na(data$PROPDMG),is.na(data$CROPDMG))
## [1] 0
So there are no NAs.
If we take a look at the exponents of CROPDMG and PROPDMG we see we have to make a conversion to get the real data according to the documentation found in this link
unique(data$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
unique(data$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
Therefore, we switch k to 10^3, M to 10^6, B to 10*^9 and H to 10^2. In case there is a number in the exponential representation.
We make the direct exponential transformation so the PROPDMG and CROPDMG vaues in the dataset contain the numeric data we need for te analysis
for (i in 1:length(data$PROPDMGEXP))
{
if (data$PROPDMGEXP[i]=='k' | data$PROPDMGEXP[i]=="K")
{data$PROPDMG[i]=data$PROPDMG[i]*10^3}
if (data$PROPDMGEXP[i]=='m' |data$PROPDMGEXP[i]=='M')
{data$PROPDMG[i]=data$PROPDMG[i]*10^6}
if (data$PROPDMGEXP[i]=='B')
{data$PROPDMG[i]=data$PROPDMG[i]*10^9}
if (data$PROPDMGEXP[i]=='h' | data$PROPDMGEXP[i]=='H')
{data$PROPDMG[i]=data$PROPDMG[i]*10^2}
if (is.numeric(data$PROPDMGEXP[i]))
{data$PROPDMG[i]=data$PROPDMG[i]*10^data$PROPDMGEXP[i]}
}
for(i in 1:length(data$CROPDMGEXP))
{
if (data$CROPDMGEXP[i]=='m' |data$CROPDMGEXP[i]=='M')
{data$CROPDMG[i]=data$CROPDMG[i]*10^6}
if (data$CROPDMGEXP[i]=='k' | data$CROPDMGEXP[i]=="K")
{data$CROPDMG[i]=data$CROPDMG[i]*10^3}
if (data$CROPDMGEXP[i]=='B')
{data$CROPDMG[i]=data$CROPDMG[i]*10^9}
if (is.numeric(data$CROPDMGEXP[i]))
{data$CROPDMG[i]=data$CROPDMG[i]*10^data$CROPDMGEXP[i]}
}
We’ll focus our analysis only in the top 10 events (most dangerous or economically hazardous)
Let’s first aggregate fatalities and injuries by event type and select ony the top 10 events, sorting them descencing
health_fatalities<-aggregate(FATALITIES ~ EVTYPE,data,sum)
top_health_effect_fatalities<-arrange(health_fatalities, desc(health_fatalities$FATALITIES))[1:10,]
top_health_effect_fatalities <- top_health_effect_fatalities[order(top_health_effect_fatalities$FATALITIES, decreasing = F), ]
health_injuries<-aggregate(INJURIES ~ EVTYPE,data,sum)
top_health_effect_injuries<-arrange(health_injuries, desc(health_injuries$INJURIES))[1:10,]
top_health_effect_injuries <- top_health_effect_injuries[order(top_health_effect_injuries$INJURIES, decreasing = F), ]
And plot the result:
From this what we get is that Tornados are, by far, the most harmful event. Effectively, we get 5633 fatalities and 91346 injuries caused by Tornados in the period
top_health_effect_fatalities[10,]
## EVTYPE FATALITIES
## 1 TORNADO 5633
top_health_effect_injuries[10,]
## EVTYPE INJURIES
## 1 TORNADO 91346
Following the same rationale, we agreggate crop and property damages by event type and select ony the top 10 events.
Let’s first aggregate fatalities and injuries by event type and select ony the top 10 events, sorting them descencing
eco_crop<-aggregate(CROPDMG~EVTYPE, data, sum)
top_eco_effect_crop<-arrange(eco_crop, desc(eco_crop$CROPDMG))[1:10,]
top_eco_effect_crop <- top_eco_effect_crop[order(top_eco_effect_crop$CROPDMG, decreasing = F), ]
eco_prop<-aggregate(PROPDMG ~ EVTYPE, data, sum)
top_eco_effect_prop<-arrange(eco_prop, desc(eco_prop$PROPDMG))[1:10,]
top_eco_effect_prop <- top_eco_effect_prop[order(top_eco_effect_prop$PROPDMG, decreasing = F), ]
And plot the result:
From this what we get is that Floods are the event with most economic impact in overall (150,319,678,257 USD in the period). However, Drouhgts have a bigger impact than floods in crop damage, but the biggest amount for ecomomic impact is due to floods.
From the data analyzed we can conclude thatin the period analyzed across the US: - Tornados are the most harmful storm event with 96979 fatalities and injuries - Floods are the storm event with biggest economic impact 150,319,678,257 USD