The basic goal of this assignment is to explore the NOAA Storm Database to answer the following questions:
First of all we read the unzipped dataset in csv format from the working directory
storm<-read.csv("repdata_data_StormData.csv",stringsAsFactors = TRUE)
The dataset has 37 variables as showed below
names(storm)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
Some variables are unnecessary for the purpose of this analysis. In particular we take:
subsetting the initial dataset
storm_ext<-storm[,c("EVTYPE", "FATALITIES", "INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")]
There are 985 different types of event.
length(unique(storm_ext$EVTYPE))
## [1] 985
Focusing on the attributes PROPDMGEXP and CROPDMGEXP we observe the distinct values:
unique(storm_ext$PROPDMGEXP)
## [1] K M B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
unique(storm_ext$CROPDMGEXP)
## [1] M K m B ? 0 k 2
## Levels: ? 0 2 B k K m M
We may suppose that ‘K’ and ‘k’ multiply 1.000 the corresponding PROPDMG/CROPDMG value, ‘M’ and ‘m’ multiply 1.000.000 while ‘B’ and ‘b’ multiply * 1.000.000.000. The other codes are unintelligible but in very few observations, one of these unknown codes is related to a PROPDMG/CROPDMG value greater than zero.
sum(!storm_ext$PROPDMGEXP %in% c("B","b", "H","h", "K","k", "M","m") & storm_ext$PROPDMG>0)
## [1] 320
sum(!storm_ext$CROPDMGEXP %in% c("B","b", "H","h", "K","k", "M","m") & storm_ext$CROPDMG>0)
## [1] 15
So we can delete from the dataset these records and create another two columns with the damage value calculated multiplying for the corresponding unit.
storm_ext_clean<-storm_ext[(storm_ext$PROPDMGEXP %in% c("B","b", "H","h", "K","k", "M","m") | storm_ext$PROPDMG==0)&(storm_ext$CROPDMGEXP %in% c("B","b", "H","h", "K","k", "M","m") | storm_ext$CROPDMG==0),]
size<-nrow(storm_ext_clean)
pmul<-vector(mode="numeric", length=size)
cmul<-vector(mode="numeric", length=size)
for(i in 1:size)
{
if(storm_ext_clean$PROPDMGEXP[i] %in% c('H','h')){
pmul[i]=100*storm_ext_clean$PROPDMG[i]}
else if(storm_ext_clean$PROPDMGEXP[i] %in% c('K','k')){
pmul[i]=1000*storm_ext_clean$PROPDMG[i]}
else if(storm_ext_clean$PROPDMGEXP[i] %in% c('M','m')){
pmul[i]=1000000*storm_ext_clean$PROPDMG[i]}
else if(storm_ext_clean$PROPDMGEXP[i] %in% c('B','b')){
pmul[i]=1000000000*storm_ext_clean$PROPDMG[i]}
else{pmul[i]=0}
if(storm_ext_clean$CROPDMGEXP[i] %in% c('H','h')){
cmul[i]=100*storm_ext_clean$CROPDMG[i]}
else if(storm_ext_clean$CROPDMGEXP[i] %in% c('K','k')){
cmul[i]=1000*storm_ext_clean$CROPDMG[i]}
else if(storm_ext_clean$CROPDMGEXP[i] %in% c('M','m')){
cmul[i]=1000000*storm_ext_clean$CROPDMG[i]}
else if(storm_ext_clean$CROPDMGEXP[i] %in% c('B','b')){
cmul[i]=1000000000*storm_ext_clean$CROPDMG[i]}
else{cmul[i]=0}
}
storm_ext_clean_mul<-cbind(storm_ext_clean,pmul,cmul)
Finally we can view the results after loading the libraries dplyr and ggplot2
library(dplyr)
library(ggplot2)
The top 10 events with the greatest economic consequences are showed in the figure below
eco_cons<-storm_ext_clean_mul %>% group_by(EVTYPE) %>% summarise(totalsum = sum(pmul+cmul)) %>% top_n(n = 10, wt = totalsum) %>% arrange(desc(totalsum/1000000))
bp<- ggplot(eco_cons, aes(x=EVTYPE, y=totalsum/1000000,fill=EVTYPE))+
geom_bar(width = 1, stat = "identity")+
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank())+labs(title = "Economic Consequences", x = "Event Type", y = "Money for damages (million dollars)")
bp
The top 10 events with the greatest health consequencesare showed in the figure below
health_cons<-storm_ext_clean_mul %>% group_by(EVTYPE) %>% summarise(totalsum = sum(FATALITIES+INJURIES)) %>% top_n(n = 10, wt = totalsum) %>% arrange(desc(totalsum))
bp<- ggplot(health_cons, aes(x=EVTYPE, y=totalsum,fill=EVTYPE))+
geom_bar(width = 1, stat = "identity")+
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank())+labs(title = "Health Consequences", x = "Event Type", y = "Number of injuried and died")
bp