Synopsis

Severe weather events may have catastrophic consequences to the economy and public health for communities and countries. Preventing these effects, or at least reduce theirs effect is a key concern. The U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database contains the information of occurrence of storms and other significant weather phenomena with enough intensity to cause damage in United States. This information is collected from a variety of sources around the country.

In the present project, we want to use the NOAA database to help prevent these catastrophic events. We will address the following question:

In this project, we answer these two main questions using the NOAA database. The project contains 2 sections: 1) A data processing part where we describe the dataset and we make a preliminary exploratory data analysis. 2) The result part where we include some of the most relevant features relevant to help predict these weather events. We will focus on the parameter that can produce more damage to the economy and the public health.

Our study shows that for population health, tornado, flood and heat are the main concern. So, in this project we recommend to develop a preventive plan to help people keep themselves protected in case of tornado, flood and excessive heat. All of these are the main reason for fatalities and injuries in United Stated during the last 60 years. The events with more economic consequences mainly related with property damage, are flood, hurricane and tornado. It is important to pay more attention to protect properties against these events. For the particular case of crop damage, drought and flood are the main concerns. We recommend to develop some programs to mitigate the effect of drought in crop production. Flood is hard to predict, however, drought can be reduced with long term programs providing water to keep crop production at good rate.

Loading and Processing the Data

First, we need to understand the NOAA dataset. As a first step we load the first row with the names of the columns to identify the most relevant information for our problem. Since the data is very large we can use this step to filter out the columns we are not going to use in the present project. Note that the dataset used in this project does not correspond to the most recent data available in the NOAA webpage. For more details about the dataset, check the link http://www.ncdc.noaa.gov/stormevents/

options(scipen = 1, digits = 5)
# Checking the zip file is already there 
if (!file.exists("repdata_data_StormData.csv.bz2")) {
download.file(url = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
method = "curl", destfile = "repdata_data_StormData.csv.bz2", quiet = TRUE)
}
# Reading the raw csv file
data <- read.csv("./repdata_data_StormData.csv.bz2",header = TRUE, stringsAsFactors = FALSE, strip.white = TRUE, na.strings = c("NA",""),nrow=2)
str(data)
## 'data.frame':    2 obs. of  37 variables:
##  $ STATE__   : num  1 1
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00"
##  $ BGN_TIME  : int  130 145
##  $ TIME_ZONE : chr  "CST" "CST"
##  $ COUNTY    : num  97 3
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN"
##  $ STATE     : chr  "AL" "AL"
##  $ EVTYPE    : chr  "TORNADO" "TORNADO"
##  $ BGN_RANGE : num  0 0
##  $ BGN_AZI   : logi  NA NA
##  $ BGN_LOCATI: logi  NA NA
##  $ END_DATE  : logi  NA NA
##  $ END_TIME  : logi  NA NA
##  $ COUNTY_END: num  0 0
##  $ COUNTYENDN: logi  NA NA
##  $ END_RANGE : num  0 0
##  $ END_AZI   : logi  NA NA
##  $ END_LOCATI: logi  NA NA
##  $ LENGTH    : num  14 2
##  $ WIDTH     : num  100 150
##  $ F         : int  3 2
##  $ MAG       : num  0 0
##  $ FATALITIES: num  0 0
##  $ INJURIES  : num  15 0
##  $ PROPDMG   : num  25 2.5
##  $ PROPDMGEXP: chr  "K" "K"
##  $ CROPDMG   : num  0 0
##  $ CROPDMGEXP: logi  NA NA
##  $ WFO       : logi  NA NA
##  $ STATEOFFIC: logi  NA NA
##  $ ZONENAMES : logi  NA NA
##  $ LATITUDE  : num  3040 3042
##  $ LONGITUDE : num  8812 8755
##  $ LATITUDE_E: num  3051 0
##  $ LONGITUDE_: num  8806 0
##  $ REMARKS   : logi  NA NA
##  $ REFNUM    : num  1 2
info <- dim(data)

The dataset contains 37 columns. So, we decided to keep only the variables relevant to our problem. Then we need first to identify all variables and look for those related with our project. Since we do not have a cookbook for this dataset we get some help from the detailed information about the fields/columns from the following link: http://www.ncdc.noaa.gov/stormevents/ftp.jsp. Below, I am including the description of variables we are going to use in our project, other columns will be removed from our dataset while reading the raw data.

  1. BGN_DATE: Begin date of the event. Format: character string
  2. STATE: State name where the event occurred. Format: character string
  3. EVTYPE: Type of event Format: character string
  4. FATALITIES: Number of fatalities in the event. Format: numeric
  5. INJURIES: Number of injuries in the event. Format: numeric
  6. PROPDMG: The estimated amount of damage to property incurred by the weather event. Units are not clear, in the doc document from the new dataset the words K or M are included meaning $10,000 or $10,000,00. Here we do not know. The other variable that can be related with its unit is the variable …. Format: numeric
  7. PROPDMGEXP: We do not know the definition of this variable very well, however, reading its name we can assume that it contains the exponent of the PROPDMG, so we will combine this variable with PROPDMG later. Format: character string
  8. CROPDMG: The estimated amount of damage to crops incurred by the weather event. Again, here we do not know the units. As in the previous case we can infer the unit from the variable. Format: numeric
  9. CROPDMGEXP: We do not know the definition of this variable very well, however, reading its name we can assume that it contains the exponent of the CROPDMG, so we will combine this variable with CROPDMG later. Format: character string

Now, we read the raw data with the relevant columns:

# Reading the raw csv file
data <- read.csv("./repdata_data_StormData.csv.bz2",colClasses =c("NULL",NA,  rep("NULL", 4),NA,NA, rep("NULL", 14), rep(NA,6),  rep("NULL", 9)), header = TRUE, stringsAsFactors = FALSE, strip.white = TRUE, na.strings = c("NA",""),nrow=902297)
str(data)
## 'data.frame':    902297 obs. of  9 variables:
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  NA NA NA NA ...
info <- dim(data)

The dataset contains 9 columns and 902297 rows. Now, we know that the variables FATALITIES and INJURIES will be relevant to the public health and PROPDMG with CROPDMG will show the consequences of these events to the public health. The other variables may be useful to plot the data (as FATALITIES vs DATE) or for some transformations of variables. Since we are studying the data with respect to public health and economy, we will keep only the records where we have at least one of these variable different from zero. This means that we are removing the rows where all these variables are zeros.

First, we check how many rows contains only zeros for these variables

dim(data[which(data$FATALITIES == 0 & data$INJURIES == 0 & data$PROPDMG ==0 & data$CROPDMG ==0   )  ,])
## [1] 647664      9

So, we need to remove them from our dataset. So, the rows we want to keep are:

dim(data[which(!(data$FATALITIES == 0 & data$INJURIES == 0 & data$PROPDMG ==0 & data$CROPDMG ==0)   )  ,])
## [1] 254633      9

then we end up with a smaller and relevant dataset:

dataDMG <- data[which(!(data$FATALITIES == 0 & data$INJURIES == 0 & data$PROPDMG ==0 & data$CROPDMG ==0)   )  ,]
rm(data)

The date variable BGN_DATE needs to be converted from the original character format of the date to the date format of R.

dataDMG$BGN_DATE <-as.Date(dataDMG$BGN_DATE, format="%m/%d/%Y")

The variables PROPDMGEXP and CROPDMGEXP are related with the units of the variables PROPDMG and CROPDMG. They contain the following symbols:

unique(dataDMG$PROPDMGEXP)
##  [1] "K" "M" NA  "B" "m" "+" "0" "5" "6" "4" "h" "2" "7" "3" "H" "-"
unique(dataDMG$CROPDMGEXP)
## [1] NA  "M" "K" "m" "B" "?" "0" "k"

After reading the documentation of most recent dataset (see link above) we consider it is correct to assume that B or b are Billions, M or m are millions, K or k are thousands, and H or h are hundreds dollars damage in properties and crops. So, we need to modify the corresponding variables PROPDMG and CROPDMG accordingly.

library(plyr)
ddply(dataDMG,.(PROPDMGEXP), summarize, length(PROPDMG))
##    PROPDMGEXP    ..1
## 1           -      1
## 2           +      5
## 3           0    210
## 4           2      1
## 5           3      1
## 6           4      4
## 7           5     18
## 8           6      3
## 9           7      3
## 10          B     40
## 11          h      1
## 12          H      6
## 13          K 231428
## 14          m      7
## 15          M  11320
## 16       <NA>  11585
ddply(dataDMG,.(CROPDMGEXP), summarize, length(CROPDMG))
##   CROPDMGEXP    ..1
## 1          ?      6
## 2          0     17
## 3          B      7
## 4          k     21
## 5          K  99932
## 6          m      1
## 7          M   1985
## 8       <NA> 152664
removed_valuesPROP <- sum( ddply(dataDMG,.(PROPDMGEXP), summarize, length(PROPDMG))[1:9,2]  )/ sum( ddply(dataDMG,.(PROPDMGEXP), summarize, length(PROPDMG))[,2]  ) *100.
removed_valuesCROP <- sum( ddply(dataDMG,.(CROPDMGEXP), summarize, length(CROPDMG))[1:2,2]  )/ sum( ddply(dataDMG,.(CROPDMGEXP), summarize, length(CROPDMG))[,2]  ) *100.

There is not missing values in the property damage and the crop damage variables. In the variables related with their units, there are many missing values (see results above) that do not modify the original variable because they were blank space in the original dataset. The other terms (numbers, +, -, and ? ) that appear in these variable are not so often and we decided to remove them from the dataset. For the property damage variable these rows are only 0.09661 % of the dataset and for the crop damage variable they are only 0.00903 % of the total of rows of the dataset.

We transform the units of the property and crop damage using the variables studied before. First, we decided to replace the missing values by a placeholder “NO” to avoid problem transforming the variables. We also included a new variable called year for better manipulation later with the plots.

dataDMG$PROPDMGEXP <-  ifelse(is.na(dataDMG$PROPDMGEXP), "NO", dataDMG$PROPDMGEXP)
dataDMG$CROPDMGEXP <-  ifelse(is.na(dataDMG$CROPDMGEXP), "NO", dataDMG$CROPDMGEXP)
dataDMG <-  dataDMG[ dataDMG$CROPDMGEXP != "?" & dataDMG$CROPDMGEXP != "0" & dataDMG$PROPDMGEXP != "-" & dataDMG$PROPDMGEXP != "+" &  dataDMG$PROPDMGEXP != "0" & dataDMG$PROPDMGEXP != "2" &  dataDMG$PROPDMGEXP != "3" &  dataDMG$PROPDMGEXP != "4" & dataDMG$PROPDMGEXP != "5" & dataDMG$PROPDMGEXP != "6" & dataDMG$PROPDMGEXP != "7"  , ]

dataDMG$PROPDMG <- ifelse(dataDMG$PROPDMG> 0 & dataDMG$PROPDMGEXP =="M" ,1000000.0*dataDMG$PROPDMG ,dataDMG$PROPDMG)
dataDMG$PROPDMG <- ifelse(dataDMG$PROPDMG > 0 & dataDMG$PROPDMGEXP =="m" ,1000000.0*dataDMG$PROPDMG ,dataDMG$PROPDMG)
dataDMG$PROPDMG <- ifelse(dataDMG$PROPDMG > 0 & dataDMG$PROPDMGEXP =="B" ,1000000000.0*dataDMG$PROPDMG ,dataDMG$PROPDMG)
dataDMG$PROPDMG <- ifelse(dataDMG$PROPDMG > 0 & dataDMG$PROPDMGEXP =="H" ,100.*dataDMG$PROPDMG ,dataDMG$PROPDMG)
dataDMG$PROPDMG <- ifelse(dataDMG$PROPDMG > 0 & dataDMG$PROPDMGEXP =="h" ,100.*dataDMG$PROPDMG ,dataDMG$PROPDMG)
dataDMG$PROPDMG <- ifelse(dataDMG$PROPDMG > 0 & dataDMG$PROPDMGEXP =="K" ,1000.*dataDMG$PROPDMG ,dataDMG$PROPDMG)
dataDMG$PROPDMG <- ifelse(dataDMG$PROPDMG > 0 & dataDMG$PROPDMGEXP =="k" ,1000.*dataDMG$PROPDMG ,dataDMG$PROPDMG)
dataDMG$CROPDMG <- ifelse(dataDMG$CROPDMG> 0 & dataDMG$CROPDMGEXP =="M" ,1000000.0*dataDMG$CROPDMG ,dataDMG$CROPDMG)
dataDMG$CROPDMG <- ifelse(dataDMG$CROPDMG > 0 & dataDMG$CROPDMGEXP =="m" ,1000000.0*dataDMG$CROPDMG ,dataDMG$CROPDMG)
dataDMG$CROPDMG <- ifelse(dataDMG$CROPDMG > 0 & dataDMG$CROPDMGEXP =="B" ,1000000000.0*dataDMG$CROPDMG ,dataDMG$CROPDMG)
dataDMG$CROPDMG <- ifelse(dataDMG$CROPDMG > 0 & dataDMG$CROPDMGEXP =="H" ,100.*dataDMG$CROPDMG ,dataDMG$CROPDMG)
dataDMG$CROPDMG <- ifelse(dataDMG$CROPDMG > 0 & dataDMG$CROPDMGEXP =="h" ,100.*dataDMG$CROPDMG ,dataDMG$CROPDMG)
dataDMG$CROPDMG <- ifelse(dataDMG$CROPDMG > 0 & dataDMG$CROPDMGEXP =="K" ,1000.*dataDMG$CROPDMG ,dataDMG$CROPDMG)
dataDMG$CROPDMG <- ifelse(dataDMG$CROPDMG > 0 & dataDMG$CROPDMGEXP =="k" ,1000.*dataDMG$CROPDMG ,dataDMG$CROPDMG)
dataDMG$year <- as.factor(strftime(dataDMG$BGN_DATE, "%Y"))

Results

First, we address the question: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health? We first take FATALITIES as the variable to explain how harmful a particular weather event can be to the population health (later we will use the variable INJURIES too). A plot of FATALITIES vs event will help to decide what event is more harmful for the population health since the variable “EVTYPE” contains 484 unique events. In figure 1 we plot total of fatalities for all year vs index of variable to visualize most important catastrophic event.

library(plyr)
Ftype <-  ddply(dataDMG, .(EVTYPE), summarise, total = sum(FATALITIES))
plot(Ftype$total,xlab =" index", ylab='Total of fatalities',main= "Figure 1")

From this plot we observe that there are only few events with a lot of fatalities. These are very easy to identify, let order these events and see which are these catastrofic event:

ordertotal <- Ftype[order(-Ftype$total),] 
head(ordertotal,10)
##             EVTYPE total
## 402        TORNADO  5630
## 60  EXCESSIVE HEAT  1903
## 72     FLASH FLOOD   978
## 148           HEAT   937
## 253      LIGHTNING   816
## 418      TSTM WIND   504
## 84           FLOOD   470
## 301    RIP CURRENT   368
## 196      HIGH WIND   246
## 11       AVALANCHE   224
FatPerc <- sum(ordertotal[1:10,2])/sum(ordertotal[,2])*100.
FatTornado <-  sum(ordertotal[1:1,2])/sum(ordertotal[,2])*100.
FatExcHeat <- sum(ordertotal[1:2,2])/sum(ordertotal[,2])*100.

In the previous list, I listed the 10 most important event related with fatalities. The first 10 event are responsible for 79.78857% of the fatalities in United Stated from 1950 to 2011. In fact, the first one, tornado by itself is responsible for 37.19855% of all fatalities. From these results, we can say that the most relevant events with respect to fatalities taking 49.77205% of all fatalities are TORNADO and EXCESSIVE HEAT.

In Figure 2, in the top plot we include FATALITIES for every year to see whether these effects are due to a big catastrophe of a particular year, or they are events that happen yearly. We plot together total of fatalities for all event compared with particular events. Here we took the first events in the list above, mainly, TORNADO, HEAT, FLOOD. Note that we include all variables with the word HEAT and FLOOD as one event for a better understanding of heat and flood effects.

year_Fat_allevent <- ddply(dataDMG, .(year), summarise, count = sum(FATALITIES))
year_Fat_tornado <- ddply(dataDMG[dataDMG$EVTYPE == "TORNADO",], .(year), summarise, sum = sum(FATALITIES))
year_Fat_FLOOD_all <- ddply(dataDMG[grep("FLOOD", dataDMG$EVTYPE),], .(year), summarise, sum = sum(FATALITIES))
year_Fat_heat_all <- ddply(dataDMG[grep("HEAT", dataDMG$EVTYPE),], .(year), summarise, sum = sum(FATALITIES))

In this plot, the small black rectangle means the total fatalities for each year, the red circles are the fatalities due to tornado, green circles are fatalities from events related with heat and blue circles are fatalities from events related with flood. From this plot we conclude that for years 1995-1992 tornado was the most relevant event for population health, after 1992 other weather events started to take place like flood and heat. Other important point is that for earlier years there are fewer events recorded, most likely due to lack of good record. So, in principle, we do not know for earlier years whether flood and heat were relevant or not, however following our observations we conclude that nowadays it is important to pay more attention to protect the population against tornado, flood and heat. Also, we can conclude that these weather events usually happen every year. This increases in fatalities due to heat it may also be related to the increase in temperature on the earth during last year, i.e. global warming process. Again, we recommend to improve protection against heat mainly during summer season when heat has a peak. We are going to study heat for every season in this project, however, it seems a necessary point to study in the future.

Now we investigate injuries. We follow similar analysis as for fatalities. First, we separate the data

FtypeIJ <-  ddply(dataDMG, .(EVTYPE), summarise, total = sum(INJURIES))
ordertotalIJ <- FtypeIJ[order(-FtypeIJ$total),] 
InjPer <-  sum(ordertotalIJ[1:10,2])/sum(ordertotalIJ[,2])*100.
InjTornado <-  sum(ordertotalIJ[1:1,2])/sum(ordertotalIJ[,2])*100.
head(ordertotalIJ,10)
##                EVTYPE total
## 402           TORNADO 91321
## 418         TSTM WIND  6957
## 84              FLOOD  6789
## 60     EXCESSIVE HEAT  6525
## 253         LIGHTNING  5230
## 148              HEAT  2100
## 233         ICE STORM  1975
## 72        FLASH FLOOD  1777
## 359 THUNDERSTORM WIND  1488
## 131              HAIL  1360

In the previous list, I listed the 10 most important event related with injuries. The first 10 event are responsible for 89.35604% of the injuries in United Stated from 1950 to 2011. In fact, the first one, tornado by itself is responsible for 65.00918% of all injuries. So, tornado is the most relevant weather event related with population health. However, it is also important to pay attention to other events as flood, heat, wind and lightning.

year_Fat_alleventIJ <- ddply(dataDMG, .(year), summarise, mean.count = sum(INJURIES))
year_Fat_tornadoIJ <- ddply(dataDMG[dataDMG$EVTYPE == "TORNADO",], .(year), summarise, sum = sum(INJURIES))
year_Fat_FLOOD_allIJ <- ddply(dataDMG[grep("FLOOD", dataDMG$EVTYPE),], .(year), summarise, sum = sum(INJURIES))
year_Fat_heat_allIJ <- ddply(dataDMG[grep("HEAT", dataDMG$EVTYPE),], .(year), summarise, sum = sum(INJURIES))
year_Fat_WIND_allIJ <- ddply(dataDMG[grep("WIND", dataDMG$EVTYPE),], .(year), summarise, sum = sum(INJURIES))
par(mfrow=c(2,1),mar = c(4, 4, 2, 1))
plot(year_Fat_allevent$year,year_Fat_allevent$count, xlab= "Years", ylab= "Fatalities",main= "Figure 2")
points(year_Fat_tornado$year,year_Fat_tornado$sum, col="red")
points(year_Fat_heat_all$year,year_Fat_heat_all$sum, col="green")
points(year_Fat_FLOOD_all$year,year_Fat_FLOOD_all$sum, col="blue")
legend("topleft",c("Tornado","Heat","Flood"), pch =1,col=c("red","green","blue"))
legend("top",c("All events"),lty=c(1), lwd=c(2.5),col=c("black"))
plot(year_Fat_alleventIJ$year,year_Fat_alleventIJ$mean.count,xlab= "Years", ylab= "Injuries")
points(year_Fat_tornadoIJ$year,year_Fat_tornadoIJ$sum, col="red")
points(year_Fat_heat_allIJ$year,year_Fat_heat_allIJ$sum, col="green")
points(year_Fat_FLOOD_allIJ$year,year_Fat_FLOOD_allIJ$sum, col="blue")
points(year_Fat_WIND_allIJ$year,year_Fat_WIND_allIJ$sum, col="orange")
legend("topleft",c("Tornado","Heat","Flood", "WIND"), pch =1,col=c("red","green","blue","orange"))
legend("top",c("All events"),lty=c(1), lwd=c(2.5),col=c("black"))

Also, in Figure 2, in the bottom plot we include INJURIES for every year to study whether these effects are due to a particular year. We plot together total of injuries for all event compared with particular events. In this plot, the small black rectangle means the total injuries for each year, the red circles are the injuries due to tornado, green circles are injuries from events related with heat, blue circles are injuries from events related with flood and orange circles are the injuries due to wind. Here, the main difference compared with fatalities is that wind produces a lot injuries but never ends up in fatalities. Here again, we recommend to develop a preventive plan to help people be saved in case of tornado, flood, excessive heat and wind. All of these are the main reason of injuries in United Stated during the last 60 years.

Now, we need to address the question: Across the United States, which types of events have the greatest economic consequences? We have already determined the variables related with economy, PROPDMG and CROPDMG. First, we will study PROPDMG (property damage). So, we need now to identify the most important events. Below, I am listing the 10 most relevant event considering the variable PROPDMG.

FtypePROP <-  ddply(dataDMG, .(EVTYPE), summarise, total = sum(PROPDMG))
ordertotalPROP <- FtypePROP[order(-FtypePROP$total),] 
PropPerc <- sum(ordertotalPROP[1:10,2])/sum(ordertotalPROP[,2])*100.
PropFlood <-  sum(ordertotalPROP[1:1,2])/sum(ordertotalPROP[,2])*100.
PropHurricane <- sum(ordertotalPROP[1:2,2])/sum(ordertotalPROP[,2])*100.
PropTornado <- sum(ordertotalPROP[1:3,2])/sum(ordertotalPROP[,2])*100.
head(ordertotalPROP,10)
##                EVTYPE      total
## 84              FLOOD 1.4466e+11
## 219 HURRICANE/TYPHOON 6.9306e+10
## 402           TORNADO 5.6937e+10
## 345       STORM SURGE 4.3324e+10
## 72        FLASH FLOOD 1.6141e+10
## 131              HAIL 1.5732e+10
## 210         HURRICANE 1.1868e+10
## 412    TROPICAL STORM 7.7039e+09
## 476      WINTER STORM 6.6885e+09
## 196         HIGH WIND 5.2700e+09

The first 10 event are responsible for 88.37176% of the property damage in United Stated from 1950 to 2011. In fact, the first one, flood by itself is responsible for 33.85252% of all property damage. From these results, we can say that the most relevant events with respect to property damage taking 63.39563% of all property damage are Flood, hurricane and tornado.

Now, we plot PROPDMG for every year. In figure 3, in the top plot, we plot together total of property damage for all events compared with particular events. Here we took the first events in the list above, mainly, Flood, hurricane, and tornado. Note that we include all variables with the word FLOOD as one event for a better understanding of flood effects.

year_Fat_alleventPROP <- ddply(dataDMG, .(year), summarise, mean.count = sum(PROPDMG))
year_Fat_tornadoPROP <- ddply(dataDMG[dataDMG$EVTYPE == "TORNADO",], .(year), summarise, sum = sum(PROPDMG))
year_Fat_FLOOD_allPROP <- ddply(dataDMG[grep("FLOOD", dataDMG$EVTYPE),], .(year), summarise, sum = sum(PROPDMG))
year_Fat_hur_allPROP <- ddply(dataDMG[grep("HURRICANE", dataDMG$EVTYPE),], .(year), summarise, sum = sum(PROPDMG))

In this plot, the small black rectangle means the total property damage for each year, the red circles are the property damage due to tornado, blue circles are property damage from events related with flood and green circles are property damage from hurricane. From this plot we conclude that for years 1995-1992 tornado was the most relevant event affecting the economy, after 1992 other weather events started to be more important take as flood and hurricane Here, we have the same problem like for injuries,there are fewer events recorded for earlier years. However, we conclude that from the economic point of view it is important to pay more attention to protect properties against flood, hurricane and tornado. Also, these events tend to occurs every year. Hurricane is the only one that usually never happens every year but when it does the economic consequences are catastrophic.

Now, also to answer the question related with types of events that have strong economic impact, let study the variable CROPMDG (crop damage). Below, I am listing the 10 most relevant event considering the variable CROPDMG.

FtypeCROP <-  ddply(dataDMG, .(EVTYPE), summarise, total = sum(CROPDMG))
ordertotalCROP <- FtypeCROP[order(-FtypeCROP$total),] 
CROPPerc <- sum(ordertotalCROP[1:10,2])/sum(ordertotalCROP[,2])*100.
CROPDrought <-  sum(ordertotalCROP[1:1,2])/sum(ordertotalCROP[,2])*100.
CROPFlood <- sum(ordertotalCROP[1:3,2])/sum(ordertotalCROP[,2])*100.
head(ordertotalCROP,10)
##                EVTYPE       total
## 48            DROUGHT 13972566000
## 84              FLOOD  5661968450
## 305       RIVER FLOOD  5029459000
## 233         ICE STORM  5022110000
## 131              HAIL  3000954453
## 210         HURRICANE  2741910000
## 219 HURRICANE/TYPHOON  2607872800
## 72        FLASH FLOOD  1420727100
## 66       EXTREME COLD  1292973000
## 111      FROST/FREEZE  1094086000

The first 10 event are responsible for 85.34739% of the crop damage in United Stated from 1950 to 2011. In fact, the first one, drought alone is responsible for 28.49881% of all crop damage. The most relevant events with respect to crop damage taking 50.30533% of all crop damage are drought and flood.

In figure 3, in the bottom plot, we plot together total of crop damage for all events compared with two particular events, drought and flood.

year_Fat_alleventCROP <- ddply(dataDMG, .(year), summarise, mean.count = sum(CROPDMG))
year_Fat_droughtCROP <- ddply(dataDMG[dataDMG$EVTYPE == "DROUGHT",], .(year), summarise, sum = sum(CROPDMG))
year_Fat_FLOOD_allCROP <- ddply(dataDMG[grep("FLOOD", dataDMG$EVTYPE),], .(year), summarise, sum = sum(CROPDMG))
par(mfrow=c(2,1),mar = c(4, 4, 2, 1))
plot(year_Fat_alleventPROP$year,year_Fat_alleventPROP$mean.count,main= "Figure 3",xlab= "Years", ylab= "Property Damage")
points(year_Fat_tornadoPROP$year,year_Fat_tornadoPROP$sum, col="red")
points(year_Fat_hur_allPROP$year,year_Fat_hur_allPROP$sum, col="green")
points(year_Fat_FLOOD_allPROP$year,year_Fat_FLOOD_allPROP$sum, col="blue")
legend("topleft",c("Tornado","Hurricane","Flood"), pch =1,col=c("red","green","blue"))
legend("top",c("All events"),lty=c(1), lwd=c(2.5),col=c("black"))
plot(year_Fat_alleventCROP$year,year_Fat_alleventCROP$mean.count,xlab= "Years", ylab= "Crop Damage")
points(year_Fat_droughtCROP$year,year_Fat_droughtCROP$sum, col="red")
points(year_Fat_FLOOD_allCROP$year,year_Fat_FLOOD_allCROP$sum, col="blue")
legend("topleft",c("Drought","Flood"), pch =1,col=c("red","blue"))
legend("top",c("All events"),lty=c(1), lwd=c(2.5),col=c("black"))

In this plot, the small black rectangle means the total crop damage for each year, the red circles are the crop damage due to drought and blue circles are crop damage from events related with flood. Note, that for crop damage we do not have data for earlier year. We conclude that for years after 1992 drought and flood are the main concern for crop production. So, we recommend to develop some programs to mitigate the effect of drought in crop production. Flood is hard to predict, however, drought can be reduced with long term programs providing water to keep crop production at good rate.

We may want to know the distribution of events per states in United States. This can be done using the variable STATE. This will part of a future report.